Sunteți pe pagina 1din 36

Voltaire Unified Fabric Manager

A new dimension to performance analysis and tuning


Ghislain de Jacquelot
ghislaindj@voltaire.com

© 2009 Voltaire Inc.


Introducing
Voltaire’s Grid Director 4000 Series
• First generally available
commercial-grade QDR switches
in the market
• Lowest latency switch at
100ns/300ns port-to-port
• “Smart” switch with advanced
management capabilities on-board
• Most mature, 4th Generation
switch family and switch silicon
• Most scalable with HyperScale
technology

4036 - 36 ports
4700 – From 324 ports to ∞
© 2009 Voltaire Inc. Confidential - Internal 2
Infiniband: a black box ?

© 2009 Voltaire Inc. Confidential - Internal 3


An Infiniband Fabric is not a black box (1/2)

Requires Hardware management


• Detect failures, communication problems
ƒ Inside the Infiniband Fabric
- Port counters
- Port status (QDR,DDR,SDR – 4X,2X,1X)
- Firmware upgrades (Switch and HCA ASICs)
ƒ Outside the Infiniband Fabric
- Chassis
- Power supplies
- Fans
- Temperature
- Chassis software updates (Switch management)

© 2009 Voltaire Inc. Confidential - Internal 4


An Infiniband Fabric is not a black box (2/2)

What about performance ?


• Blocking vs non-blocking fabrics ?
• Influence of routing algorithms ?
• Congestion ?
• Mixing different protocols on the same fabric ?
• Running multiple jobs on the same fabric ?
• Performance monitoring Tools ?

© 2009 Voltaire Inc. Confidential - Internal 5


Some Infiniband technology

© 2009 Voltaire Inc.


Fabric ?

is made of switch ASICs interconnected together


• Mellanox InfiniScale III (aka Anafa): 24 ports
• Mellanox InfiniScale IV (aka Shaldag): 36 ports

24 ports 24 ports 24 ports 24 ports

24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports

12 12 12 12 12 12 12 12
Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes

Inside a 96 ports switch


© 2009 Voltaire Inc. Confidential - Internal 7
Blocking ?

Defines the bandwidth ratio between layers in the fabric

12 8 4
Uplinks Uplinks Uplinks

24 ports 24 ports 24 ports

12 16 20
Nodes Nodes Nodes

Fully 50% 20%


Non-Blocking Blocking Blocking

© 2009 Voltaire Inc. Confidential - Internal 8


Congestion ?

Example: All orange nodes write simultaneously to the IO


node (red)

24 ports 24 ports 24 ports 24 ports

24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports 24 ports

CN CN CN CN 12 12 12 IO
CN CN CN CN Nodes Nodes Nodes Node
CN CN CN CN

© 2009 Voltaire Inc. Confidential - Internal 9


Congestion Example

Degradation due to node oversubscription


• Destination port in saturation (multiple sources)
• Congestion spread across the fabric
• ALL other flows drop to 20% of capacity
• Take time to recover
• Common with storage traffic

drop

recovery

© 2009 Voltaire Inc. Confidential - Internal 10


Routing ?

InfiniBand packets are ‘destination routed’ based on the


Destination Logical ID (DLID) field in the header
In IB: DLID=route (not only remote address)
DLIDs are 16 bits
• 48K values are used for unicast
• 16K values are used for multicast
At each switch ASIC, the incoming unicast DLID
is used as an index into a Linear Forwarding Out Port #
DLID
Table (LFT) that returns the outgoing switch
0
port number 1
2
3
• E.g. the InfiniScale III ASIC supports all 48K possible LFT entries 4
5
6
7
8
9
10
11

© 2009 Voltaire Inc. Confidential - Internal 11


Communication Patterns (balanced)
A 1 A 1
B 1 B 1
C 2 C 2
D 2 D 2
E 3 E 3
F 3 F 3
G 4 G 4
H 4 H 4
Switch 1 Switch 2
1 2 3 4 1 2 3 4
•Communication pattern:
downlinks
uplinks

•A-E
•B-H
•C-G
•D-F
1 2 1 2 1 2 1 2
•No link contention
3 4 3 4 3 4 3 4

AB CD EF GH
A 3 A 1 A 1 A 1
B 4 B 2 B 2 B 2
C 1 C 3 C 1 C 1
D 2 D 4 D 2 D 2
E 1 E 1 E 3 E 1
IB path F 2 F 2 F 4 F 2
2 symmetric IB paths
G 1 G 1 G 1 G 3
H 2 H 2 H 2 H 4
© 2009 Voltaire Inc. Confidential - Internal 12
Communication Patterns (un-balanced)
A 1 A 1
B 1 B 1
C 2 C 2
D 2 D 2
E 3 E 3
F 3 F 3
G 4 G 4
H 4 H 4 •Communication pattern:
Switch 1 Switch 2 •A-C
1 2 3 4 1 2 3 4
•B-E
downlinks
uplinks

•D-G
•F-H
•2:1 link contention:
•A->C and B->E share
1 2 1 2 1 2 1 2
uplink to Switch 1 port 1
3 4 3 4 3 4 3 4
•G->D and H->F share
AB CD EF GH uplink to Switch 2 port 4
A 3 A 1 A 1 A 1
B 4 B 2 B 2 B 2
C 1 C 3 C 1 C 1
D 2 D 4 D 2 D 2
E 1 E 1 E 3 E 1
IB path F 2 F 2 F 4 F 2
2 symmetric IB paths
G 1 G 1 G 1 G 3
H 2 H 2 H 2 H 4
© 2009 Voltaire Inc. Confidential - Internal 13
Optimization of Parallel Applications ?

Single-thread optimization
• Some examples:
ƒ Instruction Pipelining
ƒ Blocking
ƒ Prefetch data
• Tools: processor counters, profiling tools, compiler reports, etc…
• Goal: Overcome processor, cache, memory architecture contraints
Parallel optimization, scalability
• Some examples:
ƒ Load Balancing
ƒ Mix OpenMP and MPI
ƒ Barrier optimization
• Tools: MPI Profilers (Intel Trace Analyzer, etc…)
• Goals: Overcome Balancing issues, increase computation to communication ratio, use parallel IO,
etc…
Fabric optimization ?
• Benchmarking and Production environment are different
• Systems used simultaneously by several applications, several kinds of traffic.
• Handling efficiently multiple concurrent flows

© 2009 Voltaire Inc. Confidential - Internal 14


Observations

Blocking in cut through networks is a big issue


Different traffic classes have different requirements
• Collectives and storage require congestion control
• IPC requires low-latency (high-priority)
• Storage may use more bandwidth and not be latency sensitive
• Hardware based adaptive routing not efficient with bursty or storage traffic
Job layout can influence routing decisions
• IPC traffic typically stays within a job, or have unique patterns
• Storage traffic fan into storage nodes
• Management spread into all nodes
Hardware capabilities can be destructive if used inappropriately
• E.g. mis-configured adaptive routing or congestion management

© 2009 Voltaire Inc. Confidential - Internal 15


Introducing
Voltaire UFM Unified Fabric Manager™

Ensure fabric health and performance visibility


• Unique visibility into fabric traffic and bottlenecks
Optimize application performance
• “Benchmark” performance in real life
• (we’ve managed to see 10X improvements)
Manage the scale-out
• Application centric platform
Efficient operations to thousands of fabric
resources
• Automate configurations and manage changes on the fly
• Increase fabric up-time and resiliency – better utilization

Monitor, Analyze and Optimize


© 2009 Voltaire Inc. Confidential - Internal 16
Advanced Monitoring and Analysis

Monitor & analyze fabric performance


• B/W utilization
• Unique congestion monitoring
• Dashboard for aggregated fabric view
Real-time fabric-wide health monitoring
• Monitor events and errors through-out the fabric
• Threshold based alarms
• Granular monitoring of host and switch parameters
Innovative congestion mapping
• One view for fabric-wide congestion and traffic patterns
• Enables root cause analysis for routing, job placement or
resource allocation inefficiencies
All is managed at the application/aggregation
level
© 2009 Voltaire Inc. Confidential - Internal 17
Fabric Optimization with UFM

Feedback and Analysis


Application Modeling
(CLI / GUI / API)
Fabric Optimization
Monitoring

UFM
Optional
Schedulers

Characterize application Fabric virtualization and QoS Show traffic and


traffic and priorities Optimize routing and job congestion information
placement
© 2009 Voltaire Inc. Confidential - Internal 18
UFM Application Centric Approach

Applications
Fabric Policy

Monitoring
Virtual
Infrastructure

Physical
Infrastructure

Map application requirements to fabric policies and


Map element status to application status
© 2009 Voltaire Inc. Confidential - Internal 19
Combining UFM with Industry Leading
Schedulers

Enabling Intelligent Performance Driven Job Scheduling

© 2009 Voltaire Inc. Confidential - Internal 20


UFM’s traffic aware routing

Today’s routing algorithms are static while clusters are


dynamic
• Nodes are moving in and out of the cluster
• Traffic patterns change
• Static algorithms can’t cope with changes resulting in congestions and in-
efficiencies
Voltaire routing performance optimization
• Optimizations for various topologies enhanced during last years in large
clusters
• New major conceptual shift from static to traffic pattern based algorithm
• Traffic model can be derived automatically from topology
• Voltaire’s enhancements are built on top of OpenSM in a modular plug-in
architecture
Voltaire’s routing optimizations improve fabric performance
without increasing cost
© 2009 Voltaire Inc. Confidential - Internal 21
Performance optimization: partitioning and
QoS

UFM enables to run multiple clusters


or separate application jobs on the
same infrastructure
Drag and drop configuration
automatically creates dedicated IPC
and virtual I/O to each cluster
Drag and drop assignment to network
Quality of Service can be associated triggers all configurations in the ‘back-stage’
with fabric partitions so critical
applications get priority in fabric
routing queues
• Easy configuration of QoS via GUI or CLI –
assignment to pre-defined service levels
Changes in application needs is easily
reconfigured by simple re-allocation of
servers to apps or networks
Critical applications can be allocated the right resources and priority
© 2009 Voltaire Inc. Confidential - Internal 22
Benefits

© 2009 Voltaire Inc.


Boost Apps Performance with Voltaire UFM™

Optimize Real-Life Environments


© 2009 Voltaire Inc. Confidential - Internal 24
Test Environment

12 nodes
running a
bandwidth
consuming job
2 nodes running
a latency critical
job
Goal: achieve
best
performance
with Latency
critical tasks

© 2009 Voltaire Inc. Confidential - Internal 25


W/O Partitioning Latency degradation of ~ X
215%

Latency job running alone Bandwidth job added on


(Latency = ~0.000210) same partition
(Latency = ~0.000450)

© 2009 Voltaire Inc. Confidential - Internal 26


Create Partitions and Set QoS in UFM

Create 2 Logical Groups


• Latency job
• B/W oriented job
Create 2 Networks
• One for each job
Assign Service Level
• SL0 – Low Latency Queue
• SL1 – 50% (high/bandwidth)
• (SL2 – 25%, SL3 – 25%)
UFM automatically creates
virtual NICs, partitions and
Service Level definitions

© 2009 Voltaire Inc. Confidential - Internal 27


Run jobs with isolation and QoS – return almost to
original performance (~5% impact only)

Bandwidth job added on Separate partitions


Latency job running alone same partition and QoS
(Latency = ~0.000210) (Latency = ~0.000450) (Latency = ~0.000220) (!)

© 2009 Voltaire Inc. Confidential - Internal 28


Voltaire UFM™
Redefining Fabric Management

Voltaire UFM™
Monitor, Analyze & Optimize application
performance, Automate and ease fabric
management, Uses OpenSM with
advanced routing Plug-ins

Voltaire GridVision™
Basic monitoring & Troubleshooting
Rich GUI, CLI, SNMP functionality,
Voltaire SM, Embedded in Switches
Other Fabric Mgmt. Solution
Limited Proprietary SM
Device/Port oriented limited viewer
and some troubleshooting tools
OpenSM
Questions ? Subnet Manager only, Technology Test Bed
Voltaire engineer is the OpenSM Maintainer

© 2009 Voltaire Inc. Confidential - Internal 29


Open-MPI Accelerator (OMA)

© 2009 Voltaire Inc.


Voltaire OMA – Benefits

Accelerating standard, open source Open-MPI


Significant performance improvement (shmem only)
• More effective when there is more intra-node communications (between cores)
• Depends on the HW (# of cores, # of sockets) and the traffic pattern
Enhanced documentation
Open-MPI expertise – RoadRunner and many others
Works with InfiniBand and Ethernet (iWARP and TCP)

© 2009 Voltaire Inc. Confidential - Internal 31


How Shared Memory is Done Today?

HCA/iWARP
CPU socket CPU socket
#1

4 CPU
Cores
#2 1
2

RAM RAM

NUMAcc

1. Process #1 writes the data


into shmem RAM Shared memory

2. Process #2 reads the data


from shmem RAM
© 2009 Voltaire Inc. Confidential - Internal 32
The OMA Way

HCA/iWARP
CPU socket CPU socket
#1

#2 1

RAM RAM

NUMAcc
1. For large messages Kernel will
copy data from process #1 Shared memory
directly into process #2 (save
one copy), small massages will
stay as today
© 2009 Voltaire Inc. Confidential - Internal 33
OMA - Fluent – Aircraft Benchmark

Fluent Aircraft

800
700
600
Fluent Rating

500
400
300
200
100
10% 9% 7% 11% 25%
0
0 5 10 15 20 25 30 35
# of processes

Open MPI with OMA Open MPI

* OMA improves Fluent Aircraft Benchmark by up to 25%

© 2009 Voltaire Inc. Confidential - Internal 34


eff. bandwidth/proc for alltoall pingpong bandwidth
HP-MPI MVAPICH2
OPENMPI OPENMPI+OMA 6000
3000 HP-MPI
5000 MVAPICH2
2500 OPENMPI
4000 OPENMPI+OMA
2000

MB/s
3000
MB/s

1500
2000
1000
1000
500
0
0 1,E+00 1,E+01 1,E+02 1,E+03 1,E+04 1,E+05 1,E+06 1,E+07
1 10 100 1000 10000 100000 1000000 10000000
bytes
bytes

© 2009 Voltaire Inc. Confidential - Internal 35


Questions ?
© 2009 Voltaire Inc.

S-ar putea să vă placă și