Documente Academic
Documente Profesional
Documente Cultură
4036 - 36 ports
4700 – From 324 ports to ∞
© 2009 Voltaire Inc. Confidential - Internal 2
Infiniband: a black box ?
12 12 12 12 12 12 12 12
Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes
12 8 4
Uplinks Uplinks Uplinks
12 16 20
Nodes Nodes Nodes
CN CN CN CN 12 12 12 IO
CN CN CN CN Nodes Nodes Nodes Node
CN CN CN CN
drop
recovery
•A-E
•B-H
•C-G
•D-F
1 2 1 2 1 2 1 2
•No link contention
3 4 3 4 3 4 3 4
AB CD EF GH
A 3 A 1 A 1 A 1
B 4 B 2 B 2 B 2
C 1 C 3 C 1 C 1
D 2 D 4 D 2 D 2
E 1 E 1 E 3 E 1
IB path F 2 F 2 F 4 F 2
2 symmetric IB paths
G 1 G 1 G 1 G 3
H 2 H 2 H 2 H 4
© 2009 Voltaire Inc. Confidential - Internal 12
Communication Patterns (un-balanced)
A 1 A 1
B 1 B 1
C 2 C 2
D 2 D 2
E 3 E 3
F 3 F 3
G 4 G 4
H 4 H 4 •Communication pattern:
Switch 1 Switch 2 •A-C
1 2 3 4 1 2 3 4
•B-E
downlinks
uplinks
•D-G
•F-H
•2:1 link contention:
•A->C and B->E share
1 2 1 2 1 2 1 2
uplink to Switch 1 port 1
3 4 3 4 3 4 3 4
•G->D and H->F share
AB CD EF GH uplink to Switch 2 port 4
A 3 A 1 A 1 A 1
B 4 B 2 B 2 B 2
C 1 C 3 C 1 C 1
D 2 D 4 D 2 D 2
E 1 E 1 E 3 E 1
IB path F 2 F 2 F 4 F 2
2 symmetric IB paths
G 1 G 1 G 1 G 3
H 2 H 2 H 2 H 4
© 2009 Voltaire Inc. Confidential - Internal 13
Optimization of Parallel Applications ?
Single-thread optimization
• Some examples:
Instruction Pipelining
Blocking
Prefetch data
• Tools: processor counters, profiling tools, compiler reports, etc…
• Goal: Overcome processor, cache, memory architecture contraints
Parallel optimization, scalability
• Some examples:
Load Balancing
Mix OpenMP and MPI
Barrier optimization
• Tools: MPI Profilers (Intel Trace Analyzer, etc…)
• Goals: Overcome Balancing issues, increase computation to communication ratio, use parallel IO,
etc…
Fabric optimization ?
• Benchmarking and Production environment are different
• Systems used simultaneously by several applications, several kinds of traffic.
• Handling efficiently multiple concurrent flows
UFM
Optional
Schedulers
Applications
Fabric Policy
Monitoring
Virtual
Infrastructure
Physical
Infrastructure
12 nodes
running a
bandwidth
consuming job
2 nodes running
a latency critical
job
Goal: achieve
best
performance
with Latency
critical tasks
Voltaire UFM™
Monitor, Analyze & Optimize application
performance, Automate and ease fabric
management, Uses OpenSM with
advanced routing Plug-ins
Voltaire GridVision™
Basic monitoring & Troubleshooting
Rich GUI, CLI, SNMP functionality,
Voltaire SM, Embedded in Switches
Other Fabric Mgmt. Solution
Limited Proprietary SM
Device/Port oriented limited viewer
and some troubleshooting tools
OpenSM
Questions ? Subnet Manager only, Technology Test Bed
Voltaire engineer is the OpenSM Maintainer
HCA/iWARP
CPU socket CPU socket
#1
4 CPU
Cores
#2 1
2
RAM RAM
NUMAcc
HCA/iWARP
CPU socket CPU socket
#1
#2 1
RAM RAM
NUMAcc
1. For large messages Kernel will
copy data from process #1 Shared memory
directly into process #2 (save
one copy), small massages will
stay as today
© 2009 Voltaire Inc. Confidential - Internal 33
OMA - Fluent – Aircraft Benchmark
Fluent Aircraft
800
700
600
Fluent Rating
500
400
300
200
100
10% 9% 7% 11% 25%
0
0 5 10 15 20 25 30 35
# of processes
MB/s
3000
MB/s
1500
2000
1000
1000
500
0
0 1,E+00 1,E+01 1,E+02 1,E+03 1,E+04 1,E+05 1,E+06 1,E+07
1 10 100 1000 10000 100000 1000000 10000000
bytes
bytes