Documente Academic
Documente Profesional
Documente Cultură
a Next Generation Implementation of the Message
Passing Interface
Edgar Gabriel
High Performance Computing Center Stuttgart (HLRS)
gabriel@hlrs.de
Contributors
Los Alamos National Laboratory
Indiana University
The University of Tennessee
Sandia National Laboratory - Livermore
The University of Stuttgart (HLRS)
Krell Institute
Current situation on clusters
Wide diversity of MPI libraries available
confusing end-users
annoying system administrators
Diversity due to
various MPI implementations
• MPICH-1, MPICH-2, LAM/MPI, LA-MPI, FT-MPI, PACX-MPI,…
• vendor MPI libraries – often incompatible spin-offs of standard
public domain libraries
• HPC network providers MPI libraries – often incompatible spin-
offs of standard public domain libraries
various version of each MPI library required due to
requirements of ISV’s
various network drivers (TCP, SHMEM, TCP+SHMEM, IB,
GM, … )
various compilers
Motivation
Consolidation: Merger of ideas from prior
implementations:
FT-MPI: University of Tennessee
Design Goals
MPI-2
High Performance
Thread safe
Based On Component Architecture
Flexible run-time instrumentation
Portable
Maintainable
Production quality
Supported environments
TCP/IP rsh/ssh
Linux
Myrinet (gm, mx) RMS (Quadrics)
BSD
Quadrics (Elan4) Bproc
AIX
Shared memory LSF
MS Windows
Infiniband (OpenIB) PBS
Solaris
Portals SLURM
HP/UX
LAPI Load Leveler
IRIX
X1 interconnect YOD
UNICOS/mp
Multiple heterogeneous Grid/Multi-cell
?
networks in a single Daemon and daemon-
process/message less modes
MPI Component Architecture (MCA)
User application
MPI API
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
… … … … … … …
MCA Component Frameworks
Components divided into three categories
Back-end to MPI API functions
Run-time environment
Infrastructure / management
Rule of thumb:
“If we’ll ever want more than one implementation,
make it a component”
MCA Component Types
MPI types Run-time env. Types
P2P management Out of band
Component / Module Lifecycle
Open Component
Component
Open: per-process
initialization
Selection Selection: per-scope
determine if want to use
Initialization Close: per-process
finalization
Module
Module
Close
About components
fixed naming scheme
where to find it
statically compiled into the library
User application
MPI API
Memory Memory
PML PTL
Pooling Management
Pow 2 binnng
Shared Mem
Proc Private
TCP/IP
Shared
Pinned
Best fit
TEG
… … … …
IB
Pt2Pt Components
PML PTL
MPI aware Data mover
Pt2Pt Data flow
Send Receive
IB 1
PML : TEG IB1 PML : TEG
IB 2
First Fragment First Fragment
GM
PML : TEG IB 1
Last Frags IB 2 PML : TEG
TCPGige
GM Last Fragment
TCPGige
TCP/IP Latency Data (nonblocking)
Implemenation Myrinet (µs) GigeE (µs)
Bandwidths 1 NIC GigE
1000
900
800
700
FT-MPI
600 LA-MPI
LAM/MPI 7
500 MPICH2
Open MPI (poll)
400
Open MPI (async)
300 TCP/IP
200
100
0
1 10 100 1000 10000 100000 1000000 10000000
Bandwidths 1 NIC IP over Myrinet
2000
1800
1600
1400
FT-MPI
1200 LAM/MPI 7
MPICH2
1000 LA-MPI
Open MPI (poll)
800
Open MPI (async)
600 TCP/IP
400
200
0
1 10 100 1000 10000 100000 1000000 10000000
Bandwidth 2 NIC IP over Myrinet
3000
2500
2000
LA-MPI
1500 Open MPI (poll)
Open MPI (1 nic)
1000
500
0
1 10 100 1000 10000 100000 1000000 10000000
Current situation on cluster (II)
Diversity due to
various MPI implementations
• MPICH-1, MPICH-2, LAM/MPI, LA-MPI, FT-MPI, PACX-MPI,…
• vendor MPI libraries – often incompatible spin-offs of standard
public domain libraries
• HPC network providers MPI libraries – often incompatible spin-
offs of standard public domain libraries
various version of each MPI library required due to
requirements of ISV’s
various network drivers (TCP, SHMEM, TCP+SHMEM, IB,
GM, … )
various compilers
will be solved by Open MPI
could
be solved by Open MPI, requires
convincing others of Open MPI
Future Work
Alpha release available for very friendly users since two
weeks
Public beta release by end of April
Ongoing work
Data Fault Tolerance (expand on LA-MPI)
Process Fault Tolerance (expand on FT-MPI and
CPR type ideas)
add one-sided operations
add many more component types