Sunteți pe pagina 1din 20

Open MPI: 

a Next Generation Implementation of the Message 
Passing Interface
Edgar Gabriel
High Performance Computing Center Stuttgart (HLRS)
gabriel@hlrs.de
   
Contributors
 Los Alamos National Laboratory
 Indiana University
 The University of Tennessee
 Sandia National Laboratory - Livermore
 The University of Stuttgart (HLRS)

 Krell Institute

   
Current situation on clusters
 Wide diversity of MPI libraries available
 confusing end-users
 annoying system administrators
 Diversity due to
 various MPI implementations
• MPICH-1, MPICH-2, LAM/MPI, LA-MPI, FT-MPI, PACX-MPI,…
• vendor MPI libraries – often incompatible spin-offs of standard
public domain libraries
• HPC network providers MPI libraries – often incompatible spin-
offs of standard public domain libraries
 various version of each MPI library required due to
requirements of ISV’s
 various network drivers (TCP, SHMEM, TCP+SHMEM, IB,
GM, … )
 
 various compilers  
Motivation
 Consolidation: Merger of ideas from prior
implementations:
 FT-MPI: University of Tennessee

 LA-MPI: Los Alamos

 LAM/MPI: Indiana University

 PACX-MPI : University of Stuttgart (HLRS)

 Share development effort


 Share support load

   
Design Goals
 MPI-2
 High Performance
 Thread safe
 Based On Component Architecture
 Flexible run-time instrumentation
 Portable
 Maintainable
 Production quality

 Fault tolerant (optional)

   
Supported environments
 TCP/IP  rsh/ssh
 Linux
 Myrinet (gm, mx)  RMS (Quadrics)
 BSD
 Quadrics (Elan4)  Bproc
 AIX
 Shared memory  LSF
 MS Windows
 Infiniband (OpenIB)  PBS
 Solaris
 Portals  SLURM
 HP/UX
 LAPI  Load Leveler
 IRIX
 X1 interconnect  YOD
 UNICOS/mp
 Multiple heterogeneous  Grid/Multi-cell
 ?
networks in a single  Daemon and daemon-
process/message less modes

   
MPI Component Architecture (MCA)

User application

MPI API

MPI Component Architecture (MCA)

Framework Framework Framework Framework Framework Framework … Framework


Comp.

Comp.

Comp.

Comp.

Comp.

Comp.

Comp.

Comp.
Comp.

Comp.

Comp.
Comp.

Comp.
Comp.

Comp.

Comp.

Comp.
Comp.

Comp.

Comp.

Comp.
… … … … … … …

   
MCA Component Frameworks
 Components divided into three categories
 Back-end to MPI API functions

 Run-time environment

 Infrastructure / management

 Rule of thumb:
 “If we’ll ever want more than one implementation,
make it a component”

   
MCA Component Types
 MPI types  Run-time env. Types
 P2P management  Out of band

 P2P transport  Process control

 Collectives  Global data registry

 Topologies  Management types


 MPI-2 one-sided  Memory pooling

 MPI-2 IO  Memory allocation

 Reduction Operations  Common

   
Component / Module Lifecycle
Open  Component
Component

 Open: per-process
initialization
Selection  Selection: per-scope
determine if want to use
Initialization  Close: per-process
finalization
 Module
Module

Checkpoint Normal  Initialization: if component


restart usage selected
 Normal usage /
checkpoint
Finalization  Finalization: per-scope
cleanup
Comp.

  Close  
About components
 fixed naming scheme
 where to find it
 statically compiled into the library

 dynamically compiled with the library (.so files)

 copy .so file with into the installation directory ( - no


source code necessary!)
 copy .so file into any directory and tell Open MPI
where it is and to use it
 configuration file describing which components to
use (e.g. centrally installed by administrator)
   
Point­to­Point

User application

MPI API

MPI Component Architecture (MCA)

Memory Memory
PML PTL
Pooling Management

Pow 2 binnng
Shared Mem

Proc Private
TCP/IP

Shared

Pinned

Best fit
TEG

… … … …
IB

   
Pt­2­Pt Components
 PML  PTL
 MPI aware  Data mover

 Fragments  Message Matching

 May request Acks  Responsible for own

 Reassembles progress (polling or async)


 Data Fault tolerance

   
Pt­2­Pt Data flow
Send Receive
IB ­ 1
PML : TEG IB­1 PML : TEG
IB ­ 2
First Fragment First Fragment
GM

PML : TEG IB ­ 1
Last Frags IB ­ 2 PML : TEG
TCP­Gige
GM Last Fragment

  TCP­Gige  
TCP/IP Latency Data (non­blocking)
Implemenation Myrinet (µs) GigeE (µs)

Open MPI (poll) 51.5 39.7

Open MPI (Asn) 51.2 49.9

LAM/MPI 7 51.5 39.9

LA­MPI 51.6 42.9

FT­MPI 51.4 46.4

MPICH2 51.5 40.3

   
Bandwidths ­ 1 NIC GigE

1000

900

800

700
FT-MPI
600 LA-MPI
LAM/MPI 7
500 MPICH2
Open MPI (poll)
400
Open MPI (async)
300 TCP/IP

200

100

0
1 10 100 1000 10000 100000 1000000 10000000

   
Bandwidths ­ 1 NIC IP over Myrinet

2000

1800

1600

1400
FT-MPI

1200 LAM/MPI 7
MPICH2
1000 LA-MPI
Open MPI (poll)
800
Open MPI (async)
600 TCP/IP

400

200

0
1 10 100 1000 10000 100000 1000000 10000000

   
Bandwidth ­ 2 NIC IP over Myrinet
3000

2500

2000

LA-MPI
1500 Open MPI (poll)
Open MPI (1 nic)

1000

500

0
1 10 100 1000 10000 100000 1000000 10000000

   
Current situation on cluster (II)
 Diversity due to
 various MPI implementations
• MPICH-1, MPICH-2, LAM/MPI, LA-MPI, FT-MPI, PACX-MPI,…
• vendor MPI libraries – often incompatible spin-offs of standard
public domain libraries
• HPC network providers MPI libraries – often incompatible spin-
offs of standard public domain libraries
 various version of each MPI library required due to
requirements of ISV’s
 various network drivers (TCP, SHMEM, TCP+SHMEM, IB,
GM, … )
 various compilers
will be solved by Open MPI
could
  be solved by Open MPI, requires
  convincing others of Open MPI
Future Work
 Alpha release available for very friendly users since two
weeks
 Public beta release by end of April

 Ongoing work
 Data Fault Tolerance (expand on LA-MPI)
 Process Fault Tolerance (expand on FT-MPI and
CPR type ideas)
 add one-sided operations
 add many more component types

   

S-ar putea să vă placă și