IBM - MPI Programming Guide

IBM Parallel Environment for AIX 5L
MPI Programming Guide

Version 4 Release 2, Modification 2
SA22-7945-04

SA22-7945-04
Note
Before using this information and the product it supports, read the information in “Notices” on page 225.
Fifth Edition (November 2005)

This edition applies to Version 4, Release 2, Modification 2 of IBM Parallel Environment for AIX 5L (product number
5765-F83) and to all subsequent releases and modifications until otherwise indicated in new editions. This edition
replaces SA22-7945-03. Significant changes or additions to the text and illustrations are indicated by a vertical line ( |
) to the left of the change.
Order publications through your IBM representative or the IBM branch office serving your locality. Publications are
not stocked at the address given below.
IBM welcomes your comments. A form for your comments appears at the back of this publication. If the form has
been removed, address your comments to:
IBM Corporation, Department 55JA, Mail Station P384
2455 South Road
Poughkeepsie, NY 12601-5400
United States of America
FAX (United States and Canada): 1+845+432-9405

FAX (Other Countries) Your International Access Code +1+845+432-9405
IBMLink™ (United States customers only): IBMUSM10(MHVRCFS)

Internet: mhvrcfs@us.ibm.com
If you would like a reply, be sure to include your name, address, telephone number, or FAX number.
Make sure to include the following in your comment or note:
v Title and order number of this book
v Page number or topic related to your comment
When you send information to IBM, you grant IBM a nonexclusive right to use or distribute the information in any
way it believes appropriate without incurring any obligation to you.
© Copyright International Business Machines Corporation 1993, 2005. All rights reserved.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
Contents
Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
About this book . . . . . . . . . . . . . . . . . . . . . . . . ix

Who should read this book . . . . . . . . . . . . . . . . . . . . . ix
How this book is organized . . . . . . . . . . . . . . . . . . . . . ix
Conventions and terminology used in this book . . . . . . . . . . . . . x
Abbreviated names . . . . . . . . . . . . . . . . . . . . . . . x
Prerequisite and related information . . . . . . . . . . . . . . . . . xi
Using LookAt to look up message explanations . . . . . . . . . . . . xii
How to send your comments . . . . . . . . . . . . . . . . . . . . xii
National language support (NLS) . . . . . . . . . . . . . . . . . . xii
Summary of changes for Parallel Environment 4.2 . . . . . . . . . . . . xiii
Chapter 1. Performance Considerations for the MPI Library . . . . . . . 1

Message transport mechanisms . . . . . . . . . . . . . . . . . . . 1
Shared memory considerations . . . . . . . . . . . . . . . . . . 2
MPI IP performance . . . . . . . . . . . . . . . . . . . . . . 2
User Space considerations . . . . . . . . . . . . . . . . . . . . 3
MPI point-to-point communications . . . . . . . . . . . . . . . . . . 3
Eager messages . . . . . . . . . . . . . . . . . . . . . . . . 3
Rendezvous messages . . . . . . . . . . . . . . . . . . . . . 5
Polling and single thread considerations . . . . . . . . . . . . . . . . 6
LAPI send side copy . . . . . . . . . . . . . . . . . . . . . . . 7
Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Remote Direct Memory Access (RDMA) considerations . . . . . . . . . . 9
Other considerations . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 2. Profiling message passing . . . . . . . . . . . . . . . 11

AIX profiling . . . . . . . . . . . . . . . . . . . . . . . . . . 11
MPI nameshift profiling . . . . . . . . . . . . . . . . . . . . . . 11
MPI Nameshift profiling procedure . . . . . . . . . . . . . . . . . 11
Chapter 3. Using shared memory . . . . . . . . . . . . . . . . . 15

Point-to-point communications . . . . . . . . . . . . . . . . . . . 15
Collective communications . . . . . . . . . . . . . . . . . . . . 15
Shared memory performance considerations . . . . . . . . . . . . . . 16
Reclaiming shared memory . . . . . . . . . . . . . . . . . . . . 16
Using POE with multiple Ethernet adapters and shared memory . . . . . . . 16
Chapter 4. Performing parallel I/O with MPI . . . . . . . . . . . . . 19

Definition of MPI-IO . . . . . . . . . . . . . . . . . . . . . . . 19
Features of MPI-IO . . . . . . . . . . . . . . . . . . . . . . . 19
Considerations for MPI-IO . . . . . . . . . . . . . . . . . . . . . 20
MPI-IO API user tasks . . . . . . . . . . . . . . . . . . . . . . 20
Working with files . . . . . . . . . . . . . . . . . . . . . . . 20
Error handling . . . . . . . . . . . . . . . . . . . . . . . . 22
Working with Info objects . . . . . . . . . . . . . . . . . . . . 23
Using datatype constructors . . . . . . . . . . . . . . . . . . . 23
Setting the size of the data buffer . . . . . . . . . . . . . . . . . 24
MPI-IO file inter-operability . . . . . . . . . . . . . . . . . . . . 24
Chapter 5. Programming considerations for user applications in POE . . . 25

The MPI library . . . . . . . . . . . . . . . . . . . . . . . . . 25
© Copyright IBM Corp. 1993, 2005 iii

The signal library has been removed . . . . . . . . . . . . . . . . 25
Parallel Operating Environment overview . . . . . . . . . . . . . . . 25
POE user limits . . . . . . . . . . . . . . . . . . . . . . . . . 26
Exit status . . . . . . . . . . . . . . . . . . . . . . . . . . 26
POE job step function . . . . . . . . . . . . . . . . . . . . . . 27
POE additions to the user executable . . . . . . . . . . . . . . . . 27
Signal handlers . . . . . . . . . . . . . . . . . . . . . . . . 28
Handling AIX signals . . . . . . . . . . . . . . . . . . . . . . 28
Do not hard code file descriptor numbers . . . . . . . . . . . . . . 29
Termination of a parallel job . . . . . . . . . . . . . . . . . . . 29
Do not run your program as root . . . . . . . . . . . . . . . . . 30
AIX function limitations . . . . . . . . . . . . . . . . . . . . . 30
Shell execution . . . . . . . . . . . . . . . . . . . . . . . . 30
Do not rewind STDIN, STDOUT, or STDERR . . . . . . . . . . . . . 30
Do not match blocking and non-blocking collectives . . . . . . . . . . 30
Passing string arguments to your program correctly . . . . . . . . . . 31
POE argument limits . . . . . . . . . . . . . . . . . . . . . . 31
Network tuning considerations . . . . . . . . . . . . . . . . . . 31
Standard I/O requires special attention . . . . . . . . . . . . . . . 32
Reserved environment variables . . . . . . . . . . . . . . . . . 33
AIX message catalog considerations . . . . . . . . . . . . . . . . 33
Language bindings . . . . . . . . . . . . . . . . . . . . . . 33
Available virtual memory segments . . . . . . . . . . . . . . . . 34
Using a switch clock as a time source . . . . . . . . . . . . . . . 34
Running applications with large numbers of tasks . . . . . . . . . . . 34
Running POE with MALLOCDEBUG . . . . . . . . . . . . . . . . 35
Threaded programming . . . . . . . . . . . . . . . . . . . . . . 35
Running single threaded applications . . . . . . . . . . . . . . . . 36
POE gets control first and handles task initialization . . . . . . . . . . 36
Limitations in setting the thread stack size . . . . . . . . . . . . . . 36
Forks are limited . . . . . . . . . . . . . . . . . . . . . . . 37
Thread-safe libraries . . . . . . . . . . . . . . . . . . . . . . 37
Program and thread termination . . . . . . . . . . . . . . . . . 37
Order requirement for system includes . . . . . . . . . . . . . . . 37
Using MPI_INIT or MPI_INIT_THREAD . . . . . . . . . . . . . . . 37
Collective communication calls . . . . . . . . . . . . . . . . . . 38
Support for M:N threads . . . . . . . . . . . . . . . . . . . . 38
Checkpoint and restart limitations . . . . . . . . . . . . . . . . . 38
64-bit application considerations . . . . . . . . . . . . . . . . . 42
MPI_WAIT_MODE: the nopoll option . . . . . . . . . . . . . . . . 42
Mixed parallelism with MPI and threads . . . . . . . . . . . . . . . 43
Using MPI and LAPI in the same program . . . . . . . . . . . . . . . 43
Differences between MPI in PE 3.2 and PE Version 4 . . . . . . . . . 44
Differences between MPI in PE 4.1 and PE 4.2 . . . . . . . . . . . . 45
Other differences . . . . . . . . . . . . . . . . . . . . . . . 45
POE-supplied threads . . . . . . . . . . . . . . . . . . . . . 45
Chapter 6. Using error handlers . . . . . . . . . . . . . . . . . . 47

Predefined error handler for C++ . . . . . . . . . . . . . . . . . . 47
Chapter 7. Predefined MPI datatypes . . . . . . . . . . . . . . . . 49

Special purpose datatypes . . . . . . . . . . . . . . . . . . . . 49
Datatypes for C language bindings . . . . . . . . . . . . . . . . . 49
Datatypes for FORTRAN language bindings . . . . . . . . . . . . . . 50
Datatypes for reduction functions (C reduction types) . . . . . . . . . . . 50
Datatypes for reduction functions (FORTRAN reduction types) . . . . . . . 51
iv IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide

Chapter 8. MPI reduction operations . . . . . . . . . . . . . . . . 53
Predefined operations . . . . . . . . . . . . . . . . . . . . . . 53
Datatype arguments of reduction operations . . . . . . . . . . . . . 53
Valid datatypes for the op option . . . . . . . . . . . . . . . . . 54
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
C example . . . . . . . . . . . . . . . . . . . . . . . . . 55
FORTRAN example . . . . . . . . . . . . . . . . . . . . . . 55
Chapter 9. C++ MPI constants . . . . . . . . . . . . . . . . . . 57

Error classes . . . . . . . . . . . . . . . . . . . . . . . . . 57
Maximum sizes . . . . . . . . . . . . . . . . . . . . . . . . . 58
Environment inquiry keys . . . . . . . . . . . . . . . . . . . . . 58
Predefined attribute keys . . . . . . . . . . . . . . . . . . . . . 59
Results of communicator and group comparisons . . . . . . . . . . . . 59
Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . 59
File operation constants . . . . . . . . . . . . . . . . . . . . . 59
MPI-IO constants . . . . . . . . . . . . . . . . . . . . . . . . 59
One-sided constants . . . . . . . . . . . . . . . . . . . . . . . 60
Combiner constants used for datatype decoding functions . . . . . . . . . 60
Assorted constants . . . . . . . . . . . . . . . . . . . . . . . 60
Collective constants . . . . . . . . . . . . . . . . . . . . . . . 60
Error handling specifiers . . . . . . . . . . . . . . . . . . . . . 60
Special datatypes for construction of derived datatypes . . . . . . . . . . 61
Elementary datatypes (C and C++) . . . . . . . . . . . . . . . . . 61
Elementary datatypes (FORTRAN) . . . . . . . . . . . . . . . . . 61
Datatypes for reduction functions (C and C++) . . . . . . . . . . . . . 61
Datatypes for reduction functions (FORTRAN) . . . . . . . . . . . . . 61
Optional datatypes . . . . . . . . . . . . . . . . . . . . . . . 62
Collective operations . . . . . . . . . . . . . . . . . . . . . . . 62
Null handles . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Empty group . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Threads constants . . . . . . . . . . . . . . . . . . . . . . . 63
FORTRAN 90 datatype matching constants . . . . . . . . . . . . . . 63
Chapter 10. MPI size limits . . . . . . . . . . . . . . . . . . . . 65

System limits . . . . . . . . . . . . . . . . . . . . . . . . . 65
Maximum number of tasks and tasks per node . . . . . . . . . . . . . 67
Chapter 11. POE environment variables and command-line flags . . . . . 69

MP_BUFFER_MEM details . . . . . . . . . . . . . . . . . . . . 82
Chapter 12. Parallel utility subroutines . . . . . . . . . . . . . . . 87

mpc_isatty . . . . . . . . . . . . . . . . . . . . . . . . . . 88
MP_BANDWIDTH, mpc_bandwidth . . . . . . . . . . . . . . . . . 90
MP_DISABLEINTR, mpc_disableintr . . . . . . . . . . . . . . . . . 95
MP_ENABLEINTR, mpc_enableintr . . . . . . . . . . . . . . . . . 98
MP_FLUSH, mpc_flush . . . . . . . . . . . . . . . . . . . . . 101
MP_INIT_CKPT, mpc_init_ckpt . . . . . . . . . . . . . . . . . . 103
MP_QUERYINTR, mpc_queryintr . . . . . . . . . . . . . . . . . . 105
MP_QUERYINTRDELAY, mpc_queryintrdelay . . . . . . . . . . . . . 108
MP_SET_CKPT_CALLBACKS, mpc_set_ckpt_callbacks . . . . . . . . . 109
MP_SETINTRDELAY, mpc_setintrdelay . . . . . . . . . . . . . . . 112
MP_STATISTICS_WRITE, mpc_statistics_write . . . . . . . . . . . . 113
MP_STATISTICS_ZERO, mpc_statistics_zero . . . . . . . . . . . . . 116
MP_STDOUT_MODE, mpc_stdout_mode. . . . . . . . . . . . . . . 117
MP_STDOUTMODE_QUERY, mpc_stdoutmode_query . . . . . . . . . . 120
Contents v
MP_UNSET_CKPT_CALLBACKS, mpc_unset_ckpt_callbacks . . . . . . . 122
pe_dbg_breakpoint . . . . . . . . . . . . . . . . . . . . . . . 124
pe_dbg_checkpnt . . . . . . . . . . . . . . . . . . . . . . . 130
pe_dbg_checkpnt_wait . . . . . . . . . . . . . . . . . . . . . 134
pe_dbg_getcrid . . . . . . . . . . . . . . . . . . . . . . . . 136
pe_dbg_getrtid . . . . . . . . . . . . . . . . . . . . . . . . 137
pe_dbg_getvtid . . . . . . . . . . . . . . . . . . . . . . . . 138
pe_dbg_read_cr_errfile . . . . . . . . . . . . . . . . . . . . . 139
pe_dbg_restart . . . . . . . . . . . . . . . . . . . . . . . . 140
Chapter 13. Parallel task identification API subroutines . . . . . . . . 145

poe_master_tasks . . . . . . . . . . . . . . . . . . . . . . . 146
poe_task_info . . . . . . . . . . . . . . . . . . . . . . . . . 147
Appendix A. MPE subroutine summary . . . . . . . . . . . . . . 149
Appendix B. MPE subroutine bindings . . . . . . . . . . . . . . . 151

Bindings for non-blocking collective communication . . . . . . . . . . . 151
Appendix C. MPI subroutine and function summary . . . . . . . . . 155
Appendix D. MPI subroutine bindings . . . . . . . . . . . . . . . 175

Bindings for collective communication . . . . . . . . . . . . . . . . 175
Bindings for communicators . . . . . . . . . . . . . . . . . . . . 178
Bindings for conversion functions . . . . . . . . . . . . . . . . . . 182
Bindings for derived datatypes . . . . . . . . . . . . . . . . . . . 183
Bindings for environment management . . . . . . . . . . . . . . . 189
Bindings for external interfaces . . . . . . . . . . . . . . . . . . 191
Bindings for group management . . . . . . . . . . . . . . . . . . 193
Bindings for Info objects . . . . . . . . . . . . . . . . . . . . . 195
Bindings for memory allocation . . . . . . . . . . . . . . . . . . 196
Bindings for MPI-IO . . . . . . . . . . . . . . . . . . . . . . . 197
Bindings for MPI_Status objects . . . . . . . . . . . . . . . . . . 205
Bindings for one-sided communication . . . . . . . . . . . . . . . . 205
Bindings for point-to-point communication . . . . . . . . . . . . . . 208
Binding for profiling control . . . . . . . . . . . . . . . . . . . . 214
Bindings for topologies . . . . . . . . . . . . . . . . . . . . . 214
Appendix E. PE MPI buffer management for eager protocol . . . . . . 219
Appendix F. Accessibility . . . . . . . . . . . . . . . . . . . . 223

Accessibility information . . . . . . . . . . . . . . . . . . . . . 223
Using assistive technologies . . . . . . . . . . . . . . . . . . . 223
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Trademarks. . . . . . . . . . . . . . . . . . . . . . . . . . 227
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 228
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
vi IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide

Tables
1. Parallel Environment abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . x
2. How the clock source is determined . . . . . . . . . . . . . . . . . . . . . . . . 34
3. POE/MPI/LAPI Thread Inventory . . . . . . . . . . . . . . . . . . . . . . . . . 46
4. Special purpose datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5. Datatypes for C language bindings . . . . . . . . . . . . . . . . . . . . . . . . 49
6. Datatypes for FORTRAN language bindings . . . . . . . . . . . . . . . . . . . . . 50
7. Datatypes for reduction functions (C reduction types) . . . . . . . . . . . . . . . . . 50
8. Datatypes for reduction functions (FORTRAN reduction types)) . . . . . . . . . . . . . . 51
9. Predefined reduction operations . . . . . . . . . . . . . . . . . . . . . . . . . 53
10. Valid datatype arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
11. Valid datatypes for the op option . . . . . . . . . . . . . . . . . . . . . . . . . 54
12. MPI eager limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
13. Task limits for parallel jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
14. POE environment variables and command-line flags for partition manager control . . . . . . . 70
15. POE environment variables/command-line flags for job specification . . . . . . . . . . . . 73
16. POE environment variables/command-line flags for I/O control . . . . . . . . . . . . . . 75
17. POE environment variables/command-line flags for diagnostic information . . . . . . . . . . 76
18. POE environment variables and command-line flags for Message Passing Interface (MPI) . . . . 76
19. POE environment variables/command-line flags for corefile generation . . . . . . . . . . . 83
20. Other POE environment variables/command-line flags . . . . . . . . . . . . . . . . . 84
21. Parallel utility subroutines. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
22. MPE Subroutines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
23. Bindings for non-blocking collective communication . . . . . . . . . . . . . . . . . . 151
24. MPI subroutines and functions . . . . . . . . . . . . . . . . . . . . . . . . . 155
25. Bindings for collective communication . . . . . . . . . . . . . . . . . . . . . . . 175
26. Bindings for communicators . . . . . . . . . . . . . . . . . . . . . . . . . . 179
27. Bindings for conversion functions . . . . . . . . . . . . . . . . . . . . . . . . 182
28. Bindings for derived datatypes . . . . . . . . . . . . . . . . . . . . . . . . . 183
29. Bindings for environment management . . . . . . . . . . . . . . . . . . . . . . 189
30. Binding for external interfaces . . . . . . . . . . . . . . . . . . . . . . . . . 191
31. Bindings for groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
32. Bindings for Info objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
33. Bindings for memory allocation . . . . . . . . . . . . . . . . . . . . . . . . . 196
34. Bindings for MPI-IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
35. Bindings for MPI_Status objects . . . . . . . . . . . . . . . . . . . . . . . . . 205
36. Bindings for one-sided communication . . . . . . . . . . . . . . . . . . . . . . 205
37. Bindings for point-to-point communication . . . . . . . . . . . . . . . . . . . . . 208
38. Binding for profiling control . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
39. Bindings for topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
© Copyright IBM Corp. 1993, 2005 vii

viii IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide
About this book
This book provides information about parallel programming, as it relates to IBM®’s
implementation of the Message Passing Interface (MPI) standard for Parallel
Environment for AIX® (PE). References to RS/6000® SP™ or SP include currently
supported IBM Eserver Cluster 1600 hardware.
All implemented function in the PE MPI product is designed to comply with the
requirements of the Message Passing Interface Forum, MPI: A Message-Passing
Interface Standard, Version 1.1, University of Tennessee, Knoxville, Tennessee,
June 6, 1995 and MPI-2: Extensions to the Message-Passing Interface, University
of Tennessee, Knoxville, Tennessee, July 18, 1997. The second volume includes a
section identified as MPI 1.2, with clarifications and limited enhancements to MPI
1.1. It also contains the extensions identified as MPI 2.0. The three sections, MPI
1.1, MPI 1.2, and MPI 2.0 taken together constitute the current standard for MPI.
PE MPI provides support for all of MPI 1.1 and MPI 1.2. PE MPI also provides
support for all of the MPI 2.0 enhancements, except the contents of the chapter
titled ″Process creation and management.″
If you believe that PE MPI does not comply, in any way, with the MPI standard for
the portions that are implemented, please contact IBM service.
Who should read this book

This book is intended for experienced programmers who want to write parallel
applications using the C, C++, or FORTRAN programming language. Readers of
this book should know C , C++, and FORTRAN and should be familiar with AIX and
UNIX® commands, file formats, and special files. They should also be familiar with
the MPI concepts. In addition, readers should be familiar with distributed-memory
machines.
How this book is organized

This book is organized as follows:
v Chapter 1, “Performance Considerations for the MPI Library,” on page 1.
v Chapter 2, “Profiling message passing,” on page 11.
v Chapter 3, “Using shared memory,” on page 15.
v Chapter 4, “Performing parallel I/O with MPI,” on page 19.
v Chapter 5, “Programming considerations for user applications in POE,” on page
25.
v Chapter 6, “Using error handlers,” on page 47.
v Chapter 7, “Predefined MPI datatypes,” on page 49.
v Chapter 8, “MPI reduction operations,” on page 53.
v Chapter 9, “C++ MPI constants,” on page 57.
v Chapter 10, “MPI size limits,” on page 65.
v Chapter 11, “POE environment variables and command-line flags,” on page 69.
v Chapter 12, “Parallel utility subroutines,” on page 87.
v Chapter 13, “Parallel task identification API subroutines,” on page 145.
v Appendix A, “MPE subroutine summary,” on page 149.
v Appendix B, “MPE subroutine bindings,” on page 151.
© Copyright IBM Corp. 1993, 2005 ix

v Appendix C, “MPI subroutine and function summary,” on page 155.
v Appendix D, “MPI subroutine bindings,” on page 175.
v Appendix E, “PE MPI buffer management for eager protocol,” on page 219.
Conventions and terminology used in this book

This book uses the following typographic conventions:
Convention Usage
bold Bold words or characters represent system elements that you must
use literally, such as: command names, file names, flag names,
path names, PE component names (pedb, for example), and
subroutines.
constant width Examples and information that the system displays appear in
constant-width typeface.
italic Italicized words or characters represent variable values that you
must supply.
Italics are also used for book titles, for the first use of a glossary
term, and for general emphasis in text.
[item] Used to indicate optional items.
<Key> Used to indicate keys you press.
\ The continuation character is used in coding examples in this book
for formatting purposes.
In addition to the highlighting conventions, this manual uses the following

conventions when describing how to perform tasks.
User actions appear in uppercase boldface type. For example, if the action is to
enter the tool command, this manual presents the instruction as:
ENTER
tool
Abbreviated names
Some of the abbreviated names used in this book follow.
Table 1. Parallel Environment abbreviations
Short Name Full Name
AIX Advanced Interactive Executive
CSM Clusters Systems Management
CSS communication subsystem
CTSEC cluster-based security
DPCL dynamic probe class library
dsh distributed shell
GUI graphical user interface
HDF Hierarchical Data Format
IP Internet Protocol
LAPI Low-level Application Programming Interface
MPI Message Passing Interface
x IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide

Table 1. Parallel Environment abbreviations (continued)
Short Name Full Name
NetCDF Network Common Data Format
PCT Performance Collection Tool
PE IBM Parallel Environment for AIX
PE MPI IBM’s implementation of the MPI standard for PE
PE MPI-IO IBM’s implementation of MPI I/O for PE
POE parallel operating environment
pSeries ®
IBM Eserver pSeries
PSSP IBM Parallel System Support Programs for AIX
PVT Profile Visualization Tool
RISC reduced instruction set computer
RSCT Reliable Scalable Cluster Technology
rsh remote shell
RS/6000 IBM RS/6000
SP IBM RS/6000 SP
STDERR standard error
STDIN standard input
STDOUT standard output
UTE Unified Trace Environment
Prerequisite and related information

The Parallel Environment library consists of:
v IBM Parallel Environment for AIX: Introduction, SA22-7947
v IBM Parallel Environment for AIX: Installation, GA22-7943
v IBM Parallel Environment for AIX: Messages, GA22-7944
v IBM Parallel Environment for AIX: MPI Programming Guide, SA22-7945
v IBM Parallel Environment for AIX: MPI Subroutine Reference, SA22-7946
v IBM Parallel Environment for AIX: Operation and Use, Volume 1, SA22-7948
v IBM Parallel Environment for AIX: Operation and Use, Volume 2, SA22-7949
To access the most recent Parallel Environment documentation in PDF and HTML
format, refer to the IBM Eserver Cluster Information Center on the Web at:
http://publib.boulder.ibm.com/infocenter/clresctr/index.jsp
Both the current Parallel Environment books and earlier versions of the library are
also available in PDF format from the IBM Publications Center Web site located at:
http://www.ibm.com/shop/publications/order/
It is easiest to locate a book in the IBM Publications Center by supplying the book’s
publication number. The publication number for each of the Parallel Environment
books is listed after the book title in the preceding list.
About this book xi

Using LookAt to look up message explanations
LookAt is an online facility that lets you look up explanations for most of the IBM
messages you encounter, as well as for some system abends and codes. You can
use LookAt from the following locations to find IBM message explanations for
Clusters for AIX:
v The Internet. You can access IBM message explanations directly from the LookAt
Web site:
http://www.ibm.com/eserver/zseries/zos/bkserv/lookat/
v Your wireless handheld device. You can use the LookAt Mobile Edition with a
handheld device that has wireless access and an Internet browser (for example,
Internet Explorer for Pocket PCs, Blazer, or Eudora for Palm OS, or Opera for
Linux® handheld devices). Link to the LookAt Mobile Edition from the LookAt
Web site.
How to send your comments

Your feedback is important in helping to provide the most accurate and high-quality
information. If you have any comments about this book or any other PE
documentation:
v Send your comments by e-mail to: mhvrcfs@us.ibm.com
Be sure to include the name of the book, the part number of the book, the
version of PE, and, if applicable, the specific location of the text you are
commenting on (for example, a page number or table number).
v Fill out one of the forms at the back of this book and return it by mail, by fax, or
by giving it to an IBM representative.
National language support (NLS)

For national language support (NLS), all PE components and tools display
messages that are located in externalized message catalogs. English versions of
the message catalogs are shipped with the PE licensed program, but your site may
be using its own translated message catalogs. The PE components use the AIX
environment variable NLSPATH to find the appropriate message catalog. NLSPATH
specifies a list of directories to search for message catalogs. The directories are
searched, in the order listed, to locate the message catalog. In resolving the path to
the message catalog, NLSPATH is affected by the values of the environment
variables LC_MESSAGES and LANG. If you get an error saying that a message
catalog is not found and you want the default message catalog:
ENTER
export NLSPATH=/usr/lib/nls/msg/%L/%N
export LANG=C
The PE message catalogs are in English, and are located in the following
directories:
/usr/lib/nls/msg/C
/usr/lib/nls/msg/En_US
/usr/lib/nls/msg/en_US
If your site is using its own translations of the message catalogs, consult your
system administrator for the appropriate value of NLSPATH or LANG. For more
information on NLS and message catalogs, see AIX: General Programming
Concepts: Writing and Debugging Programs.
xii IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide

Summary of changes for Parallel Environment 4.2
This release of IBM Parallel Environment for AIX contains a number of functional
enhancements, including:
| v Support for POWER3™, POWER4™, and System p5™ servers running AIX 5L™
V5.2 or AIX 5L V5.3
| v Support for IBM System p5 servers and the High Performance Switch (HPS),
running either AIX 5.2 and AIX 5.3
v Support, on both AIX 5.2 and AIX 5.3, for coexistence between the High
Performance Switch and a cluster managed by Cluster Systems Management
(CSM).
| v Remote Direct Memory Access (RDMA) for transparent bulk data transfer of large
| contiguous messages, only on the HPS. In addition, PE with LL now manages
| rCxt blocks which can be exploited by LAPI.
| v Support for striping of messages over multiple adapters attached to the pSeries
| HPS
v MPI support for 128 tasks per node using shared memory
v Support for LoadLeveler® performance improvements
v Support for up to 8192 tasks in a single job, with improved memory utilization for
large jobs
v MPI collectives algorithm and optimization improvements
v MPI shared memory collectives use AIX 5L V5.3 cross-memory attachment
enhancements
v Point-to-point messages in shared memory use AIX 5L V5.3 cross-memory
attachment enhancements
v MPI/LAPI performance statistics
v The SP Switch is no longer supported
v PE 4.2 is the last release of PE that will support Parallel Systems Support
Programs for AIX (PSSP), the SP Switch2, and POWER3 servers
v PE Benchmarker support for instrumenting applications at the block level.
v PE Benchmarker OpenMP tool that allows you to collect information about the
performance of OpenMP locking functions, OpenMP directives, and
compiler-generated OpenMP functions.
v PE Benchmarker communication profiling tool for collecting communication count
(byte count) information for MPI and/or LAPI applications.
v PE Benchmarker MPI trace and profiling tools can now be used with OpenMP
applications.
About this book xiii

xiv IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide
Chapter 1. Performance Considerations for the MPI Library
This chapter provides performance considerations for the MPI library, including the
following topics:
v “Message transport mechanisms.”
v “MPI point-to-point communications” on page 3.
v “Polling and single thread considerations” on page 6.
v “LAPI send side copy” on page 7
v “Striping” on page 8.
v “Remote Direct Memory Access (RDMA) considerations” on page 9.
v “Other considerations” on page 9.
Performance of jobs using the MPI library can be affected by the setting of various
environment variables. The complete list is provided in Chapter 11, “POE
environment variables and command-line flags,” on page 69 and in IBM Parallel
Environment for AIX 5L: Operation and Use, Volume 1. Programs that conform to
the MPI standard should run correctly with any combination of environment
variables within the supported ranges.
The defaults of these environment variables are generally set to optimize the
performance of the User Space library for MPI programs with one task per
processor, using blocking communication. Blocking communication includes sets of
non-blocking send and receive calls followed immediately by wait or waitall, as well
as explicitly blocking send and receive calls. Applications that use other
programming styles, in particular those that do significant computation between
posting non-blocking sends or receives and calling wait or waitall, may see a
performance improvement if some of the environment variables are changed.
Message transport mechanisms

The MPI Library conforms to the MPI-2 Standard, with the exception of the chapter
on Process Creation and Management, which is not implemented.
The MPI library is a dynamically loaded shared object, whose symbols are linked
into the user application. At run time, when MPI_Init is called by the application
program, the various environment variables are read and interpreted, and the
underlying transport is initialized. Depending on the setting of the transport variable
MP_EUILIB, MPI initializes lower level protocol support for a User Space packet
mode, or for a UDP/IP socket mode. By default, the shared memory mechanism for
point-to-point messages (and in 64-bit applications, collective communication) is
also initialized.
Three message transport mechanisms are supported:

Shared memory
Used for tasks on the same node (as processes under the same
operating system image)
UDP/IP Used for tasks on nodes connected with an IP network
User Space Used for tasks having windows allocated on various versions of
IBM high speed interconnects such as the pSeries High
Performance Switch
© Copyright IBM Corp. 1993, 2005 1

These topics are addressed in the following sections, in detail:
v “Shared memory considerations”
v “MPI IP performance.”
v “User Space considerations” on page 3.
Shared memory considerations

An MPI job can use a combination of shared memory and UDP/IP message
transport mechanisms, or a combination of shared memory and User Space
message transport mechanisms, for intertask communication. An MPI job may not
use a combination of UDP/IP and User Space message transport mechanisms.
Tasks on the same node can use operating system shared memory transport for
point-to-point communication. Shared memory is used by default, but may be turned
off with the environment variableMP_SHARED_MEMORY. In addition, 64-bit
applications are provided an optimization where the MPI library uses shared
memory directly for selected collective communications, rather than just mapping
the collectives into point-to-point communications. The collective calls for which this
optimization is provided include MPI_Barrier, MPI_Reduce, MPI_Bcast,
MPI_Allreduce and others. This optimization is enabled by default, and disabled by
setting environment variable MP_SHARED_MEMORY to no. For most programs,
enabling the shared memory transport for point-to-point and collective calls provides
better performance than using the network transport.
For more information on shared memory, see Chapter 3, “Using shared memory,”
on page 15.
MPI IP performance
MPI IP performance is affected by the socket-buffer sizes for sending and receiving
UDP data. These are defined by two network tuning parameters udp_sendspace
and udp_recvspace. When the buffer for sending data is too small and quickly
becomes full, UDP data transfer can be delayed. When the buffer for receiving data
is too small, incoming UDP data can be dropped due to insufficient buffer space,
resulting in send-side retransmission and very poor performance.
LAPI, on which MPI is running, tries to increase the size of send and receive
buffers to avoid this performance degradation. However, the buffer sizes,
udp_sendspace and udp_recvspace, cannot be greater than another network
tuning parameter sb_max, which can be changed only with privileged access rights
(usually root). For optimal performance, it is suggested that sb_max be increased
to a relatively large value. For example, increase sb_max from the default of
1048576 to 8388608 before running MPI IP jobs.
The UDP/IP transport can be used on clustered servers where a high speed
interconnect is not available, or can use the IP mode of the high speed
interconnect, if desired. This transport is often useful for program development or
initial testing, rather than production. Although this transport does not match User
Space performance, it consumes only virtual adapter resources rather than limited
real adapter resources.
MPI with UDP/IP transport should be viewed as an IP application for system

performance tuning. This transport is selected by setting the environment variable
MP_EUILIB to ip (must be lower case). The user may set the UDP packet size
using the environment variable MP_UDP_PACKET_SIZE, which should be set
slightly smaller than the MTU of the IP network being used. The MP_ environment
2 IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide

variables described in the remainder of this chapter may also affect performance
with the IP transport, but have generally been designed with the optimized User
Space transport in mind.
Details on the network tuning parameters, such as their definitions and how to
change their values, can be found in the man page for the AIX no command.
User Space considerations

The User Space transport binds one or more real adapter resources (called User
Space windows) to each MPI task. The number of windows available depends on
adapter type, but it is common for systems fully loaded with production jobs to have
every available window committed. User Space is selected by setting the
environment variable MP_EUILIB to us (must be lower case). This is the transport
for which the MPI library is optimized.
The underlying transport for MPI is LAPI, which is packaged with AIX as part of the
RSCT file set. LAPI provides a one-sided message passing API, with optimizations
to support MPI. Except when dealing with applications that make both MPI and
direct LAPI calls, or when considering compatibility of PE and RSCT levels, there is
usually little need for the MPI user to be concerned about what is in the MPI layer
and what is in the LAPI layer.
MPI point-to-point communications

To understand the various environment variables, it is useful to describe briefly how
MPI manages point-to-point messages. Parts of this management are now in the
LAPI LLP (Lower Level Protocol), which provides a reliable message delivery layer
and a mechanism for asynchronous progress in MPI. Because LAPI runs above an
unreliable packet layer, LAPI must deal with detecting and retransmitting any lost
packet.
An MPI application program sends a message using either a blocking or a

non-blocking send. A send is considered locally complete when the blocking send
call returns, or when the wait associated with the non-blocking send returns. MPI
defines a standard send as one that may complete before the matching receive is
posted, or can delay its completion until the matching receive is posted.This
definition allows the MPI library to improve performance by managing small
standard sends with eager protocol and larger ones with rendezvous protocol. A
small message is one no larger than the eager limit setting.
The eager limit is set by the MP_EAGER_LIMIT environment variable or the

-eager_limit command-line flag. For more information on the MP_EAGER_LIMIT
environment variable, see Chapter 11, “POE environment variables and
command-line flags,” on page 69, and Appendix E, “PE MPI buffer management for
eager protocol,” on page 219.
Eager messages
An eager send passes its buffer pointer, communicator, destination, length, tag and
datatype information to a LLP reliable message delivery function. If the message is
small enough, it is copied from the user’s buffer into a protocol managed buffer, and
the MPI send is marked complete. This makes the user’s send buffer immediately
available for reuse. A longer message is not copied, but is transmitted directly from
the user’s buffer. In this second case, the send cannot be marked complete until the
data has reached the destination and the packets have been acknowledged. It is
because either the message itself, or a copy of it, is preserved until it can be
Chapter 1. Performance Considerations for the MPI Library 3

confirmed that all packets arrived safely, that the LLP can be considered reliable.
The strategy of making temporary copies of small messages in case a
retransmission is required preserves reliability while it reduces the time that a small
MPI send must block.
Whenever a send is active, and at other convenient times such as during a blocking
receive or wait, a message dispatcher is run. This dispatcher sends and receives
messages, creating packets for and interpreting packets from the lower level packet
driver (User Space or IP). Since UDP/IP and User Space are both unreliable packet
transports (packets may be dropped during transport without an error being
reported), the message dispatcher manages packet acknowledgment and
retransmission with a sliding window protocol. This message dispatcher is also
run on a hidden thread once every few hundred milliseconds and, if environment
variable MP_CSS_INTERRUPT is set, upon notification of packet arrival.
On the receive side, there are two distinct cases:

v The eager message arrives before the matching receive is posted.
v The receive is posted before the eager message arrives.
When the message dispatcher recognizes the first packet of an inbound message,
a header handler or upcall is invoked. This upcall is to a function within the MPI
layer that searches a list of descriptors for posted but unmatched receives. If a
match is found, the descriptor is unlinked from the unmatched receives list and data
will be copied directly from the packets to the user buffer. The receive descriptor is
marked by a second upcall (a completion handler), when the dispatcher detects the
final packet so that the MPI application can recognize that the receive is complete.
If a receive is not found by the header handler upcall, an early arrival buffer is
allocated by MPI and the message data will be copied to that buffer. A descriptor
similar to a receive descriptor but containing a pointer to the early arrival buffer is
added to an early arrivals list. When an application does make a receive call, the
early arrivals list is searched. If a match is found:
1. The descriptor is unlinked from the early arrivals list.
2. Data is copied from the early arrival buffer to the user buffer.
3. The early arrival buffer is freed.
4. The descriptor (which is now associated with the receive) is marked so that the
MPI application can recognize that the receive is complete.
The size of the early arrival buffer is controlled by the MP_BUFFER_MEM
environment variable.
The difference between a blocking and non-blocking receive is that a blocking

receive does not return until the descriptor is marked complete, whether the
message is found as an early arrival or is sent later. A non-blocking receive leaves
a descriptor in the posted receives list if no match is found, and returns. The
subsequent wait blocks until the descriptor is marked complete.
The MPI standard requires that a send not complete until it is guaranteed that its
data can be delivered to the receiver. For an eager send, this means the sender
must know in advance that there is sufficient buffer space at the destination to
cache the message if no posted receive is found. The PE MPI library accomplishes
this by using a credit flow control mechanism. At initialization time, each source to
destination pair is allocated a fixed, identical number of message credits. The
number of credits per pair is calculated based on environment variables
MP_EAGER_LIMIT, MP_BUFFER_MEM, and the total number of tasks in the job.

An MPI task sends eagerly to a destination as long as it has credits for that
destination, but it costs one credit to send a message. Each receiver has enough
space in its early arrival buffer to cache the messages represented by all credits
held by all possible senders.
If an eager message arrives and finds a match, the credit is freed immediately
because the early arrival buffer space that it represents is not needed. If data must
be buffered, the credit is tied up until the matching receive call is made, which
allows the early arrival buffer to be freed. PE MPI returns message flow control
credits by piggybacking them on some regular message going back to the sender, if
possible. If credits pile up at the destination and there are no application messages
going back, MPI must send a special purpose message to return the credits. For
more information on the early arrival buffer and the environment variables,
MP_EAGER_LIMIT and MP_BUFFER_MEM, see Chapter 11, “POE environment
variables and command-line flags,” on page 69 and Appendix E, “PE MPI buffer
management for eager protocol,” on page 219.
Rendezvous messages
For a standard send, PE MPI makes the decision whether to use an eager or a
rendezvous protocol based on the message length. For the standard MPI_Send
and MPI_Isend calls, messages whose size is not greater than the eager limit are
sent using eager protocol. Messages whose size is larger than the eager limit are
sent using rendezvous protocol. Thus, small messages can be eagerly sent, and
assuming that message credits are returned in a timely fashion, can continue to be
sent using the mechanisms described above. For large messages, or small
messages for which there are no message credits available, the message must be
managed with a rendezvous protocol.
Recall the following:

v The MPI definition for standard send promises the user that the message data
will be delivered whenever the matching receive is posted.
v Send side message completion is no indication that a matching receive was
found.
The decision made by an MPI implementation of standard send, to use eager
protocol in some cases and rendezvous protocol in other cases is based on a need
to allocate and manage buffer space for preserving eagerly sent message data in
the cases were there is no receive waiting. The MPI standard’s advice that a ’safe’
programming style must not assume a standard send will return before a matching
receive is found, is also based on the requirement that the MPI implementation
preserve any message data that it sends eagerly.
Since a zero byte message has no message data to preserve, even an MPI
implementation with no early arrival buffering should be able to complete a zero
byte standard send at the send side, whether or not there is a matching receive.
Thus, for PE MPI with MP_EAGER_LIMIT set to zero, a one byte standard send
will not complete until a matching receive is found, but a zero byte standard send
will complete without waiting for a rendezvous to determine whether a receive is
waiting.
A rendezvous message is sent in two stages:

1. A message envelope is sent containing the information needed for matching by
the receiver, and a message ID that is unique to the sender. This envelope
either matches a previously posted receive, or causes a descriptor to be put in

the list of early arrivals just as for an eager early arrival. Because the message
data has not been sent, no early arrival buffer is needed.
Whether the matching receive is found when the envelope arrives, or the
receive call is made later and matches a descriptor in the early arrivals list, an
’OK to send’ response goes back to the sender after the match. This ’OK to
send’ contains the ID by which the sender identifies the data to send, and also
an ID unique to the destination that identifies the match that was found.
2. When the sender gets an ’OK to send’ message, it sends the message data,
along with the destination side ID that identifies the receive that had been
matched. As the data arrives, it can be copied directly into the receive buffer
that was already identified as the match.
Eager messages require only one trip across the transport, while rendezvous
messages require three trips, but two of the trips are short, and the time is quickly
amortized for large messages. Using the rendezvous protocol ensures that there is
no need for temporary buffers to store the data, and no overhead from copying
packets to temporary buffers and then on to user buffers.
Polling and single thread considerations

A blocking send or receive, or an MPI wait call, causes MPI to invoke the message
dispatcher in a polling loop, processing packets as available until the specified
message is complete. This is generally the lowest latency programming model,
since packets are processed on the calling thread as soon as they arrive. The MPI
library also supports an interrupt mode, specified by the environment variable
MP_CSS_INTERRUPT, which causes an interrupt whenever a message packet
arrives at the receiving network port or window.
In User Space, this interrupt is implemented as an AIX dispatch of a service thread

that is created within each task at initialization time and is waiting on such an event.
This thread calls the message dispatcher to process the packet, including invoking
any upcalls to MPI for message matching or completion. Thus, while packets are
being processed, other user threads may continue to perform computations. This is
particularly useful if there are otherwise idle processors on the node, but that
situation is not common. It is more likely to be useful with algorithms that allow
communication to be scheduled well before the data is needed, and have
computations to be done using data that is already available from a prior set of
communications.
If all the processors are busy, enabling interrupt mode causes thread context
switching and contention for processors, which might cause the application to run
slower than it would in polling mode.
The behavior of the MPI library during message polling can also be affected by the
setting of the environment variable MP_WAIT_MODE. If set to sleep or yield, the
blocked MPI thread sleeps or yields periodically to allow the AIX dispatcher to
schedule other activity on the processor. This may be appropriate when the wait call
is part of a command processor thread. An alternate way of implementing this
behavior is with an MPI test command and user-invoked sleep or yield (or some
other mechanism to release a processor).
Environment variable MP_WAIT_MODE can also be set to nopoll, which polls the
message dispatcher for a short time (less than one millisecond) and then goes into

a thread wait for either an interrupt or a timer expiration. In general, if
MP_WAIT_MODE is set to nopoll, it is suggested that MP_CSS_INTERRUPT be
set to yes.
As mentioned above, packets are transferred during polling and when an interrupt is
recognized (which invokes the message dispatcher). The message dispatcher is
also invoked periodically, based on the AIX timer support. The time interval between
brief polls of the message dispatcher is controlled by environment variable
MP_POLLING_INTERVAL, specified in microseconds.
The MPI library supports multiple threads simultaneously issuing MPI calls, and
provides appropriate internal locking to make sure that the library is thread safe with
respect to these calls. If the application makes MPI calls on only one thread (or is a
non-threaded program), and does not use the nonstandard MPE_I nonblocking
collectives, MPI-IO, or MPI one-sided features, the user may wish to skip the
internal locking by setting the environment variable MP_SINGLE_THREAD to yes.
Do not set MP_SINGLE_THREAD to yes unless you are certain that the
application is single threaded.
LAPI send side copy

Some applications may benefit from changing the parameters controlling the send
side copy mechanism. Because the send side buffering occurs at the level below
MPI, the effect as seen by an MPI user must allow for headers used by MPI. To
help you understand this as an MPI user, we must discuss it from a LAPI
perspective.
LAPI send side guarantees making a copy of any LAPI level message of up to 128
bytes, letting the send complete locally. An MPI message sent by an application will
have a header (or envelope) pre-appended by PE MPI before being sent as a LAPI
message. Therefore, the application message size from the MPI perspective is less
than from the LAPI perspective. The message envelope is no larger than 32 bytes.
LAPI also maintains a limited pool of retransmission buffers larger than 128 bytes. If
the application message plus MPI envelope exceeds 128 bytes, but is small enough
to fit a retransmission buffer, LAPI tries (but cannot guarantee) to copy it to a
retransmission buffer, allowing the MPI send to complete locally.
The size of the retransmission buffers is controlled by the environment variable

MP_REXMIT_BUF_SIZE, defaulting to a LAPI level message size of 16352 bytes.
The supported MPI application message size is reduced by the number of bytes
needed for the MPI envelope, which is 24 bytes for a 32-bit executable, or 32 bytes
for a 64-bit executable.
The number of retransmission buffers is controlled by the environment variable

MP_REXMIT_BUF_CNT. The retransmission buffers are pooled, and are not
assigned to a particular destination, so the appropriate number of buffers to achieve
a balance between performance gain and memory cost is affected by the nature of
the application and the system load.
If the message is successfully copied to a retransmission buffer, the local

completion of the MPI send is immediate. If the message is too large to fit in the
retransmission buffer, or if all the retransmission buffers are full (awaiting packet
acknowledgement from their destination), the send does not complete locally until
all message data has been received by the destination and acknowledged.
Programs that do a group of blocking sends of a large number of messages that
are expected to be sent eagerly may benefit from increasing the number of

retransmission buffers. If memory allocation is of special concern, applications
should set the retransmission buffer size to be no larger than the MPI eager limit
plus the size of the MPI header.
For more information on the MP_EAGER_LIMIT environment variable, see

Chapter 11, “POE environment variables and command-line flags,” on page 69 and
Appendix E, “PE MPI buffer management for eager protocol,” on page 219.
Striping
With PE Version 4, protocol striping is supported for HPS switch adapters (striping,
failover, and recovery are not supported over non-HPS adapters such as Gigabit
Ethernet). If the windows (or UDP ports) are on multiple adapters and one adapter
or link fails, the corresponding windows are closed and the remaining windows are
used to send messages. When the adapter or link is restored (assuming that the
node itself remains operational), the corresponding windows are added back to the
list of windows used for striping.
Striping is enabled when multiple instances are selected for communication. On a

multi-network system, one way to do this is by choosing the composite device (set
environment variable MP_EUIDEVICE to sn_all or csss), which requests allocation
of one window on each network available on the node. For a node with two adapter
links in a configuration where each link is part of a separate network, the result is a
window on each of the two networks. For short messages and messages using the
User Space FIFO mechanism, the CPU and memory bandwidth limits for copying
user buffer data to the User Space FIFO packet buffers for transmission limits the
achievable communication performance. Therefore, striping user space FIFO
messages provides no performance benefit other than possibly better load
balancing of the message traffic between the two networks. However, striping
messages that use the Remote Direct Memory Access (RDMA) or bulk transfer
mechanism can result in significant performance gains, since the data transfer
function is off-loaded to the adapters, and there is very little CPU involvement in the
communication.
For single network configurations, striping, failover, and recovery can still be used
by requesting multiple instances (setting the environment variable MP_INSTANCES
to a value greater than 1). However, unless the system is configured with multiple
adapters on the network, and window resources are available on more than one
adapter, failover and recovery is not necessarily possible, because both windows
may end up on the same adapter. Similarly, improved striping performance using
RDMA can be seen only if windows are allocated from multiple adapters on the
single network.
There are some considerations that users of 32-bit applications must take into
account before deciding to use the striping, failover, and recovery function. A 32-bit
application is limited to 16 segments. The standard AIX memory model for 32-bit
applications claims five of these, and expects the application to allocate up to eight
segments (2 GB) for application data (the heap, specified with compile option
-bmaxdata). For example, -bmaxdata:0x80000000 allocates the maximum eight
segments, each of which is 256 MB. The communication subsystem takes an
additional, variable number of segments, depending on options chosen at run time.
In some circumstances, for 32-bit applications the total demand for segments can
be greater than 16 and a job will be unable to start, or will run with reduced
performance. If your application is using a very large heap and you consider

enabling striping, see section User Space striping with failover in the chapter
Managing POE jobs of IBM Parallel Environment for AIX 5L: Operation and Use,
Volume 1.
Remote Direct Memory Access (RDMA) considerations

Some MPI applications benefit from the use of the bulk transfer mode. This transfer
mode is enabled by setting the LoadLeveler keyword @bulkxfer to yes or setting
the environment variable MP_USE_BULK_XFER to yes for interactive jobs. This
transparently causes portions of the user’s virtual address space to be pinned and
mapped to a communications adapter. The low level communication protocol will
then use Remote Direct Memory Access (RDMA, also known as bulk transfer) to
copy (pull) data from the send buffer to the receive buffer as part of the MPI
receive. The minimum message size for which RDMA will be used can be adjusted
by setting environment variable MP_BULK_MIN_MSG_SIZE.
This especially benefits applications that either transfer relatively large amounts of
data (greater than 150 KB) in a single MPI call, or overlap computation and
communication, since the CPU is no longer required to copy data. RDMA
operations are considerably more efficient when large (16 MB) pages are used
rather than small (4 KB) pages, especially for large transfers. In order to use the
bulk transfer mode, the system administrator must enable RDMA communication
and LoadLeveler must be configured to use RDMA. Not all communications
adapters support RDMA.
For a quick overview of the RDMA feature, and the steps that a system
administrator must take to enable or disable the RDMA feature, see Switch Network
Interface for ERserverpSeries High Performance Switch Guide and Reference.
For information on using LoadLeveler with bulk data transfer, see these sections in
LoadLeveler: Using and Administering:
v The chapter: Configuring the LoadLeveler environment, section Enabling support
for bulk data transfer.
v The chapter: Building and submitting jobs, section Using bulk data transfer.
Other considerations
The information provided earlier in this chapter, and the controlling variables, apply
to most applications. There are a few others that are useful in special
circumstances. These circumstances may be identified by setting the
MP_STATISTICS environment variable to print and examining the task statistics at
the end of an MPI job.
MP_ACK_THRESH
This environment variable changes the threshold for the update of the packet
sliding window. Reducing the value causes more frequent update of the window,
but generates additional message traffic.
MP_CC_SCRATCH_BUFFER
MPI collectives normally pick from more than one algorithm based on the impact
of message size, task count, and other factors on expected performance.
Normally, the algorithm that is predicted to be fastest is selected, but in some
cases the preferred algorithm depends on PE MPI allocation of scratch buffer
space. This environment variable instructs PE to use the collective

communication algorithm that takes less or even no scratch buffer space, even if
this algorithm is predicted to be slower. Most applications have no reason to use
this variable.
MP_RETRANSMIT_INTERVAL
This environment variable changes the frequency of checking for
unacknowledged packets. Lowering this value too much generates more switch
traffic and can lead to an increase in dropped packets. The packet statistics are
part of the end of job statistics displayed when MP_STATISTICS is set to print.
MP_PRIORITY
This environment variable causes the invocation of the PE co-scheduler function,
if it is enabled by the system administrator. The value of this environment
variable is highly application dependent.
MP_TASK_AFFINITY
This environment variable applies to nodes that have more than one multi-chip
module (MCM) under control by AIX. It forces tasks to run exclusively on one
MCM, which allows them to take advantage of the memory local to that MCM.
This applies to IBM POWER4 and IBM System p5 servers. For more
information, see Managing task affinity on large SMP nodes in IBM Parallel
Environment for AIX 5L: Operation and Use, Volume 1.

Chapter 2. Profiling message passing
This chapter describes how to profile your program for message passing, including
the following topics:
v “AIX profiling.”
v “MPI nameshift profiling.”
AIX profiling
If you use the gprof, prof, or xprofiler command and the appropriate compiler
command (such as cc_r or mpcc_r) with the -p or -pg flag, you can profile your
program. For information about using:
v cc_r, gprof, and prof, see IBM Parallel Environment for AIX: Operation and Use,
Volume 2.
v mpcc_r and related compiler commands, see IBM Parallel Environment for AIX:
Operation and Use, Volume 1.
v xprofiler, which is part of the AIX operating system, see the AIX: Performance
Tools Guide and Reference.
The message passing library is not enabled for gprof or prof profiling counts. You
can obtain profiling information by using the nameshifted MPI functions provided.
MPI nameshift profiling

To use nameshift profiling routines that are either written to the C bindings with an
MPI program written in C, or that are written to the FORTRAN bindings with an MPI
program written in FORTRAN, follow the steps in “MPI Nameshift profiling
procedure.”
Programs that use the C MPI language bindings can easily create profiling libraries
using the nameshifted interface.
v If you are both the creator and user of the profiling library and you are not using
FORTRAN, follow steps 1 through 6. If you are using FORTRAN, follow steps 1
through 4, then steps 7 through 14.
v If you are the creator of the profiling library, follow steps 1 through 4. You also
need to provide the user with the file created in step 2.
v If you are the user of the profiling library and you are not using FORTRAN, follow
steps 5 and 6. If you are using FORTRAN, start at step 7. You will need to make
sure that you have the file generated by the creator in step 2.
MPI Nameshift profiling procedure

To perform MPI nameshift profiling, follow the appropriate steps:
1. Create a source file that contains profiling versions of all the MPI subroutines
you want to profile. For example, create a source file called myprof_r.c that
contains the following code:
#include <pthread.h>
#include <stdio.h>
#include <mpi.h>
int MPI_Init(int *argc, char ***argv) {
int rc;
printf("hello from profiling layer MPI_Init...\n");

rc = PMPI_Init(argc, argv);
printf("goodbye from profiling layer MPI_Init...\n");
return(rc);
}
2. Create an export file that contains all of the symbols your profiling library will
export. Begin this file with the name of your profiling library and the name of
the .o file that will contain the object code of your profiling routines. For
example, create a file called myprof_r.exp that contains this statement:
MPI_Init
3. Compile the source file that contains your profiling MPI routines. For example:
cc_r -c myprof_r.c -I/usr/lpp/ppe.poe/include
The -I flag defines the location of mpi.h.
4. Create a shared library called libmyprof_r.a that contains the profiled
versions, exporting their symbols and linking with the PE MPI library, using
myprof_r.exp as shown. For example:
ld -o newmyprof_r.o myprof_r.o -bM:SRE -H512 -T512 -bnoentry
-bE:myprof_r.exp -lc -lmpi_r -L/usr/lpp/ppe.poe/lib -lpthreads
ar rv libmyprof_r.a newmyprof_r.o
5. Link your user program:
mpcc_r -o test1 test1.c -L. -lmyprof_r
6. Run the resulting executable.
7. Programs that use the FORTRAN MPI language bindings need to do some
additional steps to use the profiling libraries created above. This is because
the FORTRAN bindings are contained in a separate shared object from the C
bindings.
The shipped product has a library structure that looks like this:
You need to change it into the following structure by rebuilding the mpifort_r.o
shared object:

To do this, first extract mpifort_r.o from libmpi_r.a:
ar -xv /usr/lpp/ppe.poe/lib/libmpi_r.a mpifort_r.o
8. Then, construct a script to rebuild mpifort_r.o, using the AIX rtl_enable
command:
rtl_enable -o newmpifort_r.o -s mpifort_r.o -L. -L/usr/lpp/ppe.poe/lib
-lmyprof_r -lmpi_r -lc_r -lpthreads
9. The rtl_enable command creates a script called mpifort_r.sh and import and
export files that reflect the original binding with libmpi_r.a(mpicore_r.o). To
break this binding and rebind, remove the reference to the import file:
sed "s/-bI:mpifort_r.imp//" < mpifort_r.sh > mpifort_r.newsh
10. Make mpifort_r.newsh executable and run it:
chmod +x mpifort_r.newsh
mpifort_r.newsh
11. Archive the new shared object:
ar rv libpmpi_r.a newmpifort_r.o
12. Create a program that uses an MPI function that you have profiled. For
example, a file called hwinit.f could contain these statements:
c -------------------------------------
program hwinit
include ’mpif.h’
integer forterr
c
call MPI_INIT(forterr)
c
c Write comments to screen.
c
write(6,*)’Hello from task ’
c
call MPI_FINALIZE(forterr)
c
stop
end
c
Chapter 2. Profiling message passing 13

13. Link your FORTRAN executable with the new library:
mpxlf_r -o hwinit hwinit.f -L. -lpmpi_r
14. Run the resulting executable.

Chapter 3. Using shared memory
This chapter addresses the use of shared memory and its performance
considerations, including the following topics:
v “Point-to-point communications.”
v “Collective communications.”
v “Shared memory performance considerations” on page 16.
v “Reclaiming shared memory” on page 16.
v “Using POE with multiple Ethernet adapters and shared memory” on page 16.
Point-to-point communications
MPI programs with more than one task on the same computing node may benefit
from using shared memory to send messages between same node tasks.
This support is controlled by the MP_SHARED_MEMORY environment variable.

The default setting is yes. In this case, shared memory is used for message
passing. Message passing between tasks on different nodes continues to use User
Space or IP protocol.
Setting this variable to no directs MPI to not use a shared-memory protocol for
message passing between any two tasks of a job running on the same node.
For the 32-bit libraries, shared memory exploitation always allocates a 256 MB
virtual memory address segment that is not available for any other use. Thus,
programs that are already using all available segments cannot use this option. For
more information, see “Available virtual memory segments” on page 34.
For 64-bit libraries, there are so many segments in the address space that there is
no conflict between library and end user segment use.
Shared memory support is available for both IP and User Space MPI protocols. For
programs on which all tasks are on the same node, shared memory is used
exclusively for all MPI communication (unless MP_SHARED_MEMORY is set to
no).
Collective communications
With PE Version 4, the PE implementation of MPI also offers an optimization of
certain collective communication routines. This optimization uses an additional
shared memory segment. The collective communication optimization is available
only to 64-bit executables, where segment registers are abundant. This optimization
is controlled by the MP_SHARED_MEMORY environment variable.
For collectives in 64-bit executables that are enhanced to use shared memory, the
algorithms used for smaller message sizes involve copying data from user buffers
to scratch buffers in shared memory, and then allowing tasks that are interested in
that data to work with the copy in shared memory. The algorithms used for larger
messages involve exposing the user buffer itself to other tasks that have an interest
in it. The effect is that for smaller messages, some tasks may return from a
collective call as soon as their data is copied to shared memory, sometimes before
tasks needing access to the data even enter the collective operation.

For larger messages, the algorithms are more strongly synchronizing, because a
task that directly exposes a user buffer to other tasks cannot return to the user until
the interested tasks have completed their access to the data.
Shared memory performance considerations

Be aware of these performance considerations:
1. The best performance is achieved when all message buffers are contiguous.
2. The large message support for some collectives involves exposing the memory
of one task to the address space of another task. There is a limit of 4096
concurrent operations of this kind on a node. There is also a limit of 32 GB for
the address range of a message that can use this technique.
If there are more than 4096 concurrent operations, or a buffer has an address
range greater than 32 GB, performance abnormalities may be encountered.
This applies only to 64-bit executables, as discussed in the previous section,
“Collective communications” on page 15.
3. A hang may occur if you match blocking and non-blocking collectives in the
same application. For a full description, see “Do not match blocking and
non-blocking collectives” on page 30.
4. 32-bit applications linked to use the maximum heap (8 segments) may not have
enough available segments to effectively use shared memory for large
messages. MPI will quietly use whatever resources are available, but
performance may be impacted.
Reclaiming shared memory

Occasionally, shared memory is not reclaimed. If this happens, you can use the
ipcrm command, or contact the system administrator to reclaim the shared memory
segments.
POE’s Partition Manager Daemon (PMD) attempts to clean up any allocated shared
memory segments when a program exits normally. However, if a PMD process
(named pmdv4) is killed with signals or with the llcancel command, shared
memory segments may not be cleaned up properly. For this reason, when shared
memory is used, users should not kill or cancel a PMD process.
Using POE with multiple Ethernet adapters and shared memory

The following method can be used to run a non-LoadLeveler POE job that uses
multiple Ethernet adapters and shared memory. If this method is not used for these
jobs, POE cannot correctly determine which tasks are running on the same node,
and shared memory key collisions will occur, resulting in unpredictable behavior.
This method consists of an extra poe invocation before running the real POE job,
and the use of a script that overrides an environment variable setting before
executing the parallel task.
1. With MP_PROCS set correctly in the environment (or with -procs set as part of
the poe invocation), run
poe hostname -stdoutmode ordered -ilevel 0 > hostnames
using the hostfile (either as host list in the directory where POE is run, or by
specifying MP_HOSTFILE or -hostfile) that contains the names of the Ethernet
adapters.
2. If a shared file system is not used, copy the original hostfile and the addr_fix
script below to the nodes where the parallel tasks will run. The addr_fix script
must be copied to the directory with the same name as the current directory on
the POE home node (from which you ran poe in step 1 on page 16.)
3. Run your real POE job with whatever settings you were using, except:
v Use the hostnames file from step 1 on page 16 as the MP_HOSTFILE or
-hostfile that is specified to POE.
v Set the environment variable ADDR_FIX_HOSTNAME to the name of the
hostfile that contains the names of the Ethernet adapters, used in step 1 on
page 16.
v Instead of invoking the job as:
poe my_exec my_args poe_flags
invoke it as:
poe ./addr_fix my_exec my_args poe_flags
The addr_fix script follows.
======================================================================
#!/bin/ksh93
# Determine file index based on taskid

my_index=`expr $MP_CHILD + 1`
# Index into the file to get the ethernet name that this task will run on.
my_name=`cat $ADDR_FIX_HOSTNAME | awk NR==$my_index’{print $0}’`
# Convert my_name to a dot decimal address.

my_addr=`host $my_name | awk ’{print $3}’ | tr ’,’ ’ ’`
# Set environment variable that MPI will use as address for IP communication
export MP_CHILD_INET_ADDR=@1:$my_addr,ip
# Execute what was passed in

$*
======================================================================
This script assumes that striping is not used.
If LAPI is used, set MP_LAPI_INET_ADDR in the script instead. If both MPI and
LAPI are used, set both environment variables.
Chapter 3. Using shared memory 17

Chapter 4. Performing parallel I/O with MPI
This chapter describes how to preform parallel I/O with MPI, including the following
topics:
v “Definition of MPI-IO.”
v “Features of MPI-IO.”
v “Considerations for MPI-IO” on page 20.
v “MPI-IO API user tasks” on page 20.
v “MPI-IO file inter-operability” on page 24.
Definition of MPI-IO
The I/O component of MPI-2, or MPI-IO, provides a set of interfaces that are aimed
at performing portable and efficient parallel input and output operations.
MPI-IO allows a parallel program to express its I/O in a portable way that reflects
the program’s inherent parallelism. MPI-IO uses many of the concepts already
provided by MPI to express this parallelism. MPI datatypes are used to express the
layout and partitioning of data, which is represented in a file shared by several
tasks. An extension of the MPI communicator concept, referred to as an MPI_File,
is used to describe a set of tasks and a file that these tasks will use in some
integrated manner. Collective operations on an MPI_File allow efficient physical I/O
on a data structure that is distributed across several tasks for computation, but
possibly stored contiguously in the underlying file.
Features of MPI-IO
The primary features of MPI-IO are:
1. Portability: As part of MPI-2, programs written to use MPI-IO must be portable
across MPI-2 implementations and across hardware and software platforms.
The PE MPI-IO implementation guarantees portability of object code on
RS/6000 SP computers and clustered servers. The MPI-IO API ensures
portability at the source code level.
2. Versatility: The PE MPI-IO implementation provides support for:
v basic file manipulations (open, close, delete, sync)
v get and set file attributes (view, size, group, mode, info)
v blocking data access operations with explicit offsets (both independent and
collective)
v non-blocking data access operations with explicit offsets (independent only)
v blocking and non-blocking data access operations with file pointers (individual
and shared)
v split collective data access operations
v any derived datatype for memory and file mapping
v file inter-operability through data representations (internal, external,
user-defined)
v atomic mode for data accesses.
3. Robustness: PE MPI-IO performs as robustly as possible in the event of error
occurrences. Because the default behavior, as required by the MPI-2 standard,
is for I/O errors to return, PE MPI-IO tries to prevent any deadlock that might
result from an I/O error returning. The intent of the ″errors return″ default is that

the type of errors considered almost routine in doing I/O should not be fatal in
MPI (for example, a ″file not found″ error).
However, deadlocks resulting from erroneous user codes cannot be entirely
avoided. Users of MPI-IO routines should always check return codes and be
prepared to terminate the job if the error is not one that the application can
recover from.
An application that fails in trying to create a file, fails every time it tries to write,
and fails again closing the file, will run to completion with no sign of a problem,
if return codes are not checked. The common practice of ignoring return codes
on MPI calls trusting MPI to trap the failure does not work with MPI-IO calls.
Considerations for MPI-IO

MPI-IO will not operate if the MP_SINGLE_THREAD environment variable is set to
yes. A call to MPI_INIT with MP_SINGLE_THREAD set to yes is equivalent to what
might be expected with a call to MPI_INIT_THREAD specifying
MPI_THREAD_FUNNELED. A call with MP_SINGLE_THREAD set to no is
equivalent to using MPI_THREAD_MULTIPLE. The default setting of
MP_SINGLE_THREAD is no, therefore the default behavior of the threads library is
MPI_THREAD_MULTIPLE.
Note: In PE MPI, thread behavior is determined before calling MPI_INIT or

MPI_INIT_THREAD. A call to MPI_INIT_THREAD with
MPI_THREAD_FUNNELED will not actually mimic MP_SINGLE_THREAD.
MPI-IO is intended to be used with the IBM General Parallel File System (GPFS)
for production use. File access through MPI-IO normally requires that a single
GPFS file system image be available across all tasks of an MPI job. Shared file
systems such as AFS® and NFS do not meet this requirement when used across
multiple nodes. PE MPI-IO can be used for program development on any other file
system that supports a POSIX interface (AFS, DFS™, JFS, or NFS) as long as all
tasks run on a single node or workstation, but this is not expected to be a useful
model for production use of MPI-IO.
In MPI-IO, whether an individual task performs I/O is not determined by whether

that task issues MPI-IO calls. By default, MPI-IO performs I/O through an agent at
each task of the job. I/O agents can be restricted to specific nodes by using an I/O
node file. This should be done any time there is not a single GPFS file system
available to all nodes on which tasks are to run. PE MPI-IO can be used without all
tasks having access to a single file system image by using the MP_IONODEFILE
environment variable. See IBM Parallel Environment for AIX: Operation and Use,
Volume 1 for information about MP_IONODEFILE.
MPI-IO API user tasks

This section explains the following MPI-IO user tasks:
v “Working with files.”
v “Error handling” on page 22.
v “Working with Info objects” on page 23.
v “Using datatype constructors” on page 23.
v “Setting the size of the data buffer” on page 24.
Working with files

This section explains MPI-IO file management tasks.
Opening a file (MPI_FILE_OPEN)
When MPI-IO is used correctly, a file name will refer to the same file system at
every task of the job, not just at every task that issues the MPI_FILE_OPEN. In one
detectable error situation, a file will appear to be on different file system types. For
example, a particular file could be visible to some tasks as a GPFS file and to
others as NFS-mounted.
Use of a file that is local to (that is, distinct at) each task or node, is not valid and
cannot be detected as an error by MPI-IO. Issuing MPI_FILE_OPEN on a file in
/tmp may look valid to the MPI library, but will not produce valid results.
The default for MP_CSS_INTERRUPT is no. If you do not override the default,
MPI-IO enables interrupts while files are open. If you have forced interrupts to yes
or no, MPI-IO does not alter your selection.
MPI-IO depends on hidden threads that use MPI message passing. MPI-IO cannot
be used with MP_SINGLE_THREAD set to yes.
For AFS, DFS, and NFS, MPI-IO uses file locking for all accesses by default. If
other tasks on the same node share the file and also use file locking, file
consistency is preserved. If the MPI_FILE_OPEN is done with mode
MPI_MODE_UNIQUE_OPEN, file locking is not done.
For information about file hints, see MPI_FILE_OPEN in IBM Parallel Environment
for AIX: MPI Subroutine Reference.
Other file tasks

For information about the following file tasks, see IBM Parallel Environment for AIX:
MPI Subroutine Reference.
v Closing a file (MPI_FILE_CLOSE)
v Deleting a file (MPI_FILE_DELETE)
v Resizing a file (MPI_FILE_SET_SIZE)
v Preallocating space for a file (MPI_FILE_PREALLOCATE)
v Querying the size of a file (MPI_FILE_GET_SIZE)
v Querying file parameters (MPI_FILE_GET_AMODE, MPI_FILE_GET_GROUP)
v Querying and setting file information (MPI_FILE_GET_INFO,
MPI_FILE_SET_INFO)
v Querying and setting file views (MPI_FILE_GET_VIEW, MPI_FILE_SET_VIEW)
v Positioning (MPI_FILE_GET_BYTE_OFFSET, MPI_FILE_GET_POSITION)
v Synchronizing (MPI_FILE_SYNC)
v Accessing data
– Data access with explicit offsets:
- MPI_FILE_READ_AT
- MPI_FILE_READ_AT_ALL
- MPI_FILE_WRITE_AT
- MPI_FILE_WRITE_AT_ALL
- MPI_FILE_IREAD_AT
- MPI_FILE_IWRITE_AT
– Data access with individual file pointers:
- MPI_FILE_READ
- MPI_FILE_READ_ALL
Chapter 4. Performing parallel I/O with MPI 21

- MPI_FILE_WRITE
- MPI_FILE_WRITE_ALL
- MPI_FILE_IREAD
- MPI_FILE_IWRITE
- MPI_FILE_SEEK
– Data access with shared file pointers:
- MPI_FILE_READ_SHARED
- MPI_FILE_WRITE_SHARED
- MPI_FILE_IREAD_SHARED
- MPI_FILE_IWRITE_SHARED
- MPI_FILE_READ_ORDERED
- MPI_FILE_WRITE_ORDERED
- MPI_FILE_SEEK
- MPI_FILE_SEEK_SHARED
– Split collective data access:
- MPI_FILE_READ_AT_ALL_BEGIN
- MPI_FILE_READ_AT_ALL_END
- MPI_FILE_WRITE_AT_ALL_BEGIN
- MPI_FILE_WRITE_AT_ALL_END
- MPI_FILE_READ_ALL_BEGIN
- MPI_FILE_READ_ALL_END
- MPI_FILE_WRITE_ALL_BEGIN
- MPI_FILE_WRITE_ALL_END
- MPI_FILE_READ_ORDERED_BEGIN
- MPI_FILE_READ_ORDERED_END
- MPI_FILE_WRITE_ORDERED_BEGIN
- MPI_FILE_WRITE_ORDERED_END
Error handling
MPI-1 treated all errors as occurring in relation to some communicator. Many MPI-1
functions were passed a specific communicator, and for the rest it was assumed
that the error context was MPI_COMM_WORLD. MPI-1 provided a default error
handler named MPI_ERRORS_ARE_FATAL for each communicator, and defined
functions similar to those listed below for defining and attaching alternate error
handlers.
The MPI-IO operations use an MPI_File in much the way other MPI operations use
an MPI_Comm, except that the default error handler for MPI-IO operations is
MPI_ERRORS_RETURN. The following functions are needed to allow error
handlers to be defined and attached to MPI_File objects:
v MPI_FILE_CREATE_ERRHANDLER
v MPI_FILE_SET_ERRHANDLER
v MPI_FILE_GET_ERRHANDLER
v MPI_FILE_CALL_ERRHANDLER
For information about these subroutines, see IBM Parallel Environment for AIX: MPI
Subroutine Reference.

Logging I/O errors
Set the MP_IO_ERRLOG environment variable to yes to indicate whether to turn
on error logging for I/O operations. For example:
export MP_IO_ERRLOG=yes
turns on error logging. When an error occurs, a line of information will be logged in
file /tmp/mpi_io_errdump.app_name.userid.taskid, recording the time the error
occurs, the POSIX file system call involved, the file descriptor, and the returned
error number.
Working with Info objects

The MPI-2 standard provides the following Info functions as a means for a user to
construct a set of hints and pass these hints to some MPI-IO operations:
v MPI_INFO_CREATE
v MPI_INFO_DELETE
v MPI_INFO_DUP
v MPI_INFO_FREE
v MPI_INFO_GET
v MPI_INFO_GET_NKEYS
v MPI_INFO_GET_NTHKEY
v MPI_INFO_SET
v MPI_INFO_GET_VALUELEN
An Info object is an opaque object consisting of zero or more (key,value) pairs. Info
objects are the means by which users provide hints to the implementation about
things like the structure of the application or the type of expected file accesses. In
MPI-2, the APIs that use Info objects span MPI-IO, MPI one-sided, and dynamic
tasks. Both key and value are specified as strings, but the value may actually
represent an integer, boolean or other datatype. Some keys are reserved by MPI,
and others may be defined by the implementation. The implementation defined keys
should use a distinct prefix which other implementations would be expected to
avoid. All PE MPI hints begin with IBM_ (see MPI_FILE_OPEN in IBM Parallel
Environment for AIX: MPI Subroutine Reference). The MPI-2 requirement that hints,
valid or not, cannot change the semantics of a program limits the risks from
misunderstood hints.
By default, Info objects in PE MPI accept only PE MPI recognized keys. This allows
a program to identify whether a given key is understood. If the key is not
understood, an attempt to place it in an Info object will be ignored. An attempt to
retrieve the key will find no key/value present. The environment variable
MP_HINTS_FILTERED set to no will cause Info operations to accept arbitrary (key,
value) pairs. You will need to turn off hint filtering if your application, or some
non-MPI library it is using, depends on MPI Info objects to cache and retrieve its
own (key, value) pairs.
Using datatype constructors

The following type constructors are provided as a means for MPI programs to
describe the data layout in a file and relate that layout to memory data which is
distributed across a set of tasks. The functions exist only for MPI-IO.
v MPI_TYPE_CREATE_DARRAY
v MPI_TYPE_CREATE_SUBARRAY
Chapter 4. Performing parallel I/O with MPI 23

Setting the size of the data buffer
Set the MP_IO_BUFFER_SIZE environment variable to indicate the default size of
the data buffers used by the MPI-IO agents. For example:
export MP_IO_BUFFER_SIZE=16M
sets the default size of the MPI-IO data buffer to 16 MB. The default value of this
environment variable is the number of bytes corresponding to 16 file blocks. This
value depends on the block size associated with the file system storing the file.
Valid values are any positive size up to 128 MB. The size can be expressed as a
number of bytes, as a number of kilobytes (1024 bytes), using k or K as a suffix, or
as a number of megabytes (1024*1024 bytes), using m or M as a suffix. If
necessary, PE MPI rounds the size up, to correspond to an integral number of file
system blocks.
MPI-IO file inter-operability

For information about the following file inter-operability topics, see IBM Parallel
Environment for AIX: MPI Subroutine Reference and the MPI-2 Standard:
v Datatypes (MPI_FILE_GET_TYPE_EXTENT)
v External data representation (external32)
v User-defined data representations (MPI_REGISTER_DATAREP)
– Extent callback
– Datarep conversion functions
v Matching data representations
For information about the following topics, see the MPI-2 Standard:
v Consistency and semantics
– File consistency
– Random access versus sequential files
– Progress
– Collective file operations
– Type matching
– Miscellaneous clarifications
– MPI_Offset Type
– Logical versus physical file layout
– File size
– Examples: asynchronous I/O
v I/O error handling
v I/O error classes
v Examples: double buffering with split collective I/O, subarray filetype constructor

Chapter 5. Programming considerations for user applications
in POE
This chapter describes various limitations, restrictions, and programming
considerations for user applications written to run under the IBM Parallel
Environment for AIX (PE) licensed program, including these topics:
v “The MPI library.”
v “Parallel Operating Environment overview.”
v “POE user limits” on page 26.
v “Exit status” on page 26.
v “POE job step function” on page 27.
v “POE additions to the user executable” on page 27.
v “Threaded programming” on page 35.
v “Using MPI and LAPI in the same program” on page 43.
The MPI library

The MPI library uses hidden AIX kernel threads as well as the users’ threads to
move data into and out of message buffers. It supports MPI only, (not MPL, an
older IBM proprietary message passing library API), and supports message passing
on the main thread and on user-created threads. The MPI library includes support
for both 32-bit and 64-bit applications. The hidden threads also ensure that
message packets are acknowledged, and when necessary, retransmitted. User
applications, when compiled with the PE Version 4 compilation scripts (mpcc_r,
mpCC_r, mpxlf_r), will always be compiled with the threaded MPI library, although
the application itself may not be threaded.
The signal library has been removed

In PE Version 4, a single version of the message-passing library is provided.
Previous releases provided two versions: a threads library, and a signal-handling
library. PE Version 4 provides only a threaded version of the library, with binary
compatibility for the signal-handling library functions. In addition, PE Version 4
supports only MPI functions, in both 32-bit and 64-bit applications. MPL is no longer
supported.
In addition, the MPI library is using the Low-level communication API (LAPI)
protocol as a common transport layer. For more information on this and the use of
the LAPI protocol, see IBM Reliable Scalable Cluster Technology for AIX 5L: LAPI
Programming Guide.
Parallel Operating Environment overview

As the end user, you are encouraged to think of the Parallel Operating Environment
(POE) (also referred to as the poe command) as an ordinary (serial) command. It
accepts redirected I/O, can be run under the nice and time commands, interprets
command flags, and can be invoked in shell scripts.
An n-task parallel job running in POE consists of: the n user tasks, a number of
instances of the PE partition manager daemon (pmd) that is equal to the number of
nodes, and the POE home node task in which the poe command runs. The pmd is
the parent task of the user’s task. There is one pmd for each node. A pmd is

started by the POE home node on each machine on which a user task runs, and
serves as the point of contact between the home node and the users’ tasks.
The POE home node routes standard input, standard output, and standard error
streams between the home node and the users’ tasks with the pmd daemon, using
TCP/IP sockets for this purpose. The sockets are created when the POE home
node starts the pmd daemon for each task of a parallel job. The POE home node
and pmd also use the sockets to exchange control messages to provide task
synchronization, exit status and signaling. These capabilities do not depend on the
message passing library, and are available to control any parallel program run by
the poe command.
POE user limits

When interactive or batch POE applications are submitted under LoadLeveler, it is
possible to use the LoadLeveler class to define the user resource limits used for the
duration of the job. This also allows LoadLeveler to define and modify a different set
of user limits on the submit and compute nodes, using different LoadLeveler job
classes.
For interactive POE applications, without using LoadLeveler, POE does not copy or
replicate the user resource limits on the remote nodes where the parallel tasks are
to run (the compute nodes). POE uses the user limits as defined by the
/etc/security/limits file. If the user limits on the submitting node (home node) are
different than those on the compute nodes, POE does not change the user limits on
the compute nodes to match those on the submitting node.
Users should ensure that they have sufficient user resource limits on the compute
nodes, when submitting interactive parallel jobs. Users may want to coordinate their
user resource needs with their AIX system administrators to ensure that proper user
limits are in place, such as in the /etc/security/limits file on each node, or by some
other means.
Exit status
The exit status is any value from 0 through 255. This value, which is returned from
POE on the home node, reflects the composite exit status of your parallel
application as follows:
v If MPI_ABORT(comm,nn>0,ierror) or MPI_Abort(comm,nn>0) is called, the exit
status is nn (mod 256).
v If all tasks terminate using exit(MM>=0) or STOP MM>=0 and MM is not equal to
1 and is less than 128 for all nodes, POE provides a synchronization barrier at
the exit. The exit status is the largest value of MM from any task of the parallel
job (mod 256).
v If any task terminates using exit(MM =1) or STOP MM =1, POE will immediately
terminate the parallel job, as if MPI_Abort(MPI_COMM_WORLD,1) had been
called. This may also occur if an error is detected within a FORTRAN library
because a common error response by FORTRAN libraries is to call STOP 1.
v If any task terminates with a signal (for example, a segment violation), the exit
status is the signal plus 128, and the entire job is immediately terminated.
v If POE terminates before the start of the user’s application, the exit status is 1.
v If the user’s application cannot be loaded or fails before the user’s main() is
called, the exit status is 255.

v You should explicitly call exit(MM) or STOP MM to set the desired exit code. A
program exiting without an explicit exit value returns unpredictable status, and
may result in premature termination of the parallel application and misleading
error messages. A well constructed MPI application should terminate with exit(0)
or STOP 0 sometime after calling MPI_FINALIZE.
POE job step function

The POE job step function is intended for the execution of a sequence of separate
yet interrelated dependent programs. Therefore, it provides you with a job control
mechanism that allows both job step progression and job step termination. The job
control mechanism is the program’s exit code.
v Job step progression:
POE continues the job step sequence if the task exit code is 0 or in the range of
2 through 127.
v Job-step termination:
POE terminates the parallel job, and does not run any remaining user programs
in the job step list if the task exit code is equal to 1 or greater than 127.
v Default termination:
Any POE infrastructure detected failure (such as failure to open pipes to the child
task, or an exec failure to start the user’s executable) terminates the parallel job,
and does not run any remaining user programs in the job step queue.
POE additions to the user executable

Legacy POE scripts mpcc, mpCC, and mpxlf are now symbolic links to mpcc_r,
mpCC_r, and mpxlf_r respectively. The old command names are still used in some
of the examples in this book.
POE links in the routines described in the sections that follow, when your
executable is compiled with any of the POE compilation scripts, such as: mpcc_r,
or mpxlf_r. These topics are discussed:
v “Signal handlers” on page 28.
v “Handling AIX signals” on page 28.
v “Do not hard code file descriptor numbers” on page 29.
v “Termination of a parallel job” on page 29.
v “Do not run your program as root” on page 30.
v “AIX function limitations” on page 30.
v “Shell execution” on page 30.
v “Do not rewind STDIN, STDOUT, or STDERR” on page 30.
v “Do not match blocking and non-blocking collectives” on page 30.
v “Passing string arguments to your program correctly” on page 31.
v “POE argument limits” on page 31.
v “Network tuning considerations” on page 31.
v “Standard I/O requires special attention” on page 32.
v “Reserved environment variables” on page 33.
v “AIX message catalog considerations” on page 33.
v “Language bindings” on page 33.
v “Available virtual memory segments” on page 34.
Chapter 5. Programming considerations for user applications in POE 27

v “Using a switch clock as a time source” on page 34.
v “Running applications with large numbers of tasks” on page 34.
v “Running POE with MALLOCDEBUG” on page 35.
Signal handlers
POE installs signal handlers for most signals that cause program termination, so
that it can notify the other tasks of termination. POE then causes the program to
exit normally with a code of (signal plus 128). This section includes information
about installing your own signal handler for synchronous signals.
Note: For information about the way POE handles asynchronous signals, see
“Handling AIX signals.”
For synchronous signals, you can install your own signal handlers by using the
sigaction() system call. If you use sigaction(), you can use either the sa_handler
member or the sa_sigaction member in the sigaction structure to define the signal
handling function. If you use the sa_sigaction member, the SA_SIGINFO flag must
be set.
For the following signals, POE installs signal handlers that use the sa_sigaction
format:
v SIGABRT
v SIGBUS
v SIGEMT
v SIGFPE
v SIGILL
v SIGSEGV
v SIGSYS
v SIGTRAP
POE catches these signals, performs some cleanup, installs the default signal
handler (or lightweight core file generation), and re-raises the signal, which will
terminate the task.
Users can install their own signal handlers, but they should save the address of the
POE signal handler, using a call to SIGACTION. If the user program decides to
terminate, it should call the POE signal handler as follows:
saved.sa_flags =SA_SIGINFO;
(*saved.sa_sigaction)(signo,NULL,NULL)
If the user program decides not to terminate, it should just return to the interrupted
code.
Note: Do not issue message passing calls, including MPI_ABORT, from signal
handlers. Also, many library calls are not “signal safe”, and should not be
issued from signal handlers. See function sigaction() in the AIX Technical
Reference for a list of functions that signal handlers can call.
Handling AIX signals

The POE runtime environment creates a thread to handle the following
asynchronous signals by performing a sigwait on them:
v SIGDANGER
v SIGHUP
v SIGINT

v SIGPWR
v SIGQUIT
v SIGTERM
These handlers perform cleanup and exit with a code of (signal plus 128). You can
install your own signal handler for any or all of these signals. If you want the
application to exit after you catch the signal, call the function
pm_child_sig_handler(signal,NULL,NULL). The prototype for this function is in
file usr/lpp/ppe.poe/include/pm_util.h.
The following asynchronous signals are handled as described below.
SIGALRM
Unlike the now retired signal library, the threads library does not use SIGALRM, and
long system calls are not interrupted by the message passing library. For example,
sleep runs its entire duration unless interrupted by a user-generated event.
SIGIO
Unlike PE 3.2, SIGIO is not used by the MPI library. A user-written signal handler
will not be called when an MPI packet arrives. The user may use SIGIO for other
I/O attention purposes, as required.
SIGPIPE
Some usage environments of the now retired signal library depended on MPI use of
SIGPIPE. There is no longer any use of SIGPIPE by the MPI library.
Do not hard code file descriptor numbers

Do not use hard coded file descriptor numbers beyond those specified by STDIN,
STDOUT and STDERR.
POE opens several files and uses file descriptors as message passing handles.
These are allocated before the user gets control, so the first file descriptor allocated
to a user is unpredictable.
Termination of a parallel job

POE provides for orderly termination of a parallel job, so that all tasks terminate at
the same time. This is accomplished in the atexit routine registered at program
initialization. For normal exits (codes 0, and 2 through 127), the atexit routine
sends a control message to the POE home node, and waits for a positive response.
For abnormal exits and those that do not go through the atexit routine, the pmd
daemon catches the exit code and sends a control message to the POE home
node.
For normal exits, when POE gets a control message for every task, it responds to
each node, allowing that node to exit normally with its individual exit code. The pmd
daemon monitors the exit code and passes it back to the POE home node for
presentation to the user.
For abnormal exits and those detected by pmd, POE sends a message to each
pmd asking that it send a SIGTERM signal to its tasks, thereby terminating the
task. When the task finally exits, pmd sends its exit code back to the POE home
node and exits itself.

User-initiated termination of the POE home node with SIGINT <Ctrl-c> or SIGQUIT
<Ctrl-\> causes a message to be sent to pmd asking that the appropriate signal be
sent to the parallel task. Again, pmd waits for the tasks to exit, then terminates
itself.
Do not run your program as root

To prevent uncontrolled root access to the entire parallel job computation resource,
POE checks to see that the user is not root as part of its authentication.
AIX function limitations

Use of the following AIX function may be limited:
v getuinfo does not show terminal information, because the user program running
in the parallel partition does not have an attached terminal.
Shell execution
The program executed by POE on the parallel nodes does not run under a shell on
those nodes. Redirection and piping of STDIN, STDOUT, and STDERR applies to
the POE home node (POE binary), and not the user’s code. If shell processing of a
command line is desired on the remote nodes, invoke a shell script on the remote
nodes to provide the desired preprocessing before the user’s application is invoked.
You can have POE run a shell script that is loaded and run on the remote nodes as
if it were a binary file.
Due to an AIX limitation, if the program being run by POE is a shell script and there
are more than five tasks being run per node, the script must be run under ksh93
by using:
#!/bin/ksh93
on the first line of the script.
If the POE home node task is not started under the Korn shell, mounted file system
names may not be mapped correctly to the names defined for the automount
daemon or AIX equivalent. See the IBM Parallel Environment for AIX: Operation
and Use, Volume 1 for a discussion of alternative name mapping techniques.
Do not rewind STDIN, STDOUT, or STDERR

The partition manager daemon (pmd) uses pipes to direct STDIN, STDOUT and
STDERR to the user’s program. Therefore, do not rewind these files.
Do not match blocking and non-blocking collectives

The future use of MPE_I non-blocking collectives is deprecated, but only 64-bit
executables are affected by this limitation in PE Version 4.
Earlier versions of PE/MPI allowed matching of blocking (MPI) with non-blocking

(MPE_I) collectives. With PE Version 4, it is advised that you do not match blocking
and non-blocking collectives in the same collective operation. If you do, a hang
situation can occur. It is possible that some existing applications may hang, when
run using PE Version 4. In the case of an unexpected hang, turn on DEVELOP
mode by setting the environment variable MP_EUIDEVELOP to yes, and rerun your
application. DEVELOP mode will detect and report any mismatch. If DEVELOP
mode identifies a mismatch, you may continue to use the application as is, by

setting MP_SHARED_MEMORY to no. If possible, alter the application to remove
the matching of non-blocking with blocking collectives.
Passing string arguments to your program correctly

Quotation marks, either single or double, used as argument delimiters are stripped
away by the shell and are never seen by poe. Therefore, the quotation marks must
be escaped to allow the quoted string to be passed correctly to the remote tasks as
one argument. For example, if you want to pass the following string to the user
program (including the embedded blank)
a b
you need to enter the following:

poe user_program \"a b\"
user_program is passed the following argument as one token:

a b
Without the backslashes, the string would have been treated as two arguments (a
and b).
POE behaves like rsh when arguments are passed to POE. Therefore, this
command:
poe user_program "a b"
is equivalent to:
rsh some_machine user_program "a b"
In order to pass the string argument as one token, the quotation marks have to be
escaped using the backslash.
POE argument limits

The maximum length for POE program arguments is 24576 bytes. This is a fixed
limit and cannot be changed. If this limit is exceeded, an error message is displayed
and POE terminates. The length of the remote program arguments that can be
passed on POE’s command line is 24576 bytes minus the number of bytes that are
used for POE arguments.
Network tuning considerations

Programs generating large volumes of STDOUT or STDERR may overload the
home node. As described previously, STDOUT and STDERR files generated by a
user’s program are piped to pmd, then forwarded to the POE binary using a TCP/IP
socket. It is possible to generate so much data that the IP message buffers on the
home node are exhausted, the POE binary hangs, and possibly the entire node
hangs. Note that the option -stdoutmode (environment variable
MP_STDOUTMODE) controls which output stream is displayed by the POE binary,
but does not limit the standard output traffic received from the remote nodes, even
when set to display the output of only one node.
The POE environment variable MP_SNDBUF can be used to override the default
network settings for the size of the TCP/IP buffers used.
If you have large volumes of standard input or output, work with your network
administrator to establish appropriate TCP/IP tuning parameters. You may also want
to investigate whether using named pipes is appropriate for your application.

Standard I/O requires special attention
When your program runs on the remote nodes, it has no controlling terminal.
STDIN, STDOUT, and STDERR are always piped.
Running the poe command (or starting a program compiled with one of the POE
compile scripts) causes POE to perform this sequence of events:
1. The POE binary is loaded on the machine on which you submitted the
command (the POE home node).
2. The POE binary, in turn, starts a partition manager daemon (pmd) on each
parallel node assigned to run the job, and tells that pmd to run one or more
copies of your executable (using fork and exec).
3. The POE binary reads STDIN and passes it to each pmd with a TCP/IP socket
connection.
4. The pmd on each node pipes STDIN to the parallel tasks on that node.
5. STDOUT and STDERR from the tasks are piped to the pmd daemon.
6. This output is sent by the pmd on the TCP/IP socket back to the home node
POE.
7. This output is written to the POE binary’s STDOUT and STDERR descriptors.
Programs that depend on piping standard input or standard output as part of a

processing sequence may wish to bypass the home node POE binary. If you know
that the task reading STDIN or writing STDOUT must be on the same node
(processor) as the POE binary (the POE home node), named pipes can be used to
bypass POE’s reading and forwarding STDIN and STDOUT.
Note
| Earlier versions of Parallel Environment required the use of the
| MP_HOLD_STDIN environment variable in certain cases when redirected
| STDIN was used. The Parallel Environment components have now been
| modified to control the STDIN flow internally, so the use of this environment
| variable is no longer required, and will have no effect on STDIN handling.
STDIN and STDOUT piping example

The following two scripts show how STDIN and STDOUT can be piped directly
between preprocessing and postprocessing steps, bypassing the POE home node
task. This example assumes that parallel task 0 is known or forced to be on the
same node as the POE home node.
The script compute_home runs on the home node; the script compute_parallel
runs on the parallel nodes (those running tasks 0 through n-1).
compute_home:
#! /bin/ksh93
# Example script compute_home runs three tasks:
# data_generator creates/gets data and writes to stdout
# data_processor is a parallel program that reads data
# from stdin, processes it in parallel, and writes
# the results to stdout.
# data_consumer reads data from stdin and summarizes it
#
mkfifo poe_in_$$
mkfifo poe_out_$$
export MP_STDOUTMODE=0
export MP_STDINMODE=0
data_generator >poe_in_$$ |

poe compute_parallel poe_in_$$ poe_out_$$ data_processor |
data_consumer <poe_out_$$
rc=$?
rm poe_in_$$
rm poe_out_$$
exit rc
compute_parallel:
#! /bin/ksh93
# Example script compute_parallel is a shell script that
# takes the following arguments:
# 1) name of input named pipe (stdin)
# 2) name of output named pipe (stdout)
# 3) name of program to be run (and arguments)
#
poe_in=$1
poe_out=$2
shift 2
$* <$poe_in >$poe_out
Reserved environment variables

Environment variables whose name begins with MP_ are intended for use by POE,
and should be set only as instructed in the documentation. POE also uses a
handful of MP_ environment variables for internal purposes, which should not be
interfered with.
If the value of MP_INFOLEVEL is greater than or equal to 1, POE will display any
MP_ environment variables that it does not recognize, but POE will continue
working normally.
AIX message catalog considerations

POE assumes that the environment variable NLSPATH contains the appropriate
POE message catalogs, even if environment variable LANG is set to C or is not
set. Duplicate message catalogs are provided for languages En_US, en_US, and
C.
Language bindings
The FORTRAN, C, and C++ bindings for MPI are contained in the same library and
can be freely intermixed. The library is named libmpi_r.a. Because it contains both
32-bit and 64-bit objects, and the compiler and linker select between them,
libmpi_r.a can be used for both 32-bit and 64-bit applications.
The AIX compilers support the flag -qarch. This option allows you to target code
generation for your application to a particular processor architecture. While this
option can provide performance enhancements on specific platforms, it inhibits
portability. The MPI library is not targeted to a specific architecture, and is not
affected by the flag -qarch on your compilation.
The MPI standard includes several routines that take choice arguments. For
example MPI_SEND may be passed a buffer of REAL on one call, and a buffer of
INTEGER on the next. The -qextcheck compiler option flags this as an error. In
F77, choice arguments are a violation of the FORTRAN standard that few compilers
would complain about. In F90, choice arguments can be interpreted by the compiler
as an attempt to use function overloading. MPI FORTRAN functions do not require
genuine overloading support to give correct results and PE MPI does not define
overloaded functions for all potential choice arguments. Because -qextcheck
considers use of choice arguments to be erroneous overloads even though the
code is correct MPI, the -qextcheck option should not be used.

Available virtual memory segments
A 32-bit application is limited to 16 segments. The AIX memory model for 32-bit
applications claims five of these. The application can allocate up to eight segments
(2 GB) for application data (the heap, specified with compile option -bmaxdata).
The communication subsystem takes a variable number of segments, depending on
options chosen at run time. In some circumstances, for 32-bit applications the total
demand for segments can be greater than 16 and a job will be unable to start or
will run with reduced performance. If your application is using a very large heap and
you consider enabling striping, see the migration section in IBM Parallel
Environment for AIX 5L: Operation and Use, Volume 1 for details.
Using a switch clock as a time source

The high performance switch interconnects that supports user space also provide a
globally-synchronized counter that can be used as a source for the MPI_WTIME
function, provided that all tasks are run on nodes connected to the same switch
interconnect. The environment variable MP_CLOCK_SOURCE provides additional
control.
Table 2 shows how the clock source is determined. PE MPI guarantees that the
MPI attribute MPI_WTIME_IS_GLOBAL has the same value at every task, and all
tasks use the same clock source (AIX or switch).
Table 2. How the clock source is determined
MP_CLOCK Library Are all nodes on Source MPI_WTIME
_SOURCE version the same switch? used _IS_GLOBAL
AIX ip yes AIX false
AIX ip no AIX false
AIX us yes AIX false
AIX us no Error false
SWITCH ip yes* switch true
SWITCH ip no AIX false
SWITCH us yes switch true
SWITCH us no Error
not set ip yes switch false
not set ip no AIX false
not set us yes switch true
not set us no Error
Note: * If MPI_WTIME_IS_GLOBAL value is to be trusted, the user is responsible for
making sure all of the nodes are connected to the same switch. If the job is in IP mode and
MP_CLOCK_SOURCE is left to default, MPI_WTIME_IS_GLOBAL will report false even if
the switch is used because MPI cannot know it is the same switch.
In this table, ip refers to IP protocol, us refers to User Space protocol.
Running applications with large numbers of tasks

If you plan to run your parallel applications with a large number of tasks (more than
256), the following tips may improve stability and performance:

v To control the amount of memory made available for early arrival buffering, the
environment variable MP_BUFFER_MEM or command-ling flag -buffer_mem
can accept the format M1, M2 where each of M1, M2 is a memory specification
suffixed with K, M, or G.
M1 specifies the amount of pre-allocated memory. M2 specifies the maximum
memory that might be requested by the program. See the entry for
MP_BUFFER_MEM in Chapter 11, “POE environment variables and
command-line flags,” on page 69 and Appendix E, “PE MPI buffer management
for eager protocol,” on page 219 for details.
v When using IP mode, use a host list file with the switch IP names, instead of the
IP host name.
v In 32-bit applications, you may avoid the problem of running out of memory by
linking applications with an extended heap starting with data segment 3. For
example, specifying the -bD:0x30000000 loader option causes segments 3, 4,
and 5 to be allocated to the heap. The default is to share data segment 2
between the stack and the heap.
For limitations on the number of tasks, tasks per node, and other restrictions, see
Chapter 10, “MPI size limits,” on page 65.
Running POE with MALLOCDEBUG

Running a POE job that uses MALLOCDEBUG with an align:n option of other than
8 may result in undefined behavior. To allow the parallel program being run by POE
(myprog, for example) to run with an align:n option of other than 8, create the
following script (called myprog.sh), for example:
MALLOCTYPE=debug
MALLOCDEBUG=align:0
myprog myprog_options
and then run with this command:

poe myprog.sh poe_options
instead of this command:

poe myprog poe_options myprog_options
Threaded programming
When programming in a threads environment, specific skills and considerations are
required. The information in this subsection provides you with specific programming
considerations when using POE and the MPI library. This section assumes that you
are familiar with POSIX threads in general, including multiple execution threads,
thread condition waiting, thread-specific storage, thread creation and thread
termination. These topics are discussed:
v “Running single threaded applications” on page 36.
v “POE gets control first and handles task initialization” on page 36.
v “Limitations in setting the thread stack size” on page 36.
v “Forks are limited” on page 37.
v “Thread-safe libraries” on page 37.
v “Program and thread termination” on page 37.
v “Order requirement for system includes” on page 37.
v “Using MPI_INIT or MPI_INIT_THREAD” on page 37.
v “Collective communication calls” on page 38.

v “Support for M:N threads” on page 38.
v “Checkpoint and restart limitations” on page 38.
v “64-bit application considerations” on page 42.
v “MPI_WAIT_MODE: the nopoll option” on page 42.
v “Mixed parallelism with MPI and threads” on page 43.
Running single threaded applications

As mentioned earlier, PE Version 4 provides only the threaded version of the MPI
library and program compiler scripts.
Applications that do not intend to use threads can continue to run as single
threaded programs, despite the fact they are now compiled as threaded programs.
However there are some side issues application developers should be aware of.
Any application that was compiled with the signal library compiler scripts prior to PE
Version 4 and not using MPE_I non-blocking collectives, is in this class.
Application performance may be impacted by locking overheads in the threaded

MPI library. Users with applications that do not create additional threads and do not
use the nonstandard MPE_I nonblocking collectives, MPI-IO, or MPI one-sided
communication may wish to set the environment variable MP_SINGLE_THREAD to
yes for a possible performance improvement.
Do not set MP_SINGLE_THREAD to yes unless you are certain that the
application is single threaded. Setting MP_SINGLE_THREAD to yes, and then
creating additional user threads will give unpredictable results. Calling
MPI_FILE_OPEN, MPI_WIN_CREATE or any MPE_I nonblocking collective in an
application running with MP_SINGLE_THREAD set to yes will cause PE MPI to
terminate the job.
Also, applications that register signal handlers may need to be aware that the
execution is in a threaded environment.
POE gets control first and handles task initialization

POE sets up its environment using the poe_remote_main entry point. The
poe_remote_main entry point sets up signal handlers, initializes a thread for
handling asynchronous communication, and sets up an atexit routine before your
main program is invoked. MPI communication is established when you call
MPI_INIT in your application, and not during poe_remote_main.
Limitations in setting the thread stack size

The main thread stack size is the same as the stack size used for non-threaded
applications. Library-created service threads use a default stack size of 8K for 32-bit
applications and 16K for 64-bit applications. The default value is specified by the
variable PTHREAD_STACK_MIN, which is defined in header file
/usr/include/limits.h.
If you write your own MPI reduction functions to use with nonblocking collective
communications, these functions may run on a service thread. If your reduction
functions require significant amounts of stack space, you can use the
MP_THREAD_STACKSIZE environment variable to cause larger stacks to be
created for service threads. This does not affect the default stack size for any
threads you create.

Forks are limited
If a task forks, only the thread that forked exists in the child task. Therefore, the
message passing library will not operate properly. Also, if the forked child does not
exec another program, it should be aware that an atexit routine has been
registered for the parent that is also inherited by the child. In most cases, the atexit
routine requests that POE terminate the task (parent). A forked child should
terminate with an _exit(0) system call to prevent the atexit routine from being
called. Also, if the forked parent terminates before the child, the child task will not
be cleaned up by POE.
Note: A forked child must not call the message passing library (MPI library).
Thread-safe libraries
Most AIX libraries are thread-safe, such as libc.a. However, not all libraries have a
thread-safe version. It is your responsibility to determine whether the AIX libraries
you use can be safely called by more than one thread.
Program and thread termination

MPI_FINALIZE terminates the MPI service threads but does not affect user-created
threads. Use pthread_exit to terminate any user-created threads, and exit(m) to
terminate the main program (initial thread). The value of m is used to set POE’s exit
status as explained in “Exit status” on page 26. For programs that are successful,
the value for m should be zero.
Order requirement for system includes

For programs that explicitly use threads, AIX requires that the system include file
pthread.h must be first, with stdio.h or other system includes following it.
pthread.h defines some conditional compile variables that modify the code
generation of subsequent includes, particularly stdio.h. Note that pthread.h is not
required unless your program uses thread-related calls or data.
Using MPI_INIT or MPI_INIT_THREAD

Call MPI_INIT once per task, not once per thread. MPI_INIT does not have to be
called on the main thread, but MPI_INIT and MPI_FINALIZE must be called on the
same thread.
MPI calls on other threads must adhere to the MPI standard in regard to the
following:
v A thread cannot make MPI calls until MPI_INIT has been called.
v A thread cannot make MPI calls after MPI_FINALIZE has been called.
v Unless there is a specific thread synchronization protocol provided by the
application itself, you cannot rely on any specific order or speed of thread
processing.
The MPI_INIT_THREAD call allows the user to request a level of thread support
ranging from MPI_THREAD_SINGLE to MPI_THREAD_MULTIPLE. PE MPI ignores
the request argument. If MP_SINGLE_THREAD is set to yes, MPI runs in a mode
equivalent to MPI_THREAD_FUNNELED. IF MP_SINGLE_THREAD is set to no, or
allowed to default, PE MPI runs in MPI_THREAD_MULTIPLE mode.
The nonstandard MPE_I nonblocking collectives, MPI-IO, and MPI one-sided

communication will not operate if MP_SINGLE_THREAD is set to yes.

Collective communication calls
Collective communication calls must meet the MPI standard requirement that all
participating tasks execute collective communication calls on any given
communicator in the same order. If collective communications call are made on
multiple threads, it is your responsibility to ensure the proper sequencing. The
preferred approach is for each thread to use a distinct communicator.
Support for M:N threads

By default, AIX causes thread creation to use process scope. POE overrides this
default by setting the environment variable AIXTHREAD_SCOPE to S, which has
the effect that all user threads are created with system contention scope, with each
user thread mapped to a kernel thread. If you explicitly set AIXTHREAD_SCOPE to
P, to be able to create to your user threads with process contention scope, POE will
not override your setting. In process scope, M number of user threads are mapped
to N number of kernel threads. The values of the ratio M:N can be set by an AIX
environment variable.
The service threads created by MPI, POE, and LAPI have system contention scope,
that is, they are mapped 1:1 to kernel threads.
Any user-created thread that began with process contention scope, will be
converted to system contention scope when it makes its first MPI call. Threads that
must remain in process contention scope should not make MPI calls.
Checkpoint and restart limitations

Use of the checkpoint and restart function has these limitations:
v “Programs that cannot be checkpointed.”
v “Program restrictions” on page 39.
v “AIX function restrictions” on page 39.
v “Node restrictions” on page 40.
v “Task-related restrictions” on page 40.
v “Pthread and atomic lock restrictions” on page 41.
v “Other restrictions” on page 41.
Programs that cannot be checkpointed

The following programs cannot be checkpointed:
v Programs that do not have the environment variable CHECKPOINT set to yes.
v Programs that are being run under:
– The dynamic probe class library (DPCL).
– Any debugger that is not checkpoint/restart–capable.
v Processes that use:
– Extended shmat support
– Pinned shared memory segments
v Sets of processes in which any process is running a setuid program when a
checkpoint occurs.
v Jobs for which POE input or output is a pipe.
v Jobs for which POE input or output is redirected, unless the job is submitted from
a shell that had the CHECKPOINT environment variable set to yes before the
shell was started. If POE is run from inside a shell script and is run in the

background, the script must be started from a shell started in the same manner
for the job to be able to be checkpointed.
v Jobs that are run using the switch or network table sample programs.
v Interactive POE jobs for which the su command was used prior to checkpointing
or restarting the job.
v User space programs that are not run under a resource manager that
communicates with POE (for example, LoadLeveler).
Program restrictions
Any program that meets both these criteria:
v is compiled with one of the threaded compile scripts provided by PE
v may be checkpointed prior to its main() function being invoked
must wait for the 0031-114 message to appear in POE’s STDERR before issuing
the checkpoint of the parallel job. Otherwise, a subsequent restart of the job may
fail.
Note: The MP_INFOLEVEL environment variable, or the -infolevel command-line

option, must be set to a value of at least 2 for this message to appear.
Any program that meets both these criteria:

v is compiled with one of the threaded compile scripts provided by PE
v may be checkpointed immediately after the parallel job is restarted
must wait for the 0031-117 message to appear in POE’s STDERR before issuing
the checkpoint of the restarted job. Otherwise, the checkpoint of the job may fail.
Note: The MP_INFOLEVEL environment variable, or the -infolevel command line

option, must be set to a value of at least 2 for this message to appear.
AIX function restrictions

The following AIX functions will fail, with an errno of ENOTSUP, if the
CHECKPOINT environment variable is set to yes in the environment of the calling
program:

clock_getcpuclockid() pthread_mutex_getprioceiling()
clock_getres() pthread_mutex_setprioceiling()
clock_gettime() pthread_mutex_timedlock()
clock_nanosleep() pthread_rwlock_timedrdlock()
clock_settime() pthread_rwlock_timedwrlock()
mlock() pthread_setschedprio()
mlockall() pthread_spin_destroy()
mq_close() pthread_spin_init()
mq_getattr() pthread_spin_lock()
mq_notify() pthread_spin_trylock()
mq_open() pthread_spin_unlock()
mq_receive() sched_getparam()
mq_send() sched_get_priority_max()
mq_setattr() sched_get_priority_min()
mq_timedreceive() sched_getscheduler()
mq_timedsend() sched_rr_get_interval()
mq_unlink() sched_setparam()
munlock() sched_setscheduler()
munlockall() sem_close()
nanosleep() sem_destroy()
pthread_barrierattr_init() sem_getvalue()
pthread_barrierattr_destroy() sem_init()
pthread_barrierattr_getpshared() sem_open()
pthread_barrierattr_setpshared() sem_post()
pthread_barrier_destroy() sem_timedwait()
pthread_barrier_init() sem_trywait()
pthread_barrier_wait() sem_unlink()
pthread_condattr_getclock() sem_wait()
pthread_condattr_setclock() shm_open()
pthread_getcpuclockid() shm_unlink()
pthread_mutexattr_getprioceiling() timer_create()
pthread_mutexattr_getprotocol() timer_delete()
pthread_mutexattr_setprioceiling() timer_getoverrun()
pthread_mutexattr_setprotocol() timer_gettime()
timer_settime()
Node restrictions
The node on which a process is restarted must have:
v The same operating system level (including PTFs). In addition, a restarted
process may not load a module that requires a system call from a kernel
extension that was not present at checkpoint time.
v The same switch type as the node where the checkpoint occurred.
v The capabilities enabled in /etc/security/user that were enabled for that user on
the node on which the checkpoint operation was performed.
If any threads in the parallel task were bound to a specific processor ID at

checkpoint time, that processor ID must exist on the node where that task is
restarted.
Task-related restrictions
v The number of tasks and the task geometry (the tasks that are common within a
node) must be the same on a restart as it was when the job was checkpointed.
v Any regular file open in a parallel task when that task is checkpointed must be
present on the node where that task is restarted, including the executable and
any dynamically loaded libraries or objects.
v If any task within a parallel application uses sockets or pipes, user callbacks
should be registered to save data that may be in transit when a checkpoint
occurs, and to restore the data when the task is resumed after a checkpoint or
restart. Similarly, any user shared memory should be saved and restored.
Pthread and atomic lock restrictions

v A checkpoint operation will not begin on a parallel task until each user thread in
that task has released all pthread locks, if held.
This can potentially cause a significant delay from the time a checkpoint is issued
until the checkpoint actually occurs. Also, any thread of a process that is being
checkpointed that does not hold any pthread locks and tries to acquire one will
be stopped immediately. There are no similar actions performed for atomic locks
(_check_lock and _clear_lock, for example).
v Atomic locks must be used in such a way that they do not prevent the releasing
of pthread locks during a checkpoint.
For example, if a checkpoint occurs and thread 1 holds a pthread lock and is
waiting for an atomic lock, and thread 2 tries to acquire a different pthread lock
(and does not hold any other pthread locks) before releasing the atomic lock that
thread 1 is waiting for, the checkpoint will hang.
v If a pthread lock is held when a parallel task creates a new process (either
implicitly using popen, for example, or explicitly using fork or exec) and the
releasing of the lock is contingent on some action of the new process, the
CHECKPOINT environment variable must be set to no before causing the new
process to be created.
Otherwise, the parent process may be checkpointed (but not yet stopped) before
the creation of the new process, which would result in the new process being
checkpointed and stopped immediately.
v A parallel task must not hold a pthread lock when creating a new process (either
implicitly using popen for example, or explicitly using fork) if the releasing of the
lock is contingent on some action of the new process.
Otherwise a checkpoint could occur that would cause the child process to be
stopped before the parent could release the pthread lock causing the checkpoint
operation to hang.
v The checkpoint operation may hang if any user pthread locks are held across:
– Any collective communication calls in MPI (or if LAPI is being used in the
application, LAPI).
– Calls to mpc_init_ckpt or mp_init_ckpt.
– Any blocking MPI call that returns only after action on some other task.
Other restrictions
v Processes cannot be profiled at the time a checkpoint is taken.
v There can be no devices other than TTYs or /dev/null open at the time a
checkpoint is taken.
v Open files must either have an absolute pathname that is less than or equal to
PATHMAX in length, or must have a relative pathname that is less than or equal
to PATHMAX in length from the current directory at the time they were opened.
The current directory must have an absolute pathname that is less than or equal
to PATHMAX in length.
v Semaphores or message queues that are used within the set of processes being
checkpointed must only be used by processes within the set of processes being
checkpointed.
This condition is not verified when a set of processes is checkpointed. The
checkpoint and restart operations will succeed, but inconsistent results can occur
after the restart.

v The processes that create shared memory must be checkpointed with the
processes using the shared memory if the shared memory is ever detached from
all processes being checkpointed. Otherwise, the shared memory may not be
available after a restart operation.
v The ability to checkpoint and restart a process is not supported for B1 and C2
security configurations.
v A process can checkpoint another process only if it can send a signal to the
process.
In other words, the privilege checking for checkpointing processes is identical to
the privilege checking for sending a signal to the process. A privileged process
(the effective user ID is 0) can checkpoint any process. A set of processes can
only be checkpointed if each process in the set can be checkpointed.
v A process can restart another process only if it can change its entire privilege
state (real, saved, and effective versions of user ID, group ID, and group list) to
match that of the restarted process.
v A set of processes can be restarted only if each process in the set can be
restarted.
64-bit application considerations

Support for 64-bit applications is provided in the MPI library. You can choose 64-bit
support by specifying -q64 as a compiler flag, or by setting the environment variable
OBJECT_MODE to 64 at compile and link time. All objects in a 64-bit environment
must be compiled with -q64. You cannot call a 32-bit library from a 64-bit
application, nor can you call a 64-bit library from a 32-bit application.
Integers passed to the MPI library are always 32 bits long. If you use the
FORTRAN compiler directive -qintsize=8 as your default integer length, you will
need to type your MPI integer arguments as INTEGER*4. All integer parameters in
mpif.h are explicitly declared INTEGER*4 to prevent -qintsize=8 from altering their
length.
As defined by the MPI standard, the count argument in MPI send and receive calls
is a default size signed integer. In AIX, even 64-bit executables use 32-bit integers
by default. To send or receive extremely large messages, you may need to
construct your own datatype (for example, a ’page’ datatype of 4096 contiguous
bytes).
The FORTRAN compilation scripts mpxlf_r, mpxlf90_r, and mpxlf95_r set the
include path for mpif.h to: /usr/lpp/ppe.poe/include/thread64 or
/usr/lpp/ppe.poe/include/thread, as appropriate. Do not add a separate include
path to mpif.h in your compiler scripts or make files, as an incorrect version of
mpif.h could be picked up in compilation, resulting in subtle run time errors.
The AIX 64-bit address space is large enough to remove any limitations on the
number of memory segments that can be used, so the information in “Available
virtual memory segments” on page 34 does not apply to the 64-bit library.
MPI_WAIT_MODE: the nopoll option

Environment variable MPI_WAIT_MODE set to nopoll is supported as an option. It
causes a blocking MPI call to go into a system wait after approximately one
millisecond of polling without a message being received. MPI_WAIT_MODE set to
nopoll may reduce CPU consumption for applications that post a receive call on a
separate thread, and that receive call does not expect an immediate message
arrival. Also, using MPI_WAIT_MODE set to nopoll may increase delay between

message arrival and the blocking call’s return. It is recommended that
MP_CSS_INTERRUPT be set to yes when the nopoll wait is selected, so that the
system wait can be interrupted by the arrival of a message. Otherwise, the nopoll
wait is interrupted at the timing interval set by MP_POLLING_INTERVAL.
Mixed parallelism with MPI and threads

The MPI programming model provides parallelism by using multiple tasks that
communicate by making message passing calls. Many of these MPI calls can block
until some action occurs on another task. Examples include collective
communication, collective MPI-IO, MPI_SEND, MPI_RECV, MPI_WAIT, and the
synchronizations for MPI one-sided.
The threads model provides parallelism by running multiple execution streams in a

single address space, and can depend on data object protection or order
enforcement by mutex lock. Threads waiting for a mutex are blocked until the
thread holding the mutex releases it. The thread holding the mutex will not release
it until it completes whatever action it took the lock to protect. If you choose to do
mutex lock protected threads parallelism and MPI task parallelism in a single
application, you must be careful not to create interlocks between blocking by MPI
call and blocking on mutex locks. The most obvious rule is: avoid making a blocking
MPI call while holding a mutex.
OpenMP and MPI in a single application offers relative safety because the OpenMP
model normally involves distinct parallel sections in which several threads are
spawned at the beginning of the section and joined at the end. The communication
calls occur on the main thread and outside of any parallel section, so they do not
require mutex protection. This segregation of threaded epochs from communication
epochs is safe and simple, whether you use OpenMP or provide your own threads
parallelism.
The threads parallelism model in which some number of threads proceed in a more
or less independent way, but protect critical sections (periods of protected access to
a shared data object) with locks requires more care. In this model, there is much
more chance you will hold a lock while doing a blocking MPI operation related to
some shared data object.
Using MPI and LAPI in the same program

You can use MPI and LAPI concurrently in the same parallel program. Their
operation is logically independent of one another, and you can specify
independently whether each uses the User Space protocol or the IP protocol.
If both MPI and LAPI use the same protocol (either User Space or IP), you can
choose to have them share the underlying packet protocol (User Space or UDP).
You do this by setting the POE environment variable MP_MSG_API to mpi_lapi. If
you do not wish to share the underlying packet protocol, set MP_MSG_API to
mpi,lapi.
In User Space, running with shared resource MP_MSG_API set to mpi_lapi causes
LoadLeveler to allocate only one window for the MPI/LAPI pair, rather than two
windows. Since each window takes program resources (segment registers, memory
for DMA send and receive FIFOs, adapter buffers and network tables), sharing the
window makes sense if MPI and LAPI are communicating at different times (during
different phases of the program). If MPI and LAPI are doing concurrent

communication, the DMA receive buffer may be too small to contain packets from
both LAPI and MPI, and packets may be dropped. This may impair performance.
The MP_CSS_INTERRUPT environment variable applies only to the MPI API. At

MPI_INIT time, MPI sets the protocol for the LAPI instance that MPI is using,
according to MPI defaults or as indicated by environment variable
MP_CSS_INTERRUPT. In non-shared mode, MPI retains control of the LAPI
instance that it is using. If there is use of the LAPI API in the same application, the
LAPI_Senv() function can be used to control interrupts for the LAPI API instance,
without affecting the instance that MPI is using.
In shared mode, MPI_INIT sets interrupt behavior of its LAPI instance, just as in
non-shared mode, but MPI has no way to recognize or control changes to the
interrupt mode of this shared instance that may occur later through the LAPI_Senv()
function. Unexpected changes in interrupt mode made with the LAPI API to the
LAPI instance being shared with MPI can affect MPI performance, but will not affect
whether a valid MPI program runs correctly.
In IP, running with shared resource MP_MSG_API set to mpi_lapi uses only one
pair of UDP ports, while running with separated resource MP_MSG_API set to
mpi,lapi uses two pair of UDP ports. In the separated case, there may be a slight
increase in job startup time due to the need for POE to communicate two sets of
port lists.
Differences between MPI in PE 3.2 and PE Version 4

PE 3.2 MPI used an underlying transport layer called MPCI, which was an internal
component of PSSP for the RS/6000 SP system (Cluster 1600). MPCI provided a
reliable byte stream interface, in which user’s data is copied to a send pipe
whose size is up to 64 KB, which is then broken into packets and sent to the
receiver. The receiver assembles the received packets into a receive pipe, and
populates the user’s data reads from the receive pipe. For programs with large
numbers of tasks, the amount of memory allocated to pipes becomes quite large,
and reduces the amount of storage available for user data.
In PE Version 4, MPI uses an underlying transport called LAPI, which is distributed

as an AIX fileset, part of the RSCT component. In contrast to the reliable byte
stream approach of MPCI, LAPI provides a reliable message protocol, which uses
much less storage for jobs with a large number of tasks.
Because the underlying transport mechanism is so different, POE MPI environment

variables used to tune MPCI performance are, in some cases, ignored. Also, there
are new environment variables to tune the LAPI operation. The following variables,
and their corresponding command-line options, are now ignored:
v MP_INTRDELAY, and the corresponding function mp_intrdelay
v MP_SYNC_ON_CONNECT
v MP_PIPE_SIZE
v MP_ACK_INTERVAL
The following variables are new. A brief description of their intended function is
provided. For more details, see Chapter 11, “POE environment variables and
command-line flags,” on page 69.
MP_UDP_PACKET_SIZE
Specifies the UDP datagram size to be used for UDP/IP message transport.

MP_ACK_THRESH
Sets the threshold for return packet flow control acknowledgements.
MP_USE_BULK_XFER
Causes the use of the Remote Direct Memory Access (RDMA) capability.
See “Remote Direct Memory Access (RDMA) considerations” on page 9.
Differences between MPI in PE 4.1 and PE 4.2

v Environment variable MP_SHARED_MEMORY now has a default of yes.
v Environment variable MP_BUFFER_MEM has been enhanced. See Chapter 11,
“POE environment variables and command-line flags,” on page 69.
v Refer to “Summary of changes for Parallel Environment 4.2” on page xiii for other
differences.
Other differences
v Handling shared memory. See Chapter 3, “Using shared memory,” on page 15.
v The MPI communication subsystem is activated at MPI_INIT and closed at
MPI_FINALIZE. When MPI and LAPI share the subsystem, whichever call comes
first between MPI_INIT and LAPI_INIT will provide the activation. Whichever call
comes last between MPI_FINALIZE and LAPI_TERM will close it.
v Additional service threads. See “POE-supplied threads.”
POE-supplied threads
Your parallel program is normally run under the control of POE. The communication
stack includes MPI, LAPI, and the hardware interface layer. The communication
stack also provides access to the global switch clock. This stack makes use of
several internally spawned threads. The options under which the job is run affect
which threads are created, therefore some, but not all, of the threads listed below
are created in a typical application run. Most of these threads sleep in the kernel
waiting for notification of some rare condition and do not compete for CPU access
during normal job processing. When a job is run in polling mode, there will normally
be little CPU demand by threads other than the users’ application threads.
There can be MPI service threads spawned to handle MPE_I non-blocking

collective communication, MPI-IO, and MPI one-sided communication. The threads
are spawned as needed and kept for reuse. An application that uses none of these
functions will not have any of these threads. An application that uses MPE_I
non-blocking collective communication, MPI-IO, or MPI one-sided communication
will spawn one or more MPI service threads at first need. When the operation that
required the thread finishes, the thread will be left sleeping in the kernel and will be
visible in the debugger. At a subsequent need, if a sleeping thread is available, it is
triggered for reuse to carry out the non-blocking collective communication MPI-IO
and MPI one-sided operation. While waiting to be reused, the threads do not
consume significant resources. The MPE_I, MPI-IO, or MPI one-sided API call that
triggers one of these service threads to run in a given task can, and usually does,
come from some remote task. There can be substantial CPU usage by these
threads when non-blocking collective communication MPI-IO, or MPI one-sided
communication is active.
This information is provided to help you understand what you will see in a debugger
when examining an MPI task. You can almost always ignore the service threads in
your debugging but you may need to find your own thread before you can

understand your application behavior. The dbx commands threads and thread
current n are useful for displaying the threads list and switching focus to the thread
you need to debug.
Table 3 is an example POE/MPI/LAPI thread inventory, in order of thread creation.

The list assumes shared memory over two windows, MPI only. Simpler
environments (depending on options selected) will involve fewer threads.
Table 3. POE/MPI/LAPI Thread Inventory
Name Description
T1 User’s main program
T2 POE asynchronous exit thread (SIGQUIT, SIGTERM, and so forth)
T3 hardware interface layer device interrupt/timer thread
T4 hardware interface layer fault service handler thread
T5 LAPI Completion handler thread (one default)
T6 LAPI Shared memory dispatcher (shared memory only)
T7 Switch clock service thread
T8 MPI Service threads (if MPE_I nonblocking collective communication,
MPI-IO or MPI one-sided calls used). As many as eight are created as
required.

Chapter 6. Using error handlers
This chapter provides information on using error handlers.
Predefined error handler for C++

The C++ language interface for MPI includes the predefined error handler
MPI::ERRORS_THROW_EXCEPTIONS for use with MPI::Comm::Set_errhandler,
MPI::File::Set_errhandler, and MPI::Win::Set_errhandler.
MPI::ERRORS_THROW_EXCEPTIONS can be set or retrieved only by C++

functions. If a non-C++ program causes an error that invokes the
MPI::ERRORS_THROW_EXCEPTIONS error handler, the exception will pass up
the calling stack until C++ code can catch it. If there is no C++ code to catch it, the
behavior is undefined.
The error handler MPI::ERRORS_THROW_EXCEPTIONS causes an

MPI::Exception to be thrown for any MPI result code other than MPI::SUCCESS.
The C++ bindings for exceptions follow:

namespace MPI [
Exception::Exception(int error_code);
int Exception::Get_error_code() const;
int Exception::Get_error_class() const;
const char* Exception::Get_error_string() const;
];
The public interface to MPI::Exception class is defined as follows:

namespace MPI [
class Exception [
public:
Exception(int error_code);
int Get_error_code() const;

int Get_error_class() const;
const char *Get_error_string() const;
];
];
The PE MPI implementation follows:

public:
Exception(int ec) : error_code(ec)

[
(void)MPI_Error_class(error_code, &error_class);
int resultlen;
(void)MPI_Error_string(error_code, error_string, &resultlen);
]
virtual ~Exception(){ }
virtual int Get_error_code() const

[
return error_code;
]

virtual int Get_error_class() const
[
return error_class;
]
virtual const char* Get_error_string() const

[
return error_string;
]
protected:
int error_code;
char error_string[MPI_MAX_ERROR_STRING];
int error_class;
};

Chapter 7. Predefined MPI datatypes
This chapter lists the predefined MPI datatypes that you can use with MPI:
v “Special purpose datatypes”
v “Datatypes for C language bindings”
v “Datatypes for FORTRAN language bindings” on page 50
v “Datatypes for reduction functions (C reduction types)” on page 50
v “Datatypes for reduction functions (FORTRAN reduction types)” on page 51
Special purpose datatypes

Table 4. Special purpose datatypes
Datatype Description
MPI_BYTE Untyped byte data
MPI_LB Explicit lower bound marker
MPI_PACKED Packed data (byte)
MPI_UB Explicit upper bound marker
Datatypes for C language bindings

Table 5. Datatypes for C language bindings
MPI_CHAR 8-bit character
MPI_DOUBLE 64-bit floating point
MPI_FLOAT 32-bit floating point
MPI_INT 32-bit integer
MPI_LONG 32-bit integer
MPI_LONG_DOUBLE 64-bit floating point
MPI_LONG_LONG 64-bit integer
MPI_LONG_LONG_INT 64-bit integer
MPI_SHORT 16-bit integer
MPI_SIGNED_CHAR 8-bit signed character
MPI_UNSIGNED 32-bit unsigned integer
MPI_UNSIGNED_CHAR 8-bit unsigned character
MPI_UNSIGNED_LONG 32-bit unsigned integer
MPI_UNSIGNED_LONG_LONG 64-bit unsigned integer
MPI_UNSIGNED_SHORT 16-bit unsigned integer
MPI_WCHAR Wide (16-bit) unsigned character

Datatypes for FORTRAN language bindings
Table 6. Datatypes for FORTRAN language bindings
MPI_CHARACTER 8-bit character
MPI_COMPLEX 32-bit floating point real, 32-bit floating point imaginary
MPI_COMPLEX8 32-bit floating point real, 32-bit floating point imaginary
MPI_DOUBLE_COMPLEX 64-bit floating point real, 64-bit floating point imaginary
MPI_DOUBLE_PRECISION 64-bit floating point
MPI_INTEGER 32-bit integer
MPI_INTEGER1 8-bit integer
MPI_LOGICAL 32-bit logical
MPI_LOGICAL1 8-bit logical
MPI_REAL 32-bit floating point
MPI_REAL4 32-bit floating point
Datatypes for reduction functions (C reduction types)

Table 7. Datatypes for reduction functions (C reduction types)
MPI_DOUBLE_INT {MPI_DOUBLE, MPI_INT}
MPI_FLOAT_INT {MPI_FLOAT, MPI_INT}
MPI_LONG_DOUBLE_INT {MPI_LONG_DOUBLE, MPI_INT}
MPI_LONG_INT {MPI_LONG, MPI_INT}
MPI_SHORT_INT {MPI_SHORT, MPI_INT}
MPI_2INT {MPI_INT, MPI_INT}

Datatypes for reduction functions (FORTRAN reduction types)
Table 8. Datatypes for reduction functions (FORTRAN reduction types))
MPI_2COMPLEX {MPI_COMPLEX, MPI_COMPLEX}
MPI_2DOUBLE_PRECISION {MPI_DOUBLE_PRECISION, MPI_DOUBLE_PRECISION}
MPI_2INTEGER {MPI_INTEGER, MPI_INTEGER}
MPI_2REAL {MPI_REAL, MPI_REAL}
Chapter 7. Predefined MPI datatypes 51

Chapter 8. MPI reduction operations
The chapter describes predefined MPI reduction operations, including their
datatypes, and provides C and FORTRAN examples.
Predefined operations
Table 9 lists the predefined operations for use with MPI_ALLREDUCE,
MPI_REDUCE, MPI_REDUCE_SCATTER and MPI_SCAN. To invoke a predefined
operation, place any of the following reductions in op.
Table 9. Predefined reduction operations
Operation Description
MPI_BAND bitwise AND
MPI_BOR bitwise OR
MPI_BXOR bitwise XOR
MPI_LAND logical AND
MPI_LOR logical OR
MPI_LXOR logical XOR
MPI_MAX maximum value
MPI_MAXLOC maximum value and location
MPI_MIN minimum value
MPI_MINLOC minimum value and location
MPI_PROD product
MPI_REPLACE f(a,b) = b (the current value in the target memory is
replaced by the value supplied by the origin)
MPI_SUM sum
Datatype arguments of reduction operations

Table 10 lists the basic datatype arguments of the reduction operations.
Table 10. Valid datatype arguments
Type Arguments
Byte MPI_BYTE
C integer MPI_INT
MPI_LONG
MPI_LONG_LONG_INT
MPI_SHORT
MPI_UNSIGNED
MPI_UNSIGNED_LONG
MPI_UNSIGNED_LONG_LONG
MPI_UNSIGNED_SHORT
C pair MPI_DOUBLE_INT
MPI_FLOAT_INT
MPI_LONG_INT
MPI_LONG_DOUBLE_INT
MPI_SHORT_INT
MPI_2INT

Table 10. Valid datatype arguments (continued)
Type Arguments
Complex MPI_COMPLEX
Floating point MPI_DOUBLE
MPI_DOUBLE_PRECISION
MPI_FLOAT
MPI_LONG_DOUBLE
MPI_REAL
FORTRAN integer MPI_INTEGER
MPI_INTEGER8
FORTRAN pair MPI_2DOUBLE_PRECISION
MPI_2INTEGER
MPI_2REAL
Logical MPI_LOGICAL
Valid datatypes for the op option

Table 11 lists the valid datatypes for each op option.
Table 11. Valid datatypes for the op option
Type Datatypes
Byte MPI_BAND
MPI_BOR
MPI_BXOR
MPI_REPLACE
C integer MPI_BAND
MPI_BOR
MPI_BXOR
MPI_LAND
MPI_LOR
MPI_LXOR
MPI_MAX
MPI_MIN
MPI_PROD
MPI_REPLACE
MPI_SUM
C pair MPI_MAXLOC
MPI_MINLOC
MPI_REPLACE
Complex MPI_PROD
MPI_REPLACE
MPI_SUM
Floating point MPI_MAX
MPI_MIN
MPI_PROD
MPI_REPLACE
MPI_SUM

Table 11. Valid datatypes for the op option (continued)
Type Datatypes
FORTRAN integer MPI_BAND
MPI_BOR
MPI_BXOR
MPI_MAX
MPI_MIN
MPI_PROD
MPI_REPLACE
MPI_SUM
FORTRAN pair MPI_MAXLOC
MPI_MINLOC
MPI_REPLACE
Logical MPI_LAND
MPI_LOR
MPI_LXOR
MPI_REPLACE
Examples
Examples of user-defined reduction functions for integer vector addition follow.
C example
void int_sum (int *in, int *inout,
int *len, MPI_Datatype *type);
{
int i
for (i=0; i<*len; i++) {
inout[i] + = in[i];
}
}
FORTRAN example
SUBROUTINE INT_SUM(IN,INOUT,LEN,TYPE)
INTEGER IN(*),INOUT(*),LEN,TYPE,I
DO I = 1,LEN
INOUT(I) = IN(I) + INOUT(I)
ENDDO
END
User-supplied reduction operations have four arguments:

v The first argument, in, is an array or scalar variable. The length, in elements, is
specified by the third argument, len.
This argument is an input array to be reduced.
v The second argument, inout, is an array or scalar variable. The length, in
elements, is specified by the third argument, len.
This argument is an input array to be reduced and the result of the reduction will
be placed here.
v The third argument, len is the number of elements in in and inout to be reduced.
v The fourth argument type is the datatype of the elements to be reduced.
Chapter 8. MPI reduction operations 55

Users may code their own reduction operations, with the restriction that the
operations must be associative. Also, C programmers should note that the values of
len and type will be passed as pointers. No communication calls are allowed in
user-defined reduction operations. See “Limitations in setting the thread stack size”
on page 36 for thread stack size considerations when using the MPI threads library.

Chapter 9. C++ MPI constants
This chapter lists C++ MPI constants, including the following:
v “Error classes”
v “Maximum sizes” on page 58
v “Environment inquiry keys” on page 58
v “Predefined attribute keys” on page 59
v “Results of communicator and group comparisons” on page 59
v “Topologies” on page 59
v “File operation constants” on page 59
v “MPI-IO constants” on page 59
v “One-sided constants” on page 60
v “Combiner constants used for datatype decoding functions” on page 60
v “Assorted constants” on page 60
v “Collective constants” on page 60
v “Error handling specifiers” on page 60
v “Special datatypes for construction of derived datatypes” on page 61
v “Elementary datatypes (C and C++)” on page 61
v “Elementary datatypes (FORTRAN)” on page 61
v “Datatypes for reduction functions (C and C++)” on page 61
v “Datatypes for reduction functions (FORTRAN)” on page 61
v “Optional datatypes” on page 62
v “Collective operations” on page 62
v “Null handles” on page 62
v “Empty group” on page 62
v “Threads constants” on page 63
v “FORTRAN 90 datatype matching constants” on page 63
Error classes
MPI::SUCCESS
MPI::ERR_BUFFER
MPI::ERR_COUNT
MPI::ERR_TYPE
MPI::ERR_TAG
MPI::ERR_COMM
MPI::ERR_RANK
MPI::ERR_REQUEST
MPI::ERR_ROOT
MPI::ERR_GROUP
MPI::ERR_OP
MPI::ERR_TOPOLOGY
MPI::ERR_DIMS
MPI::ERR_ARG
MPI::ERR_UNKNOWN
MPI::ERR_TRUNCATE
MPI::ERR_OTHER
MPI::ERR_INTERN

MPI::ERR_IN_STATUS
MPI::ERR_PENDING
MPI::ERR_INFO_KEY
MPI::ERR_INFO_VALUE
MPI::ERR_INFO_NOKEY
MPI::ERR_INFO
MPI::ERR_FILE
MPI::ERR_NOT_SAME
MPI::ERR_AMODE
MPI::ERR_UNSUPPORTED_DATAREP
MPI::ERR_UNSUPPORTED_OPERATION
MPI::ERR_NO_SUCH_FILE
MPI::ERR_FILE_EXISTS
MPI::ERR_BAD_FILE
MPI::ERR_ACCESS
MPI::ERR_NO_SPACE
MPI::ERR_QUOTA
MPI::ERR_READ_ONLY
MPI::ERR_FILE_IN_USE
MPI::ERR_DUP_DATAREP
MPI::ERR_CONVERSION
MPI::ERR_IO
MPI::ERR_WIN
MPI::ERR_BASE
MPI::ERR_SIZE
MPI::ERR_DISP
MPI::ERR_LOCKTYPE
MPI::ERR_ASSERT
MPI::ERR_RMA_CONFLICT
MPI::ERR_RMA_SYNC
MPI::ERR_NO_MEM
MPI::ERR_LASTCODE
Maximum sizes
MPI::MAX_ERROR_STRING
MPI::MAX_PROCESSOR_NAME
MPI::MAX_FILE_NAME
MPI::MAX_DATAREP_STRING
MPI::MAX_INFO_KEY
MPI::MAX_INFO_VAL
MPI::MAX_OBJECT_NAME
Environment inquiry keys

MPI::TAG_UB
MPI::IO
MPI::HOST
MPI::WTIME_IS_GLOBAL

Predefined attribute keys
MPI::LASTUSEDCODE
MPI::WIN_BASE
MPI::WIN_SIZE
MPI::WIN_DISP_UNIT
Results of communicator and group comparisons

MPI::IDENT
MPI::CONGRUENT
MPI::SIMILAR
MPI::UNEQUAL
Topologies
MPI::GRAPH
MPI::CART
File operation constants

MPI::SEEK_SET
MPI::SEEK_CUR
MPI::SEEK_END
MPI::DISTRIBUTE_NONE
MPI::DISTRIBUTE_BLOCK
MPI::DISTRIBUTE_CYCLIC
MPI::DISTRIBUTE_DFLT_DARG
MPI::ORDER_C
MPI::ORDER_FORTRAN
MPI::DISPLACEMENT_CURRENT
MPI-IO constants
MPI::MODE_RDONLY
MPI::MODE_WRONLY
MPI::MODE_RDWR
MPI::MODE_CREATE
MPI::MODE_APPEND
MPI::MODE_EXCL
MPI::MODE_DELETE_ON_CLOSE
MPI::MODE_UNIQUE_OPEN
MPI::MODE_SEQUENTIAL
MPI::MODE_NOCHECK
MPI::MODE_NOSTORE
MPI::MODE_NOPUT
MPI::MODE_NOPRECEDE
MPI::MODE_NOSUCCEED
Chapter 9. C++ MPI constants 59

One-sided constants
MPI::LOCK_EXCLUSIVE
MPI::LOCK_SHARED
Combiner constants used for datatype decoding functions

MPI::COMBINER_NAMED
MPI::COMBINER_DUP
MPI::COMBINER_CONTIGUOUS
MPI::COMBINER_VECTOR
MPI::COMBINER_HVECTOR_INTEGER
MPI::COMBINER_HVECTOR
MPI::COMBINER_INDEXED
MPI::COMBINER_HINDEXED_INTEGER
MPI::COMBINER_HINDEXED
MPI::COMBINER_INDEXED_BLOCK
MPI::COMBINER_STRUCT_INTEGER
MPI::COMBINER_STRUCT
MPI::COMBINER_SUBARRAY
MPI::COMBINER_DARRAY
MPI::COMBINER_F90_REAL
MPI::COMBINER_F90_COMPLEX
MPI::COMBINER_F90_INTEGER
MPI::COMBINER_RESIZED
Assorted constants
MPI::BSEND_OVERHEAD
MPI::PROC_NULL
MPI::ANY_SOURCE
MPI::ANY_TAG
MPI::UNDEFINED
MPI::KEYVAL_INVALID
MPI::BOTTOM
Collective constants
MPI::ROOT
MPI::IN_PLACE
Error handling specifiers

MPI::ERRORS_ARE_FATAL
MPI::ERRORS_RETURN
MPI::ERRORS_THROW_EXCEPTIONS
(see “Predefined error handler for C++” on page 47)

Special datatypes for construction of derived datatypes
MPI::UB
MPI::LB
MPI::BYTE
MPI::PACKED
Elementary datatypes (C and C++)

MPI::CHAR
MPI::UNSIGNED_CHAR
MPI::SIGNED_CHAR
MPI::SHORT
MPI::INT
MPI::LONG
MPI::UNSIGNED_SHORT
MPI::UNSIGNED
MPI::UNSIGNED_LONG
MPI::FLOAT
MPI::DOUBLE
MPI::LONG_DOUBLE
MPI::LONG_LONG
MPI::UNSIGNED_LONG_LONG
MPI::WCHAR
Elementary datatypes (FORTRAN)

MPI::INTEGER
MPI::REAL
MPI::DOUBLE_PRECISION
MPI::F_COMPLEX
MPI::LOGICAL
MPI::CHARACTER
Datatypes for reduction functions (C and C++)

MPI::FLOAT_INT
MPI::DOUBLE_INT
MPI::LONG_INT
MPI::TWOINT
MPI::SHORT_INT
MPI::LONG_DOUBLE_INT
Datatypes for reduction functions (FORTRAN)

MPI::TWOREAL
MPI::TWODOUBLE_PRECISION
MPI::TWOINTEGER
MPI::TWOF_COMPLEX

Optional datatypes
MPI::INTEGER1
MPI::INTEGER2
MPI::INTEGER4
MPI::INTEGER8
MPI::REAL4
MPI::REAL8
MPI::REAL16
MPI::LOGICAL1
MPI::LOGICAL2
MPI::LOGICAL4
MPI::LOGICAL8
MPI::F_DOUBLE_COMPLEX
MPI::F_COMPLEX8
MPI::F_COMPLEX16
MPI::F_COMPLEX32
Collective operations
MPI::MAX
MPI::MIN
MPI::SUM
MPI::PROD
MPI::MAXLOC
MPI::MINLOC
MPI::BAND
MPI::BOR
MPI::BXOR
MPI::LAND
MPI::LOR
MPI::LXOR
MPI::REPLACE
Null handles
MPI::GROUP_NULL
MPI::COMM_NULL
MPI::DATATYPE_NULL
MPI::REQUEST_NULL
MPI::OP_NULL
MPI::ERRHANDLER_NULL
MPI::INFO_NULL
MPI::WIN_NULL
Empty group
MPI::GROUP_EMPTY

Threads constants
MPI::THREAD_SINGLE
MPI::THREAD_FUNNELED
MPI::THREAD_SERIALIZED
MPI::THREAD_MULTIPLE
FORTRAN 90 datatype matching constants

MPI::TYPECLASS_REAL
MPI::TYPECLASS_INTEGER
MPI::TYPECLASS_COMPLEX

Chapter 10. MPI size limits
This chapter gives limitations for MPI elements and parallel job tasks, including:
v “System limits.”
v “Maximum number of tasks and tasks per node” on page 67.
System limits
The following list includes system limits on the size of various MPI elements and
the relevant environment variable or tunable parameter. The MPI standard identifies
several values that have limits in any MPI implementation. For these values, the
standard indicates a named constant to express the limit. See mpi.h for these
constants and their values. The limits described below are specific to PE and are
not part of standard MPI.
v Number of tasks: MP_PROCS
v Maximum number of tasks: 8192
v Maximum buffer size for any MPI communication (for 32-bit applications only): 2
GB
v Default early arrival buffer size: (MP_BUFFER_MEM)
When using Internet Protocol (IP): 2 800 000 bytes
When using User Space: 64 MB
v Minimum pre-allocated early arrival buffer size: (50 * eager_limit) number of
bytes
v Maximum pre_allocated early arrival buffer size: 256 MB
v Minimum message envelope buffer size: 1 MB
v Default eager limit (MP_EAGER_LIMIT): See Table 12 on page 66. Note that the
default values shown in Table 12 on page 66 are initial estimates that are used
by the MPI library. Depending on the value of MP_BUFFER_MEM and the job
type, these values will be adjusted to guarantee a safe eager send protocol.
v Maximum eager limit: 256 KB
v MPI uses the MP_BUFFER_MEM and the MP_EAGER_LIMIT values that are
selected for a job to determine how many complete messages, each with a size
that is equal to or less than the eager_limit, can be sent eagerly from every task
of the job to a single task, without causing the single target to run out of buffer
space. This is done by allocating to each sending task a number of message
credits for each target. The sending task will consume one message credit for
each eager send to a particular target. It will get that credit back after the
message has been matched at that target.
The sending task can continue to send eager messages to a particular target as
long as it still has message credits for that target. The following equation is used
to calculate the number of credits to be allocated:
MP_BUFFER_MEM / (MP_PROCS * MAX(MP_EAGER_LIMIT, 64))
MPI uses this equation to ensure that there are at least two credits for each
target. If needed, MPI reduces the initially selected value of MP_EAGER_LIMIT,
or increases the initially selected value of MP_BUFFER_MEM, in order to
achieve this minimum threshold of two credits for each target.
If the user has specified an initial value for MP_BUFFER_MEM or
MP_EAGER_LIMIT, and MPI has changed either one or both of these values, an
informational message is issued. If the user has specified MP_BUFFER_MEM
using the two values format, then the maximum value specified by the second

parameter will be used in the equation above. See IBM Parallel Environment 4.2:
Operation and Use, Volume 1 for more information about specifying values for
MP_BUFFER_MEM.
If the user allows both MP_BUFFER_MEM and MP_EAGER_LIMIT to default,
then the initial value that was selected for MP_BUFFER_MEM will be 64 MB for
a User Space job and 2.8 MB for an IP job. MPI estimates the initial value for
MP_EAGER_LIMIT based on the job size, as shown in Table 12. MPI then does
the calculation again to ensure that there will be at least two credits for each
target.
For example, with the defaults of both MP_BUFFER_MEM and
MP_EAGER_LIMIT then, according to the equation, each sending task of an
8192 task User Space job will have a minimum of 8 credits for each target. Each
sending task of a 4096 task User Space job will have a minimum of 16 complete
credits for each target. However, for an IP job, because of the smaller
MP_BUFFER_MEM value, only jobs with less than 11 tasks per job will have
more than two credits allocated for each target. Most IP jobs can have only two
credits for each target, and even this can be accomplished only by greatly
reducing the value of the MP_EAGER_LIMIT. For example, each sending task of
an 8192 tasks IP job can have only two 128 byte credits for each target,
including the task itself. For an IP job you should consider increasing
MP_BUFFER_MEM above the 2.8 MB default, unless memory is very limited.
Any time a message that is small enough to be eligible for eager send cannot be
guaranteed destination buffer space, the message is handled by rendezvous
protocol. Destination buffer space unavailability cannot cause a safe MPI
program to fail, but could cause hangs in unsafe MPI programs. An unsafe
program is one that assumes MPI can guarantee system buffering of sent data
until the receive is posted. The MPI standard warns that unsafe programs,
though they may work in some cases, are not valid MPI. We suggest every
application be checked for safety by running it just once with MP_EAGER_LIMIT
set to 0, which will cause an unsafe application to hang. Because eager limit,
along with task count, affects the minimum buffer memory requirement, it is
possible to produce an unworkable combination when both MP_EAGER_LIMIT
and MP_BUFFER_MEM are explicitly set. MPI will override unworkable
combinations. If either the MP_EAGER_LIMIT or the MP_BUFFER_MEM value
is changed by MPI, an informational message is issued.
Table 12. MPI eager limits
Number of tasks Default limit (MP_EAGER_LIMIT)
1 to 256 32768
257 to 512 16384
513 to 1024 8192
1025 to 2048 4096
2049 to 4096 2048
4097 to 8192 1024
v Maximum aggregate unsent data, per task: no specific limit

v Maximum number of communicators, file handles, and windows: approximately
2000
v Maximum number of distinct tags: all nonnegative integers less than 2**32-1

Maximum number of tasks and tasks per node
The following table lists the limits on the total number of tasks in a parallel job, and
the maximum number of tasks on a node (operating system image). If two limits are
listed, the most restrictive limit applies.
Table 13. Task limits for parallel jobs
Protocol Library Switch/Adapter Total Task Limit Task per Node Limit
IP any 8192 large
User Space SP Switch2/PCI-X 8192 32
User Space SP Switch2/PCI 8192 32
User Space pSeries HPS with 8192 64
one adapter
User Space pSeries HPS with two 8192 128
adapters per network
User Space SP Switch2/SW2 8192 16
For a system with a pSeries HPS switch and adapter, the Task per Node Limit is 64
tasks per adapter per network. For a system with two adapters per network, the
task per node limit is 128, or 64 * 2. This enables the running of a 128 task per
node MPI job over User Space. This may be useful on 64 CPU nodes with the
Simultaneous Multi-Threading (SMT) technology available on IBM System p5
servers and AIX 5.3 enabled. The LoadLeveler configuration also helps determine
how may tasks can be run on a node. To run 128 tasks per node, LoadLeveler
must be configured with 128 starters per node. In theory, you can configure more
than two adapters per network and run more than 128 tasks per node. However,
this means running more than one task per CPU, and results in reduced
performance. Also, the lower layer of the protocol stack has a 128 tasks per node
limit for enabling shared memory. The protocol stack does not use shared memory
when there are more than 128 tasks per node.
For running an MPI job over IP, the task per node limit is not affected by the
number of adapters; the task per node limit is determined only by the number of
LoadLeveler starters configured per node. The 128 task per node limit for enabling
shared memory usage also applies to MPI/IP jobs.
Although the PCI adapters support the stated limits for tasks per node, maximum
aggregate bandwidth through the adapter is achieved with a smaller task per node
count, if all tasks are simultaneously involved in message passing. Thus, if
individual MPI tasks can do SMP parallel computations on multiple CPUs (using
OpenMP or threads), performance may be better than if all MPI tasks compete for
adapter resources.
The user may also want to consider using MPI IP. On SP Switch2 PCI systems with
many MPI tasks sharing adapters, MPI IP may perform better than MPI User
Space.
Chapter 10. MPI size limits 67

Chapter 11. POE environment variables and command-line
flags
This section contains tables which summarize the environment variables and
command-line flags discussed throughout this book. You can set these variables
and flags to influence the execution of parallel programs, and the operation of
certain tools. The command-line flags temporarily override their associated
environment variable. The tables divide the environment variables and flags by
function:
v Table 14 on page 70 summarizes the environment variables and flags for
controlling the Partition Manager. These environment variables and flags enable
you to specify such things as an input or output host list file, and the method of
node allocation.
v Table 15 on page 73 summarizes the environment variables and flags for Job
Specifications. These environment variables and flags determine whether or not
the Partition Manager should maintain the partition for multiple job steps, whether
commands should be read from a file or STDIN, and how the partition should be
loaded.
determining how I/O from the parallel tasks should be handled. These
environment variables and flags set the input and output modes, and determine
whether or not output is labeled by task id.
collecting diagnostic information. These environment variables and flags enable
you to generate diagnostic information that may be required by the IBM Support
Center in resolving PE-related problems.
v Table 18 on page 76 summarizes the environment variables and flags for the
Message Passing Interface. These environment variables and flags allow you to
change message and memory sizes, as well as other message passing
information.
v Table 19 on page 83 summarizes the variables and flags for core file generation.
v Table 20 on page 84 summarizes some miscellaneous environment variables and
flags. These environment variables and flags provide control for the Program
Marker Array, enable additional error checking, and let you set a dispatch priority
class for execution.
You can use the POE command-line flags on the poe and pdbx commands. You
can also use the following flags on program names when individually loading nodes
from STDIN or a POE commands file.
v -infolevel or -ilevel
v -euidevelop
In the tables that follow, a check mark (U) denotes those flags you can use when
individually loading nodes.

Table 14. POE environment variables and command-line flags for partition manager control
The Environment
Variable/Command-Line
Flag(s): Set: Possible Values: Default:
How the node’s adapter should be One of the following Dedicated for
MP_ADAPTER_USE used. The User Space communication strings: User Space jobs,
subsystem library does not require shared for IP
-adapter_use dedicated
dedicated use of the high performance jobs.
Only a single
switch on the node. Adapter use will be
program task can
defaulted, but shared usage may be
use the adapter.
specified.
shared A number of
tasks on the
node can use the
adapter.
MP_CPU_USE How the node’s CPU should be used. One of the following Unique for User
The User Space communication strings: Space jobs,
-cpu_use subsystem library does not require multiple for IP
unique Only your
unique CPU use on the node. CPU use jobs.
program’s tasks
will be defaulted, but multiple use may
can use the
be specified.
CPU.
For example, either one job per node multiple
gets all CPUs, or more than one job Your program
can go on a node. may share the
node with other
users.
MP_EUIDEVICE The adapter set to use for message One of the following The adapter set
passing – either Ethernet, FDDI, strings: used as the
-euidevice token-ring, the IBM RS/6000 SP’s high external network
en0 Ethernet
performance switch adapter, the SP address.
switch 2, or the pSeries High fi0 FDDI
Performance Switch.
tr0 token-ring
css0 high performance
switch
csss SP switch 2
sn_all
sn_single
ml0
MP_EUILIB The communication subsystem One of the following ip
implementation to use for strings:
-euilib communication – either the IP
ip The IP
communication subsystem or the User
communication
Space communication subsystem.
subsystem.
us The User Space
communication
subsystem.
Note: This specification
is case-sensitive.

Table 14. POE environment variables and command-line flags for partition manager control (continued)
The Environment
MP_EUILIBPATH The path to the message passing and Any path specifier. /usr/lpp/ppe.poe/lib
communication subsystem libraries.
-euilibpath This only needs to be set if the libraries
are moved, or an alternate set is being
used.
MP_HOSTFILE The name of a host list file for node Any file specifier or the host.list in the
allocation. word NULL. current directory.
-hostfile -hfile
MP_INSTANCES The number of instances of User A positive integer, or the 1
Space windows or IP addresses to be string max.
-instances assigned. This value is expressed as
an integer, or the string max. If the
values specified exceeds the maximum
allowed number of instances, as
determined by LoadLeveler, that
number is substituted.
MP_PROCS The number of program tasks. Any number from 1 to the 1
maximum supported
-procs configuration.
MP_PULSE The interval (in seconds) at which POE An integer greater than or 600
checks the remote nodes to ensure equal to 0.
-pulse that they are actively communicating
with the home node.
Note: Pulse is ignored for pdbx.
MP_RESD Whether or not the Partition Manager yes no Context
should connect to LoadLeveler to dependent
-resd allocate nodes.
Note: When running POE from a
workstation that is external to the
LoadLeveler cluster, the LoadL.so
fileset must be installed on the external
node (see Using and Administering
LoadLeveler and IBM Parallel
Environment for AIX: Installation for
more information).
MP_RETRY The period of time (in seconds) An integer greater than or 0 (no retry)
between processor node allocation equal to 0, or the
-retry retries by POE if there are not enough case-insensitive value
processor nodes immediately available wait.
to run a program. This is valid only if
you are using LoadLeveler. If the
character string wait is specified
instead of a number, no retries are
attempted by POE, and the job remains
enqueued in LoadLeveler until
LoadLeveler either schedules the job or
cancels it.
MP_RETRYCOUNT The number of times (at the interval set An integer greater than or 0
by MP_RETRY) that the partition equal to 0.
-retrycount manager should attempt to allocate
processor nodes. This value is ignored
if MP_RETRY is set to the character
string wait.
Chapter 11. POE environment variables and command-line flags 71

The Environment
MP_MSG_API To indicate to POE which message MPI MPI
passing API is being used by the LAPI
-msg_api application code. MPI_LAPI
MPI,LAPI
MPI
LAPI,MPI
Indicates that the application makes
only MPI calls.
LAPI
Indicates that the application makes
only LAPI calls.
MPI_LAPI
Indicates that calls to both message
passing APIs are used in the
application, and the same set of
communication resources (windows,
IP addresses) is to be shared
between them.
MPI,LAPI
Indicates that calls to both message
passing APIs are used in the
application, with dedicated
resources assigned to each of them.
LAPI,MPI
Has a meaning identical to
MPI,LAPI.
MP_RMPOOL The name or number of the pool that An identifying pool name None
should be used for nonspecific node or number.
-rmpool allocation. This environment
variable/command-line flag only applies
to LoadLeveler.
MP_NODES To specify the number of processor Any number from 1 to the None
nodes on which to run the parallel maximum supported
-nodes tasks. It may be used alone or in configuration.
conjunction with
MP_TASKS_PER_NODE and/or
MP_PROCS.
MP_TASKS_PER_ NODE To specify the number of tasks to be Any number from 1 to the None
run on each of the physical nodes. It maximum supported
-tasks_per_node may be used in conjunction with configuration.
MP_NODES and/or MP_PROCS, but
may not be used alone.
MP_SAVEHOSTFILE The name of an output host list file to Any relative or full path None
-savehostfile be generated by the Partition Manager. name.
MP_REMOTEDIR The name of a script which echoes the Any file specifier. None
name of the current directory to be
(no associated command used on the remote nodes.
line flag)
MP_TIMEOUT The length of time that POE waits Any number greater than 150 seconds
before abandoning an attempt to 0. If set to 0 or a negative
(no associated command connect to the remote nodes. number, the value is
line flag) ignored.

The Environment
MP_CKPTFILE The base name of the checkpoint file. Any file specifier.
(no associated command

line flag)
MP_CKPTDIR The directory where the checkpoint file Any path specifier. Directory from
will reside. which POE is
(no associated command run.
line flag)
MP_CKPTDIR_PERTASK Specifies whether the checkpoint files yes no no
of the parallel tasks should be written
(no associated command to separate subdirectories under the
line flag) directory that is specified by
MP_CKPTDIR.
Table 15. POE environment variables/command-line flags for job specification

The Environment
Variable/Command-
Line Flag(s): Set: Possible Values: Default:
MP_CMDFILE The name of a POE commands file used Any file specifier. None
to load the nodes of your partition. If set,
-cmdfile POE will read the commands file rather
than STDIN.
MP_LLFILE The name of a LoadLeveler job command Any path specifier. None
file for node allocation. If you are
-llfile performing specific node allocation, you
can use a LoadLeveler job command file
in conjunction with a host list file. If you
do, the specific nodes listed in the host
list file will be requested from
LoadLeveler.
MP_NEWJOB Whether or not the Partition Manager yes no no
maintains your partition for multiple job
-newjob steps.
MP_PGMMODEL The programming model you are using. spmd mpmd spmd
-pgmmodel
MP_SAVE_LLFILE When using LoadLeveler for node Any relative or full path name. None
allocation, the name of the output
-save_llfile LoadLeveler job command file to be
generated by the Partition Manager. The
output LoadLeveler job command file will
show the LoadLeveler settings that result
from the POE environment variables
and/or command-line options for the
current invocation of POE. If you use the
MP_SAVE_LLFILE environment variable
for a batch job, or when the MP_LLFILE
environment variable is set (indicating
that a LoadLeveler job command file
should participate in node allocation),
POE will show a warning and will not
save the output job command file.

Table 15. POE environment variables/command-line flags for job specification (continued)
The Environment
Variable/Command-
| MP_TASK_AFFINITY This causes the PMD to attach each task None
| SNI Specifies that the PMD select
| of a parallel job to one of the system
|| -task_affinity resource sets (rsets) at the MCM level,
the MCM to which the first
|| thus constraining the task (and all its
adapter window is attached.
|| threads) to run within that MCM. If the
When run under LoadLeveler
|| task has an inherited resource set, the
3.3.1 or later, POE will set the
|| attach honors the constraints of the
LoadLeveler
|| inherited resource set. When POE is run
MCM_AFFINITY_OPTIONS
|| under LoadLeveler 3.3.1 or later (which
to MCM_SNI_PREF, and
|| includes all User Space jobs), POE relies
MCM_DISTRIBUTE, allowing
|| on LoadLeveler to handle scheduling
LoadLeveler to handle
|| affinity, based on LoadLeveler job control
scheduling affinity.
| file keywords that POE sets up in MCM Specifies that the PMD
| submitting the job. assigns tasks on a
| round-robin basis to the
|| For AIX 5.2, it is recommended that the MCMs in the inherited
|| user also set the AIX environment resource set. If WLM is not
|| variable MEMORY_AFFINITY to MCM. being used, this is most
| useful when a node is being
| For AIX 5.3, it is recommended that the
| used for only one job. When
| system administrator configure the
| run under LoadLeveler 3.3.1
| computing node to use memory affinity
| or later, POE sets the
| with various combinations of the
| LoadLeveler
| memplace_* options of the vmo
| MCM_AFFINITY_OPTIONS
| command.
| to MCM_MEM_PREF,
|| Systems with Dual Chip Modules (DCMs) MCM_SNI_NONE, and
|| are treated as if each DCM was an MCM. MCM_DISTRIBUTE, allowing
| LoadLeveler to handle
| scheduling affinity.
| -1 Specifies that no affinity
| request is to be made.
| mcm_list
| Specifies a set of system
| level (LPAR) logical MCMs
| that can be attached to. Tasks
| of this job will be assigned
| round-robin to this set, within
| the constraint of an inherited
| rset, if any. Any MCMs out
| side the constraint set will be
| attempted, but fail. This
| option is only valid when
| running either without
| LoadLeveler, or with
| LoadLeveler Version 3.2 (or
| earlier) that does not support
| scheduling affinity.

Table 16. POE environment variables/command-line flags for I/O control
The Environment
Variable/Command-
MP_LABELIO Whether or not output from the parallel yes no no (yes for pdbx)
tasks is labeled by task id.
-labelio
MP_STDINMODE The input mode. This determines how all
all All tasks receive
input is managed for the parallel tasks.
-stdinmode the same input
data from STDIN.
none No tasks receive
input data from
STDIN; STDIN will
be used by the
home node only.
a task id
STDIN is only sent
to the task
identified.
MP_STDOUTMODE The output mode. This determines how One of the following: unordered
STDOUT is handled by the parallel tasks.
-stdoutmode unordered
All tasks write
output data to
STDOUT
asynchronously.
ordered
Output data from
each parallel task
is written to its own
buffer. Later, all
buffers are flushed,
in task order, to
STDOUT.
a task id
Only the task
indicated writes
output data to
STDOUT.

Table 17. POE environment variables/command-line flags for diagnostic information
The Environment
Variable/Command-
MP_INFOLEVEL The level of message reporting. One of the following 1
integers:
-infolevel U -ilevel U
0 Error
1 Warning and error
2 Informational,
warning, and error
3 Informational,
warning, and error.
Also reports
high-level
diagnostic
messages for use
by the IBM
Support Center.
4, 5, 6 Informational,
warning, and error.
Also reports high-
and low-level
diagnostic
messages for use
by the IBM
Support Center.
MP_PMDLOG Whether or not diagnostic messages yes no no
should be logged to a file in /tmp on each
-pmdlog of the remote nodes. Typically, this
environment variable/command-line flag is
only used under the direction of the IBM
Support Center in resolving a PE-related
problem.
MP_DEBUG_INITIAL_ The initial breakpoint in the application One of the following: The first
STOP where pdbx will get control. “filename”:line_number executable source
function_name line in the main
(no associated routine.
command-line flag)
MP_DEBUG_ A debugging aid that allows programmers Any non-null string will no
NOTIMEOUT to attach to one or more of their tasks activate this flag.
-debug_notimeout without the concern that some other task
may reach a timeout.
Table 18. POE environment variables and command-line flags for Message Passing Interface (MPI)
Environment Variable
Command-Line Flag Set: Possible Values: Default:
MP_ACK_THRESH Allows the user to control the A positive integer limited to 31 30
packet acknowledgement
-ack_thresh threshold. Specify a positive
integer.
MP_BUFFER_MEM See “MP_BUFFER_MEM details” on page 82. 64 MB
-buffer_mem (User Space)
2800000 bytes
(IP)

Table 18. POE environment variables and command-line flags for Message Passing Interface (MPI) (continued)
MP_CC_SCRATCH_BUF Use the fastest collective yes yes
-cc_scratch_buf communication algorithm even if no
that algorithm requires
allocation of more scratch buffer
space.
MP_CLOCK_SOURCE To use the high performance AIX None. See
-clock_source switch clock as a time source. SWITCH Table 2 on page
See “Using a switch clock as a 34.
time source” on page 34.
MP_CSS_INTERRUPT To specify whether or not yes no
-css_interrupt arriving packets generate no
interrupts. Using this
environment variable may
provide better performance for
certain applications. Setting this
variable explicitly will suppress
the MPI-directed switching of
interrupt mode, leaving the user
in control for the rest of the run.
For more information, see
MPI_FILE_OPEN in IBM
Parallel Environment for AIX:
MPI Subroutine Reference.

MP_EAGER_LIMIT To change the threshold value nnnnn 4096
-eager_limit for message size, above which nnK (where:
rendezvous protocol is used. K = 1024 bytes)
To ensure that at least 32

messages can be outstanding
between any two tasks,
MP_EAGER_LIMIT will be
adjusted based on the number
of tasks according to the
following table, when the user
has specified neither
MP_BUFFER_MEM nor
MP_EAGER_LIMIT:
Number of
Tasks MP_EAGER_LIMIT
------------------------
1 to 256 32768
257 to 512 16384
513 to 1024 8192
1025 to 2048 4096
2049 to 4096 2048
4097 to 8192 1024
The maximum value for

MP_EAGER_LIMIT is 256 KB
(262144 bytes). Any value that
is less than 64 bytes but greater
than zero bytes is automatically
increased to 64 bytes. A value
of zero bytes is valid, and
indicates that eager send mode
is not to be used for the job.
A non-power of 2 value will be

rounded up to the nearest
power of 2. A value may be
adjusted if the early arrival
buffer (MP_BUFFER_MEM) is
too small.
MP_HINTS_FILTERED To specify whether or not MPI yes yes
-hints_filtered info objects reject hints (key no
and value pairs) that are not
meaningful to the MPI
implementation.
MP_IONODEFILE To specify the name of a Any relative path name or full None. All nodes
-ionodefile parallel I/O node file — a text path name. will participate
file that lists the nodes that in parallel I/O.
should be handling parallel I/O.
Setting this variable enables
you to limit the number of
nodes that participate in parallel
I/O and guarantees that all I/O
operations are performed on the
same node.

MP_MSG_ENVELOPE_BUF The size of the message Any positive number. There is 8 MB
-msg_envelope_buf envelope buffer (that is, no upper limit, but any value
uncompleted send and receive less than 1 MB is ignored.
descriptors).
MP_POLLING_INTERVAL To change the polling interval An integer between 1 and 400000
-polling_interval (in microseconds). 2 billion.
MP_RETRANSMIT_INTERVAL MP_RETRANSMIT_ The acceptable range 10000 (IP)
-retransmit_interval INTERVAL=nnnn and its is from 1000 to INT_MAX 400000
command line equivalent, (User Space)
-retransmit_interval=nnnn,
control how often the
communication subsystem
library checks to see if it should
retransmit packets that have not
been acknowledged. The value
nnnn is the number of polling
loops between checks.
MP_LAPI_TRACE_LEVEL Used in conjunction with AIX Levels 0-5 0
tracing for debug purposes.
Levels 0-5 are supported.
MP_USE_BULK_XFER Exploits the high performance yes no
switch data transfer mechanism. no
-use_bulk_xfer In other environments, this
variable does not have any
meaning and is ignored.
Before you can use

MP_USE_BULK_XFER, the
system administrator must first
enable Remote Direct Memory
Access (RDMA). For more
information, see IBM Parallel
Environment for AIX:
Installation. In other
environments, this variable does
not have any meaning and is
ignored.
Note that when you use this

environment variable, you also
need to consider the value of
the
MP_BULK_MIN_MSG_SIZE
environment variable. Messages
with lengths that are greater
than the value specified
MP_BULK_MIN_MSG_SIZE
will use the bulk transfer path, if
it is available. For more
information, see the entry for
MP_BULK_MIN_MSG_SIZE in
this table.

MP_BULK_MIN_MSG_SIZE Contiguous messages with data The acceptable range is from 153600
lengths greater than or equal to 4096 to 2147483647
-bulk_min_msg_size the value you specify for this (INT_MAX).
environment variable will use
the bulk transfer path, if it is Possible values:
available. Messages with data
lengths that are smaller than nnnnn (byte)
the value you specify for this nnnK (where:
environment variable, or are K = 1024 bytes)
noncontiguous, will use packet nnM (where:
mode transfer. M = 1024*1024 bytes)
nnG (where:
G = 1 billion bytes)
| MP_RDMA_COUNT To specify the number of user m for a single protocol
| rCxt blocks. It supports the
| -rdma_count specification of multiple values m.n for multiple protocols. The
| when multiple protocols are values are positional; m is for
| involved. MPI, n is for LAPI. Only used
| when MP_MSG_API=mpi.lapi.
MP_SHARED_MEMORY To specify the use of shared yes yes
memory (instead of IP or the no
-shared_memory high performance switch) for
message passing between
tasks running on the same
node.
Note: In past releases, the
MP_SHM_CC environment
variable was used to enable or
disable the use of shared
memory for certain 64-bit MPI
collective communication
operations. Beginning with the
PE 4.2 release, this
environment variable has been
removed. You should now use
MP_SHARED_MEMORY to
enable shared memory for both
collective communication and
point-to-point routines. The
default setting for
MP_SHARED_MEMORY is yes
(enable shared memory).
MP_SINGLE_THREAD To avoid lock overheads in a yes no
-single_thread program that is known to be no
single-threaded. MPE_I
nonblocking collective, MPI-IO
and MPI one-sided are
unavailable if this variable is set
to yes. Results are undefined if
this variable is set to yes with
multiple application message
passing threads in use.

MP_THREAD_STACKSIZE To specify the additional stack nnnnn 0
-thread_stacksize size allocated for user nnnK (where:
subroutines running on an MPI K = 1024 bytes)
service thread. If you do not nnM (where:
allocate enough space, the M = 1024*1024 bytes)
program may encounter a
SIGSEGV exception or more
subtle failures.
MP_TIMEOUT To change the length of time (in An integer greater than 0 150
None seconds) the communication
subsystem will wait for a
connection to be established
during message-passing
initialization.
If the SP security method is

″dce and compatibility″, you
may need to increase the
MP_TIMEOUT value to allow
POE to wait for the DCE
servers to respond (or timeout if
the servers are down).
MP_UDP_PACKET_SIZE Allows the user to control the A positive integer Switch 64k,
packet size. Specify a positive otherwise 8k
-udp_packet_size integer.
MP_WAIT_MODE Set: To specify how a thread or nopoll poll (for User
task behaves when it discovers poll Space and IP)
-wait_mode it is blocked, waiting for a sleep
message to arrive. yield
MP_IO_BUFFER_SIZE To specify the default size of An integer less than or equal The number of
-io_buffer_size the data buffer used by MPI-IO to 128 MB, in one of these bytes that
agents. formats: corresponds to
nnnnn 16 file blocks.
nnnK (where K=1024 bytes)
nnnM (where M=1024*1024
bytes)
MP_IO_ERRLOG To specify whether or not to yes
-io_errlog turn on I/O error logging. no no
MP_REXMIT_BUF_SIZE The maximum LAPI level nnn bytes (where: 16352 bytes
-rexmit_buf_size message size that will be nnn > 0 bytes)
buffered locally, to more quickly
free up the user send buffer.
This sets the size of the local
buffers that will be allocated to
store such messages, and will
impact memory usage, while
potentially improving
performance. The MPI
application message size
supported is smaller by, at
most, 32 bytes.

MP_REXMIT_BUF_CNT The number of retransmit nnn (where: 128
-rexmit_buf_cnt buffers that will be allocated per nnn > 0)
task. Each buffer is of size
MP_REXMIT_BUF_SIZE *
MP_REXMIT_BUF_CNT. This
count controls the number of
in-flight messages that can be
buffered to allow prompt return
of application send buffers.
MP_BUFFER_MEM details
Set:
To control the amount of memory PE MPI allows for the buffering of early arrival
message data. Message data that is sent without knowing if the receive is posted is
said to be sent eagerly. If the message data arrives before the receive is posted,
this is called an early arrival and must be buffered at the receive side.
There are two way this environment variable can be used:

1. To specify the pool size for memory to be allocated at MPI initialization time and
dedicated to buffering of early arrivals. Management of pool memory for each
early arrival is fast, which helps performance, but memory that is set aside in
this pool is not available for other uses. Eager sending is throttled by PE MPI to
be certain there will never be an early arrival that cannot fit within the pool. (To
throttle a car engine is to choke off its air and fuel intake by lifting your foot
from the gas pedal when you want to keep the car from going faster than you
can control).
2. To specify the pool size for memory to be allocated at MPI initialization time
and, with a second argument, an upper bound of memory to be used if the
pre-allocated pool is not sufficient. Eager sending is throttled to be certain there
will never be an early arrival that cannot fit within the upper bound. Any early
arrival will be stored in the pre-allocated pool using its faster memory
management if there is room, but if not, malloc and free will be used.
The constraints on eager send must be pessimistic because they must
guarantee an early arrival buffer no matter how the application behaves. Real
applications at large task counts may suffer performance loss due to pessimistic
throttling of eager sending, even though the application has only a modest need
for early arrival buffering.
Setting a higher bound allows more and larger messages to be sent eagerly. If
the application is well behaved, it is likely that the pre-allocated pool will supply
all the buffer space needed. If not, malloc and free will be used but never
beyond the stated upper bound.
Possible values:
nnnnn (byte)
nnnK (where: K = 1024 bytes)
nnM (where: M = 1024*1024 bytes)
nnG (where: G = 1 billion bytes)

Formats:
M1
M1,M2
,M2 (a comma followed by the M2 value)
M1 specifies the size of the pool to be allocated at initialization time. M1 must be

between 0 and 256 MB.
M2 specifies the upper bound of memory that PE MPI will allow to be used for early
arrival buffering in the most extreme case of sends without waiting receives. PE
MPI will throttle senders back to rendezvous protocol (stop trying to use eager
send) before allowing the early arrivals at a receive side to overflow the upper
bound.
There is no limit enforced on the value you can specify for M2, but be aware that a
program that does not behave as expected has the potential to malloc this much
memory, and terminate if it is not available.
When MP_BUFFER_MEM is allowed to default, or is specified with a single

argument, M1, the upper bound is set to the pool size, and eager sending will be
throttled soon enough at each sender to ensure that the buffer pool cannot overflow
at any receive side. If M2 is smaller than M1, M2 is ignored.
The format that omits M1 is used to tell PE MPI to use its default size pre-allocated
pool, but set the upper bound as specified with M2. This removes the need for a
user to remember the default M1 value when the intention is to only change the M2
value.
It is expected that only jobs with hundreds of task will have any need to set M2. For
most of these jobs, there will be an M1,M2 setting that eliminates the need for PE
MPI to throttle eager sends, while allowing all early arrivals that the application
actually creates to be buffered within the pre-allocated pool.
Table 19. POE environment variables/command-line flags for corefile generation
The Environment
Variable/Command-
MP_COREDIR Creates a separate directory for each Any valid directory name, or coredir.taskid
task’s core file. ″none″ to bypass creating a
-coredir new directory.
MP_COREFILE_ The format of corefiles generated when The string ″STDERR″ (to If not set/specified,
FORMAT processes terminate abnormally. specify that the lightweight standard AIX
corefile information should corefiles will be
-corefile_format be written to standard error) generated.
or any other string (to
specify the lightweight
corefile name).
MP_COREFILE_ Determines if POE should generate a yes, no no
SIGTERM core file when a SIGTERM signal is
received. Valid values are yes and no. If
-corefile_sigterm not set, the default is no.

Table 20. Other POE environment variables/command-line flags
The Environment
For users who are collecting byte mpi None
MP_BYTECOUNT (no count data (the number of bytes sent lapi
associated command-line and received) using the Performance both
flag) Collection Tool, this variable specifies
which PE Benchmarker profiling
library should be linked to the
application. You must set
MP_BYTECOUNT before invoking
the appropriate compiler script
(mpcc_r, mpCC-r, mpxlf_r, mpxlf90_r,
or mpxlf95_r).
MP_DBXPROMPTMOD (no A modified dbx prompt. The dbx Any string. None
associated command-line prompt \n(dbx) is used by the pdbx
flag) command as an indicator denoting
that a dbx subcommand has
completed. This environment variable
modifies that prompt. Any value
assigned to it will have a “.”
prepended and will then be inserted
in the \n(dbx) prompt between the “x”
and the “)”. This environment variable
is useful when the string \n(dbx) is
present in the output of the program
being debugged.
MP_EUIDEVELOP Controls the level of parameter yes (for “develop”), no
checking during execution. Setting
-euidevelop this to yes enables some intertask no or nor (for “normal”),
parameter checking which may help
uncover certain problems, but slows deb (for “debug”
execution. Normal mode does only
relatively inexpensive, local min (for “minimum).”
parameter checking. Setting this
variable to ″min″ allows PE MPI to
bypass parameter checking on all
send and receive operations. ″yes″
or ″debug″ checking is intended for
developing applications, and can
significantly slow performance. ″min″
should only be used with well tested
applications because a bug in an
application running with ″min″ will not
provide useful error feedback.
MP_STATISTICS Provides the ability to gather yes no
communication statistics for User no
-statistics Space jobs. print
MP_FENCE A “fence” character string for Any string. None
separating arguments you want
(no associated parsed by POE from those you do
command-line flag) not.

Table 20. Other POE environment variables/command-line flags (continued)
The Environment
MP_NOARGLIST Whether or not POE ignores the yes no no
argument list. If set to yes, POE will
(no associated not attempt to remove POE
command-line flag) command-line flags before passing
the argument list to the user’s
program.
MP_PRIORITY A dispatch priority class for execution Any of the dispatch priority None
or a string of high/low priority values. classes set up by the
(no associated See IBM Parallel Environment for system administrator or a
command-line flag) AIX: Installation for more information string of high/low priority
on dispatch priority classes. values in the file
/etc/poe.priority.
MP_PRIORITY_LOG Determines whether or not diagnostic yes yes
messages should be logged to the no
-priority_log POE priority adjustment co-scheduler
log file in /tmp/pmadjpri.log on each
of the remote nodes. This variable
should only be used in conjunction
with the POE co-scheduler
MP_PRIORITY variable.
The value of this environment

variable can be overridden using the
-priority_log flag.
| MP_PRIORITY_NTP Determines whether the POE priority yes no
| adjustment co-scheduler will turn no
| -priority_ntp NTP off during the priority adjustment
| period, or leave it running.
| The value of no (which is the default)

| instructs the POE co-scheduler to
| turn the NTP daemon off (if it was
| running) and restart NTP later, after
| the co-scheduler completes. Specify
| a value of yes to inform the
| co-scheduler to keep NTP running
| during the priority adjustment cycles
| (if NTP was not running, NTP will not
| be started). If MP_PRIORITY_NTP is
| not set, the default is no.
| The value of this environment

| variable can be overridden using the
| -priority_ntp flag.

Table 20. Other POE environment variables/command-line flags (continued)
The Environment
Whether to produce a report of the no
MP_PRINTENV no Do not produce a
current settings of MPI environment
report of MPI
variables, across all tasks in a job. If
-printenv environment
yes is specified, the MPI
variable settings.
environment variable information is
gathered at initialization time from all yes Produce a report
tasks, and forwarded to task 0, of MPI
where the report is prepared. If a environment
script_name is specified, the script is variable settings.
run on each node, and the output
script is forwarded to task 0 and script_name
included in the report. Produce the report
(same as yes),
When a variable’s value is the same then run the script
for all tasks, it is printed only once. If specified here.
it is different for some tasks, an
asterisk (*) appears in the report after
the word ″Task″.
To include the UTE (Unified Trace no
MP_UTE yes Include the UTE
Environment) library in the link step,
library in the link
allowing the user to collect data from
step.
the application using PE
Benchmarker. For more information, no Do not include the
see IBM Parallel Environment for UTE library in the
AIX: Operation and Use, Volume 2. link step.

Chapter 12. Parallel utility subroutines
This chapter includes descriptions of the parallel utility subroutines that are
available for parallel programming. These user-callable, threadsafe subroutines take
advantage of the parallel operating environment (POE).
Table 21. Parallel utility subroutines
Subroutine name Purpose
“mpc_isatty” on page 88 Determines whether a device is a terminal on the home
node.
“MP_BANDWIDTH, mpc_bandwidth” on page 90 Obtains user space switch bandwidth statistics.
“MP_DISABLEINTR, mpc_disableintr” on page 95 Disables message arrival interrupts on a node.
“MP_ENABLEINTR, mpc_enableintr” on page 98 Enables message arrival interrupts on a node.
“MP_FLUSH, mpc_flush” on page 101 Flushes task output buffers.
“MP_INIT_CKPT, mpc_init_ckpt” on page 103 Starts user-initiated checkpointing.
“MP_QUERYINTR, mpc_queryintr” on page 105 Returns the state of interrupts on a node.
“MP_QUERYINTRDELAY, mpc_queryintrdelay” on page The original purpose of this routine was to return the
108 current interrupt delay time. This routine currently returns
zero.
“MP_SET_CKPT_CALLBACKS, mpc_set_ckpt_callbacks” Registers subroutines to be invoked when the application
on page 109 is checkpointed, resumed, and restarted.
“MP_SETINTRDELAY, mpc_setintrdelay” on page 112 This function formerly set the delay parameter. It now
performs no action.
“MP_STATISTICS_WRITE, mpc_statistics_write” on page Print both MPI and LAPI transmission statistics.
113
“MP_STATISTICS_ZERO, mpc_statistics_zero” on page Resets (zeros) the MPCI_stats_t structure. It has no
116 effect on LAPI.
“MP_STDOUT_MODE, mpc_stdout_mode” on page 117 Sets the mode for STDOUT.
“MP_STDOUTMODE_QUERY, mpc_stdoutmode_query” Queries the current STDOUT mode setting.
on page 120
“MP_UNSET_CKPT_CALLBACKS, Unregisters checkpoint, resume, and restart application
mpc_unset_ckpt_callbacks” on page 122 callbacks.
“pe_dbg_breakpoint” on page 124 Provides a communication mechanism between Parallel
Operating Environment (POE) and an attached third party
debugger (TPD).
“pe_dbg_checkpnt” on page 130 Checkpoints a process that is under debugger control, or
a group of processes.
“pe_dbg_checkpnt_wait” on page 134 Waits for a checkpoint, or pending checkpoint file I/O, to
complete.
“pe_dbg_getcrid” on page 136 Returns the checkpoint/restart ID.
“pe_dbg_getrtid” on page 137 Returns real thread ID of a thread in a specified process
given its virtual thread ID.
“pe_dbg_getvtid” on page 138 Returns virtual thread ID of a thread in a specified
process given its real thread ID.
“pe_dbg_read_cr_errfile” on page 139 Opens and reads information from a checkpoint or restart
error file.
“pe_dbg_restart” on page 140 Restarts processes from a checkpoint file.

mpc_isatty
mpc_isatty
Purpose
Determines whether a device is a terminal on the home node.
Library
libmpi_r.a
C synopsis
#include <pm_util.h>
int mpc_isatty(int FileDescriptor);
Description
This parallel utility subroutine determines whether the file descriptor specified by the
FileDescriptor parameter is associated with a terminal device on the home node. In
a parallel operating environment partition, these three file descriptors are
implemented as pipes to the partition manager daemon. Therefore, the AIX isatty()
subroutine will always return false for each of them. This subroutine is provided for
use by remote tasks that may want to know whether one of these devices is
actually a terminal on the home node, for example, to determine whether or not to
output a prompt.
Parameters
FileDescriptor
is the file descriptor number of the device. Valid values are:
0 or STDIN
Specifies STDIN as the device to be checked.
1 or STDOUT
Specifies STDOUT as the device to be checked.
2 or STDERR
Specifies STDERR as the device to be checked.
Notes
This subroutine has a C version only. Also, it is thread safe.
Return values
In C and C++ calls, the following applies:
0 Indicates that the device is not associated with a terminal on the home
node.
1 Indicates that the device is associated with a terminal on the home node.
-1 Indicates an invalid FileDescriptor parameter.
Examples
C Example
/*
* Running this program, after compiling with mpcc_r,
* without redirecting STDIN, produces the following output:
*
* isatty() reports STDIN as a non-terminal device

mpc_isatty
* mpc_isatty() reports STDIN as a terminal device

*/
#include "pm_util.h"
main()
{
if (isatty(STDIN)) {
printf("isatty() reports STDIN as a terminal device\n");
} else {
printf("isatty() reports STDIN as a non-terminal device\n");
if (mpc_isatty(STDIN)) {
printf("mpc_isatty() reports STDIN as a terminal device\n");
} else {
printf("mpc_isatty() reports STDIN as a non-terminal device\n");
}
}
}
Chapter 12. Parallel utility subroutines 89

MP_BANDWIDTH
MP_BANDWIDTH, mpc_bandwidth
Purpose
Obtains user space switch bandwidth statistics.
Library
libmpi_r.a
C synopsis
#include <lapi.h>
int mpc_bandwidth(lapi_handle_t hndl, int flag, bw_stat_t *bw);
FORTRAN synopsis
MP_BANDWIDTH(INTEGER HNDL, INTEGER FLAG, INTEGER*8 BW_SENT, INTEGER*8 BW_RECV,
INTEGER*8 BW_TIME_SEC, INTEGER*4 BW_TIME_USEC, INTEGER RC)
Description
This parallel utility subroutine is a wrapper API program that users can call to obtain
the user space switch bandwidth statistics. LAPI’s Query interface is used to obtain
byte counts of the data sent and received. This routine returns the byte counts and
time values to allow the bandwidth to be calculated.
For C and C++ language programs, this routine uses a structure that contains the
data count fields, as well as time values in both seconds and microseconds. These
are filled in at the time of the call, from the data obtained by the LAPI Query
interface and a ″get time of day″ call.
This routine requires a valid LAPI handle for LAPI programs. For MPI programs, the
handle is not required. A flag parameter is required to indicate whether the call has
been made from an MPI or LAPI program.
If the program is a LAPI program, the flag MP_BW_LAPI must be set and the
handle value must be specified. If the program is an MPI program, the flag
MP_BW_MPI must be set, and any handle specified is ignored.
In the case where a program uses both MPI and LAPI in the same program, where
MP_MSG_API is set to either mpi,lapi or mpi_lapi, separate sets of statistics are
maintained for the MPI and LAPI portions of the program. To obtain the MPI
bandwidth statistics, this routine must be called with the MP_BW_MPI flag, and any
handle specified is ignored. To obtain the LAPI bandwidth statistics, this routine
must be called with the MP_BW_LAPI flag and a valid LAPI handle value.
Parameters
In C, bw is a pointer to a bw_stat_t structure. This structure is defined as:
typedef struct{
unsigned long long switch_sent;
unsigned long long switch_recv;
int64_t time_sec;
int32_t time_usec;
} bw_stat_t;
where:

MP_BANDWIDTH
switch_sent is an unsigned long long value of the number of bytes sent.

switch_recv is an unsigned long long value of the number of bytes received.
time_sec is a 64-bit integer value of time in seconds.
time_usec is a 32-bit integer value of time in microseconds.
In FORTRAN:
BW_SENT is a 64-bit integer value of the number of bytes sent.
BW_RECV is a 64-bit integer value of the number of bytes received.
BW_TIME_SEC
is a 64-bit integer time value of time in seconds.
BW_TIME_USEC
is a 32-bit integer time value of time in microseconds.
Flag is either MP_BW_MPI or MP_BW_LAPI, indicating whether the program is

using MPI or LAPI.
Bw_data is a pointer to the bandwidth data structure, that will include the timestamp
and bandwidth data count of sends and receives as requested. The bandwidth data
structure may be declared and passed locally by the calling program.
Hndl is a valid LAPI handle filled in by a LAPI_Init() call for LAPI programs. For MPI
programs, this is ignored.
RC in FORTRAN, will contain an integer value returned by this function. This should
always be the last parameter.
Notes
1. The send and receive data counts are for bandwidth data at the software level
of current tasks running, and not what the adapter is capable of.
2. Intranode communication using shared memory will specifically not be
measured with this API. Likewise, this API does not return values of the
bandwidth of local data sent to itself.
3. In the case with striping over multiple adapters, the data counts are an
aggregate of the data exchanged at the application level, and not on a
per-adapter basis.
Return values
0 Indicates successful completion.
-1 Incorrect flag (not MP_BW_MPI or MP_BW_LAPI).
greater than 0
See the list of LAPI error codes in IBM RSCT: LAPI Programming Guide.
Examples
C Examples
1. To determine the bandwidth in an MPI program:
#include <mpi.h>
#include <time.h>
#include <lapi.h>

MP_BANDWIDTH
int rc;
main(int argc, char *argv[])
{
bw_stat_t bw_in;
MPI_Init(&argc, &argv);
.
.
.
/* start collecting bandwidth .. */
rc = mpc_bandwidth(NULL, MP_BW_MPI, &bw_in);
.
.
.
printf("Return from mpc_bandwidth ...rc = %d.\n",rc);
printf("Bandwidth of data sent: %lld.\n",
bw_in.switch_sent);
printf("Bandwidth of data recv: %lld.\n",
bw_in.switch_recv);
printf("time(seconds): %lld.\n",bw_in.time_sec);
printf("time(mseconds): %d.\n",bw_in->time_usec);
.
.
.
MPI_Finalize();
exit(rc);
}
2. To determine the bandwidth in a LAPI program:
#include <lapi.h>
#include <time.h>
int rc;
main(int argc, char *argv[])
{
lapi_handle_t hndl;
lapi_info_t info;
bw_stat_t work;
bw_stat_t bw_in;
bzero(&info, sizeof(lapi_info_t));
rc = LAPI_Init(&hndl, &info);
.
.
.
rc = mpc_bandwidth(hndl, MP_BW_LAPI, &bw_in);
.
.
.
printf("Return from mpc_bandwidth ...rc = %d.\n",rc);
printf("Bandwidth of data sent: %lld.\n",
bw_in.switch_sent);
printf("Bandwidth of data recv: %lld.\n",
bw_in.switch_recv);
printf("time(seconds): %lld.\n", bw_in.time_sec);
printf("time(mseconds): %d.\n",bw_in.time_usec);
.
.
.
LAPI_Term(hndl);
exit(rc);
}
FORTRAN Examples
1. To determine the bandwidth in an MPI program:
program bw_mpi
include "mpif.h"
include "lapif.h"

MP_BANDWIDTH
integer retcode
integer taskid
integer numtask
integer hndl
integer*8 bw_secs
integer*4 bw_usecs
integer*8 bw_sent_data
integer*8 bw_recv_data
.
.
.
call mpi_init(retcode)
call mpi_comm_rank(mpi_comm_world, taskid, retcode)
write (6,*) ’Taskid is ’,taskid
.
.
.
call mp_bandwidth(hndl,MP_BW_MPI, bw_sent_data, bw_recv_data, bw_secs,
bw_usecs,retcode)
write (6,*) ’MPI_BANDWIDTH returned. Time (sec) is ’,bw_secs
write (6,*) ’ Time (usec) is ’,bw_usecs
write (6,*) ’ Data sent (bytes): ’,bw_sent_data
write (6,*) ’ Data received (bytes): ’,bw_sent_recv
write (6,*) ’ Return code: ’,retcode
.
.
.
call mpi_barrier(mpi_comm_world,retcode)
call mpi_finalize(retcode)
2. To determine the bandwidth in a LAPI program:
program bw_lapi
include "mpif.h"
include "lapif.h"
TYPE (LAPI_INFO_T) :: lapi_info
integer retcode
integer taskid
integer numtask
integer hndl
integer*8 bw_secs
integer*4 bw_usecs
integer*8 bw_sent_data
integer*8 bw_recv_data
.
.
.
call lapi_init(hndl, lapi_info, retcode)
.
.
.
call mp_bandwidth(hndl,MP_BW_LAPI, bw_sent_data, bw_recv_data, bw_secs,
bw_usecs,retcode)
write (6,*) ’MPI_BANDWIDTH returned. Time (sec) is ’,bw_secs
write (6,*) ’ Time (usec) is ’,bw_usecs
write (6,*) ’ Data sent (bytes): ’,bw_sent_data
write (6,*) ’ Data received (bytes): ’,bw_sent_recv
write (6,*) ’ Return code: ’,retcode
.
.
.
call lapi_term(hndl,retcode)
Related information
Commands:
v mpcc_r

MP_BANDWIDTH
v mpCC_r
v mpxlf_r
v mpxlf90_r
v mpxlf95_r
Subroutines:
v MP_STATISTICS_WRITE, mpc_statistics_write
v MP_STATISTICS_ZERO, mpc_statistics_zero

MP_DISABLEINTR
MP_DISABLEINTR, mpc_disableintr
Purpose
Disables message arrival interrupts on a node.
Library
libmpi_r.a
C synopsis
int mpc_disableintr();
FORTRAN synopsis
MP_DISABLEINTR(INTEGER RC)
Description
This parallel utility subroutine disables message arrival interrupts on the individual
node on which it is run. Use this subroutine to dynamically control masking
interrupts on a node.
Parameters
In FORTRAN, RC will contain one of the values listed under Return Values.
Notes
v This subroutine is only effective when the communication subsystem is active.
This is from MPI_INIT to MPI_FINALIZE. If this subroutine is called when the
subsystem is inactive, the call will have no effect and the return code will be -1.
v This subroutine overrides the setting of the environment variable
MP_CSS_INTERRUPT.
v Inappropriate use of the interrupt control subroutines may reduce performance.
v This subroutine can be used for IP and User Space protocols.
v This subroutine is thread-safe.
v Using this subroutine will suppress the MPI-directed switching of interrupt mode,
leaving the user in control for the rest of the run. See MPI_FILE_OPEN and
MPI_WIN_CREATE in IBM Parallel Environment for AIX: MPI Subroutine
Reference.
Return values
-1 Indicates that the MPI library was not active. The call was either made
before MPI_INIT or after MPI_FINALIZE.
Examples
C Example
/*
* without setting the MP_CSS_INTERRUPT environment variable,
* and without using the "-css_interrupt" command-line option,
* produces the following output:

MP_DISABLEINTR
*
* Interrupts are DISABLED
* About to enable interrupts..
* Interrupts are ENABLED
* About to disable interrupts...
*/
#define QUERY if (intr = mpc_queryintr()) {\

printf("Interrupts are ENABLED\n");\
} else {\
printf("Interrupts are DISABLED\n");\
}
main()
{
int intr;
QUERY
printf("About to enable interrupts...\n");

mpc_enableintr();
QUERY
printf("About to disable interrupts...\n");

mpc_disableintr();
QUERY
}
FORTRAN Example
Running the following program, after compiling with mpxlf_r, without setting the
MP_CSS_INTERRUPT environment variable, and without using the -css_interrupt
command-line option, produces the following output:
Interrupts are DISABLED
About to enable interrupts..
Interrupts are ENABLED
About to disable interrupts...
PROGRAM INTR_EXAMPLE
INTEGER RC
CALL MP_QUERYINTR(RC)
IF (RC .EQ. 0) THEN
WRITE(6,*)’Interrupts are DISABLED’
ELSE
WRITE(6,*)’Interrupts are ENABLED’
ENDIF
WRITE(6,*)’About to enable interrupts...’

CALL MP_ENABLEINTR(RC)
IF (RC .EQ. 0) THEN
ELSE
ENDIF

MP_DISABLEINTR
WRITE(6,*)’About to disable interrupts...’

CALL MP_DISABLEINTR(RC)
IF (RC .EQ. 0) THEN
ELSE
ENDIF
STOP
END
Related information
Subroutines:
v MP_ENABLEINTR, mpc_enableintr
v MP_QUERYINTR, mpc_queryintr

MP_ENABLEINTR
MP_ENABLEINTR, mpc_enableintr
Purpose
Enables message arrival interrupts on a node.
Library
libmpi_r.a
C synopsis
int mpc_enableintr();
FORTRAN synopsis
MP_ENABLEINTR(INTEGER RC)
Description
This parallel utility subroutine enables message arrival interrupts on the individual
node on which it is run. Use this subroutine to dynamically control masking
interrupts on a node.
Parameters
Notes
v This subroutine is only effective when the communication subsystem is active.
This is from MPI_INIT to MPI_FINALIZE. If this subroutine is called when the
subsystem is inactive, the call will have no effect and the return code will be -1.
v This subroutine overrides the setting of the environment variable
MP_CSS_INTERRUPT.
v Inappropriate use of the interrupt control subroutines may reduce performance.
v This subroutine can be used for IP and User Space protocols.
v This subroutine is thread safe.
v Using this subroutine will suppress the MPI-directed switching of interrupt mode,
leaving the user in control for the rest of the run. See MPI_FILE_OPEN and
MPI_WIN_CREATE in IBM Parallel Environment for AIX: MPI Subroutine
Reference.
Return values
-1 Indicates that the MPI library was not active. The call was either made
before MPI_INIT or after MPI_FINALIZE.
Examples
C Example
/*

MP_ENABLEINTR
*
*/

} else {\
}
main()
{
int intr;
QUERY

mpc_enableintr();
QUERY

mpc_disableintr();
QUERY
}
FORTRAN Example
Running this program, after compiling with mpxlf_r, without setting the
INTEGER RC
IF (RC .EQ. 0) THEN
ELSE
ENDIF

IF (RC .EQ. 0) THEN
ELSE
ENDIF

MP_ENABLEINTR

IF (RC .EQ. 0) THEN
ELSE
ENDIF
STOP
END
Related information
Subroutines:
v MP_DISABLEINTR, mpc_disableintr
v MP_QUERYINTR, mpc_queryintr

MP_FLUSH
MP_FLUSH, mpc_flush
Purpose
Flushes task output buffers.
Library
libmpi_r.a
C synopsis
int mpc_flush(int option);
FORTRAN synopsis
MP_FLUSH(INTEGER OPTION)
Description
This parallel utility subroutine flushes output buffers from all of the parallel tasks to
STDOUT at the home node. This is a synchronizing call across all parallel tasks.
If the current STDOUT mode is ordered, then when all tasks have issued this call or
when any of the output buffers are full:
1. All STDOUT buffers are flushed and put out to the user screen (or redirected) in
task order.
2. An acknowledgement is sent to all tasks and control is returned to the user.
If current STDOUT mode is unordered and all tasks have issued this call, all output
buffers are flushed and put out to the user screen (or redirected).
If the current STDOUT mode is single and all tasks have issued this call, the output
buffer for the current single task is flushed and put out to the user screen (or
redirected).
Parameters
option
is an AIX file descriptor. The only valid value is:
1 Indicates to flush STDOUT buffers.
Notes
v This is a synchronizing call regardless of the current STDOUT mode.
v All STDOUT buffers are flushed at the end of the parallel job.
v If mpc_flush is not used, standard output streams not terminated with a new-line
character are buffered, even if a subsequent read to standard input is made. This
may cause prompt message to appear only after input has been read.
Return values
0 Indicates successful completion

MP_FLUSH
-1 Indicates that an error occurred. A message describing the error will be

issued.
Examples
C Example
The following program uses poe with the -labelio yes option and three tasks:
main()
{
mpc_stdout_mode(STDIO_ORDERED);
printf("These lines will appear in task order\n");
/*
* Call mpc_flush here to make sure that one task
* doesn’t change the mode before all tasks have
* sent the previous printf string to the home node.
*/
mpc_flush(1);
mpc_stdout_mode(STDIO_UNORDERED);
printf("These lines will appear in the order received by the home node\n");
/*
* Since synchronization is not used here, one task could actually
* execute the next statement before one of the other tasks has
* executed the previous statement, causing one of the unordered
* lines not to print.
*/
mpc_stdout_mode(1);
printf("Only 1 copy of this line will appear from task 1\n");
}
Running this C program produces the following output (the task order of lines 4
through 6 may differ):
v 0 : These lines will appear in task order.
v 1 : These lines will appear in the order received by the home node.
v 1 : Only 1 copy of this line will appear from task 1.
FORTRAN Example
CALL MP_STDOUT_MODE(-2)
WRITE(6, *) ’These lines will appear in task order’
CALL MP_FLUSH(1)
WRITE(6, *) ’These lines will appear in the order received by the home node’
CALL MP_STDOUT_MODE(1)
WRITE(6, *) ’Only 1 copy of this line will appear from task 1’
END
Related information
Subroutines:
v MP_STDOUT_MODE, mpc_stdout_mode
v MP_STDOUTMODE_QUERY, mpc_stdoutmode_query

MP_INIT_CKPT
MP_INIT_CKPT, mpc_init_ckpt
Purpose
Starts user-initiated checkpointing.
Library
libmpi_r.a
C synopsis
#include <pm_ckpt.h>
int mpc_init_ckpt(int flags);
FORTRAN synopsis
i = MP_INIT_CKPT(%val(j))
Description
MP_INIT_CKPT starts complete or partial user-initiated checkpointing. The
checkpoint file name consists of the base name provided by the MP_CKPTFILE
and MP_CKPTDIR environment variables, with a suffix of the task ID and a numeric
checkpoint tag to differentiate it from an earlier checkpoint file.
If the MP_CKPTFILE environment variable is not specified, a default base name is

constructed: poe.ckpt.tag, where tag is an integer that allows multiple versions of
checkpoint files to exist. The file name specified by MP_CKPTFILE may include the
full path of where the checkpoint files will reside, in which case the MP_CKPTDIR
variable is to be ignored. If MP_CKPTDIR is not defined and MP_CKPTFILE does
not specify a full path name, MP_CKPTFILE is used as a relative path name from
the original working directory of the task.
Parameters
In C, flags can be set to MP_CUSER, which indicates complete user-initiated
checkpointing, or MP_PUSER, which indicates partial user-initiated checkpointing.
In FORTRAN, j should be set to 0 (the value of MP_CUSER) or 1 (the value of

MP_PUSER).
Notes
Complete user-initiated checkpointing is a synchronous operation. All tasks of the
parallel program must call MP_INIT_CKPT. MP_INIT_CKPT suspends the calling
thread until all other tasks have called it (MP_INIT_CKPT). Other threads in the
task are not suspended. After all tasks of the application have issued
MP_INIT_CKPT, a local checkpoint is taken of each task.
In partial user-initiated checkpointing, one task of the parallel program calls

MP_INIT_CKPT, thus invoking a checkpoint on the entire application. A checkpoint
is performed asychronously on all other tasks. The thread that called
MP_INIT_CKPT is suspended until the checkpoint is taken. Other threads in the
task are not suspended.
Upon returning from the MP_INIT_CKPT call, the application continues to run. It
may, however, be a restarted application that is now running, rather than the
original, if the program was restarted from a checkpoint file.

MP_INIT_CKPT
In a case where several threads in a task call MP_INIT_CKPT using the same flag,
the calls are serialized.
The task that calls MP_INIT_CKPT does not need to be an MPI program.
There are certain limitations associated with checkpointing an application. See

“Checkpoint and restart limitations” on page 38 for more information.
For general information on checkpointing and restarting programs, see IBM Parallel
Environment for AIX: Operation and Use, Volume 1.
For more information on the use of LoadLeveler and checkpointing, see IBM
LoadLeveler for AIX 5L: Using and Administering.
Return values
1 Indicates that a restart operation occurred.
issued.
Examples
C Example
int mpc_init_ckpt(int flags);
FORTRAN Example
i = MP_INIT_CKPT(%val(j))
Related information
Commands:
v poeckpt
v poerestart
Subroutines:
v MP_SET_CKPT_CALLBACKS, mpc_set_ckpt_callbacks
v MP_UNSET_CKPT_CALLBACKS, mpc_unset_ckpt_callbacks

MP_QUERYINTR
MP_QUERYINTR, mpc_queryintr
Purpose
Returns the state of interrupts on a node.
Library
libmpi_r.a
C synopsis
int mpc_queryintr();
FORTRAN synopsis
MP_QUERYINTR(INTEGER RC)
Description
This parallel utility subroutine returns the state of interrupts on a node.
Parameters
Notes
This subroutine is thread safe.
Return values
0 Indicates that interrupts are disabled on the node from which this subroutine
is called.
1 Indicates that interrupts are enabled on the node from which this subroutine
is called.
Examples
C Example
/*
*
*/

} else {\
}

MP_QUERYINTR
main()
{
int intr;
QUERY

mpc_enableintr();
QUERY

mpc_disableintr();
QUERY
}
FORTRAN Example
Running this program, after compiling with mpxlf_r, without setting the
INTEGER RC
IF (RC .EQ. 0) THEN
ELSE
ENDIF

IF (RC .EQ. 0) THEN
ELSE
ENDIF

IF (RC .EQ. 0) THEN
ELSE
ENDIF
STOP
END

MP_QUERYINTR
Related information
Subroutines:
v MP_DISABLEINTR, mpc_disableintr
v MP_ENABLEINTR, mpc_enableintr

MP_QUERYINTRDELAY
MP_QUERYINTRDELAY, mpc_queryintrdelay
Purpose
Note
This function is no longer supported and its future use is not recommended.
The routine remains available for binary compatibility. If invoked, it performs no
action and always returns zero. Applications that include calls to this routine
should continue to function as before. We suggest that calls to this routine be
removed from source code if it becomes convenient to do so.
The original purpose of this routine was to return the current interrupt delay
time. This routine currently returns zero.

MP_SET_CKPT_CALLBACKS
MP_SET_CKPT_CALLBACKS, mpc_set_ckpt_callbacks
Purpose
Registers subroutines to be invoked when the application is checkpointed, resumed,
and restarted.
Library
libmpi_r.a
C synopsis
int mpc_set_ckpt_callbacks(callbacks_t *cbs);
FORTRAN synopsis
MP_SET_CKPT_CALLBACKS(EXTERNAL CHECKPOINT_CALLBACK_FUNC,
EXTERNAL RESUME_CALLBACK_FUNC,
EXTERNAL RESTART_CALLBACK_FUNC,
INTEGER RC)
Description
The MP_SET_CKPT_CALLBACKS subroutine is called to register subroutines to be
invoked when the application is checkpointed, resumed, and restarted.
Parameters
In C, cbs is a pointer to a callbacks_t structure. The structure is defined as:
typedef struct {
void (*checkpoint_callback)(void);
void (*restart_callback)(void);
void (*resume_callback)(void);
} callbacks_t;
where:
checkpoint_callback Points to the subroutine to be called at checkpoint
time.
restart_callback Points to the subroutine to be called at restart time.
resume_callback Points to the subroutine to be called when an
application is resumed after taking a checkpoint.
In FORTRAN:
CHECKPOINT_CALLBACK_FUNC
Specifies the subroutine to be called at checkpoint
time.
RESUME_CALLBACK_FUNC Specifies the subroutine to be called when an
application is resumed after taking a checkpoint.
RESTART_CALLBACK_FUNC Specifies the subroutine to be called at restart time.
RC Contains one of the values listed under Return
Values .

Notes
In order to ensure their completion, the callback subroutines cannot be dependent
on the action of any other thread in the current process, or any process created by
the task being checkpointed, because these threads or processes or both may or
may not be running while the callback subroutines are executing.
The callback subroutines cannot contain calls to:

1. MP_SET_CKPT_CALLBACKS, MP_UNSET_CKPT_CALLBACKS,
mpc_set_ckpt_callbacks, or mpc_unset_ckpt_callbacks.
2. Any MPI or LAPI subroutines
If a call to MP_SET_CKPT_CALLBACKS is issued while a checkpoint is in

progress, it is possible that the newly-registered callback may or may not run during
this checkpoint.

Return values
issued.
non-negative integer
Indicates the handle that is to be used in MP_UNSET_CKPT_CALLBACKS
to unregister the subroutines.
Examples
C Example
int ihndl;
callbacks_t cbs;
void foo(void);
void bar(void);
cbs.checkpoint_callback=foo;
cbs.resume_callback=bar;
cbs.restart_callback=bar;
ihndl = mpc_set_ckpt_callbacks(callbacks_t *cbs);
FORTRAN Example
SUBROUTINE
. FOO
.
.
RETURN
END
SUBROUTINE
. BAR
.
.
RETURN
END
PROGRAM MAIN
EXTERNAL FOO, BAR
INTEGER HANDLE, RC

.
.
.
CALL MP_SET_CKPT_CALLBACKS(FOO,BAR,BAR,HANDLE)
IF
. (HANDLE .NE. 0) STOP 666
.
.
CALL
. MP_UNSET_CKPT_CALLBACKS(HANDLE,RC)
.
.
END
Related information
Commands:
v poeckpt
v poerestart
Subroutines:
v MP_INIT_CKPT, mpc_init_ckpt
v MP_UNSET_CKPT_CALLBACKS, mpc_unset_ckpt_callbacks

MP_SETINTRDELAY
MP_SETINTRDELAY, mpc_setintrdelay
Purpose
Note
This function is no longer supported and its future use is not recommended.
The routine remains available for binary compatibility. If invoked, it performs no
action and always returns zero. Applications that include calls to this routine
should continue to function as before. We suggest that calls to this routine be
removed from source code if it becomes convenient to do so.
This function formerly set the delay parameter. It now performs no action.

MP_STATISTICS_WRITE
MP_STATISTICS_WRITE, mpc_statistics_write
Description
Print both MPI and LAPI transmission statistics.
Library
libmpi_r.a
C synopsis
int mpc_statistics_write(FILE *fp);
FORTRAN synopsis
MP_STATISTICS_WRITE(INTEGER FILE_DESCRIPTOR, INTEGER RC)
Description
If the MP_STATISTICS environment variable is set to yes, MPI will keep a running
total on a set of statistical data. If an application calls this function after MPI_INIT is
completed, but before MPI_FINALIZE is called, it will print out the current total of all
available MPI and LAPI data. If this function is called after MPI_FINALIZE is
completed, it will print out only the final MPI data.
Note: LAPI will always keep its own statistical total with or without having
MP_STATISTICS set.
This function can be added to an MPI program to check communication progress.

However, keeping statistical data costs computing cycles, and may impair
bandwidth.
In the output, each piece of MPI statistical data is preceded by MPI, and each piece
of LAPI statistical data is preceded by LAPI.
The MPCI_stats_t structure contains this statistical information, which is printed

out:
sends Count of sends initiated.
sendsComplete Count of sends completed (message sent).
sendWaitsComplete Count of send waits completed (blocking and
non-blocking).
recvs Count of receives initiated.
recvWaitsComplete Count of receive waits complete.
earlyArrivals Count of messages received for which no receive
was posted.
earlyArrivalsMatched Count of early arrivals for which a posted receive
has been found.
lateArrivals Count of messages received for which a receive
was posted.
shoves Count of calls to lapi_send_msg.
pulls Count of calls to lapi_recv and lapi_recv_vec.

MP_STATISTICS_WRITE
threadedLockYields Count of lock releases due to waiting threads.

unorderedMsgs Count of the total number of out of order messages.
buffer_mem_hwmark The peak of the memory usage of buffer_memory
for the early arrivals.
If the peak memory usage is greater than the
amount preallocated with environment variable
MP_BUFFER_MEM, you may wish to increase the
preallocation. If the peak memory usage is
significantly less than the amount preallocated, you
may wish to decrease the preallocation, but set an
upper bound that equals the previous preallocation
value.
tokenStarveds Number of times a message with the length less
than or equal to eager limit were forced to use
rendezvous protocol.
If there are more than a few times a message was
forced to use rendezvous protocol, you may wish to
increase the upper bound given by the second
argument of environment variable
MP_BUFFER_MEM.
envelope_mem_used Number of bytes the memory buffer used for storing
the envelopes.
The lapi_stats_t structure contains this statistical information:

Tot_retrans_pkt_cnt Retransmit packet count.
Tot_gho_pkt_cnt Ghost packets count.
Tot_pkt_sent_ Total packets sent.
Tot_pkt_recv_cnt Total packets received.
Tot_data_sent Count of total data sent.
Tot_data_recv Count of total data received.
Parameters
fp In C, fp is either STDOUT, STDERR or a FILE pointer returned by the
fopen function.
In FORTRAN, FILE_DESCRIPTOR is the AIX file descriptor of the file that
this function will write to, having these values:
1 Indicates that the output is to be written to STDOUT.
2 Indicates that the output is to be written to STDERR.
Other Indicates the integer returned by the XL FORTRAN utility getfd, if
the output is to be written to an application-defined file.
The getfd utility converts a FORTRAN LUNIT number to an AIX file
descriptor. See Examples for more detail.
RC In FORTRAN, RC will contain the integer value returned by this function.
See Return Values for more detail.

MP_STATISTICS_WRITE
Return values
-1 Neither MPI nor LAPI statistics are available.
0 Both MPI and LAPI statistics are available.
1 Only MPI statistics are available.
2 Only LAPI statistics are available.
Examples
C Example
......
MPI_Init( ... );
MPI_Send( ... );
MPI_Recv( ... );
/* Write statistics to standard out */

mpc_statistics_write(stdout);
MPI_Finalize();
FORTRAN Example
integer(4) LUNIT, stat_ofile, stat_rc, getfd
call MPI_INIT (ierror)

.....
c stat_ofile = 1 if output is to go to stdout

c stat_ofile = 2 if output is to go to stderr
c If output is to go a file do the following
LUNIT = 4
OPEN (LUNIT, FILE="/tmp/mpi_stat.out")
CALL FLUSH_(LUNIT)
stat_ofile = getfd(LUNIT)
call MP_STATISTICS_WRITE(stat_ofile, stat_rc)
call MPI_FINALIZE(ierror)
.....

MP_STATISTICS_ZERO
MP_STATISTICS_ZERO, mpc_statistics_zero
Purpose
Resets (zeros) the MPCI_stats_t structure. It has no effect on LAPI.
Library
libmpi_r.a
C synopsis
mpc_statistics_zero();
FORTRAN synopsis
MP_STATISTICS_ZERO()
Description
If the MP_STATISTICS environment variable is set to yes, MPI will keep a running
total on a set of statistical data, after MPI_INIT is completed. At any time during
execution, the application can call this function to reset the current total to zero.
Parameters
None.
Return values
None.

MP_STDOUT_MODE
MP_STDOUT_MODE, mpc_stdout_mode
Purpose
Sets the mode for STDOUT.
Library
libmpi_r.a
C synopsis
int mpc_stdout_mode(int mode);
FORTRAN synopsis
MP_STDOUT_MODE(INTEGER MODE)
Description
This parallel utility subroutine requests that STDOUT be set to single, ordered, or
unordered mode. In single mode, only one task output is displayed. In unordered
mode, output is displayed in the order received at the home node. In ordered mode,
each parallel task writes output data to its own buffer. When a flush request is
made all the task buffers are flushed, in order of task ID, to STDOUT home node.
Parameters
mode
is the mode to which STDOUT is to be set. The valid values are:
taskid Specifies single mode for STDOUT, where taskid is the task identifier of
the new single task. This value must be between 0 and n-1, where n is
the total of tasks in the current partition. The taskid requested does not
have to be the issuing task.
-2 Specifies ordered mode for STDOUT. The macro STDIO_ORDERED is
supplied for use in C programs.
-3 Specifies unordered mode for STDOUT. The macro
STDIO_UNORDERED is supplied for use in C programs.
Notes
v All current STDOUT buffers are flushed before the new STDOUT mode is
established.
v The initial mode for STDOUT is set by using the environment variable
MP_STDOUTMODE, or by using the command-line option -stdoutmode, with
the latter overriding the former. The default STDOUT mode is unordered.
v This subroutine is implemented with a half second sleep interval to ensure that
the mode change request is processed before subsequent writes to STDOUT.
Return values

MP_STDOUT_MODE

issued.
Examples
C Example
The following program uses poe with the -labelio yes option and three tasks:
main()
{
printf("These lines will appear in task order\n");
/*
* Call mpc_flush here to make sure that one task
* doesn’t change the mode before all tasks have
* sent the previous printf string to the home node.
*/
mpc_flush(1);
mpc_stdout_mode(STDIO_UNORDERED);
printf("These lines will appear in the order received by the home node\n");
/*
* Since synchronization is not used here, one task could actually
* execute the next statement before one of the other tasks has
* executed the previous statement, causing one of the unordered
* lines not to print.
*/
mpc_stdout_mode(1);
printf("Only 1 copy of this line will appear from task 1\n");
}
Running the above C program produces the following output (task order of lines 4-6
may differ):
FORTRAN Example
WRITE(6, *) ’These lines will appear in task order’
CALL MP_FLUSH(1)
WRITE(6, *) ’These lines will appear in the order received by the home node’
CALL MP_STDOUT_MODE(1)
WRITE(6, *) ’Only 1 copy of this line will appear from task 1’
END
Running the above program produces the following output (the task order of lines 4
through 6 may differ):

MP_STDOUT_MODE
Related information
Commands:
v mpcc_r
v mpCC_r
v mpxlf_r
Subroutines:
v MP_FLUSH, mpc_flush
v MP_STDOUTMODE_QUERY, mpc_stdoutmode_query
v MP_SYNCH, mpc_synch

MP_STDOUTMODE_QUERY
MP_STDOUTMODE_QUERY, mpc_stdoutmode_query
Purpose
Queries the current STDOUT mode setting.
Library
libmpi_r.a
C synopsis
int mpc_stdoutmode_query(int *mode);
FORTRAN synopsis
MP_STDOUTMODE_QUERY(INTEGER MODE)
Description
This parallel utility subroutine returns the mode to which STDOUT is currently set.
Parameters
mode
is the address of an integer in which the current STDOUT mode setting will be
returned. Possible return values are:
taskid Indicates that the current STDOUT mode is single, i.e. output for only
task taskid is displayed.
-2 Indicates that the current STDOUT mode is ordered. The macro
STDIO_ORDERED is supplied for use in C programs.
-3 Indicates that the current STDOUT mode is unordered. The macro
STDIO_UNORDERED is supplied for use in C programs.
Notes
v Between the time one task issues a mode query request and receives a
response, it is possible that another task can change the STDOUT mode setting
to another value unless proper synchronization is used.
Return values
0 Indicates successful completion
issued.
Examples
C Example
The following program uses poe with one task:

main()

MP_STDOUTMODE_QUERY
{
int mode;
mpc_stdoutmode_query(&mode);
printf("Initial (default) STDOUT mode is %d\n", mode);
mpc_stdoutmode_query(&mode);
printf("New STDOUT mode is %d\n", mode);
}
Running the above program produces the following output:

v Initial (default) STDOUT mode is -3
v New STDOUT mode is -2
FORTRAN Example
The following program uses poe with one task:

INTEGER MODE
CALL MP_STDOUTMODE_QUERY(mode)
WRITE(6, *) ’Initial (default) STDOUT mode is’, mode
CALL MP_STDOUTMODE_QUERY(mode)
WRITE(6, *) ’New STDOUT mode is’, mode
END
Running the above program produces the following output:

v Initial (default) STDOUT mode is -3
v New STDOUT mode is -2
Related information
Commands:
v mpcc_r
v mpCC_r
v mpxlf_r
Subroutines:
v MP_FLUSH, mpc_flush
v MP_STDOUT_MODE, mpc_stdout_mode
v MP_SYNCH, mpc_synch

MP_UNSET_CKPT_CALLBACKS
MP_UNSET_CKPT_CALLBACKS, mpc_unset_ckpt_callbacks
Purpose
Unregisters checkpoint, resume, and restart application callbacks.
Library
libmpi_r.a
C synopsis
int mpc_unset_ckpt_callbacks(int handle);
FORTRAN synopsis
MP_UNSET_CKPT_CALLBACKS(INTEGER HANDLE, INTEGER RC)
Description
The MP_UNSET_CKPT_CALLBACKS subroutine is called to unregister checkpoint,
resume, and restart application callbacks that were registered with the
MP_SET_CKPT_CALLBACKS subroutine.
Parameters
handle is an integer indicating the set of callback subroutines to be unregistered.
This integer is the value returned by the subroutine used to register the callback
subroutine.
In FORTRAN, RC contains one of the values listed under Return Values.
Notes
If a call to MP_UNSET_CKPT_CALLBACKS is issued while a checkpoint is in
progress, it is possible that the previously-registered callback will still be run during
this checkpoint.

Return values
0 Indicates that MP_UNSET_CKPT_CALLBACKS successfully removed the
callback subroutines from the list of registered callback subroutines
issued.
Examples
C Example

MP_UNSET_CKPT_CALLBACKS
int ihndl;
callbacks_t cbs;
void foo(void);
void bar(void);
cbs.checkpoint_callback=foo;
cbs.resume_callback=bar;
cbs.restart_callback=bar;
ihndl
. = mpc_set_ckpt_callbacks(callbacks_t *cbs);
.
.
mpc_unset_ckpt_callbacks(ihndl);
.
.
.
FORTRAN Example
SUBROUTINE
. FOO
.
.
RETURN
END
SUBROUTINE
. BAR
.
.
RETURN
END
PROGRAM MAIN
EXTERNAL FOO, BAR
INTEGER
. HANDLE, RC
.
.
CALL MP_SET_CKPT_CALLBACKS(FOO,BAR,BAR,HANDLE)
IF
. (HANDLE .NE. 0) STOP 666
.
.
CALL
. MP_UNSET_CKPT_CALLBACKS(HANDLE,RC)
.
.
END
Related information
Commands:
v poeckpt
v poerestart
Subroutines:
v MP_INIT_CKPT, mpc_init_ckpt
v MP_SET_CKPT_CALLBACKS, mpc_set_ckpt_callbacks

pe_dbg_breakpoint
pe_dbg_breakpoint
Purpose
Provides a communication mechanism between Parallel Operating Environment
(POE) and an attached third party debugger (TPD).
Library
POE API library (libpoeapi.a)
C synopsis
#include <pe_dbg_checkpnt.h>
void pe_dbg_breakpoint(void);
Description
The pe_dbg_breakpoint subroutine is used to exchange information between POE
and an attached TPD for the purposes of starting, checkpointing, or restarting a
parallel application. The call to the subroutine is made by the POE application
within the context of various debug events and related POE global variables, which
may be examined or filled in by POE and the TPD. All task-specific arrays are
allocated by POE and should be indexed by task number (starting with 0) to retrieve
or set information specific to that task.
The TPD should maintain a breakpoint within this function, check the value of
pe_dbg_debugevent when the function is entered, take the appropriate actions for
each event as described below, and allow the POE executable to continue.
PE_DBG_INIT_ENTRY
Used by POE to determine if a TPD is present. The TPD should set the
following:
int pe_dbg_stoptask
Should be set to 1 if a TPD is present. POE will then cause the remote
applications to be stopped using ptrace, allowing the TPD to attach to
and continue the tasks as appropriate.
In addition, POE will interpret the SIGSOUND and SIGRETRACT
signals as checkpoint requests from the TPD. SIGSOUND should be
sent when the parallel job should continue after a successful
checkpoint, and SIGRETRACT should be sent when the parallel job
should terminate after a successful checkpoint.
Note: Unpredictable results may occur if these signals are sent while a
parallel checkpoint from a PE_DBG_CKPT_REQUEST is still in
progress.
PE_DBG_CREATE_EXIT
Indicates that all remote tasks have been created and are stopped. The TPD
may retrieve the following information about the remote tasks:
int pe_dbg_count
The number of remote tasks that were created. Also the number of
elements in task-specific arrays in the originally started process, which
remains constant across restarts.
For a restarted POE process, this number may not be the same as the
number of tasks that existed when POE was originally started. To

pe_dbg_breakpoint
determine which tasks may have exited prior to the checkpoint from
which the restart is performed, the poe_task_info routine should be
used.
long *pe_dbg_hosts
Address of the array of remote task host IP addresses.
long *pe_dbg_pids
Address of the array of remote task process IDs. Each of these will also
be used as the chk_pid field of the cstate structure for that task’s
checkpoint.
char **pe_dbg_executables
Address of the array of remote task executable names, excluding path.
PE_DBG_CKPT_REQUEST
Indicates that POE has received a user-initiated checkpoint request from one or
all of the remote tasks, has received a request from LoadLeveler to checkpoint
an interactive job, or has detected a pending checkpoint while being run as a
LoadLeveler batch job. The TPD should set the following:
int pe_dbg_do_ckpt
Should be set to 1 if the TPD wishes to proceed with the checkpoint.
PE_DBG_CKPT_START
Used by POE to inform the TPD whether or not to issue a checkpoint of the
POE process. The TPD may retrieve or set the following information for this
event:
int pe_dbg_ckpt_start
Indicates that the checkpoint may proceed if set to 1, and the TPD may
issue a pe_dbg_checkpnt of the POE process and some or all of the
remote tasks.
The TPD should obtain (or derive) the checkpoint file names,
checkpoint flags, cstate, and checkpoint error file names from the
variables below.
char *pe_dbg_poe_ckptfile
Indicates the full pathname to the POE checkpoint file to be used when
checkpointing the POE process. The name of the checkpoint error file
can be derived from this name by concatenating the .err suffix. The
checkpoint error file name should also be used for
PE_DBG_CKPT_START events to know the file name from which to
read the error data.
char **pe_dbg_task_ckptfiles
Address of the array of full pathnames to be used for each of the task
checkpoints. The name of the checkpoint error file can be derived from
this name by concatenating the .err suffix.
int pe_dbg_poe_ckptflags
Indicates the checkpoint flags to be used when checkpointing the POE
process. Other supported flag values for terminating or stopping the
POE process may be ORed in by the TPD, if the TPD user issued the
checkpoint request.
int pe_dbg_task_ckptflags
Indicates the checkpoint flags to be used when checkpointing the
remote tasks. Other supported flag values for stopping the remote tasks
must be ORed in by the TPD.

pe_dbg_breakpoint
The id argument for calls to the pe_dbg_checkpnt routine may be

derived from the checkpoint flags. If CHKPNT_CRID is set in the
checkpoint flags, the pe_dbg_getcrid routine should be used to
determine the CRID of the checkpoint/restart group. Otherwise, the PID
of the target process should be used. Note that the CHKPNT_CRID flag
will always be set for the remote task checkpoints, and may or may not
be set for POE checkpoints.
int pe_dbg_task_pipecnt
Indicates the number of pipefds that will appear for each task in the
pe_dbg_task_pipefds array. This value must also be used for chk_nfd in
the cstate structure of the remote task checkpoints.
int **pe_dbg_task_pipefds
Pointer to the arrays containing the file descriptor numbers for each of
the remote tasks. These numbers must be used for chk_fdp in the
cstate structure of the remote task checkpoints.
The following variable should be examined by the TPD, but contains no
information directly related to making the pe_dbg_checkpnt calls.
int pe_dbg_ckpt_aware
Indicates whether or not the remote tasks that make up the parallel
application are checkpoint aware.
The following variables should be filled in by the TPD prior to continuing POE
from this event:
int *pe_dbg_ckpt_pmd
Address of an array used by the TPD to indicate which tasks will have
the checkpoints performed by the TPD (value=0) and which tasks the
Partition Manager Daemon (PMD) should issue checkpoints for
(value=1). POE requires that the TPD must perform all checkpoints for
a particular parallel job on any node where at least one checkpoint will
be performed by the TPD.
int pe_dbg_brkpt_len
Used to inform POE of how much data to allocate for
pe_dbg_brkpt_data for later use by the TPD when saving or restoring
breakpoint data. A value of 0 may be used when there is no breakpoint
data.
PE_DBG_CKPT_START_BATCH
Same as PE_DBG_CKPT_START, but the following variables should be
ignored:
v int pe_dbg_ckpt_start
v int pe_dbg_poe_ckptflags
For this event, the TPD should not issue a checkpoint of the POE process.
PE_DBG_CKPT_VERIFY
Indicates that POE has detected a pending checkpoint. POE must verify that
the checkpoint was issued by the TPD before proceeding. The TPD should set
the following:
int pe_dbg_is_tpd
Should be set to 1 if the TPD issued the checkpoint request.
PE_DBG_CKPT_STATUS
Indicates the status of the remote checkpoints that were performed by the
TPDs. The TPD should set the following:

pe_dbg_breakpoint
int *pe_dbg_task_ckpterrnos
Address of the array of errnos from the remote task checkpoints (0 for
successful checkpoint). These values can be obtained from the
Py_error field of the cr_error_t struct, returned from the
pe_dbg_read_cr_errfile calls.
void *pe_dbg_brkpt_data
The breakpoint data to be included as part of POE’s checkpoint file.
The format of the data is defined by the TPD, and may be retrieved
from POE’s address space at restart time.
int *pe_dbg_Sy_errors
The secondary errors obtained from pe_dbg_read_cr_errfile. These
values can be obtained from the Sy_error field of the cr_error_t struct,
returned from the pe_dbg_read_cr_errfile calls.
int *pe_dbg_Xtnd_errors
The extended errors obtained from pe_dbg_read_cr_errfile. These
values can be obtained from the Xtnd_error field of the cr_error_t struct,
int *pe_dbg_error_lens
The user error data lengths obtained from pe_dbg_read_cr_errfile.
These values can be obtained from the error_len field of the cr_error_t
struct, returned from the pe_dbg_read_cr_errfile calls.
PE_DBG_CKPT_ERRDATA
Indicates that the TPD has reported one or more task checkpoint failures, and
that POE has allocated space in the following array for the TPD to use to fill in
the error data.
char **pe_dbg_error_data
The user error data obtained from pe_dbg_read_cr_errfile. These
values can be obtained from the error data field of the cr_error_t struct,
PE_DBG_CKPT_DETACH
Used by POE to indicate to the TPD that it should detach from the POE
process. After being continued from pe_dbg_breakpoint for this event (just
prior to the TPD actually detaching), POE will wait until its trace bit is no longer
set before instructing the kernel to write its checkpoint file. POE will indicate to
the TPD that it is safe to reattach to the POE process by creating the file
/tmp/.poe.PID.reattach, where PID is the process ID of the POE process.
PE_DBG_CKPT_RESULTS
Indicates the checkpoint results to either POE or the TPD, depending on who
issued the checkpoint of POE.
int pe_dbg_ckpt_rc
If the TPD issued the checkpoint, this variable should be filled in by the
TPD and should contain the return code from the call to
pe_dbg_checkpnt. Otherwise, POE will fill in this value to indicate to
the TPD whether the checkpoint succeeded (value=1) or failed
(value=0). For failed checkpoints, the TPD may obtain the error
information from the POE checkpoint error file.
int pe_dbg_ckpt_errno
If the TPD issued the checkpoint and the checkpoint failed, this variable
should be filled in by the TPD and should contain the errno set by AIX
upon return from pe_dbg_checkpnt.

pe_dbg_breakpoint
PE_DBG_CKPT_RESUME
When this event occurs, the TPD may continue or terminate the remote tasks
(or keep them stopped) after a successful checkpoint. The TPD must not
perform the post-checkpoint actions until this event is received, to ensure that
POE and LoadLeveler have performed their post-checkpoint synchronization. If
the TPD did not issue the checkpoint, the following variable should be
examined:
int pe_dbg_ckpt_action
POE will fill in this value to indicate to the TPD if the remote tasks
should be continued (value=0) or terminated (value=1) after a
successful checkpoint.
PE_DBG_CKPT_CANCEL
Indicates that POE has received a request to cancel an in-progress checkpoint.
The TPD should cause a SIGINT to be sent to the thread that issued the
pe_dbg_checkpnt calls in the remote tasks. If the TPD is non-threaded and
performs non-blocking checkpoints, the task checkpoints cannot be cancelled.
Note: If the TPD user issues a request to cancel a checkpoint being performed
by the TPD, the TPD should send a SIGGRANT to the POE process so
that the remote checkpoints being performed by the PMDs can be
interrupted. Otherwise, the checkpoint call in the TPD can return while
some remote checkpoints are still in progress.
PE_DBG_RESTART_READY
Indicates that processes for the remote task restarts have been created and
that pe_dbg_restart calls for the remote tasks may be issued by the TPD. The
TPD must perform the restarts of all remote tasks.
The TPD should first retrieve the remote task information specified in the
variables described above under PE_DBG_CREATE_EXIT. The TPD should
then obtain (or derive) the restart file names, the restart flags, rstate, and restart
error file names from the variables below. The id argument for the
pe_dbg_restart call must be derived from the remote task PID using
pe_dbg_getcrid routine.
char **pe_dbg_task_rstfiles
Address of the array of full pathnames to be used for each of the task
restarts. The name of the restart error file can be derived from this
name by concatenating the .err suffix.
int pe_dbg_task_rstflags
Indicates the restart flags to be used when restarting the remote tasks.
Other supported flag values for stopping the remote tasks may be
ORed in by the TPD.
char **pe_dbg_task_rstate
Address of the array of strings containing the restart data required for
each of the remote tasks. This value may be used as is for the
rst_buffer member of the rstate structure used in the remote task
restarts, or additional data may be appended by the TPD, as described
below:
DEBUGGER_STOP=yes
If this string appears in the task restart data, followed by a newline (\n)
character and a \0, the remote task will send a SIGSTOP signal to itself
once all restart actions have been completed in the restart handler. This

pe_dbg_breakpoint
will likely be used by the TPD when tasks are checkpoint-aware, and
the TPD wants immediate control of the task after it completes restart
initialization.
The rst_len member of the rstate structures should include a \0,

whether the TPD appends to the rst_buffer or not.
The following variables should be re-examined by the TPD during this event:
int pe_dbg_ckpt_aware
Indicates whether or not the remote tasks that make up the parallel
application are checkpoint aware.
void *pe_dbg_brkpt_data
The breakpoint data that was included as part of POE’s checkpoint file.
The format of the data is defined by the TPD.
The following variables should be filled in by the TPD prior to continuing POE
from this event. This also implies that all remote restarts must have been
performed before continuing POE:
int *pe_dbg_task_rsterrnos
Address of the array of errnos from the remote task restarts (0 for
successful restart). These values can be obtained from the Py_error
field of the cr_error_t struct, returned from the pe_dbg_read_cr_errfile
calls.
int *pe_dbg_Sy_errors
The secondary errors obtained from pe_dbg_read_cr_errfile. These
values can be obtained from the Sy_error field of the cr_error_t struct,
int *pe_dbg_Xtnd_errors
The extended errors obtained from pe_dbg_read_cr_errfile. These
values can be obtained from the Xtnd_error field of the cr_error_t struct,
int *pe_dbg_error_lens
The user error data lengths obtained from pe_dbg_read_cr_errfile.
These values can be obtained from the error_len field of the cr_error_t
struct, returned from the pe_dbg_read_cr_errfile calls.
PE_DBG_RESTART_ERRDATA
Indicates that the TPD has reported one or more task restart failures, and that
POE has allocated space in the following array for the TPD to use to fill in the
error data.
char **pe_dbg_error_data
The user error data obtained from pe_dbg_read_cr_errfile. These
values can be obtained from the error data field of the cr_error_t struct,
Notes
Use -I/usr/lpp/ppe.poe/include to pick up the header file. This flag is an uppercase
letter i.
Any references to process ID or PID above represent the real process ID, and not
the virtual process ID associated with checkpointed/restarted processes.

pe_dbg_checkpnt
pe_dbg_checkpnt
Purpose
Checkpoints a process that h is under debugger control, or a group of processes.
Library
C synopsis
int pe_dbg_checkpnt(path, id, flags, cstate, epath)
char *path;
id_t id;
unsigned int flags;
chk_state_t *cstate;
char *epath;
Description
The pe_dbg_checkpnt subroutine allows a process to checkpoint a process that is
under debugger control, or a set of processes that have the same checkpoint/restart
group ID (CRID). The state information of the checkpointed processes is saved in a
single file. All information required to restart the processes (other than the
executable files, any shared libraries, any explicitly loaded modules and data, if any,
passed through the restart system calls) is contained in the checkpoint file.
Processes to be checkpointed will be stopped before the process information is

written to the checkpoint file to maintain data integrity. If a process has not
registered a checkpoint handler, it will be stopped when a checkpoint request is
issued. However, if a process has registered a checkpoint handler, the debugger
must allow the checkpoint handler to reach its call to checkpnt_commit for the
process to be put into the stopped state.
After all processes have been stopped, the checkpoint file is written with process
information one process at a time. After the write has completed successfully, the
pe_dbg_checkpnt subroutine will do one of the following depending on the value
of the flags passed:
v Continue the processes.
v Terminate all the checkpointed processes.
v Leave the processes in the stopped state.
If any one of the processes to be checkpointed is a setuid or setgid program, the

pe_dbg_checkpnt subroutine will fail, unless the caller has superuser privilege. If
shared memory is being used within the set of processes being checkpointed, all
processes that use the shared memory must belong to the checkpoint/restart group
being checkpointed, or the pe_dbg_checkpnt subroutine will fail, unless the
CHKPNT_IGNORE_SHMEM flag is set.
The pe_dbg_checkpnt subroutine may be interrupted, in which case, all processes

being checkpointed will continue to run and neither a checkpoint file nor an error file
will be created.

pe_dbg_checkpnt
Parameters
path
The path of the checkpoint file to be created. This file will be created read-only
with the ownership set to the user ID of the process invoking the
pe_dbg_checkpnt call.
id Indicates the process ID of the process to be checkpointed or the
checkpoint/restart group ID or CRID of the set of processes to be checkpointed
as specified by a value of the flags parameter.
flags
Determines the behavior of the pe_dbg_checkpnt subroutine and defines the
interpretation of the id parameter. The flags parameter is constructed by
logically ORing the following values, which are defined in the sys/checkpnt.h
file:
CHKPNT_AND_STOP
Setting this bit causes the checkpointed processes to be put in a
stopped state after a successful checkpoint operation. The processes
can be continued by sending them SIGCONT. The default is to
checkpoint and continue running the processes.
CHKPNT_AND_STOPTRC
Setting this bit causes any process that is traced to be put in a stopped
state after a successful checkpoint operation. The processes can be
continued by sending them SIGCONT. The default is to checkpoint and
continue running the processes.
CHKPNT_AND_TERMINATE
Setting this bit causes the checkpointed processes to be terminated on
a successful checkpoint operation. The default is to checkpoint and
continue running the processes.
CHKPNT_CRID
Specifies that the id parameter is the checkpoint/restart group ID or
CRID of the set of processes to be checkpointed.
CHKPNT_IGNORE_SHMEM
Specifies that shared memory should not be checkpointed.
CHKPNT_NODELAY
Specifies that pe_dbg_checkpnt will not wait for the completion of the
checkpoint call. As soon as all the processes to be checkpointed have
been identified, and the checkpoint operation started for each of them,
the call will return. The kernel will not provide any status on whether the
call was successful. The application must examine the checkpoint file to
determine if the checkpoint operation succeeded or not. By default, the
pe_dbg_checkpnt subroutine will wait for all the checkpoint data to be
completely written to the checkpoint file before returning.
The CHKPNT_AND_TERMINATE and CHKPNT_AND_STOP flags are

mutually exclusive. Do not specify them at the same time.
cstate
Pointer to a structure of type chk_state_t. This parameter is ignored unless the
process is the primary checkpoint process for the pending checkpoint operation.
The list of file descriptors that need to be inherited at restart time should be
specified in the structure.

pe_dbg_checkpnt
epath
An error file name to log error and debugging data if the checkpoint fails. This
field is mandatory and must be provided.
Notes
letter i.
the virtual process ID associated with checkpointed or restarted processes.
Return values
Upon successful completion, a value of CHECKPOINT_OK is returned.
If the invoking process is included in the set of processes being checkpointed, and
the CHKPNT_AND_TERMINATE flag is set, this call will not return if the checkpoint
is successful because the process will be terminated.
If the pe_dbg_checkpnt call is unsuccessful, CHECKPOINT_FAILED is returned

and the errno global variable is set to indicate the error.
If a process that successfully checkpointed itself is restarted, it will return from the
pe_dbg_checkpnt call with a value of RESTART_OK.
Errors
The pe_dbg_checkpnt subroutine is unsuccessful when the global variable errno
contains one of the following values:
EACCES
One of the following is true:
v The file exists, but could not be opened successfully in exclusive mode,
or write permission is denied on the file, or the file is not a regular file.
v Search permission is denied on a component of the path prefix specified
by the path parameter. Access could be denied due to a secure mount.
v The file does not exist, and write permission is denied for the parent
directory of the file to be created.
EAGAIN
Either the calling process or one or more of the processes to be
checkpointed is already involved in another checkpoint or restart operation.
EINTR Indicates that the checkpoint operation was terminated due to receipt of a
signal. No checkpoint file will be created. A call to the
pe_dbg_checkpnt_wait subroutine should be made when this occurs, to
ensure that the processes reach a state where subsequent checkpoint
operations will not fail unpredictably.
EINVAL
Indicates that a NULL path or epath parameter was passed in, or an invalid
set of flags was set, or an invalid id parameter was passed.
ENOMEM
Insufficient memory exists to initialize the checkpoint structures.
ENOSYS

pe_dbg_checkpnt
v The caller of the function is not a debugger.

v The process could not be checkpointed because it violated a restriction.
ENOTSUP
One of the processes to be checkpointed is a kernel process or has a
kernel-only thread.
EPERM
Indicates that the process does not have appropriate privileges to
checkpoint one or more of the processes.
ESRCH
v The process whose process ID was passed, or the checkpoint/restart
group whose CRID was passed, does not exist.
v The process whose process ID was passed, or the checkpoint/restart
group whose CRID was passed, is not checkpointable because there in
no process that had the environment variable CHECKPOINT set to yes
at execution time.
v The indicated checkpoint/restart group does not have a primary
checkpoint process.

pe_dbg_checkpnt_wait
Purpose
Waits for a checkpoint, or pending checkpoint file I/O, to complete.
Library
C synopsis
int pe_dbg_checkpnt_wait(id, flags, options)
id_t id;
unsigned int flags;
int *options;
Description
The pe_dbg_checkpnt_wait subroutine can be used to:
v Wait for a pending checkpoint issued by the calling thread’s process to complete.
v Determine whether a pending checkpoint issued by the calling thread’s process
has completed, when the CHKPNT_NODELAY flag is specified.
v Wait for any checkpoint file I/O that may be in progress during an interrupted
checkpoint to complete.
The pe_dbg_checkpnt_wait subroutine will return to the caller once any

checkpoint file I/O that may be in progress during an interrupted checkpoint has
completed. The pe_dbg_checkpnt routine does not wait for this file I/O to complete
when the checkpoint operation is interrupted. Failure to perform this call after an
interrupted checkpoint can cause a process or set of processes to be in a state
where subsequent checkpoint operations could fail unpredictably.
Parameters
id Indicates the process ID or the checkpoint/restart group ID (CRID) of the
processes for which a checkpoint operation was initiated or interrupted, as
specified by a value of the flag parameter.
flags
Defines the interpretation of the id parameter. The flags parameter may contain
the following value, which is defined in the sys/checkpnt.h file:
CHKPNT_CRID
CRID of the set of processes for which a checkpoint operation was
initiated or interrupted.
CHKPNT_NODELAY
Specifies that pe_dbg_checkpnt_wait will not wait for the completion
of the checkpoint call. This flag should not be used when waiting for
pending checkpoint file I/O to complete.
options
This field is reserved for future use and should be set to NULL.
Future implementations of this function may return the checkpoint error code in
this field. Until then, the size of the checkpoint error file can be used in most
cases to determine whether the checkpoint succeeded or failed. If the size of

the file is 0, the checkpoint succeeded, otherwise the checkpoint failed and
checkpoint error file will contain the error codes. If the file does not exist, the
checkpoint most likely failed due to an EPERM or ENOENT on the checkpoint
error file pathname.
Notes
letter i.
Return values
Upon successful completion, a value of 0 is returned, indicating that one of the
following is true:
v The pending checkpoint completed.
v There was no pending checkpoint.
v The pending file I/O completed.
v There was no pending file I/O.
If the pe_dbg_checkpnt_wait call is unsuccessful, -1 is returned and the errno
global variable is set to indicate the error.
Errors
The pe_dbg_checkpnt_wait subroutine is unsuccessful when the global variable
errno contains one of the following values:
EINPROGRESS
Indicates that the pending checkpoint operation has not completed when
the CHKPNT_NODELAY flag is specified.
EINTR Indicates that the operation was terminated due to receipt of a signal.
EINVAL
Indicates that an invalid flag was set.
ENOSYS
The caller of the function is not a debugger.
ESRCH
The process whose process ID was passed or the checkpoint/restart group
whose CRID was passed does not exist.

pe_dbg_getcrid
pe_dbg_getcrid
Purpose
Returns the checkpoint/restart ID.
Library
C synopsis
crid_t pe_dbg_getcrid(pid)
pid_t pid;
Description
The pe_dbg_getcrid subroutine returns the checkpoint/restart group ID (CRID) of
the process whose process ID was specified in the pid parameter, or the CRID of
the calling process if a value of -1 was passed.
Parameters
pid Either the process ID of a process to obtain its CRID, or -1 to request the
CRID of the calling process.
Notes
Return values
If the process belongs to a checkpoint/restart group, a valid CRID is returned. If the
process does not belong to any checkpoint/restart group, a value of zero is
returned. For any error, a value of -1 is returned and the errno global variable is set
to indicate the error.
Errors
The pe_dbg_getcrid subroutine is unsuccessful when the global variable errno
ENOSYS The caller of the function is not a debugger.
ESRCH There is no process with a process id equal to pid.

pe_dbg_getrtid
pe_dbg_getrtid
Purpose
Returns real thread ID of a thread in a specified process given its virtual thread ID.
Library
C synopsis
tid_t pe_dbg_getrtid(pid, vtid)

pid_t pid;
tid_t vtid;
Description
The pe_dbg_getrtid subroutine returns the real thread ID of the specified virtual
thread in the specified process.
Parameters
pid The real process ID of the process containing the thread for which the real
thread ID is needed
vtid The virtual thread ID of the thread for which the real thread ID is needed.
Return values
If the calling process is not a debugger, a value of -1 is returned. Otherwise, the
pe_dbg_getrtid call is always successful. If the process does not exist or has
exited or is not a restarted process, or if the provided virtual thread ID does not
exist in the specified process, the value passed in the vtid parameter is returned.
Otherwise, the real thread ID of the thread whose virtual thread ID matches the
value passed in the vtid parameter is returned
Errors
The pe_dbg_getrtid subroutine is unsuccessful if the following is true:

pe_dbg_getvtid
pe_dbg_getvtid
Purpose
Returns virtual thread ID of a thread in a specified process given its real thread ID.
Library
C synopsis
tid_t pe_dbg_getvtid(pid, rtid)

pid_t pid;
tid_t rtid
Description
The pe_dbg_getvtid subroutine returns the virtual thread ID of the specified real
thread in the specified process.
Parameters
pid The real process ID of the process containing the thread for which the real
thread ID is needed
rtid The real thread ID of the thread for which the virtual thread ID is needed.
Return values
If the calling process is not a debugger, a value of -1 is returned.
Otherwise, the pe_dbg_getvtid call is always successful.
If the process does not exist, the process has exited, the process is not a restarted
process, or the provided real thread ID does not exist in the specified process, the
value passed in the rtid parameter is returned.
Otherwise, the virtual thread ID of the thread whose real thread ID matches the
value passed in the rtid parameter is returned.
Errors
The pe_dbg_getvtid subroutine is unsuccessful if the following is true:

pe_dbg_read_cr_errfile
pe_dbg_read_cr_errfile
Purpose
Opens and reads information from a checkpoint or restart error file.
Library
C synopsis
void pe_dbg_read_cr_errfile(char *path, cr_error_t *err_data, int cr_errno)
Description
The pe_dbg_read_cr_errfile subroutine is used to obtain the error information from
a failed checkpoint or restart. The information is returned in the cr_error_t structure,
as defined in /usr/include/sys/checkpnt.h.
Parameters
path
The full pathname to the error file to be read.
err_data
Pointer to a cr_error_t structure in which the error information will be returned.
cr_errno
The errno from the pe_dbg_checkpnt or pe_dbg_restart call that failed. This
value is used for the Py_error field of the returned structure if the file specified
by the path parameter does not exist, has a size of 0, or cannot be opened.
Notes
letter i.

pe_dbg_restart
pe_dbg_restart
Purpose
Restarts processes from a checkpoint file.
Library
C synopsis
int pe_dbg_restart(path, id, flags, rstate, epath)
char *path;
id_t id;
unsigned int flags;
rst_state_t *rstate;
char *epath;
Description
The pe_dbg_restart subroutine allows a process to restart all the processes whose
state information has been saved in the checkpoint file.
All information required to restart these processes (other than the executable files,
any shared libraries and explicitly loaded modules) is recreated from the information
from the checkpoint file. Then, a new process is created for each process whose
state was saved in the checkpoint file. The only exception is the primary checkpoint
process, which overlays an existing process specified by the id parameter.
When restarting a single process that was checkpointed, the id parameter specifies
the process ID of the process to be overlaid. When restarting a set of processes,
the id parameter specifies the checkpoint/restart group ID of the process to be
overlaid, and the flags parameter must set RESTART_OVER_CRID. This process
must also be the primary checkpoint process of the checkpoint/restart group. The
user ID and group IDs of the primary checkpoint process saved in the checkpoint
file should match the user ID and group IDs of the process it will overlay.
After all processes have been re-created successfully, the pe_dbg_restart

subroutine will do one of the following, depending on the value of the flags passed:
v Continue the processes from the point where each thread was checkpointed.
v Leave the processes in the stopped state.
A primary checkpoint process inherits attributes from the attributes saved in the file,
and also from the process it overlays. Other processes in the checkpoint file obtain
their attributes only from the checkpoint file, unless they share some attributes with
the primary checkpoint process. In this case, the shared attributes are inherited.
Although the resource usage of each checkpointed process is saved in the
checkpoint file, the resource usage attributes will be zeroed out when it is restarted
and the getrusage subroutine will return only resource usage after the last restart
operation.
Some new state data can be provided to processes, primary or non-primary, at

restart time if they have a checkpoint handler. The handler should have passed in a
valid rst parameter when it called checkpnt_commit at checkpoint time. At restart
time, a pointer to an interface buffer can be passed through the rstate parameter in
the pe_dbg_restart subroutine. The data in the buffer will be copied to the address

pe_dbg_restart
previously specified in the rst parameter by the checkpoint handler before the
process is restarted. The format of the interface buffer is entirely application
dependent.
If any one of the processes to be restarted is a setuid or a setgid program, the

pe_dbg_restart subroutine will fail, unless the caller has root privilege.
Parameters
path
The path of the checkpoint file to use for the restart. Must be a valid checkpoint
file created by a pe_dbg_checkpnt call.
id Indicates the process ID or the checkpoint/restart group ID or CRID of the
process that is to be overlaid by the primary checkpoint process as identified by
the flags parameter.
flags
Determines the behavior of the pe_dbg_restart subroutine and defines the
interpretation of the id parameter. The flags parameter is constructed by
logically ORing one or more of the following values, which are defined in the
sys/checkpnt.h file:
RESTART_AND_STOP
Setting this bit will cause the restarted processes to be put in a stopped
state after a successful restart operation. They can be continued by
sending them SIGCONT. The default is to restart and resume running
the processes at the point where each thread in the process was
checkpointed.
RESTART_AND_STOPTRC
Setting this bit will cause any process that was traced at checkpoint
time to be put in a stopped state after a successful restart operation.
The processes can be continued by sending them SIGCONT. The
default is to restart and resume execution of the processes at the point
where each thread in the process was checkpointed.
RESTART_IGNORE_BADSC
Causes the restart operation not to fail if a kernel extension that was
present at checkpoint time is not present at restart time. However, if the
restarted program uses any system calls in the missing kernel
extension, the program will fail when those calls are used.
RESTART_OVER_CRID
CRID of the process over which the primary checkpoint process will be
restarted. There are multiple processes to be restarted.
RESTART_PAG_ALL
Same as RESTART_WAITER_PAG.
RESTART_WAITER_PAG
Ensures that DCE credentials are restored in the restarted process.
rstate
Pointer to a structure of type rst_state_t.
epath
Path to error file to log error and debugging data, if restart fails.

pe_dbg_restart
Notes
letter i.
Return values
Upon successful completion, a value of 0 is returned. Otherwise, a value of -1 is
returned and the errno global variable is set to indicate the error.
Errors
The pe_dbg_restart subroutine is unsuccessful when the global variable errno
EACCES
v The file exists, but could not be opened successfully in exclusive mode,
or write permission is denied on the file, or the file is not a regular file.
v Search permission is denied on a component of the path prefix specified
by the path parameter. Access could be denied due to a secure mount.
v The file does not exist, and write permission is denied for the parent
directory of the file to be created.
EAGAIN
v The user ID has reached the maximum limit of processes that it can
have simultaneously, and the invoking process is not privileged.
v Either the calling process or the target process is involved in another
checkpoint or restart operation.
EFAULT
Copying from the interface buffer failed. The rstate parameter points to a
location that is outside the address space of the process.
EINVAL
v A NULL path was passed in.
v The checkpoint file contains invalid or inconsistent data.
v The target process is a kernel process.
v The restart data length in the rstate structure is greater than
MAX_RESTART_DATA.
ENOMEM
v There is insufficient memory to create all the processes in the checkpoint
file.
v There is insufficient memory to allocate the restart structures inside the
kernel.
ENOSYS
v The caller of the function is not a debugger.

pe_dbg_restart
v One or more processes could not be restarted because a restriction was

violated.
v File descriptors or user ID or group IDs are mismatched between the
primary checkpoint process and overlaid process.
v The calling process is also the target of the pe_dbg_restart subroutine.
EPERM
v The calling process does not have appropriate privileges to target for
overlay by a restarted process, one or more of the processes identified
by the id parameter.
v The calling process does not have appropriate privileges to restart one or
more of the processes in the checkpoint file.
ESRCH
Indicates that there is no process with the process ID specified by the id
parameter, or there is no checkpoint restart group with the specified CRID.

pe_dbg_restart

Chapter 13. Parallel task identification API subroutines
PE includes an API that allows an application to retrieve the process IDs of all POE
master processes, or home node processes that are running on the same node.
The information that is retrieved can be used for accounting, or to get more detailed
information about the tasks that are spawned by these POE processes.
This chapter includes descriptions of the parallel task identification API subroutines
that are available for parallel programming:
v “poe_master_tasks” on page 146.
v “poe_task_info” on page 147.

poe_master_tasks
poe_master_tasks
Purpose
Retrieves the list of process IDs of POE master processes currently running on this
system.
C synopsis
#include "poeapi.h"
int poe_master_tasks(pid_t **poe_master_pids);
Description
An application invoking this subroutine while running on a given node can retrieve
the list of process IDs of all POE master processes that are currently running on the
same node. This information can be used for accounting purposes or can be
passed to the poe_task_info subroutine to obtain more detailed information about
tasks spawned by these POE master processes.
Parameters
On return, (*poe_master_pids) points to the first element of an array of pid_t
elements that contains the process IDs of POE master processes. It is the
responsibility of the calling program to free this array. This pointer is NULL if no
POE master process is running on this system or if there is an error condition.
Notes
Use -I/usr/lpp/ppe.poe/include to pick up the header file.
If you are using the -bI:libpoeapi.exp binder option, -L/usr/lpp/ppe.poe/lib is

required; otherwise, you will need to use: -llibpoeapi.
Return values
greater than 0
Indicates the size of the array that (*poe_master_pids) points to
0 Indicates that no POE master process is running.
-1 Indicates that a system error has occurred.
-2 Indicates that POE is unable to allocate memory.
-3 Indicates a non-valid poe_master_pids argument.
Related information
v poe_task_info

poe_task_info
poe_task_info
Purpose
Returns a NULL-terminated array of pointers to structures of type POE_TASKINFO.
C synopsis
#include "poeapi.h"
int poe_task_info(pid_t poe_master_pid, POE_TASKINFO ***poe_taskinfo);
Description
Given the process ID of a POE master process, this subroutine returns to the
calling program through the poe_taskinfo argument a NULL-terminated array of
pointers to structures of type POE_TASKINFO. There is one POE_TASKINFO
structure for each POE task spawned by this POE master process on a local or
remote node.
Each POE_TASKINFO structure contains:

v node name
v IP address
v task ID
v AIX session ID
v child process name
v child process ID
Parameters
poe_master_pid
Specifies the process ID of a POE master process.
poe_taskinfo
On return, points to the first element of a NULL-terminated array of pointers to
structures of type POE_TASKINFO.
This pointer is NULL if there is an error condition. It is the responsibility of the
calling program to free the array of pointers to POE_TASKINFO structures, as
well as the relevant POE_TASKINFO structures and the subcomponents
h_name, h_addr, and p_name.
The structure POE_TASKINFO is defined in poeapi.h:

typedef struct POE_TASKINFO {
char *h_name; /* host name */
char *ip_addr; /* IP address */
int task_id; /* task ID */
int session_id; /* AIX session ID */
pid_t pid; /* child process ID */
char *p_name; /* child process name */
} POE_TASKINFO:
Notes
Use -I/usr/lpp/ppe.poe/include to pick up the header file.
If you are using the -bI:libpoeapi.exp binder option, -L/usr/lpp/ppe.poe/lib is

required; otherwise, you will need to use: -llibpoeapi.
Chapter 13. Parallel task identification API subroutines 147

poe_task_info
Return values
greater than 0
Indicates the size of the array that (*poe_taskinfo) points to
0 Indicates that no POE master process is running or that task information is
not available yet
-1 Indicates that a system error has occurred.
-2 Indicates that POE is unable to allocate memory.
-3 Indicates a non-valid poe_master_pids argument.
Related information
v poe_master_tasks

Appendix A. MPE subroutine summary
Table 22 lists the non-blocking collective communication subroutines that are
available for parallel programming. These subroutines, which have a prefix of
MPE_I, are extensions of the MPI standard. They are part of IBM’s implementation
of the MPI standard for PE. For descriptions of these subroutines, see IBM Parallel
Environment for AIX: MPI Subroutine Reference.
With PE Version 4, these nonstandard extensions remain available, but their use is
deprecated. The implementation of these routines depends on hidden message
passing threads. These routines may not be used with environment variable
MP_SINGLE_THREAD set to yes.

Table 22. MPE Subroutines
Subroutine:
C Name
FORTRAN Name Description
MPE_Iallgather non-blocking allgather operation.
MPE_IALLGATHER
MPE_Iallgatherv non-blocking allgatherv operation.
MPE_IALLGATHERV
MPE_Iallreduce non-blocking allreduce operation.
MPE_IALLREDUCE
MPE_Ialltoall non-blocking alltoall operation.
MPE_IALLTOALL
MPE_Ialltoallv non-blocking alltoallv operation.
MPE_IALLTOALLV
MPE_Ibarrier non-blocking barrier operation.
MPE_IBARRIER
MPE_Ibcast non-blocking broadcast operation.
MPE_IBCAST
MPE_Igather non-blocking gather operation.
MPE_IGATHER
MPE_Igatherv non-blocking gatherv operation.
MPE_IGATHERV
MPE_Ireduce non-blocking reduce operation.
MPE_IREDUCE
MPE_Ireduce_scatter non-blocking reduce_scatter operation.
MPE_IREDUCE_SCATTER

Table 22. MPE Subroutines (continued)
Subroutine:
C Name
FORTRAN Name Description
MPE_Iscan non-blocking scan operation.
MPE_ISCAN
MPE_Iscatter non-blocking scatter operation.
MPE_ISCATTER
MPE_Iscatterv non-blocking scatterv operation.
MPE_ISCATTERV

Appendix B. MPE subroutine bindings
Table 23 summarizes the binding information for all of the MPE subroutines listed in
IBM Parallel Environment for AIX: MPI Subroutine Reference. With PE Version 4,
these nonstandard extensions remain available, but their use is deprecated. The
implementation of these routines depends on hidden message passing threads.
These routines may not be used with environment variable MP_SINGLE_THREAD
set to yes.

Note: FORTRAN refers to FORTRAN 77 bindings that are officially supported for
MPI. However, FORTRAN 77 bindings can be used by FORTRAN 90.
FORTRAN 90 and High Performance FORTRAN (HPF) offer array section
and assumed shape arrays as parameters on calls. These are not safe with
MPI.
Bindings for non-blocking collective communication

Table 23 lists the C and FORTRAN bindings for non-blocking collective
communication subroutines. These subroutines, which have a prefix of MPE_I, are
extensions of the MPI standard. They are part of IBM’s implementation of the MPI
standard for PE.
Table 23. Bindings for non-blocking collective communication
C and FORTRAN subroutine C and FORTRAN binding
MPE_Iallgather int MPE_Iallgather(void* sendbuf,int sendcount,MPI_Datatype sendtype,void*
recvbuf,int recvcount,MPI_Datatype recvtype, MPI_Comm
comm,MPI_Request *request);
MPE_IALLGATHER MPE_IALLGATHER(CHOICE SENDBUF,INTEGER SENDCOUNT,INTEGER
SENDTYPE, CHOICE RECVBUF,INTEGER RECVCOUNT,INTEGER
RECVTYPE,INTEGER COMM,INTEGER REQUEST,INTEGER IERROR)
MPE_Iallgatherv int MPE_Iallgatherv(void* sendbuf,int sendcount,MPI_Datatype
sendtype,void* recvbuf,int *recvcounts,int *displs,MPI_Datatype
recvtype,MPI_Comm comm,MPI_Request *request);
MPE_IALLGATHERV MPE_IALLGATHERV(CHOICE SENDBUF,INTEGER SENDCOUNT,INTEGER
SENDTYPE, CHOICE RECVBUF,INTEGER RECVCOUNTS(*),INTEGER
DISPLS(*),INTEGER RECVTYPE,INTEGER COMM,INTEGER
REQUEST,INTEGER IERROR)
MPE_Iallreduce int MPE_Iallreduce(void* sendbuf,void* recvbuf,int count,MPI_Datatype
datatype,MPI_Op op,MPI_Comm comm,MPI_Request *request);
MPE_IALLREDUCE MPE_IALLREDUCE(CHOICE SENDBUF,CHOICE RECVBUF,INTEGER
COUNT,INTEGER DATATYPE,INTEGER OP,INTEGER COMM,INTEGER

Table 23. Bindings for non-blocking collective communication (continued)
MPE_Ialltoall int MPE_Ialltoall(void* sendbuf,int sendcount,MPI_Datatype sendtype,void*
recvbuf,int recvcount,MPI_Datatype recvtype,MPI_Comm
MPE_IALLTOALL MPE_IALLTOALL(CHOICE SENDBUF,INTEGER SENDCOUNT,INTEGER
SENDTYPE,CHOICE RECVBUF,INTEGER RECVCOUNT,INTEGER
MPE_Ialltoallv int MPE_Ialltoallv(void* sendbuf,int *sendcounts,int *sdispls,MPI_Datatype
sendtype,void* recvbuf,int *recvcounts,int *rdispls,MPI_Datatype
recvtype,MPI_Comm comm,MPI_Request *request);
MPE_IALLTOALLV MPE_IALLTOALV(CHOICE SENDBUF,INTEGER
SENDCOUNTS(*),INTEGER SDISPLS(*),INTEGER SENDTYPE,CHOICE
RECVBUF,INTEGER RECVCOUNTS(*),INTEGER RDISPLS(*),INTEGER
MPE_Ibarrier int MPE_Ibarrier(MPI_Comm comm, MPI_Request *request);
MPE_IBARRIER MPE_IBARRIER(INTEGER COMM, INTEGER REQUEST, INTEGER
IERROR)
MPE_Ibcast int MPE_Ibcast(void* buffer, int count, MPI_Datatype datatype, int root,
MPI_Comm comm, MPI_Request *request);
MPE_IBCAST MPE_IBCAST(CHOICE BUFFER, INTEGER COUNT, INTEGER DATATYPE,
INTEGER ROOT, INTEGER COMM, INTEGER REQUEST, INTEGER
IERROR)
MPE_Igather int MPE_Igather(void* sendbuf, int sendcount, MPI_Datatype sendtype, void*
recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm,
MPI_Request *request);
MPE_IGATHER MPE_IGATHER(CHOICE SENDBUF, INTEGER SENDCOUNT, INTEGER
SENDTYPE, CHOICE RECVBUF, INTEGER RECVCOUNT, INTEGER
RECVTYPE, INTEGER ROOT, INTEGER COMM, INTEGER REQUEST,
INTEGER IERROR)
MPE_Igatherv int MPE_Igatherv(void* sendbuf,int sendcount,MPI_Datatype sendtype,void*
recvbuf,int *recvcounts,int *displs,MPI_Datatype recvtype,int root,MPI_Comm
MPE_IGATHERV MPE_IGATHERV(CHOICE SENDBUF,INTEGER SENDCOUNT,INTEGER
SENDTYPE, CHOICE RECVBUF,INTEGER RECVCOUNTS(*),INTEGER
DISPLS(*),INTEGER RECVTYPE,INTEGER ROOT,INTEGER
COMM,INTEGER REQUEST,INTEGER IERROR)
MPE_Ireduce int MPE_Ireduce(void* sendbuf,void* recvbuf,int count,MPI_Datatype
datatype,MPI_Op op,int root,MPI_Comm comm,MPI_Request *request);
MPE_IREDUCE MPE_IREDUCE(CHOICE SENDBUF,CHOICE RECVBUF,INTEGER
COUNT,INTEGER DATATYPE,INTEGER OP,INTEGER ROOT,INTEGER
MPE_Ireduce_scatter int MPE_Ireduce_scatter(void* sendbuf,void* recvbuf,int
*recvcounts,MPI_Datatype datatype,MPI_Op op,MPI_Comm
MPE_IREDUCE_SCATTER MPE_IREDUCE_SCATTER(CHOICE SENDBUF,CHOICE
RECVBUF,INTEGER RECVCOUNTS(*),INTEGER DATATYPE,INTEGER
OP,INTEGER COMM,INTEGER REQUEST,INTEGER IERROR)
MPE_Iscan int MPE_Iscan(void* sendbuf,void* recvbuf,int count,MPI_Datatype
datatype,MPI_Op op,MPI_Comm comm,MPI_Request *request);

Table 23. Bindings for non-blocking collective communication (continued)
MPE_ISCAN MPE_ISCAN(CHOICE SENDBUF,CHOICE RECVBUF,INTEGER
COUNT,INTEGER DATATYPE,INTEGER OP,INTEGER COMM,INTEGER
MPE_Iscatter int MPE_Iscatter(void* sendbuf,int sendcount,MPI_Datatype sendtype,void*
recvbuf,int recvcount,MPI_Datatype recvtype,int root,MPI_Comm
MPE_ISCATTER MPE_ISCATTER(CHOICE SENDBUF,INTEGER SENDCOUNT,INTEGER
RECVTYPE,INTEGER ROOT,INTEGER COMM,INTEGER
MPE_Iscatterv int MPE_Iscatterv(void* sendbuf,int *sendcounts,int *displs,MPI_Datatype
sendtype,void* recvbuf,int recvcount,MPI_Datatype recvtype,int
root,MPI_Comm comm,MPI_Request *request);
MPE_ISCATTERV MPE_ISCATTERV(CHOICE SENDBUF,INTEGER
SENDCOUNTS(*),INTEGER DISPLS(*),INTEGER SENDTYPE,CHOICE
RECVBUF,INTEGER RECVCOUNT,INTEGER RECVTYPE,INTEGER
ROOT,INTEGER COMM,INTEGER REQUEST,INTEGER IERROR)
Appendix B. MPE subroutine bindings 153

Appendix C. MPI subroutine and function summary
Table 24 lists the MPI subroutines and functions that are available for parallel
programming. For descriptions of these subroutines and functions, see IBM Parallel
Environment for AIX: MPI Subroutine Reference.
Table 24. MPI subroutines and functions
Subroutine or function name:
C
C++
FORTRAN Type Description
MPI_Abort Environment Forces all tasks of an MPI job to
MPI::Comm::Abort management terminate.
MPI_ABORT
MPI_Accumulate One-sided Accumulates, according to the
MPI::Win::Accumulate communication specified reduction operation, the
MPI_ACCUMULATE contents of the origin buffer to the
specified target buffer.
MPI_Add_error_class External interface Creates a new error class and returns
MPI::Add_error_class the value for it.
MPI_ADD_ERROR_CLASS
MPI_Add_error_code External interface Creates a new error code and returns
MPI::Add_error_code the value for it.
MPI_ADD_ERROR_CODE
MPI_Add_error_string External interface Associates an error string with an
MPI::Add_error_string error code or class.
MPI_ADD_ERROR_STRING
MPI_Address Derived datatype Returns the address of a location in
(none) memory.
MPI_ADDRESS
MPI_Allgather Collective communication Collects messages from each task
MPI::Comm::Allgather and distributes the resulting message
MPI_ALLGATHER to each.
MPI_Allgatherv Collective communication Collects messages from each task
MPI::Comm::Allgatherv and distributes the resulting message
MPI_ALLGATHERV to all tasks. Messages can have
variable sizes and displacements.
MPI_Alloc_mem Memory allocation Allocates storage and returns a
MPI::Alloc_mem pointer to it.
MPI_ALLOC_MEM
MPI_Allreduce Collective communication Applies a reduction operation.
MPI::Comm::Allreduce
MPI_ALLREDUCE
MPI_Alltoall Collective communication Sends a distinct message from each
MPI::Comm::Alltoall task to every task.
MPI_ALLTOALL
MPI_Alltoallv Collective communication Sends a distinct message from each
MPI::Comm::Alltoallv task to every task. Messages can
MPI_ALLTOALLV have different sizes and
displacements.

Table 24. MPI subroutines and functions (continued)
C
C++
MPI_Alltoallw Collective communication Sends a distinct message from each
MPI::Comm::Alltoallw task to every task. Messages can
MPI_ALLTOALLW have different datatypes, sizes, and
displacements.
MPI_Attr_delete Communicator Removes an attribute value from a
(none) communicator.
MPI_ATTR_DELETE
MPI_Attr_get Communicator Retrieves an attribute value from a
MPI_ATTR_GET
MPI_Attr_put Communicator Associates an attribute value with a
MPI_ATTR_PUT
MPI_Barrier Collective communication Blocks each task until all tasks have
MPI::Comm::Barrier called it.
MPI_BARRIER
MPI_Bcast Collective communication Broadcasts a message from root to
MPI::Comm::Bcast all tasks in the group.
MPI_BCAST
MPI_Bsend Point-to-point Performs a blocking buffered mode
MPI::Comm::Bsend communication send operation.
MPI_BSEND
MPI_Bsend_init Point-to-point Creates a persistent buffered mode
MPI::Comm::Bsend_init communication send request.
MPI_BSEND_INIT
MPI_Buffer_attach Point-to-point Provides MPI with a message buffer
MPI::Attach_buffer communication for sending.
MPI_BUFFER_ATTACH
MPI_Buffer_detach Point-to-point Detaches the current buffer.
MPI::Detach_buffer communication
MPI_BUFFER_DETACH
MPI_Cancel Point-to-point Marks a non-blocking operation for
MPI::Request::Cancel communication cancellation.
MPI_CANCEL
MPI_Cart_coords Topology Translates task rank in a
MPI::Cartcomm::Get_coords communicator into Cartesian task
MPI_CART_COORDS coordinates.
MPI_Cart_create Topology Creates a communicator containing
MPI::Intracomm::Create_cart topology information.
MPI_CART_CREATE
MPI_Cart_get Topology Retrieves Cartesian topology
MPI::Cartcomm::Get_topo information from a communicator.
MPI_CART_GET
MPI_Cart_map Topology Computes placement of tasks on the
MPI::Cartcomm::Map physical processor.
MPI_CART_MAP

C
C++
MPI_Cart_rank Topology Translates task coordinates into a
MPI::Cartcomm::Get_cart_rank task rank.
MPI_CART_RANK
MPI_Cart_shift Topology Returns shifted source and
MPI::Cartcomm::Shift destination ranks for a task.
MPI_CART_SHIFT
MPI_Cart_sub Topology Partitions a Cartesian communicator
MPI::Cartcomm::Sub into lower-dimensional subgroups.
MPI_CART_SUB
MPI_Cartdim_get Topology Retrieves the number of Cartesian
MPI::Cartcomm::Get_dim dimensions from a communicator.
MPI_CARTDIM_GET
MPI_Comm_c2f Conversion function Translates a C communicator handle
(none) into a FORTRAN handle to the same
MPI_Comm_call_errhandler External interface Calls the error handler assigned to
MPI::Comm::Call_errhandler the communicator with the error code
MPI_COMM_CALL_ERRHANDLER supplied.
(none) Communicator Creates a new communicator that is a
MPI::Comm::Clone duplicate of an existing
(MPI::Cartcomm::Clone, communicator.
MPI::Graphcomm::Clone,
MPI::Intercomm::Clone,
MPI::Intracomm::Clone)
(none)
MPI_Comm_compare Communicator Compares the groups and contexts of
MPI::Comm::Compare two communicators.
MPI_COMM_COMPARE
MPI_Comm_create Communicator Creates a new communicator with a
MPI::Intercomm::Create, given group.
MPI::Intracomm::Create
MPI_COMM_CREATE
MPI_Comm_create_errhandler Communicator Creates an error handler that can be
MPI::Comm::Create_errhandler attached to communicators.
MPI_COMM_CREATE_ERRHANDLER
MPI_Comm_create_keyval Communicator Generates a new communicator
MPI::Comm::Create_keyval attribute key.
MPI_COMM_CREATE_KEYVAL
MPI_Comm_delete_attr Communicator Removes an attribute value from a
MPI::Comm::Delete_attr communicator.
MPI_COMM_DELETE_ATTR
MPI_Comm_dup Communicator Creates a new communicator that is a
MPI::Cartcomm::Dup, duplicate of an existing
MPI::Graphcomm::Dup, communicator.
MPI::Intercomm::Dup,
MPI::Intracomm::Dup
MPI_COMM_DUP
Appendix C. MPI subroutine and function summary 157

C
C++
MPI_Comm_f2c Conversion function Returns a C handle to a
(none)
MPI_Comm_free Communicator Marks a communicator for
MPI::Comm::Free deallocation.
MPI_COMM_FREE
MPI_Comm_free_keyval Communicator Marks a communicator attribute key
MPI::Comm::Free_keyval for deallocation.
MPI_COMM_FREE_KEYVAL
MPI_Comm_get_attr Communicator Retrieves the communicator attribute
MPI::Comm::Get_attr value identified by the key.
MPI_COMM_GET_ATTR
MPI_Comm_get_errhandler Communicator Retrieves the error handler currently
MPI::Comm::Get_errhandler associated with a communicator.
MPI_COMM_GET_ERRHANDLER
MPI_Comm_get_name External interface Returns the name that was last
MPI::Comm::Get_name associated with a communicator.
MPI_COMM_GET_NAME
MPI_Comm_group Group management Returns the group handle associated
MPI::Comm::Get_group with a communicator.
MPI_COMM_GROUP
MPI_Comm_rank Communicator Returns the rank of the local task in
MPI::Comm::Get_rank the group associated with a
MPI_COMM_RANK communicator.
MPI_Comm_remote_group Communicator Returns the handle of the remote
MPI::Intercomm::Get_remote_group group of an inter-communicator.
MPI_COMM_REMOTE_GROUP
MPI_Comm_remote_size Communicator Returns the size of the remote group
MPI::Intercomm::Get_remote_size of an inter-communicator.
MPI_COMM_REMOTE_SIZE
MPI_Comm_set_attr Communicator Attaches the communicator attribute
MPI::Comm::Set_attr value to the communicator and
MPI_COMM_SET_ATTR associates it with the key.
MPI_Comm_set_errhandler Communicator Attaches a new error handler to a
MPI::Comm::Set_errhandler communicator.
MPI_COMM_SET_ERRHANDLER
MPI_Comm_set_name External interface Associates a name string with a
MPI::Comm::Set_name communicator.
MPI_COMM_SET_NAME
MPI_Comm_size Communicator Returns the size of the group
MPI::Comm::Get_size associated with a communicator.
MPI_COMM_SIZE
MPI_Comm_split Communicator Splits a communicator into multiple
MPI::Intercomm::Split, MPI::Intracomm::Split communicators based on color and
MPI_COMM_SPLIT key.
MPI_Comm_test_inter Communicator Returns the type of a communicator
MPI::Comm::Is_inter (intra-communicator or
MPI_COMM_TEST_INTER inter-communicator).

C
C++
MPI_Dims_create Topology Defines a Cartesian grid to balance
MPI::Compute_dims tasks.
MPI_DIMS_CREATE
MPI_Errhandler_c2f Conversion function Translates a C error handler into a
(none) FORTRAN handle to the same error
(none) handler.
MPI_Errhandler_create Environment Registers a user-defined error
(none) management handler.
MPI_ERRHANDLER_CREATE
MPI_Errhandler_f2c Conversion function Returns a C handle to an error
(none) handler.
(none)
MPI_Errhandler_free Environment Marks an error handler for
MPI::Errhandler::Free management deallocation.
MPI_ERRHANDLER_FREE
MPI_Errhandler_get Environment Gets an error handler associated with
(none) management a communicator.
MPI_ERRHANDLER_GET
MPI_Errhandler_set Environment Associates a new error handler with a
(none) management communicator.
MPI_ERRHANDLER_SET
MPI_Error_class Environment Returns the error class for the
MPI::Get_error_class management corresponding error code.
MPI_ERROR_CLASS
MPI_Error_string Environment Returns the error string for a given
MPI::Get_error_string management error code.
MPI_ERROR_STRING
MPI_Exscan Collective communication Performs a prefix reduction on data
MPI::Intracomm::Exscan distributed across the group.
MPI_EXSCAN
MPI_File_c2f Conversion function Translates a C file handle into a
(none) FORTRAN handle to the same file.
(none)
MPI_File_call_errhandler External interface Calls the error handler assigned to
MPI::File::Call_errhandler the file with the error code supplied.
MPI_FILE_CALL_ERRHANDLER
MPI_File_close MPI-IO Closes a file.
MPI::File::Close
MPI_FILE_CLOSE
MPI_File_create_errhandler Environment Registers a user-defined error handler
MPI::File::Create_errhandler management that you can associate with an open
MPI_FILE_CREATE_ERRHANDLER file.
MPI_File_delete MPI-IO Deletes a file after pending operations
MPI::File::Delete to the file complete.
MPI_FILE_DELETE
MPI_File_f2c Conversion function Returns a C handle to a file.
(none)
(none)

C
C++
MPI_File_get_amode MPI-IO Retrieves the access mode specified
MPI::File::Get_amode when the file was opened.
MPI_FILE_GET_AMODE
MPI_File_get_atomicity MPI-IO Retrieves the current atomicity mode
MPI::File::Get_atomicity in which the file is accessed.
MPI_FILE_GET_ATOMICITY
MPI_File_get_byte_offset MPI-IO Allows conversion of an offset.
MPI::File::Get_byte_offset
MPI_FILE_GET_BYTE_OFFSET
MPI_File_get_errhandler Environment Retrieves the error handler currently
MPI::File::Get_errhandler management associated with a file handle.
MPI_FILE_GET_ERRHANDLER
MPI_File_get_group MPI-IO Retrieves the group of tasks that
MPI::File::Get_group opened the file.
MPI_FILE_GET_GROUP
MPI_File_get_info MPI-IO Returns a new Info object identifying
MPI::File::Get_info the hints associated with a file.
MPI_FILE_GET_INFO
MPI_File_get_position MPI-IO Returns the current position of the
MPI::File::Get_position individual file pointer relative to the
MPI_FILE_GET_POSITION current file view.
MPI_File_get_position_shared MPI-IO Returns the current position of the
MPI::File::Get_position_shared shared file pointer relative to the
MPI_FILE_GET_POSITION_SHARED current file view.
MPI_File_get_size MPI-IO Retrieves the current file size.
MPI::File::Get_size
MPI_FILE_GET_SIZE
MPI_File_get_type_extent MPI-IO Retrieves the extent of a datatype.
MPI::File::Get_type_extent
MPI_FILE_GET_TYPE_EXTENT
MPI_File_get_view MPI-IO Retrieves the current file view.
MPI::File::Get_view
MPI_FILE_GET_VIEW
MPI_File_iread MPI-IO Performs a non-blocking read
MPI::File::Iread operation.
MPI_FILE_IREAD
MPI_File_iread_at MPI-IO Performs a non-blocking read
MPI::File::Iread_at operation using an explicit offset.
MPI_FILE_IREAD_AT
MPI_File_iread_shared MPI-IO Performs a non-blocking read
MPI::File::Iread_shared operation using the shared file
MPI_FILE_IREAD_SHARED pointer.
MPI_File_iwrite MPI-IO Performs a non-blocking write
MPI::File::Iwrite operation.
MPI_FILE_IWRITE
MPI_File_iwrite_at MPI-IO Performs a non-blocking write
MPI::File::Iwrite_at operation using an explicit offset.
MPI_FILE_IWRITE_AT

C
C++
MPI_File_iwrite_shared MPI-IO Performs a non-blocking write
MPI::File::Iwrite_shared operation using the shared file
MPI_FILE_IWRITE_SHARED pointer.
MPI_File_open MPI-IO Opens a file.
MPI::File::Open
MPI_FILE_OPEN
MPI_File_preallocate MPI-IO Ensures that storage space is
MPI::File::Preallocate allocated for the first size bytes of the
MPI_FILE_PREALLOCATE file associated with fh.
MPI_File_read MPI-IO Reads from a file.
MPI::File::Read
MPI_FILE_READ
MPI_File_read_all MPI-IO Reads from a file collectively.
MPI::File::Read_all
MPI_FILE_READ_ALL
MPI_File_read_all_begin MPI-IO Initiates a split collective read
MPI::File::Read_all_begin operation from a file.
MPI_FILE_READ_ALL_BEGIN
MPI_File_read_all_end MPI-IO Completes a split collective read
MPI::File::Read_all_end operation from a file.
MPI_FILE_READ_ALL_END
MPI_File_read_at MPI-IO Reads from a file using an explicit
MPI::File::Read_at offset.
MPI_FILE_READ_AT
MPI_File_read_at_all MPI-IO Reads from a file collectively using an
MPI::File::Read_at_all explicit offset.
MPI_FILE_READ_AT_ALL
MPI_File_read_at_all_begin MPI-IO Initiates a split collective read
MPI::File::Read_at_all_begin operation from a file using an explicit
MPI_FILE_READ_AT_ALL_BEGIN offset.
MPI_File_read_at_all_end MPI-IO Completes a split collective read
MPI::File::Read_at_all_end operation from a file using an explicit
MPI_FILE_READ_AT_ALL_END offset.
MPI_File_read_ordered MPI-IO Reads from a file collectively using
MPI::File::Read_ordered the shared file pointer.
MPI_FILE_READ_ORDERED
MPI_File_read_ordered_begin MPI-IO Initiates a split collective read
MPI::File::Read_ordered_begin operation from a file using the shared
MPI_FILE_READ_ORDERED_BEGIN file pointer.
MPI_File_read_ordered_end MPI-IO Completes a split collective read
MPI::File::Read_ordered_end operation from a file using the shared
MPI_FILE_READ_ORDERED_END file pointer.
MPI_File_read_shared MPI-IO Reads from a file using the shared
MPI::File::Read_shared file pointer.
MPI_FILE_READ_SHARED
MPI_File_seek MPI-IO Sets a file pointer.
MPI::File::Seek
MPI_FILE_SEEK

C
C++
MPI_File_seek_shared MPI-IO Sets a shared file pointer.
MPI::File::Seek_shared
MPI_FILE_SEEK_SHARED
MPI_File_set_atomicity MPI-IO Modifies the current atomicity mode
MPI::File::Set_atomicity for an opened file.
MPI_FILE_SET_ATOMICITY
MPI_File_set_errhandler Environment Associates a new error handler with a
MPI::File::Set_errhandler management file.
MPI_FILE_SET_ERRHANDLER
MPI_File_set_info MPI-IO Specifies new hints for an open file.
MPI::File::Set_info
MPI_FILE_SET_INFO
MPI_File_set_size MPI-IO Expands or truncates an open file.
MPI::File::Set_size
MPI_FILE_SET_SIZE
MPI_File_set_view MPI-IO Associates a new view with an open
MPI::File::Set_view file.
MPI_FILE_SET_VIEW
MPI_File_sync MPI-IO Commits file updates of an open file
MPI::File::Sync to storage devices.
MPI_FILE_SYNC
MPI_File_write MPI-IO Writes to a file.
MPI::File::Write
MPI_FILE_WRITE
MPI_File_write_all MPI-IO Writes to a file collectively.
MPI::File::Write_all
MPI_FILE_WRITE_ALL
MPI_File_write_all_begin MPI-IO Initiates a split collective write
MPI::File::Write_all_begin operation to a file.
MPI_FILE_WRITE_ALL_BEGIN
MPI_File_write_all_end MPI-IO Completes a split collective write
MPI::File::Write_all_end operation to a file.
MPI_FILE_WRITE_ALL_END
MPI_File_write_at MPI-IO Performs a blocking write operation
MPI::File::Write_at using an explicit offset.
MPI_FILE_WRITE_AT
MPI_File_write_at_all MPI-IO Performs a blocking write operation
MPI::File::Write_at_all collectively using an explicit offset.
MPI_FILE_WRITE_AT_ALL
MPI_File_write_at_all_begin MPI-IO Initiates a split collective write
MPI::File::Write_at_all_begin operation to a file using an explicit
MPI_FILE_WRITE_AT_ALL_BEGIN offset.
MPI_File_write_at_all_end MPI-IO Completes a split collective write
MPI::File::Write_at_all_end operation to a file using an explicit
MPI_FILE_WRITE_AT_ALL_END offset.
MPI_File_write_ordered MPI-IO Writes to a file collectively using the
MPI::File::Write_ordered shared file pointer.
MPI_FILE_WRITE_ORDERED

C
C++
MPI_File_write_ordered_begin MPI-IO Initiates a split collective write
MPI::File::Write_ordered_begin operation to a file using the shared
MPI_FILE_WRITE_ORDERED_BEGIN file pointer.
MPI_File_write_ordered_end MPI-IO Completes a split collective write
MPI::File::Write_ordered_end operation to a file using the shared
MPI_FILE_WRITE_ORDERED_END file pointer.
MPI_File_write_shared MPI-IO Writes to a file using the shared file
MPI::File::Write_shared pointer.
MPI_FILE_WRITE_SHARED
MPI_Finalize Environment Terminates all MPI processing.
MPI::Finalize management
MPI_FINALIZE
MPI_Finalized Environment Returns true if MPI_FINALIZE has
MPI::Is_finalized management completed.
MPI_FINALIZED
MPI_Free_mem Memory allocation Frees a block of storage.
MPI::Free_mem
MPI_FREE_MEM
MPI_Gather Collective communication Collects individual messages from
MPI::Comm::Gather each task in a group at the root task.
MPI_GATHER
MPI_Gatherv Collective communication Collects individual messages from
MPI::Comm::Gatherv each task in comm at the root task.
MPI_GATHERV Messages can have different sizes
and displacements.
MPI_Get One-sided Transfers data from a window at the
MPI::Win::Get communication target task to the origin task.
MPI_GET
MPI_Get_address Derived datatype Returns the address of a location in
MPI::Get_address memory.
MPI_GET_ADDRESS
MPI_Get_count Point-to-point Returns the number of elements in a
MPI::Status::Get_count communication message.
MPI_GET_COUNT
MPI_Get_elements Derived datatype Returns the number of basic
MPI::Status::Get_elements elements in a message.
MPI_GET_ELEMENTS
MPI_Get_processor_name Environment Returns the name of the local
MPI::Get_processor_name management processor.
MPI_GET_PROCESSOR_NAME
MPI_Get_version Environment Returns the version of the MPI
MPI::Get_version management standard supported.
MPI_GET_VERSION
MPI_Graph_create Topology Creates a new communicator
MPI::Intracomm::Create_graph containing graph topology information.
MPI_GRAPH_CREATE

C
C++
MPI_Graph_get Topology Retrieves graph topology information
MPI::Graphcomm::Get_topo from a communicator.
MPI_GRAPH_GET
MPI_Graph_map Topology Computes placement of tasks on the
MPI::Graphcomm::Map physical processor.
MPI_GRAPH_MAP
MPI_Graph_neighbors Topology Returns the neighbors of the given
MPI::Graphcomm::Get_neighbors task.
MPI_GRAPH_NEIGHBORS
MPI_Graph_neighbors_count Topology Returns the number of neighbors of
MPI::Graphcomm::Get_neighbors_count the given task.
MPI_GRAPH_NEIGHBORS_COUNT
MPI_Graphdims_get Topology Retrieves graph topology information
MPI::Graphcomm::Get_dims from a communicator.
MPI_GRAPHDIMS_GET
MPI_Grequest_complete External interface Marks the generalized request
MPI::Grequest::Complete complete.
MPI_GREQUEST_COMPLETE
MPI_Grequest_start External interface Initializes a generalized request.
MPI::Grequest::Start
MPI_GREQUEST_START
MPI_Group_c2f Conversion function Translates a C group handle into a
(none) FORTRAN handle to the same group.
(none)
MPI_Group_compare Group management Compares the contents of two task
MPI::Group::Compare groups.
MPI_GROUP_COMPARE
MPI_Group_difference Group management Creates a new group that is the
MPI::Group::Difference difference of two existing groups.
MPI_GROUP_DIFFERENCE
MPI_Group_excl Group management Removes selected tasks from an
MPI::Group::Excl existing group to create a new group.
MPI_GROUP_EXCL
MPI_Group_f2c Conversion function Returns a C handle to a group.
(none)
(none)
MPI_Group_free Group management Marks a group for deallocation.
MPI::Group::Free
MPI_GROUP_FREE
MPI_Group_incl Group management Creates a new group consisting of
MPI::Group::Incl selected tasks from an existing group.
MPI_GROUP_INCL
MPI_Group_intersection Group management Creates a new group that is the
MPI::Group::Intersect intersection of two existing groups.
MPI_GROUP_INTERSECTION
MPI_Group_range_excl Group management Creates a new group by excluding
MPI::Group::Range_excl selected tasks of an existing group.
MPI_GROUP_RANGE_EXCL

C
C++
MPI_Group_range_incl Group management Creates a new group consisting of
MPI::Group::Range_incl selected ranges of tasks from an
MPI_GROUP_RANGE_INCL existing group.
MPI_Group_rank Group management Returns the rank of the local task with
MPI::Group::Get_rank respect to group.
MPI_GROUP_RANK
MPI_Group_size Group management Returns the number of tasks in a
MPI::Group::Get_size group.
MPI_GROUP_SIZE
MPI_Group_translate_ranks Group management Converts task ranks of one group into
MPI::Group::Translate_ranks ranks of another group.
MPI_GROUP_TRANSLATE_RANKS
MPI_Group_union Group management Creates a new group that is the union
MPI::Group::Union of two existing groups.
MPI_GROUP_UNION
MPI_Ibsend Point-to-point Performs a non-blocking buffered
MPI::Comm::Ibsend communication send.
MPI_IBSEND
MPI_Info_c2f Conversion function Translates a C Info object handle into
(none) a FORTRAN handle to the same Info
(none) object.
MPI_Info_create Info object Creates a new, empty Info object.
MPI::Info::Create
MPI_INFO_CREATE
MPI_Info_delete Info object Deletes a (key, value) pair from an
MPI::Info::Delete Info object.
MPI_INFO_DELETE
MPI_Info_dup Info object Duplicates an Info object.
MPI::Info::Dup
MPI_INFO_DUP
MPI_Info_f2c Conversion function Returns a C handle to an Info object.
(none)
(none)
MPI_Info_free Info object Frees an Info object and sets its
MPI::Info::Free handle to MPI_INFO_NULL.
MPI_INFO_FREE
MPI_Info_get Info object Retrieves the value associated with
MPI::Info::Get key in an Info object.
MPI_INFO_GET
MPI_Info_get_nkeys Info object Returns the number of keys defined
MPI::Info::Get_nkeys in an Info object.
MPI_INFO_GET_NKEYS
MPI_Info_get_nthkey Info object Retrieves the nth key defined in an
MPI::Info::Get_nthkey Info object.
MPI_INFO_GET_NTHKEY
MPI_Info_get_valuelen Info object Retrieves the length of the value
MPI::Info::Get_valuelen associated with a key of an Info
MPI_INFO_GET_VALUELEN object.

C
C++
MPI_Info_set Info object Adds a (key, value) pair to an Info
MPI::Info::Set object.
MPI_INFO_SET
MPI_Init Environment Initializes MPI.
MPI::Init management
MPI_INIT
MPI_Init_thread Environment Initializes MPI and the MPI threads
MPI::Init_thread management environment.
MPI_INIT_THREAD
MPI_Initialized Environment Determines if MPI is initialized.
MPI::Is_initialized management
MPI_INITIALIZED
MPI_Intercomm_create Communicator Creates an inter-communicator from
MPI::Intracomm::Create_intercomm two intra-communicators.
MPI_INTERCOMM_CREATE
MPI_Intercomm_merge Communicator Creates an intra-communicator by
MPI::Intercomm::Merge merging the local and remote groups
MPI_INTERCOMM_MERGE of an inter-communicator.
MPI_Iprobe Point-to-point Checks to see if a message matching
MPI::Comm::Iprobe communication source, tag, and comm has arrived.
MPI_IPROBE
MPI_Irecv Point-to-point Performs a non-blocking receive
MPI::Comm::Irecv communication operation.
MPI_IRECV
MPI_Irsend Point-to-point Performs a non-blocking ready send
MPI::Comm::Irsend communication operation.
MPI_IRSEND
MPI_Is_thread_main Environment Determines whether the calling thread
MPI::Is_thread_main management is the thread that called MPI_INIT or
MPI_IS_THREAD_MAIN MPI_INIT_THREAD.
MPI_Isend Point-to-point Performs a non-blocking standard
MPI::Comm::Isend communication mode send operation.
MPI_ISEND
MPI_Issend Point-to-point Performs a non-blocking synchronous
MPI::Comm::Issend communication mode send operation.
MPI_ISSEND
MPI_Keyval_create Communicator Generates a new communicator
(none) attribute key.
MPI_KEYVAL_CREATE
MPI_Keyval_free Communicator Marks a communicator attribute key
(none) for deallocation.
MPI_KEYVAL_FREE
MPI_Op_c2f Conversion function Translates a C reduction operation
(none) handle into a FORTRAN handle to
(none) the same operation.
MPI_Op_create Collective communication Binds a user-defined reduction
MPI::Op::Init operation to an op handle.
MPI_OP_CREATE

C
C++
MPI_Op_f2c Conversion function Returns a C reduction operation
(none) handle to an operation.
(none)
MPI_Op_free Collective communication Marks a user-defined reduction
MPI::Op::Free operation for deallocation.
MPI_OP_FREE
MPI_Pack Derived datatype Packs the message in the specified
MPI::Datatype::Pack send buffer into the specified buffer
MPI_PACK space.
MPI_Pack_external Derived datatype Packs the message in the specified
MPI::Datatype::Pack_external send buffer into the specified buffer
MPI_PACK_EXTERNAL space, using the external32 data
format.
MPI_Pack_external_size Derived datatype Returns the number of bytes required
MPI::Datatype::Pack_external_size to hold the data, using the external32
MPI_PACK_EXTERNAL_SIZE data format.
MPI_Pack_size Derived datatype Returns the number of bytes required
MPI::Datatype::Pack_size to hold the data.
MPI_PACK_SIZE
MPI_Pcontrol Environment Provides profile control.
MPI::Pcontrol management
MPI_PCONTROL
MPI_Probe Point-to-point Waits until a message matching
MPI::Comm::Probe communication source, tag, and comm arrives.
MPI_PROBE
MPI_Put One-sided Transfers data from the origin task to
MPI::Win::Put communication a window at the target task.
MPI_PUT
MPI_Query_thread Environment Returns the current level of threads
MPI::Query_thread management support.
MPI_QUERY_THREAD
MPI_Recv Point-to-point Performs a blocking receive
MPI::Comm::Recv communication operation.
MPI_RECV
MPI_Recv_init Point-to-point Creates a persistent receive request.
MPI::Comm::Recv_init communication
MPI_RECV_INIT
MPI_Reduce Collective communication Applies a reduction operation to the
MPI::Comm::Reduce vector sendbuf over the set of tasks
MPI_REDUCE specified by comm and places the
result in recvbuf on root.
MPI_Reduce_scatter Collective communication Applies a reduction operation to the
MPI::Comm::Reduce_scatter vector sendbuf over the set of tasks
MPI_REDUCE_SCATTER specified by comm and scatters the
result according to the values in
recvcounts.

C
C++
MPI_Register_datarep MPI-IO Registers a data representation.
MPI::Register_datarep
MPI_REGISTER_DATAREP
MPI_Request_c2f Conversion function Translates a C request handle into a
(none) FORTRAN handle to the same
(none) request.
MPI_Request_f2c Conversion function Returns a C handle to a request.
(none)
(none)
MPI_Request_free Point-to-point Marks a request for deallocation.
MPI::Request::Free communication
MPI_REQUEST_FREE
MPI_Request_get_status MPI_STATUS object Accesses the information associated
MPI::Request::Get_status with a request, without freeing the
MPI_REQUEST_GET_STATUS request.
MPI_Rsend Point-to-point Performs a blocking ready mode send
MPI::Comm::Rsend communication operation.
MPI_RSEND
MPI_Rsend_init Point-to-point Creates a persistent ready mode
MPI::Comm::Rsend_init communication send request.
MPI_RSEND_INIT
MPI_Scan Collective communication Performs a parallel prefix reduction
MPI::Intracomm::Scan on data distributed across a group.
MPI_SCAN
MPI_Scatter Collective communication Distributes individual messages from
MPI::Comm::Scatter root to each task in comm.
MPI_SCATTER
MPI_Scatterv Collective communication Distributes individual messages from
MPI::Comm::Scatterv root to each task in comm. Messages
MPI_SCATTERV can have different sizes and
displacements.
MPI_Send Point-to-point Blocking standard mode send.
MPI::Comm::Send communication
MPI_SEND
MPI_Send_init Point-to-point Creates a persistent standard mode
MPI::Comm::Send_init communication send request.
MPI_SEND_INIT
MPI_Sendrecv Point-to-point Performs a blocking send and receive
MPI::Comm::Sendrecv communication operation.
MPI_SENDRECV
MPI_Sendrecv_replace Point-to-point Performs a blocking send and receive
MPI::Comm::Sendrecv_replace communication operation using a common buffer.
MPI_SENDRECV_REPLACE
(none) Derived datatype Returns the size in bytes of the
(none) machine representation of the given
MPI_SIZEOF variable.

C
C++
MPI_Ssend Point-to-point Performs a blocking synchronous
MPI::Comm::Ssend communication mode send operation.
MPI_SSEND
MPI_Ssend_init Point-to-point Creates a persistent synchronous
MPI::Comm::Ssend_init communication mode send request.
MPI_SSEND_INIT
MPI_Start Point-to-point Activates a persistent request
MPI::Prequest::Start communication operation.
MPI_START
MPI_Startall Point-to-point Activates a collection of persistent
MPI::Prequest::Startall communication request operations.
MPI_STARTALL
MPI_Status_c2f Conversion function Translates a C status object into a
(none) FORTRAN status object.
(none)
MPI_Status_f2c Conversion function Converts a FORTRAN status object
(none) into a C status object.
(none)
MPI_Status_set_cancelled External interface Defines cancellation information for a
MPI::Status::Set_cancelled request.
MPI_STATUS_SET_CANCELLED
MPI_Status_set_elements External interface Defines element information for a
MPI::Status::Set_elements request.
MPI_STATUS_SET_ELEMENTS
MPI_Test Point-to-point Checks to see if a non-blocking
MPI::Request::Test communication operation has completed.
MPI_TEST
MPI_Test_cancelled Point-to-point Tests whether a non-blocking
MPI::Status::Is_cancelled communication operation was cancelled.
MPI_TEST_CANCELLED
MPI_Testall Point-to-point Tests a collection of non-blocking
MPI::Request::Testall communication operations for completion.
MPI_TESTALL
MPI_Testany Point-to-point Tests for the completion of any
MPI::Request::Testany communication specified non-blocking operation.
MPI_TESTANY
MPI_Testsome Point-to-point Tests a collection of non-blocking
MPI::Request::Testsome communication operations for completion.
MPI_TESTSOME
MPI_Topo_test Topology Returns the type of virtual topology
MPI::Comm::Get_topology associated with a communicator.
MPI_TOPO_TEST
MPI_Type_c2f Conversion function Translates a C datatype handle into a
(none) datatype.
MPI_Type_commit Derived datatype Makes a datatype ready for use in
MPI::Datatype::Commit communication.
MPI_TYPE_COMMIT

C
C++
MPI_Type_contiguous Derived datatype Returns a new datatype that
MPI::Datatype::Create_contiguous represents the concatenation of count
MPI_TYPE_CONTIGUOUS instances of oldtype.
MPI_Type_create_darray Derived datatype Generates the datatypes
MPI::Datatype::Create_darray corresponding to an HPF-like
MPI_TYPE_CREATE_DARRAY distribution of an ndims-dimensional
array of oldtype elements onto an
ndims-dimensional grid of logical
tasks.
MPI_Type_create_f90_complex Derived datatype Returns a predefined MPI datatype
MPI::Datatype::Create_f90_complex that matches a COMPLEX variable of
MPI_TYPE_CREATE_F90_COMPLEX KIND selected_real_kind(p, r).
MPI_Type_create_f90_integer Derived datatype Returns a predefined MPI datatype
MPI::Datatype::Create_f90_integer that matches an INTEGER variable of
MPI_TYPE_CREATE_F90_INTEGER KIND selected_integer_kind(r).
MPI_Type_create_f90_real Derived datatype Returns a predefined MPI datatype
MPI::Datatype::Create_f90_real that matches a REAL variable of
MPI_TYPE_CREATE_F90_REAL KIND selected_real_kind(p, r).
MPI_Type_create_hindexed Derived datatype Returns a new datatype that
MPI::Datatype::Create_hindexed represents count blocks. Each block
MPI_TYPE_CREATE_HINDEXED is defined by an entry in
array_of_blocklengths and
array_of_displacements.
Displacements are expressed in
bytes.
MPI_Type_create_hvector Derived datatype Returns a new datatype that
MPI::Datatype::Create_hvector represents equally-spaced blocks.
MPI_TYPE_CREATE_HVECTOR The spacing between the start of
each block is given in bytes.
MPI_Type_create_indexed_block Derived datatype Returns a new datatype that
MPI::Datatype::Create_indexed_block represents count blocks.
MPI_TYPE_CREATE_INDEXED_BLOCK
MPI_Type_create_keyval Derived datatype Generates a new attribute key for a
MPI::Datatype::Create_keyval datatype.
MPI_TYPE_CREATE_KEYVAL
MPI_Type_create_resized Derived datatype Duplicates a datatype and changes
MPI::Datatype::Create_resized the upper bound, lower bound, and
MPI_TYPE_CREATE_RESIZED extent.
MPI_Type_create_struct Derived datatype Returns a new datatype that
MPI::Datatype::Create_struct represents count blocks. Each block
MPI_TYPE_CREATE_STRUCT is defined by an entry in
array_of_blocklengths,
array_of_displacements, and
array_of_types. Displacements are
expressed in bytes.
MPI_Type_create_subarray Derived datatype Returns a new datatype that
MPI::Datatype::Create_subarray represents an ndims-dimensional
MPI_TYPE_CREATE_SUBARRAY subarray of an ndims-dimensional
array.

C
C++
MPI_Type_delete_attr Derived datatype Deletes an attribute from a datatype.
MPI::Datatype::Delete_attr
MPI_TYPE_DELETE_ATTR
MPI_Type_dup Derived datatype Duplicates the existing type with
MPI::Datatype::Dup associated key values.
MPI_TYPE_DUP
MPI_Type_extent Derived datatype Returns the extent of any defined
(none) datatype.
MPI_TYPE_EXTENT
MPI_Type_f2c Conversion function Returns a C handle to a datatype.
(none)
(none)
MPI_Type_free Derived datatype Marks a derived datatype for
MPI::Datatype::Free deallocation and sets its handle to
MPI_TYPE_FREE MPI_DATATYPE_NULL.
MPI_Type_free_keyval Derived datatype Frees a datatype key value.
MPI::Datatype::Free_keyval
MPI_TYPE_FREE_KEYVAL
MPI_Type_get_attr Derived datatype Attaches an attribute to a datatype.
MPI::Datatype::Get_attr
MPI_TYPE_GET_ATTR
MPI_Type_get_contents Derived datatype Obtains the arguments used in the
MPI::Datatype::Get_contents creation of the datatype.
MPI_TYPE_GET_CONTENTS
MPI_Type_get_envelope Derived datatype Determines the constructor that was
MPI::Datatype::Get_envelope used to create the datatype.
MPI_TYPE_GET_ENVELOPE
MPI_Type_get_extent Derived datatype Returns the lower bound and the
MPI::Datatype::Get_extent extent of any defined datatype.
MPI_TYPE_GET_EXTENT
MPI_Type_get_name External interface Returns the name that was last
MPI::Datatype::Get_name associated with a datatype.
MPI_TYPE_GET_NAME
MPI_Type_get_true_extent Derived datatype Returns the true extent of any defined
MPI::Datatype::Get_true_extent datatype.
MPI_TYPE_GET_TRUE_EXTENT
MPI_Type_hindexed Derived datatype Returns a new datatype that
(none) represents count distinct blocks with
MPI_TYPE_HINDEXED offsets expressed in bytes.
MPI_Type_hvector Derived datatype Returns a new datatype of count
(none) blocks with stride expressed in bytes.
MPI_TYPE_HVECTOR
MPI_Type_indexed Derived datatype Returns a new datatype that
MPI::Datatype::Create_indexed represents count blocks with stride in
MPI_TYPE_INDEXED terms of defining type.
MPI_Type_lb Derived datatype Returns the lower bound of a
(none) datatype.
MPI_TYPE_LB

C
C++
MPI_Type_match_size Derived datatype Returns a reference (handle) to one
MPI::Datatype::Match_size of the predefined named datatypes,
MPI_TYPE_CREATE_MATCH_SIZE not a duplicate.
MPI_Type_set_attr Derived datatype Attaches the datatype attribute value
MPI::Datatype::Set_attr to the datatype and associates it with
MPI_TYPE_SET_ATTR the key.
MPI_Type_set_name External interface Associates a name string with a
MPI::Datatype::Set_name datatype.
MPI_TYPE_SET_NAME
MPI_Type_size Derived datatype Returns the number of bytes
MPI::Datatype::Get_size represented by any defined datatype.
MPI_TYPE_SIZE
MPI_Type_struct Derived datatype Returns a new datatype that
(none) represents count blocks, each with a
MPI_TYPE_STRUCT distinct format and offset.
MPI_Type_ub Derived datatype Returns the upper bound of a
(none) datatype.
MPI_TYPE_UB
MPI_Type_vector Derived datatype Returns a new datatype that
MPI::Datatype::Create_vector represents equally-spaced blocks of
MPI_TYPE_VECTOR replicated data.
MPI_Unpack Derived datatype Unpacks the message into the
MPI::Datatype::Unpack specified receive buffer from the
MPI_UNPACK specified packed buffer.
MPI_Unpack_external Derived datatype Unpacks the message into the
MPI::Datatype::Unpack_external specified receive buffer from the
MPI_UNPACK_EXTERNAL specified packed buffer, using the
external32 data format.
MPI_Wait Point-to-point Waits for a non-blocking operation to
MPI::Request::Wait communication complete.
MPI_WAIT
MPI_Waitall Point-to-point Waits for a collection of non-blocking
MPI::Request::Waitall communication operations to complete.
MPI_WAITALL
MPI_Waitany Point-to-point Waits for any specified non-blocking
MPI::Request::Waitany communication operation to complete.
MPI_WAITANY
MPI_Waitsome Point-to-point Waits for at least one of a list of
MPI::Request::Waitsome communication non-blocking operations to complete.
MPI_WAITSOME
MPI_Win_c2f Conversion function Translates a C window handle into a
(none) window.
MPI_Win_call_errhandler External interface Calls the error handler assigned to
MPI::Win::Call_errhandler the window with the error code
MPI_WIN_CALL_ERRHANDLER supplied.

C
C++
MPI_Win_complete One-sided Completes an RMA access epoch on
MPI::Win::Complete communication a window object.
MPI_WIN_COMPLETE
MPI_Win_create One-sided Allows each task in an
MPI::Win::Create communication intra-communicator group to specify a
MPI_WIN_CREATE “window” in its memory that is made
accessible to accesses by remote
tasks.
MPI_Win_create_errhandler One-sided Creates an error handler that can be
MPI::Win::Create_errhandler communication attached to windows.
MPI_WIN_CREATE_ERRHANDLER
MPI_Win_create_keyval One-sided Generates a new window attribute
MPI::Win::Create_keyval communication key.
MPI_WIN_CREATE_KEYVAL
MPI_Win_delete_attr One-sided Deletes an attribute from a window.
MPI::Win::Delete_attr communication
MPI_WIN_DELETE_ATTR
MPI_Win_f2c Conversion function Returns a C handle to a window.
(none)
(none)
MPI_Win_fence One-sided Synchronizes RMA calls on a window.
MPI::Win::Fence communication
MPI_WIN_FENCE
MPI_Win_free One-sided Frees the window object and returns
MPI::Win::Free communication a null handle (equal to
MPI_WIN_FREE MPI_WIN_NULL).
MPI_Win_free_keyval One-sided Marks a window attribute key for
MPI::Win::Free_keyval communication deallocation.
MPI_WIN_FREE_KEYVAL
MPI_Win_get_attr One-sided Retrieves the window attribute value
MPI::Win::Get_attr communication identified by the key.
MPI_WIN_GET_ATTR
MPI_Win_get_errhandler One-sided Retrieves the error handler currently
MPI::Win::Get_errhandler communication associated with a window.
MPI_WIN_GET_ERRHANDLER
MPI_Win_get_group One-sided Returns a duplicate of the group of
MPI::Win::Get_group communication the communicator used to create a
MPI_WIN_GET_GROUP window.
MPI_Win_get_name External interface Returns the name that was last
MPI::Win::Get_name associated with a window.
MPI_WIN_GET_NAME
MPI_Win_lock One-sided Starts an RMA access epoch at the
MPI::Win::Lock communication target task.
MPI_WIN_LOCK
MPI_Win_post One-sided Starts an RMA exposure epoch for a
MPI::Win::Post communication local window.
MPI_WIN_POST

C
C++
MPI_Win_set_attr One-sided Attaches the window attribute value to
MPI::Win::Set_attr communication the window and associates it with the
MPI_WIN_SET_ATTR key.
MPI_Win_set_errhandler One-sided Attaches a new error handler to a
MPI::Win::Set_errhandler communication window.
MPI_WIN_SET_ERRHANDLER
MPI_Win_set_name External interface Associates a name string with a
MPI::Win::Set_name window.
MPI_WIN_SET_NAME
MPI_Win_start One-sided Starts an RMA access epoch for a
MPI::Win::Start communication window object.
MPI_WIN_START
MPI_Win_test One-sided Tries to complete an RMA exposure
MPI::Win::Test communication epoch.
MPI_WIN_TEST
MPI_Win_unlock One-sided Completes an RMA access epoch at
MPI::Win::Unlock communication the target task.
MPI_WIN_UNLOCK
MPI_Win_wait One-sided Completes an RMA exposure epoch.
MPI::Win::Wait communication
MPI_WIN_WAIT
MPI_Wtick Environment Returns the resolution of MPI_WTIME
MPI::Wtick management in seconds.
MPI_WTICK
MPI_Wtime Environment Returns the current value of time as a
MPI::Wtime management floating-point value.
MPI_WTIME

Appendix D. MPI subroutine bindings
The tables in this appendix summarize the binding information for all of the MPI
subroutines listed in IBM Parallel Environment for AIX: MPI Subroutine Reference.
Note: FORTRAN refers to FORTRAN 77 bindings that are officially supported for
MPI. However, FORTRAN 77 bindings can be used by FORTRAN 90.
FORTRAN 90 and High Performance FORTRAN (HPF) offer array section
and assumed shape arrays as parameters on calls. These are not safe with
MPI.
The binding information is divided into these categories:

v “Bindings for collective communication”
v “Bindings for communicators” on page 178
v “Bindings for conversion functions” on page 182
v “Bindings for derived datatypes” on page 183
v “Bindings for environment management” on page 189
v “Bindings for external interfaces” on page 191
v “Bindings for group management” on page 193
v “Bindings for Info objects” on page 195
v “Bindings for memory allocation” on page 196
v “Bindings for MPI-IO” on page 197
v “Bindings for MPI_Status objects” on page 205
v “Bindings for one-sided communication” on page 205
v “Bindings for point-to-point communication” on page 208
v “Binding for profiling control” on page 214
v “Bindings for topologies” on page 214
Bindings for collective communication

Table 25 lists the bindings for collective communication subroutines.
Table 25. Bindings for collective communication
Subroutine name: Binding:
C C
C++ C++
FORTRAN FORTRAN
MPI_Allgather int MPI_Allgather(void* sendbuf,int sendcount,MPI_Datatype
sendtype,void* recvbuf,int recvcount,MPI_Datatype recvtype,
MPI_Comm comm);
MPI::Comm::Allgather void MPI::Comm::Allgather(const void* sendbuf, int sendcount, const
MPI::Datatype& sendtype, void* recvbuf, int recvcount, const
MPI::Datatype& recvtype) const;
MPI_ALLGATHER MPI_ALLGATHER(CHOICE SENDBUF,INTEGER
SENDCOUNT,INTEGER SENDTYPE,CHOICE RECVBUF,INTEGER
RECVCOUNT,INTEGER RECVTYPE,INTEGER COMM,INTEGER
IERROR)
MPI_Allgatherv int MPI_Allgatherv(void* sendbuf,int sendcount,MPI_Datatype
sendtype,void* recvbuf,int *recvcounts,int *displs, MPI_Datatype
recvtype,MPI_Comm comm);

Table 25. Bindings for collective communication (continued)
C C
C++ C++
FORTRAN FORTRAN
MPI::Comm::Allgatherv void MPI::Comm::Allgatherv(const void* sendbuf, int sendcount, const
MPI::Datatype& sendtype, void* recvbuf, const int recvcounts[], const
int displs[], const MPI::Datatype& recvtype) const;
MPI_ALLGATHERV MPI_ALLGATHERV(CHOICE SENDBUF,INTEGER
RECVCOUNTS(*),INTEGER DISPLS(*),INTEGER
RECVTYPE,INTEGER COMM,INTEGER IERROR)
MPI_Allreduce int MPI_Allreduce(void* sendbuf,void* recvbuf,int count,MPI_Datatype
datatype,MPI_Op op,MPI_Comm comm);
MPI::Comm::Allreduce void MPI::Comm::Allreduce(const void* sendbuf, void* recvbuf, int
count, const MPI::Datatype& datatype, const MPI::Op& op) const;
MPI_ALLREDUCE MPI_ALLREDUCE(CHOICE SENDBUF,CHOICE
RECVBUF,INTEGER COUNT,INTEGER DATATYPE,INTEGER
OP,INTEGER COMM,INTEGER IERROR)
MPI_Alltoall int MPI_Alltoall(void* sendbuf,int sendcount,MPI_Datatype
sendtype,void* recvbuf,int recvcount,MPI_Datatype recvtype,
MPI_Comm comm);
MPI::Comm::Alltoall void MPI::Comm::Alltoall(const void* sendbuf, int sendcount, const
MPI_ALLTOALL MPI_ALLTOALL(CHOICE SENDBUF,INTEGER
RECVCOUNT,INTEGER RECVTYPE,INTEGER COMM,INTEGER
IERROR)
MPI_Alltoallv int MPI_Alltoallv(void* sendbuf,int *sendcounts,int
*sdispls,MPI_Datatype sendtype,void* recvbuf,int *recvcounts,int
*rdispls,MPI_Datatype recvtype,MPI_Comm comm);
MPI::Comm::Alltoallv void MPI::Comm::Alltoallv(const void* sendbuf, const int
sendcounts[], const int sdispls[], const MPI::Datatype& sendtype,
void* recvbuf, const int recvcounts[], const int rdispls[], const
MPI_ALLTOALLV MPI_ALLTOALLV(CHOICE SENDBUF,INTEGER
SENDCOUNTS(*),INTEGER SDISPLS(*),INTEGER
SENDTYPE,CHOICE RECVBUF,INTEGER
RECVCOUNTS(*),INTEGER RDISPLS(*),INTEGER
RECVTYPE,INTEGER COMM,INTEGER IERROR)
MPI_Alltoallw int MPI_Alltoallw(void* sendbuf, int sendcounts[], int sdispls[],
MPI_Datatype sendtypes[], void *recvbuf, int recvcounts[], int
rdispls[], MPI_Datatype recvtypes[], MPI_Comm comm);
MPI::Comm::Alltoallw void MPI::Comm::Alltoallw(const void *sendbuf, const int
sendcounts[], const int sdispls[], const MPI::Datatype sendtypes[],
void *recvbuf, const int recvcounts[], const int rdispls[], const
MPI::Datatype recvtypes[]) const;
MPI_ALLTOALLW MPI_ALLTOALLW(CHOICE SENDBUF(*), INTEGER
SENDCOUNTS(*), INTEGER SDISPLS(*), INTEGER
SENDTYPES(*), CHOICE RECVBUF, INTEGER RECVCOUNTS(*),
INTEGER RDISPLS(*), INTEGER RECVTYPES(*), INTEGER
COMM, INTEGER IERROR)

C C
C++ C++
FORTRAN FORTRAN
MPI_Barrier int MPI_Barrier(MPI_Comm comm);
MPI::Comm::Barrier() void MPI::Comm::Barrier() const;
MPI_BARRIER MPI_BARRIER(INTEGER COMM,INTEGER IERROR)
MPI_Bcast int MPI_Bcast(void* buffer,int count,MPI_Datatype datatype,int
root,MPI_Comm comm);
MPI::Comm::Bcast void MPI::Comm::Bcast(void* buffer, int count, const MPI::Datatype&
datatype, int root) const;
MPI_BCAST MPI_BCAST(CHOICE BUFFER,INTEGER COUNT,INTEGER
DATATYPE,INTEGER ROOT,INTEGER COMM,INTEGER IERROR)
MPI_Exscan int MPI_Exscan(void *sendbuf, void *recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);
MPI::Intracomm::Exscan void MPI::Intracomm::Exscan(const void* sendbuf, void* recvbuf, int
MPI_EXSCAN MPI_EXSCAN(CHOICE SENDBUF, CHOICE RECVBUF, INTEGER
COUNT, INTEGER DATATYPE, INTEGER OP, INTEGER COMM,
INTEGER IERROR)
MPI_Gather int MPI_Gather(void* sendbuf,int sendcount,MPI_Datatype
sendtype,void* recvbuf,int recvcount,MPI_Datatype recvtype,int
root,MPI_Comm comm);
MPI::Comm::Gather void MPI::Comm::Gather(const void* sendbuf, int sendcount, const
MPI::Datatype& recvtype, int root) const;
MPI_GATHER MPI_GATHER(CHOICE SENDBUF,INTEGER
RECVCOUNT,INTEGER RECVTYPE,INTEGER ROOT,INTEGER
COMM,INTEGER IERROR)
MPI_Gatherv int MPI_Gatherv(void* sendbuf,int sendcount,MPI_Datatype
sendtype,void* recvbuf,int *recvcounts,int *displs,MPI_Datatype
recvtype,int root,MPI_Comm comm);
MPI::Comm::Gatherv void MPI::Comm::Gatherv(const void* sendbuf, int sendcount, const
MPI::Datatype& sendtype, void* recvbuf, const int recvcounts[], const
int displs[], const MPI::Datatype& recvtype, int root) const;
MPI_GATHERV MPI_GATHERV(CHOICE SENDBUF,INTEGER
RECVCOUNTS(*),INTEGER DISPLS(*),INTEGER
RECVTYPE,INTEGER ROOT,INTEGER COMM,INTEGER IERROR)
MPI_Op_create int MPI_Op_create(MPI_User_function *function, int commute,
MPI_Op *op);
MPI::Op::Init void MPI::Op::Init(MPI::User_function *func, bool commute);
MPI_OP_CREATE MPI_OP_CREATE(EXTERNAL FUNCTION,INTEGER
COMMUTE,INTEGER OP,INTEGER IERROR)
MPI_Op_free int MPI_Op_free(MPI_Op *op);
MPI::Op::Free void MPI::Op::Free();
MPI_OP_FREE MPI_OP_FREE(INTEGER OP,INTEGER IERROR)
Appendix D. MPI subroutine bindings 177

C C
C++ C++
FORTRAN FORTRAN
MPI_Reduce int MPI_Reduce(void* sendbuf,void* recvbuf,int count,MPI_Datatype
datatype,MPI_Op op,int root,MPI_Comm comm);
MPI::Comm::Reduce void MPI::Comm::Reduce(const void* sendbuf, void* recvbuf, int
count, const MPI::Datatype& datatype, const MPI::Op& op, int root)
const;
MPI_REDUCE MPI_REDUCE(CHOICE SENDBUF,CHOICE RECVBUF,INTEGER
COUNT,INTEGER DATATYPE,INTEGER OP,INTEGER
ROOT,INTEGER COMM,INTEGER IERROR)
MPI_Reduce_scatter int MPI_Reduce_scatter(void* sendbuf,void* recvbuf,int
*recvcounts,MPI_Datatype datatype,MPI_Op op,MPI_Comm comm);
MPI::Comm::Reduce_scatter void MPI::Comm::Reduce_scatter(const void* sendbuf, void* recvbuf,
int recvcounts[], const MPI::Datatype& datatype, const MPI::Op& op)
const;
MPI_REDUCE_SCATTER MPI_REDUCE_SCATTER(CHOICE SENDBUF,CHOICE
RECVBUF,INTEGER RECVCOUNTS(*),INTEGER
DATATYPE,INTEGER OP,INTEGER COMM,INTEGER IERROR)
MPI_Scan int MPI_Scan(void* sendbuf,void* recvbuf,int count,MPI_Datatype
datatype,MPI_Op op,MPI_Comm comm);
MPI::Intracomm::Scan void MPI::Intracomm::Scan(const void *sendbuf, void *recvbuf, int
MPI_SCAN MPI_SCAN(CHOICE SENDBUF,CHOICE RECVBUF,INTEGER
COUNT,INTEGER DATATYPE,INTEGER OP,INTEGER
MPI_Scatter int MPI_Scatter(void* sendbuf,int sendcount,MPI_Datatype
sendtype,void* recvbuf,int recvcount,MPI_Datatype recvtype,int root
MPI_Comm comm);
MPI::Comm::Scatter void MPI::Comm::Scatter(const void* sendbuf, int sendcount, const
MPI::Datatype& recvtype, int root) const;
MPI_SCATTER MPI_SCATTER(CHOICE SENDBUF,INTEGER
RECVCOUNT,INTEGER RECVTYPE,INTEGER ROOT,INTEGER
MPI_Scatterv int MPI_Scatterv(void* sendbuf,int *sendcounts,int
*displs,MPI_Datatype sendtype,void* recvbuf,int
recvcount,MPI_Datatype recvtype,int root,MPI_Comm comm);
MPI::Comm::Scatterv void MPI::Comm::Scatterv(const void* sendbuf, const int
sendcounts[], const int displs[], const MPI::Datatype& sendtype, void*
recvbuf, int recvcount, const MPI::Datatype& recvtype, int root) const;
MPI_SCATTERV MPI_SCATTERV(CHOICE SENDBUF,INTEGER
SENDCOUNTS(*),INTEGER DISPLS(*),INTEGER
RECVTYPE,INTEGER ROOT,INTEGER COMM,INTEGER IERROR)
Bindings for communicators

Table 26 on page 179 lists the bindings for communicator subroutines.

Table 26. Bindings for communicators
C C
C++ C++
FORTRAN FORTRAN
MPI_Attr_delete int MPI_Attr_delete(MPI_Comm comm,int keyval);
(none) (none)
MPI_ATTR_DELETE MPI_ATTR_DELETE(INTEGER COMM,INTEGER KEYVAL,INTEGER
IERROR)
MPI_Attr_get int MPI_Attr_get(MPI_Comm comm,int keyval,void *attribute_val, int
*flag);
(none) (none)
MPI_ATTR_GET MPI_ATTR_GET(INTEGER COMM,INTEGER KEYVAL,INTEGER
ATTRIBUTE_VAL, LOGICAL FLAG,INTEGER IERROR)
MPI_Attr_put int MPI_Attr_put(MPI_Comm comm,int keyval,void* attribute_val);
(none) (none)
MPI_ATTR_PUT MPI_ATTR_PUT(INTEGER COMM,INTEGER KEYVAL,INTEGER
ATTRIBUTE_VAL, INTEGER IERROR)
(none) (none)
MPI::Comm::Clone MPI::Cartcomm& MPI::Cartcomm::Clone() const;
MPI::Graphcomm& MPI::Graphcomm::Clone() const;
MPI::Intercomm& MPI::Intercomm::Clone() const;
MPI::Intracomm& MPI::Intracomm::Clone() const;

(none) (none)
MPI_Comm_compare int MPI_Comm_compare(MPI_Comm comm1,MPI_Comm comm2,int
*result);
MPI::Comm::Compare int MPI::Comm::Compare(const MPI::Comm& comm1, const
MPI::Comm& comm2);
MPI_COMM_COMPARE MPI_COMM_COMPARE(INTEGER COMM1,INTEGER
COMM2,INTEGER RESULT,INTEGER IERROR)
MPI_Comm_create int MPI_Comm_create(MPI_Comm comm_in, MPI_Group group,
MPI_Comm *comm_out);
MPI::Intercomm::Create MPI::Intercomm MPI::Intercomm::Create(const MPI::Group& group)
const;
MPI::Intracomm::Create
MPI::Intracomm MPI::Intracomm::Create(const MPI::Group& group)
const;
MPI_COMM_CREATE MPI_COMM_CREATE(INTEGER COMM_IN, INTEGER GROUP,
INTEGER COMM_OUT,INTEGER IERROR)
MPI_Comm_create_errhandler int MPI_Comm_create_errhandler (MPI_Comm_errhandler_fn
*function, MPI_Errhandler *errhandler);
MPI::Comm::Create_errhandler static MPI::Errhandler
MPI::Comm::Create_errhandler(MPI::Comm::Errhandler_fn* function);
MPI_COMM_CREATE_ERRHANDLER MPI_COMM_CREATE_ERRHANDLER(EXTERNAL FUNCTION,
INTEGER ERRHANDLER, INTEGER IERROR)
MPI_Comm_create_keyval int MPI_Comm_create_keyval (MPI_Comm_copy_attr_function
*comm_copy_attr_fn, MPI_Comm_delete_attr_function
*comm_delete_attr_fn, int *comm_keyval, void *extra_state);

Table 26. Bindings for communicators (continued)
C C
C++ C++
FORTRAN FORTRAN
MPI::Comm::Create_keyval int MPI::Comm::Create_keyval(MPI::Comm::Copy_attr_function*
comm_copy_attr_fn, MPI::Comm::Delete_attr_function*
comm_delete_attr_fn, void* extra_state);
MPI_COMM_CREATE_KEYVAL MPI_COMM_CREATE_KEYVAL(EXTERNAL
COMM_COPY_ATTR_FN, EXTERNAL COMM_DELETE_ATTR_FN,
INTEGER COMM_KEYVAL, INTEGER EXTRA_STATE, INTEGER
IERROR)
MPI_Comm_delete_attr int MPI_Comm_delete_attr (MPI_Comm comm, int comm_keyval);
MPI::Comm::Delete_attr void MPI::Comm::Delete_attr(int comm_keyval);
MPI_COMM_DELETE_ATTR MPI_COMM_DELETE_ATTR(INTEGER COMM, INTEGER
COMM_KEYVAL, INTEGER IERROR)
MPI_Comm_dup int MPI_Comm_dup(MPI_Comm comm,MPI_Comm *newcomm);
MPI::Cartcomm::Dup MPI::Cartcomm MPI::Cartcomm::Dup() const;
MPI::Graphcomm::Dup MPI::Graphcomm MPI::Graphcomm::Dup() const;
MPI::Intercomm::Dup MPI::Intercomm MPI::Intercomm::Dup() const;
MPI::Intracomm::Dup MPI::Intracomm MPI::Intracomm::Dup() const;

MPI_COMM_DUP MPI_COMM_DUP(INTEGER COMM,INTEGER
NEWCOMM,INTEGER IERROR)
MPI_Comm_free int MPI_Comm_free(MPI_Comm *comm);
MPI::Comm::Free void MPI::Comm::Free(void);
MPI_COMM_FREE MPI_COMM_FREE(INTEGER COMM,INTEGER IERROR)
MPI_Comm_free_keyval int MPI_Comm_free_keyval (int *comm_keyval);
MPI::Comm::Free_keyval void MPI::Comm::Free_keyval(int& comm_keyval);
MPI_COMM_FREE_KEYVAL MPI_COMM_FREE_KEYVAL(INTEGER COMM_KEYVAL, INTEGER
IERROR)
MPI_Comm_get_attr int MPI_Comm_get_attr (MPI_Comm comm, int comm_keyval, void
*attribute_val, int *flag);
MPI::Comm::Get_attr bool MPI::Comm::Get_attr(int comm_keyval, void* attribute_val)
const;
MPI_COMM_GET_ATTR MPI_COMM_GET_ATTR(INTEGER COMM, INTEGER
COMM_KEYVAL, INTEGER ATTRIBUTE_VAL, LOGICAL FLAG,
INTEGER IERROR)
MPI_Comm_get_errhandler int MPI_Comm_get_errhandler (MPI_Comm comm, MPI_Errhandler
*errhandler);
MPI::Comm::Get_errhandler MPI::Errhandler MPI::Comm::Get_errhandler() const;
MPI_COMM_GET_ERRHANDLER MPI_COMM_GET_ERRHANDLER(INTEGER COMM, INTEGER
ERRHANDLER, INTEGER IERROR)
MPI_Comm_rank int MPI_Comm_rank(MPI_Comm comm,int *rank);
MPI::Comm::Get_rank int MPI::Comm::Get_rank() const;
MPI_COMM_RANK MPI_COMM_RANK(INTEGER COMM,INTEGER RANK,INTEGER
IERROR)
MPI_Comm_remote_group int MPI_Comm_remote_group(MPI_Comm comm,MPI_group *group);

C C
C++ C++
FORTRAN FORTRAN
MPI::Intercomm::Get_remote_group MPI::Group MPI::Intercomm::Get_remote_group() const;
MPI_COMM_REMOTE_GROUP MPI_COMM_REMOTE_GROUP(INTEGER COMM,MPI_GROUP
GROUP,INTEGER IERROR)
MPI_Comm_remote_size int MPI_Comm_remote_size(MPI_Comm comm,int *size);
MPI::Intercomm::Get_remote_size int MPI::Intercomm::Get_remote_size() const;
MPI_COMM_REMOTE_SIZE MPI_COMM_REMOTE_SIZE(INTEGER COMM,INTEGER
SIZE,INTEGER IERROR)
MPI_Comm_set_attr int MPI_Comm_set_attr (MPI_Comm comm, int comm_keyval, void
*attribute_val);
MPI::Comm::Set_attr void MPI::Comm::Set_attr(int comm_keyval, const void* attribute_val)
const;
MPI_COMM_SET_ATTR MPI_COMM_SET_ATTR(INTEGER COMM, INTEGER
COMM_KEYVAL, INTEGER ATTRIBUTE_VAL, INTEGER IERROR)
MPI_Comm_set_errhandler int MPI_Comm_set_errhandler (MPI_Comm comm, MPI_Errhandler
*errhandler);
MPI::Comm::Set_errhandler void MPI::Comm::Set_errhandler(const MPI::Errhandler& errhandler);
MPI_COMM_SET_ERRHANDLER MPI_COMM_SET_ERRHANDLER(INTEGER COMM, INTEGER
MPI_Comm_size int MPI_Comm_size(MPI_Comm comm,int *size);
MPI::Comm::Get_size int MPI::Comm::Get_size() const;
MPI_COMM_SIZE MPI_COMM_SIZE(INTEGER COMM,INTEGER SIZE,INTEGER
IERROR)
MPI_Comm_split int MPI_Comm_split(MPI_Comm comm_in, int color, int key,
MPI_Comm *comm_out);
MPI::Intercomm::Split MPI::Intercomm MPI::Intercomm::Split(int color, int key) const;
MPI::Intracomm::Split MPI::Intracomm MPI::Intracomm::Split(int color, int key) const;

MPI_COMM_SPLIT MPI_COMM_SPLIT(INTEGER COMM_IN, INTEGER COLOR,
INTEGER KEY, INTEGER COMM_OUT, INTEGER IERROR)
MPI_Comm_test_inter int MPI_Comm_test_inter(MPI_Comm comm,int *flag);
MPI::Comm::Is_inter bool MPI::Comm::Is_inter() const;
MPI_COMM_TEST_INTER MPI_COMM_TEST_INTER(INTEGER COMM,LOGICAL
FLAG,INTEGER IERROR)
MPI_Intercomm_create int MPI_Intercomm_create(MPI_Comm local_comm,int local_leader,
MPI_Comm peer_comm,int remote_leader,int tag,MPI_Comm
*newintercom);
MPI::Intracomm::Create_intercomm MPI::Intercomm MPI::Intracomm::Create_intercomm(int local_leader,
const MPI::Comm& peer_comm, int remote_leader, int tag) const;
MPI_INTERCOMM_CREATE MPI_INTERCOMM_CREATE(INTEGER LOCAL_COMM,INTEGER
LOCAL_LEADER, INTEGER PEER_COMM,INTEGER
REMOTE_LEADER,INTEGER TAG, INTEGER
NEWINTERCOM,INTEGER IERROR)
MPI_Intercomm_merge int MPI_Intercomm_merge(MPI_Comm intercomm,int high,
MPI_Comm *newintracomm);

C C
C++ C++
FORTRAN FORTRAN
MPI::Intercomm::Merge MPI::Intracomm MPI::Intercomm::Merge(bool high);
MPI_INTERCOMM_MERGE MPI_INTERCOMM_MERGE(INTEGER INTERCOMM,INTEGER
HIGH, INTEGER NEWINTRACOMM,INTEGER IERROR)
MPI_Keyval_create int MPI_Keyval_create(MPI_Copy_function *copy_fn,
MPI_Delete_function *delete_fn,int *keyval, void* extra_state);
(none) (none)
MPI_KEYVAL_CREATE MPI_KEYVAL_CREATE(EXTERNAL COPY_FN,EXTERNAL
DELETE_FN, INTEGER KEYVAL,INTEGER
EXTRA_STATE,INTEGER IERROR)
MPI_Keyval_free int MPI_Keyval_free(int *keyval);
(none) (none)
MPI_KEYVAL_FREE MPI_KEYVAL_FREE(INTEGER KEYVAL,INTEGER IERROR)
Bindings for conversion functions

Table 27 lists the C bindings for conversion functions. These functions do not have
C++ or FORTRAN bindings.
Table 27. Bindings for conversion functions
Function name: C binding:
MPI_Comm_c2f MPI_Fint MPI_Comm_c2f(MPI_Comm comm);
MPI_Comm_f2c MPI_Comm MPI_Comm_f2c(MPI_Fint comm);
MPI_Errhandler_c2f MPI_Fint MPI_Errhandler_c2f(MPI_Errhandler errhandler);
MPI_Errhandler_f2c MPI_Errhandler MPI_Errhandler_f2c(MPI_Fint errorhandler);
MPI_File_c2f MPI_Fint MPI_File_c2f(MPI_File file);
MPI_File_f2c MPI_File MPI_File_f2c(MPI_Fint file);
MPI_Group_c2f MPI_Fint MPI_Group_c2f(MPI_Group group);
MPI_Group_f2c MPI_Group MPI_Group_f2c(MPI_Fint group);
MPI_Info_c2f MPI_Fint MPI_Info_c2f(MPI_Info info);
MPI_Info_f2c MPI_Info MPI_Info_f2c(MPI_Fint file);
MPI_Op_c2f MPI_Fint MPI_Op_c2f(MPI_Op op);
MPI_Op_f2c MPI_Op MPI_Op_f2c(MPI_Fint op);
MPI_Request_c2f MPI_Fint MPI_Request_c2f(MPI_Request request);
MPI_Request_f2c MPI_Request MPI_Request_f2c(MPI_Fint request);
MPI_Status_c2f int MPI_Status_c2f(MPI_Status *c_status, MPI_Fint *f_status);
MPI_Status_f2c int MPI_Status_f2c(MPI_Fint *f_status, MPI_Status *c_status);
MPI_Type_c2f MPI_Fint MPI_Type_c2f(MPI_Type datatype);
MPI_Type_f2c MPI_Type MPI_Type_f2c(MPI_Fint datatype);
MPI_Win_c2f MPI_Fint MPI_Win_c2f(MPI_Win win);
MPI_Win_f2c MPI_Win MPI_Win_f2c(MPI_Fint win);

Bindings for derived datatypes
Table 28 lists the bindings for derived datatype subroutines.
Table 28. Bindings for derived datatypes
C C
C++ C++
FORTRAN FORTRAN
MPI_Address int MPI_Address(void* location, MPI_Aint *address);
(none) (none)
MPI_ADDRESS MPI_ADDRESS(CHOICE LOCATION, INTEGER ADDRESS,
INTEGER IERROR)
MPI_Get_address int MPI_Get_address(void *location, MPI_Aint *address);
MPI::Get_address MPI::Aint MPI::Get_address(void* location);
MPI_GET_ADDRESS MPI_GET_ADDRESS(CHOICE LOCATION(*),
INTEGER(KIND=MPI_ADDRESS_KIND) ADDRESS, INTEGER
IERROR)
MPI_Get_elements int MPI_Get_elements(MPI_Status *status,MPI_Datatype datatype,int
*count);
MPI::Status::Get_elements int MPI::Status::Get_elements(const MPI::Datatype& datatype) const;
MPI_GET_ELEMENTS MPI_GET_ELEMENTS(INTEGER
STATUS(MPI_STATUS_SIZE),INTEGER DATATYPE,INTEGER
COUNT,INTEGER IERROR)
MPI_Pack int MPI_Pack(void* inbuf,int incount,MPI_Datatype datatype,void
*outbuf, int outsize,int *position,MPI_Comm comm);
MPI::Datatype::Pack void MPI::Datatype::Pack(const void* inbuf, int incount, void* outbuf,
int outsize, int& position, const MPI::Comm& comm) const;
MPI_PACK MPI_PACK(CHOICE INBUF,INTEGER INCOUNT,INTEGER
DATATYPE,CHOICE OUTBUF,INTEGER OUTSIZE,INTEGER
POSITION,INTEGER COMM,INTEGER IERROR)
MPI_Pack_external int MPI_Pack_external(char *datarep, void *inbuf, int incount,
MPI_Datatype datatype, void *outbuf, MPI_Aint outsize, MPI_Aint
*position);
MPI::Datatype::Pack_external void MPI::Datatype::Pack_external(const char* datarep, const void*
inbuf, int incount, void* outbuf, MPI::Aint outsize, MPI_Aint& position)
const;
MPI_PACK_EXTERNAL MPI_PACK_EXTERNAL(CHARACTER*(*) DATAREP, CHOICE
INBUF(*), INTEGER INCOUNT, INTEGER DATATYPE, CHOICE
OUTBUF(*), INTEGER(KIND=MPI_ADDRESS_KIND) OUTSIZE,
INTEGER(KIND=MPI_ADDRESS_KIND) POSITION, INTEGER
IERROR)
MPI_Pack_external_size int MPI_Pack_external_size(char *datarep, int incount, MPI_Datatype
datatype,MPI_Aint *size);
MPI::Datatype::Pack_external_size MPI::Aint MPI::Datatype::Pack_external_size(const char* datarep, int
incount) const;
MPI_PACK_EXTERNAL_SIZE MPI_PACK_EXTERNAL_SIZE(CHARACTER*(*) DATAREP,
INTEGER INCOUNT, INTEGER DATATYPE,
INTEGER(KIND=MPI_ADDRESS_KIND) SIZE, INTEGER IERROR
MPI_Pack_size int MPI_Pack_size(int incount,MPI_Datatype datatype,MPI_Comm
comm,int *size);

Table 28. Bindings for derived datatypes (continued)
C C
C++ C++
FORTRAN FORTRAN
MPI::Datatype::Pack_size int MPI::Datatype::Pack_size(int incount, const MPI::Comm& comm)
const;
MPI_PACK_SIZE MPI_PACK_SIZE(INTEGER INCOUNT,INTEGER
DATATYPE,INTEGER COMM,INTEGER SIZE,INTEGER IERROR)
(none) (none)
(none) (none)
MPI_SIZEOF MPI_SIZEOF(CHOICE X, INTEGER SIZE, INTEGER IERROR)
MPI_Type_commit int MPI_Type_commit(MPI_Datatype *datatype);
MPI::Datatype::Commit void MPI::Datatype::Commit();
MPI_TYPE_COMMIT MPI_TYPE_COMMIT(INTEGER DATATYPE,INTEGER IERROR)
MPI_Type_contiguous int MPI_Type_contiguous(int count,MPI_Datatype
oldtype,MPI_Datatype *newtype);
MPI::Datatype::Create_contiguous MPI::Datatype MPI::Datatype::Create_contiguous(int count) const;
MPI_TYPE_CONTIGUOUS MPI_TYPE_CONTIGUOUS(INTEGER COUNT,INTEGER
OLDTYPE,INTEGER NEWTYPE,INTEGER IERROR)
MPI_Type_create_darray int MPI_Type_create_darray (int size,int rank,int ndims, int
array_of_gsizes[],int array_of_distribs[], int array_of_dargs[],int
array_of_psizes[], int order,MPI_Datatype oldtype,MPI_Datatype
*newtype);
MPI::Datatype::Create_darray MPI::Datatype MPI::Datatype::Create_darray(int size, int rank, int
ndims, const int array_of_gsizes[], const int array_of_distribs[], const
int array_of_dargs[], const int array_of_psizes[], int order) const;
MPI_TYPE_CREATE_DARRAY MPI_TYPE_CREATE_DARRAY (INTEGER SIZE,INTEGER
RANK,INTEGER NDIMS, INTEGER
ARRAY_OF_GSIZES(*),INTEGER ARRAY_OF_DISTRIBS(*),
INTEGER ARRAY_OF_DARGS(*),INTEGER ARRAY_OF_PSIZES(*),
INTEGER ORDER,INTEGER OLDTYPE,INTEGER
NEWTYPE,INTEGER IERROR)
MPI_Type_create_f90_complex int MPI_Type_create_f90_complex(int p, int r, MPI_Datatype
*newtype);
MPI::Datatype::Create_f90_complex static MPI::Datatype MPI::Datatype::Create_f90_complex(int p, int r);
MPI_TYPE_CREATE_F90_COMPLEX MPI_TYPE_CREATE_F90_COMPLEX(INTEGER P, INTEGER R,
INTEGER NEWTYPE, INTEGER IERROR)
MPI_Type_create_f90_integer int MPI_Type_create_f90_integer(int r, MPI_Datatype *newtype);
MPI::Datatype::Create_f90_integer static MPI::Datatype MPI::Datatype::Create_f90_integer(int r);
MPI_TYPE_CREATE_F90_INTEGER MPI_TYPE_CREATE_F90_INTEGER(INTEGER R, INTEGER
NEWTYPE, INTEGER IERROR)
MPI_Type_create_f90_real int MPI_Type_create_f90_real(int p, int r, MPI_Datatype *newtype);
MPI::Datatype::Create_f90_real static MPI::Datatype MPI::Datatype::Create_f90_real(int p, int r);
MPI_TYPE_CREATE_F90_REAL MPI_TYPE_CREATE_F90_REAL(INTEGER P, INTEGER R,
MPI_Type_create_hindexed int MPI_Type_create_hindexed(int count, int array_of_blocklengths[],
MPI_Aint array_of_displacements[], MPI_Datatype

C C
C++ C++
FORTRAN FORTRAN
MPI::Datatype::Create_hindexed MPI::Datatype MPI::Datatype::Create_hindexed(int count, const int
array_of_blocklengths[], const MPI::Aint array_of_displacements[])
const;
MPI_TYPE_CREATE_HINDEXED MPI_TYPE_CREATE_HINDEXED(INTEGER COUNT, INTEGER
ARRAY_OF_BLOCKLENGTHS(*),
INTEGER(KIND=MPI_ADDRESS_KIND) ARRAY_OF
DISPLACEMENTS(*), INTEGER OLDTYPE, INTEGER NEWTYPE,
INTEGER IERROR)
MPI_Type_create_hvector int MPI_Type_create_hvector(int count, int blocklength, MPI_Aint
stride, MPI_Datatype oldtype, MPI_Datatype *newtype);
MPI::Datatype::Create_hvector MPI::Datatype MPI::Datatype::Create_hvector(int count, int
blocklength, MPI::Aint stride) const;
MPI_TYPE_CREATE_HVECTOR MPI_TYPE_CREATE_HVECTOR(INTEGER COUNT, INTEGER
BLOCKLENGTH, INTEGER(KIND=MPI_ADDRESS_KIND) STRIDE,
INTEGER OLDTYPE, INTEGER NEWTYPE, INTEGER IERROR)
MPI_Type_create_indexed_block int MPI_Type_create_indexed_block(int count, int blocklength, int
array_of_displacements[], MPI_Datatype oldtype, MPI_datatype
*newtype);
MPI::Datatype::Create_indexed_block MPI::Datatype MPI::Datatype::Create_indexed_block(int count, int
blocklength, const int array_of_displacements[]) const;
MPI_TYPE_CREATE_INDEXED_BLOCK MPI_TYPE_CREATE_INDEXED_BLOCK(INTEGER COUNT,
INTEGER BLOCKLENGTH, INTEGER ARRAY_OF
INTEGER IERROR)
MPI_Type_create_keyval int MPI_Type_create_keyval (MPI_Type_copy_attr_function
*type_copy_attr_fn, MPI_Type_delete_attr_function
*type_delete_attr_fn, int *type_keyval, void *extra_state);
MPI::Datatype::Create_keyval int MPI::Datatype::Create_keyval(MPI::Datatype::Copy_attr_function*
type_copy_attr_fn, MPI::Datatype::Delete_attr_function*
type_delete_attr_fn, void* extra_state);
MPI_TYPE_CREATE_KEYVAL MPI_TYPE_CREATE_KEYVAL(EXTERNAL TYPE_COPY_ATTR_FN,
EXTERNAL TYPE_DELETE_ATTR_FN, INTEGER TYPE_KEYVAL,
INTERGER EXTRA_STATE, INTEGER IERROR)
MPI_Type_create_resized int MPI_Type_create_resized(MPI_Datatype oldtype, MPI_Aint lb,
MPI_Aint extent, MPI_Datatype *newtype);
MPI::Datatype::Create_resized MPI::Datatype MPI::Datatype::Create_resized(const MPI::Aint lb,
const MPI::Aint extent) const;
MPI_TYPE_CREATE_RESIZED MPI_TYPE_CREATE_RESIZED(INTEGER OLDTYPE, INTEGER LB,
INTEGER(KIND=MPI_ADDRESS_KIND) EXTENT, INTEGER
MPI_Type_create_struct int MPI_Type_create_struct(int count, int array_of_blocklengths[],
MPI_Aint array_of_displacements[], MPI_Datatype array_of_types[],
MPI_datatype *newtype);
MPI::Datatype::Create_struct static MPI::Datatype MPI::Datatype::Create_struct(int count, const int
array_of_blocklengths[], const MPI::Aint array_of_displacements[],
const MPI::Datatype array_of_types[]);

C C
C++ C++
FORTRAN FORTRAN
MPI_TYPE_CREATE_STRUCT MPI_TYPE_CREATE_STRUCT(INTEGER COUNT, INTEGER
ARRAY_OF_BLOCKLENGTHS(*),
INTEGER(KIND=MPI_ADDRESS_KIND) ARRAY_OF
DISPLACEMENTS(*), INTEGER ARRAY_OF_TYPES(*), INTEGER
MPI_Type_create_subarray int MPI_Type_create_subarray (int ndims,int array_of_sizes[], int
array_of_subsizes[],int array_of_starts[], int order,MPI_Datatype
MPI::Datatype::Create_subarray MPI::Datatype MPI::Datatype::Create_subarray(int ndims, const int
array_of_sizes[], const int array_of_subsizes[], const int
array_of_starts[], int order) const;
MPI_TYPE_CREATE_SUBARRAY MPI_TYPE_CREATE_SUBARRAY (INTEGER NDIMS,INTEGER
ARRAY_OF_SUBSIZES(*), INTEGER
ARRAY_OF_SIZES(*),INTEGER ARRAY_OF_STARTS(*), INTEGER
ORDER,INTEGER OLDTYPE,INTEGER NEWTYPE,INTEGER
IERROR)
MPI_Type_delete_attr int MPI_Type_delete_attr (MPI_Datatype type, int type_keyval);
MPI::Datatype::Delete_attr void MPI::Datatype::Delete_attr(int type_keyval);
MPI_TYPE_DELETE_ATTR MPI_TYPE_DELETE_ATTR(INTEGER TYPE, INTEGER
TYPE_KEYVAL, INTEGER IERROR)
MPI_Type_dup int MPI_Type_dup (MPI_Datatype type, MPI_Datatype *newtype);
MPI::Datatype::Dup MPI::Datatype MPI::Datatype::Dup() const;
MPI_TYPE_DUP MPI_TYPE_DUP(INTEGER TYPE, INTEGER NEWTYPE, INTEGER
IERROR)
MPI_Type_extent int MPI_Type_extent(MPI_Datatype datatype, int *extent);
(none) (none)
MPI_TYPE_EXTENT MPI_TYPE_EXTENT(INTEGER DATATYPE, INTEGER EXTENT,
INTEGER IERROR)
MPI_Type_free int MPI_Type_free(MPI_Datatype *datatype);
MPI::Datatype::Free void MPI::Datatype::Free();
MPI_TYPE_FREE MPI_TYPE_FREE(INTEGER DATATYPE,INTEGER IERROR)
MPI_Type_free_keyval int MPI_Type_free_keyval (int *type_keyval);
MPI::Datatype::Free_keyval void MPI::Datatype::Free_keyval(int& type_keyval);
MPI_TYPE_FREE_KEYVAL MPI_TYPE_FREE_KEYVAL(INTEGER TYPE_KEYVAL, INTEGER
IERROR)
MPI_Type_get_attr int MPI_Type_get_attr (MPI_Datatype type, int type_keyval, void
MPI::Datatype::Get_attr bool MPI::Datatype::Get_attr(int type_keyval, void* attribute_val)
const;
MPI_TYPE_GET_ATTR MPI_TYPE_GET_ATTR(INTEGER TYPE, INTEGER TYPE_KEYVAL,
INTEGER ATTRIBUTE_VAL, LOGICAL FLAG, INTEGER IERROR)
MPI_Type_get_contents int MPI_Type_get_contents(MPI_Datatype datatype, int
*max_integers, int *max_addresses, int *max_datatypes, int
array_of_integers[], int array_of_addresses[], int
array_of_datatypes[]);

C C
C++ C++
FORTRAN FORTRAN
MPI::Datatype::Get_contents void MPI::Datatype::Get_contents(int max_integers, int
max_addresses, int max_datatypes, int array_of_integers[], MPI::Aint
array_of_addresses[], MPI::Datatype array_of_datatypes[]) const;
MPI_TYPE_GET_CONTENTS MPI_TYPE_GET_CONTENTS(INTEGER DATATYPE, INTEGER
MAX_INTEGERS, INTEGER MAX_ADDRESSES, INTEGER
MAX_DATATYPES, INTEGER ARRAY_of_INTEGERS(*), INTEGER
ARRAY_OF_ADDRESSES(*), INTEGER ARRAY_of_DATATYPES(*),
INTEGER IERROR)
MPI_Type_get_envelope int MPI_Type_get_envelope(MPI_Datatype datatype, int
*num_integers, int *num_addresses, int *num_datatypes, int
*combiner);
MPI::Datatype::Get_envelope void MPI::Datatype::Get_envelope(int& num_integers, int&
num_addresses, int& num_datatypes, int& combiner) const;
MPI_TYPE_GET_ENVELOPE MPI_TYPE_GET_ENVELOPE(INTEGER DATATYPE, INTEGER
NUM_INTEGERS, INTEGER NUM_ADDRESSES, INTEGER
NUM_DATATYPES, INTEGER COMBINER, INTEGER IERROR)
MPI_Type_get_extent int MPI_Type_get_extent(MPI_Datatype datatype, MPI_Aint *lb,
MPI_Aint *extent);
MPI::Datatype::Get_extent void MPI::Datatype::Get_extent(MPI::Aint& lb, MPI::Aint& extent)
const;
MPI_TYPE_GET_EXTENT MPI_TYPE_GET_EXTENT(INTEGER DATATYPE,
INTEGER(KIND=MPI_ADDRESS_KIND) LB,
INTEGER(KIND=MPI_ADDRESS_KIND) EXTENT, INTEGER
IERROR)
MPI_Type_get_true_extent int MPI_Type_get_true_extent(MPI_Datatype datatype, MPI_Aint
*true_lb, MPI_Aint *true_extent);
MPI::Datatype::Get_true_extent void MPI::Datatype::Get_true_extent(MPI::Aint& true_lb, MPI::Aint&
true_extent) const;
MPI_TYPE_GET_TRUE_EXTENT MPI_TYPE_GET_TRUE_EXTENT(INTEGER DATATYPE, INTEGER
TRUE_LB, INTEGER(KIND=MPI_ADDRESS_KIND) TRUE_EXTENT,
INTEGER IERROR)
MPI_Type_hindexed int MPI_Type_hindexed(int count, int *array_of_blocklengths,
MPI_Aint *array_of_displacements, MPI_Datatype oldtype,
MPI_Datatype *newtype);
(none) (none)
MPI_TYPE_HINDEXED MPI_TYPE_HINDEXED(INTEGER COUNT, INTEGER
ARRAY_OF_BLOCKLENGTHS(*), INTEGER ARRAY_OF
INTEGER IERROR)
MPI_Type_hvector int MPI_Type_hvector(int count, int blocklength, MPI_Aint stride,
MPI_Datatype oldtype, MPI_Datatype *newtype);
(none) (none)
MPI_TYPE_HVECTOR MPI_TYPE_HVECTOR(INTEGER COUNT, INTEGER
BLOCKLENGTH, INTEGER STRIDE, INTEGER OLDTYPE,

C C
C++ C++
FORTRAN FORTRAN
MPI_Type_indexed int MPI_Type_indexed(int count, int *array_of_blocklengths, int
*array_of_displacements, MPI_Datatype oldtype, MPI_Datatype
*newtype);
MPI::Datatype::Create_indexed MPI::Datatype MPI::Datatype::Create_indexed(int count, const int
array_of_blocklengths[], const int array_of_displacements[]) const;
MPI_TYPE_INDEXED MPI_TYPE_INDEXED(INTEGER COUNT, INTEGER
INTEGER IERROR)
MPI_Type_lb int MPI_Type_lb(MPI_Datatype datatype, int* displacement);
(none) (none)
MPI_TYPE_LB MPI_TYPE_LB(INTEGER DATATYPE,INTEGER
DISPLACEMENT,INTEGER IERROR)
MPI_Type_match_size int MPI_Type_match_size(int typeclass, int size, MPI_Datatype
*type);
MPI::Datatype::Match_size static MPI::Datatype MPI::Datatype::Match_size(int typeclass, int
size);
MPI_TYPE_MATCH_SIZE MPI_TYPE_MATCH_SIZE(INTEGER TYPECLASS, INTEGER SIZE,
INTEGER TYPE, INTEGER IERROR)
MPI_Type_set_attr int MPI_Type_set_attr (MPI_Datatype type, int type_keyval, void
*attribute_val);
MPI::Datatype::Set_attr void MPI::Datatype::Set_attr(int type_keyval, const void*
attribute_val);
MPI_TYPE_SET_ATTR MPI_TYPE_SET_ATTR(INTEGER TYPE, INTEGER TYPE_KEYVAL,
INTEGER ATTRIBUTE_VAL, INTEGER IERROR)
MPI_Type_size int MPI_Type_size(MPI_Datatype datatype,int *size);
MPI::Datatype::Get_size int MPI::Datatype::Get_size() const;
MPI_TYPE_SIZE MPI_TYPE_SIZE(INTEGER DATATYPE, INTEGER SIZE, INTEGER
IERROR)
MPI_Type_struct int MPI_Type_struct(int count, int *array_of_blocklengths, MPI_Aint
*array_of_displacements, MPI_Datatype *array_of_types,
MPI_Datatype *newtype);
(none) (none)
MPI_TYPE_STRUCT MPI_TYPE_STRUCT(INTEGER COUNT, INTEGER
DISPLACEMENTS(*), INTEGER ARRAY_OF_TYPES(*), INTEGER
MPI_Type_ub int MPI_Type_ub(MPI_Datatype datatype,int* displacement);
(none) (none)
MPI_TYPE_UB MPI_TYPE_UB(INTEGER DATATYPE,INTEGER
DISPLACEMENT,INTEGER IERROR)
MPI_Type_vector int MPI_Type_vector(int count, int blocklength, int stride,
MPI_Datatype oldtype, MPI_Datatype *newtype);
MPI::Datatype::Create_vector MPI::Datatype MPI::Datatype::Create_vector(int count, int
blocklength, int stride) const;

C C
C++ C++
FORTRAN FORTRAN
MPI_TYPE_VECTOR MPI_TYPE_VECTOR(INTEGER COUNT, INTEGER
BLOCKLENGTH, INTEGER STRIDE, INTEGER OLDTYPE,
MPI_Unpack int MPI_Unpack(void* inbuf,int insize,int *position,void *outbuf,int
outcount,MPI_Datatype datatype,MPI_Comm comm);
MPI::Datatype::Unpack void MPI::Datatype::Unpack(const void* inbuf, int insize, void* outbuf,
int outcount, int& position, const MPI::Comm& comm) const;
MPI_UNPACK MPI_UNPACK(CHOICE INBUF,INTEGER INSIZE,INTEGER
POSITION,CHOICE OUTBUF,INTEGER OUTCOUNT,INTEGER
DATATYPE,INTEGER COMM, INTEGER IERRROR)
MPI_Unpack_external int MPI_Unpack_external(char *datarep, void *inbuf, MPI_Aint insize,
MPI_Aint *position, void *outbuf, int outcount, MPI_Datatype
datatype);
MPI::Datatype::Unpack_external void MPI::Datatype::Unpack_external(const char* datarep, const void*
inbuf, MPI::Aint insize, MPI::Aint& position, void* outbuf, int outcount)
const;
MPI_UNPACK_EXTERNAL MPI_UNPACK_EXTERNAL(CHARACTER*(*) DATAREP, CHOICE
INBUF(*), INTEGER(KIND=MPI_ADDRESS_KIND) INSIZE,
INTEGER(KIND=MPI_ADDRESS_KIND) POSITION, CHOICE
OUTBUF(*), INTEGER OUTCOUNT, INTEGER DATATYPE,
INTEGER IERROR)
Bindings for environment management

Table 29 lists the bindings for environment management subroutines.
Table 29. Bindings for environment management
C C
C++ C++
FORTRAN FORTRAN
MPI_Abort int MPI_Abort(MPI_Comm comm, int errorcode);
MPI::Comm::Abort void MPI::Comm::Abort(int errorcode);
MPI_ABORT MPI_ABORT(INTEGER COMM,INTEGER ERRORCODE,INTEGER
IERROR)
MPI_Errhandler_create int MPI_Errhandler_create(MPI_Handler_function *function,
MPI_Errhandler *errhandler);
(none) (none)
MPI_ERRHANDLER_CREATE MPI_ERRHANDLER_CREATE(EXTERNAL FUNCTION,INTEGER
MPI_Errhandler_free int MPI_Errhandler_free(MPI_Errhandler *errhandler);
MPI::Errhandler::Free void MPI::Errhandler::Free();
MPI_ERRHANDLER_FREE MPI_ERRHANDLER_FREE(INTEGER ERRHANDLER,INTEGER
IERROR)
MPI_Errhandler_get int MPI_Errhandler_get(MPI_Comm comm,MPI_Errhandler
*errhandler);

Table 29. Bindings for environment management (continued)
C C
C++ C++
FORTRAN FORTRAN
(none) (none)
MPI_ERRHANDLER_GET MPI_ERRHANDLER_GET(INTEGER COMM,INTEGER
ERRHANDLER,INTEGER IERROR)
MPI_Errhandler_set int MPI_Errhandler_set(MPI_Comm comm,MPI_Errhandler
errhandler);
(none) (none)
MPI_ERRHANDLER_SET MPI_ERRHANDLER_SET(INTEGER COMM,INTEGER
ERRHANDLER,INTEGER IERROR)
MPI_Error_class int MPI_Error_class(int errorcode, int *errorclass);
MPI::Get_error_class int MPI::Get_error_class(int errorcode);
MPI_ERROR_CLASS MPI_ERROR_CLASS(INTEGER ERRORCODE,INTEGER
ERRORCLASS,INTEGER IERROR)
MPI_Error_string int MPI_Error_string(int errorcode, char *string, int *resultlen);
MPI::Get_error_string void MPI::Get_error_string(int errorcode, char* string, int& resultlen);
MPI_ERROR_STRING MPI_ERROR_STRING(INTEGER ERRORCODE,CHARACTER
STRING(*),INTEGER RESULTLEN,INTEGER IERROR)
MPI_File_create_errhandler int MPI_File_create_errhandler (MPI_File_errhandler_fn *function,
MPI::File::Create_errhandler static MPI::Errhandler
MPI::File::Create_errhandler(MPI::File::Errhandler_fn* function);
MPI_FILE_CREATE_ERRHANDLER MPI_FILE_CREATE_ERRHANDLER(EXTERNAL
FUNCTION,INTEGER ERRHANDLER, INTEGER IERROR)
MPI_File_get_errhandler int MPI_File_get_errhandler (MPI_File file,MPI_Errhandler
*errhandler);
MPI::File::Get_errhandler MPI::Errhandler MPI::File::Get_errhandler() const;
MPI_FILE_GET_ERRHANDLER MPI_FILE_GET_ERRHANDLER (INTEGER FILE,INTEGER
MPI_File_set_errhandler int MPI_File_set_errhandler (MPI_File fh, MPI_Errhandler errhandler);
MPI::File::Set_errhandler void MPI::File::Set_errhandler(const MPI::Errhandler& errhandler);
MPI_FILE_SET_ERRHANDLER MPI_FILE_SET_ERRHANDLER(INTEGER FH,INTEGER
ERRHANLDER, INTEGER IERROR)
MPI_Finalize int MPI_Finalize(void);
MPI::Finalize void MPI::Finalize();
MPI_FINALIZE MPI_FINALIZE(INTEGER IERROR)
MPI_Finalized int MPI_Finalized(int *flag);
MPI::Is_finalized bool MPI::Is_finalized();
MPI_FINALIZED MPI_FINALIZED(LOGICAL FLAG, INTEGER IERROR)
MPI_Get_processor_name int MPI_Get_processor_name(char *name,int *resultlen);
MPI::Get_processor_name void MPI::Get_processor_name(char*& name, int& resultlen);
MPI_GET_PROCESSOR_NAME MPI_GET_PROCESSOR_NAME(CHARACTER NAME(*),INTEGER
RESULTLEN,INTEGER IERROR)

Table 29. Bindings for environment management (continued)
C C
C++ C++
FORTRAN FORTRAN
MPI_Get_version int MPI_Get_version(int *version,int *subversion);
MPI::Get_version void MPI::Get_version(int& version, int& subversion);
MPI_GET_VERSION MPI_GET_VERSION(INTEGER VERSION,INTEGER
SUBVERSION,INTEGER IERROR)
MPI_Init int MPI_Init(int *argc, char ***argv);
MPI::Init void MPI::Init(int& argc, char**& argv);
void MPI::Init();
MPI_INIT MPI_INIT(INTEGER IERROR)
MPI_Init_thread int MPI_Init_thread(int *argc, char *((*argv)[]), int required, int
*provided);
MPI::Init_thread int MPI::Init_thread(int& argc, char**& argv, int required);
int MPI::Init_thread(int required);

MPI_INIT_THREAD MPI_INIT_THREAD(INTEGER REQUIRED, INTEGER PROVIDED,
INTEGER IERROR)
MPI_Initialized int MPI_Initialized(int *flag);
MPI::Is_initialized bool MPI::Is_initialized();
MPI_INITIALIZED MPI_INITIALIZED(INTEGER FLAG,INTEGER IERROR)
MPI_Is_thread_main int MPI_Is_thread_main(int *flag);
MPI::Is_thread_main bool MPI::Is_thread_main();
MPI_IS_THREAD_MAIN MPI_IS_THREAD_MAIN(LOGICAL FLAG, INTEGER IERROR)
MPI_Query_thread int MPI_Query_thread(int *provided);
MPI::Query_thread int MPI::Query_thread();
MPI_QUERY_THREAD MPI_QUERY_THREAD(INTEGER PROVIDED, INTEGER IERROR)
MPI_Wtick double MPI_Wtick(void);
MPI::Wtick double MPI::Wtick();
MPI_WTICK DOUBLE PRECISION MPI_WTICK()
MPI_Wtime double MPI_Wtime(void);
MPI::Wtime double MPI::Wtime();
MPI_WTIME DOUBLE PRECISION MPI_WTIME()
Bindings for external interfaces

Table 30 lists the bindings for external interfaces.
Table 30. Binding for external interfaces
C C
C++ C++
FORTRAN FORTRAN
MPI_Add_error_class int MPI_Add_error_class(int *errorclass);

Table 30. Binding for external interfaces (continued)
C C
C++ C++
FORTRAN FORTRAN
MPI::Add_error_class int MPI::Add_error_class();
MPI_ADD_ERROR_CLASS MPI_ADD_ERROR_CLASS(INTEGER ERRORCLASS, INTEGER
IERROR)
MPI_Add_error_code int MPI_Add_error_code(int errorclass, int *errorcode);
MPI::Add_error_code int MPI::Add_error_code(int errorclass);
MPI_ADD_ERROR_CODE MPI_ADD_ERROR_CODE(INTEGER ERRORCLASS, INTEGER
ERRORCODE, INTEGER IERROR)
MPI_Add_error_string int MPI_Add_error_string(int errorcode, char *string);
MPI::Add_error_string void MPI::Add_error_string(int errorcode, const char* string);
MPI_ADD_ERROR_STRING MPI_ADD_ERROR_STRING(INTEGER ERRORCODE,
CHARACTER*(*) STRING, INTEGER IERROR)
MPI_Comm_call_errhandler int MPI_Comm_call_errhandler (MPI_Comm comm, int errorcode);
MPI::Comm::Call_errhandler void MPI::Comm::Call_errhandler(int errorcode) const;
MPI_COMM_CALL_ERRHANDLER MPI_COMM_CALL_ERRHANDLER(INTEGER COMM, INTEGER
MPI_Comm_get_name int MPI_Comm_get_name (MPI_Comm comm, char *comm_name,
int *resultlen);
MPI::Comm::Get_name void MPI::Comm::Get_name(char* comm_name, int& resultlen) const;
MPI_COMM_GET_NAME MPI_COMM_GET_NAME(INTEGER COMM, CHARACTER*(*)
COMM_NAME, INTEGER RESULTLEN, INTEGER IERROR)
MPI_Comm_set_name int MPI_Comm_set_name (MPI_Comm comm, char *comm_name);
MPI::Comm::Set_name void MPI::Comm::Set_name(const char* comm_name);
MPI_COMM_SET_NAME MPI_COMM_SET_NAME(INTEGER COMM, CHARACTER*(*)
COMM_NAME, INTEGER IERROR)
MPI_File_call_errhandler int MPI_File_call_errhandler (MPI_File fh, int errorcode);
MPI::File::Call_errhandler void MPI::File::Call_errhandler(int errorcode) const;
MPI_FILE_CALL_ERRHANDLER MPI_FILE_CALL_ERRHANDLER(INTEGER FH, INTEGER
MPI_Grequest_complete int MPI_Grequest_complete(MPI_Request request);
MPI::Grequest::Complete void MPI::Grequest::Complete();
MPI_GREQUEST_COMPLETE MPI_GREQUEST_COMPLETE(INTEGER REQUEST, INTEGER
IERROR)
MPI_Grequest_start int MPI_Grequest_start(MPI_Grequest_query_function *query_fn,
MPI_Grequest_free_function *free_fn, MPI_Grequest_cancel_function
*cancel_fn, void *extra_state, MPI_Request *request);
MPI::Grequest::Start MPI::Grequest MPI::Grequest::Start(MPI::Grequest::Query_function
query_fn, MPI::Grequest::Free_function free_fn,
MPI::Grequest::Cancel_function cancel_fn, void *extra_state);
MPI_GREQUEST_START MPI_GREQUEST_START(EXTERNAL QUERY_FN, EXTERNAL
FREE_FN, EXTERNAL CANCEL_FN, INTEGER EXTRA_STATE,
INTEGER REQUEST, INTEGER IERROR)
MPI_Status_set_elements int MPI_Status_set_elements(MPI_Status *status, MPI_Datatype
datatype, int count);

Table 30. Binding for external interfaces (continued)
C C
C++ C++
FORTRAN FORTRAN
MPI::Status::Set_elements void MPI::Status::Set_elements(const MPI::Datatype& datatype, int
count);
MPI_STATUS_SET_CANCELLED MPI_STATUS_SET_CANCELLED(INTEGER
STATUS(MPI_STATUS_SIZE), LOGICAL FLAG, INTEGER IERROR)
MPI_Status_set_cancelled int MPI_Status_set_cancelled(MPI_Status *status, int flag);
MPI::Status::Set_cancelled void MPI::Status::Set_cancelled(bool flag);
MPI_STATUS_SET_ELEMENTS MPI_STATUS_SET_ELEMENTS(INTEGER
STATUS(MPI_STATUS_SIZE), INTEGER DATATYPE, INTEGER
COUNT, INTEGER IERROR)
MPI_Type_get_name int MPI_Type_get_name(MPI_Datatype type, char *type_name, int
*resultlen);
MPI::Datatype::Get_name void MPI::Datatype::Get_name(char* type_name, int& resultlen)
const;
MPI_TYPE_GET_NAME MPI_TYPE_GET_NAME(INTEGER TYPE, CHARACTER*(*)
TYPE_NAME, INTEGER RESULTLEN, INTEGER IERROR)
MPI_Type_set_name int MPI_Type_set_name (MPI_Datatype type, char *type_name);
MPI::Datatype::Set_name void MPI::Datatype::Set_name(const char* type_name);
MPI_TYPE_SET_NAME MPI_TYPE_SET_NAME(INTEGER TYPE, CHARACTER*(*)
TYPE_NAME, INTEGER IERROR)
MPI_Win_call_errhandler int MPI_Win_call_errhandler (MPI_Win win, int errorcode);
MPI::Win::Call_errhandler void MPI::Win::Call_errhandler(int errorcode) const;
MPI_WIN_CALL_ERRHANDLER MPI_WIN_CALL_ERRHANDLER(INTEGER WIN, INTEGER
MPI_Win_get_name int MPI_Win_get_name (MPI_Win win, char *win_name, int
*resultlen);
MPI::Win::Get_name void MPI::Win::Get_name(char* win_name, int& resultlen) const;
MPI_WIN_GET_NAME MPI_WIN_GET_NAME(INTEGER WIN, CHARACTER*(*)
WIN_NAME, INTEGER RESULTLEN, INTEGER IERROR)
MPI_Win_set_name int MPI_Win_set_name (MPI_Win win, char *win_name);
MPI::Win::Set_name void MPI::Win::Set_name(const char* win_name);
MPI_WIN_SET_NAME MPI_WIN_SET_NAME(INTEGER WIN, CHARACTER*(*)
WIN_NAME, INTEGER IERROR)
Bindings for group management

Table 31 lists the bindings for group management subroutines.
Table 31. Bindings for groups
C C
C++ C++
FORTRAN FORTRAN
MPI_Comm_group int MPI_Comm_group(MPI_Comm comm,MPI_Group *group);

Table 31. Bindings for groups (continued)
C C
C++ C++
FORTRAN FORTRAN
MPI::Comm::Get_group MPI::Group MPI::Comm::Get_group() const;
MPI_COMM_GROUP MPI_COMM_GROUP(INTEGER COMM,INTEGER
GROUP,INTEGER IERROR)
MPI_Group_compare int MPI_Group_compare(MPI_Group group1,MPI_Group group2,int
*result);
MPI::Group::Compare static int MPI::Group::Compare(const MPI::Group& group1, const
MPI::Group& group2);
MPI_GROUP_COMPARE MPI_GROUP_COMPARE(INTEGER GROUP1,INTEGER
GROUP2,INTEGER RESULT,INTEGER IERROR)
MPI_Group_difference int MPI_Group_difference(MPI_Group group1,MPI_Group
group2,MPI_Group *newgroup);
MPI::Group::Difference static MPI::Group MPI::Group::Difference(const MPI::Group& group1,
const MPI::Group& group2);
MPI_GROUP_DIFFERENCE MPI_GROUP_DIFFERENCE(INTEGER GROUP1,INTEGER
GROUP2,INTEGER NEWGROUP,INTEGER IERROR)
MPI_Group_excl int MPI_Group_excl(MPI_Group group,int n,int *ranks,MPI_Group
*newgroup);
MPI::Group::Excl MPI::Group MPI::Group::Excl(int n, const int ranks[]) const;
MPI_GROUP_EXCL MPI_GROUP_EXCL(INTEGER GROUP,INTEGER N,INTEGER
RANKS(*),INTEGER NEWGROUP,INTEGER IERROR)
MPI_Group_free int MPI_Group_free(MPI_Group *group);
MPI::Group::Free void MPI::Group::Free();
MPI_GROUP_FREE MPI_GROUP_FREE(INTEGER GROUP,INTEGER IERROR)
MPI_Group_incl int MPI_Group_incl(MPI_Group group,int n,int *ranks,MPI_Group
*newgroup);
MPI::Group::Incl MPI::Group MPI::Group::Incl(int n, const int ranks[]) const;
MPI_GROUP_INCL MPI_GROUP_INCL(INTEGER GROUP,INTEGER N,INTEGER
RANKS(*),INTEGER NEWGROUP,INTEGER IERROR)
MPI_Group_intersection int MPI_Group_intersection(MPI_Group group1,MPI_Group
MPI::Group::Intersect static MPI::Group MPI::Group::Intersect(const MPI::Group& group1,
MPI_GROUP_INTERSECTION MPI_GROUP_INTERSECTION(INTEGER GROUP1,INTEGER
MPI_Group_range_excl int MPI_Group_range_excl(MPI_Group group,int n,int ranges
[][3],MPI_Group *newgroup);
MPI::Group::Range_excl MPI::Group MPI::Group::Range_excl(int n, const int ranges[][3])
const;
MPI_GROUP_RANGE_EXCL MPI_GROUP_RANGE_EXCL(INTEGER GROUP,INTEGER
N,INTEGER RANGES(3,*),INTEGER NEWGROUP,INTEGER
IERROR)
MPI_Group_range_incl int MPI_Group_range_incl(MPI_Group group,int n,int
ranges[][3],MPI_Group *newgroup);
MPI::Group::Range_incl MPI::Group MPI::Group::Range_incl(int n, const int ranges[][3]) const;

Table 31. Bindings for groups (continued)
C C
C++ C++
FORTRAN FORTRAN
MPI_GROUP_RANGE_INCL MPI_GROUP_RANGE_INCL(INTEGER GROUP,INTEGER
N,INTEGER RANGES(3,*),INTEGER NEWGROUP,INTEGER
IERROR)
MPI_Group_rank int MPI_Group_rank(MPI_Group group,int *rank);
MPI::Group::Get_rank int MPI::Group::Get_rank() const;
MPI_GROUP_RANK MPI_GROUP_RANK(INTEGER GROUP,INTEGER RANK,INTEGER
IERROR)
MPI_Group_size int MPI_Group_size(MPI_Group group,int *size);
MPI::Group::Get_size int MPI::Group::Get_size() const;
MPI_GROUP_SIZE MPI_GROUP_SIZE(INTEGER GROUP,INTEGER SIZE,INTEGER
IERROR)
MPI_Group_translate_ranks int MPI_Group_translate_ranks (MPI_Group group1,int n,int
*ranks1,MPI_Group group2,int *ranks2);
MPI::Group::Translate_ranks void MPI::Group::Translate_ranks(const MPI::Group& group1, int n,
const int ranks1[], const MPI::Group& group2, int ranks2[]);
MPI_GROUP_TRANSLATE_RANKS MPI_GROUP_TRANSLATE_RANKS(INTEGER GROUP1, INTEGER
N,INTEGER RANKS1(*),INTEGER GROUP2,INTEGER
RANKS2(*),INTEGER IERROR)
MPI_Group_union int MPI_Group_union(MPI_Group group1,MPI_Group
MPI::Group::Union static MPI::Group MPI::Group::Union(const MPI::Group& group1,
MPI_GROUP_UNION MPI_GROUP_UNION(INTEGER GROUP1,INTEGER
Bindings for Info objects

Table 32 lists the bindings for Info objects.
Table 32. Bindings for Info objects
C C
C++ C++
FORTRAN FORTRAN
MPI_Info_create int MPI_Info_create(MPI_Info *info);
MPI::Info::Create static MPI::Info MPI::Info::Create();
MPI_INFO_CREATE MPI_INFO_CREATE(INTEGER INFO,INTEGER IERROR)
MPI_Info_delete int MPI_Info_delete(MPI_Info info,char *key);
MPI::Info::Delete void MPI::Info::Delete(const char* key);
MPI_INFO_DELETE MPI_INFO_DELETE(INTEGER INFO,CHARACTER KEY(*),
INTEGER IERROR)
MPI_Info_dup int MPI_Info_dup(MPI_Info info,MPI_Info *newinfo);
MPI::Info::Dup MPI::Info MPI::Info::Dup() const;

Table 32. Bindings for Info objects (continued)
C C
C++ C++
FORTRAN FORTRAN
MPI_INFO_DUP MPI_INFO_DUP(INTEGER INFO,INTEGER NEWINFO,INTEGER
IERROR)
MPI_Info_free int MPI_Info_free(MPI_Info *info);
MPI::Info::Free void MPI::Info::Free();
MPI_INFO_FREE MPI_INFO_FREE(INTEGER INFO,INTEGER IERROR)
MPI_Info_get int MPI_Info_get(MPI_Info info,char *key,int valuelen, char *value,int
*flag);
MPI::Info::Get bool MPI::Info::Get(const char* key, int valuelen, char* value) const;
MPI_INFO_GET MPI_INFO_GET (INTEGER INFO,CHARACTER KEY(*),INTEGER
VALUELEN, CHARACTER VALUE(*),LOGICAL FLAG,INTEGER
IERROR)
MPI_Info_get_nkeys int MPI_Info_get_nkeys(MPI_Info info,int *nkeys);
MPI::Info::Get_nkeys int MPI::Info::Get_nkeys() const;
MPI_INFO_GET_NKEYS MPI_INFO_GET_NKEYS(INTEGER INFO,INTEGER
NKEYS,INTEGER IERROR)
MPI_Info_get_nthkey int MPI_Info_get_nthkey(MPI_Info info, int n, char *key);
MPI::Info::Get_nthkey void MPI::Info::Get_nthkey(int n, char* key) const;
MPI_INFO_GET_NTHKEY MPI_INFO_GET_NTHKEY(INTEGER INFO,INTEGER
N,CHARACTER KEY(*), INTEGER IERROR)
MPI_Info_get_valuelen int MPI_Info_get_valuelen(MPI_Info info,char *key,int *valuelen, int
*flag);
MPI::Info::Get_valuelen bool MPI::Info::Get_valuelen(const char* key, int& valuelen) const;
MPI_INFO_GET_VALUELEN MPI_INFO_GET_VALUELEN(INTEGER INFO,CHARACTER
KEY(*),INTEGER VALUELEN,LOGICAL FLAG, INTEGER IERROR)
MPI_Info_set int MPI_Info_set(MPI_Info info,char *key,char *value);
MPI::Info::Set void MPI::Info::Set(const char* key, const char* value);
MPI_INFO_SET MPI_INFO_SET(INTEGER INFO,CHARACTER KEY(*),CHARACTER
VALUE(*), INTEGER IERROR)
Bindings for memory allocation

Table 33 lists the bindings for memory allocation subroutines.
Table 33. Bindings for memory allocation
C C
C++ C++
FORTRAN FORTRAN
MPI_Alloc_mem int MPI_Alloc_mem (MPI_Aint size, MPI_Info info, void *baseptr);
MPI::Alloc_mem void* MPI::Alloc_mem(MPI::Aint size, const MPI::Info& info);
MPI_ALLOC_MEM MPI_ALLOC_MEM(INTEGER SIZE, INTEGER INFO, INTEGER
BASEPTR, INTEGER IERROR)
MPI_Free_mem int MPI_Free_mem (void *base);

Table 33. Bindings for memory allocation (continued)
C C
C++ C++
FORTRAN FORTRAN
MPI::Free_mem void MPI::Free_mem(void *base):
MPI_FREE_MEM MPI_FREE_MEM(CHOICE BASE, INTEGER IERROR)
Bindings for MPI-IO

Table 34 lists the bindings for MPI-IO subroutines.
Table 34. Bindings for MPI-IO
C C
C++ C++
FORTRAN FORTRAN
MPI_File_close int MPI_File_close (MPI_File *fh);
MPI::File::Close void MPI::File::Close();
MPI_FILE_CLOSE MPI_FILE_CLOSE(INTEGER FH,INTEGER IERROR)
MPI_File_delete int MPI_File_delete (char *filename,MPI_Info info);
MPI::File::Delete static void MPI::File::Delete(const char* filename, const MPI::Info&
info);
MPI_FILE_DELETE MPI_FILE_DELETE(CHARACTER*(*) FILENAME,INTEGER INFO,
INTEGER IERROR)
MPI_File_get_amode int MPI_File_get_amode (MPI_File fh,int *amode);
MPI::File::Get_amode int MPI::File::Get_amode() const;
MPI_FILE_GET_AMODE MPI_FILE_GET_AMODE(INTEGER FH,INTEGER AMODE,INTEGER
IERROR)
MPI_File_get_atomicity int MPI_File_get_atomicity (MPI_File fh,int *flag);
MPI::File::Get_atomicity bool MPI::File::Get_atomicity() const;
MPI_FILE_GET_ATOMICITY MPI_FILE_GET_ATOMICITY (INTEGER FH,LOGICAL
MPI_File_get_byte_offset int MPI_File_get_byte_offset(MPI_File fh, MPI_Offset offset,
MPI_Offset *disp);
MPI::File::Get_byte_offset MPI::Offset MPI::File::Get_byte_offset(const MPI::Offset disp) const;
MPI_FILE_GET_BYTE_OFFSET MPI_FILE_GET_BYTE_OFFSET(INTEGER FH,
INTEGER(KIND=MPI_OFFSET_KIND) OFFSET,
INTEGER(KIND=MPI_OFFSET_KIND) DISP, INTEGER IERROR)
MPI_File_get_group int MPI_File_get_group (MPI_File fh,MPI_Group *group);
MPI::File::Get_group MPI::Group MPI::File::Get_group() const;
MPI_FILE GET_GROUP MPI_FILE GET_GROUP (INTEGER FH,INTEGER GROUP,INTEGER
IERROR)
MPI_File_get_info int MPI_File_get_info (MPI_File fh,MPI_Info *info_used);
MPI::File::Get_info MPI::Info MPI::File::Get_info() const;
MPI_FILE_GET_INFO MPI_FILE_GET_INFO (INTEGER FH,INTEGER INFO_USED,
INTEGER IERROR)
MPI_File_get_position int MPI_File_get_position(MPI_File fh,MPI_Offset *offset);

Table 34. Bindings for MPI-IO (continued)
C C
C++ C++
FORTRAN FORTRAN
MPI::File::Get_position MPI::Offset MPI::File::Get_position() const;
MPI_FILE_GET_POSITION MPI_FILE_GET_POSITION(INTEGER FH,
INTEGER(KIND=MPI_OFFSET_KIND) OFFSET, INTEGER IERROR)
MPI_File_get_position_shared int MPI_File_get_position_shared(MPI_File fh, MPI_Offset *offset);
MPI::File::Get_position_shared MPI::Offset MPI::File::Get_position_shared() const;
MPI_FILE_GET_POSITION_SHARED MPI_FILE_GET_POSITION_SHARED(INTEGER FH,
INTEGER(KIND=MPI_OFFSET_KIND) OFFSET, INTEGER IERROR)
MPI_File_get_size int MPI_File_get_size (MPI_File fh,MPI_Offset size);
MPI::File::Get_size MPI::Offset MPI::File::Get_size() const;
MPI_FILE_GET_SIZE MPI_FILE_GET_SIZE (INTEGER
FH,INTEGER(KIND=MPI_OFFSET_KIND) SIZE, INTEGER IERROR)
MPI_File_get_type_extent int MPI_File_get_type_extent(MPI_File fh, MPI_Datatype datatype,
MPI_Aint *extent);
MPI::File::Get_type_extent MPI::Aint MPI::File::Get_type_extent(const MPI::Datatype& datatype)
const;
MPI_FILE_GET_TYPE_EXTENT MPI_FILE_GET_TYPE_EXTENT (INTEGER FH, INTEGER
DATATYPE, INTEGER(KIND=MPI_ADDRESS_KIND) EXTENT,
INTEGER IERROR)
MPI_File_get_view int MPI_File_get_view (MPI_File fh,MPI_Offset *disp, MPI_Datatype
*etype,MPI_Datatype *filetype,char *datarep);
MPI::File::Get_view void MPI::File::Get_view(MPI::Offset& disp,MPI::Datatype& etype,
MPI::Datatype& filetype, char* datarep) const;
MPI_FILE_GET_VIEW MPI_FILE_GET_VIEW (INTEGER
FH,INTEGER(KIND=MPI_OFFSET_KIND) DISP, INTEGER
ETYPE,INTEGER FILETYPE,INTEGER DATAREP,INTEGER
IERROR)
MPI_File_iread int MPI_File_iread (MPI_File fh,void *buf, int count, MPI_Datatype
datatype,MPI_Request *request);
MPI::File::Iread MPI::Request MPI::File::Iread(void* buf, int count, const
MPI::Datatype& datatype);
MPI_FILE_IREAD MPI_FILE_IREAD (INTEGER FH, CHOICE BUF, INTEGER COUNT,
INTEGER DATATYPE, INTEGER REQUEST, INTEGER IERROR)
MPI_File_iread_at int MPI_File_iread_at (MPI_File fh,MPI_Offset offset,void *buf, int
count,MPI_Datatype datatype,MPI_Request *request);
MPI::File::Iread_at MPI::Request MPI::File::Iread_at(MPI::Offset offset, void* buf, int
count, const MPI::Datatype& datatype);
MPI_FILE_IREAD_AT MPI_FILE_IREAD_AT (INTEGER FH,INTEGER
(KIND=MPI_OFFSET_KIND) OFFSET, CHOICE BUF,INTEGER
COUNT,INTEGER DATATYPE,INTEGER REQUEST, INTEGER
IERROR)
MPI_File_iread_shared int MPI_File_iread_shared (MPI_File fh,void *buf, int count,
MPI_Datatype datatype,MPI_Request *request);
MPI::File::Iread_shared MPI::Request MPI::File::Iread_shared(void* buf, int count, const

C C
C++ C++
FORTRAN FORTRAN
MPI_FILE_IREAD_SHARED MPI_FILE_IREAD_SHARED (INTEGER FH, CHOICE BUF,
INTEGER COUNT, INTEGER DATATYPE, INTEGER REQUEST,
INTEGER IERROR)
MPI_File_iwrite int MPI_File_iwrite (MPI_File fh, void *buf, int count, MPI_Datatype
datatype,MPI_Request *request);
MPI::File::Iwrite MPI::Request MPI::File::Iwrite(const void* buf, int count, const
MPI_FILE_IWRITE MPI_FILE_IWRITE(INTEGER FH,CHOICE BUF,INTEGER
COUNT,INTEGER DATATYPE, INTEGER REQUEST,INTEGER
IERROR)
MPI_File_iwrite_at int MPI_File_iwrite_at (MPI_File fh,MPI_Offset offset,void *buf, int
count,MPI_Datatype datatype,MPI_Request *request);
MPI::File::Iwrite_at MPI::Request MPI::File::Iwrite_at(MPI::Offset offset, const void* buf,
int count, const MPI::Datatype& datatype);
MPI_FILE_IWRITE_AT MPI_FILE_IWRITE_AT(INTEGER
FH,INTEGER(KIND=MPI_OFFSET_KIND) OFFSET, CHOICE
BUF,INTEGER COUNT,INTEGER DATATYPE,INTEGER REQUEST,
INTEGER IERROR)
MPI_File_iwrite_shared int MPI_File_iwrite_shared (MPI_File fh,void *buf, int count,
MPI_Datatype datatype,MPI_Request *request);
MPI::File::Iwrite_shared MPI::Request MPI::File::Iwrite_shared(const void* buf, int count,
const MPI::Datatype& datatype);
MPI_FILE_IWRITE_SHARED MPI_FILE_IWRITE_SHARED (INTEGER FH, CHOICE BUF,
INTEGER COUNT, INTEGER DATATYPE, INTEGER REQUEST,
INTEGER IERROR)
MPI_File_open int MPI_File_open (MPI_Comm comm,char *filename,int
amode,MPI_info, MPI_File *fh);
MPI::File::Open static MPI::File MPI::File::Open(const MPI::Intracomm& comm, const
char* filename, int amode, const MPI::Info& info);
MPI_FILE_OPEN MPI_FILE_OPEN(INTEGER COMM,CHARACTER
FILENAME(*),INTEGER AMODE, INTEGER INFO,INTEGER
FH,INTEGER IERROR)
MPI_File_preallocate int MPI_File_preallocate (MPI_File fh, MPI_Offset size);
MPI::File::Preallocate void MPI::File::Preallocate(MPI::Offset size);
MPI_FILE_PREALLOCATE MPI_FILE_PREALLOCATE(INTEGER FH, INTEGER SIZE,
INTEGER IERROR)
MPI_File_read int MPI_File_read (MPI_File fh, void *buf, int count, MPI_Datatype
datatype,MPI_Status *status);
MPI::File::Read void MPI::File::Read(void* buf, int count, const MPI::Datatype&
datatype, MPI::Status& status);
MPI_FILE_READ MPI_FILE_READ(INTEGER FH,CHOICE BUF,INTEGER
COUNT,INTEGER DATATYPE, INTEGER
STATUS(MPI_STATUS_SIZE),INTEGER IERROR)
MPI_File_read_all int MPI_File_read_all (MPI_File fh, void *buf, int count, MPI_Datatype

C C
C++ C++
FORTRAN FORTRAN
MPI::File::Read_all void MPI::File::Read_all(void* buf, int count, const MPI::Datatype&
void MPI::File::Read_all(void* buf, int count, const MPI::Datatype&

datatype);
MPI_FILE_READ_ALL MPI_FILE_READ_ALL(INTEGER FH,CHOICE BUF,INTEGER
COUNT, INTEGER DATATYPE, INTEGER
MPI_File_read_all_begin int MPI_File_read_all_begin (MPI_File fh, void *buf, int count,
MPI_Datatype datatype);
MPI::File::Read_all_begin void MPI::File::Read_all_begin(void* buf, int count, const
MPI_FILE_READ_ALL_BEGIN MPI_FILE_READ_ALL_BEGIN (INTEGER FH, CHOICE BUF,
INTEGER COUNT, INTEGER DATATYPE, INTEGER IERROR)
MPI_File_read_all_end int MPI_File_read_all_end(MPI_File fh,void *buf, MPI_Status *status);
MPI::File::Read_all_end void MPI::File::Read_all_end(void* buf);
void MPI::File::Read_all_end(void* buf, MPI::Status& status);

MPI_FILE_READ_ALL_END MPI_FILE_READ_ALL_END(INTEGER FH,CHOICE BUF, INTEGER
STATUS(MPI_STATUS_SIZE), INTEGER IERROR)
MPI_File_read_at int MPI_File_read_at (MPI_File fh,MPI_Offset offset,void *buf, int
count,MPI_Datatype datatype,MPI_Status *status);
MPI::File::Read_at void MPI::File::Read_at(MPI::Offset offset, void* buf, int count, const
void MPI::File::Read_at(MPI::Offset offset, void* buf, int count, const

MPI::Datatype& datatype, MPI::Status& status);
MPI_FILE_READ_AT MPI_FILE_READ_AT(INTEGER
BUF,INTEGER COUNT,INTEGER DATATYPE, INTEGER
MPI_File_read_at_all int MPI_File_read_at_all (MPI_File fh,MPI_Offset offset,void *buf, int
MPI::File::Read_at_all void MPI::File::Read_at_all(MPI::Offset offset, void* buf, int count,
void MPI::File::Read_at_all(MPI::Offset offset, void* buf, int count,

const MPI::Datatype& datatype, MPI::Status& status);
MPI_FILE_READ_AT_ALL MPI_FILE_READ_AT_ALL(INTEGER
MPI_File_read_at_all_begin int MPI_File_read_at_all_begin(MPI_File fh,MPI_Offset offset, void
*buf, int count,MPI_Datatype datatype);
MPI::File::Read_at_all_begin void MPI::File::Read_at_all_begin(MPI::Offset offset, void* buf, int

C C
C++ C++
FORTRAN FORTRAN
MPI_FILE_READ_AT_ALL_BEGIN MPI_FILE_READ_AT_ALL_BEGIN(INTEGER FH,
INTEGER(KIND=MPI_OFFSET_KIND) OFFSET, CHOICE
BUF,INTEGER COUNT,INTEGER DATATYPE, INTEGER IERROR)
MPI_File_read_at_all_end int MPI_File_read_at_all_end(MPI_File fh,void *buf, MPI_Status
*status);
MPI::File::Read_at_all_end void MPI::File::Read_at_all_end(void *buf, MPI::Status& status);
void MPI::File::Read_at_all_end(void *buf);

MPI_FILE_READ_AT_ALL_END MPI_FILE_READ_AT_ALL_END(INTEGER FH,CHOICE BUF,
INTEGER STATUS(MPI_STATUS_SIZE), INTEGER IERROR)
MPI_File_read_ordered int MPI_File_read_ordered(MPI_File fh, void *buf, int count,
MPI_Datatype datatype,MPI_Status *status);
MPI::File::Read_ordered void MPI::File::Read_ordered(void* buf, int count, const
MPI_FILE_READ_ORDERED MPI_FILE_READ_ORDERED(INTEGER FH,CHOICE BUF,INTEGER
MPI_File_read_ordered_begin int MPI_File_read_ordered_begin(MPI_File fh, void *buf, int count,
MPI::File::Read_ordered_begin void MPI::File::Read_ordered_begin(void* buf, int count, const
MPI_FILE_READ_ORDERED_BEGIN MPI_FILE_READ_ORDERED_BEGIN (INTEGER FH, CHOICE BUF,
MPI_File_read_ordered_end int MPI_File_read_ordered_end(MPI_File fh,void *buf, MPI_Status
*status)
MPI::File::Read_ordered_end void MPI::File::Read_ordered_end(void* buf, MPI::Status& status);
void MPI::File::Read_ordered_end(void* buf);

MPI_FILE_READ_ORDERED_END MPI_FILE_READ_ORDERED_END(INTEGER FH,CHOICE BUF,
MPI_File_read_shared int MPI_File_read_shared (MPI_File fh, void *buf, int count,
MPI::File::Read_shared void MPI::File::Read_shared(void* buf, int count, const
void MPI::File::Read_shared(void* buf, int count, const

MPI_FILE_READ_SHARED MPI_FILE_READ_SHARED(INTEGER FH,CHOICE BUF,INTEGER
MPI_File_seek int MPI_File_seek (MPI_File fh,MPI_Offset offset, int whence);
MPI::File::Seek void MPI::File::Seek(MPI::Offset offset, int whence);
MPI_FILE_SEEK MPI_FILE_SEEK (INTEGER FH,
INTEGER(KIND=MPI_OFFSET_KIND) OFFSET, INTEGER
WHENCE, INTEGER IERROR)
MPI_File_seek_shared int MPI_File_seek_shared(MPI_File fh,MPI_Offset offset, int whence);

C C
C++ C++
FORTRAN FORTRAN
MPI::File::Seek_shared void MPI::File::Seek_shared(MPI::Offset offset, int whence);
MPI_FILE_SEEK_SHARED MPI_FILE_SEEK_SHARED(INTEGER
FH,INTEGER(KIND=MPI_OFFSET_KIND) OFFSET, INTEGER
WHENCE, INTEGER IERROR)
MPI_File_set_atomicity int MPI_File_set_atomicity (MPI_File fh,int flag);
MPI::File::Set_atomicity void MPI::File::Set_atomicity(bool flag);
MPI_FILE_SET_ATOMICITY MPI_FILE_SET_ATOMICITY (INTEGER FH,LOGICAL
MPI_File_set_info int MPI_File_set_info (MPI_File fh,MPI_Info info);
MPI::File::Set_info void MPI::File::Set_info(const MPI::Info& info);
MPI_FILE_SET_INFO MPI_FILE_SET_INFO(INTEGER FH,INTEGER INFO,INTEGER
IERROR)
MPI_File_set_size int MPI_File_set_size (MPI_File fh,MPI_Offset size);
MPI::File::Set_size void MPI::File::Set_size(MPI::Offset size);
MPI_FILE_SET_SIZE MPI_FILE_SET_SIZE (INTEGER
FH,INTEGER(KIND=MPI_OFFSET_KIND) SIZE, INTEGER IERROR)
MPI_File_set_view int MPI_File_set_view (MPI_File fh,MPI_Offset disp, MPI_Datatype
etype,MPI_Datatype filetype, char *datarep,MPI_Info info);
MPI::File::Set_view void MPI::File::Set_view(MPI::Offset disp, const MPI::Datatype&
etype, const MPI::Datatype& filetype, const char* datarep, const
MPI::Info& info);
MPI_FILE_SET_VIEW MPI_FILE_SET_VIEW (INTEGER
FH,INTEGER(KIND=MPI_OFFSET_KIND) DISP, INTEGER
ETYPE,INTEGER FILETYPE,CHARACTER DATAREP(*),INTEGER
INFO, INTEGER IERROR)
MPI_File_sync int MPI_File_sync (MPI_File fh);
MPI::File::Sync void MPI::File::Sync();
MPI_FILE_SYNC MPI_FILE_SYNC (INTEGER FH,INTEGER IERROR)
MPI_File_write int MPI_File_write (MPI_File fh, void *buf, int count, MPI_Datatype
MPI::File::Write void MPI::File::Write(const void* buf, int count, const MPI::Datatype&
datatype);
void MPI::File::Write(const void* buf, int count, const MPI::Datatype&

MPI_FILE_WRITE MPI_FILE_WRITE(INTEGER FH,CHOICE BUF,INTEGER
MPI_File_write_all int MPI_File_write_all (MPI_File fh, void *buf, int count,
MPI::File::Write_all void MPI::File::Write_all(const void* buf, int count, const
void MPI::File::Write_all(const void* buf, int count, const


C C
C++ C++
FORTRAN FORTRAN
MPI_FILE_WRITE_ALL MPI_FILE_WRITE_ALL(INTEGER FH,CHOICE BUF,INTEGER
COUNT, INTEGER DATATYPE,INTEGER
MPI_File_write_all_begin int MPI_File_write_all_begin (MPI_File fh, void *buf, int count,
MPI::File::Write_all_begin void MPI::File::Write_all_begin(const void* buf, int count, const
MPI_FILE_WRITE_ALL_BEGIN MPI_FILE_WRITE_ALL_BEGIN (INTEGER FH, CHOICE BUF,
MPI_File_write_all_end int MPI_File_write_all_end(MPI_File fh,void *buf, MPI_Status *status);
MPI::File::Write_all_end void MPI::File::Write_all_end(void* buf);
void MPI::File::Write_all_end(void* buf, MPI::Status& status);

MPI_FILE_WRITE_ALL_END MPI_FILE_WRITE_ALL_END(INTEGER FH,CHOICE BUF, INTEGER
MPI_File_write_at int MPI_File_write_at (MPI_File fh,MPI_Offset offset,void *buf, int
MPI::File::Write_at void MPI::File::Write_at(MPI::Offset offset, const void* buf, int count,
void MPI::File::Write_at(MPI::Offset offset, const void* buf, int count,

const MPI::Datatype& datatype, MPI::Status& status);
MPI_FILE_WRITE_AT MPI_FILE_WRITE_AT(INTEGER
FH,INTEGER(KIND_MPI_OFFSET_KIND) OFFSET, CHOICE
MPI_File_write_at_all int MPI_File_write_at_all (MPI_File fh,MPI_Offset offset,void *buf, int
MPI::File::Write_at_all void MPI::File::Write_at_all(MPI::Offset offset, const void* buf, int
void MPI::File::Write_at_all(MPI::Offset offset, const void* buf, int

count, const MPI::Datatype& datatype, MPI::Status& status);
MPI_FILE_WRITE_AT_ALL MPI_FILE_WRITE_AT_ALL (INTEGER FH, INTEGER
(KIND=MPI_OFFSET_KIND) OFFSET, CHOICE BUF,INTEGER
MPI_File_write_at_all_begin int MPI_File_write_at_all_begin(MPI_File fh,MPI_Offset offset, void
*buf, int count,MPI_Datatype datatype);
MPI::File::Write_at_all_begin void MPI::File::Write_at_all_begin(MPI::Offset offset, const void* buf,
int count, const MPI::Datatype& datatype);
MPI_FILE_WRITE_AT_ALL_BEGIN MPI_FILE_WRITE_AT_ALL_BEGIN(INTEGER FH,
INTEGER(KIND=MPI_OFFSET_KIND) OFFSET, CHOICE
BUF,INTEGER COUNT,INTEGER DATATYPE, INTEGER IERROR)
MPI_File_write_at_all_end int MPI_File_write_at_all_end(MPI_File fh,void *buf, MPI_Status
*status);

C C
C++ C++
FORTRAN FORTRAN
MPI::File::Write_at_all_end void MPI::File::Write_at_all_end(const void* buf);
void MPI::File::Write_at_all_end(const void* buf, MPI::Status& status);

MPI_FILE_WRITE_AT_ALL_END MPI_FILE_WRITE_AT_ALL_END(INTEGER FH,CHOICE BUF,
MPI_File_write_ordered int MPI_File_write_ordered(MPI_File fh, void *buf, int count,
MPI::File::Write_ordered void MPI::File::Write_ordered(const void* buf, int count, const
void MPI::File::Write_ordered(const void* buf, int count, const

MPI_FILE_WRITE_ORDERED MPI_FILE_WRITE_ORDERED(INTEGER FH,CHOICE
BUF,INTEGER COUNT, INTEGER DATATYPE, INTEGER
MPI_File_write_ordered_begin int MPI_File_write_ordered_begin(MPI_File fh, void *buf, int count,
MPI::File::Write_ordered_begin void MPI::File::Write_ordered_begin(const void* buf, int count, const
MPI_FILE_WRITE_ORDERED_BEGIN MPI_FILE_WRITE_ORDERED_BEGIN (INTEGER FH, CHOICE BUF,
MPI_File_write_ordered_end int MPI_File_write_ordered_end(MPI_File fh,void *buf, MPI_Status
*status)
MPI::File::Write_ordered_end void MPI::File::Write_ordered_end(const void* buf);
void MPI::File::Write_ordered_end(const void* buf, MPI::Status&

status);
MPI_FILE_WRITE_ORDERED_END MPI_FILE_WRITE_ORDERED_END(INTEGER FH,CHOICE BUF,
MPI_File_write_shared int MPI_File_write_shared (MPI_File fh, void *buf, int count,
MPI::File::Write_shared void MPI::File::Write_shared(const void* buf, int count, const
void MPI::File::Write_shared(const void* buf, int count, const

MPI_FILE_WRITE_SHARED MPI_FILE_WRITE_SHARED(INTEGER FH,CHOICE BUF,INTEGER
MPI_Register_datarep int MPI_Register_datarep(char *datarep,
MPI_Datarep_conversion_function *read_conversion_fn,
MPI_Datarep_conversion_function *write_conversion_fn,
MPI_Datarep_extent_function *dtype_file_extent_fn, void
*extra_state);

C C
C++ C++
FORTRAN FORTRAN
MPI::Register_datarep void MPI::Register_datarep(const char* datarep,
MPI::Datarep_conversion_function* read_conversion_fn,
MPI::Datarep_conversion_function* write_conversion_fn,
MPI::Datarep_extent_function* dtype_file_extent_fn, void*
extra_state);
MPI_REGISTER_DATAREP MPI_REGISTER_DATAREP(CHARACTER*(*) DATAREP, EXTERNAL
READ_CONVERSION_FN, EXTERNAL WRITE_CONVERSION_FN,
EXTERNAL DTYPE_FILE_EXTENT_FN,
INTEGER(KIND=MPI_ADDRESS_KIND), INTEGER EXTRA_STATE,
INTEGER IERROR)
Bindings for MPI_Status objects

Table 35 lists the bindings for MPI_Status object subroutines.
Table 35. Bindings for MPI_Status objects
C C
C++ C++
FORTRAN FORTRAN
MPI_Request_get_status int MPI_Request_get_status(MPI_Request request, int *flag,
MPI_Status *status);
MPI::Request::Get_status bool MPI::Request::Get_status() const;
bool MPI::Request::Get_status(MPI::Status&status) const;

MPI_REQUEST_GET_STATUS MPI_REQUEST_GET_STATUS(INTEGER REQUEST, LOGICAL
FLAG, INTEGER STATUS, INTEGER IERROR)
Bindings for one-sided communication

Table 36 lists the bindings for one-sided communication subroutines.
Table 36. Bindings for one-sided communication
C C
C++ C++
FORTRAN FORTRAN
MPI_Accumulate int MPI_Accumulate (void *origin_addr, int origin_count,
MPI_Datatype origin_datatype, int target_rank, MPI_Aint target_disp,
int target_count, MPI_Datatype target_datatype, MPI_Op op,
MPI_Win win);
MPI::Win::Accumulate void MPI::Win::Accumulate(const void* origin_addr, int origin_count,
const MPI::Datatype& origin_datatype, int target_rank, MPI::Aint
target_disp, int target_count, const MPI::Datatype& target_datatype,
const MPI::Op& op) const;

Table 36. Bindings for one-sided communication (continued)
C C
C++ C++
FORTRAN FORTRAN
MPI_ACCUMULATE MPI_ACCUMULATE (CHOICE ORIGIN_ADDR, INTEGER
ORIGIN_COUNT, INTEGER ORIGIN_DATATYPE, INTEGER
TARGET_RANK, INTEGER TARGET_DISP, INTEGER
TARGET_COUNT, INTEGER TARGET_DATATYPE, INTEGER OP,
INTEGER WIN, INTEGER IERROR)
MPI_Get int MPI_Get (void *origin_addr, int origin_count, MPI_Datatype
origin_datatype, int target_rank, MPI_Aint target_disp, int
target_count, MPI_Datatype target_datatype, MPI_Win win);
MPI::Win::Get void MPI::Win::Get(void* origin_addr, int origin_count, const
MPI::Datatype& origin_datatype, int target_rank, MPI::Aint
target_disp, int target_count, const MPI::Datatype& target_datatype)
const;
MPI_GET MPI_GET(CHOICE ORIGIN_ADDR, INTEGER ORIGIN_COUNT,
INTEGER ORIGIN_DATATYPE, INTEGER TARGET_RANK,
INTEGER TARGET_DISP, INTEGER TARGET_COUNT, INTEGER
TARGET_DATATYPE, INTEGER WIN, INTEGER IERROR)
MPI_Put int MPI_Put (void *origin_addr, int origin_count, MPI_Datatype
origin_datatype, int target_rank, MPI_Aint target_disp, int
target_count, MPI_Datatype target_datatype, MPI_Win win);
MPI::Win::Put void MPI::Win::Put(const void* origin_addr, int origin_count, const
MPI::Datatype& origin_datatype, int target_rank, MPI::Aint
target_disp, int target_count, const MPI::Datatype& target_datatype)
const;
MPI_PUT MPI_PUT(CHOICE ORIGIN_ADDR, INTEGER ORIGIN_COUNT,
INTEGER ORIGIN_DATATYPE, INTEGER TARGET_RANK,
INTEGER TARGET_DISP, INTEGER TARGET_COUNT, INTEGER
TARGET_DATATYPE, INTEGER WIN, INTEGER IERROR)
MPI_Win_complete int MPI_Win_complete (MPI_Win win);
MPI::Win::Complete void MPI::Win::Complete() const;
MPI_WIN_COMPLETE MPI_WIN_COMPLETE(INTEGER WIN, INTEGER IERROR)
MPI_Win_create int MPI_Win_create (void *base, MPI_Aint size, int disp_unit,
MPI_Info info, MPI_Comm comm, MPI_Win *win);
MPI_WIN_CREATE(CHOICE BASE, INTEGER SIZE, INTEGER
DISP_UNIT, INTEGER INFO, INTEGER COMM, INTEGER WIN,
INTEGER IERROR)
MPI::Win::Create static MPI::Win MPI::Win::Create(const void* base, MPI::Aint size, int
disp_unit, const MPI::Info& info, const MPI::Intracomm& comm);
MPI_WIN_CREATE MPI_WIN_CREATE(CHOICE BASE, INTEGER SIZE, INTEGER
DISP_UNIT, INTEGER INFO, INTEGER COMM, INTEGER WIN,
INTEGER IERROR)
MPI_Win_create_errhandler int MPI_Win_create_errhandler (MPI_Win_errhandler_fn *function,
MPI::Win::Create_errhandler MPI::Errhandler
MPI::Win::Create_errhandler(MPI::Win::Errhandler_fn* function);
MPI_WIN_CREATE_ERRHANDLER MPI_WIN_CREATE_ERRHANDLER(EXTERNAL FUNCTION,
INTEGER ERRHANDLER, INTEGER IERROR)

C C
C++ C++
FORTRAN FORTRAN
MPI_Win_create_keyval int MPI_Win_create_keyval (MPI_Win_copy_attr_function
*win_copy_attr_fn, MPI_Win_delete_attr_function *win_delete_attr_fn,
int *win_keyval, void *extra_state);
MPI::Win::Create_keyval static int MPI::Win::Create_keyval(MPI::Win::Copy_attr_function*
win_copy_attr_fn, MPI::Win::Delete_attr_function* win_delete_attr_fn,
void* extra_state);
MPI_WIN_CREATE_KEYVAL MPI_WIN_CREATE_KEYVAL(EXTERNAL WIN_COPY_ATTR_FN,
EXTERNAL WIN_DELETE_ATTR_FN, INTEGER WIN_KEYVAL,
INTEGER EXTRA_STATE, INTEGER IERROR)
MPI_Win_delete_attr int MPI_Win_delete_attr (MPI_Win win, int win_keyval);
MPI::Win::Delete_attr void MPI::Win::Delete_attr(int win_keyval);
MPI_WIN_DELETE_ATTR MPI_WIN_DELETE_ATTR(INTEGER WIN, INTEGER WIN_KEYVAL,
INTEGER IERROR)
MPI_Win_fence int MPI_Win_fence (int assert, MPI_Win win);
MPI::Win::Fence void MPI::Win::Fence(int assert) const;
MPI_WIN_FENCE MPI_WIN_FENCE(INTEGER ASSERT, INTEGER WIN, INTEGER
IERROR)
MPI_Win_free int MPI_Win_free (MPI_Win *win);
MPI::Win::Free void MPI::Win::Free();
MPI_WIN_FREE MPI_WIN_FREE(INTEGER WIN, INTEGER IERROR)
MPI_Win_free_keyval int MPI_Win_free_keyval (int *win_keyval);
MPI::Win::Free_keyval void MPI::Win::Free_keyval(int& win_keyval);
MPI_WIN_FREE_KEYVAL MPI_WIN_FREE_KEYVAL(INTEGER WIN_KEYVAL, INTEGER
IERROR)
MPI_Win_get_attr int MPI_Win_get_attr (MPI_Win win, int win_keyval, void
MPI::Win::Get_attr bool MPI::Win::Get_attr(int win_keyval, void* attribute_val) const;
MPI_WIN_GET_ATTR MPI_WIN_GET_ATTR(INTEGER WIN, INTEGER WIN_KEYVAL,
INTEGER ATTRIBUTE_VAL, LOGICAL FLAG, INTEGER IERROR)
MPI_Win_get_errhandler int MPI_Win_get_errhandler (MPI_Win win, MPI_Errhandler
*errhandler);
MPI::Win::Get_errhandler MPI::Errhandler MPI::Win::Get_errhandler() const;
MPI_WIN_GET_ERRHANDLER MPI_WIN_GET_ERRHANDLER(INTEGER WIN, INTEGER
MPI_Win_get_group int MPI_Win_get_group (MPI_Win *win, MPI_Group *group);
MPI::Win::Get_group MPI::Group MPI::Win::Get_group() const;
MPI_WIN_GET_GROUP MPI_WIN_GET_GROUP(INTEGER WIN, INTEGER GROUP,
INTEGER IERROR)
MPI_Win_lock int MPI_Win_lock (int lock_type, int rank, int assert, MPI_Win win);
MPI::Win::Lock void MPI::Win::Lock(int lock_type, int rank, int assert) const;
MPI_WIN_LOCK MPI_WIN_LOCK(INTEGER LOCK_TYPE, INTEGER RANK,
INTEGER ASSERT, INTEGER WIN, INTEGER IERROR)

C C
C++ C++
FORTRAN FORTRAN
MPI_Win_post int MPI_Win_post (MPI_Group group, int assert, MPI_Win win);
MPI::Win::Post void MPI::Win::Post(const MPI::Group& group, int assert) const;
MPI_WIN_POST MPI_WIN_POST(INTEGER GROUP, INTEGER ASSERT, INTEGER
WIN, INTEGER IERROR)
MPI_Win_set_attr int MPI_Win_set_attr (MPI_Win win, int win_keyval, void

*attribute_val);
MPI::Win::Set_attr void MPI::Win::Set_attr(int win_keyval, const void* attribute_val);
MPI_WIN_SET_ATTR MPI_WIN_SET_ATTR(INTEGER WIN, INTEGER WIN_KEYVAL,
INTEGER ATTRIBUTE_VAL, INTEGER IERROR)
MPI_Win_set_errhandler int MPI_Win_set_errhandler (MPI_Win win, MPI_Errhandler
errhandler);
MPI::Win::Set_errhandler void MPI::Win::Set_errhandler(const MPI::Errhandler& errhandler);
MPI_WIN_SET_ERRHANDLER MPI_WIN_SET_ERRHANDLER(INTEGER WIN, INTEGER
MPI_Win_start int MPI_Win_start (MPI_Group group, int assert, MPI_Win win);
MPI::Win::Start void MPI::Win::Start(const MPI::Group& group, int assert) const;
MPI_WIN_START MPI_WIN_START(INTEGER GROUP, INTEGER ASSERT, INTEGER
WIN, INTEGER IERROR)
MPI_Win_test int MPI_Win_test (MPI_Win win, int *flag);
MPI::Win::Test() bool MPI::Win::Test() const;
MPI_WIN_TEST MPI_WIN_TEST(INTEGER WIN, LOGICAL FLAG, INTEGER
IERROR)
MPI_Win_unlock int MPI_Win_unlock (int rank, MPI_Win win);
MPI::Win::Unlock void MPI::Win::Unlock(int rank) const;
MPI_WIN_UNLOCK MPI_WIN_UNLOCK(INTEGER RANK, INTEGER WIN, INTEGER
IERROR)
MPI_Win_wait int MPI_Win_wait (MPI_Win win);
MPI::Win::Wait void MPI::Win::Wait() const;
MPI_WIN_WAIT MPI_WIN_WAIT(INTEGER WIN, INTEGER IERROR)
Bindings for point-to-point communication

Table 37 lists the bindings for point-to-point communication subroutines.
Table 37. Bindings for point-to-point communication
C C
C++ C++
FORTRAN FORTRAN
MPI_Bsend int MPI_Bsend(void* buf,int count,MPI_Datatype datatype,int dest,int
tag,MPI_Comm comm);

Table 37. Bindings for point-to-point communication (continued)
C C
C++ C++
FORTRAN FORTRAN
MPI::Comm::Bsend void MPI::Comm::Bsend(const void* buf, int count, const
MPI::Datatype& datatype, int dest, int tag) const;
MPI_BSEND MPI_BSEND(CHOICE BUF,INTEGER COUNT,INTEGER
DATATYPE,INTEGER DEST, INTEGER TAG,INTEGER
MPI_Bsend_init int MPI_Bsend_init(void* buf,int count,MPI_Datatype datatype,int
dest,int tag,MPI_Comm comm,MPI_Request *request);
MPI::Comm::Bsend_init MPI::Prequest MPI::Comm::Bsend_init(const void* buf, int count,
const MPI::Datatype& datatype, int dest, int tag) const;
MPI_BSEND_INIT MPI_SEND_INIT(CHOICE BUF,INTEGER COUNT,INTEGER
DATATYPE,INTEGER DEST,INTEGER TAG,INTEGER
MPI_Buffer_attach int MPI_Buffer_attach(void* buffer,int size);
MPI::Attach_buffer void MPI::Attach_buffer(void* buffer, int size);
MPI_BUFFER_ATTACH MPI_BUFFER_ATTACH(CHOICE BUFFER,INTEGER
MPI_Buffer_detach int MPI_Buffer_detach(void* buffer,int* size);
MPI::Detach_buffer int MPI::Detach_buffer(void*& buffer);
MPI_BUFFER_DETACH MPI_BUFFER_DETACH(CHOICE BUFFER,INTEGER
MPI_Cancel int MPI_Cancel(MPI_Request *request);
MPI::Request::Cancel void MPI::Request::Cancel(void) const;
MPI_CANCEL MPI_CANCEL(INTEGER REQUEST,INTEGER IERROR)
MPI_Get_count int MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int
*count);
MPI::Status::Get_count int MPI::Status::Get_count(const MPI::Datatype& datatype) const;
MPI_GET_COUNT MPI_GET_COUNT(INTEGER
STATUS(MPI_STATUS_SIZE),,INTEGER DATATYPE,INTEGER
COUNT, INTEGER IERROR)
MPI_Ibsend int MPI_Ibsend(void* buf,int count,MPI_Datatype datatype,int dest,int
tag,MPI_Comm comm,MPI_Request *request);
MPI::Comm::Ibsend MPI::Request MPI::Comm::Ibsend(const void* buf, int count, const
MPI_IBSEND MPI_IBSEND(CHOICE BUF,INTEGER COUNT,INTEGER
MPI_Iprobe int MPI_Iprobe(int source,int tag,MPI_Comm comm,int
*flag,MPI_Status *status);
MPI::Comm::Iprobe bool MPI::Comm::Iprobe(int source, int tag) const;
MPI_IPROBE MPI_IPROBE(INTEGER SOURCE,INTEGER TAG,INTEGER
COMM,INTEGER FLAG,INTEGER
MPI_Irecv int MPI_Irecv(void* buf,int count,MPI_Datatype datatype,int source,int

C C
C++ C++
FORTRAN FORTRAN
MPI::Comm::Irecv MPI::Request MPI::Comm::Irecv(void *buf, int count, const
MPI::Datatype& datatype, int source, int tag) const;
MPI_IRECV MPI_IRECV(CHOICE BUF,INTEGER COUNT,INTEGER
DATATYPE,INTEGER SOURCE,INTEGER TAG,INTEGER
MPI_Irsend int MPI_Irsend(void* buf,int count,MPI_Datatype datatype,int dest,int
MPI::Comm::Irsend MPI::Request MPI::Comm::Irsend(const void *buf, int count, const
MPI_IRSEND MPI_IRSEND(CHOICE BUF,INTEGER COUNT,INTEGER
MPI_Isend int MPI_Isend(void* buf,int count,MPI_Datatype datatype,int dest,int
MPI::Comm::Isend MPI::Request MPI::Comm::Isend(const void *buf, int count, const
MPI_ISEND MPI_ISEND(CHOICE BUF,INTEGER COUNT,INTEGER
MPI_Issend int MPI_Issend(void* buf,int count,MPI_Datatype datatype,int dest,int
MPI::Comm::Issend MPI::Request MPI::Comm::Issend(const void *buf, int count, const
MPI_ISSEND MPI_ISSEND(CHOICE BUF,INTEGER COUNT,INTEGER
MPI_Probe int MPI_Probe(int source,int tag,MPI_Comm comm,MPI_Status
*status);
MPI::Comm::Probe void MPI::Comm::Probe(int source, int tag) const;
void MPI::Comm::Probe(int source, int tag, MPI::Status& status)

const;
MPI_PROBE MPI_PROBE(INTEGER SOURCE,INTEGER TAG,INTEGER
COMM,INTEGER STATUS(MPI_STATUS_SIZE), INTEGER IERROR)
MPI_Recv int MPI_Recv(void* buf,int count,MPI_Datatype datatype,int source,int
tag, MPI_Comm comm, MPI_Status *status);
MPI::Comm::Recv void MPI::Comm::Recv(void* buf, int count, const MPI::Datatype&
datatype, int source, int tag) const;
void MPI::Comm::Recv(void* buf, int count, const MPI::Datatype&

datatype, int source, int tag, MPI::Status& status) const;
MPI_RECV MPI_RECV(CHOICE BUF,INTEGER COUNT,INTEGER
DATATYPE,INTEGER SOURCE, INTEGER TAG,INTEGER
COMM,INTEGER STATUS(MPI_STATUS_SIZE),,INTEGER IERROR)
MPI_Recv_init int MPI_Recv_init(void* buf,int count,MPI_Datatype datatype,int
source,int tag,MPI_Comm comm,MPI_Request *request);

C C
C++ C++
FORTRAN FORTRAN
MPI::Comm::Recv_init MPI::Prequest MPI::Comm::Recv_init(void* buf, int count, const
MPI::Datatype& datatype, int source, int tag) const;
MPI_RECV_INIT MPI_RECV_INIT(CHOICE BUF,INTEGER COUNT,INTEGER
DATATYPE,INTEGER SOURCE,INTEGER TAG,INTEGER
MPI_Request_free int MPI_Request_free(MPI_Request *request);
MPI::Request::Free void MPI::Request::Free();
MPI_REQUEST_FREE MPI_REQUEST_FREE(INTEGER REQUEST,INTEGER IERROR)
MPI_Rsend int MPI_Rsend(void* buf,int count,MPI_Datatype datatype,int dest,int
tag,MPI_Comm comm);
MPI::Comm::Rsend void MPI::Comm::Rsend(const void* buf, int count, const
MPI_RSEND MPI_RSEND(CHOICE BUF,INTEGER COUNT,INTEGER
MPI_Rsend_init int MPI_Rsend_init(void* buf,int count,MPI_Datatype datatype,int
MPI::Comm::Rsend_init MPI::Prequest MPI::Comm::Rsend_init(const void* buf, int count,
MPI_RSEND_INIT MPI_RSEND_INIT(CHOICE BUF,INTEGER COUNT,INTEGER
MPI_Send int MPI_Send(void* buf,int count,MPI_Datatype datatype,int dest,int
tag,MPI_Comm comm);
MPI::Comm::Send void MPI::Comm::Send(const void* buf, int count, const
MPI_SEND MPI_SEND(CHOICE BUF,INTEGER COUNT,INTEGER
DATATYPE,INTEGER DEST,INTEGER TAG,INTEGER COMM,
INTEGER IERROR)
MPI_Send_init int MPI_Send_init(void* buf,int count,MPI_Datatype datatype,int
MPI::Comm::Send_init MPI::Prequest MPI::Comm::Send_init(const void* buf, int count, const
MPI_SEND_INIT MPI_SEND_INIT(CHOICE BUF,INTEGER COUNT,INTEGER
MPI_Sendrecv int MPI_Sendrecv(void *sendbuf,int sendcount,MPI_Datatype
sendtype,int dest,int sendtag,void *recvbuf,int recvcount,
MPI_Datatype recvtype,int source,int recvtag,MPI_Comm
comm,MPI_Status *status);

C C
C++ C++
FORTRAN FORTRAN
MPI::Comm::Sendrecv void MPI::Comm::Sendrecv(const void* sendbuf, int sendcount, const
MPI::Datatype& sendtype, int dest, int sendtag, void* recvbuf, int
recvcount, const MPI::Datatype& recvtype, int source, int recvtag)
const;
void MPI::Comm::Sendrecv(const void* sendbuf, int sendcount, const

MPI::Datatype& sendtype, int dest, int sendtag, void* recvbuf, int
recvcount, const MPI::Datatype& recvtype, int source, int recvtag,
MPI::Status& status) const;
MPI_SENDRECV MPI_SENDRECV(CHOICE SENDBUF,INTEGER
SENDCOUNT,INTEGER SENDTYPE,INTEGER DEST,INTEGER
SENDTAG,CHOICE RECVBUF,INTEGER RECVCOUNT,INTEGER
RECVTYPE,INTEGER SOURCE,INTEGER RECVTAG,INTEGER
COMM,INTEGER STATUS(MPI_STATUS_SIZE),INTEGER IERROR)
MPI_Sendrecv_replace int MPI_Sendrecv_replace(void* buf,int count,MPI_Datatype
datatype,int dest,int sendtag,int source,int recvtag,MPI_Comm
comm,MPI_Status *status);
MPI::Comm::Sendrecv_replace void MPI::Comm::Sendrecv_replace(void* buf, int count, const
MPI::Datatype& datatype, int dest, int sendtag, int source, int recvtag)
const;
void MPI::Comm::Sendrecv_replace(void *buf, int count, const

MPI::Datatype& datatype, int dest, int sendtag, int source, int recvtag,
MPI::Status& status) const;
MPI_SENDRECV_REPLACE MPI_SENDRECV_REPLACE(CHOICE BUF,INTEGER
COUNT,INTEGER DATATYPE,INTEGER DEST,INTEGER
SENDTAG,INTEGER SOURCE,INTEGER RECVTAG,INTEGER
COMM,INTEGER STATUS(MPI_STATUS_SIZE),INTEGER IERROR)
MPI_Ssend int MPI_Ssend(void* buf,int count,MPI_Datatype datatype,int dest,int
tag,MPI_Comm comm);
MPI::Comm::Ssend void MPI::Comm::Ssend(const void* buf, int count, const
MPI_SSEND MPI_SSEND(CHOICE BUF,INTEGER COUNT,INTEGER
MPI_Ssend_init int MPI_Ssend_init(void* buf,int count,MPI_Datatype datatype,int
MPI::Comm::Ssend_init MPI::Prequest MPI::Comm::Ssend_init(const void* buf, int count,
MPI_SSEND_INIT MPI_SSEND_INIT(CHOICE BUF,INTEGER COUNT,INTEGER
COMM,INTEGER REQUEST,IERROR)
MPI_Start int MPI_Start(MPI_Request *request);
MPI::Prequest::Start void MPI::Prequest::Start();
MPI_START MPI_START(INTEGER REQUEST,INTEGER IERROR)
MPI_Startall int MPI_Startall(int count,MPI_Request *array_of_requests);
MPI::Prequest::Startall void MPI::Prequest::Startall(int count, MPI::Prequest
array_of_requests[]);

C C
C++ C++
FORTRAN FORTRAN
MPI_STARTALL MPI_STARTALL(INTEGER COUNT,INTEGER
ARRAY_OF_REQUESTS(*),INTEGER IERROR)
MPI_Test int MPI_Test(MPI_Request *request,int *flag,MPI_Status *status);
MPI::Request::Test bool MPI::Request::Test();
MPI_TEST MPI_TEST(INTEGER REQUEST,INTEGER FLAG,INTEGER
MPI_Test_cancelled int MPI_Test_cancelled(MPI_Status *status,int *flag);
MPI::Status::Is_cancelled bool MPI::Status::Is_cancelled() const;
MPI_TEST_CANCELLED MPI_TEST_CANCELLED(INTEGER
STATUS(MPI_STATUS_SIZE),INTEGER FLAG,INTEGER IERROR)
MPI_Testall int MPI_Testall(int count,MPI_Request *array_of_requests,int
*flag,MPI_Status *array_of_statuses);
MPI::Request::Testall bool MPI::Request::Testall(int count, MPI::Request req_array[]);
bool MPI::Request::Testall(int count, MPI::Request req_array[],

MPI::Status stat_array[]);
MPI_TESTALL MPI_TESTALL(INTEGER COUNT,INTEGER
ARRAY_OF_REQUESTS(*),INTEGER FLAG, INTEGER
ARRAY_OF_STATUSES(MPI_STATUS_SIZE,*),INTEGER IERROR)
MPI_Testany int MPI_Testany(int count, MPI_Request *array_of_requests, int
*index, int *flag,MPI_Status *status);
MPI::Request::Testany bool MPI::Request::Testany(int count, MPI::Request array[], int&
index);
bool MPI::Request::Testany(int count, MPI::Request array[], int&

index, MPI::Status& status);
MPI_TESTANY MPI_TESTANY(INTEGER COUNT,INTEGER
ARRAY_OF_REQUESTS(*),INTEGER INDEX,INTEGER
FLAG,INTEGER STATUS(MPI_STATUS_SIZE), INTEGER IERROR)
MPI_Testsome int MPI_Testsome(int incount,MPI_Request *array_of_requests,int
*outcount,int *array_of_indices,MPI_Status *array_of_statuses);
MPI::Request::Testsome int MPI::Request::Testsome(int incount, MPI::Request req_array[], int
array_of_indices[]);
int MPI::Request::Testsome(int incount, MPI::Request req_array[], int

array_of_indices[], MPI::Status stat_array[]);
MPI_TESTSOME MPI_TESTSOME(INTEGER INCOUNT,INTEGER
ARRAY_OF_REQUESTS(*),INTEGER OUTCOUNT,INTEGER
ARRAY_OF_INDICES(*),INTEGER
ARRAY_OF_STATUSES(MPI_STATUS_SIZE),*),INTEGER IERROR)
MPI_Wait int MPI_Wait(MPI_Request *request,MPI_Status *status);
MPI::Request::Wait void MPI::Request::Wait();
void MPI::Request::Wait(MPI::Status& status);

MPI_WAIT MPI_WAIT(INTEGER REQUEST,INTEGER

C C
C++ C++
FORTRAN FORTRAN
MPI_Waitall int MPI_Waitall(int count,MPI_Request
*array_of_requests,MPI_Status *array_of_statuses);
MPI::Request::Waitall void MPI::Request::Waitall(int count, MPI::Request req_array[]);
void MPI::Request::Waitall(int count, MPI::Request req_array[],

MPI::Status stat_array[]);
MPI_WAITALL MPI_WAITALL(INTEGER COUNT,INTEGER ARRAY_OF_
REQUESTS(*),INTEGER
ARRAY_OF_STATUSES(MPI_STATUS_SIZE,*), INTEGER IERROR)
MPI_Waitany int MPI_Waitany(int count,MPI_Request *array_of_requests,int
*index,MPI_Status *status);
MPI::Request::Waitany int MPI::Request::Waitany(int count, MPI::Request array[]);
int MPI::Request::Waitany(int count, MPI::Request array[],

MPI::Status& status);
MPI_WAITANY MPI_WAITANY(INTEGER COUNT,INTEGER
ARRAY_OF_REQUESTS(*),INTEGER INDEX, INTEGER
MPI_Waitsome int MPI_Waitsome(int incount,MPI_Request *array_of_requests,int
*outcount,int *array_of_indices,MPI_Status *array_of_statuses);
MPI::Request::Waitsome int MPI::Request::Waitsome(int incount, MPI::Request req_array[], int
array_of_indices[]);
int MPI::Request::Waitsome(int incount, MPI::Request req_array[], int

array_of_indices[], MPI::Status stat_array[]);
MPI_WAITSOME MPI_WAITSOME(INTEGER INCOUNT,INTEGER
ARRAY_OF_REQUESTS,INTEGER OUTCOUNT,INTEGER
ARRAY_OF_INDICES(*),INTEGER
ARRAY_OF_STATUSES(MPI_STATUS_SIZE),*),INTEGER IERROR)
Binding for profiling control

Table 38 lists the binding for profiling control.
Table 38. Binding for profiling control
C C
C++ C++
FORTRAN FORTRAN
MPI_Pcontrol int MPI_Pcontrol(const int level,...);
MPI::Pcontrol void MPI::Pcontrol(const int level, ...);
MPI_PCONTROL MPI_PCONTROL(INTEGER LEVEL,...)
Bindings for topologies

Table 39 on page 215 lists the bindings for topology subroutines.

Table 39. Bindings for topologies
C C
C++ C++
FORTRAN FORTRAN
MPI_Cart_coords int MPI_Cart_coords(MPI_Comm comm,int rank,int maxdims,int
*coords);
MPI::Cartcomm::Get_coords void MPI::Cartcomm::Get_coords(int rank, int maxdims, int coords[])
const;
MPI_CART_COORDS MPI_CART_COORDS(INTEGER COMM,INTEGER RANK,INTEGER
MAXDIMS,INTEGER COORDS(*),INTEGER IERROR)
MPI_Cart_create int MPI_Cart_create(MPI_Comm comm_old,int ndims,int *dims,int
*periods,int reorder,MPI_Comm *comm_cart);
MPI::Intracomm::Create_cart MPI::Cartcomm MPI::Intracomm::Create_cart(int ndims, const int
dims[], const bool periods[], bool reorder) const;
MPI_CART_CREATE MPI_CART_CREATE(INTEGER COMM_OLD,INTEGER
NDIMS,INTEGER DIMS(*), INTEGER PERIODS(*),INTEGER
REORDER,INTEGER COMM_CART,INTEGER IERROR)
MPI_Cart_get int MPI_Cart_get(MPI_Comm comm,int maxdims,int *dims,int
*periods,int *coords);
MPI::Cartcomm::Get_topo void MPI::Cartcomm::Get_topo(int maxdims, int dims[], bool periods[],
int coords[]) const;
MPI_CART_GET MPI_CART_GET(INTEGER COMM,INTEGER MAXDIMS,INTEGER
DIMS(*),INTEGER PERIODS(*),INTEGER COORDS(*),INTEGER
IERROR)
MPI_Cart_map int MPI_Cart_map(MPI_Comm comm,int ndims,int *dims,int
*periods,int *newrank);
MPI::Cartcomm::Map int MPI::Cartcomm::Map(int ndims, const int dims[], const bool
periods[]) const;
MPI_CART_MAP MPI_CART_MAP(INTEGER COMM,INTEGER NDIMS,INTEGER
DIMS(*),INTEGER PERIODS(*),INTEGER NEWRANK,INTEGER
IERROR)
MPI_Cart_rank int MPI_Cart_rank(MPI_Comm comm,int *coords,int *rank);
MPI::Cartcomm::Get_cart_rank int MPI::Cartcomm::Get_cart_rank(const int coords[]) const;
MPI_CART_RANK MPI_CART_RANK(INTEGER COMM,INTEGER
COORDS(*),INTEGER RANK,INTEGER IERROR)
MPI_Cart_shift int MPI_Cart_shift(MPI_Comm comm,int direction,int disp,int
*rank_source,int *rank_dest);
MPI::Cartcomm::Shift void MPI::Cartcomm::Shift(int direction, int disp, int &rank_source, int
&rank_dest) const;
MPI_CART_SHIFT MPI_CART_SHIFT(INTEGER COMM,INTEGER
DIRECTION,INTEGER DISP, INTEGER RANK_SOURCE,INTEGER
RANK_DEST,INTEGER IERROR)
MPI_Cart_sub int MPI_Cart_sub(MPI_Comm comm,int *remain_dims,MPI_Comm
*newcomm);
MPI::Cartcomm::Sub MPI::Cartcomm MPI::Cartcomm::Sub(const bool remain_dims[])
const;
MPI_CART_SUB MPI_CART_SUB(INTEGER COMM,INTEGER
REMAIN_DIMS,INTEGER NEWCOMM, INTEGER IERROR)
MPI_Cartdim_get int MPI_Cartdim_get(MPI_Comm comm, int *ndims);

Table 39. Bindings for topologies (continued)
C C
C++ C++
FORTRAN FORTRAN
MPI::Cartcomm::Get_dim int MPI::Cartcomm::Get_dim() const;
MPI_CARTDIM_GET MPI_CARTDIM_GET(INTEGER COMM,INTEGER NDIMS,INTEGER
IERROR)
MPI_Dims_create int MPI_Dims_create(int nnodes,int ndims,int *dims);
MPI::Compute_dims void MPI::Compute_dims(int nnodes, int ndims, int dims[]);
MPI_DIMS_CREATE MPI_DIMS_CREATE(INTEGER NNODES,INTEGER
NDIMS,INTEGER DIMS(*), INTEGER IERROR)
MPI_Graph_create int MPI_Graph_create(MPI_Comm comm_old,int nnodes,int *index,int
*edges,int reorder,MPI_Comm *comm_graph);
MPI::Intracomm::Create_graph MPI::Graphcomm MPI::Intracomm::Create_graph(int nnodes, const
int index[], const int edges[], bool reorder) const;
MPI_GRAPH_CREATE MPI_GRAPH_CREATE(INTEGER COMM_OLD,INTEGER
NNODES,INTEGER INDEX(*), INTEGER EDGES(*),INTEGER
REORDER,INTEGER COMM_GRAPH,INTEGER IERROR)
MPI_Graph_get int MPI_Graph_get(MPI_Comm comm,int maxindex,int maxedges,int
*index, int *edges);
MPI::Graphcomm::Get_topo void MPI::Graphcomm::Get_topo(int maxindex, int maxedges, int
index[], int edges[]) const;
MPI_GRAPH_GET MPI_GRAPH_GET(INTEGER COMM,INTEGER
MAXINDEX,INTEGER MAXEDGES,INTEGER INDEX(*),INTEGER
EDGES(*),INTEGER IERROR)
MPI_Graph_map int MPI_Graph_map(MPI_Comm comm,int nnodes,int *index,int
*edges,int *newrank);
MPI::Graphcomm::Map int MPI::Graphcomm::Map(int nnodes, const int index[], const int
edges[]) const;
MPI_GRAPH_MAP MPI_GRAPH_MAP(INTEGER COMM,INTEGER NNODES,INTEGER
INDEX(*),INTEGER EDGES(*),INTEGER NEWRANK,INTEGER
IERROR)
MPI_Graph_neighbors int MPI_Graph_neighbors(MPI_Comm comm,int rank,int
maxneighbors,int *neighbors);
MPI::Graphcomm::Get_neighbors void MPI::Graphcomm::Get_neighbors(int rank, int maxneighbors, int
neighbors[]) const;
MPI_GRAPH_NEIGHBORS MPI_GRAPH_NEIGHBORS(MPI_COMM COMM,INTEGER
RANK,INTEGER MAXNEIGHBORS,INTEGER
NNEIGHBORS(*),INTEGER IERROR)
MPI_Graph_neighbors_count int MPI_Graph_neighbors_count(MPI_Comm comm,int rank,int
*nneighbors);
MPI::Graphcomm::Get_neighbors_count int MPI::Graphcomm::Get_neighbors_count(int rank) const;
MPI_GRAPH_NEIGHBORS_COUNT MPI_GRAPH_NEIGHBORS_COUNT(INTEGER COMM,INTEGER
RANK,INTEGER NEIGHBORS, INTEGER IERROR)
MPI_Graphdims_get int MPI_Graphdims_get(MPI_Comm comm,int *nnodes,int *nedges);
MPI::Graphcomm::Get_dims void MPI::Graphcomm::Get_dims(int nnodes[], int nedges[]) const;
MPI_GRAPHDIMS_GET MPI_GRAPHDIMS_GET(INTEGER COMM,INTEGER
NNDODES,INTEGER NEDGES, INTEGER IERROR)

Table 39. Bindings for topologies (continued)
C C
C++ C++
FORTRAN FORTRAN
MPI_Topo_test int MPI_Topo_test(MPI_Comm comm,int *status);
MPI::Comm::Get_topology int MPI::Comm::Get_topology() const;
MPI_TOPO_TEST MPI_TOPO_TEST(INTEGER COMM,INTEGER STATUS,INTEGER
IERROR)

Appendix E. PE MPI buffer management for eager protocol
The Parallel Environment implementation of MPI uses an eager send protocol for
messages whose size is up to the eager limit. This value can be allowed to default,
or can be specified with the MP_EAGER_LIMIT environment variable or the
-eager_limit command-line flag. In an eager send, the entire message is sent
immediately to its destination and the send buffer is returned to the application.
Since the message is sent without knowing if there is a matching receive waiting,
the message may need to be stored in the early arrival buffer at the destination,
until a matching receive is posted by the application. The MPI standard requires
that an eager send be done only if it can be guaranteed that there is sufficient
buffer space. If a send is posted at some source (sender) when buffer space cannot
be guaranteed, the send must not complete at the source until it is known that there
will be a place for the message at the destination.
PE MPI uses a credit flow control, by which senders track the buffer space that
can be guaranteed at each destination. For each source-destination pair, an eager
send consumes a message credit at the source, and a match at the destination
generates a message credit. The message credits generated at the destination are
returned to the sender to enable additional eager sends. The message credits are
returned piggyback on an application message when possible. If there is no return
traffic, they will accumulate at the destination until their number reaches some
threshold, and then be sent back as a batch to minimize network traffic. When a
sender has no message credits, its sends must proceed using rendezvous
protocol until message credits become available. The fallback to rendezvous
protocol may impact performance. With a reasonable supply of message credits,
most applications will find that the credits return soon enough to enable messages
that are not larger than the eager limit to continue to be sent eagerly.
Assuming a pre-allocated early arrival buffer (whose size cannot increase), the
number of message credits that the early arrival buffer represents is equal to the
early arrival buffer size divided by the eager limit. Since no sender can know how
many other tasks will also send eagerly to a given destination, the message credits
must be divided among sender tasks equally. If every task sends eagerly to a single
destination that is not posting receives, each sender consumes its message credits,
fills its share of the destination early arrival buffer, and reverts to rendezvous
protocol. This prevents an overflow at the destination, which would result in job
failure. To offer a reasonable number of message credits per source-destination pair
at larger task counts, either a very large pre-allocated early arrival buffer, or a very
small eager limit is needed.
It would be unusual for a real application to flood a single destination this way, and
well-written applications try to pre-post their receives. An eager send must consume
a message credit at the send side, but when the message arrives and matches a
waiting receive, it does not consume any of the early arrival buffer space. The
message credit is available to be returned to the sender, but does not return
instantly. When they pre-post and do not flood, real applications seldom use more
than a small percentage of the total early arrival buffer space. However, because
message credits must be managed for the worst case, they may be depleted at the
send side. The send side then reverts to rendezvous protocol, even though there is
plenty of early arrival buffer space available, or there is a matching receive waiting
at the receive side, which would then not need to use the early arrival buffer.
The advantage of a pre-allocated early arrival buffer is that the Parallel Environment
implementation of MPI is able to allocate and free early arrival space in the
pre-allocated buffer quickly, and because the space is owned by the MPI library, it is
certain to be available if needed. There is nothing an application can do to make
the space that is promised by message credits unavailable in the event that all
message credits are used. A disadvantage is that the space that is pre-allocated to
the early arrival buffer to support adequate message credits is denied to the
application, even if only a small portion of that pre-allocated space is ever used.
With PE 4.2, MPI users are given new control over buffer pre-allocation and
message credits. MPI users can specify both a pre-allocated and maximum early
arrival buffer size. The pre-allocated early arrival buffer is set aside for efficient
management, and guaranteed availability. If the early arrival buffer requirement
exceeds the pre-allocated space, extra early arrival buffer space comes from the
heap using malloc and free. Message credits are calculated based on the
maximum buffer size, and all of the pre-allocated early arrival buffer is used before
using malloc and free. Since message credits are based on the maximum buffer
size, an application that floods a single destination with unmatched eager messages
from all senders, could require the specified maximum. If other heap usage has
made that space unavailable, a malloc could fail and the job would be terminated.
However, well-designed applications might see better performance from additional
credits, but may not even fill the pre-allocated early arrival buffer, let alone come
near needing the promised maximum. An omitted maximum, or any value at or
below the pre_allocated_size, will cause message credits to be limited so that
there will never be an overflow of the pre-allocated early arrival buffer.
For most applications, the default value for the early arrival buffer should be
satisfactory, and with the default, the message credits are calculated based on the
pre-allocated size. The pre-allocated size can be changed from its default by setting
the MP_BUFFER_MEM environment variable or using the -buffer_mem
command-line flag with a single value. The message credits are calculated based
on the modified pre-allocated size. There will be no use of malloc and free after
initialization (MPI_Init). This is the way earlier versions of the Parallel Environment
implementation of MPI worked, so there is no need to learn new habits for
command-line arguments, or to make changes to existing run scripts and default
shell environments.
For some applications, in particular those that are memory constrained or run at
large task counts, it may be useful to adjust the size of the pre-allocated early
arrival buffer to slightly more than the application’s peak demand, but specify a
higher maximum early arrival buffer size so that enough message credits are
available to ensure few or no fallbacks to rendezvous protocol. For a given run, you
can use the MP_STATISTICS environment variable to see how much early arrival
buffer space is used at peak demand, and how often a send that is small enough to
be an eager send, was processed using rendezvous protocol due to a message
credit shortage.
By decreasing the pre-allocated early arrival buffer size to slightly larger than the
application’s peak demand, you avoid wasting pre-allocated buffer space. By
increasing the maximum buffer size, you provide credits which can reduce or
eliminate fallbacks to rendezvous protocol. The application’s peak demand and
fallback frequency can vary from run to run, and the amount of variation may
depend on the nature of the application. If the application’s peak demand is larger
than the pre-allocated early arrival buffer size, the use of malloc and free may
cause a performance impact. The credit flow control will guarantee that the
application’s peak demand will never exceed the specified maximum. However, if
you pick a maximum that cannot be satisfied, it is possible for an MPI application
that does aggressive but valid flooding of a single destination to fail in a malloc.

The risk of needing the maximum early arrival buffer size is small in well-structured
applications, so with very large task counts, you may choose to set an unrealistic
maximum to allow a higher eager limit and get enough message credits to
maximize performance. However, be aware that if the application behaves
differently than expected and requires significantly more storage than the
pre-allocated early arrival buffer size, and this storage is not available before
message credit shortages throttle eager sending, unexpected paging or even malloc
failures are possible. (To throttle a car engine is to choke off its air and fuel intake
by lifting your foot from the gas pedal when you want to keep the car from going
faster than you can control).
Appendix E. PE MPI buffer management for eager protocol 221

Appendix F. Accessibility
Accessibility features help a user who has a physical disability, such as restricted
mobility or limited vision, to use software products successfully. The major
accessibility features enable users to:
v Use assistive technologies such as screen readers and screen magnifier
software
v Operate specific or equivalent features using only the keyboard
v Customize display attributes such as color, contrast, and font size.
Accessibility information
Accessibility information for IBM products is available online. Visit the IBM
Accessibility Center at:
http://www.ibm.com/able/
To request accessibility information, click Product accessibility information.
Using assistive technologies

Assistive technology products, such as screen readers, function with user
interfaces. Consult the assistive technology documentation for specific information
when using such products to access interfaces.

Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in
other countries. Consult your local IBM representative for information on the
products and services currently available in your area. Any reference to an IBM
product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product,
program, or service that does not infringe any IBM intellectual property right may be
used instead. However, it is the user’s responsibility to evaluate and verify the
operation of any non-IBM product, program, or service.
| IBM may have patents or pending patent applications covering subject matter
| described in this document. The furnishing of this document does not grant you any
| license to these patents. You can send license inquiries, in writing, to:
| IBM Director of Licensing
| IBM Corporation
| North Castle Drive
| Armonk, NY 10504-1785
| U.S.A.
| For license inquiries regarding double-byte (DBCS) information, contact the IBM
Intellectual Property Department in your country or send inquiries, in writing, to:
| IBM World Trade Asia Corporation
| Licensing
| 2-31 Roppongi 3-chome, Minato-ku
| Tokyo 106-0032, Japan
The following paragraph does not apply to the United Kingdom or any other country
where such provisions are inconsistent with local law:
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS

PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE. Some states do not allow disclaimer of express or
implied warranties in certain transactions, therefore, this statement may not apply to
you.
This information could include technical inaccuracies or typographical errors.

Changes are periodically made to the information herein; these changes will be
incorporated in new editions of the publication. IBM may make improvements and/or
changes in the product(s) and/or the program(s) described in this publication at any
time without notice.
Any references in this information to non-IBM Web sites are provided for
convenience only and do not in any manner serve as an endorsement of those
Web sites. The materials at those Web sites are not part of the materials for this
IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes
appropriate without incurring any obligation to you.

Licensees of this program who wish to have information about it for the purpose of
enabling: (i) the exchange of information between independently created programs
and other programs (including this one) and (ii) the mutual use of the information
which has been exchanged, should contact:
IBM Corporation
Department LJEB/P905
2455 South Road
Poughkeepsie, NY 12601-5400
U.S.A
Such information may be available, subject to appropriate terms and conditions,

including in some cases, payment of a fee.
The licensed program described in this document and all licensed material available
for it are provided by IBM under terms of the IBM Customer Agreement, IBM
International Program License Agreement or any equivalent agreement between us.
Any performance data contained herein was determined in a controlled

environment. Therefore, the results obtained in other operating environments may
vary significantly. Some measurements may have been made on development-level
systems and there is no guarantee that these measurements will be the same on
generally available systems. Furthermore, some measurement may have been
estimated through extrapolation. Actual results may vary. Users of this document
should verify the applicable data for their specific environment.
Information concerning non-IBM products was obtained from the suppliers of those
products, their published announcements or other publicly available sources. IBM
has not tested those products and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those
products.
All statements regarding IBM’s future direction or intent are subject to change or
withdrawal without notice, and represent goals and objectives only.
This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the
names of individuals, companies, brands, and products. All of these names are
fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
| This information contains sample application programs in source language, which

| illustrates programming techniques on various operating platforms. You may copy,
| modify, and distribute these sample programs in any form without payment to IBM,
| for the purposes of developing, using, marketing or distributing application programs
| conforming to the application programming interface for the operating platform for
| which the sample programs are written. These examples have not been thoroughly
| tested under all conditions. IBM, therefore, cannot guarantee or imply reliability,
| serviceability, or function of these programs.
Each copy or any portion of these sample programs or any derivative work, must
include a copyright notice as follows:

© (your company name) (year). Portions of this code are derived from IBM Corp.
Sample Programs. © Copyright IBM Corp. _enter the year or years_. All rights
reserved.
All implemented function in the PE MPI product is designed to comply with the
requirements of the Message Passing Interface Forum, MPI: A Message-Passing
Interface Standard. The standard is documented in two volumes, Version 1.1,
University of Tennessee, Knoxville, Tennessee, June 6, 1995 and MPI-2: Extensions
to the Message-Passing Interface, University of Tennessee, Knoxville, Tennessee,
July 18, 1997. The second volume includes a section identified as MPI 1.2 with
clarifications and limited enhancements to MPI 1.1. It also contains the extensions
identified as MPI 2.0. The three sections, MPI 1.1, MPI 1.2 and MPI 2.0 taken
together constitute the current standard for MPI.
PE MPI provides support for all of MPI 1.1 and MPI 1.2. PE MPI also provides
support for all of the MPI 2.0 Enhancements, except the contents of the chapter
titled Process Creation and Management.
If you believe that PE MPI does not comply with the MPI standard for the portions
that are implemented, please contact IBM Service.
Trademarks
The following are trademarks of International Business Machines Corporation in the
United States, other countries, or both:
AFS
AIX
AIX 5L
AIXwindows
DFS
e (logo)
IBM
IBM (logo)
IBMLink
LoadLeveler
POWER
POWER3
POWER4
pSeries
RS/6000
SP
System p5
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, and service names may be trademarks or service marks
of others.
Notices 227
Acknowledgments
The PE Benchmarker product includes software developed by the Apache Software
Foundation, http://www.apache.org.

Glossary
A breakpoint. A place in a program, specified by a
command or a condition, where the system halts
execution and gives control to the workstation user or to
AFS. Andrew File System.
a specified program.
address. A value, possibly a character or group of
broadcast operation. A communication operation
characters that identifies a register, a device, a
where one processor sends (or broadcasts) a message
particular part of storage, or some other data source or
to all other processors.
destination.
buffer. A portion of storage used to hold input or
AIX. Abbreviation for Advanced Interactive Executive,
output data temporarily.
IBM’s licensed version of the UNIX operating system.
AIX is particularly suited to support technical computing
applications, including high-function graphics and C
floating-point computations.
C. A general-purpose programming language. It was
AIXwindows® Environment/6000. A graphical user formalized by Uniforum in 1983 and the ANSI standards
interface (GUI) for the IBM RS/6000. It has the following committee for the C language in 1984.
components:
v A graphical user interface and toolkit based on C++. A general-purpose programming language that is
OSF/Motif based on the C language. C++ includes extensions that
v Enhanced X-Windows, an enhanced version of the support an object-oriented programming paradigm.
MIT X Window System Extensions include:
v Graphics Library (GL), a graphical interface library for v strong typing
the application programmer that is compatible with v data abstraction and encapsulation
Silicon Graphics’ GL interface. v polymorphism through function overloading and
templates
API. Application programming interface. v class inheritance.
application. The use to which a data processing chaotic relaxation. An iterative relaxation method that
system is put; for example, a payroll application, an uses a combination of the Gauss-Seidel and
airline reservation application. Jacobi-Seidel methods. The array of discrete values is
divided into subregions that can be operated on in
argument. A parameter passed between a calling
parallel. The subregion boundaries are calculated using
program and a called program or subprogram.
the Jacobi-Seidel method, while the subregion interiors
attribute. A named property of an entity. are calculated using the Gauss-Seidel method. See also
Gauss-Seidel.
Authentication. The process of validating the identity
of a user or server. client. A function that requests services from a server
and makes them available to the user.
Authorization. The process of obtaining permission to
perform specific actions. cluster. A group of processors interconnected through
a high-speed network that can be used for
high-performance computing.
B
Cluster 1600. See IBM Eserver Cluster 1600.
bandwidth. The difference, expressed in hertz,
between the highest and the lowest frequencies of a collective communication. A communication
range of frequencies. For example, analog transmission operation that involves more than two processes or
by recognizable voice telephone requires a bandwidth tasks. Broadcasts, reductions, and the MPI_Allreduce
of about 3000 hertz (3 kHz). The bandwidth of an subroutine are all examples of collective communication
optical link designates the information-carrying capacity operations. All tasks in a communicator must participate.
of the link and is related to the maximum bit rate that a
command alias. When using the PE command-line
fiber link can support.
debugger pdbx, you can create abbreviations for
blocking operation. An operation that does not existing commands using the pdbx alias command.
complete until the operation either succeeds or fails. For These abbreviations are known as command aliases.
example, a blocking receive will not return until a
Communication Subsystem (CSS). A component of
message is received or until the channel is closed and
the IBM Parallel System Support Programs for AIX that
no further messages can be received.

provides software support for the high performance distributed shell (dsh). An IBM Parallel System
switch. CSS provides two protocols: Internet Protocol Support Programs for AIX command that lets you issue
(IP) for LAN-based communication and User Space commands to a group of hosts in parallel. See IBM
protocol as a message passing interface that is Parallel System Support Programs for AIX: Command
optimized for performance over the switch. See also and Technical Reference for details.
Internet Protocol and User Space.
domain name. The hierarchical identification of a host
communicator. An MPI object that describes the system (in a network), consisting of human-readable
communication context and an associated group of labels, separated by decimal points.
processes.
DPCL target application. The executable program
compile. To translate a source program into an that is instrumented by a Dynamic Probe Class Library
executable program. (DPCL) analysis tool. It is the process (or processes)
into which the DPCL analysis tool inserts probes. A
condition. One of a set of specified values that a data target application could be a serial or parallel program.
item can assume. Furthermore, if the target application is a parallel
program, it could follow either the SPMD or the MPMD
control workstation. A workstation attached to the model, and may be designed for either a
IBM RS/6000 SP that serves as a single point of control message-passing or a shared-memory system.
allowing the administrator or operator to monitor and
manage the system using IBM Parallel System Support
Programs for AIX. E
core dump. A process by which the current state of a environment variable. (1) A variable that describes
program is preserved in a file. Core dumps are usually the operating environment of the process. Common
associated with programs that have encountered an environment variables describe the home directory,
unexpected, system-detected fault, such as a command search path, and the current time zone. (2) A
Segmentation Fault or a severe user error. The current variable that is included in the current software
program state is needed for the programmer to environment and is therefore available to any called
diagnose and correct the problem. program that requests it.
core file. A file that preserves the state of a program, Ethernet. A baseband local area network (LAN) that
usually just before a program is terminated for an allows multiple stations to access the transmission
unexpected error. See also core dump. medium at will without prior coordination, avoids
contention by using carrier sense and deference, and
current context. When using the pdbx debugger, resolves contention by using collision detection and
control of the parallel program and the display of its delayed retransmission. Ethernet uses carrier sense
data can be limited to a subset of the tasks belonging to multiple access with collision detection (CSMA/CD).
that program. This subset of tasks is called the current
context. You can set the current context to be a single event. An occurrence of significance to a task — the
task, multiple tasks, or all the tasks in the program. completion of an asynchronous operation such as an
input/output operation, for example.
D executable. A program that has been link-edited and
therefore can be run in a processor.
data decomposition. A method of breaking up (or
decomposing) a program into smaller parts to exploit execution. To perform the actions specified by a
parallelism. One divides the program by dividing the program or a portion of a program.
data (usually arrays) into smaller parts and operating on
each part independently. expression. In programming languages, a language
construct for computing a value from one or more
data parallelism. Refers to situations where parallel operands.
tasks perform the same computation on different sets of
data.
F
dbx. A symbolic command-line debugger that is often
provided with UNIX systems. The PE command-line fairness. A policy in which tasks, threads, or
debugger pdbx is based on the dbx debugger. processes must be allowed eventual access to a
resource for which they are competing. For example, if
debugger. A debugger provides an environment in multiple threads are simultaneously seeking a lock, no
which you can manually control the execution of a set of circumstances can cause any thread to wait
program. It also provides the ability to display the indefinitely for access to the lock.
program’ data and operation.

FDDI. Fiber Distributed Data Interface. global max. The maximum value across all
processors for a given variable. It is global in the sense
Fiber Distributed Data Interface (FDDI). An American that it is global to the available processors.
National Standards Institute (ANSI) standard for a local
area network (LAN) using optical fiber cables. An FDDI global variable. A variable defined in one portion of a
LAN can be up to 100 kilometers (62 miles) long, and computer program and used in at least one other
can include up to 500 system units. There can be up to portion of the computer program.
2 kilometers (1.24 miles) between system units and
concentrators. gprof. A UNIX command that produces an execution
profile of C, COBOL, FORTRAN, or Pascal programs.
file system. In the AIX operating system, the collection The execution profile is in a textual and tabular format.
of files and file management structures on a physical or It is useful for identifying which routines use the most
logical mass storage device, such as a diskette or CPU time. See the man page on gprof.
minidisk.
graphical user interface (GUI). A type of computer
fileset. (1) An individually-installable option or update. interface consisting of a visual metaphor of a real-world
Options provide specific functions. Updates correct an scene, often of a desktop. Within that scene are icons,
error in, or enhance, a previously installed program. (2) which represent actual objects, that the user can access
One or more separately-installable, logically-grouped and manipulate with a pointing device.
units in an installation package. See also licensed
program and package. GUI. Graphical user interface.
foreign host. See remote host.

H
FORTRAN. One of the oldest of the modern
programming languages, and the most popular high performance switch. The high-performance
language for scientific and engineering computations. Its message-passing network of the IBM RS/6000 SP that
name is a contraction of FORmula TRANslation. The connects all processor nodes.
two most common FORTRAN versions are FORTRAN
HIPPI. High performance parallel interface.
77, originally standardized in 1978, and FORTRAN 90.
FORTRAN 77 is a proper subset of FORTRAN 90. hook. A pdbx command that lets you re-establish
control over all tasks in the current context that were
function cycle. A chain of calls in which the first caller
previously unhooked with this command.
is also the last to be called. A function that calls itself
recursively is not considered a function cycle. home node. The node from which an application
developer compiles and runs his program. The home
functional decomposition. A method of dividing the
node can be any workstation on the LAN.
work in a program to exploit parallelism. The program is
divided into independent pieces of functionality, which host. A computer connected to a network that provides
are distributed to independent processors. This method an access method to that network. A host provides
is in contrast to data decomposition, which distributes end-user services.
the same work over different data to independent
processors. host list file. A file that contains a list of host names,
and possibly other information, that was defined by the
functional parallelism. Refers to situations where application that reads it.
parallel tasks specialize in particular work.
host name. The name used to uniquely identify any
G computer on a network.
hot spot. A memory location or synchronization

Gauss-Seidel. An iterative relaxation method for
resource for which multiple processors compete
solving Laplace’s equation. It calculates the general
excessively. This competition can cause a
solution by finding particular solutions to a set of
disproportionately large performance degradation when
discrete points distributed throughout the area in
one processor that seeks the resource blocks,
question. The values of the individual points are
preventing many other processors from having it,
obtained by averaging the values of nearby points.
thereby forcing them to become idle.
Gauss-Seidel differs from Jacobi-Seidel in that, for the
i+1st iteration, Jacobi-Seidel uses only values calculated
in the ith iteration. Gauss-Seidel uses a mixture of I
values calculated in the ith and i+1st iterations.
IBM Eserver Cluster 1600. An IBM Eserver Cluster
1600 is any PSSP or CSM-managed cluster comprised
Glossary 231
of POWER™ microprocessor based systems (including licensed program. A collection of software packages
RS/6000 SMPs, RS/6000 SP nodes, and pSeries sold as a product that customers pay for to license. A
SMPs). licensed program can consist of packages and file sets
a customer would install. These packages and file sets
IBM Parallel Environment (PE) for AIX. A licensed bear a copyright and are offered under the terms and
program that provides an execution and development conditions of a licensing agreement. See also fileset
environment for parallel C, C++, and FORTRAN and package.
programs. It also includes tools for debugging, profiling,
and tuning parallel programs. lightweight corefiles. An alternative to standard AIX
corefiles. Corefiles produced in the Standardized
installation image. A file or collection of files that are Lightweight Corefile Format provide simple process
required in order to install a software product on a IBM stack traces (listings of function calls that led to the
RS/6000 workstation or on SP system nodes. These error) and consume fewer system resources than
files are in a form that allows them to be installed or traditional corefiles.
removed with the AIX installp command. See also
fileset, licensed program, and package. LoadLeveler. A job management system that works
with POE to let users run jobs and match processing
Internet. The collection of worldwide networks and needs with system resources, in order to make better
gateways that function as a single, cooperative virtual use of the system.
network.
local variable. A variable that is defined and used
Internet Protocol (IP). (1) The TCP/IP protocol that only in one specified portion of a computer program.
provides packet delivery between the hardware and
user processes. (2) The SP switch library, provided with loop unrolling. A program transformation that makes
the IBM Parallel System Support Programs for AIX, that multiple copies of the body of a loop, also placing the
follows the IP protocol of TCP/IP. copies within the body of the loop. The loop trip count
and index are adjusted appropriately so the new loop
IP. Internet Protocol. computes the same values as the original. This
transformation makes it possible for a compiler to take
J additional advantage of instruction pipelining, data
cache effects, and software pipelining.
Jacobi-Seidel. See Gauss-Seidel. See also optimization.
K M
Kerberos. A publicly available security and management domain . A set of nodes configured for
authentication product that works with the IBM Parallel manageability by the Clusters Systems Management
System Support Programs for AIX software to (CSM) product. Such a domain has a management
authenticate the execution of remote commands. server that is used to administer a number of managed
nodes. Only management servers have knowledge of
kernel. The core portion of the UNIX operating system the whole domain. Managed nodes only know about the
that controls the resources of the CPU and allocates servers managing them; they know nothing of each
them to the users. The kernel is memory-resident, is other. Contrast with peer domain.
said to run in kernel mode (in other words, at higher
execution priority level than user mode), and is menu. A list of options displayed to the user by a data
protected from user tampering by the hardware. processing system, from which the user can select an
action to be initiated.
L message catalog. A file created using the AIX
Message Facility from a message source file that
Laplace’s equation. A homogeneous partial contains application error and other messages, which
differential equation used to describe heat transfer, can later be translated into other languages without
electric fields, and many other applications. having to recompile the application source code.
latency. The time interval between the instant when an message passing. Refers to the process by which
instruction control unit initiates a call for data parallel tasks explicitly exchange program data.
transmission, and the instant when the actual transfer of
data (or receipt of data at the remote end) begins. Message Passing Interface (MPI). A standardized
Latency is related to the hardware characteristics of the API for implementing the message-passing model.
system and to the different layers of software that are
involved in initiating the task of packing and transmitting MIMD. Multiple instruction stream, multiple data
the data. stream.

Multiple instruction stream, multiple data stream debugging and performance analysis very difficult
(MIMD). A parallel programming model in which because complex code transformations obscure the
different processors perform different instructions on correspondence between compiled and original source
different sets of data. code.
MPMD. Multiple program, multiple data. option flag. Arguments or any other additional
information that a user specifies with a program name.
Multiple program, multiple data (MPMD). A parallel Also referred to as parameters or command-line
programming model in which different, but related, options.
programs are run on different sets of data.
MPI. Message Passing Interface. P

package. A number of file sets that have been
N collected into a single installable image of licensed
programs. Multiple file sets can be bundled together for
network. An interconnected group of nodes, lines, and installing groups of software together. See also fileset
terminals. A network provides the ability to transmit data and licensed program.
to and receive data from other systems and users.
parallelism. The degree to which parts of a program
Network Information Services. A set of UNIX may be concurrently executed.
network services (for example, a distributed service for
retrieving information about the users, groups, network parallelize. To convert a serial program for parallel
addresses, and gateways in a network) that resolve execution.
naming and addressing differences among computers in
a network. parallel operating environment (POE). An execution
environment that smooths the differences between
NIS. See Network Information Services. serial and parallel execution. It lets you submit and
manage parallel jobs. It is abbreviated and commonly
node. (1) In a network, the point where one or more known as POE.
functional units interconnect transmission lines. A
computer location defined in a network. (2) In terms of parameter. (1) In FORTRAN, a symbol that is given a
the IBM RS/6000 SP, a single location or workstation in constant value for a specified application. (2) An item in
a network. An SP node is a physical entity (a a menu for which the operator specifies a value or for
processor). which the system provides a value when the menu is
interpreted. (3) A name in a procedure that is used to
node ID. A string of unique characters that identifies refer to an argument that is passed to the procedure.
the node on a network. (4) A particular piece of information that a system or
application program needs to process a request.
nonblocking operation. An operation, such as
sending or receiving a message, that returns partition. (1) A fixed-size division of storage. (2) In
immediately whether or not the operation was terms of the IBM RS/6000 SP, a logical collection of
completed. For example, a nonblocking receive will not nodes to be viewed as one system or domain. System
wait until a message is sent, but a blocking receive will partitioning is a method of organizing the SP system
wait. A nonblocking receive will return a status value into groups of nodes for testing or running different
that indicates whether or not a message was received. levels of software of product environments.
O Partition Manager. The component of the parallel

operating environment (POE) that allocates nodes, sets
object code. The result of translating a computer up the execution environment for remote tasks, and
program to a relocatable, low-level form. Object code manages distribution or collection of standard input
contains machine instructions, but symbol names (such (STDIN), standard output (STDOUT), and standard error
as array, scalar, and procedure names), are not yet (STDERR).
given a location in memory. Contrast with source code.
pdbx. The parallel, symbolic command-line debugging
optimization. A widely-used (though not strictly facility of PE. pdbx is based on the dbx debugger and
accurate) term for program performance improvement, has a similar interface.
especially for performance improvement done by a
PE. The IBM Parallel Environment for AIX licensed
compiler or other program translation software. An
program.
optimizing compiler is one that performs extensive code
transformations in order to obtain an executable that peer domain. A set of nodes configured for high
runs faster but gives the same answer as the original. availability by the RSCT configuration manager. Such a
Such code transformations, however, can make code
Glossary 233
domain has no distinguished or master node. All nodes pthread. A thread that conforms to the POSIX Threads
are aware of all other nodes, and administrative Programming Model.
commands can be issued from any node in the domain.
All nodes also have a consistent view of the domain
membership. Contrast with management domain.
R
performance monitor. A utility that displays how reduced instruction-set computer. A computer that
effectively a system is being used by programs. uses a small, simplified set of frequently-used
instructions for rapid execution.
PID. Process identifier.
reduction operation. An operation, usually
POE. parallel operating environment. mathematical, that reduces a collection of data by one
or more dimensions. For example, the arithmetic SUM
pool. Groups of nodes on an SP system that are operation is a reduction operation that reduces an array
known to LoadLeveler, and are identified by a pool to a scalar value. Other reduction operations include
name or number. MAXVAL and MINVAL.
point-to-point communication. A communication Reliable Scalable Cluster Technology. A set of

operation that involves exactly two processes or tasks. software components that together provide a
One process initiates the communication through a send comprehensive clustering environment for AIX. RSCT is
operation. The partner process issues a receive the infrastructure used by a variety of IBM products to
operation to accept the data being sent. provide clusters with improved system availability,
scalability, and ease of use.
procedure. (1) In a programming language, a block,
with or without formal parameters, whose execution is remote host. Any host on a network except the one
invoked by means of a procedure call. (2) A set of where a particular operator is working.
related control statements that cause one or more
programs to be performed. remote shell (rsh). A command supplied with both AIX
and the IBM Parallel System Support Programs for AIX
process. A program or command that is actually that lets you issue commands on a remote host.
running the computer. It consists of a loaded version of
the executable file, its data, its stack, and its kernel data RISC. See reduced instruction-set computer.
structures that represent the process’s state within a
multitasking environment. The executable file contains RSCT. See Reliable Scalable Cluster Technology.
the machine instructions (and any calls to shared
RSCT peer domain. See peer domain.
objects) that will be executed by the hardware. A
process can contain multiple threads of execution.
The process is created with a fork() system call and S
ends using an exit() system call. Between fork and
exit, the process is known to the system by a unique shell script. A sequence of commands that are to be
process identifier (PID). executed by a shell interpreter such as the Bourne shell
(sh), the C shell (csh), or the Korn shell (ksh). Script
Each process has its own virtual memory space and commands are stored in a file in the same format as if
cannot access another process’s memory directly. they were typed at a terminal.
Communication methods across processes include
pipes, sockets, shared memory, and message passing. segmentation fault. A system-detected error, usually
caused by referencing an non-valid memory address.
prof. A utility that produces an execution profile of an
application or program. It is useful to identify which server. A functional unit that provides shared services
routines use the most CPU time. See the man page for to workstations over a network — a file server, a print
prof. server, or a mail server, for example.
profiling. The act of determining how much CPU time signal handling. A type of communication that is used
is used by each function or subroutine in a program. by message passing libraries. Signal handling involves
The histogram or table produced is called the execution using AIX signals as an asynchronous way to move
profile. data in and out of message buffers.
Program Marker Array. An X-Windows run time Single program, multiple data (SPMD). A parallel
monitor tool provided with parallel operating programming model in which different processors
environment, used to provide immediate visual feedback execute the same program on different sets of data.
on a program’s execution.
source code. The input to a compiler or assembler,
written in a source language. Contrast with object code.

source line. A line of source code. thread. A single, separately dispatchable, unit of
execution. There can be one or more threads in a
SP. IBM RS/6000 SP; a scalable system arranged in process, and each thread is executed by the operating
various physical configurations, that provides a system concurrently.
high-powered computing environment.
tracing. In PE, the collection of information about the
SPMD. Single program, multiple data. execution of the program. This information is
accumulated into a trace file that can later be examined.
standard input (STDIN). In the AIX operating system,
the primary source of data entered into a command. tracepoint. Tracepoints are places in the program
Standard input comes from the keyboard unless that, when reached during execution, cause the
redirection or piping is used, in which case standard debugger to print information about the state of the
input can be from a file or the output from another program.
command.
trace record. In PE, a collection of information about a
standard output (STDOUT). In the AIX operating specific event that occurred during the execution of your
system, the primary destination of data produced by a program. For example, a trace record is created for
command. Standard output goes to the display unless each send and receive operation that occurs in your
redirection or piping is used, in which case standard program (this is optional and might not be appropriate).
output can go to a file or to another command. These records are then accumulated into a trace file
that can later be examined.
STDIN. Standard input.
STDOUT. Standard output. U

stencil. A pattern of memory references used for unrolling loops. See loop unrolling.
averaging. A 4-point stencil in two dimensions for a
given array cell, x(i,j), uses the four adjacent cells, user. (1) A person who requires the services of a
x(i-1,j), x(i+1,j), x(i,j-1), and x(i,j+1). computing system. (2) Any person or any thing that can
issue or receive commands and message to or from the
subroutine. (1) A sequence of instructions whose information processing system.
execution is invoked by a call. (2) A sequenced set of
instructions or statements that can be used in one or User Space. A version of the message passing library
more computer programs and at one or more points in a that is optimized for direct access to the high
computer program. (3) A group of instructions that can performance switch, that maximizes the performance
be part of another routine or can be called by another capabilities of the SP hardware.
program or routine.
utility program. A computer program in general
synchronization. The action of forcing certain points support of computer processes; for example, a
in the execution sequences of two or more diagnostic program, a trace program, a sort program.
asynchronous procedures to coincide in time.
utility routine. A routine in general support of the
system administrator. (1) The person at a computer processes of a computer; for example, an input routine.
installation who designs, controls, and manages the use
of the computer system. (2) The person who is
responsible for setting up, modifying, and maintaining V
the Parallel Environment.
variable. (1) In programming languages, a named
System Data Repository. A component of the IBM object that may take different values, one at a time. The
Parallel System Support Programs for AIX software that values of a variable are usually restricted to one data
provides configuration management for the SP system. type. (2) A quantity that can assume any of a given set
It manages the storage and retrieval of system data of values. (3) A name used to represent a data item
across the control workstation, file servers, and nodes. whose value can be changed while the program is
running. (4) A name used to represent data whose value
can be changed, while the program is running, by
T referring to the name of the variable.
target application. See DPCL target application. view. (1) To display and look at data on screen. (2) A
special display of data, created as needed. A view
task. A unit of computation analogous to an AIX temporarily ties two or more files together so that the
process. combined files can be displayed, printed, or queried.
The user specifies the fields to be included. The original
Glossary 235
files are not permanently linked or altered; however, if
the system allows editing, the data in the original files
will be changed.
X
X Window System. The UNIX industry’s graphics
windowing standard that provides simultaneous views of
several executing programs or processes on high
resolution graphics displays.

Index
Special characters B
-bmaxdata 34 bindings
-buffer_mem command-line flag 76, 81, 82, 220 subroutine 151, 175
-clock_source command-line flag 77 blocking collective 16, 30
-css_interrupt command-line flag 77 blocking communication 1
-eager_limit command-line flag 219 blocking MPI call 41, 43
-hints_filtered command-line flag 78 blocking receive 4, 6
-hostfile command-line flag 17 blocking send 3, 6, 7
-infolevel command-line flag 39 buffering messages 219
-io_buffer_size command-line flag 81 bulk transfer mode 9
-io_errlog command-line flag 81
-ionodefile command-line flag 78
-msg_api command-line flag 72 C
-polling_interval command-line flag 79 C bindings 11
-printenv command-line flag 86 C language binding datatypes 49
-procs command-line flag 16 C reduction function dattypes 50
-q64 42 checkpoint
-qarch compiler option 33 hang 41
-qextcheck compiler option 33 CHECKPOINT environment variable 38, 41
-qintsize 42 checkpoint restrictions 38
-retransmit_interval command-line flag 79 child process 41
-shared_memory command-line flag 2, 15, 80 choice arguments 33
-single_thread command-line flag 80 clock source 34
-stdoutmode command-line flag 31 collective communication 15
-thread_stacksize command-line flag 81 collective communication call 38, 41
-udp_packet_size command-line flag 81 collective constants 60
-wait_mode command-line flag 81 collective operations 62
@bulkxfer 9 combiner constants 60
command-line flags
-buffer_mem 76, 220
Numerics -clock_source 77
32-bit addressing 34 -css_interrupt 77
32-bit application 25, 33, 34, 36, 42 -eager_limit 78, 219
32-bit executable 15, 16 -hints_filtered 78
64-bit application 25, 33, 36, 42 -hostfile 17
64-bit executable 15 -infolevel 39
-io_buffer_size 81
-io_errlog 81
A -ionodefile 78
abbreviated names x -msg_api 72
accessibility 223 -polling_interval 79
acknowledgments 228 -printenv 86
acronyms for product names x -procs 16
address segments 34 -retransmit_interval 79
AIX function limitation 30 -rexmit_buf_cnt 82
AIX kernel thread 25 -rexmit_buf_size 81
AIX message catalog 33 -shared_memory 2, 15, 80
AIX profiling 11 -single_thread 80
AIX signals 28 -stdoutmode 31
AIXTHREAD_SCOPE environment variable 38 -thread_stacksize 81
APIs -udp_packet_size 81
parallel task identification 145 -wait_mode 81
archive 13 command-line flags, POE 69
assorted constants 60 commands
asynchronous signals 28, 29 poe 25, 32
atomic lock 41 communication stack 45
communications adapter 9

communicator 66 elementary datatypes (FORTRAN) 61
constants empty group 62
assorted 60 environment inquiry keys 58
collective 60 environment variables 1
collective operations 62 AIXTHREAD_SCOPE 38
communicator and group comparisons 59 CHECKPOINT 38, 41
datatype decoding 60 LAPI_USE_SHM 34
derived datatypes 61 MP_ 33
elementary datatype 61 MP_ACK_INTERVAL 44
empty group 62 MP_ACK_THRESH 9, 44, 76
environment inquiry keys 58 MP_BUFFER_MEM 4, 65, 66, 76, 220
error classes 57 MP_CC_SCRATCH_BUFFER 9
error handling specifiers 60 MP_CLOCK_SOURCE 34, 77
file operation 59 MP_CSS_INTERRUPT 6, 21, 42, 44, 77
FORTRAN 90 datatype matching 63 MP_EAGER_LIMIT 3, 4, 65, 66, 78, 219
maximum sizes 58 MP_EUIDEVELOP 30, 149, 151
MPI 57 MP_EUIDEVICE 8
MPI-IO 59 MP_EUILIB 2, 3
null handles 62 MP_HINTS_FILTERED 23, 78
one-sided 60 MP_HOSTFILE 16
optional datatypes 62 MP_INFOLEVEL 33, 39
predefined attribute keys 59 MP_INSTANCES 8
reduction function 61 MP_INTRDELAY 44
special datatypes 61 MP_IO_BUFFER_SIZE 24, 81
threads 63 MP_IO_ERRLOG 23, 81
topologies 59 MP_IONODEFILE 20, 78
construction of derived datatypes 61 MP_LAPI_INET_ADDR 17
contention for processors 6 MP_MSG_API 43, 72
conventions x MP_PIPE_SIZE 44
core file generation 69 MP_POLLING_INTERVAL 7, 42, 79
credit flow control 4, 219, 220 MP_PRINTENV 86
MP_PRIORITY 10
MP_PROCS 16, 65
D MP_RETRANSMIT_INTERVAL 10, 79
datatype constructors 23 MP_REXMIT_BUF_CNT 7, 82
datatype decoding functions 60 MP_REXMIT_BUF_SIZE 7, 81
datatypes MP_SHARED_MEMORY 1, 2, 15, 30, 34, 80, 149,
C language binding 49 151
C reduction functions 50 MP_SINGLE_THREAD 7, 20, 21, 36, 37, 80
FORTRAN language bindings 50 MP_SNDBUF 31
FORTRAN reduction functions 51 MP_STATISTICS 9, 10, 220
predefined MPI 49 MP_STDOUTMODE 31
special purpose 49 MP_SYNC_ON_CONNECT 44
datatypes for reduction functions (C and C++) 61 MP_TASK_AFFINITY 10
datatypes for reduction functions (FORTRAN) 61 MP_THREAD_STACKSIZE 36, 81
deadlock 19 MP_TIMEOUT 81
debugger restrictions 38 MP_UDP_PACKET_SIZE 2, 44, 81
DEVELOP mode 30 MP_USE_BULK_XFER 9, 44
diagnostic information 69 MP_UTE 86
disability 223 MP_WAIT_MODE 6, 81
dropped packets 10 MPI_WAIT_MODE 42
dynamic probe class library 38 not recognized by PE 33
OBJECT_MODE 42
reserved 33
E environment variables, POE 69
eager limit 3, 5, 7, 65, 66, 219 error classes 57
eager protocol 3, 5 error handler 47
eager send 3, 4, 7, 66, 219, 220 error handling specifiers 60
early arrival buffer 4, 5, 65, 219, 220 Ethernet adapter 16
early arrival list 4, 5 exit status
elementary datatypes (C and C++) 61 abnormal 29

exit status (continued)
continuing job step 27
I
I/O agent 20
normal 29
I/O node file 20
parallel application 26
IBM General Parallel File System (GPFS) 20
terminating job step 27
IBM POWER4 server 10
values 26, 29, 37
IBM System p5 server 10
export file 13
import file 13
extended heap
Info objects 23
specifying 34
ipcrm 16
F J
file descriptor numbers 29
job control 27
file handle 66
Job Specifications 69
file operation constants 59
job step progression 27
flags, command-line
job step termination 27
-buffer_mem 76, 220
default 27
-clock_source 77
-css_interrupt 77
-eager_limit 78, 219
-hints_filtered 78 K
-hostfile 17 key collision 16
-infolevel 39 key, value pair 23
-io_buffer_size 81 ksh93 30
-io_errlog 81
-ionodefile 78
-msg_api 72 L
-polling_interval 79 language bindings
-printenv 86 MPI 33
-procs 16 LAPI 1, 17, 43, 44
-retransmit_interval 79 sliding window protocol 4
-rexmit_buf_cnt 82 used with MPI 43
-rexmit_buf_size 81 LAPI data transfer function 3
-shared_memory 2, 15, 80 LAPI dispatcher 4, 6, 9, 10
-single_thread 80 LAPI parallel program 45
-stdoutmode 31 LAPI protocol 25
-thread_stacksize 81 LAPI send side copy 7
-udp_packet_size 81 LAPI user message 7
-wait_mode 81 LAPI_INIT 45
FORTRAN 77 175 LAPI_TERM 45
FORTRAN 90 175 LAPI_USE_SHM environment variable 34
FORTRAN 90 datatype matching constants 63 limits, system
FORTRAN bindings 11, 12 on size of MPI elements 65
FORTRAN language binding datatypes 50 llcancel 16
FORTRAN reduction function dattypes 51 LoadLeveler 9, 26, 67
function overloading 33 LookAt message retrieval tool xii
functions
MPI 155
M
M:N threads 38
G malloc and free 220
General Parallel File System (GPFS) 20 MALLOCDEBUG 35
gprof 11 MALLOCTYPE 35
maximum sizes 58
maximum tasks per node 67
H message address range 16
hidden threads 21 message buffer 16, 25
High Performance FORTRAN (HPF) 175 message credit 4, 5, 65, 219, 220
hint filtering 23 message descriptor 4
message envelope 5
message envelope buffer 65
Index 239
message packet transfer 7 MP_WAIT_MODE environment variable 6, 81
message passing MPCI 44
profiling 11 MPE subroutine bindings 151
message queue 41 MPE subroutines 149
message retrieval tool, LookAt xii MPI 69
message traffic 9 functions 155
message transport mechanisms 1 subroutines 155
messages used with LAPI 43
buffering 219 MPI application exit without setting exit value 27
miscellaneous environment variables and flags 69 MPI applications
mixed parallelism with MPI and threads 43 performance 1
MP_ACK_INTERVAL environment variable 44 MPI constants 57, 58, 59, 60, 61, 62, 63
MP_ACK_THRESH environment variable 9, 44, 76 MPI datatype 19, 49
MP_BUFFER_MEM environment variable 4, 65, 66, MPI eager limit 66
76, 81, 82, 220 MPI envelope 7
MP_CC_SCRATCH_BUFFER environment variable 9 MPI internal locking 7
MP_CLOCK_SOURCE environment variable 34, 77 MPI IP performance 2
MP_CSS_INTERRUPT environment variable 6, 21, 42, MPI library 37
44, 77 architecture considerations 33
MP_EAGER_LIMIT environment variable 3, 4, 65, 66, MPI Library
219 performance 1
MP_EUIDEVELOP environment variable 30, 149, 151 MPI message size 7
MP_EUIDEVICE environment variable 8 MPI reduction operations 53
MP_EUILIB environment variable 2, 3 MPI size limits 65
MP_HINTS_FILTERED environment variable 23, 78 MPI subroutine bindings 175
MP_HOSTFILE environment variable 16 MPI wait call 1, 3, 4, 6
MP_INFOLEVEL environment variable 33, 39 MPI_Abort 26, 28
MP_INSTANCES environment variable 8 MPI_ABORT 26, 28
mp_intrdelay 44 MPI_File 19
MP_INTRDELAY environment variable 44 MPI_File object 22
MP_IO_BUFFER_SIZE environment variable 24, 81 MPI_Finalize 27
MP_IO_ERRLOG environment variable 23, 81 MPI_FINALIZE 27, 37, 45
MP_IONODEFILE environment variable 20, 78 MPI_INIT 37, 45
MP_LAPI_INET_ADDR environment variable 17 MPI_INIT_THREAD 37
MP_MSG_API environment variable 43, 72 MPI_THREAD_FUNNELED 37
MP_PIPE_SIZE environment variable 44 MPI_THREAD_MULTIPLE 37
MP_POLLING_INTERVAL environment variable 7, 42, MPI_THREAD_SINGLE 37
79 MPI_WAIT_MODE environment variable 42
MP_PRINTENV environment variable 86 MPI_WTIME_IS_GLOBAL 34
MP_PRIORITY environment variable 10 MPI-IO
MP_PROCS environment variable 16, 65 API user tasks 20
MP_RETRANSMIT_INTERVAL environment considerations 20
variable 10, 79 data buffer size 24
MP_REXMIT_BUF_CNT environment variable 7 datatype constructors 23
MP_REXMIT_BUF_SIZE environment variable 7 deadlock prevention 19
MP_SHARED_MEMORY environment variable 1, 2, definition 19
15, 30, 34, 80, 149, 151 error handling 22
MP_SINGLE_THREAD environment variable 7, 20, 21, features 19
36, 37, 80 file interoperability 24
MP_SNDBUF environment variable 31 file management 20
MP_STATISTICS environment variable 9, 10, 220 file open 21
MP_STDOUTMODE environment variable 31 file tasks 21
MP_SYNC_ON_CONNECT environment variable 44 hidden threads 21
MP_TASK_AFFINITY environment variable 10 I/O agent 20
MP_THREAD_STACKSIZE environment variable 36, Info objects 23
81 logging errors 23
MP_TIMEOUT environment variable 81 portability 19
MP_UDP_PACKET_SIZE environment variable 2, 44, robustness 19
81 versatility 19
MP_USE_BULK_XFER environment variable 9, 44 MPI-IO constants 59
MP_UTE environment variable 86 MPL 25

MPL (continued) parallel utility subroutines (continued)
not supported 25 mpc_flush 101
mpxlf_r 14 mpc_init_ckpt 103
multi-chip module (MCM) 10 mpc_isatty 88
mutex lock 43 mpc_queryintr 105
mpc_queryintrdelay 108
mpc_set_ckpt_callbacks 109
N mpc_setintrdelay 112
n-task parallel job 25 mpc_statistics_write 113
named pipes 31, 32 mpc_statistics_zero 116
non-blocking collective 16, 30 mpc_stdout_mode 117
non-blocking collective communication mpc_stdoutmode_query 120
subroutines 149 mpc_unset_ckpt_callbacks 122
non-blocking communication 1 pe_dbg_breakpoint 124
non-blocking receive 4 pe_dbg_checkpnt 130
non-blocking send 3 pe_dbg_checkpnt_wait 134
null handles 62 pe_dbg_getcrid 136
pe_dbg_getrtid 137
pe_dbg_getvtid 138
O pe_dbg_read_cr_errfile 139
OBJECT_MODE environment variable 42 pe_dbg_restart 140
OK to send response 6 Partition Manager 69
one-sided constants 60 Partition Manager Daemon (PMD) 16, 25, 29, 30, 31,
one-sided message passing API 1, 7, 23, 36, 37, 43, 32
45, 46, 155, 205 PCI adapter 67
op operation PE 3.2 44
datatypes 54 PE 4.1 44
operations PE co-scheduler 10
predefined 53 performance
reduction 53 shared memory 16
optional datatypes 62 pipes 38, 40
STDIN, STDOUT, or STDERR 32
pmd 32
P POE
argument limits 31
packet sliding window 9
program argument 31
packet statistics 10
shell script 30
parallel application 45
user applications 25
parallel I/O 19
poe command 25, 32
parallel job 25
POE command-line flags 69
parallel job termination 29
-ack_thresh 76
parallel task I/O 69
-adapter_use 70
parallel task identification API
-buffer_mem 76, 220
subroutines 145
-bulk_min_msg_size 80
parallel utility subroutines 87
-clock_source 77
MP_BANDWIDTH 90
-cmdfile 73
MP_DISABLEINTR 95
-coredir 83
MP_ENABLEINTR 98
-corefile_format 83
MP_FLUSH 101
-corefile_format_sigterm 83
MP_INIT_CKPT 103
-cpu_use 70
MP_QUERYINTR 105
-css_interrupt 77
MP_QUERYINTRDELAY 108
-debug_notimeout 76
MP_SET_CKPT_CALLBACKS 109
-eager_limit 78, 219
MP_SETINTRDELAY 112
-euidevelop 69, 84
MP_STATISTICS_WRITE 113
-euidevice 70
MP_STATISTICS_ZERO 116
-euilib 70
MP_STDOUT_MODE 117
-euilibpath 71
MP_STDOUTMODE_QUERY 120
-hfile 71
MP_UNSET_CKPT_CALLBACKS 122
-hints_filtered 78
mpc_bandwidth 90
-hostfile 17, 71
mpc_disableintr 95
-ilevel 69, 76
mpc_enableintr 98
Index 241
POE command-line flags (continued) POE considerations (continued)
-infolevel 39, 69, 76 job termination 29
-instances 71 language bindings 33
-io_buffer_size 81 large numbers of tasks 34
-io_errlog 81 LoadLeveler 26
-ionodefile 78 M:N threads 38
-labelio 75 MALLOCDEBUG 35
-llfile 73 mixing collective 30
-msg_api 72 MPI_INIT 37
-newjob 73 MPI_INIT_THREAD 37
-nodes 72 MPI_WAIT_MODE 42
-pgmmodel 73 network tuning 31
-pmdlog 76 nopoll 42
-polling_interval 79 order requirement for system includes 37
-printenv 86 other differences 45
-priority_log 85 parent task 37
-priority_ntp 85 POE additions 27
-procs 16, 71 remote file system 30
-pulse 71 reserved environment variables 33
-rdma_count 80 root limitation 30
-resd 71 shell scripts 30
-retransmit_interval 79 signal handler 28
-retry 71 signal library 25
-retrycount 71 single threaded 36
-rexmit_buf_cnt 82 STDIN, STDOUT, or STDERR 30, 32
-rexmit_buf_size 81 STDIN, STDOUT, or STDERR, output 31
-rmpool 72 STDIN, STDOUT, or STDERR, rewinding 30
-save_llfile 73 task initialization 36
-savehostfile 72 thread stack size 36
-shared_memory 2, 15, 80 thread termination 37
-single_thread 80 thread-safe libraries 37
-stdinmode 75 threads 35
-stdoutmode 31, 75 user limits 26
-task_affinity 74 user program, passing string arguments 31
-tasks_per_node 72 using MPI and LAPI together 43
-thread_stacksize 81 virtual memory segments 34
-udp_packet_size 81 POE environment variables 69
-use_bulk_xfer 79 MP_ACK_INTERVAL 44
-wait_mode 81 MP_ACK_THRESH 9, 44, 76
POE considerations MP_ADAPTER_USE 70
64-bit application 42 MP_BUFFER_MEM 4, 65, 66, 76, 220
AIX 37 MP_BULK_MIN_MSG_SIZE 80
AIX function limitation 30 MP_BYTECOUNT 84
AIX message catalog considerations 33 MP_CC_SCRATCH_BUFFER 9
architecture 33 MP_CKPTDIR 73
automount daemon 30 MP_CKPTDIR_PERTASK 73
checkpoint and restart 38 MP_CKPTFILE 73
child task 37 MP_CLOCK_SOURCE 34, 77
collective communication call 38 MP_CMDFILE 73
entry point 36 MP_COREDIR 83
environment overview 25 MP_COREFILE_FORMAT 83
exit status 26 MP_COREFILE_SIGTERM 83
exits, abnormal 29 MP_CPU_USE 70
exits, normal 29 MP_CSS_INTERRUPT 6, 21, 42, 44, 77
exits, parallel task 29 MP_DBXPROMPTMOD 84
file descriptor numbers 29 MP_DEBUG_INITIAL_STOP 76
fork limitations 37 MP_DEBUG_NOTIMEOUT 76
job step default termination 27 MP_EAGER_LIMIT 3, 4, 65, 66, 78, 219
job step function 27 MP_EUIDEVELOP 30, 84, 149, 151
job step progression 27 MP_EUIDEVICE 8, 70
job step termination 27 MP_EUILIB 2, 3, 70

POE environment variables (continued) predefined error handler 47
MP_EUILIBPATH 71 predefined MPI datatype 49
MP_FENCE 84 process contention scope 38
MP_HINTS_FILTERED 23, 78 process profiling 41
MP_HOSTFILE 16, 71 prof 11
MP_INFOLEVEL 33, 39, 76 profiling
MP_INSTANCES 8, 71 counts 11
MP_INTRDELAY 44 export file 12
MP_IO_BUFFER_SIZE 24, 81 library 12
MP_IO_ERRLOG 23, 81 message passing 11
MP_IONODEFILE 20, 78 MPI nameshift 11
MP_LABELIO 75 shared library 12
MP_LAPI_INET_ADDR 17 profiling library 11
MP_LLFILE 73 profiling MPI routines 12
MP_MSG_API 43, 72 program exit without setting exit value 27
MP_NEWJOB 73 programming considerations
MP_NOARGLIST 85 user applications 25
MP_NODES 72 protocol striping 8
MP_PGMMODEL 73 pSeries high performance switch 67
MP_PIPE_SIZE 44 pthread lock 41
MP_PMDLOG 76
MP_POLLING_INTERVAL 7, 42, 79
MP_PRINTENV 86 Q
MP_PRIORITY 10, 85 quotation marks 31
MP_PRIORITY_LOG 85
MP_PRIORITY_NTP 85
MP_PROCS 16, 65, 71 R
MP_PULSE 71 receive buffer 9
MP_RDMA_COUNT 80 reduction operations
MP_REMOTEDIR 72 C example 55
MP_RESD 71 datatype arguments 53
MP_RETRANSMIT_INTERVAL 10, 79 examples 55
MP_RETRY 71 FORTRAN example 55
MP_RETRYCOUNT 71 MPI 53
MP_REXMIT_BUF_CNT 7, 82 predefined 53
MP_REXMIT_BUF_SIZE 7, 81 Remote Direct Memory Access (RDMA) 9
MP_RMPOOL 72 rendezvous message 5, 6
MP_SAVE_LLFILE 73 rendezvous protocol 3, 5, 66, 219, 220
MP_SAVEHOSTFILE 72 reserved environment variables 33
MP_SHARED_MEMORY 1, 2, 15, 30, 34, 80, 149, resource limits 26
151 restart restrictions 40
MP_SINGLE_THREAD 7, 20, 21, 36, 37, 80 results of communicator and group comparisons 59
MP_SNDBUF 31 retransmission buffer 7
MP_STATISTICS 9, 10, 220 return code 19
MP_STDINMODE 75 rewinding STDIN, STDOUT, or STDER 30
MP_STDOUTMODE 31, 75 root limitation 30
MP_SYNC_ON_CONNECT 44 rtl_enable 13
MP_TASK_AFFINITY 10, 74
MP_TASKS_PER_NODE 72
MP_THREAD_STACKSIZE 36, 81
MP_TIMEOUT 72, 81
S
sa_sigaction 28
MP_UDP_PACKET_SIZE 2, 44, 81
scratch buffer 9
MP_USE_BULK_XFER 9, 44, 79
semaphore 41
MP_UTE 86
send buffer 9
MP_WAIT_MODE 6, 81
service thread 6, 45
MPI_WAIT_MODE 42
setuid program 38
POE threads 45
shared memory 1, 2, 15, 30, 38, 40, 42, 67, 219
point-to-point communications 3
reclaiming 16
point-to-point messages 15
shared memory key collision 16
polling considerations 6
shared memory performance considerations 16
predefined attribute keys 59
shmat 38
Index 243
shmget 34 threads library considerations
sigaction 28 AIX signals 28
SIGALRM 29 topologies 59
SIGIO 29 trademarks 227
signal handler 28, 36 tuning parameter
POE 28 sb_max 2
user defined 28, 29 udp_recvspace 2
signal library 25 udp_sendspace 2
SIGPIPE 29
sigwait 28
Simultaneous Multi-Threading (SMT) 67 U
single thread considerations 6 UDP ports 8
single threaded applications 36 UDP/IP 2, 4
sockets 40 UDP/IP transport 2
special datatypes 61 unacknowledged packets 10
special purpose datatypes 49 unsent data 66
striping 8 upcall 4, 6
subroutine bindings 151, 175 user resource limits 26
collective communication 175 User Space 1, 2, 4
communicator 178 User Space FIFO mechanism 8
conversion functions 182 User Space FIFO packet buffer 8
derived datatype 183 User Space library 1
environment management 189 User space protocol 43
external interfaces 191 User Space transport 2, 3, 6
group management 193 User Space window 3
Info object 195
memory allocation 196
MPI-IO 197, 205 V
non-blocking collective communication 151 virtual address space 9
one-sided communication 205 virtual memory segments 34
point-to-point communication 208
profiling control 214
topology 214
subroutines
W
wait
MPE 149
MPI 1, 2, 3, 4, 6
MPI 155
window 66
non-blocking collective communication 149
parallel task identification API 145
parallel utility subroutines 87
poe_master_tasks 146 X
poe_task_info 147 xprofiler 11
switch clock 34
system contention scope 38
system limits
on size of MPI elements 65
T
tag 66
task limits 67
task synchronization 25
thread context 6
thread stack size
default 36
thread-safe library 7
threaded MPI library 25
threaded programming 35
threads and mixed parallelism with MPI 43
threads constants 63
threads library 25

Readers’ comments – We’d like to hear from you
Publication No. SA22-7945-04
Overall, how satisfied are you with the information in this book?
Very Satisfied Satisfied Neutral Dissatisfied Very Dissatisfied

Overall satisfaction h h h h h
How satisfied are you that the information in this book is:
Very Satisfied Satisfied Neutral Dissatisfied Very Dissatisfied

Accurate h h h h h
Complete h h h h h
Easy to find h h h h h
Easy to understand h h h h h
Well organized h h h h h
Applicable to your tasks h h h h h
Please tell us how we can improve this book:
Thank you for your responses. May we contact you? h Yes h No
When you send comments to IBM, you grant IBM a nonexclusive right to use or distribute your comments in any
way it believes appropriate without incurring any obligation to you.
Name Address
Company or Organization
Phone No.
___________________________________________________________________________________________________
Readers’ Comments — We’d Like to Hear from You Cut or Fold
SA22-7945-04 Along Line
_ _ _ _ _ _ _Fold
_ _ _and
_ _ _Tape
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _Please
_ _ _ _ _do
_ _not
_ _ staple
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _Fold
_ _ _and
_ _ Tape
______
NO POSTAGE
NECESSARY
IF MAILED IN THE
UNITED STATES
BUSINESS REPLY MAIL

FIRST-CLASS MAIL PERMIT NO. 40 ARMONK, NEW YORK
POSTAGE WILL BE PAID BY ADDRESSEE
IBM Corporation
Department 55JA, Mail Station P384
2455 South Road
Poughkeepsie, NY
12601-5400
_________________________________________________________________________________________
Fold and Tape Please do not staple Fold and Tape
Cut or Fold
SA22-7945-04 Along Line

Program Number: 5765-F83
SA22-7945-04

IBM - MPI Programming Guide

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

IBM - MPI Programming Guide

Încărcat de

Drepturi de autor:

Formate disponibile

IBM Parallel Environment for AIX 5L

MPI Programming Guide

MPI Programming Guide

Fifth Edition (November 2005)

FAX (United States and Canada): 1+845+432-9405

IBMLink™ (United States customers only): IBMUSM10(MHVRCFS)

About this book . . . . . . . . . . . . . . . . . . . . . . . . ix

Chapter 1. Performance Considerations for the MPI Library . . . . . . . 1

Chapter 2. Profiling message passing . . . . . . . . . . . . . . . 11

Chapter 3. Using shared memory . . . . . . . . . . . . . . . . . 15

Chapter 4. Performing parallel I/O with MPI . . . . . . . . . . . . . 19

Chapter 5. Programming considerations for user applications in POE . . . 25

© Copyright IBM Corp. 1993, 2005 iii

Chapter 6. Using error handlers . . . . . . . . . . . . . . . . . . 47

Chapter 7. Predefined MPI datatypes . . . . . . . . . . . . . . . . 49

iv IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide

Chapter 9. C++ MPI constants . . . . . . . . . . . . . . . . . . 57

Chapter 10. MPI size limits . . . . . . . . . . . . . . . . . . . . 65

Chapter 11. POE environment variables and command-line flags . . . . . 69

Chapter 12. Parallel utility subroutines . . . . . . . . . . . . . . . 87

Chapter 13. Parallel task identification API subroutines . . . . . . . . 145

Appendix A. MPE subroutine summary . . . . . . . . . . . . . . 149

Appendix B. MPE subroutine bindings . . . . . . . . . . . . . . . 151

Appendix C. MPI subroutine and function summary . . . . . . . . . 155

Appendix D. MPI subroutine bindings . . . . . . . . . . . . . . . 175

Appendix E. PE MPI buffer management for eager protocol . . . . . . 219

Appendix F. Accessibility . . . . . . . . . . . . . . . . . . . . 223

vi IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide

© Copyright IBM Corp. 1993, 2005 vii

Who should read this book

How this book is organized

© Copyright IBM Corp. 1993, 2005 ix

Conventions and terminology used in this book

In addition to the highlighting conventions, this manual uses the following

x IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide

Prerequisite and related information

About this book xi

How to send your comments

National language support (NLS)

xii IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide

About this book xiii

Message transport mechanisms

Three message transport mechanisms are supported:

© Copyright IBM Corp. 1993, 2005 1

Shared memory considerations

MPI with UDP/IP transport should be viewed as an IP application for system

2 IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide

User Space considerations

MPI point-to-point communications

An MPI application program sends a message using either a blocking or a

The eager limit is set by the MP_EAGER_LIMIT environment variable or the

Chapter 1. Performance Considerations for the MPI Library 3

On the receive side, there are two distinct cases:

The difference between a blocking and non-blocking receive is that a blocking

4 IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide

Recall the following:

A rendezvous message is sent in two stages:

Chapter 1. Performance Considerations for the MPI Library 5

Polling and single thread considerations

In User Space, this interrupt is implemented as an AIX dispatch of a service thread

6 IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide

LAPI send side copy