Documente Academic
Documente Profesional
Documente Cultură
SA22-7945-04
IBM Parallel Environment for AIX 5L
SA22-7945-04
Note
Before using this information and the product it supports, read the information in “Notices” on page 225.
Contents v
MP_UNSET_CKPT_CALLBACKS, mpc_unset_ckpt_callbacks . . . . . . . 122
pe_dbg_breakpoint . . . . . . . . . . . . . . . . . . . . . . . 124
pe_dbg_checkpnt . . . . . . . . . . . . . . . . . . . . . . . 130
pe_dbg_checkpnt_wait . . . . . . . . . . . . . . . . . . . . . 134
pe_dbg_getcrid . . . . . . . . . . . . . . . . . . . . . . . . 136
pe_dbg_getrtid . . . . . . . . . . . . . . . . . . . . . . . . 137
pe_dbg_getvtid . . . . . . . . . . . . . . . . . . . . . . . . 138
pe_dbg_read_cr_errfile . . . . . . . . . . . . . . . . . . . . . 139
pe_dbg_restart . . . . . . . . . . . . . . . . . . . . . . . . 140
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Trademarks. . . . . . . . . . . . . . . . . . . . . . . . . . 227
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 228
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
All implemented function in the PE MPI product is designed to comply with the
requirements of the Message Passing Interface Forum, MPI: A Message-Passing
Interface Standard, Version 1.1, University of Tennessee, Knoxville, Tennessee,
June 6, 1995 and MPI-2: Extensions to the Message-Passing Interface, University
of Tennessee, Knoxville, Tennessee, July 18, 1997. The second volume includes a
section identified as MPI 1.2, with clarifications and limited enhancements to MPI
1.1. It also contains the extensions identified as MPI 2.0. The three sections, MPI
1.1, MPI 1.2, and MPI 2.0 taken together constitute the current standard for MPI.
PE MPI provides support for all of MPI 1.1 and MPI 1.2. PE MPI also provides
support for all of the MPI 2.0 enhancements, except the contents of the chapter
titled ″Process creation and management.″
If you believe that PE MPI does not comply, in any way, with the MPI standard for
the portions that are implemented, please contact IBM service.
Convention Usage
bold Bold words or characters represent system elements that you must
use literally, such as: command names, file names, flag names,
path names, PE component names (pedb, for example), and
subroutines.
constant width Examples and information that the system displays appear in
constant-width typeface.
italic Italicized words or characters represent variable values that you
must supply.
Italics are also used for book titles, for the first use of a glossary
term, and for general emphasis in text.
[item] Used to indicate optional items.
<Key> Used to indicate keys you press.
\ The continuation character is used in coding examples in this book
for formatting purposes.
User actions appear in uppercase boldface type. For example, if the action is to
enter the tool command, this manual presents the instruction as:
ENTER
tool
Abbreviated names
Some of the abbreviated names used in this book follow.
Table 1. Parallel Environment abbreviations
Short Name Full Name
AIX Advanced Interactive Executive
CSM Clusters Systems Management
CSS communication subsystem
CTSEC cluster-based security
DPCL dynamic probe class library
dsh distributed shell
GUI graphical user interface
HDF Hierarchical Data Format
IP Internet Protocol
LAPI Low-level Application Programming Interface
MPI Message Passing Interface
To access the most recent Parallel Environment documentation in PDF and HTML
format, refer to the IBM Eserver Cluster Information Center on the Web at:
http://publib.boulder.ibm.com/infocenter/clresctr/index.jsp
Both the current Parallel Environment books and earlier versions of the library are
also available in PDF format from the IBM Publications Center Web site located at:
http://www.ibm.com/shop/publications/order/
It is easiest to locate a book in the IBM Publications Center by supplying the book’s
publication number. The publication number for each of the Parallel Environment
books is listed after the book title in the preceding list.
The PE message catalogs are in English, and are located in the following
directories:
/usr/lib/nls/msg/C
/usr/lib/nls/msg/En_US
/usr/lib/nls/msg/en_US
If your site is using its own translations of the message catalogs, consult your
system administrator for the appropriate value of NLSPATH or LANG. For more
information on NLS and message catalogs, see AIX: General Programming
Concepts: Writing and Debugging Programs.
Performance of jobs using the MPI library can be affected by the setting of various
environment variables. The complete list is provided in Chapter 11, “POE
environment variables and command-line flags,” on page 69 and in IBM Parallel
Environment for AIX 5L: Operation and Use, Volume 1. Programs that conform to
the MPI standard should run correctly with any combination of environment
variables within the supported ranges.
The defaults of these environment variables are generally set to optimize the
performance of the User Space library for MPI programs with one task per
processor, using blocking communication. Blocking communication includes sets of
non-blocking send and receive calls followed immediately by wait or waitall, as well
as explicitly blocking send and receive calls. Applications that use other
programming styles, in particular those that do significant computation between
posting non-blocking sends or receives and calling wait or waitall, may see a
performance improvement if some of the environment variables are changed.
The MPI library is a dynamically loaded shared object, whose symbols are linked
into the user application. At run time, when MPI_Init is called by the application
program, the various environment variables are read and interpreted, and the
underlying transport is initialized. Depending on the setting of the transport variable
MP_EUILIB, MPI initializes lower level protocol support for a User Space packet
mode, or for a UDP/IP socket mode. By default, the shared memory mechanism for
point-to-point messages (and in 64-bit applications, collective communication) is
also initialized.
Tasks on the same node can use operating system shared memory transport for
point-to-point communication. Shared memory is used by default, but may be turned
off with the environment variableMP_SHARED_MEMORY. In addition, 64-bit
applications are provided an optimization where the MPI library uses shared
memory directly for selected collective communications, rather than just mapping
the collectives into point-to-point communications. The collective calls for which this
optimization is provided include MPI_Barrier, MPI_Reduce, MPI_Bcast,
MPI_Allreduce and others. This optimization is enabled by default, and disabled by
setting environment variable MP_SHARED_MEMORY to no. For most programs,
enabling the shared memory transport for point-to-point and collective calls provides
better performance than using the network transport.
For more information on shared memory, see Chapter 3, “Using shared memory,”
on page 15.
MPI IP performance
MPI IP performance is affected by the socket-buffer sizes for sending and receiving
UDP data. These are defined by two network tuning parameters udp_sendspace
and udp_recvspace. When the buffer for sending data is too small and quickly
becomes full, UDP data transfer can be delayed. When the buffer for receiving data
is too small, incoming UDP data can be dropped due to insufficient buffer space,
resulting in send-side retransmission and very poor performance.
LAPI, on which MPI is running, tries to increase the size of send and receive
buffers to avoid this performance degradation. However, the buffer sizes,
udp_sendspace and udp_recvspace, cannot be greater than another network
tuning parameter sb_max, which can be changed only with privileged access rights
(usually root). For optimal performance, it is suggested that sb_max be increased
to a relatively large value. For example, increase sb_max from the default of
1048576 to 8388608 before running MPI IP jobs.
The UDP/IP transport can be used on clustered servers where a high speed
interconnect is not available, or can use the IP mode of the high speed
interconnect, if desired. This transport is often useful for program development or
initial testing, rather than production. Although this transport does not match User
Space performance, it consumes only virtual adapter resources rather than limited
real adapter resources.
Details on the network tuning parameters, such as their definitions and how to
change their values, can be found in the man page for the AIX no command.
The underlying transport for MPI is LAPI, which is packaged with AIX as part of the
RSCT file set. LAPI provides a one-sided message passing API, with optimizations
to support MPI. Except when dealing with applications that make both MPI and
direct LAPI calls, or when considering compatibility of PE and RSCT levels, there is
usually little need for the MPI user to be concerned about what is in the MPI layer
and what is in the LAPI layer.
Eager messages
An eager send passes its buffer pointer, communicator, destination, length, tag and
datatype information to a LLP reliable message delivery function. If the message is
small enough, it is copied from the user’s buffer into a protocol managed buffer, and
the MPI send is marked complete. This makes the user’s send buffer immediately
available for reuse. A longer message is not copied, but is transmitted directly from
the user’s buffer. In this second case, the send cannot be marked complete until the
data has reached the destination and the packets have been acknowledged. It is
because either the message itself, or a copy of it, is preserved until it can be
Whenever a send is active, and at other convenient times such as during a blocking
receive or wait, a message dispatcher is run. This dispatcher sends and receives
messages, creating packets for and interpreting packets from the lower level packet
driver (User Space or IP). Since UDP/IP and User Space are both unreliable packet
transports (packets may be dropped during transport without an error being
reported), the message dispatcher manages packet acknowledgment and
retransmission with a sliding window protocol. This message dispatcher is also
run on a hidden thread once every few hundred milliseconds and, if environment
variable MP_CSS_INTERRUPT is set, upon notification of packet arrival.
When the message dispatcher recognizes the first packet of an inbound message,
a header handler or upcall is invoked. This upcall is to a function within the MPI
layer that searches a list of descriptors for posted but unmatched receives. If a
match is found, the descriptor is unlinked from the unmatched receives list and data
will be copied directly from the packets to the user buffer. The receive descriptor is
marked by a second upcall (a completion handler), when the dispatcher detects the
final packet so that the MPI application can recognize that the receive is complete.
If a receive is not found by the header handler upcall, an early arrival buffer is
allocated by MPI and the message data will be copied to that buffer. A descriptor
similar to a receive descriptor but containing a pointer to the early arrival buffer is
added to an early arrivals list. When an application does make a receive call, the
early arrivals list is searched. If a match is found:
1. The descriptor is unlinked from the early arrivals list.
2. Data is copied from the early arrival buffer to the user buffer.
3. The early arrival buffer is freed.
4. The descriptor (which is now associated with the receive) is marked so that the
MPI application can recognize that the receive is complete.
The size of the early arrival buffer is controlled by the MP_BUFFER_MEM
environment variable.
The MPI standard requires that a send not complete until it is guaranteed that its
data can be delivered to the receiver. For an eager send, this means the sender
must know in advance that there is sufficient buffer space at the destination to
cache the message if no posted receive is found. The PE MPI library accomplishes
this by using a credit flow control mechanism. At initialization time, each source to
destination pair is allocated a fixed, identical number of message credits. The
number of credits per pair is calculated based on environment variables
MP_EAGER_LIMIT, MP_BUFFER_MEM, and the total number of tasks in the job.
If an eager message arrives and finds a match, the credit is freed immediately
because the early arrival buffer space that it represents is not needed. If data must
be buffered, the credit is tied up until the matching receive call is made, which
allows the early arrival buffer to be freed. PE MPI returns message flow control
credits by piggybacking them on some regular message going back to the sender, if
possible. If credits pile up at the destination and there are no application messages
going back, MPI must send a special purpose message to return the credits. For
more information on the early arrival buffer and the environment variables,
MP_EAGER_LIMIT and MP_BUFFER_MEM, see Chapter 11, “POE environment
variables and command-line flags,” on page 69 and Appendix E, “PE MPI buffer
management for eager protocol,” on page 219.
Rendezvous messages
For a standard send, PE MPI makes the decision whether to use an eager or a
rendezvous protocol based on the message length. For the standard MPI_Send
and MPI_Isend calls, messages whose size is not greater than the eager limit are
sent using eager protocol. Messages whose size is larger than the eager limit are
sent using rendezvous protocol. Thus, small messages can be eagerly sent, and
assuming that message credits are returned in a timely fashion, can continue to be
sent using the mechanisms described above. For large messages, or small
messages for which there are no message credits available, the message must be
managed with a rendezvous protocol.
Since a zero byte message has no message data to preserve, even an MPI
implementation with no early arrival buffering should be able to complete a zero
byte standard send at the send side, whether or not there is a matching receive.
Thus, for PE MPI with MP_EAGER_LIMIT set to zero, a one byte standard send
will not complete until a matching receive is found, but a zero byte standard send
will complete without waiting for a rendezvous to determine whether a receive is
waiting.
Eager messages require only one trip across the transport, while rendezvous
messages require three trips, but two of the trips are short, and the time is quickly
amortized for large messages. Using the rendezvous protocol ensures that there is
no need for temporary buffers to store the data, and no overhead from copying
packets to temporary buffers and then on to user buffers.
If all the processors are busy, enabling interrupt mode causes thread context
switching and contention for processors, which might cause the application to run
slower than it would in polling mode.
The behavior of the MPI library during message polling can also be affected by the
setting of the environment variable MP_WAIT_MODE. If set to sleep or yield, the
blocked MPI thread sleeps or yields periodically to allow the AIX dispatcher to
schedule other activity on the processor. This may be appropriate when the wait call
is part of a command processor thread. An alternate way of implementing this
behavior is with an MPI test command and user-invoked sleep or yield (or some
other mechanism to release a processor).
Environment variable MP_WAIT_MODE can also be set to nopoll, which polls the
message dispatcher for a short time (less than one millisecond) and then goes into
As mentioned above, packets are transferred during polling and when an interrupt is
recognized (which invokes the message dispatcher). The message dispatcher is
also invoked periodically, based on the AIX timer support. The time interval between
brief polls of the message dispatcher is controlled by environment variable
MP_POLLING_INTERVAL, specified in microseconds.
The MPI library supports multiple threads simultaneously issuing MPI calls, and
provides appropriate internal locking to make sure that the library is thread safe with
respect to these calls. If the application makes MPI calls on only one thread (or is a
non-threaded program), and does not use the nonstandard MPE_I nonblocking
collectives, MPI-IO, or MPI one-sided features, the user may wish to skip the
internal locking by setting the environment variable MP_SINGLE_THREAD to yes.
Do not set MP_SINGLE_THREAD to yes unless you are certain that the
application is single threaded.
LAPI send side guarantees making a copy of any LAPI level message of up to 128
bytes, letting the send complete locally. An MPI message sent by an application will
have a header (or envelope) pre-appended by PE MPI before being sent as a LAPI
message. Therefore, the application message size from the MPI perspective is less
than from the LAPI perspective. The message envelope is no larger than 32 bytes.
LAPI also maintains a limited pool of retransmission buffers larger than 128 bytes. If
the application message plus MPI envelope exceeds 128 bytes, but is small enough
to fit a retransmission buffer, LAPI tries (but cannot guarantee) to copy it to a
retransmission buffer, allowing the MPI send to complete locally.
Striping
With PE Version 4, protocol striping is supported for HPS switch adapters (striping,
failover, and recovery are not supported over non-HPS adapters such as Gigabit
Ethernet). If the windows (or UDP ports) are on multiple adapters and one adapter
or link fails, the corresponding windows are closed and the remaining windows are
used to send messages. When the adapter or link is restored (assuming that the
node itself remains operational), the corresponding windows are added back to the
list of windows used for striping.
For single network configurations, striping, failover, and recovery can still be used
by requesting multiple instances (setting the environment variable MP_INSTANCES
to a value greater than 1). However, unless the system is configured with multiple
adapters on the network, and window resources are available on more than one
adapter, failover and recovery is not necessarily possible, because both windows
may end up on the same adapter. Similarly, improved striping performance using
RDMA can be seen only if windows are allocated from multiple adapters on the
single network.
There are some considerations that users of 32-bit applications must take into
account before deciding to use the striping, failover, and recovery function. A 32-bit
application is limited to 16 segments. The standard AIX memory model for 32-bit
applications claims five of these, and expects the application to allocate up to eight
segments (2 GB) for application data (the heap, specified with compile option
-bmaxdata). For example, -bmaxdata:0x80000000 allocates the maximum eight
segments, each of which is 256 MB. The communication subsystem takes an
additional, variable number of segments, depending on options chosen at run time.
In some circumstances, for 32-bit applications the total demand for segments can
be greater than 16 and a job will be unable to start, or will run with reduced
performance. If your application is using a very large heap and you consider
This especially benefits applications that either transfer relatively large amounts of
data (greater than 150 KB) in a single MPI call, or overlap computation and
communication, since the CPU is no longer required to copy data. RDMA
operations are considerably more efficient when large (16 MB) pages are used
rather than small (4 KB) pages, especially for large transfers. In order to use the
bulk transfer mode, the system administrator must enable RDMA communication
and LoadLeveler must be configured to use RDMA. Not all communications
adapters support RDMA.
For a quick overview of the RDMA feature, and the steps that a system
administrator must take to enable or disable the RDMA feature, see Switch Network
Interface for ERserverpSeries High Performance Switch Guide and Reference.
For information on using LoadLeveler with bulk data transfer, see these sections in
LoadLeveler: Using and Administering:
v The chapter: Configuring the LoadLeveler environment, section Enabling support
for bulk data transfer.
v The chapter: Building and submitting jobs, section Using bulk data transfer.
Other considerations
The information provided earlier in this chapter, and the controlling variables, apply
to most applications. There are a few others that are useful in special
circumstances. These circumstances may be identified by setting the
MP_STATISTICS environment variable to print and examining the task statistics at
the end of an MPI job.
MP_ACK_THRESH
This environment variable changes the threshold for the update of the packet
sliding window. Reducing the value causes more frequent update of the window,
but generates additional message traffic.
MP_CC_SCRATCH_BUFFER
MPI collectives normally pick from more than one algorithm based on the impact
of message size, task count, and other factors on expected performance.
Normally, the algorithm that is predicted to be fastest is selected, but in some
cases the preferred algorithm depends on PE MPI allocation of scratch buffer
space. This environment variable instructs PE to use the collective
AIX profiling
If you use the gprof, prof, or xprofiler command and the appropriate compiler
command (such as cc_r or mpcc_r) with the -p or -pg flag, you can profile your
program. For information about using:
v cc_r, gprof, and prof, see IBM Parallel Environment for AIX: Operation and Use,
Volume 2.
v mpcc_r and related compiler commands, see IBM Parallel Environment for AIX:
Operation and Use, Volume 1.
v xprofiler, which is part of the AIX operating system, see the AIX: Performance
Tools Guide and Reference.
The message passing library is not enabled for gprof or prof profiling counts. You
can obtain profiling information by using the nameshifted MPI functions provided.
Programs that use the C MPI language bindings can easily create profiling libraries
using the nameshifted interface.
v If you are both the creator and user of the profiling library and you are not using
FORTRAN, follow steps 1 through 6. If you are using FORTRAN, follow steps 1
through 4, then steps 7 through 14.
v If you are the creator of the profiling library, follow steps 1 through 4. You also
need to provide the user with the file created in step 2.
v If you are the user of the profiling library and you are not using FORTRAN, follow
steps 5 and 6. If you are using FORTRAN, start at step 7. You will need to make
sure that you have the file generated by the creator in step 2.
You need to change it into the following structure by rebuilding the mpifort_r.o
shared object:
c -------------------------------------
program hwinit
include ’mpif.h’
integer forterr
c
call MPI_INIT(forterr)
c
c Write comments to screen.
c
write(6,*)’Hello from task ’
c
call MPI_FINALIZE(forterr)
c
stop
end
c
Point-to-point communications
MPI programs with more than one task on the same computing node may benefit
from using shared memory to send messages between same node tasks.
Setting this variable to no directs MPI to not use a shared-memory protocol for
message passing between any two tasks of a job running on the same node.
For the 32-bit libraries, shared memory exploitation always allocates a 256 MB
virtual memory address segment that is not available for any other use. Thus,
programs that are already using all available segments cannot use this option. For
more information, see “Available virtual memory segments” on page 34.
For 64-bit libraries, there are so many segments in the address space that there is
no conflict between library and end user segment use.
Shared memory support is available for both IP and User Space MPI protocols. For
programs on which all tasks are on the same node, shared memory is used
exclusively for all MPI communication (unless MP_SHARED_MEMORY is set to
no).
Collective communications
With PE Version 4, the PE implementation of MPI also offers an optimization of
certain collective communication routines. This optimization uses an additional
shared memory segment. The collective communication optimization is available
only to 64-bit executables, where segment registers are abundant. This optimization
is controlled by the MP_SHARED_MEMORY environment variable.
For collectives in 64-bit executables that are enhanced to use shared memory, the
algorithms used for smaller message sizes involve copying data from user buffers
to scratch buffers in shared memory, and then allowing tasks that are interested in
that data to work with the copy in shared memory. The algorithms used for larger
messages involve exposing the user buffer itself to other tasks that have an interest
in it. The effect is that for smaller messages, some tasks may return from a
collective call as soon as their data is copied to shared memory, sometimes before
tasks needing access to the data even enter the collective operation.
POE’s Partition Manager Daemon (PMD) attempts to clean up any allocated shared
memory segments when a program exits normally. However, if a PMD process
(named pmdv4) is killed with signals or with the llcancel command, shared
memory segments may not be cleaned up properly. For this reason, when shared
memory is used, users should not kill or cancel a PMD process.
using the hostfile (either as host list in the directory where POE is run, or by
specifying MP_HOSTFILE or -hostfile) that contains the names of the Ethernet
adapters.
2. If a shared file system is not used, copy the original hostfile and the addr_fix
script below to the nodes where the parallel tasks will run. The addr_fix script
16 IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide
must be copied to the directory with the same name as the current directory on
the POE home node (from which you ran poe in step 1 on page 16.)
3. Run your real POE job with whatever settings you were using, except:
v Use the hostnames file from step 1 on page 16 as the MP_HOSTFILE or
-hostfile that is specified to POE.
v Set the environment variable ADDR_FIX_HOSTNAME to the name of the
hostfile that contains the names of the Ethernet adapters, used in step 1 on
page 16.
v Instead of invoking the job as:
poe my_exec my_args poe_flags
invoke it as:
poe ./addr_fix my_exec my_args poe_flags
The addr_fix script follows.
======================================================================
#!/bin/ksh93
# Index into the file to get the ethernet name that this task will run on.
my_name=`cat $ADDR_FIX_HOSTNAME | awk NR==$my_index’{print $0}’`
# Set environment variable that MPI will use as address for IP communication
export MP_CHILD_INET_ADDR=@1:$my_addr,ip
If LAPI is used, set MP_LAPI_INET_ADDR in the script instead. If both MPI and
LAPI are used, set both environment variables.
Definition of MPI-IO
The I/O component of MPI-2, or MPI-IO, provides a set of interfaces that are aimed
at performing portable and efficient parallel input and output operations.
MPI-IO allows a parallel program to express its I/O in a portable way that reflects
the program’s inherent parallelism. MPI-IO uses many of the concepts already
provided by MPI to express this parallelism. MPI datatypes are used to express the
layout and partitioning of data, which is represented in a file shared by several
tasks. An extension of the MPI communicator concept, referred to as an MPI_File,
is used to describe a set of tasks and a file that these tasks will use in some
integrated manner. Collective operations on an MPI_File allow efficient physical I/O
on a data structure that is distributed across several tasks for computation, but
possibly stored contiguously in the underlying file.
Features of MPI-IO
The primary features of MPI-IO are:
1. Portability: As part of MPI-2, programs written to use MPI-IO must be portable
across MPI-2 implementations and across hardware and software platforms.
The PE MPI-IO implementation guarantees portability of object code on
RS/6000 SP computers and clustered servers. The MPI-IO API ensures
portability at the source code level.
2. Versatility: The PE MPI-IO implementation provides support for:
v basic file manipulations (open, close, delete, sync)
v get and set file attributes (view, size, group, mode, info)
v blocking data access operations with explicit offsets (both independent and
collective)
v non-blocking data access operations with explicit offsets (independent only)
v blocking and non-blocking data access operations with file pointers (individual
and shared)
v split collective data access operations
v any derived datatype for memory and file mapping
v file inter-operability through data representations (internal, external,
user-defined)
v atomic mode for data accesses.
3. Robustness: PE MPI-IO performs as robustly as possible in the event of error
occurrences. Because the default behavior, as required by the MPI-2 standard,
is for I/O errors to return, PE MPI-IO tries to prevent any deadlock that might
result from an I/O error returning. The intent of the ″errors return″ default is that
MPI-IO is intended to be used with the IBM General Parallel File System (GPFS)
for production use. File access through MPI-IO normally requires that a single
GPFS file system image be available across all tasks of an MPI job. Shared file
systems such as AFS® and NFS do not meet this requirement when used across
multiple nodes. PE MPI-IO can be used for program development on any other file
system that supports a POSIX interface (AFS, DFS™, JFS, or NFS) as long as all
tasks run on a single node or workstation, but this is not expected to be a useful
model for production use of MPI-IO.
Use of a file that is local to (that is, distinct at) each task or node, is not valid and
cannot be detected as an error by MPI-IO. Issuing MPI_FILE_OPEN on a file in
/tmp may look valid to the MPI library, but will not produce valid results.
The default for MP_CSS_INTERRUPT is no. If you do not override the default,
MPI-IO enables interrupts while files are open. If you have forced interrupts to yes
or no, MPI-IO does not alter your selection.
MPI-IO depends on hidden threads that use MPI message passing. MPI-IO cannot
be used with MP_SINGLE_THREAD set to yes.
For AFS, DFS, and NFS, MPI-IO uses file locking for all accesses by default. If
other tasks on the same node share the file and also use file locking, file
consistency is preserved. If the MPI_FILE_OPEN is done with mode
MPI_MODE_UNIQUE_OPEN, file locking is not done.
For information about file hints, see MPI_FILE_OPEN in IBM Parallel Environment
for AIX: MPI Subroutine Reference.
Error handling
MPI-1 treated all errors as occurring in relation to some communicator. Many MPI-1
functions were passed a specific communicator, and for the rest it was assumed
that the error context was MPI_COMM_WORLD. MPI-1 provided a default error
handler named MPI_ERRORS_ARE_FATAL for each communicator, and defined
functions similar to those listed below for defining and attaching alternate error
handlers.
The MPI-IO operations use an MPI_File in much the way other MPI operations use
an MPI_Comm, except that the default error handler for MPI-IO operations is
MPI_ERRORS_RETURN. The following functions are needed to allow error
handlers to be defined and attached to MPI_File objects:
v MPI_FILE_CREATE_ERRHANDLER
v MPI_FILE_SET_ERRHANDLER
v MPI_FILE_GET_ERRHANDLER
v MPI_FILE_CALL_ERRHANDLER
For information about these subroutines, see IBM Parallel Environment for AIX: MPI
Subroutine Reference.
turns on error logging. When an error occurs, a line of information will be logged in
file /tmp/mpi_io_errdump.app_name.userid.taskid, recording the time the error
occurs, the POSIX file system call involved, the file descriptor, and the returned
error number.
An Info object is an opaque object consisting of zero or more (key,value) pairs. Info
objects are the means by which users provide hints to the implementation about
things like the structure of the application or the type of expected file accesses. In
MPI-2, the APIs that use Info objects span MPI-IO, MPI one-sided, and dynamic
tasks. Both key and value are specified as strings, but the value may actually
represent an integer, boolean or other datatype. Some keys are reserved by MPI,
and others may be defined by the implementation. The implementation defined keys
should use a distinct prefix which other implementations would be expected to
avoid. All PE MPI hints begin with IBM_ (see MPI_FILE_OPEN in IBM Parallel
Environment for AIX: MPI Subroutine Reference). The MPI-2 requirement that hints,
valid or not, cannot change the semantics of a program limits the risks from
misunderstood hints.
By default, Info objects in PE MPI accept only PE MPI recognized keys. This allows
a program to identify whether a given key is understood. If the key is not
understood, an attempt to place it in an Info object will be ignored. An attempt to
retrieve the key will find no key/value present. The environment variable
MP_HINTS_FILTERED set to no will cause Info operations to accept arbitrary (key,
value) pairs. You will need to turn off hint filtering if your application, or some
non-MPI library it is using, depends on MPI Info objects to cache and retrieve its
own (key, value) pairs.
sets the default size of the MPI-IO data buffer to 16 MB. The default value of this
environment variable is the number of bytes corresponding to 16 file blocks. This
value depends on the block size associated with the file system storing the file.
Valid values are any positive size up to 128 MB. The size can be expressed as a
number of bytes, as a number of kilobytes (1024 bytes), using k or K as a suffix, or
as a number of megabytes (1024*1024 bytes), using m or M as a suffix. If
necessary, PE MPI rounds the size up, to correspond to an integral number of file
system blocks.
For information about the following topics, see the MPI-2 Standard:
v Consistency and semantics
– File consistency
– Random access versus sequential files
– Progress
– Collective file operations
– Type matching
– Miscellaneous clarifications
– MPI_Offset Type
– Logical versus physical file layout
– File size
– Examples: asynchronous I/O
v I/O error handling
v I/O error classes
v Examples: double buffering with split collective I/O, subarray filetype constructor
In addition, the MPI library is using the Low-level communication API (LAPI)
protocol as a common transport layer. For more information on this and the use of
the LAPI protocol, see IBM Reliable Scalable Cluster Technology for AIX 5L: LAPI
Programming Guide.
An n-task parallel job running in POE consists of: the n user tasks, a number of
instances of the PE partition manager daemon (pmd) that is equal to the number of
nodes, and the POE home node task in which the poe command runs. The pmd is
the parent task of the user’s task. There is one pmd for each node. A pmd is
The POE home node routes standard input, standard output, and standard error
streams between the home node and the users’ tasks with the pmd daemon, using
TCP/IP sockets for this purpose. The sockets are created when the POE home
node starts the pmd daemon for each task of a parallel job. The POE home node
and pmd also use the sockets to exchange control messages to provide task
synchronization, exit status and signaling. These capabilities do not depend on the
message passing library, and are available to control any parallel program run by
the poe command.
For interactive POE applications, without using LoadLeveler, POE does not copy or
replicate the user resource limits on the remote nodes where the parallel tasks are
to run (the compute nodes). POE uses the user limits as defined by the
/etc/security/limits file. If the user limits on the submitting node (home node) are
different than those on the compute nodes, POE does not change the user limits on
the compute nodes to match those on the submitting node.
Users should ensure that they have sufficient user resource limits on the compute
nodes, when submitting interactive parallel jobs. Users may want to coordinate their
user resource needs with their AIX system administrators to ensure that proper user
limits are in place, such as in the /etc/security/limits file on each node, or by some
other means.
Exit status
The exit status is any value from 0 through 255. This value, which is returned from
POE on the home node, reflects the composite exit status of your parallel
application as follows:
v If MPI_ABORT(comm,nn>0,ierror) or MPI_Abort(comm,nn>0) is called, the exit
status is nn (mod 256).
v If all tasks terminate using exit(MM>=0) or STOP MM>=0 and MM is not equal to
1 and is less than 128 for all nodes, POE provides a synchronization barrier at
the exit. The exit status is the largest value of MM from any task of the parallel
job (mod 256).
v If any task terminates using exit(MM =1) or STOP MM =1, POE will immediately
terminate the parallel job, as if MPI_Abort(MPI_COMM_WORLD,1) had been
called. This may also occur if an error is detected within a FORTRAN library
because a common error response by FORTRAN libraries is to call STOP 1.
v If any task terminates with a signal (for example, a segment violation), the exit
status is the signal plus 128, and the entire job is immediately terminated.
v If POE terminates before the start of the user’s application, the exit status is 1.
v If the user’s application cannot be loaded or fails before the user’s main() is
called, the exit status is 255.
POE links in the routines described in the sections that follow, when your
executable is compiled with any of the POE compilation scripts, such as: mpcc_r,
or mpxlf_r. These topics are discussed:
v “Signal handlers” on page 28.
v “Handling AIX signals” on page 28.
v “Do not hard code file descriptor numbers” on page 29.
v “Termination of a parallel job” on page 29.
v “Do not run your program as root” on page 30.
v “AIX function limitations” on page 30.
v “Shell execution” on page 30.
v “Do not rewind STDIN, STDOUT, or STDERR” on page 30.
v “Do not match blocking and non-blocking collectives” on page 30.
v “Passing string arguments to your program correctly” on page 31.
v “POE argument limits” on page 31.
v “Network tuning considerations” on page 31.
v “Standard I/O requires special attention” on page 32.
v “Reserved environment variables” on page 33.
v “AIX message catalog considerations” on page 33.
v “Language bindings” on page 33.
v “Available virtual memory segments” on page 34.
Signal handlers
POE installs signal handlers for most signals that cause program termination, so
that it can notify the other tasks of termination. POE then causes the program to
exit normally with a code of (signal plus 128). This section includes information
about installing your own signal handler for synchronous signals.
Note: For information about the way POE handles asynchronous signals, see
“Handling AIX signals.”
For synchronous signals, you can install your own signal handlers by using the
sigaction() system call. If you use sigaction(), you can use either the sa_handler
member or the sa_sigaction member in the sigaction structure to define the signal
handling function. If you use the sa_sigaction member, the SA_SIGINFO flag must
be set.
For the following signals, POE installs signal handlers that use the sa_sigaction
format:
v SIGABRT
v SIGBUS
v SIGEMT
v SIGFPE
v SIGILL
v SIGSEGV
v SIGSYS
v SIGTRAP
POE catches these signals, performs some cleanup, installs the default signal
handler (or lightweight core file generation), and re-raises the signal, which will
terminate the task.
Users can install their own signal handlers, but they should save the address of the
POE signal handler, using a call to SIGACTION. If the user program decides to
terminate, it should call the POE signal handler as follows:
saved.sa_flags =SA_SIGINFO;
(*saved.sa_sigaction)(signo,NULL,NULL)
If the user program decides not to terminate, it should just return to the interrupted
code.
Note: Do not issue message passing calls, including MPI_ABORT, from signal
handlers. Also, many library calls are not “signal safe”, and should not be
issued from signal handlers. See function sigaction() in the AIX Technical
Reference for a list of functions that signal handlers can call.
These handlers perform cleanup and exit with a code of (signal plus 128). You can
install your own signal handler for any or all of these signals. If you want the
application to exit after you catch the signal, call the function
pm_child_sig_handler(signal,NULL,NULL). The prototype for this function is in
file usr/lpp/ppe.poe/include/pm_util.h.
SIGALRM
Unlike the now retired signal library, the threads library does not use SIGALRM, and
long system calls are not interrupted by the message passing library. For example,
sleep runs its entire duration unless interrupted by a user-generated event.
SIGIO
Unlike PE 3.2, SIGIO is not used by the MPI library. A user-written signal handler
will not be called when an MPI packet arrives. The user may use SIGIO for other
I/O attention purposes, as required.
SIGPIPE
Some usage environments of the now retired signal library depended on MPI use of
SIGPIPE. There is no longer any use of SIGPIPE by the MPI library.
POE opens several files and uses file descriptors as message passing handles.
These are allocated before the user gets control, so the first file descriptor allocated
to a user is unpredictable.
For normal exits, when POE gets a control message for every task, it responds to
each node, allowing that node to exit normally with its individual exit code. The pmd
daemon monitors the exit code and passes it back to the POE home node for
presentation to the user.
For abnormal exits and those detected by pmd, POE sends a message to each
pmd asking that it send a SIGTERM signal to its tasks, thereby terminating the
task. When the task finally exits, pmd sends its exit code back to the POE home
node and exits itself.
Shell execution
The program executed by POE on the parallel nodes does not run under a shell on
those nodes. Redirection and piping of STDIN, STDOUT, and STDERR applies to
the POE home node (POE binary), and not the user’s code. If shell processing of a
command line is desired on the remote nodes, invoke a shell script on the remote
nodes to provide the desired preprocessing before the user’s application is invoked.
You can have POE run a shell script that is loaded and run on the remote nodes as
if it were a binary file.
Due to an AIX limitation, if the program being run by POE is a shell script and there
are more than five tasks being run per node, the script must be run under ksh93
by using:
#!/bin/ksh93
If the POE home node task is not started under the Korn shell, mounted file system
names may not be mapped correctly to the names defined for the automount
daemon or AIX equivalent. See the IBM Parallel Environment for AIX: Operation
and Use, Volume 1 for a discussion of alternative name mapping techniques.
Without the backslashes, the string would have been treated as two arguments (a
and b).
POE behaves like rsh when arguments are passed to POE. Therefore, this
command:
poe user_program "a b"
is equivalent to:
rsh some_machine user_program "a b"
In order to pass the string argument as one token, the quotation marks have to be
escaped using the backslash.
The POE environment variable MP_SNDBUF can be used to override the default
network settings for the size of the TCP/IP buffers used.
If you have large volumes of standard input or output, work with your network
administrator to establish appropriate TCP/IP tuning parameters. You may also want
to investigate whether using named pipes is appropriate for your application.
Running the poe command (or starting a program compiled with one of the POE
compile scripts) causes POE to perform this sequence of events:
1. The POE binary is loaded on the machine on which you submitted the
command (the POE home node).
2. The POE binary, in turn, starts a partition manager daemon (pmd) on each
parallel node assigned to run the job, and tells that pmd to run one or more
copies of your executable (using fork and exec).
3. The POE binary reads STDIN and passes it to each pmd with a TCP/IP socket
connection.
4. The pmd on each node pipes STDIN to the parallel tasks on that node.
5. STDOUT and STDERR from the tasks are piped to the pmd daemon.
6. This output is sent by the pmd on the TCP/IP socket back to the home node
POE.
7. This output is written to the POE binary’s STDOUT and STDERR descriptors.
Note
| Earlier versions of Parallel Environment required the use of the
| MP_HOLD_STDIN environment variable in certain cases when redirected
| STDIN was used. The Parallel Environment components have now been
| modified to control the STDIN flow internally, so the use of this environment
| variable is no longer required, and will have no effect on STDIN handling.
The script compute_home runs on the home node; the script compute_parallel
runs on the parallel nodes (those running tasks 0 through n-1).
compute_home:
#! /bin/ksh93
# Example script compute_home runs three tasks:
# data_generator creates/gets data and writes to stdout
# data_processor is a parallel program that reads data
# from stdin, processes it in parallel, and writes
# the results to stdout.
# data_consumer reads data from stdin and summarizes it
#
mkfifo poe_in_$$
mkfifo poe_out_$$
export MP_STDOUTMODE=0
export MP_STDINMODE=0
data_generator >poe_in_$$ |
If the value of MP_INFOLEVEL is greater than or equal to 1, POE will display any
MP_ environment variables that it does not recognize, but POE will continue
working normally.
Language bindings
The FORTRAN, C, and C++ bindings for MPI are contained in the same library and
can be freely intermixed. The library is named libmpi_r.a. Because it contains both
32-bit and 64-bit objects, and the compiler and linker select between them,
libmpi_r.a can be used for both 32-bit and 64-bit applications.
The AIX compilers support the flag -qarch. This option allows you to target code
generation for your application to a particular processor architecture. While this
option can provide performance enhancements on specific platforms, it inhibits
portability. The MPI library is not targeted to a specific architecture, and is not
affected by the flag -qarch on your compilation.
The MPI standard includes several routines that take choice arguments. For
example MPI_SEND may be passed a buffer of REAL on one call, and a buffer of
INTEGER on the next. The -qextcheck compiler option flags this as an error. In
F77, choice arguments are a violation of the FORTRAN standard that few compilers
would complain about. In F90, choice arguments can be interpreted by the compiler
as an attempt to use function overloading. MPI FORTRAN functions do not require
genuine overloading support to give correct results and PE MPI does not define
overloaded functions for all potential choice arguments. Because -qextcheck
considers use of choice arguments to be erroneous overloads even though the
code is correct MPI, the -qextcheck option should not be used.
Table 2 shows how the clock source is determined. PE MPI guarantees that the
MPI attribute MPI_WTIME_IS_GLOBAL has the same value at every task, and all
tasks use the same clock source (AIX or switch).
Table 2. How the clock source is determined
MP_CLOCK Library Are all nodes on Source MPI_WTIME
_SOURCE version the same switch? used _IS_GLOBAL
AIX ip yes AIX false
AIX ip no AIX false
AIX us yes AIX false
AIX us no Error false
SWITCH ip yes* switch true
SWITCH ip no AIX false
SWITCH us yes switch true
SWITCH us no Error
not set ip yes switch false
not set ip no AIX false
not set us yes switch true
not set us no Error
Note: * If MPI_WTIME_IS_GLOBAL value is to be trusted, the user is responsible for
making sure all of the nodes are connected to the same switch. If the job is in IP mode and
MP_CLOCK_SOURCE is left to default, MPI_WTIME_IS_GLOBAL will report false even if
the switch is used because MPI cannot know it is the same switch.
In this table, ip refers to IP protocol, us refers to User Space protocol.
For limitations on the number of tasks, tasks per node, and other restrictions, see
Chapter 10, “MPI size limits,” on page 65.
Threaded programming
When programming in a threads environment, specific skills and considerations are
required. The information in this subsection provides you with specific programming
considerations when using POE and the MPI library. This section assumes that you
are familiar with POSIX threads in general, including multiple execution threads,
thread condition waiting, thread-specific storage, thread creation and thread
termination. These topics are discussed:
v “Running single threaded applications” on page 36.
v “POE gets control first and handles task initialization” on page 36.
v “Limitations in setting the thread stack size” on page 36.
v “Forks are limited” on page 37.
v “Thread-safe libraries” on page 37.
v “Program and thread termination” on page 37.
v “Order requirement for system includes” on page 37.
v “Using MPI_INIT or MPI_INIT_THREAD” on page 37.
v “Collective communication calls” on page 38.
Applications that do not intend to use threads can continue to run as single
threaded programs, despite the fact they are now compiled as threaded programs.
However there are some side issues application developers should be aware of.
Any application that was compiled with the signal library compiler scripts prior to PE
Version 4 and not using MPE_I non-blocking collectives, is in this class.
Do not set MP_SINGLE_THREAD to yes unless you are certain that the
application is single threaded. Setting MP_SINGLE_THREAD to yes, and then
creating additional user threads will give unpredictable results. Calling
MPI_FILE_OPEN, MPI_WIN_CREATE or any MPE_I nonblocking collective in an
application running with MP_SINGLE_THREAD set to yes will cause PE MPI to
terminate the job.
Also, applications that register signal handlers may need to be aware that the
execution is in a threaded environment.
If you write your own MPI reduction functions to use with nonblocking collective
communications, these functions may run on a service thread. If your reduction
functions require significant amounts of stack space, you can use the
MP_THREAD_STACKSIZE environment variable to cause larger stacks to be
created for service threads. This does not affect the default stack size for any
threads you create.
Note: A forked child must not call the message passing library (MPI library).
Thread-safe libraries
Most AIX libraries are thread-safe, such as libc.a. However, not all libraries have a
thread-safe version. It is your responsibility to determine whether the AIX libraries
you use can be safely called by more than one thread.
MPI calls on other threads must adhere to the MPI standard in regard to the
following:
v A thread cannot make MPI calls until MPI_INIT has been called.
v A thread cannot make MPI calls after MPI_FINALIZE has been called.
v Unless there is a specific thread synchronization protocol provided by the
application itself, you cannot rely on any specific order or speed of thread
processing.
The MPI_INIT_THREAD call allows the user to request a level of thread support
ranging from MPI_THREAD_SINGLE to MPI_THREAD_MULTIPLE. PE MPI ignores
the request argument. If MP_SINGLE_THREAD is set to yes, MPI runs in a mode
equivalent to MPI_THREAD_FUNNELED. IF MP_SINGLE_THREAD is set to no, or
allowed to default, PE MPI runs in MPI_THREAD_MULTIPLE mode.
The service threads created by MPI, POE, and LAPI have system contention scope,
that is, they are mapped 1:1 to kernel threads.
Any user-created thread that began with process contention scope, will be
converted to system contention scope when it makes its first MPI call. Threads that
must remain in process contention scope should not make MPI calls.
Program restrictions
Any program that meets both these criteria:
v is compiled with one of the threaded compile scripts provided by PE
v may be checkpointed prior to its main() function being invoked
must wait for the 0031-114 message to appear in POE’s STDERR before issuing
the checkpoint of the parallel job. Otherwise, a subsequent restart of the job may
fail.
Node restrictions
The node on which a process is restarted must have:
v The same operating system level (including PTFs). In addition, a restarted
process may not load a module that requires a system call from a kernel
extension that was not present at checkpoint time.
v The same switch type as the node where the checkpoint occurred.
v The capabilities enabled in /etc/security/user that were enabled for that user on
the node on which the checkpoint operation was performed.
Task-related restrictions
v The number of tasks and the task geometry (the tasks that are common within a
node) must be the same on a restart as it was when the job was checkpointed.
v Any regular file open in a parallel task when that task is checkpointed must be
present on the node where that task is restarted, including the executable and
any dynamically loaded libraries or objects.
v If any task within a parallel application uses sockets or pipes, user callbacks
should be registered to save data that may be in transit when a checkpoint
40 IBM PE for AIX 5L V4 R2 M2: MPI Programming Guide
occurs, and to restore the data when the task is resumed after a checkpoint or
restart. Similarly, any user shared memory should be saved and restored.
Other restrictions
v Processes cannot be profiled at the time a checkpoint is taken.
v There can be no devices other than TTYs or /dev/null open at the time a
checkpoint is taken.
v Open files must either have an absolute pathname that is less than or equal to
PATHMAX in length, or must have a relative pathname that is less than or equal
to PATHMAX in length from the current directory at the time they were opened.
The current directory must have an absolute pathname that is less than or equal
to PATHMAX in length.
v Semaphores or message queues that are used within the set of processes being
checkpointed must only be used by processes within the set of processes being
checkpointed.
This condition is not verified when a set of processes is checkpointed. The
checkpoint and restart operations will succeed, but inconsistent results can occur
after the restart.
Integers passed to the MPI library are always 32 bits long. If you use the
FORTRAN compiler directive -qintsize=8 as your default integer length, you will
need to type your MPI integer arguments as INTEGER*4. All integer parameters in
mpif.h are explicitly declared INTEGER*4 to prevent -qintsize=8 from altering their
length.
As defined by the MPI standard, the count argument in MPI send and receive calls
is a default size signed integer. In AIX, even 64-bit executables use 32-bit integers
by default. To send or receive extremely large messages, you may need to
construct your own datatype (for example, a ’page’ datatype of 4096 contiguous
bytes).
The FORTRAN compilation scripts mpxlf_r, mpxlf90_r, and mpxlf95_r set the
include path for mpif.h to: /usr/lpp/ppe.poe/include/thread64 or
/usr/lpp/ppe.poe/include/thread, as appropriate. Do not add a separate include
path to mpif.h in your compiler scripts or make files, as an incorrect version of
mpif.h could be picked up in compilation, resulting in subtle run time errors.
The AIX 64-bit address space is large enough to remove any limitations on the
number of memory segments that can be used, so the information in “Available
virtual memory segments” on page 34 does not apply to the 64-bit library.
OpenMP and MPI in a single application offers relative safety because the OpenMP
model normally involves distinct parallel sections in which several threads are
spawned at the beginning of the section and joined at the end. The communication
calls occur on the main thread and outside of any parallel section, so they do not
require mutex protection. This segregation of threaded epochs from communication
epochs is safe and simple, whether you use OpenMP or provide your own threads
parallelism.
The threads parallelism model in which some number of threads proceed in a more
or less independent way, but protect critical sections (periods of protected access to
a shared data object) with locks requires more care. In this model, there is much
more chance you will hold a lock while doing a blocking MPI operation related to
some shared data object.
If both MPI and LAPI use the same protocol (either User Space or IP), you can
choose to have them share the underlying packet protocol (User Space or UDP).
You do this by setting the POE environment variable MP_MSG_API to mpi_lapi. If
you do not wish to share the underlying packet protocol, set MP_MSG_API to
mpi,lapi.
In User Space, running with shared resource MP_MSG_API set to mpi_lapi causes
LoadLeveler to allocate only one window for the MPI/LAPI pair, rather than two
windows. Since each window takes program resources (segment registers, memory
for DMA send and receive FIFOs, adapter buffers and network tables), sharing the
window makes sense if MPI and LAPI are communicating at different times (during
different phases of the program). If MPI and LAPI are doing concurrent
In shared mode, MPI_INIT sets interrupt behavior of its LAPI instance, just as in
non-shared mode, but MPI has no way to recognize or control changes to the
interrupt mode of this shared instance that may occur later through the LAPI_Senv()
function. Unexpected changes in interrupt mode made with the LAPI API to the
LAPI instance being shared with MPI can affect MPI performance, but will not affect
whether a valid MPI program runs correctly.
In IP, running with shared resource MP_MSG_API set to mpi_lapi uses only one
pair of UDP ports, while running with separated resource MP_MSG_API set to
mpi,lapi uses two pair of UDP ports. In the separated case, there may be a slight
increase in job startup time due to the need for POE to communicate two sets of
port lists.
The following variables are new. A brief description of their intended function is
provided. For more details, see Chapter 11, “POE environment variables and
command-line flags,” on page 69.
MP_UDP_PACKET_SIZE
Specifies the UDP datagram size to be used for UDP/IP message transport.
Other differences
v Handling shared memory. See Chapter 3, “Using shared memory,” on page 15.
v The MPI communication subsystem is activated at MPI_INIT and closed at
MPI_FINALIZE. When MPI and LAPI share the subsystem, whichever call comes
first between MPI_INIT and LAPI_INIT will provide the activation. Whichever call
comes last between MPI_FINALIZE and LAPI_TERM will close it.
v Additional service threads. See “POE-supplied threads.”
POE-supplied threads
Your parallel program is normally run under the control of POE. The communication
stack includes MPI, LAPI, and the hardware interface layer. The communication
stack also provides access to the global switch clock. This stack makes use of
several internally spawned threads. The options under which the job is run affect
which threads are created, therefore some, but not all, of the threads listed below
are created in a typical application run. Most of these threads sleep in the kernel
waiting for notification of some rare condition and do not compete for CPU access
during normal job processing. When a job is run in polling mode, there will normally
be little CPU demand by threads other than the users’ application threads.
This information is provided to help you understand what you will see in a debugger
when examining an MPI task. You can almost always ignore the service threads in
your debugging but you may need to find your own thread before you can
Exception::Exception(int error_code);
int Exception::Get_error_code() const;
int Exception::Get_error_class() const;
const char* Exception::Get_error_string() const;
];
Exception(int error_code);
virtual ~Exception(){ }
protected:
int error_code;
char error_string[MPI_MAX_ERROR_STRING];
int error_class;
};
Predefined operations
Table 9 lists the predefined operations for use with MPI_ALLREDUCE,
MPI_REDUCE, MPI_REDUCE_SCATTER and MPI_SCAN. To invoke a predefined
operation, place any of the following reductions in op.
Table 9. Predefined reduction operations
Operation Description
MPI_BAND bitwise AND
MPI_BOR bitwise OR
MPI_BXOR bitwise XOR
MPI_LAND logical AND
MPI_LOR logical OR
MPI_LXOR logical XOR
MPI_MAX maximum value
MPI_MAXLOC maximum value and location
MPI_MIN minimum value
MPI_MINLOC minimum value and location
MPI_PROD product
MPI_REPLACE f(a,b) = b (the current value in the target memory is
replaced by the value supplied by the origin)
MPI_SUM sum
Examples
Examples of user-defined reduction functions for integer vector addition follow.
C example
void int_sum (int *in, int *inout,
int *len, MPI_Datatype *type);
{
int i
for (i=0; i<*len; i++) {
inout[i] + = in[i];
}
}
FORTRAN example
SUBROUTINE INT_SUM(IN,INOUT,LEN,TYPE)
INTEGER IN(*),INOUT(*),LEN,TYPE,I
DO I = 1,LEN
INOUT(I) = IN(I) + INOUT(I)
ENDDO
END
Error classes
MPI::SUCCESS
MPI::ERR_BUFFER
MPI::ERR_COUNT
MPI::ERR_TYPE
MPI::ERR_TAG
MPI::ERR_COMM
MPI::ERR_RANK
MPI::ERR_REQUEST
MPI::ERR_ROOT
MPI::ERR_GROUP
MPI::ERR_OP
MPI::ERR_TOPOLOGY
MPI::ERR_DIMS
MPI::ERR_ARG
MPI::ERR_UNKNOWN
MPI::ERR_TRUNCATE
MPI::ERR_OTHER
MPI::ERR_INTERN
Maximum sizes
MPI::MAX_ERROR_STRING
MPI::MAX_PROCESSOR_NAME
MPI::MAX_FILE_NAME
MPI::MAX_DATAREP_STRING
MPI::MAX_INFO_KEY
MPI::MAX_INFO_VAL
MPI::MAX_OBJECT_NAME
Topologies
MPI::GRAPH
MPI::CART
MPI-IO constants
MPI::MODE_RDONLY
MPI::MODE_WRONLY
MPI::MODE_RDWR
MPI::MODE_CREATE
MPI::MODE_APPEND
MPI::MODE_EXCL
MPI::MODE_DELETE_ON_CLOSE
MPI::MODE_UNIQUE_OPEN
MPI::MODE_SEQUENTIAL
MPI::MODE_NOCHECK
MPI::MODE_NOSTORE
MPI::MODE_NOPUT
MPI::MODE_NOPRECEDE
MPI::MODE_NOSUCCEED
Assorted constants
MPI::BSEND_OVERHEAD
MPI::PROC_NULL
MPI::ANY_SOURCE
MPI::ANY_TAG
MPI::UNDEFINED
MPI::KEYVAL_INVALID
MPI::BOTTOM
Collective constants
MPI::ROOT
MPI::IN_PLACE
Collective operations
MPI::MAX
MPI::MIN
MPI::SUM
MPI::PROD
MPI::MAXLOC
MPI::MINLOC
MPI::BAND
MPI::BOR
MPI::BXOR
MPI::LAND
MPI::LOR
MPI::LXOR
MPI::REPLACE
Null handles
MPI::GROUP_NULL
MPI::COMM_NULL
MPI::DATATYPE_NULL
MPI::REQUEST_NULL
MPI::OP_NULL
MPI::ERRHANDLER_NULL
MPI::INFO_NULL
MPI::WIN_NULL
Empty group
MPI::GROUP_EMPTY
System limits
The following list includes system limits on the size of various MPI elements and
the relevant environment variable or tunable parameter. The MPI standard identifies
several values that have limits in any MPI implementation. For these values, the
standard indicates a named constant to express the limit. See mpi.h for these
constants and their values. The limits described below are specific to PE and are
not part of standard MPI.
v Number of tasks: MP_PROCS
v Maximum number of tasks: 8192
v Maximum buffer size for any MPI communication (for 32-bit applications only): 2
GB
v Default early arrival buffer size: (MP_BUFFER_MEM)
When using Internet Protocol (IP): 2 800 000 bytes
When using User Space: 64 MB
v Minimum pre-allocated early arrival buffer size: (50 * eager_limit) number of
bytes
v Maximum pre_allocated early arrival buffer size: 256 MB
v Minimum message envelope buffer size: 1 MB
v Default eager limit (MP_EAGER_LIMIT): See Table 12 on page 66. Note that the
default values shown in Table 12 on page 66 are initial estimates that are used
by the MPI library. Depending on the value of MP_BUFFER_MEM and the job
type, these values will be adjusted to guarantee a safe eager send protocol.
v Maximum eager limit: 256 KB
v MPI uses the MP_BUFFER_MEM and the MP_EAGER_LIMIT values that are
selected for a job to determine how many complete messages, each with a size
that is equal to or less than the eager_limit, can be sent eagerly from every task
of the job to a single task, without causing the single target to run out of buffer
space. This is done by allocating to each sending task a number of message
credits for each target. The sending task will consume one message credit for
each eager send to a particular target. It will get that credit back after the
message has been matched at that target.
The sending task can continue to send eager messages to a particular target as
long as it still has message credits for that target. The following equation is used
to calculate the number of credits to be allocated:
MP_BUFFER_MEM / (MP_PROCS * MAX(MP_EAGER_LIMIT, 64))
MPI uses this equation to ensure that there are at least two credits for each
target. If needed, MPI reduces the initially selected value of MP_EAGER_LIMIT,
or increases the initially selected value of MP_BUFFER_MEM, in order to
achieve this minimum threshold of two credits for each target.
If the user has specified an initial value for MP_BUFFER_MEM or
MP_EAGER_LIMIT, and MPI has changed either one or both of these values, an
informational message is issued. If the user has specified MP_BUFFER_MEM
using the two values format, then the maximum value specified by the second
For a system with a pSeries HPS switch and adapter, the Task per Node Limit is 64
tasks per adapter per network. For a system with two adapters per network, the
task per node limit is 128, or 64 * 2. This enables the running of a 128 task per
node MPI job over User Space. This may be useful on 64 CPU nodes with the
Simultaneous Multi-Threading (SMT) technology available on IBM System p5
servers and AIX 5.3 enabled. The LoadLeveler configuration also helps determine
how may tasks can be run on a node. To run 128 tasks per node, LoadLeveler
must be configured with 128 starters per node. In theory, you can configure more
than two adapters per network and run more than 128 tasks per node. However,
this means running more than one task per CPU, and results in reduced
performance. Also, the lower layer of the protocol stack has a 128 tasks per node
limit for enabling shared memory. The protocol stack does not use shared memory
when there are more than 128 tasks per node.
For running an MPI job over IP, the task per node limit is not affected by the
number of adapters; the task per node limit is determined only by the number of
LoadLeveler starters configured per node. The 128 task per node limit for enabling
shared memory usage also applies to MPI/IP jobs.
Although the PCI adapters support the stated limits for tasks per node, maximum
aggregate bandwidth through the adapter is achieved with a smaller task per node
count, if all tasks are simultaneously involved in message passing. Thus, if
individual MPI tasks can do SMP parallel computations on multiple CPUs (using
OpenMP or threads), performance may be better than if all MPI tasks compete for
adapter resources.
The user may also want to consider using MPI IP. On SP Switch2 PCI systems with
many MPI tasks sharing adapters, MPI IP may perform better than MPI User
Space.
You can use the POE command-line flags on the poe and pdbx commands. You
can also use the following flags on program names when individually loading nodes
from STDIN or a POE commands file.
v -infolevel or -ilevel
v -euidevelop
In the tables that follow, a check mark (U) denotes those flags you can use when
individually loading nodes.
-pgmmodel
MP_SAVE_LLFILE When using LoadLeveler for node Any relative or full path name. None
allocation, the name of the output
-save_llfile LoadLeveler job command file to be
generated by the Partition Manager. The
output LoadLeveler job command file will
show the LoadLeveler settings that result
from the POE environment variables
and/or command-line options for the
current invocation of POE. If you use the
MP_SAVE_LLFILE environment variable
for a batch job, or when the MP_LLFILE
environment variable is set (indicating
that a LoadLeveler job command file
should participate in node allocation),
POE will show a warning and will not
save the output job command file.
Table 18. POE environment variables and command-line flags for Message Passing Interface (MPI)
Environment Variable
Command-Line Flag Set: Possible Values: Default:
MP_ACK_THRESH Allows the user to control the A positive integer limited to 31 30
packet acknowledgement
-ack_thresh threshold. Specify a positive
integer.
MP_BUFFER_MEM See “MP_BUFFER_MEM details” on page 82. 64 MB
-buffer_mem (User Space)
2800000 bytes
(IP)
MP_BUFFER_MEM details
Set:
To control the amount of memory PE MPI allows for the buffering of early arrival
message data. Message data that is sent without knowing if the receive is posted is
said to be sent eagerly. If the message data arrives before the receive is posted,
this is called an early arrival and must be buffered at the receive side.
Possible values:
nnnnn (byte)
nnnK (where: K = 1024 bytes)
nnM (where: M = 1024*1024 bytes)
nnG (where: G = 1 billion bytes)
M1
M1,M2
,M2 (a comma followed by the M2 value)
M2 specifies the upper bound of memory that PE MPI will allow to be used for early
arrival buffering in the most extreme case of sends without waiting receives. PE
MPI will throttle senders back to rendezvous protocol (stop trying to use eager
send) before allowing the early arrivals at a receive side to overflow the upper
bound.
There is no limit enforced on the value you can specify for M2, but be aware that a
program that does not behave as expected has the potential to malloc this much
memory, and terminate if it is not available.
The format that omits M1 is used to tell PE MPI to use its default size pre-allocated
pool, but set the upper bound as specified with M2. This removes the need for a
user to remember the default M1 value when the intention is to only change the M2
value.
It is expected that only jobs with hundreds of task will have any need to set M2. For
most of these jobs, there will be an M1,M2 setting that eliminates the need for PE
MPI to throttle eager sends, while allowing all early arrivals that the application
actually creates to be buffered within the pre-allocated pool.
Table 19. POE environment variables/command-line flags for corefile generation
The Environment
Variable/Command-
Line Flag(s): Set: Possible Values: Default:
MP_COREDIR Creates a separate directory for each Any valid directory name, or coredir.taskid
task’s core file. ″none″ to bypass creating a
-coredir new directory.
MP_COREFILE_ The format of corefiles generated when The string ″STDERR″ (to If not set/specified,
FORMAT processes terminate abnormally. specify that the lightweight standard AIX
corefile information should corefiles will be
-corefile_format be written to standard error) generated.
or any other string (to
specify the lightweight
corefile name).
MP_COREFILE_ Determines if POE should generate a yes, no no
SIGTERM core file when a SIGTERM signal is
received. Valid values are yes and no. If
-corefile_sigterm not set, the default is no.
mpc_isatty
Purpose
Determines whether a device is a terminal on the home node.
Library
libmpi_r.a
C synopsis
#include <pm_util.h>
int mpc_isatty(int FileDescriptor);
Description
This parallel utility subroutine determines whether the file descriptor specified by the
FileDescriptor parameter is associated with a terminal device on the home node. In
a parallel operating environment partition, these three file descriptors are
implemented as pipes to the partition manager daemon. Therefore, the AIX isatty()
subroutine will always return false for each of them. This subroutine is provided for
use by remote tasks that may want to know whether one of these devices is
actually a terminal on the home node, for example, to determine whether or not to
output a prompt.
Parameters
FileDescriptor
is the file descriptor number of the device. Valid values are:
0 or STDIN
Specifies STDIN as the device to be checked.
1 or STDOUT
Specifies STDOUT as the device to be checked.
2 or STDERR
Specifies STDERR as the device to be checked.
Notes
This subroutine has a C version only. Also, it is thread safe.
Return values
In C and C++ calls, the following applies:
0 Indicates that the device is not associated with a terminal on the home
node.
1 Indicates that the device is associated with a terminal on the home node.
-1 Indicates an invalid FileDescriptor parameter.
Examples
C Example
/*
* Running this program, after compiling with mpcc_r,
* without redirecting STDIN, produces the following output:
*
* isatty() reports STDIN as a non-terminal device
#include "pm_util.h"
main()
{
if (isatty(STDIN)) {
printf("isatty() reports STDIN as a terminal device\n");
} else {
printf("isatty() reports STDIN as a non-terminal device\n");
if (mpc_isatty(STDIN)) {
printf("mpc_isatty() reports STDIN as a terminal device\n");
} else {
printf("mpc_isatty() reports STDIN as a non-terminal device\n");
}
}
}
MP_BANDWIDTH, mpc_bandwidth
Purpose
Obtains user space switch bandwidth statistics.
Library
libmpi_r.a
C synopsis
#include <pm_util.h>
#include <lapi.h>
int mpc_bandwidth(lapi_handle_t hndl, int flag, bw_stat_t *bw);
FORTRAN synopsis
MP_BANDWIDTH(INTEGER HNDL, INTEGER FLAG, INTEGER*8 BW_SENT, INTEGER*8 BW_RECV,
INTEGER*8 BW_TIME_SEC, INTEGER*4 BW_TIME_USEC, INTEGER RC)
Description
This parallel utility subroutine is a wrapper API program that users can call to obtain
the user space switch bandwidth statistics. LAPI’s Query interface is used to obtain
byte counts of the data sent and received. This routine returns the byte counts and
time values to allow the bandwidth to be calculated.
For C and C++ language programs, this routine uses a structure that contains the
data count fields, as well as time values in both seconds and microseconds. These
are filled in at the time of the call, from the data obtained by the LAPI Query
interface and a ″get time of day″ call.
This routine requires a valid LAPI handle for LAPI programs. For MPI programs, the
handle is not required. A flag parameter is required to indicate whether the call has
been made from an MPI or LAPI program.
If the program is a LAPI program, the flag MP_BW_LAPI must be set and the
handle value must be specified. If the program is an MPI program, the flag
MP_BW_MPI must be set, and any handle specified is ignored.
In the case where a program uses both MPI and LAPI in the same program, where
MP_MSG_API is set to either mpi,lapi or mpi_lapi, separate sets of statistics are
maintained for the MPI and LAPI portions of the program. To obtain the MPI
bandwidth statistics, this routine must be called with the MP_BW_MPI flag, and any
handle specified is ignored. To obtain the LAPI bandwidth statistics, this routine
must be called with the MP_BW_LAPI flag and a valid LAPI handle value.
Parameters
In C, bw is a pointer to a bw_stat_t structure. This structure is defined as:
typedef struct{
unsigned long long switch_sent;
unsigned long long switch_recv;
int64_t time_sec;
int32_t time_usec;
} bw_stat_t;
where:
In FORTRAN:
BW_SENT is a 64-bit integer value of the number of bytes sent.
BW_RECV is a 64-bit integer value of the number of bytes received.
BW_TIME_SEC
is a 64-bit integer time value of time in seconds.
BW_TIME_USEC
is a 32-bit integer time value of time in microseconds.
Bw_data is a pointer to the bandwidth data structure, that will include the timestamp
and bandwidth data count of sends and receives as requested. The bandwidth data
structure may be declared and passed locally by the calling program.
Hndl is a valid LAPI handle filled in by a LAPI_Init() call for LAPI programs. For MPI
programs, this is ignored.
RC in FORTRAN, will contain an integer value returned by this function. This should
always be the last parameter.
Notes
1. The send and receive data counts are for bandwidth data at the software level
of current tasks running, and not what the adapter is capable of.
2. Intranode communication using shared memory will specifically not be
measured with this API. Likewise, this API does not return values of the
bandwidth of local data sent to itself.
3. In the case with striping over multiple adapters, the data counts are an
aggregate of the data exchanged at the application level, and not on a
per-adapter basis.
Return values
0 Indicates successful completion.
-1 Incorrect flag (not MP_BW_MPI or MP_BW_LAPI).
greater than 0
See the list of LAPI error codes in IBM RSCT: LAPI Programming Guide.
Examples
C Examples
1. To determine the bandwidth in an MPI program:
#include <mpi.h>
#include <time.h>
#include <lapi.h>
#include <pm_util.h>
int rc;
main(int argc, char *argv[])
{
bw_stat_t bw_in;
MPI_Init(&argc, &argv);
.
.
.
/* start collecting bandwidth .. */
rc = mpc_bandwidth(NULL, MP_BW_MPI, &bw_in);
.
.
.
printf("Return from mpc_bandwidth ...rc = %d.\n",rc);
printf("Bandwidth of data sent: %lld.\n",
bw_in.switch_sent);
printf("Bandwidth of data recv: %lld.\n",
bw_in.switch_recv);
printf("time(seconds): %lld.\n",bw_in.time_sec);
printf("time(mseconds): %d.\n",bw_in->time_usec);
.
.
.
MPI_Finalize();
exit(rc);
}
2. To determine the bandwidth in a LAPI program:
#include <lapi.h>
#include <time.h>
#include <pm_util.h>
int rc;
main(int argc, char *argv[])
{
lapi_handle_t hndl;
lapi_info_t info;
bw_stat_t work;
bw_stat_t bw_in;
bzero(&info, sizeof(lapi_info_t));
rc = LAPI_Init(&hndl, &info);
.
.
.
rc = mpc_bandwidth(hndl, MP_BW_LAPI, &bw_in);
.
.
.
printf("Return from mpc_bandwidth ...rc = %d.\n",rc);
printf("Bandwidth of data sent: %lld.\n",
bw_in.switch_sent);
printf("Bandwidth of data recv: %lld.\n",
bw_in.switch_recv);
printf("time(seconds): %lld.\n", bw_in.time_sec);
printf("time(mseconds): %d.\n",bw_in.time_usec);
.
.
.
LAPI_Term(hndl);
exit(rc);
}
FORTRAN Examples
1. To determine the bandwidth in an MPI program:
program bw_mpi
include "mpif.h"
include "lapif.h"
integer retcode
integer taskid
integer numtask
integer hndl
integer*8 bw_secs
integer*4 bw_usecs
integer*8 bw_sent_data
integer*8 bw_recv_data
.
.
.
call mpi_init(retcode)
call mpi_comm_rank(mpi_comm_world, taskid, retcode)
write (6,*) ’Taskid is ’,taskid
.
.
.
call mp_bandwidth(hndl,MP_BW_MPI, bw_sent_data, bw_recv_data, bw_secs,
bw_usecs,retcode)
write (6,*) ’MPI_BANDWIDTH returned. Time (sec) is ’,bw_secs
write (6,*) ’ Time (usec) is ’,bw_usecs
write (6,*) ’ Data sent (bytes): ’,bw_sent_data
write (6,*) ’ Data received (bytes): ’,bw_sent_recv
write (6,*) ’ Return code: ’,retcode
.
.
.
call mpi_barrier(mpi_comm_world,retcode)
call mpi_finalize(retcode)
2. To determine the bandwidth in a LAPI program:
program bw_lapi
include "mpif.h"
include "lapif.h"
TYPE (LAPI_INFO_T) :: lapi_info
integer retcode
integer taskid
integer numtask
integer hndl
integer*8 bw_secs
integer*4 bw_usecs
integer*8 bw_sent_data
integer*8 bw_recv_data
.
.
.
call lapi_init(hndl, lapi_info, retcode)
.
.
.
call mp_bandwidth(hndl,MP_BW_LAPI, bw_sent_data, bw_recv_data, bw_secs,
bw_usecs,retcode)
write (6,*) ’MPI_BANDWIDTH returned. Time (sec) is ’,bw_secs
write (6,*) ’ Time (usec) is ’,bw_usecs
write (6,*) ’ Data sent (bytes): ’,bw_sent_data
write (6,*) ’ Data received (bytes): ’,bw_sent_recv
write (6,*) ’ Return code: ’,retcode
.
.
.
call lapi_term(hndl,retcode)
Related information
Commands:
v mpcc_r
v mpCC_r
v mpxlf_r
v mpxlf90_r
v mpxlf95_r
Subroutines:
v MP_STATISTICS_WRITE, mpc_statistics_write
v MP_STATISTICS_ZERO, mpc_statistics_zero
MP_DISABLEINTR, mpc_disableintr
Purpose
Disables message arrival interrupts on a node.
Library
libmpi_r.a
C synopsis
#include <pm_util.h>
int mpc_disableintr();
FORTRAN synopsis
MP_DISABLEINTR(INTEGER RC)
Description
This parallel utility subroutine disables message arrival interrupts on the individual
node on which it is run. Use this subroutine to dynamically control masking
interrupts on a node.
Parameters
In FORTRAN, RC will contain one of the values listed under Return Values.
Notes
v This subroutine is only effective when the communication subsystem is active.
This is from MPI_INIT to MPI_FINALIZE. If this subroutine is called when the
subsystem is inactive, the call will have no effect and the return code will be -1.
v This subroutine overrides the setting of the environment variable
MP_CSS_INTERRUPT.
v Inappropriate use of the interrupt control subroutines may reduce performance.
v This subroutine can be used for IP and User Space protocols.
v This subroutine is thread-safe.
v Using this subroutine will suppress the MPI-directed switching of interrupt mode,
leaving the user in control for the rest of the run. See MPI_FILE_OPEN and
MPI_WIN_CREATE in IBM Parallel Environment for AIX: MPI Subroutine
Reference.
Return values
0 Indicates successful completion.
-1 Indicates that the MPI library was not active. The call was either made
before MPI_INIT or after MPI_FINALIZE.
Examples
C Example
/*
* Running this program, after compiling with mpcc_r,
* without setting the MP_CSS_INTERRUPT environment variable,
* and without using the "-css_interrupt" command-line option,
* produces the following output:
*
* Interrupts are DISABLED
* About to enable interrupts..
* Interrupts are ENABLED
* About to disable interrupts...
* Interrupts are DISABLED
*/
#include "pm_util.h"
main()
{
int intr;
QUERY
QUERY
QUERY
}
FORTRAN Example
Running the following program, after compiling with mpxlf_r, without setting the
MP_CSS_INTERRUPT environment variable, and without using the -css_interrupt
command-line option, produces the following output:
Interrupts are DISABLED
About to enable interrupts..
Interrupts are ENABLED
About to disable interrupts...
Interrupts are DISABLED
PROGRAM INTR_EXAMPLE
INTEGER RC
CALL MP_QUERYINTR(RC)
IF (RC .EQ. 0) THEN
WRITE(6,*)’Interrupts are DISABLED’
ELSE
WRITE(6,*)’Interrupts are ENABLED’
ENDIF
CALL MP_QUERYINTR(RC)
IF (RC .EQ. 0) THEN
WRITE(6,*)’Interrupts are DISABLED’
ELSE
WRITE(6,*)’Interrupts are ENABLED’
ENDIF
CALL MP_QUERYINTR(RC)
IF (RC .EQ. 0) THEN
WRITE(6,*)’Interrupts are DISABLED’
ELSE
WRITE(6,*)’Interrupts are ENABLED’
ENDIF
STOP
END
Related information
Subroutines:
v MP_ENABLEINTR, mpc_enableintr
v MP_QUERYINTR, mpc_queryintr
MP_ENABLEINTR, mpc_enableintr
Purpose
Enables message arrival interrupts on a node.
Library
libmpi_r.a
C synopsis
#include <pm_util.h>
int mpc_enableintr();
FORTRAN synopsis
MP_ENABLEINTR(INTEGER RC)
Description
This parallel utility subroutine enables message arrival interrupts on the individual
node on which it is run. Use this subroutine to dynamically control masking
interrupts on a node.
Parameters
In FORTRAN, RC will contain one of the values listed under Return Values.
Notes
v This subroutine is only effective when the communication subsystem is active.
This is from MPI_INIT to MPI_FINALIZE. If this subroutine is called when the
subsystem is inactive, the call will have no effect and the return code will be -1.
v This subroutine overrides the setting of the environment variable
MP_CSS_INTERRUPT.
v Inappropriate use of the interrupt control subroutines may reduce performance.
v This subroutine can be used for IP and User Space protocols.
v This subroutine is thread safe.
v Using this subroutine will suppress the MPI-directed switching of interrupt mode,
leaving the user in control for the rest of the run. See MPI_FILE_OPEN and
MPI_WIN_CREATE in IBM Parallel Environment for AIX: MPI Subroutine
Reference.
Return values
0 Indicates successful completion.
-1 Indicates that the MPI library was not active. The call was either made
before MPI_INIT or after MPI_FINALIZE.
Examples
C Example
/*
* Running this program, after compiling with mpcc_r,
* without setting the MP_CSS_INTERRUPT environment variable,
* and without using the "-css_interrupt" command-line option,
* produces the following output:
*
* Interrupts are DISABLED
* About to enable interrupts..
* Interrupts are ENABLED
* About to disable interrupts...
* Interrupts are DISABLED
*/
#include "pm_util.h"
main()
{
int intr;
QUERY
QUERY
QUERY
}
FORTRAN Example
Running this program, after compiling with mpxlf_r, without setting the
MP_CSS_INTERRUPT environment variable, and without using the -css_interrupt
command-line option, produces the following output:
Interrupts are DISABLED
About to enable interrupts..
Interrupts are ENABLED
About to disable interrupts...
Interrupts are DISABLED
PROGRAM INTR_EXAMPLE
INTEGER RC
CALL MP_QUERYINTR(RC)
IF (RC .EQ. 0) THEN
WRITE(6,*)’Interrupts are DISABLED’
ELSE
WRITE(6,*)’Interrupts are ENABLED’
ENDIF
CALL MP_QUERYINTR(RC)
IF (RC .EQ. 0) THEN
WRITE(6,*)’Interrupts are DISABLED’
ELSE
WRITE(6,*)’Interrupts are ENABLED’
ENDIF
CALL MP_QUERYINTR(RC)
IF (RC .EQ. 0) THEN
WRITE(6,*)’Interrupts are DISABLED’
ELSE
WRITE(6,*)’Interrupts are ENABLED’
ENDIF
STOP
END
Related information
Subroutines:
v MP_DISABLEINTR, mpc_disableintr
v MP_QUERYINTR, mpc_queryintr
MP_FLUSH, mpc_flush
Purpose
Flushes task output buffers.
Library
libmpi_r.a
C synopsis
#include <pm_util.h>
int mpc_flush(int option);
FORTRAN synopsis
MP_FLUSH(INTEGER OPTION)
Description
This parallel utility subroutine flushes output buffers from all of the parallel tasks to
STDOUT at the home node. This is a synchronizing call across all parallel tasks.
If the current STDOUT mode is ordered, then when all tasks have issued this call or
when any of the output buffers are full:
1. All STDOUT buffers are flushed and put out to the user screen (or redirected) in
task order.
2. An acknowledgement is sent to all tasks and control is returned to the user.
If current STDOUT mode is unordered and all tasks have issued this call, all output
buffers are flushed and put out to the user screen (or redirected).
If the current STDOUT mode is single and all tasks have issued this call, the output
buffer for the current single task is flushed and put out to the user screen (or
redirected).
Parameters
option
is an AIX file descriptor. The only valid value is:
1 Indicates to flush STDOUT buffers.
Notes
v This is a synchronizing call regardless of the current STDOUT mode.
v All STDOUT buffers are flushed at the end of the parallel job.
v If mpc_flush is not used, standard output streams not terminated with a new-line
character are buffered, even if a subsequent read to standard input is made. This
may cause prompt message to appear only after input has been read.
v This subroutine is thread safe.
Return values
In C and C++ calls, the following applies:
0 Indicates successful completion
Examples
C Example
The following program uses poe with the -labelio yes option and three tasks:
#include <pm_util.h>
main()
{
mpc_stdout_mode(STDIO_ORDERED);
printf("These lines will appear in task order\n");
/*
* Call mpc_flush here to make sure that one task
* doesn’t change the mode before all tasks have
* sent the previous printf string to the home node.
*/
mpc_flush(1);
mpc_stdout_mode(STDIO_UNORDERED);
printf("These lines will appear in the order received by the home node\n");
/*
* Since synchronization is not used here, one task could actually
* execute the next statement before one of the other tasks has
* executed the previous statement, causing one of the unordered
* lines not to print.
*/
mpc_stdout_mode(1);
printf("Only 1 copy of this line will appear from task 1\n");
}
Running this C program produces the following output (the task order of lines 4
through 6 may differ):
v 0 : These lines will appear in task order.
v 1 : These lines will appear in task order.
v 2 : These lines will appear in task order.
v 1 : These lines will appear in the order received by the home node.
v 2 : These lines will appear in the order received by the home node.
v 0 : These lines will appear in the order received by the home node.
v 1 : Only 1 copy of this line will appear from task 1.
FORTRAN Example
CALL MP_STDOUT_MODE(-2)
WRITE(6, *) ’These lines will appear in task order’
CALL MP_FLUSH(1)
CALL MP_STDOUT_MODE(-3)
WRITE(6, *) ’These lines will appear in the order received by the home node’
CALL MP_STDOUT_MODE(1)
WRITE(6, *) ’Only 1 copy of this line will appear from task 1’
END
Related information
Subroutines:
v MP_STDOUT_MODE, mpc_stdout_mode
v MP_STDOUTMODE_QUERY, mpc_stdoutmode_query
MP_INIT_CKPT, mpc_init_ckpt
Purpose
Starts user-initiated checkpointing.
Library
libmpi_r.a
C synopsis
#include <pm_ckpt.h>
int mpc_init_ckpt(int flags);
FORTRAN synopsis
i = MP_INIT_CKPT(%val(j))
Description
MP_INIT_CKPT starts complete or partial user-initiated checkpointing. The
checkpoint file name consists of the base name provided by the MP_CKPTFILE
and MP_CKPTDIR environment variables, with a suffix of the task ID and a numeric
checkpoint tag to differentiate it from an earlier checkpoint file.
Parameters
In C, flags can be set to MP_CUSER, which indicates complete user-initiated
checkpointing, or MP_PUSER, which indicates partial user-initiated checkpointing.
Notes
Complete user-initiated checkpointing is a synchronous operation. All tasks of the
parallel program must call MP_INIT_CKPT. MP_INIT_CKPT suspends the calling
thread until all other tasks have called it (MP_INIT_CKPT). Other threads in the
task are not suspended. After all tasks of the application have issued
MP_INIT_CKPT, a local checkpoint is taken of each task.
Upon returning from the MP_INIT_CKPT call, the application continues to run. It
may, however, be a restarted application that is now running, rather than the
original, if the program was restarted from a checkpoint file.
In a case where several threads in a task call MP_INIT_CKPT using the same flag,
the calls are serialized.
The task that calls MP_INIT_CKPT does not need to be an MPI program.
For general information on checkpointing and restarting programs, see IBM Parallel
Environment for AIX: Operation and Use, Volume 1.
For more information on the use of LoadLeveler and checkpointing, see IBM
LoadLeveler for AIX 5L: Using and Administering.
Return values
0 Indicates successful completion.
1 Indicates that a restart operation occurred.
-1 Indicates that an error occurred. A message describing the error will be
issued.
Examples
C Example
#include <pm_ckpt.h>
int mpc_init_ckpt(int flags);
FORTRAN Example
i = MP_INIT_CKPT(%val(j))
Related information
Commands:
v poeckpt
v poerestart
Subroutines:
v MP_SET_CKPT_CALLBACKS, mpc_set_ckpt_callbacks
v MP_UNSET_CKPT_CALLBACKS, mpc_unset_ckpt_callbacks
MP_QUERYINTR, mpc_queryintr
Purpose
Returns the state of interrupts on a node.
Library
libmpi_r.a
C synopsis
#include <pm_util.h>
int mpc_queryintr();
FORTRAN synopsis
MP_QUERYINTR(INTEGER RC)
Description
This parallel utility subroutine returns the state of interrupts on a node.
Parameters
In FORTRAN, RC will contain one of the values listed under Return Values.
Notes
This subroutine is thread safe.
Return values
0 Indicates that interrupts are disabled on the node from which this subroutine
is called.
1 Indicates that interrupts are enabled on the node from which this subroutine
is called.
Examples
C Example
/*
* Running this program, after compiling with mpcc_r,
* without setting the MP_CSS_INTERRUPT environment variable,
* and without using the "-css_interrupt" command-line option,
* produces the following output:
*
* Interrupts are DISABLED
* About to enable interrupts..
* Interrupts are ENABLED
* About to disable interrupts...
* Interrupts are DISABLED
*/
#include "pm_util.h"
main()
{
int intr;
QUERY
QUERY
QUERY
}
FORTRAN Example
Running this program, after compiling with mpxlf_r, without setting the
MP_CSS_INTERRUPT environment variable, and without using the -css_interrupt
command-line option, produces the following output:
Interrupts are DISABLED
About to enable interrupts..
Interrupts are ENABLED
About to disable interrupts...
Interrupts are DISABLED
PROGRAM INTR_EXAMPLE
INTEGER RC
CALL MP_QUERYINTR(RC)
IF (RC .EQ. 0) THEN
WRITE(6,*)’Interrupts are DISABLED’
ELSE
WRITE(6,*)’Interrupts are ENABLED’
ENDIF
CALL MP_QUERYINTR(RC)
IF (RC .EQ. 0) THEN
WRITE(6,*)’Interrupts are DISABLED’
ELSE
WRITE(6,*)’Interrupts are ENABLED’
ENDIF
CALL MP_QUERYINTR(RC)
IF (RC .EQ. 0) THEN
WRITE(6,*)’Interrupts are DISABLED’
ELSE
WRITE(6,*)’Interrupts are ENABLED’
ENDIF
STOP
END
Related information
Subroutines:
v MP_DISABLEINTR, mpc_disableintr
v MP_ENABLEINTR, mpc_enableintr
MP_QUERYINTRDELAY, mpc_queryintrdelay
Purpose
Note
This function is no longer supported and its future use is not recommended.
The routine remains available for binary compatibility. If invoked, it performs no
action and always returns zero. Applications that include calls to this routine
should continue to function as before. We suggest that calls to this routine be
removed from source code if it becomes convenient to do so.
The original purpose of this routine was to return the current interrupt delay
time. This routine currently returns zero.
MP_SET_CKPT_CALLBACKS, mpc_set_ckpt_callbacks
Purpose
Registers subroutines to be invoked when the application is checkpointed, resumed,
and restarted.
Library
libmpi_r.a
C synopsis
#include <pm_ckpt.h>
int mpc_set_ckpt_callbacks(callbacks_t *cbs);
FORTRAN synopsis
MP_SET_CKPT_CALLBACKS(EXTERNAL CHECKPOINT_CALLBACK_FUNC,
EXTERNAL RESUME_CALLBACK_FUNC,
EXTERNAL RESTART_CALLBACK_FUNC,
INTEGER RC)
Description
The MP_SET_CKPT_CALLBACKS subroutine is called to register subroutines to be
invoked when the application is checkpointed, resumed, and restarted.
Parameters
In C, cbs is a pointer to a callbacks_t structure. The structure is defined as:
typedef struct {
void (*checkpoint_callback)(void);
void (*restart_callback)(void);
void (*resume_callback)(void);
} callbacks_t;
where:
checkpoint_callback Points to the subroutine to be called at checkpoint
time.
restart_callback Points to the subroutine to be called at restart time.
resume_callback Points to the subroutine to be called when an
application is resumed after taking a checkpoint.
In FORTRAN:
CHECKPOINT_CALLBACK_FUNC
Specifies the subroutine to be called at checkpoint
time.
RESUME_CALLBACK_FUNC Specifies the subroutine to be called when an
application is resumed after taking a checkpoint.
RESTART_CALLBACK_FUNC Specifies the subroutine to be called at restart time.
RC Contains one of the values listed under Return
Values .
Notes
In order to ensure their completion, the callback subroutines cannot be dependent
on the action of any other thread in the current process, or any process created by
the task being checkpointed, because these threads or processes or both may or
may not be running while the callback subroutines are executing.
For general information on checkpointing and restarting programs, see IBM Parallel
Environment for AIX: Operation and Use, Volume 1.
For more information on the use of LoadLeveler and checkpointing, see IBM
LoadLeveler for AIX 5L: Using and Administering.
Return values
-1 Indicates that an error occurred. A message describing the error will be
issued.
non-negative integer
Indicates the handle that is to be used in MP_UNSET_CKPT_CALLBACKS
to unregister the subroutines.
Examples
C Example
#include <pm_ckpt.h>
int ihndl;
callbacks_t cbs;
void foo(void);
void bar(void);
cbs.checkpoint_callback=foo;
cbs.resume_callback=bar;
cbs.restart_callback=bar;
ihndl = mpc_set_ckpt_callbacks(callbacks_t *cbs);
FORTRAN Example
SUBROUTINE
. FOO
.
.
RETURN
END
SUBROUTINE
. BAR
.
.
RETURN
END
PROGRAM MAIN
EXTERNAL FOO, BAR
INTEGER HANDLE, RC
.
.
.
CALL MP_SET_CKPT_CALLBACKS(FOO,BAR,BAR,HANDLE)
IF
. (HANDLE .NE. 0) STOP 666
.
.
CALL
. MP_UNSET_CKPT_CALLBACKS(HANDLE,RC)
.
.
END
Related information
Commands:
v poeckpt
v poerestart
Subroutines:
v MP_INIT_CKPT, mpc_init_ckpt
v MP_UNSET_CKPT_CALLBACKS, mpc_unset_ckpt_callbacks
MP_SETINTRDELAY, mpc_setintrdelay
Purpose
Note
This function is no longer supported and its future use is not recommended.
The routine remains available for binary compatibility. If invoked, it performs no
action and always returns zero. Applications that include calls to this routine
should continue to function as before. We suggest that calls to this routine be
removed from source code if it becomes convenient to do so.
This function formerly set the delay parameter. It now performs no action.
MP_STATISTICS_WRITE, mpc_statistics_write
Description
Print both MPI and LAPI transmission statistics.
Library
libmpi_r.a
C synopsis
#include <pm_util.h>
int mpc_statistics_write(FILE *fp);
FORTRAN synopsis
MP_STATISTICS_WRITE(INTEGER FILE_DESCRIPTOR, INTEGER RC)
Description
If the MP_STATISTICS environment variable is set to yes, MPI will keep a running
total on a set of statistical data. If an application calls this function after MPI_INIT is
completed, but before MPI_FINALIZE is called, it will print out the current total of all
available MPI and LAPI data. If this function is called after MPI_FINALIZE is
completed, it will print out only the final MPI data.
Note: LAPI will always keep its own statistical total with or without having
MP_STATISTICS set.
In the output, each piece of MPI statistical data is preceded by MPI, and each piece
of LAPI statistical data is preceded by LAPI.
Parameters
fp In C, fp is either STDOUT, STDERR or a FILE pointer returned by the
fopen function.
In FORTRAN, FILE_DESCRIPTOR is the AIX file descriptor of the file that
this function will write to, having these values:
1 Indicates that the output is to be written to STDOUT.
2 Indicates that the output is to be written to STDERR.
Other Indicates the integer returned by the XL FORTRAN utility getfd, if
the output is to be written to an application-defined file.
The getfd utility converts a FORTRAN LUNIT number to an AIX file
descriptor. See Examples for more detail.
RC In FORTRAN, RC will contain the integer value returned by this function.
See Return Values for more detail.
Return values
-1 Neither MPI nor LAPI statistics are available.
0 Both MPI and LAPI statistics are available.
1 Only MPI statistics are available.
2 Only LAPI statistics are available.
Examples
C Example
#include "pm_util.h"
......
MPI_Init( ... );
MPI_Send( ... );
MPI_Recv( ... );
MPI_Finalize();
FORTRAN Example
integer(4) LUNIT, stat_ofile, stat_rc, getfd
.....
MP_STATISTICS_ZERO, mpc_statistics_zero
Purpose
Resets (zeros) the MPCI_stats_t structure. It has no effect on LAPI.
Library
libmpi_r.a
C synopsis
#include <pm_util.h>
mpc_statistics_zero();
FORTRAN synopsis
MP_STATISTICS_ZERO()
Description
If the MP_STATISTICS environment variable is set to yes, MPI will keep a running
total on a set of statistical data, after MPI_INIT is completed. At any time during
execution, the application can call this function to reset the current total to zero.
Parameters
None.
Return values
None.
MP_STDOUT_MODE, mpc_stdout_mode
Purpose
Sets the mode for STDOUT.
Library
libmpi_r.a
C synopsis
#include <pm_util.h>
int mpc_stdout_mode(int mode);
FORTRAN synopsis
MP_STDOUT_MODE(INTEGER MODE)
Description
This parallel utility subroutine requests that STDOUT be set to single, ordered, or
unordered mode. In single mode, only one task output is displayed. In unordered
mode, output is displayed in the order received at the home node. In ordered mode,
each parallel task writes output data to its own buffer. When a flush request is
made all the task buffers are flushed, in order of task ID, to STDOUT home node.
Parameters
mode
is the mode to which STDOUT is to be set. The valid values are:
taskid Specifies single mode for STDOUT, where taskid is the task identifier of
the new single task. This value must be between 0 and n-1, where n is
the total of tasks in the current partition. The taskid requested does not
have to be the issuing task.
-2 Specifies ordered mode for STDOUT. The macro STDIO_ORDERED is
supplied for use in C programs.
-3 Specifies unordered mode for STDOUT. The macro
STDIO_UNORDERED is supplied for use in C programs.
Notes
v All current STDOUT buffers are flushed before the new STDOUT mode is
established.
v The initial mode for STDOUT is set by using the environment variable
MP_STDOUTMODE, or by using the command-line option -stdoutmode, with
the latter overriding the former. The default STDOUT mode is unordered.
v This subroutine is implemented with a half second sleep interval to ensure that
the mode change request is processed before subsequent writes to STDOUT.
v This subroutine is thread safe.
Return values
In C and C++ calls, the following applies:
0 Indicates successful completion.
Examples
C Example
The following program uses poe with the -labelio yes option and three tasks:
#include <pm_util.h>
main()
{
mpc_stdout_mode(STDIO_ORDERED);
printf("These lines will appear in task order\n");
/*
* Call mpc_flush here to make sure that one task
* doesn’t change the mode before all tasks have
* sent the previous printf string to the home node.
*/
mpc_flush(1);
mpc_stdout_mode(STDIO_UNORDERED);
printf("These lines will appear in the order received by the home node\n");
/*
* Since synchronization is not used here, one task could actually
* execute the next statement before one of the other tasks has
* executed the previous statement, causing one of the unordered
* lines not to print.
*/
mpc_stdout_mode(1);
printf("Only 1 copy of this line will appear from task 1\n");
}
Running the above C program produces the following output (task order of lines 4-6
may differ):
v 0 : These lines will appear in task order.
v 1 : These lines will appear in task order.
v 2 : These lines will appear in task order.
v 1 : These lines will appear in the order received by the home node.
v 2 : These lines will appear in the order received by the home node.
v 0 : These lines will appear in the order received by the home node.
v 1 : Only 1 copy of this line will appear from task 1.
FORTRAN Example
CALL MP_STDOUT_MODE(-2)
WRITE(6, *) ’These lines will appear in task order’
CALL MP_FLUSH(1)
CALL MP_STDOUT_MODE(-3)
WRITE(6, *) ’These lines will appear in the order received by the home node’
CALL MP_STDOUT_MODE(1)
WRITE(6, *) ’Only 1 copy of this line will appear from task 1’
END
Running the above program produces the following output (the task order of lines 4
through 6 may differ):
v 0 : These lines will appear in task order.
v 1 : These lines will appear in task order.
v 2 : These lines will appear in task order.
v 1 : These lines will appear in the order received by the home node.
v 2 : These lines will appear in the order received by the home node.
v 0 : These lines will appear in the order received by the home node.
Related information
Commands:
v mpcc_r
v mpCC_r
v mpxlf_r
Subroutines:
v MP_FLUSH, mpc_flush
v MP_STDOUTMODE_QUERY, mpc_stdoutmode_query
v MP_SYNCH, mpc_synch
MP_STDOUTMODE_QUERY, mpc_stdoutmode_query
Purpose
Queries the current STDOUT mode setting.
Library
libmpi_r.a
C synopsis
#include <pm_util.h>
int mpc_stdoutmode_query(int *mode);
FORTRAN synopsis
MP_STDOUTMODE_QUERY(INTEGER MODE)
Description
This parallel utility subroutine returns the mode to which STDOUT is currently set.
Parameters
mode
is the address of an integer in which the current STDOUT mode setting will be
returned. Possible return values are:
taskid Indicates that the current STDOUT mode is single, i.e. output for only
task taskid is displayed.
-2 Indicates that the current STDOUT mode is ordered. The macro
STDIO_ORDERED is supplied for use in C programs.
-3 Indicates that the current STDOUT mode is unordered. The macro
STDIO_UNORDERED is supplied for use in C programs.
Notes
v Between the time one task issues a mode query request and receives a
response, it is possible that another task can change the STDOUT mode setting
to another value unless proper synchronization is used.
v This subroutine is thread safe.
Return values
In C and C++ calls, the following applies:
0 Indicates successful completion
-1 Indicates that an error occurred. A message describing the error will be
issued.
Examples
C Example
main()
{
int mode;
mpc_stdoutmode_query(&mode);
printf("Initial (default) STDOUT mode is %d\n", mode);
mpc_stdout_mode(STDIO_ORDERED);
mpc_stdoutmode_query(&mode);
printf("New STDOUT mode is %d\n", mode);
}
FORTRAN Example
CALL MP_STDOUTMODE_QUERY(mode)
WRITE(6, *) ’Initial (default) STDOUT mode is’, mode
CALL MP_STDOUT_MODE(-2)
CALL MP_STDOUTMODE_QUERY(mode)
WRITE(6, *) ’New STDOUT mode is’, mode
END
Related information
Commands:
v mpcc_r
v mpCC_r
v mpxlf_r
Subroutines:
v MP_FLUSH, mpc_flush
v MP_STDOUT_MODE, mpc_stdout_mode
v MP_SYNCH, mpc_synch
MP_UNSET_CKPT_CALLBACKS, mpc_unset_ckpt_callbacks
Purpose
Unregisters checkpoint, resume, and restart application callbacks.
Library
libmpi_r.a
C synopsis
#include <pm_ckpt.h>
int mpc_unset_ckpt_callbacks(int handle);
FORTRAN synopsis
MP_UNSET_CKPT_CALLBACKS(INTEGER HANDLE, INTEGER RC)
Description
The MP_UNSET_CKPT_CALLBACKS subroutine is called to unregister checkpoint,
resume, and restart application callbacks that were registered with the
MP_SET_CKPT_CALLBACKS subroutine.
Parameters
handle is an integer indicating the set of callback subroutines to be unregistered.
This integer is the value returned by the subroutine used to register the callback
subroutine.
Notes
If a call to MP_UNSET_CKPT_CALLBACKS is issued while a checkpoint is in
progress, it is possible that the previously-registered callback will still be run during
this checkpoint.
For general information on checkpointing and restarting programs, see IBM Parallel
Environment for AIX: Operation and Use, Volume 1.
For more information on the use of LoadLeveler and checkpointing, see IBM
LoadLeveler for AIX 5L: Using and Administering.
Return values
0 Indicates that MP_UNSET_CKPT_CALLBACKS successfully removed the
callback subroutines from the list of registered callback subroutines
-1 Indicates that an error occurred. A message describing the error will be
issued.
Examples
C Example
#include <pm_ckpt.h>
int ihndl;
callbacks_t cbs;
void foo(void);
void bar(void);
cbs.checkpoint_callback=foo;
cbs.resume_callback=bar;
cbs.restart_callback=bar;
ihndl
. = mpc_set_ckpt_callbacks(callbacks_t *cbs);
.
.
mpc_unset_ckpt_callbacks(ihndl);
.
.
.
FORTRAN Example
SUBROUTINE
. FOO
.
.
RETURN
END
SUBROUTINE
. BAR
.
.
RETURN
END
PROGRAM MAIN
EXTERNAL FOO, BAR
INTEGER
. HANDLE, RC
.
.
CALL MP_SET_CKPT_CALLBACKS(FOO,BAR,BAR,HANDLE)
IF
. (HANDLE .NE. 0) STOP 666
.
.
CALL
. MP_UNSET_CKPT_CALLBACKS(HANDLE,RC)
.
.
END
Related information
Commands:
v poeckpt
v poerestart
Subroutines:
v MP_INIT_CKPT, mpc_init_ckpt
v MP_SET_CKPT_CALLBACKS, mpc_set_ckpt_callbacks
pe_dbg_breakpoint
Purpose
Provides a communication mechanism between Parallel Operating Environment
(POE) and an attached third party debugger (TPD).
Library
POE API library (libpoeapi.a)
C synopsis
#include <pe_dbg_checkpnt.h>
void pe_dbg_breakpoint(void);
Description
The pe_dbg_breakpoint subroutine is used to exchange information between POE
and an attached TPD for the purposes of starting, checkpointing, or restarting a
parallel application. The call to the subroutine is made by the POE application
within the context of various debug events and related POE global variables, which
may be examined or filled in by POE and the TPD. All task-specific arrays are
allocated by POE and should be indexed by task number (starting with 0) to retrieve
or set information specific to that task.
The TPD should maintain a breakpoint within this function, check the value of
pe_dbg_debugevent when the function is entered, take the appropriate actions for
each event as described below, and allow the POE executable to continue.
PE_DBG_INIT_ENTRY
Used by POE to determine if a TPD is present. The TPD should set the
following:
int pe_dbg_stoptask
Should be set to 1 if a TPD is present. POE will then cause the remote
applications to be stopped using ptrace, allowing the TPD to attach to
and continue the tasks as appropriate.
In addition, POE will interpret the SIGSOUND and SIGRETRACT
signals as checkpoint requests from the TPD. SIGSOUND should be
sent when the parallel job should continue after a successful
checkpoint, and SIGRETRACT should be sent when the parallel job
should terminate after a successful checkpoint.
Note: Unpredictable results may occur if these signals are sent while a
parallel checkpoint from a PE_DBG_CKPT_REQUEST is still in
progress.
PE_DBG_CREATE_EXIT
Indicates that all remote tasks have been created and are stopped. The TPD
may retrieve the following information about the remote tasks:
int pe_dbg_count
The number of remote tasks that were created. Also the number of
elements in task-specific arrays in the originally started process, which
remains constant across restarts.
For a restarted POE process, this number may not be the same as the
number of tasks that existed when POE was originally started. To
determine which tasks may have exited prior to the checkpoint from
which the restart is performed, the poe_task_info routine should be
used.
long *pe_dbg_hosts
Address of the array of remote task host IP addresses.
long *pe_dbg_pids
Address of the array of remote task process IDs. Each of these will also
be used as the chk_pid field of the cstate structure for that task’s
checkpoint.
char **pe_dbg_executables
Address of the array of remote task executable names, excluding path.
PE_DBG_CKPT_REQUEST
Indicates that POE has received a user-initiated checkpoint request from one or
all of the remote tasks, has received a request from LoadLeveler to checkpoint
an interactive job, or has detected a pending checkpoint while being run as a
LoadLeveler batch job. The TPD should set the following:
int pe_dbg_do_ckpt
Should be set to 1 if the TPD wishes to proceed with the checkpoint.
PE_DBG_CKPT_START
Used by POE to inform the TPD whether or not to issue a checkpoint of the
POE process. The TPD may retrieve or set the following information for this
event:
int pe_dbg_ckpt_start
Indicates that the checkpoint may proceed if set to 1, and the TPD may
issue a pe_dbg_checkpnt of the POE process and some or all of the
remote tasks.
The TPD should obtain (or derive) the checkpoint file names,
checkpoint flags, cstate, and checkpoint error file names from the
variables below.
char *pe_dbg_poe_ckptfile
Indicates the full pathname to the POE checkpoint file to be used when
checkpointing the POE process. The name of the checkpoint error file
can be derived from this name by concatenating the .err suffix. The
checkpoint error file name should also be used for
PE_DBG_CKPT_START events to know the file name from which to
read the error data.
char **pe_dbg_task_ckptfiles
Address of the array of full pathnames to be used for each of the task
checkpoints. The name of the checkpoint error file can be derived from
this name by concatenating the .err suffix.
int pe_dbg_poe_ckptflags
Indicates the checkpoint flags to be used when checkpointing the POE
process. Other supported flag values for terminating or stopping the
POE process may be ORed in by the TPD, if the TPD user issued the
checkpoint request.
int pe_dbg_task_ckptflags
Indicates the checkpoint flags to be used when checkpointing the
remote tasks. Other supported flag values for stopping the remote tasks
must be ORed in by the TPD.
The following variables should be filled in by the TPD prior to continuing POE
from this event:
int *pe_dbg_ckpt_pmd
Address of an array used by the TPD to indicate which tasks will have
the checkpoints performed by the TPD (value=0) and which tasks the
Partition Manager Daemon (PMD) should issue checkpoints for
(value=1). POE requires that the TPD must perform all checkpoints for
a particular parallel job on any node where at least one checkpoint will
be performed by the TPD.
int pe_dbg_brkpt_len
Used to inform POE of how much data to allocate for
pe_dbg_brkpt_data for later use by the TPD when saving or restoring
breakpoint data. A value of 0 may be used when there is no breakpoint
data.
PE_DBG_CKPT_START_BATCH
Same as PE_DBG_CKPT_START, but the following variables should be
ignored:
v int pe_dbg_ckpt_start
v int pe_dbg_poe_ckptflags
For this event, the TPD should not issue a checkpoint of the POE process.
PE_DBG_CKPT_VERIFY
Indicates that POE has detected a pending checkpoint. POE must verify that
the checkpoint was issued by the TPD before proceeding. The TPD should set
the following:
int pe_dbg_is_tpd
Should be set to 1 if the TPD issued the checkpoint request.
PE_DBG_CKPT_STATUS
Indicates the status of the remote checkpoints that were performed by the
TPDs. The TPD should set the following:
int *pe_dbg_task_ckpterrnos
Address of the array of errnos from the remote task checkpoints (0 for
successful checkpoint). These values can be obtained from the
Py_error field of the cr_error_t struct, returned from the
pe_dbg_read_cr_errfile calls.
void *pe_dbg_brkpt_data
The breakpoint data to be included as part of POE’s checkpoint file.
The format of the data is defined by the TPD, and may be retrieved
from POE’s address space at restart time.
int *pe_dbg_Sy_errors
The secondary errors obtained from pe_dbg_read_cr_errfile. These
values can be obtained from the Sy_error field of the cr_error_t struct,
returned from the pe_dbg_read_cr_errfile calls.
int *pe_dbg_Xtnd_errors
The extended errors obtained from pe_dbg_read_cr_errfile. These
values can be obtained from the Xtnd_error field of the cr_error_t struct,
returned from the pe_dbg_read_cr_errfile calls.
int *pe_dbg_error_lens
The user error data lengths obtained from pe_dbg_read_cr_errfile.
These values can be obtained from the error_len field of the cr_error_t
struct, returned from the pe_dbg_read_cr_errfile calls.
PE_DBG_CKPT_ERRDATA
Indicates that the TPD has reported one or more task checkpoint failures, and
that POE has allocated space in the following array for the TPD to use to fill in
the error data.
char **pe_dbg_error_data
The user error data obtained from pe_dbg_read_cr_errfile. These
values can be obtained from the error data field of the cr_error_t struct,
returned from the pe_dbg_read_cr_errfile calls.
PE_DBG_CKPT_DETACH
Used by POE to indicate to the TPD that it should detach from the POE
process. After being continued from pe_dbg_breakpoint for this event (just
prior to the TPD actually detaching), POE will wait until its trace bit is no longer
set before instructing the kernel to write its checkpoint file. POE will indicate to
the TPD that it is safe to reattach to the POE process by creating the file
/tmp/.poe.PID.reattach, where PID is the process ID of the POE process.
PE_DBG_CKPT_RESULTS
Indicates the checkpoint results to either POE or the TPD, depending on who
issued the checkpoint of POE.
int pe_dbg_ckpt_rc
If the TPD issued the checkpoint, this variable should be filled in by the
TPD and should contain the return code from the call to
pe_dbg_checkpnt. Otherwise, POE will fill in this value to indicate to
the TPD whether the checkpoint succeeded (value=1) or failed
(value=0). For failed checkpoints, the TPD may obtain the error
information from the POE checkpoint error file.
int pe_dbg_ckpt_errno
If the TPD issued the checkpoint and the checkpoint failed, this variable
should be filled in by the TPD and should contain the errno set by AIX
upon return from pe_dbg_checkpnt.
PE_DBG_CKPT_RESUME
When this event occurs, the TPD may continue or terminate the remote tasks
(or keep them stopped) after a successful checkpoint. The TPD must not
perform the post-checkpoint actions until this event is received, to ensure that
POE and LoadLeveler have performed their post-checkpoint synchronization. If
the TPD did not issue the checkpoint, the following variable should be
examined:
int pe_dbg_ckpt_action
POE will fill in this value to indicate to the TPD if the remote tasks
should be continued (value=0) or terminated (value=1) after a
successful checkpoint.
PE_DBG_CKPT_CANCEL
Indicates that POE has received a request to cancel an in-progress checkpoint.
The TPD should cause a SIGINT to be sent to the thread that issued the
pe_dbg_checkpnt calls in the remote tasks. If the TPD is non-threaded and
performs non-blocking checkpoints, the task checkpoints cannot be cancelled.
Note: If the TPD user issues a request to cancel a checkpoint being performed
by the TPD, the TPD should send a SIGGRANT to the POE process so
that the remote checkpoints being performed by the PMDs can be
interrupted. Otherwise, the checkpoint call in the TPD can return while
some remote checkpoints are still in progress.
PE_DBG_RESTART_READY
Indicates that processes for the remote task restarts have been created and
that pe_dbg_restart calls for the remote tasks may be issued by the TPD. The
TPD must perform the restarts of all remote tasks.
The TPD should first retrieve the remote task information specified in the
variables described above under PE_DBG_CREATE_EXIT. The TPD should
then obtain (or derive) the restart file names, the restart flags, rstate, and restart
error file names from the variables below. The id argument for the
pe_dbg_restart call must be derived from the remote task PID using
pe_dbg_getcrid routine.
char **pe_dbg_task_rstfiles
Address of the array of full pathnames to be used for each of the task
restarts. The name of the restart error file can be derived from this
name by concatenating the .err suffix.
int pe_dbg_task_rstflags
Indicates the restart flags to be used when restarting the remote tasks.
Other supported flag values for stopping the remote tasks may be
ORed in by the TPD.
char **pe_dbg_task_rstate
Address of the array of strings containing the restart data required for
each of the remote tasks. This value may be used as is for the
rst_buffer member of the rstate structure used in the remote task
restarts, or additional data may be appended by the TPD, as described
below:
DEBUGGER_STOP=yes
If this string appears in the task restart data, followed by a newline (\n)
character and a \0, the remote task will send a SIGSTOP signal to itself
once all restart actions have been completed in the restart handler. This
will likely be used by the TPD when tasks are checkpoint-aware, and
the TPD wants immediate control of the task after it completes restart
initialization.
The following variables should be re-examined by the TPD during this event:
int pe_dbg_ckpt_aware
Indicates whether or not the remote tasks that make up the parallel
application are checkpoint aware.
void *pe_dbg_brkpt_data
The breakpoint data that was included as part of POE’s checkpoint file.
The format of the data is defined by the TPD.
The following variables should be filled in by the TPD prior to continuing POE
from this event. This also implies that all remote restarts must have been
performed before continuing POE:
int *pe_dbg_task_rsterrnos
Address of the array of errnos from the remote task restarts (0 for
successful restart). These values can be obtained from the Py_error
field of the cr_error_t struct, returned from the pe_dbg_read_cr_errfile
calls.
int *pe_dbg_Sy_errors
The secondary errors obtained from pe_dbg_read_cr_errfile. These
values can be obtained from the Sy_error field of the cr_error_t struct,
returned from the pe_dbg_read_cr_errfile calls.
int *pe_dbg_Xtnd_errors
The extended errors obtained from pe_dbg_read_cr_errfile. These
values can be obtained from the Xtnd_error field of the cr_error_t struct,
returned from the pe_dbg_read_cr_errfile calls.
int *pe_dbg_error_lens
The user error data lengths obtained from pe_dbg_read_cr_errfile.
These values can be obtained from the error_len field of the cr_error_t
struct, returned from the pe_dbg_read_cr_errfile calls.
PE_DBG_RESTART_ERRDATA
Indicates that the TPD has reported one or more task restart failures, and that
POE has allocated space in the following array for the TPD to use to fill in the
error data.
char **pe_dbg_error_data
The user error data obtained from pe_dbg_read_cr_errfile. These
values can be obtained from the error data field of the cr_error_t struct,
returned from the pe_dbg_read_cr_errfile calls.
Notes
Use -I/usr/lpp/ppe.poe/include to pick up the header file. This flag is an uppercase
letter i.
Any references to process ID or PID above represent the real process ID, and not
the virtual process ID associated with checkpointed/restarted processes.
pe_dbg_checkpnt
Purpose
Checkpoints a process that h is under debugger control, or a group of processes.
Library
POE API library (libpoeapi.a)
C synopsis
#include <pe_dbg_checkpnt.h>
int pe_dbg_checkpnt(path, id, flags, cstate, epath)
char *path;
id_t id;
unsigned int flags;
chk_state_t *cstate;
char *epath;
Description
The pe_dbg_checkpnt subroutine allows a process to checkpoint a process that is
under debugger control, or a set of processes that have the same checkpoint/restart
group ID (CRID). The state information of the checkpointed processes is saved in a
single file. All information required to restart the processes (other than the
executable files, any shared libraries, any explicitly loaded modules and data, if any,
passed through the restart system calls) is contained in the checkpoint file.
After all processes have been stopped, the checkpoint file is written with process
information one process at a time. After the write has completed successfully, the
pe_dbg_checkpnt subroutine will do one of the following depending on the value
of the flags passed:
v Continue the processes.
v Terminate all the checkpointed processes.
v Leave the processes in the stopped state.
Parameters
path
The path of the checkpoint file to be created. This file will be created read-only
with the ownership set to the user ID of the process invoking the
pe_dbg_checkpnt call.
id Indicates the process ID of the process to be checkpointed or the
checkpoint/restart group ID or CRID of the set of processes to be checkpointed
as specified by a value of the flags parameter.
flags
Determines the behavior of the pe_dbg_checkpnt subroutine and defines the
interpretation of the id parameter. The flags parameter is constructed by
logically ORing the following values, which are defined in the sys/checkpnt.h
file:
CHKPNT_AND_STOP
Setting this bit causes the checkpointed processes to be put in a
stopped state after a successful checkpoint operation. The processes
can be continued by sending them SIGCONT. The default is to
checkpoint and continue running the processes.
CHKPNT_AND_STOPTRC
Setting this bit causes any process that is traced to be put in a stopped
state after a successful checkpoint operation. The processes can be
continued by sending them SIGCONT. The default is to checkpoint and
continue running the processes.
CHKPNT_AND_TERMINATE
Setting this bit causes the checkpointed processes to be terminated on
a successful checkpoint operation. The default is to checkpoint and
continue running the processes.
CHKPNT_CRID
Specifies that the id parameter is the checkpoint/restart group ID or
CRID of the set of processes to be checkpointed.
CHKPNT_IGNORE_SHMEM
Specifies that shared memory should not be checkpointed.
CHKPNT_NODELAY
Specifies that pe_dbg_checkpnt will not wait for the completion of the
checkpoint call. As soon as all the processes to be checkpointed have
been identified, and the checkpoint operation started for each of them,
the call will return. The kernel will not provide any status on whether the
call was successful. The application must examine the checkpoint file to
determine if the checkpoint operation succeeded or not. By default, the
pe_dbg_checkpnt subroutine will wait for all the checkpoint data to be
completely written to the checkpoint file before returning.
epath
An error file name to log error and debugging data if the checkpoint fails. This
field is mandatory and must be provided.
Notes
Use -I/usr/lpp/ppe.poe/include to pick up the header file. This flag is an uppercase
letter i.
Any references to process ID or PID above represent the real process ID, and not
the virtual process ID associated with checkpointed or restarted processes.
Return values
Upon successful completion, a value of CHECKPOINT_OK is returned.
If the invoking process is included in the set of processes being checkpointed, and
the CHKPNT_AND_TERMINATE flag is set, this call will not return if the checkpoint
is successful because the process will be terminated.
If a process that successfully checkpointed itself is restarted, it will return from the
pe_dbg_checkpnt call with a value of RESTART_OK.
Errors
The pe_dbg_checkpnt subroutine is unsuccessful when the global variable errno
contains one of the following values:
EACCES
One of the following is true:
v The file exists, but could not be opened successfully in exclusive mode,
or write permission is denied on the file, or the file is not a regular file.
v Search permission is denied on a component of the path prefix specified
by the path parameter. Access could be denied due to a secure mount.
v The file does not exist, and write permission is denied for the parent
directory of the file to be created.
EAGAIN
Either the calling process or one or more of the processes to be
checkpointed is already involved in another checkpoint or restart operation.
EINTR Indicates that the checkpoint operation was terminated due to receipt of a
signal. No checkpoint file will be created. A call to the
pe_dbg_checkpnt_wait subroutine should be made when this occurs, to
ensure that the processes reach a state where subsequent checkpoint
operations will not fail unpredictably.
EINVAL
Indicates that a NULL path or epath parameter was passed in, or an invalid
set of flags was set, or an invalid id parameter was passed.
ENOMEM
Insufficient memory exists to initialize the checkpoint structures.
ENOSYS
One of the following is true:
pe_dbg_checkpnt_wait
Purpose
Waits for a checkpoint, or pending checkpoint file I/O, to complete.
Library
POE API library (libpoeapi.a)
C synopsis
#include <pe_dbg_checkpnt.h>
int pe_dbg_checkpnt_wait(id, flags, options)
id_t id;
unsigned int flags;
int *options;
Description
The pe_dbg_checkpnt_wait subroutine can be used to:
v Wait for a pending checkpoint issued by the calling thread’s process to complete.
v Determine whether a pending checkpoint issued by the calling thread’s process
has completed, when the CHKPNT_NODELAY flag is specified.
v Wait for any checkpoint file I/O that may be in progress during an interrupted
checkpoint to complete.
Parameters
id Indicates the process ID or the checkpoint/restart group ID (CRID) of the
processes for which a checkpoint operation was initiated or interrupted, as
specified by a value of the flag parameter.
flags
Defines the interpretation of the id parameter. The flags parameter may contain
the following value, which is defined in the sys/checkpnt.h file:
CHKPNT_CRID
Specifies that the id parameter is the checkpoint/restart group ID or
CRID of the set of processes for which a checkpoint operation was
initiated or interrupted.
CHKPNT_NODELAY
Specifies that pe_dbg_checkpnt_wait will not wait for the completion
of the checkpoint call. This flag should not be used when waiting for
pending checkpoint file I/O to complete.
options
This field is reserved for future use and should be set to NULL.
Future implementations of this function may return the checkpoint error code in
this field. Until then, the size of the checkpoint error file can be used in most
cases to determine whether the checkpoint succeeded or failed. If the size of
the file is 0, the checkpoint succeeded, otherwise the checkpoint failed and
checkpoint error file will contain the error codes. If the file does not exist, the
checkpoint most likely failed due to an EPERM or ENOENT on the checkpoint
error file pathname.
Notes
Use -I/usr/lpp/ppe.poe/include to pick up the header file. This flag is an uppercase
letter i.
Any references to process ID or PID above represent the real process ID, and not
the virtual process ID associated with checkpointed/restarted processes.
Return values
Upon successful completion, a value of 0 is returned, indicating that one of the
following is true:
v The pending checkpoint completed.
v There was no pending checkpoint.
v The pending file I/O completed.
v There was no pending file I/O.
If the pe_dbg_checkpnt_wait call is unsuccessful, -1 is returned and the errno
global variable is set to indicate the error.
Errors
The pe_dbg_checkpnt_wait subroutine is unsuccessful when the global variable
errno contains one of the following values:
EINPROGRESS
Indicates that the pending checkpoint operation has not completed when
the CHKPNT_NODELAY flag is specified.
EINTR Indicates that the operation was terminated due to receipt of a signal.
EINVAL
Indicates that an invalid flag was set.
ENOSYS
The caller of the function is not a debugger.
ESRCH
The process whose process ID was passed or the checkpoint/restart group
whose CRID was passed does not exist.
pe_dbg_getcrid
Purpose
Returns the checkpoint/restart ID.
Library
POE API library (libpoeapi.a)
C synopsis
crid_t pe_dbg_getcrid(pid)
pid_t pid;
Description
The pe_dbg_getcrid subroutine returns the checkpoint/restart group ID (CRID) of
the process whose process ID was specified in the pid parameter, or the CRID of
the calling process if a value of -1 was passed.
Parameters
pid Either the process ID of a process to obtain its CRID, or -1 to request the
CRID of the calling process.
Notes
Any references to process ID or PID above represent the real process ID, and not
the virtual process ID associated with checkpointed/restarted processes.
Return values
If the process belongs to a checkpoint/restart group, a valid CRID is returned. If the
process does not belong to any checkpoint/restart group, a value of zero is
returned. For any error, a value of -1 is returned and the errno global variable is set
to indicate the error.
Errors
The pe_dbg_getcrid subroutine is unsuccessful when the global variable errno
contains one of the following values:
ENOSYS The caller of the function is not a debugger.
ESRCH There is no process with a process id equal to pid.
pe_dbg_getrtid
Purpose
Returns real thread ID of a thread in a specified process given its virtual thread ID.
Library
POE API library (libpoeapi.a)
C synopsis
#include <pe_dbg_checkpnt.h>
Description
The pe_dbg_getrtid subroutine returns the real thread ID of the specified virtual
thread in the specified process.
Parameters
pid The real process ID of the process containing the thread for which the real
thread ID is needed
vtid The virtual thread ID of the thread for which the real thread ID is needed.
Return values
If the calling process is not a debugger, a value of -1 is returned. Otherwise, the
pe_dbg_getrtid call is always successful. If the process does not exist or has
exited or is not a restarted process, or if the provided virtual thread ID does not
exist in the specified process, the value passed in the vtid parameter is returned.
Otherwise, the real thread ID of the thread whose virtual thread ID matches the
value passed in the vtid parameter is returned
Errors
The pe_dbg_getrtid subroutine is unsuccessful if the following is true:
ENOSYS The caller of the function is not a debugger.
pe_dbg_getvtid
Purpose
Returns virtual thread ID of a thread in a specified process given its real thread ID.
Library
POE API library (libpoeapi.a)
C synopsis
#include <pe_dbg_checkpnt.h>
Description
The pe_dbg_getvtid subroutine returns the virtual thread ID of the specified real
thread in the specified process.
Parameters
pid The real process ID of the process containing the thread for which the real
thread ID is needed
rtid The real thread ID of the thread for which the virtual thread ID is needed.
Return values
If the calling process is not a debugger, a value of -1 is returned.
If the process does not exist, the process has exited, the process is not a restarted
process, or the provided real thread ID does not exist in the specified process, the
value passed in the rtid parameter is returned.
Otherwise, the virtual thread ID of the thread whose real thread ID matches the
value passed in the rtid parameter is returned.
Errors
The pe_dbg_getvtid subroutine is unsuccessful if the following is true:
ENOSYS The caller of the function is not a debugger.
pe_dbg_read_cr_errfile
Purpose
Opens and reads information from a checkpoint or restart error file.
Library
POE API library (libpoeapi.a)
C synopsis
#include <pe_dbg_checkpnt.h>
void pe_dbg_read_cr_errfile(char *path, cr_error_t *err_data, int cr_errno)
Description
The pe_dbg_read_cr_errfile subroutine is used to obtain the error information from
a failed checkpoint or restart. The information is returned in the cr_error_t structure,
as defined in /usr/include/sys/checkpnt.h.
Parameters
path
The full pathname to the error file to be read.
err_data
Pointer to a cr_error_t structure in which the error information will be returned.
cr_errno
The errno from the pe_dbg_checkpnt or pe_dbg_restart call that failed. This
value is used for the Py_error field of the returned structure if the file specified
by the path parameter does not exist, has a size of 0, or cannot be opened.
Notes
Use -I/usr/lpp/ppe.poe/include to pick up the header file. This flag is an uppercase
letter i.
pe_dbg_restart
Purpose
Restarts processes from a checkpoint file.
Library
POE API library (libpoeapi.a)
C synopsis
#include <pe_dbg_checkpnt.h>
int pe_dbg_restart(path, id, flags, rstate, epath)
char *path;
id_t id;
unsigned int flags;
rst_state_t *rstate;
char *epath;
Description
The pe_dbg_restart subroutine allows a process to restart all the processes whose
state information has been saved in the checkpoint file.
All information required to restart these processes (other than the executable files,
any shared libraries and explicitly loaded modules) is recreated from the information
from the checkpoint file. Then, a new process is created for each process whose
state was saved in the checkpoint file. The only exception is the primary checkpoint
process, which overlays an existing process specified by the id parameter.
When restarting a single process that was checkpointed, the id parameter specifies
the process ID of the process to be overlaid. When restarting a set of processes,
the id parameter specifies the checkpoint/restart group ID of the process to be
overlaid, and the flags parameter must set RESTART_OVER_CRID. This process
must also be the primary checkpoint process of the checkpoint/restart group. The
user ID and group IDs of the primary checkpoint process saved in the checkpoint
file should match the user ID and group IDs of the process it will overlay.
A primary checkpoint process inherits attributes from the attributes saved in the file,
and also from the process it overlays. Other processes in the checkpoint file obtain
their attributes only from the checkpoint file, unless they share some attributes with
the primary checkpoint process. In this case, the shared attributes are inherited.
Although the resource usage of each checkpointed process is saved in the
checkpoint file, the resource usage attributes will be zeroed out when it is restarted
and the getrusage subroutine will return only resource usage after the last restart
operation.
previously specified in the rst parameter by the checkpoint handler before the
process is restarted. The format of the interface buffer is entirely application
dependent.
Parameters
path
The path of the checkpoint file to use for the restart. Must be a valid checkpoint
file created by a pe_dbg_checkpnt call.
id Indicates the process ID or the checkpoint/restart group ID or CRID of the
process that is to be overlaid by the primary checkpoint process as identified by
the flags parameter.
flags
Determines the behavior of the pe_dbg_restart subroutine and defines the
interpretation of the id parameter. The flags parameter is constructed by
logically ORing one or more of the following values, which are defined in the
sys/checkpnt.h file:
RESTART_AND_STOP
Setting this bit will cause the restarted processes to be put in a stopped
state after a successful restart operation. They can be continued by
sending them SIGCONT. The default is to restart and resume running
the processes at the point where each thread in the process was
checkpointed.
RESTART_AND_STOPTRC
Setting this bit will cause any process that was traced at checkpoint
time to be put in a stopped state after a successful restart operation.
The processes can be continued by sending them SIGCONT. The
default is to restart and resume execution of the processes at the point
where each thread in the process was checkpointed.
RESTART_IGNORE_BADSC
Causes the restart operation not to fail if a kernel extension that was
present at checkpoint time is not present at restart time. However, if the
restarted program uses any system calls in the missing kernel
extension, the program will fail when those calls are used.
RESTART_OVER_CRID
Specifies that the id parameter is the checkpoint/restart group ID or
CRID of the process over which the primary checkpoint process will be
restarted. There are multiple processes to be restarted.
RESTART_PAG_ALL
Same as RESTART_WAITER_PAG.
RESTART_WAITER_PAG
Ensures that DCE credentials are restored in the restarted process.
rstate
Pointer to a structure of type rst_state_t.
epath
Path to error file to log error and debugging data, if restart fails.
Notes
Use -I/usr/lpp/ppe.poe/include to pick up the header file. This flag is an uppercase
letter i.
Any references to process ID or PID above represent the real process ID, and not
the virtual process ID associated with checkpointed/restarted processes.
Return values
Upon successful completion, a value of 0 is returned. Otherwise, a value of -1 is
returned and the errno global variable is set to indicate the error.
Errors
The pe_dbg_restart subroutine is unsuccessful when the global variable errno
contains one of the following values:
EACCES
One of the following is true:
v The file exists, but could not be opened successfully in exclusive mode,
or write permission is denied on the file, or the file is not a regular file.
v Search permission is denied on a component of the path prefix specified
by the path parameter. Access could be denied due to a secure mount.
v The file does not exist, and write permission is denied for the parent
directory of the file to be created.
EAGAIN
One of the following is true:
v The user ID has reached the maximum limit of processes that it can
have simultaneously, and the invoking process is not privileged.
v Either the calling process or the target process is involved in another
checkpoint or restart operation.
EFAULT
Copying from the interface buffer failed. The rstate parameter points to a
location that is outside the address space of the process.
EINVAL
One of the following is true:
v A NULL path was passed in.
v The checkpoint file contains invalid or inconsistent data.
v The target process is a kernel process.
v The restart data length in the rstate structure is greater than
MAX_RESTART_DATA.
ENOMEM
One of the following is true:
v There is insufficient memory to create all the processes in the checkpoint
file.
v There is insufficient memory to allocate the restart structures inside the
kernel.
ENOSYS
One of the following is true:
v The caller of the function is not a debugger.
This chapter includes descriptions of the parallel task identification API subroutines
that are available for parallel programming:
v “poe_master_tasks” on page 146.
v “poe_task_info” on page 147.
poe_master_tasks
Purpose
Retrieves the list of process IDs of POE master processes currently running on this
system.
C synopsis
#include "poeapi.h"
int poe_master_tasks(pid_t **poe_master_pids);
Description
An application invoking this subroutine while running on a given node can retrieve
the list of process IDs of all POE master processes that are currently running on the
same node. This information can be used for accounting purposes or can be
passed to the poe_task_info subroutine to obtain more detailed information about
tasks spawned by these POE master processes.
Parameters
On return, (*poe_master_pids) points to the first element of an array of pid_t
elements that contains the process IDs of POE master processes. It is the
responsibility of the calling program to free this array. This pointer is NULL if no
POE master process is running on this system or if there is an error condition.
Notes
Use -I/usr/lpp/ppe.poe/include to pick up the header file.
Return values
greater than 0
Indicates the size of the array that (*poe_master_pids) points to
0 Indicates that no POE master process is running.
-1 Indicates that a system error has occurred.
-2 Indicates that POE is unable to allocate memory.
-3 Indicates a non-valid poe_master_pids argument.
Related information
v poe_task_info
poe_task_info
Purpose
Returns a NULL-terminated array of pointers to structures of type POE_TASKINFO.
C synopsis
#include "poeapi.h"
int poe_task_info(pid_t poe_master_pid, POE_TASKINFO ***poe_taskinfo);
Description
Given the process ID of a POE master process, this subroutine returns to the
calling program through the poe_taskinfo argument a NULL-terminated array of
pointers to structures of type POE_TASKINFO. There is one POE_TASKINFO
structure for each POE task spawned by this POE master process on a local or
remote node.
Parameters
poe_master_pid
Specifies the process ID of a POE master process.
poe_taskinfo
On return, points to the first element of a NULL-terminated array of pointers to
structures of type POE_TASKINFO.
This pointer is NULL if there is an error condition. It is the responsibility of the
calling program to free the array of pointers to POE_TASKINFO structures, as
well as the relevant POE_TASKINFO structures and the subcomponents
h_name, h_addr, and p_name.
Notes
Use -I/usr/lpp/ppe.poe/include to pick up the header file.
Return values
greater than 0
Indicates the size of the array that (*poe_taskinfo) points to
0 Indicates that no POE master process is running or that task information is
not available yet
-1 Indicates that a system error has occurred.
-2 Indicates that POE is unable to allocate memory.
-3 Indicates a non-valid poe_master_pids argument.
Related information
v poe_master_tasks
With PE Version 4, these nonstandard extensions remain available, but their use is
deprecated. The implementation of these routines depends on hidden message
passing threads. These routines may not be used with environment variable
MP_SINGLE_THREAD set to yes.
Note: FORTRAN refers to FORTRAN 77 bindings that are officially supported for
MPI. However, FORTRAN 77 bindings can be used by FORTRAN 90.
FORTRAN 90 and High Performance FORTRAN (HPF) offer array section
and assumed shape arrays as parameters on calls. These are not safe with
MPI.
Note: FORTRAN refers to FORTRAN 77 bindings that are officially supported for
MPI. However, FORTRAN 77 bindings can be used by FORTRAN 90.
FORTRAN 90 and High Performance FORTRAN (HPF) offer array section
and assumed shape arrays as parameters on calls. These are not safe with
MPI.
void MPI::Init();
MPI_INIT MPI_INIT(INTEGER IERROR)
MPI_Init_thread int MPI_Init_thread(int *argc, char *((*argv)[]), int required, int
*provided);
MPI::Init_thread int MPI::Init_thread(int& argc, char**& argv, int required);
PE MPI uses a credit flow control, by which senders track the buffer space that
can be guaranteed at each destination. For each source-destination pair, an eager
send consumes a message credit at the source, and a match at the destination
generates a message credit. The message credits generated at the destination are
returned to the sender to enable additional eager sends. The message credits are
returned piggyback on an application message when possible. If there is no return
traffic, they will accumulate at the destination until their number reaches some
threshold, and then be sent back as a batch to minimize network traffic. When a
sender has no message credits, its sends must proceed using rendezvous
protocol until message credits become available. The fallback to rendezvous
protocol may impact performance. With a reasonable supply of message credits,
most applications will find that the credits return soon enough to enable messages
that are not larger than the eager limit to continue to be sent eagerly.
Assuming a pre-allocated early arrival buffer (whose size cannot increase), the
number of message credits that the early arrival buffer represents is equal to the
early arrival buffer size divided by the eager limit. Since no sender can know how
many other tasks will also send eagerly to a given destination, the message credits
must be divided among sender tasks equally. If every task sends eagerly to a single
destination that is not posting receives, each sender consumes its message credits,
fills its share of the destination early arrival buffer, and reverts to rendezvous
protocol. This prevents an overflow at the destination, which would result in job
failure. To offer a reasonable number of message credits per source-destination pair
at larger task counts, either a very large pre-allocated early arrival buffer, or a very
small eager limit is needed.
It would be unusual for a real application to flood a single destination this way, and
well-written applications try to pre-post their receives. An eager send must consume
a message credit at the send side, but when the message arrives and matches a
waiting receive, it does not consume any of the early arrival buffer space. The
message credit is available to be returned to the sender, but does not return
instantly. When they pre-post and do not flood, real applications seldom use more
than a small percentage of the total early arrival buffer space. However, because
message credits must be managed for the worst case, they may be depleted at the
send side. The send side then reverts to rendezvous protocol, even though there is
plenty of early arrival buffer space available, or there is a matching receive waiting
at the receive side, which would then not need to use the early arrival buffer.
The advantage of a pre-allocated early arrival buffer is that the Parallel Environment
implementation of MPI is able to allocate and free early arrival space in the
© Copyright IBM Corp. 1993, 2005 219
pre-allocated buffer quickly, and because the space is owned by the MPI library, it is
certain to be available if needed. There is nothing an application can do to make
the space that is promised by message credits unavailable in the event that all
message credits are used. A disadvantage is that the space that is pre-allocated to
the early arrival buffer to support adequate message credits is denied to the
application, even if only a small portion of that pre-allocated space is ever used.
With PE 4.2, MPI users are given new control over buffer pre-allocation and
message credits. MPI users can specify both a pre-allocated and maximum early
arrival buffer size. The pre-allocated early arrival buffer is set aside for efficient
management, and guaranteed availability. If the early arrival buffer requirement
exceeds the pre-allocated space, extra early arrival buffer space comes from the
heap using malloc and free. Message credits are calculated based on the
maximum buffer size, and all of the pre-allocated early arrival buffer is used before
using malloc and free. Since message credits are based on the maximum buffer
size, an application that floods a single destination with unmatched eager messages
from all senders, could require the specified maximum. If other heap usage has
made that space unavailable, a malloc could fail and the job would be terminated.
However, well-designed applications might see better performance from additional
credits, but may not even fill the pre-allocated early arrival buffer, let alone come
near needing the promised maximum. An omitted maximum, or any value at or
below the pre_allocated_size, will cause message credits to be limited so that
there will never be an overflow of the pre-allocated early arrival buffer.
For most applications, the default value for the early arrival buffer should be
satisfactory, and with the default, the message credits are calculated based on the
pre-allocated size. The pre-allocated size can be changed from its default by setting
the MP_BUFFER_MEM environment variable or using the -buffer_mem
command-line flag with a single value. The message credits are calculated based
on the modified pre-allocated size. There will be no use of malloc and free after
initialization (MPI_Init). This is the way earlier versions of the Parallel Environment
implementation of MPI worked, so there is no need to learn new habits for
command-line arguments, or to make changes to existing run scripts and default
shell environments.
For some applications, in particular those that are memory constrained or run at
large task counts, it may be useful to adjust the size of the pre-allocated early
arrival buffer to slightly more than the application’s peak demand, but specify a
higher maximum early arrival buffer size so that enough message credits are
available to ensure few or no fallbacks to rendezvous protocol. For a given run, you
can use the MP_STATISTICS environment variable to see how much early arrival
buffer space is used at peak demand, and how often a send that is small enough to
be an eager send, was processed using rendezvous protocol due to a message
credit shortage.
By decreasing the pre-allocated early arrival buffer size to slightly larger than the
application’s peak demand, you avoid wasting pre-allocated buffer space. By
increasing the maximum buffer size, you provide credits which can reduce or
eliminate fallbacks to rendezvous protocol. The application’s peak demand and
fallback frequency can vary from run to run, and the amount of variation may
depend on the nature of the application. If the application’s peak demand is larger
than the pre-allocated early arrival buffer size, the use of malloc and free may
cause a performance impact. The credit flow control will guarantee that the
application’s peak demand will never exceed the specified maximum. However, if
you pick a maximum that cannot be satisfied, it is possible for an MPI application
that does aggressive but valid flooding of a single destination to fail in a malloc.
Accessibility information
Accessibility information for IBM products is available online. Visit the IBM
Accessibility Center at:
http://www.ibm.com/able/
IBM may not offer the products, services, or features discussed in this document in
other countries. Consult your local IBM representative for information on the
products and services currently available in your area. Any reference to an IBM
product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product,
program, or service that does not infringe any IBM intellectual property right may be
used instead. However, it is the user’s responsibility to evaluate and verify the
operation of any non-IBM product, program, or service.
| IBM may have patents or pending patent applications covering subject matter
| described in this document. The furnishing of this document does not grant you any
| license to these patents. You can send license inquiries, in writing, to:
| IBM Director of Licensing
| IBM Corporation
| North Castle Drive
| Armonk, NY 10504-1785
| U.S.A.
| For license inquiries regarding double-byte (DBCS) information, contact the IBM
Intellectual Property Department in your country or send inquiries, in writing, to:
| IBM World Trade Asia Corporation
| Licensing
| 2-31 Roppongi 3-chome, Minato-ku
| Tokyo 106-0032, Japan
The following paragraph does not apply to the United Kingdom or any other country
where such provisions are inconsistent with local law:
Any references in this information to non-IBM Web sites are provided for
convenience only and do not in any manner serve as an endorsement of those
Web sites. The materials at those Web sites are not part of the materials for this
IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes
appropriate without incurring any obligation to you.
The licensed program described in this document and all licensed material available
for it are provided by IBM under terms of the IBM Customer Agreement, IBM
International Program License Agreement or any equivalent agreement between us.
Information concerning non-IBM products was obtained from the suppliers of those
products, their published announcements or other publicly available sources. IBM
has not tested those products and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those
products.
All statements regarding IBM’s future direction or intent are subject to change or
withdrawal without notice, and represent goals and objectives only.
This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the
names of individuals, companies, brands, and products. All of these names are
fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
Each copy or any portion of these sample programs or any derivative work, must
include a copyright notice as follows:
All implemented function in the PE MPI product is designed to comply with the
requirements of the Message Passing Interface Forum, MPI: A Message-Passing
Interface Standard. The standard is documented in two volumes, Version 1.1,
University of Tennessee, Knoxville, Tennessee, June 6, 1995 and MPI-2: Extensions
to the Message-Passing Interface, University of Tennessee, Knoxville, Tennessee,
July 18, 1997. The second volume includes a section identified as MPI 1.2 with
clarifications and limited enhancements to MPI 1.1. It also contains the extensions
identified as MPI 2.0. The three sections, MPI 1.1, MPI 1.2 and MPI 2.0 taken
together constitute the current standard for MPI.
PE MPI provides support for all of MPI 1.1 and MPI 1.2. PE MPI also provides
support for all of the MPI 2.0 Enhancements, except the contents of the chapter
titled Process Creation and Management.
If you believe that PE MPI does not comply with the MPI standard for the portions
that are implemented, please contact IBM Service.
Trademarks
The following are trademarks of International Business Machines Corporation in the
United States, other countries, or both:
AFS
AIX
AIX 5L
AIXwindows
DFS
e (logo)
IBM
IBM (logo)
IBMLink
LoadLeveler
POWER
POWER3
POWER4
pSeries
RS/6000
SP
System p5
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, and service names may be trademarks or service marks
of others.
Notices 227
Acknowledgments
The PE Benchmarker product includes software developed by the Apache Software
Foundation, http://www.apache.org.
core file. A file that preserves the state of a program, Ethernet. A baseband local area network (LAN) that
usually just before a program is terminated for an allows multiple stations to access the transmission
unexpected error. See also core dump. medium at will without prior coordination, avoids
contention by using carrier sense and deference, and
current context. When using the pdbx debugger, resolves contention by using collision detection and
control of the parallel program and the display of its delayed retransmission. Ethernet uses carrier sense
data can be limited to a subset of the tasks belonging to multiple access with collision detection (CSMA/CD).
that program. This subset of tasks is called the current
context. You can set the current context to be a single event. An occurrence of significance to a task — the
task, multiple tasks, or all the tasks in the program. completion of an asynchronous operation such as an
input/output operation, for example.
D executable. A program that has been link-edited and
therefore can be run in a processor.
data decomposition. A method of breaking up (or
decomposing) a program into smaller parts to exploit execution. To perform the actions specified by a
parallelism. One divides the program by dividing the program or a portion of a program.
data (usually arrays) into smaller parts and operating on
each part independently. expression. In programming languages, a language
construct for computing a value from one or more
data parallelism. Refers to situations where parallel operands.
tasks perform the same computation on different sets of
data.
F
dbx. A symbolic command-line debugger that is often
provided with UNIX systems. The PE command-line fairness. A policy in which tasks, threads, or
debugger pdbx is based on the dbx debugger. processes must be allowed eventual access to a
resource for which they are competing. For example, if
debugger. A debugger provides an environment in multiple threads are simultaneously seeking a lock, no
which you can manually control the execution of a set of circumstances can cause any thread to wait
program. It also provides the ability to display the indefinitely for access to the lock.
program’ data and operation.
Glossary 231
of POWER™ microprocessor based systems (including licensed program. A collection of software packages
RS/6000 SMPs, RS/6000 SP nodes, and pSeries sold as a product that customers pay for to license. A
SMPs). licensed program can consist of packages and file sets
a customer would install. These packages and file sets
IBM Parallel Environment (PE) for AIX. A licensed bear a copyright and are offered under the terms and
program that provides an execution and development conditions of a licensing agreement. See also fileset
environment for parallel C, C++, and FORTRAN and package.
programs. It also includes tools for debugging, profiling,
and tuning parallel programs. lightweight corefiles. An alternative to standard AIX
corefiles. Corefiles produced in the Standardized
installation image. A file or collection of files that are Lightweight Corefile Format provide simple process
required in order to install a software product on a IBM stack traces (listings of function calls that led to the
RS/6000 workstation or on SP system nodes. These error) and consume fewer system resources than
files are in a form that allows them to be installed or traditional corefiles.
removed with the AIX installp command. See also
fileset, licensed program, and package. LoadLeveler. A job management system that works
with POE to let users run jobs and match processing
Internet. The collection of worldwide networks and needs with system resources, in order to make better
gateways that function as a single, cooperative virtual use of the system.
network.
local variable. A variable that is defined and used
Internet Protocol (IP). (1) The TCP/IP protocol that only in one specified portion of a computer program.
provides packet delivery between the hardware and
user processes. (2) The SP switch library, provided with loop unrolling. A program transformation that makes
the IBM Parallel System Support Programs for AIX, that multiple copies of the body of a loop, also placing the
follows the IP protocol of TCP/IP. copies within the body of the loop. The loop trip count
and index are adjusted appropriately so the new loop
IP. Internet Protocol. computes the same values as the original. This
transformation makes it possible for a compiler to take
J additional advantage of instruction pipelining, data
cache effects, and software pipelining.
Jacobi-Seidel. See Gauss-Seidel. See also optimization.
K M
Kerberos. A publicly available security and management domain . A set of nodes configured for
authentication product that works with the IBM Parallel manageability by the Clusters Systems Management
System Support Programs for AIX software to (CSM) product. Such a domain has a management
authenticate the execution of remote commands. server that is used to administer a number of managed
nodes. Only management servers have knowledge of
kernel. The core portion of the UNIX operating system the whole domain. Managed nodes only know about the
that controls the resources of the CPU and allocates servers managing them; they know nothing of each
them to the users. The kernel is memory-resident, is other. Contrast with peer domain.
said to run in kernel mode (in other words, at higher
execution priority level than user mode), and is menu. A list of options displayed to the user by a data
protected from user tampering by the hardware. processing system, from which the user can select an
action to be initiated.
L message catalog. A file created using the AIX
Message Facility from a message source file that
Laplace’s equation. A homogeneous partial contains application error and other messages, which
differential equation used to describe heat transfer, can later be translated into other languages without
electric fields, and many other applications. having to recompile the application source code.
latency. The time interval between the instant when an message passing. Refers to the process by which
instruction control unit initiates a call for data parallel tasks explicitly exchange program data.
transmission, and the instant when the actual transfer of
data (or receipt of data at the remote end) begins. Message Passing Interface (MPI). A standardized
Latency is related to the hardware characteristics of the API for implementing the message-passing model.
system and to the different layers of software that are
involved in initiating the task of packing and transmitting MIMD. Multiple instruction stream, multiple data
the data. stream.
MPMD. Multiple program, multiple data. option flag. Arguments or any other additional
information that a user specifies with a program name.
Multiple program, multiple data (MPMD). A parallel Also referred to as parameters or command-line
programming model in which different, but related, options.
programs are run on different sets of data.
Glossary 233
domain has no distinguished or master node. All nodes pthread. A thread that conforms to the POSIX Threads
are aware of all other nodes, and administrative Programming Model.
commands can be issued from any node in the domain.
All nodes also have a consistent view of the domain
membership. Contrast with management domain.
R
performance monitor. A utility that displays how reduced instruction-set computer. A computer that
effectively a system is being used by programs. uses a small, simplified set of frequently-used
instructions for rapid execution.
PID. Process identifier.
reduction operation. An operation, usually
POE. parallel operating environment. mathematical, that reduces a collection of data by one
or more dimensions. For example, the arithmetic SUM
pool. Groups of nodes on an SP system that are operation is a reduction operation that reduces an array
known to LoadLeveler, and are identified by a pool to a scalar value. Other reduction operations include
name or number. MAXVAL and MINVAL.
profiling. The act of determining how much CPU time signal handling. A type of communication that is used
is used by each function or subroutine in a program. by message passing libraries. Signal handling involves
The histogram or table produced is called the execution using AIX signals as an asynchronous way to move
profile. data in and out of message buffers.
Program Marker Array. An X-Windows run time Single program, multiple data (SPMD). A parallel
monitor tool provided with parallel operating programming model in which different processors
environment, used to provide immediate visual feedback execute the same program on different sets of data.
on a program’s execution.
source code. The input to a compiler or assembler,
written in a source language. Contrast with object code.
target application. See DPCL target application. view. (1) To display and look at data on screen. (2) A
special display of data, created as needed. A view
task. A unit of computation analogous to an AIX temporarily ties two or more files together so that the
process. combined files can be displayed, printed, or queried.
The user specifies the fields to be included. The original
Glossary 235
files are not permanently linked or altered; however, if
the system allows editing, the data in the original files
will be changed.
X
X Window System. The UNIX industry’s graphics
windowing standard that provides simultaneous views of
several executing programs or processes on high
resolution graphics displays.
F J
file descriptor numbers 29
job control 27
file handle 66
Job Specifications 69
file operation constants 59
job step progression 27
flags, command-line
job step termination 27
-buffer_mem 76, 220
default 27
-clock_source 77
-css_interrupt 77
-eager_limit 78, 219
-hints_filtered 78 K
-hostfile 17 key collision 16
-infolevel 39 key, value pair 23
-io_buffer_size 81 ksh93 30
-io_errlog 81
-ionodefile 78
-msg_api 72 L
-polling_interval 79 language bindings
-printenv 86 MPI 33
-procs 16 LAPI 1, 17, 43, 44
-retransmit_interval 79 sliding window protocol 4
-rexmit_buf_cnt 82 used with MPI 43
-rexmit_buf_size 81 LAPI data transfer function 3
-shared_memory 2, 15, 80 LAPI dispatcher 4, 6, 9, 10
-single_thread 80 LAPI parallel program 45
-stdoutmode 31 LAPI protocol 25
-thread_stacksize 81 LAPI send side copy 7
-udp_packet_size 81 LAPI user message 7
-wait_mode 81 LAPI_INIT 45
FORTRAN 77 175 LAPI_TERM 45
FORTRAN 90 175 LAPI_USE_SHM environment variable 34
FORTRAN 90 datatype matching constants 63 limits, system
FORTRAN bindings 11, 12 on size of MPI elements 65
FORTRAN language binding datatypes 50 llcancel 16
FORTRAN reduction function dattypes 51 LoadLeveler 9, 26, 67
function overloading 33 LookAt message retrieval tool xii
functions
MPI 155
M
M:N threads 38
G malloc and free 220
General Parallel File System (GPFS) 20 MALLOCDEBUG 35
gprof 11 MALLOCTYPE 35
maximum sizes 58
maximum tasks per node 67
H message address range 16
hidden threads 21 message buffer 16, 25
High Performance FORTRAN (HPF) 175 message credit 4, 5, 65, 219, 220
hint filtering 23 message descriptor 4
message envelope 5
message envelope buffer 65
Index 239
message packet transfer 7 MP_WAIT_MODE environment variable 6, 81
message passing MPCI 44
profiling 11 MPE subroutine bindings 151
message queue 41 MPE subroutines 149
message retrieval tool, LookAt xii MPI 69
message traffic 9 functions 155
message transport mechanisms 1 subroutines 155
messages used with LAPI 43
buffering 219 MPI application exit without setting exit value 27
miscellaneous environment variables and flags 69 MPI applications
mixed parallelism with MPI and threads 43 performance 1
MP_ACK_INTERVAL environment variable 44 MPI constants 57, 58, 59, 60, 61, 62, 63
MP_ACK_THRESH environment variable 9, 44, 76 MPI datatype 19, 49
MP_BUFFER_MEM environment variable 4, 65, 66, MPI eager limit 66
76, 81, 82, 220 MPI envelope 7
MP_CC_SCRATCH_BUFFER environment variable 9 MPI internal locking 7
MP_CLOCK_SOURCE environment variable 34, 77 MPI IP performance 2
MP_CSS_INTERRUPT environment variable 6, 21, 42, MPI library 37
44, 77 architecture considerations 33
MP_EAGER_LIMIT environment variable 3, 4, 65, 66, MPI Library
219 performance 1
MP_EUIDEVELOP environment variable 30, 149, 151 MPI message size 7
MP_EUIDEVICE environment variable 8 MPI reduction operations 53
MP_EUILIB environment variable 2, 3 MPI size limits 65
MP_HINTS_FILTERED environment variable 23, 78 MPI subroutine bindings 175
MP_HOSTFILE environment variable 16 MPI wait call 1, 3, 4, 6
MP_INFOLEVEL environment variable 33, 39 MPI_Abort 26, 28
MP_INSTANCES environment variable 8 MPI_ABORT 26, 28
mp_intrdelay 44 MPI_File 19
MP_INTRDELAY environment variable 44 MPI_File object 22
MP_IO_BUFFER_SIZE environment variable 24, 81 MPI_Finalize 27
MP_IO_ERRLOG environment variable 23, 81 MPI_FINALIZE 27, 37, 45
MP_IONODEFILE environment variable 20, 78 MPI_INIT 37, 45
MP_LAPI_INET_ADDR environment variable 17 MPI_INIT_THREAD 37
MP_MSG_API environment variable 43, 72 MPI_THREAD_FUNNELED 37
MP_PIPE_SIZE environment variable 44 MPI_THREAD_MULTIPLE 37
MP_POLLING_INTERVAL environment variable 7, 42, MPI_THREAD_SINGLE 37
79 MPI_WAIT_MODE environment variable 42
MP_PRINTENV environment variable 86 MPI_WTIME_IS_GLOBAL 34
MP_PRIORITY environment variable 10 MPI-IO
MP_PROCS environment variable 16, 65 API user tasks 20
MP_RETRANSMIT_INTERVAL environment considerations 20
variable 10, 79 data buffer size 24
MP_REXMIT_BUF_CNT environment variable 7 datatype constructors 23
MP_REXMIT_BUF_SIZE environment variable 7 deadlock prevention 19
MP_SHARED_MEMORY environment variable 1, 2, definition 19
15, 30, 34, 80, 149, 151 error handling 22
MP_SINGLE_THREAD environment variable 7, 20, 21, features 19
36, 37, 80 file interoperability 24
MP_SNDBUF environment variable 31 file management 20
MP_STATISTICS environment variable 9, 10, 220 file open 21
MP_STDOUTMODE environment variable 31 file tasks 21
MP_SYNC_ON_CONNECT environment variable 44 hidden threads 21
MP_TASK_AFFINITY environment variable 10 I/O agent 20
MP_THREAD_STACKSIZE environment variable 36, Info objects 23
81 logging errors 23
MP_TIMEOUT environment variable 81 portability 19
MP_UDP_PACKET_SIZE environment variable 2, 44, robustness 19
81 versatility 19
MP_USE_BULK_XFER environment variable 9, 44 MPI-IO constants 59
MP_UTE environment variable 86 MPL 25
Index 241
POE command-line flags (continued) POE considerations (continued)
-infolevel 39, 69, 76 job termination 29
-instances 71 language bindings 33
-io_buffer_size 81 large numbers of tasks 34
-io_errlog 81 LoadLeveler 26
-ionodefile 78 M:N threads 38
-labelio 75 MALLOCDEBUG 35
-llfile 73 mixing collective 30
-msg_api 72 MPI_INIT 37
-newjob 73 MPI_INIT_THREAD 37
-nodes 72 MPI_WAIT_MODE 42
-pgmmodel 73 network tuning 31
-pmdlog 76 nopoll 42
-polling_interval 79 order requirement for system includes 37
-printenv 86 other differences 45
-priority_log 85 parent task 37
-priority_ntp 85 POE additions 27
-procs 16, 71 remote file system 30
-pulse 71 reserved environment variables 33
-rdma_count 80 root limitation 30
-resd 71 shell scripts 30
-retransmit_interval 79 signal handler 28
-retry 71 signal library 25
-retrycount 71 single threaded 36
-rexmit_buf_cnt 82 STDIN, STDOUT, or STDERR 30, 32
-rexmit_buf_size 81 STDIN, STDOUT, or STDERR, output 31
-rmpool 72 STDIN, STDOUT, or STDERR, rewinding 30
-save_llfile 73 task initialization 36
-savehostfile 72 thread stack size 36
-shared_memory 2, 15, 80 thread termination 37
-single_thread 80 thread-safe libraries 37
-stdinmode 75 threads 35
-stdoutmode 31, 75 user limits 26
-task_affinity 74 user program, passing string arguments 31
-tasks_per_node 72 using MPI and LAPI together 43
-thread_stacksize 81 virtual memory segments 34
-udp_packet_size 81 POE environment variables 69
-use_bulk_xfer 79 MP_ACK_INTERVAL 44
-wait_mode 81 MP_ACK_THRESH 9, 44, 76
POE considerations MP_ADAPTER_USE 70
64-bit application 42 MP_BUFFER_MEM 4, 65, 66, 76, 220
AIX 37 MP_BULK_MIN_MSG_SIZE 80
AIX function limitation 30 MP_BYTECOUNT 84
AIX message catalog considerations 33 MP_CC_SCRATCH_BUFFER 9
architecture 33 MP_CKPTDIR 73
automount daemon 30 MP_CKPTDIR_PERTASK 73
checkpoint and restart 38 MP_CKPTFILE 73
child task 37 MP_CLOCK_SOURCE 34, 77
collective communication call 38 MP_CMDFILE 73
entry point 36 MP_COREDIR 83
environment overview 25 MP_COREFILE_FORMAT 83
exit status 26 MP_COREFILE_SIGTERM 83
exits, abnormal 29 MP_CPU_USE 70
exits, normal 29 MP_CSS_INTERRUPT 6, 21, 42, 44, 77
exits, parallel task 29 MP_DBXPROMPTMOD 84
file descriptor numbers 29 MP_DEBUG_INITIAL_STOP 76
fork limitations 37 MP_DEBUG_NOTIMEOUT 76
job step default termination 27 MP_EAGER_LIMIT 3, 4, 65, 66, 78, 219
job step function 27 MP_EUIDEVELOP 30, 84, 149, 151
job step progression 27 MP_EUIDEVICE 8, 70
job step termination 27 MP_EUILIB 2, 3, 70
Index 243
shmget 34 threads library considerations
sigaction 28 AIX signals 28
SIGALRM 29 topologies 59
SIGIO 29 trademarks 227
signal handler 28, 36 tuning parameter
POE 28 sb_max 2
user defined 28, 29 udp_recvspace 2
signal library 25 udp_sendspace 2
SIGPIPE 29
sigwait 28
Simultaneous Multi-Threading (SMT) 67 U
single thread considerations 6 UDP ports 8
single threaded applications 36 UDP/IP 2, 4
sockets 40 UDP/IP transport 2
special datatypes 61 unacknowledged packets 10
special purpose datatypes 49 unsent data 66
striping 8 upcall 4, 6
subroutine bindings 151, 175 user resource limits 26
collective communication 175 User Space 1, 2, 4
communicator 178 User Space FIFO mechanism 8
conversion functions 182 User Space FIFO packet buffer 8
derived datatype 183 User Space library 1
environment management 189 User space protocol 43
external interfaces 191 User Space transport 2, 3, 6
group management 193 User Space window 3
Info object 195
memory allocation 196
MPI-IO 197, 205 V
non-blocking collective communication 151 virtual address space 9
one-sided communication 205 virtual memory segments 34
point-to-point communication 208
profiling control 214
topology 214
subroutines
W
wait
MPE 149
MPI 1, 2, 3, 4, 6
MPI 155
window 66
non-blocking collective communication 149
parallel task identification API 145
parallel utility subroutines 87
poe_master_tasks 146 X
poe_task_info 147 xprofiler 11
switch clock 34
system contention scope 38
system limits
on size of MPI elements 65
T
tag 66
task limits 67
task synchronization 25
thread context 6
thread stack size
default 36
thread-safe library 7
threaded MPI library 25
threaded programming 35
threads and mixed parallelism with MPI 43
threads constants 63
threads library 25
Overall, how satisfied are you with the information in this book?
How satisfied are you that the information in this book is:
When you send comments to IBM, you grant IBM a nonexclusive right to use or distribute your comments in any
way it believes appropriate without incurring any obligation to you.
Name Address
Company or Organization
Phone No.
___________________________________________________________________________________________________
Readers’ Comments — We’d Like to Hear from You Cut or Fold
SA22-7945-04 Along Line
_ _ _ _ _ _ _Fold
_ _ _and
_ _ _Tape
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _Please
_ _ _ _ _do
_ _not
_ _ staple
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _Fold
_ _ _and
_ _ Tape
______
NO POSTAGE
NECESSARY
IF MAILED IN THE
UNITED STATES
IBM Corporation
Department 55JA, Mail Station P384
2455 South Road
Poughkeepsie, NY
12601-5400
_________________________________________________________________________________________
Fold and Tape Please do not staple Fold and Tape
Cut or Fold
SA22-7945-04 Along Line
SA22-7945-04