Sunteți pe pagina 1din 70

Advanced High

Perfomance Computing Using Linux

Advanced High Performance


Computing Using Linux
TABLE OF CONTENTS

1.0 Advanced Linux


1.1 Emacs: The Most Advanced Text Edito
1.2 Advanced Scripting

2.0 Advanced HPC


2.1 Computer System Architectures
2.2 Processors and Cores
2.3 Parallel Processing Performance

3.0 Introductory MPI Programming


3.1 The Story of the Message Passing Interface (MPI) and OpenMPI
3.2 Unravelling A Sample MPI Programming and OpenMPI Wrappers
3.3 MPI's Introductory Routines

4.0 Intermediate MPI Programming


4.1 MPI Datatypes

4.2 Intermediate Routines


4.3 Collective Communications
4.4 Derived Data Types
4.5 Particle Advector

4.6 Creating A New Communicator


4.7 Profiling Parallel Programs
4.8 Debugging MPI Applications

1.0 Advanced Linux

1.1 Emacs : The Most Advanced Text Editor

In the introductory course, we used nano as our text editor. In the intermediate
course, we used vim. Finally, in this advanced course, we'll provide an
introduction to Emacs, which can be described as the most advanced text editor
and is particularly popular among programmers.

Emacs is one of the oldest continious software applications available, first


written in 1976 by Richard Stallman, founder of the GNU free software
movement. At the time of writing it was up to version 24, with a substantial
number of forks and clones developed during its history.

The big features of Emacs is the extremely high level of built-in commands,
customisation, and extensions, so extensive that those explored here only begin
to touch the extraordinary diverse world that is Emacs. Indeed, Eric Raymond,
notes "[i]t is a common joke, both among fans and detractors of Emacs, to
describe it as an operating system masquerading as an editor".

With extensions, Emacs includes LaTex formatted documents, syntax


highlighting for major programming and scripting languages, a calculator, a
calender and planner, a text-based adventure, a web-browser, a news-reader and

email client, and an ftp client. It provides file difference, merging, and version
control, a text-based adventure game, and even a Rogerian psychotherapist.

Try doing that with Notepad!

This all said, Emacs is not easily to learn for beginners. The level of
customisation and the detailed use of meta- and control- characters does serve as
a barrier to immediate entry.

This tutorial will provide a useable introduction to Emacs.

1.1.1 Starting Emacs

The defaut of Emacs installations on contemporary Linux systems assume the


use of a graphical-user interface. This is obviously not the case with an HPC
system, but for those with a home installation you should be aware that 'emacs
-nw' from the command-line will launch the program without the GUI. If you wish
to make this the default you should add it as an alias to .bashrc (e.g., alias
emacs='emacs -nw').

Emacs is launched by simply typing 'emacs' on the command line. Commands are
invoked by a combination of the Control (Ctrl) key and a character key (C-<chr>)
or the Meta key (Alt, or Esc) and a character key (M-<char>). Of course, if
you're using a terminal emulator, as if oten the case, the "alt" key will probably
be superseded by the emulator itself, so you'll want to use Esc instead. However
the Esc key is not a shift key; rather than hold it down, type it distinctly.

1.1.2 Basic Commands

To quit emacs use C-x C-c you'll use this a lot as a beginner! Note that with all
Emacs commads this represents two sets of keystrokes. The space is not actually
typed.

To "break" from a partially entered command, C-g.

If an Emacs session crashed recently, M-x recover-session can recover the files
that were being edited.

The menubar can be activated with M-`

The help files are accessed with C-h and the manual with C-h r.

1.1.3 Files, Buffers, and Windows

Emacs has three main data structures, Files, Buffers, and Windows which are
essential to understand.

A file is the what is the actual file on disk. Strictly, when using Emacs one does
not actually edit file. Rather, what happens is the file is copied into a buffer, then
edited, and then saved. Buffers can be deleted without deleting the file on disk.

The buffer is an data space within Emacs for editing a copy of the file. Emacs can
handle many buffers simultaneously, the effective limit being the maximum
buffer size, determined by integer capacity of the processor and memory (e.g.,
for 64-bit machines, this maximum buffer size is 2^61 - 2 bytes). A buffer has a
name, usually after the file from which it has copied the data.

A window is the user's view of a buffer. Not all buffers may be visible to the user
at once due to the limits of screen size. A user may split the screen into multiple
windows. Windows can be created and deleted, without deleting the buffer
associated with the window.

Emacs also has a blank line below the mode line to display messages, and for
input for prompts from Emacs. This is called the mini-buffer, or echo.

1.1.4 Exploring and Entering Text

Cursor keys can be used to mover around the text, along with Page Up and Page
Down, if the terminal uses them. However Emacs afficiandos will recommend the
use of the control key for speed. Common commands include the following; you
may notice a pattern in the command logic:

C-v (move page down), M-v (move page up)


C-p (move previous line), C-n (move next line),
C-f (move forward, one character), C-b (move backward, one character)
M-f (move forward, one word), M-b (move backward, one word)
C-a (move to beginning of a line), C-e (move to end of a line)
M-a (move forward, beginning of a sentence)
M-e (move backward, beginning of a sentence)
M-{ (move backward, beginning of a paragraph), M-} (end of paragraph)
M-< (move move tbeginning of a text), M-> (move end of a text).

<backspace> (delete the character just before the cursor )


C-d (delete the character on the cursor)
M-<backspace> (cut the word before the cursor)
M-d (cut the word after the cursor )

C-k (cut from the cursor position to end of line )


M-k (cut to the end of the current sentence )

C-q (prefix command; use when you want to enter a control key into the buffer
e.g,. C-q ESC inserts an Escape)

Like the page-up and page-down keys on a standard keyboard you will discover
that Emacs also interprets the Backspace and Delete key as expected.

A selection can be cut (or 'killed' in Emacs lingo) by marking the beginning of the
selected text with C-SPC (space) and ending it with with standard cursor
movements and entering C-w. Text that has been cut can be pasted ('yanked') by
moving the cursor to the appropriate location and entering C-y.

Emacs commands also accept a numeric input for repetition, in the form of C-u,
the number of times the command is to be repeated, followed bythe command
(e.g., C-u 8 C-n moves eight lines down the screen.)

1.1.5 File Management

There are only three main file manipulation commands that a user needs to
know; how to find a file, how to save a file from a buffer, and how to save all.

The first command is C-x C-f, shorthand for "find-file". At first this command
checks prompts for the name of the file. If it is already copied into a buffer it will
switch to that buffer. If it is not, it will create a new buffer with the name
requested.

For the second command, to save a buffer to a file file with the buffer name use
C-x C-s, shorthand for "save-buffer".

The third command is C-x s. This is shorthand for "save-some-buffers" and will
cycle through each open buffer and prompt the user for their action (save, don't
save, check and maybe save, etc)

1.1.6 Buffer Management

There are four main commands relating to buffer management that a user needs
to know. How to switch to a buffer, how to list existing buffers, how to kill a
buffer, and how to read a buffer in read-only mode.

To switch to a buffer, user C-x b. This will prompt for a buffer name, and switch
the buffer of the current window to that buffer. It does not change your existing
windows. If you type a new name, it will create a new empty buffer.

To list current active buffers, use C-x C-b. This will provide a new window which
lists current buffers, name, whether they have been modified, their zie, and the
file that they are associated with.

To kill a buffer, use C-x k. This will prompt for the buffer name, and then remove
the data for that buffer from Emacs, with an opportunity to save it. This does not
delete any associated files.

The toggle read only mode on a buffer, use C-x C-q.

1.1.7 Window Management

Emacs has it's own windowing system, consisting of several areas of framed text.
The behaviour is similar to a tiling window manager; none of the windows
overlap with each other.

Commonly used window commands include:

C-x
C-x
C-x
C-x
C-x
C-x
C-x
C-x

0 delete the current window


1 delete all windows except the selected window
2 split the current window horizontally
3 split the current window vertically
^ make selected window taller
} make selected window wider
{ make selected window narrower
+ make all windows the same height

A command use is to bring up other documents or menus. For example, with the
key sequence C-h one usually calls for help files. If this is followed by k, it will
open a new vertical window, and with C-f, it will display the help information for
the command C-f (i.e., C-h k C-f). This new window can be closed with C-x 1.

1.1.8 Kill and Yank, Search-Replace, Undo

Emacs is notable for having a very large undo sequence, limited by system
resources, rather than application resources. This undo sequence is invoked with
C-_ (control underscore), or with C-x u . However it has a special feature that, by
engaging in a simple navigation command (e.g., C-f) the undo action is pushed to
the top of the stack and therefore the user can undo an undo command.

1.1.9 Other Features

Emacs can make it easier to read C and C++ by colour-coding such files,
through the ~/.emacs configuration file, and adding global-font-lock-mode t.

Programmers also find the feature on being able to run the GNU debugger (GDB)
from within Emacs as well. The command M-x gdb will start up gdb. If theres a
breakpoint, Emacs automatically pulls up the appropriate source file, which gives
a better context than the standard GDB.

1.2 Advanced Scripting

Good knowledge of scripting is required for any advanced Linux user, and
especially those who find that they have regular tasks, such as the processing of
data through a program. Shell scripting is no terribly difficult, although
sometimes some austere syntax bugs may prove frustrating - but the machine is
just doing what you asked it to. Despite their often under-rated utility shell
scripts are not the answer to everything. They are not great at resource intensive
tasks (e.g., extensive file operations) where speed is important. They are not
recommended for heavy-duty maths operations (use C, C++, or Fortran instead).
It is not recommended in situations where data structures, multi-dimensional
arrays (it's not a database!) and port/socket I/O is important.

In the Intermediate Course, we looked at scripting in reference to regular


expression utilities, such as sed and the programming language awk, along with
some simple examples of using Linux command invocations as variables in a
backup script, some sample "for", "while", "do/done" and "until" loops along with
simple, optional, ladder, and nested conditionals using "if", "then", "else", "else",
"elif" and "fi", the use of "break" and "continue", and the "case" conditional, and
"select" for user input. The implementation of these into PBS job submission

scripts was also illustrated. In this Advanced course we will revisit these
concepts but with more sophisticated and complex examples. In addition there
will be a close look at internal commands and filters, process substitution,
functions, arrays, and debugging.

1.2.1 Scripts With Variables

The simplest script is simply one that runs a list of system commands. At least
this saves the time of retyping the sequence each time it is used, and reduces the
possibility of error. For example, in the Intermediate course, the following script
was recommended to calculate the disk use in a directory. It's a good script, very
handy, but how often would you want to type it? Instead, type enter it once and
keep it. You will recall of course, that a script starts with an invocation of the
shell, followed by commands.

emacs diskuse.sh

#!/bin/bash
du -sk * | sort -nr | cut -f2 | xargs -d "\n" du -sh > diskuse.txt

C-x C-c, y for save

chmod +x diskuse.sh

As described in the Intermediate course, script runs a disk usage in summary,


sorts in order of size and exports to the file diskuse.txt. The "\n" is to ignore
spaces in filenames.

Making the script a little more complex, variables are usually better than hardcoded values. There are two potential variables in this script, the wildcard '*' and
the exported filename "diskuse.txt". In the former case, we'll keep the wildcared
as it allows a certain portibility of the script - it can run in any directory it is
invoked from. For the latter case however, we'll use the date command so that a
history of diskuse can be created which can be reviewed for changes. It's also

good practise to alert the user when the script is completed and, although it is
often necessary, it is also good practise to cleanly finish any script with with
'exit'.

emacs diskuse.sh

#!/bin/bash
DU=diskuse$(date +%Y%m%d).txt
du -sk * | sort -nr | cut -f2 | xargs -d "\n" du -sh > $DU
echo "Disk summary completed and sorted."
exit

C-x C-c, y for save

1.2.2 Variables and Conditionals

Another example is a script with conditionals as well as variables. A common


conditional, and sadly often forgotten, is whether or not a script has the requiste
files for input and output specified. If an input file is not specified a script that
performs an action on the file will simple go idle and never complete. If an
output file is hardcoded, then the person running the script runs the risk of
overwriting a file with the same name, which could be a disaster.

The following script searches through any specified text file for text before and
after the ubiquitous email "@" symbol and outputs these as a csv file through use
of grep, sed, and sort (for neatness). If the input or the output file are not
specified, it exits after echoing the error.

emacs findemails.sh

#!/bin/bash
# Search for email addresses in file, extract, turn into csv with designated file
name
INPUT=${1}
OUTPUT=${2}

{
if [ !$1 -o !$2 ]; then
echo "Input file not found, or output file not specified. Exiting script."
exit 0
fi
}
grep --only-matching -E '[.[:alnum:]]+@[.[:alnum:]]+' $INPUT > $OUTPUT
sed -i 's/$/,/g' $OUTPUT
sort -u $OUTPUT -o $OUTPUT
sed -i '{:q;N;s/\n/ /g;t q}' $OUTPUT
echo "Data file extracted to" $OUTPUT
exit

C-x C-c, y for save

chmod +x findemails.sh

Test this file with hidden.txt as the input text and found.csv as the output text.
The output will include a final comma on the last line but this is potentially useful
if one wants to run the script with several input files and append to the same
output file (simply change the single redirection in the grep statement to an
double appended redirection.

A serious weakness of the script (so far) is that it will gather any string with the
'@' symbol in it, regardless of whether it's a well-formed email address or not. So
it's not quite suitable for screen-scraping usenet for email address to turn into a
spammers list. But it's getting close.

1.2.3 Reads

The read command simply reads a line from standard input. By applying the -n
option is can read in a number of characters, rather than a whole line, so -n1 is
"read a single character". The use of the -r option reads the input as raw input,
so that the backslash key (for example) doesn't act like a a newline escape
character, and the -p option displays the prompt. Plus, a -t timeout in seconds
option can also added. Combined, can be used in the effect of "press any key to
continue", with a limited timeframe.

Add the following to findemails.sh at the end of the file.

emacs findemails.sh

#!/bin/bash
# Search for email addresses in file, extract, turn into csv with designated file
name
..
..
read -t5 -n1 -r -p "Press any key too see the list, sorted and with unique
record..."
if [ $? -eq 0 ]; then
echo A key was pressed.
else
echo No key was pressed.
exit 0
fi

less $OUTPUT | \
# Output file, piped through sort and uniq.
sort | uniq

exit

C-x C-x, y for save

1.2.4 Special Characters

Scripts essentially consist of commands, keywords, and special characters.


Special characters have meaning beyond their literal meaning (a meta-meaning,
if you like). Comments are the most common special meaning.

Any text following a # (with the exception of #!) is comments and will not be
executed. Comments may begin at the beginning of a line, following whitespace,
following the end of a command, and even be embedded within a piped command
(as above in section 3).

A comment ends at the end of the line, and as a result a command may not follow
a comment on the same line. A quoted or an escaped # in an echo statement
does not begin a comment.

Another special characters includes the command seperator, a semicolon, which


is used to permit two or more commands on the same line. This is already shown
by the the various tests in the script (e.g., if [ !$1 -o !$2 ]; then and if [ $? -eq
0 ]; then). Note the space after the semicolon. In contrast a double semicolon (;;)
represents a terminator in a case option, which was encountered in the extract
script in the Intermediate course.

..
case $1 in
*.tar.bz2)
*.tar.gz)
*.bz2)

tar xvjf $1
tar xvzf $1
bunzip2 $1

;;
;;
;;

..
..
esac

In contrast, the colon acts as a null command. Whilst this obviously has a variety
of uses (e.g., an alternative to the touch command, a really practical advantage
of this is that comes with a true exit status, and as such it can be used as
placeholder in if/then tests. An example from the Intermediate course;
for i in *.plot.dat; do
if [ -f $i.tmp ]; then
: # do nothing and exit if-then
else
touch $i.tmp

The use of the null command as a test at the beginning of a loop will cause it to
run endlessley (e.g., <code>while : do ... done</code>) as the test always

evaluates as true. Note that the colon is also used as a field separator in
/etc/passwd and in the $PATH variable.

A dot (.) has multiple special character uses. As a command it sources a


filename, importing the code into a script, rather like the #include directive in a
C program. This is very useful in situations when multiple scripts use a common
data file, for example (e.g., . hidden.txt). As part of a filename of course, as was
shown in the Introductory course, the . represents the current working directory
(e.g., cp -r /path/to/directory/ . and of course, .. for the parent directory). A third
use for the dot is in regular expressions, matching one character per dot. A final
use is multiple dots in sequence in a loop. e.g.,

for a in {1..10}
do
echo -n "$a "
done

Like the dot, the comma operator has multiple uses. Usually it is used to link
multiple arithmetic calculations. This is typically used in for loops, with a C-like
syntax. e.g.,

for ((a=1, b=1; a <= LIMIT ; a++, b++))


do # The comma concatenates operations.
echo -n "$a-$b "
done

Enclosing a referenced value in double quotes (" ... ") does not interfere with
variable substitution. This is called partial quoting, sometimes referred to as
"weak quoting." Using single quotes (' ... ') causes the variable name to be used
literally, and no substitution will take place. This is full quoting, sometimes
referred to as 'strong quoting.' It can also be used to combine strings.
for file in /{,usr/}bin/*sh
do
if [ -x "$file" ]
then
echo $file
fi
done

A double-quote on a value does not change variable substitution. This is


sometimes referred to as weak quoting. Using single quotes however, means the
variable to be used literally, with no substitution. This is often referred to as
strong quoting. For example, a strict single quoted directory listing of ls with a
wildcard will only provide files that are expressed by the symbol (which isn't a
very good file name). Compare ls * with ls '*'. This example will also worth with
double quote and indeed, double-quotes are generally preferable as they prevent
reinterpretation of all special characters except $, `, and \. This are usually the
symbols which are wanted in their interpreted mode. As the escape character
has a literal interpretation with single quotes, enclosing a single quote within
single quotes will not work as expected.Enclosing a referenced value in double
quotes (" ... ") does not interfere with variable substitution. This is called partial
quoting, sometimes referred to as "weak quoting." Using single quotes (' ... ')
causes the variable name to be used literally, and no substitution will take place.
This is full quoting, sometimes referred to as 'strong quoting.'

Related to quoting is the use of the backslash (\) used to escape single
characters. Do not confuse it with the forward slash (/) has multiple uses as both
the separator in pathnames (e.g., (/home/train01), but also a the division
operator.

In some scripts backticks (`) are used for command substitution, where the
output of a command can be assigned to a variable. Whilst this is not a POSIX
standard, it does exist for historical reasons. Nesting commands with backticks
also requires escape characters; the deeper the nesting the more escape
characters required (e.g., echo `echo \`echo \\\`pwd\\\`\``). The preferrred and
POSIX standard method is to use the dollar sign and parentheses. e.g., echo
"Hello, $(whoami)." rather than echo "Hello, `whoami)`."

2.0 Advanced HPC

2.1 Computer System Architectures

As explained in the first, introductory, course, "high-performance computing


(HPC) is the use of supercomputers and clusters to solve advanced computation
problems". All supercomputers ("a nebulous term for computer that is at the
frontline of current processing capacity") in contemporary times use parallel
computing, "the submission of jobs or processes over one or more processors
and by splitting up the task between them".

It is possible to illustrate the degree of parallelisation by using Flynn's Taxonomy


of Computer Systems (1966), where each process is considered as the execution
of a pool of instructions (instruction stream) on a pool of data (data stream).

From this complex is four basic possibilities:

Single Instruction Stream, Single


Data Stream (SISD)

Single Instruction Stream, Multiple


Data Streams (SIMD)

Multiple Instruction Streams, Single


Data Stream (MISD)

Multiple Instruction Streams,


Multiple Data Streams (MIMD)

2.1.1 Single Instruction Stream, Single Data Stream (SISD)


(Image from Oracle Essentials, 4th edition, O'Reilly Media, 2007)

This is the simplest and, until recently, the most common processor
architecture on desktop computer systems. Also known as a
uniprocessor system it offers a single instruction set and a single
data stream. Uniprocessors could however simulate or include
concurrency through a number of different methods:
a) It is possible for a uniprocessor system to run processes
concurrently by switching between one and another.

b) Superscale instruction level parallelism can be used on uniprocessors. More


than one instruction during a clock cycle is simultaneously dispatched to
different functional units on the processor.

c) Instruction prefetch, where an instruction is requested from main memory


before it is actually needed and placed in a cache. This often also includes a
prediction algorithm of what the instruction will be.

d) Pipelines, on the instruction level or the graphics level, can also serve as an
example of concurrent activity. An instruction pipeline (e.g., RISC) allows
multiple instructions on the same circuty by dividing the task into stages. A
graphics pipeline implements different stages of rendering operations to
different arithmetic units.

2.1.2 Single Instruction Stream, Multiple Data Streams (SIMD)

SIMD architecture represents a situation where a single processor performs the


same instruction on multiple data streams. This commonly occurs in
contemporary multimedia processors, for example MMX instruction set from the
1990s, which lead to Motorollas PowerPC Altivec, and more contemporary times
AVE (Advanced Vector Extensions) instruction set used in Intel Sandy Bridge
processors and AMD's Bulldozer processor. These developments have primarily
been orientated towards real-time graphics, using short-vectors. Contemporary
supercomputers are invariably MIMD clusters which can implement short-vector
SIMD instructions.

SIMD was also used especially in the 1970s and notably on the various Cray
systems. For example the Cray-1 (1976) had eight "vector registers," which held
sixty-four 64-bit words each (long vectors) with instructions applied to the
registers. Pipeline parallelism was used to implement vector instructions with
separate pipelines for different instructions, which themselves cuold be run in
batch and pipelined (vector chaining). As a result the Cray-1 could have a peak
performance of 240 mflops - extraordinary for the day, and even acceptable in
the early 2000s.

SIMD is also known as vector processing or data parallelism, in comparison to a


regular SIMD CPU which operates on scalars. SIMD lines up a row of scalar data
(of uniform type) as a vector and operates on it as a unit. For example, inverting
an RGB picture to produce its negative, or to alter its brightness etc. Without
SIMD each pixel would have to be fetched to memory, the instruction applied to

it, and then returned. With SIMD the same instruction is applied to all the data,
depending on the availability of cores, i.e., get n pixels, apply instruction, return.
The main disadvantages of SIMD, within the limitations of the process itself, is
that it does require additional register, power consumption, and heat.

2.1.3 Multiple Instruction Streams, Single Data Stream (MISD)

Multiple Instruction, Single Data (MISD) occurs when different operations are
performed on the same data. This is quite rare and indeed debateable as it is
reasonable to claim that once an instruction has been performed on the data, it's
not thesame data anymore. If one doesn't take this definition and allows for a
variety of instructions to be applied to the same data which can change then
various pipeline architectures can be considered MISD.

Systolic arrays are another form of MISD. They are different to pipelines because
they have non-linear array structure, they have multidirectional data flow, and
each processing element may even have its own local memory . In this situation a
matrix pipe network arrangement of processing units compute data and store it
independently of each other. Matrix multiplication is an example of such an array
in an algorithmic form, where one a matric is introduced one row at a time from
the top of the array, whereas another matrix is introduced one colum at a time.

MISD machines are rare; the Cisco PXF processor is an example. They can be
fast and scalable, as they do operate in parallel, but they are really difficult to
build.

3.1.4 Multiple Instruction Streams, Multiple Data Streams (MIMD)

Multiple Instruction, Multiple Data (MIMD) have independent and asynchronous


processes that can operate on a number of different data streams. They are now
the mainstream in contemporary computer systems and thus can be further
differentiated between multiprocessor computers and their extension,
multicomputer mutiprocessors. As the name clearly indicates, the former refers
to single machines which have multiple processors and the latter to a cluster of
these machines acting as a single entity.

Multiprocessor systems can be differentiated between shared memory and


distributed memory. Shared memory systems have all processors connected to a
single pool of global memory (whether by hardware or by software). This may be
easier to program, but it's harder to achieve scalability. Such an architecture is
quite common in single system unit multiprocessor machines.

With distributed memory systems, each processor has its own memory. Finally,
another combination is distributed shared memory, where the (physically
separate) memories can be addressed as one (logically shared) address space. A
variant combined method is to have shared memory within each multiprocessor
node, and distributed between them.

3.2 Processors and Cores

2.2.1 Uni- and Multi-Processors

A further distinction needs to be made between processors and cores. A


processor is a physical device that accepts data as input and provides results as
output. A uniprocessor system has one such device, although the definitions can
become ambiguous. In some uniprocessor systems it is possible that there is
more than one, but the entities engage in separate functions. For example, a
computer system that has one central processing unit may also have a coprocessor for mathematic functions and a graphics processor on a separate card.
Is that system uniprocessor? Arguably not as the co-processor will be seen as
belonging to the same entity as the CPU, and the graphics processor will have
different memory, system I/O, and will be dealing with different peripherals. In
contrast a multiprocessor system does share memory, system I/O, and
peripherals. But then the debate will become murky with the distinction between
shared and distributed memory discussed above.

2.2.2 Uni- and Multi-core

In addition to the distinction between uniprocessor and multiprocessor there is


also the distinction between unicore and multicore processors. A unicore

processor carries out the usual functions of a CPU, according to the instruction
set; data handling instructions (set register values, move data, read and write),
arithmetic and logic functions (add, subtract, multiply, divide, bitwise operations
for conjunction and disjunction, negate, compare), and control-flow functions
(conditionally branch to another section of a program, indirectly branch and
return). A multicore processor carries out the same functions, but with
independent central processing units (note lower case) called 'cores'.
Manufacturers integrate the multiple cores onto a single integrated circuit die or
onto multiple dies in a single chip package.

In terms of theoretical architecture, a uniprocessor system could be multicore,


and a multiprocessor system could be unicore. In practise the most common
contemporary architecture is multiprocessor and multicore. The number of cores
is represeneted by a prefix. For example, a dual-core processor has two cores
(e.g. AMD Phenom II X2, Intel Core Duo), a quad-core processor contains four
cores (e.g. AMD Phenom II X4, Intel i3, i5, and i7), a hexa-core processor
contains six cores (e.g. AMD Phenom II X6, Intel Core i7 Extreme Edition 980X),
an octo-core processor or octa-core processor contains eight cores (e.g. Intel
Xeon E7-2820, AMD FX-8350) etc.

2.2.3 Uni- and Mult-Threading

In addition to the distinctions bewteen processors and cores, whether uni- or


multi-, there is also the question of threads. An execution thread is the smallest
processing unit in an operating system. A thread is typically contained inside a
process. Multiple threads can exist within the same process and share resources.
On a uniprocessor, multithreading generally occurs by switching between
different threads engaging in time-division multiplexing with the processor
switching between the different threads, which may give the apperance that the
ask is happening at the same time. On a multiprocessor or multi-core system,
threads become truly concurrent, with every processor or core executing a
separate thread simultaneously.

2.2.4 Why Is It A Multicore Future?

Ideally, don't we want clusters of multicore multiprocessors with multithreaded


instructions? Of course we do; but think of the heat that this generates, think of

the potential for race conditions (e.g., deadlocks, data integrity issues, resource

conflicts, interleaved execution issues).

There are all fundamental problems with computer architecture.


One of the reasons that multicore multiprocessor clusters have become popular
is that clock rate has pretty much stalled. Apart from the physical reasons, it is
uneconomical. It's simply not worth the cost increasing the frequency of clock
rate in terms of the power consumed and the heat dissipitated. Intel calls the
rate/heat trade-off a "fundamental theorem of multicore processors".

New multicore systems are being developed all the time. Using RISC CPUs,
Tilera released 64-core processors in 2009 and in 2009, a one hundred core
processor. In 2012 Tilera founder, Dr. Agarwal, is leading a new MIT effort
dubbed The Angstrom Project. It is one of four DARPA-funded efforts aimed at
building exascale supercomputers. The goal is to design a chip with 1,000 cores.

2.3 Parallel Processing Performance

2.3.1 Speedup and Locks

Parallel programming and multicore systems should mean better performance.


This can be expressed a ratio called speedup

Speedup (p) = Time (serial)/ Time (parallel)

This is varied by the number of processors S = T(1)/T(p), where T(p) represents


the execution time taken by the program running on p processors, and T(1)
represents the time taken by the best serial implementation of the application
measured on one processor.

Linear, or ideal, speedup is when S(p) = p. For example, double the processors
resulting in double the speedup.

However parallel programming is hard . More complexity = more bugs.


Correctness in parallelisation requires synchronisation (locking).
Synchronisation and atomic operations causes loss of performance,
communication latency. A probable issue in parallel computing is deadlocks,
where two or more competing actions are each waiting for the other to finish,
and thus neither ever does. An apocraphyl story of a Kansas railroad statue
radically illustrates the problem of a deadlock:

"When two trains approach each other at a crossing, both shall come to a full
stop and neither shall start up again until the other has gone."

(A similar example is a livelock; the states of the processes involved in the


livelock constantly change with regard to one another, none progressing).

Locks are currrently manually inserted in typically programming languages;


without locks programs can be put in aninconsistent state. Multiple locks in

different places and orders can lead to deadlocks. Manual lock inserts is errorprone, tedious and difficult to maintain. Does the programmer know what parts
of a program will benefit from parallelisation? To ensure that parallel execution
is safe, a tasks effects must not interfere with the execution of another task.

2.3.2 Amdahl's Law and the Gustafson-Barsis Law


Amdahl's law, in the general sense, is a method to work out the maximum
improvement to a system when only part of the system has been improved. A
very typical use - and appropriate in this context - is the improvement in speedup
with the adding on multiple processors to a computational task. Because some of
the task is in serial, there is a maxiumum limit to the speedup based on the time
that is required for the sequential task - no matter how many processors are
thrown at the problem. For example, if there is a complex one hundred hour
which will require five hours of sequential processing, only 95% of the task can
be parallel - which means a maximum speedup of 20X.

Thus maximum speedup is:

S(N) = 1 / (1-P) + (P/N)

Where P is the proportion of a program that can be made parallel, and (1 - P) is


the proportion that cannot be parallelized (remains serial).

It seems a little disappointing to discover that, at a certain point, no matter how


many processors you throw at a problem, it's just not going to get any faster, and
given that almost all computational tasks are somewhat serial, the conclusion
should be clear (e.g., Minsky's Conjecture). Not only are there serial tasks within
a program, the very act of making a program parallel involves serial overhead,
such as start-up time, synchronisation, data communications, etc.

However it is not necessarily the case that the ratio of parallel and serial parts of
a job and the number of processors generate the same result, as the variation in
execution time in the specific serial and parallel implementation of a task can
vary. An example can be what is called "embarrassingly parallel", so named
because it is a very simple task to split up into parallel tasks as they have little
communication between each other. For example, the use of GPUs for projection,

where each pixel is rendered independently. Such tasks are often called
"pleasingly parallel". To give an example using the R programming language the
SNOW package (Simple Network of Workstations) package allows for
embarrassingly parallel computations (yes, we have this installed).

Whilst originally expressed by Gene Amdahl in 1967, it wasn't until over twenty
years later in 1988 that an alternative by John L. Gustafson amd Edwin H. Barsis
was proposed. Gustafon noted that Amadahl's Law assumed a computation
problem of fixed data set size. Gustafson and Barsis observed that programmers
tend to set the size of their computational problems according to the available
equipment; therefore as faster and more parallel equipment becomes available,
larger problems can be solved. Thus scaled speedup occurs; although Amdahl's
law is correct in a fixed sense, it can be circumvented in practise by increasing
the scale of the problem.

If the problem size is allowed to grow with P, then the sequential fraction of the
workload would become less and less important. A common metaphor is based
on driving (computation), time, and distance (computational task). In Amdhal's
Law, if a car had been travelling 40kmp/h and needs to reach a point 80km from
the point of origin, no matter how fast the vehicle travels it will can only reach a
maximum of a 80km/h average before reaching the 80km point, even if it
travelled at infinite speed as the first hour has already passed. With the
Gustafon-Barsis Law, it doesn't matter if the first hour has been at a plodding 40
km/h, this can be infinitely increased given enough time and distance. Just make
the problem bigger!

Image from Wikipedia

3.0 Introductory to MPI Programming

3.1 The Story of the Message Passing Interface (MPI) and OpenMPI

The Message Passing Interface (MPI) is a widely used standard, initially


designed by academia and industry initiated in 1991, to run on parallel
computers. The goal of the group was to ensure source-code portability, and as a
result they have a standard that defines an interface and specific functionality. As
a standard, syntax and semantics are defined for core library routines which
allow for programmers to write message-passing programs in Fortran or C.

Some implementations of these core library routine specifications are available


as free and open-source software, such as Open MPI. Open MPI combined three

previous well-known implementations, namely FT-MPI from the University of


Tennessee, LA-MPI from Los Alamos National Laboratory, and LAM/MPI from
Indiana University, each of which excelled in particular areas, with additional
contributions from the PACX-MPI team at the University of Stuttgart. OpenMPI
combines the quality peer-review of a scientific free and open-source software
project, and has been used in many of the world's top ranking supercomputers.

Major milestones in the development of MPI include the following:

* 1991 Decision to initiate Standards for Message Passing in a Distributed


Memory Environment
* 1992 Worskhop on the above held.
* 1992 Preliminary draft specification released for MPI
* 1994 MPI-1. Specification, not an implementation. Library, not a language.
Designed for C and Fortran 77.
* 1998 MPI-2. Extends message-passing model to include parallel I/O, includes
C++/Fortran90. Interaction with threads, and more.
* 2007 MPI Forum reconvened; MPI-3 development.
* The standard utilised in this course is MPI-2.

The messaage passing paradigm, as it is called, is attractive as it is portable on a


wide variety of distributed architectures, including distributed and shared
memory multiprocessor systerms, networks of workstations, or even potentially a
combination thereof. Although originally designed for distributed architectures
(unicore workstations connected by a common network) which were popular at
the time the standard was initiated, shared memory symmetric multiprocessing
systems over networks created a hybrid distributed/shared memory systems, that
is each system has shared memory within each machine but not the memory
distributed between machines, which distribute data over the network
communications. The MPI library standards and implementations were modified
to handle both types of memory architectrues.

(image from
Lawrence
Livermore National
Laboratory, U.S.A)

Using MPI is a matter of some common sense. It is is the only message passing
library which can really be considered a standard. It is supported on virtually all
HPC platforms, and has replaced all previous message passing libraries, such as
PVM, PARMACS, EUI, NX, Chameleon, to name a few predecessors.
Programmers like it because there is no need to modify their source code when
ported to a different system as long as that system also supports the MPI
standard (there may be other reasons however to modify the code!). MPI has
excellent performance with vendors able to exploit hardware features for
optimisation.

The core principle is that many processors should be able cooperate to solve a
problem by passing messages to each through a common communications
network. The flexible architecture does overcome serial bottlenecks, but it also
does require explicit programmer effort (the "questing beast" of automatic
parallelisation remains somewhat elusive). The programmer is responsible for
identifying opportunities for parallelism and implementing algorithms for
parallelisation using MPI.

MPI programming is best where there is not too many small communications,
and where coarse-level breakup of tasks or data is possible.

"In cases where the data layout is fairly simple, and the communications
patterns are regular this [data-parallel] is an excellent approach. However, when
dealing with dynamic, irregular data structures, data parallel programming can
be difficult, and the end result may be a program with sub-optimal performance."

(Warren, Michael S., and John K. Salmon. "A portable parallel particle program." Computer Physics
Communications 87.1 (1995): 266-290.)

3.2 Unravelling A Sample MPI Program and OpenMPI Wrappers

For the purposes of this course, copy a number of files to the home directory:

cd ~
cp -r /common/advcourse .

In the Intermediate course, an example mpi-helloworld.c program was illustrated


with an associated PBS script. Lets recall what that included and the
explanation in the C program and in the PBS script that launched it.

This is the text for mpi-helloworld.c

#include <stdio.h>

A standard include for C programs.

#include "mpi.h"

A standard include for MPI


programs.

int main( argc, argv )

Beginning of the main function,


establish arguments and vector. To
incorporate input files argc
(argument count) is the number of
arguments, and argv (argument
vector) is an array of characters
representing the arguments.

int

Argument count is an integer

argc;

char **argv;

Argument vector is a string of


characters.

int rank, size;

Set rank and size from the inputs.

MPI_Init( &argc, &argv );

Initialises the MPI execution


environment. The input parameters
argc is pointer to the number of
arguments and argv is a pointer to
the argument vector

MPI_Comm_size( MPI_COMM_WORLD,
&size );

Determines the size of the group


associated with a communicator. In

input parameter is simply a handle


(Contains all of the processes), the
output parameter, size, is an
integer of the number of processes
in the group.
MPI_Comm_rank( MPI_COMM_WORLD,
&rank );

As above, except rank is rank of the


calling process.

printf( "Hello world from


process %d of %d\n", rank, size );

Printing "Hello world" from each


process.

MPI_Finalize();

Terminates MPI execution


environment

return 0;

A successful program finishes!

It is compiled into an executable with the command:

mpicc -o mpi-helloworld mpi-helloworld.c

This is the text for the batch file pbs-helloword which is launched qsub and
reviewed with less.

qsub pbs-helloworld
less pbs-helloworld

The sample "hello world" program should be understandable to any C


programmer (indeed, any programmer) and with the MPI-specific annotations, it
should be clear what is going on. It is the same as any other program, but with a
few MPI-specific additions. For example, one can check the PGI mpi.h with the
following:
less /usr/local/openmpi/1.6.3-pgi/include/mpi.h

MPI compiler wrappers are used to compile MPI programs which perform basic
error checking, integrate the MPI include files, link to the MPI libraries and pass
switches to the underlying compiler. The wrappers are as follows:

mpif77

Open MPI Fortran 77 wrapper compiler

mpif90

Open MPI Fortran 90 wrapper compiler

mpicc

Open MPI C wrapper compiler

mpicxx

Open MPI C++ wrapper compiler

Open MPI is comprised of three software layers: OPAL (Open Portable Access
Layer), ORTE (Open Run-Time Environment), and OMPI (Open MPI). Each layer
provides the following wrapper compilers:

OPAL

opalcc and opalc++

ORTE

ortecc and ortec++

OMPI

mpicc, mpic++, mpicxx, mpiCC (only on systems with


case-senstive file systems), mpif77, and mpif90. Note that
mpic++, mpicxx, and mpiCC all invoke the same underlying
C++ compiler with the same options. All are provided as
compatibility with other MPI implementations.

The distinction between Fortran and C routines in MPI are fairly minimal. All the
names of MPI routines and constants in both C and Fortran begin with the same
MPI_ prefix. The main differences are:

* The include files are slightly different: in C, mpi.h, in Fortan, mpif.h.


* Fortran MPI routine names are in uppercase (e.g., MPI_INIT), whereas Ccompatible MPI routine names are upper and lowercase (e.g., MPI_Init).
* The arguments to MPI_Init are different; an MPI C program can take advantage
of command-line arguments.
* The arguments in MPI C functions are more strongly typed than they are in
Fortran, resulting in specific types in C (e.g., MPI_Comm, MPI_Datatype)
whereas MPI Fortran uses integers.

* Error codes are returned in a separate argument for Fortran as opposed to the
return value for C functions.

Consider the mpi-helloworld program in Fortran (mpi-helloworld.f)

Fortran MPI Hello World

A comment

program hello

Program name

include 'mpif.h'

Include file for MPI

integer rank, size, ierror, tag,


status(MPI_STATUS_SIZE)
call MPI_INIT(ierror)

Variables
Start MPI

call
MPI_COMM_SIZE(MPI_COMM_WORLD, size,
ierror)

Number of processers

call
MPI_COMM_RANK(MPI_COMM_WORLD, rank,
ierror)

Process IDs

print*, 'node', rank, ': Hello


world'

Each processor prints "Hello


World"

call MPI_FINALIZE(ierror)

Finish MPI.

end

Compile this with mpi90 (the Fortran 90 wrapper) and submit with qsub:

mpif90 mpi-helloworld.f90 -o mpi-helloworld


qsub pbs-helloworld

The mpi-helloworld program is an example of using MPI in a manner that is


similar to a Single Instruction Multiple Data architecture. The same instruction

stream (print hello world) is used across multiple times. It is perhaps best
described as Single Program Multiple Data, as it obtains the effect of running the

same program multiple times, or, if you like different programs with the same
instructions.

3.3 MPI's Introductory Routines

The core theoretical concept in MPI programming is the move from a model
where the processor and memory act in a sequence to each other to model
where the memory and processor act in parallel and pass information through a
communications network.

MPI has been described as both small and large. It is large, insofar that there are
well over a hundred different routines. But most of these are only called when
one is engaging in advanced MPI programming (beyond the level of this course),
it is perhaps fair to say that MPI is small, as there are only a handful of basic
routines that are usually needed, of which we've seen four. There are two others
(MPI_Send, MPI_Recv) which can also be considered "basic routines".

3.3.1 MPI_Init()

This routine initializes the MPI execution environment Every MPI program must
call this routine once, and only once, and before any other MPI routines;
subsequent calls to MPI_Init will produce an error. With MPI_Init() processes are
spawned and ranked with communication channels established and the defafult
communicator, MPI_COMM_WORLD created. Communicators are considered
analoguous to the mail or telephone system; every message travels in the
communicator, with every message passing call having a communcator
argument.

The input paramters are argc, a pointer to the number of arguments, and argv,
the argument vector. These are for C and C++ only. The Fortran-only output
parameter is IERROR, as integer.

The syntax for MPI_Init() is as follows for C, Fortran, and C++.

C Syntax
#include <mpi.h>
int MPI_Init(int *argc, char ***argv)

Fortran Syntax

INCLUDE mpif.h
MPI_INIT(IERROR)
INTEGER
IERROR

C++ Syntax
#include <mpi.h>
void MPI::Init(int& argc, char**& argv)
void MPI::Init()

3.3.2 MPI_Comm_size()

This routine indicates the number of processes involved in a communicator, such


as MPI_COMM_WORLD. The input parameter is comm, which the handle for the
communicator, and the output paramter is size, the number of processes in the
group of comm (integer) and the Fortran only IERROR providing the error status
as integer.

A communicator is effectively a collection of processes that can send messages


to each other. Within programs many communications also depend on the
number of processes executing the program.

The syntax for MPI_Comm_size() is as follows for C, Fortran, and C++.

C Syntax
#include <mpi.h>
int MPI_Comm_size(MPI_Comm comm, int *size)

Fortran Syntax
INCLUDE mpif.h
MPI_COMM_SIZE(COMM, SIZE, IERROR)
INTEGER
COMM, SIZE, IERROR

C++ Syntax
#include <mpi.h>
int Comm::Get_size() const

3.3.3 MPI_Comm_rank()

This routine indicates the rank rank number of the calling processes within the
pool of MPI communicator processes. The input parameters are comm, the
communicator handle and the outpit paramters are rank, the rank of the calling
processses expressed as an integer, and the ever present, IERROR error status
for Fortran. It is common for MPI programs to be written in a manager/worker
model, where one process (typically rank 0) acts in a supervisory role, and the
other processes act in a computational role.

The syntax for MPI_Comm_rank() is as follows for C, Fortran, and C++.

C Syntax
#include <mpi.h>
int MPI_Comm_rank(MPI_Comm comm, int *rank)

Fortran Syntax
INCLUDE mpif.h
MPI_COMM_RANK(COMM, RANK, IERROR)
INTEGER COMM, RANK, IERROR

C++ Syntax
#include <mpi.h>
int Comm::Get_rank() const

3.3.4 MPI_Send()

This routine performs a standard-mode, blocking send. By "blocking" what is


mean that this routine will block until the message is sent to the destination.

The message-passing system handle many messages going to and from


many different sources. The programmer just needs to state send/recv's
messages in appropriate way, without needing to know underlying

implementation. The message-passing system take care of delivery. However this


appropriate way means stating various characteristics of the message just like
the post or email; who is sending it, where its being sent to, what its about, and
so forth.

The input parameters include buf, the initial address of the send buffer., count,
an integer of the number of elements., datatype, a handle of the datatype of each
send buffer., dest, an integer rank of the destination., tag, an integer message
tag, and comm, the communicator handle. The only output parameter is
Fortran's , IERROR.

If MPIComm represents a community of addressable space, then MPISend and


MPIRecv the envelope, addressing information and the data. In order for a
message to be successfully communicated the system must append some
information to the data that the application program wishes to transmit. This
includes the rank of the sender, the receiver, a tag, and the communicator. The
source is used to differentiate messages received from different sources; the tag
to distinguish messages from a single process.

The syntax for MPI_Send() is as follows for C, Fortran, and C++.

C Syntax
#include <mpi.h>
int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm)

Fortran Syntax
INCLUDE mpif.h
MPI_SEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR)
<type>
BUF(*)
INTEGER
COUNT, DATATYPE, DEST, TAG, COMM, IERROR

C++ Syntax
#include <mpi.h>
void Comm::Send(const void* buf, int count, const Datatype&
datatype, int dest, int tag) const

3.3.5 MPI_Recv()

As what is sent should be received, the MPI_Recv routine, provides a standardmode, blocking receive. A message can be received only if addressed to the
receiving process, and if its source, tag, and communicator (comm) values match
the source, tag, and comm values specified. After a matching send has been
initiated, a receive will block and until that send has completed. The length of
the received message must be less than or equal to the length of the receive
buffer, otherwise an overflow error will be returned.

The input paramters include count, the maximum integer number of elements to
receive., datatype, a handle of the datatype for each receive buffer entry., source,
an integer rank of the source., tag, an integer message tag., and comm, the
communicator handle. The output parameters are buf, the initial address of the
receive buffer., status, a status object., and the ever-present Fortran IERROR.

The syntax for MPI_Recv() is as follows for C, Fortran, and C++.

C Syntax
#include <mpi.h>
int MPI_Recv(void *buf, int count, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm, MPI_Status *status)

Fortran Syntax
INCLUDE mpif.h
MPI_RECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS, IERROR)
<type>
BUF(*)
INTEGER
COUNT, DATATYPE, SOURCE, TAG, COMM
INTEGER
STATUS(MPI_STATUS_SIZE), IERROR

C++ Syntax
#include <mpi.h>
void Comm::Recv(void* buf, int count, const Datatype& datatype,
int source, int tag, Status& status) const

void Comm::Recv(void* buf, int count, const Datatype& datatype,


int source, int tag) const

The importance of MPI_Send() and MPI_Recv() refers to the nature of process


variables, which remain private unless passed by MPI in the Communications
World.

3.3.6 MPI_Finalize()

This routine should be called when all communications are completed. Whilist it
cleans up MPI data structures etc., it does not cancel continuing communications
which the programmer should look out for. Once called, no other routines can be
called (with some minor exceptions), not even MPI_Init. There are no input
parameters. The only output paramter is Fortran's IERROR.

The syntax for MPI_Finalize() is as follows for C, Fortran, and C++.

C Syntax
#include <mpi.h>

int MPI_Finalize()

Fortran Syntax
INCLUDE mpif.h
MPI_FINALIZE(IERROR)
INTEGER
IERROR

C++ Syntax
#include <mpi.h>
void Finalize()

Whilst the previous mpi-helloworld.c and the mpi-helloworld.f90 examples


illustrated the use of four of the six core routines of MPI, it did not illustrate the
use of the MPI_Recv and MPI_Send routines. The following program, of no
greater complexity, does this. There is no need to provide additional explanation
of what is happening, as this should be discerned from the routine explanations
given. Each program should be compiled with mpicc and mpif90 respectively,
submitted with qsub, with the results checked.
Compile with mpicc -o mpi-sendrecv mpi-sendrecv.c, submit with qsub pbs-sendrecv

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(argc,argv)
int argc;
char *argv[];
{
int myid, numprocs;
int tag,source,destination,count;
int buffer;
MPI_Status status;

MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
tag=1;
source=0;
destination=1;

count=1;
if(myid == source){
buffer=1234;
MPI_Send(&buffer,count,MPI_INT,destination,tag,MPI_COMM_WORLD);
printf("processor %d sent %d\n",myid,buffer);
}
if(myid == destination){
MPI_Recv(&buffer,count,MPI_INT,source,tag,MPI_COMM_WORLD,&status);
printf("processor %d received %d\n",myid,buffer);
}
MPI_Finalize();
}

The mpi-sendrecv.f program; compile with mpif90 mpi-sendrecv.f90, submit with


qsub pbs-sendrecv

program sendrecv
include "mpif.h"
integer myid, ierr,numprocs
integer tag,source,destination,count
integer buffer
integer status(MPI_STATUS_SIZE)
call MPI_INIT( ierr )
call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
tag=1
source=0
destination=1
count=1
if(myid .eq. source)then
buffer=1234
Call MPI_Send(buffer, count, MPI_INTEGER,destination,&
tag, MPI_COMM_WORLD, ierr)
write(*,*)"processor ",myid," sent ",buffer
endif
if(myid .eq. destination)then
Call MPI_Recv(buffer, count, MPI_INTEGER,source,&
tag, MPI_COMM_WORLD, status,ierr)
write(*,*)"processor ",myid," received ",buffer
endif
call MPI_FINALIZE(ierr)
stop
end

The following provides a summary use of the six core routines in C and Fortran.

Purpose

Include header files #include <mpi.h>

Fortran
INCLUDE mpif.h

Initialize MPI

int MPI_Init(int *argc, char


***argv)

INTEGER IERROR
CALL MPI_INIT(IERROR)

Determine number
of processes within
a communicator

int MPI_Comm_size(MPI_Comm
comm, int *size)

INTEGER COMM,SIZE,IERROR
CALL
MPI_COMM_SIZE(COMM,SIZE,IERROR)

Determine processor
int MPI_Comm_rank(MPI_Comm
rank within a
comm, int *rank)
communicator

INTEGER COMM,RANK,IERROR
CALL
MPI_COMM_RANK(COMM,RANK,IERROR)

int MPI_Send (void *buf,int


count, MPI_Datatype
datatype, int dest, int tag,
MPI_Comm comm)

<TYPE> BUF(*)
INTEGER COUNT,
DATATYPE,DEST,TAG
INTEGER COMM, IERROR
CALL MPI_SEND(BUF,COUNT,
DATATYPE, DEST, TAG, COMM,
IERROR)

Receive a message

int MPI_Recv (void *buf,int


count, MPI_Datatype
datatype, int source, int
tag, MPI_Comm comm,
MPI_Status *status)

<TYPE> BUF(*)
INTEGER COUNT, DATATYPE,
SOURCE,TAG
INTEGER COMM, STATUS, IERROR
CALL MPI_RECV(BUF,COUNT,
DATATYPE, SOURCE, TAG, COMM,
STATUS, IERROR)

Exit MPI

int MPI_Finalize()

CALL MPI_FINALIZE(IERROR)

Send a message

4.0 Intermediate MPI Programming

4.1 MPI Datatypes

Like C and Fortran (and indeed, almost every programming language that comes
to mind), MPI has datatypes, a classification for identifying different types of
data (such as real, int, float, char etc). In the introductory MPI program there
wasnt really much complexity in these types; as one delves deeper however
more will be encountered. Forewarned is forearmed, so the following provides a
handy comparison chart between MPI, C, and Fortran.

MPI DATATYPE

FORTRAN DATATYPE

MPI_INTEGER

INTEGER

MPI_REAL

REAL

MPI_DOUBLE_PRECISION

DOUBLE PRECISION

MPI_COMPLEX

COMPLEX

MPI_LOGICAL

LOGICAL

MPI_CHARACTER

CHARACTER

MPI_BYTE

MPI_PACKED

MPI DATATYPE

C Datatype

MPI_CHAR

signed char

MPI_SHORT

signed short int

MPI_LONG

signed long int

MPI_UNSIGNED_CHAR

unsigned char

MPI_UNSIGNED_SHORT

unsigned short int

MPI_UNSIGNED

unsigned int

MPI_UNSIGNED_LONG

unsigned long int

MPI_FLOAT

float

MPI_DOUBLE

double

MPI_LONG_DOUBLE

long double

MPI_BYTE

MPI_PACKED

4.2 Intermediate Routines

In the Intermediate course one of the last excersises involved the submission of
mpi-ping and mpi-pong. The first simply tested whether a connection existed
between multiple processors. The second program tested different packet sizes,
asynchronous, and bi-directional. In this example there is ping_pong.c, from the

University of Edinburgh Parallel Computing Centre, and a fortran 90 version of


the same from Colorado University. The usual methods can be used for compiling
and submitting these programs, e.g.,

or
mpicc mpi-pingpong.f90 -o mpi-pingpong and
mpicc -o mpi-pingpong mpi-pingpong.c

qsub pbs-pingpong

However for this course the interesting components is what is inside the code in
terms of the MPI routines. As previously there is the mpi.h include files, the
initialisation routines, the establishment of a communications world and so forth.
In addition however there are some new routines, specifically MPI_Wtime,
MPI_Abort, and MPI_Ssend.

4.2.1 MPI_Wtime()

MPI_Wtime returns the elapsed time, as a floating point number in seconds, of


the calling processor from an arbitrary point in the past. It can be applied in the
following fashion:

{
double starttime, endtime;
starttime = MPI_Wtime();
....
endtime

stuff to be timed

...

= MPI_Wtime();

printf("That took %f seconds\n",endtime-starttime);


}

The syntax for MPI_Wtime() is as follows for C, Fortran, and C++.

C Syntax

#include <mpi.h>
double MPI_Wtime()

Fortran Syntax

INCLUDE mpif.h
DOUBLE PRECISION MPI_WTIME()

C++ Syntax

#include <mpi.h>
double MPI::Wtime()

4.2.2 MPI_Abort()

MPI_Abort() aborts (or at least tries to) all tasks in the group of a communicator.
All associated processes are sent a SIGTERM. The input parameters include
comm, the communicator of taks to abort and errorcode, the error code to return
to invoking the environment. The only output parameter is Fortran's , IERROR.

The syntax for MPI_Abort() is as follows for C, Fortran, and C++.

C Syntax

#include <mpi.h>
int MPI_Abort(MPI_Comm comm, int errorcode)

Fortran Syntax
INCLUDE mpif.h
MPI_ABORT(COMM, ERRORCODE, IERROR)
INTEGER
COMM, ERRORCODE, IERROR

C++ Syntax
#include <mpi.h>
void Comm::Abort(int errorcode)

4.2.3 MPI_Ssend()

MPI_Ssend performs a synchronous-mode, blocking send. Whereas MPI_Send


will not return until the program can use the send buffer, MPI_Ssend will no
return until a matching receive is posted. Its a fairly subtle difference, but in
general, the best performance occurs if the program is written so that buffering
can be avoided, and MPI_Ssend is used. Otherwise, MPI_Send is the more
flexible option.
The available input parameters include buf, the initial addess of the send buffer.,
count., an non-negative integer of the number of elements in the send buffer.,
datatype, a datatype of each send buffer element as a handle., dest, an integer
rank of destination., tag, a message tag represented as an integer, and comm,
the communicator handle. The only output paramter is Fortrans IERRROR.

The syntax for MPI_Ssend() is as follows for C, Fortran, and C++.

C Syntax
#include <mpi.h>
int MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm)

Fortran Syntax
INCLUDE mpif.h
MPI_SSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR)
<type>

BUF(*)

INTEGER

COUNT, DATATYPE, DEST, TAG, COMM, IERROR

C++ Syntax
#include <mpi.h>
void Comm::Ssend(const void* buf, int count, const Datatype&
datatype, int dest, int tag) const

4.2.4 Other Send and Recv Routines

Although not used in the specific program just illustrated there are actually a
number of other send options for Open MPI. These include MPI_Bsend ,
MPI_Rsend, MPI_Isend, MPI_Ibsend, MPI_Issend, and MPI_Irsend. These are
worth mentioning in summary as follows:

MPI_Bsend(). Basic send with user-specified buffering and returns immediately.


It allows the user to send messages without worrying about where they are
buffered (because the user has provided buffer space with MPI_Buffer_attach if
they havent theyll encounter problems).

MPI_Rsend. Ready send; may be called if a matching receive has already


posted.

MPI_Irsend. A ready-mode, non-blocking send. Otherwise the same as


MPI_Rsend.

MPI_Isend. A standard-mode, nonblocking send. It will allocate a communication


request object and associate it with the request handle. A nonblocking send call

indicates to the system to start copying data out of the send buffer. A send
request can be determined being completed by calling the MPI_Wait,
MPI_Waitany, MPI_Test, or MPI_Testany with request returned by this function.
The send buffer cannot be used until one of these conditions is successful, or an
MPI_Request_free indicates that the buffer is available.

MPI_Ibsend. This initiates a buffered, nonblocking send. As it is non-blocking it


indicates that the system may start copying data out of the send buffer, as it is
buffered, it is a very good idea that the application does not access any part of
the send until the send completes.

Although MPI_Send and MPI_Ssend are typical, there may be occassions when
some of these routines are preferred. If non-blocking routines are necessary, for
example, then look at MPI_Isend or MPI_Irecv.

MPI_Isend()

MPI_Isend() provides a standard-mode, nonblocking send by allocating a


communication request object and associate it with the request handle (the
argument request). The request can be used later to query the status of the
communication or wait for its completion. The nonblocking send call allows the
system to copy data out of the send buffer. A send request can be determined
being completed by calling the MPI_Wait.

The input paramters are buf, the initial address of the send buffer., count, an
integer of the number of elements in the send buffer, datatype, a datatype handle
of each send buffer element., dest, an integer rank of the destination., tag, an
integer message tag, and comm, the comminicator handle. The output paramters
are request, the communication handle request., and Fortran's integer IERROR.

The syntax for MPI_Isend() is as follows for C, Fortran, and C++.

C Syntax
#include <mpi.h>
int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest,

int tag, MPI_Comm comm, MPI_Request *request)

Fortran Syntax
INCLUDE mpif.h
MPI_ISEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)
<type>
BUF(*)
INTEGER
COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR

C++ Syntax
#include <mpi.h>
Request Comm::Isend(const void* buf, int count, const
Datatype& datatype, int dest, int tag) const

MPI_Irecv()

MPI_Irecv() provides a standard-mode, nonblocking receive. With a recv, the


request handle (the argument request). The request can be used to query the
status of the communication or wait for its completion. A receive request can be
determined being completed by calling MPI_Wait.

The input paramters are buf, the initial address of the receive buffer., count, an
integer of the number of elements in the receive buffer, datatype, a datatype
handle of each receive buffer element., dest, an integer rank of the source., tag,
an integer message tag, and comm, the comminicator handle. The output
paramters are request, the communication handle request., and Fortran's integer
IERROR.

The syntax for MPI_Recv() is as follows for C, Fortran, and C++.

C Syntax
#include <mpi.h>
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm, MPI_Request *request)

Fortran Syntax
INCLUDE mpif.h

MPI_IRECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST,


IERROR)
<type>
BUF(*)
INTEGER
COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERROR

C++ Syntax

MPI_Wait()

Waits for an MPI send or receive to complete. It returns when the operation
identified by request is complete. If the communication object was created by a
nonblocking send or receive call, then the object is deallocated by the call to
MPI_Wait and the request handle is set to MPI_REQUEST_NULL. The input
parameter is request, the request handle. The output paramter is status, the
status object and Fortran's integer IERROR.

The syntax for MPI_Wait() is as follows for C, Fortran, and C++.

C Syntax
#include <mpi.h>
int MPI_Wait(MPI_Request *request, MPI_Status *status)

Fortran Syntax
INCLUDE mpif.h
MPI_WAIT(REQUEST, STATUS, IERROR)
INTEGER
REQUEST, STATUS(MPI_STATUS_SIZE), IERROR

C++ Syntax
#include <mpi.h>
void Request::Wait(Status& status)

void Request::Wait()

A Summary of Some Other MPI Send/Receive Modes

Send Mode

Explanation

Benefits

Problems

MPI_Send()

Standard send.
May be
synchronous or
buffering

Flexible tradeoff;
automatically
uses buffer if
available, but
goes for
synchronous if
not.

Can hide
deadlocks,
uncertainty of
type makes
debugging
harder.

MPI_Ssend()

Synchronous
send. Doesn't
return until
receive has also
completed.

Safest mode,
confident that
message has
been received.

Lower
performance,
especially
without nonblocking.

MPI_Bsend()

Buffered send.
Copies data to
buffer, program
free to continue
whilst message
delivered later.

Good
performance.
Need to be
aware of buffer
space.

Buffer
management
issues.

MPI_Rsend()

Receive send.
Message must be
already posted or
is lost.

Slight
performance
increase since
there's no
handshake.

Risky and
difficult to
design.

As described previously the arguments dest and source in the various modes of
send are the ranks of the receiving and the sending processes. MPI also allows
source to be a "wildcard" through the predefined constant MPI_ANY_SOURCE
(to receive from any source) and MPI_ANY_TAG (to receive with any source).
There is no wildcard for dest. Again using the postal analogy, a receipient may be
ready to receive a message from anyone, but they can't send a message to
anywhere!

4.2.5 The Prisoners Dilemma

The example of the Prisoners Dilemma (cooperation vs competition) is provided


as an example to illustrate how non-blocking communications work. It in this
example, there are ten rounds between two players. There are different pay-offs
for each. In this particular version the distinction is between cooperation and
competition for financial rewards. If both players cooperate they receive $2 for
the round. If they both compete, they receive $1 each for the round. But if one
adopts a competitive stance and the other a cooperative stance, the competitor
receives $3 and the co-operative player nothing.

A serial version of the code is provided (serial-gametheory.c, serialgametheory.f90). Review and then attempt a parallel version from the skeleton
versions of MPI (mpi-skel-gametheory.c, mpi-skel-gametheory.f90). Each process
must run one of the players decision-making, then they both have to transmit
their decision to the other, and then update their own tally of the result. Consider
using MPI_Send(), or MPI_Irecv(), and MPI_Wait(). On completion review with a
solution provided with mpi-gametheory.c and mpi-gametheory.f90 and submit the
tasks with qsub.

4.3 Collective Communications

MPI can also conduct collective communications. These include MPI_Broadcast,


MPI_Scatter, MPI_Gather, MPI_Reduce, and MPI_Allreduce. A brief summary of
their syntax and description of effects is provided before a practical example.

The basic principle and motivation is that whilst collective communications this
may provide a performance improvement, it will certainly provide clearer code.
Consider the following C snippet of a root processor sending to all..

if ( 0 == rank ) {
unsigned int proc_I;
for ( proc_I=1; proc_I < numProcs; proc_I++ ) {
MPI_Ssend( &param, 1, MPI_UNSIGNED, proc_I, PARAM_TAG, MPI_COMM_WORLD );
}
}
else {
MPI_Recv( &param, 1, MPI_UNSIGNED, 0 /*ROOT*/, PARAM_TAG, MPI_COMM_WORLD, &status
);

Replaced with:

MPI_Bcast( &param, 1, MPI_UNSIGNED, 0/*ROOT*/, MPI_COMM_WORLD );

4.3.1 MPI_Bcast()

MPI_Bcast Broadcasts a message from the process with rank "root" to all other
processes of the communicator, including itself. It is called by all members of
group using the same arguments for comm, root and on return, the contents of
root's communication buffer is been copied to all processes.

The input paramters include count, an integer of the number of entries in the
buffer., datatype, the datatyope of the buffer handle., root, an integer rank of the
broadcast root, and comm, the communicator handle., and buf, the starting
address of the buffer (input and output). The only other output parameter is
Fortran's , IERROR.

The syntax for MPI_Broadcast() is as follows for C, Fortran, and C++.

C Syntax
#include <mpi.h>
int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype,
int root, MPI_Comm comm)

Fortran Syntax
INCLUDE mpif.h
MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, IERROR)
<type>
BUFFER(*)
INTEGER
COUNT, DATATYPE, ROOT, COMM, IERROR

C++ Syntax
#include <mpi.h>

void MPI::Comm::Bcast(void* buffer, int count,


const MPI::Datatype& datatype, int root) const = 0

4.3.2 MPI_Scatter(), MPI_Scatterv()

MPI_Scatter sends data from one task to all tasks in a group; the inverse
operation of MPI_Gather. The outcome is as if the root executed n send
operations and each process executed a receive. MPI_Scatterv scatters a buffer
in parts to all tasks in a group.

The input parameters include sendbuff, the address of the send buffer.,
sendcount, an integer for root that is the number of elements to send to each
process., sendtype, the datatype handle for root for send buffer elements.,
recvcount, an integer of the number of elements in the receive buffer., rectype,
the datatype handle of receive buffer elements., root, the integer rank of the
sending process., and comm, the communicator handle. MPI_Scatterv also has
the input paramter of displs, an integrer array of group length size, which
specifies a displacement relative to sendbuf. The output paramters include
recvbuf, the address of the receive buff and the ever-dependable IERROR for
Fortran,

The syntax for MPI_Scatter() is as follows for C, Fortran, and C++.

C Syntax

#include <mpi.h>
int MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendtype,
void *recvbuf, int recvcount, MPI_Datatype recvtype, int root,
MPI_Comm comm)

Fortran Syntax

INCLUDE mpif.h
MPI_SCATTER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT,
RECVTYPE, ROOT, COMM, IERROR)
<type>
SENDBUF(*), RECVBUF(*)
INTEGER
SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, ROOT
INTEGER
COMM, IERROR

C++ Syntax

#include <mpi.h>
void MPI::Comm::Scatter(const void* sendbuf, int sendcount,
const MPI::Datatype& sendtype, void* recvbuf,
int recvcount, const MPI::Datatype& recvtype,
int root) const

4.3.3 MPI_Gather()

Gathers data and combines a partial array from each processor into one array on
the root processor. Each process, including the root process, sends the contents
of its send buffer to the root process. The root process receives the messages
and stores them in rank order. The outcome is as if each of the n processes in the
group (including the root process) had executed a call to MPI_Send() and root
had executed n calls to MPI_Recv().

The input parameters include sendbuff, the address of the send buffer.,
sendcount, an integer of the number of elements in the send buff., sendtype, the
datatype handle send buffer elements., recvcount, an root integer of the number
of elements in the receive buffer., rectype, the datatype handle for root of receive
buffer elements., root, the integer rank of the sending process., and comm, the
communicator handle. The output paramters include recbuf, the address of the

receive buff for root and the ever-dependable IERROR for Fortran,

The syntax for MPI_Gather() is as follows for C, Fortran, and C++.

C Syntax

#include <mpi.h>
int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendtype,
void *recvbuf, int recvcount, MPI_Datatype recvtype, int root,
MPI_Comm comm)

Fortran Syntax

INCLUDE mpif.h
MPI_GATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT,
RECVTYPE, ROOT, COMM, IERROR)
<type>
SENDBUF(*), RECVBUF(*)
INTEGER
SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, ROOT
INTEGER
COMM, IERROR

C++ Syntax

#include <mpi.h>
void MPI::Comm::Gather(const void* sendbuf, int sendcount,
const MPI::Datatype& sendtype, void* recvbuf,
int recvcount, const MPI::Datatype& recvtype, int root,
const = 0

4.3.4 MPI_Reduce(), MPI_Allreduce()

MPI_Reduce performs a reduce


operation (such as sum, max, logical
AND, etc.) across all the members of a
communication group. The reduction
operation can be either one of a
predefined list of operations, or a userdefined operation.

MPI_Allreduce conducts the same


operation but returns the reduced
result to all processors. User defined
reduction operations must be of type:

typedef void MPI_User_function( void* invec, void* inoutvec,


int* len,
MPI_Datatype* datatype )

A handle must be created to the


reduction operation of type MPI_Op; and
supplied MPI_Reduce (don't forget to
free up after use).

The input paramters include sendbuf,


the address of the send buffer., count, an

integer number of elements in the send buffer., datatype, a handle of the


datatype of elements in the send buffers., op, a handle of the reduce operation.,
root, the integrer rank of the root process., comm, the communicator handle. The
output paramters are recvbug, the address of the receive buffer for root, and
Fortran's IERROR.

The syntax for MPI_Reduce() is as follows for C, Fortran, and C++.

C Syntax

#include <mpi.h>
int MPI_Reduce(void *sendbuf, void *recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)

Fortran Syntax

INCLUDE mpif.h
MPI_REDUCE(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, ROOT, COMM,
IERROR)
<type>
SENDBUF(*), RECVBUF(*)
INTEGER
COUNT, DATATYPE, OP, ROOT, COMM, IERROR

C++ Syntax

#include <mpi.h>
void MPI::Intracomm::Reduce(const void* sendbuf, void* recvbuf,
int count, const MPI::Datatype& datatype, const MPI::Op& op,
int root) const

MPI reduction operations include the following:

MPI_Name

Function

MPI_Max

Maximum

MPI_MIN

Minimum

MPI_SUM

Sum

MPI_PROD

Product

MPI_LAND

Logical AND

MPI_BAND

Bitwise AND

MPI_LOR

Logical OR

MPI_BOR

Bitwise OR

MPI_LXOR

Logical exclusive OR

MPI_BXOR

Bitwise exclusive OR

MPI_MAXLOC

Maximum and location

MPI_MINLOC

Miniumun and location

4.3.5 Other Collective Communications

Other collective communications include:

MPI_Barrier(): Synchronises processors


MPI_Alltoall(): useful way of sharing array and interleaving at the same time
MPI_Reduce_scatter(): Reduction operation on a set of arrays, followed by a
scatter
MPI_Scan(): Same as reduce, except each processor i only works on arrays 0-i

4.4 Derived Data Types

Derived types are essentially a user define type for MPI_Send(). They are
described as 'derived' as they are derived from existing primitive datatype like
int and float. The main reason to use them in in MPI context is that they make
message passing more efficient and easier to code.

For example if a program has data in double results[5][5], what does the user do
if they want to send results[0][0], results[1][0], results[5][0]?

The program could send the data one at a time e.g.,

double results[5][5];
int i;
for ( i = 0; i < 5; i++ ) {
MPI_Send( &(results[i][0]), 1, MPI_DOUBLE,
dest, tag, comm );
}

But this has overhead; message passing is always (relatively) expensive. So


instead, a datatype can be created that informs MPI how the data is stored so it
can be sent in one routine.

To create a derived type there are two steps: Firstly construct the datatype, with
MPI_Type_vector() or MPI_Type_struct() and and then commit the datatype with
MPI_Type_Commit().

When all the data to send is the same data type use the vector method e.g.,

int MPI_Type_vector( int count, int blocklen, int stride, MPI_Datatype old_type,
MPI_Datatype* newtype )

/* Send the first double of each of the 5 rows */


MPI_Datatype newType;
double results[5][5];

MPI_Type_vector( 5, 1, 5, MPI_Double, &newType);


MPI_Type_commit( &newType );
MPI_Ssend( &(results[0][0]), 1, newType, dest, tag, comm );

Note that when sending a vector, data on receiving processor may be of a


different type eg:

double
double
MPI_Datatype
MPI_Status

recvData[COUNT*BLOCKLEN];
sendData[COUNT][STRIDE];
vecType;
st;

MPI_Type_vector( COUNT, BLOCKLEN, STRIDE, MPI_DOUBLE, &vecType );


MPI_Type_commit( &vecType );
if( rank == 0 )
MPI_Send( &(sendData[0][0]), 1, vecType, 1, tag, comm );
else
MPI_Recv( recvData, COUNT*BLOCKLEN, MPI_DOUBLE, 0, tag, comm, &st );

If you have specific parts of a struct you wish to send and the members are of
different types, use the struct datatype.

int MPI_Type_struct( int count, int blocklen[], MPI_Aint indices, MPI_Datatype


old_types[],MPI_Datatype* newtype )

For example....

/* Send the Packet structure in a message */


struct {
int a;
double array[3];
char b[10];
} Packet;

struct Packet dataToSend;

Another example .

int blockLens[3] = { 1, 3, 10 };
MPI_Aint intSize, doubleSize;
MPI_Aint displacements[3];
MPI_Datatype types[3] = { MPI_INT, MPI_DOUBLE, MPI_CHAR };
MPI_Datatype myType;

MPI_Type_extent(
MPI_Type_extent(
displacements[0]
displacements[1]
displacements[2]

MPI_INT, &intSize ); //# of bytes in an int


MPI_DOUBLE, &doubleSize ); // double
= (MPI_Aint) 0;
= intSize;
= intSize + ((MPI_Aint) 3 * doubleSize);

MPI_Type_struct( 3, blockLens, displacements, types, &myType );


MPI_Type_commit( &myType );
MPI_Ssend( &dataToSend, 1, myType, dest, tag, comm );

There are actually other functions for creating derived types:

MPI_Type_contiguous
MPI_Type_hvector
MPI_Type_indexed
MPI_Type_hindexed

In many applications, the size of a message to receive is unknown before it is


received. (e.g.: number of particles moving between domains). MPI has a way of
dealing with this elegantly. Firstly, receive side calls MPI_Probe before actually
receiving:

int MPI_Probe( int source, int tag, MPI_Comm comm,


Can then examine the status, and find length using:
int MPI_Get_count( MPI_Status *status,
MPI_Datatype datatype, int *count )

MPI_Status *status )

Then the application dynamically allocate the recv buffer, and call MPI_Recv.

5.5 Particle Advector

The particle advector hands-on excersise consists of two parts.

The first example is designed to gain familiarity with the MPI_Scatter() routine
as a means of distributing global arrays among multiple processesors via
collective commuinication. Use the skeleton code provided and determine the
number of particles to assign to each processor. Then use the function
MPI_Scatter() to spread the global particle coordinates, ids and tags among the
processors.

For an advanced tests, on the root processor only, calculate the particle with the
smallest distance from the origin (hint: MPI_Reduce( ) ). If the particle with the
smallest distance is < 1.0 from the origin, then flip the direction of movement of
all the particles. Then modify your code to use the MPI_Scatterv() function to
allow the given number of particles to be properly distributed among a variable
number of processors.

int MPI_Scatterv (
void
int
int
MPI_Datatype
void
int
MPI_Datatype
int
MPI_Comm

*sendbuf,
*sendcnts,
*displs,
sendtype,
*recvbuf,
recvcnt,
recvtype,
root,
comm )

The second example is designed to gain a practical example of the use of MPI
derived data types. Implement a data type storing the particle information from
the previous exercise and use this data type for collective communications. Set
up and commit a new MPI derived data type, based on the struct below:

typedef struct Particle {


unsigned int
globalId;
unsigned int
tag;
Coord
coord;
} Particle;

Hint: MPI_Type_struct( ), MPI_Type_commit( )

Then seed the random number sequence on the root processor only, and
determine how many particles are to be assigned among the respective
processors (same as for last exercise) and collectively assign their data using the
MPI derived data type you have implemented.

5.6 Creating A New Communicator

When creating a new communicator, each communicator has associated with it a


group of ranked processes. Before creating a new comunicator, first we must
create a group for it. Create a new group be eliminating processes from an
existing group:

MPI_Group
MPI_Comm
int

worldGroup, subGroup;
subComm;
*procsToExcl, numToExcl;

MPI_Comm_group( MPI_COMM_WORLD, &worldGroup );


MPI_Group_excl( worldGroup, numToExcl, procsToExcl, &subGroup );
MPI_Comm_create( MPI_COMM_WORLD, subGroup, &subComm );

5.7 Profiling Parallel Programs

Parallel Performance Issues include the following:

*
*
*
*
*

Coverage - % of the code that is parallel


Granularity - Amount of work in each section
Load Balancing
Locality - Communication structure
Synchronization - Locking latencies

Since the performance of parallel programs are dependant on so many issues, it


is an inherently difficult task to profile parallel programs.

TAU (Tuning and Analysis Utilities) is a portable profiling and tracing toolkit for
performance analysis of parallel programs written in Java, C, C++ and Fortran.

The steps involved in profiling parallel code are outlined as follows:

Instrument the source code with Tau macros


Compile the instrumented code
Run the program to view profile.* files for each separate process

The instrumentation of source code can be done manually or with the help of
another utility called PDT, which automatically parses source files and
instruments them with Tau macros.

5.8 Debugging MPI Applications

It has taken many years for this essential truth to be realised, but software
equals bugs. In parallel systems, the bugs are particularly difficult to diagnose,
and the core principle of parallelisation suggests race conditions and deadlocks.
For example, what happens when two processers try to send a message to one
another at the same time.

When debugging MPI programs it is usually a good idea to do this in ones own
environment, i.e., install (from source) the compilers and version of openmpi on
your own system. The reason for this is it is quite time prohibitive to conduct
debugging activities on a batch-processing high-performance computer. The HPC
systems that we have may run tasks fairly quickly when launched, but they can
take some time to begin whilst they are in the queue.

DO NOT RUN JOBS ON THE HEAD NODE


REALLY, DO NOT RUN MULTICORE JOBS ON THE HEAD NODE!

It is possible, for small tests, to bypass this by running small jobs interactively
(following the instructions given in the Intermediate course). e,g,

qsub -l walltime=0:30:0,nodes=1:ppn=2 -I
module load vpac
qsub pbs-sendrecv

In general however, parallel programs are hard to program and hard to debug.
Parallelism adds a whole new abstract layer. Although the program is being
executed on N processors, it may be running in N slightly different ways on
different data.

Although time consuming serious it is usually appropriate to build the code in


serial first to the point thats its working, and working well. As part of this
process use versioning control systems, engage in unit (check each functional
component of the code independently) and integration testing (check the
interfaces between components) as part of this development. Use standard
methods for these tests, such as the use of mid-range, boundary, and out-ofbounds variables.

Because parallelism adds a new level of abstraction, producing a serial version of


a code before producing a parallel version is not unlike producing pseudo-code
for a serial program. Time and time again it has been shown that modelling
significantly improves the quality of a program and reduces errors, thus saving
time in the longer run. In the process of engaging in such modelling, developing
a defensive style of programming is effective, for example engaging in the
techniques that prevent deadlocks, or keeping in consideration the state of a
condition when running loops or if-else statements. When conducting actual tests
on the code, a tactically placed printf or write statements will assist.

For example, consider the following simple send-recv programs; compile these
with openmpi-gcc as follows:

module load openmpi-gcc

mpicc -g mpi-debug.c -o mpi-debug

or

mpif90 -g mpi.debug.f90 -o mpi-debug

qsub -l walltime=0:20:0,nodes=1:ppn=2 -I
module load vpac
module load valgrind/3.8.1-openmpi-gcc

Note that an interactive job starts the user in their home directory requring a
change in directories.

When mpiexec with 2 processors is launched with valgrind debugging the


executable and with error output redirected to valgrind.out.

mpiexec -np 2 valgrind ./mpi-sendrecv-debug 2> valgrind.out

Valgrind is a debugging suite that automatically detects many memory


management and threading bugs. Whilst typically built for serial applications, it
can also be built with mpicc wrappers, but currently only for GNU GCC or Intels
C++ compiler. It is important to use the same compiler used in both the build
and the Valgrind test.

The file valgrind.out in this case will contain quite a few errors, but none of these
are critical to the operation of the program.

As with serial programs, gdb can also be used for thorough debugging. Execute
as

mpiexec -np [number of processers] gdb ./executable command=gdb.cmd

Where gdb.cmd is a text file of the commands that you want to send to gdb. e.g.,

module load gdb


mpiexec -np 2 gdb --exec=mpi-debug --command=gdb.cmd

Which should generate a result something like the following:

[lev@trifid166 advancedhpc]$ mpiexec -np 2 gdb commands=gdb.cmd mpi-debug


(remove license information)
Reading symbols from /nfs/user2/lev/programming/advancedhpc/mpi-debug...(no
debugging symbols found)...done.
Reading symbols from /nfs/user2/lev/programming/advancedhpc/mpi-debug...(no
debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x2aaaad875700 (LWP 19784)]
[New Thread 0x2aaaad875700 (LWP 19785)]
[New Thread 0x2aaaadc8b700 (LWP 19786)]
[New Thread 0x2aaaadc8b700 (LWP 19787)]
processor 0 final value:

324 with loop #

68

processor 1 final value:

2346 with loop #

68

[Thread 0x2aaaadc8b700 (LWP 19786) exited]


[Thread 0x2aaaad875700 (LWP 19784) exited]
[Thread 0x2aaaadc8b700 (LWP 19787) exited]
[Thread 0x2aaaad875700 (LWP 19785) exited]
[Inferior 1 (process 19776) exited normally]
[Inferior 1 (process 19777) exited normally]

This of course, simple mentions that the program successfully with the final
values as listed (hooray!). To use a serial debugger with GDB that is running in
parallel is slightly more difficult. A common hack and it is a hack is to find out
what process IDs that the job is doing then to log in to the appropriate node and
run gdb -p PID. However in order to discover that the following code snippet is
usually implemented:

{
int i = 0;
char hostname[256];
gethostname(hostname, sizeof(hostname));
printf("PID %d on %s ready for attach\n", getpid(), hostname);
fflush(stdout);
while (0 == i)
sleep(5);
}

Then at job submission those PIDs will be displayed. For example,

[lev@trifid166 advancedhpc]$ mpiexec -np 2 mpi-debug


PID 23166 on trifid166 ready for attach
PID 23167 on trifid166 ready for attach

Then login to the appropriate nodes and run gdb -p 23166 and gdb -p 23167 amd
step through the function stack and set the variable to a non-zero value, e.g.,

(gdb) set var i = 7

Then set a breakpoint after your block of code and continue execution until the
breakpoint is hit (e.g., by adding break in the loops on lines 49 and 64) and using
the gdb commands to display the values as they are being generated (e.g., print
loop, print value, or info locals).

110 Victoria Street


Carlton South VIC 3053,
Australia
Phone: +61 3 9925
4645 FAX: +61 3 9925 464
info@vpac.org
www.vpac.org

S-ar putea să vă placă și