Analytical Modelling in Parallel and Distributed Computing

PETER HANULIAK AND MICHAL HANULIAK
ANALYTICAL MODELLING
IN PARALLEL
AND DISTRIBUTED
COMPUTING
For a full listing of Chartridge Books Oxfords titles, please contact us:
Chartridge Books Oxford, 5 & 6 Steadys Lane, Stanton Harcourt,
Witney, Oxford, OX29 5RL, United Kindom
Tel: +44 (0) 1865 882191
Email: editorial@chartridgebooksoxford.com
Website: www.chartridgebooksoxford.com
Te current trends in High Performance Computing (HPC) are to use networks of
workstations (NOW, SMP) or a network of NOW networks (Grid) as a cheaper
alternative to the traditionally-used, massive parallel multiprocessors or supercomputers.
Individual workstations could be single PCs (personal computers) used as parallel
computers based on modern symmetric multicore or multiprocessor systems (SMPs)
implemented inside the workstation.
With the availability of powerful personal computers, workstations and networking
devices, the latest trend in parallel computing is to connect a number of individual
workstations (PCs, PC SMPs) to solve computation-intensive tasks in a parallel way to
typical clusters such as NOW, SMP and Grid. In this sense it is not yet correct to consider
traditionally evolved parallel computing and distributed computing as two separate
research disciplines.
To exploit the parallel processing capability of this kind of cluster, the application
program must be made parallel. An efective way of doing this for (parallelisation
strategy) belongs to the most important step in developing an efective parallel algorithm
(optimisation). For behaviour analysis we have to take into account all the overheads that
have an infuence on the performance of parallel algorithms (architecture, computation,
communication etc.).
In this book we discuss this kind of complex performance evaluation of various
typical parallel algorithms (shared memory, distributed memory) and their practical
implementations. As real application examples we demonstrate the various infuences
during the process of modelling and performance evaluation and the consequences of
their distributed parallel implementations.
9 781909 287907
ISBN 978-1-909287-90-7
A
N
A
L
Y
T
I
C
A
L

M
O
D
E
L
L
I
N
G

I
N

P
A
R
A
L
L
E
L

A
N
D

D
I
S
T
R
I
B
U
T
E
D

C
O
M
P
U
T
I
N
G
P
E
T
E
R

H
A
N
U
L
I
A
K

A
N
D

M
I
C
H
A
L

H
A
N
U
L
I
A
K
Analytical Modelling in
Parallel and Distributed Computing
Analytical Modelling in
Parallel and Distributed
Computing
Peter Hanuliak
and
Michal Hanuliak
Chartridge Books Oxford
5 & 6 Steadys Lane
Stanton Harcourt
Witney
Oxford OX29 5RL, UK
Tel: +44 (0) 1865 882191
First published in 2014 by Chartridge Books Oxford
ISBN print: 978-1-909287-90-7
ISBN ebook: 978-1-909287-91-4
Peter Hanuliak and Michal Hanuliak 2014.
The right of Peter Hanuliak to be identified as author of this work has been asserted by them in accordance
with sections 77 and 78 of the Copyright, Designs and Patents Act 1988.
British Library Cataloguing-in-Publication Data: a catalogue record for this book is available from the
British Library.
All rights reserved. No part of this publication may be reproduced, stored in or introduced into a retrieval
system, or transmitted, in any form, or by any means (electronic, mechanical, photocopying, recording or
otherwise) without the prior written permission of the publishers. This publication may not be lent, resold,
hired out or otherwise disposed of by way of trade in any form of binding or cover other than that in which
it is published without the prior consent of the publishers. Any person who does any unauthorised act in
relation to this publication may be liable to criminal prosecution and civil claims for damages. Permissions
may be sought directly from the publishers, at the above address.
Chartridge Books Oxford is an imprint of Biohealthcare Publishing (Oxford) Ltd.
The use in this publication of trade names, trademarks service marks, and similar terms, even if they are not
identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights. The publishers are not associated with any product or vendor mentioned in this
publication. The authors, editors, contributors and publishers have attempted to trace the copyright holders
of all material reproduced in this publication and apologise to any copyright holders if permission to publish
in this form has not been obtained. If any copyright material has not been acknowledged, please write and
let us know so we may rectify in any future reprint. Any screenshots in this publication are the copyright of
the website owner(s), unless indicated otherwise.
Limit of Liability/Disclaimer of Warranty
The publishers, author(s), editor(s) and contributor(s) make no representations or warranties with respect
to the accuracy or completeness of the contents of this publication and specifically disclaim all warranties,
including without limitation warranties of fitness for a particular purpose. No warranty may be created or
extended by sales or promotional materials. The advice and strategies contained herein may not be suitable
for every situation. This publication is sold with the understanding that the publishers are not rendering
legal, accounting or other professional services. If professional assistance is required, the services of a
competent professional person should be sought. No responsibility is assumed by the publishers, author(s),
editor(s) or contributor(s) for any loss of profit or any other commercial damages, injury and/or damage to
persons or property as a matter of products liability, negligence or otherwise, or from any use or operation
of any methods, products, instructions or ideas contained in the material herein. The fact that an
organisation or website is referred to in this publication as a citation and/or potential source of further
information does not mean that the publishers nor the author(s), editor(s) and contributor(s) endorses the
information the organisation or website may provide or recommendations it may make. Further, readers
should be aware that internet websites listed in this work may have changed or disappeared between when
this publication was written and when it is read.
Typeset by Domex, India
Printed in the UK and USA
Contents
Preview xiii
Acknowledgements xv
Part I. Parallel Computing 1
Introduction 3
Developing periods in parallel computing 3
1 Modelling of Parallel Computers and Algorithms 7
Model construction 7
2 Parallel Computers 11
Classication 12
Architectures of parallel computers 14
Symmetrical multiprocessor system 14
Network of workstations 16
Grid systems 19
Conventional HPC environment versus Grid environments 21
Integration of parallel computers 22
Metacomputing 22
Modelling of parallel computers including communication networks 23
vi Analytical Modelling in Parallel and Distributed Computing
3 Parallel Algorithms 29
Introduction 29
Parallel processes 31
Classication of PAs 33
Parallel algorithms with shared memory 34
Parallel algorithms with distributed memory 35
Developing parallel algorithms 35
Decomposition strategies 37
Natural parallel decomposition 38
Domain decomposition 39
Functional decomposition 39
Mapping 44
Inter process communication 45
Inter process communication in shared memory 45
Inter process communication in distributed memory 46
Performance tuning 46
4 Parallel Program Developing Standards 49
Parallel programming languages 49
Open MP standard 50
Open MP threads 53
Problem decomposition 54
MPI API standard 55
MPI parallel algorithms 57
Task Groups 58
Communicators 59
The order of tasks 59
Collective MPI commands 60
Synchronisation mechanisms 61
Conditional synchronisation 61
Rendezvous 61
Contents vii
Synchronisation command barriers 61
MPI collective communication mechanisms 62
Data scattering collective communication commands 63
Java 70
5 Parallel Computing Models 73
SPMD model of parallel computation 74
Fixed (atomic) network 74
PRAM model 75
Fixed communication model GRAM 76
Flexible models 76
Flexible GRAM model 77
The BSP model 77
Computational model MPMD functionality of all system resources 80
Load of communication network 81
6 The Role of Performance 83
Performance evaluation methods 85
Analytic techniques 85
Asymptotic (order) analysis 86
Application of queuing theory systems 90
Kendall classication 91
The simulation method 94
Experimental measurement 95
Part II. Theoretical aspects of PA 97
7 Performance Modelling of Parallel Algorithms 99
Speed up 99
Efciency 100
Isoefciency 100
viii Analytical Modelling in Parallel and Distributed Computing
Complex performance evaluation 102
Conclusion and perspectives 103
8 Modelling in Parallel Algorithms 105
Latencies of PA 105
Part III. Applied Parallel Algorithms 113
9 Numerical Integration 115
Decomposition model 116
Mapping of parallel processes 117
Performance optimisation 120
Chosen illustration results 123
10 Synchronous Matrix Multiplication 127
The systolic matrix multiplier 127
Instruction systolic array matrix multiplier 129
ISA matrix multiplier 131
Dataow matrix multiplication 132
Wave front matrix multiplier 133
The asynchronous matrix multiplication 134
Decomposition strategies 134
Domain decomposition methods for matrix multiplication 134
Comparison of used decomposition models 137
11 Discrete Fourier Transform 139
The Fourier series 139
The discrete Fourier transform 140
The discrete fast Fourier transform 141
Two-dimensional DFFTs 145
Analysed examples 147
One element per processor 147
Multiple elements per processor 147
Contents ix
Multiple elements per processor with routing 149
Multiple elements per processor in computer networks 150
Chosen illustration results 150
12 The Triangle Problem 157
The manager/worker strategy 157
Combinatorial problems 157
Sequential algorithms 159
Parallel algorithms 160
Performance optimisation 161
13 The System of Linear Equations 165
Methods of solving SLR 166
Cramers rule 166
Gaussian elimination method 167
Sequential algorithm GEM 167
Decomposition matrix strategies 168
Evaluation of GEM 168
The evaluation of matrix parallel algorithms 170
Common features of MPA 171
Decomposition models MPA 172
Domain decomposition 172
Iteration methods 173
Parallel iterative algorithms SLR 173
Convergence of iterative methods 176
14 Partial Differential Equations 181
The Laplace differential equation 183
Local communication 185
Algorithms of the Jacobi iterative method 187
Optimisation of parallel Jacobi algorithms 189
Complexity of the sequential algorithm 193
Gauss Seidel iterative sequential algorithms 194
x Analytical Modelling in Parallel and Distributed Computing
Matrix decomposition models 195
Jacobi iterative parallel algorithms 197
Parallel algorithms with a shared memory 197
Parallel iterative algorithms with a distributed memory 198
The red-black successive over-relaxation method 203
Complex analytical performance modelling of IPA 207
Basic matrix decomposition models 207
Matrix decomposition into strips 209
Matrix decomposition into blocks 210
Optimisation of the decomposition method selection 211
Parallel computational complexity 213
Complex analytical performance modelling 214
Issoeciency functions 216
Canonical matrix decomposition models 217
Optimisation of issoefciency functions 219
Conclusions of issoeciency functions 221
Chosen results 221
Part IV. The Experimental Part 229
15 Performance Measurement of PAs 231
Direct performance measurement methodology for MPAs 231
Performance measurement of PAs 232
Performance measurement of PAs with a shared memory 232
Performance measurement MPAs with a distributed memory 232
Performance measurement of PAs in NOW and Grid 233
Measurement delays 235
Measurements on parallel computers 237
Measurement of SMPs 237
Measurements on parallel computers in the world 237
Measurement of NOW 239
Contents xi
Measurement of the performance verication criteria of PAs 240
Isoefciency functions of PAs 242
16 Measuring Technical Parameters 245
Specications of measurements 245
Technical parameters for the average time of computer operations 245
Computing complexity of GEM 246
Process for deriving technical parameter t
c
250
Applied uses of technical parameter t
c
254
Verication of the accuracy for approximation relations 254
Simulated performance comparisons of parallel computers 256
Communication complexity and communication technical
parameters 258
Classic parallel computers 258
The NOW parallel computer 261
Evaluation of collective communication mechanisms 262
The Broadcast collective communication mechanism type 263
17 Conclusions 265
Appendix 1 267
Basic PVM routines 267
Preliminaries 267
Point-to-point message passing 268
Group routines 270
Appendix 2 273
Basic MPI routines 273
Preliminaries 273
Point-to-point message passing 274
Group routines 276
xii Analytical Modelling in Parallel and Distributed Computing
Appendix 3 279
Basic Pthread routines 279
Thread management 279
Thread synchronisation 281
Condition variables 282
References 285
Preview
The current trends in High Performance Computing (HPC) are to use
networks of workstations (NOW, SMP) or a network of NOW
networks (Grid) as a cheaper alternative to the traditionally-used,
massive parallel multiprocessors or supercomputers. Individual
workstations could be single PCs (personal computers) used as parallel
computers based on modern symmetric multicore or multiprocessor
systems (SMPs) implemented inside the workstation.
With the availability of powerful personal computers, workstations
and networking devices, the latest trend in parallel computing is to
connect a number of individual workstations (PCs, PC SMPs) to solve
computation-intensive tasks in a parallel way to typical clusters such
as NOW, SMP and Grid. In this sense it is not yet correct to consider
traditionally evolved parallel computing and distributed computing as
two separate research disciplines.
To exploit the parallel processing capability of this kind of cluster,
the application program must be made parallel. An effective way of
doing this for (parallelisation strategy) belongs to the most important
step in developing an effective parallel algorithm (optimisation). For
behaviour analysis we have to take into account all the overheads that
have an influence on the performance of parallel algorithms
(architecture, computation, communication etc.). In this book we
discuss this kind of complex performance evaluation of various typical
parallel algorithms (shared memory, distributed memory) and their
practical implementations. As real application examples we demonstrate
the various influences during the process of modelling and performance
evaluation and the consequences of their distributed parallel
implementations.
xiv Analytical Modelling in Parallel and Distributed Computing
Keywords: Parallel computers, parallel algorithm, performance
modelling, NOW, analytical model, decomposition, inter process
communication, IPX, OpenMP, MPI, complex performance modelling,
effectiveness, speed up, issoeficiency, NOW, Grid.
Acknowledgements
This work was carried out within the project named Modelling,
optimisation and prediction of parallel computers and parallel
algorithms, at the University of Zilina, the Slovak Republic. The
authors gratefully acknowledge the universal help of project
supervisor Prof. Ing. Ivan Hanuliak, PhD.
Part 1:
Parallel Computing
Introduction
The performance of actual computers (sequential, parallel) depends
to a degree on embedded parallel principles on various levels of
technical (Hardware) and program support means (Software). At the
level of the internal architecture of a basic module CPU (Central
Processor Unit) of a PC they are implementations of a scalar pipeline
execution or a multiple pipeline (superscalar, super pipeline) execution
and capacity extension of caches and their redundant usage at
various levels and in the form of shared and local caches (L1, L2,
L3). At the level of the motherboard there is multiple usage of cores
and processors in building multicore or multiprocessor systems such
as the SMP (Symmetrical Multiprocessor System) as a powerful
computation node, where this is also an SMP parallel computer.
Developing periods in parallel computing
During the first period of parallel computing between 1975 and
1995, scientific supercomputers dominated, which were specially
designed for High Performance Computing (HPC). These parallel
computers have mostly used a computing model based on data
parallelism. Those systems were way ahead of standard common
computers in terms of their performance and price. General purpose
processors on a single chip, which had been invented in the early
1970s, were only mature enough to hit the HPC market by the end
of the 1980s, and it was not until the end of the 1990s that the
connected standard workstation or even personal computers (PCs)
had become competitive, at least in terms of theoretical peak
4 Analytical Modelling in Parallel and Distributed Computing
performance. Increased processor performance was caused through
massive usage of various parallel principles in all forms of produced
processors. Parallel principles were used like this in single PCs and
workstations (scalar and superscalar pipelines, Symmetrical
Multiprocessor Systems SMPs), as well as on POWER PC as in a
connected Network of Workstations (NOW). The experience gained
with the implementation of parallel principles and the extension of
computer networks has led to the use of connected computers for
parallel solutions. This trend is to be characterised through the
downsizing of supercomputers such as Cray/SGI, T3E and from
other massive parallel systems (the number of used processors is
>100) to cheaper and more universal parallel systems in the form of
a Network of Workstations (NOW). This period we can refer to as
the second period. Their large growth since 1980 has been influenced
by the simultaneous influence of three basic factors and [22, 41]:
High performance processors (Pentium and higher, Power PC,
RISC etc.).
High speed interconnecting networks (100M and Gigabit
Ethernet, Myrinet, Infiniband).
Standard tools for the development of parallel algorithms
(OpenMP, Java, PVM, MPI).
Gradational change from specialised supercomputers (Cray/SGI, T3E
a pod.) and other massive (Number of processors >100) parallel
computers [66] to more available but powerful PCs based on
multiprocessors or multicores through NOW networks were
characterised by people as downsizing. This trend began in 1980 (the
introduction of personal computers), and was inspired by the
simultaneous influence of the above-mentioned three basic factors.
The developing trends are actually going towards the building of
widespread, connected NOW networks with high computation and
memory capacity (Grid). Conceptually, Grid belongs to the class of
metacomputer.
A metacomputer can be understood as a massive computer
network of computing nodes built on the principle of the common
use of existing processors, memories and other resources with the
Introduction 5
objective of creating an illusion of one huge, powerful supercomputer.
Such higher integrated forms of NOW (Grid module), named Grid
systems or metacomputers, we can define as the third period in the
developing trends of parallel computers.
The developing trends were going toward building widespread
connected networks with high computation and memory capacities
(Grid), whereby their components could make possible the existence of
supercomputers and their innovative types. Conceptually, Grid belongs
to the class of metacomputer, where a metacomputer can be understood
as a massive computer network (with high-speed data transmission) of
computing nodes built on the principle of the common use of the
existing processors, memory and other resources, with the objective of
creating an illusion of one huge, powerful supercomputer [96].
There has been an increasing interest in the use of networks of
distributed workstations (clusters) connected by high-speed networks
for solving large computation-intensive problems. This trend is
mainly driven by the cost effectiveness of such systems, as compared
to parallel computers with their massive numbers of tightly coupled
processors and memories. Parallel computing on a cluster of
powerful workstations (NOW, SMP, Grid), connected by high-speed
networks has given rise to a range of hardware and network-related
issues on any given platform.
The Network of workstations (NOW) has become a widely-
accepted form of high-performance parallel computing. As in
conventional multiprocessors, parallel programs running on this kind
of platform are often written in an SPMD form (Single program
Multiple data) to exploit data parallelism, or in an improved SPMD
form to also take into account the potential of the functional
parallelism of a given application. Each workstation in a NOW is
treated similarly to a processing element in a multiprocessor system.
However, workstations are far more powerful and flexible than
processing elements in conventional multiprocessors.
The dominant trend, and that also in the field of High Performance
Computing (HPC), are networked, connected and powerful
workstation SMPs known as NOW (Network of Workstations) and
their higher, massive integrated forms named Grid or metacomputer.
The effective usage of these dominant PCs in forms such as NOW
and Grid require principal new forms and methodical strategies in
the proposal, development, modelling and optimisation of the PA
(shared memory, distributed memory), and which is commonly
known as an effective PA.
Distributed computing using a cluster of powerful workstations
(NOW, SMP, Grid) was reborn as a kind of lazy parallelism. A
cluster of computers could team up to solve many problems at once,
rather than one problem at a higher speed. To get the most out of a
distributed parallel system, the designers and software developers
must understand the interaction between the hardware and software
parts of the system. It is obvious that use of a computer network
based on personal computers would be in principle less effective
than the typical massive parallel architectures used in the world,
because of higher communications overheads, but also because a
network of more and more powerful workstations consisting of
powerful personal computers (PCs, PC SMPs), is the way of the
future as very cheap, flexible and perspective parallel computers. We
can see this trend in its dynamic growth just in the parallel
architectures based on the networks of workstations as a cheaper
and flexible architecture compared to conventional multiprocessors
and supercomputers. The principles of these conventional parallel
computers are currently effectively implemented in modern symmetric
multiprocessor systems (SMPs) based on the same processors [1]
(multiprocessors, multicores). The unification of both approaches
(NOW and SMP) has opened up for the future new possibilities in
massive HPC computing.
1
Modelling of Parallel Computers
and Algorithms
Generally a model is the abstraction of a system (Fig. 1.1.). The
functionality of the model represents the level of the abstraction
applied. That means, if we know all there is about a system, and we
are willing to pay for the complexity of building a true model, the role
of abstraction is nearly zero. In practical cases we wish to abstract the
view we take of a system to simplify the complexity of the real
system. We wish to build a model that focuses on some basic elements
of our interest, and to leave the rest of the real system as only an
interface with no details beyond proper input and output. A real
system is the concrete applied process or system that we are going to
model [85]. In our case they should be applied Parallel Algorithms
(PA) or concrete parallel computers (SMP, NOW, Grid etc.).
The basic conclusion is that a model is the subjective view of a
modellers subjective insight into a modelled real system. This personal
view defines what is important, what the purposes are, the details, the
boundaries and so on. Therefore the modeller must understand the
system in order to guarantee the useful features of the created model.
Model construction
Modelling is highly creative process, which incorporates the following
basic assumptions:
A strong aptitude for abstract thought.
Brainstorming (creativity).
Alternating behaviour and strategy.
Logical, hierarchical approaches to differentiating between
primary and secondary facts.
In general, the development of a model in any scientific area includes
the selection of the following steps [51, 77]:
Define the problem to be studied, as well the criteria for analysis.
Define and/or refine the model of the system. This includes the
development of abstractions into mathematical, logical or
procedural relationships.
Collect data input for the model. Define the outside world and
what must be fed into or taken from the model to simulate that
world.
Select a modelling tool and prepare and augment the model for
tool implementation.
Verify that the tool implementation is an accurate reflection of
the model.
Figure 1.1 The modelling process.
Real system
Model
Abstraction
Modelling of Parallel Computers and Algorithms 9
Validate that the tool implementation provides the desired
accuracy or correspondence with the real-world system being
modelled.
Experiment with the model to obtain performance measurements.
Analyse the tool results.
Use the findings to derive designs and improvements for the real-
world system.
A corresponding flow diagram of model development is represented
in Fig. 1.2:
Figure 1.2 Flow diagram of model development.
Problem
End
Graphical
illustration
Real model
Formalisation
(Mathematical model)
Problem description
analysis
Essential
properties
Accuracy
Yes
Store
Yes
No
No
Model improvement
Figure 1.3 Applied computer modelling.
As a practical illustration we have chosen the applied modelling of the
classical sequential von Neumann computer shown below in Fig. 1.3:
Computer
End
Block schema of
computer
Real model
Formalisation
(Mathematical model)
Von Neuman
computer (1946)
Accuracy
Yes
No
Model improvement
Essential
properties
Yes
Store
No
2
Parallel Computers
Basic technical components of parallel computers are illustrated in
Fig. 2.1 as follows:
Modules of processors, cores of a mixture of them.
Modules of computers (sequential, parallel).
Memory modules.
Input/output (I/O) modules.
These modules are connected through intern high-speed
communication networks (within the concrete module) and extern
(among used computing modules) high-speed communication
networks [8, 99].
Figure 2.1 Basic building modules of parallel computers.

Module of
computer
I/O module
Memory
module
Module of
processor
Classification
It is very difficult to classify all existent parallel systems. But from
the point of view of the programmer-developer we divide them into
the two following different groups:
Synchronous parallel architectures. These are used for performing
the same or a very similar process (independent part of programme)
on different sets of data (data parallelism) in active computing
nodes of a parallel system. They are often used under central
control, which means under the global clock synchronisation
(vector, array system etc.) or a distributed local control
mechanism (systolic systems etc.). This group consists mainly of
parallel computers (centralised supercomputers) with any form
of shared memory. Shared memory defines the typical system
features, and in some cases can in considerable measure reduce
the development of some parallel algorithms. To this group
belong actual dominated parallel computers based on multiple
cores, processors, or also a mixture of them (Symmetrical Multi-
processors SMP), and most of the realised massive parallel
computers (classic supercomputers) [60, 84]. One practical
example of this kind of synchronous parallel computer is
illustrated in Fig. 2.2. The basic common characteristics are as
follows:
Shared memory (or at least part of a memory).
Using shared memory for communication.
Figure 2.2 A typical example of a synchronous parallel computer.
Host
computer
Shared
memory
Control computer

Array of processors
(compute nodes)
Parallel Computers 13
Supported developing standard OpenMP, OpenMPThreads, Java.
Asynchronous parallel computers. They are composed of a
number of fully independent computing nodes (processors, cores
or computers) which are connected through some communication
network. To this group belong mainly various forms of computer
networks (cluster), a network of powerful workstations (NOW)
or a more integrated network of NOW networks (Grid). Any
cooperation and control are performed through inter process
communication mechanisms (IPC) via realised remote or local
communication channels. The typical examples of asynchronous
parallel computers are illustrated in Fig. 2.3. According to the
latest trends, asynchronous parallel computers based on PC
computers (single, SMP) are dominant parallel computers. Their
basic common characteristics are as follows [10, 63]:
No shared memory (distributed memory).
A computing node could have some form of local memory where
this memory is in use only by a connected computing node.
Cooperation and control of parallel processes only using
asynchronous message communication.
Supported developing standards.
MPI (Message Passing Interface)
PVM (Parallel Virtual Machine)
Java.
Figure 2.3 One example of an asynchronous parallel computer.
Communication
network
Local
memory Processor
Processor
Local
memory
Processor
Local
memory
Processor
Local
memory
The classification of all existent parallel computer architectures are
illustrated below in Fig 2.4.
Figure 2.4 Classication of parallel computers.
Architectures of parallel computers
Symmetrical multiprocessor systems
A symmetrical multiprocessor system (SMP) is the multiple usage of
the same processor or cores which are included on the motherboard
in order to increase the whole performance of this kind of system.
Typical common characteristics are as follows [41, 43]:
Each processor or core (computing node) of the multiprocessor
system can access the main memory (shared memory).
I/O channels or I/O devices are allocated to individual computing
nodes according to their demands.
An integrated operation system coordinates the cooperation of
the entire multiprocessor resources (hardware, software etc.).
The concept of such a multiprocessor system is illustrated in
Fig. 2.5.
Virtual
parallel computer
SIMD
Synchronous
Systolic
Vector/Array
Others
Asynchronous
SMP
GRID
NOW
Others
Figure 2.5 Typical characteristics of multiprocessor systems.
Figure 2.6 The architecture of a multiprocessor (8-Intel processor).
Hardware:
processors or cores (CPU
units)
shared memory or shared
shared I/O devices
multiport memory
shared I/ O channels
Software:
only one integrated
operation system
abilities of system
reconfguration
Control signals Messages
Task stream
An actual typical example of eight multiprocessor systems (Intel

Xeon) is illustrated below in Fig. 2.6.
Memory
Bank 0-3
Memory
Bank 0-3
Max
16 GB
Max
16 GB
PROfusion
Pent.
III x.
Pent.
III x.
Pent.
III x.
Pent.
III x.
Pent.
III x.
Pent.
III x.
Pent.
III x.
Pent.
III x.
Bus 1
Bus 2 Bus 2
Left bus
100 MHz
Right bus
100 MHz
Left memory
port (Cache)
Right memory
port (Cache)
PCI
bridge
PCI
bridge
PCI
bridge
PCI
bridge
Bus 13
100 MHz V/V bus
Control cards
64-bit
v/v bus
Slots
64-bit.,
66 MHz
hot plug
PROfusion - cross switch of 3 bus and 2 memory ports (parallel)
PCI cards - type Enthanced PCI (64 bit, 66 MHz, Hot Plug - on-line exchange)
PCI PCI PCI PCI
A basic abstract model of a parallel computer with a shared
memory is illustrated in Fig. 2.7.
Figure 2.7 A basic abstract model of a multiprocessor.
Network of workstations
There has been increasing interest in the use of networks of
workstations (NOW) connected together via high-speed networks
for solving large computation intensive problems [39]. The principal
architecture of NOW is illustrated in Fig. 2.8. below. This trend is
mainly driven by the cost effectiveness of such systems, as compared
to massive multiprocessor systems with tightly-coupled processors
and memories (supercomputers). Parallel computing on a cluster of
workstations connected by high-speed networks has given rise to a
range of hardware and network related issues on any given platform.
Load balancing, inter-processor communication (IPC), and transport
protocol for these machines are being widely studied [76, 80]. With
the availability of cheap personal computers, workstations and
networking devices, the recent trend has been to connect a number
of these workstations to solve computation intensive tasks in parallel
Processors, cores
Communication
network
P
1
P
2
P
n
. . .
Shared memory
M
with clusters. The network of workstations (NOW) has become a
widely-accepted form of high performance computing (HPC). Each
workstation in a NOW is treated similarly to a processing element
in a multiprocessor system. However, workstations are far more
powerful and flexible than processing elements in conventional
multiprocessors (supercomputers). To exploit the parallel processing
capability of a NOW, an application algorithm must be paralleled.
One way to do this for an application problem is to build its own
decomposition strategy. This step belongs to one of the most
important steps in developing effective parallel algorithms.
Figure 2.8 The architecture of a NOW.
One typical example of networks of workstations also used for
solving large computation intensive problems is illustratred in Fig. 2.9
below. The individual workstations are mainly extremely powerful
personal workstations based on a multiprocessor or multicore
platform. Parallel computing on a cluster of workstations connected
by high-speed networks has given rise to a range of hardware and
network-related issues on any given platform.
A practical example of a NOW module is represented below in
Fig. 2.10. It also represents our outgoing architecture in terms of the
laboratory parallel computer. On such a modular parallel computer
we have been able to study basic problems in parallel and distributed
computing such as load balancing, inter-processor communication
(IPC), modelling and optimisation of parallel algorithms (Effective
PA) etc. [34]. The coupled computing nodes PC
1
, PC
2
... PC
i

PC 1 PC 2 PC 3 PC n
. . .
Myrinet switch
Myrinet ports
1G Ethernet (10G Ethernet) ports
(workstations) are able to be single extremely powerful personal
computers (PC) or SMP parallel computers. In this way, parallel
computing using networks of conventional PC workstations (single,
multiprocessor, multicore) and Internet computing, all suggest the
advantages of unifying parallel and distributed computing. Parallel
computing and distributed computing have traditionally evolved as
two separate research disciplines. Parallel computing has addressed
Figure 2.9 Typical architecture of a NOW.
Figure 2.10 A practical example of a NOW.
Parallel Applications
Parallel Programming Environments
Sequential Applications
Cluster
Supporting SW (Midlleware)
Comn. Drivers
(SW)
Comn. Drivers
(SW)
Comn. Drivers
(SW)
Network card
(HW)
Network card
(HW)
Network card
(HW)
PC/Workstation PC/Workstation PC/Workstation
High Speed Network/Switch
Ethernet switch
Myrinet (InfiniBand) switch
Laboratory (SMP, NOW)
Intel
Xeon
. . . PC
1
PC
2
PC
i
1
1 2 i
i 2
problems of communication-intensive computation on highly-
coupled processors [29], while distributed computing has been
concerned with the coordination, availability, timeliness and so on
of more likely-coupled computations [62].
A basic abstract model of a parallel computer with a distributed
memory (NOW) is illustrated in Fig. 2.10.
Grid systems
Grid technologies have attracted a great deal of attention recently, and
numerous infrastructure and software projects have been undertaken
to realise various versions of Grids. In general, Grids represent a new
way of managing and organising computer networks, and mainly of
their deeper resource sharing [5]. Grid systems are expected to operate
on a wider range of other resources, such as processors (CPUs),
storages, data modules, network components, software (typical
resources) and atypical resources like graphical and audio input/output
devices, sensors and so on (See Fig. 2.11). All these resources typically
exist within nodes that are geographically distributed and span
multiple administrative domains. The virtual machine constitutes a set
Figure 2.11 An abstract model of a NOW.
Communication
network
M
1
M
2
M
n
. . .
Modules of distributed memory
P
1
P
2
P
n
. . .
of resources taken from a resource pool [54]. It is obvious that existent
HPC parallel computers (supercomputers etc.) could also be
members of these Grid systems. In general, Grids represent a new
way of managing and organising computer networks and mainly of
their deeper resource sharing (Fig. 2.12).
Figure 2.12 Architecture of a Grid node.
Conceptually they come from a structure of virtual parallel
computers based on computer networks. In general, Grids represent
a new way of managing and organising resources like a network of
NOW networks. This term defines a massive computational Grid
with the following basic characteristics [28, 75]:
A wide area network of integrated free computing resources. It
is a massive number of inter-connected networks, which are
connected through high-speed connected networks, during
which time the entire massive system is controlled by a network
operation system, which creates the illusion of a powerful
computer system (a virtual supercomputer).
It grants the function of metacomputing, which means a computing
environment which enables individual applications a functionality
of all system resources.
Users Management
(administrator)
Grid resources
(pool)
Sharing Mechanisms
Processor
1
Data
1
Data
i
Storage
1
Storage
j
I/O
1
I/O
k
Processor
n
The system combines distributed parallel computation with
remote computing from user workstations.
Conventional HPC environment versus Grid
environments
In Grids, the virtual pool of resources is dynamic and diverse, since
the resources can be added or withdrawn at any time according to
their owners discretion, and their performance or load can change
frequently over time. A typical number of resources in the pool is of
the order of several thousand or even more. An application in a
conventional parallel environment (HPC computing) typically
assumes a pool of computational nodes from a subset, of which a
virtual concurrent machine is formed. The pool consists of PCs,
workstations, and possibly supercomputers, provided that the user
has access (a valid login name and password) to all of them. This
virtual pool of nodes for a typical user can be considered as static,
and this set varies in practice from 10 to 100 nodes. In Table 2.1 we
summarise the differences analysed between the conventional
distributed and Grid systems. We can also generally say that:
Table 2.1 Basic comparison of HPC and Grid computing.
Conventional HPC environments Grid environments
1. A virtual pool of computational
nodes.
A virtual pool of resources.
2. A user has access (credential) to all
the nodes in the pool.
A user has access to the pool but
not to the individual nodes.
3. Access to a node means access to
all resources on the node
Access to a resource may be
restricted.
4. The user is aware of the
applications and features of the
nodes.
A user has little or no knowledge
about each resource.
5. Nodes belong to a single trust
domain.
Resources span multiple trust
domains
6. Elements in the pool of 10 to 100,
more or less static.
Elements in the pool are >>100,
dynamic.
HPC environments are optimised to provide maximal
performance.
Grids are optimised to provide maximal resource capacities.
Integration of parallel computers
With the availability of cheap personal computers, workstations and
networking devices, the recent trends have been to connect a number
of such workstations to solve computational intensive tasks in
parallel with various integrated forms of clusters based on computer
networks. We have illustrated in Fig. 2.12 a typical integrated
complex consisting of NOW network modules. It is clear that any
classical parallel computer (a massive multiprocessor, supercomputers
etc.) in the world could be a member of this NOW [90].
With the aim of attaining connectivity to any of the existent
integrated parallel computers in Europe (supercomputers, NOW,
Grid) we can use the European classical massive parallel systems by
means of scientific visits by project participants to the HPC centres
in the EU (EPCC Edinburgh in the UK, BSC Barcelona in Spain,
CINECA Bologna in Italy, GENCI Paris in France, SARA Amsterdam
in the Netherlanda, HLRS Stuttgart in Germany and CSC Helsinki
in Finland) [100].
Metacomputing
This term defines massive parallel computers (supercomputer, SMP,
Grid) with the following basic characteristics [94, 96]:
A wide area network of integrated-free computing resources.
This is a massive number of inter-connected networks, which are
connected through high-speed connected networks, during
which time the entire massive system is controlled with a
network operation system, which creates the illusion of a
powerful computer system (virtual supercomputer).
It grants a function of metacomputing that means a computing
environment which provides functionality of all the systems
resources for individual applications.
A system which combines distributed parallel computation with
remote computing from user workstations.
The best example of an existent metacomputer is the Internet as a
massive international network of computer networks (Internet
module). Fig. 2.13 illustrates the Internet as a virtual parallel
computer from the viewpoint of an average Internet user.
Switch 1
Laboratory 1
Switch 2
Laboratory 2
Switch n 1
Switch n
(central)
.
.
.
GRID Modul
(SMP , NOW)
1
2
i
Router
i
1
1
1
i
i
Laboratory n 1
Figure 2.13 Integration of NOW networks.
Another viewpoint of the Internet as a network of connected
individual computer networks is illustrated in Fig. 2.14 below. The
typical networking switches are bridges, routers, gateways and so on,
which we denote with the common term network processors [35].
Modelling of parallel computers including
communication networks
Communication demands (parallel processes, IPC data) in parallel
computers to arrive randomly at a source node and follow a specific
route through the communication networks towards their destination
node. Data lengths of communicated IPC data units (for example in
words) are considered to be random variables following distributions
according to the Jackson theorem [15, 33]. Those data units are then
sent independently through the communication network nodes
towards the destination node. At each node, a queue of incoming
data units is served according to a first-come first-served (FCFS)
basis.
In Fig. 2.15 we see illustrated a generalisation of any parallel
computer, including the communication network as follows:
Figure 2.14 The Internet as a virtual parallel computer.
Internet
workstation
Computing nodes u
i
(i=1,2,3 ... U) of any parallel computer are
modelled as graph nodes.
Network communication channels are modelled as graph edges
r
ij
(ij), representing communication intensities (relation
probabilities).
Figure 2.15 The Internet as a network of connected networks.
networking switches (bridges, routers, gateways etc.)
workstation
Another used parameter of this abstract model is defined as
follows:

u
, ... , ,
2 1
represent the total intensity of the input data
stream to individual network computing nodes (the summary
input stream from other connected computing nodes to the given
i-th computing node. It is given as a Poisson input stream with
intensity
i
demands in time units.

ij
r are given as the relation probabilities from node i to the
neighbouring connected nodes j

u 2 1
, .... , ,

correspond to the total extern output stream of
data units from used nodes (the total output stream to the
connected computing nodes of the given node).
The created abstract model, according to Fig. 2.16 below, belongs to
a queuing theory in the class of open queuing theory systems (open
queuing networks). Formally, we can adjust an abstract model by
U
1
3
r
21
3
1 U
2
r
12
r
31
r
13
r
1u
r
2u
r
3u
r
u3
r
u1
r
u2
Figure 2.16 Model of a parallel computer including its communication
network.
adding two virtual nodes: node 0 and node U+1 according to
Fig. 2.17 where:
Virtual node 0 represents the sum of individual total extern input
intensities
1
U
i
i

=
=

to computing nodes u
i
Virtual node U+1 represents the sum of individual total intern
output intensities
1
U
i
i

=
=
from computing nodes u

i
.
Figure 2.17 Adjusted abstract model.

i

i
r
13
r
31
i=1
i=1
r
21
r
12
r
1u
r
2u
r
3u
r
3u+1
r
1u+1
r
uu+1
u
u
r
2u+1
r
u1
r
u2
r
u3
U+1 U
3
2
1 0
r
o1
r
ou
r
o1
r
o1
3
Parallel Algorithms
Introduction
During recent years, there has been increased interest in the field of
scientific research into effective parallel algorithms. This trend
towards parallel algorithms also supports the actual trends in
programming technologies towards the development of modular
applied algorithms based on object oriented programming (OOP).
OOP algorithms are on their own merit a result of abstract thinking
towards parallel solutions for existent complex problems.
Users and programmers from the beginning of applied computers
were using, on demand, more powerful computers and more efficient
applied algorithms. To the more effective technologies which take a
long time belongs the implementation of parallel principles into
computers, just as applied parallel algorithms. In this way the term
parallel programming could relate to every program, which
contains more than one parallel process [11, 53]. This process
represents single independent sequential parts of a program.
One basic attribute of parallel algorithms is to achieve faster
solutions in comparison with the quickest sequential solution. The
role of the programmer goes to the given parallel computer and to
the given application task (complex application problem) in order to
develop parallel algorithms (PA). Fig. 3.1 below demonstrates how
to derive parallel algorithms from existent sequential algorithms.
In general we suppose that potentially effective parallel algorithms
according to the defined algorithm classification (Fig. 3.2) should be
in group P as classified polynomial algorithms.
The other acronyms used in Fig. 3.2 are as follows [39]:
NP General non-polynomial group of all algorithms.
NC (Nicks group). A group of effective polynomial algorithms.
PC Polynomial complete. A group of polynomial algorithms
with a high degree of complexity.
NPC Non-polynomial complete. This group consists of non-
polynomial algorithms with their high solving complexity. The
existence or use of any NPC algorithm in an effective way makes
it available for solving other NPC algorithms effectively.
Figure 3.1 The methodology of deriving parallel algorithms.
Description
(diagram)
Sequential
algorithm
Way of
computing
Problem
Parallel
Analyst
Abstract formalisation
Practice
Aplication
informatics
Accuracy
of solving
Sequential
No
Complex
algorithm
End
No
Parallel
algorithm
Accuracy
of solving
No
Modification
of algorithm
Modification
of algorithm
Yes
Expert to
problem
Parallel Algorithms 31
Parallel processes
To derive PA we have to create the conditions for potential parallel
activities by dividing the input problem algorithm into its constituent
parts (decomposition strategy) as illustrated in Fig. 3.3. These
individual parts could be as follows:
Heavy parallel processes.
Light parallel processes named as threads.
In general we can define a standard process as a developed algorithm,
or as its independent parts. In detail, the process does not represent
only some part of a compiled program because the register status of
the processor (process context) also belongs to its characterisation.
An illustration of this kind of standard process is shown in Fig. 3.4
on page 32.
Figure 3.2 Algorithm classication.
NP
NPC
NC
P
PC
Figure 3.3 Illustration of parallel processes.
Complex problem
(sequential algorithm)
Decomposition
Parallel
process 1
Parallel
process 2
Parallel
process n
. . .
Figure 3.4 Illustration of a standard process.
Every standard process has, therefore, its own system stack, which
contains processed local data. In case of a process interruption there
is also an actual register status of the processor. It is obvious that we
may have contemporary multiple numbers of standard processes,
which are used together in some program parts, but their processing
contexts (process local data) are different. The tools needed to
manage processes (initialisation, abort, synchronisation,
communication etc.) are within the cores of multi-task operation
systems in the form of services. An illustration of a standard multi-
processes state is shown in Fig. 3.5.
But the concept of generating standard processes with individual
address spaces is very time-consuming. For example, in operation
system UNIX, a new process is being generated with an operation
fork (), which makes the system call in order to create a nascent
process with its new own address space. But in detail it means
memory allocation, copying of data segments and the descriptors
from the original process, as well as a realisation of a nascent process
stack. Therefore, we have named this concept a heavy-weighted
process. It is obvious that the heavy-weighted approach does not
support the effectiveness of applied parallel processing or the
necessary scalability of parallel algorithms. In relation to this it
Registers
Static
data
Program code
Stack
Standard parallel process
Memory module
became necessary to develop another less time-consuming concept of
process generation named the light-weighted process. This lighter
concept of generating new processes under another name came
about as threads were implemented into various operation systems,
the supported threads libraries and parallel developing environments.
The basic difference between a standard process and a thread is that
we can generate additional new threads within a standard process,
which use the same address space, including a descriptor declaration
of the origin of the standard process.
Classification of PAs
In principle, parallel algorithms are divided into the two following
basic classes:
Parallel algorithms with shared memory (PA
sm
). In this case,
parallel processes can communicate through shared variables using
an existent shared memory. In order to control parallel processes
typical synchronisation tools are used, such as busy waiting, and
semaphores and monitors to guarantee the exclusive use of
shared resources by only single parallel process [60, 66]. These
algorithms are developed for parallel computers with a dominant
Figure 3.5 A parallel algorithm based on multiple parallel processes.
Static
data
Processes
code
Stack Stack Stack
Memory module
Parallel processes
Registers Registers Registers
shared memory as actual symmetrical multiprocessors or
multicore systems on the motherboard (SMP).
Parallel algorithms with distributed memory (PA
dm
). Distributed
parallel algorithms have the task of the synchronisation and
cooperation of parallel processes only via network
communication. The term distributed (asynchronous) parallel
algorithm is defined in terms of individual parallel processes
being performed on independent computing nodes (processors,
computers single, parallel) of a parallel computer with a
distributed memory [68, 73]. These algorithms are developed
for parallel computers with a distributed memory as an actual
NOW system, and their higher integration forms are known as
as Grid systems.
Mixed PAs. Very perceptive parallel algorithms which use the
advantages of dominant parallel computers based on the NOW
modules as follows:
The use of parallel processes with a shared memory in
individual workstations (SMPs PCs).
The use of other parallel processes based on distributed
memory in a NOW module.
The main difference between these groups is in the form of inter-
process communication (IPC) among parallel processes. Generally
we can say that IPC communication in a parallel system with a
shared memory can use more opportunities for communication than
in distributed systems.
Parallel algorithms with a shared memory
A typical activity graph of parallel algorithms with shared memory
PA
sm
is shown in Fig. 3.6. In order to control decomposed parallel
processes there is a necessary syncronisation mechanism, which is as
follows:
Semaphors.
Monitors.
Busy waiting.
Path expession.
Critical region (CR).
Conditional critical region (CCR).
Figure 3.6 A typical activity graph of PA
sm
.
Synchr.
Par. process Par. process Par. process Par. process
Synchr.
Par. process Par. process Par. process Par. process
Parallel algorithms with a distributed memory
Parallel algorithms with distributed memory PA
dm
are parallel
processes, which are carried out on the asynchronous computing
nodes of any given parallel computer. Therefore, for all the required
cooperation of parallel processes we only have available inter-
process communication IPCs. The principal illustration of parallel
processes for PA
dm
is illustrated in Fig. 3.7.
Developing parallel algorithms
To exploit the parallel processing capability the application program
must be made parallel. The most effective way of doing this for a
particular application problem (decomposition strategy) belongs to
Figure 3.7 Ilustration of the activity graph of PA
dm
.
the most important step in developing an effective parallel algorithm
[45]. The development of the parallel network algorithm, according
to Fig. 3.8, includes the following activities:
Decomposition the division of the application into a set of
parallel processes.
Mapping the way in which processes and data are distributed
among the nodes.
Inter-process communication the way of corresponding and
synchronisation among individual processes.
Tuning alternation of the working application to improve
performance (performance optimisation).
. . .
Par.
process
Par.
process
Par.
process
Par.
process
Figure 3.8 Development steps in parallel algorithms.
Decomposition Synthesis
of solution
Parallel
solution Problem Processes
Parallel computer
Processors or
workstations
Mapping
Decomposition strategies
When developing sequential algorithms, this implicitly supposes the
existence of an algorithm for any given problem. Only later, during
the stage of practical programming are they defined and use suitable
data structures. In contrast to this classic development method, the
suggestion of a parallel algorithm should be included at the beginning
of a stage potential decomposition strategy, including the distribution
of the input data to perform decomposed parallel processes. The
selection of a suitable decomposition strategy has a cardinal
influence on the further development of the parallel algorithm.
The decomposition strategy defines a potential division of any
given complex problem into its constituent parts (parallel processes)
in such a way, that they could be performed in parallel via the
computing nodes of any given parallel computer. The existence of
some kind of decomposition method is a critical assumption for a
possible parallel algorithm. The potential degree of decomposition
of any given complex problem is crucial for the effectiveness of a
parallel algorithm [62, 70]. Until now, developed parallel algorithms
and the corresponding decomposition strategies have been mainly
related to available synchronous parallel computers based on classic
massive parallel computers (supercomputers and their innovations).
The development of parallel algorithms for the actual dominant
parallel computers NOW and Grid requires at least modified
decomposition strategies incorporating the following priorities:
An emphasis on functional parallelism for complex problems.
A more minimal inter-process communication IPC.
The most important step is to choose the best decomposition method
for any given application problem. To do this it is necessary to
understand concrete application problems, data domains, the
algorithms used and the flow of control in any given application.
When designing a parallel program, the description of the high-level
algorithm must include, in addition to designing a sequential
program, the method you intend to use to break the application
down into processes (decomposition strategy) and distribute the
data to different computing nodes (mapping). The chosen
decomposition method drives the rest of the programs development.
This is true in the case of developing new applications, as in a
porting serial code. The decomposition method tells us how to
structure the code and data and defines the communication topology
[69, 88].
Problem parallelisation is very creative process, which creates a
potential degree of parallelism. This is a way of dividing complex
problems into their constituent parts (parallel processes) in such a
way that it is possible to perform PA in parallel. The way of
decomposition depends strongly on the task algorithm used and on
the data structures. This has a significant influence on the performance
and its communication consequences. Until now, the decomposition
models and strategies developed have seemed to be only close to the
supercomputers and their innovated types (classic parallel computers)
in use around the world. On the other hand, there is the realisation
that PAs at this time dominate parallel computers (SMP, NOW,
Grid), thus demanding modified decomposition models and strategies
with respect to more minimal interpositions on communication
intensity (NOW, Grid), and also deriving a waiting latency T(s, p)
wait when using shared resources or not at their full capacities:
Natural parallel decomposition.
Domain decomposition.
Control decomposition:
manager/workers;
functional.
A divide-and-conquer strategy for the decomposition of complex
problems.
Object oriented programming (OOP).
Natural parallel decomposition
Natural parallel decomposition allows a simple creation of parallel
processes, whereby for their cooperation, normally there is a
necessarily low number of inter process communication IPCs. Also,
in parallel computation the sequence of the individual solutions is
normally not important. As a consequence there is no necessity for
any synchronisation of the parallel processes carried out during
parallel computation. Based on these attributes, natural parallel
algorithms allow the achievement of practical ideal p multiple
speed ups using p computation nodes of parallel computer (linear
speed up), with minimal additional efforts needed for developing
parallel algorithms. Typical examples are numerical integration
parallel algorithms [31]. In chapter 9 we will analyse applied parallel
algorithms based on the natural decomposition model (the
computation of ).
Domain decomposition
One typical characteristic of many complex problems is some
regularity in sequential algorithms or in their data structures
(computational or data modularity). The existence of these
computational or data modules then represents the domain of the
computation or data. A decomposition strategy based on this
domain makes up a substantial part of these complex problems
involved in generating parallel processes. This domain is mostly
characterised by a massive, discrete or static data structure. A typical
example of a computational domain is iteration computation, and a
typical example of a data domain is a matrix [25, 93].
Functional decomposition
Functional decomposition strategies concentrate their attention on
finding parallelism in the distribution of a sequential computation
stream in order to create independent parallel processes. In
comparison to domain decomposition we are concerned with
creating potential alternative control streams of concrete complex
problems. In this way we are streaming in functional decomposition
to create as many parallel threads as possible. An illustration of
functional decomposition is shown in Fig. 3.9.
The most widely distributed functional strategies are:
Controlled decomposition.
Manager/workers (server/clients).
Typical parallel algorithms are complex optimisation problems,
which are connected to the consecutive searching of massive data
structures.
Control decomposition
Control decomposition as an alternative to functional decomposition
concentrates on any given complex problem as a sequence of
individual activities (operations, computing steps, control activities
etc.), from which we are able to derive multiple control processes.
For example, we can consider searching to be a tree; one which
responds to game moves where the branch factor changes from node
Figure 3.9 An illustration of functional decomposition.
Function 2
Function 1
Function 3
Branch
block
yes
not
to node. Any static allocation of a tree is either not possible, or it
causes an unbalanced load.
In this way, this decomposition methods supposed irregular
structure controls the decomposition, which is loosely connected
with complex problems in artificial intelligence and similar non-
numerical applications. Secondly, it is very natural to look at any
given complex problem as a collection of modules, which represent
the necessity for functional parts of algorithms.
Decomposition strategy known as manager/workers
Another alternative to functional decomposition is the strategy
called manager/workers. In this case one parallel process is used as
a control (the manager). The manager process then sequentially and
continuously generates the necessary parallel processes (the workers)
for their performance in controlled computing nodes. An illustration
of this decomposition method known as manager/workers is shown
in Fig. 3.10.
Main
par. process
Worker Worker Worker
Worker
Main
par. process
Worker
Figure 3.10 The manager/worker parallel structure.
The manager process controls the computation sequence in
relation to the sequential finishing of the parallel processes carried
out by the individual workers. This decomposition strategy is
suitable mainly in cases where any given problem does not contain
static data or a known fixed number of computations. In these
cases it becomes necessary to concentrate on controlling aspects of
the individual parts of complex problems. After this analysis has
been carried out, then the needed communication sequence appears
to achieve the demanded time sequences of the parallel processes
created. The degree of the division of any given complex problem
coincides with the number of the computing nodes of the parallel
computer, along with the parallel computer architecture and with
the knowledge of the performance of the computing nodes. One of
the most important elements of the previous steps is the allocation
of the algorithms. It is more effective to allocate a parallel process
to the first free computing node (a worker) in comparison with a
defined sequential order of allocation. In chapter 12 we will
analyse applied parallel algorithms (complex combinatorial
problems) based on the manger/worker decomposition.
The divide-and-conquer strategy
The divide-and-conquer strategy decomposes complex problems
into sub-tasks which have the same size, but it iteratively keeps
repeating this process to obtain yet smaller parts of any given
complex problem. In this sense, this decomposition model iteratively
applies a problem partitioning technique, as we can see in Fig. 3.11.
Divide-and-conquer is sometimes known as recursive partitioning
[41]. A typical complex problem size has an integer power of 2, and
the divide-and-conquer strategy halves the complex problem into
two equal parts at each iterative step.
In chapter 11 we will show an example of applying the divide-
and-conquer strategy to analyse Discrete Fourier Transform (DFT).
The decomposition of big problems
In order to decompose big problems it is necessary in many cases
necessary to use more than one decomposition strategy. This is
mainly true due to the hierarchical structure of a concrete big
problem. The hierarchical character of big problems means that we
look on such a big problem as set of various hierarchical levels,
whereby it would be useful to apply a different decomposition
strategy on every level. This approach is known as multilayer
decomposition.
The effective use of multilayer decomposition is contributing to a
new generation of commnon parallel computers based on the
implementation of more than a thousand computing nodes
(processors, cores). Secondly, the unifying trends of high performance
parallel computing (HPC) based on massive parallel computers
(SMP modules, supercomputers) and distributed computing (NOW,
Grid) are opening up new horizons to programmers.
Examples of typical big problems are weather forecasting, fluid
flow, the structural analysis of substance building, nanotechnologies,
high physics energies, artificial intelligence, symbolic processing,
knowledge economics and so on. The multilayer decomposition
model makes it possible to decompose a big problem, first into
simpler modules, and then in the second phase to apply a suitable
decomposition strategy only to these decomposed modules.
Figure 3.11 An illustration of the divide-and-conquer strategy (n=8).
Object oriented decomposition
Object oriented decomposition is an integral part of object oriented
programming (OOP). In fact, it presents a modern method of
parallel program development. OOP, besides increasing the demand
for abstract thinking on the part of the programmer, contains the
decomposition of complex problems into independent parallel
modules known as objects [93]. In this way, the object oriented
approach looks at a complex problem as a collection of abstract
data structures (objects), where the integral parts of these objects
also have object functions built into them as another form of parallel
processing. In the same way, OOP creates a bridge between
sequential computers (the Von Neumann concept), as well as
modern parallel computers based on SMP, NOW and Grid. An
illustration of an object structure is illustrated in Fig. 3.12.
Mapping
This step allocates already created parallel processes to the computing
nodes of a parallel computer for their parallel execution. It is
necessary to achieve the goal that every computing node should
perform its allocated parallel processes (one or more), with at least
approximate input loads (load balancing) based on the real
assumption of equally powerful computing nodes. The fulfilment of
this condition contributes to optimal parallel solution latency.
Figure 3.12 Object structure.
.
.
.
Object
Data object
Procedure 1
Procedure 2
Procedure n
Inter process communication
In general we can say that the most dominant elements of parallel
algorithms are their sequential parts (parallel processes) and inter
process communication (IPC) between the parallel processes taking
place.
Inter process communication in shared memory
Inter process communication (IPC)
Inter process communication (IPC) for parallel algorithms with a
shared memory (PA
sm
) is defined in the context of supporting
developing standards as follows:
OpenMP.
OpenMP threads.
Pthreads.
Java threads.
Other.
The concrete communication mechanisms allow the use of the
existence of shared memory, which causes every parallel process to
store the data being communicated in a specific memory file with its
own address, and then another parallel process can read the stored
data (shared variables). It looks very simple, but it is necessary to
guarantee that only one parallel process can use this addressed
memory file. These necessary control mechanisms are known as
synchronisation tools. Typical synchronisation tools are:
Busy waiting.
Semaphore.
Conditional critical regions (CCR).
Monitors.
Path expressions.
These synchronisation tools are also used in modern multi-user
operation systems (UNIX, Windows etc.).
Inter process communication in distributed memory
Inter process communication (IPC) for parallel algorithms with a
distributed memory (PA
dm
) is defined within supporting developing
standards as follows:
MPI (message passing interface).
Point to point (PTP) communication commands.
Send commands.
Receive commands.
Collective comunication commands.
Data distribution commands.
Data gathering commands.
PVM (Parallel virtual machine).
Java (Network communication support).
Other.
To create the necessary synchronisation tools in MPI we only have
available an existent network communication of connected
computing nodes. A typical MPI network communication is shown
in Fig. 3.13. Based on existent communication links, an MPI contains
the synchronisation command BARRIER.
Performance tuning
After verifying a developed parallel algorithm on a concrete parallel
system, the next step is performance modelling and optimisation
(effective PA). This step consists of an analysis of previous steps in
such a way as to minimise the whole latency of parallel computing
T(s, p). The optimisation of T(s, p) carried out for any given parallel
algorithm depends mainly on the following factors:
Figure 3.13 Ilustration of an MPI network communication.
The allocation of a balanced input load to the computing nodes
in use in a parallel computer (load balancing) [7].
The minimisation of the accompanying overheads (parallelisation,
inter process communication IPC, control of PA) [91].
To carry out load balancing we obviously need to use the equally
powerful computing nodes of a PC, which results in load allocation
for any given developed PA. In dominant asynchronous parallel
computers (NOW, Grid) it is necessary to reduce (optimise) the
number of inter process communications IPC (communication
loads); for example, through the use of an alternative existing
decomposition model.
NIC NIC
Par.
process i
Par.
processj
Computing
node i
communication channel
Data
message
Data
message
Computing
node j
4
Parallel Program Developing
Standards
Parallel programming languages
The necessary supported parallel developing standards would have
suitable tools and services for various existing forms of parallel
processes. The existing parallel programming languages are divided
in a similar way to parallel computers into the two following basic
groups:
Synchronous programming languages (SPL).
Asynchronous programming languages APL).
Synchronous programming language simplifies an assumed concrete
shared address space (shared memory). Based on this shared memory
the synchronisation tools are implemented, such as monitors,
semaphores, critical sections, conditional critical sections and so on.
[66]. Typical parallel computers for the application of SPL are SMP
multicore and multiprocesor parallel computers.
On the other hand, asynchronous programming languages
correspond to a parallel application program interface (API), which
only has a distributed memory [13, 86]. So in this case the only tool
for parallel process cooperation is inter process communication
IPC. Typical parallel computers for the application of API are
asynchronous parallel computers based on computer networks
(NOW, Grid). The support for application parallel interfaces API
means the development of standard tools for both groups of parallel
algorithms PA (PA
sm
, PA
dm
).
Open MP standard
In terms of the development of parallel and applied parallel
computers, currently their representation of parallel architectures
with a shared memory also plays a very important role. The
condition for their effective application deployment was the
standardisation of their development environment API. This
standard, after the experience gained with a parallel extension of the
High Performance Fortran (HPF), became the OpenMP standard for
existing parallel computers with a shared memory.
OpenMP is an API for programming languages C / C + + and
Fortran, which supports SMP parallel computers (multiprocessors,
multicores, supercomputers) with a shared memory, under different
operating systems (Unix, MS Windows, etc.). The methodology of
parallelisation in OpenMP is based on the use of so-called compiler
directives, library functions and shared variables in order to specify
their support for parallelism in any given existing shared memory.
Control compiler commands (directives) indicate parts of a compiled
program which is executed in parallel and simultaneously with
additional auxiliary functions having been determined in order to
parallelise the relevant parts. The advantage of this chosen approach
is the relatively simple transfer method between a sequential
developing API and an OpenMP API, because sequential compilers
consider compiler directives as remarks and ignore them. The
OpenMP is composed of a set of compiler directives, library
functions and variables in commands that affect the implementation
of parallel algorithms. The advantage of this approach is relatively
simple, and the transparent relationship between parallel and serial
is on the basis that the serial compiler directives are ignored
(comments). The basic structure of an OpenMP API is illustrated in
Fig. 4.1.
Parallel Program Developing Standards 51
The original demands to an OpenMP and its properties are as
follows [64, 82]:
Portability.
Scalability.
Efficiency.
High level.
Support of data parallelism.
Simple to use.
Functional robustness.
The basic functional properties are as follows:
An API for parallel computers with a shared memory (shared
address space).
Portability of program codes.
Supported by Fortran (in the world), C/C++.
Placed between High Performance Fortran (HPF) and MPI.
Figure 4.1 The Structure of an OpenMP.
Complex problems
Programmer
Directives
to compile
OpenMP
library
Other
APIs
OpenMP runtime library
OS support for shared memory
Application
layer
Program
layer
System
layer
Technical
layer
Node
1
Node
2
Node
3
Node
p

Shared memory
. . .
From the HPF comes simplicity of use through compiler
directives, and from the MPI comes effectiveness and
functionality.
Standardised from 1997.
The purpose of the OpenMP standard was to provide a unified
model for developing parallel algorithms with a shared memory
(PA
sm
), which would be portable between different parallel
architectures with a shared memory, and which also comes from
different producers. In relation to the other existent parallel API
standard (MPI), OpenMP is between HPF and MPI. The HPF has
easy-to-use use compiler commands, and the MPI standard provides
high functionality. Some basic modules of OpenMP are illustrated
in Fig. 4.2.
Figure 4.2 OpenMP modules.
When analysing the development of parallel algorithms and
computers and their future direction, it is implicitly assumed that
there will be further innovations of the OpenMP standard and its
available possible alternatives for more massive multiprocessors and
multicore parallel architectures with a shared memory (massive
SMP architectures), which could eliminate its smaller scalability
when applied for use. Individual OpenMP modules are illustrated in
Fig. 4.3. Further details of OpenMP can be found in [64] or in
special manuals for OpenMP.
Control of parallel
executions
Control of
parallel program
parallel
OpenMP modules
Decomposition
Allocation
to threads
do/parallel
do section
Data
processing
Data
management
shared
memory
Synchronisation
Control
of threads
critical
atomic
barrier
Functions
and variables
Runtime
environment
omp_set_num_threads()
omp_get_thread_num()
OMP_NUM_THREADS
OMP_SCHEDULE
OpenMP threads
Classic operating systems generate processes with separate address
space to guarantee safety and to protect individual processes in
multi-user and multi-tasking environments. Technical support for
this kind of protection began with Intel processors in the Intel model
80286 through a new protected mode. When creating a classic
process it generates a so-called basic thread. Any additional thread
could create, through the already-generated thread, a command in
the following typical shape:
thread create (function, argument1, ..., argumentn);
This way of creating threads is very similar to the calling procedure.
Thread commands create a new program branch in order to perform
a function with any given argument. For each new branch a separate
stack is created, whereby they jointly use the remaining address
space (code segments, data segments, process descriptors) of the
original parent process. OpenMP threading of an API implements
multi-threading parallelism, in which the main thread of calculation
is divided into a specified number of controlled subordinate threads.
The fibres are then executed in parallel, whereby all the threads use
together all the resources of a basic thread. Each thread has an
identification number, which can be obtained by using the function
Figure 4.3 Basic modules of OpenMP.
Parallel control
structures
Flow control in
parallel program
parallel
OpenMP language
extensions
Work sharing
Distribution
to threads
do/parallel
do section
Data
environment
Scopes
variables
global and
local
variables
Synchronisation
Control thread
execution
critical
atomic
barrier
Runtime functions
and variables
Runtime
environment
omp_set_num_threads()
omp_get_thread_num()
OMP_NUM_THREADS
OMP_SCHEDULE
(omp_get_thread_num()). The identification number is an integer,
while the main thread has the number 0. After finishing the creation
of threads in parallel, the threads join together with the main thread
to carry on executing parallel algorithms. The executive (runtime)
environment allocates threads for processing based on the input
parallel computers load and other system factors. The number of
threads can be set using runtime variables within OpenMP functions.
Problem decomposition
Fig. 4.4. illustrates problem decomposition as a process of multithread
parallelisation in OpenMP. Problem decomposition into individual
threads can be done through the following commands:
omp for or omp do. These commands decompose a given
program part into a defined number of threads.
Section commands perform the allocation of contiguous blocks
to different threads.
A single command defines a single functional block of code that
is executed by only one thread.
A master command is similar to a single command, but the
functional block of code is performed by only the main thread.
Figure 4.4 An illustration of multithread parallelisation in OpenMP.
V1
V2
Vp
.
.
.
Parallel process 1
V1
V2
Vp
.
.
.
Parallel process 2
V1
V2
Vp
.
.
.
Parallel process n
. . .
Main
thread
Join Join Join
Fork Fork Fork
A thread is thus a simplified version of the standard process,
including its own instruction register and stack for the independent
execution of parallel program parts. Due to the different mechanisms
of thread generation and its support, applied parallel algorithms
were not portable to the existing different operating systems. The
same goes for various innovative alternatives of the same operating
systems. A move towards standardisation was defined in the middle
of the nineteen-nineties by extending the set library of C language
thread support. The credit for these extensions belongs to a group of
people in an organisation called POSIX (Portable Operating Systems
Interface), so this extended additional library was named Pthreads
(POSIX threads). A listed set of these library routines is currently
available in various versions of UNIX operating systems. Pthreads
with their structure were a low-level programming model for parallel
computers with a shared memory, though they were not aimed at
high parallel computing (HPC). The reasons for this include a lack
of support for the much-used FORTRAN version, and even the C /
C + + language was problematic when applied for use with scientific
parallel algorithms, because its orientation was towards supporting
the parallelism of tasks, with minimal support for data parallelism.
The Pthreads library contains a set of commands for managing
and synchronising threads. In Appendix 3 there is a set of basic
commands which is sufficient for creating threads and their later
joining and synchronisation.
MPI API standard
MPI (Message Passing Interface) [39, 82] is the standard for the
development of distributed parallel algorithms (PA
dm
) with message
communication for asynchronous parallel computers with a distributed
memory (NOW, Grid etc.). The basics of its standardisation were
defined in 1993 and 1994 by a international group of dedicated
professionals and developers under the name of the MPI Forum
(approximately 40 organisations from the US and Europe), which
gained experience from this time in using MPI API PVM (parallel
virtual language), such as PVM from Oak Ridge National Laboratory,
as well as PARMACS, CHIMP EPCC and so on. The aims of MPI
API were to provide a standard communication message library for
creating portable parallel programs (source code portability of PP
between various parallel computers), and effective (optimised) parallel
programs based on message communication (distributed memory).
MPI is inherently not a language, but it is a library program service
that can be used by programs written in C / C + + and FORTRAN.
MPI provides a rich collection of communication programs for
two-point transmission PTP (point-to-point communication) and
collective operations for data exchange, global computations,
synchronisation and the joining of partial results. MPI also defines a
number of equally important essential requirements such as derived
data types and the necessary specifications of communication
services. Currently, there are several implementations of MPI and its
versions for networks of workstations, groups of personal computers
(clusters), multiprocessors with a distributed memory and a virtual
shared memory (VSM). The MPI Forum, after standardising the
MPI 1 version in 1994, began working to add further requested
new services to the previous MPI standard, including dynamic
processes, support for the parallel decomposition strategy known as
manager/worker (client/server), collective communications, parallel
I/O operations and functions for the non-blocking of collective
communication (an innovation of MPI-1 in 1995). Further
developments of the standard MPI continued through to the
development of MPI 2, which was standardised in 1997.
Generally, existing MPI standards are still considered as being low-
level because most of the activities for the development of distributed
parallel algorithms are carried out by a programmer. A programmer
developing a process designs and implements inter process
communication and synchronisation between parallel processes; he
also creates parallel data blocks and their distribution, as well as
mapping processes onto computations and the input and output of
data structures. If the programmer does not have additional support
for these activities it is very difficult to develop parallel algorithms on
a larger scale. Simplifying the development of distributed parallel
algorithms does not directly relate to the MPI structure, but by the
additional continuing improvements of accompanying activities.
Other ways towards an alternative model for the development of
distributed parallel algorithms are based on the use of virtual shared
memory (VSM) and parallel object oriented programming (POOP).
Another approach uses a series of created modular programming
structures (skeleton programming [140]) in the form of executive
libraries, which already support some task parallelisation and can be
implemented on top of the MPI standards.
MPI parallel algorithms
MPI algorithms present a set of mutually communicating parts of an
application algorithm (parallel processes). The applied parallel
algorithms MPI API contain linked library functions from a version
of the MPI API standard in use. To every parallel process is assigned
a unique sequence in the form of an integer from 0 to n-1 for an MPI
parallel algorithm, which consists of n parallel processes. These
integers are then used by MPI parallel processes to identify
interprocess communication IPCs, to carry out shared operations
and generally for the cooperation of created parallel processes. Data
communication between parallel processes on any given parallel
computer (supercomputer, SMP, NOW, Grid) is a transparent
operation for concrete parallel processes. MPI automatically selects
the most efficient communication mechanism available on any given
respective computing node between the communicating parallel
computers. By using a defined numerical order of parallel processes,
all the operations are performed through mutual cooperation
independent of the physical position of the communicating partners.
Before calling an MPI parallel algorithm it is necessary to initialise
the MPI API version in which the MPI parallel algorithm was
developed. The following command serves this purpose:
MPI_Init (&argc,&argv)
This MPI command has to be called before using any MPI command.
After performing an initialisation of MPI API, all the MPI parallel
processes of any given parallel algorithm become available. Before
calling MPI_Init (), only an MPI command can be called, and that is
MPI_Initialise (). This command is used to determine whether the
previously initialised command was MPI_Init (). The command
MPI_Initialize () has a single argument and is defined as follows:
MPI_Initialize (& flag)
An MPI parallel algorithm begins through calling MPI_Init (), and
ends by calling MPI_Finalize (), as shown in the following coded
part of an MPI parallel algorithm:
#include "mpi.h"
main(int argc, argv)
int argc;
char *argv[];
{
MPI_Init(&argc,&argv); /* MPI initialisation */
........
........
MPI_Finalize(); /* MPI termination */
}
Task groups
MPI tasks may belong to some named groups. A group in MPI is an
object that can be reached using a predefined type MPI_Group. The
task group defines contexts in which MPI operations are restricted
to only the members of a given group. To members of the group are
assigned unique identifiers, or so-called serial numbers, within the
given group. The group is an ordered set of continuous serial
numbers starting from the number 0. MPI provides a number of
functions to create new groups from existing groups. MPI does not
have any function to create a new group from scratch. At the start,
all the tasks fall into one basic group from which it is possible to
form other groups. Members of the newly derived group can be
requested from one or more groups. A new group can be created
from the existing group, either through the exclusion from tasks or
by inserting tasks from another group. A new group can be also
created from two existing groups using Boolean operations such as
union, conjunction and difference.
Communicators
In all systems with message communication, it is very important to
guarantee safe address space for this communication, in which
different messages are separated from one another for example,
library messages can be sent and received without interference with
the other system resources. For the safe separation of user library
messages it is not enough to use only a message flag (tag). Security
for safe message communications is already contained within the
MPI communicator concept. The communicator can be seen as tying
a communication concept to a group of tasks. A communicator is an
object we can achieve with the command MPI_Comm.
Communicators are divided into intracommunicators for
operations within one task group, and intercommunicators for
operations between different task groups. In this chapter we will
focus on intracommunicators. On start-up MPI, all tasks are
connected to a single communicator world. If we need a new
context, a parallel program sends a synchronisation call to derive a
new context for the already existing context. To create a new
communicator in MPI it an already-existing communicator is
needed. The basic communicator for all MPI communicators is the
implicit communicator MPI_COMM_WORLD. MPI API has three
functions to create new communicators. These functions have to be
called by all the tasks belonging to the existing communicator,
although some tasks will not be belong to a new communicator.
The order of tasks
Tasks contained in some communicators are sequentially assigned by
integer number identifiers from 0 to (n-1), where n is the size of the
group communicator. These numbers known as rank are used to
differentiate various tasks within the group; for example, tasks with
a different rank may be assigned to different kinds of activities being
performed. Every task can determine its rank inside any given
communicator by calling the command MPI_Comm_rank ().
MPI_Comm communicator; /*communicator */
int my_rank; /* rank of task */
MPI_Comm_rank (communicator, &my_rank);
The parameter of this function is the existing communicator, and the
function result is the rank of the task called (my_rank) in this group
communicator.
Collective MPI commands
The term for collective operations in MPI API characterises commands
that can be applied to all the members of a given group
(communicator). They are also commands from MPI API, which
control the performance of the parallel processes (synchronisation
command Barrier) and the mutual interprocess communication data
(distribution, collecting). Collective operations are usually defined
in terms of a set of tasks. There are three types of collective
operations:
Task management.
Global computing.
Data communication.
To increase flexibility we can use two types of collective operations:
With a prefix.
With a postfix.
The conclusion of these collective operations can then be different
for each task and in the correct order.
Synchronisation mechanisms
Synchronisation commands are used to guarantee a defined order of
calculations within a performed computation of parallel tasks. In
certain cases it is necessary to synchronise performed parallel tasks
with each other at some defined point in parallel computation or
defined time. Some parallel processes in the group should wait at a
defined synchronisation point until all the parallel processes reach
this point. Synchronisation in MPI API is implemented using message
communication and various barrier commands.
Conditional synchronisation
Receiving communication command MPI_Recv () forces the receiving
task to wait until the complete message has been received
(communication data). As a consequence of this, the sending
communication process can block the addressed parallel process
with conditional communications.
Rendezvous
Communications concurrence (rendezvous) during parallel processes
is the establishment of a defined synchronisation point in order to
achieve a jointly defined synchronisation point. For example, if both
the sending and receiving communication operations are blocked,
then the communication operations are not finished until the sending
and receiving communication processes reach the defined
synchronisation point.
Synchronisation command barrier
Tasks in any given group can be synchronised using a synchronisation
command called a barrier. This command makes sure that no task
continues beyond a defined barrier point until all of the tasks of the
given group reach the barrier point. The group can contain either all
the parallel processes or only subsets of them independenly of the
communicator in use. The barrier command is defined as:
MPI_Barrier (communicator)
where the command parameter is the communicator in use. Barrier
synchronisation is realised in such a way that all the tasks in the given
group call function MPI_Barrier (). Any parallel process will wait at
the defined barrier point until all the parallel processes of the given
communicator reach the barrier point. After calling the MPI_Barrier
(), parallel algorithms continue after performing the MPI_Barrier ()
call for each member of the given group communicator. An illustration
of the applied use of a barrier synchronisation command is illustrated
in Fig. 4.5. This barrier mechanism is necessary in the following two
typical cases:
The safe continuation of performing parallel process algorithms
with a shared memory while waiting when using the limited
shared resources of every real parallel computer (shared memory
module, shared communication channel etc.). In relation to
Fig. 4.5, waiting delays (latencies) develop during the performance
of parallel algorithms caused by real limited parallel computer
resources.
Parallel algorithms with a distributed memory are limited
through the performance of the various communication networks
in use (communication speed). For example, there are serial
communication transmission channels for parallel computers
such as buses, or shared communication channels on the
Ethernet. These potential latencies are very important mainly in
dominant parallel computers based on NOW and Grid.
Typical potential causes of waiting times represent a delay under the
common definition as T
syn
(s, p). The MPI_Barrier barrier function
() can be understood as a collective control operation carried out on
performed parallel processes.
MPI collective communication mechanisms
MPI offers a wide range of collective communication commands.
These commands are necessary to make inter process communication
available between created parallel processes and their needed data
exchange. Basic collective communication operations are as follows:
Communication mechanisms for data distribution (dispersion)
are necessary, for example, in transferring input parameters to
created parallel processes from the main parallel process
(manager) and so on. The applied tasks require different variants
of data distribution.
Communication mechanisms for data reduction (collection) are
commands which create a formal dual action in relation to data
distribution. The same applied tasks require different kinds of
collection commands depending on the desired form of data
collection.
Data scattering collective communication commands
A standard example of data distribution is sending value n/p
(number of parallel processes) from the manger parallel process to
all the active computing nodes of a parallel computer (workers). To
carry out this activity we can use the MPI collective operation
Figure 4.5 The synchronisation command barrier.
Parallel processes
Barrier
Active
Waiting
Time
P
0
P
1
P
2
P
n1
known as Broadcast when the allocated tasks are the same size, or
the Scatter MPI command in case the individual parallel processes
are different sizes.
The dispersion command Broadcast:
The MPI dispersion command Broadcast distributes data from a main
parallel process (named in MPI as Root) to all the other tasks of any
given group communicator. Broadcast command is defined as follows:
MPI_Bcast (buffer, n, data_type, root, communicator)
where they are the following meanings of the parameters in use:
Buffer Starting address of the buffer.
N Number of data elements in the buffer.
Data_type Data type of the buffer data elements.
Root Rank of the main parallel process
(Root).
Communicator Communicator.
This command will call all other parallel process communicator
groups with the same parameter values for both the root and
communicator. The buffer content of the control parallel process
(root) is distributed to the buffers of all the parallel processes of any
given communicator. An illustration of an MPI Broadcast command
is shown in Fig. 4.6.
The dispersion command Scatter:
The Scatter feature allows one parallel process (manager) to
distribute the content of various parts of its buffer to each parallel
process group. The command is defined as follows:
MPI_Scatter (SBUF, n, stype, rbuf, m, rtype, rt, communicator)
with the following meanings of the parameters:
Sbuf Initial address of the receiving buffer.
N Number of distributed elements.
Stype Type of every element in the sending buffer.
Rbuf Initial address of the recipient cache.
M Number of data elements in the receiving buffer of every
PP.
Rtype Type of each element in the receiving buffer.
Rt Rank of sending parallel processes.
Communicator Communicator
The MPI command Scatter has its sending buffer of the main
parallel process (root) divided into individual parts, each of them
being size n. The first n elements of the sending buffer of a main
parallel process is copied into the receiving buffer of the first parallel
process of the given communicator. The second part of the n data
elements of the sending buffer of the main parallel process is copied
into the receiving buffer of the second parallel process of the given
communicator, and so on. An illustration of Scatter executing a
scattering order is shown in Fig. 4.7. The MPI command Scatter
calls all the parallel processes of communicators with the same
parameter values for the main parallel process and communicator.
The collection command Gather:
When you are collecting data, each parallel process, including the main
parallel process (root) sends the content of its transmission buffer to
this main. The main parallel process receives these data messages and
Par. process 0
data
bcast ();
Par. process 1 Par. process n1
data
bcast ();
data
bcast();
Active
Par.
program
Figure 4.6 The collective data distribution command Broadcast.
stores them according to their ranks. Sending the buffer from the first
member of any given group is copied onto the first m positions of the
root receiving buffer, the second member of the given group is copied
onto the second m positions of the receiving root buffer and so on.
This command calls all the parallel processes of any given communicator
group with the same parameter values as the main parallel process
(root) and the parallel processes of any given communicator. The
Gather command allows the main parallel process to fill its cache with
individual data parts from every parallel process of any given
communicator. The Gather command is defined as follows:
MPI_Gather (sbuf, n, stype, rbuf, m, rtype, rt, communicator)
It contains the following significant parameters:
sbuf Start address of the sending buffer.
n Number of elements in the sending buffer.
stype Type of elements in the sending buffer.
rbuf Start address of the receiving buffer.
m Number of received data elements from each parallel
process.
Rtype Type of data elements in the receiving buffer.
Rt Rank of the main parallel process.
An illustration of the content of the order of data collection
Gather type is shown in Fig. 4.8.
Figure 4.7 The collective data distribution command Scatter.
Par. process 0 Par. process 1 Par. process n1
Active
Par.
program
data
scatter ();
data
scatter ();
data
scatter ();
An alternative method of the collective command Gather is the
MPI command Allgather, which is illustrated in Fig. 4.9. The
difference between both MPI commands is obvious from their
illustrations.
Figure 4.8 The Gather collective data capture command.
Figure 4.9 The collective Allgather data capture command.
The collection command Reduce:
Reduction is an associative and communicative operation, which
could be applied to data items of any given task group. The reduction
operation can be a user-profiled function or a predefined MPI
operation, such as sum, minimum, maximum and so on. The
results of used reduction operations can be sent to each parallel
process in any given group or just to one of them, which is named
Par. process 0 Par. process 1 Par. process n1
Active
Par.
program
data
gather ();
data
gather ();
data
gather ();
data
allgather ();
data
allgather ();
data
allgather ();
. . .
Active
Par.
program
Par. process 1 Par. process n1 Par. process 0
in MPI API as root (main parallel process). The Reduce global
collection command has the following structure:
MPI_Reduce (SBUF, rbuf, n, data_type, op, rt, communicator)
where the meaning of the used parameters is as follows:
sbuf Address of the sending buffer.
rbuf Address of the receiving buffer.
n Number of data elements in the sending
buffer.
data_type Data type.
op Collecting operator.
rt Rank of the main parallel process (root).
The reduction operator is applied to the data of the sending buffer
for each sending parallel process by the group communicator. The
results are only returned to the receiving root buffer. Predefined
operations while collecting data are as follows:
MPI version Operation MPI version Operation
MPI_SUM Sum MPI_LOR logic or
MPI_PROD Multiply MPI_LXOR logic exclusive or
MPI_MIN Minimum MPI_BAND bit and
MPI_MAX Maximum MPI_BOR bit or
MPI_LAND logical and MPI_BXOR bit exclusive or
An illustration of the Reduce data collection command is shown
in Fig. 4.10 with an enabled collection operator sum.
The data collection command Allreduce:
The global collection command Allreduce defines alternative variants
of global reduction, in which the results of the operation are returned
to all the members of any given communication group. This
command can be loosely interpreted as a reduction for all (many-to-
many reduction). In this sense it is an inverse operation to the
scattering command Broadcast. The collective command Allreduce
is defined as follows:
MPI_Allreduce (SBUF, rbuf, n, data_type, op, communicator)
The parameters of this function are the same as the function
parameters of MPI_Reduce (), but with parameter rt. The results of
the operation are stored onto the receiving buffer of all the parallel
processes of any given communicator.
The data collection command Scan:
The Scan data collection command is a prefix operation for changing
the order of the sequence data collection of individual parallel
processes for any given communicator. The command is defined as
follows:
MPI_Scan (SBUF, rbuf, n, data_type, op, communicator)
The parameters of the Scan collection command are identical to the
parameters of the collection command MPI_Allreduce (). After
performing an MPI, the Scan command is at the receiving buffer of
the parallel process with its rank and results of the applied function,
which were stored in the sending buffer of parallel processes with
their rank 0, 1,......i.
Figure 4.10 Collective data capture command Reduce.
data
reduce ();
data
reduce ();
data
reduce ();
+
Active
Par.
program
Par. process 1 Par. process n1 Par. process 0
Java
The Java programming language was developed by Sun Microsystems
in 1991 as part of a research project for the development of applied
software in consumer electronics. The background of the development
of the Java language was based on the efforts towards developing
applied portable programs. From these predecessors Java has taken
a combination of properties of interpreter languages, object oriented
programming and parallel languages. It provides broad support for
graphics, distributed environments and for other application areas.
The application use of the Java language gradually showed its
appropriate use in a heterogeneous distributed environment. This
application image is known as applets, which are detailed small
programs that can communicate and are performed within a website,
for example on the Internet.
The Java structure is profiled like an object oriented language.
With the gradual development of programming technology Java has
become a very powerful all-purpose object oriented language for a
wide range of various applications. This is mainly due to the fact
that programs are written independently of computer architecture
and can be implemented without modifications to various computer
platforms. One important feature is its support for multi-thread
processes (multi-threading). This support includes traditional parallel
computing on parallel computers with a shared memory (threads)
using shared variables, including building support for typical
synchronisation mechanisms in a shared memory. As for threads in
general, its parallel implementation can contribute to the performance
of a multiprocessor or multicore system. They may also be used for
single processor systems in order to flexibly switch the processor
between open threads applied to graphic or input/output operations.
Java also provides support mechanisms for the development of
manager/worker (client/server) parallel applications. The development
of applied distributed applications does not support the MPI API
standard, except through building the java.net module. Classes of
this package support low-level mechanisms based on datagrams,
higher communication mechanism socket types, Internet
communications through uniform resource locators (URL) and RPC
mechanisms (Remote Procedure Call) under its own name of Remote
Method Invocation (RMI), because operations on the objects are
known as Java methods and not as obvious procedures. An
illustration of a Java RMI mechanism is illustrated in Fig. 4.11.
Figure 4.11 The Java RMI mechanism.
Stub Skeleton
Server
Remote reference layer
Transport layer
Client
5
Parallel Computing Models
A parallel computational model is an abstract model of parallel
computing, which should include overheads and accompanying
delays. The model is characterised by the possibility of a parallel
computer, which are deterministic for parallel calculation. A degree
of abstraction should also characterise the communication structure
and should now permit at least an approximation of its basic
parameters (complexity, performance etc.). On the other hand, the
approximation accuracy is limited by the requirement that abstract
communication models have to represent similar parallel computer
architectures and parallel algorithms [19, 83, 89]. It is clear that for
every specific parallel computer and parallel algorithm we are able
to create their own communication models, which characterise in
detail their specific characteristics. Parallel communication models
can be classified according to various criteria. One of the most used
criteria is a way of presenting the model parameters. Typically used
communication parameters can be divided into two groups as
follows:
Semantic.
Communications network architecture (structure, links, control).
Communication methods.
Communication delays (latency).
Performance (complexity, efficiency).
The typical parameters are:
Size of the parallel system p (number of processors).
Workload w number of operations.
Sequential program execution time T(s, 1).
Execution time of a parallel algorithm T(s, p).
Parallel speed up (S (s,p).
Efficiency E (s, p).
Isoefficiency w(s)>
Average time of computation unit t
c.
Average time to initialise communication (start up time) t
s.
Average time to transmit a data unit (Word) t
w.
The SPMD model of parallel computation
The parallel computing model SPMD (Single Process Multiple Data)
corresponds to classical parallel computers with a shared memory
(supercomputers, massive SMPs), which were primarily focused on
massive data parallelism. An illustration of this model is shown
in Fig. 5.1. This orientation program assumes the following
decomposition models:
Manager/worker.
Fixed (atomic) network
This communication network consists of a large fixed number of
processors p, which communicate with each other through data
messages of a fixed length. The number of processors is proportional
to the input problem load.
Parallel Computing Models 75
The PRAM model
A model of a parallel computer with a shared memory known as
PRAM (Parallel Random Access Machine) was previously used for
its high degree of universality and abstraction. The PRAM model
still represents an idealised model, because it does not consider any
delay. Although this approach has an important role in the theoretical
design and development of parallel computers and parallel algorithms,
for real modelling it is necessary to complete it by modelling at least
for communication delays. The initial PRAM model is illustrated in
Fig. 5.2.
Figure 5.1 An Illustration of the SPMD model.
Central
control unit Control processor
Instruction
load
Instruction
decoding
Loading of
operands
Execution of
operation
Store of
results
Active
computing
nodes
1.
2.
.
.
.
n.
T
Figure 5.2 The PRAM model.
Par. program
Shared memory
Tacts
Computing
nodes
Data
Instructions
In the PRAM model, computing nodes communicate via a shared
memory, whereby every addressed place according to the PRAM
model is available at the same time (idealisation). Computing nodes
are synchronised in their activities and communicate via a shared
memory. For the practical design of a parallel algorithm, a programmer
specifies sequences of parallel operations using a shared memory.
When performing parallel processes there may be long waiting delays,
which increase proportionally to the number of parallel processes in
use (waiting times for memory module access). These time delays are
necessary to model in order to analyse their behaviour and make a
real evaluation (removal of the idealised PRAM model assumption).
The GRAM fixed communication model
The fixed communication model known as GRAM (Graph Random
Access Machine) was one of the ways of solving the problem of
waiting delays in the PRAM model by using a distributed memory
with a precisely defined structure of its communication network, in
which the symbol G determines the topology graph of the
communication network in use [2, 26]. As examples, we can name this
as a two-dimensional communication network or hypercube topology.
Flexible models
Previous models were not sufficiently precise, because the increasing
robustness of parallel computers also caused rises in communication
overheads in parallel algorithms. A precisely developed parallel computer
represented its robustness by a number of computing nodes with
parameter p, whereby every computing node was ready to work with
n/p parallel processes. Parallel algorithms then consisted of sequences of
defined parallel steps named super steps, in which the necessary local
calculations were carried out, followed by a communication exchange
of data messages. It is obvious that this implemented parallel algorithm,
in which the number of super steps was small and independent of input
load n, will be effective in any parallel computer providing the efficient
implementation of just communication procedures.
The flexible GRAM model
The basic difference between the fixed and flexible GRAM models
is the number of computing nodes (processors), which was considered
as having defined parameter p. At every stage of the communication
phases, a computing node could send data messages with their
variable lengths to its neighbouring computing nodes. The price of
communication could also be a subject for modelling, and includes
the following parts:
Communication section to initialise communication (start-up
time).
Its own transmission part of communication is defined as a
number of transmitted considered data units (words).
The BSP model
The communication model BSP (Bulk Synchronous Parallel) is a
realistic alternative to the PRAM model (Fig. 5.3). The number of
parallel super steps (input load n) was divided into p computing
processors. Updates to this communication model have only used
synchronisation after every performed instruction instead of
synchronisation at the end of each performed partial computation
referred to as super steps. A super step consists of a defined number
of instructions (bulk). Each super step consists of the three following
phases:
Its own partial computation.
Global communications of processors.
Barrier synchronisation.
During a super step the processors in use perform their instructions
asynchronously, whereby all the read operations in the collective
memory of every processor are carried out before performing the
first written operation is put into the shared memory. The existing
delays of parallel algorithms were defined as follows:
Parallel computation times were given by the maximum number
of computation cycles w.
Synchronisation delays had their lower boundary as the waiting
time for the transmission of minimal communication data
messages (Word) through a communications network.
Communication delays were given as product gh cycles, where
parameter g characterises the throughput of the communications
network. Parameter h specified the number of cycles for the
communication of a maximal data message during a super step.
To avoid conflicts due to asynchronous communication network
activities, data message sent in stages by some processors were
not dependent on the received messages during the same phase
of communication.
The execution time for one super step was then given as the sum
of the partly considered sub-delays, which is: w + g.h + l.
The BSP model does not exclude the overlapping of individual super
step activities. In the case of the overlapping of defined actions, the
execution time of a super step was given as max (w, gh, l).
The adjusted BSP model
A further innovation to the BSP model includes the adjustments of
the PRAM and BSP models in such a way that the modified model
could precisely characterise the behaviour of real parallel computers.
These innovations were based on the following:
Figure 5.3 An illustration of the BSP model.
Parallel processes
Parallel computing
Communication
Barrier
Max. of
sent or received
A parallel algorithm is performed in a sequence of phases.
The following are the three types of phase:
1. Parallelisation of overheads T (s, p)
overh
2. Its own parallel computing T (s, p)
comp.
3. Interaction of processors T
interact
(communication,
synchronisation).
For any given computation the phase input load was determined
by parameters that indicated the average value of the performed
operations t
c
(p).
Different interaction imposes different execution times. Execution
time could be computed according the following relation
( )
( )
( ) ( )
= + = +
int
( , ) .
eract s s c
m
T m p t p t p m t p
r p
In this relationship, m indicates the length of the data message
length, where t
s
(p) is the communication start-up time and r
(p) is
the bandwidth limit of the communication channels in use.
The CGM (coarse grained multicomputer) model
This model is based on the BSP model and is represented by p
processors, whereby each of them have O (n/p) local memories, for
which every super step has h = O (n / p) communication cycles. The
aim is to concentrate on a proposal with fewer super steps in order
to achieve a higher effectiveness of developed parallel algorithms.
The ideal situation would mean performing a constant number of
super steps as was done in developed parallel algorithms, such as
sorting, image processing, optimisation problems and so on.
The Log P model
The Log P model is based on the BSP model and focuses on more
loosely-bound parallel computer architectures (asynchronous parallel
computers). The emphasis is on a parallel computer with a
distributed memory with parameters according to Fig. 5.4, where:
L: time for communication initialisation (start-up time).
o: overflow due to communication activities. This is defined as
the time interval during which a computing node only has
control over the performed communication.
g: the gap between two consecutive transmitted data messages.
This is defined as the inversion of the bandwidth of the
communication control processor.
p: the number of computing nodes in a parallel computer (each
of them with a local memory).
In this model the resources are considered as having limited capacity.
Consequently, only L/g data messages can exist at any given time in
a communication network. The price for basic communication data
block (data packet) between two computing nodes is L + 2 o. If we
require acknowledgement (ACK) then the price is given as 2 L 2 + o.
Computational model MPMD
The computational model MPMD (Multiple Process Multiple Data)
is associated with computer networks mainly in asynchronous
parallel computers. Similarly, network topologies in computer
networks (LAN, WAN) are typically used following this topological
structure:
Figure 5.4 The Log P model.
o g
L
o
Time
P
i
P
k
P
i
Computing
nodes
Data
message
Next data
message
Bus.
Multibus.
Star.
Tree.
Ring.
Suitable decomposition models are those which tend towards
functional parallelism, which means the creation of parallel processes,
which in turn perform the allocated parts of the parallel algorithms
on the corresponding data. Typical decomposition models are as
follows:
Functional decomposition.
Manager/server (server/client, master/worker).
Object oriented programming (OOP).
Load of communication networks
A typical network architecture of an Ethernet communication
network using a single communication is illustrated in Fig. 5.5. The
disadvantage of this communication network is serial communication
between the connected computer nodes. To analyse the communication
complexity we can use the analytical method of complexity theory.
Then the upper limit of communication complexity on the Ethernet
is given as O (p) for a supposed network connection according to
Fig. 5.5. A communication network with this communication
complexity limits the development of effective parallel algorithms
using serial communications, such as in case of the Ethernet.
A typical communication network used in NOW in our country
(the Slovak Republic) is Ethernet architecture. The communication
principles in an Ethernet network are illustrated in Fig. 5.5, where
P
1
, P
2
, ... P
p-1
, P
p
could be carried out by common powerful single
workstations or SMP parallel computers. Generally, the MPMD
implementing computational model brings different overhead delays
as follows:
Figure 5.5 Communication in an Ethernet network.
Parallelisation of complex problems.
Synchronisation of decomposed parallel processes.
Inter-process communication (IPC) delays.
Real application models should take into account the potential lack
of limited communication channels during the implementation of
parallel algorithms (technical communication limits) respectively
along with other limited required technical resources [9]. An
illustration of limited technical resources is illustrated in Fig. 5.6.
Data message
P
1
P
2
P
p1
P
p
. . .
Ethernet
Sender Receiver
Figure 5.6 Real applied models.
Input load
Fixed load model
Fixed time model
Fixed memory
model
Memory
bounds
Communication
limits
p
6
The Role of Performance
Quantitative evaluation and the modelling of the hardware and
software components of any parallel system are critical for the
delivery of complexity and the high performance of developed
parallel algorithms. Performance studies apply to initial design
phases, as well as to procurement, tuning and capacity planning
analysis. As performance cannot be expressed by quantities
independent of the system workload, the quantitative characterisation
of the application of resource demands and of their behaviour is an
important part of any performance evaluation study [27]. Among
the goals of parallel system performance analysis are those which
assess the performance of a system, or a system component or an
application, in order to investigate the match between the
requirements and the system architecture characteristics. This is to
identify the features that have a significant impact on the application
execution time, to predict the performance of a particular application
on any given parallel system and to evaluate the different structures
of parallel applications. In order to extend the applicability of
analytical techniques to the parallel processing domain, various
enhancements have been introduced to model phenomena such as
simultaneous resource possession, fork and join mechanisms,
blocking and synchronisation. Modelling techniques allow model
contention both on hardware and software levels by combining
approximate solutions and analytical methods. However, the
complexity of parallel systems and algorithms limits the applicability
of these techniques. Therefore, in spite of its computation and time
requirements, simulation is extensively used, as it imposes no
constraints on modelling.
The study of the performance of computers attempts to understand
and predict the time-dependent behaviour of computer systems,
including parallel computers. It can be broadly divided into two
areas modelling and measurement. These can be further divided by
objective and by technique. These two apparently disjointed
approaches are in fact mutually dependent and are both required in
any practical study of the performance of a real or planned system.
Performance studies apply to initial design phases as well as to
procurement, tuning and capacity planning analysis. As performance
cannot be expressed by quantities independent of the system
workload, the quantitative characteristics of the resource demands
of an application and of their behaviour is an important part of any
performance evaluation study. Among the goals of parallel system
performance analysis are to assess the performance of a system or a
system component or an application, to investigate the match
between requirements and system architecture characteristics, to
identify the features that have a significant impact on the application
execution time, to predict the performance of a particular application
on a given parallel system and to evaluate different structures of
parallel applications. For a performance evaluation of parallel
algorithms we can use an analytical approach to get past any given
constraints, analytical laws or some other derived analytical relations.
Principally we can use the following solution methods to model the
performance of both parallel computers and parallel algorithms:
Analytical.
Application of queuing theory results [55, 57].
Order (asymptotic) analysis [44, 48].
Petri nets [17].
Simulation [67].
Experimental.
Benchmarks.
Classic [65].
SPEC [98].
Supporting modelling tools [71].
The Role of Performance 85
Direct measuring.
Technical parameters [47].
Parallel algorithms [52].
Performance evaluation methods
Several fundamental concepts have been developed for evaluating
parallel computers. Tradeoffs among these performance factors are
often encountered in real-life applications. When we solve a model
we can obtain an estimate for a set of values of interest within the
system being modelled, for a given set of conditions which we set for
that execution. These conditions may be fixed permanently in the
model, or left as free variables or parameters of the model, and set
at runtime. Each set of m input parameters constitutes a single point
in m-dimensional input space. Each solution of the model produces
one set of observations. This set of n values constitutes a single point
in the corresponding n-dimensional observation space. By varying
the input conditions we hope to explore how the outputs vary
according to changes to the inputs.
Analytic techniques
The analytical method is a very well-developed set of techniques,
which can provide exact solutions very quickly, but only for a very
restricted class of models. For more general models it is often possible
to obtain approximate results significantly more quickly than when
using simulation, although the accuracy of these results may be
difficult to determine. The techniques in question belong to an area
of applied mathematics known as queuing theory, which is a branch
of stochastic modelling. Like simulation, queuing theory depends on
the use of computers to solve its models quickly. We would like to
use techniques which yield analytic solutions. We make note of
important results without proof. Details can be found in most
queuing textbooks, such as those in the bibliography [59, 72, 81].
Clearly this is very similar to the kinds of experiment we might wish
to conduct with measurements of a real system; for example, the
benchmarking of a concrete computer [56].
Asymptotic (order) analysis
In the analysis of algorithms, it is often cumbersome or impossible
to derive exact expressions for parameters such as runtime, speed
up, efficiency, issoefficiency and so on. In many cases an
approximation of the exact expression is adequate. The
approximation may indeed be more illustrative of the behaviour of
the function because it focuses on the critical factors influencing the
parameter. We have used an extension of this method to evaluate
parallel algorithms.
Order analysis comes from verified theory and praxis on the
analysis of sequential algorithms (SA). The aim is to derive a
mathematical expression which includes a number of computation
steps for an analysed algorithm (time complexity) independent of the
input load for a given problem [20]. Similarly, the space complexity
of SAs means mostly an analysis of the needed memory capacity as
a function of the input load of a given application problem. For
parallel algorithms in NOW and Grid it is necessary, besides the time
complexity of T(s,p)
comp
, to extend the analysis to further
characteristic aspects of PA, mainly to:
The complexity of overhead latencies with a derivative analytical
complex performance criterion PA including the consideration of
the overhead function h(s,p) (parallel speed up, effectivity,
isoeffectivity).
The space complexity supported optimisation theory to suggest
capacities for shared used resources (computing nodes, memory
modules, I/O equipment, communication channels and so on).
The time complexity of PA will be therefore the sum of the time
complexity of parallel computation T(s,p)
comp
and moreover, the
time complexities of all of the individual overhead latencies
(overhead function h(s, p)). As the extended basic factors of a whole
latency influences a PA (computation, overheads), they are the
parameter for the number of computation nodes in a parallel
computer, the working load of given problems, the problem load n
and also parameters for the technical limits of shared used resources.
As for the performance analysis of a given PA, it is best to consider
complex analytical relations for the defined criteria S(s,p), E(s,p),
w(s) and the overhead function h(s,p).
Order notations
Order analysis and the asymptotic complexity of functions are used
extensively in practical applications, mostly to analyse the
performance of algorithms. In the analysis of algorithms, it is often
cumbersome or impossible to derive exact ex pressions for
parameters such as runtime, speed up and efficiency. In many cases,
an approximation of the exact expression is adequate. The
approximation may indeed be more illustrative of the behaviour of
the function because it focuses on the critical factors influencing the
parameter. Order analysis uses the following notations [3, 37]:
The O notation: often, we would like to bind the growth of a
particular parameter by a simpler function. The O (capital oh)
notation sets an upper boundary on the rate of growth of a
func tion:
Figure 6.1 Illustration of an upper boundary.
n
0
y=f(n)
y=g(n)
T(n)
The notation: the notation is the converse of the O notation;
that is, it sets a lower boundary on the rate of growth of a
function:
Figure 6.2 Illustration of a lower boundary.
The notation: the notation is the exact approximation.
Formally the notation is defined as follows: given a function
g(x), f(x)= g(x) if and only if for any constants c
1
, c
2
> 0, there
exists an x
0
such that c
1
f(x) g(x) c
2
f(x) for all x x
0
.
Figure 6.3 Illustration of an exact approximation.
f(n)=o(g(n)), if there exist such constants C and n
0
, that f(n) < C
g(n) for all n n
0.
then the order symbol o defines the upper limit
(small o order symbol). An illustration of the defined order
symbol o is in Fig. 6.4.
n
0
y=g(n)
y=f(n)
T(n)
y=f(n)
y=g(n)
T(n)
n
0
f(n)=(g(n)), if there exist such constants C and n
0
that f(n) < C
g(n) for all n n
0
, then is the lower limit. An illustration of the
defined lower limit is in Fig. 6.5.
Figure 6.4 Illustration of precise boundaries.
Figure 6.5 Illustration of the dened lower limit.
The defined order of the symbols , , , , and represent classes
of functions with similar complexities ignoring constants [38, 49].
They are therefore very useful for the asymptotical complexity
evaluation of algorithms (sequential, parallel).
n(C)
y=f(n)
y=Cg(n)
y=Cg(n)
y=g(n)
n(C)
T(n)
n(C) n(C)
g(n)
Cg(n)
f(n)
Cg(n)
T(n)
Order algorithm terminology
The execution times of any algorithm with order symbol terminology
are named as follows:
(1) as constant time.
(log n) as logarithmic time.
(log
k
n) as semi-logarithmic time.
o (log n) as sub-logarithmic time.
(n) as linear time.
o (n) as sub linear time.
(1) as quadratic time.
O (f (n)) where f (n) is a polynomial as polynomial time.
Properties of functions expressed in order notation
The order notations for expressions have a number of properties
that are useful when ana lysing the performance of algorithms. Some
of the most important properties are as follows:
x
a
= O (x
b
) if and only if a b
log
a
(x) = (log
b
(x)) for all a and b
a
x
= O (b
x
) if and only if a b
for any constant c, c = O (1)
if f = O (g) then f + g = O (g)
if f = (g) then f + g = (g) = (f)
if f = O (g) if and only if g = (f)
f = (g) if and only if then f = (g) and f = O (g).
The application of queuing theory systems
The basic premise behind the use of queuing models for computer
system analysis is that the components of a computer system can be
represented by a network of servers (resources) and waiting lines
(queues) [58]. A server is defined as an entity that can affect, or even
stop, the flow of jobs through a system. In a computer system, a
server may be the CPU, I/O channel, memory, or a communication
port. A waiting line is just that: a place where jobs queue for service.
To make a queuing model work, jobs (customers, message packets,
or anything else that requires the sort of processing provided by the
server) are inserted into the network. A simple example, the single
server model, is shown in Fig. 6.6. In that system, jobs arrive at a
certain rate, queue for service on a first-come first-served basis,
receive service and exit the system. This kind of model, with jobs
Figure 6.6 The queuing theory based model.
entering and leaving the system, is called an open queuing system
model.
We will now turn our attention to some suitable queuing systems,
the notation used to represent them, the performance quantities of
interest and the methods for calculating them. We have already
introduced many notations for the quantities of interest for random
variables and stochastic processes.
Kendall classification
Queuing theory systems are classified according to various
characteristics, which are often summarised using Kendalls notation
[32, 61]. In addition to the notation described previously for the
quantities associated with queuing systems, it is also useful to
introduce a notation for the parameters of a queuing system. The
notation we will use here is known as the Kendall notation in its
extended form as A/B/m/K/L/Z , where:
A means arrival process definition.
B means service time distributions.
Arrivals
Queue Server
Departures
m is the number of identical servers.
K means the maximum number of customers allowed in the
system (default = ).
L is the number of customers allowed to arrive (Default = ).
Z means the discipline used to give orders to customers in the
queue (Default = FIFO).
The symbols used in a Kendall notation description also have some
standard definitions. The more common designators for the A and B
fields are as follows:
M means Markovian (exponential) service time or arrival rate.
D defines deterministic (constant) service time or arrival rate.
G means general service time or arrival rate.
The service discipline used to give orders to customers in the queue can
be any of a variety of types, such as first-in first-out (FIFO), last-in
first-out (LIFO), priority ordered, randomly ordered and others. Next,
we will apply several suitable queuing systems to model computer
systems or workstations and give expressions for the more important
performance quantities. We will suppose in Kendall notation default
values, which means we will use a typical short Kendall notation.
Littles Law
One of the most important results in queuing theory is Littles Law.
This was a long-standing rule of thumb in analysing queuing
systems; it derives its name from the author of the first paper which
proved the relationship formally. It is applicable to the behaviour of
almost any system of queues, as long as they exhibit steady state
behaviour. It relates a system-oriented measure - the mean number
of customers in the system - to a customer-oriented measure - the
mean time spent in the system by each customer (the mean end-to-
end time), for a given arrival rate. Littles Law says:
E (q) = . E (t
q
)
or its following alternatives:
E (w) = . E (t
w
)
E (w) = E (q) (single service - m=1)
E (w) = E (q) m . (m services).
We can use also the following valid equation
E (t
q
) = E (t
w
) + E (t
s
),
where the named parameters are:
- arrival rate at the entrance to a queue.
m - number of identical servers in the queuing system.
- traffic intensity (dimensionless coefficient of utilisation).
q - random variable for the number of customers in a system in
a steady state.
w - random variable for the number of customers in a queue in
a steady state.
E (t
s
) - the expected (mean) service time of a server.
E (q) - the expected (mean) number of customers in a system in
a steady state.
E (w) - the expected (mean) number of customers in a queue in
a steady state.
E (t
q
) - the expected (mean) time spent in a system (queue +
servicing) in a steady state.
E (t
w
) - the expected (mean) time spent in the queue in a steady
state.
Petri nets
A Petri net is essentially an extension of a finite state automaton, to
allow by means of tokens several concurrent threads of activity to
be described in one representation. It is essentially a graphical
description, being a directed graph with its edges defining paths for
the evolution of a systems behaviour and its nodes or vertices being
of two sorts; places and transitions. There are a number of extensions
to these simple place/transition nets, mostly to increase the ease of
describing complex systems. The most widely-used is to define
multiplicities for the edges, which define how many tokens flow
down an edge simultaneously. This is shorthand for an equivalent
number of edges linking the same pair of vertices. The use of Petri
nets in performance modelling now centres on the Generalised
Stochastic Petri Nets (GSPNs) [17].
The simulation method
Simulation is the most general and versatile means of modelling
systems for performance estimation. It has many uses, but its results
are usually only approximations of the exact answer and the price of
increased accuracy is much longer execution times. To reduce the cost
of a simulation we may resort to simplification of the model, which
avoids explicit modelling of many features, but this increases the level
of error in the results. If we need to resort to simplification of our
models, it would be desirable to achieve exact results, even though
the model might not fully represent the system. At least then one
source of inaccuracy would be removed. At the same time it would
be useful if the method could produce its results more quickly than
even the simplified simulation. Thus it is important to consider the
use of analytic and numerical techniques before resorting to
simulation. This method is based on the simulation of the basic
characteristics that are the input data stream and their servicing
according to the measured and analysed probability values simulate
the behaviour model of the analysed parallel system. Its part is
therefore the time registration of the wanted interested discrete
values. The result values of the simulation model always have their
discrete character, which do not have the universal form of
mathematical formulas to which we can set when we need the
variables of the used distributions as in the case of analytical models.
The accuracy of a simulation model depends therefore on the
accuracy of the measure of the used simulation model for the given
task [67].
Simulation is the most general and versatile means of modelling
systems for performance estimation. It has many uses, but its results
are usually only approximations to the exact answer and the price of
increased accuracy is much longer execution times. Numerical
techniques vary in their efficiency and their accuracy. They are still
only applicable to a restricted class of models (though not as restricted
as analytic approaches). Many approaches increase rapidly in their
memory and time requirements as the size of the model increases.
To reduce the cost of a simulation we may resort to simplification
of the model, which avoids explicit modelling of many features, but
this increases the level of error in the results. If we need to resort to
the simplification of our models, it would be desirable to achieve
exact results even though the model might not fully represent the
system. At least then one source of inaccuracy would be removed. At
the same time it would be useful if the method could produce its
results more quickly than even the simplified simulation. Thus, it is
important to consider the use of analytic and numerical techniques
before resorting to simulation.
Experimental measurement
Evaluating system performance via experimental measurements is a
very useful alternative to parallel systems and algorithms.
Measurements can be gathered on existing systems by means of
benchmark applications that aim at stressing specific aspects of the
parallel systems and algorithms. Even though benchmarks can be
used in all types of performance study, their main field of application
is competitive procurement and the performance assessment of
existing systems and algorithms. Parallel benchmarks extend the
traditional sequential ones by providing a wider and wider set of
suites that exercise each system component-targeted workload. The
Park Bench suite is especially oriented to message-passing
architectures, and the SPLASH suite for shared memory architectures
are among the most commonly-used benchmarks. To both
architectures we can apply the PRISM tool (probabilistic symbolic
model checker) or the PRISM benchmark suite [71].
Methods arise from the reality that complex analytical modelling
and performance analysis (influences of PC architecture, parallel
computation, overhead latencies h(s, p)) are very intractable problems.
Benchmark
We divided the used performance tests as follow:
Classical.
Peak performance.
Dhrystone.
Whetstone.
LINPAC.
Khornestone.
Problem-oriented tests (benchmarks).
SPEC tests.
PRISM.
SPEC ratio
SPEC (The Standard Performance Evaluation Corporation - www.spec.
org) defined one number to summarise all the necessary tests for an
integer number [98]. Execution times are at first normalised through
dividing the execution time by the value of the reference processor
(chosen by SPEC) with the execution time on a measured computer (user
application program). The resulting ratio is labelled as a SPEC ratio,
which has such an advantage that higher numerical numbers represent
higher performance, which means that a SPEC ratio is an inversion of
the execution time. INT 20xx (xx - year of the last version) or CFP 20xx
result values are produced as geometric average values of all SPEC
ratios. The relation for a geometric average value is given as:
=
1
n
i
i
n normalised execution time
where the normalised execution time is the execution time
normalised by the reference computer for an i th tested program
from the whole tested group n (all tests) and:
n
i=1
a
i
product of individual a
i
.
Part II:
Theoretical Aspects of PA
7
Performance Modelling of
Parallel Algorithms
To evaluate parallel algorithms there have been developed several
fundamental concepts. Tradeoffs among these performance factors
are often encountered in real-life applications.
Speed up
Let T(s, p) be the total number of unit operations performed by a p
processor system, with s defining the size of the computational
problem (load). Illustrations of possible load characteristics are in
Fig. 7.1. Then T(s, 1) defines the execution time units for one
processor system, and then the speed up factor is defined as:

=
( , 1)
( , )
( , )
T s
S s p
T s p
It is a measure of the speed up factor obtained by any given algorithm
when p processors are available for the given problem size s. Since
S (s, p) p, we would like to design algorithms that achieve
S(s, p) p.
Efficiency
The system efficiency for a processor system with p computing nodes
is defined by

= =
( , ) ( , 1)
( , )
( , )
S s p T s
E s p
p p T s p
The value of E(s, p) is approximately equal to 1 for some p,
indicating that this parallel algorithm, using p processors, runs
approximately p times faster than it does with one processor
(sequential algorithms). An illustration of the efficiency dependent
on the number of computing nodes is in Fig. 7.2.
Isoefficiency
The workload w of an algorithm often grows in the order O(s),
where the symbol O means the upper limit used in complexity
theory, and its parameter s is the size of the problem. Thus, we
denote workload w = w(s) as a function of s. In parallel computing
it is very useful to define an isoefficiency function relating workload
to machine size p, which is needed to obtain a fixed efficiency when
implementing a parallel algorithm on a parallel system. Let h(s, p) be
Figure 7.1 Load developments as the number of processor functions.
Load
1 10 100 1000
Sublinear
Constant
Exponential
Linear
p
Performance Modelling of Parallel Algorithms 101
the entire overhead involved in the implementation of an algorithm.
This overhead is usually a function of both machine size and
problem size. Workload w(s) corresponds to useful computations
while overhead h(s, p) are useless times attributed to architecture,
parallelisation, synchronisation and communication delays. In
general, the overheads increase with respect to increasing values of
s and p. Thus the efficiency is always less than 1. The question is
hinged on relative growth rates between w(s) and h(s, p). The
efficiency of a parallel algorithm is thus defined as:

=
+
( )
( , )
( ) ( , )
w s
E s p
w s h s p
Workload w(s) corresponds to useful computations, while
overhead function h(s, p) represents useless overhead times
(communication delays, synchronisation, control of processes etc.).
With a fixed problem size (fixed workload), the efficiency decreases
as p increases. The reason is that overhead h(s, p) increases with p.
With a fixed machine size, the overhead function h(s, p) becomes
slower as workload w does. Thus, efficiency increases with increasing
problem size for a fixed-size machine. Therefore, one can expect to
maintain a constant efficiency if workload w is allowed to grow
properly with increasing machine size (scalability).
Figure 7.2 Efciency as factor of computing node number p.
Efficiency
1 10 100 1000
N
1
0.5
0
For any given algorithm, the workload might need to grow
polynomially or exponentially with respect to p in order to maintain
a fixed efficiency. Different algorithms may require different workload
growth rates to keep the efficiency from dropping as p increases. The
isoefficiency functions of common parallel algorithms are polynomial
functions of p; they are O(p
k
) for some k 1. The smaller the power
of p in an isoefficiency function the more scalable the parallel system.
We can rewrite the equation for efficiency E(s, p) as:
E(s, p) = 1 / (1+(h(s, p) / w(s))
In order to maintain a constant E(s, p), workload w(s) should grow
in proportion to overhead h(s, p). This leads to the following relation:

=
( ) ( , )
1
E
w s h s p
E
Factor C = E / 1-E is a constant for fixed efficiency E(s, p). Thus,
we can define an isoefficiency function as follows:
W(s, p) = C h(s,p)
If the workload grows as fast as w(s, p) then a constant efficiency
can be maintained for any given parallel algorithm.
Complex performance evaluation
For the performance evaluation of parallel algorithms we can use an
analytical approach to get past any given constraints, analytical laws
or some other derived analytical relations [30, 92]. The best-known
analytical relations have been derived without considering
architecture or communication complexity. That means a performance
P f (computation). These assumptions could be real in many cases
in the existent massive multiprocessor systems in the world today,
but not in NOW or Grid. In NOW [46, 50], we have to take into
account all the aspects that are important for complex performance
evaluation according to the relation P f (architecture, computation,
Performance Modelling of Parallel Algorithms 103
communication, synchronisation etc.). Theoretically, we can the use
following solution methods to obtain a function of complex
performance:
Analytic modelling to find the P function on the basis of some
closed analytical expressions or statistical distributions for
individual overheads [12, 78].
The simulation technique a simulation modelling developed
parallel algorithms on real parallel computers [67].
Experimental measurement an experimental performance
measure of developed parallel algorithms on real parallel
computers [74].
Due to the dominant use of parallel computers based on standard
PCs (personal computers) in the form of NOW and Grid, there has
been great interest in the performance modelling of parallel
algorithms in order to achieve optimised parallel algorithms (effective
parallel algorithms), as illustrated in Fig. 7.2. Therefore, this paper
summarises the methods used for complex performance analyses,
which can be solved on all types of parallel computer (supercomputers,
NOW, Grid). Although the use of NOW and Grid parallel computers
should be less effective in some parallel algorithms than those used
by massive parallel architectures in the world today (supercomputers),
the parallel computers based on NOW and Grid belong nowadays
to dominant parallel computers.
Conclusion and perspectives
Distributed computing was reborn as a kind of lazy parallelism.
A network of computers could team up to solve many problems at
once, rather than one problem at a higher speed. To get the most out
of a distributed parallel system, designers and software developers
must understand the interaction between the hardware and software
parts of the system. It is obvious that the use of a computer network
based on personal computers would be in principle less effective
than the typical massive parallel architectures used in the world
today because of higher communication overheads, but in the future
a network of workstations based on powerful personal computers
belongs to very cheap, flexible and perspective asynchronous parallel
systems.
8
Modelling in Parallel Algorithms
Latencies of PA
Until this time, the known results in complexity modelling have used
mainly classical parallel computers with a shared memory
(supercomputers, massive SMP) or a distributed memory (cluster,
NOW, Grid), which in most cases do not consider the influences of
parallel computer architecture and other real overheads
(communication, synchronisation, parallelisation etc.), supposing
that they would be lower in comparison to the latency carried out
by massive parallel computations [18].
In this sense, the analysis and modelling of complexity in parallel
algorithms (PA) are to be reduced to only a complexity analysis of
their own computations T(s, p)
comp
, which means that the functions
of all the existent control and communication overhead latencies
(overhead function h(s, p)) were not a part of the derived relations
for the whole parallel execution time T(s, p). In general, the
computation time is sequential, as parallel algorithms are given
through a multiplicity product of algorithm complexity Z
alg
(a dimensionless number of performed instructions) and technical
parameter t
c
as an average value of the computation operations
carried out on any given computer (sequential, parallel).
In this sense the dominant function in relation to the isoefficiency
of parallel algorithms is the complexity of the massive computations
T(s, p)
comp
being performed. Such an assumption has proved to be
true when using classical parallel computers (supercomputers,
massive SMPs, SIMD architectures etc.). When using this assumption
with the relation to asymptotic isoefficiency w(s), we get w(s) as
follows:

= < =

( ) max ( , ) , ( , ) ( , ) max ( , )
comp comp comp
w s T s p h s p T s p T s p
Conversely, with parallel algorithms for the dominant parallel
computers based on NOW (including SMP systems) and Grid, it is
necessary for complex modelling to analyse at least the most important
overheads out of all the existing overheads, which are [15, 16]:
Architecture of the parallel computer T(s, p)
arch.
Its own computations T(s, p)
comp.
Communication latency T(s, p)
comm.
Start up time (t
s
).
Data unit transmission (t
w
).
Routing.
Parallelisation latency T(s, p)
par.
Synchronisation latency T(s, p)
syn.
Waiting caused by limiting shared technical resources T(s, p)
wait

(memory modules, communication channels etc.).
An illustration of typical parallel computing delays is in Fig. 8.1.
By taking these real overhead latencies to the whole parallel
execution time T(s, p)
complex
we get the following relation:
( )
=

( , ) ( , ) , ( , ) , ( , ) , ( , ) , ( , )
comp par comm syn complex arch
T s p T s p T s p T s p T s p T s p
Figure 8.1 Illustration of performing parallel processes.
Par. process 1
Par. process 2
Par. process 3
Computation Waiting Communication Data message
Time
Modelling in Parallel Algorithms 107
where T(s, p)
arch
, T(s, p)
comp
, T(s, p)
par
, T(s, p)
comm
, and T(s, p)
syn

denote the individual overhead latencies caused by parallel computer
architecture, parallel computations, parallelisation, inter-process
communication and the synchronisation of parallel processes. These
defined overhead latencies build in the defined isoefficiency overhead
function h(s, p). In general, the influence of h(s, p) is necessary to
take into account the complex performance modelling of parallel
algorithms. The defined overhead function h(s, p) is as follows:

( )
=

( , ) ( , ) , ( , ) , ( , ) ( , )
par comm syn arch
h s p T s p T s p T s p T s p
The first part of h(s, p) function T(s, p)
arch
(the architectural
influence of the parallel computer in use) is projected into used
technical parameters t
c,
t
s,
t
w
, which are constants for any given
parallel computer.
The second part of h(s, p) function T(s, p)
par
(the parallelisation
latency of a parallel algorithm) depends on the decomposition
strategy chosen, and their consequences are thus projected onto
computation part T(s, p)
comp
(the whole computation complexity),
as they are onto communication part T(s, p)
comm
(the whole
communication complexity).
The third part of h(s, p) function T(s, p)
syn
we can eliminate
through optimising the load balance between the individual
computing nodes of the parallel computer in use. For this purpose
we would measure the performance of the individual computing
nodes in use for any given developed parallel algorithm, and then
based on the measured results we are able to redistribute a better
given input load. These activities we can repeat until we have an
optimal redistributed input load (load balancing).
In general, the possible non-linear influence of overhead function
h(s, p) should be taken into account during the complex performance
modelling of parallel algorithms. Then for the analysis of the
asymptotic isoefficiency of complex performance analysis we should
consider w(s) as follows:

=

( ) max ( , ) , ( , )
comp
w s T s p h s p
where the most important parts for dominant parallel computers
(NOW, Grid) in overhead function h(s, p) are in relation to the
earlier analysis of the influence of individual overhead latencies on
its own parallel computation time T(s, p)
comm
and the communication
overhead latency of T(s, p)
comm
.
The parallel computation time T(s, p)
comp
of a parallel algorithm
is given through the quotient of the running time of the greatest
parallel process PP (a product of its complexity Z
pp
and a constant
t
c
as an average value of the computation operations being
performed) through the number of computation nodes in use of any
given parallel computer. Based on these, we are able to derive for
parallel computation time T(s,p)
comp
the following relation:

= ( , p)
pp c
comp
Z t
T s
p
Supposing an ideally parallelised complex problem (for example,
matrix algorithms) and a theoretically unlimited number of
computation nodes p, the mathematical limit of T(s, p)
comp
is given as:

= = ( , p) lim 0
sa c
comp p
Z t
T s
p
Finally, the assumed relation between T(s, p)
comp
and h(s, p) is
illustrated in Fig. 8.2. For effective parallel algorithms we are seeking
the bottom part of the whole execution time according to Fig. 8.2.
In relation to the previous steps the heart of an asymptotic analysis
of h(s, p) is the analysis of communication part T(s, p)
comm
, including
the projected consequences of the decomposition methods used. This
analysis results from the application of the isoefficiency concept
onto the derived communication complexity in an analytical way for
the decomposition strategies being used. In general, derived
isoefficiency function w(s) could have a non-linear character in a
gradually increasing number of computing nodes. An analytical
derivation of isoefficiency function w(s) allows us to predict the
performance of any given parallel algorithm. This goes for an
existent real parallel system and for a hypothetical one. Thus, we
have the possibility of considering the potential efficiency of any
given algorithm. Communication time T(s, p)
comm
is given through
the number of communications carried out for the decomposition
strategy under consideration. Every communication within NOW is
characterised through the two following communication parameters,
which are illustrated in Fig. 8.3.
Communication initialisation t
s
(start up time).
Data unit transmission t
w
(word transfer).
Communication overheads are given through the two basic following
components:
f
1
(t
s
) function as the entire number of initialisations of performed
communications.
f
2
(t
w
) as a function of the entire data unit transmission performed,
which is usually the time of word transmission for any given
parallel computer.
These two components limit the performance of the parallel system
in use based on NOW. An illustration of these communication
parameters is in Fig. 8.3. These parameters, when used in
superposition, we can write as follows:
T(s, p)
comm
= f
1
(t
s
) + f
2
(t
w
)
Figure 8.2 Relations between the parts of the parallel execution time.
Number of processors
Communication
time
Processing
time
Execution time
Communication time T(s, p)
comm
is given through the number of
communication operations performed in concrete parallel algorithms,
and depends on the decomposition model being used. As a practical
illustration of communication overheads we used the possible matrix
decomposition models.
In practice, the most difficult common example of communication
complexity for massive and Grid parallel computers is a network
communication which includes crossing through several
communications networks (hops), which are interconnected by
routers, which are in turn other connecting communication elements
(repeaters, switches, bridges, gates etc). In this case, communication
is done through a number of control communication processors or
communication switches, whereby in this kind of transmission chain
there could occur communication networks with remote data
transmission. The number of network crossings through various
communication networks is defined as the number of hops [39, 79].
In the most complex parallel computer it is necessary to extend the
defined equation for T(s, p)
comm
with a third function component
(f
3
(t
h
), which would determine the potential multiple crossings
through the NOW networks in use. This third function component
is characterised by multiplying hops l
h
between the NOW networks
being crossed and their average latency time (NOW networks with
the same communication speed), or by the sum of the individual
latencies for these NOW networks (NOW networks with different
Figure 8.3 The technical parameters to communication.
Message length
Time
2 1 3 4 5 6
t
w
t
S
communication speeds). The third latency f
3
(t
s
, l
h
), is the time taken
to send a message with m words between the NOW networks,
whereby l
h
hops is given as t
s
+ l
h
t
h
m t
w
, where the new parameters
are as follows:
l
h
is the number of network hops.
m is the number of transmitted data units (usually words).
t
h
is the average communication time for one hop, where we
suppose the same communication speed.
The new parameters t
h
and l
h
depend on the concrete architecture of
a Grid communication network and the routing algorithms being
used in Grid. For the purpose of analysis it is necessary to derive for
any given parallel algorithm or group of similar algorithms (in our
case matrix algorithms) the necessary communication functions,
which is always the case for any given decomposition strategy, the
isoefficiency function or basic constant for the parallel computer in
use. The entire communication latency in the Grid is thus defined as:
T(s, p)
comm
= f
1
(t
s
) + f
2
(t
w
) + f
3
(t
s
, l
h
)
The total communication latency as a sum of the individual
communication delays through existing different communication
networks varies depending on the architecture of the communication
networks in use, and on the data transmission method. For the
design of optimised parallel algorithms (effective PAs) it is necessary
to perform an analysis of the average values of the defined
communication parameters t
s
, t
h
and l
h
. In practice, this means
deriving the analytical dependences which represent the total delay
of its own parallel computing T(s, p)
comp
and at least the total inter-
process communication delay T(s, p)
comm
. A potential dominant
delay T(s, p)
comm
in the defined isoefficiency relations is represented
by overhead function h(s, p), whose effect may be compared to the
entire dominant parallel algorithm complexity.
Part III:
Applied Parallel Algorithms
9
Numerical Integration
Numerical integration algorithms are typical examples of those with
an implicitly latent decomposition strategy in which the parallelism is
an integral part of its own algorithm. The standard method of
creating a typical numerical integration algorithm (the computation
of the number ) assumes that we divide the interval <0, 1> into n
identical sub-intervals, whereby in each sub-interval we approximate
its part of a curve with a rectangle. The function values in the middle
of each sub-interval determine the height of the rectangle. The
number of selected sub-intervals determines the computation accuracy.
The computed value of is given as the sum of the surface area of
defined individual approximated rectangles. An illustration of the
numerical integration applied to the computation of is in Fig. 9.1.
For the concrete calculation of the value of , the following standard
formula is used [39, 49]:

1
2
0
4
1
dx
x
=
+
where h = 1 / n is the width of the selected splitting interval, x

i
= h
(i - 0.5) are mid-ranges and n is the number of selected intervals
(computational accuracy). For the computation of we can use an
alternative interpolating polynomial as follows [93]:

1 1
2
1
0 0
4
( ) ( )
1
n
i
i
f x dx dx f x
x
=
= =
+

Figure 9.1 An illustration of numerical integration.
or the next possible relation:

( )
2
0
4 1
0, 5
1
i N
N
i
N

+
+

Decomposition model
For a parallel method of numerical integration computation we use
the property of the latent decomposition strategy in all natural
parallel algorithms. We divide the entire necessary computation into
its individual parallel processes, as illustrated in Fig. 9.2, where for
the sake of simplicity there are four parallel processes. For the
parallel computation of the number we then use these created
parallel processes (see after Fig. 9.2):
enter the desired number of sub-intervals n
compute the width w of each sub-interval;
for each sub-interval find its centre x
compute f (x) and the sum end of the cycle
multiply the sum by the width to obtain
return
4
3.5
3
2.5
2
1.5
1
0.5
f(x)
0
0 0.1 0.2 0.3 0.4 0.5
X
0.6 0.7 0.8 0.9 1
Numerical Integration 117
This prospective implementation onto parallel computers (NOW,
SMP, Grid) allows an analysis of the communication load depending
on its input (the desired accuracy), because the variation of the input
load is proportional to any change to the communication load. The
chosen parallel algorithm implementation influences the necessary
interprocess communication IPC mechanism.
Mapping of parallel processes
The individual independent processes are distributed for computation
in such a way that every created parallel process is executed on a
different computing node of a parallel computer (mapping). After
the parallel computation in the individual nodes of a network of
workstations has been performed, we only need to calculate some
partial results to get a final result. To manage this task we have to
choose one of the computing nodes (manager) to handle it. At the
start of the computation, the chosen node (let it be node 0) must also
know the value of n (the number of strips in every process), and then
the selected node 0 has to make this known to all the other
computing nodes. An example of a parallel computation algorithm
(manager process) is therefore as follows:
Figure 9.2 The decomposition of numerical integration problems.
4
3.5
3
2.5
2
1.5
1
0.5
f(x)
0
0 1 2 3 0 1 2 3 0 1
X
2 3 0 1 2 3 0 1 2 3
if my node is 0
read the number n of strips desired and send it to all other
nodes;
else
receive n from node 0
end if
for each strip assigned to this node
compute the height of rectangle (at midpoint) and sum result
end for
if my node is not 0
send sum of result to node 0
else
receive results from all nodes and sum
multiple the sum by the width of the strips to get
return
The sending values of n/p could be done in the case of a parallel
algorithm with distributed memory PA
dm
using MPI API as the
collective communication command Broadcast when the parallel
processes are the same size, or MPI command Scatter when the
parallel processes have different sizes.
The following code is an example of a specific implementation of
this algorithm in parallel FORTRAN for a parallel distributed
system. The algorithm extends and modifies the starting serial
algorithm for the specific parallel implementation. In terms of its
characteristics, it is noted that the number of nodes in a parallel
system, the identification of each node and the process of
implementing parallel algorithms takes place through the procedures
known as numnodes () (number of nodes), mynode () (my node) and
mypid () (number of my processes). These procedures allow the
activation of any number of nodes in a parallel system. Then if we
allow a designated node, for example node 0, to perform management
functions and certain partial calculations, while other nodes perform
a substantial part of the calculations for assigned sub-intervals, the
partial amounts obtained are communicated to the specified node 0.
The procedures known as silence and crecv serve to ensure the
required collective data communications take place (for collectively
transmitting data, an alternative type of Broadcast is used; for the
collectively taken partial results for the final calculation, an
alternative type of Reduction is used).
f(x) = 4.0 /(1.0 + x*x)
integer n, i, p, me, mpid
real w, x, sum, pi
p = numnodes () return number of nodes
me = mynode () return number of my node
mpid = mypid () return id of my process
msglen = 4 estimate message length
allnds = -1 message name for all nodes
msgtp0 = 0 name for message 0
msgtp1 = 1 name for message 1
if (me .eg. 0) then if i am node 0
read *, n read number of subintervals n
call csend (msgtp0,n,msglen,
allnds,mpid) and send it to all other nodes
else if i am any other node
call crecv(msgtp0,
n, msglen) receive value n
endif
w = 1.0/n
sum = 0.0
do 10 i = me+1, n, dividing subintervals among nodes
x =w*(i-0.5)
sum = sum + f(x)
10 continue
if (me .ne. 0) then if i am not node 0
call csend (msgtp1, sum, 4, 0, mpid) send partial result to node 0
else if i am node 0
do 20 i = 1, p-1 for every other used node
call crecv (msgtp1, temp, 4) receive partial result to temp
sum = sum + temp and add it to sum
20 continue
pi = w*sum compute final result
print *,pi and print it
endif
end
One disadvantage is that the implementation of the central necessary
routing communications pass through the designated node 0
(manager node), and as a result there may be a bottleneck, which
could negatively affect the efficiency of the entire parallel algorithm.
Its removal is achieved through the formation of a parallel
optimisation algorithm (tuning).
Performance optimisation
After verifying the parallel algorithm on a concrete parallel computer,
the next step is to optimise its performance [42, 52]. In the above
example of numerical integration this requirement leads to reducing
the bottleneck, which is inter process communication (IPC) latency.
This latency should be minimised, since it could be used more
effectively in the useful computation of parallel algorithms. It is
therefore very important to minimise the number of communicating
data messages proportionally to the number of computational
operations, thereby also minimising the overall execution time of a
parallel algorithm. During the computation of the number , the
demanded centralisation of the necessary communications through
manager computing node 0 may cause a computation bottleneck for
the two following reasons:
Manager computing node 0 can simultaneously receive one data
message from only one other computation node.
The calculation of the partial results in manager node 0 is done
sequentially, which is a prerequisite for creating a bottleneck.
The main point arising from both these cases is to consider using
collective communication commands from standardised development
environments such as MPI API, which are the collective commands
Reduce and Gather respectively. For some parallel computers there
is available an alternative global summarisation operation known as
gssum (), which exists just to eliminate these bottlenecks. This
operation always calculates the partial results of two computing
nodes in an iterative way, whereby both computing nodes are
exchanged. Each partial calculation, which is received by one of a
computing node pair, is added to the obtained calculation in a given
node, and this result is then transmitted to the next computing node
of a defined communication chain. In this way the total accumulated
amount of the computing nodes is gradually obtained, whereby the
manager computing node 0 can perform the last calculation and
print the final result. This procedure of global calculation can also
be programmed, but the sequence of the implemented appropriate
procedures simplifies the implementation of parallel algorithms and
also contributes to its effectiveness.
In other applied tasks implementing a larger direct inter process
communication number of communications is used, for example a
form of asynchronous communication using the direct support of
multitasking in a given node, thereby achieving the parallel
implementation of communication activities in other nodes. Of
course, an example of reducing the ratio of communication
computing activities is not the only task in optimising the performance
of parallel algorithms. There are methods available for optimising
performance which are virtually the same as mere diverse parallel
application tasks. Inspiring examples and procedures will therefore
be included in the illustrative application examples in the following
sections. Commonly-used methods and procedures for the
decomposition of application tasks indirectly implies the possibilities
of optimising its performance, meaning the optimisation can also
cause a re-evaluation of the strategies used for decomposition.
Then if we allow a designated node, for example node 0, to
perform management functions and certain partial calculations,
while other nodes perform a substantial part of the calculations for
assigned sub-intervals, and communicate the obtained partial
amounts to node 0, the silence and crecv procedures serve to ensure
the required communications.
It is also important to note that for the abovementioned module,
the direct communication of every computing node with all the other
computing nodes is required as an assumption of parallel
communication between multiple pairs of computing nodes. In this
approach the final calculation can be obtained after performing the
second, third or even the fourth cycle of the communication chain.
In fact, we only need log
2
p, where p is the number of computing
nodes of the parallel computer, and the cycles of the communication
chains are compared to n data messages at initial implementation.
The parallelism used for data message exchange therefore increases
the efficiency of parallel algorithms. For the implementation of an
improved approach for data message communications it is necessary
to replace the following part:
if (me .ne. 0) then
call csend (msgtp1, sum, msglen, 0, mpid)
else
do 20 i = 1, p-1
call crecv (msgtp1, temp, msglen)
sum = sum + temp
20 continue
pi = w*sum
print *, pi
endif
end
with the next part being:
call gssum (sum, 1, temp)
if (me .eg. 0) then
pi = w*sum
print *, pi
endif
In other application tasks it is possible to use a larger direct inter
process communication to perform communications; for example in
the form of asynchronous communication using multitasking
support in a given node, thereby achieving parallelism while
performing communications with other computing node activities.
Of course this example of reducing the ratio of communication/
computing activity is not the only task in optimising the performance
of parallel algorithms. The methods available for performance
optimisation are practically as varied as the diverse parallel
application problems.
Chosen illustration results
We have illustrated some of the chosen results which have been
performed and tested. For experimental testing we have used the
workstations of NOW parallel computer as follows:
WS 1 Pentium IV (f = 2,26 G Hz)
WS 2 - Pentium IV Xeon (2 proc., f = 2,2 G Hz)
WS 3 - Intel Core 2 Duo T 7400 (2 cores, f=2,16 GHz)
WS 4 - Intel Core 2 Quad (4 cores, 2.5 GHz)
WS 5 - Intel SandyBridge i5 2500S (4 cores, f=2.7 GHz).
The complex parallel execution time results in a NOW parallel
computer with an Ethernet communication network and a previously
defined specification of the workstations to be used for the PA of
numerical integration are illustrated in Fig. 9.3. We can see that with
a decreasing order of epsilon (higher computation accuracy) there is
a necessary increase in computing time in a linear way, which is
caused by a linear rise in the number of necessary computations. We
have divided the defined computation intervals into the same parts
regardless of the real different performances of the workstations
being used and the entire computing time is defined by the computing
time at the slowest workstation. In our case this is workstation WS1.
To improve it would be necessary to distribute the input load in
such a way that we can consider the performance of the individual
workstations (load balancing). For this purpose the illustrated
measured results of real performance for any given parallel algorithm
(numerical integration) are necessary, as illustrated in Fig. 9.3:
Figure 9.3 The complex parallel execution times T (s, p)
complex
for
(Epsilon=10-
9
).
0
2000
4000
6000
8000
10000
12000
14000
0,0E+00 2,0E+05 4,0E+05 6,0E+05 8,0E+05 1,0E+06 1,2E+06
s
T(s,p)
[ms]
WS1
WS2
WS3
WS4
WS5
WS6
In a similar way we measure the results of complex parallel
execution times T(s, p)
complex
for the chosen workstations according
to the varied accuracy (Epsilon =10
-5
- 10
-9
) of the computations,
as shown in Fig. 9.4. Point-to-point communication latency in the
Ethernet communication network being used varies from 3 to 7 ms.
This extensive dispersion is caused by the control communication
mechanism in use on the Ethernet as a simple stochastic control
mechanism for shared communication channels.
The percentile of the individual parts (parallel computation,
network load latency, initialisation latency) are illustrated for the
execution time with epsilon=10
-5
as shown in Fig. 9.5. We can see
that increasing epsilon (higher input load) results in its dominating
the influence of the computation time (communication loads remain
constant).
Figure 9.4 T(s, p)
complex
for varied accuracy (Epsilon =10-
5
- 10-
9
).
Figure 9.5 The individual latencies of complex execution times
(Epsilon=10-
5
).
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
Parallel
execution
time
[ms]
Computing nodes
0.00001
0.000001
0.0000001
0.00000001
0.000000001
0.00001
8 3 2 1
0.000001 71 19 17 12
0.0000001 755 173 153 112
0.00000001
7401 1715 1568 1108
0.000000001
73855 17186 15816 11075
WS1 WS3 WS4 WS5
0%
20%
40%
60%
80%
100%
Execution
parts
[%]
Network nodes
Epsilon = 0.00001
Computation
Network load
Initialisation
Computation 7 2 2 1
Network load 8 6 3 3
Initialisation 9 0 0 0
WS1 WS3 WS4 WS5
10
Synchronous Matrix
Multiplication
The systolic matrix multiplier
We will now consider an example of the systolic multiplication of
two matrices A (k m) and B (m n). The most suitable algorithm
for matrix multiplication is based on outer products, rather than the
conventional inner product method. Here we evaluate:

1
. ( ). ( )
k
p
N AB C p R p
=
= =
where ( ) ( ) 1, 2,..., ,
pj pj
C p R p c r p k = = and j=1,2,...,n. Here C(p) is a
k~n matrix with all the columns identical to the p-th column of A,
and R(p) is a k~n matrix with all the rows identical to the p-th row
of B. For example, if:

11 12
11 12 13
21 22
21 22 23
31 32
b b
a a a
A B b b
a a a
b b

= =

then N=AB=C(1)R(1)+C(2)R(2)+C(3)R(3)
1
The time taken to multiply two matrices of size k m and m n
when using k n processors is m+(k-1)+(n-1) time units. This result is
easily proved using the fact that the final computation is carried out
after a lapse of (k-1)+(n-1) units, because of the skewing of the matrices,
and it takes m units of time to multiply two vectors of size m. Note that
a conventional single-processor matrix multiplication requires k m n
operational time units; in the case of the systolic multiplier we use k n
processors (or more space) to reduce the time to m+n+k-2.

13 31 13 32 11 11 11 12 12 21 12 22
23 31 23 32 21 11 21 12 22 21 22 22
N
a b a b a b a b a b a b
a b a b a b a b a b a b

= + +

A systolic matrix multiplier is shown in Fig. 10.1. The input streams
of the A rows arrive from the left and the input streams of the B
columns arrive from the top of nodes N
11
, N
12
, N
21
and N
22
. Each of
these nodes has the capability of multiplying two input numbers and
holding the result until the next set of inputs arrives for multiplication;
then the sum is accumulated and the process is repeated. It is worth
noting that product matrix N is finally available at nodes N
11
, N
12
,
N
21
and N
22
. Thus, in this case the matrix is entirely mapped in a one-
to-one correspondence with the processor array.
Figure 10.1 A systolic matrix multiplier.
a
13
a
12
a
11
b
11
b
21
b
31
b
12
b
22
b
32
a
23
a
22
a
21
N
21
N
11
N
22
N
12
Synchronous Matrix Multiplication 129
Instruction systolic array matrix multiplier
Systolic arrays have their limitations. Their special purpose and
nature makes them less flexible, meaning an array designed for a
specific purpose cannot be used for another purpose. For example,
a sorting systolic array cannot multiply a matrix. In other words,
systolic arrays can execute only one algorithm for a fixed set of
parameters and a fixed problem size.
In order to introduce flexibility the concept of instruction systolic
arrays (ISA) has been proposed. The ISA, as its name indicates, is also
a systolic array. However, instead of data the instructions are pushed
through the processor array. This is done by using a separate
arrangement to push an instruction stream from the top, known as
the top program (TP), and an orthogonal stream of Boolean selectors
from the left called the left program (LP). An instruction is executed
in a processor only if an instruction in the TP meets a selector bit in
the LP with a value of 1; if the selector bit is 0 it inhibits the instruction.
Such an arrangement requires that the processors have the capability
of executing different instructions. However, the separation of a
program stream and a selector stream enables the execution of
different programs on the same processor array (Fig. 10.2):
Figure 10.2 An instruction systolic array.
The basic architecture of the ISA is an n x m mesh-connected array
of identical (homogenous) processors (Fig. 10.3). The processors
have simple control units and can execute instructions from a fixed
simple set of instructions. The processor array is synchronised by a
global clock (centralised architecture) and the execution of each
instruction is assumed to take the same time duration.
LP
TP
Instruction
stream
Processor
array
Selector
stream
Figure 10.3 ISA architecture
Each processor has some data registers D, including a designated
communication register R. Two processors, P and Q, communicate
by sharing R: first P writes the required data in its Rp register during
one instruction cycle and its neighbour Q reads the contents of Rp
in the following instruction cycle. Thus, each processor P can only
write Rp in its own register, while other neighbours can read the
registered Rp. All the neighbouring processors are allowed to read
the same register simultaneously. The read-write conflicts are
eliminated by dictating that reading can be done only during the first
half of the instruction cycle, and writing only in the second half. This
means that reading results in the old content that was produced in
an earlier instruction cycle.
The boundaries of the processors serve as communications links
for I/O data. For each processor the instructions are supplied from
the outside. Each processor has only one instruction register. At the
beginning of each instruction cycle, each processor fetches the
instructions from the instruction register of its upper neighbour. This
is done synchronously, so that rows of instructions move through the
processor array from the top to the bottom. The processors in the
top row of the array are supplied with instructions from an outside
memory. In a similar manner, the column of selector bits moves
through the array from left to right. A processor executes its
instruction if its selector bit is 1, otherwise it remains idle, leaving
the contents of the registers unchanged. We will now illustrate the
ISA principle using the familiar example of matrix multiplication.

P
11
P
12
P
13
P
14
P
21
P
22
P
23
P
24
P
31
P
32
P
33
P
34
P
41
P
42
P
43
P
44
1 1 0 1 0 1
1 0 1 1 0
0 1 1 1
1 1 1
LP
(Selectors)
Processor
array

+ +
+
+
TP
(instruction
stream)
ISA matrix multiplier
An ISA matrix multiplier is shown in Fig. 10.4. It is assumed that the
two matrices A and B of size k m and m n respectively are
available as input queues at the top left:
Figure 10.4 An ISA matrix multiplier.
The processor array is of size k n and the selector bits and sets
of instructions interact in the processor array performing the
required instruction at the correct time step. The communication
register is denoted by R, and D1 and D2 are two data registers. The
basic instructions used in TP are:
Instruction 1 :R=R
L
; reads the contents of R of its left
neighbour and writes it into its own R.
Instruction 2 D:D1=R; stores the contents of the communication
register in D1.
Instruction 3 :R=R
T
; reads the contents of R of its upper neighbour
and writes it into its own R.
Instruction 4 *:D1=D1R; stores the product of the contents of
D1 and the contents of R in D1.
Instruction 5 +:D2=D2+D1; stores the sum of D2 and D1 in D2.
The algorithm used for matrix multiplication is the outer product as
in the systolic matrix multiplier method. The matrices A(k m) and
B(m n) are respectively presented from the m-th column and n-th
+
+

*
*
0
0
1 1 1 1 1
1 1 1 1 1
LP
TP
Instruction
diagnoals
+

*
*
0
0
P
11
P
12
P
21
P
22
P
11
P
12
P
21
P
22 a
21
a
22
a
23
a
11
a
12
a
13
b
31
b
21
b
11
b
32
b
22
b
12
5
4
3
2
1
row as shown in Fig. 10.4. Finally, the ij-th element of product C
appears in processor P
ij
. In Fig. 10.4, we have chosen k=2, m=3 and
n=3 for the sake of simplicity in explanation.
To complete the remaining task we repeat the process twice. This
requires the repetition of the TP and LP basic blocks twice. Thus, the
total time taken is 2 5+7=17 units for the 2 3 and 3 2 matrix
multiplication. For a k m and m n matrix multiplication we
compute the time thus: let r denote the number of instruction
diagonals (the broken oblique lines in Fig. 10.4). Then the first cycle
of interaction is completed in r+n+k-2 time units and the total time
is mr+n+k-2 units. Here factor m denotes the number of repetitions
of the LP and TP blocks. Thus we have an O(m+n+k) ISA algorithm
using k n processors.
It can be observed that the systolic array computation took
m+n+k-2 time units. In that calculation we did not take into account
the time involved for moving, adding and multiplying numbers. If
we took these into account, the computational timing would have to
be modified to r m instead of m (r being the number of instruction);
this is in agreement with the ISA timing (m r+n+k-2).
Dataflow matrix multiplication
The systolic array is entirely controlled by a global clock. Therefore,
the synchronisation of different computations requires careful
planning to ensure correct timing. When the systolic arrays become
very large, this planning may become extremely difficult. To obviate
this difficulty, wave front array processors have been suggested. A
wave front array (WFA) can be described as a systolic array in which
the dataflow computation is embedded. Thus, the computation is
essentially data driven and not control driven. This means the
successive instructions are not triggered by an external clock, but by
the availability of the required operands and resources.
Wave front matrix multiplier
The wave front matrix multiplier is shown in Fig. 10.5. Here we
multiply k m matrix A and m n matrix B. A is stored in the left
memory module and B is stored in the top memory module. We
assume that the processor array size is k n in order to compute
product P. In Fig. 10.5 we have chosen k=3, m=3 and n=4 for the
sake of simplicity:
Figure 10.5 A wave front matrix multiplier.
The A columns and B rows are taken one-by-one to generate
product C(i)R(i) during each sweep of the wave front for I=1, 2, ..., m.
During the first sweep, product C(1) R(1) is formed. To do this
initially, all the processors
(0)
0
ij
P =

for all (i,j). Processor P
11
starts to
compute
(1) (0)
11 11 11 11
P P a b = + .
Then the computational activity propagates P
12
and P
21
to
compute:

= +
(1) (0)
12 12 11 12
P P a b

= +
(1) (0)
21 21 21 11
P P a b
The secondary wave front activities P
31
, P
22
, P
13
are to compute
the other elements; then similarly, the remaining elements of C(1)
R(1) are computed by P
32
, P
23
, P
14
and P
33
, P
24
and P
34
. After
processor P
11
completes its task for the first wave front, the second
sweep can begin to compute C(1) R(1). Similarly, the m sweeps are
carried out. After the m sweeps the product matrix is obtained.
a
13
a
23
a
33
a
11
a
21
a
31
P
11
b
31
b
21
b
11
b
32
b
22
b
12
b
33
b
23
b
13
b
34
b
24
b
14
P
21
P
31
P
12
P
22
P
32
P
13
P
23
P
33
P
14
P
24
P
34
a
12
a
22
a
32
Asynchronous matrix multiplication
Decomposition strategies
To choose the best decomposition method for any application, we
have to understand the concrete application problem, the data
domain, the algorithm being used and the flow of control in the
given application. Therefore, we can use, according to the concrete
character of the given task, the following decomposition models:
Object oriented programming OOP.
Domain decomposition methods for matrix multiplication
We will illustrate the role of the correct decomposition strategies just
by matrix multiplication. The principles of matrix multiplication are
illustrated for the sake of simplicity for matrices A, B with numbers
of rows and columns k=2. The resulting matrix C=A B is:
11 12 11 12 11 11 12 21 11 12 12 22 11 12
21 22 21 22 21 11 22 21 21 12 22 22 21 22
. . . .
. . . .
a a b b a b a b a b a b c c
a a b b a b a b a b a b c c
+ +

= =

+ +

The method of sequential calculation is as follows:
Step 1: Compute all the values of result matrix C for the first row of
matrix A and for all the columns of matrix B.
Step 2: Take the next row of matrix A and repeat step 1.
In this procedure we can see the potential possibility of parallel
computation, which is a repetition of the activities in Step 1, always
with another row of matrix A. Lets consider a calculation example
of matrix multiplication on a parallel system. The basic idea of the
possible decomposition procedure is illustrated in Fig. 10.6:
The procedure is as follows:
Step 1: Give to the i-th node a horizontal column of matrix A with
the name A
I,
and to the i-th a vertical column of matrix B
named B
i
.
Step 2: Compute all the values of the resulting matrix C for A
i
and
B
i
and name them C
ii
.
Step 3: Give the i-th computing node value B
i
to node i-1 and get
value B
i+1
from computing node i+1.
Repeat Steps 2 and 3 until the i-th node does not compute C
i,i-1

values with B
i-1
columns and A
I
rows. Then the i-th node computes
the i-th row of matrix C (Fig. 10.7) for matrix B with the number of
k-columns. The advantage of this kind of decomposition is the
minimal consumption of memory cells. Every node has only three
values (rows and columns) from every matrix.
Figure 10.6 Standard decomposition of matrix multiplication.
B
i
A
i
C
ii
=
Figure 10.7 An illustration of the gradual calculation of matrix C.
C
i , 1
C
i , i + 1
C
i , i + 2
C
i , i2
C
i , i1
C
i , i
C
i , k
... ...
This method is also faster as a second possible way of decomposition
according to Fig. 10.8:
The procedure is as follows:
Step 1: Give to the i-th node a vertical column from matrix A (A
i
)
and a horizontal row from matrix B (B
i
).
Step 2: Perform the ordinary matrix computation A
i
and B
i
. The
result is matrix C
i
of type n n. Every element from C
i
is a
Figure 10.8 Matrix decomposition model with columns from the rst
matrix.
particular element of the total sum, which corresponds to the
resulting matrix C.
Step 3: Use the function of parallel addition GSSUM for the creation
of the resulting matrix C through corresponding elements
C
i
. This added function causes an increasing in the calculation
time, which strongly depends on the magnitude of the input
matrices (Fig. 10.9).
Let k be the magnitude of the rows or columns A and B, and U
define the total number of nodes. Then:
B
i
A
i
C
=
C(1,1) C(1,2)
C(2,1) a
d
d
i
t
i
o
n
C (2,1)
3
C (2,1)
1
C
3
(1,1) C
3
(1,2)
C
2
(1,1) C
2
(1,2)
C
1
(1,1) C
1
(1,2)
C
1
(2,1)
Figure 10.9 An illustration of the gradual calculation of the elements C
i,j
.

'
1 1,1 1,1 1,2 1,2 1, 1,
'
2 1, 1 1, 1 1, 2 1, 2 1,2 1,2
(1,1) . . .
(1,1) . . .
k k
k k k k k k
C a b a b a b
C a b a b a b
+ + + +
= + + +
= + + +
.
.
.

'
1,( 1) 1 1,( 1) 1 1, 1,
(1,1) . .
U U k U k Mk Mk
C a b a b
+ +
= + +
and the nal element of matrix C:

'
1
(1,1) (1,1)
U
i
i
C C
=
=

Comparison of used decomposition models
A comparison of both decomposition strategies for various numbers
of computing nodes is illustrated in Fig. 10.10. The first chosen
decomposition method goes straight to the calculation of the
individual elements of resulting matrix C by the multiplication of
corresponding matrix elements A and B. The second decomposition
method for the obtaining the final elements of matrix C besides the
multiplication of the corresponding matrix elements A and B
demands an further addition of the specific results, which causes an
additional time complexity in comparison to the first method used.
This additional time complexity depends strongly on the magnitude
of the input matrices as you can see in Fig. 10.10. In this example of
asynchronous matrix multiplication we can see the crucial influence
of the decomposition model on the complexity of parallel algorithms
(effective parallel algorithms).
Figure 10.10 A comparison of applied decomposition methods.
T
i
m
e
[
s
]
Number of computed nodes
0,0 5,0 10,0
10,0
20,0
30,0
40,0
50,0
15,0
Decomposition1
Decomposition2
1
11
Discrete Fourier Transform
The Fourier series
Periodic time functions x (t) can be expressed as a transformed
sequence of sine and cosine functions with different amplitudes and
frequencies. This transformation was defined by Fourier, and this is
referred to as the Fourier transform [14]. This transformation
generates frequency function X (f) to a given function x (t). There
are various applications which are used in theoretical and practical
ways for the Fourier transform, including digital signals and image
processing. The Fourier series is given by the sum of individual
harmonic sine and cosine functions as follows:

( )
0
1
2 2
cos sin
2
j j
j
a jt jt
x t a b
T T

=

= + +

where T is a period. In the same way the following relations are valid:
f = 1 / T
where f is the frequency. This relation is known as the Thompson
term. Fourier coefficients a
j,
b
j
are given as defined specific direct
integrals. With mathematical adjustments we can obtain a more
convenient form of expression in the form of a mathematical series:

( )
2
t
ij
T
j
j
x t x e

=

=

where x
j
is the j-th Fourier coefficient in complex form an d -1 i = .
The discrete Fourier transform
Due to the important role of the Fourier transform in scientific and
technical computations there has been great interest in implementing
DFT on parallel computers and on studying its performance.
Therefore, this chapter describes DFT parallel algorithms, which can
be solved on all types of parallel computer (supercomputers, NOW,
Grid). Although the use of NOW and Grid parallel computers
should be less effective than using massive parallel computers
(supercomputers based on hypercube architecture), we are looking
for effective parallel algorithms on NOW and Grid, as nowadays
they are the dominant parallel computers.
The discrete Fourier transform (DFT) has played an important
role in the evolution of digital signal processing techniques. It has
opened new signal processing techniques in a frequency domain
which is not easily realisable in an analogue domain. The DFT is a
linear transformation that maps n regularly sampled points from a
cycle of a periodic signal, like a sine wave, onto an equal number of
points representing the frequency spectrum of the signal. The
discrete Fourier transform (DFT) is defined as [11, 24]:

1
2
0
1
jk
N
i
N
j k
j
Y X e
N

=
=

Discrete Fourier Transform 141
and the inverse discrete Fourier transform (IDFT) as:

1
2
0
jk
N
i
N
j k
j
X Y e

=
=

for 0 k N-1. For N real input values X
0
, X
1
, X
2
, , X
N-1
, they
transform and generate N complex values Y
0
, Y
1
, Y
2
, , Y
N-1.
If we
use
2 / i N
w e

= , that is w being the N-th root of complex number i
in a complex plane, we get:

1
0
1
N
jk
j k
j
Y X w
N
=
=

and in inverse as:

1
0
N
jk
j k
j
X Y w
=
=

Variable w is a basic part of DFT computations, and is known as
the twiddle factor. Defined transformation equations are in principle
linear transformations.
A direct computation of the DFT or the IDFT requires N
2
complex
arithmetic operations. For example, the time required for only the
complex multiplication in a 1024-point DFT is T
mult
=1024 4 T
real
,
where we assume that one complex multiplication corresponds to four
real multiplications, and the time required for one real multiplication
(T
real
) is known for any given computer. But with this approach we can
only take into account the computation times, and not the overhead
delays connected with implantation using a parallel method.
The discrete fast Fourier transform
The discrete fast Fourier transform (DFFT) is a fast DFT method
computation with time complexity O (N/2 log
2
N) in comparison to
sequential DFT algorithm complexity as O (N
2
)
.
For a quick
computation of the DFT the Cooley-Tukey adjustment is used [24]. To
come to a final adjustment we start with a modified form of the DFT:

1
0
1
N
jk
j k
j
Y X w
N
=
=

In general, the required calculation is divided into two parts using
the divide-and-conquer decomposition strategy. We can describe its
principle with a modification of the original calculation into the two
following parts:

1 1
2 2
2 (2 1)
2 2 1
0 0
1
N N
jk j k
j j k
j j
Y X w X w
N

+
+
= =

= +

where the first part of the calculation content results partly in even
indices and partly with odd indices. In this sense we get:

1 1
2 2
2 2
2 2 1
0 0
1 1 1
2
2 2
N N
jk k jk
j j k
j j
Y X w w X w
N N

+
= =

= +

or:

2 2
1 1
2 2
2 2
2 2 1
0 0
1 1 1
2
2 2
jk jk
N N
i i
N N
k
j j k
j j
Y X e w X e
N N

+
= =

= +

Every part of this calculation means DFT on N/2 values with even
indices, and N/2 values with odd indices. Then we can formally
write
( )
1
2
k
even k odd
Y Y w Y = +

for k=0, 1, 2, .... N -1, whereby Y
even
is
N/2 point DFT for the values with even indices X
0
, X
2
, X
4
, and
Y
odd
is N/2 point transformation with values X
1
, X
3
, X
5
, .
Supposed that k is limited at first to 0, 1, N/2 1 N/2 values
from the entire number of N values. The whole series can be divided
into the two following parts:

( )
1
2
k
even k odd
Y Y w Y = +
and:

( )
2
2
1 1
2 2
N
k
k
even even odd odd N
k
Y Y w Y Y w Y

+

+

= + =

because w
k+N/2
= - w
k
, where 0 k < N/2. In this way we can compute
Y
k
and Y
k+N/2
in a parallel way using two N/2 point transformations
according to the illustration in Fig.11.1.
Everything from N/2 point DFT we can again divide into further
parts, that is into two N/4 point DFT. This applied decomposition
strategy can continue until exhausting the possibility for dividing a
given N (one point value). This dividing factor is known as radix - q,
which is used for dividing numbers higher than two.
Figure 11.1 Illustration of the divide-and-conquer strategy for DFFT.
.
.
.
DFFT( )
.
.
.
DFFT( )
X
0
Y
0
Y
1
X
1
X
n1 Y
n1
n
2
1
X
+
+
+
.
.
.
.
.
.
.
.
.
.
.
.
_ n
2
1
Y
_
n
2
+1
Y
_
n
2
Y
_
n
2
+1
X
_
n
2
X
_
n
2
_
n
2
_
The difference in times between the direct implementation of the
DFT and the developed DFFT algorithm is significant for a large
value N. The direct calculation of the DFT or IDFT, according to the
following program requires N
2
complex arithmetic operations:
Program Direct_DFT;
var
x, Y: array[0..Nminus1] of complex;
begin
for k:=0 to N-1 do
begin
Y[k] :=x[0];
for n:=1 to N-1 do
Y[k] := Y[k] + W
nk
* x[n];
end;
end.
The difference in execution time between a direct computation of
the DFT and the new DFFT algorithm is very high for a large N. For
example, the time required for the complex multiplication in a
1024-point FFT is T
mult
= 0,5 N log
2
(N) 4. T
real
= 0,5 1024
log
2
(1024) 4 T
real
, where the complex multiplication corresponds
approximately to four real multiplications. The principle of the
Cooley and Tukey algorithm, which uses a divide-and-conquer
strategy, is shown in Fig. 11.1.
Several variations of the Cooley-Tukey algorithm have been derived.
These algorithms are collectively referred to as DFFT (discrete fast
Fourier transform) algorithms. The basic characteristics of a parallel
DFFT are its one-dimensionality (1-D), its unordered state and its use
of radix 2 algorithms (using the divide-and-conquer strategy
according to the principle in Fig. 11.1). An effective DFFT parallel
computation tends to compute one-dimensional FFTs with a radix
greater than two, and to compute multi-dimensional FFTs by using
polynomial transfer methods. In general, a radix-q DFFT is computed
by splitting the input sequence of size s into q sequences of size n/q
each, computing the q smaller DFFTs faster, and then combining the
results. For example, in a radix 4 DFFT, each step computes four
outputs from four inputs, and the total number of iterations is log
4
s
rather than log
2
s. The input length should, of course, be to the power
of four. Parallel formulations of higher radix strategies (e.g. radix 3
and 5) in 1-D or multi-dimensional DFFTs are similar to the basic form,
because the underlying ideas behind all sequential DFFTs are the same.
An ordered DFFT is obtained by performing bit reversal (permutation)
on the output sequence of an unordered DFFT. Bit reversal does not
affect the overall complexity of a parallel implementation.
Time is needed only for the multiplication of complex numbers
(dominant computation complexity); at 1024 - T-point FFT is
T
mult
= N/2 (log
2
N) 4 T
real
= 0.5 1024 (log
2
1024) 4 T
real
,

where the complex multiplication is equivalent to about four real
multiplications. To present different variations the Cooley-Tukey
algorithms were derived [10, 11, 24, 50]. These algorithms are
collectively referred to as DFFT algorithms (discrete fast Fourier
transform). The basic shape for DFFT parallel implementation is a
one-dimensional (1D) unordered radix-2 strategy according to
Fig.11.1 (the computation is sub-divided into two independent parts).
Two-dimensional DFFTs
The processing of images and signals often requires the implementation
of a multi-dimensional discrete fast Fourier transform (DFFT). The
simplest method of computation of a two-dimensional DFFT (2-D
DFFT) is the computation of a one-dimensional DFFT (1-D DFFT)
on each row, and thus there follows the computation of one-
dimensional DFFTs for each column. This is illustrated in Fig. 11.2:
Row transformation Column transformation
X
jk
j
k
X
jm
X
lm
Figure 11.2 A two-dimensional DFFT.
We can also take advantages of vector processors and their vector
functions to store rows or columns onto any given computing node.
This means that we can divide a given matrix by rows or by columns.
Experience has shown that row distribution is more suitable for
necessary communications. For example, if we consider a 32 32
matrix using four nodal vector processors, each node will get a
portion of an 8 32 matrix, so each node will perform a DFFT on
eight rows with a length of 32 bits in parallel with three other
computing nodes as shown in Fig. 11.3:
After the computations of the 1-D DFFT in the above rows (first
matrix dimension) it is necessary to transform any given matrix into
the correct shape to allow further 1-D DFFT computations in the
above columns (second dimension). During this transformation it is
necessary that every computation sends the corresponding different
parts of the matrix 8 8 matrix to the remaining computing nodes.
In dividing the entire matrix into rows, these sub-square matrices are
stored sequentially in the memory and can be sent directly, without
copying them, through caches. Finally, each of these matrix square
parts has to be locally organised in any given computing node.
Fig.11.4 illustrates the distribution process of sub-square matrices in
any given computing nodes.
Figure 11.3 The distribution of matrix rows into computing nodes.
Comp
node 1
Comp.
node 0
Comp.
node 2
Comp.
node 3
8 rows
per comp. node
Analysed examples
One element per processor
This is the simplest example of a complexity evaluation of the DFFT.
In this case we consider a p = s parallel processor (d-dimensional
hypercube architecture) to compute s-point DFFT. A hypercube is a
multidimensional mesh of processors with exactly two processors in
each dimension. A d-dimensional hypercube consists of p = 2
d

processors. In a d-dimensional hypercube each processor is directly
connected to d other processors.
In this case we can simply derive that T(s, 1) = s log s and T(s, p)
= log s. Then speed up factor S (s, p) = p and system efficiency
E(s, p) = 1. This formulation of a DFFT algorithm for a d-dimensional
hypercube calculation is cost-optimal, but for higher values of s a the
use of p = s processors is only hypothetical.
Multiple elements per processor
This is a very real case of practical DFFT parallel computations. In
this example we examine implementing a binary exchange algorithm
Figure 11.4 Data exchange between computing nodes.
0 1 2
0 1 2
0 1 2
0 1 2
Comp. node 0
Comp. node 2
Comp. node 3
Comp. node 1
Figure 11.5 A 16-point DFFT on four processors.
1000
X
0
Y
0
Y
1
Y
2
Y
3
Y
4
Y
5
Y
6
Y
7
Y
8
Y
9
Y
10
Y
11
Y
12
Y
13
Y
14
Y
15
P
3
P
2
P
1
P
0
0001
X
1
0010
X
2
0011
X
3
0100
X
4
0101
X
5
0110
X
6
0111
X
7
1000
X
8
1001
X
9
1010
X
10
1011
X
11
1100
X
12
1101
X
13
1110
X
14
1111
r
d
X
15
to compute an s-point DFFT on a hypercube with p processors,
where p > s. Assume that both s and p are to the power of two. As
shown in Fig. 11.5, we partition the sequences into blocks of s/p
contiguous elements and assign one block to each processor. Assume
that the hypercube is d-dimensional (p=2
d
) and s=2
r
.
Fig. 11.5 shows that elements with indices differing in their most
significant d(=2) bits are e-mapped onto different processors.
However, all elements with indices having the same most significant
r - d bits are mapped onto the same processor. Hence, this parallel
DFFT algorithm performs inter processor communications only
during the first d = log p of the log s iterations. There is no
communication during the remaining r - d iterations.
Each communication operation exchanges s/p words of data.
Since all communications take place between directly connected
processors, the total communication time does not depend on the
type of routing. Thus, the time spent in communication in the DFFT
algorithm is t
s
log p + t
w
(s/p) log p, where t
s
is the message start up
time, and t
w
is the per-word transfer time. These times are known for
their concrete parallel system. If a complex multiplication and
addition pair takes time t
c
, then the parallel run-time T(s, p) for an
s-point DFFT on a p-processor hypercube is:
= + + ( , ) log log log c s w
s s
T s p t s t p t p
p p
The required expressions for complex speed up S(s, p), complex
efficiency E(s, p) and defined constant C (part of the issoefficiency
function) are given by the following equations:
=
+ +
log
( , )
log ( / ) log ( / ) log
s c w c
p s s
S s p
s s t t p p t t s p

[ ] { }
=
+ +
1
( , )
1 (log / log ) ( / ) ( / ) s c w c
E s p
p s t p t s t t

= = +
1 ( , ) log log
( , ) log log
s w
c c
E s p t p p t p
C
E s p t s s t s
Multiple elements per processor with routing
This is the most complicated, but very real, case in parallel
computing. It is typical for parallel architectures, in which the
processors do not have enough directly connected processors to
compute a given parallel algorithm. Then the communication of
indirectly connected processors or computers is realised through a
number (hop) of other processors or communication switches
according to a routing algorithm.
The time for sending a message of size m between processors, that are
l
h
hops apart is given by s h h t t l + , where t
s
is the starting message time
and t
h
is the overhead time for one hop. The values t
s
, t
h
and l
h
depend
on the architecture of the parallel system (mainly its interconnection
network) and routing strategy. If we can define these values for a
concrete parallel system and routing strategy, cost performance
tradeoffs will be analysed in a more similar way than in previous cases.
Multiple elements per processor in computer networks
Current dominant parallel systems are symmetrical multiprocessors on
motherboards (for example the latest multiprocessors and multicore
processors from Intel) and computer networks based on desktop
computers (NOW, Grid). In the second case communication overheads
depend on the topology of the computer network and its transmission
parameters (transmission speed, bandwidth, transmission control etc.).
To analyse the behaviour of communication interfaces and their
parameters we have done a series of simulation experiments on a network
of workstations (NOW) for computing DFFTs. From these experiments
it became obvious that for the effective parallel computing of DFFTs it is
preferable to use a massive parallel system (hypercube architecture of
todays supercomputers) to computer networks (NOW, Grid).
Chosen illustration results
The performance results from the parallel computer NOW based on
the Ethernet are shown as a graph for 2DFFT in Fig. 11.6. These
results from the 2DFFT parallel algorithm demonstrate an increase
in both the computation and communication parts in a geometric
way with a quotient value of nearly four for the analysed matrix
dimensions, which means in increasing matrix dimensions doing
twice as many computations on the columns and rows.
In Fig. 11.7 we have illustrated parallel speed up S (s, p) of a 1-D
DFFT parallel algorithm with binary data exchange for defined
workload s = 65 536 (s=n
2
/p) as a function of the number of
computing nodes p. The character of S (s, p) is sub-linear, that is,
always less than an illustrated p curve (ideal speed up) as a consequence
of the overhead latencies of the existent parallel computers in use
(communication, synchronisation, parallelisation etc.).
Figure 11.7 An illustration of S (s, p) as a factor of the number of com-
puting nodes p.
s=65536
0
200
400
600
800
1000
1200
0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024
p
S
S
p
Figure 11.6 Results in NOW for T(s, p)

of a 2DFFT (Ethernet network).
0
5000
10000
15000
20000
25000
30000
35000
40000
Execution
time
[ms]
Network nodes
64x64 128x128 256x256 512x512 1024x1024
64x64 327 150 141 133
128x128 1067 527 452 415
256x256 5759 2924 2912 2891
512x512 17215 8667 8624 8614
1024x1024 68841 34051 33931 33768
WS1 WS3 WS4 WS5
A scalable parallel system is one in which the efficiency can be kept
fixed as the number of processors increases, provided that the
problem size also increases. The scalability of an algorithm-
architecture combination determines its capacity by using an
increased number of processors effectively. We consider the Cooley-
Tukey algorithm for a one-dimensional s-point DFFT to maintain the
same efficiency. Fig. 11.8 illustrates the efficiency of the binary
exchange DFFT parallel algorithm as a function of s on a 512-processor
hypercube parallel architecture with t
c
=2s, t
w
=4 s and t
s
=25 s.
The threshold point is given as t
c
/(t
c
+ t
w
) = 0,33. The efficiency
initially increases rapidly with the problem size up to its threshold,
but then the efficiency curve flattens out beyond this threshold.
We can say that the use of a transpose algorithm has much higher
overheads than a binary exchange algorithm due to message start up
time t
s
, but has lower overheads due to per-word transfer time t
w
. As
a result, either of the two algorithm formulations may be faster,
depending on the relative values of t
s
and t
w
. In principle,
supercomputers and other architectures with a common memory
have a very low t
s
in comparison to a typical NOW or Grid.
Fig. 11.9 illustrates the isoefficiency functions of a 1-D DFFT parallel
algorithm. For lower values of E (s, p) (0,1; 0,2; 0,3; 0,4; 0,45) to the
Figure 11.8 DFFT efciency E (s, p) of a binary exchange parallel
algorithm (p=512).
p=512
0,33
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0 2000 4000 6000 8000 1000012000 1400016000180002000022000 2400026000 2800030000
s
E
threshold (E = 0,33), we have used the following expression, which was
computed based on a performed analysis in the theoretical part:

( ) log log
s
c
t
w s s s C p p
t
= =
The values above the threshold of efficiency functions were computed
using the following relation:

/
( ) log
w
C tw tc
c
t
w s C p p
t
=
Fig. 11.10 illustrates the issoefficiency functions of a 1-D DFFT on
a hypercube parallel computer for the values of E (s, p) = 0.5 and
0.55. From the curves illustrated in Fig. 11.10, we can see in the
theoretical part of this section the predicted stormy growth of
issoeficiency function i. The stormy tendency of algorithm scalability
Figure 11.9 Issoefciency function w(s) of a 1-D DFFT (E =0,1; 0,2;
0,3; 0,4; 0,45).
0
10000
20000
30000
40000
50000
60000
70000
0
1
0
0
2
0
0
3
0
0
4
0
0
5
0
0
6
0
0
7
0
0
8
0
0
9
0
0
1
0
0
0
1
1
0
0
1
2
0
0
1
3
0
0
1
4
0
0
1
5
0
0
p
W
E=0,1
E=0,2
E=0,3
E=0,4
E=0,45
Figure 11.11 The inuence of parallel computer architecture on
scalability (E=0,4).
E=0,4
0
200000
400000
600000
800000
1000000
1200000
0 500 1000 1500 2000 2500
p
w
Hypercube
Mesh
Figure 11.10 Issoefciency function w(s) of a 1-D DFFT (E =0,5; 0,55).
0,00E+00
5,00E+07
1,00E+08
1,50E+08
2,00E+08
2,50E+08
3,00E+08
0
1
0
0
2
0
0
3
0
0
4
0
0
5
0
0
6
0
0
7
0
0
8
0
0
9
0
0
1
0
0
0
1
1
0
0
1
2
0
0
1
3
0
0
1
4
0
0
1
5
0
0
p
W
E=0,5
E=0,55
for the analysed parallel algorithm 1-D DFFT with binary data
exchange goes past the threshold.
Fig. 11.11 illustrates the influence of parallel computer architecture on
its algorithm scalability based on the given efficiency for E (s, p) = 0, 4,
as according to the conclusions derived from the theoretical part of
this section. We have used for both parallel computer architectures
a 1-D DFFT parallel algorithm with binary data exchange. We can
see that this parallel algorithm essentially has better scalability for
hypercube parallel architecture, as well as for a mesh parallel
computer. This implies better architecture of communication
networks in hypercubes for parallel solutions of 1-D DFFTs with a
binary data exchange (lower communication latency). This verified
resulting idea documents the important role of parallel computer
architecture on the entire performance of any given parallel
algorithm.
12
The Triangle Problem
The manager/worker strategy
This method uses one parallel process as a control process (manager),
which generates the necessary tasks (worker) assigned to each type
of worker node (Fig. 12.1). The manager process controls the
procedure of parallel computation according to the gradual fulfilment
of the allocated tasks to the individual computing nodes (workers).
This method is suitable if the complex problem contains no static
data or a fixed number of computation steps is known in advance.
In these tasks, it is therefore necessary to focus on aspects of
controlling the different parts of the given application problem.
Combinatorial problems
Combinatorial problems are also typical application examples for
which the manager/worker decomposition strategy could provide an
effective solution. Therefore, as an example of a combinatorial
problem we have presented the game problem on a triangular board
with 15 holes, as illustrated in Fig. 12.2 (the triangle problem).
At the start of game there are pins in all of the holes except one
hole in the centre. The principle of the game lies in a series of moves
during which the pins have to jump over holes with a pin into an
empty hole on the other side. The pin which has been jumped over
is removed. The aim of the game is to have only one pin left on the
Figure 12.1 An illustration of the manager/worker strategy.
Figure 12.2 The complex combinatorial game problem.
Queue of
parallel processes
Queue of
results
Main parallel
process
playing board. The aim of the game is to find all the possible correct
solutions. This task requires examining all the possible correct
solutions, starting at the given position of the uncovered hole.
The manager/worker approach is very suitable for solving this
kind of problem. Because it is not possible to predict the computation
time for any defined route, a simple distribution of routes among the
computing nodes would be likely to result in an unbalanced
workload between the computing nodes in which some computing
nodes could become quickly without work, and others could become
overloaded.
The Triangle Problem 159
Sequential algorithms
The initial state of the playing board is estimated by the designated
hole without a pin (the default estimation is on the central hole, as
shown in Fig. 12.2). A sequential procedure for solving the triangle
problem uses an in-depth search which moves along a correct route
up to the point when there are no more possible moves, either
because it has found a correct solution, or it is not possible to
continue the sequence with any correct moves. If the search gets into
the latter state it becomes necessary to go back until any given
branch allows another correct route. We can summarise the
described algorithm as follows:
Initialization of the game board Call depth_first_search (1)
/ * Sequence of possible solutions Manager process * /
end ();
depth_first_search (depth)
{
if depth = 14
{
store solution
}
else
{
for each possible move of given state ofgame board
{
change board status to show move
store move to a sequence
calldepth_first_ search (depth +1)
there is not possible any move, remove the last move
of the sequence []
}
}
return
}
Parallel algorithms
Parallel algorithms require the development of one main manager
parallel process (manager) and one for all workers. The manager
determines the highest level of the search tree using a specified fixed
number of moves (at a rate of depths, for example of up to four) to
create a sufficient number of individual parallel processes (workers)
to complete the search of the allocated parts of the search tree. The
manager then allocates accumulated moves to individual computing
nodes (workers) on any given parallel computer. Each worker then
searches for solutions for all of its allocated search tree branches. We
have analysed the ratio of the number of expanding branches to the
number of available computing workers. In general this ratio should
be in the range of between four and ten. This means that the workers
should be presented with between four and ten more unsolved
moves. The higher the number of unsolved moves, the greater the
likelihood of load balancing, although it is best not to reduce
the number of moves too much in order not to come to a state where
the necessary inter process communications use more time than their
own functional computations.
Because the analysed problem generates a relatively small number
of total moves (up to 14 correct solutions) only one manager process
is needed to create the number of moves for the search tree at the
initialisation of the problem. However, in larger similar combinatorial
problems, and when using parallel computers with a massive number
of computing nodes, it could be inefficient for only a single manager
process to generate the sufficient number of moves, as only one
manager process could act as a bottleneck. For these types of
combinatorial problem, the defined central manager process could
allocate the search tree vertices to other managers workers and let
them search the next level of the search tree, whereby the results
achieved are sent to the central manager process. In this way the
central manager process can only control the search integrity,
including controlling the desired number of manager workers to
achieve an effective ratio between the number of solving moves with
the number of worker processes used.
For this problem a worker only confirms its offer to solve the
allocated partial task and find all the possible correct solutions. The
manager has to expand the first levels, maintain records of expanded
task trees with no expansion, which should be sent to the workers,
send tasks to workers according to their availability, and receive the
computed partial results. Fig.12.3 illustrates flowcharts of manager
and worker algorithms for parallel solutions on any given parallel
computer.
Performance optimisation
It is possible to use multiple variants of performance optimisation
using the decomposition manager/worker strategy (server/clients).
One possibility is to use the double buffer mechanism for messages.
This idea means that a worker gets to perform two parallel processes
Game
initialisation
Search
(for deph < 4)
Store
routes
Send route
Next
route?
No
Result
message
Results
Manager process
Yes
No
Is
node
available?
Yes
No
Worker process
Receive
message
Execute moves
Search
Result
message
to manager
Summary
Stop
Stop?
Yes
Figure 12.3 Flowcharts of the manager and workers.
for the first time. The algorithm code of both the allocated parallel
processes is the same. Only the supported system program equipment
registers the input and waiting-time of both arrivals and tasks. The
result of this mechanism is that as soon as a worker has completed
the first allocated parallel process there is immediately another one
available to perform. There is therefore no waiting time for the
allocation of further parallel process from the manager process. If it
is possible to balance the communication and computation loads,
the results of the first parallel process should be sent to the manager
process, which can now, as a response, send the next parallel process
before the worker has completed its second parallel process.
From this point of view, therefore, in a workers queue the next
parallel process from the manager process is always waiting. Another
question could be that the manager process could become a
bottleneck due to the extensive processing of the global data structure
and of the listed activities performed by the workers. Our experiments
have shown that with up to 50 workers with balanced ratio
proportions between the communication load and the computation
load, the manager process rarely acts as a bottleneck. In fact, since
the input load is dependent on the nature of the problem, it is
unlikely that two workers would communicate with the manager
process at the same time. For this reason, various parallel processes
are actually an advantage for the manager/worker decomposition
strategy. If individual parallel processes require the same input load,
then there are real conditions leading to a bottleneck, and in this way
it could cause more problems than the expected benefits. However,
for thousands of computing nodes (managers, workers) in future
massive parallel computers, using only one manager process would
certainly create bottlenecks. A logical solution to this is by using
another manager process level. This means multi-level hierarchical
control via manager processes, which are assigned a certain number
of workers at their last level. A typical example of this procedure is
the approach used in some chess algorithms.
One final idea is related to the possible implementation of
manager and worker functions on the same computing nodes
(processors) in the form of multi-tasking. If a manager cannot be
used for most of its computing time, it could allow another possible
use of its own idle time. Such an approach should be applied only
after careful reflection, because if any worker process is running at
the expense of manager computing time, any incoming data messages
to the manager could cause increased delays in responding to any of
the workers activities, which could result in a reduction in the entire
parallel algorithms performance.
13
The System of Linear Equations
Let us consider the system of n linear equations (SLR) with n
variables x
1
, x
2
, x
3
, . . ., x
n
, in the form:
a
11
x
1
+ a
12
x
2
+ . . . + a
1n
x
n
= a
1, n+1
a
21
x
1
+ a
22
x
2
+ . . . + a
2n
x
n
= a
2, n+1
. . .
. . .
. . .
a
n1
x
1
+ a
n2
x
2
+ . . . + a
nn
x
N
= a
n, n+1
where coefficients a
i j
, i=1, 2, ..., n, j=1, 2, ..., n+1 are real constants.
If we apply the following matrix definitions:

a
11
, a
12
, a
13
, . . . a
1n
a
21
, a
22
, a
23
, . . . a
2n
a
n1,
a
n2
, a
n3
, a
nn
A =
. . .
. . .
. . .

a
1,n+1
a
2,n+1
a
n,n+1
B =
.
.
.

x
1
x
2
x
n
X =
.
.
.
Then a system of linear equations we can define in matrix form as:
A X = B
where A is a square matrix of coefficients, B is the vector of the right
side, and X is a vector of an unknown search. The terms of the
solvability of linear equation systems are as follows [31]:
A system is solvable if, and only if, the rank of the coefficient
matrix A is equal to the rank of the extended matrix of the
vector of the right side of B.
If the rank of matrix A is equal to n, then the system has exactly
one solution, otherwise if determinant | A | = 0, then SLR has
infinitely many solutions which can fulfil the previous condition.
Methods of solving SLR
There is no known universal optimal method for solving systems of
linear equations. There are several different ways, whereby each of
them at the fulfilment of its defined assumption implies the option
of the solution method. In principle, we can divide the available
methods into exact (finite) and iterative. There are many different
ways of solving the system of linear equations, but a universal
optimal way of solving them does not exist. The existent methods
can be divided into:
Exact (finite).
Cramers rule.
Gaussian elimination methods (GEM).
GEM alternatives.
Iterative.
Cramers rule
The procedure for solving the SLR has the form x
i
= (det A
i
/det A),
i = 1, 2,..., N, whereby det A
i
is the matrix sub-determinant, which
arises from matrix A by replacing the i-th with the column vector
from the right side. This at first seems a simple computation, but it
has its pitfalls, one of which is a growing number of the required
The System of Linear Equations 167
sub-determinant computations A
i.
If the number of solution variables
n is very large (in practice we need a large matrix), so the number of
computations necessary for various sub-determinants det A
i
grows
exponentially. This fact does not even change the possibility of
computing the sub-determinants in a parallel way.
The Gaussian elimination method
One of most-used parallel algorithms for solving systems of linear
equations is the Gaussian elimination method (GEM), which is based
on the decomposition of a square matrix into a simple regular upper
matrix and a lower triangular matrix. The regular nature of the PA
contains a hidden computational domain (functional decomposition).
This supposed domain distribution requires only the distribution of
the whole matrix as its individual parts (rows, columns, blocks), and
next the allocation of these decomposed matrix parts to the
individual computing nodes of the parallel computer in use. This
matrix distribution (data decomposition model) can be achieved by
adjusting a developed sequential algorithm. A matrix parallel
algorithm based on these principles solves, therefore, any given
matrix PA in a parallel way on the different computing nodes of a
parallel computer. It then also becomes possible to solve large-scale
matrices in comparison to sequential algorithms.
The sequential algorithm GEM
The sequential algorithm known as GEM contains a nested triple
loop. The outer loop controls the gradual matrix decomposition. At
each repetition, the remaining part of the matrix decomposition is
less than the square sub-matrix of the remaining bottom right part
of the original matrix. A basic coincident characteristic of iterative
steps is that after performing any step it is necessary to send the
modified data via computing nodes to all the other computing nodes
of the parallel computer in use. The sequential algorithm procedure
is characterised by the following steps:
Select the leading matrix element (pivot) from the first column of
the sub-matrix.
Exchange the row with the leading element with the first row.
Divide the column with the leading element by this element.
Subtract the adjusted column with the leading row from each of
the other columns.
Decomposition matrix strategies
Any matrix is a regular data structure (domain), for which we can
create parallel processes dividing a matrix into strips (rows and
columns). When generating matrix parallel processes it is necessary
to allocate the domains parts in such a way that every computing
node has roughly the same number of strips (rows and columns). If
the total number of strips is divisible by the number of processors
without the rest, then all the computing nodes have the same number
of decomposed parts (load balance). Otherwise some computing
nodes have some extra strips.
The matrix allocation methods for decomposed strips (rows and
columns) for GEM are as follows:
Allocation of block strips (rows and columns).
Gradual allocation of strips (rows and columns).
In the first allocation method, the strips are divided into sets of strips
(rows and columns), and one block is assigned to every computing node.
An illustration of these allocation methods is in Fig. 13.1. In another
method, the gradual allocation of columns to individual computing
nodes is like gradually passing out cards to a games participants. An
illustration of this gradual assignment of columns is in Fig. 13.2:
The evaluation of GEM
The Gaussian elimination method GEM (or the Gauss elimination
method) is known also as the LU decomposition method [28, 31].
GEM goes together with alternatives such as the pivot element,
Gauss-Jordan elimination, Cholesky decomposition, and so on [16],
as opposed to the preferred accurate methods of solving systems of
linear equations (SLR). These methods are available for applications
of developed algorithms; for sequential and parallel computation.
Among the standard features there are BLAS (Basic Linear Algebra
Sub-programs), and the following upgraded versions of LINPACK
(Linear Package), LAPACK (Linear Algebra Package), ScaLAPACK
Figure 13.1 Matching of data blocks.
Figure 13.2 The Assignment of columns.
Number of columns
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2
Computing nodes
Number of columns
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 2 3 0 0 1 2 3 0 1 2 3 0 1 2
Computing nodes
(Scaleable LAPACK), and PBLACS (Parallel Basic Linear Algebra
Communication Su-programs). These programs are used for
measuring and comparing the performance of computers (sequential,
parallel).
The regular domain nature of the GEM algorithm is essential for
the creation of parallel processes. The existing parallel matrix
algorithms for classical parallel computers with a shared memory
were based on the application of a domain decomposition model.
For example, in the product LINPACK for massive parallel
computers with a shared memory, a decomposition model was used
for matrix columns. With this decomposition model, long columns
are formed, allowing greater efficiency in using vector parallel
computers or supercomputers with vector processors.
The computational complexity of parallel algorithms with GEM
is O(n
3
) operations with O(n
2
) matrix data elements.
The evaluation of matrix parallel algorithms
From the point of view of the size of the considered domain and the
available parallel matrix algorithms, we can divide them into the
following basic groups:
A fixed (static) size for both the considered domains (functional,
data) when performing a given matrix parallel algorithm.
A dynamic (changing) size for both the considered domains
(functional, data) when performing a given parallel algorithm.
Varied implementations of domain decomposition for the
dominant parallel computers based on NOW and Grid, causing
inter process communication latency T
comm
(s, p), which is to be
modelled in order to take measurements.
An efficient number of parallel algorithms (scalability).
Overhead impacts which are defined through the h(p, p)
function. For this we will use an asymptotic analysis to derive
the isoefficiency functions for every possible decomposition
strategy.
The potential derivation of the issoeficiency function in analytical
way for all possible decomposition strategies allows the possibility
of determining the range of efficiencies for non-existing theoretical
parallel computers using the number of computing nodes p as a
parameter for applied asymptotic analysis.
Common features of MPA
For a comprehensive analysis (including coincident overhead delays)
of the performance of parallel matrix algorithms, we will consider
the decomposition models typically used for the general shape of a
square n n matrix, as illustrated in Fig. 13.3:
The reason for preferring the square matrix is the reduction of the
parameter number (n = m) during the derivation process of analytical
Figure 13.3 Square matrix n x n.
performance relations (execution time, speed up, efficiency,
isoefficiency etc.). This more transparent approach is supported by
the following additional reasons:
Any rectangular matrix n m could be transformed into a
square matrix n n either by extending the number of rows
(where m < n) or columns (if m > n).
The derivation process of the performance relations will be the
same except for the fact that when considering the complexity of
the matrix instead of n
2
(square matrix) we have to consider
product n m.
Common characteristics of matrix parallel algorithms (MPA) are
as follows:
The use of domain decomposition models in which the
domain is mainly represented by data matrix elements.
a
11
, a
12
, . . . , a
1n
a
21
, a
22
, . . . , a
2n
. . .
. . .
. . .
a
n1
, a
n2
, . . . , a
nn
A =
The matrix itself is well-parallelised theoretically up to the
level of a single data element. But applying such a profound
degree of parallelisation would not be effective because of its
low computation complexity.
Decomposition models MPA
A typical characteristic of matrix algorithms is a certain regularity
both in the program (number of parallel activities) and also in the
data structure matrix (matrix elements). We refer to this regularity
as a domain. The selected domain decomposition matrix model then
specifies a procedure for the development of the matrix parallel
algorithm.
Domain decomposition
The main characteristic of many applications is a certain regularity
in its algorithm (functional domain) or data structure (data domain).
Recognition of these domains results in developing suitable parallel
algorithms. The functional domain decomposition model
(computation) and also domain data decomposition (matrix data
elements) determine the structure of matrix parallel algorithms
(MPAs). The most common domain applications are extensive,
discrete and static data structures. This common domain character
of all matrix algorithms increases the development of similar matrix
parallel algorithms. The universal access of domain decomposition
is as follows:
The distribution of functional domains as a basic assumption of
potential parallel processes.
The distribution of domain data as an assumption for parallel
processes on a group of data elements.
Mapping assigning groups of data elements to create parallel
processes.
Inter process communication IPX for parallel process cooperation
according to any given PA.
Iteration methods
When using exact methods on computers we could not come to an
exact solution in many cases. The causes of this were in the
interpretation of the data elements in digital form and rounding
errors during calculations. This is one of the main reasons for
looking for other, sometimes even more complicated methods, and
to find a computer-supported solution with any given accuracy. To
these methods belong the iterative methods:
The Jacobi iteration method.
The Seidel iteration method.
The difference in the Jacobi iteration method is that it also uses the
newest just-computed elements of a newly-calculated iteration
vector (in the same iteration step) for the calculation of other
elements. This is in matrix form as:
X
(k+1)
= D X
(k+1)
+ H X
(k)
+ P
where D is the lower triangular part M, and H is the upper triangular
part of matrix M.
Parallel iterative algorithms SLR
Despite the theoretical correctness of the exact methods, we do not
always achieve exact solutions in computer calculations. The reason
for this is the way of representing the used data information in
computers and rounding errors during computations. Therefore, we
are looking for other, sometimes more complicated methods for
solving SLR with any defined accuracy. To these methods belong
first of all iterative algorithms [6]. Iterative methods for solving SLR
lead to the SLR system in matrix form as being:
A X = B
we convert this to a suitable form for iterative computations as:
X = M X + P
where X is a searched column vector of unknowns, X = (x
1
, x
2
, . . .,
x
N
)
T
, A is a matrix of coefficients, B is a vector from the right sides
of the system of linear equations, and M is a square matrix of
coefficients P and the column vector. We chose the initial vector of
iteration computations X
(0)
= (x
1
(0)
, x
2
(0)
, . . ., x
n
(0)
), which is
substituted into the right side of the adjusted matrix equation, and
thus we get a new approximation vector X
(1)
as:
X
(1)
= M X
(0)
+ P
The resulting vector X
(1)
is substituted back into the right side of the
equation and we get the next iteration of the vector X
(2)
as:
X
(2)
= M X
(1)
+ P etc.
Generally, from the following iterative relationship:
X
(i +1)
= M X
(i)
+ P
we receive the sequence of iteration vectors X
(0)
, X
(1)
, X
(2)
, . . ., X
(i)
,
X
(i+1)
, . . .
The Jacobi iteration method
The Jacobi iteration method is based on the defined system of n
linear equations by introducing the following definitions:

,
i j
i j
i i
a
m
a
=

for i j
m
i,j
= 0

for i = j

, 1 i n
i
i i
a
p
a
+
=
We will modify the system to an appropriate form to iterate, so that
the first equation we express the unknown x
1
, from the second x
2

and so on. This can be done provided that ii 0, i = 1, 2, ..., n. After
these adjustments we get the following equivalent system of
equations:
x
1
= m
11
x
1
+ m
12
x
2
+ . . . + m
1n
x
n
+ p
1
x
2
= m
21
x
1
+ m
22
x
2
+ . . . + m
2n
x
n
+ p
2
. . .
. . .
. . .
x
n
= m
n1
x
1
+ m
n2
x
2
+ . . . + m
nn
x
n
+ p
n
and this is in matrix form as:
X = M X + P
So we can define the iterative process of the Jacobi method with the
following expression:
X
(k+1)
= M X
(k)
+ P, k = 0, 1, 2,. . .,
whereby we take the initial vector of the right sides as: X (0) = P
X
(0)
= P.
The Gauss Seidel iteration method
If in solving the system of n linear equations with iterative methods
we assume its convergence, then it is logical to assume that the
iteration vector X
(k+1)
is a better approximation than exact solutions
X and vector X
(k)
, and that this is also valid for its components x
i
(k)
,
i=1, 2, ..., n. On this the Gauss Seidel iteration method is based. In
comparison to the Jacobi iterative method, the difference is in its use
of the latest computing vector components for a new iteration of the
remaining computations of other components at the same iteration
step. If we add to the system adjusted for the Jacobi iteration indices,
and we take into consideration that m
ii
= 0, we get:
x
1
(k+1)
= m
12
x
2
(k)
+ m
13
x
3
(k)
+ . . . + m
1n
x
n
(k)
+ p
1
x
2
(k+1)
= m
21
x
1
(k+1)
+ + m
23
x
3
(k)
+ . . . + m
2n
x
n
(k)
+ p
2
x
3
(k+1)
= m
31
x
1
(k+1)
+ m
32
x
2
(k+1)
+ m
34
x
4
(k)
+ . . . + m
3n
x
n
(k)
+ p
3
. . .
x
n
(k+1)
= m
n1
x
1
(k+1)
+ m
n2
x
2
(k+1)
+ . . . + m
n,n-1
x
n-1
(k+1)
+ p
n
or in matrix form:
X
(k+1)
= L X
(k+1)
+ U X
(k)
+ P, where L is the lower triangular part
and U is the upper triangular part of matrix M.
The convergence of iterative methods
Under the term matrix norm it is a defined real number, which is a
measure of its size. Matrix norm A is called A and is defined as:
The row norm A
R
= max
i

j
| a
ij
|.
The column norm A
S
= max
j

i
| a
ij
|.
The Euclidean norm A
E
= (
i

j
| a
ij
|
2
).
The spectral radius of the matrix
The spectral radius of matrix A is the number as (A) = max
i
|
i
(A) |,
where
i
(A) are the eigenvalues of the matrix (the roots of the
characteristic polynomial), which are given as:
det (A . I) = 0
Between the spectral radius of the matrix and its norm the following
is valid:
(A) A
Convergence of the sequence of vectors
Let us have a sequence of vectors {X
(k)
}
k=1
. This sequence converges
with vector X = (x
1
, x
2
, . . ., x
n
)
T
, which is valid under the assumption
of validity for the following limit:

( )
i i
limx x
k
k
=
for i = 1, 2, . . ., n
Then the convergence of the vector represents the convergence of its
components. This condition is equivalent to the condition:
X
(k)
X = 0
which means the following is valid:

( )
limX X
k
k
=
Convergence of the iterative process
If we denote the exact solution of the system as vector X, there exists
a limit to the sequence of approximations X
(0)
, X
(1)
, X
(2)
, . . ., that
is, the limit to the sequence of approximations is the solution set of
equations, and thus is valid as:

( 1) ( )
limX M limX P
k k
k k
+

= +
which means the limit to the approximation sequence is the solution,
and it is therefore valid that:

( )
limX X
k
k
=
if the sequence of approximations is convergent, and converges on
the exact solution.
Convergence of the Jacobi iterative method
Let the error vector of the k-th iteration be e
(k)
, that is, valid:
e
(k)
= X
(k)
- X for k = 0, 1, . . .
X
(k)
= M X
(k-1)
+ P
X = M X + P
and after substitution and subtraction as:
e
(1)
= X
(1)
X = M (X
(0)
X) = M e
(0)
e
(2)
= M e
(1)
= M
2
e
(0)
e
(k)
= M e
(k-1)
= M
k
e
(0)
because:

( ) ( )
limX X lim
k k
k k
e

= <=>
The convergence of this iteration process is dependent on the
properties of matrix M. The conditions for the convergence of the
Jacobi iterative method are as follows:
Necessary and sufficient conditions.
Iterations defined with relation X
(k+1)
= M X
(k)
+ P, k=0, 1,
2, . . .
Convergence for any initial vector X
(0)
if, and only if, (M)<1.
Sufficient conditions for convergence.
If for any norm of matrix M is valid M q <1.
Then the sequence {X
(k)
}
k=0
is

determined by the relationship
converging, which is even independent of the initial choice of
vector X
(0)
, that is, the following relationship is valid:

( )
limX X
k
k
=
Estimation of the errors of the Jacobi iterative method.
If the exact solution is given as X, then the following is valid:
X
(k+1)
- X = M (X
(k)
- X) = M (X
(k+1)
- X) - M (X
(k+1)
- X
(k)
)
and assuming M q <1 and after adjustments we get:
X
(k +1)
- X q/(1-q). X
(k +1)
- X
(k)

The relationship can be used as a condition for ending the iterative
process for determining the approximate solution for x
(k+1)
to within
> 0, and the following is effective:
X
(k+1)
X <
then we stop the computation at this pair of consecutive iterations,
for which the following is valid:
X
(k+1)
X
(k)
<
while for we obtain q/(1-q). . Often we use simplified the
condition, as illustrated in Fig. 13.4:
X
(k+1)
X
(k)
<
For the convergence of the Gauss-Seidel iterative method the same
conditions are valid as for the Jacobi iterative method, whereby the
Gauss-Seidel method converges faster. A given condition is not only
necessary, but is the only one sufficient for the convergence of both
methods. In practice, we use both iterative methods in cases where
convergence is also influenced by the selction of the initial vector,
and satisfying this condition is based on this fact.
Figure 13.4 The convergence rate.
Exact
value
Computed
values
Epsilon
Iteration t t+1
14
Partial Differential Equations
Partial differential equations (PDEs), are equations involving the
partial derivates of an unknown function with respect to more than
one independent variable [4,6]. PDEs are of fundamental importance
in modelling all types of phenomena which are continuous in nature.
Typical examples are weather forecasting, the optimisation of
aerodynamic shapes, fluid flow, and so on. Simple PDEs can be
solved directly, but in general it is necessary to approximate the
solution on the extensive network of final points by iterative
numerical methods [21]. We will confine our attention to PDE with
two space independent variables x, y. The needed function we
denote as u(x, y). The considered partial derivations we denote as
u
xx
, u
xy
, u
yy
etc. For practical use the most important PDEs are two
ordered equations as follows:
Heat equation, u
t
= u
xx
Wave equation, u
tt
= u
xx
Laplace equation u
xx
+ u
yy
= 0.
These three types are the basic types of general linear second order
PDR as in the following:
a u
xx
+ b u
xy
+ c u
yy
+ d u
x
+ e u
y
+ f u + g = 0.
This equation could be transformed [87] by changing the variables
to one of three basic equations, including the members of the lower
rows, provided that the coefficients a, b, c are not all equal to zero.
Variable b
2
4 ac is referred to as discriminant, whereby its value
determines the following basic PDR groups of the second order:
b
2
4 ac > 0, hyperbolic (a typical equation for waves).
b
2
4 ac = 0, parabolic (typical for heat transfer).
b
2
4 ac < 0, elliptical (typical for the Laplace equation) [46, 47].
The classification of more general types of PDE is not so clear. When
the coefficients are variable, then the type of equation can be
modified by changes in the analysed area, and if it is intended at the
same time with several equations, each equation can generally be of
a different type. A simultaneously analysed problem may be non-
linear, or the equation requires more than the second order [20, 25].
Nevertheless, the most basic used classification of PDE is also used
when determining if it is not accurate. Specifically, the following
types of PDE are as follows:
Hyperbolic. This group is characterised by time-dependent
processes that are not stabilised at a steady state.
Parabolic. This group is characterised by time-dependent
processes, which tend towards stabilisation.
Elliptical. This group describes the processes that have reached a
steady state and are therefore time-independent. A typical
example is the Laplace equation.
Here we show how to solve a specific PDE in a parallel way the
Laplace equation in two dimensions by means of a grid computation
method that employs the finite difference method. Although we are
focussing on this specific problem, the same techniques are used for
solving other PDEs (Laplace three dimensional, the Poisson
equation etc.); extensive approximation calculations on various
parallel computers (massive supercomputers, SMP, NOW, Grid),
eventually solving other similar complex problems:

2 2
2 2
0
x y

+ =
Partial Differential Equations 183
The Laplace differential equation
The Laplace differential equation (LDE) is a practical example of
using iterative methods for its solution. The equation for two
dimensions is the following:
Function (x, y) represents some unknown potential, such as
heat, stress and so on. Another example of a PDE is the second order
Poisson equation with a non-zero right side. This has the following
form:

2 2
2 2
( , ) g x y
x y

+ =

Given a two-dimensional region and values for points of the
regional boundaries, the goal is to approximate the steady-state
solution (x, y) for points in the interior by the function u(x, y). We
can do this by covering the region with a grid of points (Fig. 14.1),
and to obtain the values of u(x
i
, y
j
) = u
i,j
of the area. Each inner point
is initialised to some initial value. Other stationary values of inner
points will by computed by applying iterative methods. At each
iteration step, the new point value (next) will be defined as an
average of the previous (old), or a combination of previous and new
sets of values of neighbouring points. Iterative computation ends
either after performing a fixed number of iterations, or after reaching
a defined precision with an acceptable difference of E > 0 (Epsilon
value) for each new value. Epsilon accuracy is determined as the
desired difference between the previous and the new point values.
Figure 14.1 Grid approximation of the Laplace equation
******************************************
******************************************
*
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
*
*
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
*
*
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
*
*
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
*
*
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
*
*
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
*
*
boundary point
.
interior point
These limits to the points that indicate the boundary conditions are
as follows:
According to Dirichlet [5], giving the values of a given function
analyses the functions at both ends of the field (at U =
0
for x =
0, 0 Y 1, U =
1
for X = 1, 0 Y 1).
According to the Neumann [5], giving the values with solution
derivations (for example u
y
= 0 for Y=0, 0X1, u
y
= 0 for Y=1
0 X 1).
Let the coordinates of the points (x
i,
y
i
) apply to the area x =
i
* h i,
y
i
j = h * for i, j = 0, 1, 2, ..., N. Function values at points (x
i,
y
i
) we
define as u
i, j
.

In the Laplace equation, if we replace each node (x
i,
y
j
)
with its partial derivative from the function = u(x, y) by using the
differentials of a function u(x
i,
y
j
) = u
i, j.
, then the partial derivatives
u
x,
u
y,
u
xx, yy
are as follows:

(1)
i,j
i 1,j i,j
x

u - u
u

h h x
+

=

(1)
i,j
y i,j 1 i,j

u u - u

h h y
+
(1) (1)
(2) 2
i,j
i,j
x i,j x 1, i 1,j i-1,j
x

2 2 2 2
u - u - 2u u
u

h h h
i j
u
x
+
+

= =
(2) (1) (1)

2
i,j i,j
y y i,j y , 1 i,j 1 i,j-1

2 2 2 2
u u - u - 2u u

h h h
i j
u
y
+
+

= =
We replace the partial derivations of ~ u(x, y) by the differences

of u
i,j
. After this substitution, we obtain the final iteration formulae
as:
X
i,j
(t+1)
= (X
i-1,j
(t)
+ X
i+1,j
(t)
+ X
i,j-1
(t)
+ X
i,j+1
(t)
)/4
or its alternative version:
X
i,j
(t+1)
= (4 X
i,j
(t)
+X
i-1,j
(t)
+ X
i+1,j
(t)
+ X
i,j-1
(t)
+ X
i,j+1
(t)
)/8
and after substituting into a partial Laplace equation the resulting
relationship is as follows:
u
i, j
= (u
i +1, j
+ u
i-1, j
+ u
i, j +1
+ u
i, j-1
)/4
Alternatively, the partial three-dimensional Laplace equation can be
derived using the same procedure following the iterative relationship:
u
i, j, k
= (u
i +1, j, k +
u
i-1, j, k +
u
i, j +1, k +
u
i, j-1, k
+ u
i, j, k +1
+ u
i, j, k-1
)/6
These equations can be solved by applying the following iterative
methods:
Jacobi.
Gauss-Seidel.
Successive Over Relaxation (SOR).
Multi-grid.
In terms of implementation complexity growth, we initially performed
a modelling iteration relation for the two-dimensional Laplace PED
using the Jacobi method in order that the parallel algorithms with a
shared memory would be in OpenMP, and parallel algorithms with
a distributed memory would be in MPI API. Subsequently, we
undertook a modelling of the rest of the iterative methods, which are
rather complex for the development of parallel algorithms, but in
terms of faster convergence they are more suitable for the use of
applications.
Local communication
A communication model structure is involved in the communication
complexity of parallel algorithms. We will analyse the basic
communication requirements for the Jacobi iterative method. An
analysis of this iterative method is the iterative computation of the
new iterative value of any given internal point from a fixed number
of neighbouring points (iterative relationship). To compute each new
value of one point from approximate network points, they use the
iteration relation set of a specified number of neighbouring point
values (iteration step). In our case we have derived the computation
of the new value X
i, j
on a two-dimensional network of points
iterative relation, where each new point value is computed as the
arithmetic average of the four neighbouring points as follows:
X
i, j
(t +1)
= (X
i-1, j
(t)
+ X
i +1, j
(t)
+ x
i, j-1
(t)
+ X
i, j +1
(t)
)/4 (10.2)
This relationship can be derived for solving iterative methods to
modify the alternative link the adjustments have made, based on the
arithmetic mean of X
i, j
(t +1),
X
i, j
(t)
as demonstrated:
X
i, j
(t +1)
= (4 X
i, j
(t)
+X
i-1, j
(t)
+ X
i +1, j
(t)
+ x
i, j-1
(t)
+ X
i, j +1
(t)
)/8 (10.3)
The iterative calculation is according to the following iterative
relationship and is repeated sequentially at each iteration step; the
new ones gain values X
i, j
(1),
X
i, j
(2)
and so on, while the named X
i, j
(t)

determines the value at any given point of the X
i, j
t - step. Suppose
by applying decomposition models we create parallel processes for
each point, and the respective group of points is a two-dimensional
network of points. This creates a parallel process that is assigned to
point X
i, j
, which successively counts the following new values:
X
i, j
(1),
X
i, j
(2),
X
i, j
(3).
..
For any given parallel process the necessary values of the following
points are:
X
i-1,j
(0)
, X
i-1,j
(1)
, X
i-1,j
(2)
, . . .
X
i+1,j
(0)
, X
i+1,j
(1)
, X
i+1,j
(2)
, . . .
X
i,j-1
(0)
, X
i,j-1
(1)
, X
i,j-1
(2)
, . . .
X
i,j+1
(0)
, X
i,j+1
(1)
, X
i,j+1
(2)
, . . .
For the Jacobi finite difference method a two-dimensional grid is
repeatedly updated by replacing the value at each point with some
function of the values at a small fixed number of neighbouring
points. The common approximation structure uses a four-point
stencil to update each element X
i,j
(Fig. 14.2.):
for t = 0 to t1 do
begin
send X
i,j
(t)
to each neighbour;
receive X
i-1,j
(t)
,X
i+1,j
(t)
,X
i,j-1
(t)
,X
i,j+1
(t)
from the neighbours;
compute X
i,j
(t+1)
using the relations 10.1 resp. 10.3;
end;
Figure 14.2 Communication for a four-point approximation
Similar to more accurate values of any point, we can also use more
precise multi-point approximation relations, for example through
the approximation of night points according to the pencil in Fig.
14.3 with following relation:
Xi,j (t+1) = (16 Xi-1,j(t) + 16 Xi+1,j(t) + 16 Xi,j-1(t) + 16 Xi,j+1(t)
Xi-2,j(t) - Xi+2,j(t) - Xi,j-2(t) - Xi,j+2(t))/60
-
Algorithms of the Jacobi iterative method
The Jacobi iterative method determines a new value for each grid
point as the arithmetic average of the previous (old) value of the
direct four neighbouring points; left, right, up and down. This
activity (iteration step) is repeated until the end of the iterative
calculation. In this sense, we create a sequential program, then
analyse the possibilities of optimisation and finally analyse the
possibilities and conditions for parallelisation (abstract parallel
algorithms), including the optimisation of computing and
communication.
Let us assume that the grid points are approximated by a square
matrix n n, which is bounded at its sides by the limiting point
values. To develop algorithms we need to create one matrix (old) to
store the outgoing networks state (internal and limiting points) and
another matrix for storing the computed values of the new iteration
values (next) as:
real old [0:n+1, 0:n+1], next [0:n+1, 0:n+1];
For the bounded points of the defined matrices (old, next) we
initialise the appropriate boundary conditions, and the internal
points are initialised with the same initial value, for example zero.
Suppose that we terminate the iteration computation; then, when
each new value of a performed iteration meets the estimated
accuracy (epsilon), the computation algorithm for one iteration step
of Jacobi iteration is as follows:
while (true) {
Figure 14.3 Stencil with night points
i, j-2
i, j-1 i, j i, j+1 i, j+2
i+1, j
i+2, j
i-2, j
i-1, j
# compute the new values for all internal points
for (i = 1 to n, j = 1 to N)
next [i, j] = (old [i-1, j] + old [i +1, j] +
old [i, j-1] + old [i, j +1]) / 4;
itstep + +;
# compute the maximum difference of maxvar
maxvar = 0.0;
for (i = 1 to n, j = 1 to n)
maxvar = max (maxvar, abs (next [i, j]-old [i, j]));
# termination test
if (maxvar <Epsilon)
break;
# copy matrix next to old for the next iteration
for (i = 1 to n, j = 1 to N)
old [i, j] = next [i, j];
}
In this algorithm maxvar is a real number, and itstep is an integer
number, which monitors the number of performed iterations. The
algorithm assumes that the matrices are stored line-by-line (C / C + +).
Optimisation of parallel Jacobi algorithms
We analyse developed sequential algorithms of the Jacobi iterative
method in order to optimise their performance (efficient algorithms).
The first cycle setting for performs n
2
times within one iteration step.
The sums of the corresponding matrix values (grid points) are
essential, and for their execution we can use more powerful SIMD
instructions. As for the more efficient execution of the division
arithmetic operation using the number four, which is frequently
executed in each iteration step, we will replace with a faster
arithmetic operation with multiple instructions, namely 0.25.
Finally, as for the arithmetic division operation section 4, we could
replace with another faster operation, as used in multiplication
arithmetic operations, and with a shift operation to the right by 2
orders of magnitude.
Similarly, we analyse the parts of the algorithm in which we
compute the maximal difference. These parts are also performed
during each iteration step, although the fulfilment of this condition
occurs only once, at the end of each iteration step. Therefore, it
effectively replaces the cycle condition with a fixed number of
iterations as an itstep function. The performed number of iterations
to compute the maximum difference with the command itstep is
used only once, and only after the computation cycle. By implementing
these improvements we can optimise our algorithm as follows:
for (itstep = 1 to maxsteps)
{
# compute the new values for all interior points
for (i=1 to n, j=1 to n)
next[i,j] = (old[i-1,j] + old[i+1,j] +
old[i,j-1] + old[i,j+1]) * 0.25;
for (i = 1 to n, j = 1 to N)
old [i, j-1] + old [i, j +1]) * 0:25;
# copy next to old for the next iteration
for (i = 1 to n, j = 1 to N)
old [i, j] = next [i, j];
}
# compute the maximum difference of maxvar
maxvar = 0.0;
for (i = 1 to n, j = 1 to N)
maxvar = max (maxvar, abs (next [i, j] - old [i, j]));
{
The variable maxsteps is an argument of the command (constant).
Although the modified algorithm is not completely identical to the
original, it is a functional algorithm of the Jacobi iterative method.
If after running the modified program we find that the resulting
value maxvar is too large, we perform an iterative calculation again
with modified value maxsteps. Based on several experimental
computations, we find the setting for the correct-value maxsteps.
A modified version of the Jacobi iterative algorithm is considerably
more efficient than the original algorithm, although there are
opportunities for further improvements. We will analyse the time-
consuming part of the algorithm, in which the new next matrix
values are copied and computed onto the old matrix. A more
effective adjustment will therefore use pointers to next and old, and
after performing the iteration step we simply change the reciprocal
pointers. Similarly, we join the used matrices next and old to one
three-part matrix as follows:
real old [0:1, 0:n+1, 0:n+1];
int old = 0, next =1
The variables old and next are used for indexing the first dimension
of the old matrix. With the above matrix adjustments, the procedure
for a single iteration step is as follows:
for (iters=1 to MAXITERS){ # compute new values ofinterior points
for (i=1 to n, j=1 to n)
old[next][i,j]=(old[old][i-1,j]+old[old][i+1,j]+
old[old][i,j-1] + old[old][i,j+1])*0.25; # exchange procedures
old = 1-old; next = 1-next;
To exchange all the points of a network it is enough to change the
values of old and next. But the contribution of the cancelling
matrices by copying next and old also has negative consequences,
whereby any reference to old during an iteration cycle has an
additive index which makes it necessary to extend the existent
algorithm with additional commands for computing the addresses of
the old matrix elements. But the delay arising due to additional
commands is definitely less than the delay caused by the substitution
of copying the next and old matrices. We can avoid this substitution
by using an adjusted complicated indexing procedure of the old
matrix by developing an alternative possibility, which uses an
expansion of its own iteration procedure. This idea is based on
expansion for the development of a cycle carried out with an
emphasis on reducing the number of repetitions of the original
iteration cycle. For example, if we repeat the outer loop twice, the
update to the existent algorithm will be as follows:
for (itstep=1 to maxsteps by 2) {# compute new values of internal
points
for (i = 1 to n, j = 1 to n)
old [next] [i, j] = (old [old] [i-1, j] + old [old] [i +1, j] +
old [old] [i, j-1] + old [old] [i, j +1]) * 0:25;# exchange procedures
old = 1 -old ; nex = 1- next;# compute new values of interior points
for (i = 1 to n, j = 1 to N)
old [next] [i, j] = (old [old] [i-1, j] + old [old] [i +1, j] +
old [old] [i, j-1] + old [old] [i, j +1]) * 0:25;# exchange procedures
old = 1 - old ; next = 1- next;
}
The cycle procedure does not have a significant impact on
performance, but it could open the way for further optimised
modifications. For example, performance can be worse with
undeveloped cycles, because the size of the commands in the cycle
may not be suitable for the processor instruction cache. When
considering the expanding cycle in the previous algorithm, it is not
necessary to perform the tasks involved in exchanging the above
network points. Instead of a substitution of changing the network
tasks we just repeat the second cycle by reading the new values of
the new grid points, which were created during the first cycle by
assigning the computed values to the previous points. When applying
this approach it is also not necessary to generate a three-dimensional
matrix. So, we go back to the already-developed algorithm with the
two single matrices known as next and old. The resulting optimised
algorithm for the computation of Jacobi iteration, and marked as
SA
opt
is as follows:
real old [0: n +1,0: n +1], next [0: n +1,0: n +1];
real maxvar = 0.0;
initialization of grid variables including boarder
for (1 = itstep it would maxsteps 2) {# compute new values
of interior points
for (i = 1 to n, j = 1 to N)
old [i, j-1] + old [i, j +1]) * 0:25;# compute again new values
for (i = 1 to n, j = 1 to N)
old [i, j] = (next [i-1, j] + Next [i +1, j] +
next [i, j-1] + Next [i, j +1]) * 0:25;}# compute
max difference maxvar
for (i = 1 to n, j = 1 to N)
maxvar = max (maxvar, abs (old [i, j] - next [i, j]));
print final values and the maximum difference of;
Complexity of the sequential algorithm
The computational delay in the iteration sequence algorithm T(s,1)
comp
is given by the product of a delay for one iteration step and the
number of performed iteration steps k. An integral part of reducing
the iteration cycle by half, according to the optimised sequential
algorithm SA
opt
,

are two consecutive iterations, that is double the
number of performed operations of addition and multiplication for
each grid point. In the following reduced cycle a subtraction
operation is carried out for each grid point. To compare the
computational complexity of the arithmetic operations carried out
in the reduced cycle per iteration step, there are additional
operations, plus a multiplication operation, plus one-half of the
subtraction operations. This is the case because multiplication
operations are more time-consuming than addition or subtraction.
We will consider the average number of six simple arithmetic
operations to be defined as the addition of integer numbers. The
delay in the computational complexity of one iteration step is given
as 6 t
c
, where t
c
is a defined technical parameter of the processor or
computer in use. The delay in the computation of T(s, 1)
comp
of the
developed sequential algorithms SA and SA
opt
are thus given as:

2
( , 1) 6
c
T s k n t =

2

3
( , 1)
2
c
opt
k n t
T s =
in which:
s is defined as the workload for the supposed square matrix s= n
2
.
t
c
is the technical parameter of given computer, defined as the
average execution time of arithmetic operations.
k is the number of performed iterative steps of the original
sequential algorithm SA.
For one iteration step (k = 1) we get following relation:

2

1
( , 1) 6
c k
T s n t
=
=

2

1
3
( , 1)
2
c
k opt
n t
T s
=
=
The asymptotic computational complexity of the Jacobi iterative
sequential algorithms SA and SA
opt
is given as:

2 2
c
( , 1) ( n t ) ( ) T s O O n = =
Gauss-Seidel iterative sequential algorithms
In sequential iterative computations we prefer the Gauss-Seidel
iterative method to the Jacobi iterative method, as we can achieve a
final solution with fewer performed iterations. The Gauss-Seidel
method approximates that the values of grid points are computed in
such an order that during the computation of each one of these grid
points, we can also use the previously computed new values of the
neighbouring grid points. For example, we can modify the derived
Jacobi iteration relation as follows:
X
i,j
(t+1)
= (X
i-1,j
(t+1)
+ X
i+1,j
(t)
+ X
i,j-1
(t+1)
+ X
i,j+1
(t)
) / 4
In this modified iteration relation (the Gauss-Seidel method) in
general, n n grid points are computed with new grid point values
on average only for n/2 grid points.
Matrix decomposition models
We will analyse the decomposition matrix models of grid points for
parallel solutions to the Jacobi iteration algorithm (decomposition
models). From our analysis of the sequential Jacobi iterative method,
we know that the basics of computations are repeated at each point of
the defined grid points (iteration steps). To decompose the grid point
values of a matrix, we can use matrix decomposition domain strategies.
The matrix domain is made up of matrix data elements corresponding
to the grid point values. This domain data we can divide into its
independent parts, represented as groups of matrix elements. On these
divided parts of the matrix we will then perform iterative computations
in a parallel manner (parallel processes). This approach is bounded
from above by the potential degree of parallelism, whereby for each
data element of the matrix a single parallel process is created as follows:
real old [0:n+1,0:n+1], nextold[0:n+1,0:n+1];
bool convergent = false;
process old[i=1 to n, j=1 to n] {
while (not convergent) {
nextold[i,j = (old[i-1,j] + old[i+1,j]+
old[i,j-1] + old[i,j+1]) * 0.25;
test to convergence ;
barrier (i);
old[i,j = nextold[i,j];
barrier (i);
}
}
This theoretical model of a decomposition matrix into single data
elements of the matrix would not be effective enough in a practical
implementation, because one iteration step (parallel process) is
realised with a low number of arithmetic operations (computing
parts of a PA). The accompanying delay overheads with these
parallel algorithms, such as inter process communications, could be
higher than the parallel speed up. Similarly, each standard parallel
process requires its own stack, so that this parallel algorithm for the
extensive network of grid points is limited by the required demands
of the necessary memory capacity. This boundary decomposition
model would not therefore, at least for the dominant parallel
computers NOW and Grid, be effective for the parallel implementation
of the iterative algorithm (IPA).
For more realistic decomposition models we will generally
consider with values of the computing number p, which is lower
than the number of matrix data elements. In this sense, we will
consider a parallel algorithm, which is based on the equal distribution
of matrix elements (parallel processes) among p computing nodes of
the parallel computer in use. For a simple illustration we assume that
the given number of square data matrix elements n
2
is an integral
multiple of p. Thus, each computing node on the parallel computer
gets the same number of matrix elements for parallel iterative
calculations. We can then divide the given square matrix into parts
in the form of vertical or horizontal matrix strips, such as p columns
or p rows. When saving rows of matrix data elements we use
horizontal strips (rows), which imply a simpler implementation of a
parallel algorithm. For the illustration of a practical approach, we
assume that n is a multiple of p, and at the same time that the matrix
data elements are stored in the main memory row-by-row. Then, for
each parallel process a horizontal strip is assigned (a group of rows)
of size n/p. Each parallel process then computes new grid point
values for its allocated strip.
Because parallel processes share common data elements of the
matrix on the border strip rows, it is necessary to ensure that each
new iteration on each computing node has been completed before
any computing node performs the next iteration step. For this
purpose we use the synchronisation command for all the performed
parallel processes. In the supporting development environment MPI
API we then use the collective command known as barrier.
Jacobi iterative parallel algorithms
Parallel algorithms with a shared memory
Parallel algorithms of the Jacobi iterative method with a shared
memory PA
sm
use the OpenMP API standard for shared variables.
They are parallel algorithms of the computational model SIMD
(simple program multiple data), in which each creates a parallel
process which performs the same activities on different allocated
parts of the matrix data elements. Every created parallel process
needs to initialise the allocated matrix data elements, including the
border ones. The substance of each nodes parallel process is the
same derived parallel process of the optimised sequential algorithm
SA
opt.
For the synchronisation of all the performed parallel processes
we use the barrier command from the supporting parallel development
environment (API) OpenMP.
An integral part of the parallel algorithm PA
sm
is the method of
computing the maximum differences between all the relevant matrix
data elements with their new values. In particular, any parallel
process computes the maximum difference in their allocated matrix
elements using a local variable mydiff, whereby it stores the
computed different values of the allocated matrix data elements into
a final shared matrix with maximal differences. After further barrier
synchronisation, each parallel process computes the maximum of
the maxvar [*]. Instead of performing this activity we can reserve
one parallel process. But in using a local variable in each parallel
process we achieve the following advantages:
We do not use a command for critical section to protect a single
global variable for this action.
We avoid potential collisions with the cache memory, which
could occur in consequence of the incorrect shared use of the
maxvar matrix.
Similarly, the final parallel algorithm PA
sm
is more effective in
multiple implementations with the barrier call procedure
synchronisation command r, especially for compilers with a built-in
support optimisation for procedure calls. The final version of the
parallel algorithm PA
sm
is thus as follows:
real old [0: n +1,0: n +1], next [0: n +1,0: n +1];
int HEIGHT = n / p;# assumption n is divisible by p
real maxvar [1: PR] = ([p] 0.0);
procedure barrier (int id) {#synchronization barrier
}
process worker [w = 1 to p] {
int firstRow = (w-1) * HEIGHT + 1;
int lastRow = firstRow + HEIGHT - 1;
real mydiff = 0.0;# initialize matrix including boarders;
barrier (w);
for (itstep=1to MAXSTEPS by 2) {# compute new values
for (i = firstRowto lastRow, j = 1 ton n)
old [i, j-1] + old [i, j +1]) * 0:25;
barrier (w); # compute again new values
for (i =firstROW it lastRow, j = 1 to n)
old [i, j] = (next [i-1, j] + next [i +1, j] +
next [i, j-1] + next [i, j +1]) * 0:25;
barrier (w);
}# compute maximal difference
for (i = firstRowto lastRow, j = 1 to n)
mydiff = max (mydiff, abs (old [i, j]-old [i, j]));
maxvar [w] = mydiff;
barrier (w);# max diff. is maximum of maxvar [*]
}
Parallel iterative algorithms with a distributed memory
In this part we will consider the development of parallel algorithms
for distributed memory for the Jacobi iterative method in support of
the standard development environment (API) MPI. An effective way
of developing PAs with message passing communication (MPI) will
be needed for the modification of developed parallel algorithms with
shared variables PA
sm
as follows:
At the first step we divide the shared variables in use among the
parallel processes.
At the second step we add the alternative communication
commands send and receive according to the version of MPI in
use, which is always standard in programs when it comes to inter
process communications (IPCs).
As in parallel algorithms with a shared memory PA
sm
,

we assume p
computing nodes and we assign to each computing node of the
parallel computer in use the allocated parts of the old and next
matrices, including their border data elements. We carry out the
decomposition in such a way that the allocated parts of the old and
next matrices are local to the allocated parallel process. This means
that each parallel process has stored in its memory, not only the
allocated data elements of the old and next matrices, but also the
border data elements of the allocated elements with their neighbouring
decomposed matrix parts. Each parallel process then executes a
sequence of iteration cycles on an allocated matrix part of the old
and next matrices, whereby after each performed iteration step it
comes to the inter process communication IPC for the exchange of
the newly computed border data values of the allocated data
elements of the old and next matrices, along with all the neighbouring
computing nodes of the parallel computer. These inter process
communication commands replace the commands used by
synchronisation barriers in previously developed parallel algorithms
with shared memory PA
sm
.
It is also necessary to distribute the computations of the maximum
difference after performing each iteration step between the computing
nodes in the allocated part. As in the parallel algorithm PA
sm
we let
each computing node compute the maximum differential for its
allocated matrix part. Then we let one parallel process gather the
achieved maximal value differences. We can do this either using an
MPI command for a two-point (PTP) communication (PTP), for
communication from one parallel process to another parallel one, or
by using an MPI collective communication command for data
collection allgather.
The resulting parallel algorithm with MPI communication for the
Jacobi iterative method we have marked as PA
dp
(parallel algorithm
with a distributed memory). It use inter process communication IPC
after each iteration step, so that the neighbouring computing nodes
exchange the newly computed border data element values twice at
each reduced iteration step. The first inter process communication is
carried out as follows:
if (w > 1) send up[w-1] (new[1,*]);
if (w < p) send down[w+1] (new[HEIGHT,*]);
if (w < p) receive up[w] (new[HEIGHT+1,*]);
if (w > 1) receive down[w](new[0,*]);
The send command in the first line sends the first row of its allocated
matrix part (we assume the decomposition is into groups of rows) to
the upper neighbouring computing nodes. The command in the
second line is then sent to each parallel process (worker) with the
exception of the last one; the bottom row of its allocated matrix part
is sent to its lower parallel process. Each computing node obtains the
following boundary line edges from all its neighbouring computational
nodes. The received values become the new border values of shared
lines for each computing node. The second inter process
communication is just the same when the next matrix replaces the
old matrix. After making a sufficient number of iteration steps
(defined accuracy), each computing node computes the maximal
difference for its allocated matrix part, and the designed computing
node (manager) executes a data collection command for the result
values (global maximal difference). This final global maximal
difference is the final value mydiff of the manager computing node.
The final developed parallel algorithm with a shared memory is as
follows, as is PA
dp
:
chan up [1: p] (real edge [0: n +1];
chan down [1: p] (real edge [0: n +1];
chan diff (real);
process worker (w = 1 to p) {
int HEIGHT = n / p;# assumption p divides n
real old [0: HEIGT +1,0 n +1], next [0: HEIGT +1,0 n +1];
real mydiff = 0.0, otherdiff = 0.0;;
initialisation of old and next of allocated part including
borders;
for [itstep = 1 it would MAXSTEPS by 2] {# compute new
values
for [i = 1 to HEIGHT, j = 1 to n]
old [i, j-1] + old [i, j +1]) * 0:25;
exchange of border lines for next;}# compute the
re-allocated new values
old [i, j] = (next [i-1, j] + Next [i +1, j] +
Next [i, j-1] + Next [i, j +1]) * 0:25;
exchange boundary lines for the old - by text;}#compute
maximum difference
mydiff = max (mydiff, abs (old [i, j] next [i, j]));
if (w> 1)
send diff (mydiff);
else# manager collects differences
for (i = 1 to N-1) {
receive diff (otherdiff);
mydiff = max (mydiff, otherdiff);}# max.difference is mydiff
}
We modify this developed parallel algorithm PA
dp
in order to
increase its performance (optimisation). At the first step of the
iterative computation of the new data values of matrices, it is not
necessary to perform inter process communications on the shared
edges after each iteration step. We perform inter process
communications on any given shared border elements only after
every second iteration step. As a result, the values of the shared data
elements are changed during step action, but as the parallel algorithm
converges, we achieve the correct results. At the second step, we
adjust the remaining inter process communications of the shared
border points as part of a local iterative computation among MPIs
sending and receiving communication commands. Specifically, we
leave part of the local iterative computations on each computing
node to perform the following actions:
Send their own border values to the neighbouring computing
nodes.
Compute the new border values of the allocated matrix part.
Receive the computed border values of the shared matrix data
elements from the neighbouring computing nodes.
Compute the new values for their own border elements of the
allocated matrix part.
These adjustments significantly increase the probability that the
neighbouring border matrix elements can be received in advance of
their use, thus MPI receiving commands do not cause any additional
communication delays. The final optimised parallel algorithm with
distributed memory PA
dpopt
looks as follows:
chan up [1: p] (real edge [0: n +1];
chan down [1: p] (real edge [0: n +1];
chan diff (real);
process worker (w = 1 to p) {
int HEIGHT = n / p;# assumption p divides n
real old [0:HEIGHT +1,0 n +1], next [0: HEIGHT +1,0 n +1];
real mydiff = 0.0, otherdiff = 0.0;
initialisation oldand next of allocated part;
for (1 = itstep it would MAXSTEPS by 2) {# compute new
values
old [i, j-1] + old [i, j +1]) * 0:25;# send breakpoints
neighbors
if (w> 1)
send up [w-1] (next [1 *]);
if (w <PR)
send down [w +1] (next [HEIGHT, *]);# compute new
values of interior points
for [i = 2 it HEIGHT-1, j = 1 to n]
old [i, j] = (next [i-1, j] + next [i +1, j] +
next [i, j-1] + next [i, j +1]) * 0:25;# receive
board values
if (w <PR)
receive up [w] (next [HEIGHT +1, *]);
if (w> 1)
receive down [w] (next [0, *]);# compute new value
for [j = 1 to n])
old [1, j] = (next [0, j] + next [2, j] +
next [1, j-1] + next [1, j +1]) * 0:25;
for [j = 1 to n]
old [HEIGHT, j] = (next [HEIGHT-1, j] +
next [HEIGHT +1, j] + next
[HEIGHT, j-1] +
next [HEIGHT, j +1]) * 0:25;
}
compute maximal difference ;
}
The red-black successive over-relaxation method
The Jacobi iterative method converges relatively slowly due to the
fact that it has a slow influence on any grid point with the values of
other distant points. For example, it is necessary to carry out n/2
iterations, so that the border values begin to influence points in the
middle of the matrix data elements. The Gauss-Seidel method (GSM)
converges faster while using less storage space because the new point
values are computed on the basis of some actual new grid point
values and also on the remaining number of previous values of
neighbouring grid points. The basis of this mechanism is the
movement through a given grid network from left to right and from
top to bottom. New point values are computed in the same matrix
variables as follows:
for [i= 1 to n, j=1 to n]
old[i,j] = (old[i-1,j] + old[i+1,j]+old[i,j-1] +
old[i,j+1]) * 0.25;
Thus, each new point value depends on actual newly computed
point values above and to the left of any given grid point, and from
its previous values from the right and below any given point. Since
computations have the same variables, there is no need for a further
matrix. The successive over-relaxation method SOR is a generalisation
of the Gauss-Seidel method. Using the SOR method we can compute
new grid point values using the same matrix variables as follows:
The new omega variable is referred to as the relaxation parameter.
We choose its value from the interval 0 < omega < 2. If omega equals
1, it is the Gauss-Seidel method. Where omega is equal to 0.5, then
the new grid point value is equal to one half of the arithmetic mean
value of the neighbouring grid point values, plus one half of its
previous value. Choosing an appropriate omega value depends on
the type of PDR and its boundary conditions.
Although the GSM and SOR converge faster than the Jacobi
iterative method and require about half the storage space, it is not
easy to parallelise these methods directly. This is due to the fact that
the computation of the new grid point values uses the previous ones
for this. In other words, an integral part of the GSM computations
and SOR computations is this built-in sequential computation
procedure (the mechanism runs from left to right and from top to
bottom). The iterative loops in these algorithms thus contain data
dependencies. To eliminate this dependence, it is necessary to modify
the algorithms while preserving their convergence as follows:
During the first step we colour the grid points using a red/black
diagram, as illustrated in Fig. 14.4. We begin in the upper-left
corner and we colour every second grid point in red, and the
remaining grid points in black in such a way that the red points
have black neighbouring points and the black ones red points.
During the second step we replace the iterative cycle with two
nested cycles in such a way that the first cycle computes new
values only for the red points and the second one only for the
black ones.
We can parallelise this kind of red-black outline, because the red
points only have black neighbours, and the black ones only red
neighbours. Thus, the new values for all the red points can be
Figure 14.4 Red-black point resolution.
computed in parallel since they depend only on the previous values
of the black points. Similarly, we compute the black points in a
parallel way. After each step of the computation, synchronisation is
needed in order to guarantee that all the red points have new values
before it begins the computation of the new values for the black
points, and vice versa.
This modified algorithm is a parallel program for the Gauss-Seidel
method, which is labelled PA
RB.
The principal adjustment is aimed at
dividing the iterative computation part into two consecutive steps.
Other assumptions which do not change are p as the parameter for
the computing nodes, and n as a multiple of p.
real old [0: n +1,0: n +1]
int HEIGHT = n / p# assumption p divides n
real maxvar [1: p] = ([p] 0.0);
procedure barrier (int id) {# synchronization barrier
}
Red
Black
process worker [
w = 1 to p] {
intf firstRow = (w-1) * height + 1;
int lastRow = FirstRow + height - 1;
int jStart;
real mydiff = 0.0;
initialization old including borders;
barrier (w);
for [itstep = 1 to MAXSTEPS] {#compute the new values
for red points
for [i = firstRow to lastRow] {
if (i% 2 == 1) jStart = 1; # odd row
else j Start = 2; # odd row
for [j = n it would jStart 2]
old [i, j] = (old [i-1, j] + old [i +1, j] +
old [i, j-1] + old [i, j +1]) * 0:25;
}
barrier (w);#compute new values for black points
for [i = firstRow to lastRow] {
if (i% 2 == 1) jStart = 2;# even row
else jStart = 1; # odd row
for [j = n jStart to n by 2]
old [i, j] = (old [i-1, j] + old [i +1, j] +
old [i, j-1] + old [i, j +1]) * 0:25;
}
barrier (w);}# compute max.difference on its part
reiterations, store the value of max.difference
maxvar [w] = mydiff;# max diff. is max. of maxvar [*]
barrier (w);
}
The basic structure of the red-black parallel algorithm PA
rb
remains
the same, as in the case of the Jacobi iterative method of PA
sm
. The
maximal difference is calculated in the same way as algorithm PA
rb
,
but each computation phase computes new values for only half of
the points in comparison to the parallel Jacobi algorithm PA
sm.
As a
result, the outer loop is then performed in double step numbers.
These cause the use of twice as many barriers, whereby these added
synchronisation delays increase the whole delay of the parallel
execution time. On the other hand, faster convergence generates
better results for the same value maxsteps or comparable results to
smaller value maxsteps. The parallel algorithms have the same
structure as the previously developed parallel algorithms of the
Jacobi iterative methods PA
dm
, PA
dm
and PA
dmopt.
Each parallel process is responsible for computing new values for
the grid points of its allocated matrix part, whereby the neighbouring
parallel processes exchange shared border grid points after each
computing phase. Red and black border point values can be
exchanged independently, but to minimise communication delays,
an actual exchange of red and black border points is preferable.
Assuming the points coloured in red and black using indices I and j
are incremented by two steps at every iteration step, this has a less
effective influence on the use of the cache, since the whole strip is
available at each iteration step, but for each point a written or a read
operation is performed, but not both operations simultaneously. We
could improve it in such a way that we colour blocks of points, or
we could colour strips of points respectively. For example, we could
divide each strip of a given parallel process in half horizontally and
colour the first half in red and the second half in black. As is evident
from the parallel algorithm PA
rb,
each parallel process repeatedly
computes new values for red ones and then new values for black,
always with a synchronisation barrier after each computational step.
Although they are computing the new values for only half the
number of grid points, for each of these grid points they carry out
operations in both reading and writing.
Complex analytical performance modelling of IPA
Basic matrix decomposition models
The supposed efficiency of parallel iterative algorithms (with shared
and distributed memories) are required to allocate a parallel process
to more than one internal element of the square matrix (data
elements). In general, the square matrix n n has n
2
data elements.
Then, for the decomposition of the matrix elements into some
groups of matrix data elements, we have in principle two
decomposition strategies, as follows:
The decomposition model of an n x n matrix into continual strips
of matrix elements. The continual strips consist of at least one
matrix row or one matrix column. An illustration of a matrix
decomposition into p strips (S
1
, S
2
, S
p
) is in Fig. 14.5. In this
case the decomposed strips consist of at least one matrix row.
The decomposition model of an n x n matrix into square blocks
of matrix elements (parallel process). An illustration of a matrix
decomposition into p blocks (B
1
, B
2
, B
p
) is Fig. 14.6. In this
case the decomposed blocks consist of at least four matrix data
elements.
Figure 14.5 Decomposition strategy into matrix strips (rows).
n
.
.
.
n
S
2
S
p
S
1
Figure 14.6 Decomposition strategy into matrix blocks.
n
. . .
. . .
.
. . .
.
n
B
1
B
2
B
p
Matrix decomposition into strips
The methods for decomposing rows or columns (strips) are the same
in terms of algorithms, and for their practical use it is is critical how
are the matrix elements are put onto the matrix. For example, in C
language the array of elements is put down from right to left and
from bottom to top (step-by-step building of matrix rows).
In this way it is very simple to send them by specifiying the address
for a given row at the start and through the number of elements in
that row (addressing with indices). For every parallel process (strip),
two messages are sent to neighbouring processors, and in the same
way two messages are received back from them (Fig. 14.7),
supposing that it is possible to transmit, for example, one row to one
message. The communication time for one calculation step T(s, p)
coms
is thus given as:

( )
( , ) 4 s
comms w
T s p t n t = +
n
.
.
.
.
.
.
Figure 14.7 The communications consequences of decomposition into
strips (rows).
When using these variables for the communication overheads when
decomposing into strips the following is correct:

s w
( , p) ( , ) ( , ) 4 (t t )
comm comms
T s T s p h s p n = = = +
The whole time for executing parallel algorithm T (s, p) for
decomposition into strips is then given in general as:

= + +
2

s w
( , p) 4 (t t )
c
n t
T s n
p
In this case the communication time for one calculation step does
not depend on the number of calculation processors in use.
Matrix decomposition into blocks
For mapping matrix elements in blocks, an inter process
communication is performed on the four neighbouring edges of the
blocks, which is necessary to exchange during a computation flow.
Every parallel process, therefore, sends four messages, and in the
same way they receive four messages at the end of every calculation
step (Fig. 14.8), supposing that all the necessary data on every edge
is sent as a part of any message.
n/ p
.
.
.
.
.
.
. . . . . .
Figure 14.8 Communication model for decomposition into blocks.
Then the requested communication time for this decomposition
method is given as:

( , ) 8( )
s w commb
n
T s p t t
p
= +
This equation is correct for p 9, because only under this assumption
is it possible to build at least one square, and only then is it possible
to build one square block with four communication edges. Using
these variables for the communication overheads in the decomposition
method on blocks, the following is correct:

( , p) ( , ) ( , ) 8( )
comm s w commb
n
T s T s p h s p t t
p
= = = +
Then the requested communication time for this decomposition
method is given as:

8( )
s w comb
n
T t t
p
= +
Data exchange on all the shared edge points for both decomposition
strategies (blocks, strips) is illustrated in Fig. 14.9:
Node i-1
Node i
Comp. at
node i
Comp. at
node i+1
Node i+1
Comp. at
node i
Comp. at
node i+1
Figure 14.9 Data exchange between processors.
Optimisation of the decomposition method selection
For a comparison based on the derived relations for communication
complexity, the decomposition method for blocks demands longer
communication times, as the decomposition into strips is more
effective if:

8( ) 4 ( )
s w s w
n
t t t n t
p
+ > +
or after adjusting for technical parameter t
s
. This relation is valid
under the assumption that p 9, which is a real condition in
developed iterative parallel algorithms for building a real square
block:

2
(1 ) .
s w
t n t
p
>
or for the second technical parameter t
w
is as follows:

2
(1 ) .
w s
t n t
p
>
Fig. 14.10 illustrates the choice optimisation of a suitable
decomposition method based on the derived dependences for
establishing t
s
for n = 256 and the following values of t
w
:
600
500
400
300
200
100
0
9
p
ts1
ts2
29 49 69 89 109 129 149 169 189 209 229 249 269
t
s
[s]
Figure 14.10 Optimisation of the decomposition method.
t
w
= 230 ns = 0.23 s for NOW type IBM SP-2.
t
w
= 2.4 s for parallel computer NCUBE-2.
For higher values, t
s
is given as t
si
(t
s1
pre t
w
= 0, 23 s and t
s2
for
t
w
= 2, 4 s) from the appropriate curve line for n = 256, which is
more effective for the decomposition into strips method. Limited
values in choice for an optimal decomposition strategy are given n
for higher values t
w
. Therefore, in general, decomposition into strips
is more effective for higher values of t
s
.
Threshold values t
s
for selecting the optimal model for matrix
decomposition strategy are given n for values greater than t
w
.
Therefore, in general, decomposition into strips (rows and columns)
is more effective for higher values of t
s
(NOW, Grid) and
decomposition into blocks again for smaller values of t
s
(centralised
massive parallel computers, supercomputers etc.). Generally, the
values of t
s
are significantly higher for parallel computers than
NOW and Grid. For example, NOW for FDDI-based optical cables
has t
s
= 1100 s, t
w
= 1.1 s and for architecture with Ethernet t
s
=
1500 s, t
w
= 5 s) [97]. In these systems, the use of the matrix
decomposition method into strips is thus more effective.
Parallel computational complexity
The parallel computational complexity of iterative parallel algorithms
IPA (shared and distributed memory) Z (s, k) is the same. It is given
as the product of computational complexity for one iteration step Z
(s, 1) and the number of iteration steps k. In computer parts, IPAs in
the outer loop carry out n/p iteration steps, and in the inner cycle n
iterations. The total number of iteration steps performed is given as:

2
n
n
n
k
p p
= =
The parallel computational complexity for one iteration step Z(s, 1)
is determined by the number of arithmetic operations performed
during one iteration step, where there are performed three additional
operations, one multiplication operation and one subtraction
operation, that is, Z(s, 1) = 5 arithmetic operations. The parallel
computation complexity Z (s, k) is thus given as:

2 2

5
( , ) ( , 1)
n n
Z s k Z s
p p
= =
The execution time for the parallel computations T(s, p)
comp
of IPAs
is the product of the complexity of Z(s, k) and technical parameters
t
c
(the average delay of arithmetic operations).
Thus, the parallel computational delay during a single step (k = 1)
T(s, p)
comp
is given as follows:

2

5
( , p) ( , )
c
comp c
n t
T s Z s k t
p
= =
For simplicity we will consider t
c1
= 5 t
c
as a computational delay
during one iteration step. Using this adjustment we can write for
T(s, p)
comp
:

2
1
( , p) ( , )
c
comp c
n t
T s Z s k t
p
= =
Complex analytical performance modelling
We have defined the complex performance modelling of IPAs as
deriving from their evaluation criteria, including the consideration
of overhead function h(s, p). We summarised the derived analytical
results as follows:
The shared results for both decomposition models (blocks and
strips):
The execution time of a sequential square matrix algorithm T(s, 1)

T(s, 1)
comp
= n
2
t
c1
The execution time for the parallel computation time of IPA
parallel algorithms T(s, p)
comp

2
1

.
( , )
c
calc
n t
T s p
p
=
The optimal conditions for the selection of a matrix decomposition
model for t
s
, and for t
w
respectively

2 2
(1 ) (1 ) .
s w w s
t n t t n t
p p
> >
Different results for basic matrix decomposition models (blocks and
strips):
The overhead function for blocks h(s, p)
b
,

and for strips h(s, p)
s
as
follows:

s w
( , ) 8( )
( , ) 4 (t t )
s w b
s
n
h s p t t
p
h s p n
= +
= +
The complex parallel execution time for blocks T(s, p)
compb
, and
for strips T(s, p)
comps

2
1

2
1

.
( , p) ( , ) ( , ) 8( )
.
( , p) ( , ) ( , ) 4 ( )
c
s w calcb calc ipcb
c
ipcs s w calcs calc
n t n
T s T s p T s p t t
p
p
n t
T s T s p T s p t n t
p
= + = + +
= + = + +
The parallel speed up for blocks S(s, p)
b
, and for strips S(s, p)
s

2

1
2
1 s w
( ,1)
( , )
( , )
8 (p t t )
c
b
calcb
c
n p t T s
S s p
T s p
n t p n
= =
+ +
The efficiency for blocks E(s, p)
b
, and for strips E(s, p)
s

2

1
2
1 s w
2

1
2
1
( , )
( , )
8 (p t t )
( , )
( , )
4 ( )
b c
b
c
s c
s
c s w
S s p n t
E s p
p
n t p n
S s p n t
E s p
p n t p t n t
= =
+ +
= =
+ +
Constant C (the constant needed when deriving isoefficiency
function w(s)) for blocks such as C
b
, and for strips such as C
s
in
an issoeficiency function:
2
1
s w
2
1
( , )
1 ( , )
8 (p t t )
( , )
1 ( , ) 4 p ( )
c
b
c
s
s w
n t E s p
C
E s p
p n
n t E s p
C
E s p t n t
= =
+
= =
+
Fig. 14.11 illustrates the growth dependencies of parallel
computing time T(s, p)
comp
, communication time T(s, p)
comm
and
complex parallel execution time T(s, p)
complex
from input load
growth n (square matrix dimension) at a constant number of
computing node p = 256.
p=256
0
5 000
10 000
15 000
20 000
25 000
30 000
35 000
40 000
45 000
1 024 3 024 5 024 7 024 9 024 11 024 13 024 15 024 17 024
n
T
[s]
T(s,p) complex
T(s, p) comm
T(s, p) comp
Figure 14.11 Dependencies of T(s,p)
complex
, T(s,p)
comm
and T(s, p)
comp

from n (p=256).
Fig. 14.12 illustrates the growth dependencies of parallel computing
time T(s, p)
comp
, communication time T(s, p)
comm
and complex parallel
execution time T(s, p)
complex
from an increasing number of computing
nodes p at constant input load n (matrix dimension n = 512).
Issoeficiency functions
Issoeficiency function w(s) is very important for carrying out the
prediction of parallel algorithms PA. In order to model the prediction
n=512
0
500
1000
1500
2000
2500
3000
3500
4000
0 20 40 60 80 100 120
p
T
[s]
T(s, p)comp
T(s, p)comm
T(s,p)complex
Figure 14.12 Dependencies of T(s,p)
complex
, T(s, p)
comm
and T(s, p)
comp

from p (n=512).
of PAs, we are going to derive for the defined basic matrix
decomposition models (blocks and strips) the corresponding
analytical issoeficiency functions w(s)
b
(decomposition into blocks)
and w(s)
s
(decomposition into strips). For the asymptotic complexity
of w(s) the following derived relation is valid:

[ ]
( ) max ( , ) , h (s,p)
calc
w s T s p =
where defined workload s is a function of input load n. For IPAs it
is given as s = n
2
. We have defined that for given value efficiency
E(s, p), the following quotient of efficiencies E(s, p) is constant:

1
E
C
E
=
Canonical matrix decomposition models

We defined canonical matrix decomposition models as being models
to which it is possible to reduce all other known matrix decomposition
models. The canonical matrix decomposition models are as follows:
A matrix decomposition model into blocks.
A matrix decomposition model into strips.
For defined constants C
b
(blocks) and C
s
(strips), which are integral
parts of issoeficiency functions w(s)
b
(blocks) and w(s)(strips), we
have derived following relations:

2
1
s w
( , )
1 ( , )
8 (p t t )
c
b
n t E s p
C
E s p
p n
= =
+

2
1
( , )
1 ( , ) 4 p ( )
c
s
s w
n t E s p
C
E s p t n t
= =
+
To gain a closed form of issoeficiency functions w(s)
b
and w(s)
s
,
we used an approach in which we first performed an analysis of the
increasing input load, which influenced the analysed expression
containing t
s
in relation to p. Therefore to keep this growth constant
we suppose that t
w
= 0. Thus, for constants C
b
and C
s
we get the
following expressions:

2 2
1 1
8 4 p
c c
s b
s s
n t n t
C C
p t t
= =
From these expressions we can derive for the searched functions
w(s)
b
= w(s)
s
= n
2
from the relations for C
b
and C
s
,

the

following
relations:

2 2
1 1
4 p
( ) 8 ( )
s s s
s b b
c c
t C t
w s n C p w s n
t t
= = = =
With a similar approach we can analyse the influence growth of the
input load, which cause another part of the expression from t
s
in
relation to p. Therefore, to keep this growth constant we suppose
that t
s
= 0. Thus, after setting and performing the necessary
adjustments we get for search functions w(s)
b
and w(s)
s
the

following
relations:

2
1 1
( ) 8 ( ) 4
w w
s s b b
c c
t t
w s n C p n w s C n p
t t
= = =
The final derived analytical functions w(s)
b
w(s)
s
are as follows:

2
1
1 1
( ) max , 8 , 8
c s w
b b b
c c
n t t t
w s C p C n p
p t t

=

2
1
1 1
( ) max , 4 , 4
c s w
s
s s
c c
n t t t
w s C p C n p
p t t

=

The optimisation of issoefficiency functions
The optimisation of derived issoefficiency functions requires
searching for dominant expressions in the derived final relations for
w(s)
b
and w(s)
s
. For this purpose, we have made a comparison of the
individual expressions of w(s)
b
and w(s)
s
with the following
conclusions:
The first expressions of w(s)
b
and w(s)
s
are the same, and
therefore this expression is the component of the final optimised
w(s)
opt
. At the same time a performed asymptotical analysis in
relation to parameter p, the following limit is valid for this
expression:

2
1
.
lim 0
c
p
n t
p
=
and therefore, we can omit the similar first expressions of issoeficiency
functions w(s)
b
and w(s)
s
searching w(s)
opt
.
In relation to the similarity of the actual first expressions of w(s)
b

and w(s)
s
, after omitting the expressions according to the
previous conclusion, it is as follows:

1 1
8 4
s s
s b
c c
t t
C p C p
t t
This condition, after reducing the shared expression parts, leads to

inequality 2 C
b
C
s
. After setting and following the adjustments, we
get the final condition as p 1, which is valid over the whole range
of spotted values of parameter p. This means that with this comparison
of the performed expressions, we get a more dominant expression,
which allows the next three comparisons (two from w(s)
b
and one
from w(s)
s
).
In an analogous way we make a comparison of the third
expression from the original w(s)
b
and w(s)
s
issoeficiency
functions, which are as follows:

1 1
4 8
w w
s b
c c
t t
C n p C n p
t t
These conditions, after a reduction of shared expression parts, lead

to the following inequality C
s
p 2 C
b
. After setting and making
adjustments we get the final condition p 1, which is fulfilled over
the whole range of parameter p. With this performed comparison we
have ignored further expressions, and the final relation w(s)
opt
is
actually as follows:

1 1
( ) max 4 , 8
w s
opt s b
c c
t t
w s C n p C p
t t

=

A final comparison of the remaining expression comes to the
following expression comparison:

1 1
4 8
w s
s b
c c
t t
C n p C p
t t
This condition, after a reduction of shared expression parts, leads to

the following inequality C
s
n t
w
2 C
b
t
s
. After setting and making
adjustments we come to the following inequality n
2
. t
w
2
p t
s
2
. We
are only able to solve this inequality for concrete values of parameters
n, p, t
s
and t
w
. For example, when using the following values of
parameters t
s
= 35 s and t
w
= 0.23 s, and under the assumption of
in practice frequently choosing n = p, we get a simpler expression for
the conditions validity as n n 152,17
2
or p p 152,17
2
. The
smallest integer number which satisfies a given condition is p = n = 813.
When satisfying this condition for n or p, the final issoefficiency
function w(s)
opt
is given with the first expression, and in opposite is
given with the second expression of the following final optimalised
issoeficiency function w(s)
opt
:

=

1 1
( ) max 4 , 8
w s
opt s b
c c
t t
w s C n p C p
t t
Conclusions of issoeficiency functions
Thus, for the given concrete value of E (s, p) and for the given values
of parameters p, n we can analyse the thresholds, for which the
growth of isoefficiency functions means a decrease in the efficiency
of any given parallel algorithm with assumed typical decomposition
strategies. This means a minor scalability of the assumed algorithm.
In the case of decomposition strategy the approach is similar to the
analysed, practical and used decomposition matrix strategies.
Based on an analysis of computer technical parameters t
s
, t
w
and
t
c
,

for some parallel computers they are valid following the
inequalities t
s
>>t
w
>t
c
. Similarly, it is valid that p n. Using these
inequalities it is necessary to analyse the dominant influence of all
the derived expressions.
Thus, the asymptotic issoefficiency function is limited through the
dominant conditions of the second and third expressions. From their
comparison comes:
Based on real condition t
w
t
s
a third expression is bigger than
or equal to a second expression, and an issoefficiency function is
limited through the first expression of w(s)
opt
. If we use the
following technical parameters t
s
= 35 s and t
w
= 0.23 s, this
is true for n 813.
For n < 813 and for the same technical constants t
s
= 35 s and
t
w
= 0.23 s, an issoefficiency function is limited through a
second expression of w(s)
opt
.
Chosen results
We will now illustrate some of the chosen tried and tested results.
For experimental testing we used workstations on NOW parallel
computers (Workstations WS 1 WS 5) and supercomputers as
follows:
WS 1 Pentium IV (f = 2,26 G Hz).
WS 2 Pentium IV Xeon (2 proc., f = 2,2 G Hz).
WS 3 Intel Core 2 Duo T 7400 (2 cores, f=2,16 GHz).
WS 4 Intel Core 2 Quad (4 cores, 2.5 GHz).
WS 5 Intel SandyBridge i5 2500S (4 cores, f=2.7 GHz).
Supercomputer Cray T3E in a remote computing node.
A comparison of the decomposition model influence (D1 - blocks,
D2 - strips) is illustrated in Fig. 14.13. For comparison the measured
values were recomputed to one iteration step. The performed
measurements proved the higher efficiency of the decomposition
model into blocks for the tested parallel computer Cray T3E. The
technical parameters of parallel computer Cray T3E were (t
s
= 3 s,
t
w
= 0.063 s, t
c
= 0.011 s).
Figure 14.13 Comparison of T(s,p)
complex
for decomposition models
(n=256).
n=256
0
50
100
150
200
250
300
350
400
450
500
0 20 40 60 80 100 120 140 160 180 200 220 240 260
p
D1
D2
T
[s]
Fig. 14.14 illustrates the dependencies on the optimal selection of a
decomposition strategy for technical parameter t
si
(ts1, ts2) using the
verified technical parameters of supercomputer Cray T3E for t
w
= 0.063
s a n = 128, 256.
In Fig. 14.15 we have presented the measurement results for all
the solution times for both developed parallel algorithms (Jacobi,
Gauss-Seidel) with the number of processors as p = 8, and for the
various values of input workload n (matrix dimensions) for E = 10
-5
.
n=128, 256
0
2
4
6
8
10
12
14
16
9 29 49 69 89 109 129 149 169 189 209 229 249
p
ts 1
ts 2
t
s
[s]
Figure 14.14 Inuences ts for n = 128, 256.
complex
for Jacobi and Gauss-Seidel
IPAs for E=10
-5
.
8 workstations, E = 0.00001
0
20
40
60
80
100
120
140
160
0 100 200 300 400 500
Matrix dimension [nxn]
T [s]
Jacobi
Gauss-Seidel
From the comparison of these measurements the results show that
for a high number of workstations (p = 8), all the solution times are
approximately the same. The reason is that the lower computation
complexity in the Gauss-Seidel method (computation) is eliminated
through the greater communication complexity in its parallel
algorithm, which is almost twice as high as a Jacobi IPA.
Figure 14.16 Percentage comparison of T (s, p)
complex
for its compo-
nents (E=0,001).
Figure 14.15 illustrates the continually spreading of the individual
overheads in terms of percentage (initialisation, computation,
communication, gathering) for a Jacobi parallel algorithm, with the
given number of workstations as p = 4 for various values of workload
n (matrix dimensions) and for accuracy E = 0.001. From these
comparisons we can see a rising trend in computation independently
of accuracy E.
In general, for problems with an increasing communication
complexity through using a high number of processors p based on
Ethernet NOW, we come to the threshold where parallel computing
is no longer effective, meaning without any speed up. It is evident
that for any given problem, it is very important for parallel algorithms
and parallel computers to find such a threshold (no speed up).
The individual parts of the entire parallel execution time are
illustrated in Fig. 14.16 for Jacobi iterative parallel algorithms on
four workstations and for E = 0.001.
Jacobi 4 workstations, E = 0.001
0%
20%
40%
60%
80%
100%
16 64 112 160 208 256 304 352 400 448 496
Parts of
T(s,p)complex
[%]
Collection
Communication
Computation
Initialisation
The influence of the number of workstations at given accuracy
E = 0.001 on the individual parts of the entire solution time for pre-
Jacobi iterative parallel algorithms for various sizes of input
workload (matrix dimensions from 64 64 to 512 512) is
illustrated in Fig. 14.17 for the number of workstations p = 4. From
these comparisons result the percentage of computations with a
higher number of workstations (parallel speed up through this higher
number of workstations) with a moderate rise in percentage of
network communication overheads.
Fig. 14.18 illustrates the times of individual parts of the whole
solution time as a function of input workload n (square matrix
dimensions) for number of workstations p = 4 and for accuracy
E = 0.001. From a comparison of these both graphs results a higher
contribution through the number of work stations than the rising
overheads of network communications.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Parts of
T(s, p)complex
[%]
Jacobi 4 workstations, E = 0.001
Collection 0.003 0.01 0.073 0.425 0.75
Communication 0.243 0.248 0.39 0.613 0.448
Computation 0.018 0.063 0.403 1.598 2.838
Initialisation
0.025 0.035 0.068 0.12 0.21
64 128 256 384 512
Figure 14.17 Comparison of T (s, p)
complex
parts for p=4 (E = 0.001).
Jacobi E = 0.001
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 50000 100000 150000 200000 250000
Number of matrix points
T (s, p)comm[s]
2p
4p
8p
Figure 14.18 Inuence of computing nodes on T (s, p)
comm,
(E = 0.001).
The derived analytical issoeficiency functions also allow us to
predict parallel computer performance for theoretical non-existent
ones. We have illustrated in Fig. 14.20 the isoefficiency functions for
the individual constant values of efficiency (E = 0.1 to 0.9) for n < 152
using the published technical parameters t
c
, t
s
and t
w
communication
constants of NOW (t
c
= 0.021 s, t
s
= 35 s, t
w
= 0.23 s).
Fig. 14.19 illustrates the influence of the number of workstations
NOW on quicker solutions for both distributed parallel algorithms
(Gauss-Seidel parallel algorithms) for matrix dimensions 512 512
and the various analysed accuracies of Epsilon.
Figure 14.19 Inuence of the number of workstations on Gauss-Seidel
IPAs.
0
50
100
150
200
250
T (s,p)complex
[s]
2p 4p 8p
E=0,001
E=0,0001
E=0,00001
Number of workstations
Gauss-Seidel 512 512
0.0E+00
5.0E+06
1.0E+07
1.5E+07
2.0E+07
2.5E+07
0 200 400 600 800 1000
p
w
E=0.1
E=0.2
E=0.3
E=0.4
E=0.5
E=0.6
E=0.7
E=0.8
E=0.9
Figure 14.20 Isoefciency functions w(s) for n < 152.
Fig. 14.21 illustrates the isoefficiency functions for individual
constant values of efficiency (E = 0.1 to 0.9) for n = 1024, and for
the communication parameters of the parallel computer Cray T3E
(t
c
= 0.011 s, t
s
= 3 s, t
w
= 0.063 s).
From both graphs (Figs. 14.20 and 14.21) we can see that to keep
a given value of efficiency, we need to increase the number of
computing processors step-by-step and a higher value of workload
(useful computation) to balance the higher communication overheads.
0.0E+00
2.0E+06
4.0E+06
6.0E+06
8.0E+06
1.0E+07
1.2E+07
1.4E+07
1.6E+07
1.8E+07
2.0E+07
0 200 400 600 800 1000
p
w
E=0.1
E=0.2
E=0.3
E=0.4
E=0.5
E=0.6
E=0.7
E=0.8
E=0.9
Figure 14.21 Isoefciency functions w(s) (n = 1024).
Part IV:
The Experimental Part
15
Performance Measurement of PAs
Direct performance measurement methodology for MPAs
The methodology of the direct measurement principle is based on
the performance measurements for the parallel computer architecture
and the developed parallel algorithms (application software)
[23, 95]. The hierarchy of technical (hardware) and software
(software) is illustrates in Fig. 15.1.
Figure 15.1 The hierarchy of technical and program equipment.
A
p
p
lic
a
tions so
ftw
a
r
e
S
y
s
t
e
m
s softw
a
r
e
Hardware
Developed specifically for parallel algorithms (shared and
distributed memory), the measurements allow the verification of
established theoretical results and the impact of architecture on the
performance of parallel computers respectively. The principle of the
experimental measurements is illustrated in Fig. 15.2.
Another reason for the application of direct measurements is
based on the fact that it is very difficult to obtain analytical relations
for the modelling complexity (performance) of parallel algorithms.
Figure 15.2 Direct performance measurement.
Complex
problems
Algorithms
and data
High-level
languages
Mapping
Programming
Linking
(compiling, loading)
Operation
system
Hardware
Performance
evaluation
Application
software
Performance measurement of PAs
Performance measurement of PAs with a shared memory
The direct performance measurement of PAs with a shared memory
requires the proposal of a methodology for measuring one process
(sequential) as shown in Fig. 15.3, and multiple parallel processes for
SMP parallel computers with a shared memory as shown in Fig. 15.4.
Performance measurement of MPAs with a distributed memory
A measurement flow diagram for parallel computers with a
distributed memory (NOW, Grid) is illustrated in Fig. 15.5. A
Performance Measurement of PAs 233
Figure 15.3 Flow measurement diagram for one process.
Initialisation of input
data
One computing
step
Presentation
of results
Start of measuring
End of measuring
Begin
End
Test of
accuracy
No
Yes
comparison of the flow diagrams in Figs. 15.4 (SMP) and 15.5
(NOW, Grid) shows that the measurement principles are very
similar. The basic difference is in the form of inter process
communication mechanisms.
Performance measurement of PAs in NOW and Grid
Developed parallel algorithms with a distributed memory (NOW,
Grid) are divided into two cooperating parts:
Figure 15.4 Flow diagram for an SMP.
The control program (manager).
The program service (worker, service, client)
The manager program registers computational nodes with run-
time services of parallel computation; it allocates parallel processes to
computing nodes (workers); it establishes communication connections
and it initiates the execution of parallel processes on the computing
nodes of any given parallel computer. At the end it summarises the
results obtained from the individual computing nodes. The worker
program waits to start its allocated parallel processes with assigned
parameters from the manager program. After completing a parallel
data
Computing step
in node i
Collection
of results
Presentation
of results
Assigning processes to
used nodes
Waiting to
fnish all par.
processes
Begin
End
Test of
accuracy
No
Yes
Yes
End
No
algorithm the worker program sends the computed results to the
manager control program, including the execution delay of the
performed parallel process in any given computing node.
Measurement delays
For the experimental measurements of PA delays we can use the
available services of the parallel development environment in use
(MPI services, Win API 32, Win API 64, etc.). For example, in order
to measure the execution time of parallel processes we use the
following functions Win 32 API and Win API 64:
Figure 15.5 Flow diagram for NOW and Grid.
data
Collection
of results
Presentation
of results
Start of measuring
End of measuring
Assigning par. processes
to comp. nodes
Waiting
to fnish all
processes
Begin
End
IPC
IPC
Waiting
to start par.
process
Begin
End
One computing
step
Return of results
to main process
End of measuring
End
Yes
No
Main par. process Worker
Test of
accuracy
End
No
No
Yes
QueryPerformanceCounter, which returns the actual value of the
counter.
QueryPerformanceFrequency, which defines the counting
frequency per second.
The values of both functions depend on the computer nodes being
used. By using the above measurement time functions, we can obtain
execution times with a high accuracy. For example, for common
Intel Pentium processors, it is 0.0008 ms, which is sufficient for the
analysis times of PAs.
If necessary, we can define functions to measure the time between
two points of performed parallel algorithms and parallel processes
respectively. An example of the following pseudo code to measure
the time between two points T1 and T2 is as follows:
.
.
.
T1: time (&t1); /*start of time*/
.
.
T2: time (&t2); /*stop time*/
.
measured_ time = difftime (t2,t1); /*measured time = t2-t1*/
.
printf (Measured time = % 5.2 f ms, measured_time);
.
.
.
This illustrated approach is universal for the measurement of
monitoring times, so we can use it for the measurement of parallel
algorithms, as for the measurement of the overheads of parallel
processes, or for measuring delays, which are typical when
establishing the technical parameters of parallel computers.
Measurements on parallel computers
Measurements of SMPs
To illustrate the crucial influence of the decomposition model on
parallel algorithm performance, we have taken some measurements
from the analysed parallel multiplication of two potential
decomposition models, which we analysed in the theoretical part of
PA. The aim of these measurements was to show their critical impact
on the performance of the optimal decomposition model. To
maximise the difference for both decomposition models D1 and D2,
the growth in the number of computing nodes p is always changing,
but also the input load is defined as s = n
2
in such a way that parallel
speed up S(s, p) for decomposition model D1 at each next
measurement does not fall under the speed up value of the previous
measurement, and at the same time the condition S(s, p) > 1 is valid.
Measurements were taken from eight core parallel Intel Xeon
computers (2 Intel Xeon 5335 quad cores).
The first decomposition model (D1) corresponds to standard
parallel multiplication. The second alternative decomposition model
(D2) corresponds to an alternative parallel multiplication, in which
to obtain the final element of the matrix multiplication C alongside
the multiplication of the corresponding matrix elements of matrix A
and B, it is necessary to calculate the intermediate generated results.
This additional computational complexity is proportional to the
input sizes of matrices A and B, resulting from the comparison of
time complexity as shown in Fig. 15.6.
Measurements on parallel computers in the world
After verifiying the derived theoretical results on available parallel
computers we made some comparisons using massive parallel
computers for high performance computing (HPC). For this purpose
we used some of the HPC centres in the European Union (EU HPC)
at EPCC Edinburgh, Cesca Barcelona, CINECA Bologna, IDRIS
Paris, SARA Amsterdam and HLRS Stuttgart [100]. The measurements
Figure 15.7 Basics of remote measurement.
Figure 15.6 Comparison of the time complexity of matrix parallel
multiplication.
were taken via remote access on a suitable parallel computer, or
directly on one of these HPC Centres in the EU. The methodology
of these experimental measurements by means of remote computing
WS1
WS0
WS2
WS3
Call of
parallel process
Return of results
Particular
computation
W
AN1
W
A
N
2
W
A
N
3
2 0
0
5
10
15
20
25
30
35
40
45
T
[s]
4 6 8
p
D1
D2
Measurements of NOW
For the direct performance measurement of complex performance
evaluations using a NOW, we have used the NOW structure as
shown in Fig. 15.8.
Figure 15.8 Measurement on a NOW (Ethernet network).
WS1
WS4
WS2
WS3
Call for computation
return of results
Ethernet
Particular
computation
Measurement types of developed PAs for NOW parallel computers
are divided as follows:
Performance testing (calibration) of workstations for various
input loads. We used the results for the load balancing of
individual NOW workstations.
The measurement of delays in parallel computation T(s, p)
comp,

communication delays T(s, p)
comm
and other essential overhead
delays dependent on input load; for various values of the
important parameters under consideration are as follows:
Complex measurements of execution times of developed
parallel algorithm T(s, p)
complex
are

dependent on input load
s, the number of computing nodes p and so on.
Complex measurements of individual delay parts on
developed parallel algorithm T(s, p)
complex
.
Computation time T(p, p)
comp
.
Communication delay T(s, p)
comm
.
Other essential overhead delays.
Measurements of the technical parameters of parallel
computers.
Average delay of arithmetic operations t
c
.
Communication of technical parameters t
s
and t
w
.
Measurement of the performance verification criteria of
PAs
Fig. 15.9 illustrates a verification of the limits of the character of
computing execution T(s,p)
comp
for two defined parallel complex
problem sizes n = 256 and 512 (square matrix dimensions) dependent
on the increasing number of computing nodes p. This limit for T(s,
p)
comp
is in accordance with the conclusions in Part II, as follows:

= =
2
lim ( , p) lim 0
comp
p comp p
n t
T s
p
Figure 15.9 Illustration of the limited behaviour of T(s,p)
comp
.
2500
2000
1500
1000
500
2 16 32 48 64 80 96 112 128
p
n = 256, 512
T
[s]
n=512
n=256
Fig. 15.10 illustrates the dependence of parallel speed up growth
S(s, p) on the increasing size of a parallel computer through the
number of computing nodes p for input problem size n = 256
(matrix dimension) for the following decomposition models:
Decomposition model into blocks (D1).
Decomposition model into strips (rows and columns) (D2).
Fig. 15.11 illustrates the effect of parallel computer architecture on
the efficiency E(p, p) of a parallel algorithm carried out with the
same decomposition model, supposing the parallel computers have
the following technical parameters:
Efficiency E1 = E1(s, p) corresponds to a NOW parallel computer
with the following supposed technical parameters: t
c
= 4.2 ns,
t
s
= 35 s, t
w
= 0.23 s [97].
Efficiency E2 = E2(s, p) corresponds to a parallel computer with
the following technical parameters: t
c
= 11 ns, t
s
= 3 ms, t
w
= 0.063
microseconds (for example the Cray T3DE - supercomputer).
Fig. 15.12 illustrates the dependence of efficiency gains E(p, p) on
increasing the size of the parallel system p (number of processors) for
problem size n = 256 (matrix size) for decomposition models.
Figure 15.10 Inuence of the decomposition models on S(s, p).
4.5
4
3.5
2.5
3
2
1.5
1
0 20 40 60 80 100 120
p
S
n = 256
D1
D2
Figure 15.11 Inuence of architectures on E(s, p).
Figure 15.12 Inuence of the decomposition models on E(s, p).
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
2 32 64 96 128 160 192 224 256
E1
n = 512
E1
p
E
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
20 40 60 80 100 120
D1
D2
p
n = 256
E
Decomposition model into blocks (D1).
Decomposition model into strips (rows and columns) (D2).
Isoefficiency functions of PAs
Optimised issoeficiency functions represent a common function for
both the analysed decomposition models (blocks and strips), and is
as follows:

( ) max 8 , 4
s w
opt
c c
t t
w s K p K n p
t t

=

Substituting the technical parameters of a given parallel computer,
we get mode w(s)
opt
to determine the dominance of its individual
elements. In the theoretical part, we used the supposed technical
parameters of NOW parallel computers, and we derived boundary
conditions depending on n. Fig. 15.13 illustrates the isoefficiency
functions for a whole range of efficiency values for E(s, p) from
0.1 to 0.9, and for n <152 (the dominance of the first term w (s)
opt

and for the supposed technical parameters t
s
= 35 s, t
w
= 0.23 s,
t
c
= 0.021 s).
2.5E+07
2.0E+07
1.5E+07
1.0E+07
5.0E+06
0.0E+00
0 200 400 600 800 1000
p
w
E=0.1
E=0.2
E=0.3
E=0.4
E=0.5
E=0.6
E=0.7
E=0.8
E=0.9
Figure 15.13 Isoefciency functions for n < 305 (n = 256).
Fig. 15.14 illustrates the isoefficiency flow functions for different
constant values of efficiency E(s, p) (E = 0.1 to 0.9) for n = 1024
using the dominant second term for w (s)
opt
and with the supposed
s
= 35 s, t
w
= 0.23 s, t
c
= 0.021 s).
Figure 15.14 Isoefciency functions for n 305 (n = 512).
2.0E+08
1.8E+08
1.6E+08
1.4E+08
1.2E+08
1.0E+08
8.0E+07
6.0E+07
4.0E+07
2.0E+07
0.0E+00
0 200 400 600 800 1000
E=0.1
E=0.2
E=0.3
E=0.4
E=0.5
E=0.6
E=0.7
E=0.8
E=0.9
p
w
16
Measuring Technical Parameters
Specifications of measurements
The measurements of supposed experimental findings or precise
specifications from derived theoretical results or the necessity for
more precise technical parameters are as follows:
Average time for computing operations t
c
, which is a constant
for any given parallel computer.
Communication technical parameters t
s
and t
w
, which are the
constants for any given parallel computer.
Technical parameters for the average time of computer
operations
To find technical parameter t
c
(the average time for performing
computing operations) we have defined the following selection
critera for suitable parallel algorithms:
Proportional representation of all arithmetic operations; namely,
addition, subtraction, multiplication and division with accurate
values for their numbers.
A sufficient representative number of performed arithmetic
operations for their statistical evaluation.
Simple applied use of various parallel computers.
Elimination of the influence of inter process communications
(IPC).
Based on these criteria, we have chosen the computational part of an
algorithm (sequential, parallel) of the Gauss elimination method
(GEM) based on decomposition into a low triangular matrix. The
reasons for this selection is based on the following facts:
The decomposition of matrix parallel algorithms (MPA) arises
from this first step through matrix decomposition into strips
(rows and columns).
Parallel computers have a varied number of computing nodes p
(processors). Different numbers of computing nodes p have a
considerable influence on the complexity (performance) of the
computing part, and on the complexity of the inter process
communication (IPC) (latencies). In general, if p = n a parallel
computer executes the first step on all the decomposed strips
(rows and columns) in a parallel way. For n > p parallel
computers with p computing nodes, this first step is executed
repeatedly with the p computing nodes. Repeated computations
increase the complexity of the computations and communications.
The computing nodes of any given HPC parallel computer
(massive supercomputers) are based on powerful processors with
the same performance.
The computing complexity of GEM
To find and verify its accuracy we have used the first step of GEM
parallel algorithms to solve the system of linear equations (SLR)
based on their decomposition into a low triangular matrix, which we
will apply to every processor of the computing nodes of any given
parallel computer. The pseudocode of the first step of a GEM matrix
parallel algorithm is as follows:
Measuring Technical Parameters 247
fork = 1 ...m:
find the centre of the column:
i_max: =argmax(i = a ... m, abs (A [i, k]))
if[i-max, k] = 0
error"matrix is singular!"
swap rows(k, imax)
do for all lines below:
fori = k +1 ...m:
do it for all remaining elements in the current row:
forj = k +1 ...n:
A [i, j]: = A [i, j] - A [k, j] * (A [i, k] / A [k, k])
fill lower triangular matrix with zeros:
A [i, k]: = 0
The parallel algorithms of the Gauss elimination method for solving
SLR with number of equations n has the following values of
performed arithmetical operations:
n(n+1) / 2 The division operations for parallel process
(IPP
div
).
(2n
3
+3n
2
-5n) / 6 The multiplication operations for parallel
process (IPP
mul
).
(2n
3
+3n
2
-5n) / 6 The addition and subtraction operations of
parallel process (IPP
add
).
Computational complexity when considering dominant terms
(second and third) is given as n
3
/3 operations. Asymptotic complexity
is given as O(n
3
) operations. In general, computational complexity is
declared as a dimensionless number of instructions or other
computing steps.
For all the performed computaional instructions IPC (instructions
per computation), we get as a sum of the individual number of
executed division operations IPC
div
, multiplication operations IPC
mul

and addition operations IPC
add
. The assumption of the same latency
for addition and substraction is as follows:
( )
3 2
3 2 3 2
2
(n 1) (2n 3 -5n)
2 6
(2n 3 -5n) (n 1) (2n 3 -5n)
6 2 3
n (4n 9 -7)
6
real div mul add
div
mul
div
add mul add
div mul add
n n
IPC IPC IPC IPC
n n n
n
+
+ +
+ +

= + + = +

+ + +

+ = +

+
=

As for the derived relations for computational complexity IPC
real
,
the

second term for IPC
mul
and the third term for IPC
add
are the
same. If we ignore the first term for IPC
div
and join the second and
third terms, we get a simpler approximation of the relations of
computational complexity. Then after adjustments for IPC
aprox
, the
following is valid:
( )
3 2 2
(2n 3 -5n) n (2n 3 -5)
3 3
aprox mul add
mul add
n n
IPC IPP IPP
+
+ +
+

For further applications using the contents of Table 16.1, the
computed values of the performed number of instructions IPC
(instruction per computation) are asymptotic IPC
asymp
, real IPC
real

and aproximated IPC
aprox
.
If we know the technical parameter value for addition t
add
(the
extended available technical parameter) for accurately defining T(s,
p)
comp
,

we recompute the numbers of all the other computational
instructions only by adding an instruction with delay time parameter
t
add
. During recomputation we consider the relations for the delays
of multiplication and division instructions as follows:
t
mul
delay of multiplication instructions, whereby approximately
t
mul
= 3 t
add
.
t
div
delay of division instructions, whereby approximately
t
div
= 3 t
add
.
Table 16.1 The computed values IPC
asymp
, IPC
real
and IPC
aprox
.
IPC computation
Number of
equations n
n
3
(4n
3
+9 n
2
-7n)/6 (2n
3
+3n
2
-5n)/3
IPC
asymp
IPC
real
IPC
aprox
15 3 375 2 570 2 450
18 5 832 4 353 4 182
21 9 261 6 811 6 580
24 13 824 10 052 9 752
27 19 683 14 184 13 806
30 27 000 19 315 18 850
33 35 937 25 553 24 981
36 46 656 33 006 32 340
39 59 319 41 782 41 002
42 74 088 51 989 51 086
45 91 125 63 735 62 700
After setting the defined relations and adjustments we get a final
relation for the whole real number of additon instructions IPC
real
)
add
as follows:
3 2 3 2
3 2 3 2
3 2
3 (n 1) (2n 3 -5n) (2n 3 -5n)
( )
2 6 6
3 (n 1) (2n 3 -5n) (2n 3 -5n)
2 2 6
8n 21 -11n
6
real add
add
mul add
add
add add
add
n n n
IPC
n n n
n
+ + +

= + +

+ + +

= + +

+
=

and for IPC
aprox
)
add
as follows:

3 2 3 2
3 2
3(2n 3 -5n) (2n 3 -5n)
( )
6 6
4n 6 -10n
3
aprox add
add add
add
n n
IPC
n
+ +
+

+

For application, Table 16.2 contains the computed values of the
performed instruction numbers, which were recomputed only to add
instructions as (IPC
real
)
add
, (IPC
aprox
)
add
and their difference as
difference
add
.
Table 16.2 Computed values IPC
real
, (IPC
real
)
add
and (IPC
aprox
)
add
.
IPC computation
Number of
equations n
(8n
3
+21 n
2
-11n)/6 (4n
3
+6 n
2
-10n)/3 3n(n+1)/2
(IPC
real
)
add
(IPC
aprox
)
add
difference
add
15 5 260 4 900 360
18 8 877 8 364 513
21 13 853 13 160 693
24 20 404 19 504 900
27 28 746 27 612 1 134
30 39 095 37 700 1 395
33 51 667 49 984 1 683
36 66 678 64 680 1 998
39 84 344 78 208 2 340
42 104 881 102 172 2 709
45 128 505 125 400 3 105
For practical use we will show the procedure of getting the values
of technical parameters t
ci
for our tested parallel processors. Then we
illustrate the applied use of technical parameter t
c
as follows:
Verification of the accuracy of the derived approximate relation
IPC
aprox
.
Simulated computation of execution times T (s,p)
comp
for the
NOW parallel computer and massive parallel supercomputers.
Process for deriving technical parameter t
c
Considering the average execution time for performing instruction t
c

(the technical parameter of parallel computers), we get the value of
compuying delay T(s,p)
comp
as follows:

2
n (4n 9 -7)
( , )
6
comp c c real
n
T s p IPC t t
+
= =

Thus, from the presented relation at any given T(s,p)
comp
we are able
to compute the average computing delay of instruction t
c
. For this
purpose we measured on various SMP workstations execution time
values T(s, p)
comp
for various values of input load n. For these
measurements of T(s,p)
comp
we used the following SMP parallel
computers:
P1 Intel Core 2 Duo T 7400 (2 cores, f=2,16 GHz).
P2 Intel Core 2 Quad (4 cores, 2.5 GHz).
P3 Intel SandyBridge i5 2500S (4 cores, f=2.7 GHz).
Table 16.3 shows the measured values of T(s,p)
comp
dependent on
input problem size n (number of equations). An illustration of the
comparison of the performance of tested workstations is illustrated
in Fig. 16.1.
Table 16.3 Measured values T(s,p)
comp
.
T(s, p)
comp
measured values
Number of
equations n
Core 2 Duo -P1 Core 2 Quad-P2 Intel i5 2500S - P3
[ms] [ms] [ms]
15 1,737 0.622 1.108
18 2,951 1.049 1.880
21 4,625 1.635 2.929
24 6,805 2.402 4.352
27 9,574 3.475 6.085
30 13,038 4.597 8.267
33 17,274 6.256 11.141
36 22,345 7.987 14.193
39 28,328 10.111 18.175
42 35,456 12.477 22.355
45 43,594 15.551 27.661
In order to measure performance values, we can also use the
process of optimal balancing of the input problem load for the tested
workstations, or in our case to establish the average instructional
delay of the tested parallel computers t
ci
(i=1, 2, 3). From our
measured values of T(s, p)
comp
we get for every value of T(s,p)
comp
the value of the searched parameter t
cj
(j=1, 2, ...11) as follows:

comp
T(s,p)
pre j 15, 18, ... , 45.
cj
real
t
IPC
= =
Then, from the derived values of parameters t
cj
for every measured
value T(s,p)
comp
we compute the searched technical parameter t
ci

(i=1,2,3) for the tested parallel computers P
i
as an arithmetic mean
value according to the following relation:

11
1
11
cji
j
ci
t
t
=
=
Figure 16.1 Performance comparison of workstations.

45
40
T
[ms]
T (s,p)
comp
35
30
25
20
15
10
5
0
15 18 21 24 27 30 33 36 39 42 45
n
P1
P2
P3
Table 16.4 The computation of average delays for tested computers.
t
cj
=T(s,p)
comp
/ IPC
real
Number of
equations n
t
cj1
(P1) t
cj2
(P2) t
cj3
(P3)
[s] [s] [s]
15 0.676 0.242 0.431
18 0.678 0.241 0.432
21 0.679 0.240 0.430
24 0.677 0.239 0.433
27 0.675 0.245 0.429
30 0.675 0.238 0.428
33 0.676 0.240 0.436
36 0.677 0.245 0.430
39 0.678 0.242 0.435
42 0.682 0.240 0.430
45 0.684 0.244 0.434
Average - t
ci
0.678 0.241 0.433
We use the same procedure to get the average time value for
addition operations on every tested SMP parallel computer using the
following relation:

comp
T(s,p)
for j 15, 18, ... , 45.
cj add
add
t t
IPC
= = =
For individually tested SMP parallel computers, the searched
values of the technical parameters t
ci
are as follows:
For P1 t
c1
= 0.678 s, t
add
=0.219 s
For P2 t
c2
= 0.241 s, t
add
=0.078 s
For P3 t
c3
= 0.433 s, t
add
=0.141 s
From this mutual comparison we can see that the most powerful
workstation is the Intel Core 2 Quad (4 cores, 2.5 GHz, t
c
=t
c2
=
0.241 s, t
add
=0.078 s).
Applied use of technical parameter t
c
Verification of the accuracy of approximate relations
To verify the accuracy of the derived approximate relation for
IPC
aprox
as:

( )
3 2
.
(2n 3 -5n)
3
aprox mul add
mul add
n
IPC IPP IPP
+
+
+

we compare computing delays T(s,p)
comp
with the real number of
performed operations IPC
real
of the tested parallel computer Core 2
Duo P1 (reference parallel computer), as well as the computing
delays of T(s,p)
comp
,

which we get by using the approximate relation
for IPC
aprox
and the derived values of technical parameter t
cj
of the
chosen parallel computer P1. The results are shown in Table 16.5,
including the evaluated differences, which were recomputed as
percentages related to the values of reference parallel computer P1.
Table 16.5 Accuracy of the approximate relations for the Intel Core 2 Duo.
Intel Core 2 Duo P1
Number of
equations n
IPC
real
t
c1
IPC
aprox
t
c1
Relative errors
[ms] [ms] [%]
15 1,742 1,661 4,669
18 2,951 2,835 3,931
21 4,618 4,461 3,400
24 6,815 6,612 2,979
27 9,617 9,360 2,672
30 13,096 12,780 2,413
33 17,325 16,937 2,239
36 22,378 21,926 2,021
39 28,328 27,799 1,867
42 35,248 34,636 1,736
45 43,212 42,511 1,622
A graph showing the accuracy of the comparison of execution
delays T(s,p)
comp
The relative errors of the approximation formulae for the tested
parallel computer Intel Core Duo (parallel computer P1) are
illustrated in Fig. 16.3.
Relative errors are generally compared within a range of five
percent. If we increase n (the number of equations), relative errors
decrease as can be seen in the following approach. We compute the
percentage distribution of arithmetical operations in relation to
IPC
real
. For n = 15 it is as follows:

4, 67 47, 7 47, 7
real div mul add
IPC t t t = + +
For n = 45 it is as follows:

1, 62 49,19 49,19
real div mul add
IPC t t t = + +
The percentage distribution of division operations decreases if the
value of n increases, because error complexity is given using the
following relation:

2
( 1)
2 2
aprox real
n n n n
Error IPC IPC
+ +
= = =
This relation has its asymptotic complexity as O(n
2
), whereby
asymptotical complexities IPC
real
and IPC
aprox
are given as O(n
3
),
which means that if n increases, the number of ignored operations
comp
with IPC
real
and IPC
aprox
.
45
50
40
35
30
25
20
15
10
5
0
15 18 21 24 27 30 33 36 39 42 45
n
IPC real
IPC aprox
T
[ms]
T (s,p)
comp
of IPC
aprox
rises more slowly as the number of considered operations.
In relation to its relative error is also its absolute error, which does
not depend on the parallel computer in use.
Simulated performance comparisons of parallel computers
To compare the most powerful tested parallel computers (Intel Core
2 Quad (4 cores, 2.5 GHz, t
c
=t
c2
= 0.241 s) we will use
representantive classic parallel computers with known values of
c
, t
s
and t
w
, as follows:
The parallel computer NOW IBM SP2 (scalable POWER parallel
computer) with technical parameters t
c
= 4.2 ns, t
s
= 35 s, t
w
=
230 ns. At this time the innovative model BlueGene-Q took
second place in terms of performance evaluation in 2012 [97].
The compared values of technical parameter t
c
were recomputed
on one computing processor for relevant comparison.
A classic massive parallel computer, such as the supercomputer
Cray T3E with technical parameters t
c
= 11 ns, t
s
= 3 s, t
w
=
0.063 s, in 2000 came 29th place on a list of the 500 most
powerful parallel computers in the world [97]. The updated
model Cray XK7 took first place in 2012. The compared values
of technical parameter t
c
were recomputed on one computing
node (processor) for relevant comparison.
Figure 16.3 Relative errors of the approximation relation.
10
9
8
7
6
5
4
3
2
1
0
15 18 21 24 27 30 33 36 39 42 45
n
Relative error
[%]
The computed results of all the final values T(s,p)
comp
using technical
parameter t
c
are shown in Table 16.6.
Table 16.6 Simulated delays of T(s, p)
comp
.
T(s, p)
comp
= IPC
real
t
c
Number of
equations n
IBM PS 2 Cray T3E Core 2 Quad
[s] [s] [ms]
15 0.0108 0.0283 0.277
18 0.0183 0.0479 0.470
21 0.0314 0.0749 0.732
24 0.0422 0.1106 1.088
27 0.0596 0.1560 1.521
30 0.0811 0.2124 2.066
33 0.1073 0. 2811 2.785
36 0.1386 0.3631 3.548
39 0.1755 0.4596 4.543
42 0.2183 0.5719 5.589
45 0.2677 0.7011 6.915
The final results from a performed comparison of the most
powerful workstation Intel Core 2 Quad with parallel computer
NOW IBM PS2 and supercomputer Cray T3E is illustrated in Table
16.6. In consequence of the considerable differences in terms of
performance disadvantage of the most powerful parallel computer
Core 2 Quad, the results T(s,p)
comp
were achieved for this parallel
computer.
Performance comparisons of the most powerful representatives of
NOW and supercomputers with our most powerful workstation
Intel Core Quad are in favour of classic parallel computers. Although
in our performed comparisons we did not use the latest models of
the chosen tested parallel computers, their computing nodes with
substantially lower tactical frequencies resulted in considerable
performance differences as demonstrated by the following facts:
Both the tested classic parallel computers use computing node
(processors) architecture known as POWER PC (performance
optimisation with enhanced RISC), where the acronym RISC
stands for Reduced Instruction Set Computer. This architecture
was built in the same way by Apple, IBM and Motorola. It is a
group of the most powerful personal computers (PCs), in which
their high performance is achieved through the following
technologies:
A higher number of cores (a minimum of four).
An overworked instruction set computer RISC.
High capacities for multilevel cache memories.
Massive superscalar pipeline architectures (parallel execution
with a minimum of four instructions).
Processing of long words (a minimum of 64 bits).
High-speed communication systems; for example, Infiniband,
Quadrics and Myrinet [40].
Communication complexity and communication technical
parameters
Classic parallel computers
In general, we derived in the theoretical Part II some relations for the
performance comparisons of decomposition models. For parallel
Figure 16.4 Performance comparison of the tested parallel computers.
7
6
5
4
3
2
1
0
15 21 27 33 39 45
n
IBM PS2
[s]
Cray T3E
[s]
Core 2Quad
[s]
T
[s,ms]
T (s,p)
comp
algorithm matrices we found that a decomposition model into strips
(rows and columns) demands lower inter process communication
delays (it is more effective) than a decomposition model into blocks
respecting the condition that for blocks p 9, and for communication
parameter t
s
is

according to the following inequality:

2
(1 )
s w
t n t
p
>
At this time the general conclusion was drawn that the decomposition
model into strips (rows and columns) is advantageous for higher values
of technical parameter t
s
, or that using this decomposition model is
more effective. For applied matrix parallel algorithms (MPAs) comes
the resulting allocation of n = p. In the case of massive equations in
which n > p, the computing nodes repeatedly perform the necessary
algorithm activities for the remaining n > p strips (rows and columns).
For example, if we suppose for the sake of simplicity that the value of
parameter n is divisible by parameter p without the rest, the computing
nodes of the parallel computers in use repeatedly perform k times the
necessary algorithm activities until the quotient value of k = n/p is
exhausted For both complexities (computation and communication),
this means k the multiplying factor of both these mentioned
complexities are the derived complexities for pre n = p. From this fact
it is clear that the basic outgoing problem is to derive the same
computational and communication complexity for any given n = p. By
setting the previous equality to the relation for technical parameter t
s
,

after the necessary performed adjustments we finally get the following
relation:

( 2 )
s w
t t n n >
In Table 16.7 we have computed the values t
s
for the known values
of communication parameter t
w
.
Table 16.7 Optimisation of the decomposition method for t
s
.
T(s, p)
comm
computed values
t
w
[s]
n=256 n=512 n=1 024
t
s1
[s] t
s2
[s] t
s3
[s]
0.063 14.11 29.40 60.48
0.070 15.68 32.67 67.20
0.080 17.92 37.34 76.80
0.230 50.60 107.35 220.8
0.440 97.68 205.37 422.4
0.540 119.88 252.04 518.4
1.100 244.20 513.4 1056
2.400 532.80 1 072.2 2 304
5.000 1 110.00 2 233.7 4 800
Graphs showing the optimal decomposition strategy for the values
computed in Table 16.8 are illustrated in Fig. 16.5 (lower values of t
s
)
and in 16.6 (higher values of t
s
).
By using this inequality for t
s
we are able to modify for other input
parameters (t
w
, n) and thus compute the remaining ones. For
example, after taking the value of parameter t
w
into account, the
following is valid:

2
s
w
t
t
n n
<
In Table 16.8 we have computed the values t

w
for the known
values of communication parameter t
s
.
The limiting values for the optimal selection of a decomposition
strategy are any given n for higher values of t
w
. Thus the
decomposition model into strips is more effective for higher values
of t
s
(NOW, Grid) and decomposition models into blocks for lower
values of t
s
.
Table 16.8 Optimisation of the decomposition method for t
w
.
T(s, p)
comm
computed values
t
s
[s]
n=256 (224) n=512 (466,75) n=1 024 (960)
t
w1
[s] t
w2
[s] t
w3
[s]
3 0.01339 0.0064 0.0031
35 0.01560 0.0750 0.0365
64 0.2857 0.1371 0.0666
77 0.3437 0.1650 0.0802
82 0.3661 0.1757 0.0854
87 0.3884 0.1864 0.0906
154 0.6875 0.3299 0.1604
1 150 5.1339 2.4638 1.1979
1 500 6.6964 3.2137 1.5625
The NOW parallel computer
For the decomposition models we are going to analyse in general their
communication complexity. Communication function T(s, p)
comm
for
any given decomposition strategy is principally defined in NOW by
the following two parameters:
Figure 16.5 Optimisation of the decomposition methods for smaller
values of t
s
.
0.40
t
wii
[s]
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
3 35 64 77 82 87
t
s
[s]
n=256
n=512
n=1024
Communication latency due to parameter t
s
(start up time).
Communication latency due to parameter t
w
(time taken to
transfer data unit to data word).
A typical communication network used in NOW in our country
(the Slovak Republic) is Ethernet architecture. We outlined the
communication principles in the Ethernet network in Chapter 5 in
Fig. 5.5. The bottleneck in this communication network arises from
its serial communications of multiple point-to-point communications,
and this also remains unchanged in the collective communication
mechanism and collective command known as Broadcast. We will
therefore evaluate in an analytical way the relations of all MPI
collective communication commands on the Ethernet.
Evaluation of collective communication mechanisms
For the following typical MPI communication mechanisms on the
Ethernet network, the following relationships are valid:
MPI commands of data dispersion:
MPI command Broadcast.
MPI command Scatter.
Figure 16.6 Optimisation of the decomposition methods for higher
values of ts.
7.00
6.00
5.00
4.00
3.00
2.00
1.00
0.00
154 1,150 1,500
n=256
n=512
n=1024
t
wii
[s]
t
s
[s]
MPI commands of data collection:
MPI command Gather.
MPI command All gather.
MPI command Reduce.
The collective communication mechanism type Broadcast
The collective communication mechanism Broadcast is the only
collective communication mechanism that can also be effective in the
Ethernet communications network. In this case, the transmission of
one of the same data units (byte, word) to all the other p computing
nodes (processors) in terms of the Ethernet network is shown in
Fig. 5.5. Communication complexity T(s, p)
comm
of this command O
(1), respectively uses the established communication t
s
, t
w
as:

( , )
s w commbr
T s p t t = +
To transmit various data units m, always only one processor (point-
to-point) computational complexity is given as O(m), respectively
supporting the communication parameters established by the
following formula:
( , ) ( )
s w commbr
T s p m t t = +
For the transmission of different data units m-1, the remaining
processor computational complexity is given by O(p) respectively,
with the support of the established communication parameters as:

1
1
( , ) ( )
p
s w commbr
i
T s p mt t
=
= +
The communication delays are the same for any other collective
communication mechanisms on the Ethernet communication
network. The communication complexity T(s, p)
commEth
common
relationship is determined as:

1
1
( , ) ( ) ( 1) ( )
p
s w s w commEth
i
T s p mt t p m t t
=
= + = +

In gen eral, the value of parameter t
s
is highly significant for the
asynchronous parallel computers NOW and Grid. For example, for
the NOW parallel computer based on fibre optic cables, these
communication parameters have concrete values, which are
t
s
= 1100 s and t
w
= 1.1 s. For Ethernet communication networks
these parameters are even higher: t
s
= 1500 s and t
w
= 5 s. The
realised measurements in our home conditions (DTI Dubnica and
the University of Zilina in the Slovak Republic) of these
communication parameters on an unloaded Ethernet network were
higher than previously specified values. The causes for these
substantial differences are mainly the following:
Built-in sophisticated technical support to block data transmission
based on direct memory access (DMA) for collective communication
mechanisms.
Multiple transmission channels based on the multi-stage structure
of communication networks.
High-speed communication network switches known as HPS
(High Performance Switch).
High-speed communications networks in use such as Infiniband,
Quadrics and Myrinet [40]. This is also obvious from the annual
published analyses of the most powerful parallel computers [97].
17
Conclusions
Performance evaluation as a discipline has repeatedly proved to be
critical for the design and successful use of operating systems. At the
early stage of design, performance models can be used to project a
systems scalability and to evaluate design alternatives. At the
production stage, performance evaluation methodologies can be
used to detect bottlenecks and subsequently suggest ways to alleviate
them. Queuing networks and Petri net models, simulation,
experimental measurements and hybrid modelling have been
successfully used for the evaluation of system components. Via the
extended form of the isoefficiency concept for parallel algorithms we
illustrated its concrete use to predicate the performance in typical
applied parallel algorithms. Based on the derived issoeficiency
functions for matrix models, this paper deals with the actual role of
performance prediction in parallel algorithms.
To derive isoefficiency functions in an analytical way it is necessary
to derive all the typically used critera for the performance evaluation
of parallel algorithms, including their overhead functions (parallel
execution time, speed up, efficiency). Based on this knowledge we
are able to derive issoefficiency functions as a real criterion for
evaluating and predicting the performance of parallel algorithms,
and also on hypothetical parallel computers. So in this way we can
say that this process includes complex performance evaluation,
including performance prediction.
Due to the dominant use of parallel computers based on the
standard PC in the form of NOW and their massive integration
known as Grid (the integration of many NOWs), there has been
great interest in the performance prediction of parallel algorithms in
order to achieve optimised parallel algorithms (effective parallel
algorithms) [34, 36].
Therefore this paper summarises the methods for complexity
analysis which can be applied to all types of parallel computer
(supercomputers, NOW, Grid). Although the use of NOW and Grid
parallel computers are less effective in some parallel algorithms than
the massive parallel architectures (supercomputers) around the
world today, nevertheless the parallel computers based on NOW and
Grid still belong to the category of dominant parallel computers.
Appendix 1
Basic PVM Routines
The following is a collection of PVM routines that is sufficient for
most programs in the text. The routines described here are divided into
preliminaries (those for establishing the environment and related matters),
basic point-to-point message passing, and collective message passing. The
complete set of routines and additional details can be found in concrete
specialized materials to individual supporting developing tools.
Preliminaries
Generally, a negative return value indicates an error.
int pvm_mytid (void)
ACTIONS: Enrolls process in PVM (on first call) and returns task
identification (tid) of process.
PARAMETERS: None
int pvm_spawn (char *task, char **argv, int flag, char *where, int ntask,
int *tids)
ACTIONS: Starts new PVM processes. Returns number of processes started.
PARAMETERS: * tasks file name of process to be started
* * argv arguments to executable
flag spawn options (set to 0 to ignore where)
*where which computer to start process
ntasks no of copies of process to start
*tids array of process tids (returned)
int pvm_parent (void)
ACTIONS: Returns tid of process that spawned calling process.
PARAMETERS: None
int pvm_exit ()
ACTIONS: Tells local pvmd that this process is leaving PVM.
Routine should be present at end of each process.
PARAMETERS: None
Point-to-PointMessage Passing
In receive messages, -1 in tid or msgtag matches with anything.
int pvm_initsend (int encoding)
ACTIONS: Clears and prepares send buffer for packing a new
message. Necessary before packing a new message.
Returns message buffer identifier.
PARAMETERS: encoding message encoding method (use
PvmDataDefault for default encoding)
int pvm_psend (int tid, int msgtag, char *buf, int len, int datatype)
ACTIONS: Packs data and sends data in one routine.
PARAMETERS: tid task identifier of destination
msgtag message tag
buf pointer to send buffer
len length of send buffer
datatype type of data:
PVM_STR string
PVM_INT - int
PVM_FLOAT - real
(Others available)
Appendix 1 269
int pvm_precv (int tid, int msgtag, char *buf, int len, int data type ,
int atid, int atag, int alen)
ACTIONS: Receives data and unpacks data in one routine. Waits
until message received and then loads message directly
into specified buffer (blocking receive).
PARAMETERS: tid tid of sending process
mstag message tag
*buf buffer
len buffer length
datatype type of data:
PVM_STR - string
PVM_INT - int
PVM_FLOAT - real
(Others available)
atid tid of sender (returned)
atag msgtag of sender (returned)
len length of message actually received (returned)
int pvm_pk* ( )
ACTIONS: Packs send message buffer with arrays of data elements
each of the same prescribed data type. Used prior to send
routine [except psend ( )]. Parameters vary depending
upon the datatype. Can be one of many datatypes,
notably.
pvm_pkint (int *array, int nitem, int stride) integers
pvm-pkfloat (float *array, int nitem, int stride) floats
pvm_pkstr (char *array strings
PARAMETERS: *array array of items being packed
nitem number of items
stride stride through array, 1 for every item in array
int pvm_send (int tid, int msgtag)
ACTIONS: Sends data packed in send buffer (nonblocking).
PARAMETERS: tid tid of destination process
msgtag message tag
int pvm_recv(int tid, int msgtag)
ACTIONS: Waits for message to be received and loads it into receive
(blocking receive). ,_eturns receive message buffer
identifier.
msgtag message tag
int pvm_nrecv(int tid, int msgtag)
ACTIONS: If message received, loads message in new receive buffer;
otherwise routine returns immediately (nonblocking
receive) with a return value of zero. Returns receive
message buffer identifier.
msgtag message tag
int pvm_upk* ( )
ACTIONS: Unpacks receive message buffer into arrays of data
elements each of the same prescribed data type, Used
after receive routines [except precv ( ) ]. Parameters vary
depending upon the datatype. Can be one of many
datatypes, notably
pvm_upkint(int *array, int nitem, int stride) integers
pvm_upkfloat(float *array, int nitem, int stride) floats
pvm_upkstr(char *array) strings
PARAMETERS: *array array of items being unpacked
nitem number of items
stride stride through array, 1for every item in array
int pvm_mcast (int *tids, int ntasks, int msgtag)
ACTIONS: Sends data in active send buffer to set of processes.
(Not strictly point-to point but does not involve a
named group of processes.)
PARAMETERS: *tids array of destination process tids
ntask number of processes
msgtag message tag
Group routines
Routines in this section involve a group of processes. Processes are
first enrolled in a group and are given an instance number of the named group.
Appendix 1 271
int pvm_joingroup (char *group)
ACTIONS: Enrolls process in named group and returns instance number.
PARAMETERS: *group group name
int pvm_getinst (char *group, int tid)
ACTIONS: Returns instance number of group member.
tid task identifier
int pvm_gettid (char *group, int inum)
ACTIONS: Returns tid of process identified by group name. Also
instance number returned (inum).
PARAMETERS: * group group name
inum instance number (returned)
int pvm_gsize (char *group)
ACTIONS: Returns number of members of named group.
int pvm_barrier (char * group, int count)
ACTIONS: Blocks process until number of processes in group have
called it. All processes to call pvm_barrier ().
count number of group members (-1 for all
members of group)
int pvm_bcast (char *group, int msgtag)
ACTIONS: Broadcasts data in message buffer to all members of group.
msgtag message tag
int pvm_gather(void*result, void *data, int count, int datatype, int
msgtag, char *group, int rootginst)
ACTIONS: One member of group (root) gathers data from each
member. All processes to call pvm_gather ( ).
PARAMETERS: *result array to hold gathered data
*data data array sent from group member
count number of elements in array
datatype type of array/result
msgtag message tag
*group group name
rootginst instance number of root process
int pvm_reduce (void (*func)( ),void *data, int count, int datatype,
int msgtag, char* group, int root)
ACTIONS: One member of group (root) performs reduce operation
on data from members of group. All processes call
pvm_reduce ( ).
PARAMETERS: *func function
*data data from group member
count number of elements in data array
datatype type of data array
msgtag message tag
*group group name
root instance number of group member
acting as root
int pvm_scatter(void *result, void *data, int count, int data type, int
msgtag, char*group, int rootginst)
ACTIONS: One member of group (root) sends different portion of an
array to each group member, including itself. All processes
call pvm_sca t ter ( ) .
PARAMETERS: *result array for data being received
*data on root only, array being scattered
count number of items to be sent to each process
data type type of data in array
msgtag message tag
*group group name
rootginst instance number of group member
acting as root
int pvm_lvgroup (char *group)
ACTIONS: Unenrolls process from group.
Appendix 2
Basic MPI Routines
The following is a collection of MPI routines that is sufficient for
most programs in the text. A very large number of routines are provided in
MPI. As in Appendix A, the routines described here are divided into
preliminaries (those for establishing the environment and related matters),
basic point-to-point message passing, and collective message passing. The
complete set of routines and additional details can be found in concrete
specialized materials to individual supporting developing tools.
Preliminaries
int MPI_Init (int *argc, char **argv[ ] )
ACTIONS: Initializes MPI environment.
PARAMETERS: *argc argument from main ( )
**argv [ ] argument from main ( )
int MPI_Finalize (void)
ACTIONS: Terminates MPI execution environment.
PARAMETERS: None
int MPI_Comm_rank (MPI_Comm comm, int *rank)
ACTIONS: Determines rank of process in communicator.
PARAMETERS: comm communicator
*rank rank (returned)
int MPI Comm size (MPI Comm comm, int *size)
ACTIONS: Determines size of group associated with communicator.
PARAMETERS: comm comm communicator
*size size of group (returned)
double MPI_Wtime(void)
ACTIONS: Returns elapsed time from some point in past, in seconds.
PARAMETERS: None
Point-to-Point Message Passing
MPI defines various datatypes for MPI_Datatype, mostly with
corresponding C datatypes, including
MPI_CHAR signed char
MPI_INT signed int
MPI_FLOAT float
int MPI_Send (void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI Comm comm)
ACTIONS: Sends message (blocking).
PARAMETERS: *buf send buffer
count number of entries in buffer
datatype data type of entries
dest destination process rank
tag message tag
comm communicator
int MPI_Recv (void *buf, int count, MPI_Datatype data type, int
source, int tag, MPI Comm comm, MPI Status *status)
ACTIONS: Receives message (blocking).
PARAMETERS: *buf receive buffer (loaded)
count max number of entries in buffer
datatype data type of entries
source source process rank
tag message tag
comm communicator
*status status (returned)
In receive routines, MPI_ANY_TAG in tag and MPI_ANY _SOURCE in
source matches with anything. The return status is a structure with at least
three members:
Appendix 2 275
status -> MPI_SOURCE rank of source of message
status -> MPI_TAG tag of source message
status -> MPI_ERROR potential errors
int MPI_Isend (void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm, MPI_Request * request)
ACTIONS: Starts a nonblocking send.
PARAMETERS: * buf send buffer
count number of buffer elements
data type data type of elements
dest destination rank
tag message tag
comm communicator
*request request handle (returned)
Related:
MPI_Ibsend ( ) Starts a nonblocking buffered send
MPI_Irsend ( ) Starts a nonblocking ready send
MPI_lssend ( ) Starts a nonblocking synchronous send
int MPI_lrecv (void *buf, int count, MPI_Datatype datatype, int
source, int tag, MPI_Comm comm, MPI_Request *request)
ACTIONS: Begins a nonblocking receive.
PARAMETERS: *buf receive butler address (loaded)
count number of buffer elements
datatype data type of elements source rank
source sourcerang
tag message tag
comm communicator
*request request handle (returned)
int MPI_Wait (MPI_Request *request, MPI_Status *status)
ACTIONS: Waits for a MPI send or receive to complete and then returns.
PARAMETERS: * request request handle
* status status (same as return status of MPI_recv ( )
if waiting for this.
Related:
MPI_Waitall ( ) Wait for all processes to complete (additional parameters)
MPI_Waitany ( ) Wait for any process to complete (additional parameters)
MPI_Waitsome ( ) Wait for some processes to complete (additional parameters)
int MPI_ Test (MPI_Request *request, int *flag, MPI_Status *status)
ACTIONS: Tests for completion or a nonblocking operation.
PARAMETERS: request request handle
* flag true if operation completed (returned)
int MPI_Probe (int source, int tag, MPI_Comm comm, MPI_Status
*status)
ACTIONS: Blocking test for a message (without receiving message).
PARAMETERS: source source process rank
tag message tag
comm communicator
int_MPI_Iprobe (int source, int tag, MPI_Comm comm, int *flag,
MPI_Comm *status)
ACTIONS: Nonblocking test for a message (without receiving message).
PARAMETERS: source source process rank
tag message tag
comm communicator
*flag true if there is a message (returned)
Group routines
int MPI_Barrier (MPI_Comm comm)
ACTIONS: Blocks process until all processes have called it.
PARAMETERS: comm communicator
int MPI_Bcast (void *buf, int count, MPI_Datatype datatype, int
root, MPI_Comm comm)
ACTIONS: Broadcasts message from root process to all processes in
comm and itself.
PARAMETERS: *buf message buffer (loaded)
count number of entries in buffer
datatype data type of buffer
root rank of root
Appendix 2 277
int MPI_Alltall (void*sendbuf, int sendcount, MPI_Datatype sendtype,
void*recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm.)
ACTIONS: Sends data from all processes to all processes.
PARAMETERS: *sendbuf send buffer
sendcount number of send buffer elements
sendtype data type of send elements
*recvbuf receive buffer (loaded)
recvcount number of elements each receive
recvtype data type of receive elements
comm communicator
Related: MPI_Alltoallv ( ) Sends data to all processes, with displacement
int MPI_Gather (void *sendbuf, int sendcount, MPI_Datatype send
type , void *recvbuf, int recvcount, MPI_Datatype recvtype, int root,
MPI_Comm comm)
ACTIONS: Gathers values for group of processes.
sendcount number of send buffer elements
sendtype data type of send elements
recvcount number of elements each receive
recvtype data type of receive elements
root rank of receiving process
comm communicator
Related:
MPI_Allgather ( ) Gather values and distribute to all
MPCGatherv ( ) Gather values into specified locations
MPI_Allgatherv ( ) Gather values into specified locations and distributes to all
MPI_Gatherv ( ) and MPI_Allgatherv ( ) require additional parameter:
*displs - array of displacements, after recvcount.
int MPI_Scatter (void *sendbuf, int sendcount, MPI_Datatype send type ,
void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_
Comm comm)
ACTIONS: Scatters a buffer from root in parts to group of processes,
sendcount number of elements send, each process
send type data type of elements
recvcount number of recv buffer elements type of
recv elements
recvtype root process rank communicator
root communicator
Related:
MPl_Scatterv ( ) Scatters a buffer in specified parts to group of processes.
MPCReduce_scatter ( ) Combines values and scatter results.
int MPl_Reduce(void *sendbuf, void *recvbuf, int count, MPl_Datatype
datatype, MPl_Op op, int root, MPl_Corom corom)
ACTIONS: Combines values on all processes to single value,
PARAMETERS: * sendbuf send buffer address
*recvbuf receive buffer address
count number of send buffer elements
datatype data type of send elements
op reduce operation. Several operations,
including
MPl_MAX Maximum
MPl_MlN Minimum
MPl_SUM Sum
MPl_PROD Product
root root process rank for result
com communicator
Related:
MPl_Allreduce ( ) Combine values to single value and return to all
Appendix 3
Basic PThread Routines
The following is a collection of Pthreat routines that is sufficient
for most programs. Additional details can be found in concrete specialized
materials to individual supporting developing tools.
Thread Management
The header file <pthread. h> contains the type definitions (pthread_t etc.)
and function prototypes.
int pthread_create(pthread_t*thread, const pthread_attr_t,*attr, void *
(*routine) (void *), void *arg)
ACTIONS: Creates thread
PARAMETERS: thread thread identifier (returned)
attr thread attribute (NULL for default attributes)
routine new thread routine
void pthread_exit (void *value)
ACTIONS: Terminates the calling thread.
PARAMETERS: value returned to threads that have already issued
pthread _ joi n ( )
int pthread_join(pthread_t thread, void **value)
ACTIONS: Causes thread to wait for specified thread to terminate.
PARAMETERS: thread thread identifier (returned)
value new thread routine
int pthread__detach (pthread_t thread)
ACTIONS: Detaches a thread.
PARAMETERS: thread thread to detach
int pthread_attr_init (pthread_attr-t *attr)
ACTIONS: Initializes a thread attribute object to default values.
PARAMETERS: attr thread attribute object
int pthread_attrsetdetachedstate (pthread_attr_t *attr, int state)
ACTIONS: Specifics whether a thread created with attr will be detached.
PARAMETERS: attr thread attribute
state not detached - PTHREAD_CREATE_JOINABLE
detached - PTHREAD_CREATE_DETACHED
int pthread_attr_destroy (pthread_attr-t *attr)
ACTIONS: Destroys a thread attribute object.
PARAMETERS: attr thread attribute object
int pthread_t pthread_self (void)
ACTIONS: Returns the ID of the calling thread.
PARAMETERS: None
int pthread_equal (pthread_t threadl, pthread_t thread2)
ACTIONS: Compares two thread IDs, threadl and thread2, and
returns zero if they are equal, otherwise returns
nonzero.
PARAMETERS: thread1 thread
thread2 thread
int pthread_once (pthread_once_t *once_ctr, void (*once_rtn) void)
ACTIONS: Executes specified routine if it has not been called before.
Ensure that the routine is only called once. Useful for
initialisation. For example, mutex locks should only be
initialized once.
Appendix 3 281
PARAMETERS: once_ctr variable used to determine whether once_
routine called before (a global variable that
should be initialized to PTHREAD_ONCE_
INIT (e,g., static pthread_once_t once_ctr =
PTHREAD_ONCE_INIT;)
once_rtn routine to be executed once
Thread Synchronization
Mutual Exclusion Locks (Mutex Locks)
int pthread_mutex_init{pthread_mutex_t *mutex, const pthread_
mutexattr_t *attr)
ACTIONS: Initialises mutex with specified attributes.
PARAMETERS: mutex mutex
attr atrributes - NULL default
int pthread- mutex_destroy{pthread_mutex_t *mutex)
ACTIONS: Destroys a mutex.
int pthread_mutex_lock{pthread_mutex_t *mutex)
ACTIONS: Locks an unlocked mutex (and becomes owner of mutex). If
already locked, blocks until thread that holds mutex releases it.
int pthread_mutex_unlock(pthread_mutex_t *mutex)
ACTIONS: Unlocks a mutex. If any thread waiting for mutex, one is
awakened. If more than one thread waiting, thread chosen
dependent upon thread priority and scheduling.
int phread_mutex_trylock{pthread_mutex_t *mutex)
ACTIONS: Locks an unlocked mutex (and becomes owner of mutex). If
already locked, returns immediately with EBUSY.
Conditionion variables
int pthread_cond_init(pthread_con_t *cond, canst pthread_condattr_t *attr)
ACTIONS: Creates a condition variable with specified attributes
PARAMETERS: cond condition variable
attr attributes, NULL - default
int pthread_cond_destroy (pthread_cond_t *cond)
ACTIONS: Destroys a condition variable
int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
ACTIONS: Waits for a condition, awakened by a signal or
broadcast. The mutex is unlocked before the wait and
relocked after the wait. (The mutex should be locked by
the thread prior to the call.)
cond condition variable
int pthread_cand_timedwait(pthread_cand_t*cand, pthread_mutex_
t*mutex, canst struct timespec *abstime)
ACTIONS: Waits for a condition with a time-out. Similar to
pthread_cond_wai t ( ) , except routine returns with
locked mutex if the system time equal or greater than a
specified time occurs.
mutex mutex
abstime time before returning if condition not
occurring. To set time to 5 seconds:
abstime.tv.sec = time(NULL) + 5;
abstime.tv.nsec = 0;
int pthread_cond_signal (pthread_cond_t *cond)
ACTIONS: Unblocks one thread currently waiting on condition
variable. If more than one thread waiting, thread
chosen dependent upon thread priority and scheduling.
If no threads are waiting, the signal is not remembered
for subsequent threads.
Appendix 3 283
pthread_cond_broadcast (pthread_cond_t *cond)
ACTIONS: Similar to pthread_cond_signal (), except all threads
waiting on condition are awakened,
Though the actions of wake-up routines may call for only one
thread to be awakened, it may be possible on some multiprocessor systems
for more than one thread to awaken, and hence this should be taken into
account in the coding.
References
Abderazek, A.B., (2010) Multicore Systems On-chip Practical Software/
Hardware Design (UK: Imperial college press).
Arie, M.C.A. and Munoz X. (2010) Graphs and Algorithms in
Communication Networks.
Arora, S. (2009) Computational Complexity (UK: Cambridge University
Press).
Articolo, G. (2009) Partial Differential Equations & Boundary Value
Problems (Elsevier).
Bader, D.A. (2007) Petascale Computing: Algorithms and Applications
(UK: CRC Press).
Bahi, J.H., Contasst-Vivier, S. and Couturier, R. (2007) Parallel Iterative
Algorithms: From Seguential to Grid Computing (USA: CRC Press).
Barria, A.J. (2006) Communication Network and Computer Systems (UK:
Imperial College Press).
Baer, J.L. (2010) Microprocessor Architecture (UK: Cambridge University
Press).
Baeten, J.C.M. and Basten, T. (2010) Process Algebra: Equational Theories
of Communicating Processes, TUEindhoven (UK: Cambridge University
Press).
Bischof, C. et al. (2008) Parallel Computing: Architectures, Algorithms
and Applications, Publication Series of NIC 38: 804.
Casanova, H., Legrand, A. and Robert, Y. (2008) Parallel Algorithms
(USA: CRC Press).
Cepciansky, G. and Schwartz, L. (2013) Stochastic Processes with Discrete
States (Germany: LAP Lambert).
Coulouris, G., Dollimore, J. and Kindberg, T. (2011) Distributed Systems
Concepts and Design, 5th edn (UK: Addison Wesley).
Dasgupta, S., Papadimitriou, C.H. and Vazirani, U. (2006) Algorithms
(USA: McGraw-Hill).
Dattatreya, G.R. (2008) Performance Analysis of Queuing and Computer
Network (USA: University of Texas).
Davis, T.A. (2006) Direct Methods for Sparse Linear Systems (UK:
Cambridge University Press).
Desel, J. and Esperza, J. (2005) Free Choice Petri Nets (UK: Cambridge
University Press).
Daz, J. et al. (2011) Paradigms for Fast Parallel Approximability (UK:
Dimitrios, S. and Wolf, T. (2011) Architecture of Network Systems
(Elsevier).
Dobrushkin, V.A. (2009) Methods in Algorithmic Analysis (USA: CRC
Press).
Dol, O. and Rehk, P. (2005) Half-linear Differential Equations (North
Holland).
Dubois, M., Annavaram, M. and Stenstrom, P. (2012) Parallel Computer
Organisation and Design.
Dubhash, D.P. and Panconesi A. (2009) Concentration of Measure for the
Analysis of Randomised Algorithms (UK: Cambridge University Press).
Edmonds, J. (2008) How to Think about Algorithms (UK: Cambridge
University Press).
Eldn, L. (2007) Matrix Methods in Data Mining and Pattern Recognition
(UK: Cambridge University Press).
Even, S. (2012) Graph Algorithms, 2nd edn (UK: Cambridge University
Press).
Fortier, P. and Howard, M. (2003) Computer System Performance
Evaluation and Prediction (Digital Press).
Foster, I. and Kesselman, C. (2003) The Grid 2, Blueprint for a New
Computing Infrastructure, 2nd edn (USA: Morgan Kaufmann).
Fountain, T.J. (2011) Parallel Computing (UK: Cambridge University
Press).
Gamal, A. and Kim, Y.H. (2011), Network Information Theory (UK:
Gaston, H.G. and Scholl, R. (2009) Scientific Computation (UK: Cambridge
University Press).
Gelenbe, E. (2010) Analysis and Synthesis of Computer Systems (UK:
Imperial College Press).
Giambene, G. (2005) Queueing Theory and Telecommunications (Germany:
Springer).
Gibbons, A. and Rytter, W. (1990) Efficient Parallel Algorithms (UK:
References 287
Giladi, R. (2008) Network Processors (Elsevier/Morgan Kaufmannn).
Goldberg, L.A. (2009) Efficient Algorithms for Listing Combinatorial
Problems (UK: Cambridge University Press).
Goldreich, O. (2010) Computational Complexity (UK: Cambridge
University Press).
Goldreich, O. (2010) P, NP, and NP Completeness (UK: Cambridge
University Press).
Grama, A., Gupta, A., Karypis, G. and Kumar, V. (2003) Introduction to
Parallel Computing, 2nd edn (Addison Wesley).
Gunawan, T.S. and Cai, W. (2003) Performance Analysis of a Myrinet-
Based Cluster (Germany: Kluwer Academic Publishers).
Hager, G. and Wellein, G. (2010) Introduction to High Performance
Computing for Scientists and Engineers (Germany: Springer).
Hanuliak, P. and Hanuliak, M. (2013) Performance Modelling of Parallel
Computers NOW and Grid, Journal of Networks and Communication
2/5: 112124.
Hanuliak, P. and Hanuliak, M. Performance Modelling of SMP Parallel
Computers, International Journal of Science, Commerce and Humanities
(IJSCH) 1(5): 118.
Hanuliak, P. (2012) Analytical Method of Performance Prediction in Parallel
Algorithms, The Open Cybernetics and Systemics Journal 6: 38-47.
Hanuliak, I. (2000) To the Role of Decomposition Strategy in High
Parallel Algorithms, Kybernetes 29(9/10): 10421057.
Hanuliak, P. (2012) Complex Performance Evaluation of Parallel Laplace
Equation, AD ALTA 2(2): 104107.
Hanuliak, P. (2012) Parallel Iteration Algorithms for Laplacean Equation
Method, Acta Moraviae 4(8): 111120.
Hanuliak, P. and Hanuliak, I. (2010) Performance Evaluation of Iterative
Parallel Algorithms, Kybernetes 39(1): 107126.
Hanuliak, J. and Hanuliak, I. (2006) Performance Evaluation of Parallel
Algorithms, Science: 126132.
Hanuliak, J. and Hanuliak, I. (2005) To Performance Evaluation of
Distributed Parallel Algorithms, Kybernetes 34(9/10): 1633-1650.
Hanuliak, M. (2007) To Modelling of Parallel Computer Systems,
TRANSCOM: 6770.
Hanuliak, J. (2003) To Performance Evaluation of Parallel Algorithms in
NOW, Communications 4: 8388.
Hanuliak, P., Hanuliak, I. and Petrucha, J. (2013) Fundamentals of
Theoretical and Applied Informatics (Czech Republic: EPI Kunovice).
Hanuliak, P. (2007) Virtual Parallel Computer, TRANSCOM.
Hanuliak, M. and Hanuliak, I. (2006) To the Correction of Analytical
Models for Computer-Based Communication Systems, Kybernetes
35(9): 14921504.
Hanuliak, M. (2013) Performance Modelling of NOW and Grid Parallel
Computers, AD ALTA, 3(2): 9196.
Hanuliak, M. (2014) Unified Analytical Models of Parallel and Distributed
Computing, American Journal of Networks and Communication,
reviewed and accepted.
Hanuliak, M. (2013) Performance Modelling of Computer Systems,.
ICSC: 4349.
Harchol-Balter, M, (2013) Performance Modelling and Design of Computer
Systems (UK: Cambridge University Press).
El-Rewini, H. and Abd-El-Barr, M. (2005) Advanced Computer Architecture
and Processing (USA: John Wiley and Sons).
Hillston, J. (2005) A Compositional Approach to Performance Modelling
(UK: Cambridge University Press).
Hwang, K. et al. (2011) Distributed and Parallel Computing (USA: Morgan
Kaufmann).
Hwang, K., Dongarra, J. and Geoffrey, C.F. (2011) Distributed and Cloud
Computing (USA: Morgan Kaufmannn).
Chapman, B., Jost, G. and Ruud Van der Pas, R. (2008) Using OpenMP -
Portable Shared Memory Parallel Programming (The MIT Press).
John, L.K. and Eeckhout, L. (2005) Performance Evaluation and
Benchmarking (USA: CRC Press).
Kirk, D.B. and Hwu, W.W. (2012) Programming Massively Parallel
Processors, 2nd edn (USA: Morgan Kaufmannn).
Kostin, A. and Ilushechkina, L. (2010) Modelling and Simulation of
Distributed Systems (Imperial College Press).
Kshemkalyani, A.D. and Singhal, M. (2011) Distributed Computing (UK:
Kumar, A., Manjunath, D. and Kuri, J. (2004) Communication Networking
(USA: Morgan Kaufmann).
Kushilevitz, E. and Nissan N. (2006) Communication Complexity, (UK:
Kwiatkowska, M., Norman, G. and Parker, D. (2011) PRISM 4.0:
Verification of Probabilistic Realtime Systems, LNCS 6806: 585591.
Le Boudec, J.Y. (2011) Performance Evaluation of Computer and
Communication Systems (USA: CRC Press).
Levesque, J. (2010) High Performance Computing: Programming and
Applications (USA: CRC Press).
References 289
Lilja, D.J. (2005) Measuring Computer Performance (UK: Cambridge
University Press).
Magoules, F., Nguyen, T.M.H. and Yu, L. (2008) Grid Ressource
Management: Towards Virtual and Services Complaint Grid Computing
(USA: CRC Press).
McCabe, J.D. (2010) Network Analysis, Architecture, and Design, 3rd edn
(USA: Elsevier/ Morgan Kaufmannn).
Meerschaert, M. (2013) Mathematical Modeling, 4th edn (Elsevier).
Miller, S. (2012) Probability and Random Processes, 2nd edn (Germany:
Academic Press, Elsevier Science).
Mieghem, P.V. (2010) Graph spectra for Complex Networks (UK:
Misra, C.S. and Woungang, I. (2010) Selected Topics in Communication
Network and Distributed Systems (Imperial College Press).
Natarajan, G. (2012) Analysis of Queues: Methods and Applications (USA:
CRC Press).
Pacheco, P. (2011) An Introduction to Parallel Computing, 1st edn (USA:
Morgan Kaufmannn).
Parashar, M. and Li, X. (2010) Advanced Computational Infrastructures
for Parallel and Distributed Adaptive Applications (USA: John Wiley &
Sons).
Patterson, D.A. and Hennessy, J.L. (2011) Computer Organization and
Design, 4th edn, (USA: Morgan Kaufmann).
Pearl, J. (2012) Causality Models, Reasoning and Inference (UK:
Peterson, L.L. and Davie, B.C. (2011) Computer Networks a System
Approach (USA: Morgan Kaufmann).
Powers, D. (2005) Boundary Value Problems and Partial Differential
Equations (Germany: Elsevier).
Rajasekaran, S. and Reif, J. (2007) Handbook of Parallel Computing:
Models, Algorithms and Applications (USA: Chapman and Hall/CRC
Press).
Ramaswami, B.J., Sivarajan, K. and Sasaki, G. (2010) Optical Networks, a
Practical Perspective, 3rd edn (USA: Morgan Kaufmann).
Resch, M.M. (2009) Supercomputers in Grids, International Journal of
Grid and HPC 1: 19.
Riano, I. and McGinity, T.M. (2011) Quantifying the Role of Complexity
in a Systems Performance, Evolving Systems: 189198.
Ross, S.M. (2010) Introduction to Probability Models, 10th edn (Germany:
Academic Press, Elsevier Science).
Shapira, Y. (2012) Solving PDEs in C++ Numerical Methods in a Unified
Object-Oriented Approach 2nd edn (UK: Cambridge University Press).
Shroff, G. (2010) Enterprice Cloud Computing (UK: Cambridge University
Press).
Tullis, T. and Albert, W. (2013) Measuring the User Experience Collecting,
Analyzing, and Presenting Usability Metrics 2nd edn (USA: Morgan
Kaufmann).
Wang, L., Jie, W. and Chen, J. (2009) Grid Computing: Infrastructure,
Service, and Application (USA: CRC Press).
http://www.top500.org
http://www.spec.org
http://www.intel.com
http://www.hpc-europa.eu.
PETER HANULIAK AND MICHAL HANULIAK
ANALYTICAL MODELLING
IN PARALLEL
AND DISTRIBUTED
COMPUTING
For a full listing of Chartridge Books Oxfords titles, please contact us:
Chartridge Books Oxford, 5 & 6 Steadys Lane, Stanton Harcourt,
Witney, Oxford, OX29 5RL, United Kindom
Tel: +44 (0) 1865 882191
Te current trends in High Performance Computing (HPC) are to use networks of
workstations (NOW, SMP) or a network of NOW networks (Grid) as a cheaper
alternative to the traditionally-used, massive parallel multiprocessors or supercomputers.
Individual workstations could be single PCs (personal computers) used as parallel
computers based on modern symmetric multicore or multiprocessor systems (SMPs)
implemented inside the workstation.
With the availability of powerful personal computers, workstations and networking
devices, the latest trend in parallel computing is to connect a number of individual
workstations (PCs, PC SMPs) to solve computation-intensive tasks in a parallel way to
typical clusters such as NOW, SMP and Grid. In this sense it is not yet correct to consider
traditionally evolved parallel computing and distributed computing as two separate
research disciplines.
To exploit the parallel processing capability of this kind of cluster, the application
program must be made parallel. An efective way of doing this for (parallelisation
strategy) belongs to the most important step in developing an efective parallel algorithm
(optimisation). For behaviour analysis we have to take into account all the overheads that
have an infuence on the performance of parallel algorithms (architecture, computation,
communication etc.).
In this book we discuss this kind of complex performance evaluation of various
typical parallel algorithms (shared memory, distributed memory) and their practical
implementations. As real application examples we demonstrate the various infuences
during the process of modelling and performance evaluation and the consequences of
their distributed parallel implementations.
9 781909 287907
ISBN 978-1-909287-90-7
A
N
A
L
Y
T
I
C
A
L

M
O
D
E
L
L
I
N
G

I
N

P
A
R
A
L
L
E
L

A
N
D

D
I
S
T
R
I
B
U
T
E
D

C
O
M
P
U
T
I
N
G
P
E
T
E
R

H
A
N
U
L
I
A
K

A
N
D

M
I
C
H
A
L

H
A
N
U
L
I
A
K

Analytical Modelling in Parallel and Distributed Computing

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Analytical Modelling in Parallel and Distributed Computing

Încărcat de

Drepturi de autor:

Formate disponibile

PETER HANULIAK AND MICHAL HANULIAK

An actual typical example of eight multiprocessor systems (Intel

from computing nodes u

where h = 1 / n is the width of the selected splitting interval, x

(2) (1) (1)

We replace the partial derivations of ~ u(x, y) by the differences

Canonical matrix decomposition models

This condition, after reducing the shared expression parts, leads to

These conditions, after a reduction of shared expression parts, lead

This condition, after a reduction of shared expression parts, leads to

Figure 16.1 Performance comparison of workstations.

In Table 16.8 we have computed the values t

264 Analytical Modelling in Parallel and Distributed Computing

S-ar putea să vă placă și