Slides PDF

Figure 1.
1 Astrophysical N-body
simulation by Scott Linssen (undergraduate
University of North Carolina at Charlotte
[UNCC] student).
Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers
Barry Wilkinson and Michael Allen  Prentice Hall, 1998 1
Main memory
Instructions (to processor)

Data (to or from processor)
Processor Figure 1.2 Conventional computer having

a single processor and memory.
Memory modules
One
address
space
Interconnection
network
Figure 1.3 Traditional shared memory

Processors multiprocessor model.
Interconnection
network
Messages
Processor
Local
memory
Figure 1.4 Message-passing

Computers multiprocessor model (multicomputer).
Interconnection
network
Messages
Processor
Shared
memory
Figure 1.5 Shared memory multiprocessor
Computers implementation.
Program Program
Instructions Instructions
Processor Processor
Data Data
Figure 1.6 MPMD structure.
P M Computers P M
C C
Network with direct links

between computers
P M
Figure 1.7 Static link multicomputer.
Computer (node)
Switch
Links Links
to other to other
nodes nodes
Processor Memory
Figure 1.8 Node with a switch for internode message transfers.
Link
Node Node
Figure 1.9 A link between two nodes with
separate wires in each direction.
Figure 1.10 Ring.
Computer/
Links processor
Figure 1.11 Two-dimensional array

(mesh).
Root
Processing
element
Links
Figure 1.12 Tree structure.
110 111
100 101
010 011
000 001 Figure 1.13 Three-dimensional hypercube.
0110 0111 1110 1111
0100 0101 1100 1101
0010 0011 1010 1011
0000 0001 1000 1001
Figure 1.14 Four-dimensional hypercube.
Ring
Figure 1.15 Embedding a ring onto a torus.
Nodal address
1011
10
11
01
00
x Figure 1.16 Embedding a mesh into a

y 00 01 11 10 hypercube.
A A
Root
A A
A A
Figure 1.17 Embedding a tree into a mesh.
Packet Head
Movement
Flit buffer
Request/
Acknowledge
signal(s)
Figure 1.18 Distribution of flits.
Source Destination
processor processor
Data
Figure 1.19 A signaling method between
R/A processors for wormhole routing (Ni and
McKinley, 1993).
Packet switching
Network
latency
Wormhole routing
Circuit switching
Distance
(number of nodes between source and destination) Figure 1.20 Network delay characteristics.
Node 4 Node 3
Messages
Figure 1.21 Deadlock in store-and-forward

Node 1 Node 2 networks.
Virtual channel
buffer Node Node
Route
Physical link
Figure 1.22 Multiple virtual channels mapped onto a single physical channel.
Ethernet
Workstation/ Workstations Figure 1.23 Ethernet-type single wire

file server network.
Frame check Data Type Source Destination Preamble
sequence address address
(variable) (16 bits) (64 bits)
(32 bits) (48 bits) (48 bits)
Direction
Figure 1.24 Ethernet frame format.
Network
Workstation/
file server
Workstations
Figure 1.25 Network of workstations connected via a ring.
Workstations
Workstation/
file server
Figure 1.26 Star connected network.
Parallel programming cluster
(a) Using specially designed adaptors
(b) Using separate Ethernet interfaces
Figure 1.27 Overlapping connectivity Ethernets.
Process 1
Process 2 Computing
Process 3
Slope indicating time
to send message
Process 4
Waiting to send a message Message Time
Figure 1.28 Space-time diagram of a message-passing program.
ts
fts (1 − f)ts
Serial section Parallelizable sections
(a) One processor
(b) Multiple
processors
n processors
(1 − f)ts /n
tp
Figure 1.29 Parallelizing sequential problem — Amdahl’s law.
20 f = 0% 20
n = 256
Speedup factor, S(n)
Speedup factor, S(n)

16 16
12 12
f = 5%
8 8
f = 10%
4 f = 20% 4
n = 16
4 8 12 16 20 0.2 0.4 0.6 0.8 1.0

Number of processors, n Serial fraction, f
(a) (b)
Figure 1.30 (a) Speedup against number of processors. (b) Speedup against serial fraction, f.
Source
file
Compile to suit
processor
Executables
Figure 2.1 Single program, multiple data

Processor 0 Processor n − 1 operation.
Process 1
Start execution
spawn(); of process 2 Process 2
Time
Figure 2.2 Spawning a process.
Process 1 Process 2
x y
Movement
send(&x, 2); of data
recv(&y, 1);
Figure 2.3 Passing a message between
processes using send() and recv()
library calls.
Process 1 Process 2
Time send(); Request to send

Suspend Acknowledgment
process recv();
Both processes Message
continue
(a) When send() occurs before recv()
Process 1 Process 2
Time recv();
Request to send Suspend
send(); process
Both processes Message
continue Acknowledgment
(b) When recv() occurs before send()
Figure 2.4 Synchronous send() and recv() library calls using a three-way protocol.
Process 1 Process 2
Message buffer
Time send();
Continue recv();
process Read
message buffer
Figure 2.5 Using a message buffer.
Process 0 Process 1 Process n − 1
data data data
Action
buf
bcast(); bcast(); bcast();

Code
Figure 2.6 Broadcast operation.
data data data
Action
buf
scatter(); scatter(); scatter();

Code
Figure 2.7 Scatter operation.
data data data
Action
buf
gather(); gather(); gather();

Code
Figure 2.8 Gather operation.
data data data
Action
buf +
reduce(); reduce(); reduce();

Code
Figure 2.9 Reduce operation (addition).
Workstation
PVM
daemon
Application
program
(executable)
Messages
sent through
Workstation network
Workstation
PVM
daemon
Application
program PVM
(executable) daemon
Application
program
(executable)
Figure 2.10 Message passing between workstations using PVM.
Workstation
PVM
daemon
Messages
sent through
Workstation network
PVM
daemon Workstation
PVM
daemon
Application
program
(executable)
Figure 2.11 Multiple processes allocated to each processor (workstation).
Process 1 Process 2
Array Send buffer Array to
holding receive
data Pack data
pvm_psend();
Continue pvm_precv(); Wait for message
process
Figure 2.12 pvm_psend() and pvm_precv() system calls.
Process_1
Process_2
pvm_initsend(); x
Send s
buffer y
pvm_pkint( … &x …);
pvm_pkstr( … &s …);
pvm_pkfloat( … &y …);
pvm_send(process_2 … ); Message
pvm_recv(process_1 …);
pvm_upkint( … &x …);
Receive pvm_upkstr( … &s …);
buffer pvm_upkfloat(… &y … );
Figure 2.13 PVM packing messages, sending, and unpacking.
#include <stdio.h> Master
#include <stdlib.h>
#include <pvm3.h>
#define SLAVE “spsum”
#define PROC 10
#define NELEM 1000
main() {
int mytid,tids[PROC];
int n = NELEM, nproc = PROC;
int no, i, who, msgtype; Slave
int data[NELEM],result[PROC],tot=0;
char fn[255]; #include <stdio.h>
FILE *fp; #include “pvm3.h”
mytid=pvm_mytid();/*Enroll in PVM */ #define PROC 10
#define NELEM 1000
/* Start Slave Tasks */
no= main() {
pvm_spawn(SLAVE,(char**)0,0,““,nproc,tids); int mytid;
if (no < nproc) { int tids[PROC];
printf(“Trouble spawning slaves \n”); int n, me, i, msgtype;
for (i=0; i<no; i++) pvm_kill(tids[i]); int x, nproc, master;
pvm_exit(); exit(1); int data[NELEM], sum;
}
mytid = pvm_mytid();
/* Open Input File and Initialize Data */
strcpy(fn,getenv(“HOME”)); /* Receive data from master */
strcat(fn,”/pvm3/src/rand_data.txt”); msgtype = 0;
if ((fp = fopen(fn,”r”)) == NULL) { pvm_recv(-1, msgtype);
printf(“Can’t open input file %s\n”,fn); pvm_upkint(&nproc, 1, 1);
exit(1); pvm_upkint(tids, nproc, 1);
} pvm_upkint(&n, 1, 1);
for(i=0;i<n;i++)fscanf(fp,”%d”,&data[i]); pvm_upkint(data, n, 1);
/* Broadcast data To slaves*/ /* Determine my tid */

pvm_initsend(PvmDataDefault); for (i=0; i<nproc; i++)
msgtype = 0; if(mytid==tids[i])
pvm_pkint(&nproc, 1, 1); {me = i;break;}
pvm_pkint(tids, nproc, 1);
pvm_pkint(&n, 1, 1); Broadcast data /* Add my portion Of data */
pvm_pkint(data, n, 1); x = n/nproc;
pvm_mcast(tids, nproc, msgtag); low = me * x;
high = low + x;
for(i = low; i < high; i++)
/* Get results from Slaves*/ sum += data[i];
msgtype = 5;
for (i=0; i<nproc; i++){ /* Send result to master */
pvm_recv(-1, msgtype); pvm_initsend(PvmDataDefault);
Receive results pvm_pkint(&me, 1, 1);
pvm_upkint(&who, 1, 1);
pvm_upkint(&result[who], 1, 1); pvm_pkint(&sum, 1, 1);
printf(“%d from %d\n”,result[who],who); msgtype = 5;
} master = pvm_parent();
pvm_send(master, msgtype);
/* Compute global sum */
for (i=0; i<nproc; i++) tot += result[i]; /* Exit PVM */
printf (“The total is %d.\n\n”, tot); pvm_exit();
return(0);
pvm_exit(); /* Program finished. Exit PVM */ }
return(0);
Figure 2.14 Sample PVM program.
Process 0 Process 1
Destination
send(…,1,…);
lib() send(…,1,…); Source
recv(…,0,…); lib()
recv(…,0,…);
(a) Intended behavior
Process 0 Process 1
send(…,1,…);
lib() send(…,1,…);
recv(…,0,…); lib()
recv(…,0,…);
(b) Possible behavior
Figure 2.15 Unsafe message passing with libraries.
#include “mpi.h”
#include <stdio.h>
#include <math.h>
#define MAXSIZE 1000
void main(int argc, char *argv)

{
int myid, numprocs;
int data[MAXSIZE], i, x, low, high, myresult, result;
char fn[255];
char *fp;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
if (myid == 0) { /* Open input file and initialize data */

strcpy(fn,getenv(“HOME”));
strcat(fn,”/MPI/rand_data.txt”);
if ((fp = fopen(fn,”r”)) == NULL) {
printf(“Can’t open the input file: %s\n\n”, fn);
exit(1);
}
for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]);
}
/* broadcast data */
MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD);
/* Add my portion Of data */

x = n/nproc;
low = myid * x;
high = low + x;
for(i = low; i < high; i++)
myresult += data[i];
printf(“I got %d from %d\n”, myresult, myid);
/* Compute global sum */

MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0) printf(“The sum is %d.\n”, result);
MPI_Finalize();
}
Figure 2.16 Sample MPI program.
Time
Startup time
Figure 2.17 Theoretical communication
Number of data items (n) time.
c2g(x) = 6x2
160
140 f(x) = 4x2 + 2x + 12
120
100
80
60 c1g(x) = 2x2
40
20
0
0 1 2 3 4 5
x0
Figure 2.18 Growth of function f(x) = 4x2 + 2x + 12.
110 111
100 101
3rd step
010 011
2nd step
1st step 000 001
Figure 2.19 Broadcast in a three-dimensional hypercube.
P000
Message
Step 1
P000 P001
Step 2
P000 P010 P001 P011

Step 3
P000 P100 P010 P110 P001 P101 P011 P111
Figure 2.20 Broadcast as a tree construction.
Steps
1 2 3
2 3 4 4
3 4 5 5
4 5 6 6
Figure 2.21 Broadcast in a mesh.
Message
Figure 2.22 Broadcast on an Ethernet

Source Destinations network.
Source
Sequential
N destinations Figure 2.23 1-to-N fan-out broadcast.
Source
Sequential message issue
Figure 2.24 1-to-N fan-out broadcast on a

Destinations tree structure.
Process 1
Process 2
Process 3
Time
Computing
Waiting
Message-passing system routine
Message
Figure 2.25 Space-time diagram of a parallel program.
Number of repetitions or time
1 2 3 4 5 6 7 8 9 10
Statement number or regions of program Figure 2.26 Program profile.
Input data
Processes
Figure 3.1 Disconnected computational

Results graph (embarrassingly parallel problem).
spawn() Send initial data
send()
recv()
Slaves
Master
send()
recv()
Collect results
Figure 3.2 Practical embarrassingly parallel computational graph with dynamic process
creation and the master-slave approach.
x
80 Process
y 640
80 Map
480
(a) Square region for each process
Process
640
10
Map
480
(b) Row region for each process
Figure 3.3 Partitioning into regions for individual processes.
+2
Imaginary
−2
−2 0 Real +2
Figure 3.4 Mandelbrot set.
Work pool
(xa, ya) (xe, ye)

(xc, yc)
(xb, yb) (xd, yd)
Task
Return results/
request new task
Figure 3.5 Work pool approach.
Rows outstanding in slaves (count)
0 Row sent disp_height

Increment
Row returned
Terminate
Decrement Figure 3.6 Counter termination.
Total area = 4
2 Area = π
Figure 3.7 Computing π by a Monte Carlo

2 method.
1
f(x)
1
y = 1 – x2
x Figure 3.8 Function being integrated in

1 computing π by a Monte Carlo method.
Master
Partial sum
Request
Slaves
Random
number
Random number Figure 3.9 Parallel Monte Carlo

process integration.
x1 x2 xk-1 xk xk+1 xk+2 x2k-1 x2k
Figure 3.10 Parallel computation of a sequence.
x0 … x(n/m)−1 xn/m … x(2n/m)−1 … x(m−1)n/m … xn−1
+ + +
Partial sums
Sum
Figure 4.1 Partitioning a sequence of numbers into parts and adding the parts.
Initial problem
Divide
problem
Final tasks
Figure 4.2 Tree construction.
Original list
P0
P0 P4
P0 P2 P4 P6
P0 P1 P2 P3 P4 P5 P6 P7
x0 xn−1
Figure 4.3 Dividing a list into parts.
x0 xn−1
P0 P1 P2 P3 P4 P5 P6 P7
P0 P2 P4 P6
P0 P4
P0
Final sum
Figure 4.4 Partial summation.
Found/ OR
Not found
OR OR
Figure 4.5 Part of a search tree.
Figure 4.6 Quadtree.
Image area
First division
into four parts
Second division
Figure 4.7 Dividing an image.
Unsorted numbers
Buckets
Sort
contents
of buckets
Merge lists
Sorted numbers
Figure 4.8 Bucket sort.
Unsorted numbers
p processors
Buckets
Sort
contents
of buckets
Merge lists
Sorted numbers
Figure 4.9 One parallel version of bucket sort.
n/m numbers
Unsorted numbers
p processors
Small
buckets
Empty
small
buckets
Large
buckets
Sort
contents
of buckets
Merge lists
Sorted numbers
Figure 4.10 Parallel version of bucket sort.
Process 0 Process n − 1
Send Receive
buffer buffer
Send
buffer
0 n−1 0 n−1 0 n−1 0 n−1
Process 1 Process n − 1 Process 0 Process n − 2
Figure 4.11 “All-to-all” broadcast.
“All-to-all”
P0 A0,0 A0,1 A0,2 A0,3 A0,0 A1,0 A2,0 A3,0
P1 A1,0 A1,1 A1,2 A1,3 A0,1 A1,1 A2,1 A3,1
P2 A2,0 A2,1 A2,2 A2,3 A0,2 A1,2 A2,2 A3,2
P3 A3,0 A3,1 A3,2 A3,3 A0,3 A1,3 A2,3 A3,3 Figure 4.12 Effect of “all-to-all” on an
array.
f(x)
f(p) f(q)
Figure 4.13 Numerical integration using

a p δ q b x rectangles.
f(x)
f(p) f(q)
Figure 4.14 More accurate numerical

a p δ q b x integration using rectangles.
f(x)
f(p) f(q)
Figure 4.15 Numerical integration using

a p δ q b x the trapezoidal method.
f(x)
C
A B
Figure 4.16 Adaptive quadrature

x construction.
f(x)
C=0
A B
Figure 4.17 Adaptive quadrature with false

x termination.
Center of mass
Distant cluster of bodies

r
Figure 4.18 Clustering distant bodies.
Subdivision
direction
Particles Partial quadtree
Figure 4.19 Recursive division of two-dimensional space.
Figure 4.20 Orthogonal recursive bisection
method.
log n numbers
+ + + +
+ + + +
+ + + +
+ +
Binary Tree
Result
Figure 4.21 Process diagram for Problem 4-12(b).
y
f(a)
f(x)
b
a x
f(b) Figure 4.22 Bisection method for finding

the zero crossing location of a function.
Figure 4.23 Convex hull (Problem 4-22).
P0 P1 P2 P3 P4 P5
Figure 5.1 Pipelined processes.
a[0] a[1] a[2] a[3] a[4]
a a a a a
sum sin sout sin sout sin sout sin sout sin sout
Figure 5.2 Pipeline for an unfolded loop.
Signal without Signal without Signal without Signal without
frequency f0 frequency f1 frequency f2 frequency f3
f0 f1 f2 f3 f4
Filtered signal
f(t) fin fout fin fout fin fout fin fout fin fout
Figure 5.3 Pipeline for a frequency filter.
p−1 m
Instance Instance Instance Instance Instance
P5 1 2 3 4 5
Instance Instance Instance Instance Instance Instance
P4 1 2 3 4 5 6
Instance Instance Instance Instance Instance Instance Instance
P3 1 2 3 4 5 6 7
P2 1 2 3 4 5 6 7
P1 1 2 3 4 5 6 7
P0 1 2 3 4 5 6 7
Time
Figure 5.4 Space-time diagram of a pipeline.
Instance 0 P0 P1 P2 P3 P4 P5
Time
Figure 5.5 Alternative space-time diagram.
Input sequence
d9d8d7d6d5d4d3d2d1d0 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9
(a) Pipeline structure

p−1 n
P9 d0 d1 d2 d3 d4 d5 d6
P8 d0 d1 d2 d3 d4 d5 d6 d7
P7 d0 d1 d2 d3 d4 d5 d6 d7 d8
P6 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9
P5 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9
P4 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9
P3 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9
P2 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9
P1 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9
P0 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9
Time
(b) Timing diagram
Figure 5.6 Pipeline processing 10 data elements.
P5 P5
P4 P4
Information
P3 P3
transfer
sufficient to P2 P2
start next
process P1 P1
Information passed
P0 to next stage P0
Time Time
(a) Processes with the same (b) Processes not with the
execution time same execution time
Figure 5.7 Pipeline processing where information passes to next stage before end of process.
Processor 0 Processor 1 Processor 2
P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11
Figure 5.8 Partitioning processes onto processors.
Multiprocessor
Host
computer
Figure 5.9 Multiprocessor system with a line configuration.
2 3 4 5
Σ1 i Σ1 i Σ1 i Σ1 i Σ1 i
P0 P1 P2 P3 P4
Figure 5.10 Pipelined addition.
Master process Slaves
dn−1… d2d1d0 P0 P1 P2 Pn−1
Sum
Figure 5.11 Pipelined addition numbers with a master process and ring configuration.
Master process
Numbers
d0 d1 Slaves dn−1
P0 P1 P2 Pn−1
Sum
Figure 5.12 Pipelined addition of numbers with direct access to slave processes.
P0 P1 P2 P3 P4
1 4, 3, 1, 2, 5
2 4, 3, 1, 2 5
2
3 4, 3, 1 5
1
4 4, 3 5 2
3 1
5 4 5 2
Time
(cycles) 4 2
6 5 3 1
3 1
7 5 4 2
2
8 5 4 3 1
1
9 5 4 3 2
10 5 4 3 2 1
Figure 5.13 Steps in insertion sort with five numbers.
P0 Smaller P1 P2
numbers
Series of numbers Compare

xn−1 … x1x0
xmax
Largest number Next largest

number
Figure 5.14 Pipeline for sorting using insertion sort.
Master process
dn−1… d2d1d0
P0 P1 P2 Pn−1
Sorted sequence
Figure 5.15 Insertion sort with results returned to the master process using a bidirectional line configuration.
Sorting phase Returning sorted numbers
2n − 1 n
P4 Shown for n = 5
P3
P2
P1
P0
Time
Figure 5.16 Insertion sort with results returned.
Not multiples of
1st prime number
P0 P1 P2
Series of numbers
xn−1 … x1x0
Compare 1st prime 2nd prime 3rd prime

multiples number number number
Figure 5.17 Pipeline for sieve of Eratosthenes.
P0 P1 P2 P3
x0 x0 x0
x0 x1 x1
Compute x0 Compute x1 x1 Compute x2 Compute x3
x2 x2
x3
Figure 5.18 Solving an upper triangular set of linear equation using a pipeline.
P5
P4
P3 Final computed value

Processes
P2
P1
P0 First value passed onward

Figure 5.19 Pipeline processing using back
Time substitution.
P0 P1 P2 P3 P4
divide
send(x0) ⇒ recv(x0)
end send(x0) ⇒ recv(x0)
multiply/add send(x0) ⇒ recv(x0)
divide/subtract multiply/add send(x0) ⇒ recv(x0)
send(x1) ⇒ recv(x1) multiply/add send(x1) ⇒
end send(x1) ⇒ recv(x1) multiply/add
multiply/add send(x1) ⇒ recv(x1)
divide/subtract multiply/add send(x1) ⇒
Time
send(x2) ⇒ recv(x2) multiply/add
end send(x2) ⇒ recv(x2)
multiply/add send(x2) ⇒
divide/subtract multiply/add
send(x3) ⇒ recv(x3)
end send(x3) ⇒
multiply/add
divide/subtract
send(x4) ⇒
end
Figure 5.20 Operations in back substitution pipeline.
x1 x2 x3 x4
x x x x
y4y3y2y1 yin yout yin yout yin yout yin yout Output
a a a a
a1 a2 a3 a4
Figure 5.21 Pipeline for Problem 5-9.
Display Display
Audio input
(digitized)
Pipeline
Audio input
(digitized)
(a) Pipeline solution (b) Direct decomposition
Figure 5.22 Audio histogram display.
Processes
P0 P1 P2 Pn−1
Active
Time
Waiting
Barrier
Figure 6.1 Processes reaching the barrier at

different times.
Processes
P0 P1 Pn−1
Barrier();
Barrier();
Processes wait until
all reach their Barrier();
barrier call
Figure 6.2 Library call barriers.
Processes
P0 P1 Pn−1
Counter, C
Increment Barrier();
and check for n
Barrier();
Barrier();
Figure 6.3 Barrier using a centralized counter.
Master Slave processes
Arrival Barrier:
for(i=0;i<n;i++) send(Pmaster);
phase
recv(Pany); recv(Pmaster);
Departure
for(i=0;i<n;i++)
phase
send(Pi); Barrier:
send(Pmaster);
recv(Pmaster);
Figure 6.4 Barrier implementation in a message-passing system.
P0 P1 P2 P3 P4 P5 P6 P7
Arrival Sychronizing
at barrier message
Departure
from barrier
Figure 6.5 Tree barrier.
P0 P1 P2 P3 P4 P5 P6 P7
1st stage
Time
2nd stage
3rd stage
Figure 6.6 Butterfly construction.
Instruction
a[] = a[] + k;
Processors a[0]=a[0]+k; a[1]=a[1]+k; a[n-1]=a[n-1]+k;
a[0] a[1] a[n-1]
Figure 6.7 Data parallel computation.
Numbers x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15
Add
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Step 1 Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ
(j = 0) i=0 i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10 i=11 i=12 i=13 i=14
Add
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Add
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Add
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Final step Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ
Figure 6.8 Data parallel prefix sum operation.
Computed
value
Error
Exact value
t t+1 Iteration Figure 6.9 Convergence rate.
data data data
Send x0 x1 xn−1
buffer
Receive
buffer
Allgather(); Allgather(); Allgather();
Figure 6.10 Allgather operation.
2 × 106
Execution
time
(τ = 1)
1 × 106
Overall
Communication
0 Computation
0 4 8 12 16 20 24 28 32
Number of processors, p
Figure 6.11 Effects of computation and communication in Jacobi iteration.
j
Metal plate
i
Enlarged
hi−1,j
hi,j
hi,j−1 hi,j+1
hi+1,j
Figure 6.12 Heat distribution problem.
x1 x2 xk−1 xk
xk+1 xk+2 x2k−1 x2k
xi−k
xi−1 xi+1
xi
xi+k
xk2 Figure 6.13 Natural ordering of heat
distribution problem.
j send(g, Pi-1,j);
column send(g, Pi+1,j);
send(g, Pi,j-1);
i send(g, Pi,j+1);
row
recv(w, Pi-1,j)
recv(x, Pi+1,j);
recv(y, Pi,j-1);
recv(z, Pi,j+1);
send(g, Pi-1,j); send(g, Pi-1,j); send(g, Pi-1,j);

send(g, Pi+1,j); send(g, Pi+1,j); send(g, Pi+1,j);
send(g, Pi,j-1); send(g, Pi,j-1); send(g, Pi,j-1);
send(g, Pi,j+1); send(g, Pi,j+1); send(g, Pi,j+1);
recv(w, Pi-1,j) recv(w, Pi-1,j) recv(w, Pi-1,j)
recv(x, Pi+1,j); recv(x, Pi+1,j); recv(x, Pi+1,j);
recv(y, Pi,j-1); recv(y, Pi,j-1); recv(y, Pi,j-1);
recv(z, Pi,j+1); recv(z, Pi,j+1); recv(z, Pi,j+1);
send(g, Pi-1,j);
send(g, Pi+1,j);
send(g, Pi,j-1);
send(g, Pi,j+1);
recv(w, Pi-1,j)
recv(x, Pi+1,j);
recv(y, Pi,j-1);
recv(z, Pi,j+1);
Figure 6.14 Message passing for heat distribution problem.
P0 P1 Pp−1
P0 P1
Pp−1
Blocks Strips (columns)
Figure 6.15 Partitioning heat distribution problem.
n
---
p n
Square blocks
Strips
Figure 6.16 Communication consequences of partitioning.
2000
Strip partition best

tstartup
1000
Block partition best
0
1 10 100 1000
Figure 6.17 Startup times for block and
Processors, p strip partitions.
Process i
Array held
by process i
One row
of points
Ghost points
Copy
Array held
by process i+1
Process i+1
Figure 6.18 Configurating array into contiguous rows for each process, with ghost points.
20°C 4ft
100°C
10ft
10ft
Figure 6.19 Room for Problem 6-14.
vehicle
Figure 6.20 Road junction for

Problem 6-16.
Airflow
Actual dimensions
selected at will
Figure 6.21 Figure for Problem 6-23.
P5
P4
P
Processors 3
P2
P1
P0
Time
(a) Imperfect load balancing leading
to increased execution time
P5
P4
P
Processors 3
P2
P1
P0
t
(b) Perfect load balancing Figure 7.1 Load balancing.
Work pool
Queue
Tasks
Master
process
Send task
Request task
(and possibly
submit new tasks)
Slave “worker” processes
Figure 7.2 Centralized work pool.
Initial tasks
Master, Pmaster
Process M0 Process Mn−1
Slaves
Figure 7.3 A distributed work pool.
Process
Process
Requests/tasks
Process
Process
Figure 7.4 Decentralized work pool.
Slave Pi Slave Pj
Requests Requests
Local Local
selection selection
algorithm algorithm
Figure 7.5 Decentralized selection algorithm requesting tasks between slaves.
Master
process
P0
P1 P2 P3 Pn−1
Figure 7.6 Load balancing using a pipeline structure.
Pcomm
If buffer empty,
make request Request for task
Receive task If buffer full,

from request send task
If free, Receive
request task from
task request
Ptask
Figure 7.7 Using a communication process in line load balancing.
P0
Task
when
requested
P1 P2
P3 P5 P4 P6
Figure 7.8 Load balancing using a tree.
Parent
Process Final
acknowledgment
Inactive First task
Acknowledgment
Task
Other processes
Active Figure 7.9 Termination using message
acknowledgments.
Token passed to next processor
when reached local termination condition
P0 P1 P2 Pn−1
Figure 7.10 Ring termination detection algorithm.
Token
AND
Terminated Figure 7.11 Process algorithm for local

termination.
Task
P0 Pj Pi Pn−1
Figure 7.12 Passing task to previous processes.
AND
Terminated AND
AND Terminated
Terminated
Figure 7.13 Tree termination.
Summit
F
B D
A
Base camp Possible intermediate camps
Figure 7.14 Climbing a mountain.
F 17
E
9
51
24 D
13
14
10 8
A B C
Figure 7.15 Graph of mountain climb.
Destination
A B C D E F
A ∞ 10 ∞ ∞ ∞ ∞
B ∞ ∞ 8 13 24 51
C ∞ ∞ ∞ 14 ∞ ∞
Source
D ∞ ∞ ∞ ∞ 9 ∞
E ∞ ∞ ∞ ∞ ∞ 17
F ∞ ∞ ∞ ∞ ∞ ∞
(a) Adjacency matrix
Weight NULL
A B 10
B C 8 D 13 E 24 F 51
C D 14
Source
D E 9
E F 17
F
(b) Adjacency list
Figure 7.16 Representing a graph.
Vertex j
di Vertex i wi,j
dj Figure 7.17 Moore’s shortest-path algo-

rithm.
Master process
Start at
source
vertex
Vertex Vertex w[]
w[]
New
distance
dist
Vertex w[]
dist Process C
New
Process A distance
Other processes
dist
Process B
Figure 7.18 Distributed graph search.
Entrance
Search path
Exit Figure 7.19 Sample maze for Problem 7-9.
Gold
Entrance
Figure 7.20 Plan of rooms for Problem 7-10.
Room B
Door
Figure 7.21 Graph representation for

Room A Problem 7-10.
Bus
Cache
Figure 8.1 Shared memory multiprocessor

Processors Memory modules using a single bus.
TABLE 8.1 SOME EARLY PARALLEL PROGRAMMING LANGUAGES
Language Originator/date Comments

Concurrent Pascal Brinch Hansen, 1975a Extension to Pascal
Ada U.S. Dept. of Defense, 1979b Completely new language
Modula-P Bräunl, 1986c Extension to Modula 2
C* Thinking Machines, 1987d Extension to C for SIMD systems
Concurrent C Gehani and Roome, 1989e Extension to C
Fortran D Fox et al., 1990f Extension to Fortran for data parallel programming
a. Brinch Hansen, P. (1975), “The Programming Language Concurrent Pascal,” IEEE Trans. Software Eng.,
Vol. 1, No. 2 (June), pp. 199–207.
b. U.S. Department of Defense (1981), “The Programming Language Ada Reference Manual,” Lecture
Notes in Computer Science, No. 106, Springer-Verlag, Berlin.
c. Bräunl, T., R. Norz (1992), Modula-P User Manual, Computer Science Report, No. 5/92 (August), Univ.
Stuttgart, Germany.
d. Thinking Machines Corp. (1990), C* Programming Guide, Version 6, Thinking Machines System Docu-
mentation.
e. Gehani, N., and W. D. Roome (1989), The Concurrent C Programming Language, Silicon Press, New
Jersey.
f. Fox, G., S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, and M. Wu (1990), Fortran D
Language Specification, Technical Report TR90-141, Dept. of Computer Science, Rice University.
Main program
FORK
Spawned processes
FORK
FORK
JOIN JOIN
JOIN JOIN Figure 8.2 FORK-JOIN construct.
Code Heap
IP
Stack
Interrupt routines
Files
(a) Process
Code Heap
Stack Thread
IP
Interrupt routines
Stack Thread
Files
IP
Figure 8.3 Differences between a process

(b) Threads and threads.
Main program
thread1
proc1(&arg)
{
pthread_create(&thread1, NULL, proc1, &arg);
return(*status);
}
pthread_join(thread1, *status);
Figure 8.4 pthread_create() and pthread_join().
Main program
pthread_create(); Thread
pthread_create();
Thread
Thread
pthread_create(); Termination
Termination
Termination
Figure 8.5 Detached threads.
Shared variable, x
Write Write
Read Read
+1 +1
Figure 8.6 Conflict in accessing shared
Process 1 Process 2 variable.
Process 1 Process 2
while (lock == 1) do_nothing; while (lock == 1)do_nothing;
lock = 1;
Critical section
lock = 0;
lock = 1;
Critical section
lock = 0;
Figure 8.7 Control of critical sections through busy waiting.
R1 R2 Resource
P1 P2 Process
(a) Two-process deadlock
R1 R2 Rn −1 Rn
P1 P2 Pn −1 Pn
(b) n-process deadlock Figure 8.8 Deadlock (deadly embrace).
Main memory
7
6
5
4
Block 3
2
1
0
Address
tag
Cache Cache
Block in cache
Processor 1 Processor 2
Figure 8.9 False sharing in caches.
sum
Array a[]
addr
Figure 8.10 Shared memory locations for Section 8.4.1 program example.
global_index sum
Array a[]
addr
Figure 8.11 Shared memory locations for Section 8.4.2 program example.
TABLE 8.2 LOGIC CIRCUIT DESCRIPTION FOR FIGURE 8.12
Gate Function Input 1 Input 2 Output

1 AND Test1 Test2 Gate1
2 NOT Gate1 Output1
3 OR Test3 Gate1 Output2
Test1
1 2 Output1
Test2
3 Output2
Test3 Figure 8.12 Sample logic circuit.
Log
Movement
of logs
River
Frog
Figure 8.13 River and frog for Problem 8-23.
Pool of threads
Request Request Slaves

serviced
Master Signal
Figure 8.14 Thread pool for Problem 8-24.
a[i] a[0] a[i] a[n-1]
Compare
Increment
counter, x
b[x] = a[i] Figure 9.1 Finding the rank in parallel.
a[i] a[0] a[i] a[1] a[i] a[2] a[i] a[3]
Compare
0/1 0/1 0/1 0/1
Add Add
0/1/2 0/1/2
Tree
Add
0/1/2/3/4
Figure 9.2 Parallelizing the rank computation.
Master
a[] b[]
Read Place selected

numbers number
Figure 9.3 Rank sort using a master and

Slaves slaves.
Sequence of steps
P1 P2
1
A Send(A) B
If A > B send(B)
else send(A)
If A > B load A
2 else load B
Compare 3
Figure 9.4 Compare and exchange on a message-passing system — Version 1.
P1 P2
1
A Send(A) B
Send(B)
2
If A > B load B If A > B load A
3 Compare Compare 3
Figure 9.5 Compare and exchange on a message-passing system — Version 2.
P1 P2
Merge
88 88 98
Original 50 50 Keep
88 higher
numbers 28 28 80 numbers
25 25 50
43 98 43
42 Return
Final 42 80 lower
numbers 28 43 28
25 numbers
25 42
Figure 9.6 Merging two sublists — Version 1.
P1 P2
Original
Merge numbers Merge
98 98 Keep
98 98
80 80 higher
88 88
43 43 numbers
80 80 (final
50 42 42 50
Keep numbers)
43 88 88 43
lower 42 42
numbers 50 50
28 28 28 28
(final 25 25
numbers) 25 25
Original
numbers
Figure 9.7 Merging two sublists — Version 2.
Original
sequence: 4 2 7 8 5 1 3 6
4 2 7 8 5 1 3 6
2 4 7 8 5 1 3 6
2 4 7 8 5 1 3 6
Phase 1
Place 2 4 7 8 5 1 3 6
largest
number
2 4 7 5 8 1 3 6
2 4 7 5 1 8 3 6
2 4 7 5 1 3 8 6
2 4 7 5 1 3 6 8
2 4 7 5 1 3 6 8
Phase 2
2 4 7 5 1 3 6 8
Place
next
largest 2 4 5 7 1 3 6 8
number
2 4 5 1 7 3 6 8
2 4 5 1 3 7 6 8
2 4 5 1 3 6 7 8
Phase 3
2 4 5 1 3 6 7 8
Time
Figure 9.8 Steps in bubble sort.
Phase 1
Phase 2
2 1
Time 2 1
Phase 3
3 2 1
3 2 1
Phase 4
4 3 2 1
Figure 9.9 Overlapping bubble sort actions in a pipeline.
P0 P1 P2 P3 P4 P5 P6 P7
Step
0 4 2 7 8 5 1 3 6
1 2 4 7 8 1 5 3 6
2 2 4 7 1 8 3 5 6
3 2 4 1 7 3 8 5 6
Time 4 2 1 4 3 7 5 8 6
5 1 2 3 4 5 7 6 8
6 1 2 3 4 5 6 7 8
7 1 2 3 4 5 6 7 8
Figure 9.10 Odd-even transposition sort sorting eight numbers.
Smallest
number
Largest
number Figure 9.11 Snakelike sorted list.
4 14 8 2 2 4 8 14 1 4 7 3
10 3 13 16 16 13 10 3 2 5 8 6
7 15 1 5 1 5 7 15 12 11 9 14
12 6 11 9 12 11 9 6 16 13 10 15
(a) Original placement (b) Phase 1 — Row sort (c) Phase 2 — Column sort
of numbers
1 3 4 7 1 3 4 2 1 2 3 4
8 6 5 2 8 6 5 7 8 7 6 5
9 11 12 14 9 11 12 10 9 10 11 12
16 15 13 10 16 15 13 14 16 15 14 13
(d) Phase 3 — Row sort (e) Phase 4 — Column sort (f) Final phase — Row sort
Figure 9.12 Shearsort.
(a) Operations between elements (b) Transpose operation (c) Operations between elements
in rows in rows (originally columns)
Figure 9.13 Using the transpose operation to maintain operations in rows.
Unsorted list
4 2 7 8 5 1 3 6 P0
4 2 7 8 5 1 3 6 P0 P4
Divide
list
4 2 7 8 5 1 3 6 P0 P2 P4 P6
4 2 7 8 5 1 3 6 P0 P1 P2 P3 P4 P5 P6 P7
2 4 7 8 1 5 3 6 P0 P2 P4 P6
Merge
2 4 7 8 1 3 5 6 P0 P4
1 2 3 4 5 6 7 8 P0
Sorted list Process allocation
Figure 9.14 Mergesort using tree allocation of processes.
Unsorted list
Pivot
4 2 7 8 5 1 3 6 P0
3 2 1 4 5 7 8 6 P0 P4
2 1 3 4 5 7 8 6 P0 P2 P4 P6
1 2 3 6 7 8 P0 P1 P6 P7
Sorted list Process allocation
Figure 9.15 Quicksort using tree allocation of processes.
Unsorted list
Pivot
4 2 7 8 5 1 3 6 4
3 2 1 5 7 8 6 3 5
1 2 7 8 6 1 7
2 6 8 2 6 8
Sorted list Pivots
Figure 9.16 Quicksort showing pivot withheld in processes.
Work pool
Sublists
Request
sublist Return
sublist
Figure 9.17 Work pool implementation of
Slave processes quicksort.
(a) Phase 1 000 001 010 011 100 101 110 111
≤ p1 > p1
(b) Phase 2 000 001 010 011 100 101 110 111
≤ p2 > p2 ≤ p3 > p3
(c) Phase 3 000 001 010 011 100 101 110 111
≤ p4 > p4 ≤ p5 > p5 ≤ p6 > p6 ≤ p7 > p7
Figure 9.18 Hypercube quicksort algorithm when the numbers are originally in node 000.
Broadcast pivot, p1
(a) Phase 1 000 001 010 011 100 101 110 111
≤ p1 > p1
Broadcast pivot, p2 Broadcast pivot, p3
(b) Phase 2 000 001 010 011 100 101 110 111
≤ p2 > p2 ≤ p3 > p3
Broadcast Broadcast Broadcast Broadcast

pivot, p4 pivot, p5 pivot, p6 pivot, p7
(c) Phase 3 000 001 010 011 100 101 110 111
≤ p4 > p4 ≤ p5 > p5 ≤ p6 > p6 ≤ p7 > p7
Figure 9.19 Hypercube quicksort algorithm when numbers are distributed among nodes.
110 111
(a) Phase 1 communication 010 011
100 101
000 001
110 111
(b) Phase 2 communication 010 011
100 101
000 001
110 111
(c) Phase 3 communication 010 011
100 101
Figure 9.20 Hypercube quicksort
communication.
000 001
Broadcast pivot, p1
(a) Phase 1 000 001 011 010 110 111 101 100
≤ p1 > p1
Broadcast pivot, p2 Broadcast pivot, p3
(b) Phase 2 000 001 011 010 110 111 101 100
≤ p2 > p2 ≤ p3 > p3
Broadcast Broadcast Broadcast Broadcast

pivot, p4 pivot, p5 pivot, p6 pivot, p7
(c) Phase 3 000 001 011 010 110 111 101 100
≤ p4 > p4 ≤ p5 > p5 ≤ p6 > p6 ≤ p7 > p7
Figure 9.21 Quicksort hypercube algorithm with Gray code ordering.
a[] b[]
Sorted lists 2 4 5 8 1 3 6 7
Merge
Even indices
Odd indices Merge
c[] 1 2 5 6 d[] 3 4 7 8
Compare and exchange
Figure 9.22 Odd-even merging of two

Final sorted list e[] 1 2 3 4 5 6 7 8 sorted lists.
Compare and
exchange
bn c2n
bn−1 c2n−1
c2n−2
Even
mergesort
b4
b3
b2
b1
an
an−1
Odd c7
mergesort c6
c5
a4 c4
a3 c3
a2 c2
a1 c1 Figure 9.23 Odd-even mergesort.
Value
a0, a1, a2, a3, … an−2, an−1 a0, a1, a2, a3, … an−2, an−1
(a) Single maximum (b) Single maximum and single minimum
Figure 9.24 Bitonic sequences.
Bitonic sequence
3 5 8 9 7 4 2 1
Compare and
exchange
3 4 2 1 7 5 8 9
Figure 9.25 Creating two bitonic
Bitonic sequence Bitonic sequence sequences from one bitonic sequence.
Unsorted numbers
3 5 8 9 7 4 2 1
Compare and
exchange
3 4 2 1 7 5 8 9
2 1 3 4 7 5 8 9
1 2 3 4 5 7 8 9
Sorted list Figure 9.26 Sorting a bitonic sequence.
Unsorted numbers
Bitonic
sorting
operation
Direction
of increasing
numbers
Sorted list
Figure 9.27 Bitonic mergesort.
Compare and exchange
ai with ai+n/2 (n numbers)
8 3 4 7 9 2 1 5 = bitonic list
Step [Fig. 9.24 (a) or (b)]
1 n=2 ai with ai+1
Form
bitonic lists 3 8 7 4 2 9 5 1
of four
numbers
2 n=4 ai with ai+2
Split
Form 3 4 7 8 5 9 2 1
bitonic list
of eight
numbers 3 n=2 ai with ai+1
Sort
3 4 7 8 9 5 2 1
4 n=8 ai with ai+4
Sort bitonic list Split

3 4 2 1 9 5 7 8
5 n=4 ai with ai+2
Compare and Split

exchange 2 1 3 4 7 5 9 8
6 n=2 ai with ai+1

Lower Higher 1 2 3 4 5 7 8 9
Sort
Figure 9.28 Bitonic mergesort on eight numbers.
88 98
50 80
Step 1 28 43
25 42
50 98
42 88
Step 2 28 80
25 43
43 98
42 88
Step 3 28 80
25 50
Figure 9.29 Compare-and-exchange

Terminates when insertions at top/bottom of lists algorithm for Problem 9-5.
Column
a0,0 a0,1 a0,m−2 a0,m−1
a1,0 a1,1 a1,m−2 a1,m−1
Row
an−2,0 an−2,1 an−2,m-2 an−2,m−1

an−1,0 an−1,1 an−1,m−2 an−1,m−1
Figure 10.1 An n × m matrix.
Column
Multiply Sum
j results
Row
i
ci,j
A × B = C
Figure 10.2 Matrix multiplication, C = A × B.
A × b = c
Row
sum
i ci
Figure 10.3 Matrix-vector multiplication

c = A × b.
q Sum
Multiply results
A × B = C
Figure 10.4 Block matrix multiplication.
a0,0 a0,1 a0,2 a0,3 b0,0 b0,1 b0,2 b0,3
a1,0 a1,1 a1,2 a1,3 b1,0 b1,1 b1,2 b1,3
×
a2,0 a2,1 a2,2 a2,3 b2,0 b2,1 b2,2 b2,3
a3,0 a3,1 a3,2 a3,3 b3,0 b3,1 b3,2 b3,3
(a) Matrices
A0,0 B0,0 A0,1 B1,0
a0,0 a0,1 b0,0 b0,1 a0,2 a0,3 b2,0 b2,1

× + ×
a1,0 a1,1 b1,0 b1,1 a1,2 a1,3 b3,0 b3,1
a0,0b0,0 + a0,1b1,0 a0,0b0,1 + a0,1b1,1 a0,2b2,0 + a0,3b3,0 a0,2b2,1 + a0,3b3,1
= +
a1,0b0,0 + a1,1b1,0 a1,0b0,1 + a1,1b1,1 a1,2b2,0 + a1,3b3,0 a1,2b2,1 + a1,3b3,1
a0,0b0,0 + a0,1b1,0 + a0,2b2,0 + a0,3b3,0 a0,0b0,1 + a0,1b1,1 + a0,2b2,1 + a0,3b3,1
=
a1,0b0,0 + a1,1b1,0 + a1,2b2,0 + a1,3b3,0 a1,0b0,1 + a1,1b1,1 + a1,2b2,1 + a1,3b3,1
= C0,0
(b) Multiplying A0,0 × B0,0 to obtain C0,0
Figure 10.5 Submatrix multiplication.
Column j b[][j]
Row i a[i][]
Processor Pi,j
Figure 10.6 Direct implementation of

c[i][j] matrix multiplication.
a0,0 b0,0 a0,1 b1,0 a0,2 b2,0 a0,3 b3,0
× × × ×
P0 P1 P2 P3
+ +
P0 P2
+
P0
Figure 10.7 Accumulation using a tree

c0,0 construction.
i j
P0 P1 P2 P3
i
P0 + P1 P2 + P3
App Apq Bpp Bpq Cpp Cpq
j
P4 + P5 P6 + P7
Aqp Aqq Bqp Bqq Cqp Cqq
P4 P5 P6 P7
Figure 10.8 Submatrix multiplication and summation.
j
i
A
Pi,j
B Figure 10.9 Movement of A and B
elements.
j
B
i
i places
A
j places ai,j+i
bi+j,j
Figure 10.10 Step 2 — Alignment of
elements of A and B.
j
B
i
A
Pi,j
Figure 10.11 Step 4 — One-place shift of

elements of A and B.
b3,3
b3,2 b2,3
b3,1 b2,2 b1,3
Pumping b2,1 b1,2
b3,0 b0,3
action b1,1 b0,2
b2,0
b1,0 b0,1
b0,0
a0,3 a0,2 a0,1 a0,0 c0,0 c0,1 c0,2 c0,3
One cycle delay
a1,3 a1,2 a1,1 a1,0 c1,0 c1,1 c1,2 c1,3
a2,3 a2,2 a2,1 a2,0 c2,0 c2,1 c2,2 c2,3
a3,3 a3,2 a3,1 a3,0 c3,0 c3,1 c3,2 c3,3
Figure 10.12 Matrix multiplication using a systolic array.
b3
b2
Pumping b1
action b0
a0,3 a0,2 a0,1 a0,0 c0
a1,3 a1,2 a1,1 a1,0 c1
a2,3 a2,2 a2,1 a2,0 c2
a3,3 a3,2 a3,1 a3,0 c3 Figure 10.13 Matrix-vector multiplication

using a systolic array.
Column
Row
Row i
aji
Step through
Row j
Cleared
to zero
Already
cleared
to zero Column i
Figure 10.14 Gaussian elimination.
Column
Row
n − i +1 elements
(including b[i])
Row i
Broadcast
ith row
Already
cleared
to zero
Figure 10.15 Broadcast in parallel implementation of Gaussian elimination.
P0 P1 P2 Pn−1
Row
Broadcast Figure 10.16 Pipeline implementation of

rows Gaussian elimination.
Row
0
P0
n/p
P1
2n/p
P2
3n/p
P3
Figure 10.17 Strip partitioning.
Row
0
n/p
P0
2n/p P1
3n/p
Figure 10.18 Cyclic partitioning to

equalize workload.
Solution space
∆ ∆
f(x, y)
y
x Figure 10.19 Finite difference method.
Boundary points (see text)
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
x11 x12 x13 x14 x15 x16 x17 x18 x19 x20
x21 x22 x23 x24 x25 x26 x27 x28 x29 x30
x31 x32 x33 x34 x35 x36 x37 x38 x39 x40
x41 x42 x43 x44 x45 x46 x47 x48 x49 x50
x51 x52 x53 x54 x55 x56 x57 x58 x59 x60
x61 x62 x63 x64 x65 x66 x67 x68 x69 x70
x71 x72 x73 x74 x75 x76 x77 x78 x79 x80
x81 x82 x83 x84 x85 x86 x87 x88 x89 x90
x91 x92 x93 x94 x95 x96 x97 x98 x99 x100
Figure 10.20 Mesh of points numbered in natural order.
Those equations with a boundary To include
point on diagonal unnecessary boundary values x1 0
for solution and some zero x2 0
entries (see text)
1 1 −4 1 1
1 1 −4 1 1
ith equation 1 1 −4 1 1 × =
ai,i−n ai,i−1 ai,i ai,i+1 ai,i+n
1 1 −4 1 1
1 1 −4 1 1
xN-1 0
xN 0
A x
Figure 10.21 Sparse matrix for Laplace’s equation.
Sequential order of computation
Point
computed
Point to be
computed
Figure 10.22 Gauss-Seidel relaxation with natural order, computed sequentially.
Red
Black
Figure 10.23 Red-black ordering.
Figure 10.24 Nine-point stencil.
Coarsest grid points Finer grid points
Processor
Figure 10.25 Multigrid processor

allocation.
50°C
40°C 60°C
Ambient temperature at edges of board = 20°C
Figure 10.26 Printed circuit board for Problem 10-18.
j
Origin (0, 0)
Picture element p(i, j)

(pixel)
Figure 11.1 Pixmap.
Number
of pixels
0 Gray level 255 Figure 11.2 Image histogram.
x0 x1 x2
x3 x4 x5
x6 x7 x8
Figure 11.3 Pixel values for a 3 × 3 group.
Step 1 Step 2 Step 3 Step 4
Each pixel adds Each pixel adds Each pixel adds pixel Each pixel adds pixel
pixel from left pixel from right from above from below
Figure 11.4 Four-step data transfer for the computation of mean.
x0 x1 x2 x0 x1 x2
x0 + x1 x0 + x1 + x2
x3 x4 x5 x3 x4 x5
x3 + x4 x3 + x4 + x5
x6 x7 x8 x6 x7 x8
x6 + x7 x6 + x7 + x8
(a) Step 1 (b) Step 2
x0 x1 x2 x0 x1 x2
x0 + x1 + x2 x0 + x1 + x2
x3 x4 x5 x3 x4 x5
x0 + x1 + x2
x0 + x1 + x2
x3 + x4 + x5
x3 + x4 + x5
x6 + x7 + x8
x6 x7 x8 x6 x7 x8
x6 + x7 + x8 x6 + x7 + x8
(c) Step 3 (d) Step 4
Figure 11.5 Parallel mean data accumulation.
Largest Next largest
in row in row
Next largest
in column
Figure 11.6 Approximate median algorithm requiring six steps.
Mask Pixels Result
w0 w1 w2 x0 x1 x2
w3 w4 w5 ⊗ x3 x4 x5 = x4'
w6 w7 w8 x6 x7 x8
Figure 11.7 Using a 3 × 3 weighted mask.
1 1 1
1
k= 1 1 1
9
1 1 1
Figure 11.8 Mask to compute mean.
1 1 1
1
k= 1 8 1
16
1 1 1
Figure 11.9 A noise reduction mask.
−1 −1 −1
1
k=
9 −1 8 −1
Figure 11.10 High-pass sharpening filter
−1 −1 −1
mask.
Intensity transition
First derivative
Second derivative
Figure 11.11 Edge detection using
differentiation.
x
Image
y
Constant
intensity
f(x, y)
φ
Gradient Figure 11.12 Gray level gradient and

direction.
−1 −1 −1 −1 0 1
0 0 0 −1 0 1
1 1 1 −1 0 1
Figure 11.13 Prewitt operator.
−1 −2 −1 −1 0 1
0 0 0 −2 0 2
1 2 1 −1 0 1
Figure 11.14 Sobel operator.
(a) Original image (Annabel) (b) Effect of Sobel operator
Figure 11.15 Edge detection with Sobel operator.
0 −1 0
−1 4 −1
0 −1 0
Figure 11.16 Laplace operator.
Upper pixel
x1
x3 x4 x5
Left pixel Right pixel
x7 Figure 11.17 Pixels used in Laplace

Lower pixel operator.
Figure 11.18 Effect of Laplace operator.
y b b = −x1a + y1
y = ax + b
(x1, y1) b = −xa + y

(a, b)
Pixel in image
x a
(a) (x, y) plane (b) Parameter space
Figure 11.19 Mapping a line into (a, b) space.
y r
y = ax + b
r = x cos θ + y sin θ
(r, θ)
θ
r
x θ
(a) (x, y) plane (b) (r, θ) plane
Figure 11.20 Mapping a line into (r, θ) space.
x
θ
r
y Figure 11.21 Normal representation using

image coordinate system.
Accumulator
15
10
5
0
0°10°20°30° Figure 11.22 Accumulators, acc[r][θ], for
θ the Hough transform.
k Transform Transform
rows columns
j
xjk Xjm Xlm
Figure 11.23 Two-dimensional DFT.
Image Transform
Convolution Multiply Inverse

fj,k f(j, k) F(j, k) transform
∗ hj,k × H(j, k) h(j, k)
gj,k g(j, k) G(j, k)
Filter/image
(a) Direct convolution (b) Using Fourier transform
Figure 11.24 Convolution using Fourier transforms.
Master process
w0 w1 wn−1
Slave processes
Figure 11.25 Master-slave approach for

X[0] X[1] X[n−1] implementing the DFT directly.
x[j]
Process j Values for

next iteration
X[k]
+ X[k]
× a × x[j]
a
wk × a
Figure 11.26 One stage of a pipeline
wk
implementation of DFT algorithm.
x[0] x[1] x[2] x[3] x[N−1]
Output sequence
0 X[k] X[0],X[1],X[2],X[3]…
1 a
wk wk
P0 P1 P2 P3 PN−1
(a) Pipeline structure
X[0] X[1] X[2] X[3] X[4] X[5] X[6]
PN−1
PN−2
Pipeline
stages
P2
P1
P0
Time
(b) Timing diagram
Figure 11.27 Discrete Fourier transform with a pipeline.
Input sequence Transform
x0
x1
Xeven
+
x2 N/2 pt
x3 DFT Xk
N/2 pt
DFT − Xk+N/2
xN−2
xN−1 Xodd × wk
k = 0, 1, … N/2
Figure 11.28 Decomposition of N-point DFT into two N/2-point DFTs.
x0 + + X0
x1 + − X1
x2 + + X2
Figure 11.29 Four-point discrete Fourier
x3 + − X3 transform.
Xk = Σ(0,2,4,6,8,10,12,14)+wkΣ(1,3,5,7,9,11,13,15)
{Σ(0,4,8,12)+wkΣ(2,6,10,14)}+wk{Σ(1,5,9,13)+wkΣ(3,7,11,15)}
{[Σ(0,8)+wkΣ(4,12)]+wk[Σ(2,10)+wkΣ(6,14)]}+{[Σ(1,9)+wkΣ(5,13)]+wk[Σ(3,11)+wkΣ(7,15)]}
x0 x8 x4 x12 x2 x10 x6 x14 x1 x9 x5 x13 x3 x11 x7 x15

0000 1000 0100 1100 0010 1010 0110 1011 0001 1001 0101 1101 0011 1011 0111 1111
Figure 11.30 Sixteen-point DFT decomposition.
x0 X0
x1 X1
x2 X2
x3 X3
x4 X4
x5 X5
x6 X6
x7 X7
x8 X8
x9 X9
x10 X10
x11 X11
x12 X12
x13 X13
x14 X14
x15 X15
Figure 11.31 Sixteen-point FFT computational flow.
Process
Row
Inputs Outputs
P/r
0000 x0 X0
0001 x1 X1
P0
0010 x2 X2
0011 x3 X3
0100 x4 X4
0101 x5 X5
P1
0110 x6 X6
0111 x7 X7
1000 x8 X8
1001 x9 X9
P2
1010 x10 X10
1011 x11 X11
1100 x12 X12
1101 x13 X13

P3
1110 x14 X14
1111 x15 X15
Figure 11.32 Mapping processors onto 16-point FFT computation.
P0 P1 P2 P3
x0 x1 x2 x3
x4 x5 x6 x7
x8 x9 x10 x11
x12 x13 x14 x15 Figure 11.33 FFT using transpose

algorithm — first two steps.
P0 P1 P2 P3
x0 x1 x2 x3
x4 x5 x6 x7
x8 x9 x10 x11
x12 x13 x14 x15 Figure 11.34 Transposing array for

transpose algorithm.
P0 P1 P2 P3
x0 x4 x8 x12
x1 x5 x9 x13
x2 x6 x10 x14
x3 x7 x11 x15 Figure 11.35 FFT using transpose

algorithm — last two steps.
7
2
Mask
1
1 2 3 4 5 6 7 Figure 11.36 Image for Problem 11-3.
First choice C0 C1 Cn−1
Second choice Not Not Not

including including including
C0 C1 Cn−1
Third choice
Figure 12.1 State space tree.
1 p p+1 m
Parent A A1 A2
1 p p+1 m
Parent B B1 B2
1 p p+1 m
Child 1 A1 B2
1 p p+1 m
Child 2 B1 A2
Figure 12.2 Single-point crossover.
Subpopulation
Migration path;
every island sends
to every other island
Figure 12.3 Island model.
Island subpopulations
Limited migration path Figure 12.4 Stepping stone model
Program
Instructions
Clock
Processors
with local
memory
Data
Shared memory
Figure D.1 PRAM model.
d[0] s[0] d[1] s[1] d[2] s[2] d[3] s[3] d[4] s[4] d[5] s[5] d[6] s[6] d[7] s[7]
1 1 1 1 1 1 1 0
Null
2 2 2 2 2 2 1 0
4 4 4 4 3 2 1 0
7 6 5 4 3 2 1 0
Figure D.2 List ranking by pointer jumping.
Threads or processes
Local computation
(maximum time w)
Maximum of h
sends or receives
Communication
Barrier synchronization
Figure D.3 A view of the bulk synchronous parallel model.
o g
Pi
Next message
Processors Message
Pk
Pi
L o Time
Figure D.4 LogP parameters.

Slides PDF

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Slides PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Figure 1.

Instructions (to processor)

Processor Figure 1.2 Conventional computer having

Figure 1.3 Traditional shared memory

Figure 1.4 Message-passing

Figure 1.6 MPMD structure.

Network with direct links

Figure 1.7 Static link multicomputer.

Figure 1.8 Node with a switch for internode message transfers.

Figure 1.11 Two-dimensional array

Figure 1.12 Tree structure.

000 001 Figure 1.13 Three-dimensional hypercube.

0100 0101 1100 1101

0010 0011 1010 1011

0000 0001 1000 1001

Figure 1.14 Four-dimensional hypercube.

Figure 1.15 Embedding a ring onto a torus.

x Figure 1.16 Embedding a mesh into a

Figure 1.17 Embedding a tree into a mesh.

Figure 1.18 Distribution of flits.

Figure 1.21 Deadlock in store-and-forward

Workstation/ Workstations Figure 1.23 Ethernet-type single wire

Figure 1.24 Ethernet frame format.

Figure 1.25 Network of workstations connected via a ring.

Figure 1.26 Star connected network.

(a) Using specially designed adaptors

(b) Using separate Ethernet interfaces

Figure 1.27 Overlapping connectivity Ethernets.

Waiting to send a message Message Time

Figure 1.28 Space-time diagram of a message-passing program.

Serial section Parallelizable sections

(a) One processor

Figure 1.29 Parallelizing sequential problem — Amdahl’s law.

Speedup factor, S(n)

4 8 12 16 20 0.2 0.4 0.6 0.8 1.0

Figure 2.1 Single program, multiple data

Figure 2.2 Spawning a process.

Time send(); Request to send

(a) When send() occurs before recv()

(b) When recv() occurs before send()

Figure 2.5 Using a message buffer.

data data data

bcast(); bcast(); bcast();

Figure 2.6 Broadcast operation.

data data data

scatter(); scatter(); scatter();

Figure 2.7 Scatter operation.

data data data

gather(); gather(); gather();

Figure 2.8 Gather operation.

data data data

reduce(); reduce(); reduce();

Figure 2.9 Reduce operation (addition).

Figure 2.10 Message passing between workstations using PVM.

Figure 2.11 Multiple processes allocated to each processor (workstation).

Figure 2.12 pvm_psend() and pvm_precv() system calls.

Figure 2.13 PVM packing messages, sending, and unpacking.

/* Broadcast data To slaves*/ /* Determine my tid */

lib() send(…,1,…); Source

(a) Intended behavior

(b) Possible behavior

Figure 2.15 Unsafe message passing with libraries.

/* Broadcast data To slaves/ / Determine my tid */