Sunteți pe pagina 1din 20

TW3740TU Parallel Programming - OpenMP and MPI

Exercises

Andrea Bettini∗
Delft University of Technology, Delft, South Holland, 2628 CD

I. OpenMP Exercises

A. Exercise 1 – Hello World


This exercise looked at the parallel implementation of a "Hello World" script, i.e each thread prints out "Hello World
from Thread no.#". I initially expected that the order of threads saying "Hello World" would appear in chronological
order, i.e No.1. No.2, No.3, etc., however it appears that it prints these out in an arbitrary order. I further tested with
OMP_NUM_THREADS=16, and the order was still arbitrary.
This is not necessarily wrong, as each thread’s runtime might not necessarily be the same due to background or
initial differences. In this case, the text will appear first for the fastest thread. To implement the second part of the
assignment for this task, all that needs to be done is to insert the following line statement inside the parallel section:

if (tid==0)
num_threads=omp_get_num_threads();

As well as the following line after the fork-join construct (as I wanted the number of threads to print out at the very end):

printf("Number of threads = %d\n",num_threads);

Meanwhile, the thread ID from which the printf statements are coming from is already implemented in the file, so
there is no need to modify the file any further to print out the ID as well.

B. Exercise 2 – Getting Information from the Computer you are using


This exercise is quite trivial. I did this exercise on my HP ZBook Studio G4 laptop. The following outputs compare
the maximum number of threads and cores given by lscpu and getenvironinfo.c:

lscpu getenvironinfo.c
# Physical Cores 4 ~
# Threads per Core 2 ~
# Cores 8 8
Max # Threads 8 Depends
Table 1 Difference of results between lscpu and getenvironinfo.c

Quite clearly, both the program and the Linux shell command return the same number of cores (which are just the
number of processors in the program). It should be noted that I only have four physical cores though, and that I have
eight cores due to hyper-threading, hence a maximum of eight threads can run at once. Meanwhile, the program returns
the maximum number of threads set by OMP_NUM_THREADS, i.e, if OMP_NUM_THREADS=32, then the program will return
a total of 32 threads. Tthe number of threads per core would be the total number of threads divided by the number of
cores I have. The difference in results stem from the fact that the Linux shell returns the total number of threads that can
execute at once, whereas the program returns the total number of creatable threads (as far as I understand it).
∗ Undergraduate Student, Faculty of Aerospace Engineering, Delft University of Technology, Kluyverweg 1, 2629 HS Delft

1
To confirm the results given by lscpu, I checked the processor of this computer (Intel © Core™ i7-7700HQ CPU
@ 2.80GHz × 4), which indeed does have four cores per processor∗ .

C. Exercise 3 – Parallelization of Matrix-Vector Multiplication


In this exercise, the parallelization of a Matrix-Vector multiplication is undertaken. The main idea is to carry out the
following multiplication of a matrix A ∈ R n×n on a vector b ∈ R n , better expressed as:

c = Ab (1)

Then for each element i of c:


N
Õ −1
ci = Ai, j b j (2)
j=0

The provided C algorithm in this exercise was originally given such that it worked best on a column-major ordering
programming language, such as Fortran. Note that the code is also wrong, as the matrix-vector multiplication is
incorrect. The sequential code for n = 1000, gave a run time of 0.00335s. Maintaining the same column-access aspect
of this program requires that the parallelization has private arrays which are summed up at the end (and that the code is
rewritten to be correct). This can be achieved easily enough using the reduction clause and modifying the indice used
for c. Thus, the lines of code which allows the correct parallelization is:

#pragma omp parallel for private(j) reduction(+:c)


for (i=0; i < SIZE; i++){
for (j=0; j < SIZE; j++){
c[j] = c[j] + A[j][i] * b[i];
}
}

Quite obviously this is not an ideal piece of code as there is a need for global reduction and that C is a row-major
ordering programming language. The latter is a big issue as the inefficient access of memory causes memory cache
thrashing. Therefore to optimize it, it is far better to simply rewrite the code to work with row-major ordering, and to
store the intermediate result in a private summation copy and access the c array once per i loop to prevent the (same)
cache line from being invalidated by each thread every time a thread carries out the computation instruction in the j loop.
This will also mean that a global reduction is not necessary. The following lines of code can be used to achieve this:

Sum = 0.0;
#pragma omp parallel for private(j) firstprivate(Sum)
for (i=0; i < SIZE; i++){
for (j=0; j < SIZE; j++)
Sum += A[i][j] * b[j];
c[i] = Sum;
Sum = 0.0;
}

This set of code will drastically improve performance. I carried out both the unoptimized and optimized sets of code on
my laptop (the same computer from the previous exercise), resulting in the following execution times:
∗ The link to the computer and all the information on the processor itself is given here: https://support.hp.com/us-en/product/hp-zbook-15-g4-

mobile-workstation/14840009/document/c05459695

2
Threads Unoptimized [s] Optimized [s]
1 0.00370 0.00343
2 0.00207 0.00161
4 0.00121 0.000893
8 0.00225 0.00108
16 0.00191 0.00136
Table 2 Execution times for Matrix-Vector Multiplications with the unoptimized code and optimized code

Note that even though I have eight threads, using four threads achieved the best run time with OpenMP. This might have
to do with how multi-threading may not allow separate memory cache for each physical core, so each physical core’s
cache is thrashing, not due to poor spatial locality (as each data point that is used for each thread should be located
contiguously!) but due to hyper-threading limitations.

D. Exercise 4 – Parallel approximation of Pi (numerical integration)


For this exercise, the numerical integration of π is attempted using a summation scheme. This leads to quite high
run times when running the code sequentially with many summation terms, i.e 150000000 terms. The sequential
code provided took 1.721s to run. The computationally intensive part of the code resides in the loop of the CalcPi()
function, hence parallelizing this section of code should be done to increase the performance of the program. As there
are no dependencies, parallelization is incredibly easy, and can be done by inserting the following directive and clauses:

#pragma omp parallel for reduction(+:fSum)

This results in the following runtimes of the program (using my laptop from Exercise 2), with the speedup and efficiency
given as well:

# Threads Runtime [s] Speedup Efficiency


1 1.804 1.000 0.125
2 0.926 1.945 0.243
3 0.629 2.871 0.359
4 0.479 3.767 0.471
6 0.327 5.523 0.690
8 0.248 7.277 0.910
10 0.287 6.277 0.785
12 0.255 7.068 0.884
14 0.264 6.830 0.854
16 0.268 6.740 0.843
Table 3 Runtime of the parallelized pi.c program with different threads, alongside the speedup and efficiency

Note that in comparison to the previous exercise, there is speedup up until eight threads are used, which is the opposite
of the previous exercise. This would make sense as this program does not need to access data from memory to use in its
computation, hence each thread of a core will not cause thrashing. Also the lower efficiency seen after eight threads
could be attributed to overhead involving library schedulers and kernel mapping, as well as work imbalance as not all
threads get assigned the same amount of work anymore.

E. Exercise 5 - A error-prone code fixit.c


This exercise creates a private array that each thread will work with. However this program contains two simple
blunders. The first blunder is not making j private. As it is not private, each thread not only increments j in their nested
loop, but also hinder the rest of threads as the memory can only be accessed by one thread at a time. To fix this problem,

3
it is simple enough to declare j private. Sometimes The second blunder is the size of N. With N = 1048, the matrix a
of the problem will be too big to store in the stack itself (a matrix of 1048 × 1048 is about 8580.5 kbytes, with the
stack allocated to programs on my computer being 8192kbytes). Therefore this causes a segmentation fault. To fix this
blunder, reducing N to something along the lines of 700 still works, while avoiding segmentation faults. Also for clarity,
a #pragma omp barrier should be placed right after the if statement that collects the number of threads used in the
problem.
A second variant of this program is that there is a single shared matrix, in which the each element is the sum of the
indices. In this case, there are three blunders: j not being private, N being too large, and the a matrix being private. To
fix these, its simple enough to set j private, N to something along the lines of 900, and removing a from the private list.
The following code can also be used to replace the entirety of the program, assuming that the tid variable was used to
keep track of each thread.

int main (int argc, char *argv[]) {


int i, j;
double a[N][N];

#pragma omp parallel for private(j)


/* Each thread works on its own private copy of the array */
for (i=0; i<N; i++)
for (j=0; j<N; j++)
a[i][j] = i + j;

/* to ensure that the outcome is correct! */


for (i=0; i<N; i++)
for (j=0; j<N; j++)
printf("a[%d][%d]=%lf\n", i,j,a[i][j]);
}

F. Exercise 6 – Matrix-matrix multiplication code


In this exercise, the parallelization of a Matrix-Matrix multiplication is undertaken. The main idea is to carry out the
following multiplication of a matrix A ∈ R n×m with another matrix B ∈ R m×p , producing a matrix C ∈ R n×p , better
expressed as:
C = AB (3)
Then for element i, j of C:
m−1
Õ
Ci, j = Ai,k Bk, j (4)
k=0

With the given sequential exercise code, the runtime of the program is 0.326s on my laptop, which is rather slow
for a matrix multiplication of n = 512, m = 512, p = 256. Parallelizing this sequential code is rather easy as well,
considering that all rows are A are independent of each other, adding parallelization to the loop that deals with the i
indice is sufficient to return the same answers as the sequential code. This can be done by replacing:

for (i=0; i<NRA; i++){


for(j=0; j<NCB; j++)
for (k=0; k<NCA; k++)
c[i][j] += a[i][k] * b[k][j];
}

with a set of code that has parallelization along the i th loop, and has private indexing for all other indices ( j, k), and also
privatizes the B matrix, although that is not truly necessary. Then the following set of OpenMP code can be used to
replace the former:

4
#pragma omp parallel for private(j,k) firstprivate(b)
for (i=0; i<NRA; i++){
for(j=0; j<NCB; j++)
for (k=0; k<NCA; k++)
c[i][j] += a[i][k] * b[k][j];
}

This parallelization yields the following runtimes, speedups, and efficiencies:

Threads Runtime [s] Speedup Efficiency


1 0.456 1.000 0.125
2 0.255 1.790 0.224
4 0.143 3.196 0.400
8 0.148 3.080 0.385
Table 4 Runtime of the parallelized mm.c program with different threads, alongside the speedup and efficiency

To investigate the effects of different chunk sizes for each thread, it is possible to add in a schedule(dynamic,chunk)
clause to the previous combination of directives and clauses. This results in the following replacement code:

#pragma omp parallel for private(j,k) firstprivate(b) schedule(dynamic,chunk)


for (i=0; i<NRA; i++){
for(j=0; j<NCB; j++)
for (k=0; k<NCA; k++)
c[i][j] += a[i][k] * b[k][j];
}

To test the chunk size effects, four different chunk sizes were chosen: 64, 128, 256, 512. Based on my laptop, a chunk
size of 128 should maximize my performance considering that only four physical cores exist and that hyper-threading is
not ideal in this case due to cache memory thrashing. The following runtimes, speedups, and efficiencies result from the
following chunk sizes:

Threads RunTime [s] Speedup Efficiency Threads RunTime [s] Speedup Efficiency
1 0.369 1.000 0.125 1 0.375 1.000 0.125
2 0.198 1.864 0.233 2 0.203 1.844 0.230
4 0.106 3.467 0.433 4 0.107 3.495 0.437
8 0.116 3.169 0.396 8 0.132 2.839 0.355
Table 5 Runtime, Speedups, and Efficiencies of Table 6 Runtime, Speedups, and Efficiencies of
mm.c with a chunk size = 64 mm.c with a chunk size = 128

Threads RunTime [s] Speedup Efficiency Threads RunTime [s] Speedup Efficiency
1 0.381 1.000 0.125 1 0.422 1.000 0.125
2 0.200 1.900 0.238 2 0.426 0.992 0.124
4 0.201 1.892 0.237 4 0.424 0.997 0.125
8 0.209 1.823 0.223 8 0.423 0.998 0.125
Table 7 Runtime, Speedups, and Efficiencies of Table 8 Runtime, Speedups, and Efficiencies of
mm.c with a chunk size = 256 mm.c with a chunk size = 512

5
As can be seen from Table 5 and Table 6, the runtimes are fairly similar to each other (hence also the speedups and
efficiencies). This is not unexpected, considering that the run time for a chunk size of 64 with eight threads is slower
than with four threads, in spite of hyper-threading, due to cache memory thrashing. The rest of Table 5 and Table 6 are
fairly similar considering that a dynamic allocation was used, and all physical cores are being used at once as there the
data is partitioned such that all cores have work. This is in contrast to Table 7 and Table 8, where the partitioning of the
work is not distributed in a way such that all cores have work to do. For example, for a chunk size of 256, there can only
be two equal partitions of work to be done for i = 512. Therefore, there can only be a maximum of two threads working,
which is why there is no speedup beyond two threads for Table 7. Meanwhile, for a chunk size of 512, clearly there is
only one partition of work, meaning only one thread is ever made, hence only one physical core can do work.
To further improve the performance of the program, it is possible to undertake a block matrix multiplication as
spatial and temporal locality is improved. For this exercise, I will be creating the following block partitioning scheme:

Thread 0 0 1 2 ··· col_p -1


Thread 1 0 1 2 ··· col_p -1
.. ..
. .

To understand this scheme, essentially all threads carry out stripe partitioning except each block stripe is subdivided into
col_p subblocks. The number of subblocks for each stripe block is a user input. After subdividing all the blocks, each
thread must carry out its own separate summation and add this sum to a shared matrix (or risk having race conditions if
you modify directly the share matrix!). What I notice is that I am hindered by the B matrix, as each element is accessed
column wise, which has terrible spatial locality for a C program. I believe that this scheme will increase the temporal
locality as there are fewer cache lines that store the A matrix entries, therefore a smaller probability that an important
cache line that contains A matrix entries will be replaced with cache lines that contain B matrix entries. Furthermore,
storing fewer cache lines from each subdivision may allow hyper-threading to work. A sample code piece is:

Sum = 0;
#pragma omp parallel for private(p, j, k) firstprivate(Sum) schedule(dynamic)
for (i=0 ; i<NRA; i++){
for(p=0; p<col_p; p++){
for(j=0; j<NCB; j++){
for (k=0 + NCA/(col_p) * p; k < NCA/(col_p) * (p + 1); k++)
Sum += a[i][k] * b[k][j];
c[i][j] += Sum;
Sum = 0;
}
}
}

For col_p = 32, the following runtimes, speedups, and effiencies are given:

Threads RunTime [s] Speedup Efficiency


1 0.424 1.000 0.125
2 0.222 1.907 0.238
4 0.133 3.191 0.399
8 0.085 4.986 0.623
Table 9 Runtime of the parallelized mm.c program with different threads, alongside the speedup and efficiency
for block-matrix operations

With this partitioning, I have achieved far better times than with stripe partitioning, which I believe confirms that
block-partitioning allows my computer to better use its resources. However, there is not much improvement over stripe
partitioning for lower number of threads. There is further recommendation to perhaps redo the scheme, albeit take the

6
transpose of the B matrix, allowing for better spatial locality when doing the matrix-matrix multiplication. However,
this may end up taking longer due to the transpose operation itself – all entries that are transposed are far apart from
each other’s memory location.
A second block-matrix scheme to consider is the partitioning of both rows and columns instead of just single rows.
Then consider that the matrix is subdivided into smaller matrices such that the parent matrix is composed by row p × col p
submatrices. For example, consider a single matrix broken up into a 4 × 4 submatrices, with the entries below showing
the submatrix/partition p:

0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

Then each thread can take a block (or several blocks), and perform block matrix operations as needed. In this sense, the
following code achieves this:

Sum = 0;
#pragma omp parallel for private(i, j, k) firstprivate(Sum) schedule(dynamic)
for(p=0; p<col_p*row_p; p++){
for (i=NRA/(row_p) * (p/col_p) ; i<NRA/(row_p) * (p/col_p + 1); i++){
for(j=0; j<NCB; j++){
for (k=NCA/(col_p) * (p%col_p); k < NCA/(col_p) * (p%col_p + 1); k++)
Sum += a[i][k] * b[k][j];
c[i][j] += Sum;
Sum = 0;
}
}
}

Note the integer division in the second for loop! The following runtimes, speedups, and efficiencies are achieved with
a subdivision into row p = 64, col p = 32 for different number of threads:

Threads RunTime [s] Speedup Efficiency


1 0.440 1.000 0.125
2 0.231 1.901 0.238
4 0.122 3.605 0.451
8 0.102 4.313 0.539
Table 10 Runtime of the parallelized mm.c program with different threads, alongside the speedup and efficiency
for the second version of the block-matrix operations

Overall, this scheme does not show much improvement over the first version of the block matrix operation. This may
have to do with how the each thread still has to access different rows, which are spaced far apart in the memory itself.

II. MPI Exercises


Note that throughout the exercises, I have included large chunks of code in the text as that should minimize the
amount of time spent reading through actual code!

A. Exercise 1 – Ping Pong


The first part of the code consists of having two processors ping each other. To ensure that they both get a message,
processor 0 sends a message "42" to processor 1. In this case, processor 1 will multiply this message by -1, and send

7
back a message "-42" to processor 0. For all parts along the communication, there is a printout of what message is
received. Since most of the code was TODO, I have included the entirety in here as well:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char **argv){


int myRank, numProcs;
int pingCount = 42;
int pongCount = 0;

MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

if (myRank == 0){
printf("Sending Ping (# %i) (P%d)\n", pingCount, myRank);
MPI_Send(&pingCount, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
MPI_Recv(&pongCount, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &status);
printf("Received Pong (# %i) (P%d)\n", pongCount, myRank);
}
else{
MPI_Recv(&pingCount, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
printf("Received Ping (# %i) (P%d)\n", pingCount, myRank);
pongCount = -1 * pingCount;
MPI_Send(&pongCount, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
printf("Sending Pong (# %i) (P%d)\n", pongCount, myRank);
}
MPI_Finalize();
return 0;
}

The second part of the exercise involved having each rank send randomly selected number of elements individually
to another rank. To do so, the number of elements to send must be communicated first. This exercise is fairly simple to
do, and can be achieved using the following code snippet:

if (myRank == 0){
printf("Sending %i elements (P%d)\n", numberOfElementsToSend, myRank);
MPI_Send(&numberOfElementsToSend, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
MPI_Recv(&numberOfElementsToSend, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &status);
numberOfElementsReceived = numberOfElementsToSend;
printf("Received %i elements (P%d)\n", numberOfElementsReceived, myRank);
}
else{ // myRank == 1
MPI_Recv(&numberOfElementsToSend, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
numberOfElementsReceived = numberOfElementsToSend;
printf("Received %i elements (P%d)\n", numberOfElementsReceived, myRank);
printf("Sending back %i elements (P%d)\n", numberOfElementsToSend, myRank);
MPI_Send(&numberOfElementsToSend, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
}

The main issue with the provided code in the first and second part of the exercise is that the number of processors
input for mpirun is not controlled. The exercise will only work correctly and finalize correctly if and only if there are

8
two processors used. For a single processor, the program will simply fail to run as there are an invalid number of ranks
used in sending information. For more than two processors, what will happen is that all processors except the one with
myRank=0 and myRank=1 will wait as they will wait for a message from myRank=0. However only the processor with
myRank=1 will receive a message, hence all other processors will wait indefinitely until the program is aborted. To fix
this, there needs to be an error message when initializing the program that checks if the input is greater than 1. The
following will work:

if (numProcs == 1){
printf("not enough processors used!\n");
MPI_Abort(MPI_COMM_WORLD, 1);
}

Then to fix the issue that comes from all other processors blocking, it is simple enough to switch the else statement
from the second exercise to an else if statement, i.e.:

else if (myRank == 1)

For the bonus part of the question, it is simple enough to use the following line of code before the receive statement
to find the number of elements being received from myRank=1 (to myRank=0):

MPI_Probe(1, 0, MPI_COMM_WORLD, &status);


MPI_Get_count(&status, MPI_INT, &numberOfElementsReceived);

as well as the ones being received from myRank=0 (to myRank=1):

MPI_Probe(0, 0, MPI_COMM_WORLD, &status);


MPI_Get_count(&status, MPI_INT, &numberOfElementsReceived);

For the last part of the exercise, to derive an approximation of the communication time being spent purely in
MPI_Send and MPI_Recv, a timer is used to time each of the communication statements while a loop multiplies the
array size of data sent by 2 up to 26 times, such that the maximum array size is 226 elements long. This exercise was
done on both my laptop, and on the HPC cluster. The following results of the individual runtimes are given in Table 18,
with the averages of the processor communication times in Table 19 (located in the Appendix due to their large size).
Note that on average, it takes about the same time to receive and send data between processors! The communication
time tables, as graphs, are shown below. Note that the array size is expressed in bytes, and that communication in this
exercise uses integers, which are by default 4 bytes long:

Fig. 1 Correlation between array size and average Fig. 2 Correlation between array size and average
send time for my computer receive time for my computer

9
Fig. 3 Correlation between array size and average Fig. 4 Correlation between array size and average
send time for the HPC cluster receive time for the HPC cluster
The approximated linear communication time functions (as a function of message size in bytes) are given below:
tsend,local (m) = 2.3267 × 10−10 m + 1.0757 × 10−4 (5)
trecv,local (m) = 2.3269 × 10 −10
m + 1.0350 × 10 −4
(6)
tsend,cluster (m) = 5.8234 × 10 −10
m − 6.0888 × 10 −5
(7)
trecv,cluster (m) = 5.8231 × 10 −10
m − 6.2781 × 10 −5
(8)
Quite clearly, the cluster constant coefficients are incorrect. The startup time for communication cannot be negative,
meaning that the functions cannot be linear by definition. It can be seen that cluster communication times are far slower
than the ones of my computer! Note that to measure the time spent sending, probing, and receiving, add the send and
receive time of the cluster / local machine to get a good approximation of the time spent communicating.

B. Exercise 2 – Calculating Pi using a Controller-Workers Scheme


In this exercise, π is calculated using an integration scheme. In order to do so, the integration code block is
parallelized using MPI. Below are the runtimes for the calculation of π using different number of processors on my
computer:

Processors Runtime [s]


1 ~
2 6.668
4 6.668
8 6.673
Table 11 Runtime of the original integration.c file with different numbers of processors

Quite clearly the run times are the same for any number of processors. No answer is given for 1 processor as there are
no slaves for the master to control, hence the program fails automatically. The reason for the same execution time for
different processor numbers is that the code essentially acts as if it was sequential. The problem lies in the controller
behaviour. For each step, the controller will send out work to a single processor, but waits for that same processor to
return the function call before moving onto the next step. In this next step, the controller will again send information to
a single (different) processor, and waits for it to return information, and so forth.
To correctly parallelize this set of code, the controller should send out an initial packet of information to all slaves.
Afterwards, it should wait for any information from a slave, save that rank, and send work back out to that slave, while
incrementing the number of steps taken. As long as the number of steps sent is less than the max number of steps, this
should keep on going, but while it has not received back all the steps, it should be waiting for a response from the slaves.
An acceptable set of code that allows the parallelization of the processors correctly is shown below:

10
double integrate(double (*f)(double x), double x_start, double x_end, int maxSteps){
int myRank;
double sum = 0;
double x[2], y[2];

MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
MPI_Status status;

if (myRank == 0){
double stepSize = (x_end - x_start)/(double)maxSteps;
int step = 0;
int items = 0;
int nextRank;
int numProcs;

MPI_Comm_size(MPI_COMM_WORLD, &numProcs);

// CONTROLLER  Send Initial Work


for (nextRank = 1; nextRank < numProcs; nextRank++){
x[0] = x_start + stepSize*step;
x[1] = x_start + stepSize*(step+1);
// Send the work
MPI_Send(x, 2, MPI_DOUBLE, nextRank, TAG_WORK, MPI_COMM_WORLD);
step++;
}
// SEND MORE WORK NOW
while (items++ < maxSteps){
// Receive the result
MPI_Recv(y, 2, MPI_DOUBLE, MPI_ANY_SOURCE, TAG_WORK, MPI_COMM_WORLD, &status);
sum += stepSize*0.5*(y[0]+y[1]);

nextRank = status.MPI_SOURCE;
if(step < maxSteps){
x[0] = x_start + stepSize*step;
x[1] = x_start + stepSize*(step+1);
// Send the work
MPI_Send(x, 2, MPI_DOUBLE, nextRank, TAG_WORK, MPI_COMM_WORLD);
step++;
}
}
// Signal workers to stop by sending empty messages with tag TAG_END
for (nextRank = 1; nextRank < numProcs; nextRank++)
MPI_Send(&nextRank, 0, MPI_INT, nextRank, TAG_END, MPI_COMM_WORLD);
}
else{
while (1){
// I am a worker, wait for work
// Receive the left and right points of the trapezoid and compute
// the corresponding function values. If the tag is TAG_END, don’t
// compute but exit.
MPI_Recv(x, 2, MPI_DOUBLE, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
if (status.MPI_TAG == TAG_END) break;
y[0] = f(x[0]);
y[1] = f(x[1]);

11
// Send back the computed result
MPI_Send(y, 2, MPI_DOUBLE, 0, TAG_WORK, MPI_COMM_WORLD);
}
}
return sum;
}

This set of code returns a proper approximate answer of π with the following run times:

Local Cluster
# Processors Runtime [s] Speedup Efficiency Runtime [s] Speedup Efficiency
1 ~ ~ ~ ~ ~ ~
2 6.6677 1.0000 0.1250 6.7044 1.0000 0.03125
4 2.2895 2.9123 0.3640 2.2978 2.9178 0.09118
8 1.0404 6.4087 0.8011 1.0432 6.4268 0.2008
16 0.6386 10.4417 1.3052 0.5420 12.3697 0.3866
32 0.4496 14.8310 1.8538 ~ ~ ~
Table 12 Runtimes, speedups, and efficiency for running the paralellized code shown above for both my local
machine and using the cluster.

Note that I could not get the HPC cluster to work with all 32 cores on this task, hence it is omitted from this table.
Interestingly enough, even though my computer does not have 16 or 32 physical cores, it is capable of increasing its
speedup (and thus its efficiency) the more processes I utilize. This might have to do with how there is a large waiting
time for both the send and receive times. By increasing the number of processes, I decrease the density of idle time,
hence there is less waiting time between each send and receive for each physical core (as more computations are being
done). However, this is not as good as having 16 physical cores actually, as the speedup is greater for the cluster than it
is for my local machine. I suspect that the run time would have been even lower if I could utilize the full 32 cores of the
HPC cluster.
For the next piece of the exercise, placing all the respective code into a controller scheme and worker scheme is not
too difficult, and makes the code far more readable. The following sets of code makes the code run in parallel, while
maintaining readability:

double integrate(double (*f)(double x), double x_start, double x_end, int maxSteps){
int myRank;
double sum;
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

if (myRank == 0){
sum = controller(x_start,x_end,maxSteps);
}
else{
worker(*f);
}
return sum;
}

double controller(double x_start, double x_end,int maxSteps){


double stepSize = (x_end - x_start)/(double)maxSteps;
double sum;
double x[2], y[2];
int step = 0;

12
int items = 0;
int nextRank;
int numProcs;

MPI_Status status;
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);

// CONTROLLER  Send Initial Work


for (nextRank = 1; nextRank < numProcs; nextRank++){
x[0] = x_start + stepSize*step;
x[1] = x_start + stepSize*(step+1);
// Send the work
MPI_Send(x, 2, MPI_DOUBLE, nextRank, TAG_WORK, MPI_COMM_WORLD);
step++;
}
// SEND MORE WORK NOW
while (items++ < maxSteps){
// Receive the result
MPI_Recv(y, 2, MPI_DOUBLE, MPI_ANY_SOURCE, TAG_WORK, MPI_COMM_WORLD, &status);
sum += stepSize*0.5*(y[0]+y[1]);

nextRank = status.MPI_SOURCE;
if(step < maxSteps){
x[0] = x_start + stepSize*step;
x[1] = x_start + stepSize*(step+1);
// Send the work
MPI_Send(x, 2, MPI_DOUBLE, nextRank, TAG_WORK, MPI_COMM_WORLD);
step++;
}
}
// Signal workers to stop by sending empty messages with tag TAG_END
for (nextRank = 1; nextRank < numProcs; nextRank++)
MPI_Send(&nextRank, 0, MPI_INT, nextRank, TAG_END, MPI_COMM_WORLD);

return sum;
}

void worker(double (*f)(double x)){


double x[2], y[2];
MPI_Status status;

while (1){
// I am a worker, wait for work
// Receive the left and right points of the trapezoid and compute
// the corresponding function values. If the tag is TAG_END, don’t
// compute but exit.
MPI_Recv(x, 2, MPI_DOUBLE, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
if (status.MPI_TAG == TAG_END) break;
y[0] = f(x[0]);
y[1] = f(x[1]);
// Send back the computed result
MPI_Send(y, 2, MPI_DOUBLE, 0, TAG_WORK, MPI_COMM_WORLD);
}
}

13
Note that in these sets of code that sending a single entry to a slave is quite inefficient considering that there is
overhead in the form of communication delays and communication start up. It is far better to remove the constant startup
time by introducing more elements for the slave to work upon. This can be easily achieved in both the controller and
worker functions by simply having the controller send more timesteps, while the worker sends back more calculation
results. In addition, the controller stops sending additional information to the slaves if it knows the next block it sends
will have too many steps in it, instead calculating the last irregular block size itself. The following code works for this
occasion (with the proper argument addition to the integrate function’s call to the controller function):

double controller(double x_start, double x_end, int maxSteps, double (*f)(double x)){
double stepSize = (x_end - x_start)/(double)maxSteps;
double sum;
double x[blocksize], y[blocksize];

int step = 0;
int items = 0;
int nextRank;
int numProcs;
int i;

MPI_Status status;
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);

// CONTROLLER  Send Initial Work


for (nextRank = 1; nextRank < numProcs; nextRank++){
for (i = 0; i < blocksize; i++)
x[i] = x_start + stepSize*(step+i);

// Send the work


MPI_Send(x, blocksize, MPI_DOUBLE, nextRank, TAG_WORK, MPI_COMM_WORLD);
step += blocksize - 1;
}
// SEND MORE WORK NOW
while (items <= maxSteps - blocksize){
// Receive the result
MPI_Recv(y, blocksize, MPI_DOUBLE, MPI_ANY_SOURCE, TAG_WORK, MPI_COMM_WORLD, &status);

for (i = 0; i < blocksize - 1; i++)


sum += stepSize*0.5*(y[i]+y[i+1]);
nextRank = status.MPI_SOURCE;

if (step <= maxSteps - blocksize){


for (i = 0; i < blocksize; i++)
x[i] = x_start + stepSize*(step+i);
// Send the work
MPI_Send(x, blocksize, MPI_DOUBLE, nextRank, TAG_WORK, MPI_COMM_WORLD);
step += blocksize - 1;
}
items += blocksize - 1;
}
// Signal workers to stop by sending empty messages with tag TAG_END
for (nextRank = 1; nextRank < numProcs; nextRank++)
MPI_Send(&nextRank, 0, MPI_INT, nextRank, TAG_END, MPI_COMM_WORLD);

// incase there are still less steps taken than maxSteps

14
int lastBlock = maxSteps - step;
if (lastBlock>1){
for (i=0;i<lastBlock;i++){
x[i] = x_start + stepSize*(step+i);
y[i] = f(x[i]);
}
for (i=0;i<lastBlock-1;i++)
sum += stepSize*0.5*(y[i]+y[i+1]);
}
return sum;
}

void worker(double (*f)(double x)){


int i;
double x[blocksize], y[blocksize];
MPI_Status status;

while (1){
// I am a worker, wait for work
// Receive the left and right points of the trapezoid and compute
// the corresponding function values. If the tag is TAG_END, don’t
// compute but exit.
MPI_Recv(x, blocksize, MPI_DOUBLE, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
if (status.MPI_TAG == TAG_END) break;
for (i = 0; i < blocksize; i++)
y[i] = f(x[i]);
// Send back the computed result
MPI_Send(y, blocksize, MPI_DOUBLE, 0, TAG_WORK, MPI_COMM_WORLD);
}
}

Using this set of code, the following run times were recorded for different blocksizes on my computer and the cluster.
The additional missing entries are due to the answers of the program not being in line with the approximation to π.

Blocksize
# Processors 2 3 4 5 6 7 8 9 10
1 ~ ~ ~ ~ ~ ~ ~ ~ ~
2 6.4698 4.9010 4.3135 4.0679 3.9014 3.8223 3.7811 3.6997 3.5959
4 2.2222 1.8598 1.5698 1.7653 1.7975 1.7467 1.6515 1.7709 1.5343
8 1.0137 0.9943 0.7928 1.1224 1.2133 1.1744 1.0750 1.2599 0.9939
16 0.6195 0.6854 0.5329 0.9265 1.0496 1.0023 ~ ~ 0.9176
32 0.4327 0.5835 0.4111 ~ ~ ~ ~ ~ 0.9988
Table 13 Runtimes, speedups, and efficiency for running the paralellized code shown above for my local
machine through communications of blocks rather than individual entries.

15
Blocksize
# Processors 2 3 4 5 6 7 8 9 10
1 ~ ~ ~ ~ ~ ~ ~ ~ ~
2 6.4703 4.9013 4.3137 4.0681 3.9016 3.8226 3.7813 3.6999 3.5957
4 2.2225 1.8601 1.5700 1.7653 1.7977 1.7468 1.6516 1.7711 1.5344
8 1.0100 0.9943 0.7929 1.1177 1.2134 1.1744 1.0751 1.2601 0.9941
16 0.5269 0.6537 0.4930 0.8802 1.0027 0.9870 ~ ~ 0.8940
32 ~ ~ ~ ~ ~ ~ ~ ~ ~
Table 14 Runtimes, speedups, and efficiency for running the paralellized code shown above for the HPC cluster
through communications of blocks rather than individual entries.

It can be seen that increasing the blocksize generally decreases the run time for small number of processors, whereas for
larger number of processors, this typically can have a varying effects in the runtime (and the solution answer!). Note
once again the decrease in runtime for the HPC cluster compared to my machine for work with 16 processors. It seems
that the optimal blocksize for this program is a size of 4, giving low run times, while giving the correct solution for large
numbers of processors. However, for computers with low number of physical cores, it is best to have a larger block
instead. Overall, the work balance between all processors generally stays the same, although they work less but for
longer between each communication. However there is an exception for processor 0 – the controller – which may have
to do some additional work should the blocksize used not be able to go up to the max step without leaving a remainder.
Note there is a balance in the blocksize used for the runtime – increasing the blocksize too much will cause a slow down
in the runtime as not all cores are utilized anymore.

C. Exercise 3 – Parallel implementation of Jacobi iteration with relaxation


In this exercise, the Jacobi iteration with relaxation is implemented and parallelized through the use of halo swaps.
There is stripe partitioning involved, therefore the halo swaps only occur in the row direction and not column direction.
A sequential code is provided, and the framework for the halo swaps, as well as iterative and absolute error needs to be
implemented. To compare the validity of the solution, the absolute error is compared. The following set of code will
perform the halo swaps.

void swap_halos(int maxXCount, int maxYCount, double *src, int prevRank, int nextRank,
int myRank, int numProcs){
#define ROW(YY) &src[(YY)*maxXCount]
if (myRank!=numProcs-1 & myRank!=0){
MPI_Sendrecv(ROW(maxYCount-2), maxXCount, MPI_DOUBLE, nextRank, 0, ROW(0), maxXCount,
MPI_DOUBLE, prevRank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Sendrecv(ROW(1), maxXCount, MPI_DOUBLE, prevRank, 0, ROW(maxYCount-1), maxXCount,
MPI_DOUBLE, nextRank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
if (myRank==0){
MPI_Sendrecv(ROW(maxYCount-2), maxXCount, MPI_DOUBLE, nextRank, 0, ROW(maxYCount-1),
maxXCount, MPI_DOUBLE, nextRank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
if (myRank==numProcs-1){
MPI_Sendrecv(ROW(1), maxXCount, MPI_DOUBLE, prevRank, 0, ROW(0), maxXCount,
MPI_DOUBLE, prevRank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
}

For the iteration error (and absolute error) estimation, it is assumed that the error is simply the square root of the sum of
the squares of the difference between the iteration value and exact value, divided by the number of grid points there are,

16
i.e. for a matrix A ∈ R n×m , then the iteration error is estimated as:

n×m
i=0 i2
iteration = (9)
n×m
In the program itself, each processor returns the sum of the squares of the differences for its partition, therefore all there
is left to do is to simply make a global reduction operation, take the sum of all processor’s sum, square root it, and
divide it by n × m. To this extent, the following set of code works:

MPI_Reduce(&localError, &error, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);


if (myRank == 0)
error = sqrt(error)/(n*m);

Similarly for the absolute error at the end of the iterations:

MPI_Reduce(&localError, &absoluteError, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);


if (myRank == 0)
absoluteError = sqrt(absoluteError)/(n*m);

The runtime and error of the sequential code is given below for both input and input.big:

Runtime [s] Absolute Error


input 0.4501 0.0005327
input.big 7.1430 0.0001333
Table 15 The runtime and absolute error of the sequential Jacobi code

The runtime and absolute error for the parallelized code, both on my computer and on the HPC cluster is given below.

input input.big
Processors Runtime [s] Absolute Error Runtime [s] Absolute Error
1 0.5061 0.0005327 7.6118 0.0001333
2 0.2809 0.0005327 4.1331 0.0001333
4 0.1674 0.0005327 2.2924 0.0001333
8 0.1057 0.0005327 1.5805 0.0001333
16 ~ ~ 1.9241 0.0001333
32 ~ ~ 1.9326 0.0001333
Table 16 The runtime and absolute error of the parallelized Jacobi code on my computer. The missing entries
is due to the program not being able to partition the input mesh size with the number of processors given.

input input.big
Processors Runtime [s] Absolute Error Runtime [s] Absolute Error
1 0.5928 0.0005327 9.7309 0.0001333
2 0.4035 0.0005327 6.6845 0.0001333
4 0.2530 0.0005327 3.9491 0.0001333
8 0.1560 0.0005327 2.2790 0.0001333
16 ~ ~ 1.3554 0.0001333
32 ~ ~ 0.9530 0.0001333
Table 17 The runtime and absolute error of the parallelized Jacobi code on the HPC cluster. I was lucky
enough to obtain a chance to utilize all 32 cores.

17
It is easy to note that using multiple cores provides a significantly smaller amount of time in comparison to using
the sequential code, with the same absolute error present. Note that by using one core only, there is still plenty of
overhead as the program parses through the statements that dictate the instructions, therefore the runtime is lower than
the sequential code in that case. On my computer, as there are not 16 or 32 cores available, the computer must map
these extra processors to cores that are already in use, therefore there is additional overhead. This is clearly not the case
for the cluster, in which the calculations still pick up on speedup with 16 or 32 cores. Note that overall, the cluster is
slower than my computer until more cores are used (16 and 32). Lastly note the sublinear speedup present – the speedup
difference becomes smaller as more processors are utilized. This could be due to the communication overhead when
undergoing halo swaps.

18
III. Appendix

Local Cluster
Array P0 P1 P0 P1
Size Send [s] Recv [s] Send [s] Recv [s] Send [s] Recv [s] Send [s] Recv [s]
2 0 0.0000363 0.0000016 0.0000022 0.0000043 0.0000227 0.0000093 0.0000089 0.0000140
21 0.0000006 0.0000004 0.0000004 0.0000004 0.0000006 0.0000004 0.0000005 0.0000006
2 2 0.0000003 0.0000003 0.0000003 0.0000002 0.0000003 0.0000003 0.0000003 0.0000003
23 0.0000004 0.0000003 0.0000003 0.0000003 0.0000003 0.0000002 0.0000003 0.0000003
2 4 0.0000003 0.0000003 0.0000004 0.0000002 0.0000003 0.0000002 0.0000003 0.0000003
2 5 0.0000083 0.0000003 0.0000105 0.0000004 0.0000003 0.0000004 0.0000003 0.0000005
26 0.0000002 0.0000002 0.0000003 0.0000004 0.0000006 0.0000004 0.0000005 0.0000004
2 7 0.0000016 0.0000003 0.0000010 0.0000002 0.0000081 0.0000004 0.0000058 0.0000004
28 0.0000005 0.0000002 0.0000007 0.0000004 0.0000046 0.0000004 0.0000018 0.0000004
2 9 0.0000004 0.0000004 0.0000007 0.0000004 0.0000013 0.0000005 0.0000013 0.0000004
210 0.0000164 0.0000081 0.0000098 0.0000141 0.0000374 0.0000245 0.0000288 0.0000294
211 0.0000049 0.0000043 0.0000055 0.0000040 0.0000088 0.0000065 0.0000075 0.0000076
212 0.0000065 0.0000056 0.0000068 0.0000054 0.0000111 0.0000099 0.0000109 0.0000101
213 0.0000097 0.0000108 0.0000116 0.0000084 0.0000175 0.0000251 0.0000261 0.0000165
214 0.0000185 0.0000155 0.0000164 0.0000177 0.0000516 0.0000362 0.0000376 0.0000506
215 0.0000307 0.0000282 0.0000288 0.0000295 0.0000636 0.0000663 0.0000676 0.0000623
216 0.0000587 0.0000527 0.0000538 0.0000579 0.0001419 0.0001488 0.0001502 0.0001406
217 0.0001179 0.0001039 0.0001050 0.0001167 0.0002442 0.0002793 0.0002807 0.0002427
218 0.0002252 0.0001990 0.0002009 0.0002240 0.0004828 0.0005491 0.0005507 0.0004812
219 0.0003766 0.0004178 0.0004209 0.0003760 0.0009780 0.0010904 0.0010922 0.0009764
220 0.0009885 0.0012251 0.0012316 0.0009878 0.0021992 0.0021863 0.0021882 0.0021975
221 0.0023471 0.0026114 0.0026284 0.0023466 0.0039863 0.0052347 0.0052370 0.0039848
222 0.0045879 0.0049289 0.0049542 0.0045832 0.0082766 0.0093836 0.0093872 0.0082736
223 0.0091726 0.0088604 0.0088912 0.0091700 0.0168217 0.0218687 0.0218748 0.0168173
224 0.0146646 0.0168732 0.0168992 0.0146598 0.0359788 0.0429652 0.0429725 0.0359725
225 0.0281039 0.0347565 0.0347829 0.0281009 0.0710643 0.0852700 0.0852776 0.0710563
226 0.0558845 0.0558826 0.0687115 0.0687359 0.1422934 0.1700836 0.1700917 0.1422867
Table 18 Raw runtime data of communication time for both local computer and the HPC cluster

19
Array Local Cluster
Size Avg. Send [s] Avg. Recv [s] Avg. Send [s] Avg. Recv [s]
20 0.0000192 0.0000029 0.0000158 0.0000116
21 0.0000005 0.0000004 0.0000005 0.0000005
22 0.0000003 0.0000003 0.0000003 0.0000003
23 0.0000003 0.0000003 0.0000003 0.0000003
24 0.0000004 0.0000002 0.0000003 0.0000003
25 0.0000094 0.0000003 0.0000003 0.0000005
26 0.0000003 0.0000003 0.0000005 0.0000004
27 0.0000013 0.0000002 0.0000070 0.0000004
28 0.0000006 0.0000003 0.0000032 0.0000004
29 0.0000005 0.0000004 0.0000013 0.0000005
210 0.0000131 0.0000111 0.0000331 0.0000270
2 11 0.0000052 0.0000041 0.0000081 0.0000071
212 0.0000066 0.0000055 0.0000110 0.0000100
2 13 0.0000106 0.0000096 0.0000218 0.0000208
214 0.0000175 0.0000166 0.0000446 0.0000434
2 15 0.0000297 0.0000289 0.0000656 0.0000643
2 16 0.0000563 0.0000553 0.0001460 0.0001447
217 0.0001114 0.0001103 0.0002625 0.0002610
2 18 0.0002130 0.0002115 0.0005168 0.0005152
219 0.0003987 0.0003969 0.0010351 0.0010334
2 20 0.0011101 0.0011064 0.0021937 0.0021919
2 21 0.0024878 0.0024790 0.0046117 0.0046098
222 0.0047710 0.0047561 0.0088319 0.0088286
2 23 0.0090319 0.0090152 0.0193483 0.0193430
224 0.0157819 0.0157665 0.0394756 0.0394689
2 25 0.0314434 0.0314287 0.0781709 0.0781631
2 26 0.0622978 0.0623093 0.1561925 0.1561851
Table 19 Averaged runtime data of communication time for both local computer and the HPC cluster

20

S-ar putea să vă placă și