Open MP

Course Title: Introduction to OpenMP
1 Introduction
1.1 Introduction
In high performance computing, there are tools that assist programmers with multi-threaded parallel processing on
distributed-memory and shared-memory multiproccessor platforms.
On distributed-memory multiprocessor platforms, each processor has its own memory whose content is not readily
available to other processors. Sharing of information among processors is customarily facilitated by message passing
using routines from standard message passing libraries such as MPI.
On shared-memory multiprocessors, memory among processors can be shared. Message passing libraries such as MPI
can be, and are, used for the parallel processing tasks. However, a directive-based OpenMP Application Program
Interface (API) has been developed specifically for shared-memory parallel processing.
OpenMP has broad support from many major computer hardware and software manufacturers. Similar to MPI's
achievement as the standard for distributed-memory parallel processing, OpenMP has emerged as the standard for
shared-memory parallel computing. Both of these standards can be used in conjunction with Fortran 77, Fortran 90, C or
C++ for parallel computing applications. It is worth noting that for a cluster of single-processor and shared-memory
multiprocessor computers, it is possible to use both paradigms in the same application program to effect an increase in the
(agregate) processing power. MPI is used to connect all machines within the cluster to form one virtual machine, while
OpenMP is used to exploit the shared-memory parallelism on individual shared-memory machines within the cluster. This
approach is commonly referred to as Multi-Level Parallel Programming (MLP).

In this course, we will focus on the fundamentals of OpenMP. The topics of MPI and MLP are covered in two separate
tutorials on the CI-Tutor site - "Introduction to MPI" and "Multi-Level Parallel Programming."
1.2 What is OpenMP?

OpenMP is comprised of three complementary components:
1. a set of directives used by the programmer to communicate with the compiler on parallelism.
2. a runtime library which enables the setting and querying of parallel parameters such as number of participating threads
and the thread number.
3. a limited number of environment variables that can be used to define runtime system parallel parameters such as the
number of threads.
Figure 1.1. The three components of the OpenMP API.

A Compiler Directive Example
Code segments that consume substantial CPU cycles frequently involve do loops (or FOR loops in C). For loops that are
parallelizable, OpenMP provides a rather simple directive to instruct the compiler to parallelize the loop immediately
following the directive. Lets take a look at a Fortran example:
call omp_set_num_threads(nthread)
!$OMP PARALLEL DO
DO i=1,N
DO j=1,M
.
.
.
END DO
END DO
!$OMP END PARALLEL DO
!requests "nthread" threads
In this Fortran code fragment, the OpenMP library function OMP_SET_NUM_THREADS is called to set the number of
threads to "nthread". Next, the !$OMP PARALLEL DO directive notifies the compiler to parallelize the (outer) do loop that
follows. The current, or master, thread is responsible for spawning "nthread-1" child threads. The matching !$OMP END
PARALLEL DO directive makes it clear to the compiler (and the programmer) the extent of the PARALLEL DO directive. It
also serves to provide a barrier to ensure that all threads complete their tasks and that all child threads are subsequently
released.
In the figure below, the execution stream starts in a serial region, followed by a parallel region using four threads. Upon
completion, the child threads are released and serial execution continues until the next parallel region with 2 threads is in
effect.
Figure 1.2. OpenMP Programming Model.

For C programs, a similar set of rules applies:
omp_set_num_threads(nthread);
#pragma omp parallel for
{
for (i=0; i&#060n; i++) {
for (j=0; j&#060m; j++) {
.
.
.
}
}
}
/* requests nthread threads */
Note in the above that the scope of the parallel for directive is delimited by the pair of curly braces ({}).
1.3 Why OpenMP?

For computers with shared-memory architecture, the ability to use directives that assist the compiler in the parallelization
of application codes has been around for many years. Almost all major manufacturers of high performance shared-memory
multiprocessor computers have their own sets of directives. Unfortunately, the functionalities and syntaxes of these
directive sets vary among vendors and because of that variance code portability (from the viewpoint of directives) is
practically impossible.
Primarily driven by the need of both the high performance computing user community and industry to have a standard to
ensure code portability across shared-memory platforms, an independent organization, openmp.org, was established in
1996. This organization's charter is to formulate and oversee the establishment and maintenance of the OpenMP standard.
As a result, the OpenMP API came into being in 1997. The primary benefit of using OpenMP is the relative ease of code
parallelization made possible by the shared-memory architecture.
1.4 Pros and Cons of OpenMP

Pros
Due to its shared-memory attribute, the programmer need not deal with message passing which is relatively difficult and
potentially harder to debug.
Generally the bulk of data decomposition is handled automatically by directives. Hence, data layout effort is minimal.
Unlike message passing MPI or PVM, OpenMP directives or library calls may be incorporated incrementally. Since codes
that need parallelism are often large, incremental implementation allows for gradual realization of performance gains
instead of having to convert the whole code at once - as is often the case in an MPI code.
Since directives, as well as OpenMP function calls, are treated as comments in the event that OpenMP invocation is not
preferred or available during compilation, the code is in effect a serial code. This affords a unified code for both serial
and parallel applications which can ease code maintenance.
Original (serial) code statements need not, in general, be modified when parallelized with OpenMP. This reduces the
chance of inadvertently introducing bugs.
Code size increase due to OpenMP is generally smaller than that which uses MPI (or other message passing methods).
OpenMP-enabled codes tend to be more readable than an equivalent MPI-enabled version. This could, at least indirectly,
help in code maintainability.
Cons
Codes parallelized with OpenMP can only be run in multiprocessor mode on shared-memory environments; this restricts
the portability (in the multiprocessing sense) of the programs on distributed-memory environments.
Requires a compiler that supports OpenMP.
Because OpenMP codes tend to rely more on parallelizable loops, this could leave a relatively high percentage of a code
in serial processing mode. This results in lower parallel efficiency; if 10% of a code remains in serial operations, it can
only attend a theorectical best of halving the wall clock time (per Amdahl's Law) even if a large number of processors is
used.
Codes implemented to run on shared-memory parallel systems, for example those parallelized via OpenMP, are limited by
the number of processors available on the respective systems. Conceptually, parallel paradigms such as MPI do not have
such a hardware limitation.
The above summary of advantages and disadvantages should be taken in a broad sense, as exceptions do exist. Some
classes of embarrassingly-parallel codes can be very trivial to parallelize with MPI, such as some Monte Carlo codes. On
the other hand, while OpenMP's primary model is to go after do loops, it can also be done in a data decomposition model -akin to the MPI approach. With coarser grain parallelism, a higher percentage of a code could then be parallelized to
achieve better parallel efficiency.
2 Basics
2.1 Basics
The primary goal of code parallelization is to distribute the code's operations among a number of processors so they can be
performed simultaneously. This goal can be elusive since there are usually operations that must be performed in a certain
sequence and therefore can't be performed in parallel. Codes where a vast majority of the operations are completely
independent are relatively easy to parallelize. Some examples of such codes are ones that perform Monte-Carlo simulations
and optimization problems. However, most codes contain a rather intricate combination of independent (parallelizable) and
dependent (serial) operations. One common example is the solution of hyperbolic partial differential equations such as the
simulation of time-varying phenomena. Since the state of a system at a later time is a function of the state at the previous
time, states at various times must be deduced in a sequential fashion. In these cases one must look for other aspects of the
algorithm that can be parallelized.
In the following sections we will discuss a few basic concepts of code parallelization and introduce our first simple, but
powerful, OpenMP directive. Although the amount of material presented in this chapter is limited, it can go quite far in
parallelizing many real-world applications in an efficient manner.
2.2 Basics - Approaches to Parallelism

In OpenMP, there are two main approaches for assigning work to threads. They are
1. loop-level
2. parallel regions
In the first approach, loop-level, individual loops are parallelized with each thread being assigned a unique range of the
loop index. This is sometimes called fine-grained parallelism and is the approach taken by many automatic parallelizers.
Code that is not within parallel loops is executed serially (on a single thread).
In the second approach, parallel regions, any sections of the code can be parallelized, not just loops. This is sometimes
called coarse-grained parallelism. The work within the parallel regions is explicitly distributed among the threads using the
unique identifier assigned to each thread. This is frequently done by using if statements, e.g., if(myid == 0) ... ,
where myid is the thread identifier. At the limit, the entire code can be executed on each thread, as is usually done with
message-passing approaches such as MPI.
The two approaches are illustrated in the following diagram in which vertical lines represent individual threads. In the
loop-level approach, execution starts on a single thread. Then, when a parallel loop is encountered, multiple threads are
spawned. When the parallel loop is finished, the extra threads are discarded and the execution is once again serial until the
next parallel loop (or the end of the code) is reached. In the parallel-regions approach, multiple threads are maintained,
irrespective of whether or not loops are encountered.
The main advantage of loop-level parallelism is that it's relatively easy to implement. It is most effective when a small
number of loops perform a large amount of work in the code. The parallel regions method requires a bit more work to
implement, but is more flexible. Any parallelization task perfomed using the loop-level method can also be implemented
using parallel regions by adding logic to the code, but not vice-versa. The disadvantage of the loop-level approach is the
overhead incurred in creating new threads at the beginning of a parallel loop and in destroying and resynchronizing
threads and data with the master thread at the end of the parallel loop. The amount of overhead incurred depends on the
details of how the slave threads are implemented, e.g. as separate os-level processes or as so-called "light-weight"
processes such as pthreads. Parallel regions allow the programmer to exploit data parallelism on a scale larger than the "
do loop" by avoiding the need to resynchronize after every loop.
2.3 Data Dependencies

Unfortunately, all of the operations in a code cannot be performed simultaneously (except in rare instances). There are
some operations which must wait for the completion of other operations before they are performed. When an operation
depends upon completion of another operation, it is called a data dependency (or sometimes data dependence). A simple
data dependency is shown in the following code fragment:
Fortran
do i = 2, 5
a(i) = a(i) + a(i-1)
enddo
C/C++:
for(i=2; i<=5; i++) a[i] = a[i] + a[i-1];
In this example, each element i of the 1 dimensional array a is replaced by the sum of the original elements of the array up
to index i. Assume that the array a has been initialized with integers from 1-5.
The final values of the a array obtained by serial execution of the loop above are shown in the following table:
i
a(i)
10
15
Consider what might happen if we executed the loop in parallel using 2 threads. Assume the first thread is assigned loop
indices 2 and 3, and the second thread is assigned 4 and 5. One possible order of execution is that thread 1 performs the
computation on i=4, reading the value of a(3), before thread 0 has completed the computations for i=3, which update a(3).
In this case, the results of each thread are:
Thread 0
Thread 1
a(2) = a(2) + a(1) = 2 + 1 = 3
a(4) = a(4) + a(3) = 4 + 3 = 7
a(3) = a(3) + a(2) = 3 + 3 = 6
a(5) = a(5) + a(4) = 5 + 4 = 9
Comparing the above tables, it's clear that a(4) should be equal to 10, not 7, and a(5) should be equal to 15, not 9. The
problem is that the values of a(3) and a(4) were used by thread 1 before the new values were calculated by thread 0. This
is the simplest example of a "race condition", in which the result of the operation depends upon the order in which the data
is accessed by the threads.
There are 3 simple criteria, that if satisfied, guarantee that there is no data dependency in a loop:
All assigments are performed on arrays.

Each element of an array is assigned to by at most one iteration.
No loop iteration reads array elements modified by any other iteration.
If these criteria are not met, you should carefully examine your loop for data dependency.
OpenMP will do exactly what it is instructed to do, and if a loop with a data dependency is parallelized naively, it will give
the wrong result. The programmer is responsible for the correctness of the code!
2.4 PARALLEL DO / PARALLEL FOR

In OpenMP, the primary means of parallelization is through the use of directives inserted in the source code. One of the
most fundamental and most powerful of these directives is PARALLEL DO (Fortran) or PARALLEL FOR (C). Here are
examples of these directives:
Fortran
!$omp parallel do
do i = 1, n
a(i) = b(i) + c(i)
enddo
C/C++
for(i=1; i
2.5 Clauses
There are many situations in which we would like to modify the behavior of directives in some way in order to solve a
specific problem or to make the directive more convenient to use. Some directives can be modified using clauses.
Shared vs. Private Variables
Since OpenMP is used on shared-memory systems, all variables in a given loop share the same address space. This means
that all threads can modify and access all variables (except the loop index), and sometimes this results in undesirable
behavior. Consider the following example:
Fortran
!$omp parallel do
do i = 1, n
temp = 2.0*a(i)
a(i) = temp
b(i) = c(i)/temp
enddo
C/C++
{
for(i=1; i
2.6 Self Test

@import url(base.css); @import url(content.css);
Introduction to OpenMP - Basics
Question 1
Question 1
Which of the following is NOT true: The Parallel Regions approach
is used to parallelize individual loops only.
is also sometimes called Coarse-Grained parallelism.
can emulate message-passing type parallelism, such as that which is used with MPI.
can be used to parallelize regions of code with loops within them.
Question 2
Question 2
The following code fragment contains a data dependency:
do i = 1, 10
x(i) = x(i) + y(i)
enddo
True
False
Question 3
Question 3
The following code fragment contains a data dependency:
for(i=2; i<=10; i++)
x[i]=x[i] + y[i-1];
True
False
Question 4
Question 4
The default shared/private clause is
shared
private
Question 5
Question 5
Firstprivate
must be used to qualify the first private variable to be accessed in a loop.
makes the first access of the specified variable private, with subsequent accesses being shared.
copies the value of the specified variable(s) from the master thread to all threads.
causes the default to be private rather than shared.
Question 6
Question 6
Lastprivate retains the value of the specified variable(s) after the parallel region has finished. The value of the variable is
the final value on the master thread.
the value from on last thread to finish the parallel region.
indeterminate.
the value that would have been obtained from serial execution.
Question 7
Question 7
In Fortran, .and., and in C, &&, are allowable reduction operations.
True
False
Question 8
Multi-choice
The 'ordered' directive causes the affected part of the code to run serially.
True
False
3 Compile and Run

Having been introduced to the workhorse PARALLEL DO directive and several associated clauses, you are now armed with
a surprisingly powerful set of tools for parallelizing many codes. This being the case, we will take a slight detour here in
order to briefly discuss how to compile and run OpenMP codes so you can try out some of these concepts.
Compilation is, of course, platform dependent. Here are the compiler flags for several popular platform:
Platform
Compiler Flag
SGI IRIX
-mp
IBM AIX
-qsmp=omp
Portland Group linux
-mp
Intel linux
-openmp
Adding the appropriate flag causes the compiler to interpret OpenMP directives, functions, etc.
For C/C++ codes, the header file omp.h should be included when using OpenMP functions.
So far, we have one way to specify the number of threads, the OMP_NUM_THREADS environment variable. (Later a
function will be introduced to do the same thing.) The behavior of the code if you fail to specify the number of threads is
platform dependent, and it is good practice to always specify the number of threads. It may be convenient to create a
simple run script such as those shown in the following examples.
Example 1
#!/bin/tcsh
setenv OMP_NUM_THREADS 4
mycode my.out
exit
Example 2
#!/bin/tcsh
setenv OMP_NUM_THREADS
mycode my.out
exit
In the first case the number of threads is hard-wired, and in the second it's a substitutable argument.
4 Conditional Compilation
Portability among different platforms is often a concern for writers of large-scale scientific codes. Lines starting with
OpenMP sentinels (the first part of all directives, e.g., !$OMP, C$OMP, *$OMP, #PRAGMA OMP) are ignored if the code is
compiled on a system that doesn't support OpenMP. This allows the same source code to be used to create serial and
parallel executables.
Conditional compilation is also possible with constructs other than directives. It is handled differently in Fortran and
C/C++, so they will be discussed individually.
Fortran
When a Fortran/OpenMP compiler encounters a !$, c$, or *$ sentinel, the two characters are replaced with spaces. When
the first character of the sentinel is a comment character, a compiler without OpenMP will simply interpret the line as a
comment. This behavior can be used for conditional compilation. Suppose there are lines of code that are to be executed
only in parallel versions of the code. If a !$ prefix is added to each line of code, it will result in conditional compilation:
Fortran
!$ call parallel_stuff(num_threads)
If compiled without OpenMP, this line will simply be interpreted as a comment. With OpenMP, the compiler will replace !$
with spaces, and the line will be compiled as an executable statement.
C/C++
In C and C++, there is a macro name, _OPENMP, that is automatically defined by OpenMP. This can be used for
conditional compilation as follows:
C/C++
#ifdef _OPENMP
parallel_stuff(num_threads);
#endif
5 PARALLEL Directive
Up to this point, we have been examining ways to parallelize loops with the PARALLEL DO (PARALLEL FOR) directive. It is
possible to break this directive into separate PARALLEL and DO (FOR) directives. For example, the following parallel loop:
Fortran
!$OMP PARALLEL DO
do i = 1, maxi
a(i) = b(i)
enddo
C/C++
for(i=1; i
6 Basic Functions
6.1 Basic Functions
In this chapter, we will cover three very basic OpenMP library functions: OMP_SET_NUM_THREADS,
OMP_GET_NUM_THREADS and OMP_GET_THREAD_NUM. These functions enable us to set the thread count, find out how
many threads are in use as well as determine the rank of individual threads. These basic functionalities, together with
basic directives such as PARALLEL DO or PARALLEL FOR, are sufficient for many applications.
However, some applications may require more specialized functionalities. For these occasions, the OpenMP library
provides additional functions to deal with them. These functions will be introduced in Chapter 10.
6.2 OMP_GET_THREAD_NUM
Returns the thread rank in a parallel region.
Note that:
The rank of threads ranges from 0 to OMP_GET_NUM_THREADS() - 1.

When invoked in a serial region, this function returns the value of 0, which is the rank of the master thread.
C/C++
#include &#060omp.h&#062
int omp_get_thread_num()
Fortran
INTEGER FUNCTION OMP_GET_THREAD_NUM()
Example
Print thread numbers in a parallel region. If four processors are used, the output may look like these:
Thread
Thread
Thread
Thread
rank:
rank:
rank:
rank:
2
0
3
1
Note that in general the rank output are not in order.

C/C++
#pragma omp parallel
{
printf("Thread rank: %d\n", omp_get_thread_num());
}
Fortran (replace "C" in column 1 with "!" for F90)

C$OMP PARALLEL
write(*,*)'Thread rank: ', OMP_GET_THREAD_NUM()
C$OMP END PARALLEL
6.3 OMP_SET_NUM_THREADS
Sets the number of threads for use in subsequent parallel region(s).
Note that:
The number of threads deployed in a run is determined by the user. There are two ways with which the user dictates that:
1. call OMP_SET_NUM_THREADS prior to the beginning of a parallel region for it to take effect; it can be called as
often as needed to dynamically control the thread counts in different parallel regions.
2. alternatively, the number of threads can be set through the environment variable OMP_NUM_THREADS before a run
as follows:
setenv OMP_NUM_THREADS threads (c shell)
OMP_NUM_THREADS = threads (korn shell)
export OMP_NUM_THREADS (korn shell)
This method can be employed if the thread count need not change in the entire code.
The result is undefined if this subroutine is called within a parallel region.
The thread count remains fixed until the next call to this subprogram.
Use of this subprogram to set number of threads has precedence over the environment variable OMP_NUM_THREADS.
The number of threads used in a parallel region is guaranteed to be what is set via call to OMP_SET_NUM_THREADS
provided that thread dynamic status is FALSE. Otherwise, the actual number of threads used at runtime is subject to what
is available at the time the parallel region is executed. Furthermore, the threads used cannot exceed what is set by
OMP_SET_NUM_THREADS.
C/C++
#include <omp.h>
(void) omp_set_num_threads(int num_threads)

num_threads -- Number of threads (input)
Fortran
SUBROUTINE OMP_SET_NUM_THREADS(num_threads)
num_threads -- Number of threads (input)
The example below demonstrates:
Request threads by OMP_SET_NUM_THREADS.

Use OMP_GET_NUM_THREADS in a parallel region to see how many threads are active.
C/C++
/* set thread size before entering parallel region */
num_threads = 4;
omp_set_num_threads(num_threads);
{
printf("Threads allocated : %d\n",
omp_get_num_threads());
}
C set thread size before entering parallel region

num_threads = 4
call OMP_SET_NUM_THREADS(num_threads)
C$OMP PARALLEL
write(*,*)'Threads allocated : ', OMP_GET_NUM_THREADS()
C$OMP END PARALLEL
6.4 OMP_GET_NUM_THREADS
Returns the number of threads used in a parallel region.
Note that:
When invoked in a parallel region, this function reports the number of participating threads.
It returns a value of unity (1) when invoked in:
a. a serial region.
b. a nested parallel region that has been serialized; e.g., if the nested parallelism is turned off or not implemented by
the vendor. See OMP_SET_NESTED in Section 11.1 for details.
If the thread dynamic status is disabled, the thread counts returned by this function is determined by the user's call to
the subprogram OMP_SET_NUM_THREADS or by the environment variable OMP_NUM_THREADS.
If the thread dynamic status is enabled, the thread count returned by this subprogram cannot be larger than what is
returned by OMP_GET_MAX_THREADS().
C/C++
int omp_get_num_threads( )
Fortran
integer function omp_get_num_threads( )
Example
C/C++
{
printf("Threads allocated : %d\n", omp_get_num_threads());
}
C$OMP PARALLEL
write(*,*)'Threads allocated : ', OMP_GET_NUM_THREADS()
C$OMP END PARALLEL
6.5 Self Test

Introduction to OpenMP - Basic Functions
Question 1
Question 1
Assuming that multiple threads are active, which one of the following is true?
omp_get_num_threads() returns the number of active threads.
omp_get_num_threads() returns the number of threads requested via omp_set_num_threads.
omp_get_num_threads() returns the number of active threads in a parallel region and 1 otherwise.
Question 2
Question 2
OMP_GET_THREAD_NUM() returns current thread's rank number which
ranges from 1 to omp_get_num_threads().
ranges from 0 to omp_get_num_threads()-1.
is the id of the physical processor.
Question 3
Question 3
omp_set_num_threads
has precedence over the environment variable OMP_NUM_THREADS.
is overriden by OMP_NUM_THREADS.
like environment variable OMP_NUM_THREADS, can only be called once in the program to set the number of threads.
Question 4
Question 4
omp_set_num_threads
can be called from anywhere in the application program.
can only be called in a parallel region.

can only be called in a serial region.
7 Parallel Regions
Discussion to this point has been concerned with methods for parallelizing individual loops. This is sometimes called
fine-grained or loop-level parallelism. The term "fine-grained" can be a misnomer, since loops can be large, sometimes
encompassing a majority of the work performed in the code, so we will use "loop-level" here.
We saw in Chapter 5. PARALLEL Directive that the entire region of code between a PARALLEL directive and an END
PARALLEL directive in Fortran, or within braces enclosing a parallel region in C, will be duplicated on all threads. This
allows more flexibility than restricting parallel regions of code to loops, and one can parallelize code in a manner much like
that used with MPI or other message-passing libraries. This approach is called coarse-grained parallelism or parallel
regions. We will use the latter term for the same reason discussed above.
In the loop-level approach, domain decomposition is performed automatically by distributing loop indices among the
threads. In the parallel regions approach, domain decomposition is performed manually. Starting and ending loop indices
are computed for each thread based on the number of threads available and the index of the current thread. The following
code fragment shows a simple example of how this works.
Fortran
!$OMP PARALLEL &
!$OMP PRIVATE(myid,istart,iend,nthreads,nper)
nthreads = OMP_GET_NUM_THREADS()
nper = imax/nthreads
myid = OMP_GET_THREAD_NUM()
istart = myid*nper + 1
iend = istart + nper - 1
call do_work(istart,iend)
do i = istart, iend
a(i) = b(i)*c(i)
enddo
!$OMP END PARALLEL
C/C++
#pragma omp parallel \
private(myid,istart,iend,nthreads,nper)
{
nthreads = omp_get_num_threads();
nper = imax/nthreads;
myid = omp_get_thread_num();
istart = myid*nper + 1;
iend = istart + nper - 1;
do_work(istart,iend);
for(i=istart; i
8 Thread Control
8.1 Thread Control
There are some instances in which additional control over the operation of the threads is required. Sometimes threads
must be synchronized, such as at the beginning of serial parts of the code. There are cases in which a task needs to be
performed only on one thread, and there are cases in which all threads must perform a task and do it one at a time. Also,
you may want to assign specific tasks to specific threads. All of these functions are available in OpenMP using the
directives discussed in this section.
8.2 BARRIER
There are instances in which threads must be synchronized This can be effected through the use of the BARRIER directive.
Each thread waits at the BARRIER directive until all threads have reached this point in the source code, and they then
resume parallel execution. In the following example, an array a is filled, and then operations are performed on a in the
subprogram DOWORK.
Fortran
do i = 1, n
a(i) = a(i) - b(i)
enddo call dowork(a)
C/C++
for(i=1; i
8.3 MASTER
In a parallel region, one might want to perform a certain task on the master thread only. Rather than ending the parallel
region, performing the task, and re-starting the parallel region, the MASTER directive can be used to restrict a region of
source code to the master thread. In Fortran, all operations between the MASTER and END MASTER directives are
performed on the master thread only. In C, the operations in the structured block (between curly braces) following the
MASTER directive are performed on the master thread only. In the following example, an array is computed and written to
a file.
Fortran
do i = 1, n
a(i) = b(i)
enddo
write(21) a
call do_work(1, n)
C/C++
for(i=1; i
8.4 SINGLE
Threads do not execute lines of source code in lockstep. There may be explicit logic in the code assigning different tasks to
different threads, and different threads may execute specific lines of source code at different speeds, due to differing cache
access patterns for example. The SINGLE directive is similar to the MASTER directive except that the specified region of
code will be performed on the thread which is the first to reach the directive, not necessarily the master thread. Also,
unlike the MASTER directive, there is an implied barrier at the end of the SINGLE region.
Below is a serial example. Note that the routine DO_SOME_WORK has a(1) as its argument, while DO_MORE_WORK
performs work on the whole array.
Fortran
do i = 1, n
a(i) = b(i)
enddo
call do_some_work(a(1))
call do_more_work(a, 1, n)
C/C++
for(i=1; i
8.5 CRITICAL
The CRITICAL directive is similar to the ORDERED directive in that only one thread executes the specified section of
source code at a time. With the CIRTICAL directive, however, the threads can perform the task in any order.
Suppose we want to determine the maximum value in the 1 dimensional array a. Let the values of the array be computed in
the function COMPUTE_A. In Fortran, the intrinsic function MAXVAL, which returns the largest value in the argument, will
be used and it will be assumed that a similar function has been provided in C. The serial code will simply look like
Fortran
call compute_a(a)
the_max = maxval(a)
C/C++
compute_a(a);
the_max = maxval(a);
In parallel, a different section of a will be computed on each thread. The maximum value will then be found for each
section, and a global maximum will be computed using the Fortran MAX function, which returns the maximum of a series
of scalar arguments. As before, it will be assumed that an analogous function has been written in C.
Fortran
the_max = 0.0
!$omp parallel private(myid, istart, iend)
call myrange(myid, nthreads, global_start, global_end, istart, iend)
call compute_a(a(istart:iend))
!$omp critical
the_max = max( maxval(a(istart:iend), the_max )
!$omp end critical
call more_work_on_a(a)
!$omp end parallel
C/C++
the_max = 0.0;
#pragma omp parallel private(myid, istart, iend)
{
myrange(myid, nthreads, global_start, global_end, &istart, &iend);
nvals = iend-istart+1;
compute_a(a[istart],nvals);
#pragma omp critical
the_max = max( maxval(a[istart],nvals), the_max );
#pragma omp end critical
call more_work_on_a(a)
}
8.6 SECTIONS
There are some tasks which must be performed serially due to data dependencies, calls to serial libraries, input/output
issues, etc. If there is more than one such task and they are independent, they can be performed by individual threads at
the same time. Note that each task is still only performed by a single thread, i.e., the individual tasks are not parallelized.
This configuration can be effected through the use of the SECTION and SECTIONS directives.
Suppose we have a code which solves for a field on a computational grid. Two of the first steps are to initialize the field and
to check the grid quality. These tasks can be performed through function or subroutine calls:
Fortran
call init_field(field)
call check_grid(grid)
C/C++
init_field(field);
check_grid(grid);
Since these are independent tasks, we would like to perform them in parallel. The SECTIONS directive is used within a
parallel region to indicate that the designated block of code (a structured block in C; the region of code between the
SECTIONS directive and an END SECTIONS directive in Fortran) will contain a number of individual sections, each of
which is to be executed on its own thread. Within the designated region, SECTIONS directives are used to delimit
individual sections. There is an implied barrier at the end of the SECTIONS region. Be careful to note that the overall
region of code which includes sections is designated with the SECTIONS directive (plural), and each individual section is
designated with the SECTION directive (singular). Here's the same code fragment using sections:
Fortran
!$omp parallel
!$omp sections
!$omp section
!$omp section
!$omp end sections
!$omp end parallel
C/C++
{
#pragma omp sections
{
#pragma omp section
init_field(field);
#pragma omp section
check_grid(grid);
}
}
Each of the two tasks will now be performed in parallel on individual threads. In this example, exactly two threads are used
irrespective of the number of threads available in the current parallel region.
There is also a PARALLEL SECTIONS directive, analogous to PARALLEL DO. The PARALLEL SECTIONS directive spawns
multiple threads and enables the use of SECTIONS directives at the same time.
Fortran
!$omp parallel sections
!$omp section
!$omp section
!$omp end parallel sections
C/C++
#pragma omp parallel sections
{
#pragma omp section
init_field(field);
#pragma omp section
check_grid(grid);
}
8.7 Self Test

Introduction to OpenMP - Thread Control
Question 1
Question 1
The barrier directive
causes a thread to wait until a specified event has occurred before continuing.
causes threads to execute one at a time (serially).
causes threads to wait at the barrier directive until all threads have reached that point in the code.
causes a thread to wait for another specified thread to reach the barrier directive before continuing.
Question 2
Question 2
The code enclosed by master/end master directives will only execute on the master thread, and will be skipped over by the
other threads.
True
False
Question 3
Question 3
The single directive causes the associated code to be executed on one thread at a time.
True
False
Question 4
Question 4
The only difference between the master and single directives is that the master directive specifies that only the master
thread will execute the specified region of code, while the single directive allows it to execute on any available thread.
True
False
Question 5
Question 5
The critical directive
causes the affected region of code to execute in the same way as it would execute on a single thread.
causes the affected region of code to execute one thread at a time.
causes the affected region of code to execute on the master thread only.
causes the affected region of code to execute on an arbitrary thread.
Question 6
Question 6
In using the sections directive, each section is assigned to a single thread.
True
False
9 More Directives
9.1 More Directives
The previous section focused on exploiting loop-level parallelism using OpenMP. This form of parallelism is relatively easy
to exploit and provides an incremental approach towards parallelizing an application, one loop at a time. However, since
loop-level parallelism is based on local analysis of individual loops, it is limited in the forms of parallelism that it can exploit.
A global analysis of the algorithm, potentially including multiple loops as well as other non-iterative constructs, can often
be used to parallelize larger portions of an application such as an entire phase of an algorithm. Parallelizing larger and
larger portions of an application in turn yields improved speedups and scalable performance.
In the previous sections of this tutorial, we mentioned the support provided in OpenMP for moving beyond loop-level
parallelism. For example, we discussed the generalized parallel region construct to express parallel execution. Rather than
being restricted to a loop as with the PARALLEL DO construct discussed previously, this construct is attached to an
arbitrary body of code that is executed concurrently by multiple threads. This form of replicated execution, with the body
of code executing in a replicated fashion across multiple threads, is commonly referred to as "SPMD" style parallelism, for
"single-program multiple-data."
Some clauses which modify the PARALLEL DO directive were introduced, such as PRIVATE, SHARED, DEFAULT, and
REDUCTION. They will continue to provide exactly the same behavior for the PARALLEL construct as they did for the
PARALLEL DO construct. In the following sections we will discuss a few more directives: THREADPRIVATE, COPYIN,
ATOMIC, and FLUSH.
9.2 THREADPRIVATE
9.2.1 THREADPRIVATE
A parallel region may include calls to other subprograms such as subroutines or functions. The lexical or static extent of a
parallel region is defined as the code that is lexically within the PARALLEL/END PARALLEL directive. The dynamic extent
of a parallel region includes not only the code that is directly within the PARALLEL/END PARALLEL directive (the static
extent), but also includes all the code in subprograms that are invoked either directly or indirectly from within the parallel
region. This distinction is illustrated in the figure below.
Figure 1: Code illustrating difference between the static(lexical) and dynamic extents. The static extent includes the
region highlighted in yellow. The dynamic extent includes the both of the highlighted regions.
The importance of this distinction is that the data scoping clauses apply only to the lexical scope of a parallel region and
not to the entire dynamic extent of the region. For variables which are global in scope (e.g, common block variables in
Fortran and global variables in C/C++), references from within the lexical extent of a parallel region are affected by the
data scoping clause (such as PRIVATE) on the parallel directive. However, references to such global variables from the
dynamic extent which are outside of the lexical extent are not affected by any of the data scoping clauses and always refer
to the global shared instance of the variable. Hence, if a global variable is declared private, then references to it from the
static extent of a parallel region and that portion of the dynamic extent outside the static extent may not refer to the same
memory location. This choice was made to simplify the implementation of the data scoping clauses.
A simple way to control the scope of such variables is to pass them as arguments of the subroutine or function being
referenced. By passing them as arguments, all references to the variables now refer to the private copy of the variables
within the parallel region.
While the problem can be solved this way, it is often cumbersome when the common blocks appear in several subprograms
or when the list of variables in the common blocks is lengthy. The OpenMP THREADPRIVATE directive provides an easier
method that does not require modification of argument lists. The syntax and specification for this directive are discussed in
the following two sub-sections.
9.2.2 Specification
The THREADPRIVATE directive tells the compiler that a common block (or global variables in C/C++) is private to each
thread. A private copy of each common block marked as threadprivate is created for each thread and within each thread all
references to variables within that common block anywhere in the entire program refer to the variable instance within the
private copy. Threads cannot refer to the private instance of the common block belonging to another thread. The important
distinction between the PRIVATE and THREADPRIVATE directives is that the threadprivate directive affects the scope of
the variable within the entire program, not just within the lexical scope of a parallel region.
The THREADPRIVATE directive is provided after the declaration of the common block (global variable in C/C++) within a
subprogram unit, not in the declaration of a parallel region. Moreover, the THREADPRIVATE directive must be supplied
after the declaration of the common block in every subprogram unit that references the common block. Variables from
threadprivate common blocks cannot not appear in any other data scope clauses, nor are they affected by the DEFAULT
(SHARED) clause. Thus, it is safe to use the default (shared) clause even when threadprivate common block variables are
being referenced in the parallel region.
How are threadprivate variables initialized? And what happens to them when the program moves from a parallel region to
serial region and back? When the program begins execution the only executing thread is the master thread which has its
own private copies of the threadprivate common blocks. As in the serial case, these blocks can be initialized by block data
statements (Fortran) or by providing initial values with the definition of the variables (C/C++). When the first parallel
region is entered, the slave threads get their own copies of the common blocks and the slave copies are initialized via the
block data or initial value mechanisms. Any changes to common block variables made by executable statements within the
master thread are lost. When the first parallel region exits, the slave threads stop executing, but they do not go away.
Rather, they persist, retaining the states of their private copies of the common blocks for when the next parallel region is
entered. There is one exception to this. If the user modifies the number of threads through an OpenMP runtime library call,
then the common blocks are reinitialized.
9.2.3 Syntax and Sample

The syntax of the threadprivate directive is
Fortran
!$omp threadprivate(/blk1/[, /blk2/]...)
C/C++
#pragma omp threadprivate(varlist)
where blk1, blk2, etc. are the names of common blocks to be made threadprivate. Note, threadprivate common blocks must
be named common blocks. Unnamed or "blank" common blocks cannot be threadprivate. varlist is a list of named file
scope or namespace scope variables.
Sample Code in Fortran:
program thrd_pvt_example
integer ibegin, iend, iblock
integer iarray(5000), jarray(5000), karray(5000)
integer N
integer nthreads, ithread
integer omp_get_num_threads, omp_get_thread_num
common /DO_LOOP_BOUNDS/ ibegin, iend
!$omp threadprivate(/DO_LOOP_BOUNDS/)
N = 5000
!$omp parallel private(nthreads,ithread, iblock)
nthreads = omp_get_num_threads()
ithread = omp_get_thread_num()
iblock = (N+nthreads-1)/nthreads
ibegin = ithread*iblock + 1
iend = min((ithread+1)*iblock,N)
call add_array(iarray,jarray,karray)
!$omp end parallel
end
subroutine add_array(iarray,jarray,karray)
common /DO_LOOP_BOUNDS/ ibegin, iend
!$omp threadprivate(/DO_LOOP_BOUNDS/)
integer iarray(5000),jarray(5000),karray(5000)
do i = ibegin, iend
iarray(i) = jarray(i)+karray(i)
enddo
return
end
9.3 COPYIN
Specification
As mentioned previously, initialization of threadprivate data occurs when the slave threads are started for the first time
(i.e., in the first parallel region) and then only by means of block data statements (Fortran) or initialization in the
declaration statement (C/C++). OpenMP also provides limited support for another kind of initialization at the beginning of
a parallel region via the COPYIN clause. The COPYIN clause allows a slave thread to read in a copy of the master thread's
threadprivate variables.
The COPYIN clause is supplied along with a parallel directive to initialize a threadprivate variable or set of threadprivate
variables within a slave thread to the values of the threadprivate variables in the master thread's copy at the time that the
parallel region starts. It takes as arguments either a list of variables from a THREADPRIVATE common block or names of
entire THREADPRIVATE common blocks.
The COPYIN clause is useful when the threadprivate variables are used for temporary storage within each thread but still
need initial values that are either computed or read from an input file by the master thread.
Syntax
The syntax of the COPYIN clause is
copyin (list)
where list is a comma-separated list of names of either threadprivate common blocks or individual threadprivate common
block variables (Fortran), or a file scope or global threadprivate variables (C/C++). In Fortran, the names of threadprivate
common blocks appear between slashes.
Fortran Example:
program copyinexample
integer N
common /blk/ N
!$omp threadprivate(/blk/)
N = 5000
!$omp parallel copyin(N)
! Slave's copy of N initialized to 5000.
! Use N or modify N.
N=N+1
print *, "slave thread:", N
!$omp end parallel
!$omp parallel
! Initial value of the slave's copy of N is whatever
! it was at the end of the previous parallel region.
N=N+1
!$omp end parallel
N = 10000
print *, "master thread:", N
!$omp parallel copyin(N)
! Slave's copy of N initialized to 10000.
print *, "slave thread:", N
!$omp end parallel
end
9.4 ATOMIC
Specification
The ATOMIC clause may be regarded as a special case of a CRITICAL directive. It provides exclusive access by a single
thread to a shared variable. Unlike the CRITICAL directive which can enclose an arbitrary block of code, the ATOMIC
directive can only enclose a critical section that consists of a single assignment statement that updates a scalar variable.
This clause is provided to take advantage of the hardware support provided by most modern multiprocessors for atomically
updating a single location in memory. This hardware support typically consists of special machine instructions used for
performing common operations such as incrementing loop variables. These hardware instructions have the property that
they maintain exclusive access to this single memory location for the duration of the update. Because the full overhead of
the locking mechanism provided by a critical section is not needed, these primitives can greatly improve performance.
One restriction on the use of the ATOMIC directive is that within sections of the program that will be running concurrently,
one cannot use a mixture of the ATOMIC and CRITICAL directives to provide mutually exclusive access to a shared
variable. Instead, you must consistently use one of the two directives. In regions of the program that will not run
concurrently, you are free to use different mechanisms.
Syntax
The syntax for the atomic clause is:
Fortran
!$omp atomic
x = x operator expr
......
!$omp atomic
x = intrinsic (x, expr)
C/C++
#pragma omp atomic
x <binop> = expr
......
#pragma omp atomic
/* one of */
x++, ++x, x--, or --x
where x is a scalar variable of an intrinsic type, operator is one of a set of pre-defined operators (including most arithmetic
and logical operators), intrinsic is one of a set of predefined intrinsic functions (including min, max, and logical intrinsics),
and expr is a scalar expression that does not reference x. The following table provides a complete list of operators and
intrinsics in the Fortran and C/C++ languages.
Language
Operators and Intrinsics
Fortran
+, *, -, /, .AND., .OR., .EQV., .NEQV., MAX, MIN, IAND, IOR, IEOR
C/C++
+, *, -, /, &, ^, |, <<, >&gt
9.5 FLUSH
Specification
The FLUSH directive in OpenMP is used to identify a sequence point in the execution of the program at which the
executing threads need to have a consistent view of memory. All memory accesses (reads and writes) that occur before the
FLUSH must be completed before the sequence point and all memory accesses that occur after the FLUSH must occur
after the sequence point. For example, threads must ensure that all registers are written to memory and that all write
buffers are flushed, ensuring that any shared variables are made visible to other threads. After the FLUSH, a thread must
assume that shared variables may have been modified by other threads and read back all data from memory prior to using
it. The FLUSH directive is useful for the writing of client-server applications.
The FLUSH directive is implied by many directives except when the NOWAIT clause is used. For example, some of the
directives for which a FLUSH directive is implied include:
BARRIER
CRITICAL and END CRITICAL
ORDERED and END ORDERED
PARALLEL and END PARALLEL
PARALLEL DO and END PARALLEL DO
By default, a FLUSH directive applies to all variables that could potentially be accessed by another thread. However,
rather than applying to all shared variables, the user can also choose to provide an optional list of variables with the
FLUSH directive. In this case, only the named variables are required to be synchronized. This allows the compiler to
optimize performance by rearranging operations involving the variables that are not named.
In summary, the FLUSH directive does not, by itself, perform any synchronization. It only provides memory consistency
between the executing thread and global memory and must be used in combination with other read/write operations to
implement synchronization between threads.
Syntax
The syntax for the flush directive is as follows:
Fortran
!$omp flush [(list)]
C/C++
#pragma omp flush [(list)]
where list is an optional list of variable names.
9.6 Self Test

Introduction to OpenMP - More Directives
Question 1
Question 1
The threadprivate directive is used to identify a common block as:
being private to each thread.
being private to the master thread.
being private to the slave thread.
being private to the specifically declared thread.
Question 2
Question 2
Through the copyin clause, which thread can have the access to what other thread's copy of threadprivate data:
master thread to slave thread.
slave thread to master thread.
each thread to all of other threads.
The above all are not right.
Question 3
Question 3
In C/C++, which of the following operator can not be used for the atomic clause?
++
-%
|
Question 4
Question 4
Does flush clause perform any synchronization by itself?
Yes
No
10 More Functions
10.1 More Functions
In Lesson 6. Basic Functions, we introduced three basic functions: OMP_SET_NUM_THREADS, OMP_GET_NUM_threads,
and OMP_GET_THREAD_NUM. In this chapter, we will cover the following OpenMP library functions:
OMP_SET_DYNAMIC provides the programmer the option to allow the runtime system to adjust the threads dynamically
based on availability.
OMP_GET_DYNAMIC reports the status of dynamic adjustment.
OMP_GET_MAX_THREADS returns the maximum number of threads that may be available for use in a parallel region.
OMP_GET_NUM_PROCS returns the number of processors available on the system.
OMP_IN_PARALLEL reports if the code region is parallel or serial.
10.2 OMP_SET_DYNAMIC
Sets the thread dynamic status.
On a multi-user, time-sharing computer, when a user requests more threads than there are processors, an individual
processor may have to handle the work load of multiple threads. Consequently, while the job still gets done, the parallel
performance suffers. To solve this problem, the OpenMP omp_set_dynamic function provides the capability to permit
automatic adjustment of threads to match available processors. For application programs that depend on a fixed number of
threads, the dynamic threading feature can be turned off to ensure that the prefixed thread count be used.
Note that:
For Fortran, this function admits a logical value (true or false). For C, it takes nonzero or 0. Upon setting the status to
".true. " (or "nonzero" for C), the runtime system is free to adjust the number of threads used in a parallel region conditioned upon thread availability. The number of threads used, however, can be no larger than
OMP_GET_MAX_THREADS(). On the other hand, if the dynamic status is set to ".false.", then the parallel region uses the
requested threads that are determined by a call to OMP_SET_NUM_THREADS.
Alternatively, the dynamic adjustment can also be controlled by the environment variable OMP_DYNAMIC.
If a specific number of threads must be used in a parallel region, disabling the dynamic control ensures that the
requested number of threads will be used.
C/C++
int omp_set_dynamic(int dynamic_threads)
Fortran
subroutine omp_set_dynamic(dynamic_threads)
logical dynamic_threads
The example below demonstrates these functions
Set thread dynamic status to ".true." ("nonzero" for C).

Query the thread dynamic status.
C/C++
omp_set_dynamic(1);
printf(STDOUT, " The status of thread dynamic is %d\n", omp_get_dynamic());
omp_set_dynamic(.true.)
write(*,*)'The status of thread dynamic is : ', omp_get_dynamic()
10.3 OMP_GET_DYNAMIC
Returns dynamic thread status.
Note that:
In Fortran, this function returns a logical value (true or false). In C, it returns 1 or 0. If the status is ".true. " (or "1" for
C), the runtime system is free to adjust the number of threads used in a parallel region - conditioned upon thread
availability. However, the number of threads used can be no larger than OMP_GET_MAX_THREADS(). On the other hand,
if the dynamic status is ".false.", then the parallel region uses the requested threads that are determined by a call to
OMP_SET_NUM_THREADS or the OMP_NUM_THREADS environment variable.
The dynamic status can either be set by calling OMP_SET_DYNAMIC or by setting the environment variable
OMP_DYNAMIC.
C/C++
int omp_get_dynamic()
Fortran
LOGICAL FUNCTION omp_get_dynamic()
The example below demonstrates these functions:
Set thread dynamic status to ".true." ("1" for C).

Query the thread dynamic status.
C/C++
omp_set_dynamic(1);
printf(STDOUT, " The status of thread dynamic is %d\n", omp_get_dynamic());

omp_set_dynamic(.true.)
write(*,*)'The status of thread dynamic is : ', OMP_get_dynamic()
10.4 OMP_GET_MAX_THREADS
Returns the maximum number of threads that may be available for use in a parallel region.
Note that:
This function can be used in serial and parallel regions.

OMP_GET_NUM_THREADS() <= OMP_GET_MAX_THREADS() if OMP_SET_DYNAMIC is ".true. " (or 1 for C).
OMP_GET_NUM_THREADS() = OMP_GET_MAX_THREADS() if OMP_SET_DYNAMIC is ".false." (or 0 for C).
C/C++
int omp_get_max_threads()
Fortran:
INTEGER FUNCTION omp_get_max_threads()
Query for thread dynamic status.

Query for maximum thread count in a serial region.
Query for maximum thread count in a parallel region.
C/C++
printf(STDOUT, " The thread dynamic status is : %d\n",
omp_get_dynamic());
printf(STDOUT, " In a serial region; max threads are : %d\n",
omp_get_max_threads());
{
printf(STDOUT, " In a parallel region; max threads are : %d\n",
omp_get_max_threads());
}
write(*,*)'The thread dynamic status is : ',

omp_get_dynamic()
write(*,*)'In a serial region, max threads are : ',
&
omp_get_max_threads()
C$OMP PARALLEL
write(*,*)'In a parallel region, max threads are : ',
&
omp_get_max_threads()
C$OMP END PARALLEL
10.5 OMP_GET_NUM_PROCS
Queries the number of processors available on the system.
Note that:
This function can be used in serial and parallel regions.

You may or may not be able to request as many threads as reported by OMP_GET_NUM_PROCS() as further restrictions,
such as batch queue process limits, may apply.
C:
int omp_get_num_procs()
Fortran:
INTEGER FUNCTION omp_get_num_procs()
In a serial region, use OMP_IN_PARALLEL to confirm that it is so.

Use OMP_GET_NUM_PROCS in a serial region to see how many processors are available in system.
In a parallel region, use OMP_IN_PARALLEL to confirm that it is indeed so.
Use OMP_GET_NUM_PROCS in a parallel region to see how many processors are available in system.
C/C++
printf(STDOUT, " In parallel region (0)? %d\n", omp_in_parallel());

printf(STDOUT, "Threads available in system : %d\n", omp_get_num_procs());
{
printf(STDOUT, "Threads available in system : %d\n", omp_get_num_procs());
}
write(*,*)'In parallel region (F)? ', omp_in_parallel()

write(*,*)'Threads available in system : ', omp_get_num_procs()
C$OMP PARALLEL
write(*,*)'In parallel region (T)? ', omp_in_parallel()
write(*,*)'Threads available in system : ', omp_get_num_procs()
C$OMP END PARALLEL
10.6 OMP_IN_PARALLEL
Returns a value indicating whether the region (code segment) in question is parallel or serial. A region is parallel when
enclosed by the parallel directive. A region is serial otherwise.
Note that:
For Fortran, this function returns a logical value (true or false). For C, it returns int of "1" or "0".
As expected, in a parallel region for which the number of threads is 1, OMP_IN_PARALLEL returns "false" (for fortran) or
"0" (for C).
C/C++
int omp_in_parallel()
Fortran:
LOGICAL FUNCTION OMP_in_parallel()
Request threads by OMP_SET_NUM_THREADS.

In a serial region, use OMP_IN_PARALLEL to confirm that it is indeed so.
Use OMP_IN_PARALLEL in a parallel region to verify that it reports "true" for fortran and "1" for C.
C/C++
{
}
write(*,*)'In parallel region (F)? ', OMP_IN_PARALLEL()
C$OMP PARALLEL
write(*,*)'In parallel region (T)? ', OMP_IN_PARALLEL()
C$OMP END PARALLEL
10.7 Self Test

Introduction to OpenMP - More Functions
Question 1
Question 1
omp_in_parallel
returns an integer for fortran and an int for C.
returns a logical for fortran and an int for C.
must be called in a parallel region.
Question 2
Question 2
omp_get_num_procs()
must be called from a parallel region.
can be called from parallel regions or serial regions.
must only be called from a serial region.
Question 3
Question 3
omp_get_max_threads
returns a value set by call to omp_get_num_threads.
is the same as omp_get_num_procs.
is the same as omp_get_num_threads.
Question 4
Question 4
omp_get_dynamic
returns the status of dynamic memory allocation.
returns an integer for both fortran and C.
returns the dynamic status of threads.
11 Nested Parallelism
11.1 Nested Parallelism
Often, application codes have nested do/for loops. At times, individual loops may have small loop counts which would
render them inefficient to be processed in parallel. However, grouped together these loops may present potential efficiency
gains for parallelism (if they are parallelizable, of course). Nested parallelism, as the phrase implies, is a feature in
OpenMP that deals with multiple levels of parallelism. If nested parallelism is implemented in the OpenMP API and is
enabled by the user, multiple levels of nested loops or parallel regions are executed in parallel.
At present, nested parallelism has not been implemented by any vendor. This topic is only covered for completeness.
11.2 OMP_SET_NESTED
This subprogram enables or disables nested parallelism by setting its only argument to .true. or .false. for fortran and 1 or
0 for C.
Note that:
This feature has not been implemented by any vendor. Only single level of for loop (for C) or do loop (for fortran) is
permitted. On these machines, user's action of enabling nested parallelism may either be ignored (e.g., the IBM language
compiler) or a warning message that this feature is not implemented (e.g., the SGI compiler) will result.
Alternatively, nested parallelism can be enabled through the environment variable OMP_NESTED before a run as follows:
setenv OMP_NESTED TRUE (c shell)

OMP_NESTED = TRUE (korn shell)
export OMP_NESTED (korn shell)
C/C++
void omp_set_nested(int nested)
Fortran
subroutine omp_set_nested(nested)
logical nested
NESTED (logical, input) -- .true. to enable nested parallelism; .false. either wise
Example
C/C++
omp_set_nested(1);
/* enables nested parallelism */
printf("Status of nested parallelism is %d\n", omp_get_nested());
Fortran
call omp_set_nested(.true.)
! enables nested parallelism
write(*,*)'Status of nested parallelism is : ', OMP_GET_NESTED()
11.3 OMP_GET_NESTED
This function returns the status of nested parallelism setting.
Note that:
As of September, 2001, no vendor has implemented the nested parallelism feature. The query function always returns
false (or 0 for C), even if the user enables it explicitly.
C/C++
int omp_get_nested()
Fortran
logical function omp_get_nested()
Example
C/C++
omp_set_nested(1);
/* enables nested parallelism */
printf("Status of nested parallelism is %d\n", omp_get_nested());
Fortran
call omp_set_nested(.TRUE.)
! enables nested parallelism
write(*,*)'Status of nested parallelism is : ', omp_get_nested()
12 LOCKS
We have seen how the CRITICAL, MASTER, SINGLE, and ORDERED directives can be used to control the execution of a
single block of code. The SECTION directive can be used to control the parallel execution of different blocks of code, but
the number of threads is restricted to the number of sections. If additional control over the parallel execution of different
blocks of code is required, OpenMP offers a set of LOCK routines.
LOCK routines operate very much as the name implies: A given thread takes "ownership" of a lock, and no other thread can
execute a specified block of code until the lock is relinquished, i.e., the other threads are "locked out" until the lock is
"opened." One useful application of locks is when a code performs a time-consuming serial task. Using locks, other useful
work can be done by the other processors while the serial task is processing.
A name must be declared for each lock. In Fortran, the name must be an integer which is large enough to hold an address.
For 64-bit addresses, this can be declared as INTEGER*8 or INTEGER(SELECTED_REAL_KIND(18)). In C or C++, the
lock name must be declared to be type OMP_LOCK_T (which is defined in the omp.h header file). Every lock routine has a
single argument, which is the lock name in Fortran, or a pointer to the lock name in C or C++.
Before using a lock, the lock name must be initialized through the OMP_INIT_LOCKsubroutine (Fortran) or function
(C/C++):
C/C++
omp_init_lock(&mylock);
Fortran
call omp_init_lock(mylock)
where mylock is the lock name.

Similarly, when the lock is no longer needed it should be destroyed using OMP_DESTROY_LOCK:
C/C++
omp_destroy_lock(&mylock);
Fortran
call omp_destroy_lock(mylock)
A thread gains ownership of a lock by calling

C/C++
omp_set_lock(&mylock);
Fortran
call omp_set_lock(mylock)
If OMP_SET_LOCK is called by a thread and a different thread already has ownership of the specified lock, the calling
thread will remain blocked at the call until the lock becomes available. The companion function to OMP_SET_LOCK is
OMP_UNSET_LOCK, which releases ownership of the specified lock:
C/C++
omp_unset_lock(&mylock);
Fortran
call omp_unset_lock(mylock)
There is one additional lock routine, OMP_TEST_LOCK, which is related to OMP_SET_LOCK:
C/C++
did_it_set = omp_test_lock(&mylock);
Fortran
did_it_set = omp_test_lock(mylock)
In Fortran this is a logical function, not a subroutine like the other lock routines. In C/C++ it is an integer function rather
than a void function like the others. The OMP_TEST_LOCK routine is like the OMP_SET_LOCK routine in that the calling
thread takes ownership of the specified lock if it is available. However, if the lock is currently owned by another thread, the
code continues to the next line rather than blocking to wait for the thread. The function's return value indicates whether or
not the lock was available. In Fortran, a logical variable "true" is returned indicating that the calling thread successfully
took ownership of the lock, and "false" is returned indicating that the lock was owned by a different thread. In C/C++, the
function returns a non-zero integer if the lock was successfully set, and it returns zero if the thread was owned by a
different thread.
Example
Below is an example of the use of lock routines. One thread performs a long serial task. While it is doing so, the other
threads perform a parallel task. The routine which performs the parallel task has an index as its argument so that each
time it is called it can restart from wherever it left off in the previous call. An example of such a task could be searching a
database.
C/C++
omp_init_lock(&mylock);
#pragma omp parallel private(index){
if(omp_test_lock(&mylock)){
long_serial_task();
}else{
while(! omp_test_lock(&mylock))
short_parallel_task(index);
}
}
omp_destroy_lock(&mylock);
Fortran
call OMP_INIT_LOCK(mylock)
!$OMP PARALLEL PRIVATE(index)
if(OMP_TEST_LOCK(mylock)) then
call long_serial_task
call OMP_UNSET_LOCK(mylock)
else
dowhile(.not. OMP_TEST_LOCK(mylock))
call short_parallel_task(index)
enddo
call OMP_UNSET_LOCK(mylock)
endif
!$OMP END PARALLEL
call OMP_DESTROY_LOCK(mylock)
A lock called "mylock" is first initialized, and then a PARALLEL directive spawns multiple threads. This is followed by a call
to OMP_TEST_LOCK. Whichever thread reaches this line first will find the lock to be free, and will take ownership of it.
This thread then goes on to perform a long serial task. The remaining threads will repeatedly check the lock and perform
the short parallel task as long as the lock is still owned by another thread. As soon as the long serial task has been
completed the thread performing that task releases the lock and goes to the end of the parallel block, where there is an
implied barrier. Each of the other threads will take ownership of the lock, perform its final short parallel task, unset the
lock, and go to the end of the parallel block. (The "final short parallel task" is performed in order to keep this example as
simple as possible. In practice, if this task is not required, it could be bypassed with additional logic.) Finally, once all
threads have been synchronized at the implied barrier, the lock is released.
13 SCHEDULE
13.1 SCHEDULE
The way in which iterations of a parallel loop are distributed among the executing threads is called the loop's SCHEDULE.
In OpenMP's default scheduling scheme, the executing threads are assigned nearly equal numbers of iterations. If each
iteration contains approximately the same amount of work, the threads will finish the loop at about the same time. This
situation is called load-balanced and yields optimal performance.
In some cases, different iterations of a loop may perform different amounts of work. When threads are assigned differing
amounts of work, the load is said to be unbalanced. In the example below, each iteration of the loop calls one of the
subroutines FAST or SLOW depending on the value of y . If the iterations assigned to each thread have very different
proportions of FAST and SLOW calls, then the speed with which the threads complete their work will vary considerably.
The threads that complete earlier do no useful work while they wait for the slower threads to catch up, thus the
performance is not optimal.
Fortran
!$omp parallel do private(y)

do i = 1, n
y = f(i)
if (y .lt. 0.5e0) then
call fast(x(i))
else
call slow(x(i))
endif
enddo
If the work done by each iteration of the loop varies in some systematic fashion, then it may be possible to speed up the
execution of the loop by changing its schedule. In OpenMP, iterations are assigned to threads in contiguous ranges called
chunks. By controlling how these chunks are assigned to threads, either dynamically or in some static fashion, and the
number of iterations per chunk, the so-called chunk size, a scheduling scheme attempts to balance the work across threads.
A schedule is specified by a SCHEDULE clause on the PARALLEL DO or DO directive, or it may be optionally specified by
an environment variable. In the next section, we describe each of the options OpenMP provides for scheduling. We focus
on how each schedule assigns iterations to threads and on the overhead each schedule imposes. We provide guidelines for
choosing an appropriate schedule.
Syntax
The syntax of a schedule clause is
schedule(type[, chunk_size])
Type is one of static, dynamic, guided or runtime. If it is present, chunk_size must be a scalar integer value. The kind of
schedule specified by the schedule clause depends on the combination of the type and optional chunk_size parameter. If no
schedule clause is specified, the choice of schedule is implementation dependent. The various types are discussed in the
following sections and then summarized in a table.
13.2 Static
In a static schedule, each thread is assigned a fixed number of chunks to work on.
If the type is static and the chunk_size parameter is not present, then each thread is given a single chunk of iterations to
perform. The runtime system attempts to make the chunks as equal in size as possible, but the precise assignment of
iterations to threads is implementation dependent. For example, if the number of iterations is not evenly divisible by the
number of threads, the remaining iterations may be distributed among the threads in any suitable fashion. This kind of
schedule is called "simple static."
If the type is static and the chunk_size parameter is present, iterations are divided into chunks of size chunk_size until
fewer than chunk_size iterations are left. The remaining iterations are divided into chunks in an implementation dependent
fashion. Threads are then assigned chunks in a round-robin fashion: the first thread gets the first chunk, the second thread
gets the second chunk, and so on, until no more chunks remain. This kind of schedule is called "interleaved."
The simple static scheme is appropriate if the work per iteration is nearly equal. The interleaved scheme may be useful if
the work per iteration varies systematically. For example, if the work per iteration increases monotonically, then an
interleaved scheme will more evenly distribute work among the threads, but at a cost of a small amount of additional
overhead. Simple static scheduling is usually the default.
13.3 Dynamic
In a dynamic schedule, the assignment of iterations to threads is determined at runtime. As in the static case, the iterations
are broken up into a number of chunks which are then farmed out to the threads on at a time. As threads complete work on
a chunk, they request another chunk until the supply of chunks is exhausted.
If the scheduling type is dynamic, iterations are divided into chunks of size chunk_size, similar to an interleaved schedule.
If chunk is not present, the size of all chunks is 1. This kind of schedule is called "simple dynamic."
A simple dynamic schedule is more flexible than an interleaved schedule because faster threads are assigned more
iterations, but it has greater overhead, in the form of synchronization costs, because the OpenMP runtime system must
coordinate the assignment of iterations to threads.
13.4 Guided
The guided type is a variant of dynamic scheduling. In this type, the first chunk of iterations is of some
implementation-dependent size, and the size of each successive chunk is a fixed fraction of the preceding chunk until a
minimum size of chunk_size is reached. Hence, size(chunkn)=min(chunk_size,rn*size(chunk0)). The value of r (where r &lt 1
),is also implementation dependent. Frequently, chunk0 is chosen at about N/P where N is the number of iterations and P is
the number of threads and r is chosen as (1-1/P). If fewer than chunk_size iterations are left, how the remaining iterations
are divided into chunks also depends on the implementation. If chunk_size is not specified, the minimum chunk size is 1.
Chunks are assigned to threads dynamically. Guided scheduling is sometimes called "guided self-scheduling" or "GSS."
The advantage of the guided type over the dynamic type is that guided schedules use fewer chunks, reducing the amount
of synchronization overhead, i.e. the number of times a thread must ask for new work. The number of chunks produced
increases linearly with the number of iterations in the dynamic type but only logarithmically for the guided type, so the
advantage gets greater as the number of iterations in the loop increases.
13.5 Runtime
The runtime type allows the scheduling to be determined at runtime. The chunk_size parameter must not appear. The
schedule type is chosen at runtime based on the value of the environment variable OMP_SCHEDULE. The environment
variable is set to a string that matches the parameters that would appear in the parentheses of a SCHEDULE clause. For
example,setting OMP_SCHEDULE via the C shell command
%setenv OMP_SCHEDULE "guided, 100"
before executing the program would result in the loops having a guided schedule with a minimum chunk size of 100. If
OMP_SCHEDULE is not set, the choice of schedule depends on the implementation.
13.6 Schedule Clause

The table below summarizes the different scheduling options and compares them in terms of several characteristics that
affect performance.
Summary of scheduling options
Name
Type
Chunk
Chunk Size
Number of
Chunks
Static or Dynamic
Computer
Overhead
Simple Static
simple
no
N/P
static
lowest
Interleaved
simple
yes
N/C
static
low
N/C
dynamic
medium
fewer than N/C
dynamic
high
varies
varies
varies
Simple dynamic
dynamic optional
Guided
guided
Runtime
runtime
optional decreasing from N/P

no
varies
In this table, N is the number of iterations of the parallel loop, P is the number of threads executing the loop, and C is the
user-specified chunk size.
A note of caution: the correctness of a program should not depend on the schedule chosen for its parallel loops. If the
correctness of the results depends on the choice of schedule, then it is likely that you missed a source of dependency in
one or more of your loop parallelizations. For example, if the correct results depend on the sequential execution of some of
the iterations, then results will depend on whether the iterations are assigned to the same chunk and/or thread. A program
may get correct results at first, but then mysteriously stop working if the schedule is changed while tuning performance. If
the schedule is dynamic, the situation is potentially more challenging as the program may fail only intermittently.
13.7 Self Test

Introduction to OpenMP - Schedule
Question 1
Question 1
Which one of the following type is NOT a correct type for the schedule clause?
static
dynamic
guided
shared
Question 2
Question 2
About the static type, which one of the following statement is correct:
The choice of which thread performs a particular iteration is only a function of the iteration number.
The first thread gets the first chunk.
If type is static and chunk is not present, the chunk size can't be determined.
Static scheduling has higher overhead.
Question 3
Question 3
How are the iterations assigned to the threads in the dynamic schedule?
All iterations are assigned to the threads at the beginning of the loop.
Each thread requests more iterations after it has completed the work already assigned to it.
All iterations are assigned to all thread evenly.
Iterations are assigned to all threads randomly.
Question 4
Question 4
In the guided type, how is the chunk size distributed?

The chunk size is distributed evenly.
The chunk size is distributed randomly.
The first is implementation dependent and the size of each successive chunk decreases in size exponentially.
The first is implementation dependent and the size of each successive decreases one by one.
Question 5
Question 5
Which one of the following UNIX C shell commands sets the environment value for the runtime type correctly?
%setenv OMP_SCHEDULE = "dynamic, 3"
%setenv OMP_SCHEDULE dynamic:3
%setenv OMP_SCHEDULE "dynamic", "3"
%setenv OMP_SCHEDULE "dynamic, 3"
CI-Tutor content for personal use only. All rights reserved. 2014 Board of Trustees of the University of Illinois.

Open MP

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Open MP

Încărcat de

Drepturi de autor:

Formate disponibile

Course Title: Introduction to OpenMP

approach is commonly referred to as Multi-Level Parallel Programming (MLP).

1.2 What is OpenMP?

Figure 1.1. The three components of the OpenMP API.

!requests "nthread" threads

Figure 1.2. OpenMP Programming Model.

/* requests nthread threads */

1.3 Why OpenMP?

1.4 Pros and Cons of OpenMP

2.2 Basics - Approaches to Parallelism

2.3 Data Dependencies

a(2) = a(2) + a(1) = 2 + 1 = 3

a(4) = a(4) + a(3) = 4 + 3 = 7

a(3) = a(3) + a(2) = 3 + 3 = 6

a(5) = a(5) + a(4) = 5 + 4 = 9

All assigments are performed on arrays.

2.4 PARALLEL DO / PARALLEL FOR

2.6 Self Test

3 Compile and Run

Portland Group linux

The rank of threads ranges from 0 to OMP_GET_NUM_THREADS() - 1.

Note that in general the rank output are not in order.

Fortran (replace "C" in column 1 with "!" for F90)

(void) omp_set_num_threads(int num_threads)

The example below demonstrates:

Request threads by OMP_SET_NUM_THREADS.

Fortran (replace "C" in column 1 with "!" for F90)

C set thread size before entering parallel region

Fortran (replace "C" in column 1 with "!" for F90)

6.5 Self Test

can only be called in a parallel region.

8.7 Self Test

9.2.3 Syntax and Sample

Operators and Intrinsics

+, *, -, /, .AND., .OR., .EQV., .NEQV., MAX, MIN, IAND, IOR, IEOR

+, *, -, /, &, ^, |, <<, >&gt

where list is an optional list of variable names.

9.6 Self Test

The example below demonstrates these functions

Set thread dynamic status to ".true." ("nonzero" for C).

Fortran (replace "C" in column 1 with "!" for F90)

The example below demonstrates these functions:

Set thread dynamic status to ".true." ("1" for C).

Fortran (replace "C" in column 1 with "!" for F90)

This function can be used in serial and parallel regions.

The example below demonstrates these functions:

Query for thread dynamic status.

Fortran (replace "C" in column 1 with "!" for F90)

write(*,*)'The thread dynamic status is : ',

This function can be used in serial and parallel regions.

The example below demonstrates these functions:

In a serial region, use OMP_IN_PARALLEL to confirm that it is so.

printf(STDOUT, " In parallel region (0)? %d\n", omp_in_parallel());

Fortran (replace "C" in column 1 with "!" for F90)

write(*,*)'In parallel region (F)? ', omp_in_parallel()

The example below demonstrates these functions:

Request threads by OMP_SET_NUM_THREADS.

Fortran (replace "C" in column 1 with "!" for F90)

10.7 Self Test

must be called in a parallel region.

setenv OMP_NESTED TRUE (c shell)

where mylock is the lock name.

A thread gains ownership of a lock by calling

write(,)'The thread dynamic status is : ',

write(,)'In parallel region (F)? ', omp_in_parallel()