High Performance Computing

High
Performance
Computing
(Lecture Notes)

Department of Computer Science
School Mathematical & Physical Sciences
Central University of Kerala
Lecture Notes High Performance Computing

Department of Computer Science, Central University of Kerala

2
Module 1
Architectures and Models of
Computation

LEARNING OBJECTIVES

Shared and Distributed Memory Machines, PRAM Model,
Interconnection Networks: Crossbar, Bus, Mesh, Tree, Butterfly and
MINs, Hypercube, Shuffle Exchange, etc.; Evaluation based on
Diameter, Bisection Bandwidth, Number of Edges, etc., Embeddings:
Mesh, Tree, Hypercube; Gray Codes, Flynns Taxonomy



3
1. What is parallel computing? Why parallel computing? How does it
differ from concurrency? What are its application areas?
Parallel Computing What it is?
Parallel computing is the use of parallel computer to reduce the time needed to
solve a single computational problem. Parallel computers are computer systems
consisting of multiple processing units connected via some interconnection
network plus the software needed to make the processing units work together.
The processing units can communicate and interact with each other using either
shared memory or message passing methods. Parallel computing is now
considered a standard way for computational scientists and engineers to solve
computational problems that demands high performance computing power.
Parallel Computing Why it is?
Sequential computing systems have been with us for more than six decades since
John von Neumann introduced digital computing in the 1950s. The traditional
logical view of a sequential computer consists of a memory connected to a
processor via a datapath. In sequential computing, all the three components
processor, memory, and datapath present bottlenecks to the overall processing
rate of a computer system. To speed up the execution, one would need to either
increase the clock rate or to improve the memory performance by reducing its
latency or increasing the bandwidth. A number of architectural innovations like
multiplicity (in processing units, datapaths and memory units), cache memory,
pipelining, superscalar execution, multithreading, prefetching, etc., over the
years have been exploited to address these performance bottlenecks. Though,
these architectural innovations have brought about an average 50% performance
improvement per year during the period 1986 to 2002, the performance gain
recorded a low rate after 2002 primarily due to the fundamental architectural
limitations of the sequential computing and the computing industry came to the
realization that uniprocessor architectures cannot sustain the rate of realizable
performance increments in the future. This realization resulted in computing
industry to focus more on parallel computing for achieving the sustained
realizable performance improvement. And the idea of a single processor
computer is fast becoming outdated and old-fashioned.
Parallelism Vs Concurrency
In many fields, the words parallel and concurrent are synonyms; not so in
programming, where they are used to describe fundamentally different concepts.
A parallel program is one that uses a multiplicity of computational hardware
(e.g., several processor cores) to perform a computation more quickly. The aim is
to arrive at the answer earlier, by delegating different parts of the computation
to different processors that execute at the same time.
By contrast, concurrency is a program-structuring technique in which there are
multiple threads of control, which may be executed in parallel on multiple


4
physical processors or in interleaved fashion on a single processor. Whether they
actually execute in parallel or not is therefore an implementation detail.
While parallel programming is concerned only with efficiency, concurrent
programming is concerned with structuring a program that needs to interact
with multiple independent external agents (for example, the user, a database
server, and some external clients). Concurrency allows such programs to be
modular. In the absence of concurrency, such programs have to be written with
event loops and callbacks, which are typically more cumbersome and lack the
modularity that threads offer.
Parallel Computing Advantages.
The main argument for using multiprocessors is to create powerful computers by
simply connecting multiple processors. A multiprocessor is expected to reach
faster speed than the fastest single-processor system. In addition, a
multiprocessor consisting of a number of single processors is expected to be more
cost-effective than building a high-performance single processor. Another
advantage of a multiprocessor is fault tolerance. If a processor fails, the
remaining processors should be able to provide continued service, although with
degraded performance.
Parallel Computing The Limits.
A theoretical result known as Amdahls law says that the amount of performance
improvement that parallelism provides is limited by the amount of sequential
processing in your application. This may, at first, seem counterintuitive.
Amdahls law says that no matter how many cores you have, the maximum
speed-up you can ever achieve is (1 / fraction of time spent in sequential
processing).
Parallel Computing Application Areas
Parallel computing is a fundamental and irreplaceable technique used in todays
science and technology, as well as manufacturing and service industries. Its
applications cover a wide range of disciplines:
Basic science research, including biochemistry for decoding human genetic
information as well as theoretical physics for understanding the interactions
of quarks and possible unification of all four forces.
Mechanical, electrical, and materials engineering for producing better
materials such as solar cells, LCD displays, LED lighting, etc.
Service industry, including telecommunications and the financial industry.
Manufacturing, such as design and operation of aircrafts and bullet trains.
Its broad applications in oil exploration, weather forecasting, communication,
transportation, and aerospace make it a unique technique for national
economical defence. It is precisely this uniqueness and its lasting impact that


5
defines its role in todays rapidly growing technological society
2. Development of parallel software has traditionally been thought of
as time and effort intensive. Justify the statement.
Traditionally, computer software has been written for serial computation. To
solve a problem, an algorithm is constructed and implemented as a serial stream
of instructions. These instructions are executed on a central processing one after
another.
Parallel computing, on the other hand, uses multiple processing elements
simultaneously to solve a problem. This is accomplished by breaking the problem
into independent tasks so that each processing element can execute its part of
the algorithm simultaneously with the others. The processing elements can be
diverse and include resources such as a single computer with multiple
processors, several networked computers, specialized hardware, or any
combination of the above.
However, development of parallel software is traditionally considered as a time
and effort intensive activity due to the following reasons:
Complexity in specifying and coordinating concurrent tasks
lack of portable parallel algorithms
lack of standardized parallel environments, and
lack of parallel software development toolkits
Complexity in specifying and coordinating concurrent tasks: Concurrent
computing involves overlapping of the execution of several computations over
one or more processors. Concurrent computing often requires complex
interactions between the processes. These interactions are often communication
via message passing, which may be synchronous or asynchronous; or may be
access to shared resources. The main challenges in designing concurrent
programs are concurrency control: ensuring the correct sequencing of the
interactions or communications between different computational executions, and
coordinating access to resources that are shared among executions. Problems
that may occur include non-determinism (from race conditions), deadlock, and
resource starvation.
Lack of portable parallel algorithms: Because the interconnection scheme
among processors (or between processors and memory) signicantly affects the
running time, efcient parallel algorithms must take the interconnection scheme
into account. Because of this, most of the existing parallel algorithms for real
world applications suffer from a major limitation that these algorithms have
designed with a specific underlying parallel architecture in mind and are not
portable to different parallel architecture.
Lack of standardized parallel environments: The lack of standards in
parallel programming languages makes parallel programs difficult to port across
parallel computers.


6
Lack of standardized parallel software development toolkits: Unlike
sequential programming tools, the parallel programming tools available are
highly dependent on both on the characteristics of the problem and on the
parallel programming environment opted for. The lack of standard programming
tools makes parallel programming difficult and the resultant programs are not
portable across parallel computers.
However, in the last few decades, researchers have made considerable progress
in designing efficient and cost-effective parallel architectures and parallel
algorithms. Together with this, the factors such as
reduction in the turnaround time required for the development of
microprocessor based parallel machine and
standardization of parallel programming environments and parallel
programming tools to ensure a longer life-cycle for parallel applications
have made the parallel computing today less time and effort intensive.
3. Briefly explain some of the compelling arguments in favour of
parallel computing platforms
Though considerable progress has been made in the microprocessor technology in
the past few decades, the industry came to the realization that the implicit
parallel architecture alone cannot provide sustained realizable performance
increments. Together with this, the factors such as
reduction in the turnaround time required for the development of
microprocessor based parallel machine and
standardization of parallel programming environments and parallel
programming tools to ensure a longer life-cycle for parallel applications
present compelling arguments in favour of parallel computing platforms.
The major fascinating arguments in favour of parallel computing platforms
include:
a) The Computational Power Argument.
b) Memory/Disk speed Argument.
c) Data Communication Argument
The Computational Power Argument: Due to the sustained development in
the microprocessor technology, the computational powers of the systems are
doubling in every 18 months (called Moores law). This sustained development in
the microprocessor technology favours parallel computing platforms.
Memory/Disk Speed Argument: The overall speed of a system is determined
not just by the speed of the processor, but also by the ability of the memory
system to feed data to it. Considering the 40% annual increase in clock speed
coupled with the increases in instructions executed per clock cycle, the small 10%
annual improvement in memory access time (memory latency) has resulted in a


7
performance bottleneck. This growing mismatch between processor speed and
DRAM latency can be bridged to a certain level by introducing cache memory
that relies on locality of data reference. Besides memory latency, the effective
memory bandwidth also influences the sustained improvements in computation
speed. Compared uniprocessor systems, parallel platforms typically provide
better memory system performance because they provide (a) larger aggregate
caches, and (b) higher aggregate memory bandwidth. Besides, design of parallel
algorithms that can exploit the locality of data reference can also improve the
memory and disk latencies.
The Data Communication Argument: Many of the modern real world
applications in quantum chemistry, statistical mechanics, cosmology,
astrophysics, computational fluid dynamics and turbulence, superconductivity,
biology, pharmacology, genome sequencing, genetic engineering, protein folding,
enzyme activity, cell modelling, medicine, modelling of human organs and bones,
global weather and environmental modelling, data mining, etc., are massively
parallel and demands large scale wide area distributed heterogeneous
parallel/distributed computing environments.
4. Describe the classical von Neumann architecture of computing
systems.
The classical von Neumann architecture consists of main memory, a central
processing unit (also known as CPU or processor or core) and an
interconnection between the memory and the CPU. Main memory consists of a
collection of locations, each of which is capable of storing both instructions and
data. Every location consists of an address, which is used to access the
instructions or data stored in the location. The classical von Neumann
architecture is depicted below:
The central processing unit is divided into a control unit and an arithmetic
and logic unit (ALU). The control unit is responsible for deciding which
instructions in a program should be executed, and the ALU is responsible for
executing the actual instructions. Data in the CPU and information about the
state of an executing program are stored in special, very fast storage called
registers. The control unit has a special register called the program counter.
It stores the address of the next instruction to be executed.
Instructions and data are transferred between the CPU and memory via the
interconnect called bus. A bus consists of a collection of parallel wires and some
hardware controlling access to the wires. A von Neumann machine executes a
single instruction at a time, and each instruction operates on only a few pieces of
data.
The process of transferring data or instructions from memory to the CPU is
referred to as data or instructions fetch or memory read operation. The process
of transferring data from the CPU to memory is referred to as memory write.


8

Fig: The Classical von Neumann Architecture
The separation of memory and CPU is often called the von Neumann
bottleneck, since the interconnect determines the rate at which instructions
and data can be accessed. CPUs are capable of executing instructions more than
one hundred times faster than they can fetch items from main memory.
5. Explain the terms processes, multitasking, and threads.
Process
A process is an instance of a computer program that is being executed. When a
user runs a program, the operating system creates a process. A process consists
of several entities:
The executable machine language program
A block of memory, which will include the executable code, a call stack that
keeps track of active functions, a heap, and some other memory locations
Descriptors of resources that the operating system has allocated to the
process for example, file descriptors. .
Security information for example, information specifying which hardware
and software resources the process can access.
Information about the state of the process, such as whether the process is
ready to run or is waiting on some resource, the content of the registers, and


9
information about the process memory.
Multitasking
A task is a unit of execution. In some operating systems, a task is synonymous
with a process, in others with a thread. An operating system is called
multitasking if it can execute multiple tasks. Most modern operating systems are
multitasking. This means that the operating system provides support for the
simultaneous execution of multiple programs. This is possible even on a system
with a single core, since each process runs for a time slice (typically a few
milliseconds). After one running program has executed for a time slice, the
operating system can run a different program. A multitasking OS may change
the running process many times a minute, even though changing the running
process can results overheads. In a multitasking OS, if a process needs to wait
for a resource (for example, it needs to read data from external storage) the OS
will block the process and schedule another ready process to run. For example,
an airline reservation system that is blocked waiting for a seat map for one user
could provide a list of available flights to another user. Multitasking does not
imply parallelism but it involves concurrency.
Threads
A thread of execution is the smallest unit of a program that can be managed
independently by an operating system scheduler. Threading provides a
mechanism for programmers to divide their programs into more or less
independent tasks with the property that when one thread is blocked another
thread can be run. Furthermore, the context switching among threads is much
faster as compared context switching among processes. This is because threads
are lighter weight than processes. Threads are contained within processes, so
they can use the same executable, and they usually share the same memory and
the same I/O devices. In fact, two threads belonging to one process can share
most of the process resources. Different threads of a process need only to keep a
record of their own program counters and call stacks so that they can execute
independently of each other.
6. Describe the architectural innovations employed to overcome the
von Neumann bottleneck
The classical von Neumann architecture consists of main memory, a processor
and an interconnection between the memory and the processor. This separation
of memory and processor is called the von Neumann bottleneck, since the
interconnect determines the rate at which instructions and data can be accessed.
This has resulted in creating a large speed mismatch (of the order of 100 times or
more) between the processor and memory. Several architectural innovations
have been exploited as an extension to the classical von Neumann architecture
for hiding this speed mismatch and improving the overall system performance.
The prominent architectural innovations for hiding the von Neumann bottleneck
include:


10
1. Caching,
2. Virtual memory, and
3. Low-level parallelism Instruction-level and thread-level parallelisms.
Caching
Cache is a smaller and faster memory between the processor and the DRAM,
which stores copies of the data from frequently used main memory locations.
Cache acts as a low-latency high-bandwidth storage (improves both memory
latency and bandwidth).
Cache works by the principle of locality of reference, which state that programs
tend to use data and instructions that are physically close to recently used data
and instructions.
The data needed by the processor is first fetched into the cache. All subsequent
accesses to data items residing in the cache are serviced by the cache, thereby
reducing the effective memory latency.
In order to exploit the principle of locality, the memory access to cache operates
on blocks (called cache blocks or cache lines) of data and instructions instead of
individual instructions and individual data items (cache lines ranges from 8 to
16 words). A cache line of 16 means that 16 memory words can be accessed in
115ns (assuming 100ns memory latency) instead of 1600ns (if accessed one word
at a time), thereby increasing the memory bandwidth from 5MWords/Sec to
70MWords/Sec. Blocked access can also reduce the memory latency. A cache line
of 16 means that next 15 accesses to memory can be found from cache (if
program exhibits strong locality of reference), thereby hiding the effective
memory latency.
Rather than implementing CPU cache as a single monolithic structure, in
practice, the cache is usually divided into levels: the first level (L1) is the
smallest and the fastest, and higher levels (L2, L3, . . . ) are larger and slower.
Most systems currently have at least two levels. Caches usually store copies of
information in slower memory. For example, a variable stored in a level 1 cache
will also be stored in level 2. However, some multilevel caches dont duplicate
information thats available in another level. For these caches, a variable in a
level 1 cache might not be stored in any other level of the cache, but it would be
stored in main memory.
When the CPU needs to access an instruction or data, it works its way down the
cache hierarchy: First it checks the level 1 cache, then the level 2, and so on.
Finally, if the information needed isnt in any of the caches, it accesses main
memory.
When a cache is checked for information and the information is available, it is
called a cache hit. If the information is not available, it is called a cache miss.
Hit or miss is often modified by the level. For example, when the CPU attempts
to access a variable, it might have an L1 miss and an L2 hit.


11
When the CPU writes data to a cache, the value in the cache and the value in
main memory are different or inconsistent. There are two basic approaches to
dealing with the inconsistency. In write-through caches, the cache line is
written to main memory when it is written to the cache. In write-back caches,
the data is not written immediately. Rather, the updated data in the cache is
marked dirty, and when the cache line is replaced by a new cache line from
memory, the dirty line is written to memory.
Virtual Memory
Caches make it possible for the CPU to quickly access instructions and data that
are in main memory. However, if we run a very large program or a program that
accesses very large data sets, all of the instructions and data may not fit into
main memory. This is especially true with multitasking operating systems. In
order to switch between programs and create the illusion that multiple programs
are running simultaneously, the instructions and data that will be used during
the next time slice should be in main memory. Thus, in a multitasking system,
even if the main memory is very large, many running programs must share the
available main memory.
Virtual memory was developed so that main memory can function as a cache for
secondary storage. It exploits the principle of locality by keeping in main memory
only the active parts of the many running programs. Those parts that are idle
are kept in a block of secondary storage called swap space. Like CPU caches,
virtual memory operates on blocks of data and instructions. These blocks are
commonly called pages (size ranges from 4 to 16 kilobytes).
When a program is compiled, its pages are assigned virtual page numbers. When
the program is loaded into memory, a Page Map Table (PMT) is created that
maps the virtual page numbers to physical addresses. The virtual address
references made by a running program are translated into corresponding
physical addresses by using this PMT.
The drawback of storing PMT in main memory is that a virtual address
reference made by the running program requires two memory accesses: one to
get the appropriate page table entry of the virtual page to find its location in
main memory, and one to actually access the desired memory. In order to avoid
this problem, CPUs have a special page table cache called the translation look
aside buffer (TLB) that caches a small number of entries (typically 16512)
from the page table in very fast memory. Using the principle of locality, one can
expect that most of the memory references will be to pages whose physical
address is stored in the TLB, and the number of memory references that require
accesses to the page table in main memory will be substantially reduced.
If the running process attempts to access a page that is not in memory, that is,
the page table does not have a valid physical address for the page and the page is
only stored on disk, then the attempted access is called a page fault. In such
case, the running process will be blocked until the faulted page is brought into
memory and the corresponding entry is made in PMT.


12
When the running program look for an address and the virtual page number is in
the TLB, it is called a TLB hit. If it is not in the TLB, it is called a TLB miss.
Due to the relative slowness of disk accesses, virtual memory always uses a
write-back scheme for handling write accesses. This can be handled by keeping a
bit on each page in memory that indicates whether the page has been updated. If
it has been updated, when it is evicted from main memory, it will be written to
disk.
Low-Level Parallelism Instruction-Level parallelism
Low-level parallelisms are the parallelism that are not visible to the programmer
(i.e., programmer has no control). Two of the low-level parallelisms are
instruction-level parallelism and thread-level parallelism.
Instruction-level parallelism, or ILP, attempts to improve processor performance
by having multiple processor components or functional units simultaneously
executing instructions. There are two main approaches to ILP: pipelining, in
which functional units are arranged in stages with the output of one being the
input to the next, and multiple issue, in which the functional units are
replicated and issues multiple instructions simultaneously.
Pipelining: Pipelining is a technique used to increase their instruction
throughput of the processor. The main idea is to divide the basic instruction cycle
into a series of independent steps of micro-operations. Rather than processing
each instruction sequentially, the independent micro-operations of different
instructions are executed concurrently (by different functional units) in parallel.
Pipelining enables faster execution by overlapping various stages in instruction
execution (fetch, schedule, decode, operand fetch, execute, store, among others).
Multiple Issue (Superscalar Execution): Superscalar execution is an
advanced form of pipelined instruction-level parallelism that allows dispatching
of multiple instructions to multiple pipelines, processed concurrently by
redundant functional units on the processor. Superscalar execution can provide
better cost-effective performance as it improves the degree of overlapping of
parallel concurrent execution of multiple instructions.
Low-Level Parallelism Thread-Level parallelism
Thread-level parallelism, or TLP, attempts to provide parallelism through the
simultaneous execution of different threads. Thread level parallelism splits a
program into independent threads for running it concurrent. TLP is considered
as a coarser-grained parallelism than ILP, that is, the program units that are
being simultaneously executed (threads in TLP) are larger or coarser than the
finer-grained units (individual instructions in ILP).
Multithreading provides a means for systems to continue doing useful work by
switching the execution to another thread when the thread being currently
executed has stalled (for example, if the current task has to wait for data to be


13
loaded from memory). There are different ways to implement the multithreading.
In fine-grained multithreading, the processor switches between threads after
each instruction, skipping threads that are stalled. While this approach has the
potential to avoid wasted machine time due to stalls, it has the drawback that a
thread thats ready to execute a long sequence of instructions may have to wait
to execute every instruction.
Coarse-grained multithreading attempts to avoid this problem by only
switching threads that are stalled waiting for a time-consuming operation to
complete (e.g., a load from main memory).
Simultaneous multithreading is a variation on fine-grained multithreading.
It attempts to exploit superscalar processors by allowing multiple threads to
make use of the multiple functional units.
7. Explain how implicit parallelisms like pipelining and super scalar
execution results in better cost-effective performance gains
Microprocessor technology has recorded an average 50% annual performance
improvement over the last few decades. This development has also uncovered
several performance bottlenecks in achieving the sustained realizable
performance improvement. To alleviate these bottlenecks, microprocessor
designers have explored a number of alternate architectural innovations to cost-
effective performance gains. One of the most important innovations is
multiplicity in processing units, datapaths, and memory units. This
multiplicity is either entirely hidden from the programmer or exposed to the
programmer in different forms.
Implicit parallelism is an approach to provide multiplicity at the level of
instruction execution for achieving the cost-effective performance gain. In
implicit parallelism, parallelism is exploited by the compiler and/or the runtime
system and this type of parallelism is transparent to the programmer.
Two common approaches for the implicit parallelism are (a) Instruction
Pipelining and (b) Superscalar Execution.
Instruction Pipelining: Instruction pipelining is a technique used in the
design of microprocessors to increase their instruction throughput. The main
idea is to divide the basic instruction cycle into a series of independent steps of
micro-operations. Rather than processing each instruction sequentially, the
independent micro-operations of different instructions are executed concurrently
(by different functional units) in parallel. Pipelining enables faster execution by
overlapping various stages in instruction execution (fetch, schedule, decode,
operand fetch, execute, store, among others).
To illustrate how instruction pipelining enables faster execution, consider the
execution of the following code fragment:


14
load R1, @1000
load R2, @1008
add R1, @1004
add R2, @100C
add R1, R2
store R1, @2000

Sequential Execution
Instruction Sequential Execution Stages
load R1, @1000 IF ID OF
add R1, @1004 IF ID OF E
add R2, @100C IF ID OF E
add R1, R2 IF ID E
store R1, @2000 IF ID WB
Clock Cycles 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Pipelined Execution
Instruction Pipeline Stages
add R1, R2 IF ID NOP E
Store R1, @2000 IF ID NOP WB
Clock Cycles 1 2 3 4 5 6 7 8 9

As seen in the example, the pipelined execution requires only 9 clock cycle, which
is a significant improvement over the 20 clock cycles needed in sequential
execution. However, the speed of a single pipelining is always limited by the
largest atomic task. Also, the pipeline performance is always dependent on the
efficiency of the dynamic branch prediction mechanism employed.
Superscalar Execution: Superscalar execution is an advanced form of
pipelined instruction-level parallelism that allows dispatching of multiple
instructions to multiple pipelines, processed concurrently by redundant
functional units on the processor. Superscalar execution can provide better cost-
effective performance as it improves the degree of overlapping of parallel
concurrent execution of multiple instructions.
To illustrate how the superscalar execution results in better performance gain,
consider the execution of the previous code fragment on a processor with two
pipelines and the ability to simultaneously issue two instructions.

Superscalar Execution


15
store R1, @2000 IF ID NOP WB
Clock Cycles 1 2 3 4 5 6 7

With the superscalar execution, the execution of the same code fragment takes
only 7 clock cycles instead of 9.
These examples illustrates that the implicit parallelisms like pipelining and
superscalar execution can results in better cost-effective performance gain.
8. Explain the concepts of pipelining and superscalar Execution with
suitable examples. Also explain their individual merits and demerits.
Pipelining and superscalar execution are two forms of instruction-level implicit
parallelism inherent in the design of modern microprocessors to increase their
instruction throughput.
Pipelining: The main idea of the instruction pipelining is to divide the
instruction cycle into a series of independent steps of micro-operations. Rather
than processing each instruction sequentially, the independent micro-operations
of different instructions are executed concurrently (by different circuitry) in
parallel. Pipelining enables faster execution by overlapping various stages in
instruction execution (fetch, schedule, decode, operand fetch, execute, store,
among others).
To illustrate how instruction pipelining executes instructions, consider the
execution of the following code fragment using pipelining:
load R1, @1000
load R2, @1008
add R1, @1004
add R2, @100C
add R1, R2
store R1, @2000
Pipelined Execution
Clock Cycles 1 2 3 4 5 6 7 8 9
As seen in the example, pipelining results in the overlapping in execution of the
various stages of different instructions.


16
Advantage of Pipelining:
The cycle time of the processor is reduced by overlapping the different
execution stages of various instruction, thereby increases the overall
instruction throughput.
Disadvantages of Pipelining:
Design Complexity: Pipelining involves adding hardware to the chip
Inability to continuously run the pipeline at full speed because of pipeline
hazards, such as data dependency, resource dependency and branch
dependency, which disrupt the smooth execution of the pipeline.
Superscalar Execution: Superscalar execution is an advanced form of
pipelined instruction-level parallelism that allows dispatching of multiple
instructions to multiple pipelines, processed concurrently by redundant
functional units on the processor. Superscalar execution can provide better cost-
effective performance as it improves the degree of overlapping of parallel
concurrent execution of multiple instructions.
To illustrate how the superscalar pipelining executes instructions, consider the
execution of the previous code fragment on a processor with two pipelines and
the ability to simultaneously issue two instructions.

Advantage of Superscalar Pipelining:
Since the processor accepts multiple instructions per clock cycle,
superscalar execution results in better performance as compared single
pipelining.
Disadvantages of Pipelining:
Design Complexity: Design of superscalar processors are still more
complex as compared to single pipeline design
Inability to continuously run the pipeline at full speed because of pipeline
hazards, such as data dependency, resource dependency and branch
dependency, which disrupt the smooth execution of the pipeline.
The performance of superscalar architectures is limited by the available


17
instruction level parallelism and the ability of a processor to detect and
schedule concurrent instructions
9. Though the superscalar execution seems to be simple and natural,
there are a number of issues to be resolved. Elaborate on the issues
that need to be resolved.
Superscalar execution is an advanced form of pipelined instruction-level
parallelism that allows dispatching of multiple instructions to multiple pipelines,
processed concurrently by redundant functional units on the processor.
Superscalar execution can provide better cost-effective performance as it
improves the degree of overlapping of parallel concurrent execution of multiple
instructions. Since superscalar execution exploits multiple instruction pipelines,
it seems to be a simple and natural means for improving the performance.
However, it needs to resolve the following issues for achieving the expected
performance improvement.
a. Pipeline Hazards
b. Out of Order Execution
c. Available Instruction Level Parallelism
Pipeline Hazards: A pipeline hazard is the inability to continuously run the
pipeline at full speed because of various pipeline dependencies such as data
dependency (called Data Hazard), resource dependency (called Structural
Hazard) and branch dependency (called Control or Branch Hazard).
Data Hazards: Data hazards occur when instructions that exhibit data
dependence modify data in different stages of a pipeline. Ignoring potential data
hazards can result in race conditions. There are three situations in which a data
hazard can occur:
read after write (RAW), called true dependency
write after read (WAR), called anti-dependency
write after write (WAW), called output dependency
As an example, consider the superscalar execution of the following two
instructions i1 and i2, with i1 occurring before i2 in program order.
True dependency: i2 tries to read R2 before i1 writes to it
i1. R2 R1 + R3
i2. R4 R2 + R3
Anti-dependency: i2 tries to write R5 before it is read by i1
i1. R4 R1 + R5
i2. R5 R1 + R2
Output dependency: i2 tries to write R2 before it is written by i1


18
i1. R2 R4 + R7
i2. R2 R1 + R3
Structural Hazards: A structural hazard occurs when a part of the processor's
hardware is needed by two or more instructions at the same time. A popular
example is a single memory unit that is accessed both in the fetch stage where
an instruction is retrieved from memory, and the memory stage where data is
written and/or read from memory
Control hazards: Branching hazards (also known as control hazards) occur with
branches. On many instruction pipeline, the processor will not know the outcome
of the branch when it needs to insert a new instruction into the pipeline
(normally the fetch stage).
Dependencies of the above types must be resolved before simultaneous issue of
instructions. Pipeline Bubbling (also known as a pipeline break or a pipeline
stall) is the general strategy to prevent all the three kinds of hazards. As
instructions are fetched, control logic determines whether a hazard will occur. If
this is true, then the control logic inserts NOPs into the pipeline. Thus, before
the next instruction (which would cause the hazard) is executed, the previous
one will have had sufficient time to complete and prevent the hazard.
A variety of specific strategies are also available for handling the different
pipeline hazards. Examples include branch prediction for handling control
hazards and out-of-order execution for handling data hazards.
There are two implications to the pipeline hazards handling. First, since the
resolution is done at runtime, it must be supported in hardware; the complexity
of this hardware can be high. Second, the amount of instruction level parallelism
in a program is often limited and is a function of coding technique.
Out of Order Execution: The ability of a processor to detect and schedule
concurrent instructions is critical to superscalar performance. As an example,
consider the execution of the following code fragment on a processor with two
pipelines and the ability to simultaneously issue two instructions.
1. load R1, @1000
2. add R1, @1004
3. load R2, @1008
4. add R2, @100C
5. add R1, R2
6. store R1, @2000
In the above code fragment, there is a data dependency between the first two
instructions
load R1, @1000 and
add R1, @1004.
Therefore, these instructions cannot be issued simultaneously. However, if the


19
processor has the ability to look ahead, it will realize that it is possible to
schedule the third instruction
load R2, @1008
with the first instruction
load R1, @1000.
In the next issue cycle, instructions two and four
add R1, @1004
add R2, @100C
can be scheduled, and so on.
However, the processor needs the ability to issue instructions out-of-order to
accomplish the desired reordering. The parallelism available in in-order issue of
instructions can be highly limited as illustrated by this example. Most current
microprocessors are capable of out-of-order issue and completion.
Available Instruction Level Parallelism: The performance of superscalar
architectures is also limited by the available instruction level parallelism. As an
example, consider the execution of the following code fragment on a processor
with two pipelines and the ability to simultaneously issue two instructions.
load R1, @1000
add R1, @1004
add R1, @1008
add R1, @100C
store R1, @2000


For simplicity of discussion, let us ignore the pipelining aspects of the example
and focus on the execution aspects of the program. Assuming two execution units
(multiply-add units), the following figure illustrates that there are several zero-
issue cycles (cycles in which the floating point unit is idle).
Clock Cycles Exe.Unit 1 Exe.Unit 2
4 Vertical Waste
5 E E Full Issue Slot
6 E Horizontal Waste
7 Vertical Waste


20

These are essentially wasted cycles from the point of view of the execution unit.
If, during a particular cycle, no instructions are issued on the execution units, it
is referred to as vertical waste; if only part of the execution units are used during
a cycle, it is termed horizontal waste. In the example, we have two cycles of
vertical waste and one cycle with horizontal waste. In all, only three of the eight
available cycles are used for computation. This implies that the code fragment
will yield no more than three eighths of the peak rated FLOP count of the
processor.
In short, though the superscalar execution seems to be a simple and natural
means for improving the performance, due to limited parallelism, resource
dependencies, or the inability of a processor to extract parallelism, the resources
of superscalar processors are heavily under-utilized.
10. The ability of a processor to detect and schedule concurrent
instructions is critical to superscalar performance. Justify the
statement with example
Superscalar execution is an advanced form of pipelined instruction-level
parallelism that allows dispatching of multiple instructions to multiple pipelines,
processed concurrently by redundant functional units on the processor.
Superscalar execution can provide better cost-effective performance as it
improves the degree of overlapping of parallel concurrent execution of multiple
instructions. Since superscalar execution exploits multiple instruction pipelines,
it seems to be a simple and natural means for improving the performance.
However, the ability of a processor to detect and schedule concurrent
instructions is critical to superscalar performance.
To illustrate this point, consider the execution of the following two different code
fragments for adding four numbers on a processor with two pipelines and the
ability to simultaneously issue two instructions.
Code Fragment 1:
1. load R1, @1000
2. load R2, @1008
3. add R1, @1004
4. add R2, @100C
5. add R1, R2
6. store R1, @2000
Consider the execution of the above code fragment for adding four numbers. The
first and second instructions are independent and therefore can be issued
concurrently. This is illustrated in the simultaneous issue of the instructions
load R1, @1000 and
load R2, @1008


21
at t = 1. The instructions are fetched, decoded, and the operands are fetched.
These instructions terminate at t = 3. The next two instructions,
add R1, @1004 and
add R2, @100C
are also mutually independent, although they must be executed after the first
two instructions. Consequently, they can be issued concurrently at t = 2 since the
processors are pipelined. These instructions terminate at t = 5. The next two
instructions,
add R1, R2 and
store R1, @2000
cannot be executed concurrently since the result of the former (contents of
register R1) is used by the latter. Therefore, only the add instruction is issued at
t = 3 and the store instruction at t = 4. Note that the instruction
add R1, R2
can be executed only after the previous two instructions have been executed. The
instruction schedule is illustrated below.
Superscalar Execution of Code Fragment 1

Code Fragment 2:
1. load R1, @1000
2. add R1, @1004
3. load R2, @1008
4. add R2, @100C
5. add R1, R2
6. store R1, @2000
This code fragment is exactly equivalent to code fragment 1 and it computes the
sum of four numbers. In this code fragment, there is a data dependency between
the first two instructions
load R1, @1000 and
add R1, @1004
Therefore, these instructions cannot be issued simultaneously. However, if the


22
processor has the ability to look ahead, it will realize that it is possible to
schedule the third instruction
load R2, @1008
with the first instruction
load R1, @1000.
In the next issue cycle, instructions two and four
add R1, @1004
add R2, @100C
can be scheduled, and so on.
However, the processor needs the ability to issue instructions out-of-order to
accomplish the desired reordering. The parallelism available in in-order issue of
instructions can be highly limited as illustrated by this example. Most current
microprocessors are capable of out-of-order issue and completion.
11. Explain how the VLIW processors can achieve the cost effective
performance gain over uniprocessor. What are its merits and
demerits?
Microprocessor technology has recorded an unprecedented growth over the past
few decades. This growth has also unveiled various bottlenecks in achieving
sustained performance gain. To alleviate these performance bottlenecks,
microprocessor designers have explored a number of alternate architectural
innovations involving implicit instruction-level parallelisms like pipelining,
superscalar architectures and out-of-order execution. All these implicit
instruction-level parallelism approaches have the demerits that they involve
increased hardware complexity (higher cost, larger circuits, higher power
consumption) because the processor must inherently make all of the decisions
internally for these approaches to work (for example, the scheduling of
instructions and determining of interdependencies).
Another alternate architectural innovation to cost-effective performance gains is
the Very Long Instruction Word (VLIW) processors. VLIW is one particular style
of processor design that tries to achieve high levels of explicit instruction level
parallelism by executing long instruction words composed of multiple operations.
The long instruction word called a MultiOp consists of multiple arithmetic, logic
and control operations. The VLIW processor concurrently executes the set of
operations within a MultiOp thereby achieving instruction level parallelism.
A VLIW processor allows programs to explicitly specify instructions to be
executed in parallel. That is, a VLIW processor depends on the programs
themselves for providing all the decisions regarding which instructions are to be
executed simultaneously and how conflicts are to be resolved. A VLIW processor
relies on the compiler to resolve the scheduling and interdependencies at compile


23
time. Instructions that can be executed concurrently are packed into groups and
are passed to the processor as a single long instruction word (thus the name) to
be executed on multiple functional units at the same time. This means that the
compiler becomes much more complex, but the hardware is simpler than many
other approaches to parallelism.
Advantages:
Since VLIW processors depends on compilers for resolving scheduling and
interdependencies, the decoding and instruction issue mechanisms are
simpler in VLIW processors.
Since scheduling and interdependencies are resolved at compilation time,
instruction level parallelism can be exploited to maximum as the compiler
has a larger-scale view of the program as compared to the instruction-level
view of a superscalar processor for selecting parallel instructions. Further,
compilers can also use a variety of transformations to optimize parallelism
when compared to a hardware issue unit.
The VLIW approach executes operations in parallel based on a fixed schedule
determined when programs are compiled. Since determining the order of
execution of operations (including which operations can execute
simultaneously) is handled by the compiler, the processor does not need the
scheduling hardware. As a result, VLIW CPUs offer significant computational
power with less hardware complexity.
Disadvantages:
VLIW programs only work correctly when executed on a processor with the
same number of execution units and the same instruction latencies as the
processor they were compiled for, which makes it virtually impossible to
maintain compatibility between generations of a processor family. For
example, if number of execution units in a processor increases between
generations; the new processor will try to combine operations from multiple
instructions in each cycle, potentially causing dependent instructions to
execute in the same cycle. Similarly, changing instruction latencies between
generations of a processor family can cause operations to execute before their
inputs are ready or after their inputs have been overwritten, resulting in
incorrect behaviour
Since the scheduling and interdependencies are resolved at compilation
time, the compilers lack dynamic program states like branch history
buffer that helps making scheduling decisions. Since the static prediction
mechanism employed by the compiler may not be as effective as the
dynamic one, the branch and memory prediction made by the compiler
may not be accurate. Moreover, some runtime situations such as stalls on
data fetch because of cache misses are extremely difficult to predict
accurately. This limits the scope and performance of static compiler-based
scheduling


24
12. With an example, illustrate how the memory latency can be a
bottleneck in achieving the peak processor performance. Also
illustrate how the cache memory can reduce this performance
bottleneck.
The effective performance of a program on a computer relies not just on the
speed of the processor, but also on the ability of the memory system to feed data
to the processor. There are two figures that are often used to describe the
performance of a memory system: the latency and the bandwidth.
The memory latency is the time that elapses between the memory beginning to
transmit the data and the processor starting to receive the first byte. The
memory bandwidth is the rate at which the processor receives data after it has
started to receive the first byte. So if the latency of a memory system is l seconds
and the bandwidth is b bytes per second, then the time it takes to transmit a
message of n bytes is l+n/b.
To illustrate the effect of memory system latency on system performance,
consider a processor operating at 1 GHz (1/10
9
= 10
-9
= 1 ns clock) connected to a
DRAM with a latency of 100 ns (no caches). Assume that the size of the memory
block is 1 word per block. Also assume that the processor has two multiply-add
units and is capable of executing four instructions in each cycle of 1 ns. The peak
processor rating is therefore 4 GFLOPS (10
9
clock cycles4 FLOPS per clock
cycles=410
9
= 4 GFLOPS). Since the memory latency is equal to 100 cycles and
block size is one word, every time a memory request is made, the processor must
wait 100 cycles before it can process the data. That is, the peak speed processor
is limited to one floating point operation in every 100 ns, or a speed of 10
MFLOPS, a very small fraction of the peak processor rating.
This example highlights how the memory longer memory latency (hence larger
speed mismatch between memory and CPU) can be a bottleneck in achieving the
peak processor performance.
One of the architectural innovations in memory system design for reducing the
mismatch in processor and memory speeds is the introduction of a smaller and
faster cache memory between the processor and the memory. The cache acts as
low-latency high-bandwidth storage.
The data needed by the processor is first fetched into the cache. All subsequent
accesses to data items residing in the cache are serviced by the cache. Thus, in
principle, if a piece of data is repeatedly used, the effective latency of this
memory system can be reduced by the cache.
To illustrate the impact of caches on memory latency and system performance,
consider a 1 GHz processor with a 100 ns latency DRAM. Assume that the size of
the memory block is 1 word per block and that a cache memory of size 32 KB
with a latency of 1 ns is available. Assume that this setup is used to multiply two
matrices A and B of dimensions 32 32. Fetching the two matrices into the cache
from memory corresponds to fetching 2K words (one matrix = 32 32 words =


25
2
5
2
5
= 2
10
= 1K words, i.e., 2 matrices = 2K words), which takes approximately
200 s (Memory latency = 100 ns. Memory latency for 2K words = 210
3
100ns =
200000ns

= 200 s micro seconds). Multiplying two nn matrices takes 2n
3

operations. For our problem, this corresponds to 64K operations (232
3
=2(2
5
)
3
=
2
16
= 64K), which can be performed in 16K cycles (or 16 s) at four instructions
per cycle (64K/4 = 16K cycles = 16000ns = 16 s). The total time for the
computation is therefore approximately the sum of time for load/store operations
and the time for the computation itself, i.e., 200+16 s. This corresponds to a
peak computation rate of 64K/216 or 303 MFLOPS.
Note that this is a thirty-fold improvement over the previous example, although
it is still less than 10% of the peak processor performance. This example
illustrates that placing of a small cache memory improves the processor
utilization considerably.
13. With suitable example illustrate the effect of memory bandwidth on
improving processor performance gain
Memory bandwidth refers to the rate at which data can be moved between the
processor and memory. It is determined by the bandwidth of the memory bus as
well as the memory units. Memory bandwidth of system decides the rate at
which the data can be pumped to the processor and it has larger impact on
realizable peak time system performance.
One commonly used technique to improve memory bandwidth is to increase the
size of the memory blocks. Since bigger blocks can effectively utilize the special
locality, increasing the block size results in hiding memory latency.
To illustrate the effect of block size on hiding memory latency (improving system
performance), consider a 1 GHz processor with a 100 ns latency DRAM. Assume
that memory block size (cache line) is 1 word. Assume that this set up is used to
find the dot-product of two vectors. Since the block size is one word, the
processor takes 100 cycles to fetch each word. For each pair of words, the dot-
product performs one multiply-add, i.e., two FLOPs in 200 cycles. Therefore, the
algorithm performs one FLOP every 100 cycles for a peak speed of 10 MFLOPS.
Now let us consider what happens if the block size is increased to four words, i.e.,
the processor can fetch a four-word cache line every 100 cycles. For each pair of
four-words, the dot-product performs eight FLOPs in 200 cycles. This
corresponds to a FLOP every 25 ns, for a peak speed of 40 MFLOPS. Note that
increasing the block size from one to four words did not increase the latency of
the memory system. However, it increased the bandwidth four-fold.
The above example assumed a wide data bus equivalent to the size of the cache
line. In practice, such wide buses are expensive to construct. In a more practical
system, consecutive words are sent on the memory bus on subsequent bus cycles
after the first word is retrieved. For example, with a 32 bit data bus, the first
word is put on the bus after 100 ns (the associated latency) and one word is put
on each subsequent bus cycle. This changes our calculations above slightly since


26
the entire cache line becomes available only after 100 + 3 cycles. However, this
does not change the execution rate significantly.
The above examples clearly illustrate how increased bandwidth results in higher
peak computation rates.
14. Data reuse is critical on cache performance. Justify the statement
with the example.
performance of a memory system: the latency and the bandwidth. Memory
latency has a larger role in controlling the speed mismatch between processor
and memory. One of the architectural innovations in memory system design for
reducing the mismatch in processor and memory speeds is the introduction of a
smaller and faster cache memory between the processor and the memory. The
data needed by the processor is first fetched into the cache. All subsequent
memory system can be reduced by the cache. The fraction of data references
satisfied by the cache is called the cache hit ratio of the computation on the
system.
The data reuse measured in terms of cache hit ratio is critical for cache
performance because if each data item is used only once, it would still have to be
fetched once per use from the DRAM, and therefore the DRAM latency would be
paid for each operation.
To illustrate this, consider a 1 GHz processor with a 100 ns latency DRAM with
a memory block size of 1 word. Assume that a cache memory of size 32 KB with a
latency of 1 ns is available. Also assume that the processor has two multiply-add
units and is capable of executing four instructions in each cycle of 1 ns. Assume
that this setup is used to multiply two matrices A and B of dimensions 32 32.
Fetching the two matrices into the cache from memory corresponds to fetching
2K words (one matrix = 32 32 words = 2
5
2
5
= 2
10
= 1K words, i.e., 2 matrices =
2K words). Multiplying two nn matrices takes 2n
3
operations (indicates data
reuse because 2K data words are used 64K times). For our problem, this
corresponds to 64K operations (232
3
=2(2
5
)
3
= 2
16
= 64K. This results in a cache
hit ratio:
Hit ratio =

i.e., Hit ratio =

(Of the 64K matrix operation, only 2K memory access is required and the rest
62K accesses are made from cache)


27
A higher hit ration results in lower memory latency and higher system
performance.
For example, in the earlier example of matrix multiplication, the peak
performance of the system would be 4GFLOPS per second at the rate of 4 FLOPS
per clock cycle (for a total of 1GHz =10
9
clock cycles per second). However, due to
the memory latency of 100ns, in the absence of cache memory, the realizable
peak performance will be 410
9
/100=0.0410
9
=4010
6
=40MFLOPS per second. In
the presence of a cache memory of size 32 KB with a latency of 1 ns, the increase
in the realizable peak performance can be illustrated with the matrix
multiplication example. Fetching the two 32 KB matrices into the cache from
memory corresponds to fetching 2K words (one matrix = 32 32 words = 2
5
2
5
=
2
10
= 1K words, i.e., 2 matrices = 2K words), which takes approximately 200 s
(2K words = 210
3
words = 210
3
10
2
ns=210
5
ns=200s). Multiplying two nn
matrices takes 2n
3
operations. For our problem, this corresponds to 64K
operations (232
3
=2(2
5
)
3
= 2
16
= 64K), which can be performed in 16K cycles or
16 s at four instructions per cycle (No. of cycles = 64K/4 = 16Kcycles =
16K1ns=16 s). The total time for the computation is therefore approximately
the sum of time for load/store operations and the time for the computation itself,
i.e., 200+16 s. This corresponds to a peak computation rate of 64K/216 or 303
MFLOPS. This results in a ten-fold improvement over the model where there is
no cache. This performance improvement is due to the data reuse because 2K
data words are used 64K times.
If the data items are used exactly one time only, then the number of memory
references becomes equal to the number of references found in cache (cache hit).
In our case, hit ratio becomes
Hit ratio =

i.e., Hit ratio =

A lower hit ratio results in higher memory latency and lower system
performance.
For example, consider the case of finding the dot product of the previous matrices
(instead of multiplication). Now the total operations will 32322=2
11
=2K (one
multiply and one add for each element), which can be performed in 0.5K cycles or
0.5 s at four instructions per cycle (No. of cycles = 2K/4 = 0.5Kcycles =
0.5K1ns=0.5 s). The total time for the computation is therefore 200+0.5 s.
This corresponds to a peak computation rate of 2K/200.5 or 9.97 MFLOPS. This
performance reduction is due to the absence of data reuse as 2K data words are
used only in 2K operations.
The above examples illustrate that the data reuse measured in terms of cache hit
ratio is critical for cache performance.



28
15. The performance of a memory bound program is critically
impacted by the cache hit ratio. Justify the statement with
example.
performance of a memory system: the latency and the bandwidth. Memory
latency has a larger role in controlling the speed mismatch between processor
and memory. One of the architectural innovations in memory system design for
reducing the mismatch in processor and memory speeds is the introduction of a
smaller and faster cache memory between the processor and the memory. The
data needed by the processor is first fetched into the cache. All subsequent
memory system can be reduced by the cache. The fraction of data references
satisfied by the cache is called the cache hit ratio of the computation on the
system. The effective computation rate of many applications is bounded not by
the processing rate of the CPU, but by the rate at which data can be pumped into
the CPU. Such computations are referred to as being memory bound. The
performance of memory bound programs is critically impacted by the cache hit
ratio.
To illustrate this, consider a 1 GHz processor with a 100 ns latency DRAM.
Assume that a cache memory of size 32 KB with a latency of 1 ns is available.
Assume that this setup is used to multiply two matrices A and B of dimensions
32 32. Fetching the two matrices into the cache from memory corresponds to
fetching 2K words (one matrix = 32 32 words = 2
5
2
5
= 2
10
= 1K words, i.e., 2
matrices = 2K words). Multiplying two nn matrices takes 2n
3
operations
(indicates data reuse because 2K data words are used 64K times). For our
problem, this corresponds to 64K operations (232
3
=2(2
5
)
3
= 2
16
= 64K. This
results in a cache hit ratio:
Hit ratio =

i.e., Hit ratio =

A higher hit ration results in lower memory latency and higher system
performance.
If the data items are used exactly one time only, then the number of memory
references becomes equal to the number of references found in cache (cache hit).
In our case, hit ratio becomes
Hit ratio =

i.e., Hit ratio =



29
A lower hit ration results in higher memory latency and lower system
performance.
The above examples illustrate that the performance of a memory bound program
is critically impacted by the cache hit ratio.
16. How the locality of reference can influence the performance gain of
a processor.
Locality of reference (also known as the principle of locality) is a phenomenon
describing the same or related memory locations, being frequently accessed.
Two types of locality of references have been observed:
temporal locality and
spatial locality
Temporal locality is the tendency for a program to reference the same memory
location or a cluster several times during brief intervals of time. Temporal
locality is exhibited by program loops, subroutines, stacks and variables used for
counting and totalling.
Spatial locality is the tendency for program to reference clustered locations in
preference to randomly distributed locations. Spatial locality suggests that once
a location is referenced, it is highly likely that nearby locations will be referenced
in the near future. Spatial is exhibited by array traversals, sequential code
execution, the tendency to reference stack locations in the vicinity of the stack
pointer, etc.
Locality of reference is one type of predictable program behaviour and the
programs that exhibit strong locality of reference are great candidates for
performance optimization through the use of techniques such as the cache and
instruction prefetch technology that can improve the memory bandwidth and can
hide the memory latency.
Effect of Locality Reference in Hiding Memory Latency Using Cache:
Both the special and temporal locality of reference exhibited by program can
improve the cache hit ratio, which results in hiding memory latency and hence
improved system performance.
To illustrate how the locality of reference can improve the system performance
by hiding memory latency through the use of cache, consider the following
example:
Consider a processor operating at 1 GHz (1/10
9
= 10
-9
= 1 ns clock) connected to a
DRAM with a latency of 100 ns (no caches). Assume that the size of the memory
block is 1 word per block. Also assume that the processor has two multiply-add
units and is capable of executing four instructions in each cycle of 1 ns. The peak
processor rating is therefore 4 GFLOPS (10
9
clock cycles4 FLOPS per clock


30
cycles=410
9
= 4 GFLOPS). Since the memory latency is equal to 100 cycles and
block size is one word, every time a memory request is made, the processor must
wait 100 cycles before it can process the data. That is, the peak speed processor
is limited to one floating point operation in every 100 ns, or a speed of 10
MFLOPS, a very small fraction of the peak processor rating.
The performance of the above processor can be improved at least 30 fold by
incorporating a cache memory of size 32 KB, as illustrated below:
Assume that the size of the memory block is 1 word per block and that a cache
memory of size 32 KB with a latency of 1 ns is available. Assume that this setup
is used to multiply two matrices A and B of dimensions 32 32. Fetching the two
matrices into the cache from memory corresponds to fetching 2K words (one
matrix = 32 32 words = 2
5
2
5
= 2
10
= 1K words, i.e., 2 matrices = 2K words),
which takes approximately 200 s (Memory latency = 100 ns. Memory latency for
2K words = 210
3
100ns = 200000ns

= 200 s micro seconds). Multiplying two
nn matrices takes 2n
3
operations. For our problem, this corresponds to 64K
operations (232
3
=2(2
5
)
3
= 2
16
= 64K), which can be performed in 16K cycles (or
16 s) at four instructions per cycle (64K/4 = 16K cycles = 16000ns = 16 s). The
total time for the computation is therefore approximately the sum of time for
load/store operations and the time for the computation itself, i.e., 200+16 s.
This corresponds to a peak computation rate of 64K/216 or 303 MFLOPS (a 30
fold increment).
Effect of Locality Reference in Improving the Memory Bandwidth:
The locality of reference also has an effect on improving the memory bandwidth
by allowing the blocks of larger size to be brought into the memory, as illustrated
in the following example:
Consider again a memory system with a single cycle cache and 100 cycle latency
DRAM with the processor operating at 1 GHz. If the block size is one word, the
processor takes 100 cycles to fetch each word. If use this set up to find the dot-
product of two vectors, for each pair of words, the dot-product performs one
multiply-add, i.e., two FLOPs. Therefore, the algorithm performs one FLOP
every 100 cycles for a peak speed of 10 MFLOPS. Now let us consider what
happens if the block size is increased to four words, i.e., the processor can fetch a
four-word cache line every 100 cycles. Assuming that the vectors are laid out
linearly in memory, eight FLOPs (four multiply-adds) can be performed in 200
cycles. This is because a single memory access fetches four consecutive words in
the vector. Therefore, two accesses can fetch four elements of each of the vectors.
This corresponds to a FLOP every 25 ns, for a peak speed of 40 MFLOPS. Note
that increasing the block size from one to four words did not change the latency
of the memory system. However, it increased the bandwidth four-fold. In this
case, the increased bandwidth of the memory system enabled us to accelerate
system performance.
Effect of Locality Reference in Hiding Memory Latency Using
Prefetching:


31
Prefetching is process of bringing data or instructions from memory into the
cache, on anticipation, before they are needed. The prefetching is found
successful with programs exhibiting good locality of reference, which results in
increased system performance by hiding the memory latency.
The following example illustrates how prefetching can hide memory latency:
Consider the problem of adding two vectors a and b using a single for loop. In the
first iteration of the loop, the processor requests a[0] and b[0]. Since these are
not in the cache, the processor must pay the memory latency. While these
requests are being serviced, the processor also requests the subsequent elements
a[1] and b[1], a[2] and b[2], etc and put them in cache in advance. Assuming
that each request is generated in one cycle (1 ns) and memory requests are
satisfied in 100 ns, after 100 such requests the first set of data items is returned
by the memory system. Subsequently, one pair of vector components will be
returned every cycle. In this way, in each subsequent cycle, one addition can be
performed and processor cycles are not wasted (results in latency hiding).
17. Explain the terms pre-fetching and multithreading. How can pre-
fetching and multithreading result in processor performance gain.
The scope for achieving the effective performance of a program on a computer
has traditionally been limited by two memory related performance factors --
latency and bandwidth. Several techniques have been proposed for handling this
problem, including cache memory, pre-fetching and multithreading. Here, the
latter two approaches are discussed.
Prefetching
Prefetching is process of bringing data or instructions from memory into the
cache, on anticipation, before they are actually needed. The prefetching works
well with programs exhibiting good locality of reference, thereby by hiding the
memory latency.
In a typical program, a data item is loaded and used by a processor in a small
time window. If the load results in a cache miss, then the program stalls. A
simple solution to this problem is to advance the load operation so that even if
there is a cache miss, the data is likely to have arrived by the time it is used.
To illustrate the effect of prefetching on memory latency hiding, consider the
problem of adding two vectors a and b using a single for loop. In the first
iteration of the loop, the processor requests a[0] and b[0]. Since these are not in
the cache, the processor must pay the memory latency. While these requests are
being serviced, the processor also requests the subsequent elements a[1] and
b[1], a[2] and b[2], etc and put them in cache in advance. Assuming that each
request is generated in one cycle (1 ns) and memory requests are satisfied in 100
ns, after 100 such requests the first set of data items is returned by the memory
system. Subsequently, one pair of vector components will be returned every
cycle. In this way, in each subsequent cycle, one addition can be performed and


32
processor cycles are not wasted (results in latency hiding).
Multithreading
Multithreading is the ability of an operating system to execute different parts of
a program by maintaining multiple threads of execution at a time. The different
threads of control of a program can be executed concurrently with other thread of
the same program. Since multiple threads of the same program are concurrently
available, execution control can be switched between processor resident threads
on cache misses. The programmer must carefully design the program in such a
way that all the threads can run at the same time without interfering with each
other
To illustrate the effect of threading on hiding memory latency, consider the
following code segment for multiplying an nn matrix a by a vector b to get
vector c.
1 for(i=0;i<n;i++)
2 c[i] = dot_product(get_row(a, i), b);
This code computes each element of c as the dot product of the corresponding row
of a with the vector b. Notice that each dot-product is independent of the other,
and therefore represents a concurrent unit of execution. We can safely rewrite
the above code segment to explicitly specify each instance of the dot-product
computation as a thread:
1 for(i=0;i<n;i++)
2 c[i] = create_thread(dot_product, get_row(a, i), b);
Now, consider the execution of each instance of the function dot_product. The
first instance of this function accesses a pair of vector elements and waits for
them. In the meantime, the second instance of this function can access two other
vector elements in the next cycle, and so on. After l units of time, where l is the
latency of the memory system, the first function instance gets the requested data
from memory and can perform the required computation. In the next cycle, the
data items for the next function instance arrive, and so on. In this way, in every
clock cycle, we can perform a computation. This hides the memory latency.
18. At this point, multithreaded systems become bandwidth bound
instead of latency bound. Explain the statement
The effective computation rate of many applications is bounded not by the
processing rate of the CPU, but by the rate at which data can be pumped into the
CPU. Such computations are referred to as being memory bound. Since the
realizable peek time performance of such an application is always bounded by
the memory latency, such applications are referred as memory bound
application.
The traditional approaches for hiding the memory latency include use of cache


33
memory, pre-fetching and multithreading. Contrary to the general belief that
that the pre-fetching and multithreading with supportive cache can solve all the
problems related to memory system performance, they are critically impacted by
the memory bandwidth.
To illustrate the impact of bandwidth on multithreaded programs, consider a
computation running on a machine with a 1 GHz clock, 4-word cache line, single
cycle access to the cache, and 100 ns latency to DRAM. Assume that the
computation has a cache hit ratio of 25% at 1 KB and of 90% at 32 KB. Consider
two cases: first, a single threaded execution in which the entire cache is available
to the serial context, and second, a multithreaded execution with 32 threads
where each thread has a cache residency of 1 KB. If the computation makes one
data request in every cycle of 1 ns, in the first case the bandwidth requirement to
DRAM is one word every 10 ns since the other words come from the cache (90%
cache hit ratio. This corresponds to a bandwidth of 400 MB/s. In the second case,
the bandwidth requirement to DRAM increases to three words every four cycles
of each thread (25% cache hit ratio). Assuming that all threads exhibit similar
cache behaviour, this corresponds to 0.75 words/ns, or 3 GB/s.
In the above example, while a sustained DRAM bandwidth of 400 MB/s is
reasonable (case I), 3.0 GB/s (case II) is more than most systems currently offer.
At this point, multithreaded systems become bandwidth bound instead of latency
bound because the bandwidth requirement is now more severe as compared to
memory latency. It is important to realize that multithreading and prefetching
only address the latency problem and may often exacerbate the bandwidth
19. Describe the Flynns Taxonomy of Computers
The most popular taxonomy of computer architecture is Flynns taxonomy that
was dened by Flynn in 1966. Flynns classication model is based on the
concept of a stream of information. Two types of information ow into a
processor: instructions and data. The instruction stream is dened as the
sequence of instructions performed by the processing unit. The data stream is
dened as the data trafc exchanged between the memory and the processing
unit. According to Flynns classication, a computer's hardware may support a
single instruction stream or multiple instruction streams manipulating a single
data stream or multiple data streams. Hence Flynn's classification results in four
categories:
single-instruction single-data streams (SISD)
single-instruction multiple-data streams (SIMD)
multiple-instruction single-data streams (MISD)
multiple-instruction multiple-data streams (MIMD)
SISD Architecture
The category SISD refers to computers with a single instruction stream and a
single data stream, as illustrated below:


34

Fig: SISD Architecture
Uniprocessors fall into SISD category. Even though it has only a single CPU
executing a single instruction stream, a modern uniprocessor may still exhibit
some concurrency of execution. For example, superscalar architectures support
the dynamic identification and selection of multiple independent operations that
may be executed simultaneously. Instruction prefetching and pipelined execution
of instructions are other examples of concurrency typically found in modem SISD
computers, though according to Flynn these are examples of Concurrency of
processing, rather than concurrency of execution
SIMD Architecture
The category SIMD refers to computers with a single instruction stream but
multiple data streams. The SIMD model of parallel computing consists of two
parts: a front-end control unit and a processor array as shown below.
The processor array is a set of identical synchronized processing elements
capable of simultaneously performing the same operation on different data. The
application program is executed by the front end in the usual serial way, but
issues commands to the processor array to carry out SIMD operations in parallel.
In SIMD architecture, parallelism is exploited by applying simultaneous
operations across large sets of data.
Processor arrays and pipelined vector processors are examples of SIMD
computers.


35

Figure: SIMD Architecture
A processor array is a parallel computer with a single control unit executing one
instruction stream, as well as multiple subordinate processors capable of
simultaneously performing the same operation on different data elements. A
pipelined vector processor relies upon a very fast clock and one or more pipelined
functional units to execute the same operation on the elements of a dataset.
MISD Architecture
The MISD category is for computers with multiple instruction streams, but only
a single data stream. An MISD computer is a pipeline of multiple independently
executing functional units operating on a single stream of data, forwarding
results from one functional unit to the next. In practice, there is no viable MISD
machine; however, the systolic-array computers are sometimes considered as
example for MISD machines.
MIMD Architecture
The MIMD category is for computers with multiple instruction streams and
multiple data streams. Multiprocessors and multicomputers fit into MIMD
category. MIMD parallel architectures are made of multiple processors and
multiple memory modules connected together via some interconnection network,
as show in the figure given below.


36

Fig: MIMD Architecture
MIMD computers fall into two broad categories: shared memory or message
passing. Processors exchange information through their central shared memory
in shared memory systems, and exchange information through their
interconnection network in message passing systems.
20. Explain the terms task parallelism and data parallelism.
There are different approaches for parallel computing, but most of them depend
on the basic idea of partitioning the work to be done among the cores. There are
two widely used approaches: task-parallelism and data-parallelism. In task-
parallelism, we partition the various tasks carried out in solving the problem
among the cores. In data-parallelism, we partition the data used in solving the
problem among the cores, and each core carries out more or less similar
operations on its part of the data.
As an example, suppose that we need to compute n values and add them
together. We know that this can be done with the following serial code:
sum = 0;
for (i = 0; i < n; i++)
{ x = Compute next value(...);
sum += x;
}
Now suppose we also have p cores and p is much smaller than n. Then each core
can form a partial sum of approximately n/p values:
my_sum = 0;
my_first_i = ...;
my_last_i = ...;


37
for (my_i = my_first_i; my_i < my_last_i; my_i++)
{ my_x = Compute_next_value(...);
my_sum += my_x;
}

Here the prefix my_ indicates that each core is using its own, private variables,
and each core can execute this block of code independently of the other cores.
After each core completes execution of this code, its variable my_sum will store
the sum of the values computed by its calls to Compute_next_value.
For example, if there are eight cores, n=24, and the 24 calls to
Compute_next_value return the values:
1, 4, 3, 9, 2, 8, 5, 1, 1, 6, 2, 7, 2, 5, 0, 4, 1, 8,
6, 5, 1, 2, 3, 9.
then the values stored in my_sum might be
Core 0 1 2 3 4 5 6 7
my_sum 8 19 7 15 7 13 12 14

When the cores are done computing their values of my sum, they can form a
global sum by sending their results to a designated master core, which can add
their results:
if (Im the master core)
{ sum = my_x;
for each core other than myself
{ receive value from core;
sum += value;
}
}
else
{ send my_x to the master;
}
In our example, if the master core is core 0, it would add the values
8+19+7+15+7+13+12+14=95.
The first part of the global sum example can be considered an example of data-
parallelism. The data are the values computed by Compute_next_value, and each
core carries out roughly the same operations on its assigned elements: it
computes the required values by calling Compute_next_value and adds them
together. The second part of the global sum example can be considered an
example of task-parallelism. There are two tasks: receiving and adding the cores
partial sums, which is carried out by the master core, and giving the partial sum
to the master core, which is carried out by the other cores.


38
Parallel programming is more complex as compared sequential programming
due to the communication, load balancing and synchronization requirements of
the parallel application.
There are several programming languages that support parallel programming.
We focus on learning the basics of programming parallel computers using two
extensions of the C language: the Message-Passing Interface or MPI and
OpenMP. MPI is libraries of type definitions, functions, and macros that can be
used in C programs. OpenMP consists of a library and some modifications to the
C compiler.
21. Distinguish between Concurrent, parallel and distributed
computing.
Although there is no complete agreement on the distinction between the terms
parallel, distributed, and concurrent, many authors make the following
distinctions:
In concurrent computing, a program is one in which multiple tasks can be in
progress at any instant
In parallel computing, a program is one in which multiple tasks cooperate
closely to solve a problem.
In distributed computing, a program may need to cooperate with other
programs to solve a problem.
So all parallel and distributed programs are always concurrent but the reverse is
not true always. For example, a program such as a multitasking operating
system is concurrent, even when it is run on a machine with only one core, since
multiple tasks can be in progress at any instant, but multitasking is neither
parallel nor distributed.
There is no clear-cut distinction between parallel and distributed programs, but
a parallel program usually runs multiple tasks simultaneously on cores that are
physically close to each other and that either share the same memory or are
connected by a very high-speed network. On the other hand, distributed
programs tend to be more loosely coupled. The tasks may be executed by
multiple computers that are separated by large distances, and the tasks
themselves are often executed by programs that were created independently.
However, many authors consider the shared-memory programs as parallel and
the distributed-memory programs as distributed.
22. Describe the Parallel Computing Architecture
As per the Flynns taxonomy, parallel computers are either single-instruction
multiple-data streams (SIMD) or multiple-instruction multiple-data streams
(MIMD) computers.


39
SIMD Architecture
The category SIMD refers to computers with a single instruction stream but
multiple data streams. The SIMD model of parallel computing consists of two
parts: a front-end control unit, and a processor array as shown below.

Figure: SIMD Architecture
The processor array is a set of identical synchronized processing elements
capable of simultaneously performing the same operation on different data.
There are two main configurations that have been used in SIMD machines. In
the first scheme (see figure given below), each processor has its own local
memory. Processors can communicate with each other through the
interconnection network. If the interconnection network does not provide direct
connection between a given pair of processors, then this pair can exchange data
via an intermediate processor.

In the second SIMD scheme (See figure given below), processors and memory
modules communicate with each other via the interconnection network. Two
processors can transfer data between each other via intermediate memory
module(s) or possibly via intermediate processor(s).


40

In SIMD architecture, parallelism is exploited by applying simultaneous
operations across large sets of data. An instruction is broadcast from the control
unit to the processing units, and each processing unit either applies the
instruction to the current data item, or it is idle.
As an example, consider the addition of two arrays x and y, each with n elements
and we want to add the elements of y to the elements of x:
for (i = 0; i < n; i++)
x[i] += y[i];
Suppose that our SIMD system has n processing elements. Then we could load
x[i] and y[i] into the i
th
processing element, have the i
th
processing element add
y[i] to x[i], and store the result in x[i]. If the system has m processing elements
and m < n, we can simply execute the additions in blocks of m elements at a
time. For example, if m = 4 and n = 15, we can first add elements 0 to 3, then
elements 4 to 7, then elements 8 to 11, and finally elements 12 to 14. Note that
in the last group of elements in our example elements 12 to 14 we are only
operating on three elements of x and y, so one of the four processing elements
will be idle.
The requirement that all the processing units execute the same instruction or
are idle can seriously degrade the overall performance of a SIMD system. For
example, suppose we only want to carry out the addition if y[i] is positive:
for (i = 0; i < n; i++)
if (y[i] > 0.0) x[i] += y[i];
In this setting, we must load each element of y into a processing unit and
determine whether it is positive. If y[i] is positive, we can proceed to carry out
the addition. Otherwise, the processing unit storing y[i] will be idle while the
other processing units carry out the addition.
SIMD systems are ideal for parallelizing simple loops that operate on large


41
arrays of data, called data parallelism. Processor arrays and pipelined vector
processors are examples of SIMD computers.
MIMD Architecture
The MIMD category is for computers with multiple instruction streams and
multiple data streams. Multiprocessors and multicomputers fit into MIMD
category. MIMD parallel architectures are made of multiple processors and
multiple memory modules connected together via some interconnection network,
as show in the figure given below.

Fig: MIMD Architecture
MIMD computers fall into two broad categories: shared memory or message
passing.
In a shared-memory system (see figure below), a collection of autonomous
processors is connected to a memory system via an interconnection network, and
each processor can access each memory location. In a shared-memory system, the
processors usually communicate implicitly by accessing shared data structures.

Fig: Shared Memory MIMD Architecture


42
Depending on the interconnection network, shared memory systems can be
classified as: uniform memory access (UMA), nonuniform memory access
(NUMA), and cache-only memory architecture (COMA).
In the UMA system (see figure below), a shared memory is accessible by all
processors through an interconnection network in the same way a single
processor accesses its memory. Therefore, all processors have equal access time
to any memory location. The interconnection network used in the UMA can be a
single bus, multiple buses, a crossbar, or a multiport memory. UMA systems are
usually easier to program, since the programmer doesnt need to worry about
different access times for different memory locations.

Fig: A UMA Multicore System
In the NUMA system (see figure below), each processor has part of the shared
memory attached. The memory has a single address space. Therefore, any
processor could access any memory location directly using its real address.
However, the access time to modules depends on the distance to the processor.
This results in a nonuniform memory access time. A number of architectures are
used to interconnect processors to memory modules in a NUMA. NUMA systems
have the potential to use larger amounts of memory than UMA systems.

Fig: A NUMA Multicore System


43
Similar to the NUMA, each processor has part of the shared memory in the
COMA. However, in this case the shared memory consists of cache memory. A
COMA system requires that data be migrated to the processor requesting it.
In a Message passing or distributed-memory system (see figure given below),
each processor is paired with its own private memory, and the processor-memory
pairs communicate over an interconnection network. Unlike shared memory
systems, the processors in distributed-memory systems usually communicate
explicitly via send and receive operations that provide access to the memory of
another processor.

Fig: Distributed Memory MIMD Architecture
A node in such a system consists of a processor and its local memory. Nodes are
typically able to store messages in buffers (temporary memory locations where
messages wait until they can be sent or received), and perform send/receive
operations at the same time as processing. Processors do not share a global
memory and each processor has access to its own address space. The processing
units of a message passing system may be connected in a variety of ways ranging
from architecture-specific interconnection structures to geographically dispersed
networks. The message passing approach is scalable to large proportions.
The most widely available distributed-memory systems are called clusters. The
grid provides the infrastructure necessary to turn large networks of
geographically distributed computers into a unified distributed-memory system.
23. Describe different topologies used for interconnecting multiple
processors and memory modules in various parallel architectures
A multiprocessor system consists of multiple processing units connected via some
interconnection network plus the software needed to make the processing units
work together.
There are two major factors used to categorize multiprocessor systems: the
processing units themselves, and the interconnection network that ties them
together.


44
Interconnection networks provide mechanisms for data transfer between
processing nodes or between processors and memory modules. A number of
communication styles exist for multiprocessing networks. These can be broadly
classified according to the communication model as shared memory (single
address space) versus message passing (multiple address spaces).
Communication in shared memory systems is performed by writing to and
reading from the global memory, while communication in message passing
systems is accomplished via send and receive commands. In both cases, the
interconnection network plays a major role in determining the communication
speed.
Here we are considering the topologies used for interconnecting multiple
processors and memory modules.
An interconnection network could be either static (also known as direct
network) or dynamic (also known as indirect network). Connections in a static
network (e.g., hypercube and mesh topologies) are fixed links, i.e., static
networks form all connections when the system is designed rather than when the
connection is needed. In a static network, messages must be routed along
established links. Dynamic interconnection networks (e.g., bus, crossbar, and
multistage interconnection topologies) establish connections between two or
more nodes on the fly as messages are routed along the links.
Static networks can be further classified according to their interconnection
pattern as one-dimension (1D), two-dimension (2D), or hypercube (HC).
Dynamic networks, on the other hand, can be classified based on interconnection
scheme as bus-based versus switch-based. Bus-based networks can further be
classified as single bus or multiple buses. Switch-based dynamic networks can
be classified according to the structure of the interconnection network as single-
stage (SS), multistage (MS), or crossbar networks. The following figure
illustrate this taxonomy.

Fig: A topology-based taxonomy for interconnection networks
STATIC INTERCONNECTION NETWORKS
Static (fixed) interconnection networks are characterized by having fixed paths


45
(unidirectional or bidirectional) between processors. Two types of static networks
can be identified. These are
a) Completely connected networks (CCNs) and
b) Limited connection networks (LCNs).
(a) Completely Connected Networks
In a completely connected network (CCN) each node is connected to all other
nodes in the network. Completely connected networks guarantee fast delivery of
messages from any source node to any destination node (only one link has to be
traversed). Since every node is connected to every other node in the network,
routing of messages between nodes becomes a straightforward task.
However, completely connected networks are expensive in terms of the number
of links needed for their construction. This disadvantage becomes more and more
apparent for higher values of N. The number of links in a completely connected
network is given by N(N1)/2, that is, O(N
2
). The delay complexity of CCNs,
measured in terms of the number of links traversed as messages are routed from
any source to any destination is constant, that is, O(1). An example having N = 6
nodes is shown in figure given below. A total of 15 links are required in order to
satisfy the complete interconnectivity of the network.

Figure: Example Completely Connected Network
(b) Limited Connection Networks
Limited connection networks (LCNs) do not provide a direct link from every node
to every other node in the network. Instead, communications between some
nodes have to be routed through other nodes in the network. The length of the
path between nodes, measured in terms of the number of links that have to be
traversed, is expected to be longer compared to the case of CCNs.
The limited interconnectivity in LCNs imposes two additional requirements: (i)
the need for a pattern of interconnection among nodes and (ii) the need for a
mechanism for routing messages around the network until they reach their
destinations.


46
A number of regular interconnection patterns have evolved over the years for
LCNs. These patterns include:
i. Linear arrays
ii. Ring (loop) networks
iii. Two-dimensional arrays (nearest-neighbour mesh)
iv. Tree networks and
v. Cube networks
Simple examples for these networks are shown below:
(i) Linear Array and Ring Networks

Figure: Limited connected networks (a) a linear array network; (b) a ring
network
In a linear array, each node is connected to its two immediate neighbouring
nodes. The two nodes at the extreme ends of the array are connected to their
single immediate neighbour. If node i needs to communicate with node j, j>i,
then the message from node i has to traverse nodes i+1, i+2, ..., ji. Similarly,
when node i needs to communicate with node j, where i>j, then the message from
node i has to traverse nodes i1, i2, . . . , ij. In the worst possible case, when
node 1 has to send a message to node N, the message has to traverse a total of
N1 nodes before it can reach its destination. Therefore, although linear arrays
are simple in their architecture and have simple routing mechanisms, they tend
to be slow. This is particularly true when the number of nodes N is large. The
network complexity of the linear array is O(N) and its time complexity is O(N).
If the two nodes at the extreme ends of a linear array network are connected,
then the resultant network has ring (loop) architecture.
(ii) Tree Network
In a tree (e.g., binary tree) network, if a node at level i (assuming that the root
node is at level 0) needs to communicate with a node at level j, where i>j and the
destination node belongs to the same roots child subtree, then it will have to
send its message up the tree traversing nodes at levels i1, i2, . . . , j+1 until it
reaches the destination node. If a node at level i needs to communicate with
another node at the same level i (or with node at level j = i where the destination


47
node belongs to a different roots child subtree), it will have to send its message
up the tree until the message reaches the root node at level 0. The message will
have to be then sent down from the root nodes until it reaches its destination.
The number of nodes (processors) in a binary tree system having k levels is given
by 2
k
1 and depth of a binary tree system is log2 N(, where N is the number of
nodes (processors) in the network. Therefore, the network complexity is O(2
k
)
and the time complexity is O( log2 N).

Figure: Limited Connected Tree Networks
(iii) Cube-Connected Networks
Cube-connected networks are patterned after the n-cube structure. An n-cube
(hypercube of order n) is defined as an undirected graph having 2
n
vertices
labelled 0 to 2
n
1 such that there is an edge between a given pair of vertices if
and only if the binary representation of their addresses differs by one and only
one bit. Different Hypercube Networks are shown below.

Figure: Different Hypercube Networks


48
In a cube-based multiprocessor system, processing elements are positioned at the
vertices of the graph. Edges of the graph represent the point-to-point
communication links between processors. As can be seen from the figure, each
processor in a 4-cube is connected to four other processors. In an n-cube, each
processor has communication links to n other processors. Recall that in a
hypercube, there is an edge between a given pair of nodes if and only if the
binary representation of their addresses differs by one and only one bit. This
property allows for a simple message routing mechanism. The route of a message
originating at node i and destined for node j can be found by XOR-ing the binary
address representation of i and j. If the XOR-ing operation results in a 1 in a
given bit position, then the message has to be sent along the link that spans the
corresponding dimension.

For example, if a message is sent from source (S) node 0101 to destination (D)
node 1011, then the XOR operation results in 1110. That will mean that the
message will be sent only along dimensions 2, 3, and 4 (counting from right to
left) in order to arrive at the destination. The order in which the message
traverses the three dimensions is not important. Once the message traverses the
three dimensions in any order it will reach its destination. The three possible
disjoint routes that can be taken by the message in this example are shown in
bold in figure given above.
In an n-cube, each node has a degree n. The degree of a node is defined as the
number of links incident on the node. One of the desirable features of hypercube
networks is the recursive nature of their constructions. An n-cube can be
constructed from two subcubes each having an (n-1) degree by connecting nodes
of similar addresses in both subcubes. Notice that the 4-cube shown above is
constructed from two subcubes each of degree three.
(iv) Mesh-Connected Networks
An n-dimensional mesh can be defined as an interconnection structure that has
K0K1 ... Kn-1 nodes, where n is the number of dimensions of the network and
Ki is the radix of dimension i. Mesh architecture with wrap around connections
forms a torus. The following figure shows different Mesh Networks. In a 3D mesh
network, a node whose position is (i, j, k) is connected to its neighbours at
dimensions i1, j1, and k1.


49

(a) (b) (c)
Fig: Different Mesh Networks (a) 2-D mesh with no wraparound (b) 2-D mesh
with wraparound link (2-D torus) and (c) a 3-D 332 mesh with no wraparound.
A number of routing mechanisms have been used to route messages around
meshes. One such routing mechanism is known as the dimension-ordering
routing. Using this technique, a message is routed in one given dimension at a
time, arriving at the proper coordinate in each dimension before proceeding to
the next dimension. Consider, for example, a 3D mesh. Since each node is
represented by its position (i, j, k), then messages are first sent along the i
dimension, then along the j dimension, and finally along the k dimension. At
most two turns will be allowed and these turns will be from i to j and then from j
to k. In the 3D mesh shown above, we show the route of a message sent from
node S at position (0, 0, 0) to node D at position (2, 1, 1). Other routing
mechanisms like dimension reversal routing, the turn model routing, and
node labelling routing, etc., are also used in mesh networks.
BUS-BASED DYNAMIC INTERCONNECTION NETWORKS
Bus-based networks can be classified as (a) single bus or (b) multiple buses.
(a) Single Bus Systems
A single bus is considered the simplest way to connect multiprocessor systems.
The following figure shows an illustration of a single bus system.

Fig: A single Bus System
In its general form, such a system consists of N processors, each having its own
cache, connected by a shared bus. The use of local caches reduces the processor


50
memory traffic. All processors communicate with a single shared memory. The
typical size of such a system varies between 2 and 50 processors. The actual size
is determined by the traffic per processor and the bus bandwidth (defined as the
maximum rate at which the bus can propagate data once transmission has
started).
The single bus network complexity, measured in terms of the number of buses
used, is O(1), while the time complexity, measured in terms of the amount of
input to output delay is O(N).
Although simple and easy to expand, single bus multiprocessors are inherently
limited by the bandwidth of the bus. Since only one processor can access the bus,
only one memory access can take place at any given time.
(b) Multiple Bus Systems
The use of multiple buses to connect multiple processors is a natural extension to
the single shared bus system. A multiple bus multiprocessor system uses several
parallel buses to interconnect multiple processors and multiple memory modules.
A number of connection schemes are possible in this case. Among the
possibilities are
multiple bus with full busmemory connection (MBFBMC),
multiple bus with single bus memory connection (MBSBMC),
multiple bus with partial bus memory connection (MBPBMC), and
multiple bus with class-based memory connection (MBCBMC).
Illustrations of these connection schemes for the case of N = 6 processors, M = 4
memory modules, and B = 4 buses are shown below:

Figure: Multiple Bus with Full BusMemory Connection (MBFBMC)


51
Figure: Multiple Bus with Single Bus-Memory Connection (MBSBMC)

Figure: Multiple Bus with Partial BusMemory Connection (MBPBMC)

Figure: Multiple Bus with Class-Based Memory Connection (MBCBMC).
Multiple bus organization has the advantage they are highly reliable and
supports incremental growth. A single bus failure will leave (B 1) distinct fault-
free paths between the processors and the memory modules. On the other hand,
when the number of buses is less than the number of memory modules (or the


52
number of processors), bus contention is expected to increase.
Bus Synchronization
A bus can be either synchronous or asynchronous. The time for any
transaction over a synchronous bus is known in advance. In accepting and/or
generating information over the bus, devices take the transaction time into
account. Asynchronous bus, on the other hand, depends on the availability of
data and the readiness of devices to initiate bus transactions.
In a single bus system, bus arbitration is required in order to resolve the bus
contention that takes place when more than one processor competes to access the
bus. In this case, processors that want to use the bus submit their requests to
bus arbitration logic. The arbitration logic, using certain priority schemes like
random priority, simple rotating priority, equal priority, and least recently used
(LRU) priority, decides which processor will be granted access to the bus during
a certain time interval (bus master). The process of passing bus mastership from
one processor to another is called handshaking and requires the use of two
control signals: bus request and bus grant. The first indicates that a given
processor is requesting mastership of the bus, while the second indicates that
bus mastership is granted. A third signal, called bus busy, is usually used to
indicate whether or not the bus is currently being used. The following figure
illustrates such a system.

Fig: Bus Handshaking Mechanism
SWITCH-BASED INTERCONNECTION NETWORKS
In this type of network, connections among processors and memory modules are
made using simple switches. Three basic interconnection topologies exist:
a) Crossbar,
b) Single-stage, and


53
c) Multistage
(a) Crossbar Networks
A simple way to connect N processors to M memory modules is to use a crossbar
network. A crossbar network employs a grid of switches or switching elements as
shown below. The lines are bidirectional communication links, the squares are
cores or memory modules, and the circles are switches.

Fig: (a) A crossbar switch connecting four processors (Pi) and four memory
modules (Mj); (b) configuration of internal switches in a crossbar; (c)
simultaneous memory accesses by the processors.


54
The individual switches can assume one of the two configurations shown in
figure (b).
The crossbar network is a non-blocking network in the sense that the connection
of a processing node to a memory module does not block the connection of any
other processing nodes to other memory module. For example, Figure (c) shows
the configuration of the switches if P1 writes to M4, P2 reads from M3, P3 reads
from M1, and P4 writes to M2.
Crossbars allow simultaneous communication among different devices, so they
are much faster than buses. However, the cost of the switches and links is
relatively high. A small bus-based system will be much less expensive than a
crossbar-based system of the same size.
(b) Single-Stage Networks (Recirculating Networks)
In this case, a single stage of switching elements (SEs) exists between the inputs
and the outputs of the network.

The simplest switching element that can be used is the 22 switching element.
The following figure illustrates the four possible settings that an SE can assume.
These settings are called straight, exchange, upper-broadcast, and lower-
broadcast. In the straight setting, the upper input is transferred to the upper
output and the lower input is transferred to the lower output. In the exchange
setting the upper input is transferred to the lower output and the lower input is
transferred to the upper output. In the upper-broadcast setting the upper input
is broadcast to both the upper and the lower outputs. In the lower-broadcast the
lower input is broadcast to both the upper and the lower outputs.
Each input may connect to one of a set of possible outputs, depending on the
physical hardware network connectivity. If the network does not have full
connectivity, to establish communication between a given input (source) to a
given output (destination), data has to be circulated a number of times around
the network. A well-known connection pattern for interconnecting the inputs and
the outputs of a single-stage network is the ShuffleExchange. Two operations
are used. These can be defined using an m bit-wise address pattern of the inputs,
pm-1pm-2 . . . p1p0, as follows


55

Fig: The different settings of the 22 SE.

With shuffle (S) and exchange (E) operations, data is circulated from input to
output until it reaches its destination.
For example, in an 8-input single stage ShuffleExchange if the source is 0 (000)
and the destination is 6 (110), then the following is the required sequence of
Shuffle/Exchange operations and circulation of data:
E(000)1(001) S(001)2(010) E(010)3(011)S(011)6(110)
In addition to the shuffle and the exchange functions, there exist a number of
other interconnection patterns that are used in forming the interconnections
among stages in interconnection networks. Among these are the Cube and the
Plus-Minus 2
i
(PM2I) networks.
The Cube Network: The interconnection pattern used in the cube network is
defined as follows:

Consider a 3-bit address (N = 8), then we have C2(6) = 2, C1(7) = 5 and C0(4) = 5.
The following figure shows the cube interconnection patterns for a network with
N = 8.

Fig: The cube network for N = 8 (a) C0; (b) C1; and (c) C2


56
The PlusMinus 2
i
(PM2I) Network: The PM2I network consists of 2k
interconnection functions defined as follows:

For example, consider the case N = 8, PM2+1(4) = 4 + 2
1
mod 8 = 6. The following
figure shows the PM2I for N = 8.

Fig: The PM2I network for N = 8 (a), PM2+0 for N = 8; (b) PM2+1 for N = 8; and (c)
PM2+2 for N = 8.
The Butterfly Function: The interconnection pattern used in the butterfly
network is defined as follows:

Consider a 3-bit address (N = 8), the following is the butterfly mapping:
B(000) = 000, B(001) = 100, B(010) = 010, B(011) = 110, B(100) = 001, B(101) =
101, B(110) = 011, B(111) = 111.
(c) Multistage Interconnection Networks
The crossbar interconnection network is scalable in terms of performance but
unscalable in terms of cost. Conversely, the shared bus network is scalable in
terms of cost but unscalable in terms of performance. An intermediate class of
networks called multistage interconnection, networks lies between these two
extremes. It is more scalable than the bus in terms of performance and more
scalable than the crossbar in terms of cost.
Multistage interconnection networks (MINs) were introduced as a means to


57
improve, some of the limitations of the single bus system while keeping the cost
within an affordable limit. The most undesirable single bus limitation that MINs
is set to improve is the availability of only one single path between the
processors and the memory modules. Such MINs provide a number of
simultaneous paths between the processors and the memory modules. As shown
in the figure given below, a general MIN consists of a number of stages each
consisting, of a set of 22 switching elements. Stages are connected to each other
using, Inter-stage Connection (ISC) Pattern. These patterns may follow any of
the routing, functions such as ShuffleExchange, Butterfly, Cube, and so on.

Fig: Multistage Interconnection Network
Figure given below shows an example of an 88 MIN that uses the 22 SEs. This
network is known as the ShuffleExchange network (SEN).

Figure 2.10 An Example 88 ShuffleExchange Network (SEN).


58
The settings of the SEs in the figure illustrate how a number of paths can, be
established simultaneously in the network. For example, the figure shows how,
three simultaneous paths connecting the three pairs of input/output 000/101,
101/011, and 110/010 can be established. Note that the interconnection pattern
among stages follows the shuffle operation.
Each bit in the destination address can be used to route the message through
one, stage. The destination address bits are scanned from left to right and the
stages, are traversed from left to right. The first (most significant bit) is used to
control, the routing in the first stage; the next bit is used to control the routing in
the next, stage, and so on. The convention used in routing messages is that if the
bit in the, destination address controlling the routing in a given stage is 0, then
the message, is routed to the upper output of the switch. On the other hand if the
bit is 1, the message, is routed to the lower output of the switch. Consider, for
example, the routing of, a message from source input 101 to destination output
011 in the 88 SEN shown above. Since the first bit of the destination address is
0, therefore the message is first routed to the upper output of the switch in the
first (leftmost) stage. Now, the next bit in the destination address is 1, thus the
message is routed to the lower output of the switch in the middle stage. Finally,
the last bit is 1, causing the message to be routed to the lower output in the
switch in the last stage. This sequence causes the message to arrive at the
correct output (see figure above). Ease of message routing in MINs is one of the
most desirable features of these networks.
Besides the ShuffleExchange MIN network, a number of other MINs exist,
among these the Banyan and omega networks are well known.
The Banyan (Butterfly) Network
Figure given below shows an example of an 88 Banyan network.

Fig: An 88 Banyan Network
The Omega Network
The Omega Network represents another well-known type of MINs. A size N


59
omega network consists of n Shuffle Exchange networks. Each stage consists of
a column of N=2, two-input switching elements whose input is a shuffle
connection. Following figure illustrates the case of an N = 8 Omega network. As
can be seen from the figure, the inputs to each stage follow the shuffle
interconnection pattern. Notice that the connections are identical to those used
in the 88 ShuffleExchange network (SEN).

Fig: The Omega Network for N = 8
24. Describe the issues involved in writing Software for parallel
systems
Parallel hardware has arrived. With the rapid advancements in technology,
SIMD and MIMD computers found a common place today. Virtually all desktop
and server systems use multicore processors. The same cannot be said for
parallel software. Except for operating systems, database systems and Web
servers, application programs that make extensive use of parallel hardware are
rare.
Contrary to the general belief, development of parallel software is no longer a
time and effort intensive because of the advent of parallel architecture, parallel
algorithms, standardized parallel programming environments and software
development toolkits.
Techniques for programming parallel processing systems can be classified into
three categories:
Parallelizing compilers
Parallel language constructs
Parallel programming languages
Parallelizing compilers are widely used in conjunction with sequential


60
programming languages, such as FORTRAN and C, usually for scientific
computation. This approach is useful in that existing sequential software can be
adapted to a parallel programming environment with minor modifications.
However, parallelizing compilers can only detect parallelism associated with
iterations over common data structures, such as arrays and matrices, and
require extensive dependency analysis. Thus, it is not appropriate for developing
large-scale software for parallel processing systems.
The parallel language constructs approach is to extend the existing sequential
programming languages with parallel constructs. This approach requires
programmers to explicitly specify the communication and synchronization among
parallel processes. Considering the fact that many errors in parallel software
stem from incorrect synchronization and communication, this approach may
increase the software development effort.
Parallel programming languages are based on different paradigms such as
functional (e.g., Concurrent Haskell), logic paradigm (e.g., Parlog), object-
oriented paradigm (e.g., Jave, C*, Smalltalk), Communication-based
coordination models (e.g., MPI), Shared-memory models (e.g., OpenMP), etc. The
underlying computation models of these programming languages are
fundamentally different from those for imperative programming languages in
that parallelism is mostly implicit and massive parallelism can be obtainable.
The development of parallel software presents a unique set of problems that do
not arise in the development of conventional sequential software. Principal
among these problems is the influence of the target parallel architecture on the
software development process. Basic steps in the software development process
such as software design and performance prediction exhibit great sensitivity to
the target parallel architecture. Consequently, the development of parallel
software is typically carried out in an architecture-dependent manner, with a
fixed target architecture.
Here we are considering some of the issues involved in writing software for
homogeneous MDMI parallel systems.
1. Coordinating the processes/threads
Unless the computing task is embarrassingly parallel (programs that can be
parallelized by simply dividing the work among the processes/threads), the
programmer must successfully devise a parallel solution to the computing task
by dividing the computing task to processes/threads. While dividing the
computing task into processes/threads, the following considerations must be
taken into account:
Divide the work among the processes/threads in such a way that
(a) each process/thread gets roughly the same amount of work and therefore
load balancing is achieved to ensure better utilization of all the processing
elements and


61
(b) the amount of communication required is minimized for reducing the
communication overhead.
Arrange for the processes/threads to synchronize.
Arrange for communication among the processes/threads.
2. Issues Specific to Shared-memory
In shared-memory programs, variables can be shared or private. Shared
variables can be read or written by any thread, and private variables can
ordinarily only be accessed by one thread. Communication among the threads is
usually done through shared variables, so communication is implicit, rather than
explicit.
The following are the specific issues to be considered while designing parallel
software for shared memory systems:
(a) Decision about the use of Static vs Dynamic Threads:
Shared-memory programs use either dynamic or static threads. In dynamic
paradigm, there is a master thread and at any given instant a collection of
worker threads. The master thread typically waits for work requests (for
example, over a network) and when a new request arrives, it forks a worker
thread, the thread carries out the request, and when the thread completes the
work, it terminates and joins the master thread. This paradigm makes efficient
use of system resources since the resources required by a thread are only being
used while the thread is actually running.
In the static thread paradigm, all of the threads are forked after any needed
setup by the master thread and the threads run until all the work is completed.
After the threads join the master thread, the master thread may do some clean-
up (e.g., free memory) and then it also terminates. In terms of resource usage,
this may be less efficient: if a thread is idle, its resources (e.g., stack, program
counter, and so on.) cant be freed. However, forking and joining threads can be
fairly time-consuming operations. So if the necessary resources are available, the
static thread paradigm has the potential for better performance than the
dynamic paradigm.
(b) Issue regarding nondeterminism
In any MIMD system in which the processors execute asynchronously, it is likely
that there will be nondeterminism. A computation is nondeterministic if a given
input can result in different outputs. If multiple threads are executing
independently, the relative rate at which they will complete statements varies
from run to run, and hence the results of the program may be different from run
to run. As a very simple example, suppose we have two threads, one with rank 0
and the other with rank 1. Suppose also that each is storing a private variable
my_x, thread 0s value for my_x is 7, and thread 1s is 19. Further, suppose both


62
threads execute the following code:
. . .
printf("Thread %d > my val = %d\n", my_rank, my_x);
. . .
Then the output could be
Thread 0 > my val = 7
but it could also be
The point here is that because the threads are executing independently and
interacting with the operating system, the time it takes for one thread to
complete a block of statements varies from execution to execution, so the order in
which these statements complete cannot be predicted
In many cases nondeterminism is not a problem. In our example, since we have
labelled the output with the threads rank, the order in which the output appears
probably does not matter. However, there are also many cases in which
nondeterminism especially in shared-memory programscan be disastrous,
because it can easily result in program errors. Here is a simple example with two
threads.
Suppose each thread computes an int, which it stores in a private variable
my_val. Suppose also that we want to add the values stored in my_val into a
shared-memory location x that has been initialized to 0. Both threads therefore
want to execute code that looks something like this:
my_val = Compute_val(my_rank);
x += my_val;
Now recall that an addition typically requires loading the two values to be added
into registers, adding the values, and finally storing the result.
Here is one possible sequence of events:
Time Core 0 Core 1
0 Finish assignments to my_val Call to Compute_val
1 Load x = 0 into register Finish assignment to my_val
2 Load my_val = 7 into register Load x = 0 into register
3 Add my_val = 7 to x Load my_val = 19 into register
4 Store x = 7 Add my_val to x
5 Start other work Store x = 19



63
Clearly, this is not the final result. The nondeterminism here is a result of the
fact that two threads are attempting to more or less simultaneously update the
memory location x. When threads or processes attempt to simultaneously access
a resource, and the accesses can result in nondeterminism. This state is called a
race condition, because the threads or processes are in a race and the outcome of
the computation depends on which thread wins the race. In our example, the
threads are in a race to execute x += my_val. In this case, unless one thread
completes x += my_val before the other thread starts, the result will be
incorrect. A block of code that can only be executed by one thread at a time is
called a critical section, and the programmer must ensure mutually exclusive
access to the critical section. That is, programmer needs to ensure that if one
thread is executing the code in the critical section, then the other threads are
excluded.
The most commonly used mechanism for ensuring mutual exclusion is a mutual
exclusion lock or mutex or lock. A mutex is a special type of instruction
implemented in hardware. The basic idea is that each critical section is protected
by a lock. Before a thread can execute the code in the critical section, it must
obtain the mutex lock by calling a mutex function, and, when it is done executing
the code in the critical section, it should relinquish the mutex by calling an
unlock function. While one thread owns the lock any other thread attempting to
execute the code in the critical section will wait in its call to the lock function.
Thus, the code segment in the above example must be modified as given below
for ensuring that the code functions correctly:
my val = Compute val(my rank);
Lock(&add my val lock);
x += my val;
Unlock(&add my val lock);
This ensures that only one thread at a time can execute the statement x +=
my_val. Note that the code does not impose any predetermined order on the
threads. Either thread 0 or thread 1 can execute x += my_val first.
Also note that the use of a mutex enforces serialization of the critical section.
Since only one thread at a time can execute the code in the critical section, this
code is effectively serial. Thus, a parallel program should have as few critical
sections as possible, and the critical sections to be as short as possible.
There are alternatives to mutexes. In busy-waiting, a thread enters a loop
whose sole purpose is to test a condition. In our example, suppose there is a
shared variable ok_for_1 that has been initialized to false. Then something like
the following code can ensure that thread 1 wont update x until after thread 0
has updated it:
my_val = Compute_val(my_rank);
if (my_rank == 1)
while (!ok_for_1); /* Busy wait loop */


64
x += my_val; /* Critical section */
if (my_rank == 0)
ok_for_1 = true; /* Let thread 1 update x */
So until thread 0 executes ok_for_1 = true, thread 1 will be stuck in the loop
while (!ok for 1). This loop is called a busy-wait because the thread can be
very busy waiting for the condition. This has the advantage that it is simple to
understand and implement. However, it can be very wasteful of system
resources, because even when a thread is doing no useful work, the core running
the thread will be repeatedly checking to see if the critical section can be entered.
Semaphores are similar to mutexes, although the details of their behaviour are
slightly different. A monitor provides mutual exclusion at a somewhat higher-
level: it is an object whose methods can only be executed by one thread at a time.
(c) Issue Related to Thread Safety
A second potential problem with shared-memory programs is thread safety. In
many cases parallel programs can call functions developed for use in serial
programs, and there wont be any problems. However, there are some notable
exceptions. The most important exception for C programmers occurs in functions
that make use of static local variables. Recall that a static variable that is
declared in a function persists from one call to the next. Thus, static variables
are effectively shared among any threads that call the function, and this can
have unexpected and unwanted consequences.
For example, the C string library function strtok splits an input string into
substrings. When it is first called, it is passed a string, and on subsequent calls it
returns successive substrings. This can be arranged through the use of a static
char * variable that refers to the string that was passed on the first call. Now
suppose two threads are splitting strings into substrings. Clearly, if, for example,
thread 0 makes its first call to strtok, and then thread 1 makes its first call to
strtok before thread 0 has completed splitting its string, then thread 0s string
will be lost or overwritten, and, on subsequent calls it may get substrings of
thread 1s strings.
A block of code that functions correctly when it is run by multiple threads is said
to be thread safe. Functions that were written for use in serial programs can
make unknowing use of shared data. This means that if it is used in a
multithreaded program, there may be errors or unexpected results. Such
functions are not thread safe. When a block of code is not thread safe, it is
usually because different threads are accessing shared data. Thus, even though
many serial functions can be used safely in multithreaded programsthat is,
theyre thread safeprogrammers need to be cautious of functions that were
written exclusively for use in serial programs.



65
3. Issues Specific to Distributed-Memory
(a) Issues Related to Message Passing
In distributed-memory programs, the cores can directly access only their own
private memories. The most common API for programming distributed-memory
systems is message-passing. In message-passing, there are (at least) two distinct
functions: a send function and a receive function. When processes need to
communicate, one calls the send and the other calls the receive.
Processes typically identify each other by ranks in the range 0, 1, ... , p-1, where
p is the number of processes. So, for example, process 1 might send a message to
process 0 with the following pseudo-code:
char message[100];
. . .
my_rank = Get_rank();
If(my_rank == 1)
{ sprintf(message, "Greetings from process 1");
Send(message, MSG_CHAR, 100, 0);
}
else if (my_rank == 0)
{ Receive(message, MSG_CHAR, 100, 1);
printf("Process 0 > Received: %s\n", message);
}
There are several possibilities for the exact behaviour of the Send and Receive
functions. The simplest behaviour is for the call to Send to block until the call to
Receive starts receiving the data. This means that the process calling Send wont
return from the call until the matching call to Receive has started. Alternatively,
the Send function may copy the contents of the message into storage that it
owns, and then it will return as soon as the data is copied. The most common
behaviour for the Receive function is for the receiving process to block until the
message is received.
Typical message-passing APIs also provide a wide variety of additional functions
such as broadcast function for various collective communications (e.g., a single
process transmits the same data to all the processes) or a reduction function, in
which results computed by the individual processes are combined into a single
result (for example, values computed by the processes are added). MPI also
supports special functions for managing processes and communicating
complicated data structures.
Message-passing is a very powerful and versatile API for developing parallel
programs. However, it is also very low level. That is, there is a huge amount of
detail that the programmer needs to manage. For example, in order to parallelize
a serial program, it is usually necessary to rewrite the vast majority of the
program. The data structures in the program may have to either be replicated by
each process or be explicitly distributed among the processes.


66
(b) Using One-Sided Communication
Distributed-memory systems can also be programmed using one-sided
communications. Recall that in message-passing, one process must call a send
function and the send must be matched by another process call to a receive
function. Any communication requires the explicit participation of two processes.
In one-sided communication, or remote memory access, a single process
calls a function, which updates either local memory with a value from another
process or remote memory with a value from the calling process. This can
simplify communication, since it only requires the active participation of a single
process. Furthermore, it can significantly reduce the cost of communication by
eliminating the overhead associated with synchronizing two processes. It can
also reduce overhead by eliminating the overhead of one of the function calls
(send or receive).
(c) Partitioned Global Address Space Languages
Since many programmers find shared-memory programming more attractive
than message-passing or one-sided communication, researchers are now at work
for developing parallel programming languages that allow the user to use some
shared-memory techniques for programming distributed-memory hardware. This
is not quite as simple as it sounds. For example, if we simply wrote a compiler
that treated the collective memories in a distributed-memory system as a single
large memory, our programs would have poor performance, since each time a
running process accessed memory, it might access local memory (that is, memory
belonging to the core on which it was executing) or remote memory (memory
belonging to another core). Accessing remote memory can take hundreds or even
thousands of times longer than accessing local memory.

Module 2
Parallel Computing:
Design Methodology &
Analysis Techniques

LEARNING OBJECTIVES
Fosters Design Methodology, Time Complexity (computation and
communication complexities), Speedup, Efficiency, Cost Optimality,
Amdahls Law, Brents Scheduling Principle, Simple PRAM
Algorithms: Boolean Operations, Max Finding in O(1) time,
Reduction, Prefix-Sum, etc



68
1. Describe the Fosters Task/Channel parallel program model
The task/channel methodology proposed by Foster views a parallel program as a
collection of tasks that communicate by sending messages to each other through
channels. The following figure shows a task/channel representation of a
hypothetical parallel program.

Fig: The task/channel programming model. (a) A task consists of a program,
local memory, and a collection of I/O ports. (b) A parallel computation can be
viewed as a directed graph in which vertices represent tasks and directed
edges represent communication channels.
A task consists of an executable program, together with its local memory and a
collection of I/O ports. The local memory contains program code and private data.
Access to this memory is called a local data access. The only way that a task can
send copies of its local data to other tasks is through its output ports, and
conversely, it can only receive non-local data from other tasks through its input
ports. A channel is a message queue that connects one task's output port to
another task's input port.
A task cannot receive a data value until the task at the other end of the channel
has sent it. A task that tries to receive data from an empty input port is blocked
until a message appears in the queue, but a task that sends data onto an output
port is never blocked even if previous messages it has sent along the same
channel have not yet received. That is, reads are synchronous but writes are
asynchronous.
The fact that a task never blocks when sending implies that the I/O model
assumes a queue of unbounded size. There is an inherent, intentional,
asymmetry between local and non-local data accesses. Access to local memory is
considered to be much faster than access to non-local data through a channel.
You should think of a local data access, whether it is a read or a write, as an
access to a locally attached RAM, typically through a dedicated
processor/memory bus.
The lifetime of a parallel program or algorithm in the task/channel model is
defined as the time during which any task is active. A task is active from the


69
point at which it starts execution until the point where it terminates. Therefore,
the program is active from when the first tasks start until the last task
terminates.
2. Describe Fosters design methodology for designing parallel
algorithms
In 1995, Ian Foster proposed a design methodology for designing parallel
algorithms. It is a four-stage process whose input is the problem statement, as
shown in figure below:

Fig: Foster's Parallel Algorithm Design Methodology
The four stages, with brief descriptions, are
1. Partitioning: The process of dividing the computation and the data into
pieces
2. Communication: The process of determining how tasks will communicate
with each other, distinguishing between local communication and global
communication
3. Agglomeration: The process of grouping tasks into larger tasks to improve
performance or simplify programming
4. Mapping: The process of assigning tasks to physical processors.
PARTITIONING
The purpose of partitioning is to discover as much parallelism as possible. There
are two potential sources of parallelism: data and computation, leading to two
complementary methods of extracting parallelism.
Domain Decomposition: Domain decomposition is a paradigm whose goal is to
decompose the data into many small pieces to which parallel computations may
be applied. These parallel computations will be called primitive tasks. In domain
decomposition, the most parallelism is achieved by identifying the largest and
Problem
Partitioning
Communication
Agglomeration Mapping


70
most frequently accessed data object, decomposing it into as many small,
identical pieces as possible, and assigning a primitive task to each piece.
For example, largest and most frequently accessed data structure is a three-
dimensional matrix. A very coarse decomposition would be to divide the matrix
into a collection of two-dimensional slices, resulting in a one-dimensional
collection of primitive tasks. A finer decomposition would be to decompose it into
a collection of one-dimensional slices, resulting in a two-dimensional collection of
primitives. The finest decomposition would be to consider each individual matrix
element, resulting a three-dimensional collection of primitive tasks.

Fig: Three domain decompositions of a three-dimensional matrix, resulting in
markedly different collections of primitive tasks.
At this first stage of in the design process, it is usually best to maximize the
number of primitive tasks. Hence the three-dimensional partitioning is
preferred.
Functional Decomposition: In functional decomposition, the focus is on the
computation that is to be performed rather than on the data manipulated by the
computation. Functional decomposition is the paradigm in which the functional
parallelism is identified, primitive tasks are assigned to these separate
functions, and then the data to which these functions can be applied is identified.
This is shown conceptually in figure given below:

Fig: Functional Decomposition, Conceptually
Foster provides a checklist for evaluating the partitioning stage. Your
partitioning should satisfy the following criteria as much as possible:


71
1. The partition defines at least an order of magnitude more tasks than there
are processors in your target computer. If not, you have little flexibility in
subsequent design stages.
2. Redundant computations and data storage are avoided as much as possible. If
not, the resulting algorithm may not be able to deal with large problems.
3. Primitive tasks are roughly the same size. If not, it may be hard to allocate to
each processor equal amounts of work, and this will cause an overall decrease
in performance.
4. The number of tasks is an increasing function of the problem size. Ideally, an
increase in problem size should increase the number of tasks rather than the
size of individual tasks. If this is not the case, your parallel algorithm may
not be able to solve larger problems when more processors are available.
COMMUNICATION
When the entire computation is one sequential program, all of the data is
available to all parts of the program. When such a sequential computation is
divided up into independent tasks that may execute in separate processors, some
of the data needed by a task may reside in its local memory, but some of it may
reside in that of other tasks. As a result, these tasks may need to exchange data
with one another. This information flow is specified in the communication stage
of the design.
There are two types of communication among the tasks, local and global. Local
communication is when a task needs values from a small number of other tasks.
Global communication is when a great many tasks must contribute data to
perform a computation.
This inter-task communication does not exist in a sequential program; it is due
to the parallelization of the computation and is therefore considered as overhead.
Minimizing parallel overhead and reducing the delays caused by the
communication are important goals of parallel algorithm design.
The following is the Foster's checklist for communication:
1. All tasks perform about the same number of communication operations.
Unbalanced communication reduces the scalability.
2. Each task should communicate only with a small number of neighbours. If
each task must communicate with many other tasks, it will add too much
overhead.
3. The communication operations should be able to proceed concurrently. If not,
your algorithm is likely to be inefficient and non-scalable to a larger problem
instance.
4. Tasks can perform their computations concurrently. If not, your algorithm is


72
likely to be inefficient and non-scalable to a larger problem instance.
AGGLOMERATION
During the first two steps of the parallel algorithm design process, the focus was
on identifying as much parallelism as possible. The resulting algorithm at this
stage may not be efficient as it is not designed to execute on any particular
parallel computer. For example, if the number of tasks exceeds the number of
processors by several orders of magnitude, then tasks creation and inter-task
communication result in significant overhead. In the final two steps of the design
process we have a target architecture in mind (e.g., centralized multiprocessor or
multicomputer) and we consider how to combine primitive tasks into larger tasks
and map them onto physical processors to reduce the amount of parallel
overhead.
Agglomeration is the process of grouping tasks into larger tasks in order to
improve performance or simplify programming.
The major objectives of the agglomeration are:
i. Reduce communication overhead. When two tasks that exchange data with
each other are combined into a single task, the data that was exchanged
through a channel is part of a single task and that channel and the overhead
are removed. This is called increasing locality (see figure below).

Fig: Agglomerating to Increase Locality.
A second way to reduce communication overhead is by combining groups of
tasks that all send and groups of tasks that all receive, data from each other.
In other words, suppose task S1 sends to task R1 and S2 sends to R2. If we
combine S1 and S2 into a single task S and R1 and R2 into a single task R,
then communication overhead will be reduced. This is because, before we
agglomerated the tasks there were two messages sent by the senders, and
afterward, one longer message. The cost of sending a message has two
components, the initial start-up time, called the latency, which is
independent of how large the message is, and the transmission time, which is
a function of the number of bytes sent. The transmission time is not reduced,
but we cut the total latency in half. Following figure illustrates this type of
agglomeration.



73

Fig: Agglomerating to reduce message transmissions. By combining the two tasks
that send, and the two that receive, into a single task each, the number of
transmissions is reduced, decreasing message latency.
ii. Maintain the scalability of the parallel design. We want to ensure that we
have not combined so many tasks that we will not be able to port our program
at some point in the future to a computer with more processors. For example,
suppose we are developing a parallel program that manipulates a three-
dimensional matrix of size 8128256. We plan to execute our program on a
centralized multiprocessor with four CPUs. If we design the parallel
algorithm so that the second and third dimensions are agglomerated, we
could certainly execute the resulting program on four CPUs. Each task would
be responsible for a 2128256 submatrix. Without changing the design, we
could even execute on a system with eight CPUs. Each task would be
responsible for a 1128256 submatrix. However, we could not port the
program to a parallel computer with more than eight CPUs without changing
the design. Hence the decision to agglomerate the second and third
dimensions of the matrix could turn out to be a short-sighted one.
iii. Reduce software engineering costs. If we parallelizing a sequential program,
agglomeration may allow us to make greater use of the existing sequential
code, reducing the time and expense of developing the parallel program.
One should evaluate how well agglomeration is done by considering each of the
following criteria.
1. Agglomeration should reduce communication costs by increasing locality.
2. If agglomeration has replicated computation, the benefits of this replication
should than the communications they replace.
3. If agglomeration replicates data, it should not compromise the scalability of
the algorithm.
4. Agglomeration should produce tasks with similar computation and
communication costs.
5. The number of tasks should be an increasing function of the problem size.
6. The number of tasks should be as small as possible, yet at least as great as
the number of processors in the likely target computers.
7. The trade-off between the chosen agglomeration and the cost of modification


74
to existing sequential code is reasonable.
MAPPING
Mapping, the final stage of Foster's methodology, is the process of assigning each
task to a processor. Of course, this mapping problem does not arise on
uniprocessors or on shared-memory computers whose operating systems provide
automatic task scheduling. Therefore, we assume here that the target
architecture is a distributed-memory parallel computer.
The goals of mapping are to maximizing processor utilization and minimize
interprocessor communication. Processor utilization is the average percentage of
time during which the computer's processors are actively executing code
necessary to solve the problem. It is maximized when the computation is
balanced evenly, allowing all processors to begin and end execution at the same
time. The processor utilization drops when one or more processors are idle while
the remainder of the processors are still busy.
Interprocessor communication increases when two tasks connected by a channel
are mapped to different processors. Interprocessor communication decreases
when two tasks connected by a channel are mapped to the same processor. For
example, consider the mapping shown in the figure below. Eight tasks are
mapped onto three processors. The left and right processors are responsible for
two tasks, while the middle processor is responsible for four tasks. If all
processors have the same speed and every task requires the same amount of
time to be performed, then the middle processor will spend twice as much time
executing tasks as the other two processors. If every channel communicates the
same amount of data, then the middle processor will also be responsible for twice
as many interprocessor communications as the other two processors.

Fig: The mapping process. (a) A task/channel graph. (b) Mapping of tasks to
three processors. Some channels now represent intraprocessor communications,
while others represent interprocessor communications
Increasing processor utilization and minimizing interprocessor communication
are often conflicting goals. For example, suppose there are p processors available.
Mapping every task to the same processor reduces interprocessor communication


75
to zero, but reduces utilization to 1/p. Our goal, then, is to choose a mapping that
represents a reasonable middle point between maximizing utilization and
minimizing communication.
3. Using twice as many hardware resources, one can reasonably expect
a program to run twice as fast. However, in typical parallel
programs, this is rarely the case, due to a variety of overheads
associated with parallelism. What are those overheads?
Parallel overhead is the amount of time required to coordinate parallel tasks, as
opposed to doing useful work. In addition to performing essential computation
(i.e., computation that would be performed by the serial program for solving the
same problem instance), a parallel program may also spend time in
1. Interprocess communication,
2. Idling, and
3. Excess computation (computation not performed by the serial
formulation).
A typical execution profile of a parallel program is illustrated in figure given
below.

Fig: The execution profile of a hypothetical parallel program executing on eight
processing elements. Profile indicates times spent performing computation (both
essential and excess), communication, and idling
Interprocess Communication: Any nontrivial parallel system requires its
processing elements to interact and communicate data (e.g., intermediate
results). The time spent communicating data between processing elements is
usually the most significant source of parallel processing overhead.
Idling: Processing elements in a parallel system may become idle due to many
reasons such as
Load imbalance,
Synchronization, and


76
Presence of serial components in a program.
In many parallel applications (for example, when task generation is dynamic), it
is impossible (or at least difficult) to predict the size of the subtasks assigned to
various processing elements. Hence, the problem cannot be subdivided statically
among the processing elements while maintaining uniform workload. If different
processing elements have different workloads, some processing elements may be
idle during part of the time that others are working on the problem.
In some parallel programs, processing elements must synchronize at certain
points during parallel program execution. If all processing elements are not
ready for synchronization at the same time, then the ones that are ready sooner
will be idle until all the rest are ready.
Parts of an algorithm may be unparallelizable, allowing only a single processing
element to work on it. While one processing element works on the serial part, all
the other processing elements must wait.
Excess Computation: The fastest known sequential algorithm for a problem
may be difficult or impossible to parallelize, forcing us to use a parallel algorithm
based on a poorer but easily parallelizable (that is, algorithm with a higher
degree of concurrency) sequential algorithm. The difference in computation
performed by the parallel program and the best serial program is the excess
computation overhead incurred by the parallel program.
A parallel algorithm based on the best serial algorithm may still perform more
aggregate computation than the serial algorithm. An example of such a
computation is the Fast Fourier Transform algorithm. In its serial version, the
results of certain computations can be reused. However, in the parallel version,
these results cannot be reused because they are generated by different
processing elements. Therefore, some computations are performed multiple
times on different processing elements.
4. Describe the commonly used performance metrics for parallel
systems.
The commonly used metrics for measuring the performance of the parallel
systems include:
Execution time
Total parallel overhead
Speedup
Efficiency
Cost
Execution Time
The serial runtime (TS) of a program is the time elapsed between the beginning
and the end of its execution on a sequential computer. The parallel runtime (TP)


77
is the time that elapses from the moment a parallel computation starts to the
moment the last processing element finishes execution.
Total Parallel Overhead
The total parallel overheads (to) incurred by a parallel program is the total time
collectively spent by all the processing elements over and above that required by
the fastest known sequential algorithm for solving the same problem on a single
processing element.
The total time spent in solving a problem summed over all processing elements is
ntp. ts units of this time are spent performing useful work, and the remainder is
overhead. Therefore, the overhead function is given by

Speedup
Speedup (S) is a measure that captures how much performance gain is achieved
by parallelizing a given application over a sequential implementation.
If we assume that a given computation can be divided into p equal subtasks,
each of which can be executed by one processor. If ts is the execution time of the
whole task using a single processor, then the time taken by each processor to
execute its subtask is tp = ts/p.
The speedup factor of a parallel system is defined as the ratio between the time
taken by a single processor to solve a given problem instance to the time taken
by a parallel system consisting of n processors to solve the same problem
instance.

The above equation indicates that the speedup factor resulting from using p
processors is equal to the number of processors used, p. When this happens, we
say that our parallel program has linear speedup.
In practice, it is unlikely to get linear speedup because the use of multiple
processes/threads introduces some overhead. For example, shared memory
programs will almost always have critical sections, which will require the use of
some mutual exclusion mechanism for ensuring the synchronization. Similarly,
distributed-memory programs will almost always need to transmit data across
the network, which is usually much slower than local memory access. Serial
programs, on the other hand, will not have these overheads. Thus, it will be very
unusual to have a linear speedup. Furthermore, it is likely that the overheads
will increase as the number of processes or threads increases. That is, as p
increases, we expect S to become a smaller and smaller fraction of the linear
speedup p. Another way of saying this is that S/p will probably get smaller and


78
smaller as p increases.
For a given problem, more than one sequential algorithm may be available, but
all of these may not be equally suitable for parallelization. Given a parallel
algorithm, its performance is measured with respect to the fastest sequential
algorithm for solving the same problem on a single processing element.
The speedup, therefore, can be formally defined as the ratio of the serial runtime
of the best sequential algorithm for solving a problem to the time taken by the
parallel algorithm to solve the same problem on p processing elements. The p
processing elements used by the parallel algorithm are assumed to be identical
to the one used by the sequential algorithm.
For example, consider case of parallelizing bubble sort. Assume that a serial
version of bubble sort of 105 records takes 150 seconds and a serial quicksort can
sort the same list in 30 seconds. If a parallel version of bubble sort, also called
odd-even sort, takes 40 seconds on four processing elements, it would appear
that the parallel odd-even sort algorithm results in a speedup of 150/40 or 3.75.
However, this conclusion is misleading, as in reality the parallel algorithm
results in a speedup of 30/40 or 0.75 with respect to the best serial algorithm.
Sometimes the whole of the algorithm cannot be parallelized. Assume that f is
the fraction that can be parallelized and 1-f is the fraction that cannot be
parallelized. In this case, the speedup can be defined as

Example:
Consider the process of improving the performance of a program by optimizing
some of the code in the program and running it in an enhanced (i.e., optimized)
form. Suppose that optimized instructions run 20 times faster than sequential.
Derive the speedup equation when x% of the instructions are optimized.
Determine the percentage of code that must be optimized to get a speedup of
2,5, and 10 respectively.
If 25% of the code cannot be optimized (due to the inherently sequential
nature such as I/O etc.), then what is the maximum speedup you can achieve.
Ans: The speedup S is defined as
S = ts/tp, where ts is non-optimized sequential and tp is optimized (parallel) time.
For each sequential cycle, the optimized time is (1/20 = 0.05). If each sequential
instruction takes 1 cycle, the sequential time for a program with N instructions
is ts = N cycles. The time for a program with x% of its instructions optimized is
(N*0.05*x + N*1*(1-x)) = N(0.05x+(1-x)).
Therefore speedup S = N/N(0.05x + (1-x)) = 1/(1-0.95x).


79
Use this equation to determine the values of x for S = 2, 5, 10. Rewrite the above
speedup equation to solve for x, the percentage of code optimized, to get
x=(100/95) (1 1/S). Next insert the desired values of S to solve for x.
For S=2, we get x =(100/95)(1 ) = 52.6%
For S=5, we get x = (100/95)(1 1/5) = 84.2%
For S=10, we get x = (100/95)( 1- 1/10) = 94.7%
If 25% of the code cannot be optimized, then maximum value of x = 0.75.
Substitute in the speedup equation to get maximum speedup S = 1/(1
0.95*0.75) = 3.47.
Theoretically, speedup can never exceed the number of processing elements, p. If
the best sequential algorithm takes ts units of time to solve a given problem on a
single processing element, then a speedup of p can be obtained on p processing
elements if none of the processing elements spends more than time ts/p. A
speedup greater than p is possible only if each processing element spends less
than time ts/p solving the problem. This is a contradiction because speedup, by
definition, is computed with respect to the best sequential algorithm. If ts is the
serial runtime of the algorithm, then the problem cannot be solved in less than
time ts on a single processing element.
In practice, a speedup greater than p is sometimes observed (a phenomenon
known as superlinear speedup). This usually happens when the work
performed by a serial algorithm is greater than its parallel formulation or due to
hardware features that put the serial implementation at a disadvantage. For
example, the data for a problem might be too large to fit into the cache of a single
processing element, thereby degrading its performance due to the use of slower
memory elements. But when partitioned among several processing elements, the
individual data-partitions would be small enough to fit into their respective
processing elements' caches.
As an example of superlinear speedup, consider the execution of a parallel
program on a two-processor parallel system each with a clock speed of 1GHz. The
program attempts to solve a problem instance of size W. With this size and
available cache of 64 KB on one processor, the program has a cache hit rate of
80%. Assuming the latency to cache of 2 ns and latency to DRAM of 100 ns, the
effective memory access time is 2 0.8 + 100 0.2, or 21.6 ns. If the computation
is memory bound and performs one FLOP/memory access, this corresponds to a
processing rate of 46.3 MFLOPS (1GB=100010
6
/21.6 = 46.3 MFLOPS). Now
consider a situation when each of the two processors is effectively executing half
of the problem instance (i.e., size W/2). At this problem size, the cache hit ratio is
expected to be higher, since the effective problem size is smaller. Let us assume
that the cache hit ratio is 90%, 8% of the remaining data comes from local
DRAM, and the other 2% comes from the remote DRAM (communication
overhead). Assuming that remote data access takes 400 ns, this corresponds to
an overall access time of 2 0.9 + 100 0.08 + 400 0.02, or 17.8 ns. The


80
corresponding execution rate at each processor is therefore 56.18, for a total
execution rate of 112.36 MFLOPS. The speedup in this case is given by the
increase in speed over serial formulation, i.e., 112.36/46.3 or 2.43! Here, because
of increased cache hit ratio resulting from lower problem size per processor, we
notice superlinear speedup.
Superlinear speedup can also happen when the work performed by a serial
algorithm is greater than its parallel formulation. For example, consider an
algorithm for exploring leaf nodes of an unstructured tree. Each leaf has a label
associated with it and the objective is to find a node with a specified label. For
example, in figure below, assume that the solution node is the rightmost leaf in
the tree.

A serial formulation of this problem based on depth-first tree traversal explores
the entire tree, i.e., all 14 nodes. If it takes time tc to visit a node, the time for
this traversal is 14tc. Now consider a parallel formulation in which the left
subtree is explored by processing element 0 and the right subtree by processing
element 1. If both processing elements explore the tree at the same speed, the
parallel formulation explores only the shaded nodes before the solution is found.
Notice that the total work done by the parallel algorithm is only nine node
expansions, i.e., 9tc. The corresponding parallel time, assuming the root node
expansion is serial, is 5tc (one root node expansion, followed by four node
expansions by each processing element). The speedup of this two-processor
execution is therefore 14tc /5tc , or 2.8!
The cause for this superlinearity is that the work performed by parallel and
serial algorithms is different.
Efficiency
Only an ideal parallel system containing p processing elements can deliver a
speedup equal to p. In practice, ideal behaviour cannot be achieved because of
the parallel overheads. Efficiency (E) is a measure of the fraction of time for
which a processing element is usefully employed; it is defined as the ratio of


81
speedup to the number of processing elements. In an ideal parallel system,
speedup is equal to p and efficiency is equal to one. In practice, speedup is less
than p and efficiency is between zero and one, depending on the effectiveness
with which the processing elements are utilized.
Mathematically, E is given by

Cost
The cost of solving a problem on a parallel system is the product of parallel
runtime and the number of processing elements used. Cost reflects the sum of
the time that each processing element spends solving the problem. Efficiency can
also be expressed as the ratio of the execution time of the fastest known
sequential algorithm for solving a problem to the cost of solving the same
problem on p processing elements. The cost of solving a problem on a single
processing element is the execution time of the fastest known sequential
algorithm. A parallel system is said to be cost-optimal if the cost of solving a
problem on a parallel computer has the same asymptotic growth as a function of
the input size as the fastest-known sequential algorithm on a single processing
element. Since efficiency is the ratio of sequential cost to parallel cost, a cost-
optimal parallel system has an efficiency of O(1).
Cost is sometimes referred to as work or processor-time product, and a cost-
optimal system is also known as a ptp-optimal system.
5. State and Explain Amdahls law
Back in the 1960s, Gene Amdahl made an observation that is become known as
Amdahls law. It says, roughly, that unless virtually all of a serial program is
parallelized, the possible speedup is going to be very limitedregardless of the
number of cores available.
Let f be the fraction of operations in a computation that must be performed
sequentially, where 0sfs1. The maximum speedup S achievable by a parallel
computer with p processors performing the computation is

(Many authors consider f as the fraction of operations in a computation that can
be infinitely parallelizable with no overhead, while the remaining fraction, 1-f, as
totally sequential. Then Amdahls law can be stated as,

or
)


82
Suppose, for example, that we are able to parallelize 90% of a serial program.
Further suppose that the parallelization is perfect, that is, regardless of the
number of cores p we use, the speedup of this part of the program will be p. If the
serial run-time is ts= 20 seconds, then the run-time of the parallelized part will
be 0.9ts/p = 18/p and the run-time of the unparallelized part will be 0.1ts = 2.
The overall parallel run-time will be

and the speedup will be

Now as p gets larger and larger,
gets closer and closer to 0, so the

total parallel run-time cannot be smaller than 0.1ts = 2. That is, the
denominator in S cannot be smaller than 0.1ts = 2. The fraction S must
therefore be smaller than

That is, S s 10. This is saying that even though we have done a perfect job in
parallelizing 90% of the program, and even if we have, say, 1000 cores, we will
never get a speedup better than 10.
More generally, if a fraction r of our serial program remains unparallelized, then
Amdahls law says we cant get a speedup better than 1/r. In our example, r = 1-
0.9 = 1/10, so we couldnt get a speedup better than 10. Therefore, if a fraction r
of our serial program is inherently serial, that is, cannot possibly be
parallelized, then we cannot possibly get a speedup better than 1/r. Thus, even if
r is quite small say 1/100 and we have a system with thousands of cores, we
cannot possibly get a speedup better than 100.
This is illustrated through the following examples:
If 90% of a calculation can be parallelized then the maximum speed-up on 10
processors is 1/(0.1+(1-0.1)/10)) or 5.3.
If 90% of a calculation can be parallelized then the maximum speed-up on 20
processors is 1/(0.1+(1-0.1)/20)) or 6.9.
If 90% of a calculation can be parallelized then the maximum speed-up on
1000 processors is 1/(0.1+(1-0.1)/1000)) or 9.9.
Example:


83
Suppose a program fragment consists of a loop and an initialization part such as:
...
val = func1(a, b, c);
for (index=0; index++; index < 100)
{ array (index) = func2 (array(index), val);}
...
Suppose further that the initialization requires 1000 cycles, and each iteration of
the loop requires 200 cycles. Note that the iterations are independent and hence
can be executed in parallel.
The total time required to execute the program sequentially is 1000 + 100200 =
21000 cycles. The loop can be parallelized, but the initialization is purely
sequential. Therefore f = 20000/21000 = about 0.95. The maximum speedup
given by the formula is
1 / ( (1 - 0.95) + (0.95/p)) = 1 / (0.05 + 0.95 /p)
This would produce the speedups shown below for some typical values of P:
p Speedup
1 1
20 10.25
100 16.8
500 19.27
infinite 20

In reality, this simple form of Amdahl's law neglects several issues, most notably
the degree of parallelism of the problem itself. Since this problem clearly cannot
use more than 100 processors, the maximum speedup is 16.8. Another major
factor ignored here is communication overhead. In addition, portions of a real
problem would have many different degrees of parallelism.
In practice, contrary to Amdahls law, many parallel programs obtain excellent
speedups. One possible reason for this apparent contradiction is that Amdahls
law doesnt take into consideration the fact that the unparallelized part often
decreases in size relative to the parallelized part as the problem size increases (a
more mathematical version of this statement is known as Gustafsons law).
Second, there are thousands of programs used by scientists and engineers that
routinely obtain huge speedups on large distributed-memory systems. Finally, in
many cases, even obtaining a speedup of 5 or 10 is more than adequate,
especially if the effort involved in developing the parallel program was not very
large.
6. Explain the term scalability with respect to parallel systems
Scalability is a term that has many interpretations. In general, a technology is
scalable if it can handle ever-increasing problem sizes. Formally, a parallel


84
program is scalable if there is a rate at which the problem size can be increased
so that as the number of processes/threads is increased, the efficiency remains
constant.
As an example, suppose that ts=n, where the units of ts are in microseconds, and
n is the problem size. Also suppose that tp = n/p+1. Then

To see if the program is scalable, increase the number of processes/threads by a
factor of k, and find the factor x to which the problem size is to be increased such
that E is unchanged. If the number of processes/threads is kp and the problem
size is xn,

If x = k,

In other words, if we increase the problem size at the same rate that we increase
the number of processes/threads, then the efficiency will be unchanged, and
program is scalable.
While increasing the number of processes/threads, assume that the efficiency can
kept fixed without increasing the problem size, the program is said to be strongly
scalable. If the efficiency can kept fixed by increasing the problem size at the
same rate as we increase the number of processes/threads, then the program is
said to be weakly scalable. The program in our example would be weakly
scalable.
7. What an abstract parallel architecture? Describe the ideal abstract
PRAM architecture and its variations.
The parallel architecture could be categorized into shared memory and message
passing systems. In a shared memory system, processing elements communicate
with each other via shared variables in the global memory, while in message
passing systems, each processing element has its own local memory and
communication is performed via message passing. Here we discuss the
architecture of an ideal abstract model for the shared memory system. The
purpose of such an abstract models for parallel computation is to give
frameworks by which we can describe and analyse parallel algorithms. These
ideal models are used to obtain performance bounds and complexity estimates.
The PRAM Model and Its Variations


85
PRAM is an abstract model for the shared memory system introduced by
Fortune and Wyllie in 1978 for modelling idealized parallel computers in which
communication cost and synchronization overhead are negligible.
A PRAM consists of a control unit, a global memory shared by p processors, each
of which has a unique index P1, P2, . . . , Pp . In addition to the global memory via
which the processors can communicate, each processor has its own private
memory. Following figure shows a diagram illustrating the components in the
PRAM model.

Fig: PRAM Model for Parallel Computations
The p processors operate on a synchronized read, compute, and write cycle.
During a computational step, an active processor may read a data value from a
memory location, perform a single operation, and finally write back the result
into a memory location. Active processors must execute the same instruction,
generally, on different data. Hence, this model is sometimes called the shared
memory, single instruction, multiple data (SM SIMD) machine. Algorithms are
assumed to run without interference as long as only one memory access is
permitted at a time. We say that PRAM guarantees atomic access to data located
in shared memory. An operation is considered to be atomic if it is completed in
its entirety or it is not performed at all (all or nothing).
There are different PRAM models depending on how they handle read or write
conflicts; i.e., when two or more processors attempt to read from, or write to, the
same global memory location.
These different read/write modes are:
Exclusive Read (ER): Only one processor can read from any memory


86
location at a time.
Exclusive Write (EW): Only one processor can write to any memory location
at a time.
Concurrent Read (CR): Multiple processors can read from the same
memory location simultaneously.
Concurrent Write (CW): Multiple processors can write to the same memory
location simultaneously. Write conflicts must be resolved using a well-defined
policy such as:
^ Common: All concurrent writes store the same value.
^ Arbitrary: Only one value selected arbitrarily is stored. The other values
are ignored.
^ Minimum: The value written by the processor with the smallest index is
stored. The other values are ignored.
^ Reduction: All the values are reduced to only one value using some
reduction function such as sum, minimum, maximum, and so on.
Based on the above read/write modes, the PRAM can be further divided into the
following subclasses:
EREW PRAM: Access to any memory cell is exclusive. This is the most
restrictive PRAM model.
ERCW PRAM: This allows concurrent writes to the same memory location by
multiple processors, but read accesses remain exclusive.
CREW PRAM: Concurrent read accesses are allowed, but write accesses are
exclusive.
CRCW PRAM: Both concurrent read and write accesses are allowed.
The EREW PRAM model is considered the most restrictive among the four
subclasses discussed above. Only one processor can read from or write to a given
memory location at any time. An algorithm designed for such a model must not
rely on having multiple processors access the same memory location
simultaneously in order to improve its performance. Obviously, an algorithm
designed for an EREW PRAM can run on a CRCW PRAM. The algorithm simply
will not use the concurrent access features in the CRCW PRAM. However, the
contrary is not true, an algorithm designed for CRCW cannot run on an EREW
PRAM.
8. Explain the major performance components used in the analysis of
PRAM algorithms.
The complexity of a sequential algorithm is generally determined by its time and
space complexity. The time complexity of an algorithm refers to its execution
time as a function of the problems size. Similarly, the space complexity refers to
the amount of memory required by the algorithm as a function of the size of the


87
problem. For parallel algorithms, the number of processors plays will also be
taken into consideration while measuring the time complexity. This is, the
performance of a parallel algorithm is expressed in terms of how fast it is, and
how many resources it uses when it runs. These criteria can be measured
quantitatively as follows:
1. Run time, which is defined as the time spent during the execution of the
algorithm.
2. Number of processors the algorithm uses to solve a problem.
3. The cost of the parallel algorithm, which is the product of the run time and
the number of processors.
The run time of a parallel algorithm is the length of the time period between the
time the first processor to begin execution starts and the time the last processor
to finish execution terminates. However, since the analysis of algorithms is
normally conducted before the algorithm is even implemented on an actual
computer, the run time is usually obtained by counting the number of steps in
the algorithm.
The cost of a parallel algorithm is basically the total number of steps executed
collectively by all processors. A parallel algorithm is said to be cost optimal if its
cost matches the lower bound on the number of sequential operations to solve a
given problem within a constant factor. It follows that a parallel algorithm is not
cost optimal if there exists a sequential algorithm whose run time is smaller
than the cost of the parallel algorithm.
It may be possible to speed up the execution of a cost-optimal PRAM algorithm
by increasing the number of processors. However, one should be careful because
using more processors may increase the cost of the parallel algorithm. Similarly,
a PRAM algorithm may use fewer processors in order to reduce the cost. In this
case the execution may be slowed down and offset the decrease in the number of
processors.
In order to design efficient parallel algorithms, one must consider the following
general rules. The number of processors must be bounded by the size of the
problem. The parallel run time must be significantly smaller than the execution
time of the best sequential algorithm. The cost of the algorithm must be optimal.
9. Analyse the EREW PRAM parallel algorithm to find the sum of an
array of numbers.
Summation of n numbers in an array can be done in time O(log n) by organizing
the numbers at the leaves of a binary tree and performing the sums at each level
of the tree in parallel.
This algorithm can be designed for an EREWPRAM with n/2 processors because
we will not need to perform any multiple read or write operations on the same
memory location. Recall that in an EREW PRAM, read and write conflicts are


88
not allowed. We assume that the array A[1 . . n] is stored in the global memory.
The summation will end up in the last location A[n]. For simplicity, we assume
that n is an integral power of 2. The algorithm will complete the work in log n
iterations as follows. In the first iteration, all the processors are active. In the
second iteration, only half of the processors will be active, and so on.

The figure given below illustrates the algorithm on an array of eight elements: 5,
2, 10, 1, 8, 12, 7, 3. In order to sum eight elements, three iterations are needed as
follows. In the first iteration, processors P1, P2, P3, and P4 add the values stored
at locations 1, 3, 5, and 7 to the numbers stored at locations 2, 4, 6, and 8,
respectively. In the second iteration, processors P2 and P4 add the values stored
at locations 2 and 6 to the numbers stored at locations 4 and 8, respectively.
Finally, in the third iteration processor P4 adds the value stored at location 4 to
the value stored at location 8. Thus, location 8 will eventually contain the sum of
all numbers in the array.

The algorithm is given below.
Algorithm Sum_EREW
for i=1 to log n do
forall Pj, where 1sjsn/2 do in parallel
if (2j modulo 2
i
)=0 then
A[4] A[5] A[6] A[7] A[8] A[1] A[2] A[3]
+ + +
+
+
+
+


89
A[2j]A[2j] + A[2j-2
i-1
]
endif
endfor
endfor
Notice that most of the processors are idle most of the time. During iteration i,
only n/2
i
processors are active.
Complexity Analysis: The for loop is executed log n times, and each iteration
has constant time complexity. Hence the run time of the algorithm is O(log n).
Since the number of processors used is n/2, the cost is obviously O(n log n). The
complexity measures of Algorithm Sum_EREW are summarized as follows:
1. Run time, T(n) = O(log n).
2. Number of processors, P(n) = n/2.
3. Cost, C(n) = O(n log n).
Since a good sequential algorithm can sum the list of n elements in O(n), this
algorithm is not cost optimal.
(Note: The binary tree is one of the most important paradigms of parallel
computing. In some algorithms data flows top-down from the root of the tree to
the leaves. Broadcast and divide-and-conquer algorithms both fit this model. In
broadcast algorithms the root sends the same data to every leaf. In divide-and
conquer algorithms the tree represents the recursive subdivision of problems into
sub-problems. In other algorithms data flows bottom-up from the leaves of the
tree to the root. These are called fan-in or reduction operations. More formally,
given a set o f n values a1, a2,..., an and an associative binary operator ,
reduction is the process of computing a1a2...an. Parallel summation that
discussed above is an example of a reduction operation).
10. Analyse the EREW PRAM parallel algorithm to find all partial sum of
an array of numbers.
Given n numbers, stored in array A[1 . . n], the partial sum (also known as prefix
sum) problem is to compute the partial sums A[1], A[1]+A[2], A[1]+A[2]+A[3], . . .
, A[1]+A[2]+...+ A[n].
The algorithm AllSums_EREW is presented to calculate all partial sums of an
array on an EREW PRAM with n-1 processors (P2, P3, . . . , Pn). Assume that the
elements of the array A[1 . . n] are assumed to be in the global shared memory.
The partial sum algorithm replaces each A[k] by the sum of all elements
preceding and including A[k].
The algorithm is given below.
Algorithm AllSums_EREW
for i=1 to log n do


90
forall Pj, where 2
i-1
+1sjsn do in parallel
A[j]A[j] + A[j-2
i-1
]
endfor
endfor
Note that unlike in Sum_EREW presented earlier, here nearly all processors are
in use.
The figure given below illustrates the three iterations of the algorithm on an
array of eight elements named A[1] through A[8].

Complexity Analysis: The complexity measures of Algorithm AllSums_EREW
are summarized as follows:
2. Number of processors, P(n) = n-1.
3. Cost, C(n) = O(n log n).
11. Analyse the CREW PRAM parallel algorithm to find the product of
two nn matrices.
Assume that n is a power of 2. The algorithm is designed as CREW PRAM model
with n
3
processors to allow multiple read operations from the same memory
locations. Assume that the two input matrices are stored in the shared memory
in the arrays A[1 . . n, 1. . n], B[1 . . n, 1. . n].
We consider the n
3
processors as being arranged into a three-dimensional array.
Processor Pi,j,k is the one with index (i, j, k). A three-dimensional array C[i, j, k],
where 1si, j, ksn, in the shared memory will be used as working space. The
resulting matrix will be stored in locations C[i, j, n], where 1si, jsn.
The algorithm consists of two steps. In the first step, all n
3
processors operate in
parallel to compute n
3
multiplications. For each of the n
2
cells in the output
matrix, n products are computed. In the second step, the n products computed for


91
each cell in the output matrix are summed to produce the final value of this cell.
The following figure shows the activities of the active processors after each of the
two steps of the algorithm when used to multiply two 22 matrices.

The summation step can be performed in parallel in O(log n) time as shown in
Algorithm Sum_EREW discussed earlier.
The two steps of the algorithm are given as:
1. Each processor Pi,j,k computes the product of A[i, k] * B[k, j] and stores it
in C[i, j, k].
2. The idea of Algorithm Sum_EREW is applied along the k dimension n
2

times in parallel to compute C[i, j, n], where 1si, jsn.
The details of these two steps are presented in Algorithm MatMult_CREW:


92
Algorithm MatMult_CREW
/* Step 1 */
forall P
i,j,k
, where 1si, j, ksn do in parallel
C[i,j,k] A[i,k]*B[k,j]
endfor
/* Step 2 */
for l=1 to log n do
forall P
i,j,k
, where 1si, jsn & 1sksn/2 do in parallel
if (2k modulo 2
l
)=0 then
C[i,j,2k] C[i,j,2k] + C[i,j, 2k-2
l-1
]
endif
endfor
/* The output matrix is stored in locations
C[i,j,n], where 1si,jsn */
endfor
Complexity Analysis: In the first step, the products are conducted in parallel in
constant time, that is, O(1). These products are summed in O(log n) time during
the second step. Therefore, the run time is O(log n). Since the number of
processors used is n
3
, the cost is O(n
3
log n). The complexity measures of the
matrix multiplication on CREW PRAM with n
3
processors are summarized as:
2. Number of processors, P(n) = n
3
.
3. Cost, C(n) = O(n
3
log n).
Since an nn matrix multiplication can be done sequentially in less than O(n
3
log
n), this algorithm is not cost optimal. In order to reduce the cost of this parallel
algorithm, we should try to reduce the number of processors.
12. Analyse the CRCW PRAM parallel algorithm to sort a set of numbers
using enumeration sort.
Given an unsorted list of n elements a1, a2, . . . , ai , . . . , an , an enumeration sort
(also known as list ranking) determines the position of each element ai in the
sorted list by computing the number of elements smaller than it. If ci elements
are smaller than ai, then it is the (ci + 1)
th
element in the sorted list. If two or
more elements have the same value, the element with the largest index in the
unsorted list will be considered as the largest in the sorted list. For example,
suppose that ai = aj , then ai will be considered the larger of the two if i > j,
otherwise aj is the larger.
The algorithm is designed for a CRCW PRAM with n
2
processors that allow both
concurrent read and read operations. However, write conflicts must be resolved
according to a certain policy. The algorithm assumes that when multiple
processors try to write different values into the same address, the sum of these
values will be stored in that address.


93
Consider the n
2
processors as being arranged into n rows of n elements each. The
processors are numbered as follows: Pi,j is the processor located in row i and
column j in the grid of processors. Assume that the sorted list is stored in the
global memory in an array A[1 . . n]. Another array C[1 . . n] will be used to store
the number of elements smaller than every element in A.
The algorithm consists of two steps:
1. Each row of processors i computes C[i], the number of elements smaller
than A[i]. Each processor Pi,j compares A[i] and A[j], then updates C[i]
appropriately.
2. The first processor in each row Pi,1 places A[i] in its proper position in the
sorted list (C[i] + 1).
The details of these two steps are presented in Algorithm Sort_CRCW:
Algorithm Sort_CRCW
/* Step 1 */
forall P
i,j
, where 1si, jsn do in parallel
if A[i] > A[j] or (A[i]=A[j] and i > j) then
C[i] C[i]+1
else
C[i]C[i]+0
endif
endfor
/* Step 2 */
forall P
i,1
, where 1sisn do in parallel
A[C[i] + 1]A[i]
endfor
Complexity Analysis The complexity measures of the enumerating sort on CRCW
PRAM are summarized as:
1. Run time, T(n) = O(1).
2. Number of processors, P(n) = n
2
.
3. Cost, C(n) = O(n
2
).
Since a good sequential algorithm can sort a list of n elements in O(n log n), this
algorithm is not cost optimal. Although the above algorithm sorts n elements in
constant time, it has no practical value because it uses a very large number of
processors in addition to its reliance on a very powerful PRAM model (CRCW).
Module 3-5
Parallel &
Distributed Programming

LEARNING OBJECTIVES
Fosters Design Methodology, Time Complexity (computation and
communication complexities), Speedup, Efficiency, Cost Optimality,
Amdahls Law, Brents Scheduling Principle, Simple PRAM
Algorithms: Boolean Operations, Max Finding in O(1) time,
Reduction, Prefix-Sum, etc



95
Programming Using the Message-Passing Paradigm
Principles of Message Passing Paradigm
The message-passing programming paradigm is one of the oldest and most
widely used approaches for programming parallel computers. There are two key
attributes that characterize the message-passing programming paradigm. The
first is that it assumes a partitioned address space and the second is that it
supports only explicit parallelization.
The logical view of a machine supporting the message-passing paradigm consists
of p processes, each with its own exclusive address space. This partitioned
address space has two implications. First, each data element must belong to one
of the partitions of the address space. Hence, data must be explicitly partitioned
and placed. This adds complexity to programming, but encourages locality of
access that is critical for achieving high performance on non-UMA architecture,
since a processor can access its local data much faster than non-local data on
such architectures. The second implication is that all interactions (read-only or
read/write) require cooperation of two processes the process that has the data
and the process that wants to access the data. This requirement for cooperation
results in increased complexity for a number of reasons. The process that has the
data must participate in the interaction even if it has no logical connection to the
events at the requesting process. However, a primary advantage of explicit two-
way interactions is that the programmer is fully aware of all the costs of non-
local interactions, and is more likely to think about algorithms (and mappings)
that minimize interactions. Another major advantage of this type of
programming paradigm is that it can be efficiently implemented on a wide
variety of architectures.
Structure of Message-Passing Programs:
Message-passing programs are often written using the asynchronous or loosely
synchronous paradigms. In the asynchronous paradigm, all concurrent tasks
execute asynchronously. This makes it possible to implement any parallel
algorithm. However, such programs can be harder to reason about, and can have
nondeterministic behaviour due to race conditions. Loosely synchronous
programs are a good compromise between these two extremes. In such programs,
tasks or subsets of tasks synchronize to perform interactions. However, between
these interactions, tasks execute completely asynchronously. Since the
interaction happens synchronously, it is still quite easy to reason about the
program.
In its most general form, the message-passing paradigm supports execution of a
different program on each of the p processes. This provides the ultimate
flexibility in parallel programming, but makes the job of writing parallel
programs effectively unscalable. For this reason, most message-passing
programs are written using the single program multiple data (SPMD) approach.
In SPMD programs the code executed by different processes is identical except


96
for a small number of processes (e.g., the "root" process). This does not mean that
the processes work in lock-step. In an extreme case, even in an SPMD program,
each process could execute a different code (the program contains a large case
statement with code for each process). But except for this degenerate case, most
processes execute the same code. SPMD programs can be loosely synchronous or
completely asynchronous.
The Building Blocks: Send and Receive Operations
The basic operations send and receive are used in message-passing programming
paradigm for sending and receiving messages.
The prototypes of these operations are defined as follows:
send(void *sendbuf, int nelems, int dest)
receive(void *recvbuf, int nelems, int source)
The sendbuf points to a buffer that stores the data to be sent, recvbuf points to a
buffer that stores the data to be received, nelems is the number of data units to
be sent and received, dest is the identifier of the process that receives the data,
and source is the identifier of the process that sends the data.
To investigate various issues in using the send and receive operations, let us
consider a simple example of a process sending a piece of data to another process
as illustrated in the following code-fragment:
P0 P1
a = 100; receive(&a, 1, 0)
send(&a, 1, 1); printf("%d\n", a);
a=0;
In this simple example, process P0 sends a message to process P1 which receives
and prints the message. Note that the process P0 changes the value of a to 0
immediately following the send. However, the value received by process P1 must
be 100 as opposed to 0. That is, the value of a at the time of the send operation
must be the value that is received by process P1.
The difficulty in ensuring the semantics of the send and receive operations
depends on how the send and receive operations are implemented. Most message
passing platforms have additional hardware support for sending and receiving
messages. They may support DMA (direct memory access) and asynchronous
message transfer using network interface hardware. Network interfaces allow
the transfer of messages from buffer memory to desired location without CPU
intervention. Similarly, DMA allows copying of data from one memory location to
another (e.g., communication buffers) without CPU support (once they have been
programmed). As a result, if the send operation programs the communication
hardware and returns before the communication operation has been
accomplished, process P1 might receive the value 0 in a instead of 100. The


97
method for ensuring the semantics of the send and receive operations in the
context of the above hardware environment are discussed below.
Blocking Message Passing Operations
A simple way to ensure the semantics of the send and receive operations is to
return from send only when it is semantically safe to do so. Note that this is not
the same as saying that the send operation returns only after the receiver has
received the data. It simply means that the sending operation blocks until it can
guarantee that the semantics will not be violated on return irrespective of what
happens in the program subsequently. There are two mechanisms by which this
can be achieved.
1. Blocking Non-Buffered Send/Receive
2. Blocking Buffered Send/Receive
Blocking Non-Buffered Send/Receive
In this case, the send operation does not return until the matching receive has
been encountered at the receiving process. When this happens, the message is
sent and the send operation returns upon completion of the communication
operation. Typically, this process involves a handshake between the sending and
receiving processes. The sending process sends a request to communicate to the
receiving process. When the receiving process encounters the target receive, it
responds to the request. The sending process upon receiving this response
initiates a transfer operation. The operation is illustrated below:

Fig: Handshake for a blocking non-buffered send/receive operation. It is easy to
see that in cases where sender and receiver do not reach communication point at
similar times, there can be considerable idling overheads.
Since there are no buffers used at either sending or receiving ends, this is also
referred to as a non-buffered blocking operation.


98
Idling Overheads in Blocking Non-Buffered Operations: The figure above
illustrates three scenarios: (a) the send is reached before the receive is posted, (b)
the send and receive are posted around the same time, and (c) the receive is
posted before the send is reached. In cases (a) and (c), we notice that there is
considerable idling at the sending and receiving process. It is also clear from the
figures that a blocking non-buffered protocol is suitable when the send and
receive are posted at roughly the same time. However, in an asynchronous
environment, this may be impossible to predict. This idling overhead is one of the
major drawbacks of blocking non-buffered protocol.
Deadlocks in Blocking Non-Buffered Operations: Consider the following simple
exchange ofmessages that can lead to a deadlock:
P0 P1
send(&a, 1, 1); send(&a, 1, 0);
receive(&b, 1, 1); receive(&b, 1, 0);
The code fragment makes the values of a available to both processes P0 and P1.
However, if the send and receive operations are implemented using a blocking
non-buffered protocol, the send at P0 waits for the matching receive at P1
whereas the send at process P1 waits for the corresponding receive at P0,
resulting in an infinite wait. As can be inferred, deadlocks are very easy in
blocking protocols and care must be taken to break cyclic waits to recover from
deadlocks. In the above example, this can be corrected by replacing the operation
sequence of one of the processes by a receive and a send as opposed to send and
receive. This often makes the code more cumbersome and buggy.
Blocking Buffered Send/Receive
A simple solution to the idling and deadlocking problem encountered in blocking
non-buffered protocol is to rely on buffers at the sending and receiving ends.
Assume a simple case in which the sender has a buffer pre-allocated for
communicating messages. On encountering a send operation, the sender simply
copies the data into the designated buffer and returns after the copy operation
has been completed. The sender process can now continue with the program
knowing that any changes to the data will not impact program semantics. The
actual communication can be accomplished in many ways depending on the
available hardware resources. If the hardware supports asynchronous
communication (independent of the CPU), then a network transfer can be
initiated after the message has been copied into the buffer. Note that at the
receiving end, the data cannot be stored directly at the target location since this
would violate program semantics. Instead, the data is copied into a buffer at the
receiver as well. When the receiving process encounters a receive operation, it
checks to see if the message is available in its receive buffer. If so, the data is
copied into the target location. This operation is illustrated in the following
figure(a).



99

Fig: Blocking buffered transfer protocols: (a) in the presence of communication
hardware with buffers at send and receive ends; and (b) in the absence of
communication hardware, sender interrupts receiver and deposits data in buffer
at receiver end
In the protocol illustrated above, buffers are used at both sender and receiver
and communication is handled by dedicated hardware. Sometimes machines do
not have such communication hardware. In this case, some of the overhead can
be saved by buffering only on one side. For example, on encountering a send
operation, the sender interrupts the receiver, both processes participate in a
communication operation and the message is deposited in a buffer at the receiver
end. When the receiver eventually encounters a receive operation, the message is
copied from the buffer into the target location. This protocol is illustrated in
figure (b) above.
We can also have a protocol in which the buffering is done only at the sender and
the receiver initiates a transfer by interrupting the sender.
It is easy to see that buffered protocols alleviate idling overheads at the cost of
adding buffer management overheads. In general, if the parallel program is
highly synchronous (i.e., sends and receives are posted around the same time),
non-buffered sends may perform better than buffered sends. However, in general
applications, this is not the case and buffered sends are desirable unless buffer
capacity becomes an issue.
Buffer Overflow in Buffered Send and Receive Operations: The impact of finite
buffers in message passing is illustrated below:
P0 P1
for (i = 0; i < 1000; i++) for (i = 0; i < 1000; i++)
{ produce_data(&a); { receive(&a, 1, 0);
send(&a, 1, 1); consume_data(&a);
} }


100
In this code fragment, process P0 produces 1000 data items and process P1
consumes them. However, if process P1 was slow getting to this loop, process P0
might have sent all of its data. If there is enough buffer space, then both
processes can proceed. However, if the buffer is not sufficient (i.e., buffer
overflow), the sender would have to be blocked until some of the corresponding
receive operations had been posted, thus freeing up buffer space. This can often
lead to unforeseen overheads and performance degradation. In general, it is a
good idea to write programs that have bounded buffer requirements.
Deadlocks in Buffered Send and Receive Operations: While buffering alleviates
many of the deadlock situations, it is still possible to write code that deadlocks.
This is due to the fact that as in the non-buffered case, receive calls are always
blocking (to ensure semantic consistency). Thus, a simple code fragment such as
the following deadlocks since both processes wait to receive data but nobody
sends it.
P0 P1
receive(&a, 1, 1); receive(&a, 1, 0);
send(&b, 1, 1); send(&b, 1, 0);
Once again, such circular waits have to be broken. However, deadlocks are
caused only by waits on receive operations in this case.
Non-Blocking Message Passing Operations
In blocking protocols, the overhead of guaranteeing semantic correctness was
paid in the form of idling (non-buffered) or buffer management (buffered). Often,
it is possible to require the programmer to ensure semantic correctness and
provide a fast send/receive operation that incurs little overhead. This class of
non-blocking protocols returns from the send or receive operation before it is
semantically safe to do so. Consequently, the user must be careful not to alter
data that may be potentially participating in a communication operation. Non-
blocking operations are generally accompanied by a check-status operation,
which indicates whether the semantics of a previously initiated transfer may be
violated or not. Upon return from a non-blocking send or receive operation, the
process is free to perform any computation that does not depend upon the
completion of the communication operation. Later in the program, the process
can check whether or not the non-blocking operation has completed, and, if
necessary, wait for its completion.
As illustrated in the following figure, non-blocking operations can themselves be
buffered or non-buffered.


101

Fig: Space of possible protocols for send and receive operations
In the non-buffered case, a process wishing to send data to another simply posts
a pending message and returns to the user program. The program can then do
other useful work. At some point in the future, when the corresponding receive is
posted, the communication operation is initiated. When this operation is
completed, the check-status operation indicates that it is safe for the
programmer to touch this data. This transfer is indicated in figure (a) below:

Fig: Non-blocking non-buffered send and receive operations (a) in absence of
communication hardware; (b) in presence of communication hardware
Comparing the above figures with the corresponding figure in blocking non-
buffered protocol, it is easy to see that the idling time when the process is
waiting for the corresponding receive in a blocking operation can now be utilized
for computation, provided it does not update the data being sent. This alleviates


102
the major bottleneck associated with the former at the expense of some program
restructuring. The benefits of non-blocking operations are further enhanced by
the presence of dedicated communication hardware. This is illustrated in figure
(b) above. In this case, the communication overhead can be almost entirely
masked by non-blocking operations. In this case, however, the data being
received is unsafe for the duration of the receive operation.
Non-blocking operations can also be used with a buffered protocol. In this case,
the sender initiates a DMA operation and returns immediately. The data
becomes safe the moment the DMA operation has been completed. At the
receiving end, the receive operation initiates a transfer from the sender's buffer
to the receiver's target location. Using buffers with non-blocking operation has
the effect of reducing the time during which the data is unsafe.
Typical message-passing libraries such as Message Passing Interface (MPI) and
Parallel Virtual Machine (PVM) implement both blocking and non-blocking
operations. Blocking operations facilitate safe and easier programming and non-
blocking operations are useful for performance optimization by masking
communication overhead. One must, however, be careful using non-blocking
protocols since errors can result from unsafe access to data that is in the process
of being communicated.
What is MPI? Why is it?
Message passing is a programming paradigm used widely on parallel MIMD
computers. Since the message-passing libraries provided by different hardware
vendors differed in syntax and semantics, portability of the message-passing
programs was seriously affected and often required significant re-engineering to
port a message-passing program from one library to another. The message-
passing interface (MPI) was created to essentially solve this problem. MPI
defines a standard library for message-passing that can be used to develop
portable message-passing programs using either C or Fortran. The MPI standard
defines both the syntax as well as the semantics of a core set of library routines
that are very useful in writing message-passing programs. MPI was developed by
a group of researchers from academia and industry, and has enjoyed wide
support by almost all the hardware vendors. Vendor implementations of MPI are
available on almost all commercial parallel computers.
Suitable implementations of MPI can be found for most distributed-memory
systems. To the programmer, MPI appears in the form of libraries for FORTRAN
or C family languages. Message passing is realized by an MPI routine (or
function) call. The MPI library contains over 125 routines. All MPI routines,
data-types, and constants are prefixed by "MPI_ ". MPI routines, data types and
constants are defined for C in the file "mpi.h" . This header file must be included
in each MPI program.
Major design goals of MPI include:


103
Allow portability across different machines. For example, although the
message passing is often thought of in the context of distributed memory
parallel computers the same code can also run well on shared memory
parallel computers or network of workstations.
Allow for implementations that can be used in a heterogeneous environment,
i.e., ability to run transparently on a collection of processors with distinct
architectures to provide a virtual computing model that hides the
architectural differences. The MPI implementation will automatically do any
necessary data conversion and utilize the correct communications protocol.
Allow efficient communication. Avoid memory-to-memory copying, allow
overlap of computation and communication and offload to a communication
coprocessor-processor, where available.
Allow for implementations that can be used in a heterogeneous environment
Allow convenient C and FORTRAN bindings for the interface. Also the
semantics of the interface should be language independent.
Provide a reliable communication interface. The user need not cope with
communication failures.
Define an interface not too different from current practice such as PVM and
provides extensions that allow greater flexibility.
Define an interface that can be implemented on many vendors platforms,
with no significant changes in the underlying communication and system
software.
The interface should be designed to allow for thread-safety.
Initialize and Terminate MPI library
MPI_Init and MPI_Finalize are the two MPI library routines used for
initializing and terminating MPI library.
MPI_Init initializes the MPI execution environment. This function must be
called in every MPI program, must be called before any other MPI functions and
must be called only once in an MPI program. For C programs, MPI_Init may be
used to pass the command line arguments to all processes. Its syntax is
int MPI_Init(int *argc, char ***argv)
The arguments argc and argv of MPI_Init are pointers to the command-line
arguments of the C program. When our program does not use command line
arguments, we can just pass NULL for both. Upon successful execution, MPI_Init
returns MPI_SUCCESS; otherwise they return an implementation-defined error
code.


104
MPI_Finalize terminates the MPI execution environment. This function should
be the last MPI routine called in every MPI program and no other MPI routines
may be called after it. Its syntax is
int MPI_Finalize()
The call to MPI Finalize tells the MPI system that we are done using MPI, and
that any resources allocated for MPI can be freed.
A typical MPI program has the following basic outline:
. . .
#include <mpi.h>
. . .
int main(int argc, char* argv[])
{ . . .
/* No MPI calls before this */
MPI Init(&argc, &argv);
. . .
MPI Finalize();
/* No MPI calls after this */
. . .
return 0;
}
Communicators
In MPI a communicator is a set of processes that are allowed to communicate
with each other. This set of processes form a communication domain.
Information about a communicator is stored in variables of type MPI_Comm.
Communicators are used as arguments to all message transfer MPI routines and
they uniquely identify the processes participating in the message transfer
operation. Note that each process can belong to many different, possibly
overlapping, communicators.
MPI defines a default communicator called MPI_COMM_WORLD which includes all
the processes involved in the parallel execution. However, in many cases we
want to perform communication only within (possibly overlapping) groups of
processes. By using a different communicator for each such group, we can ensure
that no messages will ever interfere with messages destined to any other group.
The MPI_Comm_size and MPI_Comm_rank functions are used to determine the
number of processes and the label of the calling process, respectively. The calling
sequences of these routines are as follows:
int MPI_Comm_size(MPI_Comm comm, int *size)
int MPI_Comm_rank(MPI_Comm comm, int *rank)
The function MPI_Comm_size returns in the variable size the number of


105
processes that belong to the communicator comm.
Every process that belongs to a communicator is uniquely identified by its rank.
The rank of a process is an integer that ranges from zero up to the size of the
communicator minus one. A process can determine its rank in a communicator
by using the MPI_Comm_rank function that takes two arguments: the
communicator and an integer variable rank. Up on return, the variable rank
stores the rank of the process. Note that each process that calls either one of
these functions must belong in the supplied communicator, otherwise an error
will occur.
#include <mpi.h>
main()
{ int npes, myrank;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &npes);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
printf("From process %d out of %d, Hello World!\n",myrank, npes);
MPI_Finalize();
}
(Note on Compiling and Executing MPI programs: For executing an MPI
program on a Unix like system where the MPI headers are preinstalled, you can
use the mpicc compiler by issuing the command:
mpicc hellow.c -o hellow.exe
and can be executed by the command:
mpiexec -n 4 ./hellow.exe
(-n 3 means that it is to be run on three nodes)
The output of the execution looks like:
From process 1 out of 4, Hello World!
mpicc can also be used on a windows system with Cygwin environment(from
Cygnus Solutions). You download a small setup.exe from the Cygwin site which
helps you select the needed components. You can download cygwin setup from
the links http://www.cygwin.com/setup-x86.exe (win-32 bit version) or
http://www.cygwin.com/setup-x86_64.exe (win-64 bit version). On running
the setup, it will download and install the cygwin packages for you
Cygwin setup will ask you for the installation folder, as setups often do. It will
offer you c:\cygwin as the default one.


106
At some point of the installation, you will see the list of available components by
categories. You will have to add some by clicking on 'Skip' and changing it to the
version you want to install. Some of them are required, like:
Archive - zip
Archive - unzip
Devel - binutils
Devel gcc-core
Devel - gcc-g++
Devel - git
Devel - libiconv
Devel - mingw-gcc-core
Devel - mingw-gcc-g++
Devel - mingw-pthread
Devel - mingw-runtime
Devel - make
Devel - mingw-runtime
Devel win32api-headers
Devel win32api-runtime
Editors ed
Editors edmacs
Editors emacs-win32
Editors nano
Web wget
The second step is to install MPICH2. Download MPICH2 form
http://www.mcs.anl.gov/research/projects/mpich2staging/goodell/downlo
ads/tarballs/1.5/mpich2-1.5.tar.gz. (Not the compiled Windows version,
remember, we want to run it on cygwin). Put the mpich2-1.5.tar.gz in the
c:\cygwin(or cygwin64)\usr\local\bin
or
you may open the cygwin console and work as you would on Linux (if you
installed wget from the previous list):
cd /usr/local/bin/
wget http://www.mcs.anl.gov/research/projects/mpich2staging/goodell/
downloads/tarballs/1.5/mpich2-1.5.tar.gz

(or whatever the current version is). Unpack it:
tar zxvf mpich2-1.5.tar.gz
Now, install:
mkdir /home/your-username/mpich2-install
(your-username is your Linux account name, change accordingly)


107
cd /usr/local/bin/mpich2-1.5
./configure --disable-f77 --disable-fc --prefix=/home/your-username/
mpich2-install
(don't forget to change your-username. We are skipping the installation of
FORTRAN compiler)
This will take a while even on really fast computers. But you will eventually get
the
Configuration completed
message. Two more lengthy steps remain, in the same folder:
make
and...
make install
Finally, we have to add the path to the binaries to .bash_profile:
cd
(goes to your home folder)
nano .bash_profile
(or whichever editor you chose like emacs, emacs-w32, etc, or, you can even edit
it through a Windows editor)
Add the following line:
export PATH=/home/your-username/mpich2-install/bin:${PATH}
(change your-username). And save the file...
Let us just run a "Hello, World!" example. We will run it on several machines
and they will all say "Hello World":
cd
cp /usr/local/bin/mpich2-1.5/examples/hellow.c hellow.c
mpicc hellow.c -o hellow.exe
This program can be executed by the command,
mpiexec -n 3 ./hellow.exe
(-n 3 means that it is to be run on three nodes)
Note Ends here)


108
MPI send and receive functions
The basic functions for sending and receiving messages in MPI are the MPI_Send
and MPI_Recv, respectively.
The syntax of MPI_Send and MPI_Recv are:
int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm)
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int
source, int tag, MPI_Comm comm, MPI_Status *status)
MPI_Send sends the data stored in the buffer pointed by buf. This buffer consists
of consecutive entries of the type specified by the parameter datatype. The
number of entries in the buffer is given by the parameter count.
Since C types cannot be passed as arguments to functions, MPI defines an
equivalent MPI datatype for all C datatypes, as listed below:
MPI_CHAR signed char
MPI_SHORT signed short int
MPI_INT signed int
MPI_LONG signed long int
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED_SHORT unsigned short int
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
MPI_BYTE
MPI_PACKED
MPI allows two additional datatypes that are not part of the C language. These
are MPI_BYTE and MPI_PACKED. MPI_BYTE corresponds to a byte (8 bits) and
MPI_PACKED corresponds to a collection of data items that has been created by
packing non-contiguous data.
(Note that the length of the message in MPI_Send, as well as in other MPI
routines, is specified in terms of the number of entries being sent and not in terms
of the number of bytes. Specifying the length in terms of the number of entries has
the advantage of making the MPI code portable, since the number of bytes used to
store various datatypes can be different for different architectures.)
The destination of the message sent by MPI_Send is uniquely specified by the
dest and comm arguments. The dest argument is the rank of the destination
process in the communication domain specified by the communicator comm. Each
message has an integer-valued tag associated with it. This is used to distinguish


109
different types of messages. The message-tag can take values ranging from zero
up to the MPI defined constant MPI_TAG_UB. Even though the value of
MPI_TAG_UB is implementation specific, it is at least 32,767.
MPI_Recv receives a message sent by a process whose rank is given by the
source in the communication domain specified by the comm argument. The tag of
the sent message must be that specified by the tag argument. If there are many
messages with identical tag from the same process, then any one of these
messages is received.
MPI allows specification of wildcard arguments for both source and tag. If
source is set to MPI_ANY_SOURCE, then any process of the communication domain
can be the source of the message. Similarly, if tag is set to MPI_ANY_TAG, then
messages with any tag are accepted. The received message is stored in
continuous locations in the buffer pointed to by buf. The count and datatype
arguments of MPI_Recv are used to specify the length of the supplied buffer. The
received message should be of length equal to or less than this length. This
allows the receiving process to not know the exact size of the message being sent.
If the received message is larger than the supplied buffer, then an overflow error
will occur, and the routine will return the error MPI_ERR_TRUNCATE.
The MPI_Recv returns only after the requested message has been received and
copied into the buffer. That is, MPI_Recv is a blocking receive operation.
However, MPI allows two different implementations for MPI_Send. In the first
implementation, MPI_Send returns only after the corresponding MPI_Recv have
been issued and the message has been sent to the receiver. In the second
implementation, MPI_Send first copies the message into a buffer and then
returns, without waiting for the corresponding MPI_Recv to be executed. In
either implementation, the buffer that is pointed by the buf argument of
MPI_Send can be safely reused and overwritten. MPI programs must be able to
run correctly regardless of which of the two methods is used for implementing
MPI_Send.
Suppose process q calls MPI_Send with
MPI_Send(send_buf, send_buf_sz, send_datatype, dest, send_tag,
send_comm)
Also suppose that process r calls MPI_Recv with
MPI_Recv(rec_buf, recv_buf_sz, recv_datatype, src, recv_tag, recv_
comm, &status)
Then the message sent by q with the above call to MPI_Send can be received by r
with the call to MPI_Recv if
recv_comm = send_comm,
recv_tag = send_tag,


110
dest = r, and
src = q.
Besides the above conditions, to receive a message successfully, the parameters
specified by the first three pairs of arguments, send_buf/recv_buf,
send_buf_sz/recv_buf_sz, and send_type/recv_type must specify compatible
buffers.
The following code illustrates the usage of MPI_Send and MPI_Recv functions.,
where all the processes having a rank greater than 0 is sending their greeting
messages to process with rank 0.
#include <string.h>
#include <mpi.h>
const int MAX_STRING = 100;
int main(void)
{ char greeting[MAX_STRING];
int comm_sz, my_rank, q;
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
if (my_rank != 0)
{ sprintf(greeting, "Greetings from process %d of %d!\n",
my_rank, comm_sz);
MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0,
MPI_COMM_WORLD);
}
else
{ for (q = 1; q < comm_sz; q++)
{ MPI_Recv(greeting, MAX_STRING, MPI_CHAR, q, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%s\n", greeting);
}
}
MPI_Finalize();
return 0;
}
Compiling and execution of this program (named as mpi3.c produces the
following output:
mpicc mpi3.c -o mpi3.exe
mpiexec -n 4 ./mpi3.exe
Greetings from process 1 of 4!
In the following example, the process with rank 0 sends greeting messages to all


111
other processors with rank greater than 0.
#include <stdio.h>
#include <string.h>
#include <mpi.h>
int main(void)
if (my_rank == 0)
{ sprintf(greeting, "greetings from process %d of %d!\n",
my_rank, comm_sz);
for(q=1;q<comm_sz;q++)
MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, q, 0,
MPI_COMM_WORLD);
}
else
{ MPI_Recv(greeting, MAX_STRING, MPI_CHAR, MPI_ANY_SOURCE, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%s\n", greeting);
}
MPI_Finalize();
return 0;
}
MPI_Status data structure
The general syntax of the MPI_Recv function is
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int
source, int tag, MPI_Comm comm, MPI_Status *status)
In this command, source and tag arguments can use wild card arguments
MPI_ANY_SOURCE and MPI_ANY_TAG, respectively. This means that the receiver
can receive messages from any source with any tag. In other words, a receiver
can receive a message without knowing
the amount of data in the message,
the sender of the message, or
the tag of the message.
After a message has been received, the receiver can use the status argument to
get information about the MPI_Recv operation. The status argument of the
MPI_Recv has type MPI_Status. The MPI type MPI_Status is a struct with three
fields, as follows:


112
typedef struct MPI_Status
{ int MPI_SOURCE;
int MPI_TAG;
int MPI_ERROR;
};
MPI_SOURCE and MPI_TAG store the source and the tag of the received message.
They are particularly useful when MPI_ANY_SOURCE and MPI_ANY_TAG are used
for the source and tag arguments. MPI_ERROR stores the error-code of the
received message.
Suppose our program contains the definition
MPI_Status stat;
Then, after a call to MPI_Recv in which &stat is passed as the last argument, we
can determine the sender and tag by examining the two members
status.MPI_SOURCE
status.MPI_TAG
The status argument of MPI_Recv also returns information about the length of
the received message. This information is not directly accessible from the status
variable, but it can be retrieved by calling the MPI_Get_count function. The
calling sequence of this function is as follows:
int MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int
*count)
MPI_Get_count takes as arguments the status returned by MPI_Recv and the
type of the received data in datatype, and returns the number of entries that
were actually received in the count variable.
For example, suppose that in our call to MPI_Recv, the type of the receive buffer
is recv_type and, once again, we passed in &stat. Then the call
MPI Get count(&stat, recv_type, &count)
will return the number of elements received in the count argument.
The following example illustrates the usage of the various fields of the
MPI_Status struct and MPI_Get_count function.
#include <stdio.h>
#include <string.h>
#include <mpi.h>
int main(void)
int comm_sz, my_rank, q, count;


113
MPI_Status stat;
if (my_rank == 0)
{ sprintf(greeting, "Greetings from process %d of %d!\n",
my_rank, comm_sz);
for(q=1;q<comm_sz;q++)
MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, q, 0,
MPI_COMM_WORLD);
}
else
{ if(MPI_Recv(greeting, MAX_STRING, MPI_CHAR, MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, &stat)==MPI_SUCCESS)
{ MPI_Get_count(&stat, MPI_CHAR, &count);
printf("Received a message of %d characters from process
with rank %d and tag %d and the message has been received
successfully\n", count, stat.MPI_SOURCE, stat.MPI_TAG);
printf("The Message is \"%s\"", greeting);
}
else
printf("Message received from process with rank %d and tag %d
has some error and the error code is %d", stat.MPI_SOURCE,
stat.MPI_TAG, stat.MPI_ERROR);
}
MPI_Finalize();
return 0;
}
Running the program with a 2 processors generates the following output
Received a message of 32 characters from process with rank 0 and tag
0 and the message has been received successfully
The Message is "Greetings from process 0 of 2!
Deadlocks
Non-judicious use of the MPI_Send and MPI_Recv functions can results in
deadlock situations whereby a sender/receiver waits indefinitely for the
corresponding receiver/sender. This situation can results under three
situations:
1. Blocking send implementations
2. A process sends a message to itself
3. Circular arrangments of MPI_Send and MPI_Recv
These situations are explained below in detail.


114
The semantics of MPI_Send and MPI_Recv place some restrictions on how we can
mix and match send and receive operations. For example, consider the
following piece of code in which process 0 sends two messages with different tags
to process 1, and process 1 receives them in the reverse order.
1 int a[10], b[10], myrank;
2 MPI_Status status;
3 ...
4 MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
5 if (myrank == 0)
6 { MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD);
7 MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);
8 }
9 else if (myrank == 1)
10 { MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD);
11 MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD);
12 }
13 ...
If MPI_Send is implemented using buffering, then this code will run correctly
provided that sufficient buffer space is available. However, if MPI_Send is
implemented by blocking until the matching MPI_Recv has been issued, then
neither of the two processes will be able to proceed. This is because process zero
(i.e., myrank == 0 ) will wait until process one issues the matching MPI_Recv
(i.e., the one with tag equal to 1), and at the same time process one will wait
until process zero performs the matching MPI_Send (i.e., the one with tag equal
to 2).
This code fragment is not safe, as its behavior is implementation dependent. It is
up to the programmer to ensure that his or her program will run correctly on any
MPI implementation. The problem in this program can be corrected by matching
the order in which the send and receive operations are issued.
Similar deadlock situations can also occur when a process sends a message to
itself. Even though this is legal, its behavior is implementation dependent and
must be avoided.
Improper use of MPI_Send and MPI_Recv can also lead to deadlocks in situations
when each processor needs to send and receive a message in a circular fashion.
Consider the following piece of code, in which process i sends a message to
process i + 1 (modulo the number of processes) and receives a message from
process i - 1 (modulo the number of processes), as shown in the following
example.
1 int a[10], b[10], npes, myrank;
3 ...
4 MPI_Comm_size(MPI_COMM_WORLD, &npes);


115
6 MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD);
7 MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1,
MPI_COMM_WORLD);
8 ...
When MPI_Send is implemented using buffering, the program will work correctly,
since every call to MPI_Send will get buffered, allowing the call of the MPI_Recv
to be performed, which will transfer the required data. However, if MPI_Send
blocks until the matching receive has been issued, all processes will enter an
infinite wait state, waiting for the neighboring process to issue a MPI_Recv
operation. Note that the deadlock still remains even when we have only two
processes. Thus, when pairs of processes need to exchange data, the above
method leads to an unsafe program. The above example can be made safe, by
rewriting it as follows:
1 int a[10], b[10], npes, myrank;
3 ...
4 MPI_Comm_size(MPI_COMM_WORLD, &npes);
6 if (myrank%2 == 1)
7 { MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD);
8 MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1,
MPI_COMM_WORLD);
9 }
10 else
11 { MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1,
MPI_COMM_WORLD);
12 MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD);
13 }
14 ...
This new implementation partitions the processes into two groups. One consists
of the odd numbered processes and the other of the even-numbered processes.
The odd-numbered processes perform a send followed by a receive, and the even-
numbered processes perform a receive followed by a send. Thus, when an odd-
numbered process calls MPI_Send, the target process (which has an even number)
will call MPI_Recv to receive that message, before attempting to send its own
message.
Sending and Receiving Messages Simultaneously
Communication among different processes in a communicator is a
communication pattern that appears frequently in many message-passing
programs, and for this reason MPI provides the MPI_Sendrecv function that both
sends and receives a message.


116
MPI_Sendrecv does not suffer from the circular deadlock problems of MPI_Send
and MPI_Recv. This is because MPI_Sendrecv allow data to travel for both send
and receive simultaneously. The syntax of MPI_Sendrecv is the following:
int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype
senddatatype, int dest, int sendtag, void *recvbuf, int
recvcount, MPI_Datatype recvdatatype, int source, int recvtag,
MPI_Comm comm, MPI_Status *status)
The arguments of MPI_Sendrecv are essentially the combination of the
arguments of MPI_Send and MPI_Recv. The send and receive buffers must be
disjoint, and the source and destination of the messages can be the same or
different.
The following code illustrates the usage of MPI_Sendrecv for cyclic
communication without having the deadlock.
#include <stdio.h>
#include <string.h>
#include <mpi.h>
int main(void)
MPI_Status stat;
sprintf(greeting,"Greetings!");
MPI_Sendrecv(greeting, MAX_STRING, MPI_CHAR, (my_rank+1)%comm_sz,
1, greeting, MAX_STRING, MPI_CHAR, (my_rank-
1+comm_sz)%comm_sz, 1, MPI_COMM_WORLD, &stat);
printf("%s Received from process %d by process
%d\n",greeting,stat.MPI_SOURCE,my_rank);
MPI_Finalize();
return 0;
}
Running the program with 4 processors gives the following output:
Greetings! Received from process 3 by process 0
In many programs, the requirement for the send and receive buffers of
MPI_Sendrecv be disjoint may force us to use a temporary buffer. This increases


117
the amount of memory required by the program and also increases the overall
run time due to the extra copy. This problem can be solved by using that
MPI_Sendrecv_replace MPI function. This function performs a blocking send
and receive, but it uses a single buffer for both the send and receive operation.
That is, the received data replaces the data that was sent out of the buffer.
The syntax of this function is the following:
int MPI_Sendrecv_replace(void *buf, int count, MPI_Datatype
datatype, int dest, int sendtag, int source, int recvtag,
MPI_Comm comm, MPI_Status *status)
Note that both the send and receive operations must transfer data of the same
datatype.
The following program illustrates the usage of MPI_Sendrecv_replace function.
#include <stdio.h>
#include <string.h>
#include <mpi.h>
int main(void)
MPI_Status stat;
sprintf(greeting,"Greetings!");
MPI_Sendrecv_replace(greeting, MAX_STRING, MPI_CHAR,
(my_rank+1)%comm_sz, 1, (my_rank-1+comm_sz)%comm_sz, 1,
MPI_COMM_WORLD, &stat);
printf("%s Received from process %d by process
%d\n",greeting,stat.MPI_SOURCE,my_rank);
MPI_Finalize();
return 0;
}
Non-Blocking Communication Operations
In order to overlap communication with computation, MPI provides a pair of
functions for performing non-blocking send and receive operations. These
functions are MPI_Isend and MPI_Irecv. MPI_Isend starts a send operation but
does not complete, that is, it returns before the data is copied out of the buffer.
Similarly, MPI_Irecv starts a receive operation but returns before the data has
been received and copied into the buffer. With the support of appropriate
hardware, the transmission and reception of messages can proceed concurrently
with the computations performed by the program upon the return of the above
functions.


118
However, at a later point in the program, a process that has started a non-
blocking send or receive operation must make sure that this operation has
completed before it proceeds with its computations. This is because a process
that has started a non-blocking send operation may want to overwrite the buffer
that stores the data that are being sent, or a process that has started a non-
blocking receive operation may want to use the data it requested. To check the
completion of non-blocking send and receive operations, MPI provides a pair of
functions MPI_Test and MPI_Wait. The first tests whether or not a non-blocking
operation has finished and the second waits (i.e., gets blocked) until a non-
blocking operation actually finishes. The calling sequences of MPI_Isend and
MPI_Irecv are the following:
int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm, MPI_Request *request)
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int
source, int tag, MPI_Comm comm, MPI_Request *request)
Note that these functions have similar arguments as the corresponding blocking
send and receive functions. The main difference is that they take an additional
argument request.
MPI_Isend and MPI_Irecv functions allocate a request object and return a
pointer to it in the request variable. This request object is used as an argument
in the MPI_Test and MPI_Wait functions to identify the operation whose status
we want to query or to wait for its completion.
Note that the MPI_Irecv function does not take a status argument similar to the
blocking receive function, but the status information associated with the receive
operation is returned by the MPI_Test and MPI_Wait functions.
int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status)
int MPI_Wait(MPI_Request *request, MPI_Status *status)
MPI_Test tests whether or not the non-blocking send or receive operation
identified by its request has finished. It returns flag = {true} (non-zero value in
C) if it completed, otherwise it returns {false} (a zero value in C). In the case that
the non-blocking operation has finished, the request object pointed to by request
is deallocated and request is set to MPI_REQUEST_NULL. Also the status object is
set to contain information about the operation. If the operation has not finished,
request is not modified and the value of the status object is undefined. The
MPI_Wait function blocks until the non-blocking operation identified by request
completes. In that case it deal-locates the request object, sets it to
MPI_REQUEST_NULL, and returns information about the completed operation in
the status object.



119
Collective Communication and Computation Operations
MPI provides an extensive set of functions for performing many commonly used
collective communication operations. All of the collective communication
functions provided by MPI take as an argument a communicator that defines the
group of processes that participate in the collective operation. All the processes
that belong to this communicator participate in the operation, and all of them
must call the collective communication function. In some of the collective
functions data is required to be sent from a single process (source-process) or to
be received by a single process (target-process). In these functions, the source- or
target-process is one of the arguments supplied to the routines. All the processes
in the group (i.e., communicator) must specify the same source- or target-process.
For most collective communication operations, MPI provides two different
variants. The first transfers equal-size data to or from each process, and the
second transfers data that can be of different sizes.
Barrier
The barrier synchronization operation is performed in MPI using the
MPI_Barrier function.
int MPI_Barrier(MPI_Comm comm)
The only argument of MPI_Barrier is the communicator that defines the group of
processes that are synchronized. The call to MPI_Barrier returns only after all
the processes in the group have called this function.
Broadcast
The one-to-all broadcast operation can be performed in MPI using the MPI_Bcast
function.
int MPI_Bcast(void *buf, int count, MPI_Datatype datatype,
int source, MPI_Comm comm)
MPI_Bcast sends the data stored in the buffer buf of process source to all the
other processes in the group. The data received by each process is stored in the
buffer buf. The data that is broadcast consist of count entries of type datatype.
The amount of data sent by the source process must be equal to the amount of
data that is being received by each process; i.e., the count and datatype fields
must match on all processes.
#include <stdio.h>
#include <mpi.h>
int main()
{ int rank, size;
double *a,*b;
*a=50.0, *b=70.0;


120
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if(rank==0)
{ MPI_Bcast(a, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Bcast(b, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
}
if(rank!=0)
printf("Inside the process %d: a=%f and b=%f\n",rank,*a,*b);
MPI_Finalize();
return 0;
}
Execution of the above program generates the following output:
$ mpicc mpi11.c -o mpi11.exe
$ mpiexec -n 4 ./mpi11.exe
Inside the process 1: a=50.000000 and b=70.000000
Reduction
The all-to-one reduction operation can be performed in MPI using the
MPI_Reduce function.
int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype
datatype, MPI_Op op, int target, MPI_Comm comm)
MPI_Reduce combines the elements stored in the buffer sendbuf of each process
in the group, using the operation specified in op, and returns the combined
values in the buffer recvbuf of the process with rank target. Both the sendbuf
and recvbuf must have the same number of count items of type datatype.
Note that all processes must provide a recvbuf array, even if they are not the
target of the reduction operation. When count is more than one, then the
combine operation is applied element-wise on each entry of the sequence. All the
processes must call MPI_Reduce with the same value for count, datatype, op,
target, and comm.
MPI provides a list of predefined operations that can be used to combine the
elements stored in sendbuf. MPI also allows programmers to define their own
operations. The predefined operations are shown below:
Operation Meaning Data-Types Supported
MPI_MAX Maximum C integers and floating point
MPI_MIN Minimum C integers and floating point
MPI_SUM Sum C integers and floating point
MPI_PROD Product C integers and floating point


121
MPI_LAND Logical AND C integers
MPI_BAND Bit-wise AND C integers and byte
MPI_LOR Logical OR C integers
MPI_BOR Bit-wise OR C integers and byte
MPI_LXOR Logical XOR C integers
MPI_BXOR Bit-wise XOR C integers and byte
MPI_MAXLOC Maximum and location
of maximum
Data-pairs
MPI_MINLOC Minimum and location
of minimum
Data-pairs
For example, in order to compute the maximum of the elements stored in
sendbuf, the MPI_MAX value must be used for the op argument. Not all of these
operations can be applied to all possible data-types supported by MPI. For
example, a bit-wise OR operation (i.e., op = MPI_BOR) is not defined for real-
valued data-types such as MPI_FLOAT and MPI_REAL. The last column of the
table shows the various data-types that can be used with each operation.
The following program illustrates the usage of MPI_Reduce for finding the
maximum in a linear array:
#include <stdio.h>
#include <mpi.h>
int main()
{ int rank, size,start, finish;
int a[100],i,big,biggest;
MPI_Comm_size(MPI_COMM_WORLD, &size);
srand(time(0));
if(rank==0)
{ printf("\nThe Original List is : ");
for(i=0;i<100;i++)
{ a[i]=((int)rand())%100;
printf("%d ",a[i]);
}
printf("\n");
}
MPI_Bcast(a, 100, MPI_INT, 0, MPI_COMM_WORLD);
start=rank*(100/size);
finish=(rank+1)*(100/size)-1;
printf("\nThe Segment Processed by Process-%d is \n",rank);
for(i=start;i<=finish;i++) printf("%d ",a[i]);
big=a[start];
for(i=start+1;i<=finish;i++)
if(big<a[i]) big=a[i];
printf("\nBiggest from Process %d is %d ",rank, big);
MPI_Reduce(&big, &biggest,1, MPI_INT, MPI_MAX, 0,
MPI_COMM_WORLD);


122
MPI_Barrier(MPI_COMM_WORLD);
if(rank==0) printf("\nThe Overall Biggest is : %d", biggest);
MPI_Finalize();
return 0;
}
When the result of the reduction operation is needed by all the processes, MPI
provides the MPI_Allreduce operation that returns the result to all the
processes. MPI_Allreduce operation has the following syntax.
int MPI_Allreduce(void *sendbuf, void *recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)
All-reduce operation, is identical to performing an all-to-one reduction followed
by a one-to-all broadcast of the result.
Prefix
Finding prefix sums (also known as the scan operation) is another important
problem that can be solved by using a communication pattern similar to that
used in allreduce operation. Given p numbers n0, n1, ..., np-1 (one on each node),
the problem is to compute the sums
for all k between 0 and p - 1. For

example, if the original sequence of numbers is <3, 1, 4, 0, 2>, then the sequence
of prefix sums is <3, 4, 8, 8, 10>. Initially, nk resides on the node labeled k, and at
the end of the procedure, the same node holds sk. Instead of starting with a
single numbers, each node could start with a buffer or vector of size m and the
m-word result would be the sum of the corresponding elements of buffers.
The prefix-sum operation can be performed in MPI using the MPI_Scan function.
int MPI_Scan(void *sendbuf, void *recvbuf, int count, MPI_Datatype
datatype, MPI_Op op, MPI_Comm comm)
MPI_Scan performs a prefix reduction of the data stored in the buffer sendbuf at
each process and returns the result in the buffer recvbuf. The receive buffer of
the process with rank i will store, at the end of the operation, the reduction of
the send buffers of the processes whose ranks range from 0 up to and including i.
The type of supported operations (i.e., op) as well as the restrictions on the
various arguments of MPI_Scan are the same as those for the reduction operation
MPI_Reduce.
The following example illustrates the usage of MPI_Scan function:
#include <stdio.h>
#include <mpi.h>
int main()
{ int rank, size;
int a=10,b,i;


123
MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Scan(&a, &b, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
printf("\nPrefix Sum of Process %d is %d ",rank, b);
MPI_Finalize();
return 0;
}
Scatter
In the scatter operation, a single node sends a unique message of size m to every
other node. This operation is also known as one-to-all personalized
communication. One-to-all personalized communication is different from one-to-
all broadcast in that the source node starts with p unique messages, one destined
for each node. Unlike one-to-all broadcast, one-to-all personalized communication
does not involve any duplication of data.
The scatter operation can be performed in MPI using the MPI_Scatter function.
int MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype
senddatatype, void *recvbuf, int recvcount, MPI_Datatype
recvdatatype, int source, MPI_Comm comm)
The source process sends a different part of the send buffer sendbuf to each
processes, including itself. The data that are received are stored in recvbuf.
Process i receives sendcount contiguous elements of type senddatatype starting
from the i * sendcount location of the sendbuf of the source process (assuming
that sendbuf is of the same type as senddatatype). MPI_Scatter must be called
by all the processes with the same values for the sendcount, senddatatype,
recvcount, recvdatatype, source, and comm arguments. Note that sendcount is
the number of elements sent to each individual process.
MPI provides a vector variant of the scatter operation, called MPI_Scatterv, that
allows different amounts of data to be sent to different processes.
int MPI_Scatterv(void *sendbuf, int *sendcounts, int *displs,
MPI_Datatype senddatatype, void *recvbuf, int recvcount,
MPI_Datatype recvdatatype, int source, MPI_Comm comm)
The parameter sendcount has been replaced by the array sendcounts that
determines the number of elements to be sent to each process. In particular, the
target process sends sendcounts[i] elements to process i. Also, the array
displs is used to determine where in sendbuf these elements will be sent from.
In particular, if sendbuf is of the same type is senddatatype, the data sent to
process i start at location displs[i] of array sendbuf. Both the sendcounts and
displs arrays are of size equal to the number of processes in the communicator.
Note that by appropriately setting the displs array we can use MPI_Scatterv to
send overlapping regions of sendbuf.


124
The following program illustrates the usage of MPI_Scatter function.
#include <stdio.h>
#include <mpi.h>
int main()
{ int rank, size,start, finish;
int a[100], b[10],i,big,biggest;
srand(time(0));
if(rank==0)
{ printf("\nThe Original List is : ");
for(i=0;i<100;i++)
{ a[i]=((int)rand())%100;
printf("%d ",a[i]);
}
printf("\n");
}
MPI_Scatter(a, 10, MPI_INT, b, 10, MPI_INT, 0, MPI_COMM_WORLD);
printf("\nThe Segment Processed by Process-%d is \n",rank);
for(i=0;i<=10;i++) printf("%d ",b[i]);
big=a[0];
for(i=1;i<10;i++)
if(big<b[i]) big=b[i];
printf("\nBiggest from Process %d is %d ",rank, big);
MPI_Reduce(&big, &biggest,1, MPI_INT, MPI_MAX, 0,
MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
if(rank==0) printf("\nThe Overall Biggest is : %d", biggest);
MPI_Finalize();
return 0;
}
Gather
The dual of one-to-all personalized communication or the scatter operation
is the gather operation, or concatenation, in which a single node collects a
unique message from each node. A gather operation is different from an all-to-
one reduce operation in that it does not involve any combination or reduction of
data. The scatter and gather operations are illustrated below.



125
The gather operation can be performed in MPI using the MPI_Gather function.
int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype
recvdatatype, int target, MPI_Comm comm)
Each process, including the target process, sends the data stored in the array
sendbuf to the target process. As a result, if p is the number of processors in the
communication comm, the target process receives a total of p buffers. The data is
stored in the array recvbuf of the target process, in a rank order. That is, the
data from process with rank i are stored in the recvbuf starting at location i *
sendcount (assuming that the array recvbuf is of the same type as
recvdatatype).
The data sent by each process must be of the same size and type. That is,
MPI_Gather must be called with the sendcount and senddatatype arguments
having the same values at each process. The information about the receive
buffer, its length and type applies only for the target process and is ignored for
all the other processes. The argument recvcount specifies the number of
elements received by each process and not the total number of elements it
receives. So, recvcount must be the same as sendcount and their datatypes
must be matching. MPI also provides the MPI_Allgather function in which the
data are gathered to all the processes and not only at the target process.
int MPI_Allgather(void *sendbuf, int sendcount, MPI_Datatype
recvdatatype, MPI_Comm comm)
The meanings of the various parameters are similar to those for MPI_Gather;
however, each process must now supply a recvbuf array that will store the
gathered data. In addition to the above versions of the gather operation, in which
the sizes of the arrays sent by each process are the same, MPI also provides
versions in which the size of the arrays can be different. MPI refers to these
operations as the vector variants. The vector variants of the MPI_Gather and
MPI_Allgather operations are provided by the functions MPI_Gatherv and
MPI_Allgatherv, respectively.
int MPI_Gatherv(void *sendbuf, int sendcount, MPI_Datatype
senddatatype, void *recvbuf, int *recvcounts, int *displs,
MPI_Datatype recvdatatype, int target, MPI_Comm comm)
int MPI_Allgatherv(void *sendbuf, int sendcount, MPI_Datatype
senddatatype, void *recvbuf, int *recvcounts, int *displs,
MPI_Datatype recvdatatype, MPI_Comm comm)
These functions allow a different number of data elements to be sent by each
process by replacing the recvcount parameter with the array recvcounts. The
amount of data sent by process i is equal to recvcounts[i]. Note that the size of
recvcounts is equal to the size of the communicator comm. The array parameter


126
displs, which is also of the same size, is used to determine where in recvbuf the
data sent by each process will be stored. In particular, the data sent by process i
are stored in recvbuf starting at location displs[i]. Note that, as opposed to
the non-vector variants, the sendcount parameter can be different for different
processes.



127
1. Programming Shared Address Space
Platforms
1.1. Introduction
Explicit parallel programming requires specification of parallel tasks along with
their interactions. These interactions may be in the form of synchronization
between concurrent tasks or communication of intermediate results. In shared
address space architectures, communication is implicitly specified since some (or
all) of the memory is accessible to all the processors. Thus, programming
paradigms for shared address space machines focus on constructs for expressing
concurrency and synchronization.
Different shared address space programming paradigms vary on mechanisms for
data sharing, concurrency models, and support for synchronization. For ensuring
protection in multiuser systems, process based models assume that all data
associated with a process is private, by default, unless otherwise specified.
However, private memory is not necessary when multiple concurrent aggregates
are cooperating to solve the same problem. The overheads associated with
enforcing protection make processes less suitable for parallel programming. In
contrast, lightweight processes and threads assume that all memory is global. By
relaxing the protection requirement, lightweight processes and threads support
parallel processing. As a result, this is the preferred model for parallel
programming.
1.2. Thread Basics
A thread is a single stream of control in the flow of a program.
As an example, consider the following code segment that computes the product of
two matrices of size n n.
for (row = 0; row < n; row++)
for (column = 0; column < n; column++)
c[row][column] = dot_product(get_row(a, row),get_col(b, col));
The for loop in this code fragment has n
2
iterations, each of which can be
executed independently. Such an independent sequence of instructions is
referred to as a thread. In the example presented above, there are n
2
threads,
one for each iteration of the for-loop. Since each of these threads can be executed
independently of the others, they can be scheduled concurrently on multiple
processors. We can transform the above code segment as follows:
for (row = 0; row < n; row++)
for (column = 0; column < n; column++)
c[row][column] = create_thread(dot_product(get_row(a, row),
get_col(b, col)));


128
Here, we use a function, create_thread, to provide a mechanism for specifying a
C function as a thread. The underlying system can then schedule these threads
on multiple processors. To execute this code fragment on multiple processors,
each processor must have access to matrices a, b, and c. This is accomplished via
a shared address space which is globally accessible to every thread as illustrated
in the figure given below. However, since threads are invoked as function calls,
the stack corresponding to the function call is generally treated as being local to
the thread. In the logical machine model illustrated below, the memory modules
M hold thread-local (stack allocated) data.

Fig: The logical machine model of a thread-based programming paradigm.
Threaded programming models offer significant advantages over message-
passing programming models along with some disadvantages as well.
Software Portability: Threaded applications can be developed on serial
machines and run on parallel machines without any changes. This ability to
migrate programs between diverse architectural platforms is a very significant
advantage of threaded APIs. It has implications not just for software utilization
but also for application development since supercomputer time is often scarce
and expensive.
Latency Hiding: One of the major overheads in programs (both serial and
parallel) is the access latency for memory access, I/O, and communication. By
allowing multiple threads to execute on the same processor, threaded APIs
enable this latency to be hidden. In effect, while one thread is waiting for a
communication operation, other threads can utilize the CPU, thus masking
associated overhead.
Scheduling and Load Balancing: While writing shared address space parallel
programs, a programmer must express concurrency in a way that minimizes
overheads of remote interaction and idling. While in many structured
applications the task of allocating equal work to processors is easily
accomplished, in unstructured and dynamic applications (such as game playing
and discrete optimization) this task is more difficult. Threaded APIs allow the
programmer to specify a large number of concurrent tasks and support system-


129
level dynamic mapping of tasks to processors with a view to minimizing idling
overheads. By providing this support at the system level, threaded APIs free the
programmer of the burden of explicit scheduling and load balancing.
Ease of Programming, Widespread Use: Due to the above-mentioned
advantages, threaded programs are significantly easier to write than
corresponding programs using message passing APIs. Achieving identical levels
of performance for the two programs may require additional effort, however.
2. The POSIX Thread API (Pthreads)
A number of vendors provide vendor-specific thread APIs. The IEEE specifies a
standard API called POSIX (Portable Operating System Interface for Unix) API
(or Pthreads) for threaded parallelism.
POSIX is a standard for Unix-like operating systems for example, Linux and
Mac OS X. It specifies a variety of facilities that should be available in such
systems, including an application programming interface (API) for
multithreaded programming.
Pthreads is not a programming language. Rather, like MPI, Pthreads specifies a
library that can be linked with C programs. Unlike MPI, the Pthreads API is
only available on POSIX systemsLinux, Mac OS X, Solaris, and so on. Also
unlike MPI, there are a number of other widely used specifications for
multithreaded programming: Java threads, Windows threads, Solaris threads.
However, all of the thread specifications support the same basic ideas, so once
you have learned how to program in Pthreads, it will not be difficult to learn how
to program another thread API.
As an example, take a look at the following Pthreads program. The program
shows a program in which the main function starts up several threads. Each
thread prints a message and then quits.
1.
#include <stdio.h>
2.
#include <stdlib.h>
3.
#include <pthread.h>
4.
/* Global variable: accessible to all threads */
5.
int thread_count;
6.
void* Hello(void* rank); /* Thread function */
7.
8.
{ long thread; /* Use long in case of a 64bit system */
9.
pthread_t* thread_handles;
10.
/* Get number of threads from command line */
11.
thread_count = strtol(argv[1], NULL, 10);
12.
thread_handles = malloc (thread_count*sizeof(pthread_t));
13.
for (thread = 0; thread < thread_count; thread++)
14.
pthread_create(&thread_handles[thread], NULL, Hello, (void*) thread);
15.
printf("Hello from the main thread\n");
16.


130
17.
pthread_join(thread_handles[thread], NULL);
18.
free(thread_handles);
19.
return 0;
20.
}/* main */
21.
void* Hello(void* rank)
22.
{ long my_rank = (long) rank;
23.
/* Use long in case of 64 bit system */
24.
printf("Hello from thread %ld of %d \n", my_rank,thread_count);
25.
return NULL;
26.
}/* Hello */
This is just a C program with a main function and one other function. Besides
the usual header files like stdio.h and stdlib.h, in Line 3 we include
pthread.h, the Pthreads header file, which declares the various Pthreads
functions, constants, types, and so on.
This program can be compiled in gcc using the command
gcc thread.c -o thread.exe -lpthread
and to execute the program
./thread <number of threads>
In Line 5 we define a global variable thread_count. In Pthreads programs,
global variables are shared by all the threads. Local variables and function
argumentsthat is, variables declared in functionsare (ordinarily) private to
the thread executing the function. If several threads are executing the same
function, each thread will have its own private copies of the local variables and
function arguments. This is because each thread has its own stack.
In Line 11 the program gets the number of threads it should start from the
command line. Unlike MPI programs, Pthreads programs are typically compiled
and run just like serial programs, and one relatively simple way to specify the
number of threads that should be started is to use a command-line argument.
This is not a requirement; it is simply a convenient convention we will be using.
The strtol function converts a string into a long int. It is declared in
stdlib.h, and its syntax is
long strtol(
const char* number_p /* in */,
char** end_p /* out */,
int base /* in */);
It returns a long int corresponding to the string referred to by number p. The base
of the representation of the number is given by the base argument. If end p is not
NULL, it will point to the first invalid (that is, nonnumeric) character in number
p.


131
Starting the Threads
Unlike MPI programs, in which the processes are usually started by a script, in
Pthreads the threads are started by the program executable. This introduces a
bit of additional complexity, as we need to include code in our program to
explicitly start the threads, and we need data structures to store information on
the threads.
In Line 12 we allocate storage for one pthread_t object for each thread. The
pthread_t data structure is used for storing thread-specific information. It is
declared in pthread.h.
The pthread_t objects are examples of opaque objects. The actual data that they
store is system specific, and their data members are not directly accessible to
user code. However, the Pthreads standard guarantees that a pthread_t object
does store enough information to uniquely identify the thread with which its
associated. So, for example, there is a Pthreads function that a thread can use to
retrieve its associated pthread_t object, and there is a Pthreads function that
can determine whether two threads are in fact the same by examining their
associated pthread_t objects.
In Lines 13-14, we use the pthread_create function to start the threads. Like
most Pthreads functions, its name starts with the string pthread_. The syntax of
pthread_create is
int pthread_create(
pthread_t* thread_p /* out */,
const pthread_attr_t* attr_p /* in */,
void* (*start_routine)(void*) /* in */,
void* arg_p /* in */);
The first argument is a pointer to the appropriate pthread_t object. Note that
the object is not allocated by the call to pthread_create; it must be allocated
before the call. We will not be using the second argument, so we just pass the
argument NULL in our function call. The third argument is the function that the
thread is to run, and the last argument is a pointer to the argument that should
be passed to the function start_routine. The return value for most Pthreads
functions indicates if there has been an error in the function call.
The function that is started by pthread_create should have a prototype of the
form:
Void* thread_function(void* args_p);
Running the threads
The thread that is running the main function is sometimes called the main
thread. Hence, after starting the threads, it prints the message


132
Hello from the main thread
In the meantime, the threads started by the calls to pthread_create are also
running. They get their ranks by casting in Line 22, and then print their
messages. Note that when a thread is done, since the type of its function has a
return value, the thread should return something. In this example, the threads
dont actually need to return anything, so they return NULL.
In Pthreads, the programmer does not directly control where the threads are
run. There is no argument in pthread_create saying which core should run
which thread. Thread placement is controlled by the operating system. If a
program starts more threads than cores, we should expect multiple threads to be
run on a single core. However, if there is a core that is not being used, operating
systems will typically place a new thread on such a core.
Stopping the threads
In Lines 25 and 26, we call the function pthread join once for each thread. A
single call to pthread_join will wait for the thread associated with the pthread_t
object to complete. The syntax of pthread_join is
int pthread_join(
pthread_t thread /* in */,
void** ret_val_p /* out */);
The second argument can be used to receive any return value computed by the
thread. So in our example, each thread executes a return and, eventually, the
main thread will call pthread_join for that thread to complete the termination.
This function is called pthread_join because of a diagramming style that is often
used to describe the threads in a multithreaded process. If we think of the main
thread as a single line in our diagram, then, when we call pthread_create, we
can create a branch or fork off the main thread. Multiple calls to
pthread_create will result in multiple branches or forks. Then, when the
threads started by pthread_create terminate, the diagram shows the branches
joining the main thread (see figure given below).

Matrix-Vector Multiplication
Now, let us write a Pthreads matrix-vector multiplication program. Recall that if
A = (aij) is an mn matrix and x = (x0, x1, ... , xn-1)
T
is an n-dimensional column


133
vector, then the matrix-vector product Ax = y is an m-dimensional column vector,
y = (y0, y1, ... , ym-1)
T
in which the i
th
component yi is obtained by finding the dot
product of the i
th
row of A with x:

The pseudo-code for a serial program for matrix-vector multiplication looks like:
/* For each row of A */
for (i = 0; i < m; i++)
{ y[i] = 0.0;
/* For each element of the row and each element of x */
for (j = 0; j < n; j++)
y[i] += A[i][j]* x[j];
}
We want to parallelize this by dividing the work among the threads. One
possibility is to divide the iterations of the outer loop among the threads. If we do
this, each thread will compute some of the components of y. For example,
suppose that m = n = 6 and the number of threads, thread_count, is three. Then
the computation could be divided among the threads as follows:
Thread Components of y
0 y[0], y[1]
1 y[2], y[3]
2 y[4], y[5]
To compute y[0], thread 0 will need to execute the code
y[0] = 0.0;
for (j = 0; j < n; j++)
y[0] += A[0][j]* x[j];
Thread 0 will therefore need to access every element of row 0 of A and every
element of x. More generally, the thread that has been assigned y[i] will need to
execute the code
y[i] = 0.0;
for (j = 0; j < n; j++)
y[i] += A[i][j]*x[j];

Thread 0 will therefore need to access every element of row 0 of A and every
element of x.
We need to decide what components of y will be computed by which thread. In
order to simplify the code, let us assume that both m and n are evenly divisible
by t. Our example with m = 6 and t = 3 suggests that each thread gets m/t


134
components. Furthermore, thread 0 gets the first m/t, thread 1 gets the next m/t,
and so on. Thus, the formulas for the components assigned to thread q might be
first component:
and last component:
.
The complete program is given below. We are assuming that A, x, y, m, and n are
all global and shared.
#include <stdio.h>
#include <stdlib.h>
int thread_count,a[10][10],x[10][1],y[10][1],m,n;
void* Pth_mat_vect(void* rank); /* Thread function */
{ int i,j;
long thread; /* Use long in case of a 64bit system */
printf("Order of the matrix m and n : ");
scanf("%d%d", &m, &n);
printf("Enter the matrix: \n");
for(i=0;i<m;i++)
for(j=0;j<n;j++)
{ printf("Enter element (%d,%d) : ",i+1,j+1);
scanf("%d", &a[i][j]);
}
printf("Enter the Column Matrix: \n");
for(i=0;i<m;i++)
{ printf("Enter element (%d,1) :",i+1);
scanf("%d", &x[i][0]);
}
pthread_create(&thread_handles[thread], NULL, Pth_mat_vect, (void*) thread);
printf("The Dot Product Matrix is:\n ");
for(i=0;i<m;i++)
printf("Element (%d,1): %d\n",i+1,y[i][0]);
return 0;
}/* main */
void* Pth_mat_vect(void* rank)
int i, j;


135
int local_m = m/thread_count;
if(local_m==0) local_m=1;
int my_first_row = my_rank*local_m;
int my_last_row = (my_rank+1)*local_m - 1;
for (i = my_first_row; i <= my_last_row; i++)
{ y[i][0] = 0;
for (j = 0; j < n; j++)
y[i][0] += a[i][j]*x[j][0];
}
return NULL;
} /* Pth_mat_vect */
3. Synchronization Primitives in Pthreads
While communication is implicit in shared-address-space programming, much of
the effort associated with writing correct threaded programs is spent on
synchronizing concurrent threads with respect to their data accesses or
scheduling.
3.1. Mutual Exclusion for Shared Variables
Using pthread_create and pthread_join calls, we can create concurrent tasks. These
tasks work together to manipulate data and accomplish a given task. When
multiple threads attempt to manipulate the same data item, the results can
often be incoherent if proper care is not taken to synchronize them. Consider the
following code fragment being executed by multiple threads. The variable
my_cost is thread-local and best_cost is a global variable shared by all threads.
1 /* each thread tries to update variable best_cost as follows */
2 if (my_cost < best_cost)
3 best_cost = my_cost;
To understand the problem with shared data access, let us examine one
execution instance of the above code fragment. Assume that there are two
threads, the initial value of best_cost is 100, and the values of my_cost are 50 and
75 at threads t1 and t2, respectively. If both threads execute the condition inside
the if statement concurrently, then both threads enter the then part of the
statement. Depending on which thread executes first, the value of best_cost at
the end could be either 50 or 75. There are two problems here: the first is the
non-deterministic nature of the result; second, and more importantly, the value
75 of best_cost is inconsistent in the sense that no serialization of the two threads
can possibly yield this result. This is an undesirable situation, sometimes also
referred to as a race condition (so called because the result of the computation
depends on the race between competing threads).
The above-mentioned situation occurred because the test-and-update operation
illustrated above is an atomic operation; i.e., the operation should not be broken


136
into sub-operations. Furthermore, the code corresponds to a critical segment; i.e.,
a segment that must be executed by only one thread at any time. Many
statements that seem atomic in higher level languages such as C may in fact be
non-atomic.
For example, although we can add the contents of a memory location y to a
memory location x with a single C statement,
x = x + y;
what the machine does is typically more complicated. The current values stored
in x and y will be stored in the computers main memory. Before the addition can
be carried out, the values stored in x and y have to be transferred from main
memory to registers in the CPU. Once the values are in registers, the addition
can be carried out. After the addition is completed, the result may have to be
transferred from a register back to memory.
Suppose that we have two threads, and each computes a value that is stored in
its private variable y. Also suppose that we want to add these private values
together into a shared variable x that has been initialized to 0 by the main
thread. Each thread will execute the following code:
y = Compute(my_rank);
x = x + y;
Let us also suppose that thread 0 computes y = 1 and thread 1 computes y = 2.
The correct result should then be x = 3. Here is one possible scenario:
Time Thread 0 Thread 1
1 Started by main thread
2 Call Compute() Started by main thread
3 Assign y = 1 Call Compute()
4 Put x=0 and y=1 into registers Assign y = 2
5 Add 0 and 1 Put x=0 and y=2 into registers
6 Store 1 in memory location x Add 0 and 2
7 Store 2 in memory location x
We see that if thread 1 copies x from memory to a register before thread 0 stores
its result, the computation carried out by thread 0 will be overwritten by thread
1. The problem could be reversed: if thread 1 races ahead of thread 0, then its
result may be overwritten by thread 0. In fact, unless one of the threads stores
its result before the other thread starts reading x from memory, the winners
result will be overwritten by the loser.
As an example, let us try to estimate the value of . One of the simplest formula
for estimating t is


137

The following serial code uses this formula:
double factor = 1.0;
double sum = 0.0;
for (i = 0; i < n; i++, factor = -factor)
sum += factor/(2*i+1);
pi = 4.0*sum;
We can try to parallelize this in the same way we parallelized the matrix-vector
multiplication program: divide up the iterations in the for loop among the
threads and make sum a shared variable. To simplify the computations, let us
assume that the number of threads, thread_count or t, evenly divides the number
of terms in the sum, n. Then, if my_n = n/t, thread 0 can add the first my_n terms.
Therefore, for thread 0, the loop variable i will range from 0 to my_n -1. Thread 1
will add the next my_n terms, so for thread 1, the loop variable will range from
my_n to 2 my_n -1.
The pthread code is shown below:
#include <stdio.h>
#include <stdlib.h>
int thread_count;
long long n;
double sum=0;
void* Thread_sum(void* rank); /* Thread function */
{ int i,j;
printf("Number of terms : ");
scanf("%ld", &n);
pthread_create(&thread_handles[thread], NULL, Thread_sum, (void*) thread);
printf("The Pi is: %15.13g",4*sum);
return 0;
}/* main */


138
void* Thread_sum(void* rank)
double factor;
long long i;
long long my_n = n/thread_count;
long long my_first_i = my_n*my_rank;
long long my_last_i = my_first_i + my_n;
if (my_first_i % 2 == 0) /* my_first_i is even */
factor = 1.0;
else /* my_first_i is odd */
factor = -1.0;
for (i = my_first_i; i < my_last_i; i++, factor = -factor)
return NULL;
} /* Thread_sum */
If we run the Pthreads program with two threads and n is relatively small, we
find that the results of the Pthreads program are in agreement with the serial
sum program. However, as n gets larger, we start getting some peculiar results.
For example, with a dual-core processor we get the following results:
n
10
5
10
6
10
7
10
8

t 3.14159 3.141593 3.1415927 3.14159265
1 Thread 3.14158 3.141592 3.1415926 3.14159264
2 Threads 3.14158 3.141480 3.1413692 3.14164686
Notice that as we increase n, the estimate with one thread gets better and better.
In fact, with each factor of 10 increase in n we get another correct digit. With n =
10
5
, the result as computed by a single thread has five correct digits. With n =
10
6
, it has six correct digits, and so on. The result computed by two threads
agrees with the result computed by one thread when n = 10
5
. However, for larger
values of n, the result computed by two threads actually gets worse. In fact, if we
ran the program several times with two threads and the same value of n, we
would see that the result computed by two threads changes from run to run.
3.2. Mutual Exclusion by Busy-Waiting
As illustrated in the previous section, uncontrolled accesses shared resources
(i.e., critical sections) can lead to race conditions. To avoid the race condition,
threads must ensure that access to critical section is done in a mutually
exclusive manner.
Suppose that we have two threads, and each computes a value that is stored in
its private variable y. Also suppose that we want to add these private values
together into a shared variable x that has been initialized to 0 by the main
thread. Each thread will execute the following code:


139
y = Compute(my_rank);
x = x + y;
To ensure mutual exclusion, when, say, thread 0 wants to execute the statement
x = x + y, it needs to first make sure that thread 1 is not already executing the
statement. Once thread 0 makes sure of this, it needs to provide some way for
thread 1 to determine that it, thread 0, is executing the statement, so that
thread 1 will not attempt to start executing the statement until thread 0 is done.
Finally, after thread 0 has completed execution of the statement, it needs to
provide some way for thread 1 to determine that it is done, so that thread 1 can
safely start executing the statement.
A simple approach that doesnt involve any new concepts is the use of a flag
variable. Suppose flag is a shared int that is set to 0 by the main thread.
Further, suppose we add the following code to our example:
1 y = Compute(my_rank);
2 while (flag != my_rank);
3 x = x + y;
4 flag++;
Let us suppose that thread 1 finishes the assignment in Line 1 before thread 0.
When it reaches the busy wait while statement in Line 2, it will find the test flag
!= my_rank is true and the thread 1 will keep on re-executing the test until the
test is false. When the test is false, thread 1 will go on to execute the code in the
critical section x = x + y.
Since we are assuming that the main thread has initialized flag to 0, thread 1
will not proceed to the critical section in Line 3 until thread 0 executes the
statement flag++. However, when thread 0 executes its first test of flag != my_rank,
the condition is false, and it will go on to execute the code in the critical section x
= x + y. When it is done with this, we see that it will execute flag++, and thread 1
can finally enter the critical section.
The key here is that thread 1 cannot enter the critical section until thread 0 has
completed the execution of flag++. And, provided the statements are executed
exactly as they are written, this means that thread 1 cannot enter the critical
section until thread 0 has completed it.
The while loop is an example of busy-waiting. In busy-waiting, a thread
repeatedly tests a condition, but, effectively, does no useful work until the
condition has the appropriate value (false in our example).
Note that the busy-wait solution works only when the statements are executed
exactly as they are written. If compiler optimization is turned on, it is possible
that the compiler will make changes that will affect the correctness of busy-
waiting. The reason for this is that the compiler is unaware that the program is
multithreaded, so it does not know that the variables x and flag can be
modified by another thread. For example, if our code


140
3 x = x + y;
4 flag++;
is run by just one thread, the order of the statements while (flag != my_rank) and x
= x + y is unimportant. An optimizing compiler might therefore determine that
the program would make better use of registers if the order of the statements
were switched. Of course, this will result in the code
2 x = x + y;
4 flag++;
which defeats the purpose of the busy-wait loop. The simplest solution to this
problem is to turn compiler optimizations off when we use busy-waiting.
We can immediately see that busy-waiting is not an ideal solution to the problem
of controlling access to a critical section. Since thread 1 will execute the test over
and over until thread 0 executes flag++, if thread 0 is delayed (for example, if the
operating system preempts it to run something else), thread 1 will simply spin on
the test, eating up CPU cycles. This can be positively disastrous for performance.
Turning off compiler optimizations can also seriously degrade performance. Since
access to the critical section is strictly turn taking order, in the event failure of a
thread inside the critical section can lead to indefinite wait by other threads.
The busy-wait implementation of the t estimation program is given below:
#include <stdio.h>
#include <stdlib.h>
int thread_count;
long long n, flag=0;
double sum=0;
{ int i,j;
scanf("%ld", &n);


141
return 0;
}/* main */
double factor;
long long i;
factor = 1.0;
factor = -1.0;
{ while (flag!=my_rank);
flag= (flag+1)%thread_count;
}
return NULL;
} /* Thread_sum */
In the above implementation, the last thread, thread t-1, resets the flag to zero.
This can be accomplished by the statement
flag = (flag + 1) % thread_count;
If we compile the program and run it with two threads, we see that it is
computing the correct results. However, if we add in code for computing elapsed
time, we see that when n = 10
8
, the serial sum is consistently faster than the
parallel sum. For example, on the dual-core system, the elapsed time for the sum
as computed by two threads is about 25.89 seconds, while the elapsed time for
the serial sum is about 4.35 seconds. This version of the t estimating pthread
program is given below:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int thread_count;
double sum=0;


142
{ int i,j;
clock_t t;
scanf("%ld", &n);
t = clock();
t = clock() - t;
printf ("\nIt took me %d clicks (%f seconds).\n",t,((float)t)/CLOCKS_PER_SEC);
return 0;
}/* main */
double factor;
long long i;
factor = 1.0;
factor = -1.0;
{ while (flag!=my_rank);
flag= (flag+1)%thread_count;
}
return NULL;
} /* Thread_sum */
Since the code in a critical section can only be executed by one thread at a time,
no matter how we limit access to the critical section, we will effectively serialize
the code in the critical section. Therefore, as far as possible, we should minimize
the number of times we execute critical section code. One way to greatly improve
the performance of the t estimation program is to have each thread use a private
variable to store its total contribution to the sum. Then, each thread can add in
its contribution to the global sum once, after the for loop. See the program given
below:


143
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int thread_count;
pthread_mutex_t mutex;
double sum=0;
{ int i,j;
clock_t t;
pthread_mutex_init(&mutex, NULL);
scanf("%ld", &n);
t = clock();
t = clock() - t;
return 0;
}/* main */
double factor, my_sum=0.0;
long long i;
factor = 1.0;
factor = -1.0;
my_sum+= factor/(2*i+1);
while(flag!=my_rank);
sum += my_sum;


144
flag=(flag+1)%thread_count;
return NULL;
} /* Thread_sum */

When we run this on the dual-core system with n = 10
8
, the elapsed time is
reduced to 1.872 seconds for two threads, a substantial improvement over the
previous 25.89 seconds.
3.2. Mutual Exclusion by Mutexes
Since a thread that is busy-waiting may continually use the CPU, busy-waiting
is generally not an ideal solution to the problem of limiting access to a critical
section. Two better solutions are mutexes and semaphores. Mutex is an
abbreviation of mutual exclusion, and a mutex is a special type of variable that,
together with a couple of special functions, can be used to restrict access to a
critical section to a single thread at a time. Thus, a mutex can be used to
guarantee that one thread excludes all other threads while it executes the
critical section. Hence, the mutex guarantees mutually exclusive access to the
critical section.
The Pthreads standard includes a special type for mutexes: pthread_mutex_t. A
variable of type pthread_mutex_t needs to be initialized by the system before it is
used. This can be done with a call to
int pthread_mutex_init(
pthread_mutex_t* mutex_p /* out */,
const pthread_mutexattr_t* attr_p /* in */);
We will not make use of the second argument, so we will just pass in NULL.
When a Pthreads program finishes using a mutex, it should call
int pthread_mutex_destroy(pthread_mutex_t* mutex_p /* in/out */);
To gain access to a critical section, a thread calls
int pthread_mutex_lock(pthread_mutex_t* mutex_p /* in/out */);
When a thread is finished executing the code in a critical section, it should call
int pthread_mutex_unlock(pthread_mutex_t* mutex p /* in/out */);
The call to pthread_mutex_lock will cause the thread to wait until no other thread
is in the critical section, and the call to pthread_mutex_unlock notifies the system
that the calling thread has completed execution of the code in the critical section.
We can use mutexes instead of busy-waiting in our t estimation program by
declaring a global mutex variable, having the main thread initialize it, and then,
instead of busy-waiting and incrementing a flag, the threads call to


145
pthread_mutex_lock before entering the critical section, and they call to
pthread_mutex_unlock when they are done with the critical section.
See the program given below:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int thread_count;
long long n;
pthread_mutex_t mutex;
double sum=0;
{ int i,j;
clock_t t;
pthread_mutex_init(&mutex, NULL);
scanf("%ld", &n);
t = clock();
t = clock() - t;
return 0;
}/* main */
double factor;
long long i;
double my_sum=0.0;
factor = 1.0;


146
factor = -1.0;
my_sum+= factor/(2*i+1);
pthread_mutex_lock(&mutex);
sum += my_sum;
pthread_mutex_unlock(&mutex);
return NULL;
} /* Thread_sum */
Notice that with mutexes (unlike the busy-waiting solution), the order in which
the threads execute the code in the critical section is more or less random: the
first thread to call pthread_mutex_lock will be the first to execute the code in the
critical section. Subsequent accesses will be scheduled by the system.
If we look at the performance of the busy-wait t estimation program (with the
critical section after the loop) and the mutex program, we see that for both
versions the ratio of the run-time of the single-threaded program with the
multithreaded program is equal to the number of threads, as long as the number
of threads is no greater than the number of cores (see table below). That is,

,
provided thread count is less than or equal to the number of cores. Recall that

is called the speedup, and when the speedup is equal to the number of threads,
we have achieved more or less ideal performance or linear speedup.
Thread Busy-Wait Mutex
1 2.90 2.90
2 1.45 1.45
4 0.73 0.73
8 0.38 0.38
16 0.50 0.38
32 0.80 0.40
64 3.56 0.38
Table: Run-Times (in Seconds) of t estimation programs Using n = 10
8
Terms on
a System with Two Four-Core Processors
If we compare the performance of the version that uses busy-waiting with the
version that uses mutexes, we dont see much difference in the overall run-time
when the programs are run with fewer threads than cores. Since each thread
only enters the critical section once, unless the critical section is very long we
would not expect the threads to be delayed very much by waiting to enter the
critical section. However, if we start increasing the number of threads beyond
the number of cores, the performance of the version that uses mutexes remains
pretty much unchanged, while the performance of the busy-wait version
degrades.


147
With busy-waiting, performance can degrade if there are more threads than
cores. This can be explained through an example. Suppose we have two cores and
five threads (0-4). Also suppose that thread 0 is in the critical section, thread 1 is
in the busy-wait loop, and threads 2, 3, and 4 have been descheduled by the
operating system. After thread 0 completes the critical section and sets flag = 1, it
will be terminated, and thread 1 can enter the critical section so the operating
system can schedule thread 2, thread 3, or thread 4. Suppose it schedules thread
3, which will spin in the while loop. When thread 1 finishes the critical section
and sets flag = 2, the operating system can schedule thread 2 or thread 4. If it
schedules thread 4, then both thread 3 and thread 4, will be busily spinning in
the busy-wait loop until the operating system deschedules one of them and
schedules thread 2.
4. Shared-Memory Programming with OpenMP
4.1. Introduction to OpenMP
OpenMP stands for Open Multi-Processing, which is the defacto standard API
(Application Program Interface) for writing shared-memory parallel application.
It has been jointly defined and endorsed by a group of major computer hardware
and software vendors.
OpenMP supports thread-based shared address space parallelism. It provides an
explicit programming model to control the creation, communication and
synchronization of multiple threads. OpenMP uses the fork-join model of parallel
execution: All OpenMP programs begin as a single process called the master
thread. When the master thread reaches the parallel region, it creates multiple
threads to execute the parallel codes enclosed in the parallel region. When the
threads complete the parallel region, they synchronize and terminate, leaving
only the master thread. The fork-join model is illustrated below.

Fig: The Fork-join model used in OpenMP
The master thread (a series of instructions executed sequentially) forks a
specified number of slave threads and a task is divided among them. The threads
then run concurrently, with the runtime environment allocating threads to
different processors.


148
The section of code that is meant to run in parallel is marked with a preprocessor
directive that will cause the threads to form before the section is executed. Each
thread has an id attached to it which can be obtained using a function (called
omp_get_thread_num()). The thread id is an integer, and the master thread has an
id of 0. After the execution of the parallelized code, the threads join back into the
master thread, which continues onward to the end of the program.
By default, each thread executes the parallelized section of code independently.
Work-sharing constructs can be used to divide a task among the threads so that
each thread executes its allocated part of the code. Both task parallelism and
data parallelism can be achieved using OpenMP in this way.
The runtime environment allocates threads to processors depending on usage,
machine load and other factors. The number of threads can be assigned by the
runtime environment based on environment variables or in code using functions.
The OpenMP functions are included in a header file labelled omp.h in C/C++.
Although OpenMP and Pthreads are both APIs for shared-memory
programming, they have many fundamental differences. Pthreads are considered
to be low-level primitives. Pthreads requires that the programmer explicitly
specify the behaviour of each thread. OpenMP, on the other hand, sometimes
allows the programmer to simply state that a block of code should be executed in
parallel, and the precise determination of the tasks and which thread should
execute them is left to the compiler and the run-time system. This suggests a
further difference between OpenMP and Pthreads, that is, that Pthreads (like
MPI) is a library of functions that can be linked to a C program, so any Pthreads
program can be used with any C compiler, provided the system has a Pthreads
library. OpenMP, on the other hand, requires compiler support for some
operations, and hence it is entirely possible that you may run across a C compiler
that cannot compile OpenMP programs into parallel programs.
These differences also suggest why there are two standard APIs for shared
memory programming: Pthreads is lower level and provides us with the power to
program virtually any possible thread behaviour. This power, however, comes
with some associated costit is up to the programmer to specify every detail of
the behaviour of each thread. OpenMP, on the other hand, allows the compiler
and run-time system to determine some of the details of thread behaviour, so it
can be simpler to code some parallel behaviours using OpenMP. The cost is that
some low-level thread interactions can be more difficult to program.
OpenMP is directive based. Most parallelism is specified through the use of
compiler directives which are embedded in C/C++ or FORTRAN. Such directive-
based languages have existed for a long time, but only recently have
standardization efforts succeeded in the form of OpenMP. OpenMP directives
provide support for concurrency, synchronization, and data handling while
avoiding the need for explicitly setting up mutexes, condition variables, data
scope, and initialization.


149
OpenMP was developed by a group of programmers and computer scientists who
believed that writing large-scale high-performance programs using APIs such as
Pthreads was too difficult, and they defined the OpenMP specification so that
shared-memory programs could be developed at a higher level. In fact, OpenMP
was explicitly designed to allow programmers to incrementally parallelize
existing serial programs. OpenMP allows the programmer to parallelise
individual sections of a code, such as loops, one at a time. This allows for testing
of each new parallel section before further parallelisation. This incremental
parallelism is virtually impossible with MPI and fairly difficult with Pthreads.
OpenMP uses three types of constructs to control the parallelization of a
program.
Compiler directives instructing the compiler on how to parallelize the code
Compiler directives appear as comments in the source code and are ignored
by compilers unless specified otherwise - usually by specifying the
appropriate compiler flag. OpenMP compiler directives are used for various
purposes:
o Producing a parallel region
o Dividing blocks of code among threads
o Distributing loop iterations between threads
o Serializing sections of code
o Synchronization of work among threads
Compiler directives in C and C++ are based on the #pragma directives (the
word pragma is the abbreviation for pragmatic information). The directive
itself consists of a directive name followed by clauses.
#pragma omp directive [clause list]
Pragmas (like all preprocessor directives) are, by default, one line in length.
Long directive lines can be continued on succeeding lines by escaping the
newline character with a backslash \ at the end of a directive line. The
number sign (#) must be the first non-white-space character on the line that
contains the pragma; white-space characters can separate the number sign
and the word pragma.
Runtime library functions to modify and check the number of threads and to
check how may processors there are in the multiprocessor system
The OpenMP API includes an ever-growing number of run-time library
routines. These routines are used for a variety of purposes:
o Setting and querying the number of threads


150
o Querying a thread's unique identifier (thread ID), a thread's ancestor's
identifier, the thread team size
o Setting and querying the dynamic threads feature
o Querying if in a parallel region, and at what level
o Setting and querying nested parallelism
o Setting, initializing and terminating locks and nested locks
o Querying wall clock time and resolution
For C/C++, all of the run-time library routines are actual subroutines. For
example:
#include <omp.h>
int omp_get_num_threads(void)
Note that for C/C++, the program should include the <omp.h> header file.
Environment variables to alter the execution of OpenMP applications
OpenMP provides several environment variables for controlling the execution
of parallel code at run-time. These environment variables can be used to
control such things as:
o Setting the number of threads
o Specifying how loop interations are divided
o Binding threads to processors
o Enabling/disabling nested parallelism; setting the maximum levels of
nested parallelism
o Enabling/disabling dynamic threads
o Setting thread stack size
o Setting thread wait policy
Setting OpenMP environment variables is done the same way as any other
environment variables, and depends upon which shell is being used. For
example, in sh/bash shell, we use
export OMP_NUM_THREADS=8
OpenMP does not require restructuring the serial program. The use only needs
to add compiler directives to reconstruct the serial program into a parallel one.


151
4.2. OpenMP Compiler Directives
OpenMP is directive based. Most parallelism is specified through the use of
compiler directives which are embedded in C/C++ or FORTRAN. A directive in
C/C++ has the following format:
#pragma omp <directive-name> [clause,...]
There are four different types of directives:
Parallel construct
Work-sharing construct
Combined parallel work-sharing constructs
Synchronization directives
4.3. Parallel Construct
The parallel directive has the following prototype:
#pragma omp parallel [clause ...] newline
if (scalar_expression)
private (list)
shared (list)
default (shared | none)
firstprivate (list)
reduction (operator: list)
copyin (list)
num_threads (integer-expression)
structured_block
OpenMP programs execute serially until they encounter the parallel directive.
When a thread reaches a parallel directive, it creates a team of threads and
becomes the master of the team. The master is a member of that team and has
thread number 0 within that team. Starting from the beginning of this parallel
region, the code is duplicated and all threads will execute that code. There is an
implied barrier at the end of a parallel section. Only the master thread continues
execution past this point.
The number of threads in a parallel region is determined by the following
factors, in order of precedence:
Evaluation of the if clause
Setting of the num_threads clause
Use of the omp_set_num_threads() library function


152
Setting of the OMP_NUM_THREADS environment variable
Implementation default - usually the number of CPUs on a node, though it
could be dynamic
Each thread created by this directive executes the structured block specified by
the parallel directive. The clause list is used to specify conditional
parallelization, number of threads, and data handling.
Clauses:
Conditional Parallelization (if Clause): The clause if (scalar expression) determines
whether the parallel construct results in creation of threads. If present, it must
evaluate to non-zero (C/C++) in order for a team of threads to be created.
Otherwise, the region is executed serially by the master thread. Only one if
clause can be used with a parallel directive.
Degree of Concurrency: The clause num_threads(integer-expression) specifies the
number of threads that are created by the parallel directive.
Data Scope Attribute Clauses: Because OpenMP is based upon the shared
address space programming model, most variables are shared by default. In
particular, all of the variables visible when entering the parallel region are
global (shared). Private variables include loop index variables and stack
variables in subroutines called from parallel regions. In addition, data scope
attribute clauses can be used to explicitly define how variables should be scoped.
They include private, firstprivate, lastprivate, shared, default, reduction and copyin.
The private(list) clause declares variables in its list to be private to each thread.
For a private variable, a new object of the same type is declared once for each
thread in the team and all reference to the original object are replaced with
references to the new object. Hence, variables declared private should be
assumed to be uninitialized for each thread.
The firstprivate(list) clause has the same behaviour of private but it automatically
initializes the variables in its list according to their original values.
The lastprivate(list) clause does what private does and copies the variable from the
last loop iteration to the original variable object. The shared(list) clause declares
variables in its list to be shared among all threads.
The following example illustrates the usage of some of these parallel directives
1 #pragma omp parallel if (is_parallel == 1) num_threads(8) \
2 private (a) shared (b) firstprivate(c)
3 {
4 /* structured block */
5 }


153
Here, if the value of the variable is_parallel equals one, eight threads are created.
Each of these threads gets private copies of variables a and c, and shares a single
value of variable b. Furthermore, the value of each copy of c is initialized to the
value of c before the parallel directive.
The default clause allows user to specify a default scope for all variables within a
parallel region to be either shared or not. The clause default (shared) implies that,
by default, a variable is shared by all the threads. The clause default (none)
implies that the state of each variable used in a thread must be explicitly
specified. This is generally recommended, to guard against errors arising from
unintentional concurrent access to shared data.
The reduction clause specifies how multiple local copies of a variable at different
threads are combined into a single copy at the master when threads exit. A
private copy for each list variable is created for each thread. At the end of the
reduction, the reduction variable is applied to all private copies of the shared
variable, and the final result is written to the global shared variable. The usage
of the reduction clause is reduction (operator: list). This clause performs a reduction
on the scalar variables specified in the list using the operator. The variables in
the list are implicitly specified as being private to threads. The operator can be
one of +, *, -, &, |, ^, &&, and ||.
The following example illustrates the usage of the reduction clause
1 #pragma omp parallel reduction(+: sum) num_threads(8)
2 {
3 /* compute local sums here */
4 }
5 /* sum here contains sum of all local instances of sums */
In this example, each of the eight threads gets a copy of the variable sum. When
the threads exit, the sum of all of these local copies is stored in the single copy of
the variable (at the master thread).
The following guideline can be used to identify which clauses must be used when.
If a thread initializes and uses a variable (such as loop indices) and no other
thread accesses the data, then a local copy of the variable should be made for
the thread. Such data should be specified as private.
If a thread repeatedly reads a variable that has been initialized earlier in the
program, it is beneficial to make a copy of the variable and inherit the value
at the time of thread creation. This way, when a thread is scheduled on the
processor, the data can reside at the same processor (in its cache if possible)
and accesses will not result in interprocessor communication. Such data
should be specified as firstprivate.
If multiple threads manipulate a single piece of data, one must explore ways


154
of breaking these manipulations into local operations followed by a single
global operation. For example, if multiple threads keep a count of a certain
event, it is beneficial to keep local counts and to subsequently accrue it using
a single summation at the end of the parallel block. Such operations are
supported by the reduction clause.
If multiple threads manipulate different parts of a large data structure, the
programmer should explore ways of breaking it into smaller data structures
and making them private to the thread manipulating them.
After all the above techniques have been explored and exhausted, remaining
data items may be shared among various threads using the clause shared.
In addition to private, shared, firstprivate and lastprivate, OpenMP supports one
additional data class called threadprivate.
Often, it is useful to make a set of objects locally available to a thread in such a
way that these objects persist through parallel and serial blocks provided the
number of threads remains the same. In contrast to private variables, these
variables are useful for maintaining persistent objects across parallel regions,
which would otherwise have to be copied into the master thread's data space and
reinitialized at the next parallel block. This class of variables is supported in
OpenMP using the threadprivate directive. The syntax of the directive is as
follows:
#pragma omp threadprivate(list)
This directive implies that all variables in list are local to each thread and are
initialized once before they are accessed in a parallel region. Furthermore, these
variables persist across different parallel regions provided dynamic adjustment
of the number of threads is disabled and the number of threads is the same.
Similar to firstprivate, OpenMP provides a mechanism for assigning the same
value to threadprivate variables across all threads in a parallel region. The syntax
of the clause, which can be used with parallel directives, is copyin(list).
Here is a very simple little multi-threaded parallel program HelloWorld.c,
written in C that will print Hello World, displaying the number of the thread
processing each printf statement.
#include <omp.h>
#include<stdio.h>
main ()
{ int nthreads, tid;
/* Fork a team of threads with each thread having a private tid variable */
#pragma omp parallel private(tid) num_threads(4)
{ /* Obtain and print thread id */
tid = omp_get_thread_num();


155
/* Only master thread does this */
if (tid == 0)
{ nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
printf("Hello from %d\n",tid);
} /* All threads join master thread and terminate */
}
The first line is the OpenMP include file. The parallel region is placed between
the directive #pragma omp parallel {...}. The runtime library function
omp_get_thread_num() returns the thread ID to the program.
In order to compile (using gcc) this program from the command line, type
gcc -fopenmp HelloWorld.c -o HelloWorld.exe
A sample output from this program is shown below:
Hello from 2
Number of threads = 4
Hello from 1
Hello from 3
Hello from 0
From this output you can see that the program was run on 4 threads (from zero
to three) and also that the threads do not necessarily run in any particular order.
4.4. Work-Sharing Construct
A work-sharing construct distributes the execution of the associated region
among the threads encountering it. However, it does not launch new threads by
itself. It should be used within the parallel region or combined with the parallel
region constructs.
OpenMP provides two directives for and sections to specify concurrent
iterations and tasks.
4.4.1. The omp for Directive
The for directive is used to split parallel iteration spaces across threads. The omp
for directive specifies that the iterations of the loop immediately following it
must be executed in parallel by the team. This assumes a parallel region has
already been initiated, otherwise it executes in serial on a single processor.
The general form of an omp for directive is as follows:
#pragma omp for [clause ...] newline
schedule (type [,chunk])


156
ordered
private (list)
firstprivate (list)
lastprivate (list)
shared (list)
collapse (n)
nowait
for_loop
Clauses:
schedule: The schedule clause of the for directive deals with the assignment of
iterations to threads. OpenMP supports five scheduling classes: static, dynamic,
guided, runtime and auto. The default schedule is implementation dependent.
static: The general form of the static scheduling class is
schedule(static[, chunk])
This technique splits the loop iterations into pieces of size chunk and then
statically assigned to threads. If chunk is not specified, the iterations are evenly
(if possible) divided among the threads.
dynamic: Often, because of a number of reasons, ranging from heterogeneous
computing resources to non-uniform processor loads, equally partitioned
workloads take widely varying execution times. For this reason, OpenMP has a
dynamic scheduling class. The general form of this class is
schedule(dynamic[, chunk])
The loop iterations are divided into pieces of size chunk, and dynamically
scheduled among the threads; when a thread finishes one chunk, it is
dynamically assigned another. If no chunk-size is specified, it defaults to a single
iteration per chunk (default chunk size is 1).
guided: Consider the partitioning of an iteration space of 100 iterations with a
chunk size of 5. This corresponds to 20 chunks. If there are 16 threads, in the
best case, 12 threads get one chunk each and the remaining four threads get two
chunks. Consequently, if there are as many processors as threads, this
assignment results in considerable idling. The solution to this problem (also
referred to as an edge effect) is to reduce the chunk size as we proceed through
the computation. This is the principle of the guided scheduling class.
The general form of this class is
schedule(guided[, chunk])
In this class, iterations are dynamically assigned to threads in blocks as threads


157
request them until no blocks remain to be assigned. Similar to dynamic except
that the block size decreases each time a parcel of work is given to a thread. The
size of the initial block is proportional to:
number_of_iterations / number_of_threads
Subsequent blocks are proportional to
number_of_iterations_remaining / number_of_threads
The chunk parameter defines the minimum block size. The default chunk size is 1.
runtime: Often it is desirable to delay scheduling decisions until runtime. For
example, if one would like to see the impact of various scheduling strategies to
select the best one, the scheduling can be set to runtime. In this case, the
scheduling decision is deferred until runtime. The environment variable
OMP_SCHEDULE determines the scheduling class and the chunk size. It is illegal to
specify a chunk size for this clause.
auto: The scheduling decision is delegated to the compiler and/or runtime
system.
The other clauses that can be used with omp for directive are: private, firstprivate,
lastprivate, reduction, nowait and ordered. The first four clauses deal with data
handling and have identical semantics as in the case of the omp parallel directive.
The lastprivate clause deals with how multiple local copies of a variable are
written back into a single copy at the end of the parallel omp for loop. When
using an omp for loop (or sections directive) for farming work to threads, it is
sometimes desired that the last iteration (as defined by serial execution) of the
for loop update the value of a variable. This is accomplished using the lastprivate
directive.
If specified, the nowait clause does not synchronize the threads at the end of the
parallel loop. The ordered clause specifies that the iterations of the loop must be
executed as they would be in a serial program.
The collapse clause specifies how many loops in a nested loop should be collapsed
into one large iteration space and divided according to the schedule clause. The
sequential execution of the iterations in all associated loops determines the order
of the iterations in the collapsed iteration space.
The following vector-add program illustrates the usage of omp for directive.
Arrays a, b, c, and variable N will be shared by all threads. Variable i will be
private to each thread; each thread will have its own unique copy. The iterations
of the loop will be distributed dynamically in CHUNK sized pieces. Threads will
not synchronize upon completing their individual pieces of work (nowait).
#include <omp.h>
#include<stdio.h>


158
#define CHUNKSIZE 10
#define N 100
main ()
{ int i, chunk;
float a[N], b[N], c[N];
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,chunk) private(i) num_threads(4)
{
#pragma omp for schedule(dynamic,chunk) nowait
for (i=0; i < N; i++)
c[i] = a[i] + b[i];

} /* end of parallel section */
printf("a[]=");
for(i=0;i<N;i++) printf("%g ",a[i]);
printf("\nb[]=");
for(i=0;i<N;i++) printf("%g ",b[i]);
printf("\nc[]=");
for(i=0;i<N;i++) printf("%g ",c[i]);
}
4.4.2. The sections Directive
The for directive is suited to partitioning iteration spaces across threads.
Consider now a scenario in which there are three tasks (taskA, taskB, and
taskC) that need to be executed. Assume that these tasks are independent of
each other and therefore can be assigned to different threads. OpenMP supports
such non-iterative parallel task assignment using the sections directive.
The sections directive is a non-iterative work-sharing construct. It specifies that
the enclosed section(s) of code are to be divided among the threads in the team.
Independent section directives are nested within a sections directive. Each section
is executed once by a thread in the team. Different sections may be executed by
different threads. It is possible for a thread to execute more than one section if it
is quick enough and the implementation permits such.
The general form of the sections directive is as follows:
#pragma omp sections [clause ...] newline
private (list)
firstprivate (list)
lastprivate (list)
nowait
{


159
#pragma omp section newline
structured_block
#pragma omp section newline
structured_block
}
The following simple program demonstrates that different blocks of work will be
done by different threads.
#include <omp.h>
#define N 1000
main ()
{ int i;
float a[N], b[N], c[N], d[N];
for (i=0; i < N; i++)
{ a[i] = i * 1.5;
b[i] = i + 22.35;
}
#pragma omp parallel shared(a,b,c,d) private(i)
{
#pragma omp sections nowait
{
#pragma omp section
for (i=0; i < N; i++)
c[i] = a[i] + b[i];
#pragma omp section
for (i=0; i < N; i++)
d[i] = a[i] * b[i];
} /* end of sections */
}
4.4.3. Merging Directives
Since the work sharing directives for and sections do not create any threads buy
themselves, these directives are generally preceded by a parallel directive to
create concurrent threads. If there was no parallel directive specified, the for and
sections directives would execute serially (all work is farmed to a single thread,
the master thread). OpenMP allows the programmer to merge the parallel
directives to parallel for and parallel sections, respectively. The clause list for the
merged directive can be from the clause lists of either the parallel or for / sections
directives.
For example:
#pragma omp parallel default (private) shared (n)
{
#pragma omp for


160
for (i = 0 < i < n; i++)
{ /* body of parallel for loop */ }
}
is identical to:
#pragma omp parallel for default (private) shared (n)
{ for (i = 0 < i < n; i++)
{ /* body of parallel for loop */ }
}
and
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{ taskA(); }
#pragma omp section
{ taskB(); }
/* other sections here */
}
}
is identical to:
#pragma omp parallel sections
{
#pragma omp section
{ taskA(); }
#pragma omp section
{ taskB(); }
/* other sections here */
}
4.5. Synchronization Constructs in OpenMP
There are various synchronization constructs available in OpenMP to coordinate
the work by multiple threads.
4.5.1. Critical Sections: The critical and atomic Directives
The critical directive specifies a region of code that must be executed by only one
thread at a time. The syntax of a critical directive is:
#pragma omp critical [(name)]
/*structured block*/


161
If a thread is currently executing inside a critical region and another thread
reaches that critical region and attempts to execute it, it will block until the first
thread exits that critical region.
The optional name enables multiple different critical regions to exist. Names act
as global identifiers. Different critical regions with the same name are treated as
the same region. All critical sections which are unnamed, are treated as the same
section.
The following code illustrates the usage of critical directive. All threads in the
team will attempt to execute in parallel, however, because of the critical construct
surrounding the increment of result, only one thread will be able to
read/increment/write result at any time
#include<stdio.h>
#include <omp.h>
main ()
{ int i, n=1000, chunk=100;
float a[1000], b[1000], result=0.0;
for (i=0; i < n; i++)
{ a[i] = i * 1.0;
b[i] = i * 2.0;
}
#pragma omp parallel for \
private(i) shared(result)\
schedule(static,chunk) \
num_threads(100)
for (i=0; i < n; i++)
{
#pragma omp critical
{ result = result + (a[i] * b[i]); }
}
printf("Final result= %f\n",result);
}
Often, a critical section consists simply of an update to a single memory location,
for example, incrementing or adding to an integer. OpenMP provides another
directive, atomic, for such atomic updates to memory locations. The atomic
directive specifies that the memory location update in the following instruction
should be performed as an atomic operation. In essence, the atomic directive
provides a mini- critical section. The atomic directive has the following syntax:
#pragma omp atomic newline
statement_expression
The following example illustrates the usage of atomic directive:



162
#include <omp.h>
main()
{ int x;
x = 0;
#pragma omp parallel shared(x)
{
#pragma omp critical
x = x + 1;
}
4.5.2. Single Thread Executions: The single & master Directives
Often, a computation within a parallel section needs to be performed by just one
thread. A simple example of this is the computation of the mean of a list of
numbers. Each thread can compute a local sum of partial lists, add these local
sums to a shared global sum, and have one thread compute the mean by dividing
this global sum by the number of entries in the list. The last step can be
accomplished using a single directive. A single directive specifies a structured
block that is executed by a single (arbitrary) thread.
The syntax of the single directive is as follows:
#pragma omp single [clause list]
structured block
The clause list can take clauses private, firstprivate, and nowait. These clauses have
the same semantics as before. On encountering the single block, the first thread
enters the block. All the other threads proceed to the end of the block. If the
nowait clause has been specified at the end of the block, then the other threads
proceed; otherwise they wait at the end of the single block for the thread to finish
executing the block. This directive is useful for computing global data as well as
performing I/O.
The master directive is a specialization of the single directive in which only the
master thread executes the structured block. The syntax of the master directive is
as follows:
#pragma omp master
structured block
In contrast to the single directive, there is no implicit barrier associated with the
master directive.
4.5.3. Synchronization Point: The barrier Directive
A barrier is one of the most frequently used synchronization primitives. OpenMP
provides a barrier directive, whose syntax is as follows:


163
#pragma omp barrier
The barrier directive synchronizes all threads in the team. When a barrier
directive is reached, a thread will wait at that point until all other threads have
reached that barrier. All threads then resume executing in parallel the code that
follows the barrier.
4.5.4. In-Order Execution: The ordered Directive
In many circumstances, it is necessary to execute a segment of a parallel loop in
the order in which the serial version would execute it. For example, consider a
for loop in which, at some point, we compute the cumulative sum in array
cumul_sum of a list stored in array list. The array cumul_sum can be computed
using a for loop over index i serially by executing
cumul_sum[i] = cumul_sum[i-1] + list[i].
When executing this for loop across threads, it is important to note that
cumul_sum[i] can be computed only after cumul_sum[i-1] has been computed.
Therefore, the statement would have to be executed within an ordered block.
The syntax of the ordered directive is as follows:
#pragma omp ordered
structured block
Since the ordered directive refers to the in-order execution of a for loop, it must
be within the scope of a for or parallel for directive. Furthermore, the for or parallel
for directive must have the ordered clause specified to indicate that the loop
contains an ordered block.
The following example illustrates the usage of ordered directive for calculating
the cumulative sum of a linear array.
#include<stdio.h>
#include <omp.h>
main ()
{ int i, list[100], cumul_sum[100], n=100, chunk=10;
for (i=0; i < n; i++)
list[i]=i;
cumul_sum[0] = list[0];
#pragma omp parallel for private (i) \
shared (cumul_sum, list, n) ordered
for (i = 1; i < n; i++)
{
#pragma omp ordered
{ cumul_sum[i] = cumul_sum[i-1] + list[i]; }
printf("cumul_sum[%d]=%d\n",i,cumul_sum[i]);


164
}
}
4.5.5. Memory Consistency: The flush Directive
The flush directive provides a mechanism for making memory consistent across
threads. While it would appear that such a directive is unnecessary for shared
address space machines, it is important to note that variables may often be
assigned to registers and register-allocated variables may be inconsistent. In
such cases, the flush directive provides a memory fence by forcing a variable to be
written to or read from the memory system. All write operations to shared
variables must be committed to memory at a flush and all references to shared
variables after a fence must be satisfied from the memory. Since private
variables are relevant only to a single thread, the flush directive applies only to
shared variables.
The syntax of the flush directive is as follows:
#pragma omp flush[(list)]
The optional list specifies the variables that need to be flushed. The default is
that all shared variables are flushed.
Several OpenMP directives have an implicit flush. Specifically, a flush is implied
at a barrier, at the entry and exit of critical, ordered, parallel, parallel for, and
parallel sections blocks and at the exit of for, sections, and single blocks. A flush is
not implied if a nowait clause is present. It is also not implied at the entry of for,
sections, and single blocks and at entry or exit of a master block.

High Performance Computing

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

High Performance Computing

Încărcat de

Drepturi de autor:

Formate disponibile

High

gets closer and closer to 0, so the

for all k between 0 and p - 1. For

and last component:

S-ar putea să vă placă și