Parallel Computing Seminar Report

1
1.What is Parallel Computing?
I would like to start this with providing some metaphors of real life.What do
we do when we are provided with a Herculean task and need to do it fast?
Divide the task into pieces,and employ more people to do the work if it needs
to be done faster.What would you do if not this? Similarly, a large corporation
requires many employees for it’s proper functioning, whereas a large economy
requires many people at work in various sectors of business.
If a computer was human, then its central processing unit (CPU) would be its
brain. A CPU is a microprocessor -- a computing engine on a chip. While
modern microprocessors are small, they're also really powerful. They can
interpret millions of instructions per second. Even so, there are some
computational problems that are so complex that a powerful microprocessor
would require years to solve them.
Computer scientists use different approaches to address this problem. One

potential approach is to push for more powerful microprocessors. Usually this
means finding ways to fit more transistors on a microprocessor chip.
Computer engineers are already building microprocessors with transistors that
are only a few dozen nanometers wide. How small is a nanometer? It's one-
billionth of a meter. A red blood cell has a diameter of 2,500 nanometers -- the
width of modern transistors is a fraction of that size.
Building more powerful microprocessors requires an intense and expensive

production process. Some computational problems take years to solve even
with the benefit of a more powerful microprocessor. Partly because of these
factors, computer scientists sometimes use a different approach: parallel
processing.
In general, parallel processing means that at least two microprocessors handle

parts of an overall task. The concept is pretty simple: A computer scientist
divides a complex problem into component parts using special software
specifically designed for the task. He or she then assigns each component part
to a dedicated processor. Each processor solves its part of the overall
computational problem. The software reassembles the data to reach the end
conclusion of the original complex problem.
1
2
It's a high-tech way of saying that it's easier to get work done if you can share
the load. You could divide the load up among different processors housed in
the same computer, or you could network several computers together and
divide the load up among all of them. There are several ways to achieve the
same goal.
Traditionally, computer software has been written for

serial computation. To solve a problem, an algorithm is constructed and
implemented as a serial stream of instructions. These instructions are executed
on a central processing unit on one computer. Only one instruction may
execute at a time—after that instruction is finished, the next is executed.
Parallel computing, on the other hand, uses multiple

processing elements simultaneously to solve a problem. This is accomplished
by breaking the problem into independent parts so that each processing element
can execute its part of the algorithm simultaneously with the others. The
processing elements can be diverse and include resources such as a single
computer with multiple processors, several networked computers, specialized
hardware, or any combination of the above.
1.1 Frequency Scaling and Power Consumption
Frequency scaling was the dominant reason for improvements in

computer performance from the mid-1980s until 2004. The runtime of a
program is equal to the number of instructions multiplied by the average time
per instruction. Maintaining everything else constant, increasing the clock
frequency decreases the average time it takes to execute an instruction. An
increase in frequency thus decreases runtime for all computation-bounded
programs.
However, power consumption by a chip is given by the

equation P = C × V2 × F, where P is power, C is the capacitance being switched
per clock cycle (proportional to the number of transistors whose inputs
change), V is voltage, and F is the processor frequency (cycles per
second).Increases in frequency increase the amount of power used in a
2
3
processor. Increasing processor power consumption led ultimately to Intel's

May 2004 cancellation of its Tejas and Jayhawk processors, which is generally
cited as the end of frequency scaling as the dominant computer architecture
paradigm.
1.2 Amdahl’ Law
Theoretically, the speed-up from parallelization should

be linear—doubling the number of processing elements should
halve the runtime, and doubling it a second time should again halve the
runtime. However, very few parallel algorithms achieve optimal speed-up.
Most of them have a near-linear speed-up for small numbers of processing
elements, which flattens out into a constant value for large numbers of
processing elements.
The potential speed-up of an algorithm on a parallel computing platform is

given by Amdahl’s Law, originally formulated by Gene Amdahl in the 1960s.It
states that a small portion of the program which cannot be parallelized will
limit the overall speed-up available from parallelization. Any large
mathematical or engineering problem will typically consist of several
parallelizable parts and several non-parallelizable (sequential) parts.
3
4
This relationship is given by the equation:
where S is the speed-up of the program (as a factor of its original sequential
runtime), and P is the fraction that is parallelizable. If the sequential portion of
a program is 10% of the runtime, we can get no more than a 10x speed-up,
regardless of how many processors are added. This puts an upper limit on the
usefulness of adding more parallel execution units. "When a task cannot be
partitioned because of sequential constraints, the application of more effort has
no effect on the schedule. The bearing of a child takes nine months, no matter
how many women are assigned."
1.3 Gustafson’s Law
Gustafson's law is another law in computer engineering, closely related to

Amdahl's law. It can be formulated as:
where P is the number of processors, S is the speed-up, and α the non-

parallelizable part of the process.Amdahl's law assumes a fixed-problem size
and that the size of the sequential section is independent of the number of
processors, whereas Gustafson's law does not make these assumptions.
1.4 Pipelining
4
5
One of the basic concepts behind parallel computing is the concept of

pipelining.
In computing, a pipeline is a set of data processing elements connected in

series, so that the output of one element is the input of the next one. The
elements of a pipeline are often executed in parallel or in time-sliced fashion;
in that case, some amount of buffer storage is often inserted between elements.
Computer-related pipelines include:
* Instruction pipelines, such as the classic RISC pipeline, which are used in
processors to allow overlapping execution of multiple instructions with the
same circuitry. The circuitry is usually divided up into stages, including
instruction decoding, arithmetic, and register fetching stages, wherein each
stage processes one instruction at a time.
* Software pipelines,consisting of multiple processes arranged so that the

output stream of one process is automatically and promptly fed as the input
stream of the next one. Unix pipelines are the classical implementation of this
concept.
5
6
2.Approaches To Parallel Computing
To understand parallel processing, we need to look at the four basic

programming models. Computer scientists define these models based on two
factors: the number of instruction streams and the number of data streams
the computer handles. Instruction streams are algorithms. An algorithm is just a
series of steps designed to solve a particular problem. Data streams are
information pulled from computer memory used as input values to the
algorithms. The processor plugs the values from the data stream into the
algorithms from the instruction stream. Then, it initiates the operation to obtain
a result.
2.1 Flynn’s Taxonomy
Michael J. Flynn created one of the earliest classification systems for parallel
(and sequential) computers and programs, now known as Flynn's taxonomy.
Flynn classified programs and computers by whether they were operating using
a single set or multiple sets of instructions, whether or not those instructions
were using a single or multiple sets of data.
Flynn’s Taxonomy
6
7
SISD-Single Instruction Multiple Data
MISD-Multiple Instruction Single Data
SIMD-Single Instruction Multiple Data
MIMD-Multiple Instruction Multiple Data
2.1.1 Single Instruction Single Data(SISD)
Single Instruction, Single Data (SISD) computers have one processor that
handles one algorithm using one source of data at a time. The computer tackles
and processes each task in order, and so sometimes people use the word
"sequential" to describe SISD computers. They aren't capable of performing
parallel processing on their own.
2.1.2 Single Instruction Multiple Data(SIMD)
Single Instruction, Multiple Data (SIMD) computers have several processors

that follow the same set of instructions, but each processor inputs different data
into those instructions. SIMD computers run different data through the same
algorithm. This can be useful for analyzing large chunks of data based on the
same criteria. Many complex computational problems don't fit this model.
2.1.3 Multiple Instruction Single Data(MISD)
Multiple Instruction, Single Data (MISD) computers have multiple

processors. Each processor uses a different algorithm but uses the same shared
input data. MISD computers can analyze the same set of data using several
different operations at the same time. The number of operations depends upon
the number of processors. There aren't many actual examples of MISD
computers, partly because the problems an MISD computer can calculate are
uncommon and specialized.
2.1.4 Muliple Instruction Multiple Data(MIMD)
Multiple Instruction, Multiple Data (MIMD) computers have multiple

processors, each capable of accepting its own instruction stream independently
from the others. Each processor also pulls data from a separate data stream. An
MIMD computer can execute several different processes at once. MIMD
computers are more flexible than SIMD or MISD computers, but it's more
7
8
difficult to create the complex algorithms that make these computers work.
Single Program, Multiple Data (SPMD) systems are a subset of MIMDs. An
SPMD computer is structured like an MIMD, but it runs the same set of
instructions across all processors.
According to David A. Patterson and John L. Hennessy, "Some machines are

hybrids of these categories, of course, but this classic model has survived
because it is simple, easy to understand, and gives a good first approximation.
It is also—perhaps because of its understandability—the most widely used
scheme."
2.2 Classification On The Basis Of Computation
On this basis,they can be classified into three major classes of parallelism.
Massively Parallel
Embarassingly Parallel
Grand Challenge Problems
2.2.1 Massively Parallel Systems
Massive parallelism (MP) is a term used in computer architecture,

reconfigurable computing, application-specific integrated circuit (ASIC) and
field-programmable gate array (FPGA) design. It signifies the presence of
many independent arithmetic units or entire microprocessors, that run in
parallel. The term massive connotes hundreds if not thousands of such units.
Early examples of such a system are the Distributed Array Processor, the
Goodyear MP, and the Connection Machine.
Today's most powerful supercomputers are all MP systems such as Earth

Simulator, Blue Gene, ASCI White, ASCI Red, ASCI Purple, ASCI Thor's
Hammer.
8
9
In this class of computing, all of the processing elements are connected

together to be one very large computer. This is in contrast to distributed
computing where massive numbers of separate computers are used to solve a
single problem.
Through advances in Moore's Law, System-On-Chip (SOC) implementations

of massively parallel architectures are becoming cost effective, and finding
particular application in high definition video processing.
2.2.2 Embarassingly Parallel Systems
In the jargon of parallel computing, an embarrassingly parallel workload (or

embarrassingly parallel problem) is one for which no particular effort is needed
to segment the problem into a very large number of parallel tasks, and there is
no essential dependency (or communication) between those parallel tasks.
A very common usage of an embarrassingly parallel problem lies within

graphics processing units (GPUs) for things like 3D projection since each pixel
on the screen can be rendered independently from any other pixel.
Embarrassingly parallel problems are ideally suited to distributed computing

over the Internet (e.g. SETI@home), and are also easy to perform on server
farms which do not have any of the special infrastructure used in a true
supercomputer cluster.
Embarrassingly parallel problems lie at one end of the spectrum of

parallelization, the degree to which a computational problem can be readily
divided amongst processors.
2.2.3 Grand Challenge Problems
A grand challenge is a fundamental problem in science or engineering, with

broad applications, whose solution would be enabled by the application of high
performance computing resources that could become available in the near
future. Examples of grand challenges are:
9
10
Computational fluid dynamics for
the design of hypersonic aircraft, efficient automobile bodies, and extremely

quiet submarines,
weather forecasting for short and long term effects,
efficient recovery of oil, and for many other applications;
Electronic structure calculations for the design of new materials such as
chemical catalysts,
immunological agents, and
superconductors;
Plasma dynamics for fusion energy technology and for safe and efficient
military technology;
Calculations to understand the fundamental nature of matter, including

quantum chromodynamics and condensed matter theory;
Symbolic computations including
speech recognition,
computer vision,
natural language understanding,
automated reasoning, and
tools for design, manufacturing, and simulation of complex systems."
Prediction of weather, climate, and global change
Challenges in materials sciences
Semiconductor design
Superconductivity
10
11
Structural biology
Design of pharmaceutical drugs
Human genome
Quantum Chromodynamics
Astronomy
Challenges in Transportation
Vehicle Signature
Turbulence
Vehicle dynamics
Nuclear fusion
Efficiency of combustion systems
Enhanced oil and gas recovery
Computational ocean sciences
Speech
Vision
Undersea surveillance for ASW
One current topical Grand Challenge is the DARPA Autonomous Vehicle

challenge.
2.3 Types Of Parallelism
There are four major types of parallelism
 Bit-Level Parallelism
 Instruction-Level Parallelism
11
12
 Data Parallelism
 Task Parallelism
2.3.1 Bit-Level Parallelism
From the advent of very-large-scale integration (VLSI) computer-chip

fabrication technology in the 1970s until about 1986, speed-up in computer
architecture was driven by doubling computer word size—the amount of
information the processor can execute per cycleIncreasing the word size
reduces the number of instructions the processor must execute to perform an
operation on variables whose sizes are greater than the length of the word. For
example, where an 8-bit processor must add two 16-bit integers, the processor
must first add the 8 lower-order bits from each integer using the standard
addition instruction, then add the 8 higher-order bits using an add-with-carry
instruction and the carry bit from the lower order addition; thus, an 8-bit
processor requires two instructions to complete a single operation, where a 16-
bit processor would be able to complete the operation with a single instruction.
Historically, 4-bit microprocessors were replaced with 8-bit, then 16-bit, then
32-bit microprocessors. This trend generally came to an end with the
introduction of 32-bit processors, which has been a standard in general-purpose
computing for two decades. Not until recently (c. 2003–2004), with the advent
of x86-64 architectures, have 64-bit processors become commonplace.
2.3.2 Instruction-Level Parallelism
A computer program is, in essence, a stream of instructions executed by a

processor. These instructions can be re-ordered and combined into groups
which are then executed in parallel without changing the result of the program.
This is known as instruction-level parallelism. Advances in instruction-level
parallelism dominated computer architecture from the mid-1980s until the
mid-1990s.
Modern processors have multi-stage instruction pipelines. Each stage in the

pipeline corresponds to a different action the processor performs on that
instruction in that stage; a processor with an N-stage pipeline can have up to N
different instructions at different stages of completion. The canonical example
of a pipelined processor is a RISC processor, with five stages: instruction fetch,
12
13
decode, execute, memory access, and write back. The Pentium 4 processor had
a 35-stage pipeline.[19]
A five-stage pipelined superscalar processor, capable of issuing two

instructions per cycle. It can have two instructions in each stage of the pipeline,
for a total of up to 10 instructions (shown in green) being simultaneously
executed.
In addition to instruction-level parallelism from pipelining, some processors

can issue more than one instruction at a time. These are known as superscalar
processors. Instructions can be grouped together only if there is no data
dependency between them. Scoreboarding and the Tomasulo algorithm (which
is similar to scoreboarding but makes use of register renaming) are two of the
most common techniques for implementing out-of-order execution and
instruction-level parallelism.
Instruction-level parallelism (ILP) is a measure of how many of the operations

in a computer program can be performed simultaneously. Consider the
following program:
1. e = a + b
2. f = c + d
3. g = e * f
Operation 3 depends on the results of operations 1 and 2, so it cannot be

calculated until both of them are completed. However, operations 1 and 2 do
not depend on any other operation, so they can be calculated simultaneously.If
we assume that each operation can be completed in one unit of time then these
three instructions can be completed in a total of two units of time, giving an
ILP of 3/2.
2.3.3 Data Parallelism
Data parallelism is parallelism inherent in program loops, which focuses on

distributing the data across different computing nodes to be processed in
parallel. "Parallelizing loops often leads to similar (not necessarily identical)
operation sequences or functions being performed on elements of a large data
structure."Many scientific and engineering applications exhibit data
parallelism.
13
14
A loop-carried dependency is the dependence of a loop iteration on the output

of one or more previous iterations. Loop-carried dependencies prevent the
parallelization of loops. For example, consider the following pseudocode that
computes the first few Fibonacci numbers:
1: PREV2 := 0
2: PREV1 := 1
3: CUR := 1
4: do:
5: CUR := PREV1 + PREV2
6: PREV2 := PREV1
7: PREV1 := CUR
8: while (CUR < 10)
This loop cannot be parallelized because CUR depends on itself (PREV1) and
PREV2, which are computed in each loop iteration. Since each iteration
depends on the result of the previous one, they cannot be performed in parallel.
As the size of a problem gets bigger, the amount of data-parallelism available
usually does as well.
2.3.4Task Parallelism
Task parallelism (also known as function parallelism and control parallelism)

is a form of parallelization of computer code across multiple processors in
parallel computing environments. Task parallelism focuses on distributing
execution processes (threads) across different parallel computing nodes. It
contrasts to data parallelism as another form of parallelism.
In a multiprocessor system, task parallelism is achieved when each processor

executes a different thread (or process) on the same or different data. The
threads may execute the same or different code. In the general case, different
execution threads communicate with one another as they work.
14
15
Communication takes place usually to pass data from one thread to the next as
part of a workflow.
As a simple example, if we are running code on a 2-processor system (CPUs

"a" & "b") in a parallel environment and we wish to do tasks "A" and "B" , it is
possible to tell CPU "a" to do task "A" and CPU "b" to do task 'B"
simultaneously, thereby reducing the runtime of the execution. The tasks can
be assigned using conditional statements as described below.
Task parallelism emphasizes the distributed (parallelized) nature of the

processing (i.e. threads), as opposed to the data (data parallelism). Most real
programs fall somewhere on a continuum between Task parallelism and Data
parallelism.
The pseudocode below illustrates task parallelism:
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
The goal of the program is to do some net total task ("A+B"). If we write the
code as above and launch it on a 2-processor system, then the runtime
environment will execute it as follows.
15
16
In a SIMD system, both CPUs will execute the code.
In a parallel environment, both will have access to the same data.
The "if" clause differentiates between the CPU's. CPU "a" will read true on the
"if" and CPU "b" will read true on the "else if", thus having their own task.
Now, both CPU's execute separate code blocks simultaneously, performing

different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.
2.4 Classification Based on Memory Access
16
17
One major way to classify parallel computers is based on their memory

architectures. Shared memory parallel computers have multiple processors
accessing all available memory as global address space. They can be further
divided into two main classes based on memory access times:
 Uniform Memory Access (UMA), in which access times to all parts of memory
are equal.
 Non-Uniform Memory Access (NUMA), in which they are not.
2.4.1 Uniform Memory Access(UMA)
Uniform Memory Access (UMA) is a computer memory architecture used in

parallel computers having multiple processors and probably multiple memory
chips.
All the processors in the UMA model share the physical memory uniformly.
Peripherals are also shared. Cache memory may be private for each processor.
In a UMA architecture, accessing time to a memory location is independent
from which processor makes the request or which memory chip contains the
target memory data. It is used in symmetric multiprocessing (SMP).
Uniform Memory Access computer architectures are often contrasted with

Non-Uniform Memory Architecture(NUMA) architectures. UMA machines
are, generally, harder for computer architects to design, but easier for
programmers to program, than NUMA architectures.
2.4.2 Non-Uniform Memory Access(NUMA)
Non-Uniform Memory Access or Non-Uniform Memory Architecture

(NUMA) is a computer memory design used in multiprocessors, where the
memory access time depends on the memory location relative to a processor.
Under NUMA, a processor can access its own local memory faster than non-
local memory, that is, memory local to another processor or memory shared
between processors.
NUMA architectures logically follow in scaling from symmetric

multiprocessing (SMP) architectures. Their commercial development came in
17
18
work by Burroughs, Convex Computers(later HP),SGI , Sequent and Data

General during the1990s. Techniques developed by these companies later
featured in a variety of Unix-like operating systems, as well as to some degree
inWindows NT.
Modern CPUs operate considerably faster than the main memory they are
attached to. In the early days of high-speed computing and supercomputers the
CPU generally ran slower than its memory, until the performance lines crossed
in the 1970s. Since then, CPUs, increasingly starved for data, have had to stall
while they wait for memory accesses to complete. Many supercomputer
designs of the 1980s and 90s focused on providing high-speed memory access
as opposed to faster processors, allowing them to work on large data sets at
speeds other systems could not approach.
Architecture of a NUMA system.
In the above figure,notice that the processors are connected to the bus or
crossbar by connections of varying thickness/number. This shows that different
CPUs have different priorities to memory access based on their location.
Limiting the number of memory accesses provided the key to extracting high
performance from a modern computer. For commodity processors, this means
installing an ever-increasing amount of high-speed cache memory and using
increasingly sophisticated algorithms to avoid "cache misses". But the dramatic
increase in size of the operating systems and of the applications run on them
have generally overwhelmed these cache-processing improvements. Multi-
processor systems make the problem considerably worse. Now a system can
starve several processors at the same time, notably because only one processor
can access memory at a time.
18
19
NUMA attempts to address this problem by providing separate memory for

each processor, avoiding the performance hit when several processors attempt
to address the same memory. For problems involving spread data (common for
servers and similar applications), NUMA can improve the performance over a
single shared memory by a factor of roughly the number of processors (or
separate memory banks).
Of course, not all data ends up confined to a single task, which means that
more than one processor may require the same data. To handle these cases,
NUMA systems include additional hardware or software to move data between
banks. This operation has the effect of slowing down the processors attached to
those banks, so the overall speed increase due to NUMA will depend heavily
on the exact nature of the tasks run on the system at any given time.
2.5 Classes Of Parallel Computers
Parallel computers can be roughly classified according to the level at which the
hardware supports parallelism. This classification is broadly analogous to the
distance between basic computing nodes. These are not mutually exclusive; for
example, clusters of symmetric multiprocessors are relatively common.
2.5.1 Multi-Core Computing
A multi-core CPU (or chip-level multiprocessor, CMP) combines two or

more independent cores into a single package composed of single integrated
circuits(IC), called a die, or more dies packaged together. A dual-core
processor contains two cores, and a quad-core processor contains four cores.
A multi-core microprocessor implements multiprocessing in a single physical
package. A processor with all cores on a single die is called a monolithic
processor. Cores in a multicore device may share a single coherent cache at
the highest on-device cache level (e.g. L2 for the Intel Core 2) or may have
separate caches (e.g. current AMD dual-core processors). The processors also
share the same interconnect to the rest of the system. Each "core"
independently implements optimizations such as superscalar execution,
pipelining, and multithreading. A system with n cores is effective when it is
19
20
presented with n or more threads concurrently. The most commercially

significant (or at least the most 'obvious') multi-core processors are those used
in personal computers (primarily from Intel and AMD) and game consoles
(e.g., the eight-core Cell processor in the PS3 and the three-core Xenon
processor in the Xbox 360. In this context, "multi" typically means a relatively
small number of cores. However, the technology is widely used in other
technology areas, especially those of embedded processors, such as network
processors and digital signal processors, and in GPUs.
The amount of performance gained by the use of a multicore processor

depends on the problem being solved and the algorithms used, as well as their
implementation in software (Amdahl's Law). For so-called "embarassingly
parallel" problems, a dual-core processor with two cores at 2GHz may perform
very nearly as fast as a single core of 4GHz.Other problems though may not
yield so much speedup. This all assumes however that the software has been
designed to take advantage of available parallelism. If it hasn't, there will not
be any speedup at all. However, the processor will multitask better since it can
run two programs at once, one on each core.
2.5.2 Symmetric Multiprocessing
In comuting, symmetric multiprocessing or SMP involves a multiprocessor

computer-architecture where two or more identical processors can connect to a
single shared main memory. Most common multiprocessor systems today use
an SMP architecture. In case of multi-core processors, the SMP architecture
applies to the cores, treating them as separate processors.
SMP systems allow any processor to work on any task no matter where the
data for that task are located in memory; with proper operating system support,
SMP systems can easily move tasks between processors to balance the
workload efficiently.
2.5.3 Distributed Computing
Distributed computing deals with hardware and software systems containing

more than one processing element or storage element, concurrent processes, or
multiple programs, running under a loosely or tightly controlled regime.
20
21
In distributed computing a program is split up into parts that run

simultaneously on multiple computers communicating over a network.
Distributed computing is a form of parallel computing, but parallel computing
is most commonly used to describe program parts running simultaneously on
multiple processors in the same computer. Both types of processing require
dividing a program into parts that can run simultaneously, but distributed
programs often must deal with heterogeneous environments, network links of
varying latencies, and unpredictable failures in the network or the computers.
2.5.4 Cluster Computing
A cluster is a group of loosely coupled computers that work together closely,

so that in some respects they can be regarded as a single computer.Clusters are
composed of multiple standalone machines connected by a network. While
machines in a cluster do not have to be symmetric, load balancing is more
difficult if they are not. The most common type of cluster is the Beowulf
cluster, which is a cluster implemented on multiple identical commercial off-
the-shelf computers connected with a TCP/IP Ethernet local area
network.Beowulf technology was originally developed by Thomas Sterling and
Donald Becker. The vast majority of the TOP500 supercomputers are clusters
Beowulf Clusters
2.5.5 Grid Computing
Grid computing is the most distributed form of parallel computing. It makes

use of computers communicating over the Internet to work on a given problem.
Because of the low bandwidth and extremely high latency available on the
Internet, grid computing typically deals only with embarrassingly parallel
problems. Many grid computing applications have been created, of which
SETI@home and Folding@Home are the best-known examples.
21
22
Most grid computing applications use middleware, software that sits between
the operating system and the application to manage network resources and
standardize the software interface. The most common grid computing
middleware is the Berkeley Open Infrastructure for Network Computing
(BOINC). Often, grid computing software makes use of "spare cycles",
performing computations at times when a computer is idling.
2.5.6 Vector Processors
A vector processor is a CPU or computer system that can execute the same
instruction on large sets of data. "Vector processors have high-level operations
that work on linear arrays of numbers or vectors. An example vector operation
is A = B × C, where A, B, and C are each 64-element vectors of 64-bit
floating-point numbers."They are closely related to Flynn's SIMD
classification.
Cray computers became famous for their vector-processing computers in the

1970s and 1980s. However, vector processors—both as CPUs and as full
computer systems—have generally disappeared. Modern processor instruction
sets do include some vector processing instructions, such as with AltiVec and
Streaming SIMD Extensions (SSE).
22
23
Cray-1
The most famous vector processor
23
24
3.Parallel Computing Implementation
Parallel computing has implementations in both,software as well as

hardware.To be precise in terms of terminology,parallel computing in software
is called as ‘parallel programming’ whereas in the world of hardware it is
referred as ‘parallel processing’.
3.1 Parallel Computing in Software
Parallel models are implemented in several ways: as libraries invoked from

traditional sequential languages, as language extensions, or complete new
execution models. They are also roughly categorized for two kinds of systems:
shared-memory system and distributed-memory system, though the lines
between them are largely blurred nowadays.
A programming model is usually judged by its expressibility and simplicity,

which are by all meanings conflicting factors. The ultimate goal is to improve
productivity of programming.
3.1.1 Automatic Parallelization
Automatic parallelization of a sequential program by a compiler is the "holy

grail" of parallel computing. Despite decades of work by compiler researchers,
automatic parallelization has had only limited success.
Mainstream parallel programming languages remain either explicitly parallel

or (at best) partially implicit, in which a programmer gives the compiler
directives for parallelization. A few fully implicit parallel programming
languages exist—SISAL, Parallel Haskell, and (for FPGAs) Mitrion-C—but
these are niche languages that are not widely used.
The Fortran code below can be auto-parallelized by a compiler because each

iteration is independent of the others, and the final result of array z will be
correct regardless of the execution order of the other iterations.
do i=1, n
z(i) = x(i) + y(i)
enddo
On the other hand, the following code cannot be auto-parallelized, because the
value of z(i) depends on the result of the previous iteration, z(i-1).
24
25
do i=2, n
z(i) = z(i-1)*2
enddo
This does not mean that the code cannot be parallelized. Indeed, it is equivalent
to
do i=2, n
z(i) = z(1)*2**(i-1)
enddo
However, current parallelizing compilers are not usually capable of bringing

out these parallelisms automatically, and it is questionable whether this code
would benefit from parallelization in the first place.
3.1.2 Application Checkpointing
This is all about providing a restore or a backup pointt,to ensure we don’t lose
the data.The larger and more complex a computer, the more that can go wrong
and the shorter the mean time between failures. Application checkpointing is a
technique whereby the computer system takes a "snapshot" of the application
—a record of all current resource allocations and variable states, akin to a core
dump; this information can be used to restore the program if the computer
should fail. Application checkpointing means that the program has to restart
from only its last checkpoint rather than the beginning. For an application that
may run for months, that is critical. Application checkpointing may be used to
facilitate process migration.
3.1.3 Dependencies in Parallel Algorithms
Understanding data dependencies is fundamental in implementing parallel

algorithms. No program can run more quickly than the longest chain of
dependent calculations (known as the critical path), since calculations that
depend upon prior calculations in the chain must be executed in order.
However, most algorithms do not consist of just a long chain of dependent
calculations; there are usually opportunities to execute independent
calculations in parallel.
Let Pi and Pj be two program fragments. Bernstein's conditions describe when

the two are independent and can be executed in parallel. Let Ii be all of the
25
26
input variables to Pi and Oi the output variables, and likewise for Pj. P i and Pj
are independent if they satisfy several conditions.
Consider the following functions, which demonstrate several kinds of dependencies:
1: function Dep(a, b)
2: c := a·b
3: d := 2·c
4: end function
Operation 3 in Dep(a, b) cannot be executed before (or even in parallel with) operation 2,
because operation 3 uses a result from operation 2. It violates condition 1, and thus
introduces a flow dependency.
1: function NoDep(a, b)
2: c := a·b
3: d := 2·b
4: e := a+b
5: end function
In this example, there are no dependencies between the instructions, so they can all be run
in parallel.
Bernstein’s conditions do not allow memory to be shared between different processes. For
that, some means of enforcing an ordering between accesses is necessary, such as
semaphores, barriers or some other synchronization method.
3.2 Parallel Computing in Hardware
Some of implementations of parallel computing in hardware are in vector

processors like the Cray-1,and many discussed in the classes of parallel
computers(Section 2.5) earlier.The NUMA architecture(Section 2.4.2) shows
practical implementation in hardware.However,most new generation
computers are coming with multiple processors in it,be it the client systems or
the server systems.Most of the personal computers today have Intel’s Dual-
Core or later versions of processors,or the AMD’s Sempron series.The need for
parallel computing is felt even at the micro level of a personal computer.
No supercomputer can be made without concepts of parallel computing applied

in it.The famous Cray-2 supercomputer also used the same.In fact,one of the
26
27
major distinguishing feature of a supercomputer is nothing but the use of

parallelism to a great extent.
We will have a short look at how parallel processing is implemented in Intel’s

Xeon processors.
3.2.1Parallel Processing in Intel Xeon Series
The diagrams shown in further pages explain how multi-threading technology

gives an edge over earlier used,and now obsolete single-threading processors.
Thread(or instruction) processing in a single thread technology processor.

(fig. above and below)
27
28
Threads(instructions) are executed concurrently in multi-threading

processor technology(fig.bottom) whereas,
Threads(instructions) are executed serially in a single-threading

processor.(fig.top)
3.3 Applications
The applications of parallel computing are far and wide.

28
29
As parallel computers become larger and faster, it becomes feasible to solve

problems that previously took too long to run. Parallel computing is used in a
wide range of fields, from bioinformatics (to do protein folding) to economics
(to do simulation in mathematical finance). Common types of problems found
in parallel computing applications are:
Dense linear algebra
Sparse linear algebra
Spectral methods (such as Cooley-Tukey Fast Fourier transform)
N-body problems (such as Barnes-Hut simulation)
Structured grid problems (such as Lattice Boltzmann methods),
Unstructured grid problems (such as found in finite element analysis)
Monte Carlo simulation
Combinational logic (such as brute-force cryptographic techniques)
Graph traversal (such as Sorting algorithms)
Dynamic programming
Branch and bound methods
Graphical models (such as detecting Hidden Markov models and constructing

Bayesian networks)
Finite State Machine simulation
29
30
FUTURE SCOPE
We all know that the silicon based chips are reaching a physical limit in
processing speed, as they are constrained by the speed of electricity, light and
certain thermodynamic laws. A viable solution to overcome this limitation is to
connect multiple processors working in coordination with each other to solve
grand challenge problems. Hence, high performance computing requires the
use of Massively Parallel Processing (MPP) systems containing thousands of
power ful CPUs.
The development of parallel processing is being influenced by many factors.

The prominent among them include the following:
1.Computational requirements are ever increasing, both in the area of scientific

and business computing. The technical computing problems, which require
high-speed computational power, are related to life sciences, aerospace,
geographical information systems, mechanical design and analysis, etc.
2.Sequential architectures reaching physical limitation, as they are constrained

by the speed of light and thermodynamics laws. Speed with which sequential
CPUs can operate is reaching saturation point ( no more vertical growth ), and
hence an alternative way to get high computational speed is to connect multiple
CPUs ( opportunity for horizontal growth ).
3.Hardware improvements in pipelining, super scalar, etc, are non scalable and
requires sophisticated compiler technology. Developing such compiler
technology is difficult task.
4.Vector processing works well for certain kind of problems. It is suitable for
only scientific problems ( involving lots of matrix operations). It is not useful
to other areas such as database.
30
31
5.The technology of parallel processing is mature and can be exploited

commercially, there is already significant research and development work on
development tools and environment is achieved.
6.Significant development in networking technology is paving a way for

heterogeneous computing.
India launched a major initiative in parallel computing in 1988. There are five
or six independent projects to construct parallel processing systems. This was
motivated by the need for advanced computing, a vision of developing its own
technology, and difficulties (political and economic) obtaining commercial
products.
CONCLUSION
31
32
With increasing demands of human,and more expectation from

technology,the importance of processing faster and to a larger scale is growing
fast.In every sphere of computing,be it an application as astronomical as
sending a spaceship to other planet,launching a satellite in space,or getting
images for meterological purposes,or even images of secret hideouts of
terrorist hideouts,the need for complex and fast computing is felt.With growing
demands and expectations,parallel processing is the only way to go.Be it in
software or hardware.
Many algorithms designed by the developers community are such that they
need to be prioritized to a great extent,and some part of a large algorithm(as
big as 10,000) lines needs to be executed only if certain condition is
met.Also,some algorithm are time-sliced.They work as if they are running a
marathon and every piece of algorithm is scheduled to execute for some
limited time-frame.Just consider the application of a satellite which needs to
send images of earth and ocean at certain strict points of
time(day,night,evening)etc.,and a forecast is made based on some
assumptions.This requires parallel processing to a great extent.
Even in hardware,multicore processors are used in home PCs.Earlier,such a

privelege of multiple processors was restricted only to mainframe computers.In
new generation mobiles(3G and 4G) and PDAs,the presence of parallel
processing is already taking place.In Google’s mobile project of Android,they
are using parallel processing in their mobiles.This is also providing them with
an edge over it’s rivals,Apple iPhone and Nokia N96.
All this shows growing influence of parallelism and lesser use of sequential
processing.
I would like to end the whole seminar report with one-line about parallel
comuting.
“The future of computing is parallelism!”
REFERENCES
32
33
http://portal.acm.org/citation.cfm?id=290768&coll=portal&dl=ACM
http://www-users.cs.umn.edu/~karypis/parbook/
www.cs.berkeley.edu/~yelick/cs267-sp04/lectures/01/lect01-intro
www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_01_1999b
http://www.intel.com/technology/computing/dual-
core/demo/popup/dualcore.swf
www.parallel.ru/ftp/computers/intel/xeon/24896607.pdf
www.google.com
www.wikipedia.org
www.howstuffworks.com
ACKNOWLEDGEMENT
I take this opportunity to express deep sense of gratitude and sincere thanks for the
invaluable assistance that I received,at the worthy hands of my guide,Dr.R.P.Adgaonkar.
33
34
I express my sincere thanks to our HOD of Computer

Department,Dr.R.P.Adgaonkar for permitting me to present this seminar and also to all
staff members who have helped me directly or indirectly.
I also express my thanks to my friends,well-wishers for their underlying

support shown during preparation of this seminar.
Finally I would like to express my sincere gratitude towards my family for always being
there.The credit goes to all of them.
Ameya Waghmare
BE CSE
Roll No. 41
34
35
Ts. The prom We all know that the silicon based chips are reaching a physical limit in
processing speed, as they are constrained by the speed of electricity, light and certain
thermodynamic laws. A viable solution to overcome this limitation is to connect multiple
processors working in coordination with each other to solve grand challenge problems.
Hence, high performance computing requires the use of Massively Parallel Processing
(MPP) systems containing thousands of power full CPUs.
a given problem. A given task is divided into multiple sub tasks using
divide-and-conquer technique and each one of them are processed on different
CPUs. Programming on multiprocessor system using divide-and-conquer
technique is called Parallel Processing.
The development of parallel processing is being influenced by many factors. The

prominent among them include the following:
35

Parallel Computing Seminar Report

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Parallel Computing Seminar Report

Încărcat de

Drepturi de autor:

Formate disponibile

1

1.What is Parallel Computing?

Computer scientists use different approaches to address this problem. One

Building more powerful microprocessors requires an intense and expensive

In general, parallel processing means that at least two microprocessors handle

Traditionally, computer software has been written for

Parallel computing, on the other hand, uses multiple

1.1 Frequency Scaling and Power Consumption

Frequency scaling was the dominant reason for improvements in

However, power consumption by a chip is given by the

processor. Increasing processor power consumption led ultimately to Intel's

1.2 Amdahl’ Law

Theoretically, the speed-up from parallelization should

The potential speed-up of an algorithm on a parallel computing platform is

This relationship is given by the equation:

1.3 Gustafson’s Law

Gustafson's law is another law in computer engineering, closely related to

where P is the number of processors, S is the speed-up, and α the non-

One of the basic concepts behind parallel computing is the concept of

In computing, a pipeline is a set of data processing elements connected in

Computer-related pipelines include:

* Software pipelines,consisting of multiple processes arranged so that the

2.Approaches To Parallel Computing

To understand parallel processing, we need to look at the four basic

2.1 Flynn’s Taxonomy

SISD-Single Instruction Multiple Data

MISD-Multiple Instruction Single Data

SIMD-Single Instruction Multiple Data

MIMD-Multiple Instruction Multiple Data

2.1.1 Single Instruction Single Data(SISD)

2.1.2 Single Instruction Multiple Data(SIMD)

Single Instruction, Multiple Data (SIMD) computers have several processors

2.1.3 Multiple Instruction Single Data(MISD)

Multiple Instruction, Single Data (MISD) computers have multiple

2.1.4 Muliple Instruction Multiple Data(MIMD)

Multiple Instruction, Multiple Data (MIMD) computers have multiple

According to David A. Patterson and John L. Hennessy, "Some machines are

2.2 Classification On The Basis Of Computation

On this basis,they can be classified into three major classes of parallelism.

Grand Challenge Problems

2.2.1 Massively Parallel Systems

Massive parallelism (MP) is a term used in computer architecture,

Today's most powerful supercomputers are all MP systems such as Earth

In this class of computing, all of the processing elements are connected

Through advances in Moore's Law, System-On-Chip (SOC) implementations

2.2.2 Embarassingly Parallel Systems

In the jargon of parallel computing, an embarrassingly parallel workload (or

A very common usage of an embarrassingly parallel problem lies within

Embarrassingly parallel problems are ideally suited to distributed computing

Embarrassingly parallel problems lie at one end of the spectrum of

2.2.3 Grand Challenge Problems

A grand challenge is a fundamental problem in science or engineering, with

Computational fluid dynamics for

the design of hypersonic aircraft, efficient automobile bodies, and extremely

weather forecasting for short and long term effects,

efficient recovery of oil, and for many other applications;

Electronic structure calculations for the design of new materials such as

immunological agents, and

Calculations to understand the fundamental nature of matter, including

Symbolic computations including

natural language understanding,

automated reasoning, and

tools for design, manufacturing, and simulation of complex systems."

Prediction of weather, climate, and global change