1

Parallel and Distributed
Computing
Joint International Masters
Winter Semester 2020/11
Prof. Dr. Ronald Moore
Last Modified: 10/26/10

Outline of Chapter I – Introduction
I Introduction
A) Personal Introductions
II Models of Parallel Computing
B) What is Parallel &
III Parallel Computation Design Distributed Computing?
IV The Message-Passing Paradigm C) Demotivation: Why wasn't

Parallelism used?
V The Shared Memory Paradigm
D) Motivation: Why use
VI Frontiers Parallelism?
VII Summary E) Paradigms & Packages for
Parallel Programming: An
Changes in the
Overview & an Outlook
course of the
semester are possible F) Summary & Bibliography
(probable, even).
Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 2

Personal Introductions – Instructor (1)
Instructor: Prof. Dr. Ronald Charles Moore

1983 B.Sc. Michigan State Univ.
1983-89 Software Engineer Texas Instruments: Research:
Compilers &
Design Automation Div.
Architectures for
1989 Came to Germany exploiting implicit
parallelism
1995 Diplom-Informatiker Goethe-Univ.
(Frankfurt)
Business:
2001 PhD Goethe-Univ. (Frankfurt) Web-based
2001-7 Product Mgr (among other things) with Software as a
an Internet Company in Frankfurt Service (SaS),
serving Financial
since Sept. 2007 with the FBI Market Data Apps...
(a.k.a. Fachbereich Informatik)


Official Subject Areas (Fachgebiete):
➔
Foundations of Informatics
That means
➔
Net-Centric Computing Operating Systems,
Distributed Systems
among other things
I'm still trying to

decide what that
means – but this
course belongs to it!


Yes, English is my mother-tongue...
...but I do speak German and I'll take questions in either
English or German
...but we have participants who do not speak German,
so let's try to stick to English most of the time.
Such as
...it is your responsibility to tell me should I: “colloquialisms”
→ speak too quickly, perhaps?
→ use colloquialisms
→ ...or terminology you're not familiar with,
→ or just generally stop making sense!

Personal Introductions – Students
Your Background:
0. What have you learned so far (e.g. in your Bachelors)?
1. Have you had a chance to program parallel and/or distributed
systems already?
2. What do you expect from this course?
3. What would you like to learn in this course?

I Introduction

Parallelism used?
F) Summary & Bibliography

What is Parallel & Distributed Computing?
A Term you always see like that

(seeming redundant!).
Distributed Computing is Parallel,
and Parallel Computing is
Distributed, by definition! Right? Parallel Computing
We will take Distributed
Computing to be computing
which is distributed to serve a
purpose (other than being Distributed
parallel) – normally overcoming Computing
geographical seperation.
Thus Distributed Computing can be
taken to be a subset of Parallel
Computing

What else is Parallel Computing?
Why a subset? Why else build or

use parallel computers?
One word: Performance!

Parallel Computing
The other subset of Parallel
Computing is
High Performance Computing
(HPC).
Distributed HPC
The focus of this course is on Computing
Parallelism, both in the sense of
Distributed Computing and in
the sense of HPC.

Parallel vs. Concurrent
But I haven't defined parallel

computing (did you notice?)
Computing is parallel as soon as 2
Concurrent Computing
(or more) operations can take
place at the same time. This
usually implies multiple CPUs
(Cores). Parallel Computing
Computing is concurrent as
soon as 2 (or more) operations
can take place in any order, i.e. Distributed HPC
they are not placed in a strict Computing
sequential order. This implies
multiple threads and/or
multiple processes.

Example (1) - Distributed!
The private
network
connecting the
various
branches of a
bank to the
central office.
Meanwhile,
extended to
accommodate
shops, home
banking, etc.
(Pardon the
German).
Source: “Verteilte Systeme”, Skript, Peter Wollenweber, Hochschule Darmstadt, Fachbereich Informatik.

Examples (2) - Distributed?
The computer LuxNet

center of a large Internet
SAP
GUI
bank – full of
servers with XML-
NT Oracle
LuxNet
SAP
LuxNet
connections to the Browser
HTML
Konverter LuxNet
LuxNet
DB
LuxNet
IIS
branch offices.
Loader
MTS LuxNet API
OPA CS Repl
LuxNet (Client) MQAX
INAP LuxNet
Service LuxNet
Very OPA CS
(Client)
OPA CS
(Server) OPA DS
MQSeries
Client
MQSeries M-Broker
OPA DS
heterogeneous. XM
L
OLY / K+
Interface
SUN
IMS IMS
MQSeries
Much of the
architecture IMS DB2 DB2
Vorfalls
DB2
LuxNet
LuxNet
Repl
DB2
Olympic
Sybase
Kondor+
consists of legacy
KREKIN GeParD
DB DB
MVS
systems.
Source: “Verteilte Systeme”, Skript, Alois Schütte, Hochschule Darmstadt, Fachbereich Informatik.

Examples (3) – A Web Server Farm
The Internet? Many systems or

one big system?
Stand-by
A “Production” Web-Site is often Load Balancer Load Balancer
hosted by a “web server farm” -
a.k.a. a “load-balanced cluster”
consisting of:
2 (or more) Load Balancers Web Server
Web Server
Many Web Servers Web Server
Web Server
Database Servers Web Server
Web Server
Possibly connections to other
servers, services...

Examples (4) –The Googleplex
Google: supports several computer centers with more

than 200,000 PC's (as of 2003, meanwhile a lot more!).
Servers:
Load Balancers
Proxy Servers (caches)
Web Servers
Index servers
Document servers
Data-gathering servers
Ad servers
Spelling servers
Source: http://en.wikipedia.org/w/index.php?title=Google_platform&oldid=202504102,, Barroso et. al. Web Search for a Planet: The
Google Cluster Architecture, IEEE Micro, March-April 2003 (http://labs.google.com/papers/googlecluster-ieee.pdf)

Examples (5) – An Internet Company in Frankfurt
Financial Market
HTTP Request for an URL
Data Servers Web -Page
import data from

data suppliers, add
Watchlist Canvas
value, mix in data Quote Chart Figures News Forum
supplied by users, Day Month

plus charts and
quantitative
analysis, and make User Quotes Charts Figures News Text
it available as a Cache Lifespan
internet or
Read (Pull)
Write (Push)
intranet Calculation
Read/Write
application.
Active Invalidate
Processes
Active Update
Note dataflow, External and internal data sources
granularity...
Source: Cotoaga, K.; Müller, A.; Müller, R.: Effiziente Distribution dynamischer Inhalte im
Web. In Wirtschaftsinformatik 44 (2002) 3, p. 249-259.

Examples (6) – Deep Blue
“Deep Blue was a chess-playing computer
developed by IBM. On May 11, 1997, the
machine won a six-game match... against
world champion Garry Kasparov.”
“...It was a 32-node IBM RS/6000 SP high-

performance computer, which utilized the
Power Two Super Chip processors (P2SC).
Each node of the SP employed 8 dedicated
VLSI chess processors, for a total of 256
processors working in tandem. … It was
capable of evaluating 200 million positions per
second... . In June 1997, Deep Blue was the
259th most powerful supercomputer
according to the TOP500 list, achieving 11.38
GFLOPS on the High-Performance LINPACK
benchmark.”
Sources: "Deep Blue (chess computer)." Wikipedia, The Free Encyclopedia.15 Oct 2009
<http://en.wikipedia.org/w/index.php?title=Deep_Blue_(chess_computer)&oldid=318710273>, corrected with information from
<http://www.research.ibm.com/deepblue/meet/html/d.3.shtml>.
Examples (7) – Blue Gene
“Blue Gene is a computer

architecture project designed
to produce several
supercomputers, designed to
reach operating speeds in
the PFLOPS (petaFLOPS)
range, and currently reaching
sustained speeds of nearly
500 TFLOPS (teraFLOPS). ...
The project was awarded the

National Medal of Technology
“On November 12, 2007, the first [Blue Gene/P]
and Innovation by US
system, JUGENE, with 65536 processors is running in
President Obama on
the Jülich Research Centre in Germany with a
September 18, 2009.” performance of 167 TFLOPS.[15] It is the fastest
supercomputer in Europe and the sixth fastest in the
world ”
Source: "Blue Gene." Wikipedia, The Free Encyclopedia. 15 Oct 2009
<http://en.wikipedia.org/w/index.php?title=Blue_Gene&oldid=319969920>. .

Examples (8) – HHLR
The Hessische HochLeistungs Rechner.

Right here in Darmstadt.
“The system consists of 15 SMP-Nodes

with a total of 452 Processors. … The 14
computing nodes contain 32 Power6-
CPUs each and at least 128 GB RAM.
When used as Shared Memory
computers (SMP), the nodes are
suitable for applications with high
communication requirements. In order
to also run programs that require more
than one SMP-Node, the computers are
connected by a very fast internal
network ( 8 Lanes DDR Infiniband ).”
Source: “ Der Hessische Hochleistungsrechner”

<http://www.hhlr.tu-darmstadt.de/organisatorisches/startseite/index.de.jsp>, translation by R. C. Moore.

Examples (9) – BOINC + Grids + Clouds = ???
See c't (Heise Verlag),

Ausgabe 21 (29.9.) 2008
Seite 128-145.
BOINC = Berkeley Open Grids = More or less open Clouds = Commecial

Infrastructure for infrastructures for Infrastructures (open to
Network Computing = collaborations and resource paying customers). Newest
Open infrastructure for (and data) sharing among form of Utility Computing.
applications such as distributed teams. Goal: Built on availability of
SETI@home – exploit CPU Utility Computing (Role- Virtual Machines (and
cycles that would Model: Electric Grid). waste CPU cycles at
otherwise be wasted. Mostly limited (still) to Amazon, IBM, etc.).
Research Centers (e.g.
CERN).

I Introduction

Parallelism used?

(De)Motivation: Why (not) Use Parallelism?
Why was Parallel Computing Four Factors:

neglected for so long? 1) Moore's Law
Parallelism has been omnipresent 2) Amdahl's Law
in hardware development 3) Inherent Difficulty
(computer architecture) for
4) Lack of a Unifying
decades.
Paradigm
But it has remained a specialty
(often treated as an oddity) in
software architecture and
software engineering until recently
(consider it's place in your
education so far).
Why?

(de)Motivation: Moore's Law
Four Factors:
1) Moore's Law
2) Amdahl's Law
3) Inherent Difficulty
Paradigm
The number of transistors per chip
doubles every two years.
Result: Until recently, last year's High

Performance Computer was slower
than next year's Commodity Computer
(exaggeration to make a point).
Source: "Moore's law." Wikipedia, The Free Encyclopedia. 15 Oct 2009

<http://en.wikipedia.org/w/index.php?title=Moore%27s_law&oldid=319734055>.

(de)Motivation: Amdahl's Law (1)
Four Factors: Two Critical Measurements:

1) Moore's Law If T(p) = total run-time with p
processors
2) Amdahl's Law Then
3) Inherent Difficulty Speedup = S(p) = T(1) / T(p).
4) Lack of a Unifying Efficiency = E(p) = S(p) / p.
Paradigm
By the way... Speedup

and Efficiency are really
crucial concepts!
Remember them.

More Definitions:
Four Factors:
α= inherently sequential time
1) Moore's Law (e.g. input, output...)
2) Amdahl's Law π = parallelizable time.
3) Inherent Difficulty Thus T(1) = α + π
T(p) = α + π / p
Let f = α / (α + π) e.g. 1/4th
Paradigm
Then Speedup is...
Definitions: 
S(p) = T(1) / T(p) = / p
T(p) = total run-time with p
processors 1 1
= =
/ p
Speedup = S(p) = T(1) / T(p).
Efficiency = E(p) = S(p) / p.

 / p



  
1 1
= =
 1− f 
f f
p p
Source: H. Bauke, S. Mertens, Cluster Comuting, Springer Verlag, 2006. Section 1.5 (p. 10-13).

More Definitions:
Four Factors:
α= inherently sequential time
1) Moore's Law (e.g. input, output...)
2) Amdahl's Law π = parallelizable time.
3) Inherent Difficulty Thus T(1) = α + π
T(p) = α + π / p
Let f = α / (α + π) e.g. 1/4th
Paradigm
Then Speedup is...
Definitions: 
S(p) = T(1) / T(p) =  / p
T(p) = total run-time with p
processors The upper limit on Speedup is
Speedup = S(p) = T(1) / T(p).
Efficiency = E(p) = S(p) / p. 1 1
S  p= 
1− f  f
f
p
Source: H. Bauke, S. Mertens, Cluster Computing, Springer Verlag, 2006. Section 1.5 (p. 10-13).

Four Factors:
1) Moore's Law
2) Amdahl's Law
Paradigm
Since the upper limit on
Speedup is 1/f , Efficiency
goes to zero as p → ∞
If we introduce communication
into our model, we can find a
value of p with maximum
speedup. Afterwards, more Speedup and Efficiency,
processors mean more run-time! with and without communication costs,
for f = 0.005, and 1 ≤ p ≤ 10000
Source: H. Bauke, S. Mertens, Cluster Computing, Springer Verlag, 2006. Abb. 1.3, p. 12.

(de)Motivation: Inherent Difficulty (1)
Four Factors: Prima Facie Argument:

1) Moore's Law Sequential Programmers have 2 things
2) Amdahl's Law to worry about:
1) Space (Memory)
2) Time (Instructions)
Paradigm Parallel Programmers have 3 things to
worry about:
1) Space (Memory)
2) Time (Instructions)
3) Choice of Processor (Distributing the
instructions – and possibly the
memory – across the processors).

(de)Motivation: Inherent Difficulty (2)
Four Factors: An Example: Sequential Scheduling

1) Moore's Law Problem: Given a set of n jobs, each of
which takes time ti , assign an order to
2) Amdahl's Law
the jobs to minimize total run time.
3) Inherent Difficulty Solution: Trivial. Any order will do. The
4) Lack of a Unifying run time is always the same:
Paradigm ∑ ti
0in
Comparison: Parallel Scheduling

Problem: Given a set of n jobs, each of
which takes time ti , and m machines,
assign each job to a machine so as to
minimize total run time.
Solution: Finding an optimal schedule is
NP-Complete!!!

(de)Motivation: Lack of a Unifying Paradigm
Four Factors:
1) Moore's Law
2) Amdahl's Law
Paradigm
“Some Parallel
Programming
Environments from
the Mid-1990s”
(How) Is this really different

from the sequential
languages?
Source: T. Mattson, B. Sanders, B. Massingill,

Patterns of Parallel Programming, Addison-
Wesleey, 2005. Table 2.1, p. 14.

I Introduction

Parallelism used?

Motivation(!): Moore meets Multicores
Four Factors:
1) Moore's Law
Register Register
2) Amdahl's Law set set Core
3) Inherent Difficulty Core
Package
4) Lack of a Unifying Die
Paradigm Die
Package
The number of transistors keeps
growing, but they can't all be Main-Board
used in one CPU any more:
Deep Pipelining has its limits, Hyper-threading has its
this led to hyper-threading limits, this led to multicore Result: Sequential
(virtually more than 1 CPU) chips (really more than 1 Processors are dying
CPU per chip). out! Multiprocessors are
now the rule, not the
Both can be combined. exception!

Motivation(!): Amdahl meets Gustafson
By the way, Gustafson
Four Factors: Gustafson's Law: published it, but said it
was due to E. Barsis
1) Moore's Law Speedup is...
f T p p1− f  T  p
2) Amdahl's Law S(p) = T(1) / T(p) = T p
= f + p(1– f) = p + (1 – p)f
Paradigm Amdahl Gustafson
Amdahl: T(1)
Speedup is...
S(p) = T(1) / T(p)
T(p)
1
= 1− f 
f Moral of the Story:
p
Speedup is limited only if T(1) is held constant;
For sufficiently large T(1), f goes to zero
Source: H. Bauke, S. Mertens, Cluster Comuting, Springer Verlag, 2006. Section 1.7 and Abb. 1.6, p. 16.

Motivation(!): Isoefficiency
Four Factors: Can we extend Amdahl's Law to express how

problems scale?
1) Moore's Law
New definition of parallel run time:
2) Amdahl's Law
n
3) Inherent Difficulty T n , p= n T c n , p
p
4) Lack of a Unifying Tc(n,p) is the communication costs + overhead.
Paradigm
It follows that efficiency is
The Isoefficiency E n , p=T n ,1/ p T n , p
Equation:
Constant Efficiency nn
= p  nn p T n , p
implies that c
E  n , p  n n
T n ,1= T 0  n , p =
1−E n , p  nnT 0 n , p
T n , 1=C T 0  n , p Where T 0  n , p= p−1 n p T c n , p

Motivation(!): Amdahl meets Disney/Pixar
Four Factors: 1995: Disney/Pixar release Toy Story.

1) Moore's Law Length: 51 minutes, 44 Seconds, 24
Frames per second.
2) Amdahl's Law Rendered on a “render farm” with
3) Inherent Difficulty 100 dual-processor workstations.
4) Lack of a Unifying 1999: Disney/Pixar release Toy Story 2.
Paradigm Length: 92 Minutes.
Rendered on a 1,400-processor
Moral of the Story: system.
Problems scale. 2001: Disney/Pixar release Monsters, Inc.
There are problems out Length: 94 Minutes.
there that will gladly Rendered on a system with 250
take all the processing enterprise servers, each with 14
power we can give processors for a total of 3,500
them.
processors.
Source: T. Mattson, B. Sanders, B. Massingill, Patterns of Parallel Programming, Addison-Wesley, 2005, p. 1-2.

Motivation(!): Embarrassing Parallelism
Four Factors: Example: Parallel Scheduling is NP-

1) Moore's Law Complete...
2) Amdahl's Law Counter-Example 1: Finding the optimal

schedule is very hard; finding a nearly
3) Inherent Difficulty optimal schedule is easy. Longest-
4) Lack of a Unifying Processing-Time first scheduling is a
Paradigm “4/3-optimal solution” (never worse
Clay Breshears (“The Art of than 4/3 the optimum).
Concurrency”, O Reilly, 2009) Counter-Example 2: Given sufficiently
suggests saying “enchantingly
parallel” instead of
many, sufficiently small jobs, any
“embarssingly parallel”. schedule approximates the optimum.
Of course, not all problems are

Definition: An embarrassingly parallel
embarrassingly/enchantingly problem is a problem consisting of very
parallel, and we don't want to restrict many (essentially) independent tasks.
ourselves to those that are.
The others are fun too! Like rendering Pixar films!

Motivation(!): Something like Convergence
Four Factors:
1) Moore's Law
2) Amdahl's Law
Paradigm
Of all of these,
only Java...
...and MPI
have survived.
plus Newcomers: ...more
or
C++0x (Boost),
less...
OpenMP &
something for
GPGPUs
I Introduction

Parallelism used?

Paradigms vs. Programming Packages
Paradigm \ Package Java / OpenMP MPI
C++0x (Boost)
Shared Memory Yes Yes No (*)
Message Passing Yes No Yes
Programming Language No (Java) Debatably Yes No (!)

Extension Minimally (Compiler-Extensions)
(C++0x)
Libraries Yes Yes Yes
Languages: Java, C++ C, C++, Fortran C, C++, Fortran,

Java, ...

Boost: Threads in C++: History
C++ Java
Inherits posix threads, Includes Threads &
sockets from C / Unix Sockets from the start
Extended with Templates,

Extended with Templates
Standard Template Library
Generics
(STL)
Upcoming Standards: Threading later further

(a) STL TR (Technical Report) 1, extended (e.g. thread pools).
(b) C++0x,
(c) STL TR (Technical Report) 2 Threading & sockets are:
- now include STL-style threads & - not in TR1,
sockets (amongst other things) - may be in C++0x,
- all tried out first in the Boost - will be in TR2,
Library(!) - in BOOST.

Boost: Threads in C++ (1)
Boost supports: #include <boost/thread.hpp>
1) Threads #include <iostream>
2) Synchronization
void thread() {
3) Thread-local
std::cout << "Hello from Thread "
Memory
<< boost::this_thread::get_id()
4) Sockets for << std::endl;
Message Passing
}
Boost::thread
Constructor starts Function pointer
int main()
threads running,
{
join method waits until
a thread is finished. boost::thread t1(thread);
boost::thread t2(thread);
We have a
t1.join();
synchronization
problem here. Where? t2.join();
}
Source: http://en.highscore.de/cpp/boost/multithreading.html

#include <iostream>
1) Threads
// global variable, shared by all threads
2) Synchronization boost::mutex mutex;
3) Thread-local
Memory void thread() {
mutex.lock(); // critical section!
4) Sockets for std::cout << "Hello from Thread "
Message Passing << boost::this_thread::get_id()
<< std::endl;
mutex.unlock(); // parallelism resumes here...
Mutex = Mutual }
Exclusive. Mutexes
are used to guard int main()
critical sections – {
areas with limited boost::thread t1(thread);
(or no) parallelism. boost::thread t2(thread);
t1.join();
t2.join();
} Source: http://en.highscore.de/cpp/boost/multithreading.html

#include <iostream>
1) Threads #include <cstdlib>

#include <ctime> “thread-local” will be
a new keyword in
2) Synchronization void init_number_generator() {
static boost::thread_specific_ptr<bool> tls; C++0x. In the
3) Thread-local if (!tls.get()) meantime, Boost
tls.reset(new bool(false)); uses a template.
Memory if (!*tls) {
*tls = true;
4) Sockets for std::srand(static_cast<unsigned int>(std::time(0)));
Message Passing }
} // end number_generator
boost::mutex mutex;
lock_guard:
We need more than void random_number_generator() {
Constructor locks,
init_number_generator();
static (global) and int i = std::rand();
destructor unlocks –
local variables! boost::lock_guard<boost::mutex> lock(mutex); Principle: RAII =
std::cout << i << std::endl; Resource Acquisition
} // end random_number_generator Is Initialization
int main() {
boost::thread t[3];
for (int i = 0; i < 3; ++i)
t[i] = boost::thread(random_number_generator);
for (int i = 0; i < 3; ++i)
t[i].join();
}
Source: http://en.highscore.de/cpp/boost/multithreading.html

Boost supports: #include <boost/asio.hpp>
#include <boost/array.hpp>
1) Threads
#include <iostream>
#include <string>
boost::asio::io_service io_service;
2) Synchronization boost::asio::ip::tcp::resolver resolver(io_service);

boost::asio::ip::tcp::socket sock(io_service);
3) Thread-local boost::array<char, 4096> buffer;

void read_handler(const boost::system::error_code &ec, std::size_t bytes_transferred) {
Memory if (!ec) {
std::cout << std::string(buffer.data(), bytes_transferred) << std::endl;
4) Sockets for }
sock.async_read_some(boost::asio::buffer(buffer), read_handler);
Message Passing } // end read_handler

void connect_handler(const boost::system::error_code &ec) {
if (!ec) {
Sockets are part of the boost::asio::write(sock, boost::asio::buffer("GET / HTTP 1.1\r\nHost: highscore.de\r\n\r\n"));
more general }
sock.async_read_some(boost::asio::buffer(buffer), read_handler);
“asynchronous I/O” (asio) } // end connect_handler
library.
void resolve_handler(const boost::system::error_code &ec,
boost::asio::ip::tcp::resolver::iterator it) {
Central concepts: if (!ec) {

sock.async_connect(*it, connect_handler);
IO_Services & Handlers }

} // end resolve_handler
int main() {
This example connects to a boost::asio::ip::tcp::resolver::query query("www.highscore.de", "80");
web server (via tcp) and resolver.async_resolve(query, resolve_handler);

io_service.run();
downloads an html page. }
Source: http://en.highscore.de/cpp/boost/asio.html

OpenMP: Compiler Supported Parallelism (1)
OpenMP supports #include <omp.h>
1) Threads #include <iostream>
2) Synchronization
with private and main ()
shard data, {
fork/join // serial startup goes here...
3) Loop Parallelism #pragma omp parallel num_threads (8)
{ // parallel segment
OpenMP is a compiler-
extension (not really a printf(
"\nHello world, I am thread %d\n",
language extension) for C, C++ omp_get_thread_num() );
and Fortran. Assumes shared
}
memory (usually).
// rest of serial segment ...
Supported by the Intel and gnu
compilers (et. al.). }
Not to be confused with

OpenMPI!
#include <iostream>
1) Threads int a, b, sum;
Thread0 a=1, b=9, sum=8
2) Synchronization main ()
{
Thread0 a=1, b=17, sum=16
with private and // serial segment
b = 1;
a = 1;
shard data, sum = 0;
fork/join #pragma omp parallel num_threads (8) private (a) shared

(b) reduction(+:sum)
3) Loop Parallelism A += 1; b += 1; sum = 1;
printf("\nThread%d a=%d, b=%d, sum=%d",
omp_get_thread_num(), a, b, sum);
}
// rest of serial segment
printf(" Thread%d a=%d, b=%d, sum=%d\n",
OpenMP encourages a
Single Program, #pragma omp parallel num_threads (8) private (a) shared
(b) reduction(+:sum)
Multiple Data style, { // parallel segment
A += 1; b += 1; sum = 1;
where parallel blocks printf("\nThread%d a=%d, b=%d, sum=%d",
follow each other }
sequentially (first fork, // rest of serial segment
then join...). printf(" Thread%d a=%d, b=%d, sum=%d\n",
}

#include <iostream>
1) Threads
int sum;
2) Synchronization float a[100], b[100];
with private and main ()
{
shard data, // serial segment
fork/join sum = 0;
3) Loop Parallelism // Initialise array a

for (int j = 0; j < 100; ++j) a[j] = j;
#pragma omp parallel num_threads (2)
OpenMP provides semi- #pragma omp for

for (int i = 0; i < 100; ++i)
automatic parallelization of b[i] = a[i] * 2.0;
loops. This is the dominant
programming style in OpenMP. }
// rest of serial segment
// Print array a
for (int k = 0; k < 100; ++k)
printf("%5.2f ", b[k]);
}
MPI: Compiler Supported Parallelism (1)
Every MPI
MPI supports Program
#include "mpi.h"
should start
1) Remote Execution #include <iostream> (the MPI-
(not threads) enabled
int main(int argc, char **argv) portion) with
2) Message Passing {
MPI::Init() and
end with
3) Data Parallelism int rank, size; MPI::Finalize()
MPI::Init(); .
rank = MPI::COMM_WORLD.Get_rank();
MPI is a library standard, defined size = MPI::COMM_WORLD.Get_size();
for many languages (including std::cout << "Hello, world! I am "
Fortran, C and C++ - shown here), << rank << " of "
available in various << size << std::endl;
implementations (e.g. Open MPI, MPI::Finalize();
not to be confused with OpenMP).
It does not assume shared The number of parallel
return 0;
memory – but uses it where programs is determined by
} how the programs are run –
available.
e.g. with the command
mpirun.

MPI: Compiler Supported Parallelism (2)
int main(int argc, char *argv[]) {
MPI supports int rank, size, next, prev, message, tag = 201;
MPI::Init();
1) Remote Execution rank = MPI::COMM_WORLD.Get_rank();
(not threads) size = MPI::COMM_WORLD.Get_size();
next = (rank + 1) % size;
2) Message Passing prev = (rank + size - 1) % size;
3) Data Parallelism Message = 10;
if (0 == rank)
The basic operations in MPI::COMM_WORLD.Send(&message, 1, MPI::INT, next, tag);
MPI are sending and while (1) {
receiving messages. MPI::COMM_WORLD.Recv(&message, 1, MPI::INT, prev, tag);
Message passing can if (0 == rank) --message;
be blocking or non- MPI::COMM_WORLD.Send(&message, 1, MPI::INT, next, tag);
blocking. Functions are if (0 == message) break; // exit loop!
available for handling } // end while
arbitrary data – and if (0 == rank)
ensuring compatibility
MPI::COMM_WORLD.Recv(&message, 1, MPI::INT, prev, tag);
across heterogeneous
MPI::Finalize();
machines.
return 0;
}

MPI: Compiler Supported Parallelism (3a)
MPI supports P0 A P0 A
1) Remote Execution P1 Bcast P1 A
(not threads) P2 P2 A
2) Message Passing P3 P3 A
3) Data Parallelism
P0 A B C D P0 A
MPI also contains P1 P1 B
message passing Scatter
P2 P2 C
functions that make it
easy to distribute data P3 P3 D
Gather
(usually arrays) across
a group of processors.

MPI: Compiler Supported Parallelism (3b)
P0 A P0 A B C D
MPI supports
P1 B P1 A B C D
1) Remote Execution Allgather
P2 C P2 A B C D
(not threads)
P3 D P3 A B C D
2) Message Passing
3) Data Parallelism
P0 A0 A1 A2 A3 P0 A0 B0 C0 D0
P1 B0 B1 B2 B3 P1 A1 B1 C1 D1
P2 C0 C1 C2 C3 P2 A2 B2 C2 D2
P3 D1 D2 D3 D4 P3 A3 B3 C3 D3
There are more, this is
not a complete list. Alltoall

Outlook:
The cutting edge?

Specialized Hardware:
●
General Purpose Graphics Processing
Units (GPGPUs)
●
Cell Processors
●
Field Programmable Gate Arrays
(FPGAs)
Problem: How to program?

●
CUDA? Looks promising, but only NVIDIA
●
OpenCL? Looks promising, promises to
support many architectures...
●
…?!?
Source: "CUDA." Wikipedia, The Free Encyclopedia. 28 Oct 2009, <http://en.wikipedia.org/w/index.php?title=CUDA&oldid=322466787>.

I Introduction

Parallelism used?

Summary (of Chap. 1)
B) What is Parallel & Distributed Computing?
●
Distributed Computing is a subset of Parallel Computing.
●
High Performance Computing (HPC) is another subset.
●
Parallel is strictly speaking a subset of concurrent. Key
Measurements:
C) (De)Motivation: Why (not) Use Parallelism?
Speedup &
●
Moore's Law - vs. Multicore Processors Efficiency
●
Amdahl's Law – vs. Gustafson's Law
●
Inherent Difficulty – vs. Embarassing (enchanting) Parallelism
●
Lack of a Unifying Paradigm – vs. Convergence to…
D) Paradigms & Packages for Parallel Programming: An Overview & an
Outlook
●
Boost (C++0x) offers threads & synchronization (mutexes etc.) for shared
memory programming and sockets for message passing for one language
(C++)
●
OpenMP offers shared memory programming with Loop Parallelism for
languages with extended compilers (C, C++ & Fortan)
●
MPI offers message passing programming with data parallelism in the form
of a library for multiple languages (mainly C, C++ & Fortan).
Bibliography
Main Sources:
H. Bauke, S. Mertens, Cluster Computing, Springer Verlag, 2006.
Clay Breshears, The Art of Concurrency, O Reilly Media Inc, 2009.
Ian Foster, Designing and Building Parallel Programs, Addison-
Wesley Publishing, 1995.
Available on-line: http://www.mcs.anl.gov/~itf/dbpp/(!)
A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction to Parallel
Computing, 2nd Edition, Addison Wesley (Pearson), 2003.
T. G. Mattson, B. A. Sanders & B. L. Massingill, Patterns for Prallel
Programming, Addison-Wesley (Pearson Education), 2005
R. C. Moore, SDAARC A Self Distributing Associative Architecture,
Shaker Verlag, 2001 – Hard to find! ;)
This list is not meant to be exhaustive.

1

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

1

Încărcat de

Drepturi de autor:

Formate disponibile

Parallel and Distributed

Winter Semester 2020/11

Prof. Dr. Ronald Moore

Last Modified: 10/26/10

IV The Message-Passing Paradigm C) Demotivation: Why wasn't

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 2

Instructor: Prof. Dr. Ronald Charles Moore

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 3

Instructor: Prof. Dr. Ronald Charles Moore

I'm still trying to

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 4

Instructor: Prof. Dr. Ronald Charles Moore

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 5

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 6

IV The Message-Passing Paradigm C) Demotivation: Why wasn't

F) Summary & Bibliography

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 7

A Term you always see like that

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 8

Why a subset? Why else build or

One word: Performance!

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 9

But I haven't defined parallel

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 10

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 11

The computer LuxNet

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 12

The Internet? Many systems or

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 13

Google: supports several computer centers with more

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 14

Data Servers Web -Page

import data from

value, mix in data Quote Chart Figures News Forum

supplied by users, Day Month

it available as a Cache Lifespan

Note dataflow, External and internal data sources

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 15

“...It was a 32-node IBM RS/6000 SP high-

“Blue Gene is a computer

The project was awarded the

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 17

The Hessische HochLeistungs Rechner.

“The system consists of 15 SMP-Nodes

Source: “ Der Hessische Hochleistungsrechner”

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 18

See c't (Heise Verlag),

BOINC = Berkeley Open Grids = More or less open Clouds = Commecial

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 19

IV The Message-Passing Paradigm C) Demotivation: Why wasn't

F) Summary & Bibliography

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 20

Why was Parallel Computing Four Factors:

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 21

Result: Until recently, last year's High

Source: "Moore's law." Wikipedia, The Free Encyclopedia. 15 Oct 2009

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 22

Four Factors: Two Critical Measurements:

By the way... Speedup

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 23

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 24

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 25

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 26