Sunteți pe pagina 1din 54

Parallel and Distributed

Computing
Joint International Masters

Winter Semester 2020/11

Prof. Dr. Ronald Moore

Last Modified: 10/26/10


Outline of Chapter I – Introduction

I Introduction
A) Personal Introductions
II Models of Parallel Computing
B) What is Parallel &
III Parallel Computation Design Distributed Computing?

IV The Message-Passing Paradigm C) Demotivation: Why wasn't


Parallelism used?
V The Shared Memory Paradigm
D) Motivation: Why use
VI Frontiers Parallelism?
VII Summary E) Paradigms & Packages for
Parallel Programming: An
Changes in the
Overview & an Outlook
course of the
semester are possible F) Summary & Bibliography
(probable, even).

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 2


Personal Introductions – Instructor (1)

Instructor: Prof. Dr. Ronald Charles Moore


1983 B.Sc. Michigan State Univ.
1983-89 Software Engineer Texas Instruments: Research:
Compilers &
Design Automation Div.
Architectures for
1989 Came to Germany exploiting implicit
parallelism
1995 Diplom-Informatiker Goethe-Univ.
(Frankfurt)
Business:
2001 PhD Goethe-Univ. (Frankfurt) Web-based
2001-7 Product Mgr (among other things) with Software as a
an Internet Company in Frankfurt Service (SaS),
serving Financial
since Sept. 2007 with the FBI Market Data Apps...
(a.k.a. Fachbereich Informatik)

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 3


Personal Introductions – Instructor (2)

Instructor: Prof. Dr. Ronald Charles Moore


Official Subject Areas (Fachgebiete):

Foundations of Informatics
That means

Net-Centric Computing Operating Systems,
Distributed Systems
among other things

I'm still trying to


decide what that
means – but this
course belongs to it!

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 4


Personal Introductions – Instructor (3)

Instructor: Prof. Dr. Ronald Charles Moore


Yes, English is my mother-tongue...
...but I do speak German and I'll take questions in either
English or German
...but we have participants who do not speak German,
so let's try to stick to English most of the time.
Such as
...it is your responsibility to tell me should I: “colloquialisms”
→ speak too quickly, perhaps?
→ use colloquialisms
→ ...or terminology you're not familiar with,
→ or just generally stop making sense!

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 5


Personal Introductions – Students

Your Background:
0. What have you learned so far (e.g. in your Bachelors)?
1. Have you had a chance to program parallel and/or distributed
systems already?
2. What do you expect from this course?
3. What would you like to learn in this course?

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 6


Outline of Chapter I – Introduction

I Introduction
A) Personal Introductions
II Models of Parallel Computing
B) What is Parallel &
III Parallel Computation Design Distributed Computing?

IV The Message-Passing Paradigm C) Demotivation: Why wasn't


Parallelism used?
V The Shared Memory Paradigm
D) Motivation: Why use
VI Frontiers Parallelism?
VII Summary E) Paradigms & Packages for
Parallel Programming: An
Overview & an Outlook

F) Summary & Bibliography

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 7


What is Parallel & Distributed Computing?

A Term you always see like that


(seeming redundant!).
Distributed Computing is Parallel,
and Parallel Computing is
Distributed, by definition! Right? Parallel Computing
We will take Distributed
Computing to be computing
which is distributed to serve a
purpose (other than being Distributed
parallel) – normally overcoming Computing
geographical seperation.
Thus Distributed Computing can be
taken to be a subset of Parallel
Computing

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 8


What else is Parallel Computing?

Why a subset? Why else build or


use parallel computers?

One word: Performance!


Parallel Computing
The other subset of Parallel
Computing is
High Performance Computing
(HPC).
Distributed HPC
The focus of this course is on Computing
Parallelism, both in the sense of
Distributed Computing and in
the sense of HPC.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 9


Parallel vs. Concurrent

But I haven't defined parallel


computing (did you notice?)
Computing is parallel as soon as 2
Concurrent Computing
(or more) operations can take
place at the same time. This
usually implies multiple CPUs
(Cores). Parallel Computing
Computing is concurrent as
soon as 2 (or more) operations
can take place in any order, i.e. Distributed HPC
they are not placed in a strict Computing
sequential order. This implies
multiple threads and/or
multiple processes.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 10


Example (1) - Distributed!
The private
network
connecting the
various
branches of a
bank to the
central office.

Meanwhile,
extended to
accommodate
shops, home
banking, etc.

(Pardon the
German).
Source: “Verteilte Systeme”, Skript, Peter Wollenweber, Hochschule Darmstadt, Fachbereich Informatik.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 11


Examples (2) - Distributed?

The computer LuxNet


center of a large Internet
SAP
GUI
bank – full of
servers with XML-
NT Oracle
LuxNet
SAP
LuxNet
connections to the Browser
HTML
Konverter LuxNet
LuxNet
DB
LuxNet
IIS
branch offices.
Loader
MTS LuxNet API
OPA CS Repl
LuxNet (Client) MQAX
INAP LuxNet
Service LuxNet

Very OPA CS
(Client)
OPA CS
(Server) OPA DS
MQSeries
Client
MQSeries M-Broker

OPA DS
heterogeneous. XM
L
OLY / K+
Interface
SUN
IMS IMS
MQSeries
Much of the
architecture IMS DB2 DB2
Vorfalls
DB2
LuxNet
LuxNet
Repl
DB2
Olympic
Sybase

Kondor+
consists of legacy
KREKIN GeParD
DB DB
MVS

systems.

Source: “Verteilte Systeme”, Skript, Alois Schütte, Hochschule Darmstadt, Fachbereich Informatik.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 12


Examples (3) – A Web Server Farm

The Internet? Many systems or


one big system?
Stand-by
A “Production” Web-Site is often Load Balancer Load Balancer
hosted by a “web server farm” -
a.k.a. a “load-balanced cluster”
consisting of:
2 (or more) Load Balancers Web Server
Web Server
Many Web Servers Web Server
Web Server
Database Servers Web Server
Web Server
Possibly connections to other
servers, services...

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 13


Examples (4) –The Googleplex

Google: supports several computer centers with more


than 200,000 PC's (as of 2003, meanwhile a lot more!).
Servers:
Load Balancers
Proxy Servers (caches)
Web Servers
Index servers
Document servers
Data-gathering servers
Ad servers
Spelling servers

Source: http://en.wikipedia.org/w/index.php?title=Google_platform&oldid=202504102,, Barroso et. al. Web Search for a Planet: The
Google Cluster Architecture, IEEE Micro, March-April 2003 (http://labs.google.com/papers/googlecluster-ieee.pdf)

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 14


Examples (5) – An Internet Company in Frankfurt

Financial Market
HTTP Request for an URL

Data Servers Web -Page

import data from


data suppliers, add
Watchlist Canvas

value, mix in data Quote Chart Figures News Forum

supplied by users, Day Month


plus charts and
quantitative
analysis, and make User Quotes Charts Figures News Text

it available as a Cache Lifespan

internet or
Read (Pull)
Write (Push)

intranet Calculation
Read/Write

application.
Active Invalidate
Processes
Active Update

Note dataflow, External and internal data sources

granularity...
Source: Cotoaga, K.; Müller, A.; Müller, R.: Effiziente Distribution dynamischer Inhalte im
Web. In Wirtschaftsinformatik 44 (2002) 3, p. 249-259.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 15


Examples (6) – Deep Blue
“Deep Blue was a chess-playing computer
developed by IBM. On May 11, 1997, the
machine won a six-game match... against
world champion Garry Kasparov.”

“...It was a 32-node IBM RS/6000 SP high-


performance computer, which utilized the
Power Two Super Chip processors (P2SC).
Each node of the SP employed 8 dedicated
VLSI chess processors, for a total of 256
processors working in tandem. … It was
capable of evaluating 200 million positions per
second... . In June 1997, Deep Blue was the
259th most powerful supercomputer
according to the TOP500 list, achieving 11.38
GFLOPS on the High-Performance LINPACK
benchmark.”

Sources: "Deep Blue (chess computer)." Wikipedia, The Free Encyclopedia.15 Oct 2009
<http://en.wikipedia.org/w/index.php?title=Deep_Blue_(chess_computer)&oldid=318710273>, corrected with information from
<http://www.research.ibm.com/deepblue/meet/html/d.3.shtml>.
Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 16
Examples (7) – Blue Gene

“Blue Gene is a computer


architecture project designed
to produce several
supercomputers, designed to
reach operating speeds in
the PFLOPS (petaFLOPS)
range, and currently reaching
sustained speeds of nearly
500 TFLOPS (teraFLOPS). ...

The project was awarded the


National Medal of Technology
“On November 12, 2007, the first [Blue Gene/P]
and Innovation by US
system, JUGENE, with 65536 processors is running in
President Obama on
the Jülich Research Centre in Germany with a
September 18, 2009.” performance of 167 TFLOPS.[15] It is the fastest
supercomputer in Europe and the sixth fastest in the
world ”
Source: "Blue Gene." Wikipedia, The Free Encyclopedia. 15 Oct 2009
<http://en.wikipedia.org/w/index.php?title=Blue_Gene&oldid=319969920>. .

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 17


Examples (8) – HHLR

The Hessische HochLeistungs Rechner.


Right here in Darmstadt.

“The system consists of 15 SMP-Nodes


with a total of 452 Processors. … The 14
computing nodes contain 32 Power6-
CPUs each and at least 128 GB RAM.
When used as Shared Memory
computers (SMP), the nodes are
suitable for applications with high
communication requirements. In order
to also run programs that require more
than one SMP-Node, the computers are
connected by a very fast internal
network ( 8 Lanes DDR Infiniband ).”

Source: “ Der Hessische Hochleistungsrechner”


<http://www.hhlr.tu-darmstadt.de/organisatorisches/startseite/index.de.jsp>, translation by R. C. Moore.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 18


Examples (9) – BOINC + Grids + Clouds = ???

See c't (Heise Verlag),


Ausgabe 21 (29.9.) 2008
Seite 128-145.

BOINC = Berkeley Open Grids = More or less open Clouds = Commecial


Infrastructure for infrastructures for Infrastructures (open to
Network Computing = collaborations and resource paying customers). Newest
Open infrastructure for (and data) sharing among form of Utility Computing.
applications such as distributed teams. Goal: Built on availability of
SETI@home – exploit CPU Utility Computing (Role- Virtual Machines (and
cycles that would Model: Electric Grid). waste CPU cycles at
otherwise be wasted. Mostly limited (still) to Amazon, IBM, etc.).
Research Centers (e.g.
CERN).

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 19


Outline of Chapter I – Introduction

I Introduction
A) Personal Introductions
II Models of Parallel Computing
B) What is Parallel &
III Parallel Computation Design Distributed Computing?

IV The Message-Passing Paradigm C) Demotivation: Why wasn't


Parallelism used?
V The Shared Memory Paradigm
D) Motivation: Why use
VI Frontiers Parallelism?
VII Summary E) Paradigms & Packages for
Parallel Programming: An
Overview & an Outlook

F) Summary & Bibliography

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 20


(De)Motivation: Why (not) Use Parallelism?

Why was Parallel Computing Four Factors:


neglected for so long? 1) Moore's Law
Parallelism has been omnipresent 2) Amdahl's Law
in hardware development 3) Inherent Difficulty
(computer architecture) for
4) Lack of a Unifying
decades.
Paradigm
But it has remained a specialty
(often treated as an oddity) in
software architecture and
software engineering until recently
(consider it's place in your
education so far).
Why?

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 21


(de)Motivation: Moore's Law

Four Factors:
1) Moore's Law
2) Amdahl's Law
3) Inherent Difficulty
4) Lack of a Unifying
Paradigm
The number of transistors per chip
doubles every two years.

Result: Until recently, last year's High


Performance Computer was slower
than next year's Commodity Computer
(exaggeration to make a point).

Source: "Moore's law." Wikipedia, The Free Encyclopedia. 15 Oct 2009


<http://en.wikipedia.org/w/index.php?title=Moore%27s_law&oldid=319734055>.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 22


(de)Motivation: Amdahl's Law (1)

Four Factors: Two Critical Measurements:


1) Moore's Law If T(p) = total run-time with p
processors
2) Amdahl's Law Then
3) Inherent Difficulty Speedup = S(p) = T(1) / T(p).
4) Lack of a Unifying Efficiency = E(p) = S(p) / p.
Paradigm

By the way... Speedup


and Efficiency are really
crucial concepts!
Remember them.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 23


(de)Motivation: Amdahl's Law (2)
More Definitions:
Four Factors:
α= inherently sequential time
1) Moore's Law (e.g. input, output...)
2) Amdahl's Law π = parallelizable time.
3) Inherent Difficulty Thus T(1) = α + π
T(p) = α + π / p
4) Lack of a Unifying
Let f = α / (α + π) e.g. 1/4th
Paradigm
Then Speedup is...
Definitions: 
S(p) = T(1) / T(p) = / p
T(p) = total run-time with p
processors 1 1
= =
/ p
Speedup = S(p) = T(1) / T(p).
Efficiency = E(p) = S(p) / p.

 / p



  
1 1
= =
 1− f 
f f
p p

Source: H. Bauke, S. Mertens, Cluster Comuting, Springer Verlag, 2006. Section 1.5 (p. 10-13).

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 24


(de)Motivation: Amdahl's Law (3)
More Definitions:
Four Factors:
α= inherently sequential time
1) Moore's Law (e.g. input, output...)
2) Amdahl's Law π = parallelizable time.
3) Inherent Difficulty Thus T(1) = α + π
T(p) = α + π / p
4) Lack of a Unifying
Let f = α / (α + π) e.g. 1/4th
Paradigm
Then Speedup is...
Definitions: 
S(p) = T(1) / T(p) =  / p
T(p) = total run-time with p
processors The upper limit on Speedup is
Speedup = S(p) = T(1) / T(p).
Efficiency = E(p) = S(p) / p. 1 1
S  p= 
1− f  f
f
p
Source: H. Bauke, S. Mertens, Cluster Computing, Springer Verlag, 2006. Section 1.5 (p. 10-13).

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 25


(de)Motivation: Amdahl's Law (4)

Four Factors:
1) Moore's Law
2) Amdahl's Law
3) Inherent Difficulty
4) Lack of a Unifying
Paradigm
Since the upper limit on
Speedup is 1/f , Efficiency
goes to zero as p → ∞
If we introduce communication
into our model, we can find a
value of p with maximum
speedup. Afterwards, more Speedup and Efficiency,
processors mean more run-time! with and without communication costs,
for f = 0.005, and 1 ≤ p ≤ 10000
Source: H. Bauke, S. Mertens, Cluster Computing, Springer Verlag, 2006. Abb. 1.3, p. 12.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 26


(de)Motivation: Inherent Difficulty (1)

Four Factors: Prima Facie Argument:


1) Moore's Law Sequential Programmers have 2 things
2) Amdahl's Law to worry about:
1) Space (Memory)
3) Inherent Difficulty
2) Time (Instructions)
4) Lack of a Unifying
Paradigm Parallel Programmers have 3 things to
worry about:
1) Space (Memory)
2) Time (Instructions)
3) Choice of Processor (Distributing the
instructions – and possibly the
memory – across the processors).

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 27


(de)Motivation: Inherent Difficulty (2)

Four Factors: An Example: Sequential Scheduling


1) Moore's Law Problem: Given a set of n jobs, each of
which takes time ti , assign an order to
2) Amdahl's Law
the jobs to minimize total run time.
3) Inherent Difficulty Solution: Trivial. Any order will do. The
4) Lack of a Unifying run time is always the same:
Paradigm ∑ ti
0in

Comparison: Parallel Scheduling


Problem: Given a set of n jobs, each of
which takes time ti , and m machines,
assign each job to a machine so as to
minimize total run time.
Solution: Finding an optimal schedule is
NP-Complete!!!

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 28


(de)Motivation: Lack of a Unifying Paradigm

Four Factors:
1) Moore's Law
2) Amdahl's Law
3) Inherent Difficulty
4) Lack of a Unifying
Paradigm
“Some Parallel
Programming
Environments from
the Mid-1990s”

(How) Is this really different


from the sequential
languages?

Source: T. Mattson, B. Sanders, B. Massingill,


Patterns of Parallel Programming, Addison-
Wesleey, 2005. Table 2.1, p. 14.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 29


Outline of Chapter I – Introduction

I Introduction
A) Personal Introductions
II Models of Parallel Computing
B) What is Parallel &
III Parallel Computation Design Distributed Computing?

IV The Message-Passing Paradigm C) Demotivation: Why wasn't


Parallelism used?
V The Shared Memory Paradigm
D) Motivation: Why use
VI Frontiers Parallelism?
VII Summary E) Paradigms & Packages for
Parallel Programming: An
Overview & an Outlook

F) Summary & Bibliography

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 30


Motivation(!): Moore meets Multicores

Four Factors:
1) Moore's Law
Register Register
2) Amdahl's Law set set Core
3) Inherent Difficulty Core
Package
4) Lack of a Unifying Die
Paradigm Die
Package
The number of transistors keeps
growing, but they can't all be Main-Board
used in one CPU any more:
Deep Pipelining has its limits, Hyper-threading has its
this led to hyper-threading limits, this led to multicore Result: Sequential
(virtually more than 1 CPU) chips (really more than 1 Processors are dying
CPU per chip). out! Multiprocessors are
now the rule, not the
Both can be combined. exception!

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 31


Motivation(!): Amdahl meets Gustafson
By the way, Gustafson
Four Factors: Gustafson's Law: published it, but said it
was due to E. Barsis
1) Moore's Law Speedup is...
f T p p1− f  T  p
2) Amdahl's Law S(p) = T(1) / T(p) = T p
3) Inherent Difficulty
= f + p(1– f) = p + (1 – p)f
4) Lack of a Unifying
Paradigm Amdahl Gustafson

Amdahl: T(1)
Speedup is...
S(p) = T(1) / T(p)
T(p)
1
= 1− f 
f Moral of the Story:
p
Speedup is limited only if T(1) is held constant;
For sufficiently large T(1), f goes to zero
Source: H. Bauke, S. Mertens, Cluster Comuting, Springer Verlag, 2006. Section 1.7 and Abb. 1.6, p. 16.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 32


Motivation(!): Isoefficiency

Four Factors: Can we extend Amdahl's Law to express how


problems scale?
1) Moore's Law
New definition of parallel run time:
2) Amdahl's Law
n
3) Inherent Difficulty T n , p= n T c n , p
p
4) Lack of a Unifying Tc(n,p) is the communication costs + overhead.
Paradigm
It follows that efficiency is
The Isoefficiency E n , p=T n ,1/ p T n , p
Equation:
Constant Efficiency nn
= p  nn p T n , p
implies that c
E  n , p  n n
T n ,1= T 0  n , p =
1−E n , p  nnT 0 n , p
T n , 1=C T 0  n , p Where T 0  n , p= p−1 n p T c n , p

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 33


Motivation(!): Amdahl meets Disney/Pixar

Four Factors: 1995: Disney/Pixar release Toy Story.


1) Moore's Law Length: 51 minutes, 44 Seconds, 24
Frames per second.
2) Amdahl's Law Rendered on a “render farm” with
3) Inherent Difficulty 100 dual-processor workstations.
4) Lack of a Unifying 1999: Disney/Pixar release Toy Story 2.
Paradigm Length: 92 Minutes.
Rendered on a 1,400-processor
Moral of the Story: system.
Problems scale. 2001: Disney/Pixar release Monsters, Inc.
There are problems out Length: 94 Minutes.
there that will gladly Rendered on a system with 250
take all the processing enterprise servers, each with 14
power we can give processors for a total of 3,500
them.
processors.

Source: T. Mattson, B. Sanders, B. Massingill, Patterns of Parallel Programming, Addison-Wesley, 2005, p. 1-2.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 34


Motivation(!): Embarrassing Parallelism

Four Factors: Example: Parallel Scheduling is NP-


1) Moore's Law Complete...

2) Amdahl's Law Counter-Example 1: Finding the optimal


schedule is very hard; finding a nearly
3) Inherent Difficulty optimal schedule is easy. Longest-
4) Lack of a Unifying Processing-Time first scheduling is a
Paradigm “4/3-optimal solution” (never worse
Clay Breshears (“The Art of than 4/3 the optimum).
Concurrency”, O Reilly, 2009) Counter-Example 2: Given sufficiently
suggests saying “enchantingly
parallel” instead of
many, sufficiently small jobs, any
“embarssingly parallel”. schedule approximates the optimum.

Of course, not all problems are


Definition: An embarrassingly parallel
embarrassingly/enchantingly problem is a problem consisting of very
parallel, and we don't want to restrict many (essentially) independent tasks.
ourselves to those that are.
The others are fun too! Like rendering Pixar films!

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 35


Motivation(!): Something like Convergence

Four Factors:
1) Moore's Law
2) Amdahl's Law
3) Inherent Difficulty
4) Lack of a Unifying
Paradigm
Of all of these,
only Java...
...and MPI
have survived.
plus Newcomers: ...more
or
C++0x (Boost),
less...
OpenMP &
something for
GPGPUs
Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 36
Outline of Chapter I – Introduction

I Introduction
A) Personal Introductions
II Models of Parallel Computing
B) What is Parallel &
III Parallel Computation Design Distributed Computing?

IV The Message-Passing Paradigm C) Demotivation: Why wasn't


Parallelism used?
V The Shared Memory Paradigm
D) Motivation: Why use
VI Frontiers Parallelism?
VII Summary E) Paradigms & Packages for
Parallel Programming: An
Overview & an Outlook

F) Summary & Bibliography

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 37


Paradigms vs. Programming Packages
Paradigm \ Package Java / OpenMP MPI
C++0x (Boost)

Shared Memory Yes Yes No (*)

Message Passing Yes No Yes

Programming Language No (Java) Debatably Yes No (!)


Extension Minimally (Compiler-Extensions)
(C++0x)
Libraries Yes Yes Yes

Languages: Java, C++ C, C++, Fortran C, C++, Fortran,


Java, ...

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 38


Boost: Threads in C++: History

C++ Java
Inherits posix threads, Includes Threads &
sockets from C / Unix Sockets from the start

Extended with Templates,


Extended with Templates
Standard Template Library
Generics
(STL)

Upcoming Standards: Threading later further


(a) STL TR (Technical Report) 1, extended (e.g. thread pools).
(b) C++0x,
(c) STL TR (Technical Report) 2 Threading & sockets are:
- now include STL-style threads & - not in TR1,
sockets (amongst other things) - may be in C++0x,
- all tried out first in the Boost - will be in TR2,
Library(!) - in BOOST.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 39


Boost: Threads in C++ (1)
Boost supports: #include <boost/thread.hpp>

1) Threads #include <iostream>

2) Synchronization
void thread() {
3) Thread-local
std::cout << "Hello from Thread "
Memory
<< boost::this_thread::get_id()
4) Sockets for << std::endl;
Message Passing
}
Boost::thread
Constructor starts Function pointer
int main()
threads running,
{
join method waits until
a thread is finished. boost::thread t1(thread);
boost::thread t2(thread);
We have a
t1.join();
synchronization
problem here. Where? t2.join();
}
Source: http://en.highscore.de/cpp/boost/multithreading.html

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 40


Boost: Threads in C++ (2)
Boost supports: #include <boost/thread.hpp>
#include <iostream>
1) Threads
// global variable, shared by all threads
2) Synchronization boost::mutex mutex;

3) Thread-local
Memory void thread() {
mutex.lock(); // critical section!
4) Sockets for std::cout << "Hello from Thread "
Message Passing << boost::this_thread::get_id()
<< std::endl;
mutex.unlock(); // parallelism resumes here...
Mutex = Mutual }
Exclusive. Mutexes
are used to guard int main()
critical sections – {
areas with limited boost::thread t1(thread);
(or no) parallelism. boost::thread t2(thread);
t1.join();
t2.join();
} Source: http://en.highscore.de/cpp/boost/multithreading.html

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 41


Boost: Threads in C++ (3)
Boost supports: #include <boost/thread.hpp>
#include <iostream>

1) Threads #include <cstdlib>


#include <ctime> “thread-local” will be
a new keyword in
2) Synchronization void init_number_generator() {
static boost::thread_specific_ptr<bool> tls; C++0x. In the
3) Thread-local if (!tls.get()) meantime, Boost
tls.reset(new bool(false)); uses a template.
Memory if (!*tls) {
*tls = true;
4) Sockets for std::srand(static_cast<unsigned int>(std::time(0)));

Message Passing }
} // end number_generator
boost::mutex mutex;
lock_guard:
We need more than void random_number_generator() {
Constructor locks,
init_number_generator();
static (global) and int i = std::rand();
destructor unlocks –
local variables! boost::lock_guard<boost::mutex> lock(mutex); Principle: RAII =
std::cout << i << std::endl; Resource Acquisition
} // end random_number_generator Is Initialization
int main() {
boost::thread t[3];
for (int i = 0; i < 3; ++i)
t[i] = boost::thread(random_number_generator);
for (int i = 0; i < 3; ++i)
t[i].join();
}
Source: http://en.highscore.de/cpp/boost/multithreading.html

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 42


Boost: Threads in C++ (4)
Boost supports: #include <boost/asio.hpp>
#include <boost/array.hpp>

1) Threads
#include <iostream>
#include <string>
boost::asio::io_service io_service;

2) Synchronization boost::asio::ip::tcp::resolver resolver(io_service);


boost::asio::ip::tcp::socket sock(io_service);

3) Thread-local boost::array<char, 4096> buffer;


void read_handler(const boost::system::error_code &ec, std::size_t bytes_transferred) {

Memory if (!ec) {
std::cout << std::string(buffer.data(), bytes_transferred) << std::endl;

4) Sockets for }
sock.async_read_some(boost::asio::buffer(buffer), read_handler);

Message Passing } // end read_handler


void connect_handler(const boost::system::error_code &ec) {
if (!ec) {
Sockets are part of the boost::asio::write(sock, boost::asio::buffer("GET / HTTP 1.1\r\nHost: highscore.de\r\n\r\n"));

more general }
sock.async_read_some(boost::asio::buffer(buffer), read_handler);

“asynchronous I/O” (asio) } // end connect_handler

library.
void resolve_handler(const boost::system::error_code &ec,
boost::asio::ip::tcp::resolver::iterator it) {

Central concepts: if (!ec) {


sock.async_connect(*it, connect_handler);

IO_Services & Handlers }


} // end resolve_handler
int main() {

This example connects to a boost::asio::ip::tcp::resolver::query query("www.highscore.de", "80");

web server (via tcp) and resolver.async_resolve(query, resolve_handler);


io_service.run();
downloads an html page. }
Source: http://en.highscore.de/cpp/boost/asio.html

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 43


OpenMP: Compiler Supported Parallelism (1)
OpenMP supports #include <omp.h>

1) Threads #include <iostream>

2) Synchronization
with private and main ()
shard data, {
fork/join // serial startup goes here...
3) Loop Parallelism #pragma omp parallel num_threads (8)
{ // parallel segment
OpenMP is a compiler-
extension (not really a printf(
"\nHello world, I am thread %d\n",
language extension) for C, C++ omp_get_thread_num() );
and Fortran. Assumes shared
}
memory (usually).
// rest of serial segment ...
Supported by the Intel and gnu
compilers (et. al.). }

Not to be confused with


OpenMPI!
Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 44
OpenMP: Compiler Supported Parallelism (2)
OpenMP supports #include <omp.h>
#include <iostream>
1) Threads int a, b, sum;
Thread0 a=1, b=9, sum=8

2) Synchronization main ()
{
Thread0 a=1, b=17, sum=16
with private and // serial segment
b = 1;
a = 1;
shard data, sum = 0;

fork/join #pragma omp parallel num_threads (8) private (a) shared


(b) reduction(+:sum)
{ // parallel segment
3) Loop Parallelism A += 1; b += 1; sum = 1;
printf("\nThread%d a=%d, b=%d, sum=%d",
omp_get_thread_num(), a, b, sum);
}
// rest of serial segment
printf(" Thread%d a=%d, b=%d, sum=%d\n",
omp_get_thread_num(), a, b, sum);
OpenMP encourages a
Single Program, #pragma omp parallel num_threads (8) private (a) shared
(b) reduction(+:sum)
Multiple Data style, { // parallel segment
A += 1; b += 1; sum = 1;
where parallel blocks printf("\nThread%d a=%d, b=%d, sum=%d",
omp_get_thread_num(), a, b, sum);
follow each other }
sequentially (first fork, // rest of serial segment
then join...). printf(" Thread%d a=%d, b=%d, sum=%d\n",
omp_get_thread_num(), a, b, sum);
}

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 45


OpenMP: Compiler Supported Parallelism (3)
OpenMP supports #include <omp.h>
#include <iostream>
1) Threads
int sum;
2) Synchronization float a[100], b[100];
with private and main ()
{
shard data, // serial segment
fork/join sum = 0;

3) Loop Parallelism // Initialise array a


for (int j = 0; j < 100; ++j) a[j] = j;
#pragma omp parallel num_threads (2)
{ // parallel segment

OpenMP provides semi- #pragma omp for


for (int i = 0; i < 100; ++i)
automatic parallelization of b[i] = a[i] * 2.0;
loops. This is the dominant
programming style in OpenMP. }
// rest of serial segment
// Print array a
for (int k = 0; k < 100; ++k)
printf("%5.2f ", b[k]);
}
Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 46
MPI: Compiler Supported Parallelism (1)
Every MPI
MPI supports Program
#include "mpi.h"
should start
1) Remote Execution #include <iostream> (the MPI-
(not threads) enabled
int main(int argc, char **argv) portion) with
2) Message Passing {
MPI::Init() and
end with
3) Data Parallelism int rank, size; MPI::Finalize()
MPI::Init(); .
rank = MPI::COMM_WORLD.Get_rank();
MPI is a library standard, defined size = MPI::COMM_WORLD.Get_size();
for many languages (including std::cout << "Hello, world! I am "
Fortran, C and C++ - shown here), << rank << " of "
available in various << size << std::endl;
implementations (e.g. Open MPI, MPI::Finalize();
not to be confused with OpenMP).
It does not assume shared The number of parallel
return 0;
memory – but uses it where programs is determined by
} how the programs are run –
available.
e.g. with the command
mpirun.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 47


MPI: Compiler Supported Parallelism (2)
int main(int argc, char *argv[]) {

MPI supports int rank, size, next, prev, message, tag = 201;
MPI::Init();
1) Remote Execution rank = MPI::COMM_WORLD.Get_rank();
(not threads) size = MPI::COMM_WORLD.Get_size();
next = (rank + 1) % size;
2) Message Passing prev = (rank + size - 1) % size;
3) Data Parallelism Message = 10;
if (0 == rank)
The basic operations in MPI::COMM_WORLD.Send(&message, 1, MPI::INT, next, tag);
MPI are sending and while (1) {
receiving messages. MPI::COMM_WORLD.Recv(&message, 1, MPI::INT, prev, tag);
Message passing can if (0 == rank) --message;
be blocking or non- MPI::COMM_WORLD.Send(&message, 1, MPI::INT, next, tag);
blocking. Functions are if (0 == message) break; // exit loop!
available for handling } // end while
arbitrary data – and if (0 == rank)
ensuring compatibility
MPI::COMM_WORLD.Recv(&message, 1, MPI::INT, prev, tag);
across heterogeneous
MPI::Finalize();
machines.
return 0;
}

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 48


MPI: Compiler Supported Parallelism (3a)

MPI supports P0 A P0 A
1) Remote Execution P1 Bcast P1 A
(not threads) P2 P2 A

2) Message Passing P3 P3 A

3) Data Parallelism
P0 A B C D P0 A
MPI also contains P1 P1 B
message passing Scatter
P2 P2 C
functions that make it
easy to distribute data P3 P3 D
Gather
(usually arrays) across
a group of processors.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 49


MPI: Compiler Supported Parallelism (3b)
P0 A P0 A B C D
MPI supports
P1 B P1 A B C D
1) Remote Execution Allgather
P2 C P2 A B C D
(not threads)
P3 D P3 A B C D
2) Message Passing
3) Data Parallelism
P0 A0 A1 A2 A3 P0 A0 B0 C0 D0
P1 B0 B1 B2 B3 P1 A1 B1 C1 D1
P2 C0 C1 C2 C3 P2 A2 B2 C2 D2
P3 D1 D2 D3 D4 P3 A3 B3 C3 D3
There are more, this is
not a complete list. Alltoall

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 50


Outlook:

The cutting edge?


Specialized Hardware:

General Purpose Graphics Processing
Units (GPGPUs)

Cell Processors

Field Programmable Gate Arrays
(FPGAs)

Problem: How to program?



CUDA? Looks promising, but only NVIDIA

OpenCL? Looks promising, promises to
support many architectures...

…?!?

Source: "CUDA." Wikipedia, The Free Encyclopedia. 28 Oct 2009, <http://en.wikipedia.org/w/index.php?title=CUDA&oldid=322466787>.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 51


Outline of Chapter I – Introduction

I Introduction
A) Personal Introductions
II Models of Parallel Computing
B) What is Parallel &
III Parallel Computation Design Distributed Computing?

IV The Message-Passing Paradigm C) Demotivation: Why wasn't


Parallelism used?
V The Shared Memory Paradigm
D) Motivation: Why use
VI Frontiers Parallelism?
VII Summary E) Paradigms & Packages for
Parallel Programming: An
Overview & an Outlook

F) Summary & Bibliography

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 52


Summary (of Chap. 1)
A) Personal Introductions
B) What is Parallel & Distributed Computing?

Distributed Computing is a subset of Parallel Computing.

High Performance Computing (HPC) is another subset.

Parallel is strictly speaking a subset of concurrent. Key
Measurements:
C) (De)Motivation: Why (not) Use Parallelism?
Speedup &

Moore's Law - vs. Multicore Processors Efficiency

Amdahl's Law – vs. Gustafson's Law

Inherent Difficulty – vs. Embarassing (enchanting) Parallelism

Lack of a Unifying Paradigm – vs. Convergence to…
D) Paradigms & Packages for Parallel Programming: An Overview & an
Outlook

Boost (C++0x) offers threads & synchronization (mutexes etc.) for shared
memory programming and sockets for message passing for one language
(C++)

OpenMP offers shared memory programming with Loop Parallelism for
languages with extended compilers (C, C++ & Fortan)

MPI offers message passing programming with data parallelism in the form
of a library for multiple languages (mainly C, C++ & Fortan).
Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 53
Bibliography
Main Sources:
H. Bauke, S. Mertens, Cluster Computing, Springer Verlag, 2006.
Clay Breshears, The Art of Concurrency, O Reilly Media Inc, 2009.
Ian Foster, Designing and Building Parallel Programs, Addison-
Wesley Publishing, 1995.
Available on-line: http://www.mcs.anl.gov/~itf/dbpp/(!)
A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction to Parallel
Computing, 2nd Edition, Addison Wesley (Pearson), 2003.
T. G. Mattson, B. A. Sanders & B. L. Massingill, Patterns for Prallel
Programming, Addison-Wesley (Pearson Education), 2005
R. C. Moore, SDAARC A Self Distributing Associative Architecture,
Shaker Verlag, 2001 – Hard to find! ;­)
This list is not meant to be exhaustive.

Prof. Ronald Moore – P&DC – Introduction – Winter 2010/11 54

S-ar putea să vă placă și