Sunteți pe pagina 1din 53

Exploring Parallel

Computing
Fabian Frie
Numerical Methods in Quantum Physics
February 6th 2014
,
Page 2 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Syllabus
1 Introduction
What is Parallel Computing?
Scalability
2 Parallel Programming Models
Memory Models
Exploring MPI
Exploring OpenMP
Comparison
3 Examples with OpenMP
Matrix Matrix Multiplication
Approximation of
4 Conclusion
,
Page 3 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
1 Introduction
What is Parallel Computing?
Scalability
2 Parallel Programming Models
Memory Models
Exploring MPI
Exploring OpenMP
Comparison
3 Examples with OpenMP
Matrix Matrix Multiplication
Approximation of
4 Conclusion
,
Page 4 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
What is Parallel Computing?
Introduction

Parallelization is another optimization technique to


reduce execution time

Thread: series of instructions for a processor unit

Coarse-grain parallelism: parallelization achieved


by distributing domains over different processors.

ne-grain parallelism: parallelization achieved by


distributing iterations equally over different
processors.
,
Page 4 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
What is Parallel Computing?
Introduction

Parallelization is another optimization technique to


reduce execution time

Thread: series of instructions for a processor unit

Coarse-grain parallelism: parallelization achieved


by distributing domains over different processors.

ne-grain parallelism: parallelization achieved by


distributing iterations equally over different
processors.
,
Page 4 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
What is Parallel Computing?
Introduction

Parallelization is another optimization technique to


reduce execution time

Thread: series of instructions for a processor unit

Coarse-grain parallelism: parallelization achieved


by distributing domains over different processors.

ne-grain parallelism: parallelization achieved by


distributing iterations equally over different
processors.
,
Page 4 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
What is Parallel Computing?
Introduction

Parallelization is another optimization technique to


reduce execution time

Thread: series of instructions for a processor unit

Coarse-grain parallelism: parallelization achieved


by distributing domains over different processors.

ne-grain parallelism: parallelization achieved by


distributing iterations equally over different
processors.
,
Page 5 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Scalability I
Amdahls Law

Dene the speedup with respect to the number of


threads n by
S(n) =
t(1)
t(n)

unless the application is embarrassingly parallel,


S(n) will deviate from the ideal curve

Assume the program has a parallel fraction f than


with n processors the execution time will change
according to
t(n) =
f
n
t(1) + (1 f )t(1)
,
Page 6 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Scalability II
Amdahls Law

Amdahls Law states: If the fraction f of a program


can be made parallel than the maximum speedup
that can be achieved by using n threads is
S(n) =
1
(1 f ) +f / n
,
Page 7 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Scalability III
Amdahls Law
,
Page 8 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
1 Introduction
What is Parallel Computing?
Scalability
2 Parallel Programming Models
Memory Models
Exploring MPI
Exploring OpenMP
Comparison
3 Examples with OpenMP
Matrix Matrix Multiplication
Approximation of
4 Conclusion
,
Page 9 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Memory Architectures
Shared Distributed

Shared Memory Architectures

Symmetric Multi Processor (SMP):


A shared address space with equal access cost for
each processor.

Non Uniform Memory Access (NUMA):


Different memory regions have different access
costs.

Distributed Memory Architectures

Clusters: Each processor acts on its own private


memory space. For remote data, communication is
required.
,
Page 9 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Memory Architectures
Shared Distributed

Shared Memory Architectures

Symmetric Multi Processor (SMP):


A shared address space with equal access cost for
each processor.

Non Uniform Memory Access (NUMA):


Different memory regions have different access
costs.

Distributed Memory Architectures

Clusters: Each processor acts on its own private


memory space. For remote data, communication is
required.
,
Page 9 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Memory Architectures
Shared Distributed

Shared Memory Architectures

Symmetric Multi Processor (SMP):


A shared address space with equal access cost for
each processor.

Non Uniform Memory Access (NUMA):


Different memory regions have different access
costs.

Distributed Memory Architectures

Clusters: Each processor acts on its own private


memory space. For remote data, communication is
required.
,
Page 9 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Memory Architectures
Shared Distributed

Shared Memory Architectures

Symmetric Multi Processor (SMP):


A shared address space with equal access cost for
each processor.

Non Uniform Memory Access (NUMA):


Different memory regions have different access
costs.

Distributed Memory Architectures

Clusters: Each processor acts on its own private


memory space. For remote data, communication is
required.
,
Page 9 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Memory Architectures
Shared Distributed

Shared Memory Architectures

Symmetric Multi Processor (SMP):


A shared address space with equal access cost for
each processor.

Non Uniform Memory Access (NUMA):


Different memory regions have different access
costs.

Distributed Memory Architectures

Clusters: Each processor acts on its own private


memory space. For remote data, communication is
required.
,
Page 10 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Shared Memory Architecture
Intel Core i7 980X Extreme Edition
,
Page 11 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Exploring MPI I
What is MPI?

MPI Message Passing Interface

MPI is extensive parallel programming API for


distributed memory (clusters, grids)

First introduced in 1994

MPI supports C, C++, and Fortran

All data is private to processing unit

Data communication must be programmed


explicitly
,
Page 12 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Exploring MPI II
What is MPI?
Pros

Flexibility: Can use any


cluster of any size

Widely available

Widely used : popular


in High performance
computing
Cons

Redesign of application

More resources
required: Typically more
memory

Errorprone & hard to


debug: Due to many
layers
,
Page 13 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Exploring OpenMP
Parallel Programming Models

OpenMP Open Multi Processing (API)

OpenMP is build for shared memory architectures


such as Symmetric Multi Processing (SMP)
machines

Supports both coarse-grained and ne-grained


parallelism

Data can be shared or private

All threads have access to the same, shared,


memory

Use of mostly implicit synchronization


,
Page 13 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Exploring OpenMP
Parallel Programming Models

OpenMP Open Multi Processing (API)

OpenMP is build for shared memory architectures


such as Symmetric Multi Processing (SMP)
machines

Supports both coarse-grained and ne-grained


parallelism

Data can be shared or private

All threads have access to the same, shared,


memory

Use of mostly implicit synchronization


,
Page 13 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Exploring OpenMP
Parallel Programming Models

OpenMP Open Multi Processing (API)

OpenMP is build for shared memory architectures


such as Symmetric Multi Processing (SMP)
machines

Supports both coarse-grained and ne-grained


parallelism

Data can be shared or private

All threads have access to the same, shared,


memory

Use of mostly implicit synchronization


,
Page 13 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Exploring OpenMP
Parallel Programming Models

OpenMP Open Multi Processing (API)

OpenMP is build for shared memory architectures


such as Symmetric Multi Processing (SMP)
machines

Supports both coarse-grained and ne-grained


parallelism

Data can be shared or private

All threads have access to the same, shared,


memory

Use of mostly implicit synchronization


,
Page 13 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Exploring OpenMP
Parallel Programming Models

OpenMP Open Multi Processing (API)

OpenMP is build for shared memory architectures


such as Symmetric Multi Processing (SMP)
machines

Supports both coarse-grained and ne-grained


parallelism

Data can be shared or private

All threads have access to the same, shared,


memory

Use of mostly implicit synchronization


,
Page 13 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Exploring OpenMP
Parallel Programming Models

OpenMP Open Multi Processing (API)

OpenMP is build for shared memory architectures


such as Symmetric Multi Processing (SMP)
machines

Supports both coarse-grained and ne-grained


parallelism

Data can be shared or private

All threads have access to the same, shared,


memory

Use of mostly implicit synchronization


,
Page 14 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Comparison
Parallel Programming Models
MPI

popular, widely used

ready for grids

high steep learning


curve

No data scoping
(shared, private,...)

sequential code is not


preserved

requires only one


library

easier model

requires runtime
enviroment
OpenMP

popular, widely used

limited to one system


(SMP), not grid ready

easy to learn

data scoping required

preserve sequential
code

requires compiler
support

performance issues
implicit

no runtime enviroment
required
,
Page 14 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Comparison
Parallel Programming Models
MPI

popular, widely used

ready for grids

high steep learning


curve

No data scoping
(shared, private,...)

sequential code is not


preserved

requires only one


library

easier model

requires runtime
enviroment
OpenMP

popular, widely used

limited to one system


(SMP), not grid ready

easy to learn

data scoping required

preserve sequential
code

requires compiler
support

performance issues
implicit

no runtime enviroment
required
,
Page 14 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Comparison
Parallel Programming Models
MPI

popular, widely used

ready for grids

high steep learning


curve

No data scoping
(shared, private,...)

sequential code is not


preserved

requires only one


library

easier model

requires runtime
enviroment
OpenMP

popular, widely used

limited to one system


(SMP), not grid ready

easy to learn

data scoping required

preserve sequential
code

requires compiler
support

performance issues
implicit

no runtime enviroment
required
,
Page 14 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Comparison
Parallel Programming Models
MPI

popular, widely used

ready for grids

high steep learning


curve

No data scoping
(shared, private,...)

sequential code is not


preserved

requires only one


library

easier model

requires runtime
enviroment
OpenMP

popular, widely used

limited to one system


(SMP), not grid ready

easy to learn

data scoping required

preserve sequential
code

requires compiler
support

performance issues
implicit

no runtime enviroment
required
,
Page 14 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Comparison
Parallel Programming Models
MPI

popular, widely used

ready for grids

high steep learning


curve

No data scoping
(shared, private,...)

sequential code is not


preserved

requires only one


library

easier model

requires runtime
enviroment
OpenMP

popular, widely used

limited to one system


(SMP), not grid ready

easy to learn

data scoping required

preserve sequential
code

requires compiler
support

performance issues
implicit

no runtime enviroment
required
,
Page 14 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Comparison
Parallel Programming Models
MPI

popular, widely used

ready for grids

high steep learning


curve

No data scoping
(shared, private,...)

sequential code is not


preserved

requires only one


library

easier model

requires runtime
enviroment
OpenMP

popular, widely used

limited to one system


(SMP), not grid ready

easy to learn

data scoping required

preserve sequential
code

requires compiler
support

performance issues
implicit

no runtime enviroment
required
,
Page 14 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Comparison
Parallel Programming Models
MPI

popular, widely used

ready for grids

high steep learning


curve

No data scoping
(shared, private,...)

sequential code is not


preserved

requires only one


library

easier model

requires runtime
enviroment
OpenMP

popular, widely used

limited to one system


(SMP), not grid ready

easy to learn

data scoping required

preserve sequential
code

requires compiler
support

performance issues
implicit

no runtime enviroment
required
,
Page 14 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Comparison
Parallel Programming Models
MPI

popular, widely used

ready for grids

high steep learning


curve

No data scoping
(shared, private,...)

sequential code is not


preserved

requires only one


library

easier model

requires runtime
enviroment
OpenMP

popular, widely used

limited to one system


(SMP), not grid ready

easy to learn

data scoping required

preserve sequential
code

requires compiler
support

performance issues
implicit

no runtime enviroment
required
,
Page 14 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Comparison
Parallel Programming Models
MPI

popular, widely used

ready for grids

high steep learning


curve

No data scoping
(shared, private,...)

sequential code is not


preserved

requires only one


library

easier model

requires runtime
enviroment
OpenMP

popular, widely used

limited to one system


(SMP), not grid ready

easy to learn

data scoping required

preserve sequential
code

requires compiler
support

performance issues
implicit

no runtime enviroment
required
,
Page 15 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
1 Introduction
What is Parallel Computing?
Scalability
2 Parallel Programming Models
Memory Models
Exploring MPI
Exploring OpenMP
Comparison
3 Examples with OpenMP
Matrix Matrix Multiplication
Approximation of
4 Conclusion
,
Page 16 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Simple Tasks with OpenMP
Examples
Matrix Matrix
Multiplication
C = AB (1)
C
i,j
=

k
A
i,k
B
k,j
(2)
Approximation of
1

0
dx
4
1 +x
2
= (3)
[arctan(x)]
1
0
= (4)
N

i=0
4
1 +x
2
i
x (5)
How efciently can these problems be
parallelized?
,
Page 16 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Simple Tasks with OpenMP
Examples
Matrix Matrix
Multiplication
C = AB (1)
C
i,j
=

k
A
i,k
B
k,j
(2)
Approximation of
1

0
dx
4
1 +x
2
= (3)
[arctan(x)]
1
0
= (4)
N

i=0
4
1 +x
2
i
x (5)
How efciently can these problems be
parallelized?
,
Page 16 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Simple Tasks with OpenMP
Examples
Matrix Matrix
Multiplication
C = AB (1)
C
i,j
=

k
A
i,k
B
k,j
(2)
Approximation of
1

0
dx
4
1 +x
2
= (3)
[arctan(x)]
1
0
= (4)
N

i=0
4
1 +x
2
i
x (5)
How efciently can these problems be
parallelized?
,
Page 16 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Simple Tasks with OpenMP
Examples
Matrix Matrix
Multiplication
C = AB (1)
C
i,j
=

k
A
i,k
B
k,j
(2)
Approximation of
1

0
dx
4
1 +x
2
= (3)
[arctan(x)]
1
0
= (4)
N

i=0
4
1 +x
2
i
x (5)
How efciently can these problems be
parallelized?
,
Page 17 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Matrix Matrix Multiplication
Examples
,
Page 18 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Approximation of
Examples
,
Page 19 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Approximation of I
Source Code
1 program integ
_
pi
2 use omp
_
lib
3 implicit none
4
5 integer(kind=8) :: ii, num
_
steps, jj
6 integer :: tid, nthreads
7 real(kind=8) :: step, xx, pi, summ, start
_
time, run
_
time
8
9 num
_
steps = 100000000
10 step = 1d0/dble(num
_
steps)
11
12 do jj = 1,8 ! Number of requested threads
13 pi = 0d0
14 call omp
_
set
_
num
_
threads(jj)
15 start
_
time = omp
_
get
_
wtime()
16 nthreads = omp
_
get
_
num
_
threads()
17
18 !$omp single
,
Page 20 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Approximation of II
Source Code
19 write(
*
,
*
) "Number of threads: ", nthreads
20 !$omp end single
21
22 !$omp parallel do reduction(+:pi) private(ii,xx)
23 do ii = 0,num
_
steps
24 xx = (dble(ii)+0.5d0)
*
step
25 pi = pi + 4d0 / (1d0 + xx
*
xx)
26 enddo
27 !$omp end parallel do
28
29 run
_
time = omp
_
get
_
wtime()-start
_
time
30 pi = pi
*
step
31 write(
*
,
*
) "pi approx ", pi
32 write(
*
,
*
) "wtime: ", run
_
time
33 enddo
34 end program integ
_
pi
,
Page 21 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Wrap Up
Prospects

Hybrid parallelism: Combine MPI and OpenMP for


several reasons

Nested parallelism: Devide and conquere principle

Problems with Data Races and Deathlocks


,
Page 21 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Wrap Up
Prospects

Hybrid parallelism: Combine MPI and OpenMP for


several reasons

Nested parallelism: Devide and conquere principle

Problems with Data Races and Deathlocks


,
Page 21 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Wrap Up
Prospects

Hybrid parallelism: Combine MPI and OpenMP for


several reasons

Nested parallelism: Devide and conquere principle

Problems with Data Races and Deathlocks


,
Page 22 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Thank you for your attention!
Enjoy your meal!
,
Page 22 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Thank you for your attention!
Enjoy your meal!
,
Page 23 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Ruud van der Pas Barabara Chapman Gabriele Jost.
Using OpenMP: Portable Shared Memory Parallel
Programming. MIT Press, Cambridge.
Miguel Hermanns. Parallel Programming in
Fortran 95 using OpenMP. In: School of
Aeronautical Engineering, 2002.
Timothy G. Mattson. A Hands-on Introduction to
OpenMP. In: OpenMP Architecture Review Board,
2008.
Ruud van der Pas. Basic Concepts in
Parallelization. In: IWOMP 2010 CCS, Univeristy of
Tsukuba, 2010.
W. H. Press u. a. Numerical Recipes: The Art of
Scientic Computing. 3. Au. Cambridge,
University Press, 2007.
,
Page 24 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Matrix Matrix Multiplication I
Source Code
1 program matmult
2 use omp
_
lib
3 implicit none
4
5 integer nra, nca, ncb, tid, nthreads, ii, jj, kk, chunk,nn
6 parameter (nra=900)
7 parameter (nca=900)
8 parameter (ncb=100)
9 real
*
8 a(nra,nca), b(nca,ncb), c(nra,ncb), time
10
11 chunk = 10
12 do nn = 1,8
13 call omp
_
set
_
num
_
threads(nn)
14 !$omp parallel shared(a,b,c,nthreads,chunk) private(tid,ii,jj,kk)
15 tid = omp
_
get
_
thread
_
num()
16
17 ! !$omp single
18 ! write(
*
,
*
) "threads: ", omp
_
get
_
num
_
threads()
,
Page 25 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Matrix Matrix Multiplication II
Source Code
19 ! !$omp end single
20
21 !$omp do schedule(static,chunk)
22 do ii = 1, nra
23 do jj = 1, nca
24 a(ii,jj) = (ii-1)+(jj-1)
25 enddo
26 enddo
27 !$omp end do
28
29 !$omp do schedule(static,chunk)
30 do ii = 1, nca
31 do jj = 1, ncb
32 b(ii,jj) = (ii-1)
*
(jj-1)
33 enddo
34 enddo
35 !$omp end do
36
,
Page 26 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Matrix Matrix Multiplication III
Source Code
37 !$omp do schedule(static,chunk)
38 do ii = 1, nra
39 do jj = 1, ncb
40 c(ii,jj) = 0d0
41 enddo
42 enddo
43 !$omp end do
44
45 time = omp
_
get
_
wtime()
46 !$omp do schedule(static,chunk)
47 do ii = 1,nra
48 do jj = 1,ncb
49 do kk =1,nca
50 c(ii,jj) = c(ii,jj) + a(ii,kk)
*
b(kk,jj)
51 enddo
52 enddo
53 enddo
54 !$omp end do
,
Page 27 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Matrix Matrix Multiplication IV
Source Code
55
56 !$omp end parallel
57 write(
*
,
*
) omp
_
get
_
wtime() - time
58 enddo
59 endprogram

S-ar putea să vă placă și