Exploring Parallel Computing

Exploring Parallel
Computing
Fabian Frie
Numerical Methods in Quantum Physics
February 6th 2014
,
Page 2 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014
Syllabus
1 Introduction
What is Parallel Computing?
Scalability
2 Parallel Programming Models
Memory Models
Exploring MPI
Exploring OpenMP
Comparison
3 Examples with OpenMP
Matrix Matrix Multiplication
Approximation of
4 Conclusion
,
1 Introduction
Scalability
Memory Models
Exploring MPI
Exploring OpenMP
Comparison
Approximation of
4 Conclusion
,
Introduction
Parallelization is another optimization technique to

reduce execution time
Thread: series of instructions for a processor unit
Coarse-grain parallelism: parallelization achieved

by distributing domains over different processors.
ne-grain parallelism: parallelization achieved by

distributing iterations equally over different
processors.
,
Introduction



processors.
,
Introduction



processors.
,
Introduction



processors.
,
Scalability I
Amdahls Law
Dene the speedup with respect to the number of

threads n by
S(n) =
t(1)
t(n)
unless the application is embarrassingly parallel,

S(n) will deviate from the ideal curve
Assume the program has a parallel fraction f than

with n processors the execution time will change
according to
t(n) =
f
n
t(1) + (1 f )t(1)
,
Scalability II
Amdahls Law
Amdahls Law states: If the fraction f of a program

can be made parallel than the maximum speedup
that can be achieved by using n threads is
S(n) =
1
(1 f ) +f / n
,
Scalability III
Amdahls Law
,
1 Introduction
Scalability
Memory Models
Exploring MPI
Exploring OpenMP
Comparison
Approximation of
4 Conclusion
,
Memory Architectures
Shared Distributed
Shared Memory Architectures
Symmetric Multi Processor (SMP):

A shared address space with equal access cost for
each processor.
Non Uniform Memory Access (NUMA):

Different memory regions have different access
costs.
Distributed Memory Architectures
Clusters: Each processor acts on its own private

memory space. For remote data, communication is
required.
,
Shared Distributed

each processor.

costs.

required.
,
Shared Distributed

each processor.

costs.

required.
,
Shared Distributed

each processor.

costs.

required.
,
Shared Distributed

each processor.

costs.

required.
,
Shared Memory Architecture
Intel Core i7 980X Extreme Edition
,
Exploring MPI I
What is MPI?
MPI Message Passing Interface
MPI is extensive parallel programming API for

distributed memory (clusters, grids)
First introduced in 1994
MPI supports C, C++, and Fortran
All data is private to processing unit
Data communication must be programmed

explicitly
,
Exploring MPI II
What is MPI?
Pros
Flexibility: Can use any

cluster of any size
Widely available
Widely used : popular

in High performance
computing
Cons
Redesign of application
More resources
required: Typically more
memory
Errorprone & hard to

debug: Due to many
layers
,
Exploring OpenMP
Parallel Programming Models
OpenMP Open Multi Processing (API)
OpenMP is build for shared memory architectures

such as Symmetric Multi Processing (SMP)
machines
Supports both coarse-grained and ne-grained

parallelism
Data can be shared or private
All threads have access to the same, shared,

memory
Use of mostly implicit synchronization

,
Exploring OpenMP

machines

parallelism

memory

,
Exploring OpenMP

machines

parallelism

memory

,
Exploring OpenMP

machines

parallelism

memory

,
Exploring OpenMP

machines

parallelism

memory

,
Exploring OpenMP

machines

parallelism

memory

,
Comparison
MPI
popular, widely used
ready for grids
high steep learning

curve
No data scoping
(shared, private,...)
sequential code is not

preserved
requires only one

library
easier model
requires runtime
enviroment
OpenMP
limited to one system

(SMP), not grid ready
easy to learn
data scoping required
preserve sequential
code
requires compiler
support
performance issues
implicit
no runtime enviroment
required
,
Comparison
MPI
ready for grids
high steep learning

curve
No data scoping

preserved
requires only one

library
easier model
requires runtime
enviroment
OpenMP

easy to learn
preserve sequential
code
requires compiler
support
performance issues
implicit
required
,
Comparison
MPI
ready for grids
high steep learning

curve
No data scoping

preserved
requires only one

library
easier model
requires runtime
enviroment
OpenMP

easy to learn
preserve sequential
code
requires compiler
support
performance issues
implicit
required
,
Comparison
MPI
ready for grids
high steep learning

curve
No data scoping

preserved
requires only one

library
easier model
requires runtime
enviroment
OpenMP

easy to learn
preserve sequential
code
requires compiler
support
performance issues
implicit
required
,
Comparison
MPI
ready for grids
high steep learning

curve
No data scoping

preserved
requires only one

library
easier model
requires runtime
enviroment
OpenMP

easy to learn
preserve sequential
code
requires compiler
support
performance issues
implicit
required
,
Comparison
MPI
ready for grids
high steep learning

curve
No data scoping

preserved
requires only one

library
easier model
requires runtime
enviroment
OpenMP

easy to learn
preserve sequential
code
requires compiler
support
performance issues
implicit
required
,
Comparison
MPI
ready for grids
high steep learning

curve
No data scoping

preserved
requires only one

library
easier model
requires runtime
enviroment
OpenMP

easy to learn
preserve sequential
code
requires compiler
support
performance issues
implicit
required
,
Comparison
MPI
ready for grids
high steep learning

curve
No data scoping

preserved
requires only one

library
easier model
requires runtime
enviroment
OpenMP

easy to learn
preserve sequential
code
requires compiler
support
performance issues
implicit
required
,
Comparison
MPI
ready for grids
high steep learning

curve
No data scoping

preserved
requires only one

library
easier model
requires runtime
enviroment
OpenMP

easy to learn
preserve sequential
code
requires compiler
support
performance issues
implicit
required
,
1 Introduction
Scalability
Memory Models
Exploring MPI
Exploring OpenMP
Comparison
Approximation of
4 Conclusion
,
Simple Tasks with OpenMP
Examples
Matrix Matrix
Multiplication
C = AB (1)
C
i,j
=
k
A
i,k
B
k,j
(2)
Approximation of
1
0
dx
4
1 +x
2
= (3)
[arctan(x)]
1
0
= (4)
N
i=0
4
1 +x
2
i
x (5)
How efciently can these problems be
parallelized?
,
Examples
Matrix Matrix
Multiplication
C = AB (1)
C
i,j
=
k
A
i,k
B
k,j
(2)
Approximation of
1
0
dx
4
1 +x
2
= (3)
[arctan(x)]
1
0
= (4)
N
i=0
4
1 +x
2
i
x (5)
parallelized?
,
Examples
Matrix Matrix
Multiplication
C = AB (1)
C
i,j
=
k
A
i,k
B
k,j
(2)
Approximation of
1
0
dx
4
1 +x
2
= (3)
[arctan(x)]
1
0
= (4)
N
i=0
4
1 +x
2
i
x (5)
parallelized?
,
Examples
Matrix Matrix
Multiplication
C = AB (1)
C
i,j
=
k
A
i,k
B
k,j
(2)
Approximation of
1
0
dx
4
1 +x
2
= (3)
[arctan(x)]
1
0
= (4)
N
i=0
4
1 +x
2
i
x (5)
parallelized?
,
Examples
,
Approximation of
Examples
,
Approximation of I
Source Code
1 program integ
_
pi
2 use omp
_
lib
3 implicit none
4
5 integer(kind=8) :: ii, num
_
steps, jj
6 integer :: tid, nthreads
7 real(kind=8) :: step, xx, pi, summ, start
_
time, run
_
time
8
9 num
_
steps = 100000000
10 step = 1d0/dble(num
_
steps)
11
12 do jj = 1,8 ! Number of requested threads
13 pi = 0d0
14 call omp
_
set
_
num
_
threads(jj)
15 start
_
time = omp
_
get
_
wtime()
16 nthreads = omp
_
get
_
num
_
threads()
17
18 !$omp single
,
Approximation of II
Source Code
19 write(
*
,
*
) "Number of threads: ", nthreads
20 !$omp end single
21
22 !$omp parallel do reduction(+:pi) private(ii,xx)
23 do ii = 0,num
_
steps
24 xx = (dble(ii)+0.5d0)
*
step
25 pi = pi + 4d0 / (1d0 + xx
*
xx)
26 enddo
27 !$omp end parallel do
28
29 run
_
time = omp
_
get
_
wtime()-start
_
time
30 pi = pi
*
step
31 write(
*
,
*
) "pi approx ", pi
32 write(
*
,
*
) "wtime: ", run
_
time
33 enddo
34 end program integ
_
pi
,
Wrap Up
Prospects
Hybrid parallelism: Combine MPI and OpenMP for

several reasons
Nested parallelism: Devide and conquere principle
Problems with Data Races and Deathlocks

,
Wrap Up
Prospects

several reasons

,
Wrap Up
Prospects

several reasons

,
Thank you for your attention!
Enjoy your meal!
,
Thank you for your attention!
Enjoy your meal!
,
Ruud van der Pas Barabara Chapman Gabriele Jost.
Using OpenMP: Portable Shared Memory Parallel
Programming. MIT Press, Cambridge.
Miguel Hermanns. Parallel Programming in
Fortran 95 using OpenMP. In: School of
Aeronautical Engineering, 2002.
Timothy G. Mattson. A Hands-on Introduction to
OpenMP. In: OpenMP Architecture Review Board,
2008.
Ruud van der Pas. Basic Concepts in
Parallelization. In: IWOMP 2010 CCS, Univeristy of
Tsukuba, 2010.
W. H. Press u. a. Numerical Recipes: The Art of
Scientic Computing. 3. Au. Cambridge,
University Press, 2007.
,
Matrix Matrix Multiplication I
Source Code
1 program matmult
2 use omp
_
lib
3 implicit none
4
5 integer nra, nca, ncb, tid, nthreads, ii, jj, kk, chunk,nn
6 parameter (nra=900)
7 parameter (nca=900)
8 parameter (ncb=100)
9 real
*
8 a(nra,nca), b(nca,ncb), c(nra,ncb), time
10
11 chunk = 10
12 do nn = 1,8
13 call omp
_
set
_
num
_
threads(nn)
14 !$omp parallel shared(a,b,c,nthreads,chunk) private(tid,ii,jj,kk)
15 tid = omp
_
get
_
thread
_
num()
16
17 ! !$omp single
18 ! write(
*
,
*
) "threads: ", omp
_
get
_
num
_
threads()
,
Matrix Matrix Multiplication II
Source Code
19 ! !$omp end single
20
21 !$omp do schedule(static,chunk)
22 do ii = 1, nra
23 do jj = 1, nca
24 a(ii,jj) = (ii-1)+(jj-1)
25 enddo
26 enddo
27 !$omp end do
28
30 do ii = 1, nca
31 do jj = 1, ncb
32 b(ii,jj) = (ii-1)
*
(jj-1)
33 enddo
34 enddo
35 !$omp end do
36
,
Matrix Matrix Multiplication III
Source Code
38 do ii = 1, nra
39 do jj = 1, ncb
40 c(ii,jj) = 0d0
41 enddo
42 enddo
43 !$omp end do
44
45 time = omp
_
get
_
wtime()
47 do ii = 1,nra
48 do jj = 1,ncb
49 do kk =1,nca
50 c(ii,jj) = c(ii,jj) + a(ii,kk)
*
b(kk,jj)
51 enddo
52 enddo
53 enddo
54 !$omp end do
,
Matrix Matrix Multiplication IV
Source Code
55
56 !$omp end parallel
57 write(
*
,
*
) omp
_
get
_
wtime() - time
58 enddo
59 endprogram

Exploring Parallel Computing

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Exploring Parallel Computing

Încărcat de

Drepturi de autor:

Formate disponibile

Exploring Parallel

Parallelization is another optimization technique to

Thread: series of instructions for a processor unit

Coarse-grain parallelism: parallelization achieved

ne-grain parallelism: parallelization achieved by

Parallelization is another optimization technique to

Thread: series of instructions for a processor unit

Coarse-grain parallelism: parallelization achieved

ne-grain parallelism: parallelization achieved by

Parallelization is another optimization technique to

Thread: series of instructions for a processor unit

Coarse-grain parallelism: parallelization achieved

ne-grain parallelism: parallelization achieved by

Parallelization is another optimization technique to

Thread: series of instructions for a processor unit

Coarse-grain parallelism: parallelization achieved

ne-grain parallelism: parallelization achieved by

Dene the speedup with respect to the number of

unless the application is embarrassingly parallel,

Assume the program has a parallel fraction f than

Amdahls Law states: If the fraction f of a program

Shared Memory Architectures

Symmetric Multi Processor (SMP):

Non Uniform Memory Access (NUMA):

Distributed Memory Architectures

Clusters: Each processor acts on its own private

Shared Memory Architectures

Symmetric Multi Processor (SMP):

Non Uniform Memory Access (NUMA):

Distributed Memory Architectures

Clusters: Each processor acts on its own private

Shared Memory Architectures

Symmetric Multi Processor (SMP):

Non Uniform Memory Access (NUMA):

Distributed Memory Architectures

Clusters: Each processor acts on its own private

Shared Memory Architectures

Symmetric Multi Processor (SMP):

Non Uniform Memory Access (NUMA):

Distributed Memory Architectures

Clusters: Each processor acts on its own private

Shared Memory Architectures

Symmetric Multi Processor (SMP):

Non Uniform Memory Access (NUMA):

Distributed Memory Architectures

Clusters: Each processor acts on its own private

MPI Message Passing Interface

MPI is extensive parallel programming API for

First introduced in 1994

MPI supports C, C++, and Fortran

All data is private to processing unit

Data communication must be programmed

Flexibility: Can use any

Widely used : popular

Errorprone & hard to

OpenMP Open Multi Processing (API)

OpenMP is build for shared memory architectures

Supports both coarse-grained and ne-grained

Data can be shared or private

All threads have access to the same, shared,

Use of mostly implicit synchronization

OpenMP Open Multi Processing (API)

OpenMP is build for shared memory architectures

Supports both coarse-grained and ne-grained