Sunteți pe pagina 1din 63

Parallel Computing

N-Body-Simulation
Thorsten Grahs, 06.07.2015

Overview
Introduction
N-Body model
Algorithms
Hierarchical approach
Multipole expansion
MPI derived data-types

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2

N-Body-Simulation
Dwarf # 4 N-Body-systems
Multi-body/Particle simulation
Simulation of interacting bodies/particles
Many physical systems can be modelled as particle
simulations
(e.g. micro scale view to fluid dynamics)
Cover the whole range of scales:
atomic scale
solar systems

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3

Particle-oriented Simulation Methods


General Approach
N-body problem
compute motion paths of many individual particles
requires modelling and computation of inter-particle forces
typ. leads to ODE for particle positions and velocities
Examples
Molecular dynamics
Astrophysics
Particle-oriented discretization techniques
Approximation
06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4

Application | Molecular dynamics


Solid State Physics

hexagonal crystal (Boron nitride)


06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5

Application | Molecular dynamics


Nano fluids

Flow through a nano-tube (modelling via continuum


mechanics no longer valid)
06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 6

Application | Molecular dynamics


Protein Structure

spatial structure of Haemoglobin


06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 7

HPC Example | Bell Prize 2005


Fred Streitz et al. Lawrence Livermore National Lab.
solidification in Tantal and Uranium
3D molecular simulation, up to 524,000,000 atoms
Platform: IBM Blue Gene/L, 131,072 CPUs
(at that time #1 of the Top 500)
Performance: more than 100 TeraFlops (30 % peak)

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8

HPC Example XXLMillennium Project


Simulating Galaxy Population in Dark Energy Universes
Angulo, White Max-Planck-Inst f. Astrophysics, 2012.

N-body simulation with


N = 3 1011 particles
each particlecorresponds.
to 109 suns
simulates the generation of
galaxy clusters
served to validate the cold
dark matter model
06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9

HPC Example XXLMillennium Project


Simulation Details
N-body simulation with N = 3 1011 particles
10 TB RAM required only to store positions and
velocities (single precision)
entire memory requirements: 29 TB
JuRoPa Supercomputer (Jlich)
computation on 1536 nodes
(each 2x QuadCore, i.e., 12 288 cores)
hybrid parallelization: MPI plus OpenMP/Posix threads
execution time: 9.3 days; ca. 300 CPU years

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10

Model N-Body Simulation


Building blocks
Newtons second law
F = ma
Movement of the bodies influenced by the potentials of
other bodies, i.e.
Force between two particles with position xi , xj
f(xi , xj )
Resulting force for the i-th particle:
F(xi ) =

n
P

f(xi , xj )

j=1,i6=j

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11

Forces for N bodies


Assume body Bi at location xi with mass mi
Body Bi acts on body Bj (location xj , mass mj ) with
x x
Fij = Gmi mj |xj jxi i|3
Body Bj (xj , mj ) acts on body Bi (xi , mi ) with the
contra-vice force
Fji = Fij
G is the Gravitational constant
G = 6, 673 1011 [m3 kg 1 s2 ]
For a n-body system, the forces for Bi sums up to:
Fi =

n
P
j=1,i6=j

Fij = Gmi

n
P
j=1,i6=j

x x

mj |xj jxi i|3

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12

Modelling
The outcome of the N-Body forces is a
System of dN 2nd order ODEs with
N: # of molecules & d Dimension
2

Fi = mi dd tx2i
Can be reformulated into a system of 2dN 1st-order ODEs:
.

pi = mi xi
.
pi = Fi
Time integration with Euler or Runge-Kutta Scheme
xn+1
= xni + t
i

t n
Fi
i
|m{z
}
vni

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13

Cost of computation
Computational costs
The costs for the evaluation of the Forces depends
strongly (and not surprisingly) on the number of the
involved bodies.
They are of the order
O(N 2 )
Force evaluation dominates the overall costs of the
simulations
Need for cost reduction in order to achieve
high performance
06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14

Cost reduction
Ways to cost reduction
1. For every action there is an equal and opposite reaction
Fji = Fij
2. Introduction of an cutoff-radius R.
Forces, due to particles outside the radius R, will be
updated rarely:
O(NR 3 + N 2 )
3. Hierarchical methods (octree approach)
O(N log N)
4. Multipole methods
O(N)
06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15

Hierarchical approach
Barnes-Hut Algorithm (J. Barnes, P. Hut, 1986)
The simulation domain is divided into sub-volumes based
on the appearance of particles
The algorithm is build on the octree approach
Advantages
Only particles in neighbouring or nearby cells has to be
treated individually.
For remote cells the particles inside can be treated as one
particle with summed-up mass in the cell barycentre.

O(N log N)

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16

Quad-/Octree

Recursive decomposition into four (2D) or eight (3D)


boxes
Should be adaptive
Each box contains total mass and barycentre of the
particles inside
06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17

Particle injection

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18

Mass centre

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19

Force calculation

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20

Multipole expansion
A fast algorithm for particle simulation
L. Greengard and V. Rokhlin, J. Comp. Phys., 73 (1987)
Building blocks:
Multipole expansion of the forces
Approximation of the expansion due to a given tolerance
Translation operators to recentre the expansion

O(N)

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21

Example: N-body interaction


Force derivation from a potential

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22

Taylor expansion
Force expansion in a Taylor series

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23

Taylor expansion | components

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24

Taylor expansion | general form

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25

Shifting Taylor expansion

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26

Shifting Taylor expansion|components

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27

Tree structure

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28

Constructing the tree


i) Keep putting particles into root
cell

ii) If number of particles becomes


larger than a threshold,
subdivide cells

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29

Information transformation

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30

Comparison of Methods

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31

Scaling & evaluation time

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32

Constructing the tree

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33

Expansion shift

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34

Multiple expansion shift

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35

MPI basic data-types


MPI_CHAR
MPI_SHORT
MPI_INT
MPI_LONG
MPI_UNSIGNED_CHAR
MPI_UNSIGNED_SHORT
MPI_UNSIGNED
MPI_UNSIGNED_LONG
MPI_FLOAT
MPI_DOUBLE
MPI_LONG_DOUBLE
MPI_BYTE
MPI_PACKED

signed char
signed short int
signed int
signed long int
unsigned char
unsigned short int
unsigned int
unsigned long int
float
double
long double
type-less 8 bit memory range)
denotes packed data)

MPI data-types and their counter types in C


06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36

MPI derived data-types


In MPI it is possible to define new datatypes by grouping.
This class of data is the derived datatype.
Derived datatypes in MPI can be used in
Grouping data of different datatypes for communication.
Grouping non contiguous data for communication.

MPI has the following functions to group data


MPI_Type_contiguous
MPI_Type_vector
MPI_Type_indexed

MPI_Type_struct
MPI_Pack
MPI_Unpack

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37

Why group data?


In general, each element of a system of interest has
attributes of different datatypes.
It is desirable to group these attributes to streamline
manipulation and access.
Example: Classical n-body system
Each particle has the following attributes
mass (m)
position (~x )
momentum (~p)
ID tag

MPI_DOUBLE (1)
MPI_DOUBLE (3)
MPI_DOUBLE (3)
MPI_INT (1)

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38

Why group data?


Data grouping allows transfer of different datatypes in one
MPI communication function call. Otherwise, one call per
one
Data grouping allows transfer of non contiguous data in
one MPI communication function call.
Each MPI function call is expensive as it involves several
steps to initiate and ensure data communication is
completed successfully.
Data grouping can reduce the number of communication
calls.

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39

Building derived data-types


Suppose the following structure definition
struct {
int num;
} obj;

float x;

double data[2];

Layout in memory?

What do we have to know?

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 40

MPI custom data-types


Where do the types start?

How many of each type?


What type of data?

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 41

MPI datatype is a map

struct {
int num;
float x;
double data[2];
} obj

MPI_Datatype obj_type;

1
2
3
4

MPI_Bcast(&obj,1,obj_type,0,comm);

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 42

MPI datatype constructors


So how do we define those derived datatypes?
There are different constructors for this:
MPI_Type_contiguous
Continuous layout in memory
(same basic data types)
MPI_Type_vector
Block layout in memory
(same basic data types, same size with stride)
MPI_Type_struct
Block layout in memory
(different basic data types of, different size with stride)

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 43

Continuous
Constructor
1

int MPI_Type_contiguous(int count, MPI_Datatype old_type,


MPI_Datatype new_type);

Example
1
2
3

int B[2][3];
MPI_Datatype matrix;
MPI_Type_contiguous(6, MPI_INT, &matrix);

Whats the advantage over this?


1

MPI_Send(B, 6, MPI_INT, 1, tag, comm);


06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 44

Continuous example
Constructor
1

const int N = 10;

2
3
4

double A[N][N];
double B[N][N];

5
6

MPI_Datatype matrix;

7
8
9

MPI_Type_contiguous(N*N, MPI_DOUBLE, &matrix);


MPI_Type_commit(&matrix);

10
11
12
13
14

if (rank == master_rank
MPI_Send(A, 1, matrix, 1, 10, comm);
else if( rank == 1 )
MPI_Recv(B, 1, matrix, 0, 10, comm, &status);
06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 45

Vector
Constructor
1
2
3
4
5

int MPI_Type_vector(int count,


int blocklength,
int stride,
MPI_Datatype old_type,
MPI_Datatype *newtype);

newtype has
count blocks each consisting of
blocklength copies of
oldtype data-types.

Displacement between blocks is set by stride

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 46

Vector example
1
2
3

const int N = 5;
MPI_Datatype dt;
MPI_Type_vector(N, 1, N, MPI_INT, &dt);

Variables
count=5
blocklength=1
stride=5

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 47

Matrix column example


Constructor
1

const int N = 10;

2
3

double A[N][N];

4
5

MPI_Datatype column;

6
7
8

MPI_Type_vector(N, 1, N, MPI_DOUBLE, &column);


MPI_Type_commit(&column);

9
10
11
12
13

if (rank == master_rank
MPI_Send(&A[0][2], 1, column, 1, 10, comm);
else if( rank == 1 )
MPI_Recv(&A[0][2], 1, column, 0, 10, comm, &status);

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 48

Structure
Constructor
1
2
3
4
5

int MPI_Type_struct(int count,


const int *array_of_blocklengths,
const MPI_Aint *array_of_displacements,
const MPI_Datatype *array_of_types,
MPI_Datatype *newtype)

Parameters
count number of blocks (integer)
array_of_blocklength
number of elements in each block (array)
array_of_displacements
number of displacements in each block (array)
array_of_types type of elements in each block
06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 49

Structure example
1
2
3
4
5

struct {
int num;
float x;
double data[4];
} obj;

Variables
count=3

count blocks where the i-th block is


blocks[i] copies of the type types[i].
blocks={1,1,4} The displacement of the i-th block (in bytes)
is given by displacements[i].
types={int,float,double}
displacements = { 0, size(int), size(int) + size(float) }

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 50

Extent | Displacement of struct member


Memory span of a datatype
1

int MPI_Type_extent(MPI_Datatype datatype, MPI_Aint *


extent);

Usage: Kind of size of


1
2

MPI_Aint intex;
MPI_Type_extent(MPI_INT, &intex);

3
4
5

displacements[0] = static_cast<MPI_Aint>(0);
displacements[1] = intex

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 51

MPI_Aint
Address integers
Addresses in C are integer longer than int.
Datatype of locations/addresses is MPI_Aint and not int.
Displacements, which are differences of two addresses
can be longer than int.
Datatype MPI_Aint takes care of this possibility.
FORTRAN has integer which is four bytes long, hence it is
not necessary to use MPI_Aint.

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 52

Pointer n C/C++
Array
1
2
3
4

struct {
int n;
double x[3];
} obj;

Pointer to array
1
2
3
4

struct {
int n;
double * x;
} obj;

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 53

MPI_Address
int MPI_Address(void *location, MPI_Aint *address);
Get the address location in memory
1
2
3
4
5

struct {
int n; double * x;
} obj;
obj.n = 10;
obj.x = new double[obj.n];

6
7
8
9
10
11
12

MPI_Aint address_n, address_x;


MPI_Address(&(obj.n), &address_n);
MPI_Address(obj.x, &address_x);
MPI_Aint displacements[2];
displacements[0] = static_cast<MPI_Aint>(0);
displacements[1] = address_x - address_n ;
06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 54

Commit and Free


Commit
After construction, you must commit the datatype
int MPI_Type_commit(MPI_Datatype *datatype);
Free
When you are done, you need to free the datatype
int MPI_Type_free(MPI_Datatype *datatype);

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 55

MPI_Pack
Packing data
int MPI_Pack(void* inbuf, int incount,
MPI_Datatype datatype, void *outbuf,
int outsize, int *pos, MPI_Comm comm)
Create a package of discontinuous data

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 56

MPI_Pack | Arguments
inbuf
Input buffer, where the data comes from.
incount
# elements from type datatype, which will be packed.
datatype
Datentype of the elements.
outbuf
Buffer of the packed data.
outsize
Size of outbuf in bytes.
pos
Position in outbuf (in bytes) from where data are injected.
Is increased to the size of the inserted data after injection.
06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 57

MPI_Pack
Works like a stream
Alternative approach to grouping data for communication
Manually pack variables into a contiguous buffer
Transmit
Unpack into desired variables

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 58

MPI_Unpack
Wheres a pack, theres an unpack...
int MPI_Unpack(void* inbuf, int insize,
int *pos, void *outbuf, int outcount,
MPI_Datatype datatype, MPI_Comm comm)
Unpacking the data on the receiver side.
Receiving the data into the inbuf buffer
Storing the unpacked data into the outbuf buffer

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 59

MPI_Pack | Arguments
inbuf
Incoming buffer with the packed data
insize
Size of inbuf in bytes.
pos
Position in inbuf (in bytes) from where the data should be
unpacked.
outbuf
Buffer where the unpacked data are stored.
outcount
# elements to be unpacked
datatype
Datatype of the elements.
06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 60

Example
1
2
3
4
5
6
7
8

10

11
12
13

int position, i, j, a[2];


char buff[1000];
....
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank == 0) {
/ * SENDER CODE */
position = 0;
MPI_Pack(&i, 1, MPI_INT, buff, 1000, &position,
MPI_COMM_WORLD);
MPI_Pack(&j, 1, MPI_INT, buff, 1000, &position,
MPI_COMM_WORLD);
MPI_Send( buff, position, MPI_PACKED, 1, 0, MPI_COMM_WORLD)
;
}
else /* RECEIVER CODE */
MPI_Recv( a, 2, MPI_INT, 0, 0, MPI_COMM_WORLD)
06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 61

Considerations
Good
Sending few messages of great/big size is more efficient
than sending many messages of small size.
May avoid the use of system buffering
Ideal for sending variable length messages
Sparse matrix data, client/server task dispatch

Bad
You have to pack and unpack data yourself
Requires a lot of attention to detail

You have to consider the effort


Do you really want variable length messages?
You still need big pre-allocated fixed-size buffers
06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 62

Further reading
J. Barnes and P. Hut
A hierarchical O(N log N) force-calculation algorithm,
Nature 324 (4): 446449, 1986.
L. Greengard and V. Rokhlin,
A Fast Algorithm for Particle Simulations , J. Comput.
Phys. 73, 325-348, 1987.
Netlib
Netlib Repository at UTK and ORNL
User-Defined Datatypes and Packing
http:
//www.netlib.org/utk/papers/mpi-book/node70.html

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 63

S-ar putea să vă placă și