pc12 Nbody

Parallel Computing
N-Body-Simulation
Thorsten Grahs, 06.07.2015
Overview
Introduction
N-Body model
Algorithms
Hierarchical approach
Multipole expansion
MPI derived data-types
06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2
N-Body-Simulation
Dwarf # 4 N-Body-systems
Multi-body/Particle simulation
Simulation of interacting bodies/particles
Many physical systems can be modelled as particle
simulations
(e.g. micro scale view to fluid dynamics)
Cover the whole range of scales:
atomic scale
solar systems
Particle-oriented Simulation Methods

General Approach
N-body problem
compute motion paths of many individual particles
requires modelling and computation of inter-particle forces
typ. leads to ODE for particle positions and velocities
Examples
Molecular dynamics
Astrophysics
Particle-oriented discretization techniques
Approximation
Application | Molecular dynamics

Solid State Physics
hexagonal crystal (Boron nitride)


Nano fluids
Flow through a nano-tube (modelling via continuum

mechanics no longer valid)

Protein Structure
spatial structure of Haemoglobin

HPC Example | Bell Prize 2005

Fred Streitz et al. Lawrence Livermore National Lab.
solidification in Tantal and Uranium
3D molecular simulation, up to 524,000,000 atoms
Platform: IBM Blue Gene/L, 131,072 CPUs
(at that time #1 of the Top 500)
Performance: more than 100 TeraFlops (30 % peak)
HPC Example XXLMillennium Project

Simulating Galaxy Population in Dark Energy Universes
Angulo, White Max-Planck-Inst f. Astrophysics, 2012.
N-body simulation with

N = 3 1011 particles
each particlecorresponds.
to 109 suns
simulates the generation of
galaxy clusters
served to validate the cold
dark matter model
HPC Example XXLMillennium Project

Simulation Details
N-body simulation with N = 3 1011 particles
10 TB RAM required only to store positions and
velocities (single precision)
entire memory requirements: 29 TB
JuRoPa Supercomputer (Jlich)
computation on 1536 nodes
(each 2x QuadCore, i.e., 12 288 cores)
hybrid parallelization: MPI plus OpenMP/Posix threads
execution time: 9.3 days; ca. 300 CPU years
Model N-Body Simulation

Building blocks
Newtons second law
F = ma
Movement of the bodies influenced by the potentials of
other bodies, i.e.
Force between two particles with position xi , xj
f(xi , xj )
Resulting force for the i-th particle:
F(xi ) =
n
P
f(xi , xj )
j=1,i6=j
Forces for N bodies

Assume body Bi at location xi with mass mi
Body Bi acts on body Bj (location xj , mass mj ) with
x x
Fij = Gmi mj |xj jxi i|3
Body Bj (xj , mj ) acts on body Bi (xi , mi ) with the
contra-vice force
Fji = Fij
G is the Gravitational constant
G = 6, 673 1011 [m3 kg 1 s2 ]
For a n-body system, the forces for Bi sums up to:
Fi =
n
P
j=1,i6=j
Fij = Gmi
n
P
j=1,i6=j
x x
mj |xj jxi i|3
Modelling
The outcome of the N-Body forces is a
System of dN 2nd order ODEs with
N: # of molecules & d Dimension
2
Fi = mi dd tx2i
Can be reformulated into a system of 2dN 1st-order ODEs:
.
pi = mi xi
.
pi = Fi
Time integration with Euler or Runge-Kutta Scheme
xn+1
= xni + t
i
t n
Fi
i
|m{z
}
vni
Cost of computation
Computational costs
The costs for the evaluation of the Forces depends
strongly (and not surprisingly) on the number of the
involved bodies.
They are of the order
O(N 2 )
Force evaluation dominates the overall costs of the
simulations
Need for cost reduction in order to achieve
high performance
Cost reduction
Ways to cost reduction
1. For every action there is an equal and opposite reaction
Fji = Fij
2. Introduction of an cutoff-radius R.
Forces, due to particles outside the radius R, will be
updated rarely:
O(NR 3 + N 2 )
3. Hierarchical methods (octree approach)
O(N log N)
4. Multipole methods
O(N)
Hierarchical approach
Barnes-Hut Algorithm (J. Barnes, P. Hut, 1986)
The simulation domain is divided into sub-volumes based
on the appearance of particles
The algorithm is build on the octree approach
Advantages
Only particles in neighbouring or nearby cells has to be
treated individually.
For remote cells the particles inside can be treated as one
particle with summed-up mass in the cell barycentre.
O(N log N)
Quad-/Octree
Recursive decomposition into four (2D) or eight (3D)

boxes
Should be adaptive
Each box contains total mass and barycentre of the
particles inside
Particle injection
Mass centre
Force calculation
Multipole expansion
A fast algorithm for particle simulation
L. Greengard and V. Rokhlin, J. Comp. Phys., 73 (1987)
Building blocks:
Multipole expansion of the forces
Approximation of the expansion due to a given tolerance
Translation operators to recentre the expansion
O(N)
Example: N-body interaction

Force derivation from a potential
Taylor expansion
Force expansion in a Taylor series
Taylor expansion | components
Taylor expansion | general form
Shifting Taylor expansion
Shifting Taylor expansion|components
Tree structure
Constructing the tree

i) Keep putting particles into root
cell
ii) If number of particles becomes

larger than a threshold,
subdivide cells
Information transformation
Comparison of Methods
Scaling & evaluation time
Constructing the tree
Expansion shift
Multiple expansion shift
MPI basic data-types

MPI_CHAR
MPI_SHORT
MPI_INT
MPI_LONG
MPI_UNSIGNED_CHAR
MPI_UNSIGNED_SHORT
MPI_UNSIGNED
MPI_UNSIGNED_LONG
MPI_FLOAT
MPI_DOUBLE
MPI_LONG_DOUBLE
MPI_BYTE
MPI_PACKED
signed char
signed short int
signed int
signed long int
unsigned char
unsigned short int
unsigned int
unsigned long int
float
double
long double
type-less 8 bit memory range)
denotes packed data)
MPI data-types and their counter types in C

MPI derived data-types

In MPI it is possible to define new datatypes by grouping.
This class of data is the derived datatype.
Derived datatypes in MPI can be used in
Grouping data of different datatypes for communication.
Grouping non contiguous data for communication.
MPI has the following functions to group data

MPI_Type_contiguous
MPI_Type_vector
MPI_Type_indexed
MPI_Type_struct
MPI_Pack
MPI_Unpack
Why group data?

In general, each element of a system of interest has
attributes of different datatypes.
It is desirable to group these attributes to streamline
manipulation and access.
Example: Classical n-body system
Each particle has the following attributes
mass (m)
position (~x )
momentum (~p)
ID tag
MPI_DOUBLE (1)
MPI_DOUBLE (3)
MPI_DOUBLE (3)
MPI_INT (1)
Why group data?

Data grouping allows transfer of different datatypes in one
MPI communication function call. Otherwise, one call per
one
Data grouping allows transfer of non contiguous data in
one MPI communication function call.
Each MPI function call is expensive as it involves several
steps to initiate and ensure data communication is
completed successfully.
Data grouping can reduce the number of communication
calls.
Building derived data-types

Suppose the following structure definition
struct {
int num;
} obj;
float x;
double data[2];
Layout in memory?
What do we have to know?
MPI custom data-types

Where do the types start?
How many of each type?

What type of data?
MPI datatype is a map
struct {
int num;
float x;
double data[2];
} obj
MPI_Datatype obj_type;
1
2
3
4
MPI_Bcast(&obj,1,obj_type,0,comm);
MPI datatype constructors

So how do we define those derived datatypes?
There are different constructors for this:
MPI_Type_contiguous
Continuous layout in memory
(same basic data types)
MPI_Type_vector
Block layout in memory
(same basic data types, same size with stride)
MPI_Type_struct
Block layout in memory
(different basic data types of, different size with stride)
Continuous
Constructor
1
int MPI_Type_contiguous(int count, MPI_Datatype old_type,

MPI_Datatype new_type);
Example
1
2
3
int B[2][3];
MPI_Datatype matrix;
MPI_Type_contiguous(6, MPI_INT, &matrix);
Whats the advantage over this?

1
MPI_Send(B, 6, MPI_INT, 1, tag, comm);

Continuous example
Constructor
1
const int N = 10;
2
3
4
double A[N][N];
double B[N][N];
5
6
MPI_Datatype matrix;
7
8
9
MPI_Type_contiguous(N*N, MPI_DOUBLE, &matrix);

MPI_Type_commit(&matrix);
10
11
12
13
14
if (rank == master_rank
MPI_Send(A, 1, matrix, 1, 10, comm);
else if( rank == 1 )
MPI_Recv(B, 1, matrix, 0, 10, comm, &status);
Vector
Constructor
1
2
3
4
5
int MPI_Type_vector(int count,

int blocklength,
int stride,
MPI_Datatype old_type,
MPI_Datatype *newtype);
newtype has
count blocks each consisting of
blocklength copies of
oldtype data-types.
Displacement between blocks is set by stride
Vector example
1
2
3
const int N = 5;
MPI_Datatype dt;
MPI_Type_vector(N, 1, N, MPI_INT, &dt);
Variables
count=5
blocklength=1
stride=5
Matrix column example

Constructor
1
const int N = 10;
2
3
double A[N][N];
4
5
MPI_Datatype column;
6
7
8
MPI_Type_vector(N, 1, N, MPI_DOUBLE, &column);

MPI_Type_commit(&column);
9
10
11
12
13
if (rank == master_rank
MPI_Send(&A[0][2], 1, column, 1, 10, comm);
else if( rank == 1 )
MPI_Recv(&A[0][2], 1, column, 0, 10, comm, &status);
Structure
Constructor
1
2
3
4
5
int MPI_Type_struct(int count,

const int *array_of_blocklengths,
const MPI_Aint *array_of_displacements,
const MPI_Datatype *array_of_types,
MPI_Datatype *newtype)
Parameters
count number of blocks (integer)
array_of_blocklength
number of elements in each block (array)
array_of_displacements
number of displacements in each block (array)
array_of_types type of elements in each block
Structure example
1
2
3
4
5
struct {
int num;
float x;
double data[4];
} obj;
Variables
count=3
count blocks where the i-th block is

blocks[i] copies of the type types[i].
blocks={1,1,4} The displacement of the i-th block (in bytes)
is given by displacements[i].
types={int,float,double}
displacements = { 0, size(int), size(int) + size(float) }
Extent | Displacement of struct member

Memory span of a datatype
1
int MPI_Type_extent(MPI_Datatype datatype, MPI_Aint *

extent);
Usage: Kind of size of

1
2
MPI_Aint intex;
MPI_Type_extent(MPI_INT, &intex);
3
4
5
displacements[0] = static_cast<MPI_Aint>(0);
displacements[1] = intex
MPI_Aint
Address integers
Addresses in C are integer longer than int.
Datatype of locations/addresses is MPI_Aint and not int.
Displacements, which are differences of two addresses
can be longer than int.
Datatype MPI_Aint takes care of this possibility.
FORTRAN has integer which is four bytes long, hence it is
not necessary to use MPI_Aint.
Pointer n C/C++
Array
1
2
3
4
struct {
int n;
double x[3];
} obj;
Pointer to array
1
2
3
4
struct {
int n;
double * x;
} obj;
MPI_Address
int MPI_Address(void *location, MPI_Aint *address);
Get the address location in memory
1
2
3
4
5
struct {
int n; double * x;
} obj;
obj.n = 10;
obj.x = new double[obj.n];
6
7
8
9
10
11
12
MPI_Aint address_n, address_x;

MPI_Address(&(obj.n), &address_n);
MPI_Address(obj.x, &address_x);
MPI_Aint displacements[2];
displacements[0] = static_cast<MPI_Aint>(0);
displacements[1] = address_x - address_n ;
Commit and Free

Commit
After construction, you must commit the datatype
int MPI_Type_commit(MPI_Datatype *datatype);
Free
When you are done, you need to free the datatype
int MPI_Type_free(MPI_Datatype *datatype);
MPI_Pack
Packing data
int MPI_Pack(void* inbuf, int incount,
MPI_Datatype datatype, void *outbuf,
int outsize, int *pos, MPI_Comm comm)
Create a package of discontinuous data
MPI_Pack | Arguments
inbuf
Input buffer, where the data comes from.
incount
# elements from type datatype, which will be packed.
datatype
Datentype of the elements.
outbuf
Buffer of the packed data.
outsize
Size of outbuf in bytes.
pos
Position in outbuf (in bytes) from where data are injected.
Is increased to the size of the inserted data after injection.
MPI_Pack
Works like a stream
Alternative approach to grouping data for communication
Manually pack variables into a contiguous buffer
Transmit
Unpack into desired variables
MPI_Unpack
Wheres a pack, theres an unpack...
int MPI_Unpack(void* inbuf, int insize,
int *pos, void *outbuf, int outcount,
MPI_Datatype datatype, MPI_Comm comm)
Unpacking the data on the receiver side.
Receiving the data into the inbuf buffer
Storing the unpacked data into the outbuf buffer
MPI_Pack | Arguments
inbuf
Incoming buffer with the packed data
insize
Size of inbuf in bytes.
pos
Position in inbuf (in bytes) from where the data should be
unpacked.
outbuf
Buffer where the unpacked data are stored.
outcount
# elements to be unpacked
datatype
Datatype of the elements.
Example
1
2
3
4
5
6
7
8
10
11
12
13
int position, i, j, a[2];

char buff[1000];
....
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank == 0) {
/ * SENDER CODE */
position = 0;
MPI_Pack(&i, 1, MPI_INT, buff, 1000, &position,
MPI_COMM_WORLD);
MPI_Pack(&j, 1, MPI_INT, buff, 1000, &position,
MPI_COMM_WORLD);
MPI_Send( buff, position, MPI_PACKED, 1, 0, MPI_COMM_WORLD)
;
}
else /* RECEIVER CODE */
MPI_Recv( a, 2, MPI_INT, 0, 0, MPI_COMM_WORLD)
Considerations
Good
Sending few messages of great/big size is more efficient
than sending many messages of small size.
May avoid the use of system buffering
Ideal for sending variable length messages
Sparse matrix data, client/server task dispatch
Bad
You have to pack and unpack data yourself
Requires a lot of attention to detail
You have to consider the effort

Do you really want variable length messages?
You still need big pre-allocated fixed-size buffers
Further reading
J. Barnes and P. Hut
A hierarchical O(N log N) force-calculation algorithm,
Nature 324 (4): 446449, 1986.
L. Greengard and V. Rokhlin,
A Fast Algorithm for Particle Simulations , J. Comput.
Phys. 73, 325-348, 1987.
Netlib
Netlib Repository at UTK and ORNL
User-Defined Datatypes and Packing
http:
//www.netlib.org/utk/papers/mpi-book/node70.html

pc12 Nbody

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

pc12 Nbody

Încărcat de

Drepturi de autor:

Formate disponibile

Parallel Computing

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3

Particle-oriented Simulation Methods

Application | Molecular dynamics

hexagonal crystal (Boron nitride)

Application | Molecular dynamics

Flow through a nano-tube (modelling via continuum

Application | Molecular dynamics

spatial structure of Haemoglobin

HPC Example | Bell Prize 2005

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8

HPC Example XXLMillennium Project

N-body simulation with

HPC Example XXLMillennium Project

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10

Model N-Body Simulation

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11

Forces for N bodies

mj |xj jxi i|3

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16

Recursive decomposition into four (2D) or eight (3D)

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21

Example: N-body interaction

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23

Taylor expansion | components

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24

Taylor expansion | general form

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25

Shifting Taylor expansion

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26

Shifting Taylor expansion|components

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28

Constructing the tree

ii) If number of particles becomes

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31

Scaling & evaluation time

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32

Constructing the tree

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34

Multiple expansion shift

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35

MPI basic data-types

MPI data-types and their counter types in C

MPI derived data-types

MPI has the following functions to group data

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37

Why group data?

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38

Why group data?

06.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39

Building derived data-types

What do we have to know?