Sunteți pe pagina 1din 34

Accelerating OpenFOAM with

GPU
(Graphics Processing Units)
Daniel Jasiski
Atizar

Who we are?

Atizar
Limited
OpenFOAM
interface

consulting
research

robust
customization
analysis

Atizar Flagship
Product

simFlow

simFlow
Features
create and import mesh
prepare cases
parametrise your problem
calculate solutions
compute in parallel with just one click
post-process results with ParaView

Graphics Processing Units

GPU and CFD?

Processor vs Memory (RAM) Performance

Intel Xeon E52670

High End
CPU

CPU
Features
Clock Frequency
2.6 GHz
No of Cores
Memory Bandwidth

Feeding

1 CPU
Core
double [] a, b, c;

for(int i = 0; i < N; i+
+)
{
c[i] = a[i] + b[i];
}

2. * (2 + *
8
6 1)
GHz

input
output

bytes

ops/cycl
e

= 249.6
GB/s
We only have 51.2

Memory
Latency

CPU

Core
L1 Cache

~3 cycles
~10 cycles

L2 Cache

~30 cycles
~300 cycles

L3 Cache
Main Memory (RAM)

Fight Memory Latency

CPU

Large and deep cache


hierarchy
Speculative prefetching
Branch prediction
Speculative execution
Out-of-order execution

Stall

Time

...

CPU Core

GPU Core

How to use all those GPU cores?

No task level parallelism on a GPU

Data parallelism
forAll(a,i)
{
c[i] = a[i] + b[i];
}

thrust::transform
{
a.begin(), a.end(),
b.begin(),
c.begin(),
thrust::plus<scalar>()
};

Run CFD code


on

GPU

Structures for GPU data


Keep copy of the mesh in GPU
memory
Store matrix in GPU memory
Convert loops into kernel
invocations
Challenges
A lot of programming
work
Parallel code is hard
Easy to introduce bugs

Accelerating

Linear
Solvers

OFgpu
SpeedIT
CUFFLINK
PARALUTION

Advantages
Simple drop-in library
No code changes in OpenFOAM

Disadvantages
Limited acceleration

Memory Copy

CPU-GPU
16 GB/s

Features

RapidCFD
Simulations running fully on
the GPU
Minimized CPU-GPU memory
copy
Support for multiple GPUs
PCG and GAMG solvers
AMI Interpolation

Limitations

RapidCFD
Not all solvers are available
Need original OpenFOAM for
meshing and decomposition
GAMG agglomeration on the CPU
AMI addressing calculation on the
CPU

Original Code

GPU Code

template<class Type>
void Foam::GAMGAgglomeration::restrictFaceField
(
Field<Type>& cf,
const Field<Type>& ff,
const label fineLevelIndex
) const
{
const labelList& fineToCoarse = faceRestrictAddressing_[fineLevelIndex];

template<class Type>
struct GAMGAgglomerationRestrictFunctor
{
const Type* ff;
const label* sort;
const Type zero;
GAMGAgglomerationRestrictFunctor(const Type* _ff, const label* _sort):
ff(_ff), sort(_sort), zero(pTraits<Type>::zero){}

cf = Zero;

__host__ __device__
Type operator()(const label& start, const label& end)
{
Type out = zero;
for(label i = start; i<end; i++)
out += ff[sort[i]];
return out;
}

forAll(fineToCoarse, ffacei)
{
label cFace = fineToCoarse[ffacei];
if (cFace >= 0)
{
cf[cFace] += ff[ffacei];
}
}
}

};
template<class Type>
void Foam::GAMGAgglomeration::restrictFaceField
(
gpuField<Type>& cf,
const gpuField<Type>& ff,
const label fineLevelIndex
) const
{
const labelgpuList& sort = faceRestrictSortAddressing_[fineLevelIndex];
const labelgpuList& target = faceRestrictTargetAddressing_[fineLevelIndex];
const labelgpuList& targetStart =
faceRestrictTargetStartAddressing_[fineLevelIndex];
cf = pTraits<Type>::zero;
thrust::transform
(
targetStart.begin(),targetStart.end()-1,
targetStart.begin()+1,target.begin(),
thrust::make_permutation_iterator(cf.begin(),target.begin()),
GAMGAgglomerationRestrictFunctor<Type>
(
ff.data(),
sort.data()
),
nonNegativeGAMGFunctor<label>()
);
}

Hardware

Performance
Tests
Type

Cores

Bandwidth
[GB/s]

Xeon E5-2670

CPU

51.2

166.4

Tesla K20X

GPU

2688

250

1312

GFlops (DP)

Case Study

Performance
Tests
3.15M Hexa cells
pisoFoam + LES
double precision

What impacts GPU


performance?
The best algorithms are usually sequential (DIC/DILU, GaussSeidel)
GPU require additional addressing for parallel access
Memory allocation and deallocation on the GPU is slower
GPU needs lots of data to efficiently hide latency
MPI communication needs to go through PCI-E

sim-flow.com

S-ar putea să vă placă și