Sunteți pe pagina 1din 20

Deutsche Forschungsgemeinschaft

Priority Program 1253


Optimization with Partial Dierential Equations
Stephan Schmidt, Volker Schulz
A 2589 Line Topology Optimization Code
Written for the Graphics Card
May 2009
Preprint-Number SPP1253-068
http://www.am.uni-erlangen.de/home/spp1253
A 2589 Line Topology Optimization
Code Written for the Graphics Card
Stephan Schmidt

Volker Schulz

Abstract
We investigate topology optimization based on the SIMP method on CUDA en-
abled video cards in three dimensions. Using nite elements, linear elasticity is
solved by a matrix-free conjugate gradient method entirely on the GPU. Due to
the unique requirements of the SIMD stream processors, special attention is given
to procedural generation of matrix vector products entirely on the graphics card.
The GPU code is found to be extremely ecient, solving problems of 4.65 10
7
un-
knowns on commodity hardware up to 58 times faster than the CPU. The sources
are availiable at http://www.mathematik.uni-trier.de/
~
schmidt/gputop.
1 Introduction
Few areas are as quickly developing as the computational power of micropro-
cessors. However, the means to sustain the exponential increase in processing
speed as predicted by Moores law appears to be changing from making a sin-
gle processing unit faster to making it wider, i.e. increasing the parallelism by
switching to many processing cores and highly parallel architectures. Also,
the gap between the processors computational power and the memorys abil-
ity to deliver data is deteriorating. These developments can lead to future
systems which are very fast, highly parallel, but also highly heterogenous,
requiring numerical algorithms which are able to scale well to such systems.
Two architectures, which are available today, already feature many of
these aspects: The Cell Broadband Engine Architecture, jointly developed
by Sony, Toshiba, and IBM combines a Power Architecture core with several

Dipl.-Math S. Schmidt, Univ. Trier, Stephan.Schmidt@uni-trier.de

Prof. Dr. V. Schulz, Univ. Trier, Volker.Schulz@uni-trier.de


1
1 Introduction 2
co-processing elements, creating a heterogenous processing unit with strong
emphasis on computational performance and power consumption rather than
simplicity of program code. The other such system is the commodity graph-
ics card. Originally highly adapted to rendering three dimensional polygonal
data, these graphics adapters have by now reached a exibility such that they
can be well used in scientic computing. These devices feature a stream pro-
cessing approach with many processing cores of which each can process many
threads in a SIMD fashion. They also require manually managing a heteroge-
nous memory system, such that the processors are not idle, waiting for data.
Presently, nVidias Compute Unied Device Architecture (CUDA) appears
to be the most wildly used solution for using graphics cards in scientic com-
puting. Due to the ease of availability of commodity graphics hardware and
CUDA being an extension of the C/C++ programming language, we chose
to use CUDA for the studies presented in this article.
The aim of this work is to study the applicability of this novel hardware in
the eld of PDE constrained optimization. With respect to solving problems
involving PDEs, most literature on stream computing are either focused on
linear algebra [6] or simulation alone [10], but not optimization. Also, nite
dierences [8] are almost always used. The interest in computational uid
dynamics and nite volume methods is also particularly strong [5, 7, 11, 13].
As such, this work focuses on topology optimization based on the power
law or Solid Isotropic Material with Penalization (SIMP) approach using
nite elements on a structured mesh in three dimensions [1, 15]. The SIMP
approach models the distribution of material and voids by assuming constant
material properties in each nite element. The optimization variable is the
relative material density in each element raised to some power times the
material properties of the solid material. The SIMP approach is generally
recognized for the problem size, as especially in three dimensions, the number
of unknowns can become huge. GPU acceleration of solid mechanics solvers
and linear elasticity has very recently been studied in [9], where the GPU
is used as a coprocessor. Here, however, it is our aim to use the graphics
accelerator card as the whole solver, including preprocessing and procedural
generation of the matrix-vector product on the graphics card.
Some of the authors have previously studied the SIMP approach using in-
teriour point multigrid methods in [12]. Due to its structure and the problem
size, the SIMP method is well suited for the vector computing of the graph-
ics card. We generally follow a similar optimization strategy as presented
in [14]. However, there are numerous modications to the nite element as-
sembly procedure to make it better suited to the graphics card. There are
also some modications necessary to account for the single precision oating
point number limitation of present GPUs. Depending on which classical x86-
2 Linear Elasticity and the SIMP Method 3
64 CPU and how many cores to measure against, the GPU implementation
oers speed-ups between a factor of 22 to 58. Without compiler optimization
for the CPU, the gap increases to a factor of 110.
The structure of this work is as follows: Section 2 briey recapitulates
linear elasticity and topology optimization using the SIMP method. In the
following section 3, the programming model of the GPU is discussed in all
detail. Special attention is given to the procedural matrix-vector product
entirely in the GPU and the single precision limitation of the device. In
section 4, we compare the time needed by the GPU solver with dierent
CPU implementations. The GPU is up to 58 times faster for problems with
4.6 10
7
unknowns.
2 Linear Elasticity and the SIMP Method
2.1 Linear Elasticity
A three dimensional body occupying a domain R
3
is considered. The de-
formation of the body under body forces f : R
3
and boundary tractions
t : R is modeled by linear elasticity in the displacement formulation. Let
u : R
3
be the displacements of the material under load. The linearized
strain is given by

ij
:=
1
2
_
u
i
x
j
+
u
j
x
i
_
, (2.1)
and the load linear form by
L(v) :=
_

f v dA +
_

t v dS.
The displacements u can be computed as the solution of the energy bilinear
form
a(u, v) :=
_

E
ijkl
(u)
ij
(v)
kl
dS = L(u) v V, (2.2)
where E
ijkl
is an apriory given constant elasticity tensor. In an homogenous
isotropic media, symmetry allows reducing the order of the elasticity tensor
using Voigt notation. Due to symmetry of (2.1), one can write
:= (
11
,
22
,
33
,
12
,
13
,
23
)
T
=: Bu
2 Linear Elasticity and the SIMP Method 4
and the Cauchy Stress Tensor is given by
=
E
(1 + )(1 2)
_

_
1
1
1
1 2
1 2
1 2
_

_

=: CBu,
where E is Youngs modulus and is Poissons ratio. Hence, (2.2) can also
be expressed as
a(u, v) =
_

(Bv)
T
CBu dS = L(u) v V. (2.3)
In the following, we discretize (2.3) using 8-node cubic nite elements with
linear test- and trial functions in three dimensions, resulting in a symmetric,
positive denite linear system
Ku = f.
For more details on nite elements and elasticity theory see [4]. The compli-
ance c of the structure is given by
c(u) = u
T
f = u
T
Ku,
which is the objective function to be minimized.
2.2 SIMP Method
The SIMP method models the distribution of solid material and voids by
introducing a pseudo density [0, 1] into (2.3):
a(u, v) =
N

e=1
_

e
(Bv)
T

e
p
CBu dS = L(u) v V. (2.4)
As intermediate values of do not have a physical sense, the exponent p is
used as a penalty factor. Here, N is the number of nite elements and is
considered constant for each element. The topology optimization problem is
3 Implementation on the Graphics Processing Unit 5
now given by
min
(u,)
J(u, ) : = u
T
K()u (2.5)
subject to
K()u = f (2.6)
N

e=1

e
= v
0
(2.7)

e
{0, 1}. (2.8)
Equation (2.7) is a volume constraint, as otherwise the solution will be a solid
body. Condition (2.8) is usually relaxed to
e
[
0
, 1] to prevent the state
equation from becoming singular. More details on topology optimization
in general and the SIMP method in particular can be found in [3]. Each
optimization step follows [14]: According to [2], a heuristic updating scheme
for the design variables is given by:

e
=
_
_
_
max(
0
,
e
m) if
e
B

e
max(
0
,
e
m)

e
B

e
if max(
0
,
e
m) <
e
B

e
< min(1,
e
+ m)
min(1,
e
+ m) if min(1,
e
+ m)
e
B

e
(2.9)
where m is a positive move-limit, = 0.5 is a numerical damping coecient
and B
e
is found from the optimality condition as
B
e
=

e
where is a Lagrangian multiplier that can be found by a bi-sectioning
algorithm. The sensitivity of the objective function is found as
c

e
= p
p1
e
u
T
e
K
e
u
e
,
where K
e
is the element stiness matrix of which K is assembled. In order to
prevent checker boarding and arrive at a mesh-independent structure, there
is an additional mesh independence lter. However, we would like to refer to
[3, 14] instead of repeating the mesh lter here also.
3 Implementation on the Graphics Processing Unit
3.1 Overview
The domain is discretized using a cartesian mesh of cubic, linear nite
elements. We employ the matrix-free conjugate gradient method to solve the
3 Implementation on the Graphics Processing Unit 6
state equation. The whole optimization is conducted in a one-shot sense,
meaning we use the iterative solver for the state equation as a defect cor-
rection after each density update. The system matrix K is never explicitly
created, instead we generate the product Ku needed for the conjugate gra-
dient method only procedurally, which is - due to the cartesian mesh - an
ideal task for the SIMD graphics processor. Due to the eectiveness of the
optimality criteria update, only very few optimization iterations are needed
compared to the number of CG iterations. Also, the mesh independency
lter from [14] requires an additional halo layer. Thus, these two steps are
performed by the CPU. However, porting these to the GPU is canonical.
In the following, we adopt nVidias nomenclature by referring to the
graphics unit as the device and the CPU as the host. The graphics
card acts like an autonomous compute device, meaning that after copying
the data into the device memory, the actual computation is conducted by
the GPU autonomously. It is therefore important to perform the entire CG
method in the device, as copying data from the systems RAM to the device
memory is rather slow.
3.1.1 GPU Execution Model
The device uses a hierarchical structure in both execution and memory.
A grid is started on the device which consists of one- or two-dimensional
Grid Block
Block
Block
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Fig. 3.1: CUDA execution model.
blocks. Each block contains the threads which can have one-, two-, or three-
dimensional indices. The hierarchy is illustrated in gure 3.1. The threads
3 Implementation on the Graphics Processing Unit 7
perform the actual computation by running a kernel. There is no control over
the order in which the blocks are processed and it is hard to determine the
number of threads which are actually executed in parallel as each streaming
multiprocessor can switch between blocks to minimize idle time while waiting
for data. Since memory latency is hidden by parallelism and not by caching,
it is in general advantageous to maximize the number of threads. However,
since shared memory and the registers of each streaming multiprocessor can
overlap, the parallelism will be reduced if more data needs to be exchanged
between the threads. This is particular problematic in the data intensive
case of linear elasticity. It is also important, that the kernel being executed
is indeed conform with the SIMD hardware, meaning branches in the code as
a result of an if-statement do not reduce performance provided each half-
warp executes the same branch. On present hardware, a half-warp are 16
consecutive threads. In terms of solving PDEs, this usually means reduced
performance at boundary nodes.
3.1.2 GPU Memory Model
The device also features many dierent memories, which must be managed
manually. Those are:
the device memory,
the shared memory,
the constant memory,
the texture memory.
The device memory has the biggest capacity but is also the slowest. On
present graphics cards, one access to the device memory can cost up to 600
cycles, which is equivalent to 150 or more single precision oating point
operations. As opposed to the CPU RAM, it is highly optimized for band-
width and less for latency, meaning that much more data can be accessed
provided consecutive threads access consecutive memory addresses with a
starting address that is aligned to a multiple of 4. Such an access is referred
to as coalescensed. Due to the alignment requirement, three displacements
for each node are highly unfavorable. Therefore, we store the density
e
for
each element as part of the nodal displacements, resulting in 4 unknowns
per node with little memory overhead. Similarly, this also means that for a
two dimensional cartesian mesh of size N
x
N
y
, where N
x
and N
y
are not
a multiple of the block sizes, the vector must be padded with zeros, such
that threads (0, 1), (0, 2), ... can access aligned vector components v

where
3 Implementation on the Graphics Processing Unit 8
= i + N
x
j. There is no atomic write operation to the device memory for
oating point numbers, meaning if multiple threads or blocks want to incre-
ment the same value in the device memory, the result is undened. This is
particular problematic for computing scalar products and for computing the
nite element matrix vector product in a traditional fashion.
The shared memory is fastest, but very small. Present hardware features
only 16 KiB of shared memory. It can be accessed by all threads of the same
block in one cycle provided there is no bank conict. On present hardware,
the shared memory is grouped in 16 banks, meaning 16 dierent values can
be loaded from the shared memory at the same time. The kernel will serialize
with a bank conict should more threads access the shared memory at the
same time. An exception to this is given when all threads access the same
component of the shared memory, as the data can then be broadcast in
hardware. For the nite element solver we found bank conicts to be almost
unavoidable, given the need to store four numbers per node. The shared
memory can overlap with the registers of a stream processor, meaning that
if too much is needed, the number of blocks that can be processed in parallel
suers. This can lead to the counter-intuitive situation where smaller blocks
and consequently less threads lead to a faster overall performance.
On present hardware, the constant memory is up to 80 KiB in size. It
features the same broadcasting mechanism should all threads access the same
value. It is also cached, meaning that accessing the constant memory can
be as fast as one cycle or as slow as one access to the device memory when
there is a cache miss. The device can only read values from the constant
memory which have been stored by the host prior to the launch of the grid.
We use the constant memory to store the element stiness matrix K
e
. For
linear elasticity using 8 node linear nite elements, those are 576 numbers
or 2304 byte. Theoretically, each thread can assess the same element of K
e
during the computation of Ku, however, due to bank conicts when loading
u
e
into the processor registers, it is possible that the broadcasting function
is not useable to the maximum performance.
The texture memory is a fairly recent addition to CUDA and describes a
special way of accessing the memory in such a manner, that the device can
use some auxiliary functions like interpolation in hardware. We do not use
texture memory in the present work.
3.2 Finite Element Matrix Assembly
Due to the limited amount of 16 KiB shared memory on present compute
capability 1.x devices, we follow the common strategy to extend a 2D com-
putation into the third dimension by loading only three slices into the shared
3 Implementation on the Graphics Processing Unit 9
memory. Each thread then loops or streams slice-wise into the third di-
mension.
3.2.1 Element Based
The standard approach to compute Ku is given by
u
new
= Ku =
N

e=1
K
e
u
e
, (3.1)
where u
e
refers to the restriction of u to the respective element e. An outer
loop over all elements with two inner loops testing each vertex of the element
with each other results in incrementing one value of the state vector for
each vertex. Processing one element results in 3 memory accesses per vertex:
loading u

e
, the value of u
e
at vertex of node e, loading u

new
, and writing back
the incremented value. The element based assembly is tempting, because no
special treatment of nodes using the natural boundary condition is necessary,
which is very inline with the SIMD requirement.
Unfortunately, we found the standard approach to be unsuitable for
the graphics processing unit for several reasons. The maximum number of
threads allowed per block is 512. Thus, it is in general not possible to com-
pute (3.1) using only one block. Consequently, there will be nodes belonging
to several blocks. Without an atomistic write operation to the device mem-
ory, the correct value at these nodes is not guaranteed. Also, if one thread
per element is used, the update of only one vertex fullls the alignment re-
quirement for a coalescent device memory access whereas the remaining seven
other vertices do not. Additionally, incrementing in the device memory this
often is highly undesirable. Alternatively, a possible storage of intermediate
values of u
new
in the shared memory cuts the number of useable threads in
half.
3.2.2 Nodal Based
Instead of element based nite element assembly, we assume there is one
thread per test function, i.e. per node. We use the built-in data type oat4
to store three displacements and the density in each node, satisfying the
memory alignment as discussed earlier. To minimize device memory access,
each thread loads the state at its node into the shared memory, such that all
threads of the block can operate on shared memory when accessing the values
of neighbor nodes, thus minimizing device memory access. Finite element
assembly requires knowledge of one neighbor node. Hence, some threads
3 Implementation on the Graphics Processing Unit 10
must also load halo values, which can not be done entirely coalescenced and
which can also interfere with the SIMD execution, if the number of halo
values is not a multiple of the half-warp size. In three dimensions, we hold
Fig. 3.2: 16 16 patch with one halo-layer in shared memory per block.
Green shows inner 16 16 nodes loaded coaslescently. Blue shows
halo nodes loaded coalsecently. Red shows halo nodes loaded unco-
alescently.
three such planes in the shared memory, looping into the third dimension by
discarding the last plane form the shared memory and loading one new plane.
One such plane is sketched in gure 3.2. The strategy of feeding the shared
memory is adapted from [8]. The central part of the code without feeding
the shared memory is shown in listing 1 below. The complete sources are
availiable at http://www.mathematik.uni-trier.de/
~
schmidt/gputop.
Listing 1: Central Part of Computing Ku Procedurally
1 for ( i nt ek1=0; ek1 <2; ek1++)
2 {
3 const i nt EIDK = kek1 ;
4 i f (EIDK >= 0 && EIDK < NZ1)
5 {
6 for ( i nt e j 1 =0; ej 1 <2; e j 1++)
7 {
8 const i nt EIDJ = j e j 1 ;
9 i f ( EIDJ >= 0 && EIDJ < NY1)
10 {
11 for ( i nt e i 1 =0; ei 1 <2; e i 1++)
12 {
13 const i nt EIDI = i e i 1 ;
14 i f ( EIDI >= 0 && EIDI < NX1)
15 {
3 Implementation on the Graphics Processing Unit 11
16 const REAL Dens =
17 pow( s u [ inde i 1 IOFFe j 1 JOFFek1KOFF] . w, gpupexp ) ;
18 const i nt LID1 = e i 1+e j 12+ek1 4;
19 for ( i nt ek2=0; ek2 <2; ek2++)
20 {
21 for ( i nt e j 2 =0; ej 2 <2; e j 2++)
22 {
23 for ( i nt e i 2 =0; ei 2 <2; e i 2++)
24 {
25 const i nt LID2 = e i 2+e j 22+ek2 4;
26 const i nt i d i f f = ei 2e i 1 ;
27 const i nt j d i f f = ej 2e j 1 ;
28 const i nt k d i f f = ek2ek1 ;
29 const REAL4 MyU =
30 s u [ i nd+i d i f f IOFF+j d i f f JOFF+k d i f f KOFF] ;
31 MyRes . x += Dens ( GPU El eSti f f [ LID1 ] [ LID2 ] MyU. x
32 + GPU El eSti f f [ LID1 ] [ LID2+8]MyU. y
33 + GPU El eSti f f [ LID1 ] [ LID2+16]MyU. z ) ;
34 MyRes . y += Dens ( GPU El eSti f f [ LID1+8] [ LID2 ] MyU. x
35 + GPU El eSti f f [ LID1+8] [ LID2+8]MyU. y
36 + GPU El eSti f f [ LID1+8] [ LID2+16]MyU. z ) ;
37 MyRes . z += Dens ( GPU El eSti f f [ LID1+16] [ LID2 ] MyU. x
38 + GPU El eSti f f [ LID1+16] [ LID2+8]MyU. y
39 + GPU El eSti f f [ LID1+16] [ LID2+16]MyU. z ) ;
40 }
41 }
42 }//end e2 l oops
43 }// i f i okay
44 }// i l oop
45 }// i f j okay
46 }// j l oop
47 }// i f k okay
48 }//kl oop
49 }//end i f not d i r i c h l e t
50 // s t or e r e s u l t s
51 r e s [ i ndg cur ] = MyRes ;
52 }//end i f ac t i v e
53 s ync t hr e ads ( ) ;
Lines 111 loop over all elements which have node ( i , j , k) associated with
this thread as a vertex. The if-statements are needed at boundary nodes.
3 Implementation on the Graphics Processing Unit 12
The variables EID hold the global index of the element being processed.
Line 16 loads the constant density for the element in computation. The
node/vector component with global index ( i , j , k) has dierent local indices
depending on the element it is a vertex of. Loops line 1923 iterate over local
indices node ( i , j , k) has in adjacent elements. Line 25 computes the index
of the trial function being tested with, needed to access the element stiness
stored in GPU EleSti[24][24], residing in the constant memory. Lines 26
28 compute the same indices for accessing the states in the shared memory.
The following computation for the matrix-vector product is straight forward
and the kernel ends with a single store to the device memory per component
of the vector u. Consequently, this is also one perfectly coalscensed write
operation per thread, completely avoiding the problems with missing atomic
increment instructions.
3.3 Single Precision
Presently, GPU devices are limited to single precision oating point numbers.
Although recently double precision support has been added in the hardware,
the limited amount of shared memory and double precision processing power
still make the GPU uncompetitive for double precision computations. Appar-
ently, the optimality criteria update is somewhat sensitive towards round-o
errors. For nding the adjoint variable for the volume constraint, we em-
ploy the same bi-sectioning algorithm as in [14]. However, the termination
criteria had to be relaxed, as the original termination criterion was below
single precision accuracy. This did not lead to any problems with the volume
fraction being violated.
We also had to modify the actual optimality criteria update scheme. In
single precision, components v
i
of the gradient vector v with |v
i
| < 10
5
can have an unreliable sign. Although the relative error is insignicant,
the updating scheme (2.9) is quite sensitive to components of the gradient
being 0 instead of +0 due to the discontinuous nature of max and min
operators. Since the optimal solution without constraints is completely lled
and without voids, the sign of the gradient can be corrected manually.
The single precision limit is especially problematic when the problem size
increases, as the condition number of the system matrix becomes very large
for problems with 10
6
to 10
7
unknowns.
4 Results 13
4 Results
With four oating point numbers per node, the problem of three dimensional
linear elasticity is very data intensive, especially when one considers that the
actual multiplication with the element stiness K
e
is of very low algorithmic
intensity. Nevertheless, we found the speed-up to be tremendous. Our main
Fig. 4.1: Optimal cantilever 180180360 mesh. GPU computation. Filter
radius 2.5, penalty exponent p = 3.0. Figure is interactive in the
electronic version of the paper.
compute device is a GeForce GTX280 with 1 GiB device memory. It features
30 stream processors and 240 cores all running at 1.30 Ghz. Since the frame
buer also resides in this memory, we found only around 700 MiB to be
actually useable. The device is of compute capability 1.3 and allows us to
overlap the x-y plane of the three dimensional domain with 16 16 blocks,
thus launching 256 threads per block.
4.1 Cantilever Beam
4.1.1 Overview
Our rst example is a cantilever beam. The domain consists of 180180360
nodes, resulting in 34, 992, 000 displacements and 11, 502, 719 densities, a to-
tal of 46, 494, 719 unknowns or 177 MiB. The additional overhead to meet
4 Results 14
the memory alignment restrictions is less than 1 MiB. To account for a loss
of orthogonality in the CG iteration due to single precision accuracy, we
re-initialize the residual every 50 CG iterations. These iterations require
storing 4 vectors. With an additional overhead to compute the scalar prod-
ucts needed for the CG iteration by a tree reduction scheme, a total of 711
MiB device memory is needed. The domain is overlapped by 12 12 blocks,
resulting in 36, 864 threads streaming into the k-plane. This block setup
requires 13, 872 out of 16, 384 byte shared memory per block, resulting in
a multiprocessor occupancy of only 1. We could not observe any speed-up
by lowering the block sizes and consequently consuming less shared mem-
ory with an accompanying increase in multiprocessor occupancy. In fact the
opposite was often the case.
4.1.2 Speedup
00:05:27,60 1
Core2Duo E6600 1 Core 05:18:27,38 58,33
Core2Duo E6600 2 Core 02:51:28,55 31,41
Core2Duo T9600 1 Core 04:37:29,92 50,82
Core2Duo T9600 2 Core 01:58:50,87 21,77
1000 CG Iterations 180x180x360 Mesh
GeForce GTX280
00:00:00,00
01:12:00,00
02:24:00,00
03:36:00,00
04:48:00,00
06:00:00,00
GeForce GTX280
Core2Duo E6600 1
Core
Core2Duo E6600 2
Core
Core2Duo T9600 1
Core
Core2Duo T9600 2
Core
Fig. 4.2: Time needed for 1000 CG iterations on a 180 180 360 mesh.
GPU is up to 60 times faster. CPU code with maximum compiler
optimization.
Since the actual optimization might take dierent paths due to round-o
discrepancies, we measure the time needed for the rst 1,000 CG iterations on
the starting domain with uniform density. The results are shown in gure 4.2.
The CPU code was compiled with gcc 4.3.1 using OpenMP parallelization
for shared memory systems and the option -O3 for maximum compiler
optimization. This especially includes auto-vectorization for the SSE units
of modern CPUs. We compare the GTX280 graphics card with the Intel
Core2Duo E6600 and T9600 processors. The E6600 desktop processor was
released ca. 2006 and runs at 2.4 Ghz. It features 4 MiB cache shared by
both cores. The graphics card was 58.33 times faster than one core and 31.41
4 Results 15
times faster than both cores in parallel. To compare the graphics card with
a processor released roughly at the same time, we also measured the time
needed on a T9600 laptop processor. The Core2Duo T9600 runs at 2.8 Ghz
and features 6 MiB cache shared by both cores. Here, the gap narrows down
to the CPU being only 50.82 times slower on one core, and 21.77 times
slower when both are used in parallel. As it turns out, even when compared
to quite recent processors, the graphics card implementation solves problems
in one day which would require more than two weeks on two processor cores
costing roughly the same amount of money. The graphics card becomes up
to 110 times faster than a single core of the E6600 processor if no compiler
optimization is used when translating the CPU code.
4.2 Wheel
Fig. 4.3: Optimal wheel. 100 100 100 mesh. Filter radius 2.0, penalty
exponent p = 3.0. Figure is interactive in the electronic version of
the paper.
In the second example, a single load is attached in the middle of the
bottom of the domain and the lower corners dene the supports. In two
dimensions, the optimal topology is often called a wheel. The wheel-like
appearance is less obvious in three dimensions. Here, the structure becomes
more dome-like on the outside with many hidden cavities in the inside. Fig-
5 Conclusions and Outlook 16
ure 4.4 shows the cavities. Using a 100100100 mesh, this problem consists
Fig. 4.4: Transparent view into the wheel structure showing internal cavities.
of 3, 000, 000 displacements and 970, 299 densities, a total of 3, 970, 299 un-
knowns. Even a comparatively weak laptop graphics chip such as the GeForce
9600M GT consisting of four stream processors and 32 cores running at a
mere 0.78 Ghz clock rate can solve the whole topology optimization problem
in less than two hours.
5 Conclusions and Outlook
We have investigated topology optimization on CUDA enabled graphics cards.
An optimization procedure very similar to [14] in three dimensions is con-
ducted on the graphics card and found to be up to 58 times faster than the
CPU implementation. Special attention is given to conduct a matrix-free
CG iteration on the device with procedural generation of the matrix-vector
product. Due to missing atomic increment operations and the high number
of memory accesses, we present a nodal nite element matrix-vector routine.
The resulting topology optimization code is found to be very ecient both
in memory requirement and speed, solving problems of 4.65 10
7
unknowns
on a commodity graphics card.
5 Conclusions and Outlook 17
References
[1] M. P. Bendse. Optimal shape design as a material distribution problem.
Structural Optimization, 1:193202, 1989.
[2] M. P. Bendse. Methods for Optimization of Structural Topology, Shape
and Material. Springer, 1995.
[3] M. P. Bendse and O. Sigmund. Topology Optimization Theory, Meth-
ods and Applications. Springer, Berlin, Heidelberg, New York, 2nd edi-
tion, 2004.
[4] D. Braess. Finite Elemente, Theorie, schnelle Loser und Anwendungen
in der Elastizitatstheorie. Springer, Berlin, Heidelberg, New York, 2nd
edition, 1997.
[5] T. Brandvik and G. Pullan. Acceleration of a 3d euler solver using com-
modity graphics hardware. In Proceedings of the 46th AIAA Aerospace
Sciences Meeting, volume AIAA 2008-607. AIAA, 2008.
[6] L. Buatois, G. Caumon, and B. Levy. Concurrent number cruncher:
An ecient sparse linear solver on the gpu. Lecture Notes in Computer
Science, 4782:358371, 2007.
[7] E. Elsen, P. LeGresley, and E. Darve. Large calculation of the ow over
a hypersonic vehicle using a gpu. Journal of Computational Physics,
227:1014810161, 2008.
[8] M. Giles. Using nvidia gpus for computational nance. http://people.
maths.ox.ac.uk/
~
gilesm/hpc/.
[9] D. G oddeke, H. Wobker, R. Strzodka, J. Mohd-Yusof, P. McCormick,
and S. Turek. Co-processor acceleration of an unmodied parallel solid
mechanics code with FEASTGPU. International Journal of Computa-
tional Science and Engineering (IJCSE), 2009. to appear.
[10] N. Goodnight, C. Woolley, G. Lewin, D. Luebke, and G. Humphreys.
A multigrid solver for boundary value problems using programmable
graphics hardware. In HWWS 03: Proceedings of the ACM SIG-
GRAPH/EUROGRAPHICS conference on Graphics hardware, pages
102111, Aire-la-Ville, Switzerland, Switzerland, 2003. Eurographics As-
sociation.
5 Conclusions and Outlook 18
[11] T. R. Hagen, K.-A. Lie, and J. R. Natvig. Solving the euler equations on
graphics processing units. Lecture Notes in Computer Science, 3994:220
227, 2006.
[12] B. Maar and V. Schulz. Interior point multigrid methods for topology
optimization. Structural Optimization, 19(3):214224, 2000. Also ap-
peared as IWR, University of Heidelberg Technical Report 9857, 1998.
[13] E. H. Phillips, Y. Zhang, R. L. Davis, and J. D. Owens. Rapid aerody-
namic performance prediction on a cluster of graphics processing units.
In Proceedings of the 47th AIAA Aerospace Sciences Meeting, volume
AIAA 2009-565. AIAA, 2009.
[14] O. Sigmund. A 99 line topology optimization code written in matlab.
Structural and Multidisciplinary Optimization, 21(2):120127, 2001.
[15] M. Zhou and G. I. N. Rozvany. The COC algorithm, part II: Topological,
geometry and generalized shape optimization. Computer Methods in
Applied Mechanics and Engineering, 89:197224, 1991.

S-ar putea să vă placă și