GPU Accelerated Hough Transformation For HLT Application: CERN, CH-1211, Geneve 23, Switzerland

GPU accelerated hough transformation for HLT application
1,2
Siddhant Mohanty
1.
2.
and Peter Hristov
CERN, CH-1211, Geneve 23, Switzerland
Department of Computer Science, VNIT, Nagpur, India 440010

E-mail: siddhant9892@gmail.com,Peter.Hristov@cern.ch
ABSTRACT.
Hough transform has been carried out using graphic processor both for circular and linear trajectories, the latter
being obtained through conformal mapping. As expected in CPU, the circular track model takes longer time as
compared to the linear hough transform. However, both the methods takes nearly same time on GPU and about 60 times
faster than the corresponding CPU implementation.
1. Introduction
ALICE (A Large Hadron Collider Experiment) is a general purpose detector, being used at CERN, LHC
to study nucleus-nucleus and proton-proton collisions at dierent centre of mass energies [1]. The task
of the ALICE High level trigger (HLT) is to select the events of interest using various fast track
reconstruction algorithms. A fast track-reconstruction algorithm for the Time Projection Chamber of
ALICE detector (see appendix A for a detail description of ALICE TPC) has been discussed in Ref. [2]
based on linear Hough Transform (HT). In this work, we try to implement similar HT using graphic
processor unit (GPU). We consider both circular and linear HT (for detail on HT, see appendix B).
The circle equation with center at a and b and radius r passing through the origin
2
(1)
(x a) + (y b) = r
can be written in the parametric form given by,
2
sin( 0)
k=
where k = 1/r is the radius of curvature, R = x + y , 0 is the azimuthal angle at (0, 0) and
tan
.
x
(3)
The above circle equation (Eq.1) can be transformed into a straight line through conformal mapping with
the prescription,
u
=
x
2
x +y
v
2
x +y
(4)
If r is fixed by the relation

2
r =a +b
(5)
the following straight line is obtained,

a
1
v=
2b
Both Eq.6 and Eq.2 can be used for hough transformation.
Trajectory
Conformal
Mapping
2
5
2
0
1
5
1
0
0.01
5
0
0.005
(
c
y
v
5
1
-0.005
-0.01
2
x (cm)
-0.010 -0.005
0.0050.01
Figure 1. The x-y coordinates for 100 circular tracks.The clock-wise curved tracks are for pos-itive
charged particles and anti-clockwise curved tracks are for neg-ative particles.
Figure 2. The corresponding confromally mapped linear tracks in u-v plane.
In the following, we have simulated curve trajectories in the x-y plane to mimic the transverse trajectories in a Time
Projection Chamber(TPC) of ALICE detector. The inner radius R is fixed at 84.5cm and the outer radius is fixed at
246.6cm. The initial azimuthal angle 0 is randomly chosen from 0 0 2, and the transverse momentum is chosen
randomly between 0.5 |pT | 1GeV. The curvature k is determined from the relation,
k
=
0.002998
B
(7)
where B = 0.5 T is the magnetic field corresponding to ALICE setup. Figure1 and figure2 shows a typical
example of 100 lines generated using the above procedures.For each track, we have generated about 256
data points which corrsponds to the number of pads per trajectory.
1.1. Implementation in GPU

To detect the trajectory parameters, we have implemented the hough transformation using both Eq.2
and Eq.6. In case of the circular hough transform, the independent parameters in hough space are 0
and pT (k). Similarly under conformal mapping the hough parameters are u and v.Both the methods are
implemented using CPU (Intel Core2 Duo, 3.16GHz) and GPU (NVIDIA GeForce GTX 480). For GPU
programming, NVIDIA CUDA is used, although it can be implemented in OpenCL as well.
While implementing in GPU, two methods have been followed. In the first method each data point is
associated with a given thread. So, N threads are used, spread over several blocks corresponding to N
data points. In the second method, N blocks are used each for a given data point, where the threads in
a block are used to build the accumulator. Detail description of each method are given below:
1.2. Method1
Circular transform
300
250
200
150
100
0.8
0
P(GeV)
(Degre
e)
-0.8
40
Figure 3. The hough transformation in 0 and p space. As an example we have considered 10 curves
each consisting of 256 data points.
In the following, a pseudo-code of the kernel implementation is given below. We have considered all
the data points in a azimuthal sector for 0 0 45O.
__global__ void hough_thread(float *R,float *phi,float *sine,int *temp)
{
int i,y2,y1,ip;
float ag,m,c,sinag,p,k,phi0;
int index = threadIdx.x + blockIdx.x * blockDim.x;
for(i=0;i<A;i++)
{
phi0=i*0.1;
ag=phi[index]*(180.0/3.14)-phi0;
if(ag>=0)
{
y1=ag;
y2=y1+1; m=sine[y2]-sine[y1];
Conformal transform
300
280
260
240
220
200
180
160
0.02
0.016
0.008
0.
2
Intercept
Slo
pe
0
0.8
Figure 4. The hough transformation in u and v using conformal mapping for the same points as shown in Fig3.
c=sine[y1]-m*y1;
sinag=m*ag+c;
}
if(ag<0)
{
ag=-ag; y1=ag;
y2=y1+1;
m=sine[y2]-sine[y1];
c=sine[y1]-m*y1; sinag=1*(m*ag+c);
}
k=2.0*sinag/R[index];
p=(0.002998*.5)/k;
ip=p*B/2+200;
if(ip>=0 && ip<400)atomicAdd(&temp[i*B+ip],1);
}
}
The variable index varies from 0 to the total number of threads N .In the circular representation (in R and
coordiantes),
the data points are given by phi[index] and R[index].The independent variable phi0 varies from 0 to
45O over 400 bins.The corresponding p variable is also estimated over 400 bins. The appropriate scaling factors
have been used to build the accumulator temp as a 400x400 array. It is important to notice that the inbuilt
atomicAdd function has been used to avoid the race conditions.The sine calculations are carried out in the CPU
and intermediate values are estimated in GPU using a simple linear interpolation.
The following code shows the pseudo-code of the kernel for conformal mapping in linear hough
space.
__global__ void linear_hough(float *u,float *v,int *temp)

{
int i,c; float m,k,d;
int index = threadIdx.x + blockIdx.x * blockDim.x;

d=v[index]*v[index]+u[index]*u[index]; v[index]=v[index]/d;
u[index]=u[index]/d; for(i=0;i<A;i++)
{
m=i/100.0; k=v[index]m*u[index];
c=(k*50000/2+.5)+50;
if(c>=0 && c<400)atomicAdd(&temp[i*B+c],1);

}
}
Like the circular case the linear conformal mapping also uses by 400x400 bin to build the accumulator.
The results of the circular and linear hough transforms are shown in Fig3 and Fig4. As an example
we consider 10 trajectories each consisting 256 data points.
1.3. Method2
In the first method, each thread carries out A = 400 number of calculations. In order to reduce the load
on each thread, we have also implemented a second method where each block is assigned to each
data point. Then A number of threads in a given block are used to build the accumulator.The following is
a pseudo-code for the block implementation corresponding to circular hough transformation. It is
noticed, that the second method is not superior as compared to first one.
__global__ void hough_block(float *R,float *phi,float *sine,int *temp)
{
int y2,y1,ip;
float ag,m,c,sinag,p,k,phi0; int
index = blockIdx.x;
phi0=threadIdx.x*0.1;
ag=phi[index]*(180.0/3.14)-phi0;
if(ag>=0)
{
y1=ag;
y2=y1+1;
c=sine[y1]-m*y1;
sinag=m*ag+c;
}
if(ag<0)
{
ag=-ag; y1=ag;
y2=y1+1;
c=sine[y1]-m*y1; sinag=1*(m*ag+c);
}
k=2.0*sinag/R[index];
p=(0.002998*.5)/k;
ip=p*B/2+200;
if(ip>=0 && ip<400)atomicAdd(&temp[threadIdx.x*B+ip],1);

}
Table 1. The comparison of execution time for circular hough transform. The first column shows the
number of blocks in unit of 1024.The second column shows the memory transfer time from host to
device. The third column shows the kernel execution time. The fourth column shows the memory
transfer time from device to host. The fifth column shows the total execution time on GPU. The
corrsponding CPU execution time is shown in column six.The last column shows the sppedup factor as
compared to CPU.
Block
No.
T1
T2
T3
CPU
Time
Speed
up
0.000
36
0.000
72
0.000
30
0.001
38
0.01084
7.85
0.000
40
0.002
25
0.000
32
0.002
97
0.05860
20
0.000
41
0.002
23
0.000
34
0.002
98
0.08435
28.35
0.000
0.002
0.000
0.003
0.11362
37.89
41
25
34
00
16
0.000
48
0.004
52
0.000
35
0.005
35
0.42880
80.14
24
0.000
51
0.005
09
0.000
34
0.005
94
0.63959
107.14
32
0.000
54
0.006
94
0.000
34
0.007
82
0.86569
110.7
45
0.000
60
0.007
83
0.000
34
0.008
77
1.20404
115.8
50
0.000
62
0.009
61
0.000
34
0.010
57
1.33536
126.33
64
0.000
62
0.011
49
0.000
37
0.012
48
1.70349
136.49
2. Results and Discussion

The program is implemented using NVIDIA CUDA and nvcc compiler with arch = sm 13 option for
enabling single precision floating point division. The CUDA code essentialy has three main parts. In the
first part, all the data points of a sector are transfered from host
Table 2. The comparison of execution time linear hough transform using conformal mapping. The various
columns have the same meaning as that of table1.
Block
No.
T1
T2
T3
CPU
Time
Speed
up
0.000
25
0.000
25
0.000
32
0.000
82
0.00424
5.17
0.000
26
0.000
65
0.000
32
0.001
23
0.03045
24.75
0.000
30
0.000
73
0.000
33
0.001
36
0.03304
25.29
0.000
27
0.000
80
0.000
32
0.001
39
0.04615
33.2
16
0.000
30
0.002
86
0.000
30
0.003
46
0.16750
48.41
24
0.000
0.004
0.000
0.004
0.25115
51.46
39
14
35
88
32
0.000
39
0.005
43
0.000
32
0.006
14
0.3336
54.33
45
0.000
47
0.007
28
0.000
34
0.008
11
0.46704
57.58
50
0.000
49
0.007
92
0.000
32
0.008
71
0.52035
59.74
64
0.000
53
0.010
01
0.000
32
0.010
86
0.67324
61.99
to device using cudaMemCopy() function. The second part invokes the kernel specifying the number
blocks and number of threads per block to be used. The kernel implements the hough transform. In the
third part, the value of the accumulator is transfered from device to host using cudaMemCopy() function.
To compare the relative performance, we have fixed the block size at 1024 threads which is the
maximum possible block size for NVIDIA GeForce GTX 480 and estimated various execution times as a
function of number of blocks as shown in table1 and table2.Table1 and table2 show the execution and
speedup time for circular and linear tranform respectively using method1.In both the tables, the first
column shows the number of blocks in unit of 1024.The second column shows the memory transfer time
from host to device. The third column shows the kernel execution time. The fourth column shows the
memory transfer time from device to host. The fifth column shows the total execution time on GPU. The
corrsponding CPU execution time is shown in column six. The last column shows the sppedup factor as
compared to CPU.
Fig5 shows the execution time versus the number of blocks,for both circular and linear hough
transform using CPU and GPU. As expected, circular hough transform takes more time than the linear
hough implementation in CPU. It is interesting to note that the corresponding execution time in GPU is
nearly same and about 50X times faster than the corresponding CPU time. Fig6 shows the GPU
speedup time versus number of blocks both for circular and linear transforms. Fig. 7 shows a typical
example of execution time versus thread number using both method1 and method2 as described in the
text. It is noticed that method1 gives best performance as compared to method2.
3. Conclusion
In conclusion, the GPU implementation for circular hough transform is around 120X times faster than
the corresponding CPU implementation. Whereas, its about 60X times faster in the case of linear
transform using conformal mapping. It is interesting to note that the GPU execution time both for circular
and linear hough transforms is nearly equal although circular transform takes more time than linear
transform in CPU. This is an important observation as circular hough tranform can still be used for the
all (0 2), whereas linear hough transform requires data to be divided into sectors so that the
situation corresponding to lines having slope close to 90O can be avoided.
CPU_Lin
ear
GPU_Lin
ear
CPU_Cir
cle
GPU_Cir
cle
Ti
me
(Se
c)
10
-1
10
10
1
0
20
40
70
Number of
Blocks
Figure 5. The number of blocks used versus execution time in seconds both for CPU and GPU using
circular and linear hough transforms.
4. Appendix A
4.1. Hough Transformation
The Hough Transform is a technique which can be used to isolate features of a particular shape within an
image. Because it requires that the desired features be specified in some parametric form, the classical
hough transform is most commonly used for the detection of regular curves such as lines, circles ellipses,etc.
A generalised hough transform can be employed in aplications where a simple analytic description of a
feature is not possible. Due to the computational complexity of the generalized hough transform, in general
classical hough transforms are used. These classical transforms are used as a feature technique for image
analysis, computer vision and digital image processing. The purpose of the technique is to find imperfect
instances of objects within a certain class of shapes by a voting procedure. The voting procedure is carried
Linear
Circle
F
a
c
t
o
r
S
p
e
e
d
u
p
70
Number of Blocks
Figure 6. The number of blocks used versus speedup factor in GPU w.r.t. CPU.
out in a parameter space, from which object candidates are obtained as local maxima in a so called
accumulator space that is explicitly constructed by the algorithm for computing the hough transform.
The classical hough transform has been extended to identifying positions of arbitrary shapes, most
commonly circles or ellipses. In automated analysis of digital images,a subproblem often arises of
detecting simple shapes, such as straight lines, circles or ellipses. In many cases an edge detector can
be used as a pre-processing stage to obtain image points or image pixels that are on the desired curve
in the image space. Due to imperfections in either the image data or the edge detector, however, there
may be missing points or pixels on the desired curves as well as spatial deviations between the ideal
line, ellipse or circle and the noisy edge points as they are obtained from the edge detector. For these
reasons, it is often non-trivial to group the extracted edge features to an appropriate set of line,circles or
ellipses. The purpose of the Hough transform is to address this problem by making it possible to
perform groupings of edge points into object candidates by performing an explicit voting procedure over
a set of parametrized image objects.
The Hough Transform algorithm uses an array, called an accumulator, to detect the existence of a line or
curve. The dimension of the accumulator is equal to the number of unknown parameters of the hough
transform problem. The dimensions of the accumulator array would correspond to quantized values of the
parameters used. For each pixel and its neighborhood, the hough transform algorithm determines if there is
enough evidence of an edge at that pixel. If so, it will calculate the parameters of that line, and then look for
the accumulators bin that the parameters
CPU
Met
hod
2
Method1
Ti
me
(se
c)
1
01
10
10
200
10000
00
400
30000
00
600 700
50000 00 00
Number of
points
Figure 7. The number of data points (threads) versus execution time for both the methods as described in the text. The corresponding
CPU time is also shwon for comparison.
fall into, and increase the value of that bin. By finding the bins with the highest values, typically by
lookinng for local maxima in the accumulator space , most likely lines can be extracted, and
their(approximate) geometric definitions read o. The simplest way of finding these peaks is by applying
some form of threshold, but dierent techniques may yield better results in dierent circumstances,
determining which lines are found as well as how many. Since the lines returned do not contain any
length information, it is often next necessary to find which lines are found as well as how many. Since
the lines returned do not contain any length information, it is often next necessary to find which parts of
the image match up with which lines. Moreover, due to imperfection errors in the edge detection step,
there will usually be errors in the accumulator space, which may make it non-trivial to find the
appropriate peaks, and thus the approprtiate lines.
The Hough transform is only ecient if a high number of votes fall in the right bin, so that the
bin can be easily detected amid the background noise. This means that the bin must not be too small, or
else some votes will fall in neighboring bins, thus reducing the visibility of main bin. Also, when the
number of parameters is large, i.e.when we are using the Hough transform with typically more than
three parameters, the average number of votes cast in a single bin is very low, and those bins
corresponding to a real figure in the image do not necessarily appear to have a much higher of votes
2
than their neighbors. The complexity increases a rate of O(A M ) with each additional parameter, where
A is the size of the image space and m is the number of parameters. Thus, the hough transform must
be used with great care to detect anything other than or circles. Finally, much of the e cieny of the
eciency of the hough transform is dependent on the quality of the input data: the edges must be
detected well for the Hough transform to be ecient. Use of the Hough transform on noisy images is a
very delicate matter and generally, a denoising stage must be used before. In the case where the image
is corrupted by speckle, some other transforms are preferred to detect lines, because it attenuates the
noise through summation.
4.2. Linear Hough Transform
The simplest of the Hough Transform is the linear transform for detecting straight lines. In the image
space, the straight line can be described as y = mx + b and can be graphically plotted for each pair of
image points (x, y). In the Hough transform, main idea is to consider the characteristics of the straight
line not as image points (x 1, y1),(x2, y2),etc., but instead, in terms of its parameters, i.e., the slope
parameter space. However, one faces the problem that vertical lines give rise to unbounded values of
the parameters m and b. For comptational reasons, it is therefore better to use a dierent pair of
parameters, denoted r and ,for the lines in the Hough transform. These are the polar coordinates.
Figure 8. A diagram representing the line in parameter space eventually to be transformed to the Hough
space
The parameter r represents the distance between the line and the origin, while is the angle of the vector from the origin to the
closest point as shown in the Fig8.Using this parametrization,
the equation of the line can be written as:
y = ( sin
cos
)x + ( sin )
(8)
which can be rearranged to r = xcos + ysin. It is possible to associate with each line of the image a
pair (r, ) which is unique if [0, ) and rR, or if [0, 2) and r > 0. The (r, ) plane is sometimes
referred to as Hough space for the set of straight lines in two dimensions. This representation makes the
Hough transform conceptually very close to the two-dimensional Radon transform. They can be seen as
dierent ways of looking at the same transform.
Figure 9. A sample Linear Hough transformation

For an arbitrary point on the image plane with coordinates, e.g., (x0, y0), the lines that go through it
are the pairs (r, ) with
r() = x0.cos + y0.sin
(9)
where r, the distance between the line and the origin, is determined by . This corresponds to a
sinusoidal curve in the (r, ) plane, which is unique to that point. If the curves corresponding to two
points are superimposed, the location (in the Hough space) where they cross corresponds to a line (in
the original image space) that passes through both points. More generally, a set of points that form a
straight line will produce sinusoids which cross at the parameters for that line.Thus, the problem of
detecting collinear points can be converted to the problem of finding concurrent curves.
In Fig9 we have three data points, shown in black dots:-
For each data point, a number of lines are plotted going through it, all at dierent angles. These are
shown as solid lines. For each solid line a line is plotted which is perpendicular to it and which intersects
the origin. These are shown as dashed lines. The length(i.e. perpendicular distance to the origin) and
angle of each dashed line is measured. The corresponding results are shown in the table. This is
repeated for each data point.
4.3. Circular Hough Transform

Circles are a common geometric interest in computer vision applications. The Hough transform can be
used to determine the parameters of a circle when a number of points that fall on the perimeter are
known. A circle with radius R and center (a, b) can be described with the parametric equations:
x = a + Rcos()
(10)
y = b + Rsin()
(11)
When the angle sweeps through the full 360 degree range the points (x, y) trace the perimeter of a
circle. If an image contains many points, some of which fall on perimeters of circles, then the job of the
search program is to find parameter triplets (a,b,R) to describe each circle. The fact that the parameter
space is 3D makes a direct implementation of the Hough technique more expensive in computer
memory and time. If the circles in an image are of known radius R, then the search can be reduced to
2D. The objective is to find the (a, b) coordinates of the centers. The locus of (a, b) in the parameter
space fall on a circles, and can be found with a Hough accumulation array.This particular transformation
is demonstrated in Fig10.
Figure 10. An example of Circular Hough Transformation for a circle with fixed radius
Multiple circles with the same radius can be found with the same technique. Overlap of circles can
cause spurious centers to also be found. The spurious circles can be removed by matching to circles in
the original image.
If the radius is not known, then the locus of points in parameter space will fall on the surface of a
cone as shown in Fig11. Each point (x, y) on the perimeter of a circle will produce a cone surface in
parameter space. The triplet (a, b, R) will correspond to the accumulation cell where the largest number
of cone surfaces intersect. A conical surface in parameter space is generated corresponding to one (x,
y) point. A circle with a dierent radius will be constructed
Figure 11. An example of Circular Hough Transformation for circles with variable radius
at each level,r. The search for circles with unknown radius can be conducted by using a three
dimensional accumulation matrix. The accumulator array which is three dimensional, if the radius is not
held constant, can quite fast grow large. Its size is depended on the number of dierent radii and
especially the image size. The computational cost of calculating all circles for each edge point increases
with the number of edge points which is usually a function of image size. The overall computation time
of circular hough transform can therefore quickly reach an infeasible amount of time with large images
with many edge points. While drawing a circle one problem that arises is the selection of the discrete
values for the resolution of selected. One solution is to use a high resolution of , and then round the
values o, but this is likely to result in massive overdraw or lack of pixels if the radius is large. The
rounding of sin,cos should be carried out after the values have been multiplied with the radius. It is
desirable to be able to find circles from the accumulator data. If no apriori knowledge is known about the
number of circle and their radii then this process can be quite challenging. One approach is to find the
highest peaks for each a, b plane corresponding to a particular radius, in the accumulator data. If the
height of the peak(s) is equal compared to the number of edge points for a circle with the particular
radius, the coordinates of the peak(s) does probably correspond to the center of such a circle. But the
center of a circle can also be represented by a peak with a height less than the number of edge points, if
for instance the circle is not complete or is ellipse shaped. If it is dicult to locate exact peaks,the
accumulator data can be smoothed.
4.4. Conformal Mapping
Fitting a circle through a number of measured points obtained e.g. from a drift chamber requires generally an
iterative fit due to the nonlinearity of the problem. Since iterative methods are too time consuming uses
approximations which permit the linearization of the problem. Since iterative methods are too time consuming
in most applications, one generally uses approximations which permit the linearization of the problem
yielding a set of linear equations which can be solved in one single iteration. One standard method of track
recognition is the
method of conformal mapping. In the normal version of that method one transforms the circle equation:
2
(x a) + (y b) = R
(12)
into a straight line with the prescription:
v
=
x +y
x +
If one fixes R by imposing

the relation
2
R =a +b
(14)
one obtains straight lines

of the form:
1
v=
2b
a
u
These straight lines can then be used for pattern recognition by establishing search roads following the
points, to find out which points belong to a given straight line. A fit to such a straight line in (u,v) space
will then yield the coordinates of the center of the circle, a and b, and together with the rest of the
equations the radius R is also determined. The values thus obtained however, are only approximate,
since the constraints make the circle pass through the origin and the important third parameter
determining a track, the impact
parameter
, is lost. The situation can be remedied by allowing for a
2
2
2
small dierence between R and a + b , which we will call ,
2
=R a b
(16)
2
For much smaller than R we than have, instead of a straight line, a parabola with a very small
curvature:
(1
v
=
4
b
2
a
)
u
(1
2b
(17)
Here terms of the order of and higher have been neglected. Furthermore,we usually set
b
2
,1
(18)
in the above equation to one. The equation for the parabola then becomes:
u
2
2
(19)
where = R a + b is the impact parameter.
5. Appendix B
1. 5.1. ALICE TPC
1. 5.2. Architecture
As shown in Fig. 12, the ALICE Time-Projection Chamber(TPC) is the main device, in the ALICE central
barrel, for tracking of charged particles and particle identification. The ALICE TPC is designed to cope with
the highest conceivable charged particle multiplicities predicted, at the time of the Technical Proposal(TP),
for central Pb-Pb collisions at LHC energy, i.e. rapidity densities approaching dNCH/dy = 8000 at center-ofmass energy of 5.5 TeV. Its acceptance covers 2 in azimuthal angle and a pseudo- rapidity interval || <
0.9, including secondaries, the above charged particle density could amount to 20000 tracks in one
interaction in the TPC acceptance.
Figure 12. The ALICE Time projection Chamber
5.3. Principle of Operation

3
The ALICE TPC is a 88 m cylinder filled with gas and divided in two drift regions by the by the
electrode located at the axial centre. The active volume is contained in a cylinder with inner and outer
radii of 84.5 and 246.6 cm repectively and a length of 500 cm along the beam axis. The field cage
secures the uniform electric field along the z-axis. Charged particles traversing the TPC volume ionise
the gas along their path, liberating electrons that drift towards the end plates of the cylinder. The
necessary signal amplification is provided through the avalanche eect in the vicinity of the anode wires
strung in the readout. Moving from the anode wire towards the surrounding electrodes, the positive ions
created in the avalanche induce a positive current signal on the pad plane. This current signal, which is
characterised by a fast rise time(less than 1 ns) and a long tail with a rather complex shape, carries a
charge that, for the minimum ionising particle, is about 4.8 fC. The readout of the signal is done by the
570132 pads that form the cathode plane of the multi-wire proportional chambers located at the TPC
end plates.
5.4. Field Cage
The design of the field cage of the ALICE TPC is based on a novel construction principle to adapt the
detector to the specific running conditions with heavy ion collisions at the LHC. The expected high
particle densities make it necessary that the field cage keeps instrumental(systematic) errors at a
minimum in order not to impair the sensitive pattern recognition and the resolution capabilties of the
detector as a whole. Although a classical cylindrical geometry, optimum for colliding beam experiments,
was chosen but the other features of the device dier largely from any other field cage.
The purpose of the field cage is to define a uniform electrostatic field in the gas volume in order to
transport ionization electrons from their point of creation to the readout chambers on the endplates
without significant distortions. The field cage provides a stable mechanical structure for precise
positioning of the chambers and other detector elements while being as thin as possible in terms of
radiation lengths presented to the tracks entering the TPC. In addition, the walls
of the field cage provide a gas-tight envelope ensure appropriate electrical isolation of the field cage
provide a gas-tight envelope and ensure appropriate electrical isolation of the field cage from the rest of
the experiment.
Seperated by the central high voltage electrodes the field cage has two detection volumes with
inner/outer diameter of 1.25/5 m and a drift length of 2.5 m each. The total sensitive detector volume is
3
2
88 m , filled with a gas mixture of Ne-CO (90:10). With a drift field of 400 V/cm, this gas represents the
optimum in terms of charge transport(velocity and diusion), signal amplification(over 104) and
transperancy for traversing particles. Hence, the field cage will have to sustain a maximum potential of
100kV at its central electrode. Consequently and in line with the requirements of ALICE, the field cage
vessels are to be built from light material , yet with sucient mechanical rigidity for such a large
apparatus. A composite honeycomb sandwich structure was thus chosen for its favorable stability/mass
ratio. Material of aerospace quality have been used, such as aramide-based honeycomb cores (Nomex)
and foils of Tedlar. A principle element of the design philosophy is to contain the actual field cage
volume by a protective CO gas envelope provided by two additional cylinders called the inner and outer
containment vessels. It allows a substantial reduction in material traversed by particles. Another unique
feature of the field cage is its internal potential defining system designed to provide a highly uniform
electric field with radial distortions of no more than one part in 104. The entire potential strip network is
suspended from 18 support rods mounted equidistantly over 360 degrees, 31 mm away from the
cylinder walls.
5.5. Readout Chambers
The ALICE TPC readout chambers were specially designed to cope with the high track density expected
in heavy ion collisions at LHC. The pad size(granularity) of the inner chambers,i.e. those closest to the
2
beam-beam interaction diamond, has been minimised(7x4.5 mm ) to the point that a signal induced in
the pads after amplification at the anode proportional wires is just visible above the electronic noise (S/N
> 20).Large-scale TPCs have been employed and proven to work in collider experiments before, but
none of them had to cope with the particle densities and rates anticipated for the ALICE experiment. For
the design of the Read-Out Chambers (ROCs), this leads to requirements that go beyond an
optimization in terms of momentum and dE/dx resolution. In particular, the optimization of rate capability
in a high-track density environment has been the key input for the design consideration. The azimuthal
segmentation of the readout plane is common with the subsequent ALICE detectors TRD and TOF,i.e.
18 trapezoidal sectors, each covering 20 degree in azimuth. The radial dependence of the track density
leads to dierent requirements for the readout-chamber design as a function of radius. Consequently.
there are two dierent types of readout chambers, leading to a radial segmentation of the readout plane
into Inner and Outer Chamber(IROC and OROC,respectively). In addition, this segmentation eases the
assembly and handling of the chambers compared toa single large one,covering the full radial extension
of the TPC.
The ALICE-TPC readout chambers employ a commonly used scheme of wire planes,i.e. a grid of anode
wires above the pad plane,a cathod-wire grid, and a gating grid. All wires run in the azimuthal direction.
Since the design constraints are dierent for the inner and outer chambers, their wire geometry is
dierent. The gap between the anode-wire grid and the pad plane is 3 mm for the outer chambers, and
only 2 mm for the inner chambers. The same is true for the distance the anode-wire grid and the
cathode-wire grid. The gating grid is located 3 mm above the cathode-wire grid in both types of
chamber. The anode-wire grid and the gating grid are staggered with respect to the cathode-wire grid.
The chambers are attached to the endplate from the inside to minimize dead space between
neighboring chambers. This required a special mounting technique,by which the chambers are attached
to a long manipulator arm, which allows the rotation and tilting of the chambers.
5.6. Front End electronics

The readout of the signal is done by the 570132 pads that form th cathode plane of conventional multiwire proportional chambers located at the TPC end caps. The signal from the pads are passed on to
4356 front-end cards located some 10 cm away from the pad plane. In the front-end card a custommade charge sensitive shaping amplifier transforms the charge induced in the pads into a dierential
semigaussian signal that is fed to the input the ALTRO chip.
Each ALTRO chip contains 16 channels that digitise and processthe input signals. Upon arrival of a first
level trigger, the data stream is stored in a memory. When the second trigger (accept or reject) is
received, the latest event data stream is either frozen in the data memory,until its complete readout
takes place,or discarded. The readout takes place, at a speed of up to 300 MB/s, through a 40-bit-wide
backplane bus linking the front end cards to the readout controller unit. The ALTRO chip is a mixed
signal custom integrated circuit designed to be one of the building blocks of the front-end electronics for
the ALICE TPC. In one single chip, the analogue signals from 16 channels are
digested,processed,compressed and stored in a memory ready for readout. The Analogue-to-Digital
converters embedded in the chip have a 10-bit dynamic range and a maximum sampling rate of 40 Mhz.
6. Appendix C
In this parallel programming implementation, we have made use of CUDA(Compute Unified Device
Architecture), a parallel computing architecture developed by Nvidia for graphics processing.Other
parallel languages like OpenCL can also be used for the same. For this particular implementation, we
have made use of CUDA C, an extension of C on CUDA which makes its structure and syntax similar to
C. It is compiled through a PathScale Open64 C compiler, to code algorithms for execution on the GPU.
6.1. CUDA
CUDA gives access to the virtual instruction set and memory of the parallel computational elements in
CUDA GPUs. Using CUDA, the latest Nvidia become accessible for computations for CPUs. Unlike
CPUs however, GPUs have a parallel throughput architecture that emphasizes executing many
concurrent threads slowly,rather than executing a single thread very quickly. This approach of solving
general purpose problems on GPUs is known as GPGPU. CUDA works with all Nvidia GPUs from the
G8x series onwards,including GeForce, Quadro and Tesla line. CUDA is compatible with most standard
operating systems.Nvidia states that programs developed for the G8x series will also work without
modification on all future Nvidia video cards, due to binary compatibility. In our case, we are using
GeForce GTX 480 Nvidia card, having a compute capability of 2.0. Hence, it supports double-precision
floating point calculations, which otherwise are not supported in graphic cards having compute
capability less than 1.3.
6.2. Parallel-Programming Model
The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the
GPU is specialized for compute-intensive, highly parllel computation and therefore designed such that
more transistors are devoted to data processing rather than data caching and flow control. More
specifically, the GPU is especially well-suited to address problems that can be expressed as dataparallel computations. Because the same program is executed for each data element, there is lower
requirement for sophisticated flow control, and because it is executed on many data elements and has
high arithematic intensity. Data-parallel processing maps data elements to parallel processing
threads.This property can be exploited for applications that process large data sets like in the case of
implementing hough transform for track recognition or Monte-Carlo simualtions and many more. For
these applications a data-parallel programming model can be used to speed up the computations.
Figure 13. This figure shows the architecture of the NVIDIA graphic card according to the CUDA
programming model
6.3. Architecture
The architecture of modern NVIDIA cards are designed so as to maximize the parallelization not just for
graphics and texture programming but also for general computations and simulation.They are build
around an array of streaming multiprocessors. These multiprocessors are multithreaded, focused on a
high floating point operation throughput instead of single thread performance.Each of these
multiprocessors consisits of eight scalar processor cores, a special function unit which can execute
complex functions like sinus or inverse square root, a double precision unit, a large register file, an
instruction decoder and a block of shared memory. These threads are distributed onto the
multiprocessors in groups called blocks.These are also the unit of granularity at which synchronization
statements in the code are executed. The blocks are further subdivided into warps, each warp
containing 32 threads. At each cycle the multiprocessors take a warp that is ready for execution and
runs it in a single instruction multiple threads (SIMT) fashion. The shader clock, which is the clock for
the scalar processors,is faster than the core clock of the device. Therefore a scalar processor can
process more than one thread in one clock cycle of the multiprocessor. The exact architecture of the
Nvidia graphic card 8x series is shown in Fig13
6.4. SIMT versus SIMD model
The SIMT model is similar to single instruction multiple data (SIMD) execution model on the CPU. They both
use one instruction decoder to feed multiple algorithmic logic units. This means all algorithmic logic units.
This means all algorithmic logic units will always execute the same operation. In the SIMD execution model
one thread,meaning one instruction counter and register set, is used for algorithmic logic units. However,the
SIMT model makes use of its own thread for a logic unit. This allows the code executed to be independent of
the actual physical algorithmic logic unit count,while in SIMD it needs to be aware of the width of the vector
registers used to feed them. Also in an SIMT model we can skip execution of an instruction
based on special predicate registers or if its instruction counter doesnt match the address of the
currently executed instruction. In the SIMD model masking of operands is required to achieve a similar
eect.
Figure 14. The internal arrangement of thread and blocks in a grid launched by the kernel
6.5. CUDA programming language

Parallelization of a CPU executable i.e. serially processed program is done at the thread level. Data
parallel applications can be written in the SIMT model by using the extension of C/C++ on CUDA. From
programming point of view the execution of threads happens in a grid of blocks. All threads in a grid
execute the same function. The dimensions of this grid can be specified by the programmer. Due to the
SIMT execution model the threads can take dierent code paths within the function. Within the blocks
threads can cooperate using shared memory and synchronize at barriers given by synchronize()
intrinsic function.There is no synchronization between blocks. If multiple grids are launched those are
executed sequentially. Therefore synchronization between blocks can be done by splitting the
application into two functions launched in seperate grids. The decision of where the methods are to be
executed is specified thruogh the additional function qualifiers specified in CUDA. The execution on host
is done either by default or by explicitly specifying the host keyword. The equivalent for functions to be
invoked from and executed on the device is called device. The host and device specifiers can be
combined to make a function usable on both platforms.
Methods to be executed on the device but invoked from the host are called kernels and qualified by
global . There is no way to invoke a function on the host from the device.Actually the qualifier device
defines a variable to be located in device memory, which by default is global memory. The function
invocations are limited due to the fact that GPU does not have hardware stack. Similar qualifiers are
present to define other levels of memory present for e.g:- constant allows to declare variables located in
constant memory and shared allows to declare variables located in shared memory. CUDA doesnt oer
mechanisms to declare what memory pointers point to. Locations of variables used as function
parameters are automatically determined by
the compiler.
Each thread has access to special variables threadIdx,blockDim and blockIdx. These represent the threads
position in the grid. Global memory on the device can be dynamically allocated from the CPU using
cudaMalloc() instrinsic function. Pointers to such memory regions can be handed to the GPU code as kernel
parameter as well.Transfer of data to the GPU and back needs to be performed using special memory
function cudaMemCpy() defined in the CUDA library.Generally, this is the function whose execution leads to
a significant rise in the overall execution time. This is because the memory bus in the GPU is not as fast as
in the case of CPU leading to latency in the transfer of data from host to device.
CUDA defines several compute capabilities that specify the features supported by the hardware.
These capabilities have the format of versions,each capability including features of the below compute
capabilties have the format of versions ,each capability including the features of the below compute
capabilties. Among other functions compute capabilties 1.1 and 1.2 add atomic functions on global and
shared memory.
6.6. Atomic Functions

In many applications there are applications that involve the read-write-modify operation. When moving
from a single-threaded to a multi-threaded version of this application, we suddenly have potential for
unpredictable results if multiple threads need to read or write shared values. If the threads get
scheduled unfavourably then the end result might be wrong. So, the read-write-modify operation must
be performed by a thread without being interrupted by another. Because the execution of these
operations cannot be broken into smaller parts by other threads, we call operations that satisfy this
constraint as atomic operations.An atomic function performs a read-write-modify atomic operation on
one 32-bit or 64-bit word residing in global or shared memory.
CUDA C supports several atomic operations that allow safe operation on memory, even when a large
number of threads are potentially competing for access. For example, atomicAdd() reads a 32-bit word
at some address in global or shared memory, adds a number to it, and writes the result back to the
same address. The operation is atomic in the sense that it is guaranteed to be performed without
interference from other threads i.e. no other thread can access this address until the operation is
complete.
Atomic functions can only be used in device functions and are only available for devices of compute
capability 1.1 and above.
Atomic functions operating on shared memory and atomic functions operating on 64-bit words are only
available for devices of compute capability 1.2 and above.
Atomic functions operating on 64-bit words in shared memory are only available for devices of compute
capability 2.x and higher.
Atomic functions operating on mapped page-locked memory are not atomic from the point of view of the
host or other devices. Since, in our case we are using a GeForce GTX 480 which has a compute
capability of 2.0. So we have all these facilities at our disposal. Some examples of atomic functions are
atomicSub(), atomicExch(), atomicMin(), atomicMax(), atomicSub, atomicInc, atomicDec and many
more. Atomic operations only work with signed and unsigned integers with the exception of atomicAdd()
for devices of computing capability 2.x and atomicExch() for all devices, that also work for singleprecision floating-point numbers. However it is interesting to note that any atomic operation can be
implemented based on atomicCAS()(Compare And Swap).
References
1.
1.
[1] ALICE Collaboration 2000 2002 TDR of Time Projection Chamber (CERN/LHCC/2000-001).
[2] C Cheshkov 2006 Nuclear Instruments and Methods in Physics Research A566 p 35-39.

GPU Accelerated Hough Transformation For HLT Application: CERN, CH-1211, Geneve 23, Switzerland

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

GPU Accelerated Hough Transformation For HLT Application: CERN, CH-1211, Geneve 23, Switzerland

Încărcat de

Drepturi de autor:

Formate disponibile

GPU accelerated hough transformation for HLT application

and Peter Hristov

CERN, CH-1211, Geneve 23, Switzerland

Department of Computer Science, VNIT, Nagpur, India 440010

If r is fixed by the relation

the following straight line is obtained,

Both Eq.6 and Eq.2 can be used for hough transformation.

1.1. Implementation in GPU

if(ip>=0 && ip<400)atomicAdd(&temp[i*B+ip],1);

__global__ void linear_hough(float *u,float *v,int *temp)

int index = threadIdx.x + blockIdx.x * blockDim.x;

if(c>=0 && c<400)atomicAdd(&temp[i*B+c],1);

if(ip>=0 && ip<400)atomicAdd(&temp[threadIdx.x*B+ip],1);

2. Results and Discussion

the equation of the line can be written as:

Figure 9. A sample Linear Hough transformation

r() = x0.cos + y0.sin

4.3. Circular Hough Transform

into a straight line with the prescription:

If one fixes R by imposing

one obtains straight lines

where = R a + b is the impact parameter.

Figure 12. The ALICE Time projection Chamber

5.3. Principle of Operation

5.6. Front End electronics

6.5. CUDA programming language

6.6. Atomic Functions

S-ar putea să vă placă și

global void linear_hough(float u,float v,int *temp)