Documente Academic
Documente Profesional
Documente Cultură
Execution Time
Execution Time
35 100 25
30
80 20 Q2 Q4
25
20
15
60 15 lel quicksort algorithm was analyzed and modified to fit the
Q2 Q4 40
Q1
10
10
5
0
Q1 20
0
P
5
0
P
qthread model. Secondly, a small parallel benchmark was
(a) Increment (b) Lock/Unlock (c) Mutex
modified to use qthreads.
Chaining
350 45 180 Q1
P P
Execution Time
Execution Time
Execution Time
300 40 160
35 140
250 30 Q2 Q4 120
200
150
100
25
20
15
Q1 100
80
60
5.1. Quicksort
10 40 Q2 Q4
50 Q4 5 20
0 Q1 Q2 0 0
Execution Time
Execution Time
Q1
40
35
30
200
150 Q48
20
15
late, such as the hashed memory design found in the Cray
25 Q16 Q16 Q48Q128
20
15
10
100
50
10
5
MTA-2. Memory addresses in the MTA-2 are distributed
5 Q16 Q48Q128 Q1 P
0 Q1 0
P
0
throughout the machine at word boundaries. When dividing
(a) Increment (b) Lock/Unlock (c) Mutex work amongst several threads on the MTA-2, the boundaries
Chaining of the work regions can be fine-grained without significant
1.2 250 100 Q128
P Q16 Q48
loss of performance. Conventional processors, on the other
Execution Time
Execution Time
Execution Time
1 Q128
Q16 Q48 200 80
0.8
150 60 Q1
0.6
0.4
P 100
Q1
Q16 Q48 Q128 40 hand, assume that memory within page boundaries are all
0.2 50 20
0
Q1
0 0 contiguous. Thus, conventional cache designs reward pro-
(d) Spinlock (e) Thread (f) Concurrent grams that allow an entire page to reside in a single pro-
Chaining Creation Threads cessor’s cache, and limit the degree to which tasks can be
divided among multiple processors without paying a heavy
Figure 2: Microbenchmarks on a 48-node Altix cache coherency penalty.
An example wherein the granularity of data distribution
and 2(c). This is a detail that would likely not be true on a can be crucial to performance is a parallel quicksort algo-
system that had hardware support for FEBs, and would be rithm. In any quicksort algorithm, there are two phases:
significantly improved with a better centralized data struc- first the array is partitioned into two segments around a
ture, such as a lock-free hash table. Because of the qthread “pivot” point, and then both segments are sorted indepen-
library’s simple scheduler, it outperforms pthreads when us- dently. Sorting the segments independently is relatively
ing spinlocks and a low number of shepherds, as illustrated easy, but partitioning the array in parallel is more complex.
in Figure 1(d). The impact of the scheduler is demonstrated On the MTA-2, elements of the array to be partitioned can
by larger numbers of shepherds (Figure 2(d)). be divided up among each thread without regard to the lo-
The pthread library was incapable of more than several cation of the elements. On conventional processors, how-
hundred concurrent threads—requesting too many threads ever, that behavior is very likely to result in multiple pro-
deadlocked the kernel (Figures 1(f) and 2(f)). A benchmark cessors transferring the same cache-line or memory page
was designed that worked within pthreads’ limitations by al- between processors. Constantly sending the same memory
lowing a maximum of 400 concurrent threads. Threads are back and forth between processors prevents the parallel al-
spawned in blocks of 200, and after each block, threads are gorithm from exploiting the capabilities of multiple proces-
joined until there are only 200 outstanding before spawning sors.
a new block of 200 threads. In this benchmark, Figures 1(e)
The qthread library includes an implementation of the
and 2(e), pthreads performs more closely qthreads—on the
quicksort algorithm that avoids contention problems by en-
PowerPC system, it is only a factor of two more expensive.
suring that work chunks are always at least the size of a
page. This avoids cache-line competition between proces-
5. Application development sors while still exploiting the parallel computational power
of all available processors on sufficiently large arrays. Fig-
Development of software that realistically takes advan- ure 3 illustrates the scalability of the qthread-based quick-
tage of lightweight threading is important to research, but sort implementation, and compares its performance to the
difficult to achieve due to the lack of lightweight threading libc qsort () function. This benchmark sorts an array of
interfaces. To evaluate the performance potential of the API one billion double-precision floating point numbers on a 48-
and how difficult it is to integrate into existing code, two node SGI Altix SMP with 1.5Ghz Itanium processors.
ferent from the qthread implementation and does not use
1800
1600
shepherds. The qthread and MPI implementations scale ap-
Execution Time