Sunteți pe pagina 1din 1

non-in-place out-of-cache in-place out-of-cache non-in-place out-of-cache in-place out-of-cache

non-in-place in-cache in-place in-cache non-in-place in-cache in-place in-cache


8 60 4 60

billion tuples / second


billion tuples / second

7 50 3.5 50
6 3

GB / second

GB / second
40 40
5 2.5
4 30 2 30
3 20 1.5 20
2 1
10 10
1 0.5
0 0 0 0

128
256
2
4

16

512
1024

32
64
2

2048
4096
8192

1024
2048
4096
8192
4
8
16
32
64
128
256
512
number of partitions number of partitions

Figure 3: Shared-nothing partitioning for 1010 tuples Figure 6: Shared-nothing partitioning for 1010 tuples
(32-bit key and 32-bit payload on separate arrays) (64-bit key and 64-bit payload on separate arrays)

In Figure 3 we show the performance of the four variants In Figure 5 we show the performance of histogram gener-
of partitioning using 32-bit keys and 32-bit payloads run in ation for all variants of partitioning. Radix and hash parti-
a shared-nothing fashion. The in-cache variants are bound tioning operate roughly at the memory bandwidth. Range
by the TLB capacity, thus, have poor performance for large partitioning using the configured range function index (see
fanout. When running out-of-cache, the optimal cases are 5– Section 3.5.2) improves 4.95–5.8X compared to binary search.
6 bits (32–64 partitions). The out-of-cache variants increase Figure 6 shows the performance of partitioning for 64-
the optimal fanout to 10–12 bits, when non-in-place, and 9– bit keys and 64-bit payloads. Compared to the 32-bit case,
10 bits, when in-place. Using out-of-cache variants for small partitioning is actually slightly faster (in GB/s), since RAM
cache-resident array sizes incurs unnecessary overheads. The accesses and computation overlap more e↵ectively.
optimal fanout is the one with the highest performance per non-in-place (4 CPUs)
partitioning bit (= log P , for P -way partitioning) ratio. 3
in-place (4 CPUs) 40
billion tuples / second

2.5 non-in-place (1 CPU)

GB / second
7
billion tuples / second

50 2 in-place (1 CPU) 30
6
GB / second

40 1.5 20
5 1
4 30 10
0.5
3 20 0 0
2 Zipf
1 2 3 4 5 6 7 8 16
1 Uniform 10
threads / CPU (SMT)
0 0
1024
2048
4096
8192
2
4
8
16
32
64
128
256
512

Figure 7: Shared-nothing out-of-cache partitioning


number of partitions 1024-way for 109 tuples (64-bit key, 64-bit payload)

Figure 4: Partitioning as in Figure 3 using uniform Figure 7 shows the scalability of out-of-cache partitioning
and skewed data under the Zipf distribution (✓ = 1.2) variants with a 1024-way fanout. The in-place variant gets a
noticeable benefit from SMT compared to the non-in-place.
In Figure 4, we repeat the experiment of Figure 3 includ- Figure 8 shows the performance of 64-bit histogram gen-
ing runs with data that follow the Zipf distribution with eration. Radix and hash histogram generation still run at
✓ = 1.2. Under skew, some partitions are accessed more the memory bandwidth. Using the range function index
often than others. Implicit caching of these partitions de- speeds-up the process 3.17–3.4X over scalar code, despite
creases the probability of cache misses and TLB thrashes, the limitation of only 2-way 64-bit operations. The speed-
improving performance. With less skew (✓ < 1), we found up decreases because scalar binary search doubles its per-
no significant di↵erence in partitioning performance. formance (in GB/s), fully shadowing the RAM accesses.

range (index) range (bs) radix hash range (index) range (bs) radix hash
billion keys / second

billion keys / second

32 120 16 120
100 100
24 12
80
GB / second
GB / second

80
16 60 8 60
40 40
8 4
20 20
0 0 0 0
128 256 512 1024 2048 128 256 512 1024 2048
number of partitions number of partitions

Figure 5: Histogram generation for 1010 32-bit keys Figure 8: Histogram generation for 1010 64-bit keys

S-ar putea să vă placă și