The Computer Journal 2013 Danalis Comjnl - bxt057

The Computer Journal Advance Access published June 28, 2013
The British Computer Society 2013. All rights reserved.

For Permissions, please email: journals.permissions@oup.com
doi:10.1093/comjnl/bxt057
BlackjackBench: Portable Hardware

Characterization with Automated
Results Analysis
Anthony Danalis1 , Piotr Luszczek1 , Gabriel Marin2 , Jeffrey S. Vetter2
and Jack Dongarra1
Downloaded from http://comjnl.oxfordjournals.org/ at University of Tennessee Library on March 17, 2014

1 University
of Tennessee, Knoxville, TN, USA
2 Oak Ridge National Laboratory, Oak Ridge, TN, USA
Corresponding author: luszczek@eecs.utk.edu
DARPAs AACE project aimed to develop Architecture Aware Compiler Environments. Such a
compiler automatically characterizes the targeted hardware and optimizes the application codes
accordingly. We present the BlackjackBench suite, a collection of portable micro-benchmarks that
automate system characterization, plus statistical analysis techniques for interpreting the results. The
BlackjackBench benchmarks discover the effective sizes and speeds of the hardware environment
rather than the often unattainable peak values. We aim at hardware characteristics that can be
observed by running executables generated by existing compilers from standard C codes. We
characterize the memory hierarchy, including cache sharing and non-uniform memory access
characteristics of the system, properties of the processing cores affecting the instruction execution
speed and the length of the operating system scheduler time slot. We show how these features
of modern multicores can be discovered programmatically. We also show how the features could
potentially interfere with each other resulting in incorrect interpretation of the results, and how
established classification and statistical analysis techniques can reduce experimental noise and aid
automatic interpretation of results. We show how effective hardware metrics from our probes allow
guided tuning of computational kernels that outperform an autotuning library further tuned by the
hardware vendor.
Keywords: micro-benchmarks; hardware characterization; statistical analysis

Received 31 March 2012; revised 25 March 2013
Handling editor: Stephen Jarvis
1. INTRODUCTION (1) A collection of portable micro-benchmarks that can

probe the hardware and record its behavior while
Compilers, autotuners, numerical libraries and other perfor-
control variables, such as buffer size, are varied.
mance sensitive software need information about the under-
(2) A statistical analysis methodology, implemented as a
lying hardware. If portable performance is a goal, automatic
collection of scripts for result parsing, that examines
detection of hardware characteristics is necessary given the
the output of the micro-benchmarks and produces
dramatic changes undergone by computer hardware. Several
the desired system characterization information, e.g.
system benchmarks exist in the literature [18]. However, as
effective speeds and sizes.
hardware becomes more complex, new features need to be char-
acterized, and assumptions about hardware behavior need to be BlackjackBench was specifically motivated by the effort
revised, or completely redesigned. to develop architecture aware compiler environments [9] that
In this paper, we present BlackjackBench, a system automatically adapt to hardware, which is unknown to the
characterization benchmark suite. The contribution of this work compiler writer, and optimize application codes based on the
is 2-fold: discovery of the runtime environment.
The Computer Journal, 2013

2 A. Danalis et al.
Often, important performance-related decisions take into generation as part of the benchmarking activity, while we give
account effective values of hardware features, rather than their more emphasis on analyzing the resulting data.
peak values. In this context, we consider an effective value to be P-Ray [3] is a micro-benchmark suite whose primary aim is
the value of a hardware feature that would be experienced by a to complement X-Ray by characterizing multicore hardware
user level application written in C (or any other portable, high- features such as cache sharing and processor interconnect
level, standards-compliant language) running on that hardware. topology. We extend the P-Ray contribution with new tests
This is in contrast with values that can be found in vendor and measurement techniques as well as a larger set of tested
documents, or through assembler benchmarks, or specialized hardware architectures. In particular, the authors express
instructions and system calls. interest in testing IBM Power and Intel Itanium architectures,
BlackjackBench goes beyond the state of the art in system which we did in our work.
benchmarking by characterizing features of modern multicore Servet [5] is yet another suite of benchmarks that
systems, taking into account contemporarycomplex attempts to subsume both X-Ray and P-Ray by measuring
hardware characteristics such as modern sophisticated cache a similar set of hardware parameters. It adds measurements

prefetchers, and the interaction between the cache and trans- of interconnect parameters for communication that occurs in
lation lookaside buffer (TLB) hierarchies, etc. Furthermore, a cluster of multicore processors with distributed memory.
BlackjackBench combines established classification and sta- The methodology and measurement techniques in Servet
tistical analysis techniques with heuristics tailored to specific complement, rather than imitate, those of X-Ray and P-Ray.And
benchmarks, to reduce experimental noise and aid automatic they remain in sharp contrast to our work. Unlike Servet, we do
interpretation of the results. As a consequence, BlackjackBench not focus on the actual hardware parameters of the tested system.
does not merely output large sets of data that require human Rather, we seek the observable parameters, whose values are,
intervention and comprehension; it shows information about too often, below vendors advertised specifications. Servet aims
the hardware characteristics of the tested platform. Moreover, for maximum portability of its constituent tests, as does our
BlackjackBench does not rely on assembler code, specialized work, but we were unable to compare this aspect of our efforts as
kernel modules and libraries, nor non-portable system calls. the authors only presented results from Intel Xeon and Itanium
Therefore, it is a portable system characterization tool. clusters.
In summary, our work differs from existing benchmarks in
the methodology used in several micro-benchmarks, the breadth
2. RELATED WORK
of hardware features it characterizes, the automatic statistical
Several low-level system benchmarks exist, but most have analysis of the results, and the emphasis on effective values and
different target audiences, functionality or assumptions. Some the ability to address modern, sophisticated architectures.
benchmarks, such as those described by Molka et al. [10], We consider the use of BlackjackBench as a tool for
aim to analyze the micro-architecture of a specific platform model-based tuning and performance engineering to be related
in great detail and thus sacrifice portability and generality. to autotuning-based existing exhaustive search approaches
Others [11] sacrifice portability and generality by depending [15, 16], analytical search methodologies [17] and techniques
upon specialized software such as PAPI [12]. Autotuning based on machine learning [18].
libraries such as ATLAS [13] rely on micro-benchmarking
for accurate system characterization for a very specific set
3. BENCHMARKS
of routines that need tuning. These libraries also develop
their own characterization techniques [14], most of which we In this section, we describe the operation of our micro-
need to subsume in order to target a much broader feature benchmarks and discuss the assumptions about compiler
spectrum. and hardware behavior that make our benchmarks possible.
Other benchmarks, such as CacheBench [8], or lmbench We also present experimental results, from diverse hardware
[6, 7] are higher level and portable and use similar techniques environments, as supporting evidence for the validity of our
to those we usesuch as pointer chasingbut output large assumptions.
data sets or graphs that need human interpretation instead Our benchmarks regulate control variables, such as buffer
of answers about the values that characterize the hardware size, access pattern, number of threads, etc., in order to
platform. observe variations in performance. We assert that, by observing
X-Ray [1, 2] is a micro- and nano-benchmark suite that variations in the performance of benchmarks, all hardware
is close to our work in terms of the scope of the system characteristics that can significantly affect the performance
characteristics. There are, however, a number of features that of applications can be discovered. Conversely, if a hardware
we chose to discover with our tests that are not addressed by characteristic cannot be discovered through performance
X-Ray. There are also differences in methodology which we measurements, it is probably not very important to optimization
mention, where appropriate, throughout this document. One tools such as compilers, auto-tuners, etc. Our benchmarks rely
important distinguishing feature is X-Rays emphasis on code on assumptions about the behavior of the hardware under

Portable Hardware Characterization 3
different circumstances, and try to trigger different behaviors enough. To bridge the gap, fast, albeit small, cache memory
by varying the circumstances. has become necessary for fast program execution. In recent
The memory hierarchy in modern architectures is rather com- years, the pressure on the main memory has increased further,
plex, with sophisticated hardware prefetchers, victim caches, as the number of processing elements per socket has been
etc. As a result, several details must be addressed to attain clean going up [19]. As a result, most modern processor designs
results from the benchmarks. Unlike benchmarks [4, 5] that use incorporate complex, multilevel cache hierarchies that include
constant strides when accessing their buffers, we use a tech- both shared and non-shared cache levels between the processing
nique known as pointer chasing (or pointer chaining) [1, 3, 6]. elements. In this section, we describe the operation of the micro-
To achieve this, we use a buffer that holds pointers (uintptr_t) benchmarks that detect the cache hierarchy characteristics, as
instead of integers, or characters. We initialize the buffer so well as any eventual asymmetries in the memory hierarchy. The
that each element points to the element that should be accessed cache benchmarks are the most similar to lmbench [6].
next, and then we traverse the buffer in a loop that reads an
element and dereferences the value read to access the next ele-

ment: ptr=(uintptr_t *)*ptr. The benefits of pointer chasing are 3.1.1. Cache line size
3-fold. The assumption that enables this benchmark is that upon a
cache miss the cache fetches a whole line1 (A1). As a result,
(1) The initialization of the buffer is not part of the two consecutive accesses to memory addresses that are mapped
timed code. Therefore, it does not cause noise in the to the same cache line will result in, at most, one cache miss.
performance measurement loop, unlike calculations of In contrast, two accesses to memory addresses that are farther
offsets or random addresses done on the critical path. apart than the size of a cache line can result in, at most, two
Furthermore, since we can afford for the initialization cache misses.
to be slow, we can use sophisticated pseudo-random To utilize this observation, our benchmark allocates a buffer
number generators, when random access patterns are large enough to be much larger than the cache, and performs
desirable. memory accesses in pairs. The buffer is aligned to 512 bytes,
(2) It eliminates the possibility that a compiler could alter which is a size assumed to be safely larger than the cache line
the loop that traverses the buffer, since the addresses size. Each access is to an element of size equal to the size of
used in the loop depend on program data (the pointers a pointer (uintptr_t). Each pair of accesses, in a sense, touches
stored in the buffer itself). the first and the last elements of a memory segment with extent
(3) It eliminates the possibility that the hardware D of a random buffer location. To achieve this, the pairs are
prefetcher(s) can guess the address of the next element, chosen such that every odd access, starting with the first one, is
when the access pattern is random, regardless of the at a random memory address within the buffer and every even
sophistication level of the prefetcher. access is Dsizeof(uintptr_t) bytes away from the previous
To minimize the loop overhead, the pointer-chasing loop is one. The random nature of the access pattern and the large size
unrolled over 100 times. Finally, to minimize random noise in of the buffer guarantees, statistically, that the vast majority of
our measurements, we repeat each measurement several times the odd accesses will result in cache misses. However, the even
(30). But at the same time, we strive to keep the total time accesses can result in either cache hits or misses depending on
to run the tests in a matter of minutes. By doing so, we allow the extent D. If D is smaller than the cache line size, each even
for a complete set of runs to be performed every time the tested access will be in the same line as the previous odd access, and
system undergoes a significant change such as a change in a this will result in a cache hit. If D is larger than the cache line
BIOS setting that could potentially affect the observable system size, the two addresses will map to different cache lines and
characteristics. both accesses will result in cache misses.
The sample results we provide were selected to emphasize the Clearly, an access that results in a cache hit is served at
portability of our implementation. More importantly, however, the latency of the cache, while a cache miss is served at the
we showcase the experiments that feature more challenging data latency of further away caches, or the RAM, which leads to
patterns and could be misinterpreted if the data analysis phase significantly higher latency. Our benchmark forces this behavior
was not done carefully. by varying the value of D, expecting a significant increase in
average access latency when D becomes larger than the cache
line size. A sample run of this benchmark, on a Core 2 Duo
3.1. Cache hierarchy processor, can be seen in Fig. 1.
Improved cache utilization is one of the most performance
critical optimizations in modern computer hardware. Processor
speed has been increasing faster than memory speed for 1 On some architectures the cache fetches two consecutive lines upon a miss.
several decades, making it increasingly harder for the main Since we are interested in effective sizes, for such architectures we report the
memory to feed all processing elements with data quickly cache line to be twice the size of the vendor documented value.

4 A. Danalis et al.
7 90
Intel Core 2 Duo 2.8 GHz 85 ATOM cache discovery
80
6.5 75
70
Access latency (ns)

65
Average access latency
60
6 55
50
45
5.5 40
35
30
5 25
20
15
10
4.5 5
0
1 2 4 8 16 32 64 12 25 51 10 20 40 81 16 32
8 6 2 24 48 96 92 38 76
4 4 8

8 16 32 64 128 256 512 Buffer size (KiB)
Access pair extent:D
FIGURE 2. Cache count, size and latency characterization on Intel
Atom.
FIGURE 1. Cache line size characterization on Core 2 Duo.
proceeding to the next segment. This approach guarantees that,

3.1.2. Cache size and latency in a system with a cache line size SL and a page size SP ,
Using the cache line size information, we can detect the number there are at least SP /SL cache misses for each TLB miss. In
of cache levels, as well as their sizes and access latencies. The a typical modern architecture, the value of SP /SL is around
enabling assumption is that performing multiple accesses to a 64. While this approach does not completely eliminate the cost
buffer that resides in the cache is faster than performing the of a TLB miss, it significantly amortizes it. Indeed, the data
same number of accesses to a buffer that does not reside in the sets we have gathered from real hardware are easy to correlate
cache (or only partially resides in the cache) (A2). with the characteristics of the cache hierarchy and exhibit little
The benchmark starts by allocating a buffer that is expected interference due to the TLB. Figure 2 shows the output of the
to fit in the smallest cache of the system; for example, a buffer benchmark on an Intel Atom processor. The horizontal ledges
only a few cache lines large. Then we access the whole buffer of the graph indicate the cache or memory latency. The sudden
with stride equal to the cache line size. By making the access jumps in latency value show the sizes of caches: 24 KiB for
pattern random to avoid prefetching effects, and by continuously Level 1 and 512 KiB for Level 2.
accessing the buffer until every element has been accessed
multiple times to amortize start-up and cold-miss overhead, we
can estimate the average latency per access. 3.1.3. Cache associativity
The benchmark varies the buffer size and performs the same This benchmark is based on the assumption that in a system with
random access traversal for each new buffer size, recording the a cache of size Sc , two memory addresses A1 and A2 , where
average access latency at each step. Owing to assumption A2, A1 mod Sc A2 mod Sc , will map to the same set of an N-way
we expect that the average access latency will be constant for all set associative cache (A3). The benchmark assumes knowledge
buffers that fit in a particular cache of the hierarchy, but there of the cache size, potentially extracted from the execution of the
will be a significant access latency difference for buffers that benchmark mentioned above.
reside in different levels of cache. Therefore, by varying the Assuming that the cache size is Sc , we allocate a buffer
buffer size, we expect to generate a set of Access Latency vs. many times larger than the cache size, M Sc . Next, the
Buffer Size data that looks like a step function with multiple benchmark starts accessing a part of the buffer that is K Sc
steps. A step in the data set should occur when the buffer size large (with K M), repeating the experiment for different
exceeds the size of a cache. The number of steps will equal integral values of K. For every value of K, the access pattern
the number of caches plus one, the extra step corresponds to consists of randomly accessing addresses that are Sc bytes apart;
the main memory and the Y value at the apex of each step will this process is repeated a sufficient number of times. Since all
correspond to the access latency of each cache. such addresses will map into the same cache set, as soon as K
To eliminate false steps, the cache benchmarks need to becomes larger than N , the cache will start evicting elements
minimize the effects of the TLB. To achieve that goal, the to store the new ones, and therefore some of the accesses will
accesses to the buffer are not uniformly random. Instead, the result in cache misses. Consequently, the average access time
buffer is logically split into segments equal to a TLB page should be significantly lower for K N than for K > N .
size. The benchmark accesses the elements of each segment in The output from a sample run of this benchmark on an Itanium
random order, but exhausts all the addresses in a segment before processor is shown in Fig. 3.

5
Itanium
comes at the expense of increased hardware complexity in the
form of non-uniform memory access (NUMA) costs. Cores
4 experience lower latency when accessing memory attached to
their own memory controller.
Access latency (ns)
Both of these hardware features can potentially create

3
asymmetries in the memory system. They may cause subsets
of cores to have an affinity to each othercores communicate
2 faster with other cores from the same subset than with cores
that are not part of the subset. For example, a cache level may
1 be shared by only a subset of cores on a chip as is the case
with the Level 2 cache on the Intel Yorkfield family of quad
core processors. Thus, cores that share some level of cache may
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 exchange data faster among themselves than with cores that do

Buffer size ( x cache size)
not share that same level of cache. For NUMA architectures,
data allocated on one NUMA node is accessed faster by cores
FIGURE 3. Cache associativity characterization on Intel Itanium II. located on the same NUMA node than by cores from different
NUMA nodes.
To understand such memory asymmetries, we use a multi-
We note that a victim cache may cause the benchmark threaded micro-benchmark based on a producerconsumer pat-
to detect an associativity value higher than the real one. To tern. Two threads access a shared memory block. First, one
prevent this, our benchmark implementation accesses elements thread allocates and then initializes the memory block with
in multiple sets, instead of just one. zeros. The benchmarks assume that a NUMA system imple-
ments the first-touch page allocation policy. By allocating a
3.1.4. Asymmetries in the memory hierarchy new memory block and then touching it, the memory block
With the move to multicore processors, we witnessed the quasi- will be allocated in the local memory of the first thread. Page
general introduction of shared cache levels to the memory allocation policy can also be set using operating system (OS)
hierarchy. Multicore processors introduced sharing of caches specific controls, e.g. the numactl command on NUMA-aware
at some levels to the memory hierarchy. A shared cache design Linux kernels. Next, the two threads take turns incrementing
provides larger cache capacity by eliminating data replication all the locations in the memory block. One thread reads and
for multithreaded applications. The entire cache may be used then modifies the entire memory block before the other thread
by the single active core for single-threaded workloads. More takes its turn. The two threads synchronize using busy waiting
importantly, a shared cache design eliminates on-chip cache on a volatile variable. We use padding to ensure that the locking
coherence at that cache level. In addition, it resolves coherence variable is allocated in a separate cache line, to avoid false shar-
of the private lower level caches internally within the chip ing with other data. This approach causes the two threads to act
and thus reduces external coherence traffic. One downside of both as producers and as consumers, switching roles after each
shared caches is a larger hit latency which may cause increased complete update of the memory block. We repeat this process
cache contention and unpredictable conflicts. Shared caches many times to minimize secondary effects and we compute the
are not an entirely new design feature. Before Level 2 caches achieved bandwidth for different memory block sizes. We use
were integrated onto the chip, some SMP architectures were a range of block sizes, from a block size smaller than the Level
using external shared Level 2 caches to increase capacity for 1 cache size to a block size larger than the size of the last level
single-threaded workloads, and to reduce communication costs of cache.
between processors. We measure the communication bandwidth between all pairs
Over the last decade, we also witnessed the integration of cores. To be OS independent, the benchmark must assume
of the memory controller into the processor chip even for that a thread is executed on the same core for its entire life.
desktop grade processors. On the positive side, the on-chip For best results, we control the placement of the threads using
memory controller increases memory performance by reducing an OS-specific application programming interface (API) for
the number of chip boundary crossings for a memory access, pinning threads to cores. The hwloc library [20] provides a
and by removing the memory bus as a shared resource. Memory portable API for setting thread affinity, and we plan to integrate
bandwidth can be increased by adding additional memory it into the benchmark suite. Next, the communication profiles
controllers. Multisocket systems now feature multiple memory are analyzed in decreasing order of the memory block size, to
controllers. It is expected that as the number of cores packed detect any potential cliques of cores that have an affinity to each
onto a single chip continues to increase, multiple on-chip other. The algorithm starts with a large clique that includes all
memory controllers will be needed to keep up with the memory the cores in the system. At each step, the data for the next smaller
bandwidth demand. This higher memory bandwidth availability memory block is processed. The communication data between

6 A. Danalis et al.
8000
L1 L2 L3 3.2. TLB hierarchy
7000 TLB hierarchy is an important part of the memory system that

One way bandwidth (MB/s)
bears some resemblance to the cache hierarchy. However, the

6000
TLB is sufficiently different to warrant its own characterization
5000 methodology. Accordingly, we will focus on the description
4000
of our TLB benchmarking techniques rather than present
differences and similarities with the cache benchmarks.
3000
The crucial feature that any TLB benchmark should possess is
2000 the ability to alleviate cache effects on the measurements. Both
1000
conflict and capacity misses coming from data caches should
Inter-node
Intra-node
either be avoided at runtime or filtered out during the analysis
0 of the results. We chose the latter, as it has the added benefit of
16 64 256 1024 4096 16384
Block size (KB) capturing the rare events when the TLB and the data cache are

inherently interconnected, such as when the TLB fits the same
FIGURE 4. One-way, inter-core communication bandwidth for
number of pages as there are data cache lines.
different memory block sizes and core placements on a dual-socket
Intel Gainestown system.
To determine the page size, our benchmark maximizes the
penalty coming from the TLB misses. We do it by traversing a
large array multiple times with a given stride. The array is large
all cores of a clique, identified at a previous step, is analyzed to enough to exceed the span of any TLB levelthis guarantees a
identify any sub-cliques of cores that communicate faster with high miss rate if the stride is larger or equal to the page size. If
each other at this smaller memory block level. In the end, the the stride is less than the page size, some of the accesses to the
algorithm will produce a locality tree that captures all detectable array will be contained in the same page, and thus, will decrease
asymmetries in the memory hierarchy. the number of misses and the overall benchmark execution time.
Figure 4 shows aggregated results for an Intel Gainestown The false positives stemming from interference of data cache
system with two sockets and Hyper-Threading disabled. misses are eliminated by the high cost of a TLB miss in the last
The x-axis represents the memory block size and the level of TLB. Handling these misses requires the traversal of
y-axis represents the bandwidth observed by one of the the OS page table stored in the main memorythe combined
threads. The bandwidth is computed as number_updated_ latency exceeds the cost of a miss for any level of cache. Typical
lines cache_line_size/time, where number_updated_ timing curves for this benchmark are shown in Fig. 5. The
lines is the number of cache lines updated by the first thread. figure shows results from three very different processors: ARM
Since the two threads update an equal number of lines, the OMAP3, Intel Itanium 2 and Intel Nehalem EP. The graph line
values shown in the figure represent only half of the actual for each system has the same shape; for strides smaller than
two-way bandwidth. As expected for such a system, the the page size, the line raises as the number of misses increases
benchmark captures two distinct communication patterns: because fewer memory accesses hit the same page. And for
strides that exceed the page size, the graph line is flat because
(1) when the two cores reside on different NUMA nodes,
each array access touches a different page and so the per-access
curve labeled inter-node;
overhead remains the same. The page size is determined as the
(2) when the two cores are on the same NUMA node, curve
inflection point in the graph line. For our example, the page size
labeled intra-node.
is 4 KB on both ARM and Nehalem, and 16 KB on Itanium.
For this system, when the data for the largest memory block There are programmatic means of obtaining the TLB page
size is analyzed, the algorithm divides the initial set of eight size, such as the legacy POSIX call getpagesize() or the
cores into two cliques of size four cores, corresponding to more modern sysconf(_SC_PAGE_SIZE). One obvious
the two NUMA nodes in the system. The cores in each of caveat of using any of these functions is portability. A more
these two cliques communicate among themselves at the speed important caveat is the fact that these functions return the system
shown by the intra-node point, while the communication speed page size rather than the hardware TLB page size. For most
between cores in distinct cliques corresponds to the inter-node practical purposes, they are the same, but modern processors
point. The algorithm continues by processing the data for the support multiple page sizes and large page sizes are often
smaller memory block sizes, but no additional asymmetries used to increase performance. Under such circumstances, our
are uncovered in the following steps. The locality tree for this test delivers the observable size of a TLB page, which is the
system has just two levels, with the root node corresponding to preferred value from a performance standpoint.
the entire system at the top, and two nodes of size four cores A similar argument can be made about counting the actual
at the next level corresponding to the two NUMA nodes in the page faults rather than measuring the execution time and
system. inferring the number of faults from the timing variations. There

30 0.005
ARM OMAP3
28 Intel Itanium 2
Intel Nehalem EP 0.0045
Access latency (micro-seconds)

Access latency per each access
26
24 0.004
22
(micro-seconds)
20 0.0035
18 0.003
16
14 0.0025
12
0.002
10
8 0.0015
6
0.001
4 Intel Penryn (Core 2 Duo)
2 0.0005 AMD Opteron K10 (Istanbul)
0 IBM POWER7
0 5000 10000 15000 20000 25000 30000 0
Stride for memory accesses 50 100 150 200 250 300 350 400 450 500

Working set size (number of visited TLB pages)
FIGURE 5. Graph of timing results that reveal the translation
lookaside buffer (TLB) page size. FIGURE 6. Graph of results that reveal the number of levels of TLB
and the number of entries in each level.
are system interfaces such as getrusage() on Unix. The

pitfall here is that these interfaces assume the faults will either be
3.3.1. Instruction latencies
the result of the first touch of the page or that the fault occurred
The latency L (O, T ) of an operation O, with operands of
because the page was swapped out to disk. Again, our technique
type T , is calculated as the number of cycles it takes from
side steps these issues altogether.
the time one such operation is issued until its result becomes
Finally, we do not target scenarios with mixed page sizes
available to subsequent dependent operations. BlackjackBench
even though it is practically possible. The main reason is lack
reports all operation latencies relative to the latency of a
of available optimization techniques that could effectively use
32-bit integer addition, since the CPU frequency is not directly
such information.
detected. We use a code generator to output a set of micro-
Once the actual TLB page is known, it is possible to proceed
benchmarks, tailored to each operation type and data type. For
with discovering the sizes of the levels of the TLB hierarchy.
each combination of operation type O and operand type T , our
Probably the most important task is to minimize the impact
code generator outputs a micro-benchmark that executes in a
of the data cache. The common and portable technique is to
loop a large number of chained operations of the given type.
perform repeated accesses to a large memory buffer at strides
Figure 7 presents a sample micro-benchmark loop for
equal to the TLB page size. This technique is prone to creating
computing the latency of Add operations. The accumulator
as many false positives as there are data cache levels and a
t0 and the increment variable xvalues data types are
slight modification to this technique is required [4]. On each
set when the micro-benchmarks are generated. We use a
visited TLB page, our benchmark chooses a different cache
switch-case construct, placing the operations in separate
line to access, thus, maximizing the use of any level of data
case statements with fall-through execution, to prevent the
cache. As a side note, choosing a random cache line within a
compiler from optimizing away the code. This approach is
page utilizes only half of the data cache on average. Figure 6
similar to the technique used by the X-Ray [1, 2] tool. The
shows the timing graphs on a variety of platforms. Both Level
value of the variable key used by the switch condition
1 and Level 2 TLBs are identified accurately. As we mentioned
is unknown at compile time, and the variable is declared
before, we aim at the observable parameters: even if the TLB
as volatile. At runtime, its value determines how many
size is 512 entries, some of these entries will not contain user
case statements and operations are executed in each iteration.
data. In a running process, there is a need for the function stack
The micro-benchmarks can be set to use different increment
and the code segment which will use one TLB page each, thus
variables in consecutive case statements, interleaved in a
reducing the observable TLB size by 2.
round-robin fashion. Thus, for floating-point multiplication and
division micro-benchmarks, we use two increment variables
3.3. Arithmetic operations
whose product is approximately 1, to prevent the accumulator
Information about the processing units, such as the latency and variable from over- or underflowing. For integer multiplication
throughput of various arithmetic operations, are important to operations, we use the identity element to prevent overflowing.
compilers and performance analysis tools. Such information is We found the performance of integer operations to be mostly
needed to produce efficient execution schedules that attempt to insensitive to the value of their operands. The exception is
balance the type of operations in loops with the number and represented by the division operation. For integer division
type of resources available on the target architecture. measurements, we interleave division operations with additions

8 A. Danalis et al.
are issued, micro-benchmarks must include many independent

operations as opposed to chained ones. We generate multiple
independent operations with different accumulator variables for
each case statement of the micro-benchmark loop shown in
Fig. 7.
Using too many independent operations increases register
pressure, potentially causing unnecessary delays due to register
spills/unspills. Therefore, for each operation type O and
operand type T , multiple micro-benchmarks are generated each
with a different number of independent streams of operations.
The number of parallel operations is varied between 1 and 20,
FIGURE 7. Sample benchmark loop for Add operations. and the minimum time per operation across all versions, Lmin ,
is recorded. Throughput is computed as the ratio between the

latency of a 32-bit integer addition and Lmin . Table 1 shows
TABLE 1. Latency, throughput, and operations in flight for
Intel Atom for various data types and bit widths.
sample results for Intel Atom.
Type-width Op. Latency Throughput In flight 3.3.3. Operations in flight

The number of operations in flight F (O, T ) for an operation
float 32 add 5 1 5 O, with operands of type T , is a measure of how many such
float 32 subtract 5 1 5 operations can be outstanding at any given time in a single thread
float 32 divide 31 0.0323 1 of control. This measure is a function of both the issue rate
float 32 multiply 4 1 4 and the pipeline depth of the target processor, and is a unitless
float 64 add 5 1 5 quantity.
float 64 subtract 5 1 5 Operations in flight are measured using an approach similar to
float 64 divide 60 0.0167 1 the ones used for operation latencies and operation throughputs.
float 64 multiply 5 0.5 2 For each operation type O and operand type T , multiple micro-
benchmarks, with different numbers of independent streams of
operations, are generated and benchmarked. However, unlike
the operation throughput benchmarks where we are interested
using a large value. At the end of the loop, we subtract the cost in the minimum cost per operation, to understand the number of
of the additions that we measure separately. operations in flight we are looking at the cost per iteration loop.
The cost per operation is computed by dividing the wall clock Each independent stream contains the same number of chained
time of the loop by the total number of operations executed. operations, and thus, the total number of operations in one
The loop is executed twice, using the same number of iterations loop iteration grows proportionally with the number of streams.
but with different unroll factors, and the difference of the two When we increase the number of independent streams in the
execution times is divided by the difference in operation counts loop, as long as the processor can pipeline all the independent
between the two loops. This approach eliminates the effects of operations, the cost per iteration should remain constant. Thus,
the loop overhead, without increasing the unroll factor to a very the inflection point where the cost per iteration starts to increase
large value, which would significantly increase the time spent in yields the number of operations in flight supported by one thread
compiling the micro-benchmarks and could create instruction of control. Table 1 shows sample results for Intel Atom.
cache issues. To account for runtime variability, each benchmark
is executed six times, and the second lowest computed value is
selected. Table 1 shows sample results for an Intel Atom-based 3.4. Execution contexts
machine. Modern systems commonly have multiple cores per socket
and multiple sockets per node. To avoid confusion due to the
3.3.2. Instruction throughputs overloading of the terms core, CPU, node, etc., by hardware
The throughput T (O, T ) of an operation O, with operands vendors, we use the term Execution Context to refer to the
of type T , represents the rate at which one thread of control minimum hardware necessary to effect the execution of a com-
can issue and retire independent operations of a given type. pute thread. This allows us to side step the issue of discovering
Throughput is reported as the number of operations that can be the number of available physical cores and their computational
issued in the time it takes to execute a 32-bit integer addition. resources. Instead, we again focus on the functionality required
Instruction throughputs are measured using an approach by the application (such as floating-point computation) and
similar to the one used for measuring instruction latencies. discover the available level of parallelism for the functionality.
However, to determine the maximum rate at which operations Several modern architectures implementing virtual hardware

threads exhibit selective preference over different resources. For 2.5

Floating Point
example, a processor could have private integer units for each Integer
Memory
virtual hardware thread, but only a shared floating-point unit for 2
Normalized execution time

all hardware threads residing on a physical core. The Blackjack
benchmarks attempt to discover the maximum number of:
1.5
(1) Floating-Point Execution Contexts,
(2) Integer Execution Contexts and
1
(3) Memory Intensive Execution Contexts.
Intuitively, we are interested in the maximum number of com- 0.5
pute threads that could perform floating-point computations,
integer computations or intense memory accesses, without hav-
ing to compete with one another, or wait for the completion of 0
2 4 6 8 10 12 14 16 18 20 22 24 26 28

one another. As a result, in a system with N floating-point con- Number of threads
texts (or integer, or memory), we should be able to execute up
to N parallel threads that perform intense floating-point com- FIGURE 8. Execution contexts characterization on Power7.
putations (or integer, or memory) and observe no interference
between the threads. Thus, the assumption is that in a system operations on variables initialized to 1 (so the values do not
with N execution contexts, there should be no delay in the exe- diverge) in a non-decidable way so that the compiler cannot
cution of each thread for M N threads, but for M > N simplify the loop. A sample output of this benchmark on a
threads at least one thread should be delayed (A4). Power7 processor can be seen in Fig. 8.
To utilize this assumption, our benchmark instantiates M
threads that do identical work (Floating point, Integer or
3.5. Live ranges
Memory intensive, depending on what we are measuring) and
records the total execution time, normalized to the time of the An important hardware characteristic for optimization is the
case where M = 1. The experiment is repeated for different maximum number of registers available to an application.
(integral) values of M until the normalized time exceeds some Although the number of registers on an architecture is usually
small predetermined threshold, typically 2 or 3. Owing to known, not all the registers can be used for application variables.
assumption A4, the normalized total execution time will be For example, either the stack pointer or the frame pointer should
constant and equal to 1 (with some possible noise) for M N be held in a register as at least one of them is essential for
and greater than 1 for M > N. retrieving data from the stack (both cannot be stored on the
While this behavior is true for every machine we have tested, stack). Other constraints or conventions require allocation of
the behavior for X > N depends on the scheduler of the OS. hardware registers for special purposes, thus reducing the actual
Namely, since we do not explicitly bind the threads to cores, number of registers available for applications. We use the term
the OS is free to move them around in an effort to distribute the Live Rangesto describe the maximum number of concurrently
extra load caused by the X N threads (for X > N) among all usable registers by a user code.
computing resources. If the OS does so, then the total execution When the number of concurrently live variables in a program
time for X > N will increase almost linearly with the number exceeds the number of available registers, the compiler has to
of threads. If the OS chooses to leave the threads on the same introduce a piece of code that spills the extraneous variables
core for their whole life time, the result will look like a step to memory. This incurs a delay because: (1) memory (even the
function with steps at X = N , X = 2 N , X = 3 N , etc., Level 1 cache) is slower than the register file and (2) the extra
and corresponding apexes at Y = 1, Y = 2, Y = 3 and so instructions needed to transfer the variable and calculate the
on. The latter tends to be a rare case, but we observed it in our appropriate memory location consume CPU resources that are
private Cray XT5 system, Jaguar. otherwise used by the original computation performed by the
For the case of the Memory Intensive Contexts, it is important application. Thus, the enabling assumption of this benchmark is
to distinguish between the ability of the execution contexts that in a machine with N Live Ranges, a program using K > N
to perform memory operations and the ability of the memory registers will experience higher latency per operation than a
subsystem to serve multiple parallel memory requests. To avoid program using K N registers (A5). To measure this effect,
erroneous results due to memory limitations, our benchmark our benchmark runs a large number of small tests (130), each
limits memory accesses of each thread to a small memory block executing a loop with K live variables (involved in K additions)
that fits in the Level 1 cache and uses pointer chasing to traverse with K ranging from 3 to 132. By executing all these tests
it. For the case of the Integer and Floating-Point Contexts, the and measuring the average time per operation, we detect the
corresponding benchmarks execute a tight compute loop with maximum number of usable registers R by detecting the step in
divisions, which are among the most demanding arithmetic the resulting data.

10 A. Danalis et al.
The structure of the loop is similar to the one used by X-Ray 1

[1], but is modified in two ways. 0.9
0.8
Average operation latency (ns)

(1) The switch statement, used in the loop by X-Ray, 0.7
is unnecessary and detrimental to the benchmark. 0.6
It is unnecessary as the chained dependences of 0.5
the operations are sufficient to prohibit the compiler
0.4
from aggressive code optimization (by simplifying,
or reducing the number of operations in the loop). 0.3
It is detrimental because some compilers, on some 0.2

Integer 32bit
architectures, choose to implement the switch with 0.1 Integer 64bit
Float 32bit
a jump determined by a register. As a result, the Float 64bit
0
benchmark could report one less register than the 0 10 20 30 40 50 60 70

Number of variables
number that is available for application variables.
(2) The simple dependence chain suggested in X-Ray:
Pi mod N = Pi mod N + P(i+N 1) mod N can lead FIGURE 9. Live ranges characterization on Power7.
to inconclusive results in architectures with deep
pipelines and sophisticated arithmetic logic units with
spare resources. We observed this effect when the 3.6. OS scheduler time slot
compiler generated instruction schedules tried to hide
Multithreaded applications can tune synchronization, schedul-
the spill/unspill overhead by using the extra resources
ing and work distribution decisions by using information about
of the CPU. To reduce the impact of this effect, we used
the typical duration of time a thread will run on the CPU unin-
the pattern: Pi mod N = Pi mod N + P(i+N N/2) mod N for
terrupted after the OS schedules it for execution. We refer to this
the loop body. This pattern enables N/2 operations to
time duration as the OS Scheduler time slot. The assumption that
be pipelined provided there is sufficient pipeline depth.
enables this benchmark is that when a program is being executed
As a result, the execution of the operations in the loop
on a CPU with frequency F , the latency between instructions
puts much more pressure on the CPU and makes it
should be in the order of 1/F , where two instructions that exe-
harder to hide the memory overheads.
cute in different time slots (because the program execution was
preempted by the OS between these instructions) will be sepa-
In a similar manner to other benchmarks, where sensitive rated by a time interval in the order of the OS Scheduler time
timing is performed, special attention was paid to details that slot, Ts , which is several orders of magnitude larger than 1/F
can affect the performance or the decisions of the compiler. As (A6).
an example, the K operations that we time have to be performed Our benchmark consists of several threads. Each thread
thousands of times in a loop due to the coarse granularity executes the following pseudocode:
of widely available timers. However, use of loops introduces
te = ts = t0 = timestamp()
latency and poses a dilemma for the compiler as to whether the
while (te t0 < TOTAL_RUNNING_TIME)
loop induction variable should be kept in a register. Manually
OP0 OPN
unrolling the body of the loop multiple times amortizes the
te = timestamp()
loop overhead and reduces the importance of the induction
if (te ts > THRESHOLD) record(te ts)
variable when register allocation is performed. An example
ts = te
run of this benchmark, on a Power7 processor, can be seen in
Fig. 9. In other words, if Ti was larger than a predefined threshold,
The actual output data of this benchmark shows that, for then a loop that executes a minimal number of operations takes
very small number of variables, the average operation latency a timestamp and records the length of time Ti since the previous
is higher than the minimum operation latency achieved. This timestamp. The threshold must be chosen such that it is much
is the case since the pipeline is not fully utilized when the longer than the time the few operations and the timestamp
data dependencies do not allow several operations to execute function take to execute, but much shorter than the expected
in parallel. However, this benchmark is only aiming to identify value of the scheduler time slot, Ts . In a typical modern system,
the latency associated with spilling registers to memory, which a handful of arithmetic operations and a call to a timestamp
results in increased latency. For this reason, the curves shown function, such as gettimeofday, take less than a microsecond.
in the figure, and used to extract the Live Ranges information, On the other hand, the scheduler time slot is typically in the order
were obtained after applying monotonicity enforcement to the of, or larger than, a millisecond. Therefore, a THRESHOLD of
data. This technique is discussed in Section 4.1. 20 s safely satisfies all our criteria. Since we do not record the

duration Ti of any loop iteration with Ti < THRESHOLD, and the resources), and then continue with noisy behavior and
we have chosen THRESHOLD to be significantly larger than the potentially additional steps as can be seen in Fig. 8.
pure execution time of each iteration, the only values recorded To extract the number of Execution Contexts from this data,
will be from iterations whose execution was interrupted by the we are interested in that first jump from the straight line to the
OS and spanned more than one scheduler time slot. noisy data. However, due to noise, there can be small jumps in
The benchmark assumes knowledge of the number of the part of the data that is expected to be flat. The challenge
execution contexts on the machine (i.e. cores), and oversub- is to systematically define what constitutes a small jump vs.
scribes them by a factor of 2 by generating twice as many a large jump for the data sets that can be generated by this
threads as there are execution contexts. Since all threads are benchmark. To address this, we first calculate the relative value
compute-bound, perform identical computation and share an increase in every step dYnr = (Yn+1 Yn )/Yn and then compute
execution context with some other thread, we expect that, sta- the average relative increase
dY r . We argue that the data
tistically, each thread will run for one time slot and wait during point that corresponds to the jump we want (and thus the
another. Regardless of scheduling fairness decisions and other number of execution contexts) is the first data point i for which

esoteric OS details, the mode of the output data distribution dYir >
dY r . The rationale is that the average of a large number
(i.e. the most common measurement taken) will be the duration of very small values (noise) and a few much larger values (actual
of the scheduler time slot Ts . Indeed, experiments on several steps) will be a value higher than the noise, but smaller than
hardware platforms and OSs have produced expected results: the steps. Thus, the average relative increase
dY r gives us a
10 ms on server and desktop Linux distributions and 100 ms threshold to differentiate between small and large values for
on Crays lightweight kernel for XT-series of supercomputers. every data set of this type.
4.2.2. Biggest step

The Live Ranges benchmark produces curves that start flat
4. STATISTICAL ANALYSIS
(when all variables fit in registers), then potentially grow slightly
The output of the micro-benchmarks discussed in Section 3 is (if some registers are unavailable for some reason), then exhibit
typically a performance curve, or rather, a large number of points a large step when the first spill to memory occurs and then
representing the performance of the code for different values continue growing in a non-regular way as can be seen in Fig. 9.
of the control variable. Since our motivation for developing Owing to the increase in latency before the first spill, the
BlackjackBench was to inform compilers and auto-tuners about previous approach for detecting the first step is not appropriate
the characteristics of a given hardware platform, we have for this type of data. However, we can utilize the fact that
developed analyses that can process the performance curves and the steps caused by the additional spills will be no larger than the
output values that correspond to actual hardware characteristics. step caused by the first spill. Furthermore, since the additional
steps have higher starting values than the first step, the relative
increase (Yn+1 Yn )/Yn for every n higher than the first spill
4.1. Monotonicity enforcement will be lower than the relative increase of the first spill. To
In all our benchmarks, except for the micro-benchmark that demonstrate this point, Fig. 10 shows the data curves for the
detects asymmetries in the memory hierarchy, the output curve integer live ranges along with the relative dY r values.
is expected to be monotonically increasing. If any data points
violate this expectation, it is due to random noise, or esoteric
1
second-orderhardware details that are beyond the scope of
this benchmark suite. Therefore, as a first post-processing step 0.9
Average operation latency (ns)
we enforce monotonic increase using the formula: i : Xi = 0.8

minj i Xj . 0.7
0.6
0.5 Integer 32bit relative dY

4.2. Gradient analysis Integer 64bit relative dY
0.4
Most of our benchmarks result in data that resemble step
0.3
functions. Therefore, the challenge is to detect the location of
the step, or steps, that contain the useful information. 0.2
0.1
0
4.2.1. First step 0 10 20 30 40 50 60 70
The Execution Contexts benchmark produces curves that start Number of variables
flat (when the number of threads is less than the available
resources), then exhibit a large jump (when the demand exceeds FIGURE 10. Live ranges and relative dY .

25
Access latency (ns)

5
Power7 cacheata
L1 cluster
L2 cluster
(Local) L3 cluster
1 (Remote) L3 cluster
1 2 4 8 16 32 64 12 25 51 10 20 40 81 16 32
8 6 2 24 48 96 92 38 76
4 8

Buffer size (KiB)
FIGURE 12. QT-Clustering applied to Power7 Cache Data.
FIGURE 11. Quality threshold clustering pseudocode. TABLE 2. Comparison between hardware and software-determined
instruction latency on Intel Atom Diamondville.
Instruction latency
The biggest relative step technique can also be used for
processing the results of the cache line size benchmark and the Type-width Op. H/W latency S/W latency
cache associativity benchmark. For the TLB page size, where
int 32 add 1, 2 1
the desired information is in the last large step, the analysis
int 64 add 1, 2 1
seeks the biggest scaled step dY s = dY Y (instead of the
int 32 multiply 5, 6 5
biggest relative step).
int 64 multiply 13, 14 13
int 32 divide 49, 61 65
4.2.3. Quality threshold clustering int 64 divide 183, 207 153
Unlike the previous cases, where the analysis aimed to extract
Type-width Op. x87 XMM S/W latency
a single value from each data set, the benchmark for detecting
the cache size, count and latency has multiple steps that carry float 32 add 5 5 5
useful information. Owing to the regular nature of the steps float 64 add 5 5 5
this benchmark tends to generate, we can group the data points float 32 multiply 5 4 4
into clusters based on their Y value (access latency) such that float 64 multiply 5 5 5
each cluster includes the data points that belong to one cache float 64 divide 71 31 31
level. For the clustering, we use a modified version of the float 32 divide 71 60 60
quality threshold clustering algorithm [21]. The modification
pertains to the cluster diameter threshold used by the algorithm
to determine if a candidate point belongs to a cluster or not. In
5. ACCURACY
particular, unlike regular QT-Clustering, where the diameter is
a constant value predetermined by the user of the algorithm, our We stressed from the beginning that we focus on effective
version uses a variable diameter equal to 25% of the average values rather than raw hardware values. Nonetheless, some
value of each cluster. The algorithm we use is represented in of the raw hardware parameters are available and we readily
pseudocode in Fig. 11. compared with the results of our tests. For brevity, we focus
Using QT-Clustering, we can obtain the clusters of points that here on values published for two well-known variations of the
correspond to each cache level. Thus, we can extract for each x86 architecture [22], whose results we presented earlier. In
cache level the size information from the maximum X value particular, Table 2 shows instruction latency for various data
of each cluster, the latency information from the minimum Y types and operations on Intel Atom Diamondville and Table 3
value of each cluster and the number of levels of the cache for AMD Opteron K10 Istanbul.
hierarchy from the number of clusters. An example use of One important thing to note in Tables 2 and 3 is that often
this analysis, on the data from a Power7 processor, can be hardware values cannot be known exactly and, consequently,
seen in Fig. 12. QT-Clustering is also used for the levels we are forced to show more than one value or even the entire
of TLB. range. For integer operations, the difference may come from the

TABLE 3. Comparison between hardware and software- cache is organized in 4 MiB chunks local to each core, but
determined instruction latency on AMD Opteron K10 Istanbul. shared between all cores. As a result, each core can access the
first 4 MiB of the Level 3 faster than it accesses the rest. It
Instruction latency
then appears as if there were two discrete cache levels. The
Type-width Op. H/W latency S/W latency size being detected is slightly smaller than vendors claims.
This phenomenon occurs in virtually every architecture we have
int 32 add 1 1 tested, for the largest cache in the hierarchy. The cause differs
int 64 add 1 1 between architectures, and it is a combination of imperfect
int 32 multiply 3 3 cache replacement policies, misses due to associativity and
int 64 multiply 4 4 pollution due to caching of data such as page table entries, or
int 32 divide 1546, 2455 55 program instructions. As a result, if a user application has a
int 64 divide 1578, 2487 66 working set equal to the vendor advertised value for the largest
Type-width Op. x87 SSE S/W latency cache, it is practically impossible for that application to execute

without a performance penalty due to cache misses. Therefore,
float 32 add 4 4 4 we consider the size of the largest working set that can be used
float 64 add 4 4 4 without a significant performance penalty to be the effective
float 32 multiply 4 4 4 cache size, which is the value we are interested in.
float 64 multiply 4 4 4 A few of our measurements may be affected by a variety of
float 64 divide 31 16 16 system settings. For example, BIOS on some servers include an
float 32 divide 31 20 21 option for pair-wise prefetching in Intel Xeon server chips. The
Itanium architecture has its peculiarities such as lack of Level 1
caching support for floating-point values and two different TLB
page sizes at each level of TLB. In fact, TLB support can be
instruction used (e.g. MUL vs. IMUL) or a complex interaction
quite peculiar and this includes a separate TLB space for huge
between internal CPU units (e.g. decoder, reservation stations)
page entries on AMDs Opteron chips. Similarly, there are lock-
which results in the latency given as a range of numbers. For
down TLB entries that could potentially be utilized differently
floating-point operands, the compiler may choose to generate
by various implementations of the ARM processor designs. And
either x87 or SSE instructions. What we can see is that for
there is the issue of read-only Level 1 TLB on the Intel Core
the most part our benchmarks are able to match the hardware
2 architecture. Occasionally, the concurrency of in-memory
parameters and the compiler chooses the faster SSE instructions
TLB table walkers may limit the performance if multiple cores
for the generated code. For the integer divide, our observed
happen to suffer heavy TLB miss rates and require constant
latency are on the high side of the range of values provided by
TLB reloading from the main memory. Finally, some OSes (HP-
manufacturers.
UX and AIX in particular) allow the users binary executables
We measured live ranges and matched it accurately with the
to include the information on what page size should be used
hardware values for the floating-point registers (16). The integer
during process execution rather than have a system-wide default
registers were measured at 15 which matches the hardware
value. Such uncommon conditions can be challenging for our
register file that has to exclude the stack pointer register.
benchmarks, in that the produced results could differ from the
The measured cache parameters were determined accurately
users expectations.
on the two architectures as we include a sophisticated techniques
to remove the influence of modern cache hierarchy such as high
levels of cache associativity, hardware prefetchers, etc.
7. GUIDED TUNING
Aside from benefiting AACE, we consider BlackjackBench as
6. DISCUSSION
an indispensable tool for model-based tuning and performance
Comparing Figs 2 and 12 against vendor documents, one can engineering of computational kernels for scientific codes. One
note that there are discrepancies in some of the reported values. such kernel, called DSSSM [23], is absolutely essential for
For example, IBM advertises the Level 3 cache of the 8 core good performance of tile linear algebra codes [24]. For the
Power7 processor as 32 MiB. Also, according to the vendor, sake of brevity and simplicity of exposition, we will show how
there is no L4 cache. However, our benchmark detects a Level 3 BlackjackBench helps us guide a model-based tuning of one
of size just under 4 MiB and a fourth level of cache with a of the components of the kernel: a Schurs complement update
size just under 32 MiB. The reason behind this behavior is that based on matrixmatrix multiplication [25]. We believe that our
our benchmarks aim at the effective values of the hardware approach is a viable alternative to existing exhaustive search
features, as those are viewed using changes in performance approaches [15, 16], analytical search [17] or approaches based
as the only criterion. In the case of the Power7, the Level 3 on machine learning [18].

We proceed first by constructing a cache reuse model to 5.5

maximize the issue rate of floating-point operations. In fact, it 5
is desirable to retire as many such operations in every cycle 4.5
as is feasible given the number of FPU units. The Schurs 4
Performance Gflop/s
complement kernel has a high memory reuse factor that grows 3.5
linearly with the input matrix sizes. Our model is a constrained 3
optimization problem with discrete parameters: 2.5
2
2mnk 1.5
max ,
m+n+kR 2mnk + (mn + mk + nk) 1
0.5 Blackjack
where m, n and k are input matrix dimensions and R is the 0
VecLib
number of registers or the cache size. When the above fraction 20 40 60 80 100 120
is maximized, the code performs the most amount of floating- Matrix size

point operations for each load operation for the kernel under
FIGURE 13. Performance of the Schurs complement kernel routines.
consideration. BlackjackBench was used to supply the values
VecLib is based on ATLAS. Blackjack is a routine optimized for Level
of R for the register file (Live Ranges) and Level 1 cache size. 1 cache.
Optimizations that take into account additional levels of cache
are not as important for tile algorithms [24].
We put our methodology to the test on a dual-core Intel Core the art in benchmarking by:
i7 machine with 2.66 GHz clock. This determines the size of
register file (16) and Level 1 cache (32 KiB) as supplied by (1) offering micro-benchmarks that can exercise a wider
BlackjackBench and confirmed with the hardware specification. set of hardware features than most existing benchmark
Our model points to register blocking with m = 3, n = 3 and suites do;
k = 1; cache blocking with m = 63, n = 63 and k = 1. These (2) emphasizing portability by avoiding low-level primi-
parameters, however, are not practical because the kernel would tives, specialized software tools and libraries, or non-
most likely always be called with even matrix sizes and so the portable OS calls;
cleanup code [15] would have to be utilized in the majority of (3) providing comprehensive statistical analyses as part
cases which may drastically diminish the performance. Thus, a of the characterization suite, capable of distilling the
more feasible register blocking is m = 4, n = 2 and k = 1. results of micro-benchmarks into useful values that
Level 1 cache blocking has to be adjusted to match the register describe the hardware;
blocking. Further adjustment comes from the fact that one (4) emphasizing the detection of hardware features
of the input matrices in our kernel may have non-unit stride, through variations in performance. As a result,
and loading its entries into the Level 1 cache brings in one BlackjackBench detects the effective values of
cache line worth of floating-point data, and we need room to hardware characteristics, which is what a user-level
accommodate for these extra items. Using BlackjackBench, we application experiences when running on the hardware,
discover the cache line size and thus, the corresponding change instead of often unattainable peak values.
in cache blocking parameters is: m, n = 56, and k = 1. Since
We describe how the micro-benchmarks operate, and
m and n are the same, we refer to them collectively as matrix
their fundamental assumptions. We explain the analysis
size.
techniques for extracting useful information from the results and
Figure 13 shows performance results for our kernel and
demonstrate, through several examples drawn from a variety of
the vendor implementation. According to our model and the
hardware platforms and OSs, that our assumptions are valid and
consideration presented above, optimal performance should
our benchmarks are portable.
occur at matrix size 56, and the figure confirms this finding:
the Blackjack line drops for matrix sizes larger than this value.
An additional benefit of our guided tuning is the fact that
REFERENCES
we are able to achieve significantly higher performance for
the range of matrix sizes that are the most important in our [1] Yotov, K., Pingali, K. and Stodghill, P. (2005) Automatic
application. measurement of memory hierarchy parameters. SIGMETRICS
Perform. Eval. Rev., 33, 181192.
[2] Yotov, K., Jackson, S., Steele, T., Pingali, K. and Stodghill, P.
8. CONCLUSION (2005) Automatic Measurement of Instruction Cache Capacity.
Proc. 18th Workshop on Languages and Compilers for Parallel
We have presented the BlackjackBench system characterization Computing (LCPC), Hawthorne, NY, USA, October 2022, pp.
suite. This suite of micro-benchmarks goes beyond the state of 230243. Springer.

[3] Duchateau, A.X., Sidelnik, A., Garzarn, M.J. and Padua, [15] Whaley, R.C. and Dongarra, J. (1998) Automatically Tuned
D.A. (2008) P-ray: A Suite of Micro-Benchmarks for Multi- Linear Algebra Software. CD-ROM Proc. IEEE/ACM Conf.
Core Architectures. Proc. 21st Int. Workshop on Languages SuperComputing 1998: High Performance Networking and
and Compilers for Parallel Computing (LCPC08), Edmonton, Computing, Orlando, FL, USA, November 713, p. 38. IEEE
Canada, July 31August 2, Lecture Notes in Computer Science Computer Society. doi:10.1109/SC.1998.10004.
5335, pp. 187201. Springer. [16] Whaley, R.C., Petitet, A. and Dongarra, J.J. (2001) Automated
[4] Saavedra, R.H. and Smith, A.J. (1995) Measuring cache and TLB empirical optimization of software and the ATLAS project.
performance and their effect on benchmark runtimes. IEEE Trans. Parallel Comput., 27, 335.
Comput., 44, 12231235. [17] Yotov, K., Li, X., Ren, G., Garzarn, M.J., Padua, D.A., Pingali,
[5] Gonzalez-Dominguez, J., Taboada, G., Fraguela, B., Martin, K. and Stodghill, P. (2005) Is search really necessary to generate
M. and Tourio, J. (2010) Servet: A Benchmark Suite for high-performance BLAS? Proc. IEEE, 93, 358386. Special
Autotuning on Multicore Clusters. IEEE Int. Symp. Parallel & issue on Program Generation, Optimization, and Adaptation.
Distributed Processing (IPDPS), Atlanta, GA, USA, April 19 doi:10.1109/JPROC.2004.840444.
23, pp. 110. IEEE Computer Society. doi:10.1109/IPDPS.2010. [18] Singh, K. and Weaver, V. (2007) Learning Models in

5470358. Self-Optimizing Systems. COM S612Software Design for
[6] McVoy, L. and Staelin, C. (1996) lmbench: Portable Tools High Performance Architectures. Cornell University, Computer
for Performance Analysis. ATEC96: Proc. Annual Technical Systems Laboratory.
Conf. USENIX 1996 Annual Technical Conf., Berkeley, CA, USA, [19] Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Luszczek, P. and
January 2426, pp. 2323. USENIX Association. Tomov, S. (2006) The Impact of Multicore on Math Software.
[7] Staelin, C. and McVoy, L. (1998) MHZ: Anatomy of a In Kgstrm, B., Elmroth, E., Dongarra, J. and Wasniewski, J.
Micro-Benchmark. USENIX 1998 Annual Technical Conf., New (eds), Applied Parallel Computing. State of the Art in Scientific
Orleans, LA, USA, January 1518, pp. 155166. USENIX Computing, 8th Int. Workshop, PARA, Umea, Sweden, June 18
Association. 21, Lecture Notes in Computer Science 4699, pp. 110. Springer,
[8] Mucci, P.J. and London, K. (1998) The CacheBench Report. Heidelberg.
Technical Report. Computer Science Department, University of [20] Broquedis, F., Clet Ortega, J., Moreaud, S., Furmento, N., Goglin,
Tennessee, Knoxville, TN. B., Mercier, G., Thibault, S. and Namyst, R. (2010) hwloc: a
[9] DARPA AACE. http://www.darpa.mil/ipto/solicit/baa/BAA-08- Generic Framework for Managing Hardware Affinities in HPC
30_PIP.pdf. Applications. Proc. PDP 2010The 18th Euromicro Int. Conf.
[10] Molka, D., Hackenberg, D., Schone, R. and Muller, M.S. (2009) Parallel, Distributed and Network-Based Computing, Pisa, Italy,
Memory Performance and Cache Coherency Effects on an Intel February 1719, pp. 180186. IEEE Computer Society Press.
Nehalem Multiprocessor System. PACT 09: Proc. 2009 18th doi:10.1109/PDP.2010.67.
Int. Conf. Parallel Architectures and Compilation Techniques, [21] Heyer, L.J., Kruglyak, S. and Yooseph, S. (1999) Exploring
Washington, DC, USA, September 1416, pp. 261270. IEEE expression data: identification and analysis of coexpressed genes.
Computer Society. Genome Res., 9, 11061115.
[11] Dongarra, J., Moore, S., Mucci, P., Seymour, K. and You, [22] Fog, A. (2012) Instruction Tables: Lists of Instruction Latencies,
H. (2004) Accurate Cache and TLB Characterization Using Throughputs and Micro-Operation Breakdowns for Intel, AMD
Hardware Counters. In Bubak, M., van Albada, G.D., Sloot, and VIA CPUs. Technical Report last updated 2012-02-29.
P.M. and Dongarra, J. (eds), Int. Conf. Computational Science, Copenhagen University College of Engineering, Copenhagen,
Krakow, Poland, June, Lecture Notes in Computer Science 3036, Denmark.
pp. III:432439. Springer, Berlin. ISBN 3-540-22114-X. [23] Buttari, A., Langou, J., Kurzak, J. and Dongarra, J.J.
[12] Browne, S., Dongarra, J., Garner, N., Ho, G. and Mucci, P. (2000) (2008) Parallel tiled QR factorization for multicore architec-
A portable programming interface for performance evaluation tures. Concurrency Comput. Pract. Exper., 20, 15731590.
on modern processors. Int. J. High Perform. Comput. Appl., 14, http://dx.doi.org/10.1002/cpe.1301. doi:10.1002/cpe.1301.
189204. [24] Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak,
[13] Whaley, R.C., Petitet, A. and Dongarra, J.J. (2001) Automated J., Langou, J., Ltaief, H., Luszczek, P. and Tomov, S.
empirical optimizations of software and the ATLAS project. (2009) Numerical linear algebra on emerging architectures: the
Parallel Comput., 27, 335. PLASMA and MAGMA projects. J. Phys. Conf. Ser., 180.
[14] Whaley, R.C. and Castaldo, A.M. (2008) Achieving accurate doi:10.1088/1742-6596/180/1/012037.
and context-sensitive timing for code optimization. Softw.-Pract. [25] Golub, G. and Van Loan, C. (1996) Matrix Computations (3rd
Exp., 38, 16211642. edn). Johns Hopkins University Press, Baltimore, MD.

The Computer Journal 2013 Danalis Comjnl - bxt057

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

The Computer Journal 2013 Danalis Comjnl - bxt057

Încărcat de

Drepturi de autor:

Formate disponibile

The Computer Journal Advance Access published June 28, 2013

The British Computer Society 2013. All rights reserved.

BlackjackBench: Portable Hardware

Downloaded from http://comjnl.oxfordjournals.org/ at University of Tennessee Library on March 17, 2014

Keywords: micro-benchmarks; hardware characterization; statistical analysis

1. INTRODUCTION (1) A collection of portable micro-benchmarks that can