Sunteți pe pagina 1din 5

College of Engineering, Pune - 5

Department of Computer Engineering and Information Technology B.Tech. Project: VII Semester - Progress Report

Techniques for Benchmarking of CPU Micro-Architecture


Varad R. Deshmukh Computer Engineering deshmukhvr07.comp@coep.ac.in Roll no: 702015 Nishchay S. Mhatre Information Technology mhatrens07.it@coep.ac.in Roll No: 705040

Faculty Advisor: Prof. S. Gosavi External Advisor: Dr. Shrirang Karandikar(Computational Research Laboratory, Tata Sons Ltd.)
Abstract The micro-architectural features of modern CPUs have become common across most commercial processors. However there is a large variety of dierences at the implementation level, which is thought to have an impact on the performance. These dierences make comparison dicult when it comes to choosing a CPU for High Performance Computing applications. Existing benchmarks, like SPEC, LINPACK etc. do not measure the direct impact of micro-architecture on performance, while certain micro-architecture specic benchmarks like STREAM give information about only one aspect. Thus, the aim of this project is to create a suite of benchmarks to provide information about both overall performance as well as the contribution of particular features and optimisations. The approach adopted consists of, understanding the implementation details of the microarchitectural features, writing programs to isolate the eect of a particular feature and making measurements to determine their impact on performance. Comparison of these measurements with theoretical and calculated estimates will be used to determine their correctness. Running the programs on multiple platforms will be used to compare microarchitectures. The challenge is to isolate the features, make the programs fairly portable and provide a coherent interpretation of the generated data.

Introduction

Micro-Architecture (abbr. uArch) of the Computer comprises of the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with dierent microarchitectures.Implementations might vary due to dierent goals of a given design or due to shifts in technology. There are various features in micro-architecture that have become fairly commonplace due to advancement in design and the silicon process. For example, most commercial 1

Desktop or Server processors ship with Multi-Stage Instruction Pipelines and On-Chip Cache Memories of Megabyte order. However there is enormous variety in the design and implementation of these features across vendors and within CPU families from the same vendor, which are largely invisible to consumers and developers. Therefore it is hard to make a direct comparison in terms of performance, given only the performance claims made by the vendors. While this does not aect the everyday user to a great degree, it is particularly important in the eld of high performance computing, due to the sheer volume of data on which computationally intensive operations are performed. There exist many commercial and scientic benchmarks, like SPEC and LINPACK which measure performance. However, most of the benchmarks do not isolate features of the uArch. Certain benchmarks exist like STREAM and Cachebench consider only one feature. Moreover, they give an average value of performance. Hence the need is felt for a comprehensive set of benchmarks, which provides information about the impact of uArch features to a ner level of detail. The aim of this project is to create such a benchmark suite. The rest of this report is organised as follows. Section two talks briey about the dierent uArch features, their concept, operation, expected eects on performance. Section three talks about previous work in this area surveyed by the authors. Section four describes the proposed approach. Section ve describes experiments conducted to isolate and benchmark the Memory hierarchy and the experiments on functional unit performance. Section six enumerates the future course of action.

CPU Micro-Architecture

All practical CPUs today follow the Von-Neumann architecture. That is, an instruction is fetched from memory, decoded and executed. Execution involves use of the ALU, fetching operands from memory and writing results to memory. This purely sequential, word at a time approach gives rise to the Von-Neumann bottleneck. In order to overcome this, processors use many clever designs at the uArch level. Most prominent among them are: 1. Instruction pipelining 2. Cache memory 3. SIMD functional units 4. Out of order execution

2.1

Instruction Pipelining

The various stages of the fetch-decode-execute cycles can be executed in parallel for successive instruction. This process, called Instruction Pipelining is analogous to an industrial assembly line. To overcome practical problems caused by the pipelined execution, supporting features such as Branch Prediction, Scoreboarding, Multiple functional units with Reservation stations are implemented.

2.2

Cache Memory

Main memory speeds have not been able to keep up with the increase in CPU speed which closely follows Moores Law. Spatial and temporal locality is observed in most 2

programs. This is exploited using cache memory. This also has many features that are optimisations of the basic cache design.

2.3

Vectorisation

Single Instruction Multiple Data functional units help exploit inherent data parallelism in a variety of programs. Most recent processors have SIMD extensions in their ISA and hence, the required micro-architectural features.

2.4

Out of Order Execution

This is a technique which involves re-ordering of the instructions in the instruction stream to make sure that an instruction that does not require waiting can execute before one which is dependent on some operand or functional unit, although the former is sequentially behind in instruction memory. This again results in complex modications at the uArch level.

Previous Work

We explored the existing CPU benchmarks and searched for any work done in uArch level measurements. Some of the existing work is described below: The SPEC suite of benchmarks runs 17 oating point benchmarks and 12 integer benchmarks to give a single numerical score for each. These benchmarks only provide and overall performance score. There is absolutely no study of the eect of the uArch. It is an accepted industry standard currently in its third increment. This is portable but proprietary. Among the uArch level benchmarks, most notable is the STREAM benchmark for evaluation of Memory Hierarchy. Others like Cachebench, LLCbench exist but focus only on cache memory. There are micro-architecture benchmarks which are concerned with the other aspects also, such as experiments by Krishnaswami and Scherson[], Wenish et al[]. However all these involve CPU simulation and do not run code directly on the actual machine. There are signicant works in this area, like the code written by Dusseau[], to evaluate memory hierarchy, which is in turn based on the paper by Savedra- Barrera[]. This has been helpful to us. Such a uArch benchmarking suite has been developed for evaluating NVIDIA GPUs by Wong et al [], but a similarly comprehensive suite for CPUs was not found.

Proposed Approach

Firstly, thorough research will be done into the internal workings of various uArch implementation. Most prominent features will be chosen for evaluation. Programs will be written to isolate the features. Accurate timing mechanisms will be ued for evaluating the performance. Comparisons with calculated estimates will be used for verication. Comparisons of the performance on various CPUs can then be used to evaluate their uArchs. The end product is envisaged to be a suite of portable programs along with scripts which will allow the user to automate the testing process.

Work Progress Report

Among the uArch features listed in section two, work has mainly been done on the memory hierarchy and functional units (both scalar and SIMD) till date. Other features like the Out Of Order Execution Engine and Branch Predictor are under study. Deeper literature survey is underway.

5.1

Memory Hierarchy

In general, in a multi-level cache, the speed of memory access increases with the level and size decreases. Hence, if data is in the higher level cache, the performance will be higher than that when it is in a lower level cache, which is still higher than the performance when the data is not in cache but has to be brought in from Main Memory. A program was written to isolate and evaluate the performance of these three phases. The working of this program is simple. An array of double precision oating point numbers is declared. This array has a size slightly larger than the highest level cache. It is initialised to dummy values, sequentially. In this way, the last elements to be initialised are in the registers at the end of the initialisation. After this the array is accessed backwards, since all the elements at the end of this array will be now in L1 cache, we eectively isolate this mode. Thus, working backward through the array, we isolate all levels of cache and later main memory. Some of the newer Intel uArchs have a hardware prefetch mechanism. To get the worst case performance, code was written specically to fool the prefetch logic. The program has been coded in C and some of the performance measurement parts in x86 assembly. The time is measured using the RDTSC instruction and the performance is reported. (It was initially measured with the Performance API, but portability requirements dictated the move to RDTSC). This code was run on dierent machines, namely Intel Pentium 4 (Netburst uArch), Core 2 Duo (Core uArch), the Eka Supercomputer nodes at CRL which are Intel Xeon nodes (Clovertown uArch) and Core i7 (Nehalem uArch). The output of the program agrees with the calculated values. However, this code requires prior knowledge of the cache size and number of levels. In contrast to this the Dusseau program is more diagnostic and needs no prior information except cache size. Further improvements are underway.

5.2

Functional Units

This is more straighforward than the cache evaluation program. A kernel is written in C and compiled with the -S instruction to give assembly output. The kernel is then lled with the desired Arithmetic instructions to load two operands into registers, followed by an arithmetic instruction repeated 10000 to 100000 times. eg: divsd, addsd and SIMD instructions. Any instruction can be inserted here. This serves to exercise and evaluate the functional units where these instructions are executed. Care is taken to remove dependencies between successive instructions to get the reciprocal throughput of instructions. Latency of the instructions has also been measured. The results concur with the available numbers from the paper by Avner Fog[].

Future Course of Action

The next step is improving the cache evaluation code, to require minimum prior knowledge of the system memory heirarchy. The eects of other simple cache features like set associativity, block size, and write policies have to be accounted for. Virtual memory features, like the TLB also have to be taken into consideration. Pipelining, branch prediction and other chief microarchitectural features will be handled on the same lines, diering only in details.

S-ar putea să vă placă și