Documente Academic
Documente Profesional
Documente Cultură
Using Intel VTune Performance Analyzer to Optimize Software on Intel Core i7 Processors
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Intel Core i7 Processor Features . . . . . . . . . . . . . . . . . . . . . . . . 3 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Intel Core i7 Processor Performance Monitoring Unit . . . . 3 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Complexities of Performance Measurement . . . . . . . . . . . . . . . 4 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 The Software Optimization Process . . . . . . . . . . . . . . . . . . . . . . . 4 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Step 1: Identify the Hotspots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Step 2: Determine Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Efficiency Method 1: % Execution Stalled . . . . . . . . . . . . . . . . . . 4 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Efficiency Method 2: Changes in Cycles per Instruction (CPI) 5 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Efficiency Method 3: Code Examination . . . . . . . . . . . . . . . . . . . 5 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Code Examination Issue 1: Failure to Use Intel SSE Instructions for Floating Point . . . . 6 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Code Examination Issue 2: Failure to Use New Intel SSE4 .2 Instructions . . . . . . . . . . . . . 6 Step 3: Identify Architectural Reason for Inefficiency . . . . . 6 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Branch Mispredicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Front-End Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Address Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4k False Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Set Associative Cache Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Memory Bank Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Long Latency Instructions and Exceptions . . . . . . . . . . . . . . . . 8 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 DTLB Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Notes: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 For More Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
Using Intel VTune Performance Analyzer to Optimize Software on Intel Core i7 Processors
Introduction
This article is designed for developers working on an Intel Core i7 platform and using Intel VTune Performance Analyzer . Software optimization should begin after you have: Utilized any compiler optimization options (/O2, /QxSSE4 .2, etc) Chosen an appropriate workload Measured baseline performance If you are new to Intel VTune, see the companion article, Intel VTune Performance Analyzer How-To Guide . The performance information here applies to other tools as well, such as PTU, but is focused on Intel VTune analyzer .
Intel Turbo Boost Technology: Adds complexity for performance analysis . If you can it is easier to turn it off during performance analysis . The features of the Integrated Memory Controller/Smart Memory Access/QPI/8 MB Intel Smart Cache: Adds more things which can be or need to be measured on Nehalem
- 3 fixed function performance counters Measure more events than ever before - Hundreds of different things can now be measured Only a few are commonly required for performance insight Measure core and uncore events Gather more information for certain events - i .e ., Load latency Notes: The 3 fixed function performance counters: INST_RETIRED.ANY_P CPU_CLK_UNHALTED.THREAD_P CPU_CLK_UNHALTED.REF_P Each of these counters has a corresponding programmable version (which will use a general purpose counter): INST_RETIRED.ANY CPU_CLK_UNHALTED.THREAD CPU_CLK_UNHALTED.REF The new capability to measure uncore events is not supported by Intel VTune . And the new information gathered when measuring certain events (i .e ., load latency in the PEBS record) is not supported by Intel VTune Performance Analyzer .
Using Intel VTune Performance Analyzer to Optimize Software on Intel Core i7 Processors
% execution stalled and changes in CPI methods rely on Intel VTunes Notes: Although Intel VTune analyzer can also be helpful in threading an application, this discussion is not aimed at the process of introducing threads . The process described here could be used either before or after threading . Intel Thread Profiler is another performance analysis tool included with Intel Vtune Performance Analyzer for Windows* that can help determine the efficiency of a parallel algorithm . Remember that you should only focus on hotspots . Only try to determine efficiency, identify causes, and optimize in hotspots . sampling. The code examination method relies on using Intels capability as a source/disassembly viewer .
Using Intel VTune Performance Analyzer to Optimize Software on Intel Core i7 Processors
Next: - < 10% stall cycles is good: focus on code reduction - 10%50% for client applications: worth investigating stall reduction - 50%80% for server applications: worth investigating stall reduction Notes: The UOPS_EXECUTED.CORE_STALL_CYCLES counter measures when the EXECUTION stage of the pipeline is stalled . This counter counts per core, not per thread . This has implications for doing performance analysis with Intel Hyper-Threading enabled . For example, if Intel Hyper-Threading is enabled, you wont be able to tell if the stall cycles are evenly distributed across the threads or just occurring on one thread . For this reason it can be easier to interpret performance data taken while Intel Hyper-Threading is disabled . (Be sure to enable again after optimizations are completed .) Optimized code (i .e ., SSE instructions) may actually lower the CPI, and increase stall %, but it will increase the performance. CPI and stalls are just the general guide of efficiency . The real measure of efficiency is work taking less timework is not something that can be measured by the CPU . CPI will be doubled if using Intel Hyper-Threading. With Intel HyperThreading enabled, there are actually two different definitions of CPI. We call them Per Thread CPI and Per Core CPI. The Per Thread CPI will be twice the Per Core CPI . Only convert between per Thread and per Core CPI when viewing aggregate CPIs (summed for all logical threads) . Optimized code (i .e ., SSE instructions) may actually lower the CPI, and increase stall percentage, but it will increase the performance . CPI and stalls provides general guidance of efficiency . The real measure of efficiency is work taking less time. Work is not something that can be measured by the CPU .
Using Intel VTune Performance Analyzer to Optimize Software on Intel Core i7 Processors
Code Examination Issue 1: Failure to Use Intel SSE Instructions for Floating Point
Why: Using p(arallel) Intel Streaming SIMD Extension (SSE, SSE2, SSE3, SSSE3, SSE4 .1, and SSE4 .2) instructions has the potential to greatly increase the floating point performance . Compilers/libraries may not use the latest instructions by default . How: Drill down to Source View on floating point loops and Examine Assembly Next: Look for floating point math assembly that are not using p(acked) instructions with xmm# or rmm# registers . For example:
- PCMPGTQ Next: If Intel SSE4 .2 is not used, try any of the following: - Libraries, i .e ., Intel Integrated Performance Primitives (Intel IPP) and/or Intel Math Kernel Library (Intel MKL) - Intel Compiler 11.0: Use /QxSSE4.2 (Linux: -xSSE4.2) Note: Some LIBM and CRT library functions have been enhanced in the 11.0 compiler to use Intel SSE4.2 (strlen for example) - Switches for GCC (-mSSE4 .2) or MS compiler (/arch:SSE2) - Rewrite in Assembly/Intrinsics
Cache Misses
Why: Cache misses raise the CPI of an application - Focus on long latency data accesses coming from second and third level misses How: MEM_LOAD_RETIRED.LLC_MISS, MEM_LOAD_RETIRED .LLC_UNSHARED_HIT, MEM_LOAD_RETIRED. OTHER_CORE_L2_HIT_HITM Next: - Within a hotspot, estimate the % of cycles due to long latency data access: For third level misses: ((MEM_LOAD_RETIRED.LLC_MISS * 180) / CPU_CLK_UNHALTED.THREAD) * 100 For second level misses: (((MEM_LOAD_RETIRED. LLC_UNSHARED_HIT * 35) + (MEM_LOAD_RETIRED.OTHER_ CORE_L2_HIT_HITM * 74)) / CPU_CLK_UNHALTED.THREAD) * 100
Using Intel VTune Performance Analyzer to Optimize Software on Intel Core i7 Processors
If percentage is significant (>20%), consider reducing misses: use software prefetch instructions, data blocking, using local variables for threads, padding data structures to cacheline boundaries, or changing your algorithm to reduce data storage
- Try using the /QxSSE4 .2 switch on the Intel Compiler - Try using profile-guided optimizations with your compiler - Use linker ordering techniques (/ORDER on Microsofts linker or a linker script on gcc) - Use switches that reduce code size, such as /O1 or /Os Notes: The pipeline for the Intel Core i7 processor consists of many stages . Some stages are collectively referred to as the front end; these are the stages responsible for fetching, decoding, and issuing instructions and micro-operations . The back end contains the stages responsible for dispatching, executing, and retiring instructions and micro-operations . These two parts of the pipeline are de-coupled and separated by a buffer (in fact, buffering is used throughout the pipeline), so when the front end is stalled, it does NOT necessarily mean the back end is stalled . Front-end stalls are a problem when they are causing instruction starvation, meaning no micro-operations are being issued to the back end . If instruction starvation occurs for a long enough period of time, the back end may stall .
Branch Mispredicts
Why: Mispredicted branches cause pipeline inefficiencies due to wasted work or instruction starvation (while waiting for new instructions to be fetched) How: UOPS_ISSUED.ANY, UOPS_RETIRED.ANY, BR_INST_EXEC. ANY, BR_MISP_EXEC.ANY, RESOURCE_STALLS.ANY Next: - ((UOPS_ISSUED.ANY/UOPS_RETIRED.ANY: Tells you the fraction of execution that was wasted (due to mispredictions) - Instruction starvation = (UOPS_ISSUED.CORE_STALL_CYCLES - RESOURCE_STALLS.ANY)/CPU_CLK_UNHALTED.THREAD - If you have significant wasted work (> .1) or instruction starvation (> .1), examine branch misprediction percentage ((BR_ MISP_EXEC.ANY/BR_INST_EXEC.ANY) * 100) - Try to reduce misprediction impact with better code generation (compiler, profile-guided optimizations, or hand tuning) Notes: All applications will have some branch mispredicts. Up to 10% of branches being mispredicted could be considered normal, depending on the application. However, anything above 10% should be investigated. With branch mispredicts, it is not the number of mispredicts that is the problem, but the impact . The equations above show how to determine the impact of the mispredicts on the front end (with instruction starvation) and the back-end (with wasted execution cycles given to wrongly speculated code) .
Address Aliasing
Why: Having multiple accesses to data that is large 2N (see example: 4K) apart will result in inefficiencies that manifest themselves in multiple ways, such as: - Set Associative Caches - 4K False Store Forwarding How: PARTIAL_ADDRESS_ALIAS Next: - ((PARTIAL_ADDRESS_ALIAS) * 3) / CPU_CLK_UNHALTED. THREAD) * 100) tells you the fraction of execution that was wasted (due to 4K False Store Forwarding) - If impact is significant (> 10%), change the layout of the data, as it is also likely to cause you significant extra data cache misses - Change data layout or data access patterns so that your software isnt skipping through data large 2N Notes: This counter only measures 4K False Store Forwarding, but if 4K False Store Forwarding is detected, your code is probably being impacted by Set Associative Cache issues, 4K False Store Forwarding, and potentially even memory bank issuesall resulting in more cache misses and slower data access .
Front-End Stalls
Why: Front-end stalls (at the issue stage of the pipeline) may cause instruction starvation . This may lead to stalls at the execute stage in the pipeline . How: RESOURCE_STALLS.ANY, UOPS_ISSUED. CORE_STALL_CYCLES Next: - If (UOPS_ISSUED.CORE_STALL_CYCLES - RESOURCE_STALLS. ANY)/CPU_CLK_UNHALTED.THREAD > .1, you have significant instruction starvation - This category of stalls can be fixed with better code generation and layout techniques:
7
Using Intel VTune Performance Analyzer to Optimize Software on Intel Core i7 Processors
4K False Store Forwarding In order to speed things up, when the CPU is processing stores and loads, it uses a subset of the address to determine if the store forward optimization can occur; if the bits are the same it starts to process the load . Later on, it will fully resolve the addresses and determine that the load is not from the same address as the store, resulting in the processor needing to fetch the appropriate data . Set Associative Cache Issue Set Associative Caches are organized in two ways . If two data elements are in memory at addresses that are 2N apart such that they fall in the same set, then they will take two entries in that set on a 4-way set associative cache . The fifth access that tries to occupy the same slot will force one of the other entries out of the cache, turning a large multi-KB/MB cache into a N entry cache . Level one, two, and three caches of different processor SKUs have different cache sizes, line sizes, and associativity . Skipping through memory at exactly 2N boundaries (2K,4K, 16K 256K) will cause the cache to evict entries more quickly . The exact size depends on the processor and the cache . Memory Bank Issues Typically (and for best performance), computers are installed with multiple DIMMs for the RAM. No single DIMM can keep up with requests from the CPU . The system determines which data goes to which DIMM via some bits in the address. Skipping through memorylarge 2N may cause all or most of the accesses to hit the same DIMM.
Notes: FP_ASSIST .ALL: It would be nice to use this counter, but it only works for x87 code (some of the manuals misreport that this event works on SSE code) . UOPS_DECODED.MS: Counts the number of Uops decoded by the Microcode Sequencer, MS . The MS delivers uops when the instruction is more than 4 uops long or a microcode assist is occurring . There is no public definitive list of events/instructions that will cause this to fire. If there is a high % of uop_dedoded.ms, then try to deduce why the processor is issuing extra instructions or causing an exception in your code . 1 UOP does not necessarily equal to one cycle, but the result is close enough to determine if it is affecting your overall time .
DTLB Misses
Why: DTLB Load misses that require a page walk can affect your applications performance How: DTLB_LOAD_MISSES.WALK_COMPLETED Next: - Estimate the impact on your app from TLB misses ((DTLB_ LOAD_MISSES.WALK_COMPLETED * 30) / CPU_CLK_UNHALTED. THREAD) * 100 - If impact is significant (> 5-10%), optimize functions with high DTLB misses - Possible optimizations: Target data locality to TLB size, use the Extended Page Tables (EPT) on virtualized systems, try large pages (database/server apps only) Notes: On-target data locality to TLB size . This is accomplished via data blocking and trying to minimize random access patterns . This is more likely to occur with server applications or applications with a large random dataset .
2009-2010, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Core and VTune are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. 0209/BLA/CMD/PDF 321520-001