Sunteți pe pagina 1din 33

PTU Intel Performance Tuning Utility

Intel PTU Sample View


Basic block grouping and execution flow analysis

10/23/2009

Core 2 Architecture - Basic Blocks


32 bytes/clk

Next IP Instruction Fetch


16 bytes/clk

DPL Router L2Q/XQ

FSB

BPU

L2

I$

Length Decoding
6 inst/clk

FB (8 entries)

IQ / LSD
18 entries

DCU
32 Kb

DTLB

32 LB + 20 SB

MOB

BAC

Decoders 4:1:1:1
7 ops/clk

MS
Port 0 ALU/SSE FMul Port 1 ALU/SSE FAdd Port 5 ALU/SSE Branch Port 2 LOADS Port 3 STA Port 4 STD

11 entries

IDQ

Alloc & Rename


4 ops/clk

Reservation Station
32 entries Result Bus dispatch 6 ops/clk

RRF

ReOrder Buffer
4 ops/clk
3

96 entries

10/23/2009

Execution
Reservation station (32 entries)
Port 0 Port 1 Port 5 Port 2 Port 3 Port 4

ALU1

ALU2 IMUL

ALU3 SHIFT2 JEU

LOAD AGU

STORE AGU

INT

STORE DATA (MIU)

SHIFT1 LEA

MOB

Result Bus

SIMD

SIALU1 SISHIFT FMUL

SIMUL

SIALU2 SISHUF

32 entries

LB

20 entries

SB

DTLB

PMH

FADD FPROM

FP

FDIV FSHUF

DCU

L2 / FSB

ROB
4 10/23/2009

Core Architecture PMU


The PMU of Intel Core architecture offers some 700 events ( including sub-events ) compared to
Netburst Architecture: Some 140

Event counters (5 in total) are not identical


Three frequently used (basic) events INST_RETIRED.ANY, CPU_CLK_UNHALTED.CORE and CPU_CLK_UNHALTED.REF have dedicated counter registers (1:1 mapping) Some events are only counted by PMC0 ( precise and atretirement events not belonging to the three events above ) All other events are counted either by PMC0 or PMC1

Counter structure allows collection of up to 5 events in one run - however, depending on the event selection, typically 2-3 only
This is important to remember when running VTune/PTU Sampling
5 10/23/2009

PTU Background and Motivation


A replacement for Intel VTune Performance Analyzer ?

No ! - but an alternative A lot of components are shared ( e.g. Linux* Sampling drivers ) A free tool but requires Intel VTune Performance Analyzer license

A tool for experienced developers Intel internally and externally


Less focus on user guidance and help functionality as VTune Focus is on getting results fast; pragmatic view No registration/RPM/registry simply unpack and start ! Same look and feel / same GUI - on all platforms ( Eclipse-based ) Has been Intel-internal tool only but now made available to everybody !

A test environment for sophisticated new features/prototypes available in VTune (potentially) only in future releases

Includes pre-launch platform support

Make simple things easy, make extraordinary possible


6 10/23/2009

Intel PTU Analysis Methods at a Glance


Event Based Sampling (PMU hardware Events)
Optional event Multiplexing more events collected in one run Includes sophisticated Data Access Profiling Includes features to compare multiple runs

Statistical Call Graph Sampling (with loop detection)

No instrumentation / very low overhead Statistical (pragmatical) approach thus not 100% precise
Instrumentation-based Call Graph/Count Analysis

Instrumentation similar to VTune Callgraph but using PIN


technology

Intrusive (factor 2 or more) but 100% precise


Heap / Memory Access Profiling
7 10/23/2009

Structural Components
Data collectors command utilities
- Driver based and non-driver based collectors - Collection controllers

Data Viewers command utilities


- Data converters - Symbol resolver - Filtering, Comparing and Displaying

Eclipsed-based GUI (a plug-in for Eclipse)


- Windows* and Linux* - single look and feel - Pre-built Eclipse package part of distribution
* Other names and brands may be claimed as the property of others.
8 10/23/2009

PTU Architecture High Level Overview

Eclipse* GUI

Configuration

Collection Command Line

Queries

Data

Command Line Collectors


vtsarun, vtssrun Experiment Directory

Command Line Viewers


vtsaview, vtsadiff, vtssview

SEP
Data conversion (symbols, tb5 -> database)
* Other names andKernel Driver brands may be claimed as the property of others.
9 10/23/2009

PMU Hardware

Command Line utilities


Collectors Viewers

vtsarun Sampling collector

vtsaview Sampling viewer vtdpview Data profiler viewer vtsadiff Hotspot difference viewer

vtssrun Statistical Call graph collector vtcgrun Exact Call graph collector vthprun Heap profiler collector

vtssrun Statistical Call graph viewer vtcgview Exact Call graph viewer vthpview Heap profiler viewer

Command line utilities can be used independent of graphical interface GUI functionality is completely based on these batch applications

10

10/23/2009

Some Interesting Usability Features


Basic blocks as underlying data representation

Aggregation, manipulation per basic block

Drill-down to Source, Disassembly and Control Flow graph view for a procedure

All three (can be) displayed simultaneously

Results difference (text mode, csv export, GUI mode) GUI for different binaries, and for identical sources Ability to modify CPU frequency via GUI CPU Frequency Changing Configuration per PMU, export/import Sampling Hotspot View expansion to show events per CPU Event over IP chart Assembler display choice ( Intel versus AT&T )
11 10/23/2009

PTU Workspace & Projects


Main concepts Eclipse* Workspace
Project Experiment Modules (for experiment sharing) Sources (for experiment sharing)

Project defines the application and its workload (parameters) Simply zipping up an experiment creates a self contained test case

PTU Project is workload specification

* Other names and brands may be claimed as the property of others.


12 10/23/2009

PTU Profile Configurations



Project is definition of the workload Profile Configuration is definition of collection method
You apply a Profile Configuration to a Project to collect Experiment GUI generates a command line for collector and invokes it

Workspace, Project, Profile Configuration are in GUI world only


Command line tools know only about Experiments Basic Statistical Callgraph, Basic Sampling and others You can create your own configurations as well

5 predefined configurations come in PTU distribution

Project + Profile Configuration produces the Experiment


* Other names and brands may be claimed as the property of others.
13 10/23/2009

Project vs. Profiling Configuration


Configurations

Configuration 1
Projects

Configuration 2

Project 1 Project 2 Project 3 Experiment Result Experiment Result

Profiling Configuration is orthogonal to the Project concept Profiling Configuration created by user can be applied to any Project

14

10/23/2009

Event based Sampling


Use Basic Sampling predefined configuration or create your own with events you need in the experiment Granularity in base display is function, module, basic block or instruction (RVA) Standard drill down to source/asm by row in spread sheet

15

10/23/2009

Add the Control Flow Graph with the Tool bar Button

16

10/23/2009

All Three Views at once

17

10/23/2009

Events over IP histogram


In Hotspot view right click on row of interest, appropriate item invokes

Events over IP histogram for selected module

Drill-down to highlighted region of histogram to source and/or disassembly view

18

10/23/2009

Data Access Profiling


Intel Pentium processors, Intel Core2 processor family and Itanium Processor family have Precise Event Based Sampling (PEBS) Precise events have hardware to capture the Instruction Pointer of the instruction that caused the event

Ex: On Intel Core2 Processor family, an L2 load miss that retrieves a cacheline can be identified with MEM_LOAD_RETIRED.L2_LINE_MISS

Further the values of all registers are also captured in a buffer

At completion of the instruction on Intel Core2 Processor Family Prior to execution of the instruction on Pentium 4 Processors

Coupled with the disassembly, the addresses of load and store instructions can be reconstructed

19

10/23/2009

Basic Data Access Profiling Easy to Use Pre-Defined Event Set

20

10/23/2009

Data Access Result: Here Heavy Contention on marked Line Multiple Threads Accessing Different Offsets Indicate False Sharing

21

10/23/2009

From one Hotspot View to two of them


This was the model that we used so far Aggregate over time

But the reality is like this

Aggregate over time + over one more dimension

22

10/23/2009

Easy Navigation between both Dimensions


Filtering - projecting slices onto another dimension Sorting repositioning segments of the axes Applying granularity changing scale of the axis

Intel PTU Data Access profiling operates with two hotspots views: Code and Data hotspots

23

10/23/2009

Differences of EBS Measurments


Intel PTU 3.x supports an analysis of differences of experiments This requires

Event names must be the same Load Modules have the same names
They can be the same, with data taken on different machines They can be different but built from the same source Allowing differences to be analyzed down to source view They can be completely different (sources and binaries) PTU will compare functions with the same names for modules with the same names

Single click on first experiment Ctrl single click on second experiment Right click and select compare
24 10/23/2009

Data blocked 2X2 unrolled Matrix Multiply compiled at -O2

25

10/23/2009

Data blocked 2X2 unrolled Matrix Multiply compiled at -O3 QxT

26

10/23/2009

Only Significant Difference is Cycle Count Create Difference Display


Control click to select 2 experiments Right click to select Compare Experiments

27

10/23/2009

Differences of Samples Differences in Cycles Shown in msec to Correct for Comparison of Machines at Different Frequencies

28

10/23/2009

Same Source can Display Difference per Source Line

29

10/23/2009

Statistical Call Graph

Along with standard features supports multithreaded apps and 32 bit binaries on 64 bit OSs

Easy to identify path to the hotspot


30 10/23/2009

Exact Call Graph



Reconstructs the applications Call graph by instrumenting function entry and exit points using the PIN dynamic instrumentation framework Records a caller-callee pair together with timing and call count information Text viewer provides the following information:
List of functions with their parents, children, and timing information Flat data: list of functions with timing and call information with no relationships Per-thread view for multithreaded applications

Collection overhead for computationally-intensive console applications 1.8-2x

31

10/23/2009

Call Count GUI View

Displays a limited set of exact call graph data: a function name, module name, and call count, which is a number of times the function was called

32

10/23/2009

Summary

Intel PTU - Easy-to-use performance debugging


tool

Focus on getting results quickly The tool is easy to use but the performance

methodology supported perfectly by this tool requires some experience it is not a tool for the beginner !! you help make it better for you and contribute to future Performance Suite
User Forum at whatif.intel.com

Using it and providing feedback /submitting bugs

Call to Action:

Download, install and use Intel PTU !


33 10/23/2009

S-ar putea să vă placă și