Documente Academic
Documente Profesional
Documente Cultură
CODE OPTIMIZATIONS
Optimization for scalar machines is a problem that was solved ten years ago David Kuck - 1990 Rice University
new features present new problems changing costs lead to different concerns must re-engineer well-known solutions
02/02/2011
Why Optimizations?
Compiler optimization is essential for modern computer architectures.
Without optimization, most applications would perform very poorly on modern architectures
Even with optimization, most applications do not get a high fraction of peak performance
Optimization techniques are also basis for exploiting SIMD components, vector units, hyperthreading, other forms of multithreading
do I = n, m, p Program OhNoNotPtrsAndFuncs k=2*i+q*q ptr p, q ; var x, y, z , w[20] ; x( i + f(n)) = x(k + q) y = DoSomething ( x, y, z, w ) ; ..... x = SomeFunc ( p, q, y, z ) ; w[x + *p] = w[ *q] ; What can compilers exploit? What defeats optimization?
02/02/2011
produce improved code, not optimal code can sometimes produce worse code range of speedup might be from 1.01 to 4 (or more )
Definition
Classical optimizations
reduce number of instructions reduce cost of instructions change ordering of instructions (latency) execute instructions in parallel modify data placement (registers, cache)
02/02/2011
IR
Target code
Error messages
Front end maps legal source code into IR Back end maps IR into target machine code Admits multiple front ends & multiple passes Typically, front end is O(n) or O(n log n), while back end is NPC
Computation of arithmetic expressions, simplification of logical expressions Any work performed here is dependent on parsing process: syntaxdirected Semantic analysis begins to gather information that may help in optimization
Adapt code to use actual registers provided, or to better exploit functional units
02/02/2011
Suns Fortran compiler has ca. 4 MLOC (Million Lines of Code) Open64 Fortran/C/C++ has ca. 7 MLOC No one person understands the complete system.
Pro64/Open64 Framework
CC front end front end C++ front end C++ front end
IA 64 code generator
02/02/2011
F77 source
F90/f95 source
CC source
Intermediate Representation
IR Opt
CG
object
Machine dependent transformations 1. replace a costly operation with a cheaper one 2. replace a sequence of instructions with a more powerful one
3.
4. 5.
hide latency
improve locality lower power consumption
02/02/2011
Middle End
Source code
Error messages I I Front End R Middle End R Back End
Target code
Analyzes IR and rewrites (or transforms) IR Primary goal is assumed to be to reduce running time of the compiled code May also improve space, power consumption, Improvements must provably preserve meaning of the code
Some people distinguish middle end from back end, some dont
What is optimality? Many problems are hard, or intractable, or NP- complete Which optimizations should we use? Lots of optimizations overlap and/ or interact.
02/02/2011
A few examples common subexpression elimination constant folding dead code elimination instruction scheduling register allocation loop transformations Analysis versus Optimization knowledge doesnt make code run faster; changing the code can sometimes make it run faster. We use analysis to transform code.
Opt 1
IR
Opt 2
IR
Opt 3
IR ...
Opt n
IR
Errors
There are
Discover and propagate some constant value Move a computation to a less frequently executed place Specialize some computation based on context Discover and remove a redundant computation Remove useless or unreachable code Encode an idiom in some particularly efficient form
02/02/2011
Source code
Compiler
For the average user, a compiler is a black box.
Target code
Source code
Target code
For the informed user (or computer scientist), the compiler is structured as a front and back end
A real compiler has a more complex internal structure. it consists of a series of phases that analyze and transform a program Analysis and transformation
Source code
phase1
phase2
phase3
phase4
Target code
02/02/2011
linear time heuristics for hard problems unforeseen consequences multiple ways to achieve the same end
10
02/02/2011
Scope of Optimization
confined to straight line code simplest to analyze time frame: sixties, seventies, maybe now?
code as good as or better than an assembler programmer stable, robust performance (predictability ) architectural strengths fully exploited architectural weaknesses fully hidden broad, efficient support for language features instantaneous compiles Unfortunately, modern compilers often drop the ball.
11
02/02/2011
consistent philosophy careful selection of transformations thorough application coordination between transformations and data structures attention to the results (code, time, space) Compilers are engineered objects
minimize running time of compiled code minimize compile time use reasonable compile time space (serious problem)
Transformation and
Optimization improves a program by making a number of modifications to it.
Such modifications are formalized as program transformations. They are applied to an intermediate representation Some optimizations are local, and may affect a few operations Others may affect entire regions of a program
rewrite rules
12
02/02/2011
Analysis
conditions governing rewrite
In general, an analysis of program is needed to prove that application of a given transformation is permitted in a specific context (i.e. to prove its correctness). The quality of the analysis (results) may have a profound impact on the ability of the compiler to optimize by applying transformations
Implementing Analysis
There are many trade-offs to be considered in practice
Ability to apply a transformation depends on whether compiler can prove it is correct So quality of optimization depends on results of analysis Can the information be derived from the IR?
Depends upon how the program was written and what the IR represents
Execution of some analyses very time-consuming Cost of analysis in terms of programming effort and execution time are major issues in compiler design
13
02/02/2011
Redundant Expression Elimination (Common Subexpression Elimination) Use an address or value that has been previously computed. Consider control and data dependencies
x : = a + b => x := a+b
Common Subexpression Some transformations are well Elimination applicable. known and widely
Goal: Eliminate redundant (multiple) computations. Two expressions are equivalent only if they produce same result. True if they are identical and none of the operands are redefined in the intervening code.
t3 = a * t2 t4 = t3 * t1 t5 = t4 + b t6 = t3 * t1
t7 = t6 + b
c = t5 * t7
14
02/02/2011
Partially Redundant Expression (PRE) Elimination A variant of Redundant Expression Elimination. If a value or address is redundant along some execution paths, add computations to other paths to create a fully redundant expression (which is then removed).
Like CSE but earlier expression available only along some path if then => if then x := a+b t :=a+b; x:=t end else t:=a+b I end else I y := a+b y := t
x:=5 y := x+2
=> =>
15
02/02/2011
x := y w := w+ x
=>
x := y w := w+y
x: = 3+4 =>
x := 7
16
02/02/2011
if (false) Istruzione
17
02/02/2011
Loop-invariant code motion for j := 1 to 10 for i := 1 to 10 a[i] := a[i] + b[j] for j := 1 to 10 { t := b[j] for i := 1 to 10 a[i] := a[i] + t }
18
02/02/2011
Replace a loop body executed N times with an expanded loop body consisting of M copies of the loop body. This expanded loop body is executed N/M times, reducing loop overhead and increasing optimization possibilities within the expanded loop body.
for i := 1 to N a[i] := a[i] + 1 for i := 1 to N by 4 a[i] := a[i] + 1 a[i+1] := a[i+1] + 1 a[i+2] := a[i+2] + 1 a[i+3] := a[i+3] + 1 Creates more optimization opportunities in loop body
DO Loop Normalization
DO I = N, M, 2 A(I) = = .. I + 1 END DO
This transformation makes it easier to specify and apply transformations.
19
02/02/2011
Call Inlining
At the site of a call, insert the body of a subprogram, with actual parameters initializing formal parameters.
Inlining l := w:=4 a := area(l,w) l := w := 4 a := l*w l := w := 4 a := l <<2 Many simple optimizations become important after inlining Interprocedural constant propagation
20
02/02/2011
Code Hoisting and Sinking If the same code sequence appears in two or more alternative execution paths, the code may be hoisted to a common ancestor or sunk to a common successor. (This reduces code size, but does not reduce instruction count.)
21
02/02/2011
Call Optimizations
through function pointers in imperative languages Call of computed function in functional language OO-dispatch in OO languages (e.g., COOL) If receiver class can be deduced, can replace with direct call Other optimizations possible even when multiple targets (e.g., using PICs = polymorphic inline caches)
Procedure specialization
Partial
evaluation
Machine-dependent Optimizations
1. 2. 3. 4. 5.
Register allocation Instruction selection Important for CISCs Instruction scheduling Particularly important with long-delay instructions and on wide-issue machines (superscalar + VLIW)
22
02/02/2011
Global Register Allocation Within a subprogram, frequently accessed variables and constants are allocated to registers. Usually there are many more register candidates than available registers. Interprocedural Register Allocation Variables and constants accessed by more than one subprogram are allocated to registers. This can greatly reduce call/return overhead.
Register Allocation
Performed when the object code is almost ready Goal is to minimize CPU delay waiting for data (typically 60% of total execution time is actually waiting for data)
Symbolic registers may be used initially But careful choice of data to put in registers is necessary Algorithms are heuristic, this is NP-complete
23
02/02/2011
Software Pipelining A value needed in iteration i of a loop is computed during iteration i-1 (or i-2, ...). This allows long latency operations (floating point divides and square roots, low hit-ratio loads) to execute in parallel with other operations. Software pipelining is sometimes called symbolic loop unrolling.
24
02/02/2011
Later optimization will clean up Common subexpression elimination register allocation Register allocation instructions scheduling Constant folding constant propagation
25
02/02/2011
Kinds of Analysis
Control flow analysis Scalar data flow analysis (Data) dependence analysis
Classification of analysis according to scope of results: 1. Basic block analysis 2. Intraprocedural analysis 3. Interprocedural analysis
Analysis determines the applicability of transformations
Transformation
There are many potentially useful transformations When a compiler is written, the developers must consider
Trade-off between the cost of implementing and executing a transformation versus its likely usefulness
Question without a proper answer: what is the input program like? Transformations are usually combined into a general strategy for optimizing: an individual transformation may be applied more than once to a program as part of a strategy
26
02/02/2011
Kinds of Transformation
Many transformations have been proposed for a variety of purposes. They may be
specific to a source language. The best developed set of these is for Fortran (especially Fortran 77) source language and target machine neutral (more-or-less) target machine specific
Optimization Challenges
Problem with Analysis
We want as much information as possible But we dont want the compiler to be too slow or to take up too much space We dont know what the optimal program is like We dont know too much about an input program So what are the useful changes?
27
02/02/2011
Another Challenge
Some transformations are guaranteed to improve program performance when applied, but
Most are not They may occasionally degrade performance They may negatively affect size of program
Optimization Strategy
So compiler writer makes some assumptions about the form of input program
Optimizations may work well on some input programs and not on others Compiler usually provides choices of strategy Variety of optimization flags used by smart application developer to get custom strategy
28
02/02/2011
Role of IR
Analyses gather information on the program represented by the IR
Information not available in the IR cannot be exploited E.g. some IRs represent arrays and loops, others do not
Using IR
int a, b, c, d ; c=a+b; d=c+1;
ldw a,r1
ldw b,r2 add r1,r2,r3 stw r3,c ldw c,r3 add r3,1,r4 stw r4,d add r1,r2,r3 add r3,1,r4
29
02/02/2011
This usually includes some optimization levels but may also include flags or options that specify the application of specific optimizations
O0, O1, O2, O3, Flags are compiler-specific, there are no standards
The compiler supports the xOn option to specify the level of optimization you wish to apply Each level includes the lower optimization levels Choices for n are:
n=1 Basic block optimization n=2 Global optimization n=3 Loop unrolling and modulo scheduling n=4 - Intra-file inlining and pointer tracking n=5 Aggressive optimizations
30
02/02/2011
-fns
-fsimple=2 -dalign -xlibmil -ftrap=%none C
-xbuiltin=%all -xalias_level=basic
31
02/02/2011
The SUN compilers use rightmost wins rule In case of conflicting options, the option most to the right on the compile line wins ( this usually happens if compiler macros are used) When in doubt use the v option to see the macro expansions You can also use the dryrun option to see the macro expansion
The way the program was written Features of the compiler The compiler flags used The hardware
32
02/02/2011
Goals of Optimization
Memory hierarchy inefficiencies lead to very high stall rate in current machines
CPU spends ca. 60% of total execution time waiting for data
Early on in compilation, reorganize execution of loops to get better cache hit rate Later, examine simplified code and try to reduce number of variables, reduce work of computing array addresses At end of compilation, put heavily used variables into registers
Applying Transformations
Order of application of transformations may influence outcome. Some transformations may be applied with the purpose of enabling other transformations.
We may deliberately degrade performance in an enabling transformation if we expect it will subsequently lead to a dramatic improvement.
33
02/02/2011
Selecting Transformations
Is it essential for the optimization goals? If not, how frequently is it likely to be used to modify code ? Does it simplify the implementation of some essential part of the optimization process? How hard is it to implement? How time-consuming is its execution? How often will this lead to a program improvement? How great is that improvement likely to be?
Summary: Optimizations
Optimization is not a single process or procedure.
Rather it is a collection of strategies for program improvement that may be applied at various stages during compilation. One compiler may implement a number of distinct strategies that may be used according to the needs of the user and the kind of application.
34
02/02/2011
Optimizations in the back end or in the middle end of a modern compiler are usually performed in a series of phases. Some of them are executed multiple times. In a real world compiler, this can be pretty complex We look at some examples
Front End
IR Gen
IR
Machine code
Modern restructuring transformations: Full and partial inlining of routines Blocking for memory hierarchy and register reuse Vectorization Parallelization All based on dependence analysis
Errors
35
02/02/2011
Classic Compilers
1957: The FORTRAN Automatic Coding System
Front End
Index Optimizn
Code Merge
bookkeeping
Flow Analysis
Register Allocation
Final Assembly
Front End
Middle End
Back End
Classic Compilers
1969: IBMs FORTRAN H Compiler
Scan & Parse Build CFG & DOM Find Busy Vars CSE Loop Inv Code Motn Copy Elim. OSR Re Reg. assoc Alloc.
(consts)
Final Assy.
Front End
Middle End
Back End
Used low-level IR (quads), identified loops with dominators Focused on optimizing loops (inside out order)
Fairly modern set of passes
36
02/02/2011
Classic Compilers
1980: IBMs PL.8 Compiler
Front End
Middle End
Back End
Dead code elimination, code motion, constant folding, strength reduction, value numbering, dead store elimination, code straightening, algebraic re-association
Classic Compilers
1986: HPs PA-RISC Compiler
Front End
Middle End
Back End
Several front ends, an optimizer, a back end Four fixed-order choices for optimization (9 passes) Coloring allocator, instruction scheduler, peephole optimizer
37
02/02/2011
Modern Compilers
The SGI Pro64 Compiler (now Open64)
Fortran C C++ Front End Middle End Back End Interpr. Anal. & Optimn Loop Nest Optimn Global Optimn Code Gen.
Open source optimizing compiler for IA 64 Multiple front ends, 1 back end Five levels of IR Gradual lowering of abstraction level
Modern Compilers
The SGI Pro64 Compiler (now Open64)
Fortran C C++ Front End Middle End Back End Interpr. Anal. & Optimn Loop Nest Optimn Global Optimn Code Gen.
Interprocedural analysis Inlining (user and library code) Cloning (for constants and locality) Dead function, dead variable elimination
38
02/02/2011
Modern Compilers
The SGI Pro64 Compiler (now Open64)
Fortran C C++ Front End Middle End Back End Interpr. Anal. & Optimn Loop Nest Optimn Global Optimn Code Gen.
Loop nest optimizations Dependence analysis Loop transformations: interchange, fission, fusion, peeling, tiling, unroll and jam Parallelization
Modern Compilers
Loop Nest Optimn Global Optimn Code Gen.
Global analysis Intraprocedural Data flow analysis (using SSA form) Constant propagation
Middle End
Back End
39
02/02/2011
Modern Compilers
The SGI Pro64 Compiler for IA64 (now Open64)
Fortran C C++ Front End Middle End Back End Interpr. Anal. & Optimn Loop Nest Optimn Global Optimn Code Gen.
Code generation If conversion and predication Code motion Instruction scheduling, register allocation Peephole optimizations
40