Documente Academic
Documente Profesional
Documente Cultură
COMPILER CONSTRUCTION
O S ADEWALE
Department of Computer Science
The Federal University of Technology
AKURE, NIGERIA
PREFACE
Compilers and interpreters are a necessary part of any computer system,
without them, we would all be programming in assembly language or even
machine language. This has made compiler construction an important,
practical area of research in computer science. The object of this book is to
present in a coherent fashion the major techniques used in compiler writing,
in order to make it easier for the novice to enter the field and for the expert
to reference the literature.
The book is intended to serve two needs; it can be used as a self-study and
reference book for the professional programmer interested in or involved in
compiler construction, and as a text in compiler construction at the
undergraduate or graduate level. The emphasis is on solving the problems
universally encountered in designing a compiler, regardless of the source
language or the target machine.
A number of ideas and techniques discussed in this book can be profitably
used in general software design. For example, the finite-state techniques and
regular expression used to build lexical analysers have also been used in text
editors, bibliographic search system and pattern recognition program.
Context-free grammars and syntax-directed translation schemes have been
used to build text processors of many sorts. Techniques of code optimisation
also have applicability to program verifiers and to programs that produce
structured programs from unstructured ones.
ii
TABLE OF CONTENTS
Chapter One The Compiling Process
22
31
46
61
71
78
84
90
97
109
117
Bibliography
132
iii
CHAPTER ONE
What is a compiler?
A compiler is a program that takes as input a program written in one
language (the source language) and translates it into a functionally
equivalent program in another language (the target language). The source
language is usually a high-level language like Pascal or C, and the target
language is usually a low-level language like assembly or machine
language. As it translates, a compiler also reports errors and warnings to
help the programmer make corrections to the source, so the translation can
be completed. Theoretically, the source and target can be any language,
but the most common use of a compiler is translating an ASCII source
program written in a language such as C into a machine specific result like
SPARC assembly that can execute on that designated hardware.
Although we will focus on writing a compiler for a programming
language, the techniques you learn can be valuable and useful for a widevariety of parsing and translating tasks e.g converting javadoc
comments to HTML, generating a table from the results of an SQL query,
collating responses from e-mail survey, implementing a server that
responds to a network protocol like http or imap, or screen-scraping
information from an on-line source. Your printer uses parsing to render
PostScript files. Hardware engineers use a full-blown compiler to translate
from a hardware description language to the schematic of the circuit. Your
email spam filter quite possibly uses scanning and parsing to detect
unwanted email. And the list goes on
2)
The symbol on the left side of the -> in each rule can be replaced
by the symbols on the right. To parse a + 2, we would apply the
following rules:
Expression ->
->
->
->
->
Expression + Expression
Variable + Expression
T_IDENTIFIER + Expression
T_IDENTIFIER + Constant
T_IDENTIFIER + T_INTCONSTANT
4)
_t1
_t2
_t3
a
=
=
=
=
b * c
b * d
_t1 + _t2
_t3
_t1 = a > b
if _t1 goto L0
_t2 = a c
a = _t2
_t3 = b * c
c = _t3
loop)
=
=
=
=
=
b *
_t1
b *
_t2
_t4
c
+ 0
c
+ _t3
_t1 = b * c
_t2 = _t1 + _t1
a = _t2
In the example above, the code generator translated the TAC input
into MIPS assembly output.
3)
information about all the identifiers in the program along with important
attributes such as type and scope. Identifiers can be found in the lexical
analysis phase and added to the symbol table. During the two phases that
follow (syntax and semantic analysis), the compiler updates the identifier
entry in the table to include information about its type and scope.
When generating intermediate code, the type of the variable is used to
determine which instructions to emit. During optimisation, the live
range of each variable may be placed in the table to aid in register
allocation. The memory location determined in the code generation phase
might also be kept in the symbol table.
Error-handling
Another activity that occurs across several phases is error handling. Most
error handling occurs in the first three phases of the analysis stage. The
scanner keeps an eye for stray tokens, the syntax analysis phase reports
invalid combinations of tokens, and the semantic analysis phase reports
type errors and the like. Sometimes these are fatal errors that stop the
entire process, while at other times, the compiler can recover and continue.
One-pass versus multi-pass
In looking at this phased approach to the compiling process, one might
think that each phase generates output that is then passed on to the next
phase. For example, the scanner reads through the entire source program
and generates a list of tokens. This list is the input to the parser that reads
through the entire list of tokens and generates a parse tree or derivation. If
a compiler works in this manner, we call it a multi-pass compiler. The
pass refers to how many times the compiler must read through the
source program. In reality, most compilers are one-pass up to the code
optimisation phase. Thus, scanning, parsing, semantic analysis and
intermediate code generation are all done simultaneously as the compiler
reads through the source program once. Once we get to code optimisation,
several passes are usually required which is why this phase slows the
compiler down so much.
CHAPTER TWO
Lexical Analysis
The basics
Lexical analysis or scanning is the process where the stream of characters
making up the source program is read from left-to-right and grouped into
tokens. Tokens are sequences of characters with a collective meaning.
There are usually only a small number of tokens for a programming
language: constants (integer, double, char, string, etc.), operators
(arithmetic, relational, logical), punctuation, and reserved words.
10
Furthermore, the scanner has no idea how tokens are grouped. In the above
sequence, it returns b, [,2, and ] as four separate tokens, having no idea they
collectively form an array access.
The lexical analyser can be a convenient place to carry out some other chores like
stripping out comments and white space between tokens and perhaps even some
features like macros and conditional compilation (although often these are
handled by some sort of preprocessor which filters the input before the compiler
runs).
11
The mythical source language tokenised by the above scanner requires that
reserved words be in all upper-case and identifiers in all lower-case. This
convenient feature makes it easy for the scanner to choose which path to
pursue after reading just one character. It is sometimes necessary to design
12
Alphabet
string
empty string
formal language
regular expressions
x or y
x repeated 0 or more times
13
r1 r2
r1 | r2
r1 *
2)
3)
Now that we know what FAs are, here is a regular expression and a simple
finite automata that recognises an integer.
14
This FA handles only a subset of all Pascal tokens but it should give you
an idea of how an FA can be used to drive a scanner. The
numbered/lettered states are final states. The loops on states 1 and 2
continue to execute until a character other than a letter or digit is read. For
example, when scanning "temp:=temp+1;" it would report the first token
at final state 1 after reading the ":" having recognised the lexeme "temp"
as an identifier token.
What happens in an FA-driven scanner is we read the source program one
character at a time beginning with the start state. As we read each
character, we move from our current state to the next by following the
appropriate transition for that. When we end up in a final state, we perform
an action associated with that final state. For example, the action
associated with state 1 is to first check if the token is a reserved word by
looking it up in the reserved word list. If it is, the reserved word is passed
to the token stream being generated as output. If it is not a reserved word,
it is an identifier so a procedure is called to check if the name is in the
symbol table. If it is not there, it is inserted into the table.
Once a final state is reached and the associated action is performed, we
pick up where we left off at the first character of the next token and begin
again at the start state. If we do not end in a final state or encounter an
unexpected symbol while in any state, we have an error condition. For
example, if you run "ASC@I" through the above FA, we would error out
of state 1.
From regular expressions to NFA
So thats how FAs can be used to implement scanners. Now we need to
look at how to create an FA given the regular expressions for our tokens.
There is a looser definition of an FA that is especially useful to us in this
process. A nondeterministic finite automaton (NFA) has:
1)
15
2)
3)
Notice that there is more than one path through the machine for a given
string. For example, 000 can take you to a final state, or it can leave you
in the start state. This is where the non-determinism (choice) comes in. If
any of the possible paths for a string leads to a final state, that string is in
the language of this automaton.
There is a third type of finite automata called -NFA which have
transitions labelled with the empty string. The interpretation for such
transitions is one can travel over an empty-string transition without using
any input symbols.
A famous proof in formal language theory (Kleenes Theorem) shows that
FAs are equivalent to NFAs which are equivalent to -NFAs. And, all
these types of FAs are equivalent in language generating power to regular
expressions. In other words,
If R is a regular expression, and L is the language
corresponding to R, then there is an FA that recognises L.
Conversely, if M is an FA recognising a language L, there is
a regular expression R corresponding to L.
It is quite easy to take a regular expression and convert it to an equivalent
NFA or -NFA, thanks to the simple rules of Thompsons construction:
Rule 1: There is an NFA that accepts any particular symbol of the alphabet:
16
Here is an example:
We continue with this process analysing all the new states that we create.
We need to determine where we go in the NFA from each state, on all the
symbols of the alphabet.
And finally, filling in the transitions from {X2, X3} state brings us full
circle. This is now a deterministic FA that accepts the same language as
the original NFA. We have 5 states instead of original 4, a rather modest
increase in this case.
18
The process then goes like this: from a regular expression for a token, we
construct an NFA that recognises them using Thompsons algorithm.
NFAs are not useful as drivers for programs because non-determinism
implies choices and thus, expensive exhaustive backtracking algorithms.
So, we use subset construction to convert that NFA to a DFA. Once we
have the DFA, we can use it as the basis for an efficient non-backtracking
scanner.
Lex, a Scanner Generator
The reason we have spent so much time looking at how to go from regular
expressions to finite automata is because this is exactly the process that lex
goes through in creating a scanner. Lex is a lexical analysis generator that
takes as input a series of regular expressions and builds a finite automaton
and a driver program for it in C through the mechanical steps shown
above. Theory in practice!
The first phase in a compiler reads the input source and converts strings in
the source to tokens. Using regular expressions, we can specify patterns to
lex that allow it to scan and match strings in the input. Each pattern in lex
has an associated action. Typically an action returns a token, representing
the matched string, for subsequent use by the parser. To begin with,
however, we will simply print the matched string rather than return a token
value. We may scan for identifiers using the regular expression
letter(letter|digit)*
This pattern matches a string of characters that begins with a single letter,
and is followed by zero or more letters or digits. This example nicely
illustrates operations allowed in regular expressions:
-
concatenation
between states. There is one start state, and one or more final or accepting
states.
In the above figure, state 0 is the start state, and state 2 is the accepting
state. As characters are read, we make a transition from one state to
another. When the first letter is read, we transition to state 1. We remain in
state 1 as more letters or digits are read. When we read a character other
than a letter or digit, we transition to state 2, the accepting state. Any FSA
may be expressed as a computer program. For example, our 3-state
machine is easily programmed:
start:
goto state0
state0: read c
if c = letter goto state1
goto state0
state1: read
if c
if c
goto
c
= letter goto state1
= digit goto state1
state2
This is the technique used by lex. Regular expressions are translated by lex
to a computer program that mimics an FSA. Using the next input
character, and current state, the next state is easily determined by indexing
into a computer-generated state table.
Now we can easily understand some of lexs limitations. For example, lex
cannot be used to recognize nested structures such as parentheses. Nested
structures are handled by incorporating a stack. Whenever we encounter a
"(", we push it on the stack. When a ")" is encountered, we match it with
the top of the stack, and pop the stack. Lex, however, only has states and
transitions between states. Since it has no stack, it is not well suited for
parsing nested structures. Yacc augments an FSA with a stack, and can
process constructs such as parentheses with ease. The important thing is to
use the right tool for the job. Lex is good at pattern matching. Yacc is
appropriate for more challenging tasks. We shall consider yacc in the
future.
20
21
to output. Defaults for input and output are stdin and stdout,
respectively. Here is the same example, with defaults explicitly coded:
%%
/* match everything except newline */
ECHO;
/* match newline */
\n ECHO;
.
%%
int yywrap(void) {
return 1;
}
int main(void) {
yylex();
return 0;
}
Two patterns have been specified in the rules section. Each pattern must
begin in column one. This is followed by whitespace (space, tab or
newline), and an optional action associated with the pattern. The action
may be a single C statement, or multiple C statements enclosed in braces.
Anything not starting in column one is copied verbatim to the generated C
file. We may take advantage of this behavior to specify comments in our
lex file. In this example there are two patterns, . and \n, with an
ECHO action associated for each pattern. Several macros and variables are
predefined by lex. ECHO is a macro that writes code matched by the
pattern. This is the default action for any unmatched strings. Typically,
ECHO is defined as:
#define ECHO fwrite(yytext, yyleng, 1, yyout)
22
The following example prepends line numbers to each line in a file. Some
implementations of lex predefine and calculate yylineno. The input file
for lex is yyin, and defaults to stdin.
23
Whitespace must separate the defining term and the associated expression.
References to substitutions in the rules section are surrounded by braces
({letter}) to distinguish them from literals. When we have a match in
the rules section, the associated C code is executed. Here is a scanner that
counts the number of characters, words, and lines in a file:
24
CHAPTER THREE
Formal Grammars
What is a grammar?
A grammar is a powerful tool for describing and analysing languages. It is
a set of rules by which valid sentences in a language are constructed.
Heres a trivial example of English grammar:
sentence
> <subject> <verb-phrase> <object>
subject
> This | Computers | I
verb-phrase > <adverb> <verb> | <verb>
adverb
> never
verb
> is | run | am | tell
object
> the <noun> | a <noun> | <noun>
noun
> university | world | cheese | lies
Using the above rules or productions, we can derive simple sentences such
as these:
This is a university.
Computers run the world.
I am the cheese.
I never tell lies.
25
terminal
production
derivation
start symbol
null symbol
BNF
A few grammar exercises to try on your own, the alphabet in each case is
a,b}
26
Parse Representation
In working with grammars, we can represent the application of the rules to
derive a sentence in two ways. The first is a derivation as shown earlier for
This is a university where the rules are applied step-by-step and we
substitute for one nonterminal at a time. Think of a derivation as a history
of how the sentence was parsed because it not only includes which
productions were applied, but also the order they were applied (i.e. which
nonterminal was chosen for expansion at each step). There can many
different derivations for the same sentence (the leftmost, the rightmost,
and so on).
A parse tree is the second method for representation. It diagrams how each
symbol derives from other symbols in a hierarchical manner. Here is a
parse tree for This is a university:
Although the parse tree includes all of the productions that were applied, it
does not encode the order they were applied. For an unambiguous
grammar, there is exactly one parse tree for a particular sentence.
More Formal Definitions
Here are some other definitions we will need, described in reference to this
example grammar:
S > AB
A > Ax | y
B > z
alphabet
The alphabet is {S, A, B, x, y, z}. It is divided into two disjoint sets.
The terminal alphabet consists of terminals, which appear in the
sentences of the language: {x, y, z}. The remaining symbols are the
nonterminal alphabet; these are the symbols that appear on theleft
side of productions and can be replaced during the course of a
derivation: {S, A, B}. Formally, we use V for the alphabet, T for the
27
equivalence
The language L(G) defined by grammar G is the set of sentences
derivable using G. Two grammars G and G' are said to be
equivalent if the languages they generate L(G) and L(G') are the
same.
A Hierarchy of Grammars
We owe a lot of our understanding of grammars to the work of the
American linguist Noam Chomsky (yes, the Noam Chomsky known for
his politics). There are four categories of formal grammars in the Chomsky
Hierarchy, they span from Type 0, the most general, to Type 3, the most
restrictive. More restrictions on the grammar make it easier to describe and
efficiently parse, but reduce the expressive power.
28
> E op E | (E ) | int
> + | - | * | /
29
Both trees are legal in the grammar as stated and thus either interpretation
is valid. Although natural languages can tolerate some kind of ambiguity
(puns, plays on words, etc.), it is not acceptable in computer languages.
We dont want the compiler just haphazardly deciding which way to
interpret our expressions! Given our expectations from algebra concerning
precedence, only one of the trees seems right. The right-hand tree fits our
expectation that * binds tighter and for that result to be computed first
then integrated in the outer expression which has a lower precedence
operator.
Its fairly easy for a grammar to become ambiguous if you are not careful
in its construction. Unfortunately, there is no magical technique that can be
used to resolve all varieties of ambiguity. It is an undecidable problem to
determine whether any grammar is ambiguous, much less attempt to
mechanically remove all ambiguity. However, that doesn't mean in
practice that we cannot detect ambiguity or can't do something about it.
For programming language grammars, we usually take pains to construct
an unambiguous grammar or introduce additional disambiguating rules to
throw away the undesirable parse trees, leaving only one for each
sentence.
Using the above ambiguous expression grammar, one technique would
leave the grammar as is, but add disambiguating rules into the parser
implementation. We could code into the parser knowledge of precedence
and associativity to break the tie and force the parser to build the tree on
the right rather than the left. The advantage of this is that the grammar
remains simple and less complicated. But as a downside, the syntactic
structure of the language is no longer given by the grammar alone.
Another approach is to change the grammar to only allow the one tree that
correctly reflects our intention and eliminate the others. For the expression
30
31
The common prefix is if Cond then Stmt. This causes problems because
when a parser encounter an if, it does not know which production to use.
A useful technique called left-factoring allows us to restructure the
grammar to avoid this situation. We rewrite the productions to defer the
decision about which of the options to choose until we have seen enough
of the input to make the appropriate choice. We factor out the common
part of the two options into a shared rule that both will use and then add a
new rule that picks up where the tokens diverge.
Stmt
> if Cond then Stmt OptElse | Other |
OptElse > else S |
32
A cursory examination of the grammar may not detect that the first and
second productions of B overlap with the third. We substitute the
expansions for A into the third production to expose this:
A > da | acB
B > abB | daA | daf | acBf
Similarly, the following grammar does not appear to have any leftrecursion:
S > Tu | wx
T > Sq | vvS
33
CHAPTER FOUR
Top-Down Parsing
Approaches to Parsing
The syntax analysis phase of a compiler verifies that the sequence of
tokens extracted by the scanner represents a valid sentence in the grammar
of the programming language. There are two major parsing approaches:
top-down and bottom-up. In top-down parsing, you start with the start
symbol and apply the productions until you arrive at the desired string. In
bottom-up parsing, you start with the string and reduce it to the start
symbol by applying the productions backwards. As an example, lets trace
through the two approaches on this simple grammar that recognises strings
consisting of any number of as followed by at least one (and possibly
more) bs:
S > AB
A > aA |
B > b | bB
Here is a top-down parse of aaab. We begin with the start symbol and at
each step, expand one of the remaining nonterminals by replacing it with
the right side of one of its productions. We repeat until only terminals
remain. The top-down parse produces a leftmost derivation of the
sentence.
S
AB S
> AB
aAB A > aA
aaAB A > aA
aaaAB A > aA
aaaB A >
aaab B > b
34
Lets follow parsing the input bcd. In the trace below, the column on the
left will be the expansion thus far, the middle is the remaining input, and
the right is the action attempted at each step:
S
bab
ab
S
bA
A
d
A
cA
Ad
d
bcd
bcd
cd
bcd
bcd
cd
cd
cd
cd
Try
d
As you can see, each time we hit a dead-end, we backup to the last
decision point, unmake that decision and try another alternative. If all
alternatives have been exhausted, we back up to the preceding decision
point and so on. This continues until we either find a working parse or
have exhaustively tried all combinations without success.
A number of authors have described backtracking parsers; the appeal is
that they can be used for a variety of grammars without requiring them to
fit any specific form. For a small grammar such as above, a backtracking
35
36
each nonterminal, this function calls the associated function to handle its
part of the parsing.
To make things a little cleaner, lets introduce a utility function that can be
used to verify that the next token is what is expected and will error and
exit otherwise. We will need this again and again in writing the parsing
routines.
37
When parsing the closing portion of the if, we have to decide which of the
two right-hand side options to expand. In this case, it isnt too difficult.
We try to match the first token again ENDIF and on non-match, we try to
match the ELSE clause and if that doesnt match, it will report an error.
Navigating through two choices seemed simple enough, however, what
happens where we have many alternatives on the right side?
statement > assg_statement | return_statement | print_statement | null_statement
| if_statement | while_statement | block_of_statements
38
If the first sets of the various productions for a nonterminal are not
disjoint, a predictive parser doesn't know which choice to make. We would
either need to re-write the grammar or use a different parsing technique for
this nonterminal. For programming languages, it is usually possible to restructure the productions or embed certain rules into the parser to resolve
conflicts, but this constraint is one of the weaknesses of the top-down nonbacktracking approach.
It is a bit trickier if the nonterminal we are trying to recognize is nullable.
A nonterminal A is nullable if there is a derivation of A that results in (i.e.
that nonterminal would completely disappear in the parse string) i.e.,
First(A). In this case A could be replaced by nothing and the next token
would be the first token of the symbol following A in the sentence being
parsed. Thus if A is nullable, our predictive parser also needs to consider
the possibility that the path to choose is the one corresponding to A =>* .
To deal with this we define the following:
The follow set of a nonterminal A is the set of terminal
symbols that can appear immediately to the right of A in a
valid sentence. A bit more formally, for every valid sentence
S =>*uAv , where v begins with some terminal, that terminal
is in Follow(A).
39
Informally, you can think about the follow set like this: A can appear in
various places within a valid sentence. The follow set describes what
terminals could have followed the sentential form that was expanded from
A. We will detail how to calculate the follow set a bit later. For now,
realize follow sets are useful because they define the right context
consistent with a given nonterminal and provide the lookahead that might
signal a nullable nonterminal should be expanded to .
With these two definitions, we can now generalize how to handle A > u1 |
u 2 | ..., in a recursive-descent parser. In all situations, we need a case to
handle each member in First(ui) . In addition if there is a derivation from
any ui that could yield (i.e. if it is nullable) then we also need to handle
the members in Follow(A).
What about left-recursive productions? Now we see why these are such a
problem in a predictive parser. Consider this left-recursive production that
matches a list of one or more functions.
function_list > function_list function | function
function
> FUNC identifier ( parameter_list ) statement
becomes
40
3.
2.
3.
41
S > AB
A > Ca |
B > BaAC | c
C > b |
becomes
B > cB'
B' > aACB' |
It helps to first compute the nullable set (i.e. those nonterminals X that X
=>* ), since you need to refer to the nullable status of various
nonterminals when computing the first and follow sets:
Nullable(G) = {A B' C}
42
Follow(S) = {$}
S doesnt appear in the right hand side of any productions.
We put $ in the follow set because S is the start symbol.
Follow(B) = {$}
B appears on the right hand side of the S > AB production.
Its follow set is the same as S.
Follow(B') = {$}
B' appears on the right hand side of two productions. The B'
> aACB' production tells us its follow set includes the
follow set of B', which is tautological. From B > cB', we
learn its follow set is the same as B.
Follow(C) = {a $}
C appears in the right hand side of two productions. The
production A > Ca tells us a is in the follow set. From B' >
aACB' , we add the First(B') which is just a again. Because B'
is nullable, we must also add Follow(B') which is $.
Follow(A) = {c b a $}
A appears in the right hand side of two productions. From S
> AB we add First(B) which is just c. B is not nullable. From
B' > aACB' , we add First(C) which is b. Since C is nullable,
so we also include First(B') which is a. B' is also nullable, so
we include Follow(B') which adds $.
It can be convenient to compute the follows sets for the nonterminals that
appear toward the top of the parse tree and work your way down, but
sometimes you have to circle around computing the follow sets of other
nonterminals in order to complete the one youre on.
The calculation of the first and follow sets follow mechanical algorithms,
but it is very easy to get tripped up in the details and make mistakes even
when you know the rules. Be careful!
Table-Driven LL(1) Parsing
In a recursive-descent parser, the production information is embedded in
the individual parse functions for each nonterminal and the run-time
execution stack is keeping track of our progress through the parse. There is
another method for implementing a predictive parser that uses a table to
store that production along with an explicit stack to keep track of where
we are in the parse.
This grammar for add/multiply expressions is already set up to handle
precedence and associativity:
E > E + T | T
T > T * F | F
F > (E) | int
43
E > TE'
E' > + TE' |
T > FT'
T' > * FT' |
F > (E) | int
One way to illustrate the process is to study some transition graphs that
represent the grammar:
44
4.
Suppose, instead, that we were trying to parse the input +$. The first step
of the parse would give an error because there is no entry at M[E, +].
Constructing the Parse Table
The next task is to figure out how we built the table. The construction of
the table is somewhat involved and tedious (the perfect task for a
45
Once we have the first and follow sets, we build a table M with the
leftmost column labelled with all the nonterminals in the grammar, and the
top row labelled with all the terminals in the grammar, along with $. The
following algorithm fills in the table cells:
1.
2.
3.
4.
No ambiguity
No left recursion
A grammar G is LL(1) iff whenever A > u | v are two
distinct productions of G, the following conditions hold:
46
All of this translates intuitively that when trying to recognise A, the parser
must be able to examine just one input symbol of lookahead and uniquely
determine which production to use.
Error-Reporting and Recovery
A few general principles apply to errors found regardless of parsing
technique being used:
47
The problem is how to fix the error in some way to allow parsing to
continue.
Many errors are relatively minor and involve syntactic violations for
which the parser has a correction that it believes is likely to be what the
programmer intended. For example, a missing semicolon at the end of the
line or a misspelled keyword can usually be recognised. For many minor
errors, the parser can fix the program by guessing at what was intended
and reporting a warning, but allowing compilation to proceed unhindered.
The parser might skip what appears to be an erroneous token in the input
or insert a necessary, but missing, token or change a token into the one
expected (substituting BEGIN for BGEIN). For more major or complex
errors, the parser may have no reliable correction. The parser will attempt
to continue but will probably have to skip over part of the input or take
some other exceptional action to do so.
Panic-mode error recovery is a simple technique that just bails out of the
current construct, looking for a safe symbol at which to restart parsing.
The parser just discards input tokens until it finds what is called a
synchronising token. The set of synchronising tokens are those that we
believe confirm the end of the invalid statement and allow us to pick up at
the next piece of code. For a nonterminal A, we could place all the symbols
in Follow(A) into its synchronising set. If A is the nonterminal for a variable
declaration and the garbled input is something like duoble d; the parser
might skip ahead to the semi-colon and act as though the declaration didnt
exist. This will surely cause some more cascading errors when the variable
is later used, but it might get through the trouble spot. We could also use
the symbols in First(A) as a synchronising set for re-starting the parse of A.
This would allow input junk double d; to parse as a valid variable
declaration.
48
CHAPTER FIVE
Bottom-Up Parsing
As the name suggests, bottom-up parsing works in the opposite direction
from top-down. A top down parser begins with the start symbol at the top
of the parse tree and works downward, driving productions in forward
order until it gets to the terminal leaves. A bottom-up parse starts with the
string of terminals itself and builds from the leaves upward, working
backwards to the start symbol by applying the productions in reverse.
Along the way, a bottom-up parser searches for substrings of the working
string that match the right side of some production. When it finds such a
substring, it reduces it, i.e., substitutes the left side nonterminal for the
matching right side. The goal is to reduce all the way up to the start
symbol and report a successful parse.
In general, bottom-up parsing algorithms are more powerful than topdown methods, but not surprisingly, the constructions required are also
more complex. It is difficult to write a bottom-up parser by hand for
anything but trivial grammars, but fortunately, there are excellent parser
generator tools like yacc that build a parser from an input specification,
not unlike the way lex builds a scanner to your spec.
Shift-reduce parsing is the most commonly used and most powerful of the
bottom-up techniques. It takes as input a stream of tokens and develops the
list of productions used to build the parse tree, but the productions are
discovered in reverse order of a top-down parser. Like a table-driven
predictive parser, a bottom-up parser makes use of a stack to keep track of
the position in the parse and a parsing table to determine what to do next.
To illustrate stack-based shift-reduce parsing, consider this simplified
expression grammar:
S > E
E > T | E + T
T > id | (E)
The shift-reduce strategy divides the string we are trying parse into two
parts: an undigested part and a semi-digested part. The undigested part
contains the tokens that are still to come in the input, and the semidigested part is put on a stack. If parsing the string v, it starts out
completely undigested, so the input is initialised to v, and the stack is
initialised to empty. A shift-reduce parser proceeds by taking one of three
actions at each step:
Reduce:
49
Error:
The general idea is to read tokens from the input and push them onto the
stack attempting to build sequences we recognise as the right side of a
production. When we find a match, we replace that sequence with the
nonterminal from the left side and continue working our way up the parse
50
tree. This process builds the parse tree from the leaves upward, the inverse
of the top-down parser. If all goes well, we will end up moving everything
from the input to the stack and eventually construct a sequence on the
stack that we recognise as a right-hand side for the start symbol.
Lets trace the operation of a shift-reduce parser in terms of its actions
(shift or reduce) and its data structure (a stack). The chart below traces a
parse of (id+id) using the previous example grammar:
51
reduction. An example of a shift-reduce conflict occurs with the if-thenelse construct in programming languages. A typical production might be:
S > if E then S | if E then S else S
with else as the next token. It could reduce because the contents of the
stack match the right-hand side of the first production or shift the else
trying to build the right-hand side of the second production. Reducing
would close off the inner if and thus associate the else with the outer if.
Shifting would continue building and later reduce the inner if with the else.
Either is syntactically valid given the grammar, but two different parse
trees result, showing the ambiguity. This quandary is commonly referred
to as the dangling else. Does an else appearing within a nested if statement
belong to the inner or the outer? The C and Java languages agree that an
else is associated with its nearest unclosed if. Other languages, such as Ada
and Modula, avoid the ambiguity by requiring a closing endif delimiter.
Reduce-reduce conflicts are not common and usually indicate a problem in
the grammar definition.
Now that we have general idea of how a shift-reduce parser operates, we
will look at how it recognises a handle, and how it decides which
production to use in a reduction. To deal with these two issues, we will
look at a specific shift-reduce implementation called LR parsing.
LR Parsing
LR parsers (L for left to right scan of input; R for rightmost
derivation) are efficient, table-driven shift-reduce parsers. The class of
grammars that can be parsed using LR methods is a proper superset of the
class of grammars that can be parsed with predictive LL parsers. In fact,
virtually all programming language constructs for which CFGs can be
written can be parsed with LR techniques. As an added advantage, there is
no need for lots of grammar rearrangement to make it acceptable for LR
parsing the way that LL parsing requires.
The primary disadvantage is the amount of work it takes to build the tables
by hand, which makes it infeasible to hand-code an LR parser for most
grammars. Fortunately, there exist LR parser generator tools that create the
parser from a CFG specification. The parser tool does all the tedious and
52
complex work to build the necessary tables and can report any ambiguities
or language constructs that interfere with the ability to parse it using LR
techniques.
We begin by tracing how an LR parser works. Determining the handle to
reduce in a sentential form depends on the sequence of tokens on the stack,
not only the topmost ones that are to be reduced, but the context at which
we are in the parse. Rather than reading and shifting tokens onto a stack,
an LR parser pushes "states" onto the stack; these states describe what is
on the stack so far. Think of each state as encoding the current left context.
The state on top of the stack possibly augmented by peeking at a
lookahead token enables us to figure out whether we have a handle to
reduce, or whether we need to shift a new state on top of the stack for the
next input token.
An LR parser uses two tables:
1.
The action table Action[s,a] tells the parser what to do when the
state on top of the stack is s and terminal a is the next input token.
The possible actions are to shift a state onto the stack, to reduce the
handle on top of the stack, to accept the input, or to report an error.
2.
The goto table Goto[s,X] indicates the new state to place on top of
the stack after a reduce of the nonterminal X while state s is on top
of the stack.
The two tables are usually combined, with the action table specifying
entries for terminals, and the goto table specifying entries for
nonterminals.
Tracing an LR Parser
We start with the initial state s0 on the stack. The next input token is the
terminal a and the current state is st. The action of the parser is as follows:
53
Here is the combined action and goto table. In the action columns sN
means shift state numbered N onto the stack number and rN action means
reduce using production numbered N. The goto column entries are the
number of the new state to push onto the stack after reducing the specified
nonterminal. This is an LR(0) table (more details on table construction will
come in a minute).
Here is a parse of id + (id) using the LR algorithm with the above action
and goto table:
Types of LR Parsers
There are three types of LR parsers: LR(k), simple LR(k), and lookahead
LR(k) (abbreviated to LR(k), SLR(k), LALR(k))). The k identifies the
number of tokens of lookahead. We will usually only concern ourselves
54
55
This dot marks how far we have gotten in parsing the production.
Everything to the left of the dot has been shifted onto the parsing stack and
next input token is in the First set of the symbol after the dot (or in the
follow set if that symbol is nullable). A dot at the right end of a
configuration indicates that we have that entire configuration on the stack
i.e., we have a handle that we can reduce. A dot in the middle of the
configuration indicates that to continue further, we need to shift a token
that could start the symbol following the dot. For example, if we are
currently in this position:
A > XYZ
We want to shift something from First(Y) (something that matches the next
input token). Say we have productions Y > u | w. Given that, these three
productions all correspond to the same state of the shift-reduce parser:
A > XYZ
Y > u
Y > w
At the above point in parsing, we have just recognised an X and expect the
upcoming input to contain a sequence derivable from YZ. Examining the
expansions for Y, we furthermore expect the sequence to be derivable from
either u or w . We can put these three items into a set and call it a
configurating set of the LR parser. The action of adding equivalent
configurations to create a configurating set is called closure. Our parsing
tables will have one state corresponding to each configurating set.
These configurating sets represent states that the parser can be in as it
parses a string. Each state must contains all the items corresponding to
each of the possible paths that are concurrently being explored at that point
in the parse. We could model this as a finite automaton where we move
from one state to another via transitions marked with a symbol of the CFG.
For example,
Recall that we push states onto the stack in a LR parser. These states
describe what is on the stack so far. The state on top of the stack
56
in the above example, we would add the following to the configurating set.
Xi+1 > Y1...Yg
Xi+1 > Z1...Zh
1.
2.
3.
production
If u begins with a nonterminal B, add all productions with B
on the left side, with the dot at the start of the right side: B
> v
4.
57
Now, to create the action and goto tables, we need to construct all the
configurating sets and successor functions for the expression grammar. At
the highest level, we want to start with a configuration with a dot before
the start symbol and move to a configuration with a dot after the start
symbol. This represents shifting and reducing an entire sentence of the
grammar. To do this, we need the start symbol to appear on the right side
of a production. This may not happen in the grammar so we modify it. We
create an augmented grammar by adding the production:
S' > S
where S is the start symbol. So we start with the initial configurating set C0
which is the closure of S' > S. The augmented grammar for the example
expression grammar:
0) E' > E
58
1) E > E + T
2) E > T
3) T > (E)
4) T > id
3.
Here is the full family of configurating sets for the grammar given above.
Note that the order of defining and numbering the sets is not important;
what is important is that all the sets are included.
A useful means to visualise the configurating sets and successors is with a
diagram like the one shown below. The transitions mark the successor
relationship between sets. We call this a gotograph or transition diagram.
59
To construct the LR(0) table, we use the following algorithm. The input is
an augmented grammar G' and the output is the action/goto tables:
1.
2.
3.
4.
5.
Notice how the shifts in the action table and the goto table are just
transitions to new states. The reductions are where we have a handle on
the stack that we pop off and replace with the nonterminal for the handle;
this occurs in the states where the is at the end of a production.
At this point, we should go back and look at the parse of id + (id) from
earlier in the handout and trace what the states mean. (Refer to the action
and goto tables and the parse diagrammed on page 4 and 5).
60
T > id
E > T
T > id
E > T
T > (E)
E > E+T
E' > E
Now lets examine the action of the parser. We start by pushing s0 on the
stack. The first token we read is an id. In configurating set I0, the successor
of id is set I4, this means pushing s4 onto the stack. This is a final state for id
(the is at the end of the production) so we reduce the production T > id.
We pop s4 to match the id being reduced and we are back in state s0. We
reduced the handle into a T, so we use the goto part of the table, and
Goto[0, T] tells us to push s2 on the stack. (In set I0 the successor for T was
set I2). In set I2 the action is to reduce E > T, so we pop off the s2 state and
are back in s0. Goto[0, E] tells us to push s1. From set I1 seeing a + takes us
to set I5 (push s5 on the stack).
From set I5 we read an open ( which that takes us to set I3 (push s3 on the
stack). We have an id coming up and so we shift state s4. Set 4 reduces T >
id, so we pop s4 to remove right side and we are back in state s3. We use the
goto table Goto[3, T] to get to set I2. From here we reduce E > T, pop s2 to
get back to state s3 now we goto s6. . Action[6, )] tells us to shift s7. Now in s7
we reduce T > (E). We pop the top three states off (one for each symbol in
the right-hand side of the production being reduced) and we are back in s5
again. Goto[5,T] tells us to push s8. We reduce by E > E + T which pops off
three states to return to s0. Because we just reduced E we goto s1. The next
input symbol is $ means we completed the production E' > E and the
parse is successful.
The stack allows us to keep track of what we have seen so far and what we
are in the middle of processing. We shift states that represent the
amalgamation of the possible options onto the stack until we reach the end
of a production in one of the states. Then we reduce. After a reduce, states
are popped off the stack to match the symbols of the matching right-side.
What's left on the stack is what we have yet to process.
Consider what happens when we try to parse id++. We start in s0 and do the
same as above to reduce the id to T and then to E. Now we are in set I5 and
we encounter another +. This is an error because the action table is empty
for that transition. There is no successor for + from that configurating set,
because there is no viable prefix that begins E++.
61
Interesting, isn't it, to see the parallels between the two processes? They
both are grouping the possibilities into states that only diverge once we get
further along and can be sure of which path to follow.
Limitations of LR(0) Parsing
The LR(0) method may appear to be a strategy for creating a parser that
can handle any context-free grammar, but in fact, the grammars we used as
examples in this handout were specifically selected to fit the criteria
needed for LR(0) parsing. Remember that LR(0) means we are parsing
with zero tokens of lookahead. The parser must be able to determine what
62
action to take in each state without looking at any further input symbols,
i.e. by only considering what the parsing stack contains so far. In an LR(0)
table, each state must only shift or reduce. Thus an LR(0) configurating set
cannot have both shift and reduce items, and can only have exactly one
reduce item. This turns out to be a rather limiting constraint.
To be precise, a grammar is LR(0) if the following two conditions hold:
1.
2.
Very few grammars meet the requirements to be LR(0). For example, any
grammar with an -rule will be problematic. If the grammar contains the
production A > , then the item A > will create a shift-reduce conflict if
there is any other non-null production for A. -rules are fairly common
programming language grammars, for example, for optional features such
as type qualifiers or variable declarations.
Even modest extensions to earlier example grammar cause trouble.
Suppose we extend it to allow array elements, by adding the production
rule T>id[E]. When we construct the configurating sets, we will have one
containing the items T>id and T>id[E] which will be a shift-reduce
conflict.
Or suppose we allow assignments by adding the productions E > V = E
and V > id. One of the configurating sets for this grammar contains the
items V>id and T > id, leading to a reducereduce conflict.
The above examples show that the LR(0) method is just too weak to be
useful. This is caused by the fact that we try to decide what action to take
only by considering what we have seen so far, without using any
information about the upcoming input. By adding just a single token
lookahead, we can vastly increase the power of the LR parsing technique
and work around these conflicts. There are three ways to use a one token
lookahead: SLR(1), LR(1) and LALR(1), each of which we will consider
in turn later in the next few chapters.
63
CHAPTER SIX
SLR and LR parsing
The problem with LR(0)
LR(0) is the simplest technique in the LR family. Although that makes it
the easiest to learn, these parsers are too weak to be of practical use for
anything but a very limited set of grammars. The examples given at the
end of the LR(0) handout show how even small additions to an LR(0)
grammar can introduce conflicts that make it no longer LR(0). The
fundamental limitation of LR(0) is the zero, meaning no lookahead tokens
are used. It is a stifling constraint to have to make decisions using only
what has already been read, without even glancing at what comes next in
the input. If we could peek at the next token and use that as part of the
decision-making, we will find that it allows for a much larger class of
grammars to be parsed.
SLR(1)
We will first consider SLR(1) where the S stands for Simple. SLR(1)
parsers use the same LR(0) configurating sets and have the same table
structure and parser operation, so everything you've already learned about
LR(0) applies here. The difference comes in assigning table actions, where
we are going to use one token of lookahead to help arbitrate among the
conflicts. If we think back to the kind of conflicts we encountered in LR(0)
parsing, it was the reduce actions that cause us grief. A state in an LR(0)
parser can have at most one reduce action and cannot have both shift and
reduce actions. Since a reduce is indicated for any completed item, this
dictates that each completed item must be in a state by itself. But let's
revisit the assumption that if the item is complete, the parser must choose
to reduce. Is that always appropriate? If we peeked at the next upcoming
token, it may tell us something that invalidates that reduction. If the
sequence on top of the stack could be reduced to the non-terminal A, what
tokens do we expect to find as the next input? What tokens would tell us
that the reduction is not appropriate? Perhaps Follow(A) could be useful
here!
The simple improvement that SLR(1) makes on the basic LR(0) parser is
to reduce only if the next input token is a member of the follow set of the
non-terminal being reduced. When filling in the table, we don't assume a
reduce on all inputs as we did in LR(0), we selectively choose the
reduction only when the next input symbols in a member of the follow set.
To be more precise, here is the algorithm for SLR(1) table construction
(note all steps are the same as for LR(0) table construction except for 2a)
1.
64
2.
3.
4.
5.
In SLR(1) parser, it is allowable for there to be both shift and reduce items
in the same state as well as multiple reduce items. The SLR(1) parser will
be able to determine which action to take as long as the follow sets are
disjoint.
Let's consider those changes at the end of the LR(0) handout to the
simplified expression grammar that would have made it no longer LR(0).
Here is the version with the addition of array access:
E' > E
E > E + T | T
T > (E) | id | id[E]
Here are the first two LR(0) configurating sets entered if id is the first
token of the input.
65
Here are the first two LR(0) configurating sets entered if id is the first
token of the input.
66
This means we have states corresponding to X1...Xi on the stack and we are
looking to put states corresponding to Xi+1...Xj on the stack and then reduce,
but only if the token following Xj is the terminal a. a is called the
lookahead of the configuration. The lookahead only comes into play with
LR(1) configurations with a dot at the right end:
A > X1Xj , a
This means we have states corresponding to X1...Xj on the stack but we may
only reduce when thenext symbol is a. The symbol a is either a terminal or
$ (end of input marker). With SLR(1) parsing, we would reduce if the next
token was any of those in Follow(A). With LR(1) parsing, we reduce only if
the next token is exactly a. We may have more than one symbol in the
lookahead for the configuration, as a convenience, we list those symbols
separated by a forward slash. Thus, the configuration A > u, a/b/c says
that it is valid to reduce u to A only if the next token is equal to a, b, or c.
The configuration lookahead will always be a subset of Follow(A).
Recall the definition of a viable prefix from the SLR handout. Viable
prefixes are those prefixes of right sentential forms that can appear on the
stack of a shift-reduce parser. Formally we say that a configuration [A >
uv , a] is valid for a viable prefix if there is a rightmost derivation S =>*
Aw =>* uvw where = u and either a is the first symbol of w or w is
and a is $. For example,
68
S > ZZ
Z > xZ | y
69
The above grammar would only have seven SLR states, but has ten in
canonical LR. We end up with additional states because we have split
states that have different lookaheads. For example, states 3 and 6 are the
same except for lookahead, state 3 corresponds to the context where we
are in the middle of parsing the first X, state 6 is the second X. Similarly,
states 4 and 7 are completing the first and second X respectively. In SLR,
those states are not distinguished, and if we were attempting to parse a
single b by itself, we would allow that to be reduced to X, even though this
will not lead to a valid sentence. The SLR parser will eventually notice the
syntax error, too, but the LR parser figures it out a bit sooner.
70
To fill in the entries in the action and goto tables, we use a similar
algorithm as we did for SLR(1), but instead of assigning reduce actions
using the follow set, we use the specific lookaheads. Here are the steps to
build an LR(1) parse table:
1.
2.
3.
4.
5.
71
Now, lets consider what the states mean. S4 is where X > b is completed;
S2 and S6 is where we are in the middle of processing the 2 a's; S7 is where
we process the final b; S9 is where we complete the X > aX production; S5
is where we complete S > XX; and S1 is where we accept.
LR(1) Grammars
Every SLR(1) grammar is a canonical LR(1) grammar, but the canonical
LR(1) parser may have more states than the SLR(1) parser. An LR(1)
grammar is not necessarily SLR(1), the grammar given earlier is an
example. Because an LR(1) parser splits states based on differing
lookaheads, it may avoid conflicts that would otherwise result if using the
full follow set.
A grammar is LR(1) if the following two conditions are satisfied for each
configurating set:
1.
2.
For any item in the set [A > uxv , a] with x a terminal, there
is no item in the set of the form [B > v, x] In the action
table, this translates no shift-reduce conflict for any state.
The successor function for x either shifts to a new state or
reduces, but not both.
The lookaheads for all complete items within the set must
be disjoint, e.g. set cannot have both [A > u, a] and [B > v
, a] This translates to no reduce-reduce conflict on any
state. If more than one non-terminal could be reduced from
this set, it must be possible to uniquely determine which is
appropriate from the next input token.
72
73
CHAPTER SEVEN
LALR Parsing
The Motivation for LALR
Because a canonical LR(1) parser splits states based on differing
lookahead sets, it can have many more states that the corresponding
SLR(1) or LR(0) parser. Potentially it could require splitting a state with
just one item into a different state for each subset of the possible
lookaheads, in a pathological case, this means the entire power set of its
follow set (which theoretically could contain all terminals). It never
actually gets that bad in practice, but a canonical LR(1) parser for a
programming language might have an order of magnitude more states than
an SLR(1) parser. Is there something in between?
With LALR (lookahead LR) parsing, we attempt to reduce the number of
states in an LR(1) parser by merging similar states. This reduces the
number of states to the same as SLR(1), but still retains some of the power
of the LR(1) lookaheads. Lets examine the LR(1) configurating sets from
an example given in the LR parsing.
S' > S
S > XX
X > aX
X > b
Notice that some of the LR(1) states look suspiciously similar. Take I3 and
I6 for example. These two states are virtually identical they have the
same number of items, the core of each item is identical, and they differ
only in their lookahead sets. This observation may make you wonder if it
possible to merge them into one state. The same is true of I4 and I7, and I8
and I9. If we did merge, we would end up replacing those six states with
just these three:
I36:
74
I47:
I89:
X > b, a/b/$
X > b, a/b/$
X > aX, a/b/$
But isnt this just SLR(1) all over again? In the above example, yes, since
after the merging we did end up with the complete follow sets as the
lookahead. This is not always the case however. Consider this example:
S' > S
S > Bbb | aab | bBa
B > a
75
We try to merge I6 and I9 since they have the same core items and they only
differ in lookahead:
I69:
C > e, c/d
B > e, d/c
76
1.
2.
3.
4.
Lets do an example to make this more clear. Consider the LR(1) table for
the grammar given earlier. There are nine states.
77
However there is a more efficient strategy for building the LALR(1) states
called step-by-step merging. The idea is that you merge the configurating
sets as you go, rather than waiting until the end to find the identical ones.
Sets of states are constructed as in the LR(1) method, but at each point
where a new set is spawned, you first check to see whether it may be
merged with an existing set. This means examining the other states to see
if one with the same core already exists. If so, you merge the new set with
the existing one, otherwise you add it normally.
Here is an example of this method in action:
S'
S
E
F
V
> S
> V = E
> F | E + F
> V | int | (E)
> id
It has the same core as I6 so rather than add a new state, we go ahead and
merge with that one to get:
I6: E >F, $/+/)
We have a similar situation on state I12 which can be merged with state I7 .
The algorithm continues like this, merging into existing states where
possible and only adding new states when necessary. When we finish
creating the sets, we construct the table just as in LR(1).
LALR(1) Grammars
A formal definition of what makes a grammar LALR(1) cannot be easily
encapsulated in a set of rules, because it needs to look beyond the
78
79
seen in the input and shift state 3 on the stack (the successor for id in these
states), effectively faking that the necessary token was found. The error
message printed might be something like missing operand.
Error e2 is called from states 0, 1, 2, 4, 5 on finding a right parenthesis
where we were expecting either the beginning of a new expression (or
potentially the end of input for state 1). A possible fix: remove right
parenthesis from the input and discard it. The message printed could be
"unbalanced right parenthesis."
Error e3 is called from state 1, 3, 6, 7, 8, 9 on finding id or left parenthesis.
What were these states expecting? What might be a good fix? How should
you report the error to the user?
Error e4 is called from state 6 on finding $. What is a reasonable fix? What
do you tell the user?
80
CHAPTER EIGHT
Parsing Miscellany
A different way of resolving ambiguity
Recall that ambiguity means we have two or more leftmost derivations for
the same input string, or equivalently, that we can build more than one
parse tree for the same input string. A simple arithmetic expression
grammar is a common example:
E > E + E | E * E | (E) | id
Let's say we were building an SLR(1) table for this grammar. Look
carefully at state 7. In the action table, there are two conflicting entries
under the column labeled * : s5 and r1, a shift/reduce conflict. Trace the
parse of input id + id * id up to the point where we arrive in state 7:
81
State stack
S0S1S4S7
Remaining input
state 7: next input is *
* id $
At this point during the parse, we have the handle E + E on top of the stack,
and the lookahead is * . * is in the follow set for E, so we can reduce that E
+ E to E. But we also could shift the * and keep going. What choice should
we make if we want to preserve the usual arithmetic precedence?
What about if we were parsing id + id + id? We have a similar shift/reduce
in state 7, now on next input +. How do we want to resolve the
shift/reduce conflict here? (Because addition is commutative, it actually
doesnt much matter, but it will for subtraction!)
State stack
S0S1S4S7
Remaining input
+ id $
Remaining input
+ id $
Remaining input
state 8: next input is +
* id $
Here are the LR(0) configurating sets. Where is the conflict in the
following collection?
83
Note this diagram refers to grammars, not languages, e.g. there may be
an equivalent LR(1) grammar that accepts the same language as another
non-LR(1) grammar. No ambiguous grammar is LL(1) or LR(1), so we
must either re-write the grammar to remove the ambiguity or resolve
conflicts in the parser table or implementation.
The hierarchy of LR variants is clear: every LR(0) grammar is SLR(1) and
every SLR(1) is LALR(1) which in turn is LR(1). But there are grammars
that dont meet the requirements for the weaker forms that can be parsed
by the more powerful variations.
Weve seen several examples of grammars that are not LL(1) that are
LR(1). But the reverse is not possible, every LL(1) grammar is guaranteed
to be LR(1). A rigorous proof is fairly straightforward from the definitions
of LL(1) and LR(1) grammars. Your intuition should tell you that an
LR(1) parser uses more information than the LL(1) parser since it
postpones the decision about which production is being expanded until it
sees the entire right side rather than attempting to predict after seeing just
the first terminal.
Comparing LL(1) Parsers to LALR(1)
The two dominant parsing techniques in real compilers are LL(1) and
LALR(1). These techniques are the ones to stash away in your brain cells
for further usage. Here are some thoughts on how to weigh the two
approaches against one another:
Implementation. Because the underlying algorithms are more
complicated, most LALR(1) parsers are built using parser
generators such as yacc and bison. LL(1) parsers may be
implemented via hand-coded recursive-descent or via LL(1)
table-driven predictive parser generators like LLgen. There
are those who like managing details and writing all the code
themselves, no errors result from misunderstanding how the
tools work, and so on. But as projects get bigger, the
automated tools can be a help, yacc/bison can find
ambiguities and conflicts that you might have missed doing
the work by hand, for example. The implementation chosen
also has an effect on maintenance. Which would you rather
do: add new productions into a grammar specification being
fed to a generator, add new entries into a table, or write new
functions for a recursive-descent parser?
Simplicity: Both techniques have fairly simple drivers. The
algorithm underlying LL(1) is more intuitively
understandable and thus easier to visualise and debug. The
myriad details of the LALR(1) configurations can be messy
and when trying to debug can be a bit overwhelming.
84
85
86
CHAPTER NINE
INTRODUCTION TO YACC
Grammars for yacc are described using a variant of Backus Naur Form
(BNF). This technique was pioneered by John Backus and Peter Naur, and
used to describe ALGOL60. A BNF grammar can be used to express
context-free languages. Most constructs in modern programming
languages can be represented in BNF.
Input to yacc is divided into three sections.
... definitions ...
%%
Lex includes this file and utilizes the definitions for token values. To
obtain tokens, yacc calls yylex. Function yylex has a return type of
int, and returns the token value. Values associated with the token are
returned by lex in variable yylval. For example,
[0-9]+
{ yylval = atoi(yytext);
return INTEGER;
}
87
would store the value of the integer in yylval, and return token INTEGER
to yacc. The type of yylval is determined by YYSTYPE. Since the default
type is integer, this works well in this case. Token values 0-255 are
reserved for character values. For example, if you had a rule such as
[-+] return *yytext; /* return operator */
the character value for minus or plus is returned. Note that we placed the
minus sign first so that it wouldnt be mistaken for a range designator.
Generated token values typically start around 258, as lex reserves several
values for end-of-file and error processing. Here is the complete lex input
specification for a simple calculator:
Internally, yacc maintains two stacks in memory; a parse stack and a value
stack. The parse stack contains terminals and nonterminals, and represents
the current parsing state. The value stack is an array of YYSTYPE
elements, and associates a value with each element in the parse stack. For
example, when lex returns an INTEGER token, yacc shifts this token to the
parse stack. At the same time, the corresponding yylval is shifted to the
value stack. The parse and value stacks are always synchronized, so
finding a value related to a token on the stack is easily accomplished. Here
is the yacc input specification for a simple calculator:
88
The rules section resembles the BNF grammar discussed earlier. The lefthand side of a production, or nonterminal, is entered left-justified,
followed by a colon. This is followed by the right-hand side of the
production. Actions associated with a rule are entered in braces.
By utilizing left-recursion, we have specified that a program consists of
zero or more expressions. Each expression terminates with a newline.
When a newline is detected, we print the value of the expression. When we
apply the rule
expr: expr '+' expr { $$ = $1 + $3; }
we replace the right-hand side of the production in the parse stack with the
left-hand side of the same production. In this case, we pop expr '+'
expr and push expr. We have reduced the stack by popping three
terms off the stack, and pushing back one term. We may reference
positions in the value stack in our C code by specifying $1 for the first
term on the right-hand side of the production, $2 for the second, and so
on. $$ designates the top of the stack after reduction has taken place.
The above action adds the value associated with two expressions, pops
three terms off the value stack, and pushes back a single sum. Thus, the
parse and value stacks remain synchronised.
Numeric values are initially entered on the stack when we reduce from
INTEGER to expr. After INTEGER is shifted to the stack, we apply the
rule
89
The INTEGER token is popped off the parse stack, followed by a push of
expr. For the value stack, we pop the integer value off the stack, and then
push it back on again. In other words, we do nothing. In fact, this is the
default action, and need not be specified. Finally, when a newline is
encountered, the value associated with expr is printed.
In the event of syntax errors, yacc calls the user-supplied function
yyerror. If you need to modify the interface to yyerror, you can alter
the canned file that yacc includes to fit your needs. The last function in our
yacc specification is main in case you were wondering where it was.
This example still has an ambiguous grammar. Yacc will issue shiftreduce warnings, but will still process the grammar using shift as the
default operation.
In the remaining of this chapter we will extend the simple calculator by
incorporating some new functionality. New features include arithmetic
operators multiply, and divide. Parentheses may be used to over-ride
operator precedence, and single-character variables may be specified in
assignment statements. The following illustrates sample input and
calculator output:
user:
calc:
user:
user:
user:
calc:
user:
calc:
user:
calc:
3 *
27
x =
y =
x
27
y
5
x +
37
(4 + 5)
3 * (5 + 4)
5
2*y
90
The input specification for yacc follows. The tokens for INTEGER and
VARIABLE are utilized by yacc to create #defines in y.tab.h for use
in lex. This is followed by definitions for the arithmetic operators. We may
specify %left, for left-associative, or %right, for right associative. The
last definition listed has the highest precedence. Thus, multiplication and
division have higher precedence than addition and subtraction. All four
operators are left-associative. Using this simple technique, we are able to
disambiguate our grammar.
91
92
CHAPTER TEN
Syntax-Directed Translation
Syntax-directed translation refers to a method of compiler implementation
where the source language translation is completely driven by the parser.
In other words, the parsing process and parse trees are used to direct
semantic analysis and the translation of the source program. This can be a
separate phase of a compiler or we can augment our conventional grammar
with information to control the semantic analysis and translation. Such
grammars are called attribute grammars.
We augment a grammar by associating attributes with each grammar
symbol that describes its properties. An attribute has a name and an
associated value a string, a number, a type, a memory location, an
assigned register, whatever information we need. For example, variables
may have an attribute "type" (which records the declared type of a
variable, useful later in type-checking) or an integer constants may have an
attribute "value" (which we will later need to generate code).
With each production in a grammar, we give semantic rules or actions,
which describe how to compute the attribute values associated with each
grammar symbol in a production. The attribute value for a parse node may
depend on information from its children nodes below or its siblings and
parent node above.
Consider this production, augmented with a set of actions that use the
value attribute for a digit node to store the appropriate numeric value.
Below, we use the syntax X.a to refer to the attribute a associated with
symbol X.
digit > 0 {digit.value =
| 1 {digit.value =
| 2 {digit.value =
...
| 9 {digit.value =
0}
1}
2}
9}
(We are using subscripts in this example to clarify which attribute we are
referring to, so int1 and int2 are different instances of the same nonterminal symbol.)
There are two types of attributes we might encounter: synthesised or
inherited. Synthesised attributes are those attributes that are passed up a
parse tree, that is, the left-side attribute is computed from the right-side
93
Inherited attributes are those that are passed down a parse tree, i.e., the
right-side attributes are derived from the left-side attributes (or other rightside attributes). These attributes are used for passing information about the
context to nodes further down the tree.
X > Y1Y2...Yn
Y .a = f(X.a, Y .a, Y .a, ..., Y .a, Y .a, ..., Y .a)
k
k-1
k+1
Now we add two attributes to this grammar, name and dl, for the name of a
variable and the list of declarations. Each time a new variable is declared,
a synthesised attribute for its name is attached to it. That name is added to
a list of variables declared so far in the synthesised attribute dl that is
created from the declaration block. The list of variables is then passed as
an inherited attribute to the statements following the declarations for use in
checking that variables are declared before use.
P > DS
D1 > var V; D2
D .dl)}
|
S1 > V := E; S2
S .dl}
|
V > x
| y
| z
{S.dl = D.dl}
{D .dl = addlist(V.name,
1
{D .dl = NULL}
{check(V.name,S .dl); S .dl=
1
{V.name = 'x'}
{V.name = 'y'}
{V.name = 'z'}
If we were to parse the following code, what would the attribute structure
look like?
94
var x;
var y;
x := ...;
y := ...;
Top-Down SDT
We can implement syntax-directed translation in either a top-down or a
bottom-up parser and we'll briefly investigate each approach. First, let's
look at adding attribute information to a handconstructed top-down
recursive-descent parser. Our example will be a very simple FTP client,
where the parser accepts user commands and uses a syntax-directed
translation to act upon those requests. Here's in the grammar we'll use,
already in an LL(1) form:
Session
CommandList
Command
Login
User
Pass
Get
MoreFiles
Logout
>
>
>
>
>
>
>
>
>
CommandList T_QUIT
Command CommandList |
Login | Get | Logout
User Pass
T_USER T_IDENT
T_PASS T_IDENT
T_GET T_IDENT MoreFiles
T_IDENT MoreFiles |
T_LOGOUT
95
Now, lets see how the attributes, such as the username, filename, and
connection, can be passed around during the parsing. This recursivedescent parser is using the lookahead/token-matching utility functions
from the top-down parsing.
During processing of the Login command, the parser gathers the username
and password returned from the children nodes and uses that information
to create a new connection attribute to pass up the tree. In this situation the
96
97
98
99
CHAPTER ELEVEN
Semantic Analysis
What is semantic analysis?
Parsing only verifies that the program consists of tokens arranged in a
syntactically-valid combination, we now move on to semantic analysis,
where we delve deeper to check whether they form a sensible set of
instructions in the programming language. Whereas any old noun phrase
followed by some verb phrase makes a syntactically-correct English
sentence, a semantically-correct one has subject-verb agreement, gender is
properly used, and the components go together to express an idea that
makes sense. For a program to be semantically correct, all variables,
functions, classes, etc. are properly defined, expressions and variables are
used in ways that respect the type system, access control isnt violated, and
so on. Semantic analysis is the next-to-last phase of the front end and the
compilers last chance to weed out incorrect programs. We need to ensure
the program is well-formed enough to continue on to the next phase where
we generate code.
A large part of semantic analysis consists of tracking
variable/function/type declarations and type checking. In many languages,
identifiers have to be declared before use. As the compiler encounters a
new declaration, it records the type information assigned to that identifier.
Then, as it continues examining the rest of the program, it verifies that the
type of an identifier is respected in terms of the operations being
performed. For example, the type of the right-side expression of an
assignment statement should match the type of the left-side, and the leftside needs to be a properly declared and assignable identifier (i.e. not some
sort of constant). The parameters of a function should match the arguments
of a function call in both number and type. The language may require that
identifiers are unique, disallowing a global variable and function of the
same name. The operands to multiplication operation will need to be of
numeric type, perhaps even the exact same type depending on the
strictness of the language. These are examples of the things checked in the
semantic analysis phase.
Semantic analysis can be done right in the midst of parsing. As a particular
construct is recognised, say an addition expression, the parser action
would be to check the two operands and verify they are of numeric type
and compatible for this operation. In fact, in a one-pass compiler, the code
is generated right then and there as well. In a compiler that runs in more
than one pass, the first pass only parses the input and builds the tree
representation of the program. Then in a second pass, we go back and
traverse the tree we built to verify the semantic rules of the language are
being respected. The single-pass strategy offers space and time
100
Compound types
Complex types
In many languages, a programmer must first establish the name and type
of any data object (variable, function, type, etc). In addition, the
programmer usually defines the lifetime. A declaration is a statement in a
program that communicates to the compiler this information. The basic
declaration is just a name and type, but in many languages may include
modifiers that control visibility and lifetime (i.e. static in C, private in
Java). Some languages also allow declarations to also initialise variables,
such as in C, where you can declare and initialise in one statement. The
101
102
103
The fact that types do not have to be declared unless necessary, makes it
possible for ML to provide one of its most important features:
polymorphism. A polymorphic function is one that takes parameters of
different types on different activations. For example, a function that
returns the number of elements in a list:
fun length(L) = if L = nil then 0 else length (tl(L)) + 1;
(Note: tl is a built-in function that returns all the elements after the first
element of a list.) This function will work on a list of integers, reals,
characters, strings, lists, etc. Polymorphism is an important feature of most
object-oriented languages also. It introduces some interesting problems in
semantic analysis, as we will see a bit later.
Designing a Type Checker
When designing a type checker for a compiler, heres the process to
follow:
1.
identify the types that are available in the language
2.
identify the language constructs that have types associated
with them
3.
identify the semantic rules for the language
Type equivalence of compound types
The equivalence of base types is usually very easy to establish, an int is
only exactly equivalent to int, a bool only to a bool. Many languages also
104
arrays:
structs:
pointers:
two subtrees, one for number of elements and one for the
base type
one subtree for each field
one subtree that is the type being referenced by the pointer
105
var
a,
c:
d,
f,
h,
b: array[1..5] of integer;
array[1..5] of integer;
e: little;
g: small;
i: big;
When are two types the same? Which of the types are equivalent in the
above example? It depends on how one defines equivalence, the two
main options are named versus structured equivalence. If the language
supports named equivalence, two types are the same if and only if they
have the same name. Thus d and e are type-equivalent, so are f and g
and h and i. The variables a and b are also type-equivalent because they
have identical (but unnamed) types. (Any variables declared in the same
statement have the same type.) But c is a different, anonymous type. And
even though the small type is a synonym for little which is a
synonym for an array of 5 integers, Pascal, which only supports named
equivalence, does not consider d to be type-equivalent to a or f. The
more general form of equivalence is structural equivalence. Two types are
structurally equivalent if a recursive traversal of the two type definition
trees matches in entirety. Thus, the variables a-g are all structurally
equivalent but are distinct from h and i.
Which definition of equivalence a language supports is a part of the
definition of the language. This, of course, has an impact on the
implementation of the type checker of the compiler for the language.
Clearly, a language supporting named equivalence is much easier and
quicker to type check than one supporting structural equivalence. But there
is a trade-off. Named equivalence does not always reflect what is really
being represented in a user-defined type. Which version of equivalence
does C support? Do you know? How could you find out?
Type Compatibility and Subtyping
In addition to establishing rules for type equivalency, the type system also
defines type compatibility. Certain language constructs may require
equivalent types, but most allow for substitution of coercible or
compatible types.
We've already talked a bit about type coercion. An int and a double are not
type equivalent, but a function that takes a double parameter may allow an
integer argument to be passed because an integer can be coerced to a
double without loss of precision. The reverse may or may not be true, in C,
a double is substitutable for an int (it is truncated), in Java, a typecast is
required to force the truncation. This sort of automatic coercion affects
both the type checker and the code generator, since we need to recognise
which coercions are valid in a particular context and if required, generate
the appropriate instructions to actually do the conversion.
106
107
When we assign to a, should we use the global variable, the local variable,
or the parameter? Normally it is the innermost declaration, the one nearest
the reference, which wins out. Thus, the local variable is assigned the
value 2. When a variable name is re-used like this, we say the innermost
declaration shadows the outer one. Inside the Binky function, there is no
way to access the other two a variables because the local variable is
shadowing them and C has no mechanism to explicitly specific which
scope to search.
There are two common approaches to the implementation of scope
checking in a compiler. The first is to implement an individual symbol
table for each scope. We organise all these symbol tables into a scope
stack with one entry for each open scope. The innermost scope is stored at
the top of the stack, the next containing scope is underneath it, etc. When a
new scope is opened, a new symbol table is created and the variables
declared in that scope are placed in the symbol table. We then push the
symbol table on the stack. When a scope is closed, the top symbol table is
popped. To find a name, we start at the top of the stack and work our way
down until we find it. If we do not find it, the variable is not accessible and
an error should be generated.
There is an important disadvantage to this approach, besides the obvious
overhead of creating additional symbol tables and doing the stack
processing. All global variables will be at the bottom of the stack, so scope
checking of a program that accesses a lot of global variables through many
levels of nesting can run slowly. The overhead of a table per scope can
also contribute to memory bloat in the compiler.
The other approach to the implementation of scope checking is to have a
single global table for all the scopes. We assign to each scope a scope
number. Each entry in the symbol table is assigned the scope number of
the scope it is contained in. A name may appear in the symbol table more
than once as long as each repetition has a different scope number.
When we encounter a new scope, we increment a scope counter. All
variables declared in this scope are placed in the symbol table and
assigned this scopes number. If we then encounter a nested scope, the
scope counter is incremented once again and any newly declared variables
are assigned this new number. Using a hash table, new names are always
entered at the front of the chains to simplify the searches. Thus, if we have
the same name declared in different nested scopes, the first occurrence of
the name on the chain is the one we want.
When a scope is closed, all entries with the closing scope number are
deleted from the table. Any previously shadowed variables will now be
accessible again. If we try to access a name in a closed scope, we will not
108
109
110
We are rotating all the objects in the list without providing any type
information. The receiving object itself will respond with the correct
implementation for its shape type.
The primary difference between virtual functions and non-virtual functions
is their binding times. Binding means associating a name with a definition
or storage location. In C++, the names of nonvirtual functions are bound at
compile time. The names of virtual functions are bound at run-time, at the
time of the call to the method. Thus, the binding is determined by the class
of the object at the time of the function call. To implement this, each
virtual method in a derived class reserves a slot in the class definition
record, which is created at run-time. A constructor fills in this slot with the
location of the virtual function defined in the derived class, if one exists. If
it does not exist, it fills in the location with the function from the base
class.
111
CHAPTER TWELVE
Intermediate Representations
Most compilers translate the source program first to some form of
intermediate representation (IR) and convert from there into machine
code. The intermediate representation is a machine- and language
independent version of the original source code. Although converting the
code twice introduces another step and thus incurs loss in compiler
efficiency, use of an intermediate representation provides advantages in
increased abstraction, cleaner separation between the front and back ends,
and adds possibilities for re-targeting/cross-compilation. Intermediate
representations also lend themselves to supporting advanced compiler
optimisations and most optimisation is done on this form of the code.
There are many intermediate representations in use (one author suggests it
may be as many as a unique one for each existing compiler) but the
various representations are actually more alike than they are different.
Once you become familiar with one, its not hard to learn others.
Intermediate representations are usually categorised according to where
they fall between a high-level language and machine code. IRs that are
close to a high-level language are called high-level IRs, and IRs that are
close to assembly are called low-level IRs. For example, a high-level IR
might preserve things like array subscripts or field accesses whereas a lowlevel IR converts those into explicit addresses and offsets. For example,
consider the following three code examples (from Muchnick), offering
three translations of a 2-dimensional array access:
The thing to observe here isnt so much the details of how this is done (we
will get to that later), as the fact that the low-level IR has different
information than the high-level IR. What information does a high-level IR
have that a low-level one does not? What information does a low level IR
have that a high-level one does not? What kind of optimisation might be
possible in one form that might not in another?
High-level IRs usually preserve information such as loop-structure and ifthen-else statements. They tend to reflect the source language they are
compiling more than lower-level IRs. Medium level IRs often attempt to
be independent of both the source language and the target machine. Low
level IRs tend to reflect the target architecture very closely, and as such are
112
often machine dependent. They differ from actual assembly code in that
there may be choices for generating a certain sequence of operations, and
the IR stores this data in such a way as to make it clear that choice must be
made. Sometimes a compiler will start-out with a high-level IR, perform
some optimisations, translate the result to a lower-level IR and optimise
again, then translate to a still lower IR, and repeat the process until final
code generation.
Abstract Syntax Trees
A parse tree is an example of a very high-level intermediate
representation. You can usually completely reconstruct the actual source
code from a parse tree since it contains all the information about the parsed
program. (Its fairly unusual that you can work backwards in that way
from most IRs since much information has been removed in translation).
More likely, a tree representation used as an IR is not quite the literal parse
tree (intermediate nodes may be collapsed, groupings units can be
dispensed with, etc.), but it is winnowed down to the structure sufficient to
drive the semantic processing and code generation. Such a tree is usually
referred to as an abstract syntax tree. In the programming projects so far,
you have already been immersed in creating and manipulating such a tree.
Each node represents a piece of the program structure and the node will
have references to its children subtrees (or none if the node is a leaf) and
possibly also have a reference to its parent.
Consider the following excerpt of a programming language grammar:
program
function_list
function
params
>
>
>
>
function_list
function_list function | function
PROCEDURE ident ( params ) body
The literal parse tree for the sample program looks something like:
113
Here is what the abstract syntax tree looks like (notice how some pieces
like the parents and keywords are no longer needed in this representation):
The parser actions to construct the tree might look something like this:
function: PROCEDURE ident ( params ) body
{ $$ = MakeFunctionNode($2, $4, $6); }
function_list: function_list function
{ $$ = $1; $1->AppendNode($2); }
What about the terminals at the leaves? Those nodes have no children,
usually these will be nodes that represent constants and simple variables.
When we recognise those parts of the grammar that represent leaf nodes,
we store the data immediately in that node and pass it upwards for joining
in the larger tree.
constant : int_constant
{ $$ = MakeIntConstantNode($1); }
114
To generate code for the entire tree, we first generate code for each of the
subtrees, storing the result in some agreed-upon location (usually a
register), and then combine those results. The function GenerateCode
below takes two arguments: the subtree for which it is to generate
assembly code and the number of the register in which the computed result
will be stored.
115
In the first line of GenerateCode, we test if the label of the root node is an
operator. If its not, we emit a load instruction to fetch the current value of
the variable and store it in the result register. If the label is an operator, we
call GenerateCode recursively for the left and right expression subtrees,
storing the results in the result register and the next higher numbered
register, and then emit the instruction applying the operator to the two
results. Note that the code as written above will only work if the number
of available registers is greater than the height of the expression tree. (We
could certainly be smarter about re-using them as we move through the
tree, but the code above is just to give you the general idea of how we go
about generating the assembly instructions).
Lets trace a call to GenerateCode for the following tree:
The initial call to GenerateCode is with a pointer to the '+' and result
register 0.
116
Notice how using the tree height for the register number (adding one as we
go down the side) allows our use of registers to not conflict. It also reuses
registers (R2 is used for both c and d). It is clearly not the most optimal
strategy for assigning registers.
A quick aside: Directed Acyclic Graphs
In a tree, there is only one path from a root to each leaf of a tree. In
compiler terms, this means there is only one route from the start symbol to
each terminal. When using trees as intermediate representations, it is often
the case that some subtrees are duplicated. A logical optimisation is to
share the common subtree. We now have a data structure with more than
one path from start symbol to terminals. Such a structure is called a
directed acyclic graph (DAG). They are harder to construct internally, but
provide an obvious savings in space. They also highlight equivalent
sections of code and that will be useful later when we study optimisation
techniques, such as only computing the needed result once and saving it,
rather than re-generating it several times.
a * b + a * b;
117
world program.
Method Main()
0 aload_0
1 invokespecial #1 <Method java.lang.Object()>
4 return
Method void main(java.lang.String[])
118
119
CHAPTER THIRTEEN
Code Optimisation
Optimisation is the process of transforming a piece of code to make more
efficient (either in terms of time or space) without changing its output or
side-effects. The only difference visible to the codes user should be that it
runs faster and/or consumes less memory. It is really a misnomer that the
name implies you are finding an optimal solution in truth,
optimisation aims to improve, not perfect, the result.
Optimisation is the field where most compiler research is done today. The
tasks of the front-end (scanning, parsing, semantic analysis) are well
understood and unoptimised code generation is relatively straightforward.
Optimisation, on the other hand, still retains a sizable measure of
mysticism. High-quality optimisation is more of an art than a science.
Compilers for mature languages arent judged by how well they parse or
analyse the codeyou just expect it to do it right with a minimum of
hassle but instead by the quality of the object code they produce.
Many optimisation problems are NP-complete and thus most optimisation
algorithms rely on heuristics and approximations. It may be possible to
come up with a case where a particular algorithm fails to produce better
code or perhaps even makes it worse. However, the algorithms tend to do
rather well overall.
Its worth reiterating here that efficient code starts with intelligent
decisions by the programmer. No one expects a compiler to replace
BubbleSort with Quicksort. If a programmer uses a lousy algorithm, no
amount of optimisation can make it zippy. In terms of big-O, a compiler
can only make improvements to constant factors. But, all else being equal,
you want an algorithm with low constant factors.
First let me note that you probably shouldnt try to optimise the way we
will discuss in your favourite high-level language. Consider the following
two code snippets where each walks through an array and set every
element to one. Which one is faster?
You will invariably encounter people who think the second one is faster.
And they are probably right.if using a compiler without optimisation.
But, many modern compilers emit the same object code for both, by use of
120
121
122
Now we can construct the control-flow graph between the blocks. Each
basic block is a node in the graph, and the possible different routes a
program might take are the connections, i.e. if a block ends with a branch,
there will be a path leading from that block to the branch target. The
blocks that can follow a block are called its successors. There may be
multiple successors or just one. Similarly the block may have many, one,
or no predecessors.
Connect up the flow graph for Fibonacci basic blocks given above. What
does an if-then-else look like in a flow graph? What about a loop?
You probably have all seen the gcc warning or javac error about:
Unreachable code at line XXX. How can the compiler tell when code is
unreachable?
Local Optimisations
Optimisations performed exclusively within a basic block are called local
optimisations. These are typically the easiest to perform since we do not
consider any control flow information, we just work with the statements
123
within the block. Many of the local optimisations we will discuss have
corresponding global optimisations that operate on the same principle, but
require additional analysis to perform. We'll consider some of the more
common local optimisations as examples.
Constant Folding
Constant folding refers to the evaluation at compile-time of expressions
whose operands are known to be constant. In its simplest form, it involves
determining that all of the operands in an expression are constant-valued,
performing the evaluation of the expression at compile-time, and then
replacing the expression by its value. If an expression such as 10+2*3 is
encountered, the compiler can compute the result at compile-time (16) and
emit code as if the input contained the result rather than the original
expression. Similarly, constant conditions, such as a conditional branch if
a < b goto L1 else goto L2 where a and b are constant can be
replaced by a Goto L1 or Goto L2 depending on the truth of the
expression evaluated at compile-time.
The constant expression has to be evaluated at least once, but if the
compiler does it, it means you dont have to do it again as needed during
runtime. One thing to be careful about is that the compiler must obey the
grammar and semantic rules from the source language that apply to
expression evaluation, which may not necessarily match the language you
are writing the compiler in. (For example, if you were writing an APL
compiler, you would need to take care that you were respecting its
Iversonian precedence rules). It should also respect the expected treatment
of any exceptional conditions (divide by zero, over/underflow).
Consider the unoptimsed TAC translation on the left, which is transformed
by constant-folding on the right:
124
lw $t0, -8($fp)
lw $t1, 12($t0)
125
The most obvious of these are the optimisations that can remove useless
instructions entirely via algebraic identities. The rules of arithmetic can
come in handy when looking for redundant calculations to eliminate.
Consider the examples below, which allow you to replace an expression
on the left with a simpler equivalent on the right:
x+0 = x
0+x = x
x*1 = x
1*x = x
0/x = 0
x-0 = x
b && true = b
b && false = false
b || true = true
b || false = b
126
Each time through the loop, we multiply i by 4 (the element size) and add
to the array base. Instead, we could maintain the address to the current
element and instead just add 4 each time:
L0:
_tmp4 = arr ;
_tmp2 = i < 100;
IfZ _tmp2 Goto _L1 ;
*_tmp4 = 0;
_tmp4 = _tmp4 + 4;
i = i + 1 ;
L1:
This eliminates the multiplication entirely and reduces the need for an
extra temporary. By rewriting the loop termination test in terms of arr, we
could remove the variable i entirely and not bother tracking and
incrementing it at all.
Copy Propagation
This optimisation is similar to constant propagation, but generalised to
non-constant values. If we have an assignment a = b in our instruction
stream, we can replace later occurrences of a with b (assuming there are
no changes to either variable in-between). Given the way we generate
TAC code, this is a particularly valuable optimisation since it is able to
eliminate a large number of instructions that only serve to copy values
from one variable to another.
The code on the left makes a copy of tmp1 in tmp2 and a copy of tmp3 in
tmp4. In the optimised version on the right, we eliminated those
unnecessary copies and propagated the original variable into the later uses:
127
and tmp1 is never used again, we can eliminate this instruction altogether.
However, we have to be a little careful about making assumptions, for
example, if tmp1 holds the result of a function call:
tmp1 = LCall _Binky;
What sub-expressions can be eliminated? How can valid common subexpressions (live ones) be determined? Here is an optimised version, after
constant folding and propagation and elimination of common subexpressions:
tmp2 = -x ;
x = 21 * tmp2 ;
tmp3 = x * x ;
tmp4 = x / y ;
y = tmp3 + tmp4 ;
tmp5 = x / y ;
z = tmp5 / tmp3 ;
y = z ;
128
129
c)
First, divide the code above into basic blocks. Now calculate the available
expressions for each block. Then find an expression available in a block
and perform step 2c above. What common subexpression can you share
between the two blocks?
What if the above code were:
main:
BeginFunc 28;
b = a + 2 ;
c = 4 * b ;
tmp1 = b < c ;
IfNZ tmp1 Goto L1 ;
b = 1 ;
z = a+2;<=== an additional line here
L1:
d = a + 2 ;
EndFunc ;
Code Motion
Code motion (also called code hoisting) unifies sequences of code
common to one or more basic blocks to reduce code size and potentially
avoid expensive re-evaluation. The most common form of code motion is
loop-invariant code motion that moves statements that evaluate to the
same value every iteration of the loop to somewhere outside the loop.
What statements inside the following TAC code can be moved outside the
loop body?
L0:
tmp1 = tmp2 + tmp3 ;
tmp4 = tmp4 + 1 ;
PushPram tmp4 ;
LCall _PrintInt ;
PopParams 4;
tmp6 = 10 ;
tmp5 = tmp4 == tmp6 ;
IfZ Goto L0 ;
130
Loop invariant code can be moved to just above the entry point to the
loop.
Machine Optimisations
In final code generation, there is a lot of opportunity for cleverness in
generating efficient target code. In this pass, specific machines features
(specialised instructions, hardware pipeline abilities, register details) are
taken into account to produce code optimised for this particular
architecture.
Register Allocation
One machine optimisation of particular importance is register allocation,
which is perhaps the single most effective optimisation for all
architectures. Registers are the fastest kind of memory available, but as a
resource, they are scarce. The problem is how to minimise traffic between
the registers and what lies beyond them in the memory hierarchy to
eliminate time wasted sending data back and forth across the bus.
One common register allocation technique is called register colouring,
after the central idea to view register allocation as a graph colouring
problem. If we have 8 registers, then we try to colour a graph with eight
different colours. The graphs nodes are made of webs and the arcs are
determined by calculating interference between the webs. A web
represents a variables definitions, places where it is assigned a value (as
in x = .), and the possible different uses of those definitions (as in y = x
+ 2). This problem, in fact, can be approached as another graph. The
definition and uses of a variable are nodes, and if a definition reaches a
131
use, there is an arc between the two nodes. If two portions of a variables
definition-use graph are unconnected, then we have two separate webs for
a variable. In the interference graph for the routine, each node is a web.
We seek to determine which webs don't interfere with one another, so we
know we can use the same register for those two variables. For example,
consider the following code:
i
j
x
y
=
=
=
=
10;
20;
i + j;
j + k;
If we get stuck, then the graph may not be r-colourable, we could try again
with a different heuristic, say reusing colours as often as possible. If no
other choice, we have to spill a variable to memory.
132
Instruction Scheduling
Another extremely important optimisation of the final code generator is
instruction scheduling. Because many machines, including most RISC
architectures, have some sort of pipelining capability, effectively
harnessing that capability requires judicious ordering of instructions.
In MIPS, each instruction is issued in one cycle, but some take multiple
cycles to complete. It takes an additional cycle before the value of a load is
available and two cycles for a branch to reach its destination, but an
instruction can be placed in the "delay slot" after a branch and executed in
that slack time. On the left is one arrangement of a set of instructions that
requires 7 cycles. It assumes no hardware interlock and thus explicitly
stalls between the second and third slots while the load completes and has
a dead cycle after the branch because the delay slot holds a noop. On the
right, a more favourable rearrangement of the same instructions will
execute in 5 cycles with no dead cycles.
Peephole Optimisations
Peephole optimisation is a pass that operates on the target assembly and
only considers a few instructions at a time (through a "peephole") and
attempts to do simple, machine-dependent code improvements. For
example, peephole optimisations might include elimination of
multiplication by 1, elimination of load of a value into a register when the
previous instruction stored that value from the register to a memory
location, or replacing a sequence of instructions by a single instruction
with the same effect. Because of its myopic view, a peephole optimiser
does not have the potential payoff of a full-scale optimiser, but it can
significantly improve code at a very local level and can be useful for
cleaning up the final code that resulted from more complex optimisations.
Much of the work done in peephole optimisation can be though of as findreplace activity, looking for certain idiomatic patterns in a single or
sequence of two to three instructions than can be replaced by more
efficient alternatives. For example, MIPS has instructions that can add a
small integer constant to the value in a register without loading the
constant into a register first, so the sequence on the left can be replaced
with that on the right:
133
Optimisation Soup
You might wonder about the interactions between the various optimisation
techniques. Some transformations may expose possibilities for others, and
even the reverse is true, one optimisation may obscure or remove
possibilities for others. Algebraic rearrangement may allow for common
subexpression elimination or code motion. Constant folding usually paves
the way for constant propagation and then it turns out to be useful to run
another round constant-folding and so on. How do you know you are
done? You don't!
As one compiler textbook author (Pyster) puts it:
Adding optimisations to a compiler is a lot like eating chicken
soup when you have a cold. Having a bowl full never hurts, but
who knows if it really helps. If the optimisations are structured
modularly so that the addition of one does not increase compiler
complexity, the temptation to fold in another is hard to resist. How
well the techniques work together or against each other is hard to
determine.
134
BIBLIOGRAPHY
1.
135
24. Neimann T (2000), A Compact Guide to Lex and Yacc, ePaper Press,
Portland.
25. Pyster, A Compiler Design and Construction. New York, NY: Van
Nostrand Reinhold, 1988.
26. Sudkamp, T Languages and Machines: An Introduction to the Theory
of Computer Science, Reading, MA: Addison-Wesley, 1988.
27. Tremblay, J; Sorenson, P The Theory and Practice of Compiler
Writing. New York, NY: McGraw-Hill, 1985.
28. Wexelblat, R L History of Programming Languages. London:
Academic Press, 1981.
29. Wirth, N. The Design of a Pascal Compiler, Software - Practice and
Experience, Vol. 1, No. 4, 1971.
136