Sunteți pe pagina 1din 117

3012 Formal Languages & Compilers

CS3012 Formal Languages

Course Notes

CS3012 Formal Languages

Course Notes

UNIVERSITY OF ABERDEEN Department of Computing Science CS3012: Formal Languages & Compilers Course Notes 2006 Frank Guerin1 17/12/2012
About CS3012 Formal languages underlie all of Computing Science: if a language is not formally defined it is difficult to use a computer to process it in a consistent manner. This course provides an introduction to how formal languages may be defined and describes how computers may be used to manipulate such languages, with particular reference to compiling programming languages. You will gain practical experience of tools that are widely used in industrial applications to generate parsers and lexical analysers. Attendance You are expected to attend all meetings of CS3012. The course introduces many formal concepts, and builds upon them each week. The only way to do well on this course is to work steadily throughout the term, building up your familiarity with the different formalisms. If you can't do the exercises in the tutorials, you will not be able to understand the lectures. You will not be able to cram in all the information in the weeks before the exam if you have not been working steadily. Attendance will be taken at tutorials. If you are not attending at least 75% of the tutorial classes, you will be reported to the Senate Office as being "at risk".

Motivation for the Course


A compiler takes a program we have written in a particular language, and converts it into another format ready to be executed on a particular computer. In order to write efficient programs, we must understand how a compiler works. The basic ideas of compiling are used in many different areas in computing, including user interfaces, software design, and intelligent agents. Before we can write a compiler, we must establish the language that we will be compiling. That is, we have to determine what programs are valid. Then, we must be able to recognise valid
1

programs, and process them. We face similar problems understanding natural language. For example, which of the following five examples are proper sentences? 1. large red cars go quickly, 2. Large red cars go quickly. 3. Colourless green ideas sleep furiously. 4. Go cars large red quickly. 5. Coches rojos grandes marchan rapidamente. Even if we can't state clearly which of these are valid sentences, we can still make a good guess at what some of them mean. Unfortunately, we can't get away with hand-waving like this when

These notes mostly from Ken Browns original notes

CS3012 Formal Languages

Course Notes

we are dealing with computer programs we have to be precise and unambiguous, and the only effective way we have of interpreting programs requires us to be very strict about what is and what is not a valid program. We will then use this strict specification to help us recognise the structure of a program. Once we have recognised the structure, we can start worrying about what it means, and what actions we have to take to execute the program as expected. To specify the valid programs, we need the concept of a formal language languages where the programs (or sentences) that are valid are defined solely in terms of the form (or shape, or structure) of the program. The course will begin by looking at some simple ways to define a formal language, and some algorithms for recognising sentences in those languages. We will see how to define simple examples, and we will investigate how powerful the methods are.

We will then look at more powerful methods, which will allow us to specify programming languages, and we will see how to recognise whether or not a program is valid. Once we have recognised a valid program, we have to start translating it into an executable form. First, we will check that the program is meaningful i.e. the instructions make sense, and obey the conventions of the language. We will then start translating the original instructions into a different format, and we will show how to create the necessary structures in memory. We will also look at the runtime environment, showing how diffrent programming language styles manipulate the memory of the computer as programs are executed. We will use the standard UNIX tools of Lex and Yacc to build parts of a working compiler. A schematic model of a compiler is shown below:
symbol table

s o u r c e

lexical analysis

syntax analysis

semantic analysis

intermediate code generation

code optimisation

code generation

t a r g e t

error handling

error messages

Course Content

basic formal language theory 1. Alphabets, Strings Languages and Machines 2. Finite State Automata 3. Regular Expressions and Regular Languages 4

4. Finite State Automata and Regular Languages 5. Finite State Automata with Output lexical analysis 6. Lex: A Lexical Analysis Tool

CS3012 Formal Languages

Course Notes

grammar theory 7. Languages and Grammars 8. Derivations and Ambiguity 9. Parsing

compilation 10. Yacc: A Parser Generator 11. Error Handling 12. Syntax Directed Translation 13. Symbol Table 14. Type Checking 15. Run-time Environment 16. Intermediate Code Generation

CS3012 Formal Languages

Course Notes

A note on the style of the course


This handout together with the exercise sheets from the problem classes contain all of the required material for the course. However, the handout is concise, may be difficult to read, and will not contain much in the way of discussion, motivation or examples. The lectures will consist of slides, spoken material, and additional examples given on the blackboard. In order to understand the algorithms and the reasons for studying the material, you will need to attend the lectures and take notes to supplement the handout. This is your responsibility. If there is anything you do not understand during the lectures, then ask, either during or after the lecture. If the lectures are covering the material too quickly, then say so. If there is anything you do not understand in the handout, then ask, either at a lecture or in the problem classes. The textbooks contain additional material, and may motivate the material in different ways. They are useful background to the course, and may help you understand it.

The exercises in the problem classes are designed to test your understanding of the material and to give you practice applying the definitions and algorithms in solving problems. In order to survive on the course, you are strongly advised to attend the problem classes and attempt the exercises. Much of this material is formal and abstract, and without practice you will quickly fall behind. Make sure you do attempt the exercises for yourself, and don't wait for the solutions - reading solutions is no substitute for trying to solve problems. If there is anything you don't understand in the exercises (or the notes), or you can't see how to generate the solutions for yourself, then ask for help. Do not sit in silence hiding the fact that you are struggling. There are also continuous assessment exercises. These put into practice the theory you will have learned on the course. They are part of the formal assessment for the course, so it is obviously important that you make serious attempts at solving them. If you do not complete the assessments, you risk failing the course.

CS3012 Formal Languages

Finite State Automata

1. Alphabets, Strings and Languages

Definitions 1.0: Set and Mathematical Notation We need to define some mathematical notation before we start, so that we can talk about things concisely, without having to write a paragraph every time. A B AB AB 2P A\B A B A is a subset of B - every element of A is also in B the union of A and B - a set containing every element in either A or B the intersection of A and B - a set containing only those elements in both A and B there exists for all the set of all subsets of P is an element of is not an element of A minus B: all elements of A, except those that are also in B the set of all pairs of elements, where the first is an element of A and the second is an element of B the empty set - i.e. the set with no members not means the set of all elements x, such that the sentence p(x) is true for each x Example: {x : x Z, x > 0, x < 10} means the set of all integers between 0 and 10, and is the same set as {1,2,3,4,5,6,7,8,9} f is a function that maps elements of set X to elements of set Y, and maps a particular element x to a particular element y Example: f: N N :: x x2 is the "square" function for positive integers. It may also be written f(x) = x2. Thus f(2) = 4, f(3) = 9, etc.

{ x | p(x) }

f : X Y :: x y

CS3012 Formal Languages

Finite State Automata

Definitions 1.1 A symbol is a basic unit. An alphabet is a finite set of symbols. A string over an alphabet T is a finite sequence of symbols from T. This may be shortened to Tstring, or if the context is clear or unimportant, simply string.

Example 1.2 If T = {a, b, c, d} is an alphabet, then abd, aaaa and abaabc are T-strings. Definitions 1.3 The empty string is the string with no symbols, denoted . The length of a string w is the number of symbols in the sequence, denoted |w|. Two strings, w and v, are equal if they have exactly the same sequence of symbols, denoted w = v. The concatenation of two strings, w and v, is the string consisting of the sequence of symbols in w followed by the sequence of symbols in v, denoted wv. Note: concatenation is not commutative - vw and wv need not be equal - but is associative - (uv)w = u(vw).

Example 1.4

If w = abb and v = bab then wv = abbbab and w = abb.

Definitions 1.5 A string u is a substring of w if there exists strings x and y such that (s.t.) w = xuy. If u is a substring of w as above, and x = , then u is a prefix of w. If u w, the u is a proper prefix. If u is a substring of w as above, and y = , then u is a suffix of w. If u w, then u is a proper suffix. Note that is a substring of every string.

Example 1.6

ba is both a prefix and a suffix of babba.

Definition 1.7

If T is an alphabet, then T* is the set of all strings over T.

CS3012 Formal Languages

Finite State Automata

Example 1.8

T = {a,b} and T* = {, a, b, aa, ab, ba, bb, aaa, aab, ...}

CS3012 Formal Languages

Finite State Automata

Definitions 1.9 T+ is T* without . If a is a symbol, then (i) an (n 0) is the string consisting of n a's. Note that anam = an+m. (ii) a* = {, a, aa, aaa, ...} (iii) a+ = {a, aa, aaa, ...}

Definition 1.10

Language over T

A language over an alphabet T is a set of strings over T. This may be abbreviated to T-language, or simply language. Note that L is a T-language if and only if (iff) L T*.

Example 1.11

If T = (a, b}, then {, ab, babba, bbbbbbb} is a T-language.

Definitions 1.12 Let A and B be languages over an alphabet T. A+B (or A B) denotes the set union of A and B. A B denotes the set intersection of A and B. A' denotes the complement of A - i.e. all the strings in T* but not in A AB denotes the concatenation of A and B - all strings uv s.t. u A and v B. Note that language concatenation is associative, but not commutative. An denotes the concatenation of A with itself n times ( = AA...A). Note: A0 = {}. A* = A0 + A1 + A2 +... i.e. the set of all strings consisting of the concatenation of strings from A. This operation is called the "Kleene closure". Note: A** = A*. A+ = A1 + A2 +...

Definitions 1.13 Let T be an alphabet with an ordering on its symbols. Say T = {t1, t2, t3, ...}. Strings over T can be ordered in two ways: Dictionary Order All strings beginning t1 are ordered before strings beginning t2, and t2 before t3, etc. Within the group of strings beginning t1, strings are ordered by the second symbol, etc. is always the first string.

CS3012 Formal Languages

Finite State Automata

Lexical Order Strings are ordered by their length. Within each group of strings of the same length, strings are ordered by dictionary order. Again, is the first string.

CS3012 Formal Languages

Finite State Automata

2. Finite State Automata


Definition 2.1 Finite State Automata A Finite-State Automaton (FSA) is a quintuple (Q,I,F,T,E), where Q is a finite set (whose elements are called states) I is a subset of Q (whose elements are the initial states) F is a subset of Q (whose elements are the final states) T is an alphabet, and E is a subset of Q (T + ) Q (whose elements are called edges) Essentially, a FSA is a labelled, directed graph - that is, it is a set of nodes with directed arcs between the nodes, where arcs may have labels from an alphabet. Notation: We will sketch a FSA as a graph, where the edges of the FSA are the arcs of the graph and the states are the nodes (drawn as circles). The initial states will be drawn with a short incoming arrow, and the final states will be drawn as double circles.

Example 2.2 The FSA A1: Q = {1,2,3,4}, I = {1} F = {4}, T = {a, b} E = {(1,a,2), (1,b,4), (2,b,4), (2,a,3), (3,a,3), (3,b,3), (4,a,2), (4,b,4)} can be sketched as shown:
a 1 b b

2 a 4

3 a,b

Example 2.3 The FSA of 2.2 (A1) can be interpreted as follows: The machine starts in state "1". From there it can move either to state "2", by action labelled "a", or it can move to state "4", by action labelled "b". From state "2" it can move to state "3", by action labelled "a", or it can move to state "4", by action labelled "b". From state "3", it can stay in state "3' by actions labelled "a" or "b". From state "4", it can move to state "2", by action labelled "a", or it can stay in state "4", by action labelled "b". The machine can stop successfully in state "4". Definitions 2.4

CS3012 Formal Languages

Finite State Automata

If (x,a,y) is an edge in a FSA, then x is the start state of the edge and y is the end state. A path in a FSA is a sequence of edges, such that the end state of one is the start state of the next. A cycle in a FSA is a path, such that the start state and the end state are the same. A path is successful if its first state is an initial state, and its last state is a final state. The label of a path is the sequence of labels of the edges in the path. A string is accepted by a FSA if it is the label of a successful path. A string is rejected if it is not the label of a successful path.

CS3012 Formal Languages

Finite State Automata

Definition 2.5 Language accepted by a FSA The language accepted by a FSA, A, is the set of strings accepted by A. Denote the language L(A).

Example 2.6 Consider the FSA A1 (example 2.2): (i) p1 = (2,b,4), (4,a,2), (2,a,3) is a path; (ii) p2 = (2,b,4), (4,b,4), (4,a,2) is a cycle; (iii) p3 = (1,b,4), (4,a,2), (2,b,4), (4,b,4) is a successful path; (iv) The label of p1 = baa; (v) babb is accepted by A1; (vi) baa is rejected by A1; (vii) A1 accepts the language of strings of a's and b's which end in a b, and in which no two a's are adjacent. Definition 2.7 The transition function of a FSA, A, is the function : (x,t) edge (x, t, y) in A}. {y:

Definition 2.8 If A = (Q,I,F,T,E) is a FSA, then a transition matrix for A is a matrix which has one row for each state in Q and one column for each symbol in T s.t. the entry in row q and column t is (q,t) ( 2Q) Notation: A transition matrix will be drawn as a table, labelling the rows and columns with states and symbols. Each entry in the table will be the set of states as defined above, or will be left blank in the case of the empty set. Additionally, rows corresponding to initial states will be labelled with an "in" arrow, and final states with an "out" arrow.

Example 2.9 The transition matrix for A1 is:

CS3012 Formal Languages

Finite State Automata

A1 1 2 3 4

a {2} {3} {3} {2}

b {4} {4} {3} {4}

CS3012 Formal Languages

Finite State Automata

Definition 2.10 A FSA, A, is non-deterministic if (i) there are edges labelled with , or (ii) there are two edges (x,t,y) and (x,t,z) in A s.t. y z, or (iii) there is more than one initial state. Conversely, if none of (i), (ii) or (iii) hold, then A is a deterministic FSA Non-deterministic and deterministic FSA's will be denoted NDFSA and DFSA respectively.

Example 2.11

A1 is a DFSA

Note: For a DFSA, every entry in the transition matrix is either a singleton set or the empty set.

Algorithm 2.12

Recognition Algorithm (DFSA)

Problem: Given a DFSA, A = (Q,I,F,T,E) and a string w, determine whether w L(A). begin Add symbol # to end of w q := initial state t := first symbol of w# while (t # & q {}) do begin q := (q, t) t := next symbol in w# end return ((t == #) & (q F)) end

while the current symbol is not the end marker and we are in a proper state get the next state from the transition table get the next symbol from the input string if the current symbol is the end marker and the current state is a finish state, return true, else false

Theorem 2.13 DFSA = NDFSA Let L be a language. L is accepted by a NDFSA iff L is accepted by a DFSA. Algorithm 2.14 NDFSA -> DFSA

Problem: Given a NDFSA, A, create a DFSA, A'

CS3012 Formal Languages

Finite State Automata

begin create unique initial state remove -edges remove edge choices end

CS3012 Formal Languages

Finite State Automata

Algorithm 2.15

create unique initial state

Given a NDFSA, A = (Q,I,F,T,E), create FSA A' with a single initial state s.t. L(A) = L(A') begin Q := Q {i} (where i Q) for each q I do add edge (i,,q) to E I := {i} return (Q,I,F,T,E) end

/* add a new initial state i */ /* reset the initial set to be just i */

Algorithm 2.16 remove -edges

Given a NDFSA, A = (Q,I,F,T,E), create FSA A' with no -edges begin remove all edges of form (q,,q) from E while there are cycles of -edges in E do begin select a cycle merge all states in cycle into single state, keeping all edges in/out of cycle end while there are -edges in E do begin select a -edge (p,,q) for each edge (q,t,r) E do add edge (p,t,r) to E if q F then add p to F remove (p,,q) from E end return (Q,I,F,T,E) end

CS3012 Formal Languages

Finite State Automata

Algorithm 2.17

remove edge choices (subset construction)

Given a NDFSA, A = (Q,I,F,T,E), create DFSA A' s.t. L(A) = L(A') begin I' := {I} /* Note: I is a set; I' is a set with one member, I */ F' := {} E' := {} S := {I} Q' := {I} while S is not empty do begin select X S for each t T do begin S' := {q Q : (p,t,q) E, for some p X} if S' {} then begin if S' F {} then F' := F' {S'} E' := E' {(X,t,S')} S := S {S'}\Q' /* if we haven't seen S' before, add to S */ Q' := Q' {S'} /* if S' is already in Q', Q' doesn't change */ end end S := S\{X} end return (Q',I',F',T,E') end Alternative description of Algorithm 2.17 Make a new initial state, I', representing all the old initial states. Make empty sets F' and E', for the new finish states and new edges Create a set S of states we haven't exanded yet, initially containing just I' Create a set Q' of all new states, initially containing just I'. While there are states we haven't expanded yet (i.e. still states left in S) Pick one of those states, and call it X For each symbol in the alphabet Find all the old states that make up X Find all the old states we could have got to from those states by reading in the current alphabet symbol Group all those old states into a new state, S' If S' is not empty (i.e. there is at least one old state making up S') If any of the states making up S' were old finish states Make S' a new finish state (i.e. add to F') Add a new edge from X to S' for the current symbol (i.e. add to E')

CS3012 Formal Languages

Finite State Automata

If we hadn't seen S' before, add S' to S If we hadn't seen S' before, add S' to Q' Take X out of S Return the new FSA we have just created, where Q' is the set of states, I' is the set of initial states, F' is the set of finish states, T is the alphabet, and E' is the set of edges.

CS3012 Formal Languages

Finite State Automata

Example 2.18 A = ({1,2,3},{1},{3},{a,b},{(1,a,1),(1,a,2),(1,b,1),(2,b,3),(3,a,3),(3,b,3)}) Convert A into a DFSA. 1 a,b I' F' {1} E' S {1} {1,2} Q' {1} {1,2} a 2 b 3 a,b X {1} t a b {1,2} a b {1,3} a b {1,2,3} a b S' {1,2} {1} {1,2} {1,3} {1,2,3} {1,3} {1,2,3} {1,3}

({1},a,{1,2}) ({1},b,{1}) ({1,2},a,{1,2}) {1,3} ({1,2},b,{1,3}) {1,2,3} ({1,3},a,{1,2,3}) ({1,3},b,{1,3}) ({1,2,3},a,{1,2,3}) ({1,2,3},b,{1,3})

{1,3} {1,2,3}

{1,3} {1,2,3}

a 1 b A' = a 1,2 a b 1,3 b b a 1,2,3

({{1},{1,2},{1,3},{1,2,3}}, {{1}}, {{1,3},{1,2,3}}, {a,b), {({1},a,{1,2}),({1},b,{1}),({1,2},a,{1,2}),({1,2},b,{1,3}), ({1,3},a,{1,2,3}), ({1,3),b,{1,3}),({1,2,3},a,{1,2,3})), ({1,2,3},b,{1,3})})

Definition 2.19 Let A =(Q,I,F,T,E) be a FSA. For any two strings x, y T*, x and y are distinguishable w.r.t. A if there is a string z T* s.t. exactly one of xz and yz are in L(A). We say z distinguishes x and y w.r.t. A.

CS3012 Formal Languages

Finite State Automata

Theorem 2.20 L is a language over T. If, for some integer, n, there are n elements of T* s.t. any two are distinguishable w.r.t. A, then any FSA that recognises L must have at least n states. Theorem 2.21 For a given language L, there exists a minimal DFSA accepting L, and it is unique. Algorithm 2.22 DFSA -> minimal DFSA

Given a DFSA A, create a DFSA A' s.t. A' is minimal over all FSAs accepting L(A). begin R := {} for all edges (p,t,q) E do add (q,t,p) to R remove edge choices from (Q,F,I,T,R) to get (Q',I',F',T,E') Z := equivalent_states(Q, Q') (Q'',I'',F'',T,E'') := merge(Z,Q,I,F,T,E) return (Q'',I'',F'',T,E'') end Algorithm 2.23 equivalent states

Given two sets of states Q and Q', produce Z, the set of states of Q equivalent in Q'. M is a 2d array, indexed by Q and {} Q begin set all cells of M to t for all p Q do for all sets S Q' do if p S then for each q Q do begin if q S then M[p][q] := f end else for each q Q do if q S then M[p][q] := f S := {}

CS3012 Formal Languages

Finite State Automata

for each p Q do begin Z := {} if M[p][] = t then for each q Q do if M[p][q] = t then begin add q to Z M[q][] := f end add Z to S end return S end

CS3012 Formal Languages

Finite State Automata

Algorithm 2.24

merge

Given a DFSA A and a set of sets of states Z, return a new DFSA A' begin for all S Z do begin select p S for all q S s.t. q p do begin delete q from Q for each edge (q,t,w) E do begin delete (q,t,w) from E if (p,t,w) E then add (p,t,w) to E end for each edge (w,t,q) E do begin delete (w,t,q) from E if (w,t,p) E then add (w,t,p) to E end if q I then begin delete q from I if p I then add p to I end if q F then begin delete q from F if p F then add p to F end end end return (Q,I,F,T,E) end Example 2.25 Minimise the following DFSA:
2 a 4 a b ab 5 b a , b 3

a 1 b

a 1 b

2 a ab 5 b

4 a b a , b 3

Reverse:

The set Q' from "remove edge choices" is { {3,4}, {2,3,4}, {2,3,4,5} and {1,2,3,4,5} }

1 2 3 4 5

=>

1 2 3 4 5

2 a , b a 3 , 4 1 a a , b 4 b b 5

CS3012 Formal Languages

Finite State Automata

1 2 3 4 5

t t t t t

t t t t t

t t t t t

t t t t t

t t t t t

t t t t t

1 2 3 4 5

t t t t t

t f f f f

f t f f f

f f t t f

f f t t f

f f f f t

giving merge set Z = { {1}, {2}, {3,4}, {5} }:

CS3012 Formal Languages

Regular Expressions and Regular Languages

3. Regular Expressions and Regular Languages

Definition 3.1 Regular Expressions Let T be an alphabet. A regular expression over T defines a language over T as follows: (i) denotes {}, denotes {}, and t denotes {t} for t T; (ii) if r and s are regular expressions denoting languages R and S, then (r + s) denoting R + S, (rs) denoting RS, and (r*) denoting R* are regular expressions; and (iii) nothing else is a regular expression over T. Note: when writing regular expressions, if we give the operators +, . and * ascending priorities, then we can omit most of the brackets. For example, the regular expression ((a)* + ((b)* + (c))*)((b) + (c)) can be written as (a* + (b* + c)*)(b + c) Precedence: r + st should be interpreted as r + (st) r + st* should be interpreted as r + (s (t*)) Notation: If T is an alphabet, then T also denotes the regular language of strings over T of length 1. tn denotes ttt...t n times.

Example 3.2 (i)The regular expression (a* + (b* + c)*)(b + c) denotes a set, some of whose members are: aaaab, b, bbbcbbbbcccccc, etc. * denotes the language consisting of all strings over T starting with a. (ii) aT (iii) T*(a2 + b2)T* denotes the set of all strings over T with a substring of aa or bb. (iv) 0 + 1(0 + 1)* denotes the set of all binary numbers. (v) The set of strings over {0,1} not containing two adjacent 0's is (1 + 01)*( + 0) Example 3.3 Application of Regular Expressions (I) Searching for strings of characters in UNIX using ex and other editors, and using grep and egrep. 26

CS3012 Formal Languages

Regular Expressions and Regular Languages

the ex command /a*[abc]/ means find any line containing a substring starting with any number of a's followed by an a, b, or a c. Application of Regular Expressions (II) Lexical analysis, the initial phase of compiling, divides the source code into "tokens". The definition of what constitutes the different tokens is given by regular expressions.

27

CS3012 Formal Languages

Regular Expressions and Regular Languages

Definition 3.4 A language L over T is a regular language iff there is a regular expression defining it.

Theorem 3.5

If A and B are regular languages, then so are A+B, AB and A*.

Notation: A' denotes the complement of A: i.e. the set of all strings in T* not in A.

Theorem 3.6

If A and B are regular languages, then so are AB and A'.

Theorem 3.7

Any finite language is regular.

28

CS3012 Formal Languages

Finite State Automata and Regular Languages

4. Finite State Automata and Regular Languages

Theorem 4.1

Kleene's Theorem

A language L is accepted by a FSA iff L is regular Algorithm 4.2Regular Expression -> NDFSA Given a regular language, L, over T, defined by a regular expression, r, create a NDFSA, A, s.t. L = L(A). begin if r == , then A := ({q},{q},{q},T,{}) else if r == , then A := ({q},{q},{},T,{}) else if r == t, then A = ({p,q},{p},{q},T,{(p,t,q)}) else if r == r1 + r2 then begin obtain A1 = (Q1,{i1},{f1},T,E1), L1 = L(A1) obtain A2 = (Q2,{i2},{f2},T,E2), L2 = L(A2) A := (Q1Q2{i,f},{i},{f},T,E1E2{(i,,i1),(i,,i2),(f1,,f),(f2,,f)}) end else if r == r1r2 then begin obtain A1 and A2 as above A := (Q1Q2,{i1},{f2},T,E1E2{(f1,,i2)}) end else if r == r1* then begin obtain A1 as above A := (Q1{i,f},{i},{f},T,E1{(i,,i1),(i,,f),(f1,,f),(f1,,i1)}) end return A end

Example 4.3 Regular Expression ->NDFSA Let L = (b+ab)(b+ab)*, T = {a, b} Find NDFSA's for: (i) a (ii) b (iii) ab (iv) (b+ab) (v) (b+ab)* (vi) (b+ab)(b+ab)*
(i) ({1,2},{1},{2},T,{(1,a,2)})

29

CS3012 Formal Languages

Finite State Automata and Regular Languages

(ii) ({3,4},{3},{4},T,{(3,b,4)}) (iii) ({1,2,3,4},{1},{4},T,{(1,a,2),(2,,3),(3,b,4)}) (ii)' ({5,6},{5},{6},T,{(5,b,6)}) (iv) ({1,2,3,4,5,6,7,8},{7},{8},T,{(7,,1),(7,,5),(1,a,2),(2,,3),(3,b4),(5,b6),(4,,8),(6,,8)}) (iv)' ({9,10,11,12,13,14,15,16,{15},{16},T, {(15, ,9),(15,,13),(9,a,10),(10,,11),(11,b,12),(13,b,14),(12,,16),(14,,16)}) (v) ({9,10,11,12,13,14,15,16,17,18},{17},{18},T,{(17,,15),(17,,18),(15,,9),(15,,13),(9,a,10),(10,,11),(11,b,12), (13,b,14),(12,,16),(14,,16) (16,,18),(16,,15)}) (vi) ({1,2,...,18},{7},{18},T,{(7,1),(7,,5),(1,a,2),(2,,3),(3,b,4),(5,b,6),(4,,8),(6,,8),(8,,17),(17,,15),(17,,18), (15,,9),(15,,13),(9,a,10),(10,,11),(11,b,12), (13,b,14),(12,,16),(14,,16),(16,,18),(16,,15)})

1 7

a 2

b 3

a 4 9 15 13 10

11

12 16 18

5 b 6

17

b 14

Algorithm 4.4FSA -> Regular Expression Given a FSA, A, create a regular expression defining L(A) begin create unique initial state create unique final state unique FSA -> regular expression end Algorithm 4.5create unique final state Given a NDFSA, A = (Q,I,F,T,E), create FSA A' with a single final state s.t. L(A) = L(A') begin Q := Q {f} (where f Q) for each q F do add (q,,f) to E F := {f} return (Q,I,F,T,E) end

30

CS3012 Formal Languages

Finite State Automata and Regular Languages

Definition 4.6 A regular finite state automaton (RFSA) is a FSA where the edge labels may be regular expressions. An edge labelled with the regular expression r indicates that we can move along that edge on input of any string defined by r.

31

CS3012 Formal Languages

Finite State Automata and Regular Languages

Algorithm 4.7unique FSA -> regular expression Given a FSA, A = (Q,{i},{f},T,E), with unique initial and final states, create a regular expression r defining L(A). begin convert A to a RFSA %trivial while Q\{i,f} is not empty do begin for each state p Q with more than one edge (p,ri,p) (i n) do replace all those edges by (p,r1+r2+...+rn,p) for each pair p,q Q with more than one edge (p,ri,q) (i n) do replace all those edges by (p, r1+r2+...+rn,q) select s Q for each pair p,q Q (p,q s) s.t. there are edges (p,r1,s) and (s,r2,q) do if there is an edge (s,r3,s) then add the edge (p,r1r3*r2,q) else add the edge (p,r1r2,q) remove all edges to or from s remove all states and edges with no path from i end return r, where E = {(i,r,f)} end Example 4.8 FSA -> Regular Expression 2 a b create unique initial and final states a i 1 b a 4 b remove state 2 - edges are 1->3, 1->4, 4->3, 4->4 2 b f a 3 a+b 4 b b a 3 a+b

a 1

32

CS3012 Formal Languages

Finite State Automata and Regular Languages

aa i 1 b ab ab remove edge pairs a i 1 b + a b a 4 b aa f

a+b

a + b

4 f b + a b

remove state 3 - no edges i 1 4 b+ab b+ab f

remove state 4 - edge is 1->f i 1 (b+ab)(b+ab)* f

remove state 1 - edge is i->f i expression is (b+ab)(b+ab)* (b+ab)(b+ab)* f

Theorem 4.9

The Pumping Lemma

If L is a regular language, then there exists an integer N s.t. for any w L with |w| N, there are strings x, u and y s.t. w = xuy |xu| N |u | > 0 and s.t. for any m 0, xumy L. Proof: Since L is a regular language, there must be a DFSA A which accepts L. Let N be the number of states in A. Suppose there is a string of length N which is accepted by A (i.e. L). Since is accepted by A, the accepting path must make ||+1 visits to states, and hence must make > N visits to states. But A has only N states, and so at least one state must be visited at least twice. Let s be the first state which is 33

CS3012 Formal Languages

Finite State Automata and Regular Languages

visited twice on the accepting path. We can split the path into three sub-paths: from the start to the first visit to s, from the first visit to s to the second visit to s, and from the second visit to s to the end. Let x, u and y be the substrings of corresponding to these three subpaths (so = xuy). The subpath xu does not visit any states more than once except s, and so makes at most N+1 visits, and so must have length N (so |xu| N). For the two visits to s to be separate, there must be at least one character accepted in that subpath (so |u| > 0). Now, if we are in state s, an input of u will take us back to s, and an input of y will take us to the finish state. From the start state, an input of x will take us to state s. Therefore, the input xy will be accepted (x takes us to s, and y takes us to the finish state), and so will xuy, xuuy, xuuuy, etc.. Therefore, for any m 0, xumy will be accepted by A. But, by definition of A, any string accepted by A is in L. Therefore, for any m 0, xumy L

34

CS3012 Formal Languages

Finite State Automata and Regular Languages

Example 4.10

Using the Pumping Lemma

Show L = {anbn : |n| 0} is not regular. Suppose L is regular. Then, by the pumping lemma, there exists some integer N s.t. for any w L with |w| N, there are strings x, u and y s.t. w = xuy |xu| N |u| > 0 and m 0, xumy L Choose i > N/2. Let w be the string aibi. Then w has length > N. By the pumping lemma, w can be split into substrings xuy, s.t. |xu| N and |u| > 0. Now u must be of the form an, or anbm, or bm., for some n and m. If u = an, then w = xuy = ajanakbi, where j+n+k = i. So xu2y = ajananakbi, which is not in L, because it has more a's then b's. The same argument works for u = bm. If u = anbm, then w = xuy = ajanbmbk, and xu2y = ajanbmanbmbk, which is obviously not in L, because it has b's before a's. Thus in no case is xu2y in L. But by the Pumping Lemma, xu2y L. Contradiction. Therefore our first assumption must have been wrong, so L is not regular.

35

CS3012 Formal Languages

Finite State Automata with Output

5. Finite State Automata with Output

Definition 5.1 Moore Machine A Moore Machine is a 6-tuple (Q,I,T,E,,O), where Q, I, T and E are as for DFSA's is an alphabet (called the output alphabet), and O is a subset of Q (called the output function)
q/x

Notation: if (q,x) O, then sketch state q by The output function defines the output of the machine whenever the machine enters a particular state.

Example 5.2 A Moore machine which prints out a "1" every time an aab substring is input:

b 0/

a a b 1/ a 2/ a b 3/1

The input aaababaaab gives the output 11. Definition 5.3 Mealy Machine A Mealy Machine is a 6-tuple (Q,I,T,E,,O) where Q, I, T and E are as for DFSA's is an alphabet (called the output alphabet), and O is a subset of Q T (called the output function). Notation: if (q,t,x) O, then for any arc (q,t,p) E, label the arc by t/x. The output function defines the output of the machine whenever the machine leaves a particular state through a particular labelled action.

36

CS3012 Formal Languages

Finite State Automata with Output

Example 5.4 A Mealy Machine which takes reversed binary numbers as input, and prints as output the reversed number one larger:

0/0, 1/1 0/1 0/1 1/0 1/0

The input 11101 gives the output 00011. Definitions 5.5 Let M be a Moore machine or a Mealy Machine, with output alphabet . Define Mo(w) to be the output of M on w. Let M1 = (Q1,I1,T1,E1,1,O1) be a Moore Machine, and M2 = (Q2,I2,T2,E2,2,O2) be a Mealy Machine. Let M1o() = b. M1 and M2 are equivalent if T1 = T2 and for all strings w T1*, M1o(w) = bM2o(w).

Theorem 5.6 Moore-Mealy Equivalence If M1 is a Moore Machine, then there exists a Mealy Machine M2 equivalent to M1. If M2 is a Mealy Machine, then there exists a Moore Machine M1 equivalent to M2.

37

CS3012 Formal Languages

Languages and Grammars

6. Lex: A Lexical Analysis Tool

Lex is a program generator, accepting a series of regular expression definitions, and producing a program which analyses input to identify lexical tokens defined by those regular expressions.

A Lex script has three sections, separated by a line containing only "%%": ... definitions ... %% ... regular expression / action pairs ... %% ... user-defined functions ...

Lex Syntax Let c be a character, x,y regular expressions, s a string, m,n integers, and i an identifier. regular expressions c any character except meta characters [...] any of the list of characters enclosed (may be a range of characters) [...] any of the characters not in the list enclosed . any ASCII character except newline xy the concatenation of x and y x* same as x* x+ same as x+ x? an optional x (same as x + ) x|y x or y {i} the definition of i x/y x, but only if followed by y (and y is not read from the input) x{m,n} m to n occurrences of x x x, but only at the beginning of a line x$ x, but only at the end of a line "s" exactly what is in the quotes (except for "\" and the following character) Precedence: brackets, then unary operators (+,?,*), then concatenation, then |, then /. Regular expression are terminated by a space or a tab. 38

CS3012 Formal Languages

Languages and Grammars

If there is a conflict between different regular expression, then Lex will match against the longest expression, and for the same length expression, will match against the first definition. meta characters (do not match themselves) ()[]{}<>+/,^*|.\"$?-% A match with a meta-character can be obtained by preceding with "\" Backslash, tab and newline are represented by \\, \t and \n respectively. Actions An action is a C language statement (followed by ";"). For example: [0-9]+ [a-zA-Z]+ printf("Integer\n"); printf("String\n");

will print out "Integer" after receiving a digit string as input, and "String" after receiving a character string. Thus the input 12+19=sum will result in Integer +Integer =String Note that a recognised regular expression is held in the string variable yytext, and its length is held in the integer variable yylen. Any input not recognised by the regular expression section will simply be echoed to the screen. Definition Section If identifier string whenever

appears in the definition section, then string will replace identifier {identifier} appears in the regular expression section. Thus L [a-zA-Z] %% {L}+ 39

CS3012 Formal Languages

Languages and Grammars

is equivalent to

%% [a-zA-Z]+

Anything enclosed between %{ ... %} in this section will be copied into the output program. include and define statements, all variable declarations, all function definitions and any comments should be so enclosed. Functions Section The section should contain the user-defined "main" routine, and any other required functions, written as C code. A simple "main" routine is found in the lex library, and will be used if no user-defined "main" is supplied.

40

CS3012 Formal Languages

Languages and Grammars

Running Lex The command lex calls the lex program on the specified file (usually with a ".l" suffix). The output, a C file, is called lex.yy.c. This program must then be compiled with the lex library (using the -ll option) with the object file renamed if required. To run the program, simply type the name of the object file. For example, to compile and run the lex script "example.l", type: lex example.l cc lex.yy.c -o example.o -ll example.o

Example Lex Program The following program specifies a simple word recognition lexical analyser %{ /* simple word recognition program */ %} L [a-zA-Z] %% [ \t]+ is|are a|the dog | cat | male | female {L}+ .|\n %% main() { yylex(); } Running this program as above would give the following (user input is underlined) % word.o the dog is a male <cr> determiner: the; noun: dog; verb: is; determiner: a; noun: male; female cat dog is <cr> 41 ; /* ignore whitespace */ printf("verb: %s; ", yytext); printf("determiner: %s; ", yytext);

printf("noun: %s; ", yytext); printf("unknown: %s; ", yytext); ECHO;

CS3012 Formal Languages

Languages and Grammars

noun: female; noun: cat; noun: dog; verb: is; catdog is male <cr> unknown: catdog; verb: is; noun: male; <ctrl-d> %

42

CS3012 Formal Languages

Languages and Grammars

Practical Class: Using Lex


Write a lexical analyser using Lex, for the language C-, defined below. What is required? A "y.tab.h" file will be supplied, defining all the different tokens to be used. It is linked from the web page for the practical, and should be copied to your own filespace before beginning the practical. Note that "KEY_REAL_T" is intended for the "real" keyword, and "REAL_T" is intended for real numbers. You have to write a Lex script, containing a definition section, a regular expression/action pair section, and a function section. The script, when run through Lex, should create a program which takes a file as input, reads the file, and outputs the result of a lexical analysis (either to another file or to the screen). The output from the analyser should be in the form of <token, attribute> pairs. Every element of an input program should be classified. Thus, on receiving input of z := y*27; the output should be something like: <ID_T,z> <BECOMES_T,:=> <ID_T,y> <MUL_T,*> <INT_T,27> <SEMI_T,;> The input must be described by regular expressions, and you must use Lex. You are advised to use the skeleton file "lexer.l" in the above directory. Note that you are only asked to do word recognition, and not check syntax. The action for each regular expression should be a simple "return" statement. You have to decide what to do with errors, but do not allow something to pass as a token if it should not pass, do not misclassify tokens, and do not allow valid tokens to pass as errors.

C- language definition
NOTE: A program is a sequence of function declarations and variable declarations. Each function and each variable must be declared before use. A variable is declared by stating the variable type followed by a nonempty space-separated sequence of identifiers or array specifications, ending with a semi-colon. A function is declared by stating its return type, followed by the function name (an identifier), a (possibly empty) comma-separated list of parameter declarations between "(" and ")", followed by the code block between "{" and "}". Possible variable and return types are "real" and "int" (and "void" as a special return type). 43 An identifier is a sequence of letters. An array specification is an identifier followed by an integer between "[" and "]" representing the size of the array. An array reference is an identifier followed by an integer-valued expression between "[" and "]". A parameter declaration is a variable type, a space, and an identifier. Each program must have a "main" function block, which must be the last function to be declared. It has no parameters, and no return type. The code block is a sequence of statements. Each statement may be a variable declaration, an

CS3012 Formal Languages

Languages and Grammars

assignment, a call to a function, a print statement, a code block between "{" and "}", a while statement, an if-then statement, an if-then-

else statement, or a return statement. All except the code block and while and if statements must be terminated by a semi-colon.

An assignment has an array reference or an identifier on the left-hand side, a ":=", and an expression on the right-hand side. Expressions are built from the "+","-", "*" and "/" operators, and the basic factors are reals, integers, identifiers, array references, function calls, or expressions inside "(" and ")". A call to a function is the function name, and then the argument list between "(" and ")". The argument list is a possibly empty commaseparated sequence of expressions. A print statement is the keyword "print" followed by an expression between "(" and ")". A while statement is the keyword "while" followed by a test between "(" and ")" followed by the keyword "do" followed by a statement. An if-then statement is the keyword "if" followed by a test between "(" and ")" followed by the keyword "then" followed by a statement. An if-then-else statement is an if-then statement followed by the keyword "else" followed by a statement. A test is two expressions with a relational operator in between. The relational operators are "<", ">", "=<", "=>", "=" and "!=" (standing for "not equals"). A return statement is the keyword "return" followed by an expression.

Integers have an optional sign followed by one or more digits. Reals have an optional sign, one or more digits, a decimal point, and one or more digits. Variables or functions which are used before being declared will give an error. Trying to use a return value of a void function gives an error. Using a return statement inside a void function gives an error. Variables are either declared outside a function, and can be accessed by all functions which follow them, or are declared inside a function, and can only be used inside that function. Redeclaring a variable within its scope gives an error. All parameters are passed by value (i.e. the value of a passed parameter does not change once the function has completed). A small example C- program is given below: int a b c ; int g[5]; int testFunc(int x) { real y; y := (x+a)/2; print(y); return a; } main() { a := 1; while (a < 3) do { testFunc(a); a := a + 1; } }

44

CS3012 Formal Languages

Languages and Grammars

7. Languages and Grammars

Definitions 7.1

Grammar

A grammar is a 4-tuple, G = (N,T,S,P), where N is a finite alphabet (called the non-terminals); T is a finite alphabet (called the terminals); N T = ; S N is the start symbol; and P is a finite set of productions of the form -> , where (N T)+, has at least one member from N, and (N T)*. Let G = (N,T,S,P) be a grammar. If s, t, x, y, u and v are strings s.t. s = xuy, t = xvy, and u -> v P, then s directly derives t, written s => t. If there is a sequence of strings s0, s1, ..., sn s.t. s0 => s1 => ... => sn-1 => sn, then s0 derives sn, written s0 =>* sn. A sentential form of G is a string w (N T)* s.t. S =>* w. A sentence of G is a sentential form w T* - i.e. w has only terminal symbols.

Definition 7.2

Language defined by a grammar

The language defined by G is the set of all sentences of G, denoted L(G).

Example 7.3 Let G = ({S}, {a,b}, S, {S -> , S -> aSb}). G has one non-terminal: S The terminals of G are a and b. The start symbol of G is S. G has two productions. aaaSbbb => aaaaSbbbb. 45

CS3012 Formal Languages

Languages and Grammars

S =>* aaaabbbb. aaaSbbb is a sentential form of G aaaabbbb is a sentence of G. L(G) = {, ab, aabb, aaabbb, ...}, which is {anbn: n 0}

46

CS3012 Formal Languages

Languages and Grammars

Notation: We will not normally write the grammar as a tuple, but will use the following conventions: Non-terminals will be uppercase Terminals will be lowercase Unless stated otherwise, the start symbol will be S. The set of productions may be numbered. If x => y using production number i, then we write x =>i y.
-> 1 | 2

| ... | n will be shorthand for the n productions

-> i .

Definition 7.4 Context-Free Grammars and Languages A context-free grammar (denoted CFG) is a grammar in which all productions are of the form -> , where N - i.e.the left hand side is a single non-terminal. A context-free language (denoted CFL) is one defined by a context-free grammar.

Example 7.5

A Grammar of Algebraic Expressions: G0

G = ({S}, {a, +, *, (, )}, S, { 1) S -> S + S 2) S -> S * S 3) S -> (S) 4) S -> a } Example derivation: S =>2 S * S =>4 a * S =>3 a * (S) =>1 a * (S + S) =>4 a * (a + S) =>4 a * (a + a). Note that there are many other ways of deriving the same string. Definition 7.6 Regular Grammar A grammar is regular if each production is of the form: (i) A -> t , (ii) A -> tB, or 47

CS3012 Formal Languages

Languages and Grammars

(iii) A -> where A, B N, t T.

48

CS3012 Formal Languages

Languages and Grammars

Example 7.7 S -> aA | bB A -> aS | a B -> bS | b S => aA => aaS => aaaA => aaaaS => aaaabB => aaaabb The language generated by this grammar is the same as (aa + bb)+. Theorem 7.8 A language is regular iff it can be defined by a regular grammar.

Techniques for constructing grammars To create sequences of a symbol (e.g. aaa...a): A -> aA | or A -> Aa |

Example: A => aA => aaA => ... => aaaaaA => aaaaa To "bracket" a string (e.g. axxx...xb): A -> aBb B ->xB | or A -> Cb C -> ax | Cx

Example: A => aBb => axBb => axxBb => ... => axxxxxBb => axxxxxb To create a nested structure (e.g. aaa...<.....>...bbb): A -> aAb | B B -> xB | Example A => aAb => aaAbb => ... => aaaaaAbbbbb => aaaaaBbbbbb => aaaaaxBbbbbb => aaaaaxxBbbbbb => aaaaaxxxbbbbb

Example 7.9 Construct a grammar for the language consisting of all strings of the form abccc...cab or abab...abccc...cabab...ab 49

CS3012 Formal Languages


|<-- ntimes -->| |<-- n times -->|

Languages and Grammars

A -> abAab | abBab B -> cB | c

50

CS3012 Formal Languages and Compilers

Lex

8. Derivations and Ambiguity

Recognition problem

Given a grammar, G, and a string, w, is w L(G)?

Parsing Problem

Given a grammar, G, and a string, w L(G), how is w derived in G?

Definition 8.1

Derivation tree

Let (S =) w0 =>i1 w 1 =>i2 w 2 =>i3 ... =>in wn be a derivation. We construct the corresponding derivation tree as follows. It has w0 as its root. Every time a symbol a is replaced by a substring , a branch is added from a to every symbol in , in the same order in which they appear in .

Example 8.2 Let S => S+S => S+(S) => S+(S*S) => S+(S*a) => S+(a*a) => a+(a*a) be a derivation in the gramar G0. Its corresponding derivation tree is
S S a S a + ( S S * ) S a

Definitions 8.3 A derivation in which, at each step, the rightmost non-terminal is replaced is a right-derivation. A CFG is ambiguous if there is at least one string in L(G) having two or more different right derivations. Note: A string has two different right derivations iff it has two different derivation trees.

51

CS3012 Formal Languages and Compilers

Lex

Example 8.4 G0 is ambiguous, since the string a+a*a has two different right derivations: 1. S => S+S => S+S*S => S+S*a => S+a*a => a+a*a 2. S => S*S => S*a => S+S*a => S+a*a => a+a*a with the two derivation trees:

1.

S S + S a S * S a a

2. S

S * S

S + S a a a

Example 8.5

An unambiguous grammar of algebraic expressions G


S S + T T T * F F F a a a

1) S -> S + T 2) S -> T 3) T -> T * F 4) T -> F 5) F -> (S) 6) F -> a

S => S+T => S+T*F => S+T*a => S+F*a => S+a*a => T+a*a => F+a*a => a+a*a Definition 8.6 A language for which every defining grammar is ambiguous is inherently ambiguous.

52

CS3012 Formal Languages

Intermediate Code Generation

9. Parsing

Definition 9.1 Top-down parsing creates a derivation tree for a given string by expanding from the start symbol by applying productions.

Definition 9.2 Recursive-descent parsing is a top-down parsing method that associates a recursive procedure with each non-terminal of the grammar. Predictive parsing is recursive-descent parsing where it is possible to determine which procedure to call at each stage by examining the next symbol of the input.

Example 9.3

Consider the following grammar: Type ->Simple | array [Simple] of Type Simple -> int | num .. num

We can write procedures for Type and Simple as follows: procedure type begin if token {int, num} then simple else if token = array then begin match(array) match('[') simple match(']') match(of) type end else error end procedure simple begin if token = int then match(int) else if token = num then begin match(num) match(..) match(num) end else error end procedure match (t:token) begin if token = t then token := nexttoken else error end

53

CS3012 Formal Languages

Intermediate Code Generation

54

CS3012 Formal Languages

Intermediate Code Generation

A parse of "array[3 .. 11] of int" then consists of the following procedure calls: token: array [ num .. num ] of int procedure calls: type match(array) match('[') simple match(num ) match(..) match(num ) match(']') match(of ) type simple match(int )

Type array [ Simple ] of num .. num Type

Simple int

Definition 9.4 LL(1) parsing means: (i) read the input from the left (to the right) (ii) generate a left derivation (iii) using 1 lookahead symbol.

LL(1) parsing for a given grammar requires a 2D table with a column for each terminal plus a new symbol #, and a row for each non-terminal. Each cell is a single production from the grammar.

Algorithm 9.5

LL(1) Parsing Algorithm

Given a string, a grammar and an LL(1) parse table, parse the string using the table. Variables: z - a string (the parsing stack), w - a string (the input), M - the LL(1) table begin z := start symbol concatenated with # w := input string concatenated with # while q # do begin q represents the first symbol in z t represents the first symbol in w if q = a and t = a then begin remove a from front of w remove a from front of z 55

%top state in stack % a is a terminal % 'match'

CS3012 Formal Languages

Intermediate Code Generation

end else if q = N and t = a and M[N,a] = p then begin remove N from front of z put onto the front of z end else error end while if q = # and t = # then accept else error end Example 9.6 Grammar: LL(1) table: S: z (parsing stack) S# (S)S# S)S# )S# S# # 1) S -> ( S ) S ( 1 ) 2 2) S -> # 2 action S -> ( S ) S match S -> match S -> accept

% p = N ->

%input L(G)

w (input stack) ()# ()# )# )# # #

Both recursive-descent parsing and LL(1) parsing require first(N) and follow(N) to be known for all nontemrinal symbols N. first(N) is the set of all tokens which could appear as the first symbol in a token substring derived from N, while follow(N) is the set of all tokens which could appear as the next token once N's token substring is finished. Algorithms to compute these sets are known (but are omitted from the course). Neither recursive-descent nor LL(1) parsing can be used on grammars which are left recursive, or which have two or more productions for the one non-terminal where the right-hand side starts with the same substring.

Definition 9.7 Bottom-up parsing constructs a derivation tree from the input string, applying productions in reverse (called reductions) until the start symbol is reached. 56

CS3012 Formal Languages

Intermediate Code Generation

Algorithm 9.8Basic Shift-Reduce Parsing Given a string and a grammar, construct a derivation of the string. variables: z - a string (the stack), w - a string (the input), h - a substring (the handle)

begin z := w := input string concatenated with # while z S or w # do begin obtain the handle h (corresponding to production A -> h) if z does not end in h, then move first symbol of w to end of z else begin remove h from z put A on end of z end end end Example 9.9 Stack a F T S S+ S+a S+F S+T S+T* S+T*a S+T*F S+T S Parse a+a*a in grammar G Input a+a*a +a*a +a*a +a*a +a*a a*a *a *a *a a Action shift reduce (6) reduce (4) reduce (2) shift shift reduce (6) reduce (4) shift shift reduce (6) reduce (3) reduce (1) accept

% shift % reduce

Definition 9.10 LR(k) Parse Table

An LR(k) parse table is a 2D matrix, with rows indexed by integers and columns indexed by length k strings of grammar symbols plus an endmarker. The entries of the table are of five types: 57

CS3012 Formal Languages

Intermediate Code Generation

Rp Sn n A E "LR(k)" means:

(reduce by production p) (shift, go to state n) (go to state n) (accept) /* there is only one 'A' entry */ (error) /* Notation: appear blank in the table */ (i) read the input from the left (to the right) (ii) generate a right derivation (iii) using k lookahead symbols.

Example 9.11
S 1

LR(1) parse table for G


T 2 F 3 a S5 + S6 R2 R4 8 2 9 3 3 10 S5 R6 S5 S5 S6 R1 R3 R5 S7 R3 R5 R6 S4 S4 S11 R1 R3 R5 R1 R3 R5 * ( S4 ) # A R2 R4 R6

0 1 2 3 4 5 6 7 8 9 10 11

S7 R4 S4

R2 R4 R6

58

CS3012 Formal Languages

Intermediate Code Generation

Algorithm 9.12

LR(1) Parsing Algorithm

Given a string, a grammar and an LR(1) parse table, parse the string using the table. begin z := 0 w := input string concatenated with # loop q := last symbol in z %top state in stack t := first symbol in w if M[q, t] = Sn then begin % row q, col t in table remove t from front of w put n on end of z end else if M[q, t] = Rp then begin take the grammar rule numbered with p let the left hand side of it be called B and let the right hand side be called i.e. the grammar rule has the form: p = B -> remove || symbols from end of z q := last symbol in z %top state in stack put M[q,B] on end of z %new state end else if M[q, t] = A then return true %input L(G) else return false %input L(G) end end Example 9.13 Parse a+a*a using the table of 9.11 and grammar G 1) S -> S + T 2) S -> T 3) T -> T * F 4) T -> F 5) F -> (S) 6) F -> a

59

CS3012 Formal Languages

Intermediate Code Generation

Symbol stack Stack(z) Input (w) 0 a+a*a# a 05 +a*a# a 0 +a*a# F 03 +a*a# F 0 +a*a# T 02 +a*a# T 0 +a*a# S 01 +a*a# S+ 016 a*a# S+a 0165 *a# S+a 016 *a# S+F 0163 *a# S+F 016 *a# S+T 0169 *a# S+T* 01697 a# S+T*a 016975 # S+T*a 01697 # S+T*F 01697 10 # S+T*F 016 # S+T 0169 # S+T 0 # S 01 #

q 0 5 0 3 0 2 0 1 6 5 6 3 6 9 7 5 7 10 6 9 0 1

t a + +

F T + S + a * F * T * a # F # T # S #

Action S5 R6 R6 R4 R4 R2 R2 S6 S5 R6 R6 R4 R4 S7 S5 R6 R6 R3 R3 R1 R1 A

Grammar Rule 6) F -> a 4) T -> F 2) S -> T

6) F -> a 4) T -> F

6) F -> a 3) T -> T * F 1) S -> S + T

Definitions 9.14 A grammar is LR(k) if we can construct a deterministic LR(k) parse table for it. A language is LR(k) if it has an LR(k) grammar.

60

CS3012 Formal Languages and Compilers

Yacc

10. Yacc: A Parser Generator


Yacc is a parser generator, accepting a contextfree grammar, and producing a program which analyses input to check whether it conforms to the syntax of the grammar. Yacc constructs the LR(1) parse table, and implements the LR(1) parsing algorithm (in fact, LALR(1) - a slight restriction of LR(1) - and not LR(1)). The input must first be converted to a stream of integer tokens, using a function yylex(). The function yylex() can be hand-written, or generated by Lex. A Yacc script has three sections, separated by lines containing only %%: ... definitions ... %% ... production rules ... %% ... user-defined functions ... Definitions section As in Lex, anything in this section enclosed between %{ and %} will be copied into the output program. Any #include or #define statements or variable or function declarations required for the user-defined functions should be enclosed here. In this section must appear a set of "token" declarations, and there must be a token for each terminal which will appear in the grammar. For example: %token VERB_T %token NOUN_T declares two terminals for use in a grammar. A useful convention is to use uppercase ending in "_T" for token names, and to use mixed case, starting with a capital letter, for non-terminals. 61

Productions section Instead of writing A -> a b c | e f g we now write A:abc|efg; and liberal use of white space is encouraged to improve readability, and to make it easier to update scripts. For example, the two productions above would be better written as A : | ; abc efg

YACC will take the left-hand symbol of the first rule in this section, and make it the start symbol. Comments can be included in 'C' format. For example: /* A can be rewritten to abc or to efg */ Functions section As in Lex, this section should contain the userdefined main() routine, and any other required functions. The usual functions to include here apart from main() are: lexerr() - defining what to do if the lexical analyser finds an undefined token. This requires that the default case in the lexer has a call to this function as its associated action. yyerror(char*) - defining what to do if the parser cannot recognise the syntax of part

CS3012 Formal Languages and Compilers

Yacc

of the input. This function will be called by the parser, which passes a string describing the type of error. Note that when an error occurs, the line number of the input is held in yylineno, and the last token read when the error is reached is held in yytext.

shift-reduce conflict or reduce-reduce conflict

Example productions giving rise to these messages are: Expr : | ; TOKEN Expr + Expr

Running Yacc The command yacc calls the Yacc program on the specified input. Using the "-d" option forces Yacc to create a file y.tab.h, which contains the #define statements for all the tokens declared in the definitions section. If we need to use the integer values of these tokens in the user defined functions, we can then place #include "y.tab.h" between the %{ and %} lines of the definitions section. Using the "-v" option forces Yacc to create a file y.output, which contains information on the parse table useful for debugging. The output of the yacc command is a file y.tab.c, which contains the 'C' source for the parser. If we have written a Lex script for the lexical analyser, we must also create lex.yy.c as before. To obtain executable code for the complete parser, we then must link the object files, using both the yacc library, "-ly" and the Lex library, "-ll". Error Messages Yacc can only accept grammars of a particular sort. Specifically, it cannot handle ambiguous grammars, nor can it handle grammars requiring two or more symbols of lookahead for parsing. The two messages resulting from ambiguous grammars that you will see most often are:

for the first TOKEN+TOKEN+TOKEN two ways, and Animal: | ; Dog Cat Dog Cat

case, where could be parsed in

: :

FRED_T ; FRED_T;

for the second, where FRED_T could be parsed two ways. If these messages appear, then your grammar is not suitable. In most cases, by carefully studying the grammar (using the information in y.output), you can find a different set of productions which Yacc can handle. The two simplest cases are given above. In particular, note that productions of the form E -> E+E are guaranteed to produce conflicts. Occasionally, it may turn out that the language you are trying to define is inherently ambiguous, in which case Yacc is of no use; however, this is very unlikely. If the language is easy to understand, then, generally, it is easy to write a simple, unambiguous grammar for it. Remember that Yacc can handle even large and relatively complex languages like PASCAL and C - in fact, the Berkeley PASCAL and Sun C compilers are written in Yacc. 62

CS3012 Formal Languages and Compilers

Yacc

If Yacc does output the above messages, do not let your grammar go uncorrected. Although a parser will be generated, it will probably not define the language you intend, and will fail in mysterious ways.

VP V

-> V NP -> is | likes | hates

First, we will accept files consisting of multiple sentences. Each sentence will be delimited by a ".". Therefore, change the first production to read: S-> NP VP .

Example 10.1 Write a Yacc script to construct a parser for sentences from the natural language grammar below. S-> NP VP NP -> Det NP1 | PN NP1 -> Adj NP1 | N Det -> a | the PN -> peter | paul | mary Adj -> large | grey N -> dog | cat | male | female

and we also add two new productions describing "documents" in terms of sentences: D -> S D |

Note that we are only trying to parse sentences, and not understand them - therefore, our lexical analysis only needs to be to the level of the parts of speech (i.e. we only need to recognise nouns and verbs, and not individual words).

The lexical analyser is a modification of the example Lex program given on p23. Instead of "print" statements, we will return tokens. Therefore, in the definitions section, we have a line which includes the token list which will be created by Yacc. %{ /* simple part of speech lexer */ #include "y.tab.h" %} L [a-zA-Z] %% In the regular expression section, we need expressions for each part of speech, plus special symbols and unknown input. [ \t\n]+ is|likes|hates a|the dog | cat | male | female peter | paul | mary large | grey /* ignore whitespace */; return VERB_T; return DET_T;

return NOUN_T;

return PROPER_T; return ADJ_T; 63

CS3012 Formal Languages and Compilers

Yacc

\. {L}+ . %%

return PERIOD_T; lexerr(); lexerr();

We will use the standard yylex() function created by Lex, and so we don't need user-defined functions. In the definitions section of the Yacc script, we need to declare the variables we will use in the error functions, as well as all the tokens we expect to be passed by the lexer. %{ /* a Yacc script for a simple natural language grammar */ #include <stdio.h> #include "y.tab.h" extern extern extern extern int yyleng; char yytext[]; int yylineno; int yyval;

extern int yyparse(); %} %token %token %token %token %token %token %% The grammar rules are straightforward. /* a document is a sentence and the rest of the document, or is empty */ Doc | ; : Sent Doc /* empty */ DET_T NOUN_T PROPER_T VERB_T ADJ_T PERIOD_T

/* a sentence is a noun phrase, verb phrase, and a period */ Sent : NounPhrase VerbPhrase PERIOD_T ;

/* a noun phrase is a determiner and an undetermined noun phrase, or a proper noun */ NounPhrase : | DET_T NounPhraseUn PROPER_T 64

CS3012 Formal Languages and Compilers

Yacc

; /* an undetermined noun phrase is an adjective and an undetermined noun phrase, or a noun */ NounPhraseUn : | ; ADJ_T NounPhraseUn NOUN_T

/* a verb phrase is a verb and a noun phrase */ VerbPhrase : %% In the user-defined functions section, we need to handle errors from the lexical analysis and errors from the syntax analysis, as well as defining the output from successful parsing. void lexerr() { printf("Invalid input '%s' at line %i\n",yytext,yylineno); exit(1); } void yyerror(s) char *s; { (void)fprintf(stderr, "%s at line %i, last token: %s\n", s, yylineno, yytext); } void main() { if (yyparse() == 0) printf("Parse OK\n"); else printf("Parse Failed\n"); } To compile the program, we type: yacc -d -v parser.y cc -c y.tab.c lex parser.l cc -c lex.yy.c cc y.tab.o lex.yy.o -o parser -ly -ll Suppose we have three different input files, file1, file2 and file 3, as follows: file1: peter is a large grey cat. the dog is a female. 65 VERB_T NounPhrase ;

CS3012 Formal Languages and Compilers

Yacc

paul is peter. file2: the cat is mary. a dogcat is a male. peter is male. mary is a female.

file3:

Typing the following commands gives the following results: % parser < file1 Parse OK % parser < file2 Invalid input 'dogcat' at line 2 % parser < file3 syntax error at line 1, last token: male % The second sentence of file2 contains unknown input - the word "dogcat". The first sentence of file3 has a syntax error - we have defined the word "male" to be a noun, and it must be preceded by a determiner.

66

CS3012 Formal Languages and Compilers

Error Handling

11. Error Handling


Error Handling Error detection It is part of the task of a compiler to assist in the identification, location and correction of errors. Errors can occur at any stage in the process, and it is desirable for each component of the compiler to report (and maybe recover from) the errors corresponding to its operation. Lexical errors Very few errors can be detected during lexical analysis, because the analyser has a very local view of the code. The main type of error is when the analysis halts because the input cannot be matched to any of the declared regular expressions - i.e. there is an invalid character or sequence of characters in the program. The easiest way to recover from this type of error (after reporting it) is simply to delete the offending characters from the input, and continue processing. This is not very satisfactory, however, as it is uncontrolled, and may cause confusion during later stages of compilation. Parsing errors The error handler in the parser should: Strategies report errors clearly and accurately recover from each error quickly enough to detect subsequent errors not significantly slow down the compilation. The design of parser error handling requires finding a balance between these three objectives. panic mode - ignore all input symbols until a designated "synchronising" token is reached for example, end or ";". Start processing again after this token. This method often skips large parts of code without checking for errors, but it is simple, and it does not enter infinite loops. phrase level - locally correct the input - that is, replace a prefix of the current input by The LR-parsing method has the advantage that it detects the errors at the earliest possible point in the input. The errors are detected by the parser reaching a blank (or "E") entry in the parse table, indicating that this (state,lookahead) pair can never be reached during the parse of a syntactically correct string. This condition is used to trigger an error recovery procedure which reports the error and then tries to return the parser to a state where it can continue. Error recovery Once an error has been detected, the aim is to put the parser in a state such that it can continue processing input with a reasonable hope that subsequent correct input will be parsed, and subsequent errors will be detected. If the parser is not returned to a good state, there will be an avalanche of spurious errors, which are not actually errors in the source program, but were introduced by the changes made to the state of parser. Even if the rest of the input is accepted, there is no guarantee that it doesn't contain errors

CS3012 Formal Languages and Compilers

Error Handling

something that would allow the parser to continue. Commonly, this involves replacing, inserting or deleting delimiters. Care must be taken, however, that the parser does not start to loop - a possibility if it always adds input onto the front rather than replaces input. The method also has a problem if the error actually occurred before the current point on the input stack error productions - if certain errors are known to happen frequently, it is possible to include in the grammar what are called error productions. The grammar then caters for these errors, includes likely recovery, and allows specific diagnostics to be output. global correction - ideally, we would want the compiler to carry out the minimum of changes to the input in order to jump over an error. Given an incorrect input string x, and a grammar G, it is possible to find a parse tree for a related string y, such that the number of changes made to x to get y is minimised. However, this method is very expensive in time and space, and so, generally, is not used in practice.

The SLR parser may make a few erroneous reductions before discovering an error, but will never shift an erroneous token from the input onto the stack. We can implement the first two error recovery strategies in the following ways: panic mode - scan down the stack until we find a state, s, which has a shift command for particular non-terminals (A, say, with shift action Si). We then discard input symbols until we reach one, a, say, which is in follow(A). Normally, we restrict the possibilities for A to be major program components - e.g. statement - and then a might be a semi-colon or an end. We remove the states above the selected one from the stack, and place i on the stack. Basically, we assume that a string derivable from A contains the error. Part of this string has already been processed (the states above s), and part remains on the input (the symbols to be discarded). The parser tries to skip over the error by assuming that A has been parsed successfully, and jumping to a symbol that should follow it. phrase level - for this mode, we study each error entry in the table (the blanks or "E"s), and decide on the most likely cause. We then implement recovery procedures which assume that cause and take the appropriate action to modify the input.

Error recovery in LR Parsing

Example 11.1 phrase-level error recovery in LR(1) parsing Consider the LR(1) parse table for the grammar G augmented with error procedures:

CS3012 Formal Languages and Compilers S 1 T 2 F 3 a S5 e3 e3 e3 8 2 9 3 3 10 S5 e3 S5 S5 e3 e3 e3 e3 + e1 S6 R2 R4 e1 R6 e1 e1 S6 R1 R3 R5 * e1 e4 S7 R4 e1 R6 e1 e1 e4 S7 R3 R5 ( S4 e3 e3 e3 S4 e3 S4 S4 e3 e3 e3 e3 ) e2 e2 R2 R4 e2 R6 e2 e3 S11 R1 R3 R5 # e1 A R2 R4 e1 R6 e1 e1 e5 R1 R3 R5

Error Handling

0 1 2 3 4 5 6 7 8 9 10 11

e1: /* called from states 0, 4, 6 or 7, that are expecting the beginning of an operand (either an a or a "("), but instead a "+", "*" or "#" is found */ put 5 on top of the stack /* assumes a has been found */ issue message "missing operand" e2: /* called from states 0, 1, 4, 6 or 7, which find an unexpected ")" */ remove ")" from input /* simply ignore it */ issue message "unmatched right parenthesis" e3: /* called from states 1 or 8 which expect "+", but find an a or a "(" */ put 6 on to the stack /* assume a "+" has been found */ issue message "missing '+'" e4: /* called from states 1 or 8 which expect "+" but find "*" */ put 6 on top of stack /* assume a "+" has been found */ remove "*" from input /* assume it was a "+" */ issue message "'*' instead of '+'" e5: /* called from state 8 which expects a ")" but finds # */ put 11 on stack /* assume ")" is found */ issue message "missing right parenthesis" Error recovery in Yacc The easiest way to recover from errors in Yacc is to use error productions. In practice, this corresponds more to the idea of phrase level recovery discussed above. You must decide which non-terminals will have error recovery procedures associated with them, and then add to the grammar productions of the form A -> error where is a string of grammar symbols (possibly empty). When Yacc finds an error, its scans down the stack until it finds a state whose items include a rule of the form A -> something error

CS3012 Formal Languages and Compilers .

Error Handling

The parser then "shifts" a fictitious token, and scans through the input until it finds a substring matching ; once found, it removes everything up to the end of that substring from the input. The parser then reduces to A, and continues. For example, an error production Statement -> error ; would say to Yacc to skip beyond the next semi-colon and assume a statement had been parsed. An appropriate error message can be generated at this point.

CS3012 Formal Languages and Compilers

Syntax-Directed Translation

12. Syntax-directed Translation


Translation is the process of taking some input and converting it into some other form whose structure and content is dependent on the structure and content of the input. We will do this for programming languages by associating actions with the productions of the grammar defining the programming language. number 1 2 3 4 5 6 Parse a + a*a + a a + a * a + a <=6 F + a * a + a <= T + a * a + a <= S + a * a + a <=6 S + F * a + a <= S + T * a + a <=6 S + T * F + a <=3 S + T + a <=1 S + a <=6 S + F <= S + T <=1 S Printing output in the order in which the reductions were applied (6, 6, 6, 3, 1, 6, 1) which gives aaa*+a+, which is the corresponding postfix expression. rule S -> S + T S -> T T -> T * F T -> F F -> ( S ) F -> a

Example 12.1 translating from infix expressions to postfix expressions The following actions convert expressions from the grammar G to postfix notation:

action print ("+") print ("*")

print(a)

The Value Stack A more general scheme is to associate values with each symbol on the parsing stack. On the stack, therefore, we have pairs of <symbol, value>, so we can think of this as two separate stacks, the symbol stack and the value stack. We can then associate with each reduction some action to be carried out on the value stack. The end result of a parse is then a report on whether the input had the correct syntax, and a value derived from the input's structure. Suppose we are about to apply the reduction A -> x1x2...xn. The parsing stack then has the symbols x1, x2, ... xn on the right. The values corresponding to these symbols we will call $1,

$2, ... $n. On performing the reduction, we remove the n symbols from the symbol stack (and eventually replace by A): therefore, we will remove the top n symbols from the value stack, and replace by some new value defined by the rule augmentation. Call this new value $$. The most general form of this action is then a function, such that $$ = f($1, $2, ..., $n). In practice, this function might be an actual function, or a sequence of lower level actions which take the $i values as parameters. We don't need to use all of the $i. Putting values on the stack There are basically two cases: Putting on a non-terminal, and

CS3012 Formal Languages and Compilers

Syntax-Directed Translation

Putting on a terminal. A non-terminal only goes on during a reduction (or the shift immediately following a reduction). This corresponds to the evaluation of the Example 12.2 computing the values of expressions

function defined above. The values of the terminal symbols, on the other hand, generally come from the lexical analysis.

Assume a lexical analyser returns the value of an integer along with the ID_T token 1) S -> S + T 2) S -> T 3) T -> T * F 4) T -> F 5) F -> ( S ) 6) F -> a $$ := $1 + $3 $$ := $1 $$ := $1 * $3 $$ := $1 $$ := $2 $$ := $1

Parsing 1 + 2 * 3 is then as follows:


Symbol a F T S S+ S+a S+F S+T S+T* S+T*a S+T*F S+T S Values 1 1 1 1 1 12 12 12 12 123 123 16 7 Stack 0 05 03 02 01 016 0165 0163 0169 01697 016975 01697 10 0169 01 Input 1+2*3# +2*3# +2*3# +2*3# +2*3# 2*3# *3# *3# *3# 3# # # # # Action S5 R6 R4 R2 S6 S5 R6 R4 S7 S5 R6 R3 R1 A

The Value Stack in Lex and Yacc Lex Yacc assumes values are passed to it in the global variable yylval. Lex places the lookahead in yytext; it also must assign values to yylval. There are a number of possibilities:

1. The lookahead is a digit string. The internal value must be computed and placed in yylval. 2. The lookahead is a character string. It must be copied from yytext to a safe place, usually either: (i) a much larger string array, and the value placed in yylval is the position in which it starts in that larger array, or

CS3012 Formal Languages and Compilers

Syntax-Directed Translation

a dynamically allocated character string, and the value placed in yylval is the pointer to that string. 3. The lookahead is a string representing a real number. It should be converted to a floating point, and stored in a real array. Again, its position in the array will be passed to yylval. 4. The lookahead is an identifier or a keyword. User-defined identifiers must be stored as for strings (but only one copy should be kept).

(ii)

Yacc Yacc allows us to place an action after any production. This action will be performed at the moment the reduction is performed (which is before the values are removed from the stack). The action is a C statement within {...}. Values should be represented by the $i notation described above. When the statement is reached by Yacc, it will translate the $i's into their appropriate values or array positions.

CS3012 Formal Languages and Compilers

Syntax-Directed Translation

Example 12.3 using Yacc's value stack S will be represented by "Expr", T by "Term" and F by "Factor". %% Finish ; Expr : | ; Term : | ; Factor | ; %% Definition 12.4 Attribute Grammar semantic rules of the form b := f(c1, c2, ..., cn), where f is a function, c1, c2, ..., cn are attributes of any of the grammar symbols in the production. If b is an attribute of A, and the ci are attributes of symbols in , then b is a synthesised attribute. If b is an attribute of one of the symbols in , then b is an inherited attribute.

Expr

{ printf("%d", $1); }

Expr PLUS_T Term Term

{ $$ = $1 + $3; }

Term MUL_T Factor Factor

{ $$ = $1 * $3; }

: OB_T Expr CB_T INT_T

{ $$ = $2; }

With each symbol in the grammar, we associate a set of attributes. An attribute can represent any form of information we require, including data type, number, pointer or string. The semantic rules we associate with each production determine how the values of the attributes are computed. In an attribute grammar, each grammar production A -> has associated with it a set of

Example 12.5 computing the value of expressions Production S -> E E1 -> E2 + T E -> T T1 -> T2 * F T -> F F -> ( E ) Semantic Rules print(E.val) E1.val := E2.val + T.val E.val := T.val T1.val := T2.val * F.val T.val := F.val F.val := E.val

1) 2) 3) 4) 5) 6)

CS3012 Formal Languages and Compilers

Syntax-Directed Translation

7)

F -> digit

F.val := digit.lexval

The subscripts on symbols are simply to distinguish which symbol in the semantic rule refers to which symbol in the syntax. The symbol digit is a terminal (or token), and it is assumed to have a single attribute, returned by the lexical analyser. In this case, it will be the value of the particular number token. Definition 12.6 A syntax-directed definition which uses only synthesised attributes is called an S-attributed definition. We can augment parse trees for attribute grammars with the attribute values at each node: for S-attributed definitions, we can evaluate all the attribute values by starting at the leaf nodes and applying the semantic rules from the bottom to the top.

Example 12.7 S
val = 12

Annotated parse tree for the expression 6+2*3

E
val = 12

E
val = 6

+ T
val = 2

T
val = 6

T
val = 6

F
val = 3

F
val = 6

F
val = 2

digit
lexval = 3

digit
lexval = 6

digit
lexval = 2

An inherited attribute is one whose value is determined by the values of the attributes of its parent or siblings. They are useful for describing the way in which the meaning of a symbol depends upon the context in which it appears. For example, we can use an inherited attribute to keep track of which side of an assignment statement an identifier appears on, so that we know whether to use its address or value during processing.

Example 12.8 using inherited attributes

CS3012 Formal Languages and Compilers

Syntax-Directed Translation

The following grammar defines a language of integer or real variable declarations. The semantic rules determine how the symbol table is to be updated, by passing the values of the inherited attributes down from the attribute of the T symbol (which is synthesised from rules 2 and 3). Production D -> T L T -> int T -> real L1 -> L2 , id L-> id Semantic Rules L.t := T.t T.t := integer T.t := real L2.t := L1.t, addtype(id, L1.t) addtype(id, L.t)

1) 2) 3) 4) 5)

The augmented parse tree is shown below on the left, and the flow of information between the different attributes is shown on the right: D D

T
t = real

L
t = real

T
t = real

L
t = real

real

L
t = real

, ,

id3 id2

real

L
t = real

, ,

id3 id2

L
t = real

L
t = real

id1

id1

Dependency graphs If the value of an attribute b depends on the value of attribute c, then the semantic rule for b must be evaluated after that for c. The interdependencies between the attributes can be drawn as a dependency graph (as above). A topological sort of a graph is an ordering of the attributes of the graph such that all edges in the graph go from the attributes earlier in the ordering to attributes later. A topological sort gives a valid order in which to evaluate the semantic rules.

There are a number of different methods for evaluating semantic rules. 1. Parse-tree based. At compile time, construct a parse tree, then a dependency graph, and then obtain a topological sort. Use the sort to determine the order in which to process the rules. This method works for all dependency graphs with no cycles. 2. Rule based. When the compiler is constructed, analyse the rules for dependencies. The order in which rules are to be evaluated is then fixed before compilation starts.

CS3012 Formal Languages and Compilers

Syntax-Directed Translation

3. Oblivious. The compiler simply selects an evaluation order without analysing the dependencies. This obviously limits the class of attribute grammars that can be implemented. Methods 2 and 3 are more efficient, in that no compile-time analysis is required.

from the parsing process. This is particularly useful for two reasons: 1. A grammar that is suitable for parsing might not explicitly represent the hierarchical nature of the programs it describes 2. The parsing method constrains the order in which the nodes are considered. This may not be the best order for translation. A syntax tree is a condensed parse tree, where the operators and keywords do not appear as leaves, but are associated with the interior nodes that would have been their parent node in the parse tree. Also, chains of single productions can be collapsed into a single branch.

Abstract Syntax Trees A useful form of intermediate representation of a program is a syntax tree. Using syntax trees allows the translation process to be separated

Example 12.9 abstract syntax trees The derivation step S => if B then S1 else S2 would have the syntax tree: if then els e B S1 S2

The parse tree below has the syntax tree on the right: E + E T F 6 + T F 2 T * F 3 6 2 * 3

Example 12.10

creating abstract syntax trees

CS3012 Formal Languages and Compilers

Syntax-Directed Translation

We will use the following functions, which return pointers to the newly created nodes: mknode(op, left, right): creates an internal node for the operator "op", with two fields containing pointers to the left and right operands; mkleaf_id(id, string): creates a leaf node for the identifier "id", and a field containing a pointer to a string for that identifier; mkleaf_num(num, val): creates a leaf node, labelled "num", with a field containing the value of the number. The grammar and semantic rules are given below. Each non-terminal in the grammar has an attribute ptr, which keeps track of the pointers returned by the functions:
Production E1 -> E2 + T E -> T T1 -> T2 * F T -> F F -> ( E ) F -> id F -> num Semantic Rules E1.ptr := mknode('+', E2.ptr, T.ptr) E.ptr := T.ptr T1.ptr := mknode('*', T2.ptr, F.ptr) T.ptr := F.ptr F.ptr := E.ptr F.ptr := mkleaf_id(id, id.string) F.ptr := mkleaf_num(num, num.val)

1) 2) 3) 4) 5) 6) 7)

The parse tree for 6+2*x is shown below, with the constructed syntax tree on the right.
ptr =

E ptr =
ptr =

+ T

T ptr =
ptr =

+ F num 6 num 2 * id string for x

ptr =

ptr =

ptr =

Example 12.11 The following grammar specifies compound statements: CStat -> Stat ; CStat Stat -> s | Stat

CS3012 Formal Languages and Compilers

Syntax-Directed Translation

The string s ; s ; s ; s has the parse tree below on the left, and one possible syntax tree on the right CStat Stat s ; CStat ; CStat Stat s ; CStat s Stat s The semi-colon serves only to bind the statements into a sequence. A more natural tree is shown below on the left. However, this requires each node to have arbitrarily many children. A better tree is shown on the right, where statements are joined as siblings. This requires only one extra field in our syntax tree nodes. s eq s s s s s s s eq s s A note on implementing attribute grammars and Yacc and Lex The value stack in Yacc maintains a single value for each symbol on the symbol stack. In the syntax-directed definitions in these notes, however, we use multiple attributes of different types for the symbols. We can implement this in Yacc as follows. Symbol Types Internally, Yacc declares each value as a C union. List all types that will be required in a %union declaration in the definitions section of the Yacc script - e.g.
%union { int intval; char *strptr; struct table *tblptr; }

; s s ; ; s

Stat s

which declares three symbol types - an integer value, a string pointer, and a pointer to some table structure (which would have to be declared elsewhere). Each token must be declared to use one of the types from the union, using the %token declaration - e.g.
%token <intval> INT_T %token <tblptr> ID_T

Non-terminals must also be declared, using the %type declaration - e.g.


%type <intval> Expr Term Factor

Referring to a value using the $$, $1,... notation causes Yacc to use the appropriate field of the union.

CS3012 Formal Languages and Compilers

Syntax-Directed Translation

Inherited Attributes Multiple Attributes Multiple attributes can be implemented using the symbol table (see 11) which is defined to have a number of attribute places for each entry. Instead of referring explicitly to values on the value stack, we would then refer to the symbol table entry, and extract the appropriate attribute value as required. The Yacc value stack is designed for synthesised attributes - that is, when a rule is used as a reduction, the values of all symbols on the right hand side are known. In some cases, however, we would like to use inherited attributes to assign values to symbols on the right hand side. Yacc does allow us to do this, by accessing symbols on the internal stack to the left of the current rule, using the notation $0, $-1, $-2, ... . Thus we might have the rules:

Decl : Type Idlist ; Type : KEY_REAL_T | KEY_INT_T ; {$$ = 1;} {$$ = 2;}

Idlist : Idlist ID_T | ID_T ;

{action($0, $2);} {action($0, $1);}

where action(...) is some function which assigns type information. The symbol Type, which contains that information, always occurs one place to the left of the Idlist nonterminal, and this symbol's value is referred to by the $0 notation. If we wanted to refer to the symbol two places to the left, we would use $-1, etc.. Note that these rules are different from the ones given in lectures (p48) - here the values are not passed in the Decl rules, but in the Idlist rule. This use of the value stack is not recommended. It is very hard to keep track of positions in more complicated grammars - to use this notation, every time we use Idlist, we must be confident that the symbol one place to the left in the internal stack has the appropriate value.

The preferred method of dealing with inherited values is to create a list of pointers to the attribute places and maintain this list as an attribute of Idlist. The Decl rule would then use this list to assign the correct attribute values to the various identifiers.

CS3012 Formal Languages and Compilers

Symbol Table

13. Symbol Table


The symbol table stores information about various source language constructs. Information is built up during the analysis stages of compiling, and is used in succeeding stages. Finally, the code generation phase uses the information in the table to generate the target code. The symbol table is central to the work of the compiler. In practice, efficient methods of manipulating and storing the table must be used. In this course, though, we will not consider efficiency - we will use a linked list and some operations for manipulating the information. In some compilers, the symbol table is used extensively during lexical analysis and parsing, to represent information and resolve ambiguities. In other cases, lexical analysis and parsing simply construct a complete abstract syntax tree. This tree is then analysed to produce the symbol table. There are three main functions we need to implement: lookup(s): determines whether a particular string has already been stored returns the index of the table entry, or 0 (or -1 in some systems) if it has not been stored inserts a new string (of token t) into the table returns the index of the new entry deletes an entry from the table (or, typically, hides it)

insert(s,t): delete(s):

Example 13.1 A Simple Symbol Table Implementation An initial node will point to the first and last entries, and store the length of the table. A separate array will store all the string identifiers. Each node will be of the form: index token atts next strPtr 7 ID_T ... ... ...

The table will be of the form:

CS3012 Formal Languages and Compilers

Symbol Table

Table first 1 ID_T c o u n

length last 78 2 ID_T # i # 78 ID_T n a m e #

...

...

...

...

...

CS3012 Formal Languages and Compilers

Symbol Table

Declarations There are four basic kinds of declaration that may require entries in the Symbol Table: Constant: Type: e.g. const int MAX = 10000; e.g. struct Entry { int index; char *strPtr; }; e.g. int count, marks[100]; e.g. int gcd(int n, int m) { if (m == 0) return n; else return gcd(m, n % m); }

Variable: Function:

Constant and variable declarations can be stored in the table in the style shown above. Type declarations may require more work, while function declarations are normally indexed by their name, with the code being treated separately. In some compilers, separate symbols tables are used for each different kind of declaration; in others, each separate region of the program (e.g. functions) may be given a separate table. The attributes stored with each entry will depend on the kind of declaration: constant declarations will typically have value bindings; type, variable and function declarations will have type signatures. Variables will have pointers to allocated memory for storing values. Functions may have pointers to code representations. All kinds may have scope attributes, defining when memory should be allocated, and when it should be accessible.

Example 13.2 Single scope In languages with very restricted scoping rules (and in other siutations) it is possible to construct the symbol table during lexical analysis. Augment the lexical analysis rule for recognising identifiers as follows: {L}+ {entry = lookup(yytext); if (entry == -1) yylval.entry = insert(yytext, ID_T); }

This will insert an entry the first time an identifier is encountered. Ensuring that, for example, it is properly declared will be a function of the semantic analysis phase.

CS3012 Formal Languages and Compilers

Symbol Table

Scoping Rules Many languages allow programs to be constructed from blocks. In C, blocks are files, function declarations and compound statements (between "{" and "}"). Also, structures and unions can be considered to be blocks. The use of blocks complicates the symbol table, as the same identifier can refer to different data objects depending on the position it occurs in the code, and the scoping rules. In this situation, it is not sufficient to use lookup during lexical analysis.

CS3012 Formal Languages and Compilers

Symbol Table

Example 13.3 Nested scope int i; int f1(int k) { int j; ... print i; } int f2() { int j; ... } In the above C program fragment, i is a global (integer) variable, normally accessible throughout the code. When f1 is entered, a new (integer) entry is required for k. Immediately, a new (integer) entry for j is also required. Inside the function, i -- in print(i) -- refers to the global variable. Once f1 is exited, the entries for j and k are deleted. Once f2 is entered, a new (integer) entry for j is required - note that this is a different variable from the one inside f1. To implement nested scopes, the lookup function must find the most recently inserted declaration, the insert function must not overwrite previous declarations of the same name, but should hide them, while the delete function should only delete (or hide) the most recent declaration and uncover the previous one. The symbol table should thus behave as a stack.

Example 13.4 Nesting Level One possible way of obeying scope rules while constructing the symbol table during the first pass of the compiler is to use explicit nesting level and scope variables. Use an explicit stack where the top entry represents the current nesting level and scoping identifier. We also need last, the index of the last entry in the table. Initially, the top of the stack is set to (0,0), and last is set to 0. Consider the following grammar fragment for recognising programs similar to 13.3: Prog -> Dec Prog Prog -> Main Dec -> VDec ; Dec -> FDec VDec -> int id FDec -> SFDec Par ) { CStat } SFDec -> int id (

{ decrement(stack); } { increment(stack); }

CS3012 Formal Languages and Compilers

Symbol Table

Par Par Par PList PList

-> -> Vdec -> PList , Vdec -> Vdec -> Vdec , PList

The lexical analysis action then becomes

CS3012 Formal Languages and Compilers

Symbol Table

{L}+

{entry = lookup(yytext,stack); if (entry == -1) insert(yytext,ID_T, stack); }

insert now places an entry at the end of the table, and associates the pair of values at top of the stack as nesting level and scope attributes. lookup now searches the symbol table for a matching string. When it finds a match, it checks the nesting level, and then moves down the stack until it finds the entry with the same nesting level. If the index of the match is less than the corresponding scope value, it ignores it and continues with the search. If no appropriate match is found, return -1. decrement simply deletes the top element of the stack. increment adds a new element to the top of the stack, incrementing the nesting level, and assigning the last index as the scope value. A parse tree and associated symbol table for 13.3 are shown below.
Prog Dec VDec ; i nt i d i Dec FDec 2 1 SFDec Par ) i nt id ( f1 VDec i nt i d k { CStat } VDec ; i nt i d j pri nt ( i d ) ; i Index 0 1 2 3 4 5 Str i f1 k j f2 j Nest 0 0 1 1 0 1 Scope 0 0 1 1 0 4 Atts ... ... i nt Prog Prog Dec FDec 4 SFDec3 Par ) id ( f2 { CStat } VDec ; i nt i d j ... ...

The changes in the stack are as follows (top on the right): Event Last Stack (Nest,Scope)

CS3012 Formal Languages and Compilers

Symbol Table

1 2 3 4

0 1 3 4 5

(0,0) (0,0), (1,1) (0,0) (0,0), (1,4) (0,0)

CS3012 Formal Languages and Compilers

Symbol Table

Instead of attempting to complete all compilation in a single pass, it is often easier to make a number of passes through the program. Using the techniques of section 11, an abstract syntax tree can be constructed during the parsing phase. This tree can then be processed to build the symbol table and to support the later phases of the compilation. Although this may be slower, it can result in more natural grammars, and simpler translation and analysis routines. Example 13.5 A possible abstract syntax tree for the program of 13.3 is shown below. From this, it should be easy to see the nesting levels and scope of the different declarations.
Prog VDec i nt i d i func i nt i d VDec f1 i nt i d k VDec i nt i d j pri nt id i func i nt i d f2 VDec i nt i d j

CS3012 Formal Languages and Compilers

Type Checking

14. Type Checking

The final part of the analysis phases of compilation we will consider is type checking, where the compiler checks operators, functions and procedures are not applied to objects of incompatible datatypes.

Definition 14.1 A type checker verifies that the type of a construct matches that expected by its context.

Example 14.2 type signature The PASCAL arithmetic operator "mod" requires two integer operands, and returns an integer. We can describe this by a signature: _mod_ : integer integer integer The underscores on either side of the mod operator indicate that it is an infix operator (that is, it is placed between its two operands). After the colon is the signature, which here indicates that the operator takes two integers, and returns an integer. Type information will be required when intermediate code is generated. Operators like "+" can be used in a number of different ways, and the particular way depends on the context. Four different signatures can be given for "+": _+_ : integer integer integer _+_ : integer real real _+_ : real integer real _+_ : real real real In the second and third case, some form of type translation will be required, in order to allow the integers and reals to be added together. Once the types have been determined, the intermediate code generator can put in the required conversion operations. The "+" operator is an example of an overloaded operator - that is, an operator which represents different operations in different contexts.

CS3012 Formal Languages and Compilers

Type Checking

Type Expressions In Pascal and C, types are either basic or constructed. Basic types have no internal structure as far as the programmer is concerned - for example, boolean, character and integer in Pascal. Constructed types are built from basic types and other constructed types, such as arrays, records and sets in Pascal. Each language construct has a type associated with it implicitly; this will be denoted by a type expression.

CS3012 Formal Languages and Compilers

Type Checking

Definition 14.3 A type expression is either a basic type, or is formed by applying an operator called a type constructor to other type expressions. 1. A basic type is a type expression (e.g. boolean, char, integer). A special basic type called type_error will indicate an error found during type checking. A basic type, void, indicates the absence of a value, allowing constructs with no type to be checked. 2. A type name is a type expression. 3. A type constructor applied to type expressions is a type expression. Constructors include: (a) arrays - if T is a type expression, then array(I,T) is a type expression denoting the type of an array with elements of type T and index set I. For example, the Pascal declaration var A : array[1..10] of integer; associates the type expression array(1..10,integer) with A. (b) products - if T1 and T2 are type expressions, then so is T1 T2 (c) records - a record is a product with names for its fields. The record type constructor will apply to a tuple formed from field names and types. E.g.: type row = record address : integer; lexeme : array[1..15] of char; end; declares the type name row representing the type expression: record((address integer) (lexeme array(1..15,char))) (d) functions - mathematically a function maps elements of one set to another set. We will treat functions in programming languages as mapping from a type D to a type R (from domain to range). The type will be denoted D R. E.g.: function f(a, b : char) : integer; has the type expression char char integer

CS3012 Formal Languages and Compilers

Type Checking

It is sometimes convenient to represent type expressions as graphs. We can use abstract syntax trees, with nodes for type constructors, and leaves for basic types and names.

CS3012 Formal Languages and Compilers

Type Checking

Example 14.4 type trees

Possible trees for the type expressions of 14.3 are: product:

function: T2 f char

T1

integer char

record

address

integer

lexeme

array char 1..10

array integer

1..10

Note the use of sublings to represent the different elements of products, but the use of child nodes to represent the function name and return type. Type Systems A type system is a collection of rules for assigning type expressions to the different parts of a program. A type checker implements a type system. Since type checking has the potential for discovering errors in programs, it is important for a type checker to do something reasonable when an error is discovered. The compiler must report the nature and location of the error, but again the checker should recover from the error so that the rest of the input can be processed. A type checker able to handle errors may result in a more complicated grammar than that required solely for processing correct programs. Again, for that reason, some type checkers operate on the abstract syntax tree rather than during parsing.

Example 14.5 Specifying the type checker We now specify a type checker for a simple language which requires declaration of identifiers before their use. The grammar below generates programs, represented by the non-terminal P, consisting of a sequence of declarations D followed by a single expression E. P -> D ; E D -> D ; D

CS3012 Formal Languages and Compilers

Type Checking

D ->id : T T -> char | integer | array[num] of T E -> num | id | E mod E | E [E] | id := E The language has two basic types: char and integer. The two special basic types type_error and void are used to signal errors and the absence of a type respectively. Arrays are assumed to start at index 1, so the declaration array [256] of char leads to the type expression array(1..256,char). In the translation scheme given below (for a one-pass compiler), actions add type information to the symbol table entry for the identifiers. P -> D ; E D -> D ; D D -> id : T T -> char T -> integer T1 -> array[num] of T2

addtype(id.entry, T.type) T.type := char T.type := integer T1.type := array(1..num.value, T2.type)

These actions allow the type of all declared identifiers to be added to the symbol table. The expressions can now be checked. E -> num E -> id E1 -> E2 mod E3 E.type := integer E.type := lookup(id.entry) E1.type := if E2.type = integer and E3 = integer then integer else type_error E1.type := if E3.type = integer and E2.type = array(s,t) then t else type_error E1.type := if lookup(id.entry) = E2.type then void else type_error

E1 -> E2 [E3]

E1 -> id := E2

Numbers are of type integer. "lookup(x)" searches the symbol table and return the stored type of entry x. The mod operator requires that both its operands are of type integer. If so, then the resulting expression is also of type integer; if not, then there is an error.

CS3012 Formal Languages and Compilers

Type Checking

For the array lookups, the index to the array must be of type integer, and the type of the array name must , obviously, be an array. If both of these conditions are met, then the type of the expression is the same as the type of the elements of the array. For the assignment expressions, the type of the identifier must match the type of the expression. If so, then the special type void is returned; otherwise, the value type-error is returned.

CS3012 Formal Languages and Compilers

Runtime Environment

15. Runtime Environment

After the analysis phases are complete, the compiler must generate executable code. In particular, the compiler must generate code to maintain the structure of the target machine's registers and memory during execution. In this section, we consider the types of environment that are required. In most compiled languages executable code is stored in a fixed area of RAM which cannot be changed during execution. The code for each different function or procedure is stored separately, at a known address (or at a known offset from a base address). Static data (e.g. constants or strings known at compile time) and global variables can also be stored in this fixed area. The remainder of the data, plus bookeeping information for control flow, will be stored in areas that will be allocated dynamically during execution.

Example 15.1 Simple runtime storage structure

entry address entry address

code for function 1 code for function 2 ...

entry address

code for function n global/static area stack

free space

heap

The stack is used for data that can be allocated in a last-in, first-out manner, while the heap area is used for other data (e.g. C pointers). Definition 15.2

CS3012 Formal Languages and Compilers

Runtime Environment

A procedure activation record is a section of memory allocated each time a procedure is called. It contains space for arguments, local data and local temporary variables, and pointer to code area and the activation record which called it.

CS3012 Formal Languages and Compilers

Runtime Environment

Definition 15.3 In a fully static environment, no procedures can be called recursively, there are no pointers, and no dynamic memory allocation - for example, FORTRAN77. In such an environment, we only ever need to maintain one procedure activation record for each procedure, as it is not possible for more than one copy of a single procedure to be in use simultaneously. Thus, at compile time, we can construct a procedure activation record for each procedure, Each time a procedure is called, we compute its arguments and store them in the appropriate record, and store the address of the calling procedure. We then jump to the start of the code for the current procedure, execute it, using the space in the current record for maintaining data, and on exit, jump back to the return address.

Example 15.4 A simple static environment 1 int i = 10; 2 int f1(int j) { 3 int k; 4 k = 3 * j; 5 if (k < i) print(i); 6 else print(j); 7 } 8 main() { 9 int k = 1; 10 while (k < 5) { 11 f1(k); 12 k = k+1; 13 } 14 } global area i (int): activation record: main k (int): start code ptr: 8 current code ptr: activation record: f1 j (int): start code ptr: 2 current code ptr: return address: k (int): initial environment global area i (int): 10 activation record: main k (int): 1 start code ptr: 8 current code ptr:11 activation record: f1 j (int):1 start code ptr: 2 current code ptr: 2 return address:11 k (int): on entry to f1 global area i (int): 10 activation record: main k (int): 2 start code ptr: 8 current code ptr:14 activation record: f1 j (int): start code ptr: 2 current code ptr: return address: k (int): on reaching line 14

CS3012 Formal Languages and Compilers

Runtime Environment

Definition 15.5 In a stack-based environment, procedures may be called recursively. It is not sufficient to maintain a single activation record for each procedure. A stack is required, onto which new records are placed each time a procedure is called, and from which old records are deleted when procedures exit. Each procedure may have several records on the stack at any one time. Each activation record should maintain a pointer to the previous activation record, to allow it to be recovered on exit. The environment requires a pointer to the current activation record, and a pointer to the last allocated position on the stack.

Example 15.6 A simple stack-based environment


1 2 3 4 5 6 7 8 9 int x, y; int gcd(int u, int v) { if (v == 0) return u; else return gcd(v, u % v); } main() { scanf("%d%d", &x, &y); printf("%d\n", gcd(x,y)); }

initial environment global area x (int): y (int): fp sp activation record: main start code address: 6 current code address: fp free space

on 1st entry to gcd global area x (int): 15 y (int): 10 activation record: main start code address: 6 current code address:8 activation record: gcd u (int): 15 v (int): 10 start code address: 2 current code address:2 return pointer: return address: 8 k (int): free space

on 3rd entry to gcd global area x (int): 15 y (int): 10 activation record: main start code address: 6 current code address:8 activation record: gcd u (int): 15 v (int): 10 start code address: 2 current code address: 4 return pointer: return address: 8 k (int): record: gcd activation u (int): 10 v (int): 5 start code address: 2 current code address:4 return pointer: return address: 4 k (int): record: gcd activation u (int): 5 v (int): 0 start code address: 2 current code address:2 return pointer: return address: 4 k (int): free space

sp

about to exit main global area x (int): 15 y (int): 10 fp sp activation record: main start code address: 6 current code address:9 free space

fp

sp

CS3012 Formal Languages and Compilers

Runtime Environment

CS3012 Formal Languages and Compilers

Runtime Environment

The details of how much space to allocate for an activation record and the offsets to be computed to reach the appropriate data items must be provided by the compiler. This is covered in the next chapter. Note that the stack-based environment presented here is particularly simple - there is no discussion of variable length data, temporary variables, internal blocks and nested declarations, local procedures (as in Pascal), nor procedures as arguments. Detailed descriptions of methods for dealing with such situations are given by Aho et al. (1986).

Definition 15.7 In a dynamic environment, activation records are not maintained on a stack, but must exist and be accessible for as long as all references to them exist, and must be capable of being dynamically deallocated when they become inaccessible (a process called garbage collection).

Example 15.8 dangling references int *dangle() { int x; return &x; } In a stack-based environment, the variable x is allocated space only during the lifetime of the function dangle(). Once the function has returned, the space is reclaimed, and will be reused by subsequent procedure calls. However, the address of x has been returned, and will be assumed to point to an integer, although the memory location could now contain anything. C is a stackbased language, and the above procedure is defined to be a logical error. Other languages do not have this restriction, and so require dynamic environments. Dynamic allocation is handled in the heap area. A heap provides two operations: allocate and free. Allocate takes a size parameter, and returns the address of a block of memory of the correct size. Free takes an address, and marks it as being free. The main problem in heap management is that the memory can quickly become fragmented, unless contiguous free memory blocks are combined into a whole. A second problem is in ensuring that free is only ever applied to the start of an allocated block of the appropriate size, or corruption can result.

CS3012 Formal Languages and Compilers

Runtime Environment

Example 15.9 simple heap management Maintain a circular linked list of allocated memory blocks. Each block to be allocated is headed by some bookkeeping information, with the address of the next allocated block, the size of the used space, and the size of the following free space. The first element of the list is the top of the heap, which also has a pointer to a block with some free space. To allocate a new block, move round the list until we find an element with enough free space. Create a new element, insert it into the list after the selected element's used space, set the new element's free
header last next header: next used sz free sz used header: next used sz free sz used header: next used sz free sz used header last next header: next used sz free sz used header: next used sz free sz used header: next used sz free sz used header: next used sz free sz used free free header last next header: next used sz free sz used free header: next used sz free sz used header: next used sz free sz used free header: next used sz free sz used free

size to the selected element's free size minus the size of the new element, and set the selected element's free size to null. To free a block, move to the start of the list, and step through until the appropriate address is found - if it is not found, the address is invalid. Add the current block size and its free sizee to its predecessor's free size, and delete from the list. The figure below shows the heap during a sequence of allocations and deallocations.
header last next header: next used sz free sz used free

Note that the heap management system in Example 15.9 is for dealing with explicit manual allocation and deallocation commands, and as such is required in stack-based languages like C. Fully dynamic languages require additional routines for garbage collection.

CS3012 Formal Languages and Compilers

Runtime Environment

16. Intermediate Code Generation

After a program has been parsed and statically checked, the compiler converts it to an intermediate language, and then optimises the code before producing the final executable version. The main advantage of developing this intermediate code is machine independence - the analysis techniques can be developed without concern for the target language, a single optimisation procedure can be used, and porting the compiler to new machines only requires changing the final component. In this section, we build on the previous material, and consider syntax-directed methods of generating the intermediate code, in the language known as three-address code. Three Address Code Statements in this language take the general form: x := y op z where x, y and z are names, constants or compiler-generated temporaries, and op stands for any operator. An expression like a+b*c has to be translated into the sequence t1 := b * c t2 := a + t1 where t1 and t2 are compiler-generated temporary names. Unravelling complicated arithmetical expressions allows them to be optimised effectively and translated easily to the target language, as threeaddress code is similar to assembly language. Example 16.1 three-address code Three-address code is a linearised form of postfix expressions and syntax trees. For example, consider the statement a := b * c + b / c In postfix, this is := abc*bc/+ := As a syntax tree, it becomes: a +

CS3012 Formal Languages and Compilers

Runtime Environment

CS3012 Formal Languages

Exercises

In three-address code, it is t1 := b / c t2 := b * c t3 := t1 + t2 a := t3 The three-address statements used in this chapter are shown below: assignment statement - x := y op z unary assignment - x := op y copy - x := y unconditional jump - goto L (L is the label of a statement) (relop is a relational operator: <, , =, ...) (e.g. unary minus, negation, type conversion)

conditional jump - if x relop y goto L procedure call - param x - call p n - return y indexed assignments - x := y[i] x[i] := y

(defines x as a parameter) (call procedure p, passing the last n declared parameters) (optional) (x is set to the value at i memory locations after y) (i memory locations after x is set to y)

We assume that statements in three-address code can be labelled (labels are referred to by the goto statement). The choice of allowable memory operators is the critical issue in the design of an intermediate language. The operator set must be sufficiently rich to implement the operations in the source language, and expressive enough that the code generator does not need to generate long sequences of instructions to implement each operator.

CS3012 Formal Languages

Exercises

Example 16.2 A Syntax-directed Translation The syntax-directed definition given below translates assignment statements into three address code. The synthesised attribute S.code in the definition that follows represents the three-address code fragment for assignment S. The non-terminal E has two attributes: E.place - the name that will hold the value of E E.code - the three address code code fragment for E The notation gen(x ":=" y "+" z) represents the three address code statement x := y+z. Expressions appearing instead of the variables (x,y,z) are evaluated before being passed to gen, and the quoted strings are taken literally. The notation <code fragment> || expression means concatenate the expression onto the end of the code fragment. newtemp() creates a new temporary variable. Production Semantic Rules S -> id := E S.code := E.code || gen(id.place ":=" E.place) E1 E2 + E3 E1.place := newtemp(); E1.code := E2.code || E3.code || gen(E1.place ":=" E2.place "+" E3.place) E1 -> E2 * E3 E1.place := newtemp(); E1.code := E2.code || E3.code || gen(E1.place ":=" E2.place "*" E3.place) E1 -> -E2 E1.place := newtemp(); E1.code := E2.code || gen(E1.place ":=" "uminus" E2.place) E1 -> (E2) E1.place := E2.place; E1.code := E2.code E -> id E.place := id.place; E.code := ""

CS3012 Formal Languages

Exercises

Example 16.3 The parse tree for a := b*c + b*-c is: S a := E3 E1 b * E2 c E8 + E4 b E7 * E6 E5 c

The attributes are constructed as follows: Symbol E1 E2 E3 E4 E5 E6 E7 E8 S place b c t1 b c t2 t3 t4 code

E1.code || E2.code || t1 := b * c

E5.code || t2 := uminus c E4.code || E6.code || t3 := b * t2 E3.code || E7.code || t4 := t1 + t3 E8.code || a:= t4

Expanding the code attribute for S then gives us the three address code: t1 := b * c t2 := uminus c t3 := b * t2 t4 := t1 + t3 a := t4

CS3012 Formal Languages

Exercises

Example 16.4

flow of control

We can extend the language defined by 16.2 by including flow of control statements: Production S1 -> while E do S2 Semantic Rules S1.begin := newlabel(); S1.after := newlabel(); S1.code := gen(S1.begin ":") || E.code || gen("if" E.place "=" "0" "goto" S1.after) || S2.code || gen("goto" S1.begin) || gen(S1.after ":")

We have introduced new attributes, "begin" and "after", which will hold labels, and that the function newlabel() will create a new label and return it. A schematic drawing of the code created by this semantic rule is shown below: labels S1.begin : code E.code if E.place = 0 goto S1.after S2.code goto S1.begin S1.after : ...

We assume that if the expression E is non-zero, it is true, and thus if the expression, evaluated by E.code, is false, control shifts to S1.after; if the expression is true, S2.code is executed, then control shifts back to S1.begin, and the expression is evaluated again. Assignment statements The previous sections assumed that when variable names were used, they represented pointers to the symbol table. This section demonstrates how names corresponding to the terminal id are looked up in the symbol table - the function lookup(id.name) returns a pointer to the entry of the identifier if it is in the symbol table, or nil if not.

CS3012 Formal Languages

Exercises

Example 16.5 We now redo the semantic rules of 16.2 to show the use of the lookup function. Instead of concatenating the code together in the attributes of the symbols, we now output the intermediate code to a file, using the emit() function.

Production Semantic Rules S -> id := E p := lookup(id.name); if p nil then emit(p ":=" E.place) else error E1 E2 + E3 E1.place := newtemp(); emit(E1.place ":=" E2.place "+" E3.place) E1 -> E2 * E3 E1 := newtemp(); emit(E1.place ":=" E2.place "*" E3.place) E1 -> -E2 E1.place := newtemp(); emit(E1.place ":=" "uminus" E2.place) E1 -> (E2) E1.place := E2.place; E -> id p := lookup(id.name); if p nil then E.place := p else error Parsing the fragment res := a * (alpha + -b) assuming that res and alpha have already been declared and placed in the symbol table: lexptr token attributes index : : : -> res ID_T 5 -> a ID_T 6 -> alpha ID_T 7 -> b ID_T 8 gives the following sequence: processed string res := a * (alpha + -b) res := E1 * (alpha + -b) res := E1 * (E2 + -b) res := E1 * (E2 + -E3) res := E1 * (E2 + E4) res := E1 * (E5) attributes E1.place = <6> E2.place = <7> E3.place = <8> E4.place = <9> E5.place = <10> output

<9> := uminus <8> <10> := <7> + <9>

CS3012 Formal Languages

Exercises

res := E1 * E6 res := E7 S

E6.place = <11> E7.place = <12>

<12> := <6> * <11> <5> := <12>

For the remainder of the chapter, we will dispense with the <i> notation, and simply refer to identifiers by their name.

CS3012 Formal Languages

Exercises

Arrays We can access the elements of an array quickly if we store them in a block of consecutive locations. Let A be an array, the width of each array element be w, the lower bound of the index be low, and the address of the storage for A be base. The ith element of A then begins at location: base + (i - low) w. To speed up the access of array elements, we can partially evaluate this address at compile time by rewriting it as: i w + (base - low w) and evaluating the subexpression (base - low w). This value, c, say, is then stored in the table with A, and the relative address of an element A[i] can then be found by adding i w to c. We can also do something similar for multi-dimensional arrays. Two-dimensional arrays can be stored either row by row or column by column. For arrays stored row by row, the relative address of A[i,j] can be calculated by the formula: base + ((i - low1) n2 + j - low2) w where low1 and low2 are the lower bounds on i and j, and n2 is the number of values that j can take. Assuming that i and j are the only two values not known at compile time, we can rewrite this as: ((i n2) + j) w + (base - ((low1 n2) + low2) w) As before, the last term can be pre-computed at compile time. The chief problem in generating code for array references is to relate the computation of the positions of elements in an array to a grammar of array references. A grammar may be given as follows: L -> id[Elist] | id Elist -> Elist, E | E It is useful to re-write this grammar to allow the dimensional limits of the array to be available as the index expressions are grouped into an Elist: L -> Elist] | id EList -> Elist, E | id [E These productions allow a pointer to the symbol table entry for the array name to be passed as a synthesised attribute of Elist. The following attributes are used below: (**)

CS3012 Formal Languages

Exercises

Elist.ndim: the number of dimensions of Elist; limit(array,j): function returning the number of elements along the jth dimension of the array; Elist.place: temporary variable holding a value computed from Elist; L.place: position in the symbol table; L.ofset: an offset into the array, or is null to indicate that the l-value is a simple name rather than an array reference. c(Elist.array): a function returning the pre-computed expression of (**) above width(array): a function returning w in (**) above

CS3012 Formal Languages

Exercises

Example 16.6 translation scheme for addressing array elements Production 1 S -> L := E 2 3 4 Semantic Rules if L.offset = null then emit(L.place ":=" E.place) else emit(L.place "[" L.offset "]" ":=" E.place) E1 E2 + E3 E1.place := newtemp(); emit(E1.place ":=" E2.place "+" E3.place) E1 -> (E2) E1.place := E2.place; E -> L if L.offset = null then E.place := L.place else E.place := newtemp; emit(E.place ":=" L.place "[" L.offset "]") L -> Elist] L.place := newtemp(); L.offset = newtemp(); emit(L.place ":=" c(Elist.array)) emit(L.offset ":=" Elist.place "*" width(Elist.array)) L -> id L.place := id.place; L.offset := null Elist1 -> Elist2, E t := newtemp(); m := Elist2.ndim + 1; emit(t ":=" Elist2.place "*" limit(Elist2.array,m)) emit(t ":=" t "+" E.place); Elist1.array := Elist2.array; Elist1.place := t; Elist1.ndim := m Elist -> id [E Elist.array := id.place; Elist.place := E.place; Elist.ndim := 1

6 7

CS3012 Formal Languages

Exercises

Example 16.7 generating code using the scheme of 16.6 Let A be a 10 20 array with low1 = low2 = 1. Therefore n1 = 10 and n2 = 20. Take w to be 4. The assignment x := A[y,z] is parsed and translated as follows: sentential forms x := A[y, z] L1 := A[y, z] L1 := A[L2, z] L1 := A[E1, z] L1 := Elist1, z] attributes L1.place = x L1.offset = null L2.place = y L2.offset = null E1.place = y Elist1.array = A Elist1.place = y Elist1.ndim = 1 L3.place = z L3.offset = null E2.place = z <t = t1> <m = 2> Elist2.array = A Elist2.place = t1 Elist2.ndim = 2 L4.place = t2 L4.offset = t3 E3.place = t4 generated code

L1 := Elist1, L3] L1 := Elist1, E2] L1 := Elist2]

t1 := y * 20 t1 := t1 + z

L1 := L4 L1 := E3 S

t2 := c /* baseA - 84 */ t3 := t1 * 4 t4 := t2[t3] x := t4

Type Conversions In practice, there are many different types of variables and constants, and the programmer may wish to combine their use in a single expression where appropriate; it is the task of the compiler to generate appropriate type conversion instructions. In the above, suppose there are reals and integers, and that integers can be converted to reals. The semantic rules for the arithmetic operations (and most of the other productions) must be modified to generate three-address statements which carry out the type conversion where necessary. We also include with the operator some indication of whether we intend fixed point or floating point operations.

CS3012 Formal Languages

Exercises

Example 16.8 arithmetic type conversions (for E1 -> E2 + E3 ). We need one additional attribute, E.type, which is either integer or real. E1.place := newtemp(); if E2.type = integer and E3.type = integer then begin emit(E1.place ":=" E2.place "int+" E3.place); E1.type := integer end else if E2.type = real and E3.type = real then begin emit(E1.place ":=" E2.place "real+" E3.place); E1.type := real end else if E2.type = integer and E3.type = real then begin u := newtemp(); emit(u ":=" "intotoreal" E2.place); emit(E1.place ":=" u "real+" E3.place); E1.type := real end else if E2.type = real and E3.type = integer then begin u := newtemp(); emit(u ":=" "inttoreal" E3.place); emit(E1.place ":=" E2.place "real+" u); E1.type := real end else E1.type = type_error; Similar semantic functions are required for E -> E*E, replacing "int+" with "int*" etc. Example 16.9 Parsing and translating the string x := y + i * j, where x and y are reals, and i and j are integers
symbols x := y + i * j x := E1 + i * j x := E1 + E2 * j attributes E1.place = y E1.type = real E2.place = i code

CS3012 Formal Languages E2.type = integer E3.place = j E3.type = integer E4.place = t1 E4.type = integer E5.place = t2 <u = t3> E5.type = real

Exercises

x := E1 + E2 * E3 x := E1 + E4 x := E5

t1 := i int* j t3 := inttoreal t1 t2 := y real+ t3 x := t2

S-ar putea să vă placă și