Documente Academic
Documente Profesional
Documente Cultură
Parsing
1
13. Parsing
Parsing is one of the major functions of the compiler of a programming language. Given a source code w, the parser examines w
to see whether it can be derived by the grammar of the programming language, and, if it can be, the parser constructs a parse tree
yielding w. Based on this parse tree, the compiler generates an object code. So, the parser acts as a membership test algorithm
designed for a given grammar G that, given a string w, tells us whether w is in L(G) or not, and, if it is, outputs a parse tree.
Notice that the parser tests the membership based on the given grammar. Recall that when we practiced constructing a PDA for a
given language, say {aibi | i > 0 }, we used the structural information of the language, such as a’s come first, then b’s, and the number
of a’s and b’s are same. Consider the two CFG’s G1 and G2 shown below in figure (a), which generate the same language {aibi | i >
0 }. Figure (b) shows a PDA that recognizes this language. For an input string w, this PDA does not give any information about the
grammar and how the string w is derived. Hence, we need a different approach to construct a parser based on the grammar, not the
language.
There are several algorithms available for parsing that, given an arbitrary CFG G and a string x, tell whether x ∈ L(G) or not, and
if it is, output how x is derived. (CYK algorithm is a typical example, which is shown in Appendix F.) However, these algorithms
are too slow to be practical. (For example, CYK algorithm takes O(n3) time for an input string of length n) Thus, we restrict CFG’s
to a subclass for which we can build a fast practical parser. This chapter presents two parsing strategies applicable to such restricted
grammars together with several design examples. Finally, the chapter briefly introduces Lex (the lexical analyzer generator) and
YACC (the parser generator).
( a, a/aa ) (b, a/ε )
Two very elderly ladies were enjoying the sunshine on a park bench in Miami. They had been meeting at the park
every sunny day, for over 12 years, chatting, and enjoying each others friendship. One day, the younger of the two
ladies, turns to the other and says, “Please don't be angry with me dear, but I am embarrassed, after all these years...
What is your name ? I am trying to remember, but I just can't.”
The older friend stares at her, looking very distressed, says nothing for 2 full minutes, and finally with tearful
eyes, says, “How soon do you have to know ?”
- overheard by Rubin -
3
Parsing
13.1 Derivation
The parser of a grammar generates a parse tree for a given input string. For
convenience, the tree is commonly presented in a sequence of rules applied in
one of the following two ways to derive the input string starting with S.
Example: G: S → ABC A → aa B → a C → cC | c
Leftmost derivation: S ⇒ ABC ⇒ aaBC ⇒ aaaC ⇒ aaacC ⇒ aaacc
Rightmost derivation: S ⇒ ABC ⇒ ABcC ⇒ ABcc ⇒ Aacc ⇒ aaacc
4
Derivation Parsing
5
Derivation Parsing
S1
G: S → ABC A → aa B → bD
C → cC | c D → bd 6A 4B C2
aa b 5D c C3
Rightmost derivation: c
1 2 3 bd
S ⇒ ABC ⇒ ABcC ⇒ ABcc
4 5 6 654321
⇒ AbDcc ⇒ Abbdcc ⇒ aabbdcc
6
Parsing
13.2 LL(k) parsing strategy
We know that parsers are different from PDA’s, because their membership test
should be based on the given CFG. Let’s try to build a conventional DPDA which,
with the grammar G stored in the finite control, tests whether the input string x is in
L(G), and, if it is, outputs a sequence of rules applied to derive x. We equip the
finite control with an output port for the output (see figure (b) below).
Our first strategy is to derive the same input string x in the stack. Because any
string must be derived starting with the start symbol, we let the machine push S into
the stack and enter state q1 for the next move. For convenience, we assign a rule
number to each rule as shown in figure (a).
SZ0
L(G) = {a10 x | x = bi or x = ci, i ≥ 1 }
(a) (b)
7
LL(k) Parsing Parsing
Now, we ask which rule, either (1) or (2), the machine should apply with S to
eventually derive the string on the input tape. If the input string is derived using rule
(1) (rule (2)) first, then there should be the symbol b (respectively, symbol c) after
the 10-th a. Unfortunately, our conventional DPDA model cannot look-ahead the
input before reading it. Recall that conventional DPDA’s decide whether they will
read the input or not depending on the stack top symbol. Only after reading the
input does the machine knows what it is. Thus, without reading up to the 11-th input
symbol, there is no way for the machine in the figure to identify the symbol at that
position.
To overcome this problem, we equip the finite state control with a “telescope”
with which the machine can look some finite k cells ahead on the input tape. For
the grammar G, it is enough to have a telescope with the range of 11 cells.
(Notice that for the range to look ahead, we also include the cell under the head.)
With this new capability, the machine scans the input string ahead in the range,
and, based on what it sees ahead, it takes the next move. While looking ahead,
the input head does not move.
A
N
(1) (2) (3)
I
G: S → AB | AC A → aaaaaaaaaa aaaaaaaaaabbb
(4) (5) (6) (7)
B → bB | b C → cC | c q1 G
S → AB !
L(G) = {a10 x | x = bi or x = ci, i ≥ 1 } SZ0
(a) (b)
9
LL(k) Parsing Parsing
Now, the parser, looking ahead 11 cells, sees aaaaaaaaaab. Since there is b at
the end, the machine chooses rule (1) (i.e., S → AB), rewrites the stack top S with
AB and outputs rule number (1) as shown in figure (a).
Let q, α , and β be, respectively, the current state, the remaining input portion
to read, and the current stack contents. From now on, for convenience we shall use
the triple (q, α , β ), called the configuration, instead of drawing the cumbersome
diagram to show the parser.
(1) (2)
aaaaaaaaaabbb α
G: S → AB | AC
(3) (1) q1 G q G
A → aaaaaaaaaa
(4) (5) A B Z0
B → bB | b β
(6) (7) (a) Apply rule S → AB (b) Configuration (q, α , β )
C → cC | c
10
LL(k) Parsing Parsing
G: S → AB | AC A → aaaaaaaaaa B → bB | b C → cC | c
Looking ahead 11 cells in the current configuration (q0, aaaaaaaaaabbb, SZ0), the
parser applies rule (1) by rewriting the stack top S with the rule’s right side AB.
Consequently, the configuration changes as follows.
look-ahead 11 cells
(1)
(q0, aaaaaaaaaabbb, Z0) ⇒(q1, aaaaaaaaaabbb, SZ0) ⇒(q1, aaaaaaaaaabbb, ABZ0)
Now, with nonterminal symbol A at the stack top, the parser must find a rule to
apply. Since A has only one rule, i.e., rule (3), there is no choice. So, the parser
applies rule (3), consequently changing the configuration as follows.
(3)
(q1, aaaaaaaaaabbb, ABZ0) ⇒ (q1, aaaaaaaaaabbb, aaaaaaaaaaBZ0)
11
LL(k) Parsing Parsing
G: S → AB | AC A → aaaaaaaaaa B → bB | b C → cC | c
Notice that the terminal symbol appearing at the stack top after applying rule
(3) corresponds to the leftmost terminal symbol appearing in the leftmost
derivation. Thus, the terminal symbol appearing at the stack top must match the
next input symbol, if the input string is generated by the grammar.
So, the parser, seeing a terminal symbol at the stack top, reads the input and, if
they match, pops the stack top. The following sequence of configurations shows
how the parser successfully pops all the terminal symbols pushed on the stack
top by applying rule (3).
(1)
(q0, aaaaaaaaaabbb, Z0) ⇒(q1, aaaaaaaaaabbb, SZ0) ⇒(q1, aaaaaaaaaabbb, ABZ0)
(3)
⇒ (q1, aaaaaaaaaabbb, aaaaaaaaaaBZ0) ⇒ . . .⇒(q1, abbb, aBZ0) ⇒(q1, bbb, BZ0)
12
LL(k) Parsing Parsing
G: S → AB | AC A → aaaaaaaaaa B → bB | b C → cC | c
(1) (3)
(q0, aaaaaaaaaabbb, Z0) ⇒(q1, aaaaaaaaaabbb, SZ0) ⇒(q1, aaaaaaaaaabbb, ABZ0) ⇒
Now, the parser must choose one of B’s rules, either (4) or (5). If there remains
only one b in the input tape, rule (5) is the choice. Otherwise (i.e., if there are
more than one b), rule (4) must be applied. It follows that the parser needs to look
two cells ahead and proceeds as follows.
Look-ahead 2 cells
(4) (4)
(q1, bbb, BZ0) ⇒ (q1, bbb, bBZ0) ⇒ (q1, bb, BZ0) ⇒ (q1, bb, bBZ0) ⇒
(5)
(q1, b, BZ0) ⇒ (q1, b, bZ0) ⇒ (q1, ε , Z0)
13
LL(k) Parsing Parsing
In summary, our parser works as follows, where underlined parts of the input
string are look-ahead contents and the numbers are the rules in the order applied
during the parsing.
(1) (3)
(q0, aaaaaaaaaabbb, Z0)⇒(q1, aaaaaaaaaabbb, SZ0)⇒(q1, aaaaaaaaaabbb, ABZ0) ⇒
(q1, aaaaaaaaaabbb, aaaaaaaaaaBZ0)⇒ . . . .⇒(q1, abbb, aBZ0)⇒
(4) (4)
(q1, bbb, BZ0) ⇒ (q1, bbb, bBZ0) ⇒ (q1, bb, BZ0) ⇒ (q1, bb, bBZ0) ⇒
(5)
(q1, b, BZ0) ⇒ (q1, b, bZ0) ⇒ (q1, ε , Z0)
Notice that the last configuration above implies a successful parsing. It shows that
the sequence of rules applied on the stack generates exactly the same string as the one
originally written on the input tape. If the parser fails to reach the accepting
configuration, we say the input is rejected. In the above example, the sequence of rules
applied to the nonterminal symbols appearing at the stack top matches the sequence of
rules applied for the leftmost derivation of the input string shown below.
(1) (3) (4) (4) (5)
S ⇒ AB ⇒ aaaaaaaaaaB ⇒ aaaaaaaaaabB ⇒ aaaaaaaaaabbB ⇒ aaaaaaaaaabbb
14
LL(k) Parsing Parsing
G: S → AB | AC A → aaaaaaaaaa B → bB | b C → cC | c
For the other input strings ending with c’s, the parser can apply the same
strategy and successfully parse it by looking ahead at most 11 cells (see below).
This parser is called an LL(11) parser, named after the following property of the
parser; the input is read Left-to-right, the order of rules applied matches the order
of the Leftmost derivation, and the longest look-ahead range is 11 cells. For a
grammar G, if we can build an LL(k) parser, for some constant k, we call G an
LL(k) grammar.
(2) (3)
(q0, aaaaaaaaaabbb, Z0) ⇒(q1, aaaaaaaaaaccc, SZ0) ⇒(q1, aaaaaaaaaaccc, ACZ0) ⇒
(q1, aaaaaaaaaaccc, aaaaaaaaaaCZ0) ⇒ . . . .⇒(q1, abbb, aCZ0) ⇒
(6) (6)
(q1, ccc, CZ0) ⇒ (q1, ccc, bBZ0) ⇒ (q1, cc, CZ0) ⇒ (q1, cc, cCZ0) ⇒
(7)
(q1, c, CZ0) ⇒ (q1, c, cZ0) ⇒ (q1, ε , Z0)
15
LL(k) Parsing Parsing
G: S → AB | AC A → aaaaaaaaaa B → bB | b C → cC | c
Contents of 11 look-ahead
Stack a10 b a10 c bbX9 bB10 ccX9 cB10 ε
top S AB AC
A a10
B bB b
cC c
C
Parse Table
16
Parsing
(1) (2)
S → aSb | aabbb
17
Designing LL(k) Parser Parsing
Pushing the start symbol S into the stack in the initial configuration, the parser
gets ready to parse the string as shown below. With S in the stack top, it must
apply one of S’s two rules. To choose one of them, the parser needs to look ahead
for supporting information. What could be the shortest range to look ahead?
(1) (2)
(q0, aaaaabbbbbb, Z0) ⇒ (q1, aaaaabbbbbb, SZ0) ⇒ ?
S → aSb | aabbb
If there is aabbb, rule (2) must be applied. So it appears k = 5. But the parser does
not have to see the whole string. If there is aaa ahead, the leftmost symbol a must
have been generated by rule (1). Otherwise, if there is aab ahead, the leftmost a
must have been generated by rule (2). It is enough to look ahead 3 cells (i.e., k = 3).
Thus, in the current configuration, since the contents of 3 look-ahead is aaa, the
parser applies rule (1), then reads the input to match and pop the terminal symbol a
from the stack top as follows.
(1)
(q1, aaaaabbbbbb, SZ0) ⇒(q1, aaaaabbbbbb, aSbZ0) ⇒(q1, aaaabbbbbbb, SbZ0)
Look-ahead 3
18
Designing LL(k) Parser Parsing
Again, with S on the stack top, the parser looks ahead 3 cells, and seeing aaa,
applies rule (1), and repeats the same procedure until it looks ahead aab as follows.
(1)
(q1, aaaaabbbbbb, SZ0) ⇒(q1, aaaaabbbbbb, aSbZ0) ⇒
(1) (2) (1)
S → aSb | aabbb (q1, aaaabbbbbbb, SbZ0) ⇒ (q1 , aaaabbbbbb, aSbbZ0 ) ⇒
(1)
(q1, aaabbbbbbb, SbbZ0) ⇒ (q1 , aaabbbbbb, aSbbbZ0 ) ⇒
Now, the parser finally applies rule (2), and keeps reading and match-and-
popping until it enters the accepting configuration as follows.
(2)
(q1 , aabbbbbb, SbbbZ0 ) ⇒ (q1 , aabbbbbb, aabbbbbbZ0 ) ⇒ … ⇒ (q1 , ε , Z0)
19
Designing LL(k) Parser Parsing
The parser applied the rules in the order, (1), (1), (1), (2), which is the same
order applied for the leftmost derivation of the input string aaaaabbbbbb.
Given an arbitrary input string, the parser, applying the same procedure, will end
up in the final accepting configuration if and only if the input belongs to the
language of the grammar. The parser needs to look ahead at least 3 cells. Hence, the
grammar is LL(3). The parse table is shown below.
3 look-ahead
aaa aab
Stack top
(1) (2) S aSb aabbb
S → aSb | aabbb
Parse Table
20
Designing LL(k) Parser Parsing
Example 2. Construct an LL(k) parser with minimum k for the following CFG.
21
Designing LL(k) Parser Parsing
If the input is not empty, the parser, with S at the stack top, should choose rule (1) to
apply. Then, as shown below, for each terminal symbol appearing at the stack top, the
parser reads the next input symbol, and if they match, pops out the stack top until A
appears. If the input tape was empty, the parser would simply pops S (i.e., rewrites S
with ε ) and enters the accepting configuration. Now, with A at the stack top, the
parser should choose a rule between (3) and (4).
(1)
(q1, ababaaaa, SZ0) ⇒ (q1, ababaaaa, abAZ0) ⇒ . . ⇒ (q1, abaaaa, AZ0) ⇒?
If rule (4) was used to derive the input, the next input symbol ahead should be b, not
a. Looking symbol a ahead, the parser applies rule (3), and consequently, having S on
the stack top as before, it needs to look ahead to choose the next rule. Up to this point,
it appears that 1 look-ahead is an appropriate range.
(3)
(q1, abaaaa, AZ0) ⇒ (q1, abaaaa, SaaZ0) ⇒ ?
22
Designing LL(k) Parser Parsing
But this time, with S at the stack top it is uncertain which rule to apply. Looking
a ahead, the parser can apply either rule (1) or rule (2), because in either case, the
parser will successfully match the stack top a with the next input symbol a (see
below). To resolve this uncertainty, the parser needs one more cell to look ahead.
To solve this problem we could have the parser look down the stack. But we have
chosen to extend the range of look-ahead, a straightforward solution. Later in this
chapter, we will discuss parsers which are allowed to look down the stack some
finite depth.
(1) (q1 , abaaaa, abaaZ0)
(1) (2) (3) (4)
⇒
(q1 , abaaaa, SaaZ0) ⇒
S → abA | ε A → Saa | (2) (q , abaaaa, aaZ )
1 0
b
Now, looking ab ahead in the extended range, which must be generated by rule
(1), the parser applies the rule and repeats the previous procedure as follows till S
appears at the stack top again.
(1) (3)
(q1 , abaaaa, SaaZ0 ) ⇒ (q1 , abaaaa, abAaaZ0 ) ⇒. . ⇒ (q1, aaaa, AaaZ0) ⇒
(q1, aaaa, SaaaaZ0) ⇒?
23
Designing LL(k) Parser Parsing
Looking aa ahead with S on the stack top, the parser applies rule (2). Then, for
each a appearing at the stack top, it keeps reading the next input symbol, matching
them and popping the stack top, eventually entering the accepting configuration.
(2)
(q1, aaaa, SaaaaZ0) ⇒ (q1, aaaa, aaaaZ0) ⇒ . . . . ⇒ (q1, ε , Z0)
24
Designing LL(k) Parser Parsing
The input string that we have just examined is the one derived by applying rule
(2) last. For the other typical string ababbaa that can be derived by applying rule (4)
last, the LL(2) parser will parse it as follows.
(1) (3)
(q1, ababbaa, SZ0) ⇒ (q1, ababbaa, abAZ0) ⇒ . . ⇒ (q1, abbaa, AZ0) ⇒
(1)
(q1, abbaa, SaaZ0) ⇒ (q1 , abbaa, abAaaZ0 ) ⇒. . ⇒ (q1, baa, AaaZ0)
(3)
⇒ (q1, baa, baaZ0) ⇒ . . . . ⇒ (q1, ε , Z0)
From the analysis with the two parsing examples, we construct the following parse
table. (Notice that with A at the stack top, though 1 look-ahead is enough, the
entries are under the column of 2 look-ahead.)
2 look-ahead
ab aa bX BB
(1) (2) (3) (4) Stack
S → abA | ε A → Saa | top S abA ε ε
B: blank
b A Saa Saa b X: don’t care
Parse Table
25
Designing LL(k) Parser Parsing
For a given input string, the basic strategy of LL(k) parsing is to generate the
same string on the top of the stack by rewriting every nonterminal symbol appearing
at the stack top with the right side of that nonterminal’s rule. If the nonterminal
symbol has more than one rule, the parser picks the right one based on the prefix of
the input string appearing on k cells looked ahead.
Whenever a terminal symbol appears on the stack top, the machine reads the next
input symbol and pops the stack top, if they match. The sequence of rules applied for
a successful parsing according this strategy is the same as the one applied for the
leftmost derivation of the input string.
The class of CFG’s that can be parsed by LL(k) parsing strategy is limited. The
CFG G1 below is an example for which no LL(k) parser exists. However, G2, which
generates the same language, is an LL(k) grammar. We will shortly explain why.
G1: S → A | B A → aA | 0 B → aB | 1
G2: S → aS | D D→ 0|1
L(G1) = L(G2) = {ait | i ≥ 0, t ∈ {0, 1}}
26
Designing LL(k) Parser Parsing
Consider the first working configuration illustrated below (with the start symbol S
on top of the stack.) The parser should choose one of S’s two rules, S→A and S
→B. But it is impossible to choose a correct rule, because the right end symbol 0 (or
1), which is essential for the correct choice, can be located arbitrarily far to the right.
It is impossible for any LL(k) parser to identify it ahead with its “telescope” of a
finite range k.
But for the grammar G2, we can easily design an LL(1) parser.
aaaa..... aa0
G1: S → A | B A → aA | 0 B → aB | 1
G2: S → aS | D D→ 0|1 q1 G1
27
Parsing
Definition of LL(k) grammars
We saw just now two CFG’s that generate the same language, but for the one,
no LL(k) parser exists, and for the other, we can design an LL(k) parser. So, we
may ask the following: What is the property of LL(k) grammars?
For a string x, let (k) x denote the prefix of length k of string x. If | x | < k, then
(k)
x = x. For example, (2) ababaa = ab, (3) ab = ab.
Definition (LL(k) grammar). Let G = (VT, VN, P, S) be a CFG. Grammar G is an
LL(k) grammar if it satisfies the following condition. Consider two arbitrary
leftmost derivations of the following forms.
S ⇒* ω Aα ⇒ ω β α ⇒* ω y
S ⇒* ω Aα ⇒ ω γ α ⇒* ω x
, where α , β , γ ∈(VT ∪VN)*, ω , x, y ∈ VT*, A ∈VN.
If (k) x = (k) y , then it must be that β = γ . That is, in the above two derivations,
the same rule of A should have been applied if (k) x = (k) y.
The above condition implies that with a nonterminal symbol A on the stack
top, the parser can identify A’s rule to apply by looking ahead k cells. If G has
such property, we can build an LL(k) parser.
28