Sunteți pe pagina 1din 38

Top-down and Bottom-up Parsing - a

whirlwind
tour
Simplified Compiler Structure

Source
code
if (b == 0) a = b; Understand
source code Front end
(machine-
Intermediate code independent)

Optimize Optimizer
Intermediate code

Generate Back end


Assembly code (machine-
cmp $0,ecx
assembly code dependent)
cmovz edx,ecx
Simplified Front-End
Structure
Source code if (b == 0) a = b;
(character stream)
Lexical Analysis
Token
if ( b == 0 ) a = b ;
stream
Syntax Analysis
(Parsing)
if
Abstract Syntax == =
Tree (AST) b 0 a b
Semantic Analysis
Parse Tree vs. AST
• Parse tree also called “concrete syntax”
S Abstract
E + S Syntax Tree
Parse Tree ( S ) E +
(Concrete E + S 5 + 5
Syntax) 1 E + S 1 +
2 E 2 +
(S) 3 4
E+S Discards (abstracts)
E
3 4 unneeded information
How to build an AST
• Need to find a derivation for the
program in the grammar
• Want an efficient algorithm
– should only read token stream once
– exponential brute-force search out of
question
– even CKY is too slow
• Two main ways to parse:
– top-down parsing (recursive descent)
– bottom-up parsing (shift-reduce)
Parsing Top-down S→E+S |E
E → num | ( S )

Goal: construct a leftmost derivation of string


while reading in token stream
Partly-derived String Lookahead parsed part unparsed part
S ( (1+2+(3+4))+5
→ E+S ( (1+2+(3+4))+5
→ (S) +S 1 (1+2+(3+4))+5
→ (E+S)+S 1 (1+2+(3+4))+5
→ (1+S)+S 2 (1+2+(3+4))+5
→ (1+E+S)+S 2 (1+2+(3+4))+5
→ (1+2+S)+S 2 (1+2+(3+4))+5
→ (1+2+E)+S ( (1+2+(3+4))+5
→ (1+2+(S))+S 3 (1+2+(3+4))+5
→ (1+2+(E+S))+S 3 (1+2+(3+4))+5
Problem S→E+S |E
E → num | ( S )

• Want to decide which production to


apply based on next symbol

(1) S → E → (S) → (E) → (1)


(1)+2 S → E + S → (S) + S →(E) + S
→ (1)+E → (1)+2

• Why is this hard?


Grammar is Problem
• This grammar cannot be parsed top-down
with only a single look-ahead symbol
• Not LL(1) = Left-to-right-scanning, Left-
most derivation, 1 look-ahead symbol
• Is it LL(k) for some k?
• Can rewrite grammar to allow top-down
parsing: create LL(1) grammar for same
language
Making a grammar LL(1)

S → E+S
S → E • Problem: can’t decide which
E → num S production to apply until
E → (S)
we see symbol after first
expression
• Left-factoring: Factor
S → ES' common S prefix, add new
S' → ε non-terminal S' at decision
S' → + S
E → num point. S' derives (+E)*
E→(S)
Parsing with new grammar
S → ES ' S'→ε|+S E → num | ( S )

S ( (1+2+(3+4))+5
→ E S' ( (1+2+(3+4))+5
→ (S) S' 1 (1+2+(3+4))+5
→ (E S') S' 1 (1+2+(3+4))+5
→ (1 S') S' + (1+2+(3+4))+5
→ (1+E S' ) S' 2
(1+2+(3+4))+5
→ (1+2 S') S' + (1+2+(3+4))+5
→ (1+2 + S) S' ( (1+2+(3+4))+5
→ (1+2 + E S') S' ( (1+2+(3+4))+5
→ (1+2 + (S) S') S' 3 (1+2+(3+4))+5
→ (1+2 + (E S' ) S') S' 3
(1+2+(3+4))+5
→ (1+2 + (3 S') S') S' +
(1+2+(3+4))+5
→ (1+2 + (3 + E) S') S' 4 (1+2+(3+4))+5
Predictive Parsing

• LL(1) grammar:
– for a given non-terminal, the look-ahead
symbol uniquely determines the
production to apply
– top-down parsing = predictive parsing
– Driven by predictive parsing table of
non-terminals × terminals → productions
Using Table S→ES'
S' →ε | +S
E → num | ( S )

S ( (1+2+(3+4))+5
→ E S' ( (1+2+(3+4))+5
→ (S) S' 1 (1+2+(3+4))+5
→ (E S' ) S' 1 (1+2+(3+4))+5
→ (1 S') S' + (1+2+(3+4))+5
→ (1 + S) S' 2 (1+2+(3+4))+5
→ (1+E S' ) S' 2
(1+2+(3+4))+5
→ (1+2 S') S' + (1+2+(3+4))+5

num + ( ) $
S →ES' →ES'
S' → +S →ε →ε
E → num →(S)
How to Implement?
• Table can be converted easily into a
recursive-descent parser

num + ( ) $
S →ES' →ES'
S' → +S →ε →ε
E → num →(S)

• Three procedures: parse_S, parse_S’,


parse_E
Recursive-Descent Parser
void parse_S () { lookahead token
switch (token) {
case num: parse_E(); parse_S’(); return;
case ‘(’: parse_E(); parse_S’(); return;
default: throw new ParseError();
}
}

number + ( ) $
S → ES’ → ES’
S’ → +S →ε →ε
E → number →(S)
Recursive-Descent Parser
void parse_S’() {
switch (token) {
case ‘+’: token = input.read(); parse_S();
return;
case ‘)’: return;
case EOF: return;
default: throw new ParseError();
}
}
number + ( ) $
S → ES’ → ES’
S’ → +S →ε →ε
E → number →(S)
Recursive-Descent Parser
void parse_E() {
switch (token) {
case number: token = input.read(); return;
case ‘(‘: token = input.read(); parse_S();
if (token != ‘)’) throw new
ParseError();
token = input.read(); return;
default: throw new ParseError(); }
}
number + ( ) $
S → ES’ → ES’
S’ → +S →ε →ε
E → number →(S)
Call Tree = Parse Tree
(1 + 2 + (3 + 4)) + 5
S
parse_S E S’
( S ) +S
parse_E parse_S’
E S’ 5
parse_S parse_S 1 +S
parse_E parse_S’ E S’
2 + S
parse_S
E S’
parse_Eparse_S’ ( S ) ε
parse_S E S’
parse_Eparse_S’ 3 +S
parse_S E
4
How to Construct Parsing Tables

• There exists an algorithm for


automatically generating a
predictive parse table from a
grammar (take 412 for details)
N + ( ) $
S → ES’
S’ → ε | + S S ES’ ES’
E → number | ( S ) S’ +S ε ε
E N (S)
Summary for top-down
parsing
• LL(k) grammars
– left-to-right scanning
– leftmost derivation
– can determine what production to apply from
the next k symbols
– Can automatically build predictive parsing
tables
• Predictive parsers
– Can be easily built for LL(k) grammars from
the parsing tables
– Also called recursive-descent, or top-down
parsers
Top-Down Parsing Summary
Language grammar

Left-recursion elimination
Left-factoring

LL(1) grammar

predictive parsing table

recursive-descent parser

parser with AST generation


Now: Bottom-up Parsing
• A more powerful parsing technology

• LR grammars -- more expressive than


LL
– construct right-most derivation of program
– virtually all programming languages
– easier to express programming language
syntax

• Shift-reduce parsers
– Parsers for LR grammars
– automatic parser generators (e.g. yacc,CUP)
Bottom-up Parsing
• Right-most derivation -- backward
– Start with the tokens
– End with the start symbol S→ S+E|E
E → num | ( S )

(1+2+(3+4))+5 ← (E+2+(3+4))+5
← (S+2+(3+4))+5 ←(S+E+(3+4))+5
← (S+(3+4))+5 ← (S+(E+4))+5 ←(S+(S+4))+5
← (S+(S+E))+5 ← (S+(S))+5 ←(S+E)+5 ← (S)+5
← E+5 ← S+E ← S
(1+2+(3+4))+5 ←
Progress of Bottom-up
(1+2+(3+4))+5

Parsing (1
(E+2+(3+4))+5
+2+(3+4))+5
(S+2+(3+4))+5 ← (1
+2+(3+4))+5
(S+E+(3+4))+5 ← (1+2
right-most derivation

+(3+4))+5
(S+(3+4))+5 ← (1+2+(3 +4))+5
(S+(E+4))+5 ← (1+2+(3 +4))+5
(S+(S+4))+5 ← (1+2+(3 +4))+5
(S+(S+E))+5 ← (1+2+(3+4 ))+5
(S+(S))+5 ← (1+2+(3+4 ))
+5
(S+E)+5 ← (1+2+(3+4) )
+5
(S)+5 ← (1+2+(3+4) )
+5
E+5 ← (1+2+(3+4))
+5
S+E ← (1+2+(3+4))+5
Bottom-up Parsing
• (1+2+(3+4))+5 ← S→ S+E|E
(E+2+(3+4))+5 ← E → num | ( S )
(S+2+(3+4))+5 ←
S
(S+E+(3+4))+5 …
S + E
E 5
• Advantage of bottom-up
parsing: can postpone the ( S )
selection of productions until S + E
more of the input is scanned S+E (S)
E 2 S+E
1 E 4
3
Top-down Parsing
(1+2+(3+4))+5 S→ S+E|E
E → num | ( S )
S → S+E → E+E → (S)+E →
(S+E)+E
S
→ (S+E+E)+E →(E+E+E)+E
S + E
→ (1+E+E)+E → (1+2+E)+E ...
E 5
• In left-most derivation, ( S )
entire tree above a token S + E
(2) has been expanded S+E (S)
when encountered E 2 S+E
1 E 4
3
Top-down vs. Bottom-up
Bottom-up: Don’t need to figure out as much of
the parse tree for a given amount of input

scanned scanned
unscanned unscanned

Top-down Bottom-up
Shift-reduce Parsing
• Parsing actions: is a sequence of shift and reduce
operations
• Parser state: a stack of terminals and non-
terminals (grows to the right)

•Derivation
Current derivation
step step = always
stack stack+input
unconsumed
input
(1+2+(3+4))+5 ←
(1+2+(3+4))+5
(E+2+(3+4))+5 ← (E
+2+(3+4))+5
(S+2+(3+4))+5 ← (S
+2+(3+4))+5
(S+E+(3+4))+5 ← (S+E
+(3+4))+5
Shift-reduce Parsing
• Parsing is a sequence of shifts and reduces
• Shift : move look-ahead token to stack
stack input action
( 1+2+(3+4))+5 shift 1
(1 +2+(3+4))+5
• Reduce : Replace symbols γ from top of stack
with non-terminal symbol X, corresponding to
production X → γ (pop γ, push X)
stack input action
(S+E +(3+4))+5 reduce S
→ S+E
(S +(3+4))+5
Shift-reduce Parsing
S→ S+E|E
E → num | ( S )

derivation stack input stream action


(1+2+(3+4))+5 ← (1+2+(3+4))+5 shift
(1+2+(3+4))+5 ← ( 1+2+(3+4))+5 shift
(1+2+(3+4))+5 ← (1 +2+(3+4))+5 reduce E
→num
(E+2+(3+4))+5 ← (E +2+(3+4))+5 reduce S
→E
(S+2+(3+4))+5 ← (S +2+(3+4))+5 shift
(S+2+(3+4))+5 ← (S+ 2+(3+4))+5 shift
(S+2+(3+4))+5 ← (S+2 +(3+4))+5 reduce E
→num
(S+E+(3+4))+5 ← (S+E +(3+4))+5 reduce S
→ S+ E
(S+(3+4))+5 ← (S +(3+4))+5 shift
(S+(3+4))+5 ← (S+ (3+4))+5 shift
(S+(3+4))+5 ← (S+( 3+4))+5 shift
(S+(3+4))+5 ← (S+(3 +4))+5 reduce E
→num
Problem
• How do we know which action to
take: whether to shift or reduce, and
which production?

• Issues:
– Sometimes can reduce but shouldn’t
– Sometimes can reduce in different ways
Action Selection Problem
• Given stack σ and look-ahead symbol b,
should parser:
– shift b onto the stack (making it σb)
– reduce X → γ assuming that stack has
the form α γ (making it αX)

• If stack has form α γ, should apply


reduction X → γ (or shift) depending on
stack prefix α
α is different for different possible
reductions, since γ’s have different length.
LR Parsing Engine
• Basic mechanism:
– Use a set of parser states
– Use a stack with alternating symbols and
states
• E.g: 1 ( 6 S 10 + 5
– Use a parsing table to:
• Determine what action to apply
(shift/reduce)
• Determine the next state

• The parser actions can be precisely


determined from the table
The LR Parsing Table
Terminals Non-terminals

Next action Next


State and next state state

Action table Goto table

• Algorithm: look at entry for current state S and input


terminal C
If Table[S,C] = s(S’) then shift:
push(C), push(S’)
If Table[S,C] = X→α then reduce:
pop(2*|α|), S’=top(), push(X), push(Table[S’,X])
LR Parsing Table Example
( ) id , $ S L
1 s3 s2 g4
2 S→id S→id S→id S→id S→id
3 s3 s2 g7
g5
4 accept
5 s6 s8
6 S→(L) S→(L) S→(L) S→(L) S→(L)
7 L→S L→S L→S L→S L→S
8 s3 s2 g9
9 L→L,S L→L,S L→L,S L→L,S L→L,S
LR(k) Grammars

• LR(k) = Left-to-right scanning, Right-most


derivation, k look-ahead characters

• Main cases: LR(0), LR(1), and some


variations (SLR and LALR(1))

• Parsers for LR(0) Grammars:


– Determine the actions without any lookahead
symbol
Building LR(0) Parsing Tables
• To build the parsing table:
– Define states of the parser
– Build a DFA to describe the transitions
between states
– Use the DFA to build the parsing table
Summary for bottom-up
parsing
• LR(k) grammars
– left-to-right scanning
– rightmost derivation
– can determine whether to shift or
reduce from the next k symbols
– Can automatically build predictive
parsing tables
• Shift-reduce parsers
– Can be built for LR(k) grammars using
automated parser generator tools, eg.
CUP, yacc.
Top-down vs. Bottom-up
again
LL(k), recursive descent LR(k), shift-
reduce

scanned scanned
unscanned unscanned

Top-down Bottom-up

S-ar putea să vă placă și