Crafting An Interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodePr

Crafting an interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodeProject
https://workspaces.codeproject.com/vrosnet/crafting-an-interpreter-part-3-parse-trees-and-syn[19/06/2014 4:46:28 AM]

Home VROSNET Crafting an interpreter Part 3 - Parse Trees and Syntax Trees
Workspace home
Co
Ta
Arti
Add app

Crafting an interpreter Part 3 - Parse
Trees and Syntax Trees
by: VROSNET forked from: martinholzherr / crafting-an-interpreter-part-3-parse-trees-and-syn
Third article on building a language interpreter describing the generation of parse trees and syntax trees. (Back to
article)

0

Share with others
Congratulations! Your download is complete
You now have a new workspace in the cloud!
Your downloaded code is ready to be worked on and shared. You can now access your
downloaded code from anywhere at any time.
What can I do with this Workspace?
Connect directly via Visual Studio
Make changes to the code without affecting the authors original
Issue a pull request to the author to alert them of fixes or updates youve made
View your workspace
This workspace will provide you a snapshot of
the download. If the author updates or
changes the code (uploads zip or push it over
git) then you can update this snapshot easily
or keep your version locked to ensure stability.
Learn More
Check out our selection of articles to get you
started!
Working with ::GitMachine
Introduction to ::Tasks
Introduction to git for Windows users
Migrating a SVN Repository to Git using
::GitMachine
Customizing ::Tasks - tips and tricks

CREATE A
WORKSPACE
EXPLORE
WORKSPACES
My favorites
Nothing to display
My workspaces
Nothing to display
Shared with me
Nothing to display
WCode WTasks
Forum

VROSNET Sign out S home workspaces premium account
Submit
Download demo - 66 Kb
Download PegWizard source (V0.01) - 19.2 Kb
Introduction
In this third article on building an interpreter, I will show how to generate a physical tree from the source text, which
must be parsed. Building a physical tree from the source text has many advantages: it allows detailed analysis,
transformations and optimizations of the original source. A tree together with symbol tables compiled from tree
visits can even serve as basis for a simple interpreter.
Tree generation is particularly easy for a recursive descent parser, since a recursive descent parser consists of a set of
functions - one for each grammar rule. To generate a tree, the grammar-rule function must additionally add a new
child to the already built tree, where the new child represents the source part matched by the corresponding
grammar rule.
With a tree, language analysis and code generation becomes much easier. But despite this and the fact, that it is easy
to generate such a physical tree with most parsers, early compilers (e.g. the first Pascal or C compilers) did not
generate such a tree. This is due to the fact, that the overhead in memory usage can be overwhelming - and
memory is a resource which was much more precious in the early days of computing.
In a PEG parser [1][2] as used in this article series, tree generation is complicated by backtracking. In case of a parsing
failure, the part of the tree belonging to the failed rule must be removed again. Exceptions during parsing must be
handled exactly like backtracking.
As a useful application of tree generation, I will present the core of a parser wizard, which takes a set of grammar
rules in source text form and generates the skeleton of a parser for this grammar. Without having a parse tree, such
a parser wizard (simple form of parser generator) would be difficult to build.
The Tree advantage
The following table shows some problems which can be easily solved when having a physical tree representing the
source text, but which are difficult otherwise.
Task Source Tree Solution Advantage of tree representation
Optimize
Code
arithmetic
expression:
x= x + 1
arithmetic expression optimized:
x += 1
Easy to recognize when looking from the
parent node.
Recognize
Out of
Order
Elements
PEG grammar
rule:
Int
(
('+' /
'-') Int
)*
template implementation of PEG rule:
And<
Int,
OptRepeat<
And<
OneOfChars<'+','-'>,
Int
>
>
>
To automatically translate Int(('+'/'-')
Int)* into the template shown under Tree
Solution, a tree representation can help very
much. The problem here: * appears at the
end, but must be placed as OptRepeat at
the front. Easy to handle when looking
from the parent node.
Interpret
loop
while loop:
while(i>0)
{
j+= i;i--
;
}
using tree evaluating function calls
while( ConditionTrue(t) ){
EvalStat(t->child);
}
A tree can be easily evaluated by passing
sibling and child nodes to evaluation
functions.
Multiple
Passes
any language
source
using tree walks starting at
pContext.tree
if(G::Matches(pIn,&pContext))
{
pContext.Analyze();
pContext.OptimizeTree();
return pContext.GenCode();
}
Multiple passes over the same source
without the need for reparsing.
Parse Trees and Syntax Trees
A parse tree uses one physical tree node per nonterminal, what usually results in huge trees. A syntax tree, often
called abstract syntax tree or abbreviated AST is a parse tree where most nonterminals have been removed. The
difference is memory usage as the comparison of the parse and the syntax tree for the following PEG grammar
shows:
//PEG grammar with start rule expr.
//Note: this grammar does not accept any white space
expr : mexpr (('+'/'-') mexpr )*;
mexpr: atom ('*' atom )*;
atom : [0-9]+ / '(' expr ')';
sample expr Parse Tree Abstract Syntax Tree
2*(3+5)
expr<
mexpr<
atom<'2'>
'*'
atom<
expr<
mexpr<atom<'3'>>
'+'
mexpr<atom<'5'>>
>
>
>
>
mexpr<'2' '*' expr<'3' '+' '5'>>
Storage usage: 8 bytes Storage usage: >= 14*12=168 bytes Storage usage: >= 7*12 =84 bytes
Compared to the source text, any tree representation will use much more memory, but parse trees can be especially
huge. Tree building also takes much more time than just recognizing the input as shown by the following output
taken from the demo program coming with this article:
-p
>> parse a 100000 char buffer filled with 2*(3+5)
>> CPU usage ( Syntax tree with 87504 nodes built) = 0.17 seconds
>> CPU usage ( Parse tree with 175003 nodes built) = 0.211 seconds
>> CPU usage ( Expression recognizer) = 0.03 seconds
A tenfold increase in parsing time compared to a simple recognizer is not uncommon.
A tree node typically has at least the following members:
two pointer members - one pointing to the first child, other pointing to the sibling.
a member used for node identification
a virtual destructor in order to support derived nodes.
My current PEG implementation has the following node interface:
template<typename It >
struct TNode{
typedef TNode<It> Self;
TNode (int id=0, It beg=0, It end=0, Self* firstChild=0);
virtual ~TNode (){} //support derived nodes
virtual Self* AddChild (TNode* pChild);

int id;
Self *next,*child;
BegEnd<It> match; //references the source text matched by this node
};
Integrating Tree Generation into a PEG Parser
Tree generation can be added to the template based implementation of a PEG parser just by the use of a handful of
tree generating templates. To use these templates, the Context structure passed to the Matches function must be
derived from the predefined TreeContext. Each of these tree generating templates has a type parameter
responsible for recognizing a PEG fragment and an optional integer value parameter used as identifying key for the
generated tree node. The following table lists the available tree generating templates.
Tree generating template Meaning, Effect
template<int id,typename
TRule> struct TreeNTWithId
Builds a tree node representing the rule TRule. This node gets the ID
id. All tree nodes created inside TRule will be descendants of this
tree node.
TRule> struct AstNTWithId
Same effect as TreeNTWithId, but with automatic replacement by
the child node if the child node has no sibling.
template<typename TRule>
struct TreeNT
Same effect as TreeNTWithId, but the id set to eTreeNT.
template<typename TRule>
struct AstNT
Same effect as AstNTWithId, with the id set to eTreeNT.
TTerminalRule> struct
TreeCharsWithId
Builds a tree node having the source text recognized by
TTerminalRule as match string. This node gets the ID id.
TTerminalRule> struct
TreeChars
Same effect as TreeCharsWithId, with the id set to
eTreeSpecChar.
template<typename T0,
typename T1, ...> struct
TreeSafeAnd
Replacement for And in case of using tree templates. The use of
TreeSafeAnd guarantees correct behaviour in case of backtracking.
The practical usage of this tree generating templates will be shown with a grammar for expressions supporting the
operations '+', '-', '*', where parentheses can be used around subexpressions.
Topic Recognizer Grammar Parse Tree Grammar
Grammar Specification
expr :
mexpr (('+'/'-') mexpr )*;
mexpr: atom ('*' atom )*;
atom : [0-9]+ / '(' expr ')';
// ^ means: generate tree
// node for next terminal
expr : mexpr (^('+'/'-') mexpr )*;
mexpr: atom (^'*' atom )*;
atom : ^([0-9]+) / '(' expr ')';
Rules as Types
using namespace peg_templ;
struct expr{
typedef
//atom : [0-9]+ / '(' expr ')';
Or<
PlusRepeat<In<'0',
'9' > >,
And<Char<'('>, expr,
Char<')'> >
> atom;
typedef
//mexpr: atom ('*' atom )*;
And<
atom,
OptRepeat<
And<Char<'*'>, atom >
>
> mexpr;
typedef
//expr : mexpr (('+'/'-')
mexpr )*;
And<
mexpr,
OptRepeat<
And<
Or<Char<'+'>,
Char<'-'> >,
mexpr
>
>
> expr_rule;
template<
typename It,
typename Context
>
static bool
Matches(It& p,Context* pC)
{
return
expr_rule::Matches(p,pC);
}
};
using namespace peg_tree;
enum{eAtom,eMExpr,eInt,eExpr};
struct exprT{
typedef
TreeNTWithId<eAtom,
Or<
TreeCharsWithId<eInt,
PlusRepeat<In<'0', '9' > >
>,
And<Char<'('>,
exprT, Char<')'> >
>
> atom;
typedef
TreeNTWithId<eMExpr,
And<
atom,
OptRepeat<
TreeSafeAnd<TreeChars
<Char<'*'> >, atom >
>
>
> mexpr;
typedef
TreeNTWithId<eExpr,
And<
mexpr,
OptRepeat<
TreeSafeAnd<
TreeChars<Or<Char
<'+'>, Char<'-'> > >,
mexpr
>
>
>
> exprT_rule;
template<
typename It,
typename Context
>
static bool
Matches(It& p,Context* pC)
{
return
exprT_rule::Matches(p,pC);
}
};
struct EmptyContext{};
bool CallRecognizingParser()
{
EmptyContext c;
typedef const
bool CallTreeBuildingParser()
{
typedef const
unsigned char* Iterator;
//either use TreeContext
Calling the Parser
unsigned char* Iterator;
const char* sTest= "2*(3+5)";
Iterator it= (Iterator)sTest;
return expr::Matches(it,&c);
}
//or derived class
peg_tree::TreeContext
<Iterator> tc;
const char* sTest= "2*(3+5)";
Iterator it= (Iterator)sTest;
return exprT::Matches(it,&tc);
}
The following chapters will introduce Parser Generators and Parser Wizards as examples of complex tree building
parsers. A Parser Generator reads a grammar specification and outputs a Parser. A Parser generator typically builds
a tree representation of the grammar specification provided by the user, because it must analyze and transform the
user input before it can generate the appropriate parser.
Parser Generators
A well known product category in the parser business are the so called Parser Generators like e.g., YACC and its
successors or to name a more recent development ANTLR. They take an annotated grammar in text form and
generate a parser for it. These annotated grammars contain not only the grammar rules (using a notation close to what I
introduced to explain PEG parsers), but also the code (the so called semantic actions), which shall be executed by the
generated parser when it passes the annotation point. This is shown in the following excerpt from the ANTLR
documentation [3] for a grammar which corresponds to the following PEG grammar:
//PEG grammar for expressions supporting
//the operations '+','-','*' on integers.
expr : S mexpr S(('+'/'-') S mexpr S)*;
mexpr: atom S('*' S atom S)*;
atom : [0-9]+ / '(' expr ')';
S : [ \t\v\r\n]*;
ANTLR parser specification for expression recognizer
(semantic actions bold)
ANTLR parser specification for expression
parser with semantic actions (semantic actions
bold)
//Note, that ANTLR uses a separate scanner, PLUS below therefore
corresponds to '+' which has been defined in the scanner
specification.
class ExprParser extends Parser;
expr: mexpr ((PLUS|MINUS) mexpr)*
;
mexpr
: atom (STAR atom)*
;
atom: INT
| LPAREN expr RPAREN
class ExprParser extends Parser;
expr returns [int value=0]
{int x;}
: value=mexpr
( PLUS x=mexpr {value += x;}
| MINUS x=mexpr {value -= x;}
)*
;
mexpr returns [int value=0]
{int x;}
: value=atom
( STAR x=atom {value *= x;} )*
;
atom returns [int value=0]
: i:INT {value=
Integer.parseInt(i.getText());}
| LPAREN value=expr RPAREN
;
Typically, a parser generator copies the code annotations (the bold parts in the ANTLR spec. above) to the
appropriate places in the generated parser. The next step is to compile the generated code (using e.g., your Java
compiler). In case of a compile error or an unexpected behaviour when testing the parser, you either go back editing
the original grammar specification or you edit the generated parser code directly. In the second case, you'll break
the relation between the specification and the generated parser. But sooner or later you will do so and when using
ANTLR you will have the luck, that the generated code is quite readable. Unfortunately the same is not true for YACC
type parsers, which generate really impenetrable code. Stroustrap, the inventor of C++, experienced this when using
YACC for his Cfront C++ front-end. "In retrospect, for me and C++ it was a bad mistake." [to use YACC].. "Cfront has
a YACC parser supplemented by much lexical trickery relying on recursive descent techniques."[4]
But I believe, that even with integrated parsers like boost::spirit or Perl 6, something similar to a parser
generator can still be useful. Such a tool, which generates a starter parsing application, has a similar purpose as e.g.,
the MFC application wizard, namely to generate the initial objects and functions, which then will be edited by the
programmer.
The PegWizard Parser Wizard
The PegWizard parser wizard has a similar job profile as the MFC Application Wizard:
The MFC Application Wizard generates an application having built-in functionality, that, when
compiled, implements the basic features of a Windows executable application. The MFC starter
application includes C++ source (.cpp), resource files, header (.h) files, and a project (.vcproj) file.
The code generated in these starter files is based on MFC.[5]
Translated to the case of the PegWizard Parser Wizard:
The PegWizard Parser Wizard generates an application having built-in functionality, that, when
compiled, implements the basic features of a parser application. The PegWizard starter application
includes C++ source (.cpp) and header (.h) files. The code generated in these starter files is based on
the PEG parsing methodology.
The PegWizard parser wizard should do the following:
Meaningful Parser Wizard Functionality Implementation Status
Check completeness of grammar: for each used Nonterminal, there should be a
corresponding rule.
Implemented in
PegWizard 0.01
Generate Recognizers for arbitrary PEG grammar.
Implemented in
PegWizard 0.01
Check for left recursive grammar rules (serious error). Planned for version 0.02
Generate Tree Builder for arbitrary PEG grammar. Planned for version 0.02
Support error annotations in PEG grammar. Planned for version 0.02
Support File Iterators, Unicode Iterators.. Planned for version 1.00
Support for automatic error annotations. Planned for version 1.00
The following table shows a sample usage of the PegWizard parser wizard V0.01.00:
command line activation PegWizard -g TestGrammar\expr.gram -o TestGrammar
program output on command line
Peg0Wizard 0.01.00 (alpha) [20050508]
INFO: file content of 'TestGrammar\expr.gram' read in
INFO: saved the generated parser in 'TestGrammar\expr.cpp'
program ended successfully
content of 'expr.gram' file
//PEG grammar for expressions supporting
//the operations '+','-','*' on integers.
expr : S mexpr S(('+'/'-') S mexpr S)*;
mexpr: atom S('*' S atom S)*;
atom : [0-9]+ / '(' expr ')';
S : [ \t\v\r\n]*;
code excerpt of generated parser
#include "peg_templ.h"
#include <iostream>
using namespace peg_templ;
struct expr; //forward declarations
struct mexpr;
struct atom;
struct S;
struct expr{ //grammar rule implementation
typedef And<
S,
mexpr,
S,
...
test run output '2*(3+5)' is matched by grammar
The input grammar file must adhere to the following PEG grammar:
Grammar: S GrammarRule (S GrammarRule)* S ;
GrammarRule: Nonterminal S ':' S RSideOfRule ;
RSideOfRule: Alternative (S '/' S Alternative)* S ';' ;
Alternative: Term (S Term)* ;
Term: ('&' / '!')? S Factor S ( '*' / '+' / '?')? ;
Factor: (Nonterminal / Terminal / '(' S RSideOfRule S ')' ) ;
Nonterminal: [a-zA-Z0-9_]+ ;
Terminal: LiteralString | Charset | '.';
Charset: '[' (CharsetRange / EscapeChar / CharsetChar )+ ']' ;
LiteralString: '\'' ( EscapeChar / !'\'' PrintChar )+ '\'' ;
CharsetChar: !']' PrintChar ;
EscapeChar: '\\' ( HexChar / PrintChar ) ;
CharsetRange: ( EscapeChar / CharsetChar ) '-' ( EscapeChar / CharsetChar ) ;
HexChar: '(x|X)(' Hexdigit+ ')' ;
S: ([ \v\t\r\n] | '//' [ \v\r\t]* '\n')* ;
PrintChar: //determined by 'isprint(c)' ;
Why the PegWizard internally builds a Parse Tree
The PegWizard reads a PEG grammar file and generates the corresponding C++ source file. This would be difficult
without having a parse tree, because a PEG grammar uses postfix notation (e.g. the quantifiers '*', '?', '+' appear at
the end of the construct to which they apply), but the generated code uses prefix operators. Furthermore, code
generation is only one of many tasks of the PegWizard. Other tasks are checking for left recursive rules, checking for
completeness of the grammar, automatic insertion of error handling code and so on. Without having a tree, this
would require reparsing the grammar file.
Generally speaking, a parse or syntax tree makes almost anything easier. The only disadvantages of a physical tree
are the high runtime and memory costs. But in the case of the PegWizard this does not matter, because grammar
files are in general quite small. These days a language compiler builds almost certainly a physical tree out of the
source text. Cfront [4], Stroustrap's C front end, e.g., builds tree structures for all global elements and for the current
function.
Future Directions
The PegWizard presented in this article as a tree building parser will be improved and shall soon be a useful tool.
Currently, it only generates recognizing parsers.
Besides teaching the subject - namely parsing - a goal of this ongoing article series is to build highly efficient parsers.
This will be investigated by comparing hand crafted parsers with schematically constructed ones.
Take-Home Points
It is easy to build syntax trees or parse trees during parsing. A syntax tree makes life (parsing and the later steps)
much easier, but has its costs. When parsing speed matters and it seems feasible to go without a syntax tree, this
should be preferred.
Advertise | Privacy | Pricing | Report problem | Forum | Twitter Copyright CodeProject, 1999-2014

All Rights Reserved. Terms of Use
layout: fixed | fluid
History
2005 May: initial version.
References
1. Parsing Expression Grammars, Bryan Ford, MIT, January 2004 [^]
2. Parsing Expression Grammars, Wikipedia, the free encyclopedia [^]
3. ANTLR Parser Generator [^]
4. Bjarne Stroustrap, The Design and Evolution of C++, Addison Wesley 1995.
5. MFC Application Wizard [^]

Crafting An Interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodePr

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Crafting An Interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodePr

Încărcat de

Drepturi de autor:

Formate disponibile

Crafting an interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodeProject

https://workspaces.codeproject.com/vrosnet/crafting-an-interpreter-part-3-parse-trees-and-syn[19/06/2014 4:46:28 AM]

S-ar putea să vă placă și