Sunteți pe pagina 1din 235

How to create pragmatic, lightweight

languages
The unix philosophy applied to language design, for GPLs
and DSLs

Federico Tomassetti
This book is for sale at http://leanpub.com/create_languages

This version was published on 2017-05-03

This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.

2016 - 2017 Federico Tomassetti


Contents

1. Motivation: why do you want to build language tools? . . . . . . . . . . . . . . . . . . 1


Why to create a new language? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Why to invest in tools for languages? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2. The general plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
How the different tools are related . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Technology used for the examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2. The example languages we are going to build . . . . . . . . . . . . . . . . . . . . . . . . 6


MiniCalc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
StaMac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Part I: the basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4. Writing a lexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Why using ANTLR? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
The plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
The Lexer grammar for MiniCalc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
The Lexer grammar for StaMac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5. Writing a parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
The parser grammar for MiniCalc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
The parser grammar for StaMac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6. Mapping: from the parse-tree to the Abstract Syntax Tree . . . . . . . . . . . . . . . . . 40


General support for the Abstract Syntax Tree . . . . . . . . . . . . . . . . . . . . . . . . 40
Defining the metamodel of the Abstract Syntax Tree . . . . . . . . . . . . . . . . . . . . 45
Mapping the parse tree into the Abstract Syntax Tree . . . . . . . . . . . . . . . . . . . . 52
CONTENTS

Testing the mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58


Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7. Symbol resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Example: reference to a value in Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Example: reference to a type in Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Example: reference to a method in Java . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Resolving symbols in MiniCalc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Resolving symbols in StaMac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Testing the symbol resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

8. Typesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Typesystem rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Lets see the code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Typesystem for MiniCalc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Typesystem for StaMac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

9. Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Validation for MiniCalc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Validation for StaMac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Part II: compiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

10. Build an interpreter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94


What you need to build an interpreter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Lets see the code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

11. Generate JVM bytecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128


The Java Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
The main instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

12. Generate LLVM bitcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

Part III: editing support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

13. Syntax highlighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

14. Auto completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229


CONTENTS

Write to me . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
1. Motivation: why do you want to
build language tools?
In this book we are going to see how to build tools to support languages.
There are two different scenarios in which you may want to do that:

1. you want to create a new language: maybe a general purpose language (GPL), maybe a
domain specific language (DSL). In any case you may want to build some support for this
language of yours. Maybe you want to generate C and compile the generated code, maybe you
want to interpret it. Maybe you want to build a compiler or a simulator for your language. Or
you want to do all of this stuff and more.
2. you want to create an additional tool for an existing language. Do you want to perform
static analysis on your Java code? Or build a translator from Python to JavaScript? Maybe a
web editor for some less known language?

In both scenarios building the right tools to support a language can make a difference. The tools you
use can make or break your experience with that language. They can make a crucial difference by
supporting your programming, or any other kind of intellectual activities you can carry on by using
your language, or instead hindering your every move and putting all sort of limitations to what you
can achieve.

Why to create a new language?


This really depends on the nature of the language you want to create. The basic distinction is about
the domains you care about: do you want to build a General Purpose Language or a Domain Specific
Language?
A General Purpose Language (GPL) is a language which can be used to build all sort of applications.
Examples are C, Java, Kotlin, Haskell, Lisp, Ruby, Python, C# and more.
A Domain Specific Language (DSL) instead is a language created to serve a single purpose. The
advantage is that the language would be very good at solving a certain kind of problems. Examples
are CSS, HTML, dot (the language used by Graphviz), SQL.
I think there are good reasons that apply to both cases: 1. It will be a lot of fun 2. You will learn a
lot by creating a language
These reasons will apply even if your language ends up not being that useful. However, we are
pragmatic people, arent we? So how building languages will affect our productivity?
1. Motivation: why do you want to build language tools? 2

The case of Domain Specific Language is very easy to defend: if you create a language for a specific
goal you will end up with a language tailored for a set of tasks. And it will be arguably better at
supporting those than some generic language. Think about some notable DSLs like HTML, Latex,
or SQL. You could define documents using some program written in C to draw on the screen the
information you want to display, or generate some PDF document to distribute. While C could
be used also to write these kind of applications it is not a language designed for this goal, so it
will be more complicate to use, for this goal, than HTML or Latex. People would be required to
learn much more to write simple documents. Also, many more things could go wrong: writing an
HTML document is pretty difficult to have a memory leak or to deferentiate a null pointer. You have
less power, but also less things to consider. You can analyze more easily the things you write with
your DSL. You can build special tools for the very specific tasks you need to accomplish with your
languages.
The case for General Purpose Languages is different: to build a language that is better than the
existing General Purpose Languages is not an easy task. Many try and end up with languages not
powerful enough or give up when they realize than designing a language is far from being easy.
On the other hand, someone from time to time succeed and build a GPL that works well for him.
Or for his team. Or for a small community. Or a language that change how we program. Think
about the influence a language like Ruby had in the last decade. A language can make a difference,
or can be better than existing one, for many different reasons. You can build a GPL that is really
good on a specific aspect, like Go, which is famous for being good at concurrency. And thus its a
good choice for networking. But there are also other good reasons in addition to the technical ones,
such as educational or artistic. This is typically the case for the creation of esoteric programming
languages: languages created to make a point. Building a GPL is one of those challenges which
attract a significant percentage of the most talented developers. Maybe you are just one of them, or
you aspire to be one of them, and you want to give it a try. If this challenge appeal to you, even if
you dont leave a mark in the history of computer science you still going to have fun.
There is a good reason for building a language today that was not true before: its easier than ever.
Now the barriers to create a language, and make it usable by sane persons, are significantly lower
than they used to be. In this book I will try to demonstrate why I think this is the case.
First of all there are ecosystems like the JVM or CLR: if you build your language to be compatible
with one of those you get access to tons of libraries from day one. Frameworks like LLVM make also
possible to build very efficient languages with a much lower effort than was required in the past.
There are great frameworks and libraries you can reuse. In this book, for example, we are going to
use ANTLR to generate most of our lexers and parser. You can also build your editor as a plugin for
well known IDEs like Eclipse or IntelliJ.
So it is a great time to move your first steps as a Language Engineer.
1. Motivation: why do you want to build language tools? 3

Why to invest in tools for languages?


Depending on the nature of the tools they can either be absolutely necessary to use a language (e.g.,
you probably need a compiler or an interpreter for your language) or just extremely useful (an editor
with syntax highlighting and auto-completion).
If you are trying to build a user-base for your language you need to offer great tool support. There
are a lot of competitors out there and an inferior language which much better tool support will beat
your creation hands down, every single time. People expect build systems, compilers, editors: the
whole set of stuff. If you want to give to your language a fighting chance you need to provide that. If
you want to build all the tool support by yourself it means that you have to become very productive
at writing them. And you have to be smart about it, because you cannot afford not to be smart and
take decades to build a basic editor or a decent compiler. You just cant.
Tools can also be power-ups for existing languages. Languages that you can already use decently
enough. Perhaps you can use static analysis to catch more errors, you can build editors that provide
you documentation on the fly and autocompletion, making you coding faster. You can generate
documentation, some executable format or a converter to some other technology. You can build
tools to perform automated refactoring for a language. There are all sort of stuff you can do when
you know how to manipulate source code programmatically.
If you are using a common language, like Java or C#, you can build tools to improve your productivity
with that language. You could build a simple tool that reformat the code for you, or one that checks
for typical errors your co-workers make. Tools that perform smart refactoring for you. For example,
a tool that can update your code to work with a new version of a framework. Or again, tools that
analyze your code and find duplicate code. Once you know how to build tools to process code you
start seeing all sort of new possibilities.

Summary
We have seen that there are different reasons to build language machinery and different things that
can be achieved. However there are some common tools and principles that are shared. We will look
into those and see how to apply them pragmatically to get concrete results. At the end of the book
you should have learnt an approach that you can adapt, to produce systems that you can understand
and extend.
2. The general plan
In this book we are going to see how to build machinery for your languages.
These include:

parsers
compilers
code generators
static analysis tools
editors
simulators

In other words we are going to see how to implement all sort of tools that would make working with
a language productive.
We are not going to discuss in detail how to design languages. While there will be some comments
here and there in the book, I think there is no better way to learn design principles than by building
things. So you shouldnt expect theoretical dissertations on the merits of this or that paradigm: we
are going to learn how to build stuff in practice, and with practice you will form your own ideas.
We will also see different kinds of languages, and you will be able to see the merits of different
approaches and decide what makes sense in your case.

Philosophy
Building software is a complex task. It would be easy to spend a whole life working on one single
problem. Think about the amount of effort went in producing a parser generator like ANTLR or the
thousands of man-years poured in building a Java compiler or the IDEs used by most developers.
If you want to build all the machinery for a language and build all of this to high quality you need
to adopt the UNIX philosophy to reach your goal: take simple, high quality components and
combine them together in smart ways.
This is exactly what we are going to do: we are going to look at components that we can reuse
and combine. For our strategy to work, we need to select components which are not just of high
quality, but also that can be combined easily. Components with very large requirements or very
complex interfaces are not good candidates. Components that do one thing well, and have a simple
architecture are the ideal ones.
2. The general plan 5

How the different tools are related


We can build all sort of machinery for our language. We can imagine our set of tools for our language
as a tree: depending on our specific needs we will make our tree grow, adding piece by piece as we
move forward.
Now, at the center of this tree there is one piece which is the model of the code written in your
language: the Abstract Syntax Tree (AST). Lets see how the other tools relate to the AST.
We will have tools to obtain the AST: lexers, parsers and transformations will permit to take the
source code and obtain the corresponding AST from them. We may want to obtain an equivalent of
the AST from compiled versions of your code. We will see also how to do that.
Once you have a model of your code you may want to extract information from it: for example
finding the methods which uses some deprecated library. Or you may want to transform your code
maybe to generate something else, maybe to perform some refactoring to improve efficiency. Either
way you are going to manipulate an AST. We will see a few techniques to do that.
Finally you may want to produce something from your AST. Typically after a few transformations
you may want to generate bytecode or native code. Or maybe JavaScript. Or also you could write
an interpreter for your AST or some derived format.
Also editors take advantage of the AST to extract information that are needed to implement different
features. For example, syntax highlighting is typically based on the result produced by the lexer, but
autocompletion or validation need to operate on the AST, maybe resolving symbols to elements we
got from compiled code.
This is a very brief overview. Now it is time to jump in and try to build something. If you are like
me, things do much more sense when you see what means working on them. Lets get started!

Technology used for the examples


I believe that you should put your code where your mouth is, so we will not just discuss solution
but show real code for every single tool we are going to discuss.
The examples will be written using Kotlin, which is a JVM language. Kotlin can also be transformed
to Java, automatically, so when the book will be finished we will make the code available also in
Java. The ideas discussed in this book should be applicable using any language.
Depending on the response this book get we could also translate the examples to other languages.
Why starting with Kotlin? Because it is very concise and it reduces the boilerplate. It is also well
supported and reasonably clear. Also, the JVM should work decently on all relevant platforms. I am
testing all my stuff under Linux but I am confident it would work on Windows and Mac also.
We will also use Gradle as our build system.
2. The example languages we are
going to build
In this book we are going to build different languages. We will use these languages as examples to
show how to implement the different techniques.

MiniCalc
This will be a toy-language, created to show us how to work with expressions. This language would
be of limited use in practice, but it will be helpful to start introducing the basics of building languages.
The language will permit to define inputs and variables. It will be possible to execute one MiniCalc
module, specifying the values for the inputs. The execution will consist of evaluating all the
expressions and then executing the print statements, producing an output visible to the user.
We will support:

integer and decimal literals


variable definition and assignment
the basic mathematical operations (addition, subtraction, multiplication, division)
the usage of parenthesis

Particularities:

newlines will be meaningful


we will have string interopolation like hi #{name}!

Example:

1 input Int width


2 input Int height
3 var area = width * height
4 print("A rectangle #{width}x#{height} has an area #{area}")

MiniCalcFun
When implementing an interpreter we will enrich MiniCalc by adding support for functions. This
variant will be named MiniCalcFun because creativity is not really my strong suit. We will also
allow to have annidated functions. This will be useful to discuss scoping.
2. The example languages we are going to build 7

StaMac
This language will permit to represent state machines. It will be useful to see how to work with a
different execution model when compared to the classical procedural one.
A state machine starts in a specific state and when receives in event it moves to a different state.
When entering or leaving a state it can execute specific actions.
StaMac will permit to define inputs for our State Machines, so that they are configurable. State
machines will also have variables, a list of events to which they can react and a list of states.
Of all the states one will be marked as the start state. A state will have a name and specify to which
events it will react and to which states it will move. It will also specify the actions to execute on
entering and leaving that state.
Consider a state machine used to represent some piece of equipment producing physical items.
This state machine will started as turned off. We will send to it a command (an event) to turn it on.
Later we could increase the speed or decrease the speed. The machine will support three speed: still,
low speed, high speed. We could also simulate the fact that time passes without nothing happening:
we will do that by sending the event doNothing. This state machine will be configurable: we could
specify how many items it produces while in low speed or high speed mode.
In StaMac this state machine could be written as:

1 statemachine mySm
2
3 input lowSpeedThroughtput: Int
4 input highSpeedThroughtput: Int
5
6 var totalProduction = 0
7
8 event turnOff
9 event turnOn
10 event speedUp
11 event speedDown
12 event emergencyStop
13 event doNothing
14
15 start state turnedOff {
16 on turnOn -> turnedOn
17 }
18
19 state turnedOn {
20 on turnOff -> turnedOff
21 on speedUp -> lowSpeed
2. The example languages we are going to build 8

22 }
23
24 state lowSpeed {
25 on entry {
26 totalProduction = totalProduction + lowSpeedThroughtput
27 print("Producing " + lowSpeedThroughtput + " elements (total "+totalProd\
28 uction+")")
29 }
30 on speedDown -> turnedOn
31 on speedUp -> highSpeed
32 on doNothing -> lowSpeed
33 }
34
35 state highSpeed {
36 on entry {
37 totalProduction = totalProduction + highSpeedThroughtput
38 print("Producing " + highSpeedThroughtput + " elements (total "+totalPro\
39 duction+")")
40 }
41 on speedDown -> lowSpeed
42 on emergencyStop -> turnedOn
43 on doNothing -> highSpeed
44 }
Part I: the basics
We are going to see the basic building blocks for building language tools.
We are going to work using several examples presented in Chapter 2.
The languages are deliberately simple because here we want to show the principles without getting
caught in too many nitty gritty details and corner cases.
At the end of Part I you will know the basis to build a model from the raw code of your language. You
will be able to validate such model, to resolve references and to calculate the type of the different
expressions. At that point you will be ready to move to the next steps.
4. Writing a lexer
When we start analyzing the code of our language we get an entire file to process. The first step is to
break that big file into a list of tokens. Divide et impera is a principle that worked for some millenia
and keep being valid.
To split a file into tokens we will build a lexer. The lexer is the piece of code that takes a textual
document and break it into tokens. Tokens are portions of text with a specific role.
Our tokens could be:

numeric literals
string literals
comments
keywords

and some others.


We could use a lexer to provide syntax highlighting. Do you want to show the keywords in green?
You need first to recognize which parts are the keywords!
To build our lexer we are going to use ANTLR. Indeed we will use ANTLR to generate both our lexer
and our parser. The parser will be later used to arrange tokens into an organized structure called
parse-tree. Typically a lexer and a parser need to work together, so it makes sense that just one tool
generate both of them.

Why using ANTLR?


ANTLR is a very mature tool for writing lexer and parsers. It can generate code for several languages
and has decent performance. It is well mantained and we can be sure it has all the features we could
possible need to handle all the corner cases we could meet. In addition to that, ANTLR 4 makes
possible to write simple grammars because it solves left recursive definition for you. So you do not
have to write many intermediate node types for specifying precedence rules for your expressions.
More on this when we will look into the parser.
Sometimes a lexer is also called tokenizer
With ANTLR 4.6 they have introduced support for even more target languages. From the ANTLR grammar files we can generate lexers and
parsers in Java, JavaScript, Python, Go, C#, and Swift
4. Writing a lexer 11

The plan
We are going to look into how to setup our project and then we will see the Lexer grammar
for MiniCalc and StaMac, which are both described in chapter 2, the one presenting the example
languages we are going to work with through the book.

Configuration
As first thing we will need to setup our project. We are going to use Gradle as our build system but
any build system would work. At this stage we will just need to:

be able to invoke ANTLR to generate the lexer code from the lexer grammar. We will generate
a Java class, but ANTLR supports many other targets
compile the code generated by ANTLR

I typically starts by setting up a new git local repository (git init) and setup a gradle wrapper
(gradle wrapper). This is just a small script that install locally a specific version of gradle, so when
we will share the project anyone will be able to use the wrapper and the wrapper will take care of
installing gradle for the specific platform of our user.
Then I create a gradle build file (build.gradle). My build file looks like this:

1 buildscript {
2 // The version of Kotlin I am using, soon moving to 1.1
3 ext.kotlin_version = '1.0.6'
4
5 repositories {
6 mavenCentral()
7 maven {
8 name 'JFrog OSS snapshot repo'
9 url 'https://oss.jfrog.org/oss-snapshot-local/'
10 }
11 jcenter()
12 }
13
14 dependencies {
15 classpath "org.jetbrains.kotlin:kotlin-gradle-plugin:$kotlin_version"
16 }
17 }
18
19 apply plugin: 'kotlin'
4. Writing a lexer 12

20 apply plugin: 'antlr'


21 // I use IntelliJ IDEA and this plugin permit to generate project files for that\
22 IDE
23 apply plugin: 'idea'
24
25 repositories {
26 mavenCentral()
27 jcenter()
28 }
29
30 dependencies {
31 antlr "org.antlr:antlr4:4.5.1"
32 compile "org.antlr:antlr4-runtime:4.5.1"
33 compile "org.jetbrains.kotlin:kotlin-stdlib:$kotlin_version"
34 compile "org.jetbrains.kotlin:kotlin-reflect:$kotlin_version"
35 testCompile "org.jetbrains.kotlin:kotlin-test:$kotlin_version"
36 testCompile "org.jetbrains.kotlin:kotlin-test-junit:$kotlin_version"
37 testCompile 'junit:junit:4.12'
38 }
39
40 // This is the task to generate the lexer using ANTLR
41 generateGrammarSource {
42 maxHeapSize = "64m"
43 arguments += ['-package', 'me.tomassetti.minicalc']
44 outputDirectory = new File("generated-src/antlr/main/me/tomassetti/minicalc"\
45 .toString())
46 }
47 // We want to compile the generated lexer AFTER having generate it
48 compileJava.dependsOn generateGrammarSource
49 sourceSets {
50 generated {
51 java.srcDir 'generated-src/antlr/main/'
52 }
53 }
54 compileJava.source sourceSets.generated.java, sourceSets.main.java
55 compileKotlin.source sourceSets.generated.java, sourceSets.main.java, sourceSets\
56 .main.kotlin
57
58 // When we run ./gradlew clean we want to remove the generated code
59 clean{
60 delete "generated-src"
61 }
4. Writing a lexer 13

62
63 idea {
64 module {
65 sourceDirs += file("generated-src/antlr/main")
66 }
67 }

At this point we can run:

./gradlew idea: to generate the project files for IntelliJ IDEA


./gradlew generateGrammarSource: if we want just to generate the lexer class from our
ANTLR grammar
./gradlew build: to generate and compile everything
./gradlew check: to run tests

The Lexer grammar for MiniCalc


This is our complete lexer grammar:

1 lexer grammar MiniCalcLexer;


2
3 channels { WHITESPACE }
4
5 // Whitespace
6 NEWLINE : '\r\n' | 'r' | '\n' ;
7 WS : [\t ]+ -> channel(WHITESPACE) ;
8
9 // Keywords
10 INPUT : 'input' ;
11 VAR : 'var' ;
12 PRINT : 'print';
13 AS : 'as';
14 INT : 'Int';
15 DECIMAL : 'Decimal';
16 STRING : 'String';
17
18 // Literals
19 INTLIT : '0'|[1-9][0-9]* ;
20 DECLIT : '0'|[1-9][0-9]* '.' [0-9]+ ;
21
22 // Operators
4. Writing a lexer 14

23 PLUS : '+' ;
24 MINUS : '-' ;
25 ASTERISK : '*' ;
26 DIVISION : '/' ;
27 ASSIGN : '=' ;
28 LPAREN : '(' ;
29 RPAREN : ')' ;
30
31 // Identifiers
32 ID : [_]*[a-z][A-Za-z0-9_]* ;
33
34 STRING_OPEN : '"' -> pushMode(MODE_IN_STRING);
35
36 UNMATCHED : . ;
37
38 mode MODE_IN_STRING;
39
40 ESCAPE_STRING_DELIMITER : '\\"' ;
41 ESCAPE_SLASH : '\\\\' ;
42 ESCAPE_NEWLINE : '\\n' ;
43 ESCAPE_SHARP : '\\#' ;
44 STRING_CLOSE : '"' -> popMode ;
45 INTERPOLATION_OPEN : '#{' -> pushMode(MODE_IN_INTERPOLATION) ;
46 STRING_CONTENT : ~["\n\r\t\\#]+ ;
47
48 STR_UNMATCHED : . -> type(UNMATCHED) ;
49
50 mode MODE_IN_INTERPOLATION;
51
52 INTERPOLATION_CLOSE : '}' -> popMode ;
53
54 INTERP_WS : [\t ]+ -> skip ;
55
56 // Keywords
57 INTERP_AS : 'as'-> type(AS) ;
58 INTERP_INT : 'Int'-> type(INT) ;
59 INTERP_DECIMAL : 'Decimal'-> type(DECIMAL) ;
60 INTERP_STRING : 'String'-> type(STRING) ;
61
62 // Literals
63 INTERP_INTLIT : ('0'|[1-9][0-9]*) -> type(INTLIT) ;
64 INTERP_DECLIT : ('0'|[1-9][0-9]*) '.' [0-9]+ -> type(DECLIT) ;
4. Writing a lexer 15

65
66 // Operators
67 INTERP_PLUS : '+' -> type(PLUS) ;
68 INTERP_MINUS : '-' -> type(MINUS) ;
69 INTERP_ASTERISK : '*' -> type(ASTERISK) ;
70 INTERP_DIVISION : '/' -> type(DIVISION) ;
71 INTERP_ASSIGN : '=' -> type(ASSIGN) ;
72 INTERP_LPAREN : '(' -> type(LPAREN) ;
73 INTERP_RPAREN : ')' -> type(RPAREN) ;
74
75 // Identifiers
76 INTERP_ID : [_]*[a-z][A-Za-z0-9_]* -> type(ID);
77
78 INTERP_STRING_OPEN : '"' -> type(STRING_OPEN), pushMode(MODE_IN_STRING);
79
80 INTERP_UNMATCHED : . -> type(UNMATCHED) ;

Now lets go to see it in details.

Preamble
We start by specifying that this is a lexer grammar. Using ANTLR we could also define parser
grammars or mixed grammars (containing a lexer and a parser in one file).
We also specify that we want to use an extra channel, in addition to the default one. You can imagine
channels as dispatch belts. You put tokens in different channels so that different users are free to
consider or ignore them. We will see more when looking at whitespace.

1 lexer grammar MiniCalcLexer;


2
3 channels { WHITESPACE }

Whitespace
In our language the newlines are relevant while spaces are not. We will therefore ignore spaces most
of the time. They will however be useful when performing syntax highlighting, so we will not just
throw them away but we will put them into a separate channel, where we can retrieve them when
we need them.
4. Writing a lexer 16

1 // Whitespace
2 NEWLINE : '\r\n' | 'r' | '\n' ;
3 WS : [\t ]+ -> channel(WHITESPACE) ;

Keywords and ID
Defining keywords is pretty simple: we have just to pay attention to the fact that typically the rules
for identifiers could match most, if not all, the keywords. This might become an issue because in
ANTLR, when a piece of text can match more than one rule, the one defined first is chosen. The
solution is just to put the ID rules after all the keywords and you are good to go.
Also, notice that our ID rule specify that an ID cannot start with a capital letter.

1 // Keywords
2 INPUT : 'input' ;
3 VAR : 'var' ;
4 PRINT : 'print';
5 AS : 'as';
6 INT : 'Int';
7 DECIMAL : 'Decimal';
8 STRING : 'String';
9
10 // Identifiers
11 ID : [_]*[a-z][A-Za-z0-9_]* ;

Example of strings that are valid identifiers in our language:

_____a______
a99
foo_99_

Example of strings which are not valid identifiers:

__A
A
99a

Numeric Literals
Our language is very simple and it permits to manipulate just numbers and literals. Our number
literals are very simple:
4. Writing a lexer 17

1 // Literals
2 INTLIT : '0'|[1-9][0-9]* ;
3 DECLIT : '0'|[1-9][0-9]* '.' [0-9]+ ;

Our string literals are much more involved because we support interpolation. Lets see them in the
next paragraph.

String
Typically lexers are not context sensitive, however in some cases it makes sense to build them to
be context sensitive. In this way we can have simple rules that apply only in a given context. For
example, when we are inside a string we want to recognize sequences like \n while these are not
relevant outside strings.
In ANTLR we achieve this by using modes: as we open a string we enter the mode MODE_IN_-
STRING.

1 STRING_OPEN : '"' -> pushMode(MODE_IN_STRING);

Now new rules apply:

1 mode MODE_IN_STRING;
2
3 ESCAPE_STRING_DELIMITER : '\\"' ;
4 ESCAPE_SLASH : '\\\\' ;
5 ESCAPE_NEWLINE : '\\n' ;
6 ESCAPE_SHARP : '\\#' ;
7 STRING_CLOSE : '"' -> popMode ;
8 INTERPOLATION_OPEN : '#{' -> pushMode(MODE_IN_INTERPOLATION) ;
9 STRING_CONTENT : ~["\n\r\t\\#]+ ;
10
11 STR_UNMATCHED : . -> type(UNMATCHED) ;

We have all the escape sequences and then we have STRING_CLOSE. When we match it we go back
to the mode we were in when we entered MODE_IN_STRING (typically the default mode).
We can also enter into another mode: MODE_IN_INTERPOLATION.
Finally all the other characters (excluding newlines, which are illegal in string) are just STRING_-
CONTENT.

Interpolation
When we are in interpolation mode we basically can write all the expressions we can write at the
top level. For this reason we have to duplicate different rules:
4. Writing a lexer 18

1 mode MODE_IN_INTERPOLATION;
2
3 INTERPOLATION_CLOSE : '}' -> popMode ;
4
5 INTERP_WS : [\t ]+ -> skip ;
6
7 // Keywords
8 INTERP_AS : 'as'-> type(AS) ;
9 INTERP_INT : 'Int'-> type(INT) ;
10 INTERP_DECIMAL : 'Decimal'-> type(DECIMAL) ;
11 INTERP_STRING : 'String'-> type(STRING) ;
12
13 // Literals
14 INTERP_INTLIT : ('0'|[1-9][0-9]*) -> type(INTLIT) ;
15 INTERP_DECLIT : ('0'|[1-9][0-9]*) '.' [0-9]+ -> type(DECLIT) ;
16
17 // Operators
18 INTERP_PLUS : '+' -> type(PLUS) ;
19 INTERP_MINUS : '-' -> type(MINUS) ;
20 INTERP_ASTERISK : '*' -> type(ASTERISK) ;
21 INTERP_DIVISION : '/' -> type(DIVISION) ;
22 INTERP_ASSIGN : '=' -> type(ASSIGN) ;
23 INTERP_LPAREN : '(' -> type(LPAREN) ;
24 INTERP_RPAREN : ')' -> type(RPAREN) ;
25
26 // Identifiers
27 INTERP_ID : [_]*[a-z][A-Za-z0-9_]* -> type(ID);
28
29 INTERP_STRING_OPEN : '"' -> type(STRING_OPEN), pushMode(MODE_IN_STRING);
30
31 INTERP_UNMATCHED : . -> type(UNMATCHED) ;

This is not ideal and it is one of the very few things I do not like about ANTLR. Unfortunately we do
not live in an ideal world, so I guess we have to cope with it. Anyway we are getting a full lexer by
writing a few tens of lines of definitions, so probably we should not complain too much. All things
considered, the advantages clearly outweight this drawback.

Unmatched
There are characters that are not allowed in certain positions. Like you cannot put a dollar symbol
outside a string in MiniCalc. Normally you may want to just throw an error when you meet
such characters. However you want to handle those characters differently when doing syntax
4. Writing a lexer 19

highlighting: those characters need to be considered and maybe colored in red to give feedback
to the user. This is why we have rules to produce an UNMATCHED token in all modes.

1 UNMATCHED : . ;
2 STR_UNMATCHED : . -> type(UNMATCHED) ;
3 INTERP_UNMATCHED : . -> type(UNMATCHED) ;

Invoke it on a few examples


Now that we have defined out lexer grammar we need to invoke ANTLR to generate the actual code
for the lexer. If you have configured gradle like I did, you can run ./gradlew generateGrammar-
Source and a file named MiniCalcLexer.java should appear in the directory generated-src. Or
you can run ./gradlew build and the Java file will be generated and compiled.
Lets see how we can use this lexer and what it produces.
This is the code we can use to invoke the lexer and print the list of tokens to the screen:

1 package examples
2
3 import me.tomassetti.minicalc.MiniCalcLexer
4 import org.antlr.v4.runtime.ANTLRInputStream
5 import org.antlr.v4.runtime.Token
6 import java.io.FileInputStream
7 import java.io.StringReader
8
9 fun lexerForCode(code: String) = MiniCalcLexer(ANTLRInputStream(StringReader(cod\
10 e)))
11
12 fun readExampleCode() = FileInputStream("examples/rectangle.mc").bufferedReader(\
13 ).use { it.readText() }
14
15 fun main(args: Array<String>) {
16 val lexer = lexerForCode(readExampleCode())
17 var token : Token? = null
18 do {
19 token = lexer.nextToken()
20 val typeName = MiniCalcLexer.VOCABULARY.getSymbolicName(token.type)
21 val text = token.text.replace("\n", "\\n").replace("\r", "\\r").replace(\
22 "\t", "\\t")
23 println("L${token.line}(${token.startIndex}-${token.stopIndex}) $typeNam\
24 e '$text'")
25 } while (token?.type != -1)
26 }
4. Writing a lexer 20

This is the content of the example named rectangle.mc:

1 input Int width


2 input Int height
3 var area = width * height
4 print("A rectangle #{width}x#{height} has an area #{area}")

This is the produced output:

1 L1(0-4) INPUT 'input'


2 L1(5-5) WS ' '
3 L1(6-8) INT 'Int'
4 L1(9-9) WS ' '
5 L1(10-14) ID 'width'
6 L1(15-15) NEWLINE '\n'
7 L2(16-20) INPUT 'input'
8 L2(21-21) WS ' '
9 L2(22-24) INT 'Int'
10 L2(25-25) WS ' '
11 L2(26-31) ID 'height'
12 L2(32-32) NEWLINE '\n'
13 L3(33-35) VAR 'var'
14 L3(36-36) WS ' '
15 L3(37-40) ID 'area'
16 L3(41-41) WS ' '
17 L3(42-42) ASSIGN '='
18 L3(43-43) WS ' '
19 L3(44-48) ID 'width'
20 L3(49-49) WS ' '
21 L3(50-50) ASTERISK '*'
22 L3(51-51) WS ' '
23 L3(52-57) ID 'height'
24 L3(58-58) NEWLINE '\n'
25 L4(59-63) PRINT 'print'
26 L4(64-64) LPAREN '('
27 L4(65-65) STRING_OPEN '"'
28 L4(66-77) STRING_CONTENT 'A rectangle '
29 L4(78-79) INTERPOLATION_OPEN '#{'
30 L4(80-84) ID 'width'
31 L4(85-85) INTERPOLATION_CLOSE '}'
32 L4(86-86) STRING_CONTENT 'x'
33 L4(87-88) INTERPOLATION_OPEN '#{'
4. Writing a lexer 21

34 L4(89-94) ID 'height'
35 L4(95-95) INTERPOLATION_CLOSE '}'
36 L4(96-108) STRING_CONTENT ' has an area '
37 L4(109-110) INTERPOLATION_OPEN '#{'
38 L4(111-114) ID 'area'
39 L4(115-115) INTERPOLATION_CLOSE '}'
40 L4(116-116) STRING_CLOSE '"'
41 L4(117-117) RPAREN ')'
42 L4(118-118) NEWLINE '\n'
43 L5(119-118) EOF '<EOF>'

The Lexer grammar for StaMac


Lets build a second lexer. This time we will build a lexer for StaMac, our language to represent state
machines:

1 lexer grammar SMLexer;


2
3 channels { COMMENT_CH, WHITESPACE_CH }
4
5 // Comment
6 COMMENT : '//' ~( '\r' | '\n' )* -> channel(COMMENT_CH) ;
7
8 // Whitespace
9 NEWLINE : ('\r\n' | 'r' | '\n') -> channel(WHITESPACE_CH) ;
10 WS : [\t ]+ -> channel(WHITESPACE_CH) ;
11
12 // Keywords : preamble
13 SM : 'statemachine' ;
14 INPUT : 'input' ;
15 VAR : 'var' ;
16 EVENT : 'event' ;
17
18 // Keywords : statements and expressions
19 PRINT : 'print';
20 AS : 'as';
21 INT : 'Int';
22 DECIMAL : 'Decimal';
23 STRING : 'String';
24
25 // Keywords : SM
4. Writing a lexer 22

26 START : 'start';
27 STATE : 'state';
28 ON : 'on';
29 ENTRY : 'entry';
30 EXIT : 'exit';
31
32 // Identifiers
33 ID : [_]*[a-z][A-Za-z0-9_]* ;
34
35 // Literals
36 INTLIT : '0'|[1-9][0-9]* ;
37 DECLIT : '0'|[1-9][0-9]* '.' [0-9]+ ;
38 STRINGLIT : '"' ~["]* '"' ;
39
40 // Operators
41 PLUS : '+' ;
42 MINUS : '-' ;
43 ASTERISK : '*' ;
44 DIVISION : '/' ;
45 ASSIGN : '=' ;
46 COLON : ':' ;
47 LPAREN : '(' ;
48 RPAREN : ')' ;
49 LBRACKET : '{' ;
50 RBRACKET : '}' ;
51 ARROW : '->' ;
52
53 UNMATCHED : . ;

There are many similarities between this lexer grammar and the previous one. This is not by accident
but rather typical, because there are common elements present in many languages:

keywords
literals
operators
the UNMATCHED rule
comments
whitespace

Lets focus on the differences:


4. Writing a lexer 23

in this language we do not support string interpolation, therefore our string literal rule is way
simple and it does not involve using different modes
we have two channels because we support comments, which are not supported in MiniCalc
in this language newlines are not meaningful, so we send them to the same channel as
whitespace

Testing
Of course we want to start with the right foot and begin writing tests for our language machinery.
We started writing a lexer, so we will start our testing efforts from here.
What is a lexer supposed to do? Take a string and return me a list of tokens. Lets build our tests to
verify it does it correctly.

1 package me.tomassetti.minicalc
2
3 import lexerForCode
4 import tokensContent
5 import tokensNames
6 import kotlin.test.assertEquals
7 import org.junit.Test as test
8
9 // Utilities included only for completeness
10
11 fun lexerForCode(code: String) = MiniCalcLexer(ANTLRInputStream(StringReader(cod\
12 e)))
13
14 fun tokensNames(lexer: MiniCalcLexer): List<String> {
15 val tokens = LinkedList<String>()
16 do {
17 val t = lexer.nextToken()
18 when (t.type) {
19 -1 -> tokens.add("EOF")
20 else -> if (t.type != MiniCalcLexer.WS) tokens.add(lexer.vocabulary.\
21 getSymbolicName(t.type))
22 }
23 } while (t.type != -1)
24 return tokens
25 }
26
27 // End of utilities, here it starts the real test code
28
4. Writing a lexer 24

29 class MiniCalcLexerTest {
30
31 @org.junit.Test fun parseVarDeclarationAssignedAnIntegerLiteral() {
32 assertEquals(listOf("VAR", "ID", "ASSIGN", "INTLIT", "EOF"),
33 tokensNames(lexerForCode("var a = 1")))
34 }
35
36 @org.junit.Test fun parseVarDeclarationAssignedADecimalLiteral() {
37 assertEquals(listOf("VAR", "ID", "ASSIGN", "DECLIT", "EOF"),
38 tokensNames(lexerForCode("var a = 1.23")))
39 }
40
41 @org.junit.Test fun parseVarDeclarationAssignedASum() {
42 assertEquals(listOf("VAR", "ID", "ASSIGN", "INTLIT", "PLUS", "INTLIT", "\
43 EOF"),
44 tokensNames(lexerForCode("var a = 1 + 2")))
45 }
46
47 @org.junit.Test fun parseMathematicalExpression() {
48 assertEquals(listOf("INTLIT", "PLUS", "ID", "ASTERISK", "INTLIT", "DIVIS\
49 ION", "INTLIT", "MINUS", "INTLIT", "EOF"),
50 tokensNames(lexerForCode("1 + a * 3 / 4 - 5")))
51 }
52
53 @org.junit.Test fun parseMathematicalExpressionWithParenthesis() {
54 assertEquals(listOf("INTLIT", "PLUS", "LPAREN", "ID", "ASTERISK", "INTLI\
55 T", "RPAREN", "MINUS", "DECLIT", "EOF"),
56 tokensNames(lexerForCode("1 + (a * 3) - 5.12")))
57 }
58
59 @org.junit.Test fun parseCast() {
60 assertEquals(listOf("ID", "ASSIGN", "ID", "AS", "INT", "EOF"),
61 tokensNames(lexerForCode("a = b as Int")))
62 }
63
64 @org.junit.Test fun parseSimpleString() {
65 assertEquals(listOf("STRING_OPEN", "STRING_CONTENT", "STRING_CLOSE", "EO\
66 F"),
67 tokensNames(lexerForCode("\"hi!\"")))
68 }
69
70 @org.junit.Test fun parseStringWithNewlineEscape() {
4. Writing a lexer 25

71 val code = "\"hi!\\n\""


72 assertEquals(listOf("\"", "hi!", "\\n", "\"","EOF"),
73 tokensContent(lexerForCode(code)))
74 assertEquals(listOf("STRING_OPEN", "STRING_CONTENT", "ESCAPE_NEWLINE", "\
75 STRING_CLOSE","EOF"),
76 tokensNames(lexerForCode(code)))
77 }
78
79 @org.junit.Test fun parseStringWithSlashEscape() {
80 assertEquals(listOf("STRING_OPEN", "STRING_CONTENT", "ESCAPE_SLASH", "ST\
81 RING_CLOSE","EOF"),
82 tokensNames(lexerForCode("\"hi!\\\\\"")))
83 }
84
85 @org.junit.Test fun parseStringWithDelimiterEscape() {
86 assertEquals(listOf("STRING_OPEN", "STRING_CONTENT", "ESCAPE_STRING_DEL\
87 IMITER", "STRING_CLOSE","EOF"),
88 tokensNames(lexerForCode("\"hi!\\\"\"")))
89 }
90
91 @org.junit.Test fun parseStringWithSharpEscape() {
92 assertEquals(listOf("STRING_OPEN", "STRING_CONTENT", "ESCAPE_SHARP", "S\
93 TRING_CLOSE","EOF"),
94 tokensNames(lexerForCode("\"hi!\\#\"")))
95 }
96
97 @org.junit.Test fun parseStringWithInterpolation() {
98 val code = "\"hi #{name}. This is a number: #{5 * 4}\""
99 assertEquals(listOf("\"", "hi ", "#{", "name", "}", ". This is a number:\
100 ", "#{", "5", "*", "4", "}", "\"", "EOF"),
101 tokensContent(lexerForCode(code)))
102 assertEquals(listOf("STRING_OPEN", "STRING_CONTENT", "INTERPOLATION_OPEN\
103 ", "ID", "INTERPOLATION_CLOSE",
104 "STRING_CONTENT", "INTERPOLATION_OPEN", "INTLIT", "ASTERISK", "I\
105 NTLIT", "INTERPOLATION_CLOSE",
106 "STRING_CLOSE", "EOF"),
107 tokensNames(lexerForCode(code)))
108 }
109 }

Easy and straight to the point.


In our tests we are verifying exclusively that the tokens have the correct type. We could also check
4. Writing a lexer 26

they have the right content and the right position. We could potentially write a couple of such tests,
just to verify we do not have any surprise. However, normally what matters is the type of tokens
returned, the rest should just work as expected because ANTLR is very mature and battle-tested.

Summary
Building a lexer is not very difficult, but there are a few things you need to pay attention to, if you
want to get it right. My advice is to use ANTLR because it is a powerful tool which we can adapt
in many contexts. In this chapter we have seen how to use it with two different languages and how
to test our lexer. We also discussed some aspects to consider, in order to make our lexer usable from
the other components, like the parser or the editor.
Now it is time to move to the next component, the parser.
5. Writing a parser
We have seen how to organize the characters of our text into tokens. Now these tokens can be
organized in a more structured form: the Parse Tree.
A boring but needed note about terminology: in some situations people refer to the tree produced
by a parser as the parse tree while in others they refer to it as the Abstract Syntax Tree. In this book
we are calling the tree produced by ANTLR parse tree. In the following chapters we are going to
see how to refine and transform the parse tree to obtain a second tree. We will call that transformed
tree the Abstract Syntax Tree (AST).
The picture below shows the whole process: from code to the AST.

Parsing: From code to the AST

We have already seen how to obtain a list of tokens, by using a lexer. Now to get the parse tree we
are going to use an ANTLR parser. The ANTLR parser is generated by ANTLR according to a parser
grammar. In this Chapter we are going to build such grammar.
In the parser grammar we will refer to the terminals or token types defined in the lexer grammar:
NEWLINE, VAR, ID, and the like.

The parser grammar for MiniCalc


Here it is our new ANTLR grammar for our first example language, MiniCalc:
5. Writing a parser 28

1 parser grammar MiniCalcParser;


2
3 // We specify which lexer we are using: so it knows which terminals we can use
4 options { tokenVocab=MiniCalcLexer; }
5
6 miniCalcFile : lines=line+ ;
7
8 line : statement (NEWLINE | EOF) ;
9
10 statement : inputDeclaration # inputDeclarationStatement
11 | varDeclaration # varDeclarationStatement
12 | assignment # assignmentStatement
13 | print # printStatement ;
14
15 print : PRINT LPAREN expression RPAREN ;
16
17 inputDeclaration : INPUT type name=ID ;
18
19 varDeclaration : VAR assignment ;
20
21 assignment : ID ASSIGN expression ;
22
23 expression : left=expression operator=(DIVISION|ASTERISK) right=expression # bin\
24 aryOperation
25 | left=expression operator=(PLUS|MINUS) right=expression # bin\
26 aryOperation
27 | value=expression AS targetType=type # typ\
28 eConversion
29 | LPAREN expression RPAREN # par\
30 enExpression
31 | ID # var\
32 Reference
33 | MINUS expression # min\
34 usExpression
35 | STRING_OPEN (parts+=stringLiteralContent)* STRING_CLOSE # str\
36 ingLiteral
37 | INTLIT # int\
38 Literal
39 | DECLIT # dec\
40 imalLiteral ;
41
42 stringLiteralContent : STRING_CONTENT # const\
5. Writing a parser 29

43 antString
44 | INTERPOLATION_OPEN expression INTERPOLATION_CLOSE # inter\
45 polatedValue ;
46
47 type : INT # integer
48 | DECIMAL # decimal
49 | STRING # string ;

We reuse the existing lexer (tokenVocab=MiniCalcLexer).


At the top of the grammar you typically you put the rule descripting the whole file. In this case our
top rule is miniCalcFile. It is simply defined as a list of lines.
Each line is composed by a statement terminated either by a newline or the end of the file.
A statement can be an inputDeclaration, a varDeclaration, an assignment or a print statement.
An inputDeclaration is just defined as an INPUT terminal followed by a type an identifier. An
INPUT terminal is basically the keyword input, as you can see looking at the lexer grammar. A type
is defined by the last rule of the grammar. In this case we use a label (name) to specify the role of the
ID terminal. This does not affect the way a parse tree is produced, but later we will be able to use
that label to get the ID terminal from the inputDeclaration node, referring to it as name.
Labels are more useful when we have more than one terminal of the same kind in the same rule.
This is for example the case in the binaryOperation alternative of the expression rule. You can see
that we have two expressions: one with label left and one with label right. Later we will be able
to ask for the left or the right expression of a binaryOperation avoiding every confusion.
An expression can be defined in many different ways. The order is important because it determines
the operator precedence. So the multiplication comes before the sum. Notice that we specify two
ways to obtain a binaryOperation: the first time using a DIVISION or ASTERISK operator, the second
time using a PLUS or MINUS operator. It is important to define them separately because they have
different operator precedence, however the resulting node of the parse tree will have exactly the
same form, so we reuse the same name (binaryOperation).
String interpolation makes our stringLiteral not trivial: it starts and it ends with two terminals
(STRING_OPEN, STRING_CLOSE). Between those terminals we can have any number of stringLit-
eralContent. Each stringLiteralContent can be a simple piece of text (constantString) or
an interpolated value. An interpolated value is an expression wrapped between the terminals
INTERPOLATION_OPEN and INTERPOLATION_CLOSE.
How we obtain the code for the parser from this parser grammar? We simply run ./gradlew
generateGrammarSource. Please refer to the build.gradle file in the repository or take at the previous
chapter.

Printing a parse tree for an example file


Lets try to invoke the parser on a simple example and look at the resulting parse tree.
5. Writing a parser 30

The example we are going to consider is this:

1 input Int width


2 input Int height
3 var area = width * height
4 print("A rectangle #{width}x#{height} has an area #{area}")

This is the code we are going to use to print the parse tree:

1 ///
2 /// Parsing
3 ///
4
5 fun lexerForCode(code: String) = MiniCalcLexer(ANTLRInputStream(StringReader(cod\
6 e)))
7
8 fun parseCode(code: String) : MiniCalcParser.MiniCalcFileContext = MiniCalcParse\
9 r(CommonTokenStream(lexerForCode(code))).miniCalcFile()
10
11 ///
12 /// Transform the Parse Tree in a string we can print on the screen
13 ///
14
15 abstract class ParseTreeElement {
16 abstract fun multiLineString(indentation : String = ""): String
17 }
18
19 class ParseTreeLeaf(val type: String, val text: String) : ParseTreeElement() {
20 override fun toString(): String{
21 return "T:$type[$text]"
22 }
23
24 override fun multiLineString(indentation : String): String = "${indentation}\
25 T:$type[$text]\n"
26 }
27
28 class ParseTreeNode(val name: String) : ParseTreeElement() {
29 val children = LinkedList<ParseTreeElement>()
30 fun child(c : ParseTreeElement) : ParseTreeNode {
31 children.add(c)
32 return this
33 }
5. Writing a parser 31

34
35 override fun toString(): String {
36 return "Node($name) $children"
37 }
38
39 override fun multiLineString(indentation : String): String {
40 val sb = StringBuilder()
41 sb.append("${indentation}$name\n")
42 children.forEach { c -> sb.append(c.multiLineString(indentation + " "))\
43 }
44 return sb.toString()
45 }
46 }
47
48 fun toParseTree(node: ParserRuleContext, vocabulary: Vocabulary) : ParseTreeNode\
49 {
50 val res = ParseTreeNode(node.javaClass.simpleName.removeSuffix("Context"))
51 node.children.forEach { c ->
52 when (c) {
53 is ParserRuleContext -> res.child(toParseTree(c, vocabulary))
54 is TerminalNode -> res.child(ParseTreeLeaf(vocabulary.getSymbolicNam\
55 e(c.symbol.type), c.text))
56 }
57 }
58 return res
59 }
60
61 ///
62 /// Invoking the parser and print the parse tree
63 ///
64
65 fun main(args: Array<String>) {
66 // readExampleCode is a simple function that read the code of our example fi\
67 le
68 println(toParseTree(parseCode(readExampleCode()), MiniCalcParser.VOCABULARY)\
69 .multiLineString())
70 }

What we do here is:

1. we invoke the parser and get the parse tree


2. we transform the parse tree so that we can print it
5. Writing a parser 32

Parsing is quite easy: MiniCalcParser(CommonTokenStream(lexerForCode(code))). We simply


create a lexer for our code and pass that lexer to our parser. Done.
Transforming the parse tree is a bit more complicated and require working with the classes generated
by ANTLR to represent the nodes of the parse tree.
The function toParseTree take the root of the parse tree returned by ANTLR (a ParserRuleContext
instance) together with the Vocabulary object, that basically tells us the name of the terminals and
the parser rules. This function take the node it has received, look at the class name and drop the
Context suffix. ANTLR generates one class for each parser rule named like the rule with the extra
Context suffix and we do not want our representation of the parse tree to be polluted by Context
appearing all over the place. At this point we take all the children and look if they correspond to
simple terminals or to other rules. For terminals we instantiate ParseTreeLeaf elements and for
nodes corresponding to rules we instantiate ParseTreeNode instead. Once we have a whole tree
made of ParseTreeNode and ParseTreeLeaf we can invoke the method multiLineString on the
root and get a readable version of the parse tree. This is what we get:

1 MiniCalcFile
2 Line
3 InputDeclarationStatement
4 InputDeclaration
5 T:INPUT[input]
6 Integer
7 T:INT[Int]
8 T:ID[width]
9 T:NEWLINE[
10 ]
11 Line
12 InputDeclarationStatement
13 InputDeclaration
14 T:INPUT[input]
15 Integer
16 T:INT[Int]
17 T:ID[height]
18 T:NEWLINE[
19 ]
20 Line
21 VarDeclarationStatement
22 VarDeclaration
23 T:VAR[var]
24 Assignment
25 T:ID[area]
26 T:ASSIGN[=]
27 BinaryOperation
5. Writing a parser 33

28 VarReference
29 T:ID[width]
30 T:ASTERISK[*]
31 VarReference
32 T:ID[height]
33 T:NEWLINE[
34 ]
35 Line
36 PrintStatement
37 Print
38 T:PRINT[print]
39 T:LPAREN[(]
40 StringLiteral
41 T:STRING_OPEN["]
42 ConstantString
43 T:STRING_CONTENT[A rectangle ]
44 InterpolatedValue
45 T:INTERPOLATION_OPEN[#{]
46 VarReference
47 T:ID[width]
48 T:INTERPOLATION_CLOSE[}]
49 ConstantString
50 T:STRING_CONTENT[x]
51 InterpolatedValue
52 T:INTERPOLATION_OPEN[#{]
53 VarReference
54 T:ID[height]
55 T:INTERPOLATION_CLOSE[}]
56 ConstantString
57 T:STRING_CONTENT[ has an area ]
58 InterpolatedValue
59 T:INTERPOLATION_OPEN[#{]
60 VarReference
61 T:ID[area]
62 T:INTERPOLATION_CLOSE[}]
63 T:STRING_CLOSE["]
64 T:RPAREN[)]
65 T:EOF[<EOF>]

The parser grammar for StaMac


Lets see another example of grammar. Here it follows the parser grammar for the StaMac language:
5. Writing a parser 34

1 parser grammar SMParser;


2
3 options { tokenVocab=SMLexer; }
4
5 stateMachine : preamble (states+=state)+ EOF ;
6
7 preamble : SM name=ID (elements+=preambleElement)* ;
8
9 preambleElement : EVENT name=ID # eve\
10 ntDecl
11 | INPUT name=ID COLON type # inp\
12 utDecl
13 | VAR name=ID (COLON type)? ASSIGN initialValue=expression # var\
14 Decl
15 ;
16
17 state : (start=START)? STATE name=ID LBRACKET (blocks+=stateBlock)* RBRACKET ;
18
19 stateBlock : ON ENTRY LBRACKET (statements+=statement)* RBRACKET # entryBlock
20 | ON EXIT LBRACKET (statements+=statement)* RBRACKET # exitBlock
21 | ON eventName=ID ARROW destinationName=ID # transitionBlo\
22 ck
23 ;
24
25 statement : assignment # assignmentStatement
26 | print # printStatement
27 | EXIT # exitStatement ;
28
29 print : PRINT LPAREN expression RPAREN ;
30
31 assignment : ID ASSIGN expression ;
32
33 expression : left=expression operator=(DIVISION|ASTERISK) right=expression # bin\
34 aryOperation
35 | left=expression operator=(PLUS|MINUS) right=expression # bin\
36 aryOperation
37 | value=expression AS targetType=type # typ\
38 eConversion
39 | LPAREN expression RPAREN # par\
40 enExpression
41 | ID # val\
42 ueReference
5. Writing a parser 35

43 | MINUS expression # min\


44 usExpression
45 | INTLIT # int\
46 Literal
47 | DECLIT # dec\
48 imalLiteral
49 | STRINGLIT # str\
50 ingLiteral ;
51
52 type : INT # integer
53 | DECIMAL # decimal
54 | STRING # string;

You can see that we have reused some definitions while others are very similar.
In MiniCalc the top rule was defined to recognize a list of lines. In StaMac the top rule instead is
defined to organize the code in two areas: the first is the preamble while the second one is a list of
states. After that we expect the end of file, represented by the special terminal EOF.
The preamble is defined as the terminal SM (corresponding to the keyword statemachine), an ID rep-
resenting the name of the state machine and finally a list of preambleElements. A preambleElement
can be an event declaration, an input declaration or a variable declaration. By defining the rules in
this way we permit to users to mix event, input and variable declarations in any order they want.
However all these definitions must preceed all the ones relatives to the states.
In MiniCalc we were using the newline as a terminator of each line, while in StaMac we ignore
newlines.
In StaMac we have also an optional element: the keyword start (terminal START). It can be used at
the beginning of a state.
Note also that the rule statement could be written equivalently as:

1 statement : ID ASSIGN expression # assignmentStatement


2 | PRINT LPAREN expression RPAREN # printStatement
3 | EXIT # exitStatement ;

This is the form we obtain by replacing print and assignment by their definitions. This alternative
form would produce a slightly simpler parse tree, but I prefer the original one because I find it more
readable. We will later process the parse tree to obtain the abstract syntax tree, so we have no gain
in sacrificing readability to affect the exact shape of the parse tree we will obtain.

Testing
Ok, we defined our parser, now we need to test it. In general, I think we need to test a parser in three
ways:
5. Writing a parser 36

1. Verify that all the code we need to parse is parsed without errors
2. Ensure that code containing errors is not parsed
3. Verify that the the shape of the resulting AST is the one we expect

In practice the first point is the one on which I tend to insist the most. If you are building a parser
for an existing language the best way to test your parser is to try parsing as much code as you can,
verifying that all the errors found correspond to actual errors in the original code, and not errors in
the parser. Typically I iterate over this step multiple times to complete my grammars.
The second and third points are refinements on which I work once I am sure my grammar can
recognize everything.
In this simple case, we will write simple test cases to cover the first and the third point: we will
verify that some examples are parsed and we will verify that the AST produced is the one we want.
It is a bit cumbersome to verify that the AST produced is the one you want. There are different ways
to do that, but in this case I chose to generate a string representation of the AST and verify it is the
same as the one expected. It is an indirect way of testing the AST is the one I want, but it is much
easier for simple cases like this one.
This is how we produce a string representation of the AST:

1 abstract class ParseTreeElement {


2 abstract fun multiLineString(indentation : String = ""): String
3 }
4
5 class ParseTreeLeaf(val text: String) : ParseTreeElement() {
6 override fun toString(): String{
7 return "T[$text]"
8 }
9
10 override fun multiLineString(indentation : String): String = "${indentation}\
11 T[$text]\n"
12 }
13
14 class ParseTreeNode(val name: String) : ParseTreeElement() {
15 val children = LinkedList<ParseTreeElement>()
16 fun child(c : ParseTreeElement) : ParseTreeNode {
17 children.add(c)
18 return this
19 }
20
21 override fun toString(): String {
22 return "Node($name) $children"
5. Writing a parser 37

23 }
24
25 override fun multiLineString(indentation : String): String {
26 val sb = StringBuilder()
27 sb.append("${indentation}$name\n")
28 children.forEach { c -> sb.append(c.multiLineString(indentation + " "))\
29 }
30 return sb.toString()
31 }
32 }
33
34 fun toParseTree(node: ParserRuleContext) : ParseTreeNode {
35 val res = ParseTreeNode(node.javaClass.simpleName.removeSuffix("Context"))
36 node.children.forEach { c ->
37 when (c) {
38 is ParserRuleContext -> res.child(toParseTree(c))
39 is TerminalNode -> res.child(ParseTreeLeaf(c.text))
40 }
41 }
42 return res
43 }

And these are some test cases:

1 class MiniCalcParserTest {
2
3
4 @org.junit.Test fun parseAdditionAssignment() {
5 assertEquals(
6 """MiniCalcFile
7 Line
8 AssignmentStatement
9 Assignment
10 T[a]
11 T[=]
12 BinaryOperation
13 IntLiteral
14 T[1]
15 T[+]
16 IntLiteral
17 T[2]
18 T[<EOF>]
5. Writing a parser 38

19 """,
20 toParseTree(parseResource("addition_assignment", this.javaClass)\
21 ).multiLineString())
22 }
23
24 @org.junit.Test fun parseSimplestVarDecl() {
25 assertEquals(
26 """MiniCalcFile
27 Line
28 VarDeclarationStatement
29 VarDeclaration
30 T[var]
31 Assignment
32 T[a]
33 T[=]
34 IntLiteral
35 T[1]
36 T[<EOF>]
37 """,
38 toParseTree(parseResource("simplest_var_decl", this.javaClass)).\
39 multiLineString())
40 }
41
42 @org.junit.Test fun parsePrecedenceExpressions() {
43 assertEquals(
44 """MiniCalcFile
45 Line
46 VarDeclarationStatement
47 VarDeclaration
48 T[var]
49 Assignment
50 T[a]
51 T[=]
52 BinaryOperation
53 BinaryOperation
54 IntLiteral
55 T[1]
56 T[+]
57 BinaryOperation
58 IntLiteral
59 T[2]
60 T[*]
5. Writing a parser 39

61 IntLiteral
62 T[3]
63 T[-]
64 IntLiteral
65 T[4]
66 T[<EOF>]
67 """,
68 toParseTree(parseResource("precedence_expression", this.javaClas\
69 s)).multiLineString())
70 }
71
72 }

Simple, isnt it?

Summary
We have seen how to build a simple lexer and a simple parser. Many tutorials you can find online
stop there. We are instead going to move on and build more tools from our lexer and parser. We
laid the foundations, we now have to move to the rest of the infrastructure. Things will start to get
pretty interesting.
6. Mapping: from the parse-tree to
the Abstract Syntax Tree
In this chapter we are going to see how to process and to transform the information obtained from the
parser. The ANTLR parser recognizes the elements present in the source code and build a parse tree.
From the parse tree we will obtain the Abstract Syntax Tree on which we will perform validation
and from which will we produce compiled code.
Our goal here is obtain a new tree which satisfies three requirements:

1. Is composed of classes that are easy to work with


2. Does not contain purely syntactical elements
3. Is as explicit as possible

Why? Because we will need to be able to do several operations on this tree, to traverse it and trasform
it easily. The kind of operations that we are going to perform are based on the semantic content of
the code, not its syntactic structure. The syntax has guided us to produce the parse tree and it has
now exhausted its goal, time to move to the semantic. Are you confused by this discussion about
syntax vs. semantic? Do not worry, I am going to throw a lot of code at you and show what I mean
in practice.
In other words we will build a model of our code to simplify the hard work that follows, so that the
hard work becomes a walk in the park.

General support for the Abstract Syntax Tree


There are some operations that we will need to perform over and over on our AST:

navigate the tree, touching all nodes


find all the nodes of a given type

For this reason every node of the AST will implement this interface:
6. Mapping: from the parse-tree to the Abstract Syntax Tree 41

1 interface Node {
2 val position: Position?
3 }

A Node represents every possible node of an AST and it is general. We can reuse it across the different
languages that we may want to create.
The most important operation that we want to be able to perform on each node is navigate through
it and all its descendants. In particular we want to have the ability to define an operation and execute
it for all nodes of an AST. To do that we will define Node.process:

1 fun Node.process(operation: (Node) -> Unit) {


2 operation(this)
3 this.javaClass.kotlin.memberProperties.forEach { p ->
4 val v = p.get(this)
5 when (v) {
6 is Node -> v.process(operation)
7 is Collection<*> -> v.forEach { if (it is Node) it.process(operation\
8 ) }
9 }
10 }
11 }

This takes a Node and looks at all its properties. It finds the children by identifiying those properties
that have as value a Node or a collection of Nodes.
What about performing an operation only on nodes of a certain kind? Easy!

1 fun <T: Node> Node.specificProcess(klass: Class<T>, operation: (T) -> Unit) {


2 process { if (klass.isInstance(it)) { operation(it as T) } }
3 }

We just invoke process and for each Node we traverse we check if it corresponds to the expected
type. In that case we execute the given operation on it.

Node position
The Node interface has exactly one property: the position. The position represents, well, the position
of the node in the original code. It will be useful when we will need to show some message to the
user, for example about an error we found. To do so we want to be able to indicate a position in the
code, like line 3, column 10 to 20.
These are the classes we will use to define the position: Position and Point.
6. Mapping: from the parse-tree to the Abstract Syntax Tree 42

A Point is a pair of a line and a column, while a Position is a portion of code defines by two
extremes: two points.
Here there are their definitions and some operations that will be useful:

1 data class Point(val line: Int, val column: Int) {


2 override fun toString() = "Line $line, Column $column"
3
4 /**
5 * Translate the Point to an offset in the original code stream.
6 */
7 fun offset(code: String) : Int {
8 val lines = code.split("\n")
9 val newLines = this.line - 1
10 return lines.subList(0, this.line - 1).foldRight(0, { it, acc -> it.leng\
11 th + acc }) + newLines + column
12 }
13
14 fun isBefore(other: Point) : Boolean = line < other.line || (line == other.l\
15 ine && column < other.column)
16
17 }
18
19 data class Position(val start: Point, val end: Point) {
20
21 init {
22 if (end.isBefore(start)) {
23 throw IllegalArgumentException("End should follows start")
24 }
25 }
26
27 /**
28 * Given the whole code extract the portion of text corresponding to this po\
29 sition
30 */
31 fun text(wholeText: String): String {
32 return wholeText.substring(start.offset(wholeText), end.offset(wholeText\
33 ))
34 }
35
36 fun length(code: String) = end.offset(code) - start.offset(code)
37 }
38
6. Mapping: from the parse-tree to the Abstract Syntax Tree 43

39 /**
40 * Utility function to create a Position
41 */
42 fun pos(startLine:Int, startCol:Int, endLine:Int, endCol:Int) = Position(Point(s\
43 tartLine,startCol),Point(endLine,endCol))

Other operations on Node


We may want to be able to print an AST, as we printed the parse-tree in previous examples. We can
do that with Node.multilineString:

1 fun Node.multilineString(indent: String = "") : String {


2 val sb = StringBuffer()
3 sb.append("$indent${this.javaClass.simpleName} {\n")
4 this.javaClass.kotlin.memberProperties.filter { !it.name.startsWith("compone\
5 nt") && !it.name.equals("position") }.forEach {
6 val mt = it.returnType.javaType
7 if (mt is ParameterizedType && mt.rawType.equals(List::class.java)){
8 val paramType = mt.actualTypeArguments[0]
9 if (paramType is Class<*> && Node::class.java.isAssignableFrom(param\
10 Type)) {
11 sb.append("$indent$indentBlock${it.name} = [\n")
12 (it.get(this) as List<out Node>).forEach { sb.append(it.multilin\
13 eString(indent + indentBlock + indentBlock)) }
14 sb.append("$indent$indentBlock]\n")
15 }
16 } else {
17 val value = it.get(this)
18 if (value is Node) {
19 sb.append("$indent$indentBlock${it.name} = [\n")
20 sb.append(value.multilineString(indent + indentBlock + indentBlo\
21 ck))
22 sb.append("$indent$indentBlock]\n")
23 } else {
24 sb.append("$indent$indentBlock${it.name} = ${it.get(this)}\n")
25 }
26 }
27 }
28 sb.append("$indent}\n")
29 return sb.toString()
30 }
6. Mapping: from the parse-tree to the Abstract Syntax Tree 44

Or we may want to check if a Node comes before or after another node, considering their
corresponding position in the code:

1 fun Node.isBefore(other: Node) : Boolean = position!!.start.isBefore(other.posit\


2 ion!!.start)

Names and references


In addition to that we will also want to resolve references. When we parse the code we recognize
identifiers: sometimes identifiers are used to name things we are declaring like here:

1 event myEvent

Sometimes they are used to refer to things we have declared:

1 state aState {
2 on myEvent -> myOtherState
3 }

In this example we have three identifiers:

aState defines the name of a state we are declaring


myEvent identifies an event on which we want to do a transition, i.e. it indicates a reference
to an event declaration
myOtherState identifies the state to move to when receiving the event, i.e. it indicates a
reference to a state declaration

During the parsing phase an identifier is just an identifier. In our AST we want instead to recognize
the references and treat them differently from the identifiers used to name things. In particular
we want to be able to resolve those references. We want to get a pointer from the reference to the
declared element they are referring to. This will make implementing some operations much easier.
Lets start by defining an interface which will mark the things having a name:

1 interface Named {
2 val name: String
3 }

Now, not everything that is Named would necessarily be a Node, because there could be external
elements which we could refer from our code which are not defined by code. For example compiled
classes or external resources.
6. Mapping: from the parse-tree to the Abstract Syntax Tree 45

1 data class ReferenceByName<N>(val name: String, var referred: N? = null) where N\


2 : Named {
3 override fun toString(): String {
4 if (referred == null) {
5 return "Ref($name)[Unsolved]"
6 } else {
7 return "Ref($name)[Solved]"
8 }
9 }
10 }

How we will resolve references? Simply by passing a list of named things and trying to find a match:

1 fun <N> ReferenceByName<N>.tryToResolve(candidates: List<N>) : Boolean where N :\


2 Named {
3 val res = candidates.find { it.name == this.name }
4 this.referred = res
5 return res != null
6 }

Note that references are the only mutable classes we have as part of our model.

Defining the metamodel of the Abstract Syntax Tree


We have seen the basic classes that we will use to define all ASTs, now lets see the metamodels for
our example languages.
Metamodel is another big word you can use to impress your friends. It means a model of model. In
other words a metamodel defines the structure you can use to build a model. So by metamodel in
this case we mean the list of classes which will be used for the AST.

The metamodel for MiniCalc


We will define one data class for each type of Node. We are using data classes so we can get for free
the hashCode, equals and toString methods. Kotlin generates for us also constructors and getters.
Try to imagine how much code that would be in Java.
Lets start by the top node type, the one representing the whole file. Lets include also the interfaces
we will use to represent the most relevant type of nodes:
6. Mapping: from the parse-tree to the Abstract Syntax Tree 46

1 //
2 // MiniCalc main entities
3 //
4
5 data class MiniCalcFile(val statements : List<Statement>, override val position:\
6 Position? = null) : Node
7
8 interface Statement : Node
9
10 interface Expression : Node
11
12 interface Type : Node

Now we can look at the Nodes representing Types:

1 //
2 // Types
3 //
4
5 data class IntType(override val position: Position? = null) : Type
6
7 data class DecimalType(override val position: Position? = null) : Type
8
9 data class StringType(override val position: Position? = null) : Type

Note that these nodes do not bring any relevant information, just their position.
Time to look at the expressions. In the parse tree we used to have a node of type binaryOperation. In
our AST metamodel instead we have four separate node types: SumExpression, SubtractionExpres-
sion, MultiplicationExpression, and DivisionExpression. BinaryExpression is just a marker
interface which acts as a common ancestor for this four operations.

1 //
2 // Expressions
3 //
4
5 interface BinaryExpression : Expression {
6 val left: Expression
7 val right: Expression
8 }
9
10 data class SumExpression(override val left: Expression, override val right: Expr\
6. Mapping: from the parse-tree to the Abstract Syntax Tree 47

11 ession, override val position: Position? = null) : BinaryExpression


12
13 data class SubtractionExpression(override val left: Expression, override val rig\
14 ht: Expression, override val position: Position? = null) : BinaryExpression
15
16 data class MultiplicationExpression(override val left: Expression, override val \
17 right: Expression, override val position: Position? = null) : BinaryExpression
18
19 data class DivisionExpression(override val left: Expression, override val right:\
20 Expression, override val position: Position? = null) : BinaryExpression
21
22 data class UnaryMinusExpression(val value: Expression, override val position: Po\
23 sition? = null) : Expression
24
25 data class TypeConversion(val value: Expression, val targetType: Type, override \
26 val position: Position? = null) : Expression
27
28 data class ValueReference(val ref: ReferenceByName<ValueDeclaration>, override v\
29 al position: Position? = null) : Expression
30
31 data class IntLit(val value: String, override val position: Position? = null) : \
32 Expression
33
34 data class DecLit(val value: String, override val position: Position? = null) : \
35 Expression

Most of the expressions have as children other nodes. A few have instead simple values. They are
ValueReference (which has a property varName of type ReferenceByName<ValueDeclaration>),
and Intlit and DecLit (both have a property value of type String).
Lets look separately to the StringLit. Given that we support interpolated strings in MiniCalc, each
string literal is a sequence of elements which can be constants or interpolated values. For example
"hi #{name}! will be represented as a StringLit node with three elements: a ConstantStringLit-
Part (hi ), an ExpressionStringLItPart (name), and another ConstantStringLitPart (!).
6. Mapping: from the parse-tree to the Abstract Syntax Tree 48

1 data class StringLit(val parts: List<StringLitPart>, override val position: Posi\


2 tion? = null) : Expression
3
4 interface StringLitPart : Node
5
6 data class ConstantStringLitPart(val content: String, override val position: Pos\
7 ition? = null) : StringLitPart
8
9 data class ExpressionStringLItPart(val expression: Expression, override val posi\
10 tion: Position? = null) : StringLitPart

Time to look at the statements. We introduce the interface ValueDeclaration to represent a common
ancestor for InputDeclaration and VarDeclaration. We need it because our ValueReferences can
refer to either inputs or values so we need some node type to indicate both.
Finally we have the four classes implementing Statement.

1 //
2 // Statements
3 //
4
5 interface ValueDeclaration : Statement, Named
6
7 data class VarDeclaration(override val name: String, val value: Expression, over\
8 ride val position: Position? = null) : ValueDeclaration
9
10 data class InputDeclaration(override val name: String, val type: Type, override \
11 val position: Position? = null) : ValueDeclaration
12
13 data class Assignment(val varDecl: ReferenceByName<VarDeclaration>, val value: E\
14 xpression, override val position: Position? = null) : Statement
15
16 data class Print(val value: Expression, override val position: Position? = null)\
17 : Statement

The metamodel for StaMac


Lets now take a look at the metamodel for StaMac, starting with the top node:
6. Mapping: from the parse-tree to the Abstract Syntax Tree 49

1 data class StateMachine(val name: String,


2 val inputs: List<InputDeclaration>,
3 val variables: List<VarDeclaration>,
4 val events: List<EventDeclaration>,
5 val states: List<StateDeclaration>,
6 override val position: Position? = null) : Node

Here we see that we separate the different kind of children part of the preamble in different groups:
inputs, variables, and events. Finally we get a list of states.

1 //
2 // Top level elements
3 //
4
5 interface Typed { val type: Type }
6
7 interface ValueDeclaration : Node, Named, Typed { }
8
9 data class InputDeclaration(override val name: String,
10 override val type: Type,
11 override val position: Position? = null) : ValueDecl\
12 aration
13
14 data class VarDeclaration(override val name: String,
15 val explicitype: Type?,
16 val value: Expression,
17 override val position: Position? = null) : ValueDeclar\
18 ation {
19 override val type: Type
20 get() = explicitype ?: value.type()
21 }
22
23 data class EventDeclaration(override val name: String,
24 override val position: Position? = null) : Node, Nam\
25 ed
26
27 data class StateDeclaration(override val name: String,
28 val start: Boolean,
29 val blocks: List<StateBlock>,
30 override val position: Position? = null) : Node, Nam\
31 ed
6. Mapping: from the parse-tree to the Abstract Syntax Tree 50

As we did for MiniCalc we have introduced a common ancestor for the InputDeclaration and the
VarDeclaration. It is named ValueDeclaration. Here we also have an interface named Typed. A
Typed element has a type, obviously. In the case of the InputDeclaration it is always explicitely
present, while in the case of VarDeclaration it can be either explicitely present or inferred by
looking at the type of the initial value.

1 //
2 // Interfaces
3 //
4
5 interface StateBlock : Node
6 interface Statement : Node
7 interface Expression : Node
8 interface Type : Node
9
10 //
11 // StateBlocks
12 //
13
14 data class OnEntryBlock(val statements: List<Statement>, override val position: \
15 Position? = null) : StateBlock
16 data class OnExitBlock(val statements: List<Statement>, override val position: P\
17 osition? = null) : StateBlock
18 data class OnEventBlock(val event: ReferenceByName<EventDeclaration>,
19 val destination: ReferenceByName<StateDeclaration>,
20 override val position: Position? = null) : StateBlock

For StaMac we introduced also a common ancestor for IntType and DecimalType: NumberType.

1 //
2 // Types
3 //
4
5 interface NumberType : Type
6
7 data class IntType(override val position: Position? = null) : NumberType
8
9 data class DecimalType(override val position: Position? = null) : NumberType
10
11 data class StringType(override val position: Position? = null) : Type
12
13 //
6. Mapping: from the parse-tree to the Abstract Syntax Tree 51

14 // Expressions
15 //
16
17 interface BinaryExpression : Expression {
18 val left: Expression
19 val right: Expression
20 }
21
22 data class SumExpression(override val left: Expression, override val right: Expr\
23 ession, override val position: Position? = null) : BinaryExpression
24
25 data class SubtractionExpression(override val left: Expression, override val rig\
26 ht: Expression, override val position: Position? = null) : BinaryExpression
27
28 data class MultiplicationExpression(override val left: Expression, override val \
29 right: Expression, override val position: Position? = null) : BinaryExpression
30
31 data class DivisionExpression(override val left: Expression, override val right:\
32 Expression, override val position: Position? = null) : BinaryExpression
33
34 data class UnaryMinusExpression(val value: Expression, override val position: Po\
35 sition? = null) : Expression
36
37 data class TypeConversion(val value: Expression, val targetType: Type, override \
38 val position: Position? = null) : Expression
39
40 data class ValueReference(val symbol: ReferenceByName<ValueDeclaration>,
41 override val position: Position? = null) : Expression
42
43 data class IntLit(val value: String, override val position: Position? = null) : \
44 Expression
45
46 data class DecLit(val value: String, override val position: Position? = null) : \
47 Expression
48
49 data class StringLit(val value: String, override val position: Position? = null)\
50 : Expression

Expressions look similar to the ones we had in MiniCalc, just the StringLit is much simpler because
we do not have string interpolation in StaMac.
6. Mapping: from the parse-tree to the Abstract Syntax Tree 52

1 //
2 // Statements
3 //
4
5 data class Assignment(val variable: ReferenceByName<VarDeclaration>, val value: \
6 Expression,
7 override val position: Position? = null) : Statement
8
9 data class Print(val value: Expression, override val position: Position? = null)\
10 : Statement

Mapping the parse tree into the Abstract Syntax Tree


The Abstract Syntax Tree metamodel is simply the structure of the data we want to use for our
Abstract Syntax Tree (AST). In this case we are defining it by defining the classes which we will use
for our AST.
The AST metamodel looks reasonably similar to the parse tree metamodel, i.e., the set of classes
generated by ANTLR to contain the nodes.
We have discuss some of differences. Lets just add that we will remove elements which are
meaningful only while parsing but that logically are useless: for example the parenthesis expression
or the line node. Some nodes for which we have separate instances in the parse tree can correspond
to a single instance in the AST. This is the case of the type references Int and Decimal which in the
AST are defined using singleton objects

it will have a simpler and nicer API than the classes generated by ANTLR (so the classes
composing the parse tree). In next sections we will see how this API could permit to perform
transformations on the AST
we will remove elements which are meaningful only while parsing but that logically are
useless: for example the parenthesis expression or the line node
some nodes for which we have separate instances in the parse tree can correspond to a single
instance in the AST. This is the case of the type references Int and Decimal which in the AST
are defined using singleton objects
we can define common interfaces for related node types like BinaryExpression
to define how to parse a variable declaration we reuse the assignement rule. In the AST the
two concepts are completely separated
certain operations have the same node type in the parse tree, but are separated in the AST.
This is the case of the different types of binary expressions

Lets now see how we can get the parse tree, produced by ANTLR, and map it into our AST
classes.
6. Mapping: from the parse-tree to the Abstract Syntax Tree 53

First we define some utility functions to translate the positions, from the way they are expressed in
the parse tree, to the way we want to define them in the ASTL

1 fun Token.startPoint() = Point(line, charPositionInLine)


2
3 fun Token.endPoint() = Point(line, charPositionInLine + text.length)
4
5 fun ParserRuleContext.toPosition(considerPosition: Boolean) : Position? {
6 return if (considerPosition) Position(start.startPoint(), stop.endPoint()) e\
7 lse null
8 }

Now we can look at the specific mapping, as implemented for MiniCalc and for StaMac

Mapping MiniCalc
1 fun MiniCalcFileContext.toAst(considerPosition: Boolean = false) : MiniCalcFile \
2 = MiniCalcFile(this.line().map { it.statement().toAst(considerPosition) }, toPos\
3 ition(considerPosition))
4
5 fun StatementContext.toAst(considerPosition: Boolean = false) : Statement = when\
6 (this) {
7 is VarDeclarationStatementContext -> VarDeclaration(varDeclaration().assignm\
8 ent().ID().text,
9 varDeclaration().assignment().expression().toAst(considerPosition),
10 toPosition(considerPosition))
11 is AssignmentStatementContext -> Assignment(ReferenceByName(assignment().ID(\
12 ).text), assignment().expression().toAst(considerPosition), toPosition(considerP\
13 osition))
14 is PrintStatementContext -> Print(print().expression().toAst(considerPositio\
15 n), toPosition(considerPosition))
16 is InputDeclarationStatementContext -> InputDeclaration(this.inputDeclaratio\
17 n().ID().text, this.inputDeclaration().type().toAst(considerPosition), toPositio\
18 n(considerPosition))
19 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
20 }
21
22 fun ExpressionContext.toAst(considerPosition: Boolean = false) : Expression = wh\
23 en (this) {
24 is BinaryOperationContext -> toAst(considerPosition)
25 is IntLiteralContext -> IntLit(text, toPosition(considerPosition))
26 is DecimalLiteralContext -> DecLit(text, toPosition(considerPosition))
6. Mapping: from the parse-tree to the Abstract Syntax Tree 54

27 is StringLiteralContext -> StringLit(this.parts.map { it.toAst(considerPosit\


28 ion) }, toPosition(considerPosition))
29 is ParenExpressionContext -> expression().toAst(considerPosition)
30 is ValueReferenceContext -> ValueReference(ReferenceByName(text), toPosition\
31 (considerPosition))
32 is TypeConversionContext -> TypeConversion(expression().toAst(considerPositi\
33 on), targetType.toAst(considerPosition), toPosition(considerPosition))
34 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
35 }
36
37 fun StringLiteralContentContext.toAst(considerPosition: Boolean = false) : Strin\
38 gLitPart = when (this) {
39 is ConstantStringContext -> ConstantStringLitPart(this.STRING_CONTENT().text\
40 , toPosition(considerPosition))
41 is InterpolatedValueContext -> ExpressionStringLItPart(this.expression().toA\
42 st(considerPosition), toPosition(considerPosition))
43 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
44 }
45
46 fun TypeContext.toAst(considerPosition: Boolean = false) : Type = when (this) {
47 is IntegerContext -> IntType(toPosition(considerPosition))
48 is DecimalContext -> DecimalType(toPosition(considerPosition))
49 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
50 }
51
52 fun BinaryOperationContext.toAst(considerPosition: Boolean = false) : Expression\
53 = when (operator.text) {
54 "+" -> SumExpression(left.toAst(considerPosition), right.toAst(considerPosit\
55 ion), toPosition(considerPosition))
56 "-" -> SubtractionExpression(left.toAst(considerPosition), right.toAst(consi\
57 derPosition), toPosition(considerPosition))
58 "*" -> MultiplicationExpression(left.toAst(considerPosition), right.toAst(co\
59 nsiderPosition), toPosition(considerPosition))
60 "/" -> DivisionExpression(left.toAst(considerPosition), right.toAst(consider\
61 Position), toPosition(considerPosition))
62 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
63 }

To implement this we have taken advantage of three very useful features of Kotlin:

extension methods: we added the method toAst to several existing classes


the when construct, which is a more powerful version of switch
6. Mapping: from the parse-tree to the Abstract Syntax Tree 55

smart casts: after we check that an object has a certain class the compiler implicitly cast it to
that type so that we can use the specific methods of that class

We could come up with a mechanism to derive automatically this mapping for most of the rules and
just customize it where the parse tree and the AST differs. To avoid using too much reflection black
magic we are not going to do that for now. If I were using Java I would just go for the reflection
road to avoid having to write manually a lot of redundant and boring code. However using Kotlin
this code is compact and clear.

Mapping StaMac
When mapping the root of the parse-tree to the root of the AST for StaMac we remove the preamble
and redistribute its content directly into the StateMachine node. This is because the preamble had a
role from the syntactic point view but it has not semantic meaning. It was useful to group all kinds
of declarations that we wanted to have at the top of the file, before the states declarations but we do
not need to preserve it. Also, the preamble contained a list of preamble elements: input declarations,
variable declarations, and input declarations all mixed together in any order. In the AST we instead
prefer to have three separate lists, so we filter the premble element depending on the type. We then
translate each premble element to its equivalent in the AST and pass the resulting collections to the
StateMachine constructor.

1 //
2 // StateMachine
3 //
4
5 fun StateMachineContext.toAst(considerPosition: Boolean = false) : StateMachine \
6 = StateMachine(
7 this.preamble().name.text,
8 this.preamble().elements.filterIsInstance(InputDeclContext::class.java).\
9 map { it.toAst(considerPosition) },
10 this.preamble().elements.filterIsInstance(VarDeclContext::class.java).ma\
11 p { it.toAst(considerPosition) },
12 this.preamble().elements.filterIsInstance(EventDeclContext::class.java).\
13 map { it.toAst(considerPosition) },
14 this.states.map { it.toAst(considerPosition) },
15 toPosition(considerPosition))

The rest of the transformations are not particularly interesting and they follow a basic schema.
6. Mapping: from the parse-tree to the Abstract Syntax Tree 56

1 //
2 // Top level elements
3 //
4
5 fun InputDeclContext.toAst(considerPosition: Boolean = false) : InputDeclaration\
6 = InputDeclaration(
7 this.name.text, this.type().toAst(considerPosition), toPosition(consider\
8 Position))
9
10 fun VarDeclContext.toAst(considerPosition: Boolean = false) : VarDeclaration = V\
11 arDeclaration(
12 this.name.text, this.type()?.toAst(considerPosition), this.initialValue.\
13 toAst(considerPosition), toPosition(considerPosition))
14
15 fun EventDeclContext.toAst(considerPosition: Boolean = false) : EventDeclaration\
16 = EventDeclaration(
17 this.name.text, toPosition(considerPosition) )
18
19 fun StateContext.toAst(considerPosition: Boolean = false) : StateDeclaration = S\
20 tateDeclaration(
21 this.name.text, this.start != null, this.blocks.map { it.toAst(considerP\
22 osition) }, toPosition(considerPosition))
23
24 //
25 // StateBlocks
26 //
27
28
29 fun StateBlockContext.toAst(considerPosition: Boolean = false) : StateBlock = wh\
30 en (this) {
31 is EntryBlockContext -> OnEntryBlock(this.statements.map { it.toAst(consider\
32 Position) })
33 is ExitBlockContext -> OnExitBlock(this.statements.map { it.toAst(considerPo\
34 sition) })
35 is TransitionBlockContext -> OnEventBlock(ReferenceByName(this.eventName.tex\
36 t),
37 ReferenceByName(this.destinationName.text), toPosition(considerPosit\
38 ion))
39 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
40 }
41
42 //
6. Mapping: from the parse-tree to the Abstract Syntax Tree 57

43 // Types
44 //
45
46 fun TypeContext.toAst(considerPosition: Boolean = false) : Type = when (this) {
47 is IntegerContext -> IntType(toPosition(considerPosition))
48 is DecimalContext -> DecimalType(toPosition(considerPosition))
49 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
50 }
51
52 //
53 // Expressions
54 //
55
56 fun ExpressionContext.toAst(considerPosition: Boolean = false) : Expression = wh\
57 en (this) {
58 is BinaryOperationContext -> toAst(considerPosition)
59 is IntLiteralContext -> IntLit(text, toPosition(considerPosition))
60 is DecimalLiteralContext -> DecLit(text, toPosition(considerPosition))
61 is StringLiteralContext -> StringLit(text, toPosition(considerPosition))
62 is ParenExpressionContext -> expression().toAst(considerPosition)
63 is ValueReferenceContext -> ValueReference(ReferenceByName(text), toPosition\
64 (considerPosition))
65 is TypeConversionContext -> TypeConversion(expression().toAst(considerPositi\
66 on), targetType.toAst(considerPosition), toPosition(considerPosition))
67 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
68 }
69
70 fun BinaryOperationContext.toAst(considerPosition: Boolean = false) : Expression\
71 = when (operator.text) {
72 "+" -> SumExpression(left.toAst(considerPosition), right.toAst(considerPosit\
73 ion), toPosition(considerPosition))
74 "-" -> SubtractionExpression(left.toAst(considerPosition), right.toAst(consi\
75 derPosition), toPosition(considerPosition))
76 "*" -> MultiplicationExpression(left.toAst(considerPosition), right.toAst(co\
77 nsiderPosition), toPosition(considerPosition))
78 "/" -> DivisionExpression(left.toAst(considerPosition), right.toAst(consider\
79 Position), toPosition(considerPosition))
80 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
81 }
82
83 //
84 // Statements
6. Mapping: from the parse-tree to the Abstract Syntax Tree 58

85 //
86
87 fun StatementContext.toAst(considerPosition: Boolean = false) : Statement = when\
88 (this) {
89 is AssignmentStatementContext -> Assignment(ReferenceByName(assignment().ID(\
90 ).text),
91 assignment().expression().toAst(considerPosition), toPosition(consid\
92 erPosition))
93 is PrintStatementContext -> Print(print().expression().toAst(considerPositio\
94 n), toPosition(considerPosition))
95 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
96 }

Testing the mapping


To implement this we have taken advantage of three very useful features of Kotlin:

extension methods: we added the method toAst to several existing classes


the when construct, which is a more powerful version of switch
smart casts: after we check that an object has a certain class the compiler implicitly cast it to
that type, so that we can use the specific methods of that class

We could come up with a mechanism to derive automatically this mapping for most of the rules and
just customize it where the parse tree and the AST differs. To avoid using too much reflection black
magic we are not going to do that for now. If I were using Java I would just go for the reflection
road to avoid having to write manually a lot of redundant and boring code. However using Kotlin
this code is compact and clear.

1 class ModelTest {
2
3 @test fun transformVarName() {
4 val startTree = MiniCalcFile(listOf(
5 VarDeclaration("A", IntLit("10")),
6 Assignment("A", IntLit("11")),
7 Print(VarReference("A"))))
8 val expectedTransformedTree = MiniCalcFile(listOf(
9 VarDeclaration("B", IntLit("10")),
10 Assignment("B", IntLit("11")),
11 Print(VarReference("B"))))
12 assertEquals(expectedTransformedTree, startTree.transform {
13 when (it) {
6. Mapping: from the parse-tree to the Abstract Syntax Tree 59

14 is VarDeclaration -> VarDeclaration("B", it.value)


15 is VarReference -> VarReference("B")
16 is Assignment -> Assignment("B", it.value)
17 else -> it
18 }
19 })
20 }
21
22 Given we are solid engineers we want to build solid code by testing every compon\
23 ent. In this case we will test it by defining an expected AST, parse the code an\
24 d verify they match. Note that we build the expected AST manually.
25
26 ```kotlin
27 @test fun mapSimpleFileWithPositions() {
28 val code = """var a = 1 + 2
29 |a = 7 * (2 / 3)""".trimMargin("|")
30 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst(considerPositio\
31 n = true)
32 val expectedAst = MiniCalcFile(listOf(
33 VarDeclaration("a",
34 SumExpression(
35 IntLit("1", pos(1,8,1,9)),
36 IntLit("2", pos(1,12,1,13)),
37 pos(1,8,1,13)),
38 pos(1,0,1,13)),
39 Assignment(ReferenceByName("a"),
40 MultiplicationExpression(
41 IntLit("7", pos(2,4,2,5)),
42 DivisionExpression(
43 IntLit("2", pos(2,9,2,10)),
44 IntLit("3", pos(2,13,2,14)),
45 pos(2,9,2,14)),
46 pos(2,4,2,15)),
47 pos(2,0,2,15))),
48 pos(1,0,2,15))
49 assertEquals(expectedAst, ast)
50 }

It would be much more convenient not having to define the positions of all the elements of the AST.
So we do not specify the position for the nodes we build manually and for the AST obtained by
transforming the parse tree we leave considerPosition to false, which is the default value. In this
way the tests are much easier to write:
6. Mapping: from the parse-tree to the Abstract Syntax Tree 60

1 @test fun mapSimpleFileWithoutPositions() {


2 val code = """var a = 1 + 2
3 |a = 7 * (2 / 3)""".trimMargin("|")
4 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()
5 val expectedAst = MiniCalcFile(listOf(
6 VarDeclaration("a", SumExpression(IntLit("1"), IntLit("2"))),
7 Assignment(ReferenceByName("a"), MultiplicationExpression(
8 IntLit("7"),
9 DivisionExpression(
10 IntLit("2"),
11 IntLit("3"))))))
12 assertEquals(expectedAst, ast)
13 }
14
15 @test fun mapCastInt() {
16 val code = "a = 7 as Int"
17 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()
18 val expectedAst = MiniCalcFile(listOf(Assignment(ReferenceByName("a"), TypeC\
19 onversion(IntLit("7"), IntType()))))
20 assertEquals(expectedAst, ast)
21 }
22
23 @test fun mapCastDecimal() {
24 val code = "a = 7 as Decimal"
25 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()
26 val expectedAst = MiniCalcFile(listOf(Assignment(ReferenceByName("a"), TypeC\
27 onversion(IntLit("7"), DecimalType()))))
28 assertEquals(expectedAst, ast)
29 }
30
31 @test fun mapPrint() {
32 val code = "print(a)"
33 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()
34 val expectedAst = MiniCalcFile(listOf(Print(ValueReference(ReferenceByName("\
35 a")))))
36 assertEquals(expectedAst, ast)
37 }

Summary
With this chapter we conclude our journey from the code to the model: the AST is the model on
which we are going to work. We have designed it to contain only the relevant information, we have
6. Mapping: from the parse-tree to the Abstract Syntax Tree 61

built functions to operate on it. By doing this we have put solid foundations on which to build the
next blocks.
7. Symbol resolution
In this short chapter we will see how to resolve symbols.
When we parse our code and we obtain a parse-tree that is indeed a tree: it means that a parent is
connected to the children but there are no other links. The nodes are organized in a strict hierarchy.
By resolving symbols we create new links between references and their declaration. In a sense we
transform our tree in a graph, because links are not strictly hierarchical.
These new links are very important because references are just placeholders, which have no
knowledge about the referred elements. We instead need that knowledge when processing code.
For example, if you have a reference to a variable v in an expression, in order to calculate the type
of the expression you need to know the type of v. The reference does not contain that information,
but the original declaration of the variable does. For this reason you want to have a link to navigate
from the reference to the declaration and extract from there all the information you need.
References could be of different type and solving them could be more or less complicate depending
on the case. Lets consider some examples from the Java language.

Example: reference to a value in Java


This is the simplest of the examples we are going to consider. When we refer to a value in Java we
could be referring to:

a local variable,
a method parameter,
a field of the current class,
an inherited field,
a statically imported field

In some cases we could have multiple matches, for example a field and a local variable both having
the name used by a certain reference. We resolve these ambiguities by selecting the most specific
declarations, where most specific in general means closest to the point of usage.

Example: reference to a type in Java


If we encounter a reference to a type A in Java we need to consider different possibilities. A could be
a type parameter. For example:
These are examples I had to consider when working on the JavaSymbolSolver, which is a symbol solver for Java, to be used to analyze Java code
parsed using JavaParser
7. Symbol resolution 63

1 class Foo<A> {
2 // here I can refer to A, the type parameter
3 }

Or it could be the current class or a class wrapping the current class:

1 class A {
2 class B {
3 // here I can refer to A or B
4 }
5 }

Alternatively it could be a class we imported:

1 import foo.bar.A;

Maybe we imported a whole package:

1 import foo.bar.*; // that package could contain a class A

Or we could refer to a class A defined in the same package as the current class.

Example: reference to a method in Java


Understanding which method is invoked can very complicated in Java, more than you probably
imagine.
First of all in a method call the method actually invoked depends on the type of the scope. So if I
have this call:

1 foo.aMethod(aParam);

The actual method invoked depends on the type of foo: when we have a scope we need first of all to
calculate the type of the scope. If there is no scope specified then only methods of the current class
(declared or inherited) can be invoked or methods imported statically.
Secondly, there could be different overloaded versions of the same method, i.e., different methods
with the same name but taking different parameters. There are all sort of considerations to do. In
general you start considering the number of parameters, taking in account variadic methods, i.e.,
methods that can accept a variable number of parameters. Then you need to verify if the type of
7. Symbol resolution 64

the actual parameter is compatible with the type of the formal arguments of the method considered.
You could also have multiple matches, in that case you need to consider the closest match.
We are not even considering type arguments, lambdas, type inference and other aspects of the
language that makes this problem significantly more complex.
So in general resolving symbol is not trivial. However in many cases it is, and it definitely is for the
simple languages we are considering.

Resolving symbols in MiniCalc


In MiniCalc we could have references to variables or inputs. We want to be able to refer to inputs
and variables defined before the current statement.
Ideally we would do that by finding all the references, then for each reference we would look at its
containing statement and then get the preceeding statements. At that point we would consider all
the InputDeclaration and VariableDeclaration contained among those statements: our reference
should point to one of those.
The problem is that for the way we have implemented the AST so far we have no way to find the
parent of a Node. We can traverse the tree from the top to the leaves but not the other way around.
If we wanted to change that we should implement bidirectional relationships: so that when a Node
knows the child, the child knows the parent, and then we could assign the child to some other parent
both sides of the relationships would be updated. We could do that but it would not be trivial and it
would mean building more complicated classes in our model.
We could instead navigate once the AST, after it is built, finding all the pairs child-parent and save
them. Then considering we are not chaning the AST we can keep using that list of pairs to navigate
from the child to the parent, as needed. We create that list of pairs, or a map with the function
childParentMap:

1 fun Node.childParentMap() : Map<Node, Node> {


2 val map = IdentityHashMap<Node, Node>()
3 this.processConsideringParent({ child, parent -> if (parent != null) map[chi\
4 ld] = parent })
5 return map
6 }

Now we can use it to find the parent and the parent of the parent and so on, until we reach the root
of the AST. We will use this mechanism to find an ancestor of a particular type:
7. Symbol resolution 65

1 fun <T: Node> Node.ancestor(klass: Class<T>, childParentMap: Map<Node, Node>) : \


2 T?{
3 if (childParentMap.containsKey(this)) {
4 val p = childParentMap[this]
5 if (klass.isInstance(p)) {
6 return p as T
7 }
8 return p!!.ancestor(klass, childParentMap)
9 }
10 return null
11 }

Now we can use the function ancestor to find the Statement containing a certain ValueReference.
When we have it we just look at the statements preceeding that one. We select all the ValueDec-
laration (either InputDeclaration or VarDeclaration) and we start looking for a match with our
reference from the last one to the first one. We do that by reversing the list of preceeding value
declarations and pass it to tryToResolve.

1 fun MiniCalcFile.resolveSymbols() {
2
3 val childParentMap = this.childParentMap()
4
5 // Resolve value reference to the closest thing before
6 this.specificProcess(ValueReference::class.java) {
7 val statement = it.ancestor(Statement::class.java, childParentMap)!! as \
8 Statement
9 val valueDeclarations = this.statements.preceedings(statement).filterIsI\
10 nstance<ValueDeclaration>()
11 it.ref.tryToResolve(valueDeclarations.reversed())
12 }
13
14 // We need to consider also assignments
15 }

The function tryToResolve looks like this:


7. Symbol resolution 66

1 fun <N> ReferenceByName<N>.tryToResolve(candidates: List<N>) : Boolean where N :\


2 Named {
3 val res = candidates.find { it.name == this.name }
4 this.referred = res
5 return res != null
6 }

We have also assignments to consider because they contain a reference to a variable declaration.
They are simpler to implement considering they are statements (no need to search for the containing
statement). We will use the same approach of considering only the preceediung statements. In this
case we will focus only on VarDeclarations, not ValueDeclarations because assignments cannot
refer to InputDeclarations.

1 this.specificProcess(Assignment::class.java) {
2 val varDeclarations = this.statements.preceedings(it).filterIsInstance<V\
3 arDeclaration>()
4 it.varDecl.tryToResolve(varDeclarations.reversed())
5 }

Resolving symbols in StaMac


In StaMac we can have references to variables or inputs and assignments to variables, like we had
in MiniCalc.
However there is a difference: in StaMac all assignments happen on transitions, after inputs and
variables have been defined.

1 this.specificProcess(ValueReference::class.java) {
2 if (!it.symbol.tryToResolve(this.variables) && !it.symbol.tryToResolve(this.\
3 inputs)) {
4 errors.add(Error("A reference to symbol or input '${it.symbol.name}' can\
5 not be resolved", it.position!!))
6 }
7 }
8
9 this.specificProcess(Assignment::class.java) {
10 if (!it.variable.tryToResolve(this.variables)) {
11 errors.add(Error("An assignment to symbol '${it.variable.name}' cannot b\
12 e resolved", it.position!!))
13 }
14 }

We then have transitions. Each transition has two references: one to the event on which we execute
the transition and one to the destination state.
7. Symbol resolution 67

1 this.specificProcess(OnEventBlock::class.java) {
2 if (!it.event.tryToResolve(this.events)) {
3 errors.add(Error("A reference to event '${it.event.name}' cannot be reso\
4 lved", it.position!!))
5 }
6 }
7 this.specificProcess(OnEventBlock::class.java) {
8 if (!it.destination.tryToResolve(this.states)) {
9 errors.add(Error("A reference to state '${it.destination.name}' cannot b\
10 e resolved", it.position!!))
11 }
12 }

Testing the symbol resolution


Time to write some tests. Lets consider just MiniCalc in this case. First of all we want to verify if
the references to values are resolved correctly we should:

being able to resolve references to a variable or an input declared before


being not able to resolve references to a variable declared in the same statement
being not able to resolve references to a variable or an input declared in a following statement
being not able to resolve references to unexisting variables or inputs

1 class SymbolResolutionTest {
2
3 @test fun resolveValueReferenceToVariableDeclaredBefore() {
4 val code = """var a = 1 + 2
5 |var b = 7 * a""".trimMargin("|")
6 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()
7 ast.resolveSymbols()
8 assertEquals(1, ast.collectByType(ValueReference::class.java).size)
9 assertEquals(true, ast.collectByType(ValueReference::class.java)[0].ref.\
10 resolved)
11 assertEquals("a", ast.collectByType(ValueReference::class.java)[0].ref.n\
12 ame)
13 }
14
15 @test fun resolveValueReferenceToInputDeclaredBefore() {
16 val code = """input Int a
17 |var b = 7 * a""".trimMargin("|")
7. Symbol resolution 68

18 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()


19 ast.resolveSymbols()
20 assertEquals(1, ast.collectByType(ValueReference::class.java).size)
21 assertEquals(true, ast.collectByType(ValueReference::class.java)[0].ref.\
22 resolved)
23 assertEquals("a", ast.collectByType(ValueReference::class.java)[0].ref.n\
24 ame)
25 }
26
27 @test fun resolveValueReferenceToVariableDeclaredOnSameLine() {
28 val code = """var a = 1 + a""".trimMargin("|")
29 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()
30 ast.resolveSymbols()
31 assertEquals(1, ast.collectByType(ValueReference::class.java).size)
32 assertEquals(false, ast.collectByType(ValueReference::class.java)[0].ref\
33 .resolved)
34 }
35
36 @test fun resolveValueReferenceToVariableDeclaredAfter() {
37 val code = """var a = b
38 |var b = 0""".trimMargin("|")
39 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()
40 ast.resolveSymbols()
41 assertEquals(1, ast.collectByType(ValueReference::class.java).size)
42 assertEquals(false, ast.collectByType(ValueReference::class.java)[0].ref\
43 .resolved)
44 }
45
46 @test fun resolveValueReferenceToInputDeclaredAfter() {
47 val code = """var a = b
48 |input Int b""".trimMargin("|")
49 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()
50 ast.resolveSymbols()
51 assertEquals(1, ast.collectByType(ValueReference::class.java).size)
52 assertEquals(false, ast.collectByType(ValueReference::class.java)[0].ref\
53 .resolved)
54 }
55
56 @test fun resolveValueReferenceToUnexistingValue() {
57 val code = """var a = 1 + 2
58 |var b = 7 * c""".trimMargin("|")
59 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()
7. Symbol resolution 69

60 ast.resolveSymbols()
61 assertEquals(1, ast.collectByType(ValueReference::class.java).size)
62 assertEquals(false, ast.collectByType(ValueReference::class.java)[0].ref\
63 .resolved)
64 }
65
66 // more tests to follow
67
68 }

We can also verify variable assignments. We should:

being able to assign variables defined before


being not able to assign inputs defined before
being not able to assign variables or inputs defined after
being not able to assign unexisting values

1 @test fun resolveAssignmentOfVariableDeclaredBefore() {


2 val code = """var a = 1 + 2
3 |a = 7 * a""".trimMargin("|")
4 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()
5 ast.resolveSymbols()
6 assertEquals(1, ast.collectByType(Assignment::class.java).size)
7 assertEquals(true, ast.collectByType(Assignment::class.java)[0].varDecl.reso\
8 lved)
9 assertEquals("a", ast.collectByType(Assignment::class.java)[0].varDecl.name)
10 }
11
12 @test fun resolveAssignmentOfInputDeclaredBefore() {
13 val code = """input Int a
14 |a = 10""".trimMargin("|")
15 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()
16 ast.resolveSymbols()
17 assertEquals(1, ast.collectByType(Assignment::class.java).size)
18 assertEquals(false, ast.collectByType(Assignment::class.java)[0].varDecl.res\
19 olved)
20 }
21
22 @test fun resolveAssignmentOfVariableDeclaredAfter() {
23 val code = """a = 7 * a
24 |var a = 1 + 2""".trimMargin("|")
7. Symbol resolution 70

25 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()


26 ast.resolveSymbols()
27 assertEquals(1, ast.collectByType(Assignment::class.java).size)
28 assertEquals(false, ast.collectByType(Assignment::class.java)[0].varDecl.res\
29 olved)
30 }
31
32 @test fun resolveAssignmentOfInputDeclaredAfter() {
33 val code = """a = 10
34 |input Int a""".trimMargin("|")
35 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()
36 ast.resolveSymbols()
37 assertEquals(1, ast.collectByType(Assignment::class.java).size)
38 assertEquals(false, ast.collectByType(Assignment::class.java)[0].varDecl.res\
39 olved)
40 }
41
42 @test fun resolveAssignmentOfUnexistingValue() {
43 val code = """var a = 1 + 2
44 |d = 7 * a""".trimMargin("|")
45 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()
46 ast.resolveSymbols()
47 assertEquals(1, ast.collectByType(Assignment::class.java).size)
48 assertEquals(false, ast.collectByType(Assignment::class.java)[0].varDecl.res\
49 olved)
50 }

Summary
The examples we have seen of symbol resolution are rather simple, because we have not used
concepts like inheritance or annidated scope, where one symbol defined internally can shadow a
symbol defined in the superior level. However the principles remain the same: what changes is just
the way we navigate the AST to identify the referenced elements.
8. Typesystem
I guess you know what a typesystem is and why it is useful, so we can skip the motivational speech
and get down to business.
In this chapter we are going to briefly look at how a typesystem works and then we are moving to
the interesting part: how to implement one.

Types
Many languages have pretty similar typesystems. Sure, as supporters of one language or another
we tend to emphasize the differences that make our favorite one so much better (look ma, reified
generics!), but the core is pretty common. We are just taking a look at what works decently well.
When you will be building your language you will get a chance to be more creative, but it is always
good to know what worked for others.

Basic Types
There are some basic types you have in most languages:

boolean
types for numbers
character
string

Types for numbers could be divided in types for integers or real numbers, and each group could
have different elements depending on the precision.
These types typically get special support in the language, given their role as building blocks.
In some languages string is not a primitive type but normally it has some kind of built-in support
because, you know, strings are the kind of things you may want to use quite frequently.

Declared types
Depending on the language the user could have the possibility of defining new types.
For example:
8. Typesystem 72

classes
structs
interfaces
enums

Some languages permit also to define type aliases. They typically are not proper types, just additional
names for existing types. That means that when you do this:

1 typedef myNewType = int

You define just an alias to int, so you can use int or myNewType interchangeably.

Parametric types
Simple types are types which do not have any sort of parameter. So two instances of that type are
indistinguishable. An int is just an int.
Not all types are like that: think about arrays or collections. An array of int is not the same type as
an array of string. This is because array is a parametric type. Or if you wish, it is not a type at all,
it is more like a type template, something that you can use to create proper types like array of float,
or array of double.

Subtyping
Types are typically organized in some sort of hierarchy, so that some types are subtypes of other
types.
The classical example is defining the class Cat as a subtype of the class Animal. Or maybe the class
Rectangle as a subtype of the interface Shape.
This could also apply to primitive or built-in types. You could have a type named number and int,
or float, could be subtypes of number, for example.
In general a subtype should respect the Liskov substitution principle. In a poor-man words we
could summarize it as: you should be able to use an instance of the subtype wherever you can use
an instance of the supertype legally, and the program should still be legal.

Typesystem rules
Here we see how to calculate the types of all the expressions.
The crucial part of implementing a simple typesystem is specifying how to calculate the type of each
expression.
https://en.wikipedia.org/wiki/Liskov_substitution_principle
8. Typesystem 73

Calculate the type of literals


Typically you start by calculating the type of literals.
A string literal? It has type string.
A boolean literal? It has type boolean.
Things can be slightly trickier for numbers, when you have different types with different precision.
For example in Java you can define a floating point literal to be of type float or double:

1 // a float
2 float f = 0.01f
3
4 // a double
5 double d = 0.01

In Java there are four types for integer numbers and not all of them are supported in the same way:

1 byte b = (byte)9999999;
2 short s = (short)9999999;
3 int i = 9999999;
4 long l = 9999999L;

By default a number literal is an int, however the modifier l/L can be used to make the literal a
long. The other types represented integers (byte or short) do not have the same level of support.
Indeed there is simply no way of defining a byte literal or a short literal. All that you can do is to
define an int or a long value and then cast it to either byte or short. Note that I am not advocating
that the Java typesystem is a good example to follow, just discussing a real case.
Some languages also distinguish between signed and unsigned numbers. In that case there are
typically modifiers to indicate literals of one type or the other.

Calculate the type of mathematical operations


Typically the type of mathematical operations depends on the type of the operands. For example,
summing two integers should produce an integer while summing two float numbers should produce
a float. Things can get more complicated. Depending on the language we may want to allow to sum
a float to an integer and consider the result a float, or we could consider it invalid, requiring the
two operands to be converted to a common type before being added.
Support the four basic arithmetic operations could be enough in many DSLs. GPLs tend to support
all sort of bit operations (shifts, bit and, bit or and whatnots), but unless those concepts are important
in the domain of your language you can leave them out.
8. Typesystem 74

Boolean logic operations


These operations tend to work on boolean value and produce a boolean result. The ones you want
to typically have are:

logical-not
logical-and
logical-or

Relations operations
In this case you need two elements that are comparable. So you need some logic to understand which
types are comparable with each other. Is it legal to compare 5 to 3.12 in your language? Are strings
comparable?
The result of these operations will always be a boolean.

==
!=
>
>=
<
<=

Collection operations
You may have operators to access or set elements in collections.
For example:

1 val v1 = myList[0]
2 val v2 = myMap["Key"]

In this case the result of the access is the element type of a collection. It means that if myList is a
collection of float then v1 will be of type float.
8. Typesystem 75

Conditional operator
The conditional operator is a sort of concise if that is an expression and not a statement. It is present
in C and Java:
myCondition ? valueIfConditionIsTrue : valueIfConditionIsFalse

What is the type of the result? Well, ideally you want to find the most specific ancestor of the two
possibile return values. It could get tricky.
Consider this case:
myCondition ? "hi" : "hello"

The result is clearly a string.


What about this?
myCondition ? 1 : 2.3

What should be the result? A float? An int? A number, meaning some abstract concept which is a
supertype of both int or float?
This is the kind of decisions you need to take when designing your typesystem. In this case I would
find elegant to consider the return type of being a number, but it would probably be more practical
to consider it a float.

Casting
We may want to cast a type to another, either to force a conversion (e.g., from integer to float) or
because we know that a certain value has a more specific type. For example, we could have a value
we got from a parameter of type Object, but we know that the value will always be a String, so we
explain it to the compiler by using a cast.
So the type of:
(someType) anyExpression

Is always the type specified in the case (someType in this case).

I want more
This is just a very brief discussion on types. It should be sufficient to get you started and build many
simple but useful languages.
If you are going to build DSLs, probably you are not going to need to build more complex
typesystems, while if we are going to build a General Purpose Language you could need way more
complex stuff.
However if you want to get into the hard stuff you could read:
8. Typesystem 76

Types and Programming Languages by Benjamin C. Pierce


Type Theory and Functional Programming by Simon Thompson
Proofs and Types by Jean-Yves Girard, Yves Lafont, and Paul Taylor

Lets see the code


Enough talking (or better, enough writing). Lets go to the real stuff and see some code.
The two typesystems we are going to see are very similar and both very simple.

Typesystem for MiniCalc


Depending on the kind of operations you want to support in your language, you need to support
some operations on your types. For example, if you want to support relational operators you need,
given two types, to figure out if they are comparable or not.
In our simple language we want just to know if given a type can be assigned to a variable of a certain
type.

1 interface Type : Node {


2 fun isAssignableBy(other:Type) : Boolean {
3 return this.equals(other)
4 }
5 }

Our default implementation tells us that a value of a certain type is assignable exclusively to a
variable of the very same type. That means that you can assign a string to a string variable, for
example.
Lets see one exception:

1 data class DecimalType(override val position: Position? = null) : Type {


2 override fun isAssignableBy(other:Type) : Boolean {
3 return other is IntType || other is DecimalType
4 }
5 }

For variables of type DecimalType we can accept either values of type DecimalType or of type
IntType, because we can promote an int to a decimal.
Now we are going to see the code for calculating the type of expressions. You are going to be surprised
by how concise it is. Part of the merit goes to Kotlin, which is wonderful for writing this kind of
code:
https://www.cis.upenn.edu/~bcpierce/tapl/
https://www.cs.kent.ac.uk/people/staff/sjt/TTFP/
http://www.paultaylor.eu/stable/Proofs+Types.html
8. Typesystem 77

1 fun Expression.type() : Type =


2 when (this) {
3 is IntLit -> IntType()
4 is DecLit -> DecimalType()
5 is StringLit -> StringType()
6 is SumExpression -> {
7 if (this.left.type() is StringType) {
8 StringType()
9 } else if (onNumbers()) {
10 if (left.type() is DecimalType || right.type() is DecimalType)
11 DecimalType() else IntType()
12 } else {
13 throw IllegalArgumentException("This operation should be perform\
14 ed on numbers or start with a string")
15 }
16 }
17 is SubtractionExpression, is MultiplicationExpression, is DivisionExpres\
18 sion -> {
19 val be = this as BinaryExpression
20 if (!be.onNumbers()) {
21 throw IllegalArgumentException("This operation should be perform\
22 ed on numbers")
23 }
24 if (be.left.type() is DecimalType || be.right.type() is DecimalType)
25 DecimalType() else IntType()
26 }
27 is UnaryMinusExpression -> this.value.type()
28 is TypeConversion -> this.targetType
29 is ValueReference -> this.ref.referred!!.type()
30 else -> throw UnsupportedOperationException("No way to calculate the typ\
31 e of $this")
32 }

Lets examine it. Calculating the type of literals is pretty simple:

1 is IntLit -> IntType()


2 is DecLit -> DecimalType()
3 is StringLit -> StringType()

Then it comes the SumExpression, which is not as simple:


8. Typesystem 78

1 is SumExpression -> {
2 if (this.left.type() is StringType) {
3 StringType()
4 } else if (onNumbers()) {
5 if (left.type() is DecimalType || right.type() is DecimalType)
6 DecimalType() else IntType()
7 } else {
8 throw IllegalArgumentException("This operation should be performed on nu\
9 mbers or start with a string")
10 }
11 }

The point is that what we called SumExpression is doing two very different things depending on the
operands:

If on the left we have a string what we are doing is actually a string concatenation. We convert
whatever is on the right to a string and append it to the string on the left.
If we have numbers as operands then we are actually summing them. The type of the result
will depends on the type of the operands. If at least one operand is a DecimalType then the
result is a DecimalType, otherwise it means both operands are IntType and the result is also
an IntType
In all other cases we cannot perform the operations

Note that in this case we are handling the fact we have basically two different operations using the
same syntax as part of the typesystem. This is ok in this case because the language is reasonably
simple. In other cases I would prefer to do a transformation on the AST as an intermediate step,
transforming the SumExpression nodes representing string concatenation in nodes of a different
type, so that the rest of the code would be much simpler.
The other mathematical operators are less ambiguous:

1 is SubtractionExpression, is MultiplicationExpression, is DivisionExpression -> {


2 val be = this as BinaryExpression
3 if (!be.onNumbers()) {
4 throw IllegalArgumentException("This operation should be performed on nu\
5 mbers")
6 }
7 if (be.left.type() is DecimalType || be.right.type() is DecimalType)
8 DecimalType() else IntType()
9 }

The only thing we have to consider is if we should return a DecimalType or an IntType.


When inverting the sign, the result has the same type as the original value. If we invert an IntType
we still get an IntType and if we invert a DecimalType we still get a DecimalType.
8. Typesystem 79

1 is UnaryMinusExpression -> this.value.type()

This is our cast. The result has the type to which we casted:

1 is TypeConversion -> this.targetType

When we have a ValueReference, the type of the reference is exactly the type of the element being
referred. So if we refer to the variable a and a was declared to be an IntType then also our reference
is an IntType.

1 is ValueReference -> this.ref.referred!!.type()

We left out a few extension methods we introduced to make the previous code simpler. Lets take a
look at those.
This extension method is useful to figure out if a type represents a number.

1 fun Type.isNumberType() = this is IntType || this is DecimalType

We could have instead created an abstract supertype named NumberType and make both IntType and
DecimalType to extend it. Then we could have just checked if a type was representing a number by
using the instance-of operator (myType is NumberType). In this case the chosen solution was good
enough and simpler.
We then have another extension method which is also related to numbers. We just want a simple
way to figure out if a BinaryExpression is performed on two number operands.

1 fun BinaryExpression.onNumbers() = left.type().isNumberType() && right.type().is\


2 NumberType()

We want to be able to get the type for every ValueDeclaration. In the case of inputs the type is
explicitly defined, so we just return it. In the case of variables it is not, it is inferred from the initial
value.

1 fun ValueDeclaration.type() =
2 when (this) {
3 is VarDeclaration -> this.value.type()
4 is InputDeclaration -> this.type
5 else -> throw UnsupportedOperationException()
6 }

Consider this example:


8. Typesystem 80

1 input Int myInput


2 var myVar = "hello"

The type of myInput is found in the original code, while the type of myVar is obtained by calculating
the type of the initial value ("hello" in this case).

Typesystem for StaMac


The typesystem for StaMac is very, very similar to the one for MiniCalc. We had no reasons to get
creative and we just reapplied the stuff that worked.
Lets see the corresponding code:

1 fun BinaryExpression.onNumbers() = (left.type() is NumberType) && (right.type() \


2 is NumberType)
3
4 fun Expression.type() : Type =
5 when (this) {
6 is IntLit -> IntType()
7 is DecLit -> DecimalType()
8 is StringLit -> StringType()
9 is SumExpression -> {
10 if (this.left.type() is StringType) {
11 StringType()
12 } else if (onNumbers()) {
13 if (left.type() is DecimalType || right.type() is DecimalType)
14 DecimalType() else IntType()
15 } else {
16 throw IllegalArgumentException("This operation should be perform\
17 ed on numbers or start with a string")
18 }
19 }
20 is SubtractionExpression, is MultiplicationExpression, is DivisionExpres\
21 sion -> {
22 val be = this as BinaryExpression
23 if (!be.onNumbers()) {
24 throw IllegalArgumentException("This operation should be perform\
25 ed on numbers")
26 }
27 if (be.left.type() is DecimalType || be.right.type() is DecimalType)
28 DecimalType() else IntType()
29 }
8. Typesystem 81

30 is UnaryMinusExpression -> this.value.type()


31 is TypeConversion -> this.targetType
32 is ValueReference -> this.symbol.referred!!.type
33 else -> throw UnsupportedOperationException("No way to calculate the typ\
34 e of $this")
35 }

The only difference we have here is on the rule for ValueReference.

1 is ValueReference -> this.symbol.referred!!.type

In MiniCalc we created an extension method to calculate the type of the symbol referred. In StaMac
it is instead always contained in a field.
The field type comes from the interface Typed. Note that in the case of an InputDeclaration it
is always explicit. In the case of a VarDeclaration instead it can be either explicit or inferred. In
StaMac you can write this:

1 var v1 : Int = 10
2 var v2 = "foo"

The type of v1 would be Int, because it is explictly indicated, while the type of v2 will be
inferred from calculating the type of the initial value ("foo"). This logic is handled directly in
VarDeclaration:

1 interface Typed { val type: Type }


2
3 interface ValueDeclaration : Node, Named, Typed
4
5 data class InputDeclaration(override val name: String,
6 override val type: Type,
7 override val position: Position? = null) : ValueDecl\
8 aration
9
10 data class VarDeclaration(override val name: String,
11 val explicitType: Type?,
12 val value: Expression,
13 override val position: Position? = null) : ValueDeclar\
14 ation {
15 override val type: Type
16 get() = explicitType ?: value.type()
17 }
8. Typesystem 82

Summary
Typesystems have a scary reputation. Now, you can need very complex and elaborate typesystems,
which are not trivial to implement. However, it does not have to be the case, and unless you need
to be creative you can get away by reapplying some basic patterns common to most languages.
9. Validation
You have parsed your code and built an Abstract Syntax Tree. At this point we can start working on
this Abstract Syntax Tree. The first thing we should do is verifying that the code we parsed make
sense at a semantic level.
The process of lexing and parsing told us if the code made sense at a syntactical level. If he did not,
maybe we could not even build an Abstract Syntax Tree. The fact that a piece of code make sense
at a syntactical level does not necessarily mean it is correct.
Typical semantic errors are:

defining twice variables with the same name


referring to a symbol that was not defined
trying to assign a value to a variable of an incompatible type

Validation for MiniCalc


The validation will produce a list of errors, possibly empty. For each error we will need a description
and the position in the code, so that we can communicate that to the user. This translate to a very
simple data class in Kotlin:

1 data class Error(val message: String, val position: Point)

We could add a level, for example to support also warnings. We could also come up with error codes.
But, lets keep things simple.
When writing in Kotlin I add the validation as an extension method for the AST root. Note that
in this case all the validation happens at the root level. In more complex cases I would define the
validation at different levels (e.g., class level and method level, if your language have those) and
invoke this more specific validation methods from the validation method of the root node.
9. Validation 84

1 fun MiniCalcFile.validate() : List<Error> {


2 val errors = LinkedList<Error>()
3
4 // check a variable is not duplicated
5 val varsByName = HashMap<String, VarDeclaration>()
6 this.specificProcess(VarDeclaration::class.java) {
7 if (varsByName.containsKey(it.name)) {
8 errors.add(Error("A variable named '${it.name}' has been already dec\
9 lared at ${varsByName[it.name]!!.position!!.start}",
10 it.position!!.start))
11 } else {
12 varsByName[it.name] = it
13 }
14 }
15
16 // check all references are resolved
17 this.specificProcess(ValueReference::class.java) {
18 if (!it.ref.resolved) {
19 errors.add(Error("Unresolved reference ${it.ref.name}", it.position!\
20 !.start))
21 }
22 }
23 this.specificProcess(Assignment::class.java) {
24 if (!it.varDecl.resolved) {
25 errors.add(Error("Unresolved reference ${it.varDecl.name}", it.posit\
26 ion!!.start))
27 }
28 }
29
30 // check assignments use compatible types
31 this.specificProcess(Assignment::class.java) {
32 if (it.varDecl.resolved) {
33 val actualType = it.value.type()
34 val formalType = it.varDecl.referred!!.type()
35 if (!formalType.isAssignableBy(actualType)) {
36 errors.add(Error("Cannot assign $actualType to variable of type \
37 $formalType", it.position!!.start))
38 }
39 }
40 }
41
42 return errors
9. Validation 85

43 }

Lets examine the different pieces one by one.

1 // check a variable is not duplicated


2 val varsByName = HashMap<String, VarDeclaration>()
3 this.specificProcess(VarDeclaration::class.java) {
4 if (varsByName.containsKey(it.name)) {
5 errors.add(Error("A variable named '${it.name}' has been already declare\
6 d at ${varsByName[it.name]!!.position!!.start}",
7 it.position!!.start))
8 } else {
9 varsByName[it.name] = it
10 }
11 }

In this case we do not want to find two variables with the same name. There are not other named
elements that could clash with variables so we are considering only one type of node. In other cases
we could want to verify that a name is unique among several types of node. For example, we may
want to prevent to have a variable foo and a function foo if this lead to ambiguous usages in our
languages.
Note that we do not have annidated scopes here, but only one global scope, so all variables are
defined in the global scope.
As we find variables we check whether their name was already used. If that is the case we produce an
error, otherwise we mark the name as used. This means that if we have two variables with the same
name the error will be associated only to the second one, the first one will be considered correct. I
prefer this approach, while others prefer to show the error on both variables. This is another small
design choice. Small, sure, but they tend to pile up.

1 // check all references are resolved


2 this.specificProcess(ValueReference::class.java) {
3 if (!it.ref.resolved) {
4 errors.add(Error("Unresolved reference ${it.ref.name}", it.position!!.st\
5 art))
6 }
7 }
8 this.specificProcess(Assignment::class.java) {
9 if (!it.varDecl.resolved) {
10 errors.add(Error("Unresolved reference ${it.varDecl.name}", it.position!\
11 !.start))
12 }
13 }
9. Validation 86

Here we check that all references are resolved. For this validation to succeed we expect the symbol
resolution to have happened as previous step. In other cases we could explicitly invoke the symbol
resolution during validation (and that is what we do in StaMac).
For MiniCalc we expect someone to have called resolveSymbols:

1 fun <E> List<E>.preceedings(element: E) = this.subList(0, indexOf(element))


2
3 fun MiniCalcFile.resolveSymbols() {
4
5 val childParentMap = this.childParentMap()
6
7 // Resolve value reference to the closest thing before
8 this.specificProcess(ValueReference::class.java) {
9 val statement = it.ancestor(Statement::class.java, childParentMap)!! as \
10 Statement
11 val valueDeclarations = this.statements.preceedings(statement).filterIsI\
12 nstance<ValueDeclaration>()
13 it.ref.tryToResolve(valueDeclarations.reversed())
14 }
15
16 this.specificProcess(Assignment::class.java) {
17 val varDeclarations = this.statements.preceedings(it).filterIsInstance<V\
18 arDeclaration>()
19 it.varDecl.tryToResolve(varDeclarations.reversed())
20 }
21 }

Finally we need to check if we are usage of types is consistent. To that we verify that when assigning
a value to a variable the value has a type compatible with the value of the variable. We do not want
to assign a string value to an int variable.

1 // check assignments use compatible types


2 this.specificProcess(Assignment::class.java) {
3 if (it.varDecl.resolved) {
4 val actualType = it.value.type()
5 val formalType = it.varDecl.referred!!.type()
6 if (!formalType.isAssignableBy(actualType)) {
7 errors.add(Error("Cannot assign $actualType to variable of type $for\
8 malType", it.position!!.start))
9 }
10 }
11 }
9. Validation 87

In this case, this translates to check that the values assigned to variables are compatible with the
variable. This would prevent us from assigning a string value to an int variable. The type of variable
was inferred by the type of its initial value, like this:

1 var a = 1 // this is an int variable


2 var b = "hi!" // this is a string variable

While the type is not explicit in the code, it is defined for each variable. Yes, we got static typing
without the typical cerimonies.
Note also that we do not strictly need to assign the exact same type the variable had, but only a
type that is compatible. This means that we can assign an int value to a decimal variable. It will be
converted to a decimal value. We cannot do the opposite: we cannot assign a decimal value to an
int value because that conversion could lead to a loss of information. Of course you can allow that
in your language, if you want.

Validation for StaMac


This is the method performing validation on the StaMac AST:

1 fun StateMachine.validate() : List<Error> {


2 val errors = LinkedList<Error>()
3
4 // check a symbol or input is not duplicated
5 val valuesByName = HashMap<String, Int>()
6 this.specificProcess(ValueDeclaration::class.java) {
7 checkForDuplicate(valuesByName, errors, it)
8 }
9
10 val eventsByName = HashMap<String, Int>()
11 this.specificProcess(EventDeclaration::class.java) {
12 checkForDuplicate(eventsByName, errors, it)
13 }
14
15 val statesByName = HashMap<String, Int>()
16 this.specificProcess(StateDeclaration::class.java) {
17 checkForDuplicate(statesByName, errors, it)
18 }
19
20 // check references
21 this.specificProcess(ValueReference::class.java) {
22 if (!it.symbol.tryToResolve(this.variables) && !it.symbol.tryToResolve(t\
9. Validation 88

23 his.inputs)) {
24 errors.add(Error("A reference to symbol or input '${it.symbol.name}'\
25 cannot be resolved", it.position!!))
26 }
27 }
28 this.specificProcess(Assignment::class.java) {
29 if (!it.variable.tryToResolve(this.variables)) {
30 errors.add(Error("An assignment to symbol '${it.variable.name}' cann\
31 ot be resolved", it.position!!))
32 }
33 }
34 this.specificProcess(OnEventBlock::class.java) {
35 if (!it.event.tryToResolve(this.events)) {
36 errors.add(Error("A reference to event '${it.event.name}' cannot be \
37 resolved", it.position!!))
38 }
39 }
40 this.specificProcess(OnEventBlock::class.java) {
41 if (!it.destination.tryToResolve(this.states)) {
42 errors.add(Error("A reference to state '${it.destination.name}' cann\
43 ot be resolved", it.position!!))
44 }
45 }
46
47 // check the initial value is compatible with the explicitly declared type
48 this.specificProcess(VarDeclaration::class.java) {
49 if (it.explicitType != null && !it.explicitType.isAssignableBy(it.value.\
50 type())) {
51 errors.add(Error("Cannot assign ${it.explicitType!!} to variable of \
52 type ${it.value.type()}", it.position!!))
53 }
54 }
55 // check the type used in assignment is compatible
56 this.specificProcess(Assignment::class.java) {
57 if (it.variable.resolved) {
58 val actualType = it.value.type()
59 val formalType = it.variable.referred!!.type
60 if (!formalType.isAssignableBy(actualType)) {
61 errors.add(Error("Cannot assign $actualType to variable of type \
62 $formalType", it.position!!))
63 }
64 }
9. Validation 89

65 }
66
67 // we have exactly one start state
68 if (this.states.filter { it.start }.size != 1) {
69 errors.add(Error("A StateMachine should have exactly one start state", t\
70 his.position!!))
71 }
72
73 return errors
74 }

StaMac and MiniCalc have a similar typesystem and similar validation rules. This is not surprising
because we are seeing the most typical patterns for typesystems and validations, and they tend to
be common across many languages. There are a few differences, anyway, so lets look at them.
We start by defining a function for checking duplicate names.

1 fun checkForDuplicate(elementsByName: MutableMap<String, Int>, errors : MutableL\


2 ist<Error>, named: Named) {
3 if (elementsByName.containsKey(named.name)) {
4 errors.add(Error("A symbol named '${named.name}' has been already declar\
5 ed at line ${elementsByName[named.name]}",
6 (named as Node).position!!))
7 } else {
8 elementsByName[named.name] = (named as Node).position!!.start.line
9 }
10 }

In MiniCalc we had only variables to consider. In StaMac we have different kinds of nodes, with
different naming spaces, that means that names have to be unique only for a certain kind of node,
while it is ok to have a state and an event with the same name. Maybe it is not a smart idea, but
it is legal in the language. Someone could prefer to forbid it or give a warning to the user. In my
experience people who are designing their first language tend to want to be more in control and
prohibit things like this, while after a while a language designer realize that it needs to provide a
tool to users and get out of the way. To me it does not seem to make sense to name an event and a
state with the same name but a user could have a reason to do that, so unless it is strictly needed
for the consistency of my language I would not prohibit it.
So here it is how we check for duplicate names:
9. Validation 90

1 val valuesByName = HashMap<String, Int>()


2 this.specificProcess(ValueDeclaration::class.java) {
3 checkForDuplicate(valuesByName, errors, it)
4 }
5
6 val eventsByName = HashMap<String, Int>()
7 this.specificProcess(EventDeclaration::class.java) {
8 checkForDuplicate(eventsByName, errors, it)
9 }
10
11 val statesByName = HashMap<String, Int>()
12 this.specificProcess(StateDeclaration::class.java) {
13 checkForDuplicate(statesByName, errors, it)
14 }

Then we verify the all references are resolved. In this case we perform symbol resolution as part of
the validation.

1 // check references
2 this.specificProcess(ValueReference::class.java) {
3 if (!it.symbol.tryToResolve(this.variables) && !it.symbol.tryToResolve(this.\
4 inputs)) {
5 errors.add(Error("A reference to symbol or input '${it.symbol.name}' can\
6 not be resolved", it.position!!))
7 }
8 }
9 this.specificProcess(Assignment::class.java) {
10 if (!it.variable.tryToResolve(this.variables)) {
11 errors.add(Error("An assignment to symbol '${it.variable.name}' cannot b\
12 e resolved", it.position!!))
13 }
14 }
15 this.specificProcess(OnEventBlock::class.java) {
16 if (!it.event.tryToResolve(this.events)) {
17 errors.add(Error("A reference to event '${it.event.name}' cannot be reso\
18 lved", it.position!!))
19 }
20 }
21 this.specificProcess(OnEventBlock::class.java) {
22 if (!it.destination.tryToResolve(this.states)) {
23 errors.add(Error("A reference to state '${it.destination.name}' cannot b\
24 e resolved", it.position!!))
9. Validation 91

25 }
26 }

As you can see we just call tryToResolve and verifies if it has found a match. If it did not we add an
error to our list. Not also that in the case of a ValueReference we try twice to resolve the symbol,
first looking for variables with that name and then looking for inputs. The order does not matter
because inputs and variables should have different names. Both are ValueDeclaration and we have
verified that in the initial part of our validation method.
Then we check that the Assignments refer to an existing variable. We do not consider inputs here
because inputs cannot be assigned.
Finally we check for every OnEventBlock that both the event and the destination state can be
resolved.
Regarding the typesystem consistency, we have the same rule on assignments as we have seen in
MiniCalc and we have an additional rule. The additional rule is necessary because in StaMac we
can optionally specify the type of a variable. If we do so, we need to ensure that the initial value is
compatible with the explicitly defined type. This way we will prevent users from writing:

1 var myIntVar : Int = "Hi!" // mmm, it does not seem to be an int...

This is the code that does these check:

1 // check the initial value is compatible with the explicitly declared type
2 this.specificProcess(VarDeclaration::class.java) {
3 if (it.explicitType != null && !it.explicitType.isAssignableBy(it.value.type\
4 ())) {
5 errors.add(Error("Cannot assign ${it.explicitType!!} to variable of type\
6 ${it.value.type()}", it.position!!))
7 }
8 }
9 // check the type used in assignment is compatible
10 this.specificProcess(Assignment::class.java) {
11 if (it.variable.resolved) {
12 val actualType = it.value.type()
13 val formalType = it.variable.referred!!.type
14 if (!formalType.isAssignableBy(actualType)) {
15 errors.add(Error("Cannot assign $actualType to variable of type $for\
16 malType", it.position!!))
17 }
18 }
19 }

There are other semantic checks that can be performed which are more language specific. For
example, in StaMac we need to ensure that we have exactly one start state:
9. Validation 92

1 // we have exactly one start state


2 if (this.states.filter { it.start }.size != 1) {
3 errors.add(Error("A StateMachine should have exactly one start state", this.\
4 position!!))
5 }

Summary
There are kinds of validation checks that common to all languages, like typesystem related checks,
symbol resolution checks or name duplicates. They are the bread and butter of validation and you
are going to need those in most of your languages.
Then there is a different category of checks that depends on the specificities of your language. Maybe
you can define classes and each class should have a constructor. Or maybe all variables of type int
should have a name that starts with an i. While these rules will be different in each case they will
be implemented using similar techniques: navigate the AST, find the nodes you are interested into,
record errors and show errors.
In complex languages you could have a multi-level validation: you first resolve symbols, then check
type consistency, then do other semantic checks. With time you will be able to grow more complexity
in your language implementation, but hopefully this should give you the basis to start building
something real, and usable.
Part II: compiling
We have seen how to recognize what the user wrote and verify it is correct. Good.
Now it is time to do something with the information we obtained.
For example we could:

interpret the code and execute something in response


compile to native code or to bytecode
generate something: a graph or some code for another language

In this part we are going to see how to do the first two things.
First we are going to see how to build an interpreter, then how to compile to bytecode and finally
how to compile to native code using LLVM.
We will not see how to write a generators but consider that this can be done either using some
template engine or writing an interpreter then print something on a file. While we will not see an
example you should know all the techniques necessary to write a generator. Or you can look for the
next post on my blog.
https://tomassetti.me
10. Build an interpreter
In the introduction to Part II we have seen that we have two main ways to execute a piece of code:
building an interpreter or building a compiler.
In the case of an interpreter the code is executed directly, while when you are using a compiler
you have to go through an intermediate step: producing the bytecode or native code that will be
executed. There are different technical aspects to consider, about performance or the easyness of
distribute one or the other but if we put them aside we can obtain a very similar results using one
or the other. That said writing an interpreter is typically easier then writing a compiler.
Lets put aside the technical considerations for a moment and lets just consider that by the end
of this chapter we will know how to build interpreters and we will have built two fully working
interpreters.
I love this stage in building a language: is when code becomes alive and starts doing stuff.
So, enough chatting, lets go down to business.

What you need to build an interpreter


The way you write the logic of an interpreter could depend on the execution model you have:
imperative, functional, or based on state machines.
Lets see some of the elements you will typically need.

Symbol table
A symbol table is a data structure you use to track the symbols that are available in a given context.
For example, while you are inside a function your symbol table could contain the parameters of the
function and the local variables.
Typically symbol tables are organized in a stack: what does it mean? It means that when you enter
in a more specific context you see new symbols, available only in that context but you see also more
generally available symbols.
Consider a Java program: all code inside a class can access the class fields. When you enter a more
specific context, like a method or an inner class, you get access to more symbols. At the same time
you have still access to the more general symbols, class fields in this case.
Typically when looking for symbols in a symbol table you first check with the more specific one if a
match is available. If it is not you check with the parent symbol table, and so on until you reach the
10. Build an interpreter 95

root symbol table. This approach typically leads to the possibility of shadowing. It means that if you
have a global variable named foo and a local variable named foo, where the local variable is available
you will always access that instead of the global variable. The name foo will be always resolved to
the most local element, making the most generic one inaccessible from within that specific context.
We used the term context, but we could have also used the term scope. For each scope we have a
specific Symbol table, connected to a parent symbol table. Examples of scopes:

global scope
class
method/function
for, while, block

Basically every section of code where I can define symbols is typically a scope.
Take the following example:

Scopes and Symbol Tables

We have four scopes:


10. Build an interpreter 96

the whole Main class


the three methods method1, method2, main

In Main we refer to the field v. There is no symbol v defined in the symbol table associated to the
method, so we look into its parent: the symbol table for the whole class. There we found the field
v. This is the same thing that happens in method1 when we refer to v. In method2 instead there is
a symbol in the local symbol table with name v (the method parameter). So when we refer to v we
refer to this local element which is shadowing the field with the same name.

Interpreting expressions
In most languages you will have some form of expression. How do we evaluate them in an
interpreter?
The evaluation of an expression typically produces two things:

a resulting value
side-effects

Lets consider them separately.

On side-effects

Side-effects in this case could be the change of a value in the symbol table or the execution of some
statements.
For example, consider these C expressions:

1. (a = b) == 2
2. foo() + 2

The first expression causes the value of a to change in the symbol table. a = b produces also a value
(the original value of b, corresponding to the new value of a). This value is compared to 2 and if it
is considered equal then the whole expression evaluates to true, otherwise to false.
The second expression sums the result of invoking the function foo to 2. Now, the function foo could
do all sort of things like writing on the screen or opening a socket connection. In general it could
execute code that could have side-effects.
Because of side-effects we have to evaluate some pieces of our expressions in a specific, predictable
order. This is not necessary for languages which do not have side-effects. Those languages are free
to have things like lazy-evaluation and be opaque with respect to the rules they use to determine
the order in which they process parts of the expressions.
In an interpreter typically side-effects different from changing values in a symbol table corresponds
to call to runtime libraries or interfaces representing the outside world. In the implementation of
MiniCalcFun we will use the latter approach, using an interface named SystemInterface.
10. Build an interpreter 97

Resulting value

The key thing we want to get out of evaluating an expression is the resulting value. This is typically
obtained in a way that depends from the expression. Lets consider some cases:

literals
unary expressions
binary arithmetic expressions
logical expressions
value references

Literals are quite easy to consider: we just have some value that we could have to parse in some
way.
Number literals would need to be parsed and reconducted to a canonical internal representation, so
that things like 32, 0x20, 040, 100000b are recognized to be the same thing, assuming our language
support specifying numbers in decimal, hexadecimal (0x prefix), octal (0 prefix) and binary format
(b suffix form).
Decimal numbers could instead be expressed in their typical form or in the exponential form. String
literals could have escape sequences that we need to recognize. When evaluating literals we need to
consider these aspects.
Alternatively we could translate literals to a canonical form during the mapping step. In that case
evaluating a literal means just accessing its value, already calculated and stored in the AST node.
Unary expressions are typically also very simple. We can consider the logical negation, the binary
negation or the unary minus sign. The only thing to do is to transform the value of the child
expression. For example if we process -a we need first to evaluate a, then take its value and multiply
it by -1.
Binary arithmetic expressions can be calculated differently depending on the type of the operands.
Why is that? Because summing two integers or summing two decimals are conceptually the same
operation but for the CPU they could be very different operations. For this reason on one level we
may want to represent them as one single construct in our language, and one single AST node type,
but in the interpreter we may have to process them differently.
It is not just about differentiating between integers and decimals. Some languages support a wide
ranges of numerical types: byte, short, int, long, long long, float, double. In some cases you can have
both the signed and unsigned variants of these numbers. Now, executing mathematical operations on
a CPU is still one of those fascinating adventures that seems to work decently well, until it surprises
you with some apparently absurd result. This is not the place to talk about all the issues you can
have with overflows, underflows and problems due to limited precision, but you need to consider
that the specific type involved in the operation could lead to different results. For example, dividing
5 by 2 if 5 and 2 are integer could produce 2 in your language. Or maybe 3, depending how you do
the rounding. Or maybe you want to produce 2.5, so that the result is not an integer anymore. What
10. Build an interpreter 98

about the result of 10 divided by 3? Is it producing 3.33333333? 3.33333334? Or do you represent


internally as a fraction? Honestly I think there are two strategies to protect your mental health:

1. You just use the primitive types used by the language in which you are writing the interpreter.
It means you could run in all sort of strange results (hey, summing XXX and YYY produces
-123, that is surprising!) but operations are performed fast
2. You internally represents these values as BigDecimal or something equivalent. That means
that all mathematical operations will be very slow but the result will be the correct results
(with some approximation)

What is the best strategy? Well, it depends on what your language is used for. If it is used for
developers who needs to write fast code, then go for the first one. If you are building this language
for non-developers or the language will be used in safety-critical or mission-critical applications go
for the second.
Logical expressions typical logical expressions are logical-and, logical-or, and logical-xor. Now,
depending on the user of your language they could have different expectations on how these are
evaluated. Developers typically expect you to use short-circuits. What does it mean? It means that
you evaluate the first element and if you can already determine the result you do not evaluate the
second element. So if you have a logical-and b you:

evaluate a
a is true -> you evaluate b
a is false -> you return false without evaluating b

Why that matters? Because evaluating b could have side-effects. If b is a function call that print
something on the screen, or change some value, evaluating it or not evaluating wit ould change the
behavior of your program. However if your language does not allow side-effects than all of this is
just a performance optimization.
Value references these are the expressions that permits to access a value of a variable, a constant or
a parameter in you code. In foo + 3, foo is a value reference. Basically you evaluate them by taking
their value out of the symbol table. That is it. Unless you are supporting accessors. For example in
Ruby when writing bar, bar could be a variable or method. In that case the method would have to
be invoked.

Executing statements
Statements permit to execute all sort of operations. They also determine control flow, i.e., what code
you are going to execute. For example a while-loop could make you execute multiple times its body.
http://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html
10. Build an interpreter 99

As you will see executing statements basically means to implement the control flow correctly: most
statements are composed by other statements. You just need to know in which order to execute the
element statements and the logic wiring it. For example, in the case of an if statement you have to
evaluate the condition and if the result is truthy you will execute the statement in the then-branch
otherwise you will execute the one in the else-branch.
Statements determine which expressions need to be evaluated. These expressions need to be executed
in a certain scope, that means we need to use a corresponding symbol table. So we need to pass that
around when executing statements. Statements could also modify such symbol table.
Lets see how we could implement a typical bunch of statements.
Print statement: lets start with something simple. The old, glorious print statement. Basically you
need to do two things: 1) evaluate the expression that you want to print 2) get the result and call the
print function or method of the language in which you are going to build your interpreter. That is
basically it. In the implementation of StaMac we will do just that, while in the implementation of
MiniCalcFun we will do something slightly more elaborated.
Variable declaration statement: this statement add a new symbol to the symbol table. It will be
available for all following statements. Now, a variable could also have an expression determining
the initial value. You want to typically first evaluate it and then add the new variable. In this way
an initialization value cannot refer to the variable itself.
Expression statement: this is just about evaluating an expression. You typically want to do that for
the side-effects of that expression. For example in some languages an assignment is an expression
(while in others it is a statement). In the languages in which the assignment is an expression you
may want to put it into an expression statement and execute it. So the assignment is performed and
the symbol table is changed as consequence.
Block statement: a block statement is typically a list of statements to be executed one after the
other. It also delimits a new scope so that variables defined inside the block are visible only inside
the block. The way you typically execute it is to define a new symbol table having as parent the
current symbol table. You use this new symbol table to execute all the statements which are part of
the block. When leaving you just go back to the original symbol table, forgetting about the symbol
table used inside the block.
If statement: we have anticipated this one but it is really as easy as that. You evaluate the condition
and depending on the result you execute the then portion or the else portion (if it is present).
While statement: another easy one, just a variation of the if statement. You evaluate the condition,
if it is satisfied you execute the statement corresponding to the body of the while statement and then
go back to re-evaluate the condition.
For statement: a for statement as present in C99 or Java is a complex beast. Consider this:

1 for (int i=0;i<10;i++)


2 body
10. Build an interpreter 100

First of all you want to excute the statement introducing a new variable. This is executed just once.
Then you verify the condition (i<10). If it evaluates to true you execute the body and then the
iteration step (i++). At this point you verify again the condition and keep doing the same steps until
the condition evaluates to false.

Things we are not considering


We will not build data structures or garbage collectors because we could just reuse the ones from
our host language. Do you need a map in your language and you are writing the interpreter in Java?
Just use a Java map in your interpreter. Some applies to lists, hash tables and so on. This significantly
reduces our work.
We will not use the bytecode technique used by several interpreters. CPython is an interpreter that
uses it: basically the Python code is translated in a low-level representation (assembly-like) and that
representation is then interpreted. This technique is used by many industrial-grade interpreters.
In this chapter we are not talking about advanced topics that you may want to look at after you
have built your first couple of interpreters.
For example you could want to trace what code you execute so that when you have to show an error
you can give context by specifying a stack trace.
We are not discussing tracking coverage. If your language supports testing you may want to add the
possibility to track which branches of the various control statements are executed, to determine the
code coverage.
These and other aspects are important if you want to build solid, industrial-grade interpreters.
However in my experience the hardest part is getting started building your first interpreters. Many
do not overcome that first step. This chapter is about helping you do that, there will be time to look
into more advanced stuff when you are able to walk on your own.

Lets see the code


As usual now we go to look into how to apply the theory to the languages we use for our examples.

MiniCalcFun
To spicy up things we introduce a variant of MiniCalc which support functions. It is named
MiniCalcFun.
A function must specify its return type and it will expect the last statement to be an expression
statement. The value of that expression will be the return value of the function
10. Build an interpreter 101

1 fun double(Int p) Int {


2 p * 2 // returned value
3 }

MiniCalcFun supports also annidated functions. They look like this:

1 fun f() Int {


2 fun g() Int {
3 0 // returned value
4 }
5 g() + 1 // returned value
6 }

What is changed with respect to MiniCalc?

We change the grammar slightly:

1 statement : inputDeclaration # inputDeclarationStatement


2 | varDeclaration # varDeclarationStatement
3 | assignment # assignmentStatement
4 | print # printStatement
5 | expression # expressionStatement // new
6 | function # functionStatement; // new
7
8 // FUN is a token capturing 'fun'
9 function : FUN name=ID LPAREN (params+=param (COMMA params+=param)*)? RPAREN typ\
10 e LBRACE NEWLINE
11 (statements+=statement NEWLINE)* RBRACE;
12
13 param : type name=ID ;
14
15 expression : left=expression operator=(DIVISION|ASTERISK) right=expression # bin\
16 aryOperation
17 | left=expression operator=(PLUS|MINUS) right=expression # bin\
18 aryOperation
19 | value=expression AS targetType=type # typ\
20 eConversion
21 | LPAREN expression RPAREN # par\
22 enExpression
23 | ID # val\
24 ueReference
25 | MINUS expression # min\
10. Build an interpreter 102

26 usExpression
27 | STRING_OPEN (parts+=stringLiteralContent)* STRING_CLOSE # str\
28 ingLiteral
29 | INTLIT # int\
30 Literal
31 | DECLIT # dec\
32 imalLiteral
33 // new
34 | funcName=ID LPAREN (params+=expression (COMMA params+=expression)*)\
35 ? RPAREN # funcCall ;

In addition to the introduction of the functionStatement we make also possible to use an expression
as a statement, by adding expressionStatement. We need this because functions will return the
result of their last statement, which should be an expressionStatement. This will also permit to
have function calls as statements, which we may want to invoke for they side effects (i.e., because
they could print something).
This is how we calculate the type of function call:

1 fun Expression.type() : Type =


2 when (this) {
3 ...
4 is FunctionCall -> {
5 if (!this.function.resolved) {
6 throw IllegalStateException("Unsolved reference")
7 }
8 this.function.referred!!.returnType
9 }
10 ...
11 }

It would be also possible to infer the return type of the function but we are not going to do that. Or
as better authors would say this is left as an exercise for the reader.

Combined Symbol Table

In the interpreter for MiniCalcFun we use the class CombinedSymbolTable. This is a class we created
to represent a symbol table that can contain elements in two separate namespaces: values and
functions. This will permit to have a value named foo and a function named foo without any
problem. In this language it makes sense to do so because the only operation we perform on functions
is invoking them. Functions are not first-class citizens in our language. We cannot save a reference
to a function in a variable or pass them around, so they could never be confused with values.
The class CombinedSymbolTable is not particularly complex.
10. Build an interpreter 103

1 class CombinedSymbolTable<V, F>(val parent: CombinedSymbolTable<V, F>? = null) {


2 private val values = HashMap<String, V>()
3 private val functions = HashMap<String, F>()
4
5 fun hasValue(name: String) = tryToGetValue(name) != null
6
7 fun getValue(name: String) : V {
8 val res = tryToGetValue(name)
9 if (res == null) {
10 throw RuntimeException("Unknown symbol $name. Known symbols: ${value\
11 s.keys}")
12 } else {
13 return res
14 }
15 }
16
17 fun tryToGetValue(name: String) : V? {
18 if (!values.containsKey(name)) {
19 if (parent == null) {
20 return null
21 } else {
22 return parent.tryToGetValue(name)
23 }
24 }
25 return values[name]!!
26 }
27
28 fun setValue(name: String, value: V) {
29 values[name] = value
30 }
31
32 fun hasFunction(name: String) = tryToGetFunction(name) != null
33
34 fun getFunction(name: String) : F {
35 val res = tryToGetFunction(name)
36 if (res == null) {
37 throw RuntimeException("Unknown symbol $name. Known symbols: ${value\
38 s.keys}")
39 } else {
40 return res
41 }
42 }
10. Build an interpreter 104

43
44 fun tryToGetFunction(name: String) : F? {
45 if (!functions.containsKey(name)) {
46 if (parent == null) {
47 return null
48 } else {
49 return parent.tryToGetFunction(name)
50 }
51 }
52 return functions[name]!!
53 }
54
55 fun setFunction(name: String, value: F) {
56 functions[name] = value
57 }
58
59 fun popUntil(function: F): CombinedSymbolTable<V, F> {
60 if (this.functions.containsValue(function)) {
61 return this
62 }
63 if (this.parent == null) {
64 throw IllegalArgumentException("Function not found: $function")
65 }
66 return this.parent.popUntil(function)
67 }
68
69 override fun toString(): String {
70 return "SymbolTable(values=${values.keys}, functions=${functions.keys})"
71 }
72 }

Just a few comments:

Symbol tables are organized in a stack. Each instance can have a parent. When a value cannot
be found in the current symbol table we will ask the parent, if present
We store separately values and functions, so most methods are duplicates and we have separate
fields
hasValue/hasFunction can be used to check if a value is known to a Symbol Table (directly
or through its parent)
getValue/getFunction will get the element with the corresponding element or throw an
exception. The element is searched first in this symbol table, then in the stack of ancestor
symbol tables
10. Build an interpreter 105

tryToGetValue/tryToGetFunction will try to get the element with the corresponding element
or just return null. The element is searched first in this symbol table, then in the stack of
ancestor symbol tables
setValue/setFunction store a new value in the current symbol table. This could cause a value
known by the parent to be shadowed, i.e., to be not accessible anymore. For example in a
function with a parameter p we would be unable to access a global variable named p because
everytime we would access p we would get the parameter back, never the global variable

Aside from toString this leaves out one method: popUntil.


This method takes a function and traverse the stack of symbol tables until it finds the one containing
the given function. Why do we need this? We will use this when invoking functions. When we invoke
a function we could invoke a function defined in the same scope. For example:

1 fun inc(Int p) {
2 p + 1
3 }
4 inc(3) // invocation at the same level as the function declaration

But we could also call a function defined in an higher level scope:

1 fun myBigFunction() Int {


2 fun inc(Int p) Int {
3 p + 1
4 }
5 fun anotherFunction() Int {
6 var j = 0
7 inc(3) // invocation at an upper level w.r.t. the function declaration
8 }
9 anotherFunction()
10 }

When we invoke inc we move from the scope inside anotherFunction to the scope of myBigFunc-
tion. There the variable j is not visible. When we change scope we should use a different Symbol
Table, because every Symbol Table represents the elements present in a certain scope.
We can have multiple levels of annidated scope:
10. Build an interpreter 106

1 fun inc(Int p) Int {


2 p + 1 // here a, b, c are not visible
3 }
4
5 fun wrapper1() Int {
6 var a = 0
7 fun wrapper2() Int {
8 var b = 1
9 fun wrapper3() Int {
10 var c = 2
11 inc(a + b + c) // here a, b, c are visible
12 }
13 wrapper3()
14 }
15 wrapper2()
16 }
17 wrapper1()

In this case when invoking inc we will do that from a scope (and using a Symbol Table) specific to
wrapper3. This scope would be contained in the scope of wrapper2 (so our Symbol Table would have
as parent a Symbol Table for wrapper2). The scope of wrapper2 would be contained in the scope of
wrapper1 and the scope of wrapper1 would be contained in the global scope.

At the point in which we invoke inc we can access all the variables. But we cannot do that when
executing inc. For this reason we would go from the Symbol Table containing the values of wrapper3
directly to the global one, where inc is defined.
Once we have done that we will need to create a Symbol Table for the specific execution of inc,
adding the values for the parameters but we will see this later. This new Symbol Table will have as
parent the Symbol Table where the function inc is defined.

System Interface

Our language permits to do one thing that affects the outside world: printing. We could just print
on the screen when we interpret a print statement. We will not do directly that, to make our system
more testable.
We will provide an instance of SystemInterface to our interpreter and delegate interactions with
the system to it.
In a real application the implementation of the interface will actually print, while during tests we
will capture the strings we would have printed and save them. Later we could add assertions to
verify we tried to print exactly what we expected.
10. Build an interpreter 107

1 interface SystemInterface {
2 fun print(message: String)
3 }
4
5 class RealLifeSystemInterface : SystemInterface {
6
7 override fun print(message: String) {
8 println(message)
9 }
10
11 }
12
13 class TestSystemInterface : SystemInterface {
14
15 // later we can assert on the content of this property
16 val output = LinkedList<String>()
17
18 override fun print(message: String) {
19 output.add(message)
20 }
21
22 }

Interpreter

Lets see the whole code for the interpreter. Later we will comment it piece by piece.

1 class MiniCalcInterpreter(val systemInterface: SystemInterface, val inputValues:\


2 Map<String, Any> = emptyMap()) {
3
4 private val globalSymbolTable = CombinedSymbolTable<Any, FunctionDeclaration\
5 >()
6
7 fun fileEvaluation(miniCalcFile: MiniCalcFile) {
8 miniCalcFile.statements.forEach { executeStatement(it, globalSymbolTable\
9 ) }
10 }
11
12 fun singleStatementEvaluation(statement: Statement) {
13 executeStatement(statement, globalSymbolTable)
14 }
15
10. Build an interpreter 108

16 fun getGlobalValue(name: String) : Any = globalSymbolTable.getValue(name)


17
18 private fun executeStatement(statement: Statement, symbolTable: CombinedSymb\
19 olTable<Any, FunctionDeclaration>) : Any? =
20 when (statement) {
21 is ExpressionStatatement -> evaluate(statement.expression, symbolTab\
22 le)
23 is VarDeclaration -> symbolTable.setValue(statement.name, evaluate(s\
24 tatement.value, symbolTable))
25 is Print -> systemInterface.print(evaluate(statement.value, symbolTa\
26 ble).toString())
27 is Assignment -> symbolTable.setValue(statement.varDecl.name, evalua\
28 te(statement.value, symbolTable))
29 is FunctionDeclaration -> symbolTable.setFunction(statement.name, st\
30 atement)
31 is InputDeclaration -> symbolTable.setValue(statement.name, inputVal\
32 ues[statement.name]!!)
33 else -> throw UnsupportedOperationException(statement.javaClass.cano\
34 nicalName)
35 }
36
37 private fun StringLitPart.evaluate(symbolTable: CombinedSymbolTable<Any, Fun\
38 ctionDeclaration>) : String =
39 when (this) {
40 is ConstantStringLitPart -> this.content
41 is ExpressionStringLItPart -> evaluate(this.expression, symbolTa\
42 ble).toString()
43 else -> throw UnsupportedOperationException(this.javaClass.canon\
44 icalName)
45 }
46
47 private fun evaluate(expression: Expression, symbolTable: CombinedSymbolTabl\
48 e<Any, FunctionDeclaration>) : Any =
49 when (expression) {
50 is IntLit -> expression.value.toInt()
51 is DecLit -> expression.value.toDouble()
52 is StringLit -> expression.parts.map { it.evaluate(symbolTable) }.jo\
53 inToString(separator = "")
54 is ValueReference -> symbolTable.getValue(expression.ref.name)
55 is SumExpression -> {
56 val l = evaluate(expression.left, symbolTable)
57 val r = evaluate(expression.right, symbolTable)
10. Build an interpreter 109

58 if (l is Int) {
59 l as Int + r as Int
60 } else if (l is String) {
61 l as String + r.toString()
62 } else {
63 throw UnsupportedOperationException(l.toString()+ " from eva\
64 luating " + expression.left)
65 }
66 }
67 is SubtractionExpression -> {
68 val l = evaluate(expression.left, symbolTable)
69 val r = evaluate(expression.right, symbolTable)
70 if (l is Int) {
71 l as Int - r as Int
72 } else {
73 throw UnsupportedOperationException(expression.toString())
74 }
75 }
76 is MultiplicationExpression -> {
77 val l = evaluate(expression.left, symbolTable)
78 val r = evaluate(expression.right, symbolTable)
79 if (l is Int) {
80 l * r as Int
81 } else if (l is Double) {
82 l * r as Double
83 } else {
84 throw UnsupportedOperationException("Left is " + l.javaClass)
85 }
86 }
87 is DivisionExpression -> {
88 val l = evaluate(expression.left, symbolTable)
89 val r = evaluate(expression.right, symbolTable)
90 if (l is Int) {
91 l / r as Int
92 } else if (l is Double) {
93 l / r as Double
94 } else {
95 throw UnsupportedOperationException(expression.toString())
96 }
97 }
98 is FunctionCall -> {
99 // SymbolTable: should leave the symbol table until we go at the\
10. Build an interpreter 110

100 same level at which the function


101 // was declared
102 val functionSymbolTable = CombinedSymbolTable(symbolTable.popUnt\
103 il(expression.function.referred!!))
104 var i = 0
105 expression.function.referred!!.params.forEach {
106 functionSymbolTable.setValue(it.name, evaluate(expression.pa\
107 rams[i++], symbolTable))
108 }
109 var result : Any? = null
110 expression.function.referred!!.statements.forEach { result = exe\
111 cuteStatement(it, functionSymbolTable) }
112 if (result == null) {
113 throw IllegalStateException()
114 }
115 result as Any
116 }
117 else -> throw UnsupportedOperationException(expression.javaClass.can\
118 onicalName)
119 }
120
121 }

Lets start from the beginning:

1 class MiniCalcInterpreter(val systemInterface: SystemInterface, val inputValues:\


2 Map<String, Any> = emptyMap()) {
3 ...
4 }

When instantiating an interpreter we specify how we will interact with the rest of the world
(systemInterface) and we also need to provide values for our inputs. Inputs are the mechanism
we have to get parameters in our little algorithm. Better than reading what the user is typing, right?
Our class then contains a symbol table:

1 private val globalSymbolTable = CombinedSymbolTable<Any, FunctionDeclaration>()

This is the global symbol table. It will contains the inputs, the variables and the functions defined
in the global scope.
Then we have the methods that constitute the public interface of our interpreter:
10. Build an interpreter 111

1 fun fileEvaluation(miniCalcFile: MiniCalcFile) {


2 miniCalcFile.statements.forEach { executeStatement(it, globalSymbolTable) }
3 }
4
5 fun singleStatementEvaluation(statement: Statement) {
6 executeStatement(statement, globalSymbolTable)
7 }
8
9 fun getGlobalValue(name: String) : Any = globalSymbolTable.getValue(name)

For what can we use these methods?

fileEvaluation to evaluate one entire file. We execute all the top level statements using the
global Symbol Table
singleStatementEvaluation could be used to execute statements one by one. It could be
useful to implement a REPL or maybe a simple debugger
getGlobalValue takes a value out of the symbol table. We will use it in tests

The most interesting method seems executeStatement which, well, execute a statement using a
specified symbol table. When executing top level statements we are passing the global symbol table
but that will not always be the case.

1 private fun executeStatement(statement: Statement,


2 symbolTable: CombinedSymbolTable<Any, FunctionDeclaration>) : Any? =
3 when (statement) {
4 is ExpressionStatatement ->
5 evaluate(statement.expression, symbolTable)
6 is VarDeclaration ->
7 symbolTable.setValue(statement.name,
8 evaluate(statement.value, symbolTable))
9 is Print ->
10 systemInterface.print(evaluate(statement.value,
11 symbolTable).toString())
12 is Assignment ->
13 symbolTable.setValue(statement.varDecl.name,
14 evaluate(statement.value, symbolTable))
15 is FunctionDeclaration ->
16 symbolTable.setFunction(statement.name, statement)
17 is InputDeclaration ->
18 symbolTable.setValue(statement.name,
19 inputValues[statement.name]!!)
10. Build an interpreter 112

20 else ->
21 throw UnsupportedOperationException(
22 statement.javaClass.canonicalName)
23 }

Lets look at each statement separately.


The ExpressionStatement evaluates an expression in the current scope (so we use the current
symbol table). That one was easy.
All Declarations are about putting stuff into the symbol table. I am sure there is a fancier way to tell
this but this is what declarations are supposed to do.
More specifically:

VarDeclaration inserts a value. The actual value is determined by the initialization expression
that is evaluated. Note that we first evaluate the initialization expression and only then we
insert the resulting value in the symbol table. That means that in the initialization of a variable
we cannot refer to the variable itself
InputDeclaration inserts a value. The value has been provided when instantiating the
interpreter because it is coming from outside. The user could specify it as a parameter in
the command line or in a form. When we find the InputDeclaration we make that value
available to the program by putting it in the symbol table
FunctionDeclaration inserts a function. We just take the function and put it in the symbol
table. Note that we do not evaluate the body of the function

Assignment is similar to the declarations because it changes the symbol table. The only difference is
that we expect the element to be already present in the symbol table and we just change its value.
Finally Print evaluates the expression and transform it into a string. Once it got the string to print
it uses the systemInterface. That interface could actually print something on the screen, or log it,
or store it to check it later in an assertion.
The gist is that statements mainly evaluate expressions and put things into the symbol table. They
control what is happening, but most of the action passes through expressions.
Lets check how we evaluate them.
10. Build an interpreter 113

1 private fun evaluate(expression: Expression,


2 symbolTable: CombinedSymbolTable<Any, FunctionDeclaration>) : Any =
3 when (expression) {
4 is IntLit -> expression.value.toInt()
5 is DecLit -> expression.value.toDouble()
6 is StringLit -> expression.parts.map {
7 it.evaluate(symbolTable)
8 }.joinToString(separator = "")
9 is ValueReference -> symbolTable.getValue(expression.ref.name)
10 is SumExpression -> {
11 val l = evaluate(expression.left, symbolTable)
12 val r = evaluate(expression.right, symbolTable)
13 if (l is Int) {
14 l + r as Int
15 } else if (l is Double) {
16 l + r as Double
17 } else if (l is String) {
18 l + r.toString()
19 } else {
20 throw UnsupportedOperationException(l.toString()+ " from evaluat\
21 ing " + expression.left)
22 }
23 }
24 is SubtractionExpression -> {
25 val l = evaluate(expression.left, symbolTable)
26 val r = evaluate(expression.right, symbolTable)
27 if (l is Int) {
28 l - r as Int
29 } else if (l is Double) {
30 l - r as Double
31 } else {
32 throw UnsupportedOperationException(expression.toString())
33 }
34 }
35 is MultiplicationExpression -> {
36 val l = evaluate(expression.left, symbolTable)
37 val r = evaluate(expression.right, symbolTable)
38 if (l is Int) {
39 l * r as Int
40 } else if (l is Double) {
41 l * r as Double
42 } else {
10. Build an interpreter 114

43 throw UnsupportedOperationException("Left is " + l.javaClass)


44 }
45 }
46 is DivisionExpression -> {
47 val l = evaluate(expression.left, symbolTable)
48 val r = evaluate(expression.right, symbolTable)
49 if (l is Int) {
50 l / r as Int
51 } else if (l is Double) {
52 l / r as Double
53 } else {
54 throw UnsupportedOperationException(expression.toString())
55 }
56 }
57 is FunctionCall -> {
58 // SymbolTable: should leave the symbol table until
59 // we go at the same level at which the function
60 // was declared
61 val functionSymbolTable = CombinedSymbolTable(
62 symbolTable.popUntil(expression.function.referred!!))
63 var i = 0
64 expression.function.referred!!.params.forEach {
65 functionSymbolTable.setValue(it.name,
66 evaluate(expression.params[i++], symbolTable))
67 }
68 var result : Any? = null
69 expression.function.referred!!.statements.forEach {
70 result = executeStatement(it, functionSymbolTable) }
71 if (result == null) {
72 throw IllegalStateException()
73 }
74 result as Any
75 }
76 else -> throw UnsupportedOperationException(
77 expression.javaClass.canonicalName)
78 }

Literals are the building blocks of our expressions and they are easy to deal with:
10. Build an interpreter 115

1 is IntLit -> expression.value.toInt()


2 is DecLit -> expression.value.toDouble()
3 is StringLit -> expression.parts.map {
4 it.evaluate(symbolTable)
5 }.joinToString(separator = "")

For IntLit and DecLit we just parse them as Int and Double and we are done. Our string literals
are a little more complex because we support inserting expressions into them (i.e., we have string
interpolation). So a string literal is really a concatenation of constant parts and expressions to
transform into strings. We evaluate the single parts and join them together without spaces in
between. Voila!
We just need to see how to evaluate the single parts of our string literal:

1 private fun StringLitPart.evaluate(symbolTable: CombinedSymbolTable<Any, Functio\


2 nDeclaration>) : String =
3 when (this) {
4 is ConstantStringLitPart -> this.content
5 is ExpressionStringLItPart -> evaluate(
6 this.expression, symbolTable).toString()
7 else -> throw UnsupportedOperationException(
8 this.javaClass.canonicalName)
9 }

How do we perform operations? In MiniCalc (and in MiniCalcFun) we support the four basic
arithmetic operations, but the same mechanism can be used for all sorts of operations: we calculate
the values of the single operands and then we figure out how to compose those values.
For example, in the case of a SumExpression, once we had the left and the right values we may want
to sum them as ints, sum them as doubles or concatenating them as strings, if we have a string on
the left.
So:

1 + 2 -> 3
1.1 + 2 -> 3.1
"foo " + 2 -> foo 2

This is how we implement this logic:


10. Build an interpreter 116

1 is SumExpression -> {
2 val l = evaluate(expression.left, symbolTable)
3 val r = evaluate(expression.right, symbolTable)
4 if (l is Int) {
5 l + r as Int
6 } else if (l is Double) {
7 l + r as Double
8 } else if (l is String) {
9 l + r.toString()
10 } else {
11 throw UnsupportedOperationException(l.toString()+ " from evaluating " + \
12 expression.left)
13 }
14 }

The other operations are simpler because we do not support string operands. You cannot divide a
string, multiple it or subract something from it, so we just deal with ints and doubles.

1 is SubtractionExpression -> {
2 val l = evaluate(expression.left, symbolTable)
3 val r = evaluate(expression.right, symbolTable)
4 if (l is Int) {
5 l - r as Int
6 } else if (l is Double) {
7 l - r as Double
8 } else {
9 throw UnsupportedOperationException(expression.toString())
10 }
11 }
12 is MultiplicationExpression -> {
13 val l = evaluate(expression.left, symbolTable)
14 val r = evaluate(expression.right, symbolTable)
15 if (l is Int) {
16 l * r as Int
17 } else if (l is Double) {
18 l * r as Double
19 } else {
20 throw UnsupportedOperationException("Left is " + l.javaClass)
21 }
22 }
23 is DivisionExpression -> {
24 val l = evaluate(expression.left, symbolTable)
10. Build an interpreter 117

25 val r = evaluate(expression.right, symbolTable)


26 if (l is Int) {
27 l / r as Int
28 } else if (l is Double) {
29 l / r as Double
30 } else {
31 throw UnsupportedOperationException(expression.toString())
32 }
33 }

Then we can look at how we handle value references:

1 is ValueReference -> symbolTable.getValue(expression.ref.name)

We simply get the value out of the symbol table. Thats it.
The FunctionCall is the most complex expression.

1 is FunctionCall -> {
2 // SymbolTable: should leave the symbol table until
3 // we go at the same level at which the function
4 // was declared
5 val functionSymbolTable = CombinedSymbolTable(
6 symbolTable.popUntil(expression.function.referred!!))
7 var i = 0
8 expression.function.referred!!.params.forEach {
9 functionSymbolTable.setValue(it.name,
10 evaluate(expression.params[i++], symbolTable))
11 }
12 var result : Any? = null
13 expression.function.referred!!.statements.forEach {
14 result = executeStatement(it, functionSymbolTable) }
15 if (result == null) {
16 throw IllegalStateException()
17 }
18 result as Any
19 }

We first move up until we find the scope where the function was defined and we get the
corresponding symbol table (see the discussion in the previous section on popUntil).
Then we create a new Symbol Table having the symbol table in which the function is defined as
parent. This is our way to say that we go in one more specific scope (the scope inside the function).
10. Build an interpreter 118

In that symbol table we register the values for the parameters. We get their names from the function
definition, while their values are evaluated. Pay attention to how we evaluate them: we use the
expressions provided in the function call and we evaluate them using the symbol table of the scope
in which the function is called, not the scope representing the inside of the function.
At this point all that we have to do is to execute all the statements composing the body of the
function, using the appropriate symbol table. We just get the result of the last statement and use it
as the result of our function call.
Ok, that was as tricky as it gets for this interpreter.

Testing

It is time to test our interpreter. By testing it we show how to use it. It should not be too hard to put
some UI around of it and get a simple REPL or a simulator out of it.
This is the structure of our test case:

1 class InterpreterTest {
2
3 private var interpreter : MiniCalcInterpreter? = null
4 private var systemInterface : MySystemInterface? = null
5
6 class TestSystemInterface : SystemInterface {
7
8 val output = LinkedList<String>()
9
10 override fun print(message: String) {
11 output.add(message)
12 }
13
14 }
15
16 fun interpret(code: String) {
17 val res = MiniCalcParserFacade.parse(code)
18 assertTrue(res.isCorrect(), res.errors.toString())
19 val miniCalcFile = res.root!!
20 systemInterface = MySystemInterface()
21 interpreter = MiniCalcInterpreter(systemInterface!!)
22 interpreter!!.fileEvaluation(miniCalcFile)
23 }
24
25 ...
26 our test methods
10. Build an interpreter 119

27 ...
28 }

We have the TestSystemInterface we have discussed before. In the interpret method we parse
the code, assert that it is correct and interpret it, saving the systemInterface and the interpreter
as fields of the class. Later in the tests we will access them to validate our assertions.
Lets look at some involuted code. This example is very useful to show how you should not write
code. Incidentally it also useful to check if our interpreter can resolve values and functions correctly,
using the ones defined closer to the point where they are used.
There are two functions names f. When invoking f(3) - f(a) we are referring to the most internal
one. While when we invoke f(a + 1) + f(a + 2) instead we invoke the external one. Inside the
external function f references to a are resolved to the parameter a, while outside that function they
are resolved to the variable a.

1 @test fun interpretAnnidatedTwoLevels() {


2 interpret("""var a = 0
3 fun f(Int a) Int {
4 print("external f invoked with " + a)
5 fun f(Int p) Int {
6 print("internal f invoked with " + p)
7 3 * p
8 }
9 f(3) - f(a)
10 }
11 a = f(a + 1) + f(a + 2)""")
12 assertEquals(listOf("external f invoked with 1",
13 "internal f invoked with 3",
14 "internal f invoked with 1",
15 "external f invoked with 2",
16 "internal f invoked with 3",
17 "internal f invoked with 2"), systemInterface!!.output)
18 assertEquals(9, interpreter!!.getGlobalValue("a"))
19 }

This is how you can use it.


You can also write simpler tests:
10. Build an interpreter 120

1 @test fun interpretInputReference() {


2 interpret("""input Int i
3 input String s
4 print(s + i)""", mapOf("i" to 34, "s" to "Age="))
5 assertEquals(listOf("Age=34"), systemInterface!!.output)
6 }
7
8 @test fun interpretIntDivision() {
9 interpret("""print(10 / 3)""")
10 assertEquals(listOf("3"), systemInterface!!.output)
11 }
12
13 @test fun interpretDecimalDivision() {
14 interpret("""print(3 * 4)""")
15 assertEquals(listOf("12"), systemInterface!!.output)
16 }

StaMac
Looking at MiniCalcFun we have seen how to implement a typical imperative language with
statements and expressions. In the case of StaMac we have a different execution model, based on
state machines so there are some differences. However the part related to the expressions is quite
similar.
This is the whole code we need to interpret StaMac files.

1 fun StateMachine.stateByName(name: String) = this.states.find { it.name.equals(n\


2 ame) }!!
3 fun StateMachine.eventByName(name: String) = this.events.find { it.name.equals(n\
4 ame) }!!
5 fun StateMachine.inputByName(name: String) = this.inputs.find { it.name.equals(n\
6 ame) }!!
7
8 class SymbolTable {
9 private val values = HashMap<String, Any>()
10 fun readByName(name: String) : Any {
11 if (!values.containsKey(name)) {
12 throw RuntimeException("Unknown symbol $name. Known symbols: ${value\
13 s.keys}")
14 }
15 return values[name]!!
16 }
17 fun writeByName(name: String, value: Any) {
10. Build an interpreter 121

18 values[name] = value
19 }
20 }
21
22 class Interpreter(val stateMachine: StateMachine, val inputsValues: Map<InputDec\
23 laration, Any>) {
24 var currentState : StateDeclaration = stateMachine.states.find { it.start }!!
25 val symbolTable = SymbolTable()
26 var alive = true
27
28 init {
29 stateMachine.inputs.forEach { symbolTable.writeByName(it.name, inputsVal\
30 ues[it]!!) }
31 stateMachine.variables.forEach { symbolTable.writeByName(it.name, it.val\
32 ue.evaluate(symbolTable)) }
33 executeEntryActions()
34 }
35
36 fun variableValue(variable: VarDeclaration) = symbolTable.readByName(variabl\
37 e.name)
38
39 fun receiveEvent(event: EventDeclaration) {
40 if (!alive) {
41 println("[Log] Receiving event ${event.name} after exiting")
42 return
43 }
44 println("[Log] Receiving event ${event.name} while in ${currentState.nam\
45 e}")
46 val transition = currentState.blocks.filterIsInstance(OnEventBlock::clas\
47 s.java).firstOrNull { it.event.referred!! == event }
48 if (transition != null) {
49 enterState(transition.destination.referred!!)
50 }
51 }
52
53 private fun enterState(enteredState: StateDeclaration) {
54 executeExitActions()
55 currentState = enteredState
56 executeEntryActions()
57 }
58
59 private fun executeEntryActions() {
10. Build an interpreter 122

60 currentState.blocks.filterIsInstance(OnEntryBlock::class.java).forEach {\
61 it.execute(symbolTable, this) }
62 }
63
64 private fun executeExitActions() {
65 currentState.blocks.filterIsInstance(OnExitBlock::class.java).forEach { \
66 it.execute(symbolTable, this) }
67 }
68
69 }
70
71 private fun OnEntryBlock.execute(symbolTable: SymbolTable, interpreter: Interpre\
72 ter) {
73 this.statements.forEach { it.execute(symbolTable, interpreter) }
74 }
75
76 private fun OnExitBlock.execute(symbolTable: SymbolTable, interpreter: Interpret\
77 er) {
78 this.statements.forEach { it.execute(symbolTable, interpreter) }
79 }
80
81 private fun Statement.execute(symbolTable: SymbolTable, interpreter: Interpreter\
82 ) {
83 when (this) {
84 is Print -> println(this.value.evaluate(symbolTable))
85 is Assignment -> symbolTable.writeByName(this.variable.name, this.value.\
86 evaluate(symbolTable))
87 is Exit -> interpreter.alive = false
88 else -> throw UnsupportedOperationException(this.toString())
89 }
90 }
91
92 private fun Expression.evaluate(symbolTable: SymbolTable): Any =
93 when (this) {
94 is ValueReference -> symbolTable.readByName(this.symbol.name)
95 is SumExpression -> {
96 val l = this.left.evaluate(symbolTable)
97 val r = this.right.evaluate(symbolTable)
98 if (l is Int) {
99 l + r as Int
100 } else if (l is Double) {
101 l + r as Double
10. Build an interpreter 123

102 } else if (l is String) {


103 l + r.toString()
104 } else {
105 throw UnsupportedOperationException(this.toString())
106 }
107 }
108 is SubtractionExpression -> {
109 val l = this.left.evaluate(symbolTable)
110 val r = this.right.evaluate(symbolTable)
111 if (l is Int) {
112 l + r as Int
113 } else if (l is Double) {
114 l + r as Double
115 } else {
116 throw UnsupportedOperationException(this.toString())
117 }
118 }
119 is MultiplicationExpression -> {
120 val l = this.left.evaluate(symbolTable)
121 val r = this.right.evaluate(symbolTable)
122 if (l is Int) {
123 l * r as Int
124 } else if (l is Double) {
125 l * r as Double
126 } else {
127 throw UnsupportedOperationException(this.toString())
128 }
129 }
130 is DivisionExpression -> {
131 val l = this.left.evaluate(symbolTable)
132 val r = this.right.evaluate(symbolTable)
133 if (l is Int) {
134 l / r as Int
135 } else if (l is Double) {
136 l / r as Double
137 } else {
138 throw UnsupportedOperationException(this.toString())
139 }
140 }
141 is IntLit -> this.value.toInt()
142 is DecLit -> this.value.toDouble()
143 is StringLit -> this.value
10. Build an interpreter 124

144 else -> throw UnsupportedOperationException(this.toString())


145 }

Letsee first how to identify elements:

1 fun StateMachine.stateByName(name: String) = this.states.find { it.name.equals(n\


2 ame) }!!
3 fun StateMachine.eventByName(name: String) = this.events.find { it.name.equals(n\
4 ame) }!!
5 fun StateMachine.inputByName(name: String) = this.inputs.find { it.name.equals(n\
6 ame) }!!

In this case we have only one scope: the global scope. So we have only one Symbol Table. While
processing the different parts of the AST we pass the Symbol Table around.
This is how our Symbol table is defined:

1 class SymbolTable {
2 private val values = HashMap<String, Any>()
3
4 fun readByName(name: String) : Any {
5 if (!values.containsKey(name)) {
6 throw RuntimeException(
7 "Unknown symbol $name. Known symbols: ${values.keys}")
8 }
9 return values[name]!!
10 }
11
12 fun writeByName(name: String, value: Any) {
13 values[name] = value
14 }
15 }

Lets see how we construct the interpreter.


10. Build an interpreter 125

1 class Interpreter(val stateMachine: StateMachine,


2 val inputsValues: Map<InputDeclaration, Any>) {
3 var currentState : StateDeclaration =
4 stateMachine.states.find { it.start }!!
5 val symbolTable = SymbolTable()
6
7 init {
8 stateMachine.inputs.forEach {
9 symbolTable.writeByName(it.name, inputsValues[it]!!) }
10 stateMachine.variables.forEach {
11 symbolTable.writeByName(it.name, it.value.evaluate(symbolTable)) }
12 executeEntryActions()
13 }

First of all we have to provide values for the inputs. Inputs permit to make a State Machine
configurable. The input values are inserted in the Symbol Table. Then we evaluate all the initial
expressions for the variables and put also those into the Symbol Table.
We also set the current state to the state marked as start state, and we execute all the entry actions
for such state.
After the setup we are ready to react to events. We expose a method name receiveEvent and we
expect it to be called when an event is sent to our State Machine. For example, if we built an UI
for our interpreter the user could hit a button for each event type and that button could call this
method, passing the associated event.

1 fun receiveEvent(event: EventDeclaration) {


2 if (!alive) {
3 println("[Log] Receiving event ${event.name} after exiting")
4 return
5 }
6 println("[Log] Receiving event ${event.name} while in ${currentState.name}")
7 val transition = currentState.blocks
8 .filterIsInstance(OnEventBlock::class.java)
9 .firstOrNull { it.event.referred!! == event }
10 if (transition != null) {
11 enterState(transition.destination.referred!!)
12 }
13 }

What this method does? When we receive an event we print a log message. Then we look for a
transition that could be triggered from the current state based on the event we received. Two things
can happen:
10. Build an interpreter 126

we find a transition: in that can we simply go to the destination (another state)


we do not find a transition: in that case nothing happens. This is perfectly normal that some
events are ignored in certain states

How do we navigate to the destination?


We do the following:

execute the exit actions for the current state


change the current state
execute the entry actions for the new state

Note that both executeEntryActions and executeExitActions do not take as a parameter the state
on which to execute the entry or exit actions but they use instead currentState, so the order in
which we call this method and update the currentState variable is important.
Execute entry or exit actions is done by looking for entry or exit blocks in the current state. If such
blocks are found we invoke execute on them passing the Symbol Table.

1 private fun enterState(enteredState: StateDeclaration) {


2 executeExitActions()
3 currentState = enteredState
4 executeEntryActions()
5 }
6
7 private fun executeEntryActions() {
8 currentState.blocks.filterIsInstance(OnEntryBlock::class.java)
9 .forEach { it.execute(symbolTable, this) }
10 }
11
12 private fun executeExitActions() {
13 currentState.blocks.filterIsInstance(OnExitBlock::class.java)
14 .forEach { it.execute(symbolTable, this) }
15 }

The execution of the blocks consist just in executing every single statement contained in the block,
in order:
10. Build an interpreter 127

1 private fun OnEntryBlock.execute(symbolTable: SymbolTable,


2 interpreter: Interpreter) {
3 this.statements.forEach { it.execute(symbolTable, interpreter) }
4 }
5
6 private fun OnExitBlock.execute(symbolTable: SymbolTable,
7 interpreter: Interpreter) {
8 this.statements.forEach { it.execute(symbolTable, interpreter) }
9 }

Execute a statement is fairly easy because we have only two types of statements.

1 private fun Statement.execute(symbolTable: SymbolTable,


2 interpreter: Interpreter) {
3 when (this) {
4 is Print -> println(this.value.evaluate(symbolTable))
5 is Assignment -> symbolTable.writeByName(
6 this.variable.name, this.value.evaluate(symbolTable))
7 is Exit -> interpreter.alive = false
8 else -> throw UnsupportedOperationException(this.toString())
9 }
10 }

Note that you may want to collect logs using an interface and permitting the user to specify different
instances, like loggers that print messages on the screen, on a file or maybe a DB. The same goes for
the output of the print statement of the language.

Summary
In this chapter we have seen the basics for writing an interpreter. We have studied the typical
structure of an interpreter and discussed its main components. We got started working symbol
tables, executing statements and evaluating expressions. Now building an interpreter should look
less mysterious. After all is just about following the information captured into the AST and do
something in response.
In many cases starting by writing an interpreter is just easier compared to writing a compiler. I
would suggest going thorugh this path at least while you are designing the language and it is not
yet stable.
Have fun writing interpreters and when you are ready lets move to the next chapter and explore
an alternative: generating bytecode.
11. Generate JVM bytecode
In the previous chapter we have seen how to write an interpreter. In this one we will see how to
write a compiler instead. Our compiler will produce JVM bytecode. By compiling for the JVM we
will be able to run our code on all sort of platforms. That sounds pretty great to me!
Also, the JVM classes generated by our compiler could be used inside applications written in Java,
Kotlin, Scala, JRuby, Frege and all other sorts of languages that run on the JVM. This opens all sort
of scenarios. For example you may want to create the core of a complex system in Java, and maybe
use a smaller language, like MiniCalc or StaMac to define specific subsytems. In other words you
could combine many specific languages and other more general, established JVM languages to build
rich applications. This a scenario that I think has a lot of potential, because it permits to combine
the strength of different languages, to define different portions of an application.
Before we start writing a compiler targeting the JVM we need to examine how the Java Virtual
Machine works. We will start by doing that in the first section of this chapter. Later we will write
two different compilers: one for MiniCalcFun (i.e., MiniCalc extended to support functions) and the
other one for StaMac.

The Java Virtual Machine


To be able to write bytecode it is very important to have a general understanding of how the Java
Virtual Machine works. It is not that complex but there are a few specificities and a few terms you
should familiarize with. For this reason a first part of this chapter presents you all the concepts
you need to know about the JVM. If you need more details you can always refer to the JVM
Specification, which is freely available on the Oracle website.
At a very general level the first thing to notice is that the JVM is a stack based machine. Many
processors today are register-based. This means their elementary operations are based on accessing
registers which can be purpose-specific or generic. Now, the issue is that every processor family
have different registers and use them differently. So if you want to build a Virtual Machine that has
good performance, on all sort of different processors, you are probably better off not thinking in
terms of registers. You can instead build a stack based machine. Being a stack based machine most
operations take values from the stack, manipulate them, and put results back in the stack. In addition
to the stack also other structures are used. The most important ones are the constant pool and the
local variables table. We will see instructions that will permit to access values from those structures
and put them into the stack.
In the rest of this section we look at:
https://docs.oracle.com/javase/specs/jvms/se8/html/index.html
11. Generate JVM bytecode 129

the general structure of class files: class files contain the code executed by the JVM
JVM Type descriptions: how the JVM define types internally
the stack: a first example of how the stack is used to perform operations
bytecode: the bytecode specifies the instructions to execute when running a method
frames: they will be useful to understand the execution of methods

Finally we will look at a class file and examine the different parts.

Class files
The JVM executes code contained in class files. The format for such files is described into Chapter
4 of the JVM specification. To understand how to write bytecode effectively it is not necessary to
look into every single field of a class file. There are details we can ignore, if we use a simple library
like ASM. What is important is to understand the general structure.
A class file contains:

a signature corresponding to the hexadecimal value 0xCAFEBABE. Yes, really.


the class file version. Each JVM can support certain class file versions. The JVM for Java 8
supports up to class file version 52.
the constant pool. It is a list of constants (more on this later)
access flags: tell us if the class is public or not, if it is an interface, it it is an abstract class, an
annotation or enum and a few other things
the name of the class
the name of the super class
the interfaces implemented or extended
the fields defined
the methods defined
attributes: they could be a variety of different things like annotations, exceptions, code, debug
information

Each class file represents a single class. Also internal classes, anonymous classes and local classes
are compiled into separate class files.
A very important structure contained in a class file is the constant pool. It contains a set of constants
that can be used for very different goals. Many other fields of the class file contains just indexes that
refer to the constant pool. For example the this_class field contains just an index to an entry in the
constant pool. That entry is expected to contain a data structure describing the current class. The
constan pool contains also constants that will be accessed from bytecode. For example, instructions
to invoke methods do not specify directly the method to invoke. They instead specify indexes into
https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html
11. Generate JVM bytecode 130

the constant pool. At that position in the constant pool we will find the name of the method to invoke
and its signature. This permits to save space when we refer the same method more than once, because
the name is present just once. Considering an index takes just two bytes, this is considerably less
than the space needed to record the information needed to identify a mehtod.
The class file contains also a list of fields and a list of methods declared in the class. The fields and
methods inherited are not present in the class file. For each field we have information like the name,
the type, and the access level. For the method we have the name, its signature and its code.
There are other possible attributes associated to fields and methods, which can be useful for
debugging purposes or which can contain other information (like the exceptions thrown by a
method). We are not going to look into those; if you want to learn more about those please refer
to the JVM Specification, section 4.7. The class file contains also a list of inner classes. We are not
going to use them in this chapter.
The code attribute associated to a method contains the bytecode and some complementary infor-
mation. The bytecode contains a list of instructions that are executed when invoking the method.
We are going to see more about this in the following sections. Before that we are going to look into
concepts that are relevant to understand how the bytecode operates.

JVM Type description


All types are referred in class files through their JVM Type description. We have one-letter long type
descriptions for the primitive types:

type JVM Type description


void V
boolean Z
char C
byte B
short S
int I
long J
float F
double D

In addition to that we need to consider two other cases: declared types and arrays.
By declared types we mean classes, interfaces, enums, and annotations. They have JVM Type
description constructed as "L" + internal name + ";". The internal name is simply the qualified
name with dots replaced by slashes.
To compose the type description of arrays we add the [ symbol to the start of the type description
for the element type.
https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.7
11. Generate JVM bytecode 131

Lets see some examples:

type JVM Type description


int[] [I
String Ljava/lang/String;
String[] [Ljava/lang/String;
Object[] [Ljava/lang/Object;
Object[][][] [[[Ljava/lang/Object;

Typically in the class file you use the JVM Type description, unless only a declared type can be
used. In that case you use an internal name. For example, when specifying a superclass, or the
class defining a method we will use internal names. For parameter types we will instead use type
descriptions.

The stack
We have said that the JVM is a stack based machine. But what is a stack based machine? It is a
machine that executes operations by extracting values from a stack and putting results back on a
stack. A stack is LIFO structure: when extracting values the first value we pick is the last value that
was inserted on the stack.
Consider one instruction of the JVM, IADD. This instruction expects to find two integers at the top
of the stack when it is invoked. It will then get these two values, sum them, and put the result back
on the stack.
Suppose that we have inserted the values 1, 2, and 3 in the stack and we execute two consecutive
IADD. What will happen?

1. Initially the stack will contain some value. We will leave it untouched
2. We will first push the value 1 on the top of the stack
3. Then we will push the value 2. Now 2 is on top, before 1
4. Then we will push the value 3. Now 3 is on top, before 2, which comes before 1
5. We perform an addition by removing the two values at the top of the stack. We remove first
3 and then 2. We sum them, and we put the result on the top of the stack. Now 5 is on the top
of the stack, above 1
11. Generate JVM bytecode 132

6. We perform an addition. We remove first 5 and then 1. We put the result on the top of the
stack. Now 6 is on the top of the stack, above the values that were originally present in the
stack

Bytecode
A JVM istruction is composed by an opcode that takes exactly one byte. Opcode stands for operation
code, and it is a number that identifies one of the instructions the JVM knows how to execute. The
maximum number of opcodes would thoretically be 256, but some values are reserved and not all
the values correspond to valid opcodes. Associated to each opcode there is also a mnemonic name:
it is much clearer to read iadd instead of 96 (which is the value of the opcode for iadd).
An opcode can be followed by one or more operands. Operands can be immediate values or indexes
indicating entries in the constant pool.
Note that the opcode determines how many operands are expected and their type, so that looking
at the opcode we know how long the whole instruction is going to be. For example, after d2f we
know that there will be no operands, so the whole instruction will take one byte. After bipush we
have an operand of one byte, so the instruction will take two bytes. After putfield will follow one
operand of two bytes, so the whole instruction will take three bytes.
For conceputally similar operations we could have different opcodes. For example summing two
numbers can be done using iadd, ladd, fadd, or dadd depending on the type of the operands being
byte, short, int, long, float, or double.

Frames
Each time a method is invoked a new frame is created. The frame is destroyed when the invocation
is completed.
Associated to each frame we have an array of local variables. It contains in order:

the value of this, if the method is an instance method. This entry is not present for static
methods
the values of the method parameters, starting from position 1 for instance methods or from
position 0 for static methods
the local variables defined in the method

Note that while most values in the local variables array take one space the long or double values
takes two.
Lets see a couple of cases.
Suppose we have these two Java methods:
11. Generate JVM bytecode 133

1 void foo(String s, long l) {


2 boolean b;
3 }
4
5 static void bar(int i, int j, double d) {
6 Object o;
7 }

This will be the local variables array for foo:

Index Content
0 this
1 parameter s
2 parameter l (1st part)
3 parameter l (2nd part)
4 local variable b

This will be the local variables array for bar:

Index Content
0 parameter i
1 parameter j
2 parameter d (1st part)
3 parameter d (2nd part)
4 local variable o

Examining a class file


Bundled with the JDK there is a utility named javap that you can use to inspect class files.
Create a Java file with this code:

1 class A {
2 }

And compile it (for example running javac A.java).


Now you can decompile it by running javap -v -c -s A.class. I get this result (I have just omitted
a few lines).
11. Generate JVM bytecode 134

1 class A
2 minor version: 0
3 major version: 52
4 flags: ACC_SUPER
5 Constant pool:
6 #1 = Methodref #3.#10 // java/lang/Object."<init>":()V
7 #2 = Class #11 // A
8 #3 = Class #12 // java/lang/Object
9 #4 = Utf8 <init>
10 #5 = Utf8 ()V
11 #6 = Utf8 Code
12 #7 = Utf8 LineNumberTable
13 #8 = Utf8 SourceFile
14 #9 = Utf8 a.java
15 #10 = NameAndType #4:#5 // "<init>":()V
16 #11 = Utf8 A
17 #12 = Utf8 java/lang/Object
18 {
19 A();
20 descriptor: ()V
21 flags:
22 Code:
23 stack=1, locals=1, args_size=1
24 0: aload_0
25 1: invokespecial #1 // Method java/lang/Object."<init>\
26 ":()V
27 4: return
28 LineNumberTable:
29 line 1: 0
30 }

First we have the version of the class file. When a new version of Java is released they could introduce
new opcodes or slightly change the class format. If they do so they introduce a new version of the
class file format. Not all new releases of Java have introduced a new version of the class file format.
The ACC_SUPER flag in this context is present for historical reason. It should always be there.
This class file contains one method, even if the class in the source code was empty. This method is
the default constructor, which is added by the compiler when no constructor is explicitely defined.
Now lets look into the constant pool. To understand how the constant pool works lets look at one
line of code:
11. Generate JVM bytecode 135

1 1: invokespecial #1 // Method java/lang/Object."<init>":()V

This line invokes the method specified in the entry #1 of the class pool. If we go in the class pool
we can see that the entry #1 is of type MethodRef (i.e., it describes a method) and it refers to two
other entries: #3 and #10. Entry #3 define the class in which the method is declared, while entry
#10 define the signature. The constructors are basically special methods with the name <init>. In
this case we have a default constructor that invokes the parent constructor. Our class A implicitly
extends java.lang.Object so we call the java.lang.Object constructor.
Entry #3 is an entry of type Class that refers to an entry #12. Entry #12 actually contains the internal
name of the class. The name of the class is in the internal format (java/lang/Object), which is basically
the canonical format with slashes replacing the dots.
Entry #10 is an entry of type NameAndType which refers to two other entries. The first entry (#4)
specifies the name of the method while the second one (#5) specifies the parameters accepted and
the return type.
Entry #4 contains the name of the method which is <init>. This is the special name used to represent
constructors.
Entry #5 indicates that the method takes no parameters and return void (i.e., it returns nothing).
What should be clear at this point is that filling the constant pool is not complicated, but it requires
a lot of bookkeeping. For this reason we are going to use a library to write the class files instead of
writing directly the bytes. That would not be conceptually difficult, just boring.

Generics and arrays


Java 6 introduced generics in a way that was backward compatible. The way generics are handled
in the JVM is peculiar. We are not going to see it into this chapter.
We are also not considering the instructions to work with arrays.

The main instructions


In this section we look at the most commonly used instructions we will see in bytecode.

Constants, Loading, and storing


These instructions permit to read values from the local variables table and push them into the stack
or doing the opposite: popping a value from the stack and saving it into the local variables table.
Consider this Java method:
11. Generate JVM bytecode 136

1 int foo(int p) {
2 p = 0;
3 return p;
4 }

In this method we first store 0 into an entry of the local variables table (p) and then we read a value
from the same entry, to return it.
If we compile it and decompile it we get:

1 int foo(int);
2 descriptor: (I)I
3 flags:
4 Code:
5 stack=1, locals=2, args_size=2
6 0: iconst_0
7 1: istore_1
8 2: iload_1
9 3: ireturn

The first instruction is iconst_0. This is an instruction that push the int value 0 into the stack. We
will see more about the instructions to push constants in the paragraph on Constants.
Then we have the instruction istore_1 which takes the int value on the top of the stack and store it
into the entry #1 of the local variables table. Note that in this case the entry #0 would indicate this,
while the entry #1 would indicate the only parameter of the method, p.
After that we load the integer value in the entry #1 of the local variables table, and then return it.
Each of these instructions takes exactly one byte. The number before the instruction indicates the
index in the byte array describing the code. For example 0: iconst_0 starts at byte 0, 1: istore_1
at byte 1, and so on.
11. Generate JVM bytecode 137

Lets consider this method now:

1 int foo(int p1, int p2, int p3, int p4, int p5, int p6) {
2 p1 = 10;
3 p2 = 20;
4 p3 = 30;
5 p4 = 40;
6 p5 = 50;
7 p6 = 60;
8 return p6;
9 }

This is compiled into this:


11. Generate JVM bytecode 138

1 int foo(int, int, int, int, int, int);


2 descriptor: (IIIIII)I
3 flags:
4 Code:
5 stack=1, locals=7, args_size=7
6 0: bipush 10
7 2: istore_1
8 3: bipush 20
9 5: istore_2
10 6: bipush 30
11 8: istore_3
12 9: bipush 40
13 11: istore 4
14 13: bipush 50
15 15: istore 5
16 17: bipush 60
17 19: istore 6
18 21: iload 6
19 23: ireturn

Here we can see that conceptually we execute the same operation over and over: assigning a constant
to a parameter. However the instructions are different. In the first example we pushed into the stack
the value 0 with the instruction iconst_0. That is an instruction of one byte that specify what to
do (push an integer value) and the value itself to push (0). In general we do not have a specific
instruction to push for each possible value, we can use instead the parametric instruction bipush
which requires us to specify the value to push. We could specify bipush 0. it would be equivalent
to iconst_0, it would just takes more bytes. Using bipush we takes 2 bytes, indeed the successive
instruction starts at byte 2, not at byte 1.
The same reasoning applies for storing instructions. We have special instructions to store values in
the entry #1, #2, #3 but after that we need to use the generic instruction istore. The same is true for
loading: we have seen before iload_1 but there is no iload_6, we instead use iload and specify as
a parameter the index of the entry (6 in this example).

Constants

We have special instructions to put the values between -1 and 5, included. Then for values between
-128 and 127 we use bipush. For values between -32767 and 32767 we push sipush. For other values
we insert the constant in the constant pool and then we use the instruction ldc #x where x is the
index of the constant in the constant pool.
11. Generate JVM bytecode 139

Value to push Instruction Length in bytes


-10 bipush -10 2 bytes
-2 bipush -2 2 bytes
-1 iconst_m1 1 byte
0 iconst_0 1 byte
1 iconst_1 1 byte
2 iconst_2 1 byte
3 iconst_3 1 byte
4 iconst_4 1 byte
5 iconst_5 1 byte
6 bipush 6 2 bytes
7 bipush 7 2 bytes
100 bipush 100 2 bytes
127 bipush 127 2 bytes
128 sipush 128 3 bytes
32767 sipush 32767 3 bytes

Mathematical operations
Lets look at how addition is executed.

1 int sumBytes(byte a, byte b) {


2 return a + b;
3 }
4
5 int sumShorts(short a, short b) {
6 return a + b;
7 }
8
9 int sumInts(int a, int b) {
10 return a + b;
11 }
12
13 long sumLongs(long a, long b) {
14 return a + b;
15 }
16
17 float sumFloats(float a, float b) {
18 return a + b;
19 }
20
21 double sumDoubles(double a, double b) {
11. Generate JVM bytecode 140

22 return a + b;
23 }

result in:

1 int sumBytes(byte, byte);


2 descriptor: (BB)I
3 Code:
4 0: iload_1
5 1: iload_2
6 2: iadd
7 3: ireturn
8
9 int sumShorts(short, short);
10 descriptor: (SS)I
11 Code:
12 0: iload_1
13 1: iload_2
14 2: iadd
15 3: ireturn
16
17 int sumInts(int, int);
18 descriptor: (II)I
19 Code:
20 0: iload_1
21 1: iload_2
22 2: iadd
23 3: ireturn
24
25 long sumLongs(long, long);
26 descriptor: (JJ)J
27 Code:
28 0: lload_1
29 1: lload_3
30 2: ladd
31 3: lreturn
32
33 float sumFloats(float, float);
34 descriptor: (FF)F
35 Code:
36 0: fload_1
37 1: fload_2
11. Generate JVM bytecode 141

38 2: fadd
39 3: freturn
40
41 double sumDoubles(double, double);
42 descriptor: (DD)D
43 Code:
44 0: dload_1
45 1: dload_3
46 2: dadd
47 3: dreturn

We can notice a few things:

the JVM treats bytes, shorts, and ints in the same way in many cases. I.e., internally they are
all treated as they were ints.
we have different operations to load primitive values: iload, lload, fload, dload
correspondigly we have different instruction to sum: iadd, ladd, fadd, dadd
same thing for return instructions: ireturn, lreturn, freturn, dreturn

We have not seen it in this example, but for subtraction, division, and multiplication we have also
four variants.
Addition Subtraction Multiplication Division
Byte iadd isub imul idiv
Short iadd isub imul idiv
Int iadd isub imul idiv
Long ladd lsub lmul ldiv
Float fadd fsub fmul fdiv
Double dadd dsub dmul ddiv

We are not considering what happens when you sum two values which are not of the same type.
We are going to figure that out in the next section about conversions.

Conversions
When two types are not compatible we are going to need to do some conversions. Consider these
cases:
11. Generate JVM bytecode 142

1 int sumByteAndShort(byte a, byte b) {


2 return a + b;
3 }
4
5 int sumByteAndInt(byte a, int b) {
6 return a + b;
7 }
8
9 long sumByteAndLong(byte a, long b) {
10 return a + b;
11 }
12
13 float sumByteAndFloat(byte a, float b) {
14 return a + b;
15 }
16
17 double sumByteAndDouble(byte a, double b) {
18 return a + b;
19 }

They result in this bytecode:

1 int sumByteAndShort(byte, byte);


2 descriptor: (BB)I
3 Code:
4 0: iload_1
5 1: iload_2
6 2: iadd
7 3: ireturn
8
9 int sumByteAndInt(byte, int);
10 descriptor: (BI)I
11 Code:
12 0: iload_1
13 1: iload_2
14 2: iadd
15 3: ireturn
16
17 long sumByteAndLong(byte, long);
18 descriptor: (BJ)J
19 Code:
20 0: iload_1
11. Generate JVM bytecode 143

21 1: i2l
22 2: lload_2
23 3: ladd
24 4: lreturn
25
26 float sumByteAndFloat(byte, float);
27 descriptor: (BF)F
28 Code:
29 0: iload_1
30 1: i2f
31 2: fload_2
32 3: fadd
33 4: freturn
34
35 double sumByteAndDouble(byte, double);
36 descriptor: (BD)D
37 Code:
38 0: iload_1
39 1: i2d
40 2: dload_2
41 3: dadd
42 4: dreturn

What happens here?


When we sum bytes, shorts or ints we do not need any conversion because internally they are all
ints. However when we sum a byte to a long we need to convert the byte (that internally is an int)
to a long. This is why sumByteAndLong contains i2l. In sumByteAndFloat and sumByteAndDouble
we instead convert the int to a float (i2f) or to a double (i2d).
After the conversion we have two values of the same type at the top of the stack. We can then just
sum them using the appropriate version of add.

Original type Target type Operation Operation type


int long i2l Widening
int float i2f Widening
int double i2d Widening
long float l2f Widening
long double l2d Widening
float double f2d Widening
int byte i2b Narrowing
int char i2c Narrowing
int short i2s Narrowing
long int l2i Narrowing
11. Generate JVM bytecode 144

Original type Target type Operation Operation type


float int f2i Narrowing
float long f2l Narrowing
double int d2i Narrowing
double long d2l Narrowing
double float d2f Narrowing

Conversions can be widening or narrowing. Widening numeric conversions should always keep the
original value or a value that is close to the original value. For details you should consider how
floating point values are represented. Narrowing numeric conversions could instead cause the value
to be changed significantly.
For example if you try to convert the an int containing the value 3 into a byte, this will work.
However when you try to convert an int containing the value 128 into a byte you have a problem
because a byte can represent values between -128 and 127: 128 does not fit into a byte. For this reason
the resulting value would be not equivalent to the original value.

Operations on objects
So far we have focused on instructions involving primitive types. We have not yet seen how to deal
with object instances.
In a few words: they work very similarly. The only thing worth noticing is that the instructions will
not work directly on the object value itself, but on a reference, in other words on its address. For the
old guys that learnt to program a few years ago this will bring old memories: basically we are back
to working with pointers.
Consider this method:

1 String passingStringAround(String param) {


2 String myStringVar = param;
3 return myStringVar;
4 }

This is translated to:


11. Generate JVM bytecode 145

1 java.lang.String passingStringAround(java.lang.String);
2 descriptor: (Ljava/lang/String;)Ljava/lang/String;
3 Code:
4 0: aload_1
5 1: astore_2
6 2: aload_2
7 3: areturn

Basically we have the variants of the load, store, and return instruction for references. They start
with a.
The other thing that you could notice is that signatures are much longer. For all the primitive types
we had one single letter. So a method taking an integer and returning a long would have the signature
(I)J.

For types instead the signature is L + internal name + ;. For example java.lang.String becomes
Ljava/lang.String;. This also explains why the signature for long is not L but J.

Method invocations
Now that we have seen how object references are passed around we may want to see how to actually
use them. What do you do with objects? You invoke methods on them.
This is where things get complicated.
We have 5 different instructions:

invokedynamic
invokeinterface
invokespecial
invokestatic
invokevirtual

invokedynamic has to do with the support for dynamic languages that was introduced in the version
8 of the JVM. We are not going to look into that because we should introduce a lot of different
concepts. And you can build many interesting concepts without it.
invokeinterface is used to invoke methods on references having an interface type.
invokespecial is using to invoke superclass methods, private methods and constructors.
In the other cases you want to use invokevirtual.
Lets consider this piece of Java code which contains an interface, an abstract class and a concrete
class.
11. Generate JVM bytecode 146

1 class A {
2
3 interface MyInterface {
4 void foo();
5 }
6
7 abstract class MyAbstractClass implements MyInterface {
8
9 }
10
11 class MyConcreteClass implements MyInterface {
12 public void foo() {}
13 }
14
15 void invoking(MyInterface p0, MyAbstractClass p1, MyConcreteClass p2) {
16 p0.foo();
17 p1.foo();
18 p2.foo();
19 }
20
21 }

The corresponding class file is:

1 Constant pool:
2 #1 = Methodref #6.#22 // java/lang/Object."<init>":()V
3 #2 = InterfaceMethodref #12.#23 // A$MyInterface.foo:()V
4 #3 = Methodref #10.#23 // A$MyAbstractClass.foo:()V
5 #4 = Methodref #7.#23 // A$MyConcreteClass.foo:()V
6 #5 = Class #24 // A
7 #6 = Class #25 // java/lang/Object
8 #7 = Class #26 // A$MyConcreteClass
9 #8 = Utf8 MyConcreteClass
10 #9 = Utf8 InnerClasses
11 #10 = Class #27 // A$MyAbstractClass
12 #11 = Utf8 MyAbstractClass
13 #12 = Class #28 // A$MyInterface
14 #13 = Utf8 MyInterface
15 #14 = Utf8 <init>
16 #15 = Utf8 ()V
17 #16 = Utf8 Code
18 #17 = Utf8 LineNumberTable
11. Generate JVM bytecode 147

19 #18 = Utf8 invoking


20 #19 = Utf8 (LA$MyInterface;LA$MyAbstractClass;LA$MyConcreteClass\
21 ;)V
22 #20 = Utf8 SourceFile
23 #21 = Utf8 a.java
24 #22 = NameAndType #14:#15 // "<init>":()V
25 #23 = NameAndType #29:#15 // foo:()V
26 #24 = Utf8 A
27 #25 = Utf8 java/lang/Object
28 #26 = Utf8 A$MyConcreteClass
29 #27 = Utf8 A$MyAbstractClass
30 #28 = Utf8 A$MyInterface
31 #29 = Utf8 foo
32 {
33 A();
34 descriptor: ()V
35 flags:
36 Code:
37 stack=1, locals=1, args_size=1
38 0: aload_0
39 1: invokespecial #1 // Method java/lang/Object."<init>\
40 ":()V
41 4: return
42 LineNumberTable:
43 line 1: 0
44
45 void invoking(A$MyInterface, A$MyAbstractClass, A$MyConcreteClass);
46 descriptor: (LA$MyInterface;LA$MyAbstractClass;LA$MyConcreteClass;)V
47 flags:
48 Code:
49 stack=1, locals=4, args_size=4
50 0: aload_1
51 1: invokeinterface #2, 1 // InterfaceMethod A$MyInterface.f\
52 oo:()V
53 6: aload_2
54 7: invokevirtual #3 // Method A$MyAbstractClass.foo:()V
55 10: aload_3
56 11: invokevirtual #4 // Method A$MyConcreteClass.foo:()V
57 14: return
58 LineNumberTable:
59 line 16: 0
60 line 17: 6
11. Generate JVM bytecode 148

61 line 18: 10
62 line 19: 14
63 }
64 SourceFile: "a.java"
65 InnerClasses:
66 #8= #7 of #5; //MyConcreteClass=class A$MyConcreteClass of class A
67 abstract #11= #10 of #5; //MyAbstractClass=class A$MyAbstractClass of class\
68 A
69 static #13= #12 of #5; //MyInterface=class A$MyInterface of class A

What is interesting to us are the three invocations of the method foo. The first one is operated on an
interface, so we use invokeinterface. The other twos are operated on two classes, one abstract and
the other one concrete. In both cases we use invokevirtual.
Lets look into constructors:

1 class Derived extends Super { }

The class file will contain a default constructor for Derived which will call the default constructor
of Super.

1 class Derived extends Super


2 minor version: 0
3 major version: 52
4 flags: ACC_SUPER
5 Constant pool:
6 #1 = Methodref #3.#10 // Super."<init>":()V
7 #2 = Class #11 // Derived
8 #3 = Class #12 // Super
9 #4 = Utf8 <init>
10 #5 = Utf8 ()V
11 #6 = Utf8 Code
12 #7 = Utf8 LineNumberTable
13 #8 = Utf8 SourceFile
14 #9 = Utf8 a.java
15 #10 = NameAndType #4:#5 // "<init>":()V
16 #11 = Utf8 Derived
17 #12 = Utf8 Super
18 {
19 Derived();
20 descriptor: ()V
21 flags:
11. Generate JVM bytecode 149

22 Code:
23 stack=1, locals=1, args_size=1
24 0: aload_0
25 1: invokespecial #1 // Method Super."<init>":()V
26 4: return
27 LineNumberTable:
28 line 5: 0
29 }

The invocation of the super constructor is done by using invokespecial.


Lets now see an example in which we call the methods of the same class.

1 class A {
2
3 private void myPrivateInstanceMethod() { }
4 public void myPublicInstanceMethod() { }
5 private static void myPrivateStaticMethod() { }
6 public static void myPublicStaticMethod() { }
7
8 private void myMethodCallingTheOthers() {
9 myPrivateStaticMethod();
10 myPublicStaticMethod();
11 myPrivateInstanceMethod();
12 myPrivateInstanceMethod();
13 }
14
15 }

The corresponding class is:

1 class A
2 minor version: 0
3 major version: 52
4 flags: ACC_SUPER
5 Constant pool:
6 #1 = Methodref #6.#18 // java/lang/Object."<init>":()V
7 #2 = Methodref #5.#19 // A.myPrivateStaticMethod:()V
8 #3 = Methodref #5.#20 // A.myPublicStaticMethod:()V
9 #4 = Methodref #5.#21 // A.myPrivateInstanceMethod:()V
10 #5 = Class #22 // A
11 #6 = Class #23 // java/lang/Object
11. Generate JVM bytecode 150

12 #7 = Utf8 <init>
13 #8 = Utf8 ()V
14 #9 = Utf8 Code
15 #10 = Utf8 LineNumberTable
16 #11 = Utf8 myPrivateInstanceMethod
17 #12 = Utf8 myPublicInstanceMethod
18 #13 = Utf8 myPrivateStaticMethod
19 #14 = Utf8 myPublicStaticMethod
20 #15 = Utf8 myMethodCallingTheOthers
21 #16 = Utf8 SourceFile
22 #17 = Utf8 A.java
23 #18 = NameAndType #7:#8 // "<init>":()V
24 #19 = NameAndType #13:#8 // myPrivateStaticMethod:()V
25 #20 = NameAndType #14:#8 // myPublicStaticMethod:()V
26 #21 = NameAndType #11:#8 // myPrivateInstanceMethod:()V
27 #22 = Utf8 A
28 #23 = Utf8 java/lang/Object
29 {
30 A();
31 descriptor: ()V
32 flags:
33 Code:
34 stack=1, locals=1, args_size=1
35 0: aload_0
36 1: invokespecial #1 // Method java/lang/Object."<init>\
37 ":()V
38 4: return
39 LineNumberTable:
40 line 1: 0
41
42 private void myPrivateInstanceMethod();
43 descriptor: ()V
44 flags: ACC_PRIVATE
45 Code:
46 stack=0, locals=1, args_size=1
47 0: return
48 LineNumberTable:
49 line 3: 0
50
51 public void myPublicInstanceMethod();
52 descriptor: ()V
53 flags: ACC_PUBLIC
11. Generate JVM bytecode 151

54 Code:
55 stack=0, locals=1, args_size=1
56 0: return
57 LineNumberTable:
58 line 4: 0
59
60 private static void myPrivateStaticMethod();
61 descriptor: ()V
62 flags: ACC_PRIVATE, ACC_STATIC
63 Code:
64 stack=0, locals=0, args_size=0
65 0: return
66 LineNumberTable:
67 line 5: 0
68
69 public static void myPublicStaticMethod();
70 descriptor: ()V
71 flags: ACC_PUBLIC, ACC_STATIC
72 Code:
73 stack=0, locals=0, args_size=0
74 0: return
75 LineNumberTable:
76 line 6: 0
77
78 private void myMethodCallingTheOthers();
79 descriptor: ()V
80 flags: ACC_PRIVATE
81 Code:
82 stack=1, locals=1, args_size=1
83 0: invokestatic #2 // Method myPrivateStaticMethod:()V
84 3: invokestatic #3 // Method myPublicStaticMethod:()V
85 6: aload_0
86 7: invokespecial #4 // Method myPrivateInstanceMethod:\
87 ()V
88 10: aload_0
89 11: invokespecial #4 // Method myPrivateInstanceMethod:\
90 ()V
91 14: return
92 LineNumberTable:
93 line 9: 0
94 line 10: 3
95 line 11: 6
11. Generate JVM bytecode 152

96 line 12: 10
97 line 13: 14
98 }

We can see that the static methods are invoked by using invokestatic. For the instance methods
instead we use invokespecial.

Working with fields


The other operations we can do on objects are related to fields. We can read and write them.

1 class A {
2 String name;
3
4 A(String name) {
5 this.name = name;
6 }
7
8 String getName() {
9 return this.name;
10 }
11 }

The compiled class looks like this:

1 class A
2 minor version: 0
3 major version: 52
4 flags: ACC_SUPER
5 Constant pool:
6 #1 = Methodref #4.#15 // java/lang/Object."<init>":()V
7 #2 = Fieldref #3.#16 // A.name:Ljava/lang/String;
8 #3 = Class #17 // A
9 #4 = Class #18 // java/lang/Object
10 #5 = Utf8 name
11 #6 = Utf8 Ljava/lang/String;
12 #7 = Utf8 <init>
13 #8 = Utf8 (Ljava/lang/String;)V
14 #9 = Utf8 Code
15 #10 = Utf8 LineNumberTable
16 #11 = Utf8 getName
11. Generate JVM bytecode 153

17 #12 = Utf8 ()Ljava/lang/String;


18 #13 = Utf8 SourceFile
19 #14 = Utf8 A.java
20 #15 = NameAndType #7:#19 // "<init>":()V
21 #16 = NameAndType #5:#6 // name:Ljava/lang/String;
22 #17 = Utf8 A
23 #18 = Utf8 java/lang/Object
24 #19 = Utf8 ()V
25 {
26 java.lang.String name;
27 descriptor: Ljava/lang/String;
28 flags:
29
30 A(java.lang.String);
31 descriptor: (Ljava/lang/String;)V
32 flags:
33 Code:
34 stack=2, locals=2, args_size=2
35 0: aload_0
36 1: invokespecial #1 // Method java/lang/Object."<init>\
37 ":()V
38 4: aload_0
39 5: aload_1
40 6: putfield #2 // Field name:Ljava/lang/String;
41 9: return
42 LineNumberTable:
43 line 4: 0
44 line 5: 4
45 line 6: 9
46
47 java.lang.String getName();
48 descriptor: ()Ljava/lang/String;
49 flags:
50 Code:
51 stack=1, locals=1, args_size=1
52 0: aload_0
53 1: getfield #2 // Field name:Ljava/lang/String;
54 4: areturn
55 LineNumberTable:
56 line 9: 0
57 }

We can see the two instructions we use:


11. Generate JVM bytecode 154

putfield is used in the constructor to set the field


getfield is used in getName to read the field

In both cases we specify the index of a field descriptor, which is contained in the constant pool. The
field descriptors defines the class and the field name.

Object creation
To work with objects we need to be able to instantiate them. Lets see how.

1 A instance() {
2 return new A();
3 }

The class:

1 class A
2 minor version: 0
3 major version: 52
4 flags: ACC_SUPER
5 Constant pool:
6 #1 = Methodref #4.#13 // java/lang/Object."<init>":()V
7 #2 = Class #14 // A
8 #3 = Methodref #2.#13 // A."<init>":()V
9 #4 = Class #15 // java/lang/Object
10 #5 = Utf8 <init>
11 #6 = Utf8 ()V
12 #7 = Utf8 Code
13 #8 = Utf8 LineNumberTable
14 #9 = Utf8 instance
15 #10 = Utf8 ()LA;
16 #11 = Utf8 SourceFile
17 #12 = Utf8 A.java
18 #13 = NameAndType #5:#6 // "<init>":()V
19 #14 = Utf8 A
20 #15 = Utf8 java/lang/Object
21 {
22 A();
23 descriptor: ()V
24 flags:
25 Code:
11. Generate JVM bytecode 155

26 stack=1, locals=1, args_size=1


27 0: aload_0
28 1: invokespecial #1 // Method java/lang/Object."<init>\
29 ":()V
30 4: return
31 LineNumberTable:
32 line 1: 0
33
34 A instance();
35 descriptor: ()LA;
36 flags:
37 Code:
38 stack=2, locals=1, args_size=1
39 0: new #2 // class A
40 3: dup
41 4: invokespecial #3 // Method "<init>":()V
42 7: areturn
43 LineNumberTable:
44 line 4: 0
45 }

Here we first use the special instruction new to allocate the object. Once we have allocate it we need
to call the corresponding constructor.
You could wonder why we have the dup instruction here. This instruction takes the value on top
of the stack and duplicate it, so that two copies of the same value are placed on top of the stack.
We need to have two references to the instance of A because we will consume the first one in the
invocation of the constructor and the second one will be needed by areturn.

Comparison
Until now we have just seen how to execute a list of instructions, without conditions. However in
real code we have the if statements, we have loops. We do not execute a list of statements from the
beginning to the end but we do jumps.
Lets look at a simple example with an if:
11. Generate JVM bytecode 156

1 void choice(boolean flag) {


2 if (flag) {
3 System.out.println("Flag is set!");
4 }
5 }

And as always lets look at the corresponding class:

1 class A
2 minor version: 0
3 major version: 52
4 flags: ACC_SUPER
5 Constant pool:
6 #1 = Methodref #6.#16 // java/lang/Object."<init>":()V
7 #2 = Fieldref #17.#18 // java/lang/System.out:Ljava/io/Print\
8 Stream;
9 #3 = String #19 // Flag is set!
10 #4 = Methodref #20.#21 // java/io/PrintStream.println:(Ljava/\
11 lang/String;)V
12 #5 = Class #22 // A
13 #6 = Class #23 // java/lang/Object
14 #7 = Utf8 <init>
15 #8 = Utf8 ()V
16 #9 = Utf8 Code
17 #10 = Utf8 LineNumberTable
18 #11 = Utf8 choice
19 #12 = Utf8 (Z)V
20 #13 = Utf8 StackMapTable
21 #14 = Utf8 SourceFile
22 #15 = Utf8 A.java
23 #16 = NameAndType #7:#8 // "<init>":()V
24 #17 = Class #24 // java/lang/System
25 #18 = NameAndType #25:#26 // out:Ljava/io/PrintStream;
26 #19 = Utf8 Flag is set!
27 #20 = Class #27 // java/io/PrintStream
28 #21 = NameAndType #28:#29 // println:(Ljava/lang/String;)V
29 #22 = Utf8 A
30 #23 = Utf8 java/lang/Object
31 #24 = Utf8 java/lang/System
32 #25 = Utf8 out
33 #26 = Utf8 Ljava/io/PrintStream;
34 #27 = Utf8 java/io/PrintStream
11. Generate JVM bytecode 157

35 #28 = Utf8 println


36 #29 = Utf8 (Ljava/lang/String;)V
37 {
38 A();
39 descriptor: ()V
40 flags:
41 Code:
42 stack=1, locals=1, args_size=1
43 0: aload_0
44 1: invokespecial #1 // Method java/lang/Object."<init>\
45 ":()V
46 4: return
47 LineNumberTable:
48 line 1: 0
49
50 void choice(boolean);
51 descriptor: (Z)V
52 flags:
53 Code:
54 stack=2, locals=2, args_size=2
55 0: iload_1
56 1: ifeq 12
57 4: getstatic #2 // Field java/lang/System.out:Ljav\
58 a/io/PrintStream;
59 7: ldc #3 // String Flag is set!
60 9: invokevirtual #4 // Method java/io/PrintStream.prin\
61 tln:(Ljava/lang/String;)V
62 12: return
63 LineNumberTable:
64 line 4: 0
65 line 5: 4
66 line 7: 12
67 StackMapTable: number_of_entries = 1
68 frame_type = 12 /* same */
69 }

What is interesting in this case is the ifeq instruction. It has one parameter, that in this case has the
value 12. The parameter indicates the position at which to jump.
How does it work? We first put on the stack the content of the local variables table entry with index
1. It will be the parameter named flag. ifeq performs the jump if the value on top of the stack is
equal to zero. Now, the boolean value false is represented by zero, so we jump if the flag is set to
false.
11. Generate JVM bytecode 158

Where we jump to? We jump to the implicit return instruction at the very end of the method.
If we do not jump (because flag is true) we just keep executing the following instructions, which
corresponds to the statement System.out.println("Flag is set!");.
Another typical condition is checking if a reference is null:

1 void choice(Object obj) {


2 if (obj != null) {
3 System.out.println("Obj is not null!");
4 }
5 }

Lets see to what is translated to:

1 class A
2 minor version: 0
3 major version: 52
4 flags: ACC_SUPER
5 Constant pool:
6 #1 = Methodref #6.#16 // java/lang/Object."<init>":()V
7 #2 = Fieldref #17.#18 // java/lang/System.out:Ljava/io/Print\
8 Stream;
9 #3 = String #19 // Obj is not null!
10 #4 = Methodref #20.#21 // java/io/PrintStream.println:(Ljava/\
11 lang/String;)V
12 #5 = Class #22 // A
13 #6 = Class #23 // java/lang/Object
14 #7 = Utf8 <init>
15 #8 = Utf8 ()V
16 #9 = Utf8 Code
17 #10 = Utf8 LineNumberTable
18 #11 = Utf8 choice
19 #12 = Utf8 (Ljava/lang/Object;)V
20 #13 = Utf8 StackMapTable
21 #14 = Utf8 SourceFile
22 #15 = Utf8 A.java
23 #16 = NameAndType #7:#8 // "<init>":()V
24 #17 = Class #24 // java/lang/System
25 #18 = NameAndType #25:#26 // out:Ljava/io/PrintStream;
26 #19 = Utf8 Obj is not null!
27 #20 = Class #27 // java/io/PrintStream
28 #21 = NameAndType #28:#29 // println:(Ljava/lang/String;)V
11. Generate JVM bytecode 159

29 #22 = Utf8 A
30 #23 = Utf8 java/lang/Object
31 #24 = Utf8 java/lang/System
32 #25 = Utf8 out
33 #26 = Utf8 Ljava/io/PrintStream;
34 #27 = Utf8 java/io/PrintStream
35 #28 = Utf8 println
36 #29 = Utf8 (Ljava/lang/String;)V
37 {
38 A();
39 descriptor: ()V
40 flags:
41 Code:
42 stack=1, locals=1, args_size=1
43 0: aload_0
44 1: invokespecial #1 // Method java/lang/Object."<init>\
45 ":()V
46 4: return
47 LineNumberTable:
48 line 1: 0
49
50 void choice(java.lang.Object);
51 descriptor: (Ljava/lang/Object;)V
52 flags:
53 Code:
54 stack=2, locals=2, args_size=2
55 0: aload_1
56 1: ifnull 12
57 4: getstatic #2 // Field java/lang/System.out:Ljav\
58 a/io/PrintStream;
59 7: ldc #3 // String Obj is not null!
60 9: invokevirtual #4 // Method java/io/PrintStream.prin\
61 tln:(Ljava/lang/String;)V
62 12: return
63 LineNumberTable:
64 line 4: 0
65 line 5: 4
66 line 7: 12
67 StackMapTable: number_of_entries = 1
68 frame_type = 12 /* same */
69 }

Here the structure is very similar, we just have a different kind of jump. This time we use ifnull.
11. Generate JVM bytecode 160

Code
For writing our JVM compilers we are going to use ASM. ASM is a library that can produce
bytecode and class files. On one hand this library is extremely useful because it handles all the
bookkepping involved in generating the bytecode while giving access tothe low level structures
present in the class file. On the other hand the documentation is extremely outdated and poor. All
in all, it is worthy to go through the difficulties of learning how to use ASM to build your own
compiler.

MiniCalcFun
We are going to build a JVM compiler that given a source file written in MiniCalcFun will produce
a class file.

General structure

Lets start from the entry point of our compiler. We will expect the name of a source file to be
specified as the first and only parameter. We will open the file, read the code and try to build an
AST. We will check for lexical and syntactical errors. If there are none we will validate the AST
and check for semantic errors. If will have no semantics errors we will go on with the class file
generation. If instead errors are found we show them to the user and terminate.

1 fun main(args: Array<String>) {


2 if (args.size != 1) {
3 System.err.println("Exactly one argument expected")
4 return
5 }
6 val sourceFile = File(args[0])
7 if (!sourceFile.exists()) {
8 System.err.println("Given file does not exist")
9 return
10 }
11 val res = MiniCalcParserFacade.parse(sourceFile)
12 if (res.isCorrect()) {
13 val miniCalcFile = res.root!!
14 val className = "minicalc.${sourceFile.nameWithoutExtension}"
15 val bytes = JvmCompiler().compile(miniCalcFile, className)
16 val outputFile = File("${sourceFile.nameWithoutExtension}.class")
17 outputFile.writeBytes(bytes)
18 } else {

http://asm.ow2.org
11. Generate JVM bytecode 161

19 System.err.println("${res.errors.size} error(s) found\n")


20 res.errors.forEach { System.err.println(it) }
21 }
22 }

We are reusing code to build the AST and validate it. A line containing new code is this one:

1 val bytes = JvmCompiler().compile(miniCalcFile, className)

This is a simple invocation of this method:

1 class JvmCompiler {
2
3 fun compile(ast: MiniCalcFile, className: String) =
4 Compilation(ast, className).compile()
5
6 }

Here we take an AST and a name to assign to the class to generate. We use them to instantiate
Compilation. Why do we do that? Because Compilation will be used to track different pieces of
temporary data we need while producing the class file.
Before going to examine the Compilation class we will look at some utilities we will need.

Internal names and JVM Type descriptions

When looking at how the JVM works we have seen that internally it uses type descriptions and
internal names for declared types (classes, interfaces, enums, and annotations). We have seen that
there are type descriptions for all primitive types, for arrays, and for declared types. For example
the type description for int is I, for an array of arrays of int is [[I, for the class String is
Ljava/lang/String;. Internal names can be instead defined only for declared types. The internal
name of String is java/lang/String, for File is java/io/File.
When compiling our code we will translate the types present in our language to types for the JVM.
In particular we have three types in MiniCalcFun:

Int we will translate it to the primitive JVM type int


Decimal we will translate it to the primitive JVM type double
String we will translate it to the corresponding JVM class java.lang.String

In general it will be useful to have functions to the get the internal names and type descriptions of
the different classes. In our simple compiler we will refer to String but also to Object.
In general given a canonical name (like java.lang.Object) we can obtain an internal name or a
type description like this:
11. Generate JVM bytecode 162

1 fun canonicalNameToInternalName(canonicalName: String) = canonicalName.replace("\


2 .", "/")
3 fun canonicalNameToJvmDescription(canonicalName: String) = "L${canonicalNameToIn\
4 ternalName(canonicalName)};"

If we have instances of Class we can use this extension methods:

1 fun Class<*>.jvmDescription() = canonicalNameToJvmDescription(this.canonicalName)


2 fun Class<*>.internalName() = canonicalNameToInternalName(this.canonicalName)

We will use this extension method in our extension method for Type. This method give us the type
description for any of the three types we support in MiniCalcFun:

1 fun Type.jvmDescription() =
2 when (this) {
3 is IntType -> "I"
4 is DecimalType -> "D"
5 is StringType -> String::class.java.jvmDescription()
6 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
7 }

Type specific operations

Now that we have started looking into types we can look at other operations that depends on the
type.
We have four of them:

localVarTableSize: this method, when invoked on a type returns the number of spaces
needed for an element of that type in the local variables table
loadOp: we have seen that there are different operations to load a value from the local variables
table into the stack, depending on its type. For example, for int we should use ILOAD, while
for double we should use DLOAD. This method gives us the right opcode to use with a given
type
storeOp: similarly to loadOp, given a type it returns the opcode to use to store a value of that
type into the local variables table
returnOp: similarly to loadOp and storeOp, given a type it returns the opcode to use to return
a value of that type

This methods are very simple, maybe we could have used maps instead of writing these methods.
Anyway they will be useful to abstract some of the nitty-gritty details necessary when writing the
compiler.
11. Generate JVM bytecode 163

1 // We have seen that all types but long and double takes one space in a local
2 // variables table. In this case we have a type (DecimalType) that is
3 // translated into the JVM type double so that it takes two spaces, while
4 // the other types take just one
5 fun Type.localVarTableSize() =
6 when (this) {
7 is IntType -> 1
8 is DecimalType -> 2
9 is StringType -> 1
10 else -> throw UnsupportedOperationException(
11 this.javaClass.canonicalName)
12 }
13
14 fun Type.loadOp() =
15 when (this) {
16 is IntType -> ILOAD
17 is DecimalType -> DLOAD
18 is StringType -> ALOAD
19 else -> throw UnsupportedOperationException(
20 this.javaClass.canonicalName)
21 }
22
23 fun Type.storeOp() =
24 when (this) {
25 is IntType -> ISTORE
26 is DecimalType -> DSTORE
27 is StringType -> ASTORE
28 else -> throw UnsupportedOperationException(
29 this.javaClass.canonicalName)
30 }
31
32 fun Type.returnOp() =
33 when (this) {
34 is IntType -> IRETURN
35 is DecimalType -> DRETURN
36 is StringType -> ARETURN
37 else -> throw UnsupportedOperationException(
38 this.javaClass.canonicalName)
39 }
11. Generate JVM bytecode 164

Pushing values

Now we are going to see how do we deal with expressions. The typical thing you want to do is
to evaluate an expression. What does it mean from the point of view of the compiler? It means
executing a sequence of instructions and at the end having the result of the expression at the top of
the stack.
This is how we evaluate all the expressions. Note that we are referring to some classes we have not
yet seen (MethodVisitor, CompilationContext) so not everything will be clear right now but lets
start focusing on the general structure.

1 private fun Expression.push(methodVisitor: MethodVisitor,


2 context: CompilationContext) {
3 when (this) {
4 is IntLit -> methodVisitor.visitLdcInsn(
5 Integer.parseInt(this.value))
6 is DecLit -> methodVisitor.visitLdcInsn(
7 java.lang.Double.parseDouble(this.value))
8 is StringLit -> {
9 if (this.parts.isEmpty()) {
10 methodVisitor.visitLdcInsn("")
11 } else {
12 val part = this.parts.first()
13 when (part) {
14 is ConstantStringLitPart -> methodVisitor.visitLdcInsn(
15 part.content)
16 is ExpressionStringLItPart -> part.expression.pushAsString(
17 methodVisitor, context)
18 }
19 if (this.parts.size > 1) {
20 StringLit(this.parts.subList(1, this.parts.size))
21 .push(methodVisitor, context)
22 methodVisitor.visitMethodInsn(INVOKEVIRTUAL,
23 "java/lang/String", "concat",
24 "(${String::class.java.jvmDescription()})"
25 + "${String::class.java.jvmDescription()}",
26 + false)
27 }
28 }
29 }
30 is ValueReference -> methodVisitor.visitVarInsn(this.type().loadOp(),
31 context.localSymbols[this.ref.referred!!]!!.index)
32 is SumExpression -> {
11. Generate JVM bytecode 165

33 val lt = this.left.type()
34 val rt = this.right.type()
35 if (lt is StringType) {
36 this.left.pushAsString(methodVisitor, context)
37 this.right.pushAsString(methodVisitor, context)
38 methodVisitor.visitMethodInsn(INVOKEVIRTUAL,
39 "java/lang/String",
40 "concat",
41 "(${String::class.java.jvmDescription()})"
42 + "${String::class.java.jvmDescription()}", false)
43 } else if (lt is IntType && rt is IntType) {
44 this.left.pushAsInt(methodVisitor, context)
45 this.right.pushAsInt(methodVisitor, context)
46 methodVisitor.visitInsn(IADD)
47 } else if (lt is NumberType && rt is NumberType) {
48 this.left.pushAsDouble(methodVisitor, context)
49 this.right.pushAsDouble(methodVisitor, context)
50 methodVisitor.visitInsn(DADD)
51 } else {
52 throw UnsupportedOperationException(lt.toString()
53 + " from evaluating " + this.left)
54 }
55 }
56 is SubtractionExpression -> {
57 val lt = this.left.type()
58 val rt = this.right.type()
59 if (lt is IntType && rt is IntType) {
60 this.left.pushAsInt(methodVisitor, context)
61 this.right.pushAsInt(methodVisitor, context)
62 methodVisitor.visitInsn(ISUB)
63 } else if (lt is NumberType && rt is NumberType) {
64 this.left.pushAsDouble(methodVisitor, context)
65 this.right.pushAsDouble(methodVisitor, context)
66 methodVisitor.visitInsn(DSUB)
67 } else {
68 throw UnsupportedOperationException(lt.toString()
69 + " from evaluating " + this.left)
70 }
71 }
72 is MultiplicationExpression -> {
73 val lt = this.left.type()
74 val rt = this.right.type()
11. Generate JVM bytecode 166

75 if (lt is IntType && rt is IntType) {


76 this.left.pushAsInt(methodVisitor, context)
77 this.right.pushAsInt(methodVisitor, context)
78 methodVisitor.visitInsn(IMUL)
79 } else if (lt is NumberType && rt is NumberType) {
80 this.left.pushAsDouble(methodVisitor, context)
81 this.right.pushAsDouble(methodVisitor, context)
82 methodVisitor.visitInsn(DMUL)
83 } else {
84 throw UnsupportedOperationException(lt.toString()
85 + " from evaluating " + this.left)
86 }
87 }
88 is DivisionExpression -> {
89 val lt = this.left.type()
90 val rt = this.right.type()
91 if (lt is IntType && rt is IntType) {
92 this.left.pushAsInt(methodVisitor, context)
93 this.right.pushAsInt(methodVisitor, context)
94 methodVisitor.visitInsn(IDIV)
95 } else if (lt is NumberType && rt is NumberType) {
96 this.left.pushAsDouble(methodVisitor, context)
97 this.right.pushAsDouble(methodVisitor, context)
98 methodVisitor.visitInsn(DDIV)
99 } else {
100 throw UnsupportedOperationException(lt.toString()
101 + " from evaluating " + this.left)
102 }
103 }
104 is FunctionCall -> {
105 val functionCode = context.compilation.functions[this.function.refer\
106 red!!]!!
107 var index = 0
108 // we push this
109 methodVisitor.visitVarInsn(ALOAD, index)
110 // we push all the parameters we received and we need to pass along
111 index = 1
112 functionCode.surroundingValues.forEach {
113 val type = it.type()
114 methodVisitor.visitVarInsn(type.loadOp(), index)
115 index += type.localVarTableSize()
116 }
11. Generate JVM bytecode 167

117 // we push all the parameters specified in the call


118 this.params.forEach { it.push(methodVisitor, context) }
119 // we invoke the method
120 methodVisitor.visitMethodInsn(INVOKEVIRTUAL,
121 context.compilation.className,
122 functionCode.methodName,
123 functionCode.signature, false)
124 }
125 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
126 }
127 }

In the following sub-sections we are going to examine the different portions of this method.

Literals

Lets start from some simple cases. How do we evaluate integer and decimal literals?

1 is IntLit -> methodVisitor.visitLdcInsn(Integer.parseInt(this.value))


2 is DecLit -> methodVisitor.visitLdcInsn(java.lang.Double.parseDouble(this.value))

In this case all we have to do is to push a constant on the stack. If the value is small, ASM will generate
an instruction containing the value itself. Otherwise ASM will create an entry in the constant pool to
hold the value and generate an instruction referring to that entry. These little details are abstracted
away by ASM: we just invoke visitLdcInsn.
String literals are more complex because MiniCalcFun supports interpolated strings. It means that
we can insert expressions in string literals. Like:

1 var myString = "area = #{42 * height}"

String literals in MiniCalcFun are composed by parts that could be either constant strings or
embedded expressions.
How do we translate this?
We consider three different cases:

we have zero elements in the string literal


we have exactly one element in the string literal
we have two or more elements in the string literal
11. Generate JVM bytecode 168

If we have zero elements we just push an empty string into the stack (methodVisitor.visitLdcInsn("")).
If we have one or more elements we evaluate the first element. If it is a constant string we just push it.
If it is an expression we instead evaluate it and convert it to a string using the method pushAsString
that we will see in the next section. This means that evaluating 3 * 4 will not produce the integer
12 but will instead produce the string "12". In this way every single part of the interpolated string
will produce a string.
If we had more than one elements at this point we have evaluated only the first one. To evaluate the
remaining ones we create a temporary StringLit with all the parts from the second one to the last
one (all but the first part, that we have already evaluated). We then do a recursive call on push.
At this point we will have on the top of the stack two strings: the first one representing the first part,
the second one representing the concatenation of all the other parts. Now we just call the method
String.concat(String) that will merge the two elements into a single string. It will use the first
element as the this value and the second one as the parameter of the concat method.

1 is StringLit -> {
2 if (this.parts.isEmpty()) {
3 methodVisitor.visitLdcInsn("")
4 } else {
5 val part = this.parts.first()
6 when (part) {
7 is ConstantStringLitPart -> methodVisitor.visitLdcInsn(part.content)
8 is ExpressionStringLItPart -> part.expression.pushAsString(methodVis\
9 itor, context)
10 }
11 if (this.parts.size > 1) {
12 StringLit(this.parts.subList(1, this.parts.size)).push(methodVisitor\
13 , context)
14 methodVisitor.visitMethodInsn(INVOKEVIRTUAL,
15 "java/lang/String", "concat",
16 "(${String::class.java.jvmDescription()})"
17 + "${String::class.java.jvmDescription()}", false)
18 }
19 }
20 }

This is what happens when we evaluate Area=#{7 * 6}


11. Generate JVM bytecode 169

Value reference

When we have a reference to an input, a variable or a parameter we just need to find its value and
push on the stack:
11. Generate JVM bytecode 170

1 is ValueReference -> methodVisitor.visitVarInsn(


2 this.type().loadOp(), context.localSymbols[this.ref.referred!!]!!.index)

The only question is: where do we find the value?


The answer is in the local variables table. We will build our code so that all inputs, variables, and
parameters we can refer to are always present in the local variables table. So we will just need to get
the right index into that table and produce the correct load operation. We have seen that the actual
local operation to be used depends on the type of the value to push into the stack, for example iload
for integers and dload for doubles. To find the index we will instead use a map named localSymbols.
More on this later.

Binary operations

Lets start by looking at how the subtraction is implemented:

1 is SubtractionExpression -> {
2 val lt = this.left.type()
3 val rt = this.right.type()
4 if (lt is IntType && rt is IntType) {
5 // we know the first operand is already an int, so we could just use pus\
6 h instead of pushInt
7 this.left.pushAsInt(methodVisitor, context)
8 this.right.pushAsInt(methodVisitor, context)
9 methodVisitor.visitInsn(ISUB)
10 } else if (lt is NumberType && rt is NumberType) {
11 // we know the first operand is already a double, so we could just use p\
12 ush instead of pushDouble
13 this.left.pushAsDouble(methodVisitor, context)
14 this.right.pushAsDouble(methodVisitor, context)
15 methodVisitor.visitInsn(DSUB)
16 } else {
17 throw UnsupportedOperationException(lt.toString()+ " from evaluating " +\
18 this.left)
19 }
20 }

In practice we start by looking at the type of the operands. If they are both integers we just push
both of them on the stack. When the values are on the stack we call the instruction ISUB to subtract
them. If at least one of them is a decimal then we need to convert both values to decimal by using
pushAsDouble and then invoke DSUB.
Multiplication and division work in the exact same way, they just use different opcodes: IMUL, DMUL,
IDIV, and DDIV.
11. Generate JVM bytecode 171

Addition is instead more complex because we consider the case in which we are summing strings.
While this operation use the plus sign it is not a real addition, but a string concatenation. In that
case we will use the concat method that we have seen when we looked at interpolated strings.

1 is SumExpression -> {
2 val lt = this.left.type()
3 val rt = this.right.type()
4 if (lt is StringType) {
5 this.left.pushAsString(methodVisitor, context)
6 this.right.pushAsString(methodVisitor, context)
7 methodVisitor.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String",
8 "concat",
9 "(${String::class.java.jvmDescription()})${String::class.java.jvmDe\
10 scription()}",
11 false)
12 } else if (lt is IntType && rt is IntType) {
13 this.left.pushAsInt(methodVisitor, context)
14 this.right.pushAsInt(methodVisitor, context)
15 methodVisitor.visitInsn(IADD)
16 // NumberType is a common ancestor for IntType and DecimalType
17 // if both are NumberType and the previous condition
18 // was not satisfied it means at least one is a Decimal
19 // and the other is either a Decimal or an Int
20 } else if (lt is NumberType && rt is NumberType) {
21 this.left.pushAsDouble(methodVisitor, context)
22 this.right.pushAsDouble(methodVisitor, context)
23 methodVisitor.visitInsn(DADD)
24 } else {
25 throw UnsupportedOperationException(lt.toString()+ " from evaluating " +\
26 this.left)
27 }
28 }

Function call

Function calls are quite more complex than other expressions.


11. Generate JVM bytecode 172

1 is FunctionCall -> {
2 val functionCode = context.compilation.functions[this.function.referred!!]!!
3 var index = 0
4 // we push this
5 methodVisitor.visitVarInsn(ALOAD, index)
6 // we push all the parameters we received and we need to pass along
7 index = 1
8 functionCode.surroundingValues.forEach {
9 val type = it.type()
10 methodVisitor.visitVarInsn(type.loadOp(), index)
11 index += type.localVarTableSize()
12 }
13 // we push all the parameters specified in the call
14 this.params.forEach { it.push(methodVisitor, context) }
15 // we invoke the method
16 methodVisitor.visitMethodInsn(INVOKEVIRTUAL, context.compilation.className,
17 functionCode.methodName, functionCode.signature, false)
18 }

To understand how the function call works you need to know how we will compile each function.
We will see it later in details but the idea is that each function in MiniCalcFun is compiled as a
JVM method. This method has as many parameters as the values which are visible to the function.
Consider this function:

1 input Int i
2 var globalVar = 0
3
4 fun f(Int p0) Int {
5 i * globalVar * p0
6 }

The function f needs to access not only its own parameter p0 but also the inputs and global variables.
For this reason we will generate JVM method named fun_f which will take three parameters. In this
way when we will call it we will be able to pass to it all the necessary values.
MiniCalcFun supports also annidated functions, like in this example:
11. Generate JVM bytecode 173

1 input Int i
2 var globalVar = 0
3
4 fun f(Int p0) Int {
5 fun g(Int p1) Int {
6 p1 * 2
7 }
8 i * globalVar * g(p0)
9 }

In this case g will compiled to a JVM method taking 4 parameters: one for the input (i), one for the
global variable (globalVar), one for the parameter of the wrapping function f (p0) and one for its
own parameter (p1).
Consider also this case:

1 input Int i
2 var globalVar = 0
3
4 fun f(Int p0) Int {
5 fun g(Int p1) Int {
6 fun h(Int p2) Int {
7 f(p0) * p2
8 }
9 p1 * 2
10 }
11 i * globalVar * g(p0)
12 }

Things start to get complex, so lets look at the parameter lists in a table.

Function Method signature Method parameters


f int fun_f(int, int, int); i, globalVar, p0
g int fun_f_g(int, int, int, int); i, globalVar, p0, p1
h int fun_f_g_h(int, int, int, int, int); i, globalVar, p0, p1, p2

The idea is that as we use more deep functions we pass all the sourrounding information plus the
new parameters. Note also that local variables have to be passed along too.
If we update the example in this way:
11. Generate JVM bytecode 174

1 input Int i
2 var globalVar = 0
3 fun f(Int p0) Int {
4 var v0 = 2
5 fun g(Int p1) Int {
6 var v1 = 3
7 fun h(Int p2) Int {
8 var v2 = 4
9 f(p0) * (p2 - v2 + v1)
10 }
11 p1 * (2 + v0)
12 }
13 i * globalVar * g(p0)
14 }

For these functions we will compile these generated methods.

Function Method signature Method parameters


f int fun_f(int, int, int); i, globalVar, p0
g int fun_f_g(int, int, int, int, int); i, globalVar, p0, v0, p1
h int fun_f_g_h(int, int, int, int, int, int, i, globalVar, p0, v0, p1, v1, p2
int);

So when we execute a function call we need to pass more than just the parameters of the function
as defined in the MiniCalcFun code. We need to pass also all the values visible to that function.
It means:

all the values visible to its parent


the variables of the parent
all the parameters received by that function

When we call a function we are sure to have all the values it needs already present in our local
variables table. We just need to push them, so that they are available to the method we are going
to invoke. Once we pass the contextual values we also push the values for the parameters, that are
instead specified in the function call.
Note also that values are ordered from the most global to the most specific both in the local variables
table and among the parameters of JVM methods. This will be useful.
It can sound confusing right now, but we will see more details when looking at how the code for
the function and for the top level statements is generated.
Back at the code for FunctionCall we:
11. Generate JVM bytecode 175

push the value of this, because we are going to call an instance method (the JVM method for
the function)
we pass as many values from the local variables table as needed
we evaluate all the parameter values specified in the function call by pushing their expressions
we invoke the JVM method corresponding to the function

This one was not easy, but we are building a compiler after all. We have to sweat a little.

Pushing and converting values

We have seen that while pushing values we may want to convert them, to ensure they have a certain
type. Lets see the methods we use to do these conversions:

1 private fun Expression.pushAsInt(methodVisitor: MethodVisitor,


2 localSymbols: HashMap<String, JvmCompiler.Entry>) {
3 when (this.type()) {
4 is IntType -> {
5 this.push(methodVisitor, localSymbols)
6 }
7 is DecimalType -> {
8 this.push(methodVisitor, localSymbols)
9 methodVisitor.visitInsn(D2I)
10 }
11 else -> throw UnsupportedOperationException(
12 this.type().javaClass.canonicalName)
13 }
14 }
15
16 private fun Expression.pushAsDouble(methodVisitor: MethodVisitor,
17 localSymbols: HashMap<String, JvmCompiler.Entry>) {
18 when (this.type()) {
19 is IntType -> {
20 this.push(methodVisitor, localSymbols)
21 methodVisitor.visitInsn(I2D)
22 }
23 is DecimalType -> {
24 this.push(methodVisitor, localSymbols)
25 }
26 else -> throw UnsupportedOperationException(
27 this.type().javaClass.canonicalName)
28 }
29 }
11. Generate JVM bytecode 176

30
31 private fun Expression.pushAsString(methodVisitor: MethodVisitor,
32 localSymbols: HashMap<String, JvmCompiler.Entry>) {
33 when (this.type()) {
34 is IntType -> {
35 this.pushAsInt(methodVisitor, localSymbols)
36 methodVisitor.visitMethodInsn(INVOKESTATIC, "java/lang/Integer",
37 "toString",
38 "(I)${String::class.java.jvmDescription()}", false)
39 }
40 is DecimalType -> {
41 this.pushAsDouble(methodVisitor, localSymbols)
42 methodVisitor.visitMethodInsn(INVOKESTATIC, "java/lang/Double",
43 "toString",
44 "(D)${String::class.java.jvmDescription()}", false)
45 }
46 is StringType -> this.push(methodVisitor, localSymbols)
47 else -> throw UnsupportedOperationException(
48 this.type().javaClass.canonicalName)
49 }
50 }

The structure is pretty simple: if the value has already the expected type we do a simple push,
otherwise we do a push followed by some operation to perform a conversion.
Consider pushAsInt: if the value to be converted to an int is a double, we invoke the operation D2I,
which convert a double value on top of the stack to an int value. We do the opposite in pushAsDouble,
by using I2D.
To convert numbers to strings we need instead to invoke the methods Integer.toString and
Double.toString. Both of them are static methods expecting one parameter. So we push the value
to be converted and invoke those methods. They will pop the value on top of the stack and use it as
their parameters, then they will convert it to a string and push that string on top of the stack.

Compilation
It is time to see the remaining element of our compiler: the Compilation class.
11. Generate JVM bytecode 177

1 class Compilation(val ast: MiniCalcFile, val className: String) {


2 val functions = HashMap<FunctionDeclaration, FunctionCode>()
3 val cw = ClassWriter(ClassWriter.COMPUTE_FRAMES or ClassWriter.COMPUTE_MAXS)
4
5 data class Entry(val index: Int, val type: Type)
6
7 data class FunctionCode(val functionDeclaration: FunctionDeclaration,
8 val methodName: String,
9 val surroundingValues: List<ValueDeclaration>) {
10 val signature: String
11 get() = "(" + (surroundingValues + functionDeclaration.params).map {
12 it.type().jvmDescription()
13 }.joinToString(separator = "")
14 + ")" + functionDeclaration.returnType.jvmDescription()
15 }
16
17 private fun collectFunctions(functionDeclaration: FunctionDeclaration,
18 prefix:String,
19 surroundingValues: List<ValueDeclaration> =
20 ast.inputs() + ast.topLevelVariables()) {
21 val methodName = "${prefix}_${functionDeclaration.name}"
22 functions[functionDeclaration] = FunctionCode(functionDeclaration,
23 methodName, surroundingValues)
24 functionDeclaration.containedFunctions().forEach {
25 collectFunctions(it, methodName,
26 surroundingValues + functionDeclaration.params)
27 }
28 }
29
30 private fun compileConstructor() {
31 val constructor = cw.visitMethod(ACC_PUBLIC, "<init>",
32 "(${SystemInterface::class.java.jvmDescription()})V",
33 null, null)
34 constructor.visitVarInsn(ALOAD, 0)
35 constructor.visitMethodInsn(INVOKESPECIAL,
36 Object::class.java.internalName(), "<init>",
37 "()V", false)
38 constructor.visitVarInsn(ALOAD, 0)
39 constructor.visitVarInsn(ALOAD, 1)
40 constructor.visitFieldInsn(PUTFIELD,
41 canonicalNameToInternalName(className),
42 "systemInterface",
11. Generate JVM bytecode 178

43 SystemInterface::class.java.jvmDescription())
44 constructor.visitInsn(RETURN)
45 constructor.visitEnd()
46 constructor.visitMaxs(-1, -1)
47 }
48
49 private fun compileFunction(functionDeclaration: FunctionDeclaration) {
50 val functionCode = functions[functionDeclaration]!!
51 val allParams = LinkedList<ValueDeclaration>()
52 allParams += functionCode.surroundingValues
53 allParams += functionDeclaration.params
54 generateMethod(functionCode.methodName, allParams,
55 functionDeclaration.variables(),
56 functionDeclaration.statements,
57 functionDeclaration.returnType)
58 }
59
60 private fun generateMethod(methodName: String,
61 methodParameters: List<ValueDeclaration>,
62 variables: List<VarDeclaration>,
63 statements: List<Statement>,
64 returnType: Type? = null) {
65 // our class will have just one method: the calculate method
66 // it will take as many methodParameters as the inputs and return nothing
67 val methodVisitor = cw.visitMethod(ACC_PUBLIC, methodName,
68 "(${methodParameters.map { it.type().jvmDescription() }.join\
69 ToString(separator = "")})"
70 + "${returnType?.jvmDescription() ?: "V"}", null, null)
71 methodVisitor.visitCode()
72 // labels are used by ASM to mark points in the code
73 val methodStart = Label()
74 val methodEnd = Label()
75 // with this call we indicate to what point in the method the label
76 // methodStart corresponds
77 methodVisitor.visitLabel(methodStart)
78
79 // Variable declarations:
80 // we find all variable declarations in our code and we assign to them
81 // an index value. Our vars map will tell us which variable name
82 // corresponds to which index
83 var nextIndex = 1
84 val localSymbols = HashMap<ValueDeclaration, Entry>()
11. Generate JVM bytecode 179

85 methodParameters.forEach {
86 localSymbols[it] = Entry(nextIndex, it.type())
87 nextIndex += it.type().localVarTableSize()
88 // they are just represented by the params
89 }
90 variables.forEach {
91 localSymbols[it] = Entry(nextIndex, it.type())
92 methodVisitor.visitLocalVariable(it.name, it.type().jvmDescription()\
93 , null,
94 methodStart, methodEnd, nextIndex)
95 nextIndex += it.type().localVarTableSize()
96 }
97
98 // time to generate bytecode for all the statements
99 val ctx = CompilationContext(localSymbols, this)
100 statements.forEach { s ->
101 when (s) {
102 is InputDeclaration -> {
103 // Nothing to do, the value is already stored where it shoul\
104 d be
105 }
106 is VarDeclaration -> {
107 s.value.push(methodVisitor, ctx)
108 methodVisitor.visitVarInsn(s.type().storeOp(), localSymbols[\
109 s]!!.index)
110 }
111 is Print -> {
112 methodVisitor.visitVarInsn(ALOAD, 0)
113 methodVisitor.visitFieldInsn(GETFIELD, canonicalNameToIntern\
114 alName(className),
115 "systemInterface", SystemInterface::class.java.j\
116 vmDescription())
117 s.value.pushAsString(methodVisitor, ctx)
118 methodVisitor.visitMethodInsn(INVOKEINTERFACE,
119 SystemInterface::class.java.internalName(), "pri\
120 nt",
121 "(${String::class.java.jvmDescription()})V", tru\
122 e)
123 }
124 is Assignment -> {
125 s.value.push(methodVisitor, ctx)
126 methodVisitor.visitVarInsn(s.varDecl.referred!!.type().store\
11. Generate JVM bytecode 180

127 Op(),
128 localSymbols[s.varDecl.referred!!]!!.index)
129 }
130 is FunctionDeclaration -> compileFunction(s)
131 is ExpressionStatatement -> s.expression.push(methodVisitor, ctx)
132 else -> throw UnsupportedOperationException(s.javaClass.canonica\
133 lName)
134 }
135 }
136
137 // We just says that here is the end of the method
138 methodVisitor.visitLabel(methodEnd)
139 // And we had the return instruction
140 if (returnType == null) {
141 methodVisitor.visitInsn(RETURN)
142 } else {
143 methodVisitor.visitInsn(returnType.returnOp())
144 }
145 methodVisitor.visitEnd()
146 methodVisitor.visitMaxs(-1, -1)
147 }
148
149 private fun compileCalculateMethod() {
150 generateMethod("calculate", ast.inputs(), ast.topLevelVariables(), ast.s\
151 tatements)
152 }
153
154 fun compile() : ByteArray {
155 ast.topLevelFunctions().forEach { collectFunctions(it, "fun") }
156
157 // here we specify that the class is in the format introduced with
158 // Java 8 (so it would require a JRE >= 8 to run). We also specify the
159 // name of the class, the fact it extends Object and it implements no
160 // interfaces
161 cw.visit(V1_8, ACC_PUBLIC, canonicalNameToInternalName(className), null,\
162 "java/lang/Object", null)
163
164 cw.visitField(ACC_PRIVATE, "systemInterface", SystemInterface::class.jav\
165 a.jvmDescription(),
166 null, null)
167
168 compileConstructor()
11. Generate JVM bytecode 181

169 compileCalculateMethod()
170
171 cw.visitEnd()
172 return cw.toByteArray()
173 }
174 }

Our strategy is to generate for each MiniCalcFun source file one JVM class with:

one constructor
one method named calculate that will execute the whole code
one method for each MiniCalcFun function

The method calculate and the methods for the functions will be generated using generateMethod.
In both cases we just need to generate code for a sequence of statements. In one case the sequence
of statements will come from the global scope, in the other cases it will come from the body of the
function.
When we instantiate the compilation we pass the AST to compile and the name of the class:

1 class Compilation(val ast: MiniCalcFile, val className: String) {

We need to provide the name of the class because there is no name for the whole script in the AST.
We start by creating a ClassWriter:

1 val cw = ClassWriter(ClassWriter.COMPUTE_FRAMES or ClassWriter.COMPUTE_MAXS)

ClassWriter is a class from ASM that will be used to generate the class. It will give us the actual
bytes to save in the class file. The parameters we specify instruct ASM to calculate for us several
values.
The action starts in compile. We first of all collect all the functions. We pick the top level ones and
on each of them we invoke collectFunction. This method will look for other functions inside the
given one, recursively.
11. Generate JVM bytecode 182

1 private fun collectFunctions(functionDeclaration: FunctionDeclaration,


2 prefix:String,
3 surroundingValues: List<ValueDeclaration> =
4 ast.inputs() + ast.topLevelVariables()) {
5 val methodName = "${prefix}_${functionDeclaration.name}"
6 functions[functionDeclaration] = FunctionCode(functionDeclaration,
7 methodName, surroundingValues)
8 functionDeclaration.containedFunctions().forEach {
9 collectFunctions(it, methodName,
10 surroundingValues + functionDeclaration.params)
11 }
12 }

What we do is to create a map that associate to each FunctionDeclaration an instance of


FunctionCode. In FunctionCode we store the name of the method to generate for a given function
and the list of surrounding values it will need to receive in addition to its parameters, when invoked.
Consider a function named g inside a function named f:

the name will be fun_f_g. A qualified name that permits to distinguish functions having the
same name but declared in different scopes
the surrounding values of g will be: all the inputs and global variables plus the parameters
and variables of f

The constructor

The constructor of our compiled class will receive one parameter of type SystemInterface. That
parameter defines how we will interact with the system. This will make testing much easier, as we
will see later.

1 private fun compileConstructor() {


2 val constructor = cw.visitMethod(ACC_PUBLIC, "<init>",
3 "(${SystemInterface::class.java.jvmDescription()})V", null, null)
4 constructor.visitVarInsn(ALOAD, 0)
5 constructor.visitMethodInsn(INVOKESPECIAL,
6 Object::class.java.internalName(), "<init>", "()V", false)
7 constructor.visitVarInsn(ALOAD, 0)
8 constructor.visitVarInsn(ALOAD, 1)
9 constructor.visitFieldInsn(PUTFIELD, canonicalNameToInternalName(className),
10 "systemInterface", SystemInterface::class.java.jvmDescription())
11 constructor.visitInsn(RETURN)
12 constructor.visitEnd()
13 constructor.visitMaxs(-1, -1)
14 }
11. Generate JVM bytecode 183

We start by defining the constructor (special name <init>). Its signature indicates that the return
type is void and the only parameter is an instance of the class SystemInterface, which we have
already seen while examining the interpreter:

1 interface SystemInterface {
2 fun print(message: String)
3 }

We start by pushing this into the stack (aload 0) and then invoke the constructor of Object, our
super class. After that we take the value of the parameter and store in the field systemInterface.
To do so we need to first push this (aload 0), then push the value to assign, i.e. the value of the
first and only parameter (aload 1), finally we call PUTFIELD. At this point we just need to insert a
RETURN. In the bytecode RETURN is never implicit and it must always be present.

Generate method

The generate method will be used to generate both the method for the global scope (calculate) and
a method for each function present in the MiniCalcFun script to be compiled.

1 private fun generateMethod(methodName: String,


2 methodParameters: List<ValueDeclaration>,
3 variables: List<VarDeclaration>, statements: List<Statement>,
4 returnType: Type? = null) {
5 // our class will have just one method: the calculate method
6 // it will take as many methodParameters as the inputs and return nothing
7 val methodVisitor = cw.visitMethod(ACC_PUBLIC, methodName,
8 "(${methodParameters.map { it.type().jvmDescription() }.joinToString\
9 (separator = "")})" +
10 "${returnType?.jvmDescription() ?: "V"}", null, null)
11 methodVisitor.visitCode()
12 // labels are used by ASM to mark points in the code
13 val methodStart = Label()
14 val methodEnd = Label()
15 // with this call we indicate to what point in the method the label
16 // methodStart corresponds
17 methodVisitor.visitLabel(methodStart)

We expect to receive the name of the JVM method to generate, the parameters this method will
receive, the variables that will be defined in this method, the statements composing the method and
potentially the return type.
11. Generate JVM bytecode 184

Remember that a function could see variables defined outside of it: in the global scope or in a function
wrapping it. These variables would become parameters in the corresponding JVM code, so they will
be inserted in the list of methodParameters.
variables would contain exclusively the variables defined in the function being compiled. When
generateMethod is called for the global scope variables will contain the global variables.

The signature is defined by simply taking all the method parameters and getting their corresponding
JVM Type description. They are all joined without space between them and they are enclosed in
parenthesis. At the end of the signature we have V if no return type is present, otherwise the JVM
Type description corresponding to the return type.
We then define labels indicating the start and the end of the method. We will use them when defining
the range of validity of the variables, which is relevant only for debugging purposes.
At this point we register all the parameters and variables:

1 // Variable declarations:
2 // we find all variable declarations in our code and we assign to them an
3 // index value.
4 // our vars map will tell us which variable name corresponds to which index
5 var nextIndex = 1
6 val localSymbols = HashMap<ValueDeclaration, Entry>()
7 methodParameters.forEach {
8 localSymbols[it] = Entry(nextIndex, it.type())
9 nextIndex += it.type().localVarTableSize()
10 // they are just represented by the params
11 }
12 variables.forEach {
13 localSymbols[it] = Entry(nextIndex, it.type())
14 methodVisitor.visitLocalVariable(it.name, it.type().jvmDescription(), null,
15 methodStart, methodEnd, nextIndex)
16 nextIndex += it.type().localVarTableSize()
17 }

We have seen that the first entry of the local variables table is this. It is then followed by the
parameters. So for each parameter we record its position in the local variables table. Remember that
double and long entries take 2 spaces. We do not use long but we use double which corresponds to
the Decimal type in MiniCalcFun. So what happens if we have a method taking three parameters,
(p0, p1, p2), where the first and the last one are of type Int and the second one is of type Decimal?
The resulting local variables table will have this content:
11. Generate JVM bytecode 185

Entry Type Index


this reference 0
p0 int 1
p1 double 2-3
p2 int 4

Then we proceed to insert the variables. For example, if we have two string variables (v0 and v1) for
this method the local variables table will have this content:
Entry Type Index
this reference 0
p0 int 1
p1 double 2-3
p2 int 4
v0 reference 5
v1 reference 6

The call to visitLocalVariable is useful only to fill a table used for debugging purposes.
At this point we create an instance of context:

1 val ctx = CompilationContext(localSymbols, this)

This instance captures the list of symbols and their position in the local variables table. We will
need it when executing statements and evaluating expressions. Why is that? Because when will
have reference to p0 we could look to localSymbols to know where p0 is in the local variables table,
so that we could write the correct instruction to retrieve or set its value.
Now we can process all the statements, in order:

1 is InputDeclaration -> {
2 // Nothing to do, the value is already stored where it should be
3 }
4 is VarDeclaration -> {
5 s.value.push(methodVisitor, ctx)
6 methodVisitor.visitVarInsn(s.type().storeOp(), localSymbols[s]!!.index)
7 }

For input declarations we do not need to do anything. We will receive their values as parameters so
they will be already in the local variables table.
For variables instead we need to evaluate the expressions providing the initial value. Once we
evaluated those expressions, their value is on top of the stack and the store operation put it in the local
11. Generate JVM bytecode 186

variables table. We figure out the index of the local variables table by looking in the localSymbols
map we created earlier.
The assignment works exactly as the variable declaration: we evaluate an expression and store its
value in the local variables table.

1 is Assignment -> {
2 s.value.push(methodVisitor, ctx)
3 methodVisitor.visitVarInsn(s.varDecl.referred!!.type().storeOp(),
4 localSymbols[s.varDecl.referred!!]!!.index)
5 }

The expression statement consists in even less code: just evaluating an expression, without saving
its result.

1 is ExpressionStatatement -> s.expression.push(methodVisitor, ctx)

Then we had the function declarations, which are handled in a separate method:

1 is FunctionDeclaration -> compileFunction(s)

We are left with the print method. Now, one simple way to implement it would be this:

1 is Print -> {
2 // this means that we access the field "out" of "java.lang.System" which
3 // is of type "java.io.PrintStream"
4 mainMethodWriter.visitFieldInsn(GETSTATIC, "java/lang/System", "out",
5 "Ljava/io/PrintStream;")
6 // we push the value we want to print on the stack
7 s.value.push(mainMethodWriter, localSymbols)
8 // we call the method println of System.out to print the value. It will
9 // take its parameter from the stack note that we have to tell the JVM
10 // which variant of println to call. To do that we describe the
11 // signature of the method, depending on the type of the value we want
12 // to print.
13 // If we want to print an int we will produce the signature "(I)V",
14 // we will produce "(D)V" for a double
15 mainMethodWriter.visitMethodInsn(INVOKEVIRTUAL, "java/io/PrintStream",
16 "println", "(${s.value.type().jvmDescription()})V", false)
17 }
11. Generate JVM bytecode 187

This just consist in pushing an expression on the stack and then invoke one of the different methods
System.out.println. There are several of these methods, one taking a string, one taking an int, an
other one taking a double.
However we did not implement the print statement in this way. We instead delegate the imple-
mentation to the systemInterface field. By choosing this approach we can either: i) define a
SystemInterface that actually prints to the screen or ii) an instance that collect the strings we tried
to print in an array to later test the result. This is the same strategy we have used in the interpreter.

1 is Print -> {
2 methodVisitor.visitVarInsn(ALOAD, 0)
3 methodVisitor.visitFieldInsn(GETFIELD, c
4 anonicalNameToInternalName(className),
5 "systemInterface",
6 SystemInterface::class.java.jvmDescription())
7 s.value.pushAsString(methodVisitor, ctx)
8 methodVisitor.visitMethodInsn(INVOKEINTERFACE,
9 SystemInterface::class.java.internalName(), "print",
10 "(${String::class.java.jvmDescription()})V", true)
11 }

The resulting bytecode for a few examples

When writing a JVM compiler you may want to generate class files and examine them using the
javap utility. Lets see at some examples of MiniCalcFun code and the resulting class files we got,
disassembled using javap.
First example:

1 input Int i
2 print(i * 2)

And the corresponding class file:

1 {
2 private me.tomassetti.minicalc.interpreter.SystemInterface systemInterface;
3 descriptor: Lme/tomassetti/minicalc/interpreter/SystemInterface;
4 flags: ACC_PRIVATE
5
6 public minicalc.example(me.tomassetti.minicalc.interpreter.SystemInterface);
7 descriptor: (Lme/tomassetti/minicalc/interpreter/SystemInterface;)V
8 flags: ACC_PUBLIC
9 Code:
11. Generate JVM bytecode 188

10 stack=2, locals=2, args_size=2


11 0: aload_0
12 1: invokespecial #11 // Method java/lang/Object."<init>":()V
13 4: aload_0
14 5: aload_1
15 6: putfield #13 // Field systemInterface:Lme/tomassetti/minicalc/i\
16 nterpreter/SystemInterface;
17 9: return
18
19 public void calculate(int);
20 descriptor: (I)V
21 flags: ACC_PUBLIC
22 Code:
23 stack=3, locals=2, args_size=2
24 0: aload_0
25 1: getfield #13 // Field systemInterface:Lme/tomassetti/minicalc/i\
26 nterpreter/SystemInterface;
27 4: iload_1
28 5: ldc #16 // int 2
29 7: imul
30 8: invokestatic #22 // Method java/lang/Integer.toString:(I)Ljava/lang\
31 /String;
32 11: invokeinterface #28, 2 // InterfaceMethod me/tomassetti/minicalc/i\
33 nterpreter
34 /SystemInterface.print:(Ljava/lang/Strin\
35 g;)V
36 16: return
37 }

In this example we have:

the field for storing the instance of SystemInterface we get in the constructor
the constructor
the calculate method, which executes the whole script

Lets now look at an example with a function:


11. Generate JVM bytecode 189

1 fun f(Int p) Int {


2 p + 1
3 }
4 print(f(5))

And here we have the corresponding class:

1 {
2 private me.tomassetti.minicalc.interpreter.SystemInterface systemInterface;
3 descriptor: Lme/tomassetti/minicalc/interpreter/SystemInterface;
4 flags: ACC_PRIVATE
5
6 public minicalc.example(me.tomassetti.minicalc.interpreter.SystemInterface);
7 descriptor: (Lme/tomassetti/minicalc/interpreter/SystemInterface;)V
8 flags: ACC_PUBLIC
9 Code:
10 stack=2, locals=2, args_size=2
11 0: aload_0
12 1: invokespecial #11 // Method java/lang/Object."<init>\
13 ":()V
14 4: aload_0
15 5: aload_1
16 6: putfield #13 // Field systemInterface:Lme/tomas\
17 setti/minicalc/interpreter/SystemInterface;
18 9: return
19
20 public void calculate();
21 descriptor: ()V
22 flags: ACC_PUBLIC
23 Code:
24 stack=3, locals=1, args_size=1
25 0: aload_0
26 1: getfield #13 // Field systemInterface:Lme/tomas\
27 setti/minicalc/interpreter/SystemInterface;
28 4: aload_0
29 5: ldc #18 // int 5
30 7: invokevirtual #22 // Method "minicalc.example".fun_f\
31 :(I)I
32 10: invokestatic #28 // Method java/lang/Integer.toStri\
33 ng:(I)Ljava/lang/String;
34 13: invokeinterface #34, 2 // InterfaceMethod me/tomassetti/m\
35 inicalc/interpreter/SystemInterface.print:(Ljava/lang/String;)V
11. Generate JVM bytecode 190

36 18: return
37
38 public int fun_f(int);
39 descriptor: (I)I
40 flags: ACC_PUBLIC
41 Code:
42 stack=2, locals=2, args_size=2
43 0: iload_1
44 1: ldc #17 // int 1
45 3: iadd
46 4: ireturn
47 }

In this case in addition to the field, the constructor, and the calculate method we have a method
for the function, named fun_f.

Testing

We may have finished looking into the code of our first compiler but one important piece is stil
missing: our tests.
Lets look at the general structure we use for testing compilation of MiniCalcFun source files.

1 class JvmCompilerTest {
2
3 fun compile(code: String): Class<*> {
4 val res = MiniCalcParserFacade.parse(code)
5 assertTrue(res.isCorrect(), res.errors.toString())
6 val miniCalcFile = res.root!!
7 val bytes = JvmCompiler().compile(miniCalcFile, "me/tomassetti/MyCalc")
8 return MyClassLoader(bytes).loadClass("me.tomassetti.MyCalc")
9 }
10
11 class MyClassLoader(val bytes: ByteArray) : ClassLoader() {
12 override fun findClass(name: String?): Class<*> {
13 return defineClass(name, bytes, 0, bytes.size)
14 }
15 }
16
17 class TestSystemInterface : SystemInterface {
18
19 val output = LinkedList<String>()
20
11. Generate JVM bytecode 191

21 override fun print(message: String) {


22 output.add(message)
23 }
24
25 }
26
27 ...
28 }

In the compile method we get some code, we parse it and verify there are no errors. If everything
is fine we invoke the compile method. The compile method returns to us the bytes of the compiled
class. We pass those to our simple classloader (MyClassLoader) which will use them to define a class,
as it is needed to.
We define also an instance of SystemInterface that instead of printing strings store them in a list.
This is how we can write tests using this structure:

1 @test fun inputReference() {


2 val clazz = compile("""input Int i
3 input String s
4 print(s + i)""")
5 val systemInterface = TestSystemInterface()
6 val instance = clazz.declaredConstructors[0].newInstance(systemInterface)
7 clazz.methods.find { it.name == "calculate" }!!.invoke(instance, 12, "hi")
8 assertEquals(listOf("hi12"), systemInterface.output)
9 }
10
11 @test fun varAssignment() {
12 val clazz = compile("""var i = 0
13 print(i)
14 i = 2
15 print(i)""")
16 val systemInterface = TestSystemInterface()
17 val instance = clazz.declaredConstructors[0].newInstance(systemInterface)
18 clazz.methods.find { it.name == "calculate" }!!.invoke(instance)
19 assertEquals(listOf("0", "2"), systemInterface.output)
20 }
21
22 @test fun interpolatedStringLitWithThreeParts() {
23 val clazz = compile("""print("hi!#{2 * 3}bye")""")
24 val systemInterface = TestSystemInterface()
25 val instance = clazz.declaredConstructors[0].newInstance(systemInterface)
11. Generate JVM bytecode 192

26 clazz.methods.find { it.name == "calculate" }!!.invoke(instance)


27 assertEquals(listOf("hi!6bye"), systemInterface.output)
28 }
29
30 @test fun annidatedFunction() {
31 val clazz = compile("""fun f() Int {
32 fun f() Int {
33 2
34 }
35 3 * f()
36 }
37 print(f())""")
38 val systemInterface = TestSystemInterface()
39 val instance = clazz.declaredConstructors[0].newInstance(systemInterface)
40 clazz.methods.find { it.name == "calculate" }!!.invoke(instance)
41 assertEquals(listOf("6"), systemInterface.output)
42 }

All these tests have a similar structure:

we pass some code to compile and get a class back


we instantiate our TestSystemInterface
we instantiate our compiled class by using reflection (see newInstance)
we find the method calculate, again through reflection, and invoke it. Note that we need
always to pass the instance of our compiled class and if the script uses inputs we need to pass
values for those
we verify if the script has printed the messages we expected, by examining the output captured
by the TestSystemInterface

And this is how you can write and test your JVM compiler.

StaMac
We are going to see how to write a JVM compiler for StaMac. We will reuse many principles we
have seen when writing the compiler for MiniCalcFun. However the structure of the generated
classes will be different because the execution model is different. MiniCalcFun is a typical imperative
language which executes a list of statements from beginning to end. StaMac instead is based on State
Machines, so it is event based.
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#newInstance--
11. Generate JVM bytecode 193

What we want to obtain

For each StaMac source file we are going to generate several class files.
One class file will represent the whole state machine. We will then have one interface named
State.We will also have one class for each state of the state machine. Each of these state classes
will implement the State interface.
We will generate all the classes in one package, named using the name of the state machine.
Consider this example:

1 statemachine simpleSM
2
3 input lowSpeedThroughtput : Int
4 input highSpeedThroughtput : Int
5 var counter = 0
6
7 event accelerate
8 event slowDown
9 event clock
10
11 start state turnedOff {
12 on accelerate -> lowSpeed
13 }
14
15 state lowSpeed {
16 on entry {
17 counter = counter + lowSpeedThroughtput
18 }
19 on accelerate -> highSpeed
20 on slowDown -> turnedOff
21 on clock -> lowSpeed
22 }
23
24 state highSpeed {
25 on entry {
26 counter = counter + highSpeedThroughtput
27 }
28 on slowDown -> lowSpeed
29 on clock -> highSpeed
30 }

It will produce five class files:


11. Generate JVM bytecode 194

stamac.simpleSM.StateMachine
stamac.simpleSM.State
stamac.simpleSM.turnedOff
stamac.simpleSM.lowSpeed
stamac.simpleSM.highSpeed

The StateMachine class will be the only one users should consider. The State interface and the three
implementations will be used by StateMachine. They could have been inner classes, but this would
have required to introduce a little more complexity that is not worthy at this stage.
The StateMachine class will have several elements. Lets start with the public ones:

a constructor taking a SystemInterface instance and a value for each input


a method for each event, named as the event itself
the method isExited

Other elements are package-protected because they are intended to be accessed only by the state
classes.

one field for SystemInterface, that we have already seen in the interpreter chapter and in the
MiniCalcFun compiler
one field for each input and variable
the exit method
a goTo_Xxx method for each state

Some are just private:

one field indicating if the state machine has reach the exit
one field containing an instance of a state class. This represents the current state

The State interface has:

a method enter
a method leave
a method for each event

Each state class has:

a private field containing a reference to the StateMachine


a constructor taking an instance of StateMachine
the implementation of the methods defined in the State interface
11. Generate JVM bytecode 195

Why did we chose this approach?

There are different ways of producing bytecode that would lead to a system with the same behavior.
We chose this approach because it seems reasonably clean and easy to implement. The main activity
our system will do will be reacting to events from the external world so we started from there. What
should happen when we receive an event? We should react in a way that depends on the event and
the current state.
The first choice is having a different method for each event: in this way the user will communicate
which events he is sending by invoking the corresponding method. Alternatively we could have
chosen to have a single method, named receiveEvent and specify which event was sent using
a parameter, like receiveEvent(ACCELERATE), for example. With the approach we chose the user
would have to call accelerate() instead.
The second choice is how to implement these event methods. One way would be to have a switch
on the current state, and do something different depending on the state. Something equivalent to:

1 void accelerate() {
2 switch (currentState) {
3 case TURNED_OFF:
4 ...
5 break;
6 case LOW_SPEED:
7 ...
8 break;
9 case HIGH_SPEED:
10 ...
11 break;
12 }
13 }

We did not like this approach for several reason, including the fact that generating the bytecode for
switch statements is not trivial. We chose instead to simply delegate to an object representing the
current state, like we were applying the State pattern from the Design Patterns book.
So our code will be very similar to what we would obtain by compiling this example in Java:

https://en.wikipedia.org/wiki/Design_Patterns
11. Generate JVM bytecode 196

1 class StateMachine {
2
3 ...
4
5 void accelerate() {
6 currentState.accelerate();
7 }
8
9 ...
10 }

Example of compiled classes

Compiling the example we have seen before we will obtain this code.
StateMachine class:

1 public class stamac.simpleSM.StateMachine


2 {
3 me.tomassetti.stamac.jvmcompiler.SystemInterface systemInterface;
4 descriptor: Lme/tomassetti/stamac/jvmcompiler/SystemInterface;
5 flags:
6
7 private boolean exited;
8 descriptor: Z
9 flags: ACC_PRIVATE
10
11 private stamac.simpleSM.State currentState;
12 descriptor: Lstamac/simpleSM/State;
13 flags: ACC_PRIVATE
14
15 int lowSpeedThroughtput;
16 descriptor: I
17 flags:
18
19 int highSpeedThroughtput;
20 descriptor: I
21 flags:
22
23 int counter;
24 descriptor: I
25 flags:
26
11. Generate JVM bytecode 197

27 public stamac.simpleSM.StateMachine(me.tomassetti.stamac.jvmcompiler.SystemInt\
28 erface, int, int);
29 descriptor: (Lme/tomassetti/stamac/jvmcompiler/SystemInterface;II)V
30 flags: ACC_PUBLIC
31 Code:
32 stack=2, locals=4, args_size=4
33 0: aload_0
34 1: invokespecial #19 // Method java/lang/Object."<init>\
35 ":()V
36 4: aload_0
37 5: aload_1
38 6: putfield #21 // Field systemInterface:Lme/tomas\
39 setti/stamac/jvmcompiler/SystemInterface;
40 9: aload_0
41 10: aload_2
42 11: putfield #23 // Field lowSpeedThroughtput:I
43 14: aload_0
44 15: aload_3
45 16: putfield #25 // Field highSpeedThroughtput:I
46 19: aload_0
47 20: invokevirtual #28 // Method goTo_turnedOff:()V
48 23: return
49
50 void exit();
51 descriptor: ()V
52 flags:
53 Code:
54 stack=2, locals=1, args_size=1
55 0: aload_0
56 1: ldc #30 // int 1
57 3: putfield #32 // Field exited:Z
58 6: return
59
60 public boolean isExited();
61 descriptor: ()Z
62 flags: ACC_PUBLIC
63 Code:
64 stack=1, locals=1, args_size=1
65 0: aload_0
66 1: getfield #32 // Field exited:Z
67 4: ireturn
68
11. Generate JVM bytecode 198

69 void goTo_turnedOff();
70 descriptor: ()V
71 flags:
72 Code:
73 stack=4, locals=1, args_size=1
74 0: aload_0
75 1: getfield #36 // Field currentState:Lstamac/simp\
76 leSM/State;
77 4: ifnull 16
78 7: aload_0
79 8: getfield #36 // Field currentState:Lstamac/simp\
80 leSM/State;
81 11: invokeinterface #41, 1 // InterfaceMethod stamac/simpleSM\
82 /State.leave:()V
83 16: aload_0
84 17: new #43 // class stamac/simpleSM/turnedOff
85 20: dup
86 21: aload_0
87 22: invokespecial #46 // Method stamac/simpleSM/turnedOf\
88 f."<init>":(Lstamac/simpleSM/StateMachine;)V
89 25: putfield #36 // Field currentState:Lstamac/simp\
90 leSM/State;
91 28: aload_0
92 29: getfield #36 // Field currentState:Lstamac/simp\
93 leSM/State;
94 32: invokeinterface #49, 1 // InterfaceMethod stamac/simpleSM\
95 /State.enter:()V
96 37: return
97 StackMapTable: number_of_entries = 1
98 frame_type = 16 /* same */
99
100 void goTo_lowSpeed();
101 descriptor: ()V
102 flags:
103 Code:
104 stack=4, locals=1, args_size=1
105 0: aload_0
106 1: getfield #36 // Field currentState:Lstamac/simp\
107 leSM/State;
108 4: ifnull 16
109 7: aload_0
110 8: getfield #36 // Field currentState:Lstamac/simp\
11. Generate JVM bytecode 199

111 leSM/State;
112 11: invokeinterface #41, 1 // InterfaceMethod stamac/simpleSM\
113 /State.leave:()V
114 16: aload_0
115 17: new #52 // class stamac/simpleSM/lowSpeed
116 20: dup
117 21: aload_0
118 22: invokespecial #53 // Method stamac/simpleSM/lowSpeed\
119 ."<init>":(Lstamac/simpleSM/StateMachine;)V
120 25: putfield #36 // Field currentState:Lstamac/simp\
121 leSM/State;
122 28: aload_0
123 29: getfield #36 // Field currentState:Lstamac/simp\
124 leSM/State;
125 32: invokeinterface #49, 1 // InterfaceMethod stamac/simpleSM\
126 /State.enter:()V
127 37: return
128 StackMapTable: number_of_entries = 1
129 frame_type = 16 /* same */
130
131 void goTo_highSpeed();
132 descriptor: ()V
133 flags:
134 Code:
135 stack=4, locals=1, args_size=1
136 0: aload_0
137 1: getfield #36 // Field currentState:Lstamac/simp\
138 leSM/State;
139 4: ifnull 16
140 7: aload_0
141 8: getfield #36 // Field currentState:Lstamac/simp\
142 leSM/State;
143 11: invokeinterface #41, 1 // InterfaceMethod stamac/simpleSM\
144 /State.leave:()V
145 16: aload_0
146 17: new #56 // class stamac/simpleSM/highSpeed
147 20: dup
148 21: aload_0
149 22: invokespecial #57 // Method stamac/simpleSM/highSpee\
150 d."<init>":(Lstamac/simpleSM/StateMachine;)V
151 25: putfield #36 // Field currentState:Lstamac/simp\
152 leSM/State;
11. Generate JVM bytecode 200

153 28: aload_0


154 29: getfield #36 // Field currentState:Lstamac/simp\
155 leSM/State;
156 32: invokeinterface #49, 1 // InterfaceMethod stamac/simpleSM\
157 /State.enter:()V
158 37: return
159 StackMapTable: number_of_entries = 1
160 frame_type = 16 /* same */
161
162 public void accelerate();
163 descriptor: ()V
164 flags: ACC_PUBLIC
165 Code:
166 stack=1, locals=1, args_size=1
167 0: aload_0
168 1: getfield #32 // Field exited:Z
169 4: ifne 16
170 7: aload_0
171 8: getfield #36 // Field currentState:Lstamac/simp\
172 leSM/State;
173 11: invokeinterface #60, 1 // InterfaceMethod stamac/simpleSM\
174 /State.accelerate:()V
175 16: return
176 StackMapTable: number_of_entries = 1
177 frame_type = 16 /* same */
178
179 public void slowDown();
180 descriptor: ()V
181 flags: ACC_PUBLIC
182 Code:
183 stack=1, locals=1, args_size=1
184 0: aload_0
185 1: getfield #32 // Field exited:Z
186 4: ifne 16
187 7: aload_0
188 8: getfield #36 // Field currentState:Lstamac/simp\
189 leSM/State;
190 11: invokeinterface #63, 1 // InterfaceMethod stamac/simpleSM\
191 /State.slowDown:()V
192 16: return
193 StackMapTable: number_of_entries = 1
194 frame_type = 16 /* same */
11. Generate JVM bytecode 201

195
196 public void clock();
197 descriptor: ()V
198 flags: ACC_PUBLIC
199 Code:
200 stack=1, locals=1, args_size=1
201 0: aload_0
202 1: getfield #32 // Field exited:Z
203 4: ifne 16
204 7: aload_0
205 8: getfield #36 // Field currentState:Lstamac/simp\
206 leSM/State;
207 11: invokeinterface #66, 1 // InterfaceMethod stamac/simpleSM\
208 /State.clock:()V
209 16: return
210 StackMapTable: number_of_entries = 1
211 frame_type = 16 /* same */
212 }

State interface:

1 interface stamac.simpleSM.State
2 {
3 public abstract void enter();
4 descriptor: ()V
5 flags: ACC_PUBLIC, ACC_ABSTRACT
6
7 public abstract void leave();
8 descriptor: ()V
9 flags: ACC_PUBLIC, ACC_ABSTRACT
10
11 public abstract void accelerate();
12 descriptor: ()V
13 flags: ACC_PUBLIC, ACC_ABSTRACT
14
15 public abstract void slowDown();
16 descriptor: ()V
17 flags: ACC_PUBLIC, ACC_ABSTRACT
18
19 public abstract void clock();
20 descriptor: ()V
21 flags: ACC_PUBLIC, ACC_ABSTRACT
22 }
11. Generate JVM bytecode 202

turnedOff class:

1 class stamac.simpleSM.turnedOff implements stamac.simpleSM.State


2 {
3 private stamac.simpleSM.StateMachine stateMachine;
4 descriptor: Lstamac/simpleSM/StateMachine;
5 flags: ACC_PRIVATE
6
7 public stamac.simpleSM.turnedOff(stamac.simpleSM.StateMachine);
8 descriptor: (Lstamac/simpleSM/StateMachine;)V
9 flags: ACC_PUBLIC
10 Code:
11 stack=2, locals=2, args_size=2
12 0: aload_0
13 1: invokespecial #13 // Method java/lang/Object."<init>\
14 ":()V
15 4: aload_0
16 5: aload_1
17 6: putfield #15 // Field stateMachine:Lstamac/simp\
18 leSM/StateMachine;
19 9: return
20
21 public void enter();
22 descriptor: ()V
23 flags: ACC_PUBLIC
24 Code:
25 stack=0, locals=1, args_size=1
26 0: return
27
28 public void leave();
29 descriptor: ()V
30 flags: ACC_PUBLIC
31 Code:
32 stack=0, locals=1, args_size=1
33 0: return
34
35 public void accelerate();
36 descriptor: ()V
37 flags: ACC_PUBLIC
38 Code:
39 stack=1, locals=1, args_size=1
40 0: aload_0
41 1: getfield #15 // Field stateMachine:Lstamac/simp\
11. Generate JVM bytecode 203

42 leSM/StateMachine;
43 4: invokevirtual #23 // Method stamac/simpleSM/StateMac\
44 hine.goTo_lowSpeed:()V
45 7: return
46
47 public void slowDown();
48 descriptor: ()V
49 flags: ACC_PUBLIC
50 Code:
51 stack=0, locals=1, args_size=1
52 0: return
53
54 public void clock();
55 descriptor: ()V
56 flags: ACC_PUBLIC
57 Code:
58 stack=0, locals=1, args_size=1
59 0: return
60 }

lowSpeed class:

1 private stamac.simpleSM.StateMachine stateMachine;


2 descriptor: Lstamac/simpleSM/StateMachine;
3 flags: ACC_PRIVATE
4
5 public stamac.simpleSM.lowSpeed(stamac.simpleSM.StateMachine);
6 descriptor: (Lstamac/simpleSM/StateMachine;)V
7 flags: ACC_PUBLIC
8 Code:
9 stack=2, locals=2, args_size=2
10 0: aload_0
11 1: invokespecial #13 // Method java/lang/Object."<init>\
12 ":()V
13 4: aload_0
14 5: aload_1
15 6: putfield #15 // Field stateMachine:Lstamac/simp\
16 leSM/StateMachine;
17 9: return
18
19 public void enter();
20 descriptor: ()V
11. Generate JVM bytecode 204

21 flags: ACC_PUBLIC
22 Code:
23 stack=3, locals=1, args_size=1
24 0: aload_0
25 1: getfield #15 // Field stateMachine:Lstamac/simp\
26 leSM/StateMachine;
27 4: aload_0
28 5: getfield #15 // Field stateMachine:Lstamac/simp\
29 leSM/StateMachine;
30 8: getfield #22 // Field stamac/simpleSM/StateMach\
31 ine.counter:I
32 11: aload_0
33 12: getfield #15 // Field stateMachine:Lstamac/simp\
34 leSM/StateMachine;
35 15: getfield #25 // Field stamac/simpleSM/StateMach\
36 ine.lowSpeedThroughtput:I
37 18: iadd
38 19: putfield #22 // Field stamac/simpleSM/StateMach\
39 ine.counter:I
40 22: return
41
42 public void leave();
43 descriptor: ()V
44 flags: ACC_PUBLIC
45 Code:
46 stack=0, locals=1, args_size=1
47 0: return
48
49 public void accelerate();
50 descriptor: ()V
51 flags: ACC_PUBLIC
52 Code:
53 stack=1, locals=1, args_size=1
54 0: aload_0
55 1: getfield #15 // Field stateMachine:Lstamac/simp\
56 leSM/StateMachine;
57 4: invokevirtual #30 // Method stamac/simpleSM/StateMac\
58 hine.goTo_highSpeed:()V
59 7: return
60
61 public void slowDown();
62 descriptor: ()V
11. Generate JVM bytecode 205

63 flags: ACC_PUBLIC
64 Code:
65 stack=1, locals=1, args_size=1
66 0: aload_0
67 1: getfield #15 // Field stateMachine:Lstamac/simp\
68 leSM/StateMachine;
69 4: invokevirtual #34 // Method stamac/simpleSM/StateMac\
70 hine.goTo_turnedOff:()V
71 7: return
72
73 public void clock();
74 descriptor: ()V
75 flags: ACC_PUBLIC
76 Code:
77 stack=1, locals=1, args_size=1
78 0: aload_0
79 1: getfield #15 // Field stateMachine:Lstamac/simp\
80 leSM/StateMachine;
81 4: invokevirtual #38 // Method stamac/simpleSM/StateMac\
82 hine.goTo_lowSpeed:()V
83 7: return
84 }

And finally the highSpeed class:

1 class stamac.simpleSM.highSpeed implements stamac.simpleSM.State


2 {
3 private stamac.simpleSM.StateMachine stateMachine;
4 descriptor: Lstamac/simpleSM/StateMachine;
5 flags: ACC_PRIVATE
6
7 public stamac.simpleSM.highSpeed(stamac.simpleSM.StateMachine);
8 descriptor: (Lstamac/simpleSM/StateMachine;)V
9 flags: ACC_PUBLIC
10 Code:
11 stack=2, locals=2, args_size=2
12 0: aload_0
13 1: invokespecial #13 // Method java/lang/Object."<init>\
14 ":()V
15 4: aload_0
16 5: aload_1
17 6: putfield #15 // Field stateMachine:Lstamac/simp\
11. Generate JVM bytecode 206

18 leSM/StateMachine;
19 9: return
20
21 public void enter();
22 descriptor: ()V
23 flags: ACC_PUBLIC
24 Code:
25 stack=3, locals=1, args_size=1
26 0: aload_0
27 1: getfield #15 // Field stateMachine:Lstamac/simp\
28 leSM/StateMachine;
29 4: aload_0
30 5: getfield #15 // Field stateMachine:Lstamac/simp\
31 leSM/StateMachine;
32 8: getfield #22 // Field stamac/simpleSM/StateMach\
33 ine.counter:I
34 11: aload_0
35 12: getfield #15 // Field stateMachine:Lstamac/simp\
36 leSM/StateMachine;
37 15: getfield #25 // Field stamac/simpleSM/StateMach\
38 ine.highSpeedThroughtput:I
39 18: iadd
40 19: putfield #22 // Field stamac/simpleSM/StateMach\
41 ine.counter:I
42 22: return
43
44 public void leave();
45 descriptor: ()V
46 flags: ACC_PUBLIC
47 Code:
48 stack=0, locals=1, args_size=1
49 0: return
50
51 public void accelerate();
52 descriptor: ()V
53 flags: ACC_PUBLIC
54 Code:
55 stack=0, locals=1, args_size=1
56 0: return
57
58 public void slowDown();
59 descriptor: ()V
11. Generate JVM bytecode 207

60 flags: ACC_PUBLIC
61 Code:
62 stack=1, locals=1, args_size=1
63 0: aload_0
64 1: getfield #15 // Field stateMachine:Lstamac/simp\
65 leSM/StateMachine;
66 4: invokevirtual #31 // Method stamac/simpleSM/StateMac\
67 hine.goTo_lowSpeed:()V
68 7: return
69
70 public void clock();
71 descriptor: ()V
72 flags: ACC_PUBLIC
73 Code:
74 stack=1, locals=1, args_size=1
75 0: aload_0
76 1: getfield #15 // Field stateMachine:Lstamac/simp\
77 leSM/StateMachine;
78 4: invokevirtual #35 // Method stamac/simpleSM/StateMac\
79 hine.goTo_highSpeed:()V
80 7: return
81 }

Types and expression

Lets start with the similarities: also in StaMac we have functions to deal with types. So the
extension methods Type.jvmDescription, Type.localVarTableSize, Type.loadOp, Type.storeOp,
and Type.returnOp are exactly the same as we have seen in the compiler for MiniCalcFun. Also the
methods canonicalNameToInternalName, canonicalNameToJvmDescription, and the corresponding
extension methods for Class are reused.
The code that deals with expressions is also very similar: pushAsInt, pushAsDouble, and pushAs-
String look exactly the same. The code of push is partially the same but it contains some differences.
Lets focus on them:
11. Generate JVM bytecode 208

1 fun Expression.push(methodVisitor: MethodVisitor, context: CompilationContext) {


2 ...
3 ... code similar to what we had in MiniCalcFun, omitted
4 ...
5 is ValueReference -> {
6 methodVisitor.visitVarInsn(Opcodes.ALOAD, 0) // this
7 methodVisitor.visitFieldInsn(Opcodes.GETFIELD,
8 canonicalNameToInternalName(context.classCname),
9 "stateMachine",
10 canonicalNameToJvmDescription(
11 context.compilation.stateMachineCName))
12 methodVisitor.visitFieldInsn(Opcodes.GETFIELD,
13 canonicalNameToInternalName(
14 context.compilation.stateMachineCName),
15 this.symbol.name,
16 this.symbol.referred!!.type.jvmDescription())
17 }
18 is StringLit -> methodVisitor.visitLdcInsn(
19 this.value.removePrefix("\"").removeSuffix("\""))
20 is UnaryMinusExpression -> {
21 this.value.push(methodVisitor, context)
22 when (this.value.type()) {
23 is DecimalType -> {
24 methodVisitor.visitLdcInsn(-1.0)
25 methodVisitor.visitInsn(Opcodes.DMUL)
26 }
27 is IntType -> {
28 methodVisitor.visitLdcInsn(-1)
29 methodVisitor.visitInsn(Opcodes.IMUL)
30 }
31 else -> throw UnsupportedOperationException(
32 this.value.type().javaClass.canonicalName)
33 }
34 }
35 else -> throw UnsupportedOperationException(
36 this.javaClass.canonicalName)
37 }
38 }

ValueReference is treated differently from what we did for MiniCalcFun because of the way we
store global variables and inputs. In the case of StaMac we store them as class fields.
11. Generate JVM bytecode 209

The general structure

We have a JvmCompiler class, the SystemInterface interface, a Compilation and CompilationCon-


text classes as we had in MiniCalcFun.

1 class JvmCompiler {
2
3 fun compile(ast: StateMachine) = Compilation(ast).compile()
4
5 }
6
7 interface SystemInterface {
8 fun print(message: String)
9 }
10
11 data class CompilationContext(val compilation: Compilation,
12 val classCname: String)
13
14 class Compilation(val ast: StateMachine) {
15 ...
16 }

The main method of our compiler asks for a source file but it produces a list of class files, instead of
just one as it did for MiniCalcFun. This is because for MiniCalcFun we generated one class file for
each source file, while for StaMac we generate several.

1 fun main(args: Array<String>) {


2 if (args.size != 1) {
3 System.err.println("Exactly one argument expected")
4 return
5 }
6 val sourceFile = File(args[0])
7 if (!sourceFile.exists()) {
8 System.err.println("Given file does not exist")
9 return
10 }
11 val res = SMLangParserFacade.parse(sourceFile)
12 if (res.isCorrect()) {
13 val stateMachine = res.root!!
14 for (c in JvmCompiler().compile(stateMachine)) {
15 val outputFile = File("${c.key.split(".").last()}.class")
16 outputFile.writeBytes(c.value)
11. Generate JVM bytecode 210

17 }
18 } else {
19 System.err.println("${res.errors.size} error(s) found\n")
20 res.errors.forEach { System.err.println(it) }
21 }
22 }

This is how the Compilation class looks like:

1 class Compilation(val ast: StateMachine) {


2 val smClass = ClassWriter(ClassWriter.COMPUTE_FRAMES
3 or ClassWriter.COMPUTE_MAXS)
4
5 val packageName = "stamac.${ast.name}"
6 val stateMachineCName = "$packageName.StateMachine"
7 val stateInterfaceCName = "$packageName.State"
8
9 private fun stateClassCName(state: StateDeclaration) =
10 "$packageName.${state.name}"
11
12 private fun compileStateInterface(classes: HashMap<String, ByteArray>) {
13 ...
14 }
15
16 private fun compileStateClass(state: StateDeclaration,
17 classes: HashMap<String, ByteArray>) {
18 ...
19 }
20
21 private fun smConstructor() {
22 ...
23 }
24
25 private fun compileStatement(statement: Statement, mv: MethodVisitor,
26 classCname: String) {
27 ...
28 }
29
30 private fun goToMethodName(state: StateDeclaration) = "goTo_${state.name}"
31
32 private fun smExitMethod() {
33 ...
11. Generate JVM bytecode 211

34 }
35
36 private fun smIsExitedMethod() {
37 ...
38 }
39
40 private fun smEventMethod(event: EventDeclaration) {
41 ...
42 }
43
44 private fun smGoToStateMethod(state: StateDeclaration) {
45 ...
46 }
47
48 fun compile() : Map<String, ByteArray> {
49 val classes = HashMap<String, ByteArray>()
50
51 // here we specify that the class is in the format introduced with
52 // Java 8 (so it would require a JRE >= 8 to run)
53 // We also specify the name of the class, the fact it extends Object
54 // and it implements no interfaces
55 smClass.visit(V1_8, ACC_PUBLIC,
56 canonicalNameToInternalName(stateMachineCName), null, "java/lang/Ob\
57 ject", null)
58
59 smClass.visitField(0, "systemInterface",
60 SystemInterface::class.java.jvmDescription(), null, null)
61 smClass.visitField(ACC_PRIVATE, "exited", "Z", null, null)
62 smClass.visitField(ACC_PRIVATE, "currentState",
63 canonicalNameToJvmDescription(stateInterfaceCName), null, null)
64 ast.inputs.forEach {
65 smClass.visitField(0, it.name, it.type.jvmDescription(), null, null)
66 }
67 ast.variables.forEach {
68 smClass.visitField(0, it.name, it.type.jvmDescription(), null, null)
69 }
70
71 smConstructor()
72 smExitMethod()
73 smIsExitedMethod()
74 ast.states.forEach {
75 smGoToStateMethod(it)
11. Generate JVM bytecode 212

76 }
77
78 ast.events.forEach {
79 smEventMethod(it)
80 }
81
82 smClass.visitEnd()
83 classes[stateMachineCName] = smClass.toByteArray()
84
85 // add State interface
86 compileStateInterface(classes)
87
88 ast.states.forEach {
89 // generate State classes
90 compileStateClass(it, classes)
91 }
92 return classes
93 }
94 }

The Compilation receives the AST to compile. It defines a ClassWriter exactly as we did in the
compiler for MiniCalcFun.

1 class Compilation(val ast: StateMachine) {


2 val smClass = ClassWriter(ClassWriter.COMPUTE_FRAMES
3 or ClassWriter.COMPUTE_MAXS)

Then we have a few constants and a function to define the name of the classes to generate. CName
stands for Canonical name. I.e., the qualified name of a class, which includes the package name.

1 val packageName = "stamac.${ast.name}"


2 val stateMachineCName = "$packageName.StateMachine"
3 val stateInterfaceCName = "$packageName.State"
4
5 private fun stateClassCName(state: StateDeclaration) =
6 "$packageName.${state.name}"
7 ...

The compile method is where we coordinate the work. We start by preparing a map to collect all
the classes we are going to generate by their name. To be precise we are going to store the actual
bytes corresponding to the class (as ByteArray instances).
11. Generate JVM bytecode 213

We will define the class for the state machine. First we define the fields: systemInterface, exited,
currentState, and then one field for each input and one for each variable.
Then we define the constructor, the exit method, the isExited method, one goTo_xxx method for
each state and one method for each event.
After that we define the State interface and one class for each state.

1 fun compile() : Map<String, ByteArray> {


2 val classes = HashMap<String, ByteArray>()
3
4 // here we specify that the class is in the format introduced with
5 // Java 8 (so it would require a JRE >= 8 to run)
6 // we also specify the name of the class, the fact it extends Object
7 // and it implements no interfaces
8 smClass.visit(V1_8, ACC_PUBLIC,
9 canonicalNameToInternalName(stateMachineCName), null,
10 "java/lang/Object", null)
11
12 smClass.visitField(0, "systemInterface",
13 SystemInterface::class.java.jvmDescription(), null, null)
14 smClass.visitField(ACC_PRIVATE, "exited", "Z", null, null)
15 smClass.visitField(ACC_PRIVATE, "currentState",
16 canonicalNameToJvmDescription(stateInterfaceCName), null, null)
17 ast.inputs.forEach {
18 smClass.visitField(0, it.name, it.type.jvmDescription(), null, null)
19 }
20 ast.variables.forEach {
21 smClass.visitField(0, it.name, it.type.jvmDescription(), null, null)
22 }
23
24 smConstructor()
25 smExitMethod()
26 smIsExitedMethod()
27 ast.states.forEach {
28 smGoToStateMethod(it)
29 }
30
31 ast.events.forEach {
32 smEventMethod(it)
33 }
34
35 smClass.visitEnd()
36 classes[stateMachineCName] = smClass.toByteArray()
11. Generate JVM bytecode 214

37
38 // add State interface
39 compileStateInterface(classes)
40
41 ast.states.forEach {
42 // generate State classes
43 compileStateClass(it, classes)
44 }
45 return classes
46 }

The StateMachine class

The StateMachine class is the only class intended to be used directly. It coordinates all the activities.
The constructor of StateMachine expects an instance of SystemInterface and values for all the
inputs. The signature is similar to the signature we have seen for the constructor of the class
generated for MiniCalcFun.
Also in this case we call the super constructor, the default constructor of Object.
We then store the first parameter in the field systemInterface. After that we look at all the values
for the inputs which are passed as parameters. We store them one by one in separate fields.
Then we invoke the goTo_xxx method for start state and we close the constructor.

1 private fun smConstructor() {


2 val constructor = smClass.visitMethod(ACC_PUBLIC, "<init>",
3 "(${SystemInterface::class.java.jvmDescription()}" +
4 "${ast.inputs.map { it.type.jvmDescription() }.joinToString(sepa\
5 rator = "")})V",
6 null, null)
7 constructor.visitCode()
8
9 constructor.visitVarInsn(ALOAD, 0)
10 constructor.visitMethodInsn(INVOKESPECIAL,
11 Object::class.java.internalName(),
12 "<init>", "()V", false)
13
14 constructor.visitVarInsn(ALOAD, 0)
15 constructor.visitVarInsn(ALOAD, 1)
16 constructor.visitFieldInsn(PUTFIELD,
17 canonicalNameToInternalName(stateMachineCName),
18 "systemInterface", SystemInterface::class.java.jvmDescription())
19
11. Generate JVM bytecode 215

20 var index = 2
21 ast.inputs.forEach {
22 constructor.visitVarInsn(ALOAD, 0)
23 constructor.visitVarInsn(it.type.loadOp(), index)
24 constructor.visitFieldInsn(PUTFIELD,
25 canonicalNameToInternalName(stateMachineCName),
26 it.name, it.type.jvmDescription())
27 index += it.type.localVarTableSize()
28 }
29
30 constructor.visitVarInsn(ALOAD, 0)
31 constructor.visitMethodInsn(INVOKEVIRTUAL,
32 canonicalNameToInternalName(stateMachineCName),
33 goToMethodName(ast.startState()), "()V", false)
34
35 constructor.visitInsn(RETURN)
36 constructor.visitEnd()
37 constructor.visitMaxs(-1, -1)
38 }

How these goTo_xxx methods look like?


They are named after the state, so if we have the states foo, bar, zum we will have three of these
smethods: goTo_foo, goTo_bar, and goTo_zum. None of these methods take parameters or return
anything.
We start the method by loading the value of the field currentState into the stack. Then we check if
that value is null (IFNULL). If it is, we jump to the label afterCallToExitCurrentState, so that we
skip some instructions.
The instructions we skip when the currentState is null consist in:

loading the value of currentState


invoke the method leave on it

So what do this do in practice? If the currentState is not null we call the method leave on it. The
currentState will be null when we first start the state machine, and we are not yet in any state.
So when we go to a state (the start state) we have no state to leave. From now on whenever we call
one of these goTo_xxx methods we will have to leave a state instead, and to do that we will call the
method leave on it.
Once we have done that we want to instantiate the class representing the state to which we are
going. After we have instantiated it we assign it to the field currentState. Finally we execute the
method enter on the new value of currentState. In this case we are sure this value is not null, so
there is no reason to check.
11. Generate JVM bytecode 216

1 private fun smGoToStateMethod(state: StateDeclaration) {


2 val mv = smClass.visitMethod(0, goToMethodName(state), "()V", null, null)
3 mv.visitCode()
4
5 val afterCallToExitCurrentState = Label()
6
7 // exit method
8 mv.visitVarInsn(ALOAD, 0)
9 mv.visitFieldInsn(GETFIELD, canonicalNameToInternalName(stateMachineCName),
10 "currentState",
11 canonicalNameToJvmDescription(stateInterfaceCName))
12 mv.visitJumpInsn(IFNULL, afterCallToExitCurrentState)
13 mv.visitVarInsn(ALOAD, 0)
14 mv.visitFieldInsn(GETFIELD, canonicalNameToInternalName(stateMachineCName),
15 "currentState",
16 canonicalNameToJvmDescription(stateInterfaceCName))
17 mv.visitMethodInsn(INVOKEINTERFACE, canonicalNameToInternalName(stateInterfa\
18 ceCName),
19 "leave", "()V", true)
20 mv.visitLabel(afterCallToExitCurrentState)
21
22 // assign field
23 mv.visitVarInsn(ALOAD, 0) // push this for PUTFIELD
24 mv.visitTypeInsn(NEW, canonicalNameToInternalName(stateClassCName(state)))
25 mv.visitInsn(DUP) // so we will have 2 copies of the reference to the
26 // instantiated state: we will consume the first while
27 // calling the constructor and the second as the value
28 // for PUTFIELD
29 mv.visitVarInsn(ALOAD, 0) // push this for PUTFIELD
30 mv.visitMethodInsn(INVOKESPECIAL,
31 canonicalNameToInternalName(stateClassCName(state)), "<init>",
32 "(${canonicalNameToJvmDescription(stateMachineCName)})V", false)
33 mv.visitFieldInsn(PUTFIELD, canonicalNameToInternalName(stateMachineCName),
34 "currentState", canonicalNameToJvmDescription(stateInterfaceCName))
35
36 // enter method
37 mv.visitVarInsn(ALOAD, 0)
38 mv.visitFieldInsn(GETFIELD, canonicalNameToInternalName(stateMachineCName),
39 "currentState", canonicalNameToJvmDescription(stateInterfaceCName))
40 mv.visitMethodInsn(INVOKEINTERFACE,
41 canonicalNameToInternalName(stateInterfaceCName), "enter", "()V", t\
42 rue)
11. Generate JVM bytecode 217

43
44 mv.visitInsn(RETURN)
45 mv.visitEnd()
46 mv.visitMaxs(-1, -1)
47 }

Similarly to the goTo_xxx methods we have the exit method. Also this method is not public because
it is not intended to be called directly by the users of our compiled class.

1 private fun smExitMethod() {


2 val mv = smClass.visitMethod(0, "exit", "()V", null, null)
3 mv.visitCode()
4 mv.visitVarInsn(ALOAD, 0)
5 mv.visitLdcInsn(true)
6 mv.visitFieldInsn(PUTFIELD, canonicalNameToInternalName(stateMachineCName),
7 "exited", "Z")
8 mv.visitInsn(RETURN)
9 mv.visitEnd()
10 mv.visitMaxs(-1, -1)
11 }

What it does is simply setting the field exited to true.


We have a method related to it: isExited. What it does is to return the value of the exited field.

1 private fun smIsExitedMethod() {


2 val mv = smClass.visitMethod(ACC_PUBLIC, "isExited", "()Z", null, null)
3 mv.visitCode()
4 mv.visitVarInsn(ALOAD, 0)
5 mv.visitFieldInsn(GETFIELD, canonicalNameToInternalName(stateMachineCName),
6 "exited", "Z")
7 mv.visitInsn(IRETURN)
8 mv.visitEnd()
9 mv.visitMaxs(-1, -1)
10 }

Finally we have the main public methods the users will need to interact with the state machine.
These methods permit to report that an event has been received.
These methods are public and named after the events.
What they do is loading the value of the field exited and then check if it is equal to true (opcode
IFNE). If that is the case all the rest of the method is skipped, as we jump directly before the RETURN
instruction.
11. Generate JVM bytecode 218

If instead the field exited is false we continue by invoking the method corresponding to the event
on the field currentState. For example, if we are defining the method foo on the StateMachine,
we will invoke currentState.foo(). In other words, we delegate to currentState to decide how
to react to event received.

1 private fun smEventMethod(event: EventDeclaration) {


2 val mv = smClass.visitMethod(ACC_PUBLIC, event.name, "()V", null, null)
3 mv.visitCode()
4 val ret = Label()
5 mv.visitVarInsn(ALOAD, 0)
6 mv.visitFieldInsn(GETFIELD, canonicalNameToInternalName(stateMachineCName),
7 "exited", "Z")
8 mv.visitJumpInsn(IFNE, ret)
9 mv.visitVarInsn(ALOAD, 0)
10 mv.visitFieldInsn(GETFIELD, canonicalNameToInternalName(stateMachineCName),
11 "currentState",
12 canonicalNameToJvmDescription(stateInterfaceCName))
13 mv.visitMethodInsn(INVOKEINTERFACE,
14 canonicalNameToInternalName(stateInterfaceCName),
15 event.name, "()V", true)
16 mv.visitLabel(ret)
17 mv.visitInsn(RETURN)
18 mv.visitEnd()
19 mv.visitMaxs(-1, -1)
20 }

The State interface

It is time to examine how the State interface is defined. There is no code involved here, we just
define the methods. All of them are public and abstract, because they are interface methods. We
always have enter and leave. Then we have also one method for each event, named as the event
itself.

1 private fun compileStateInterface(classes: HashMap<String, ByteArray>) {


2 val interfaceClass = ClassWriter(ClassWriter.COMPUTE_FRAMES
3 or ClassWriter.COMPUTE_MAXS)
4 interfaceClass.visit(V1_8, ACC_INTERFACE or ACC_ABSTRACT,
5 canonicalNameToInternalName(stateInterfaceCName),
6 null, "java/lang/Object", null)
7 interfaceClass.visitMethod(ACC_PUBLIC or ACC_ABSTRACT, "enter", "()V",
8 null, null)
9 interfaceClass.visitMethod(ACC_PUBLIC or ACC_ABSTRACT, "leave", "()V",
11. Generate JVM bytecode 219

10 null, null)
11 ast.events.forEach {
12 interfaceClass.visitMethod(ACC_PUBLIC or ACC_ABSTRACT, it.name, "()V",
13 null, null)
14 }
15 interfaceClass.visitEnd()
16 classes[stateInterfaceCName] = interfaceClass.toByteArray()
17 }

The classes for each state

We define a class for each state. While defining it we specify that it implements one interface: the
State interface (arrayOf(canonicalNameToInternalName(stateInterfaceCName))).
The class will have a stateMachineField. In the constructor we will receive the reference to the
state machine and store it in such field.
We then have the enter and leave methods. They simply contains all the code for the statements
associated to on-entry and on-exit blocks.

1 private fun compileStateClass(state: StateDeclaration,


2 classes: HashMap<String, ByteArray>) {
3 // register the state class as inner class
4 val stateClass = ClassWriter(ClassWriter.COMPUTE_FRAMES
5 or ClassWriter.COMPUTE_MAXS)
6 stateClass.visit(V1_8, 0,
7 canonicalNameToInternalName(stateClassCName(state)), null,
8 "java/lang/Object",
9 arrayOf(canonicalNameToInternalName(stateInterfaceCName)))
10
11 stateClass.visitField(ACC_PRIVATE, "stateMachine",
12 canonicalNameToJvmDescription(stateMachineCName), null, null)
13
14 val constructor = stateClass.visitMethod(ACC_PUBLIC, "<init>",
15 "(${canonicalNameToJvmDescription(stateMachineCName)})V",
16 null, null)
17 constructor.visitCode()
18 constructor.visitVarInsn(ALOAD, 0)
19 constructor.visitMethodInsn(INVOKESPECIAL,
20 Object::class.java.internalName(),
21 "<init>", "()V", false)
22 constructor.visitVarInsn(ALOAD, 0)
23 constructor.visitVarInsn(ALOAD, 1)
24 constructor.visitFieldInsn(PUTFIELD,
11. Generate JVM bytecode 220

25 canonicalNameToInternalName(stateClassCName(state)),
26 "stateMachine",
27 canonicalNameToJvmDescription(stateMachineCName))
28 constructor.visitInsn(RETURN)
29 constructor.visitEnd()
30 constructor.visitMaxs(-1, -1)
31
32 val enterMethod = stateClass.visitMethod(ACC_PUBLIC, "enter", "()V",
33 null, null)
34 enterMethod.visitCode()
35 state.blocks.filterIsInstance(OnEntryBlock::class.java).forEach {
36 it.statements.forEach {
37 compileStatement(it, enterMethod, stateClassCName(state))}
38 }
39 enterMethod.visitInsn(RETURN)
40 enterMethod.visitEnd()
41 enterMethod.visitMaxs(-1, -1)
42
43 val leaveMethod = stateClass.visitMethod(ACC_PUBLIC, "leave", "()V",
44 null, null)
45 leaveMethod.visitCode()
46 state.blocks.filterIsInstance(OnExitBlock::class.java).forEach {
47 it.statements.forEach {
48 compileStatement(it, leaveMethod, stateClassCName(state))}
49 }
50 leaveMethod.visitInsn(RETURN)
51 leaveMethod.visitEnd()
52 leaveMethod.visitMaxs(-1, -1)
53
54 ast.events.forEach { e ->
55 val eventMethod = stateClass.visitMethod(ACC_PUBLIC, e.name, "()V",
56 null, null)
57 eventMethod.visitCode()
58
59 val transition = state.blocks.filterIsInstance(OnEventBlock::class.java)
60 .find { it.event.referred!! == e }
61 if (transition != null) {
62 eventMethod.visitVarInsn(ALOAD, 0)
63 eventMethod.visitFieldInsn(GETFIELD,
64 canonicalNameToInternalName(stateClassCName(state)),
65 "stateMachine",
66 canonicalNameToJvmDescription(stateMachineCName))
11. Generate JVM bytecode 221

67 eventMethod.visitMethodInsn(INVOKEVIRTUAL,
68 canonicalNameToInternalName(stateMachineCName),
69 goToMethodName(transition.destination.referred!!),
70 "()V", false)
71 }
72
73 eventMethod.visitInsn(RETURN)
74 eventMethod.visitEnd()
75 eventMethod.visitMaxs(-1, -1)
76 }
77
78 stateClass.visitEnd()
79 classes[stateClassCName(state)] = stateClass.toByteArray()
80 }

Consider this example:

1 state lowSpeed {
2 on entry {
3 counter = counter + lowSpeedThroughtput
4 }
5 on accelerate -> highSpeed
6 on slowDown -> turnedOff
7 on clock -> lowSpeed
8 }

In this case the enter method will contain the code for the statement counter = counter +
lowSpeedThroughtput, while the leave method will be empty. The generation of code for each
statement is done in the compileStatement method that follows below.
We then have one method for each event. To compile each of these methods we check if there is
a transition in that state for that event. For example, considering the previous example the state
lowSpeed has a transition on the event accelerate. That transition goes to the state highSpeed. So
we need to generate code that express that. The way we do it is by calling the goTo_xxx method
corresponding to the target state on the StateMachine instance. In this case for example we would
invoke stateMachine.goTo_lowSpeed().
We have just to see how statements are compiled:
11. Generate JVM bytecode 222

1 private fun compileStatement(statement: Statement, mv: MethodVisitor,


2 classCname: String) {
3 when (statement) {
4 is Print -> {
5 // we should call the method print of the field systemInterface of t\
6 he statemachine
7 mv.visitVarInsn(ALOAD, 0) // this
8 mv.visitFieldInsn(GETFIELD, canonicalNameToInternalName(classCname),
9 "stateMachine", canonicalNameToJvmDescription(stateMachi\
10 neCName))
11 mv.visitFieldInsn(GETFIELD, canonicalNameToInternalName(stateMachine\
12 CName),
13 "systemInterface", SystemInterface::class.java.jvmDescri\
14 ption())
15 statement.value.pushAsString(mv, CompilationContext(this, classCname\
16 ))
17 mv.visitMethodInsn(INVOKEINTERFACE, SystemInterface::class.java.inte\
18 rnalName(),
19 "print", "(${String::class.java.jvmDescription()})V", tr\
20 ue)
21 }
22 is Assignment -> {
23 mv.visitVarInsn(ALOAD, 0) // this
24 mv.visitFieldInsn(GETFIELD, canonicalNameToInternalName(classCname),
25 "stateMachine", canonicalNameToJvmDescription(stateMachi\
26 neCName))
27 statement.value.push(mv, CompilationContext(this, classCname))
28 mv.visitFieldInsn(PUTFIELD, canonicalNameToInternalName(stateMachine\
29 CName),
30 statement.variable.name, statement.variable.referred!!.t\
31 ype.jvmDescription())
32 }
33 is Exit -> {
34 mv.visitVarInsn(ALOAD, 0) // this
35 mv.visitFieldInsn(GETFIELD, canonicalNameToInternalName(classCname),
36 "stateMachine", canonicalNameToJvmDescription(stateMachi\
37 neCName))
38 mv.visitMethodInsn(INVOKEVIRTUAL, canonicalNameToInternalName(stateM\
39 achineCName),
40 "exit", "()V", false)
41 }
42 else -> throw UnsupportedOperationException(statement.javaClass.canonica\
11. Generate JVM bytecode 223

43 lName)
44 }
45 }

The Print statement works as we have seen for in MiniCalcFun.


The Assisgnment statement is different only because we store variables inside fields, instead of using
entries in the local variables table. We use fields of the stateMachine instance, so we first load the
stateMachine field (ALOAD 0, GETFIELD) and then we push the value and invoke PUTFIELD.

Finally the Exit statement just cause us to invoke stateMachine.exit().

Tests

The general structure of the tests for StaMac compiler is practically the same we had for MiniCalc-
Fun:

1 class JvmCompilerTest {
2
3 fun compile(code: String): Class<*> {
4 val res = SMLangParserFacade.parse(code)
5 assertTrue(res.isCorrect(), res.errors.toString())
6 val miniCalcFile = res.root!!
7 val classesBytecode = JvmCompiler().compile(miniCalcFile)
8 val classes = HashMap<String, Class<*>>()
9 classesBytecode.forEach { name, bytes -> classes[name.replace("/", ".")]\
10 = MyClassLoader(classesBytecode).loadClass(name.replace("/", ".")) }
11 return classes["stamac.sm.StateMachine"]!!
12 }
13
14 class MyClassLoader(val bytes: Map<String, ByteArray>) : ClassLoader() {
15 override fun findClass(name: String?): Class<*> {
16 return defineClass(name, bytes[name], 0, bytes[name]!!.size)
17 }
18 }
19
20 class TestSystemInterface : SystemInterface {
21
22 val output = LinkedList<String>()
23
24 override fun print(message: String) {
25 output.add(message)
26 }
11. Generate JVM bytecode 224

27
28 }
29
30 ...
31 }

The only difference is that in this case we can produce several classes, not just one. So we slightly
adapted MyClassLoader.
This is an actual test based on the example we have used in this section:

1 @test fun exampleMachinery() {


2 val clazz = compile("""statemachine sm
3
4 input lowSpeedThroughtput : Int
5 input highSpeedThroughtput : Int
6 var counter = 0
7
8 event accelerate
9 event slowDown
10 event clock
11
12 start state turnedOff {
13 on accelerate -> lowSpeed
14 }
15
16 state lowSpeed {
17 on entry {
18 counter = counter + lowSpeedThroughtput
19 print(counter)
20 }
21 on accelerate -> highSpeed
22 on slowDown -> turnedOff
23 on clock -> lowSpeed
24 }
25
26 state highSpeed {
27 on entry {
28 counter = counter + highSpeedThroughtput
29 print(counter)
30 }
31 on slowDown -> lowSpeed
32 on clock -> highSpeed
11. Generate JVM bytecode 225

33 }""")
34
35 val systemInterface = TestSystemInterface()
36 val instance = clazz.declaredConstructors[0].newInstance(
37 systemInterface, 2, 5)
38 assertEquals(emptyList<String>(), systemInterface.output)
39 clazz.methods.find { it.name == "accelerate" }!!.invoke(instance)
40 assertEquals(listOf("2"), systemInterface.output)
41 clazz.methods.find { it.name == "clock" }!!.invoke(instance)
42 assertEquals(listOf("2", "4"), systemInterface.output)
43 clazz.methods.find { it.name == "clock" }!!.invoke(instance)
44 assertEquals(listOf("2", "4", "6"), systemInterface.output)
45 clazz.methods.find { it.name == "accelerate" }!!.invoke(instance)
46 assertEquals(listOf("2", "4", "6", "11"), systemInterface.output)
47 clazz.methods.find { it.name == "slowDown" }!!.invoke(instance)
48 assertEquals(listOf("2", "4", "6", "11", "13"), systemInterface.output)
49 }

We added some print statements to our original example, so that we can verify the output in our
assertions.
In the test we start by instantiating the StateMachine class returned by compile. We do that by
passing an instance of TestSystemInterface and values for the inputs (lowSpeedThroughtput, and
highSpeedThroughtput in this case).

Then we invoke the methods corresponding to the different events and we verify that the correct
values are printed.

Summary
In this chapter we have learned how the JVM works and we have defined two JVM compilers for
two different languages.
There are undoubtedly some similarities and a common structure in the way we work with types,
expressions and statements. However the general structure of the generated classes can vary a lot,
depending on the nature of the languages. In this chapter we have examined to the most common,
useful concepts you can leverage to write powerful compilers for the JVM. There is still a lot to learn:
inner classes, invokedynamic, control-flow statements. We could not cover the whole of it in one
chapter, but at this stage you should be familiar with how compilers for the JVM work and you can
keep going from here. Remember that the JVM Specification is a very useful resource.
12. Generate LLVM bitcode
Part III: editing support
13. Syntax highlighting
14. Auto completion
Write to me
I would be extremely grateful if you could share with me your feedback. Write to me about your
ideas, suggestions, comments at federico@tomassetti.me
If you want to read more about these topics you can find articles on my blog on Language
Engineering.

https://tomassetti.me

S-ar putea să vă placă și