Sunteți pe pagina 1din 15

Compiler

Construction

Maher Ahmed Raza

1
Virtual Machines for Compilers
We will discuss how virtual machine together with compilers help in executing a
program. Before we discuss anything we should know what the compiler is and how
it works? We should also know what is virtual machine and how it works? Then
finally we see how a program is converted from source code to machine code and
then executes on the machine.

Compiler
A compiler is a piece of software that translates a source program from
source language to equivalent program in target language. An important feature of
the compiler is to identify error in source program during compilation/translation
process.

Structure
A compiler consists of many phases. This makes us easy to understand a
compiler and implements it. We will give the overview of phases one by one.

Lexical Analyzer The first phase of a compiler is called lexical analysis or


scanning. The lexical analyzer reads the stream of characters making up the source
program and groups the characters into meaningful sequences called lexemes. For
each lexeme, the lexical analyzer produces as output a token of the form (tokenname, attribute-value) that it passes on to the subsequent phase, syntax
analysis.

Syntax Analyzer The second phase of the compiler is syntax analysis or parsing.
The parser uses the first components of the tokens produced by the lexical analyzer
to create a tree-like intermediate representation that depicts the grammatical
structure of the token stream. A typical representation is a syntax tree in which
each interior node represents an operation and the children of the node represent
the arguments of the operation. The parser analyzes the source code (token stream)
against the production rules to detect any errors in the code. The output of this
phase is a parse tree. This way, the parser accomplishes two tasks, i.e., parsing
the code, looking for errors and generating a parse tree as the output of the phase

Semantic Analyzer The semantic analyzer uses the syntax tree and the
information in the symbol table to check the source program for semantic
consistency. It performs two functions.

Scope checking: verify that all applied occurrences of identifiers are declared
Type checking: verify that all operations in the program are used according to
their type rules.

3
The output of the semantic analyzer is parse tree or (AST) which is then fed to
intermediate code generator.

Intermediate Code Generations This a phase where compiler produces and


low level intermediate representation of the which can have a variety of forms.
1. Postfix
2. Syntax Tree
3. Three Address Code
The most popular of the above representation is three address code.
Three Address Code Three address code is the sequence of statements of the
general form as given
a := b op c
Where a, b and c are identifiers names and op represents operator. And the symbol
:= stands for assignment operator.
This representation is called three address code because it has three addresses, two
for operands and one for the result.
Quadruples and Triples The three address code is usually implemented using one
of the following representations.

Quadruples
Triples

Quadruples In quadruples we represent three address code with four fields.


1.
2.
3.
4.

Op (operator)
Arg1(argument1)
Arg2(argument2)
result

for example if the expression is i := i + j + k then the three address code will be
t1 := i + j
i := t1 + k
Quadruple representation is
(0)
(1)

Op
+
+

ARG1
i
t1

ARG2
j
k

Result
t1
i

Triples In triples we represent three address code with three fields.


1. Op (operator)

4
2. Arg1(argument1)
3. Arg2(argument2)
the triples representation of the above expression will be
Op
+
+

(0)
(1)

ARG1
i
(0)

ARG2
j
k

Code Optimization It is a program transformation technique, which tries to


improve the code by making it consume less resources (i.e. CPU, Memory) and
deliver high speed.
Optimization depends on various factors like
1.
2.
3.
4.
5.

Memory
Algorithm
Execution Time
Programming Language and
Others

In optimization, high-level general programming constructs are replaced by very


efficient low-level programming codes. A code optimizing process must follow the
three rules given below:

The output code must not, in any way, change the meaning of the program.

Optimization should increase the speed of the program and if possible, the
program should demand less number of resources.

Optimization should itself be fast and should not delay the overall compiling
process.

Code Generation The code generator takes as input an intermediate


representation of the source program and maps it into the target language.
The target program may be in
1. Assembly Language
2. Relocatable machine code or
3. Absolute machine code

Virtual Machine
Now that we have seen how the compiler works we are going to see what is virtual
machine?

5
A virtual machine (VM) is a software implementation of a machine that executes
programs like a physical machine. It shares physical hardware resources with the
other users but isolates the OS or application to avoid changing the end-user
experience.

Why need virtual machine?


We need virtual machine because the compilers for languages like Java and C#
often outputs intermediate representation as the target language e.g. bytecode and
intermediate language. We can output machine code directly from compiler but it
defeats the purpose of higher-level languages. So we use intermediate
representation because this makes it possible to transfer the compiled output to
different computers and architecture. We can than input this IR to virtual machine to
get the machine code.
We are going to see how the java virtual machine (JVM) works and how it executes
the bytecode output of the compiler
Java Virtual Machine (JVM)

The Java Virtual Machine is an abstract computing machine. Like a real computing
machine, it has an instruction set and manipulates various memory areas at run
time. It is reasonably common to implement a programming language using a
virtual machine. The Java Virtual Machine knows nothing of the Java programming
language, only of a particular binary format, the class file format. A class file
contains Java Virtual Machine instructions (or bytecodes) and a symbol table, as well
as other ancillary information.
For the sake of security, the Java Virtual Machine imposes strong syntactic and
structural constraints on the code in a class file. However, any language with
functionality that can be expressed in terms of a valid class file can be hosted by
the Java Virtual Machine. Attracted by a generally available, machine-independent

6
platform, implementers of other languages can turn to the Java Virtual Machine as a
delivery vehicle for their languages.
Architecture of JVM

Class Loader Subsystem


The class loader subsystem consists of three phases load, link and initialize.

Load is the part responsible for loading bytecode into the memory. Class loader
loads files from different sources using different loader such as

Bootstrap Class Loader responsible for loading java internal classes from
rt.jar which is distributed with JVM.
Extension class loader responsible for loading additional application jars
that reside in jre/lib/ext
Application class loader loads classes from valued specified in your
CLASSPATH environment variables and from cp parameterized folder.

7
Link is the phase where much of the work is done. It consists of three parts

Verify This is the part where the bytecode is verified according to the JVM
class specifications.
Prepare This is the part where the memory is allocated for the static
variables inside the class file. The memory locations are than initialized with
the default values.
Resolve In this part all the symbolic references to the current classes are
resolved with actual reference. For example one class has reference to other
class.

Initialization This is the phase where the actual values of the static variable define
in source code are set unlike prepare where the default value are set.

Runtime Data Areas In this section the memory is reserved for all the parts of the
program. It consists of five parts.

Method Areas The place where metadata corresponding to class is


stored.eg. static variables, byte codes, class level constant pool. It is also
called Permgen space only 64mb memory allocated by default. In java8 it is
called metaspace and can automatically expand and shrink based on
requirements.
Heap Areas The place where object data is stored. All the instance variable
arrays etc. is stored in heap. Xms is used for minimum size and Xmx is used
for the maximum size. Objects in heap are automatically garbage-collected
when they are not needed
PC Registers They are called program counter registers i.e. point to the next
instruction to be executed per thread.
Stack Areas Contain stack frame corresponding to the current method
execution per thread. Stack frames contain the data corresponding to the
current method such as memory area for the parameters, the return values,
local variable within a method and the operand stack.
Native Method Stacks Contains stack for the native method execution per
thread. The stack contains memory portions for different parts of functions
like parameters or local variables etc.

Execution Engine
Once the instruction to be executed is ready then the java interpreter interprets the
instruction and execute its
Interpreter take the byte code instruction interprets the instruction and finds out
which native operation is to be done and then execute that operation with the help
of native method interface which uses native method libraries.
JIT Compiler JIT stands for Just in Time. Every time JIT compiler interprets byte
codes, it will keep the most frequent executed binary code in log and optimize it.
Next time, when the same method is running, the optimized code will run. So this
eliminate the overhead of interpreting the instructions again and optimizing it.
Profiler keeps track of which portions are frequently being executing. Experiments
show Java programs using JIT could be as fast as a compiled C program
Garbage Collection A garbage collector's primary function is to automatically
reclaim the memory used by objects that are no longer referenced by the running
application. It may also move objects as the application runs to reduce heap
fragmentation. A garbage collector is not strictly required by the Java virtual
machine specification. The specification only requires that an implementation
manage its own heap in some manner.

Assembler
It is a program which converts assembly language into machine code. Assembler
performs the translation in similar way as compiler. But assembler is used to
translate low-level programming language whereas compiler is used to translate
high-level programming language.
An assembler performs the following functions

Convert mnemonic operation codes to their machine language codes

Convert symbolic (e.g., jump labels, variable names) operands to their


machine addresses

Use proper addressing modes and formats to build efficient machine


instructions

Translate data constants into internal machine representations

Output the object program and provide other information (e.g., for linker and
loader)

Two Pass Assembler


Two pass compiler is used to process the source program in two passes. The
first pass generates the symbol table and the second pass generates the machine
code. We use two pass assembler to deal with forward reference problem.

Pass 1
Assign addresses to all statements in the program
Save the values (addresses) assigned to all labels (including label and
variable names) for use in Pass 2 (deal with forward references)
Perform some processing of assembler directives (e.g., BYTE, RESW, these
can affect address assignment)

Pass 2
Assemble instructions (generate opcode and look up addresses)
Generate data values defined by BYTE, WORD
Perform processing of assembler directives not done in Pass 1
Write the object program and the assembly listing

Two major data structures:


Operation Code Table (OPTAB): is used to look up mnemonic operation codes and
translate them to their machine language equivalents
Symbol Table (SYMTAB): is used to store values (addresses) assigned to labels

10
Location Counter (LOCCTR) is used to help the assignment of addresses. LOCCTR
is initialized to the beginning address specified in the START statement .The length
of the assembled instruction or data area to be generated is added to LOCCT.

Linker
A programming tool which combines one or more partial Object Files and libraries
into a (more) complete executable object file.

Three tasks
o
o
o

Searches the program to find library routines used by program, e.g.


printf(), math routines,
Determines the memory locations that code from each module will
occupy and relocates its instructions by adjusting absolute references
Resolves references among files

Loader
Part of the OS that brings an executable file residing on disk into memory and
starts it running
Steps
o
o
o
o
o
o

Read executable files header to determine the size of text and data
segments
Create a new address space for the program
Copies instructions and data into address space
Copies arguments passed to the program on the stack
Initializes the machine registers including the stack ptr
Jumps to a startup routine that copies the programs arguments from
the stack to registers and calls the programs main routine

11

Editors
Source code editors have features specifically designed to simplify and speed up
input of source code, such as syntax highlighting, indentation, autocomplete and
bracket matching functionality. These editors also provide a convenient way to run a
compiler, interpreter, debugger, or other program relevant for the software
development process. So, while many text editors can be used to edit source code,
if they don't enhance, automate or ease the editing of code, they are not source
code editors, but simply text editors that can also be used to edit source code.
Some well-known editors are Gedit, Vim, Atom etc.

Single Pass Compiler with Complete Example


A single pass compiler makes a single pass over the source text, parsing, analyzing,
and generating code all at once.
A pass is a complete traversal of the source program, or a complete traversal of
some internal representation of the source program (such as an AST). A pass can
correspond to a phase but it does not have to! Sometimes a single pass
corresponds to several phases that are interleaved in time.
Dependency diagram of a typical Single Pass Compiler:

12

There is no intermediate code representation present in single pass compiler.


Example:
Source program
let var n: integer;
var c: char
in begin
c := &;
n := n+1
end
Symbol Table
Ident
n
c

Type
int
char

Assembly Code
PUSH 2
LOADL 38
STORE 1[SB]
LOAD 0[SB]
LOADL 1
CALL add
STORE 0[SB]
POP 2
HALT

Address
0[SB]
1[SB]

13

The basic problem is of forward references. The correct interpretation of a symbol at


some point in the source file may be dependent on the presence or not of other
symbols further on in the source file and until they are encountered, correct code
for the current symbol cannot be produced. This is the problem of context
dependence, and the span can be anywhere from adjacent symbols to arbitrarily
large amounts of source text.
Example Pascal:
Pascal was explicitly designed to be easy to implement with a single pass compiler:
Every identifier must be declared before its first use.

Example Pascal:
Every identifier must be declared before it is used. How to handle mutual
recursion then?

14

S-ar putea să vă placă și