CD - Unit I

UNIT I
Overview of Compilation: Phases of Compilation Lexical Analysis, Regular

Grammar and regular expression for common programming language features, pass
and Phases of translation, interpretation, bootstrapping, data structures in compilation
LEX lexical analyzer generator.
OVERVIEW OF LANGUAGE PROCESSING SYSTEM

Language Processing System
Pre-processor
A source program may be divided into modules stored in separate files. The task of
collecting the source program is entrusted to a separate program called pre-processor. It
may also expand macros into source language statement.
A preprocessor produce input to compilers. They may perform the following
functions.
1. Macro processing: A preprocessor may allow a user to define
macros that are short hands for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older
languages with more modern flow-of-control and data structuring
facilities.
4. Language Extensions: These preprocessor attempts to add
capabilities to the language by certain amounts to build-in macro
A program written in high-level language is called as source code. To convert the source
code into machine code, translators are needed.
A translator takes a program written in source language as input and converts it into a
program in target language as output.
It also detects and reports the error during translation.
Roles of translator are:
Translating the high-level language program input into an equivalent machine language
program.
Providing diagnostic messages wherever the programmer violates specification of the
high-level language program.
Different type of translators
The different types of translator are as follows:
Compiler
Compiler is a translator which is used to convert programs in high-level language to low-
level language. It translates the entire program and also reports the errors in source
program encountered during the translation.
Interpreter
Interpreter is a translator which is used to convert programs in high-level language to
low-level language. Interpreter translates line by line and reports the error once it
encountered during the translation process.
It directly executes the operations specified in the source program when the input is given
by the user.
It gives better error diagnostics than a compiler.
Assembler
Assembler is a translator which is used to translate the assembly language code into
machine language code.
Loader and link-editor

The re-locatable machine code has to be linked together with other re-locatable object
files and library files into the code that actually runs on the machine.
The linker resolves external memory addresses, where the code in one file may refer to a
location in another file.
The loader puts together the entire executable object files into memory for execution.
Differences between compiler and interpreter
SI. No Compiler Interpreter
1 Performs the translation of a Performs statement by

program as a whole. statement translation.
2 Execution is faster. Execution is slower.
3 Requires more memory as Memory usage is efficient as

linking is needed for the no intermediate object code
generated intermediate is generated.
object code.
4 Debugging is hard as the It stops translation when the

error messages are generated first error is met. Hence,
after scanning the entire debugging is easy.
program only.
5 Programming languages like Programming languages like

C, C++ uses compilers. Python, BASIC, and Ruby
uses interpreters.
The phases of a compiler can be grouped as:
Front end
Front end of a compiler consists of the phases
Lexical analysis.
Syntax analysis.
Semantic analysis.
Intermediate code generation.
Back end
Back end of a compiler contains
Code optimization.
Code generation.
Front End
Front end comprises of phases which are dependent on the input (source language) and
independent on the target machine (target language).
It includes lexical and syntactic analysis, symbol table management, semantic analysis
and the generation of intermediate code.
Code optimization can also be done by the front end.
It also includes error handling at the phases concerned.
Back End
Back end comprises of those phases of the compiler that are dependent on the target
machine and independent on the source language.
This includes code optimization, code generation.
In addition to this, it also encompasses error handling and symbol table management
operations.
Passes
The phases of compiler can be implemented in a single pass by marking the primary
actions viz. reading of input file and writing to the output file.
Several phases of compiler are grouped into one pass in such a way that the operations
in each and every phase are incorporated during the pass.
(eg.) Lexical analysis, syntax analysis, semantic analysis and intermediate code
generation might be grouped into one pass. If so, the token stream after lexical analysis
may be translated directly into intermediate code.
Reducing the Number of Passes
Minimizing the number of passes improves the time efficiency as reading from and
writing to intermediate files can be reduced.
When grouping phases into one pass, the entire program has to be kept in memory to
ensure proper information flow to each phase because one phase may need information in
a different order than the information produced in previous phase.
The source program or target program differs from its internal representation. So, the
memory for internal form may be larger than that of input and output.
1. Phases of Compilation
The structure of compiler consists of two parts:
Analysis part
Analysis part breaks the source program into constituent pieces and imposes a
grammatical structure on them which further uses this structure to create an intermediate
representation of the source program.
It is also termed as front end of compiler.
Information about the source program is collected and stored in a data structure called
symbol table.
Synthesis part
Synthesis part takes the intermediate representation as input and transforms it to the
target program.
It is also termed as back end of compiler.
The design of compiler can be decomposed into several phases, each of which converts
one form of source program into another.
The different phases of compiler are as follows:
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Intermediate code generation
5. Code optimization
6. Code generation
All of the aforementioned phases involve the following tasks:
Symbol table management.
Error handling.
Lexical Analysis
Lexical analysis is the first phase of compiler which is also termed as scanning.
Source program is scanned to read the stream of characters and those characters are
grouped to form a sequence called lexemes which produces token as output.
Token: Token is a sequence of characters that represent lexical unit, which matches
with the pattern, such as keywords, operators, identifiers etc.
Lexeme: Lexeme is instance of a token i.e., group of characters forming a token. ,
Pattern: Pattern describes the rule that the lexemes of a token takes. It is the structure
that must be matched by strings.
Once a token is generated the corresponding entry is made in the symbol table.
Input: stream of characters
Output: Token
Token Template: <token-name, attribute-value>
(eg.) c=a+b*5;
Lexemes and tokens

Lexemes Tokens
c identifier
= assignment symbol
a identifier
+ + (addition symbol)
b identifier
* * (multiplication symbol)
5 5 (number)
Hence, <id, 1><=>< id, 2>< +><id, 3 >< * >< 5>

Syntax Analysis
Syntax analysis is the second phase of compiler which is also called as parsing.
Parser converts the tokens produced by lexical analyzer into a tree like representation
called parse tree.
A parse tree describes the syntactic structure of the input.
Syntax tree is a compressed representation of the parse tree in which the operators
appear as interior nodes and the operands of the operator are the children of the node for
that operator.
Input: Tokens
Output: Syntax tree
Semantic Analysis
Semantic analysis is the third phase of compiler.
It checks for the semantic consistency.
Type information is gathered and stored in symbol table or in syntax tree.
Performs type checking.
Intermediate Code Generation
Intermediate code generation produces intermediate representations for the source
program which are of the following forms:
o Postfix notation
o Three address code
o Syntax tree
Most commonly used form is the three address code.
t1 = inttofloat (5)
t2 = id3* tl
t3 = id2 + t2
id1 = t3
Properties of intermediate code
It should be easy to produce.

It should be easy to translate into target program.
Code Optimization
Code optimization phase gets the intermediate code as input and produces optimized
intermediate code as output.
It results in faster running machine code.
It can be done by reducing the number of lines of code for a program.
This phase reduces the redundant code and attempts to improve the intermediate code so
that faster-running machine code will result.
During the code optimization, the result of the program is not affected.
To improve the code generation, the optimization involves
o Deduction and removal of dead code (unreachable code).
o Calculation of constants in expressions and terms.
o Collapsing of repeated expression into temporary string.
o Loop unrolling.
o Moving code outside the loop.
o Removal of unwanted temporary variables.
t1 = id3* 5.0
id1 = id2 + t1
Code Generation
Code generation is the final phase of a compiler.
It gets input from code optimization phase and produces the target code or object code
as result.
Intermediate instructions are translated into a sequence of machine instructions that
perform the same task.
The code generation involves
o Allocation of register and memory.
o Generation of correct references.
o Generation of correct data types.
o Generation of missing code.
LDF R2, id3
MULF R2, # 5.0
LDF R1, id2
ADDF R1, R2
STF id1, R1
Symbol Table Management
Symbol table is used to store all the information about identifiers used in the program.
It is a data structure containing a record for each identifier, with fields for the attributes
of the identifier.
It allows finding the record for each identifier quickly and to store or retrieve data from
that record.
Whenever an identifier is detected in any of the phases, it is stored in the symbol table.
Example
int a, b; float c; char z;
Symbol name Type Address
a Int 1000
b Int 1002
c Float 1004
z char 1008
extern double test (double x);

double sample (int count)
{
double sum= 0.0;
for (int i = 1; i < = count; i++)
sum+= test((double) i);
return sum;
}
Symbol name Type Scope
test function, double extern
x double function parameter
sample function, double global
count int function parameter
sum double block local
i int for-loop statement

Error Handling
Each phase can encounter errors. After detecting an error, a phase must handle the
error so that compilation can proceed.
In lexical analysis, errors occur in separation of tokens.
In syntax analysis, errors occur during construction of syntax tree.
In semantic analysis, errors may occur at the following cases:
(i) When the compiler detects constructs that have right syntactic structure but
no meaning
(ii) During type conversion.
In code optimization, errors occur when the result is affected by the optimization.
In code generation, it shows error when code is missing etc.
Figure illustrates the translation of source code through each phase, considering the
statement
c =a+ b * 5.
Error Encountered in Different Phases
Each phase can encounter errors. After detecting an error, a phase must some how deal
with the error, so that compilation can proceed.
A program may have the following kinds of errors at various stages:
Lexical Errors
It includes incorrect or misspelled name of some identifier i.e., identifiers typed
incorrectly.
Syntactical Errors
It includes missing semicolon or unbalanced parenthesis. Syntactic errors are handled by
syntax analyzer (parser).
When an error is detected, it must be handled by parser to enable the parsing of the rest of
the input. In general, errors may be expected at various stages of compilation but most of
the errors are syntactic errors and hence the parser should be able to detect and report
those errors in the program.
The goals of error handler in parser are:
Report the presence of errors clearly and accurately.
Recover from each error quickly enough to detect subsequent errors.
Add minimal overhead to the processing of correcting programs.
There are four common error-recovery strategies that can be implemented in the parser to
deal with errors in the code.
Panic mode.
Statement level.
Error productions.
Global correction.
Semantical Errors
These errors are a result of incompatible value assignment. The semantic errors that the
semantic analyzer is expected to recognize are:
Type mismatch.
Undeclared variable.
Reserved identifier misuse.
Multiple declaration of variable in a scope.
Accessing an out of scope variable.
Actual and formal parameter mismatch.
Logical errors
These errors occur due to not reachable code-infinite loop.
2. Lexical Analysis
Lexical analysis is the process of converting a sequence of characters from source
program into a sequence of tokens.
A program which performs lexical analysis is termed as a lexical analyzer (lexer),
tokenizer or scanner.
Lexical analysis consists of two stages of processing which are as follows:
Scanning
Tokenization
Token, Pattern and Lexeme
Token
Token is a valid sequence of characters which are given by lexeme. In a programming
language,
keywords,
constant,
identifiers,
numbers,
operators and
punctuations symbols
are possible tokens to be identified.
Pattern
Pattern describes a rule that must be matched by sequence of characters (lexemes) to form
a token. It can be defined by regular expressions or grammar rules.
Lexeme
Lexeme is a sequence of characters that matches the pattern for a token i.e., instance of a
token.
(eg.) c=a+b*5;
Lexemes and tokens
Lexemes Tokens
c identifier
= assignment symbol
a identifier
+ + (addition symbol)
b identifier
* * (multiplication symbol)
5 5 (number)
he sequence of tokens produced by lexical analyzer helps the parser in analyzing the
syntax of programming languages.
Role of Lexical Analyzer
Lexical analyzer performs the following tasks:

Reads the source program, scans the input characters, group them into lexemes and
produce the token as output.
Enters the identified token into the symbol table.
Strips out white spaces and comments from source program.
Correlates error messages with the source program i.e., displays error message with its
occurrence by specifying the line number.
Expands the macros if it is found in the source program.
Tasks of lexical analyzer can be divided into two processes:
Scanning: Performs reading of input characters, removal of white spaces and comments.
Lexical Analysis: Produce tokens as the output.
Need of Lexical Analyzer
Simplicity of design of compiler The removal of white spaces and comments enables the
syntax analyzer for efficient syntactic constructs.
Compiler efficiency is improved Specialized buffering techniques for reading characters
speed up the compiler process.
Compiler portability is enhanced
Issues in Lexical Analysis
Lexical analysis is the process of producing tokens from the source program. It has the
following issues:
Lookahead
Ambiguities
Lookahead
Lookahead is required to decide when one token will end and the next token will begin.
The simple example which has lookahead issues are i vs. if, = vs. ==. Therefore a way to
describe the lexemes of each token is required.
A way needed to resolve ambiguities
Is if it is two variables i and f or if?
Is == is two equal signs =, = or ==?
arr(5, 4) vs. fn(5, 4) II in Ada (as array reference syntax and function call syntax are
similar.
Hence, the number of lookahead to be considered and a way to describe the lexemes of
each token is also needed.
Regular expressions are one of the most popular ways of representing tokens.
Ambiguities
The lexical analysis programs written with lex accept ambiguous specifications and
choose the longest match possible at each input point. Lex can handle ambiguous
specifications. When more than one expression can match the current input, lex chooses
as follows:
The longest match is preferred.
Among rules which matched the same number of characters, the rule given first is
preferred.
Lexical Errors
A character sequence that cannot be scanned into any valid token is a lexical error.
Lexical errors are uncommon, but they still must be handled by a scanner.
Misspelling of identifiers, keyword, or operators are considered as lexical errors.
Usually, a lexical error is caused by the appearance of some illegal character, mostly at
the beginning of a token.
Error Recovery Schemes
Panic mode recovery
Local correction
o Source text is changed around the error point in order to get a correct text.
o Analyzer will be restarted with the resultant new text as input.
Global correction
o It is an enhanced panic mode recovery.
o Preferred when local correction fails.
Panic mode recovery
In panic mode recovery, unmatched patterns are deleted from the remaining input, until
the lexical analyzer can find a well-formed token at the beginning of what input is left.
(eg.) For instance the string fi is encountered for the first time in a C program in the
context:
fi (a== f(x))
A lexical analyzer cannot tell whether f iis a misspelling of the keyword if or an
undeclared function identifier.
Since f i is a valid lexeme for the token id, the lexical analyzer will return the token id to
the parser.
Local correction
Local correction performs deletion/insertion and/or replacement of any number of
symbols in the error detection point.
(eg.) In Pascal, c[i] '='; the scanner deletes the first quote because it cannot legally follow
the closing bracket and the parser replaces the resulting'=' by an assignment statement.
Most of the errors are corrected by local correction.
(eg.) The effects of lexical error recovery might well create a later syntax error, handled
by the parser. Consider
for $tnight
The $ terminates scanning of for. Since no valid token begins with $, it is deleted.
Then tnight is scanned as an identifier.
In effect it results,
fortnight
Which will cause a syntax error? Such false errors are unavoidable, though a syntactic
error-repair may help.
Lexical error handling approaches
Lexical errors can be handled by the following actions:
Deleting one character from the remaining input.
Inserting a missing character into the remaining input.
Replacing a character by another character.
Transposing two adjacent characters.
3. Regular Expression
Regular expressions are a notation to represent lexeme patterns for a token.
They are used to represent the language for lexical analyzer.

They assist in finding the type of token that accounts for a particular lexeme.
Strings and Languages
Alphabets are finite, non-empty set of input symbols.
= {0, 1} - binary alphabets
String represents the collection of alphabets.
w = {0,1, 00, 01, 10, 11, 001, 010, ... }
w indicates the set of possible strings for the given binary alphabet
Language (L) is the collection of strings which are accepted by finite automata.
L = {0n1 I n >= 0}
Length of string is defined as the number of input symbols in a given string. It is found by
|| operator.
Let = 0101
| | =4
Empty string denotes zero occurrence of input symbol. It is represented by
. Concatenation of two strings p and q is denoted by pq.
Let p = 010
And q = 001
pq = 010001
qp = 001010
i.e., pq qp
Empty string is identity under concatenation.

Let x be a string.
Ex= XE= X
Prefix A prefix of any string s, is obtained by removing zero or more symbols from the
end of s.
(eg.) s = balloon
Possible prefixes are: ball, balloon,
Suffix A suffix of any string s, is obtained by removing zero or more symbols from the
beginning of s.
(eg.) s =balloon
Possible prefixes are: loon, balloon
Proper prefix: Proper prefix p of a strings, can be given by s p and p E
Proper suffix: Proper suffix x of a string s, can be given by s x and x E
Substring: Substring is part of a string obtained by removing any prefix and any suffix
from s.
Operations on Languages
Important operations on a language are:
Union
Concatenation and
Closure
Union
Union of two languages Land M produces the set of strings which may be either in
language L or in language M or in both. It can be denoted as,
LUM = {p I p is in L or p is in M}
Concatenation
Concatenation of two languages L and M, produces a set of strings which are formed by
merging the strings in L with strings in M (strings in L must be followed by strings in M).
It can be represented as,
LUM= {pq | p is in L and q is in M}
Closure
Kleene closure (L*)
Kleene closure refers to zero or more occurrences of input symbols in a string, i.e., it
includes empty string (set of strings with 0 or more occurrences of input symbols).
Positive closure (L +)
Positive closure indicates one or more occurrences of input symbols in a string, i.e., it
excludes empty string (set of strings with 1or more occurrences of input symbols).
L3- set of strings each with length 3.

(eg.) Let = {a, b}
L* = {E, a, b, aa, ab, ba, bb, aab, aba, aaba, ... }
L+ = {a, b, aa, ab, ba, bb, aab, aaba, }
L3 = {aaa, aba, abb, bba, bob, bbb, }
Precedence of operators
Unary operator (*) is having highest precedence.
Concatenation operator (-) is second highest and is left associative.
letter_ (letter_ I digit )*
Union operator ( I or U) has least precedence and is left associative.
Based on the precedence, the regular expression is transformed to finite automata when
implementing lexical analyzer.
Regular Expressions
Regular expressions are a combination of input symbols and language operators such as
union, concatenation and closure.
It can be used to describe the identifier for a language. The identifier is a collection of
letters, digits and underscore which must begin with a letter. Hence, the regular
expression for an identifier can be given by,
Letter_ (letter I digit)*
Note: Vertical bar ( I ) refers to 'or' (Union operator).
The following describes the language for given regular expression:
Languages for regular expressions
S.No. Regular expression Language
1 r L(r)
2 a L(a)
3 r|s L(r) | L(s)
4 rs L(r) L(s)
5 r* (L(r))*
Regular set Language defined by regular expression.

Two regular expressions are equivalent, if they represent the same regular set.
(p I q) = (q | p)
Algebraic laws of regular expressions
Law Description
r|s=s|r | is commutative
r | (s | t) = (r | s ) | t | is associative
r (st) = (rs)t Concatenation is associative
r(s|t) = rs | rt; (s|t)r = sr | tr Concatenation is distributive
r = r = r is identity for concatenation
r* = (r | )* is guaranteed in closure
r** = r* * is idempotent
definitions are of the following form

di --> ri
d2-->r2
d3--> rs
dn--> rn
in which definitions di, d2, ... , can be used in place of ri, r2 respectively.
letter --> A I B I I Z I a I b I I z I
digit -->0 |1 I 2 ... I 9
id --> letter_ (letter I digit)*
Convert Regular Expression to DFA
Regular expression is used to represent the language (lexeme) of finite

automata (lexical analyzer).
Finite automata
A recognizer for a language is a program that takes as input a string x and
answers yes if x is a sentence of the language and no otherwise.
A regular expression is compiled into a recognizer by constructing a generalized
transition diagram called a Finite Automaton (FA).
Finite automata can be Non-deterministic Finite Automata (NFA) or
Deterministic Finite Automata (DFA).
It is given by M = (Q, , qo, F, ).
Where Q - Set of states
- Set of input symbols
qo - Start state
F - set of final states
- Transition function (mapping states to input symbol).
:Q x Q
Non-deterministic Finite Automata (NFA)
o More than one transition occurs for any input symbol from a state.
o Transition can occur even on empty string ().
Deterministic Finite Automata (DFA)
o For each state and for each input symbol, exactly one transition occurs
from that state.
Regular expression can be converted into DFA by the following methods:
(i) Thompson's subset construction
Given regular expression is converted into NFA
Resultant NFA is converted into DFA
(ii) Direct Method
In direct method, given regular expression is converted directly into
DFA.
Rules for Conversion of Regular Expression to NFA
Union
r = r1 + r2
Concatenation
r = r1 r2
Closure
r = r1*
closure
- Closure is the set of states that are reachable from the state concerned on
taking empty string as input. It describes the path that consumes empty string
() to reach some states of NFA.
Example 1
-closure(q0) = { q0, q1, q2}

closure(q1 ) = {q1, q2}
-closure(q2) = { q0}
Example 2
-closure (l) = {l, 2, 3, 4, 6}

-closure (2) = {2, 3, 6}
-closure (3) = {3, 6}
-closure (4) = {4}
-closure (5) = {5, 7}
-closure (6) = {6}
-closure (7) = {7}
Sub-set Construction
Given regular expression is converted into NFA.
Then, NFA is converted into DFA.
Steps
l. Convert into NFA using above rules for operators (union, concatenation and
closure) and precedence.
2. Find -closure of all states.
3. Start with epsilon closure of start state of NFA.
4. Apply the input symbols and find its epsilon closure.
Dtran[state, input symbol] = -closure(move(state, input symbol))
where Dtran transition function of DFA
5. Analyze the output state to find whether it is a new state.
6. If new state is found, repeat step 4 and step 5 until no more new states are
found.
7. Construct the transition table for Dtran function.
8. Draw the transition diagram with start state as the -closure (start state of
NFA) and final state is the state that contains final state of NFA drawn.
Direct Method
Direct method is used to convert given regular expression directly into DFA.
Uses augmented regular expression r#.
Important states of NFA correspond to positions in regular expression that
hold symbols of the alphabet.
Regular expression is represented as syntax tree where interior nodes
correspond to operators representing union, concatenation and closure
operations.
Leaf nodes corresponds to the input symbols
Construct DFA directly from a regular expression by computing the
functions nullable(n), firstpos(n), lastpos(n) andfollowpos(i) from the syntax
tree.
o nullable (n): Is true for * node and node labeled with . For other nodes it is
false.
o firstpos (n): Set of positions at node ti that corresponds to the first symbol of
the sub-expression rooted at n.
o lastpos (n): Set of positions at node ti that corresponds to the last symbol of
the sub-expression rooted at n.
o followpos (i): Set of positions that follows given position by matching the first
or last symbol of a string generated by sub-expression of the given regular
expression.
Rules for computing nullable, firstpos and lastpos
Node n nullable (n) firstpos (n) lastpos (n)
A leaf labeled True
A leaf with position False {i} {i}

i
An or node n = c1| Nullable (c1 ) or firstpos (c1) U Iastpos (c1) U

c2
Nullable (c2 ) firstpos (c2) Iastpos (c2)
A cat node n = c1c2 Nullable (c1 ) and If (Nullable (c1 )) If (Nullable (c2 ))
Nullable (c2 ) firstpos (c1) U lastpos (c1) U
firstpos (c2) Iastpos (c2)
else else
firstpos (c1) lastpos (c1)
A star node n = c1* True firstpos (c1) lastpos (c1)
Computation of followpos
The position of regular expression can follow another in the following ways:
If n is a cat node with left child c1 and right child c2, then for every position i
in lastpos(c1), all positions in firstpos(c2) are in followpos(i).
o For cat node, for each position i in lastpos of its left child, the firstpos of its
right child will be in followpos(i).
If n is a star node and i is a position in lastpos(n), then all positions
in firstpos(n) are in followpos(i).
o For star node, the firstpos of that node is in f ollowpos of all positions
in lastpos of that node.
Context Free Grammar
Grammars are used to describe the syntax of a programming language. It specifies the
structure of expression and statements.
stmt -> if (expr) then stmt

where stmt denotes statements,
expr denotes expressions.
Types of grammar
Type 0 grammar
Type 1 grammar
Type 2 grammar
Type 3 grammar
Context Free Grammar
Context free grammar is also called as Type 2 grammar.
Definition
A context free grammar G is defined by four tuples as,

G=(V,T,P,S)
where,
G - Grammar
V - Set of variables
T - Set of Terminals
P - Set of productions
S - Start symbol
It produces Context Free Language (CFL) which is defined as,
where,
L-Language
G- Grammar
w - Input string
S - Start symbol
T - Terminal
Hence, CFL is a collection of input strings which are terminals, derived from the start
symbol of grammar on multiple steps.
Conventions
Terminals are symbols from which strings are formed.

Lowercase letters i.e., a, b, c.
Operators i.e.,+,-,*
Punctuation symbols i.e., comma, parenthesis.
Digits i.e. 0, 1, 2, ,9.
Boldface letters i.e., id, if.
Non-terminals are syntactic variables that denote a set of strings.

Uppercase letters i.e., A, B, C.
Lowercase italic names i.e., expr , stmt.
Start symbol is the head of the production stated first in the grammar.
Production is of the form LHS ->RHS (or) head -> body, where head contains only one
non-terminal and body contains a collection of terminals and non-terminals.
(eg.) Let G be,
Context Free Grammars vs Regular Expressions
Grammars are more powerful than regular expressions.
Every construct that can be described by a regular expression can be described by a
grammar but not vice-versa.
Every regular language is a context free language but reverse does not hold.
(eg.)
RE= (a I b)*abb (set of strings ending with abb).
Grammar
Rules
For each state i of the NFA, create a non-terminal Ai.

If state i has a transition to state j on input a, add the production Ai -> aAj.
If state i goes to state j on input e, add the production Ai -> Aj.
If i is an accepting state, add Ai -> .
If i is a start state, make Ai be the start symbol of the grammar.

CD - Unit I

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

CD - Unit I

Încărcat de

Drepturi de autor:

Formate disponibile

UNIT I

Overview of Compilation: Phases of Compilation Lexical Analysis, Regular

OVERVIEW OF LANGUAGE PROCESSING SYSTEM

Loader and link-editor

Differences between compiler and interpreter

SI. No Compiler Interpreter

1 Performs the translation of a Performs statement by

2 Execution is faster. Execution is slower.

3 Requires more memory as Memory usage is efficient as

4 Debugging is hard as the It stops translation when the

5 Programming languages like Programming languages like

The phases of a compiler can be grouped as:

Lexemes and tokens

Hence, <id, 1><=>< id, 2>< +><id, 3 >< * >< 5>

It should be easy to produce.

Symbol Table Management

int a, b; float c; char z;

Symbol name Type Address

extern double test (double x);

Symbol name Type Scope

test function, double extern

x double function parameter

sample function, double global

count int function parameter

sum double block local

i int for-loop statement

Lexical analyzer performs the following tasks:

They are used to represent the language for lexical analyzer.

Empty string is identity under concatenation.

L3- set of strings each with length 3.

Languages for regular expressions

S.No. Regular expression Language

3 r|s L(r) | L(s)

Regular set Language defined by regular expression.

Algebraic laws of regular expressions

r(s|t) = rs | rt; (s|t)r = sr | tr Concatenation is distributive

r = r = r is identity for concatenation

definitions are of the following form

Convert Regular Expression to DFA

Regular expression is used to represent the language (lexeme) of finite

Rules for Conversion of Regular Expression to NFA

-closure(q0) = { q0, q1, q2}

-closure (l) = {l, 2, 3, 4, 6}

Rules for computing nullable, firstpos and lastpos

Node n nullable (n) firstpos (n) lastpos (n)

A leaf labeled True

A leaf with position False {i} {i}

An or node n = c1| Nullable (c1 ) or firstpos (c1) U Iastpos (c1) U

A star node n = c1* True firstpos (c1) lastpos (c1)

stmt -> if (expr) then stmt

Context Free Grammar

Context free grammar is also called as Type 2 grammar.

A context free grammar G is defined by four tuples as,

Terminals are symbols from which strings are formed.

Non-terminals are syntactic variables that denote a set of strings.

For each state i of the NFA, create a non-terminal Ai.

S-ar putea să vă placă și