Documente Academic
Documente Profesional
Documente Cultură
Lexical Analyzer
Lexical Analyzer
Lexical Analyzer
reads the source program character by character to produce tokens
doesn
doesn’tt return a list of tokens at one shot
returns a token when the parser asks a token from it
For simplicity, a token may have a single attribute which holds the required information for that
token
k
For identifiers, this attribute is a pointer to the symbol table and the symbol table holds the
actual attributes for that token
Some attributes:
<id,attr> where attr is pointer to the symbol table
<assgop,
g p,_> no attribute is needed ((if there is onlyy one assignment
g operator)
p )
<num,val> where val is the actual value of the number.
Token
k type andd its attribute
b uniquely
l identifies
d f a lexeme
l
Regular expressions: widely used technique to specify patterns
Terminology of Languages
Alphabet : a finite set of symbols (a, b, X etc.)
String :
Finite sequence of symbols on an alphabet
Sentences and words are also represented in terms of string
: empty string
|s| : length of string s
Language:
g g sets of strings
g over some fixed alphabet
p
: the empty set is a language
{}: the set containing empty string is a language
The
h set off well-formed
ll f d C programs is a language
l
The set of all possible identifiers is a language
Operators on Strings:
Concatenation: xy represents the concatenation of strings x and y
s =s s=s
sn = s s s .. s ( n times) s0 =
Operations on Languages
Concatenation:
L1L2 = { s1s2 | s1 L1 and s2 L2 }
Union
L1 L2 = { s | s L1 or s L2 }
Exponentiation:
L0 = {
{}} L1 = L L2 = LL
Kleene Closure
L* = L
i0
i
Positive Closure
L+ =
L i 1
i
Example
L1 = {a,b,c,d} L2 = {1,2}
L1L2 = {
{a1,a2,b1,b2,c1,c2,d1,d2}
, , , , , , , }
L1 L2 = {
{a,b,c,d,1,2}}
Each
E h regular
l expression
i ddenotes a llanguage
(r)+ = (r)(r)*
(r)? = (r) |
Regular Expressions (cont.)
Ex:
= {0,1}
0|1 => {0,1}
(0|1)(0|1) =>
> {00
{00,01,10,11}
01 10 11}
0* => { ,0,00,000,0000,....}
(0|1)* => all strings with 0 and 1, including the empty string
Regular Definitions
Writing regular expression for some languages may be difficult
Alternative: regular definitions
Assign names to regular expressions and we can use these names as symbols
to define other regular expressions
A regular
g definition
f is a sequence
q of the definitions of the form:
d1 r1 where di is a distinct name and
d2 r2 ri is a regular expression over symbols in
. {d1,d2,...,di-1}
dn r n
basic symbols previously defined names
Regular Definitions (cont.)
(cont )
Ex: Identifiers in Pascal
letter A | B | ... | Z | a | b | ... | z
digit 0 | 1 | ... | 9
id letter (letter | digit
g )*
If we try to write the regular expression representing identifiers without using regular definitions, that
regular expression will be complex.
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *
A non
non-deterministic
deterministic finite automaton (NFA) is a mathematical model that
consists of:
S - a set of states
- a set of input symbols (alphabet)
move – a transition function move to map state-symbol pairs to sets of
states.
s0 - a start (initial) state
F – a set of accepting states (final states)
- transitions are allowed in NFAs
we can move from one state to another one without consumingg anyy symbol
y
A NFA accepts a string x, if and only if there is a path from the starting state to
one of accepting states such that edge labels along this path spell out x
NFA (Example)
a 0 is the start state s0
{2} is the set of final states F
a b = {a,b}
0 1 2
start S = {0,1,2}
b Transition Function: a b
0 {0,1} {0}
Transition graph of the NFA 1 _ {2}
2 _ _
Th language
The l recognized
i d by
b this
thi NFA is ( |b) * a b
i (a|b)
Deterministic Finite Automaton (DFA)
Thomson
Thomson’ss Construction is simple and systematic
It guarantees that the resulting NFA will have exactly one final state, and one start state
a
• Recognize
g a symbol
y a in the alphabet
p i f
N(r1)
i f NFA for r1 | r2
N(r2)
Thomson’s Construction (cont.)
( )
• For regular expression r1 r2
NFA for r1 r2
• For regular
g expression
p r*
i N(r) f
NFA for r*
Thomson’s Construction (Example - (a|b) * a )
a a
a:
(a | b)
b b
b:
a
( |b)
(a|b) *
b
a
(a|b) * a
a
b
Convertingg a NFA into a DFA ((subset construction))
put -closure({s0}) as an unmarked state into the set of DFA (DS)
while ((there is one unmarked S1 in DS)) do
begin -closure({s0}) is the set of all states, accessible
mark S1 from s0 by -transition.
for each input symbol a do
sett off states
t t tot which
hi h there
th is i a transition
t iti on
begin a from a state s in S1
S2 -closure(move(S1,a))
if (S2 is not in DS) then
add S2 into DS as an unmarked state
transfunc[S1,a] S2
end
endd
S1
S0 b a
S2
b
Converting Regular Expressions Directly to DFAs
Regular expression can be directly converted into a DFA (without creating a
NFA first)
Augment the given regular expression by concatenating it with a special
symbol #
r (r)# augmented regular expression
Create a syntax tree for this augmented regular expression
Syntax tree
Leaves:
L alphabet
l h b t symbols
b l (including
(i l di # andd theth empty t string)
t i ) in
i the
th
augmented regular expression
Intermediate nodes: operators
Number each alphabet symbol (including #) depending upon the positions
Regular Expression DFA (cont.)
(cont )
| ) * a ((a|b)
((a|b) | )* a # augmented
g regular
g expression
p
S t tree
Syntax ( |b) * a #
t off (a|b)
#
4
* a
3 • each symbol is numbered (positions)
| • each symbol is at a leave
a b • inner nodes are operators
1 2
followpos
Define the function followpos for the positions (positions
assigned to leaves)
followpos(i)
p -- set of ppositions which can follow
the position i in the strings generated by
the augmented regular expression
For example, ( a | b) * a #
1 2 3 4
followpos(1) = {1,2,3}
followpos is just defined for leaves,
f ll
followpos(2)
(2) = {1,2,3}
{1 2 3} it is not defined for inner nodes.
followpos(3) = {4}
followpos(4) = {}
firstpos, lastpos, nullable
1 If n is concatenation
1. concatenation-node
node with left child c1 and right child c2,
and i is a position in lastpos(c1), then all positions in firstpos
(c2) are in followpos (i)
2. If n is a star-node, and i is a position in lastpos(n), then all
positions in firstpos(n) are in followpos(i).
If firstpos and lastpos have been computed for each node, followpos
off eachh position can be b computedd by
b making
k one depth-first
d hf
traversal of the syntax tree
Example -- ( a | b) * a #
{1,2,3} {4} pink – firstpos
{1,2,3} {3} {4} # {4} blue – lastpos
4
{1,2} *{1,2} {{3}} a{{3}}
3
{1,2} | {1,2}
Then we can calculate followpos
{1} a {1} {2} b {2}
2 followpos(1) = {1,2,3}
1
followpos(2) = {1,2,3}
f ll
followpos(3)
(3) = {4}
followpos(4) = {}
S1=firstpos(root)={1,2,3}
p ( ) { , , }
mark S1
a: followpos (1) followpos(3)={1,2,3,4}=S2 move(S1,a)=S2
b: followpos (2)={1
(2)={1,2,3}=S
2 3}=S1 move(S1,b)=S
b)=S1
mark S2
a: followpos(1) followpos(3)={1,2,3,4}=S2 move(S2,a)=S2
b followpos(2)={1,2,3}=S
b: f ll (2) {1 2 3} S1 move(S
(S2,b)=S
b) S1
b a
start state: S1 a
accepting states: {S2} S1 S2
b
Example -- ( a | ) b c* #
1 2 3 4
followpos(1)={2} followpos(2)={3,4} followpos(3)={3,4} followpos(4)={}
S1=firstpos(root)={1,2}
mark S1
a: followpos(1)={2}=S2 move(S1,a)=S2
b: followpos(2)={3,4}=S3 move(S1,b)=S3
mark S2
b: followpos(2)={3,4}=S3 move(S2,b)=S3
mark S3
S2
c: followpos(3)={3,4}=S3 move(S3,c)=S3 a
b
S1
b
start state: S1 S3 c
accepting states: {S3}
Minimizingg Number of States of a DFA
partition the set of states into two groups:
G1 : set of accepting states
G2 : set of non-accepting states
Start state: the group containing the start state of the original DFA
Accepting states: the groups containing the accepting states of the original
DFA
Minimizing DFA - Example
a
G1 = {2}
2
a G2 = {1,3}
1 b a
b
3 G2 cannot be partitioned because
move(1,a)=2 move(1,b)=3
b
( )
move(3,a)=2 move(3,b)=3
( )
b a
{1,3} a {2}
b
Minimizing DFA – Another Example
a
2
a a
1 4
Groups: {1,2,3} {4}
b
b a
{1 2}
{1,2} {3} a b
3 b no more partitioning 1->2 1->3
2->2 2->3
b 3->4 3->3
{3}
a b
{1 2}
{1,2} a b
a {4}
An Example
l
Grammar Fragment (Pascal)
if if
then then
else else
relop < | <= | = | <> | > | >=
id letter
l tt ( letter
l tt di it )*
| digit
num digit+ (. digit+ )? (E(+|-)? digit+ )?
delim blank | tab | newline
ws delim+
Tokens and Attributes
Regular Expression Token Attribute Value
ws - -
if if -
then then -
else else -
id id pointer
i t t
to entry
t
num num pointer to entry
< relop LT
<= relop LE
= relop
p EQ
Q
<> relop NE
> relop GT
=> relop GE
Transition Diagram for “relop”
return(relop,LE)
= 2
return(relop,NE)
> 3
other * return(relop,LT)
1 4
<
start = return(relop,EQ)
0 5
>
= return(relop,GE)
6 7
other
* return(relop,GT)
8
Identifiers and Keywords
Share a transition diagram
After reachingg accepting
p g state,, code determines if lexeme is keyword
y or
identifier
Easier than encoding exceptions in diagram
Simple technique is to appropriately initialize symbol table with keywords
letter or digit
digit
+ or - digit other * return(get_token(), install_num())
16 17 18 19
digit
digit digit
digit
Diagrams
g with low numbered start states tried before diagrams
g with
high numbered start states
return start;
}
Finding the Next Token
token nexttoken(void) {
while (1) {
switch (state) {
case 0:
c = nextchar();
if (c == ' ' || c=='\t' || c == '\n') {
state = 0;
l
lexeme_beginning++;
b i i ++
}
else if (c == '<') state = 1;
else if (c == '=') state = 5;
else if (c == '>') state = 6;
else state = next_td();
break;
case 19:
retract();
install_num();
return(NUM);
break;
… /* Final 8 cases */
Some Other Issues in Lexical Analyzer
Lexical analyzer has to recognize the longest possible string
Ex: identifier newval -- n ne new newv newva newval
What is the end of a token? Is there any character which marks the end of a
token?
normally
o a y not
ot defined
e e
Not an issue if the number of characters in a token is fixed
But < < or <> (in Pascal) (not fixed)
E d off an identifier
End id ifi : the
h characters
h cannot bbe iin an id
identifier
ifi that
h can markk the
h endd off token
k
We may need a lookahead
In Prolog:
g p :- X is 1. p :- X is 1.5.
The dot followed by a white space character can mark the end of a number.
B iff that
But h is not the
h case, the
h ddot must bbe treatedd as a part off the
h number.
b
Some Other Issues in Lexical Analyzer (cont.)
Skipping comments
Normally we don’t return a comment as a token
We skip a comment, and return the next token (which is not a comment) to the parser
So, the comments are only processed by the lexical analyzer, and don’t complicate the syntax
of the language