Documente Academic
Documente Profesional
Documente Cultură
Lexical Analyzer
Outline
Informal sketch of lexical analysis
Identifies tokens in input string
Specifying lexemes
Regular expressions
Examples of regular expressions
Lexical Analyzer
Functions
Grouping input characters into tokens
Stripping out comments and white
spaces
Correlating error messages with the
source program
Issues (why separating lexical analysis
from parsing)
Simpler design
Compiler efficiency
Compiler portability (e.g. Linux to Win)
Scanning
Perspective
Purpose
Transform a stream of symbols
Into a stream of tokens
Lexical Analyzer
Responsibilities
Lexical analyzer
[Scanner]
Scan input
Remove white
spaces
Remove comments
Manufacture tokens
Generate lexical
errors
Pass token to
parser
read char
Source
program
Lexical
analyzer
put back
char
Read entire
program into
memory
id
Symbol Table
Compiler Construction
Parser
get next
Lexical Analysis
What do we want to do? Example:
if (i == j)
Z = 0;
else
Z = 1;
The input is just a string of characters:
\t if (i == j) \n \t \t z = 0;\n \t else \n \t \t z
= 1;
Goal: Partition input string into substrings
Where the substrings are tokens
Compiler Construction
Whats a Token?
A syntactic category
In English:
Compiler Construction
Tokens
Tokens correspond to sets of strings.
Identifier: strings of letters or digits,
starting with a letter
Integer: a non-empty string of digits
Keyword: else or if or begin or
Whitespace: a non-empty sequence of
blanks, newlines, and tabs
Compiler Construction
Typical Tokens in a PL
Terminology
Token
A classification for a common set of strings
Examples: Identifier, Integer, Float, Assign, LeftParen,
RightParen,....
Pattern
The rules that characterize the set of strings for a token
A pattern is a description of the form that the lexemes of
token may take.
Examples: [0-9]+
Lexeme
Actual sequence of characters that matches a pattern and
has a given Token class.
A lexeme is a sequence of characters in the source
program that matches the pattern for a token and is
identified by the lexical analyzer as an instance of that
token.
Examples:
Identifier: Name,Data,x
Integer: 345,2,0,629,....
13
Examples
14
Lexical Errors
Error Handling is very localized, w.r.t. Input Source
Example:
fi(a==f(x))
generates no lexical error in C
Do a case analysis
Based on lookahead char
Based on current lexeme
Outcome
If char can extend lexeme, all is well, go on.
If char cannot extend lexeme:
Figure out what the complete lexeme is and
return its token
Put the lookahead back into the symbol stream
16
Token Attribute
E = C1 ** 10
Token
Attribute
ID
=
ID
**
NUM
10
Compiler Construction
Lexical Errors
It is hard for a lexical analyzer to tell,
without the aid of other components,
that there is a source-code error.
E.g., fi ( a == f(x)) ...
Keyword if
18
an identifier
19
Error detection
Error reporting
Error recovery
Delete the current character and
restart scanning at the next character
Delete the first character read by the
scanner and resume scanning at the
character following it.
How about runaway strings and
comments?
Compiler Construction
Specification of Tokens
Regular expressions are an important notation
for specifying lexeme patterns. While they
cannot express all possible patterns, they are
very effective in specifying those types of
patterns that we actually need for tokens.
Compiler Construction
24
Input Buffering
LA reads the source program character by character from secondary storage. Reading the input character by character by is
costly.
Therefore a block of data is first read into buffer and then scanned by the LA.
In order to efficiently move back and forth in the input stream input buffering is used.
There are various methods by which buffering may be done
One Buffer Scheme: If a lexeme crosses over the boundary of the buffer the buffer has to be refilled in order to scan the
rest of the lexeme and the first part of the lexeme will be overwritten in the process.
25
Buffer Pairs
Two buffers of the same size, say 4096, are alternately
reloaded.
Two pointers to the input are maintained:
Pointer lexeme_Begin marks the beginning of the
current lexeme.
Pointer forward pointer scans ahead until a pattern
match is found.
The string of characters between the two pointers is the
current lexeme. Initially both pointers point to the
beginning of the lexeme.
The fp then starts scanning forward until a match for a
pattern is found. When the end of the lexeme is
identified, the fp is set to the character at its right end,
the token and the attribute corresponding to this
lexeme is returned.
After that both the pointers lb and fp are then set to the
beginning of the next token, which is the character
immediately past the lexeme.
26
Buffering
Input Buffering:
Some efficiency issues concerned with the buffering of input.
A two-buffer input scheme that is useful when lookahead on the input is necessary to
identify tokens.
Techniques for speeding up the lexical analyser, such as the use of sentinels to mark the
buffer end.
There are three general approaches to the implementation of a lexical analyser:
Use a lexical-analyser generator, such as Lex compiler to produce the lexical analyser from a
regular expression based specification. In this, the generator provides routines for reading and
buffering the input.
2. Write the lexical analyser in a conventional systems-programming language, using I/O
facilities of that language to read the input.
3. Write the lexical analyser in assembly language and explicitly manage the reading of input.
Buffer pairs:
Because of a large amount of time can be consumed moving characters, specialized
buffering techniques have been developed to reduce the amount of overhead required to
process an input character.
The scheme to be discussed:
Consists a buffer divided into two N-character halves.
Once the next lexeme is determined, the forward pointer is set to the character at its
right end.
If the forward pointer is about to move past the halfway mark, the right half is
filled with N new input characters.
If the forward pointer is about to move past the right end of the buffer, the left
half is filled with N new characters and the forward pointer wraps around to the
beginning of the buffer.
Disadvantage of this scheme:
This scheme works well most of the time, but the amount of lookahead is
limited.
30
Sentinels
We can reduce these three tests to one by
introducing a special character called
sentinel at the end. Let us choose EOF as
the sentinel at the end of each buffer.
E
M * eof C * * 2 eof
eof
Sentinels
In the previous scheme, must check each time the move forward pointer that
have not moved off one half of the buffer. If it is done, then must reload the other
half.
Therefore the ends of the buffer halves require two tests for each advance of the
forward pointer.
This can reduce the two tests to one if it is extend each buffer half to hold a
sentinel character at the end.
The sentinel is a special character that cannot be part of the source program. (eof
character is used as sentinel).
In this, most of the time it performs only one test to see whether forward points
to an eof.
Only when it reach the end of the buffer half or eof, it performs more tests.
Since N input characters are encountered between eofs, the average number of
tests per input character is very close to 1.
forward:=forward+1;
If (forward==eof)
{
If forward at end of first half then begin
reload second half;
forward:=forward + 1;
End
Else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
End
Else terminate lexical analysis;
}
34
36
Id letter(letter|digit)
letter A|B||Z|a|b||z
digit 0|1|2||9
Identifier can also be expressed by following
regular expression
(A|B||Z|a|b||z)(A|B||Z|a|b||z| 0|1|2||
9)*
37
38
language
Regular Expression
A notation that allows us to define a pattern in
a high level language.
Regular language
Each regular expression r denotes a language
L(r) (the set of sentences relating to the regular
expression r)
39
40
is regular expression
{} is the related regular language
41
a is regular expression
{a} is the related regular language
42
8) If L(),then
= |
= *
= |
= *
44
5 Notational Short-hands
a)One or more instances
( r )+ digit+
b)Zero or one instance
r? is a shorthand for r|
digits)?
c)Character classes
[a-z] denotes a|b|c||z
[A-Za-z] [A-Za-z0-9]
46
(E(+|-)?
47
48
Regular
expression
if
id
Token
<
relop
if
id
Attributevalue
Pointer to
table entry
LT
49
3 Transition Diagram(Stylized
flowchart)
50
start
0
Start
state
>
Accepting
state
6
return(relop,GE)
other
8
return(relop,GT)
51
52
while (1) {
switch(state) {
case 0: c=nextchar();
if (c==blank || c==tab ||
c==newline){
state=0;lexeme_beginning+
+}
else if (c== <) state=1;
else if (c===) state=5;
else if(c==>) state=6 else
state=fail();
break
case 9: c=nextchar();
if (isletter( c)) state=10;
else state=fail(); break
}}}
53
54
Deterministic or non-deterministic FA
d 2 =
Lexeme_beginning
FA simulator
55
letter
letter
2
digit
56