Sunteți pe pagina 1din 56

Unit 2

Lexical Analyzer

Outline
Informal sketch of lexical analysis
Identifies tokens in input string

Issues in lexical analysis


Lookahead
Ambiguities

Specifying lexemes
Regular expressions
Examples of regular expressions

Lexical Analyzer
Functions
Grouping input characters into tokens
Stripping out comments and white
spaces
Correlating error messages with the
source program
Issues (why separating lexical analysis
from parsing)
Simpler design
Compiler efficiency
Compiler portability (e.g. Linux to Win)

The role of the lexical


analyzer
Lexical analyzers are divided two processes:
Scanning
No tokenization of the input
deletion of comments, compaction of white space
characters
Lexical analysis
Producing tokens

Scanning
Perspective
Purpose
Transform a stream of symbols
Into a stream of tokens

Lexical Analyzer
Responsibilities
Lexical analyzer

[Scanner]

Scan input
Remove white
spaces
Remove comments
Manufacture tokens
Generate lexical
errors
Pass token to
parser

The Role of a Lexical Analyzer


pass token
and attribute value

read char
Source
program

Lexical
analyzer
put back
char

Read entire
program into
memory

id

Symbol Table

Compiler Construction

Parser
get next

Lexical Analysis
What do we want to do? Example:
if (i == j)
Z = 0;
else
Z = 1;
The input is just a string of characters:
\t if (i == j) \n \t \t z = 0;\n \t else \n \t \t z
= 1;
Goal: Partition input string into substrings
Where the substrings are tokens
Compiler Construction

Whats a Token?
A syntactic category
In English:

noun, verb, adjective,


In a programming language:

Identifier, Integer, Keyword,


Whitespace,

Compiler Construction

What are Tokens For?


Classify program substrings
according to role
Output of lexical analysis is a stream
of tokens . . .which is input to the
parser
Parser relies on token distinctions
An identifier is treated differently than a
keyword
Compiler Construction

Tokens
Tokens correspond to sets of strings.
Identifier: strings of letters or digits,
starting with a letter
Integer: a non-empty string of digits
Keyword: else or if or begin or
Whitespace: a non-empty sequence of
blanks, newlines, and tabs

Compiler Construction

Typical Tokens in a PL

Symbols: +, -, *, /, =, <, >, ->,


Keywords: if, while, struct, float, int,

Integer and Real (floating point)


literals
123, 123.45
Char (string) literals
Identifiers
Comments
White space
Compiler Construction

Terminology
Token
A classification for a common set of strings
Examples: Identifier, Integer, Float, Assign, LeftParen,
RightParen,....

Pattern
The rules that characterize the set of strings for a token
A pattern is a description of the form that the lexemes of
token may take.
Examples: [0-9]+

Lexeme
Actual sequence of characters that matches a pattern and
has a given Token class.
A lexeme is a sequence of characters in the source
program that matches the pattern for a token and is
identified by the lexical analyzer as an instance of that
token.
Examples:
Identifier: Name,Data,x
Integer: 345,2,0,629,....
13

Examples

14

Lexical Errors
Error Handling is very localized, w.r.t. Input Source
Example:
fi(a==f(x))
generates no lexical error in C

In what situations do errors occur?


Prefix of remaining input doesnt match any defined
token

Possible error recovery actions:


Deleting or Inserting Input Characters
Replacing or Transposing Characters

Or, skip over to next separator to ignore problem


15

Basic Scanning technique


Use 1 character of look-ahead
Obtain char with getc()

Do a case analysis
Based on lookahead char
Based on current lexeme

Outcome
If char can extend lexeme, all is well, go on.
If char cannot extend lexeme:
Figure out what the complete lexeme is and
return its token
Put the lookahead back into the symbol stream
16

Token Attribute

E = C1 ** 10
Token

Attribute

ID

Index to symbol table entry E

=
ID

Index to symbol table entry C1

**
NUM

10
Compiler Construction

Lexical Errors
It is hard for a lexical analyzer to tell,
without the aid of other components,
that there is a source-code error.
E.g., fi ( a == f(x)) ...

Keyword if

18

an identifier

Suppose a situation in which none of


the patterns for tokens matches a
prefix of the remaining input.

E.g. $%#if a>0 a+=1;

19

The simplest recovery strategy is


panic mode recovery.
Delete successive characters from the
remaining input until the lexical analyzer
can find a well-formed token.

This technique may occasionally


confuse the parser, but in an
interactive computing environment
it may be quit adequate.
20

Other possible errorrecovery actions


Delete one extraneous character
from the remaining input.
Insert a missing character into the
remaining input.
Replace a character by another
character.
Transpose two adjacent characters.
21

Lexical Error and Recovery

Error detection
Error reporting
Error recovery
Delete the current character and
restart scanning at the next character
Delete the first character read by the
scanner and resume scanning at the
character following it.
How about runaway strings and
comments?
Compiler Construction

Specification of Tokens
Regular expressions are an important notation
for specifying lexeme patterns. While they
cannot express all possible patterns, they are
very effective in specifying those types of
patterns that we actually need for tokens.

Compiler Construction

24

Input Buffering
LA reads the source program character by character from secondary storage. Reading the input character by character by is
costly.
Therefore a block of data is first read into buffer and then scanned by the LA.
In order to efficiently move back and forth in the input stream input buffering is used.
There are various methods by which buffering may be done

One Buffer Scheme


Two Buffer Scheme

One Buffer Scheme: If a lexeme crosses over the boundary of the buffer the buffer has to be refilled in order to scan the
rest of the lexeme and the first part of the lexeme will be overwritten in the process.

25

Buffer Pairs
Two buffers of the same size, say 4096, are alternately
reloaded.
Two pointers to the input are maintained:
Pointer lexeme_Begin marks the beginning of the
current lexeme.
Pointer forward pointer scans ahead until a pattern
match is found.
The string of characters between the two pointers is the
current lexeme. Initially both pointers point to the
beginning of the lexeme.
The fp then starts scanning forward until a match for a
pattern is found. When the end of the lexeme is
identified, the fp is set to the character at its right end,
the token and the attribute corresponding to this
lexeme is returned.
After that both the pointers lb and fp are then set to the
beginning of the next token, which is the character
immediately past the lexeme.
26

Buffering
Input Buffering:
Some efficiency issues concerned with the buffering of input.
A two-buffer input scheme that is useful when lookahead on the input is necessary to
identify tokens.
Techniques for speeding up the lexical analyser, such as the use of sentinels to mark the
buffer end.
There are three general approaches to the implementation of a lexical analyser:
Use a lexical-analyser generator, such as Lex compiler to produce the lexical analyser from a
regular expression based specification. In this, the generator provides routines for reading and
buffering the input.
2. Write the lexical analyser in a conventional systems-programming language, using I/O
facilities of that language to read the input.
3. Write the lexical analyser in assembly language and explicitly manage the reading of input.

Buffer pairs:
Because of a large amount of time can be consumed moving characters, specialized
buffering techniques have been developed to reduce the amount of overhead required to
process an input character.
The scheme to be discussed:
Consists a buffer divided into two N-character halves.

N Number of characters on one disk block, e.g., 1024 or 4096.


Read N characters into each half of the buffer with one system read command.
If fewer than N characters remain in the input, then eof is read into the buffer
after the input characters.
Two pointers to the input buffer are maintained.
The string of characters between two pointers is the current lexeme.
Initially both pointers point to the first character of the next lexeme to be
found.
Forward pointer, scans ahead until a match for a pattern is found.

Once the next lexeme is determined, the forward pointer is set to the character at its
right end.
If the forward pointer is about to move past the halfway mark, the right half is
filled with N new input characters.
If the forward pointer is about to move past the right end of the buffer, the left
half is filled with N new characters and the forward pointer wraps around to the
beginning of the buffer.
Disadvantage of this scheme:
This scheme works well most of the time, but the amount of lookahead is
limited.

This limited lookahead may make it impossible to recognize tokens in


situations where the distance that the forward pointer must travel is more than the
length of the buffer.
For example: DECLARE ( ARG1, ARG2, , ARGn ) in PL/1 program;
Cannot determine whether the DECLARE is a keyword or an array name until
the character that follows the right parenthesis.

Whenever we advance fp we must check whether


we have reached the end of one buffer or not as in
that case we have to refill the other buffer before
advancing fp. So we have to make three tests.
1.For checking whether the first buffer is full or not.
2.For checking whether the second buffer is full or
not.
3.To check the end of the input.

30

If forward at end of first half then begin


reload second half;
forward:=forward + 1;
End
Else
if forward at end of second half then begin
reload first half;
move forward to beginning of first half
End
Else if (fp==eof)
{ terminate scanning
}
Else
forward:=forward + 1;
31

Sentinels
We can reduce these three tests to one by
introducing a special character called
sentinel at the end. Let us choose EOF as
the sentinel at the end of each buffer.
E

M * eof C * * 2 eof

eof

Sentinels
In the previous scheme, must check each time the move forward pointer that
have not moved off one half of the buffer. If it is done, then must reload the other
half.
Therefore the ends of the buffer halves require two tests for each advance of the
forward pointer.
This can reduce the two tests to one if it is extend each buffer half to hold a
sentinel character at the end.
The sentinel is a special character that cannot be part of the source program. (eof
character is used as sentinel).
In this, most of the time it performs only one test to see whether forward points
to an eof.
Only when it reach the end of the buffer half or eof, it performs more tests.
Since N input characters are encountered between eofs, the average number of
tests per input character is very close to 1.

forward:=forward+1;
If (forward==eof)
{
If forward at end of first half then begin
reload second half;
forward:=forward + 1;
End
Else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
End
Else terminate lexical analysis;
}
34

How can deal with a long and long


andlong lexeme, this is a
problem in the two buffer scheme.
DECLARE(ARG1, ARG2,,ARGn)
E.g. When a function is rewritten in c+
+, a function name is represent
several function.
35

3.3 Specification of Tokens


Regular expressions are an important notation for
specifying token patterns.
Study formal notations for regular expressions.
these expressions are used in lexical-analyzer generator.
Sec. 3.7 shows how to build the lexical analyzer by
converting regular expressions to automata.

36

1 Regular Definition of Tokens


Defined in regular expression
e.g. identifier can be defined by regular
Grammar

Id letter(letter|digit)
letter A|B||Z|a|b||z
digit 0|1|2||9
Identifier can also be expressed by following
regular expression
(A|B||Z|a|b||z)(A|B||Z|a|b||z| 0|1|2||
9)*
37

Regular expressions are an


important notation for specifying
patterns. Each pattern matches a
set of strings, so regular
expressions will serve as names
for sets of strings.

38

2 Regular Expression & Regular

language
Regular Expression
A notation that allows us to define a pattern in
a high level language.

Regular language
Each regular expression r denotes a language
L(r) (the set of sentences relating to the regular
expression r)

39

Each token in a program can be expressed in a regular expression

40

3 The construct rule of regular

expression over alphabet

is a regular expression that denote


{}

is regular expression
{} is the related regular language

2) If a is a symbol in , then a is a regular


expression that denotes {a}

41

a is regular expression
{a} is the related regular language

3) Suppose and are regular


expressions, then |, , (), * , * is
also a regular expression
L(|)= L()L()
L()= L()L()
L(())= L()
L(*)={}L()L()L()... L()
L()

42

4 Algebraic laws of regular


expressions
1) |= |
2) |(|)=(|)| () =( )
3) (| )= |
(|)= |
4) = =
5)(*)*=*
6) *= |
* = *
7) (|)*= (* | *)*= (* *)*
43

8) If L(),then
= |

= *

= |

= *

Notes: We assume that the precedence


of * is the highest, the precedence of |
is the lowest and they are left
associative

44

Example unsigned numbers such as


5280, 39.37, 6.336E4, 1.894E-4
digit0 | 1 || 9
digits digit digit*
optional_fraction .digits|
optional_exponent (E(+|-| )digits)|
num digits optional_fraction
optional_exponent
45

5 Notational Short-hands
a)One or more instances
( r )+ digit+
b)Zero or one instance
r? is a shorthand for r|
digits)?
c)Character classes
[a-z] denotes a|b|c||z
[A-Za-z] [A-Za-z0-9]
46

(E(+|-)?

3.4 Recognition of Tokens


1 Task of recognition of token in a
lexical analyzer

47

Isolate the lexeme for the next token


in the input buffer

Produce as output a pair consisting of


the appropriate token and attributevalue, such as <id,pointer to table
entry> , using the translation table
given in the Fig in next page

48

Regular
expression
if
id

Token

<

relop

if
id

Attributevalue
Pointer to
table entry
LT

2 Methods to recognition of token

49

Use Transition Diagram

3 Transition Diagram(Stylized
flowchart)

50

Depict the actions that take place


when a lexical analyzer is called by the
parser to get the next token

start

0
Start
state

>

Accepting
state
6

return(relop,GE)

other
8

return(relop,GT)

Notes: Here we use * to indicate states on which input


retraction must take place

51

4 Implementing a Transition Diagram

52

Each state gets a segment of code

If there are edges leaving a state, then


its code reads a character and selects
an edge to follow, if possible

Use nextchar() to read next character


from the input buffer

while (1) {
switch(state) {
case 0: c=nextchar();
if (c==blank || c==tab ||
c==newline){
state=0;lexeme_beginning+
+}
else if (c== <) state=1;
else if (c===) state=5;
else if(c==>) state=6 else
state=fail();
break
case 9: c=nextchar();
if (isletter( c)) state=10;
else state=fail(); break
}}}
53

5 A generalized transition diagram


Finite Automation

54

Deterministic or non-deterministic FA

Non-deterministic means that more


than one transition out of a state may
be possible on the the same input
symbol

6 The model of recognition of


tokens
Input buffer

d 2 =
Lexeme_beginning

FA simulator

55

e.g The FA simulator for Identifiers is:


1

letter

letter
2
digit

Which represent the rule:


identifier=letter(letter|digit)*

56

S-ar putea să vă placă și