Sunteți pe pagina 1din 49

Lexical Analysis

Shashank Gupta
BITS Pilani Assistant Professor
Pilani Campus
Department of Computer Science and Information Systems
What is Lexical Analysis (LA)?

I/P is any high-level language and O/P is


LA sequence of tokens.

It generally cleans the code: strips off


blanks, tabs, newlines and comments.

Keeps track of the line numbers for


associated error messages and performs
some pre-processor functions.
2
CS F363 Compiler Construction BITS Pilani, Pilani Campus
Separation of Lexical Analysis from
Syntax Analysis

Lexical Analysis simplifies the design.

I/O issues are limited to only lexical analysis.

More compact and faster parser.

3
CS F363 Compiler Construction BITS Pilani, Pilani Campus
Tokens, Patterns and Lexemes

• A string of characters which logically belong together.


Token

• The set of strings for which the same token is produced.


Pattern

• The sequence of characters matched by a pattern to form


the corresponding token.
Lexeme

CS F363 Compiler Construction 4


BITS Pilani, Pilani Campus
Tokens

Tokens Keywords, operators, identifiers,


constants, literal strings and punctuation
symbols.
A unique integer representing the token
is passed by LA to the parser.

Attributes can also be passed for the


token including an integer.

5
CS F363 Compiler Construction BITS Pilani, Pilani Campus
Problems in Lexical Analysis

Certain languages do not have


any reserved words.

Blanks are not much important in


some languages, but not so in ‘C’.

LA cannot catch any significant


errors except for simple errors.

6
CS F363 Compiler Construction BITS Pilani, Pilani Campus
Specification and Recognition of
Tokens

Regular expressions are very popular for


specifications of tokens.

Transition diagrams are used to implement


regular definitions and to recognize tokens.

7
CS F363 Compiler Construction BITS Pilani, Pilani Campus
Interface to other Phases

Push back operation must be implemented through a buffer.

CS F363 Compiler Construction 8


BITS Pilani, Pilani Campus
Construct a Lexical Analyzer

Allow white spaces, numbers and arithmetic operators in an


expression.

Return token and attributes to the syntax analyzer.

A global variable tokenval is set to the value of the number.

Design requires that a finite set of tokens need to be defined


and also describe strings belonging to each token.

CS F363 Compiler Construction 9


BITS Pilani, Pilani Campus
C Program

#include <stdio.h>
#include <ctype.h>
//int lineno = 1;
int tokenval = NONE;
int lex() {
int t;
while (1) {
t = getchar ();
if (t = = ' ' || t = = '\t');
else if (t = = '\n')lineno = lineno + 1;
else if (isdigit (t) ) {
tokenval = t - '0' ;
t = getchar ();
while (isdigit(t)) {
tokenval = tokenval * 10 + t - '0' ;
t = getchar();
}
ungetc(t,stdin);
return num;
}
else { tokenval = NONE; return t; }
}
} 10
CS F363 Compiler Construction BITS Pilani, Pilani Campus
Problems

Scans text character by character.

Look ahead character determines what kind of


token to read and when the current token ends.

First character cannot determine what kind of token


we are going to read.
CS F363 Compiler Construction 11
BITS Pilani, Pilani Campus
Symbol Table

Stores information for subsequent phases.

Interface to the Symbol Table.


• Insert(s, t): save lexeme s and token t and return
pointer.
• Lookup(s): return index of entry for lexeme s or 0 if
s is not found.

12
CS F363 Compiler Construction BITS Pilani, Pilani Campus
Implementation of Symbol Table

Fixed amount of space to store lexemes. Not advisable as


it waste space.

Store lexemes in a separate array. Each lexeme is


separated by eos. Symbol table has pointers to lexemes.

CS F363 Compiler Construction 13


BITS Pilani, Pilani Campus
Implementation of Symbol Table
(Continued…..)

14
CS F363 Compiler Construction BITS Pilani, Pilani Campus
Handling of Keywords

Consider token DIV and MOD with lexemes div


and mod.

Initialize symbol table with insert( "div" , DIV )


and insert( "mod" , MOD).

Any subsequent lookup returns a nonzero value,


therefore, cannot be used as an identifier.
CS F363 Compiler Construction 15
BITS Pilani, Pilani Campus
Difficulties in Design of Lexical
Analyzer

Lexemes in a fixed position.


• Fix format vs Free format Languages.

Handling of Blanks
• In Pascal, blanks separate identifiers.
• In Fortran, blanks are important in some
situations.
CS F363 Compiler Construction 16
BITS Pilani, Pilani Campus
PL/1 Problems

Keywords are not reserved in PL/1


• if then then then = else else else = then
• if if then then = then + 1
PL/1 Declarations
• Declare (arg 1, arg 2, arg 3, ...., arg n )
• Cannot say whether declare is a keyword or
anything else untill ‘)’.

CS F363 Compiler Construction 17


BITS Pilani, Pilani Campus
Specification of Tokens

• How to describe tokens


4.e1 40.e-1 4.000

• How to break text into token


if (y==0) a = y << 1;
iff (y==0) a = y < 1;

• How to break input into token efficiently


- Tokens may have similar prefixes.

CS F363 Compiler Construction 18


BITS Pilani, Pilani Campus
Recognition of Tokens

• Programming language tokens can be described


by Regular languages
• Regular languages
- Are easy to understand
- There is a well understood and useful theory
- They have efficient implementation
• Regular languages have been discussed in great
detail in the "Theory of Computation" course
CS F363 Compiler Construction 19
BITS Pilani, Pilani Campus
Notations in Regular Expressions

• If r and s are regular expressions denoting the


languages L(r) and L(s) then
- (r)|(s) is a regular expression denoting
L(r) U L(s)
- (r)(s) is a regular expression denoting
L(r)L(s)
- (r)* is a regular expression denoting (L(r))*
- (r) is a regular expression denoting L(r )
CS F363 Compiler Construction 20
BITS Pilani, Pilani Campus
How to Specify Tokens?

CS F363 Compiler Construction 21


BITS Pilani, Pilani Campus
Examples

CS F363 Compiler Construction 22


BITS Pilani, Pilani Campus
Regular Definition for Unsigned
Numbers

Digit  0 |1 | 2 |     | 9
Digits Digit 
Fraction  '.' Digits|
Exponent ( E ( |  |) Digits) |
Number  Digits Fraction Exponent

CS F363 Compiler Construction 23


BITS Pilani, Pilani Campus
Regular Expressions in
Specifications

• Regular expressions describe many useful


languages
• Regular expressions are only specifications;
implementation is still required.
• Given a string s and a regular expression R, does s
belong to L(R) ?
• Solution to this problem is on the basis of the
lexical analyzers.
• Goal: Partition the input into tokens

CS F363 Compiler Construction 24


BITS Pilani, Pilani Campus
Regular Expressions in
Specifications

CS F363 Compiler Construction 25


BITS Pilani, Pilani Campus
Regular Expressions in
Specifications

• The algorithm gives priority to tokens listed earlier


- Treats "if" as keyword and not identifier
• How much input is used? What if
- x1 .xi? L(R)
- x1.xj ? L(R)
- Pick up the longest possible string in L(R)
- The principle of "maximal munch"
• Regular expressions provide a concise and useful notation for
string patterns

CS F363 Compiler Construction 26


BITS Pilani, Pilani Campus
How to break up text?

CS F363 Compiler Construction 27


BITS Pilani, Pilani Campus
Transition Diagrams

• Regular expression are declarative


specifications.
• Finite automata is an implementation
• A transition diagram consist of an input
alphabet, set of states, set of transitions, set of
final states and a start state.

CS F363 Compiler Construction 28


BITS Pilani, Pilani Campus
Pictorial Notations

CS F363 Compiler Construction 29


BITS Pilani, Pilani Campus
Transition Diagrams

Transitions may be labelled with a symbol, group of symbols or


regular definitions.

Few states may be treated as Retracting States that indicates that


the lexeme does not include the symbol that brought us to the
accepting state.

All states has an action attached to it, which is executed when the
state is reached. Usually, such actions returns a token and its
attribute value.
CS F363 Compiler Construction 30
BITS Pilani, Pilani Campus
Transition Diagram for Identifier
and Keywords

* Indicates Retraction State

CS F363 Compiler Construction 31


BITS Pilani, Pilani Campus
Transition Diagram for Relational
Operators

CS F363 Compiler Construction 32


BITS Pilani, Pilani Campus
Transition Diagram for Relational
Operators

33
CS F363 Compiler Construction BITS Pilani, Pilani Campus
Transition Diagram for Unsigned
Numbers in C

CS F363 Compiler Construction 34


BITS Pilani, Pilani Campus
Example

Design a regular definition and transition


diagram notation for hex and octal constants.
Consider hex notation for your compiler must
initiate with 0x whereas octal notation should
initiate with 0. In addition, both the notations
may include the Qualifier (unsigned or long or
null) as a suffix (at the end of their respective
notations).

CS F363 Compiler Construction 35


BITS Pilani, Pilani Campus
Transition Diagram for Hex and Oct
Constants

36
CS F363 Compiler Construction BITS Pilani, Pilani Campus
Example

Construct a generalized DFA to capture the


patterns of logical operators, integers and real
numbers. Regular expressions [.][a][n][d][.], [.]
[o][r][.], [.][n][o][t][.], [0-9][0-9]* and [0-9][0-
9]*[.][0-9][0-9] respectively define patterns for
logical and, or, not and integer and real numbers.
How will you tokenize the following i/p?
46.an89.or.and.45.54
Explain the step by step procedure precisely.
CS F363 Compiler Construction 37
BITS Pilani, Pilani Campus
Transition Diagram

38
CS F363 Compiler Construction BITS Pilani, Pilani Campus
Transition Diagram for Identifiers
and Keywords

* Indicates Retraction State

CS F363 Compiler Construction 39


BITS Pilani, Pilani Campus
Lexical Analyzer Implementation
from Transition Diagrams

CS F363 Compiler Construction 40


BITS Pilani, Pilani Campus
Transition Diagram for Hex and Oct
Constants

41
CS F363 Compiler Construction BITS Pilani, Pilani Campus
Lexical Analyzer Implementation
from Transition Diagrams

CS F363 Compiler Construction 42


BITS Pilani, Pilani Campus
Lexical Analyzer Implementation
from Transition Diagrams

CS F363 Compiler Construction 43


BITS Pilani, Pilani Campus
Lexical Analyzer Implementation
from Transition Diagrams

CS F363 Compiler Construction 44


BITS Pilani, Pilani Campus
Transition Diagram for Integer
Constant

CS F363 Compiler Construction 45


BITS Pilani, Pilani Campus
Lexical Analyzer Implementation
from Transition Diagram

46
CS F363 Compiler Construction BITS Pilani, Pilani Campus
Generation of Lexical Analyzer
from Transition Diagram

Different transition diagrams must be combined


appropriately to generate a lexical analyzer.
• Merging different transition diagrams is not so easy.

Trace different transition diagrams one after


another.

To find the longest match, all transition diagrams


must be tried and the longest match must be used.
CS F363 Compiler Construction 47
BITS Pilani, Pilani Campus
Lexical Analyzer Generator

I/P to the Generator

• List of regular expressions in priority order


• Associated actions for each of regular expression (generates
kind of token and other book keeping information)

O/P of the Generator

• Program that reads input character stream and breaks that into
tokens
• Reports lexical errors (unexpected characters), if any
CS F363 Compiler Construction 48
BITS Pilani, Pilani Campus
Thank You

13
CS F363 Compiler Construction BITS Pilani, Pilani Campus

S-ar putea să vă placă și