Sunteți pe pagina 1din 15

The role of the Lexical

Analyzer
The Role of Lexical Analyzer

Lexical analyzer is the first phase of a compiler.
Its main task is to read input characters and
produce as output a sequence of tokens that parser
uses for syntax analysis.
Lexical Analyzer
Functions
Grouping input characters into tokens
Stripping out comments and white spaces
Correlating error messages with the source
program
Issues in Lexical Analysis
There are several reasons for separating the
analysis phase of compiling into lexical
analysis and parsing:
Simpler design
Compiler efficiency
Compiler portability (Ex: Win to Linux)
Specialized tools have been designed to help
automate the construction of lexical analyzer
and parser when they are separated.
Tokens, Patterns, Lexemes
A lexeme is a sequence of characters in the source
program that is matched by the pattern for a token.
A lexeme is a basic lexical unit of a language comprising
one or several words, the elements of which do not
separately convey the meaning of the whole.
The lexemes of a programming language include its
identifier, literals, operators, and special words.
A token of a language is a category of its lexemes.
A pattern is a rule describing the set of lexemes that can
represent as particular token in source program.
Whats a Token?
A syntactic category
In English: noun, verb, adjective,
In a programming language: Identifier, Integer,
Keyword, Whitespace,

Examples of Tokens
const pi = 3.1416;
The substring pi is a lexeme for the token identifier.
Lexeme and Token
semicolon ;
int_literal 17
plus_op +
identifier Count
multi_op *
int_literal 2
equal_sign =
Identifier Index
Tokens Lexemes
Index = 2 * count +17;
Tokens, Patterns and Lexemes
Pattern: A rule that describes a set of strings
Token: A set of strings in the same pattern
Lexeme: The sequence of characters of a token


Token Sample Lexemes Pattern
if if if
id abc, n, count, letters+digit
NUMBER 3.14, 1000 numerical
constant
; ; ;
Lexical Analysis
What do we want to do? Example:
if (i == j)
Z = 0;
else
Z = 1;
The input is just a string of characters:
\t if (i == j) \n \t \t z = 0;\n \t else \n \t \t z = 1;
Goal: Partition input string into substrings
Where the substrings are tokens

Example
Recall:
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

Token-lexeme pairs returned by the lexer:
(Whitespace, \t)
(Keyword, if)
(OpenPar, ()
(Identifier, i)
(Relation, ==)
(Identifier, j)

What are Tokens For?
Classify program substrings according to
role
Output of lexical analysis is a stream of
tokens . . .which is input to the parser
Parser relies on token distinctions
An identifier is treated differently than a
keyword
Tokens
Tokens correspond to sets of strings.
Identifier: strings of letters or digits, starting
with a letter
Integer: a non-empty string of digits
Keyword: else or if or begin or
Whitespace: a non-empty sequence of blanks,
newlines, and tabs

Token Attribute
E = C1 ** 10


Token Attribute
ID Index to symbol table entry E
=
ID Index to symbol table entry C1
**
NUM 10
Lexical Error and Recovery
Error detection
Error reporting
Error recovery
Delete the current character and restart
scanning at the next character
Delete the first character read by the scanner
and resume scanning at the character
following it.

S-ar putea să vă placă și