Sunteți pe pagina 1din 6

Lexical Analyzer

Introduction to Lexical Analyzer


As the first phase of a compiler, the main task of the lexical analyzer is to read the input characters
of the source program, group them into lexemes, and produce as output a sequence of tokens for
each lexeme in the source program.
The stream of tokens is sent to the parser for syntax analysis.
It is common for the lexical analyzer to interact with the symbol table as well.
When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme
into the symbol table.
The interaction is implemented by having the parser call the lexical analyzer by the getNextToken
command.
That causes the lexical analyzer to read characters from its input until it can identify the next lexeme
and produce for it the next token, which it returns to the parser.

Interactions between the lexical analyzer and the parser
Lexical analyzer may perform certain other tasks besides identification of lexemes like,
Stripping out comments and whitespace
Generate error message with line number
Sometimes, lexical analyzers are divided into a cascade of two processes:
Scanning consists of the simple processes that do not require tokenization of the input,
such as deletion of comments and compaction of consecutive whitespace characters
into one.
Lexical analysis proper is the more complex portion, where the scanner produces the
sequence of tokens as output.
Some lexical analysis related terms:
A token is a pair consisting of a token name and an optional attribute value. The token
name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword,
or a sequence of input characters denoting an identifier. The token names are the input
symbols that the parser processes. In what follows, we shall generally write the name of
a token in boldface. We will often refer to a token by its token name.
A pattern is a description of the form that the lexemes of a token may take. In the case
of a keyword as a token, the pattern is just the sequence of characters that form the
keyword. For identifiers and some other tokens, the pattern is a more complex structure
that is matched by many strings.
A lexeme is a sequence of characters in the source program that matches the pattern for
a token and is identified by the lexical analyzer as an instance of that token
Input Buffering
Buffer Pairs:
Because of the amount of time taken to process characters and the large number of characters that
must be processed during the compilation of a large source program, specialized buffering
techniques have been developed to reduce the amount of overhead required to process a single
input character.
Two pointers to the input are maintained:
Pointer Lexeme Begin, marks the beginning of the current lexeme, whose extent we are
attempting to determine
Pointer Forward, scans ahead until a pattern match is found.
Once the next lexeme is determined, forward is set to character at its right end.
Then, after the lexeme is recorded as an attribute value of a token returned to the parser, Lexeme
Begin is set to the character immediately after the lexeme just found.
Sentinels:
If we use the scheme of Buffer pairs we must check, each time we advance forward, that we have
not moved off one of the buffers; if we do, then we must also reload the other buffer. Thus, for each
character read, we make two tests: one for the end of the buffer, and one to determine what
character is read (the latter may be a multiway branch). We can combine the buffer-end test with
the test for the current character if we extend each buffer to hold a sentinel character at the end.
The sentinel is a special character that cannot be part of the source program, and a natural choice is
the character EOF.
Note that EOF retains its use as a marker for the end of the entire input. Any EOF that appears other
than at the end of a buffer means that the input is at an end.

Specification of Tokens
Regular expressions are notation for specifying patterns.
Each pattern matches a set of strings.
Regular expressions will serve as names for sets of strings.
Strings and Languages:
The term alphabet or character class denotes any finite set of symbols.
e.g., set {0,1} is the binary alphabet.
The term sentence and word are often used as synonyms for the term string.
The length of a string s is written as |s| - is the number of occurrences of symbols in s.
e.g., string banana is of length six.
The empty string denoted by length of empty string is zero.
The term language denotes any set of strings over some fixed alphabet.
e.g., {} set containing only empty string is language under .
If x and y are strings, then the concatenation of x and y (written as xy) is the
string formed by appending y to x. x = dog and y = house; then xy is doghouse.
s = s = s.
s
0
= , s
1
= s, s
2
= ss, s
3
= sss, so on.


TERM DEFINITION
Prefix of s

A string obtained by removing zero or more trailing symbols of string s;
e.g., ban is a prefix of banana.
Suffix of s

A string formed by deleting zero or more of the leading symbols of s;
e.g., nana is a suffix of banana.
Substring of s

A string obtained by deleting a prefix and a suffix from s; e.g., nan is a
substring of banana.
Proper prefix, suffix, or
substring of s
Any nonempty string x that is a prefix, suffix or substring of s that s <> x.
Subsequence of s

Any string formed by deleting zero or more not necessarily contiguous
symbols from s; e.g., baaa is a subsequence of banana.

Table: Terms for parts of a string
Operations on Languages:
There are several operations that can be applied to languages:


OPERATION DEFINITION
Union of L and M. written LM L M = { s | s is in L or s is in M }
Concatenation of L and M. written
LM
LM = { st | s is in L and t is in M }
Kleene closure of L.
written L*


L* denotes zero or more concatenation of L.
Positive closure of L.
written L+


L
+
denotes one or more Concatenation of L.

Table: Definitions of operations on languages L and M

Regular Expressions:
It allows defining the sets to form tokens precisely.
e.g., letter ( letter | digit) *
Defines a Pascal identifier which says that the identifier is formed by a letter followed
by zero or more letters or digits.
A regular expression is built up out of simpler regular expressions using a set of defining
rules.
Each regular expression r denotes a language L(r).
The rules that define the regular expressions over alphabet .
(Associated with each rule is a specification of the language denoted by the regular
expression being defined)
1. is a regular expression that denotes {}, i.e. the set containing the empty string.
2. If a is a symbol in , then a is a regular expression that denotes {a}, i.e. the set containing the
string a.
3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then
a) (r) | (s) is a regular expression denoting the languages L(r) U L(s).
b) (r)(s) is a regular expression denoting the languages L(r)L(s).
c) (r)* is a regular expression denoting the languages (L(r))*.
d) (r) is a regular expression denoting the languages L(r).
A language denoted by a regular expression is said to be a regular set.
The specification of a regular expression is an example of a recursive definition.
Rule (1) and (2) form the basis of the definition.
Rule (3) provides the inductive step.
AXIOM DESCRIPTION
r|s = s|r | is commutative
r|(s|t) = (r|s)|t | is associative
(rs)t = r(st) Concatenation is associative
r(s|t) = rs|rt
(s|t)r = sr|tr
Concatenation distributes over |

r = r
r = r
is the identity element for concatenation

r* = (r|)* Relation between * and
r** = r* * Is idempotent

Table: Algebraic Properties of regular expressions
Regular Definition:
If is an alphabet of basic symbols, then a regular definition is a sequence of definitions
of the form
d
1
r
1

d
2
r
2


d
n
r
n

Where each d
i
is a distinct name, and each r
i
is a regular expression over the symbols in U {d
1
,
d
2
, , d
i-1
}, i.e., the basic symbols and the previously defined names.
e.g. (regular definition in bold):
letter A|B||Z|a|b||z
digit 0|1||9
id letter ( letter | digit ) *

Notational Shorthand:
This shorthand is used in certain constructs that occur frequently in regular expressions.
1. one or more instance: unary postfix operator
+
means one or more instances
of. If ris a regular expression that denotes the language L(r), the r
+
is a regular
expression that denotes the language (L(r))
+
. Similarly unary postfix
operator * means zero or more instances of. The two algebraic
identities r* and r
+
relate the kleene and positive closure operators.
2. zero or one instance: unary postfix operator ? means zero or one instance
of. The notation r? is a shorthand for r|.
3. character class: the notation [abc] is a shorthand for a|b|c.
Recognition of Tokens

S-ar putea să vă placă și