As the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program. The stream of tokens is sent to the parser for syntax analysis. It is common for the lexical analyzer to interact with the symbol table as well. When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table. The interaction is implemented by having the parser call the lexical analyzer by the getNextToken command. That causes the lexical analyzer to read characters from its input until it can identify the next lexeme and produce for it the next token, which it returns to the parser.
Interactions between the lexical analyzer and the parser Lexical analyzer may perform certain other tasks besides identification of lexemes like, Stripping out comments and whitespace Generate error message with line number Sometimes, lexical analyzers are divided into a cascade of two processes: Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one. Lexical analysis proper is the more complex portion, where the scanner produces the sequence of tokens as output. Some lexical analysis related terms: A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes. In what follows, we shall generally write the name of a token in boldface. We will often refer to a token by its token name. A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings. A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token Input Buffering Buffer Pairs: Because of the amount of time taken to process characters and the large number of characters that must be processed during the compilation of a large source program, specialized buffering techniques have been developed to reduce the amount of overhead required to process a single input character. Two pointers to the input are maintained: Pointer Lexeme Begin, marks the beginning of the current lexeme, whose extent we are attempting to determine Pointer Forward, scans ahead until a pattern match is found. Once the next lexeme is determined, forward is set to character at its right end. Then, after the lexeme is recorded as an attribute value of a token returned to the parser, Lexeme Begin is set to the character immediately after the lexeme just found. Sentinels: If we use the scheme of Buffer pairs we must check, each time we advance forward, that we have not moved off one of the buffers; if we do, then we must also reload the other buffer. Thus, for each character read, we make two tests: one for the end of the buffer, and one to determine what character is read (the latter may be a multiway branch). We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a sentinel character at the end. The sentinel is a special character that cannot be part of the source program, and a natural choice is the character EOF. Note that EOF retains its use as a marker for the end of the entire input. Any EOF that appears other than at the end of a buffer means that the input is at an end.
Specification of Tokens Regular expressions are notation for specifying patterns. Each pattern matches a set of strings. Regular expressions will serve as names for sets of strings. Strings and Languages: The term alphabet or character class denotes any finite set of symbols. e.g., set {0,1} is the binary alphabet. The term sentence and word are often used as synonyms for the term string. The length of a string s is written as |s| - is the number of occurrences of symbols in s. e.g., string banana is of length six. The empty string denoted by length of empty string is zero. The term language denotes any set of strings over some fixed alphabet. e.g., {} set containing only empty string is language under . If x and y are strings, then the concatenation of x and y (written as xy) is the string formed by appending y to x. x = dog and y = house; then xy is doghouse. s = s = s. s 0 = , s 1 = s, s 2 = ss, s 3 = sss, so on.
TERM DEFINITION Prefix of s
A string obtained by removing zero or more trailing symbols of string s; e.g., ban is a prefix of banana. Suffix of s
A string formed by deleting zero or more of the leading symbols of s; e.g., nana is a suffix of banana. Substring of s
A string obtained by deleting a prefix and a suffix from s; e.g., nan is a substring of banana. Proper prefix, suffix, or substring of s Any nonempty string x that is a prefix, suffix or substring of s that s <> x. Subsequence of s
Any string formed by deleting zero or more not necessarily contiguous symbols from s; e.g., baaa is a subsequence of banana.
Table: Terms for parts of a string Operations on Languages: There are several operations that can be applied to languages:
OPERATION DEFINITION Union of L and M. written LM L M = { s | s is in L or s is in M } Concatenation of L and M. written LM LM = { st | s is in L and t is in M } Kleene closure of L. written L*
L* denotes zero or more concatenation of L. Positive closure of L. written L+
L + denotes one or more Concatenation of L.
Table: Definitions of operations on languages L and M
Regular Expressions: It allows defining the sets to form tokens precisely. e.g., letter ( letter | digit) * Defines a Pascal identifier which says that the identifier is formed by a letter followed by zero or more letters or digits. A regular expression is built up out of simpler regular expressions using a set of defining rules. Each regular expression r denotes a language L(r). The rules that define the regular expressions over alphabet . (Associated with each rule is a specification of the language denoted by the regular expression being defined) 1. is a regular expression that denotes {}, i.e. the set containing the empty string. 2. If a is a symbol in , then a is a regular expression that denotes {a}, i.e. the set containing the string a. 3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then a) (r) | (s) is a regular expression denoting the languages L(r) U L(s). b) (r)(s) is a regular expression denoting the languages L(r)L(s). c) (r)* is a regular expression denoting the languages (L(r))*. d) (r) is a regular expression denoting the languages L(r). A language denoted by a regular expression is said to be a regular set. The specification of a regular expression is an example of a recursive definition. Rule (1) and (2) form the basis of the definition. Rule (3) provides the inductive step. AXIOM DESCRIPTION r|s = s|r | is commutative r|(s|t) = (r|s)|t | is associative (rs)t = r(st) Concatenation is associative r(s|t) = rs|rt (s|t)r = sr|tr Concatenation distributes over |
r = r r = r is the identity element for concatenation
r* = (r|)* Relation between * and r** = r* * Is idempotent
Table: Algebraic Properties of regular expressions Regular Definition: If is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form d 1 r 1
d 2 r 2
d n r n
Where each d i is a distinct name, and each r i is a regular expression over the symbols in U {d 1 , d 2 , , d i-1 }, i.e., the basic symbols and the previously defined names. e.g. (regular definition in bold): letter A|B||Z|a|b||z digit 0|1||9 id letter ( letter | digit ) *
Notational Shorthand: This shorthand is used in certain constructs that occur frequently in regular expressions. 1. one or more instance: unary postfix operator + means one or more instances of. If ris a regular expression that denotes the language L(r), the r + is a regular expression that denotes the language (L(r)) + . Similarly unary postfix operator * means zero or more instances of. The two algebraic identities r* and r + relate the kleene and positive closure operators. 2. zero or one instance: unary postfix operator ? means zero or one instance of. The notation r? is a shorthand for r|. 3. character class: the notation [abc] is a shorthand for a|b|c. Recognition of Tokens