Sunteți pe pagina 1din 15

System Software Term Paper

LEX- A Lexical Analyzer Generator


CSE318

Submitted to: Sahiba Assistant prof. Himanshi Raperia

Submitted by: Section :K1R03 Roll no. A02 Reg. no. 11000677

Abstract
Lex helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine. Lex can generate analyzers in either C or Ratfor, a language which can be translated automatically to portable ortran. !he lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match possible at each input point. If necessary, substantial loo"ahead is performed on the input, but the input stream will be bac"ed up to the end of the current partition, so that the user has general freedom to manipulate it.

Acknowledgement
I would li"e to than" my $ystem $oftware teacher %iss &imanshi Raperia for helping me to complete my term 'aper on topic Lex- ( Lexical (nalyzer )enerator and I am *ery than"ful to her for her *aluable guidance and support to complete my term paper without her my term paper would not ha*e been a success.

Contents
'age no. 1, Introduction #, Lex $ource +, Lex Regular 0xpressions 2, Lex (ctions -, (mbiguous $ource Rules ., Lex $ource 5efinition /, 6sage 1, Lex and 7acc. 3, Character $et 14, $ummary of $ource ormat 11, Ca*eats and 8ugs 1#, References. --. .-/ /-1 3 14 14-11 11 11-1# 1# 1+-12 12 12

1) INTRODUCTION
Lex is a program generator designed for lexical processing of character. It accepts a high le*el , problem oriented specification for character string matching, and produces a program in a general purpose language which recognizes regular expressions. !he regular expressions are specified by the user gi*en to Lex. !he Lex code recognizes these expressions in an input stream and partitions the input stream into strings matching the expressions. (t the boundaries between strings program sections pro*ided by the user are executed. !he Lex source file associates the regular expressions and the program fragments. (s each expression appears in the input, the corresponding fragment is executed. !he user supplies the additional code beyond expression matching needed to complete his tas"s, possibly including code written by other generators. !he program that recognizes the expressions is generated in the general purpose language employed for the user9s program fragments. !his a*oids forcing the user who wishes to use a string manipulation language for input analysis to write processing programs in the same and often inappropriate string handling language. Lex can write code in different host languages. !he host language is used for the output code generated by Lex and also for the program fragments added by the user. Run time libraries ma"es Lex adaptable to different en*ironments and different users. Lex itself exists on 6:I;, )C<$, and <$=+/4. or example , Lex turns the user9s expressions into the host general-purpose language, the generated program is named >yylex. !he >yylex program will recognize expressions in a stream and perform the specified actions for each expression. $ource ? Lex ? yylex Input ? yylex ? <utput or example, consider a program to delete all blan"s or tabs at the end of lines from the input

@@ A BtCDE F is reGuired. @@ delimiter is used to mar" the beginning of the rules. !his rule contains a regular expression which matches one or more instances of the characters blan" or tab Hust prior to the end of a line. !he brac"ets indicate the character class made of blan" and tab, D indicates IIone or more ...99 and the E indicates IIend of line99 . :o action is specified, so the program generated by Lex that is yylex will ignore these characters. 0*erything else will be copied. !o change any remaining string of blan"s or tabs to a single blan", we should add another ruleJ @@ A BtCDE F A BtCD printfKL L,F !he automation generated will scan both rules at once, , obser*ing at the termination of the string of blan"s or tabs whether or not there is a newline character, and executing the desired rule action. !he first rule matches all strings of blan"s or tabs at the end of lines, and the second rule matches all the remaining strings of blan"s or tabs. Lex can be used alone for simple transformations, or for analysis . Lex can also be used with a parser generator to perform the lexical analysis phase. It is particularly easy to handle Lex and 7acc . Lex recognize only regular expressions whereas 7acc writes parsers that accept a large class of context free grammars, but reGuire a lower le*el analyzer to recognize input to"ens. !hus, a combination of Lex and 7acc is often appropriate. Mhen used as a preprocessor for a later parser generator, Lex is used to partition the input stream, and the parser generator assigns structure to the resulting pieces. Lexical grammar rules rules Lex 7acc Input yylex yyparse 'arsed input Lex with 7acc Lex generates a automaton from the regular expressions in the source . !he automaton is interpreted to sa*e space. !he result is still a fast analyzer. !he time ta"en by a Lex program to recognize and partition an input stream is proportional to the length of the input. !he number of Lex rules or the complexity of the rules is not important in determining speed, unless rules which include forward context reGuire a significant amount of rescanning. Lex is not limited to source which can be interpreted on the basis of one character loo" ahead. or example, if there are two rules, one loo"ing for ab and another for abcdefg , and the input stream is abcdefh , Lex will recognize ab and lea*e the input pointer Hust before cd.

2) Lex Source
!he general format of Lex source isJ NdefinitionsO @@ NrulesO @@ Nuser subroutinesO !he first @@ is reGuired to mar" the beginning of the rules and the second @@ is optional. !he definitions and the user subroutines are often omitted. !he absolute minimum Lex program is thus @@ which translates into a program which copies the input to the output unchanged. !he rules represent the user9s control decisions. If the action is merely a single C expression, it can Hust be gi*en on the right side of the lineF if it is compound, or ta"es more than a line, it should be enclosed in braces. (s a slightly more useful example, suppose it is desired to change a number of words from 8ritish to (merican spelling. Lex rules such as colour printfKLcolorL,F mechanise printfKLmechanizeL,F petrol printfKLgasL,F would be a start. !hese rules are not Guite enough, since the word petroleum would become gaseum.

3) Lex Regular Expressions


( regular expression specifies a set of strings to be matched. It contains text characters and operator characters . !he letters of the alphabet and the digits are always text Characters, thus the regular expression PintegerP matches the string integer where*er it appears and the expression >a-/5P loo"s for the string a57D. (a) Operators J !he operator characters are LBACQRS.DTK,E=NO@UV and if they are to be used as text characters, an escape should be used. !he Guotation mar" operator KL, indicates that whate*er is contained between a pair of Guotes is to be ta"en as text characters. !hus xyzLDDL matches the string xyz++ when it appears. It is harmless but unnecessary to Guote an ordinary text characterF the expression LxyzDDL is the same as the one abo*e. (n operator character may also be turned into a text character by preceding it with B as in xyzBDBD /

which is another, less readable, eGui*alent of the abo*e expressions. (nother use of the Guoting mechanism is to get a blan" into an expression. (ny blan" character not contained within A C must be Guoted. $e*eral normal C escapes with B are recognizedJ Bn is newline, Bt is tab Bb is bac"space. !o enter B itself, use BB. $ince newline is illegal in an expression, Bn must be usedF it is not reGuired to escape tab and bac"space. 0*ery character but blan", tab, newline and the list abo*e is always a text character. ( ) C!aracter Classes J Classes of characters can be specified using the operator pair A C. !he construction matches a single character, which may be a , b, or c . Mithin sGuare brac"ets, most operator meanings are ignored. <nly three characters are specialJ these are B and Q. !he character indicates ranges. or example, Aaz43UVWC indicates the character class containing all the lower case letters, the digits, the angle brac"ets, and underline. Ranges may be gi*en in either order. 6sing between any pair of characters which are not both upper case letters, both lower case letters, or both digits is implementation dependent and will get a warning message. In character classes, the Q operator must appear as the first character after the left brac"et . It indicates that the resulting string is to be complemented with respect to the computer character set. !hus AQabcC matches all characters except a, b, or c, including all special or control characters or AQaRz(RXC is any character which is not a letter. " Optional expressions J !he operator ? indicates an optional element of an expression. !hus abSc matches either ac or abc . (#) Repeate# expressions : Repetitions of classes are indicated by the operators Y and +. aY is any number of consecuti*e a characters, including zero while aD is one or more instances of a. or example, AaRzCD is all strings of lower case letters. (nd A(-Xa-zCA(Xa-z4-3CY indicates all alphanumeric strings with a leading alphabetic character. !his is a typical expression for recognizing identifiers in computer languages. (e) $lternation an# %rouping. !he operator T indicates alternationJ Kab T cd, matches either ab or cd. !he parentheses are used for grouping. 'arentheses can be used for more complex expressionsJ Kab T cdD,SKef, matches such strings as abefef , efefef , cdef, or cddd but not abc , abcd , or abcdef .

(&) Context sensiti'it(. Lex will recognize a small amount of surrounding context. !he simplest operator for this is $ . If the *ery last character is $ , the expression will only be matched at the end of a line. !he latter operator is a special case of the / operator character, which indicates trailing context. !he expression ab=cd matches the string ab , but only if followed by cd. !hus abE is the same as ab=Bn If a rule is only to be executed when the Lex automaton interpreter is in start condition x, the rule should be prefixed by UxV using the angle brac"et operator characters. If we considered IIbeing at the beginning of a line99 to be start condition ONE , then the Q operator would be eGui*alent to U<:0V (g) Repetitions an# De&initions. !he operators NO specify either repetitions or definition expansion. or example NdigitO loo"s for a predefined string named digit and inserts it at that point in the expression. In contrast, aN1,-O loo"s for 1 to - occurrences of a .

)) Lex $ctions
Mhen an expression is matched, Lex executes the actions. !here is a default action, which consists of copying the input to the output . <ne of the simplest things that can be done is to ignore the input. $pecifying a C null statement, ; as an action causes this result. ( freGuent rule is A BtBnC F which causes the three spacing characters to be ignored. (nother easy way to a*oid writing actions is the action character T, which indicates that the action for this rule is the action for the next rule. !he abo*e example could also ha*e been written LL LBtL LBnL with the same result, although in different style. In more complex actions, the actual text that matched some expression li"e Aa-zCD. Lex lea*es this text in an external character array named yytext . !hus, to print the name found, a rule li"e AaRzCD printfKL@sL, yytext,F will print the string in yytext. !he C function printf accepts a format argument and the data are the characters in yytext.

( Lex action may decide that a rule has not recognized the correct span of characters. !wo routines are pro*ided to aid with this situation. 1, yymoreK, can be called to indicate that the next input expression recognized is to be tac"ed on to the end of this input. :ormally, the next input string would o*erwrite the current entry in yyte xt. #, yylessKn, may be called to indicate that not all the characters matched by the currently successful expression are wanted right now. !he argument n indicates the number of characters in yytext to be retained. In addition to these routines, Lex also permits access to the I=< routines it uses. !hey areJ 1, inputK, which returns the next input character. #, outputKc, which writes the character c on the output. +, unputKc , pushes the character c bac" onto the input stream to be read later by inputK,. !hese routines are pro*ided as macro definitions, but the user can o*erride them and supply pri*ate *ersions. !hese routines define the relationship between external files and internal characters, and must all be retained or modified consistently. (nother Lex library routine is yywrapK, which is called whene*er Lex reaches an endof-file. If yywrap returns a 1, Lex continues with the normal wrapup on end of input. 5efault yywrap always returns 1. !his routine is also a con*enient place to print tables, summaries, etc. at the end of a program.

*+ $, iguous Source Rules


Lex can handle ambiguous specifications. Mhen more than one expression can match the current input, Lex chooses as followsJ 1,!he longest match is preferred. #,(mong rules which matched the same number of characters, the rule gi*en first is preferred. !hus, suppose the rules integer "eyword action ...F AaRzCD identifier action ...F to be gi*en in that order. If the input is integers , it is ta"en as an identifier, because [az]+ matches 1 characters while integers matches only /. If the input is integers , both rules match / characters, and the "eyword rule is selected because it was gi*en first. (nything shorter will not match the expression integer and so the identifier interpretation is used. !he principle of preferring the longest match ma"es rules containing expressions li"e . dangerous. or 14

example, . might seem a good way of recognizing a string in single Guotes. 8ut it is an in*itation for the program to read far ahead, loo"ing for a distant single Guote. 'resented with the input first Guoted string here, second here the abo*e expression will match first Guoted string here, second which is probably not what was wanted. ( better rule is of the form AQBnC which, on the abo*e input, will stop after Ifirst9 . !he conseGuences of errors li"e this are mitigated by the fact that the . operator will not match newline. !hus expressions li"e . stop on the current line. Lex is normally partitioning the input stream, not searching for all possible matches of each expression. !his means that each character is accounted for once and only once.

-+ Lex Source De&initions+


Lex is turning the rules into a program. (ny source not intercepted by Lex is copied into the generated program. !here are three classes of such things. 1) (ny line which is not part of a Lex rule or action which begins with a blan" or tab is copied into the Lex generated program. $uch source input prior to the first @@ delimiter will be external to any function in the codeF if it appears immediately after the first @@, it appears in an appropriate place for declarations in the function written by Lex which contains the actions. !his material must loo" li"e program fragments, and should precede the first Lex rule.(s a side effect of the abo*e, lines which begin with a blan" or tab, and which contain a comment, are passed through to the generated program. !his can be used to include comments in either the Lex source or the generated code. !he comments should follow the host language con*ention. 2, (nything included between lines containing only ! and " is copied out. !he delimiters are discarded. !his format permits entering text li"e preprocessor statements that must begin in column 1, or copying lines that do not loo" li"e programs. +, (nything after the third @@ delimiter, regardless of formats, etc., is copied out after the Lex output. 5efinitions intended for Lex are gi*en before the first @@ delimiter. (ny line in this section not contained between @N and @O, and beginning in column 1, is assumed to define Lex substitution strings. !he format of such lines is name translation and it causes the string gi*en as a translation to be associated with the name. !he name and translation must be separated by at least one blan" or tab, and the name must begin with a letter. !he translation can then be called out by the NnameO syntax in a rule. 6sing N5O for the digits and N0O for an exponent field, for example, might abbre*iate rules to recognize numbersJ 5 A4R3C 0 A50deCARDCSN5OD 11

@@ N5OD printfKLintegerL,F N5ODL.LN5OKN0O,S T N5OL.LN5ODKN0O,S T N5ODN0O

.+ Usage
!here are two steps in compiling a Lex source program. !he Lex source must be turned into a generated program in the host general purpose language. !hen this program must be compiled and loaded, usually with a library of Lex subroutines. !he generated program is on a file named lex.yy.c . !he I=< library is in terms of the C standard library. !he library is accessed by the loader flag ll . $o an appropriate set of commands is lex source cc lex.yy.c Rll . !he resulting program is placed on the usual file a.#$t for later execution.

/+ Lex an# 0acc+


If we want to use Lex with 7acc, Lex writes is a program named yylex%&, the name reGuired by 7acc for its analyzer. :ormally, the default main program on the Lex library calls this routine, but if 7acc is loaded, and its main program is used, 7acc will call yylex%&. In this case each Lex rule should end with to"en where the appropriate to"en *alue is returned. (n easy way to get access to 7acc9s names for to"ens is to compile the Lex output file as part of the 7acc output file by placing the line Z include Llex.yy.cL in the last section of 7acc input. $upposing the grammar to be named IIgood99 and the lexical rules to be named IIbetter99 the 6:I; command seGuence can Hust beJ yacc good lex better cc y.tab.c Rly Rll !he 7acc library KRly, should be loaded before the Lex library, to obtain a main program which in*o"es the 7acc parser. !he generations of Lex and 7acc programs can be done in either order.

1+ C!aracter Set+
!he programs generated by Lex handle character I=< only through the routines input, output, and unput. !hus the character representation pro*ided in these routines is accepted by Lex and employed to return *alues in yytext. or internal use a character is 1#

represented as a small integer which, if the standard library is used, has a *alue eGual to the integer *alue of the bit pattern representing the character on the host computer. :ormally, the letter a is represented as the same form as the character constant [a[. If this interpretation is changed, by pro*iding I=< routines which translate the characters, Lex must be told about it, by gi*ing a translation table. !his table must be in the definitions section, and must be brac"eted by lines containing only \\@![[. !he table contains lines of the form NintegerO Ncharacter stringO which indicate the *alue associated with each character. !hus the next example @! 1 (a # 8b ... #. Xz #/ Bn #1 D #3 +4 4 +1 1 ... +3 3 @! $ample character table. maps the lower and upper case letters together into the integers 1 through #., newline into #/, D and - into #1 and #3, and the digits into +4 through +3. :ote the escape for newline. If a table is supplied, e*ery character that is to appear either in the rules or in any *alid input must be included in the table. :o character may be assigned the number 4, and no character may be assigned a bigger number than the size of the hardware character set.

12+ Su,,ar( o& Source 3or,at+


!he general form of a Lex source file isJ NdefinitionsO @@ NrulesO @@ Nuser subroutinesO !he definitions section contains a combination of 1, 5efinitions, in the form \\name space translation[[.

1+

#, Included code, in the form \\space code[[. +, Included code, in the form @N code @O 2, $tart conditions, gi*en in the form @$ name1 name# ... -, Character set tables, in the form @! number space character-string ... @! ., Changes to internal array sizes, in the form @x nnn where nnn is a decimal integer representing an array size and x selects the parameter as followsJ Letter 'arameter p positions n states e tree nodes a transitions " pac"ed character classes o output array size Lines in the rules section ha*e the form \\expression action[[ where the action may be continued on succeeding lines by using braces to delimit it. Regular expressions in Lex use the following operatorsJ x the character LxL LxL an LxL, e*en if x is an operator. Bx an LxL, e*en if x is an operator. AxyC the character x or y. Ax-zC the characters x, y or z. A]xC any character but x. . any character but newline. ]x an x at the beginning of a line. UyVx an x when Lex is in start condition y. xE an x at the end of a line. xS an optional x. xY 4,1,#, ... instances of x. xD 1,#,+, ... instances of x. xTy an x or a y. Kx, an x. x=y an x but only if followed by y. 12

NxxO

the translation of xx from the definitions section. xNm,nO m through n occurrences of x

11+ Ca'eats an# 4ugs+


!here are pathological expressions which produce exponential growth of the tables when con*erted to deterministic machines. R0^0C! does not rescan the input instead it remembers the results of the pre*ious scan. !his means that if a rule with trailing context is found, and R0^0C! executed, the user must not ha*e used unput to change the characters forthcoming from the input stream. !his is the only restriction on the user[s ability to manipulate the not-yetprocessed input.

12+ Re&erences
1, #, +, 2, -, ., /, httpJ==dinosaur.compilertools.net=lex=index.html httpJ==en.wi"ipedia.org=wi"i=LexWKsoftware, httpJ==parsa.iust.ac.ir=&andouts=Compiler=lex.pdf httpJ==www.cs.utexas.edu=_no*a"=lexpaper.htm httpJ==www.mec.ac.in=resources=notes=notes=compiler=%odule1=lexgen.html httpJ==web.cs.wpi.edu=_"al=courses=cs2-++=module1='L!+...html httpJ==www.google.co.in=urlSsa`tarct`HaG`lex@#4-@#4a@#4lexical @#4analyzer @#4generatorasource`webacd`+acad`rHaa*ed`4CC2b H(Caurl`http@+( @# @# pllab.cs.nthu.edu.tw@# cs#24+@# newslides @# Lecture44+WL0;and7(CC.pptaei`edy)6:<l&1+trbfWz7C(5wausg` ( bHC:&aGcy<70suo)"6r4b-na7)t7eetg 1, httpJ==web.cs.wpi.edu=_"al=courses=cs2-++=module1='L!+./.html 3, httpJ==www.smlnH.org=doc=%L-Lex=manual.html

1-

S-ar putea să vă placă și