Sunteți pe pagina 1din 8

Regular Expression

- important notations for specifying patterns


- used to define sets of string
- example: letter ( letter | digit )* where | = or is a regular
expression for identifier

Rules for defining regular expression over alphabet ε


1. ∈ is a regular expression that denotes {∈}
2. if a is a symbol in ε , then a is a regular expression that denotes {a}
3. let r and s are regular expressions denoting the language L(r) and
L(s). Then
a. ( r ) | ( s ) is a regular expression denoting L( r ) | L( s )
b. ( r ) ( s ) is a regular expression denoting L( r ) L( s )
c. ( r )* is a regular expression denoting (L( r ))*
d. ( r ) is a regular expression denoting L( r )

Conventions for Regular Expression


1. Unary operator * has highest precedence and is left associative
2. Concatenation has second highest precedence and is lest associative
3. | has lowest precedence and is left associative

So ( a ) | ( ( b )* ( c ) ) = a | b*c
Either a single a or zero or more b`s followed by a single c

Examples
- let ε = {a,b}
o a | b = {a,b}
o (a | b )(a | b ) = {aa,ab,ba,bb}
o a* = {∈, a,aa,aaa,…}

Algebraic properties for Regular Expression


- r | s = s | r → | is commutative
- r | ( s | t) = ( r | s ) | t → is associative
- ( rs ) t = r ( s t ) → concatenation is associative
- r ( s | t ) = rs | rt → concatenation distributes over |
- ∈ r = r or r ∈ = r → identity element
- r* = ( r | ∈)*
- r** = r* → * is idem potent

Regular Definition
- For notational convenience, we give names to regular expressions
- If ∑ is an alphabet of basic symbols, then a regular definition is a
sequence of a definitions of the form
o D1 → r1
o D2 → r2
o … Dn → rn
Where Di is a distinct name and ri is a regular expression over the
symbol ∑ U { d1,d2,di-1} i.e. the basic symbols and the previously
defined names
By restricting each ri to symbols of ∑ and previously defined
names, we an construct regular expression over ∑ for any ri by
repeatedly replacing regular expression names by the expressions
they denote. If ri used dj for some j≥I, then ri must be recursively
defined, and the substitution process would not terminate

- Example
o Letter → A|B|….|Z|a|b|….|z
o digit → 0|1|….|9
o Id → letter (letter | digit)*

- Example

o Regular definition for unsigned numbers

o digit → 0|1|2|…|9

o digits → digit digit*

o optional-fraction → . digits | ∈

o optional-exponent → ( E ( +| - | ∈ ) digits ) | ∈

o num → digits optional-fraction optional-exponent


Using Notational Shorthand

- (r)?=r|∈

- r+ = r r*

- r* = r+ | ∈

- using these shorthand we can rewrite the above definitions as

- digit → 0|1|….|9

- digits → digit+

- optional-fraction → ( . digits )?

- optional-exponent → ( E ( + | - ) ? digits ) ?

- num → digits optional-fraction optional-exponent

Recognition of Token
- Takes place by implementing a stylized flowchart, called Transition
Diagram
- Transition Diagram
o Depicts actions that take place when a Lexical Analyzer is
called by the parser to get next token
o Positions in a TD are drawn as circles, called States
o States are connected by arrows, called Edges
o Edges leaving state S have Labels indicating the i/p character
that can next appear after the TD has reached state S
o Start state is the initial state of the TD where control resides
when we begin to recognize a token
o E.g.
start > 6v = 7

8
other *

o The above TD works as follows


o Start state is state Ø
o In state Ø, we read the next i/p character
o The edge labeled > from state Ø is to be folloed to state 6 if this
i/p character is >.
o The double circle on 7 indicates that it is an accepting state , a
state in which the token >= has been found
o In case of > only, reading of an extra character will lead to
state 8 , where token for > will be recognized and the FP will be
retracted (sumbol indicates retraction in TD)
o If failure occurs in one TD , another is activated by pinting FP
to start
o Lexical Error is generated only when all TD’s Fails
o example
start < =
Ø 1 2 return (relop, LE)

> 3 return (relop, NE)

other 4 * return (relop, LT )

= 5 return ( relop, EQ )

> 6 = 7
return (relop, GE )

other 8 * return (relop, GT )


- The lexical Analyzer return tokens and attribute values, using the
translation table

Regular Expression Patterns for Token

Regular Exp Token Attribute

ws --- ---
If If ---
then Then ---
else Else ---
Id id Pointer to
symbol Table
entry
num num Pointer to ST
entry
< relop LT
<= relop LE
> relop GT

>= relop GE
<> relop NE
= relop EQ

Regular Definition for White Spaces


Delim → blank | tab | newline

Ws → delim +

RD for operators

Relop → < | > | <= | >= | <> | =


Recognition of Keywords

- there are two approaches for the recognition of keywords\


- Define a separate TD of keywords. If a letter is encountered , first
check it in the TD of keywords. If a match is not found then analyze
the string in TD of of identifier
*
S 6
Start B 1 E 2 G 3 I 4 N 5 ws/{
o
*

1
E 7 N 8 D 9 ws/: 0
*
L S E ws 1
1 1 1 4
1 2 3

I 1 F 1 ws/( 1
5 6 7

T H 1 E 2N N 2 ws 2
1 9 0 1 2
8

Note : All final States (6,10,14,17,22) have * like one shown on state
6. the symbol is called retraction symbol which indicates that
pointer moves back to previous sate

R.E for Keywords


Keyword → BEGIN | END | IF | THEN | ELSE

Disadvantage of using this techniques to identify keywords


o The number of states will be increased which increases the
coding and handling of pgm also
o Incase of addition in keywords , TD will have to be changed
- The second approach is to initialize all keywords in symbol table and
accept the string in the TD of Identifiers and keywords combined then
match the string in symbols table for keywords .if match is found
generate token or that keyword otherwise generate token for id

TD for Keywords and Identifier combined

Letter or digit *
Start letter 1 other 1
9
0 1 Return(gettoken(),installid())

- Keywords are sequence of letter and id is sequence of letter and digits


- Keywords are stored in symbol table
- 2 tasks are performed
o if lexeme is matched against the pattern then 1st task is to look
for the keyword in Symbol Table that matches and if so
generate corresponding token and if not then generate id token.
Gettoken() is used to here for this purpose
o if id token is generated then store that lexeme in symbol table
and return pointer to that entry or if it is already stored than
return pointer to that entry. Return 0 if keyword is identified.
Installid () is used for this purpose

Advantage of using this technique


- In case of additional keywords, TD will not be changed only symbol
Table will be initialized with new keywords

TD for unsigned number

digit digit dg
*
1 digit 1 . 1 digit 1 E 1 +/- 1 digit 1 ot 1
S 2 3 4 5 6 7 8 9
Install-num()
E digit

digit digit *
digit . digit other
S 2 2 2 2 2
0 1 2 3 4 Install_num()

digit *
digit
S 2 2 other 2
5 6 7
Install_num()

TD for white space

Delim
*
S delim other
2 2 3
8 9 0

Nothing is returned when accepting condition is reached

S-ar putea să vă placă și