Regular Expression

Regular Expression
- important notations for specifying patterns

- used to define sets of string
- example: letter ( letter | digit )* where | = or is a regular
expression for identifier
Rules for defining regular expression over alphabet ε

1. ∈ is a regular expression that denotes {∈}
2. if a is a symbol in ε , then a is a regular expression that denotes {a}
3. let r and s are regular expressions denoting the language L(r) and
L(s). Then
a. ( r ) | ( s ) is a regular expression denoting L( r ) | L( s )
b. ( r ) ( s ) is a regular expression denoting L( r ) L( s )
c. ( r )* is a regular expression denoting (L( r ))*
d. ( r ) is a regular expression denoting L( r )
Conventions for Regular Expression

1. Unary operator * has highest precedence and is left associative
2. Concatenation has second highest precedence and is lest associative
3. | has lowest precedence and is left associative
So ( a ) | ( ( b )* ( c ) ) = a | b*c
Either a single a or zero or more b`s followed by a single c
Examples
- let ε = {a,b}
o a | b = {a,b}
o (a | b )(a | b ) = {aa,ab,ba,bb}
o a* = {∈, a,aa,aaa,…}
Algebraic properties for Regular Expression

- r | s = s | r → | is commutative
- r | ( s | t) = ( r | s ) | t → is associative
- ( rs ) t = r ( s t ) → concatenation is associative
- r ( s | t ) = rs | rt → concatenation distributes over |
- ∈ r = r or r ∈ = r → identity element
- r* = ( r | ∈)*
- r** = r* → * is idem potent
Regular Definition
- For notational convenience, we give names to regular expressions
- If ∑ is an alphabet of basic symbols, then a regular definition is a
sequence of a definitions of the form
o D1 → r1
o D2 → r2
o … Dn → rn
Where Di is a distinct name and ri is a regular expression over the
symbol ∑ U { d1,d2,di-1} i.e. the basic symbols and the previously
defined names
By restricting each ri to symbols of ∑ and previously defined
names, we an construct regular expression over ∑ for any ri by
repeatedly replacing regular expression names by the expressions
they denote. If ri used dj for some j≥I, then ri must be recursively
defined, and the substitution process would not terminate
- Example
o Letter → A|B|….|Z|a|b|….|z
o digit → 0|1|….|9
o Id → letter (letter | digit)*
- Example
o Regular definition for unsigned numbers
o digit → 0|1|2|…|9
o digits → digit digit*
o optional-fraction → . digits | ∈
o optional-exponent → ( E ( +| - | ∈ ) digits ) | ∈
o num → digits optional-fraction optional-exponent

Using Notational Shorthand
- (r)?=r|∈
- r+ = r r*
- r* = r+ | ∈
- using these shorthand we can rewrite the above definitions as
- digit → 0|1|….|9
- digits → digit+
- optional-fraction → ( . digits )?
- optional-exponent → ( E ( + | - ) ? digits ) ?
- num → digits optional-fraction optional-exponent
Recognition of Token
- Takes place by implementing a stylized flowchart, called Transition
Diagram
- Transition Diagram
o Depicts actions that take place when a Lexical Analyzer is
called by the parser to get next token
o Positions in a TD are drawn as circles, called States
o States are connected by arrows, called Edges
o Edges leaving state S have Labels indicating the i/p character
that can next appear after the TD has reached state S
o Start state is the initial state of the TD where control resides
when we begin to recognize a token
o E.g.
start > 6v = 7
8
other *
o The above TD works as follows

o Start state is state Ø
o In state Ø, we read the next i/p character
o The edge labeled > from state Ø is to be folloed to state 6 if this
i/p character is >.
o The double circle on 7 indicates that it is an accepting state , a
state in which the token >= has been found
o In case of > only, reading of an extra character will lead to
state 8 , where token for > will be recognized and the FP will be
retracted (sumbol indicates retraction in TD)
o If failure occurs in one TD , another is activated by pinting FP
to start
o Lexical Error is generated only when all TD’s Fails
o example
start < =
Ø 1 2 return (relop, LE)
> 3 return (relop, NE)
other 4 * return (relop, LT )
= 5 return ( relop, EQ )
> 6 = 7
return (relop, GE )
other 8 * return (relop, GT )

- The lexical Analyzer return tokens and attribute values, using the
translation table
Regular Expression Patterns for Token
Regular Exp Token Attribute
ws --- ---
If If ---
then Then ---
else Else ---
Id id Pointer to
symbol Table
entry
num num Pointer to ST
entry
< relop LT
<= relop LE
> relop GT
>= relop GE
<> relop NE
= relop EQ
Regular Definition for White Spaces

Delim → blank | tab | newline
Ws → delim +
RD for operators
Relop → < | > | <= | >= | <> | =

Recognition of Keywords
- there are two approaches for the recognition of keywords\

- Define a separate TD of keywords. If a letter is encountered , first
check it in the TD of keywords. If a match is not found then analyze
the string in TD of of identifier
*
S 6
Start B 1 E 2 G 3 I 4 N 5 ws/{
o
*
1
E 7 N 8 D 9 ws/: 0
*
L S E ws 1
1 1 1 4
1 2 3
I 1 F 1 ws/( 1
5 6 7
T H 1 E 2N N 2 ws 2
1 9 0 1 2
8
Note : All final States (6,10,14,17,22) have * like one shown on state
6. the symbol is called retraction symbol which indicates that
pointer moves back to previous sate
R.E for Keywords

Keyword → BEGIN | END | IF | THEN | ELSE
Disadvantage of using this techniques to identify keywords

o The number of states will be increased which increases the
coding and handling of pgm also
o Incase of addition in keywords , TD will have to be changed
- The second approach is to initialize all keywords in symbol table and
accept the string in the TD of Identifiers and keywords combined then
match the string in symbols table for keywords .if match is found
generate token or that keyword otherwise generate token for id
TD for Keywords and Identifier combined
Letter or digit *
Start letter 1 other 1
9
0 1 Return(gettoken(),installid())
- Keywords are sequence of letter and id is sequence of letter and digits

- Keywords are stored in symbol table
- 2 tasks are performed
o if lexeme is matched against the pattern then 1st task is to look
for the keyword in Symbol Table that matches and if so
generate corresponding token and if not then generate id token.
Gettoken() is used to here for this purpose
o if id token is generated then store that lexeme in symbol table
and return pointer to that entry or if it is already stored than
return pointer to that entry. Return 0 if keyword is identified.
Installid () is used for this purpose
Advantage of using this technique

- In case of additional keywords, TD will not be changed only symbol
Table will be initialized with new keywords
TD for unsigned number
digit digit dg
*
1 digit 1 . 1 digit 1 E 1 +/- 1 digit 1 ot 1
S 2 3 4 5 6 7 8 9
Install-num()
E digit
digit digit *
digit . digit other
S 2 2 2 2 2
0 1 2 3 4 Install_num()
digit *
digit
S 2 2 other 2
5 6 7
Install_num()
TD for white space
Delim
*
S delim other
2 2 3
8 9 0
Nothing is returned when accepting condition is reached

Regular Expression

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Regular Expression

Încărcat de

Drepturi de autor:

Formate disponibile

Regular Expression

- important notations for specifying patterns

Rules for defining regular expression over alphabet ε

Conventions for Regular Expression

Algebraic properties for Regular Expression

o Regular definition for unsigned numbers

o digits → digit digit*

o num → digits optional-fraction optional-exponent

- using these shorthand we can rewrite the above definitions as

- num → digits optional-fraction optional-exponent

o The above TD works as follows

> 3 return (relop, NE)

other 4 * return (relop, LT )

other 8 * return (relop, GT )

Regular Expression Patterns for Token

Regular Exp Token Attribute

Regular Definition for White Spaces

Relop → < | > | <= | >= | <> | =

- there are two approaches for the recognition of keywords\

R.E for Keywords

Disadvantage of using this techniques to identify keywords

TD for Keywords and Identifier combined

- Keywords are sequence of letter and id is sequence of letter and digits

Advantage of using this technique

TD for unsigned number

TD for white space

Nothing is returned when accepting condition is reached

S-ar putea să vă placă și