Sunteți pe pagina 1din 23

THEORY OF COMPUTATION

UNIT 2

Regular expressions
Given an alphabet a language is a set of words . So far we were able to describe
languages either by using set theory (i.e. enumeration or comprehension) or by an automaton.
In this section we shall introduce regular expressions as an elegant and concise way to describe
languages.
We shall see that the languages definable by regular expressions are precisely the same as
those accepted by deterministic or nondeterministic finite automata. These languages are called
regular languages or (according to the Chomsky hierarchy) Type 3 languages.

The meaning of regular expressions


For this purpose, we shall first define an operation on languages called the Kleene star. Given a

language we define

Intuitively, contains all the words which can be formed by concatenating an arbitrary number
of words in . This includes the empty word since the number may be 0.

As an example consider :

You should notice that we use the same symbol as in but there is a subtle difference: is a set

of symbols but is a set of words.


Alternatively (and more abstractly) one may describe as the least language (wrt ) which
contains and the empty word and is closed under concatenation:

We now define the semantics of regular expressions: To each regular expression over we

assign a language . We do this by induction over the definition of the syntax:

What are regular expressions?


We assume as given an alphabet (e.g. ) and define the syntax of regular
expressions (over )

1. is a regular expression.
2. is a regular expression.

3. For each , is a regular expression. E.g. in the example all small letters are regular
expression. We use boldface to emphasize the difference between the symbol a and the
regular expression a.
4. If and are regular expressions then is a regular expression.
5. If and are regular expressions then (i.e. just one after the other) is a regular
expression.
6. If is a regular expression then is a regular expression.

7. If is a regular expression then is a regular expression.

These are all regular expressions.

Regular Grammar

A grammar is a set of rewrite rules which are used to generate strings by successively
rewriting symbols. For example consider the language represented by a+, which is { a, aa, aaa, . .
.} . One can generate the strings of this language by the following procedure: Let S be a symbol to
start the process with. Rewrite S using one of the following two rules: S -> a , and S -> aS .

Formally a grammar consists of a set of non terminals (or variables) V, a set of terminals
(the alphabet of the language), a start symbol S, which ia a non terminal, and a set of rewrite

rules (productions) P. A production has in general the form -> , where is a string of
terminals and non terminals with at least one non terminal in it and is a string of terminals and
non terminals.

A grammar is regular if and only if is a single non terminal and is a single terminal
or a single terminal followed by a single non terminal, that is a production is of the form X -> a or
X -> aY, where X and Y are non terminals and a is a terminal.

For example, = {a, b}, V = { S } and P = { S -> aS, S -> bS, S -> } is a regular grammar and
it generates all the strings consisting of a's and b's including the empty string.

A grammar is a context-free grammar if and only if its production is of the form X -> , where
is a string of terminals and non terminals, possibly the empty string.
For example P = { S -> aSb, S -> ab } with = { a, b } and V = { S } is a context-free grammar
and it generates the language { anbn | n is a positive integer } . As we shall see later this is an
example of context-free language which is not regular.

A grammar is a context-sensitive grammar if and only if its production is of the form 1X 2 ->

1 2, where X is a non terminal and 1 , 2 and are strings of terminals and non

terminals, possibly empty except .

Thus the non terminal X can be rewritten as only in the context of 1X 2 .


Construction: machine from regular expression:
Given a regular expression there is an associated regular language L(r). Since there is a finite
automata for every regular language, there is a machine, M, for every regular expression such that
L(M) = L(r).

The constructive proof provides an algorithm for constructing a machine, M, from a regular
expression r. The six constructions below correspond to the cases:

1) The entire regular expression is the null string, i.e. L={epsilon}


r = epsilon

2) The entire regular expression is empty, i.e. L=phi r = phi

3) An element of the input alphabet, sigma, is in the regular expression


r = a where a is an element of sigma.

4) Two regular expressions are joined by the union operator, +


r1 + r2

5) Two regular expressions are joined by concatenation (no symbol)


r1 r2

6) A regular expression has the Kleene closure (star) applied to it


r*

The construction proceeds by using 1) or 2) if either apply.

The construction first converts all symbols in the regular expression


Using construction 3).

Then working from inside outward, left to right at the same scope,
apply the one construction that applies from 4) 5) or 6).

Note: add one arrow head to figure 6) going into the top of the second circle.

The result is a NFA with epsilon moves. This NFA can then be converted to a NFA without
epsilon moves. Further conversion can be performed to get a DFA. All these machines have the
same language as the regular expression from which they were constructed.

The construction covers all possible cases that can occur in any regular expression. Because of
the generality there are many more states generated than are necessary. The unnecessary states
are joined by epsilon transitions. Very careful compression may be performed. For example,
the fragment regular expression aba would be

a e b e a
q0 ---> q1 ---> q2 ---> q3 ---> q4 ---> q5

with e used for epsilon, this can be trivially reduced to

a b a
q0 ---> q1 ---> q2 ---> q3

A careful reduction of unnecessary states requires use of the Myhill-Nerode Theorem of section
3.4 in 1st Ed. or section 4.4 in 2nd Ed. This will provide a DFA that has the minimum number of
states.
Within a renaming of the states and reordering of the delta, state transition table, all minimum
machines of a DFA are identical.

Conversion of a NFA to a regular expression was started in this lecture and finished in the next
lecture. The notes are in lecture 7.

Example: r = (0+1)* (00+11) (0+1)*


Solution: find the primary operator(s) that are concatenation or union.
In this case, the two outermost are concatenation, giving, crudely:
//---------------\ /----------------\\ /-----------------\
-->|| <> M((0+1)*) <> |->| <> M((00+11)) <> ||->| <> M((0+1)*) <<>> |
\\---------------/ \----------------// \-----------------/

There is exactly one start "-->" and exactly one final state "<<>>"
The unlabeled arrows should be labeled with epsilon.
Now recursively decompose each internal regular expression.

Convert NFA to regular expression


Conversion algorithm from a NFA to a regular expression.
Start with the transition table for the NFA with the following state naming conventions:
the first state is 1 or q1 or s1 which is the starting state.states are numbered consecutively, 1, 2, 3,
... n
The transition table is a typical NFA where the table entries are sets of states and phi the empty
set is allowed.

The set F of final states must be known.

We call the variable r a regular expression.

We can talk about r being the regular expression with i,j subscripts
ij

Note r is just a (possibly) different regular expression from r


12 53
Because we need multiple columns in a table we are going to build, we
also use a superscript in the naming of regular expression.

1 3 k k-1
r r r r are just names of different regular expressions
12 64 1k ij
2
We are going to build a table with n rows and n+1 columns labeled

| k=0 | k=1 | k=2 | ... | k=n


----+--------+-------+-------+-----+------
| 0 | 1 | 2 | | n
r | r | r | r | ... | r Only build column n
11 | 11 | 11 | 11 | | 11 for 1,final state
----+--------+-------+-------+-----+------
| 0 | 1 | 2 | | n The final regular expression
r | r | r | r | ... | r is then the union, +, of
12 | 12 | 12 | 12 | | 12 the column n
----+--------+-------+-------+-----+------
| 0 | 1 | 2 | | n
r | r | r | r | ... | r
13 | 13 | 13 | 13 | | 13
----+--------+-------+-------+-----+------
| 0 | 1 | 2 | |
r | r | r | r | ... |
21 | 21 | 21 | 21 | |
----+--------+-------+-------+-----+------
| 0 | 1 | 2 | |
r | r | r | r | ... |
22 | 22 | 22 | 22 | |
----+--------+-------+-------+-----+------
| 0 | 1 | 2 | |
r | r | r | r | ... |
23 | 23 | 23 | 23 | |
----+--------+-------+-------+-----+------
| 0 | 1 | 2 | |
r | r | r | r | ... |
31 | 31 | 31 | 31 | |
----+--------+-------+-------+-----+------
| 0 | 1 | 2 | |
r | r | r | r | ... |
32 | 32 | 32 | 32 | |
----+--------+-------+-------+-----+------
| 0 | 1 | 2 | |
r | r | r | r | ... |
33 | 33 | 33 | 33 | |

^ 2
|- Note n rows, all pairs of numbers from 1 to n

Now, build the table entries for the k=0 column:


/
0 / +{ x | delta(q ,x) = q } i /= j
r = / i j
ij \
\ +{ x | delta(q ,x) = q } + epsilon i = j
\ i j

where delta is the transition table function, x is some symbol from sigma the q's
are states
0
r could be phi, epsilon, a,0+1,or a+b+d+epsilon
ij
notice there are no Kleene Star or concatenation in this column

Next, build the k=1 column:


1 0 0 * 0 0
r = r ( r ) r + r note: all items are from the previous column
ij i1 11 1j ij

Next, build the k=2 column:


2 1 1 * 1 1
r = r ( r ) r + r note: all items are from the previous column
ij i2 22 2j ij

Then, build the rest of the k=k columns:


k k-1 k-1 * k-1 k-1
r = r ( r ) r + r note: all items are from previous column
ij ik kk kj ij

Finally, for final states p, q, r the regular expression is


n n n
r + r + r
1p 1q 1r

Note that this is from a constructive proof that every NFA has a language
for which there is a corresponding regular expression.

Some minimization rules for regular expressions These can be applied at every
step.

Note: phi is the empty set


epsilon is the zero length string
0, 1, a, b, c, are symbols in sigma
x is a variable or regular expression
( ... )( ... ) is concatenation
( ... ) + ( ... ) is union
( ... )* is the Kleene Closure = Kleene Star

(phi)(x) = (x)(phi) = phi

(epsilon)(x) = (x)(epsilon) = x

(phi) + (x) = (x) + (phi) = x

x + x = x

(epsilon)* = (epsilon)(epsilon) = epsilon

(x)* + (epsilon) = (x)* = x*

(x + epsilon)* = x*

x* (a+b) + (a+b) = x* (a+b)

x* y + y = x* y

(x + epsilon)x* = x* (x + epsilon) = x*

(x+epsilon)(x+epsilon)* (x+epsilon) = x*

Now for an example:


Given M=(Q, sigma, delta, q0, F) as

delta | a | b | c Q = { q1, q2}


--------+------+------+----- sigma = { a, b, c }
q1 | {q2} | {q2} | {q1} q0 = q1
--------+------+------+----- F = { q2}
q2 | phi | phi | phi
--------+------+------+-----

| k=0 | k=1 (using e for epsilon)


-----+-------------+------------------------------------
r | c + epsilon | (c+e)(c+e)* (c+e) + (c+e) = c*
11 | |
-----+-------------+------------------------------------
r | a + b | (c+e)(c+e)* (a+b) + (a+b) = c* (a+b)
12 | |
-----+-------------+------------------------------------
r | phi | phi (c+e)* (c+e) + phi = phi
21 | |
-----+-------------+------------------------------------
r | epsilon | phi (c+e)* (a+b) + e = e
22 | |
-----+-------------+------------------------------------
| k=0 | k=1 | k=2 (using e for epsilon)
-----+-------------+----------+-------------------------
r | c + epsilon | c* |
11 | | |
-----+-------------+----------+-------------------------
r | a + b | c* (a+b) | c* (a+b)(e)* (e) + c* (a+b) only final
12 | | | state
-----+-------------+----------+-------------------------
r | phi | phi |
21 | | |
-----+-------------+----------+-------------------------
r | epsilon | e |
22 | | |
-----+-------------+----------+-------------------------

the final regular expression minimizes to c* (a+b)

Exercise Questions on Regular Language and Regular Expression

Ex. 1: Find the shortest string that is not in the language represented by the regular
expression a*(ab)*b*.

Solution: It can easily be seen that , a, b, which are strings in the language with length 1 or less.
Of the strings wiht length 2 aa, bb and ab are in the language. However, ba is not in it. Thus the
answer is ba.

Ex. 2: For the two regular expressions given below,


(a) find a string corresponding to r2 but not to r1 and
(b) find a string corresponding to both r1 and r2.

r1 = a* + b* r2 = ab* + ba* + b*a + (a*b)*

Solution: (a) Any string consisting of only a's or only b's and the empty string are in r1. So we
need to find strings of r2 which contain at least one a and at least one b. For example ab and ba are
such strings.
(b) A string corresponding to r1 consists of only a's or only b's or the empty string. The only
strings corresponding to r2 which consist of only a's or b's are a, b and the strings consiting of only
b's (from (a*b)*).

Ex. 3: Let r1 and r2 be arbitrary regular expressions over some alphabet. Find a simple (the
shortest and with the smallest nesting of * and +) regular expression which is equal to each of
the following regular expressions.

(a) (r1 + r2 + r1r2 + r2r1)*


(b) (r1(r1 + r2)*)+
Solution: One general strategy to approach this type of question is to try to see whether or not they
are equal to simple regular expressions that are familiar to us such as a, a*, a+, (a + b)*, (a + b)+
etc.
(a) Since (r1 + r2)* represents all strings consisting of strings of r1 and/or r2 , r1r2 + r2r1 in the
given regular expression is redundant, that is, they do not produce any strings that are not
represented by (r1 + r2)*. Thus (r1 + r2 + r1r2 + r2r1)* is reduced to (r1 + r2)*.

(b) (r1(r1 + r2)*)+ means that all the strings represented by it must consist of one or more strings
of (r1(r1 + r2)*). However, the strings of (r1(r1 + r2)*) start with a string of r1 followed by any
number of strings taken arbitrarily from r1 and/or r2. Thus anything that comes after the first r1 in
(r1(r1 + r2)*)+ is represented by (r1 + r2)*. Hence (r1(r1 + r2)*) also represents the strings of
(r1(r1 + r2)*)+, and conversely (r1(r1 + r2)*)+ represents the strings represented by (r1(r1 + r2)*).
Hence (r1(r1 + r2)*)+ is reduced to (r1(r1 + r2)*).

Ex. 4: Find a regular expression corresponding to the language of all strings over the
alphabet { a, b } that contain exactly two a's.

Solution: A string in this language must have at least two a's. Since any string of b's can be placed
in front of the first a, behind the second a and between the two a's, and since an arbitrasry string of
b's can be represented by the regular expression b*, b*a b*a b* is a regular expression for this
language.

Ex. 5: Find a regular expression corresponding to the language of all strings over the alphabet { a,
b } that do not end with ab.

Solution: Any string in a language over { a , b } must end in a or b. Hence if a string does not end
with ab then it ends with a or if it ends with b the last b must be preceded by a symbol b. Since it
can have any string in front of the last a or bb, ( a + b )*( a + bb ) is a regular expression for the
language.

Kleene's Theorem --- Part 1

Theorem 1 (Part 1 of Kleene's theorem): Any regular language is accepted by a finite


automaton.

Proof: This is going to be proven by (general) induction following the recursive definition of
regular language.

Basis Step: As shown below the languages ,{ } and { a } for any symbol a in are accepted
by an FA.
Example 1: An NFA- that accepts the language represented by the regular expression (aa + b)*
can be constructed as follows using the operations given above.
Kleene's Theorem -- Part 2

The converse of the part 1 of Kleene Theorem also holds true. It states that any language accepted
by a finite automaton is regular.
Before proceeding to a proof outline for the converse, let us study a method to compute the set of
strings accepted by a finite automaton.
Given a finite automaton, first relabel its states with the integers 1 through n, where n is the
number of states of the finite automaton. Next denote by L(p, q, k) the set of strings representing
paths from state p to state q that go through only states numbered no higher than k. Note that paths
may go through arcs and vertices any number of times.
Then the following lemmas hold.

Lemma 1: L(p, q, k+1) = L(p, q, k) L(p, k+1, k)L(k+1, k+1, k)*L(k+1, q, k) .


What this lemma says is that the set of strings representing paths from p to q passing through states
labeled with k+1 or lower numbers consists of the following two sets:

1. L(p, q, k) : The set of strings representing paths from p to q passing through states labeled wiht k
or lower numbers.

2. L(p, k+1, k)L(k+1, k+1, k)*L(k+1, q, k) : The set of strings going first from p to k+1, then from
k+1 to k+1 any number of times, then from k+1 to q, all without passing through states labeled
higher than k.
See the figure below for the illustration.

Lemma 2: L(p, q, 0) is regular.


Proof: L(p, q, 0) is the set of strings representing paths from p to q without passing any states in
between. Hence if p and q are different, then it consists of single symbols representing arcs from p
to q. If p = q, then is in it as well as the strings representing any loops at p (they are all single
symbols). Since the number of symbols is finite and since any finite language is regular, L(p, q, 0)
is regular.

>From Lemmas 1 and 2 by induction the following lemma holds.

Lemma 3: L(p, q, k) is regular for any states p and q and any natural number k.

Since the language accepted by a finite automaton is the union of L(q0, q, n) over all accepting
states q, where n is the number of states of the finite automaton, we have the following converse of
the part 1 of Kleene Theorem.

Translating regular expressions to NFAs


Theorem For each regular expression we can construct ab NFA s.t. ,
i.e. the automaton accepts the language described by the regular expression.

Proof:

We do this again by induction on the syntax of regular expressions:

1. :

which will reject everything (it has got no final states) and hence

2. :
This automaton accepts the empty word but rejects everything else, hence:

3. :

This automaton only accepts the word x, hence:

4. :

We merge the diagrams for and into one:


I.e. given

The disjoint union just signals that we are not going to identify states, even if they
accidently happen to have the same name.

Just thinking of the game with markers you should be able to convince yourself that

Moreover to show that


we are allowed to assume that

that's what is meant by induction over the syntax of regular expressions.

Now putting everything together:

5. :

We want to put the two automata and in series. We do this by connecting the

final states of with the initial states of in a way explained below.

In this diagram I only depicted one initial and one final state of each of the automata
although they may be several of them.

Here is how we construct from and :


o The states of are the disjoint union of the states of and :

o The transition function of contains all the transitions of and

(as for ) and for each pair of a final state of and an initial state

of we add all the arrows coming out of to .

o The initial states of are the initial states of , and the initial states of

if there is an initial state of which is also a final state.

o The final states of are the final states of .

We now set

I hope that you are able to convince yourself that


and hence we can reason

6. :

We construct from by merging initial and final states of in a way


similar to the previous construction and we add a new state which is initial and final.

Given

we construct .

o We add one extra state :

o inherits all transitions form and for each state which has an arrow to the
final state labelled we also add an arrow to all the initial states labelled .
o

o The initial states of are the initial states of and :

o The final states of are the final states of and :

We define

We claim that

since we can run through the automaton an arbitrary number of times. The new state
allows us also to accept the empty sequence. Hence:

7.
I.e. using brackets does not change anything.

As an example we construct . First we construct :

Now we have to apply the -construction and we obtain:

is just the same and we get

and now we have to serialize the two automata and we get:


Now, you may observe that this automaton, though correct, is unnecessary complicated, since we
could have just used

However, we shall not be concerned with minimality at the moment.

Showing that a language is not regular :

Regular languages are languages which can be recognized by a computer with finite (i.e. fixed)
memory. Such a computer corresponds to a DFA. However, there are many languages which
cannot be recognized using only finite memory, a simple example is the language

i.e. the language of workds which start with a number of 0s followed by the same number of s.

Note that this is different to which is the language of words of sequences of 0s


followed by a sequence of s but the umber has not to be identical (and which we know to be
regular because it is given by a regular expression).

Why can not be recognized by a computer with finite memory? Assume we have 32 Megabytes
of memory, that is we have bits. Such a computer corresponds
to an enormous DFA with states (imagine you have to draw the transition diagram).
However, the computer can only count until if we feed it any more 0s in the beginning
it will get confused! Hence, you need a (potentially) infinite amount of memory to recognize .

We shall now show a general theorem called the pumping lemma which allows us to prove that a
certain language is not regular.
QUESTIONS

PART –A

1. Define Abstract model.


2. Give the formal definition of DFA.
3. Define NDFA with example.
4. What are the steps for processing a string by a finite state machine?
5. Define regular Expression.
6. Check (a+b)*(cd) is a regular expression?
7. Write the kleen’s theorem for regular expression.
8. Define pumping lemma.
9. Write the applications of pumping lemma.
10. Prove L={0i1j | i>j} is not regular
11. Define recursive and non recursive production.
12. Define Regular language
13. What are the application of automata theory?
14. Construct FA the set of all string over {0,1} with 3 consecutive 0’s.
15. Define FA with epsilon moves.
16. How to construct Regular expression from FA?
17. Show that L={anbncn|n>=0}is not regular.

PART –B
1. Convert the following NFA to DFA.

2. Explain the Kleen’s theorem in detail.


3.
Construct an NFA- that accepts the language represented by the regular expression (aa +
*
b)

4. Construct an NFA- that accepts the language represented by the regular expression ((a +
b)a*)*

5. Construct a deterministic finite automata equivalent to M=({q0,q1,q2,q3},{a,b}, ,q0,{q3})


Where is
(q0,a)={q0,q1}
(q0,b)={q0}
(q1,a)={q2}
(q1,b)={q1}
(q2,a)={q3}
(q0,b)={q3}
(q3,b)={q2}

S-ar putea să vă placă și