Documente Academic
Documente Profesional
Documente Cultură
UNIT 2
Regular expressions
Given an alphabet a language is a set of words . So far we were able to describe
languages either by using set theory (i.e. enumeration or comprehension) or by an automaton.
In this section we shall introduce regular expressions as an elegant and concise way to describe
languages.
We shall see that the languages definable by regular expressions are precisely the same as
those accepted by deterministic or nondeterministic finite automata. These languages are called
regular languages or (according to the Chomsky hierarchy) Type 3 languages.
language we define
Intuitively, contains all the words which can be formed by concatenating an arbitrary number
of words in . This includes the empty word since the number may be 0.
As an example consider :
You should notice that we use the same symbol as in but there is a subtle difference: is a set
We now define the semantics of regular expressions: To each regular expression over we
1. is a regular expression.
2. is a regular expression.
3. For each , is a regular expression. E.g. in the example all small letters are regular
expression. We use boldface to emphasize the difference between the symbol a and the
regular expression a.
4. If and are regular expressions then is a regular expression.
5. If and are regular expressions then (i.e. just one after the other) is a regular
expression.
6. If is a regular expression then is a regular expression.
Regular Grammar
A grammar is a set of rewrite rules which are used to generate strings by successively
rewriting symbols. For example consider the language represented by a+, which is { a, aa, aaa, . .
.} . One can generate the strings of this language by the following procedure: Let S be a symbol to
start the process with. Rewrite S using one of the following two rules: S -> a , and S -> aS .
Formally a grammar consists of a set of non terminals (or variables) V, a set of terminals
(the alphabet of the language), a start symbol S, which ia a non terminal, and a set of rewrite
rules (productions) P. A production has in general the form -> , where is a string of
terminals and non terminals with at least one non terminal in it and is a string of terminals and
non terminals.
A grammar is regular if and only if is a single non terminal and is a single terminal
or a single terminal followed by a single non terminal, that is a production is of the form X -> a or
X -> aY, where X and Y are non terminals and a is a terminal.
For example, = {a, b}, V = { S } and P = { S -> aS, S -> bS, S -> } is a regular grammar and
it generates all the strings consisting of a's and b's including the empty string.
A grammar is a context-free grammar if and only if its production is of the form X -> , where
is a string of terminals and non terminals, possibly the empty string.
For example P = { S -> aSb, S -> ab } with = { a, b } and V = { S } is a context-free grammar
and it generates the language { anbn | n is a positive integer } . As we shall see later this is an
example of context-free language which is not regular.
A grammar is a context-sensitive grammar if and only if its production is of the form 1X 2 ->
1 2, where X is a non terminal and 1 , 2 and are strings of terminals and non
The constructive proof provides an algorithm for constructing a machine, M, from a regular
expression r. The six constructions below correspond to the cases:
Then working from inside outward, left to right at the same scope,
apply the one construction that applies from 4) 5) or 6).
Note: add one arrow head to figure 6) going into the top of the second circle.
The result is a NFA with epsilon moves. This NFA can then be converted to a NFA without
epsilon moves. Further conversion can be performed to get a DFA. All these machines have the
same language as the regular expression from which they were constructed.
The construction covers all possible cases that can occur in any regular expression. Because of
the generality there are many more states generated than are necessary. The unnecessary states
are joined by epsilon transitions. Very careful compression may be performed. For example,
the fragment regular expression aba would be
a e b e a
q0 ---> q1 ---> q2 ---> q3 ---> q4 ---> q5
a b a
q0 ---> q1 ---> q2 ---> q3
A careful reduction of unnecessary states requires use of the Myhill-Nerode Theorem of section
3.4 in 1st Ed. or section 4.4 in 2nd Ed. This will provide a DFA that has the minimum number of
states.
Within a renaming of the states and reordering of the delta, state transition table, all minimum
machines of a DFA are identical.
Conversion of a NFA to a regular expression was started in this lecture and finished in the next
lecture. The notes are in lecture 7.
There is exactly one start "-->" and exactly one final state "<<>>"
The unlabeled arrows should be labeled with epsilon.
Now recursively decompose each internal regular expression.
We can talk about r being the regular expression with i,j subscripts
ij
1 3 k k-1
r r r r are just names of different regular expressions
12 64 1k ij
2
We are going to build a table with n rows and n+1 columns labeled
^ 2
|- Note n rows, all pairs of numbers from 1 to n
where delta is the transition table function, x is some symbol from sigma the q's
are states
0
r could be phi, epsilon, a,0+1,or a+b+d+epsilon
ij
notice there are no Kleene Star or concatenation in this column
Note that this is from a constructive proof that every NFA has a language
for which there is a corresponding regular expression.
Some minimization rules for regular expressions These can be applied at every
step.
(epsilon)(x) = (x)(epsilon) = x
x + x = x
(x + epsilon)* = x*
x* y + y = x* y
(x + epsilon)x* = x* (x + epsilon) = x*
(x+epsilon)(x+epsilon)* (x+epsilon) = x*
Ex. 1: Find the shortest string that is not in the language represented by the regular
expression a*(ab)*b*.
Solution: It can easily be seen that , a, b, which are strings in the language with length 1 or less.
Of the strings wiht length 2 aa, bb and ab are in the language. However, ba is not in it. Thus the
answer is ba.
Solution: (a) Any string consisting of only a's or only b's and the empty string are in r1. So we
need to find strings of r2 which contain at least one a and at least one b. For example ab and ba are
such strings.
(b) A string corresponding to r1 consists of only a's or only b's or the empty string. The only
strings corresponding to r2 which consist of only a's or b's are a, b and the strings consiting of only
b's (from (a*b)*).
Ex. 3: Let r1 and r2 be arbitrary regular expressions over some alphabet. Find a simple (the
shortest and with the smallest nesting of * and +) regular expression which is equal to each of
the following regular expressions.
(b) (r1(r1 + r2)*)+ means that all the strings represented by it must consist of one or more strings
of (r1(r1 + r2)*). However, the strings of (r1(r1 + r2)*) start with a string of r1 followed by any
number of strings taken arbitrarily from r1 and/or r2. Thus anything that comes after the first r1 in
(r1(r1 + r2)*)+ is represented by (r1 + r2)*. Hence (r1(r1 + r2)*) also represents the strings of
(r1(r1 + r2)*)+, and conversely (r1(r1 + r2)*)+ represents the strings represented by (r1(r1 + r2)*).
Hence (r1(r1 + r2)*)+ is reduced to (r1(r1 + r2)*).
Ex. 4: Find a regular expression corresponding to the language of all strings over the
alphabet { a, b } that contain exactly two a's.
Solution: A string in this language must have at least two a's. Since any string of b's can be placed
in front of the first a, behind the second a and between the two a's, and since an arbitrasry string of
b's can be represented by the regular expression b*, b*a b*a b* is a regular expression for this
language.
Ex. 5: Find a regular expression corresponding to the language of all strings over the alphabet { a,
b } that do not end with ab.
Solution: Any string in a language over { a , b } must end in a or b. Hence if a string does not end
with ab then it ends with a or if it ends with b the last b must be preceded by a symbol b. Since it
can have any string in front of the last a or bb, ( a + b )*( a + bb ) is a regular expression for the
language.
Proof: This is going to be proven by (general) induction following the recursive definition of
regular language.
Basis Step: As shown below the languages ,{ } and { a } for any symbol a in are accepted
by an FA.
Example 1: An NFA- that accepts the language represented by the regular expression (aa + b)*
can be constructed as follows using the operations given above.
Kleene's Theorem -- Part 2
The converse of the part 1 of Kleene Theorem also holds true. It states that any language accepted
by a finite automaton is regular.
Before proceeding to a proof outline for the converse, let us study a method to compute the set of
strings accepted by a finite automaton.
Given a finite automaton, first relabel its states with the integers 1 through n, where n is the
number of states of the finite automaton. Next denote by L(p, q, k) the set of strings representing
paths from state p to state q that go through only states numbered no higher than k. Note that paths
may go through arcs and vertices any number of times.
Then the following lemmas hold.
1. L(p, q, k) : The set of strings representing paths from p to q passing through states labeled wiht k
or lower numbers.
2. L(p, k+1, k)L(k+1, k+1, k)*L(k+1, q, k) : The set of strings going first from p to k+1, then from
k+1 to k+1 any number of times, then from k+1 to q, all without passing through states labeled
higher than k.
See the figure below for the illustration.
Lemma 3: L(p, q, k) is regular for any states p and q and any natural number k.
Since the language accepted by a finite automaton is the union of L(q0, q, n) over all accepting
states q, where n is the number of states of the finite automaton, we have the following converse of
the part 1 of Kleene Theorem.
Proof:
1. :
which will reject everything (it has got no final states) and hence
2. :
This automaton accepts the empty word but rejects everything else, hence:
3. :
4. :
The disjoint union just signals that we are not going to identify states, even if they
accidently happen to have the same name.
Just thinking of the game with markers you should be able to convince yourself that
5. :
We want to put the two automata and in series. We do this by connecting the
In this diagram I only depicted one initial and one final state of each of the automata
although they may be several of them.
(as for ) and for each pair of a final state of and an initial state
o The initial states of are the initial states of , and the initial states of
We now set
6. :
Given
we construct .
o inherits all transitions form and for each state which has an arrow to the
final state labelled we also add an arrow to all the initial states labelled .
o
We define
We claim that
since we can run through the automaton an arbitrary number of times. The new state
allows us also to accept the empty sequence. Hence:
7.
I.e. using brackets does not change anything.
Regular languages are languages which can be recognized by a computer with finite (i.e. fixed)
memory. Such a computer corresponds to a DFA. However, there are many languages which
cannot be recognized using only finite memory, a simple example is the language
i.e. the language of workds which start with a number of 0s followed by the same number of s.
Why can not be recognized by a computer with finite memory? Assume we have 32 Megabytes
of memory, that is we have bits. Such a computer corresponds
to an enormous DFA with states (imagine you have to draw the transition diagram).
However, the computer can only count until if we feed it any more 0s in the beginning
it will get confused! Hence, you need a (potentially) infinite amount of memory to recognize .
We shall now show a general theorem called the pumping lemma which allows us to prove that a
certain language is not regular.
QUESTIONS
PART –A
PART –B
1. Convert the following NFA to DFA.
4. Construct an NFA- that accepts the language represented by the regular expression ((a +
b)a*)*