Documente Academic
Documente Profesional
Documente Cultură
Grammars
287
11. Ambiguity of Context-free Grammars
Ambiguity in a language occurs either when a symbol or an expression has more than one meaning (e.g., story), or
when an expression can be (grammatically) parsed in two different ways. The former is called lexical (or semantic)
ambiguity, and the later syntactic (or structural) ambiguity. For example, in natural language, the sentence “A man
entered the room with a picture” can be interpreted (i.e., parsed) into two different grammatical structures as follows.
288
Ambiguity
The Reply:
Dear Son,
I kNOw that astroNOmy, ecoNOmics, and oceaNOgraphy are eNOugh to keep even an hoNOr student busy.
Do NOt forget that the pursuit of kNOwledge is a NOble task, and you can never study eNOugh.
Love,
Dad
- Adrea -
289
Ambiguity
290
Parse Tree Ambiguity
Given a derivation (i.e., sequence of rules applied) for a string w, the parse tree for
w with respect to the derivation is constructed as follows. First put the root node with
the start symbol S. Then, for each leaf node on the current tree with a nonterminal
label, say A, recursively expand the tree as follows:
Suppose that A → is a rule applied next to derive w. For each symbol X appearing
in , in the order from left to right, a child node with label X is add to A. This
procedure repeats until the tree has no leaf nodes (labeled with a nonterminal symbol)
to expand. Reading all the leaf nodes left to right on the final tree should give the
string w. This string of terminal symbols is called the yield of the parse tree.
S
S S
A S S
G: S SS | SS | S | A p
A A
Ap|q|r
q r
(a) (b)
291
parse tree Ambiguity
S S
S S S S
A
S S S S A
G: S SS | SS | S | A p
A p| q| r A A r
A A
q r p q
(a) (b) p(qr) (c) (pq)r
293
Parse Tree and Ambiguity Ambiguity
As we saw in the previous example, the existence of two parse trees yielding the
same string is a problematic property, called ambiguity, of a CFG that should be
eliminated. In real application, we cannot expect the correct result from a program
written in the language of an ambiguous grammar. Before we discuss how to
eliminate ambiguity from a CFG, we need a formal definition of it.
Definition (Ambiguity): A CFG G is ambiguous if there is a string x L(G) for
which there are two parse trees yielding x.
Unfortunately, it is an unsolvable problem to decide whether an arbitrary CFG is
ambiguous or not. Also, there is no algorithm available that given an ambiguous
CFG, converts it to an unambiguous grammar. However, for a certain restricted
construct, it is possible to solve the problems. In this section we will present several
techniques with some examples.
294
Ambiguity
S S
S S S
S
A
G1: S SS | SS | S | A S S S S A
Ap|q|r p
A A A A r
q r p q
(a) (b)
295
Eliminating Ambiguity Ambiguity
This ambiguity can be eliminated by parenthesizing the right side of those two
rules as shown in G2 below. The parentheses make the yields of the two parse trees
different as shown in figures (a) and (b).
S S
( ) )
S S (
S S
A A
( S S ) ( S S )
p
r
A A A A
q r p q
(a): (p (q r)) (b): ((p q ) r)
296
Ambiguity
Eliminating Ambiguity
S S
b b
G3 : S bS | Sb | c S S
S b b
S
c c
297
Ambiguity
Eliminating Ambiguity
We can simply modify the grammar G3 to G4 as shown below such that left side
b‟s, if any, are always generated first. Figure (b) is the only parse tree for string bcb.
Grammar G4 is unambiguous.
Ambiguous G3 : S bS | Sb | c
Unambiguous G4 : S bS | A A Ab | c
S
b
S S
S
b b
S S A
S b b b
S A
c c c
(a) (b)
298
Ambiguity
Eliminating Ambiguity
Example 2. Using the technique of fixing the order of derivation, the ambiguous
CFG G1 that we have examined can be converted to an unambiguous grammar G5
shown in figure (a). Notice that this grammar generates the operators left to right in
the order they appear in the string.
S
Ambiguous:
S
G1: S SS | SS | S | A
Ap|q|r A S
A S
Unambiguous: p
q A S
G5 : S AS | AS | S | A
Ap|q|r q A
p
(a) (b): pqqp
299
Eliminating Ambiguity Ambiguity
The CFG G8 below is ambiguous because it can generate in two ways. Applying
the technique for minimizing the number of -production rules, we can convert it to
an unambiguous grammar G9.
300
Ambiguity
Eliminating Ambiguity
(a) (b)
301
Ambiguity
Eliminating Ambiguity
Example. Assume that the logical operators , , and are right associative with
precedence given in that order (i.e., is at the top followed by , and at the bottom).
The ambiguous CFG G1 (repeated below) can be modified into an unambiguous G10
by implementing the precedence and associativity into the production rules.
Notice that every OR () operator in the string should be generated by S DS
before generating others. Then AND () operators, if any, are generated by D CD,
and finally NOT () operators. Also notice that with D, there is no way to derive ,
and with C neither nor can be derived.
S
Ambiguous D S
G1: S SSSSSA S
A pqr C D D
Unambiguous ~ C C C D
G10 : S DSD D CDC A A A C
C CA A pqr
p q r A
(a) (b) p
302
Eliminating Ambiguity Ambiguity
This fixed order of rule applications facilitates the compiler generating an object
code to evaluate the string (i.e., a logical expression) according to the operator
precedence and associativity.
For example, because NOT () operators are generated last, they appear on a
subtree of the parent node of a leaf node labeled with or . Hence, the compiler,
traversing the tree bottom up, left to right, generates the instructions which will
execute the NOT () operators before the instructions executing other operators.
Similarly, we can see how the compiler generates the instructions to execute AND
() operators before OR () operators. S
D S
S
G10 : S DSD D CDC C D D
C CA A pqr
~ C C C D
A A C
p q r p ((p) (q (r p))) A
(1) (4) (3) (2) r A
p q
(a) (b) p
303
Eliminating Ambiguity Ambiguity
Now, for the associativity, notice that identical operators are generated left to
right in the order they appear in the string. For example, all OR () operators are
derived by applying S DS recursively. Thus, the compiler, traversing the parse
tree bottom up, left to right, will generate an object code to evaluate the logic
expression following the order of right associativity.
If we want to implement left associativity for an operator used in the language,
we can simply reverse the right side of the rule which derives it. For example, we
use S SD instead to make operator left associative.
S
D S
S
G10 : S DSD D CDC C D D
C CA A pqr ~ C C C D
A A C
p q r p ((p) (q (r p))) A
(1) (4) (3) (2) p q r A
(a) (b) p
304
Ambiguity
• There are some languages that can only be generated by ambiguous context-free grammars. Such languages are call
inherently ambiguous. The following language L is an example.
L = {anbncm | m, n 1 } {anbmcm | m, n 1}
• It is an unsolvable problem to tell whether an arbitrary CFG is ambiguous or not.
• It is an unsolvable problem to convert an arbitrary ambiguous CFG, whose language is not inherently ambiguous, to an
unambiguous CFG.
305
Ambiguity
Exercises
11.1 Show that each of the following CFG‟s is ambiguous, and convert it to an unambiguous CFG.
(a) S Sa | A | a A a | bA
(b) S AB | A a| B b|
G: S S + S | S * S | T T a|b
(b) In the CFG G above, let + and * be the addition and multiplication operators, respectively, and a and b be integer
variables. Convert G to a CFG G‟ that satisfies the following conditions.
(i) L(G) = L(G‟)
(ii) Operator * has higher precedence than operator +.
(iii) Both operators are left associative.
You should also present a parse tree showing that your grammar meets the required order of operator precedence and
associativity.
306
Exercises Ambiguity
The following problem is concerned with the ambiguity of the if-statement, which appears often in the text. Question (a)
is easy, but question (b) is challenging.
11.3 The syntax flow graph below defines a simplified <If-statement> of the Pascal programming language. Following
the convention, symbols in a circle or oval are terminals, and the words in rectangles correspond to nonterminal symbols.
(a) Transform the flow graph to a CFG G and show that G is ambiguous.
(b) Convert grammar G to an unambiguous CFG and explain in detail how you got your answer.
<IF-statement>
a-b
<statement> <bool>
c
<IF-statement>
d
a+b
307
Hierarchy of the Models
308
12. Hierarchy of the Models: Proper Containment
In Chapter 7 we saw the Chomsky hierarchy among languages, automata, and other models and proved the horizontal
relations (i.e., the characterization) at the lowest level of the hierarchy. Figures (a) and (b), respectively, are the copies of the
relations among the classes of languages and automata presented in Chapter 7 (see next slide). Recall that if a language is a
member of a class of languages at a lower level, then it is also a member of an upper class. If a language is recognizable by an
automaton at a lower level, then it is also recognizable by an automaton at an upper level. But the reverse of these relations
does not hold. An upper lever language class has a member that does not belong to a lower level class, and there is an
automaton belonging to an upper level whose language cannot be recognized by any automaton belonging to a lower level of
the hierarchy.
TM LBA
Type-0L Type-3L
PDA
Type-1L Type-2L
FA
309
Hierarchy: Proper Containment
Review
The Chomsky Hierarchy
Languages (grammars) Machines Other Models
Linear-bounded
Context-sensitive Automata .
Languages(type 1) (LBA) .
.
.
Context-free Pushdown : Containment
.
Automata
Languages(type 2)
(PDA) : Characterization
To prove the proper containment relations of the hierarchy it requires in-depth logical arguments that we have developed in
Chapter 1. In this chapter we will prove the containment relations up to the class of context-sensitive languages. In Chapter 15
we will complete the proof of the hierarchy by proving the characterizations of context-free, context-sensitive and phrase
structured languages, together with the proof of the last proper containment relation between type 0 and type 1. The logics
involved in these proofs are so elegant that it worth a challenge.
311
Hierarchy: Proper Containment
12.1 The containment relations
Before the proof of the proper containment relations () between two levels of the
hierarchy, we will prove the containment relations () by the following theorem.
Theorem 12.1 Let Type-iL denote type i (i = 0, 1, 2, 3) language class. Then for all i
< 3, the following relation holds.
Type-iL Type-(i+1)L
Proof. Let be a rule of a grammar G = (VT, VN, P, S). We know that by
definition, type 3 (regular) grammar rules should have the form (i.e., || = 1) of type 2
(context-free) grammar rules with additional restriction that either = xB or = x ,
where x (VT)* and B VN. It follows that every type 3 grammar is a type 2
grammar, and hence, all type 3 languages belong to the class of type 2 languages, i.e.,
Type-2L Type-3L.
312
Containment Relations Hierarchy: Proper Containment
Proof (continued). By definition, type 1 grammar (CSG) rules are type 0 grammar
rules with the additional restriction that the left side of each rule cannot be longer
than the right side. It follows that every type 1 (context-sensitive) grammar is also a
type 0 (phrase structured) grammar. That is,
Type-0L Type-1L
Now, we prove that type 1 language class contains type 2 (context-free)
language class. Recall that type 1 grammar rules are non-contracting. In other
words, type 1 grammar rules are type 0 grammar rules with the restriction that for
any rule , the left side cannot be longer than the right side, i.e., || ||. Type
2 grammar rules are type 0 grammar rules having one nonterminal symbol on their
left side, i.e., || = 1.
Both type 1 and type 2 grammars are defined by adding a restriction on type 0
grammar rules. So for the proof of the containment relation between type 1 and
type 2 language classes, we cannot apply the simple logic that we used for the
other relations between other language classes.
313
Containment Relations Hierarchy: Proper Containment
Clearly, since every CFG rule has exactly one nonterminal symbol on its left
side, as far as the left side of grammar rules are concerned, CFG rules belong
to CSG rules. The only problem is the -production rules allowed in CFG‟s, which
violate the requirement that the right side of CSG rules cannot be shorter than the left
side, except for S . All the other CFG rules, which do not produce , are CSG
rules. Fortunately, we have a way to overcome this problem. Recall Theorem 10.1
repeated below.
Theorem 10.1 Given a CFG G, we can construct a CFG G' that satisfies the following
conditions.
• L(G) = L(G').
• If L(G), G' has no -production rule. Otherwise, S is the only -production
rule in G'.
Grammar G' is a CSG because every rule in G' is non-contracting except for the
rule S , if any. This implies that every CFL can be generated by a CSG. It follows
that Type-1L Type-2L.
314
Containment Relations Hierarchy: Proper Containment
315
Hierarchy: Proper Containment
12.2 The Pumping Lemma
To prove the relation Type-2L Type-3L, it is enough to find a context-free
language that is not regular. Our approach is to first find a common property of all
regular languages, and then show a context-free language that does not satisfy this
property. Since every finite language is regular (we leave the proof of this claim for
the reader), the property we are looking for must be found among infinite regular
languages. The following lemma presents such property.
Pumping Lemma: For every infinite regular language L, there is a constant integer
n that satisfies the following conditions:
Let be the alphabet of the language. Then for every string z L whose length is
greater than or equal to n, there are u, v, w * such that
• z = uvw, |uv| n, |v| 1, and
• for all i 0, uviw L
316
Pumping Lemma Hierarchy: Proper Containment
Proof: For the proof we need the following two terminologies; a transition path is
a directed path on a state transition graph and a path label is the string constructed
by picking the input symbols along a transition path. For example, in the state
transition graph shown below, aabba is the path label on the state transition path 1
2 4 3 2 4. Given the string aabba as an input, the DFA will take a
sequence of transitions along the transition path 1 2 4 3 2 4.
To help the reader understand the argument, we will use the following state
transition graph throughout the proof, together with other figures when needed.
a 2 a
a b 4
start 11
33
b
b
317
Pumping Lemma Hierarchy: Proper Containment
Given an infinite regular language L, let M be a DFA that recognizes L and let n
be the number of states of this automaton. For an input string z L, whose length is
greater than or equal to n (i.e., |z| n), examine the state transition path with path
label z.
Notice that because M is deterministic, there is only one transition path with path
label z. Since L is infinite and the number of states n is finite, there must be a string
z in L such that |z| n. Otherwise, L cannot be infinite.
Let m be the length of z (i.e., |z| = m n ) and let z = a1a2 …am.
a 2 a
a b 4
start 11
33
b
b
318
Pumping Lemma Hierarchy: Proper Containment
Since z L, the transition path with label z must end in an accepting state.
Because |z| ≥ n, the path involves a cycle and hence, there should be a state,
say q, visited a second time by reading some j-th symbol aj (j n) in z as
illustrated in figure (b) below. (This can be proved by the pigeonhole principle.
Refer the application example of the pigeonhole principle in Chapter 1.)
|z| = m n
a 2 a
a b aj+1 .
4 ai .
start 11 . q . a
33
a1 .. m
b a2 ai+1 aj
start
b
M
(a) Transition path with path (b) Transition path with path label
label aabba z = a1a2 …am, m n
319
Pumping Lemma Hierarchy: Proper Containment
As shown in figure (b), let v be the path label of the cyclic transition path, and let
u and w be, respectively, the prefix and the suffix of z partitioned by v. (If q is the
start state, then u = , and if q is the last state in the path, w = .) Since the cycle
involves at least one edge, |v| 1. Since j n, we have |uv| n.
a 2 a
aj+1 .
a b 4 ai .
11 . q . a
start a1 .. aj m
33 a2 ai+1 w
b start
u
b v
M
(a) Transition path with path
label aabba (b) Transition path with path label
z = a1a2 …am, m n
320
Pumping Lemma Hierarchy: Proper Containment
Sine uvw is in L, M accepts uvw. It follows that all the following strings
corresponding to the path labels of the transition paths with the cycle repeated
zero or more times are also accepted by M.
uw = uv0w, uvvw = uv2w, uvvvw = uv3w, . . . .
That is, for all i 0, uviw L. (Refer to figure (a) for an example.)
z = a1a2 …ai ai+1 . . . . aj aj+1 . . . .am
u v w
u = a, v = abb, w = a
|z| = m n |uv| n |v| 1
For all i 0, a(abb)ia L
For all i 0, uviw L
a 2 a aj+1 .
ai .
a b . q . a
4 a1 .. aj m
start 11 a2 ai+1 w
start
33 u
b v
M
b
(a) Transition path with path (b) Transition path with path label
label aabba z = a1a2 …am, m n
321
Pumping Lemma Hierarchy: Proper Containment
Pumping Lemma: For every infinite regular language L, there is a constant integer
n which satisfies the following conditions:
Let be the alphabet of the language. Then for every string z L whose length is
greater than or equal to n, there are u, v, w * such that
• z = uvw, |uv| n, |v| 1, and
• for all i 0, uviw L
322
Hierarchy: Proper Containment
12.3 Application of the Pumping Lemma
Now, as an application of the pumping lemma we will show that there is a CFL
which is not regular. Specifically, using the lemma, we will prove that the language
{aibi | i 1} does not satisfy the pumping lemma to show that Type-2L Type-3L,
the proper containment relation between the classes of type 2 and type 3 in the
Chomsky hierarchy.
For the convenience of the application, the lemma is rewritten below to expose its
logical structure. (Notice the quantified parts highlighted.)
323
Application of the Pumping Lemma Hierarchy: Proper Containment
Before using the pumping lemma for the proof, recall the techniques (in Section 1.3)
for dealing with existential quantification () and universal quantification () in a
statement to prove. In this example, we are going to prove that for the given context-
free language, the pumping lemma is not true. Hence, for the existentially quantified
parts of the lemma, i.e., the constant n and strings u, v, and w, we should consider all
possible cases. For the universally quantified parts, i.e., regular language L, string z
L, and integer i 0, it is enough to consider a single case, which can be chosen for our
convenience of the proof.
The argument of the proof will run as a game between the proponent of the lemma
and us who are going to show that the lemma does not hold for the given language.
324
Application of the Pumping Lemma Hierarchy: Proper Containment
Now, we show that our familiar CFL L = {aibi | i 1} does not satisfy the pumping
lemma, and hence, it is not regular. Suppose that L is a regular language. (By this
sentence the reader will promptly recognize that we are going to use the proof-by-
contradiction technique.) We will carefully analyze the pumping lemma to find a
contradicting part against our supposition. (The numbers in the following arguments
match the ones that we put in the lemma.)
325
Application of the Pumping Lemma Hierarchy: Proper Containment
326
Application of the Pumping Lemma Hierarchy: Proper Containment
The other day I had the opportunity to drop by my department head's office. He's a friendly guy and on the rare
opportunities that I have to pay him a visit, we have had enjoyable conversations. While I was in his office
yesterday I asked him, "Sir, What is the secret of your success?"
He said, "Two words."
"And, Sir, what are they?"
"Right decisions."
"But how do you make right decisions?"
"One word." He responded.
"And, sir, What is that?"
"Experience."
"And how do you get Experience?"
"Two words."
"And, Sir, what are they?"
"Wrong decisions."
- Rubin -
327
Hierarchy: Proper Containment
12.4 The pumping lemma for context-free languages
Now, we prove the proper containment relation between the next upper two levels
of the language classes, i.e., Type-1L Type-2L. We use the same approach that we
took for proving the containment relation Type-2L Type-3L. With no proof we
first present a common property of infinite context-free languages, called the
pumping lemma for context-free languages (see below). Then, as an application of
the lemma, we will show that the context-sensitive language {aibici | i ≥ 1 }, which
we introduced in Section 2.3, does not satisfy the lemma. We shall present the proof
of the lemma after showing the application.
The pumping lemma for context-free languages. For every infinite CFL L, there is
a constant integer p that satisfies the following conditions:
Let be the alphabet of the language. Then for every string z L whose length is
greater than or equal to p, there are u, v, w * such that
• z = uvwxy, |vwx| p, |vx| 1
• for all i 0, uviwxiy L
328
Pumping Lemma for CFL’s Hierarchy: Proper Containment
The pumping lemma for CFL‟s looks very similar to the one for regular languages.
The figure below shows the difference. Both lemmas claim the existence of a constant
(i.e, n for regular languages and p for CFL‟s). The lemma for regular languages divides
string z into three segments (i.e., z = uvw), while the lemma for CFL‟s divides it into
five segments (i.e., z = uvwxy). The lemma for regular languages provides one site,
named v, to pump (see the up arrow), while the lemma for CFL‟s provides two sites v
and x to pump simultaneously.
u v w x y u v w
z
vwx| p |uv| n
329
Pumping Lemma for CFL’s Hierarchy: Proper Containment
Recall that when we applied the pumping lemma for regular languages, we took
advantage of the range of the pump site v that is restricted within the prefix of length n
of string z, given by the condition |uv| n.
Unfortunately, for the pumping lemma for CFL‟s the segment vwx, which contains
the two pump sites, can be flanked by u to the left and by y to the right. So the pump
sites can be located anywhere in z in the window of width |vwx| p. When we apply
the lemma to prove a language is not context-free, we must consider all possible
locations of the window vwx, from the left end of z up to the right end.
The lemma for CFL‟s The lemma for regular language
u v w x y u v w
z
vwx| p |uv| n
330
Hierarchy: Proper Containment
331
Application of CFL Pumping Lemma Hierarchy: Proper Containment
Now, we show that the CSL L = {aibici | i 1}, that we have introduced in Section
2.3, does not satisfy the pumping lemma for CFL‟s. Our argument is similar to the
one for proving that language {aibi | i 1} is not regular.
Suppose L is a CFL. It should satisfy the lemma. With L we walk along the lemma
and come out with a contradiction.
(1) For every infinite CFL L, L is a infinite CFL. Hence, there should
(2) there exists a constant p s.t. be a constant p such that conditions (3)
(3) . . . . . and (4) are satisfied.
(4) . . . . Let p be the constant. Using p as a
variable, we are going to consider all
possible constant values of p in our
argument.
332
Application of CFL Pumping Lemma Hierarchy: Proper Containment
(3) For all strings z L such that We choose string z = apbpcp in L and
|z|p, examine whether, for all strings u, v, w
(4) there exist strings u, v, w, x, y satisfying conditions (i) – (iii), it satisfies
that satisfy the following. condition (iv).
(i) z = uvw, By (i) let Let z = apbpbp = uvwxy, and
examine what kinds of symbols can be in
(ii) |uv| n,
the two pump sites v and x.
(iii) |v| 1,
Recall that vwx, whose length p, can
(iv) for all i 0, uviw L. be anywhere in z.
p p p
aa . . . . . . . . .abb . . . . . . . .bcc. . . . . . . . . c
u vwx y
333
Application of CFL Pumping Lemma Hierarchy: Proper Containment
As we can see in the figure below, by condition (ii) |vwx| p, the pump sites v and x
together can contain no more than two different symbols (a‟s and b‟s or b‟s and c‟s).
By condition (iii) |vx| 1, strings v and x together should contain at least one symbol.
It follows that in uv2wx2y the numbers of a‟s, b‟s, and c‟s cannot be the same, which
implies that uv2wx2y L. The language does not satisfy condition (iv) of the lemma
when i = 2. Hence, L is not context-free.
(i) z = uvwxy, p p p
(ii) |vwx| p,
(iii) |vx| 1, aa . . . . . . . . .abb . . . . . . . .bcc. . . . . . . . . c
(iv) for all i 0, u vwx y
uviwxiy L. |vwx| p
334
Application of CFL Pumping Lemma Hierarchy: Proper Containment
For the language {aibici | i 1 }, it was fairly straightforward to apply the lemma and
show that the language is not context-free. However, this lemma is so weak that there
are many non-context-free languages for which it is either hard or impossible to apply it.
For example, consider the following language L, which is not context-free. Whichever
string z (of length greater than or equal to p) we pick from this language, it is
impossible to pump and come to a contradiction.
335
Hierarchy: Proper Containment
12.6 Ogden’s Lemma
Ogden‟s lemma, shown below, is very similar to the original lemma except
for the additional (highlighted) parts restricting the locations of the pump sites.
We omit the proof of this lemma. Interested readers are referred to the book
“Introduction to Formal Languages, Automata Theory, Languages and Computation”
by Hopcroft, Motwani, and Ullman (2001. Addison Wedley).
336
Hierarchy: Proper Containment
Ogden’s Lemma
Now, we can apply Ogden‟s lemma and show that the following language is not
CFL, which we failed to prove with the original lemma.
Suppose L is a CFL. Then it should satisfy Ogden‟s lemma. Let p be the constant.
p p p p p
Pick string z = 1 0 a1 0 , let z = uvwxy, and mark the prefix 1 . According to the
lemma, |vwx| p, and string vx should contain at least one marked symbol. It follows
that string vx can never include the symbol a in the middle. We see that string z' =
uv2wx2y L, because in z' the binary string on the left side of a must be different
from the string on the right side of a. We have a contradiction. It follows that L is not
context-free.
By the two application examples of the pumping lemmas for regular languages and
context-free languages, we have proven the relations of proper containment between
the lower two levels of Chomsky hierarchy (repeated below). We will complete the
remaining proofs of other relations in Chapter 15.
337
Review Chomsky Hierarchy
Languages (grammars) Machines Other Models
Linear-bounded
Context-sensitive Automata .
Languages(type 1) (LBA) .
: Containment
.
.
: Characterization
.
Context-free Pushdown
Languages(type 2) Automata * Colored arrows indicate
(PDA)
proven relations.
S AB A DE B FD E FC
C DB | c F HC Dd Hh
Let‟s examine the parse tree of the grammar yielding the following string.
d(hd)3hc(d)3dchdchd
339
Proof of the CFL Pumping Lemma Hierarchy: Proper Containment
S
Notice that on the longest trunk, A
there appears a repeating D E
F B
subsequence of nonteminals, d C F
H C D
S-A-E-F-C-B-F-C-B-F-C-B-F-C h D B
D H C
(see the highlighted part). Also B F d
d D
notice that in the yield of this tree F D d H h c
3 3
d(hd) hc(d) dchdchd, C
H C d d
the highlighted substrings that are h D B h c
produced by the repeating d
F D
nonterminals also repeat. H C d G:
Since every nonterminal has h
D B S AB A DE
two children, it can produce
C DB | c B FD
“fruits” (i.e., a terminal string) d
F
D
F HC E FC
either on its left side branches or H C
d
on its right side branches. In the D d H h
h c
example, F, C and B, respectively,
produce h, d and d. yield: d(hd)3hc(d)3dchdchd
340
Proof of the CFL Pumping Lemma Hierarchy: Proper Containment
S
On the longest trunk, F-C-B A
repeats. This repeating segment D E
F B
can be inserted (or deleted), d C F
H C D
proportionally increasing (or h D B
D H C
decreasing) the “fruits” on both B F d
d d D
sides of the trunk. F D H h c
C
H C d d
Notice that each repeating
h D B h c
segment F-C-B bears hb to the left
d
and d to the right. F D
H C d G:
For CFG G, it satisfies the h S AB A DE
D B
condition that for all i ≥ 0, C DB | c B FD
d D
z = d(hd)ihc(d)idhcdhcd is in F F HC E FC
d
the language L(G). H C D d H h
h c
z = d ( h d )3 h c (d)3 d h c d h c d
u v w x y
341
Proof of the CFL Pumping Lemma Hierarchy: Proper Containment
We can make the trunk of the suffix tree grow by inserting more repeating
subsequences of F-C-B, or cut it short by deleting them. Since the number of repeats
can be arbitrary, there can be infinite number of parse trees. Every repeating trunk has
a repeating pattern of branches to its left or right and every branch should bear “fruit”
(i.e., a terminal symbol). (Recall that there is no -production rule except S → .)
Hence, the parse tree can yield a repeating substring either to the left (bh in the
example) or to the right (b in the example) of the trunk. For CFG G , we can claim
that for all i 0, string d(hd)ihc(d)idchdchd is a yield of a parse tree of G, i.e.,
d(hd)ihc(d)idchdchd L(G).
342
Proof of the CFL Pumping Lemma Hierarchy: Proper Containment
A
y
u α
v A x
w
343
Proof of the CFL Pumping Lemma Hierarchy: Proper Containment
We see that string uvwxy L(G), and for all i 0, uviwxiy L(G), because if
AA is a legal pattern on a trunk, then for all i 0, (A)iA is also a legal pattern.
Let VN be the nonterminal alphabet of the grammar. The length of string z for
which every parse tree has a trunk with a repeating pattern is proportional to | VN|,
i.e., for some constant p, |z| p = c|VN|. It follows that |vwx| p and |vx| 1.
S
S
A
A y
y
α
u
u α v x
A
v A x
α
v x
A
w
w
344
Proof of the CFL Pumping Lemma Hierarchy: Proper Containment
The pumping lemma for context-free languages. For every infinite CFL L, there
is a constant integer p that satisfies the following conditions:
Let be the alphabet of the language. Then for every string z L whose length is
greater than or equal to p, there are u, v, w * such that
• z = uvwxy, |vwx| p, |vx| 1
• for all i 0, uviwxiy L
345
Hierarchy: Proper Containment
346
Rumination (1): Application of the Pumping Lemma Hierarchy
• Here is an application example of the pumping lemma with a logical bug that we often come across while grading
homework assignments.
We can show that L is regular by showing an FA which recognizes L (see the figure below). What is wrong in the above
argument? It is in the argument dealing with the existential quantification for strings u, v, and w in part (4) of the lemma. The
argument took u, v, and w only for the specific values of u = a, v = b, and w = canabc, not considering other cases satisfying
conditions (i)-(iii). We should show that uviw L, for all possible locations of the “pump site” v in the prefix of length n of
string z and for all possible lengths of v (i.e., 1 |v| n). Actually, for the case where v contains some a‟s in the substring an, it
satisfies that for all i 0, uviw L. The proof fails.
347
Rumination (1): Application of the Pumping Lemma Hierarchy
• The pumping lemma presents a common property of all infinite regular languages. Recall the last claim of the lemma, i.e.,
“for all i ≥ 0, uviw L”, which is based on the existence of a cyclic path in the state transition graph of every DFA which
recognizes an infinite regular language. Notice that every finite language is regular. Let L = {w1, w2, . . . , wk} be a finite
language. Then we can simply construct a regular grammar with rules S → w1 | w2 | . . .| wk. Therefore, every non-regular
language is infinite.
• We know that every infinite regular language satisfies the pumping lemma. Is the converse of this statement true? In other
words, can we say that if an infinite language satisfies the pumping lemma, then the language is regular? We didn‟t prove that
every infinite language satisfies the pumping lemma. We proved it only for infinite regular languages. Actually, Appendix E
shows a non-regular CFL which satisfies the lemma. Thus, it is wrong to prove the regularity of a language by showing that it
satisfies the pumping lemma.
348
Hierarchy: Proper Containment
....
DO 10 I = 1, 100, 1
SUM = SUM + I
10 PRO = PRO * I
....
The formal language L below has a similar context-dependency of programming languages. String x corresponds to the label
(i.e., a string of decimal digits) and ai corresponds to the loop body. Using Ogden‟s lemma we showed that L is not a context-free
language.
349
Hierarchy: Proper Containment
Exercises
12.1 Using the pumping lemma for regular languages, we are going to prove that L = {xxR | x {a, b, c}* }is not regular as
follows. Complete the proof by answering each of the following questions.
Proof. Suppose that L is regular, and let n be the constant of the lemma. All the following strings are in L.
(a) abccba (b) a100b100b100a100 (c ) an/2bn/2b n/2an/2 (d) anbnbnan
Question 1: Which of the strings above are you going to pick as the string z for your proof? Why are other strings not your
choice? If you don‟t like any, choose another string in L and explain why.
Question 2: Consider strings u, v, w {a, b, c}* that satisfy the conditions (i) z = uvw, (ii) |uv| n, and (iii) |v| 1. What
will be in v? Briefly explain all possible contents of string v.
Question 3: To show that L is not regular, how are you going to use the last part of the lemma: “For all i 0, string uviw
L.?” Write your argument.
12.2 Which of the following languages are regular? Justify your answer.
L1 = {aibjck | i, j, k > 0 } L2 = {aaaaibjci | i, j > 0 }
12.3 For the symbols a, b and c, and a string x, let #a(x), #b(x) and #c(x), respectively, be the number of a‟s, b‟s and c‟s in
string x. Which of the following languages are context-free? Justify your answer.
L3 = { x | x {a,b,c}+, and #a(x) > #b(x) }
L4 = { x | x {a,b,c}+, and #a(x) = #b(x) = #c(x) }
L5 = {x | x {a,b,c}+, and #a(x) < #b(x) < #c(x) }
12.4 We learned that language { w#wR | w {a, b}* } is context-free. The following languages are not context-free. Prove
why.
L6 = {w#w | w {a, b}* } L7 = {xyx | x, y {a, b}* }
350
Practical Application:
Parsing
351
13. Parsing
Parsing is one of the major functions of the compiler of a programming language. Given a source code w, the parser examines w
to see whether it can be derived by the grammar of the programming language, and, if it can be, the parser constructs a parse tree
yielding w. Based on this parse tree, the compiler generates an object code. So, the parser acts as a membership test algorithm
designed for a given grammar G that, given a string w, tells us whether w is in L(G) or not, and, if it is, outputs a parse tree.
Notice that the parser tests the membership based on the given grammar. Recall that when we practiced constructing a PDA for a
given language, say {aibi | i > 0 }, we used the structural information of the language, such as a‟s come first, then b‟s, and the
number of a‟s and b‟s are same. Consider the two CFG‟s G1 and G2 shown below in figure (a), which generate the same language
{aibi | i > 0 }. Figure (b) shows a PDA that recognizes this language. For an input string w, this PDA does not give any information
about the grammar and how the string w is derived. Hence, we need a different approach to construct a parser based on the
grammar, not the language.
There are several algorithms available for parsing that, given an arbitrary CFG G and a string x, tell whether x L(G) or not, and
if it is, output how x is derived. (CYK algorithm is a typical example, which is shown in Appendix F.) However, these algorithms
are too slow to be practical. (For example, CYK algorithm takes O(n3) time for an input string of length n) Thus, we restrict CFG‟s
to a subclass for which we can build a fast practical parser. This chapter presents two parsing strategies applicable to such restricted
grammars together with several design examples. Finally, the chapter briefly introduces Lex (the lexical analyzer generator) and
YACC (the parser generator).
( a, a/aa ) (b, a/ )
353
Parsing
13.1 Derivation
The parser of a grammar generates a parse tree for a given input string. For
convenience, the tree is commonly presented in a sequence of rules applied in
one of the following two ways to derive the input string starting with S.
Example: G: S ABC A aa B a C cC | c
Leftmost derivation: S ABC aaBC aaaC aaacC aaacc
Rightmost derivation: S ABC ABcC ABcc Aacc aaacc
354
Derivation Parsing
355
Derivation Parsing
S1
G: S ABC A aa B bD
C cC | c D bd 6A 4B C2
aa b 5D c C3
Rightmost derivation: c
1 2 3 bd
S ABC ABcC ABcc
4 5 6 654321
AbDcc Abbdcc aabbdcc
356
Parsing
13.2 LL(k) parsing strategy
We know that parsers are different from PDA‟s, because their membership test
should be based on the given CFG. Let‟s try to build a conventional DPDA which,
with the grammar G stored in the finite control, tests whether the input string x is in
L(G), and, if it is, outputs a sequence of rules applied to derive x. We equip the
finite control with an output port for the output (see figure (b) below).
Our first strategy is to derive the same input string x in the stack. Because any
string must be derived starting with the start symbol, we let the machine push S into
the stack and enter state q1 for the next move. For convenience, we assign a rule
number to each rule as shown in figure (a).
(a) (b)
357
LL(k) Parsing Parsing
Now, we ask which rule, either (1) or (2), the machine should apply with S to
eventually derive the string on the input tape. If the input string is derived using rule
(1) (rule (2)) first, then there should be the symbol b (respectively, symbol c) after
the 10-th a. Unfortunately, our conventional DPDA model cannot look-ahead the
input before reading it. Recall that conventional DPDA‟s decide whether they will
read the input or not depending on the stack top symbol. Only after reading the
input does the machine knows what it is. Thus, without reading up to the 11-th input
symbol, there is no way for the machine in the figure to identify the symbol at that
position.
To overcome this problem, we equip the finite state control with a “telescope”
with which the machine can look some finite k cells ahead on the input tape. For
the grammar G, it is enough to have a telescope with the range of 11 cells.
(Notice that for the range to look ahead, we also include the cell under the head.)
With this new capability, the machine scans the input string ahead in the range,
and, based on what it sees ahead, it takes the next move. While looking ahead, the
input head does not move.
A
N
(1) (2) (3)
I
G: S AB | AC A aaaaaaaaaa aaaaaaaaaabbb
(4) (5) (6) (7)
B bB | b C cC | c q1 G
S AB !
L(G) = {a10x | x = bi or x = ci, i 1 } SZ0
(a) (b)
359
LL(k) Parsing Parsing
Now, the parser, looking ahead 11 cells, sees aaaaaaaaaab. Since there is b at
the end, the machine chooses rule (1) (i.e., S AB), rewrites the stack top S with
AB and outputs rule number (1) as shown in figure (a).
Let q, , and be, respectively, the current state, the remaining input portion to
read, and the current stack contents. From now on, for convenience we shall use
the triple (q, , ), called the configuration, instead of drawing the cumbersome
diagram to show the parser.
(1) (2)
aaaaaaaaaabbb
G: S AB | AC
(3) (1) q1 G q G
A aaaaaaaaaa
(4) (5)
A B Z0
B bB | b
(6) (7) (a) Apply rule S AB (b) Configuration (q, , )
C cC | c
360
LL(k) Parsing Parsing
G: S AB | AC A aaaaaaaaaa B bB | b C cC | c
Looking ahead 11 cells in the current configuration (q0, aaaaaaaaaabbb, SZ0), the
parser applies rule (1) by rewriting the stack top S with the rule‟s right side AB.
Consequently, the configuration changes as follows.
look-ahead 11 cells
(1)
(q0, aaaaaaaaaabbb, Z0) (q1, aaaaaaaaaabbb, SZ0) (q1, aaaaaaaaaabbb, ABZ0)
Now, with nonterminal symbol A at the stack top, the parser must find a rule to
apply. Since A has only one rule, i.e., rule (3), there is no choice. So, the parser
applies rule (3), consequently changing the configuration as follows.
(3)
(q1, aaaaaaaaaabbb, ABZ0) (q1, aaaaaaaaaabbb, aaaaaaaaaaBZ0)
361
LL(k) Parsing Parsing
G: S AB | AC A aaaaaaaaaa B bB | b C cC | c
Notice that the terminal symbol appearing at the stack top after applying rule
(3) corresponds to the leftmost terminal symbol appearing in the leftmost
derivation. Thus, the terminal symbol appearing at the stack top must match the
next input symbol, if the input string is generated by the grammar.
So, the parser, seeing a terminal symbol at the stack top, reads the input and, if
they match, pops the stack top. The following sequence of configurations shows
how the parser successfully pops all the terminal symbols pushed on the stack
top by applying rule (3).
(1)
(q0, aaaaaaaaaabbb, Z0) (q1, aaaaaaaaaabbb, SZ0) (q1, aaaaaaaaaabbb, ABZ0)
(3)
(q1, aaaaaaaaaabbb, aaaaaaaaaaBZ0) . . .(q1, abbb, aBZ0) (q1, bbb, BZ0)
362
LL(k) Parsing Parsing
G: S AB | AC A aaaaaaaaaa B bB | b C cC | c
(1) (3)
(q0, aaaaaaaaaabbb, Z0) (q1, aaaaaaaaaabbb, SZ0) (q1, aaaaaaaaaabbb, ABZ0)
Now, the parser must choose one of B‟s rules, either (4) or (5). If there remains
only one b in the input tape, rule (5) is the choice. Otherwise (i.e., if there are
more than one b), rule (4) must be applied. It follows that the parser needs to look
two cells ahead and proceeds as follows.
Look-ahead 2 cells
(4) (4)
(q1, bbb, BZ0) (q1, bbb, bBZ0) (q1, bb, BZ0) (q1, bb, bBZ0)
(5)
(q1, b, BZ0) (q1, b, bZ0) (q1, , Z0)
363
LL(k) Parsing Parsing
In summary, our parser works as follows, where underlined parts of the input
string are look-ahead contents and the numbers are the rules in the order applied
during the parsing.
(1) (3)
(q0, aaaaaaaaaabbb, Z0)(q1, aaaaaaaaaabbb, SZ0)(q1, aaaaaaaaaabbb, ABZ0)
(q1, aaaaaaaaaabbb, aaaaaaaaaaBZ0) . . . .(q1, abbb, aBZ0)
(4) (4)
(q1, bbb, BZ0) (q1, bbb, bBZ0) (q1, bb, BZ0) (q1, bb, bBZ0)
(5)
(q1, b, BZ0) (q1, b, bZ0) (q1, , Z0)
Notice that the last configuration above implies a successful parsing. It shows that
the sequence of rules applied on the stack generates exactly the same string as the one
originally written on the input tape. If the parser fails to reach the accepting
configuration, we say the input is rejected. In the above example, the sequence of rules
applied to the nonterminal symbols appearing at the stack top matches the sequence of
rules applied for the leftmost derivation of the input string shown below.
(1) (3) (4) (4) (5)
S AB aaaaaaaaaaB aaaaaaaaaabB aaaaaaaaaabbB aaaaaaaaaabbb
364
LL(k) Parsing Parsing
G: S AB | AC A aaaaaaaaaa B bB | b C cC | c
For the other input strings ending with c‟s, the parser can apply the same
strategy and successfully parse it by looking ahead at most 11 cells (see below).
This parser is called an LL(11) parser, named after the following property of the
parser; the input is read Left-to-right, the order of rules applied matches the order
of the Leftmost derivation, and the longest look-ahead range is 11 cells. For a
grammar G, if we can build an LL(k) parser, for some constant k, we call G an
LL(k) grammar.
(2) (3)
(q0, aaaaaaaaaabbb, Z0) (q1, aaaaaaaaaaccc, SZ0) (q1, aaaaaaaaaaccc, ACZ0)
(q1, aaaaaaaaaaccc, aaaaaaaaaaCZ0) . . . .(q1, abbb, aCZ0)
(6) (6)
(q1, ccc, CZ0) (q1, ccc, bBZ0) (q1, cc, CZ0) (q1, cc, cCZ0)
(7)
(q1, c, CZ0) (q1, c, cZ0) (q1, , Z0)
365
LL(k) Parsing Parsing
G: S AB | AC A aaaaaaaaaa B bB | b C cC | c
Contents of 11 look-ahead
Stack a10b a10c bbX9 bB10 ccX9 cB10
top S AB AC
A a10
B bB b
C cC c
Parse Table
366
Parsing
(1) (2)
S aSb aabbb
The language of this grammar is {aiaabbbbi | i 0}. Every string generated by
this grammar has aabbb at the center. As we did in the preceding section, let‟s
examine how an LL(k) parser will parse the input aaaabbbbb with the shortest
possible look-ahead range of k.
To parse the input string successfully, the machine should apply the rules in the
order of (1), (1), (1), (2), which is the same order applied for the following
leftmost derivation.
367
Designing LL(k) Parser Parsing
Pushing the start symbol S into the stack in the initial configuration, the parser
gets ready to parse the string as shown below. With S in the stack top, it must apply
one of S‟s two rules. To choose one of them, the parser needs to look ahead for
supporting information. What could be the shortest range to look ahead?
If there is aabbb, rule (2) must be applied. So it appears k = 5. But the parser does
not have to see the whole string. If there is aaa ahead, the leftmost symbol a must
have been generated by rule (1). Otherwise, if there is aab ahead, the leftmost a
must have been generated by rule (2). It is enough to look ahead 3 cells (i.e., k = 3).
Thus, in the current configuration, since the contents of 3 look-ahead is aaa, the
parser applies rule (1), then reads the input to match and pop the terminal symbol a
from the stack top as follows.
(1)
(q1, aaaaabbbbbb, SZ0) (q1, aaaaabbbbbb, aSbZ0) (q1, aaaabbbbbbb, SbZ0)
Look-ahead 3
368
Designing LL(k) Parser Parsing
Again, with S on the stack top, the parser looks ahead 3 cells, and seeing aaa,
applies rule (1), and repeats the same procedure until it looks ahead aab as follows.
(1)
(q1, aaaaabbbbbb, SZ0) (q1, aaaaabbbbbb, aSbZ0)
(1) (2) (1)
S aSb aabbb (q1, aaaabbbbbbb, SbZ0) (q1 , aaaabbbbbb, aSbbZ0 )
(1)
(q1, aaabbbbbbb, SbbZ0) (q1 , aaabbbbbb, aSbbbZ0 )
(q1 , aabbbbbb, SbbbZ0 ) ?
Now, the parser finally applies rule (2), and keeps reading and match-and-
popping until it enters the accepting configuration as follows.
(2)
(q1 , aabbbbbb, SbbbZ0 ) (q1 , aabbbbbb, aabbbbbbZ0 ) … (q1 , , Z0)
369
Designing LL(k) Parser Parsing
The parser applied the rules in the order, (1), (1), (1), (2), which is the same
order applied for the leftmost derivation of the input string aaaaabbbbbb.
Given an arbitrary input string, the parser, applying the same procedure, will end
up in the final accepting configuration if and only if the input belongs to the
language of the grammar. The parser needs to look ahead at least 3 cells. Hence, the
grammar is LL(3). The parse table is shown below.
3 look-ahead
aaa aab
Stack top
(1) (2) S aSb aabbb
S aSb aabbb
Parse Table
370
Designing LL(k) Parser Parsing
Example 2. Construct an LL(k) parser with minimum k for the following CFG.
371
Designing LL(k) Parser Parsing
If the input is not empty, the parser, with S at the stack top, should choose rule (1) to
apply. Then, as shown below, for each terminal symbol appearing at the stack top, the
parser reads the next input symbol, and if they match, pops out the stack top until A
appears. If the input tape was empty, the parser would simply pops S (i.e., rewrites S
with ) and enters the accepting configuration. Now, with A at the stack top, the parser
should choose a rule between (3) and (4).
(1)
(q1, ababaaaa, SZ0) (q1, ababaaaa, abAZ0) . . (q1, abaaaa, AZ0) ?
If rule (4) was used to derive the input, the next input symbol ahead should be b, not
a. Looking symbol a ahead, the parser applies rule (3), and consequently, having S on
the stack top as before, it needs to look ahead to choose the next rule. Up to this point,
it appears that 1 look-ahead is an appropriate range.
(3)
(q1, abaaaa, AZ0) (q1, abaaaa, SaaZ0) ?
372
Designing LL(k) Parser Parsing
But this time, with S at the stack top it is uncertain which rule to apply. Looking a
ahead, the parser can apply either rule (1) or rule (2), because in either case, the
parser will successfully match the stack top a with the next input symbol a (see
below). To resolve this uncertainty, the parser needs one more cell to look ahead.
To solve this problem we could have the parser look down the stack. But we have
chosen to extend the range of look-ahead, a straightforward solution. Later in this
chapter, we will discuss parsers which are allowed to look down the stack some
finite depth.
(1) (q1 , abaaaa, abaaZ0)
(1) (2) (3) (4) (q1 , abaaaa, SaaZ0)
S abA | A Saa | b (2) (q , abaaaa, aaZ )
1 0
Now, looking ab ahead in the extended range, which must be generated by rule (1),
the parser applies the rule and repeats the previous procedure as follows till S
appears at the stack top again.
(1) (3)
(q1 , abaaaa, SaaZ0 ) (q1 , abaaaa, abAaaZ0 ) . . (q1, aaaa, AaaZ0)
(q1, aaaa, SaaaaZ0) ?
373
Designing LL(k) Parser Parsing
Looking aa ahead with S on the stack top, the parser applies rule (2). Then, for
each a appearing at the stack top, it keeps reading the next input symbol, matching
them and popping the stack top, eventually entering the accepting configuration.
(2)
(q1, aaaa, SaaaaZ0) (q1, aaaa, aaaaZ0) . . . . (q1, , Z0)
374
Designing LL(k) Parser Parsing
The input string that we have just examined is the one derived by applying rule (2)
last. For the other typical string ababbaa that can be derived by applying rule (4)
last, the LL(2) parser will parse it as follows.
(1) (3)
(q1, ababbaa, SZ0) (q1, ababbaa, abAZ0) . . (q1, abbaa, AZ0)
(1)
(q1, abbaa, SaaZ0) (q1 , abbaa, abAaaZ0 ) . . (q1, baa, AaaZ0)
(3)
(q1, baa, baaZ0) . . . . (q1, , Z0)
From the analysis with the two parsing examples, we construct the following parse
table. (Notice that with A at the stack top, though 1 look-ahead is enough, the entries
are under the column of 2 look-ahead.)
2 look-ahead
ab aa bX BB
(1) (2) (3) (4) Stack
S abA | A Saa | b top S abA
B: blank
A Saa Saa b X: don‟t care
Parse Table
375
Designing LL(k) Parser Parsing
For a given input string, the basic strategy of LL(k) parsing is to generate the same
string on the top of the stack by rewriting every nonterminal symbol appearing at the
stack top with the right side of that nonterminal‟s rule. If the nonterminal symbol has
more than one rule, the parser picks the right one based on the prefix of the input
string appearing on k cells looked ahead.
Whenever a terminal symbol appears on the stack top, the machine reads the next
input symbol and pops the stack top, if they match. The sequence of rules applied for
a successful parsing according this strategy is the same as the one applied for the
leftmost derivation of the input string.
The class of CFG‟s that can be parsed by LL(k) parsing strategy is limited. The
CFG G1 below is an example for which no LL(k) parser exists. However, G2, which
generates the same language, is an LL(k) grammar. We will shortly explain why.
G1: S A | B A aA | 0 B aB | 1
G2: S aS | D D 0 | 1
L(G1) = L(G2) = {ait | i 0, t {0, 1}}
376
Designing LL(k) Parser Parsing
Consider the first working configuration illustrated below (with the start symbol S
on top of the stack.) The parser should choose one of S‟s two rules, S→A and S →B.
But it is impossible to choose a correct rule, because the right end symbol 0 (or 1),
which is essential for the correct choice, can be located arbitrarily far to the right. It
is impossible for any LL(k) parser to identify it ahead with its “telescope” of a finite
range k.
But for the grammar G2, we can easily design an LL(1) parser.
aaaa..... aa0
G1: S A | B A aA | 0 B aB | 1
G2: S aS | D D 0 | 1 q1 G1
377
Parsing
Definition of LL(k) grammars
We saw just now two CFG‟s that generate the same language, but for the one,
no LL(k) parser exists, and for the other, we can design an LL(k) parser. So, we
may ask the following: What is the property of LL(k) grammars?
For a string x, let (k)x denote the prefix of length k of string x. If x < k, then
(k)x = x. For example, (2)ababaa = ab, (3)ab = ab.
---------------------- ----------------------
q G A q G A
R. . . . . . . Z0 A . . . . . . . Z0
379
LR(k) parsing Parsing
If there is uncertainty because of other rules generating the same string , for
example, B , the parser looks some constant k cells ahead on the input string to
resolve the uncertainty. The following figures show the difference of the two parsing
strategies. A
N
LL(k) Parsing
---------------------- I
----------------------
q G A | ? q G A !
A . . . . . . . Z0 . . . . . . . Z0
LR(k) Parsing
---------------------- ----------------------
q G A q G A !
B ?
R . . . . . . . Z0 A . . . . . . . Z0
380
LR(k) Parsing Parsing
Let‟s take a simple example of designing an LR(k) parser with the CFG
shown below. As we did for designing LL(k) parsers, we shall first pick a
typical string that can be derived by the grammar, and with this string, observe
how the machine should work according to the LR(k) strategy. Based on the
observation, we will build an LR(k) parser.
Suppose string aabc is given on the input tape. The basic strategy of LR(k) parsers
is to shift in the input (i.e., read the input and push it) into the stack, until the right
side of a rule in the grammar appears on some stack top portion.
(q0, aabc, Z0 ) ?
381
LR(k) Parsing Parsing
For the example, the parser will read up to aab, until it sees baa (notice that it is
reversed) on the stack top portion. Two rules, rule (3) and rule (4), are applicable.
String baa must be replaced by either A or B. To resolve this uncertainty, the machine
needs to look ahead. For this example, it is enough to look one cell ahead. Looking c
ahead, the parser applies rule (3) by rewriting baa on the top portion of the stack by A
as follows.
look-ahead 1
(3)
(q0, aabc, Z0 ) ( q1, c, baaZ0) (q1, c, AZ0) ?
382
LR(k) Parsing Parsing
After applying rule (3), the parser sees no string on the stack top portion that is
reducible (i.e., generated by a rule). So the parser shifts in the next symbol c from the
input, and applies rule (5) to rewrite it with C. (This time no look ahead is needed.)
Finally, the parser applies rule (1) S AC, and rewrites AC appearing on the stack
top portion (the parser reads the stack from inside out to the top) by the start symbol S,
entering the final accepting configuration (q1, , SZ0).
383
LR(k) Parsing Parsing
The following figure shows an LR(1) parsing history for the other string aabd,
which can be generated by the grammar.
look-ahead 1
accepting configuration
(4)
(q0, aabd, Z0 ) ( q1, d, baaZ0) (q1, c, BZ0)
(q1, , dBZ0) (q1, , DBZ0) (q1, , SZ0)
(6) (2)
Notice that the sequence of rules applied by the parser, (4)(6)(2), is exactly the reverse
order of the rightmost derivation of aabd (see below). It is also true for the former
example aabc, and actually for every LR(k) parsing. (Recall that the reverse order of the
rules applied for the rightmost derivation of a string x corresponds to the bottom-up left-
to-right traversal of the parse tree yielding x. )
384
LR(k) Parsing Parsing
With a simple modification to the LR(1) grammar that we have just examined, we
can change it into an LR(k) grammar for any given constant k. For example, by
prefixing eeee to the right side of the rules of C and D as shown below, we can
change the grammar to LR(5). Below are two LR parsing histories of an LR(5)
parser for two typical strings derivable by the modified grammar and the parse
table constructed based on the histories.
386
Parsing
13.5 Designing LR(k) Parsers
We shall now practice designing an LR(k) parser for a couple of grammars. As
with LL(k), we will first pick a typical string that can be generated by the given
grammar and observe how an LR(k) parser parses it with minimum k. Based on the
observation, we will build the parse table.
Example 1. Construct an LR(k) parser for the following CFG with minimum k.
387
Designing LR(k) Parsers Parsing
The production rule applied last in the rightmost derivation of a string that can be
generated by this grammar is rule (3) or rule (2). So, before applying one of these
rules, the parser should shift in the input until it sees either aaa or aaaddd on the
stack top portion. But we have a problem. Consider the following two
configurations with aaa shifted in onto the stack. Cases (a) and (b) are, respectively,
for input strings aaaddd and aaadddccc.
388
Designing LR(k) Parsers Parsing
390
Designing LR(k) Parsers Parsing
After applying rule (4), the parser, seeing no string on the stack that is reducible
by applying a rule, shifts in the next input symbol c from the input. This first c
from the input is the one generated by rule (6). (Note that if rule (5) were C cC,
with the right side reversed, the last c would have been generated by rule (6).) Thus,
the parser applies rule (6), and reduces the c on the stack top by C as follows.
(3)
(q0, aaadddccc, Z0 ) ( q1, aaadddccc, aZ0) . . . (q1, dddccc, aaaZ0)
(4)
(q1, dddccc, AZ0) . . . . (q1, ccc, dddAZ0) (q1, ccc, DAZ0)
(6)
(q1, cc, cDAZ0) (q1, cc, CDAZ0) ?
391
Designing LR(k) Parsers Parsing
(3)
(q0, aaadddccc, Z0 ) ( q1, aaadddccc, aZ0) . . . (q1, dddccc, aaaZ0)
(4)
(q1, dddccc, AZ0) . . . . (q1, ccc, dddAZ0) (q1, ccc, DAZ0)
(6)
(q1, cc, cDAZ0) (q1, cc, CDAZ0) ?
Now, the parser has ADC on the stack that would be reducible by applying rule
(1). But it is too early to do this, because there are more c‟s to be processed. Here it
needs to look one cell ahead, and shift in the next c to apply rule (5) and reduce Cc
on the stack by C. The parser repeats this reduction until it sees no c‟s ahead, and
finally applies rule (1) to reduce ADC by S, entering the accepting configuration.
(5)
(q1, cc, CDAZ0) (q1, c, cCDAZ0) (q1, c, CDAZ0)
(5) (1)
(q1, , cCDAZ0) (q1, , CDAZ0) (q1, , SZ0)
392
Designing LR(k) Parsers Parsing
Below is the summary of our observations. The order of the rules applied is
exactly the reverse order of the rules applied for the rightmost derivation of the
input string (see below).
(3)
(q0, aaadddccc, Z0 ) . . ( q1, aadddccc, aaZ0) (q1, dddccc, aaaZ0)
(4)
(q1, dddccc, AZ0) . . . . (q1, ccc, dddAZ0) (q1, ccc, DAZ0)
(6) (5)
(q1, cc, cDAZ0) (q1, cc, CDAZ0) (q1, c, cCDAZ0) (q1, c, CDAZ0)
(5) (1)
(q1, , cCDAZ0) (q1, , CDAZ0) (q1, , SZ0)
Rightmost derivation:
(1) (5) (5) (6) (4) (3)
S ADC ADCc ADCcc ADccc Adddccc aaadddccc
393
Designing LR(k) Parsers Parsing
Now, based on the observations, while tracing how an LR(k) parser can
successfully parse the chosen input string, we can construct the following parse
table for an LR(4) parser.
4 look-ahead
Stack top dddc dddB BBBB cXXX X: don‟t care
portion B : blanks
aaa A Shift-in
: no look-ahead
ddd D
aaaddd S
c C
Cc C
ADC S Shift-in
Parse Table
394
Parsing
Designing LR(k) Parsers
Example 2. Construct an LR(k) parser for the following CFG with minimum k.
This is not an LL(k) grammar. (We leave the proof for the reader. For the proof,
recall the non-LL(k) grammar that we discussed with at the end of last section.)
The language of this grammar is {aixd i 1, x {aaab, aaac}}. Again, we will
first trace how an LR parser should parse a typical string from this language to
construct a parse table.
Let‟s choose the string aaaaaabd for the analysis. The rightmost derivation for
this string is shown below. Our LR(k) parser, given this string as an input, should
parse it by applying the rules in the reverse order of the derivation, i.e., (4), (3), (3),
(5), (7), (1).
(1) (7) (5) (3) (3) (4)
S EAF EAd Eaaabd aEaaabd aaEaaabd aaaaaabd
395
Designing LR(k) Parsers Parsing
Every rightmost derivation of this grammar must end by applying rule (4). Thus,
our LR(k) parser should shift in the a onto the stack that has been generated by
this rule. The language of this grammar is {aixd i 1, x { aaab, aaac}}, and the
rightmost a in the prefix ai of the language is generated by rule (4).
So the parser should shift in the whole string ai into the stack to bring the
rightmost a (of the string ai) on top of the stack. But, to do this we have a problem,
because there are more a‟s to the right which belong to x { aaab, aaac }.
To identify the last a in ai, the parser needs to look ahead. If it sees either aaab
or aaac ahead, then the a on top of the stack is the last a in ai, which should be
reduced by applying rule (4) as follow.
396
Designing LR(k) Parsers Parsing
Now, we have aE at the stack top portion that is reducible by applying rule (3).
The parser does this reduction twice, and then shifts in the next input d, because
no string appears on the stack top that is reducible. Symbol d on the stack is
reduced by rule (7), and the result of this reduction EAF is in turn reduced by rule
(1) to enter the final accepting configuration (q1, , SZ0).
(3) (3)
(q1, aaabd, EaaZ0) (q1, aaabd, EaZ0) (q1, aaabd, EZ0)
(5)
(q1, aabd, aEZ0) … (q1, d, baaaEZ0) (q1, d, AEZ0)
(7) (1)
(q1, , dAEZ0) (q1, , FAEZ0) (q1, , SZ0)
397
Designing LR(k) Parsers Parsing
Here are the summary of our analysis and the parse table. The order of rules
applied matches the order of the rightmost derivation, which is (4), (3), (3), (5),
(7), (1).
4 look-ahead
(q0, aaaaaabd, Z0) (q1, aaaaabd, aZ0) …
aaab aaac others
(4) (3)
(q1, aaabd, aaaZ0) (q1, aaabd, EaaZ0) a E E shift-in
(3) E
(q1, aaabd, EaaZ0) (q1, aaabd, EaZ0) aE
aaab A
(q1, aaabd, EZ0) (q1, aabd, aEZ0) … aaac B
(5) d F
(q1, d, baaaEZ0) (q1, d, AEZ0)
(7) (1) EAF S
(q1, , dAEZ0) (q1, , FAEZ0) (q1, , SZ0) EBF S
Parse Table
398
Designing LR(k) Parsers Parsing
Example 3. Construct an LL(k) parser for the following CFG with minimum k.
399
Designing LR(k) Parsers Parsing
(4)
(q0, bbbbbcccccc, Z0) … . . (q1, ccccc, cbbbbbZ0)
(q1, ccccc, AbbbbZ0) (q1, cccc, cAbbbbZ0) ?
After rule (4) applied, with no stack top portion reducible, the parser shifts in
the next input symbol c as above. Now, the parser sees bAc on the stack, which
is reducible by rule (3). However, we should be careful. We cannot let the parser
repeat applying this reduction without looking ahead. It needs to save the last
three c‟s of the input string for the final reduction by rule (2).
Thus the parser needs to look 3 cells ahead. In the current configuration, since
there is ccc ahead, it can reduce the bAc on the stack top portion by rule (3). The
parser can do this shift-in and reduction once more as shown below.
(3)
(q1, cccc, cAbbbbZ0) (q1, cccc, AbbbZ0)
(3)
(q1, ccc, cAbbbZ0) (q1, ccc, AbbZ0) (q1, cc, cAbbZ0) ?
400
Designing LR(k) Parsers Parsing
Now, the parser sees two c‟s ahead, that must be shifted in onto the stack for
the reduction by rule (2), because those last three c‟s, together with A, are
generated by rule (2). After this reduction by rule (2), the parser reduces the
stack top twice by applying rule (1) and enters the final configuration as follows.
(2)
(q1, cc, cAbbZ0) . . . (q1, , cccAbbZ0)
(1) (1)
(q1, , SbbZ0) (q1, , SbZ0) (q1, , SZ0)
401
Designing LR(k) Parsers Parsing
Below is the summary of our analysis and the parse table of the LR(3) parser.
(4)
(q0, bbbbbcccccc, Z0) … . . (q1, ccccc, cbbbbbZ0)
(3)
(q1, ccccc, AbbbbZ0) (q1, cccc, cAbbbbZ0) (q1, cccc, AbbbZ0)
(3)
(q1, ccc, cAbbbZ0) (q1, ccc, AbbZ0) (q1, cc, cAbbZ0) . . .
(2) (1) (1)
(q1, , cccAbbZ0) (q1, , SbbZ0) (q1, , SbZ0) (q1, , SZ0)
3 look-ahead
ccc ccB xxx
(1) (2)
bc A
S bS | Accc
Stack bAc A Shift-in
(3) (4) top
A bAc | bc Accc S
portion
bS S
Parse Table
402
Designing LR(k) Parsers Parsing
Definition of LR(k) Grammars
LR(k) grammars are defined as follows, which is quite similar to the
definition of LL(k) grammars except that in this definition, we are concerned
with rightmost derivations instead of leftmost derivations.
Definition (LR(k) grammar). Let G = (VT, VN, P, S) be a CFG, and let x and
y be two sentential forms in the rightmost derivation of a string in L(G), where ,
, V*, x, y VT*, and A VN. If there is a constant k for which the following
conditions are satisfied, then grammar G is an LR(k) grammar (i.e., we can
construct an LR(k) parser for G).
(i) If S *Ax x and S *Ay y, such that (k)x = (k)y,
(ii) then = .
In other words, if the above condition is satisfied, it is possible for an LR(k) parser
to establish (= ) on the stack top portion and reduce it with the rule A → by
looking at most k cells ahead. (The proof, which can be done using the proof-by-
induction technique, is a little challenging for the level of this text. We omit the
proof.)
403
Parsing
13.6 Lex and YACC
Lexical analyzer and parser are the major components of a compiler. A lexical
analyzer, scanning the source program, identifies the basic program components,
called tokens, such as reserved words, operators, and variable names. A parser,
based on the grammar of the programming language, constructs the parse tree (see
Section 11.1) whose yield is the sequence of tokens found by the lexical analyzer.
This section briefly introduces two software tools, Lex and YACC, which
automatically generate a lexical analyzer and a parser, respectively.
Lex
Given tokens in terms of regular expression, Lex constructs a lexical analyzer.
(Actually, this lexical analyzer is a software implementation of an FA that
recognizes the set of tokens.) The input to Lex consists of the following three
parts, each delimited by %%.
(1) Token definition
(2) Token description and actions
(3) User-written codes
404
Lex Parsing
(1) Token definition
Tokens are defined in terms of a variation of the regular expression that was
introduced in the rumination section of Chapter 3. Here we repeat the examples
presented in that rumination section.
(a | b)* = (a+b)* (a | b)+ = (a+b)(a+b)* [ab] = (a+b),
[a-z] = a + b + . . . + z, (ab\.c)? = (ab.c + ) abc? = {abcd, abd}
406
YACC Parsing
407
YACC Parsing
An elderly couple were experiencing declining memories, so they decided to take a power memory class, where
they teach one to remember things by association. Later, the man was talking to a neighbor about how much the
class helped them.
“Who was the instructor?” the neighbor asked.
“Oh, let‟s see,” pondered the man. “Umm… what‟s that flower, you know, the one that smells real nice but has
those thorns…?”
“A rose?” offered the neighbor.
“Right,” said the man. He then turned toward his house and shouted,
“Hey, Rose, what‟s the name of the guy we took that memory class from?”
- DonTravels -
408
Parsing
Rumination (1): Parsing
The LL and LR parsers work based on a given grammar. The following three context-free grammars generate the same
language {aiX | i 1, X {b, c}}. Grammar G1 is neither an LL(k) nor an LR(k), for any k. G2 is an LR(1) grammar, but not
LL(k), for any k., and G3 is LL(2) and LR(0). (We leave the proofs for the reader.) Thus, LL(k) and LR(k) refer to the
grammar, not to the language.
G1 : S Ab | Bc A Aa | a B Ba | a
G2 : S Ab | Bc A aA | a B aB | a
G3 : S aS | aA A b | c
Ambiguous CFG‟s have more than one parse tree that yields the same string x, and hence, there are more than one leftmost
or rightmost derivation for x. Thus, it is impossible to parse x and give the exact sequence of rules that have been applied for
the leftmost or rightmost derivation. However, if we allow the parser to choose arbitrarily one of the multiple derivations, we
can design a parser. As an example, consider the following grammar, which is ambiguous. We can make an LL(0) or LR(0)
parser for this grammar by letting it choose either one of following two derivations; S A a and S B a.
SA| B A a Ba
409
Rumination (1): Parsing Parsing
One extra working state (q1) was enough for all LL(k) parsers that we have designed in this chapter. We can show that this
is true for every LL(k) grammar. However, this is not true for LR(k) parsing, though all the LR(k) parsers that we have
presented in this chapter use two states including the start state q0. The following grammar is an example that requires three
states.
The language of this grammar is {xai | i 1, x {a,b,c}}. Every string in this language begins with either a, b or c, and
depending on it, the last a in ai is generated by either A a, B a or C a. Notice that for any rightmost derivation, one of
these rules is the last one applied. It follows that the parser must shift in the whole input string to bring the last a on top of the
stack, which can be done by looking one cell ahead (see above). Now, the problem is to correctly choose a rule among the three
A a, B a and C a to reduce the a on the stack top.
The only information needed for the correct reduction is the symbol pushed in first on top of the bottom of stack symbol Z0 (b
in the above illustration). This symbol can be arbitrarily far from the top. (Recall that the parser is allowed to look down the
stack by only a finite depth.) So the parser, reading the first symbol, needs to remember it in its finite state control and use it for
choosing the correct rule for the first reduction. The parser needs three states, say qa, qb, and qc, respectively, corresponding to
keeping the first symbol either a, b or c, in its memory.
410
Rumination (1): Parsing Parsing
Below are the three LR(1) parsing histories showing how the parser can parse the above grammar with three states qa,
qb and qc.
(i) Choose a typical string w that is derivable by the grammar, and trace the LL(k) parsing for the input w with clear indication
of where and why your parser needs to look k cells ahead.
(ii) Based on the analysis in step (i) above, construct an LL(k) parse table of your parser.
13.2 (a) Why can CFG G below not be an LL(k) grammar for any constant k?
(b) Construct an LL(k) grammar G’ whose language is L(G), and show an LL(k) parse table for G’, together with the analysis
that you took to build your parse table according to step (i) in problem 13.1 above.
13.3 Following steps (i) and (ii) below, construct an LR(k) parser with minimum k for each of the following grammars.
(i) Choose a typical string w that is derivable by the grammar, and trace the LR(k) parsing for the input w with clear
indication of where and why your parser needs to look k cells ahead.
(ii) Based on the analysis in step (i) above, construct an LR(k) parse table of your LR(k) parser.
412
Practical Application:
Web Programming and Bioinformatics
413
14. Web Programming and Bioinformatics
In Section 3.2, we learned how the syntax of Pascal programming language can be defined with a CFG. (Actually,
Appendix A shows the whole definition of Pascal programming language in terms of a syntax flow graph.) Other
programming languages can also be defined formally with a grammar, except for a few exceptional cases of context
dependency, such as double declaration of a variable in a block and using numeric labels in DO-loop in FORTRAN.
This chapter first shows that the major parts of the popular Web programming language HTML (Hypertext Markup
Language) and XML (Extensible Markup Language), which is designed for e-commerce, can also be defined in terms
of a CFG. Then the chapter presents a potential application of the formal languages to molecular biology.
414
Web Programming and Bioinformatics
14.1 Hyper Text Markup Language (HTML)
In this section, to see how our knowledge in the formal languages could be applicable
to investigating the properties of programming languages, we shall first examine
HTML. The left box below shows an informal definition (itemized for convenience) of
an HTML list that would commonly appear in a text. To the right are CFG rules
translated from the informal definition.
417
Web Programming and Bioinformatics
14.2 DTD and XML
The major objective of HTML is to define the format of a document, and hence it is
not easy to describe the semantics of the text, that is needed in the area of e-commerce.
XML has been introduced to overcome this drawback. Actually, XML is the language
(i.e., the set of documents) written according to a DTD (Document Type Definition),
which can be transformed into a CFG. All DTD‟s have the following form.
This format can be transformed into the following CFG rules, where DTD corresponds
to the start symbol S.
DTD <!DOCTYPE name-of-DTD [ List-of-Element-Definitions ]>
List-of-Element-Definitions Element-Definition List-of-Element-Definitions
| Element-Definition
Element-Definition <!ELEMENT element-name (Element-Description)>
418
DTD and XML Web Programming and Bioinformatics
<!DOCTYPE PcSpecs [
<!ELEMENT PCS (PC*)>
<!ELEMENT PC (MODEL, PRICE, PROCESSOR, RAM, DISK+)>
<!ELEMENT MODEL (\#PCDATA)>
<!ELEMENT PRICE (\#PCDATA)>
<!ELEMENT PROCESSOR (MANF, MODEL, SPEED)>
<!ELEMENT MANF (\#PCDATA)>
<!ELEMENT MODEL (\#PCDATA)>
<!ELEMENT SPEED (\#PCDATA)>
<!ELEMENT RAM (\#PCDATA)>
<!ELEMENT DISK (HARDDISK | CD | DVD) >
<!ELEMENT HARDDISK (MANF, MODEL, SIZE)>
<!ELEMENT SIZE (\#PCDATA)>
<!ELEMENT CD (SPEED)>
<!ELEMENT DVD (SPEED)>
]>
420
Web Programming and Bioinformatics
DTD and XML
Here, we show how to transform each element definition of PcSpecs into CFG
rules (shown in the box).
421
Web Programming and Bioinformatics
DTD and XML
.....
DISK <DISK>HARDDISK</DISK> |
<DISK>CD</DISK> | <DISK>DVD</DISK>
.....
422
Web Programming and Bioinformatics
DTD and XML
<PCS>
<PC>
<MODEL>1234</MODEL>
<PRICE>$3000</PRICE>
<PROCESSOR>
<RAM>512</RAM>
<DISK><HARDDISK>
<MANF>Superdc</MANF>
<MODEL>xx1000</MODEL>
<SIZE>62Gb</SIZE>
</HARDDISK></DISK>
<DISK><CD>
<SPEED>32x</SPEED>
</CD></DISK>
</PC>
<PC> . . . . </PC>
</PCS>
423
Web Programming and Bioinformatics
DTD and XML
Again in XML we see the same context dependency caused by the matching tags.
As in the previous example, if we restrict the matching tags to those chosen from a
finite set of reserved words, such as MODES, PRICE, DISK, etc., XML turns out to
be a CFL.
Warning Signs
Break Time
- On children's alphabet blocks: Letters may be used to construct words, phrases and sentences that may be
deemed offensive.
- On a microscope: Objects are smaller and less alarming than they appear.
- On a revolving door: Passenger compartments for individual use only.
- On work gloves: For best results, do not leave at crime scene.
- On a palm sander: Not to be used to sand palms.
- On a calendar: Use of term "Sunday" for reference only. No meteorological warranties express or implied.
-On a blender: Not for use as an aquarium.
- Dennis -
424
Web Programming and Bioinformatics
14.3 Gene code and grammar
Quite a few linguistic terminologies, such as code, express, translate, edit, etc., have
been adopted by biologists, especially in the field of molecular biology, ever since
James Watson and Francis Crick proposed the structure of DNA (deoxyribonucleic acid)
strands in 1953. This implies that in biology there are problems that we can approach
with the knowledge of automata and formal languages. As an example, this section
shows how it is possible to express genes in terms of a grammar.
Before the example, we need a very brief introduction to the process of protein
synthesis. A genome is a long double helix DNA strand, consisting of four chemical
bases; adenine, thymine, cytosine and guanine, denoted by A, T, C, and G, respectively.
425
Gene Grammar Web Programming and Bioinformatics
Each base in a double helix forms one of the four complementary pairs, A-T, T-A,
C-G, and G-C as the following example shows. (Thus, given a DNA single strand,
we can figure out its complementary strand.)
A C T TAA C G G C G AT
T G AAT T G C C G C TA
Proteins are synthesized by the following three steps; (1) a single strand segment
(i.e., gene) of the genome is transcribed into an RNA. The RNA is the complementary
strand of the source strand, with every base T replaced by U (uracil). Hence, RNA‟s
consist of four bases, A, C, G and U.
5‟ end TCAAGCT
5‟ end
UCAAGCU
3‟ end
3‟ end AGTTCGA
RNA
(1) RNA Transcription
426
Gene Grammar Web Programming and Bioinformatics
(2) In eukaryotes, only useful substrings for protein synthesis, called exons, are
spliced (i.e., cut off and concatenated) to form a mRNA (messenger RNA). (3) Finally,
ribosomes, large molecular assemblies traveling along the mRNA, translate the mRNA
into a protein, which is an amino-acid sequence.
exons introns
Transcribed RNA
(2) splicing
Messenger RNA
(3) Translation
Protein
427
Gene Grammar Web Programming and Bioinformatics
There are 20 different amino acids. Hence, proteins can be represented as a string of
20 characters. The translation is done by triplets (i.e., groups of three adjacent bases),
also called codons, according to the table shown below. The translation begins with the
start codon ATG, and continues until it meets one of the stop codons, TAA, TAG, and
TGA (see the table below). For every codon scanned, a specific amino acid is attached to
the next position of the protein being synthesized. Notice that because there are 444 =
64 different codons and 20 amino acids, several codons are translated into the same
amino acid. (Following convention, we shall represent codons using character T rather
than U.)
Ala Arg Asp Ans Cys Glu Gln Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Var Stop
GCA AGA GAT AAT TGT GAA CAA GGA CAT ATA TTA AAA ATG TTT CCA AGT ACA TGG TAT GTA TAA
AAC TGC GAG CAG GGG CAC ATT TTG AAG (start) TTC CCG AGC ACG TAC GTG TAG
GCG AGG GAC
GGT ATC CTA CCT TCA ACT GTT TGA
GCT CGA GGC CTG CCC TCG ACC GTC
CTT TCT
GCC CGG
CTC TCC
CGT
CGC
428
Gene Grammar Web Programming and Bioinformatics
This three step process for protein synthesis is called the central dogma in
molecular biology, because with so many exceptional cases, it is no longer accepted as
a valid theory. However, to show a potential application of the formal languages and
automata theory to molecular biology, we shall present an example based on this
dogma. As we did in Section 14.1 for HTML, we will transform the informal definition
of genes into a formal definition in terms of a grammar. (This example is an excerpt,
lightly edited, from David Sears‟ paper “The Linguistics of DNA,” which appeared in
the journal American Scientist, Vol. 80, 1992.)
• The transcript, which is the part of the gene that is copied into a messenger RNA, has
flanking 5‟- and 3‟-untranslated-regions around a coding-region initiated by a start-
codon.
<gene> → <transcript>
<transcript> → <5‟-untranslated-region><start-codon><coding-region>
<3‟-untranslated-region>
429
Gene Grammar Web Programming and Bioinformatics
• A stop-codon is any of the three triplets TAA, TAG and TGA, whereas a codon is any
of the remaining 61 possible triplets that are translated into an amino acid according to
the rule given in the codon table.
430
Gene Grammar Web Programming and Bioinformatics
• The first two bases of a codon partially specify the amino acid into which it is
translated. Every codon starting with GC (CG) is translated into alanine (respectively,
arginine), independent of the third base. Every codon starting with GA is translated into
aspartic acid if the third base is a pyrimidine (i.e., C or T). Otherwise, if the third base is
a purine (i.e., A or G), it is translated into glutamic acid. Here are some formal
representations of such translation rules specified in the codon table.
All the grammar rules presented up to this point are context-free. But the following
biological fact requires a rule from the upper level in the Chomsky hierarchy.
• Introns can be transposed to the left or right.
The following slide shows a grammar that specifies the set of “genes” that we have
developed in this section.
431
Gene Grammar Web Programming and Bioinformatics
<gene> → <transcript>
<transcript> → <5‟-untranslated-region><start-codon><coding-region><3‟-untranslated-region>
<coding-region> → <codon><coding-region> | <splice><coding-region> | <stop-codon>
---------------------------------------------------------------------------
<codon> → <Ala> | <Arg> | <Asp> | <Asn>| <Cys>| <Glu>| <Gln> | <Gly> | <His> | <Ile>|
<Leu>| <Lys>| <Met>| <Phe>| <Pro>| <Ser>| <Thr>| <Typ>| <Tyr>| <Val>
<start-codon> → <Met> <stop-codon> → TAA | TAG | TGA
-----------------------------------------------------------------------------
<Ala> → GC<base> <Arg> → CG<base> <Asp> → GA<pyrimidine> <Cys> → TG<pyrimidine>
<Glu> → GA<purine> <Gln> → CA<purine> <Gly> → GG<base> <His> → CA<Pyrimidine>
<Ile> → AT<pyrimidine> | ATA <Leu> → TT<purine> | CT<base> <Lys> → AA<purine>
<Met> → ATG <Phe> → TT<primidine> <Pro> → CC<base> <Thr> → AC<base>
<Ser> → AG<primidine> | TC<base> <Typ> → TGG <Tyr> → TA<pyrimidine> <Val> → GT<base>
-------------------------------------------------------------------------------
<base> → A | T | G | C <pyrimidine> → C | T <purine> → A | G
-------------------------------------------------------------------------------
<splice> → <intron> <intron> → GT<inton-body>AG
-------------------------------------------------------------------------------
<intron><base> → <base><intron> <base><intron> → <intron><base>
432
Gene Grammar Web Programming and Bioinformatics
The grammar that we have just developed is incomplete because no rules are defined
for the nonterminal symbols <5‟-untranslated-region>, <3‟-untranslated-region> and
<intron-body>. The two untranslated regions flanking the coding-region usually
contain segments regulating the gene expression. The contents and the function of
these parts are not yet completely understood. The intron-body parts generally consist
of a long repeating string, whose exact role is also under investigation. We need more
research before we can present a formal model for these regions.
What else is love but understanding and rejoicing in the fact that another person lives, acts, and experiences
otherwise than we do….? - Friedrich Nietzsche –
Just because you love someone doesn‟t mean you have to be involved with them. Love is not a bandage to cover
wounds. - Hugh Elliot -
433
Hierarchy of the Models
434
Review Chomsky Hierarchy
Languages (grammars) Machines Other Models
Linear-bounded
Context-sensitive Automata .
Languages(type 1) (LBA) .
: Containment
.
.
: Characterization
.
Context-free Pushdown
Languages(type 2) Automata * Colored arrows indicate
(PDA)
proven relations.
436
Hierarchy
15.1 Languages that TM’s cannot recognize
At the top level of the Chomsky hierarchy, there are type 0 languages. In this
chapter we will prove that every language is type 0 if and only if it is recognizable
by a TM. Before we prove this characterization, we may ask the following question.
Is there a language that cannot be recognized by a TM? The answer is positive. We
will first prove this answer. A popular approach that we will use in this chapter is the
diagonalization technique introduced in chapter 1. This technique requires an
enumeration (i.e., listing out) of all possible target objects, such as certain types of
grammars, automata, or strings. Before using the diagonalization technique for the
proof, we need to formally define the term enumeration.
For example, the set of positive integers and the set of character strings on a finite
alphabet are enumerable, because we can build a TM that lists all the elements of the
sets on its tape (see the illustration on the next page). Notice that because these sets
are infinite, the enumerator never stops.
437
Hierarchy
Non-type 0 Languages
The figure below shows how we can build a TM which enumerates the set of
positive integers in binary. (The set *, for a finite alphabet , is also enumerable
because we can build a TM which enumerates the set in lexicographic order.)
Write 1 on the
blank tape
Enumerable sets are also called countable, because all the elements in an enumerable
set are countable, i.e., they can be put to one to one correspondence with the set of
positive integers. There are uncountable sets. The following theorem shows an example.
438
Hierarchy
Non-type 0 Languages
Theorem 15.1 For a given finite alphabet , the sets of all languages in (i.e., the set
of all subsets in *) is not enumerable.
Proof. We use the diagonalization technique. Suppose that all the languages in * are
enumerable, i.e., we can list them out along an infinite line. Let Li be the i-th language
in the list. As illustrated in the figure below, consider a matrix M, where M[i, j] = 1, if
wi is in language Lj, otherwise, M[i, j] = 0. (Notice that this is a conceptual setting. We
don‟t actually have to construct the whole matrix.)
L1 L2 L3 L4 L5 . . .
w1
0 1 0 1 1
w2
1 1 0 0 1
w3
1 0 1 1 0
w4
0 1 1 0 0
.
.
M
439
Hierarchy
Non-type 0 Languages
LD = {wi | M[i, i] = 0}
We claim that LD is a language that does not exist in the list. Suppose that i-th
language Li = LD. If wi Li, then wi LD. Otherwise, if wi Li, then wi LD. It
must be that Li LD. Notice that this proof holds even for the singleton alphabet,
for example, = {a}.
L1 L2 L3 L4 L5 . . .
w1
0 1 0 1 1
w2
1 1 0 0 1
w3
1 0 1 1 0
w4
0 1 1 0 0
.
.
M
440
Non-type 0 Languages Hierarchy
441
Hierarchy
Non-type 0 Languages
With no loss of generality, assume that every TM uses a finite number of state names
and the tape alphabet, respectively, chosen from the left end of the following lists (pool)
of state names and tape alphabet. (Notice that both lists are infinitely long.) As usual,
we will use R, L, N for the direction of the tape head move.
We further assume that q0 is the start state, q1 is the accepting state, t0 is the blank
symbol (B), t1 = a, and t2 = b. (Recall from the ruminations at the end of Chapters 4
and 5 that every TM can accept its language with one accepting state.) Finally, we
chose a format to describe the state transition function . For example, we may list , as
follows.
(q0, a) = (q2, b, R); (q2, a) = (q5, a, L); (q0, b) = (q3, b, R); . . . . .
442
Hierarchy
Non-type 0 Languages
For simplicity, let‟s delete the parentheses and commas from the list as follows.
In this shortened list, every transition consists of six symbols and is delimited with a
semicolon. Though poorly legible, it carries the same information as the original list.
We are going to represent a TM‟s transition function written in this format.
Recall that once all the symbols in the set of states Q, input alphabet , tape alphabet ,
the start state q0, and the accepting state in F are defined, we can uniquely represent a
TM only with the transition function (or with the state transition graph).
We will use this list for the convenience of converting it into a binary number in the
next step.
443
Hierarchy
Non-type 0 Languages
Let E be an encoding function (i.e., homomorphism) such that given a symbol used
in the transition function of a TM, E returns a binary code as follows.
E(;) = 010, E(=) = 0120,
E(L) = 0130, E(R) = 0140, E(N) = 0150,
E(qi) = 016+2i0, E(ti) = 017+2i0, i 0.
Notice that every binary code is uniquely identified by the number of 1‟s between
the two 0‟s. State codes have an even number of 1‟s and tape symbols an odd number
of 1‟s. Here are some examples of the binary encoding according to the above
encoding function E.
q0t1=q2t2R; 01600190012001100011100140010
q2t2=q5t1L; 011000111001200116001900130010
q0t2=q3t3R; 016001110012001120011300140010
We extend function E such that, given a TM M, E(M) gives the binary encoding of
M‟s transition function according to the above definition.
444
Hierarchy
Non-type 0 Languages
TE
445
Hierarchy
Non-type 0 Languages
Corollary. The set of all type 0 languages on a given finite alphabet is enumerable.
Type 0 languages are also called recursively enumerable (R.E.) languages. By
Theorem 15.1, we know that for a given finite alphabet , it is impossible to enumerate
all languages in *. With this theorem and the above Corollary, we can claim the
following theorem.
Theorem 15.3 Let be a finite alphabet. There is a language in * that cannot be
recognized by a TM.
446
Hierarchy
Non-type 0 Languages
Notice that this theorem also holds for a singleton alphabet, for example = {a}.
The same binary encoding technique that we have introduced for TM is also
applicable to grammars and other automata. Thus, we can also claim the following.
Theorem 15.4 The set of all binary encoded CSG‟s is enumerable.
- We make a living by what we get, we make a life by what we give. - Sir Winston Churchill -
- Life is something that happens when you can‟t get to sleep. - Fran Lebowitz –
- Life is something that everyone should try at least once. - Henry J. Tillman -
- Life is pleasant. Death is peaceful. It‟s the transition that‟s troublesome. - Isaac Asimov –
- Life is what happens to you while you‟re busy making other plans. - John Lennon –
- The secret of a good life is to have the right loyalties and hold them in the right scale of values. - Norman Thomas -
447
Hierarchy
15.2 Universal TM
Definition 15.2 (Universal TM). We call a TM Tu universal, if Tu, given an arbitrary
TM M and an input x, simulates the computation of M on the input x.
Here is an idea of how to construct a universal TM Tu. The input M and x given to
Tu must be represented according to an agreed format. Suppose that they are binary
encoded as E(M) and E(x), using the same function E that we developed in the
previous section. TU records E(M), E(x), the current state E(qi) of M and the head
position as the following figure illustrates. Using this information on its tape, Tu
follows M‟s moves one by one. If M enters an accepting state, so does Tu. If M meets
a transition undefined, Tu terminates the simulation.
E(M)
E(x)
E(qi)
Head position
Tu
448
Hierarchy
Universal TM
Break Time
Science & Research
The whole science is nothing more than a refinement of everyday thinking. - Albert Einstein –
Research is of mind, not of hands, a concentration of thought and not a process of experimentation. Research is the
effort of the mind to comprehend relationships which no one had previously known. - DeForest Arnold -
449
Hierarchy
15.3 Enumerable Languages
Now, we turn our attention from the topic of enumerating the sets of automata and
languages to enumerating a language itself. We first prove the following theorem,
which claims that enumerability is another characterization of type 0 languages.
Theorem 15.5 (1) If a language L is enumerable, then L is a type 0 (R.E.) language. (2)
If a language L is type 0 (R.E.), then L is enumerable.
Proof (1). Let Me be a TM (i.e., enumerator) that enumerates L. With Me we can build
a 2-tape TM M recognizing L as follows. Given an input x, using Me in its finite state
control, M writes the next string wi in the enumeration and checks whether x = wi. If it
does, M enters an accepting state and halts. Otherwise, it writes the next string wi+1 in
the enumeration and repeats the process. M accepts x if and only if x L.
Me
TM M
wi
450
Enumerable Languages Hierarchy
Proof (2). This part of proof is not as simple as the proof for part (1). Let M be a
TM recognizing L. We build an enumerator Me for L with M. Let be the alphabet
of language L.
We may think of the following straightforward idea. We let Me, while writing the
strings wi * in the lexicographic order on one of its tape tracks, check whether
wi is accepted by M. If it is accepted, we let it be the next string in the enumeration.
Otherwise, we erase it and repeat the process with the next string in *.
There is a serious problem with this idea. What will happen to Me, if M enters an
infinite loop without accepting wi? We cannot make Me not to put wi in the
enumeration, because it is impossible for Me to decide whether M will eventually
halt or not (i.e., it is an unsolvable decision problem). Thus, the idea fails.
w1 w2 w3 w4 . . . .
w2 w4 w7 . . . .
Me
451
Enumerable Languages Hierarchy
w1 w2 w3 w4 . . . .
1
steps 2
3
4
. TM M
. Me
452
Enumerable Languages Hierarchy
453
Enumerable Languages Hierarchy
Proof (cont’d). We construct a matrix M such that for every i 0, the i-th row
corresponds to the i-th string wi , and the i-th column corresponds to the i-th TM Mi in
the enumeration. This is possible because the set of TM‟s and the set * are both
enumerable. For the matrix entries, we let M[i, j] = 1, if Mj accepts string wi.
Otherwise, we let M[i, j] = 0.
Now, we claim that no TM recognizes language L0 below, which contains string wi ,
if i-th TM Mi in the enumeration does not accept i-th string wi.
M0 M1 M2 M3 . . . .
w0 0 1 1 1 0
w1 1 1 0 0 1
w2 1 0 0 0 0 L0 = { wi | M[i, i] = 0, i 0 }
w3 0 0 1 1 1
w4 1 1 1 0 0
.
454
Enumerable Languages Hierarchy
Suppose that there is a TM, say Mi in the enumeration, that recognizes L0. (We are
using the proof-by-contradiction technique.) Consider what will happen if we give i-
th string wi as an input to Mi. (Notice that Mi and wi have the same subscript.) Either
wi L0 or wi L0. We examine these two cases.
(1) When wi L0. By definition of the matrix, this case implies that M[i, i] = 0, i.e.,
Mi does not accept wi. It follows that wi L(Mi) = L0. We have a contradiction.
(2) When wi L0. By definition, the case implies that M[i, i] = 1. That is, Mi
accepts wi. It follows that wi L(Mi) = L0. Again, we have a contradiction.
So we should give up the assumption that L0 = L(Mi), for some i 0. No TM exists
that recognizes L0. Language L0 is not type 0 (R.E.).
M0 M1 M2 M3 . . . .
w0 0 1 1 1 0
L0 = { wi | M[i, i] = 0, i 0 } w1 1 1 0 0 1
= { w0, w2, w4 , . . . } w2 1 0 0 0 0
w3 0 0 1 1 1
w4 1 1 1 0 0
.
455
Enumerable Languages Hierarchy
M0 M1 M2 M3 . . . .
w0 0 1 1 1 0
w1 1 1 0 0 1
w2 L1 = { wi | M[i, i] = 1, i 0 }
1 0 0 0 0
w3 0 0 1 1 1 = { w1, w3, . . . }
w4 1 1 1 0 0
.
456
Enumerable Languages Hierarchy
x
Input tape
*
TM enumerator
enumerator
MD wi
457
Hierarchy
Enumerable Languages
When you were born, you were crying and everyone around you was smiling.
Live your life so that when you die, you‟re smiling and everyone around you crying, with tears of joy for having
known you. - Anonymous -
458
Hierarchy
15.4 Recursive Language
The Turing machine model that we have defined halts if the input is accepted by
the machine. The definition does not explicitly say what the machine should do if
the input is not accepted. The machine may either enter a dead state (if an undefined
input symbol is read in a state) or never halt, going around an indefinite loop of non-
accepting states. (Recall that for convenience, we do not show transitions entering
the dead state.)
How about defining a TM model, as the following figure illustrates, which after
some finite number of moves, always halts in either an accepting state or a non-
accepting state? Such TM‟s and their languages are called recursive.
(a, b, N)
Reject and halt
459
Hierarchy
Recursive Languages
This model could be more user friendly than the original one. We may let it output
a kind message before halting, like “Yes, I accept the input!” or “No, I regret that I
have to reject it.” For the original TM model, it is impossible to give any specific
message when it does not accept the input because there is no way to know whether
it will enter an infinite loop or not.
Interestingly, we will show that the recursive (i.e., halting) TM model is weaker
than the original model in the sense that there is a type 0 language that cannot be
recognized by any recursive TM. Before we show the proof, we define the following.
460
Hierarchy
Recursive Languages
Let Type-Rec be the class of recursive languages. We know that every recursive
language is type 0. According to the Corollary of Theorem 15.7, there is a type 0
language whose complement is not type 0. Thus, Theorems 15.7 and 15.8 above
implies the following theorem.
Theorem 15.9 Type-Rec is properly contained in Type-0L.
Type-0L Type-Rec
461
Recursive Languages Hierarchy
Now that the top level (i.e., type 0) of Chomsky hierarchy has been refined, we
move down to the next lower level (i.e., type 1). We will first prove Theorem 15.10
below, which claims that every context-sensitive language is recursive. Then using
this theorem and Theorem 15.9, we will show that Type-1L Type-0L.
x
S => . . . . => x ?
TM M G
462
Recursive Languages Hierarchy
(1) wi = x, (2) wi = wj, i < j , (3) (|wi| > |x|) OR (wi (VT)+ AND wi x)
x
S w1 w2 ...
Recursive TM M G
463
Recursive Languages Hierarchy
If condition (1) is satisfied, M halts in an accepting state. Case (2) implies that the
derivation got into a loop. So, M halts in a non-accepting state. Case (3) implies that
according to the non-contracting property of context-sensitive grammar rules, it is
impossible to derive string x. Thus in this case too M halts in a non-accepting state.
M is a recursive (i.e., halting) TM that recognizes L(G). It follows that every context-
sensitive language is recursive.
Now, we may ask the following question: Is the converse of Theorem 15.10 true?
(That is, is every recursive language context-sensitive?) In terms of automata, this is
equivalent to the question of whether every language recognized by a recursive TM
can also be recognized by an LBA. The next theorem shows that there is a recursive
language that is not context-sensitive.
464
Recursive Languages Hierarchy
G0 G1 G2 G3 . . . .
w0 0 1 1 1 0
w1 1 1 0 0 1
w2 1 0 0 0 0
w3 0 0 1 1 1
w4 1 1 1 0 0
.
465
Recursive Languages Hierarchy
Now, going along the diagonal of matrix M, we pick all wi L(Gi) and construct
the language L0.
G0 G1 G2 G3 . . . .
w0 0 1 1 1 0
w1 1 1 0 0 1 L0 = { wi | M[i, i] = 0, i 0 }
w2 1 0 0 0 0 = { w0, w2, w4, , , , }
w3 0 0 1 1 1
w4 1 1 1 0 0
.
We first prove that L0 is not context-sensitive. This proof is very similar to the one
of Theorem 15.3, where we showed a language that cannot be recognized by any TM.
Suppose that L0 is a language of a CSG Gi, i.e., L0 = L(Gi). Consider the string wi.
(Notice that wi and Gi have the same subscript.) Either wi L0 or wi L0. We
examine each of these cases.
466
Recursive Languages Hierarchy
(1) Case of wi L0: By the definition of L0, we have M[i, i] = 0, which implies that
wi L(Gi) = L0 . We are in a contradiction.
(2) Case of wi L0: Again by the definition of L0, we have M[i, i] = 1, which implies
wi L(Gi) = L0. Again, we are in a contradiction.
So we must give up the supposition that L0 = L(Gi). Language L0 is not context-sensitive.
Now, we will show that L0 is recursive. The proof is similar to the proof of Theorem
15.7. We construct a recursive TM M0 which recognizes L0 as follows. Suppose that
on the input tape of M0, a string x is given.
Gi
*
CSG enumerator
enumerator
M0 wi
467
Recursive Languages Hierarchy
The machine M0 has two enumerators, one for the set of context-sensitive
grammars, and another for the set *. Using these two enumerators, M0 keeps
alternately generating grammars and strings on two tapes as shown below, until wi
appears, that is equal to x. Then M0 tests whether Gi can derive x using the idea
presented in the proof of Theorem15.10. If this test shows that Gi can derive x, M0
rejects the input x and halts. Otherwise, M0 accepts x and halts. It follows that M0 is
a recursive TM that recognizes L0. Language L0 is recursive.
Gi
*
CSG enumerator
enumerator
M0 wi
468
Recursive Languages Hierarchy
In this chapter we have introduced two additional classes of languages, i.e., the class
of recursive languages and the class of languages that are not type 0, and we have
identified their level in the Chomsky hierarchy. The one is the class of languages that
cannot be recognized by any TM, and the other is the set of languages recognized by
recursive TM‟s. The next page shows the Chomsky hierarchy, including the class of
recursive languages. (The class of non-type 0 languages is not shown.) In the figure the
red arrows show the characterization relations ( ) and the proper containment
relations () that we have proved so far.
469
Hierarchy
Chomsky Hierarchy
Languages Automata
Recursively
Turing
Enumerable
Machines
Sets (type 0)
Recursive Recursive
Languages Turing Machines
Proof (1). Let G be an arbitrary type 0 grammar, and let {1, 2, . . , n} be the set of
strings used for the left side of the rules of G. Let {i1, i2, . . ik} be the set of strings
appearing on the right side of rule i, i.e.,
i i1 | i2 |, . . , | ik
(Notice that n and k are some finite integers.) We construct a TM M that recognizes
L(G) as follows. M has G in its finite state control stored as a look-up table.
471
Proofs of Characterizations Hierarchy
Suppose that a string x is given on its input tape as illustrated in the figure below.
The machine M, using an extra work tape, tests whether x can be derived by grammar
G. M starts the work by writing the start symbol S on the work tape. Suppose w is a
sentential form that can be derived starting with S. M, reading w from left to right,
nondeterministically chooses a substring i and replaces it with one of its right side
ij, also chosen nondeterministically.
M repeats this process until it sees that the current sentential form w is equal to the
input string x and, if it is, M halts in an accepting state. Notice that if x L(G), M
may repeat the process forever or see a sentential form w on which no rule is
applicable and halt. M accepts x if and only if it is generated by G. It follows that M
recognizes L(G), and hence, every type 0 language is recognized by a TM.
x x x x
input
?
G G G G
S i ij x
work tape
472
Hierarchy
Type 0 Grammar TM
Proof (2). For the proof, let M = (Q, , , q0, , F) be an arbitrary TM. We construct
a type 0 grammar G that generates L(M). Let Q = {q0, q1, . . , qn}, and qf F. For
the construction we shall use two distinct symbols #, $ .
Grammar G generates L(M) in the following three phases (see the figure below): (1)
for an arbitrary string x *, grammar G generates string x#q0x$. (2) Starting with
the start state q0, G simulates M‟s computation on input x using the string x
generated to the right of q0 in the first phase. (3) If M enters the accepting state qf in
the second phase, G erases all symbols except the left side x generated in the first
phase.
We will show how G carries out the three phases. According to the convention of
using upper case letters for the nonterminal symbols of a grammar, we shall use Qi
instead of state qi. In the figure below, strings u and v concatenated corresponds to
the final tape content of M when it accepts x.
473
Hierarchy
Type 0 Grammar TM
Let = {a1, a2, . . . , ak} be the input alphabet of M. The following set of rules
carries out phase (1) to generate the string x#Q0x$, for every x *. Notice that
for x = , the rules generates #Q0$.
S A$ (1)
474
Hierarchy
Type 0 Grammar TM
... b a .. a .. ..
. . . bQia . . . . . . #Qia . .
qi qi
(a) (b)
a .. .. . . . aQi$ . . . ... a
. . . Qi# a . .
qi qi
(c) (d)
475
Hierarchy
Type 0 Grammar TM
Depending on M‟s transitions, we design G‟s rules as follows. Part (a) below is for
the case of M reading a non-blank symbol a, and part (b) is for the case of reading
the blank symbol next to either end of the current tape contents. Grammar G uses
two special symbols $ and # as boundary markers, respectively, to expand its
sentential form to the right and to the left to accommodate M‟s current tape contents.
476
Hierarchy
Type 0 Grammar TM
It is simple to design a set of rules for phase (3). By the two rules in (a) below,
we let the nonterminal symbol Qf, which denotes the accepting state of M, move to
the right till it meets $. Then by the rules in part (b), we let QE keep moving to the
left and erasing every symbol it meets up to and including #.
477
Hierarchy
CSG and LBA
Now, we prove the characterization relation between CSG‟s and LBA‟s. The
approach is similar to the one for the characterization between TM‟s and type 0
grammars that we have just completed. For the proof we should take into
consideration of two restrictions: the tape space of an LBA is bounded by the input
length, and the grammar rules are non-contracting.
Theorem 15.13 (1) Every language generated by a CSG can be recognized by an
LBA. (2) For every language L recognizable by an LBA, there is a CSG G such that
L(G) = L – {}.
Proof (1). The LBA uses a two-track tape, instead of two tapes, to conveniently limit
the tape space. If the LBA sees that next sentential form derived can be longer than x,
it stops the derivation and rejects the input string x, because it is impossible to derive x
by applying non-contracting CSG rules.
input [ x ] [ x ] [ x ] [ x ]
[S ] [ i ] [ ij ] [ x ]
?
G G G G
478
CSG LBA Hierarchy
For the special case; when the input is (i.e., the tape has []), the machine enters
an accepting state, if the grammar has the rule S . (The details are omitted.)
Proof (2). Suppose L is the language of an LBA M = (Q, , , , q0, [, ], F) ,
where Q = {q0, q1, …., qm}, and = {a1, a2, …., ak}, for some constants m and k.
For the proof we will construct a CSG G such that L(G) = L(M) – {}.
For the construction, we will be inclined to use the same three-phase approach
(repeated below) that we used to construct a type 0 grammar which can derive all
the input string x accepted by a TM.
(1) (2) (3)
S . . . x#Q0x$ . . . . x#uQfv$ . . . . x
However, we have a problem to follow this approach because CSG rules are non-
contracting. Reviewing the type 0 grammar rules for the three-phase procedure, we
find that among the rules for phase (3) (repeated below), all the rules in line (b) are
contracting. These rules are used to erase the tape, leaving only the accepted input
string x.
(a) Qf# #Qf Qfa aQf
(b) Qf$ QE aQE QE #QE
479
CSG LBA Hierarchy
Now, we are ready to show how G can derive the string x if and only if M accepts it.
After deriving a sentential form of the above format, CSG G begins to “simulate” the
computation of M on the string [x1x2 . . . xn] going along the sentential form of
composite symbols. The leftmost elements in the composite symbols delimited by the
angled brackets are reserved for the final derivation when M accepts the input.
Seeing that M is in an accepting state, G erases all symbols in the current sentential
form except for the reserved input symbols, consequently, deriving the input string x =
x1x2 . . . xn.
480
Hierarchy
CSG LBA
Below (1) – (4) are the rules for deriving all possible sentential forms shown above.
(Recall that in the above sentential form, xi is a member of the input alphabet = {a1,
a2, …., ak}.) The rules in line (2) are to generate the initial sentential forms of all
possible input strings of length one. We need the rule in line (5) for the case when M
accepts . Notice that we can safely add this rule because no rule has S on its right side.
(Recall that CSG‟s are allowed to use this contracting rule under the condition that no
rule has S on its right side.)
(1) S A<a1, (a1, ])> | A<a2, (a2, ])> | . . . | A<ak, (ak, ])>
(2) | <a1, q0, ([, ], a1)> | <a2, q0, ([, ], a2)> | . . . | <ak, q0, ([, ], ak)>
(3) A A<a1, a1> | A<a2, a2> | . . . | A<ak, ak>
(4) | <a1, q0, ([, a1)> | <a2, q0, ([, a2)> | . . . | <ak, q0, ([, ak)>
(5) If the start state q0 is an accepting state, add S .
481
Hierarchy
CSG LBA
For the simulation, G assumes that M is currently reading (in a state qi) the
rightmost element in the composite symbol (delimited by parentheses) appearing next
to the right of symbol qi. The following figures show the correspondence between the
various contexts involving the state symbol qi in a sentential form and the positions of
M‟s tape head. (Recall that M doesn‟t keep the original input string x and may change
it while computing.)
<xi, a><qi, (xi+1, b)> <x1, qi, ([, a)> <x1, qi, (a, [)> <xn, qi, (], a)>
... a b .. [ a .. [ a .. ... a ]
qi qi qi qi
<xn, qi, (a, ])> <x1, qi, ([, ], a)> <x1, qi, ([, a, ])> <x1, qi, (], a, [)>
... a ] [ a ] [ a ] [ a ]
qi qi qi qi
482
CSG LBA Hierarchy
Below is the set of rules of G simulating each move of M. The readers are strongly
recommended to refer the figures above showing the correspondence between the
context of qi and M‟s local configuration.
483
Hierarchy
CSG LBA
Finally, to complete the construction for phase (2), we add the rules shown below
for the special cases when M reads the boundary markers.
484
Hierarchy
CSG LBA
Now, we present below a set of rules needed for the final phase. With qf indicating
that M has accepted the input, the grammar erases all the symbols in the sentential
form, except for the leftmost elements reserved to generate the input string. Notice that
all the rules satisfy the non-contracting condition. (Recall from Section 1.1 that
composite symbols are introduced for the matter of comprehension. They are not
different from the regular symbols. We are just extending the alphabet size.)
In the following rules, a, b and c, d .
(1) <qf, (a, c)><b, d> <a, c><qf, (b, d)> (2) <a, qf, (c,])> a
(3) <b, c>a ba (4) <b, ([, c)>a ba
(5) <a, qf, ([, c, ])> a <a, qf, ([, ], c)> a <a, qf, (], c, [)> a
The rule in part (1) keeps shifting qf to right until it sees the rightmost composite
symbol. Rule (2) extracts the rightmost input symbol (a in the figure). Rules (3) and (4)
let the grammar, going left along the sentential form, collect the reserved input
symbols to finally derive the input string accepted by M. Rules in part (5) are for the
special cases when the length of the input string is 1.
Grammar G derives every string x * if and only if M accepts it, implying that
L(G) = L(M).
485
Hierarchy
CFG’s and PDA’s
Now we finally complete the proofs of the characterizations in Chomsky
hierarchy by proving the characterization relation between CFG‟s and PDA‟s.
Theorem 15.14 (1) Every language generated by a CFG can be recognized by a
PDA. (2) Every language recognized by a PDA can be generated by a CFG.
Proof (1). Let G be an arbitrary CFG. We construct an NPDA M which recognizes
L(G) as follows. The basic approach is the same as we took for the proof of the
characterization at the upper level. Using G, the machine nondeterministically derives
a terminal string x in the stack to match the input string.
M starts the work by pushing the start symbol S into the stack and does the
following until it sees that the stack is empty. Let A be a nonterminal symbol which
has some constant k rules.
A 1 | 2 |, . . ., | k
(a) If A is at the stack top, M nondeterministically choosed a rule A i, 1 i k,
popd A, and pushes i.
(b) If a terminal symbol, say a, appears at the stack top, M reads the input and pop
the stack top if the input symbol and the stack top match.
486
CFG PDA Hierarchy
(c) If the stack is empty (i.e., the stack top is Z0 ), M enters an accepting state.
The NPDA M enters an accepting state if G derives the string given in the input
tape. It follows that L(M) = L(G).
Example: Suppose that CFG G shown in figure (a) below is given. We construct an
NPDA as shown in figure (b), which recognizes L(G). Interestingly, we can show that
3 states are enough to construct an NPDA recognizing L(G) for any CFG G.
(, S/aBC), (a, a/ ), (c , c/ )
(, B/bBc), (, B/bc), (b, b/)
(, C/bCc), (, C/),
487
CFG PDA Hierarchy
Proof (2). This proof is the last and most challenging among the four
characterizations between the models for grammars and automata. However, once
we understand the underlying concept of the approach, it requires only perseverance.
Here again we will present how to construct a CFG G for a given PDA M such that
L(G) = L(M). For the proof we assume that M has the following property.
(1) M accepts the input by empty stack (i.e., Z0 is popped out from the stack).
(2) M pops the stack or pushes it with exactly one symbol.
In Section 6.3, we showed that every context-free language can be recognized by a
PDA with an empty stack (condition (1)). With an empty stack, no move is possible. So
we assume that there is one distinguished state that M enters after emptying the stack.
(If there is more than one such state, we can merge them all into one state.)
In the rumination section of Chapter 4, we showed that every context free language
can be recognized by a DPDA with stack operations restricted to either popping or
pushing exactly one symbol in a move (condition (2)). (The same claim is applicable to
NPDA‟s.)
Thus, without loss of generality, we can assume that M meets the two conditions (1)
and (2) above. Otherwise, we can modify it without affecting the language.
488
CFG PDA Hierarchy
Suppose that M, in a state p, with a symbol A on top of its stack, takes some
number of moves until the next stack symbol below A (symbol B in figure (a) below)
appears on the stack top for the first time, as shown in figure (b). Let q be the state of
M when B appears for the first time on the stack top, but before any operation with B.
Notice that B may never appear on the stack top, or the stack height may change many
times before B appears.
Let [p, A, q], called the stack-top erasing set (STES), denote the set of all input
string segments that M will read from the input tape from the time when the machine
sees A at the stack top in state p until it first sees B under A, as the following figures
illustrate. (Notice that [p, A, q] may be empty.)
x
p q [p, A, q] = {x, y, . . }
Z0 . . . BA Z0 . . . B (c)
(a) (b)
489
CFG PDA Hierarchy
Figure (a) below shows non-empty STES‟s for the PDA in figure (a). Notice that,
among many others, [2, A, 5] and [1, Z0, 3] are empty STES‟s.
(b, a/)
6
(a, A/aA)
2 5 [5, a, 6] = {b}
(b, A/)
(a, Z0/AZ0) (b, A/ ) [6, A, 7] = {b}
7
[7, Z0, 4] = {}
1 3
(, Z0/ ) [2, A, 7] = {ab}
(, Z0/ ) 4 [2, A, 3] = {b}
[1, Z0, 4] = {aabb, ab}
(a)
(b)
490
CFG PDA Hierarchy
Let M = (Q, , , , q0, Z0, ) be the NPDA. Here is an algorithm for constructing
a context-free grammar G = (VT, VN, S, P) such that L(G) = L(M).
491
CFG PDA Hierarchy
Now we need to understand the underlying idea for constructing CFG rules with the
PDA M that generates L(M). The main objective is to design the grammar rules such
that x [p, A, q] if and only if string x is derivable starting with the nonterminal
symbol [p, A, q], as the following figures illustrate.
x [p, A, q] [p, A, q] . . . . x
STES [q0, Z0, t] contains all the strings that M reads, starting in the start state and
continuing until it enters the state t by emptying the stack (i.e., pops Z0). Hence, we get
L(M) = [q0, Z0, t]. It follows that G should be able to derive L(M) starting with the
nonterminal symbol [q0, Z0, t]. However, to use S as the start symbol following the
convention, in step (2) of the algorithm we put the rule S [q0, Z0, t] in P.
492
Hierarchy
CFG PDA
Next let us examine step (3) of the algorithm. Recall that [p, A, q] is the set of all
possible input segments that M will read, starting in state p with A on the stack top until
the machine, in state q, sees the symbol under A (B in figure (a) below) appear for the
first time at the stack top. For example, if M has a move as shown in figure (b) below
(i.e. (p, a, A) = (q, )), then a [p, A, q].
a a
(a, A/) q
p q
p
Z0 . . . BA Z0 . . . B
(a) (b)
493
Hierarchy
CFG PDA
Now, suppose that M has a pushing move, say (p, a, A) = (r1, DC). (Note that in
this move, A is changed to C, and D is pushed on top of it). Let s1 be the state M
enters the first time it sees C at the stack top, and let q be the state M enters the first
time it sees B (which was under the A) at the stack top as shown below. The set [p, A,
q] must contain all the strings in a[r1, D, s1][s1, C, q] (for example, axy in the figure),
i.e., [p, A, q] a[r1, D, s1][s1, C, q].
a a x a x y
p r1 s1 q
Z0 . . BA Z0 . . BCD Z0 . . BC Z0 . . . . B
494
Hierarchy
CFG PDA
The following state transition graph will help us understand why this containment
relation holds.
[p, A, q] a[r1, D, s1][s1, C, q]
(., D/..)
p (a, A/DC) r1
q
(. , B/..)
(., D/..)
(. ./AB) (. , C/. .)
s11
s12 (. , C/. .)
Notice that, as the figure illustrates, there can be more than one state (in the figure s1i )
that M may enter the first time it sees C at the stack top. So the following relation holds.
495
Hierarchy
CFG PDA
Suppose that M has multiple pushing moves in the same state p with the same stack
top A, like (p, a1, A) = (r1, D1C1), (p, a2, A) = (r2, D2C2). We apply the same idea
and let [p, A, q] include all the segments of the input strings that M will read until it
sees for the first time the symbol (B in the figure below) under DiCi . Thus, for this
more general case, we have the following (see the figure below, which illustrates the
case).
496
Hierarchy
CFG PDA
Now, suppose that M has some popping moves in state p with A at the stack top, for
example, (p, b1, A) = (q, ), (p, b2, A) = (q, ), . . . Since M has only popping and
pushing moves on the stack, we finally get the following.
Suppose that in state p, M has k1 pushing moves and k2 popping moves, and let n be
the number of states of M. The above equation can be expressed as follows.
k1 n k2
[p, A, q] = ( ai[ri, Di, sij][sij,Ci, q] ) (
i =1
b)
i
i=1 j=1
497
Hierarchy
CFG PDA
We know that for each triple p, q and A, the following equation holds.
Thus, for every nonterminal [p, A, q] in grammar G, we make the following rules.
This is exactly what step (3) of the algorithm does.
498
Hierarchy
CFG PDA
Example 1. For the PDA M shown in figure (a) below, figure (b) shows a CFG
constructed with the algorithm. Rule (1) is constructed by step (2) of the algorithm,
rule (2) by step (3)-(a), and rule (3) and (4) by step (3)-(b). We can show that L(M) =
{aba} = L(G).
(a, Z0/AZ0)
p r (1) S [p, Z0, q]
(2) [p, Z0, q] a[r, A, s] [s, Z0, q]
(b, A/) q
(3) [r, A, s] b
s
(4) [s, Z0, q] a
(a, Z0/)
(a) PDA M (b) CFG G
499
CFG PDA Hierarchy
Example 2. The PDA below is a little more complex than the one in Example 1.
Notice that this is an NPDA because of the two transitions in state 2.
(b, a/)
6
(a, A/aA)
2 5
(b, A/)
(a, Z0/AZ0) (, A/ )
7
1 3
4
(, Z0/ )
(b, Z0/ )
L(M) = L(G) = {aabb, ab }
S [1, Z0, 4]
[1, Z0, 4] a[2, A, 7] [7, Z0, 4] | a[2, A, 3] [3, Z0, 4]
[2, A, 7] a[5, a, 6][6, A, 7]
[5, a, 6] b [6, A, 7] b [2, A, 3]
[7, Z0, 4] [3, Z0, 4] b
500
CFG PDA Hierarchy
The grammar that we have just constructed has too many rules to comprehend for
the simple language. If we apply the techniques for minimizing the number of -
production rules and eliminating the unit production rules to this grammar (recall
the techniques in Sections 10.1 and 10.2), we can drastically simplify the grammar
as shown below.
S [1, Z0, 4]
[1, Z0, 4] a[2, A, 7] [7, Z0, 4] | a[2, A, 3] [3, Z0, 4]
[2, A, 7] a[5, a, 6][6, A, 7]
[5, a, 6] b [6, A, 7] b [2, A, 3]
[7, Z0, 4] [3, Z0, 4] b
S aabb | ab
501
CFG PDA Hierarchy
Example 3. Figure (a) below shows a PDA accepting (by empty stack) our familiar
language L = {aibi | i 1 }. Figure (b) below shows the grammar G generated by the
algorithm. Again, it has too many rules to comprehend. We need to clean it up.
502
CFG PDA Hierarchy
Rules (5) and (6) are useless, because they generate empty string. In state 2, M
makes only pushing moves and it never sees (in the same state 2) the stack symbol
under the current stack top symbol A. So, [2, A, 2] is empty. Set [3, A, 4] is empty,
since the machine empties the stack and terminates without reading the input. Thus
we can eliminate all rules involving two nonterminals [2, A, 2] and [3, A, 4] from the
grammar.
503
CFG PDA Hierarchy
Cleaning up those useless symbols from the grammar (see Section 10.3), we get the
following simplified grammar.
(a, A/AA)
(a) PDA M
504
CFG PDA Hierarchy
To further simplify the grammar let us replace each of the composite symbols with a
unique nonterminal symbol as shown in figure (c) below. Eliminating unit productions
and minimizing the number of -production rules (presented in Sections 10.1 and 2),
we finally get the CFG shown in figure (d). Now, we see that this grammar generates
the language L = {aibi | i 1 }.
(b) (c)
(1) S aB
(4) B aBb | b
(d)
505
Hierarchy
(a) If M has a move (p, a, A) = (r, BC) (where, a { }, r Q, and B, C ), then for every s Q, put the following
rule in P.
[p, A, q] a[r, B, s][s, C, q]
506
Hierarchy
Exercises
15.1 Prove that the union S1 S2 of two enumerable set s S1 and S2 is also enumerable.
15.2 Show that the set of all real numbers is not enumerable.
15.3 Prove that the class of languages which are not type 0 is not enumerable.
507