Sunteți pe pagina 1din 34

CS138, Wim van Dam, UCSB

Automata and
Formal Languages

CS138, Winter 2006
Wim van Dam
Room 5109, Engr. I
vandam@cs.ucsb.edu
http://www.cs.ucsb.edu/~vandam/
CS138, Wim van Dam, UCSB
Formalities
This Friday the Midterm will be returned.

The next Homework 3 will be announced on Friday,
and will be due on the next Friday (instead of Monday).

Questions?
CS138, Wim van Dam, UCSB
Context Free Languages
Having dealt with Regular Languages, the coming weeks
we will discuss the power of context free languages and
the pushdown automata that accept them.

This week: Context Free Languages, Chapter 5 in
An Introduction to Formal Languages and Automata by
Peter Linz [Reader, pp. 2170] and Section 6.1 Methods
for Transforming Grammars, ibidem [Reader pp. 7180]

The notation of Linz is somewhat different from Sipser
CS138, Wim van Dam, UCSB
SipserLinz Dictionary
Sipser: Linz:

Empty string
Conditional in sets { x | xeN } { x : xeN }
Letters alphabet Terminals T
CS138, Wim van Dam, UCSB
Grammars
Linz defines grammars as follows:
A grammar G is defined by (V,T,S,P), where
V is a finite set of variables
T is a finite set of terminal symbols (think alphabet )
SeV is the special start variable
P is a finite set of productions

Each grammar G defines a language L(G), which is the
set of strings in T* (=*) that G can generate from S.
It is all about the production rules.
CS138, Wim van Dam, UCSB
Context Free Grammars
A Context Free Grammar (V,T,S,P) is a grammar
where all production rules are of the form:
A x, with AeV and xe(VT)*

Example 5.1: Let G = ({S}, {a,b},S,P) with for P:
SaSa, and SbSb, and S.
Some derivations from this grammar:
S aSa aaSaa aabSbaa aabbaa
S bSb baSab baab, and so on.

In general S . ww
R
for we{a,b}*.

CS138, Wim van Dam, UCSB
Context Free Languages
A single step derivation consist of the substitution of a
variable by a string according to a substitution rule in P.
(Note that the rules are described using single arrows ,
while the derivations themselves use double arrows .)

A sequence of several derivations (or none) is indicated
by * . Previous example: S * aabbaa.

L is a Context Free Language if and only if there is a
context free grammar G=(V,T,S,P) such that:
L = L(G) = { weT* : S * w }
CS138, Wim van Dam, UCSB
Why Context Free Languages?
Context-free languages allow us to describe languages
that are nonregular like { 0
n
1
n
: n>0}.

CLFs are complex enough to give us a model for natural
languages (cf. Noam Chomsky) and programming languages.
The theory of CFLs is very closely related to the problem
of parsing a computer program.
Later we will see that CFLs are the languages that can
be recognized by automata that have one single stack:
{ 0
n
1
n
: n>0 } is a CFL
{ 0
n
1
n
0
n
: n>0 } is not a CFL
CS138, Wim van Dam, UCSB
Some Remarks
The language L(G) = { weT* : S * w } contains
only strings of terminals, not variables.
Notation: We summarize several rules for one variable:
A B
A 01 by A B | 01 | AA
A AA
Question: What is the CFG ({S},{(,)},S,P) that produces
the language of correct parentheses like (), (()), or ()(())?
Answer: S (S)|SS| [see Example 5.4]
CS138, Wim van Dam, UCSB
Another CFG Example
Consider the CFG G=({S,Z},{0,1},S,P) with
P: S 0S1 | 0Z1
Z 0Z |
What is the language generated by this G?

Answer: L(G) = {0
i
1
j
| i>j }

Specifically, S yields the 0
j+k
1
j
according to:
S 0S1 0
j
S1
j

0
j
Z1
j
0
j
0Z1
j


0
j+k
Z1
j
0
j+k
1
j
= 0
j+k
1
j
CS138, Wim van Dam, UCSB
Automata and
Formal Languages

CS138, Winter 2006
Wim van Dam
Room 5109, Engr. I
vandam@cs.ucsb.edu
http://www.cs.ucsb.edu/~vandam/
CS138, Wim van Dam, UCSB
Last Monday
A Context Free Grammar (V,T,S,P) is a grammar
where all production rules are of the form:
A x, with AeV and xe(VT)*

Example 5.1: Let G = ({S}, {a,b},S,P) with for P:
SaSa, and SbSb, and S.
In general we have S * ww
R
for we{a,b}*,
hence L(G) = { ww
R
: we{a,b}*}.

CS138, Wim van Dam, UCSB
Questions
Can you make Context Free Grammars for the following?
a) { 0
n
1
n
: n0}
b) { 0
n
1
m
: n,m0}
c) Arithmetic a,b,c formulas like a+bc+a (without ())

Answers:
a) S 0S1 |
b) S 0S | R and R 1R |
c) S a | b | c | S+S | SS
CS138, Wim van Dam, UCSB
Linear Grammars
A grammar is linear if and only if in every production rule at
most one variable occurs in the right hand side.
Example: S (S)|SS| is not linear, but S 0S1| is.

A grammar (V,T,S,P) is right-linear if all production rules
are of the form A xB or A x with A,BeV and xeT*.
A grammar (V,T,S,P) is left-linear if all production rules are
of the form A Bx or A x with A,BeV and xeT*.

Note: All regular languages can be described by a right-
linear grammar (or a left-linear one), and vice versa.
CS138, Wim van Dam, UCSB
Non Linear Grammars
Most CFGs will not be linear, which means that in the
derivation of a word we will often have more than one
variable in the sentential forms (example: S * xAyBz).
Note: in a derivation S w
1
w
2
w
n
w, all
strings S,w
1
,,w
n
e(VT)* are called sentential forms.

A derivation is leftmost (rightmost) if in each derivation
step xy the leftmost (rightmost) variable is replaced.

Requiring leftmost derivations does not limit the power
of a CFG but creates some order in the many ways one
can derive a single word. See, for example, S (S)|SS|.
CS138, Wim van Dam, UCSB
Order is Unimportant
Take the CFG S 0 | 1 | (S) | (S)v(S) | (S).(S), which
generates all proper Boolean formulas that use 0, 1,
, v, ., ( and ).
Then (0)v((0).(1)) can be derived in the following ways
[leftmost] S (S)v(S) (0)v(S) (0)v((S).(S))
(0)v((0).(S)) (0)v((0).(1))

[rightmost] S (S)v(S) (S)v((S).(S)) (S)v((S).(1))
(S)v((0).(1)) (0)v((0).(1))

[something else] S (S)v(S) (0)v(S) (0)v((S).(S))
(0)v((S).(1)) (0)v((0).(0))

The fact that it is irrelevant in which order we use the
production rules is expressed by the derivation tree.
CS138, Wim van Dam, UCSB
Derivation Trees
The derivation S * (0)v((0).(1)) can be expressed by
the following derivation tree:
S
0
0
1
S S ( ( ) ) v
S S ( ( ) ) .
CS138, Wim van Dam, UCSB
Reading Tree Leaves
Application of a production rule A x is
represented by node A with children x.
(Note that the tree is ordered:
the ordering of the nodes matters.)

The root has variable S.

The yield of S is
expressed by the
leaves of the tree.
S
0
0
1
S S ( ( ) ) v
S S ( ( ) ) .
CS138, Wim van Dam, UCSB
Defining a Tree
Definition 5.3: For a CFG G=(V,T,S,P) a derivation
tree has the following properties:

1) The root is labeled S
2) Each leaf is from T{}
3) Each interior node is from V
4) If node has label AeV and
its children a
1
a
n
(from L to R),
then P must have the rule
A a
1
a
n
(with a
j
eVT{})
5) A leaf labeled is a single
child (has no siblings).

For partial derivation trees we have:
2a) Each leaf is from VT{}
S
0
0
1
S S ( ( ) ) v
S S ( ( ) ) .
CS138, Wim van Dam, UCSB
Purpose of Trees
Looking at a tree you see the derivation without the
unnecessary information about its order.

Theorem 5.1: Let G be a CFG. We have weL(G) if and
only if there exists a derivation tree of G with yield w.
Also, y is a sentential form of G if and only if there exists
a partial derivation tree for G.
Remember: the root always has to be S.
CS138, Wim van Dam, UCSB
Automata and
Formal Languages

CS138, Winter 2006
Wim van Dam
Room 5109, Engr. I
vandam@cs.ucsb.edu
http://www.cs.ucsb.edu/~vandam/
CS138, Wim van Dam, UCSB
Formalities
Homework will be announced later today.
This homework will be due Friday afternoon.

The Midterm has been graded, ask Yen Ting for it.

The scores are as follows
CS138, Wim van Dam, UCSB
The Midterm
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60
Midterm
H
W
1
+
2
Midterm scores:
average: 42.9
median: 43

minmax: 2355

Correlation between
HWs 1+2 and the
Midterm: 0.60
CS138, Wim van Dam, UCSB
SipserLinz Dictionary
Sipser: Linz:

Empty string
Conditional in sets { x | xeN } { x : xeN }
Letters alphabet Terminals T
Union in RE ab a+b
CS138, Wim van Dam, UCSB
Parsing
Generative aspect of CFG: By now it should be clear how,
from a CFG G, you can derive strings weL(G).

Analytical aspect: Given a CFG G and strings w, how do
you decide if weL(G) and if so how do you determine
the derivation tree or the sequence of production rules
that produce w? This is called the problem of parsing.
CS138, Wim van Dam, UCSB
Exhaustive Parsing
Exhaustive parsing is a form of top-down parsing where
you start with S and systematically go through all possible
(say leftmost) derivations until you produce the string w.
(You can remove sentential forms that will not work.)

Example 5.7: Can the CFG S SS | aSb | bSa |
produce the string w = aabb, and how?
After one step: S SS or aSb or bSa or .
After two steps: S SSS or aSbS or bSaS or S,
or S aSSb or aaSbb or abSab or ab.
After three steps we see that: S aSb aaSbb aabb.
CS138, Wim van Dam, UCSB
Flaws of Exhaustive Parsing
Obvious flaw: it will take a long time and a lot of memory
for moderately long strings w: It is inefficient.

For cases weL(G) exhaustive parsing my never end.
This will especially happen if we have rules like A that
make the sentential forms shrink so that we will never
know if we went too far with our parsing attempts.
Similar problems occur if the parsing can get in a loop
according to A B A B
Fortunately, it is always possible to remove problematic
rules like A and AB from a CFG G.


CS138, Wim van Dam, UCSB
Exhaustive yet Finite Parsing
Theorem 5. 2: Let G be a CFG without rules of the form
A and AB (with A,B e V), then on any string w,
the exhaustive parsing method either produces w or halts
eventually such that we can conclude weL(G).
This derivation will require no more than 2
|w|
rounds.

The complexity of this algorithm is still exponential in the
length |w| of the string. We can do much better though:
Theorem 5. 3: For every CFG G there exists a parsing
algorithm that runs in time O(|w|
3
).
(This algorithm uses dynamic programming.)
CS138, Wim van Dam, UCSB
Simple Grammars
Definition 5.4: A CFG (V,T,S,P) is a simple grammar
(s-grammar) if and only if all its productions are of the form
A ax with
AeV, aeT, xeV* and any pair (A,a) occurs at most once.

Note, for simple grammars a left most derivation of a
string weL(G) is straightforward and requires time |w|.

Example: Take the s-grammar S aS|bSS|c with aabcc:
S aS aaS aabSS aabcS aabcc.
CS138, Wim van Dam, UCSB
Ambiguity
A string weL(G) is derived ambiguously if it has
more than one derivation tree (or equivalently: if it has
more than one leftmost derivation (or rightmost)).

A grammar is ambiguous if some strings are derived
ambiguously.
Typical example: rule S 0 | 1 | S+S | SS

S S+S SS+S 0S+S 01+S 01+1
versus
S SS 0S 0S+S 01+S 01+1
CS138, Wim van Dam, UCSB
Ambiguity and Parse Trees
The ambiguity of 01+1 is shown by the two
different parse trees:
S
+
S

S
1
S
0
S
1
S

S
+
S
1
S
1
S
0
CS138, Wim van Dam, UCSB
More on Ambiguity
Note that the two different derivations:
S S+S 0+S 0+1
and
S S+S S+1 0+1
do not constitute an ambiguous string
0+1 as have the same parse tree:
S
+
0
1
Ambiguity causes troubles when trying to interpret strings
like: She likes men who love women who don't smoke.

Solutions: Use parentheses, or use precedence rules
such as a+(bc) = a+bc (a+b)c.
CS138, Wim van Dam, UCSB
Inherently Ambiguous
Languages that can only be generated by ambiguous
grammars are inherently ambiguous.

Example 5.13: L = {a
n
b
n
c
m
} {a
n
b
m
c
m
}.

The way to make a CFG for this L somehow has to
involve the step S S
1
|S
2
where S1 produces the
strings a
n
b
n
c
m
and S
2
the strings a
n
b
m
c
m
.
This will be ambiguous on strings a
n
b
n
c
n
.

Proving this rigoursly is hard though.
CS138, Wim van Dam, UCSB
Programming Languages
Programming languages are often defined as Context
Free Grammars in Backus-Naur Form (BNF).

Example:
<if_statement> ::= IF <expression><then_clause><else_clause>
<expression> ::= <term> | <expression>+<term>
<term> ::= <factor>|<term>*<factor>

The variables as indicated by <a variable name>
The arrow is replaces by ::=
Here, IF, + and * are terminals.

Syntax Checking is checking if a program is an
element of the CFG of the programming language.