Sunteți pe pagina 1din 28

# Automata and Formal Languages

## CS138, Winter 2006

Wim van Dam Room 5109, Engr. I vandam@cs.ucsb.edu http://www.cs.ucsb.edu/~vandam/
CS138, Wim van Dam, UCSB

Formalities
The new Homework 3 is due on Friday afternoon. Questions?

## CS138, Wim van Dam, UCSB

This Week
This week: Simplification of Context-Free Grammars from An Introduction to Formal Languages and Automata by Peter Linz [Reader, pp. 7185] We will look at the important task of rewriting Context Free Grammars to equivalent ones to easy the computational problem of parsing words of the CFG. Ultimately, we will describe a parsing algorithm that works in polynomial time O(|w|3).

## Dealing with Empty String

Section 6.1: It does not really matter whether or not the empty string is part of the language of a CFG (V,T,S,P). To add , you define V=V{S0} and add S0S| to P such that the CFG (V,T,S0,P) produces L(G){}. To remove : Exercise 13, p. 79. From now on, we assume that our languages are -free.

## Substituting Production Rules

Let G=(V,T,S,P) have the rules A x1Bx2 (with AB) and B y1|y2||yn, then the following CFG G=(V,T,S,P) is equivalent with G (that is: L(G)=L(G)). P does not have the rule A x1Bx2 Instead it has: A x1y1x2 | x1y2x2 || x1ynx2. Proof: That G and G are equivalent should be obvious. Example: If P has A a|aaA|abBc and B abbA|b, then P has A a|aaA|ababbAc|abbc and B abbA|b B in P has become useless
CS138, Wim van Dam, UCSB

## Removing Useless Rules

Definition 6.1: A variable AV of a CFG G is useful if and only if there is a wL(G) such that S * xAy * w. A variable that is not useful is useless; a production rule that uses a useless variable is a useless production rule. Example 1): With S A, and A aA|, and B bA, the B variable is useless: It is impossible to get S * xBy. Example 2): With S aSb | | A, and A aA, the variable A is useless: It is impossble to get A * x with xT*.

## How to Remove Useless Rules

Theorem 6.2: There is an efficient algorithm that, given G=(V,T,S,P), produces an equivalent grammar (V,T,S,P) that has no useless variables or production rules. Proof: 1) By backtracking, collect all variables that can terminate, and remove all those that can not. 2) Now starting from S, draw the dependency graph of the reduced CFG to detect all variables A such that S * xAy. Remove all other variables and their production rules. The resulting CFG G is equivalent with G. See the Reader for the details.
CS138, Wim van Dam, UCSB

Removing -Productions
A -production rule is of the form A . Any variable A for which we have A * is called nullable. Theorem 6.3: If the language of a CFG G is -free, then we can efficiently rewrite G to an equivalent CFG G without -production rules. Proof: Backtracking, collect all nullable variables in VN. Add to P all production rules A x1x2xm as well as the rules that have the variables from VN replaced by . Unless all xj are nullable, then A is not added. Again, see Reader for more details.
CS138, Wim van Dam, UCSB

Example 6.5
Take the CFG defined by S ABaC A BC B b| C D| Dd The nullable variables are: VN = {A,B,C}

Thus we get the equivalent, -production free CFG : S ABaC | BaC | AaC | ABa | aC | Ba | Aa | a A BC | B | C Bb CD Dd
CS138, Wim van Dam, UCSB

## Removing Unit Productions

After removing useless rules and -productions, we also want to get rid of unit-productions of the kind AB. Theorem 6.4: For a -production free CFG, we can make an equivalent CFG without unit-productions. We do this using the earlier described substitution rule (but be careful to avoid the case AB and BA). Proof: Backtracking, collect all variables with A*B. First, add to P all non-unit productions. For all (A,B) with A*B and B y1|y2||yn in P, add to P the production A y1|y2||yn.

## CS138, Wim van Dam, UCSB

Example 6.6
Take the CFG S Aa|B B A|bb A a|bc|B Unit productions for the CFG are: S * B and S * A B * A A * B

## For the CFG G, we have the production rules: S Aa B bb A a|bc

added with: S bb | a | bc B a | bc A bb
CS138, Wim van Dam, UCSB

## Putting It All Together

Theorem 6.5: For all context free languages L without , there exist a context free grammar G that generates L, while G does not have useless productions, -productions, or unit productions. Proof: In the right order, perform the manipulations: 1) Remove -productions (might produce unit-productions) 2) Remove unit-productions (does not create -productions) 3) Remove useless productions (does not create unit or -productions) This theorem is useful for parsing algorithms
CS138, Wim van Dam, UCSB

Today
Last Monday we saw how to transform a (-free) CFG into an equivalent CFG that has: 1. no -productions (A * ) 2. no unit-productions (A * B) 3. no useless variables or useless productions Today we will discuss two important normal forms: the Chomsky Normal Form and the Greibach Normal Form, and the fast parsing of CFGs in CNF [Reader, pp. 8084].

## Again, Unit Productions

After removing useless rules and -productions, we also want to get rid of unit-productions of the kind AB. Theorem 6.4: For a -production free CFG, we can make an equivalent CFG without unit-productions. We do this using the earlier described substitution rule (but be careful to avoid the case AB and BA). Proof: Backtracking, collect all variables with A*B. First, add to P all non-unit productions. For all (A,B) with A*B and B y1|y2||yn in P, add to P the production A y1|y2||yn.

## Again, Example 6.6

Take the CFG S Aa|B B A|bb A a|bc|B Unit productions for the CFG are: S * B and S * A B * A A * B

## For the CFG G, we have the production rules: S Aa B bb A a|bc

added with: S bb | a | bc B a | bc A bb
CS138, Wim van Dam, UCSB

## Chomsky Normal Form

Definition 6.4: A CFG is in Chomsky normal form if and only if all production rules are of the form A BC or A x with variables A,B,CV and xT. (Sometimes rule S is also allowed.) CFGs in CNF can be parsed in time O(|w|3). Named after Noam Chomsky who in the 60s made seminal contributions to the field of theoretical linguistics. (cf. Chomsky hierarchy of languages).
CS138, Wim van Dam, UCSB

Theorem 6.6
Theorem 6.6: Every -free CFG G can be described by an equivalent CFG G in Chomsky normal form. The transformation from G to G can be done efficiently. Outline of Proof: 1. Rewrite G to eliminate unit and -productions. 2. Rewrite such that all terminal producing rules are of the form Baa. 3. Rewrite such that all variable producing rules are of the form ACD with C,DV.

## CS138, Wim van Dam, UCSB

Details of Proof
Step 2: How do you transform general production rules of the kind Ay1yn with yjVT to rules that are of the kind Ay1yn with yjV or Ay with yT? Answer: Introduce terminal producing variables Byy for each yT and replace in all relevant rules y by By. Step 3: How do you transform production rules of the kind AC1Cn with CjV to rules of the kind AC1C2? Answer: Make a chain of rules to produce C1Cn: AC1D1 and D1C2D2 and and Dn2Cn1Cn.

## Example of Making in CNF

Initial grammar: S aSb | AAA and A a | SA Create a,b terminal producing variables X and Y to get: S XSY | AAA A a | SA Xa Yb Note that we do not create AX. Make variable chains to get: S XS1 | AS2 S1 SY S2 AA A a | SA Xa Yb

## Greibach Normal Form

Definition 6.5: A CFG is in Greibach Normal Form if and only if all production rules are of the form Aax with aT and xV*. Note: several pairs (A,a) are allowed (unlike s-grammars). Theorem 6.7: For every CFG with L(G) there is an equivalent CFG that is in Greibach Normal Form. Proof: Just trust me on this one. Example: We can rewrite S ab | aS | aaS to GNF: S aB | aS | aAS, and A a, and B b.
CS138, Wim van Dam, UCSB

## CFG Membership Algorithm

The CYK algorithm (Cocke-Younger-Kasami) decides in time O(|w|3) whether or not wL(G) with G in Chomsky NF. How it works: Let the string be w = a1an and define Vik = { A V : A * aiak } for all 1ikn, so that we want to know SV1n? We solve this by first determining V11,V22,,Vnn, then V12,V23,,Vn1 n, then V13,., up to the final V1n. Observation 1: Because of CNF, finding the Vii is trivial. Observation 2: Also, Vik is determined by the combinations Vik = {AV : ABC with AVij and BVj+1 k and ijk}. Using this dynamic programming technique we find V1n.
CS138, Wim van Dam, UCSB

CYK in Action
Take the grammar V11 V12 V13 V14 S AB | CC A CC C S,A . B BC | 0 V22 V23 V24 C0|1 with w = 1101 L? B,C S,A,B V11 = {C}, V22 = {C}, V33 = {B,C}, V44 = {C} V33 V34 V12 = {S,A}, V23 = {S,A}, V34 = {S,A,B} C V13 = {S}, V24 = {} V44 V14 = {S} S AB CCB 1CB 11B 11BC 110C 1101
CS138, Wim van Dam, UCSB

S,A

Complexity of CYK
There are O(n2) variable sets Vik that we have to construct. For each set Vik there are no more than n pairs (Vij,Vj+1 k) that we have to consider to determine Vik. In total, the running time is upper bounded by O(n3). Note that this does not include the time required to bring the CFG into Chomsky Normal Form (which can be done efficiently though).

## CS138, Wim van Dam, UCSB

Formalities
The new Homework 3 is due today, 5pm. New homework will be announced this weekend. Midterm on context free grammars will probably be later than originally planned (so, after Friday March 3). Coming Monday there will be no class. Questions?
CS138, Wim van Dam, UCSB

## CFG Membership Algorithm

The CYK algorithm (Cocke-Younger-Kasami) decides in time O(|w|3) whether or not wL(G) with G in Chomsky NF. How it works: Let the string be w = a1an and define Vik = { A V : A * aiak } for all 1ikn, so that we want to know SV1n? We solve this by first determining V11,V22,,Vnn, then V12,V23,,Vn1 n, then V13,., up to the final V1n. Observation 1: Because of CNF, finding the Vii is trivial. Observation 2: Also, Vik is determined by the combinations Vik = {AV : ABC with AVij and BVj+1 k and ijk}. Using this dynamic programming technique we find V1n.
CS138, Wim van Dam, UCSB

CYK in Action
Take the grammar V11 V12 V13 V14 S AB | CC 1 A CC C S,A . B BC | 0 1 V22 V23 V24 C0|1 with w = 1101 L? B,C S,A,B 0 V11 = {C}, V22 = {C}, V33 = {B,C}, V44 = {C} V33 V34 V12 = {S,A}, V23 = {S,A}, V34 = {S,A,B} C V13 = {S}, V24 = {} 1 V44 V14 = {S} Retracing the V14 = {S} result gives the derivation tree:
CS138, Wim S AB CCB 1CB 11B 11BC 110C 1101van Dam, UCSB

S,A

An Exercise (1)
Write into Chomsky Normal Form the CFG: S aA|aBB A aaA| B bC|bbC CB Answer (1): First you remove the -productions (A): S aA|aBB|a A aaA|aa B bC|bbC CB

## CS138, Wim van Dam, UCSB

An Exercise (2)
Answer (2): Next you remove the unit-productions from: S aA|aBB|a A aaA|aa B bC|bbC CB Removing CB, we have to include the C*B possibility, which can be done by substitution (Thm 6.4) and gives: S aA|aBB|a A aaA|aa B bC|bbC C bC|bbC