Documente Academic
Documente Profesional
Documente Cultură
Summer2011
Handout08
June24th,2011
FormalGrammars
HandoutwrittenbyMaggieJohnsonandJulieZelenski.
Whatisagrammar?
Agrammarisapowerfultoolfordescribingandanalyzinglanguages.Itisasetofrules
bywhichvalidsentencesinalanguageareconstructed.Heresatrivialexampleof
Englishgrammar:
sentence
subject
verb-phrase
adverb
verb
object
noun
>
>
>
>
>
>
>
Usingtheaboverulesorproductions,wecanderivesimplesentencessuchasthese:
This is a university.
Computers run the world.
I am the cheese.
I never tell lies.
Hereisaleftmostderivationofthefirstsentenceusingtheseproductions.
sentence
>
>
>
>
>
>
Inadditiontoseveralreasonablesentences,wecanalsoderivenonsenselike"Computers
runcheese"and"Thisamalies".Thesesentencesdon'tmakesemanticsense,butthey
aresyntacticallycorrectbecausetheyareofthesequenceofsubject,verbphrase,and
object.Formalgrammarsareatoolforsyntax,notsemantics.Weworryaboutsemantics
atalaterpointinthecompilingprocess.Inthesyntaxanalysisphase,weverify
structure,notmeaning.
2
Vocabulary
Weneedtoreviewsomedefinitionsbeforewecanproceed:
grammar
asetofrulesbywhichvalidsentencesinalanguageareconstructed.
nonterminal
agrammarsymbolthatcanbereplaced/expandedtoasequenceof
symbols.
terminal
anactualwordinalanguage;thesearethesymbolsinagrammarthat
cannotbereplacedbyanythingelse."terminal"issupposedtoconjure
uptheideathatitisadeadendnofurtherexpansionispossible.
production
agrammarrulethatdescribeshowtoreplace/exchangesymbols.The
generalformofaproductionforanonterminalis:
X >Y1Y2Y3...Yn
ThenonterminalXisdeclaredequivalenttotheconcatenationofthe
symbolsY1Y2Y3...Yn.Theproductionmeansthatanywherewherewe
encounterX,wemayreplaceitbythestringY1Y2Y3...Yn.Eventuallywe
willhaveastringcontainingnothingthatcanbeexpandedfurther,i.e.,it
willconsistofonlyterminals.Suchastringiscalledasentence.Inthe
contextofprogramminglanguages,asentenceisasyntacticallycorrect
andcompleteprogram.
derivation
asequenceofapplicationsoftherulesofagrammarthatproducesa
finishedstringofterminals.Aleftmostderivationiswherewealways
substitutefortheleftmostnonterminalasweapplytherules(wecan
similarlydefinearightmostderivation).Aderivationisalsocalleda
parse.
startsymbol
agrammarhasasinglenonterminal(thestartsymbol)fromwhichall
sentencesderive:
S > X1X2X3...Xn
AllsentencesarederivedfromSbysuccessivereplacementusingthe
productionsofthegrammar.
nullsymbol
itissometimesusefultospecifythatasymbolcanbereplacedby
nothingatall.Toindicatethis,weusethenullsymbol,e.g.,A > B |.
BNF
awayofspecifyingprogramminglanguagesusingformalgrammars
andproductionruleswithaparticularformofnotation(BackusNaur
form).
3
Afewgrammarexercisestotryonyourown(Thealphabetineachcaseis{a,b}.)
o Defineagrammarforthelanguageofstringswithoneormorea'sfollowedby
zeroormoreb's.
o Defineagrammarforevenlengthpalindromes.
o Defineagrammarforstringswherethenumberofa'sisequaltothenumberb's.
o Defineagrammarwherethenumberofa'sisnotequaltothenumberb's.(Hint:
thinkaboutitastwoseparatecases...)
(Canyouwriteregularexpressionsfortheselanguages?Whyorwhynot?)
ParseRepresentation
Inworkingwithgrammars,wecanrepresenttheapplicationoftherulestoderivea
sentenceintwoways.Thefirstisaderivationasshownearlierfor"Thisisauniversity"
wheretherulesareappliedstepbystepandwesubstituteforonenonterminalatatime.
Thinkofaderivationasahistoryofhowthesentencewasparsedbecauseitnotonly
includeswhichproductionswereapplied,butalsotheordertheywereapplied(i.e.,
whichnonterminalwaschosenforexpansionateachstep).Therecanmanydifferent
derivationsforthesamesentence(theleftmost,therightmost,andsoon).
Aparsetreeisthesecondmethodforrepresentation.Itdiagramshoweachsymbol
derivesfromothersymbolsinahierarchicalmanner.Hereisaparsetreefor"Thisisa
university":
s
subject
This
v-p
verb
is
object
a
noun
university
Althoughtheparsetreeincludesalloftheproductionsthatwereapplied,itdoesnot
encodetheordertheywereapplied.Foranunambiguousgrammar(welldefine
ambiguityinaminute),thereisexactlyoneparsetreeforaparticularsentence.
MoreDefinitions
Herearesomeotherdefinitionswewillneed,describedinreferencetothisexample
grammar:
S
A
B
>
>
>
AB
Ax | y
z
4
alphabet
Thealphabetis{S, A, B, x, y, z}.Itisdividedintotwodisjointsets.Theterminal
alphabetconsistsofterminals,whichappearinthesentencesofthelanguage:
{x, y, z}.Theremainingsymbolsarethenonterminalalphabet;thesearethe
symbolsthatappearontheleftsideofproductionsandcanbereplacedduring
thecourseofaderivation:{S, A, B}. Formally,weuseVforthealphabet,Tfor
theterminalalphabetandNforthenonterminalalphabetgivingus:V=TN,
andT N=.
Theconventionusedinourlecturenotesareasansseriffontforgrammar
elements,lowercaseforterminals,uppercasefornonterminals,andunderlined
lowercase(e.g.,u, v)todenotearbitrarystringsofterminalandnonterminal
symbols(possiblynull).Insometextbooks,Greeklettersareusedforarbitrary
stringsofterminalandnonterminalsymbols(e.g.,, )
contextfreegrammar
Todefinealanguage,weneedasetofproductions,ofthegeneralform: u > v.In
acontextfreegrammar,uisasinglenonterminalandvisanarbitrarystringof
terminalandnonterminalsymbols.Whenparsing,wecanreplaceubyv
whereveritoccurs.WeshallrefertothissetofproductionssymbolicallyasP.
formalgrammar
Weformallydefineagrammarasa4tuple{S,P,N,T}.S isthestartsymbol(with
S N),Pisthesetofproductions,andNandTarethenonterminalandterminal
alphabets.AsentenceisastringofsymbolsinTderivedfromSusingoneor
moreapplicationsofproductionsinP.AstringofsymbolsderivedfromS but
possiblyincludingnonterminalsiscalledasententialformoraworkingstring.
Aproductionu> visusedtoreplaceanoccurrenceofubyv.Formally,ifwe
applyaproductionpPtoastringofsymbolswinVtoyieldanewstringof
symbolszinV,wesaythatzderivedfromwusingp,writtenasfollows:w=>pz.
Wealsouse:
w=>z
w=>*z
w=>+z
zderivesfromw(productionunspecified)
zderivesfromwusingzeroormoreproductions
zderivesfromwusingoneormoreproductions
equivalence
ThelanguageL(G)definedbygrammarGisthesetofsentencesderivableusing
G.TwogrammarsGandG'aresaidtobeequivalentifthelanguagesthey
generate,L(G)andL(G'),arethesame.
5
GrammarHiearchy
WeowealotofourunderstandingofgrammarstotheworkoftheAmericanlinguist
NoamChomsky(yes,theNoamChomskyknownforhispolitics).Therearefour
categoriesofformalgrammarsintheChomskyHierarchy,theyspanfromType0,the
mostgeneral,toType3,themostrestrictive.Morerestrictionsonthegrammarmakeit
easiertodescribeandefficientlyparse,butreducetheexpressivepower.
Type0: freeorunrestrictedgrammars
Thesearethemostgeneral.Productionsareoftheformu> vwherebothu
andvarearbitrarystringsofsymbolsinV,withunonnull.Thereareno
restrictionsonwhatappearsontheleftorrighthandsideotherthantheleft
handsidemustbenonempty.
Type1: contextsensitivegrammars
ProductionsareoftheformuXw> uvwwhereu,vandwarearbitrarystrings
ofsymbolsinV,withvnonnull,andXasinglenonterminal.Inotherwords,X
maybereplacedbyvbutonlywhenitissurroundedbyuandw.(i.e.,ina
particularcontext).
Type2: contextfreegrammars
ProductionsareoftheformX> vwherevisanarbitrarystringofsymbolsin
V,andXisasinglenonterminal.WhereveryoufindX,youcanreplacewithv
(regardlessofcontext).
Type3: regulargrammars
ProductionsareoftheformX> a,X> aY, or X>whereXandYare
nonterminalsandaisaterminal.Thatis,thelefthandsidemustbeasingle
nonterminalandtherighthandsidecanbeeitherempty,asingleterminalby
itselforwithasinglenonterminal.Thesegrammarsarethemostlimitedin
termsofexpressivepower.
Everytype3grammarisatype2grammar,andeverytype2isatype1andsoon.Type
3grammarsareparticularlyeasytoparsebecauseofthelackofrecursiveconstructs.
EfficientparsersexistformanyclassesofType2grammars.AlthoughType1andType0
grammarsaremorepowerfulthanType2and3,theyarefarlessusefulsincewecannot
createefficientparsersforthem.Indesigningprogramminglanguagesusingformal
grammars,wewilluseType2orcontextfreegrammars,oftenjustabbreviatedasCFG.
Issuesinparsingcontextfreegrammars
ThereareseveralefficientapproachestoparsingmostType2grammarsandwewilltalk
throughthemoverthenextfewlectures.However,therearesomeissuesthatcan
interferewithparsingthatwemusttakeintoconsiderationwhendesigningthe
6
grammar.Letstakealookatthreeofthem:ambiguity,recursiverules,andleft
factoring.
Ambiguity
Ifagrammarpermitsmorethanoneparsetreeforsomesentences,itissaidtobe
ambiguous.Forexample,considerthefollowingclassicarithmeticexpressiongrammar:
E
op
>
>
E op E | ( E ) | int
+|-|*|/
Thisgrammardenotesexpressionsthatconsistofintegersjoinedbybinaryoperators
andpossiblyincludingparentheses.Asdefinedabove,thisgrammarisambiguous
becauseforcertainsentenceswecanconstructmorethanoneparsetree.Forexample,
considertheexpression10 2 * 5.WeparsebyfirstapplyingtheproductionE > E op E.
Theparsetreeontheleftchoosestoexpandthatfirstop to*, theoneontherightto-. We
havetwocompletelydifferentparsetrees.Whichoneiscorrect?
E
E
E
E
int
10
op
-
E
int
2
op
int
5
op
int
10
E
E
int
2
op
*
E
int
5
Bothtreesarelegalinthegrammarasstatedandthuseitherinterpretationisvalid.
Althoughnaturallanguagescantoleratesomekindofambiguity(e.g.,puns,playson
words,etc.),itisnotacceptableincomputerlanguages.Wedontwantthecompilerjust
haphazardlydecidingwhichwaytointerpretourexpressions!Givenourexpectations
fromalgebraconcerningprecedence,onlyoneofthetreesseemsright.Therighthand
treefitsourexpectationthat*"bindstighter"andforthatresulttobecomputedfirstthen
integratedintheouterexpressionwhichhasalowerprecedenceoperator.
Itsfairlyeasyforagrammartobecomeambiguousifyouarenotcarefulinits
construction.Unfortunately,thereisnomagicaltechniquethatcanbeusedtoresolveall
varietiesofambiguity.Itisanundecidableproblemtodeterminewhetheranygrammar
isambiguous,muchlesstoattempttomechanicallyremoveallambiguity.However,
thatdoesn'tmeaninpracticethatwecannotdetectambiguityordosomethingaboutit.
Forprogramminglanguagegrammars,weusuallytakepainstoconstructan
unambiguousgrammarorintroduceadditionaldisambiguatingrulestothrowawaythe
undesirableparsetrees,leavingonlyoneforeachsentence.
7
Usingtheaboveambiguousexpressiongrammar,onetechniquewouldleavethe
grammarasis,butadddisambiguatingrulesintotheparserimplementation.Wecould
codeintotheparserknowledgeofprecedenceandassociativitytobreakthetieandforce
theparsertobuildthetreeontherightratherthantheleft.Theadvantageofthisisthat
thegrammarremainssimpleandlesscomplicated.Butasadownside,thesyntactic
structureofthelanguageisnolongergivenbythegrammaralone.
Anotherapproachistochangethegrammartoonlyallowtheonetreethatcorrectly
reflectsourintentionandeliminatetheothers.Fortheexpressiongrammar,wecan
separateexpressionsintomultiplicativeandadditivesubgroupsandforcethemtobe
expandedinthedesiredorder.
E
t_op
T
f_op
F
>
>
>
>
>
E t_op E | T
+|T f_op T | F
*|/
(E) | int
Termsareaddition/subtractionexpressionsandfactorsusedformultiplicationand
division.Sincethebasecaseforexpressionisaterm,additionandsubtractionwill
appearhigherintheparsetree,andthusreceivelowerprecedence.
Afterverifyingthattheaboverewrittengrammarhasonlyoneparsetreefortheearlier
ambiguousexpression,youmightthingwewerehomefree,butnowconsiderthe
expression10 2 5. Therecursiononbothsidesofthebinaryoperatorallowseither
sidetomatchrepetitions.Thearithmeticoperatorsusuallyassociatetotheleft,soby
replacingtherighthandsidewiththebasecasewillforcetherepetitivematchesontothe
leftside.Thefinalresultis:
E
t_op
T
f_op
F
>
>
>
>
>
E t_op T | T
+|T f_op F | F
*|/
(E) | int
Whew!Theobviousdisadvantageofchangingthegrammartoremoveambiguityisthat
itmaycomplicateandobscuretheoriginalgrammardefinitions.Thereisnomechanical
meanstochangeanyambiguousgrammarintoanunambiguousone(undecidable,
remember?)However,mostprogramminglanguageshaveonlylimitedissueswith
ambiguitythatcanberesolvedusingadhoctechniques.
8
Recursiveproductions
Productionsareoftendefinedintermsofthemselves.Forexamplealistofvariablesina
programminglanguagegrammarcouldbespecifiedbythisproduction:
variable_list
>
Suchproductionsaresaidtoberecursive.Iftherecursivenonterminalisattheleftofthe
rightsideoftheproduction,e.g.A > u | Av,wecalltheproductionleftrecursive.
Similarly,wecandefinearightrecursiveproduction:A > u | vA.Someparsing
techniqueshavetroublewithoneortheothervariantsofrecursiveproductionsandso
sometimeswehavetomassagethegrammarintoadifferentbutequivalentform.Left
recursiveproductionscanbeespeciallytroublesomeinthetopdownparsers(andwell
seewhyabitlater).Handily,thereisasimpletechniqueforrewritingthegrammarto
movetherecursiontotheotherside.Forexample,considerthisleftrecursiverule:
X
> Xa | Xb | AB | C | DEF
X
X'
Toconverttherule,weintroduceanewnonterminalX'thatweappendtotheendofall
nonleftrecursiveproductionsforX.Theexpansionforthenewnonterminalisbasically
thereverseoftheoriginalleftrecursiverule.Therewrittenproductionsare:
Itappearswejustexchangedtheleftrecursiverulesforanequivalentrightrecursive
version.Thismightseempointless,butsomeparsingalgorithmspreferorevenrequire
onlyleftorrightrecursion.
Leftfactoring
Theparserusuallyreadstokensfromlefttorightanditisconvenientif,uponreadinga
token,itcanmakeanimmediatedecisionaboutwhichproductionfromthegrammarto
expand.However,thiscanbetroubleifthereareproductionsthathavecommonfirst
symbol(s)ontherightsideoftheproductions.Hereisanexampleweoftenseein
programminglanguagegrammars:
Stmt
>
if Cond then Stmt else Stmt | if Cond then Stmt | Other | ....
9
partofthetwooptionsintoasharedrulethatbothwilluseandthenaddanewrulethat
picksupwherethetokensdiverge.
Stmt
OptElse
>
>
Intherewrittengrammar,uponreadinganifweexpandfirstproductionandwait
untilif Cond then Stmt hasbeenseentodecidewhethertoexpandOptElsetoelseor.
Hiddenleftfactorsandhiddenleftrecursion
Agrammarmaynotappeartohaveleftrecursionorleftfactors,yetstillhaveissuesthat
willinterferewithparsing.Thismaybebecausetheissuesarehiddenandneedtobe
firstexposedviasubstitution.
Forexample,considerthisgrammar:
A
B
> da | acB
> abB | daA | Af
Acursoryexaminationofthegrammarmaynotdetectthatthefirstandsecond
productionsofBoverlapwiththethird.WesubstitutetheexpansionsforAintothe
thirdproductiontoexposethis:
A
B
> da | acB
> abB | daA | daf | acBf
ThisexchangestheoriginalthirdproductionofBforseveralnewproductions,onefor
eachoftheproductionsforA.Thesedirectlyshowtheoverlap,andwecanthenleft
factor:
A
B
M
N
>
>
>
>
da | acB
aM | daN
bB | cBf
A|f
Similarly,thefollowinggrammardoesnotappeartohaveanyleftrecursion:
S
T
> Tu | wx
> Sq | vvS
YetaftersubstitutionofSintoT,theleftrecursioncomestolight:
S
T
> Tu | wx
> Tuq | wxq | vvS
Ifwetheneliminateleftrecursion,weget:
10
S
T
T'
> Tu | wx
> wxqT' | vvST'
> uqT' |
Programminglanguagecasestudy:ALGOL
Algolisofinteresttousbecauseitwasthefirstprogramminglanguagetobedefined
usingagrammar.Itgrewoutofaninternationaleffortinthelate1950stocreatea
"universalprogramminglanguage"thatwouldrunonallmachines.Atthattime,
FORTRANandCOBOLweretheprominentlanguages,withnewlanguagessprouting
upallaround.Programmersbecameincreasinglyconcernedaboutportabilityof
programsandbeingabletocommunicatewithoneanotheronprogrammingtopics.
ConsequentlytheACMandGAMM(GesellschaftfrangewandteMathematikund
Mechanik)decidedtocomeupwithasingleprogramminglanguagethatallcoulduse
ontheircomputers,andinwhosetermsprogramscouldbecommunicatedbetweenthe
usersofallmachines.TheirfirstdecisionwasnottouseFORTRANastheiruniversal
language.Thismayseemsurprisingtoustoday,sinceitwasthemostcommonlyused
languagebackthen.However,asAlanJ.Perlis,oneoftheoriginalcommitteemembers,
putsit:
"Today,FORTRANisthepropertyofthecomputingworld,butin1957,it
wasanIBMcreationandcloselytiedtoIBMhardware.Forthesereasons,
FORTRANwasunacceptableasauniversallanguage."
ALGOL58wasthefirstversionofthelanguage,followedupverysoonafterby
ALGOL60,whichistheversionthathadthemostimpact.Asalanguage,itintroduced
thefollowingfeatures:
o
o
o
o
o
o
o
blockstructureandnestedstructures
strongtyping
scoping
proceduresandfunctions
callbyvalue,callbyreference
sideeffects(isthisgoodorbad?)
recursion
ItmayseemsurprisingthatrecursionwasnotpresentintheoriginalFORTRANor
COBOL.Youprobablyknowthattoimplementrecursionweneedaruntimestackto
storetheactivationrecordsasfunctionsarecalled.InFORTRANandCOBOL,
activationrecordswerecreatedatcompiletime,notruntime.Thus,onlyoneactivation
recordpersubroutinewascreated.Nostackwasused.Theparametersforthe
subroutinewerecopiedintotheactivationrecordandthatdataareawasusedfor
subroutineprocessing.
11
TheALGOLreportwasthefirsttimeweseeBNFtodescribeaprogramminglanguage.
BothJohnBackusandPeterNaurwereontheALGOLcommittees.Theyderivedthis
descriptiontechniquefromanearlierpaperwrittenbyBackus.Thetechniquewas
adoptedbecausetheyneededamachineindependentmethodofdescription.Ifone
looksattheearlydefinitionsofFORTRAN,onecanseethelinkstotheIBMhardware.
WithALGOL,themachinewasnotrelevant.BNFhadahugeimpactonprogramming
languagedesignandcompilerconstruction.First,itstimulatedalargenumberof
studiesontheformalstructureofprogramminglanguageslayingthegroundworkfora
theoreticalapproachtolanguagedesign.Second,aformalsyntacticdescriptioncouldbe
usedtodriveacompilerdirectly(asweshallsee).
ALGOLhadatremendousimpactonprogramminglanguagedesign,compiler
construction,andlanguagetheory,butthelanguageitselfwasacommercialfailure.
Partlythiswasduetodesigndecisions(overlycomplexfeatures,noIO)alongwiththe
politicsofthetime(popularityofFortran,lackofsupportfromtheallpowerfulIBM,
resistancetoBNF).
Bibliography
A. Aho, R. Sethi, J. Ullman, Compilers: Principles, Techniques, and Tools. Reading, MA:
Addison-Wesley, 1986.
J. Backus, The Syntax and Semantics of the Proposed International Algebraic Language of
the Zurich ACM-GAMM Conference, Proceedings of the International Conference on
Information Processing, 1959, pp. 125-132.
N. Chomsky, On Certain Formal Properties of Grammars, Information and Control, Vol. 2,
1959, pp. 137-167.
J.P. Bennett, Introduction to Compiling Techniques. Berkshire, England: McGraw-Hill, 1990.
D. Cohen, Introduction to Computer Theory. New York: Wiley, 1986.
J.C. Martin, Introduction to Languages and the Theory of Computation. New York, NY:
McGraw-Hill, 1991.
P. Naur, Programming Languages, Natural Languages, and Mathematics, Communications
of the ACM, Vol 18, No. 12, 1975, pp. 676-683.
J. Sammet, Programming Languages: History and Fundamentals. Englewood-Cliffs, NJ:
Prentice-Hall, 1969.
R.L.Wexelblat, History of Programming Languages. London: Academic Press, 1981.