The Linguistics of DNA

Sigma Xi, The Scientific Research Society
The Linguistics of DNA

Author(s): David B. Searls
Source: American Scientist, Vol. 80, No. 6 (November-December 1992), pp. 579-591
Published by: Sigma Xi, The Scientific Research Society
Stable URL: http://www.jstor.org/stable/29774782 .
Accessed: 21/11/2014 10:07
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
Sigma Xi, The Scientific Research Society is collaborating with JSTOR to digitize, preserve and extend access
to American Scientist.
http://www.jstor.org
This content downloaded from 169.230.243.252 on Fri, 21 Nov 2014 10:07:25 AM

All use subject to JSTOR Terms and Conditions
The of DNA
Linguistics
In thequest tounderstand the language of life,

a promisingstrategyis toconstructa grammar ofgenes
David B. Searls
metaphors have been a mathematics and computer science. It English sentence in progressively
part ofmolecular biology ever since
Linguistic now appears that the idea of a genera? greater detail, until finally individual
the structure of DNA was solved in tive grammar may also be a powerful words are introduced. The first rule
1953. Biologists speak of the genetic tool for expressing the biological mes? states that a sentence consists of a noun
code, of gene expression and of reading sages inscribed inDNA and RNA. phrase followed by a verb phrase; the
frames in nucleic acids. DNA is tran? Finding effective methods for read? second rule then defines a noun phrase
scribed intoRNA, which is then translat? ing the language of nucleic acids is as an article followed by a modified
ed into protein. Certain enzymes are rapidly becoming an issue of practical noun; the third rule declares that a verb
even said to editRNA. Despite all this concern. The Human Genome Project, phrase ismade up of a verb and a noun
linguistic terminology, however, there in itsmost ambitious phase, proposes phrase. The fourth rule offers a choice:
has been littleeffortto apply the tools of to record the entire sequence of the The vertical bar I simply means "ox,"
formal language theory to the problems three billion DNA base pairs thatmake so that a modified noun may be either a
of interpretingbiological sequences. up the human genetic endowment. If noun by itself or an adjective followed
At about the same time thatWatson this archive is to be of any use, biolo? by a modified noun. The remaining
and Crick were laying the foundation gistswill have tobe able to identifyand production rules give a small vocabu?
ofmolecular biology with theirmodel retrievemeaningful segments of it.This lary of nouns, verbs, adjectives and arti?
of the double helix, Noam Chomsky is no trivial challenge, given that func? cles. All of the italicized words are
was initiating an equally fundamental tional genes make up only a few per? "nonterminal" symbols, meaning that
revolution in linguistics. The engine of cent of the totalDNA, and our knowl? they never appear in the actual sen?
this revolution was the idea of a genera? edge of how they are controlled by tences that are the final output of the
tive grammar: a set of rules that can surrounding elements of the genome is grammar; lexical items such as lin?
generate all the grammatically accept? still rudimentary. Equipped with guist and sees, on the other hand, are
able sentences of a language, with? knowledge of the linguistic structure of "terminal" symbols thatwill be part of
out
generating any erroneous ones. the genome, one can endeavor towrite the generated sentences. The arrow
Through this device a relatively small a computer program that parses genes within each rule can be read as "pro?
set of rules can capture the structure of a and other high-level features ofDNA. duces" or
"expands
into."
potentially infinite language. Chom? To generate a sentence, all that is

sky's primary interest was in under? Grammars and Languages needed is to apply the rules, one after
standing natural human languages A generative grammar ismore than a another. The process begins with the
such as English, but his work was im? set of rules and recommendations for top-level rule, inwhich the single sym?
mediately successful in elucidating the the correct use of language; it is a mech? bol sentence is replaced by the sequence
more abstract and formal languages of anism foractually producing all the syn? of symbols noun-phrase verb-phrase.
tactically correct utterances in a lan? Then more rules are applied to expand
guage, and only those utterances. Here these symbols in turn, so that noun
David B. Searls is research associate professor of
is a generative grammar for a tiny sub? phrase yields article modified-noun, and
genetics at theUniversity ofPennsylvania School set of English: so on. Eventually all of the nontermi?
a
ofMedicine, with secondary appointment in
nals are replaced by terminal symbols.
computer and information science. He received sentence ?>
noun-phrase verb-phrase An entire sequence of rule applications,
bachelor's degrees in philosophy and life sciences
noun-phrase ?> articlemodified-noun called a derivation, is shown in Figure
from the Massachusetts Institute of Technology,
?> verb
and a Ph.D. in biology from the JohnsHopkins verb-phrase noun-phrase 2. It leads to the sentence: "The linguist
University. Subsequently, he turned his attention
modified-noun-^ noun I sees a famous young biologist."
to computational biology and received a master's adjectivemodified-noun With the grammar given here, which
noun ?? linguist Ibiologist
degree in computer science from theUniversity of a very limited variety of
verb ?> sees Ibelieves generates
Pennsylvania. He is currently co-director of the
forms, it is easy to verify that all of the
Computational Biology and Informatics Laboratory adjective ?> young I famous sentences are grammatical (if slightly
there.Address: Department ofGenetics, Room 475, article -?a Ithe
Clinical Research Building, University ofPenn? inane). Despite this simplicity, the gram?
sylvania School ofMedicine, 422 Curie Blvd.,
The grammar consists of "production mar could in theory produce an infinite
Philadelphia, PA 19104-6145. rules/' which specify the structure of an number of sentences. The source of its
1992 November-December 579

_M4?_tun_
10*000
50*00 15*000 20*000 25*000 30'000 35*000 40*000 45*000 50000 55000 65000
60'000 70'000
opstrtam
CCAAT-box_ TftTA-bOX_
' ' '
aggacacaggtcagcctigaccaatgacttttaagtaccatggaâacagggggccagaacttcggcagtaaagaataaaaggccagacagaga^gcagc ' '
19410 19*420 19430 19440 19450 19*460 19470 19480 19490 19*500
transcript
???tr?M
CAP C? Ktt val his Ph< thr *l* gvx glv lys ala ala val _tiir ser
' ' ' ' * ' '
agcacaLaLCtgcttccgacacagctgcaatcactagcaagctctcaggcctggcatcatggtgcattitactgcigaggagaaggcXgccgtcactagc
19*510 19520 19530 19540 19*550 19*560 19570 19580 39?S0<
19*590
ff**I...in
I ?. .i nil... i..
transcript
?xon
l?n trp s?r lys mtt ?asn val gin glu ala aly qly qlu alaiatron_
ltu qlv 5' splict
cigtggagcaagatgaaigiggaagagj^ggagglgaagcct^ ' '
19620 19*610 19*630 19640 196S0
13660 19670 19630
19680 1970(
ass*_
transcript
intron ?xon_ branch 3' splict arg ltu leu val v?l tyr pro trp__
thr gin
' ' ' ' '
caagttgattgggaaagtcctcaagattttttgcatctciaaitttgtatctgatatggtgtcatttcatagactcctcgttgtttaccccCggacccag ' ' '
19*710 19*720 19^30 19*740 19*750 19*760 19*770 19780 19*790 19801
transcript
arg phi phc asp s?r pht gly asn ltu str $tc pro s?r ala lie l<u ?gly asn pro lys val lys ala his gly lys lys val Ittt thr str phe gly asp
' ' '
agattttitgacagcttXggaaacctgtCRtctccctcTgccatcctgggcaaccccaagpâa ' ' ' '
jgcccaTggcaagaaggtgctgacttccttxggas '
19810 19820 19830 19940 198S0 19860 13*870 19*880 19*890 19900
transcript
ala asp ^V* am
iysâsp
asn^tt Pro ly*_j^?_ieJL. hl*_?V-L..Î**-^ __^ Xt^ .i?L. grA. *^-JEî* *y*__ jPj."*
'
axgcrLBTâaaaacarggacaaccTcaBgccc^c ' ' ' ' ' '
19*910 19*920 19930 19*940 19*950 19*960 19*970 19*980 20000
19*990
ass_
transcript
intron
5' splice
' ' ' ' * '
tlcaggtgctggtgatgtgattttttggctttatattttgacattaattgaagctcataatcttattggaaagaccaacaaagatctcagaaatcatggg ' ' '
20*010 20*020 20*030 20040 20050 20*060 20*070 20*080 20090 20100
Figure 1. Structure of a gene

is revealed by linguistic analysis. In the author's laboratory, grammars are used to parse DNA sequences, elucidat?
ing the hierarchical organization of genes and other high-level components of the genome. Shown here is the first half of a human hemoglobin
gene, one of five similar genes distributed in a region of DNA some 73,000 nucleotides in length. The long list of letters is the nucleotide
the variously colored lines represent features recognized, a simple syntactic
sequence; including the gene itself. Although analysis cannot
unambiguously identify genes, grammars provide a formal framework for exploring structure in biological
promising higher-order sequences.
infinitecapacity is the rule formodified A Hierarchy of Languages 010101101, and countless others. In?
noun,which has a recursive structure: It Formal languages look nothing like the deed, the language is infinite, even
invokes itself.Each time modified-noun of
languages everyday experience. The though the alphabet has only a finite
expands into the sequence adjectivemod? theory of formal languages defines a number of symbols.
ified-noun, there is another instance of language as a set of "strings" formed by The language of binary integers has a
modified-noun to be expanded. The writing down a sequence of symbols simple generative grammar:
process can string together an unlimited drawn from some specified alphabet.
S -> OS I IS I8
number of adjectives. Although the sen? The strings do not have tomean any?
tences would quickly grow tiresome
thing; indeed, the term "string" is fa? Here the symbol S is a nonterminal,
and even nonsensical, there is no point vored over more familiar alternatives whereas 0 and 1 are terminals.As in the
atwhich adding just one more adjective such as "word" or "sentence" because it English grammar, the vertical bar
makes the sentence ungrammatical. carries no connotation that the sequence means "or." The one new element of the
Such a toy grammar can hardly begin of symbols should be meaningful. A grammar is the symbol e (theGreek let?
to express the diversity of English syn? string is a member of a
language if it terepsilon), which represents the "emp?
tax. To make the grammar more realis? satisfies some formally defined proper? ty string," the string of zero length. The
tic, itwould have tobe greatly elaborat? ty,such as being derivable from a given three alternative productions can be
ed, but that exercise would probably grammar?there are no other criteria. translated intowords as follows: "Any
reveal more about the peculiarities of As an example, consider the alpha? instance of the nonterminal S yields the
natural languages than about thenature bet made up of the two symbols 0 and string OS or the string IS or the empty
of grammars. A better strategy is to 1, and the language consisting of all string." Choosing the third possibility,
look instead at the grammars of still possible strings drawn from this 8, amounts to erasing the S.
alpha?
simpler languages, namely the abstract bet. The language, which is equivalent The operation of the grammar is
and formal languages studied inmath? to the set of all integers expressed inbi?
straightforward. Beginning with S, any
ematics and computer science?and nary notation O's), in? of the productions can be chosen. If the
(allowing leading
molecular genetics. cludes strings such as 0, 101, 111011, firstproduction selected is S ?> IS, then
580 American Scientist, Volume 80

the effect is to replace the solitary S with "all strings of O's and l's/7 or "all strings regular grammar. The grammar is guar?
the string IS. This derivation is repre? with an even number of l's," but itcan? anteed to generate only palindromes
sented as S => IS, where the double ar? not represent certain other concepts. Of because the string grows outward from
a the center,with each production adding
row signifies that this is an application particular importance in biological
of a grammar rule, not a statement of context, no regular language can specify a balanced pair of O's or l's. A sample
the rule itself.The S in thenew string IS the set of all palindromic strings?that derivation would look like this:
is now subject to replacement in turn; if is, strings that read the same forward s => 0S0 => OlSlO
the production S ?> OS is chosen this and backward. There are many proofs
=> 011S110 => 011110
time, the derivation is IS => 10s. Now of this assertion, but I shallmerely offer
the rule S -? IS might be applied again, an informal argument in support of it: All the palindromes generated by this
yielding 10IS. Finally
one selects the The way to construct a palindrome is to grammar have an even number of sym?
thirdalternative production, S ??8, and build it from themiddle outward or bols, but a grammar that can accommo?
the S disappears, leaving the string 101. from the two ends toward themiddle, date odd-length palindromes, such as
Since this string has no nonterminal but the defining trait of regular gram? 01010, is only a bitmore complicated.
mars is that they add symbols only to Grammars with no restriction on the
symbols, it cannot be transformed fur?
ther.The process of building the string one end of a string. number or placement of nonterminals
is summarized as follows: To writea grammar forpalindromes, on the right-hand side of a rule are
some of the restrictions on the form of called context-free grammars, and the
S => IS => 10s 101s => 101
the production rules must be relaxed. associated languages are context-free
con?
The grammar can generate strings of Specifically, nonterminals must be al? languages. Chomsky showed that
and either 0 or 1 can lowed in any position on the text-free languages are more powerful
any length, appear right-hand
at any position within a string, so that side of a rule.Here is a simple grammar than regular languages, precisely be?
all binary integers are included in the thatproduces palindromes drawn from cause they can express concepts such as
the alphabet consisting of 0 and 1: that of a palindrome. Moreover, the
language.
more restrictive class of context-free languages includes
Introducing produc? S -> 0S0 I 1S1 I8
tion rules gives rise to languages with a the class of regular languages, since a
more complicated structure. Consider Note that the nontenninal S appears in regular grammar is simply a special
this grammar: themiddle of the productions 0S0 and case of a context-free grammar.
lSl, which would not be allowed in a Context-free languages are not at the
S -> OS I IT I8
t -> or i is
Here S and T are both nonterminals; the

0
alphabet of terminals again consists of
and 1.What set of strings is generated
by the grammar? It is the language of
all binary strings having an even num?
ber of l's, as in this sample derivation:
S => OS => 01t

=> 011s => 0110s => 0110
Palindromes and Repeats

The grammars of binary numerals pre?
sented above have a distinctive form. In
every rule the left-hand side consists of
a single nonterminal symbol, and the
corresponding productions on the
side either have no nonter?
right-hand
minals (as in S ?> e) or else they have a
single nonterminal as their final symbol
(as in S ?> 0t). It follows that strings
generated by these rules can grow only
from one end. A grammar that fits this
a
description is called regular grammar,
and the corresponding language is a
regular language. (The same languages
can be specified inotherways as well; a
common technique is the formalism
known as regular expressions, often
used in searching forpatterns in text.)
a
Regular languages have number of Figure 2. Parse tree represents the syntactic structure of an English sentence. The sentence is
attractive properties, most notably the to the rules of a grammar in the series of shown at the top. All the
generated according steps
computational efficiency of recognizing stages of this derivation can then be arranged in a tree structure (which by convention is
them. But they also have limitations. A drawn inverted, with the root at the top). Formal languages, as well as the languages of bio?
regular language can capture the notion

are also well suited to this kind of analysis.
logical sequences,

pinnacle of Chomsky's pyramid of lan?
A context-free can
guages. grammar
specify palindromes and numerous oth?
er interestingpatterns, but there are also
many concepts that escape the context
free formalism. A copy language, for ex?
ample, consists of all strings inwhich
B-jL^-e .. the firsthalf of the string is identical to
i4
^ Olll?=>01112B
0U1A =>011122
=>011122B the second half. Where a palindrome
Jlg^^pl^^^lM^ might read 011110, a member of the
Figure 3. Regular languages, the simplest languages in the hierarchy defined by Noam copy language would read 011011.
such repeatsmay seem little
are described by regular grammars and by the abstract machines called finite-state
Chomsky, Generating
automata. An FSA is a collection of states (circles) and transitions (arrows). It generates lan? different from generating palindromes,
guage by emitting symbols associated with some of the transitions. The FSA shown here but in fact the task ismuch more diffi?
in state S, where it can emit any number of O's by making transitions back to the same
begins cult to accomplish.
state; on moving to state A it can generate l's in a similar way; then in state B its output con?
sists of 2's. The grammar below the diagram specifies the same language, namely all strings Creating a grammar for a copy lan?
made up of any number of O's followed by any number of l's followed by any number of 2's. guage entails a further loosening of the
Each grammar rule shows how a nonterminal on the left-hand side of the arrow may constraints on the form of the produc?
symbol
be replaced by a rule body on the right-hand side; the vertical bar means "or"; the e rule sim?
tion rules. This time the requirement to
ply erases the nonterminal. All rule bodies have at most
one nonterminal, and it is the right? be jettisoned is the one stating that the
most symbol. The derivation below the grammar shows how a string develops by repeated left-hand side of each rulemust consist
application of the grammar rules. At each step the underlined nonterminal is replaced by the of a single nonterminal symbol. By al?
lowing additional symbols on the left
rule body shown in color in the following step.
hand side, one can indeed create a
grammar for the copy language, al?
though it is too complex tobe presented
here. The key to itsoperation is that the
presence of multiple symbols on the
left-hand side of a rule allows for arbi?
trarymovement of symbols to the left
and right in the developing string, so
that (for example) a palindromic string
can be systematically rearranged into a
=> 0001L4222 => 00011222.
string from a copy language.
A copy language is an example of a
Figure 4. Context-free languages
are
generated by pushdown automata, inwhich an FSA is aug?
context-sensitive In a context
mented by a simple memory mechanism?a "stack" where symbols are"pushed" and "popped" language.
sensitive grammar the one constraint
during state transitions. The automaton shown generates &s followed by l's followed by 2's, as in
a
the regular language above, but requires that the number of &s equal the number of 2's. The ma? remaining on the form of rule is that
chine counts the O's by pushing a y onto the stack for each 0 generated, and later emits a 2 for each y the right-hand side be at least as long
itpops; the string ends when themachine pops an x thatwas pushed at the outset The equivalent as the left-hand side. Beyond the con?
grammar form relaxes restrictions on the number and placement of nonterminals in rule bodies. text-sensitive languages there are lan?
guages that can be specified only by
write/ ready
move left
read/
move right dropping all constraints on the form of
movenflht a grammar rule. This rarefied class of
languages specified only by unrestrict?
ed grammars, however, is not easy to
?.,read x characterize; the examples tend to be
nfibverig rather contrived and mathematical.
y
The Chomsky Hierarchy

The ranking of grammars and lan?
guages in Chomsky's classification is
|^?*?W*- H
v?-..:'>:::
Itape
sometimes confusing. Going up the lad?
der from regular through context-free
and context-sensitive to unrestricted
=> 000lA22il2
languages, the grammars are subject to
* 000112422 => OOOII4222 => 000111222 fewer constraints, but the languages
themselves are more narrowly defined.
Figure 5. Context-sensitive
a
languages require machine with
a more versatile memory?a tape on A context-free grammar can include
which the automaton can read or write symbols, and then move left or right This machine can
rules thatwould be forbidden in a regu?
produce all strings with equal numbers of 0's, l's, and 2's, in that order. The grammar has rules
with more than one symbol on the left-hand side, which serve, for example, to transpose symbols.
lar grammar, but a context-free lan?
In a context-sensitive language the length of the tape is proportional to the length of the string guage ismore powerful not because of
generated, and the left-hand side of a rule is never longer than the right-hand side. Without these what it includes but because ofwhat it
constraints, the language is called recursively enumerable, and the automaton is a Turing machine. excludes?such as strings that are not

palindromes. Ascending the hierarchy known as a last-in, first-outbuffer.The
allows one tobe more specific, and thus stack can be used to generate palin?
todescribe a wider variety of languages. dromes by pushing a copy of each sym?
One way of understanding the bol onto the stack as it isproduced, then
Chomsky hierarchy is by considering popping the symbols off the stack in re?
the patterns inherent in the languages. verse
sequence.
A regular grammar can generate overall For a context-sensitive language the
patterns in strings, such as alternating appropriate machine is one with a more
O's and l's. What a regular grammar versatile memory?one based on a tape
cannot do ismaintain dependencies at that can move in either direction. The
arbitrary distances within a string of machine can read from or write to the
symbols; it cannot use
a symbol in one tape and move leftor right on it,which
part of the string to decide what symbol allows for the erasures and rearrange?
towrite in another, part. Context-free ments required in dealing with copy
languages allow certain dependencies languages and other manifestations of
within a string, for example where the crossing dependencies. The one con?
first symbol ismatched with the last straint is that the length of the tape be
symbol, the second with the penulti? proportional to the length of the gener? Figure 6. Union and intersection of languages
mate, and so on. Any pattern of depen? ated string, so that the space used for form new languages thatmay have different
dencies that can be drawn so as not to intermediate calculations can never ex? properties. Here two languages are shown
cross one another (as in a palindrome) is ceed some fixedmultiple of the string side by side, both based on sequences of O's,
said to be nested, and nested depen? l's and 2's. In the language on the left, the
length.
dencies are characteristic of context-free When thememory tape is allowed to number of l's in each string must equal the
number of 2's, whereas in the language on the
languages. When dependencies cross grow to any length, the resulting au? the number of O's always equals the
tomaton is a Turing machine, which is right
(as in a copy language), a context-sensi? number of l's. Both languages are context
tive language ismandated, because the device needed to generate an unre?
free, and their union (the set of all strings that
crossing dependencies can be estab? stricted language. The Turing machine are members of either language) is also con?
lished only with the freedom ofmove? is a kind of universal computer; no one text-free. However, the intersection of the lan?
ment that nonterminals enjoy during a has yet found any enhancement to this guages (the set of strings common to both) is
context-sensitive derivation. computing architecture thatwould al? the language with equal numbers of O's, l's
Another approach to understanding low it to execute some algorithm that it and 2's, which is greater than context-free.
the hierarchy looks at the kind of ma? cannot already handle. Thus any lan?
chine needed to generate each class of guage that can be produced by a digital bearing on the linguistic status ofDNA.
language. The machines in question are computer program can also be precisely Although the term generative gram?
not devices of steel or silicon but ab? specified by some unrestricted gram? mar suggests only the production of
stract and idealized computing mech? mar, and vice versa. language, grammars are equally useful
anisms. For a regular language, the When viewing languages as sets of in recognizing and analyzing language;
machine required is a finite-state au? strings, it is natural to ask what hap? they can listen as well as speak. One
tomaton, or FSA. As thename suggests, pens when languages are combined by way to do this is to apply the rules of
such a machine has a finitenumber of the operations of set union and set in? the grammar "in reverse," startingwith
states?typically one state for each non? tersection. The union of two languages a string of tenrtinal symbols and build
terminal in the grammar?which can be is a concept that should be familiar to
ar? anyone who
represented as nodes connected by is bilingual, and who
rows showing possible transitions be? therefore recognizes strings in either of
tween the states. The FSA begins opera? two languages. Two people who speak
tion in the state corresponding to the closely related languages might define
start symbol, then with each symbol the intersection of their languages as the
produced moves along an arrow to a set of strings theyboth recognize. Math?
new state.An important property of an ematicians investigate the "closure" of
FSA is that it has no storage facility collections of sets under such opera?
apart from the collection of states; ithas tions.A collection of languages isdosed
no way of recording information for lat? under the operation of set union if the
er use, and so it can produce patterns union of any two languages in the col?
only when they are "hard-wired" into lection is also a language in the same
the connections between states. collection. All the levels of the Chom?
More powerful languages are associ? sky hierarchy are closed under set
ated with more sophisticated machines. union; in other words, the union of two
A context-free language is generated by context-free languages (for example) Figure 7. Dependencies between distant ele?
ments of a string characterize the nonregular
a "pushdown automaton," which con? must still be a context-free language. It are strictly
languages. If the dependencies
sists of an FSA augmented by a "stack" is interesting, however, that the inter?
nested, as in the palindrome at the top, the lan?
that provides a limited form of auxil? section of two context-free languages? guage is typically context-free. When depen?
iarymemory. The stack works like a the strings those languages have in dencies of unrestricted extent cross one anoth?
stack of cafeteria trays in that only the common?cannot be guaranteed to be er, as in the example of a copy language at the
topmost item is accessible; it is also context-free. This factmay have some bottom, the language is beyond context-free.

ingup progressively more abstract cate? can also be
produced in the process of The Genetic Code
gories. A program that takes as input a generation to representwhat Chomsky, How do these linguistic notions apply
grammar and a string and decides in the context of natural language, toDNA and other biological macromol
whether the string is a member of the called the
deep structure of a sentence. ecules? At firstglance, DNA, RNA and
is a
corresponding language recogniz? Chomsky's subsequent work has been proteins are all linear, one-dimensional
er?a more practical version of the ab? much concerned with transformations, molecules, made up of subunits strung
stractmachines described above. In the or rearrangements, of parse trees. Trans?
together like the links of a chain. In the
case of English, a recognizer accepts a formations can systematically produce case of DNA, the links are the nu
string of words as a grammatical sen? wholesale changes in the surface struc? cleotide bases adenine, cytosine, gua
tence if thewords can be organized into tureorword order of a sentence, reflect? nine and thymine, abbreviated a, c, g
lexical categories, such as noun and verb, ing differentways of
expressing essen? and t. At the most superficial level,
then into phrases, then into still higher tially identical syntactic constructs, such DNA can be viewed as the language
level structures, until finally the entire as "The
linguist sees the biologist" and made up of all strings drawn from this
string is subsumed under the gram? "The biologist is seen by the linguist." alphabet; all possible strings are mem?
mar's start symbol, sentence.The time it A further extension of a recognizer bers of the language because it appears
may take to process a string with a gen? allows it to produce output as it reads thatno physical or chemical constraints
for a class of
eral-purpose recognizer input. For example, in the course of would prevent some particular se?
languages increases sharply as one as? changing states, an FSA
might read a quence or class of sequences from be?
cends theChomsky hierarchy?a price symbol and at the same time emit a ing assembled. But viewing DNA as
paid forgreater expressive power. symbol from an entirelydifferent alpha? such a totally unconstrained and un?
With a littlemore effort,a recognizer bet. A machine of this kind is a trans? structured language ignores both its
can be extended to serve as a parser, ducer; it can travel along one string and role as a genetic archive and its status as
which not only passes judgment on a generate another string that is related to a participant in thebiochemical process?
string but also produces an analysis of the firstby rules inherent in the con? es of a living cell.
Although itmay be
its syntactic structure.The result isusu? struction of themachine. As we shall true that any base sequence could be
ally displayed as a parse tree, a down? see, the living cell has a number of constructed in principle, only a very
ward-branching diagram that reveals transducers, as well as analogues of lin? small subset of possible sequences ap?
thehierarchy of grammatical categories guistic transformations, dependencies pear in the living world. Grammars
implicit in the derivation. Parse trees and recognition. might help to characterize that subset.
Figure 8. Linguistic themes abound in the mechanism of gene expression, illustrated here in a highly compressed and schematic way.
Transduction, or the of one string to another a different alphabet, is marked
element-by-element mapping having by red arrows; instances
of this process are the transcription of the double-helical DNA (purple) into RNA (blue) by the enzyme RNA polymerase, shown as an ellip?
soid near the middle of the diagram, and the subsequent translation of RNA into protein (orange) by the snowman-shaped ribosome at the
lower right. Recognition of specific classes of strings is indicated
by yellow, as at the upper leftwhere regulatory proteins find promoter
sequences in the DNA that help determine when and where should begin. Transformation, the piecewise or wholesale modi?
transcription
fication of a string, is indicated by orange sequences, such as the intron being spliced out of the RNA at left. The ismediated
splicing
that recognize splice signals shown as yellow circles, squares and diamonds; the resulting lariat-shaped
RNA-protein complexes piece of RNA
is discarded. The ribosome, another RNA-protein is a transducer from triplets of RNA nucleotide bases to amino acids, shown as
complex,
red spheres. The cruciform are transfer RNAs that act as triplet recognizers (at their yellow end) and carry the corresponding amino
shapes
acids for the ribosome to add to the protein. Dependencies between segments of a string are shown in green; they are most obviously mani?
fested in the folding of protein and RNA. Part of the
newly synthesized protein is shown being clipped off in a process called post-transla
tional modification, which is another example of transformation. Even DNA often undergoes transformation, for example via the insertion
of transposable elements such as the small circular DNA molecule at the upper right, which is shown recognizing a specific site.

gene upstreamitranscriptdownstream
transcript ?> 5'-?qtramlaie?-region start-codon coding-region 3'-untranslated-region
co?ingrregim w ^
stop-c?don \splice coding-region
codon-> lys I ?sn I tie I thr I met I $er I gin I his l arg I pro I
asp ! ijgiw I ala I gly I?aaJ I iyr \Jrp \cy$ \pne I leu
start-codon stop-codontaa I tag I tga
-? aa purim* asn aa pyrimidine tie t? at pyrimidine I ata

^ ser ag ^y^wî?^ I tc feose
gin \
-rMcavp^
ftfe ^ arg ^ eg ??si^ I ag purme
pro ??';-$] asp # ga ^ g/u ^ ^ P?
ala
-?'':g&Jint?'lp g/y tfck gt ft^
tyr r>r t* f^p^mne frp tgg ; c^
pihe pyrimmne leu tt
puritie I qt feosc
^ c 11 . base -? pyriniidine\
pyrimidine purine^> a I $\
spliceB> inlrm ^ intron-? gt int^-bqd^
nice* &> intron splice :c^ d intftm spliceg -? g mfnw spto t ^ t intrpn
splice intrbnfa c splici -> tninmc g splice?> infrong t splice intront
9. Partial grammar specifies some of the structure of a typical protein-encoding gene. The transcript, which is the part of the gene that is
Figure
RNA, has flanking 5'- and 3'-untranslated-regions, around a coding-region initiated by a start-codon. The coding-region
copied into messenger
consists of a coôw followed by more coding-region, a recursion that ultimately terminates in a stop-codon. A stop-codon is any of the three
triplets indicated, whereas a codon is any of the 61 other possibilities, given in individual codon rules. Some of those rules refer to the
nucleotide classes purine and pyrimidine to capture variability in the third codon position. The rule for coding-region also allows for a splice at
or
any point in the recursion, resulting in the insertion of an intron. A series of context-sensitive splice rules allow introns to shift left right,
the fact that introns can appear within as well as between codons. An intron of this type is generally bracketed by splice signals
reflecting
as promoters of transcription
including gt and ag. This simple syntactic description omits many other imperfectly understood signals, such
usually found in the region upstream of the transcript.
Before proceeding furtherwith a lin? ed base uracil, so that theRNA alphabet In translation, RNA can be interpret?
guistic analysis of biological sequences, consists of a, c, g and u. Although RNA ed as a language of triplets, inwhich
itwould be well to briefly review the is single-stranded, base-pairing none? successive groups of three adjacent
actual structure and chemistry of nucle? theless has an important place in its bases?called codons?specify the se?
of amino acids in a protein. This
ic acids and proteins. The architecture chemistry: Complementary pairing be? quence
of DNA iswell known: Two strands of tween regions of the strand determines is the language that biologists had in
nucleotides, each thousands ormillions how the RNA folds up to form a dis? mind when they firstbegan to speak of
of bases long, twine around each other tinctive three-dimensional structure. a genetic code and of transcription and
in a double helix. The two strands are The RNA transcribed from most translation. Four letters taken three at a
held together by a specific pattern of genes is not an end product but rather time yield 64 possible codons, all of
hydrogen bonding inwhich g mates
serves as an intermediary, called mes? which are used, although there are only
with c and t with a.' Thus the strands senger RNA, which is subsequently 20 amino acids spelled out in the genet?
have complementary sequences? translated into protein. Proteins are an? ic code. It follows that the code must
wherever a g appears in one strand other class of linear molecules, but have a good deal of redundancy, where
theremust be a c in the other strand, they are assembled from a wholly dif? several codons all specify the same
and likewise every t requires a comple? ferent alphabet, namely 20 amino amino acid. A few codons serve as
mentary a. The strands are oppositely acids. Each sequence of amino acids marks of punctuation, signaling where
oriented, so that if one strand reads folds up to form a specific three-di? translation should start and end.
from left to right, the other reads from mensional structure, guided by chemi? The RNA polymerase molecule can
right to left.Such base-pairing is thekey cal interactions that are even more be viewed as a simple linguistic trans?
to the faithful replication of DNA, in complex than those observed inDNA ducer, which reads bases of DNA and
which the strands separate and then and RNA. The translation from RNA writes complementary bases in the
serve as templates fornew complemen? to protein is done by ribosomes, which slightly different alphabet of RNA. In?
tary strands. are large molecular assemblies made deed, the enzyme is remarkably ma?
When a gene is expressed, a region up of both protein and RNA; amino chine-like in its procession down the
along one strand of, the DNA is tran? acids are carried to the site of transla? DNA template and synthesis of RNA
scribed by the enzyme RNA Poly? tion by transferRNAs. The ribosomal output. Similarly, the ribosome is a
merase. The resultingmolecule of RNA RNAs and transferRNAs are examples transducer fromRNA toprotein. The ri?
is also a chain of nucleotides, and* it is of RNA molecules that are not them? bosome begins by scanning an RNA
complementary to the transcribed selves translated into proteins. Even so, molecule one base at a time, looking for
strand of DNA. RNA is chemically like proteins, they derive their func? the codon aug, which is the start signal
quite similar to DNA, except that tionality in large degree from the shape of the genetic code and which also spec?
thymine is replaced by the closely relat of their folded structure. ifies the amino acid methionine (met).

S -> aSt I cSg IgSc ItSa I8 ? ^ , ^ . 0 , ^ . ^ .
BT^S^H
atgttcgaacat ^^^^^^^^^^^
? ?
caaatcgatcatcgaagagctcttgttg
Figure 10. Biological palindromes in the genome give rise to distinctive secondary structures in folded molecules. In double-stranded DNA
(far left), each g pairs with a c on the opposite strand, and each t with an a. When a substring of one strand appears on the other strand run?
ning in the opposite direction, the resulting pattern is called an inverted repeat. The symmetry of the pattern allows either of the strands to
fold and pair with itself (middle); the RNA can adopt the same stem configuration. The
corresponding single-stranded language of such bio?
logical palindromes is specified by a context-free grammar. (In real nucleic acids there is a loop of unpaired bases at the tip of the stem; the
grammar is easily extended to accommodate such features.) By adding a rule that doubles the start symbol, the grammar is able to generate
strings that form branched secondary structure (right), as is often found in RNA molecules. The parse tree for any derivation of these gram?
mars reflects the actual physical structure of the folded RNA and is shown here drawn within the structure.
On finding an aug, the transducer out? acid but instead terminate the transla? RNA. Even when the ribosome trans?
puts a met unit, then continues scanning tion process. ducer happens to find a valid start sig?
in a mode where it looks at groups of If these two transducers (as described nal in the correct reading frame, it can?
three successive nucleotides. For each so far) could be combined and set loose not be taken for granted that the
triplet, the transducer (with the help of on a real genome,
theywould eventual? resulting protein is an actual gene prod?
transferRNA "adaptors") adds a spe? ly translate into protein all the coding uct. In higher organisms genes are gen?
cific arnino acid to the growing protein regions present in the DNA. Unfortu? erally discontinuous, with long stretch?
chain; for example, gcg corresponds to nately, the transducers would also pro? es of noncoding verbiage thatmust be
alanine {aid), and aag to lysine dys). The duce an enormous volume of utter non? spliced out in a step called processing.
scanning continues until the transducer sense. One problem is that The meaningful regions that are pre?
although
comes upon one of the triplets uaa, uag every protein startswith a methionine served for translation are called exons;
or uga, which do not
specify an amino unit, not every methionine appears at the intervening excised regions are in
the start of a protein chain. Further? trons. During transcription the entire
more, each strand of DNA has three length of the gene, including both ex?
"reading frames," distinguished by ons and introns, is copied into RNA,
whether the transducer begins reading but processing must be completed, to
with the first, the second, or the third remove the introns,before theRNA can
nucleotide; since a DNA molecule is be translated into protein.
double-stranded, there are six reading The removal of introns is largely gov?
frames altogether. Each of the reading erned by specific sequences in theRNA
frames generates a totallydifferentmes? transcript. In some genes in some or?
sage, and with few exceptions only one ganisms these sequences participate in a
such transcript is valid. Thus, the actual characteristic folding of the RNA and
transducers must be guided to the cor? are sufficient to remove the intronwith?
rect sites and the correct reading frames out outside help. Protein-coding genes
by other factors. It isworth remember? rely on a more involved mechanism.
ing in this regard thatDNA is not sim? The intron signals are found at and near
ply an abstract string of symbols but the ends of the introns tobe spliced, and
rather a molecular object in a cellular they are recognized and bound by a
Figure 11. Cloverleaf pattern of transfer RNA context. For example, it spends much of
is an example of branched struc? complex of RNA and protein that su?
secondary its existence tightlywound around pro?
ture. The loop of the topmost stem includes a
pervises the precise removal of the in?
to a codon recognized
teins like thread on a succession of tron.The sequence at the upstream end
triplet complementary
on themessenger RNA. This codon specifies spools, and it interacts with a great of the intron includes, among other fea?
the amino acid carried at the bottom of the many other proteins at specific sites. tures, the two-base sequence gu. The
main stem of the transfer RNA. The reality is no less complex for downstream end of the intron has sev

eral landmarks, ending in an ag at the
splice site itself.
Although a simple transducer could
perhaps be designed tomodel at least
the bare facts of intron removal, this
would miss the point that splicing in?
volves complex structures and interac?
tions thatwould be better described by
a more hierarchical representation. In
fact, the wholesale rearrangement of
specific sequences is rerniniscent of lin?
guistic transformations thatmanipulate
parse trees. What sort of grammar
would be necessary to produce useful
parse trees, and to be more selective in
recognizing protein-coding genes?
A Grammar of Genes
gaatattcgaatattc
A grammar describing coding se?
as in the DNA gaatattcgaatattc
quences they appear
would capture the genetic code in a
straightforward manner, recursively
building up strings of codons beginning
with a met codon and ending with one
of the stop codons. The rules mapping
codons to amino acids would constitute
the lowest, lexical, level of the grammar.
gaatattcgaatattc
Intronsmight be inserted at any point
12. Ambiguity in the grammar gives rise to strings that have
during this accretion of codons, but Figure for nested palindromes
more than one parse with more than one secondary
tree, or molecules structure. A double
there is a complication: Processing of
theRNA isnot constrained to a particu?
inverted repeat can form a simple hairpin (upper left), an intermediate cruciform structure
(upper right) or a dumbbell (bottom). Although there is an unambiguous grammar for the
lar reading frame, so that it is quite pos?
language of general secondary structure, the ambiguous grammar may be preferable since the
sible foran intron splice site to interrupt alternative trees reflect alternative structures. Moreover, that
parse secondary any grammar
a codon. One way to accommodate of inverted repeats must be ambiguous.
specifically generates only adjacent pairs
such interpolations iswith context-sen?
sitive grammar rules that give rise to ticnucleotide sequence tata. In higher of examples called a consensus se?
movement of nonterminals in the de? organisms many other sequences may quence. From the other perspective,
veloping string. be present, such as a caat box further recognition models the action of a tran?
The part of the gene grammar deal? upstream?but with more variability? scription factor (forexample) in finding
ingwith translated sequences would be and gc boxes, which can be found in and binding to the appropriate se?
embedded in a higher-level rule for the multiple copies on either strand of the quence. Although grammars have their
entire RNA transcript, including tran? DNA. Promoter sequences are involved shortcomings in this regard, they also
scribed but untranslated regions at both in the binding of RNA polymerase at have some great strengths.
ends of theRNA; and the transcript rule the start of transcription. Their effec? A recurring theme in these recogni?
would in turn be embedded within a tiveness is greatly influenced by the tion regions is thatmore than one se?
rule of stillwider scope covering up? presence of other genetic elements quence comes into play at once. Tran?
stream and downstream control regions called enhancers, whose positions and scription factors seldom act alone;
associated with the gene. This rule orientations relative to the gene vary instead a number of them seem to be
structure allows for a natural hierarchi? enormously; theymay lie thousands of required towork cooperatively. Similar?
cal organization of our knowledge bases away, upstream or downstream. ly, in the processing of introns several
about themechanisms of gene expres? The variation in the sequences of pro? RNA-protein complexes bind different
sion,with detail always presented at its moters, enhancers, splice signals and so sites. Grammars, by their nature, de?
appropriate level. What ismost chal? on has made them hard to identify reli? scribe the relationships ofwords, and of
lenging at thispoint is the incorporation ably. Most of them have been discov? categories of words, and of categories
of grammar elements for the subtle sig? ered either by noting similar sequences of categories. This last point is impor?
nals controllingwhich potential coding at similar positions relative to many tant because some transcription factors
regions are expressed as genes, and genes, or else by direct evidence that recognize not a DNA sequence directly
how they are processed. they are binding sites of other mole? but rather other transcription factors al?
Important features that signal the cules such as transcription factors, the ready bound to specific sequences. The
presence of a protein-coding gene, proteins thatmediate gene expression. picture that emerges is of a complex of
which are collectively called the pro? In either case, the linguistic problem is transcription factors being modeled by
moter region, lie upstream of the tran? one of recognition. From one perspec? a parse treedescribing the organization
script itself. In this region is a sequence tive, the challenge is to recognize a of those factors and of the sites towhich
called the tata box, after its characteris "word" similar to a statistical aggregate they bind on the string ofDNA.

Grammars are also adept at captur? stead the lettersmust form complemen? might extend. An RNA palindrome can
ing theme and variation. There aremyr? tarypairs, so that a aligns with t (or u) be folded in themiddle to create a base
iad arrangements of regulatory regions, and c corresponds to g. For example, paired "hairpin" structure, which is
including identifiable patterns that are the RNA sequence aguucgaacu is a bi? thermodynamically more stable than
specific for sets of genes expressed un? since the first a the same RNA without base pairing.
ological palindrome,
der similar conditions. Response ele? pairs with the last u, the g in the second Moreover, one cannot identifya longest
ments, forexample, are patterns associ? position complements the c that comes hairpin, such that itwould not be possi?
ated with genes whose activity is next to last in the string, and so on. In ble to add just one more base pair. To
altered by hormones, heat shock or oth? DNA such palindromes are called in? the extent that such folding is a require?
er environmental factors.Whole fami? verted
repeats and have an interesting ment of the language of nucleic acids,
lies of genes are coordinately expressed corollary, owing to the fact that any se? they are beyond regular languages by
in specific tissues and at specific peri? quence of bases in one strand is also virtue of theirvery chemistry.
ods in the development of the organ? mirrored in the opposite strand. Even A grammar forRNA palindromes is
ism. There are classes and subclasses of though a biological palindrome does a littlemore
only complicated than a
introns with varying splice mecha? not literally read the same forward and grammar of binary-numeral palin?
nisms, and growing evidence of species backward, ifyou read such a sequence dromes. The underlying principle is the
specificity of signals within these class? from left to right on one strand, then same, although the alphabet is larger
es. As with natural
language, it is rela? switch to the complementary strand and the nature of the pairing is differ?
tively easy to express alternatives in the and read from right to left,you will in? ent.Here is one form of the grammar:
"phrasing" of these control elements, deed pass through the identical se?
S ?> aSu IuSa IcSg IgSc I?
and todistinguish themore constant or? quence again.
ganizing aspects from themore vari? Palindromes are quite frequent in Note that the nonterminal S is embed?
able, using the implicit structure of DNA, and short ones are bound to oc? ded within a string in the productions
grammars. cur simply by chance. These random in? on theright-hand side of the rule,which
verted repeats do not imply that the is the trait that distinguishes a context
Biological Palindromes language of DNA is nonregular, and free grammar from a regular grammar.
Grammars can also express dependen? even palindromes that appear to be Trying a few random derivations gives
cies with great facility.Such dependen? purposeful (such as those associated a sense of how the grammar can gener?
cies are likely to be inherent in regula? with protein binding) may yet be regu? ate all RNA palindromes, and only
tory regions, as we have seen, but lar, technically, if there is a definite limit palindromes, as in this example:
dependencies may also be present even on their size. It is only when a
language S aSu => auSau => augScau augcau
in the actual lexical sequences of DNA
requires dependencies over an arbitrary
and RNA. extent that itmust be deemed greater Folding the final structure in half and
Many protein-binding sites on DNA than regular. pairing the bases that are thereby
are palindromes, The folding and base-pairing of an
although they differ brought together yields a perfectmatch.
somewhat fromnatural-language palin? RNA strand to formwhat is called sec? What's more, the parse tree produced
dromes. The requirement for a biologi? ondary structure does establish depen? has the pleasing property that it can be
cal palindrome is not that the lettersbe dencies between the paired bases, and drawn to reflect the base pairing seen
identicalwhen they arematched one by there is no theoretical limit on the dis? in the actual secondary structure.
one from both ends of the in tance over which those dependencies The perfection of the palindromes
string;
13. Attenuators are regulatory mechanisms that are thought to depend on alternative secondary structure for their operation.
Figure They are found
upstream of certain bacterial genes coding for proteins that help tomanufacture amino acids. When RNA polymerase to transcribe this
begins
upstream region, ribosomes immediately attach to the RNA and begin to translate the sequence. If the amino acid to be synthesized is present in
abundance (and thus the corresponding transfer RNA is abundant as well), the first ribosome reads through a group of codons for that amino acid
and into a region capable of forming either of two alternative
hairpins (left). The ribosome obstructs the first part of this region and thus favors the
formation of the second stem. Formation of the second hairpin sends a
signal that causes the RNA polymerase to cease transcribing and to fall off
the DNA. This shuts down gene expression and the synthesis of the amino acid (which was If the amino acid is scarce, the ribo?
already abundant).
somes cannot read
through the codons for that amino acid (right); they stall upstream and allow the first alternative hairpin to form instead of the
second. This event in turn allows the RNA polymerase to proceed and manufacture the protein thatwill alleviate the shortage of the amino acid.

generated by this grammar is actually a stems, known as orthodox secondary tures, particularly in ribosomal RNA.
weakness from thepoint of view of bio? structure. The most famous example of The grammar of recursive palin?
logical realism. Real RNA palindromes orthodox secondary structure is the dromes remains context-free,but ortho?
are often imperfect, but an occasional cloverleaf structure of transfer RNA, dox secondary structure alters the na?
mismatched base pair does not destroy but there are many other instances of ture of the language in another way:
the secondary structure. (Some alterna? extensively nested stem-and-loop struc Ambiguity enters the scene. A grammar
tive base pairings, such as g:u pairs, are
more tolerated than others.) It is com? S->AaS \CcS IG$S ITtS IX X->e
mon for the stem of a
hairpin to have
"bulges" of unpaired bases. Moreover, Aa-?aA Ac cA Ag-? gA At -? tA
the RNA is not flexible enough for the Ca -> aC Cc^cC Cg-?gC Ct-*tC
tipof the hairpin tomake a sudden 180 Ga-?aG Ge-+cG Gg -^ gG Gt-*tG
'
degree turn;generally there is a loop of 7a -?aT Tb^ cf T$ gT Tt -*tr
at least three or four unpaired bases,
and sometimes the loop ismuch longer. AX^Xa CX^Xe GX-?Xg TX->Xt
Similarly, inDNA significant inverted
repeats may be separated by very large
distances. Incorporating such features
into a grammar complicates the form of ctaacctaac
the rules, but it can be done, for exam?
ple by substituting for the e in the palin?
drome grammar a nonterminal repre? Figure 14. Direct repeats, or simple repetitions of a sequence, are a common feature of DNA.
The language consisting of all such sequences is known to be greater than context-free. Here
senting the loop, and adding a simple, direct repeats are generated by a grammar that produces a nonterminal and a terminal for
regular rule specifying that loop. It is each base in the repeated sequence, then "slides" the nonterminals to the right where they
perhaps more interesting, though, to are converted to terminals in the correct order by interaction with the nonterminal X. In
study further the linguistic nature of terms of the parse tree, the nonterminals are skipped over and then
gathered up at the posi?
idealized secondary structure with tion of the X in the tree. This rearrangement creates crossing dependencies.
complete base pairing.

All of thepalindrome grammars pre?
S ->PX X->e
sented so far are limited to generating
P aPt I cPg IgPc ItPa IQ
strings containing a single palindrome ICQg.l GQc
whose center of symmetry is themiddle Q-?AQt ITQ? Ie
of the string, as in the classic "Madam,
I'm Adam," where the "I" marks the
center of the (odd-length) palindrome.
But strings can have multiple palin?
dromes, most simply when two or
more palindromes are found side
by
side in the same string, as in "Madam
Eve." Nucleic acids are rich in such
compound palindromes. Indeed, DNA
and RNA abound not only in sequential
palindromes but also in recursive ones,
where one palindrome is embedded in
another. For example, "Madam Eve, I'm
Adam" has the short palindrome "Eve"
inserted off-center in the longer one.
The secondary structure that corre?
sponds to a recursive palindrome is a -?f T*
stem with another stem budding from
its side. In principle there can be any de?
gree of nesting of stems upon stems
upon stems.
Modifying the palindrome grammar gcagaatgctgccatt

to allow for recursive secondary struc?
ture turns out to be simpler than one Figure 15. Pseudoknots are elements of secondary
structure thatmay produce crossing depen?
dencies and require non-context-free
might guess. All that is required is to expression. A pseudoknot in RNA, shown in the dia?
add to the existing productions a new gram at right, results when one side of a stem resides within the loop of another stem. Only
the context-free part of the pseudoknot grammar is given; the complete grammar also includes
one, stating that S -> SS. Duplicating
all the context-sensitive rules shown above in the grammar for direct repeats. The pseudoknot
the start symbol plants the seed of a
grammar generates an idealized pseudoknot language, without any unpaired bases. Like the
new palindrome
anywhere within an direct-repeat grammar, it "skips over" the nonterminals produced by the Q rules and gathers
existing palindrome. It can be shown them up at the position of the X in the parse tree. Tracing the terminal string around the parse
that this is a completely general gram? tree in this manner produces the topologically structure shown at the
equivalent secondary
mar describing structures of branched left, and preserves the correct base-pairing dependencies.

is ambiguous if itcan generate the same elements of an attenuator can form a Climbing Chomsky's Ladder
string with two different parse trees. stem and loop in either of twomutually All of the palindromic structures, no
Syntactic ambiguity innatural language exclusive conformations; one
base-pair?
matter how elaborate, remain within
can be seen in the sentence "The lin?
ing allows transcription to continue, the family of context-free languages.
guist sees the biologist with the tele? whereas the other causes it to terminate But there are features of nucleic acids
scope/' where in one parse the linguist prematurely. The tendency to one or the thatdo lie beyond the descriptive pow?
has the telescope, but in the other the other secondary structure,under the in? er of context-free grammars. The most
biologist has it.Whereas the grammar fluence of thebiochemical context, in ef? obvious example is the presence of di?
of simple palindromes isunambiguous, fect forms a binary switch for the gene. rect repeats inDNA. Direct repeats are a
the grammar of recursive secondary There are many other indications that fairly common motif in the genomes of
structure is ambiguous in a particularly the language of genes should be consid? most organisms; for example, they oc?
interestingway. Consider strings con? ered to be ambiguous?cases where cur in some enhancers, in segments of
sisting of an inverted repeat that is dou? multiple start sites can be selected in the DNA that enter and leave the genome
bled, such as gatcgatc. This sequence same coding region, where alternative (such as viruses) and in the "amplifica?
clearly parses with the recursive-sec? patterns of splicing are achieved by tion" of certain heavily used genes. Di?
ondary-structure grammar as two side mixing and matching splice sites, and rect repeats may arise when a segment
by-side stems, formed by doubling the so on.
Ambiguous grammars accom? of DNA is duplicated, and they essen?
S at the outset. But the entire string can modate these themes with aplomb. tially comprise a copy language. Hence
also form a simple hairpin. In fact a se? Ambiguity, alas, does create difficulty they are beyond context-free, provided
ries of distinct parses is possible with in parsing. For one thing, a grammar
they are essential to the language and
such strings, inwhich S's are doubled at with more than one nonterminal on the are not simply present by chance.
a variety of points in the derivation. In
right-hand side of a production (such as A more interesting example of a non
each case, the parse tree again rnimics the duplicated start symbol SS) is called context-free structure is found by refer?
the actual form of a potential folded nonlinear, and it is disqualified from ring again to themechanism of RNA
structure?the ambiguity of the gram? certain speedups allowed inparsing lin? folding. RNA structures called pseudo
mar is indeed modeling a known ear context-free Even knots can be understood as palin?
phe? grammars. worse,
nomenon in RNA, that of alternative ifwe are interested in alternative sec? dromes that are interleaved rather than
secondary structure. ondary structures,we should be inter? nested. For example, where aaccgguu
In at least one case structural ambi? ested in all possible parse trees. The can be seen as a
nesting of the two bio?
guity inRNA is exploited forbiological number of treesmay increase exponen? logical palindromes aauu and ccgg, the
effect.An attenuator is a regulatory ele? tiallywith the length of the string being sequence aaccuugg is a pseudoknot,
ment found in some bacterial messen? parsed. Thus even if each individual where the complementary pairings
ger RNAs that appears to employ alter? parse is quite efficient, the overall task must cross over each other in order to
native secondary structures to control of finding all the trees may be in? form a base-paired secondary structure.
its own transcription. The palindromic tractablewith a general-purpose parser. An English example would be "DNA's
loops and spools," where "DNA" and
"and" formone palindrome and the re?
mainder of the string another. In actual
pseudoknots, which are observed (for
example) in certain RNA viruses and in
a class of introns,one side of the stem of
a
stem-and-loop resides within the loop
of another stem-and-loop. Features of
thiskind are referred to as nonorthodox
secondary structure. As with direct re?
peats, pseudoknots entail crossing de?
pendencies that exceed the capabilities
of any pushdown automaton. A
pseudoknot parser might push "DNA's
loops" on a stack, but then it could not
pop "DNA" from the bottom of the
stack (in order to generate "and") with?
out discarding the intervening letters
thatwould be needed later for "spools."
Both direct repeats and pseudoknots
can be described by context-sensitive
grammars and the corresponding au?
tomata.
Figure 16. Evolutionary"operations/' such as the translocation or inversion of Specific secondary structures are
genetic
characteristic of ribosomal RNA and
sequences, may promote a language to a higher level in the
Chomsky hierarchy. The strictly
nested dependencies of the palindrome at the center become crossed dependencies when a transferRNA and other forms of RNA
translocation occurs, as in the word re-ordering at the top. This creates a pseudoknot-like that are not translated into protein. In
structure. An inversion within the palindrome, as at the bottom, may create a direct
repeat, protein-coding genes the evidence for
also with crossing dependencies. such conserved syntactic structures is

less obvious. Secondary structure in specify the folded structure of the pro? means to encapsulate thatwhich is in?
transcriptsmay be important in splice teinwith its rich array of nested and herent in the biochemistry of nucleic
site selection in cases of alternative crossing dependencies. Context-free acids and themachinery that impinges
splicing, and an excess of secondary languages, itwill be recalled, are not on them, and then to
incorporate more
structure at thebeginning of RNA tran? closed under intersection. concrete data, heuristic methods or bio?
scripts, where ribosomes first attach, chemical context in a principled way. In
may hinder translation in certain in? Practical Parsing the process, the grammars developed
stances. In such cases, the absence of The recognition of genes and other serve to help codify our current knowl?
stems could be as important as their high-level features of DNA sequences edge of the higher-order structure of
presence, for example if they interfere is challenging from both a theoretical genetic information, and classify its
with some mechanical process. This is and a practical standpoint. Even if the complexity in terms relevant to its com?
interesting from a mathematical stand? language of DNA is greater than con? putational analysis.
point because context-free languages text-free,a practical parser can include
are not closed under
complementa? special-purpose features that anticipate Acknowledgments
tion?that is, the language of all strings known problematic aspects of the lan? For useful discussions and contributions to
that are notmembers of a given context guage, such as direct and inverted re? thiswork, theauthor thanksBrian Hayes,
free language need not be context-free. peats. This strategy can ameliorate the Erik Cheever, G. Christian Over ton,James
Whether the resulting language of inefficiencyof general-purpose parsing, Tisdall, Sandip Biswas and Shan Dong.
translatable transcripts is in factgreater at least for appropriately constructed The author is supported in part by theDe?
than context-free on this account de? grammars. Balancing this gain is the partment of Energy under Genome Grant
pends on the specific secondary struc? need to examine a potentially exponen? 92ER61371.
tures thatwere excluded. tial number of parses for some gram?
Closure arguments suggest a mecha? mars, and the fact thatmany recogni?
Bibliography
nism by which the language of genetic tion sites are underconstrained and Brendel, V., and H. G. Busse. 1984. Genome
sequences may have ascended the vary in
"strength."
structure described by formal
languages. Nu?
cleic Acids Research 12:2561-2568.
Chomsky hierarchy, from the firstran? Inmy own work, I have
developed
a
dom (and thus regular) polymers that Brendel, V., J. S. Beckmann and E. N. Trifinov.
parser in the logic programming lan? 1986. Linguistics of nucleotide sequences:
may have condensed in the primordial guage Prolog for a new grammar for? Morphology and comparison of vocabular?
soup. For the everyday operations that malism that is "slightly" greater than ies. Journal of Biomolecular and Structural Dy?
are performed on context-free and thathandles repeats in namics 4:11-21.
DNA?replication,
scission, ligation?all the levels of the J. 1989. A transformational-gram?
any form.This allows me, for example, Collado-Vides,
mar to the study of the regulation
Chomsky hierarchy are closed. For ex? towrite a grammar for transferRNAs approach
of gene expression. Journal ofTheoretical Biolo?
ample, creating the complementary thatperforms on a par with specialized
gy 136:403-425.
strand of a string ofDNA cannot make algorithms designed for the purpose, T. 1987. Formal
Head, language theory and
a context-free
language greater than and that offersmuch greater flexibility DNA: An analysis of the generative capacity
context free (although it can make an to modify and add rules?another of specific recombinant behaviors. Bulletin of
Mathematical Biology 49(6):737-759.
unambiguous language ambiguous). strength of grammars. I have also
Hopcroft, J.E., and J.D. Ullman. 1979. Introduc?
However, context-free languages are added mechanisms that allow for im? tion toAutomata Theory, Languages, and Com?
not closed under evolutionary opera? perfectmatching and forkeeping track putation. Reading, Mass.: Addison-Wesley.
tions?those that involve block move? of the "goodness of fit"of theparse. For Lewin, B. 1990. Genes TV. Cambridge, Mass.: Cell
ments of segments of a genome. This is protein-coding genes, where the combi? Press.
Searls, D. B. 1988. Representing genetic informa?
easily seen in the case of duplication, natorics of splicing leads to a prohibi?
tion with formal grammars. In Proceedings of
which creates direct repeats. It is less ob? tively large number of parses, I am us? the Seventh National Conference on Artificial In?
vious why a context-free language of ing the grammar as a framework to telligence,pp. 386-391. San Mateo, Calif.: Mor?
DNA should not be closed under inver? apply statistical heuristics that others gan Kaufmann.
sion (where a segment of double have used to detect likely coding re? Searls, D. B. 1989. Investigating the linguistics of
DNA with definite clause grammars. In Logic
stranded DNA is flipped to an opposite gions, in thisway limiting the parser to
Programming: Proceedings of theNorth American
orientation) or transposition (where two only themost probable combinations. on Logic
Conference Programming, ed. E. Lusk
segments exchange positions). But con? Such enhancements necessarily de? and R. Overbeek, pp. 189-208. Cambridge,
sider what effect these operations can part from the purity of formal gram? Mass.: MIT Press.
have on a palindrome: By reversing half mars and from the realm of syntax, Searls, D. B., and M. O. Noordewier. 1991. Pat?
of a palindrome, inversion creates a di? tern-matching search of DNA sequences us?

delving instead into the semantics, or
rect repeat, whereas transposition can ing logic grammars. In Proceedings of the Sev?
meaning, of language. English syntax enth Conference on
Artificial Intelligence
produce a pseudoknot structure. allows sentences that are nonsensical, Applications, pp. 3-9. Los Alamitos, Calif.:
Another source of linguistic com? even though
grammatical, just as the
IEEE Computer Society Press.
in the cell is Searls, D. B. 1992. The computational
plexity living superposi? rudimentary gene grammar described linguistics
of biological sequences. In Artificial Intelligence
tion. A gene grammar must reflect a here allows nonsense proteins. Typical
and Molecular ed. L. Hunter, pp.
number of processes mediated by sepa? Biology,
natural-language understanding pro? 47-120. Cambridge, Mass.: AAAI Press.
ratemachinery at different times in dif? grams hand off the results of
parsing to Searls, D. B., and S. Dong. In press. A syntactic
ferentparts of the cell. The language of components that attempt to evaluate pattern-recognition system for DNA se?
themeaning of an utterance, perhaps in In Proceedings of the Second Interna?
genes really represents the intersection quences.
tional Conference on Bioinformatics,
of separate languages for transcription, a much larger context, and thus
help to
Supercom
puting, and Complex Genome Analysis, ed. H.
processing, translation and even for the resolve any syntactic ambiguity. In a A. Lim, J.Fickett, C. R. Cantor and R. J.Rob
encoded protein sequence, which must similar way, gene grammars offer a bins. Singapore: World Scientific.


The Linguistics of DNA

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

The Linguistics of DNA

Încărcat de

Drepturi de autor:

Formate disponibile

Sigma Xi, The Scientific Research Society

The Linguistics of DNA

This content downloaded from 169.230.243.252 on Fri, 21 Nov 2014 10:07:25 AM

In thequest tounderstand the language of life,

potentially infinite language. Chom? To generate a sentence, all that is

1992 November-December 579

This content downloaded from 169.230.243.252 on Fri, 21 Nov 2014 10:07:25 AM

Figure 1. Structure of a gene

580 American Scientist, Volume 80

This content downloaded from 169.230.243.252 on Fri, 21 Nov 2014 10:07:25 AM

Here S and T are both nonterminals; the

S => OS => 01t

Palindromes and Repeats

regular language can capture the notion

1992 November-December 581

This content downloaded from 169.230.243.252 on Fri, 21 Nov 2014 10:07:25 AM

The Chomsky Hierarchy

582 American Scientist, Volume 80

This content downloaded from 169.230.243.252 on Fri, 21 Nov 2014 10:07:25 AM

1992 November-December 583

This content downloaded from 169.230.243.252 on Fri, 21 Nov 2014 10:07:25 AM

584 American Scientist, Volume 80

This content downloaded from 169.230.243.252 on Fri, 21 Nov 2014 10:07:25 AM

-? aa purim* asn aa pyrimidine tie t? at pyrimidine I ata

1992 November-December 585

This content downloaded from 169.230.243.252 on Fri, 21 Nov 2014 10:07:25 AM

586 American Scientist, Volume 80

This content downloaded from 169.230.243.252 on Fri, 21 Nov 2014 10:07:25 AM

1992 November-December 587

This content downloaded from 169.230.243.252 on Fri, 21 Nov 2014 10:07:25 AM

588 American Scientist, Volume 80

This content downloaded from 169.230.243.252 on Fri, 21 Nov 2014 10:07:25 AM

complete base pairing.

Modifying the palindrome grammar gcagaatgctgccatt

1992 November-December 589

This content downloaded from 169.230.243.252 on Fri, 21 Nov 2014 10:07:25 AM

590 American Scientist, Volume 80

This content downloaded from 169.230.243.252 on Fri, 21 Nov 2014 10:07:25 AM

of a palindrome, inversion creates a di? tern-matching search of DNA sequences us?

1992 November-December 591

This content downloaded from 169.230.243.252 on Fri, 21 Nov 2014 10:07:25 AM

S-ar putea să vă placă și