Sunteți pe pagina 1din 225

Symmetry:

Founding editors: G. Darvas and D. Nagy

The journal of the Symmetrion

Editor:
György Darvas

Volume 23, Numbers 3-4,225-448, 2012

SYMMETRIES IN GENETIC INFORMATION


AND ALGEBRAIC BIOLOGY

CONTENTS
ANNOUNCEMENT
 Symmetry Festival 2013, 2-7 August, Delft, The Netherlands 228
EDITORIAL, Sergey Petoukhov 229
SYMMETRY IN SCIENCE AND ART
 Genome symmetries, Paul Dan Cristea 233
 Symmetry of mitochondrial DNA. The case of COXn genes in primates
and carnivores, Teodora Popovici and Paul Dan Cristea 255
 Symmetries of the genetic code, hypercomplex numbers and
genetic matrices with internal complementarities, Sergey V. Petoukhov 275
 Fractal genetic nets and symmetry principles in long nucleotide
sequences, S.V. Petoukhov, V.I. Svirin 303
 A Markov information source for the syntactic characterization of amino
acid substitutions in protein evolution, Miguel A. Jiménez-Montaño 323
 Symmetries in molecular-genetic systems and musical harmony,
G. Darvas, A.A. Koblyakov, S.V.Petoukhov, I.V.Stepanian 343
 Modeling “cognition” with nonlinear dynamic systems, Yuri V. Andreyev,
Alexander S. Dmitriev 377
 The irregular (integer) tetrahedron as a warehouse of biological
information, Tidjani Négadi 403
 Theory of topological coding of proteins and nature of antisymmetry
of the amino acids canonical set, Vladimir A. Karasev 427
SYMMETRY: CULTURE AND SCIENCE is the journal of and is published by the
Symmetrion, http://symmetry.hu/. Edition is backed by the Executive Board and the
Advisory Board (http://symmetry.hu/isa_leadership.html) of the International Symmetry
Association. The views expressed are those of individual authors, and not necessarily
shared by the boards and the editor.

Editor:
György Darvas

Any correspondence should be addressed to the

Symmetrion
Mailing address: Symmetrion c/o G. Darvas, 29 Eötvös St., Budapest, H-1067 Hungary
Phone: 36-1-302-6965
E-mail: symmetry@symmetry.hu
http://symmetry.hu/

Annual subscription:
Normal € 120.00,
Members of ISA € 90.00,
Student Members of ISA € 60.00,
Benefactors € 900.00,
Institutional Members please contact the Symmetrion.
Make checks payable to the Symmetrion and mail to the above address, or transfer to the
following account: Symmetrology Foundation, IBAN: HU24 1040 5004 5048 5557 4953 1021,
SWIFT: OKHBHUHB, K&H Bank, 20 Arany J. St., Budapest, H-1051.

© Symmetrion. No part of this publication may be reproduced without written


permission from the publisher.
ISSN 0865-4824 – print version
ISSN 2226-1877 – electronic version

Cover layout: Günter Schmitz;


Image on the front cover: Matjuska Teja Krasek: Star(s) for Donald, 2000, (tribute to H.S.M. Coxeter);
Images on the back cover: Matjuska Teja Krasek: Twinstar and Octapent;
Ambigram on the back cover: Douglas R. Hofstadter.
Symmetry: Culture and Science
Vol. 23, Nos. 3-4, 225-448, 2012

SYMMETRIES IN
GENETIC INFORMATION
AND
ALGEBRAIC BIOLOGY

A thematic issue

Guest editor:
Sergey V. Petoukhov
Symmetry: Culture and Science
Vol. 23, Nos.3-4, 228, 2012
Symmetry: Culture and Science
Vol. 23, Nos. 3-4, 229-231, 2012

EDITORIAL

The biological meaning of genetic informatics is reflected in the brief statement: "life is
a partnership between genes and mathematics" (I. Stewart, 1999, Life’s other secret:
The New Mathematics of the Living World. New-York: 304 p.) But, what kind of
mathematics has partner relations with the genetic code and what kind of mathematics
is behind genetic phenomenology? This question is one of the main challenges in the
exact natural sciences today.

This thematic issue Symmetries in genetic information and algebraic biology is focused
on different aspects of impressive properties of living matter: noise-immunity
transmission of genetic information along the chain of generations; genetic coordination
of all inherited subsystems of any organism into a single unit including members of a
huge chorus of its cyclic processes; genetically concerted self-reproduction of the
genetic system and a whole organism, etc. Authors of the issue are trying to understand
such properties from mathematical point of view of the exact natural sciences where
symmetry principles play main roles.

Modern science knows that deep knowledge about phenomenological relations of


symmetry among separate parts of a complex natural system can tell many important
things about the evolution and mechanisms of these systems. Physics and other natural
sciences have a great number of successful applications of symmetry approaches.
Nowadays, many physical theories, beginning from the theory of relativity to quantum
mechanics, are created as theories of invariants of mathematical groups of
transformations, in other words as theories of special kinds of symmetry. The study of
symmetries and asymmetries in molecular structures is one of the important branches of
chemistry. For example, functional differences between the right handed forms of
molecules and the left handed forms of molecules in living organisms have become
known to mankind due to investigations of symmetry in biological molecules.

Biological organisms belong to a category of very complex natural systems, which


correspond to a huge number of biological species with inherited properties. But
surprisingly, molecular genetics has discovered that all organisms are identical to each
230 S. V. PETOUKHOV

other by their basic molecular-genetic structures. Due to this revolutionary discovery, a


great unification of all biological organisms has happened in the science. The
information-genetic line of investigations has become one of the most prospective lines
not only in biology, but also in science as a whole. The basic system of genetic coding
has become strikingly simple. Its simplicity and orderliness presented challenges to
specialists from many scientific fields. Bioinformatics considers each biological
organism as an ensemble of information systems which are interrelated to each other.
The genetic coding system is the basic one. All other biological systems must be
correlated to this system to be transmitted to the next generations of organisms.

The natural technology of genetic coding is a major and most effective technology of
life on our planet. Using this natural technology, huge biomass of living matter with
unique and valuable properties is produced around the world. Bioinformatics and
biotechnology have been applied to many areas such as biology, medicine, and life
sciences. Bioinformatical knowledge is used to manufacture biological organisms with
new properties, to extend human life, to diagnose and treat disease on the basis of
“personal genetics”, to clone organisms, to develop new computer technologies, to
create new materials with unique characteristics, to propose new genetic algorithms for
technical applications, and so on. It seems that all fields of human life will be
influenced in the future by progress in bioinformatics.

Modern science recognizes a key meaning of information principles for inherited self-
organization of living matter. Modern informatics is an independent branch of science,
which possesses its own language and mathematical formalisms and exists together
with physics, chemistry and other scientific branches. A problem of information
evolution of living matter has been investigated intensively in the last decades in
addition to studies of the classical problem of biochemical evolution.

Not only physics and chemistry deal with principles and methods of symmetry,
informatics and digital signal processing also pay great attention to them. How is theory
of signal processing connected to geometry and geometrical symmetries? Signals are
represented there in a form of a sequence of the numeric values of their amplitude in
reference points. The theory of signal processing is based on the interpretation of
discrete signals as a form of vector in multi-dimensional spaces. In every tact of time, a
signal value is interpreted as the corresponding value of a coordinate in a multi-
dimensional vector space of signals. In this way, the theory of discrete signals turns out
to be the science of geometries of multi-dimensional spaces where different
multidimensional numeric systems can be useful. The number of dimensions of such a
EDITORIAL 231

space is equal to the quantity of reference points for the signal. Metric notions and all
other necessary things are introduced in these multi-dimensional vector spaces for those
or other problems of maintenance of reliability, speed, economy of the signal
information. For example, important notions of the energy and the power of a discrete
signal appear in multi-dimensional geometry of the space of signals as forms of a
square of the length of a multi-dimensional vector-signal and of a square of the length
of a vector-signal divided by the number of dimensions of an appropriate space. On this
geometrical basis, many methods and algorithms of recognition of signals and images,
coding information, detection and correction of information mistakes, artificial intellect
and training of robots are constructed. One can add here the importance of symmetries
in permutations of components for coding signals, in spectral analysis of signals, in
orthogonal and other transformations of signals, and so on. Investigation of symmetrical
and structural analogies between computer informatics and genetic informatics is also
needed for the creation of DNA-computers and DNA-robotics.

This thematic issue contains nine articles in the following order. The first four articles –
by P. Cristea; T. Popovici and P. Cristea; S. Petoukhov; S. Petoukhov and V. Svirin –
are devoted to the study of hidden regularities in long nucleotide sequences and of a
possible use of multi-dimensional numerical systems for the understanding the
structural organization of the genetic coding system. The article by M.A. Jiménez-
Montaño introduces a theoretical model to understand some aspects of protein
evolution. The article by G. Darvas, A. Koblyakov and S.Petoukhov presents some
relations between musical harmony and the genetic coding system that have
applications, in particular, in the field of musical culture. The article by Y. Andreyev
and A. Dmitriev reviews different biological applications of the theory of non-linear
dynamic systems and proposes a method for storing and processing information on the
base of dynamic chaos. The article by T. Negadi is devoted to his classification model
of the amino acids and to some numeric features of the genetic code. The article by V.
Karasev describes his theory of topological coding of proteins and the nature of
antisymmetry of the amino acids’ canonical set.

Sergey Petoukhov
Symmetry: Culture and Science
Vol. 23 , Nos. 3-4 , 233-254 , 2012

SYMMETRY IN SCIENCE AND ART

GENOME SYMMETRIES

Paul Dan Cristea

Electronics Engineer, Physicist (1941-2013)


Affiliation: BioMedical Engineering Center, University “Politehnica” of Bucharest (UPB), Splaiul
Independentei no. 313, 060042 Bucharest, Romania (e-mail: pcristea@dsp.pub.ro).
Current  position: Professor of Electrical Engineering and Applied Information Science, Member of the
Romanian Academy, Fellow IEEE, Member of Honor of the Romanian Scientists Academy.
Publications: Over 510 papers on Genomic Signals (including Symmetry in Genomics), Signal & Image
Digital Processing, Neural Systems. Evolutionary Intelligent Agents, Intelligent e-Learning environments,
Circuit Theory and Design Circuit Theory, Evolutionary Intelligent Agents, Computerized Medical
Equipment, Measurement Equipment, High Performance Electrical Batteries, and Semiconductor Thin Layers
and Technical Physics.

Abstract: The paper gives an overview of the nucleotide genomic signal (NuGS)
methodology and its applications in revealing regularities and symmetries in genomes.
The long range symmetries make a genome resemble, from the structural point of view,
less to a "plain text," and closer to a "poem," which obeys rules evoking the “rhythm”
and “rhyme.” Both tangible symmetries—in the genomes of extant taxa, as well as
hidden symmetries—which existed in ancestral genomes, but have disappeared under
the evolutionary pressures linked to species separation, can be detected. The approach
offers also the possibility to apply signal processing methods for the analysis of
genomic data in the local study of genes, inserts, motifs, mtDNA, etc.

Keywords: nucleotide genomic signals, nucleotide imbalance, nucleotide pair


imbalance

1. INTRODUCTION

The currently accepted concept of symmetry reaches much further than the classical
one, which only requested “the harmony of the different parts of an object, the good
234 P. D. CRISTEA

proportion between its constituent parts” (Darvas, 2007). The link between symmetry
and conservation laws was formulated in the work and thinking of the early physicists,
such as Galileo and Newton, but especially as a result of Einstein's theories on
equations invariance (Wigner, 1964). Now, it is clear that symmetry, regularity and
conservation share an ontological dimension. The existence of a system, object or
property requests their conservation from a moment to the next, essential for its
identification and recognition. Any conservation of a system, object or property with
respect to a transformation corresponds to a specific symmetry, ostensive or not.
Moreover, symmetry is much more than the harmony between parts of some objects,
actually underlining the deep harmony of the cosmos, granting its existence. The lack of
symmetry and regularity, which is the main feature of chaos, is always relative, being
accompanied by hidden symmetries and regularities. Cosmos and chaos form a dialectic
dichotomy, as opposite faces of the reality of the universe. Science focuses on
reproducible systems or phenomena, so that it always assumes their conservation and
the existence of the corresponding essential symmetry.

In the living world, the very existence of life as we know it, a large variety of extremely
slowly changing forms, requires a very well structured genetic information and its
almost strict conservation in non-stationary and closely interacting multiple component
eco-systems. At a close analysis, it is a matter of wonder that genomes survived and
even evolved on Earth, in their anti-entropic march towards increasing complexity,
from the first primitive self-replicating single-celled organisms, going up to the
tremendously complex extant organisms, including Homo sapiens. This process took
place for more than 3.5 billion years, over many successive generations and many
parallel or successive interrelated species. It continuous nowadays, in spite of the
intricate complexity of genomes, the apparent vulnerability of their replication
mechanisms, and the large number of perturbing factors potentially hindering their
replication. The genomes can consist of thousands of nucleotides – for some viruses,
millions of nucleotides – for bacteria, and up to billions of nucleotides – for eukaryotes,
again, including humans. For all extant species, the replication accuracy is high enough
to conserve the essential genetic information to a degree at which the various
correction, repair, and selection mechanisms, acting at molecular, infra-cellular,
cellular, tissue, organism, population, species and biosphere levels, are able to assure
life continuity, i.e., survival. In fact, this means about one error per 10,000 replicated
nucleotides, for most genomic areas, but reaches less than one error per 10,000,000
replicated nucleotides, for critical genomic areas (Kunkel, 2004). A large variety of
GENOME SYMMETRIES 235

DNA polymerases and repair enzymes are used to carry-out and control these
processes. The replication speed is also impressively high. In humans, it is typically
about 50 nucleotides per second per replication fork (initiation site). Consequently, the
whole genome can be copied in only a few hours, due to the many replication forks
operating simultaneously. But the replication speed can reach over 1,000 nucleotides
per second in bacteria, the rate of error being also considerably higher. The multiple
adverse factors that tend to disturb genome accurate replication, introducing a genetic
noise, include a large range from simple physical (thermal noise, ultra-violet,
radioactive and cosmic radiations), and chemical factors (fluctuating pH, instable
environment, various active radicals triggering interconversion), up to more complex
molecular (insertion, deletion, recombination, transposition, chromosome breakage), or
pathogen level (chromatin virus modulation, retro virus invasion) factors, and many
more.

Despite the amazing complexity of the mechanisms through which it is put into action,
the strategy that proved effective in the conservation and evolution of genomes and of
their corresponding organisms during all the successive phases of biosphere
development on Earth remains essentially simple: the replication is fast enough to keep
low the accumulation of mutations. This way, potentially harmful factors can create
only a small amount of deviant genomes, cells and, finally, organisms. The first line of
defense against this genetic noise is given by the correction, repair and maintenance
mechanisms, acting at molecular and cellular levels. These mechanisms have evolved in
time themselves, to gradually become highly complex and powerful, being now able to
assure the already mentioned surprisingly good replication accuracy. But even more
important are the processes that act at higher level, involving variability and selection as
creative tools to enable evolution. The vast majority of the genome is copied faultlessly,
or is properly corrected, carrying the intact genomic information to the next generation.
Most of the remaining mutations simply harm the functionality and replication capacity
of genomes and/or of their corresponding organisms, compromising their survival.
Natural selection will subsequently eliminate these faulty genomes and organisms,
without significant effects on the majority population. Nevertheless, there remain
exceptionally few mutations that confer advantages to genomes and/or to their
corresponding organisms. These exceptional mutations will be successful and, under the
effect of the same natural selection, will spread out over the whole species by a
diffusion like process, in a few thousand years. In this way, the potentially harmful
genetic noise, which seems to threaten genome integrity, actually emerges as the fruitful
236 P. D. CRISTEA

genetic variability, providing the informatics support on which natural selection can
enrich the genetic pool of species with positive mutations evolution. This fundamental
mechanism uses properly selected “noise” to produce evolution, by reversing the flow
of entropy towards higher order and complexity. The process is neither neat, nor gentle,
and occurs through attacks on genomes by all the mentioned and many other adverse
factors (Albrecht-Buehler, 2006). Still, without it, natural selection would have no
variants to act upon, and evolution would be impossible.

2. EXPLORING REGULARITIES IN DNA MOLECULES

The regularity and symmetry of genomes, which result in highly correlated distributions
of nucleotides along DNA molecules, are among the conditions necessary for the
existence of low level correction mechanisms in genome replication. By using the
Nucleotide Genomic Signal (NuGS) methodology (Cristea, 2002, 2005), we have
revealed such correlations, for both large- and short-scale sequences of nucleotides. We
have also proven theoretically, and verified experimentally for all available genomes,
that cross-over and recombination conserve the distribution of pairs of nucleotides
(Cristea, 2004). We have used this result to build an efficient model for the prediction
of nucleotide sequences, similar to the models used in time series prediction. An
improved performance prediction system has been obtained, using a novel two
component architecture comprising a Principal Component Analysis (PCA) block, and
an Artificial Neural Network (ANN), acting as a learning machine (Cristea et al., 2007).
For real life sequences, such as genome nucleotide sequences, the system requires
learning a much lower number of parameters, thus greatly reducing the training time in
comparison to classical time-series prediction systems of similar complexity. A
retraining algorithm, which further reduces the computational burden and increases the
prediction accuracy has also been used. Elements of such a system will be discussed in
Section 5.

There were many attempts to reveal the symmetry and regular inner structure of a DNA
molecule. The best known results is the Chargaff’s first law for mono- and oligo-
nucleotides in a double-stranded DNA molecule (Chargaff, 1951):

“The numbers of mono-, di-, tri-, ..., nucleotides on one strand of a double-
helix DNA molecule are equal to the numbers of their reverse complements on
the other.”
GENOME SYMMETRIES 237

This result was fundamental in guiding Watson and Crick to formulate their well known
double-helix model for the DNA molecule (Watson & Crick, 1953). It clearly pointed
towards the complementarity of the two DNA molecule strands, and opened the way to
establish the existence of the nitrogenous base-pairs adenine-thymine (A-T), and
cytosine-guanine (C-G), formed by nucleotides placed on the two strands. Conversely,
the law results as a direct consequence of the nucleotide pair complementarity. It is
satisfied with utmost precision by all natural double-helix DNA molecules.

Chargaff also formulated a second law for mono- and oligo-nucleotides, which refers to
each strand of a DNA molecule:

“The numbers of mono-, di-, tri-, ..., nucleotides on one strand of a natural
DNA molecule (>100 Kb) are equal to the numbers of their reverse
complements on the same strand.”

This statement refers only to the nucleotides on one strand, so that it has no connection
to nitrogenous base-pair complementary in the DNA double helix. The rule is a truly
statistical law, valid only approximately and only for large enough nucleotide sequences
(> 100 Kb) of nuclear DNA. It is not valid for mitochondrial DNA or plasmids.
Chloroplast DNA contains two inverted repeats (IRA and IRB), therefore many genes
encoded by a chloroplast genome have two complementary copies, and Chargaff's
second law is valid for them. The mitochondrial, chloroplast and plasmid DNA's are
naked, i.e., not histone associated. Chargaff’s second law is satisfied by the total
numbers of mono- and oligo-nucleotides, not also locally, by the density of their
distribution along a nucleotide sequence, as it is the case for the Chargaff’s first law. In
many cases, the law is better satisfied by whole chromosomes, or whole genomes. Still,
this law is important as it reveals a marked regularity in the DNA longitudinal structure.

The genome pixel image (GPxI) method has been introduced as a simple and
straightforward graphical representation of the distribution of nucleotides along a DNA
strand, in order to explore its possible ordered structure (Albrecht-Buehler, 2006).
Different gray-tone values (or colors, Cristea, 2002) are arbitrarily assigned to the four
bases (A, C, G, T) of a DNA sequence, transforming the symbolic sequence of
nucleotides into a continuous line of pixels with varying gray values. Similarly, the
method can be extended for visualizing the distribution of di- and tri-nucleotides along
a strand, by using a large enough number of color hues. The resulting line is arranged in
238 P. D. CRISTEA

a rectangular frame, as successive lines of an arbitrarily chosen length w, disposed one


below the other like a text. It is expected that, for a computer generated random
sequence, a featureless dot-pattern would be obtained, while for exactly repetitive
motifs, there will exist specific values of the width w for which the motifs would align
perfectly, to generate a pattern of vertical lines. Furthermore, in the more general case
of pseudo-repetitive sequences, the existence of otherwise easily overlooked
relationships could be revealed, as well as the possible causes of the broken periodicity
– e.g., mutations. Nevertheless, the need to carefully select the visualized segment of
the analyzed DNA sequence, as well as the need to decide upon the adequate width w,
for which seemingly significant images are obtained, makes the method too subjective.
Much more objective methods are based on Nucleotide Genomic Signal (NuGS)
methodology, mentioned above and discussed in the next section.

3. NUCLEOTIDE GENOMIC SIGNALS

The NuGS method is based on the conversion of symbolic nucleotide sequences into
digital signals (Cristea, 2005). The approach is adequate for the analysis of large scale
genomic sequences, including both exons and introns, but also for the local DNA
analysis, such as in study of the variability of pathogens. Comparison of sets of closely
related signals describing the same genomic area in various individuals of a population,
or across species, are the topics of interest. An important case is the development of
pathogen resistance to treatment (Cristea, 2006).

A key feature of this approach is that it operates with signals and, as a direct
consequence, the observed regularities take the form of mathematical properties, like a
piece-wise linear distribution of nucleotides or pairs of nucleotides, so that it is possible
to measure the extent to which it is complied with. The method reveals significant
regularities, not only in the distribution of nucleotides along DNA sequences, but also
in the distribution of nucleotide pairs, similar to Chargaff’s second law. One important
difference is that not only the total numbers of nucleotides, di-nucleotides (pairs) or
other oligo-nucleotides are considered, but also the distributions of these genetic
alphabet structural elements along DNA sequences. Significant regularities have been
found in all studied genomes – Archaea, Bacteria and Eukaryota. A genome appears to
be more than a plain text, by satisfying symmetry restrictions that evoke the rhythm and
rhyme in poems (Cristea, 2010).
GENOME SYMMETRIES 239

Such regularities help to identify exogenous inserts in the genomes of prokaryotes,


because such inserts show clearly different regularities. Inserts can comprise entire
retro-viruses, individual genes, or some non-coding repeats. It is interesting to note that
even in the case when these inserts are not fully active viruses, they can nevertheless
retain a certain pathogenicity, as they correspond to genes of enzymes, such as protease,
which facilitate the multiplication and dissemination of certain viruses, generating an
increased susceptivity to the contamination with the corresponding virus.

For convenience, we briefly review here the main features of the 2D (complex) NuGS
representation we have mostly used in our work (Cristea, 2002, 2010, 2012). A
graphical representation with many similarities has been used for both DNA and protein
sequences (Randić et al., 2011). The mapping is an unbiased representation of
nucleotide classes, which uses the ordinality of numbers – their capability to order
classes, instead of the cardinality – their capability to express quantities (Cristea, 2002).
Four complex numbers (a = 1+j, c = –1–j, g= –1+j, and t = 1–j), having the quadrantal
symmetry shown in Fig. 1, are attached to the four nucleotides (adenine, cytosine,
guanine and thymine) of the DNA alphabet.

The complex representation in Fig. 1 is the result of a nucleotide representation


genome analysis recursive process, which showed that it is possible, and desirable, to
ignore the less important “amino” vs. “keto” dichotomy we have used in a 3D
tetrahedral representation (Cristea, 2002), thus reducing dimensionality, without
information loss.

The real part (Re) of these numbers corresponds to the strength of the hydrogen bonds
among the nucleotide pairs (di-nucleotides) in the double helix Watson-Crick DNA
model (Watson, 1953), expressing the dichotomy “strong bonds” (C-G pair, Re = –1)
vs. “weak bonds” (A-T pair, Re = +1).

Similarly, the imaginary part (Im) corresponds to the nucleotide molecular structure,
expressing the dichotomy “pyrimidines” – heterocyclic aromatic compounds similar to
benzene, containing two nitrogen atoms in positions 1 and 3 of a ring (C-T pair,
Im=–1) vs. “purines” – heterocyclic aromatic compounds consisting of a pyrimidine
ring fused to a five-membered imidazole ring (A-G pair, Im = +1).
240 P. D. CRISTEA

Figure 1: The complex representation of nucleotides

All the complex representations of the nucleotides in Fig. 1 have the same absolute
value ( 2 ), but their phases can be used to build NuGSs associated to the nucleotide
sequences. Two phase signals are particularly useful to this purpose:

- The cumulated phase – the sum of the phases of the complex representations of the
nucleotides in a sequence, from the first to the current hth sample in the sequence:
h

 c (h)   arg(C{Nu (k )})  N (h), h  1,, nb , (1)
k 1 4

where Nu(k) is the kth nucleotide in the sequence, C{Nu(k)} – its complex
representation, N(h) – the nucleotide imbalance, and nb – the number of nucleotides
(bases) in the sequence. We have shown (Cristea, 2005) that the nucleotide imbalance is
a signature of the distribution of nucleotides in the sequence:

N (h)  3 nG (h)  nC (h)   n A (h)  nT (h), h  1,, nb , (2)

where nA(h), nC(h), nG(h) and nT(h) are the number of occurrences of adenine, cytosine,
guanine and thymine nucleotides, respectively, in the first h samples of the sequence.

- The unwrapped phase – the phase of the elements in the sequence, corrected by
adding an integer multiple of 2 (i.e., 2m, mZ, Z – the set of integers), so that the
absolute value of the difference of phase between any two successive entries of the
sequence be smaller than :
GENOME SYMMETRIES 241

u (1)  arg(C{Nu (1)}), (3)


u (h)  arg(C{Nu (k )})  2m , m  Z , so that u (h)   u (h  1)   .

We have also shown (Cristea, 2005) that:


 (4)
u (h)  u (1)  P(h), h{2,, nb},
2
where P(h) is the nucleotide pair imbalance, a signature of the distribution of di-
nucleotides (pairs of nucleotides) in the sequence, given by:
(5)
P(h)  n (h)  n(h), h{2,, nb},
where n+ is the number of positive pairs (AG, GC, CT, TA), n is the number
of negative pairs (AT, TC, CG, GA) formed by the first h samples of the
sequence. Usually, u(1) is negligible.

As they have a direct statistical significance and are expressed by integer numbers, it is
convenient to use the nucleotide imbalance (N) and the nucleotide pair imbalance (P)
in genomic signal analysis, instead of the cumulated phase (c) and the unwrapped
phase (u), respectively.

For the comparative analysis of genes or other conserved nucleotide sequences, e.g.,
when studying variability, one considers a set of n similar NuGSs derived from n
individuals in a group, among which might have occurred mutations. In such cases, the
set of signals Sk, (k = 1,..., n), can be characterized by a pair of signals:

R – the reference against which we compare the set of signals, usually a signal
that expresses the common trend of the signals, and

Ok = Sk – R – the individual offset of Sk (k = 1,..., n) with respect to R.


Most of the times, the reference can be chosen as: (1) the average (mean) or any other
linear combination of the signals in the set (weighted mean), including the choice of
one of the signals in the set, (2) the median, and (3) the mode step. As its name
suggests, the ModeStep signal is built by selecting in each point the variation (step) that
occurs the largest number of times (the mode) in that point, for the entire set of signals.
The starting point is chosen as the median of the starting points.
242 P. D. CRISTEA

To monitor a pathogen variability, e.g., to detect and track the development of its
resistance to treatment, the natural choice for R is the wild type (WT) of the pathogen
nucleotide sequence – usually downloaded from a genomic database, such as
(GenBank, 2012). This is feasible when the variability is small enough, such as in the
case of Mycobacterium tuberculosis, so that the pathogens in the isolates from various
patients are not too different from the WT. But in the case of highly variable pathogens,
such as in the case of HIV, the differences between individual signals and the WT signal
might be too large, so that the reference must be constructed in terms of the set of
signals itself.

The resolution can be further improved by using the digital derivatives of the offsets.
This is particularly useful to identify punctual mutations (single-nucleotide genetic
variations), or to determine the distance between the individual signals and the
reference, which correspond to step variations in the offsets.

4. REGULARITIES IN GENOMIC SIGNALS

To illustrate the NuGS methodology and the regularities in the nucleotide distribution it
can reveal, we present some results in the global and local analysis of DNA molecules.
We briefly discuss the phase (Subsection 4.1) and di-nucleotide (Subsection 4.2)
statistical analysis of a prokaryote genome (Helicobacter pylori), as well as a
comparison of Hominidae family mitochondrial DNA (mtDNA) genes (Subsection 4.3).
The three problems presented below have been chosen not only to show the versatility
of the NuGSs approach to tackle global and local aspects of nucleotide sequence
analysis, but also because of the intrinsic interest in understanding molecular scale
properties of genomes and their ecological role. This is certainly true for Helicobacter
pylori, a bacterium with a significant function in both the normal and the pathological
state of human gastrointestinal tract, and for which the opportunity and management of
antibiotic treatment needs further improvement. On the other hand, it is important to
find new methods to study gene variability and evolution, to better understand their
functions and the possibility of unharmful control.

4.1 Entire genome NuGS analysis

Figure 2 presents the nucleotide imbalance (N) and the nucleotide pair imbalance (P)
along the DNA sequence of Helicobacter pylori 26695c entire genome, downloaded
GENOME SYMMETRIES 243

from (GenBank 2012), with the accession number NC_000915. The length of the DNA
sequence is 1,667,867 bp, comprising the whole circular chromosome of H. pylori
genome. H. pylori is a helix-shaped bacterium, about 3 m long and a diameter of about
0.5 m, found in 1982 by B. Marshall and R.Warren in patients with chronic gastritis
and gastric ulcers, previously not believed to have a microbial cause. It is linked also to
duodenal ulcers and stomach cancer. More than 50% of the world's population harbor
H. pylori in their upper gastrointestinal tract, but about 80% are asymptomatic and it is
considered that the bacterium plays an important role in the natural stomach ecology
(Yamaoka, 2008). The study of the genome is focused on understanding pathogenesis,
the ability of about 29% of the loci to cause disease, believed to be linked to a 40 kb
Cag pathogenicity island (PAI), which contains genes of virulence proteins. The low gc
content of the cag PAI relative to the rest of the H. pylori genome suggests it was
acquired by horizontal transfer from another bacterial species.

Figure 2: Nucleotide imbalance (N) and nucleotide pair imbalance (P) for H. pylori 26695c genome
(GenBank 2012, NC_000915).

The circular DNA of H. pylori is divided by the variation of the nucleotide imbalance
signal N in two segments: one having 3nG + nA in excess to 3nC + nT, resulting in a
positive slope (+0.0504) for N, the other having the reversed property (-0.0768). Taking
into account equation (2) and Chargaff’s second law, which states that the number of
occurrences of complementary nucleotides in a DNA molecule single strand should be
the same, one would expect that nG, nC and nA, nT balance each other, and N remains
244 P. D. CRISTEA

close to zero. This is the case for most eukaryotes, but obviously not for H. pylori, the
variation of N being approximately piece-wise linear. As it can be seen by the direct
visual inspection of the curve in Fig.2, the linearity of the two segments of the
nucleotide imbalance signal N is not very good. This type of regularity, usually
significantly better than for H. pylori, is found in all bacteria and some archaea, with
typical features and parameters defining a “physiognomy’ for each genome. The
separation points have a biological meaning: the minimum of the nucleotide imbalance
signal N in the origin of the sequence corresponds, quite accurately, to the origin of
replication, while the maximum (8.14 105 bp) corresponds, with less precision, to the
terminus of replication.

We have used the following measures of the linearity:


- Mean Absolute Error (MAE) average per nucleotide of the absolute differences
between the actual values and the best (least mean square error) linear fit values
estimates the error of the linear fit;
- Linear to Absolute Error Ratio (LAER)  ratio between the best (least mean square
error) linear fit variation per nucleotide and MAE compares the estimated linear
variation per nucleotide with the fluctuations with respect to the linear fit.

LAER is a ratio which compares the variation of the best linear fit with the absolute
error, on the same length of the nucleotide sequence, for each segment of an
(approximately) piece-wise linear signal.. As the error has been expressed by the mean
absolute error, computed as an average per nucleotide, the same approach has been used
for the linear (regular) variation, to shorten the wording of the definition. Certainly, the
division with the number of nucleotides of both the numerator and the nominator does
not change the ratio.

For the ascending branch (0 8.14 105 bp) of the H. pylori nucleotide imbalance signal
N, these measures of linearity are MAE = 1.1 and LAER = 7.65, showing a rather high
fluctuation of the N signal with respect to its inferred regular (linear) variation. The
behavior is a little better for the descending branch (8.14 16.68 105 bp), where MAE =
0.85 and LAER = 9.13, in accordance to the visual estimation of the linearity.

In contrast, a much better linearity is found for the variation of the nucleotide pair
imbalance P signal along the entire DNA strand. According to (5), such a feature
corresponds to a uniform statistical difference between the n+ pairs and the n pairs. The
GENOME SYMMETRIES 245

linearity of the nucleotide pair imbalance P is a general property found in all the
investigated genomes, but the slope of P is positive for animal eukaryotes and negative
for plant eukaryotes and bacteria. We have shown (Cristea, 2003) that recombination
and crossing-over conserve this regularity, while local random mutations, such as
uncorrelated SNPs (“snips” - single nucleotide polymorphisms), tend to destroy it. For
the nucleotide pair imbalance P signal of H. pylori, the slope is +0.0231 for the entire
genome, while the linear parameters are MAE = 0.35 and LAER = 19.45, showing a
lower fluctuation and a variation closer to the linear one. Much better linearity of P has
been found in other genomes, such as Mycobacterium tuberculosis, for which MAE =
0.12 and LAER = 170.51.

The linearity of the nucleotide pair imbalance P is one of the most striking regularities
of the nucleotide DNA genomic signals, especially that it occurs even in the points
where the N signals change slope (Cristea, 2005), as can be seen in Fig. 2. As
mentioned above, recombination and crossing-over – which imply the simultaneous
direction reversal and strand switching of a DNA double-helix segment – conserve both
n+ and n- in (5), so that the nucleotide pair imbalance P signal remains unchanged,
whereas the nucleotide imbalance N signal changes. This property suggests to explore a
possible “hidden” symmetry of DNA molecules, by re-orienting all exons in the
genome along the same positive direction (Cristea, 2004). The positive or negative
orientation of the coding segments (exons) in a nucleotide sequence is known from then
tRNAs and the proteins which are synthesized, but there is no such information for the
non-coding segments. Consequently, it is possible to re-orient in the same positive
direction only the exons. As expected, the nucleotide pair imbalance P signal does not
change, but the nucleotide pair imbalance N signal becomes almost linear along the
entire strand. This outcome points to a putative less differentiated ancestral genomic
structure, from which the current nucleotide structures, revealed by their current
specific piece-wise linear N signal, have evolved (Cristea, 2010).

Figure 3 gives the nucleotide imbalance N signals for the H. pylori complete genome
(marked all, length 1,667,867 bp, similarly to Fig. 2) and for the 1,626 concatenated
exons, kept in their initial orientation in the genome (marked NOCS – non-reoriented
coding segments, length 1,527,953 bp). The figure also shows the nucleotide imbalance
N signals for the complete genome and for the concatenated exons, after the re-
orientation of all exons in the same positive direction (marked rfr – re-framed,
containing both reoriented exons and non-reoriented non-coding segments, and ROCS –
246 P. D. CRISTEA

reoriented coding segments, containing reoriented exons, respectively). The most


important property of the nucleotide imbalance N signals revealed by the exon re-
orientation shown in Fig. 3 is the transformation of the piece-wise linear variation of the
all and NOCS signals into the quite surprising approximately linear of the rfr and ROCS
signals. It is remarkable that the linearity of ROCS signal (MAE = 0.66 and LAER =
42.75), which contains only exons oriented in the same positive direction, is better than
the linearity of the rfr signal (MAE = 1.01 and LAER = 30), which also contains some
unchanged non-coding segments.

Figure 3: Nucleotide imbalance (N) of H. pylori 26695c genome (GenBank 2012, NC_000915, all – complete
genome, rfr –reoriented exons and non-reoriented non-coding segments, NOCS – 1,626 non-reoriented
concatenated exons, ROCS – reoriented concatenated exons).

As mentioned above, the similar four nucleotide pair imbalance P signals for the
complete genome (all and rfr), on one hand, and of the non-(re)oriented and reoriented
concatenated exons (NOCS and ROCS), on the other, do not change after the
reorienting the exons and remain approximately linear. In a graphical representation
(not shown here), the curves would appear like two pairs of superposed lines.

These results reveal not only the large scale regularities in the genomes of extant
species, but also those in putative ancestral genomes, which disappeared in the process
of evolution, under the pressure of species separation.
GENOME SYMMETRIES 247

4.2 Whole genome di-nucleotide statistical analysis

On a DNA strand it is possible to have 4 x 4 = 16 distinct di-nucleotide pairs (aa, ac,


ag, at, … , tt), which can be arranged in 8 complementary di-nucleotide couple
difference signals in accordance to Watson-Crick rules (tg – ca, gt – ac, gg – cc, aa - tt,
ct - ag, ga - tc, at - at, cg - cg). The first six difference signals are non-trivial, the other
two are identically zero, as they contain pairs of self-complementary di-nucleotides. For
the special case of di-nucleotides, Chargaff's second law states that the numbers of di-
nucleotides on one strand of a natural DNA molecule (>100 Kb) should be equal to the
numbers of their reverse complements on the same strand, meaning that the
complementary di-nucleotide couple difference signals, which start from the origin,
should become zero again at the end of the sequence (null return value). This statement
is only approximately fulfilled, and only for some of the mentioned signals. Figure 4
presents the complementary di-nucleotide difference signals for the H. pylori 26695c
whole genome (GenBank 2012, NC_000915), whereas Table 1 gives the range, the
return value, as well as the Chargaff’s second law and di-nucleotide distribution percent
errors for each of these signals. The error in the Chargaff’s second law can be measured
as the ratio of the return value, which measures the global imbalance in the
complementary di-nucleotide pair (e.g., tg – ca), and the total number of di-nucleotide
pairs in the sequence (tg + ca). The di-nucleotide distribution error is defined as the
ratio of the range – the difference between the maximum and minimum values of the
complementary di-nucleotide difference signals along the sequence (e.g., max(tg – ca)-
min(tg – ca)), and the total number of di-nucleotide pairs in the sequence (tg + ca).

Chargaff’s second law is statistically well satisfied by the first four pairs of
complementary di-nucleotides in Table 1 and the corresponding difference signals (gt –
ac, tg – ca, gg – cc, aa - tt), which comprise neutral pairs, and less accurate by the two
difference signals (ct – ag, ga - tc), containing pairs of positive and negative di-
nucleotides. It is known that the aa and tt di-nucleotides satisfy specific distributions,
symmetrical relative to each other (Ioshikhes at al., 1992). The di-complementary pairs
are also distributed quite uniformly along DNA sequences, but the errors are relatively
larger.

The uniformity of the di-nucleotide distribution for H. pylori 26695c whole genome is
shown in Figs. 5 and 6, for the neutral, positive and negative dinucleotides. The
distribution of the eight neutral in Fig. 5 is quite uniform along the sequence, as shown
248 P. D. CRISTEA

by the MAE and LAER measures of linearity (best for the tt di-nucleotide), and the
complementary di-nucleotide pairs compensate well each other, as can be seen from the
low slope errors (best for the cc – gg pair). The linearity for the positive and negative
di-nucleotides in Fig. 6 is also very good, or even better (for the self-complementary ta,
at, gc, and cg di-nucleotides), but with a less accurate pairing of the remaining
nontrivial complementary di-nucleotide pairs (ct – ag, and ct – ga).

Complementary di-nucleotide difference signals, H. pylori 26695c whole genome (1667867 bp)
8000

6000
Di-nucleotide differences (number of pairs)

ct-ag
4000

2000

0
gt-ac
tg-ca
gg-cc
-2000

-4000
ga-tc
aa-tt
-6000
0 2 4 6 8 10 12 14 16 18
5
Nucleotides x 10

Figure 4: Complementary di-nucleotide difference signals for H. pylori 26695c whole genome
(GenBank 2012, NC_000915, 1,667,867 bp).

Table 1: Complementary di-nucleotide difference signals


for H. pylori 26695c whole genome
Di- Minimum Maximum Return Chargaff’s Di-nucl.
nucleotide value value value second law distribution
signal error (%) error (%)
Neutral tg - ca -877 3027 -719 0.37524 2.0374
gt - ac -1087 2488 -691 0.52342 2.7080
gg - cc -921 6067 -838 0.57120 4.7632
aa - tt -5220 263 -5167 1.2121 1.2863
Positive ct - ag -3552 4174 4099 2.1325 4.0194
Negative ga - tc -4176 4246 -4122 2.3924 4.8881
GENOME SYMMETRIES 249

Figure 5: Neutral di-nucleotide signals for H. pylori Figure 6: Positive and negative di-nucleotide signals
26695c whole genome (GenBank 2012). for H. pylori 26695c whole genome (GenBank 2012).

Figure 7: Neutral di-nucleotide signals for H. pylori Figure 8: Neutral di-nucleotide signals for H. pylori
26695c for the 1,626 NOCS. 26695c for the 1,626 ROCS.
250 P. D. CRISTEA

It is also interesting to analyze the linearity of the di-nucleotide distribution for the
1,626 concatenated exons (length 1,527,953 bp) of the H. pylori 26695c genome
(GenBank 2012), for which N – the nucleotide imbalance, has been given in Fig.3,
before (Fig.7, NOCS) and after (Fig.8, ROCS) the re-orientation of exons in the same
positive direction. Only the neutral di-nucleotide distribution is considered here, in
order to show the contribution of the non-coding DNA segments in satisfying the
regularity of the DNA segments. As shown in Table 4, the elimination of the non-
coding segments reduces about 3 times the slope of di-nucleotide signals, and similarly
increases the slope error, but keeps almost the same linearity measures. In contrast, the
re-orientation of all exons in the same positive direction changes little the slope (even if
switching the components of each complementary di-nucleotide pair) and the linearity
measures, but largely increases the slope error.

4.3 Comparison of mtDNA genes

The NuGS methodology can also be used in the local analysis of nucleotide sequences,
such as in the study of pathogen variability, primarily for the detection of the
development of resistance to drugs and to treatment (Cristea, 2006), the investigation of
genetic inserts (Cristea et al., 2008), or the comparison of genes belonging to
individuals in the same or related species (Cristea et al., 2011). In Figs. 9 to 11 we
consider for illustration the case of the ND6 mitochondrial (mt) gene, one of the
Complex I genes in the respiratory chain.

Fig. 9 presents the Nucleotide imbalance signals N of the ND6 mt gene for seven
species of the Hominidae family: Homo sapiens (shortened to Hs), Hs neanderthalensis
(Hsn), Pan troglodytes (Pat, the Chimpanzee), Pan paniscus (Papa, the Bonobo
GENOME SYMMETRIES 251

Chimpanzee), Gorilla gorilla (Gg), Pongo pygmaeus abelii (Popya, Sumatran


Orangutan) and Pongo pygmaeus (Popy, Bornean Orangutan). The genes have the same
length (525 bp) and are aligned.

The distance between two homologous genes, from different species or from different
individuals in the same species, is defined as the sum of the absolute values of the
differences between the NuGSs describing the two genes. The distance between two
species, from the point of view of the genes in some specified set of genes, is defined as
the Euclidian distance in a space in which each considered gene is an independent
coordinate. As mentioned in Section 3, the resolution of the NuGSs in Fig. 9 is not
good enough to measure the distances between genes, and the description based on
reference and offsets, and offsets’ digital derivatives has to be used (Cristea et al.,
2011). Fig. 10 gives the offsets of N signals of the ND6 mt genes in Fig. 9 with respect
to the Hs signal, chosen as reference (thus, its offset is zero). The step variations in the
other offsets correspond to the points where there are differences (mutations) with
respect to the Hs signal. One can already appreciate the distances between the genes
(e.g., Hsn is the closest to Hs). The resolution increases by using the digital derivatives
of the offsets. In Fig. 11 we have represented the digital derivatives of the offsets of the
N signals shown in Fig. 10. The pulses along these lines correspond to the differences
in the N signals of the ND6 mt genes for the seven Hominidae species. The distances
between each gene and the reference gene for Hs are given by the numbers at the right
of the lines. The distances between the genes are now expressed quantitatively (e.g.,
Hsn is the closest to Hs – distance 5, whereas Popy is the farthest – distance 71, from
the ND6 mt gene point of view). Fig. 12 represents graphically the distances between
the seven Hominidae species considered above, evaluated on the basis of the distances
among their genes in the mtDNA respiratory chain. The succession of the genes along
the horizontal axis corresponds to their position in the mtDNA nucleotide sequence.
The rhythmicity of the variation of the mt gene distances along the mtDNA molecule
can not be fully attributed to the differences in the gene lengths and seems to indicate
the existence of hotspots in a mtDNA molecule from the variability point of view.
Similar results have been obtained in the study of pathogen variability (Cristea, 2006).
252 P. D. CRISTEA

Figure 9: Nucleotide imbalance N signals of the ND6 mt gene for seven species of the Hominidae family
(abbreviations: Hs – Homo sapiens, NC001807, Hsn – Hs neanderthalensis, NC011137, Pat – Pan
troglodytes, NC001643, Papa – Pan paniscus, NC001644, Popya – Pongo pygmaeus abelii, NC002083, Popy
– Pongo pygmaeus, NC001646, Gg – Gorilla gorilla, NC001645). The accession codes of the mtDNA genes
in Genbank (2012) are given.

Figure 10: Offsets of nucleotide imbalance signals Figure 11: Digital derivatives of the offsets in Fig.
(N) of the ND6 mt gene for the seven species in Fig. 10. The distances between each gene and the
9 with respect to the Hs signal as reference. homologous gene for Hs are on the right.
GENOME SYMMETRIES 253

Figure 12: Distances between the respiratory chain & ATP synthase genes and the homologous Hs mt genes.

5. CONCLUSIONS

The study of nucleotide sequences by using NuGSs reveals regularities in the structure
of DNA and RNA molecules. This approach has been applied to describe the structure
of nucleotide sequences, both in the current state and in a putative ancestral state, from
which they have evolved. The structural restrictions in genomic sequences are reflected
in symmetries and regularities observed in the corresponding genomic signals. Results
on the Entire genome NuGS analysis (Subsection 4.1), Whole genome di-nucleotide
statistical analysis (4.2) and the Comparison of mtDNA genes (4.3) are presented in the
paper.

REFERENCES

Albrecht-Buehler, G. (2006) Asymptotically increasing compliance of genomes with Chargaff's second parity
rules through inversions and inverted transpositions, Proc. Natl. Acad. Sci. U.S.A, 103, 17828-17833.
Chargaff, E. (1951) Structure and function of nucleic acids as cell constituents, Federal Proceedings, 10, 654–
659.
Cristea, P.D. (2002) Conversion of Nitrogenous Base Sequences into Genomic Signals, Journal of Cellular
and Molecular Medicine, 6, no. 2, 279–303.
Cristea, P.D. (2004) , Genomic Signals of Re-Oriented ORFs, Eurasip – Journal on Applied Signal Processing,
[Special Issue on Genomic Signal Processing], 2004, no.1, 132–137.
254 P. D. CRISTEA

Cristea, P.D. (2005) Chapter 1: Representation and analysis of DNA sequences, in Genomic Signal Processing
and Statistics, Daugherty, E., Shmulevich, I., Chen, J. and Wang, Z.J., eds. Eurasip Book Series on
Signal Processing and Communications, Hindawi Publ. Corp., p. 15–65.
Cristea, P.D. (2006) Genomic Signal Analysis of Pathogen Variability, Progress in Biomedical Optics and
Imaging, Proceedings of SPIE, 6088, P1-P12.
Cristea, P.D., Tuduce, R., Cornelis J., Deklerck, R., Nastac, I., Andrei, M. (2007) Signal Representation and
Processing of Nucleotide Sequences, Proceeding of the 7th IEEE Intl. Conf. on Bioinformatics and
Bioengineering (IEEE BIBE 2007), 1214-1219, Harvard Medical School, Boston, USA.
Cristea, P.D., Tuduce, R. (2008) Use of Nucleotide Genomic Signals in the Analysis of Variability and Inserts
in Prokaryote Genomes, Proceedings of the 2008 International Conference on Bioinformatics &
Computational Biology (BIOCOMP'08), Ed. H.R. Arabnia, Las Vegas, Nevada, USA, pp. 241-247.
Cristea, P.D. (2010) Symmetry in Genomics, The Journal of the Symmetrion – Symmetry: Culture and Science
– Symmetry in Mathematical Education, ISSN 0865-4824, 21, no. 1-304, 71-86,
http://symmetry.hu/aus_journal_thematic_issues.html#SME.
Cristea, P.D., Tuduce, R. (2011) Hominidae mtDNA Analysis by Using Nucleotide Genomic Signals,
Proceedings of 2nd International Workshop on Genomic Signals Processing (GSP2011), Bucharest,
Romania, pp. 61-65.
Cristea, P.D. (2012) Building Phylogenetic Trees by Using Gene Nucleotide Genomic Signals, 2012 IEEE
Int'l Conf. of the Engineering in Medicine & Biology Soc. (EMBS), San Diego, CA.
Darvas, G. (2007) Symmetry, Basel: Birkhauser, 508 pp, 2007.
GeneBank (2012) NIH - National Institutes of Health, National Centre for Biotechnology Information,
National Library of Medicine, (NCBI/GenBank), http://www.ncbi.nlm.nih.gov/
Ioshikhes I, Bolshoy A, Trifonov E.N. (1992) Preferred positions of AA and TT dinucleotides in aligned
nucleosomal DNA sequences, J Biomol Struct Dyn., 9, no. 6, 1111-7.
Kunkel, T.A. (2004) DNA Replication Fidelity, The Journal of Biological Chemistry, 279, no. 17, 16895–
16898.
Randić M, Zupan J, Balaban AT, Vikić-Topić D, Plavsić D (2011) Graphical representation of proteins, Chem
Rev., 111(2), 790-862.
Telenti, A., Imboden, P., Marchesia, F., Matter, L., Schopfer, K., Bodmer, T., Lowrie, D., Colston, M.J.,
Cole,S. (1993) Detection of rifampin-resistance mutations in Micobacterium tuberculosis, Lancet, 341,
647-650.
Watson, J.D., Crick, F.H.C. (1953) A structure for deoxyribose nucleic acid, Nature, 171, no. 4356, 737–738.
Wigner, E.P. (1964) Symmetry and conservation laws, Proc. Natl. Acad. Sci. U.S.A., 51, 956-965.
Yamaoka, Y. (2008) Helicobacter pylori: Molecular Genetics and Cellular Biology, Caister Academic Pr.
Symmetry: Culture and Science
Vol. 23 , Nos. 3-4, 255-274, 2012

SYMMETRY OF MITOCHONDRIAL DNA. THE CASE


OF COXn GENES IN PRIMATES AND CARNIVORES

Teodora Popovici1 and Paul Dan Cristea2

Abstract: The Nucleotide Genomic Signals (NuGSs) have been proven to be an


effective measure of nucleotide strands characteristic features, and have illustrated the
regularities and symmetries within these structures. Together with a proper alignment
of two Mitochondrial DNA genes from individuals of different species, they can help
compute the distances between the genes. Furthermore, clustering algorithms applied
on the data set of distances can illustrate patterns of the relationships among species.
This paper presents an approach on the study of these symmetries, using distances
among genomic signals of previously aligned genes. For the purpose of this research,
the Distance Computer application has been designed and tested on the Primates and
Carnivora orders. The results support the idea that using the Nucleotide Genomic
Signal as a mathematical abstraction of the nucleotide strands, along with a proper
software alignment tool, relevant conclusions can be drawn about the inner symmetries
of the mitochondrial DNA.

Keywords: mitochondrial DNA, DNA symmetry, inter-gene distances, nucleotide


genomic signals, nucleotide imbalance.

1
Teodora Popovici graduated from the Artificial Intelligence Master at the Computer Science Department of
the University “Politehnica” of Bucharest (UPB), Romania, in July 2012, and is currently affiliated to the
Bio-Medical Engineering Center of UPB (e-mail: teodora.popovici@cti.pub.ro).
2
Paul Dan Cristea is Professor of Electrical Engineering and Applied Information Sciences at the University
“Politehnica” of Bucharest (UPB), Splaiul Independentei no. 313, 060042 Bucharest, Romania, Member of
the Romanian Academy, Fellow IEEE, Member of Honor of the Romanian Scientists Academy, director of
the Bio-Medical Engineering Center of UPB.
256 T. POPOVICI AND P. D. CRISTEA

1. INTRODUCTION

The concept of symmetry is closely related to regularity and harmony. In order to


explain the universe, mankind has always searched for rules and patterns. These rules
create certain symmetries that one can find in almost any existing system. Sometimes
though, these symmetrical patterns are not that clear, they are hidden by the way one
can observe the system. However, these symmetries emerge once the observer analyzes
the system from another point of view. This innovative point of view could appear by
applying some sort of transformation on the system data, for example.

Genetics is one of the domains that enclose a very large amount of rules and
symmetries, but at the same time it also withholds patterns and regulations from us.
New discoveries are constantly made and innovative patterns are revealed through the
means of technology and science. Things that were once considered chaotic and
meaningless could later gain a well-defined and regulated structure. Researchers thrive
to create better methods of analyzing the genome, in order to understand more of its
rules and symmetries. One of these innovative methods is the Nucleotide Genomic
Signal (NuGS) approach, which will be explained in a later section. This concept
creates an interesting approach to genome analysis, which enables one to use powerful
signal processing techniques on DNA data. The paper presents the studies conducted on
genomic data using techniques like sequence alignment, NuGS and hierarchical
clustering. This research aims to prove the effectiveness and applicability of the above
mentioned methods in the field of DNA analysis. Studies in this area can help improve
the knowledge one has on gene functionality, involvement in diseases, trends of related
species in the same family and many other aspects, more broadly discussed in the
Motivation chapter of this paper. This study can also reveal the natural symmetry of the
Mitochondrial DNA, supported by the accurate species classification that can be created
using the previously mentioned techniques.

The following sections will give an overview of the terms used across this paper and an
introductory part on DNA analysis. Further along, there will be a description of the
innovative NuGS technology and its geometrical importance in the field of DNA
analysis. In the State of the Art chapter, one may read about related research in the area
of nucleotide genomic signals. The final parts of this paper present the developed
application and several interesting results that were obtained.
GENOME SYMMETRIES 257

1.1. DNA analysis

The Deoxyribonucleic acid (DNA) is the hereditary material in almost all living
organisms. Only RNA viruses are the exception to this. DNA is a type of nucleic acid
(in almost all cases, it resides in the nucleus of the cell) that contains information, the
“genetic instructions” that determine the functioning of organisms (Watson and Crick,
1969) and (Pearson, 2006).

The genes are the portions of the DNA strands that carry the useful genetic information.
The other segments may have structural or regulating functions. The genes are the
molecular units that code for a polypeptide or an RNA chain. Basically, genes contain
the information relevant for the building and maintaining of cells and can pass this
information on to offsprings. For example, genes are responsible for all traits of an
organism, either physically visible (like hair or eye color) or more hidden, like
predisposition to a certain condition and so on.

The mitochondrial DNA (mtDNA) is a special kind of DNA that does not reside in the
nucleus of the cell, but in an organelle named Mitochondrion. The Mitochondria are
structures that reside in cells that convert the food into energy for the use of the cell.
The Mitochondria have the special ability to replicate themselves independently of the
gene information in the DNA, unlike all the other cells in the organism. Also, unlike
normal DNA, mtDNA is in most species solely inherited from the mother. The
mitochondrial DNA is a circular structure of approximately 16,500 base pairs, in
humans. It codes for 37 genes and also has a non-coding section named D-loop, in
which the two strands of DNA are separated by a third one (hence the name, the
displacement loop). The mtDNA is often used in forensic experiments and also lately in
phylogeny research.

1.2. Mathematical models of DNA sequences

Despite the simple and standard representation, the symbolic form of nucleotide
sequences (namely the enumeration of the nucleobases) limits the possibilities of
exploitation to pattern matching and statistical analysis. These paths of research are
sometimes difficult to use and limiting in nature. Hence, a new approach has been
researched by (Cristea, 2005), one that could help interpret genomic information as
signals. The resulting numerical values would have an accurate mathematical meaning
and can be subject to signal processing procedures.
258 T. POPOVICI AND P. D. CRISTEA

After a process of studying different mappings between the symbolic form and the
genomic signals, a tetrahedral representation and a 2D model were proposed in (Cristea,
2002, 2005). Using these models, the four bases can be assigned phases. A corresponds
to π/4, G to 3 π/4, T to –π/4 and C to -3 π/4. The complex mapping preserves the
biochemical characteristics of the bases in corresponding mathematical features. Based
on the phase values of the four bases, one can compute the cumulated phase, which is
defined as the sum of phases of all the nucleotides in the sequence of interest:

 c   4 3(n G nC )  (n A nT ) , (1)


where nA, nC, nG, nT are the numbers of adenine, cytosine, guanine and thymine bases
from the begining of the sequence to the current location.

A second measure of the complex representation of DNA strands is the unwrapped


phase, which corrects the absolute value of the differences between consecutive
elements in the sequence to be lower than π. These two nucleotide genomic signals
(NuGS) can be primarily used to find large scale features of DNA molecules. In order
to compute the distance between two homologous genes, the general formula is to
compute the sum of all the differences between the signals in the sequence, in absolute
value:
L
D(G1 , G2 )   S G1 (k )  S G2 (k ) , (2)
k 1

where SG is a genomic signal and in this case it is the nucleotide imbalance signal.

2. MOTIVATION AND PROBLEM DESCRIPTION

This research aims at finding an accurate sequence of steps such that DNA strands
analysis can be properly made, and structural symmetries can be identified. There have
been studies that employ the above mentioned NuGS technique, by (Cristea, 2002),
(Teodorescu and Cristea, 2012), and (Cristea and Tuduce, 2003, 2009, 2010). However,
a proper alignment has not yet been applied before the genomic signal distance was
calculated, but was among the recommendations of the authors for future work. An
appropriate alignment of the initial nucleotide sequences could greatly enhance the
quality of the distance results. Furthermore, the application behind the current study
employs a clustering phase at the end, so that conclusions can be more easily drawn
upon the appropriateness of the techniques for the described problem. The clustering
results also reveal the patterns and symmetries in the Mitochondrial DNA structure.
GENOME SYMMETRIES 259

The main objective of this research is to compute the distance between different
species, at molecular level, and to prepare for computing intra-species distances, as well
(Cristea and Tuduce, 2003). Such a study enables one to explore the stability of a DNA
segment across different species, thus proving its importance.

Over the past decade, well-known projects have been concluded, such as the Human
Genome Project described at the archive site of the U.S. Department of Energy's
Human Genome Project, which was conducted between 1990 and 2003 under the
coordination of the U.S. Department of Energy and the National Institute of Health
(Human Genome Project). The project’s main goals were to sequence the
approximately 3 billion base pairs that compose the human DNA, identify the 20,000-
25,000 genes, store all the information and improve tools for data analysis. As a
corollary, several other organisms were also sequenced. Although the project is
officially finished, research on the sequenced data will go on for many years to come.
The aim of the continuous research on the data produced by the project is to identify
more genes and reveal their functionality, in order to improve research on disease-
causing genes and to identify new treatment solutions.

3. STATE OF THE ART

3.1. Genomic Signals

The work in the field of genomic signals is concerned with the comparison between
homologous genes in different individuals or across species. In the first case, the aim
could be to identify specific variations in genes that might have a great impact in
diseases, as researched by (Teodorescu and Cristea, 2012). They conducted a study on
the variations of gene TCF7L2, in order to investigate its influence in Diabetes type 2.
Numerical experiments have been performed on human individuals, and also on some
other species. Because of the numerous inserts in this gene across the different species,
the conclusions are to be finalized after a proper alignment of the genes.

Several studies have been conducted for the mitochondrial DNA genes by (Cristea and
Tuduce, 2003, 2009, 2010). The mitochondrial DNA (mtDNA) is an exception, most of
DNA being present only in the nucleus. The mtDNA is present in the mitochondrion,
instead. The typical length of the mtDNA is 16,500 base pairs, which actually encode
37 genes, and 2-10 mtDNAs are present in each mitochondrion. The main focus is on
the Hominidae family. Apart from the genes, mtDNA also contains a non-coding
260 T. POPOVICI AND P. D. CRISTEA

region, the D-loop, which is also important, as it controls initiation and regulation of
transcription and replication of mtDNA. A distinct feature of mtDNA is that it is almost
always inherited from the mother, and not from both parents, as is the case with nuclear
DNA.
The research in the mtDNA of six hominidae species conducted by (Cristea and
Tuduce, 2003, 2009, 2010) emphasizes the similarities between the species, but also the
characteristics that tell them apart. The nucleotide imbalance, the offsets to a reference
signal (usually; Homo Sapiens) and the differential signals of the offsets are studied.
Also, studies on the nucleotide path have been conducted on several mtDNA genes. All
these methods represent an efficient method of comparing and viewing related signals,
which give an accurate view on highly related sequences.

3.2. Phylogenetic Trees

Phylogenetic trees (phylogenies or evolutionary trees) are arborescent structures which


correspond to the inferred evolutionary relationships within a group of species or
organisms. These branching diagrams are built using a measure of similarity among
physical and/or genetic traits, according to the Phylogenetic Tree (Benton, 2000). The
tree often considers restrictions such as time spans.

Evolutionary trees can be used to structure classifications, to order the diversity of a


system, to infer certain events that took place throughout the evolution of the system, or
to guide the scientific evolutionary research, according to (Baum, 2008) and (Baum and
Offner, 2008). Phylogenies have been used since the studies by Edward Hitchcock in
1840 and the theories published by Charles Darwin in 1859, but have gained popularity
in the last decades.

4. APPLICATION

The application employed for the purpose of this research is mainly implemented in
Java and uses results from other pieces of software: BLAST (BLAST/NCBI),
FeatureExtract (Wernersson, 2005) and MultiDendrograms (Fernández and Gómez,
2008). The application itself is called DistanceComputer and employs several modules.
A description of the overall architecture of the DistanceComputer application and its
additional software will be given in the following subsections.

Given two nucleotide sequences in symbolic format, the program calls BLAST in order
to find the most appropriate match sequences. BLAST may introduce gaps in the
GENOME SYMMETRIES 261

sequences, even in the sequences of the same length, if there has been an insertion in
one of the genes and the matching content is „shifted”. This happens in a batch mode,
the program iterates through a directory of gene content files, comparing them all
against each other.

The application analyzes the comprised output of BLAST, which basically only
contains the differences between the two sequences. This format proves to be
convenient and efficient to process for computing distances using as reference one of
the signals, as well as the cumulated phase. Other features could be added, in order to
be able to process other types of BLAST output (for example, the complete alignments,
along with statistics reported by the program to what concerns the alignment).

4.1. Overall algorithm

The processing is basically made in batch mode. The application can execute a variety
of tasks, depending on the processing step:
1. Filtering the input files – this process enables the user to select which inputs
are of interest for them. For example, one might be interested in comparing all the
Vertebrates among them.
2. Separating the desired genes and grouping the organisms into families – the
genes have been separated using a modified version of the FeatureExtract utility, by
Rasmus Wernersson. In order to have a broad view on the selected species, organisms
have been grouped from level 9 (the Primates/Carnivora level), to level 13 (the smallest
family level, in order to correctly classify the organisms in the final phylogenetic trees).
3. Doing the alignment – the programs accepts a list of interesting genes and a
certain family of organisms and calls BLAST, for every gene in the list, for all
organisms against one another.
4. Computing the distances between organism genes for each pair (every
alignment) – the application reads the BLAST output in the format specified above and
computes the nucleotide imbalance signals for each organism, only on the portions that
disagree. The distance is then computed as the sum of the absolute values of the signal
differences. Also, when a deletion or insertion occurs (one of the signals has a gap), a
distance of 7 was chosen to represent the gravity of this difference. This value was
chosen, because it is one unit bigger that the largest absolute value of the nucleotide
imbalance signals difference.
262 T. POPOVICI AND P. D. CRISTEA

5. Post-processing phase – the distance files are rewritten, in order to match the
input format of the MultiDendrograms application, designed by Alberto Fernández and
Sergio Gómez.
6. Running the MultiDendrograms application on the distance set of choice – this
step will output the phylogenetic tree of the considered species.

4.2. Auxiliary software

The other pieces of software used are BLAST (BLAST/NCBI), FeatureExtract


(Wernersson, 2005) and MultiDendrograms (Fernández and Gómez, 2008).
BLAST was developed by researchers at the US National Center for Biotechnology
Information (NCBI), and is publicly available on the web (BLAST/NCBI), both as a
web service and an executable version. BLAST performs queries between a sequence of
data (nucleotides or proteins) and another sequence or database of sequences. The
algorithm is based on the existence of high-scoring segment pairs (HSPs) which form
an alignment. These HSPs are searched through a heuristic of the algorithm Smith-
Waterman. Because BLAST uses a heuristic of this algorithm, it is much faster than the
original version (reportedly 50 times faster), but on the downside it doesn’t guarantee
the same accuracy as Smith-Waterman. The compromise between speed and accuracy
favors BLAST in the case of gene data alignment.

There are several parsing utilities that can assist in extracting relevant information from
GenBank files. One of the open and free pieces of software that can extract sequences
and annotations from this file format is Feature Extract (Wernersson, 2005), with its
command-line Python software gb2tab. The program receives the requirements as
arguments in the command line: which types of sequences are searched for, the name of
the input GenBank file and other options, which are out of the scope of this research. It
then reads the descriptive part of the file, gathering all the information it needs about
the features to extract. For the purpose of this paper, the author has modified the
software such that it saves the useful information (the nucleotide sequences) in several
files, organized in directories based on gene type and family of the organism. The
descriptive initial part of the file is also saved, analogous to the content, for further
reference. The modified software splits the organisms into smaller families.

MultiDendrograms (Fernández and Gómez, 2008) is an open software program


designed to create hierarchical clustering on real valued data. It has been implemented
in Java and presents a user-friendly graphical interface that enables the user to choose
GENOME SYMMETRIES 263

the desired clustering algorithm, along with visual markers for the output. The
algorithms that can be used are part of the Agglomerative Hierarchical Clustering
genre: Variable-group Single-Linkage, Complete Linkage, Unweighted Average,
Weighted Average, Unweighted Centroid, Weighted Centroid, Joint Between-Within.

5. EXPERIMENTS

This section aims to present several experiments that have been conducted in order to
compute inter-gene distances and construct the associated phylogenetic trees. In order
to create the phylogenetic trees depicted in the Appendices section, the Unweighted
Average clustering distance has been used.

5.1. The Respiratory Chain of mtDNA

As mentioned in the earlier sections of this paper, mtDNA is composed of several


segments. Among them, 13 segments code for proteins of the electron transport chain
(the respiratory chain), as shown in Table 1 (Cristea and Tuduce, 2009).

In this research paper, the focus falls on three COX genes of the Respiratory complex,
namely COX1, COX2 and COX3. The tests were done using the Euclidian distance of
the component genes. It is worth mentioning that during testing, the „megablast”
version of the BLAST software has been used, so that even distant species could be
properly aligned. This version of BLAST searches for segments of smaller length to
match exactly, such that it can be used for more distant species.

Table 1: Products and Genes encoded by mtDNA


Description Product Genes
MT-ND1, MT-ND2, MT-ND3, MT-ND4, MT-
Complex I
ND4L, MT-ND5, MT-ND6
Electron transport chain Complex III MT-CYB
(Respiratory complex) Complex IV MT-COX1, MT-COX2, MT-COX3
ATP
MT-ATP6, MT-ATP8
synthase
Ribosomal DNA mt rRNA MT-RNR1 (12S), MT-RNR2 (16S)
Transport DNA mt tRNA MT-Ala, ... , MT-Val

5.2. The Primates Order

The Primates order is a part of the Mammalian class, mainly distinguished by the
tendency for bipedalism and mostly arboreal life, according to The Primates Order, The
264 T. POPOVICI AND P. D. CRISTEA

Haplorrhini Suborder, The Strepsirrhini Suborder, (Goodman et al., 1990), (Rylands,


and Mittermeier, 2009), (Saint-Hilaire, 1812), and (Groves, 2005). The Haplorrhini are
a suborder of Primates, named after the characteristic of their nose: „dry-nosed”
primates. As opposed to the Strepsirrhini, the other suborder of Primates (the „wet-
nosed” primates), the Haplorrhini have a much more evolved brain.

During this research, the authors had access to only a limited amount of complete
mtDNA genomes from the Mammalian class. Namely, only 59 species of the Primates
order have been sequenced and posted on the NCBI ftp site for to the date (Metazoa
GenBank). Among them, 45 are of the Haplorrhini suborder, and 14 of the
Strepsirrhini clade. Unfortunately, not all of the families of the order have currently a
representative in the GenBank data, the omissions being most significantly from the
Platirrhini parvorder. The tests employed on these species and the hierarchies built
using the computed distances have shown a close resemblance to the biological
hierarchy.

5.3. The Carnivora Order

The Carnivora Order is also a part of the Mammalian class. As in the previous case, not
all the suborders of this clade have been available for testing purposes. However, the
tests have been conducted on 12 species of the Feliformia suborder and 59 species of
the Caniformia suborder.

5.4. Test case 1 – Primates

For the Primates order, tests have been employed on 59 species, using the three COX
genes as reference. The results are similar among each of the three genes. Namely, the
species have been correctly classified in almost all cases. The resulting phylogenetic
tree shows accurate results in all species, except the Tarsiidae, which are clustered
among the Strepsirrhini, although they are officially a part of the Haplorrhini suborder.
A curious fact is that throughout the last century, the Tarsiidae were alternately
considered as part of the Strepsirrhini and the Haplorrhini suborder. That is because
this family has genetic traits that resemble both these suborders. With the exception of
this incorrect classification, the resulting phylogenetic trees depicted in the Appendices
section of this article show a close resemblance to the actual phylogeny of these species.

It is remarkable that tests considering only one gene can give such accurate results. One
can predict that combining all the genes of the Mitochondrial DNA, the results will be
GENOME SYMMETRIES 265

even more accurate. The phylogenetic trees have the scale attached, so the height of the
branches actually represents the computed distances between the clusters.

5.5. Test case 2 – Carnivora

In the case of the Carnivora suborder, the results follow a similar pattern. However, all
the main families have been successfully identified and separated in the phylogeny. The
exception this time is that the algorithm reports a slightly smaller distance between the
cheetah and the puma, compared to the distance between the puma and the rest of its
family. The results for all the three genes are quite similar to each other.

Nonetheless, in the case of COX3 the phylogenetic tree is not entirely accurate, because
one family from the Caniformia suborder is grouped along with the Feliformia
suborder. The distance to the other members of its right group is however very small.
This could be interpreted as another proof that taking into consideration all the
Mitochondrial DNA genes, a more accurate result could be obtained and these outliers
would be corrected in this case.

5.6. Numerical results

The Appendices section contains visual representations of the experimental results. The
phylogenetic trees branches have heights that are consistent with the numerical results.
For example, one may notice that in Figure 4, the smallest depicted distance is the one
between the Gray Wolf and the Eurasian Wolf, namely 2. This result shows that the
COX1 gene has slightly mutated between these two closely related species. Similar
results show the resemblance between clusters of tiger species, bears or seals. In this
particular case of the Carnivora family and the COX1 gene, the farthest subspecies
were the Snow Leopard and the Wolverine (of the Mustelidae family), with a reported
distance of 884.

There have been cases in the research experiments where BLAST could not find a
proper alignment, so it reported no resemblance between the two genes. In these cases,
the default “infinite” distance is 100,000. This is a path worth studying, as additional
alignment techniques should be implemented, when BLAST fails to identify the
distance. The simplest of these techniques is the computation of the nucleotide
imbalance signal, without prior alignment of the genes. This can be misleading, because
it is difficult to compare distances obtained by different techniques.
266 T. POPOVICI AND P. D. CRISTEA

6. CONCLUSIONS AND FUTURE WORK

The tests in the previous section have shown remarkable results, in the sense that the
phylogenies that were obtained resemble closely the scientific accepted phylogeny. This
can be an indicator that the employed distance measure is genuinely modeling the
differences between the species. This supports the assumption that the nucleotide
genomic signal is an appropriate mathematical and signaling model for the DNA
strands.

In some cases, BLAST may not be able to find a suitable alignment, even though the
species are closely related. Taking into account a backup version of alignment in the
case of BLAST finding no proper one, can enable the software to give more appropriate
distances in this case.

Other aspects worth noticing are the accuracy and the reliability of the test data. For
example, this study and many others make use of some data sequenced from extinct
species, or extant ones, but from very old artifacts. There is an issue of these pieces of
data not being as accurate as one may desire, mainly because of the factors that
influenced the DNA for centuries and millennia. Nevertheless, the techniques used by
specialists for DNA sequencing are of high technology and can be trusted to bring
about the best possible results given the factors.

The completeness of the results is also affected by the fact that unfortunately there is no
complete sequencing of all the species for mtDNA. For example, there are large
portions of the Platirrhini family from the Primates order that are not covered in the
test set, so the family is only partially represented in the results. A more complete study
could be conducted as soon as more data is made available online or in the various
university and research facilities.

6.1. Future Work

There is a wide area of possible future research in this field. The results presented in
this paper can bring confidence to the fact that the Nucleotide Genomic Signal
technique, along with a proper alignment, can give good results on comparison of DNA
strands. This can be a starting point for implementing other variants for alignment when
the main software (BLAST in this case) fails to find one. These can also be done using
NuGS techniques.
GENOME SYMMETRIES 267

This research can be extended to other fields. There are various possibilities in which
this kind of technology can be useful. For example, testing can be done on genes
associated with some diseases, from both healthy and afflicted patients. Obtaining
distances in these cases and clustering the final results can lead to important discoveries
in the impact that the particular genes have on the disease development. Also, treatment
plans can be evaluated using this data.

One closely related perspective of study could be using other variants of computing
distances. For example, the Mode Step reference can be used when aligning the
sequences, so that all the individuals or species can be compared against the common
trend of the group. This research could offer interesting insight on which individuals are
more closely connected to the common trend, and whether this common signal can be
interpreted as the best solution in the evolution path or it is merely an average path.
Such a technique could depict an even clearer image of the symmetries hidden in the
Mitochondrial DNA.

To conclude with, using techniques such as sequence alignment and nucleotide genomic
signals gives a better perspective on highly related sequences symmetry and enhances
the proper estimation of the distances between these sequences. This is a relatively new
topic in the field of Bioinformatics, and it promises to offer great opportunities for
medical studies as well.

REFERENCES

Baum, D. A. (2008), Reading a phylogenetic tree: The meaning of monophyletic groups. Nature Education, 1
(1).
Baum, D. A., and Offner, S. (2008), Phylogenies and tree thinking. American Biology Teacher, 70, 222–229.
Benton, M. J, (2000), Stems, nodes, crown clades, and rank-free lists: is Linnaeus dead?, Biological Reviews,
75, 633-648.
BLAST / NCBI, http://blast.ncbi.nlm.nih.gov/Blast.cgi.
Cristea, P. D. (2005), Representation and Analysis of DNA sequences, in Genomic Signal Processing and
Statistics, Editors Dougherty E. G., Shmulevici I., Chen Jie, Wang Z. J., Book Series on Signal
Processing. and Communication, Hindawi, 15-65.
Cristea, P. D. (2002), Conversion of nucleotides sequences into genomic signals, International Journal of
Cellular and Molecular Medicine, 6, 2, 279–303.
Cristea, P. D., and Tuduce, Rodica (2003), Signal processing of genomic information: Mitochondrial genomic
signals of hominidae”, 4th EURASIP Conference - Video/Image Processing and Multimedia
Communications, 2003. 2-5 July 2003, 209-214.
Cristea, P. D., and Tuduce, Rodica (2009), Nucleotide genomic signal analysis of hominidae mitochondrial
DNA, DSP2009 - 16th International Conference on Digital Signal Processing, 1-6.
268 T. POPOVICI AND P. D. CRISTEA

Cristea, P. D. and Tuduce, Rodica (2009), Nucleotide genomic signal comparative analysis of homo sapiens
and other hominidae mtDNA, ISSCS 2009 - International Symposium on Signals, Circuits and Systems,
1-4.
Cristea, P. D. and Tuduce, Rodica (2010), Comparative Analysis of Mitochondrial DNA by using Nucleotide
Genomic Signals, Materials Science Forum, 670, 507-516.
Fernández, A. and Gómez, S. (2008), Solving Non-uniqueness in Agglomerative Hierarchical Clustering
Using Multidendrograms, Journal of Classification 25, 43-65.
Goodman, M., Tagle, D. A., Fitch, D. H., Bailey, W., Czelusniak, J., Koop, B. F., Benson, P., and Slightom, J.
L. (1990), Primate evolution at the DNA level and a classification of hominoids, Journal of Molecular
Evolution, 30 (3), 260–266.
Groves, C. (2005), Strepsirrhini, in Wilson D. E., Reeder D. M., Mammal Species of the World (3rd ed.).
Johns Hopkins University Press, Baltimore, 111. Human Genome Project,
ttp://www.ornl.gov/sci/techresources/Human_Genome/project/about.shtml.
Metazoa GenBank - ftp://ftp.ncbi.nlm.nih.gov/genomes/MITOCHONDRIA/Metazoa.
Pearson Helen, (2006), Genetics: What is a gene?, Nature, 441, Volume 441, Issue 7092, 398-401.
Rylands A. B. and Mittermeier R. A. (2009). The Diversity of the New World Primates (Platyrrhini), in
Garber P.A., Estrada A, Bicca-Marques J. C., Heymann E.W , Strier K.B., South American Primates:
Comparative Perspectives in the Study of Behavior, Ecology, and Conservation, Springer.
Saint-Hilaire, É. G. (1812), Suite au tableau des quadrumanes. Seconde famille. Lemuriens. Strepsirrhini,
Annales du Muséum d'Histoire Naturelle, 19, 156–170.
Teodorescu, D., and Cristea, P.D. (2012), Nucleotide Genomic Signal comparative analysis of genes involved
in diabetes type 2 for various taxons, 19th International Conference on Systems, Signals and Image
Processing (IWSSIP), 518-521.
Watson, J. and Crick, F., 1(1969), Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic
Acid, Nature, 224, 470 – 471, reprinted from Nature, April 25, 1953.
Wernersson, R. (2005), FeatureExtract—extraction of sequence annotation made easy. Oxford University
Press, Nucleic Acid Research, 33, 2, w567-w569.
GENOME SYMMETRIES 269

APPENDICES

Figure 1: Primates Phylogenetic tree for gene mtCOX1.


270 T. POPOVICI AND P. D. CRISTEA

Figure 2: Primates Phylogenetic tree for gene mtCOX2.


GENOME SYMMETRIES 271

Figure 3: Primates Phylogenetic tree for gene mtCOX3.


272 T. POPOVICI AND P. D. CRISTEA

Figure 4: Carnivora Phylogenetic tree for gene mtCOX1.


GENOME SYMMETRIES 273

Figure 5: Carnivora Phylogenetic tree for gene mtCOX2.


274 T. POPOVICI AND P. D. CRISTEA

Figure 6: Carnivora Phylogenetic tree for gene mtCOX3.


Symmetry: Culture and Science
Vol. 23, Nos.3-4, 275-301, 2012

SYMMETRIES OF THE GENETIC CODE,


HYPERCOMPLEX NUMBERS AND
GENETIC MATRICES WITH
INTERNAL COMPLEMENTARITIES

S.V.Petoukhov

Biophysicist, bioinformatician (b. Moscow, Russia, 1946).


Address: Laboratory of Biomechanical Systems, Mechanical Engineering Institute of Russian Academy of
Sciences; Malyi Kharitonievskiy pereulok, 4, Moscow, 101990, Russia. E-mail: spetoukhov@gmail.com.
Fields of interest: genetics, bioinformatics, biosymmetries, multidimensional numbers, musical harmony,
mathematical crystallography (also history of sciences, oriental medicine).
Awards: Gold medal of the Exhibition of Economic Achievements of the USSR, 1974; State Prize of the
USSR, 1986; Honorary diplomas of a few international conferences and organizations, 2005-2012.
Publications: 1) S.V. Petoukhov (1981) Biomechanics, Bionics and Symmetry. Moscow, Nauka, 239 pp. (in
Russian); 2) S.V. Petoukhov (1999) Biosolitons. Fundamentals of Soliton Biology. Moscow, GPKT, 288 pp.
(in Russian); 3) S.V. Petoukhov (2008) Matrix Genetics, Algebras of the Genetic Code, Noise-immunity.
Moscow, RCD, 316 pp. (in Russian); 4) S.V. Petoukhov, M. He (2010) Symmetrical Analysis Techniques for
Genetic Systems and Bioinformatics: Advanced Patterns and Applications, Hershey, USA: IGI Global, 271
pp.; 5) He M., Petoukhov S.V. (2011) Mathematics of Bioinformatics: Theory, Practice, and Applications.
USA: John Wiley & Sons, Inc., 295 pp.

Abstract: The article describes results of study of some symmetries of the genetic
coding system by means of matrix representations of its molecular ensembles. This
matrix approach is borrowed by the author from the known theory of noise-immunity
coding, which is used for a long time in discrete signals processing for communication
and computer technology. In the process, important connections between the hierarchy
of genetic alphabets and complex numbers, quaternions by Hamilton and some other
multi-dimensional numbers are discovered by means of analysis of reasoned numeric
representations of genetic (2n*2n)-matrices. It has been shown that these numeric
matrices belong to a class of “matrices with internal complementarities” and they
allow creation of new mathematical tools to study the molecular-genetic system,
including hidden regularities of long nucleotide sequences. The described results give
some evidences about the algebraic nature of the molecular-genetic system.
276 S. V. PETOUKHOV

Keywords: symmetry, genetic code, matrix, hypercomplex numbers, complementarity,


Kronecker multiplication, long nucleotide sequences.

1. ABOUT THE PARTNERSHIP OF THE GENETIC CODE AND


MATHEMATICS

Science has led to a new understanding of life itself: “Life is a partnership between
genes and mathematics” (Stewart, 1999). This article describes a system of
multidimensional numeric structures together with some evidences that this
mathematical system is the partner of molecular ensembles of the genetic code. The
described results are based on symmetric properties of the genetic code system and on a
matrix approach which was borrowed by the author from mathematics of noise-
immunity coding to study genetic phenomenology (Petoukhov, 2008a-c, 2011, 2012;
Petoukhov, He, 2010).
1 - 1 - - 1 1 -
1 1 1 1 - - 1 1
1 1 - 1 - 1 1 - 1 - 1 -
- 1 1 1 - - 1 1 1 1 1 1
H4 = 1 - 1 1 ; H8 = 1 - - 1 1 - 1 -
- - - 1 1 1 - - 1 1 1 1
- 1 - 1 - 1 1 -
- - - - - - 1 1

1 1 1 1 1 1 -1 -1
1 1 1 1 1 1 -1 -1
1 1 1 -1 -1 -1 1 1 -1 -1 -1 -1
R4 = -1 1 -1 -1 ; R8 = -1 -1 1 1 -1 -1 -1 -1
1 -1 1 1 1 1 -1 -1 1 1 1 1
-1 -1 -1 1 1 1 -1 -1 1 1 1 1
-1 -1 -1 -1 -1 -1 1 1
-1 -1 -1 -1 -1 -1 1 1

Figure 1: numeric matrices H4, H8, R4 and R8 which are connected with phenomenology of the genetic coding
system (Petoukhov, 2011, 2012)

The main mathematical objects of the article are four matrices R4, R8, H4 and H8 shown
on Figure 1. Why these numeric matrices are chosen from infinite set of matrices? The
reason is that they are connected with phenomenology of the genetic code system in
GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 277

matrix forms of its representation as it was shown in works (Petoukhov, 2011, 2012),
and as it will be additionally demonstrated in the end of this article, where a conclusion
about algebraic essence of the nature of genetic informatics will be made. The matrices
H4 and H8 belong to a huge set of famous Hadamard matrices, which are widely used
for noise-immunity coding in technologies of signals processing. The matrices R4 and
R8 are conditionally termed “Rademacher matrices” because each of their columns
represents one of known Rademacher functions.

2. THE HADAMARD MATRICES H4 AND H8

Let us begin with analysis of the (4*4)-matrix H4 (Figure 1). One of variants of
decomposition of the matrix H4 gives a set of 4 sparse matrices H40, H41, H42 and H43
(Figure 2). This set is closed in relation to multiplication and it defines their
multiplication table (Figure 2, bottom row) that is identical to the famous multiplication
table of quaternions by Hamilton. From this point of view, the matrix H4 is the
quaternion by Hamilton with unit coordinates. (Such type of decompositions is termed a
dyadic-shift decomposition because it corresponds to structures of matrices of dyadic
shifts, well known in technology of signals processing (Ahmed, Rao, 1975)).

H4 = H40 + H41 + H42 + H43 =


1 0 0 0 0 1 0 0 0 0 - 0 0 0 0 1
0 1 0 0 + - 0 0 0 + 0 0 0 1 + 0 0 1 0
0 0 1 0 0 0 0 1 1 0 0 0 0 - 0 0
0 0 0 1 0 0 - 0 0 - 0 0 - 0 0 0

1 H41 H42 H43


1 1 H41 H42 H43
H41 H41 -1 H43 - H42
H42 H42 - H43 -1 H41
H43 H43 H42 - H41 -1

Figure 2: the dyadic-shift decomposition of the (4*4)-matrix H4 (from Figure 1) gives the set of 4 sparse
matrices H40, H41, H42 and H43, which corresponds to the multiplication table of quatrnions by Hamilton
(bottom row). The matrix H40 is identity matrix

But the matrix H4 is also the sum of two sparse matrices HL4 and HR4 (Figure 3). One
can numerate 4 columns of the matrix H4 from left to right by numbers 0, 1, 2 and 3. In
this case two columns with non-zero entries in the matrix HL4 have numerations with
even numbers 0 and 2; two columns with non-zero entries in the matrix HR4 have
278 S. V. PETOUKHOV

numerations with odd numbers 1 and 3. In view of this, such decomposition


H4=HL4 +HR4 can be conditionally termed as “the even-odd decomposition” (such type
of decompositions will be used a few times in this article).

1 0 -1 0 0 1 0 1
H4 = HL4 + HR4 = -1 0 1 0 + 0 1 0 1 ,
1 0 1 0 0 -1 0 1
-1 0 -1 0 0 -1 0 1

1 0 0 0 0 0 -1 0
HL4 = HL40 + HL41 = -1 0 0 0 + 0 0 1 0 ,
0 0 1 0 1 0 0 0
0 0 -1 0 -1 0 0 0

0 1 0 0 0 0 0 1
HR4 = HR40 + HR41 = 0 1 0 0 + 0 0 0 1
0 0 0 1 0 -1 0 0
0 0 0 1 0 -1 0 0

Figure 3: upper row: the representation of the matrix H4 as sum of matrices HL4 and HR4. Other rows:
representations of each of matrices HL4 and HR4 as sums of two matrices: HL4=HL40+HL41, HR4 =HR40+HR41

It is unexpected but the set of two (4*4)-matrices HL40 and HL41 is also closed in
relation to multiplication and it defines their multiplication table (Figure 43), identical
to the multiplication table of complex numbers
(http://en.wikipedia.org/wiki/Complex_number). One can note that in the field of
matrix analysis, complex numbers are usually represented by means of (2*2)-matrices
[a, -b; b, a]. Let us consider now the set of (4*4)-matrices CL = a0*HL40+a2*HL41 which
is the unusual representation of complex numbers (here a0, a2 are real numbers) (Figure
4). The classical identity matrix E=[1 0 0 0; 0 1 0 0; 0 0 1 0; 0 0 0 1] is absent in the set
of matrices CL, each of which has zero determinant. Consequently the usual notion of
the inverse matrix CL-1 (as CL*CL-1=E) can’t be defined in relation to the classical
identity matrix E in accordance with the famous theorem about inverse matrices for
matrices with zero determinant (Bellman, 1960, Chapter 6, § 4). On the other hand, the
set of matrices CL has the matrix HL40, which possesses all properties of identity matrix
(or the real unit) for any member of this set (one can check that the matrix HL40
represents the real unit in this set). In the frame of the set of matrices CL, where the
matrix HL40 represents the real unity, one can define the special notion of inverse
matrix CL-1 for any non-zero matrix CL in relation to the matrix HL40 on the base of
equations: CL*CL-1 = CL-1*CL = HL40. From this point of view, the genetic (4*4)-matrix
GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 279

HL4 is the complex number with unit coordinates (a0=a2=1). In the case of genetic
matrices, we reveal that 4-dimensional spaces can contain 2-parametric subspaces, in
which complex numbers exist in the form of (4*4)-matrices CL.

HL40 HL41 a0 0 -a2 0


HL40 HL40 HL41 ; CL = a0*HL40+a2*HL41 -a0 0 a2 0
HL41 HL41 -HL40 = a2 0 a0 0
-a2 0 -a0 0

a0 0 a2 0
CL-1 = (a02+a22)-1 * -a0 0 -a2 0
-a2 0 a0 0
a2 0 -a0 0

Figure 4: the multiplication table of two (4*4)-matrices HL40 and HL41 (from Figure 3), which represent a set
of two basic elements of complex numbers CL = a0*HL40+a2*HL41, where a0, a2 are real numbers. In the frame
of the set of 2-parametric matrices CL, where the matrix HL40 represents the real unit, the matrix CL-1 is the
inverse matrix for CL by definition on the base of the equation: CL*CL-1 = HL40

A similar situation holds true for (4*4)-matrices HR4 = HR40 + HR41 (from Figure 3).
The set of two matrices HR40 and HR41 is also closed in relation to multiplication; it
gives the multiplication table (Figure 5) which is also identical to the multiplication
table of complex numbers. The set of (4*4)-matrices CR = a1*HR40+a3*HR41, where a1,
a3 are real numbers, represents complex numbers in the (4*4)-matrix form (Figure 5).

HR40 HR41 0 a1 0 a3
HR40 HR40 HR41 ; CR = a1*HR40+a3*HR41 0 a1 0 a3
HR41 HR41 -HR40 = 0 -a3 0 a1
0 -a3 0 a1

0 a1 0 -a3
CR-1 = (a12+a32)-1 * 0 a1 0 -a3
0 a3 0 a1
0 a3 0 a1

Figure 5: the multiplication table of two (4*4)-matrices HR40 and HR41 (from Figure 3), which represent a set
of two basic elements of complex numbers CR = a1*HR40+a3*HR41, where a1, a3 are real numbers. In the frame
of the set of 2-parametric matrices CR, where the matrix HR40 represents the real unit, the matrix CR-1 is the
inverse matrix for any non-zero matrix CR by definition on the base of the equation: CR*CR-1 = HR40.
280 S. V. PETOUKHOV

The matrix HR40 plays a role of the real unit in this set of matrices CR. In the frame of
matrices CR, where HR40 represents the real unit, the matrix CR-1 (Figure 5) is the
inverse matrix for any non-zero matrix CR by definition on the base of equations
CR*CR-1 = CR-1*CR = HR40. The genetic matrix HR4 is complex number with unit
coordinates (a1=a3=1). Two sets of (4*4)-matrices CL and CR are quite different
representations of complex numbers; for example, a sum CL+CR of members of these
sets is not complex number.

One should note that actions of the (4*4)-matrices HL4 and HR4 on 4-dimensional
vectors in their planes R0(x0, 0, x2, 0) and R1(0, x1, 0, x3) rotate the vectors in different
directions: clockwise and counterclockwise (Figure 6). The properties of these genetic
matrices can be used in studying the famous problem of dissymmetry in biological
organisms.

Figure 6: The action of the matrix HL4 on a 4-dimensional vector R0(x0, 0, x2, 0) leads to a vector rotation
clockwise (on the left). The action of the matrix HR4 on a 4-dimensional vector R1(0, x1, 0, x3) leads to a
vector rotation counterclockwise (on the right)

As described above, we have received one more interesting result: the sum of two 2-
dimensional complex numbers HL4 and HR4 with unit coordinates (they belong to two
different matrix types of complex numbers) generates the 4-dimensional quaternion by
Hamilton with unit coordinates H4=HL4+HR4 (Figure 2). It resembles a situation when
a union of Yin and Yang (or a union of female and male beginnings, or a fusion of male
and female gametes) generates a new organism. Below we will meet with other similar
situations concerning (2n*2n)-matrices which represent (2n)-dimensional numbers with
unit coordinates and which consists of two “complementary” halves (like the matrix
H4), each of which is 2n-1-dimensional number with unit coordinates. One can name
such type of matrices as “matrices with internal complementarities”. They resemble in
some extend the complementary structure of double helixes of DNA.

Let us return now to the (8*8)-matrix H8 (Figure 1) and demonstrate that it is also the
matrix with internal complementarities. Figure 6 shows the matrix H8 as sum of
matrices HL8 and HR8.
GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 281

H8 = HL8+HR8 =

1 0 1 0 -1 0 1 0 0 -1 0 -1 0 1 0 -1
1 0 1 0 -1 0 1 0 0 1 0 1 0 -1 0 1
-1 0 1 0 1 0 1 0 0 1 0 -1 0 -1 0 -1
-1 0 1 0 1 0 1 0 + 0 -1 0 1 0 1 0 1
1 0 -1 0 1 0 1 0 0 -1 0 1 0 -1 0 -1
1 0 -1 0 1 0 1 0 0 1 0 -1 0 1 0 1
-1 0 -1 0 -1 0 1 0 0 1 0 1 0 1 0 -1
-1 0 -1 0 -1 0 1 0 0 -1 0 -1 0 -1 0 1

Figure 7: The matrix H8 (from Figure 1) is one of matrices with internal complementarities, which are
represented by its halves HL8 and HR8 (explanation in text)

HL8 = HL80 + HL81 + HL82 + HL83 =


1 0 1 0 -1 0 10 10000000 0010 0 0 00
1 0 1 0 -1 0 10 10000000 0010 0 0 00
-1 0 1 0 1 0 10 00100000 -1 0 0 0 0 0 0 0
-1 0 1 0 1 0 10 = 00100000 + -1 0 0 0 0 0 0 0
1 0 -1 0 1 0 10 00001000 000 0 0 0 10
1 0 -1 0 1 0 10 00001000 000 0 0 0 10
-1 0 -1 0 -1 0 10 00000010 0 0 0 0 -1 0 0 0
-1 0 -1 0 -1 0 10 00000010 0 0 0 0 -1 0 0 0

0 0 0 0 -1 0 0 0 00 00 0010
0 0 0 0 -1 0 0 0 00 00 0010
00 00 00 10 00 00 10 00
+ 00 00 00 10 + 00 00 10 00
10 00 00 00 0 0 -1 0 0 0 0 0
10 00 00 00 0 0 -1 0 0 0 0 0
0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0
0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0

HL80 HL81 HL82 HL83


HL80 HL80 HL81 HL82 HL83
HL81 HL81 - HL80 HL83 - HL82
HL82 HL82 - HL83 - HL80 HL81
HL83 HL83 HL82 - HL81 - HL80

Figure 8: upper rows: the decomposition of the matrix HL8 (from Figure 7) as sum of 4 matrices: HL8 = HL80
+ HL81 + HL82 + HL83. Bottom row: the multiplication table of these 4 matrices HL80, HL81, HL82 and HL83,
which is identical to the multiplication table of quaternions by Hamilton. The matrix HL80 represents the real
unit for this matrix set
282 S. V. PETOUKHOV

The similar situation holds true for the matrix HR8 (from Figure 7). Figure 9 shows a
decomposition of the matrix HR8 as a sum of 4 matrices: HR8 = HR80 + HR81 + HR82 +
HR83. The set of matrices HR80, HR81, HR82 and HR83 is closed in relation to
multiplication and it defines the multiplication table which is identical to the same
multiplication table of quaternions by Hamilton. General expression for quaternions in
this case can be written as QR = a0*HR80 + a1*HR81 + a2*HR82 + a3*HR83, where a0, a1,
a2, a3 are real numbers. From this point of view, the (8*8)-genomatrix HR8 is the
quaternion by Hamilton with unit coordinates.

HR8 = HR80 + HR81 + HR82 + HR83 =


0 -1 0 -1 0 1 0 -1 0 -1 0 0 0 0 0 0 0 0 0 -1 0 0 0 0
0 1 0 1 0 -1 0 1 0 10 00 0 0 0 0 00 1000 0
0 1 0 -1 0 -1 0 -1 0 0 0 -1 0 0 0 0 0 10 0000 0
0 -1 0 1 0 1 0 1 0 00 10 0 0 0 0 -1 0 0 0 0 0 0
0 -1 0 1 0 -1 0 -1 = 0 0 0 0 0 -1 0 0 + 0 0 0 0 0 0 0 -1
0 1 0 -1 0 1 0 1 0 00 00 1 0 0 0 0 0 00 00 1
0 1 0 1 0 1 0 -1 0 0 0 0 0 0 0 -1 0 0 0 00 10 0
0 -1 0 -1 0 -1 0 1 0 00 00 0 0 1 0 0 0 0 0 -1 0 0
0 0 0 00 1 0 0 0 0 0 0 0 0 0 -1
+ 0 0 0 0 0 -1 0 0 0 00 00 00 1
0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0
0 0 0 00 0 0 1 0 00 00 1 0 0
0 -1 0 0 0 0 0 0 + 0 00 10 0 0 0
0 1 0 00 0 0 0 0 0 0 -1 0 0 0 0
0 0 0 10 0 0 0 0 10 00 0 0 0
0 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0
HR80 HR81 HR82 HR83
HR80 HR80 HR81 HR82 HR83
HR81 HR81 - HR80 HR83 - HR82
HR82 HR82 - HR83 - HR80 HR81
HR83 HR83 HR82 - HR81 - HR80

Figure 9: upper rows: the decomposition of the matrix HR8 (from Figure 7) as sum of 4 matrices: H8R = H08R
+ H18R + H28R + H38R. Bottom row: the multiplication table of these 4 matrices HR80, HR81, HR82 and HR83,
which is identical to the multiplication table of quaternions by Hamilton. HR80 represents the real unit for this
matrix set

The initial (8*8)-matrix H8 (Figure 1) can be also decomposed in another way on the
base of dyadic-shift decomposition. Figure 10 shows such dyadic-shift decomposition
H8 = H80+H81+H82+H83+H84+H85+H86+H87, when 8 sparse matrices H80, H81, H82, H83,
H84, H85, H86, H87 arise (H80 is identity matrix). The set H80, H81, H82, H83, H84, H85, H86,
GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 283

H87 is closed in relation to multiplication and it defines the multiplication table on


Figure 10. This multiplication table is identical to the multiplication table of
8-dimensional hypercomplex numbers that are termed as biquaternions by Hamilton (or
Hamiltons’ quaternions over the field of complex numbers). General expression for
biquaternions in this case can be written as Q8 = a0*H80+a1*H81+a2*H82+a3*H83+ a4*H84
+a5*H85+a6*H86+a7*H87, where a0, a1, a2, a3, a4, a5, a6, a7 are real numbers. From this
point of view, the (8*8)-genomatrix H8 is Hamiltons’ biquaternion with unit
coordinates.

H8 = H80+H81+H82+H83+H84+H85+H86+H87 =
1 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 -1 0 0 0 0
0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 -1 0 0 0 0 -1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0 + 0 0 1 0 0 0 0 0 + 0 -1 0 0 0 0 0 0 + -1 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 -1 +
0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 -1 0 0 0 0 -1 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 00 0 0 0 0 1 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0

0 0 0 0 -1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 -1
0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 -1 0 0 0 0 1 0 0 0 0 0 0 0 0 -1 0 0
0 0 0 0 0 0 0 1 + 0 0 0 0 0 0 1 0 + 0 0 0 0 0 1 0 0 + 0 0 0 0 1 0 0 0
1 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 1 0 0 0 0
0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0
0 0 -1 0 0 0 0 0 0 0 0 1 0 0 0 0 -1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0 0 0

1 H81 H82 H83 H84 H85 H86 H87


1 1 H81 H82 H83 H84 H85 H86 H87
H81 H81 -1 H83 - H82 H85 - H84 H87 - H86
H82 H82 H83 -1 - H81 - H86 - H87 H84 H85
H83 H83 - H82 - H81 1 - H87 H86 H85 - H84
H84 H84 H85 H86 H87 -1 - H81 - H82 - H83
H85 H85 - H84 H87 - H86 - H81 1 - H83 H82
H86 H86 H87 - H84 - H85 H82 H83 -1 - H81
H87 H87 - H86 - H85 H84 H83 - H82 - H81 1

Figure 10: Upper rows: the decomposition of the matrix H8 (from Figure 1) as sum of 8 matrices:
H8 = H80+H81+H82+H83+H84+H85+H86+H87. Bottom row: the multiplication table of these 8 matrices H80, H81,
H82, H83, H84, H85, H86, H87, which is identical to the multiplication table of biquaternions by Hamilton (or
Hamiltons’ quaternions over the field of complex numbers). H80 is identity matrix

Here for the (8*8)-genomatrix H8 we have received the interesting result: the sum of
two different 4-dimensional quaternions by Hamilton with unit coordinates (they belong
284 S. V. PETOUKHOV

to two different matrix representations of Hamiltons’ quaternions) generates the 8-


dimensional biquaternion with unit coordinates. This result resembles the results,
regarding genetic matrices with internal complementarities described above; it
resembles a situation when a union of Yin and Yang (or a union of male and female
beginnings, or a fusion of male and female gametes) generates a new organism.

3. THE RADEMACHER MATRICES R4 AND R8

Now let us pay attention to Rademacher matrices R4 and R8 (Figure1) that belong to the
second important type of genetic matrices with internal complementarities. Let us
initially analyze the matrix R4, which is the sum of two matrices RL4 and RR4 (Figure
11).
1 0 1 0 0 1 0 -1
R4 = RL4 + RR4 = -1 0 -1 0 + 0 1 0 -1
1 0 1 0 0 -1 0 1
-1 0 -1 0 0 -1 0 1

1 0 0 0 0 0 1 0
RL4 = RL40 + RL41 = -1 0 0 0 + 0 0 -1 0
0 0 1 0 1 0 0 0
0 0 -1 0 -1 0 0 0

0 1 0 0 0 0 0 -1
RR4 = RR40 + RR41 = 0 1 0 0 + 0 0 0 -1
0 0 0 1 0 -1 0 0
0 0 0 1 0 -1 0 0

Figure 11: upper row: the representation of the matrix R4 as sum of matrices RL4 and RR4.
Other rows: representations of matrices RL4 and RR4 as sums of matrices RL40, RL41, RR40 and RR41.

The (4*4)-matrix RL4 is the sum of two matrices RL40 and RL41 (Figure 11), the set of
which is closed in relation to multiplication and defines the multiplication table of these
matrices (Figure 12). This table is identical to the well-known multiplication table of
split-complex numbers (their synonyms are Lorentz numbers, hyperbolic numbers,
perplex numbers, double numbers, etc. - http://en.wikipedia.org/wiki/Split-
complex_number). Split-complex numbers are a two-dimensional commutative algebra
over the real numbers.
GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 285

RL40 RL41 A0 0 A2 0
RL40 RL40 RL41 ; DL = A0* RL40+A2*RL41 = - A0 0 - A2 0
RL41 RL41 RL40 A2 0 A0 0
- A2 0 - A0 0

A0 0 - A2 0
DL-1 = (A02-A22)-1 * - A0 0 A2 0
- A2 0 A0 0
A2 0 - A0 0

Figure 12: the multiplication table of two (4*4)-matrices RL40 and RL41 (Figure 11), which is a set
of basic elements of split-complex numbers DL = A0*RL40+A2*RL41, where A0, A2 are real numbers. The
matrix RL40 represents the real unit for this matrix set. If A0 ≠ A2, the matrix DL-1 is the inverse matrix for DL
by definition on the base of the equation DL*DL-1= RL40

The set of (4*4)-matrices DL = A0*RL40+A2*RL41, where A0, A2 are real numbers,


represents split-complex numbers in the special (4*4)-matrix form (Figure 12). The
classical identity matrix E=[1 0 0 0; 0 1 0 0; 0 0 1 0; 0 0 0 1] is absent in the set of
matrices DL, each of which has zero determinant. Consequently the usual notion of the
inverse matrix DL-1 (as DL*DL-1=E) can’t be defined in relation to the classical identity
matrix E in accordance with the famous theorem about inverse matrices for matrices
with zero determinant (Bellman, 1960, Chapter 6, § 4). But the set of matrices DL has
the matrix RL40 which possesses all properties of identity matrix (or the real unit) for
any member of this set. In the frame of the set of matrices DL, where the matrix RL40
represents the real unity, one can define the special notion of inverse matrix DL-1 for any
non-zero matrix DL in relation to the matrix RL40 on the base of equations: DL*DL-1 =
DL-1*DL = RL40 (Figure 12). From this point of view, the genetic (4*4)-matrix RL4 is
the split-complex number with unit coordinates (A0=A2=1). So, we reveal that
4-dimensional spaces can contain 2-parametric subspaces, in which split-complex
numbers exist in the form of (4*4)-matrices DL. It is well known that in mathematics
split-complex numbers are traditionally represented in the form of (2*2)-matrix
[a0 a1; a1 a0], where a0, a1 are real numbers (http://en.wikipedia.org/wiki/Split-
complex_number).

A similar situation holds true for (4*4)-matrices RR4 = RR40 + RR41 (from Figure 11).
The set of two matrices RR40 and RR41 is also closed in relation to multiplication; it
gives the multiplication table (Figure 13) which is also identical to the multiplication
table of split-complex numbers. The set of (4*4)-matrices DR = a1*RR40+a3*RR41,
where a1, a3 are real numbers, represents split-complex numbers in the (4*4)-matrix
286 S. V. PETOUKHOV

form (Figure 13). The matrix RR40 plays a role of the real unit in this set of matrices DR.
In the case a1 ≠ a3, the matrix DR-1 (Figure 13) is the inverse matrix for DR by definition
on the base of equations DR*DR-1 = DR-1*DR = RR40.
RR40 RR41 0 A1 0 - A3
RR40 RR40 RR41 ; DR = A1*RR40+A3*RR41 = 0 A1 0 - A3
RR41 RR41 RR40 0 - A3 0 A1
0 - A3 0 A1
0 A1 0 A3
DR-1 = (A12-A32)-1 * 0 A1 0 A3
0 A3 0 A1
0 A3 0 A1

Figure 13: The multiplication table of two (4*4)-matrices RR40 and RR41, which is a set of basic elements of
split-complex numbers DR = A1*RR40+A3*RR41, where A1, A3 are real numbers. The matrix RR40 represents
the real unit in this matrix set. If A1 ≠ A3, the matrix DR-1 is the inverse matrix for DR by definition on the base
of the equation DL*DL-1 = RR40

The initial matrix R4 can be also decomposed in another way by means of the dyadic-
shift decomposition as it was done for the matrix H4 on Figure 2. Figure 14 shows such
dyadic-shift decomposition R4 = R04+R14+R24+R34 when 4 sparse matrices R04, R14,
R24 and R34 arise (R04 is identity matrix). The set of these matrices R04, R14, R24 and
R34 is closed in relation to multiplication and it defines the multiplication table on
Figure 14. This multiplication table is identical to the multiplication table of
4-dimensional hypercomplex numbers that are termed as split-quaternions by J.Cockle
and are well known in mathematics and physics (http://en.wikipedia.org/wiki/Split-
quaternion). From this point of view, the matrix R4 is split-quaternion with unit
coordinates.
1 1 1 -1 1 0 00 0100 0010 0 0 0 -1
-1 1 -1 1 = 0 1 00 + -1 0 0 0 + 0 0 0 -1 + 0 0 -1 0
1 -1 1 1 0 0 10 0001 1000 0 -1 0 0
-1 -1 -1 1 0001 0 0 -1 0 0 -1 0 0 -1 0 0 0
R04 R14 R24 R34
R04 R04 R14 R24 R34
R14 R14 -R04 R34 - R24
R24 R24 - R34 R04 - R14
R34 R34 R24 R14 R04

Figure 14: upper row: the dyadic-shift decomposition R4 = R04+R14+R24+R34. Bottom row: the
multiplication table of the sparse matrices R04, R14, R24 and R34, which is identical to the multiplication table
of split-quaternions by J.Cockle (http://en.wikipedia.org/wiki/Split-quaternion). R04 is identity matrix, which
plays a role of the real unit in this form of split-quaternions by Cockle.
GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 287

So we have received the interesting result: the sum of two 2-dimensional split-complex
numbers R4L and R4R with unit coordinates (they belong to two different matrix types of
split-complex numbers) generates the 4-dimensional split-quaternion with unit
coordinates. It resembles again a situation when a union of Yin and Yang (a union of
female and male beginnings, or a fusion of male and female gametes) generates a new
organism. In particular, it means that the matrix R4 is one of matrices with internal
complementarities.

Let us return now to the (8*8)-matrix R8 (Figure 1) and demonstrate that it is also a
matrix with internal complementarities. Figure 15 shows the matrix R8 as sum of
matrices R8L and R8R.

R8 = RL8 + RR8 =
1 0 1 0 1 0 -1 0 0 1 0 1 0 1 0 -1
1 0 1 0 1 0 -1 0 0 1 0 1 0 1 0 -1
-1 0 1 0 -1 0 -1 0 0 -1 0 1 0 -1 0 -1
-1 0 1 0 -1 0 -1 0 + 0 -1 0 1 0 -1 0 -1
1 0 -1 0 1 0 1 0 0 1 0 -1 0 1 0 1
1 0 -1 0 1 0 1 0 0 1 0 -1 0 1 0 1
-1 0 -1 0 -1 0 1 0 0 -1 0 -1 0 -1 0 1
-1 0 -1 0 -1 0 1 0 0 -1 0 -1 0 -1 0 1

Figure 15: the matrix R8 consists of two complementary parts RL8 and RR8

Figure 16 shows a decomposition of the matrix RL8 (from Figure 15) as a sum of 4
matrices: RL8 = RL80 + RL81 + RL82 + RL83. The set of matrices RL80, RL81, RL82 and
RL83 is closed in relation to multiplication and defines the multiplication table identical
to the same multiplication table of split-quaternions by Cockle. General expression for
split-quaternions in this case can be written as SL = a0*RL80 + a1*RL81 + a2*RL82 +
a3*RL83, where a0, a1, a2, a3 are real numbers. From this point of view, the (8*8)-
genomatrix RL8 is split-quaternion by Cockle with unit coordinates.
288 S. V. PETOUKHOV

1 0 1 0 1 0 -1 0 10000000 0010 0 0 00
1 0 1 0 1 0 -1 0 10000000 0010 0 0 00
-1 0 1 0 -1 0 -1 0 00100000 -1 0 0 0 0 0 0 0
-1 0 1 0 -1 0 -1 0 = 00100000 + -1 0 0 0 0 0 0 0
1 0 -1 0 1 0 1 0 00001000 000 0 0 0 10
1 0 -1 0 1 0 1 0 00001000 000 0 0 0 10
-1 0 -1 0 -1 0 1 0 00000010 0 0 0 0 -1 0 0 0
-1 0 -1 0 -1 0 1 0 00000010 0 0 0 0 -1 0 0 0

00 0010 00 0 0 0 0 0 0 -1 0
00 0010 00 0 0 0 0 0 0 -1 0
0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0 0
+ 0 0 0 0 0 0 -1 0 + 0 0 0 0 -1 0 0 0
10 0000 00 0 0 -1 0 0 0 0 0
10 0000 00 0 0 -1 0 0 0 0 0
0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0
0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0

RL80 RL81 RL82 RL83


RL80 RL80 RL81 RL83 RL83
RL81 RL81 - RL80 RL83 - RL82
RL82 RL82 - RL83 RL80 - RL81
RL83 RL83 RL82 RL81 RL80

Figure 16: Upper rows: the decomposition of the matrix RL8 (from Figure 15) as sum of 4
matrices: RL8 = RL80 + RL81 + RL82 + RL83. Bottom row: the multiplication table of these 4 matrices RL80,
RL81, RL82 and RL83, which is identical to the multiplication table of split-quaternions by J.Cockle. RL80
represents the real unit for this matrix set

The similar situation holds for the matrix RR8 (from Figure 15). Figure 17 shows a
decomposition of the matrix RR8 as a sum of 4 matrices: RR8 = RR80 + RR81 + RR82 +
RR83. The set of matrices RR80, RR81, RR82 and RR83 is closed in relation to
multiplication and defines the multiplication table that is identical to the same
multiplication table of split-quaternions by Cockle. General expression for split-
quaternions in this case can be written as SR = a0*RR80 + a1*RR81 + a2*RR82 + a3*RR83,
where a0, a1, a2, a3 are real numbers. From this point of view, the (8*8)-matrix RR8 is
the split-quaternion with unit coordinates.
GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 289

0 1 0 1 0 1 0 -1 01000000 0 0010 000


0 1 0 1 0 1 0 -1 01000000 0 0010 000
0 -1 0 1 0 -1 0 -1 00010000 0 -1 0 0 0 0 0 0
0 -1 0 1 0 -1 0 -1 00010000 0 -1 0 0 0 0 0 0
0 1 0 -1 0 1 0 1 = 00000100 + 0 0000 001
0 1 0 -1 0 1 0 1 00000100 0 0000 001
0 -1 0 -1 0 -1 0 1 00000001 0 0 0 0 0 -1 0 0
0 -1 0 -1 0 -1 0 1 00000001 0 0 0 0 0 -1 0 0

000 0010 0 0 0 0 0 0 0 0 -1
000 0010 0 0 0 0 0 0 0 0 -1
0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0
0 0 0 0 0 0 0 -1 0 0 0 0 0 -1 0 0
+ 010 0000 0 + 0 0 0 -1 0 0 0 0
010 0000 0 0 0 0 -1 0 0 0 0
0 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0
0 0 0 -1 0 0 0 0 0 -1 0 0 0 0 0 0

RR80 RR81 RR82 RR83


RR80 RR80 RR81 RR82 RR83
RR81 RR81 - RR80 RR83 - RR82
RR82 RR82 - RR83 RR80 - RR81
RR83 RR83 RR82 RR81 RR80

Figure 17: upper rows: the decomposition of the matrix RR8 (from Figure 15) as the sum of 4 matrices:
RR8 = RR80 + RR81 + RR82 + RR83. Bottom row: the multiplication table of these 4 matrices RR80, RR81, RR82
and RR83, which is identical to the multiplication table of split-quaternions by Cockle. RR80 represents the real
unit here.

The initial (8*8)-matrix R8 (Figure 1) can be also decomposed in another way by means
of the dyadic-shift decomposition as it was done for the matrix H8 on Figure 10. Figure
18 shows the case of such dyadic-shift decomposition R8 = R08+R18+R28+R38+R48
+R58+R68+R78, when 8 sparse matrices R08, R18, R28, R38, R48, R58, R68, R78 arise
(R08 is identity matrix). The set R08, R18, R28, R38, R48, R58, R68, R78 is closed in
relation to multiplication and defines the multiplication table on Figure 18. This
multiplication table is identical to the multiplication table of 8-dimensional
hypercomplex numbers that are termed as bi-split-quaternions by Cockle (or split-
quaternions over the field of complex numbers). General expression for bi-split-
quaternions in this case can be written as S8 = a0*R08+a1*R18+a2*R28 +a3*R38+a4*R48
+a5*R58+a6*R68+a7*R78, where a0, a1, a2, a3, a4, a5, a6, a7 are real numbers. From this
point of view, the (8*8)-genomatrix R8 is bi-split-quaternion with unit coordinates.
290 S. V. PETOUKHOV

R8 = R08+R18+R28+R38+R48+R58+R68+R78 =

1 0 0 0 0 0 0 0 01000000 0 0 100000 000 10000


0 1 0 0 0 0 0 0 10000000 0 00 10000 00 100000
0 0 1 0 0 0 0 0 00010000 -1 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0
0 0 0 1 0 0 0 0 + 00100000 + 0 -1 0 0 0 0 0 0 + -1 0 0 0 0 0 0 0 +
0 0 0 0 1 0 0 0 00000100 0 00000 10 000 0000 1
0 0 0 0 0 1 0 0 00001000 0 0000 00 1 000 000 10
0 0 0 0 0 0 1 0 00000001 0 0 0 0 -1 0 0 0 0 0 0 0 0 -1 0 0
0 0 0 0 0 0 0 1 00000010 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0

0 0 0 01 000 0 0000100 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 -1
0 0 0 00 100 0 0001000 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0
0 0 0 00 0 -1 0 0 0 0 0 0 0 0 -1 0 0 0 0 -1 0 0 0 0 0 0 0 0 -1 0 0
0 0 0 00 0 0 -1 + 0 0 0 0 0 0 -1 0 + 0 0 0 0 0 -1 0 0 + 0 0 0 0 -1 0 0 0
1 0 0 00 000 0 1 0 0 0 0 00 0 0 -1 0 0 0 0 0 0 0 0 -1 0 0 0 0
0 1 0 00 000 1 0 0 0 0 0 00 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0
0 0 -1 0 0 000 0 0 0 -1 0 0 0 0 -1 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0
0 0 0 -1 0 000 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0 0 -1 0 0 0 0 0 0 0

R08 R18 R28 R38 R48 R58 R68 R78


R08 R08 R18 R28 R38 R48 R58 R68 R78
R18 R18 R08 R38 R28 R58 R48 R78 R68
R28 R28 R38 - R08 - R18 R68 R78 - R48 - R58
R38 R38 R28 - R18 - R08 R78 R68 - R58 - R48
R48 R48 R58 - R68 - R78 R08 R18 - R28 - R38
R58 R58 R48 - R78 - R68 R18 R08 - R38 - R28
R68 R68 R78 R48 R58 R28 R38 R08 R18
R78 R78 R68 R58 R48 R38 R28 R18 R08

Figure 18: Upper rows: the decomposition of the matrix R8 (from Figure 1) as sum of 8 matrices: R8 =
R08+R18+R28+R38+R48+R58+R68+R78. Bottom row: the multiplication table of these 8 matrices R08, R18,
R28, R38, R48, R58, R68 and R78, which is identical to the multiplication table of bi-split-quaternions by
Cockle. R08 is identity matrix and represents the real unit here.

Here for the (8*8)-genomatrix R8 we have received the interesting result: the sum of
two different 4-dimensional split-quaternions by Cockle with unit coordinates (they
belong to two different matrix types of split-quaternion numbers) generates the 8-
dimensional bi-split-quaternion with unit coordinates. This result resembles the above-
described result about the sum of 2-dimensional split-complex numbers with unit
coordinates that generates the 4-dimensional split-quaternion with unit coordinates
(Figures 12-14). It also resembles a situation when a union of Yin and Yang (a union of
male and female beginnings or a fusion of male and female gametes) generates a new
organism.
GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 291

4. MATRICES OF GENETIC DUPLETS AND TRIPLETS

Theory of noise-immunity coding is based on matrix methods. For example, matrix


methods allow transferring high-quality photos of Mar’s surface via millions of
kilometers of strong interference. In particularly, Kronecker families of Hadamard
matrices are used for this aim. Kronecker multiplication of matrices is the well-known
operation in fields of signals processing technology, theoretical physics, etc. It is used
for transition from spaces with a smaller dimension to associated spaces of higher
dimension.

By analogy with theory of noise-immunity coding, the 4-letter alphabet of RNA


(adenine A, cytosine C, guanine G and uracil U) can be represented in a form of the
(2*2)-matrix [C U; A G] (Figure 19) as a kernel of the Kronecker family of matrices [C
U; A G](n), where (n) means a Kronecker power (Figure 19). Inside this family, this 4-
letter alphabet of monoplets is connected with the alphabet of 16 duplets and 64 triplets
by means of the second and third Kronecker powers of the kernel matrix: [C U; A G](2)
and [C U; A G](3), where all duplets and triplets are disposed in a strict order (Figure
19). We begin with the alphabet A, C, G, U of RNA here because of mRNA-sequences
of triplets define protein sequences of amino acids in a course of its reading in
ribosomes (below we will separately consider the case of DNA with its own alphabet).

Figure 19 contains not only 64 triplets but also amino acids and stop-codons encoded
by the triplets in the case of the Vertebrate mitochondrial genetic code that is the most
symmetrical among known variants of the genetic code
(http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi). One can see on Figure 19
that in the matrix [C U; A G](3) the set of columns with even numeration 0, 2, 4, 6 and
the set of columns with odd numeration 1, 3, 5, 7 have the same collection of amino
acids and stop-codons. In other words, the nature has constructed the distribution of
amino acids and stop-codons in accordance with the principle of the matrix with
internal complementarity. This fact is only one of evidences that the described matrices
with internal complementarities are the mathematical patterns of the genetic coding
system (the mathematical partners of the genetic code).

Let us explain black-and-white mosaics of [C U; A G](2) and [C U; A G](3) (Figure 19)


which reflect important features of the genetic code. These features are connected with
a specificity of reading of mRNA-sequences in ribosomes to define protein sequences
of amino acids (this is the reason, why we use the alphabet A, C, G, U of RNA in
matrices on Figure 19; below we will consider the case of DNA-sequences separately).
292 S. V. PETOUKHOV

CC CU UC UU
C U CA CG UA UG
A G AC AU GC GU
AA AG GA GG

CCC CCU CUC CUU UCC UCU UUC UUU


PRO PRO LEU LEU SER SER PHE PHE
CCA CCG CUA CUG UCA UCG UUA UUG
PRO PRO LEU LEU SER SER LEU LEU
CAC CAU CGC CGU UAC UAU UGC UGU
HIS HIS ARG ARG TYR TYR CYS CYS
CAA CAG CGA CGG UAA UAG UGA UGG
GLN GLN ARG ARG STOP STOP TRP TRP
ACC ACU AUC AUU GCC GCU GUC GUU
THR THR ILE ILE ALA ALA VAL VAL
ACA ACG AUA AUG GCA GCG GUA GUG
THR THR MET MET ALA ALA VAL VAL
AAC AAU AGC AGU GAC GAU GGC GGU
ASN ASN SER SER ASP ASP GLY GLY
AAA AAG AGA AGG GAA GAG GGA GGG
LYS LYS STOP STOP GLU GLU GLY GLY

Figure 19: the first three representatives of the Kronecker family of RNA-alphabetic matrices [C U; A G](n).
Black color marks 8 strong duplets in the matrix [C U; A G](2) (at the top) and 32 triplets with strong roots in
the matrix [C U; A G](3) (bottom). 20 amino acids and stop-codons, which correspond to triplets, are also
shown in the matrix [C U; A G](3) for the case of the Vertebrate mitochondrial genetic code

A combination of letters on the two first positions of each triplet is ususally termed as a
“root” of this triplet (Konopelchenko, Rumer, 1975a,b; Rumer, 1968). Modern science
recognizes many variants (or dialects) of the genetic code, data about which are shown
on the NCBI’s website http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi.
17 variants (or dialects) of the genetic code exist that differ one from another by some
details of correspondences between triplets and objects encoded by them. Most of these
dialects (including the so called Standard Code and the Vertebrate Mitochondrial Code)
have the symmetrologic general scheme of these correspondences, where 32 “black”
triplets with “strong roots” and 32 “white” triplets with “weak” roots exist (see details
in (Petoukhov, 2008c). In this basic scheme, the set of 64 triplets contains 16
subfamilies of triplets, every one of which contains 4 triplets with the same two letters
on the first positions (an example of such subsets is the case of four triplets CAC, CAA,
CAT, CAG with the same two letters CA on their first positions). In the described basic
scheme, the set of these 16 subfamilies of NN-triplets is divided into two equal subsets.
The first subset contains 8 subfamilies of so called “two-position” NN-triplets, a coding
GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 293

value of which is independent on a letter on their third position: (CCC, CCT, CCA,
CCG), (CTC, CTT, CTA, CTG), (CGC, CGT, CGA, CGG), (TCC, TCT, TCA, TCG),
(ACC, ACT, ACA, ACG), (GCC, GCT, GCA, GCG), (GTC, GTT, GTA, GTG),
(GGC, GGT, GGA, GGG). An example of such subfamilies is the four triplets CGC,
CGA, CGT, CGC, all of which encode the same amino acid Arg, though they have
different letters on their third position. The 32 triplets of the first subset are termed as
“triplets with strong roots” (Konopelchenko, Rumer, 1975a,b; Rumer, 1968). The
following duplets are appropriate 8 strong roots for them: CC, CT, CG, AC, TC, GC,
GT, GG (strong duplets). All members of these 32 NN-triplets and 8 strong duplets are
marked by black color in the matrices [C U; A G](3) and [C U; A G](2) on Figures 19.

The second subset contains 8 subfamilies of “three-position” NN-triplets, the coding


value of which depends on a letter on their third position: (CAC, CAT, CAA, CAG),
(TTC, TTT, TTA, TTG), (TAC, TAT, TAA, TAG), (TGC, TGT, TGA, TGG), (AAC,
AAT, AAA, AAG), (ATC, ATT, ATA, ATG), (AGC, AGT, AGA, AGG), (GAA,
GAT, GAA, GAG). An example of such subfamilies is the four triplets CAC, CAA,
CAT, CAC, two of which (CAC, CAT) encode the amino acid His and the other two
(CAA, CAG) encode another amino acid Gln. The 32 triplets of the second subset are
termed as “triplets with weak roots” (Konopelchenko, Rumer, 1975a,b; Rumer, 1968).
The following duplets are appropriate 8 weak roots for them: CA, AA, AT, AG, TA,
TT, TG, GA (weak duplets). All members of these 32 NN-triplets and 8 weak duplets
are marked by white color in the matrices [C U; A G](3) and [C U; A G](2) on Figure 19.

From the point of view of its black-and-white mosaic, each of columns of genetic
matrices [C U; A G](2) and [C U; A G](3) has a meander-like character and coincides
with one of Rademacher functions that form orthogonal systems and well known in
discrete signals processing. These functions contain elements “+1” and “-1” only. Due
ti this fact, one can construct Rademacher representations of the symbolic genomatrices
[C U; A G](2) and [C U; A G](3) (Figure 19) by means of the following operation: each
of black duplets and of black triplets is replaced by number “+1” and each of white
duplets and white triplets is replaced by number “-1”. This operation leads immediately
to the matrices R4 and R8 from Figure 1, that are the Rademacher representations of the
phenomenological genomatrices [C U; A G](2) and [C U; A G](3). This fact is one of
evidences of algebraic nature of the genetic code.

One can note that genomatrices [C U; A G](2) and [C U; A G](3) and their Rademacher
representations R4 and R8 (Figure 1) are connected on the base of the equations (1),
where  means Kronecker multiplication:
294 S. V. PETOUKHOV

R4  [1 1; 1 1] = R8, [C U; A G](2)  [C U; A G] = [C U; A G](3) (1)

Here [1 1; 1 1] is the traditional (2*2)-matrix representation of split-complex number


with unit coordinates, that can be considered as the Rademacher representation R2 of
the genomatrix [C U; A G]. The equations (1) testify that, in the case of RNA-alphabet,
each of its four letters in the matrix [C U; A G] should be taken as equal to number
“+1”: A=C=G=U=+1. They also show that Rademacher representations R2 and R4 of
matrices [C U; A G] and [C U; A G](2) can be considered as basic due to the fact that
the Rademacher representation R8 is deduced from them by means of their Kronecker
multiplication.

Now let us pay attention to the DNA alphabet (adenine A, cytosine C, guanine G and
thymine T) and the appropriate Kronecker family of matrices [C T; A G](n). What kind
of black-and-white mosaics (or a disposition of elements “+1” and “-1” in numeric
representations of these symbolic matrices) can be appropriate in this case for the basic
matrix [C T; A G] and [C T; A G](2)? The important phenomenological fact is that the
thymine T is a single nitrogenous base in DNA which is replaced in RNA by another
nitrogenous base U (uracil) for unknown reason (this is one of the mysteries of the
genetic system). In other words, in this system the letter T is the opposition in relation
to the letter U, and so the letter T can be symbolized by number “-1” (instead of number
“+1” for U). By this objective reason, one can construct numeric representations H2 and
H4 of mentioned matrices [C T; A G] and [C T; A G](2) by means of the following
algorithm of transformation of black-and-white mosaics of matrices [C U; A G] and
[C U; A G](2) from Figure 19 together with their Rademacher representations R2 and R4:
- in matrices [C T; A G] and [C T; A G](2), each of monoplets and duplets that begin
with the letter T, should be taken with opposite color in comparison with appropriate
entries in matrices [C U; A G] and [C U; A G](2) from Figure 19; correspondingly
numeric representations of these DNA-alphabetic matrices [C T; A G] and [C T; A G](2)
reflect the new mosaics of these symbolic matrices.

The numeric representation H8 of the DNA-alphabetic matrix of triplets [C T; A G](3) is


constructed on the base of equations (2) by analogy with equations (1):

H4  [1 -1; 1 1] = H8, [C T; A G](2)  [C T; A G] = [C T; A G](3) (2)

Here [1 -1; 1 1] is the traditional (2*2)-matrix representation of complex number with


unit coordinates. The black-and-white mosaic of the matrix [C T; A G](3) is defined by
the disposition of numbers “+1” and “-1” in its numeric representation H8. Figure 20
GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 295

shows DNA-alphabetic matrices [C T; A G], [C T; A G](2) and [C T; A G](3) with their


mosaics constructed by this way, which is based on the objective properties of the
molecular-genetic system and can be used in biological computers of organisms. One
can see that mosaics of these symbolic matrices [C T; A G](2) and [C T; A G](3) coincide
with the disposition of numbers “+1” and “-1” in numeric matrices H4 and H8 (Figure 1)
that can be termed as “Hadamard representations” of these genomatrices because
matrices H4 and H8 satisfy the definition of Hadamard matrices (Petoukhov, 2008b,
2011).

CCC CCT CTC CTT TCC TCT TTC TTT


C T CCA CCG CTA CTG TCA TCG TTA TTG
A G CAC CAT CGC CGT TAC TAT TGC TGT
; CAA CAG CGA CGG TAA TAG TGA TGG
CC CT TC TT ACC ACT ATC ATT GCC GCT GTC GTT
CA CG TA TG ACA ACG ATA ATG GCA GCG GTA GTG
AC AT GC GT AAC AAT AGC AGT GAC GAT GGC GGT
AA AG GA GG AAA AAG AGA AGG GAA GAG GGA GGG

Figure 20: the first three representatives [C T; A G], [C T; A G](2) and [C T; A G](3) of the Kronecker family
of DNA-alphabetic matrices [C T; A G](n). Hadamard representations H4 and H8 of the symbolic matrices [C
T; A G](2) and [C T; A G](3) with the same mosaics are shown on Figure 1

Genetic matrices with internal complementarities resemble objects with Yin and Yang
parts from doctrines of Ancient China. One can add here the following mathematical
fact. The famous Yin-Yang symbol  has a symmetrical configuration: its 180-degree
turn changes only its black-and-white mosaic, but the new configuration of the symbol
coincides with the initial. It is interesting that the 180-degree turn of the genetic
matrices R4, R8, H4, H8 (Figure 1) leads to a similar result: mosaics of these matrices are
essentially changed but the new matrices are again matrices with internal
complementarities, algebraic properties of which coincide with the initial (the same
multiplication tables as on Figures 9, 10, 12-14, 16-18). So, the mythological object
allows revealing new mathematical properties of the genetic matrices in this case.

Phenomenology of the genetic system gives additional confirmations of its connection


with the mosaic genomatrices [C T; A G](n), numeric representations of which posess
internal complementarities. In matrices [C T; A G](n), let us enumerate their 2n columns
from left to right by numbers 0, 1, 2, .., 2n-1 and then consider two sets of n-plets
(oligonucleotides) in each of matrices [C T; A G](n): 1) the first set contains all n-plets
from columns with even numeration 0, 2, 4, … (this set is conditionally termed as the
296 S. V. PETOUKHOV

even-set or the Yin-set); 2) the second set contains all n-plets from columns with odd
numeration 1, 3, 5, … (this set is conditionally termed as the odd-set or the Yang-set).

For example, the genomatrix [C T; A G](3) (Figure 19) contains the even-set of 32
triplets in its columns with even numerations 0, 2, 4, 6 (CCC, CCA, CAC, CAA, ACC,
ACA, AAC, AAA, CTC, CTA, CGC, CGA, ATC, ATA, AGC, AGA, TCC, TCA,
TAC, TAA, GCC, GCA, GAC, GAA, TTC, TTA, TGC, TGA, GTC, GTA, GGC,
GGA) and the odd-set of 32 triplets in its columns with odd numerations 1, 3, 5, 7
(CCT, CCG, CAT, CAG, ACT, ACG, AAT, AAG, CTT, CTG, CGT, CGG, ATT,
ATG, AGT, AGG, TCT, TCG, TAT, TAG, GCT, GCG, GAT, GAG, TTT, TTG, TGT,
TGG, GTT, GTG, GGT, GGG). One can show, for example, that the structure of the
whole human genome is connected with the equal devision of the whole set of 64
triplets into the even-set of 32 triplets and the odd-set of 32 triplets. Really, let us
calculate total quantities (frequencies Feven and Fodd) of members of these two sets of
triplets in the whole human genome that contains the huge number 2.843.411.612
(about three billion) triplets. The initial data about this genome (Figure 21) are taken by
the author from the article (Perez, 2010). Very different frequencies of different triplets
are represented in this genome. For example, the frequency of the triplet CGA is equal
to 6.251.611 and the frequency of the triplet TTT is equal to 109.591.342; they differ in
18 times approximately. But our result of the calculation shows that the total quantities
of members of the even-set (Feven) and of the odd-set (Fodd) in the whole human genome
are equal to each other with a precision within 0,12%:
Feven = 1.420.853.821 for the even-set of 32 triplets;
Fodd = 1.422.557.791 for the odd-set of 32 triplets.

One should note that the work (Perez, 2010, Table 10) shows another variant of
division of the set of 64 triplets into two other subsets with 32 triplets in each not on the
basis of the matrix approach but on the base of using a traditional table of triplets and a
principle of “codons and their mirror-codons”. This variant also reveals an approximate
equality of quantities of members of these two subsets with a high precision for the case
of the whole human genome.
GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 297

TRIPLET TRIPLET TRIPLET TRIPLET


TRIPLET FREQUENCY TRIPLET FREQUENCY TRIPLET FREQUENCY TRIPLET FREQUENCY

AAA 109143641 CAA 53776608 GAA 56018645 TAA 59167883


AAC 41380831 CAC 42634617 GAC 26820898 TAC 32272009
AAG 56701727 CAG 57544367 GAG 47821818 TAG 36718434
AAT 70880610 CAT 52236743 GAT 37990593 TAT 58718182
ACA 57234565 CCA 52352507 GCA 40907730 TCA 55697529
ACC 33024323 CCC 37290873 GCC 33788267 TCC 43850042
ACG 7117535 CCG 7815619 GCG 6744112 TCG 6265386
ACT 45731927 CCT 50494519 GCT 39746348 TCT 62964984
AGA 62837294 CGA 6251611 GGA 43853584 TGA 55709222
AGC 39724813 CGC 6737724 GGC 33774033 TGC 40949883
AGG 50430220 CGG 7815677 GGG 37333942 TGG 52453369
AGT 45794017 CGT 7137644 GGT 33071650 TGT 57468177
ATA 58649060 CTA 36671812 GTA 32292235 TTA 59263408
ATC 37952376 CTC 47838959 GTC 26866216 TTC 56120623
ATG 52222957 CTG 57598215 GTG 42755364 TTG 54004116
ATT 71001746 CTT 56828780 GTT 41557671 TTT 109591342

Figure 25: quantities of repetitions of each triplet in the whole human genome (from [Perez, 2010]).

More general confirmation of genetic importance of the structure of genomatrices with


internal complementarities for long nucleotide sequences was revealed by the results of
the study of the Symmetry Principle № 6 from the work (Petoukhov, 2008c, 6th version,
section 11), where a special notion of fractal genetic nets for long nucleotide sequences
were used in contrast to this article. Now we propose to use the relevant
phenomenologic data for justification and development of the new idea: the described
matrices with internal complementarities are important algebraic patterns for
structurization of the genetic coding system, the nature of which has algebraic bases.

The described connection between the genetic system and matrices with internal
complementarities is associated with the Plato’s conception about androgynes. In
accordance with this ancient conception, in primal times people had doubled bodies.
But at one moment the gods have punished them by splitting them in half. Ever since
that time, people run around saying they are looking for their other half because they
are really trying to recover their primal nature
(http://en.wikipedia.org/wiki/Symposium_(Plato). This conception is frequently used in
discussions on important facts of embryology and other modern scientific fields about
hermaphroditism including the embryological principle of primordial hermaphroditism,
etc. (Dreger, 1998; Money, 1990, etc.). Taking the Plato’s conception into account,
genetic matrices with internal complementarities can be also termed as “androgynous
matrices”. Results of our researches lead to the idea that phenomena of
298 S. V. PETOUKHOV

hermaphroditism have a basic analogue at the molecular-genetic level. These results can
be related with biological problems of genetically inherited symmetries and
dissymmetry (Darvas, 2007; Gal, 2011; Hellige, 1993).

5. SOME CONCLUDING REMARKS

In the beginning of 19-th century, there was a belief was about the existence of one
arithmetic that is true for all natural systems. But after the discovery of quaternions by
Hamilton, the science has been compelled to refuse the former belief about existence of
only one true arithmetic/algebra in the world (see (Kline, 1980)). It has recognized, that
various natural systems can have not only their own geometry (Euclidean or non-
Euclidean geometries), but also their own algebra (arithmetic of multi-dimensional
numbers). If the scientist takes inadequate algebra to model a natural system, he/she can
repeat the impressive example by Hamilton, who has wasted 10 years to solve the task
of 3D space transformations on the bases of inadequate 3-dimensional algebras (this
task needs the 4-dimensional algebra of Hamilton’s quaternions). Modern theoretical
physics includes, as one of its main parts, a great number of attempts to reveal what
kinds of multi-dimensional numeric systems correspond to ensembles of relations in
concrete physical systems.

The results of our researches discover that relations in the genetic coding system
correspond to the described algebraic system of matrices with internal
complementarities. If the researcher does not take into account this fact and this special
mathematics, he/she runs the risk of wasting a lot of time and effort because of the
application of inadequate approaches to study algebraic properties of the genetic
system.

In particularly, this article shows the connection of the genetic coding system with
quaternions by Hamilton. Hamilton quaternions are closely related to the Pauli
matrices, the theory of the electromagnetic field (Maxwell wrote his equation on the
language of Hamilton quaternions), the special theory of relativity, the theory of spins,
quantum theory of chemical valency, etc. In the twentieth century thousands of works
were devotes to quaternions in physics [http://arxiv.org/abs/math-ph/0511092]. Now
Hamilton quaternions are manifested in the genetic code system. Our scientific
direction - "matrix genetics" - has led to the discovery of an important bridge among
physics, biology and computer science for their mutual enrichment. In addition, our
study provides a new example of the inconceivable effectiveness of mathematics:
GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 299

abstract mathematical structures derived by mathematicians at the tip of the pen 160
years ago, are embodied inside the molecular-genetic system which is the informational
basis of living matter. And the fact that mathematics is opened by means of painful
reflection (like Hamilton, who has spent 10 years of continuous thought to discover his
quaternions) is already represented in the genetic coding system.

The described genetic matrices with internal complementarities (or “androgenous


matrices”) posess many other interesting mathematical properties related to cyclic and
dyadic shifts, multiplications of these matrices, Kronecker families of matrices R4[1
1; 1 1](n) and H4[1 -1; 1 1](n), dichotomous trees of different 2n-dimensional numbers,
rotational transformations of these numeric genomatrices into new numeric
genomatrices with internal complementarities, etc. A set of (2n*2n)-matrices with
internal complementarities contains a huge quantity of different types of matrix
representations of complex numbers (or relevant algebraic fields
(http://en.wikipedia.org/wiki/Field_(mathematics)) and of split-complex numbers that
didn’t specially studied in mathematics previously, as the author can judge. The
relevant 2n-dimensional numeric systems, including the said plurality of complex and
split-complex numbers and their extensions, have perspectives to be applied in
mathematical natural sciences and signals processing. Here one can remember the
statement: “Profound study of nature is the most fertile source of mathematical
discoveries” (Fourier, 2006). The discovery of genetic importance of matrices with
internal complementarities gives us a possibility to divide sets of amino acids and stop-
signals in interesting sub-sets in accordance with the structure of the genomatrix [C T;
A G](3); it also presents new approaches to study proteins. One should note that
phenomena of complementarities play a basic role at different genetic levels. We are
hoping to expend this and similar topics in future publications.

The notion of number is one of the main notions of mathematics. In a long evolution of
this notion, many kinds of multi-dimensional numerical systems have appeared.
Complex numbers and split-complex numbers occupy a particularly important place in
mathematics and mathematical natural sciences. For example, complex numbers have
appeared as magic instruments for development of theories and calculations in the field
of problems of heat, light, sounds, vibrations, elasticity, gravitation, magnetism,
electricity, liquid streams, and phenomena of a micro-world. These complex numbers
are mathematical basis of quantum mechanics and of many other branches of sciences.
For example, the Schrödinger equation contains the imaginary unit, and the wave
functions of quantum mechanics are complex-valued. This article shows that many
300 S. V. PETOUKHOV

kinds of complex numbers and split-complex numbers exist, which are connected with
the genetic matrices. One can think that this splitting of numeric basis of mathematical
natural sciences lead to a relevant splitting in mathematical natural sciences. For
example, one can ask what kinds of complex numbers should be used in the
Schrödinger equation? Or can different types of wave functions of quantum mechanics
exist, which correspond to different kinds of complex numbers? In our opinion, such
questions should be deeply analyzed in future.

This article proposes a new mathematical approach to study “a partnership between


genes and mathematics” (see Section 1 above). In the author’s opinion, this kind of
mathematics is beautiful and it can be used for further developing of algebraic biology
and theoretical physics in accordance with the famous statement by P.Dirac, who taught
that a creation of a physical theory must begin with the beautiful mathematical theory:
“If this theory is really beautiful, then it necessarily will appear as a fine model of
important physical phenomena. It is necessary to search for these phenomena to
develop applications of the beautiful mathematical theory and to interpret them as
predictions of new laws of physics” (Arnold, 2007). According to Dirac, all new
physics, including relativistic and quantum, are developing in this way.

Results of matrix genetics lead to the idea that the structure of the genetic coding
system is dictated by patterns of described numeric genomatrices; here one can
remember the famous Pythagorean statement that “numbers rule the world" with the
refinement that we should talk now about multi-dimensional numbers.

Acknowledgments. The described researches were made by the author in the frame of
a long-term cooperation between Russian and Hungarian Academies of Sciences. The
author is grateful to Darvas, G., Stepanyan, I.V., Svirin, V.I. for their collaboration.

REFERENCES

Ahmed, N.U., Rao, K.R. (1975). Orthogonal transforms for digital signal processing. New York: Springer-
Verlag, Inc.
Arnold, V. (2007) A complexity of the finite sequences of zeros and units and geometry of the finite
functional spaces. Lecture at the session of the Moscow Mathematical Society, May 13,
http://elementy.ru/lib/430178/430281.
Bellman, R. (1960) Introduction to Matrix Analysis. New-York: Mcgraw-Hill Book Company, Inc., 351 pp.
Darvas, G. (2007) Symmetry. Basel: Birkhauser Book.
Dreger A. (1998) Hermaphrodites and the Medical Invention of Sex. Harward University Press.
Fourier, J. (2006) The Analytical Theory of Heat. Cambridge: University Press.
GENETIC MATRICES WITH INTERNAL COMPLEMENTARITIES 301

Gal, J. (2011) Louis Pasteur, language, and molecular chirality. I. Back- ground and dissymmetry, Chirality,
23, 1–16.
Hellige, J. B. (1993). Hemispheric Asymmetry: What's Right and What's Left. Cambridge. Massachusetts:
Harvard University Press.
Kline, M. (1980) Mathematics. The Loss of Certainty. New-York: Random House, 384 p.
Konopelchenko, B. G., Rumer, Yu. B. (1975a) Classification of the codons of the genetic code. I & II.
Preprints 75-11 and 75-12 of the Institute of Nuclear Physics of the Siberian department of the USSR
Academy of Sciences. Novosibirsk: Institute of Nuclear Physics.
Konopelchenko, B. G., Rumer, Yu. B. (1975b). Classification of the codons in the genetic code. Doklady
Akademii Nauk SSSR, 223(2), 145-153 (in Russian).
Money, J. (1990) Androgyne becomes bisexual in sexological theory: Plato to Freud and neuroscience. The
Journal of the American Academy of Psychoanalysis. 18(3): 392-413
(http://www.ncbi.nlm.nih.gov/pubmed/2258314)
Petoukhov, S.V. (2008a). Matrix genetics, algebras of the genetic code, noise-immunity. Moscow: Regular
and Chaotic Dynamics, 316 p. (in Russian; summary in English is on the
http://www.geocities.com/symmetrion/Matrix_genetics/matrix_genetics.html)
Petoukhov, S.V. (2008b) The degeneracy of the genetic code and Hadamard matrices. -
http://arXiv:0802.3366, p. 1-26 (The first version is from February 22, 2008; the last revised is from
December, 26, 2010).
Petoukhov, S.V. (2008c) Matrix genetics, part 1: permutations of positions in triplets and symmetries of
genetic matrices. - http://arxiv.org/abs/0803.0888, version 6, p. 1-34.
Petoukhov, S.V. (2011) Matrix genetics and algebraic properties of the multi-level system of genetic
alphabets. - Neuroquantology, 9, No 4, 60-81,
http://www.neuroquantology.com/index.php/journal/article/view/501
Petoukhov, S.V. (2012). The genetic code, 8-dimensional hypercomplex numbers and dyadic
shifts. http://arxiv.org/abs/1102.3596, p. 1-80.
Petoukhov, S.V. , He M. (2010) Symmetrical Analysis Techniques for Genetic Systems and
Bioinformatics: Advanced Patterns and Applications. Hershey, USA: IGI Global. 271 p.
Perez, J.-C. (2010) Codon populations in single-stranded whole human genome DNA are fractal and fine-
tuned by the golden ratio 1.618. - Interdiscip Sci Comput Life Sci, 2, 1–13.
Rumer, Yu. B. (1968). Systematization of the codons of the genetic code. Doklady Akademii Nauk SSSR,
183(1), p. 225-226 (in Russian).
Stewart, I. (1999) Life's Other Secret: The New Mathematics of the Living World. New-York: Wiley, 304 p.
Symmetry: Culture and Science
Vol. 23, Nos. 3-4, 303-322, 2012

FRACTAL GENETIC NETS AND SYMMETRY


PRINCIPLES IN LONG NUCLEOTIDE SEQUENCES

S.V. Petoukhov*, V.I. Svirin**

* Biophysicist, bioinformatician (b. Moscow, Russia, 1946).


Address: Laboratory of Biomechanical Systems, Mechanical Engineering Research Institute of Russian
Academy of Sciences; Malyi Kharitonievskiy pereulok, 4, Moscow, 101990, Russia. E-mail:
spetoukhov@gmail.com.
Fields of interest: genetics, bioinformatics, biosymmetries, multidimensional numbers, musical harmony,
mathematical crystallography (also history of sciences, oriental medicine).
Awards: Gold medal of the Exhibition of Economic Achievements of the USSR, 1974; State Prize of the
USSR, 1986; Honorary diplomas of a few international conferences and organizations, 2005-2012.
Publications: Biomechanics, Bionics and Symmetry, Moscow, Nauka, (1981), 239 pp. (in Russian);
Biosolitons. Fundamentals of Soliton Biology, Moscow, GPKT, (1999), 288 pp. (in Russian); Matrix Genetics,
Algebras of the Genetic Code, Noise-immunity, Moscow, RCD, (2008), 316 pp. (in Russian); with M. He:
Symmetrical Analysis Techniques for Genetic Systems and Bioinformatics: Advanced Patterns and
Applications, Hershey, USA: IGI Global, (2010), 271 pp.; with He M.: Mathematics of Bioinformatics:
Theory, Practice, and Applications, USA: John Wiley & Sons, Inc., (2011), 295 pp.

** Biophysicist, bioinformatician (b. Nizhnekamsk, Russia, 1987).


Address: Laboratory of Biomechanical Systems, Mechanical Engineering Research Institute of Russian
Academy of Sciences; Malyi Kharitonievskiy pereulok, 4, Moscow, 101990, Russia. E-
mail: vitaly.i.svirin@gmail.com.
Fields of interest: genetics, bioinformatics, biosymmetries, multidimensional numbers, musical harmony,
descrete mathematics, cybernetics, neural networks.
Publications : Stepanyan I.V., Cygankov V.D., Svirin V.I. and Golovanyova G.V. (2012) Neurophysiological
approaches to medical cybernetics based on the creative heritage by Academician P.K. In: Anokhin:
Biozashchita i Biobezopastnost', IV, №1, 28-42 (in Russian).

Abstract: This article is devoted to hidden regularities of long nucleotide sequences. It


contains a description and a thematic application of a new research tool that is termed
as «fractal genetic nets». Described results testify in favor of existence of new
Symmetry Principles of long nucleotide sequence as an addition to the known Symmetry
Principle on the base of the generalized Chargaff’s second parity rule. Our results
provide new materials to the Chargaff's problem about a grammar of biology and to the
idea about an algebraic essence of the genetic coding system.
304 S.V. PETOUKHOV AND V.I. SVIRIN

Keywords: genetic code, Chargaff’s rule, long nucleotide sequence, grammar, fractal,
genetic nets.

1. ON A GRAMMAR OF BIOLOGY AND THE NOTION OF


FRACTAL GENETIC NETS

Fantastic successes of molecular genetics were defined in particular by a disclosure of


phenomenological facts of symmetry in molecular constructions of genetic code and by
a skillful implementation of these facts in theoretical modeling. A bright example is a
disclosure of a symmetrological fact, reflected in the famous Chargaff's first parity rule,
which says that in any double-stranded DNA segment, the quantities (or frequencies) of
adenine and thymine are equal, and so are the frequencies of cytosine and guanine
(Chargaff, 1950). This rule was used by Watson and Crick to support their famous
DNA double-helix structure model (Watson & Crick, 1953).

In his works, Chargaff pursued goal of searching for a grammar of biology that defines
hidden regularities of genetic texts to construct living cells with their “confounded
multi-dimensionality”, etc. One of his works “Preface to a Grammar of Biology”
(Chargaff, 1971) which was devoted to a hundred years of nucleic acid research,
reflects his thoughts that all achievements of molecular genetics are only the first steps
in to discovering such grammar.

Besides his first parity rule for double-stranded DNA, Chargaff also perceived that the
parity rule approximately holds in the sufficient long single-stranded DNA segment.
This last rule is known as Chargaff’s second parity rule (CSPR), and it has been
confirmed in several organisms (Mitchell & Bride, 2006). Originally, CSPR is meant to
be valid only to mononucleotide frequencies (that is quantities of monoplets) in the
single-stranded DNA. “But, it occurs that oligonucleotide frequencies follow a
generalized Chargaff’s second parity rule (GCSPR) where the frequency of an
oligonucleotide is approximately equal to its complement reverse oligonucleotide
frequency (Prahbu, 1993). This is known in the literature as the Symmetry Principle”
(Yamagishi, Herai, 2011, p. 2). The work of Prahbu (1993) shows the implementation
of the Symmetry Principle in long DNA-sequences for cases of complementary reverse
n-plets with n = 2, 3, 4, 5 at least. In scientific publications, long genetic sequences are
those sequences that contain no less that 50.000 nucleotides (see for example
(Yamagishi, Herai, 2011)). “As correctly pointed out by Forsdyke, higher order
equifrequency does imply lower order, and he therefore conjectured that the original
FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 305

CSPR was actually a particular case of a higher order parity rule” (Yamagishi, Herai,
2011). This Symmetry Principle was studied or described in many other publications
(Albrecht-Buehler, 2006; Chargaff, 1971, 1975; Dong, Cuticchia, 2001; Forsdyke,
2002; Forsdyke, Bell, 2004; Kong, et al. 2009; Mitchell, Bridge, 2006; Sueoka, 1999).
Due to this Symmetry Principle, the work (Yamagishi, Herai, 2011) has uncovered new
rules of long nucleotide sequences and emphasized a fractal-like property of such
sequences across a large set of genomes “since no matter of scale, the same pattern is
observed (self-similarity)”.

Our previous research in the field of “matrix genetics” (Petoukhov, 2008a-d, 2011,
2012a,b,c; Petoukhov, He, 2009; Petoukhov, Svirin, 2012) has led to the hypothesis
that structures of long nucleotide sequences of different organisms are connected with
so called “fractal genetic nets” (FGN). In this work we are proposing a novel approach
to discover new Symmetry Principles in such sequences. In general case, each variant
of FGN is constructed by means of the author’s “method of a positional convolution (or
positional splitting) of long genetic sequences” to get a cluster of long sequences, each
of which, respectively, shorter than the original sequence. In the particular case
considered in our article, the method lies in the positional convolution (or splitting) of
long sequences of triplets through the removal or retention of individual positions
(items) in each triplet.

1.1. Methodology

Let us explain a construction of FGN of various types on an example of FGN for


sequences of triplets (Figure 1). In each triplet, its three positions are numbered by 0, 1
and 2 correspondingly. At the first level of a convolution, an initial long sequence S0 of
triplets is transformed by means of a positional convolution into three new sequences of
nucleotides S1/0, S1/1, S1/2, each of which is 3 times shorter in comparison with the initial
sequence (numerator of the index in this notation of sequences shows the level of the
convolution, and the denominator - the position of the triplets, which is used for the
convolution): the sequence S1/0 includes one by one all the nucleotides that are in the
initial position "0" of triplets of the original sequence S0; the sequence S1/1 includes one
by one all the nucleotides that are in the middle position "1" of triplets of the original
sequence S0; the sequence S1/2 includes one by one all the nucleotides that are in the last
position "2" of triplets of the original sequence S0. At the final stage of the first level of
the positional convolution, each of the sequences of nucleotides S1/0, S1/1, S1/2 is
represented as a sequence of triplets, where three positions inside each of triplets are
306 S.V. PETOUKHOV AND V.I. SVIRIN

numbered again by 0, 1 and 2. To construct the second level of the convolution, each of
the sequences S1/0, S1/1, S1/2 is transformed by means of the same positional convolution
in three new sequences: S1/0 is convolved in S2/00, S2/01, S2/02; S1/1 – in S2/10,S2/11, S2/12;
S1/2 – in S2/20, S2/21, S2/22. Similarly, the third level and subsequent levels of the
convolution are constructed to form a multi-level net of sequences of nucleotides called
"the fractal genetic net for the triplet convolution" or briefly "FGN-3" (Figure 1).

Figure 1: The scheme of the fractal genetic net (FGN-3) for a sequence of triplets

This FGN possesses a fractal-like character if the enumeration of positions is only taken
into account: each of long sequences of this FGN can be taken as an initial sequence to
form a similar genetic net on its basis (Figure 1). In general case, the FGN can be built
not only for triplets, but also for other n-plets (n = 2, 4, 5, ...) or oligonucleotides by
means of a repeated positional convolution of each of sequences from the previous level
into "n" sequences of the next level of the convolution. This way one can built FGN-2,
FGN-4, FGN-5, etc. for n=2, 3, 4, 5,… correspondingly. (Each of these FGN-2, FGN-3,
FGN-4, FGN-5, etc. is a tree, but all of them form a net of separate trees; in a wide
sense, FGN is the complete set of such separate trees). This article, on the other hand,
concentrates only on the results related to the FGN-3.
FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 307

2. ON SYMMETRY PRINCIPLES IN LONG NUCLEOTIDE


SEQUENCES AND THE FGN-3

To test the author's hypothesis that structures of long nucleotide sequences of different
organisms are connected with fractal genetic nets (first of all with FGN-3), we analyze
an implementation of the known Symmetry Principle for long nucleotide sequences at
different levels of a positional convolution in the fractal genetic net for the triplet
convolution (FGN-3). In our article we use two different notions of complementary
oligonucleotides (or n-plets): 1) complementary oligonucleotides in a traditional sense
(for example ACGTG and TGCAC are the pair of complementary oligonucleotides in a
traditional sense); 2) complementary reverse oligonucleotides (Prahbu, 1993) briefly
called CR-oligonucleotides or reverse complements (for example ACGTG and CACGT
are the pair of CR-oligonucleotides). The mentioned Symmetry Principle has been
revealed for pairs of CR-oligonucleotides. Taking this into account we began testing the
author’s hypothesis by means of analyzing frequencies (or quantities) of all variants of
pairs of CR-oligonucleotides in long DNA-sequences of different organisms at different
levels of their FGN-3. We test frequencies of n-plets in the FGN-3 with n = 1, 2, 3, 4, 5
only because of our computer limitations, but we assumed that our described results for
FGN-3 hold true also for n > 5. Initial nucleotide sequences for testing are taken from
(NCBI, 2012a). To test the proposed hypothesis, we use special software written by
V.I.Svirin using programming language Python.

In our preliminary studies we have revealed the following: 1) the Symmetry Principle
for pairs of CR-oligonucleotides is realized in each of long nucleotide sequences at
different levels of the convolution in FGN-3 (the length of oligonucleotides or n-plets
under consideration is equal to n = 1, 2, 3, 4, 5 at least); 2) a series of new Symmetry
Principles exists in those initial long nucleotide sequences where the famous Symmetry
Principle for pairs of CR-oligonucleotides is performed; 3) each of these new Symmetry
Principles is performed for n-plets in each of long nucleotide sequences at different
levels of the convolution in FGN-3 (n = 1, 2, 3, 4, 5 at least).

Let us take, for example, the long nucleotide sequence of Mycoplasma crocodyli
MP145 chromosome, complete genome (NCBI Reference Sequence: NC_014014.1
(NCBI, 2012b). This sequence contains 934379 nucleotides. Figure 2 shows
realisations of the mentioned Symmetry Principle (we'll name it as the Symmetry
Principle №1) in the 13 sequences at the first three levels of convolution in the FGN-3
of this sequence. It displays the number of occurrences of 32 triplets (AAA, AAC,
AAG, AAT, ACA, ACC, ACG, ACT, AGA, AGC, AGG, ATA, ATC, ATG, CAA,
308 S.V. PETOUKHOV AND V.I. SVIRIN

CAC, CAG, CCA, CCC, CCG, CGA, CGC, CTA, CTC, GAA, GAC, GCA, GCC,
GGA, GTA, TAA, TCA) and their 32 CR-triplets (TTT, GTT, CTT, ATT, TGT, GGT,
CGT, AGT, TCT, GCT, CCT, TAT, GAT, CAT, TTG, GTG, CTG, TGG, GGG, CGG,
TCG, GCG, TAG, GAG, TTC, GTC, TGC, GGC, TCC, TAC, TTA, TGA) in the long
sequences S0, S1/0, S1/1, S1/2, S2/00, S2/01, S2/02, S2/10, S2/11, S2/12, S2/20, S2/21, S2/22 at the first
three levels of the FGN-3 (a limited volume of the article doesn’t allow demonstration
of other levels of this FGN).

The straight line in each frame is a slope 1 (it is a bisector of the coordinate angle).
Each dot in a frame represents one pair “triplet and CR-triplet”; its coordinate X shows
number of occurrences (or the frequency) of the triplet, and its coordinate Y shows
number the frequency of its CR-triplet on the same strand of the sequence. Each frame
contains all 32 pairs «triplet and its CR-triplet». The dots agglutinate at the line of slope
1, demonstrating that amounts of occurrences (or frequencies) of two members of each
of 32 pairs «triplet and its CR-triplet» are approximately equal in each of the sequences
at each of the levels of convolution in the FGN-3. It means that the Symmetry Principle
№1 is performed for each of these sequences.
FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 309

Figure 2: Realizations of the Symmetry Principle №1 in the long sequences S0, S1/0, S1/1, S1/2, S2/00, S2/01, S2/02,
S2/10, S2/11, S2/12, S2/20, S2/21, S2/22 at the first three levels of the FGN-3 for Mycoplasma crocodyli MP145
chromosome, complete genome (NCBI Reference Sequence: NC_014014.1 (NCBI, 2012b)). The initial
sequence S0 contains 934379 nucleotides.

These results show the effectiveness of the proposed fractal genetic nets as a research
tool and they also testify in favour of existence of the generalized
Symmetry Principle № 1: in long nucleotide sequences at different levels of
convolution in FGN-3, oligonucleotide frequencies follow a generalized Chargaff’s
second parity rule where the frequency of each oligonucleotide is approximately equal
to its complement reverse oligonucleotide frequency.

Now let us present our research results that testify in favor of the existence of new
Symmetry Principles of long nucleotide sequences. Below, we formulate these new
Symmetry Principles directly and then provide some data confirming their existence.

The Symmetry Principle № 2 (concerning FGN): the frequency of each


oligonucleotide is approximately the same in all the long nucleotide sequences of each
of levels of FGN-3.

Figure 3: Frequencies of the triplet ACG in 40 long nucleotide


sequences S0, S1/0, S1/1, S1/2, S2/00, S2/01, S2/02, S2/10, S2/11, S2/12,
S2/20, S2/21, S2/22, ….., S3/221, S3/222 at the first four levels of the
FGN-3 of Mycoplasma crocodyli MP145 chromosome,
complete genome (NCBI Reference Sequence: NC_014014.1 (
NCBI, 2012b)). Coordinate X shows the 40 sequences and
coordinate Y shows appropriate frequencies of the triplet ACG
in them.
310 S.V. PETOUKHOV AND V.I. SVIRIN

Figure 3 demonstrates an example of frequencies of the triplet ACG in 40 long


nucleotide sequences S0, S1/0, S1/1, S1/2, S2/00, S2/01, S2/02, S2/10, S2/11, S2/12, S2/20, S2/21,
S2/22, ….., S3/221, S3/222 at the first four levels of the FGN-3 of the same initial sequence
shown on Figure 2.

Figure 4 shows examples of frequencies of all 64 triplets in 12 long nucleotide


sequences at the first three levels of FGN-3 of the same initial sequence as on Figure 2.

S0 S1/0 S1/1 S1/2 S2/00 S2/01 S2/02 S2/10 S2/11 S2/12 S2/20 S2/21
AAA 19832 5786 5679 5768 1975 1944 1944 1986 1899 1952 1954 1935
AAC 6246 1709 1707 1643 550 560 567 587 543 557 504 531
AAG 7087 1859 1783 1940 607 619 651 607 630 611 615 679
AAT 15037 5320 5352 5428 1784 1685 1769 1770 1758 1743 1757 1775
ACA 5049 1527 1564 1635 542 513 546 492 566 521 537 517
ACC 2363 747 755 747 233 253 266 241 273 253 231 203
ACG 1029 660 663 684 214 205 197 175 210 208 228 203
ACT 4714 1713 1745 1702 526 548 553 536 568 544 552 537
AGA 5272 1784 1688 1737 657 635 640 631 600 598 691 644
AGC 2754 590 623 586 226 216 200 220 199 181 198 223
AGG 2150 973 880 912 293 294 322 292 259 287 314 288
AGT 4713 1700 1704 1820 530 595 543 513 492 523 562 549
ATA 11126 4952 5051 4886 1688 1629 1637 1671 1655 1659 1706 1635
ATC 5250 1619 1570 1568 507 537 511 518 527 583 531 522
ATG 5499 1450 1488 1505 494 547 582 510 529 538 551 514
ATT 15079 5397 5390 5419 1797 1802 1835 1816 1832 1799 1857 1745
CAA 7427 1620 1615 1661 526 538 568 565 502 508 540 561
CAC 1872 700 746 733 225 236 201 255 262 243 222 238
CAG 2105 553 653 605 212 206 211 205 203 221 197 203
CAT 5375 1473 1497 1388 547 546 519 548 497 539 543 507
CCA 2750 727 712 664 235 248 252 267 252 246 218 255
CCC 569 522 622 513 181 166 170 188 198 183 164 196
CCG 681 336 390 326 81 113 84 105 109 118 94 109
CCT 2181 887 945 885 292 277 293 323 302 307 281 292
CGA 1171 635 607 600 189 208 178 196 207 194 204 208
CGC 508 321 341 319 106 113 89 104 101 107 109 116
CGG 693 402 365 366 124 93 102 121 137 105 85 97
CGT 989 671 663 684 214 175 194 199 189 215 164 204
CTA 4326 1664 1659 1562 545 515 520 526 532 542 543 524
CTC 1786 832 893 841 310 306 295 289 306 316 283 312
CTG 2115 577 647 636 215 236 206 207 172 211 196 173
CTT 6917 1950 1913 1785 635 646 641 684 653 577 617 614
GAA 7190 1823 1801 1812 651 689 675 640 610 655 602 654
GAC 1404 585 598 611 208 204 178 195 223 208 189 201
GAG 1820 932 833 930 289 289 274 284 275 295 291 291
GAT 5225 1664 1563 1555 555 530 523 488 507 534 577 557
GCA 2974 572 580 631 228 195 197 241 210 196 196 198
GCC 710 276 353 288 97 91 110 102 92 118 106 99
GCG 497 377 321 361 113 97 121 109 108 109 103 104
GCT 2973 622 582 585 227 233 218 181 227 192 207 178
FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 311

GGA 2330 958 875 888 283 302 296 297 272 316 308 318
GGC 676 286 275 283 115 110 100 112 112 108 93 104
GGG 616 551 546 555 185 200 193 187 170 200 215 174
GGT 2446 733 755 728 246 240 281 232 249 246 229 243
GTA 3636 1680 1606 1709 516 537 532 518 544 501 513 546
GTC 1374 601 577 578 199 198 218 194 217 226 202 203
GTG 2083 711 720 777 265 257 238 219 226 237 244 239
GTT 6587 1694 1736 1770 575 541 553 560 559 530 551 577
TAA 13334 5401 5418 5430 1732 1708 1678 1693 1760 1717 1762 1723
TAC 3369 1625 1685 1624 520 528 509 571 571 535 492 554
TAG 4368 1685 1596 1641 498 538 556 475 505 561 587 544
TAT 11019 4895 4950 4973 1635 1730 1667 1648 1672 1649 1711 1695
TCA 6993 1542 1679 1586 549 514 546 542 559 575 573 542
TCC 2302 872 904 854 291 276 293 340 353 277 295 316
TCG 1240 681 662 669 183 201 204 193 204 206 220 212
TCT 5710 1794 1866 1798 639 645 600 646 634 658 647 669
TGA 6550 1642 1533 1602 577 570 560 530 557 525 568 556
TGC 2790 616 610 611 190 233 201 219 219 213 205 191
TGG 2658 720 689 796 222 285 245 246 241 240 239 293
TGT 5123 1727 1583 1619 585 568 547 565 592 546 531 536
TTA 13519 5470 5525 5585 1755 1753 1760 1765 1793 1802 1733 1777
TTC 7131 1839 1864 1809 644 631 659 669 660 672 632 680
TTG 7918 1700 1745 1689 576 579 583 563 562 559 541 553
TTT 20219 5886 5876 5921 1997 1929 2004 2034 1960 2010 1995 1969

Figure 4: the table of frequencies of 64 triplets in long nucleotide sequences S0, S1/0, S1/1, S1/2, S2/00, S2/01, S2/02,
S2/10, S2/11, S2/12, S2/20, S2/21 at the first three levels of FGN-3 of Mycoplasma crocodyli MP145 chromosome,
complete genome (NCBI Reference Sequence: NC_014014.1 (NCBI, 2012b)).

The Symmetry Principle № 3: for each of long nucleotide sequences at each level of
FGN-3 the following rules hold true: sum of the frequencies of all the oligonucleotides,
that begin with the letter A, approximately equal to the sum of the frequencies of all the
oligonucleotides that begin with the letter T; sum of the frequencies of all the
oligonucleotides, that begin with the letter C, approximately equal to the sum of the
frequencies of all the oligonucleotides that begin with the letter T.

In particularly, these rules hold not only for long sequences at lower levels of FGN-3
but also for an initial long sequence S0. Figure 5 illustrates the Symmetry Principle
№ 3 using examples of n-plets (n=2, 3, 4, 5) in sequences S0, S1/0,…, S2/22 at the first
levels in FGN-3 of the same sequence as on Figure 2.

The total frequencies of the sets of duplets in sequences of FGN-3:


S0 S1/0 S1/1 S1/2 S2/00 S2/01 S2/02 S2/10 S2/11 S2/12
F(A) 169757 56674 56197 56747 19065 18857 18836 18764 18756 18713
F(T) 171420 57022 57247 57179 18870 18937 19067 19102 19155 19040
F(C) 62531 20763 21486 20600 6929 6903 6882 7113 7148 7178
F(G) 63471 21265 20794 21198 7044 7211 7123 6929 6849 6977
312 S.V. PETOUKHOV AND V.I. SVIRIN

The total frequencies of the sets of triplets in sequences of FGN-3:


S0 S1/0 S1/1 S1/2 S2/00 S2/01 S2/02 S2/10 S2/11 S2/12
F(A) 113200 37786 37642 37980 12623 12582 12763 12565 12540 12557
F(T) 114243 38095 38185 38207 12593 12688 12612 12699 12842 12745
F(C) 41465 13870 14268 13568 4637 4622 4523 4782 4622 4632
F(G) 42541 14065 13721 14061 4752 4713 4707 4559 4601 4671

The total frequencies of the sets of 4-plets in sequences of FGN-3:


S0 S1/0 S1/1 S1/2 S2/00 S2/01 S2/02 S2/10 S2/11 S2/12
F(A) 84955 28475 27999 28493 9522 9391 9274 9342 9417 9434
F(T) 85573 28439 28601 28639 9531 9532 9624 9552 9570 9538
F(C) 31189 10286 10758 10270 3409 3438 3499 3573 3578 3594
F(G) 31867 10662 10504 10460 3492 3593 3557 3487 3389 3388

The total frequencies of the sets of 5-plets in sequences of FGN-3:


S0 S1/0 S1/1 S1/2 S2/00 S2/01 S2/02 S2/10 S2/11 S2/12
F(A) 67729 22626 22512 22788 7551 7506 7542 7491 7417 7431
F(T) 68688 22951 22918 22764 7620 7613 7684 7626 7668 7677
F(C) 25144 8242 8503 8217 2780 2762 2701 2851 2907 2849
F(G) 25304 8470 8356 8520 2812 2882 2836 2795 2771 2806

Figure 5: The illustration of the Symmetry Principle № 3 in the case of Mycoplasma crocodyli MP145
chromosome, complete genome (NCBI Reference Sequence: NC_014014.1 (NCBI, 2012b)). Here F(A), F(T),
F(C) and F(G) mean sum of the frequencies of oligonucleotides (or n-plets) that begin with the letters A, T, C
or G correspondingly. The tables show the F(A) ≈ F(T) and F(C) ≈ F(G) for sets of n-plets (n=2, 3, 4, 5) in
each of long nucleotide sequences S0, S1/0, S1/1, S1/2, S2/00, S2/01, S2/02, S2/10, S2/11, S2/12 at the first three levels of
FGN-3 of this sequence

This result was obtained in connection with studies related to genetic matrices [C T; A
G](n) (see (Petoukhov, 2012c) in this issue). Each of 4 quadrants of such genomatrices
contains all oligonucleotides that begin with one of 4 letters C, T, A or G. In these
genomatrices, each oligonucleotide and its complementary oligonucleotide are disposed
inverse-symmetrical relative to the center of the appropriate matrix. In accordance with
the Symmetry Principle № 3, the total frequencies of oligonucleotides in both quadrants
along the main diagonal of these genomatrices are approximately equal each other
(F(C) ≈ F(G)); the total frequencies of oligonucleotides in both quadrants along the
second diagonal of these genomatrices are also approximately equal each other (F(A) ≈
F(T)).

An additional illustration of the Symmetry Principle № 3 is obtained based on the initial


data about frequencies of separate triplets in the whole human genome from the work
(Perez, 2010). This genome contains 2843411612 triplets. Figure 6 shows the total
frequencies FC, FG, FA and FT of sets of triplets that begin with one of four letters C, G,
FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 313

A or T. The percentage difference between the total frequencies FC and FG is equal to


0.05% and between FA and FT is equal to 0.16%. But we don’t have data about the
quantities of separate triplets in convoluted sequences S1/0, S1/1,… at the lower levels of
FGN-3 for this genome because we have no information about the order of triplets in its
huge sequence S0.

The total frequencies of the sets The total frequencies of the sets
of triplets, which begin with C: of triplets, which begin with G:
FC = F(CCC+CCT+CCA+CCG+ FG = F(GGG+GGA+GGT+GGC+
CTC+CTT+CTA+CTG+ GAG+GAA+GAT+GAC+
CAC+CAT+CAA+CAG+ GTG+GTA+GTT+GTC+
CGC+CGT+CGA+CGG) = GCG+GCA+GCT+GCC)=
581026275 581343106
The total frequencies of the sets of triplets, The total frequencies of the sets of triplets, which
which begin with A: begin with T:
FA = F(ACC+ACT+ACA+ACG+ FT = F(TGG+TGA+TGT+TGC+
ATC+ATT+ATA+ATG+ TAG+TAA+TAT+TAC+
AAC+AAT+AAA+AAG+ TTG+TTA+TTT+TTC+
AGC+AGT+AGA+AGG)= TCG+TCA+TCT+TCC)=
839827642 841214589

Figure 6: the approximate equality of the total frequencies of sets of triplets that begin with letters C and G
(upper table) and with letters A and T (bottom table) in the case of the sequence S0 of the whole human
genome. Initial data about frequencies of separate triplets are taken from the work (Perez, 2010).

Now let us introduce the Symmetry Principle № 4, which deals with reading frame
shifts, deletion mutations, and also positional permutations in oligonucleotides.
Concerning those DNA sequences (including the mentioned sequence on Figures 2-5),
that have been tested till today in our laboratory, we have discovered the following
phenomenological facts (this study is continued now for a wide list of DNA-sequences
of different organisms and organelles):

- a transformation of long nucleotide sequences by means of a reading frame shift in


them preserves implementations of all described Symmetry Principles inside new long
nucleotide sequences (in our tests, a reading frame shift means that the reading of
sequence does not begin with its first position, but with one of subsequent positions;
the missing fragment of the sequence can be moved into the end of the sequence, and in
this case a reading frame shift leads to a simple change of order of all sequences at each
of lower levels of FGN);

- a transformation of long nucleotide sequences by means of a deletion mutation (when


their short parts are missing) preserves implementations of all described Symmetry
Principles in new long nucleotide sequences.
314 S.V. PETOUKHOV AND V.I. SVIRIN

One should separately consider the question about positional permutations in


oligonucleotides. The theory of noise-immunity coding pays a special attention to
permutations of elements of transmitted signals. It is obvious that for different n-plets
different quantities of variants of permutation of their positions exist:

for duplets two variants of positional permutations exist (1-2 and 2-1);

for triplets six variants of positional permutations exist (1-2-3, 2-3-1, 3-1-2, 3-2-1,
2-1-3, 1-3-2);

for 4-plets 24 variants of positional permutations exist (1-2-3-4, 2-3-4-1, …..);

for 5-plets 120 variants of positional permutations exist (1-2-3-4-5, 2-3-4-5-1, …..).

It is also evident that if a long nucleotide sequence is interpreted as a sequence of a


certain type of oligonucleotides (duplets, or triplets, or 4-plets, or 5-plets, …), and one
of possible positional permutations is done simultaneously inside all of its
oligonucleotides, then a quite new long nucleotide sequence appears (we named
simultaneous positional permutations inside all oligonucleotides of a certain type as
“collective positional permutations” inside these oligonucleotides). For example, if we
have initially a sequence of triplets CGA-TAA-AGC-GTC-TAG-CGC-ATC -…, then
after changing of the positional order from the initial order 1-2-3 to new order 2-3-1
inside each of triplets, we obtain the quite different sequence GAC-AAT-GCA-TCG-
AGT-GCC-TCA -… . But our studies of a wide set of long nucleotide sequences
(including the sequence on Figure 2-5) demonstrated that the FGN-3 for such new long
nucleotide sequence has obeyed the same Symmetry Principles №№ 1-3 described
above. These results attest to possible existence of the Symmetry Principle № 4, which
can be briefly formulated as:

The Symmetry Principle № 4:

- reading frame shifts and deletion mutations in long nucleotide sequences, and also
collective positional permutations inside their oligonucleotides don't essentially violate
implementations of all Symmetry Principles for long nucleotide sequences and their
fractal genetic net (FGN-3).

The final part of this article illustrates an additional application of the FGN-3 approach
to study hidden regularities of long nucleotide sequences from the point of view of the
FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 315

black-and-white mosaics of the genetic matrix [C T; A G](3) from the article


(Petoukhov, 2012c, Figure 20). Figure 7 shows this matrix, which reflects
phenomenological properties of the genetic coding system and which contains the
complete set of 64 triplets in a strong order. The mosaic of this matrix is identical to the
mosaic of one of Hadamard (8*8)-matrices that are widely used in noise-immunity
coding (for example, codes based on Hadamard matrices have been used on spacecraft
«Mariner» and «Voyadger», which allowed obtaining high-quality photos of Mars,
Jupiter, Saturn, Uranus and Neptune in spite of the distortion and weakening of the
incoming signals; Hadamard matrices are used to create quantum computers, which are
based on Hadamard gates, etc.). In addition, this Hadamard representation of the genetic
matrix [C T; A G](3) is the biquaternion by Hamilton with unit coordinates (see details
in the article (Petoukhov, 2012c)). A possible connection between the black-and-white
mosaic of this genetic matrix and hidden regularities of long nucleotide sequences has a
special interest. Below we present our initial results of studying this connection.

CCC CCT CTC CTT TCC TCT TTC TTT


CCA CCG CTA CTG TCA TCG TTA TTG
CAC CAT CGC CGT TAC TAT TGC TGT
CAA CAG CGA CGG TAA TAG TGA TGG
ACC ACT ATC ATT GCC GCT GTC GTT
ACA ACG ATA ATG GCA GCG GTA GTG
AAC AAT AGC AGT GAC GAT GGC GGT
AAA AAG AGA AGG GAA GAG GGA GGG

Figure 7: the genetic matrix [C T; A G](3) of 64 triplets with its black-and-white mosaic, which reflects
phenomenological properties of the genetic coding system (from the work (Petoukhov, 2012c)).

This matrix [C T; A G](3) in Figure 7 contains two subsets with 28 kinds of white
triplets and 36 kinds of black triplets. The authors calculate total quantities (frequencies
FWHITE and FBLACK) of members of these two subsets in long nucleotide sequences. For
example, we calculated the total frequencies for the whole human genome, which
contains the huge number 2843411612 (about three billion) of triplets. The initial data
about this genome are shown on Figure 8 from the article (Perez, 2010). Very different
frequencies of different triplets are represented in this genome. For example, the
frequency of the triplet CGA is equal to 6251611 and the frequency of the triplet TTT is
equal to 109951342; they differ in 18 times approximately. But our result of the
calculation shows that in this genome the percentage difference between FWHITE and
FBLACK is approximately equal to 0.1% because the total quantity FWHITE of white
triplets is equal to 1422456641 and the total quantity FBLACK of black triplets is equal to
1420954971.
316 S.V. PETOUKHOV AND V.I. SVIRIN

TRIPLET TRIPLET TRIPLET TRIPLET


TRIPLET FREQUENCY TRIPLET FREQUENCY TRIPLET FREQUENCY TRIPLET FREQUENCY
AAA 109143641 CAA 53776608 GAA 56018645 TAA 59167883
AAC 41380831 CAC 42634617 GAC 26820898 TAC 32272009
AAG 56701727 CAG 57544367 GAG 47821818 TAG 36718434
AAT 70880610 CAT 52236743 GAT 37990593 TAT 58718182
ACA 57234565 CCA 52352507 GCA 40907730 TCA 55697529
ACC 33024323 CCC 37290873 GCC 33788267 TCC 43850042
ACG 7117535 CCG 7815619 GCG 6744112 TCG 6265386
ACT 45731927 CCT 50494519 GCT 39746348 TCT 62964984
AGA 62837294 CGA 6251611 GGA 43853584 TGA 55709222
AGC 39724813 CGC 6737724 GGC 33774033 TGC 40949883
AGG 50430220 CGG 7815677 GGG 37333942 TGG 52453369
AGT 45794017 CGT 7137644 GGT 33071650 TGT 57468177
ATA 58649060 CTA 36671812 GTA 32292235 TTA 59263408
ATC 37952376 CTC 47838959 GTC 26866216 TTC 56120623
ATG 52222957 CTG 57598215 GTG 42755364 TTG 54004116
ATT 71001746 CTT 56828780 GTT 41557671 TTT 109591342

Figure 8: quantities of repetitions of each triplet in the whole human genome (from [Perez, 2010])

Similar results about approximate equality of FWHITE and FBLACK were obtained for all
811 long fragments of the human genome studied in his student’s thesis by one of the
authors – V.Svirin, who became the pioneer of this comparative analyses of FWHITE and
FBLACK in long nucleotide sequences from the point of view of the phenomenological
genomatrix shown in Figure 7.

What conclusion can be made about an application of the method of the FGN-3 to study
the total quantities FWHITE and FBLACK in long nucleotide sequences? Figure 9 shows
typical results of the comparison analysis of FWHITE and FBLACK in sequences on
different levels of the FGN-3 for the same initial sequence S0 of Mycoplasma crocodyli
MP145 chromosome.

S0 S1/0 S1/1 S1/2 S2/00 S2/01 S2/02 S2/10 S2/11 S2/12


FWHITE % 52 49 49 49 49 49 50 50 49 49
FBLACK % 48 51 51 51 51 51 50 50 51 51

S2/20 S2/21 S2/22 S3/000 S3/001 S3/002 S3/010 S3/011 S3/012 S3/020
FWHITE % 49 49 49 50 49 50 49 50 50 49
FBLACK % 51 51 51 50 51 50 51 50 50 51
FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 317

S3/021 S3/022 S3/100 S3/101 S3/102 S3/110 S3/111 S3/112 S3/120 S3/121
FWHITE % 49 50 50 50 50 49 50 49 50 50
FBLACK % 51 50 50 50 50 51 50 51 50 50

S3/122 S3/200 S3/201 S3/202 S3/210 S3/211 S3/212 S3/220 S3/221 S3/222
FWHITE% 49 50 49 50 50 49 49 49 50 49
FBLACK % 51 50 51 50 50 51 51 51 50 51

Figure 9: percentage of frequencies FWHITE and FBLACK of white and black triplets (from Figure 7) in long
sequences S0, S1/0, ..., S3/222 in the first four levels of the FGN-3 for the Mycoplasma crocodyli MP145
chromosome, complete genome (NCBI Reference Sequence: NC_014014.1 (NCBI, 2012b)). The sequence S0
contains contains 934379 nucleotides.

From Figure 9, one can observe the fact of approximate equality of total quantities of
white and black triplets in all these sequences S0, S1/0, ..., S3/222.

It appears that the described FGN-3 and fractal-like properties of long genetic
sequences that are related to the invariance of these Symmetry Principles, have a
biological value (a biological sense) associated with mutational changes of such
sequences and with evolutionary creation of new types of DNA-sequences. The authors
presume that mechanisms of biological evolution use these permutational and other
described properties of long nucleotide sequences in producing new biological
organisms and organelles. For instance, new DNA sequences can be constructed in the
course of biological evolution of organisms by means of combinatorics of nucleotide
sequences from different levels of FGN (including genetic crossing among long
nucleotide sequences from different levels of FGN by analogy with well-known
examples of genetic crossing). One should note here that the question about
permutation properties of DNA-sequences is very important because some biological
organisms differ each from other only by permutations in their DNA sequences (see for
example the book (Pevzner, 2000)). The proposed method of the FGN is the new
effective and useful approach in the field of bioinformatics, molecular genetics, and
evolutionary biology. It generates new data in the field of symmetrology (Darvas, 2007;
Cristea, 2005, etc.)

In addition, one can mention here about fractal images in genetic systems. A number of
publications are devoted to fractal features of genetic texts (Gusev et al, 2009; Jeffry,
1990; Pellionisz et al, 2012a; Petoukhov, 2008b; Petoukhov, He, 2009; Yam, 1995,
etc). Interesting data about fractal approaches in genetics, including materials about an
important connection of fractal defects with cancer, are presented at the website
318 S.V. PETOUKHOV AND V.I. SVIRIN

(Pellionisz, 2012b). Research in this direction continues all over the world. In this
article, the authors propose Fractal Genetics Nets (FGN) as a new tool to study fractal-
like properties of long DNA sequences that also describes new fractal-like properties of
such nucleotide sequences. We believe that these FGN and fractal-like properties of
long nucleotide sequences can lead to new principles and systems in the field of signal
processing, recognition of images and artificial intellect. The list of these scientific tools
includes also genetic algorithms developed intensively in scientific world during last
decades (for example see (Goldberg, Korb, Deb, 1989; Forrest, Mitchell, 1991)). Our
findings described here contribute to the evidences of the idea about algebraic essence
of the genetic coding system (Petoukhov, 2008a-d, 2011, 2012a,b,c; Petoukhov, He,
2009).

We plan to publish in the nearest future other results of our studies toward FGN and the
Symmetry Principles related to a wide list of long DNA sequences of different organells
and organisms from different taxonomical classes. These results would require a large
volume for their publication and, therefore, are not included in the limited volume of
this article.

3. DISCUSSION

The genetic coding system possesses impressive noise-immunity properties. Modern


technology of noise-immunity coding is based on matrix presentations of discrete
signals. This technology allows noise-immunity transferring, for example, photos of a
surface of Mars through millions kilometers of spaces with noises to provide a
receiving the high-quality photos on Earth. The authors are studying hidden regularities
of the genetic coding system by means of known matrix methods from this
communication technology. In the result, a special scientific direction called “matrix
genetics” is developing during last year (Petoukhov, 2008a-d, 2011, 2012a,b,c;
Petoukhov, He, 2009; Petoukhov, Svirin, 2012). The results described in our article are
closely connected with many other results of this effective direction of researches where
many connections have been revealed between the genetic coding system and
mathematics of discrete signals processing including noise-immunity coding. In
particular, the list of relevant mathematical formalisms includes Hadamard matrices,
orthogonal systems of Walsh functions and Rademacher functions, Kronecker families
of matrices, dyadic-shift matrices and dyadic-shift decompositions of matrices, hyper-
complex numbers (including Hamilton quaternions and bi-quaternions), new matrix
presentations of complex numbers and split-complex numbers, algebras of projective
FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 319

operators, etc. Without these algebraic results, we couldn’t offer fractal genetic nets as a
new tool for genetic analyses and we couldn’t receive the described phenomenological
data about the proposed symmetry principles in long nucleotide sequences. Our matrix
approach to the genetic system gives opportunity to receive data in favor of existence of
not only symmetry principles proposed above but also some other symmetry principles
that will be published in the nearest future.

Modern science knows that deep knowledge about phenomenological relations of


symmetry among separate parts of a complex natural system can tell many important
things about the evolution and mechanisms of these systems. It should be noted that
fantastic successes of molecular genetics were defined in particular by a disclosure of
phenomenological facts of symmetry in molecular constructions of genetic code and by
skilful using of these facts in theoretical modeling. A bright example is a disclosure of a
symmetrological fact, reflected in the first rule by E. Chargaff, of an equality of
quantities of nitrogenous bases in their appropriate pairs (adenine-thymine and
cytosine-guanine) in molecules of DNA in different organisms. This phenomenological
rule was used skilfully in a theoretic modeling of a double helix of DNA by F. Crick
and J. Watson with using of additional symmetrological principles.

Biological organisms belong to a category of very complex natural systems, which


correspond to a huge number of biological species with inherited properties. But
surprisingly, molecular genetics has discovered that all organisms are identical to each
other by their basic molecular-genetic structures. Due to this revolutionary discovery, a
great unification of all biological organisms has happened in the science. The
information-genetic line of investigations has become one of the most prospective lines
not only in biology, but also in science as a whole. The more science studies living
matter, the more facts of unification in other physiological systems (metabolic
biosystems, energy biosystems, etc.) are discovered. The searching of unification
principles in living matter is an important direction of developing modern science.
Materials of our article belong to this direction.

Modern science recognizes a key meaning of information principles for inherited self-
organization of living matter. Modern informatics is an independent branch of science,
which possesses its own language and mathematical formalisms and exists together
with physics, chemistry and other scientific branches. A problem of information
evolution of living matter has been investigated intensively in the last decades in
addition to studies of the classical problem of biochemical evolution. Not only physics
and chemistry deal with principles and methods of symmetry, informatics and digital
320 S.V. PETOUKHOV AND V.I. SVIRIN

signal processing also pay great attention to them. How is theory of signal processing
connected to geometry and geometrical symmetries? Signals are represented there in a
form of a sequence of the numeric values of their amplitude in reference points. The
theory of signal processing is based on the interpretation of discrete signals as a form of
vector in multi-dimensional spaces. In every tact of time, a signal value is interpreted as
the corresponding value of a coordinate in a multi-dimensional vector space of signals.
In this way, the theory of discrete signals turns out to be the science of geometries of
multi-dimensional spaces where different multidimensional numeric systems can be
useful. The number of dimensions of such a space is equal to the quantity of reference
points for the signal. Metric notions and all other necessary things are introduced in
these multi-dimensional vector spaces for those or other problems of maintenance of
reliability, speed and economy of the signal information. On this geometrical basis,
many methods and algorithms of recognition of signals and images, coding information,
detection and correction of information mistakes, artificial intellect and training of
robots are constructed. One can add here the importance of symmetries in permutations
of components for coding signals, in spectral analysis of signals, in orthogonal and
other transformations of signals, and so on. Investigation of symmetrical and structural
analogies between computer informatics and genetic informatics is also needed for the
creation of DNA-computers, DNA-robotics, for so called “genetic algorithms” that is
widely used in modern engineering, etc. The authors of the article hope that the
proposed symmetry principles described in the article will be useful not only for
fundamental knowledge but also for technologic applications.

Thoughts and dreams of Chargaff about a disclosure of a grammar of biology on the


basis of symmetrologic analysis of hidden regularities of DNA are still valid and they
determine the important area of researches that are additionally supported by this
article.

Acknowledgments: The described research was conducted in a framework of a long-


term cooperation between Russian and Hungarian Academies of Sciences. The authors
are grateful to G. Darvas, M. He, A. Pellionisz and I. Stepanyan for their support. Some
results of this paper have been possible due to the Russian State scientific contract P377
from July 30, 2009.
FRACTAL GENETIC NETS IN NUCLEOTIDE SEQUENCES 321

REFERENCES

Albrecht-Buehler, G. (2006) Asymptotically increasing compliance of genomes with Chargaff's second parity
rules through inversions and inverted transpositions. Proceedings of the National Academy of Sciences,
November 21, 103(47), 17828–17833.
Bell, S. J., Forsdyke, D. R. (1999) Deviations from Chargaff's Second Parity Rule Correlate with Direction of
Transcription, Journal of Theoretical Biology, 197, 63-76
Chargaff, E. (1950) Chemical specificity of nucleic acids and mechanism of their enzymatic degradation.
Experimentia, 6, 201
Chargaff, E. (1971) Preface to a Grammar of Biology: A hundred years of nucleic acid research, Science, 172,
637-642,
http://www.sciencemag.org/content/172/3984/637.full.pdf?ijkey=99298aa2ffc516d64de947c301cfa5f6
a56d3c08&keytype2=tf_ipsecsha
Chargaff, E. (1975) A fever of reason, Annual Review of Biochemistry, 44, 1-20
Cristea, P.D. (2005) Representation and analysis of DNA sequences. Genomic Signal Processing and
Statistics, Chapter 1, E. Daugherty et al. Eds., Hindawi Publishing Corp., pp. 15–65.
Darvas, G. (2007) Symmetry. Basel: Birkhauser, xi + 508 pp.
Dong, Q., Cuticchia, A.J. (2001) Compositional symmetries in complete genomes. Bioinformatics, 17, 557-
559.
Forrest, S., Mitchell, M. (1991) The performance of genetic algorithms on Walsh polynomials: Some
anomalous results and their explanation. – In R.K.Belew and L.B.Booker, editors, Proceedings of the
Fourth International Conference on Genetic Algorithms, pp.182-189. Morgan Kaufmann, San Mateo,
CA.
Forsdyke, D. R. (2002) Symmetry observations in long nucleotide sequences: a commentary on the Discovery
Note of Qi and Citicchia. Bioinformatics letter, v. 18, 1, 215-217.
Forsdyke, D. R., Bell, S. J. (2004) A discussion of the application of elementary principles to early chemical
observations. Applied Bioinformatics, 3, 3-8.
Goldberg, D.E., Korb B., Deb K. (1989) Messy genetic algorithms: Motivation, analysis, and first results.
Complex systems, 1989, 3(5), 493-530.
Gusev, V., Miroshnichenko, L., Chuzhanova, N. (2009). Detection of fractal structures in the DNA sequences.
- International Book Series “Information Science and Omputing”, Book 8, Classification, Forecasting,
Data Mining, p.117-124, in Russian (Supplement to International Journal "Information Technologies
and Knowledge", v. 3) http://www.foibg.com/ibs_isc/ibs-08/ibs-08-p17.pdf
Jeffrey, H.J. (1990) Chaos game representation of gene structure. Nucleic Acids Research, v.18, 8, 2163-2170
Kong, S-G, Fan W-L, Chen, H-D, Hsu, Z-T, Zhou, N, et al. (2009) Inverse Symmetry in Complete Genomes
and Whole-Genome Inverse Duplication, PLoS ONE 4(11): e7553. doi:10.1371/journal.pone.0007553
Mitchell, D., Bridge, R. (2006) A test of Chargaff's second rule. Biochemical and Biophysical Research
Communications, 340(1): 90-94, http://www.ncbi.nlm.nih.gov/pubmed/16364245 .
NCBI. (2012a). http://www.ncbi.nlm.nih.gov/.
NCBI. (2012b). http://www.ncbi.nlm.nih.gov/nuccore/294155300.
Pellionisz, A.J, Graham, R., Pellionisz, P.A., Perez, J.C. (2012a) Recursive Genome Function of the
Cerebellum: Geometric Unification of Neuroscience and Genomics. In: Springer Handbook "The
Cerebellum" pp. 1381-1423 M. Manto, D.L. Gruol, J.D. Schmahmann, N. Koibuchi, F. Rossi (eds.),
Handbook of the Cerebellum and Cerebellar Disorders, Submitted October 20, Accepted November 1,
2011.DOI 10.1007/978-94-007-1333-8_61, #Springer Science+Business Media Dordrecht 2012 (full
text in http://fr.scribd.com/doc/111439455/BOOK-Unification- of-Neuroscience-and-Genomics-
Pellionisz-Et-Al-in-Section-4-Springer-the-Cerebellum-Handbook-2012).
322 S.V. PETOUKHOV AND V.I. SVIRIN

Pellionisz, A.J. (2012b). http://www.junkdna.com/the_genome_is_fractal.html.


Perez, J.-C. (2010). Codon populations in single-stranded whole human genome DNA are fractal and fine-
tuned by the golden ratio 1.618. Interdisciplinary Sciences Computational Life Sciences, 2, 1–13.
http://www.ncbi.nlm.nih.gov/pubmed/20658335 , full text in:
(http://fr.scribd.com/doc/95641538/Codon-Populations-in-Single-stranded-Whole-Human-Genome-
DNA-Are-Fractal-and-Fine-tuned-by-the- Golden-Ratio-1-618 ).
Petoukhov, S.V. (2008a) The degeneracy of the genetic code and Hadamard matrices. arXiv:0802.3366 [q-
bio.QM].
Petoukhov, S.V. (2008b) Matrix genetics, algebras of the genetic code, noise immunity. Moscow: RCD,
316 p. (in Russian).
Petoukhov, S.V. (2008c) Matrix genetics, part 1: Permutations of positions in triplets and symmetries of
genetic matrices, http://arxiv.org/abs/0803.0888, 6th version, 1-34.
Petoukhov, S.V. (2008d) Matrix genetics, part 3: the evolution of the genetic code from the viewpoint of the
genetic octave Yin-Yang-algebra. arXiv:0805.4692[q-bio.QM].
Petoukhov, S.V. (2011) Hypercomplex numbers and the algebraic system of genetic alphabets. Elements of
algebraic biology. Hypercomplex numbers in geometry and physics, v. 8, 2(16), 118-139
(Gipercompleksnyie chisla v geometrii i fizike, in Russian)
Petoukhov, S.V. (2012a) The genetic code, 8-dimensional hypercomplex numbers and dyadic shifts. (7th
version from January, 30, 2012), http://arxiv.org/abs/1102.3596
Petoukhov, S.V. (2012b) On fractal structure of long nucleotide sequences. Joint scientific journal
(Ob’edinennyi nauchnyi journal), # 6-7, 50 (in Russian)
Petoukhov, S.V. (2012c) Symmetries of the genetic code, hypercomplex numbers and genetic matrices with
internal complementarities. Symmetry: Culture and Science, in this issue
Petoukhov, S.V., He, M. (2009) Symmetrical Analysis Techniques for Genetic Systems and Bioinformatics:
Advanced Patterns and Applications. Hershey, USA: IGI Global. 271 p.
Petoukhov, S.V., Svirin, V.I. (2012) Fractal genetic nets and the rules of long genetic sequences. Joint
scientific journal (Ob’edinennyi nauchnyi journal), # 8-9, 50-52 (in Russian)
Pevzner, P.A. (2000) Computational molecular biology. An algorithmic approach. – Cambridge,
Massachusetts: MIT Press.
Prabhu, V. V. (1993) Symmetry observation in long nucleotide sequences. Nucleic Acids Research, 21, 2797-
2800.
Sueoka, N. (1999) Two aspects of DNA base composition: G + C content and translation-coupled deviation
from intra-strand rule of A = T and G = C. – Journal of Molecular Evolution, 49, 49–62
Yam, Ph. (1995). Talking trash (Linguistic patterns show up in junk DNA). – Scientific America, 272(3), 12-
15.
Yamagishi, M.E.B., Herai, R.H. (2011) Chargaff’s “Grammar of Biology”: New Fractal-like Rules.
arXiv:1112.1528v1 from 07.12.2011
Watson, J. D., Crick, F. H. C. (1953) Molecular Structure of Nucleic Acids. Nature, 4356, 737.
Symmetry: Culture and Science
Vol. 23, Nos. 3-4, 323-342, 2012

A MARKOV INFORMATION SOURCE FOR THE


SYNTACTIC CHARACTERIZATION OF AMINO ACID
SUBSTITUTIONS IN PROTEIN EVOLUTION

Miguel A. Jiménez-Montaño

BioPhysicist, (b. México, D.F., MEXICO, 1941).


Address: Faculty of Physics and Artificial Intelligence, University of Veracruz, Sebastián Camacho # 5, Col.
Centro, C.P. 91000, Xalapa, Ver., México. E-mail: ajimenez@uv.mx.
Fields of interest: The structure of the genetic code, technological evolution and informational measures and
algorithmic complexity of sequences of symbols and nerve signals.
Awards: Research Award, 1989; Dean Award, 2004; both from Universidad Veracruzana. Fulbright Fellow,
1982. First Prize National Contest on Scientific Non-technical Essay, 1990.
Publications: Ebeling W., Jiménez-Montaño M. A. (1980)*. On Grammars, Complexity, and Information
Measures of Biological Macromolecules. Mathematical Biosciences Vol. 52:53-71. Jiménez-Montaño M. A.
(1984). On the Syntactic Structure of Protein Sequences, and Concept of Grammar Complexity. Bulletin of
Mathematical Biology, Vol.46:641-660. Jiménez-Montaño M.A., de la Mora-Basáñez R., Pöschel T. (1996)*.
The Hypercube Structure of the Genetic Code Explains Conservative and Non-Conservartive Aminoacid
Substitutions in Vivo and in Vitro. BioSystems Vol. 39: 117-125. Jiménez-Montaño M.A (1999)* Protein
Evolution Drives the Evolution of the Genetic Code and Vice Versa. BioSystems Vol. 54: 47-64. Weiss O.,
Jiménez-Montaño M.A, Herzel H. (2000) Information content of protein sequences. J. Theor. Biol., Vol 206:
379-386. . Jiménez-Montaño M. A. (2004). Applications of Hyper Genetic Code to Bioinformatics. Journal of
Biological Systems. Vol. 12: 5-20. .Jiménez-Montaño M. A. (2009)*. The fourfold way of the genetic code.
BioSystems, Vol. 98 (2), 105-114.

Abstract: We introduce a theoretical model, which consists of a Markov Information


Source that generates codon sequences, and from them amino acid sequences, that
maintain the same or very similar functions and structures, as a direct consequence of
the structure of the genetic code, and general physical chemical constraints. With the
help of the model, we propose a codon dendrogram to describe a hierarchy of codon
categorizations, which explain the pattern of frequent amino acid substitutions in short-
term evolution.

Keywords: Markov source, genetic code, codon, amino acid, protein evolution.
324 M. A. JIMÉNEZ-MONTAÑO

1. INTRODUCTION

Understanding protein evolution remains today a major challenge in molecular biology


as it was a decade ago (Dokholyan and Shakhnovich, 2001 and references therein),
despite the huge amount of data gathered from genes, protein sequences and structures
presently available. Our knowledge of the relation between the genotype (DNA coding
for a protein) and the phenotype (a protein’s structure and its pattern of specific traits
related to its biological function), which is central to the Theory of Evolution and all
biology, is still at a very primitive stage (Thorne and Goldman, 2001; Wagner, 2012).
While the mechanisms of mutations in DNA sequences that code for proteins are known
(Parkhomchuk et al., 2009 ; Skipper,et al., 2012) the contribution of the genetic code in
creating new information, against the part played by natural selection in its fixation in
the population, is not completely appreciated. According to Abel and Trevors, (2006),
“Genetic prescription of computation precedes and produces phenotypic realization.
And this prescription is “written in stone”. Only recently, it has been recognized in the
literature the full complexity of the genotype/phenotype map (Crutchfield and Schuster,
2003).

According to DePristo et al., (2005), “Taken as whole, recent findings from


biochemistry and evolutionary biology indicate that our understanding of protein
evolution is incomplete, if not fundamentally flawed”. They suggest joining the fields of
protein biophysics and molecular evolution by highlighting the shared questions. In the
same line of thought, Pàl et al., (2006) argue that an integrated view of this field should
embrace genomic, structural and population levels of description. In Fig.1 these
different levels are graphically displayed. However, the problem to achieve this aim is,
on the one hand, that these levels belong to fields of knowledge with radically different
conceptual frameworks; and, on the other, the degenerate relationship between physics
and biology. It is well known that many proteins with no apparent sequence similarity
display the same folds (Kleiger et al., 2000). Thus, the many-to-one map M (Ai / S),
depicted in Fig.1, gives the amino acid at position i of any of the sequences
corresponding to the given structure (fold) S. In the concise statement by Hietpas et al.,
(2011): “Biology is governed by physical interactions, but biological requirements can
have multiple physical solutions”.
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 325

Environment
S = 3‐d structure of the protein
Ai = aa at site i of protein 
sequence 
Natural
Selection
M (S’/Aj’) Ci = codon at site i of protein 
Neutral
Evolution
fi= frequency of species i
ψi=  phenotype of species i

M (f’/S’)

Virus Aj’
Protein 
Quasi‐Species Protein  DNA
sequence 
f1, f2…fn; structure sequence 
X ψ1, ψ2… ψn (form
space
space
f (Coding of 
M(Cj /Ai)
(Concentration Space) (Genotypes)
M(S/ψi) M(Ai/S) structure)
space)
Biophysics
Population genetics Cm
&
(PHENOTYPE)
Biochemistry M(Ai’/C’m) (Codon
(Thermodynamics) usage)

Genetic
Translation  Code Space
apparatus (Hypercube)
MUTATIONS

M(Cm’/Cm)

Molecular biology &
Bioinformatics
(GENOTYPE)

Figure 1: The conceptual scheme for protein evolution. The connection between the domain of population
genetics (where Darwinian selection operates and true evolution ocurrs) and the genome of an organism
(where mutations, an essential ‘raw material’ of evolution, occur), is mediated by the physics and chemistry of
proteins. The crucial point to appreciate the complexity implicit in this diagram comes from the degenerate
relationship between physics and biology: Given a population, e.g. of virus quasi-species with frequencies {fi,
i = 1, 2, …,n}and with corresponding phenotypes {ψi, i = 1, 2, …,n}, M (S/ ψi ) specifies the common
structure (S) of a protein family (the set π of orthologous proteins, associated phenotype ψi ). Then, M (Ai/S)
is the amino acid at position i of any of the sequences corresponding to the structure S. In the same way,
M(Ci/Ai) gives one of the codons codifying the amino acid, according to the genetic code. Let Cm be the
chosen codon, then M (C’m / Cm) represents the single-nucleotide mutation from codon Cm to codon C’m ,
obeying the structure of the genetic code. In turn, M (A’j / C’m) gives the mutated amino acid A’j , coded by
C’m, and M (S’/A’j) maps A’j to the corresponding new structure S’ . This modified protein structure is
mapped to a new phenotype ψ+ though M (ψ+ / S’), which through Darwinian selection starts the evolutionary
cycle again. Notice that M (S’/A’j) maps A’j to the corresponding new structure S’, and M (ψ+ / S’), maps the
new protein structure (fold) to the new phenotype ψ+. Certainly, there is no direct feedback from the old to the
new structure. Proteins are concrete molecules that do not evolve.

The main purpose of the present work is to provide a theoretical model, built upon an
empirical codon substitution matrix (Schneider et al., 2005), which explains the pattern
of amino acid substitutions in proteins that maintain the same or very similar functions
and structures, as a direct consequence of the structure of the genetic code (Jiménez-
Montaño, 1994), which controls the possible amino acid changes from single
326 M. A. JIMÉNEZ-MONTAÑO

nucleotide mutations, and general physical chemical constraints which are responsible
for the stability of the protein.

2. MODELS OF PROTEIN EVOLUTION

2.1 Amino acid models of protein evolution

Nonetheless the complexity of protein evolution, for a wide range of applications such
as database search, sequence alignment, protein family classification and phylogenetic
inference, among many others, the phenomenological approach to amino acid
substitutions in protein families, started with the empirical work of Margaret Dayhoff
and her colleagues (1978), is still widely used. Following Dayhoff’s footsteps, with the
help of large data bases available in subsequent years, various authors built several
amino acid substitution matrices based on observed mutation counts in protein
alignments (e.g. the updated Dayhoff matrices by Gonnet et al., 1992 or Jones et al.,
1992). This formalism operates in protein space (see below), thus completely ignores
the underlying mutational process that occurs at the DNA level. Dayhoff’s PAM
matrices describe the probabilities of amino acid substitutions, for a given period of
evolution. They are derived from a model in which amino acids mutate randomly and
independent of one another. Each substitution probability during some time interval
depends only on the identities of the initial and replacement residues. Mathematically
speaking, the dynamics of amino acid substitution resembles a time-homogenous first
order reversible Markov chain (Dayhoff et al., 1972, 1978; Gonnet et al., 1992; Jones et
al., 1992; Müller and Vingron, 2000).Of course, the above assumptions are not strictly
true, and various authors have pointed out that the dynamics of amino acid substitutions
is not Markovian, stationary, nor homogeneous (Crooks and Brenner, 2005).

Sequence space, the abstract space of all sequences drawn from an alphabet of k letters
and of length n, was first introduced in coding theory by Hamming (1950). It is a
metric space with respect to the Hamming distance, dH, (Hamming, 1950), which
represents the minimum number of changes that are required to convert one sequence
into another. Maynard Smith (1970) applied this concept to amino acid sequences
defining the concept of protein space (see also Kauffman, 1989 and references therein).
As recently described in a delightful paper by Frances Arnold (2011), in protein space
each sequence is surrounded by its one-mutant neighbors, that is, by all the proteins that
differ from it by a change in a single amino acid letter. As described in (Kauffman
1989), “The concept of protein space is a high-dimensional space in which each point
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 327

represents one protein, and is next to 19 N points representing all the 1-mutant
neighbors of that protein. The protein space therefore simultaneously represents the
entire ensemble of 20N proteins and keeps track of which proteins are 1-mutant
neighbors of each other”.

However, the autonomous description of protein evolution in protein (sequence) space


is misleading, because it violates the Central Dogma of molecular biology. In nature
amino acids do not interchange among themselves during evolution. The space of
possibilities is at the DNA level, where the meaningful unit is a base triplet or codon.
Therefore, the relevant space to describe codon substitutions is a genetic code space
(Fig. 1), where codon mutations occur (Swanson, 1984; Jiménez-Montaño et al., 1996;
Petoukhov, 1999; Stambuk, 2000; Jiménez-Montaño, 2004; Jiménez-Montaño and He,
2009). Thus, by single-nucleotide mutations many of the 19 amino acids are out of
reach from the original amino acid, and thus they have null probability of appearance.
This is the first place where the symmetry associated to the concept of random
mutations is broken. The single-nucleotide mutations among the four bases are indeed
random (although not necessarily equally probable), but the corresponding amino acid
substitution probabilities cannot be equal, due to the structure of the genetic code. The
dynamics of amino acid substitutions refers to an aggregate level (see below).

2.2 Codon models of protein evolution

Goldman and Yang (1994) and, independently, Muse and Gaut (1994) introduced the
first models of a Markovian dynamics at the DNA (codon) level. In these models all
substitution rates are derived from parameters. We will not discuss parametric codon
models here; instead, we are going to employ an adaptation for short-term evolution of
the empirical codon substitution model proposed by Gaston Gonnet and his group
(Schneider et al., 2005). In this case, all substitution rates were estimated from a large
data set of aligned vertebrate coding sequences and then fixed.

Assuming a Markovian dynamics at the DNA (codon) level, the dynamics of amino acid
substitutions is defined by an aggregation (grouping) of codon states. However,
Görnerup and Jacobi (2010) pointed out that in general the dynamics on the aggregated
level is not closed, since the partition of the original space introduces memory on the
aggregated level. Only in the special case when the aggregated dynamics indeed is
closed, the stochastic process over the partitions constitutes a Markov chain with the
same order as the original process. Employing the same empirical codon substitution
matrix (Schneider et al., 2005) as we do, they showed that the substitution process
328 M. A. JIMÉNEZ-MONTAÑO

hierarchically operates on multiple levels, from nucleotides to codons, to groups of


codons, associated with amino acids, and to amino acid groups which form “reduced
alphabets”. Since each level approximately has its own closed dynamics, the original
dynamics and the partition of the state space then define a new stochastic process on the
coarser level. These theoretical aspects of molecular evolution were corroborated by our
computer simulations.

Recently, Kosiol and Goldman (2011) proposed a closely related approach in terms of
aggregated Markov processes (AMPs), to model protein evolution as time-
homogeneous Markovian at the DNA (codon) level but observed (via the genetic code)
only at the amino acid level. They showed that this approach leads to time-dependent
and non-Markovian observations of amino acid sequence evolution. The main
difference between their work and the paper by Görnerup and Jacobi (2010) and our
model is that Kosiol and Goldman employed a parametric codon substitution matrix.
Nonetheless, our model is consistent with their assertion that the genetic code and
amino acids' physiochemical properties “influence the average substitution patterns
observed over collections of proteins at all evolutionary distances in the same way”.
That is, we assert that is not exact that the influence of the genetic dominates in the
short-term, and physiochemical properties in the long-term, as supposed by Benner et
al. (1994).

3. THE MARKOVIAN CODON-SUBSTITUTION MODEL

Markov processes/ chains/ models were first developed by Andrei A. Markov. Their
first use was for a linguistic purpose, modeling the letter sequences in works of Russian
literature (Markov, 1913). Later on, Markov models were developed as a general
statistical tool and applied to problems in the study of natural language processing
(Christopher and Schutze, 2003) and in computational biology (Nielsen, 2005; Ewens
et al., 2001; Yang, 2006), among many other applications.

3.1 The Markov Information Source

Probabilistic finite state automata, PFA, as hidden Markov models, HMM, are widely
used in computational linguistics, machine learning, time series analysis, computational
biology, and speech recognition among other fields of research. Their definition, given
in (Vidal et al., 2005), is equivalent to the definition of a stochastic regular grammar.
PFA are built to deal with the problem of probabilizing a structured space by adding
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 329

probabilities to structure. This is precisely what we want to do: To bring in codon


transition probabilities into the structure of the genetic code.

This is necessary because in our model, as in the parametric models by Goldman and
Yang (1994), and by Halpern and Bruno (1998), the state-space for the Markov process
corresponds to the standard genetic code (or its variants). In the model of Goldman and
Yang, “The states of the Markov process are the 61 sense codons. The three nonsense
(stop) codons are not considered in the model, as mutations to or from stop codons can
be assumed to affect drastically the structure and function of the protein and therefore
will rarely survive”. But, except for sharing the same abstract space, our approach is not
related with the mentioned models. Rather, its formulation and interpretation is closer to
that of informational and linguistic models. Therefore, we interpret the genetic code as a
Markov information source, exactly as this expression is understood in information
theory (Ash, 1965, p 172). That is, a finite Markov chain, together with a function f
whose domain is the set of states S and whose range is a finite set Γ called the alphabet
of the source. In our case, Γ = {A, G, S, T,…, Y, W} is the amino acid alphabet. The
PFA can be displayed graphically as a six-dimensional Boolean hypercube (Jiménez-
Montaño et al., 1996; Petoukhov, 1999; Stambuk, 2000; Jiménez-Montaño, 2004;
Sánchez et al., 2004; Karasev and Soronkin, 1997). In a forthcoming paper (Jiménez-
Montaño and Ramos-Fernández, 2013) we describe an implementation of the PFA with
the help of software tool GSEQUENCE that we developed specially to simulate the
generation of codon sequences, and from them amino acid sequences, in protein
evolution.

As Shannon (1948) did not mean that his statistical description of human language, with
a Markov information source, is the actual manner in which human discourse is
generated, it is clear that we are not suggesting that Nature really produces proteins with
the help of a Markov information source at the codon level. This is only a mathematical
device to describe the correlations among the amino acid substitutions along the
evolutionary process.

As mentioned above, for our model we have adapted for short term evolution (i.e., for
single-nucleotide changes) the 61 x 61 codon matrix introduced in (Schneider et al.,
2005), for which all substitution rates have been estimated from a set of 17,502
alignments of orthologous genetic sequences from five vertebrate genomes. The codon
transition probabilities are fixed, and correspond to protein divergence between 25 and
60 accepted point mutations per 100 amino acids (PAMs). Besides the influence of the
genetic code, this as any other empirical codon substitution matrix includes variable
330 M. A. JIMÉNEZ-MONTAÑO

factors such as codon usage, transition/transversion bias and selective pressures.


Inversions and duplications are not considered in this paper.

Out of the 190 possible interchanges among the 20 amino acids, we consider only 75
that can be obtained by single-base substitutions. Therefore, we employ a reduced
empirical matrix (REM), making zero all entries corresponding to more than one
nucleotide change in the original matrix and normalizing the resulting matrix; see
(Jiménez-Montaño and He, 2009) for more details. In this way, we take into account the
local structure of the genetic code around each codon. In one step, a codon can change
in nine different ways and generally can have from zero to three synonymous changes
and from six to nine non-synonymous changes, except in the cases of six-fold
degeneracy such as serine, leucine and arginine. We consider all possible one-step
changes for all 61 codons disregarding the three stop codons.

The important contribution of the genetic code to protein evolution has recently been
underlined by Hietpas et al., (2011), who found that the genetic code is highly
optimized (+2.4σ) to favor single-base substitutions between codons with WT-like
fitness compared with randomly generated codes. Thus, the genetic code generally
permits single-base substitution pathways between codons with WT-like fitness.

4. SELECTIVE CONSTRAINTS

The functions and structure of individual proteins impose different constraints on their
evolution. Irrespective of their dispensability, most proteins require a suitable three
dimensional structure to function. Therefore, any polypeptide having a well defined
globular structure must be the subject of a strong selection and its sequence is, from this
point of view, nearly optimal in terms of stability (Sánchez et al., 2006). Therefore, a
majority of positions in a protein globular domain are selected for stability. The
fundamental role of selection for thermodynamic stability in shaping molecular
evolution has been demonstrated by studies that simulated sequence evolution under
structural constraints (Parisi and Echave, 2001). The amino acid substitution
probabilities derived from the REM matrix are highly anti-correlated with the values
taken from the amino acid substitution matrix suggested long ago by Miyata et al.,
(1979) (Table 2). Therefore, the selective constraints are approximately captured by the
amino acid properties of hydrophobicity and volume.
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 331

More than thirty years ago, while analyzing the globin fold, Lesk and Chothia (1980)
reached already a similar conclusion. So long as some basic physical-chemical
constraints are satisfied, there is considerable latitude in primary structure (Axe, 2004).
About the same time, a related conclusion was reached by Sander and Schulz (1979) in
their study of the degeneracy of information contents of amino acid sequences from
overlaid genes. From the families of homologous protein know at that time, they
concluded that the information contained in a sequence is degenerate with respect to
function. They quantified this degeneracy on the basis of viral overlays and found that
five amino acid groups is the largest number of groups for which the assumption is
tenable that there exists one group sequence per protein function. Ever since, a great
number of reduced alphabets have been proposed, based on different amino acid
properties or observed substitutions (Miyata et al., 1979; Jiménez-Montaño, 1984;
Murphy et al., 2000; Solis and Rackovsky, 2000; Cannata et al., 2002; Li et al., 2003),
among many others. Here, following (Görnerup and Jacobi, 2010), we interpret reduced
amino acid alphabets simply as a result of the various codon sub dynamics, among
different groups of codons, which are neighbors according to the topology of the genetic
code space.

Recently, Chothia (Sasidharan and Chothia, 2007) returned to the problem from a
different perspective. For the divergence process in proteins that maintain the same or
very similar functions and structures, Sasidharan and Chothia reported very similar
overall patterns of divergence by counting observed amino acid substitutions in three
very different groups of orthologs. They interpret this result to mean that individual
responses of most proteins are variations on a common set of selective constraints
which govern the types of frequent mutations that are acceptable. In RESULTS we
show that the frequencies of amino acid pair substitutions deduced from our computer
simulations are in very good agreement with the mutation profile obtained in their
paper.

5. PROTEIN SYNTAX

Paraphrasing Prince and Smolensky (1997) in their attempt to relate the sciences of the
brain with the sciences of the mind, we can say in the present context that: “It is evident
that statistical thermodynamics and molecular biology are separated by many gulfs, not
the least of which lies between the formal methods appropriate for continuous
dynamical systems and those for discrete symbol structures”.
332 M. A. JIMÉNEZ-MONTAÑO

In order that an amino acid substitution is acceptable is necessary, first of all, that the
alteration it produces in the protein structure be as small as possible. Therefore, the
general Darwinian principle of “gradual change” is interpreted in the sense that the
destabilization of the structure should be as small as possible. Thus, thermodynamics
requires minimization of the Gibbs free energy change. However, this continuous
optimization is hampered by the discrete nature of the amino acid change. A rough
estimation of the effect produced by the substitution consists, for example, in
calculating the Miyata et al., distance (1979) between the original and the new amino
acid. However, this distance should not calculated between the original and any of the
other 19 amino acids; only between the original amino acid and the accessible amino
acids after a single-nucleotide mutation (that is, at most nine amino acids). In this way
the genetic code modulates acceptable mutations.

Following the parallelism between linguistics and a formal protein language (Jiménez-
Montaño, 1984), we recall that an important challenge of the first discipline is to
discover an architecture for grammars that both allows variation and limits its range to
what is actually possible in human language (Prince and Smolensky, 1997).
Furthermore, these authors remark that “… a central element in the architecture of
grammar is a formal means for managing the pervasive conflict between grammatical
constraints”. “The key observation is this: In a variety of clear cases where there is a
strength asymmetry between two conflicting constraints, no amount of success on the
weaker constraint can compensate for failure on the stronger one”. Finally, “….a
grammar consists entirely of constraints arranged in a strict domination hierarchy, in
which each constraint is strictly more important than-takes absolute priority over- all
constrains lower-ranked in the hierarchy. With this type of constraint interaction, it is
only the ranking of constraints in the hierarchy that matters for the determination of
optimally; no particular numerical strengths, for example, are necessary”.

Below we are going to show in which sense these concepts can be applied to the
characterization of amino acid substitutions. First, we need to say some words about
amino acid categorizations and the syntactic structure of proteins at the letter-unit level
which we discussed in detail in (Jiménez-Montaño, 1984).

We will call any classification of the 20 amino acid types in r groups (under different
criteria), an amino acid categorization. The set of symbols denoting the group names
will be called a reduced alphabet. When the reduced alphabet corresponds to a pattern
of substitutions according to an empirical matrix (Dayhoff et al., 1972, 1978; Gonnet et
al., 1992; Jones et al., 1992; etc.), the pattern is called a pattern of substitution classes.
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 333

If the categorization is based on physical-chemical properties (Grantham; 1964; Miyata


et al., 1979; etc.), the resulting patterns are called amino acid property sequences. In the
paper mentioned above, we approached the question of how we can select a fixed
number of categories which best mirror amino acid replacements. The mutational
categories should reflect the most frequent amino acid substitutions observed. In the
same way, if the selected physical chemical properties truly determine the protein´s
architecture, both patterns will be consistent. In other words, the reduced alphabets
obtained under different criteria will be almost equivalent. It is on this basis that is
reasonable to assume that the constraints responsible for a given fold are somehow
encoded in the pattern of substitution classes.

A hierarchy (inverted tree) of amino acid categorizations represents a syntactic structure


of proteins at the letter-unit level. We shall employ for our discussion the hierarchy
introduced in Fig. 1 of (Jiménez-Montaño, 1984). In the same paper we discussed two
more hierarchies, one from Sneath (1966) and the other from Lim (1974). Twelve more
hierarchies (also called dendrograms), associated with the same number of popular
amino acid substitution matrices are displayed in Fig. 4 of (Johnson and Overington,
1993). See also Fig. 1 in (Fan and Wang, 2003), and (Venkatarajan and Braun, 2001),
among many other proposals in the literature.

With this background, the interpretation of the above quotations from Optimality
Theory (Prince and Smolensky (1997) in the context of the present article is
straightforward. A hierarchy of amino acid categorizations encodes physical chemical
constraints arranged in a strict domination hierarchy. Thus, the dominant partition in the
dendrogram in Fig. 1 of (Jiménez Montaño, 1984) separates amino acids into non-
hydrophobic, represented by the group symbol a, and hydrophobic, represented by the
group symbol b. This constraint dominates over the lower constrains (for example size).
Therefore, we expect that an amino acid of a given class will be substituted with another
amino acid of the same class. In this case, we say that the substitution is syntactically
correct, and that the new sequence belongs to the language generated by the grammar.

Let us illustrate with an example how the grammar generates average amino acid
substitutions obeying general physical chemical constraints that preserve the stability of
the protein. If, in a given site of a protein sequence, we have aspartic acid (D) we expect
that it will be replaced by glutamic acid (E) because the node n in Fig. 1 of (Jiménez-
Montaño, 1984) is the smallest class that includes both amino acids. Next, we have node
e which includes two more amino acids, Q and N, that is, e = {D, E, N, Q}. Thus, the
category represented by the symbol n corresponds to the most conservative substitution,
334 M. A. JIMÉNEZ-MONTAÑO

then follows the wider category e, and so on up to the category represented by the
symbol a, which embraces the non-hydrophobic amino acids.

The amino acid dendrograms we are considering were derived from a number of amino
acid substitution matrices, by several authors that employed different clustering
procedures which have a significant influence in the result. Besides, this approach in
protein space disregards the fact that two amino acids in the same group may be
separated by two or three nucleotide substitutions, thus unlikely to substitute one
another. As pointed out long ago by Miyata et al., (1979): “Amino acids separated by
two or three codon position differences are unlikely to interchange even if they are
chemically similar”. Recently, we discussed this problem (Jiménez-Montaño and He,
2009). For example, the category i = {F, Y, W} in Fig. 1 of (Jiménez-Montaño, 1984) ,
which includes the three large hydrophobic amino acids should be refined into two
groups:{F, Y} and {W}. This is so because to go from the codon of W to any of the
codons of the other two amino acids we need two nucleotide changes; thus, W
constitutes a separate group by itself. This splitting of W was already proposed, for
example, in (Murphy et al., 2000) but for a different reason (which is a consequence of
the above reason): The small number of substitutions observed between W and the other
two amino acids, as reflected in the empirical BLOSUM 50 matrix. Therefore, it is clear
that to improve over previous approaches it is necessary to have a syntactic structure at
the codon level.

6. RESULTS

The first result of this paper is the proposal of the codon dendrogram shown in Fig. 2. It
was obtained by applying the clustering algorithm UPGMA (unweighted pair-group
method using arithmetic averages) to the full codon substitution matrix introduced in
(Schneider et al., 2005). This classification of codons, inferred from an empirical
matrix, induces a corresponding arrangement for amino acids. We observe that the
codons for D and E share the same group, therefore, we expect these two amino acids
exchange frequently both because they are very similar and because they are neighbors
in codon space. They are ranked one in our simulations (Table 1) and in Human-
Chicken orthologs, and ranked two in Escherichia coli and Salmonella orthologs; both
from the observed data (Table 6 in supp. material from Sasidharan and Chothia, 2007).
However, the group e = {D, E, N, Q} does not occur in Fig.2; N and S2 (S with AGY
codons) form one category, and Q and H another. Therefore, the letters in category e are
not completely equivalent from the point of view of their substitutability. From the
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 335

results in Table 1, the substitutions DN, and EQ are in ranks between 12 and
14, while NQ and EN, are ranked 44 and 45, respectively, in (Table 6 in supp.
Material from Sasidharan and Chothia, 2007). In the first case the amino acids have
neighboring codons; in the second case the corresponding codons differ by two bases.

Figure 2: Dendrogram (Codon rooted tree) obtained from the full empirical codon-substitution matrix
(Schneider et al., 2005), employing the UPGMA method.
336 M. A. JIMÉNEZ-MONTAÑO

E. COLI-S. ENTERICA MARKOV SOURCE RANDOM GENERATOR


RANK ORDER MUTATION RANK ORDER MUTATION RANK ORDER MUTATION
2 DE 1 DE/ED 10 DE/ED
1 IV 3 IV/VI 6 IV/VI
4 ST 4 ST/TS 3 ST/TS
5 AT 8 AT/TA 5 AT/TA
10 NS 5 SN/NS 6 SN/NS
3 AS 9 AS/SA 2 AS/SA
15 KR 2 RK/KR 6 RK/KR
18 GS 15 SG/GS 2 SG/GS
8 AV 13 AV/VA 5 AV/VA
6 IL 14 IL/LI 4 IL/LI
12 SP/PS 3 SP/PS
13 LV 6 VL/LV 3 VL/LV
7 LM 7 LM/ML 9 LM/ML
19 AP 20 AP/PA 5 AP/PA
16 AG 17 AG/GA 5 AG/GA
21 HQ 3 QH/HQ 10 QH/HQ
17 FY 10 FY/YF 10 FY/YF
23 NT 19 TN/NT 8 TN/NT
16 VM/MV 10 VM/MV
29 IM 10 IM/MI 11 IM/MI
23 PQ/QP 8 PQ/QP
25 FL 10 LF/FL 6 LF/FL
22 TP/PT 5 TP/PT
19 TI/IT 6 TI/IT
24 QR 18 RQ/QR 6 RQ/QR
14 DN 12 ND/DN 10 ND/DN
19 SC/CS 7 SC/CS
11 KQ 9 KQ/QK 10 KQ/QK
28 TM/MT 10 TM/MT
20 HN 17 NH/HN 10 NH/HN
9 AE 24 AE/EA 8 AE/EA
12 EQ 8 EQ/QE 10 EQ/QE
22 KN 11 KN/NK 10 KN/NK
26 AD 25 AD/DA 8 AD/DA
27 TV 5 TV/VT
28 HR 20 RH/HR 6 RH/HR
30 LQ 22 QL/LQ 6 QL/LQ
20 GE/EG 8 GE/EG
24 PL/LP 2 PL/LP

Table 1: Comparison of rank positions of the most frequent mutations types found in pairs of orthologs from
Escherichia coli and Salmonella enteric, for < 10 % divergence (Sasidharan and Chothia, 2007), with the ones
obtained from simulations generated with our Markov information source, implemented with the help of
software tool GSEQUENCE (Jiménez-Montaño and Ramos-Fernández, 2013). In the third column we display
results obtained with a random source

Despite the just explained discrepancies, there is a very good agreement between most
of the categories in Fig. 1 from (Jiménez-Montaño, 1984) and those in the codon
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 337

dendrogram (Fig. 2). For example, the important category h = {L, I, V, M} of aliphatic
amino acids in our hierarchy of amino acid categorizations, coincides with the category
d in Fig.2. The same is true for the small neutral amino acids in group c = {P, A, G, S,
T}, except for glycine (G), which in Fig. 2 belongs class b = {N, S2, G, D, E}. There are
other coincidences and differences that the reader can easily find. An important
difference worth mention is the following: while in the grouping based on amino acid
properties I and L are in the same category q, in the codon dendrogram I joins V to form
a group; even though there are proofreading process in protein synthesis to correct
translation errors, the substitutions VI are ranked three and the substitutions
LI are ranked fourteen in our simulations; and are ranked one and six, respectively,
in pairs of orthologs from Escherichia coli and Salmonella enterica (See our Table 1
and Table 6 in supp. Material from Sasidharan and Chothia, 2007).

The amino acid substitution pairs from our codon dendrogram, (D,E),(K,R),(I,V),
(Y,F),(M,L),(N, S2 ) (Q, H) and (A,S) are consistent to the ones reported in the
dendrogram displayed in Fig. 3 of (Görnerup and Jacobi, 2010), except for minor
differences. These are in the two last groups; in their paper (which employs a
completely different agglomeration procedure), Q and H form separate groups, and A
pairs with T instead of S. However, the higher categorizations are completely different
in both dendrograms. The separation of hydrophobic and hydrophilic amino acids in our
dendrogram (Fig. 2) is consistent with that in Fig. 1 of (Jiménez-Montaño, 1984). The
only difference comes from the small neutral amino acids, which are in the non-
hydrophobic group in our former publication, and are grouped with the hydrophobic
amino acids in the codon dendrogram.

The second but not less important result is that six of the ten more frequent amino acid
substitutions pairs, obtained from simulations generated with our Markov information
source, implemented with the help of software tool GSEQUENCE (Jiménez-Montaño
and Ramos-Fernández, 2013), agree with the pairs in the three sets of orthologs from E.
coli – S. enterica, Human-Mouse and Human-Chicken, respectively, from Table 6 in
supp. material from (Sasidharan and Chothia, 2007). These pairs are: DE, IV, ST, NS,
AT, AS. The pair KR agrees with two of the three sets of orthologs. Additionally, we
have in descending order the pairs VL, LM and FY which are ranked twelve, thirteen
and seventeen, respectively, in the same source. Seven of these amino acid substitution
pairs agree with the codon pairs in the Codon Dendrogram (Fig. 2), they are: DE, IV,
NS, AS, KR, LM, FY. These results are consistent with the most frequent amino acid
exchanges found in (Schmitt et al., 2007). In our simulations, the corresponding codons
338 M. A. JIMÉNEZ-MONTAÑO

outline approximately closed dynamics, as discussed in (Görnerup and Jacobi, 2010).


Therefore, these cycles of the codon dynamics produce amino acid substitutions which
are fixed in the population because of the similarity of the corresponding amino acids.
These outcomes take us to the third and last result of this paper.

AA/CODON CORR COEF AA/CODON CORR COEF AA/CODON CORR COEF


K AAA -0.8791 I ATA -0.5617 D GAC -0.7193
K AAG -0.9266 I ATC -0.5668 D GAT -0.7201
N AAC -0.6477 I ATT -0.5791 A GCA -0.3349
N AAT -0.7568 M ATG -0.6726 A GCC -0.3815
T ACA -0.6894 Q CAA -0.7166 A GCG 0.7656
T ACC -0.4522 Q CAG -0.5851 A GCT -0.4195
T ACG 0.4707 H CAC -0.6559 G GGA -0.4947
T ACT -0.4727 H CAT -0.6578 G GGC -0.8047
R AGA -0.8543 P CCA -0.8373 G GGG -0.2542
R AGG -0.8397 P CCC -0.8056 G GGT -0.7988
R CGA -0.8787 P CCG 0.3259 V GTA -0.4626
R CGC -0.9202 P CCT -0.7635 V GTC -0.6288
R CGG -0.9238 L CTA -0.6903 V GTG -0.8257
R CGT -0.9403 L CTC -0.964 V GTT -0.6141
S2 AGC -0.7744 L CTG -0.9467 Y TAC -0.9487
S2 AGT -0.6739 L CTT -0.9273 Y TAT -0.9621
S TCA -0.9765 L TTA -0.5547 C TGC -0.788
S TCC -0.8915 L TTG -0.7221 C TGT -0.7812
S TCG -0.7629 E GAA -0.6831 W TGG -0.7817
S TCT -0.9292 E GAG -0.8145 F TTC -0.4676
F TTT -0.4737

Table 2: Anti-correlation between the substitution probabilities from the reduced empirical matrix, REM
(Jiménez-Montaño and He, 2009), and the physical-chemical dissimilarity index (distance) from (Miyata et
al., 1979). For a given codon, e.g. AAA (K), I calculated the correlation between the list of values of
substitution probabilities with its neighbors (AGA (R), GAA (E), etc, and the list of values of the index of the
associated amino acids (in parenthesis)

In Table 2 we display the anti-correlation between the substitution probabilities from the
reduced empirical matrix, REM (Jiménez-Montaño and He, 2009), and the dissimilarity
physical-chemical index (distance) from (Miyata et al., 1979), which is based on
hydrophobicity and volume of amino acids. As expected, amino acid pairs which have
codons that substitute frequently have small values of the dissimilarity index and vice
versa. So, this well-known result from comparisons of amino acid substitution matrices,
is corroborated at the codon level.

7. CONCLUSIONS

After presenting a general conceptual framework for the analysis of protein evolution,
we introduced a theoretical model, which consists of a Markov Information Source that
generates codon sequences, and from them amino acid sequences, that maintain the
same or very similar functions and structures. This invariance is a consequence not only
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 339

of natural selection (that preserves the sequences that obey general physical chemical
constraints, which are responsible of the stability of the protein), but also of the
structure of the genetic code, which controls the possible amino acid changes, from
single nucleotide mutations. With the help of the model, we introduced a syntactic
formulation (codon dendrogram) to describe a hierarchy of codon categorizations which
explain the pattern of frequent amino acid substitutions in short-term evolution. From
our computer simulations (Jiménez-Montaño and Ramos-Fernández, 2013) we
interpreted the reduced amino acid alphabets simply as a result of the various codon
sub dynamics, among different clusters of codons, which are neighbors according to the
topology of the genetic code space.

Acknowledgements: I wrote this paper while commissioned at Dirección General de


Investigaciones de la Universidad Veracruzana. I want to thank director César I.
Beristain-Guevara for his support. I also thank Q.F.B. Antero Ramos-Fernández for his
help in doing some calculations and preparing the tables and figures. I thank David Abel
for suggestions to make clearer the manuscript and some references. I express thanks to
Sistema Nacional de Investigadores, México, for partial support. Finally, I thank my
wife, Ma. Eta. Castellanos G. for her patience and understanding.

REFERENCES

Abel, D.L. and Trevors, J.T. (2006) More than Metaphor: Genomes are Objective Sign Systems, Journal of
BioSemiotics, 1 253-267.
Arnold, F.H. (2011) The Library of Maynard-Smith: My Search for Meaning in the Protein Universe.
Microbe, ASM News 6(7) 316-318.
Ash, R. (1965) Information Theory, New York: Interscience Publishers, 339pp.
Axe D.D. (2004) Estimating the Prevalence of Protein Sequences Adopting Functional Enzyme Folds.
Journal of Molecular Biology, 341 1295-1315.
Benner S.A. Cohen M.A. Gonnet G.H. (1994). Amino acid substitution during functionally constrained
divergent evolution of protein sequences. Protein Engineering, 7 1323–1332.
Cannata,N., Toppo, S., Romualdi, C. and Valle, G. (2002) Simplifying amino acid alphabets by means of a
branch and bound algorithm and substitution matrices. Bioinformatics 18 1102-1108.
Crooks, G.E. and Brenner, S.E. (2005) An alternative model of amino acid replacement. Bioinformatics 21
975–980.
Crutchfield, J.P. and Schuster, P. (2003) Evolutionary Dynamics–Exploring the Interplay of Accident,
Selection, Neutrality, and Function, Oxford University Press, New York, 452pp.
Dayhoff, M.O., Eck, R.V. and Park, C.M. (1972) A model of evolutionary change in proteins. In: Dayhoff M,
ed. Atlas of Protein Sequence and Structure, National Biomedical Research Foundation, Washington,
D.C., 5 89–99pp.
Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978) A model of evolutionary change in proteins. In:
Dayhoff M, ed. Atlas of Protein Sequence and Structure, National Biomedical Research Foun- dation,
Washington, D.C. 5(3) 345–352pp.
340 M. A. JIMÉNEZ-MONTAÑO

DePristo, M.A., Weinreich, D.M. and Hartl, D.L. (2005) Missense meanderings in sequence space: a
biophysical view of protein evolution. Nature Reviews Genetics 6 678-687.
Dokholyan, N.V. and Shakhnovich, E.I. (2001) Understanding hierarchical protein evolution from first
principles. Journal of Molecular Biology 312 289–307.
Ewens, W.J. and Grant, G.R. (2001) Statistical Methods in Bioinformatics: An Introduction, Springer- Verlag,
New York, 476pp.
Fan, K. and Wang, W. (2003) What is the Minimum Number of Letters Required to Fold a Protein? Journal of
Molecular Biology 328 921–926.
Goldman, N. and Yang, Z. (1994) A Codon-based Model of Nucleotide Substitution for Protein-coding DNA
Sequences. Molecular Biology and Evolution 11 725-736.
Gonnet, G.H., Cohen, M.A. and Benner, S.A. (1992) Exhaustive matching of the entire protein sequence
database. Science 256 1443-1445.
Görnerup, O. and Jacobi, M.N. (2010) A model-independent approach to infer hierarchical codon substitution
dynamics. BMC Bioinformatics 11 201
Grantham, R. (1974) Amino acid difference formula to help explain protein evolution. Science 185 862-864.
Halpern, A.L. and Bruno, W.J. (1998) Evolutionary Distances for Protein-Coding Sequences: Modeling Site-
Specific Residue Frequencies. Molecular Biology and Evolution 15 910–917.
Hamming, R.W. (1950) Error detecting and error correcting codes. Bell System Technical Journal 29 147-160.
Hietpas, R.T., Jensen, J.D. and Bolon, D.N.A. (2011) Experimental illumination of a fitness landscape.
Proceedings of the National Academy of Sciences 108 7896–7901.
Jiménez-Montaño, M.A. (1984) On the syntactic structure of protein sequences and the concept of grammar
complexity. Bulletin of Mathematical Biology 46 641-659.
Jiménez Montaño M. A. (1994) On the Syntactic Structure and Redundancy Distribution of the Genetic Code.
BioSystems, 32 11-23.
Jiménez-Montaño, M.A. (2004) Applications of Hyper Genetic Code to Bioinformatics. Journal of Biological
Systems 12 5-20.
Jiménez-Montaño, M.A. and He, M. (2009) Irreplaceable Amino Acids and Reduced Alphabets in Short-term
and Directed Protein Evolution. In Bioinformatics Research and Applications. Mandoiu, Ion;
Narasimhan, Giri; Zhang, Yanquing (Eds.). Springer-Verlag Berlin Heidelberg, 297–309pp.
Jiménez-Montaño, M.A. and Ramos-Fernández, A. (2013) Simulation of protein evolution with a Markovian
empirical codon-substitution model. Manuscript in preparation.
Jiménez-Montaño, M.A., de la Mora-Basáñez, R. and Pöschel, T. (1996) The Hypercube Structure of the
Genetic Code Explains Conservative and Non-Conservartive Aminoacid Substitutions in Vivo and in
Vitro. BioSystems 39 117-125.
Johnson, M.S. and Overington, J.P. (1993) A structural basis for sequence comparisons—an evaluation of
scoring methodologies. Journal of Molecular Biology 233 716–738
Jones, D.T., Taylor, W.R. and Thornton, J.M. (1992) The rapid generation of mutation data matrices from
protein sequences. Computer Applications in the Biosciences 8 275–282.
Karasev, V.A. and Soronkin, S.G. (1997) Topological structure of the genetic code, Russian Journal of
Genetics 33 622–628.
Kauffman, S. (1989) Adaptation on Rugged Fitness Landscapes. In Lectures in the Sciences of Complexity.
Stein, D.L., Editor.Addison-Wesley Publishing Company, Redwood City, California, 527- 618pp.
Kleiger, G., Beamer, L.J., Grothe, R., Mallick, P., and Eisenberg, D. (2000) The 1.7 Å Crystal Structure of
BPI: A Study of How Two Dissimilar Amino Acid Sequences can Adopt the Same Fold, Journal of
Molecular Biology 299 1019-1034.
Kosiol, C. and Goldman, N. (2011) Markovian and Non-Markovian Protein Sequence Evolution: Aggregated
Markov Process Models, Journal of Molecular Biology 411 910–923.
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 341

Lesk, A.M. and Chothia, C. (1980) How different amino acid sequences determine similar protein structures:
The structure and evolutionary dynamics of the globins, Journal of Molecular Biology 136 225-270.
Li, T., Fan, K., Wang, J. and Wang, W. (2003) Reduction of protein sequence complexity by residue
grouping, Protein Engineering 16 323-330.
Lim, V.I. (1974) Algorithms for prediction of alpha-helical and beta-structural regions in globular proteins,
Journal of Molecular Biology 88 873-94.
Markov, A.A. (1913) Primer statisticheskogo issledovanija and tekstom `Evgenija Onegina' illjustrirujuschij
svjaz' ispytanij v tsep (An example of statistical study on the text of `Eugene Onegin' illustrating the
linking of events to a chain). Izvestija Imp, Akademii nauk, serija VI, 3 153-162.
Miyata,.T., Miyazawa, S. and Yasunaga,.T. (1979) Two types of amino acid substitutions in protein evolution,
Journal of Molecular Evolution 12 219-236.
Müller, T. and Vingron, M. (2000) Modeling amino acid replacement, Journal of Computational Biology 7
761–776.
Murphy,L.R., Wallqvist, A. and Levy, R.M. (2000) Simplified amino acid alphabets for protein fold
recognition and implications for folding, Protein Engineering 13 149-152.
Muse, S.V. and Gaut, B.S. (1994) A Likelihood Approach for Comparing Synonymous and Nonsynonymous
Nucleotide Substitution Rates, with Application to the Chloroplast Genome, Molecular Biology and
Evolution 11 715-724.
Manning, C.D. and Schutze, H. (1999) Foundations of Statistical Natural Language Processing, Cambridge,
Massachusetts, MIT Press, 680pp. Reprint: Cambridge, Massachusetts, MIT Press 2003.
Nielsen, R. (2005) Statistical Methods in Molecular Evolution, Springer Verlag, New York, 508pp.
Pál, C., Papp, B. and Lercher, M.J. (2006) An integrated view of protein evolution, Nature Reviews Genetics 7
337-348.
Parisi, G. and Echave, J. (2001) Structural constraints and emergence of sequence patterns in protein
evolution, Molecular Biology and Evolution 18 750–756.
Parkhomchuk D., Amstislavskiy,V. , Soldatov A. and Ogryzko V. (2009) Use of high throughput sequencing
to observe genome dynamics at a single cell level, Proceedings of the National Academy of Sciences
106 20830-20835.
Petoukhov, S.V. (1999) Genetic code and the ancient Chinese book of changes, Symmetry: Culture and
Science 10 211-226.
Prince, A. and Smolensky, P. (1997) Optimality: From Neural Networks to Universal Grammar, Science 275
1604-1610.
Sanchez, I.E., Tejero, J., Gomez-Moreno, C., Medina, M. and Serrano, L. (2006) Point Mutations in Protein
Globular Domains: Contributions from Function, Stability and Misfolding, Journal of Molecular
Biology 363 422–432.
Sánchez, R., Morgado, E. and Grau, R. (2004) The Genetic Code Boolean Lattice, Communications in
Mathematical and in Computer Chemistry 52 29-46.
Sander, C. and Schulz, G.E. (1979) Degeneracy of the information contained in amino acid sequences:
Evidence from overlaid genes, Journal of Molecular Evolution 13 245-252.
Sasidharan, R. and Chothia, C. (2007) The selection of acceptable protein mutations, Proceedings of the
National Academy of Sciences 104 10080–10085.
Schmitt A. O., Schuchhardt, J., Ludwig A., Brockmann G. A. (2007) Protein evolution within and between
species, Journal of Theoretical Biology 249 376–383.
Schneider, A., Cannarozzi, G.M. and Gonnet, G.H. (2005) Empirical codon substitution matrix, BMC
Bioinformatics 6 134.
Shannon C. (1948). A Mathematical Theory of Communication, The Bell System Technical Journal 27 379–
423, 623–656, July, October.
342 M. A. JIMÉNEZ-MONTAÑO

Skipper M., Dhand R., Campbell P. (2012) Nature/Encode. 2001 Will always be remembered as the year of
the human genome, Nature 489: 45. The ENCODE Project Consortium (2012). An integrated
encyclopedia of DNA elements in the human genome. Nature 489: 57–74.
http://www.nature.com/encode/
Smith, J.M. (1970) Natural Selection and the Concept of a Protein Space, Nature 225 563–564.
Sneath, P.H.A. (1966) Relations between chemical structure and biological activity in peptides, Journal of
Theoretical Biology 12 157-195.
Solis, A.D. and Rackovsky, S. (2000) Optimized representations and maximal information in proteins,
Proteins: Structure, Function, and Bioinformatics 38 149-164.
Stambuk, N. (2000) Universal metric properties of the genetic code, Croatica Chemica Acta 73 1123-1139.
Swanson, R. (1984) A unifying concept for the amino acid code, Bulletin of Mathematical Biology 46 187-
203.
Thorne, J.L. and Goldman, N. (2001) Probabilistic models for the study of protein evolution. Balding, D.J.,
Bishop, M., Cannings, C. (Eds.), Handbook of Statistical Genetics. John Wiley, Chichester, UK, 67-
82pp.
Venkatarajan, M.S. and Braun, W. (2001) New quantitative descriptors of amino acids based on
multidimensional scaling of a large number of physical–chemical properties, Journal of Molecular
Modeling 7 445–453.
Vidal, E., Thollard, F., de la Higuera, C., Casacuberta, F. and Carrasco, R.C. (2005) Probabilistic finite-state
machines-Part I, IEEE Trans, Pattern Analysis and Machine Intelligence 27 1013-1025.
Yang, Z. (2006) Computational molecular evolution, Oxford: Oxford University Press 374 pp.
Wagner A. (2012). The Role of Randomness in Darwinian Evolution, Philosophy of Science 79 95-119.
Symmetry: Culture and Science
Vol. 23, No. 3-4, 343-375, 2012

SYMMETRIES IN MOLECULAR-GENETIC SYSTEMS


AND MUSICAL HARMONY

G. Darvas*, A.A. Koblyakov**, S.V.Petoukhov***, I.V.Stepanian****

* Physicist, philosopher (b. Budapest, Hungary, 1948).


Address: Symmetrion, 29 Eötvös St. Budapest, H-1067 Hungary; darvasg@iif.hu.
Fields of interest: symmetry in arts and sciences, especially physics; interrelations of sciences and arts.
Publication: Symmetry, Basel: Birkhauser, (2008), xi+508 p.
** Composer, musicologist (b. Kuibyshev, Russia, 1951).
Address: dean of Composer Faculty, Moscow State Conservatory by P.I. Tchaikovsky, Bolshaya Nikitskaya
street 13/6, 
125009 Moscow Russian Federation. E-mail: akoblyakov@list.ru
Fields of interest: Music, interdisciplinary research, logic, mathematics, biology, physics
Awards: Laureate of the International Composers' Competition
Publications: 1) Synergetics and creativity // Synergetic paradigm. Moscow, 2000 (in Russian); 2) From
disjunction to conjunction (the contours of the general theory of creation) // The language of science - the
languages of art. Moscow, 2000 (in Russian); 3) Semantic aspects of self-similarity in music // Symmetry:
culture and science, v.6, number 2, 1995; 4) About one model defining art in its broadest sense // Sustainable
development, science and practice. M., 2003, № 2 (in Russian); 5) Discrete and continuous in the field of
music from the viewpoint of problem-sense approach // Proceedings of the International Conference
"Mathematics and Art", Moscow, 1997 (in Russian).
*** Biophysicist, bioinformatist (b. Moscow, Russia, 1946).
Address: Laboratory of Biomechanical Systems, Mechanical Engineering Research Institute of Russian
Academy of Sciences; Malyi Kharitonievskiy pereulok, 4, Moscow, 101990, Russia. E-mail:
spetoukhov@gmail.com.
Fields of interest: genetics, bioinformatics, biosymmetries, multidimensional numbers, musical harmony,
mathematical crystallography (also history of sciences, oriental medicine).
Awards: Gold medal of the Exhibition of Economic Achievements of the USSR, 1974; State Prize of the
USSR, 1986; Honorary diplomas of a few international conferences and organizations, 2005-2012.
Publications: 1) S.V. Petoukhov (1981) Biomechanics, Bionics and Symmetry. Moscow, Nauka, 239 pp. (in
Russian); 2) S.V. Petoukhov (1999) Biosolitons. Fundamentals of Soliton Biology. Moscow, GPKT, 288 pp.
(in Russian); 3) S.V. Petoukhov (2008) Matrix Genetics, Algebras of the Genetic Code, Noise-immunity.
Moscow, RCD, 316 pp. (in Russian); 4) S.V. Petoukhov, M. He (2010) Symmetrical Analysis Techniques for
Genetic Systems and Bioinformatics: Advanced Patterns and Applications, Hershey, USA: IGI Global, 271
pp.; 5) He M., Petoukhov S.V. (2011) Mathematics of Bioinformatics: Theory, Practice, and Applications.
USA: John Wiley & Sons, Inc., 295 pp.
344 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

**** Mathematician, biologist (b. Moscow, Russia, 1980)


Fields of interest: algebraic biology, quantum neural computing, DNA music therapy, bioinformatics and
bionics
Address: Laboratory of Biomechanical Systems, Mechanical Engineering Research Institute od Russian
Academy of Sciences; Malyi Kharitonievskiy pereulok, 4, Moscow, 101990, Russia. E-mail:
neurocomp.pro@gmail.com
Publications: 1) I.V. Stepanian (2011) Neural network algorithms for acoustic spirometry data
recognition (method of diagnosis of pulmonary occupational diseases). LAP LAMBERT Academic
Publishing GmbH & Co.: Saarbrücken, Germany, 200 pp. (ISBN: 978-3-8473-2767-7); 2) I. V. Stepanian, A.
L. Krugly (2011) An example of the stochastic dynamics of a causal set, in Foundations of Probability and
Physics – 6, Växjö-Kalmar, Sweden, 14-16 June 2011, AIP Conference Proceedings, V. 1424, edited by
Mauro D’Ariano, Shao-Ming Fei, Emmanuel Haven, Beatrix Hiesmayr, Gregg Jaeger, Andrei Khrennikov,
and Jan-Åke Larsson, (2012), pp. 206 -210 (arXiv: 1111.5474 [gr-qc]); 3) S.V.Petoukhov, V.I. Svirin, I.V.
Stepanian (2012) Matrix genetics, hypercomplex numbers and the rules of long genetic sequences.-
Proceedings of the VIII International conference «Finsler extensions of relativity theory», Moscow-Fryazino,
25 June-1 July, 2012, p. 71-72.

Abstract: The Moscow State Conservatory by P.I. Tchaikovsky has recently created a
special “Center for interdisciplinary researches of musical creativity”. One of the main
tasks of this center is to study genetic musical scales from different viewpoints including
new opportunities for composers and for musical therapy. This article is devoted to
scientific aspects of the genetic musical scales, which are based on symmetric features
of molecular ensembles of genetic systems. These musical scales were revealed in a
course of symmetrologic study of representations of molecular-genetic ensembles in a
united form of mathematical matrices (Kronecker families of genetic matrices). This
study has discovered a relation of genetic systems with the golden section and
Fibonacci numbers, which play role in a hierarchical system of these musical scales
and which are well known in biological phyllotaxis laws and in aesthetics of
proportions. Some historical and biological aspects of musical harmony are also
considered.

Keywords: symmetry, musical harmony, genetic code, golden section, Fibonacci


numbers.

1. ABOUT THE GENETIC CODING SYSTEM AND


GENETICALLY INHERITED PERCEPTION OF MUSIC

From ancient times, understanding the phenomenon of music and building musical
structures were associated with mathematics. The creator of the first computer
G.Leibniz wrote: “Music is a secret arithmetical exercise and the person who indulges
in it does not realize that he is manipulating numbers” and “music is the pleasure the
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 345

human mind experiences from counting without being aware that it is counting”
(http://thinkexist.com/quotes/g._wilhelm_leibniz/ ).

The range of human sound perception contains an infinite set of sound frequencies.
Pythagoras has discovered that certain mathematical rules, based on integers, allow
separating - from this infinite set of frequencies - a discrete set of frequencies, which
determine the harmonious sound set. In other words, certain combinations of sounds
from this set are perceived by living organisms as pleasant for hearing (consonances). In
addition, Pythagoras has linked the phenomenon of the harmonic sounds with the
parameters of a physical object: oscillation frequencies of stretched string, the length of
which is varied in accordance with appropriate numerical rules. But these discoveries by
Pythagoras say nothing about the fact that other discrete sets of sound frequencies may
exist, which will also form harmonious sets of sounds.

This article describes some results of researches of molecular ensembles of the genetic
coding system. The results reveal that sets of parameters of this molecular genetic
system are related with the well-known Pythagorean musical scale and also with a
hierarchy of special mathematical sets. This hierarchy can be interpreted and used as the
base of a new system of musical scales, because appropriate sets of sound frequencies
may possess harmonic properties for human hearing. According to our assumption, it
seems to be essential that these musical systems be connected with the molecular-
genetic system because the phenomenon of musical perception is inherited.

The scientific studies of physiological mechanisms of musical perception took place


long ago. One can find the review on this topic in the article (Weinberger, 2004).
Beginning with 4-months old infants turn to a source of pleasant sounds (consonances)
and turn aside a source of unpleasant sounds (dissonances). The human brain does not
possess a special center of music. The feeling of love to music seems to be dispersed in
the whole organism. The musical sound addresses to all in the person, or to person’s
archetypes. There are known data that the first shout of the baby, who has been born,
corresponds to sounds on frequency of the music note “la” (440 Hz) irrespective of its
timbre and of loudness, as a rule.
(http://www.rods.ru/Html/Russian/MoreResonance.html). This frequency is used
traditionally for tuning musical instruments by means of a tuning fork. This speaks
about certain biological unification of musical sounds. According to statistics, physical
reactions to music (in the form of skin reactions, tears, laugh, etc.) arise in 80 % of adult
people. Animals also are not indifferent to human music. All such data show that the
346 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

perception of music has biological essence and that the feeling of musical harmony is
based on inborn mechanisms. Therefore it is necessary to search for connections of the
genetic system with musical harmony. This article presents such a search.

It can be mentioned that thoughts about the key significance of musical harmony in the
organization of the world exist from ancient time. For example, Pythagoreans thought
about musical intervals in the planetary system and in all around. J. Kepler wrote the
famous book Harmonices Mundi, etc. Modern atomic physics found the harmonic ratios
in spectral series by T. Lyman in the atom of hydrogen, which has been named “music
of atomic spheres” by A. Einstein and A. Sommerfeld (Voloshinov, 2000). The
importance of Pythagorean ideas about a role of musical harmony was emphasized also
by the Nobel prize winner in physics R. Feynman (1963, v. 4, Chapter 50).

The living substance is compared with crystals frequently. For example, E. Schrödinger
(1955) named it “aperiodic crystal”. Whether annals of modern science contain any data
about a connection of musical harmony with crystals? Yes, such data exist (see, for
example, the book (Berger, 1997, p. 270-281).

In 1818, C.S. Weiss, who discovered crystallographic systems and who was one of
founders of crystallography, emphasized a musical analogy in crystallographic systems.
He investigated ratios among segments, which are formed by faces of crystals of the
cubic system. Weiss has shown that these ratios are identical absolutely to ratios
between musical tones.

In 1829, J. Grassman, who wrote a well-known book “Zur Physischen Kristallonomie


und Geometrishen Combinationslehre” and developed many mathematic methods in
crystallography, noted impressive musical analogies in the field of crystallography. The
statement is about many analogies described by him between ratios of musical tones and
segments, formed by faces of the same zone of crystals. According to his figurative
expression, “crystal polyhedron is a fallen asleep chord - a chord of the molecular
fluctuations made in time of its formation” (from (Berger, 1997, p. 270)).

At the end of 1890’s the outstanding crystallographer V. Goldschmidt returned to the


same ideas. The prominent Russian mineralogist and geochemist A.E. Fersman wrote
about his thematic publications: “These works represent the historical page in
crystallography, which has lead Goldschmidt to revealing by him laws of harmonic
ratios. Goldschmidt has extended these laws logically from the world of crystals into the
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 347

world of other correlations in the regions of paints, colors, sounds and even biological
correlations. It has become one of the most favourite themes of philosophical researches
by Goldschmidt” (from (Berger, 1997, p. 270)). This list of such historical examples can
be continued.

Taking into account, that Shrödinger named a living substance as aperiodic crystal and
that the classicists of crystallography emphasized a connection between crystal
structures and musical harmony, it seems natural to try to find traces of musical
harmony in living substance as well. This idea about a possible participation of musical
harmony in the organization of biological organisms is not new for modern biophysics.
For example, the famous Russian biophysicist, S. Shnoll (1989) wrote: “From possible
consequences of interaction of macromolecules of enzymes, which are carrying out
conformational (cyclic) fluctuations, we shall consider pulsations of pressure - sound
waves. The range of numbers of turns of the majority of enzymes corresponds to
acoustic sound frequencies. We shall consider … a fantastic picture of "musical
interactions" among biochemical systems, cells, bodies, and a possible physiological
role of these interactions. …… It leads to pleasant thoughts about nature of hearing,
about an origin of musical perception and about many other things, which already
belong to the area of biochemical aesthetics”. This term “biochemical aesthetics”,
proposed by Schnoll, reflects materials of our article.

Let us recall some fundamental notions of the theory of musical harmony. Each musical
note is characterized by its certain frequency of sounding. For musical melody, a ratio
between frequencies of neighboring notes is important, but not the absolute values of
frequencies of separate notes. For this reason the melody is easily distinguished
irrespective of what acoustic range of frequencies it is produced in, for example, by
child, woman or adult man with quite different voices. An aggregate of frequency
values between sounds in musical system is named a musical scale. The same note, for
example, the note “do” is distinguished by the person as the same if its frequency is
increased or reduced twice i.e., if it belongs to another octave. The interval of
frequencies from some note frequency f0 up to frequency 2*f0 is named an octave. Each
note “do” is considered usually as the beginning of the appropriate octave. For example,
the first octave reaches from frequency 260 Hz approximately (the note “do” of the first
octave) up to the double frequency 520 Hz (the note “do” of the second octave).

Small quantity of discrete frequencies of the octave diapason is traditionally used for
musical notes only. The notes, which correspond to these frequencies, form a certain
sequence in ascending order of frequencies. A musical scale represents a sequence of
348 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

numerical values (“interval values”) between frequencies of the adjacent notes (musical
tones).

For Europeans the idea of musical harmony of a universe is connected basically with the
name Pythagoras and his school. After ancient thinkers (first of all, ancient Chinese
thinkers) Pythagoreans considered that the world is arranged by principles of musical
harmony. The Pythagorean musical scale, which is based on the quint ratio 3:2, played
the main role in these views. One should note that this musical scale was known in
Ancient China long before Pythagoras, who has presumably got acquainted with it in his
life in Egypt and Babylon (the analysis of these questions is presented in detail in the
book (Needham, v.4, 1962)). In Ancient China this quint music scale had a cosmic
meaning connected with “The Book of Changes” (“I Ching”): numbers 2 and 3 were
named “numbers of Earth and Heaven” there. After Ancient China, Pythagoreans
considered numbers 2 and 3 as the female and male numbers, which can give birth to
new musical tones in their interconnection. According to some data, the quint system of
the musical scale is the most ancient among known systems in the history of musical
scales (http://www.arbuz.uz/t_octava.html).

Ancient Greeks attached an extraordinary significance to the search of the quint 3:2 in
natural systems because of their thoughts about musical harmony in the organization of
the world. For example, the great mathematician and mechanician Archimedes
considered the detection of the quint 3:2 between volumes and areas of a cylinder and a
sphere entered in it (Voloshinov, 2000) as the best result of his life. Just these
geometrical figures with the quint ratio were pictured on his gravestone according to
Archimedes testament. And due to these figures Cicero has found Archimedes’ grave
later, 200 years after his death. This article demonstrates, in particular, the connection of
the Kronecker family of the genomatrices of hydrogen bonds with the Pythagorean
musical scale based on the quint ratio 3:2.

2. NUMERIC GENOMATRICES OF HYDROGEN BONDS

One of the effective methods of cognition of a complex natural system, including the
genetic coding system, is the investigation of symmetries. Modern science knows that
deep knowledge about phenomenological relations of symmetry among separate parts of
a complex natural system can tell many important things about the evolution and
mechanisms of these systems. This article studies some symmetry properties of the
genetic coding system by means of matrix representation and analysis of molecular
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 349

ensembles of the genetic system. An initial choice of such a form of presentation of


molecular ensembles of the genetic code is explained by the following main reasons.

Information is usually stored in computers in the form of matrices.

 The genetic coding system provides noise-immunity properties; noise-


immunity codes are constructed on the basis of matrices.

 Genetic molecules obey principles of quantum mechanics, which utilizes


matrix operators. A connection between genetic matrices and these matrix
operators can be revealed. The significance of matrix approach is emphasized
by the fact that quantum mechanics has arisen in a form of matrix mechanics
by W. Heisenberg.

 Complex and hypercomplex numbers, which are utilized in physics and


mathematics, possess matrix forms of their presentation. The notion of number
is the main notion of mathematics and mathematical natural sciences. In view
of this, investigation of a possible connection of the genetic code to multi-
dimensional numbers in their matrix presentations can lead to very significant
results.

 Matrix analysis is one of the main investigation tools in mathematical natural


sciences. The study of possible analogies between matrices, which are specific
for the genetic code, and famous matrices from other branches of sciences can
be heuristic and useful.

 Matrices, which are a kind of union of many components in a single whole, are
subordinated to certain mathematical operations, which determine substantial
connections between collectives of many components. Such connections can
be essential for collectives of genetic elements of different levels as well.

In history of science, the first publication about matrix representation of molecular


ensembles of the genetic coding system was the work (Konopel’chenko, Rumer, 1975),
which studied symmetries in the genetic system. It represented the genetic alphabet A
(adenine), C (cytosine), G (guanine), T (thymine) and the set of 16 duplets in a form of
the two square matrices [C G; T A] and [C G; T A](2) respectively (here 2 in brackets
means the Kronecker power).
350 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

In our article we continue studying the molecular genetic system in its matrix forms of
representation, which has given some interesting results in the last 10 years (Petoukhov,
2005, 2008; Petoukhov, He, 2010). In these works we studied the Kronecker family of
genetic matrices [C T; A G](n) (here “n” in brackets means the Kronecker power), the
first representatives of which are shown in Fig. 1.
CC CT TC TT
C T (2)
CA CG TA TG
[C T; A G]= A G ; [C T; A G] = AC AT GC GT
AA AG GA GG

CCC CCT CTC CTT TCC TCT TTC TTT


CCA CCG CTA CTG TCA TCG TTA TTG
CAC CAT CGC CGT TAC TAT TGC TGT
CAA CAG CGA CGG TAA TAG TGA TGG
[C T; A G](3) = ACC ACT ATC ATT GCC GCT GTC GTT
ACA ACG ATA ATG GCA GCG GTA GTG
AAC AAT AGC AGT GAC GAT GGC GGT
AAA AAG AGA AGG GAA GAG GGA GGG

Figure 1: The first members of the Kronecker family of genetic symbolic matrices [C T; A G](n). Here A, C,
G and T are adenine, cytosine, guanine and thymine correspondingly.

Numeric genomatrices can be derived from the replacement of each symbol A, C, G, T


of the nitrogenous bases in the symbolic genomatrices [C T; A G](n) (Figure 1) by
quantitative parameters of these bases. For example, let us consider the genomatrices of
hydrogen bonds of these nitrogenous bases. The hydrogen bonds 2 and 3 of
complementary letters of the genetic alphabet are suspected for their important
information meaning by different authors for a long time. In addition, hydrogen plays
the main role in the composition of our Universe, where 93% hydrogen atoms exist
among all kinds of atoms and where “chemical influence of omnipresent hydrogen is
the defining factor” (Ponnamperuma, 1972). Thus the investigation of a possible
meaning of hydrogen bonds in genetic information deserves special interest.

The complementary letters C and G have 3 hydrogen bonds (C = G = 3) and the


complementary letters A and T have 2 hydrogen bonds (A = T= 2). Let us replace each
multiplet in the Kronecker family of the genomatrices [C T; A G](n) by the product of
these numbers of its hydrogen bonds. In this case, we get the Kronecker family of
numeric matrices [3 2; 2 3](n). For example, the triplet CAT will be replaced by
number 12 (=3*2*2) in the genomatrix [3 2; 2 3](3). Figure 2 demonstrates the three
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 351

initial genomatrices from this Kronecker family of genomatrices [3 2; 2 3](n)


constructed in this way. Numeric characteristics of each genomatrix [3 2; 2 3](n) are
connected with the quint ratio 3:2; for this reason we name such genomatrices as quint
genomatrices conditionally.

      9 6 6 4  27 18 18 12 18 12 12  8 
  3 2    6 9 4 6    18 27 12 18 12 18  8 12 
         Q =  2 3  ;     Q(2) =  6 4 9 6  ;     Q(3) =  18 12 27 18 12  8  1812 
4 6 6 9  12 18 18  27  8 12 1218 
      18 12 12   8 27 18 1812 
12 18  8  12 18 27 1218 
12   8 18 12 18 12 2718 
8  12 12 18 12 18 18 27 

Figure 2: The beginning of the family of the quint genomatrices [3 2; 2 3](n), which are based on the product
of numbers of hydrogen bonds (C=G=3, A=T=2)

3 THE NUMERIC GENOMATRICES AND THE GOLDEN SECTION

In biology, a genetic system provides the self-reproduction of biological organisms in


their generations. In mathematics, the “golden section” (or the “divine proportion”) and
its properties were a mathematical symbol of self-reproduction from the Renaissance,
and they were studied by Leonardo da Vinci, J. Kepler and many other prominent
thinkers (see details in (Darvas, 2007; Shubnikov, Koptsik, 2005) and in the website
“Museum of Harmony and Golden Section” by A. Stakhov, www.goldenmuseum.com).
Is there any connection between these two systems? Yes, and this article demonstrates
such unexpected connection. The golden section is the value φ = (1+50.5)/2 = 1.618…
(Sometimes the inverse of this value is called the golden section in literature). If the
simplest quint genomatrix [3 2; 2 3] is raised to the power 1/2 in the ordinary sense (that
is, if we take the square root), the result is the bi-symmetric matrix [φ φ-1; φ-1 φ] =
[3 2; 2 3]1/2, the matrix elements of which are equal to the golden section and to its
inverse value. And if any other quint genomatrix [3 2; 2 3](n) is raised to the power ½
in the ordinary sense, the result is the bi-symmetric matrix [φ φ-1; φ-1 φ](n) =
([3 2; 2 3](n))1/2, the matrix elements of which are equal to the golden section in various
integer powers with elements of symmetry among these powers (Figure 3).

Here one can remind what does it mean: square root of a nonsingular square matrix M?
It means such square matrix M1/2, the second power of which is equal to the initial
matrix M: (M1/2)2 = M. Many known kinds of software (for example, MathLab) allow
receiving square roots from nonsingular square matrices. Let us demonstrate here that
352 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

for example the golden genomatrix [φ φ-1; φ-1 φ] is the square root from the quint
genomatrix [3 2; 2 3]. Really, using ordinary rules of matrix multiplication, we get:
[φ, φ-1; φ-1, φ]*[φ, φ-1; φ-1, φ] = [φ*φ+φ-1*φ-1, φ*φ-1 + φ-1*φ; φ-1*φ+φ*φ-1, φ-1*φ-1
+φ*φ] = [ 3, 2; 2, 3]. Similar results can be checked for other corresponding pairs of
the genomatrices: ([φ φ-1; φ-1 φ](n))2 = [3 2; 2 3](n). Matrices with matrix elements, all of
which are equal to the golden section φ in different integer powers only, can be referred
to as “golden matrices”. Figuratively speaking, the quint genomatrices [3 2; 2 3](n) have
the secret substrate from the golden matrices [φ φ-1; φ-1 φ](n) (below we will explain a
deep geometrical relationship between the quint matrices and the golden matrices,
which represent square roots from them).
  φ    φ‐1  φ2     φ0      φ0         φ­2 
  [3 2; 2 3]1/2 =   φ‐1  φ  φ0     φ2      φ­2      φ0 
;        ([3 2; 2 3](2))1/2 =  φ0     φ­2        φ2      φ0 
φ­2    φ0       φ0      φ2 

  φ3 φ1 φ1 φ -1 φ1 φ-1 φ-1 φ-3


  φ1 φ3 φ-1 φ1 φ-1 φ1 φ-3 φ-1
   ([3 2; 2 3](3))1/2 =  φ1 φ-1 φ3 φ1 φ-1 φ-3 φ1 φ-1
φ-1 φ1 φ1 φ3 φ-3 φ-1 φ-1 φ1
φ1 φ-1 φ-1 φ -3 φ3 φ1 φ1 φ-1
φ-1 φ1 φ-3 φ-1 φ1 φ3 φ-1 φ1
φ-1 φ-3 φ1 φ-1 φ1 φ-1 φ3 φ1
φ-3 φ-1 φ-1 φ1 φ-1 φ1 φ1 φ3

Figure 3: The beginning of the Kronecker family of the golden matrices [φ φ-1; φ-1 φ](n) = ([3 2; 2 3](n))1/2,
where φ = (1+50.5)/2 = 1, 618… is the golden section

The mentioned matrix elements of the matrix [φ φ-1; φ-1 φ](n) = ([3 2; 2 3](n))1/2 can be
constructed from a combination of φ and φ-1 directly by the following algorithm. We
take a corresponding multiplet of the genomatrix [C T; A G](n) and change its letters C
and G to φ. Then we take letters A and T in this multiplet and change each of them to
φ-1. As a result, we obtain a chain with “n” links, where each link is φ or φ-1. The
product of all such links gives the value of corresponding matrix elements in the matrix
[φ φ-1; φ-1 φ](n). For example, in the case of the matrix [φ φ-1; φ-1 φ](n), let us calculate a
matrix element, which is disposed at the same place as the triplet CAT in the matrix
[C T; A G](3). According to the described algorithm, one should change the letter C to
φ and the letters A and T to φ-1. In the considered example, we obtain the following
product: (φ * φ-1 * φ-1) = φ-1. This is the desired value of the considered matrix element
for the matrix [φ φ-1; φ-1 φ](3) on Figure 3.
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 353

A ratio between adjacent numbers in numerical sequences inside each of such matrices
[φ φ-1; φ-1 φ](n) (for example, …φ3, φ1, φ-1, φ-3 …) is equal to φ2 (or φ-2) always. The
same ratio φ2 exists in regular 5-stars (Figure 4) as a ratio between sides of the adjacent
stars entered in each other. Below we will use the name “pentagram musical scales” for
new musical scales connected with the golden genomatrices. In view of this, let us
remind that the pentagram and its metaphysical associations were explored by the
Pythagoreans who considered it an emblem of perfection and health. Pythagoreans
swore by it and used the pentagram as a distinctive sign of belonging to their
community. But the pentagram has been known long before Pythagoras since ancient
times as a sign that protects from all evil, so in Ancient Babylon it depicted on the doors
of stores and warehouses to protect goods from damage and theft. It was also a sign of
power and was used on the royal seals. The first known images of pentagrams date back
to around 3500 BC, they were found at the territory of Ancient Mesopotamia. For early
Christians, the pentagram was a reminder about the five wounds of Christ, from the
crown of thorns on his forehead, and from the nails in the hands and feet.

Figure 4: Sizes of pentagrams, which are entered in each other, differ by scale factor φ2 (or φ-2)

Let us remind that the value φ2 (or φ-2) is also well known in another genetically
inherited phenomena of biological organisms, which are united under the term
“phyllotaxis laws” (authors don’t know how these two cases of realization of φ2 are
interconnected by biological mechanisms). Hundreds of books and articles around the
world are devoted to these genetically inherited laws, which are connected with
Fibonacci numbers and the golden section and which describe genetically inherited
configurations of a huge number of living bodies at different levels and branches of
biological evolution (see the review in the book (by Jean, 2010)). For example, leave
354 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

arrangements in the cases of a lime tree, an elm tree, a beech are characterized by the
ratio 2/1; in cases of an alder tree, a nut-tree, a vine, a sedge – by the ratio 3/1; in the
cases of a raspberry, a pear tree, a poplar, a barberry – by the ratio 8/3; in the cases of an
almond tree, a sea-buckthorn – by the ratio 13/5; cones of coniferous trees correspond to
ratios 21/8, 34/13, 55/21 in various cases. All of these integer numbers are Fibonacci
numbers, and the sequence of these ratios tends to the value φ2 = 2,618… . The ideal
angle in phyllotaxis laws, which is termed as “the Fibonacci angle”, is equal to φ-2
(Jean, 2010, section 2.2.1).

The golden section is presented in 5fold-symmetrical objects of biological bodies


(flowers, etc.), which are presented widely in the living nature but which are forbidden
in classical crystallography. It exists as well in many figures of modern generalized
crystallography: quasi-crystals by D. Shechtman, R. Penrose’s mosaics, dodecahedra of
ensembles of water molecules, icosahedral figures of viruses, biological phyllotaxis
laws, etc. (Darvas, 2007). The article (Carrasco et al, 2009) shows that about 1-nm-wide
ice chains that nucleate on metal surfaces Cu(110) are built from a face-sharing
arrangement of water pentagons. The pentagon structure is favored over others because
it maximizes the water–metal bonding while maintaining a strong hydrogen-bonding
network. It reveals an unanticipated structural adaptability of water–ice films.

In recent years, unexpected connections are discovered between the golden section and
micro-world of quantum mechanics, which includes genetic molecules. The article
(Coldea et al., 2010) describes that the chain of atoms in certain circumstances acts like
a nanoscale guitar string. The journal “Science Daily” gives a special title in its
information about this discovery: “Golden Ratio Discovered in Quantum World: Hidden
Symmetry Observed for the First Time in Solid State Matter”. The principal author of
this paper R.Coldea speaks: “Here the tension comes from the interaction between spins
causing them to magnetically resonate. For these interactions we found a series (scale)
of resonant notes: the first two notes show a perfect relationship with each other. Their
frequencies (pitch) are in the ratio of 1.618…, which is the golden ratio famous from art
and architecture. … It reflects a beautiful property of the quantum system - a hidden
symmetry. Actually quite a special one called E8 by mathematicians, and this is its first
observation in a material"
(http://www.sciencedaily.com/releases/2010/01/100107143909.htm ).

The new theme of the golden section in genetic matrices seems to be important because
many physiological systems and processes are connected with it. It is known that
proportions of a golden section characterize many physiological processes: cardio-
vascular processes, respiratory processes, electric activities of brain, locomotion
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 355

activity, etc. The golden section is described and is investigated for a long time in
phenomena of aesthetic perception as well. Taking into account these facts, the golden
section should be considered as the candidate for the role of one of base elements in an
inherited interlinking of the physiological subsystems, which provides unity of an
organism. The matrix relation between the golden section φ and significant parameters
of genetic codes testifies in a favor of a molecular-genetic clue providing such
physiological phenomena. One can hope that the algebra of bi-symmetric genetic
matrices, which are connected with the theme of the golden section, will be useful for
explanation and the numeric forecast of separate parameters in different physiological
sub-systems of biological organisms with their cooperative essence and golden section
phenomena.

One should emphasize the deep geometrical sense of the connection between the quint
genomatrices [3 2; 2 3](n) and golden genomatrices [φ φ-1; φ-1 φ](n). This connection
deals with the notion of “metric tensor”, which is the main notion of Riemannian
geometry (all other notions of Riemannian geometry - curvature tensor, geodesic lines,
etc. – can be deduced from this main notion) (Rashevsky, 1964;
http://en.wikipedia.org/wiki/Metric_tensor). The statement is that quint genomatrices [3
2; 2 3](n) are metric tensors, and golden genomatrices [φ φ-1; φ-1 φ](n) are matrices of
basic vectors of the frame of reference, on which this tensor is built. Let us explain it in
more details. By definition, a metric tensor in n-dimensional affine space, where the
operation of scalar product exists, is determined by means of a nonsingular matrix ||gij||
with the condition of symmetry gij = gji (Rashevsky, 1964, p. 157). Coordinates of the
metric tensor gij are equal to the scalar products of pairs of the basic vectors ei, ej of the
frame of reference, on which this tensor is built. The square root of the metric tensor
||gij|| gives a square matrix, columns of which are basic vectors ei of the frame of
reference. But the quint matrices [3 2; 2 3](n) satisfy the definition of metric tensors.
Above we took the square root from this quint metric tensor [3 2; 2 3](n), and as a
-1 -1 (n)
result we received golden genomatrices [φ φ ; φ φ] . It means that the metric tensors
[3 2; 2 3](n) are built on the corresponding bunches of the “golden” vectors (as their
basic vectors of the frames of reference), all components of which are equal to the
golden section φ in integer power. For example, the genomatrix [3 2; 2 3] can be
interpreted as a metric tensor, which is built on a special affine frame of reference. This
frame consists of two basic vectors: the golden vector e1 with coordinates (, -1) and
the golden vector е2 with coordinates (-1, ). These two golden vectors coincide with
the columns in the golden genomatrix [φ φ-1; φ-1 φ]. Scalar products of pairs of these
vectors are equal to the components of the quint genomatrix [3 2; 2 3]: <e1 , e1> = * +
-1*-1 = 3; <e1 , e2> = *-1 + -1* = 2; <e2 , e1> = -1* + *-1 = 2, <e2 , e2> =
-1*-1 + * = 3.
356 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

One additional remark is the following. To interpret correctly the matrix [3 2; 2 3] as a


metric tensor of a 2-dimensional plane, one should show a group of transformations, in
relation to which this matrix plays a role of a tensor. In the considered case, for
example, group of rotations can play can be taken for a tensor because their
transformations conserve values of scalar products of the frame golden vectors (though
the coordinates of these vectors are changed under such transformations). A similar
situation holds true for other corresponding pairs of the quint genomatrices [3 2; 2 3](n)
and golden genomatrices [φ φ-1; φ-1 φ](n). In result, one can say that the considered
Kronecker families of the quint genomatrices [3 2; 2 3](n) and golden genomatrices
[φ φ-1; φ-1 φ](n) are closely connected from a geometrical point of view or, in other
words, they form a geometric organic whole. It should be added that the Riemannian
geometry is very essential to study genetically inherited curved surfaces and lines of
biological bodies: these curvilinear configuration endowed internal metric that is
described by means of the Riemannian geometry (in view of this, some mathematical
models of biological morphogenesis can be developed on the base of this geometry and
its metric tensors).

The molecular system of the genetic alphabet is constructed by nature in such manner
that not only numeric parameters of hydrogen bonds lead to the quint and golden
genomatrices but some other significant parameters of genetic molecules lead also to
quint and golden matrices by analogy. For example, the quantities of atoms in molecular
rings of pyrimidines and purines are such parameters: the ring of purine contains 6
atoms and the ring of pyrimidine contains 9 atoms (Figure 7). From the viewpoint of
this kind of parameters, C = T = 6, A = G = 9. The ratio 9:6 is equal to the quint 3:2.
Thus the symbolic matrices [A C; T G](n), [G C; T A](n), [A T; C G](n), [G T; A C](n)
become the threefold quint matrices in the Kronecker power “n” in the case of
replacement of their symbolic elements by these numbers 9 and 6. The square root of
such numeric matrices is connected with the golden matrices obviously. A biological
organism is the master on the use of a set of parallel information channels. It is enough
to remind about many sensory channels by means of which we obtain sensory
information simultaneously: visual, acoustical, tactile, etc. It is probable, that many
kinds of genetic matrices are used by organisms in parallel information channels as
well.
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 357

4 THE GENOMATRICES, MUSICAL HARMONY AND PYTHAGOREAN


MUSICAL SCALE

The theme of harmony of living nature is discussed frequently by many authors. The
word “harmony” has arisen in Ancient Greece in relation to the Pythagorean musical
scale. In the antique theory of music the word "harmony" has found the modern value -
the consent of discordant. Seven musical notes carry names familiar to all: do (C),
re (D), mi (E), fa (F), sol (G), la (A), si (B). These seven notes are interrelated among
themselves by their frequencies not in an accidental manner, but they form the regular
uniform ensemble. Really, it is well known that the seven notes of the Pythagorean
musical scale from appropriate octaves form the regular sequence of the geometric
progression on the base of the quint ratio 3:2 between frequencies of the adjacent
members of this sequence (Figures 5). The quint 3:2, which is the ratio between
frequencies of the third and the second harmonics of an oscillated string, plays the role
of the factor of this geometrical progression. The frequency 293 Hz of the note re (D1)
of the first octave stays in the middle of this frequency series. The ratios of the fre-
quencies of all notes to this frequency of the note re (D1) form the symmetrical series by
signs and sizes of their powers of the quint: from the power "-3" up to the power "+3".

fa (F) do (C) sol (G) re (D1) la (A1) mi (E2) si (B2)


87 130 196 293 440 660 990
(3/2)-3 (3/2)-2 (3/2)-1 (3/2)0 (3/2)1 (3/2)2 (3/2)3

Figure 5 The quint (or the perfect fifth) sequence of the 7 notes of the Pythagorean musical scale. The upper
row shows the notes. The second row shows their frequencies. The third row shows the ratios between the
frequencies of these notes to the frequency 293 Hz of the note re (D1). The designation of notes is given on
Helmholtz system. Values of frequencies are approximated to integers.

The Kronecker family of the genomatrices [3 2; 2 3](n) is connected with the


Pythagorean musical scale. Let us consider it more attentively. Each genomatrix of the
family [3 2; 2 3](n) demonstrates the quint (or the perfect fifth) principle of its structure
because they have the quint ratio 3:2 at different levels: between numerical sums in top
and bottom quadrants, sub-quadrants, sub-sub-quadrants, etc. including quint ratios
between neighbor numbers in them. For example, [3 2; 2 3](3) contains 4 numbers – 27,
18, 12, 8 - with the quint ratio between them: 27/18=18/12=12/8=3/2.

Each quint genomatrix [3 2; 2 3](n) contains (n+1) kinds of numbers from a geometrical
progression, factor of which is equal to the quint 3/2:
358 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

[3 2; 2 3](1)  3, 2
[3 2; 2 3](2)  9, 6, 4
(3)
[3 2; 2 3]  27, 18, 12, 8
………………………………………….
[3 2; 2 3](6)  729, 486, 324, 216, 144, 96, 64
……………………………………………………..

Let us write out these kinds of numbers in columns for each genomatrix [3 2; 2 3](n) to
arrive at the “genetic” triangle, which is shown on the left part of the expression (1):

3 9 27 81 243 …. 1 3 9 27
2 6 18 54 162 …. 2
4 12 36 108 …. 4
8 24 72 …. 8 (1)
16 48 ….
32 ….

On the right side in the expression (1) the historically famous numeric triangle by Plato
is demonstrated. This triangle was utilized by Ancient Greeks to create the Pythagorean
musical scale on the basis of its main proportions. One can see the analogy between the
“genetic” triangle and the Plato’s triangle.

Moreover, as Jay Kappraff (USA) has informed one of the authors of this article in his
private letter, this genetic triangle, which was obtained from the matrices of the genetic
code, was known many centuries ago: it is identical to the famous triangle, which was
published 2000 years ago by Nichomachus of Gerasa in his famous book “Introduction
into arithmetic”. Nichomachus belonged to the Pythagorean society, and this triangle
was famous for centuries as the basis of the Pythagorean theory of musical harmony and
aesthetics. In accordance with this triangle, the Parthenon (Kappraff, 2006) and other
great architectural objects were created because architecture was interpreted as the non-
movement music, and the music was interpreted as the dynamic architecture.
Nichomachus of Gerasa was one of the great persons in the theory of musical harmony
and aesthetics. The Cambridge library has the ancient picture, where Nichomachus is
shown together with other great persons in this field: Pythagoras, Plato and Boeticus
(http://www.jcsparks.com/painted/boethius.html ). One can find more details about the
triangle by Nichomachus of Gerasa in the publications (Kappraff, 2000, 2002). This
unexpected connection of times makes additionally probable the adequacy of the
presented way of the matrix research of genetic systems and the assumed connection of
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 359

genetic systems with the Pythagorean musical scale, reflected unconsciously in


Nichomachus’ triangle.

As we mentioned above, a set of certain kinds of numbers in each genomatrix


[3 2; 2 3](n) reproduces fragments of the geometrical progressions with the quint factor.
Thus sequences of such kinds of numbers can be compared to quint sequences of
musical notes from Figure 5. If one confronts the least number from a quint genomatrix
with the frequency 87 Hz of the musical note “fa” (F), which possesses the least
frequency on Figure 5, then all sequences of such kinds of numbers automatically
corresponds to the series of the frequencies of the musical notes: for example, the
sequence of numbers 8, 12, 18, 27 of [3 2; 2 3](3) is assumed to correspond to the
frequency sequence 87, 130, 196, 293 Hz of the notes fa(F) - do(C) - sol(G) - re(D1).
Genomatrix [3 2; 2 3](6) contains the sequence of 7 numbers (64, 96, 144, 216, 324,
486, 729), which is assumed to correspond to the whole quint sequence of the
frequencies 87, 130, 196, 293, 440, 660, 990 Hz of the 7 notes of Figure 5: fa(F) - do(C)
- sol(G) - re(D1) - la (A1) - mi (E2) - si (B2).

For this reason, we assume that each genomatrix [3 2; 2 3](n) can be presented in the
form of a matrix PMUSIC(n) of frequencies of notes (or a “music-matrix”). For instance,
Figure 6 demonstrates the genomatrix [3 2; 2 3](3) of the 64 triplets as a music-matrix
PMUSIC(3) of frequencies of appropriate four notes (the general factor 293/27 arises for
concordance of numeric values of the note frequencies with numbers 8, 12, 18, 27 of the
genomatrix [3 2; 2 3](3)).

re (D1) sol (G) sol (G) do (C) sol (G) do (C) do (C) fa (F)
sol (G) re (D1) do (C) sol (G) do (C) sol (G) fa (F) do (C)
sol (G) do (C) re (D1) sol (G) do (C) fa (F) sol (G) do (C)
do (C) sol (G) sol (G) re (D1) fa (F) do (C) do (C) sol (G)
sol (G) do (C) do (C) fa (F) re (D1) sol (G) sol (G) do (C)
do (C) sol (g) fa (F) do (C) sol (G) re (D1) do (C) sol (G)
do (C) fa (F) sol (G) do (C) sol (G) do (C) re (D1) sol (G)
fa (F) do (C) do (C) sol (G) do (C) sol (G) sol (G) re (D1)

Figure 6: A presentation of the genomatrix [3 2; 2 3](3)*(293/27) in the form of the music-matrix PMUSIC(3) of
the frequencies of the musical notes (see Figure 5)

The four numbers 8=2*2*2, 12=2*2*3, 18=2*3*3, 27=3*3*3, which are presented in
the genomatrix [3 2; 2 3](3) on Figure 2, characterize those four kinds of triplets, which
differ by their numbers of hydrogen bonds of nitrogenous bases. For instance, number
360 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

18=2*3*3 belongs to those triplets, which have one nitrogenous base with 2 hydrogen
bond and two bases with 3 hydrogen bonds (the mathematics of genomatrices testifies
that products of numbers of hydrogen bonds should be taken into account here but not
their sums; it has precedents and the justification in information theories, in particular,
in the theory of parallel channels of coding and processing the information). Different
sequences of these four numbers, for example 12-8-27-12-8-18-18-…, determine
appropriate successions of the musical ratios (3:2)0, (3:2)1, (3:2)2, (3:2)3 (in this
example, 3:2 - (3:2)3 – (2:3)2 – (2:3) – (3:2)2 - (3:2)0-…). It is obvious that such
succession can be interpreted as a kind of an analogous genetic music for triplets, which
is connected with their hydrogen bonds. Each gene and each part of a DNA and RNA
have their own genetic “melody of hydrogen bonds” which can be played by means of
musical tools.

But the described musical sequence is not the single one in the molecule DNA at all.
DNA can be considered as a set of joint sequences, which are very different in their
physical-chemical sense: a sequence of nitrogenous bases; a sequence of hydrogen
bonds of complementary pairs of these bases; a sequence of triplets; a sequence of rings
of nitrogenous bases; a sequence of ensembles of protons in rings of nitrogenous bases,
etc. One can note the phenomenological fact that many of these sequences are
constructed on quint ratios between quantitative characteristics of their neighboring
members, which are typical for the Pythagorean musical scale (it was mentioned above).
Correspondingly each of these sequences of ratios can be interpreted as a special kind of
genetic musical melody. The whole set of such sequences in DNA can be considered as
a polyphonic (coordinated) music ensemble. An investigation of this music ensemble
seems to be an important scientific task.

Let us demonstrate a few additional examples of sequences with the musical ratios in
DNA. A sequence of triplets in DNA has another kind of genetic music also which is
connected with the quantity of protons in molecular rings of nitrogenous bases (Figure
7). The pyrimidines C and T have 40 protons in their rings; the purines A and G have 60
protons in their rings. (Each complementary pair has 100 protons in their rings
precisely). The ratio 60:40 is equal to the quint 3:2. Let us present each triplet by the
product of the proton numbers 40 and 60 in its rings (as we did above for numbers 2 and
3 of the hydrogen bonds of triplets). Then any triplet has one of four proton numbers:
64000=40*40*40; 96000=40*40*60; 144000=40*60*60; 216000=60*60*60. This
proton set of the four numbers differs from the considered set of four numbers 8, 12, 18,
27 of hydrogen bonds in the triplets by the factor 8000 only. In other words, a ratio
between any two numbers from this proton set has a quint character again and is equal
to one of the values (3:2)k, where k = 1, 2, 3. One can note that a sequence of triplets of
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 361

one DNA-filament has two different sequences with the same typical ratios: one
sequence for triplet characteristics of its hydrogen bonds and another sequence for
triplet characteristic of protons in triplet rings. These two sequences differ each from
other by dispositions of these ratios along DNA-filament, generally speaking (Figure 7).
So, any triplet sequence bears on itself two different genetic melodies on these two
parameters.

Figure 7: On top: Complementary pairs of four nitrogenous bases in DNA: А - Т and C - G. By a dotted line
are specified hydrogen bonds in these pairs. Black circles are atoms of carbon, small white circles - hydrogen,
circles with the letter N - nitrogen, and circles with the letter O – oxygen. At bottom: the numerical
representations of a sequence of complementary pairs of the bases in DNA as a sequence of numbers of
hydrogen bonds in the given pairs (the average row made up on basis of numbers 2 and 3) and as a numerical
sequence of protons of molecules rings of these nitrogenous bases

Sequential dispositions of musical ratios for these two parameters of triplets (and of
nitrogenous bases also) are different on two filaments of DNA, but they are connected
in regular manner due to a fact of complementary pairs of bases. Figuratively speaking,
two filaments of DNA bear complementary kinds of genetic music on these parameters.

It should be added about an atomic parameter of nitrogenous bases: the quantity of non-
hydrogen atoms in molecular rings of the pyrimidines C and T is equal to 6 and the
362 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

quantity of non-hydrogen atoms in molecular rings of the purines A and G is equal to 9.


Their quint ratio 9:6=3:2 can be considered as a basis for “atomic” genetic music of the
nitrogenous bases and triplets along DNA. But these kinds of sequences of ratios are
identical to sequences of ratios in the case considered above about 40 and 60 protons in
rings of the pyrimidines and the purines. For this reason these sequences have nothing
new from musical viewpoint though they can have an important meaning in the
ensemble of genetic music because they are organized on the higher – atomic - level.

A sequence of numbers of 2 and 3 of hydrogen bonds between complementary


nitrogenous bases along DNA (for instance, 3-2-2-3-2-3-…) determines a sequence of
ratios between its neighboring - subsequent and previous - members (in the considered
example, 2:3 - 2:2 - 3:2 - 2:3 -….). This simple sequence contains ratios (3/2)-1, (3/2)0
and (3/2)1 only. From a viewpoint of musical analogy, this sequence determines a
special kind of very simple genetic music.

Quantities of molecular rings in the pyrimidines and the purines are characterized by the
octave ratio 2:1. This fact gives an additional possibility to consider sequences of
nitrogenous bases and triplets in DNA as genetic melodies. But sequences of ratios in
these cases contain the octave ratios only and are not so interesting from musical
viewpoint though they can play an important role in the whole ensemble of genetic
music.

Total quantities of protons in both pairs of nitrogenous bases A-T and C-G are the same
and are equal to 136. On this numeric parameter, a sequence of nitrogenous bases has
constant ratios 1:1 along DNA.

The full list of different kinds of such genetic music at different parameters and levels
of genetic system permits one to reproduce a musical polyphonic party for each gene
and for other parts of the genetic system. These musical sequences were created by
nature itself. Each gene and each protein have their own genetic music composition (or
briefly “genomusic”). The natural music of genes can be reproduced in acoustical
diapason not only for aesthetic pleasure but, perhaps, also for medical therapy, for
theoretical needs, etc. (applications of genomusic in the field of musical therapy have
not been tested by authors). This natural genomusic and its compositions can be
connected to deep physiological archetypes, which were introduced into science by the
creator of analytic psychology Carl Jung. From the viewpoint of musical harmony in
structures of molecular-genetic system, outstanding composers are researchers of
harmony in the organization of living substance. According to the famous expression by
G. Leibnitz, music is the mysterious arithmetic of the soul, which calculates itself
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 363

without understanding this action (here one should note the difference in the order of
magnitude of the wave-lengths of musical tones and the size of the molecular
nitrogenous bases; this fact testifies in favor of informational nature of musical action).

It is well known, that some kinds of music stimulate growth of plants, cure people, etc.
“American Music Therapy Association” unites a few thousands of members, many of
them are professional therapists there. One should emphasize that “melodies” of the
mentioned genetic music are not formed by any person in a forcible way, but they are
defined by natural sequences of parameters in chain genetic molecules (although
applications of the genetic music for musical therapy have not been tested else, we may
suppose that this kind of music is closer to biological organisms than the former ones).
Such genetic melodies are named conditionally as "natural genetic music" to distinguish
them from variants of "genetic music", sometimes offered by other authors on the basis
of obviously forcible approaches without a sufficient support on molecular features of
genetic sequences. The claim is that some authors (see for example
http://www.youtube.com/watch?v=tQv5Ho8zsKI) propose their own “genetic music”
on the basis of an arbitrary correspondence of the genetic letters or triplets to musical
notes without sufficient attention to the musical correspondence of ratios of natural
numeric parameters of adjacent genetic elements. Such attempts to create arbitrary
"genetic music" are related with the long-standing hypothesis that just the genetic
system is the carrier of genetically inherited connection of biological organisms with the
phenomenon of music (see for example
http://discovermagazine.com/2001/aug/featmusic#.UMyvN0I3tXU).

All physiological systems should be coordinated structurally with the genetic code for
their genetic transfer to next generations and for a survival in a course of biological
evolution. For this reason we collect examples of harmonious ratios (first of all, the
quint 3:2) in structures and functions on different levels of biological systems including
the supra-molecular level. For example, the quint ratio 3:2 exists between:
 durations of phases of the activity and the rest in human cardio-cycles (0.6 sec
and 0.4 sec correspondingly);
 plasmatic and globular volumes of blood (60% and 40%);
 albumens and globulins of blood (60% and 40%);
 60S and 40S sub-particles in the composition of ribosomes (from
http://vivovoco.rsl.ru/VV/JOURNAL/NATURE/08_03/KISSELEV.HTM).
Now let us consider a well-known algorithm of the construction of the Pythagorean
musical scale from a geometrical progression, which factor is equal to the quint. This
algorithm, which is useful for the theme of the next paragraph, creates the sequence
of the notes do-re-mi-fa-sol-la-si-do on the interval of frequencies {1, 2} of one
364 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

octave, in which the lowermost note “do” has the conditional frequency of power 1
and the lowermost note of the next octave has the conditional frequency of power 2.
This algorithm contains the following steps:  
1. Taking the first seven members of such geometrical progression with the quint
factor 3/2, which begins from the inverse value of the quint: (3/2)-1, (3/2)0,
(3/2)1, (3/2)2, (3/2)3, (3/2)4, (3/2)5;
2. Returning into the octave power interval {1, 2} for those members of this
sequence, values of which overstep the limits of this interval; this returning is
made for these values by means of their multiplication or division with the
number 2. As a result of this operation, the new sequence appears (this
sequence can be named “the geometrical progression with the returning into
the octave ”): 2*(3/2)-1, (3/2)0, (3/2)1, (3/2)2/2, (3/2)3/2, (3/2)4/4, (3/2)5/4;
3. The permutation of these seven members in accordance with their increasing
values from 1 up 2 (the number 2 is included in this sequence as the end of the
octave): (3/2)0, (3/2)2/2, (3/2)4/4, 2*(3/2)-1, (3/2)1, (3/2)3/2, (3/2)5/4, 2.

In this last sequence, a ratio of the greater number to the adjacent smaller number refers
to as the interval factor. Two kinds of interval factors exist in this sequence only: 9/8,
which is named the tone-interval T, and 256/243, which is named the semitone-interval
S. One can check that the sequence of interval factors in this case is T-T-S-T-T-T-S.
These five tone-intervals and two semitone-intervals cover the octave precisely: (9/8)5 *
(256/243)2 = 2. It is known that the name “semitone-interval” in the Pythagorean
musical scale is utilized by convention only because the semitone-interval 256/243=
1.0545… is not equal to the half of the tone-interval, that is the square root from the
tone-interval: (9/8)0.5 =1.0607… . If one takes not 7, but 6 or 8 members in the initial
quint geometrical progression (see the first step of the algorithm), then the same
Pythagorean algorithm does not give a binary sequence of interval factors T and S
because three kinds of interval factor arise.

The similar algorithm will be used in the next paragraph to construct new mathematical
scale on the base of described data about the genetic code and its genomatrices.

5 PENTAGRAM MUSICAL SCALES AND FIBONACCI NUMBERS

Many theorists of music paid attention to the connection of the structure of many
musical compositions of prominent composers with the golden section φ = (1+50.5)/2 =
1.618… (see, for example, (Lendvai, 1993) and the web-site about the Hungarian
composer Bela Bartok and musicologist Erno Lendvai
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 365

http://mathcs.holycross.edu/~groberts/Courses/Mont2/Handouts/Lectures/Bartok-
web.pdf). The results of matrix genetics reveal a new direction of thoughts about a
relation between the golden section, Fibonacci numbers and music because structures of
a genetic code are also (although in another way) connected with the golden section.

Similarly to a quint genomatrix [3 2; 2 3](n), which contains a sequence of (n+1)-kinds


of numbers from a geometrical progression with the quint factor 3/2, a corresponding
golden genomatrix Φ(n) contains a sequence of (n+1)-kinds of numbers from a geometric
progression, the factor of which is equal to φ2 = 2.618….:

Φ(1)  φ1, φ-1


Φ(2)  φ2, φ0, φ-2
Φ(3)  φ3, φ1, φ-1, φ-3 (2)
…………………………….

The previous section demonstrated that the Kronecker family of the quint genomatrices
is connected with the Pythagorean musical scale. Now we turn to the Kronecker family
of the golden genomatrices and to the geometrical progressions with the factor φ2. Is it
possible to apply the described Pythagorean algorithm to such geometrical progressions
with factor φ2 to arrive at a new musical (or mathematical) scale, where only two
interval factors exist by analogy with the Pythagorean musical scale? Investigation of
this question seems to be important because such a new scale or scales can be essential
for a theory of musical harmony and for the creation of musical compositions with
increased physiological activity.

After research of this question the beautiful positive result is obtained: yes, it is possible
every time, when we take one of Fibonacci numbers 2, 3, 5, 8, 13 (see the Figure 8) as
the first member of such a geometrical progression (the situation becomes more difficult
for the higher Fibonacci numbers 21, 34, …). Mathematical scales, which are formed in
these cases, possess such quantity of each of their two interval factors, which is equal to
Fibonacci numbers as well. Moreover a value of each of these two interval factors is
expressed by means of Fibonacci numbers, too.

n 0 1 2 3 4 5 6 7 8 9 10 11 …
Fn 0 1 1 2 3 5 8 13 21 34 55 89 …

Figure 8: The Fibonacci series where Fn+1 = Fn + Fn-1


366 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

Such interrelated Fibonacci-stage scales, each of which has interval factors of two kinds
only and which is based on the geometric progression with the coefficient φ2, are named
“the Fibonacci-stage scales” or “the pentagram scales”. Let us consider the example of
the 8-stage pentagram scale. We should construct a new mathematical scale of
frequencies, which fills up the octave {1, 2}, by means of the Pythagorean algorithm on
the base of a geometrical progression with the irrational coefficient φ2 (instead of the
coefficient of the quint 3/2). As a result we should arrive at such a scale, which
possesses two kinds of interval factors only by analogy with the Pythagorean musical
scale. One can note that the factor φ2 = 2.618… exceeds the considered interval of the
octave {1, 2}. Therefore it is comfortable to use from the very beginning the twice
smaller factor φ2/2 = р = 1.309…, the value of which belongs to this octave interval. It
is easy to check that the final sequence (3) of the 8-stage pentagram scale does not
depend on whether we use the factor φ2 or the factor φ2/2, which are equivalent to each
other in the given problem. This factor р = φ2/2 has been known in the field of
investigations of biological symmetries and morphological invariants for a long time
under the name of the golden wurf (Petoukhov, 1981; Petoukhov, He, 2010).

Now let us construct the 8-stage pentagram scale by means of the analogue of the
described Pythagorean algorithm, using the factor p = φ2/2 in the initial geometric
progression (instead of the quint factor 3/2). All three steps of the Pythagorean
algorithm are reproduced:

1. Taking the first eight (!) members of such a geometrical progression with the
factor p = φ2/2, which begins from the inverse value of this factor: p-1, p0, p1,
p2, p3, p4, p5 , p6;
2. Returning into the octave interval {1, 2} for those members of this sequence,
values of which overstep the limits of this interval; this returning is made for
these values by means of their multiplication or division with the number 2. As
a result of this operation, a new sequence is obtained (this sequence can be
named "the geometrical progression with return to the octave "): 2* p-1, p0, p1,
p2, p3/2, p4/2, p5/2, p6/4;
3. The permutation of these seven members in accordance with their increasing
values from 1 up to 2 (the number 2 is included in this sequence as the end of
the octave):

1, p3/2, p6/4, р1, p4/2, 2*p-1, p2, p5/2, 2 (3)

This final sequence (3) satisfies the initial condition concerning the existence of two
kinds of interval factors only. Really, it is easy to check directly that all ratios of
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 367

adjacent members of this sequence are equal to two values only, which play the role of
the interval factors. For this sequence (3), the first kind of intervals is T = p3/2 =
1.1215… and the second kind of intervals is S = 4*р-5 = 1.0407… . The sequence of
these interval factors is T-T-S-T-S-T-T-S. This sequence fills all the octave in accuracy:
(p3/2)5 * (4*р-5)3 = 2. The quantities of various interval factors are equal to Fibonacci
numbers here. Really, the 3 intervals S, 5 intervals T and in total 8 interval factors exist
here. It is interesting, that if we take a non-Fibonacci number (for example, 4, 6 or 9) for
the first member of the initial geometric progression in the first step of the Pythagorean
algorithm, there arise such final sequences, which have more than two kinds of interval
factors.

Let us compare the classical 7-stage Pythagorean musical scale with the obtained
8-stage pentagram scale. Figure 9 shows the minimal difference between the sequences
(musical scales) of two kinds of intervals inside the octave interval 2 for both scales.
The initial and final parts of both sequences coincide completely, and only one
additional semitone-interval arises in the middle part of the octave. This additional
interval of the second kind S exists because the factor “р” is less than the quint factor. 
 
T T S T T T S
T T S T S T T S

Figure 9: Sequences of interval factors in the 7-stage Pythagorean scale of C major (the upper row) and in the
8-stage pentagram scale. In each row, the intervals of the first kind are marked by T, and the intervals of the
second kind are marked by S (though values of T and S in the upper row differ from values of T and S in the
bottom row).

Using the sequence (3) of the intervals, one can construct the sequence of tones
(musical notes), which is named the “8-stage pentagram scale of C major” by analogy
with Pythagorean scale of C major (Figure 10). A choice of frequencies for these tones
of the first octave is made in such way that this scale contains the frequency 440 Hz,
which corresponds to note “la” in the Pythagorean scale and in equal temperament scale
and which is used traditionally for tuning in musical instruments. Figure 10 compares
the Pythagorean 7-steps scale C major and 8-stage pentagram scale for the first octave.
Taking into account a minimal difference between the two scales, the majority of the
notes of the pentagram scale are named by analogy with the appropriate notes of the
Pythagorean scale but with the letter “m” in the end (for instance, "rem" instead "re").
The additional fifth note is named “pim”.
368 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

260.7 293.3 330.0 347.6 391.1 440 495.0 521.5


D O1 RE MI FA SOL LA SI D O2
256.8 288.0 323.0 336.1 376.8 392.3 440 493.5 513.6
DOM1 REM MIM FAM PIM SOLM LAM SIM DOM2

Figure 10: The upper row demonstrates the frequencies of the tones in the 7-stage Pythagorean scale of C
major in the first octave. The bottom row demonstrates the frequencies of the tones in the 8-stage pentagram
scale of C major in the similar octave. Numbers mean frequencies in Hz. The names of the notes are given.

№ Scales Value of Value of Sequence of ТК и SК


ТК SК in the scale
K=0 1 2 - T0
K = 1  2 21*p-1 = 20*p1 = T1 –S1
(n=3)  1,5279… 1,3090…
K = 2  3 20*p1 = 21*p-2 = T2– S2 – T2
(n=4)  1,3090… 1,1672…
K = 3  5 2*p-2 = 2-1*p3 = S3–T3–T3 –S3 –T3
(n=5)  1,1672… 1,1215…
К = 4  8 2-1*p3 = 22*p-5 = T4–T4–S4–T4–S4–
(n=6)  1,1215… 1,0407… T4–T4–S4
K = 5  13 22*p-5 = 2-3*p8 = S5-T5-S5-T5-T5-S5-T5-T5-S5-T5-
(n=7)  1,0407… 1,0776… S5-T5-T5
K = 6  21 2-3*p8 = 25*p-13 = T6-T6-S6-T6-T6-S6-T6-S6-T6-T6-
(n=8)  1,0776… 0,9657… S6-T6-S6-T6-T6-S6-T6-T6-S6-T6-
S6
K = 7  34 25*p-13 = 2-8*p21 = S7-T7-S7-T7-T7-S7-T7-S7-T7-T7-
(n=9)  0,9658… 1,1159… S7-T7-T7-S7-T7-S7-T7-T7-S7-T7-
T7-S7-T7-S7-T7-T7-S7-T7-S7-T7-
T7-S7-T7-T7
K = 8  55 2-8*p21 = 213*р-34 = T8-T8-S8-T8-T8-S8-T8-S8-T8-T8-
(n=10)  1,1159… 0,8655… S8-T8-T8-S8-T8-S8-T8-T8-S8-T8-
S8-T8-T8-S8-T8-T8-S8-T8-S8-T8-
T8-S8-T8-S8-T8-T8-S8-T8-T8-S8-
T8-S8-T8-T8-S8-T8-T8-S8-T8-S8-
T8-T8-S8-T8-S8

Figure 11: The values and the order of both kinds of intervals ТК and SK in the first Fibonacci-stage
pentagram scales. “K” means a serial number of pentagram scales; “n” is a serial number of Fibonacci values
from Figure 8.
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 369

This pentagram scale (Fig. 10), which was constructed in connection with parameters of
the genetic code, possesses many analogies with the Pythagorean musical code by their
internal symmetries and proportions. Its main difference from the Pythagorean scale is
connected with irrational values of its interval factors. Irrational factors are used also in
the modern equal-temperament scale. According to some data, Ancient Chinese knew
about the equal-temperament scale, but neglected it preferring the Pythagorean scale, in
which they saw cosmic and biological importance.

The history of attempts of creation of new musical scales includes names of many
prominent scientists: J. Kepler, R. Descartes, G. Leibnitz, L. Euler, etc. But these
authors had no possibility to use the data about the genetic code in their attempts. The
data about the genetic code allow one to create new musical scales.

By analogy with the 8-stage pentagram scales, other Fibonacci-stage scales can be
constructed. Figure 11 shows both kinds of interval factors T and S in the pentagram
scales with different Fibonacci stages (see more details in (Petoukhov, 2008)).

Each of pentagram scales (Figure 11) contains a Fibonacci quantity of each of interval
factors ТК and SK: the interval TK is repeated Fn-1 times and the interval SK is repeated
Fn-2 times. In total, the number of repetitions of these intervals TK+SK is equal to Fn, and
they always exhaust the octave interval 2 exactly:

(TК^Fn-1)*(SК^Fn-2) = 2, (4)

where the symbol «^» means exponentiation.

In addition, values of each of ТК and SK are also expressed via Fibonacci numbers
simply.
One can see from the table on Figure 11 that the recurrent relations exist for the system
of the pentagram scales:

TК+2 = TК/TК+1; SК = TК+1 (K = 0, 1, 2,..; Т0 = 2, Т1 = 2*p-1) (5)

These relations lead to a recurrent algorithm (6) for calculating interval factors in the
pentagram scales. This new algorithm can be considered as an alternative variant in
relation to the Pythagorean algorithm described above. On the base of values T1 and T2,
this new algorithm allow calculating the values of the interval factors TK for K = 3, 4, 5,
6,…, which correspond to 3-, 5-, 8-, 13-, 21- and higher order of the Fibonacci-stage
370 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

scales without using the Pythagorean algorithm. Really the recurrent relations (5)
generate relations (6) for values TK as functions T1 and T2:

ТК=[( Т1^FК-2)/(T2^FК-1)]^(-1)K+1 (6)

where the symbol «^» means exponentiation; FK-1 and FK are Fibonacci numbers.

Due to the expression Tn+2=Tn/Tn+1 (5) this family of Fibonacci-stage scales is


connected with Pascal's triangle and the coefficients of the binomial expansion by
Newton because Т01 = Т11*Т21 = Т21*Т32*Т41 = Т31*Т43*Т53*Т61 = Т41*Т54*Т66*Т74*Т81
= … . The exponents in these products coincide with the binomial coefficients.

Another algorithm exists to determine the order of ТК and SK in each of the pentagram
scales on the base of knowledge about their order in the first pentagram scales: T0 and
Т1-S1. This algorithm is connected with the classical task by Fibonacci about rabbits’
reproduction. The algorithm is based on the fact that under transition from the
pentagram scale K to the next pentagram scale K+1, each interval TK is replaced by two
intervals TK+1 and SK+1, and each interval SK is replaced by interval TK +1. It should be
noted that under transition from the scale with odd numeration K to the scale with even
numeration (for example, from K=3 to K=4) the interval TK is replaced by TK+1 and SK+1
(the order TK+1 and SK+1 is essential here). In contrary, under transition from the scale
with even numeration K to the scale with odd numeration (for example, from K=4 to
K=5) the interval TK is replaced by SK+1 and TK+1 (the reverse order).

Let us explain this with an example of the sequence S3-T3-T3-S3-T3, pointing in brackets
for each of S3 and T3 their algorithmic transformation into T4 and S4 under transition
from the pentagram scale with K=3 to the next scale with K = 4: S3(Т4)–T3(T4-S4)–
T3(T4-S4)–S3(Т4)–T3(T4-S4). Paying attention only to the sequence of T4 and S4 inside
brackets, we get the familiar sequence T4–T4–S4–T4–S4–T4–T4–S4 for the pentagram
scale with К=4 on Figure 11.

Figure 12 shows the tree of reproduction of interval factors TK and SK, which is
constructed on the base of this algorithm and which corresponds to sequences of TK and
SK in the table of pentagram scales on Figure 11.

But the similar tree in which each of two elements is a repeated Fibonacci number at
each level, is known since before the Renaissance. It appeared in the "biological" task
by Fibonacci. This task speaks about breeding rabbits, a couple of which give birth
every month a new pair, but give birth to rabbits only from the second month of its
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 371

birth. The history of the task and applications of Fibonacci numbers in many fields of
contemporary science are described in the book (Vorobyev, 2003).

Figure 12: The tree of reproduction of intervals TK (black circles) and SK (white circles) for the ensemble of
the Fibonacci-stage pentagram scales from the table on Figure 11.

Along with the similarities between trees in the Fibonacci's classical task and in our
musical "task of the octave", which is associated with expression (4), the following
mathematical differences exist between these tasks and between their trees:
 Our task of the octave analyzes not only the number of TK and SK, but also the
values of each of the TK and SK, which is expressed through the Fibonacci
numbers. Fibonacci's classical task doesn't consider parameters of each rabbit
(e.g., weight or size), and all rabbits are different from each other only on the
basis of sexual maturity.
 The octave task determines the order of TK and SK for each level of the tree of
the pentagram scales. Fibonacci's task doesn't consider the order of two kinds
of rabbits (which reached or not reached their sexual maturity) at each level of
the Fibonacci tree.
 Octaves are not considered at all in the frame of the Fibonacci task.

Taking these facts into account, our "task of the octave" can be represented as
complication or generalization of the classical Fibonacci task.

On the base of the table in Figure 11, one could construct systems of sound frequencies
for the pentagram scales corresponding to different Fibonacci stages. In this case a new
interesting result appears: all the musical frequencies of any pentagram scale are
repeated in all higher pentagram scales, which have more numbers of steps. In other
words, a fractal-like principle exists which provides incorporations of the set of sound
frequencies of lower pentagram scale into the set of frequencies of higher pentagram
scales. Each subsequent pentagram scale contains information about musical
frequencies of all previous "generations" of pentagram scales.
372 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

The Moscow State Conservatory by P.I. Tchaikovsky has recently created a special
“Center for interdisciplinary research of musical creativity” headed by one of the
authors of the article. One of the main tasks of this center is to study genetic musical
scales described in this article. This study is being conducted now from different
viewpoints including new opportunities for composers. The study revealed that the
pentagram musical scales possess beautiful and rich harmony for sound perception and
they can be used to compose music based on them. The system of the musical
pentagram scales has much more possibilities to produce harmonic sounds in
comparison with the equal tempered scale, which is widely used now. The authors of
the article have created a few musical instruments and special software on the computer
language Python to produce appropriate musical products, which are used in this study.
A group of specialists from different fields of science, medicine and culture participate
also in these works. In addition, theoretical researches in this field are conducted in the
international institute “Symmetrion” (Budapest, Hungary). Initial results of the wide
study testify into a favor of great perspectives of this direction for science and culture.

In our opinion, the aesthetic aspects of genetic music are connected not with a
mechanical resonance of molecular structures under influence of sound waves but with
informational aspects, which provide an effect of (not yet identified way of) recognition
of a kindred language under during listening genetic music. This effect of recognition
can be provided by biological algorithms of signal processing inside organisms. For
example, in the case of pentagram music from the outside world, our organism can
recognize those ratios, on which our genetic system and the whole inherited physiology
are built, and the organism responds positively to this manifestation of a structural
kinship of the outside world with its own genetic physiology. This positive reaction can
be compared with mutual understanding between two persons when they begin to talk in
the same language (if they talk in different languages, mutual understanding and
interactivity don't arise though these persons can speak more and more loudly and
energetically). Music is not limited to the relationship of sounds emitted by a system of
stretched strings, which were studied by Pythagoras. The purpose of music is to call the
emotion, associations and living pictures from bio-informational memory. This may be
made effectively not only by sounds from classical musical systems of stretched strings
but also (and more effectively?) by other sets of sounds, which are structured and toned
on the base of inherited algorithms of biological processing of genetic information. No
wonder the sense of musical harmony is innate.
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 373

6 SOME CONCLUDING REMARKS

The facts described in this article about relations of the genetic systems with musical
harmony are essential for the problem of genetic bases of aesthetics and inborn feeling
of harmony. According to the words of the famous physicist Richard Feynman about
feeling of musical harmony, "we may question whether we are any better off than
Pythagoras in understanding why [stressed] only certain sounds are pleasant to our
ear. The general theory of aesthetics is probably no further advanced now than in the
time of Pythagoras" (Feynman, Leighton, & Sands, 1963, Chapter 50).

A cultural direction of “genetic art” (or briefly “genoart”) can be developed additionally
due to these data of matrix genetics. Genoart has many patterns, which are revealed by
matrix genetics, and can be used to create new works of art, of designs and architectural
and musical compositions. For example, the quint genomatrices can be presented in a
form of color mosaics if matrix numbers are replaced by colors. It is possible to see
regular complication of color mosaics along the family of the genomatrices with an
increase of their Kronecker powers. The discovery of the connection of the genetic code
with the golden section shows the molecular-genetic base of many known facts about
aesthetic meanings of the golden section. Specifically the described facts give new
materials for the question about architectural canons, where the golden section is used
for a long time; for example, the famous modulor by Le Corbusier (1948, 1953,
http://en.wikipedia.org/wiki/Le_Corbusier) is based on the golden section. The
pentagram Fibonacci-stage scales can be additionally utilized for architectural
proportions (in the role of “pentagram modulor”).

There is no doubt that applications of numeric genetic matrices for investigations of the
various ensembles of parameters of the genetic system can give many unexpected and
useful results in the future as well. This direction of theoretical researches will be
developed in parallel with developing matrix application in many other branches of
science. The matrix-genetic approach to phenomena of the golden section in genetic
systems and aesthetics can be developed in many theoretical ways and can give new
interesting mathematical models.
According to the described materials, each gene, each DNA, each protein can be
characterized by its own “musical ensemble”. Sequences of appropriate musical
intervals from such genetic melodies can be reproduced in a form of sequences of
sounds, colors (“color music”), electrical stimulus, and impulses of laser beams, etc. for
different needs (though own frequencies of these physical matters are very different).
Whether such "natural genetic music" (or compositions on its basis) possesses a special
physiological effectiveness for the treatment of people and animals, stimulation of
374 G. DARVAS, A.A. KOBLYAKOV, S.V.PETOUKHOV, I.V.STEPANIAN

growth of plants and microorganisms, and so forth? Only future experiments can give
the answer more precisely. It seems that a creation of a computer bank of genetic music
is useful for theoretical and practical needs. One can add here that the creator of analytic
psychology Carl Jung, studying archetypes of human consciousness, has created the
medical method of amplification. This method is based on an active intercourse of his
patients with these archetypes including famous tables of Ancient Chinese “I Ching”,
which are connected with the genetic matrices (Petoukhov, 2005, 2008; Petoukhov, He,
2010; Tusa, 1994). Many composers declared a mysterious connection of music with
the golden section and Fibonacci numbers early. In our opinion, this connection has
based on the musical scale tuned on the described scale, which was constructed on the
analogy of the mathematical sequences discovered in the algebraic structure of the
genetic coding. The described facts are related to a problem of genetic bases of
aesthetics and an inborn feeling of harmony.

Investigations of numeric genetic matrices are an effective scientific instrument to


analyze multi-component and multi-parametric ensembles of the molecular-genetic
systems. The obtained results give a new vision of connections of genetic systems with
well-known mathematical objects and theories from other branches of science and
culture. Owing to the results of matrix genetics new opportunities arise to demonstrate
the close connection between science and culture. One of them is a problem of multi-
dimensional spaces including multi-dimensional musical spaces which need appropriate
algebraic formalisms for their analysis (Kappraff, Petoukhov, 2009; Koblyakov, 1995,
2000a,b).

One should note that our attempt to create the mathematical scale of the golden section,
where the factor of the geometrical progression is equal to the golden section φ (but not
to the φ2), has led to a scale, which differs from the Pythagorean musical scale
cardinally and which has been considered not so interesting from the musical viewpoint.
Furthermore such scales of the golden section had no evident connection with Fibonacci
numbers in its interval factors.

REFERENCES

Berger L. G. (2001). Epistemology of art. Moscow: Isskusstvo (in Russian).


Carrasco, J., Michaelides, A., Forster M., Haq S., Raval R., Hodgson, A. (2009). A one-dimensional ice
structure built from pentagons. Nature Materials, v. 8, 427 - 431
Coldea R., Tennant D. A., E. M. Wheeler, E. Wawrzynska, D. Prabhakaran, M. Telling, K. Habicht, P.
Smeibidl, K. Kiefer (2010). Quantum Criticality in an Ising Chain: Experimental Evidence for
Emergent E8
Symmetry. Science, Jan. 8, Vol. 327, no. 5962, 177-180
SYMMETRIES IN GENETIC SYSTEMS AND MUSIC 375

Darvas, G. (2007). Symmetry. Basel: Birkhäuser book.


Feynman, R., Leighton, R., Sands, M. (1963) The Feynman lectures. New-York: Pergamon Press.
Jean, R.V. (2006). Phyllotaxis. A systemic study in plant morphogenesis. Cambridge: Cambridge University
Press.
Kappraff, J. (2000). The arithmetic of Nichomachus of Gerasa and its applications to systems of proportions.
Nexus Network Journal, 2(4). Retrieved October 3, 2000, from
http://www.nexusjournal.com/Kappraff.html
Kappraff, J. (2002). Beyond measure: essays in nature, myth, and number. Singapore: World Scientific.
Kappraff, J., Petoukhov S.V. (2009) Symmetries, generalized numbers and harmonic laws in matrix genetics.
Symmetry: Culture and Science, v. 20, 1-4, 23-50
Koblyakov, A.A. (1995) Semantic aspects of self-similarity in music. Symmetry: Culture and Science, v.6, 2,
63-74
Koblyakov, A.A. (2000a) Synergetics and creativity. Synergetic paradigm. Moscow (in Russian)
Koblyakov, A.A. (2000b) From disjunction to conjunction (the contours of the general theory of creation).
The language of science - the languages of art. Moscow, 2000, 75-86 (in Russian)
Konopelchenko, B. G., & Rumer, Yu. B. (1975). Classification of the codons in the genetic code. Doklady
Akademii Nauk SSSR, 223(2), 145-153 (in Russian).
Le Corbusier, Sh. (1948). Modulor. Boulogne: Collection Ascoral.
Le Corbusier, Sh. (1953). Der Modulor. Stuttgart: DVA.
Lendvai, E. (1993). Symmetries of music, an introduction to semantics of music. Kecskemét: Kodály Institute.
Needham, J. (1962). Science and civilization in China. Cambridge: Cambridge University Press.
Petoukhov, S. V. (1981). Biomechanics, bionics and symmetry. Moscow: Nauka.
Petoukhov, S.V. (2005). The rules of degeneracy and segregations in genetic codes. The chronocyclic
conception and parallels with Mendel’s laws. In: He M., Narasimhan G., Petoukhov S. eds. Advances
in Bioinformatics and its Applications, Proceedings of the International Conference (Florida, USA, 16-
19 December 2004), Series in Mathematical Biology and Medicine, v.8, 2005, New Jersey-London-
Singapore-Beijing, World Scientific, ISBN 981-256-148-X
Petoukhov, S.V. (2008). Matrix genetics, algebras of the genetic code, noise-immunity. Moscow: RCD (in
Russian).
Petoukhov, S.V., He, M., (2010) Symmetrical Analysis Techniques for Genetic Systems and Bioinformatics:
Advanced Patterns and Applications. Hershey, USA: IGI Global. 271 p.
Ponnamperuma, C. (1972). The origin of life. New York: E.P.Dutton.
Rashevsky P.K. (1964). Riemannian geometry and tensor analysis. Moscow, Nauka (in Russian)
Schrodinger, E. (1955). What is life? The physical aspect of the living cell. Cambridge: University Press.
Shnoll, S.E. (1989). Physical-chemical factors of biological evolution. Moscow: Nauka (in Russian).
Shubnikov, A. V., & Koptsik, V. A. (1974). Symmetry in science and art. New-York: Plenum Press.
Shults, G.E., & Schirmer, R.H. (1979). Principles of protein structure. Berlin: Springer-Verlag.
Tusa, E. (1994). Lambdoma - “I Ging” - Genetic code. Symmetry: Culture and Science, 5(3), 305-310.
Voloshinov, A.V. (2000). Mathematics and arts. Moscow; Prosveschenie (in Russian).
Vorobiev, N.N., (2003). Fibonacci numbers. Birkhäuser Basel
Weinberger, N.M. (2004). Music and brain. Sci. Amer., 291(5), 88-95.

 
Symmetry: Culture and Science
Vol. 23, Nos. 3-4,377-402, 2012

MODELING “COGNITION” WITH


NONLINEAR DYNAMIC SYSTEMS

Yuri V. Andreyev*, Alexander S. Dmitriev**

* Physicist (b. Ufa, Russia, 1960).


Address: Laboratory of Information and Communication Technologies based on Dynamic Chaos (Inform-
Chaos Lab.), Institute of Radio Engineering and Electronics of Russian Academy of Sciences; Mokhovaya st.,
11, building 7, Moscow, 125009, Russia. E-mail: yuwa@cplire.ru.
Fields of interest: dynamic chaos, chaos for communications, symmetry of chaos, information theory, chaotic
cryptography.
Publications: Radiotekhnika, 2008, №8, с. 83 (in Russian);Int. J. Bifurcation and Chaos (2005) vol. 15, No.
11, pp. 3639; IEEE Trans. Circuits and Systems-I, 2003, vol. 50, No. 5, pp. 613; Chaos, Solitons and Fractals,
2003, vol. 17, No. 2-3, pp. 531; Int. Journal Bifurcation and Chaos. 1999, vol. 9, no. 12, pp. 2165; Nonlinear
Phenomena in Complex Systems, 1999, vol. 2, no. 4, pp. 48.

** Physicist, informatician (b. Kuibyshev, Russia, 1948).


Address: Laboratory of Information and Communication Technologies based on Dynamic Chaos (Inform-
Chaos Lab.), Institute of Radio Engineering and Electronics of Russian Academy of Sciences; Mokhovaya st.,
11, building 7, Moscow, 125009, Russia. E-mail: chaos@cplire.ru.
Fields of interest: dynamic chaos and bifurcation phenomena; generation of dynamic chaos; information
processes in complex dynamics systems; information technologies based on dynamic chaos and nonlinear
phenomena; application of dynamic chaos in information networks and communications.
Awards: State Awards of the USSR Council of Ministries (1984 and 1989); 2 Gold Medals of the 4th Int.
Invention Fair in the Middle East (Kuwait, 2011); IEEE Circuits and Systems Society Chapter-of-the-Year
Award (2001); Medaille d'Honneur of Int. Exhibition of Inventions (2000) Paris, France; Grand Prize of the
Contest of the works on Image Recognition (HP Labs – Bristol, 1992).
Publications: Dmitriev A.S., Efremova E.V., Kuzmin L.V., Miliou A.N., Panas A.I., Starkov S.O.: Chapter
15: “Secure Transmission of Analog Information using Chaos”, in: Chaos Synchronization and Cryptography
for Secure Communications: Applications for Encryption, ed. Santo Banerjee, IGI Global (2010) pp. 337-360;
“Generation of chaos", Tekhnosfera (2012) 424 p. (in Russian); "Dynamic chaos: novel information carriers
for communication systems", Fiz.-Mat. Lit (2002) 252 p. (in Russian); "Dynamic chaos as information car-
rier", New in synergetics: Glimpse in the third millenium. Nauka (2002) pp. 82–122. (in Russian); Andreyev,
Yu.V., Dmitriev, A.S., and Kuminov, D.A. Chaotic processors, Advances in Modern Radioelectronics. (For-
eign Radioelectronics), 1997, No. 10, pp.50-79 (in Russian).

Abstract: In this paper we consider realization of information processing and recogni-


tion with dynamic systems. A method for storing and processing information is de-
378 Y.V. ANDREYEV, A.S. DMITRIEV

scribed, which is connected with symmetries and in which digital information blocks
are related to dynamic attractors (periodic orbits or chaotic attractors) of a specially
designed nonlinear dynamic system. Such a system has interesting features concerning
information processing, in particular, associative access (retrieval of the whole stored
image when a small part of it is given as request), search by content, novelty filter, etc.
The designed dynamic systems can be used as storage for texts, pictures, digital se-
quences, etc. Practical examples of using this approach to create information search
engines, information archives with associative access to the stored data, and other data
management solutions are given.

Keywords: dynamic chaos, symmetries, attractors, information processing, recognition.

1. INTERRELATION OF DYNAMICS AND INFORMATION

The role that dynamic chaos plays in processing information by human and animal
brains is extensively investigated in the last decades. The very existence of chaotic
modes in the brain is considered doubtless, and the efforts of the researchers are con-
centrated now on the study of those special functions of brain, for which chaos is either
necessary, or has some advantages compared to simple dynamics.

The last few decades have been witnessing a sharp growth of interest towards pro-
cessing, memorizing and storing information in live systems. Unlike the addressed
memory now used in computers, the memory of humans and animals is associative, i.e.,
both storing and retrieval of information are based not on the index of a memory cell
but on the content (Kohonen, 1980).

There exist quite a number of concepts of realizing the association principle to one
extent or another. One of the most popular among them is that of using neuron network
models (Grossberg, 1988; Hopfield, 1982; Carpenter, 1989). Such models are described
as dynamic systems and objects being memorized or recognized are related to basic
attractors, viz. stable modes. The attraction basin of each of the attractors defines the
limits of recognition of one image or another.

The functional role of cortical chaos appearing due to thalamo-cortical interaction is


discussed in (Nicolis, 1982; Nicolis & Tsuda, 1985). Chaos is considered as a possible
mechanism of self-referential logic, and as a machine for a short-term memory based on
that logic.
MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 379

W.J. Freeman observed chaotic activity in the learning process in the rabbit’s olfactory
system (Freeman, 1987; Freeman, Yao & Burke, 1988; Skarda & Freeman, 1987; Ei-
senberg, Freeman & Burke, 1989; Yao & Freeman, 1990). He has found, that the rabbit
remembers a known smell by coding it in spatially coherent and temporally almost
periodic activity of the olfactory potential. In the case when the animal feels a new
smell the coding mechanism doesn't work and the activity of the olfactory bulb becomes
a low-dimensional chaos, as if it were a filter of "novelty", forming the state "I don't
know".

Based on the analysis of human electro-encephalograms a hypothesis was suggested


(Babloyantz, 1986; Babloyantz & Destexhe, 1986; Destexhe, Sepulchre & Babloyantz,
1988), that the functional role of chaos is determined by the property of chaotic dynam-
ics to increase the resonance capacity of the brain, giving a chance for extremely rich
responses to an external stimulus.

Among other hypotheses about the functional role of chaos we want to note: a nonlinear
pattern classifier (Freeman, Yao & Burke, 1988; Yao & Freeman, 1990), a catalyst of
learning (Skarda & Freeman, 1987), a stimulus interpreter (Tsuda, 1984), a memory
searcher (Tsuda, Koerner & Shimizu, 1987), etc. A more thorough list of possible roles
of cortical chaos in information processing can be found in (Tsuda, 1992), along with a
rich reference base for the studies of 1970s to 1990s.

Procaccia expressed a few ideas (Procaccia, 1988), that indicate at connection between
chaos, unstable periodic orbits and information properties of dynamic systems. First,
chaotic orbits can be organized around the skeleton of unstable periodic orbits. Each
periodic orbit (or a point) can be universally encoded, using symbolic dynamics. As for
the symbolic dynamics, there is a “grammar” than defines permitted “words”, or period-
ic orbits. Such a grammar can also be universal, which means that different dynamic
systems, belonging to the same universal class, have the same distribution of periodic
orbits in corresponding space points. Finally, periodic orbits and their eigenvalues can
be derived directly from experimental data. A corresponding algorithm is given in
(Auerbach, Cvitanovic & Eckmann, 1987). Some details of the above ideas are also
discussed in (Gunaratne & Procaccia, 1987; Cvitanovic, 1988).

The problem of information streams in 1-D maps is analyzed in (Wiegrinch &


Tennekes, 1990). The authors refer to the studies (Shaw, 1981; Schuster, 2004; Farmer,
1982), in which information is argued to be a fundamental concept in the theory of
dynamic systems and chaos. In particular, sensitivity to initial conditions rigorously
380 Y.V. ANDREYEV, A.S. DMITRIEV

refers to information production. Further, he considers a dynamic system described by


mapping f of a segment into itself and studies how map f iteration produces a special
process which the authors call information stream.

In (Matsumoto & Tsuda, 1988) the rates of information streams are estimated, and
restrictions imposed by computers are discussed. The rate of information stream is in-
terpreted by the volume of new data per unit time. For coupled 1-D maps derived from
experimental data on Belousov-Zhabotinsky reaction, the authors showed that the data
rate is equivalent to Kolmogorov entropy KS.

Another reason to consider dynamic chaos from information viewpoint is existence of


natural objects with deterministic chaotic dynamics (Voges, Atmanspacher &
Scheingraber, 1987; Atmanspacher, Scheingraber & Voges, 1988) or with mixed dy-
namics, containing both deterministic chaos and a random process. As a rule, there is a
1-D signal, which is to be processed on order to obtain more or less detailed infor-
mation about the object dynamics. Such a processing is a method of getting information
on the object by a chaotic process that takes place in it.

One can note the known fact about a connection of the genetic coding system with
dynamic chaos and symmetric patterns of fractals (see for example (Almeida et al.,
2001; Basu, et al., 1997; Deschavanne et al., 1999; Dutta & Das, 1992; Fiser, Tusnady
& Simon, 1994; Goldman, 1993; Gutierrez, Rodriguez & Abramson, 2001; Joseph &
Sasikumar, 2006; Oliver al., 1993; Petoukhov & He, 2010; Wang et al., 2005; Yu, Anh
& Lau, 2004)). The pioneer work in this field was (Jeffrey, 1990) where an application
of the known symmetrologic method “Chaos Game Representation” (CGR) has allowed
visualization of long DNA sequences and has discovered different fractals patterns in
them. Identifying chaos in experimental data such as biological sequences, a researcher
can include searching for a strange attractor in the special dynamics, identified by its
fractal structure. Having found such an attractor, one can try to estimate its dimension,
which is a measure of the number of active variables and hence the complexity of the
equations required to model the dynamics. Fractals are to chaos what geometry is to
algebra. They are the useful geometric manifestation of the chaotic dynamics. They are
called "the fingerprints of chaos" sometimes. The revealing the fractal structure of such
CGR patterns of DNA sequences and protein sequences shows hidden connections of
such genetic structures with non-linear dynamics or chaotic dynamical systems.

A book of Jeff Hawkins “On intelligence” is of a special interest (Hawkins &


Blakeslee, 2004). The author analyzed contemporary knowledge on human brain from
MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 381

the viewpoint of technical implementation of the main principles of brain functioning.


He claims that intelligence must be treated as memory, rather than behavior, and points
that the brain is purely dynamic system with nothing static in it. On his opinion, the
brain operates as predictive hierarchical associative memory, in which patterns are
stored as dynamic structures (e.g., cycles), and the input signal to brain is time-varying.
Associative memory is activated (accessed) by means of applying time-varying patterns
at the input (or parts of patterns or distorted patterns). Here, we would like to stress the
idea that the carriers of information in such a model of the brain are dynamic objects,
not the fixed points of any kind.

Thus, experimental investigations of electric activity of the brain and its certain neural
subsystems, simulation of various neural networks and qualitative analysis of infor-
mation processes in the brain allowed to suggest and to prove, to some extent, several
hypotheses about the role of chaos in the brain activity. The use of different approaches,
models and methods in the study of the functional role of chaos leads to an idea of the
existence of general principles of information processing in chaotic systems, independ-
ent of the concrete nature and realization of the systems. This allows hoping to investi-
gate main relations of information processing using simple models. And here the prob-
lem of a proper choice of the dynamical system arises, which must be convenient, i.e.,
be simple enough and allow thorough description, and at the same time exhibit complex
and chaotic behavior.

The approach that we follow in this paper implies that information processing in a dy-
namical system is associated with a notion of an attractor in the system phase space
carrying information. Information processing, e.g., recognition, is associated with struc-
tural transformations of the attractors (bifurcations) and essential change in the system’s
behavior.

The first step of information processing, the storing, is coupled with a synthesis of a
nonlinear dynamical system with the phase space of a special structure, i.e., with attrac-
tors corresponding to stored information. This approach is used, for example, in neural
networks where for a given set of images a neural network is synthesized (trained) such
that these images correspond to equilibrium states of this dynamical system. The most
simple type of attractors, a stable point in the system phase space, is used as the carrier
of information.

Efforts are also known (e.g., (Tsuda, 1992, 1994; Baird & Eeckman, 1992)) of using
more complicated attractors, such as cycles (periodic orbits) and strange attractors, for
382 Y.V. ANDREYEV, A.S. DMITRIEV

carrying information in neural networks. But the enormous complexity of cooperative


motion of the neurons in conventional neural networks makes direct synthesis (calcula-
tion) of these networks very difficult or even practically impossible. Instead, to design
such a dynamical system, one has to use time-consuming procedures of training, which
obscures investigations of the general principles of information processing.

Issuing from the above concept of the existence of general principles of information
processing independent of the concrete dynamical system, we proposed to use a class of
discrete-time one-dimensional systems, namely, piecewise-linear maps of a segment (an
interval) into itself xn+1=f(xn). The efforts were concentrated on the synthesis of dynam-
ical systems with prescribed cycles in the system phase space. As a result, a method of
storing information using stable limit cycles of 1-D maps as information carriers was
proposed (Dmitriev, 1991; Dmitriev, Panas & Starkov, 1991).

The use of more complicated attractors, i.e., cycles rather than equilibrium points, offers
new capabilities of information processing, for example, associative memory (Dmitriev,
1991; Dmitriev, Panas & Starkov, 1991). Further investigations have shown that this
method of storing information can be applied in practice to storing pictures, texts, sig-
nals, etc., (Andreyev, Belsky & Dmitriev, 1992; Andreyev et al., 1992; Dmitriev, 1993;
Dmitriev et al., 1993). The method was extended also to storing information in 2-D and
multi-dimensional maps and to storing multi-dimensional information sequences (cycles
of vectors) (Andreyev, Belsky & Dmitriev, 1994).

In this Paper we describe the original method and its developments and generalizations,
including the use of chaotic systems and unstable limit cycles for storing information,
and discuss opportunities of information processing in such systems.

In section 2 we present the original method (Dmitriev, 1991; Dmitriev, Panas &
Starkov, 1991). In section 3 we briefly describe dynamics of the map with stored infor-
mation. In section 4 we discuss memory scanning using unstable cycles. In section 5 we
demonstrate that information can be stored as chaotic attractors. In section 6 we investi-
gate 1-D maps as recognition machines. “Long-term” and “short-term” memory model
is shown. Finally, we summarize information processing functions realized in the pro-
posed dynamical systems, and draw some conclusions on the role of chaos in the dis-
cussed models.
MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 383

2. STORING INFORMATION AS DYNAMIC ATTRACTORS

Processing information with dynamic systems implies:


 choice of attractor types, suitable for processing;
 choice of dynamic phenomena, necessary to implement basic operations of in-
formation processing;
 development of principles of unambiguous relation of information with trajec-
tories of the dynamic system;
 development of concrete mathematical models allowing to process information
as map trajectories and to control dynamic phenomena, in order to implement
basic processing operations;
 development of software to simulate dynamic processors on PC;
 investigation of dynamic processor models;
 solution of complex problems with the dynamic processor, which are hard to
solve with traditional approaches.

Each trajectory of a dynamic system can be treated as an information signal, so, the set
of map trajectories is a certain information “depository”. This “depository” has a num-
ber of interesting features, depending on the type of the dynamic system attractors.

2.1. Storing as synthesis of 1-D map

In (Dmitriev, 1991; Dmitriev, Panas & Starkov, 1991) we proposed to store information
in nonlinear dynamic systems as dynamic attractors (cycles, periodic orbits or even
chaotic attractors). A method of storing information as stable cycles of a map of a seg-
ment into itself was proposed. Since an arbitrary map can have no necessary cycles, this
is actually a method for synthesis of dynamic system (1-D map), in phase space of
which there exist cycles of prescribed structure. This method is based, in part, on the
ideas of symbolic dynamics (Lind & Marcus, 1995; Kitchens, 1998). We partition the
phase space of the dynamic system into adjacent regions and assign each region a sym-
bol. So, when the phase trajectory visits some region, we treat this as “appearance” or
“production” of the corresponding symbol by this dynamic system. In the theory of
symbolic dynamics this partitioning (generatrix) must be precise, in order to provide
unambiguous relation between the system variables and the produced symbols. Here,
we design an artificial dynamic system, so we may partition the phase space at will.
Then we construct the cycles in the system phase space that run through necessary
384 Y.V. ANDREYEV, A.S. DMITRIEV

space regions, which would mean production of the required symbols in the prescribed
order, and finally, design a dynamic system which has these cycles in its phase space.

The procedure is as follows. Since the cycle is finite and can carry a limited portion of
information, we store finite blocks of information. Let us introduce the main notions
and terms on an example of storing two information blocks, e.g., 1-D strings “babe”
and “add”. For simplicity, we use a subset of the Latin characters A = {a, b, c, d, e} as
the alphabet. The length of the alphabet is NA = 5. Our aim is to design the function  of
a 1-D map xn+1=(xn), such that in the phase space of this dynamic system stable cycles
exist, and each information block of length n stored in the map is unambiguously relat-
ed to a n-period cycle n. The symbols of the strings are coded by the amplitude of the
mapping variable xn. We will store the words in a 1-D map using second-level storing,
which means that each point of the cycles is determined by a pair of successive sym-
bols.

We divide the phase space of the dynamical system (the unit interval I = [0, 1]) into NA
subintervals of the first level (each with the length 1/NA = 0.2) and relate them to the
elements of the alphabet. Then we repeat the procedure and divide each of the subinter-
vals of the first level into subintervals of the second level (with the length 1/NA2 = 0.04)
and also relate them to the alphabet elements, as shown in Fig. 1.

Now we design two cycles n = {x1, x2, ... xn} unambiguously related to the stored in-
formation blocks. Three cycle points for the word add are related with the block frag-
ments (pairs) ad, dd, and da (the information block is mentally closed in a loop). The
cycle point corresponding to the fragment ad is the center of the second-level subinter-
val d located within the first-level subinterval corresponding to the symbol a. Other
cycle points are created similarly. The cycle points corresponding to block babe are
determined by the pairs ba, ab, be, ea.

Having created the cycles n in 1-D phase space, we construct a dynamical system pos-
sessing the phase space with such a structure. To obtain a map with a cycle of period n
passing through the points x1, …, xn, it is necessary to plot the points (x1, x2), (x2, x3), …,
(xn, x1) on the plane (Xm, Xm+1) and draw a curve y = f(x) passing through these points.

In the plane (Xm, Xm+1) we plot the pairs of successive points (xi, xi+1) for all the cycles.
These points form the “skeleton” of the map function (x). Through these points, we
then draw short straight-line segments (called information regions), all with the same
fixed slope s. We will control the stability of the cycles by changing the slope of these
MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 385

segments (turning them around the central point which coincides with the cycle point
lying on this segment).

Figure 1: Storing two information blocks “babe” and “add” in 1-D map. Cycle points are designated with
squares and diamonds, respectively. Storage level q = 2, s = 0.5. (a) The map function. (b) Information
carrying cycles.

As is known, the stability of a cycle is determined by its multiplier . In the case of a 1-


D map xn = (xn+1), the eigenvalue for the cycle n = {x1, x2, ... xn} is equal to

 =  (n)(x1) = (x1)(x2) ... (xn). (1)

Here,  = sn. If <1 (s<1) the cycle is stable, otherwise it is unstable. To complete
the synthesis of the piecewise-linear map function (x), we connect the information
regions and the unit interval endpoints in series with straight-line segments, which we
will further call non-information segments. The plot of the map with the information
cycles is shown in Fig. 1.

Iterates of the designed map produce the output information stream: an occurrence of
the system variable xi in a first-level subinterval is treated as “generation” of the corre-
sponding alphabet element. Mathematically, it is mi = int(NAxi), where mi is the order
386 Y.V. ANDREYEV, A.S. DMITRIEV

number of this element in the alphabet, and int() denotes integer part of number. Thus,
the motion of the phase trajectory over a cycle in the system phase space is accompa-
nied by continuous reproduction of the corresponding information block.

Storing information as stable limit cycles allows easy associative access to the stored
information. If an equilibrium point is used as an information carrier, all information or
its most part is necessary to access the point and to retrieve information. If an image is
stored as a stable cycle, as in our case, each point of the cycle is related to only a part of
the image, and only a piece of the original information is necessary to get a point near
the cycle and to retrieve the whole image by iterating the dynamical system. Thus, asso-
ciative access to the stored information becomes possible, yet by expense of iteration
time. Indeed, if we take an excerpt aiai+1...aj of an information block with the length
equal to or greater than the storage level q, then we can apply the same procedure as in
creating the cycles points and get a point lying exactly on the corresponding cycle.
Note that this is direct access to information, because the offered excerpt is not
compared to all the images, instead, an initial point lying at the required cycle is directly
calculated, so the access to information is very fast.

2.2. Storing texts, signals, images in 1-D map

Not only simple letter sequences can be stored in 1-D maps. Any kind of information
can be stored, that permits representation in the form of symbolic sequence of a certain
alphabet. Texts, DNA sequences, vocal sheet music, etc., can be represented by certain
symbols from a finite set, i.e., they can be stored as cycles of 1-D map.

Moreover, the method can also be applied to storing more complex data, e.g., 2-D digi-
tal images, because they can also be transformed into 1-D symbol sequences, as is
shown in Fig. 2. As the cycle of period p repeats itself after p iterations (here, p = mn),
all cyclic permutations are equivalent and generate the same trajectory. We can say that
the system loses its initial phase when converging towards the cycle.

This situation is undesirable when working with pictures, since a picture has a definite
beginning and it is difficult to recognize a cyclic permutation of the picture. Therefore,
we mark the beginning of the information block by putting a label (a special symbol of
the alphabet). Correspondingly, when restoring the pattern from the values of the map-
ping variable on the limit cycle, we consider the first element after the label as the be-
ginning of the picture (the label itself is not displayed). Thus the alphabet is augmented
MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 387

with a special symbol for the label. The label element is used only once in each infor-
mation block.

Figure 2: representation of the letter E. Label symbol is added to the string of black and white elements

As an example, let us make an information block from a picture of letter E (Fig. 2). The
picture is represented here by 8×8 array. Assume that the alphabet consists of three
elements 0, 1 and 2, where 0 corresponds to black, 1 corresponds to white, and 2
corresponds to the label. Then the information block is written as

21111111001100010011010000111100001101000011000101111111000000000. (2)

Since the above string contains 2 identical substrings (001101000011) of length 12, the
minimum number of levels necessary to store the image with the original storing meth-
od cannot be less than 13.

Let us estimate, how much information can be stored in 1D map at level q. Since the
size of information region is N–q, then no more than NAq symbols can be stored (in that
case, all function segments are information regions). Proceeding to more usual bits, we
obtain the utmost capacity limit of

Emax = NAqlog2(NA) bits.

If we consider, for certainty, storing of blocks of equal length l, the information capaci-
ty E of the method can be estimated as (see reasoning in Andreyev et al., 1992):
N l / l, l  q;
E q (3)
 N / l, l  q.

2.4. Storing and coding arbitrary information blocks

As can be seen from example in Fig. 2, it is difficult to store information blocks having
large identical fragments. The number of the storage levels must be more than the
length of identical fragments, but an increase of the number of storage levels leads to
388 Y.V. ANDREYEV, A.S. DMITRIEV

exponential decrease of the size of information region. Finite accuracy of computer


calculations puts a limit on the potential number of storage levels.

To overcome this difficulty, a development of the original method is proposed in (An-


dreyev, Belsky & Dmitriev, 1992; Andreyev et al., 1992). Analysis of information
sequences (pictures, texts, etc.) shows that the main storing difficulties are associated
with repeating pieces of data. This leads us to an idea of compressing information
(eliminating redundancy) before storing it. A compression method, matching the storing
requirements (e.g., at level q), using alphabet of repeating fragments, was proposed
(Andreyev, et al., 1992; Andreyev, Belsky & Dmitriev, 1994; Andreyev et al., 1996a).

The idea is to substitute repeating fragments of length q with new symbols of the alpha-
bet. Thus, shorter information blocks are obtained, while the alphabet is extended. The
procedure is repeated until all the set of blocks becomes "q-storable", i.e., contains no
identical fragments of length q. Thus, the encoding procedure means elimination of
information redundancy by means of encoding the repeating fragments with symbols.
Using this coding method, any set of information blocks can be stored at any level,
beginning from the second.

The coding method is reversible, i.e., lossless. To decode information, symbols of the
addition alphabet are substituted by length-q fragments; possibly, this is repeated sever-
al times. From the viewpoint of compressing information, the described method is very
close to known lossless Lempel-Ziv (LZ) compression methods. So, the compression
ratio of the described method is very close to that of these methods. The difference here
is that this compression method is matched with method of storing at level q.

2.4.1 Associative access to encoded information

To implement associative access, i.e., recovery of a stored image by its arbitrary frag-
ment, it is necessary to set the initial conditions on the related attractor. For this, a
fragment (ajaj+1...aj+q–1) of length q of information block is required, which is used to
calculate the initial point x0
q
a j k 1
x0   . (4)
j 1 Nk
Associative access in the case of encoded information is organized as follows. If a
fragment of the encoded information block of length q is given as request, then
according to expression (4) initial point x0 can be calculated, encoded information block
MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 389

recovered, and then the original information block decoded back. However, much more
interesting is to recover information by fragments of the original (not coded) blocks.
To implement associative access in this case, we should be able to transform a given
request in initial alphabet into the corresponding encoded fragment. The first step is
encoding this request using the present addition alphabet, i.e., the table of fragments
obtained by orthogonalization of information blocks. The problem is that the request
fragment can begin with arbitrary symbol of the original information block, preceded
by those that don’t exist in the encoded block, because they were incorporated into new
symbols. For example, if a block (abcdefghijk) after coding at the third level became
(xdyhz), where x = (abc), y = (efg), z = (ijk), and the a fragment of initial block (cdefghi)
is presented as request, then after encoding this fragment with the addition alphabet,
sequence (cdyhi) is obtained, which contains “correct” piece of encoded block (dyh) in
the center and "garbage" elements in the beginning and in the end.

To get rid of them, we move q-element window along the encoded request until we find
an initial point x0 that hits an information interval. We can do this, because we know the
map function. Then we iterate the map beginning from this initial point, and compare
the generated information stream with the symols of the encoded request. If there is a
match, at least for a few iterates, then the matching symbols are “correct” and other
symbols are “garbage”. If the encoded request fragment contains a correct piece of the
length q (at least), it will be found. When initial point x0 is obtained, the entire infor-
mation sequence can easily be restored and, consequently, the entire image.

Thus, the system of associative memory based on the described principles, in response
to the offered information block or a part of it, practically immediately forms one of the
two answers: it either returns initial point x0 on the corresponding attractor (limit cycle)
with which the entire block can be restored; or it gives the answer that the offered in-
formation is insufficient to unambiguously recover information perhaps, because there
is no such on the map).

Note that the forming initial point on attractor and, consequently, recovery of the initial
image by its fragment take place without comparison of the request fragment with all
stored images. After encoding the request and calculation of the initial point x0, just a
few iterates are necessary to determine the fact of the presence on map attractor, and the
time of each iteration of the map is proportional to logarithm of the volume of stored
information. Thus, the described associative access operates as a very fast correlator.
390 Y.V. ANDREYEV, A.S. DMITRIEV

2.5. Storing information in 2-D and multi-dimensional maps

To store information at qth level with the original method, we used nested sub-divisions
of the system phase space. Instead, other dimensions of Rq space can be used. Method
for storing information as cycles of dynamic systems is generalized to 2D and higher
dimensions (Andreyev, Belsky & Dmitriev, 1994). Regular procedures of designing
multidimensional maps are developed.

Even multi-dimensional signals can be stored in such maps. Storing multi-dimensional


signals with capabilities of quick associative search can find practical application in
geophysical studies, tomography, by construction of hierarchical systems, etc.

3. DYNAMICS OF MAPS WITH STORED INFORMATION

By construction, procedure cycles with stored information are stable, if the slope s of
information regions of the map is less than 1. Here, we discuss the phenomena that
occur when these limit cycles are made unstable, i.e., in case s > 1 (Andreyev, 1995).
To investigate these phenomena, bifurcation diagrams are built for parameter s.

As is shown in (Maistrenko, Maistrenko, Sushko, 1994a, 1994b), in piecewise-linear


maps beside the ordinary cycles of points m = {x1, x2, ..., xm} there might be cycles of
intervals m = {I1, I2, ... Im}, i.e., chaotic attracting sets composed of a finite number of
intervals Ik. Each interval is mapped exactly into the next one, i.e., In+1 = f(In). When
moving on this attractor, the trajectory goes through points x1, x2, …, xm, xn+1 = (xn),
…, located sequentially on the intervals I1, I2, ..., Im. In (Maistrenko, Maistrenko,
Sushko, 1994a, 1994b) the theory of piecewise-linear maps with a single extremum was
developed and birth mechanisms of cycles of chaotic intervals and dynamic properties
of these attractors were investigated.

For example, consider bifurcation diagrams of the map with two information blocks
12345 and 97583 stored at the second level. The size of information regions here is
0.01. When the cycle stability is lost at s = 1, for each cycle of points two interval cy-
cles are born, one at each end of the information regions. In Fig. 3(a) diagram is depict-
ed for the case, when by stability loss of the limit cycle carrying information block
97583, the phase trajectory is attracted to the right interval cycle in the vicinity of this
information cycle. At s  1.03 interval cycle at the right border loses stability and the
trajectory leaves it and goes towards the attractor on the left border. If it were stable at
MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 391

that moment, the trajectory remained there. But due to similar instability, the trajectory
remains there for some time, then returns to the first attractor, and so on. Thus, the birth
of united interval cycle, that embraces entire information segments of cycle 97583, is
observed.

( A) (B)
Figure 3: bifurcation diagram of the maps with two information blocks

The slopes of non-information segments, adjacent to different information segments,


are different, so, the “lifetimes” and stability conditions by parameter s are different for
each of these interval cycles. In Fig. 3(b) the case is depicted, in which during the bifur-
cation of limit cycle stability loss, corresponding to information block 12345, the trajec-
tory is attracted to the newly-born stable interval cycle that goes through the right bor-
der of information segments. But it loses stability practically immediately, and the tra-
jectory goes to the interval cycle at the other border of information segments. At s 
1.015 united interval cycle 5, is born. At the moment, when it also loses stability (at s
 1.025), there still exists stable interval cycle connected with the other information
block 97583, and the phase trajectory is attracted to it.

At s = s2  1.07, the united interval cycle for block 97583 loses stability. Since no stable
structures remain in the phase space of the dynamic system, the phase trajectory starts
wandering over the system phase space. At the moment of appearance, the holes on
interval cycles are small, and the trajectory spends most of its time in the vicinity of the
interval cycles, but with increasing slopes the holes extend, and the map variable distri-
bution becomes more uniform.

Analysis of bifurcation diagrams of the maps with stored information shows that at s <
1 the only stable attractors in this oscillation system are information limit cycles, and at
1 < s < s2 – chaotic interval cycles. At s > s2 transition to global chaos through inter-
mittency takes place in nonlinear dynamic systems.
392 Y.V. ANDREYEV, A.S. DMITRIEV

4. RANDOM ACCESS TO STORED IMAGES

In section 2 we described the method for storing information on 1-D maps using stable
limit cycles. We will show in this section the possibility of realizing random access, in
the sense that the trajectory visits all regions which correspond to information stored in
the map. It is clear that we cannot use stable limit cycles for this purpose.

We want the trajectory to visit each region (or just the regions where information is
stored) of the phase space from time to time. For this purpose, we can use a strange
attractor with a nonzero distribution of the system variable in every point on the interval
[0, 1]. However, in general, a strange attractor visits each region only briefly.

For the random access memory, we exploit the phenomenon of intermittency. The idea
of memory scanning using intermittency was discussed qualitatively by Nicolis (1986,
1991). In intermittency mode, the phase trajectory "lingers" in some definite regions of
the phase space. So, we want the trajectory to be in the vicinities of the limit cycles
corresponding to information blocks.

"Random access" can therefore be realized using unstable cycles to store information
blocks. Each cycle can easily be made unstable by letting the absolute value of the cycle
multiplier (1) (i.e., the product of the slopes in the cycle points) be slightly larger than
1. A trajectory starting from the vicinity of a cycle will leave the cycle after some time
and begin to wander in phase space until it gets into the neighborhood of the same limit
cycle or another limit cycle with the same stability property. Then the trajectory will
linger near that limit cycle for some time before leaving it again. If, independent of the
initial condition, the trajectory visits the neighborhoods of all points, intermittency with
respect to all cycles corresponding to stored information will then be observed over
short but random time intervals. It must be noted that intermittency takes place in a
rather narrow region of the parameter space only.

Let us demonstrate random-access memory on example of 2D data, i.e., graphic images.


Ten information blocks, corresponding to ten 8×8 patterns (depicting letters A, b, C, d,
E, f, G, h, I, j), are stored in 1-D map at the second level (Fig. 4a). Information blocks
were preliminary compressed. The map is designed in the same way as was described
earlier, with the only exception that the slopes s of information segments are slightly
more than 1. Actually, s = 1.02, hence, the cycles are unstable. The sequence of snap-
shots in Fig. 4b demonstrates intermittent appearance of all ten images.
MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 393

As can be seen in Fig. 4b, the trajectory visits all districts with the stored information
and lingers in the vicinity of unstable information cycles long enough. By means of
switching the slopes of information segments and making them less than 1, an unstable
cycle can be made stable.

( A) (B)
Figure 4: (a) set of letters stored on 1-D map, and (b) intermittency between cycles representing letters.
Snapshots are taken at arbitrary moments

5. STORING INFORMATION AS CHAOTIC ATTRACTORS

Motion over interval cycle is chaotic, yet the cycle itself is stable and confined in space.
Interval cycle consists of a finite number of continuous intervals, that embrace infor-
mation intervals of corresponding information cycle (or parts of them) and small parts
of adjacent non-information intervals. The order of going round the intervals of this
chaotic attractor coincides with the round order of the information limit cycle points,
located on these intervals. Therefore, if we consider the information stream aiai+1ai+2…
(where ai = int(Nxi), int(x) is integer part of x) produced by map xn+1 = (xn) during the
motion on this chaotic attractor, it appears to be reproduction of the stored information
block.

( A) (B) (C)
Figure 5: Interval cycles of the map with information block 375 stored at the first level: (a) time series, (b)
phase portrait, (c) probability distribution
394 Y.V. ANDREYEV, A.S. DMITRIEV

In Fig. 5, the use of chaotic attractors (interval cycles) for storing information is shown
for the case of information block 375 stored at the first level. Solution for the slope s =
1.125 is depicted.

6. UNSTABLE CYCLES AND RECOGNITION

Storing information as periodic motions in 1-D maps is an easy and efficient method of
organizing associative access. However, an analysis of the papers devoted to investiga-
tion of information processing in natural brains indicates essentially complicated behav-
ior in natural neural networks, (e.g., see a review in (Tsuda, 1992)). In particular, peri-
odic motion often points at some “degenerate” states in the brain, the “usual” state be-
ing chaotic (Skarda & Freeman, 1987; Yao & Freeman, 1990; Babloyantz & Destexhe,
1986). Besides, the existence of stable limit cycles in the 1-D map phase space leads to
competition of cycles, which means that iterates from an arbitrary initial point can result
in convergence to any cycle, because the attraction basins of the cycles have fractal
structure (Dmitriev, 1991; 1993).

In the above described procedure of storing information, the cycles may be easily made
unstable. All that is necessary is to change the slope of the information regions of the
piecewise-linear map function s > 1.

6.1. Direct map function control

A number of methods can be used to retrieve information from the map. Here we
demonstrate the method of direct control of the map function (slope switching) to make
the desired cycle stable while retaining others unstable, and apply this method to image
recognition (Andreyev, Belsky & Dmitriev, 1992).

According to the map design procedure, the phase space of the map contains a skeleton
of unstable periodic orbits coupled with the stored information blocks and passing
through information regions of the map function. Unstable periodic orbits coupled with
only non-information segments of the map are ignored in the further discussion.

We want to derive a regular procedure of the map function ƒ deformation, such that, if a
presented image coincides with one of the stored images (or is close enough), the corre-
sponding cycle becomes stable and attracting. The phase trajectory converges to this
stable periodic orbit then, and the stored image is reproduced by the system, which can
MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 395

be treated as recognition. Otherwise, no stable cycles appear, and the motion in the
dynamical system remains chaotic. Thus, the character of the motion in the modified
dynamical system, regular or chaotic, indicates of the result of recognition.

Let M information blocks be stored in the map at q-th level, and an image be presented
for recognition in the form of a string of L symbols. The question that the system is to
answer is, does this image correspond to any of the stored information blocks or not?
The procedure is as follows. We look through L fragments of this string, each q
symbols long, and change the slopes of those information regions of the map function
that correspond to the fragments, so that they become less than one in magnitude, as in
Fig. 6. If the presented image coincides with an image stored in the system, all
information regions of the map coupled with this image become switched, and the cycle
becomes stable. If we iterate the modified map now beginning from an arbitrary initial
point, we find that in some time the phase trajectory converges to this single stable limit
cycle, and the system’s behavior becomes regular.

If the presented image has nothing in common with the stored images, the map function
ƒ is not distorted, and the motion in the dynamical system remains chaotic.

Figure 6: Switching the slope of an information region.

It is important, that only one attracting cycle appears in the system phase space. There-
fore, if the global chaotic attractor comprises information regions of the map, then in
the modified map the trajectory will inevitably “fall” in some time onto an information
region of the stable cycle and converge to this cycle. Pictorially, this can be described
as an appearance of a "hole" in the chaotic set, through which the trajectory “leaks” out
from the chaotic to the regular mode. The stable limit cycle appearing because of the
crisis is unique, so the recognition process is practically independent of the initial con-
ditions for the phase trajectory. The choice of an initial point determines only the dura-
tion of the transient process from the metastable chaotic set to the stable limit cycle, i.e.,
the time of recognition.
396 Y.V. ANDREYEV, A.S. DMITRIEV

The method of direct map function control that was designed to retrieve information
from the map is based on the knowledge of the concrete map construction, but some
general methods, such as cycle stabilization after OGY procedure (Ott, Grebogi &
Yorke, 1990), or chaotic synchronization (Afraimovich, Verichev & Rabinovich, 1986;
Pecora & Carrol, 1990), also seem applicable for this purpose.

6.2. Adaptive Model and Recognition

The above discussed possibility of retrieving information stored as unstable cycles of a


1-D map is an intermediate step in creating a model of an “living” system processing
continuous information stream and capable of selection (“recalling”) of information
images stored in the system. If there are excerpts of “known” information objects in the
input information stream, the system “recalls” them and reproduces them thoroughly
(because of the associative property) in the output information stream; if there are no
“known” objects in the input stream, the system returns to the initial chaotic state. The-
se different states of the system can be related to “short-term” memory (“inspiration”)
and “long-term” memory (“storage”) inherent of brain systems.

Realization of these properties is possible with an adaptive model, a generalization of


the above model, in which information is stored as unstable cycles, and the form of the
map function controlled by external signal (Fig. 7). External signal here is an endless
sequence of symbols fed to the system input. The elements of the external signal all
belong to the same alphabet as the stored information blocks. They are fed to the system
input synchronously, one per iteration. The input signal influences not the system varia-
ble xn, but directly controls the system function f by changing the slopes of information
regions. The slopes in this model are not switched, but “slowly” oscillate between two
boundaries. Now, when an image at the input is not known immediately as a whole, the
map function is modified permanently at each step according to the piece of information
available at the moment. If the input contains pieces of information related to some
information regions, their slopes are rapidly decreased with iterates, if not, they slowly
return to the initial value.

Figure 7: block diagram of the adaptive recognition model.

If the storage was made at q-th level, the input symbols are accumulated to form a
fragment of the length q and, consequently, a point X on the map. If point X hits infor-
MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 397

mation region j at i-th time moment, the slope of this region sij is decreased according to
some rule and can become less than one in magnitude.

Besides, backward relaxation is introduced: if at a moment i the input signal doesn't


correspond to this information region j, then the slope sij begins the return to the initial
state, i.e., to the value of the slope in the absence of the external signal.

Thus, at each moment only one information region of the map function can be turned
downward, all the others are turned upward. If the input sequence consists of successive
repetitions of a stored image, the related information regions will eventually become
modified (their slopes set less than one in magnitude), and a stable cycle will appear in
the system. Thus, the system permanently adapts itself to the input signal.
Let us designate the upper boundary for the information regions as Su (>1), and the
lower boundary as Ss (<1). General equation describing the dynamics of sij is then

sji+1 = [sij + (1 – )Ss] ij + [sij + Su(1 – )](1 – ij), (5)

where ij = 1 if i = j, otherwise ij = 0 (this means that at each time moment each infor-
mation region is turned only in one direction);  determines the rate of relaxation, and 
the rate of convergence. In numerical experiments we used the values  = 0.1 and  =
0.9, which provided fast convergence to Ss and slow return to Su. As follows from (5),
the convergence to Ss in the presence of the corresponding external signal may take
place only if the condition

0 < |n| << || < 1

is satisfied, where n is the length of the corresponding cycle. The simplest case  = 0
corresponds to one-step convergence.

6.3. Models of "long-term memory" and "short-term memory"

We will show now how the notions of "long-term memory" and "short-term memory",
widely used in the study of the principles of memory functioning in living systems (e.g.,
(Klatzky, 1975)), are applicable to the behavior of the adaptive system.

Let us begin with the long-term memory. After information blocks are stored, they are
present in the system all the time, and the carriers of information are the unstable cy-
398 Y.V. ANDREYEV, A.S. DMITRIEV

cles. So, such a system may be interpreted as a long-term memory. Information is pre-
sent in the system, but an external stimulus is necessary to retrieve it.

The external stimulus is a signal containing information blocks stored in the system,
precise or with some errors. In general, the system doesn't respond to other information,
remaining in chaotic state. If the external signal with a stored information is fed to the
system input, a stable limit cycle appears in the place of one of the unstable cycles.
When the external stimulation is ceased, this cycle remains stable until the slopes of the
corresponding information regions return to initial values, i.e., while the condition for
the cycle stability  < 1 is fulfilled.

Figure 8: “Online” recognition of the input stream representing repetitions of the stored information blocks.

An example of an “online” system is given in Fig. 8, where the dynamical system with
three information blocks 123, 14568, 97583, stored at the second level, demonstrates
transitions between three stored information blocks. The input signal, the output infor-
mation stream and the map trajectory are presented. The input signal (upper plot) is
pieces of oscillations representing the stored images, 200 points each. As is seen from
the figure, the phase trajectory xn follows (though with a time delay) the input images,
so the dynamical system successfully recognizes the images in the input signal. A cer-
tain delay in switching between the cycles in the output stream is associated with the
discussed effect of the short-time memory.
MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 399

Competition of stable information cycles caused by strong dependence on the initial


conditions disappears, because a single attracting cycle exists in recognition process.
Chaos can be treated here as a reservoir containing “useful” trajectories (along with
many other ones). The main role of the global chaos in these systems seems to be the
global mixing, providing guaranteed though random access to all the stored images:
independently of the initial point the system phase trajectory will sooner or later occur
in the vicinity of any required cycle (in the properly designed map).

7. CONCLUSIONS

Models of human cognition using nonlinear dynamic systems with information stored
as dynamic attractors are described. They implement a wide range of information pro-
cessing functions, such as storing and retrieval, associative memory, memory scanning
based on intermittency, image recognition based on storing with unstable cycles and
direct map function control, “novelty filter”, “long-term” and “short-term” memories,
etc.

Associative memory is realized due to storing information as dynamic attractor (stable


cycle), which allows access to a stored image by its fragment, i.e., by a small part of the
stored image the system “recalls” the place where the image is stored, finds a point on
the corresponding cycle and retrieves the whole image by means of iteration (going
along the cycle).

Memory scanning or random access is meant here as a mode, in which the phase trajec-
tory wanders chaotically over the phase space and comes “at random” in the vicinity of
this or that information cycle, lingers there for some time, sufficient to recover the
stored image, and then leaves this vicinity and goes wandering further. This is a “slight-
ly chaotic” mode, with information cycles made unstable but with the cycle multipliers
(1) very close to unity.

The designed systems can operate as “novelty” filter, i.e., when a piece of data is pre-
sented to the system with information blocks stored as stable cycles, the system imme-
diately answers the question, whether this data piece is known to the system (i.e., it is
stored) or it is “new”. For the presented data piece the system calculates a point in the
phase space (or a set of points) and checks, whether they fit any cycle point.
Image recognition by means of direct map function control, “long-term” and “short-
term” memory described in Section 6 become possible, when information is stored as
400 Y.V. ANDREYEV, A.S. DMITRIEV

unstable cycles and the cycle stability is varied depending on the request presented.
Note also surprisingly high performance of the chaotic processing models, which results
in capability of solving rather complex and large problems by ordinary computer
simulation. It seems that this is due to not only good model but also to the properties of
the chaotic systems themselves, e.g., flexibility and quick reaction to external influence.

The discussed models are very simple and allow complete description, yet they possess
considerable information capacity, and may be of practical interest from the viewpoint
of information processing technologies using chaos. In addition one can see that the
described materials are closely connected with methods of symmetrology to study many
characteristics of cyclic processes, periodic orbits, fractal structures and other typical
objects in the field of dynamic chaos.

REFERENCES
Afraimovich V.S., Verichev N.I., & Rabinovich M.I. (1986) Chaotic synchronization of oscillations in dissi-
pative systems, Izvestiya VUZov. Radiofizika, vol. 29, no. 9, 1050 (in Russian).
Almeida, J. S., Carrico, J. A., Maretzek, A. M., Noble, P. A., & Fletcher, M. (2001). Analysis of genomic
sequences by chaos game representation. Bioinformatics, 17, 429-437.
Andreyev Yu.V. (1995) Attractors and bifurcation phenomena in 1-D dynamical systems with stored informa-
tion. Izvestiya VUZov. Prikladnaya nelineinaya dinamika, vol. 3, no. 5, 3-15 (in Russian).
Andreyev Yu.V., Belsky Yu.L., & Dmitriev A.S. (1992) Information processing in nonlinear systems with
dynamic chaos. Proceedings of Int. Seminar Nonlinear Circuits and Systems, Moscow, vol. 1, 51-60.
Andreyev Yu.V., Dmitriev A.S., Chua L.O., & Wu C.W. (1992) Associative and random access memory using
one-dimensional maps. International Journal of Bifurcation and Chaos, vol. 2, no. 3, 483-504.
Andreyev Yu.V., Belsky Yu.L., & Dmitriev A.S. (1994) Storing and recognition of information using stable
cycles of 2-D and multi-dimensional maps. Radiotekhnika i elektronika, vol. 39, no. 1, 114-123 (in
Russian).
Andreyev Yu.V, Belsky Yu.L., Dmitriev A.S., & Kuminov D.A. (1996a) Information processing using dy-
namical chaos. IEEE Transactions on Neural Networks, vol. 7, 290–291.
Atmanspacher H., Scheingraber H., & Voges W. (1988) “Global Scaling Properties of the Chaotic Attractor
Reconstructed from Experimental Data. Physical Review A, vol. 37, 1314–1322.
Auerbach D., Cvitanovic P., & Eckmann J.-P. (1987) Exploring chaotic motion through periodic orbits. Physi-
cal Review Letters, vol. 58, no. 23, 2387.
Babloyantz A. [(986) Evidence of chaotic dynamics of brain activity during the sleep cycle. In Dimension and
Entropies in Chaotic Systems (G. Mayer-Kress, Ed. ), Berlin: Springer-Verlag, 252-259.
Babloyantz A. & Destexhe A. (1986) Low-dimensional chaos in an instance of epilepsy. Proceedings of
National Academy of Sciences USA, vol. 83, 3513-3517.
Baird B. & Eeckman F. (1992) A normal form projection algorithm for associative memory. In: Associative
Neural Memories: Theory and Implementation (Ed. M. H. Hassoun), New York: Oxford University
Press, 33-51.
Basu, S., Pan, A., Dutta, C., & Das, J. (1997). Chaos game representation of proteins. Journal of molecular
graphics & modeling, October, 279-289.
MODELING “COGNITION” WITH NONLINEAR DYNAMIC SYSTEMS 401

Carpenter G.A. (1989) Neural network models for pattern recognition and associative memory. Neural Net-
works, vol. 2, 243.
Cvitanovic P. (1988) Invariant Measurement of Strange Sets in Terms of Cycles. Physical Review Letters, vol.
61, no. 24, 2729–2732.
Deschavanne, P., Giron, A., Vilain, J., Fagot, G., & Fertil, B. (1999). Genomic signature: characterization and
classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol.
16(October), 1391-1399.
Destexhe A., Sepulchre J.A., & Babloyantz A. (1988) A comparative study of the experimental quantification
of deterministic chaos. Physical Letters A, vol. 132, 101-106.
Dmitriev A.S. (1991) Storing and recognition information in one-dimensional dynamical systems. Radio-
tekhnika i Elektronika, vol. 36, no. 1, 101-108 (in Russian).
Dmitriev A.S. (1993) Chaos and information processing in dynamical systems. Radiotekhnika i elektronika,
vol. 38, no. 1, 1-24 (in Russian).
Dmitriev A.S., Panas A.I., & Starkov S.O. (1991) Storing and recognition information based on stable cycles
of one-dimensional maps. Physical Letters A, vol. 155, no. 8/9, 494-499.
Dmitriev A.S., Kuminov D.A., Pavlov V.V., & Panas A.I. (1993) Storing and processing texts in 1-D dynami-
cal systems. Preprint no. 3 (585), Institute of Radioengineering and Electronics RAS, Moscow, (in Rus-
sian).
Dutta, C., & Das, J. (1992). Mathematical characterization of Chaos Game Representation: New algorithms
for nucleotide sequence analysis. J. Mol. Biol., 228, 715–729.
Eisenberg J., Freeman W.J., & Burke B. (1989) Hardware architecture of a neural network model simulating
pattern recognition by the olfactory bulb. Neural Networks, vol. 2, 315-325.
Farmer, J.D. (1982) Information Dimension and the Probabilistic Structure of Chaos. Zeitschrift fur Naturfor-
schung, vol. 37A, 1304-1325.
Fiser, A., Tusnady, G.E., & Simon, I. (1994). Chaos game representation of protein structures. Journal of
molecular graphics, 12, 302-304.
Freeman W.J. (1987) Simulation of chaotic EEG patterns with a dynamic model of the olfactory system.
Biological Cybernetics, vol. 56, 139-150.
Freeman W.J., Yao Y., & Burke B. (1988) Central pattern generating and recognizing in olfactory bulb.
Neural Networks, vol. 1, 277-278.
Hawkins J., Blakeslee S. (2004) On intelligence, New York, NY: Times Books.
Hopfield J.J. (1982) Neural networks and physical systems with emergent collective computational abilities.
Proceedings of National Academy of Sciences, USA. Apr; vol. 79, n 8, 2554–2558.
Goldman, N. (1993). Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos
game representations of DNA sequences. Nucleic Acid Research, May, 2487-2491.
Grossberg S. (1988) Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Net-
works, vol. 1, 17-61.
Gunaratne G.N. & Procaccia I. (1987) Organization of Chaos. Physical Review Letters, vol. 59, no. 13, 1377–
1380.
Gutierrez, J.M., Rodriguez, M.A., & Abramson, G. (2001). Multifractal analysis of DNA sequences using
novel chaos-game representation. Physica A, 300, 271–284.
Joseph, J., & Sasikumar , R. (2006). Chaos game representation for comparison of whole genomes. BMC
Bioinformatics, 7(May), 243-246.
Kitchens B. (1998) Symbolic Dynamics: One-sided, Two-sided and Countable State Markov Chains. Springer.
Klatzky R.J. (1975) Human Memory. Structures and Processes. W. H. Freeman and Co., San Francisco.
Kohonen T. (1980) Content-addressable memories (Springer, Berlin).
402 Y.V. ANDREYEV, A.S. DMITRIEV

Lind D. & Marcus B. (1995) An Introduction to Symbolic Dynamics and Coding. Cambridge University Press,
Cambridge.
Maistrenko Yu.L., Maistrenko V.L., Sushko I.M. (1994a) Order for the appearance of attractors in piecewise
linear systems. In: Chaos and Nonlinear Mechanics, Series B, vol. 7, World Scientific.
Maistrenko Yu.L., Maistrenko V.L., Sushko I.M. (1994b) Bifurcation phenomena in generators with delay
lines. Radiotekhnika i elektronika, Moscow, No. 8-9, 1367-1380.
Matsumoto K. & Tsuda I. (1988) Calculation of information flow rate from mutual information. Journal of
Physics A: Mathematical and General, vol. 21, 1405–1414.
Nicolis J.S. (1982) Should a reliable information processor be chaotic? Kybernetes, vol. 11, 269-274.
Nicolis, J.S. (1986) Dynamics of Hierarchical Systems: An Evolutionary Approach (Springer-Verlag).
Nicolis, J.S. (1991) Chaos and Information Processing: A Heuristic Outline (World Scientific).
Nicolis J.S. & Tsuda I. (1985) Chaotic dynamics of information processing – The “magic number seven plus-
minus two” revisited. Bulletin of Mathematical Biology, vol. 47, 343-365.
Oliver, J.L., Bernaola-Galvan, P., Guerrero-Garcia, J., & R. Roman-Roldan (1993). Entropic profiles of DNA
sequences through chaos-game-derived images. Journal of theoretical biology, February 21, 457-70.
Ott E., Grebogi C., & Yorke J.A. (1990) Controlling chaos. Physical Review Letters, vol. 57, 1196-1199.
Pecora L.M. & Carrol T.L. (1990) Synchronization in chaotic systems, Physical Review Letters, vol. 64, 821-
824.
Petoukhov S.V. & He M. (2010) Symmetrical Analysis Techniques for Genetic Systems and Bioinformatics:
Advanced Patterns and Applications. Hershey, USA: IGI Global.
Procaccia I. (1988) The organization of chaos by periodic orbits: Topological universality of complex sys-
tems. In: Universalities in Condensed Matter. Ed. Jullien R., Springer, 213
Skarda C.A. & Freeman W.J. (1987) How brains make chaos in order to make sense of the world. Behavioral
and Brain Sciences, vol. 10, 161-165.
Shaw, R. (1981) Strange attractors, chaotic behavior and information flow. Zeitschrift fur Naturforschung,
vol. 36a, 80-112.
Schuster H.G. (2004) Deterministic Chaos. An Introduction (Wiley-VCH).
Tsuda I. (1984) A hermeneutic process of the brain. Progr. Theor. Phys., Supplement, vol. 79, 241-259.
Tsuda I. (1992) Dynamic link of memory - chaotic memory map in nonequilibrium neural networks.” Neural
Networks, vol. 5, 313-326.
Tsuda I. (1994) Can stochastic renewal of maps be a model for cerebral cortex? Physica D, vol. 75, 165-178.
Tsuda I., Koerner E., & Shimizu H. (1987) Memory dynamics in asynchronous neural networks. Progress of
Theoretical Physics, vol. 78, 51-71.
Voges W., Atmanspacher H., & Scheingraber H. (1987) Deterministic Chaos in Accrecting Systems: Analysis
of the X-Ray Variability of Hercules X. The Astrophysical Journal, vol. 320, 794–802.
Wang, Y., Hill, K., Singh, S., & Kari, L. (2005). The spectrum of genomic signatures: from dinucleotides to
chaos game representation. GENE, February, 173-185.
Wiegrinch W. & Tennekes H. (1990) On the Information Flow for One-Dimensional Maps. Physical Letters
A, vol. 144, no. 3, 145–152.
Yao Y. & Freeman W.J. (1990) Model of biological pattern recognition with spatially chaotic dynamics.
Neural Networks, vol. 3, no. 2, 153-170.
Yu, Z. G., Anh, V., & Lau, K. S. (2004). Chaos game representation of protein sequences based on the de-
tailed HP model and their multifractal and correlation analyses. Journal of theoretical biology,
226(February), 341-348.
Symmetry: Culture and Science
Vol. 23, Nos. 3-4, 403-426, 2012

THE IRREGULAR (INTEGER) TETRAHEDRON AS


A WAREHOUSE OF BIOLOGICAL INFORMATION

Tidjani Négadi

Address: Physics Department, Faculty of Science, University of Oran, 31100, Oran, Algeria; E-mail:
tnegadi@gmail.com

Abstract: A “variable geometry” classification model of the 20 L-amino acids


and the 20 D-amino acids, based on twenty, physically and mathematically,
labeled positions on tetrahedrons, and extending Filatov’s recent model, is
presented. We also establish several physical and mathematical identities (or
constraints), very useful in applications. The passage from a tetrahedron with
(possibly) maximum symmetry to a tetrahedron with no symmetry at all, here a
distinguished integer Heronian tetrahedron, which could “describe” some kind of
symmetry breaking process, reveals a lot of meaningful biological numerical
information. Before symmetry breaking, and as a first supporting result, we
discover that the L- and D-tetrahedrons together encode the nucleon-content in
the 61 amino acids of the genetic code table and the atom-content in the 64 DNA-
codons. After a (geometric) symmetry breaking, and also an accompanying
(physical) “quantitative symmetry” restoration concerning atom numbers, more
results appear, as for example the atom-content in this time 64 RNA-codons (61
amino acids and three stops), the remarkable Downes-Richardson-shCherbak
nucleon-number balance and, most importantly, the structure of the famous
protonated serine octamer Ser8+H+ (L- and D- versions), thought by many people
to be a “key player” in the origin of homochirality in living organisms because of
its unique property to form exceptionally stable clusters and also its strong
preference for homochirality. Using all the labeling possibilities, we find the
more fundamental neutral serine octamer Ser8 (L- and D-versions). We also
revisit, in this paper, the number 23! which is at the basis of our recent arithmetic
approach to the structure of the genetic code. New consequences, not yet
published, and also new results, specially in connection with the serine octamer,
404 T. NÉGADI

are given. Finally, a remark on the inclusion the “non-standard” versions of the
genetic code, in the present formalism, is made.

Keywords: integer tetrahedron; amino acids; homochirality; serine octamer

1. INTRODUCTION

This paper is devoted, first, to a new classification of the twenty amino acids based on
the Heronian (integer) tetrahedron, including and extending the recent one by Filatov
(Filatov, 2009) based on the usual tetrahedron and includes also the twenty mirror-
image D-amino acids. We start with the regular, achiral, tetrahedron, the most basic of
all polyhedra (the first of the five platonic solids) with maximum symmetry and end up
with the irregular, chiral, integer-Heronian tetrahedron, noted symbolically “117”, with
no symmetry at all. In this way we have, besides a “symmetry-breaking” process
“begeting chirality”, a coherent chiral-framework classification of the amino acids,
distributed in the faces, vertices and edges of the chiral integer Heronian tetrahedron
(and also its mirror-image for the 20 D-amino acids). Recall that in the case of the
tetrahedron the framework could not always be chiral and the amino acids which are
disposed on it are chiral (except of course glycine). The whole object that is the chiral
Heronian tetrahedron-and-chiral amino acids on it is, in this way, chiral. At the same
time, a slight “quantitative symmetry-breaking”, at the level of atom number and
inherent to the Filatov’s ordinary tetrahedron (extended) model, with only physical
labeling (nucleon, carbon, hydrogen and atom numbers), is cured and the latter restored.
Another nice virtue of the above “117” Heronian tetrahedron is that it incorporates
(encodes) naturally and strikingly, throught its geometric characteristics the correct
numeric structure of the famous protonated serine octamer Ser8+H+, thought to be a
promising actor in the origin of homochirality (Cooks et al., 2001; Hodyss et al., 2001).
It is precisely the passage from a regular tetrahedron, with possible maximum
symmetry, to the Heronian (irregular) tetrahedron, with no symmetry at all (no two
edges equal), that is a symmetry breaking, that leads to the serine octamer. By
introducing a supplementary and concomitant mathematical labeling, for the 20
positions (amino acids) in the vertices, edges and faces of the tetrahedron, the so-called
“generalized Plato’s Lambda” numbers or Tetraktys, its more fundamental neutral form
Ser8 is revealed. This is examined in section 2. Our second aim, in this paper, and in
section 3, is to revisit the number 23! which was at the basis of our arithmetic model of
the genetic code (Négadi, 2007, 2008, 2009) but in the context of the present work. We
show, in particular, that it incorporates, too, the above serine octamer structure and
encode also several other mathematical characteristics of the genetic code. As this paper
is intended first for the readers of Symmetry: Culture and Science, and maybe other
IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 405

people at large, we include two Appendices to (i) define the physical (chemical) and
mathematical entities used, (ii) ease reading and (iii) render possible skill/verifications
by the reader. In the first, we collect detailed chemical data concerning the 20 amino
acids as well as the DNA (RNA) units Thymine (Uracil), Cytosine, Adenine and
Guanine. In the second, we give some mathematical definitions concerning certain
(elementary) arithmetic functions used in this paper (and in others), that help to unveil
the hidden “biological information”. In particular, famous Euler’s totient function  is
briefly presented, with here interesting applications.

2.AN EXTENDED CLASSIFICATION MODEL OF THE AMINO


ACIDS AND THE SERINE OCTAMER SER8__

From about seven hundred known amino acids only exactly twenty are coded by the
genetic code. It is also a recognized fact that life uses exclusively the L-form amino
acids to make proteins and exclusively the D-form sugars in the backbones of DNA and
RNA. However, and this is a guenine paradigm shift (or turning point) in biology, the
D-amino acids are also used in many living beings, from bacteria to mammals (see for
example Yang et al. 2003), in various biological strategies. In fact, L or D is a matter of
convention, the two forms (called also enantiomers) exist and some people say that life
could even have begun “blind” to the D- or L-forms of the amino acids, maybe used
both, and latter “chosed” one of them, the L-form, by inventing “control quality”,
proofreading mechanisms and “checkpoints”. In the contemporary ribosome, it is
mainly the Aminoacyl-t-RNA synthetases that play a central role for the correct
handedness (L-amino acids) because the other component, the Peptidyl Transferase
Center and the codon-anticodon Decoding Center, are “blind” to the handedness of the
amino acids. The Decoding Center guarantees only the identity of an amino acids but
not its handedness. Enantiomers have very similar physical properties, identical with
respect to ordinary chemical reactions, and the difference arise only when they interact.
In the recent years, the field of D-amino acids chemistry and biochemistry has grown
and these have now the same status as their L-forms. In the following, therefore, we
shall consider an extended classification comprising the 20 L-amino acids as well as the
20 D-amino acids.

We start from Filatov’s classification model of the twenty amino acids, which is a
consequence of his symmetric table of the genetic code, and based on the tetrahedron.
The author does not precise the geometric nature of the tetrahedron (regular or irregular)
but any tetrahedron has 4 faces, 4 vertices and 6 edges. At one side, we have the
regular tetrahedron with 24 isometries and, at the other we have the irregular
406 T. NÉGADI

tetrahedron with 7 possible isometries. The extreme case where all the edges of the
tetrahedron are different corresponds to a complete loss of symmetry and we are
precisely interested in this case, in this work. The 20 amino acids are disposed on the
tetrahedron as shown in the figure 1 below (see Filatov, 2009). Four amino acids A, N,
L and F are on the center of the four faces (blue), four amino acids G, P, K and Y
(green) are on the four vertices and the remaining twelve amino acids (in red) are
disposed on the 6 edges (2 amino acids per edge). Filatov has discovered a
“quantitative” symmetry at the level of the nucleon number (the Hasegawa-Miyata
Parameter (HMP), Hasegawa, and Miyata, 1980). The nucleon (proton or neutron) is
the basic building-block of (ordinary) matter.

P mirror P

T M S S M T

N I F F I N
R L E L R
E
G G
C Q Q C
KW A H H W
YA K
K V D Y Y D V K

Figure 1: The L-amino acids (ex. left) and the D-amino acids (right); the vertical line: the mirror.

The sum of the nucleon numbers of the amino acids (side-chains only) in the two pairs
of faces is (for one tetrahedron, the L-tetrahedron say):

628+627=626+629=1255 (n)

The nucleon numbers of the 20 amino acids are given in Appendix 1. Here, the two
pairs of faces are defined in the figure by (PKY/YKG) and ((PGY/PGK), respectively.
We have used Filatov’s tetrahedron model to deduce the existence of the equivalents of
equation (n) for carbon and for hydrogen. For atom number, there is, a-priori, no (exact)
quantitative symmetry, as for carbon and hydrogen, but we shall see below that this
“quantitative symmetry-breaking” could be cured, thanks to the Heronian tetrahedron
mathematical properties, and the (quantitative) symmetry restored, with interesting
consequences. For carbon, it writes (for the respective faces, as in (n),
IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 407

33+34=32+35=67 (c)

Using an elementary (funny) manipulation of numbers, by adding the “digits” of the


numbers in Eq.(n) in base-100 gives 34+33=32+35, which is Eq.(c) up to a trivial
permutation. The nucleon and carbon atom numbers appear therefore, somehow,
"linked" as we have already experienced earlier (see Négadi, 2009) for the total number
of nucleons in the 20 amino acids (12+55=67). Now, for hydrogen, we have

64+53=62+55=117 (h)

That each pair of faces gives the (correct) nucleon, carbon and hydrogen numbers for
the 20 amino acids is not at all trivial. As a matter of fact, in each pair of faces, there is
only 16 different amino acids, with four of them situated on an edge, contributing two
times, and these 16 amino acids give the correct number of nucleons, carbon and
hydrogen numbers for the 20 (different) amino acids. Thus, there is something, not
trivial, at work. If now we consider atom numbers, we have for the two pairs
108+97=205 and 104+99=203, which is not 204, the number of atoms in the 20 amino
acids and the “quantitative” symmetry is broken. However, their mean gives the correct
value: (205+203)/2=204. We shall see below how this “quantitative symmetry-
breaking” for atom numbers could be “restored”.
There exist also a secondary “quantitative” symmetry at the level of carbon (C),
hydrogen (H), atom and nucleon numbers but now between the four vertices (G, P, Y,
K) and the four face centers (L, A, F, N):

#C-atoms(v)=#C-atoms(f.c.) =14 (1)


(v) (f.c.)
#H-atoms =#H-atoms =23 (2)
(v) (f.c.)
#atoms =#atoms =39 (3)
#nucleons(v)=#nucleons(f.c.) =221 (4)

We could also take Eqs.(1)-(4), collectively, and represent this “secondary” quantitative
symmetry by the sum of the four numbers in the following interesting identity:

297=297 (5)

The identities (or constraints) such as (n), (c) and (h) will be called in the following
primary and those in (1)-(5) secondary. We shall show below the importance of the
relation in Eq.(5) concerning serine and for other interesting consequences. As
408 T. NÉGADI

mentioned in the introduction, the above tetrahedron(s) could also be labeled but the
numbers of the 3-D generalization of famous Plato’s Lamda, or 3-D Tetraktys1 The
numbers on the four faces, when these latter are taken separately, are given as shown
below
1 1 1 8
2 4 3 4 2 3 12 16
4 8 16 9 12 16 4 6 9 18 24 32
8 16 32 64 27 36 48 64 8 12 18 27 27 36 48 64

Let us note, first, that the sums of the numbers in each of the above four number-
patterns are as follows: 155, 220; 90, 285, respectively. Now, when the tetrahedron is
reconstructed, that is when the four “faces” are assembled, some numbers become
redundant so that the sum of all theses numbers is only 350 (see Phillips, ref. in footnote
1, and figure 2 below). In this way, we could establish, also for the Tetraktys, the
equivalent of Filatov’s nucleon number “quantitative” symmetry (see Eq.(n), same
order):

G(1,1)
I(57,4)
M(75,16)
84 P(41,64)
C(47,2) Q(72,3) N(58,8)
51 F(91,12) T(45,32)
W(130,4) A(15,6)
52
R(100,16) 117
S(31,48)
K(72,8) H(81,9) L(57,24) 80
V(43,12) 53 E(73,36)
D(59,18)
Y(107,27)
Figure 2: The Heronian tetrahedron “117” hosting the 20 amino acids

285+90=220+155=375 (T)

This Tetraktys-number identity could be shown to be also derivable from the above
secondary identities relations (1)-(5). As a matter of fact, by adding (5) to the sum of
the prime factors of (1) through (4), or their a0-functions (see Appendix 2), we have
(two times)

1
Stephen M. Phillips Plato’s Lambda-its meaning, generalization and connection to the Tree of Life (Article
11, http://smphillips.8m.com).E
IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 409

297+a0(14)+a0(23)+a0(39)+a0(221)=375 (6)

Let us now return to the Heronian tetrahedron which was mentioned several times
above. This interesting mathematical object (shown in figure 2 above), also called
“perfect pyramid” (Buchholz, 1992), has integer edges, faces areas and volume. We
have represented in the figure only one Heronian tetrahedron, say the L-Heronian
tetrahedron, but its mirror-image D-Heronian tetrahedron is also considered, but here
not represented. Also, for each amino acid, the first number in the parenthesis is the
number of nucleons and the second the Tetraktys number. Now from a large family of
Heronian tetrahedrons, known in mathematics, the integer Heronian tehtrahedron
having the smallest maximum edge-length is the one with edge-lengths 51, 51, 53, 80,
84, 117. As mentioned in the introduction, this tetrahedron is often noted “117”, in
reference to its special property (smallest maximum edge-length). As mentioned in the
introduction, the above (irregular) tetrahedron has all its edges different so that it is
chiral. The faces corresponding to the order in equation (n) have as (integer) areas 1800,
1170, 1890, 2016. The volume V is equal to 18144 (see Buchholz, 1992). As a first
application, consider the integer volume. Mathematically the volume of the mirror
image of V is –V, via the Cayley-Menger determinant but this would have no impact
because the arithmetic functions we use here are insensible to the sign of an integer, as
they are only the sum of the prime factors. We have a0(V)=a0(-V)=29 so that
a0(a0(V)+a0(-V))=31, SOD(V)=18 and B0(V)=56, where SOD is the Sum Of Digits and
the functions a0 and B0 are defined in Appendix 2. This is serine; it has 31 nucleons in
its side-chain, 56 in the block of its “residue” and 18 in the water molecule to complete
the block. (Note that 56+18=74 is the number of nucleons in a block, the same for 19
amino acids and, as it is well known, proline is an exception.) Serine is also “present”
when considering the Tetraktys numbers with sum 350. Looking at the numbers in the
detail, we see that the numbers 4, 8, 12 and 16, and only these, appear two times or with
multiplicity 2. This gives for the total sum 80+270=350 where 80 is for the above four
numbers, having multiplicity 2, and 270 corresponds to the rest of the numbers, with no
multiplicity. We have B0(270)=31 and B0(270)+B0(80)=56 so that
2B0(270)+B0(80)=31+56=87 is equal to the nucleon number in the “residue” of serine,
105-18, where 18 is for the nucleons in a water molecule as mentioned above. This
latter could be found several ways and two of them are the following B0(375-350)=18
or, simply a0(375)=18, where 375 is the sum of the Tetraktys numbers in the two pairs
of faces (see Eq.(T)). These three numbers 31, 56 and 18, characterizing serine, are
identical to the above three ones, obtained from the heronian tetrahedron.
410 T. NÉGADI

The Heronian tetrahedron has as the sum of its six integer edges 437 but when the two
pairs of faces are considered each edge contributes twice so that the total sum is
2437=874. This number identifies nicely with the number of nucleons in the
protonated serine octamer Ser8+H+, in a state called “maximum exchange” where it
“catches” exactly 33 hydrogen nuclei or nucleons. As a matter of fact we could write for
the total number of nucleons for eight serines, the octamer, and its “charging” proton,
on the one hand, and the 33 “exchangeable” hydrogens, on the other,
8(31+74)+1+33=841+33=874. (Each serine has 4 exchangeable hydrogens H2N-
(amine group), HO- (hydroxy group) and –COOH (acid group) and a charging-carrying
proton, that is 48+1=33, hydrogen atoms.)2 In experiments serine clusters can appear
between the mass ratio m/z=841 (no exchange) and m/z=874 (complete exchange). It is
interesting, and remarkable, that the sum of all the edges of the two pairs of faces of the
above Heronian (L-)tetrahedron, 874, could be made to reproduce numerically the
details of the protonated octamer and its 33 exchangeable hydrogens. It suffices to
introduce the primary identity (c), for carbon, in the form 33+34-(32+35)=33-33=0, that
selects the number 33: 874-33+33=841+33=874. Selecting instead the number 34
(33+charging proton), we obtain the neutral serine octamer Ser80 and the rest in protons
(hydrogen atoms) 840+1+33=874. Also, selecting from the six edges the two equal
maximum lengths, we get 2117+640=874 and, introducing this time the secondary
identity for carbon, in Eq.(1) and assuming that the first primay identity is already
applied, we end up with 248+593+33=74. The first two numbers are respectively the
number of nucleons in the 8 side-chains of the 8 serines forming the octamer (831) and
the number of nucleons in their 8 blocks as well as the charging proton
(874+1=592+1). The serine tetramer (4 serines) is the smallest serine cluster known to
exhibit homochiral preference (Cooks et al., 2001). It is also thought that the octamer
could be made of two tetramers (Nemes, 2005, Costa and Cooks, 2001). As only a
speculational exercise, and following the reasoning about the serine octamer, the
tetramer (Ser4+H+,) would have 44=16 “exchangeable hydrogens” and, adding a
“charging” proton, we would end with a tetramer in “maximum exchange” having a
number of nucleons equal to 4105+1+16=421+16=437. It is remarkable that this
number half 874 or the sum of the six edges of the Heronian tetrahedron
(51+52+53+80+84+117). The introduction of the identity (5) in 437 gives
140+297=437 which has a simple meaning: the first term, 140, is the number of
nucleons in the four side-chains and the sixteen “exchangeable hydrogens” 431+16
and the second, 297, is the number of nucleons in the four blocks and the “charging”
proton 474+1. This is in agreement with the fact that the octamer is thought to be made
2
I would like to thank Lars Konermann (Department of Chemistry, The University of Western Ontario,
Canada) for very kind explanations concerning the exchangeable hydrogens.
IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 411

of two tetramers. Finally, the D-octamer is reproduced the same way, using this time the
mirror image, or D-tetrahedron.

Now, we come back to the “quantitative symmetry-breaking”, mentioned above. We


had for the two pairs 108+97=205 and 104+99=203. In fact, the discrepancy comes
from the NOS-atoms (nitrogen, oxygen and sulfur), as for hydrogen and carbon the
symmetry is obeyed. In the detail, and for the four faces, we have 108=64+44,
97=53+44, 104=62+42 and 99=55+44, where in each case the first number refers to
hydrogen number and the second to the CNOS-atoms. There exist numerical patterns
that suggest that hydrogen is in some way separated from the other four atoms (see the
next section) and we shall keep this separation working in the above four relations
where the numbers of CNOS-atoms are, respectively 44, 44, 42 and 44. These latter
correspond to the faces “628”, “627”, “626” and “629”, respectively, and we shall show
below that their atom numbers are rather given by the B0-functions of the areas of theses
faces. The three edges of these faces are (53, 80, 117), (51, 52, 53), (51, 84, 117) and
(52, 80, 84), respectively. The (integer) areas of these faces have, as mentioned above,
the values 1800, 1170, 1890 and 2016. The formula for the area (Heron of Alexandria,
10-75 A.D.) is simply the square root of s(s-a)(s-b)(s-c) where s=(a+b+c)/2 is the
semiperimeter. Now, the B0-functions of the areas are calculated to be B0(1800)=42,
B0(1170)=45, B0(1890)=43 and B0(2016)=44, respectively. Adding the respective
hydrogen atom numbers gives 64+42=106, 53+45=98, 62+43=105 and 55+44=99,
respectively. Finally, grouping the two pairs of faces apart gives

106+98=105+99=204. (a)

Thus, by replacing the CNOS-number of each face of the tetrahedron by the B0-function
of the area of that face, the “quantitative symmetry” at the level of atom-number is
restored, as it was mentioned before (remember, the quantitative symmetry for
hydrogen is not broken 64+53=62+55=117).

Now, we come to some interesting applications. Consider, first both (ordinary)


tetrahedrons, the L-tetrahedron and the D-tetrahedron (see the figure). The 20 positions
on each one of them are, first, labeled by the nucleon numbers of the amino acids. We
know that there is a “balance” between two pairs of faces (primary identity (n)) and four
“balances” between the four vertices and the four centers of the faces (secondary
identities (1)-(4) or (5)) involving the carbon, hydrogen, atom and nucleon numbers.
Finally, each tetrahedron involves 4 vertices, 4 centers and 6 edges or 14 “invariant”
objects. Writing down the sum for the two tetrahedrons (L- and D-) gives
412 T. NÉGADI

2(21255+2297+14)=22(1255+297+7) (7)

The first expression, at the left of the equal sign, means 2 tetrahedra (L- and D-) and the
second, at the right, means twice two pairs of faces and, in each pair, 7 (concomitantly)
means 2 faces sharing 5 edges or 2+5=7. Writing, first, 1255 as 627+628, 297 as
221+76 and 7 as 3+4 (or 2+5) and, second, noting that the different parts are
independent, we could for example group in the parenthesis the odd numbers together
and the even ones together, to obtain the following interesting partition
22(627+221+3)=3404 (8)
22(628+76+4)=2832 (9)

The first equation gives the total number of nucleons in the 61 amino acids and the
second the total number of atoms in 64 DNA-codons. As a matter of fact, there are 145
nucleons in the 5 quartets, 188 in the 3 sextets, 660 in the 9 doublets, 57 in the triplet
and 75+130=205 in the two singlets (see Appendix 1) and therefore
1454+1886+6602+573+2051=3404 nucleons, in 61 amino acids. For 64 DNA-
codons, each one of the four nucleobases T, C, A and G appears 48 times and there are
therefore 48(15+13+15+16)=2832 atoms (see Appendix 1). Now, taking instead
1255=626+629 in (8) and (9), the result would be 3404+8=3412, for the first, and 2824,
for the second, so that it is essentially the same result as above. Note also that 3412 is
the number of nucleons in 61 amino acids where these latter are in their “physiological”
state, that is when some few amino acids are charged (Downes and Richardson (2002),
shCherbak, 2008). Concerning the number 3412, mentioned above, Downes and
Richardson (2002) have shown the existence of an exact nucleon-number balance
between the 61 amino acids residue molecules (in their “physiological” state) side-
chains and the 61 blocks (5756+455) where the unique amino acid proline has 55
nucleons instead of 56 which is the number of nucleons in the remaining 19 amino
acids. (Note that in this case the number of nucleons in the 20 amino acids is 1256 or
1255+1.) To see the balance, we consider the three identities (n), (c) and (a), where in
this latter we add in both members 90+90=90+90 (corresponding to the blocks). We
have

L-tetrahedron → 2(1255+67+384)=3412 (10)

D-tetrahedron → 2(1255+67+384)=3412 (11)

and the equality of the two tetrahedra could “fit” (or encodes) the remarkable Downes-
Richardson nucleon number balance. Downes and Richardson (Downes and
IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 413

Richardson, 2002, see also shCherbak, 2008) considered that some few amino acids (in
their “physiological” state) are charged and also the case where proline has 42 nucleons
in its side chain and 73 nucleons in its block and showed that there are 3412 nucleons in
the 61 side-chains and 3412 (=5774+473) nucleons in the corresponding 61 blocks.
We have also, recently, established this very numerical balance, by using other
mathematical technics (Négadi, 2011). Now, we have already established the restoration
of the “quantitative” symmetry of atom number. Consider therefore the sum
=2(1255+67+117+204) including the nucleon, carbon, hydrogen and atom numbers
in the two pairs of faces of the tetrahedron. We have (=23153)

B0()=86+31=117 (12)

Where 86 corresponds to the sum of the prime factors, a0(), and the rest, 31, to the sum
of the prime indices and the big Omega function (or the number of prime factors, here
3; see Appendix 2). The result is the total number of hydrogen atoms in the 20 amino
acids in agreement with the pattern “16+7=23” (see below): 86 hydrogen atoms in 5
quartets, 9 doublets and 2 triplets (21+50+7+8) and 31 hydrogen atoms in 3 sextets and
1 triplet (22+9). Now, we include the blocks, in the number of atoms and collect
everything for one pair of faces: Eqs.(n), (5), (c), (h) and (a) to which we add
290=180, for the blocks, and also the number of edges (5) and centers of faces(2), that
is 7. We have for the sum of the prime factors and their indices for the sum, 2127
(=3709; 709 is the 127th prime), of all the mentioned quantities

A0(1255+297+67+117+384+7)=841 (13)

This number, again, corresponds to the protonated serine octamer. The selection of the
sub-set comprising the four last numbers above is also interesting. Computing the
quantity

22A0(67+117+384+7)=448=192 (14)

for both tetrahedrons, L- and D-, that is for 22 pairs of faces, we find this time the total
number of nucleobases in 64 codons (each of the 4 nucleobase appears 48 times). Next,
we introduce the quantity Q0=21255+2375+(1170+1800+1890+2016) comprising the
nucleon numbers, the Tetraktys numbers and also the (integer) areas, the two pairs of
faces of the Heronian tetrahedron, the L-tetrahedron, say. We have

B0(Q0)=248 (15)
414 T. NÉGADI

The nucleon numbers and the Tetraktys numbers for the four vertices and the four face
centers gives (v, f.c.)=2221+150=592. Adding the two quantities above gives

B0(Q0)+ (v, f.c.)=248+592=840 (16)

The relation (16) corresponds exactly to the neutral serine octamer Ser80: there are 248
nucleons in its eight side-chains (831) and 592 nucleons in its eight blocks (874).
B0(Q0) and (v, f.c.) give therefore these two parts. Also, taking the A0-function of these
two numbers, we obtain

A0(248)+A0(592)=112 (17)

Amazingly it appears that 112 is equal to the number of atoms in the neutral serine
octamer, as serine has 14 atoms (side-chain and block) and 814=112. Finally,
considering the two tetrahedra (L- and D-), we have B0(2112)=32=31+1 and this
corresponds to the protonated monomer Ser+ (side-chain only). In the equations (16)
and (17) nothing prevents from adding the number of tetrahedron(s) involved, here 1, to
get 840+1=841 (nucleons) and 112+1=113 (atoms) for the protonated serine octamer.
For the two tetrahedra we could form the product [B0(112)+1][B0(112)+1] and this
latter is equal to 292=841, again to protonated serine octamer. Now, we form the two
following quantities

Q1=21255+(1170+1800+1890+2016) (18)

Q2=2375+(1170+1800+1890+2016) (19)

where, in Q1, the nucleon and the Heronian tetrahedron numbers are considered and, in
Q2 the Tetraktys numbers replace those of the nucleons. First we have
A0(Q1)+A0(Q2)=76+104=180 which coud be written as 114+66. This is the carbon-
content in the 61 amino acids: 114 in 5 quartets, 9 doublets and 2 singlets
(49+233+3+9), on the one hand, and 66 in the 3 sextets and the triplet (69+34), on
the other. This is also in agreement with the pattern “16+7=23” (see above and below).
Now, we use the B0, instead. We have in this case B0(Q1)+B0(Q2)=80+108=188. This is
the number of nucleons in the three sextets serine, leucine and arginine. As
Q1=213192 and Q2=233141, we get immediately 31+157 which selects serine’s
number of nucleons, 31. Some little additional manipulation gives the final result
31+57+100, respectively. Taking the total sum, 180+188=2423, we have a0(368)=31.
The sum of the prime indices and the number of factors give 18, the water molecule.
Equivalently B0(368)=31+18=49. Finally, crossing the four values, we introduce the
IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 415

numbers 156=76+80 and 212=104+108 from which we have A0(156)+A0(212)=105.


This is the nucleon-number of serine (side-chain and block). As we have already shown,
here, in this paper, these same numbers refer to serine; in particular the difference
A0(156)+A0(212) - B0(368) gives the right number of nucleons in the block of the
residue of serine (31+74=49+56=87+18=105).

To end this section, let us return to the two important, and interesting, numbers 248 and
840. Both correspond to the serine octamer, in the first the block is not included. We
show below that, in fact, the number 248 begets the number 840 and shares with the
Heronian tetrahedron rare mathematical properties. As a matter of fact, the number 248
is the smallest number 1 for which the three Pythagorean means, the arithmetic,
geometric and harmonic means of Euler’s totient-function  and the divisor-function 
are all integers. We have (248)=120 and (248)=480. As for the means we have the a-
mean=(120+480)/2=300, the g-mean= 0 (integer!, see the equation
below) and finally the h-mean=(2120480)/(120+480)=192. From all these relations
we have

(248)+(248)+g-mean[(248),(248)]= (20)

120+480+240=248+592=840

In the last step, we isolated the proper divisors of 248 and wrote (248)=248+232. This
result is the same as in Eq.(16), the neutral serine octamer. We have thus seen that the
complete serine octamer is derivable from the side-chain part and the number 248 seems
therefore more fundamental than 840. It is even “hidden” in the L/D-classification based
on the regular tetrahedron. In Eq.(7), the right-hand side is equal to 221559 which is
the factorization of the sum 6236 (3404+2832). It appears that the sum of the prime-
indices is equal to 248 (1559 is the 246th prime). Now, A0(840)=33 and find that 840,
itself, begets the 33 “exchangeable” hydrogens (see above). The sum of the three
numbers in the chain 248→840→33 (for one tetrahedron) leads to B0(1121)=105 and,
considering the two tetrahedra, L- and D-, we have A0(21121)=106=105+1. These are
respectively the neutral monomer of serine and the protonated monomer, experimentally
observed. The number 192 which is the total number of nucleobases in 61 codons
appears, here, two times: (i) (840)=192 and (ii) g-mean[(248), (248)]=192, see
above). This could describe 61 RNA-codons and 61 DNA-codons, inasmuch as 840
could also reveal the four DNA units {T(126), C(111), A(135), G(151)} and the four
RNA units {U(112), C(111), A( 135), G(151)} where, in parenthesis, the number of
nucleons in each nucleobase is given. As a matter of fact, using Eq.(16), we get by
adding 840 and its -function (see section 3)
416 T. NÉGADI

840+(840)=1032 (21)

This last number is identical with the total nucleon-sum of the eight units: 523 for the
DNA units and 509 for the RNA units. The number 1032 could be written 442+590, by
isolationg the nucleon identity (4), that is the physical term 2221, and next by applying
the carbon identity (c), 67-67=0, we find (442+67)+(590-67)=509+523 which is the
result. Considering also the two tetrahedra and using the results following Eqs.(18)-
(19), we have

21255+ 2(B0(Q1)+B0(Q2))=2(1255+188)=21443=2886 (22)

This is also an interesting result, as it gives the total number of atoms in 61 RNA-
codons, 2560, and in 3 stops, 326 (see Rakočević, 2009). Rakočević used the
nucleotides for the 61 RNA-codons and the ribonucleosides for the 3 stop-codons
(UAA, UAG and UGA): 45U+48C+44A+46G giving 2560 atoms and
3UMP+4AMP+2GMP giving 326 atoms. Let us note that 326 could be written as 128
for the nucleotide-part, or the “side-chain”, and 198 for the ribose/phosphate-part, or the
“block”. Let us also write one copy of 1255 as 245+1010, according to the pattern
“16+7” (see above and below) where 245=188+57 (3 sextets S, L, R, and the triplet I).
Knowing that 188 is also equal to B0(Q1)+B0(Q2)=108+80, we could therefore rewrite
(22) as (108+57)+(2188+80+1010+1255)=2886. It is now sufficient to apply the
primary identity (c) for carbon to get (108+57+33)+(2721-33)=198+2688. We have
therefore found the “block”-part, 198, see above. Using now, in 2721, the
decomposition of 188 as 57+131 and using the secondary identity for carbon (1), we
have (198+128)+2560=326+2560, that is the correct partition into codons and stops (see
above and Rakočević, 2009). Is is quite astonishing that we could also reveal the
structure of serine using nucleon- and atom-numbers in all the chemical engredients
amino acids, DNA and RNA. Take for example the two numbers 2886 and 2560. Their
difference, 326, was associated above to the 3 stops and they are both relevant:
A0(2886)=67+9=76, the number of carbon atoms in 23 Amino Acids Signals and
a0(2560)=18+5=23, precisely these 23 AASs, (). Also, their difference (the three stops)
gives A0(326)=204 which is the number of atoms in the 20 amino acids (see also
Rakočević, 2009). Consider therefore the quantity Q3=3404+2886+2560=235259.
We have B0(Q3)=74+31=105, precisely the detailed structure of serine (see above). Add
now to Q3 the number of atoms in 64 DNA codons Q4=Q3+2832 (see Eq.(9)) to get
A0(Q4)=105. Also we have that B0(½Q4)=106, the number of nucleons in the
protonated monomer of serine, 105+1. Finally Q5=3404+2832+2560=223733 (733:
130th prime) leads to A0(Q5)=874 which corresponds to the protonated serine octamer in
IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 417

its state of maximum exchange (see above in this section). Finally, the two main
numbers of the tetrahedron 437 and 874 are also informative, by themselves. First,
B0(437) written as A0(437)+(437)=59+2=61 has a nice interpretation with respect to
the genetic code mathematical structure: 59 codons for degenerate amino acids and one
codon for each one of the two singlets (M and W). Interestingly, the sum of the a0-
functions of the four semiperimeters 125, 78, 126 and 108 of the Heronian tetrahedron
gives also 61. As for 874, closely related to the serine octamer, and homochirality, as
we have seen in this paper, we have B0(874)=65 and this last number is conspicuously
identical with the number of what are known as the “biological” space groups which
survive, out of a total of 230 crystal groups when considering chiral molecules (see for
example Mainzer, 1996 ).

3. REVISITING 23!

In 2007, (Négadi, 2007), we have designed an arithmetic model of the genetic code
based on the number 23!. From its twofold, decimal and prime-factorization
representations in equations (II) and (III)

23!=1234…212223 (I)
23!=25852016738884976640000 (II)
23!=21939547311213171923 (III)
(I)-(II)=0, (IV)
(I)-(III)=0 (V)

we have deduced the multiplet structure of the (standard) genetic code and computed
the right degeneracies as well as many other interesting results (Négadi, 2008, 2009).
Every digit, from 1 to 9 in Eq.(II), was associated to an amino acid coded by more than
one codon. Two zeros were associated to the two singlets methionine and tryptophane
and three zeros to the three stop codons. One of them, for example, gives the total
number of hydrogen atoms in the 61 amino acids as
a0(23!)+(23!)+117=200+41+117=358, where 117 is the sum of all the digits in (II) as
well their number (zero excluded). This last number is nothing but the number of
hydrogen atoms in the 20 amino acids and, the remaining part, 200+41=241,
corresponds to the 41 degenerate codons (amino acids). It appears that we did not fully
exploit all the numerical facets hidden in 23!. For example, an unnoticed nice result
comes from the above two representations, (II) and (III), and consists in adding all the
(47) digits, as “individual”, even ignoring the place value (for example count 11 as 1+1)
and also the factorial-sign. In this way we obtain 99+74+2(2+3)=183, which identifies
nicely with the total number of nucleobases in the 61 codons inasmuch as the prime
418 T. NÉGADI

factorization of 183, 361, is “taylor-made”: 61 codons and 3stops, because


a0(183)=61+3=64, and also 18 amino acids with degeneracy and 2 non-degenerate
singlets, because the sum of the prime-indices SPI(183)=18+2=20. This is not the end,
because if we consider the first “representation”, (I), which is in fact the definition of
the factorial, and proceed as above for its (individual) digits, we get 114 which
identifies with the number of nucleobases in 38 codons (3 nucleobases per codon).
Substracting 114 from 183 gives 69 the number of nucleobases in 23 codons. We obtain
therefore the pattern 23+38 for the 61 codons. Note that the sum 183+114 gives 297.
(114: 73 from 1 till 16 and 41 from 17 till 23).

Another interesting result comes from the three representations of 23! written above as
(I), (II) and (III). In the table below, we compute the sum of all the digits, and their
number, needed to write each representation (excluding zeros and ignoring place-values,
exponent positions, etc., just the digits)

#digits (zero excluded) sum


I 35 114 (73+41)
II 18 (11+7) 99
III 20 74

The total sum is 360 and we have, adding its number of divisors
360+(360)=360+24=384. This number is equal to the number of atoms in the 20 amino
acids (side-chain and block). In the table the number 114 is also written according to the
partition 73+41, already mentioned. Moreover, the number of digits in the decimal
place-value representation is also partitioned according to the parity of the digits
(even/odd): 18=11+7. Using these two partitions, we have

(i) (73+11)+300=84+300=384
and
(ii) (99+74+7)+(73+41+11+20+35)+σ(360)=180+204=384.

The case (ii) corresponds to the blocks and the side-chains, respectively, and case (i)
describes the partition into the set comprising the 5 quartets, the 9 doublets and the 2
singlets, 16 amino acids having 76+177+20+27=300 atoms, on the one hand, and the 3
sextets and the triplet, 7 amino acids having 62+22=84 atoms, on the other. Let us
denote this pattern by “16+7=23”; we shall meet it again later in the next section (see
Appendix 1). The number of atoms, 384, could be derived otherwise. As a matter of
fact, including the sum of the three 23s in the first members of Eqs.(I)-(III) and the nine
“missing” zeros, including those of Eqs.(IV)-(V) which are “necessary” to express the
IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 419

equivalence of the three relations (I), (II) and (III), we get 360+(5+5+5)+9=384.
Pushing the computation to its extreme, by including the number of digits in the three
23s, 6, we end up with 390, a very useful number, see below. Above, the total number
of hydrogen atoms in the 61 amino acids was computed as
a0(23!)+(23!)+117=200+41+117=358, where 117 is the sum of all the digits in (II) as
well their number (zero excluded). Now, if we include the five zeros, we have
99+23=122 and, adding the sum of all the digits in (I), 114, we get 236. This is the
number of CNOS-atoms (carbon, nitrogen, oxygen and sulfur) in the 61 amino acids. In
this way we find the total number of atoms in the 61 amino acids 236+358=594, 236 for
CNOS and 358 for hydrogen H. Returning to the above number 390, it appears that it is
just equal to the number of atoms in the 41 degenerate codons (amino acids) and,
substracting it from 594, gives 594-390=204, which is the number of atoms in 20 amino
acids. Now, we turn to another type of consequences from the number 23!, that were
not published before. We introduced (Négadi, 2009) the use of certain (useful and
generating information) simple algorithms known as mathematical “black-holes” to
“produce” biological information. One of them, call it the “black-hole(123)”-algorithm,
where the “black-hole” is the number 123, works as follows: (i) start with any number
and count the number of even digits and the number of odd digits, (ii) write them down
next to each other (by concatenation) following by their sum. Treat the result as a new
number and continue the process. This latter is very quick even for big numbers, as 23!.
The first iteration, using (II), with 16 even numbers and 7 odd numbers gives 16723
(16+7=23) and this first iteration agrees with the pattern “16+7”, mentioned several
times in the first section. The second iteration gives 235, the third 123 and finally the
fourth (check) also 123.

23! → 16723 → csod(16723)=1


↓ ↓
SPI(16723)=359 →235→123→123 → 840+1=841 → 841+A0(840)=874

a0(a0(16723))=603 →235→123→123 → 1443 (+1)

(3)(16723)=1440 → 1443

As the number 16723 is a five-digits number, but with special consideration (see
above), we shall treat it differently from the rest of iterations which are all three-digits
numbers. It is capable to give, as its consequences, two numbers (i) 359 as the sum of
the prime-indices of its prime-factors and (ii) 603 as the sum of the prime factors of it’s
a0-function (see the schematic diagram above). We see that the sum of the four numbers
in the second line give 840, the neutral serine octamer number. Adding the complete
420 T. NÉGADI

sum of the digits, 840+csod(16723), gives the protonated form 840+1=841. Finally,
adding the A0-function of 840 leads ot 840+csod(16723)+A0(840)=841+33=874, the
serine octamer in the state of “maximum exchange, see above. Now, we use the -
function and consider exactly some iterations applied to 16723. We get, at the third
iteration (3)(16723)=1440. If we just add to this last number the number of iterations,
here, 3 we have 1440+3=1443. This number is not unknown; it is the number of
nucleons in the 23 AASs 1255+188 (see above). The addition of csod(16723)=1, or
1443+1, could even fit the usual case where proline has 41+1 nucleons. The sum
a0(a0(16723))+SPI(16723)+[(3)(16723)+3]+235+2123=21443=2886 gives the same
result as in Eq.(22), that is the total number of atoms in 61 RNA-codons (2560) and in 3
stops (326), at the sole condition to introduce the carbon identity, used several times
above, into SPI(16723) and write it 359-33+33=326+33. Rearranging the terms we end
up with 2560+326=2886. We have shown in the previous section how the numeric
structure of the serine octamer arises from the mathematical characteristics of the
Heronian tetrahedron. It could shown that the Tetraktys, too, is able to reveal this
strange structure. First, we have to take the two tetrahedrons in order to include L- as
well as D-amino acids. The number 350 is the sum of all the numbers situated at the 20
places, as explained above, and 375 corresponds to the (same) sum on the two pairs of
faces (see above). These four numbers are 155, 220, 90 and 285. The following relation

(2350+A0(155)+A0(220)+A0(90)+A0(285))+B0(375)
=(700+140+1)+33 (23)
=(840+1)+33
=874

which is constructed from the Tetraktys numbers only, gives again the same result for
the serine octamer as in the first section, from the Heronian tetrahedron, or as above
from the “black-hole(123)” algorithm. We speculate that there could maybe exist a “link”
between these different approaches. For example we have that the number of partitions
of the number 23 is equal to 1255 (procedure numbpart(.) in softwares). It appears also
that the number of partitions of the number 20 is equal to 627. The difference leads to
1255-627=628 or 627+628=1255 and this relation is precisely one of the two filatov’s
identities mentioned at the beginning of section 2 (see Eq.(n)). Another interesting result
comes when we take the ratio between the number of divisors of 23! and the -function
of the number of partitions of 23:

(24)
=192
IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 421

The result, 192, is equal to the total number of nucleotides in 64 codons and, as (840)
is also equal to 192, we get 2192=384, a nice starting point for establishing the
multiplet structure of the genetic code, another way, that is from one and small number
(Négadi, 2011). As a last word, let us remark that a close (and strange) relation exists
between serine, which was in the focus, in this paper, and proline, also known to be a
quite singular amino acid (ex. “entangled” side-chain and block). Notably, it has also
been shown that chirality has significant impact on the assembly of proline clusters
(Myung et al., 2006). Looking at that acute apex of the Heronian tetrahedron in Figure
2, where proline resides, we have that the sum of the nucleon number of proline’s side-
chain, 41, and its associated Tetraktys number, 64, is equal to 105=41+64, which is
precisely the number of nucleons in serine (side-chain and block). Also, by calling,
again, the carbon identity (c), we get (41+33)+(64-33)=74+31 which is serine, in the
detail, 31 for the side-chain and 74 for the block. Inversely, and first, the positive
difference between the numbers for serine, 48 and 31 is equal to 17 which fits the atom-
number of proline. Second, take now the total sum of the nucleon numbers on the
tetrahedron, 1255, and the Tetraktys numbers, 350, that is 1605 (35107), and
compute the sum of the prime factors of this sum, we find 3+5+107=8+107=115 which
is precisely the total number of nucleon in proline. Introducing the carbon identity (33-
33=0) gives 41+74; this is the correct partition into side-chain and block. Moreover,
adding the sum of the prime-indices of the prime factors and their number (-function)
to the numbers for serine, 31 and 48, gives 31+48+36=115. Finally, writing 36 as 8+28,
we have (48+8+31)+28 and by introducing the atom number (secondary) identity (1),
we get 73+42=115. This is the other form of proline’s nucleon number: 41+1 in the
side-chain and 74-1 in its block. The “manifestation” of serine and its clusters seems not
to be something linked to the tetrahedron classification of the 20 amino acids considered
in this paper, alone. We have studied recently the small set of the amino acid precursors
and found a similar “manifestation” of serine, its clusters, in particular the octamer
(Négadi, 2011).

We end this paper by considering the possibility to include the “nonstandard” genetic
codes, mentioned at the end of the introduction. We have seen in this paper, that a
prominent role is given to the number of amino acids 61 and also to the 3 stops, i.e., the
standard genetic code. A raised question by a reviewer was whether something could be
said, in the present model formalism, about the experimentally known “nonstandard”
genetic codes which are however also known to concern only very few living
organisms, compared to the standard, or quasi-universal genetic code. This is an
interesting question and we shall show that indeed something could be said. As a matter
of fact, we start from the number of nucleotides in the 61 coding (or sense) codons
422 T. NÉGADI

183=361, i.e., three nucleotides per codon. This number (derived by us several times,
as the one in section 3 also mentioned in Appendix 2) has as the prime factorization
361 so that its a0-function (sum of the prime factors) is 61+3=64 with, here, an
immediate new “interpretation”: 61 amino acids sense codons and 3 stop-codons. Please
note in this latter case the “new” function of the number 3, as the number of “stops” for
the “standard” genetic code which we recall concerns the great majority of the living
organisms. Now, some 18 “nonstandard” genetic codes have been discovered these last
years (see Elzanowski and Ostell, 2010 for a recent updated compilation). Looking at
these genetic codes tables, we have that the sense codon number oscillates between 60
and 63 with only four possible (observed) cases 60, 61, 62 and 63. Equivalently, the
possible number of stop-codons oscillates between 1 to 4, with also four possible cases
4, 3, 2 and 1. In general, in these “re-assignments” and without intering into the details
of the biochemical machinery, a sense codon could become a stop codon and,
conversely, a stop codon could become a sense codon, coding for some “new” amino
acid (in the same canonical set of 20 amino acids or even comprising the 21th or the
22th amino acids Selenocysteine or Pyrrolysine). It is precisely at this point that, one
more time, Euler’s totient function , and also , the sum of divisors function, could
help to “describe” these features. Take, as the starting point, the numbers 61 and 3 for
the “standard” genetic code case where the former refers to the sense codons and the
latter to the stops. They are both prime (see above) and we have
 (61)=60=61-1
 (61)=62=61+1
 (3)=2=3-1
 (3)=4=3+1

For a prime p, recall that (p)=p-1 and (p)=p+1. It is not difficult to see that by
introducing the above relations, in the “standard form” 63+3=64, three and only three
other emerging cases, (ii)-(iv), are possible. In Summary, we have
 (i) 64=61+3
 (ii) 64=60+4
 (iii) 64=62+2
 (iv) 64=63+1

These four relations seemingly describe all the following 18 observed cases (and also
the case of Selenocysteine and Pyrroloysine too) where the number of stop-codons is
indicated in the parenthesis (see Elzanowski and Ostell, 2010): The Standard Code (3),
The Vertebrate Mitochondrial Code (4), The Yeast Mitochondrial Code (2), The
IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 423

Mold, Protozoan, and Coelenterate Mitochondrial Code and the


Mycoplasma/Spiroplasma Code (2), The Invertebrate Mitochondrial Code (2), The
Ciliate, Dasycladacean and Hexamita Nuclear Code (1), The Echinoderm and
Flatworm Mitochondrial Code (2), The Euplotid Nuclear Code (2), The Bacterial,
Archaeal and Plant Plastid Code (3), The Alternative Yeast Nuclear Code (3), The
Ascidian Mitochondrial Code (2), The Alternative Flatworm Mitochondrial Code
(1), Blepharisma Nuclear Code (2), Chlorophycean Mitochondrial Code (2),
Trematode Mitochondrial Code (2), Scenedesmus Obliquus Mitochondrial Code
(3), Thraustochytrium Mitochondrial Code ( 4) and Pterobranchia Mitochondrial
Code (2). It appears that the inclusion of all these genetic code variants could be
extended also at the level of the number of nucleotides, itself, beyond the number of
codons, thanks again to the -function. We shall develop these new results in a
forthcoming paper.

Appendix 1

In this appendix we give some numeric data concerning the 20 amino acids and the
DNA and RNA units, used in the text.
M Amino acid H C N/O/S #Atom #Nucleon
Proline (P) 5 3 0 8 41
Alanine (A) 3 1 0 4 15
4 Threonine (T) 5 2 0/1/0 8 45
Valine (V) 7 3 0 10 43
Glycine (G) 1 0 0 1 1
Serine (S) 3 1 0/1/0 5 31
6 Leucine (L) 9 4 0 13 57
Arginine (R) 10 4 3/0/0 17 100
Phenylalanine (F) 7 7 0 14 91
Tyrosine (Y) 7 7 0/1/0 15 107
Cysteine (C) 3 1 0/0/1 5 47
Histidine (H) 5 4 2/0/0 11 81
2 Glutamine (Q) 6 3 1/1/0 11 72
Asparagine (N) 4 2 1/1/0 8 58
Lysine (K) 10 4 1/0/0 15 72
Aspartic acid (D) 3 2 0/2/0 7 59
Glutamic Acid (E) 5 3 0/2/0 10 73
3 Isoleucine (I) 9 4 0 13 57
1 Methionine (M) 7 3 0/0/1 11 75
Tryptophane (W) 8 9 1/0/0 18 130
Total 117 67 9/9/2=20 204 1255

The 20 amino acids atomic composition


424 T. NÉGADI

In the above Table, the detailed atomic composition of the amino acids side-chains is
given: H for hydrogen, C for carbon c, N for nitrogen, O for oxygen and S for sulfur.
Also, the atom and nucleon numbers atre given. The 20 amino acids are organized into
the five known multiplets of the standard genetic code: 5 quartets (M=4), 3 sextets
(M=6), 9 doublets (M=2), 1 triplet (M=3) and 2 singlets (M=1); the multiplicity M gives
the number of codons. The amino acids are given in the one-letter usual code (in
parenthesis) and only the numbers for the side chains are given. When one considers
also the blocks, then the corresponding number for the (common) block must be added;
for example for the number of atoms one must add 9, for the nucleons 74, etc.. Second,
the number of atoms in the five nucleobases (or nucleotides) is as follows Uracil (U,
12)/Thymine (T, 15), Cytosine (C, 13), Adenine (A, 15) and Guanine (G, 16), see the
picture below (courtesy from Dr. Gary E. Kaiser). When the “block”, made of the ribose
sugar and phosphate, is added to the nucleotides the ribonucleosides have the following
content in atoms UMP(C9H13N2O9P, 34), CMP(C9H14N3O8P, 35), AMP(C10H14N5O7P,
37), GMP(C10H14N5O8P, 38), (see Rakočević, 1997).

The nitrogenous nucleobases of RNA and DNA

Appendix 2

In this appendix, we give the definition of some elementary mathematical tools, used in
this paper. First, we use the Fundamental Theorem of Arithmetic which states that every
natural number n could be written, uniquely, as a product of primes each raised to a
given exponent n=p1a1p2a2p3a3p4a4…. For a given number n the arithmetic function
a0(n) gives the sum of the prime factors of n, including multiplicity. When the
multiplicities are discarded, the corresponding function is called a1(n). We also define
the function, A0(n) to be the sum of a0(n) and the sum of the prime-indices of the prime
factors. The big-Omega function (n) counts the number of the prime factors. We also
define the function B0(n) as A0(n)+(n). Let us give an example. Take the number
n=183, mentioned in section 3. Its prime decomposition is 361, the prime-indices of
the two prime factors 3 and 61 are respectively 2 and 18 and (183)=2. We have
a0(183)=3+61=64, A0(183)=a0(183)+2+18=84 and B0(183)=A0(183)+(183)=86. We
IRREGULAR TETRAHEDRON IN BIOLOGICAL INFORMATION 425

also use famous Euler’s totient or -function (a fundamental mathematical object in


Cryptography) which gives the total number of numbers that are co-prime to n and,
specially for a prime p, it is simply p-1). In other words, it gives a count of how many
numbers in the set {1, 2, 3, …, n} share no common factors with n that are greater to
one. For any number n, the general formula φ(n) = p1a1-1(p1-1) p2 a2-1( p2-1)...pn ak-1( pk-1)
could be used to compute , by hand, but one could also use more quickly computer
software as phi(n) and sigma(n) in Maple6, used here.

Aknowledgments: I express my aknowledgement to the reviewers for their very


constructive comments.

REFERENCES

Abramowitz, M.; Stegun, I. A. (1964), Handbook of Mathematical Functions, New York: Dover
Publications, ISBN 0-486-61272-4. See paragraph 24.3.2. (see also
http://en.wikipedia.org/wiki/Euler's_totient_function)
Buchholz, R. H. (1992) Perfect Pyramids; Bull. Austral. Math. Soc. 45, 3, 353-368 (See
also http://mathworld.wolfram.com/HeronianTetrahedron.html)
Cooks, R. G., Zhang, D., Koch, K. J., Gozzo, F. C., Eberlin, M. N. (2001) Chiroselective self-directed
octamerization of serine: implications for homochirogenesis Anal. Chem. 73, 3646-3655.
Costa, A.A., Cooks, R.G. (2001) Origin of chiral selectivity in gas-phase serine tetramers. Phys Chem Chem
Phys., 877-85.
Downes, A.M., Richardson, B.J. (2002) Relationships between genomic base content distribution of mass in
coded proteins J Mol Evol , 55, 476-490.
Elzanowski, A and Ostell, J (2010)
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes
Filatov, F. (2009) A molecular mass gradient is the key parameter of the genetic code organization,
http://arxiv.org: q-bio. OT/0907.3537.
Hasegawa, M. and Miyata, T. (1980) On the antisymmetry of the amino acid code table.
Origins of Life, 10, 265-270.
Hodyss, R., Julian, R.R., Beauchamp, J.L. (2001) Spontaneous chiral separation in noncovalent molecular
clusters, Chirality, 13, 703.
Mainzer, K. (1996) Symmetries of Nature, de Gruyter, Berlin (German Edition: 1988).
Myung, S., Fioroni, M., Julian, R. R., Koeniger, S. L., Baik, M.-H. Clemmer, D. E. (2006) Chirally Directed
Formation of Nanometer-Scale Proline Clusters. J. Am. Chem. Soc., 128, 10833–10839.
Négadi, T. (2007) The genetic code multiplet structure, in one number. Symmetry: Culture and Science, 18, 2-
3, 149-160. (Available also at http://arxiv.org: q-bio. OT/0707.2011).
426 T. NÉGADI

Négadi, T. (2008) The genetic code via Gödel encoding. The Open Physical Chemistry Journal, 2, 1-5.
(Available also at http://arxiv.org: q-bio. OT/0805.0695).
Négadi, T. (2009) The genetic code degeneracy and the amino acids chemical composition are connected.
NeuroQuantology, 7, 1, 181-187 (Available also at http://arxiv.org: q-bio. OT/0903.4131).
Négadi, T. (2009) A taylor-made arithmetic model of the genetic code and applications.
Symmetry: Culture and Science, 20, 1-4, 51-76.
Négadi, T. (2011) The multiplet structure of the genetic code, from one and small number. NeuroQuantology,
9, 4, 767-771.
Négadi, T. (2011) A “quantum-like” approach to the genetic code. NeuroQuantology, 9, 4, 2011, 785-798.
Nemes, P., Schlosser, G., Vékey, K. (2005) Amino acid cluster formation studied by electrospray mass
spectrometry, Journal of mass spectrometry, 40, 43-49.
Rakočević, M. M. (1997) Genetic Code as a unique system, SKC Niš, Appendix 3.
Rakočević, M. M. (2009) Genetic Code Table: A note on the three splittings into amino
acid classes, http://arxiv.org: q-bio. OT/0903.4110.
Scherbak, V. (2008) The Arithmetical Origin of the genetic code in The Codes of Life: The Rules of
Macroevolution, M. Barbieri (ed.), Springer 2008.
Yang, H., Sheng, G., Peng, X., Qiang, B. Yuan, J., (2003) D-Amino acids and D-Tyr-tRNATyr deacylase:
stereospecificity of the translation machine revisited, FEBS Letters, 552, 95-98.
Symmetry: Culture and Science
Vol. 23, Nos.3-4, 427-447, 2012

THEORY OF TOPOLOGICAL CODING OF PROTEINS


AND NATURE OF ANTISYMMETRY OF
THE AMINO ACIDS CANONICAL SET

Vladimir A. Karasev

Biochemist, (b. Saint-Petersburg (Leningrad), Russian Federation (USSR), 1947).


Address: Microtechnology Center, St. Petersburg State Electrotechnical University, Prof. Popov str. 5, 197376
St. Petersburg, Russia. E-mail: genetic-code@yandex.ru.
Fields of interest: Biochemistry, bioinformatics, theoretical biology, mathematical biology.
Publications: Karasev, V.A. and Sorokin, S.G. (1997) Topological structure of the genetic code, Russ. J.
Genetics, B33, 622-628; Karasev, V.A. and Stefanov, V.E. (2001) Topological nature of the genetic code, J.
Theor. Biol. B209, 303-317; Karasev V.A., Luchinin V.V. and Stefanov V.E. (2005) Series in Mathematical
Biology and Medicine, B.8. Proceedings of the International Conference. "Advances in Вioinformatics and Its
Applications”, New Jersey: World Scientific Publishing Co., pp. 482-493; Karasev V.A., Luchinin V.V. and
Stefanov V.E. (2007) A model of the “molecular vector machine” for protein folding, Proceedings of the 3-rd
Moscow conference on computational molecular biology, Moscow, Russia, July 27-31, 2007, pp.134-136;
Karasev V.A. and Luchinin V.V. (2009) Vvedenie v konstruirovanie boinicheskich nanosistem [Introduction
to the design of bionical nanosystems, in Russian], Moskow: "Fizmatlit", 464 pp.

Abstract: The present review is dedicated to the theory of topological encoding of


proteins. Consistent development of this theory has led to a model of molecular vector
machines (MVM) of proteins, which includes: protein fragment containing five amino
acids (pentafragment), the dodecahedron containing a group of 20 vectors, canonical
set of changeable physical operators (side chains of amino acids) and tetrahedral Ci -
th atom. The action of vectors is aimed at the formation of hydrogen bond NiH … Oi-
4=C in the protein pentafragment, the side chains of amino acids realizing this action as
physical operators. It is shown that the group of vectors possesses symmetry, while the
side chains of amino acids manifest antisymmetry. A dodecahedron-based model for the
structure of the canonical set of amino acids is proposed. The model makes part of
MVM. Antisymmetry of the side chains of amino acids is considered in connection with
their involvement in MVM structure of protein.
428 V. A. KARASEV

Keywords: Genetic code, amino acids canonical set, antisymmetry, theory of


topological coding, proteins.

1. INTRODUCTION

The nature of the canonical set of 20 amino acids which are contained in molecules of
proteins and coded by a genetic code is still unclear (Schulz and Schirmer,1979, Higgs
and Pudritz, 2009). One of the approaches to this problem suggests development of
classifications for amino acids. There are several ways to classify the amino acids’ side
chains. The most common is classification of the amino acids’ side chains according to
physicochemical properties of their radicals (Campbell and Smith, 1994) (Table 1).
Chemically, side chains are rather heterogeneous. For example, Asp and Glu are
carbonic acids, Lys is a primary amine with five side chains containing cycles, Pro is a
saturated imino acid, whereas Phe, His, Tyr and Trp are aromatic compounds. The
physicochemical approach cannot disclose grouping principles for the side chains.

The genetic code is a natural basis for classification of amino acids. One of the
approaches considers the complementarity of amino acids encoded by complementary
triplets (Mekler and Idlis, 1993; Siemion et al., 2004). A version of the periodic table
of amino acids based on properties of the genetic code was proposed in the paper (Biro
et al., 2003).

Table 1: Structure of side chains of canonical amino acids

Nonpolar and weakly polar


CH2

N CH2 Glycine (Gly) HC


C CH3 H 3C CH 3
H2 Alanine (Ala) Valine (Val)
Proline (Pro)

CH2
HC
HC H 2C HC
H3C CH2
H 3C CH3 OH H 3C OH
H3C Serine (Ser)(-)
Leucine (Leu) Threonine (Thr)(-)
Isoleucine (Ile)
ANTISYMMETRY OF THE AMINO ACIDS CANONICAL SET 429

Polar

H 2C
H2C H2C
CH2 CH2
CH2 CH2 NH
C 2
C H2C H+ H2C
 
O O N H
O O C C
Asparagic acid Glutamic acid H2
H N
H N+H
(Asp) (-) (Glu)(-) Lysine (Lys)(+) Arginine (Arg)(+)
Neutral and Sulfur-containing

CH2 H2C CH2


H2 C
CH2 H2C
H 2N
C SH
O C Cysteine (Cys)(-) S CH3
Asparagine (Asn) H2N O
Methionine (Met)
Glutamine (Gln)
Cyclic

CH 2 CH2 CH2 CH2


H
HC C HC C C C C
N HC C
HC HC CH CH
CH HC CH
C CH N HC C
HC CH H C N
HO H
Phenylalanine Tyrosine (Tyr) Histidine (His) H
(Phe) Tryptophane (Trp)

* The side chains are attached to the -carbon atom, marked with a circle.

Methods of the theory of groups were applied to genetic code triplets (Balakrishnan,
2002). Four groups of multiplets of triplets coding amino acids with different properties
were attributed to SU(4) group symmetry. There are also spatial models of the genetic
code (Jimenez-Montaño et al. , 1996, Karasev and Sorokin, 1997), while amino acids
play a subordinate role. There are a number of approaches in classification of amino
acids which appearance was stimulated by the need of prediction of the protein
structure (Taylor, 1986; Kosiol et al., 2004; Esteve and Falceto, 2005). Amino acids can
also be classified on the basis of their role in the protein structure: passive chains and
active functional modules (Karasev et al., 1994). Another classification considers amino
acid chains as physical operators reconstituting the encoded structure (Karasev and
Stefanov, 2001).
430 V. A. KARASEV

The present work purpose is spatial representation of the structure of the amino acids’
canonical set based on theory of topological encoding of proteins. Earlier the problem
was addressed in a preliminary study (Karasev et al. 2005; 2007).

2. THE BASIC PRINCIPLES OF THE THEORY FOR TOPOLOGICAL


CODING OF PROTEINS

2.1 Topological code

2.1.1 Definitions

The main elements of the theory for topological coding of proteins are topological code,
the system of physical operators, recreating encoded structure and model of molecular
vector machines (Karasev et al., 2000; Karasev and Stefanov, 2001; Karasev and
Luchinin, 2009,a, b, c).

The main elements of the theory for topological coding of proteins are topological code,
the system of physical operators, recreating encoded structure and model of molecular
vector. As an elementary unit of the protein we isolated a fragment of five amino acids
(pentafragment). Our choice is motivated by the fact that it is the minimal fragment of a
protein capable of forming the cycle with one H-bond (between NiH and Oi-4 = C),
which has loose conformation of -helix (Schulz and Schirmer, 1979). Link in the
protein, in the framework of our theory, is a fragment of the two amino acids linked by
peptide bonds. Accordingly, in structure of pentafragments it is possible to allocate four
links, therefore we name them also 4-unit fragments of the protein. Mathematical
analogue of protein pentafragments are 4-arc chain graphs (Fig. 1a, 1,b) (Karasev and
Stefanov, 2001).

a b d

Figure 1: Cyclic 4-unit fragments of the protein which have H-bond between NiH…Oi-4=C (a), its 4-arc chain
graph with connectivity edge between i-th – i-4-th vertices (b)$; its matrix description (c); d – matrix of 6
variables.
ANTISYMMETRY OF THE AMINO ACIDS CANONICAL SET 431

We have used 4-arc chain graphs to construct a model of topological code. In the
structure of the graph two types of edges have been allocated: the structural edges
connecting adjacent vertices in a linear chain and the connectivity edge connecting non-
adjacent vertices. These edges define a variety of conformations of the graph (Karasev
and Stephanov, 2001).
Description of the location of the edges can be performed using the upper triangular
matrices of six variables, taking the value 0 (no connectivity edge) and 1 (presence of
the connectivity edge), as shown in Fig. 1,c and Fig. 1,d. We distinguish two types of
protein pentafragments conformations (connectivity states) and their graphs:

Acyclic – conformations, which lack H-bond NiH…Oi-4=C in the protein’s fragment and
contain no connection between the i-th - i-4 fourth vertices in the graph (x3 = 0);

Cyclic – conformations, in which hydrogen bond NiH…Oi-4=C in the protein’s fragment


is formed and connection between i-th and i-4 fourth vertices exists (x3=1).

2.1.2 Supermatrix conformations of 4-arc graph

All possible 64 conformations of the 4-arc graph were considered (Karasev et al., 2000,
Karasev and Stefanov, 2001). As can be seen in Figure 2, in supermatrix there are 4
blocks. The main property of each block is the occurrence of common second pairs of
variables – x3x4 (in the headlines of the blocks). Blocks 00 and 01 contain acyclic graph
conformations (x3 = 0), and the blocks 10 and 11 – cyclic conformations (x3 = 1). The
blocks are constructed according to the following rules: rows are generated by the first
pair of variables (x1x2) in the sequence 00, 10, 01, 11 and columns – by the third pair of
variables (x5x6) in the sequence 00, 01, 10, 11.

2.1.3 Symmetry in the supermatrix

One can find two types of symmetry in the supermatrix. One exists within the blocks.
Thus, in the first block (x3x4 = 00) the conformations of the first and third pairs are
symmetrical, i.e. 00  00, 10  01, etc. The corresponding matrices and graphs
are arranged symmetrically with respect to the main diagonal. A particular case of this
symmetry is the intrinsic symmetry of matrices lying on the main diagonals, e.g.
000000, 100001, 010010, 110011.
432 V. A. KARASEV

Figure 2: Supermatrix conformations of 4-arc graph and their matrix description

The second type of symmetry is related to the structure of the supermatrix as a whole.
Two groups of matrices in which 0-elements of the matrix belonging to one group
correspond to 1-elements of the matrix of the other group and vice versa, e.g. 000000
 111111, 100000  011111, 010000  101111, occupy positions which are
related by C2 symmetry (separated by a solid line in Fig. 2). This type of symmetry was
called antisymmetry (Karasev et al., 2000), and the transformation itself 0  1 -
conversion of antisymmetry. Acyclic and cyclic conformations associated symmetry
group C2, are antisymmetric. So, completely disconnected graph, described by the
matrix of the six "0" (located in the upper left corner of the supermatrix) is
antisymmetric graph with a cyclic conformation, described by the matrix of the six "1"
(located in the lower right corner).
ANTISYMMETRY OF THE AMINO ACIDS CANONICAL SET 433

Spatial representation of the resulting supermatrix composed of “6 variables“ elements


is Boolean hypercube B6, which reflects all of the possible single-bit transitions between
elements (Karasev et al., 2000; Karasev and Stefanov, 2001; Karasev and Luchinin,
2009, a).

2.1.4 Transformation of supermatrix into the triplet genetic code

Information about the graph structure presented in the matrix form cannot be used for
transmission, reproduction and copying. It should be transformed into a suitable form of
unbranched chain (Karasev and Stefanov, 2001). The number of variables in the
matrices, describing conformation of the 4-arc graph, is equal to 6, i.e. 3 pairs. For pairs
of variables xixi+1 we introduce the notation XYZ (Scheme 1):

x 1x 2 - X x3x4 - Y x5x6 – Z (1)

The values 00, 10, 01, 11, assumed by xixi+1, can be denoted as symbols C, U, G, A of
the genetic code (Scheme 2):

C - 00 G - 10 U - 01 A - 11 (2) ,

Using this correspondence, we transform the supermatrix into the code (Fig.3). Triplets
appear together with the amino acids they code for. Information on the structure of the
4-arc graph in terms of the 4-letter code assumes the form of a linear chain.
The second letter of the triplet (Y), the same for the whole block, codes for variables x3
and x4. It contains the main information on the graph structure. As can be seen in the
Figure 3, the second pairs of variables were transformed into the second letter of the
triplets in the sequence: C, U, G, A. The third letters of triplets have the same order,
while the first letters are located in other order - C, G, U, A , which is conditioned by
the rules of the conversion.

2.1.5 Symmetry in the table of genetic code

Symmetric matrices describing symmetric conformations of the 4-arc graph are encoded
by triplets arranged symmetrically with respects to the main diagonals of the blocks, for
example CCU – GCC, UCC – CCG. (Fig. 3). Anti-symmetric matrices, related by
symmetry C2 (separated by a thick line), are encoded by triplets, which transform into
each other according to the Rumer’s rule (Rumer, 1968): C  A, G  U, for
example, ССС  AAA, CCU  AAG, CCG  AAU and so on.
434 V. A. KARASEV

Figure 3: Transformation of the supermatrix describing the conformation of the 4-arc graph, in the triplet
genetic code (according to Karasev and Stefanov, 2001).

Since in the basis of the code, as we have shown, is a description of the conformations
of the 4-arc graph (or, equivalently, protein pentafragments), it is clear that the Rumer’s
rule connects antisymmetric conformations of the protein encoded by triplets.

Spatial structure of the supermatrix is a Boolean hypercube B6 (see Section 2.1.3). It is


clear that the triplet genetic code, received on the basis of correspondences of schemes 1
and 2, must have a structure that is isomorphic to the hypercube B6 (Jimenez-Montaño
et al., 1996; Karasev and Sorokin, 1997).

Thus, we have shown that the genetic code has a topological nature and is associated
with encoding of conformations of protein pentafragments. The nature of “triplet –
amino acid assignment” from the same position can be explained (Karasev and
Stefanov, 2001).
ANTISYMMETRY OF THE AMINO ACIDS CANONICAL SET 435

2.2 Physical operators and their assignment to genetic code tripets

2.2.1 The definition of "physical operator"

In order recreate encoded graph structure (acyclic or cyclyc conformation) by protein, it


is necessary that between coding triplets and the side chain of the protein a definite
assignment existed (Karasev et al., 2000, Karasev and Stephanov, 2001). The only site
that can be affected by the side chain R of the just bound i-th amino acid is the area
where the hydrogen bond between groups NiH…Oi-4 is formed (Fig. 4,a, link, shown by
arrow). This is represented by the variable x3 in the matrix. Connectivity and anti-
connectivity operators can be distinguished by their mode of action (Fig. 4,b и 4,c).

For realization of their functions, they must meet a number of requirements:

• have a group capable of acting on the bond NiH...Oi-4;


• the size should be of the same order with the scope;
• their spatial position should always be the same.

The latter fact can be realized only in the event of chirality of the protein links (D-or L-
type). Thus, in our approach the chirality of amino acids exists due to their participation,
as the physical operators in the reconstruction of the encoded protein conformations.

Connectivity operators - amino acid side chains which provide additional


fixation of pentafragments, e.g. due to hydrogen bonds, in accordance with the
encoded cyclic fragment of the 4-arc-graph. For this, connectivity operators
should have the end groups, capable to form hydrogen bonds. Connectivity
operator type is shown in Fig. 4, b. In the matrix x3 = 1.

Anti-connectivity operators – amino acid side chains which obstruct formation


of a cyclic protein pentafragment in accordance with the encoded fragment of
the 4-arc-graph, providing recreation of acyclic conformation. The side chains
of the anti-connectivity operators should be introduced into the region of
NiH...Oi-4 bond and to prevent the formation of H-bond (Fig. 4, c). In the
matrix x3 = 0. Side chain of anti-connectivity operators as a rule, should not
have groups, capable to form hydrogen bonds. In the works (Karasev and
Stefanov, 2001; Karasev and Luchinin, 2009,a) it is shown for this purpose,
that the side chains of amino acids quite satisfy the above requirements.
436 VLADIMIR A. KARASEV

c
b

Figure 4: The definition of "physical operator". a – the scope of the physical operator; b – connectivity
operator; c – anti-connectivity operator

2.2.2 Assignment of physical operators to blocks of triplets of the genetic code

In the supermatrix of the genetic code, as follows from Figure 3, there are two types of
blocks: two blocks of triplets coding for acyclic conformation of the protein (C = 00 and
U = 01) in a matrix which includes the variable x3 = 0, and two blocks with cyclic
conformations (G = 10 and A = 11), for which x3 = 1. The property of anti-connectivity
operators is a recreation of the conformations of acyclic graph (x3 = 0), so they should
be assigned to the blocks, С = 00 and U = 01. The group property of the connectivity
operators is a reconstruction of cyclic conformations (x3 = 1) and they should
correspond to blocks of G = 10 and A = 11.

As can be seen from Figure 3, mainly non-polar side chains (Pro, Ala,, Val, Leu Ile, Phe
and Met) correspond to blocks C = 00 and U = 01, which meets to the requirements to
the anti-connectivity operators. At the same time, blocks G = 10 and A = 11 correspond
to the amino acid side chains capable of forming hydrogen bonds (Arg, Ser, Cus, Trp,
His, Gln, Asp, Glu, Tyr, Asn, Lys), which is also consistent with the above
requirements.
THEORY OF TOPOLOGICAL CODING OF PROTEINS 437

It should be noted that the side chain of Ser is present both in block C = 00 containing
acyclic conformations (triplets UCC, UCU, UCG, UCA), and in block G = 10 which
includes cyclic conformations (triplets AGC, AGU). Within of our approach, it may be
due to the fact that C–OH-group of Ser can form H-bond both with Оi-3=C-group, and
with Оi-4=C-group. In the first case, C–OH-group contributes to the formation of H-
bond NiH….Оi-3=C in block C = 00 (pentafragment here is acyclic, since cyclic one
must include bond NiH….Oi-4=C by definition, see Section 2.1.1). In the second case,
C–OH-group promotes formation of the cyclic pentafragment in block G = 10 due to
connection NiH ….Oi-4=C.

In general, the problem of triplet-amino acid assignment in the genetic code (Crick,
1968, Knight et al., 1999) can be solved in the framework of the concepts of
connectivity and anti-connectivity physical operators.

2.2.3 Recreating symmetric conformation of the protein

We have considered the action of the operators encoded by different triplets, recreating
the symmetric conformation (Karasev and Stefanov, 2001; Karasev and Luchinin,
2009,a). Fig. 5 shows the proposed mechanism of action of these operators.
Suppose that these are connectivity operators with similar properties but of different
size (Fig. 5). Let us denote functional groups situated at the end of the chains as
O=CNH2. As seen from fig. 5, hydrogen bonds of two side chains of different length
have different slope and, hence, differently directed field lines. Connectivity of i-th – (i-
4)-th -carbon atoms in the two cycles is the same (dotted line), whereas connectivity of
other atoms is different.

a b

Figure 5: Assignment of physical operators to triplets, encoding symmetric conformation of the graph.
438 VLADIMIR A. KARASEV

The longer side chain forms the lines of force directed to the left and there is a
connectivity edge between i-2 - i-4 (variable x6 = 1, Fig. 5, a), while the shorter chain
forms the lines of force directed to the right - there is a connectivity edge between i - i-2
(variable x1 = 1, Fig. 5, b). The reasoning is, apparently, applicable to the anti-
connectivity operators as well. The consecutive analysis of the possible actions of all
side chains on the bond area NiH...Oi-4=C from a position of the theory of topological
coding has led to the concept of «molecular vector machine» of proteins.

3. MOLECULAR VECTOR MACHINE AND STRUCTURE OF THE AMINO


ACIDS CANONICAL SET

3.1 Model of molecular vector machine

3.1.1 The allocation of planes of symmetry and setting the vectors

Earlier (Karasev and Luchinin, 2009, a, с) the molecular vector machine model was
described for chain polymers. Let's consider the area of  NiH … Oi-4=C bond of the
main protein chain in detail (Fig. 6). In addition, let us take into account that the
HNC=O groups in proteins have partially delocalized double bond, which makes the
considered group flat (Schulz and Schirmer, 1979). Besides that all six atoms
surrounding the i-3 - i-4-th group (Ci-3, H, Ni-3, Ci-4, Oi-4, Ci-4) are located in one
plane.

Due to the partial delocalization, electron environment of HNC=O-group can be


described by three sp2-hybridized clouds (in Oi-4, Ci-4 and Ni-3). One of them (that for
the atom of oxygen) is shown in Figure 6. One can to draw three mutually perpendicular
planes (I - III, 5, a) through Ci-4=Oi-4- and NiH-groups, dividing sp2- hybridized clouds
into parts (plane I: - right and left parts, plane II: - front and rear parts, plane III: - upper
and lower parts). Possible directions of the action of the side chains on NiH … Oi-4=Ci-4
bond (Fig. 6, b-d) were considered, which revealed at least 20 vectors of action.

Within plane I two pairs of vectors are allocated: along NiH … Oi-4=Ci-4 bond and
across this bond (Fig. 6, b). On the basis of reflection transformation ( - at transition
through a plane I,  - at transition through plane II) and rotation transformation ( - at
transition through plane III) two subgroups of eight vectors directed, respectively, along
the plane II (Fig. 6, c) and along the plane III (Fig. 6, d) are allocated.
THEORY OF TOPOLOGICAL CODING OF PROTEINS 439

3.1.2 Setting the directions of vectors. Model of the molecular vector machine

The action of vectors is realized with the aid of the side chains of amino acids which
have the real physical dimensions. Therefore, they must build a kind of spatial figure.
Most suitable for these purposes is the dodecahedron. It has 20 vertices that correspond
to the number of amino acids in the canonical set.

a b

c d

Figure 6: The introduction of planes of symmetry (a) and the possible arrangement of vectors in the area of
NiH...Oi-4=Ci-4-bonds (b-d).

Atom Oi-4 is placed in the center of the dodecahedron, group Ni  in one of vertices,
vectors directed to the vertices of the dodecahedron (Fig. 7).

Dodecahedron vertices and vectors directed to its vertices are divided into four groups
and designated according to their symmetry. In plane I, vertex, corresponding to atom
Ni, is marked with the letter A, and the vertex connected with the vertex A via operation
440 VLADIMIR A. KARASEV

of rotation () is marked by adding sign minus to A (A). Together they form a
subgroup 1. Two other vertices located in the plane I and designated as B and (B) are
also interconnected by operation  and form a subgroup 2.

The vertices, symmetric with respect to plane I and located above the plane III on the
left and on the right, are designated by the letter A either with low right or upper left
subindix, respectively. The vertices located below plane III, have the same notation with
the minus sign in front. Together, they form the third subgroup consisting of 8 vertices
connected by symmetric transformations (Table 2). Similar notation is used for the
fourth subgroup, consisting of eight elements, designated by the letter B.

Figure 7: Model of molecular vector machines


THEORY OF TOPOLOGICAL CODING OF PROTEINS 441

Subgroups        


1 A -А
2 B -B
1 2
3 A1 A A2 -A1 A - 1A -A2 - 2A
1 2
4 B1 B B2 -B1 B - 1B -B2 - 2B
Table 2: The subgroups of vectors connected by transformations of symmetry

Thus, 20 vertices of the dodecahedron and 20 vectors to which they are directed, form
four subgroups according to operations of transformation about the planes which are
drawn in the dodecahedron.

To develop further the concept of the MVM let us add an arrow, marked with the letter
Si, to introduce a changeable physical operator (side-chain amino acids), and a fragment
of i +1-th link with an arrow showing the direction to the i +1-th C atoms (Fig. 7). In
the structure of MVM four components can be identified (Karasev et al., 2007; Karasev
and Luchinin, 2009, a,b): protein fragment, consisting of five amino acids
(pentafragment), dodecahedron containing a group of 20 vectors (radii of the
dodecahedron); canonical set of exchangeable physical operators and tetrahedral Ri -th
atom.

The operation principle of MVM consists in consecutive action on the bond NiH...Oi-
4=Ci-4 of protein pentafragment during the synthesis of the protein side chains of amino
acids implying attachment to the i-th -carbon atom. Each amino acid, in accordance
with its length, is represented by the vertex of the dodecahedron, to which the vector,
corresponding to the appropriate amino acid, is directed. Connectivity operators realize
their effect through the hydrogen bonding of terminal groups with the group Oi-4 = Ci-4
(their vectors are directed mainly upwards), whereas the anti-connectivity operators –act
via collisions of this group with the electron shells of the side chains’ terminal groups
(their vectors are directed downward). In this case, the side chains associated with the i-
th tetrahedral -atom, by changing the direction Ci  Ci+1, (see Figure 7), determine
the direction of growth of the polypeptide chain.

3.2 Dodecahedron model of the canonical set of 20 amino acids

This model was proposed in earlier work (Karasev et al., 2005). In its present form
(Figure 8), it is modified to meet the requirements of MVM.
442 VLADIMIR A. KARASEV

The side chains of amino acids were arranged in circles at the vertices of the
dodecahedron. Alpha-carbon atom, to which side chains are attached are situated at the
top, and the side chains oriented downward. Side chains were arranged from the top to
the bottom in the increasing order of their size. According to model MVM, the
dodecahedron structure is divided by three planes - I, II and III. Shorter side chains are
situated on the right whereas their heavier analogs on the left from the plane I. In this
model four groups of antisymmetrical chains can be distinguished: 1) chains
antisymmetrical about plane I (e.g. Ser : Thr, etc.), 2) chains antisymmetrical about
plane II (e.g. Ser : Cys, etc.), 3) chains antisymmetrical about plane III (e.g. Ser : His)
and 4) chains antisymmetrical about the center of the dodecahedron (e.g. Ser : Trp). If
this structure to place in a generalized structure of MVM, leaving only the names of
amino acids, it will turn into a MVM model of proteins (Karasev and Luchinin 2009 a).

Figure 8: Dodecahedron model of the canonical set of amino acids. Subgroups 1 and 2 – dark gray, subgroup
3 – grey, subgroup 4 – white circles

The side chains of amino acids are indicated in the circles corresponding to the
dodecahedron vertices. Alpha-carbon atom, to which side chains are attached, is situated
THEORY OF TOPOLOGICAL CODING OF PROTEINS 443

at the top, and the side chains are oriented downwards. Side chains are arranged from
the top to the bottom in the increasing order of their size. According to model MVM,
the dodecahedron structure is divided by three planes - I, II and III. Shorter side chains
are situated on the right, whereas their heavier analogs - on the left from the plane I. In
this model four groups of antisymmetrical chains can be distinguished: 1) chains
antisymmetrical about plane I (e.g. Ser : Thr, etc.), 2) chains antisymmetrical about
plane II (e.g. Ser : Cys, etc.), 3) chains antisymmetrical about plane III (e.g. Ser : His)
and 4) chains antisymmetrical about the center of the dodecahedron (e.g. Ser : Trp). If
this structure is included into the generalized structure of MVM, leaving only the names
of amino acids, it will lead to the MVM model of proteins (Karasev and Luchinin
2009a).

3.3 Properties of the canonical set of amino acids derived from the MVM model 

3.3.1 Side chains should have different length

Longer side chains should yield vectors, directed downwards, towards group Oi-4=C,
while those corresponding to shorter chains are directed towards NiH. Hence, side
chains of amino acids are located on a dodecahedron in order of increasing their length
(Fig. 8).

3.3.2 Quasi-mirror antisymmetry

As follows from Figure 7, the i-th -carbon atom, to which side chains are attached, is
located to the right of the dodecahedron (asymmetrically). The side chains, yielding the
vector, symmetric with respect to plane I should have shorter chains on the right, and
the longer ones on the left.

On the model of the dodecahedron side chains possessing similar properties but
different length are arranged symmetrically with respect to plane I. We call it a quasi-
mirror antisymmetry. Pairs of amino acids connected by this type of antisymmetry,
yield symmetric vectors connected by transformation . They are shown in Table 3,
part A.

3.3.3 Non-mirrored antisymmetry

The side chains yielding vectors symmetrical with respect to plane II, which are
interrelated by transformation , as seen in Fig.7, should have close values of the chain
444 VLADIMIR A. KARASEV

length, but different groups at the end. We call it non-mirrored antisymmetry property.
Side chains of amino acids linked by this type of antisymmetry, as seen in Figure 8, are
located on the opposite sides of plane II. They are listed in Table 3, part B.

3.3.4 Rotary antisymmetry

As follows from Figure 7, the side chains, yielding vectors, symmetric with respect to
plane III, connected by transformation , should have different length. We take into
account that the top half of the dodecahedron can be combined with the bottom only by
rotation about the axis located in plane III. With this in mind, we can assume that vector
rotary antisymmetry is possible between the side chains of the upper and lower half of
the dodecahedron.

А. Quasi-mirror B. Non-mirrored
Thr Ser Thr - Met Ser - Cys
Met Cys
Glu Asp Glu - Gln Asp - Asn
Gln Asn
Lys Arg Lys - Ile Arg - Val
Ile Val
Trp His Trp - Tyr His - Phe
Tyr Phe
C. Rotary D. Complemetarity
Thr - Trp Ser - His Thr - His Ser - Trp
Met - Tyr Cys - Phe Met - Phe Cys - Tyr
Glu - Lys Asp - Arg Glu - Arg Asp - Lys
Gln - Ile Asn - Val Gln - Val Asn - Ile
Pro - Gly Pro - Gly
Ala - Leu Ala - Leu

Table 3: Antisymmertry types for side chains of amino 

At the same time there are features of complementarity in the properties of opposing
side chains, e.g. short chains Ser, Thr oppose cyclic ones His, Trp, and negatively
charged Asp, Glu oppose positively charged Arg, Lys. Pair of side chains associated
with this type of antisymmetry, can also be seen in Fig. 8. They are shown in table 3,
part C.
THEORY OF TOPOLOGICAL CODING OF PROTEINS 445

3.3.5 Antisymmetry of complementarity

From Figure 7, it follows that the two vectors, directed to the opposite vertices of the
dodecahedron, form its diameter. They have the same angle, but the opposite direction
of action. Accordingly, the side chains, responsible for the effect of these vectors should
be mutually complementary in their properties. Indeed, as follows from Fig. 8 and Table
3, part D, the side chains with additional properties are located in the opposite vertices.
Ser with a short side chain opposes Trp, with a bulky side chain, and more massive Thr
opposes less massive His, etc.

By analogy with the symmetry transformations described for vectors (Table 2)


antisymmetry transformation connecting amino acids side chains was undertaken in
Table 4.

Subgroups        


1 Gly Pro
2 Ala Leu
3 Ser Thr Cys His Met Trp Phe Tyr
4 Asp Glu Asn Arg Gln Lys Val Ile

Table 4: Antisymmetry transformation connecting the side chains of amino acids

4. CONCLUSION

The main objective of the present review was consistent presentation of the theory of
topological coding of proteins, illustrated by a model of spatial structure of the
canonical set of amino acids on a dodecahedron developed on the basis of this theory.
Our model, unlike classifications of side chains of the amino acids based on their
physical and chemical properties (Campbell and Smith, 1994), principles of
complementarity (Mekler and Idlis, 1993) or genetic code (Biro et al., 2003;
Balakrishnan, 2002), emerged in the course of the development of the earlier proposed
theory of topological coding of proteins (Karasev et al., 2000; Karasev and Stefanov,
2001; Karasev and Luchinin, 2009, a, b, c).

One of concluding results derived from the theory is the molecular vector machine
(МVМ) of proteins presented in this paper. Within the limits of МVМ model 20 vectors
are introduced in the structure of dodecahedron, affecting formation of hydrogen bond
NiH…Oi-4=C of protein pentafragment. Side chains of amino acids realize this action as
446 VLADIMIR A. KARASEV

physical operators. The group of vectors manifests symmetry, whereas side chains of
amino acids - antisymmetry. The structure model for the canonical set of amino acids on
the dodecahedron uses principles of antisymmetry, being a part of the MVM. Thus, a
rational explanation of the antisymmetry nature of amino acids side chains is provided.
Current data on the structure of ribosomes and protein biosynthesis suggest that the
formation of the secondary structure of proteins occurs in the structure of ribosomes co-
translationally, i.e. at the time of their biosynthesis (Kramer et al., 2009). MVM model,
in which the side chains of amino acids act as physical operators, is consistent with the
data.

Further development of the structure model for the canonical set of amino acids is
attributed to the two-level scheme (Karasev and Luchinin, 2009, a), explaining the
nature of the degeneracy of the genetic code triplets, and to the group-theoretical
approach (Karasev and Luchinin, 2009, a, b) considering amino acid side chains as
irreducible representations of the group composed by the vectors.
More detailed information, as well as applied aspects of the approach can be found on
the websites: http://genetic-code.narod.ru, http://amino-acids-20.narod.ru and
http://vector-machine.narod.ru.

Acknowledgment: We are grateful to V.V. Luchinin and V.E. Stefanov for useful
discussion of the paper.

REFERENCES

Balakrishnan J. (2002) Symmetry scheme for amino acid codons, Phys. Rev. E, Stat. Nonlin. Soft. Matter.
Phys., B65 (2 Pt 1), 021912.
Biro J.C., Benyo B., Sansom C., Szlavecz A., Fordos G., Micsik T. and Benyo Z. (2003) A common periodic
table of codons and amino acids, Biochem. Biophys. Res. Commun., B306, 408-415.
Campbell, P.N. and Smith, A.D. (1994) Biochemistry Illustrated. Edinburgh: Curchill Livingstone,, pp. 8-9.
Crick, F.H.C. (1968) The origin of the genetic code, J.Mol.Biol. B38, 367-379.
Esteve J.G. and Falceto F. (2005) Classification of amino acids induced by their associated matrices, Biophys.
Chem., B115, 177-180.
Jenni S. and Ban N. (2003) The chemistry of protein synthesis and voyage through the ribosomal tunnel,
Curr. Opin. Struct. Biol., B13, 212-219.
Jimenez-Montaño, M.A., de la Mora-Basañez, C.R. and Poschel, Th. (1996) The hypercube structure of the
genetic code explains conservative and non-conserva-tive amino acid substitutions in vivo and in vitro,
Bio Systems, B39, 117-125.
Higgs P. and Pudritz R.E. (2009) A thermodynamic basis for prebiotic amino acid synthesis and the nature of
the first genetic code, Astrobiology, B9, 483-490.
THEORY OF TOPOLOGICAL CODING OF PROTEINS 447

Karasev, V.A., Luchinin, V.V., Stefanov, V.E. (1994) A model of molecular electronics based on the concept
of conjugated ionic-hydrogen bond systems, Adv. Mater. Opt. Electron. B4, 203-218.
Karasev, V.A. and Sorokin, S.G. (1997) Topological structure of the genetic code, Russ. J. Genetics, B33,
622-628.
Karasev V.A., Demchenko E.L. and Stefanov V.E. (2000) Topological coding of polymers and protein
structure prediction. In: Chemical topology: applications and techniques. (D.Bonchev & D.Rouvray
eds). – Ser. Math.Chem., B.6, 295-345, New-York:Gordon&Breach.
Karasev, V.A. and Stefanov, V.E. (2001) Topological nature of the genetic code, J. Theor. Biol. B209, 303-
317.
Karasev V.A., Luchinin V.V. and Stefanov V.E. (2005) Series in Mathematical Biology and Medicine. B.8.
Proceedings of the International Conference. "Advances In Вioinformatics And Its Applications”, New
Jersey: World Scientific Publishing Co., pp.482-493.
Karasev V.A., Luchinin V.V. and Stefanov V.E. (2007) A model of the “molecular vector machine” for
protein folding, Proceedings of the 3-rd Moscow conference on computational molecular biology.
Moscow, Russia, July 27-31, 2007, pp.134-136.
Karasev V.A. and Luchinin V.V. (2009a) Vvedenie v konstruirovanie boinicheskich nanosistem [Introduction
to the design of bionical nanosystems, in Russian], Moskow: "Fizmatlit", 464 pp.
Karasev V.A. and Luchinin V.V. (2009b) Model topologocheskogo codirovania tsepnikh polimerov dlia
bonicheskoi nanoelectroniki. I. Topologicheslii cod i sootvetstviia fisicheskikh operatorov tripletam
coda (Model of topological coding of chain polymers for bionical nanoelectronics. I. A topological
code and assingment of the physical operators to triplets of a code. In Russian), Biotekhnosfera, No.1,
pp. 2 – 10.
Karasev V.A. and Luchinin V.V. (2009c) Model topologocheskogo codirovania tsepnikh polimerov dlia
bonicheskoi nanoelectroniki. II. Molekuliarnaia vektornaia mashina i struktura kanonicheskogo nabora
phisicheskikh operatorov (Model of topological coding of chain polymers for bionical nanoelectronics.
II. The molecular vector machine and structure of the canonical set of physical operators. In Russian),
Biotekhnosfera, No.2, pp. 6 – 12.
Knight R.D., Freeland S.J. and Landweber L.F. (1999) Selection, history and chemistry: three faces of the
genetic code, Trends Biochem. Sci. B24, 241-247.
Kosiol C., Goldman N. and Buttimore N.H. (2004) A new criterion and method for amino acid classification/
J. Theor. Biol., B228, 97-106.
Kramer G, Boehringer D, Ban N. and Bukau B. ( 2009) The ribosome as a platform for co-translational
processing, folding and targeting of newly synthesized proteins, Nat. Struct. Mol. Biol. B16:589-97.
Mekler, L.B. and Idlis, R.G. (1993) Obschii stereokhimicheskii geneticheskii cod – put k biotechnologii i
unuversalnoi medizine ХХI veka uzhe segodnia [General stereochemical genetic code – towards
biotechnology and universal medicine of the ХХI century, in Russian], Priroda, No.5, 29-63.
Pauling L. (1960) The Nature of the Chemical Bond, 3rd ed.,Ithaca:Cornell Univ.Press, 644 pp.
Rumer, Yu.B. (1968). Sistematizacija kodonov v geneticheskom code [Systematization of codons in the
genetic code, in Russian], Dokl. Acad. Nauk SSSR B183, 225 – 226.
Schulz, G.E. and Schirmer, R.H. (1979) Principles of Protein Structure. New York: Springer-Verlag, 354 pp.
Siemion I.Z., Cebrat M. and Kluczyk A. (2004) The problem of amino acid complementarity and antisense
peptides, Curr. Protein Pept. Sci. B5, 507-527.
Taylor W.R. (1986) The classification of amino acid conservation, J. Theor. Biol. B119, 205-218.
448 AIMS AND SCOPE

SYMMETRY: CULTURE AND SCIENCE provides an interdisciplinary forum for


representatives of the various fields of art, science, and technology. According to its
established tradition, it publishes papers by scientists addressed to their colleagues
active in other disciplines, or even in different fields of the arts; and also papers by
artists addressed to the representatives of the sciences and diverse fields of technology.
Symmetry appears in articles of the various disciplinary and art periodicals, however
those tend not to reach scholars in other fields of study. The journal SYMMETRY aims at
conveying to them knowledge, methods, and novelties which are applicable to their
main fields of interest and creative work. Its basic goal is building bridges between
various fields of the arts and sciences, between various disciplines, and between
different cultures.
Symmetry is suitable for such a bridging function. It is a concept, a phenomenon, a
class of properties, and a method. It is present in almost all disciplines and fields of art
and technology. As a concept, it has roots in both science and art. As a phenomenon,
symmetry or its lack is present in all fields of art, science, and technology. Finally,
properties and methods, based on the application and the investigation of symmetry
(and symmetry breaking) are transferred from one field to another.
Symmetry is understood here in a broad sense, and approach to its study will be referred
to as symmetrology. In contrast to the common geometric concept, one can speak about
a more general scientific meaning of symmetry if: (i) under any kind of transformation
(operation), (ii) at least one property, (iii) of an object is left invariant (intact). This
generalised concept of symmetry makes possible the application of symmetry to both
animate and inanimate material objects, as well as to products of our mind. In addition
to geometric (morphological) symmetries (such as reflection, rotation, translation, etc.),
the scope of the journal covers functional symmetries and asymmetries (e.g., in the
human brain), gauge symmetries (of physical phenomena), and properties, like color,
tone, shading, weight, and so on (of artistic objects). The journal focuses not only on the
concept of symmetry, but also on its associates (asymmetry, dissymmetry, and
antisymmetry) and related concepts (such as proportion, harmony, rhythm, and
invariance) in an interdisciplinary and intercultural context.
SYMMETRY publishes original papers on symmetry and related questions which present
new results, or new connections between known results. The papers are addressed to a
broad non-specialist public, without becoming too general, and have an interdisciplinary
character in any of the following senses:
(1) they describe concrete interdisciplinary ‘bridges’ between different fields of art,
science, and technology using the concept or related to the phenomenon of symmetry;
(2) they survey the importance of the application of symmetry (antisymmetry, etc.) in
a concrete field with an emphasis on possible ‘bridges’ to other fields.
The journal also has a special interest in historic and educational questions, as well as in
symmetry-related methods and processes.

S-ar putea să vă placă și