Sunteți pe pagina 1din 8

Bioinformatics

Secondary article
Article Contents

David A Adler, Zymo Genetics Inc. and Department of Pathology, University of Washington,
Seattle, Washington, USA

. Introduction

Darrell Conklin, Zymo Genetics Inc., Seattle, Washington, USA

. Scope of Bioinformatics
. Hardware

Bioinformatics is a discipline at the intersection of computer science, information


technology, mathematics and biology, and includes the study and practice of archiving,
searching, displaying, manipulating and modelling biological data. Bioinformatics is
applied, for example, in the construction of genetic and physical maps of genomes;
nucleotide and amino acid sequence analysis; gene discovery and the prediction of protein
structures.

Introduction
Advances in classical as well as modern biology have often
been achieved by individuals presenting novel perspectives
of previously available observational information. The
elucidation of the genetic code following the publication of
Watson and Cricks model for the structure and replication
of DNA, along with the subsequent codication of the
central dogma of molecular biology (DNA is transcribed
into RNA which in turn is translated into protein)
exemplify the concept of biomolecules as information
carriers. This view leads naturally to the application of
computational approaches to the analysis of DNA and
protein sequence. In addition, the development of highthroughput technologies for generating biological and
biochemical data has contributed to a data explosion,
thereby increasing the diculty of simply examining all
data pertinent to a biological question. The need to
retrieve, organize and digest very large databases requires
the development of computational tools for data interaction and analysis. Bioinformatics is a discipline at the
intersection of computer science, information technology,
mathematics and biology and includes the study and
practice of archiving, searching, displaying, manipulating
and modelling biological data. Bioinformatics research
and development not only provides discovery tools for
other biologists but is making direct intellectual contributions to biology and medicine.
Bioinformatics is alternatively referred to as biocomputing or computational biology, the choice of term depending on the focus of activity. The practitioner may have a
background emphasizing any of the composite elds of
study and it is only recently that colleges and universities
have developed interdisciplinary programmes with the
goal of training bioinformatics professionals. An essential
part of the infrastructure of bioinformatics is a communications medium with fast data transfer rates and high
trac capacity to provide almost simultaneous information access to thousands of people. In the late twentieth
century the internet became that medium, and the principle

. Software
. Mapping and Linkage Analysis
. Biosequence Analysis
. Conclusion

interface is now the web browser. In just a few years the


World Wide Web has taken over the majority of internet
trac and the webs hypertext, hypergraphic presentation
and interface has become the primary means of exploring a
vast knowledge base of biological information.
The international Human Genome Project, with the goal
of determining the three billion base pairs of human genetic
information, has resulted in an exponential growth of
genetic data and is driving the rapid growth and maturation of the discipline of bioinformatics. Delving from the
level of amino acid sequence encoded in a gene, to protein
structure and its associated function touches upon central
questions in biology. Computational approaches are now
making important contributions to our understanding of
the relationship of the structure and function of biomolecules and their roles in biological processes. This article will
introduce some of the concepts, methods and tools of
biocomputing and describe a few examples of areas of
inquiry in bioinformatics.

Scope of Bioinformatics
Bioinformatics encompasses the study of a broad range
of biological data including gene maps, gene and
protein sequences and gene expression proles. A
primary goal of this data analysis is directed towards
unravelling the information content of biomolecules
and understanding how bioinformation directs the
development and function of living organisms. The
analysis of nucleic acid sequence, protein structure/
function relationships, genome organization, regulation
of gene expression, interaction of proteins and mechanisms
of physiological functions, can all benet from a bioinformatics approach. Nucleic acid and protein sequence data
from many dierent species and from population samplings provides a foundation for studies leading to new
understandings of evolution and the natural history of
humans.

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

Bioinformatics

Access to the published scientic literature is another


important aspect of bioinformatics and in the realms of
biotechnology and biomedicine in particular, this includes
invention information contained in patent archives.
Library professionals are now more likely to be found
searching digital databases than the shelves and have
become digital information access experts using a variety
of sophisticated search interfaces. One of the largest
databases in the world is Medline, a vast archive of
biological and biomedical journal references that covers
the period from 1965 to the present. Medlines indices
provide rapid searching of title, author, keywords and
terms, citation and often abstracts of articles. Recently, an
important technological promise has begun to be fullled
by delivering, almost instantly, to the desktop, journal
articles, complete with full text and high-resolution
graphics.
Bioinformatics researchers, computer scientists and
information specialists are also working on the conceptual
foundations for the next generation of knowledge navigators. New hardware and software developments will break
out beyond the limitations of our current windows on the
world of biology and provide tools of discovery in the next
century. The technological innovations in bioinformatics
are accompanied by ethical concerns, particularly pertaining to data repositories of identiable genetic information.
The ethical implications of advances in bioinformatics
necessitate investigation and eorts on the part of
scientists, lawmakers and the public are required to ensure
the privacy of individuals.

Hardware
The computer is the basic tool of bioinformatics, utilized to
store, display and analyse data, and to design and construct
scientic models and simulations. Computer hardware
requirements are commonly dictated by the tasks needing
to be accomplished, the software available to do the job,
the computational intensity of the process and the degree
of interactivity desired. Modern personal computers have a
higher performance than the super computers of two
decades ago so that sophisticated programs can be run and
complex, interactive software is on the desktop. For jobs
that require more computational capacity such as rapid
searches of very large databases, protein modelling, threedimensional display and simulating the interaction of large
molecules, it may be necessary to employ current supercomputer class machines, providing high performance by
harnessing multiple central processing units (CPUs).
Hardware solutions (algorithms in silicon) designed to
perform a single computational task, such as extremely fast
searching, have also been developed. However, the high
cost of custom hardware for specic computational tasks
has limited their widespread application.
2

Software
Software for bioinformatics is as task-driven as the
hardware. Data descriptions, the types of searches and
analysis, how one needs to interact with the computer
(interface), and how the results are presented, all will
determine the choices of software for the task at hand.
Programs for the analysis of genetic and physical mapping
data, drawing pedigrees and evolutionary trees are
available from both commercial and academic sources.
Sequence analysis suites generally include programs for
assembling sequences, pattern or string searching, restriction analysis, motif identication, base or amino acid
composition analysis and protein characterization. There
are also individual programs for particular tasks such as
multiple sequence analysis, for example, Clustal W (see
Table 1), and for similarity searches of database such as
BLAST (Table 1). Institutions and schools often have
obtained site-licenses for software packages, which are
then made accessible for use on networked desktop
computers. Software for a wide variety of computational
tasks, which have been developed at academic institutions,
is often freely available for download via the internet.
Unfortunately sites come and go on the internet so it is
dicult to maintain lists of resources with associated links.
Instead of presenting a comprehensive list that will become
obsolete almost instantly, the reader is referred to several
stable, well-maintained sites as starting points for nding
biology-related software and documentation on the internet (Table 1).
A web browser has become another necessary bioinformatics tool, since the web medium is often the easiest
means of accessing data from remote networked databases. Particularly for very rapidly growing map and
sequence databases, it is not practical or appropriate to try
to maintain a local copy of the data. To ensure the accuracy
and timeliness of information from databases that are
updated daily it is necessary to be able to access those sites
directly. The search interfaces provided for interacting
with the major data repositories are powerful and fast,
delivering responses within seconds. Network server software often report results in hypertext format, facilitating
the further investigation of details and related information
from other databases. Regardless of the particular software one chooses for a task it is important to know the
program well enough to use it eciently, to maximize its
utility and to evaluate the signicance of computational
results.

Mapping and Linkage Analysis


The three billion base pairs comprising the human genome
are distributed among 24 dierent linear DNA molecules
which in turn are packaged into individual chromosomes

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

Bioinformatics

Table 1 Starting points for exploring bioinformatics on the web


Site

Description

URLs

Clustal

Multiple sequence alignment


software (about ClustalW)
Sequence database similarity
searching

http://bioinformer.ebi.ac.uk/newsletter/archives/
2/clustalw17.html

BLAST servers:
NCBI (USA)
EMBL (Germany)
NIH software
CMS Molecular Biology Resource
EXPASY SIB
Weizman Institute of Science

Software repository and


well-maintained link lists

Biology Department

Indiana University
The Laboratory of Statistical Genetics Genetic mapping background and
Rockefeller University
resources
WIBR Mapmaker
Mapping software distribution
Centre dEtude du Polymorphisme
Humain (CEPH)
Mouse Genome Database
Jackson Laboratory
EUCIB
Radiation Mapping
EBI Stanford
Genome Database
OMIM
PDB
MapManager

http://www.ncbi.nlm.nih.gov/blast
http://dove.embl-heidelberg.de/Blast2
http://molbio.info.nih.gov/molbio/
software.html
http://www.sdsc.edu/ResTools/cmshp.html
http://www.expasy.ch/
http://bioinformatics.weizmann.ac.il/mb/
software.html
http://www.bio.indiana.edu/generalinfo/
bioresearch.html
http://linkage.rockefeller.edu

Human mapping resources

http://waldo.wi.mit.edu/ftp/distribution/
software/mapmaker3
http://www.cephb.fr

Mouse mapping and informatics

http://www.informatics.jax.org

European Collaborative Interspecific


Backcross mouse genetic mapping
Radiation hybrid mapping tools and
resources

http://www.hgmp.mrc.ac.uk/MBx/
MBxHomepage.html
http://www.ebi.ac.uk/RHdb
http://waldo.wi.mit.edu/ftp/distribution/
software/rhmapper
Human gene mapping database
http://www.gdb.org
Catalogue of human genes and genetic http://www.ncbi.nlm.nih.gov/omim
disorders
Protein Data Bank protein structure
http://www.rcsb.org/pdb
resource
Software suite for genetic mapping
http://mcbio.med.buffalo.edu/mapmgr.html
projects

(autosomes 122 and the sex chromosomes, X and Y). It is


predicted, very approximately, that there are 100 000 genes
and each has a distinct location on one of the human
chromosomes. The process of assigning genes and DNA
fragments to locations on particular chromosomes is called
mapping.
Gene maps are of two primary types, genetic and
physical. Genetic maps, determined by family studies in
humans and dened crosses of laboratory organisms such
as mice, provide the chromosomal assignment of a gene
and its position relative to other genetic markers. Synteny
is the term for two genes being on the same chromosome
and they are linked if they are suciently close on the
chromosome that they appear to be inherited together. The
determination of genetic map distance between genes is

referred to as linkage analysis. Linkage can only be


determined for polymorphic genes, those that have two
or more distinguishable alleles in a population. Genetic
map distances are calculated by assessing the frequency of
recombination between two polymorphic loci on a
chromosome and are usually expressed in units called
morgans (0.01 morgans, or one centimorgan (cM), is equal
to 1.0% recombination). For a simple two-point cross, the
longest genetic distance that can be measured is 50 cM, or
50% recombination, since genes further apart will appear
to act like unlinked genes. However, by combining data
from multiple loci, complete chromosome maps can be
deduced. A general principle of gene mapping is that the
closer two loci are, the less likely a recombinational event,
or breakage in the case of several physical-mapping

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

Bioinformatics

methods, will occur between them. Since genetic mapping


is dependent upon probabilistic phenomena, statistical
methods are necessary for the calculation of map distances
based on actual observations of inheritance. The signicance of linkage data is typically reported as a prediction
probability and expressed as a logarithmic odds ratio or
LOD (logarithm of dierences) score. As a practical and
general rule a LOD score of 3 or greater between two
genetic markers is considered signicant evidence for
linkage. Examples of software for the analysis of genetic
linkage data and the calculation of linkage and mapping
distances are available (see, for example, The Laboratory
of Statistical Genetics at Rockefeller University, and
WIBR Mapmaker; Table 1).
It is obviously not practical to control matings in human
populations, so human genetic maps can only be elucidated
by following the segregation of traits, or genetic markers,
in family studies. Such methods have been widely used in
the assignment of inherited disease loci to particular
chromosomes and chromosomal regions but usually
require pedigrees of large families with multiple generations. Investigators utilizing samples from large family
resources, such as at the Centre dEtude du Polymorphisme
Humain (CEPH; see Table 1) have made major contributions to the present density of the human genetic map. The
advent of recombinant nucleic acid technology with the
ability to clone and visualize particular small fragments of
DNA, and the identication of simple sequence repeat
polymorphisms in human populations has further expanded the analysis of the large family repositories,
thereby contributing to the density of human genetic maps.
In the laboratory mouse, with a long history of genetic
analysis, genetic maps have been constructed over the years
by following the segregation of alleles in experimental
matings between well-characterized inbred strains. Genetic
heterogeneity of 0.31.0% between the laboratory mouse
and the interfertile species, Mus spretus, has been exploited
to produce high-resolution mouse genetic maps. MapManager (Table 1) is a software package for tracking results of
genetic crosses, calculating map distances and generating
chromosome maps. Starting points for exploring mouse
gene mapping data are: Mouse Genome Informatics, from
Jackson Labs, and EUCIB (European Collaborative
Interspecic Mouse Backcross); see Table 1.
Physical maps exploit techniques of molecular and
cellular biology to localize genes and other markers
without the need for family studies or genetic crosses,
and do not require polymorphic genes. Methods of somatic
cell genetics, DNA hybridization and the polymerase chain
reaction (PCR) have provided a streamlined approach to
human gene mapping. Cytogenetic techniques combined
with nucleic acid hybridization provide the only direct
means of localizing genes on chromosomes. Nucleic acid
probes, labelled with uorescent dyes, are allowed to bind
to their complementary sequences on spread chromosomes
and are detected by uorescence microscopy. A trained
4

cytogeneticist, who is able to visually identify each


individual human chromosome, must assess the precise
position of uorescent label. This technique, uorescence
in situ hybridization, or FISH, usually applied to
condensed metaphase chromosomes, has now been extended to interphase chromatin. Interphase FISH using
two probes dierentially labelled has been used to measure
distances between genes in the range of  2 MB to  50
kilobases.
Other than FISH, physical mapping methods localize
genes relative to previously mapped markers. The incidence of random breakage events between markers and
the occurrence of concordant cloning of two genes are both
used to estimate physical distance between loci. Human
cells, irradiated with X-rays to induce chromosome
breakage, can be fused with normal rodent (mouse or
hamster) producing a hybrid cell line. Each individual cell
line will retain only one or a few fragments of human
chromosomal material. The presence of human DNA is
usually detected by PCR assay of an STS (sequence tagged
site: a fragment of genomic DNA that can be uniquely
amplied). The detection of two STSs in the same hybrid
cell line is evidence that two associated DNA fragments
reside on the same chromosomal fragment. Statistical
analysis of data from assaying many dierent hybrid cell
lines can estimate the distance between genes and with a
sucient number of cell lines, a map of the human genome
can be generated. Once the map framework is created, the
panel of radiation hybrid cell lines can be used to map new
loci. New genes are mapped by assaying each cell line in the
panel for the presence of the new locus (an STS) and then
evaluating the concordance of PCR-positive cell lines with
previous mapping data. Information on the Genebridge
panel (93 cell lines) can be found at EBI University and
Stanford University (Table 1). Software for generating
maps from radiation hybrid scoring data is available from
WIBR (Table 1).
Content mapping is based on recombinant techniques
used to clone DNA fragments into various vectors. The
variety of vectors can be grouped by the size of foreign
DNA insert they can carry. Commonly used cloning
vectors include phage, plasmids, cosmids, BACs (bacterial
articial chromosome) and YACs (yeast articial chromosomes). If genomic DNA is randomly broken into
appropriately sized pieces and then packaged into one of
these vectors then individual clones can be analysed to
determine map distance relationships. If, for example,
PCR assays for two genes are both positive in a single
recombinant clone then the two genes must be no further
apart than the size of the insert DNA. Thus the resolution
of this type of STS content mapping is dependent on the
particular vector chosen. Collections of characterized
cloned material are also amenable to creating overlapping
contigs along an entire chromosome (see, for example,
Foote et al., 1992). Reconciliation and integration of maps
derived by dierent methods, particularly the combining of

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

Bioinformatics

physical and genetic maps, contribute to increasing the


accuracy and resolution of mapping data. The correspondence of dierent map units can also be approximated
from integrated maps, for example 1 cM is roughly
equivalent to 12 megabases of DNA and similar in size
to a small cytogenetic band. Examples of integrated maps
can be searched and viewed: CEPH-Genethon Integrated
map, as in Table 1. Comparisons of the gene maps of
dierent species have proven valuable in evolutionary
studies as well as identication of human disease genes.
The functional signicance of the conservation of genome
arrangement as evidenced by, for example, the homeotic
genes maintained evolutionarily as clusters, a conserved
linkage from fruities to humans, remains unclear.
The development, renement and application of all these
mapping technologies have produced dense maps of entire
genomes. Human gene maps of individual chromosomes
that could once be reported in graphic form on a single
sheet of paper can now only be displayed with computer
technology due to the exponential increase in the number
of localized markers. The availability of dense gene maps
also greatly facilitates positional cloning of disease loci.
Positional cloning refers to a commonly used strategy that
starts from knowing only the approximate location of a
gene and progressively narrowing the critical region until
mutations in a single gene are shown to be associated with
the phenotype. There are many examples of the cloning of
human inherited diseases using this approach, including
Huntington disease, Duchenne muscular dystrophy and
cystic brosis. Dense maps also provide the foundation for
the realization of the ultimate map, the complete genome
sequence. In order to ensure the value and accessibility of
mapping data it is essential to maintain authoritative
repositories. The ability to search and display this
information is essential and is thus another important
aspect of bioinformatics.

translating DNA into protein, assembling partially overlapping fragments, analysing sequences, comparing sequences, and DNA motif discovery and recognition.
Current DNA sequencing technologies are not capable of
generating complete sequence for long nucleic acid
molecules in a single sequencing run and so it is necessary
to utilize computational methods to assemble contiguous
sequences from individual short sequence determinations.
If a large DNA molecule is randomly broken into smaller
pieces for the actual sequence determinations then a
contiguous linear sequence can be reconstructed by
aligning the overlapping portions from dierent random
fragments.
A common question arising when new genes are cloned
and sequenced is whether the sequence is already known or
does not occur in current databases. Answering this
question requires comparing the newly obtained sequence
to every sequence in the database. The algorithm of choice
for this task is the extremely rapid BLASTN algorithm
(Altschul et al., 1990). A list of all W-mers (contiguous
fragments of length W, which is typically set between 11
and 16), in the query sequence is rst compiled and then
every sequence in the database is in turn checked against
this list. This can be done rapidly and serves to rule out
most sequences from consideration. These regions are then
extended in either direction, using less stringent matching,
to form HSPs (high-scoring segment pairs). The expectation value of the HSP (the probability that an HSP of a
similar score will occur between two random sequences) is
computed and all database sequences having signicant
HSPs are reported. Overall database access time by
BLASTN is minimized by using a compressed form of
the nucleotide data and by using a memory-mapped le. It
is an algorithm highly amenable to parallelism and can be
compiled to run on multiprocessor hardware.

Amino acid sequence analysis

Biosequence Analysis
Development of the technologies to determine the linear
sequence of amino acids in proteins and the nucleotides in
DNA and RNA leads to the requisite need for compiling
and analysing sequence data. Sequence analysis is the
process of investigating the information content of linear
raw nucleic and protein sequence data.

Nucleic acid sequence analysis


The bulk of genomic DNA does not code for protein, and
the protein-coding regions of human genes are not
contiguous but are arranged with exons interspersed with
introns. Therefore an important question for computational biology is how to detect protein-coding regions
within genomic DNA. Other common tasks include

Linear chains of amino acids, proteins, the product of gene


translation, are found in cells folded into functionally
active structures. It is thought that the primary sequence of
the protein determines the ultimate conformation of the
protein and therefore its biological function. However, the
exibility of long-chain polypeptides can generate an
innite number of shapes and the computational task of
predicting correct structures is beyond the reach of current
knowledge. Predicting the shape of a protein from its linear
amino acid sequence is one of the holy grails of
computational biology. Solving the protein-folding problem holds the promise of spawning major advances in
molecular biology, pharmacology and the treatment of
disease.
An indispensable resource for the bioinformatics
scientist is the Protein Data Bank (Table 1), which is a
repository of solved protein structures, that is, mappings of

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

Bioinformatics

each atom in a protein onto three-dimensional space. This


is done using X-ray crystallography; the diraction pattern
of molecular crystals is interpreted to create maps of
electron density. This technique is very time consuming
and requires the availability of protein crystals. For this
reason, the number of known protein sequences vastly
exceeds the number of sequences with solved structures in
the PDB.

Secondary structure prediction


In a classic paper, Levitt and Chothia (1976) proposed a
classication of protein structures into four dierent
structural classes: alpha, beta, alpha/beta, and alpha 1 beta. Although there is a predictive relationship between
the amino acid composition of a protein and its class
(Chou, 1995), a prediction of protein class is in most
circumstances too broad to be of general use. Knowing, for
example, that a protein is in the alpha class does not
distinguish it from a globin, an annexin or an interferon, all
alpha class proteins with dierent topologies (numbers,
lengths and connectivities of helices) and therefore very
dierent biochemical functions. The computational technique that can provide the scientist with information on
protein topology is called secondary structure prediction.
A secondary structure prediction is simply an assignment
of a secondary structure state to every amino acid of the
query protein sequence.
Most secondary structure prediction algorithms derive
their models using proteins in the PDB, modelling
observed relationships between short pieces of contiguous
sequence and secondary structure. Such relationships are
not specic; that is, near identical short stretches of
sequence can have dierent secondary structures. Furthermore, the PDB is not large enough to contain sucient
statistics on longer, more specic stretches of sequence that
may be encountered in a query sequence. Currently the best
secondary-structure prediction algorithms (Rost and
Sander, 1993) achieve an accuracy of only about 70%
(the number of secondary structure states correctly
assigned divided by the number of amino acids in the
protein). It is widely agreed that this gure is close to an
upper bound on current methods, because none of them
can adequately model long-range interactions in the
protein sequence.

Comparative modelling
Through the ages the human genome has been the target of
major evolutionary processes such as gene duplication,
gene fusion, gene rearrangement and gene deletion. The
individual gene has been subjected to the more subtle
process of base mutations that often change the protein
sequence of the gene product. Genes have evolved
substantially while still preserving the three-dimensional
6

structure of their protein. This is because mutations that


substantially alter a protein fold will destroy the normal
function of the protein, and will not persist through
generations. Furthermore, amino acids with similar
hydropathic properties can often be substituted for one
another in a protein without appreciably changing its
structure.
Therefore, the rst step in predicting the fold of a new
protein is to determine whether it is evolutionarily related
to some sequence in the PDB. The technique for testing two
protein sequences for an evolutionary relationship is
pairwise alignment using a dynamic programming algorithm (Needleman and Wunsch, 1970; Smith and Waterman, 1981). It has become abundantly clear that when two
sequences have a high percentage of identical amino acids
at aligned positions, they will tend to have very similar
folds. Sander and Schneider (1991) quantied the relationship between percentage identity, alignment length and
structural similarity by studying proteins of known
structure in the PDB. Roughly stated, when a pair of
sequences has at least 25% identity over at least 80 amino
acids, there is a high probability that the sequences have the
same structure in the aligned region. It must be stressed
that the converse of this implication is not true: due to
evolutionary divergence, two related sequences may not
have a signicant alignment. There may be a high
probability of nding an alignment of equal or higher
score between two random sequences.
Dynamic programming algorithms, unless implemented
in massively parallel hardware, are too slow for interactive
application to very large databases. This is because they
require, for every database sequence, the computation of
every cell in a score matrix, the total number of cells being
equal to the product of the query and subject sequence
lengths. Several clever algorithms have been devised to
avoid the computation of the full score matrix; two popular
methods are FASTA and BLASTP. FASTA initially
computes a hash table containing all k-tuples (peptide of
length k) in the query sequence. A target sequence can be
tested very rapidly for the presence of these k-tuples. The
second step of the FASTA algorithm performs a limited
computation of the full score matrix, only in regions which
join selected k-tuples. The BLASTP method initially
compiles a nite state machine, employing an extremely
rapid technique from computer science for nding
common substrings in sequences. Using an amino acid
comparison matrix all W-mer peptides (W is typically set at
2 or 3) that could possibly attain a score greater than a
threshold score to any W-mer in the query are placed into
the machine. The value of this threshold depends on the
comparison matrix and on other parameters of the
algorithm. Each region of a database sequence that reaches
the nal state of the machine is then extended in either
direction to form HSPs. Sequences with statistically
signicant HSPs are reported to the scientist. Recent
extensions to BLASTP permitting gaps in the alignment

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

Bioinformatics

now make BLASTP and FASTA nearly indistinguishable


in terms of search sensitivity.
Protein sequences related by ancient evolutionary events
of gene duplication are said to form a family of sequences.
To accurately delineate a protein family, access to all
ancestral sequences would be needed, which is not possible.
However, it is a convenient fact that protein families are
transitive; if sequence A is related to B, and B is related to
C, it can be inferred that A is related to C, even though A
and C might not have a signicant alignment. This fact is
applied routinely to infer the structure of new protein
sequences.
Comparative modelling encompasses more than the task
of nding a structurally similar protein. After backbone
coordinates of ungapped regions of the alignment are
transferred onto the target sequence, a full atomic model is
developed. This involves assigning coordinates to gapped,
loop regions, and assigning coordinates to residue sidechains. The rough model can be improved by energy
minimization techniques.

Sequence motifs
Even if a protein family is divergent, it may be possible
to identify short regions that appear to have conserved
sequences and therefore locally conserved structure and
perhaps biochemical function. Each region can be
described using a motif that states, for each position,
the allowed variation in possible amino acids using a
distinct score for each. Dynamic programming algorithms
are used to align motifs to sequences. Motifs can be
viewed as compact expressions for a protein family, an
alternative to representing the family as a list of its
members. Furthermore, a match to a motif may not
be statistically signicant but may be biologically signicant because higher scores may not occur when
applying the motif to other protein families. If a
discriminating motif matches a sequence of unknown
structure it can be inferred that the sequence has the same
protein fold as the family.
The computational techniques used to create motifs
fall into four classes. The standard technique creates
the motif using variation observed in columns of a multiple
sequence alignment of the family. There are several ways
to compute motifs from a multiple alignment (Gribskov
et al., 1987; Tatusov et al., 1994). Other techniques are
machine learning algorithms that attempt to create motifs
without requiring a multiple alignment of the family
(Brazma et al., 1998). The hidden Markov model
techniques (Krogh et al., 1994) try to t available data to
a sequence of probability distributions using a local
optimization algorithm. Finally, some techniques are
iterative algorithms, which generalize a motif by repeated
searches of a sequence database (Tatusov et al., 1994) using
the evolving motif.

Fold recognition
Often a new protein sequence contains no recognizable
motifs, nor can its structure be inferred by comparative
modelling. In such cases, one can resort to fold recognition
approaches. The task of fold recognition is easily dened
but notoriously dicult to solve: for a given sequence,
determine which, if any, structures in the PDB are
compatible with the sequence.
Because the function of a protein is determined by
its three-dimensional structure, mutations causing
amino acid changes that grossly alter the structure of
the protein will usually inactivate the protein function and
will be selected against by evolution. It is for this reason
that, despite the vast space of protein sequences explored
by evolution over the ages, there probably exist only
several thousand unique protein topologies (Hubbard
et al., 1992). As the PDB continues to expand with new
solved protein structures, the chance that a new gene
product folds like a known structure will continue to
increase.
Fold recognition by threading is a new, powerful
technique. Threading methods are based upon the
assumptions that protein structures are in a state of
minimum free energy, and that this energy can be roughly
computed for any given structure. The energy computation
takes into account the compatibility of dierent amino
acids at each position in the structure. This compatibility
usually reects the preference of hydrophobic amino acids
in the core environment of the protein, and the potential
energy created when two amino acids are spatially close to
one another.
Given a function that can evaluate the compatibility of
a sequence with a structural template whose native
sequence has been removed, threading algorithms
attempt to minimize this function by considering various
possible sequence to structure alignments. The threading
task is enormously complex since exponentially many
(as a function of sequence and structure sizes) alignments
are possible, and the presence of arbitrarily many
pairwise interactions in a protein structure precludes
the use of dynamic programming alignment algorithms
to produce optimal solutions. There are two interesting
heuristic algorithms for obtaining at least a feasible
solution in the face of this complexity. One is the
approach of Jones et al. (1992) which uses a variant
of the standard dynamic programming algorithm.
Another is the statistical sampling approach of Madej
et al. (1995), which iteratively modies a working alignment until a local minima is reached. Both approaches
have had some success in predicting the fold of unknown
proteins, although low selectivity (proteins of dierent
structure appearing to be compatible) continues to be an
issue.

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

Bioinformatics

Conclusion
As complete genome sequences become available and
many more protein structures are solved, new challenges
for bioinformatics are appearing. Investigators are just
beginning to address the questions of living organisms as
dynamic systems and these explorations will once again
expand the scope of bioinformatics. Advances in the
various arenas of bioinformatics holds the promise of
revolutionizing biological understanding and thereby
contributing to progress in preventing and treating disease.

References
Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ (1990)
BLAST Basic Local Alignment Search Tool. Journal of Molecular
Biology 215: 403410.
Brazma A, Jonassen I, Eidhammer I and Gilbert D (1998) Approaches to
the automatic discovery of patterns in biosequences. Journal of
Computational Biology 5(2): 279305.
Chou KC (1995) A novel approach to predicting protein structural
classes in a (20-1)-D amino acid composition space. Proteins 21(4):
319344.
Foote S, Vollrath D, Hilton A and Page DC (1992) The human Y
chromosome: overlapping DNA clones spanning the euchromatic
region. Science 258: 6066.
Gribskov M, McLachlan AD and Eisenberg D (1987) Prole analysis:
detection of distantly related proteins. Proceedings of the National
Academy of Sciences of the USA 84(13): 43554358.
Hubbard TJ, Ailey B, Brenner SE et al. (1992) SCOP: a structural
classication of proteins database. Nucleic Acids Research 27: 254
256.
Jones DT, Taylor WR and Thornton JM (1992) A new approach to
protein fold recognition. Nature 358: 8689.
Krogh A, Brown M, Mian IS, Sjolander K and Haussler D (1994)
Hidden Markov models in computational biology. Applications to
protein modeling. Journal of Molecular Biology 235: 15011531.

Levitt M and Chothia C (1976) Structural patterns in globular proteins.


Nature 261: 552558.
Madej T, Gibrat JF and Bryant SH (1995) Threading a database of
protein cores. Proteins 23(3): 356369.
Needleman SB and Wunsch CD (1970) A general method applicable to
the search for similarities in the amino acid sequence of two proteins.
Journal of Molecular Biology 48(3): 443453.
Rost B and Sander C (1993) Prediction of protein secondary structure at
better than 70% accuracy. Journal of Molecular Biology 232: 584599.
Sander C and Schneider R (1991) Database of homology-derived protein
structures and the structural meaning of sequence alignment. Proteins
9(1): 5668.
Smith TF and Waterman MS (1981) Identication of common
molecular subsequences. Journal of Molecular Biology 147: 195197.
Tatusov RL, Altschul SF and Koonin EV (1994) Detection of conserved
segments in proteins: iterative scanning of sequence databases with
alignment blocks. Proceedings of the National Academy of Sciences of
the USA 91: 1209112095.

Further Reading
Baxevanis A and Ouellette BFF (eds) (1998) Bioinformatics: A Practical
Guide to the Analysis of Genes and Proteins. Chichester: John Wiley
and Sons.
Lesk AM (ed.) (1988) Computational Molecular Biology, Sources and
Methods for Sequence Analysis. Oxford: Oxford University Press.
Gribskov M and Devereux J (eds) (1991) Sequence Analysis Primer. New
York: Stockton Press.
Schuler GD, Boguski MS, Stewart EA et al. (1996) A gene map of the
human genome. Science 274: 540546. [http://www.ncbi.nlm.nih.gov/
genemap/]
Smith CM (1997) The CMS Molecular Biology Resource: Bio-Web
resources organized by analytical function. Trends in Genetics 13: 416.
[(1998) MolyBio. Science 281: 139]
Vogel F and Motulsky AG (1997) Human Genetics and Approaches.
Berlin: Springer.
von Heijne G (1987) Sequence Analysis in Molecular Biology, Treasure
Trove or Trivial Pursuit. San Diego: Academic Press.

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

S-ar putea să vă placă și