Documente Academic
Documente Profesional
Documente Cultură
What is Bioinformatics?
So why is bioinformatics a hot field? One answer to this question is that it is tied to the
human genome project which has generated a lot of popular interest. Various advances
in molecular biology techniques (such as genome sequencing and microarrays) has led to
a large amount of data that needs to be analyzed. Now that we are close to having the
human genome finished, what does it all mean? That’s where bioinformatics steps in.
Bioinformatics can lead to important discoveries as well as help companies save time and
money in the long run. In addition, there needs to be methods to manage large amounts
of data. One of the biggest reasons for bioinformatics being a hot field is the old supply
and demand adage. There just are too few people adequately trained in both biology and
computer science to solve the problems that biologists need to have solved.
Introduction to Molecular Biology
Cells
Every organism is made up of tiny structures called cells. Often these cells are too small
to be seen with the naked eye. Each cell is in itself a complex system enclosed in a
membrane. Some organisms, such as bacteria and baker’s yeast are composed of only a
single cell (i.e. they are unicellular). Other organisms are made up of many different
cells (i.e. they are multicellular). For instance, the human body is composed of around 60
trillion cells. Humans have about 320 different cell types, each having a different type of
function or structural property.
There are two types of organisms: eukaryotes and prokaryotes. Eukaryotes (or as Bruce
Roe from the University of Oklahoma calls them the “You and I” Karyotes) represent
most of the organisms which we can see, including plants and animals. Prokaryotes
(such as bacteria) are smaller than eukaryotic cells and have simpler structure.
Prokaryotes are single cellular organisms (but not all single-celled organisms are
prokaryotes!)
So what is the difference between the two types of cells? A eukaryotic cell has a nucleus,
which is separated from the rest of the cell by a membrane. Inside the nucleus are the
chromosomes, where all of the genetic information for the organism is stored. In
addition, eukaryotic cells contain membrane bound organelles with various functions,
including centrioles, lysosomes, mitochondria, ribosomes, etc.
Contained within the nucleus are one or several long double stranded DNA molecules
organized as chromosomes. For humans, there are 22 pairs of autosomes, as well as one
pair of sex chromosomes. One copy of each pair is inherited from each parent.
DNA
Deoxyribonucleic Acid (DNA) is the basis for the building blocks encoding the
information of life. A single stranded DNA molecule, called a polynucleotide or
oligomer, is a chain of small molecules called nucleotides. There are four different
nucleotides, or bases: adenosine (A), cytosine (C), guanine (G) and thymine (T).
The bases can be separated into two different types: purines (A and G) and pyrimidines
(C and T). The difference between purines and pyrimidines is in the base structure.
Stringing together a simple alphabet of four characters together we can get enough
information to create a complex organism! Different nucleotides can be strung together
to form a polynucleotide. However, the ends of the polynucleotide are different, meaning
that each polynucleotide sequence will have a directionality. The ends of the
polynucleotide are marked either 3’ or 5’. The general convention is to label the coding
strand from 5’ to 3’ (left to right).
For instance, the following is a polynucleotide:
5’ G→T→A→A→A→G→T→C→C→C→G→T→T→A→G→C 3’
5’ G→T→A→A→A→G→T→C→C→C→G→T→T→A→G→C 3’
| | | | | | | | | | | | | | | |
3’ C←A←T←T←T←C←A←G←G←G←C←A←A←T←C←G 5’
Two complementary polynucleotide chains form a stable structure known as the DNA
double helix. This spring represents the 50th anniversary of the discovery of the double
helix structure of DNA by Watson, Crick and Franklin.
Note that in this image, there appear to be two types of grooves: A larger one, which is
called the major groove and a smaller one, known as the minor groove. In addition, there
are roughly 10.5 base pairs in one complete turn of the helix.
RNA
Ribonucleic Acid (RNA) is similar to DNA in the fact that it is constructed from
nucleotides. However, instead of thymine (T), an alternative base uracil (U) is found in
RNA. RNA can be found as double-stranded or single-stranded, and can also be part of a
hybrid helix where one strand is an RNA strand and the other is a DNA strand. RNA is
generally found as a single stranded molecule that may form a secondary structure or
tertiary structures due to the complementary bases between parts of the same strand.
RNA folding will be discussed in detail during a later class period. RNA is important in
the cell and contributes in a variety of ways. One of the most important roles of RNA is
in protein synthesis. Two of the major RNA molecules involved in protein synthesis are
messenger RNA (mRNA) and transfer RNA (tRNA).
mRNA
mRNA encodes the genetic information as copied from the DNA molecules.
Transcription is the process in which DNA is copied into an RNA molecule. The
resulting linear molecule is an mRNA transcript. In eukaryotic cells, before the mRNA
can be translated into a protein, it needs to be modified. The nature of most eukaryotic
genes is that the genes are created in pieces, where coding regions, called exons, are
interspersed with noncoding regions, called introns. One of the steps in processing the
mRNA is to remove the intronic regions and to splice together the coding, or exonic
regions. The processed mRNA can then be transported from the nucleus and translated
into a protein sequence.
mRNA processing.
tRNA
Genetic Code
Since there are 4 possible bases (A, C, G, U) and 3 bases in the codon, there are 4 * 4 * 4
= 64 possible codon sequences. However, the codon AUG can also be used as a signal to
initiate translation, while the codons UAA, UAG, and UGA are terminal codons
signaling the end of translation. That leaves a 61 codon sequences that can code for
amino acids (AUG can also code for an amino acid). However, there are only 20 amino
acids. Therefore the genetic code is redundant, meaning that a single amino acid could
be coded for by several different codons.
Second Position of Codon
U C A G
UUU Phe [F] UCU Ser [S] UAU Tyr [Y] UGU Cys [C] U
UUC Phe [F] UCC Ser [S] UAC Tyr [Y] UGC Cys [C] C
U
UUA Leu [L] UCA Ser [S] UAA STOP UGA STOP A
F UUG Leu [L] UCG Ser [S] UAG STOP UGG Trp [W] G T
i h
r CUU Leu [L] CCU Pro [P] CAU His [H] CGU Arg [R] U i
s CUC Leu [L] CCC Pro [P] CAC His [H] CGC Arg [R] C r
t C
CUA Leu [L] CCA Pro [P] CAA Gln [Q] CGA Arg [R] A d
P CUG Leu [L] CCG Pro [P] CAG Gln [Q] CGG Arg [R] G P
o o
s AUU Ile [I] ACU Thr [T] AAU Asn [N] AGU Ser [S] U s
i AUC Ile [I] ACC Thr [T] AAC Asn [N] AGC Ser [S] C i
t A AUA Ile [I] ACA Thr [T] AAA Lys [K] AGA Arg [R] A t
i i
o AUG Met [M] ACG Thr [T] AAG Lys [K] AGG Arg [R] G o
n n
GUU Val [V] GCU Ala [A] GAU Asp [D] GGU Gly [G] U
GUC Val [V] GCC Ala [A] GAC Asp [D] GGC Gly [G] C
G
GUA Val [V] GCA Ala [A] GAA Glu [E] GGA Gly [G] A
GUG Val [V] GCG Ala [A] GAG Glu [E] GGG Gly [G] G
Genetic Code. Note that the initiator codon is labeled in green, and the terminal codons are labeled in red.
The first column gives the triplet base; the second the three letter amino acid label, and the third the one
letter amino acid label.
Amino Acids
Amino acids are the building blocks from which proteins are made. There are 20
different amino acids that vary from each other by their side chain groups. Amino acids
can be classified into different groups based on their solubility in water. Hydrophilic
amino acids are water soluable, while hydrophobic are not. This property becomes
important when a protein sequence is made. Amino acids are linked to one another via a
single chemical bond, called a peptide bond. A linear chain of amino acids can be
referred to as a peptide (if it is short – less than 30 a.a. long) or polypeptide (which can be
upwards of 4000 residues long).
One-letter Three-letter Full name
G GLY Glycine
A ALA Alanine
V VAL Valine
L LEU Leucine
I ILE Isoleucine
F PHE Phenylalanine
P PRO Proline
S SER Serine
T THR Threonine
C CYS Cysteine
M MET Methionine
W TRP Tryptophan
Y TYR Tyrosine
N ASN Asparagine
Q GLN Glutamine
D ASP Aspartic acid
E GLU Glutamic acid
K LYS Lysine
R ARG Arginine
H HIS Histidine
Proteins
Proteins are polypeptides that have a three dimensional structure. They can be described
through four different hierarchical levels:
Image source:
http://www.ebi.ac.uk/microarray/biology_intro.html
Calculating the secondary and tertiary structure of a protein given its primary structure is
not an easy task. Protein folding prediction will be covered at some point close to the end
of the semester.
Monomer – Any small molecule that can be linked with others of the same type to form
a polymer. For the purpose of this class, the molecules could be nucleic acids, amino
acids, or proteins.
Oligimer – General term for a short polymer most commonly consisting of nucleic acids
or amino acids.
Polymer – Any large molecule consisting of multiple identical or similar subunits linked
by covalent bonds.
Putting it all together, we get the flow of genetic information. That is, DNA directs the
synthesis of RNA, and RNA then in turn directs the synthesis of protein. This flow of
genetic information from nucleic acids to protein has been called the Central Dogma of
Molecular Biology.
Central Dogma of Molecular Biology
DNA
↓
RNA
↓
PROTEIN
Image Source:
http://www.people.virginia.edu/~rjh9u/dnaprot.html
What is a Gene?
Aaah, the million dollar question. In short, a gene can be described as the physical and
functional unit of heredity that carries information from one generation to the next. A
gene can be thought of as the DNA sequence necessary for the synthesis of a functional
protein or RNA molecule.
Whenever the term genome is used, it typically refers to the chromosomal DNA of an
organism, or as far as sequencing is concerned, the heterochromatic regions of the
chromosomal DNA. The number of chromosomes and genome size varies quite
significantly from one organism to another. An example list of genome sizes is given
below. Don’t be fooled by this table that the size of the genome and the number of genes
determines the complexity of an organism. In fact, many plant genomes are much greater
in size than the human genome!
ORGANISM CHROMOSOMES GENOME SIZE GENES
Homo sapiens 23 3,200,000,000 ~ 30,000
(Humans)
Mus musculus 20 2,600,000,000 ~30,000
(Mouse)
Drosophila 4 180,000,000 ~18,000
melanogaster
(Fruit Fly)
Saccharomyces 16 14,000,000 ~6,000
cerevisiae (Yeast)
Zea mays (Corn) 10 2,400,000,000 ???
The term transcriptome refers to the complete collection of all possible mRNAs
(including splice variants) of an organism. This can be thought of as the regions of an
organism’s genome that get transcribed into messenger RNA. In some cases, the
transcriptome can be extended to include all transcribed elements, including non-coding
RNAs used for structural and regulatory purposes.
The term proteome refers to the complete collection of proteins that can be produced by
an organism. The proteome can be studied either as a static (sum of all proteins possible)
or a dynamic (all proteins found at a specific time point) entity.
Molecular Biology Reference Books
Lewin, B (1999), Genes VII (published by Oxford University Press) ISBN: 019879276X
Lodish et al (1995), Molecular Cell Biology, 3rd edition (published by Scientific American Books,
Freeman and Cpy, New York) ISBN 0 7167 2380 8
Gonick, L & Wheelis, M (1991), The Cartoon Guide to Genetics (published by Harper Perrenial, New
York) ISBN 0 06 273099 1
Online tutorials
One site you will be intimately familiar with by the end of the semester:
http://www.ncbi.nlm.nih.gov
Reading assignment
http://www.ebi.ac.uk/microarray/biology_intro.html
Introduction to Bioinformatics
In molecular biology, a common question is to ask whether or not two sequences are
related. The most common way to tell whether or not they are related is to compare them
to one another to see if they are similar. If we look at two words in the English language,
we note that two words that are spelled similarly may mean two completely different
things, such as the words pear and tear.
Biological sequences that are similar (but not exact) provide useful information to help
discover functional, structural, and evolutionary information. One common mistake is to
describe two sequences as having some sort of homology or a percent homology based on
their sequence similarity. This is a misuse of the biological term. Two sequences in
different organisms are homologous if they have been derived from a common ancestor
sequence. Two sequences may or may not be homologous regardless of their sequence
similarity. However, the greater the sequence similarity, the greater chance there is that
they share similar function and/or structure.
Homologs are similar sequences in two different organisms that have been derived from
a common ancestor sequence. Homologs can be described as either orthologous or
paralogous.
Orthologs are similar sequences in two different organisms that have arisen due to a
speciation event. Orthologs typically retain their functionality throughout evolution.
Paralogs are similar sequences within a single organism that have arisen due to a gene
duplication event.
Xenologs are similar sequences that do not share the same evolutionary origin, but rather
have arisen out of horizontal transfer events through symbiosis, viruses, etc.
Image Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html
One method in determining sequence similarity is to determine the edit distance between
two sequences. If we take the example of pear and tear, how similar are these two
words? We notice that if we change the p to a t, and keep the ear, then we can change
pear to tear. Thus, there is a mismatch in the first letter, and matches in the last three. An
alignment of these two is as follows:
P E A R
| | |
T E A R
One way to score this alignment is to calculate the Hamming distance, which is the
minimum number of letters by which the two words differ. In this example, the
Hamming distance would be 1. The Hamming distance is calculated by summing up the
number of mismatches when two words are aligned to one another.
With biological sequences, it is often necessary to align two sequences that are of
different lengths, or that have regions that have been inserted or deleted over time. Thus,
the notion of gaps needs to be introduced. Consider the words alignment and ligament.
One alignment of these two words is as follows:
A L I G N M E N T
| | | | | | |
- L I G A M E N T
In this case, a gap is denoted in the alignment by a ‘-‘ character. Now an alignment can
produce one of the following: a match between two characters, a mismatch between two
characters (also called a substitution or mutation), a gap in the first sequence (which can
be thought of as the deletion of a character in the first sequence), or a gap in the second
sequence (which can be thought of as the insertion of a character in the first sequence).
Consider the following two nucleic acid sequences: ACGGACT and ATCGGATCT.
The following are two valid alignments:
A – C – G G – A C T
| | | | |
A T C G G A T _ C T
A T C G G A T C T
| | | | | |
A – C G G – A C T
Which alignment is the better alignment? One way to judge this is to assign a positive
score for each match, and a negative score for each mismatch, and a negative score for
each insertion/deletion (collectively referred to as indels).
Using this scoring scheme, the first alignment has 5 matches, 1 mismatch, and 4 indels.
The score for this alignment is: 5 * 2 – 1(1) – 4(2) = 10 – 1 – 8 = 1.
The second alignment has 6 matches, 1 mismatch, and 2 indels. The score for the second
alignment is 6 * 2 – 1(1) – 2 (2) = 12 – 1 – 4 = 7.
Therefore, using the above scoring scheme, the second alignment is a better alignment,
since it produces a higher alignment score.
One of the more basic, yet important techniques for determining the alignment between
two sequences is by using a visual alignment known as dot plots. Dot plots of sequence
similarity are created using a matrix where the rows in the matrix correspond to the
characters in the first sequence and the columns in the matrix correspond to the characters
in the second sequence. The dot plot is created as follows: loop through each row. For
the current row, take the character in that row and compare it to the character in each
column. If they are equal, place a dot in the matrix. Continue until all nodes in the
matrix have been considered.
A C C T G A G C T C A C C T G A G T T A
A
C
C
T
G
A
G
C
T
C
A
C
C
T
G
A
G
T
T
A
Results for aligning ACCTGAGCTCACCTGAGTTA to itself using the Dot Matrix option of the AlignX
feature of Informax’s Vector NTI program.
When a dotplot is created to compare nucleic acids, there will be a lot of noise, since one
out of every four positions will match at random, if there are an equal number of A, C, G,
and T in the sequence. Therefore, dot plots can be filtered for stringency requiring that a
certain percentage of nucleotides match in a given window size. With the example
above, if we filter the sequences to only show matches of two or more consecutive
nucleotides, the dot plot now looks as the following:
Information within Dot Plots
Dot plots are useful as a first-level filter for determining an alignment between two
sequences. Regions of similarity will show up as diagonals within the dot plot matrix.
Regions of genomic DNA can contain repetitive regions. For instance, approximately 50
percent of the human genome is composed of repetitive elements, which can be on the
order of a few hundred bases (SINEs – Alu elements) or a few thousand (LINES). In
addition, regions of low complexity are present as well. Repetitive elements and methods
to filter them out will be discussed during a later class period. In addition to repetitive
elements, regions of a genome can be duplicated. The duplicated region can be found
either as a direct repeat, meaning that it occurs in the same direction, or as an inverted
repeat, meaning that the sequence of the duplicated region is found in the reverse
complement direction. Dot plots can readily show regions of direct and inverted repeats.
Dot plots show all possible matches of residues between two sequences given a certain
threshold level. Thus, the researcher can decide which alignments are the most
significant.
Example dot plots showing the presence of direct and inverted repeats.
Dot plots can also be used in order to compare two different assemblies of the same
sequence. Below are three dotplots of various chromosomes. The first shows two
separate assemblies of human chromosome 5 compared against each other. The second
shows one assembly of chromosome 5 compared against itself, indicating the presence of
repetitive regions. The final dotplot shows chromosome Y compared against itself,
indicating the presence of inverted repeats.
Comparison of two assemblies of chromosome 5. The figure to the left indicates the alignment of two
separate assemblies, while the figure to the right indicates the alignment of a single assembly against itself.
Self plot of chromosome Y. Indicated are several regions of both direct and inverted repeats.
DNA strider
Staden, 1982
Staden, R. (1982).
An interactive graphics program for comparing and aligning nucleic-acid and amino-acid
sequences.
Nucl. Acid. Res. 10 (9), 2951-2961.
The shortcoming of visual methods is that they do not yield a direct measure into the
similarity between two sequences.
Suppose there are two sequences X and Z to be aligned, where |X| = m and |Z| = n. If
gaps are allowed in the sequences, then the potential length of both the first and second
sequences is m+n. Several methods will be discussed to align these sequences.
If we are interested in determining the optimal alignment (either global or local), then we
note that there are 2m+n subsequences with spaces for the sequence X, and 2m+n
subsequences with spaces for the sequence Z using the power set rules. Thus, a brute
force method of comparing these two sequences for the optimal alignment would require
2m+n * 2m+n = 2(2(m+n)) = 4m+n comparisons. It doesn’t take long for this to be an impossible
search!
Dynamic Programming
Dynamic programming techniques align two sequences by beginning at the ends of the
two sequences and attempting to align all possible pairs of characters (one from each
sequence) using a scoring scheme for matches, mismatches, and gaps. The highest set of
scores defines the optimal alignment between the two sequences.
We will first consider dynamic programming in terms of DNA, where only exact matches
are considered for a match score. Later we will discuss how substitution matrices can be
used to score amino acid matches and mismatches.
Now we are ready to go ahead and start creating the dynamic programming matrix. The
first step is to align one of the sequences across the columns of the matrix, and the other
sequence across the rows. Note that an alignment can also begin with a gap in one of the
sequences, so that has to be taken care of as well. Let’s assume that we want to align the
sequence GAATTCAGTTA to GGATCGA. The length of the first sequence is 11
residues, and the length of the second is 7. Since it is possible to begin an alignment
with a gap, the size of the matrix should be 8 x 12. Row 0 and column 0 will represent
gaps. Rows 1-7 will be labeled with the corresponding residue of the sequence
GGATCGA, while columns 1-11 will be labeled with the corresponding residue of the
sequence GAATTCAGTTA. The initial matrix, S, is as follows:
- G A A T T C A G T T A
-
G
G
A
T
C
G
A
Now we need to decide upon the scoring scheme to be used. This requires parameters for
a match score, a mismatch score, and a gap score. The match and mismatch scores will
be combined into a single match/mismatch score, s(aibj). We’ll see how this can later be
used with a substitution matrix. There will also be a single linear gap penalty score, w.
For our first example, we have the following parameters:
Once you have the scoring functions set and the sequences to align, there are three steps
involved in calculating the optimal scoring alignment. The methods used to finish these
three steps are dependent upon whether global or local sequence alignment is desired.
The three steps are as follows:
• Initialization
• Matrix Fill (scoring)
• Traceback (alignment)
In global sequence alignment, an attempt to align the entirety of two different sequences
is made, up to and including the ends of the sequence. Needleman and Wunsch (1970)
were among the first to describe a dynamic programming algorithm for global sequence
alignment.
Initialization Step. In the initialization step of global alignment, each row Si,0 is set to w
* i. In addition, each column S0,j is set to w * j. Remember, that w is the gap penalty.
Using the scoring scheme described above, the initialization step results in the following:
- G A A T T C A G T T A
0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
-
-4
G
-8
G
-12
A
-16
T
-20
C
-24
G
-28
A
Matrix Fill Step. One possible solution of the matrix fill step finds the maximum global
alignment score by starting in the upper left hand corner in the matrix and finding the
maximal score Si,j for each position in the matrix. In order to find Si,j for any i,j it is
minimal to know the score for the matrix positions to the left, above and diagonal to i, j.
In terms of matrix positions, it is necessary to know Si-1,j, Si,j-1 and Si-1, j-1.
For each position, Si,j is defined to be the maximum score at position i,j; i.e.
Si,j = MAXIMUM[
Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal),
Si,j-1 + w (gap in sequence #1),
Si-1,j + w (gap in sequence #2)]
Note that in the example, Si-1,j-1 will be red, Si,j-1 will be green and Si-1,j will be blue.
Using this information, the score at position 1,1 in the matrix can be calculated. Since the
first residue in both sequences is a G, s(a1b1) = 5, and by the assumptions stated earlier, w
= -4. Thus, S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 - 4] = MAX[5, -8, -8].
A value of 5 is then placed in position 1,1 of the scoring matrix. Note that there is also an
arrow placed back into the cell that resulted in the maximum score, S0,0.
- G A A T T C A G T T A
0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
-
-4
G 5
-8
G
-12
A
-16
T
-20
C
-24
G
-28
A
Now we proceed to S1,2. Since a1 = G and b2 = A, there is a mismatch. Therefore, sa1b2
= -3 and by the assumptions stated earlier, w = -4. Thus, S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2 -
4] = MAX[-4 - 3, 5 – 4, -8 – 4] = MAX[-7, 1, -12] = 1. An arrow is placed back into the
cell that resulted in the maximum score, which is the cell S1,1.
- G A A T T C A G T T A
0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
-
-4
G 5 1
-8
G
-12
A
-16
T
-20
C
-24
G
-28
A
We can proceed to fill in the rest of the first row in a similar fashion, resulting in the
following matrix:
- G A A T T C A G T T A
0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
-
-4
G 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35
-8
G
-12
A
-16
T
-20
C
-24
G
-28
A
Now we can start to fill in the second row, beginning with S2,1. Note that a2 = G and b1
= G, so sa2b1 = 5 and by the assumptions stated earlier, w = -4. Thus, S2,1= MAX[S1,0 +5,
S0,2 - 4, S1,1 - 4] = MAX-4 + 5, -8 – 4, 5 - 4] = MAX[1, -12, 1] = 1. Note that in this case,
there are two possible paths to the maximum value. Therefore, an arrow is placed back
into each cell resulting in the maximum score, which are sells S1,0 and S1,1.
- G A A T T C A G T T A
0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
-
-4
G 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35
-8
G 1
-12
A
-16
T
-20
C
-24
G
-28
A
We can then proceed to fill in the rest of the matrix in a similar fashion. The resulting
matrix is as follows:
- G A A T T C A G T T A
0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
-
-4
G 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35
-8
G 1 2 -2 -6 -10 -14 -18 -14 -18 -22 -26
-12
A -3 6 7 3 -1 -5 -9 -13 -17 -21 -17
-16
T -7 2 3 12 8 4 0 -4 -8 -12 -16
-20
C -11 -2 -1 8 9 13 9 5 1 -3 -7
-24
G -15 -6 -5 4 5 9 10 14 10 6 2
-28
A -19 -10 -1 0 1 5 14 10 11 7 11
Each cell has one to three arrows indicating from which cell the maximum score was
obtained. The matrix fill step is now complete.
Traceback Step. After the matrix fill step, the maximum global alignment score for the
two sequences is 11 (the value in the lower right hand cell). The traceback step will
obtain the actual alignment(s) that result in the maximum score.
The traceback begins in position SM,N; i.e. the position where both sequences are globally
aligned.
Since pointers have been kept back to all possible predacessors, the traceback is simple.
At each cell, we look to see where we move next according to the pointers. To begin, the
only possible predacessor is the diagonal match.
- G A A T T C A G T T A
0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
-
-4
G 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35
-8
G 1 2 -2 -6 -10 -14 -18 -14 -18 -22 -26
-12
A -3 6 7 3 -1 -5 -9 -13 -17 -21 -17
-16
T -7 2 3 12 8 4 0 -4 -8 -12 -16
-20
C -11 -2 -1 8 9 13 9 5 1 -3 -7
-24
G -15 -6 5 4 5 9 10 14 10 6 2
-28
A -19 -10 -1 0 1 5 14 10 11 7 11
This gives us an alignment of
A
|
A
Note that the blue letters and gold arrows indicate the path leading to the maximum score.
We can continue to follow the path until we get to the following situation:
- G A A T T C A G T T A
-4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
- 0
G -4 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35
G 2 -2 -6 -10 -14 -18 -14 -18 -22 -26
A -12 -3 6 7 3 -1 -5 -9 -13 -17 -21 -17
T -16 -7 2 3 12 8 4 0 -4 -8 -12 -16
C -20 -11 -2 -1 8 9 13 9 5 1 -3 -7
G -24 -15 -6 5 4 5 9 10 14 10 6 2
A -28 -19 -10 -1 0 1 5 14 10 11 7 11
The resulting global alignment is as follows:
G A A T T C A G T T A
| | | | | |
G G A – T C – G - — A
Remembering that the scoring scheme used was +5 for a match, -3 for a mismatch, and –
4 for a gap, we can double check the score of the alignment:
G A A T T C A G T T A
| | | | | |
G G A – T C – G - — A
+ - + - + + - + - - +
5 3 5 4 5 5 4 5 4 4 5
5 – 3 + 5 – 4 + 5 + 5 – 4 + 5 – 4 – 4 + 5 = 11
so this alignment results in a global alignment score of 11.
Note that in the case of the sequence and scoring schemes we chose, there was only one
maximal alignment. It is possible that there could be multiple alignments yielding the
same score, as evidenced by having multiple ways to obtain the maximal score in a given
cell in the scoring matrix. In such a case, the traceback can be accomplished in any
manner desired, as long as the same set of rules is consistently used in order for
reproducibility.
In 1981, Temple Smith and Mike Waterman proposed a modification to the Needleman-
Wunsch algorithm in order to obtain a local sequence alignment resulting in the highest-
scoring local match between two sequences.
There are only two slight modifications that need to be made to the Needleman-Wunsch
Algorithm in order to make it a local alignment algorithm. The first modification
requires negative scores for mismatches. The second modification requires that when the
dynamic programming scoring matrix value becomes negative, the value is set to zero,
which has the effect of terminating any alignment up to that point. This has the effect of
changing the matrix score to:
Si,j = MAXIMUM[
Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal),
Si,j-1 + w (gap in sequence #1),
Si-1,j + w (gap in sequence #2),
0]
The local alignments are then produced by starting at the highest-scoring positions in the
scoring matrix and following a trace path from those positions up to a box that scores
zero.
Initialization Step. In the initialization step of local alignment, each row Si,0 is set to 0.
In addition, each column S0,j is set to 0. Using the scoring scheme described above, the
initialization step results in the following:
- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0
G 0
A 0
T 0
C 0
G 0
A 0
Matrix Fill Step. One possible solution of the matrix fill step finds the maximum local
alignment score by starting in the upper left hand corner in the matrix and finding the
maximal score Si,j for each position in the matrix. In order to find Si,j for any i,j it is
minimal to know the score for the matrix positions to the left, above and diagonal to i, j.
In terms of matrix positions, it is necessary to know Si-1,j, Si,j-1 and Si-1, j-1.
For each position, Si,j is defined to be the maximum score at position i,j; i.e.
Si,j = MAXIMUM[
Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal),
Si,j-1 + w (gap in sequence #1),
Si-1,j + w (gap in sequence #2),
0]
- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5
G 0
A 0
T 0
C 0
G 0
A 0
Note that in the example, Si-1,j-1 will be red, Si,j-1 will be green and Si-1,j will be blue.
Using this information, the score at position 1,1 in the matrix can be calculated. Since the
first residue in both sequences is a G, s(a1b1) = 5, and by the assumptions stated earlier, w
= -4. Thus, S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 – 4,0] = MAX[5, -4, -4, 0].
Now we proceed to S1,2. Since a1 = G and b2 = A, there is a mismatch. Therefore, sa1b2
= -3 and by the assumptions stated earlier, w = -4. Thus, S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2
– 4, 0] = MAX[0 - 3, 5 – 4, 0 – 4, 0] = MAX[-3, 1, -4, 0] = 1. An arrow is placed back
into the cell that resulted in the maximum score, which is the cell S1,1.
- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 1
G 0
A 0
T 0
C 0
G 0
A 0
- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 1 0
G 0
A 0
T 0
C 0
G 0
A 0
We can then proceed to fill in the rest of the matrix in a similar fashion. The resulting
matrix is as follows:
- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 1 0 0 0 0 0 5 1 0 0
G 0 5 2 0 0 0 0 0 5 2 0 0
A 0 1 10 7 3 0 0 5 1 2 0 5
T 0 0 6 7 12 8 4 1 2 6 7 3
C 0 0 2 3 8 9 13 9 5 2 3 4
G 0 5 1 0 4 5 9 10 14 10 6 2
A 0 1 10 6 2 1 4 14 10 11 7 11
Each cell has one to three arrows indicating from which cell the maximum score was
obtained. The matrix fill step is now complete.
Traceback Step. After the matrix fill step, the maximum local alignment score for the
two sequences is 14, which can be found by locating the highest values in the score
matrix. Note that 14 is found in two separate cells, indicating there are multiple
alignments producing the maximal alignment score. The traceback step will find the
actual local alignments resulting in the maximum score.
The traceback begins in the position with the highest value. Since pointers have been
kept back to all possible predacessors, the traceback is simple. At each cell, we look to
see where we move next according to the pointers. When we reach a cell where there is
not a pointer to a previous cell, then we have reached the beginning of the local
alignment.
- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 1 0 0 0 0 0 5 1 0 0
G 0 5 2 0 0 0 0 0 5 2 0 0
A 0 1 10 7 3 0 0 5 1 2 0 5
T 0 0 6 7 12 8 4 1 2 6 7 3
C 0 0 2 3 8 9 13 9 5 2 3 4
G 0 5 1 0 4 5 9 10 14 10 6 2
A 0 1 10 6 2 1 4 14 10 11 7 11
Note that the blue letters and gold arrows indicate the path leading to the maximum score.
We can continue to follow the path until we get to the following situation:
- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 1 0 0 0 0 0 5 1 0 0
G 0 5 2 0 0 0 0 0 5 2 0 0
A 0 1 10 7 3 0 0 5 1 2 0 5
T 0 0 6 7 12 8 4 1 2 6 7 3
C 0 0 2 3 8 9 13 9 5 2 3 4
G 0 5 1 0 4 5 9 10 14 10 6 2
A 0 1 10 6 2 1 4 14 10 11 7 11
At this point, or alignment (which is built starting at the end of the alignment) is as
follows:
C - A
| |
C G A
Now the current cell gets its score either from a match of the T’s or a gap in the second
sequence. We’ll consider both as possibilities: Match of the T’s (1) and gap in second
(2).
- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 1 0 0 0 0 0 5 1 0 0
G 0 5 2 0 0 0 0 0 5 2 0 0
A 0 1 10 7 3 0 0 5 1 2 0 5
T 0 0 6 7 12 8 4 1 2 6 7 3
C 0 0 2 3 8 9 13 9 5 2 3 4
G 0 5 1 0 4 5 9 10 14 10 6 2
A 0 1 10 6 2 1 4 14 10 11 7 11
Once we reach the node with 0 and there are no pointers from this node, we are finished.
The two local alignments resulting in a score of 14 in the final row are:
G A A T T C - A
| | | | |
G G A T – C G A
+ - + + - + - +
5 3 5 5 4 5 4 5
G A A T T C - A
| | | | |
G G A – T C G A
+ - + - + + - +
5 3 5 4 5 5 4 5
As you can see, each of these has 5 matches, 1 mismatch, and 2 gaps, so the score is 5(5)
– 1(3) – 2(4) = 25 – 3 – 8 = 14. This coincides with the maximum local alignment score
calculated in the matrix.
Amino Acids
Certain amino acid substitutions commonly occur in related proteins from different
species. Since the proteins in all of the species are functional, the substations maintain
protein structure and function. Often the substitutions result in a chemically similar
amino acid. Other substitutions are relatively rare. Thus, rather than create a dynamic
programming matrix with a match/mismatch score, it would be better to weight a
matching score for two residues dependent upon the likelihood that such a substitution
would be observed in nature.
In a substitution matrix (whether it is an amino acid or nucleic acid), the residues are
listed both as column and row headings. Each position is in the matrix is filled with a
score reflecting how often one residue would be paired with another in an alignment of
related sequences.
Percent Accepted Mutation (PAM) Matrices
Margaret Dayhoff pioneered the research in amino acid substitutions for found through
the alignment of common protein sequences. The resulting Percent Accepted Mutation
(PAM) Matrices give the changes expected for a given period of evolutionary time. The
assumption with this evolutionary model is that amino acid substitutions over short
periods of evolutionary history can be extrapolated to longer distances.
Each change in the current amino acid at a particular site is assumed to be independent of
previous mutational events at that site.
• amino acid substitutions of evolving proteins were estimated using 1572 changes
in 71 groups of protein sequences at least 85% similar. Since the proteins have
similar functions, the mutations are called “accepted” mutations – meaning they
are accepted by natural selection without negatively affecting a protein’s fitness.
• Similar sequences were organized into phylogenetic trees
• The number of changes of each amino acid into every other amino acid was
counted.
• Relative mutabilities were evaluated by counting the number of changes of each
amino acid divided by a normalization factor. This normalized the data for
variations in amino acid composition, mutation rate, and sequence length.
• The amino acid exchange counts and mutability values were used to generate a 20
x 20 mutation probability matrix representing all possible amino acid changes.
Since the changes are independent of previous mutational events, the PAM1 matrix can
be multiplied by itself N times to give the transition matrices for sequences that have
undergone N mutations. Thus, the PAM250 matrix can be used for sequences that are
20% similar, while the PAM 120, PAM80, and PAM60 matrices represent 40%, 50%,
and 60% similarity. Note that PAM1 is 1 accepted mutation per 100 amino acids;
PAM10 is 10 accepted mutations per 100 amino acids; PAM250 is 250 accepted
mutations per 100 amino acids and so on.
Thus, the substitution matrix chosen when aligning two sequences should take into
account the divergence between the two sequences.
Example PAM1 matrix (normalized probabilities multiplied by 10000)
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
A R N D C Q E G H I L K M F P S T W Y V
Ala A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18
Arg R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1
Asn N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1
Asp D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1
Cys C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2
Gln Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1
Glu E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2
Gly G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5
His H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1
Ile I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33
Leu L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15
Lys K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1
Met M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4
Phe F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0
Pro P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2
Ser S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2
Thr T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9
Trp W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0
Tyr Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1
Val V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901
Taken from:
http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/nodeE.html#page7
PAM matrices are usually converted into another form, called log odds matrices. Odds
ratios are converted into logarithms in order that the scores may be added, rather than
multiplied. Each cell of the log-odds matrix is calculated by first finding the odds ratio
for each substitution. The odds ratio is calculated by taking the scores in the above
matrix, which is the probability of one amino acid mutating to another given amino acid,
and dividing it by the frequency of the first amino acid. Such a ratio gives the relative
frequency of change. The ratio is then converted to a log10, so that the scores are
additive, and it is multiplied by 10. The log odds for converting from the first amino acid
to the second is added to the log odds for converting from the second amino acid to the
first, and the average is taken to produce a symmetric matrix, since the direction of
mutation cannot necessarily be inferred. An example of how the log odds score for
changes between Phe and Tyr is given in Mount, pp 80 – 81. Make sure to look at this to
see if you have any questions.
log-odds form of PAM250 Scoring matrix.
Image Source: http://www.blc.arizona.edu/courses/bioinformatics/dayhoff.html
(Image in Mount, p82)
One of the arguments against the Dayhoff PAM matrices is that they represent only a
small number of families, and therefore may not truly reflect amino acid distributions that
one is likely to encounter. Therefore, another set of substitution matrices, called
BLOSUM matrices were developed using a much larger number of protein families.
The BLOSUM matrices were developed by Stephen and Georgia Henikoff by looking at
a large set of approximately 2000 amino acid patterns organized into blocks, which are
conserved regions within protein families as identified by the protein database, Prosite.
The blocks that were studied were also signatures of a protein family, indicating that
members of the family could be found by searching for these blocks.
In order to deal with overrepresentation of amino acid substitutions occurring in the most
closely related members of the family, a consensus sequence of the block is formed.
Sequences that were 60% identical to the consensus were grouped together to form the
BLOSUM60 matrix; sequences 80% identical were grouped together to form the
BLOSUM80 matrix, etc.
A G PURINES: A, G
PYRIMIDINES C, T
C T
A G T C
A 0.99
G 0.00333 0.99
T 0.00333 0.00333 0.99
C 0.00333 0.00333 0.00333 0.99
A G T C
A 0.99
G 0.006 0.99
T 0.002 0.002 0.99
C 0.002 0.002 0.006 0.99
A G T C
A 2
G -6 2
T -6 -6 2
C -6 -6 -6 2
A G T C
A 2
G -5 2
T -7 -7 2
C -7 -7 -5 2
Gap Penalties
The scoring matrices used to this point assume a linear gap penalty where each gap is
given the same penalty score. However, over evolutionary time, it is more likely that a
contiguous block of residues has become inserted/deleted in a certain region (for
example, it is more likely to have 1 gap of length k than k gaps of length 1). Therefore, a
better scoring scheme to use is an initial higher penalty for opening a gap, and a smaller
penalty for extending the gap. The affine gap penalty can then be formulated as follows:
wx = g + r(x-1)
where wx is the total gap penalty, g is the gap open penalty, r is the gap extend penalty,
and x is the length of the gap.
The gap penalty needs to be chosen relative to the score matrix, so that gaps will not be
excluded from the alignment, or propagate throughout the alignment. Typical values are
–12 for gap opening, and –4 for gap extension.
Affine gap penalties increase the number of matrices (or at least storage space) to be
filled out. The information to be processed is now:
Di - 1, j - 1 + subst(Ai, Bj)
Mi, j = max { Mi - 1, j - 1 + subst(Ai, Bj)
Ii - 1, j - 1 + subst(Ai, Bj)
Di , j - 1 - extend
Di, j = max {
Mi , j - 1 - open
Mi-1 , j - open
Ii, j = max {
Ii-1 , j - extend
Where M is the match matrix, D is the delete matrix, and I is the insert matrix.
When two sequences of length m and n are not obviously similar but show an alignment,
it becomes necessary to assess the significance of the alignment. The alignment of scores
of random sequences has been shown to follow a Gumbel extreme value distribution.
Image source: •http://roso.epfl.ch/mbi/papers/discretechoice/node11.html
Using a Gumbel extreme value distribution, the expected number of alignments with a
score at least S (E-value) is:
E = Kmn e-λS
Where:
m,n: Lengths of sequences
K ,λ: statistical parameters dependent upon scoring system and background residue
frequencies
Recall that the log-odds scoring schemes examined to this point normally use a S =
10*log10x scoring system. We can normalize the raw scores obtained using these non-
gapped scoring systems to obtain the amount of bits of information contained in a score,
or the amout of nats of information contained within a score.
Converting to bit scores
The E-value corresponding to a given bit score can then be calculated as:
Converting to nats is similar. However, we just substitute e for 2 in the above equations.
Converting scores to either bits or nats gives a standardized unit by which the scores can
be compared.
P-values
P = 1 – e-E
If a scoring matrix has been scaled to bit scores, then it can quickly be determined
whether or not an alignment is significant. For a typical amino acid scoring matrix, K =
0.1 and lambda depends on the values of the scoring matrix. If a PAM or BLOSUM
matrix is used, then lambda is precomputed. For instance, if the log odds matrix is in
units of bits, then lambda = loge2, and the significance cutoff can be calculates as
log2(mn).
F W L E V E G N S M T A P T G
F W L D V Q G D S M T A P A G
Using the PAM250 matrix (p82), the score for this local alignment can be calculated as:
S = 9 + 17 + 6 + 3 + 4 + 2 + 5 + 2 + 2 + 6 + 3 + 2 + 6 + 1 + 5 = 73
S = 10 log10x
S/10 = log10x
S/10 = log10x * (log210/log210)
S/10 * log210 = log10x / log210
S/10 * log210 = log2x
1/3 S ~ log2x
so S’ ~ 1/3S
Since the alignment score is above the significance cutoff, this is a significant local
alignment.
Estimation of P and E
When a PAM250 scoring matrix is being used, K is estimated to be 0.09, while lambda is
estimated to be 0.229. Using equations 30 and 31 (Mount), we can convert the score to a
bit score:
Bayesian Statistics
Bayesian statistics are built upon conditional probabilities, which are used to derive the
joint probability of two events or conditions.
Suppose that A can have two states, A1 and A2, and B can have two states, B1 and B2.
Suppose that P(B1) = 0.3 is known. Therefore, P(B2) = 1 – 0.3 = 0.7. These
probabilities are known as marginal probabilities. Now we would like to determine the
probability of A1 and B1 occurring together, which is denoted as: P(A1, B1) and is called
the joint probability. Note that in this case the marginal probabilities A1 and A2 are
missing. Thus, there is not enough information at this point to calculate the marginal
probability. However, if more information about the joint occurrence of A1 and B1 are
given, then the joint probabilities may be derived using Bayes Rule:
Suppose that we are given P(A1|B1) = 0.8. Then, since there are only two different
possible states for A, P(A2|B1) = 1 – 0.8 = 0.2. If we are also given P(A2|B2) = 0.7, then
P(A1|B2) = 0.3. Using Bayes Rule, the joint probability of having states A1 and B1
occurring at the same time is P(B1)P(A1|B1) = 0.3 * 0.8 = 0.24 and P(A2,B2) =
P(B2)P(A2|B2) = 0.7 * 0.7 = 0.49. The other joint probabilities can be calculated from
these as well.
The calculation of the joint probabilities results in posterior probabilities, since they are
not known initially, but are calculated using prior probabilities and initial information.
Dynamic programming approaches are guaranteed to give the optimal alignment between
two sequences given a scoring scheme. However, the two main drawbacks to DP
approaches is that they are compute and memory intensive, in the cases discussed to this
point taking at least O(n2) time and space.
Linear space algorithms have been used in order to deal with one drawback to dynamic
programming. The basic idea is to concentrate only on those areas of the matrix more
likely to contain the maximum alignment. The most well-known of these linear space
algorithms is the Myers-Miller algorithm.
http://genome.cs.mtu.edu/align/align.html
EMBOSS APPLICATIONS
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/index.html
BAYESIAN TUTORIAL
http://www.wadsworth.org/resnres/bioinfo/tut1/index.htm
Sim4
est2genome
spidey
########################################
# Program: needle
# Rundate: Wed Jan 22 20:09:50 2003
# Report_file: outfile.align
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: gi
# 2: gi
# Matrix: EDNAFULL
# Gap_penalty: 12.0
# Extend_penalty: 4.0
#
# Length: 1030
# Identity: 537/1030 (52.1%)
# Similarity: 537/1030 (52.1%)
# Gaps: 493/1030 (47.9%)
# Score: 1649.0
#
#
#=======================================
gi 1 0
gi 1 ATACAAAATTTACGTGACTGGAGGGTGAAAGGGAATGTGGGAGGTCAGTG 50
gi 1 GGCAATAATGATACAATGTATCATGCCTCT 30
||||||||||||||||||||||||||||||
gi 51 CATTTAAAACATAAAGAAATGGCAATAATGATACAATGTATCATGCCTCT 100
gi 31 TTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGCAATAG 80
||||||||||||||||||||||||||||||||||||||||||||||||||
gi 101 TTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGCAATAG 150
gi 81 CAA---------------------------ATAAATTGTAACTGATGTAA 103
||| ||||||||||||||||||||
gi 151 CAATATCTCTGCATATAAATATTTCTGCATATAAATTGTAACTGATGTAA 200
gi 538 537
gi 538 537
gi 538 537
#---------------------------------------
#---------------------------------------
water results
########################################
# Program: water
# Rundate: Wed Jan 22 20:11:48 2003
# Report_file: outfile.align
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: gi
# 2: gi
# Matrix: EDNAFULL
# Gap_penalty: 12.0
# Extend_penalty: 4.0
#
# Length: 660
# Identity: 484/660 (73.3%)
# Similarity: 484/660 (73.3%)
# Gaps: 152/660 (23.0%)
# Score: 1660.0
#
#
#=======================================
gi 1 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATA 50
||||||||||||||||||||||||||||||||||||||||||||||||||
gi 71 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATA 120
gi 51 ACAGTGATAATTTCTGGGTTAAGGCAATAGCAA----------------- 83
|||||||||||||||||||||||||||||||||
gi 121 ACAGTGATAATTTCTGGGTTAAGGCAATAGCAATATCTCTGCATATAAAT 170
gi 84 ----------ATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATA 123
||||||||||||||||||||||||||||||||||||||||
gi 171 ATTTCTGCATATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATA 220
#---------------------------------------
#---------------------------------------
Blast 2 sequences
10 20 30 40 50 60 70
gi|227 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGC
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
gi|227 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGC
100 110 120 130 140 150 160 170
(Don’t worry about for this project – this will be part of the second programming
assignment)
Sometimes genes that are similar in sequence can be mutated or rearranged to perform an
altered function. By looking at multiple alignments of such sequences, we can tell which
changes in the sequence have caused a change in the functionality.
Multiple sequence alignment yields information concerning the structure and function of
proteins, and can help lead to the discovery of important sequence domains or motifs
with biological significance while at the same time uncovering evolutionary relationships
among genes.
In multiple sequence alignment, the idea is to take three or more sequences, and align
them so that the greatest number of similar characters are aligned in the same column of
the alignment.
The difficulty with multiple sequence alignment is that now there are a number of
different combinations of matches, insertions, and deletions that must be considered
when looking at several different sequences. Methods to guarantee the highest scoring
alignment are not feasible. Therefore, approximation methods are put to use in multiple
sequence alignment.
There are four approaches to multiple sequence alignment we will consider: Dynamic
Programming Approach, Progressive alignment, Iterative alignment, and statistical
modeling.
Suppose the length of each sequence is n residues. If there are two such sequences, then
the number of comparisons needed to fill in the scoring matrix is n2, since it is a two-
dimensional matrix. The number of comparisons needed to fill in the scoring cube when
three sequences are aligned is n3, and when four sequences are aligned, the number of
comparisons needed is n4. Thus, as the number of sequences increases, the number of
comparisons needed increases exponentially, i.e. nN where n is the length of the
sequences, and N is the number of sequences. Thus, without any changes to the dynamic
programming approach, this becomes impractical for even a small number of short
sequences rather quickly.
While this is a heuristic alignment (and is therefore not guaranteed to be optimal), it does
provide a limit to the search space within which optimal alignments are likely to be
found.
Figures 4.2 and 4.3 (Mount) describe how the two dimensional search spaces can be
projected into a three dimensional volume that can be searched.
MSA calculates the multiple alignment score within the lattice by adding the scores of the
corresponding pairwise alignments in the multiple sequence alignment. This measure is
known as the sum of pairs (SP) measure. The optimal alignment is based on the best SP
score.
ECSQ (1)
SNSG (2)
SWKN (3)
SCSN (4)
Since there are four sequences, there will be six different alignments to consider for each
column. The alignments, listed by the sequence number are listed as follows:
1-2
1-3
1-4
2-3
2-4
3-4
Problem with this approach: more closely related sequences will have a higher weight
The MSA program gets around this by calculating weights to associate to each sequence
alignment pair. The weights are assigned based on the predicted tree of the aligned
sequences.
calculates a e value for each pair of sequences which acts as the weight for that sequence
pair. Sequences that are more divergent will have a higher e value.
http://www.psc.edu/biomed/genedoc/whygd.htm
The initial pairwise alignments are calculated using an enhanced dynamic programming
algorithm, and the genetic distances used to create the phylogenetic tree are calculated by
dividing the total number of mismatched positions by the total number of matched
positions.
Alignments are associated a weight based on their distance from the root node.
PILEUP is the multiple sequence alignment program that is part of the Genetics
Computer Group (GCG) package developed at the University of Wisconsin. In PILEUP,
multiple sequence alignment is performed by first aligning each of the sequences in a
pair-wise fashion using a Needleman-Wunsch approach. The resulting scores are used to
produce a tree by the unweighed pair-group method using arithmetic averages (UPGMA).
The resulting tree is then used to guide the alignment of the most closely related
sequences and groups of sequences.
The second issue is that suitable scoring matrices and gap penalties must be chosen to
apply to the sequences as a set.
MultAlin
PRRP
DIALIGN
Genetic Algorithms
2) The initial alignments are scored by the sum of pairs method. Standard amino
acid scoring matrices and gap open, gap extension penalties are used.
3) Initial alignments are replaced to give another generation of multiple sequence
alignments. One half of the multiple sequence alignments are chosen to proceed
to the next generation unchanged (natural selection). This half is chosen by
assigning probabilities to each sequence based on an inverse proportion of their
SP scores (the best alignments, since the SP scores are weighted according to their
distance from the parent). The other half of the alignments are sent to the next
generation, but are first subject to mutation.
4) In the mutation process, gaps are inserted into the sequences subject to mutation
and rearranged in an attempt to create a better scoring alignment. In this step, the
sequences are split into two sets based on an estimated phylogenetic tree, and
gaps of random lengths are inserted into random positions in the alignment.
6) The next generation is evaluated going back to step 2, and steps 2-5 are repeated a
number (100-1000) times. The best scoring multiple sequence alignment is then
obtained (note that it may not be the optimal scoring alignment).
7) The entire process is repeated several times, starting from a different initial
alignment each time. The best scoring multiple sequence alignment is then
chosen and reported to the user.
Simulated Annealing
In the group approach, sequences are clustered into related groups. A consensus
sequence is produced to make alignments between the groups. Examples of programs
implementing the group approach are PIMA and MULTAL.
PIMA
MULTAL
Tree approach
The tree method uses the distance method of phylogenetic analysis to arrange the
sequences. The two closest sequences are then aligned, and the consensus of these two is
aligned with the next best sequence (or group of sequences) until an alignment is
produced that includes all of the sequences. This approach is a popular approach used by
PILEUP, CLUSTALW and ALIGN.
TREEALIGN is a program that uses the tree approach, but rearranges the tree as
sequences are aligned to produce the tree by maximum parsimony of the tree.
Programs to detect localized alignments typically use one of the following three
approaches: Profile Analysis; block analysis; pattern-searching or statistical methods
Profile analysis
Profiles are found by first multiply aligning the sequences, determining which regions are
the most highly conserved, and then creating a scoring matrix for the alignment of the
highly conserved region. The profile is composed of columns, and may include matches,
mismatches, insertions, and deletions found in a particular column.
Once a profile is created, it can be used to search a target sequence or database for
possible matches to the profile using the profiles scores to evaluate the likelihood at each
position.
The drawback of profiles is that the profile is only as representative as the variation in the
sequences used to construct it. Thus, there is a bias in the profile towards the training
data.
For each position in a profile, there is a column for each amino acid, plus a column for an
unknown amino acid (z), and a column for gap opening and gap extension. There is a
row for each position in the multiple alignment.
Block Analysis
Expectation-Maximization
Gibbs Sampling
Sequence Logos
Alignments based on locally conserved patterns found in the same order in the sequences
(synteny)
Multiple sequence alignments yield information into the evolutionary history of the
sequences – sequences that are most similar are likely to be recently derived from a
common ancestor sequence
If the sequences in a multiple alignment have quite a bit of variation then it is difficult to
create a multiple sequence alignment due to the different combinations of substitutions,
insertions, and deletions that can be used
ECSQ
SNSG
SWKN
SCSN
Motif-Based Approaches
MEME
Meta-MEME
PRRP
SAGA
MSA
ClustalX
ClustalW
Viewing Multiple Alignments
sequence logos
Once an alignment is made, they can be compared using Hidden Markov Models
IBM’s MUSCA
http://cbcsrv.watson.ibm.com/Tmsa.html
ClustalW
http://www.ebi.ac.uk/clustalw/
http://clustalw.genome.ad.jp/
DIALIGN
http://bibiserv.techfak.uni-bielefeld.de/cgi-bin/dialign_submit
Web Logo
http://www.bio.cam.ac.uk/cgi-bin/seqlogo/logo.cgi
Introduction to Bioinformatics
Lecture 4: Multiple Sequence Alignment
Profile analysis
Profiles are found by first multiply aligning the sequences, determining which regions are
the most highly conserved, and then creating a scoring matrix for the alignment of the
highly conserved region. The profile is composed of columns, and may include matches,
mismatches, insertions, and deletions found in a particular column.
Once a profile is created, it can be used to search a target sequence or database for
possible matches to the profile using the profiles scores to evaluate the likelihood at each
position.
The drawback of profiles is that the profile is only as representative as the variation in the
sequences used to construct it. Thus, there is a bias in the profile towards the training
data.
For each position in a profile, there is a column for each amino acid, plus a column for an
unknown amino acid (z), and a column for gap opening and gap extension. There is a
row for each position in the multiple alignment.
Calculating profiles
How are the values in a profile created? The value of an individual cell is calculated as
the log odds score of finding a particular residue in a particular location in an alignment
divided by the probability of aligning the two amino acids by random chance using a
particular scoring scheme (such as PAM250, BLOSUM80, …). Additional penalties
must be calculated for gap opening and gap extension in the profile as well. If a gap
exists in the multiple alignment, then the penalties for gaps will be reduced.
One method (average method) weighs the proportion of the amino acids found in a
particular column, and weights the score of matching the consensus residue at a given
position to that particular residue.
Shannon Entropy
One method to calculate the observed column variation given the expected variation in
the evolutionary model is to use an information measure known as entropy. The smaller
the entropy, the more conserved a column is. Entropy for a single column is calculated
by the following formula:
H =− ∑f a
residues ( a )
log( pa )
Where fa is the observed proportion of each residue a in the msa column and pa is the
expected frequency of the residue when derived from a given ancestor residue.
With an amino acid msa, the entropy measure can be used with several different
evolutionary distances to determine which one minimizes entropy. (See page 164 of
Mount for the discussion).
Block Analysis
Blocks are similar to profiles in the sense that they represent locally conserved regions
within a multiple sequence alignment. However, the difference is that blocks lack indels.
Blocks can be determined either by performing a multiple sequence alignment, or by
searching a database for similar sequences of the same length. Algorithms for searching
for a BLOCK were initially developed by Henikoff and Henikoff (1991).
Statistical approaches to finding the most alike sequences have been proposed, such as
the Expectation-Maximization algorithms and the Gibbs sampler. In any case, once a set
of blocks has been determined, the information contained within the block alignment can
be displayed as a sequence profile.
A global sequence alignment will usually contain ungapped regions that are aligned
between multiple sequences. These regions can be extracted to produce blocks. Two
programs that allows for the extraction of a block from a multiple sequence alignment are
BLOCKS and eMOTIF, both of which can read in a multiple sequence alignment in one
of many different file formats and perform the extraction of the blocks. The websites for
these programs are:
http://www.blocks.fhcrc.org/blocks/process_blocks.html
http://dna.stanford.edu/emotif/
AF1 1 -
SEVMLSTRIGSGSFGTVYKGKWHGDVAVKILKVVDPTPEQFQAFRNEVAVLRKT--
RHVNIL
MOS 1 -EQVCLLQRLGAGGFGSVYKATYRG-
VPVAIKQVNKCTKNRLASRRSFWAELNVARLRHDNI-
DGM 1 -DQLVLGRTLGSGAFGQVVEATAHG-
LSHSQATMKVAVKMLKSTARSSEKQALMSELYGDLV-
GFR 1 -TEFKKIKVLGSGAFGTVYKGLWIP-
EGEKVKIPVAIKELREATSPKANKEILDEAYVMASV-
D28 1 -ANYKRLEKVGEGTYGVVYKALDLR--
PGQGQRVVALKKIRLESEDEGVPSTAIREISLLKEL
SKH 1 -AKYDIKALIGRGSFSRVVRVEHRA-
TRQPYAIKMIETKYREGREVCESELRVLRRVRHANI-
APK 1 -DQFERIKTLGTGSFGRVMLVKHME-
TGNHYAMKILDKQKVVKLKQIEHTLNEKRILQAVNF-
EE1 1 -
TRFRNVTLLGSGEFSEVFQVEDPVEKTLKYAVKKLKVKFSGPKERNRLLQEVSIQRALK
G--
FES 1 VLNRAVPKDKWVLNHEDLVLGEQIG-
RGNFGEVFSGRLRADNTLVAVKSCRETLPPDIKAK--
SVK 1 -MGFTIHGALTPGSEGCVFDSSHPD-
YPQRVIVKAGWYTSTSHEARLLRRLDHPAILPLLDL-
cons 1 qf ll lgsgsfg vykg g k i v k r
v l i
Taking this alignment, we can generate blocks using the BLOCKS server:
ID x6676xbli; BLOCK
AC x6676xbliA; distance from previous blocks=(1,1)
DE ../tmp/6676.blin
BL UNK motif; width=24; seqs=10; 99.5%=0; strength=0
AF1 ( 1) SEVMLSTRIGSGSFGTVYKGKWHG 41
MOS ( 1) EQVCLLQRLGAGGFGSVYKATYRG 48
DGM ( 1) DQLVLGRTLGSGAFGQVVEATAHG 49
GFR ( 1) TEFKKIKVLGSGAFGTVYKGLWIP 41
D28 ( 1) ANYKRLEKVGEGTYGVVYKALDLR 61
SKH ( 1) AKYDIKALIGRGSFSRVVRVEHRA 54
APK ( 1) DQFERIKTLGTGSFGRVMLVKHME 46
EE1 ( 1) TRFRNVTLLGSGEFSEVFQVEDPV 55
FES ( 1) LNRAVPKDKWVLNHEDLVLGEQIG 100
SVK ( 1) MGFTIHGALTPGSEGCVFDSSHPD 73
//
ID x6676xbli; BLOCK
AC x6676xbliB; distance from previous blocks=(2,2)
DE ../tmp/6676.blin
BL UNK motif; width=28; seqs=10; 99.5%=0; strength=0
AF1 ( 27) AVKILKVVDPTPEQFQAFRNEVAVLRKT 87
MOS ( 27) PVAIKQVNKCTKNRLASRRSFWAELNVA 75
DGM ( 27) SHSQATMKVAVKMLKSTARSSEKQALMS 92
GFR ( 27) GEKVKIPVAIKELREATSPKANKEILDE 83
D28 ( 27) PGQGQRVVALKKIRLESEDEGVPSTAIR 83
SKH ( 27) RQPYAIKMIETKYREGREVCESELRVLR 74
APK ( 27) GNHYAMKILDKQKVVKLKQIEHTLNEKR 85
EE1 ( 27) TLKYAVKKLKVKFSGPKERNRLLQEVSI 77
FES ( 27) GNFGEVFSGRLRADNTLVAVKSCRETLP 100
SVK ( 27) PQRVIVKAGWYTSTSHEARLLRRLDHPA 92
//
Expectation-Maximization
In the expectation-maximization algorithms, the starting point is a set of sequences
expected to have a common sequence pattern that may not be easily detectible. An initial
guess is made as to the location and size of the site of interest in each of the sequences.
These initial sites are then aligned.
Expectation Step In the expectation step, background residue frequencies are calculated
based on those residues that are not in the initially aligned sites. Column specific resides
are calculated for each position in the initial motif alignment. Using this information, the
probability of finding the site at any position in the sequences can then be calculated.
Maximization Step In the maximization step, the counts of residues for each position in
the site as found in the expectation step are used to calculate the location within each
sequence that maximally aligns to the motif pattern calculated in the expectation step.
This is done for each of the sequences. Once a new motif location has been calculated,
the expectation step is repeated. This cycle continues until the solution converges.
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
TCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAG
AAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTC
GGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGC
AGCCTAGAGTGATGACTCCTATCTGGGTCCCCAGCAGGA
GCCTCAGGATCCAGCACACATTATCACAAACTTAGTGTCCA
CATTATCACAAACTTAGTGTCCATCCATCACTGCTGACCCT
TCGGAACAAGGCAAAGGCTATAAAAAAAATTAAGCAGC
GCCCCTTCCCCACACTATCTCAATGCAAATATCTGTCTGAAACGGTTCC
CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG
GATTGGTCACAGCATTTCAAGGGAGAGACCTCATTGTAAG
TCCCCAACTCCCAACTGACCTTATCTGTGGGGGAGGCTTTTGA
CCTTATCTGTGGGGGAGGCTTTTGAAAAGTAATTAGGTTTAGC
ATTATTTTCCTTATCAGAAGCAGAGAGACAAGCCATTTCTCTTTCCTCCCGGT
AGGCTATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTC
CCAGCACACACACTTATCCAGTGGTAAATACACATCAT
TCAAATAGGTACGGATAAGTAGATATTGAAGTAAGGAT
ACTTGGGGTTCCAGTTTGATAAGAAAAGACTTCCTGTGGA
TGGCCGCAGGAAGGTGGGCCTGGAAGATAACAGCTAGTAGGCTAAGGCCAG
CAACCACAACCTCTGTATCCGGTAGTGGCAGATGGAAA
CTGTATCCGGTAGTGGCAGATGGAAAGAGAAACGGTTAGAA
GAAAAAAAATAAATGAAGTCTGCCTATCTCCGGGCCAGAGCCCCT
TGCCTTGTCTGTTGTAGATAATGAATCTATCCTCCAGTGACT
GGCCAGGCTGATGGGCCTTATCTCTTTACCCACCTGGCTGT
CAACAGCAGGTCCTACTATCGCCTCCCTCTAGTCTCTG
CCAACCGTTAATGCTAGAGTTATCACTTTCTGTTATCAAGTGGCTTCAGCTATGCA
GGGAGGGTGGGGCCCCTATCTCTCCTAGACTCTGTG
CTTTGTCACTGGATCTGATAAGAAACACCACCCCTGC
From this alignment, the frequency of each base occurring is calculated. In this case, the
motif we are searching for is six bases wide. Therefore, we need to calculate seven
different sets of frequencies: One for the background, and one for each of the columns in
the motif. Calculating the total counts, we get:
After calculating the observed counts for each of the positions, we can convert these to
observed frequencies:
In the expectation step, the residue frequencies for the motif are used to estimate the
composition of the motif site. The expectation step attempts to maximally discriminate
between sequence within and not within the site. For each sequence, each possible motif
location is considered in order to find the most probable location given the current motif.
The six base site CAGTTA beginning at base 8 is calculated to have the highest odds
probability. Therefore, it is chosen as the new site in sequence 1.
This is repeated for each of the sequences. In the maximization step, the newly chosen
sites for each of the sequences are used to recalculate the frequency table. The
expectation/maximization cycle is then repeated, until the results converge on a set of
motifs.
MEME locates one or more ungapped patterns in a single DNA or protein sequence, or in
a series of sequences. A search is conducted on a variety of motif widths in order to
determine the most likely width for the profile. This likelihood is based on the log
likelihood score calculated after the EM algorithm. One of three types of motif models
can be chosen:
Various prior knowledge can be added to MEME, including the expected number of
motifs, the expected length of the motif, and whether or not the motif is palindromic
(only applicable for DNA sequences).
Gibbs Sampling
Gibbs Sampling is another statistical method similar in nature to the EM algorithms.
Gibbs sampling combines both EM and simulated annealing techniques in order to
determine a maximal local alignment of multiple sequences.
The idea behind Gibbs sampling is to determine the most probable pattern common to all
of the sequences by sliding them back and forth until the ratio of the motif probability to
the background probability is a maximum.
In the first step of Gibbs Sampling, the predictive update step, a random start position for
the motif is chosen in all of the sequences except one that is chosen either in a random or
specified order. The initial alignment of the randomly assigned motifs is used to
calculated the residue frequencies in each position of the motif, and the background
frequencies. This is done in a manner similar to the EM algorithm. The ratio of
probabilities for the model:background is designated for the weights for each of the
possible motif starting positions in the assigned sequence. These weights are normalized
by dividing by their sum, resulting in a probability for each motif position. A motif start
position is then chosen based on a random sampling with the given weights. This process
is repeated until the residue frequencies in each column do not change. The sampling
step is then repeated for a different initial random alignment.
Since Gibbs sampler samples based on the probability rather than taking the maximum
probability, there is a way to escape local maxima that might hinder EM algorithms.
One problem with creating a model of a sequence alignment that is then used to search
databases is that there is a bias towards the training data. For instance, one column in a
motif may contain a completely conserved residue. However, such an occurrence will
make it highly unlikely to detect a new member of the family that doesn’t have the same
residue in that position. In addition, the residues found in a specific column may not be
highly representative of the family as a whole, especially if a small training set is used.
In order to get around these problems, the idea of pseudocounts is introduced in order to
estimate the probabilities. So now the estimated probability is changed from a frequency
of counts in the data to the following form:
nca + bca
Pca =
N c +Bc
Where Pca is the probability of seeing residue a in column c; nca is the counts of residue a
in column c; bca are the pseudocounts for residue a in column c; Nc is the number of
residues in column c; Bc is the number of pseudocounts in column c.
These probabilities are then converted into a log-odds form (usually log2 so the
information can be reported in bits) and placed in the PSSM.
In order to search a sequence against a PSSM, the value for the first residue in the
sequence occurring in the first column is calculated by searching the PSSM. Similarly,
the value for the residue occurring in each column is calculated. These values are added
(since they are logarithms) to produce a summed log odds score, S. This score can be
converted to an odds score using the formula 2S. The odds scores for the motif beginning
at each position can be summed together and normalized to produce a probability of the
motif occurring at each location.
Information theory can give an appreciation for the amount of information contained
within each sequence.
When there is no information contained within a column, the amount of uncertainty can
be measured as log220 = 4.32 for amino acids, since there are 20 amino acids. For
nucleic acid sequences, the amount of uncertainty can be measured as log24 = 2. PSSMs
can be used in order to reduce the uncertainty. If only one amino acid is found in a
particular column, then the uncertainty is 0 – there is only one choice. If there are two
amino acids occurring with equal probability, then there is an uncertainty to deciding
which residue it is.
the uncertainty for the whole PSSM can be calculated as a sum over all columns:
Hc = ∑H
allcolumns
c
Sequence Logos
One way to look at a particular PSSM is to view it visually. Sequence logos are one way
to do so, by illustrating the information in each column of a motif. Such a graph can
indicate which residues and which columns are the most important as far as sequence
conservation is concerned. The height of the logo is calculated as the amount by which
uncertainty has been decreased.
In addition to the entropy measure given before, a relative entropy measure could be
calculated as well. Relative entropy takes into account not only the data in the columns
of the motif, but also the overall composition of the organism being studied. Relative
entropy can be measured as:
RC = − ∑f ac
residues ( a )
log 2 ( pac / ba )
Sequence editors allow the user to take a given multiple alignment and manually fix it.
Examples of sequence editors include:
CINEMA
http://www.biochem.ucl.ac.uk/bsm/dbbrowser/CINEMA2.02/kit.html
GeneDoc
http://www.psc.edu/biomed/genedoc/
MACAW
http://ncbi.nlm.nih.gov/pub/schuler/macaw
BoxShade
http://www.ch.embnet.org/software/BOX_form.html
Biological sequence data formats
IUPAC Codes
In order to standardize sequence data, The Nomenclature Committee of the International
Union of Biochemistry and the International Union of Pure and Applied Chemistry has
established a standard code to represent bases that are uncertain or ambiguous. The code,
often referred to as the IUPAC code, is as follows:
A = adenine
C = cytosine
G = guanine
T = thymine
U = uracil
R = G A (purine)
Y = T C (pyrimidine)
K = G T (keto)
M = A C (amino)
S=GC
W=AT
B=GTC
D=GAT
H=ACT
V=GCA
N = A G C T (any)
Any other character besides the ones listed above (with the exception of the gap character
‘-‘) represents an error that will not be tolerated by nearly all sequence analysis
programs.
In addition to the nucleic acid codes, a standard single letter and three letter amino acid
code has been formulated by IUPAC as well. The table for this code is as follows:
1-letter 3-letter description
A Ala Alanine
R Arg Arginine
N Asn Asparagine
D Asp Aspartic acid
C Cys Cysteine
Q Gln Glutamine
E Glu Glutamic acid
G Gly Glycine
H His Histidine
I Ile Isoleucine
L Leu Leucine
K Lys Lysine
M Met Methionine
F Phe Phenylalanine
P Pro Proline
S Ser Serine
T Thr Threonine
W Trp Tryptophan
Y Tyr Tyrosine
V Val Valine
B Asx Aspartic acid or Asparagine
Z Glx Glutamine or Glutamic acid
Xaa or
X Any amino acid
Xxx
FASTA
Fasta sequence format is one of the most basic and widespread sequence formats. A
sequence in fasta format has as its first line a descriptor beginning with a ‘>’ character.
The proceeding lines contain the sequence (either nucleotide or amino acid) using
standard one-letter symbols. This format is extremely useful for sequence analysis
programs, since it is devoid of numerical and nonsequence characters (with the exception
of the newline character).
Example Fasta Sequence:
>gi|27819608|ref|NP_776342.1| hemoglobin, beta [beta globin] [Bos taurus]
MLTAEEKAAVTAFWGKVKVDEVGGEALGRLLVVYPWTQRFFESFGDLSTADAVMNNPKVKAHGKKVLDSF
SNGMKHLDDLKGTFAALSELHCDKLHVDPENFKLLGNVLVVVLARNFGKEFTPVLQADFQKVVAGVANAL
AHRYH
Note the first line begins with ‘>’, which in this case is followed by gi, indicating that the
next field surrounded by ‘|’ will be the GenBank identifier. Following the GenBank
identifier is the keyword ‘ref’ indicating the next field will be the reference for the
version of this sequence. The final field is the description. Note that nearly all sequence
based programs will treat anything following the ‘>’ as a comment and disregard it (or
only use it as a sequence descriptor). There are, however, a few sequence analysis
programs that expect the sequences to be in a strict fasta format.
GenBank
GenBank is the National Center for Biotechnology Information’s nucleic acid and protein
sequence database. It is the most widely used source of biological sequence data.
GenBank file format contains information about the sequence, including literature
references, functions of the sequence, locations of various features, etc.
The information in GenBank records is organized into fields, each with an identifier,
justified to the farthest left column. Some identifiers have additional subfields. The
actual sequence data lies between the identifier ORIGIN and the ‘//’ which signals the
end of a GenBank record.
Example GenBank Sequence:
SwissProt
Databases
GenBank
DDBJ
EMBL
SwissProt
BLOCKS
PFAM
Using Entrez
Complete Process:
>sp|P02023|HBB_HUMAN
--------VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQR
FFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTF
ATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVA
GVANALAHKYH------
>sp|P02062|HBB_HORSE
--------VQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQR
FFDSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTF
AALSELHCDKLHVDPENFRLLGNVLVVVLARHFGKDFTPELQASYQKVVA
GVANALAHKYH------
>sp|P01922|HBA_HUMAN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKT
YFPHF-DLS-----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNAL
SALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLA
SVSTVLTSKYR------
>sp|P01958|HBA_HORSE
---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKT
YFPHF-DLS-----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGAL
SNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPAVHASLDKFLS
SVSTVLTSKYR------
>sp|P02185|MYG_PHYCA
---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLE
KFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAEL
KPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALE
LFRKDIAAKYKELGYQG
>sp|P02208|GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQE
FFPKFKGLTTADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKL
RDLSGKHAKSFQVDPQYFKVLAAVIADTVAAG---------DAGFEKLMS
MICILLRSAY-------
>sp|P02240|LGB2_LUPLU
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKD
LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL
KNLGSVHVSKGVAD-AHFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYD
ELAIVIKKEMNDAA---
3) View alignment using various methods
4) Find Blocks in the alignment (BLOCKS)
Multiple sequence alignments yield information into the evolutionary history of the
sequences – sequences that are most similar are likely to be recently derived from a
common ancestor sequence
If the sequences in a multiple alignment have quite a bit of variation then it is difficult to
create a multiple sequence alignment due to the different combinations of substitutions,
insertions, and deletions that can be used
FASTA Format
In Fasta Format, each sequence in the multiple alignment starts with a Fasta description
line (beginning with a ‘>’). Following the description line is the sequence data. The gap
character ‘-‘ is found in locations corresponding to gaps in the sequence when the
multiple alignment was created.
>JC2395
NVSDVNLNK---YIWRTAEKMK---ICDAKKFARQHKIPESKIDEIEHNSPQDAAE----
-------------------------QKIQLLQCWYQSHGKT--GACQALIQGLRKANRCD
IAEEIQAM
>KPEL_DROME
MAIRLLPLPVRAQLCAHLDAL-----DVWQQLATAVKLYPDQVEQISSQKQRGRS-----
-------------------------ASNEFLNIWGGQYN----HTVQTLFALFKKLKLHN
AMRLIKDY
>FASA_MOUSE
NASNLSLSK---YIPRIAEDMT---IQEAKKFARENNIKEGKIDEIMHDSIQDTAE----
-------------------------QKVQLLLCWYQSHGKS--DAYQDLIKGLKKAECRR
TLDKFQDM
Stockholm Format
Stockholm Format (http://www.cgr.ki.se/cgr/groups/sonnhammer/Stockholm.html)
# STOCKHOLM 1.0
#=GF ID CBS
#=GF AC PF00571
#=GF DE CBS domain
#=GF AU Bateman A
#=GF CC CBS domains are small intracellular modules mostly found
#=GF CC in 2 or four copies within a protein.
#=GF SQ 67
#=GS O31698/18-71 AC O31698
#=GS O83071/192-246 AC O83071
#=GS O83071/259-312 AC O83071
#=GS O31698/88-139 AC O31698
#=GS O31698/88-139 OS Bacillus subtilis
O83071/192-246 MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRVPVYERS
#=GR O83071/192-246 SA 999887756453524252..55152525....36463774777
O83071/259-312 MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVAIVLDEY
#=GR O83071/259-312 SS CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEE
O31698/18-71 MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAIPVLDPS
#=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH
O31698/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE
#=GR O31698/88-139 SS CCCCCCCHHHHHHHHHHH..HEEEEEEE....EEEEEEEEEEH
#=GC SS_cons CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH
O31699/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE
#=GR O31699/88-139 AS *
#=GR O31699/88-139 IN
//
1 50
IXI_234 TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_235 TSPASIRPPAGPSSR.........RPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_236 TSPASIRPPAGPSSRPAMVSSR..RPSPPPPRRPPGRPCCSAAPPRPQAT
IXI_237 TSPASLRPPAGPSSRPAMVSSRR.RPSPPGPRRPT....CSAAPRRPQAT
51 100
IXI_234 GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG
IXI_235 GGWKTCSGTCTTSTSTRHRGRSGW..........RASRKSMRAACSRSAG
IXI_236 GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR..G
IXI_237 GGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR..G
101 131
IXI_234 SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_235 SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_236 SRPPRFAPPLMSSCITSTTGPPPPAGDRSHE
IXI_237 SRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE
PileUp
//
JC2395 .NVSDVNLNK YIWRTAEKMK ICDAKKFARQ HKIPESKIDE IEHNSPQDAA
FASA_MOUSE .NASNLSLSK YIPRIAEDMT IQEAKKFARE NNIKEGKIDE IMHDSIQDTA
KPEL_DROME MAIRLLPLPV RAQLCAHLDA LDVWQQLATA VKLYPDQVEQ ISSQKQRGRS
JC2395 -NVSDVNLNKYIWRTAEKMKICDAKKFARQHKIPESKIDEIEHNSPQDAAEQKIQLLQCW 59
FASA_MOUSE -NASNLSLSKYIPRIAEDMTIQEAKKFARENNIKEGKIDEIMHDSIQDTAEQKVQLLLCW 59
KPEL_DROME MAIRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQISSQKQRGRSASN-EFLNIW 59
:* *. : :::* :: .::::* :. :. : .: ::* *
JC2395 YQSHGKTGACQALIQGLRKANRCDIAEEIQAM 91
FASA_MOUSE YQSHGKSDAYQDLIKGLKKAECRRTLDKFQDM 91
KPEL_DROME GGQYNHT--VQTLFALFKKLKLHNAMRLIKDY 89
.:.:: * *: ::* : ::
Phylip
3 92
JC2395 -NVSDVNLNK YIWRTAEKMK ICDAKKFARQ HKIPESKIDE IEHNSPQDAA
FASA_MOUSE -NASNLSLSK YIPRIAEDMT IQEAKKFARE NNIKEGKIDE IMHDSIQDTA
KPEL_DROME MAIRLLPLPV RAQLCAHLDA LDVWQQLATA VKLYPDQVEQ ISSQKQRGRS
PIR Format
>P1;JC2395
-NVSDVNLNKYIWRTAEKMKICDAKKFARQHKIPESKIDEIEHNSPQDAAEQKIQLLQCW
YQSHGKTGACQALIQGLRKANRCDIAEEIQAM
*
>P1;FASA_MOUSE
-NASNLSLSKYIPRIAEDMTIQEAKKFARENNIKEGKIDEIMHDSIQDTAEQKVQLLLCW
YQSHGKSDAYQDLIKGLKKAECRRTLDKFQDM
*
>P1;KPEL_DROME
MAIRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQISSQKQRGRSASN-EFLNIW
GGQYNHT--VQTLFALFKKLKLHNAMRLIKDY
*
GDE
%JC2395
nvsdvnlnkyiwrtaekmkicdakkfarqhkipeskideiehnspqdaaeqkiqllqcwy
qshgktgacqaliqglrkanrcdiaeeiqam
%FASA_MOUSE
nasnlslskyipriaedmtiqeakkfarennikegkideimhdsiqdtaeqkvqlllcwy
qshgksdayqdlikglkkaecrrtldkfqdm
%KPEL_DROME
--mairllplpvraqlcahldaldvwqqlatavklypdqveqissqkqrgrsasneflni
wggqynhtvqtlfalfkklklhnamrlikdy
Nexus
#NEXUS
BEGIN DATA;
dimensions ntax=3 nchar=91;
format missing=?
symbols="ABCDEFGHIKLMNPQRSTUVWXYZ"
interleave datatype=PROTEIN gap= -;
matrix
JC2395 NVSDVNLNKYIWRTAEKMKICDAKKFARQHKIPESKIDEIEHNSPQDAAE
FASA_MOUSE NASNLSLSKYIPRIAEDMTIQEAKKFARENNIKEGKIDEIMHDSIQDTAE
KPEL_DROME --MAIRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQISSQKQRG
JC2395 QKIQLLQCWYQSHGKTGACQALIQGLRKANRCDIAEEIQAM
FASA_MOUSE QKVQLLLCWYQSHGKSDAYQDLIKGLKKAECRRTLDKFQDM
KPEL_DROME RSASNEFLNIWGGQYNHTVQTLFALFKKLKLHNAMRLIKDY
;
end;
The general feature format was developed so that annotations could be readily parsed by
a number of programs to quickly determine the location of various features. Example
uses of GFF include importing data into ACE formats for quick feature viewing, and for
creating sequence images complete with features.
http://www.sanger.ac.uk/Software/formats/GFF/
SeqIO
ReadSeq
Searching Sequence Databases
The most common type of search used is to compare a single query sequence against a
database. Such a search is typically performed to gather information on the potential
function of a gene. This is done by comparing the search results, and the functions of the
sequences that are related on a sequence similarity level. Such a search can be expanded
to find more distantly related sequences (at least on the sequence level) to the query
sequence. Such sequence similarity searches can yield information concerning related
proteins that may lead to the discovery of a family that can then be characterized, and
perhaps multiply aligned and profiled.
With all sequence searches, it is important to consider the sensitivity and the selectivity
of the algorithms. Sensitivity refers to the ability to find most of the related members
(reduction of false negatives) while selectivity refers to the ability to detect only members
of the family you are interested in studying (reduction of false positives). This is
important to keep in mind when interpreting alignment results and assigning a function to
a sequence, since this assignment may be given through transitive relationships.
In addition, since multiple codon sequences code for the same amino acid, it is possible
that the translated amino acid sequences could be identical, yet the underlying nucleic
acids could be different.
AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA
AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA
AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA
||||| | || || || | || || || | |
AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA
which gives 21 identical residues out of 36, for a percent identity of 58%.
ELVISISALIVE
This sequence is 100% identical for both protein sequences. Therefore, if the nucleotide
region we are interested in searching for is known to be in a protein coding region, it
would be beneficial to translate the DNA sequence into a protein sequence. Generally,
both the target (or database) and the query sequences are translated into all six reading
frames and compared to one another. (Recall that there are three reading frames in the
forward direction, and three in the reverse complement). Now rather than having four
comparisons (target forward and reverse complement AGAINST query forward and
reverse complement) there are now thirty-six comparisons to be made. Therefore, while
translation of the sequences into proteins will lead to better results, the time it takes to run
will be approximately nine times as long.
Scoring matrices
Most database searching utilities will allow the user to change between various scoring
systems. At one time, the default scoring matrix used for amino acid searches was the
Dayhoff PAM250 matrix. In most instances, the PAM250 matrix has been replaced by
the BLOSUM62 matrix, since the BLOSUM matrices were based on more sequence data.
FASTA
FASTA was the first rapid search method developed for database searching. FASTA
uses a heuristic algorithm to speed up the process of locating similar regions. Unlike
dynamic programming, FASTA is not guaranteed to lead to the optimal solution.
However, the search time is roughly 50 times faster than DP solutions.
FASTA Algorithm
In the initial stage of searching for regions of similarity, FASTA uses a hashing approach.
For each of the sequences being compared, a table is constructed showing the positions of
each word of length k, or k-tuple. The relative positions of each word in the two
sequences are calculated by subtracting the position of the first sequence from the
position of the second. Words having the same offset are in phase and reveal a longer
region of alignment between the two sequences.
Step 2: the ten regions with the highest density of identities are identified. The ends of
each region is trimmed to include only residues contributing to the highest score. Each
resulting region is now a partial alignment without gaps. Each is given a score (init1
score)
Step 3: If there are several initial regions with scores greater than a cutoff value, check to
see if the trimmed initial regions can be joined to form an approximate alignment with
gaps. A similarity score is calculated as the sum of the init1 scores for each of the initial
regions minus a penalty for each gap. (initn score)
Step 4: Construct a needleman-wunch optimal alignment of the query sequence and the
library sequence, considering only those residues that lie in a band 32 residues wide,
centered on the best initial region found in step 2 (opt score)
After locating the k-tuples and grouping the ones with the same offset together, an
optimization step is invoked to piece together k-tuple alignments allowing gaps.
Using this approach, the search time increases linearly with the size of the query and
target sequences. Compared to the polynomial increase with dynamic programming,
FASTA presents a much faster alternative, particularly as the sequence size increases.
For DNA and RNA sequences, the typical size of the k-tuple in the FASTA algorithm is
4-6, while in protein sequences it is 1 or 2. The larger the k-tuple, the faster FASTA will
run, but the less thorough it will be in determining regions of similarity.
In order to determine the significance of an alignment for a target database and a query
sequence, FASTA calculates the u and lambda parameters for the extreme value
distribution, which will vary with the length and the composition of the sequences being
compared. The steps to calculate z-scores for each possible score is calculated as
follows:
1) The average score for database sequences in the same length range is determined.
2) The average score is plotted against the logarithm of average sequence length in each
length range.
3) The points are then fitted to a straight line by linear regression.
4) A z score, the number of standard deviations from the fitted line, is calculated for
each score.
5) High-scoring, presumably related sequences, and also very low scoring alignments
that do not fit the straight line are removed from consideration.
6) Steps 1-5 are repeated one or more times.
7) The known statistical distribution of alignment scores is used to calculated the
probability that a Z score between unrelated or random sequences of the same lengths
as the query and database sequence could be greater than z, which follow an extreme
value distribution such that: (Pearson, 2000 ISMB)
− e ( −1.2825 z −0.5772 )
P ( Z > z ) = 1 − (e )
The expectation of observing a Z-score greater than z in a database of D
sequences is:
9) The significance of the alignment score between a sequence and a database can be
further analyzed by aligning a sequence with a shuffled library.
One of the items reported in the FASTA output is a histogram showing a graphical
representation of the distribution of the normalized scores when matched with the query
sequence. These scores are expected to fall approximately into a normal distribution, and
any significant matches will fall outside the normal curve.
The first column listed in the fasta score distribution is the z’ score, which is a z score
normalized to a mean of 50 and a standard deviation of 10. The second column lists the
number of optimized scores found in that range. The third column lists the number of
expected sequences to lie within a range, given an extreme value distribution and the
calculated values of u and lambda.
The “=” signs give an approximate curve for the actual distribution, while the “*”
indicates the expected score distribution.
After the statistics is a list of the best scoring hits. Note that FASTA presents at most one
highest scoring hit per sequence, whereas other alignment programs may present many.
Listed in the hits section are the description of the sequence, the z’ score, the initn, initl,
and opt scores (note the initn score is the extended hit score; the init1 score is the initial
hit score; the opt score is the score calculated by stringing together regions with gaps –
see Figure 7.2 of Mount for a more in-depth explanation) and the E score (calculated as
an estimate of the likelihood of a match occurring by chance).
After the list of the highest scoring hits are the smith-waterman alignments between the
query and the highest scoring hits. A ‘:’ marks conservation; ‘.’ denotes a conservative
substitution:
>>MERR_STAAU mercuric resistance operon regulatory protei (135 aa)
initn: 292 init1: 172 opt: 298 Z-score: 373.6 expect() 3.5e-14
Smith-Waterman score: 298; 36.923% identity in 130 aa overlap
10 20 30 40 50 60
MerR MENNLENLTIGVFAKAAGVNVETIRFYQRKGLLLEPDKPYGSIRRYGEADVTRVRFVKSA
. :. .::: :: ::.:.:.::::. : . .. : :.: . ::::.:
MERR_S MGMKISELAKACDVNKETVRYYERKGLIAGPPRNESGYRIYSEETADRVRFIKRM
10 20 30 40 50
70 80 90 100 110
MerR QRLGFSLDEIAELLRL--EDGTHCEEASSLAEHKLKDVREKMADLARMEAVLSELVCACH
..: ::: :: :. . .:: .:.. ... .: :....:. : :.. .: :: :
MERR_S KELDFSLKEIHLLFGVVDQDGERCKDMYAFTVQKTKEIERKVQGLLRIQRLLEELKEKCP
60 70 80 90 100 110
FASTA Programs
TFASTA – compares a query protein sequence to a DNA sequence library, after the
DNA sequence library has been translated in all six reading frames.
Example
>mgstm1
MGCEN,MIDYP,MLLAY,MLLGY
LALIGN, LFASTA – Same as the FASTA program, except that multiple aligning regions
may be reported for each sequence.
BLAST
Blast has supplanted FASTA as the most commonly used database search tool. BLAST
was developed as an improvement in speed from the FASTA suite without a sacrifice in
sensitivity.
The first step of the BLAST algorithm is to locate common words or k-tuples in the query
sequence and the target database sequences. However, BLAST does not search for every
possible k-tuple, it only considers those that are most significant. For the NCBI BLAST
program, the word length is fixed at 3 for proteins and 11 for nucleic acids. This k-tuple
is referred to as the word-length, and is the minimum length needed to achieve a word
score that is high enough to be significant but not so long as to miss short but significant
patterns.
MSP – Maximal Segment Pair: The highest scoring pair of identical length segments
chosen from two sequences. The boundaries of an MSP are chosen to maximize its
score, so an MSP can be of any length.
The number of MSP scores with a score greater than a cutoff score S are reported.
BLAST minimizes the time spent on sequence regions where the score is unlikely to
exceed this cutoff score.
The main strategy of BLAST is to seek only segment pairs that contain a word pair with a
score of at least T. Any such hit is extended to determine if it is contained within a
segment pair whose score is greater than or equal to the cutoff score S.
The scanning phase of BLAST locates the words within the sequences in linear time.
One method is to map each possible word to an integer so that it can be used as an index
into an array. For instance, if the word size was 4, and amino acids were used, there are
204 = 160,000 entries in the array. The second approach was the use of a deterministic
finite state automaton
Hit Extension
Initial hits are then examined, and extended in either direction until they fall below a
certain score threshold.
In order to get around the problem of using uninformative hits, BLAST stores a list of
words that are found much more often than expected by random. Hits to these words are
discarded from consideration
1) The sequence is optimally filtered to remove low-complexity regions that will not
lead to meaningful sequence alignments.
2) A list of words of the predefined word length (3 for amino acids; 11 for DNA
sequences) in the query sequence is made.
3) The query words are evaluated for an exact match with a word in any database
sequence, using substitution scores for amino acids, and +5,-4 scoring scheme for
DNA.
4) A cutoff score, called neighborhood word score threshold (T) is selected to reduce
the number of possible matches to the word to be the most significant ones. This
pares down the list of possible matching words to those resulting in the most
significant alignments.
5) The procedure is repeated for each word in the query sequence.
6) The remaining high-scoring words for each possible match to a word are
organized into an efficient search tree.
7) Each database sequence is scanned for an exact match to one of words in the
search tree, one position at a time. If a match is found, it is used to seed a
possible ungapped alignment between the query and database sequences.
8) (UNGAPPED BLAST – VERSION 1.0) In the original BLAST suite of
programs, an attempt is made to extend an alignment from the matching words in
each direction along the sequences, as long as the score does not drop below a
certain threshold. At this point, a larger stretch of sequence (called the HSP
(high-scoring segment pair) which has a larger score than the original word may
have been found.
9) The score of each HSP is compared against a cutoff score S, which is empirically
determined.
10) The statistical significance for each HSP is calculated using the Karlin-Altschul
statistics and the extreme value distribution, as previously discussed with
sequence alignments. Recall that the probability, p, of observing a score S greater
than or equal to x is given by the equation:
−e− λ ( x −u )
P( S ≥ x) = 1 − e
where
ln Kmn
m' ≈ m −
H
ln Kmn
n' ≈ n −
H
where H is the average expected score per aligned pairs of residues in an alignment of
two random sequences; m and n are the length of the query and database; K and lambda
are parameters calculated based on the sequences and the scoring scheme.
These effective, or reduced, lengths are used as a correction factor in order to allow
alignments starting near the end of one of the sequences to be detected.
The expectation, E, of seeing a score S >= x in a database of D sequences is
approximately given by the Poisson distribution,
E ≈ 1 − e − p(s>x) D
11) Two or more HSP regions may be combined to a longer alignment region, even
though the individual HSPs may result in a lower score.
12) Smith-Waterman type alignments are shown for the query sequence with each of
the matched sequences in the database. BLAST-2 can produce alignments with
gaps, while BLAST-1 cannot.
13) When the expected score for a given database sequence satisfies the threshold for
E, the match score is reported.
BLAST Programs
BLASTP: Compares a protein query sequence against a protein database, allowing for
gaps
BLASTN: Compares a DNA query sequence against a DNA database, allowing for gaps
BLASTX: Compares a DNA query sequence, translated into all six reading frames,
against a protein database, allowing for gaps
TBLASTN: Compares a protein query sequence against a DNA database, translated into
all six reading frames, allowing for gaps
TBLASTX: Compares a DNA query sequence, translated into all six reading frames,
against a DNA sequence database, translated into all six reading frames. TBLASTX does
not allow for gaps.
There are a number of different BLAST options. One list of these options and a
description of them is available through the WU-BLAST home page:
http://blast.wustl.edu/blast/README.html
Another approach to searching databases is the Bayes Block Aligner. The methodology
behind the Bayes Block Aligner is to find all possible blocks located within two
sequences. A larger number of possible alignments between two sequences are generated
by aligning combinations of blocks. Gaps will be present between the blocks.
The Bayes Block Aligner uses Bayesian statistics to derive the posterior probabilities of
each alignment assuming various scoring models and different number of blocks. This
approach has been shown to locate some weak, yet real, similarities between sequences.
SSAHA
SSAHA stands for Sequence Search and Alignment by Hashing Algorithm. It can align
DNA sequences by converting the sequence information into a ‘hash table’ data structure
that can then be searched very rapidly for matches.
SSAHA is best suited towards problems in locating identical or near identical matches.
The hash word length is defined to be 10 bases by default. Example applications include
SNP detection; rapid sequence assembly; detecting order and orientation of contigs.
SSEARCH
While dynamic programming algorithms can be painfully slow when searching against
large databases, they are more likely to discover sequences that are distantly related to
one another. SSEARCH is one program that implements the Smith-Waterman approach
to sequence alignment.
ftp.virginia.edu/pub/fasta
SSEARCH is part of the FASTA suite of programs. This approach compares a protein
sequence to another protein sequence or sequence database (or DNA sequence to a DNA
sequence or database) using enhanced Smith-Waterman local sequence alignments.
BLAT
BLAT (BLAST-Like Alignment Tool), developed by Jim Kent at UCSC, is used to
locate smaller regions of higher identity within genomic assemblies. BLAT on nucleic
acids will quickly identify regions at least 95% similar consisting of 40 bases or more.
More divergent and shorter sequence alignments may be missed. BLAT on amino acids
will find sequences at least 80% similar consisting of at least 20 amino acids.
DNA BLAT works by keeping an index of an entire genome in memory, where the index
consists of all non-overlapping 11-mers except those involved in repetitive elements. For
the human genome, this corresponds to a little less than a gigabyte of RAM. Protein
BLAT works in the same fashion, except that 4-mers are used. The protein index is
slightly larger than 2 gigabytes for humans.
BLAT is a very fast tool for localizing highly similar regions. However, distant
homologies are not detected. The typical use for BLAT is to localize a specific sequence
on a genome. This can be very useful, since the BLAT web interface directly ties to the
UCSC GoldenPath genomic browser.
http://genome.ucsc.edu/cgi-bin/hgBlat?command=start&org=human
SEQUENCE FILTERING
Low-Complexity Regions
Low-complexity regions are amino acid or DNA sequence regions that offer very low
information due to their highly biased content. Examples of low complexity regions
include histidine-rich domains in amino acids, poly-A tails in DNA sequences, poly-G
tails in nucleotides, runs of purines, runs of pyrimidines, runs of a single amino acid, etc.
1
K=
L!
L ∗ log N ( )
∏ ni !
alli
so K = 1/4log4(24/1) = 0.573
Short, periodic repeats
Another possible source of low information are regions of DNA or amino acid sequences
with repeats with a short periodicity (such as 10 bases long). Examples of such
sequences commonly found in DNA are short tandem repeats.
Fortunately, there are programs out there to remove such sequences from a query and
target database before it is searched.
SEG, PSEG are programs developed at NCBI that are used to mask out low-complexity
regions in amino acid sequences.
NSEG is the NCBI program that masks out low-complexity regions in nucleic acid
sequences.
DUST is another program that removes low-complexity regions in DNA sequences.
Each of these programs calculates the complexity for a given window using the algorithm
defined above and masks out regions based on a given complexity threshold.
XNU is a program that will locate internal repeats with a short periodicity.
Interspersed Repeats
In addition to short, periodic repeats, genomes are filled with longer interspersed
repetitive elements. These can be in the form of short-interspersed elements (SINES) on
the order of 300 bases long, or long-interspersed elements (LINES) on the order of 1-2
KB long. There are other classes of interspersed repeats, including Mammalian-
interspersed repeats (MIRS), and other elements that have been transposed and fixed into
genomes through viral-like events. Transposable elements are numerous in many plant
species, which leads to large genome sizes.
Consider the human genome. Somewhere around 50% of its composition comes from
interspersed repeats. Thus, these regions should be masked out as well.
RepeatMasker is a program that takes a query sequence and compares it against a set of
target repetitive element databases. Those regions in the query which match a repetitive
element are masked out. Run in its native mode, RepeatMasker calls cross-match, which
implements a Smith-Waterman dynamic programming algorithm to locate instances of
repetitive elements. Improvements have been made to the RepeatMasker software so that
it can use the speed of BLAST to speed up the time it takes to locate repeats. This
updated version is called MaskerAid.
The libraries of repeats that RepeatMasker uses are maintained by the Genetics
Information Research Institute in their Repbase database.
Soft versus Hard masking
There are generally two approaches to masking out repetitive elements. The first
approach, in which the repetitive element sequences are replaced by either N or X
characters, is called hard masking. The second approach, which is becoming a more
popular approach to use, maintains the sequence data, but denotes repetitive portions of
sequences with lowercase letters. This can be a preferred approach for many reasons.
However, when using either approach it is important to understand how the sequence
analysis software you are using will treat each of these.
The method to search a database with a PSSM is very similar to seeing whether or not a
sequence belongs to a family that the PSSM defines. Every possible sequence position in
each database sequence is evaluated as a possible sequence position by sliding the PSSM
along one sequence at a time. Positions with high scores are the best matches, and can be
quickly identified.
Certain databases (such as ProSite) allow the databases to be searched using a regular
expression.
PSI-BLAST
Of course, caution should be used with PSI-BLAST since a greedy algorithm is used in
the sense that the most recently added sequences will now influence the next round of
sequences that are to be found.
PHI-BLAST
PHI-BLAST (pattern hit initiated blast) functions in same manner as PSI-BLAST except
that the query sequence is first searched for a complex pattern, or regular expression,
provided by the user. The subsequent search for similar sequences is then focused on
regions containing the pattern. One example of a regular expression that might be used
is:
[LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV]
Reference Books:
BLAST
by Mark Yandell , Joseph Bedell (Editor), Ian Korf, Ian Korf, Mark Yandell
Joseph Bedell, Lorrie LeJeune
Availability: This item will be released on June 1, 2003. $39.95
Sequence Databases
Major Sequence Repositories
Many of the applications in computational biology and bioinformatics are based on the
analysis of nucleotide and protein sequences. There are three major repositories that
contain all of the known nucleotide and protein sequences. They all share their
information with each other through the International Nucleotide Sequence Database
Collaboration. These three repositories are:
Nucleotide sequence information has also been organized in such a manner that it is
stored in genome databases. One of the most widely used resources of genomic data is
the UCSC Genome Browser, which contains genome assemblies and annotation for the
rat, mouse and human genomes. Another widely used resource is the Ensembl genome
browser.
Gene Databases
Once a genome is in place, it is desirable to study the regions that make a particular
organism what it is. One such resource is located in the geneic regions of the organism.
Several databases of genes and related structures exist. Perhaps the largest such database
is the RefSeq database curated at NCBI. This data set contains information on a non-
redundant collection of molecules naturally occurring. These are typically given as
mRNA sequences where various information is known about them. For instance, these
mRNA could be well studied and annotated to a degree that they are known to be geneic
regions. Or these regions could be predicted mRNAs, where the predictions are based
upon either computational methods, or by the mapping of EST sequences onto these
regions.
Other gene and gene structure databases include: AllGenes: Human and mouse gene
index integrating gene, transcript and protein annotation; ASAP: Alternatively Splicesd
isoforms of genes; ExInt: exon-intron structures of genes; IDB/IEDB: intron sequence
and evolution; SpliceDB: Canonical and non-canonical mammalian splice sites; GDB and
GenAtlas: Human genes and geonomic maps; HS3D: Human exon, intron and splice
regions;
SNP Resources
In human sequences, single base changes are thought to occur approximately once every
2000 bases between individuals. While this may not seem like a lot, that still leads to
over 1.6 million SNPs in the human population. SNPs play an important role in
differentiation, but can also be the cause of disease (one example is sickle-cell anemia).
Databases to locate and characterize single nucleotide polymorphisms are available for
use. These include dbSNP; SNP Consortium database; rSNP Guide: Single nucleotide
polymorphisms in regulatory gene regions;
EST Resources
ESTs are expressed sequence tags, which are partial copies of mRNA found within a
particular cell. Information from ESTs can be used to tell the splicing patterns of genes,
the occurrence of genes, etc.
dbEST http://www.ncbi.nlm.nih.gov/dbEST/
Gene Resource Locator (Alignment of ESTs with finished human sequence)
http://grl.gi.k.u-tokyo.ac.jp
HUNT: Annotated human full-length cDNA sequences http://www.hri.co.jp/HUNT/
Sputnik: Annotation of clustered plant ESTs: http://mips.gsf.de/proj/sputnik
STACK: non-redundant, gene-oriented clusters: http://www.sanbi.ac.za/Dbases.html
TIGR Gene Indices: non-redundant EST clusters: http://www.tigr.org/tdb/tgi.shtml
UniGene: non-redundant EST clusters: http://www.ncbi.nlm.nih.gov/UniGene/
Binding Sites, Promoters, ETC
Besides locating genes within the genome, it is important to understand the signaling
mechanisms that an organism employs in order to turn a gene on or off. Databases of
various factors such as promoters and transcription factor binding sites are available.
Various databases include: DBTBS: Bacillus subtilis binding factors and promoters;
EPD: Eukaryotic POL II Promoters; PromEC: E. coli mRNA promoters; TRANSFAC:
Transcription factors and binding sites;
DBTBS: http://elmo.ims.u-tokyo.ac.jp/dbtbs/
EPD: http://www.epd.isb-sib.ch/
PromEC: http://bioinfo.md.huji.ac.il/marg/promec
TRANSFAC: http://transfac.gbf.de/TRANSFAC/index.html
Protein Databases
The process of the central dogma states that DNA gets coded into RNA, which in turn
gets turned into proteins. Since proteins code for genes, it is important to store known
information about proteins inside of databases. There are many different protein
databases, many of them dealing with specific protein families. Databases for curated
proteins include:
In addition to proteins, we can have families of proteins defined with conserved regions
called motifs or domains. Databases to store this information includes:
After a protein sequence has been created, it takes on a three dimensional structure.
Various structure databases exist that contain proteins where the structure is known,
typically through NMR and X-ray crystallography. Some of the larger structure
databases include:
ASTRAL http://astral.stanford.edu/
PDB http://www.pdb.org/
SCOP http://scop.mrc-lmb.cam.ac.uk/scop
MMDB http://www.ncbi.nlm.nih.gov/Structure/
Once the location and sequence of genes is known, the next step is to determine their
function. Various biological experiments can be performed on gene data, including the
newer microarray technology which we will cover in class. Databases containing the
results of this experimental data are available. Included might be experimental images,
analysis of results, etc. Examples of experimental Gene Expression and Metabolic
pathway databases are:
ArrayExpress http://www.ebi.ac.uk/arrayexpress
BodyMap http://bodymap.ims.u-tokyo.ac.jp/
HugeIndex http://hugeindex.org/
Mouse Atlas and Gene Expression Database: http://genex.hgu.mrc.ac.uk/
NetAffx http://www.affymetrix.com/
Stanford Microarray Database http://genome-www.stanford.edu/microarray/
KEGG http://www.genome.ad.jp/kegg/
Klotho http://www.ibc.wustl.edu/klotho/
MetaCyc http://ecocyc.org/
Disease Databases
After the function of genes is known, those genes involved in disease are classified.
Mutational databases include:
OMIM: http://www.ncbi.nlm.nih.gov/Omim/
OMIA: http://www.angis.org.au/omia/
HGMD: http://www.hgmd.org/
Tumor Gene Family Databases: http://www.tumor-gene.org/tgdf.html
Literature References
After all this work has been done, there needs to be a way to do a search through the
literature references for a specific gene, disorder, organism, sequencing project, etc. The
most widely used resource in this regard is the PubMed
http://www.ncbi.nlm.nih.giv/PubMed/ database.
Storage of Data
http://otn.oracle.com/oramag/oracle/03-jan/o13science.html
Hidden Markov Models (HMMs)
Hidden Markov Models (HMMs) are probabilistic models for studying sequences of
symbols. In particular, HMMs can model matches, mismatches, insertions and deletions
of symbols. Hidden Markov Models have been deeply rooted in speech recognition
problems.
In speech recognition, the problem is the phonemes (or words) that have been spoken in a
particular time frame. Consider the difficulty. Everyone you meet has a different voice.
Everyone speaks with a slight variation – this might be caused by an accent, the person
having a cold, or differences in physiological development. However, humans are able to
distinguish what the speaker is saying. The idea behind speech recognition is to take in a
spoken word and to try to fit it to a specific model of possible words. This may in fact be
close to what the brain does – just think about the Sprint PCS commercials!
Problems in sequence analysis are similar. For instance, given an amino acid sequence,
we may want to determine the protein family to which it belongs. Now the amino acid
sequence can be treated similarly to the speech signal in a given frame, and the amino
acids can be treated as the phonemes.
Markov Chain
A Markov Chain is a probabilistic model that generates a sequence where the probability
of a symbol depends upon the previous symbol. A traffic light is an example of a
Markov chain.A Markov Chain can be used to model a random DNA sequence, where
there are four states: A, C, G, T, one for each letter in the alphabet. When we are given a
certain state, there is a transition from that state to another state with an associated
probability called a transition probability. An example Markov Chain can be drawn as
follows:
A C
start END
G T
The key property of a Markov chain is that the probability of a symbol S at position p
(Sp) depends only upon the previous symbol S at position p – 1 (Sp-1), and not on the
entire previous sequence.
Since the probability of a symbol is dependent upon the previous symbol, a prime
example for the use of Markov chains is in the detection of CpG islands, which are rich in
the dinucleotide CG.
The process of methylation in biological systems will typically convert the nucleotide C
to a T with a high probability when a CG nucleotide is encountered. As a result, there
will be an overabundance of the dinucleotide TG, and an underabundance of the
dinucleotide CG. If we ignore the start and end states for now, we can see that there are
sixteen different transitions. A study of regions of genomic DNA has determined normal
genomic transition probabilities to be the following, where the FROM node is labeled
along the rows to the left, and the TO node is labeled along the columns above:
A C G T
A 0.300 0.205 0.285 0.210
C 0.322 0.298 0.078 0.302
G 0.248 0.246 0.298 0.208
T 0.177 0.239 0.292 0.292
The model shown above can then assign these weights to the edges of the graph.
In some regions of the genome, such as the promoter region of genes, methylation is
suppressed. In these regions, the dinucleotide CG is found in greater quantities. In fact,
the nucleotides C and G are found to a greater degree than elsewhere in the genome. A
study of regions of genomic DNA where CpG islands exist has determined the transition
probabilities to be the following:
A C G T
A 0.180 0.274 0.426 0.120
C 0.171 0.368 0.274 0.188
G 0.161 0.339 0.375 0.125
T 0.079 0.355 0.384 0.182
A new model just like the one above can have its transition properties assigned according
to the new table. Now we have two different models: the first where CpG islands are
absent, and the second where CpG islands are present.
Let’s call the first model the non-CpG model and the second model the CpG model.
Given a new sequence, how would we determine whether it belongs to the non-CpG
model or the CpG model?
Remember, the key property of a Markov chain is that the probability of a symbol S at
position p (Sp) depends only upon the previous symbol S at position p – 1 (Sp-1), and not
on the entire previous sequence.
Therefore, to find the probability that a sequence fits a model, you would multiply all of
the conditional probabilities:
L
P ( x ) = P ( x1 )∏ a xi −1xi
i=2
where
a xi −1xi
is the transition probability from residue at position i-1 to the residue at position i.
Let’s consider for now that in the non-CpG model, P(A) = P(T) = 0.3; P(C) = P(G) = 0.2,
so that A and T are more probable. In the CpG model, consider P(A) = P(C) = P(G) =
P(T) = 0.25.
P(G)P(G|G)P(C|G)P(G|C)P(A|G)P(C|A)P(G|C)
(0.20)(0.298)(0.246)(0.078)(0.248)(0.205)(0.078) = 0.000000453499
For the CpG model can be calculated as:
(0.25)(0.375)(0.339)(0.274)(0.161)(0.274)(0.274)(0.125) = 0.0010526
Given this information, it is more likely that this sequence fits the CpG model. One thing
to note is how quickly the probability gets to zero. This shows the importance of using
log statistics.
One question that might arise is how different the non-CpG and CpG models are in
relation to each other. If they are not different enough, then there is not enough
information to determine from which model a particular sequence is derived. In order to
test whether we are able to discriminate between the two models, a log ratio is taken for
each of the scores in the two previous tables to create a third table, where each entry, x, in
the new table is equal to: log2(P(x|CpG model) / P(x| non-CpG model)). The resulting
table is as follows:
A C G T
A -0.740 0.419 0.580 -0.803
C -0.913 0.302 1.812 -0.685
G -0.624 0.461 0.331 -0.730
T -1.169 0.573 0.393 -0.679
Using this log-odds ratio table as the scores, we can then see that a sequence with a
negative score will belong to the non-CpG model, while a sequence with a positive score
will belong to the CpG model.
The two models we have created can now be used to test which of the two models a
sequence belongs to. However, consider the case where we are given a long genomic
piece of DNA. How do we determine where the regions are where CpG islands are
located? Our models cannot answer these questions as they currently exist.
The solution to this problem is to combine both of our models into a single model where
there are small probabilities in switching back and forth between the two models. The
problem now becomes more complicated, because there are now two states
corresponding to each nucleotide symbol.
The difference between a Markov chain and a hidden Markov model is that a hidden
Markov model does not have a one-to-one correspondence between the states and the
symbols, and therefore it is no longer possible to tell what state the model was in when a
particular symbol was emitted. Therefore, the state is “hidden” from us.
In the example of the CpG islands used thus far, only one symbol is emitted at each state.
However, consider the example of multiple alignments where any one of a number of
amino acids is likely in any given column. In this case, the state of a hidden Markov
model could emit a symbol from a given distribution. The probabilities of emitting a
symbol given that you are in a specific state is referred to as the emission probabilities.
Using emission probabilities, the joint probability of seeing an observed sequence x and a
path through the Hidden Markov model , π, is equal to:
L
P( x, π )= aoπ 1 ∏ eπ i ( xi )aπ iπ i +1
i =1
Note that in this case,
eπ i ( xi )
Is the probability of emitting the residue x found position i in the sequence, when you are
at the state πi in the path.
This equation is not all that useful, since it is often the case that the path is not known.
However, it is important to be able to calculate the most probable path.
vl (i + 1) = el ( xi +1 ) max(v k (i )a kl )
Initialization
In the initialization step, before you begin emitting any characters, set the probability of a
path of length 0 ending at state 0 to 1, and the probability of all other paths of length 0
ending at all other states equal to 0:
The recursion step goes through the whole length of the input sequence, one at a time,
and calculates the maximum probability for being in a state, l, given the current input i:
vl(i) = el(xi)maxk(vk(i-1)akl)
In addition, the recursion step keeps track of a pointer for getting to each state, so that a
traceback can be performed to reconstruct the path with the maximum probability:
ptri(l) = argmaxk(vk(i-1)akl)
Termination
In the termination step, the probability of the sequence and the maximum path is set to be
the maximum value at the final position, and the pointer for the maximal path is set at
that point, similar to the recursive step, except that a termination step, ak0 is introduced:
P(x,π* ) = maxk(vk(L)ak0)
πL* = argmaxk(vk(L)ak0)
Traceback
Since pointers were kept for the path with the maximum probability using a recursive
dynamic programming approach, traceback continues in a similar fashion to the sequence
alignment. For the path with the maximum probability, we start at the final state, and
trace back through the set of transitions that led to that state. Then we will recurse back
until we get to the first state:
Luckily, the probability of the observed sequence up to and including a point xi, where
the path ends at state πi , can be calculated using a dynamic programming approach,
similar to the Viterbi algorithm:
fk(i)=P(x1...xi, πi=k)
The forward algorithm can be described as follows, with an initialization, recursion,
and termination step.
Initialization
The initialization step is identical to the Viterbi algorithm, except now the v’s (maximum
probable path) are replaced by f’s (overall probability)
Recursion
The recursion step goes through the whole length of the input sequence, one at a time,
and calculates the overall probability for being in a state, l, given the current input i. In
the Viterbi algorithm, this recursive step takes the maximum; in the forward algorithm,
we will consider sum over all possibilities:
f l (i ) = el ( xi )∑ f k (i − 1)a kl
k
Termination
The termination step is an extension of the recursion step with the difference that a
terminating transition is used. The overall probability of the sequence being described by
the HMM is then given in the final cell:
P( x ) = ∑ f k ( L)a k 0
k
Backward Algorithm
While the Viterbi algorithm finds the most probable path through the model, and the
forward algorithm finds the probability of fitting the sequence to the model, we might
also be interested in calculating the posterior probability that the emitted value came from
a particular state given the observed seqeuence. Formally, this is written as:
P(π i = k | x).
The backward algorithm is very similar to the traceback step in pairwise sequence
alignment using dynamic programming. We start at the final step and work our way back
to the beginning.
Initialization
In the initialization step, the value of the posterior probability for each of the final
transitions is assigned the value of the final transition probability:
Recursion
The recursion works backwards (i = L-1, ..., 1). Thus, the recursive step is:
bk (i ) = ∑ a kl el ( xi +1 )bl (i + 1)
l
Termination
The termination step reports back the same value as the forward algorithm, calculated in
the reverse direction:
P( x) = ∑ a0 l el ( x1 )bl (1)
l
Parameter Estimation
So now the question becomes, how does the model and the associated probabilities get
specified in the first place? There are two parts to HMM parameter estimation: the
design of the structure (what states there are and how they are connected) and the
assignment of the transition and emission probabilities.
Calculation of Probabilities
In order to deal with insufficient data and overtraining of models, pseudocounts should be
incorporated into these maximum likelihood estimations.
In the Baum-Welch algorithm, the transition and emission probabilities are calculated as
the expected number of times each transition or emission is used, given the training data.
Once a model is in place, its overall log likelihood is computed, and transition and
emission probabilities are calculated again based on the values given.
The Baum-Welch algorithm is summarized as follows:
Initialization:
Pick arbitrary model parameters
Recurrence:
Set all the transition and emission variables to their pseudocount values
For each sequence j = 1..n:
Calculate the forward value for sequence j
Calculate the backward value for sequence j
Add the contribution of sequence j to the transition and emission variables
Calculate the new model parameters using maximum likelihood
Calculate the new log likelihood of the model
Termination:
Stop if the change in log likelihood is less than some predefined threshold
Duration Modeling
Complex length distributions can be modeled by introducing several states with
transtitions between one another:
Silent States
In some cases, it would be nice to be able to skip over states. This is done by the
introduction of silent states. Silent states will allow for arbitrary deletions of a chain
states. In order to make the model less cluttered, silent states are introduced as circles in
the model. This will achieve the same effect of forward connecting transitions from each
of the states to each other state in the future.
Using the above HMM model, we can now skip some of the emitting states by traveling
through the silent states. The next effect would be a deletion of emitted sequence.
Insertion States
In addition to skipping over states, it may be necessary to emit residues between
matching states. This is done by introducing insert states which are labeled as diamonds
in the HMMs:
HMMs have many different applications in computational biology. Among them are:
There are two main programs used to calculate Hidden Markov Models:
A finite state machine can be created to show the calculation of a pairwise alignment with
five states: A start state, a stop state, a match/mismatch state, an insertion state (for
sequence 1), and an insertion state (for sequence 2).
M EN
BE D
G
By assigning probabilities for the transitions, and for the emissions at states M, X, and Y,
this finite state machine can be converted into a pair HMM. A pair HMM differs from a
standard HMM in the sense that a pair of sequences is emitted in this case. This pair
HMM will generate an aligned pair of sequences.
Chapter 4 discusses using HMMs to find pairwise alignments; look it over, but we will
not discuss it further in class. Rather, we will concentrate on where the real power of
HMMs lies: in determining families of sequences based on multiple alignments.
The assumption with profile HMMs is that we begin with a multiple alignment that has
been correctly calculated. This multiple alignment can then be used to build a model to
score potential matches to new sequences. For example, consider the following multiple
alignment of globin sequences:
10 20 30 40 50 60
....*....|....*....|....*....|....*....|....*....|....*....|
consensus 1 SAAQKALVKASWGKVKG------NREELGAEILARLFK-------AYPDTKAYFPKFg-D 46
1ASH 1 ANKTRELCMKSLEHAKVdt--snEARQDGIDLYKHMFE-------NYPPLRKYFKSR--E 49
1ITH_A 3 TAAQIKAIQDHWFLNIKg-----CLQAAADSIFFKYLT-------AYPGDLAFFHKF--S 48
gi 1065933 162 DKESCEVVADSWRLVESrssaaeTSACFGLFVFQRVFS-------KIPMLRPLFG-Ls-E 212
gi 3877400 71 nvyEKELLRRTWSDEFD------NLYELGSAIYCYIFD-------HNPNCKQLFP-Fi-S 115
gi 3877381 15 TDEEVTAIRDVWRRA--------KTDNVGKKILQTLIE-------KRPKFAEYFG-IqsE 58
gi 3874505 230 TCAQIHLVRALWRQVYTt----kGPTVIGASIYHRLCFknvmvkeQMKQVE-LPPKFq-N 283
gi 4098133 39 EDRDALRVLQNAFKL--------DDPELVRRFYAHWFA-------LDASVRDLFP-P--- 79
gi 1707914 18 SPADVK--KHTVESMKAvpv-grDKAQNGIDFYKFFFT-------HHKDLRKFFKGA--E 65
gi 2494780 3 TKDEFDSLLHELDPKIDte---eHRMELGLGAYTELFA-------AHPEYIKKFSRL--Q 50
70 80 90 100 110 120
....*....|....*....|....*....|....*....|....*....|....*....|
consensus 47 LSTAAALKSSPKFKAHGKKVLGALDEAVKHL---DDDGNLKAALAKLGAR-HAKRG---H 99
1ASH 50 EYTAEDVQNDPFFAKQGQKILLACHVLCATY---DDRETFNAYTRELLDR-HARDHv--H 103
1ITH_A 49 SVPLYGLRSNPAYKAQTLTVINYLDKVVDAL-----GGNAGALMKAKVPS-HDAMG---- 98
gi 1065933 213 SDDVFDLPDNHPVRRHARLFTSILHISVKNVd--ELEAQVAPTVFKYGER-HYRPDitpH 269
gi 3877400 116 KYQGDEWKESKEFRSQALKFVQTLAQVVKNIyhmERTESFLYMVGQKHVK-FADRG---- 170
gi 3877381 59 SLDIRALNQSKEFHLQAHRIQNFLDTAVGSLg-fCPISSVFDMAHRIGQI-HFYRGv--N 114
gi 3874505 284 --------RDNFIKAHCKAVAELIDQVVENL---DHLDNVTGELMRIGRV-HAKVL---- 327
gi 4098133 80 -------DMGAQRAAFGQALHWVYGELVAQr-----AEEPVAFLAQLGRD-HRKYG---- 122
gi 1707914 66 NFGADDVQKSKRFEKQGTALLLAVHVLANVY---DNQAVFHGFVRELMNR-HEKRGvdpK 121
gi 2494780 51 EATPANVMAQDGAKYYAKTLINDLVELLKAS---TDEATLNTAIARTATKdHKPRN---- 103
130 140 150 160 170
....*....|....*....|....*....|....*....|....*....|....
consensus 100 VDPANFKLFGEALL---VVLAEHLg---DFTPEVKAAWDKALDVVADALKSGYR 147
1ASH 104 MPPEVWTDFWKLFE---EYLGKKT----TLDEPTKQAWHEIGREFAKEINKHGR 150
1ITH_A 99 ITPKHFGQLLKLVG---GVFQEEFs---ADPTTVAAWGDAAGVLVAAMK----- 141
gi 1065933 270 MTEENVRVFCAQIV---CTVFDFLrd-tEATPKCAESWIELMRYLGQKLLDGFD 319
gi 3877400 171 FKHEYWDIFQDAME---FALEHRLsimtDLDDNQKRDAVTVWRTLALYTTVHMR 221
gi 3877381 115 FGADNWLVFKKVTV---DQVTTGTt---DSSKEKEDtnsngtangkvdtdasli 162
gi 3874505 328 RGELTGKLWNTVAEtiiDCTLEWGdr-rCRSETVRKAWALIVAFVIEKIKAGHH 380
gi 4098133 123 VLPTQYDTLRRALY---TTLRDYLg------HPSRGAWTDAVDEAAGQSLNLII 167
gi 1707914 122 LWKIFFDDVWVPFL---ESKGAKLs------GDAKAAWKELNKNFNSEAQHQLE 166
gi 2494780 104 VSGAEFQTGEPIFI---KYFSHVL-----TTPANQAFMEKLLTKIFTGVAGQL- 148
First, consider only the BLOCKS, where there are no insertion and deletion events. For
example, we can consider the block of width 5 that is highlighted above:
HAKRG
HARDH
HDAMG
HYRPD
FADRG
HFYRG
HAKVL
HRKYG
HEKRG
HKPRN
As we have shown with the multiple alignment algorithms, a position specific scoring
matrix (PSSM) can be calculated for the BLOCK above.
Converting a PSSM for a block to a HMM is trivial, due to the absence of insertions and
deletions. There will be a begin state, and five match states, one for each column. Match
states are represented by squares in a diagram of a profile HMM. The transition
probabilities between match states in the HMM would be 1 (we can only transition to the
next match state), and the emission probabilities for each match state would be calculated
based on the PSSM. The diagram for the resulting HMM is shown below:
B E
Insertions
For a profile HMM, insertions and deletions are treated separately. In order to handle
insertions, where a portion of a new sequence does not match anything else in the model,
a new set of insert states is inserted, denoted by diamonds. Whenever an insertion is
possible, a transition is needed from the last match state in a block to the insertion, from
the insertion to itself (to allow for multiple length insertions), and from the insertion to
the first match state of the next block. For instance, if our alignment had been:
HAKVPRG
HAR--DH
HDAV-MG
HYR--PD
FAD--RG
HFY--RG
HAK-PVL
HRKG-YG
HEKGGRG
HKP--RN
We now have an insertion after the third match state. Thus, the HMM now looks like the
following:
B E
In this case, the score of a gap of length k is equal to the score of the transition from the
match state to the insert state, plus (k-1) times the score of the transition remaining in the
insert state, plus the score of the transition from the insert state to the next match state.
This can be rewritten as:
Deletions are handled in HMMs by introducing silent states between matches. Deletion
states do not emit a residue (thus, the name “silent” state). These states are denoted by
circles in a HMM. An example of a HMM emitting a sequence between 0 and 3 residues
long is as follows:
B E
HMM With Matches, Insertions, and Deletions
A HMM with match states, insertion states, and delete states is referred to as a profile
HMM (first introduced by David Haussler, et al and Anders Krogh, et al in the mid-
1990s). An example structure of a profile HMM is as follows:
Profile HMMs can be used to perform generalized pairwise alignments, similar to the
dynamic programming approach, where the emission probabilities are the
match/mismatch probabilities or the probabilities of matching two amino acids in a
scoring matrix. The transition probabilities from a match state to an insert or delete state
is equivalent to a gap-open score, while the transition probabilities within an insertion
state or between delete states is equivalent to a gap extension penalty.
Multiple sequence alignments can be used to create profile HMMs that act as models
describing the consensus sequence of a sequence family.
HAKVPRG
HAR--DH
HDAV-MG
HYR---D
FAD--RG
HFY--RG
HAK-PVL
HRKG-YG
HEKGGRG
HKP--RN
Once the length of the model is chosen, the next problem is to assign the transition and
emission probability parameters. The simplest manner in assigning probabilities is to
take the frequencies.
First, consider the transition frequencies. Note that for each state has three possible
transitions leading from it. The transition probability from state k to state l can be
calculated as:
Akl
a kl =
∑ Akl '
l'
HAKVPRG
HAR--DH
HDAV-MG
HYR---D
FAD--RG
HFY--RG
HAK-PVL
HRKG-YG
HEKGGRG
HKP--RN
First, consider the transitions. From column 1 to column 2, there are 10 transitions to the
next match state, 0 transitions to an insertion state, and 0 transitions to a delete state.
Using Laplace’s rule, the probabilities would be 11/13, 1/13, and 1/13 for aM1M2, aM1I1,
and aM1D2 respectively. The probabilities are the same for the transitions from the second
column to the third column. Note that the fourth and fifth columns are insertion columns.
Therefore, the next set of match transitions will be from column three to column four.
There are four matches where columns four and five have gaps, (probability of matching:
5/13); there is one where 4, 5 and 6 have gaps (probability of deletion 2/13), and five
with an insertion in column 4 and/or column 5 (probability of insertion 6/13).
Remembering that we have a total of five match states, the probabilities can be stored in
the following table:
The emission probabilities can be calculated for the match states in a similar fashion. In
the first column, there are 9 H’s, and 1 F. Using Laplace’s rule, this becomes 10 H’s, 2
F’s, and 1 each of the remaining 18 amino acids. Therefore, the probabilities are 10/30
H; 2/30 F; 1/30 for each of the remaining 18 amino acids.
Once the profile HMM is in place, sequences can be searched against the HMM to detect
whether or not they belong to a particular family of sequences described by the profile
HMM. Using a global alignment, the probability of the most probable alignment and
sequence can be determined using the Viterbi algorithm (yielding P(x, PI*|M)). In
addition, the full probability of a sequence aligning to the profile HMM can be
determined using the forward algorithm (yielding P(x|M)). The Viterbi equations
specifically designed for profile HMMs are given on page 109, Durbin et al. The forward
equations are given on page 110.
HMMER
http://hmmer.wustl.edu/
SAM
“Sequence Alignment and Modeling System” is a profile-HMM package used for protein
sequences as well. It is very similar to HMMER as far as the functionality is concerned.
http://www.cse.ucsc.edu/research/compbio/sam.html
Meta-meme
Meta-meme is a program that creates Hidden Markov Models of ungapped alignments.
The benefit is that there are fewer parameters to be learned in creating the HMMs. Thus,
Meta-meme is a Motif-based hidden markov approach.
http://metameme.sdsc.edu/
HMMPro
HMMPro is a commericial software package (free academic license) that has been
developed to add additional HMM features, including graphical interfaces, multiple
topologies, and multiple training methods.
http://www.netid.com/html/hmmpro.html
Pfam
Pfam is a large collection of multiple sequence alignments and hidden Markov models
covering many common protein domains and families. Over 73% of all known protein
sequences have at least one match to one of 5,193 different protein families. PFAM
families are extensively hand curated to assure a greater reliability in results.
Profile HMMs have been created to describe each protein family. In general, the HMMs
are seeded with a training data set of 50 or more sequences that are multiply aligned.
Generally, this alignment first is accomplished using a multiple alignment program such
as Clustal. After the automatic multiple alignment is generated, it is scrutinized by eye
and adjusted. This assures that the seed alignments that produce the Profile HMM are
more likely to be correct.
After the HMM has been created (using the HMMER suite), additional sequences are
added to the family by comparing the HMM against sequence databases. The resulting
full alignments with additional family members may look worse than the initial seed
alignments.
Pfam families can be broken down into four basic types:
http://pfam.wustl.edu/
http://www.sanger.ac.uk/Software/Pfam/index.shtml
The best way to look at the Pfam families is to jump right into the Pfam program and
view some examples: http://pfam.wustl.edu/
ProDom
ProDom is a database of all protein domain families automatically generated from the
SWISS-PROT and TrEMBL databases. ProDom incorporates Pfam-A families as well as
generating new ProDom alignments using the PSI-BLAST program.
http://prodes.toulouse.inra.fr/prodom/2002.1/html/home.php
ProSite
ProSite is a database of protein domains that can be searched by either regular expression
patterns or sequence profiles. The ProSite data can be accessed at:
http://www.expasy.org/prosite/
InterPro
InterPro is an integrated resource of protein families, domains, and functional sites
created to handle the data from various protein family sites such as PROSITE, Pfam,
PRINTS, ProDom, SMART and TIGRFAMs into a single, comprehensive resource.
5629 Entries
4280 Families
1239 Domains
95 Repeats
15 post-translational modifications
Represented in InterPro are 74% of all proteins within the SWISS-PROT and TrEMBL
databases.
Computational Gene Prediction
The simplest method of predicting geneic regions is to search for open reading frames
(ORFs). We have already discussed open reading frames, which begins with a start
(AUG) codon, and ends with one of three stop codons.
In prokaryotic organisms, DNA sequences coding for proteins generally are transcribed
into mRNA which is translated into protein with very little modification. Therefore, in
prokaryotes, locating an open reading frame from a start codon to a stop codon can give a
strong suggestion into protein coding regions. Longer ORFs are more likely to predict
protein-coding regions than shorter ORFs.
Once a new genomic DNA sequence has been obtained, the first step in gene prediction is
to locate homologous genes. This is accomplished by taking the new DNA sequence,
translating it into all six reading frames, and comparing it to protein sequence databases.
This step will locate known open reading frames, where the translation of the proteins is
known.
We have shown various algorithms for searching databases, and in particular, aspects that
allow translation into all possible reading frames before searching. (Examples are
BLASTX, FASTX, TBLASTN, TFASTX/TFASTY).
It is thought that only about half of the genes can be found by homology searches. The
remaining 50% need to be found using other mechanisms.
If the organism from which the genomic DNA has been extracted has EST or cDNA
sequences available, the next step is to perform a similarity search between the genomic
DNA and these ESTs or cDNAs. In addition, it can be helpful to perform similarity
searches with ESTs and cDNAs from other organisms as well. This step locates potential
exons within genomic sequences.
Programs that take into account not only sequence similarity but also intron/exon
boundary information (such as acceptor/donor sites and branch points) are listed as
follows:
EST-GENOME (http://www.hgmp.mrc.ac.uk/Registered/Option/est_genome.html)
SIM4 (http://pbil.univ-lyon1.fr/sim4.html)
SPIDEY (http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/)
Other Programs:
TAP (Transcript Analysis Program)
The third step is to take the genomic DNA and run it though gene prediction programs to
try and locate genes. Various programs exist to predict genes in different organisms,
usually basing the methodology on the observed characteristics of known exons, introns,
splice sites, and other regulatory sites in known genes. One important aspect to consider
is that gene structure varies from one organism to the next, so a program trained on one
organism is not generally useful for finding genes on another organism. Methods for
computationally predicting genes are generally error prone. We will discuss the
important statistics to consider when comparing gene prediction programs.
In general, predicting genes in prokaryotic organisms is much easier. This is due to the
fact that prokaryotic genes generally lack introns, and several highly conserved promoter
regions are found around the start sites and transcription and translation.
There are also three potential binding sides for the lexA product to the promoter region,
indicating that lexA provides a self feedback loop. When enough lexA has been
produced, these sites will be bound with lexA, telling it to stop creating this protein.
Many other prokaryotic genes operate in this same manner, and it is therefore fairly
straightforward to locate genes in prokaryotic organisms.
The highly conserved features of prokaryotic genes have made computational gene
identification a possibility. One method to detect these genes is to create HMMs based
on the gene structures. One such HMM is given in Mount, p50. This model suggests that
a model can be constructed based on each of the 61 codons in the genetic code, as well as
for the start codon and 3 stop codons.
Since codon usage and intergenic sequences vary from one organism to the next, a model
trained on the genes of one organism may not be useful in detecting genes in a second.
The reliability of the model depends on the accuracy of the information used to train the
initial model.
One program that uses a fifth order HMM (such that hexamers are important) in
modeling E. coli genes is the program GeneMark.hmm.
GeneMark (http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi )
Glimmer (http://www.cs.jhu.edu/labs/compbio/glimmer.html)
The commonly used methods for eukaryotic gene prediction train a computer program to
recognize sequences that are characteristic of known exons in genomic DNA sequences.
Then these programs are used to predict exons in unknown genomic sequences, and then
connect these exons to produce a gene structure.
The patterns used include intron-exon boundaries and upstream promoter sequences.
However, in eukaryotes, the signals for these are poorly defined, and therefore cannot be
searched by a simple pattern-matching technique as used with prokaryotes.
Splice sites
(Image source: http://genio.informatik.uni-stuttgart.de/GENIO/splice/)
Cannonical: GT->AG
Alternative: GC->AG
U12 Introns: AT->AC
http://www.fruitfly.org/seq_tools/splice.html
http://www.softberry.com/spldb/SpliceDB.html
Neural networks provide a framework for finding complex yet subtle patterns and
relationships among sequences. Grail II provides analyses of protein-coding regions,
poly(A) sites, and promoters; constructs gene models; and predicts encoded protein
sequences. The underlying algorithm makes a list of the most likely exon candidates and
these are further evaluated using a neural network. Dynamic programming is then
applied to define the most probable gene models.
Input for Grail II includes several indicators of sequence patterns, including a Markov
model for gene recognition and inputs from two additional neural networks that evaluate
the region for potential splice sites. One important indicator in Grail II (and other gene
prediction programs as well) is the in-frame 6-mer preference score, since the occurrence
of codon pairs in coding regions is not random, while in noncoding regions it is more
likely to be random. A higher frequency of 6-mers that are more commonly found in
coding regions can be an indicator of the presence of an exon.
The neural network used for Grail II is trained using a set of known coding sequences.
A schematic for the Grail II system is as follows (Mount, p53):
Gene Prediction Using Dynamic Programming
GeneParser is a program that predicts the most likely combination of exons and introns in
a genomic sequence using a combined neural network and DP approach.
Discrimination methods are statistical methods used to classify the sequence based on a
number of observed sequence patterns, including a 6-mer exon preference score (EPS),
3’-flanking splice site score (FSS). The idea behind discrimination analysis is to is to
plot a pair of scores for known sequences against one another, labeling each point as
either intron or exon. Based on this data, a discriminating curve is drawn to discriminate
between the introns and exons. The scores are then calculated for the new genomic
sequence, and depending on where the score falls, the region is labeled either as an intron
or exon. (See page 356 for an example). Example pattern discrimination methods
include:
HEXON
FGENEH
MZEF
Twinscan
Burset and Guigo (1996) came out with an important paper describing methods in order
to test gene prediction programs. In this paper, they describe a set of known coding
sequences that should be used as data to train the models. In addition, a set of known
coding sequences is provided to evaluate the success of the model. The important
statistics to look at include:
[(TP )(TN ) − ( FP )( FN )]
CC =
[( AN )( PP)( AP)( PN )]
GeneFinder http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html
GeneMark http://genemark.biology.gatech.edu/GeneMark/
GeneParser http://beagle.colorado.edu/~eesnyder/GeneParser.html
GeneScan http://202.41.10.146/GS.html
Genie http://www.cse.ucsc.edu/~dkulp/cgi-bin/genie
GenScan http://genes.mit.edu/GENSCAN.html
Grail http://compbio.ornl.gov/
MZEF http://argon.cshl.org/genefinder/
MetaGene: http://rgd.mcw.edu/METAGENE/
BCM Web Launcher for Gene Predictions: http://searchlauncher.bcm.tmc.edu/seq-
search/gene-search.html
Genie http://www.fruitfly.org/seq_tools/genie.html
Grail EXP http://grail.lsd.ornl.gov/grailexp/
HMMGene http://www.cbs.dtu.dk/services/HMMgene/
NetGene2 http://www.cbs.dtu.dk/services/NetGene2/
geneid http://www1.imim.es/geneid.html
Procrustes http://hto-13.usc.edu/software/procrustes/
Genewise
Twinscan http://genes.cs.wustl.edu/
Burge CB, Karlin S. (1998) Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346-
354.
Burge C, Karlin S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol.
Biol. 268, 78-94.
Reese MG, Eeckman FH, Kulp D, Haussler D. (1997). Improved splice site
detection in Genie. Proceedings of the First Annual International Conference on
Computational Molecular Biology (RECOMB) 1997, Santa Fe, NM, ACM Press,
New York.
Snyder EE, Stormo GD. (1993) Identification of coding regions in genomic DNA
sequences: an application of dynamic programming and neural networks. Nucl.
Acids Res. 21(3): 607-613.
CECS694-02
Introduction to Bioinformatics
Lecture 9
Phylogenetic Prediction
Overview of phylogenetics
Phylogenetic analysis gives insight into how a family of related sequences has been
derived during evolution. The evolutionary relationships among the sequences are shown
as branches of a tree. The length and nesting of these branches reflects the degree of
similarity between any two given sequences. The objective of phylogenetic analysis is to
determine the length of the branches and to figure out how the tree should be drawn.
Sequences that are the most closely related are drawn as neighboring branches on a tree.
Given a set of genes (such as a family of genes) phylogenetic analysis can help determine
which genes are likely to have equivalent functions.
Used to follow changes occurring in a rapidly changing species such as a virus. Take for
instance influenza. By studying the rapidly changing genes through phylogenetic
analysis, the next year’s strain can be predicted, and a flu vaccination can be developed.
The prediction is not always correct, but it gives a level of protection.
Tree of Life
On one level, it is interesting to understand and study how the evolution of species has
occurred. There are many different resources discussing the evolution of species. This
includes the NCBI taxonomy web sites, and the University of Arizona’s tree of life
project. We’ll take a look at both of these web sites in order to get a better appreciation
for the evolution of species relative to one another.
An evolutionary tree is composed of outer branches or leaves that represent the taxa and
nodes and branches representing the relationships among the taxa. Two taxa that are
derived from the same common ancestor will share a node in the graph. In general,
approaches to designing evolutionary trees attempt to define the length of each branch to
the next node according to the number of sequence level changes that occurred. One
thing to be careful of in phylogenetic analysis is that this distance may not be in direct
relation to evolutionary time. Analyses that prescribe to the theory of a uniform rate of
mutation are known as the molecular clock hypothesis.
Rooted Trees
In a rooted tree topology, one sequence (the root) is defined to be the common ancestor
of all of the other sequences. A unique path leads from the root node to any other node,
and the direction of the path indicates evolutionary time. The root is chosen by including
a sequence from an organism that is thought to have branched off earlier than the other
sequences. If the molecular clock hypothesis holds, it is also possible to predict a root.
As the number of sequences increase, the number of possible rooted trees increases very
rapidly. In some cases, a bifurcating binary tree is the best model to simulate
evolutionary events in which case one species branches off into two separate species.
There are three methods used to calculate the tree(s) that best account for the observed
variation in a set of sequences. These methods are maximum parsimony, distance, and
maximum likelihood.
Maximum Parsimony
Maximum parsimony methods predict the evolutionary tree that minimizes the number of
steps required to generate the observed variation in the sequences. In order to construct
a tree using maximum parsimony, a multiple sequence alignment must first be obtained.
For each aligned position, phylogenetic trees that require the smallest number of
evolutionary changes to produce the observed sequence changes are identified. This
continues for each position in the alignment. Those trees that produce the smallest
number of changes overall for all sequence positions are identified. This is a rather time
consuming algorithm that only works well if the sequences have a strong sequence
similarity.
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G
Consider the example above (Mount, 250). There are a total of four sequences, which
gives a possibility of three different unrooted trees. In this case some sites are
informative, and other sites are not. An informative site has the same sequence character
in at least two different sequences. Only the informative sites need to be considered.
Possible trees:
1 3 1 2 1 3
2 4 3 4 4 2
In this case, the optimal tree is obtained by adding the number of changes at each
informative site for each tree, and picking the tree requiring the least total number of
changes.
For a large number of sequences the number of trees to examine becomes so large that it
might not be possible to examine all possible trees. Some programs, such as PAUP, add
features that will allow the user to envoke a heuristic that will keep representative trees
that best fit the data.
Let’s go through the possible trees, and figure out the number of rearrangements for each
in the informative sites. (SEE THE POWERPOINT PRESENTATION)
One problem with determining evolutionary distance between sequences is that columns
representing greater variation dominate the analysis. In order to overcome this problem
of determining long branch lengths is to look only at transversion events, which are the
most significant base changes (i.e. changes a purine to a pyrimidine or vice versa). This
is referred to as Lake’s method of invariants.
Distance Methods
The distance method for construction of phylogenetic trees looks at the number of
changes between each pair in a group of sequences to produce a phylogenetic tree of the
group. The goal of distance methods is to identify a tree that positions neighbors
correctly and that also has branch lengths which reproduce the original data as closely as
possible.
Phylip http://evolution.genetics.washington.edu/phylip.html
FITCH: estimates a phylogenetic tree assuming additivity of branch lengths using the
Fitch-Margoliash method.
KITSH: same as FITCH, but under the assumption of a molecular clock.
NEIGHBOR: estimates phylogenies using the neighbor-joining (no molecular clock
assumed) or unweighted pair group method with arithmetic mean (UPGMA) (molecular
clock assumed).
For phylogenetic analysis, the distance score counted as either the number of mismatched
positions in the alignment or the number of sequence positions that must be changed to
generate the other sequence is used.
The success of distance methods depends on the degree to which the distances among a
set of sequences can be made additive on a predicted evolutionary tree.
A ACGCGTTGGGCGATGGCAAC
B ACGCGTTGGGCGACGGTAAT
C ACGCATTGAATGATGATAAT
D ACACATTGAGTGATAATAAT
The distances between these sequences can be shown as a table:
A B C D
A - 3 7 8
B - - 6 7
C - - - 3
D - - - -
Using this information, an unrooted tree showing the relationship between these
sequences can be drawn:
A C
2 1
4
1 2
B D
The Fitch and Margoliash method uses a distance table. The sequences are combined in
threes to define the branches of the predicted tree and to calculate the branch lengths of
the tree.
1) Draw an unrooted tree with three branches originating from a common node and
label the ends:
A a
C
c
B b
A B C
A -- 22 39
B -- -- 41
C -- -- --
b + c = 41
-a – c = -39
__________
b – a = 2 (4)
A 10
C
29
12
B
Example of Fitch-Margoliash Using Five Sequences
A B C D E
A -- 22 39 39 41
B -- -- 41 41 43
C -- -- -- 18 20
D -- -- -- -- 10
E -- -- -- -- --
Suppose that the initial tree is as follows:
C
A c
a f
D
d
b g
B
e E
1) The first step is to locate the most closely related sequences in the distance table. In
this case, that would be sequences D and E.
2) Now create a new table by combining the remaining sequences. For the distance from
D to A,B,C take the average distance of each of these to D ( (39 + 41 + 18) / 3 = 32.7)
For the distance from E to A,B,C, take the average distance of each of these to E
((41+43+20)/3 = 34.7). The resulting table is as follows:
D E AVG ABC
D -- 10 32.7
E -- -- 34.7
AVG ABC -- -- --
3) The average distances from D to ABC and E to ABC could also be found by
averaging the sum of the appropriate branch lengths:
D to E: d + e = 10 (1)
D to ABC: d + m = 32.7 where m = g + (c + 2f + a + b) / 3 (2)
E to ABC: e + m = 34.7 (3)
4) Now treat D and E as a single sequence, and create a new distance table. The
distance to DE is taken as the average of sequence A to D and A to E. The other
distances are calculated in a similar fashion. The resulting distance table is:
A B C (DE)
A -- 22 39 40
B -- -- 41 42
C -- -- -- 19
(DE) -- -- -- --
5) Identify the closely related sequences in the table. In this case, it is C to DE.
Using algebra, the distance c can be calculated to be 9, and g is calculated to be 5.
6) Repeat the process until all lengths have been identified, in which case there is
only single composite node left.
B C
A
E
The tree is modified by joining pairs of sequences. The pair to be joined is chosen by
calculating the sum of the branch lengths for the corresponding tree. The sum of the
branch lengths is calculated as follows:
∑ d im d mn ∑ d ij
+ d in
S mn = + +
2( N − 2) 2 N −2
where i,j represent all sequences except m and n, and i < j.
B C
A
E
S mn =
∑d im + d in
+
d mn ∑ d ij
+
2( N − 2) 2 N −2
The pair that results in the smallest branch length is then chosen to be the pair that is
joined. Based on this choice, the Fitch-Margoliash algorithm is used to compute the
actual branch lengths.
After the pair has been joined, a new distance table is created with the recently joined
sequences now entered as a composite. The neighbor-joining algorithm chooses the next
pair of sequences to join, and the F-M algorithm computes the branch lengths.
The process continues until the correctly branched tree and distances have been
identified.
Maximum Likelihood
The choice of which of these methods to choose depends upon the sequences that are
being compared. If there is strong sequence similarity, then maximum parsimony
methods are best. If there is not strong sequence similarity, but clearly recognizable
sequence similarity, then distance methods work best. For all others, the best approach is
a maximum likelihood model.
Rearrangements of genetic material can also lead to false conclusions with phylogenetic
analysis, especially if two sequences of different evolutionary origins are place next to
each other.
Gene duplication events also cause problems with phylogenetic analysis, since the
duplicated genes can evolve along separate pathways, leading to different functions.
TreeTop http://www.genebee.msu.su/services/phtree_reduced.html
Phylodendron http://iubio.bio.indiana.edu/treeapp/treeprint-form.html
ATV (A Tree Viewer) http://www.genetics.wustl.edu/eddy/atv/
How to Make a phylogenetic tree http://hiv-web.lanl.gov/content/hiv-
db/TREE_TUTORIAL/Tree-tutorial.html
A Brief Review of Common Tree Making Methods
http://bioinfo.mbb.yale.edu/mbb452a/projects/Patricia-M-Strickler.html
NCBI Primer on Phylogenetics http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
List of Phylogeny Programs
http://evolution.genetics.washington.edu/phylip/software.html
TreeViewer http://www.avl.iu.edu/projects/DNAml/
Phylip http://evolution.genetics.washington.edu/phylip.html
CECS694-02
Introduction to Bioinformatics
Lecture 10
Phylogenetic Prediction
Tree of Life
On one level, it is interesting to understand and study how the evolution of species has
occurred. There are many different resources discussing the evolution of species. This
includes the NCBI taxonomy web sites, and the University of Arizona’s tree of life
project. We’ll take a look at both of these web sites in order to get a better appreciation
for the evolution of species relative to one another.
Evolutionary Trees
An evolutionary tree is composed of outer branches or leaves that represent the taxa and
nodes and branches representing the relationships among the taxa. Two taxa that are
derived from the same common ancestor will share a node in the graph. In general,
approaches to designing evolutionary trees attempt to define the length of each branch to
the next node according to the number of sequence level changes that occurred. One
thing to be careful of in phylogenetic analysis is that this distance may not be in direct
relation to evolutionary time. Analyses that prescribe to the theory of a uniform rate of
mutation are known as the molecular clock hypothesis.
Rooted Trees
In a rooted tree topology, one sequence (the root) is defined to be the common ancestor
of all of the other sequences. A unique path leads from the root node to any other node,
and the direction of the path indicates evolutionary time. The root is chosen by including
a sequence from an organism that is thought to have branched off earlier than the other
sequences. If the molecular clock hypothesis holds, it is also possible to predict a root.
As the number of sequences increase, the number of possible rooted trees increases very
rapidly. In some cases, a bifurcating binary tree is the best model to simulate
evolutionary events in which case one species branches off into two separate species.
There are three methods used to calculate the tree(s) that best account for the observed
variation in a set of sequences. These methods are maximum parsimony, distance, and
maximum likelihood.
Maximum Parsimony
Maximum parsimony methods predict the evolutionary tree that minimizes the number of
steps required to generate the observed variation in the sequences. In order to construct
a tree using maximum parsimony, a multiple sequence alignment must first be obtained.
For each aligned position, phylogenetic trees that require the smallest number of
evolutionary changes to produce the observed sequence changes are identified. This
continues for each position in the alignment. Those trees that produce the smallest
number of changes overall for all sequence positions are identified. This is a rather time
consuming algorithm that only works well if the sequences have a strong sequence
similarity.
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G
Consider the example above (Mount, 250). There are a total of four sequences, which
gives a possibility of three different unrooted trees. In this case some sites are
informative, and other sites are not. An informative site has the same sequence character
in at least two different sequences. Only the informative sites need to be considered.
Possible trees:
1 3 1 2 1 3
2 4 3 4 4 2
In this case, the optimal tree is obtained by adding the number of changes at each
informative site for each tree, and picking the tree requiring the least total number of
changes.
For a large number of sequences the number of trees to examine becomes so large that it
might not be possible to examine all possible trees. Some programs, such as PAUP, add
features that will allow the user to envoke a heuristic that will keep representative trees
that best fit the data.
Let’s go through the possible trees, and figure out the number of rearrangements for each
in the informative sites. (SEE THE POWERPOINT PRESENTATION)
One problem with determining evolutionary distance between sequences is that columns
representing greater variation dominate the analysis. In order to overcome this problem
of determining long branch lengths is to look only at transversion events, which are the
most significant base changes (i.e. changes a purine to a pyrimidine or vice versa). This
is referred to as Lake’s method of invariants.
LOOK AT THE MITOCHONDRIAL SEQUENCE ANALYSIS ON P 252
Distance Methods
The distance method for construction of phylogenetic trees looks at the number of
changes between each pair in a group of sequences to produce a phylogenetic tree of the
group. The goal of distance methods is to identify a tree that positions neighbors
correctly and that also has branch lengths which reproduce the original data as closely as
possible.
Phylip http://evolution.genetics.washington.edu/phylip.html
FITCH: estimates a phylogenetic tree assuming additivity of branch lengths using the
Fitch-Margoliash method.
KITSH: same as FITCH, but under the assumption of a molecular clock.
NEIGHBOR: estimates phylogenies using the neighbor-joining (no molecular clock
assumed) or unweighted pair group method with arithmetic mean (UPGMA) (molecular
clock assumed).
For phylogenetic analysis, the distance score counted as either the number of mismatched
positions in the alignment or the number of sequence positions that must be changed to
generate the other sequence is used.
The success of distance methods depends on the degree to which the distances among a
set of sequences can be made additive on a predicted evolutionary tree.
A ACGCGTTGGGCGATGGCAAC
B ACGCGTTGGGCGACGGTAAT
C ACGCATTGAATGATGATAAT
D ACACATTGAGTGATAATAAT
The distances between these sequences can be shown as a table:
A B C D
A - 3 7 8
B - - 6 7
C - - - 3
D - - - -
Using this information, an unrooted tree showing the relationship between these
sequences can be drawn:
A C
2 1
4
1 2
B D
The Fitch and Margoliash method uses a distance table. The sequences are combined in
threes to define the branches of the predicted tree and to calculate the branch lengths of
the tree.
7) Draw an unrooted tree with three branches originating from a common node and
label the ends:
A a
C
c
B b
A B C
A -- 22 39
B -- -- 41
C -- -- --
b + c = 41
-a – c = -39
__________
b – a = 2 (4)
A 10
C
29
12
B
Example of Fitch-Margoliash Using Five Sequences
A B C D E
A -- 22 39 39 41
B -- -- 41 41 43
C -- -- -- 18 20
D -- -- -- -- 10
E -- -- -- -- --
Suppose that the initial tree is as follows:
C
A c
a f
D
d
b g
B
e E
1) The first step is to locate the most closely related sequences in the distance table. In
this case, that would be sequences D and E.
2) Now create a new table by combining the remaining sequences. For the distance from
D to A,B,C take the average distance of each of these to D ( (39 + 41 + 18) / 3 = 32.7)
For the distance from E to A,B,C, take the average distance of each of these to E
((41+43+20)/3 = 34.7). The resulting table is as follows:
D E AVG ABC
D -- 10 32.7
E -- -- 34.7
AVG ABC -- -- --
9) The average distances from D to ABC and E to ABC could also be found by
averaging the sum of the appropriate branch lengths:
D to E: d + e = 10 (1)
D to ABC: d + m = 32.7 where m = g + (c + 2f + a + b) / 3 (2)
E to ABC: e + m = 34.7 (3)
10) Now treat D and E as a single sequence, and create a new distance table. The
distance to DE is taken as the average of sequence A to D and A to E. The other
distances are calculated in a similar fashion. The resulting distance table is:
A B C (DE)
A -- 22 39 40
B -- -- 41 42
C -- -- -- 19
(DE) -- -- -- --
11) Identify the closely related sequences in the table. In this case, it is C to DE.
Using algebra, the distance c can be calculated to be 9, and g is calculated to be 5.
12) Repeat the process until all lengths have been identified, in which case there is
only single composite node left.
B C
A
E
The tree is modified by joining pairs of sequences. The pair to be joined is chosen by
calculating the sum of the branch lengths for the corresponding tree. The sum of the
branch lengths is calculated as follows:
∑ d im d mn ∑ d ij
+ d in
S mn = + +
2( N − 2) 2 N −2
where i,j represent all sequences except m and n, and i < j.
B C
A
E
S mn =
∑d im + d in
+
d mn ∑ d ij
+
2( N − 2) 2 N −2
The pair that results in the smallest branch length is then chosen to be the pair that is
joined. Based on this choice, the Fitch-Margoliash algorithm is used to compute the
actual branch lengths.
After the pair has been joined, a new distance table is created with the recently joined
sequences now entered as a composite. The neighbor-joining algorithm chooses the next
pair of sequences to join, and the F-M algorithm computes the branch lengths.
The process continues until the correctly branched tree and distances have been
identified.
Works by clustering the sequences, starting with more similar sequences and working
towards more distant sequences.
The process assembles a tree upwards, with each node being added above the others, and
the edge lengths being determined by the difference in the heights of the nodes.
The distance dij between two clusters Ci and Cj is defined to be the average distance
between pairs of sequences from each cluster:
1
d ij = ∑ d pq
| Ci || C j | pinCi ,qinC j
where |Ci| and |Cj| are the number of sequences in clusters i and j, respectively
7. Continue steps 3-6 until only two clusters i and j remain, and place the root of the
tree at height dij/2
EXAMPLE OF UPGMA
Consider the case where there are five sequences represented by dots on a graph. The
spacing between each of these is representative of the distance between them:
The first step is to assign each of the sequences to their own cluster, which now gives a
number to each of these. In addition, the tree can be constructed at the base, where each
sequence is a leaf of the tree:
1 2
3
4
5
.Now select the two clusters that are closest to each other. These are the sequences 1 and
2. Create a single cluster for these two sequences, and create a parent node in the tree at
height d12/2.
1 2 6
3
4 1 2
Contine on, selecting the two clusters that are closest: in this case, it is 4 and 5. Combine
into a single cluster, and update the tree:
1 2 6
7
3 1 2 4 5
4
The next two clusters are the one containing 4 and 5, and the one containing 3:
1 2
6
3 8
4 7
1 2 4 5 3
5
There are now only two clusters left, so join them to complete the tree:
9
1 2
6
8
3 7
4
1 2 4 5 3
5
SEE STARTING ON P 262
Maximum Likelihood
The choice of which of these methods to choose depends upon the sequences that are
being compared. If there is strong sequence similarity, then maximum parsimony
methods are best. If there is not strong sequence similarity, but clearly recognizable
sequence similarity, then distance methods work best. For all others, the best approach is
a maximum likelihood model.
Difficulties with phylogenetic analysis
Rearrangements of genetic material can also lead to false conclusions with phylogenetic
analysis, especially if two sequences of different evolutionary origins are place next to
each other.
Gene duplication events also cause problems with phylogenetic analysis, since the
duplicated genes can evolve along separate pathways, leading to different functions.
>gamma_A
MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAH
GKKVLT
SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMV
TAVAS
ALSSRYH
>alfa
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>beta
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>delta
VHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGK
EFTPQMQAAYQKVVAGVANALAHKYH
>epsilon
VHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKV
KAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGK
EFTPEVQAAWQKLVSAVAIALAHKYH
>gamma_G
MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAH
GKKVLT
SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMV
TGVAS
ALSSRYH
>myoglobin
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKK
HGATVL
TALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKAL
ELFR
KDMASNYKELGFQG
>teta1
ALSAEDRALVRALWKKLGSNVGVYTTEALERTFLAFPATKTYFSHLDLSPGSSQVRAHGQ
KVADALSLAVERLDDLPHALSALSHLHACQLRVDPASFQLLGHCLLVTLARHYPGDFSPA
LQASLDKFLSHVISALVSEYR
>zeta
SLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGS
KVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAE
AHAAWDKFLSVVSSVLTEKYR
TreeTop http://www.genebee.msu.su/services/phtree_reduced.html
Phylodendron http://iubio.bio.indiana.edu/treeapp/treeprint-form.html
ATV (A Tree Viewer) http://www.genetics.wustl.edu/eddy/atv/
How to Make a phylogenetic tree http://hiv-web.lanl.gov/content/hiv-
db/TREE_TUTORIAL/Tree-tutorial.html
A Brief Review of Common Tree Making Methods
http://bioinfo.mbb.yale.edu/mbb452a/projects/Patricia-M-Strickler.html
NCBI Primer on Phylogenetics http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
List of Phylogeny Programs
http://evolution.genetics.washington.edu/phylip/software.html
TreeViewer http://www.avl.iu.edu/projects/DNAml/
Phylip http://evolution.genetics.washington.edu/phylip.html
CECS694-02
Introduction to Bioinformatics
Lecture 11
RNA Secondary Structure Prediction
RNA molecules are important to study since they are involved in important biochemical
functions, including translation, RNA splicing, processing and editing, cellular
localization, and catalysis.
RNA sequence analysis needs to be treated differently than DNA sequence analysis,
since RNA structures fold and base pair with themselves to form secondary structures.
Therefore, it is not necessarily the sequence but the structure conservation that is most
important in RNA sequence analysis.
Variations in RNA sequence maintain base-pairing patterns that give rise to these
secondary structures. Therefore, to maintain the secondary structure, when a nucleotide
in one base changes, the base with which it pairs must also change to maintain the same
structure. For instance, if you have the base pair G-C, and the G mutates to an A, then
the C should mutate to a U to maintain a base pairing at this location, which promotes the
same secondary structure. Such a variation is referred to as covariation.
In order to determine the secondary structure of the RNA molecule, all possible choices
of complementary sequences are considered, and the sets that provide the most
energetically stable molecules are chosen.
Another method to predict secondary structure in RNA takes into account conserved
patterns of base-pairing. Positions of covariance are studied, and are taken to be
conserved matches, since they maintain the secondary structure. Locating regions of
covariance in sequence data is a computationally challenging task.
G-C and A-U form complementary hydrogen bonded base pairs, with the GC base pairs
being more stable since they form three hydrogen bonds as opposed to the two hydrogen
bonds formed by AU base pairs.
RNA is typically produced as a single stranded molecule (unlike DNA) which folds upon
itself to form base pairs. This structure is referred to as the secondary structure of the
RNA.
RNA secondary structure can be viewed as an intermediary between a linear molecule
and a three-dimensional structure. RNA secondary structure is mainly composed of
double-stranded RNA regions formed by folding the single-stranded RNA molecule back
on itself. There are a number of different secondary structures that can be formed from
this base-pairing, including:
Bulge Loops
Bulge Loops occur when bases on one side of the structure cannot form base pairs
Interior Loops
Interior loops occur when bases on both sides of the structure cannot form base pairs.
Junctions or Multiloops
In addition, tertiary interactions can be present as well. Such tertiary interactions are
located using covariance analysis. The types of tertiary interactions present in RNA
molecules include:
Kissing Hairpins
In kissing hairpins, the unpaired bases of two separate hairpin loops base pair with one
another.
Pseudoknots
Hairpin-Bulge Interactions
One method of representing the base pairs of a secondary structure is to draw the
structure in a circle. An arc is drawn to represent each base pairing found in the
structure. If any of the arcs cross, then a pseudoknot is present.
With RNA sequences, homology is not defined in terms of sequence similarity, but rather
in terms of common secondary structure. Two sequences that do not appear to have
significant sequence similarity can still have conserved secondary structure.
In order to use comparative sequence analysis, the first step is to calculate a multiple
sequence alignment. This requires that the sequences be similar enough so that they can
be initially aligned. At the same time, the sequences should be dissimilar enough so that
covarying substitutions can be detected.
The mutual information gained by aligning two columns that covary is determined by the
function:
f xi x j
M ij = ∑f
xi , x j
xi x j log 2
f xi f x j
Where fxi is the frequency of a base in column i; fxixj is the joint (pairwise) frequency of
a base pair between columns i and j. For RNA, the information ranges from 0 and 2 bits.
If columns i and j are uncorrelated, the mutual information is 0.
An example of a plot for the mutual information of the yeast tRNA-Phe is given below
(Durbin, et al., p 268):
The mutual information from this graph produces the following structure:
One approach to predicting secondary structure looks at finding the structure with the
most base pairs. An efficient dynamic programming approach to this problem was
introduced in the late 1970’s by Nussinov.
According to the Nussinov algorithms, there are four ways to get the best structure from I
to j from the best structures of the smaller subsequences:
1) Add i,j pair onto best structure found for subsequence i+1, j-1
2) add unpaired position i onto best structure for subsequence i+1, j
3) add unpaired position j onto best structure for subsequence i, j-1
4) combine two optimal structures i,k and k+1, j
The Nussinov RNA folding prediction program works by comparing a sequence against
itself in a dynamic programming matrix with the above rules for scoring the structure at a
particular point. Since the structure is folding upon itself, it is only necessary to calculate
half of the matrix.
Initialization step:
In the matrix fill step, the score for the matches along the main diagonal and the diagonal
just below it are set to zero. Formally, the scoring matrix, M, is initialized as follows:
Using the example in Durbin, et al. with the RNA sequence GGGAAAUCC, the matrix
now looks like the following, such that sequences of length 1 will score 0:
G G G A A A U C C
G 0
G 0 0
G 0 0
A 0 0
A 0 0
A 0 0
U 0 0
C 0 0
C 0 0
Now the matrix is filled in, starting with subsequences of length 2, and ending at
subsequences of length L. The four rules for filling in the matrix are used:
When looking for subsequences of length 2, the matrix is filled as follows, since A-U is
the only base-pair found:
G G G A A A U C C
G 0 0
G 0 0 0
G 0 0 0
A 0 0 0
A 0 0 0
A 0 0 1
U 0 0 0
C 0 0 0
C 0 0
G G G A A A U C C
G 0 0 0
G 0 0 0 0
G 0 0 0 0
A 0 0 0 0
A 0 0 0 1
A 0 0 1 1
U 0 0 0 0
C 0 0 0
C 0 0
G G G A A A U C C
G 0 0 0 0 0 0 1 2 3
G 0 0 0 0 0 0 1 2 3
G 0 0 0 0 0 1 2 2
A 0 0 0 0 1 1 1
A 0 0 0 1 1 1
A 0 0 1 1 1
U 0 0 0 0
C 0 0 0
C 0 0
Traceback through this matrix (covered on P 271, Durbin et al) leads to the following
structure:
Given the four possibilities for the maximum structure in the Nussinov algorithm, it can
be computed to a stochastic context-free grammar as follows:
S → aS | cS | gS | uS
S → Sa | Sc | Sg | Su
S → aSu | cSg | uSa | gSc
S → SS
Such a simplistic approach will not give accurate structure predictions, since it does not
take into account important structural features, such as nearest neighbor interactions,
stacking interactions, and loop length preferences.
Since RNA folding is determined by biophysical properties, methods that take into
account these properties are more likely to yield accurate predictions. One method that is
widely used is the energy minimization algorithm that predicts the correct secondary
structure is the one that minimizes the free energy (∆G).
The free energy of an RNA secondary structure is calculated as the sum of the individual
contributions of loops, base pairs, and other secondary structure elements. Energies of
stems are calculated as the stacking contributions between neighboring base pairs.
In order to find the structure for which the minimum free energy is found, the sequence is
compared against itself using a dynamic programming approach similar to the maximum
base-paired structure approach previously described. However, instead of using a scoring
scheme for the base pairs present, the score is based upon the free energies described
above. Gaps between matches represent some form of a loop, so the gap score is
calculated using the above tables as well. The most widely used software that
incorporates this minimum free energy algorithm is MFOLD.
Suboptimal folds
The correct structure is not necessarily the structure with the optimal structure, but a
structure within a certain threshold of the calculated minimum energy. Therefore, the
MFOLD algorithm has been updated to report suboptimal foldings as well.
Covariance Models
In order to locate covarying sites in RNA sequences, 7 different approaches are offered in
Mount, p225.
The key to covariance is the measure of the mutual information content previously
discussed. The mutual information content can be plotted on a motif logo, which can
give insight into the folding of a particular sequence.
Image source: http://www.cbs.dtu.dk/~gorodkin/appl/slogo.html
A formal covariance model, COVE, was devised by Eddy and Durbin. The model
provides very accurate results, but is extremely slow and unsuitable for large genomes.
Stochastic Context Free Grammars (SCFGs) have also been used to model RNA
secondary structure. Examples of these are tRNAScan-SE, and a program created to find
snoRNAs. Typically, with SCFG approaches, the grammars are created by using a
training set of data, and then the grammars are applied to potential sequences to see if
they fit into the language.
SCFGs allow the detection of sequences belonging to a family, such as tRNAs, group I
introns, snoRNAs, snRNAs, etc.
With a SCFG approach, base-paired columns are modeled by pairwise emitting non
terminals (for example aWu) while single-stranded columns are modeled by leftwise
emitting nonterminals (such as gW), when possible. Any RNA structure can then be
reduced to a SCFG (see Durbin, et al., p 278-279).
Tranformational Grammars
Transformational grammars were first described by the linguist Noam Chomsky in the
1950’s. (Yes, this is the same Noam Chomsky who has expressed various dissident
political views throughout the years!) Transformational grammars are very important in
computer science, most notably in compiler design. Grammars are covered in more
detail in compiler and automaton classes, so we will only briefly touch on them here.
Web site:
http://web.mit.edu/linguistics/www/chomsky.home.html
The idea behind transformational grammars is to take a set of outputs (such as a sentence,
or in our case, an RNA structure) and determine whether or not it can be produced using
a set of rules for the language.
Transformational grammars consist of a set of symbols and production rules on which the
symbols can be put together. The symbols can be either terminal (emitting) symbols or
non-terminal symbols that can be used to create longer strings of symbols.
First, consider the case of palindromic DNA sequences. There are a total of five possible
terminal symbols: {A, C, G, T, ε) where ε represents the blank terminal symbol. The
production rules for creating a palindromic sequence are as follows, where S and W are
non-terminal symbols:
S→W
W→ aWa | cWc | gWg | tWt
W→ a | c| g | t | ε
Using these production rules, we can create a derivation of the palindromic sequence
acttgttca as follows:
In order to align a context-free grammar to a sequence, a parse tree can be created, where
the root of the tree is the non-terminal start symbol, S. Leaves of the parse tree are the
terminal symbols in the sequence, and internal nodes are the nonterminals. The leaves
can be parsed from left to right to view the results of the production. An example for the
parse tree on the above production is as follows:
S
W
W
More information on parse trees can be found in Durbin, et al., Chapter 9.
S→W
W→ WW (bifurcation)
W→ aWu | cWg | gWc | uWa (loops)
W→ gWu | uWg
W→ aW | cW | gW | uW (bulges on one side)
W→ Wa | Wc | Wg | Wu (bulges on opposite side)
W→ a | c| g | t | ε
Using this grammar, the structure for the RNA structure for the sequence:
GCUUACGACCAUAUCACGUUGAAUGCACGCCAUCCCGUCCGAUCUGGCAAG
UUAAGCAACGUUGAGUCCAGUUAGUACUUGGAUCGGAGACGGCCUGGGAA
UCCUGGAUGUUGUAAGCU
Produced by MFOLD, can be constructed using the following productions (5’ to 3’):
S⇒W⇒Wu⇒gWcu⇒gcWgcu⇒gcuWagcu⇒gcuuWaagcu⇒
gcuuaWuaagcu⇒gcuuacWguaagcu⇒
gcuuacgWuguaagcu⇒gcuuacgaWuuguaagcu⇒
gcuuacgacWguuguaagcu⇒gcuuacgaccWguuguaagcu⇒
gcuuacgaccaWguuguaagcu⇒....
Genes are regions of a genome that code for either a structural or functional protein.
Genes are of interest to biologists due to their association with diseases. In the past, the
study on whether a gene was turned on or turned off under a specific condition was an
expensive and time consuming task. Within the past 10 years, the emergence of a new
technology, called microarrays, has made it possible to study the expression pattern of
thousands of genes instantaneously. Microarrays allow the study of genes (actually any
sequence of interest) under differing conditions.
The ultimate goal of microarray data is to be able to understand how the expression levels
of different genes differ under two separate conditions. By asking and answering such
questions, we can get an idea of which genes are involved in a certain disease, and
potentially, the pathways involved in these diseases.
In order to figure out which genes are expressed in a given condition, cells in a given
condition are taken, and the mRNA from these cells is extracted. The mRNA represents
the genes that are turned on in these cells. These mRNA sequences are then labeled. The
manner in which these cells are labeled is dependent upon the technique that is being
used.
With single channel microarrays, the genes present under a given condition are labeled
with biotin. The expressed genes are washed over the microarray slide, and the expressed
genes will hybridize at the appropriate locations. What results is a dark spot where the
expressed genes have hybridized. If a clear microscope slide has been used to spot the
microarrays, then light can be passed underneath. Black spots represent genes that are
expressed in a given condition. In order to study two different conditions in single
channel microarrays, two separate slides must be used. An example of a single channel
microarray is given below:
Two Channel Microarrays
With two channel microarrays, the samples under different conditions are labeled
separately. The labels normally incorporated are green and red. For argument sake,
assume that the control is labeled green, and the sample is labeled red. Both samples are
washed over the microarray slide, and hybridization occurs. Each spot on the slide is
now one of four colors as shown below:
The colors correspond to the expression of the gene under the different conditions. For
example, spots that are only green are highly expressed in the control, while spots that are
red are highly expressed in the sample. Spots that are yellow are equally expressed in
both sample and control, while black spots are genes that are not expressed in either the
sample nor the control.
Once the spots are determined, the difficulty is in quantifying the image signals.
Generally, the images are converted to some sort of matrix of numbers. This step
requires processing. Besides the spot intensity, other measures that might be taken
include measurements of error and measurements of background noise. For instance, you
might ask how green is a spot? Answering this question can give an indication as to how
the difference in expression levels between the control and the sample.
Such an approach is often referred to as a fold approach. In otherwords, how does the
expression level change under a given condition? (Two-fold difference? Four-fold
difference?) Besides determining this value, it is important to figure out when a
significant change has been made. One thing to be aware of is that a four-fold observed
difference does not necessarily mean that a gene is expressed four times as much in a
given condition!
Clustering
One might be inclined to ask questions concerning the relationship among sequences in
an experiment. Several approaches have been suggested. Included are:
k-Means Clustering
k-Means clustering attempts to partition the results into groups that have similar
expression patterns, where k is the number of clusters the user believes that the data
should fall into. There are three steps in the k-Means clustering algorithm:
Clustering approaches have several disadvantages, and should be used with extreme
caution (if they are used at all).
Image source:
http://cfpub.epa.gov/ncer_abstracts/index.cfm/fuseaction/display.abstractDetail/abstract/9
75/report/2001
SOMs are a type of neural network approach. A SOM has a set of nodes with a simple
topology and a distance function on the nodes. The nodes are iteratively mapped into a
k-dimensional gene expression space. The steps in assembling a SOM are as follows:
1) Random vectors are constructed and assigned to each partition
2) A gene is picked at random and the reference vector closest to that gene is
identified
3) The reference vector is adjusted to be more similar to the vector of the assigned
gene.
4) Steps 2 and 3 are iterated through, until the reference vectors converge.
Support Vector Machines are supervised machine learning techniques. These techniques
organize the data by mapping the gene expression vectors into a higher dimensional
space based on a kernel function. The SVM is trained to discriminate between positive
and negative data points. SVMs find the hyperplane that is needed to maximize the
margin between the surface between the positive and negative data points.
CECS694-02
Introduction to Bioinformatics
Lecture 13
Protein Structure Prediction
Proteins are polypeptides that have a three dimensional structure. They can be described
through four different hierarchical levels:
Once the polypeptide sequence (primary structure) of a protein has been determined, the
next step is to determine the secondary and tertiary structure of the protein. The
secondary structures of a protein are packed into a core region with a hydrophobic
environment. Interactions between the amino acid side chains occur within the core
structure. Outside of the core are loops and structural elements the come in contact with
water, other proteins, and other structures.
Proteins are chains of amino acids joined by peptide bonds. Each amino acid is polar,
meaning that it has separate positive and negatively charged regions. Each amino acid
has a free C=O group (CARBOXYL), which can act as a hydrogen bond acceptor, and an
NH group (AMINYL), which can act as a hydrogen bond donor. Many confirmations of
the chain are possible due to the rotation around the Alpha-Carbon (Cα) atom. These
confirmational changes lead to differences in the three-dimensional structure of the
protein. Within a polypeptide chain, there is a pattern of N-Cα-C repeated. The angle
between the aminyl group and the Alpha-carbon is the PHI (φ) angle; the angle between
the Cα and the carboxyl group is the PSI (ψ) angle.
Image Source: Bioinformatics, Mount
The difference between each of the 20 amino acids is in the R side chains. Amino acids
can be separated into distinct groups based on the chemical properties of the side chains:
hydrophobic: Alanine(A), Valine(V), phenylalanine (Y), Proline (P), Methionine (M),
isoleucine (I), and Leucine(L); charged: Aspartic acid (D), Glutamic Acid (E), Lysine
(K), Arginine (R); Polar: Serine (S), Theronine (T), Tyrosine (Y); Histidine (H), Cysteine
(C), Asparagine (N), Glutamine (Q), Tryptophan (W).
Secondary Structures
Image source: http://www.ebi.ac.uk/microarray/biology_intro.html
The core of each protein is made up of regular secondary structures that fold into a three-
dimensional configuration. In these secondary structures, regular patterns of hydrogen
bonds are formed between neighboring amino acids, and the amino acids have similar φ
and ψ angles. These structures act to neutralize the polar groups on each amino acid.
These secondary structures are tightly packed in the protein core and a hydrophobic
environment, and thus, each amino acid side group has a limited space to occupy and
therefore a limited number of possible interactions.
Alpha Helix
The alpha helix (Picture, p 388) is the most abundant type of secondary structure in
proteins. The helix has 3.6 amino acids per turn with a Hydrogen bond formed between
every fourth reside. The average length of an alpha helix is 10 amino acids, or 3 turns,
but it varies from 5 to 40 amino acids.
http://www.hhmi.princeton.edu/sw/ http://www4.ocn.ne.jp/~bio/biology/protein.htm
2002/psidelsk/scavengerhunt.htm
Alpha helix structures are normally found on the surface of protein cores where they
interact with the aqueous environment. The inner facing side of the helix tends to have
hydrophobic amino acids, while the outer-facing side has hydrophilic amino acids. This
means that every third amino acid will tend to be hydrophobic. This is a pattern that can
be detected computationally. Sequences rich in alanine (A), gutamic acid (E), leucine
(L), and methionine (M) and poorer in proline (P), glycine (G), tyrosine (Y), and serine
(S) tend to form alpha helices.
Beta Sheet
Beta sheets are formed by hydrogen bonds between an average of 5-10 consecutive
amino acids in one portion of the chain with another 5-10 farther down the chain. The
interacting regions may be adjacent, with a short loop in between, or far apart with other
structures in between. If the chains run in the same direction, they form a parallel sheet.
If they run in opposite directions, the form an antiparallel sheet. A mixed sheet may also
be formed. The pattern of hydrogen bond formation in parallel and anti-parallel sheets is
different. Beta sheets have a slight counterclockwise rotation, and the Alpha carbons (as
well as the R side groups) alternate above and below the sheet in a pleated structure.
Prediction of beta sheets is more difficult, due to the wide range of the PHI and PSI
angles.
http://broccoli.mfn.ki.se/pps_course_96/ http://www4.ocn.ne.jp/~bio/
ss_960723_12.html biology/protein.htm
Image Source: Bioinformatics, Mount
Loops
Loops are regions of a protein chain are regions between alpha helicies and beta sheets.
They have various lengths and three-dimensional configurations, and they are located on
the surface of the structure. Hairpin loops represent a complete turn in the polypeptide
chain, as is found in anti-parallel beta sheets. Loops are allowed to be more variable as
far as the sequence structure is concerned. They tend to have charged and polar amino
acids and are frequently a component of active sites.
Coils
A region of secondary structure that is not a helix, sheet, or loop is commonly referred to
as a coil.
Classes of Protein Structure:
Sources:
http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics;pdbId=3hhb;page=;pid=&opt=
show&size=250
http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics&pdbId=1cd8&page=&pid=
http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics;pdbId=2wsy;page=;pid=&opt
=show&size=250
http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics;pdbId=1rnb;page=;pid=&opt=
show&size=250
http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics;pdbId=1opf;page=;pid=&opt=
show&size=250
There are a number of databases that contain information on three dimensional structures
of proteins, where the structure has been solved using either X-ray crystallography or
nuclear magnetic resonance (NMR) techniques. Examples of the available sequence
databases include:
PDB
SCOP
PIR
Swiss-Prot
The most extensive of these for 3-D structure is the Protein Data Bank (PDB). The
current release of PDB (April 8, 2003) has 20,622 structures.
A full description of PDB File Format can be obtained at:
http://www.rcsb.org/pdb/info.html A partial example PDB file for the entry 3hhb is given
below (the full file can be obtained at
http://www.rcsb.org/pdb/cgi/explore.cgi?job=download;pdbId=3HHB;page=0&opt=sho
w&format=PDB&pre=1) :
The second column indicates the amino acid position in the polypeptide chain
The fourth column indicates the current amino acid
Columns 7, 8, and 9 represent the x, y, and z coordinates (in angstroms)
The 11th column represents the temperature factor, which can be used as a measurement
of uncertainty.
CATH classifies proteins into hierarchical levels by class, except that a/B and a+B are
considered to be a single class. CATH is located at
http://www.biochem.ucl.ac.uk/bsm/cath/
FSSP is based on structure alignment of all pairwise combinations of the proteins in PDB
using the structural alignment program DALI. Each protein is separated into individual
domains, and the domains are aligned using DALI to find common folds. FSSP is
located at http://www2.embl-ebi.ac.uk/dali/fssp/fssp.html
Molecular Modelling Database (MMDB)
MMDB categorizes structures from PDB into structurally related groups using the VAST
structure alignment program, that looks for similar arrangements of secondary structural
elements. MMDB has been incorporated into ENTREZ at
http://www.ncbi.nlm.nih.gov/Entrez
There are a number of programs available that convert the atomic coordinates of the 3-d
structures into views of the molecule. Viewers also allow the user to manipulate the
molecule by rotation, zooming, etc. Such a viewer can be critical in drug design, since it
yields insight into how the protein might interact with ligands at active sites. The most
popular program for viewing 3-dimensional structures is Rasmol. The following is a list
of the most popular viewers:
Rasmol: http://www.umass.edu/microbio/rasmol/
Chime: http://www.umass.edu/microbio/chime/
Cn3D: http://www.ncbi.nlm.nih.gov/Structure/
Mage: http://kinemage.biochem.duke.edu/website/kinhome.html
Swiss 3D viewer: http://www.expasy.ch/spdbv/mainpage.html
In addition to viewing 3-dimensional structures, there are repositories for still images.
One such site is the swissprot website:
http://www.expasy.ch/databases/swiss-3dimage/IMAGES/
Structural similarity between proteins does not necessarily translate into an evolutionary
relationship between the two.
When structures are compared, positions of atoms in two three-dimensional structures are
compared. Typically these methods to align structures look for the positions of
secondary structural elements (helices and strands) within a protein domain to determine
whether or not the structures are similar. Distances between the carbon atoms are
examined to determine the degree to which the structures may be superimposed.
Additional information about the side chains (such as whether they are buried or visible)
can be used as well.
A local structural environment is created for each residue in each sequence. This
environment is defined by the degree of burial in the hydrophobic core of the protein and
the type of secondary structure to which the residue belongs. One of the environment
variables is a representing of the geometry of the protein by drawing a series of vectors
from the CB atoms of an amino acid to the CB atoms of all of the other amino acids in
the protein. If the geometric views in two protein structures are similar, the structures
must also be similar.
1) Calculate vectors from one Cβ of one amino acid to a set of other nearby amino
acids. The resulting vectors from two separate proteins are compared, and a
difference (expressed as an angle) is calculated. A score for this difference is then
computed.
2) A matrix for the scores of vector differences from one protein to the next is
computed.
3) An optimal alignment is found using global dynamic programming, with a
constant gap penalty.
4) The next amino acid residue in one of the sequences is considered, and an optimal
path to align this amino acid to the second sequence is computed using the steps
above.
5) Resulting alignments are then transferred to a summary matrix. If the paths cross
the same matrix position, the scores are summed. If part of the alignment path is
found in both matrices, then there is evidence of similarity between the vectors.
6) When all of the alignments have been placed in the summary matrix, a dynamic
programming alignment is performed for the summary matrix. The final
alignment represents the optimal alignment between the protein structures. The
resulting score is converted such that it can be compared to see how closely
related the two structures are to each other.
Distance method uses graphical procedure similar to dot plots to identify the atoms that
lie most closely together in the three-dimensional structure. If two sequences have a
similar structure, then their resulting dot plots can be superimposed. For the dot plot, the
sequence of the protein is listed along both axes. The values in the distance matrix
represent the distance between the corresponding Cα atoms in the three dimensional
structure. The positions of the closest packing atoms are marked with a dot to highlight
regions of interest. Similar groups of secondary structural elements are superimposed as
closely as possible by minimizing the sum of the atomic distances.
Dali is one example of a program that uses the distance matrix method to align protein
structures. Existing structures that have been compared to one another are organized
into the FSSP database. The assembly step of DALI uses a Monte Carlo simulation
strategy to find submatrices that can be aligned with one another.
One way to quickly compare two structures is to compare the types and arrangements of
the secondary structures within two proteins. If the elements are similarly arranged, the
three-dimensional structures are similar. VAST and SARF are example programs that
use these methods to compare two structures.
Zinc finger motifs can be found by looking at order and spacing of cysteine and histidine
residues in a sequence. Typical zinc finger motifs are composed of two cysteines
followed by two histidines.
Image source: www.bmb.psu.edu/faculty/tan/lab/ tanlab_gallery_protdna.html
Leucine zippers can be found by looking for two antiparallel alpha helices held together
by interactions between hydrophobic leucine residues found at every seventh position in
the helix.
Coiled-coil structures have two to three alpha helices coiled around each other in a left-
handed supercoil. They may be predicted by searching for a 7-residue periodicity.
COILS2 (http://www.ch.embnet.org/software/COILS_form.html) is a package to detect
coils
Transmembrane-spanning Proteins
Membrane proteins traverse back and forth through a series of alpha helices composed of
amino acids with hydrophobic side chains. the typical length of these regions is 20-30
residues in length. Therefore, these protein regions can be detected by scanning for
hydrophobic regions around 19 residues in length. Membrane spanning alpha helices
tend to have hydrophobic residues on the inside facing portions, and hydrophilic residues
on the outside or exposed residues.
The Chou-Fasman method was based on analyzing the frequency of amino acids in the
different secondary structures. For instance, it was determined that A, E, L, and M are
strong predictors of alpha helices, while P and G are predictors in the break of a helix. A
table of predictive values was created for alpha helices, beta sheets, and turns. The
structure with the greatest overall prediction value greater than 1 is used to determine the
structure for that region.
The GOR method improves upon the Chou-Fasman method by basing the assumption
that amino acids surrounding the central amino acid influence the secondary structure that
the central amino acid is likely to adopt, as opposed to it individually influencing the
secondary structure.
Scoring matrices are used in the GOR method, which incorporates both information
theory and Bayesian statistics.
In the neural network approach, programs are trained to recognize amino acid patterns
that are located in known secondary structures and to distinguish these patterns from
other patterns not located in structures. PHD and NNPREDICT are two programs that
incorporate neural network models.
Nearest-Neighbor Methods
Nearest-neighbor methods are also a type of machine learning method. The secondary
structure confirmation of an amino acid in the query is calculated by identifying
http://www.google.com/search?q=Simpa96&hl=en&lr=&ie=UTF-8&oe=UTF-
8&start=10&sa=Nsequences of known structures that are similar to the query by looking
at the surrounding amino acids. The programs using the nearest-neighbor methods
include PSSP, Simpa96, SOPM, and SOPMA.
Prediction of Three-dimensional Protein Structure
Protein classification
nnpredict (http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html)
Sequence:
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
P389
RasMol
…
“Protein-folding Problem”
Threading
Programs:
WHAT-IF
LOOK
SWISS-MODEL
VAST
DALI
3Dee
FSSP
PHD
TOPITS
SignalP http://www.cbs.dtu.dk/services/SignalP/
TMpred http://www.isrec.isb.sib.ch/ftp-server/tmpred/www/TMPRED_form.html
CECS694-02
Introduction to Bioinformatics
Lecture 14
Protein Structure Prediction
Distance Matrix
Distance method uses graphical procedure similar to dot plots to identify the atoms that
lie most closely together in the three-dimensional structure. If two sequences have a
similar structure, then their resulting dot plots can be superimposed. For the dot plot, the
sequence of the protein is listed along both axes. The values in the distance matrix
represent the distance between the corresponding Cα atoms in the three dimensional
structure. The positions of the closest packing atoms are marked with a dot to highlight
regions of interest. Similar groups of secondary structural elements are superimposed as
closely as possible by minimizing the sum of the atomic distances.
Dali is one example of a program that uses the distance matrix method to align protein
structures. Existing structures that have been compared to one another are organized
into the FSSP database. The assembly step of DALI uses a Monte Carlo simulation
strategy to find submatrices that can be aligned with one another.
One way to quickly compare two structures is to compare the types and arrangements of
the secondary structures within two proteins. If the elements are similarly arranged, the
three-dimensional structures are similar. VAST and SARF are example programs that
use these methods to compare two structures.
Zinc finger motifs can be found by looking at order and spacing of cysteine and histidine
residues in a sequence. Typical zinc finger motifs are composed of two cysteines
followed by two histidines.
Image source: www.bmb.psu.edu/faculty/tan/lab/ tanlab_gallery_protdna.html
Leucine zippers can be found by looking for two antiparallel alpha helices held together
by interactions between hydrophobic leucine residues found at every seventh position in
the helix.
Coiled-coil structures have two to three alpha helices coiled around each other in a left-
handed supercoil. They may be predicted by searching for a 7-residue periodicity.
COILS2 (http://www.ch.embnet.org/software/COILS_form.html) is a package to detect
coils
Transmembrane-spanning Proteins
Membrane proteins traverse back and forth through a series of alpha helices composed of
amino acids with hydrophobic side chains. the typical length of these regions is 20-30
residues in length. Therefore, these protein regions can be detected by scanning for
hydrophobic regions around 19 residues in length. Membrane spanning alpha helices
tend to have hydrophobic residues on the inside facing portions, and hydrophilic residues
on the outside or exposed residues.
The Chou-Fasman method was based on analyzing the frequency of amino acids in the
different secondary structures. For instance, it was determined that A, E, L, and M are
strong predictors of alpha helices, while P and G are predictors in the break of a helix. A
table of predictive values was created for alpha helices, beta sheets, and turns. The
structure with the greatest overall prediction value greater than 1 is used to determine the
structure for that region.
The GOR method improves upon the Chou-Fasman method by basing the assumption
that amino acids surrounding the central amino acid influence the secondary structure that
the central amino acid is likely to adopt, as opposed to it individually influencing the
secondary structure.
Scoring matrices are used in the GOR method, which incorporates both information
theory and Bayesian statistics.
In the neural network approach, programs are trained to recognize amino acid patterns
that are located in known secondary structures and to distinguish these patterns from
other patterns not located in structures. PHD and NNPREDICT are two programs that
incorporate neural network models.
Nearest-Neighbor Methods
Nearest-neighbor methods are also a type of machine learning method. The secondary
structure confirmation of an amino acid in the query is calculated by identifying
http://www.google.com/search?q=Simpa96&hl=en&lr=&ie=UTF-8&oe=UTF-
8&start=10&sa=Nsequences of known structures that are similar to the query by looking
at the surrounding amino acids. The programs using the nearest-neighbor methods
include PSSP, Simpa96, SOPM, and SOPMA.
Prediction of Three-dimensional Protein Structure
Threading
Threading is the most robust of structure prediction techniques. Threading searches for
structures that have a similar fold without apparent sequence similarity.
Threading takes a query sequence whose structure is not known and threads it through the
coordinates of a target protein whose structure has been solved, using either X-ray
crystallography or NMR imaging.
In the environmental template method, the environment of each amino acid in each
known structural core is determined, including the secondary structure, the area of the
side chain that is buried by closeness to other atoms, and types of nearby side chains.
Each position is classified into one of 18 types, 6 representing increasing levels of residue
burial, combined with three classes of secondary structure (alpha helices, beta sheets, and
loops). Each amino acid is then assessed for its ability to fit into that type of structure.
The number and closeness between amino acids in the core are analyzed. The query
sequence is evaluated for amino acid interactions that will correspond to those in the core
and that will contribute to the stability of the protein. The most energetically stable
confirmations are the most likely three-dimensional structures.
Predictions as to which amino acids are able to fit into a structural position are given as a
sequence profile. Substitutions in different structures have different effects –
substitutions in loops do not have as many constrants. A structure profile is created for
each core in the PDB. These profiles are then used to score the query sequence for
compatibility with that core. The structural profile is a table of scores with one row for
each amino acid position in the core and a column for each amino acid substitution at that
position plus two columns for deletion penalties. A dynamic programming algorithm is
used to identify an optimal, best scoring alignment.
Threading Services
123D http://www-lmmb.ncifcrf.gov/~nicka/123D.html
3D-PSSM
Honig lab
Libra I
NCBI structure site
Profit
Threader 2
TOPITS
UCLA-DOE structure prediction Server
DNA Sequencing
Sequencing DNA is a routine molecular biology technique. The most common form of
DNA sequencing used today is the Sanger dideoxynucleotide chain termination method.
In this method, new strands of DNA complementary to a single-stranded DNA template
are synthesized. The template DNA is supplied with a mixture of all four
deoxynucleotides (A, C, G, T) along with four dideoxynucleotides (A, C, G, T) that
terminate the elongation of the DNA sequence. Each nucleotide is labeled with a
different color fluorescent tag. The result is a set of DNA sequences, each with of
different lengths. The fragments are separated by their size using a technique known as
gel electrophoresis. As each labeled DNA fragment passes a laser detector, the color is
recorded. The DNA sequence is then reconstructed from the pattern of colors.
www.ncbi.nlm.nih.gov/About/primer/ genetics_molecular.html
http://jcsmr.anu.edu.au/group_pages/brf/services/DNA%20sequencing/Templiphi.html
www.csic.es/mostrar/tecnicas/ area2/iib1/abi377.htm
The procedure of determining the actual base that is represented is referred to as base-
calling. Often, automated sequencers have software installed that automatically takes the
trace data and calculates the bases. This information can also be used by programs such
as PHRED. For each base, there is an associated quality value, which represents the
probability that the base has been called correctly. Typically, the beginning and end of
the sequence will have lower values. These low quality regions are usually trimmed from
the final data. The PHRED quality value is calculated by the following formula:
PHRED: http://www.genome.washington.edu/UWGC/analysistools/Phred.cfm
TraceTuner http://www.paracel.com/tracetuner/
http://www.phrap.com/
Berno, A. 1996. A graph theoretic approach to the analysis of DNA sequencing data.
Genome Res. 6:80-91 .
Ewing B, Green P. (1998) Base-calling of automated sequencer traces using phred II.
Genome Res, 8(3):186-194.
Genomic Sequencing
Shotgun sequencing
Shotgun sequencing is the process of sequencing a whole genome by ignoring map data.
The idea is to sequence both ends of DNA fragments of short (2kb), medium (10 kb) and
long (100 kb) fragments, and use these end sequences as anchors. The genome is then
randomly broken up into small (500 base) pieces which are then sequenced. The problem
of sequence assembly is much tougher with shotgun sequencing.
Taken from Waterston RH, Lander ES, Sulston JE. (2002) On the sequencing of the
human genome. PNAS 99(6):3712-3716.
PHRAP (fragment assembly program) is the most widely used program when it comes to
assembling the smaller pieces of each clone together. Other programs that are used to
assemble whole genomes include ARACHNE (MIT’s Whitehead center); GigAssembler
(UCSC), and … The most valuable of these whole genome assembly techniques take
into account various pieces of information concerning BAC ends, polymorphisms, and
mapping markers in order to correctly orient and assemble the pieces of the genome.
Huang X, Maddan A. (1999) CAP3: A DNA sequence assembly program. Genome Res,
9(9):868-977.
Bonfield JK, Smith K, Staden R. (1995) A new DNA sequence assembly program.
Nucleic Acids Res, 23(24):4992-4999.
Mullikin JC, Ning, Z. (2003) The phusion assembler. Genome Res, 13(1):81-90.
Waterston RH, Lander ES, Sulston JE. (2002) On the sequencing of the human genome.
PNAS 99(6):3712-3716.
Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP,
Lander ES. (2002) ARACHNE: a whole-genome shotgun assembler. Genome Res
12(1):177-189.
Venter C, et al. (2001) The sequence of the human genome. Science 291(5507):1304-
1351.