Sunteți pe pagina 1din 232

Introduction to Bioinformatics

Lecture 1: Overview of Bioinformatics and Molecular Biology

What is Bioinformatics?

Defining the terms bioinformatics and computational biology is not necessarily an


easy task, as evidenced by multiple definitions available over the web. A recent
google search for "definition of bioinformatics" returned over 43,000 results! In
the past few years, as the areas have grown, a greater confusion into these two
terms has prevailed. For some, the terms bioinformatics and computational
biology have become completely interchangeable terms, while for others, there is
a great distinction. I'll throw my two cents in, based on what my experience has
been to the consensus use of these two terms.

Computational biology and bioinformatics are multidisciplinary fields, involving


researchers from different areas of specialty, including (but in no means limited
to) statistics, computer science, physics, biochemestry, genetics, molecular
biology and mathematics. The goal of these two fields is as follows:

• Bioinformatics: Typically refers to the field concerned with the collection


and storage of biological information. All matters concerned with biological
databases are considered bioinformatics.
• Computational biology: Refers to the aspect of developing algorithms
and statistical models necessary to analyze biological data through the aid
of computers.

In this respect, my understanding of bioinformatics and computational biology


follows the NIH definitions listed below:

Bioinformatics: Research, development, or application of computational tools and approaches


for expanding the use of biological, medical, behavioral or health data, including those to acquire,
store, organize, archive, analyze, or visualize such data.

Computational Biology: The development and application of data-analytical and theoretical


methods, mathematical modeling and computational simulation techniques to the study of
biological, behavioral, and social systems.

Others have offered various opinions into these definitions as well:


http://kbrin.kwing.louisville.edu/~rouchka/definition.html
Image Source: http://ccb.wustl.edu/

Bioinformatics = Hot Field

Smart Money: #1 among next hot jobs


http://smartmoney.com/consumer/index.cfm?story=working-june02

Business Week: Among 50 Masters of Innovation


http://www.businessweek.com/bw50/content/mar2001/bf20010323_198.htm

So why is bioinformatics a hot field? One answer to this question is that it is tied to the
human genome project which has generated a lot of popular interest. Various advances
in molecular biology techniques (such as genome sequencing and microarrays) has led to
a large amount of data that needs to be analyzed. Now that we are close to having the
human genome finished, what does it all mean? That’s where bioinformatics steps in.
Bioinformatics can lead to important discoveries as well as help companies save time and
money in the long run. In addition, there needs to be methods to manage large amounts
of data. One of the biggest reasons for bioinformatics being a hot field is the old supply
and demand adage. There just are too few people adequately trained in both biology and
computer science to solve the problems that biologists need to have solved.
Introduction to Molecular Biology

(For a good overview of this topic, please read:


http://www.ebi.ac.uk/microarray/biology_intro.html)

In order to be a good computational biologist, it is important to understand the


terminology and basic processes behind the biological problems. Many interesting
problems arise out of sequence analysis. There are two different types of biological
sequences studied in this class: DNA/RNA and amino acids. But first, let’s make sure
the basics are covered.

Cells

Every organism is made up of tiny structures called cells. Often these cells are too small
to be seen with the naked eye. Each cell is in itself a complex system enclosed in a
membrane. Some organisms, such as bacteria and baker’s yeast are composed of only a
single cell (i.e. they are unicellular). Other organisms are made up of many different
cells (i.e. they are multicellular). For instance, the human body is composed of around 60
trillion cells. Humans have about 320 different cell types, each having a different type of
function or structural property.

Structure of an animal cell.

Image source: www.ebi.ac.uk/microarray/ biology_intro.htm

There are two types of organisms: eukaryotes and prokaryotes. Eukaryotes (or as Bruce
Roe from the University of Oklahoma calls them the “You and I” Karyotes) represent
most of the organisms which we can see, including plants and animals. Prokaryotes
(such as bacteria) are smaller than eukaryotic cells and have simpler structure.
Prokaryotes are single cellular organisms (but not all single-celled organisms are
prokaryotes!)

So what is the difference between the two types of cells? A eukaryotic cell has a nucleus,
which is separated from the rest of the cell by a membrane. Inside the nucleus are the
chromosomes, where all of the genetic information for the organism is stored. In
addition, eukaryotic cells contain membrane bound organelles with various functions,
including centrioles, lysosomes, mitochondria, ribosomes, etc.

Contained within the nucleus are one or several long double stranded DNA molecules
organized as chromosomes. For humans, there are 22 pairs of autosomes, as well as one
pair of sex chromosomes. One copy of each pair is inherited from each parent.

Karyotype showing the 23 pairs of human chromosomes.

Image source: http://avery.rutgers.edu/WSSP/StudentScholars/Session8/Session8.html


Image source: www.biotec.or.th/Genome/ whatGenome.html

DNA

Deoxyribonucleic Acid (DNA) is the basis for the building blocks encoding the
information of life. A single stranded DNA molecule, called a polynucleotide or
oligomer, is a chain of small molecules called nucleotides. There are four different
nucleotides, or bases: adenosine (A), cytosine (C), guanine (G) and thymine (T).

The bases can be separated into two different types: purines (A and G) and pyrimidines
(C and T). The difference between purines and pyrimidines is in the base structure.

Stringing together a simple alphabet of four characters together we can get enough
information to create a complex organism! Different nucleotides can be strung together
to form a polynucleotide. However, the ends of the polynucleotide are different, meaning
that each polynucleotide sequence will have a directionality. The ends of the
polynucleotide are marked either 3’ or 5’. The general convention is to label the coding
strand from 5’ to 3’ (left to right).
For instance, the following is a polynucleotide:

5’ G→T→A→A→A→G→T→C→C→C→G→T→T→A→G→C 3’

DNA can be either single-stranded or double stranded. When DNA is double-stranded,


the second strand is referred to as the reverse complement strand. This name is derived
from the fact that the directionality of this second strand runs in the opposite direction as
the first, and the fact that the bases in the second strand are complementary to the bases in
the first. Complementary bases are determined by which pairs of nucleotides can form
bonds between them. In the case of DNA, A binds to T, and C binds to G. For the
polynucleotide given above, the double-stranded polynucleotide is as follows:

5’ G→T→A→A→A→G→T→C→C→C→G→T→T→A→G→C 3’
| | | | | | | | | | | | | | | |
3’ C←A←T←T←T←C←A←G←G←G←C←A←A←T←C←G 5’

Two complementary polynucleotide chains form a stable structure known as the DNA
double helix. This spring represents the 50th anniversary of the discovery of the double
helix structure of DNA by Watson, Crick and Franklin.

DNA double helix structure.


Image source: www.genecrc.org/site/ lc/lc2b.htm

Note that in this image, there appear to be two types of grooves: A larger one, which is
called the major groove and a smaller one, known as the minor groove. In addition, there
are roughly 10.5 base pairs in one complete turn of the helix.

RNA

Ribonucleic Acid (RNA) is similar to DNA in the fact that it is constructed from
nucleotides. However, instead of thymine (T), an alternative base uracil (U) is found in
RNA. RNA can be found as double-stranded or single-stranded, and can also be part of a
hybrid helix where one strand is an RNA strand and the other is a DNA strand. RNA is
generally found as a single stranded molecule that may form a secondary structure or
tertiary structures due to the complementary bases between parts of the same strand.
RNA folding will be discussed in detail during a later class period. RNA is important in
the cell and contributes in a variety of ways. One of the most important roles of RNA is
in protein synthesis. Two of the major RNA molecules involved in protein synthesis are
messenger RNA (mRNA) and transfer RNA (tRNA).

Secondary structure for E. coli Rnase P RNA.


Image source: www.mbio.ncsu.edu/JWB/MB409/lecture/ lecture05/lecture05.htm

mRNA

mRNA encodes the genetic information as copied from the DNA molecules.
Transcription is the process in which DNA is copied into an RNA molecule. The
resulting linear molecule is an mRNA transcript. In eukaryotic cells, before the mRNA
can be translated into a protein, it needs to be modified. The nature of most eukaryotic
genes is that the genes are created in pieces, where coding regions, called exons, are
interspersed with noncoding regions, called introns. One of the steps in processing the
mRNA is to remove the intronic regions and to splice together the coding, or exonic
regions. The processed mRNA can then be transported from the nucleus and translated
into a protein sequence.
mRNA processing.

Image source: http://departments.oxy.edu/biology/Stillman/bi221/111300/processing_of_hnrnas.htm

tRNA

tRNA molecules develop a well-defined three-dimensional structure which is critical in


the creation of proteins. Attached to each tRNA molecule is an amino acid (which will
be discussed momentarily). The amino acid to be attached is determined by a three base
sequence called an anticodon sequence, which is complementary to the sequence in the
mRNA. Translation is the process in which the nucleotide base sequence of the
processed mRNA is used to order and join the amino acids into a protein with the help of
ribosomes and tRNA.
tRNA secondary structure. tRNA tertiary structure.
Image Source: Image source: www.biology.ucsc.edu/people/
areslab/BMB100A/11-26.html
http://www.tulane.edu/~biochem/nolan/lectures/
rna/frames/trnabtx2.htm

Genetic Code

Since there are 4 possible bases (A, C, G, U) and 3 bases in the codon, there are 4 * 4 * 4
= 64 possible codon sequences. However, the codon AUG can also be used as a signal to
initiate translation, while the codons UAA, UAG, and UGA are terminal codons
signaling the end of translation. That leaves a 61 codon sequences that can code for
amino acids (AUG can also code for an amino acid). However, there are only 20 amino
acids. Therefore the genetic code is redundant, meaning that a single amino acid could
be coded for by several different codons.
Second Position of Codon

U C A G

UUU Phe [F] UCU Ser [S] UAU Tyr [Y] UGU Cys [C] U
UUC Phe [F] UCC Ser [S] UAC Tyr [Y] UGC Cys [C] C
U
UUA Leu [L] UCA Ser [S] UAA STOP UGA STOP A
F UUG Leu [L] UCG Ser [S] UAG STOP UGG Trp [W] G T
i h
r CUU Leu [L] CCU Pro [P] CAU His [H] CGU Arg [R] U i
s CUC Leu [L] CCC Pro [P] CAC His [H] CGC Arg [R] C r
t C
CUA Leu [L] CCA Pro [P] CAA Gln [Q] CGA Arg [R] A d
P CUG Leu [L] CCG Pro [P] CAG Gln [Q] CGG Arg [R] G P
o o
s AUU Ile [I] ACU Thr [T] AAU Asn [N] AGU Ser [S] U s
i AUC Ile [I] ACC Thr [T] AAC Asn [N] AGC Ser [S] C i
t A AUA Ile [I] ACA Thr [T] AAA Lys [K] AGA Arg [R] A t
i i
o AUG Met [M] ACG Thr [T] AAG Lys [K] AGG Arg [R] G o
n n
GUU Val [V] GCU Ala [A] GAU Asp [D] GGU Gly [G] U
GUC Val [V] GCC Ala [A] GAC Asp [D] GGC Gly [G] C
G
GUA Val [V] GCA Ala [A] GAA Glu [E] GGA Gly [G] A
GUG Val [V] GCG Ala [A] GAG Glu [E] GGG Gly [G] G

Genetic Code. Note that the initiator codon is labeled in green, and the terminal codons are labeled in red.
The first column gives the triplet base; the second the three letter amino acid label, and the third the one
letter amino acid label.

Adapted from: http://psyche.uthct.edu/shaun/SBlack/geneticd.html

Amino Acids

Amino acids are the building blocks from which proteins are made. There are 20
different amino acids that vary from each other by their side chain groups. Amino acids
can be classified into different groups based on their solubility in water. Hydrophilic
amino acids are water soluable, while hydrophobic are not. This property becomes
important when a protein sequence is made. Amino acids are linked to one another via a
single chemical bond, called a peptide bond. A linear chain of amino acids can be
referred to as a peptide (if it is short – less than 30 a.a. long) or polypeptide (which can be
upwards of 4000 residues long).
One-letter Three-letter Full name
G GLY Glycine
A ALA Alanine
V VAL Valine
L LEU Leucine
I ILE Isoleucine
F PHE Phenylalanine
P PRO Proline
S SER Serine
T THR Threonine
C CYS Cysteine
M MET Methionine
W TRP Tryptophan
Y TYR Tyrosine
N ASN Asparagine
Q GLN Glutamine
D ASP Aspartic acid
E GLU Glutamic acid
K LYS Lysine
R ARG Arginine
H HIS Histidine

Amino Acid Codes.

Proteins

Proteins are polypeptides that have a three dimensional structure. They can be described
through four different hierarchical levels:

• Primary structure – the sequence of amino acids constituting the polypeptide


chain.
• Secondary structure – the local organization of the parts of the polypeptide
chain into secondary structures such as α helices and β sheets.
• Tertiary structure – the three dimensional arrangements of the amino acids as
they react to one another due to the polarity and resulting interactions between
their side chains.
• Quaternary structure – if a protein consists of several protein subunits held
together, then the protein can be described as well by the number and relative
positions of the subunits.
Visualization of Protein Structures.

Magenta: alpha helix Blue: Monomer A


Gold: Beta Sheets Orange: Monomer B

Image source:
http://www.ebi.ac.uk/microarray/biology_intro.html

Calculating the secondary and tertiary structure of a protein given its primary structure is
not an easy task. Protein folding prediction will be covered at some point close to the end
of the semester.

Monomer – Any small molecule that can be linked with others of the same type to form
a polymer. For the purpose of this class, the molecules could be nucleic acids, amino
acids, or proteins.

Dimer - Two small molecules of the same type linked together.

Trimer – Three small molecules of the same type linked together.

Oligimer – General term for a short polymer most commonly consisting of nucleic acids
or amino acids.

Polymer – Any large molecule consisting of multiple identical or similar subunits linked
by covalent bonds.
Putting it all together, we get the flow of genetic information. That is, DNA directs the
synthesis of RNA, and RNA then in turn directs the synthesis of protein. This flow of
genetic information from nucleic acids to protein has been called the Central Dogma of
Molecular Biology.
Central Dogma of Molecular Biology

DNA

RNA

PROTEIN
Image Source:
http://www.people.virginia.edu/~rjh9u/dnaprot.html

What is a Gene?

Aaah, the million dollar question. In short, a gene can be described as the physical and
functional unit of heredity that carries information from one generation to the next. A
gene can be thought of as the DNA sequence necessary for the synthesis of a functional
protein or RNA molecule.

Genome, Transcriptome, Proteome

Whenever the term genome is used, it typically refers to the chromosomal DNA of an
organism, or as far as sequencing is concerned, the heterochromatic regions of the
chromosomal DNA. The number of chromosomes and genome size varies quite
significantly from one organism to another. An example list of genome sizes is given
below. Don’t be fooled by this table that the size of the genome and the number of genes
determines the complexity of an organism. In fact, many plant genomes are much greater
in size than the human genome!
ORGANISM CHROMOSOMES GENOME SIZE GENES
Homo sapiens 23 3,200,000,000 ~ 30,000
(Humans)
Mus musculus 20 2,600,000,000 ~30,000
(Mouse)
Drosophila 4 180,000,000 ~18,000
melanogaster
(Fruit Fly)
Saccharomyces 16 14,000,000 ~6,000
cerevisiae (Yeast)
Zea mays (Corn) 10 2,400,000,000 ???

The term transcriptome refers to the complete collection of all possible mRNAs
(including splice variants) of an organism. This can be thought of as the regions of an
organism’s genome that get transcribed into messenger RNA. In some cases, the
transcriptome can be extended to include all transcribed elements, including non-coding
RNAs used for structural and regulatory purposes.

The term proteome refers to the complete collection of proteins that can be produced by
an organism. The proteome can be studied either as a static (sum of all proteins possible)
or a dynamic (all proteins found at a specific time point) entity.
Molecular Biology Reference Books

Lewin, B (1999), Genes VII (published by Oxford University Press) ISBN: 019879276X

Lodish et al (1995), Molecular Cell Biology, 3rd edition (published by Scientific American Books,
Freeman and Cpy, New York) ISBN 0 7167 2380 8

Gonick, L & Wheelis, M (1991), The Cartoon Guide to Genetics (published by Harper Perrenial, New
York) ISBN 0 06 273099 1

Online tutorials

The Learning Center:


http://www.genecrc.org/site/lc/lc1a.htm

On-Line Biology Book:


http://www.emc.maricopa.edu/faculty/farabee/BIOBK/BioBookTOC.html

EMBL-EBI Introduction to Biology:


http://www.ebi.ac.uk/microarray/biology_intro.html

One site you will be intimately familiar with by the end of the semester:
http://www.ncbi.nlm.nih.gov

Reading assignment

http://www.ebi.ac.uk/microarray/biology_intro.html

Chapters 1 & 2 (Durbin, et al.)


Chapters 1 & 3 (Mount)

Introduction to Bioinformatics

Lecture 2: Pairwise Sequence Alignment

In molecular biology, a common question is to ask whether or not two sequences are
related. The most common way to tell whether or not they are related is to compare them
to one another to see if they are similar. If we look at two words in the English language,
we note that two words that are spelled similarly may mean two completely different
things, such as the words pear and tear.
Biological sequences that are similar (but not exact) provide useful information to help
discover functional, structural, and evolutionary information. One common mistake is to
describe two sequences as having some sort of homology or a percent homology based on
their sequence similarity. This is a misuse of the biological term. Two sequences in
different organisms are homologous if they have been derived from a common ancestor
sequence. Two sequences may or may not be homologous regardless of their sequence
similarity. However, the greater the sequence similarity, the greater chance there is that
they share similar function and/or structure.

SEQUENCE SIMILARITY ≠ HOMOLOGY!


Biological Definitions for Related Sequences

Homologs are similar sequences in two different organisms that have been derived from
a common ancestor sequence. Homologs can be described as either orthologous or
paralogous.

Orthologs are similar sequences in two different organisms that have arisen due to a
speciation event. Orthologs typically retain their functionality throughout evolution.

Paralogs are similar sequences within a single organism that have arisen due to a gene
duplication event.

Xenologs are similar sequences that do not share the same evolutionary origin, but rather
have arisen out of horizontal transfer events through symbiosis, viruses, etc.
Image Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html

Hamming or edit distance

One method in determining sequence similarity is to determine the edit distance between
two sequences. If we take the example of pear and tear, how similar are these two
words? We notice that if we change the p to a t, and keep the ear, then we can change
pear to tear. Thus, there is a mismatch in the first letter, and matches in the last three. An
alignment of these two is as follows:

P E A R
| | |
T E A R

One way to score this alignment is to calculate the Hamming distance, which is the
minimum number of letters by which the two words differ. In this example, the
Hamming distance would be 1. The Hamming distance is calculated by summing up the
number of mismatches when two words are aligned to one another.
With biological sequences, it is often necessary to align two sequences that are of
different lengths, or that have regions that have been inserted or deleted over time. Thus,
the notion of gaps needs to be introduced. Consider the words alignment and ligament.
One alignment of these two words is as follows:

A L I G N M E N T
| | | | | | |
- L I G A M E N T
In this case, a gap is denoted in the alignment by a ‘-‘ character. Now an alignment can
produce one of the following: a match between two characters, a mismatch between two
characters (also called a substitution or mutation), a gap in the first sequence (which can
be thought of as the deletion of a character in the first sequence), or a gap in the second
sequence (which can be thought of as the insertion of a character in the first sequence).

Consider the following two nucleic acid sequences: ACGGACT and ATCGGATCT.
The following are two valid alignments:

A – C – G G – A C T
| | | | |
A T C G G A T _ C T

A T C G G A T C T
| | | | | |
A – C G G – A C T

Alignment scoring schemes

Which alignment is the better alignment? One way to judge this is to assign a positive
score for each match, and a negative score for each mismatch, and a negative score for
each insertion/deletion (collectively referred to as indels).

One scoring scheme might assign the following values:


match: +2
mismatch: -1
indel –2

Using this scoring scheme, the first alignment has 5 matches, 1 mismatch, and 4 indels.
The score for this alignment is: 5 * 2 – 1(1) – 4(2) = 10 – 1 – 8 = 1.

The second alignment has 6 matches, 1 mismatch, and 2 indels. The score for the second
alignment is 6 * 2 – 1(1) – 2 (2) = 12 – 1 – 4 = 7.
Therefore, using the above scoring scheme, the second alignment is a better alignment,
since it produces a higher alignment score.

Visual Alignments -- Dot Plots

One of the more basic, yet important techniques for determining the alignment between
two sequences is by using a visual alignment known as dot plots. Dot plots of sequence
similarity are created using a matrix where the rows in the matrix correspond to the
characters in the first sequence and the columns in the matrix correspond to the characters
in the second sequence. The dot plot is created as follows: loop through each row. For
the current row, take the character in that row and compare it to the character in each
column. If they are equal, place a dot in the matrix. Continue until all nodes in the
matrix have been considered.

A C C T G A G C T C A C C T G A G T T A
A
C
C
T
G
A
G
C
T
C
A
C
C
T
G
A
G
T
T
A
Results for aligning ACCTGAGCTCACCTGAGTTA to itself using the Dot Matrix option of the AlignX
feature of Informax’s Vector NTI program.

When a dotplot is created to compare nucleic acids, there will be a lot of noise, since one
out of every four positions will match at random, if there are an equal number of A, C, G,
and T in the sequence. Therefore, dot plots can be filtered for stringency requiring that a
certain percentage of nucleotides match in a given window size. With the example
above, if we filter the sequences to only show matches of two or more consecutive
nucleotides, the dot plot now looks as the following:
Information within Dot Plots

Dot plots are useful as a first-level filter for determining an alignment between two
sequences. Regions of similarity will show up as diagonals within the dot plot matrix.

Regions containing insertions/deletions can be readily determined. One potential


application is to determine the number of coding regions (exons) contained within a
processed mRNA.

Example of a Dot Plot showing insertion/deletion events.

Regions of genomic DNA can contain repetitive regions. For instance, approximately 50
percent of the human genome is composed of repetitive elements, which can be on the
order of a few hundred bases (SINEs – Alu elements) or a few thousand (LINES). In
addition, regions of low complexity are present as well. Repetitive elements and methods
to filter them out will be discussed during a later class period. In addition to repetitive
elements, regions of a genome can be duplicated. The duplicated region can be found
either as a direct repeat, meaning that it occurs in the same direction, or as an inverted
repeat, meaning that the sequence of the duplicated region is found in the reverse
complement direction. Dot plots can readily show regions of direct and inverted repeats.

Dot plots show all possible matches of residues between two sequences given a certain
threshold level. Thus, the researcher can decide which alignments are the most
significant.
Example dot plots showing the presence of direct and inverted repeats.

Dot plots can also be used in order to compare two different assemblies of the same
sequence. Below are three dotplots of various chromosomes. The first shows two
separate assemblies of human chromosome 5 compared against each other. The second
shows one assembly of chromosome 5 compared against itself, indicating the presence of
repetitive regions. The final dotplot shows chromosome Y compared against itself,
indicating the presence of inverted repeats.

Comparison of two assemblies of chromosome 5. The figure to the left indicates the alignment of two
separate assemblies, while the figure to the right indicates the alignment of a single assembly against itself.
Self plot of chromosome Y. Indicated are several regions of both direct and inverted repeats.

Available Dot Plot Packages

Vector NTI software package (under AlignX)


Dotlet Java applet: http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
Dotter (http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html )

GCG software package:


Compare http://www.hku.hk/bruhk/gcgdoc/compare.html
DotPlot+ http://www.hku.hk/bruhk/gcgdoc/dotplot.html

Emboss software package:


Dotmatcher
Dotpath
Dotup

DNA strider

Pipmaker: http://bio.cse.psu.edu/pipmaker/ -- Returns back a pdf of the alignment


dotmatcher: http://www.hku.hk/bruhk/emboss/dotmatcher.html

Overview of Dotplot techniques:


http://imagebeat.com/dotplot/overview.html
Dot Plot Articles

Gibbs & McIntyre, 1970


Gibbs, A. J. & McIntyre, G. A. (1970).
The diagram method for comparing sequences. its use with amino acid and
nucleotide sequences.
Eur. J. Biochem. 16, 1-11.

Staden, 1982
Staden, R. (1982).
An interactive graphics program for comparing and aligning nucleic-acid and amino-acid
sequences.
Nucl. Acid. Res. 10 (9), 2951-2961.

The shortcoming of visual methods is that they do not yield a direct measure into the
similarity between two sequences.

In order to get a measure into sequence similarity, dynamic programming can be


employed.

Finding an optimal alignment of two sequences

Suppose there are two sequences X and Z to be aligned, where |X| = m and |Z| = n. If
gaps are allowed in the sequences, then the potential length of both the first and second
sequences is m+n. Several methods will be discussed to align these sequences.

Brute Force Method

If we are interested in determining the optimal alignment (either global or local), then we
note that there are 2m+n subsequences with spaces for the sequence X, and 2m+n
subsequences with spaces for the sequence Z using the power set rules. Thus, a brute
force method of comparing these two sequences for the optimal alignment would require
2m+n * 2m+n = 2(2(m+n)) = 4m+n comparisons. It doesn’t take long for this to be an impossible
search!

Dynamic Programming

Luckily, sequence alignment has an optimal-substructure property, and therefore there is


a much easier way to consider all of the possible alignments using what is called dynamic
programming (DP). Dynamic programming techniques are used in many different
aspects of computer science. DP algorithms solve optimization problems by dividing the
problem into independent subproblems. Each subproblem is then only stored once, and
the answer is stored in a table, thus avoiding the work of recomputing the solution.
With sequence alignment, the subproblems can be thought of as the alignment of the
“prefixes” of the two sequences to a certain point. Therefore, a dynamic programming
matrix is computed. The optimal alignment score for any particular point in the matrix is
built upon the optimal alignment that has been computed to that point.

Dynamic programming techniques align two sequences by beginning at the ends of the
two sequences and attempting to align all possible pairs of characters (one from each
sequence) using a scoring scheme for matches, mismatches, and gaps. The highest set of
scores defines the optimal alignment between the two sequences.

We will first consider dynamic programming in terms of DNA, where only exact matches
are considered for a match score. Later we will discuss how substitution matrices can be
used to score amino acid matches and mismatches.

Dynamic programming approaches are guaranteed to provide the optimal alignment


given a particular scoring scheme. For large sequences, dynamic programming can be
slow and memory intensive. Discuss the time and space necessary for microarray
analysis.

Setting up the Dynamic Programming Matrix

Now we are ready to go ahead and start creating the dynamic programming matrix. The
first step is to align one of the sequences across the columns of the matrix, and the other
sequence across the rows. Note that an alignment can also begin with a gap in one of the
sequences, so that has to be taken care of as well. Let’s assume that we want to align the
sequence GAATTCAGTTA to GGATCGA. The length of the first sequence is 11
residues, and the length of the second is 7. Since it is possible to begin an alignment
with a gap, the size of the matrix should be 8 x 12. Row 0 and column 0 will represent
gaps. Rows 1-7 will be labeled with the corresponding residue of the sequence
GGATCGA, while columns 1-11 will be labeled with the corresponding residue of the
sequence GAATTCAGTTA. The initial matrix, S, is as follows:

- G A A T T C A G T T A
-
G
G
A
T
C
G
A
Now we need to decide upon the scoring scheme to be used. This requires parameters for
a match score, a mismatch score, and a gap score. The match and mismatch scores will
be combined into a single match/mismatch score, s(aibj). We’ll see how this can later be
used with a substitution matrix. There will also be a single linear gap penalty score, w.
For our first example, we have the following parameters:

Sequence #1: GAATTCAGTTA; M = 11


Sequence #2: GGATCGA; N = 7

• s(aibj) = +5 if ai = bj (match score)


• s(aibj) = -3 if ai≠bj (mismatch score)
• w = -4 (gap penalty)

Three steps in dynamic programming

Once you have the scoring functions set and the sequences to align, there are three steps
involved in calculating the optimal scoring alignment. The methods used to finish these
three steps are dependent upon whether global or local sequence alignment is desired.
The three steps are as follows:

• Initialization
• Matrix Fill (scoring)
• Traceback (alignment)

Global Alignment: Needleman-Wunsch Algorithm

In global sequence alignment, an attempt to align the entirety of two different sequences
is made, up to and including the ends of the sequence. Needleman and Wunsch (1970)
were among the first to describe a dynamic programming algorithm for global sequence
alignment.
Initialization Step. In the initialization step of global alignment, each row Si,0 is set to w
* i. In addition, each column S0,j is set to w * j. Remember, that w is the gap penalty.
Using the scoring scheme described above, the initialization step results in the following:

- G A A T T C A G T T A
0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
-
-4
G
-8
G
-12
A
-16
T
-20
C
-24
G
-28
A
Matrix Fill Step. One possible solution of the matrix fill step finds the maximum global
alignment score by starting in the upper left hand corner in the matrix and finding the
maximal score Si,j for each position in the matrix. In order to find Si,j for any i,j it is
minimal to know the score for the matrix positions to the left, above and diagonal to i, j.
In terms of matrix positions, it is necessary to know Si-1,j, Si,j-1 and Si-1, j-1.

For each position, Si,j is defined to be the maximum score at position i,j; i.e.

Si,j = MAXIMUM[
Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal),
Si,j-1 + w (gap in sequence #1),
Si-1,j + w (gap in sequence #2)]

Note that in the example, Si-1,j-1 will be red, Si,j-1 will be green and Si-1,j will be blue.

Using this information, the score at position 1,1 in the matrix can be calculated. Since the
first residue in both sequences is a G, s(a1b1) = 5, and by the assumptions stated earlier, w
= -4. Thus, S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 - 4] = MAX[5, -8, -8].
A value of 5 is then placed in position 1,1 of the scoring matrix. Note that there is also an
arrow placed back into the cell that resulted in the maximum score, S0,0.

- G A A T T C A G T T A
0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
-
-4
G 5
-8
G
-12
A
-16
T
-20
C
-24
G
-28
A
Now we proceed to S1,2. Since a1 = G and b2 = A, there is a mismatch. Therefore, sa1b2
= -3 and by the assumptions stated earlier, w = -4. Thus, S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2 -
4] = MAX[-4 - 3, 5 – 4, -8 – 4] = MAX[-7, 1, -12] = 1. An arrow is placed back into the
cell that resulted in the maximum score, which is the cell S1,1.

- G A A T T C A G T T A
0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
-
-4
G 5 1
-8
G
-12
A
-16
T
-20
C
-24
G
-28
A
We can proceed to fill in the rest of the first row in a similar fashion, resulting in the
following matrix:

- G A A T T C A G T T A
0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
-
-4
G 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35
-8
G
-12
A
-16
T
-20
C
-24
G
-28
A
Now we can start to fill in the second row, beginning with S2,1. Note that a2 = G and b1
= G, so sa2b1 = 5 and by the assumptions stated earlier, w = -4. Thus, S2,1= MAX[S1,0 +5,
S0,2 - 4, S1,1 - 4] = MAX-4 + 5, -8 – 4, 5 - 4] = MAX[1, -12, 1] = 1. Note that in this case,
there are two possible paths to the maximum value. Therefore, an arrow is placed back
into each cell resulting in the maximum score, which are sells S1,0 and S1,1.

- G A A T T C A G T T A
0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
-
-4
G 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35
-8
G 1
-12
A
-16
T
-20
C
-24
G
-28
A
We can then proceed to fill in the rest of the matrix in a similar fashion. The resulting
matrix is as follows:

- G A A T T C A G T T A
0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
-
-4
G 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35
-8
G 1 2 -2 -6 -10 -14 -18 -14 -18 -22 -26
-12
A -3 6 7 3 -1 -5 -9 -13 -17 -21 -17
-16
T -7 2 3 12 8 4 0 -4 -8 -12 -16
-20
C -11 -2 -1 8 9 13 9 5 1 -3 -7
-24
G -15 -6 -5 4 5 9 10 14 10 6 2
-28
A -19 -10 -1 0 1 5 14 10 11 7 11
Each cell has one to three arrows indicating from which cell the maximum score was
obtained. The matrix fill step is now complete.

Traceback Step. After the matrix fill step, the maximum global alignment score for the
two sequences is 11 (the value in the lower right hand cell). The traceback step will
obtain the actual alignment(s) that result in the maximum score.

The traceback begins in position SM,N; i.e. the position where both sequences are globally
aligned.

Since pointers have been kept back to all possible predacessors, the traceback is simple.
At each cell, we look to see where we move next according to the pointers. To begin, the
only possible predacessor is the diagonal match.

- G A A T T C A G T T A
0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
-
-4
G 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35
-8
G 1 2 -2 -6 -10 -14 -18 -14 -18 -22 -26
-12
A -3 6 7 3 -1 -5 -9 -13 -17 -21 -17
-16
T -7 2 3 12 8 4 0 -4 -8 -12 -16
-20
C -11 -2 -1 8 9 13 9 5 1 -3 -7
-24
G -15 -6 5 4 5 9 10 14 10 6 2
-28
A -19 -10 -1 0 1 5 14 10 11 7 11
This gives us an alignment of
A
|
A
Note that the blue letters and gold arrows indicate the path leading to the maximum score.

We can continue to follow the path until we get to the following situation:

- G A A T T C A G T T A
-4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
- 0
G -4 5 1 -3 -7 -11 -15 -19 -23 -27 -31 -35
G 2 -2 -6 -10 -14 -18 -14 -18 -22 -26
A -12 -3 6 7 3 -1 -5 -9 -13 -17 -21 -17
T -16 -7 2 3 12 8 4 0 -4 -8 -12 -16
C -20 -11 -2 -1 8 9 13 9 5 1 -3 -7
G -24 -15 -6 5 4 5 9 10 14 10 6 2
A -28 -19 -10 -1 0 1 5 14 10 11 7 11
The resulting global alignment is as follows:

G A A T T C A G T T A
| | | | | |
G G A – T C – G - — A

Remembering that the scoring scheme used was +5 for a match, -3 for a mismatch, and –
4 for a gap, we can double check the score of the alignment:

G A A T T C A G T T A
| | | | | |
G G A – T C – G - — A

+ - + - + + - + - - +
5 3 5 4 5 5 4 5 4 4 5

5 – 3 + 5 – 4 + 5 + 5 – 4 + 5 – 4 – 4 + 5 = 11
so this alignment results in a global alignment score of 11.
Note that in the case of the sequence and scoring schemes we chose, there was only one
maximal alignment. It is possible that there could be multiple alignments yielding the
same score, as evidenced by having multiple ways to obtain the maximal score in a given
cell in the scoring matrix. In such a case, the traceback can be accomplished in any
manner desired, as long as the same set of rules is consistently used in order for
reproducibility.

Local Alignment: Smith-Waterman Algorithm

In 1981, Temple Smith and Mike Waterman proposed a modification to the Needleman-
Wunsch algorithm in order to obtain a local sequence alignment resulting in the highest-
scoring local match between two sequences.

Why choose a local alignment algorithm?


• More meaningful – point out conserved regions between two sequences
• Aligns two sequences of different lengths to be matched
• Aligns two partially overlapping sequences
• Aligns two sequences where one is a subsequence of another

There are only two slight modifications that need to be made to the Needleman-Wunsch
Algorithm in order to make it a local alignment algorithm. The first modification
requires negative scores for mismatches. The second modification requires that when the
dynamic programming scoring matrix value becomes negative, the value is set to zero,
which has the effect of terminating any alignment up to that point. This has the effect of
changing the matrix score to:

Si,j = MAXIMUM[
Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal),
Si,j-1 + w (gap in sequence #1),
Si-1,j + w (gap in sequence #2),
0]

The local alignments are then produced by starting at the highest-scoring positions in the
scoring matrix and following a trace path from those positions up to a box that scores
zero.
Initialization Step. In the initialization step of local alignment, each row Si,0 is set to 0.
In addition, each column S0,j is set to 0. Using the scoring scheme described above, the
initialization step results in the following:

- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0
G 0
A 0
T 0
C 0
G 0
A 0

Matrix Fill Step. One possible solution of the matrix fill step finds the maximum local
alignment score by starting in the upper left hand corner in the matrix and finding the
maximal score Si,j for each position in the matrix. In order to find Si,j for any i,j it is
minimal to know the score for the matrix positions to the left, above and diagonal to i, j.
In terms of matrix positions, it is necessary to know Si-1,j, Si,j-1 and Si-1, j-1.

For each position, Si,j is defined to be the maximum score at position i,j; i.e.

Si,j = MAXIMUM[
Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal),
Si,j-1 + w (gap in sequence #1),
Si-1,j + w (gap in sequence #2),
0]

- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5
G 0
A 0
T 0
C 0
G 0
A 0
Note that in the example, Si-1,j-1 will be red, Si,j-1 will be green and Si-1,j will be blue.
Using this information, the score at position 1,1 in the matrix can be calculated. Since the
first residue in both sequences is a G, s(a1b1) = 5, and by the assumptions stated earlier, w
= -4. Thus, S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 – 4,0] = MAX[5, -4, -4, 0].
Now we proceed to S1,2. Since a1 = G and b2 = A, there is a mismatch. Therefore, sa1b2
= -3 and by the assumptions stated earlier, w = -4. Thus, S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2
– 4, 0] = MAX[0 - 3, 5 – 4, 0 – 4, 0] = MAX[-3, 1, -4, 0] = 1. An arrow is placed back
into the cell that resulted in the maximum score, which is the cell S1,1.

- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 1
G 0
A 0
T 0
C 0
G 0
A 0

Now we proceed to S1,3. Since a1 = G and b3 = A, there is a mismatch. Therefore, sa1b2


= -3 and by the assumptions stated earlier, w = -4. Thus, S1,3 = MAX[S0,2 -3, S1,2 - 4, S0,3
– 4, 0] = MAX[0 - 3, 1 – 4, 0 – 4, 0] = MAX[-3, -3, -4, 0] = 0. Since the maximum score
is 0 (all other possible scores are negative), no arrow is drawn back from this location.

- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 1 0
G 0
A 0
T 0
C 0
G 0
A 0
We can then proceed to fill in the rest of the matrix in a similar fashion. The resulting
matrix is as follows:

- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 1 0 0 0 0 0 5 1 0 0
G 0 5 2 0 0 0 0 0 5 2 0 0
A 0 1 10 7 3 0 0 5 1 2 0 5
T 0 0 6 7 12 8 4 1 2 6 7 3
C 0 0 2 3 8 9 13 9 5 2 3 4
G 0 5 1 0 4 5 9 10 14 10 6 2
A 0 1 10 6 2 1 4 14 10 11 7 11

Each cell has one to three arrows indicating from which cell the maximum score was
obtained. The matrix fill step is now complete.

Traceback Step. After the matrix fill step, the maximum local alignment score for the
two sequences is 14, which can be found by locating the highest values in the score
matrix. Note that 14 is found in two separate cells, indicating there are multiple
alignments producing the maximal alignment score. The traceback step will find the
actual local alignments resulting in the maximum score.

The traceback begins in the position with the highest value. Since pointers have been
kept back to all possible predacessors, the traceback is simple. At each cell, we look to
see where we move next according to the pointers. When we reach a cell where there is
not a pointer to a previous cell, then we have reached the beginning of the local
alignment.

First, consider the case where the 14 is in the last row.

- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 1 0 0 0 0 0 5 1 0 0
G 0 5 2 0 0 0 0 0 5 2 0 0
A 0 1 10 7 3 0 0 5 1 2 0 5
T 0 0 6 7 12 8 4 1 2 6 7 3
C 0 0 2 3 8 9 13 9 5 2 3 4
G 0 5 1 0 4 5 9 10 14 10 6 2
A 0 1 10 6 2 1 4 14 10 11 7 11

Note that the blue letters and gold arrows indicate the path leading to the maximum score.

We can continue to follow the path until we get to the following situation:

- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 1 0 0 0 0 0 5 1 0 0
G 0 5 2 0 0 0 0 0 5 2 0 0
A 0 1 10 7 3 0 0 5 1 2 0 5
T 0 0 6 7 12 8 4 1 2 6 7 3
C 0 0 2 3 8 9 13 9 5 2 3 4
G 0 5 1 0 4 5 9 10 14 10 6 2
A 0 1 10 6 2 1 4 14 10 11 7 11

At this point, or alignment (which is built starting at the end of the alignment) is as
follows:

C - A
| |
C G A
Now the current cell gets its score either from a match of the T’s or a gap in the second
sequence. We’ll consider both as possibilities: Match of the T’s (1) and gap in second
(2).

- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 1 0 0 0 0 0 5 1 0 0
G 0 5 2 0 0 0 0 0 5 2 0 0
A 0 1 10 7 3 0 0 5 1 2 0 5
T 0 0 6 7 12 8 4 1 2 6 7 3
C 0 0 2 3 8 9 13 9 5 2 3 4
G 0 5 1 0 4 5 9 10 14 10 6 2
A 0 1 10 6 2 1 4 14 10 11 7 11
Once we reach the node with 0 and there are no pointers from this node, we are finished.
The two local alignments resulting in a score of 14 in the final row are:

G A A T T C - A
| | | | |
G G A T – C G A

+ - + + - + - +
5 3 5 5 4 5 4 5

G A A T T C - A
| | | | |
G G A – T C G A

+ - + - + + - +
5 3 5 4 5 5 4 5

As you can see, each of these has 5 matches, 1 mismatch, and 2 gaps, so the score is 5(5)
– 1(3) – 2(4) = 25 – 3 – 8 = 14. This coincides with the maximum local alignment score
calculated in the matrix.

Incorporation of Scoring Matrices

Amino Acids

Certain amino acid substitutions commonly occur in related proteins from different
species. Since the proteins in all of the species are functional, the substations maintain
protein structure and function. Often the substitutions result in a chemically similar
amino acid. Other substitutions are relatively rare. Thus, rather than create a dynamic
programming matrix with a match/mismatch score, it would be better to weight a
matching score for two residues dependent upon the likelihood that such a substitution
would be observed in nature.

In a substitution matrix (whether it is an amino acid or nucleic acid), the residues are
listed both as column and row headings. Each position is in the matrix is filled with a
score reflecting how often one residue would be paired with another in an alignment of
related sequences.
Percent Accepted Mutation (PAM) Matrices

Margaret Dayhoff pioneered the research in amino acid substitutions for found through
the alignment of common protein sequences. The resulting Percent Accepted Mutation
(PAM) Matrices give the changes expected for a given period of evolutionary time. The
assumption with this evolutionary model is that amino acid substitutions over short
periods of evolutionary history can be extrapolated to longer distances.

Assumptions in Creating PAM matrices

Each change in the current amino acid at a particular site is assumed to be independent of
previous mutational events at that site.

Calculation of PAM matrices

• amino acid substitutions of evolving proteins were estimated using 1572 changes
in 71 groups of protein sequences at least 85% similar. Since the proteins have
similar functions, the mutations are called “accepted” mutations – meaning they
are accepted by natural selection without negatively affecting a protein’s fitness.
• Similar sequences were organized into phylogenetic trees
• The number of changes of each amino acid into every other amino acid was
counted.
• Relative mutabilities were evaluated by counting the number of changes of each
amino acid divided by a normalization factor. This normalized the data for
variations in amino acid composition, mutation rate, and sequence length.
• The amino acid exchange counts and mutability values were used to generate a 20
x 20 mutation probability matrix representing all possible amino acid changes.

A detailed example of calculating the PAM matrix is located in Mount, p50.

Since the changes are independent of previous mutational events, the PAM1 matrix can
be multiplied by itself N times to give the transition matrices for sequences that have
undergone N mutations. Thus, the PAM250 matrix can be used for sequences that are
20% similar, while the PAM 120, PAM80, and PAM60 matrices represent 40%, 50%,
and 60% similarity. Note that PAM1 is 1 accepted mutation per 100 amino acids;
PAM10 is 10 accepted mutations per 100 amino acids; PAM250 is 250 accepted
mutations per 100 amino acids and so on.

Thus, the substitution matrix chosen when aligning two sequences should take into
account the divergence between the two sequences.
Example PAM1 matrix (normalized probabilities multiplied by 10000)
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
A R N D C Q E G H I L K M F P S T W Y V

Ala A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18

Arg R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1

Asn N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1

Asp D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1

Cys C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2

Gln Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1

Glu E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2

Gly G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5

His H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1

Ile I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33

Leu L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15

Lys K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1

Met M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4

Phe F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0

Pro P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2

Ser S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2

Thr T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9

Trp W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0

Tyr Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1

Val V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901

Taken from:
http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/nodeE.html#page7

Log Odds matrices

PAM matrices are usually converted into another form, called log odds matrices. Odds
ratios are converted into logarithms in order that the scores may be added, rather than
multiplied. Each cell of the log-odds matrix is calculated by first finding the odds ratio
for each substitution. The odds ratio is calculated by taking the scores in the above
matrix, which is the probability of one amino acid mutating to another given amino acid,
and dividing it by the frequency of the first amino acid. Such a ratio gives the relative
frequency of change. The ratio is then converted to a log10, so that the scores are
additive, and it is multiplied by 10. The log odds for converting from the first amino acid
to the second is added to the log odds for converting from the second amino acid to the
first, and the average is taken to produce a symmetric matrix, since the direction of
mutation cannot necessarily be inferred. An example of how the log odds score for
changes between Phe and Tyr is given in Mount, pp 80 – 81. Make sure to look at this to
see if you have any questions.
log-odds form of PAM250 Scoring matrix.
Image Source: http://www.blc.arizona.edu/courses/bioinformatics/dayhoff.html
(Image in Mount, p82)

Blocks Amino Acid Substitution Matrices (BLOSUM)

One of the arguments against the Dayhoff PAM matrices is that they represent only a
small number of families, and therefore may not truly reflect amino acid distributions that
one is likely to encounter. Therefore, another set of substitution matrices, called
BLOSUM matrices were developed using a much larger number of protein families.

The BLOSUM matrices were developed by Stephen and Georgia Henikoff by looking at
a large set of approximately 2000 amino acid patterns organized into blocks, which are
conserved regions within protein families as identified by the protein database, Prosite.
The blocks that were studied were also signatures of a protein family, indicating that
members of the family could be found by searching for these blocks.

In order to deal with overrepresentation of amino acid substitutions occurring in the most
closely related members of the family, a consensus sequence of the block is formed.
Sequences that were 60% identical to the consensus were grouped together to form the
BLOSUM60 matrix; sequences 80% identical were grouped together to form the
BLOSUM80 matrix, etc.

Nucleic Acid Scoring Matrices

In addition to using a match/mismatch scoring scheme for DNA sequences, nucleotide


mutation matrices can be constructed as well. These matrices are based upon two
different models of nucleotide evolution: the first, the Jukes-Cantor model, assumes
there are uniform mutation rates among nucleotides, while the second, the Kimura model,
assumes that there are two separate mutation rates: one for transitions (where the
structure of purine/pyrimidine stays the same), and one for transversions. Generally, the
rate of transitions is thought to be higher than the rate of transversions.

A G PURINES: A, G
PYRIMIDINES C, T

Transitions: A↔G; C↔T


Transversions: A↔C, A↔T,
C↔G, G↔T

C T

Jukes-Cantor Model of evolution: α = common rate of base substitution

Kimura Model of Evolution: α = rate of transitions; β = rate of transversions


http://www.cs.man.ac.uk/~jowh6/phase/node26.html
Tables 3.4 and 3.5 indicate nucleotide substitution matrices with the equivalent distance
of 1 PAM.

Table 3.4 PAM1 Odds Matrices

A. Model of uniform mutation rates among nucleotides.

A G T C
A 0.99
G 0.00333 0.99
T 0.00333 0.00333 0.99
C 0.00333 0.00333 0.00333 0.99

B. Model of 3-fold higher transitions than transversions.

A G T C
A 0.99
G 0.006 0.99
T 0.002 0.002 0.99
C 0.002 0.002 0.006 0.99

Table 3.5 PAM1 Log-Odds Matrices

A. Model of uniform mutation rates among nucleotides.

A G T C
A 2
G -6 2
T -6 -6 2
C -6 -6 -6 2

B. Model of 3-fold higher transitions than transversions.

A G T C
A 2
G -5 2
T -7 -7 2
C -7 -7 -5 2

Gap Penalties

The scoring matrices used to this point assume a linear gap penalty where each gap is
given the same penalty score. However, over evolutionary time, it is more likely that a
contiguous block of residues has become inserted/deleted in a certain region (for
example, it is more likely to have 1 gap of length k than k gaps of length 1). Therefore, a
better scoring scheme to use is an initial higher penalty for opening a gap, and a smaller
penalty for extending the gap. The affine gap penalty can then be formulated as follows:

wx = g + r(x-1)
where wx is the total gap penalty, g is the gap open penalty, r is the gap extend penalty,
and x is the length of the gap.

The gap penalty needs to be chosen relative to the score matrix, so that gaps will not be
excluded from the alignment, or propagate throughout the alignment. Typical values are
–12 for gap opening, and –4 for gap extension.

Affine gap penalties increase the number of matrices (or at least storage space) to be
filled out. The information to be processed is now:

Di - 1, j - 1 + subst(Ai, Bj)
Mi, j = max { Mi - 1, j - 1 + subst(Ai, Bj)
Ii - 1, j - 1 + subst(Ai, Bj)
Di , j - 1 - extend
Di, j = max {
Mi , j - 1 - open
Mi-1 , j - open
Ii, j = max {
Ii-1 , j - extend

Where M is the match matrix, D is the delete matrix, and I is the insert matrix.

Assessing the significance of sequence alignments

When two sequences of length m and n are not obviously similar but show an alignment,
it becomes necessary to assess the significance of the alignment. The alignment of scores
of random sequences has been shown to follow a Gumbel extreme value distribution.
Image source: •http://roso.epfl.ch/mbi/papers/discretechoice/node11.html

Using a Gumbel extreme value distribution, the expected number of alignments with a
score at least S (E-value) is:

E = Kmn e-λS
Where:
m,n: Lengths of sequences
K ,λ: statistical parameters dependent upon scoring system and background residue
frequencies

Recall that the log-odds scoring schemes examined to this point normally use a S =
10*log10x scoring system. We can normalize the raw scores obtained using these non-
gapped scoring systems to obtain the amount of bits of information contained in a score,
or the amout of nats of information contained within a score.
Converting to bit scores

A raw score can be normalized to a bit score using the formula:

The E-value corresponding to a given bit score can then be calculated as:

Converting to nats is similar. However, we just substitute e for 2 in the above equations.
Converting scores to either bits or nats gives a standardized unit by which the scores can
be compared.

P-values

P values can be calculated as the probability of obtaining a given score at random. P-


values can be estimated as:

P = 1 – e-E

Which is approximately e-E

A quick determination of significance

If a scoring matrix has been scaled to bit scores, then it can quickly be determined
whether or not an alignment is significant. For a typical amino acid scoring matrix, K =
0.1 and lambda depends on the values of the scoring matrix. If a PAM or BLOSUM
matrix is used, then lambda is precomputed. For instance, if the log odds matrix is in
units of bits, then lambda = loge2, and the significance cutoff can be calculates as
log2(mn).

Example (p110 Mount)


Suppose we have two sequences, each approximately 250 amino acids long that are
aligned using a Smith-Waterman approach and the PAM250 matrix. The following local
alignment occurs:

F W L E V E G N S M T A P T G
F W L D V Q G D S M T A P A G

Using the PAM250 matrix (p82), the score for this local alignment can be calculated as:

S = 9 + 17 + 6 + 3 + 4 + 2 + 5 + 2 + 2 + 6 + 3 + 2 + 6 + 1 + 5 = 73

S is in 10 * log10x, so this should be converted to a bit score.

S = 10 log10x
S/10 = log10x
S/10 = log10x * (log210/log210)
S/10 * log210 = log10x / log210
S/10 * log210 = log2x
1/3 S ~ log2x

so S’ ~ 1/3S

In this case, S’ = 1/3 * 73 = 24.3

The significance cutoff is: log2(mn) = log2(250 * 250) = 16 bits

Since the alignment score is above the significance cutoff, this is a significant local
alignment.

Estimation of P and E

When a PAM250 scoring matrix is being used, K is estimated to be 0.09, while lambda is
estimated to be 0.229. Using equations 30 and 31 (Mount), we can convert the score to a
bit score:

S’ = 0.229 * 73 – ln 0.09 * 250 * 250


S’ = 16.72 – 8.63 = 8.09 bits
P(S’ >= 8.09) = 1 – e(-e-8.09) = 3.1* 10-4
Therefore, we see that the probability of observing an alignment with a bitscore greater
than 8.09 is about 3 in 1000.
Significance of Gapped Alignments

Gapped alignments make use of the same statistics as ungapped alignments in


determining the statistical significance. However, in gapped alignments, the values for
lambda and K cannot be easily estimated. Emperical estimations and gap scores have
been determined by looking at the alignments of randomized sequences.

Bayesian Statistics

Bayesian statistics are built upon conditional probabilities, which are used to derive the
joint probability of two events or conditions.

P(B|A) is the probability of B given condition A is true.


P(B) is the probability of condition B occurring, regardless of conditions A.

Suppose that A can have two states, A1 and A2, and B can have two states, B1 and B2.
Suppose that P(B1) = 0.3 is known. Therefore, P(B2) = 1 – 0.3 = 0.7. These
probabilities are known as marginal probabilities. Now we would like to determine the
probability of A1 and B1 occurring together, which is denoted as: P(A1, B1) and is called
the joint probability. Note that in this case the marginal probabilities A1 and A2 are
missing. Thus, there is not enough information at this point to calculate the marginal
probability. However, if more information about the joint occurrence of A1 and B1 are
given, then the joint probabilities may be derived using Bayes Rule:

P(A1, B1) = P(B1)P(A1|B1)


P(A1, B1) = P(A1)P(B1|A1)

Suppose that we are given P(A1|B1) = 0.8. Then, since there are only two different
possible states for A, P(A2|B1) = 1 – 0.8 = 0.2. If we are also given P(A2|B2) = 0.7, then
P(A1|B2) = 0.3. Using Bayes Rule, the joint probability of having states A1 and B1
occurring at the same time is P(B1)P(A1|B1) = 0.3 * 0.8 = 0.24 and P(A2,B2) =
P(B2)P(A2|B2) = 0.7 * 0.7 = 0.49. The other joint probabilities can be calculated from
these as well.

The calculation of the joint probabilities results in posterior probabilities, since they are
not known initially, but are calculated using prior probabilities and initial information.

Applications of Bayesian Statistics

Bayesian statistics have many applications in bioinformatics. One application is in


determining the evolutionary distance between two sequences (Agarwal and States, 1996
– covered in Mount, pp 122-124). Another is in sequence alignment algorithms (Zhu et
al, 1998; Mount pp 124-134). The significance of an alignment can also be computed
using a Bayesian framwork (Durbin, et al, pp 36-38). More applications using Bayesian
statistics will be examined when the Gibbs Sampling algorithm is discussed during a later
class period.

Drawbacks to Dynamic Programming Approach

Dynamic programming approaches are guaranteed to give the optimal alignment between
two sequences given a scoring scheme. However, the two main drawbacks to DP
approaches is that they are compute and memory intensive, in the cases discussed to this
point taking at least O(n2) time and space.

Linear space algorithms have been used in order to deal with one drawback to dynamic
programming. The basic idea is to concentrate only on those areas of the matrix more
likely to contain the maximum alignment. The most well-known of these linear space
algorithms is the Myers-Miller algorithm.

Available pairwise sequence alignment programs

FASTA suite of programs


LALIGN
BESTFIT
SIM
GAP
NAP
LAP2
GAP2

http://genome.cs.mtu.edu/align/align.html

EMBOSS APPLICATIONS
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/index.html

WEB FORMS FOR EMBOSS APPLICATIONS


http://bioweb.pasteur.fr/seqanal/alignment/intro-uk.html#EMBOSS
http://bioinfo.pbi.nrc.ca:8090/EMBOSS/index.html

BAYESIAN TUTORIAL
http://www.wadsworth.org/resnres/bioinfo/tut1/index.htm

Expressed Sequences to Genomes

Sim4
est2genome
spidey
########################################
# Program: needle
# Rundate: Wed Jan 22 20:09:50 2003
# Report_file: outfile.align
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: gi
# 2: gi
# Matrix: EDNAFULL
# Gap_penalty: 12.0
# Extend_penalty: 4.0
#
# Length: 1030
# Identity: 537/1030 (52.1%)
# Similarity: 537/1030 (52.1%)
# Gaps: 493/1030 (47.9%)
# Score: 1649.0
#
#
#=======================================

gi 1 0

gi 1 ATACAAAATTTACGTGACTGGAGGGTGAAAGGGAATGTGGGAGGTCAGTG 50

gi 1 GGCAATAATGATACAATGTATCATGCCTCT 30
||||||||||||||||||||||||||||||
gi 51 CATTTAAAACATAAAGAAATGGCAATAATGATACAATGTATCATGCCTCT 100

gi 31 TTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGCAATAG 80
||||||||||||||||||||||||||||||||||||||||||||||||||
gi 101 TTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGCAATAG 150

gi 81 CAA---------------------------ATAAATTGTAACTGATGTAA 103
||| ||||||||||||||||||||
gi 151 CAATATCTCTGCATATAAATATTTCTGCATATAAATTGTAACTGATGTAA 200

gi 104 GAGGTTTCATATTGCTAATAGCAGCTACAATCCAGCTACCATTCTGCTTT 153


||||||||||||||||||||||||||||||||||||||||||||||||||
gi 201 GAGGTTTCATATTGCTAATAGCAGCTACAATCCAGCTACCATTCTGCTTT 250

gi 154 TATTTTA---------------------------------------TGGT 164


||||||| ||||
gi 251 TATTTTAAATTTATATGCAGAAATATTTATATGCAGAGATATTGCTTGGT 300

gi 165 TGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATC 214


||||||||||||||||||||||||||||||||||||||||||||||||||
gi 301 TGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATC 350

gi 215 ATGTTCATACCTCTTATCTTCCTCCCACGGCTCCTGGGCAACGTGCTGGT 264


||||||||||||||||||||||||||||||||||||||||||||||||||
gi 351 ATGTTCATACCTCTTATCTTCCTCCCACGGCTCCTGGGCAACGTGCTGGT 400

gi 265 CTGTGTGC--------------------------------CCAGTGCAGG 282


|||||||| ||||||||||
gi 401 CTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGG 450

gi 283 CTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAG 332


||||||||||||||||||||||||||||||||||||||||||||||||||
gi 451 CTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAG 500

gi 333 TATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTT 382


||||||||||||||||||||||||||||||||||||||||||||||||||
gi 501 TATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTT 550

gi 383 GTTCCCTAAGTCCAACTACTAAAC-------------------------- 406


||||||||||||||||||||||||
gi 551 GTTCCCTAAGTCCAACTACTAAACAAGCTAGGCCCTTTTGCTAATCATGT 600
gi 407 -----------------------TGGGGGATATTATGAAGGGCCTTGAGC 433
|||||||||||||||||||||||||||
gi 601 TCATACCTCTTATCTTCCTCCCATGGGGGATATTATGAAGGGCCTTGAGC 650

gi 434 ATCTGGATTCTGCCTAATAAAA---------------------------- 455


||||||||||||||||||||||
gi 651 ATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGCATCTGCATAT 700

gi 456 -----------------------------------------TATTTCTGA 464


|||||||||
gi 701 AAATATTTCTGCATATAAATTGTAACATGATGTATTTAAATTATTTCTGA 750

gi 465 ATA-------------------------------TTTTACTAAAAAGGGA 483


||| ||||||||||||||||
gi 751 ATAAGAAATCTTACCACGTTTCTCCGTACTATGTTTTTACTAAAAAGGGA 800

gi 484 ATGTGGGAGGTCAGTGCATTTAAAACATAAAGAAATGAAGAGCTAGTTCA 533


||||||||||||||||||||||||||||||||||||||||||||||||||
gi 801 ATGTGGGAGGTCAGTGCATTTAAAACATAAAGAAATGAAGAGCTAGTTCA 850

gi 534 AACC 537


||||
gi 851 AACCACTTACATCAGTTACAATTTATATGCAGAAATATTTATATGCAGAG 900

gi 538 537

gi 901 ATATTGCTTTAGGTCGGAATAGGGTTGGTATTTTATTTTCGTCTTACCAT 950

gi 538 537

gi 951 CGACCTAACATCGACGATAATAGCAGCTACAATCCAGCTACCATTCTGCT 1000

gi 538 537

gi 1001 TTTATTTTATGGTTGGGATAAGGCTGGATT 1030

#---------------------------------------
#---------------------------------------
water results

########################################
# Program: water
# Rundate: Wed Jan 22 20:11:48 2003
# Report_file: outfile.align
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: gi
# 2: gi
# Matrix: EDNAFULL
# Gap_penalty: 12.0
# Extend_penalty: 4.0
#
# Length: 660
# Identity: 484/660 (73.3%)
# Similarity: 484/660 (73.3%)
# Gaps: 152/660 (23.0%)
# Score: 1660.0
#
#
#=======================================

gi 1 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATA 50
||||||||||||||||||||||||||||||||||||||||||||||||||
gi 71 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATA 120

gi 51 ACAGTGATAATTTCTGGGTTAAGGCAATAGCAA----------------- 83
|||||||||||||||||||||||||||||||||
gi 121 ACAGTGATAATTTCTGGGTTAAGGCAATAGCAATATCTCTGCATATAAAT 170

gi 84 ----------ATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATA 123
||||||||||||||||||||||||||||||||||||||||
gi 171 ATTTCTGCATATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATA 220

gi 124 GCAGCTACAATCCAGCTACCATTCTGCTTTTATTTTA------------- 160


|||||||||||||||||||||||||||||||||||||
gi 221 GCAGCTACAATCCAGCTACCATTCTGCTTTTATTTTAAATTTATATGCAG 270

gi 161 --------------------------TGGTTGGGATAAGGCTGGATTATT 184


||||||||||||||||||||||||
gi 271 AAATATTTATATGCAGAGATATTGCTTGGTTGGGATAAGGCTGGATTATT 320

gi 185 CTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTT 234


||||||||||||||||||||||||||||||||||||||||||||||||||
gi 321 CTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTT 370

gi 235 CCTCCCACGGCTCCTGGGCAACGTGCTGGTCTGTGTGC------------ 272


||||||||||||||||||||||||||||||||||||||
gi 371 CCTCCCACGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACT 420

gi 273 --------------------CCAGTGCAGGCTGCCTATCAGAAAGTGGTG 302


||||||||||||||||||||||||||||||
gi 421 TTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTG 470

gi 303 GCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCT 352


||||||||||||||||||||||||||||||||||||||||||||||||||
gi 471 GCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCT 520

gi 353 TGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACT 402


||||||||||||||||||||||||||||||||||||||||||||||||||
gi 521 TGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACT 570

gi 403 AAAC---------------------------------------------- 406


||||
gi 571 AAACAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTTCCTC 620

gi 407 ---TGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAA 453


|||||||||||||||||||||||||||||||||||||||||||||||
gi 621 CCATGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAA 670

gi 454 AATATTTCTGAATATTTTACTAAAAAGGGAATGTGGGAGGTCAGTGCATT 503


||.|..| | ||||||..|...|...|.||.| ..|..|...|||||.
gi 671 AAAACAT-T---TATTTTCATTGCATCTGCATAT-AAATATTTCTGCATA 715

gi 504 TAAAACATAA 513


||||...|||
gi 716 TAAATTGTAA 725

#---------------------------------------
#---------------------------------------
Blast 2 sequences

Score = 258 bits (134), Expect = 7e-66


Identities = 134/134 (100%)
Strand = Plus / Plus

Query: 273 ccagtgcaggctgcctatcagaaagtggtggctggtgtggctaatgccctggcccacaag 332


||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 441 ccagtgcaggctgcctatcagaaagtggtggctggtgtggctaatgccctggcccacaag 500

Query: 333 tatcactaagctcgctttcttgctgtccaatttctattaaaggttcctttgttccctaag 392


||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 501 tatcactaagctcgctttcttgctgtccaatttctattaaaggttcctttgttccctaag 560

Query: 393 tccaactactaaac 406


||||||||||||||
Sbjct: 561 tccaactactaaac 574
Score = 216 bits (112), Expect = 4e-53
Identities = 112/112 (100%)
Strand = Plus / Plus

Query: 161 tggttgggataaggctggattattctgagtccaagctaggcccttttgctaatcatgttc 220


||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 297 tggttgggataaggctggattattctgagtccaagctaggcccttttgctaatcatgttc 356

Query: 221 atacctcttatcttcctcccacggctcctgggcaacgtgctggtctgtgtgc 272


||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 357 atacctcttatcttcctcccacggctcctgggcaacgtgctggtctgtgtgc 408
….
LALIGN
/seqprg/slib/bin/lalign -N 5000 -n -r "+5/-4" -f -12 -g -4 -w 75 -q @ @
resetting to DNA matrix
resetting to DNA matrix
LALIGN finds the best local alignments between two sequences
version 2.1u03 April 2000
Please cite:
X. Huang and W. Miller (1991) Adv. Appl. Math. 12:373-381

resetting to DNA matrix


alignments < E( 0.05):score: 75 (50 max)
Comparison of:
(A) @ gi|22758817|gb|AY128651.1| Homo sapiens beta-globi - 537 nt
(B) @ gi|22758817|gb|AY128651.1| Homo sapiens beta-globi - 1058 nt
using matrix file: DNA, gap penalties: -12/-4 E(limit) 0.05

73.3% identity in 660 nt overlap (1-513:99-753); score: 1660 E(10000): 3.5e-130

10 20 30 40 50 60 70
gi|227 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGC
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
gi|227 GGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAGAATAACAGTGATAATTTCTGGGTTAAGGC
100 110 120 130 140 150 160 170

80 90 100 110 120


gi|227 AATAGCAA---------------------------ATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATA
:::::::: ::::::::::::::::::::::::::::::::::::::::
gi|227 AATAGCAATATCTCTGCATATAAATATTTCTGCATATAAATTGTAACTGATGTAAGAGGTTTCATATTGCTAATA
180 190 200 210 220 230 240

130 140 150 160


gi|227 GCAGCTACAATCCAGCTACCATTCTGCTTTTATTTTA--------------------------------------
:::::::::::::::::::::::::::::::::::::
gi|227 GCAGCTACAATCCAGCTACCATTCTGCTTTTATTTTAAATTTATATGCAGAAATATTTATATGCAGAGATATTGC
250 260 270 280 290 300 310 320

170 180 190 200 210 220 230


gi|227 -TGGTTGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTT
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
gi|227 TTGGTTGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTT
330 340 350 360 370 380 390

240 250 260 270


gi|227 CCTCCCACGGCTCCTGGGCAACGTGCTGGTCTGTGTGC--------------------------------CCAGT
:::::::::::::::::::::::::::::::::::::: :::::
gi|227 CCTCCCACGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGT
400 410 420 430 440 450 460 470

280 290 300 310 320 330 340 350


gi|227 GCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCT
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
gi|227 GCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCT
480 490 500 510 520 530 540

360 370 380 390 400


gi|227 TGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAAC---------------------
::::::::::::::::::::::::::::::::::::::::::::::::::::::
gi|227 TGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACAAGCTAGGCCCTTTTGCTAAT
550 560 570 580 590 600 610 620

410 420 430 440 450


gi|227 ----------------------------TGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAA
:::::::::::::::::::::::::::::::::::::::::::::::
gi|227 CATGTTCATACCTCTTATCTTCCTCCCATGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAA
630 640 650 660 670 680 690

460 470 480 490 500 510


gi|227 AATATTTCTGAATATTTTACTAAAAAGGGAATGTGGGAGGTCAGTGCATTTAAAACATAA
:: : : : :::::: : : : :: : : : ::::: :::: :::
gi|227 AAAACAT-T---TATTTTCATTGCATCTGCATATAA-ATATTTCTGCATATAAATTGTAA
700 710 720 730 740 750
CECS 694-02
Introduction to Bioinformatics
Lecture 3: Multiple Sequence Alignment

Two Issues with the Programming Project


1. Amino Acid Sequence Alignment
2. Calculating alignment score using affine gap penalties

Amino Acid Sequence alignment


With amino acid sequence alignment, there is no longer a straight match/mismatch score
as there is with DNA sequence alignment, since different amino acids are allowed to
mutate while still maintaining the functionality of a protein. Therefore, when aligning
two sequences using amino acid sequences, it is necessary to use a lookup table to find
the match score between two amino acids. This lookup table is the scoring matrix as
described in the previous class, such as a PAM or BLOSUM matrix. Now when you
have two residues, you can look up in this matrix to determine their match score. You
can use the symmetric PAM250 matrix on page 82 for amino acid sequence alignments
for this project.

Pam250 Matrix, P 82 (Mount)


Calculating alignments using affine gap penalties

(Don’t worry about for this project – this will be part of the second programming
assignment)

In order to calculate an alignment using affine gap penalties, it is necessary to consider


the possibility of either extending an existing gap, or to open a new gap. In order to
calculate the maximum alignment score matrix, V, it is necessary to consider three
separate matrices: a match matrix (M), an insertion matrix (I), and a deletion matrix (D).
The scores for each of these matrices is calculated as follows:

Mi,j = MAX{ Mi-1, j-1 + s(xi, yi),


Ii-1, j-1 + s(xi, yi),
Di-1, j-1 + s(xi, yi) }

Ii,j = MAX{ Mi-1, j – g, // Opening new gap, g = gap open penalty;


Ii-1, j – r} // Extending existing gap, r = gap extend penalty

Di,j = MAX{Mi,j-1 – g, // Opening new gap;


Di,j-1 – r} // Extending existing gap

Vi,j = MAX {Mi,j, Ii,j, Di,j}

Multiple Sequence Alignment


Description
Similar genes can be conserved across species that perform similar or identical functions.
Many genes are represented in highly conserved forms across organisms.

Unique human and mouse genes

By performing a simultaneous alignment of multiple sequences having similar or


identical functions, we can gain information about which regions have been subject to
mutations over evolutionary time and which are evolutionarily conserved. Such
knowledge tells which regions or domains of a gene are critical to its functionality.

Sometimes genes that are similar in sequence can be mutated or rearranged to perform an
altered function. By looking at multiple alignments of such sequences, we can tell which
changes in the sequence have caused a change in the functionality.
Multiple sequence alignment yields information concerning the structure and function of
proteins, and can help lead to the discovery of important sequence domains or motifs
with biological significance while at the same time uncovering evolutionary relationships
among genes.

In multiple sequence alignment, the idea is to take three or more sequences, and align
them so that the greatest number of similar characters are aligned in the same column of
the alignment.

The difficulty with multiple sequence alignment is that now there are a number of
different combinations of matches, insertions, and deletions that must be considered
when looking at several different sequences. Methods to guarantee the highest scoring
alignment are not feasible. Therefore, approximation methods are put to use in multiple
sequence alignment.

Example multiple alignment of 8 immunoglobulin sequences.

There are four approaches to multiple sequence alignment we will consider: Dynamic
Programming Approach, Progressive alignment, Iterative alignment, and statistical
modeling.

Extension of Dynamic Programming Approach


The attractiveness of dynamic programming with two sequences is that it guarantees to
give the optimal alignment of sequences given a specific scoring scheme. In addition, it
is a relatively easy method to implement.
Dynamic programming approaches can be extended to multiple alignment as well.
Consider the example where we have three amino acid sequences VSNS, SNA, and AS to
align. Instead of filling a two dimensional matrix as we did with two sequences, we now
fill a three dimensional space.

Figure source: http://www.techfak.uni-


bielefeld.de/bcd/Curric/MulAli/node2.html#SECTION00020000000000000000

Suppose the length of each sequence is n residues. If there are two such sequences, then
the number of comparisons needed to fill in the scoring matrix is n2, since it is a two-
dimensional matrix. The number of comparisons needed to fill in the scoring cube when
three sequences are aligned is n3, and when four sequences are aligned, the number of
comparisons needed is n4. Thus, as the number of sequences increases, the number of
comparisons needed increases exponentially, i.e. nN where n is the length of the
sequences, and N is the number of sequences. Thus, without any changes to the dynamic
programming approach, this becomes impractical for even a small number of short
sequences rather quickly.

Carillo and Lipman – Sum of Pairs (1988)

MSA – Lipman, et al. 1989


Gupta et al 1995 – Substantial reduction in memory and number of required steps

Idea for reduction of memory and computations:


Multiple sequence alignment imposes an alignment on each of the pairs of sequences.
Alignments found for each of the pairs of sequences can imposes bounds on the location
of the MSA within the cube (three sequences) or N-dimensional space (N sequences).

Step 1: Find the alignment for each pair of sequences.


Step 2: Trial msa is produced by first predicting a phylogenetic tree for the sequences
Step 3: Sequences are multiply aligned in the order of their relationship on the tree

While this is a heuristic alignment (and is therefore not guaranteed to be optimal), it does
provide a limit to the search space within which optimal alignments are likely to be
found.

Figures 4.2 and 4.3 (Mount) describe how the two dimensional search spaces can be
projected into a three dimensional volume that can be searched.

MSA calculates the multiple alignment score within the lattice by adding the scores of the
corresponding pairwise alignments in the multiple sequence alignment. This measure is
known as the sum of pairs (SP) measure. The optimal alignment is based on the best SP
score.

Scoring Multiple Sequence Alignments using sum of pairs


method
The sum of pairs method scores all possible combinations of pairs of residues in a
column of a multiple sequence alignment. For instance, consider the alignment

ECSQ (1)
SNSG (2)
SWKN (3)
SCSN (4)

Since there are four sequences, there will be six different alignments to consider for each
column. The alignments, listed by the sequence number are listed as follows:

1-2
1-3
1-4
2-3
2-4
3-4

Residues Score Residues Score Residues Score Residues Score


1-2 E-S 0 C-N -4 S-S 2 Q-G -1
1-3 E-S 0 C-W -8 S-K 0 Q-N 1
1-4 E-S 0 C-C 12 S-S 2 Q-N 1
2-3 S-S 2 N-W -4 S-K 0 G-N 0
2-4 S-S 2 N-C -4 S-S 2 G-N 0
3-4 S-S 2 W-C -8 K-S 0 N-N 2
6 -16 6 3
Residues Score Residues Score Residues Score Residues
Score
1-2 E-S 0 C-N -4 S-S 2 Q-G -1
1-3 E-S 0 C-W -8 S-K 0 Q-N 1
1-4 E-S 0 C-C 12 S-S 2 Q-N 1
2-3 S-S 2 N-W -4 S-K 0 G-N 0
2-4 S-S 2 N-C -4 S-S 2 G-N 0
3-4 S-S 2 W-C -8 K-S 0 N-N 2
6 -16 6 3
Using PAM250 matrix, p250

Problem with this approach: more closely related sequences will have a higher weight

The MSA program gets around this by calculating weights to associate to each sequence
alignment pair. The weights are assigned based on the predicted tree of the aligned
sequences.

In summary, the steps of MSA are as follows:


1. Calculate all pairwise alignment scores
2. Use the scores to predict tree.
3. Calcuate pair weights based on the tree.
4. Produce a heuristic msa based on the tree.
5. Calculate the maximum weight for each sequence pair.
6. Determine the spatial positions that must be calculated to obtain the optimal
alignment.
7. Perform the optimal alignment.
8. Report the weight found compared to the maximum weight previously found.

calculates a e value for each pair of sequences which acts as the weight for that sequence
pair. Sequences that are more divergent will have a higher e value.

Carillo, H. and Lipman, D. (1988).


The multiple sequence alignment problem in biology.
SIAM J. Appl. Math., 48(5):1073-1082.
Lipman, D. J., Altschul, S. F., and Kececioglu, J. D. (June 1989).
A tool for multiple sequence alignment.
Proc. Natl. Acad. Sci. USA, 86:4412-4415.

Description of Carrillo-Lipman technique:


http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/node3.html

http://www.psc.edu/biomed/genedoc/whygd.htm

Progressive alignment methods


The approach of progressive alignment is to begin with an alignment of the most alike
sequences, and then build upon the alignment using other sequences. Progressive
alignments work by first aligning the most alike sequences using dynamic programming,
and then progressively adding less related sequences to the initial alignment.

Difficulties with progressive alignments


CLUSTALW and CLUSTALX are progressive alignment programs that follow the
following steps:

1) Perform pairwise alignments of all of the sequences


2) Use the alignment scores to produces a phylogenetic tree using neighbor-joining
methods
3) Align the sequences sequentially, guided by the phylogenetic relationships
indicated by the tree

The initial pairwise alignments are calculated using an enhanced dynamic programming
algorithm, and the genetic distances used to create the phylogenetic tree are calculated by
dividing the total number of mismatched positions by the total number of matched
positions.

Alignments are associated a weight based on their distance from the root node.

In a progressive alignment, gaps are added to a profile of an existing multiple sequence


alignment. Statistical tests have been prepared in order to accumulate gaps between
secondary structure elements, which models what is found in nature. Such information is
incorporated into scoring an alignment using CLUSTALW.

PILEUP is the multiple sequence alignment program that is part of the Genetics
Computer Group (GCG) package developed at the University of Wisconsin. In PILEUP,
multiple sequence alignment is performed by first aligning each of the sequences in a
pair-wise fashion using a Needleman-Wunsch approach. The resulting scores are used to
produce a tree by the unweighed pair-group method using arithmetic averages (UPGMA).
The resulting tree is then used to guide the alignment of the most closely related
sequences and groups of sequences.

Problems with progressive alignments


The difficulty with progressive alignments is that they depend upon the initial pair-wise
sequence alignments. If the sequences are closely related, then the likelihood is good that
the initial alignment contains relatively few errors. However, if the initial sequences are
distantly related, then there will be more errors in the alignment, which will propagate
through the rest of the alignments.

The second issue is that suitable scoring matrices and gap penalties must be chosen to
apply to the sequences as a set.

Iterative alignment methods


Iterative alignment methods begin by making an initial alignment of the sequences.
These alignments are then revised to give a more reasonable result. The objective of this
approach is to improve the overall alignment score.

MultAlin
PRRP
DIALIGN

Genetic Algorithms

The goal of genetic algorithms used in sequence alignment is to generate as many


different multiple sequence alignments by rearrangements that simulate gaps and genetic
recombination events. SAGA (Serial Alignment by Genetic Algorithm) is one such
approach that yields very promising results, but becomes slow when more than 20
sequences are used.

Steps of SAGA (Genetic Algorithm)

1) Up to 20 different sequences are written in a row, allowing for overlaps of a


random length. The ends of these sequences are then padded with gaps.
Typically, upwards of 100 initial alignments are made.

2) The initial alignments are scored by the sum of pairs method. Standard amino
acid scoring matrices and gap open, gap extension penalties are used.
3) Initial alignments are replaced to give another generation of multiple sequence
alignments. One half of the multiple sequence alignments are chosen to proceed
to the next generation unchanged (natural selection). This half is chosen by
assigning probabilities to each sequence based on an inverse proportion of their
SP scores (the best alignments, since the SP scores are weighted according to their
distance from the parent). The other half of the alignments are sent to the next
generation, but are first subject to mutation.

4) In the mutation process, gaps are inserted into the sequences subject to mutation
and rearranged in an attempt to create a better scoring alignment. In this step, the
sequences are split into two sets based on an estimated phylogenetic tree, and
gaps of random lengths are inserted into random positions in the alignment.

5) Recombination of two parent alignments is accomplished ???

6) The next generation is evaluated going back to step 2, and steps 2-5 are repeated a
number (100-1000) times. The best scoring multiple sequence alignment is then
obtained (note that it may not be the optimal scoring alignment).

7) The entire process is repeated several times, starting from a different initial
alignment each time. The best scoring multiple sequence alignment is then
chosen and reported to the user.

Simulated Annealing

Another approach to sequence alignment that works in a manner similar to genetic


algorithms is simulated annealing. In these approaches, you begin with a heuristically
determined multiple sequence alignment that is then changed using probabilistic models
that identifies changes in the alignment that increase the alignment score. The drawback
of simulated annealing approaches is that you can get stuck finding only the locally
optimal alignment rather than the alignment score that is globally optimal.

Other methods for multiple sequence alignment


Group approach

In the group approach, sequences are clustered into related groups. A consensus
sequence is produced to make alignments between the groups. Examples of programs
implementing the group approach are PIMA and MULTAL.

PIMA
MULTAL
Tree approach

The tree method uses the distance method of phylogenetic analysis to arrange the
sequences. The two closest sequences are then aligned, and the consensus of these two is
aligned with the next best sequence (or group of sequences) until an alignment is
produced that includes all of the sequences. This approach is a popular approach used by
PILEUP, CLUSTALW and ALIGN.

TREEALIGN is a program that uses the tree approach, but rearranges the tree as
sequences are aligned to produce the tree by maximum parsimony of the tree.

Localized Alignments in sequences


Just like with pairwise alignments, we may not be interested in the global alignment of
multiple sequences, but rather only specific regions that are conserved. For instance,
given regions of genomic DNA occurring upstream or before a certain gene, there might
be sequences where transcription factors bind to the DNA so that the gene can be
transcribed. Thus, if we are interested in determining if there is any signal in the regions
upstream of a certain family of genes across several different organisms, it would be
important to only find the conserved region, and not try to align all of the genomic DNA.

Localized alignments of protein sequences can yield information about conserved


domains found in otherwise unrelated proteins.

Programs to detect localized alignments typically use one of the following three
approaches: Profile Analysis; block analysis; pattern-searching or statistical methods

Profile analysis
Profiles are found by first multiply aligning the sequences, determining which regions are
the most highly conserved, and then creating a scoring matrix for the alignment of the
highly conserved region. The profile is composed of columns, and may include matches,
mismatches, insertions, and deletions found in a particular column.

Once a profile is created, it can be used to search a target sequence or database for
possible matches to the profile using the profiles scores to evaluate the likelihood at each
position.

The drawback of profiles is that the profile is only as representative as the variation in the
sequences used to construct it. Thus, there is a bias in the profile towards the training
data.
For each position in a profile, there is a column for each amino acid, plus a column for an
unknown amino acid (z), and a column for gap opening and gap extension. There is a
row for each position in the multiple alignment.

Block Analysis

Expectation-Maximization

Gibbs Sampling

Hidden Markov Models

Position Specific Scoring Matrix

Sequence Logos

Profile: Scores for substitutions and gaps in each column


Blocks: ungapped aligned regions

Alignments based on locally conserved patterns found in the same order in the sequences
(synteny)

Use of statistical methods and probabilistic models of the sequences

Multiple sequence alignments yield information into the evolutionary history of the
sequences – sequences that are most similar are likely to be recently derived from a
common ancestor sequence

If the sequences in a multiple alignment have quite a bit of variation then it is difficult to
create a multiple sequence alignment due to the different combinations of substitutions,
insertions, and deletions that can be used

Local Alignment of proteins

ECSQ
SNSG
SWKN
SCSN

Profiles and Position-Specific scoring matrices

Motif-Based Approaches

Gibbs Sampling Algorithm


Describe this – hand out papers

MEME
Meta-MEME

Hidden Markov Models

Scoring Multiple Alignments

Programs for multiple sequence alignment

Progressive Alignment programs:


CLUSTALW, CLUSTALX
MSA
PRALINE

Iterative Alignment Programs:


DIALIGN
MULTALIGN

PRRP
SAGA

Local Alignment of proteins


Asset
BLOCkS
EMOTIF
Gibbs Sampler
HMMER
MACAW
MEME
SAM

MSA
ClustalX
ClustalW
Viewing Multiple Alignments
sequence logos
Once an alignment is made, they can be compared using Hidden Markov Models

IBM’s MUSCA
http://cbcsrv.watson.ibm.com/Tmsa.html

ClustalW
http://www.ebi.ac.uk/clustalw/
http://clustalw.genome.ad.jp/

DIALIGN
http://bibiserv.techfak.uni-bielefeld.de/cgi-bin/dialign_submit

Web Logo
http://www.bio.cam.ac.uk/cgi-bin/seqlogo/logo.cgi

DNA Sequence Data sets


Fasta File Format
GenBank File Format
ASN.1 File Format
XML File Format

Introduction to Bioinformatics
Lecture 4: Multiple Sequence Alignment

Localized Alignments in sequences


Just like with pairwise alignments, we may not be interested in the global alignment of
multiple sequences, but rather only specific regions that are conserved. For instance,
given regions of genomic DNA occurring upstream or before a certain gene, there might
be sequences where transcription factors bind to the DNA so that the gene can be
transcribed. Thus, if we are interested in determining if there is any signal in the regions
upstream of a certain family of genes across several different organisms, it would be
important to only find the conserved region, and not try to align all of the genomic DNA.

Localized alignments of protein sequences can yield information about conserved


domains found in otherwise unrelated proteins.
Programs to detect localized alignments typically use one of the following three
approaches: Profile Analysis; block analysis; pattern-searching or statistical methods

Profile analysis
Profiles are found by first multiply aligning the sequences, determining which regions are
the most highly conserved, and then creating a scoring matrix for the alignment of the
highly conserved region. The profile is composed of columns, and may include matches,
mismatches, insertions, and deletions found in a particular column.

Once a profile is created, it can be used to search a target sequence or database for
possible matches to the profile using the profiles scores to evaluate the likelihood at each
position.

The drawback of profiles is that the profile is only as representative as the variation in the
sequences used to construct it. Thus, there is a bias in the profile towards the training
data.

For each position in a profile, there is a column for each amino acid, plus a column for an
unknown amino acid (z), and a column for gap opening and gap extension. There is a
row for each position in the multiple alignment.

Calculating profiles

How are the values in a profile created? The value of an individual cell is calculated as
the log odds score of finding a particular residue in a particular location in an alignment
divided by the probability of aligning the two amino acids by random chance using a
particular scoring scheme (such as PAM250, BLOSUM80, …). Additional penalties
must be calculated for gap opening and gap extension in the profile as well. If a gap
exists in the multiple alignment, then the penalties for gaps will be reduced.

One method (average method) weighs the proportion of the amino acids found in a
particular column, and weights the score of matching the consensus residue at a given
position to that particular residue.

Shannon Entropy
One method to calculate the observed column variation given the expected variation in
the evolutionary model is to use an information measure known as entropy. The smaller
the entropy, the more conserved a column is. Entropy for a single column is calculated
by the following formula:
H =− ∑f a
residues ( a )
log( pa )

Where fa is the observed proportion of each residue a in the msa column and pa is the
expected frequency of the residue when derived from a given ancestor residue.

With an amino acid msa, the entropy measure can be used with several different
evolutionary distances to determine which one minimizes entropy. (See page 164 of
Mount for the discussion).

Another measure of creating a creating a profile is by using log-odds score. In this


method, the log2 of the ratio of observed/background frequencies is calculated for each
position. What results is the amount of information available in an alignment given in
bits. A new sequence can then be searched to see if it possibly contains the motif.

Block Analysis
Blocks are similar to profiles in the sense that they represent locally conserved regions
within a multiple sequence alignment. However, the difference is that blocks lack indels.
Blocks can be determined either by performing a multiple sequence alignment, or by
searching a database for similar sequences of the same length. Algorithms for searching
for a BLOCK were initially developed by Henikoff and Henikoff (1991).

Statistical approaches to finding the most alike sequences have been proposed, such as
the Expectation-Maximization algorithms and the Gibbs sampler. In any case, once a set
of blocks has been determined, the information contained within the block alignment can
be displayed as a sequence profile.

Extraction of blocks from a multiple sequence alignment

A global sequence alignment will usually contain ungapped regions that are aligned
between multiple sequences. These regions can be extracted to produce blocks. Two
programs that allows for the extraction of a block from a multiple sequence alignment are
BLOCKS and eMOTIF, both of which can read in a multiple sequence alignment in one
of many different file formats and perform the extraction of the blocks. The websites for
these programs are:

http://www.blocks.fhcrc.org/blocks/process_blocks.html
http://dna.stanford.edu/emotif/

Below is a set of 10 truncated kinases


>D28 CD28 S. CEREVISIAE CELL CYCLE CONTROL PROTEIN
KINASE
ANYKRLEKVGEGTYGVVYKALDLRPGQGQR
VVALKKIRLESEDEGVPSTAIREISLLKEL
>SKH SKH HELA MYSTERY PUTATIVE PROTEIN KINASE
AKYDIKALIGRGSFSRVVRVEHRATRQPYA
IKMIETKYREGREVCESELRVLRRVRHANI
>APK CAPK BOVINE CARDIAC MUSCLE CYCLIC AMP-DEPENDENT
(ALPHA)
DQFERIKTLGTGSFGRVMLVKHMETGNHYA
MKILDKQKVVKLKQIEHTLNEKRILQAVNF
>EE1 WEE1 S. POMBE MITOTIC INHIBITOR
TRFRNVTLLGSGEFSEVFQVEDPVEKTLKY
AVKKLKVKFSGPKERNRLLQEVSIQRALKG
>GFR EGFR HUMAN EPIDERMAL GROWTH FACTOR RECEPTOR
TEFKKIKVLGSGAFGTVYKGLWIPEGEKVK
IPVAIKELREATSPKANKEILDEAYVMASV
>DGM PDGF RECEPTOR, MOUSE KINASE REGION
DQLVLGRTLGSGAFGQVVEATAHGLSHSQA
TMKVAVKMLKSTARSSEKQALMSELYGDLV
>FES THIS IS VFES TYROSINE KINASE
VLNRAVPKDKWVLNHEDLVLGEQIGRGNFG
EVFSGRLRADNTLVAVKSCRETLPPDIKAK
>AF1 RAF1 HUMAN C-RAF-1 ONCOGENE
SEVMLSTRIGSGSFGTVYKGKWHGDVAVKI
LKVVDPTPEQFQAFRNEVAVLRKTRHVNIL
>MOS CMOS HUMAN C-MOS ONCOGENE
EQVCLLQRLGAGGFGSVYKATYRGVPVAIK
QVNKCTKNRLASRRSFWAELNVARLRHDNI
>SVK HSVK HERPES SIMPLEX VIRUS PUTATIVE PROTEIN KINASE
MGFTIHGALTPGSEGCVFDSSHPDYPQRVI
VKAGWYTSTSHEARLLRRLDHPAILPLLDL

(Mulitple Alignment created using ClustalW; Colors Added using BoxShade)

AF1 1 -
SEVMLSTRIGSGSFGTVYKGKWHGDVAVKILKVVDPTPEQFQAFRNEVAVLRKT--
RHVNIL
MOS 1 -EQVCLLQRLGAGGFGSVYKATYRG-
VPVAIKQVNKCTKNRLASRRSFWAELNVARLRHDNI-
DGM 1 -DQLVLGRTLGSGAFGQVVEATAHG-
LSHSQATMKVAVKMLKSTARSSEKQALMSELYGDLV-
GFR 1 -TEFKKIKVLGSGAFGTVYKGLWIP-
EGEKVKIPVAIKELREATSPKANKEILDEAYVMASV-
D28 1 -ANYKRLEKVGEGTYGVVYKALDLR--
PGQGQRVVALKKIRLESEDEGVPSTAIREISLLKEL
SKH 1 -AKYDIKALIGRGSFSRVVRVEHRA-
TRQPYAIKMIETKYREGREVCESELRVLRRVRHANI-
APK 1 -DQFERIKTLGTGSFGRVMLVKHME-
TGNHYAMKILDKQKVVKLKQIEHTLNEKRILQAVNF-
EE1 1 -
TRFRNVTLLGSGEFSEVFQVEDPVEKTLKYAVKKLKVKFSGPKERNRLLQEVSIQRALK
G--
FES 1 VLNRAVPKDKWVLNHEDLVLGEQIG-
RGNFGEVFSGRLRADNTLVAVKSCRETLPPDIKAK--
SVK 1 -MGFTIHGALTPGSEGCVFDSSHPD-
YPQRVIVKAGWYTSTSHEARLLRRLDHPAILPLLDL-
cons 1 qf ll lgsgsfg vykg g k i v k r
v l i

Taking this alignment, we can generate blocks using the BLOCKS server:

ID x6676xbli; BLOCK
AC x6676xbliA; distance from previous blocks=(1,1)
DE ../tmp/6676.blin
BL UNK motif; width=24; seqs=10; 99.5%=0; strength=0
AF1 ( 1) SEVMLSTRIGSGSFGTVYKGKWHG 41
MOS ( 1) EQVCLLQRLGAGGFGSVYKATYRG 48
DGM ( 1) DQLVLGRTLGSGAFGQVVEATAHG 49
GFR ( 1) TEFKKIKVLGSGAFGTVYKGLWIP 41
D28 ( 1) ANYKRLEKVGEGTYGVVYKALDLR 61
SKH ( 1) AKYDIKALIGRGSFSRVVRVEHRA 54
APK ( 1) DQFERIKTLGTGSFGRVMLVKHME 46
EE1 ( 1) TRFRNVTLLGSGEFSEVFQVEDPV 55
FES ( 1) LNRAVPKDKWVLNHEDLVLGEQIG 100
SVK ( 1) MGFTIHGALTPGSEGCVFDSSHPD 73
//
ID x6676xbli; BLOCK
AC x6676xbliB; distance from previous blocks=(2,2)
DE ../tmp/6676.blin
BL UNK motif; width=28; seqs=10; 99.5%=0; strength=0
AF1 ( 27) AVKILKVVDPTPEQFQAFRNEVAVLRKT 87
MOS ( 27) PVAIKQVNKCTKNRLASRRSFWAELNVA 75
DGM ( 27) SHSQATMKVAVKMLKSTARSSEKQALMS 92
GFR ( 27) GEKVKIPVAIKELREATSPKANKEILDE 83
D28 ( 27) PGQGQRVVALKKIRLESEDEGVPSTAIR 83
SKH ( 27) RQPYAIKMIETKYREGREVCESELRVLR 74
APK ( 27) GNHYAMKILDKQKVVKLKQIEHTLNEKR 85
EE1 ( 27) TLKYAVKKLKVKFSGPKERNRLLQEVSI 77
FES ( 27) GNFGEVFSGRLRADNTLVAVKSCRETLP 100
SVK ( 27) PQRVIVKAGWYTSTSHEARLLRRLDHPA 92
//

Expectation-Maximization
In the expectation-maximization algorithms, the starting point is a set of sequences
expected to have a common sequence pattern that may not be easily detectible. An initial
guess is made as to the location and size of the site of interest in each of the sequences.
These initial sites are then aligned.

Expectation Step In the expectation step, background residue frequencies are calculated
based on those residues that are not in the initially aligned sites. Column specific resides
are calculated for each position in the initial motif alignment. Using this information, the
probability of finding the site at any position in the sequences can then be calculated.

Maximization Step In the maximization step, the counts of residues for each position in
the site as found in the expectation step are used to calculate the location within each
sequence that maximally aligns to the motif pattern calculated in the expectation step.
This is done for each of the sequences. Once a new motif location has been calculated,
the expectation step is repeated. This cycle continues until the solution converges.

Example of EM: begin with an initial, Random alignment:

TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
TCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAG
AAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTC
GGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGC
AGCCTAGAGTGATGACTCCTATCTGGGTCCCCAGCAGGA
GCCTCAGGATCCAGCACACATTATCACAAACTTAGTGTCCA
CATTATCACAAACTTAGTGTCCATCCATCACTGCTGACCCT
TCGGAACAAGGCAAAGGCTATAAAAAAAATTAAGCAGC
GCCCCTTCCCCACACTATCTCAATGCAAATATCTGTCTGAAACGGTTCC
CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG
GATTGGTCACAGCATTTCAAGGGAGAGACCTCATTGTAAG
TCCCCAACTCCCAACTGACCTTATCTGTGGGGGAGGCTTTTGA
CCTTATCTGTGGGGGAGGCTTTTGAAAAGTAATTAGGTTTAGC
ATTATTTTCCTTATCAGAAGCAGAGAGACAAGCCATTTCTCTTTCCTCCCGGT
AGGCTATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTC
CCAGCACACACACTTATCCAGTGGTAAATACACATCAT
TCAAATAGGTACGGATAAGTAGATATTGAAGTAAGGAT
ACTTGGGGTTCCAGTTTGATAAGAAAAGACTTCCTGTGGA
TGGCCGCAGGAAGGTGGGCCTGGAAGATAACAGCTAGTAGGCTAAGGCCAG
CAACCACAACCTCTGTATCCGGTAGTGGCAGATGGAAA
CTGTATCCGGTAGTGGCAGATGGAAAGAGAAACGGTTAGAA
GAAAAAAAATAAATGAAGTCTGCCTATCTCCGGGCCAGAGCCCCT
TGCCTTGTCTGTTGTAGATAATGAATCTATCCTCCAGTGACT
GGCCAGGCTGATGGGCCTTATCTCTTTACCCACCTGGCTGT
CAACAGCAGGTCCTACTATCGCCTCCCTCTAGTCTCTG
CCAACCGTTAATGCTAGAGTTATCACTTTCTGTTATCAAGTGGCTTCAGCTATGCA
GGGAGGGTGGGGCCCCTATCTCTCCTAGACTCTGTG
CTTTGTCACTGGATCTGATAAGAAACACCACCCCTGC

From this alignment, the frequency of each base occurring is calculated. In this case, the
motif we are searching for is six bases wide. Therefore, we need to calculate seven
different sets of frequencies: One for the background, and one for each of the columns in
the motif. Calculating the total counts, we get:

After calculating the observed counts for each of the positions, we can convert these to
observed frequencies:

In the expectation step, the residue frequencies for the motif are used to estimate the
composition of the motif site. The expectation step attempts to maximally discriminate
between sequence within and not within the site. For each sequence, each possible motif
location is considered in order to find the most probable location given the current motif.

Consider the first sequence:


TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT

There are a total of 41 residues, so there are 41 – 5 + 1 = 37 potential sites to consider:

1 2 3 4 5 6 1*2*3*4*5*6 RANDOM ODDS


TCAGAA .241 .230 .256 .226 .289 .263 0.000244 0.000274 0.89
CAGAAC .263 .296 .246 .256 .289 .256 0.000363 0.000362 1.00
AGAACC .256 .233 .256 .256 .256 .256 0.000256 0.000362 0.71
GAACCA .240 .296 .256 .256 .256 .263 0.000313 0.000362 0.87
AACCAG .256 .296 .243 .256 .289 .233 0.000317 0.000362 0.88
ACCAGT .256 .230 .243 .256 .213 .248 0.000193 0.000274 0.71
CCAGTT .263 .230 .256 .226 .241 .248 0.000209 0.000257 0.81
CAGTTA .263 .296 .246 .261 .241 .263 0.000317 0.000257 1.23
AGTTAT .256 .233 .254 .261 .289 .248 0.000283 0.000241 1.18
GTTATA .240 .241 .254 .256 .241 .263 0.000238 0.000241 0.99
TTATAA .241 .241 .256 .261 .289 .263 0.000295 0.000297 0.99
TATAAA .241 .296 .254 .256 .289 .263 0.000353 0.000297 1.19
ATAAAT .256 .241 .256 .256 .289 .248 0.000290 0.000318 0.91
TAAATT .241 .296 .256 .256 .241 .248 0.000279 0.000297 0.94
AAATTT .256 .296 .256 .261 .241 .248 0.000303 0.000297 1.02
AATTTA .256 .296 .254 .261 .241 .263 0.000318 0.000297 1.07
ATTTAT .256 .241 .254 .261 .289 .248 0.000293 0.000278 1.05
TTTATC .241 .241 .254 .256 .241 .256 0.000233 0.000278 0.84
TTATCA .241 .241 .256 .261 .256 .263 0.000261 0.000297 0.88
TATCAT .241 .296 .254 .256 .289 .248 0.000332 0.000297 1.12
ATCATT .256 .241 .243 .256 .241 .248 0.000229 0.000297 0.77
TCATTT .241 .230 .256 .261 .241 .248 0.000221 0.000278 0.80
CATTTC .263 .296 .254 .261 .241 .256 0.000318 0.000297 1.07
ATTTCC .256 .241 .254 .261 .256 .256 0.000268 0.000297 0.90
TTTCCT .241 .241 .254 .256 .256 .248 0.000240 0.000278 0.86
TTCCTT .241 .241 .243 .256 .241 .248 0.000216 0.000278 0.78
TCCTTC .241 .230 .243 .261 .241 .256 0.000217 0.000297 0.73
CCTTCT .263 .230 .254 .261 .256 .248 0.000255 0.000297 0.86
CTTCTC .263 .241 .254 .256 .241 .256 0.000254 0.000297 0.86
TTCTCC .241 .241 .243 .261 .256 .256 0.000241 0.000297 0.81
TCTCCA .241 .230 .254 .256 .256 .263 0.000243 0.000318 0.76
CTCCAC .263 .241 .243 .256 .289 .256 0.000292 0.000339 0.86
TCCACT .241 .230 .243 .256 .256 .248 0.000219 0.000318 0.69
CCACTC .263 .230 .256 .256 .241 .256 0.000245 0.000339 0.72
CACTCC .263 .296 .243 .261 .256 .256 0.000324 0.000339 0.95
ACTCCT .256 .230 .254 .256 .256 .248 0.000243 0.000318 0.76

The six base site CAGTTA beginning at base 8 is calculated to have the highest odds
probability. Therefore, it is chosen as the new site in sequence 1.

This is repeated for each of the sequences. In the maximization step, the newly chosen
sites for each of the sequences are used to recalculate the frequency table. The
expectation/maximization cycle is then repeated, until the results converge on a set of
motifs.

Multiple EM for Motif Elcitation (MEME)

MEME is a program developed that uses the expectation-maximization methods as


described previously. ParaMEME searches for blocks using the EM algorithm, while
MetaMEME searches for profiles using Hidden Markov Models (HMMs).

MEME locates one or more ungapped patterns in a single DNA or protein sequence, or in
a series of sequences. A search is conducted on a variety of motif widths in order to
determine the most likely width for the profile. This likelihood is based on the log
likelihood score calculated after the EM algorithm. One of three types of motif models
can be chosen:

OOPS: One expected occurrence per sequence


ZOOPS: Zero or one expected occurrence per sequence
TCM: Any number of occurrences of the motif

Various prior knowledge can be added to MEME, including the expected number of
motifs, the expected length of the motif, and whether or not the motif is palindromic
(only applicable for DNA sequences).

Gibbs Sampling
Gibbs Sampling is another statistical method similar in nature to the EM algorithms.
Gibbs sampling combines both EM and simulated annealing techniques in order to
determine a maximal local alignment of multiple sequences.
The idea behind Gibbs sampling is to determine the most probable pattern common to all
of the sequences by sliding them back and forth until the ratio of the motif probability to
the background probability is a maximum.

In the first step of Gibbs Sampling, the predictive update step, a random start position for
the motif is chosen in all of the sequences except one that is chosen either in a random or
specified order. The initial alignment of the randomly assigned motifs is used to
calculated the residue frequencies in each position of the motif, and the background
frequencies. This is done in a manner similar to the EM algorithm. The ratio of
probabilities for the model:background is designated for the weights for each of the
possible motif starting positions in the assigned sequence. These weights are normalized
by dividing by their sum, resulting in a probability for each motif position. A motif start
position is then chosen based on a random sampling with the given weights. This process
is repeated until the residue frequencies in each column do not change. The sampling
step is then repeated for a different initial random alignment.

Since Gibbs sampler samples based on the probability rather than taking the maximum
probability, there is a way to escape local maxima that might hinder EM algorithms.

In order to improve the performance of the Bayesian approach to Gibbs sampling,


Dirichlet priors (pseudocounts) are added into the nucleotide counts. Gibbs sampling
also employs a shifting routine that will take a current multiple motif alignment, and shift
it a few bases to the left or the right, in order to see if only part of the motif is being
found. A range of motif sizes can be explored in Gibbs sampling as well. In addition,
Gibbs sampling can be extended to search for multiple motifs in the same set of
sequences, and to find a pattern in only a fraction of the sequences. In addition, certain
model-specific parameters can be enforced, such as palindromic sequences.

Gibbs Sampler Web interface


http://bayesweb.wadsworth.org/gibbs/gibbs.html

Hidden Markov Models


Hidden Markov models are statistical models that can take into account various
probabilities. We will come back and talk about Hidden Markov models in greater detail
later in the class.

Position Specific Scoring Matrix (PSSM)

Position Specific Scoring Matrices incorporate information theory in order to gain a


measure of how much information is contained within each column of a multiple
alignment. The information contained within a PSSM is a logarithmic transformation of
the frequency of each residue in the motif.
Pseudocounts

One problem with creating a model of a sequence alignment that is then used to search
databases is that there is a bias towards the training data. For instance, one column in a
motif may contain a completely conserved residue. However, such an occurrence will
make it highly unlikely to detect a new member of the family that doesn’t have the same
residue in that position. In addition, the residues found in a specific column may not be
highly representative of the family as a whole, especially if a small training set is used.
In order to get around these problems, the idea of pseudocounts is introduced in order to
estimate the probabilities. So now the estimated probability is changed from a frequency
of counts in the data to the following form:

nca + bca
Pca =
N c +Bc
Where Pca is the probability of seeing residue a in column c; nca is the counts of residue a
in column c; bca are the pseudocounts for residue a in column c; Nc is the number of
residues in column c; Bc is the number of pseudocounts in column c.

These probabilities are then converted into a log-odds form (usually log2 so the
information can be reported in bits) and placed in the PSSM.

In order to search a sequence against a PSSM, the value for the first residue in the
sequence occurring in the first column is calculated by searching the PSSM. Similarly,
the value for the residue occurring in each column is calculated. These values are added
(since they are logarithms) to produce a summed log odds score, S. This score can be
converted to an odds score using the formula 2S. The odds scores for the motif beginning
at each position can be summed together and normalized to produce a probability of the
motif occurring at each location.

Information theory can give an appreciation for the amount of information contained
within each sequence.

When there is no information contained within a column, the amount of uncertainty can
be measured as log220 = 4.32 for amino acids, since there are 20 amino acids. For
nucleic acid sequences, the amount of uncertainty can be measured as log24 = 2. PSSMs
can be used in order to reduce the uncertainty. If only one amino acid is found in a
particular column, then the uncertainty is 0 – there is only one choice. If there are two
amino acids occurring with equal probability, then there is an uncertainty to deciding
which residue it is.

The amount of uncertainty for a particular column is measured as the entropy, as


introduced previously:
HC = − ∑f ac
residues ( a )
log( pac )

the uncertainty for the whole PSSM can be calculated as a sum over all columns:

Hc = ∑H
allcolumns
c

Sequence Logos
One way to look at a particular PSSM is to view it visually. Sequence logos are one way
to do so, by illustrating the information in each column of a motif. Such a graph can
indicate which residues and which columns are the most important as far as sequence
conservation is concerned. The height of the logo is calculated as the amount by which
uncertainty has been decreased.

In addition to the entropy measure given before, a relative entropy measure could be
calculated as well. Relative entropy takes into account not only the data in the columns
of the motif, but also the overall composition of the organism being studied. Relative
entropy can be measured as:

RC = − ∑f ac
residues ( a )
log 2 ( pac / ba )

Where ba is the background frequency of residue a in the orgainism. If the frequency in


the column is less than the frequency in the background, then a negative information can
be computed, which is shown by an inverted character in the logo.
Sequence Editors and formatters

Sequence editors allow the user to take a given multiple alignment and manually fix it.
Examples of sequence editors include:

CINEMA
http://www.biochem.ucl.ac.uk/bsm/dbbrowser/CINEMA2.02/kit.html

GeneDoc
http://www.psc.edu/biomed/genedoc/

MACAW
http://ncbi.nlm.nih.gov/pub/schuler/macaw

BoxShade
http://www.ch.embnet.org/software/BOX_form.html
Biological sequence data formats
IUPAC Codes
In order to standardize sequence data, The Nomenclature Committee of the International
Union of Biochemistry and the International Union of Pure and Applied Chemistry has
established a standard code to represent bases that are uncertain or ambiguous. The code,
often referred to as the IUPAC code, is as follows:

A = adenine
C = cytosine
G = guanine
T = thymine
U = uracil
R = G A (purine)
Y = T C (pyrimidine)
K = G T (keto)
M = A C (amino)
S=GC
W=AT
B=GTC
D=GAT
H=ACT
V=GCA
N = A G C T (any)

Any other character besides the ones listed above (with the exception of the gap character
‘-‘) represents an error that will not be tolerated by nearly all sequence analysis
programs.

Standard Amino Acid Code

In addition to the nucleic acid codes, a standard single letter and three letter amino acid
code has been formulated by IUPAC as well. The table for this code is as follows:
1-letter 3-letter description
A Ala Alanine
R Arg Arginine
N Asn Asparagine
D Asp Aspartic acid
C Cys Cysteine
Q Gln Glutamine
E Glu Glutamic acid
G Gly Glycine
H His Histidine
I Ile Isoleucine
L Leu Leucine
K Lys Lysine
M Met Methionine
F Phe Phenylalanine
P Pro Proline
S Ser Serine
T Thr Threonine
W Trp Tryptophan
Y Tyr Tyrosine
V Val Valine
B Asx Aspartic acid or Asparagine
Z Glx Glutamine or Glutamic acid
Xaa or
X Any amino acid
Xxx

FASTA

Fasta sequence format is one of the most basic and widespread sequence formats. A
sequence in fasta format has as its first line a descriptor beginning with a ‘>’ character.
The proceeding lines contain the sequence (either nucleotide or amino acid) using
standard one-letter symbols. This format is extremely useful for sequence analysis
programs, since it is devoid of numerical and nonsequence characters (with the exception
of the newline character).
Example Fasta Sequence:
>gi|27819608|ref|NP_776342.1| hemoglobin, beta [beta globin] [Bos taurus]
MLTAEEKAAVTAFWGKVKVDEVGGEALGRLLVVYPWTQRFFESFGDLSTADAVMNNPKVKAHGKKVLDSF
SNGMKHLDDLKGTFAALSELHCDKLHVDPENFKLLGNVLVVVLARNFGKEFTPVLQADFQKVVAGVANAL
AHRYH

Note the first line begins with ‘>’, which in this case is followed by gi, indicating that the
next field surrounded by ‘|’ will be the GenBank identifier. Following the GenBank
identifier is the keyword ‘ref’ indicating the next field will be the reference for the
version of this sequence. The final field is the description. Note that nearly all sequence
based programs will treat anything following the ‘>’ as a comment and disregard it (or
only use it as a sequence descriptor). There are, however, a few sequence analysis
programs that expect the sequences to be in a strict fasta format.

GenBank

GenBank is the National Center for Biotechnology Information’s nucleic acid and protein
sequence database. It is the most widely used source of biological sequence data.
GenBank file format contains information about the sequence, including literature
references, functions of the sequence, locations of various features, etc.

The information in GenBank records is organized into fields, each with an identifier,
justified to the farthest left column. Some identifiers have additional subfields. The
actual sequence data lies between the identifier ORIGIN and the ‘//’ which signals the
end of a GenBank record.
Example GenBank Sequence:

LOCUS HBB 145 aa linear MAM 22-JAN-2003


DEFINITION hemoglobin, beta [beta globin] [Bos taurus].
ACCESSION NP_776342
VERSION NP_776342.1 GI:27819608
DBSOURCE REFSEQ: accession NM_173917.1
KEYWORDS .
SOURCE Bos taurus (cow)
ORGANISM Bos taurus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Cetartiodactyla; Ruminantia; Pecora; Bovoidea;
Bovidae; Bovinae; Bos.
REFERENCE 1 (residues 1 to 145)
AUTHORS Duncan,C.H.
JOURNAL Unpublished (1991)
COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. The reference sequence was derived from M63453.1.
FEATURES Location/Qualifiers
source 1..145
/organism="Bos taurus"
/db_xref="taxon:9913"
/chromosome="15"
/map="15q22-q27"
/tissue_type="thymus"
/dev_stage="newborn"
Protein 1..145
/product="hemoglobin, beta [beta globin]"
Region 3..145
/region_name="Globin"
/note="globin"
/db_xref="CDD:pfam00042"
CDS 1..145
/gene="HBB"
/coded_by="NM_173917.1:53..490"
/db_xref="LocusID:280813"
ORIGIN
1 mltaeekaav tafwgkvkvd evggealgrl lvvypwtqrf fesfgdlsta davmnnpkvk
61 ahgkkvldsf sngmkhlddl kgtfaalsel hcdklhvdpe nfkllgnvlv vvlarnfgke
121 ftpvlqadfq kvvagvanal ahryh
//
ASN.1
Abstract Syntax Notation (ASN.1) is a formal description language that has been
developed to encode various data such that it can be easily connected across computer
systems. ASN.1 format is highly structured and detailed. ASN.1 format contains all of
the other information found in other formats.

Seq-entry ::= set {


level 1 ,
class nuc-prot ,
descr {
source {
genome genomic ,
org {
taxname "Bos taurus" ,
common "cow" ,
db {
{
db "taxon" ,
tag
id 9913 } } ,
orgname {
name
binomial {
genus "Bos" ,
species "taurus" } ,
lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi; Mammalia; Eutheria; Cetartiodactyla; Ruminantia; Pecora;
Bovoidea; Bovidae; Bovinae; Bos" ,
gcode 1 ,
mgcode 2 ,
div "MAM" } } ,
subtype {
{
subtype chromosome ,
name "15" } ,
{
subtype map ,
name "15q22-q27" } ,
{
subtype tissue-type ,
name "thymus" } ,
{
subtype dev-stage ,
name "newborn" } } } ,
user {
type
str "RefGeneTracking" ,
data {
{
label
str "Status" ,
data
str "Provisional" } ,
{
label
str "Assembly" ,
data
fields {
{
label
id 0 ,
data
fields {
{
label
str "accession" ,
data
str "M63453.1" } ,
{
label
str "gi" ,
data
int 162741 } } } } } ,
{
label
str "Related" ,
data
fields {
{
label
id 0 ,
data
fields {
{
label
str "accession" ,
data
str "X00376.1" } ,
{
label
str "gi" ,
data
int 395 } } } } } ,
{
label
str "Unknown" ,
data
fields {
{
label
id 0 ,
data
fields {
{
label
str "accession" ,
data
str "X03248.1" } ,
{
label
str "gi" ,
data
int 319 } } } } } } } ,
pub {
pub {
gen {
cit "Unpublished" ,
authors {
names
std {
{
name
name {
last "Duncan" ,
initials "C.H." } } } } ,
date
std {
year 1991 } } } ,
comment "simple staff_entry" } ,
update-date
std {
year 2003 ,
month 1 ,
day 22 } } ,
seq-set {
seq {
id {
other {
accession "NM_173917" ,
version 1 } ,
gi 27819607 } ,
descr {
molinfo {
biomol mRNA } ,
title "Bos taurus hemoglobin, beta [beta globin] (HBB), mRNA" ,
create-date
std {
year 2003 ,
month 1 ,
day 22 } } ,
inst {
repr raw ,
mol rna ,
length 821 ,
strand ss ,
seq-data
ncbi2na '11F9F784416EF47241C440484539E1E78A20A796D165FFAA42B80BA382F
AEB8A57A929E7AFB7157A1D22BDFE2D7FAA1FB51E78E7BCE0415C2B82953A420AE723D7F2C3A4E
09376385D0A917F9E678B89E47B8C2793BA35E207D09D7A906E72EBEE7A7643FE90A0F455AE792
9E1FD20AEBA7AEE94395E9512334F09D5FD79FD4A02BFFD35D225408F833A003CE0BBFE24DE977
970C084FCFF4F91EBB3F03CFD1EDDF1D23A913A8A900478213020382A72D885F8803334B37E855
38492EBEC0C9E3BCE80129FE75F25F1DD5F020F40'H } ,
annot {
{
data
ftable {
{
data
gene {
locus "HBB" ,
db {
{
db "LocusID" ,
tag
id 280813 } } } ,
location
int {
from 0 ,
to 820 ,
strand plus ,
id
gi 27819607 } } } } } } ,
seq {
id {
other {
accession "NP_776342" ,
version 1 } ,
gi 27819608 } ,
descr {
molinfo {
biomol peptide } ,
title "hemoglobin, beta [beta globin] [Bos taurus]" ,
create-date
std {
year 2003 ,
month 1 ,
day 22 } } ,
inst {
repr raw ,
mol aa ,
length 145 ,
seq-data
ncbieaa "MLTAEEKAAVTAFWGKVKVDEVGGEALGRLLVVYPWTQRFFESFGDLSTADAVMNNPKV
KAHGKKVLDSFSNGMKHLDDLKGTFAALSELHCDKLHVDPENFKLLGNVLVVVLARNFGKEFTPVLQADFQKVVAGVA
NALAHRYH" } ,
annot {
{
data
ftable {
{
data
prot {
name {
"hemoglobin, beta [beta globin]" } } ,
location
whole
gi 27819608 } ,
{
data
region "Globin" ,
comment "globin" ,
location
int {
from 2 ,
to 144 ,
id
gi 27819608 } ,
ext {
type
str "cddScoreData" ,
data {
{
label
str "definition" ,
data
str "Globin" } ,
{
label
str "short_name" ,
data
str "globin" } ,
{
label
str "score" ,
data
int 327 } ,
{
label
str "evalue" ,
data
real { 813255, 10, -37 } } ,
{
label
str "bit_score" ,
data
real { 130091, 10, -3 } } } } ,
dbxref {
{
db "CDD" ,
tag
str "pfam00042" } } } } } } } } ,
annot {
{
data
ftable {
{
data
cdregion {
frame one ,
code {
id 1 } } ,
product
whole
gi 27819608 ,
location
int {
from 52 ,
to 489 ,
id
gi 27819607 } } } } } }

Sample ASN.1 file

SwissProt

XML File Format

Databases
GenBank
DDBJ
EMBL

SwissProt
BLOCKS
PFAM

Using Entrez

Complete Process:

1) Determine sequences to align (Globins)


>sp|P02023|HBB_HUMAN Hemoglobin beta chain - Homo sapiens (Human), Pan troglodytes
(Chimpanzee), and Pan paniscus (Pygmy chimpanzee) (Bonobo).
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTP
DAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLV
CVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>sp|P02062|HBB_HORSE Hemoglobin beta chain - Equus caballus (Horse).
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNP
GAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLV
VVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
>sp|P01922|HBA_HUMAN Hemoglobin alpha chain - Homo sapiens (Human), Pan
troglodytes (Chimpanzee), and Pan paniscus (Pygmy chimpanzee) (Bonobo).
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHG
SAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLA
AHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>sp|P01958|HBA_HORSE Hemoglobin alpha chains (Slow and fast) - Equus caballus
(Horse).
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHG
SAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAV
HLPNDFTPA
VHASLDKFLSSVSTVLTSKYR
>sp|P02185|MYG_PHYCA Myoglobin - Physeter catodon (Sperm whale) (Physeter
macrocephalus).
VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTE
AEMKASED
LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHV
LHSRHP
GDFGADAQGAMNKALELFRKDIAAKYKELGYQG
>sp|P02208|GLB5_PETMA Globin V - Petromyzon marinus (Sea lamprey).
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPK
FKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQ
VDPQYFKVLA
AVIADTVAAGDAGFEKLMSMICILLRSAY
>sp|P02240|LGB2_LUPLU Leghemoglobin II - Lupinus luteus (Yellow lupine).
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVP
QNNPEL
QAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVV
KEAILKTIKE
VVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA

2) Determine multiple alignment (ClustalW)

>sp|P02023|HBB_HUMAN
--------VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQR
FFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTF
ATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVA
GVANALAHKYH------
>sp|P02062|HBB_HORSE
--------VQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQR
FFDSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTF
AALSELHCDKLHVDPENFRLLGNVLVVVLARHFGKDFTPELQASYQKVVA
GVANALAHKYH------
>sp|P01922|HBA_HUMAN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKT
YFPHF-DLS-----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNAL
SALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLA
SVSTVLTSKYR------
>sp|P01958|HBA_HORSE
---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKT
YFPHF-DLS-----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGAL
SNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPAVHASLDKFLS
SVSTVLTSKYR------
>sp|P02185|MYG_PHYCA
---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLE
KFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAEL
KPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALE
LFRKDIAAKYKELGYQG
>sp|P02208|GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQE
FFPKFKGLTTADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKL
RDLSGKHAKSFQVDPQYFKVLAAVIADTVAAG---------DAGFEKLMS
MICILLRSAY-------
>sp|P02240|LGB2_LUPLU
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKD
LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL
KNLGSVHVSKGVAD-AHFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYD
ELAIVIKKEMNDAA---
3) View alignment using various methods
4) Find Blocks in the alignment (BLOCKS)

Profile: Scores for substitutions and gaps in each column


Blocks: ungapped aligned regions
Alignments based on locally conserved patterns found in the same order in the sequences
(synteny)

Use of statistical methods and probabilistic models of the sequences

Multiple sequence alignments yield information into the evolutionary history of the
sequences – sequences that are most similar are likely to be recently derived from a
common ancestor sequence

If the sequences in a multiple alignment have quite a bit of variation then it is difficult to
create a multiple sequence alignment due to the different combinations of substitutions,
insertions, and deletions that can be used

READ MOUNT, Chapters 2 and 7


CECS 694-02
Introduction to Bioinformatics
Lecture 5: Searching Sequence Databases

Multiple Alignment format


In addition to storing individual sequences in a specified format, the results from a
multiple sequence alignment can be stored in a specified format as well. Various
programs (including the BLOCKS server) can then read in these multiple sequence
alignments and perform analysis on them. The most widely used multiple sequence
alignment file formats are: FASTA, GCG Multiple Sequence Format, and ALN.

FASTA Format
In Fasta Format, each sequence in the multiple alignment starts with a Fasta description
line (beginning with a ‘>’). Following the description line is the sequence data. The gap
character ‘-‘ is found in locations corresponding to gaps in the sequence when the
multiple alignment was created.

>JC2395
NVSDVNLNK---YIWRTAEKMK---ICDAKKFARQHKIPESKIDEIEHNSPQDAAE----
-------------------------QKIQLLQCWYQSHGKT--GACQALIQGLRKANRCD
IAEEIQAM
>KPEL_DROME
MAIRLLPLPVRAQLCAHLDAL-----DVWQQLATAVKLYPDQVEQISSQKQRGRS-----
-------------------------ASNEFLNIWGGQYN----HTVQTLFALFKKLKLHN
AMRLIKDY
>FASA_MOUSE
NASNLSLSK---YIPRIAEDMT---IQEAKKFARENNIKEGKIDEIMHDSIQDTAE----
-------------------------QKVQLLLCWYQSHGKS--DAYQDLIKGLKKAECRR
TLDKFQDM

Stockholm Format
Stockholm Format (http://www.cgr.ki.se/cgr/groups/sonnhammer/Stockholm.html)

# STOCKHOLM 1.0
#=GF ID CBS
#=GF AC PF00571
#=GF DE CBS domain
#=GF AU Bateman A
#=GF CC CBS domains are small intracellular modules mostly found
#=GF CC in 2 or four copies within a protein.
#=GF SQ 67
#=GS O31698/18-71 AC O31698
#=GS O83071/192-246 AC O83071
#=GS O83071/259-312 AC O83071
#=GS O31698/88-139 AC O31698
#=GS O31698/88-139 OS Bacillus subtilis
O83071/192-246 MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRVPVYERS
#=GR O83071/192-246 SA 999887756453524252..55152525....36463774777
O83071/259-312 MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVAIVLDEY
#=GR O83071/259-312 SS CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEE
O31698/18-71 MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAIPVLDPS
#=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH
O31698/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE
#=GR O31698/88-139 SS CCCCCCCHHHHHHHHHHH..HEEEEEEE....EEEEEEEEEEH
#=GC SS_cons CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH
O31699/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE
#=GR O31699/88-139 AS *
#=GR O31699/88-139 IN

GCG Multiple Sequence Format


!!AA_MULTIPLE_ALIGNMENT 1.0

msf MSF: 131 Type: P 22/01/02 CompCheck: 3003 ..

Name: IXI_234 Len: 131 Check: 6808 Weight: 1.00


Name: IXI_235 Len: 131 Check: 4032 Weight: 1.00
Name: IXI_236 Len: 131 Check: 2744 Weight: 1.00
Name: IXI_237 Len: 131 Check: 9419 Weight: 1.00

//

1 50
IXI_234 TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_235 TSPASIRPPAGPSSR.........RPSPPGPRRPTGRPCCSAAPRRPQAT
IXI_236 TSPASIRPPAGPSSRPAMVSSR..RPSPPPPRRPPGRPCCSAAPPRPQAT
IXI_237 TSPASLRPPAGPSSRPAMVSSRR.RPSPPGPRRPT....CSAAPRRPQAT

51 100
IXI_234 GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAG
IXI_235 GGWKTCSGTCTTSTSTRHRGRSGW..........RASRKSMRAACSRSAG
IXI_236 GGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSR..G
IXI_237 GGYKTCSGTCTTSTSTRHRGRSGYSARTTTAACLRASRKSMRAACSR..G

101 131
IXI_234 SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_235 SRPNRFAPTLMSSCITSTTGPPAWAGDRSHE
IXI_236 SRPPRFAPPLMSSCITSTTGPPPPAGDRSHE
IXI_237 SRPNRFAPTLMSSCLTSTTGPPAYAGDRSHE

PileUp

MSF: 92 Type: P Check: 1886 ..

Name: JC2395 oo Len: 92 Check: 8870 Weight: 35.3


Name: FASA_MOUSE oo Len: 92 Check: 527 Weight: 64.6
Name: KPEL_DROME oo Len: 92 Check: 2489 Weight: 41.2

//
JC2395 .NVSDVNLNK YIWRTAEKMK ICDAKKFARQ HKIPESKIDE IEHNSPQDAA
FASA_MOUSE .NASNLSLSK YIPRIAEDMT IQEAKKFARE NNIKEGKIDE IMHDSIQDTA
KPEL_DROME MAIRLLPLPV RAQLCAHLDA LDVWQQLATA VKLYPDQVEQ ISSQKQRGRS

JC2395 EQKIQLLQCW YQSHGKTGAC QALIQGLRKA NRCDIAEEIQ AM


FASA_MOUSE EQKVQLLLCW YQSHGKSDAY QDLIKGLKKA ECRRTLDKFQ DM
KPEL_DROME ASN.EFLNIW GGQYNHT..V QTLFALFKKL KLHNAMRLIK DY

ClustalW ALN Format


CLUSTAL W (1.82) multiple sequence alignment

JC2395 -NVSDVNLNKYIWRTAEKMKICDAKKFARQHKIPESKIDEIEHNSPQDAAEQKIQLLQCW 59
FASA_MOUSE -NASNLSLSKYIPRIAEDMTIQEAKKFARENNIKEGKIDEIMHDSIQDTAEQKVQLLLCW 59
KPEL_DROME MAIRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQISSQKQRGRSASN-EFLNIW 59
:* *. : :::* :: .::::* :. :. : .: ::* *

JC2395 YQSHGKTGACQALIQGLRKANRCDIAEEIQAM 91
FASA_MOUSE YQSHGKSDAYQDLIKGLKKAECRRTLDKFQDM 91
KPEL_DROME GGQYNHT--VQTLFALFKKLKLHNAMRLIKDY 89
.:.:: * *: ::* : ::

CLUSTAL W(1.4) multiple sequence alignment

IXI_234 TSPASIRPPA GPSSRPAMVS SRRTRPSPPG PRRPTGRPCC SAAPRRPQAT


IXI_235 TSPASIRPPA GPSSR----- ----RPSPPG PRRPTGRPCC SAAPRRPQAT
IXI_236 TSPASIRPPA GPSSRPAMVS SR--RPSPPP PRRPPGRPCC SAAPPRPQAT
IXI_237 TSPASLRPPA GPSSRPAMVS SRR-RPSPPG PRRPT----C SAAPRRPQAT

IXI_234 GGWKTCSGTC TTSTSTRHRG RSGWSARTTT AACLRASRKS MRAACSRSAG


IXI_235 GGWKTCSGTC TTSTSTRHRG RSGW------ ----RASRKS MRAACSRSAG
IXI_236 GGWKTCSGTC TTSTSTRHRG RSGWSARTTT AACLRASRKS MRAACSR--G
IXI_237 GGYKTCSGTC TTSTSTRHRG RSGYSARTTT AACLRASRKS MRAACSR--G

IXI_234 SRPNRFAPTL MSSCITSTTG PPAWAGDRSH E


IXI_235 SRPNRFAPTL MSSCITSTTG PPAWAGDRSH E
IXI_236 SRPPRFAPPL MSSCITSTTG PPPPAGDRSH E
IXI_237 SRPNRFAPTL MSSCLTSTTG PPAYAGDRSH E

Phylip
3 92
JC2395 -NVSDVNLNK YIWRTAEKMK ICDAKKFARQ HKIPESKIDE IEHNSPQDAA
FASA_MOUSE -NASNLSLSK YIPRIAEDMT IQEAKKFARE NNIKEGKIDE IMHDSIQDTA
KPEL_DROME MAIRLLPLPV RAQLCAHLDA LDVWQQLATA VKLYPDQVEQ ISSQKQRGRS

EQKIQLLQCW YQSHGKTGAC QALIQGLRKA NRCDIAEEIQ AM


EQKVQLLLCW YQSHGKSDAY QDLIKGLKKA ECRRTLDKFQ DM
ASN-EFLNIW GGQYNHT--V QTLFALFKKL KLHNAMRLIK DY

PIR Format
>P1;JC2395

-NVSDVNLNKYIWRTAEKMKICDAKKFARQHKIPESKIDEIEHNSPQDAAEQKIQLLQCW
YQSHGKTGACQALIQGLRKANRCDIAEEIQAM
*
>P1;FASA_MOUSE

-NASNLSLSKYIPRIAEDMTIQEAKKFARENNIKEGKIDEIMHDSIQDTAEQKVQLLLCW
YQSHGKSDAYQDLIKGLKKAECRRTLDKFQDM
*
>P1;KPEL_DROME

MAIRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQISSQKQRGRSASN-EFLNIW
GGQYNHT--VQTLFALFKKLKLHNAMRLIKDY
*
GDE
%JC2395
nvsdvnlnkyiwrtaekmkicdakkfarqhkipeskideiehnspqdaaeqkiqllqcwy
qshgktgacqaliqglrkanrcdiaeeiqam
%FASA_MOUSE
nasnlslskyipriaedmtiqeakkfarennikegkideimhdsiqdtaeqkvqlllcwy
qshgksdayqdlikglkkaecrrtldkfqdm
%KPEL_DROME
--mairllplpvraqlcahldaldvwqqlatavklypdqveqissqkqrgrsasneflni
wggqynhtvqtlfalfkklklhnamrlikdy

Nexus
#NEXUS
BEGIN DATA;
dimensions ntax=3 nchar=91;
format missing=?
symbols="ABCDEFGHIKLMNPQRSTUVWXYZ"
interleave datatype=PROTEIN gap= -;

matrix
JC2395 NVSDVNLNKYIWRTAEKMKICDAKKFARQHKIPESKIDEIEHNSPQDAAE
FASA_MOUSE NASNLSLSKYIPRIAEDMTIQEAKKFARENNIKEGKIDEIMHDSIQDTAE
KPEL_DROME --MAIRLLPLPVRAQLCAHLDALDVWQQLATAVKLYPDQVEQISSQKQRG

JC2395 QKIQLLQCWYQSHGKTGACQALIQGLRKANRCDIAEEIQAM
FASA_MOUSE QKVQLLLCWYQSHGKSDAYQDLIKGLKKAECRRTLDKFQDM
KPEL_DROME RSASNEFLNIWGGQYNHTVQTLFALFKKLKLHNAMRLIKDY
;
end;

General Feature Format (GFF)

The general feature format was developed so that annotations could be readily parsed by
a number of programs to quickly determine the location of various features. Example
uses of GFF include importing data into ACE formats for quick feature viewing, and for
creating sequence images complete with features.

http://www.sanger.ac.uk/Software/formats/GFF/

A description of multiple alignment formats is given on the BLOCKS server page:


http://www.blocks.fhcrc.org/blocks/help/blocks_format.html
The sequence formats used by EMBL are found at:
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Themes/SequenceFormats.html

Sequence Conversion Programs

SeqIO
ReadSeq
Searching Sequence Databases

Sequence Similarity Searches

The most common type of search used is to compare a single query sequence against a
database. Such a search is typically performed to gather information on the potential
function of a gene. This is done by comparing the search results, and the functions of the
sequences that are related on a sequence similarity level. Such a search can be expanded
to find more distantly related sequences (at least on the sequence level) to the query
sequence. Such sequence similarity searches can yield information concerning related
proteins that may lead to the discovery of a family that can then be characterized, and
perhaps multiply aligned and profiled.

With all sequence searches, it is important to consider the sensitivity and the selectivity
of the algorithms. Sensitivity refers to the ability to find most of the related members
(reduction of false negatives) while selectivity refers to the ability to detect only members
of the family you are interested in studying (reduction of false positives). This is
important to keep in mind when interpreting alignment results and assigning a function to
a sequence, since this assignment may be given through transitive relationships.

DNA versus protein searches

It is much easier to determine patterns of sequence similarity between protein sequences


than DNA sequences due to the fact that DNA sequences only have four potential
characters per position, while amino acid sequences have 20. To illustrate this example,
consider a sequence of length four. With DNA sequences, such a sequence has a chance
of 1/44 = 1/256 of aligning at random. With protein sequences, this would be 1/204 =
1/160,000.

In addition, since multiple codon sequences code for the same amino acid, it is possible
that the translated amino acid sequences could be identical, yet the underlying nucleic
acids could be different.

For instance, consider the following sequences:

AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA
AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA

blast2sequences reports that no significant alignment is found!

If we look at an ungapped alignment between these two sequences, we get:

AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA
||||| | || || || | || || || | |
AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA
which gives 21 identical residues out of 36, for a percent identity of 58%.

However, translation of both of these sequences yields the protein sequence:

ELVISISALIVE

This sequence is 100% identical for both protein sequences. Therefore, if the nucleotide
region we are interested in searching for is known to be in a protein coding region, it
would be beneficial to translate the DNA sequence into a protein sequence. Generally,
both the target (or database) and the query sequences are translated into all six reading
frames and compared to one another. (Recall that there are three reading frames in the
forward direction, and three in the reverse complement). Now rather than having four
comparisons (target forward and reverse complement AGAINST query forward and
reverse complement) there are now thirty-six comparisons to be made. Therefore, while
translation of the sequences into proteins will lead to better results, the time it takes to run
will be approximately nine times as long.

Scoring matrices

Most database searching utilities will allow the user to change between various scoring
systems. At one time, the default scoring matrix used for amino acid searches was the
Dayhoff PAM250 matrix. In most instances, the PAM250 matrix has been replaced by
the BLOSUM62 matrix, since the BLOSUM matrices were based on more sequence data.

FASTA
FASTA was the first rapid search method developed for database searching. FASTA
uses a heuristic algorithm to speed up the process of locating similar regions. Unlike
dynamic programming, FASTA is not guaranteed to lead to the optimal solution.
However, the search time is roughly 50 times faster than DP solutions.

FASTA Algorithm

In the initial stage of searching for regions of similarity, FASTA uses a hashing approach.
For each of the sequences being compared, a table is constructed showing the positions of
each word of length k, or k-tuple. The relative positions of each word in the two
sequences are calculated by subtracting the position of the first sequence from the
position of the second. Words having the same offset are in phase and reveal a longer
region of alignment between the two sequences.

Step 2: the ten regions with the highest density of identities are identified. The ends of
each region is trimmed to include only residues contributing to the highest score. Each
resulting region is now a partial alignment without gaps. Each is given a score (init1
score)

Step 3: If there are several initial regions with scores greater than a cutoff value, check to
see if the trimmed initial regions can be joined to form an approximate alignment with
gaps. A similarity score is calculated as the sum of the init1 scores for each of the initial
regions minus a penalty for each gap. (initn score)

Step 4: Construct a needleman-wunch optimal alignment of the query sequence and the
library sequence, considering only those residues that lie in a band 32 residues wide,
centered on the best initial region found in step 2 (opt score)

After locating the k-tuples and grouping the ones with the same offset together, an
optimization step is invoked to piece together k-tuple alignments allowing gaps.

Using this approach, the search time increases linearly with the size of the query and
target sequences. Compared to the polynomial increase with dynamic programming,
FASTA presents a much faster alternative, particularly as the sequence size increases.

For DNA and RNA sequences, the typical size of the k-tuple in the FASTA algorithm is
4-6, while in protein sequences it is 1 or 2. The larger the k-tuple, the faster FASTA will
run, but the less thorough it will be in determining regions of similarity.

Significance of fasta scores

In order to determine the significance of an alignment for a target database and a query
sequence, FASTA calculates the u and lambda parameters for the extreme value
distribution, which will vary with the length and the composition of the sequences being
compared. The steps to calculate z-scores for each possible score is calculated as
follows:

1) The average score for database sequences in the same length range is determined.
2) The average score is plotted against the logarithm of average sequence length in each
length range.
3) The points are then fitted to a straight line by linear regression.
4) A z score, the number of standard deviations from the fitted line, is calculated for
each score.
5) High-scoring, presumably related sequences, and also very low scoring alignments
that do not fit the straight line are removed from consideration.
6) Steps 1-5 are repeated one or more times.
7) The known statistical distribution of alignment scores is used to calculated the
probability that a Z score between unrelated or random sequences of the same lengths
as the query and database sequence could be greater than z, which follow an extreme
value distribution such that: (Pearson, 2000 ISMB)

− e ( −1.2825 z −0.5772 )
P ( Z > z ) = 1 − (e )
The expectation of observing a Z-score greater than z in a database of D
sequences is:

E (Z > z ) = D ∗ P(Z > z )


8) Z scores are then normalized to z’ = 50 + 10z so that an alignment score with a
standard deviation of 5 now has a normalized score of 100.

9) The significance of the alignment score between a sequence and a database can be
further analyzed by aligning a sequence with a shuffled library.

HISTOGRAM OF FASTA DATA

One of the items reported in the FASTA output is a histogram showing a graphical
representation of the distribution of the normalized scores when matched with the query
sequence. These scores are expected to fall approximately into a normal distribution, and
any significant matches will fall outside the normal curve.

The first column listed in the fasta score distribution is the z’ score, which is a z score
normalized to a mean of 50 and a standard deviation of 10. The second column lists the
number of optimized scores found in that range. The third column lists the number of
expected sequences to lie within a range, given an extreme value distribution and the
calculated values of u and lambda.

The “=” signs give an approximate curve for the actual distribution, while the “*”
indicates the expected score distribution.

The z’-scores greater than 120 are considered to be high-scoring alignments.


opt E()
< 20 188 0:==
22 0 0: one = represents 109 library sequences
24 0 0:
26 2 1:*
28 7 15:*
30 28 91:*
32 200 353:== *
34 841 958:========*
36 2217 1968:==================*==
38 3746 3253:=============================*=====
40 5360 4538:=========================================*========
42 6055 5547:==================================================*=====
44 6496 6119:========================================================*===
46 5820 6232:====================================================== *
48 5469 5966:=================================================== *
50 4820 5444:============================================= *
52 4202 4787:======================================= *
54 3815 4089:=================================== *
56 3271 3415:===============================*
58 2755 2804:=========================*
60 2268 2271:====================*
62 1813 1821:================*
64 1500 1448:=============*
66 1233 1145:==========*=
68 951 900:========*
70 746 706:======*
72 699 551:=====*=
74 460 430:===*=
76 337 335:===*
78 287 260:==*
80 244 202:=*=
82 185 154:=*
84 115 122:=*
86 114 95:*=
88 75 73:* inset = represents 1 library sequences
90 70 57:*
92 48 44:* :=======================================*
94 26 34:* :========================== *
96 33 26:* :=========================*=======
98 14 20:* :============== *
100 10 16:* :========== *
102 7 12:* :======= *
104 6 9:* :====== *
106 5 7:* :===== *
108 2 6:* :== *
110 2 4:* :== *
112 1 3:* := *
114 0 3:* : *
116 0 2:* : *
118 0 2:* : *
>120 27 1:* :*==========================

After the histogram is calculation of the Kolmogorov-Smirnov statistic, which yields


some information into the deviation between the observed and expected distributions. If
the deviation is significant enough, then the alignment should be performed again with
different gap penalties.

After the statistics is a list of the best scoring hits. Note that FASTA presents at most one
highest scoring hit per sequence, whereas other alignment programs may present many.
Listed in the hits section are the description of the sequence, the z’ score, the initn, initl,
and opt scores (note the initn score is the extended hit score; the init1 score is the initial
hit score; the opt score is the score calculated by stringing together regions with gaps –
see Figure 7.2 of Mount for a more in-depth explanation) and the E score (calculated as
an estimate of the likelihood of a match occurring by chance).

The best scores are: initn init1 opt z-sc E(66345)


MERR_PSEAE mercuric resistance operon regu ( 144) 928 928 928 1129.8 0
MERR_SHIFL mercuric resistance operon regu ( 144) 871 871 871 1061.3 0
MERR_SERMA mercuric resistance operon regu ( 144) 810 810 810 988.1 0
MERR_STAAU mercuric resistance operon regu ( 135) 292 172 298 373.6 3.5e-14
MERR_BACSR (strain rc607). mercuric resist ( 132) 241 198 289 363.0 1.4e-13
YHDM_ECOLI hypothetical transcriptional re ( 141) 175 175 276 347.0 1.1e-12

After the list of the highest scoring hits are the smith-waterman alignments between the
query and the highest scoring hits. A ‘:’ marks conservation; ‘.’ denotes a conservative
substitution:
>>MERR_STAAU mercuric resistance operon regulatory protei (135 aa)
initn: 292 init1: 172 opt: 298 Z-score: 373.6 expect() 3.5e-14
Smith-Waterman score: 298; 36.923% identity in 130 aa overlap

10 20 30 40 50 60
MerR MENNLENLTIGVFAKAAGVNVETIRFYQRKGLLLEPDKPYGSIRRYGEADVTRVRFVKSA
. :. .::: :: ::.:.:.::::. : . .. : :.: . ::::.:
MERR_S MGMKISELAKACDVNKETVRYYERKGLIAGPPRNESGYRIYSEETADRVRFIKRM
10 20 30 40 50

70 80 90 100 110
MerR QRLGFSLDEIAELLRL--EDGTHCEEASSLAEHKLKDVREKMADLARMEAVLSELVCACH
..: ::: :: :. . .:: .:.. ... .: :....:. : :.. .: :: :
MERR_S KELDFSLKEIHLLFGVVDQDGERCKDMYAFTVQKTKEIERKVQGLLRIQRLLEELKEKCP
60 70 80 90 100 110

120 130 140


MerR ARRGNVSCPLIASLQGGASLAGSAMP
... .::.: .:.::
MERR_S DEKAMYTCPIIETLMGGPDK
120 130

FASTA Programs

FASTA – compares a query protein sequence to a protein sequence library or a DNA


sequence to a DNA sequence library.

TFASTA – compares a query protein sequence to a DNA sequence library, after the
DNA sequence library has been translated in all six reading frames.

FASTF – compares a set of ordered peptide fragments, obtained from analysis of a


protein by cleavage and sequencing of protein bands resolved by electrophoresis, against
a protein database

TFASTF – compares a set of ordered peptide fragments, obtained from analysis of a


protein by cleavage and sequencing of protein bands resolved by electrophoresis, against
a DNA database
FASTS – compares a set of ordered peptide fragments, obtained from mass-spectometry
analysis of a protein, against a protein database.

TFASTS – compares a set of ordered peptide fragments, obtained from mass-spectometry


analysis of a protein, against a DNA database.

Example
>mgstm1
MGCEN,MIDYP,MLLAY,MLLGY

FASTX, FASTY – compares a query DNA sequence to a protein sequence database,


translating the DNA sequence in all six reading frames and allowing frameshifts.

TFASTX, TFASTY – Compares a protein sequence to a DNA sequence or DNA


sequence library, such that the DNA sequence is translated in all six reading frames, and
the protein query sequence is compared to each of the six derived protein sequences. The
DNA sequence is translated from one end to the other; termination codons are translated
into unknown amino acids.

LALIGN, LFASTA – Same as the FASTA program, except that multiple aligning regions
may be reported for each sequence.

PLALIGN – dot plot algorithm available through the fasta suite

FAST-pat, FAST-swap: compares a sequence to a pattern database


FAST-swap

BLAST

Basic Local Alignment Search Tool

Blast has supplanted FASTA as the most commonly used database search tool. BLAST
was developed as an improvement in speed from the FASTA suite without a sacrifice in
sensitivity.

The first step of the BLAST algorithm is to locate common words or k-tuples in the query
sequence and the target database sequences. However, BLAST does not search for every
possible k-tuple, it only considers those that are most significant. For the NCBI BLAST
program, the word length is fixed at 3 for proteins and 11 for nucleic acids. This k-tuple
is referred to as the word-length, and is the minimum length needed to achieve a word
score that is high enough to be significant but not so long as to miss short but significant
patterns.
MSP – Maximal Segment Pair: The highest scoring pair of identical length segments
chosen from two sequences. The boundaries of an MSP are chosen to maximize its
score, so an MSP can be of any length.
The number of MSP scores with a score greater than a cutoff score S are reported.
BLAST minimizes the time spent on sequence regions where the score is unlikely to
exceed this cutoff score.

The main strategy of BLAST is to seek only segment pairs that contain a word pair with a
score of at least T. Any such hit is extended to determine if it is contained within a
segment pair whose score is greater than or equal to the cutoff score S.

The scanning phase of BLAST locates the words within the sequences in linear time.
One method is to map each possible word to an integer so that it can be used as an index
into an array. For instance, if the word size was 4, and amino acids were used, there are
204 = 160,000 entries in the array. The second approach was the use of a deterministic
finite state automaton

Hit Extension

Initial hits are then examined, and extended in either direction until they fall below a
certain score threshold.

In order to get around the problem of using uninformative hits, BLAST stores a list of
words that are found much more often than expected by random. Hits to these words are
discarded from consideration

Steps Used by BLAST

1) The sequence is optimally filtered to remove low-complexity regions that will not
lead to meaningful sequence alignments.
2) A list of words of the predefined word length (3 for amino acids; 11 for DNA
sequences) in the query sequence is made.
3) The query words are evaluated for an exact match with a word in any database
sequence, using substitution scores for amino acids, and +5,-4 scoring scheme for
DNA.
4) A cutoff score, called neighborhood word score threshold (T) is selected to reduce
the number of possible matches to the word to be the most significant ones. This
pares down the list of possible matching words to those resulting in the most
significant alignments.
5) The procedure is repeated for each word in the query sequence.
6) The remaining high-scoring words for each possible match to a word are
organized into an efficient search tree.
7) Each database sequence is scanned for an exact match to one of words in the
search tree, one position at a time. If a match is found, it is used to seed a
possible ungapped alignment between the query and database sequences.
8) (UNGAPPED BLAST – VERSION 1.0) In the original BLAST suite of
programs, an attempt is made to extend an alignment from the matching words in
each direction along the sequences, as long as the score does not drop below a
certain threshold. At this point, a larger stretch of sequence (called the HSP
(high-scoring segment pair) which has a larger score than the original word may
have been found.

(GAPPED BLAST – VERSION 2.0) In the newer version of BLAST, the


neighborhood word threshold T is reduced in order to find shorter matching word
hits that can be aligned along the same diagonal

9) The score of each HSP is compared against a cutoff score S, which is empirically
determined.

10) The statistical significance for each HSP is calculated using the Karlin-Altschul
statistics and the extreme value distribution, as previously discussed with
sequence alignments. Recall that the probability, p, of observing a score S greater
than or equal to x is given by the equation:

−e− λ ( x −u )
P( S ≥ x) = 1 − e
where

log Km' n'


u=
λ
and m’ and n’ are the effective lengths of the query and database, such that

ln Kmn
m' ≈ m −
H

ln Kmn
n' ≈ n −
H
where H is the average expected score per aligned pairs of residues in an alignment of
two random sequences; m and n are the length of the query and database; K and lambda
are parameters calculated based on the sequences and the scoring scheme.

These effective, or reduced, lengths are used as a correction factor in order to allow
alignments starting near the end of one of the sequences to be detected.
The expectation, E, of seeing a score S >= x in a database of D sequences is
approximately given by the Poisson distribution,

E ≈ 1 − e − p(s>x) D

11) Two or more HSP regions may be combined to a longer alignment region, even
though the individual HSPs may result in a lower score.

12) Smith-Waterman type alignments are shown for the query sequence with each of
the matched sequences in the database. BLAST-2 can produce alignments with
gaps, while BLAST-1 cannot.

13) When the expected score for a given database sequence satisfies the threshold for
E, the match score is reported.

SHOW STEPS OF BLAST


SHOW EXAMPLES OF BLAST OUTPUT USING SEQUENCES FROM THE CLASS

BLAST Programs

BLASTP: Compares a protein query sequence against a protein database, allowing for
gaps
BLASTN: Compares a DNA query sequence against a DNA database, allowing for gaps
BLASTX: Compares a DNA query sequence, translated into all six reading frames,
against a protein database, allowing for gaps
TBLASTN: Compares a protein query sequence against a DNA database, translated into
all six reading frames, allowing for gaps
TBLASTX: Compares a DNA query sequence, translated into all six reading frames,
against a DNA sequence database, translated into all six reading frames. TBLASTX does
not allow for gaps.

There are a number of different BLAST options. One list of these options and a
description of them is available through the WU-BLAST home page:

http://blast.wustl.edu/blast/README.html

BAYES BLOCK ALIGNER

Another approach to searching databases is the Bayes Block Aligner. The methodology
behind the Bayes Block Aligner is to find all possible blocks located within two
sequences. A larger number of possible alignments between two sequences are generated
by aligning combinations of blocks. Gaps will be present between the blocks.

The Bayes Block Aligner uses Bayesian statistics to derive the posterior probabilities of
each alignment assuming various scoring models and different number of blocks. This
approach has been shown to locate some weak, yet real, similarities between sequences.

SSAHA

SSAHA stands for Sequence Search and Alignment by Hashing Algorithm. It can align
DNA sequences by converting the sequence information into a ‘hash table’ data structure
that can then be searched very rapidly for matches.

SSAHA is best suited towards problems in locating identical or near identical matches.
The hash word length is defined to be 10 bases by default. Example applications include
SNP detection; rapid sequence assembly; detecting order and orientation of contigs.

SSEARCH

While dynamic programming algorithms can be painfully slow when searching against
large databases, they are more likely to discover sequences that are distantly related to
one another. SSEARCH is one program that implements the Smith-Waterman approach
to sequence alignment.

ftp.virginia.edu/pub/fasta

SSEARCH is part of the FASTA suite of programs. This approach compares a protein
sequence to another protein sequence or sequence database (or DNA sequence to a DNA
sequence or database) using enhanced Smith-Waterman local sequence alignments.

BLAT
BLAT (BLAST-Like Alignment Tool), developed by Jim Kent at UCSC, is used to
locate smaller regions of higher identity within genomic assemblies. BLAT on nucleic
acids will quickly identify regions at least 95% similar consisting of 40 bases or more.
More divergent and shorter sequence alignments may be missed. BLAT on amino acids
will find sequences at least 80% similar consisting of at least 20 amino acids.

DNA BLAT works by keeping an index of an entire genome in memory, where the index
consists of all non-overlapping 11-mers except those involved in repetitive elements. For
the human genome, this corresponds to a little less than a gigabyte of RAM. Protein
BLAT works in the same fashion, except that 4-mers are used. The protein index is
slightly larger than 2 gigabytes for humans.

BLAT is a very fast tool for localizing highly similar regions. However, distant
homologies are not detected. The typical use for BLAT is to localize a specific sequence
on a genome. This can be very useful, since the BLAT web interface directly ties to the
UCSC GoldenPath genomic browser.

The BLAT web server is:

http://genome.ucsc.edu/cgi-bin/hgBlat?command=start&org=human

SEQUENCE FILTERING

Low-Complexity Regions

Low-complexity regions are amino acid or DNA sequence regions that offer very low
information due to their highly biased content. Examples of low complexity regions
include histidine-rich domains in amino acids, poly-A tails in DNA sequences, poly-G
tails in nucleotides, runs of purines, runs of pyrimidines, runs of a single amino acid, etc.

The complexity of a window of size L can be calculated as:

1
K=
L!
L ∗ log N ( )
∏ ni !
alli

A resulting K value of 0 results in a region of very low complexity; a value of 1 results in


a high complexity.

Consider the sequence AAAA:


L is 4, so L! = 4*3*2*1 = 24
nA = 4; nC = nG = nT = 0
So the product of the factorials is 4!*0!*0!*0! = 24
K = ¼ log4(24/24) = 0, so this is low complexity region.

Now consider the sequence ACTG:


L is 4, so L! is 4*3*2*1 = 24
nA =nC=nG=nT = 1
so the product of the factorials is 1!1!1!1! = 1

so K = 1/4log4(24/1) = 0.573
Short, periodic repeats

Another possible source of low information are regions of DNA or amino acid sequences
with repeats with a short periodicity (such as 10 bases long). Examples of such
sequences commonly found in DNA are short tandem repeats.

Fortunately, there are programs out there to remove such sequences from a query and
target database before it is searched.

SEG, PSEG are programs developed at NCBI that are used to mask out low-complexity
regions in amino acid sequences.
NSEG is the NCBI program that masks out low-complexity regions in nucleic acid
sequences.
DUST is another program that removes low-complexity regions in DNA sequences.

Each of these programs calculates the complexity for a given window using the algorithm
defined above and masks out regions based on a given complexity threshold.

XNU is a program that will locate internal repeats with a short periodicity.

Interspersed Repeats

In addition to short, periodic repeats, genomes are filled with longer interspersed
repetitive elements. These can be in the form of short-interspersed elements (SINES) on
the order of 300 bases long, or long-interspersed elements (LINES) on the order of 1-2
KB long. There are other classes of interspersed repeats, including Mammalian-
interspersed repeats (MIRS), and other elements that have been transposed and fixed into
genomes through viral-like events. Transposable elements are numerous in many plant
species, which leads to large genome sizes.

Consider the human genome. Somewhere around 50% of its composition comes from
interspersed repeats. Thus, these regions should be masked out as well.

RepeatMasker is a program that takes a query sequence and compares it against a set of
target repetitive element databases. Those regions in the query which match a repetitive
element are masked out. Run in its native mode, RepeatMasker calls cross-match, which
implements a Smith-Waterman dynamic programming algorithm to locate instances of
repetitive elements. Improvements have been made to the RepeatMasker software so that
it can use the speed of BLAST to speed up the time it takes to locate repeats. This
updated version is called MaskerAid.

The libraries of repeats that RepeatMasker uses are maintained by the Genetics
Information Research Institute in their Repbase database.
Soft versus Hard masking

There are generally two approaches to masking out repetitive elements. The first
approach, in which the repetitive element sequences are replaced by either N or X
characters, is called hard masking. The second approach, which is becoming a more
popular approach to use, maintains the sequence data, but denotes repetitive portions of
sequences with lowercase letters. This can be a preferred approach for many reasons.
However, when using either approach it is important to understand how the sequence
analysis software you are using will treat each of these.

Removal of Vector Sequence

Searching Databases with PSSMS

The method to search a database with a PSSM is very similar to seeing whether or not a
sequence belongs to a family that the PSSM defines. Every possible sequence position in
each database sequence is evaluated as a possible sequence position by sliding the PSSM
along one sequence at a time. Positions with high scores are the best matches, and can be
quickly identified.

EXAMPLES: BLOCKS Server; MAST Server (p323 for more)

Searching Databases with Regular Expressions

Certain databases (such as ProSite) allow the databases to be searched using a regular
expression.

PSI-BLAST

PSI-BLAST (position specific iterated blast) is a newer version of BLAST designed to


take in an initial query sequence and find similar sequences to the query which can then
be multiply aligned to create a scoring matrix that can be used to search the database for
even more matches. At this point, even more sequences are potentially found, that can
then be added onto the multiple alignment. This process of iteratively building the
multiple alignment continues until the user is statisfied with the search results.

Of course, caution should be used with PSI-BLAST since a greedy algorithm is used in
the sense that the most recently added sequences will now influence the next round of
sequences that are to be found.
PHI-BLAST

PHI-BLAST (pattern hit initiated blast) functions in same manner as PSI-BLAST except
that the query sequence is first searched for a complex pattern, or regular expression,
provided by the user. The subsequent search for similar sequences is then focused on
regions containing the pattern. One example of a regular expression that might be used
is:

[LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV]

Reference Books:

BLAST
by Mark Yandell , Joseph Bedell (Editor), Ian Korf, Ian Korf, Mark Yandell
Joseph Bedell, Lorrie LeJeune
Availability: This item will be released on June 1, 2003. $39.95

Sequence Databases
Major Sequence Repositories

Many of the applications in computational biology and bioinformatics are based on the
analysis of nucleotide and protein sequences. There are three major repositories that
contain all of the known nucleotide and protein sequences. They all share their
information with each other through the International Nucleotide Sequence Database
Collaboration. These three repositories are:

DNA Data Bank of Japan (DDBJ) http://www.ddbj.nig.ac.jp


EMBL Nucleotide Sequence Database http://www.ebi.ac.uk.embl.html
GenBank http://www.ncbi.nlm.nih.gov/

Currently, GenBank contains over 28 billion nucleotide bases, representing over 22


million sequences in over 100,000 species. This represents a large amount of data to be
stored! Looking at the growth of GenBank over the past 20 years, we can see the
explosion of sequence data, particularly in the last five years.

Image source: http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html


Genome Databases

Nucleotide sequence information has also been organized in such a manner that it is
stored in genome databases. One of the most widely used resources of genomic data is
the UCSC Genome Browser, which contains genome assemblies and annotation for the
rat, mouse and human genomes. Another widely used resource is the Ensembl genome
browser.

Other genome databases include: WormBase, which contains information on the C.


elegans and C. briggsae worm genomes; AceDB which contains information on the C.
elegans, S. pombe, and H. sapiens genomes; Comprehensive Microbial Resource which
contains information on 95 completed microbial genomes; FlyBase – Drosophila
melanogaster genome sequence; HIV sequence database; MOsDB: rice genome
database; MGD – Mouse Genome Database; Rat Genome Database; Saccharomyces
Genome Database; The Arabidopsis Information Resource; ArkDB: Genome databases
for animals; along with many other genomic resources.

Ensembl Genome Browser (http://www.ensembl.org)


UCSC Genome Browser http://genome.ucsc.edu/
WormBase: http://www.wormbase.org/
AceDB: http://www.acedb.org/
Comprehensive Microbial Resource: http://www.tigr.org/tigr-
scripts/CMR2/CMRHomePage.spl
FlyBase: http://flybase.bio.indiana.edu/
HIV Sequence Database: http://hiv-web.lanl.gov/
MOsDB Rice Database http://mips.gsf.de/gams/rice/index.jsp
MGD Mouse Genome Database: http://www.informatics.jax.org/
Rat Genome Database: http://rgd.mcw.edu/
Saccharomyces Genome Database: http://genome-www.stanford.edu/Saccharomyces/
The Arabidopsis Information Resource (TAIR): http://www.arabidopsis.org/
ArkDB: http://thearkdb.org/

Gene Databases

Once a genome is in place, it is desirable to study the regions that make a particular
organism what it is. One such resource is located in the geneic regions of the organism.
Several databases of genes and related structures exist. Perhaps the largest such database
is the RefSeq database curated at NCBI. This data set contains information on a non-
redundant collection of molecules naturally occurring. These are typically given as
mRNA sequences where various information is known about them. For instance, these
mRNA could be well studied and annotated to a degree that they are known to be geneic
regions. Or these regions could be predicted mRNAs, where the predictions are based
upon either computational methods, or by the mapping of EST sequences onto these
regions.
Other gene and gene structure databases include: AllGenes: Human and mouse gene
index integrating gene, transcript and protein annotation; ASAP: Alternatively Splicesd
isoforms of genes; ExInt: exon-intron structures of genes; IDB/IEDB: intron sequence
and evolution; SpliceDB: Canonical and non-canonical mammalian splice sites; GDB and
GenAtlas: Human genes and geonomic maps; HS3D: Human exon, intron and splice
regions;

RefSeq: NCBI Reference Sequence Project http://www.ncbi.nlm.nih.gov/RefSeq/


AllGenes: http://www.allgenes.org
GDB http://www.gdb.org/
GenAtlas: http://www.citi2.fr/GENATLAS/
Genew (Approved gene names): http://www.gene.ucl.ac.uk/cgi-
bin/nomenclature/searchgenes.pl
ASAP: Alternatively spliced genes http://www.bioinformatics.ucla.edu/ASAP
ExInt: http://intron.bic.nus.edu/sg/exint/exint.html
IDB/IEDB: http://nutmeg.bio.indiana.edu/intron/index.html
SpliceDB: http://genomic.sanger.ac.uk/spldb/SpliceDB.html
HS3D: http://www.sci.unisannio.it/docenti/rampone/

SNP Resources

In human sequences, single base changes are thought to occur approximately once every
2000 bases between individuals. While this may not seem like a lot, that still leads to
over 1.6 million SNPs in the human population. SNPs play an important role in
differentiation, but can also be the cause of disease (one example is sickle-cell anemia).
Databases to locate and characterize single nucleotide polymorphisms are available for
use. These include dbSNP; SNP Consortium database; rSNP Guide: Single nucleotide
polymorphisms in regulatory gene regions;

dbSNP: database of single nucleotide polymorphisms http://www.ncbi.nlm.nih.gov/SNP/


SNP Consortium database: http://snp.cshl.org/
rSNP Guide: http://util.bionet/nsc.ru/databases/rsnp.html

EST Resources

ESTs are expressed sequence tags, which are partial copies of mRNA found within a
particular cell. Information from ESTs can be used to tell the splicing patterns of genes,
the occurrence of genes, etc.

dbEST http://www.ncbi.nlm.nih.gov/dbEST/
Gene Resource Locator (Alignment of ESTs with finished human sequence)
http://grl.gi.k.u-tokyo.ac.jp
HUNT: Annotated human full-length cDNA sequences http://www.hri.co.jp/HUNT/
Sputnik: Annotation of clustered plant ESTs: http://mips.gsf.de/proj/sputnik
STACK: non-redundant, gene-oriented clusters: http://www.sanbi.ac.za/Dbases.html
TIGR Gene Indices: non-redundant EST clusters: http://www.tigr.org/tdb/tgi.shtml
UniGene: non-redundant EST clusters: http://www.ncbi.nlm.nih.gov/UniGene/
Binding Sites, Promoters, ETC

Besides locating genes within the genome, it is important to understand the signaling
mechanisms that an organism employs in order to turn a gene on or off. Databases of
various factors such as promoters and transcription factor binding sites are available.
Various databases include: DBTBS: Bacillus subtilis binding factors and promoters;
EPD: Eukaryotic POL II Promoters; PromEC: E. coli mRNA promoters; TRANSFAC:
Transcription factors and binding sites;

DBTBS: http://elmo.ims.u-tokyo.ac.jp/dbtbs/
EPD: http://www.epd.isb-sib.ch/
PromEC: http://bioinfo.md.huji.ac.il/marg/promec
TRANSFAC: http://transfac.gbf.de/TRANSFAC/index.html

Protein Databases

The process of the central dogma states that DNA gets coded into RNA, which in turn
gets turned into proteins. Since proteins code for genes, it is important to store known
information about proteins inside of databases. There are many different protein
databases, many of them dealing with specific protein families. Databases for curated
proteins include:

InterPro: Protein families and domains http://www.ebi.ac.uk/interpro


EXProt: proteins with experimentally verified functions: http://www.cmbi.nl/exprot
Protein Information Resource (PIR): http://pir.georgetown.edu/
SWISS-PROT/TrEMBL curated protein sequences: http://www.expasy.ch/sprot

Protein Sequence Motifs (Domains)

In addition to proteins, we can have families of proteins defined with conserved regions
called motifs or domains. Databases to store this information includes:

BLOCKS (Multiple alignments of conserved regions) http://blocks.fhcrc.org/


CDD: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
eMOTIF: http://motif.stanford.edu/emotif/
Pfam: http://www.sanger.ac.uk/Software/Pfam/
PRINTS: http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/
ProDom: http://www.toulouse.inra.fr/prodom.html
PROSITE: http://www.expasy.org/prosite
ProtoMap: http://protomap.cornell.edu
Structure Databases

After a protein sequence has been created, it takes on a three dimensional structure.
Various structure databases exist that contain proteins where the structure is known,
typically through NMR and X-ray crystallography. Some of the larger structure
databases include:

ASTRAL http://astral.stanford.edu/
PDB http://www.pdb.org/
SCOP http://scop.mrc-lmb.cam.ac.uk/scop
MMDB http://www.ncbi.nlm.nih.gov/Structure/

Gene Expression Databases (Microarray experiments; etc)

Once the location and sequence of genes is known, the next step is to determine their
function. Various biological experiments can be performed on gene data, including the
newer microarray technology which we will cover in class. Databases containing the
results of this experimental data are available. Included might be experimental images,
analysis of results, etc. Examples of experimental Gene Expression and Metabolic
pathway databases are:

ArrayExpress http://www.ebi.ac.uk/arrayexpress
BodyMap http://bodymap.ims.u-tokyo.ac.jp/
HugeIndex http://hugeindex.org/
Mouse Atlas and Gene Expression Database: http://genex.hgu.mrc.ac.uk/
NetAffx http://www.affymetrix.com/
Stanford Microarray Database http://genome-www.stanford.edu/microarray/
KEGG http://www.genome.ad.jp/kegg/
Klotho http://www.ibc.wustl.edu/klotho/
MetaCyc http://ecocyc.org/

Disease Databases

After the function of genes is known, those genes involved in disease are classified.
Mutational databases include:

OMIM: http://www.ncbi.nlm.nih.gov/Omim/
OMIA: http://www.angis.org.au/omia/
HGMD: http://www.hgmd.org/
Tumor Gene Family Databases: http://www.tumor-gene.org/tgdf.html
Literature References

After all this work has been done, there needs to be a way to do a search through the
literature references for a specific gene, disorder, organism, sequencing project, etc. The
most widely used resource in this regard is the PubMed
http://www.ncbi.nlm.nih.giv/PubMed/ database.

Factors in Considering Biological Databases

What is important as far as databases are concerned:

Fast retrieval of information


Ability to store large amounts of data
ability to update data – databases provide a moving target
choice of paradigm – object oriented or relational?

Storage of Data

Next week – Dr. Chang will talk about storage of data


GenBank – flat file format
ensembl – mySQL ports
XML ports of databases as well

DISCUSS ORACLE PAPER:

http://otn.oracle.com/oramag/oracle/03-jan/o13science.html
Hidden Markov Models (HMMs)

Hidden Markov Models (HMMs) are probabilistic models for studying sequences of
symbols. In particular, HMMs can model matches, mismatches, insertions and deletions
of symbols. Hidden Markov Models have been deeply rooted in speech recognition
problems.

In speech recognition, the problem is the phonemes (or words) that have been spoken in a
particular time frame. Consider the difficulty. Everyone you meet has a different voice.
Everyone speaks with a slight variation – this might be caused by an accent, the person
having a cold, or differences in physiological development. However, humans are able to
distinguish what the speaker is saying. The idea behind speech recognition is to take in a
spoken word and to try to fit it to a specific model of possible words. This may in fact be
close to what the brain does – just think about the Sprint PCS commercials!

Problems in sequence analysis are similar. For instance, given an amino acid sequence,
we may want to determine the protein family to which it belongs. Now the amino acid
sequence can be treated similarly to the speech signal in a given frame, and the amino
acids can be treated as the phonemes.

Markov Chain

A Markov Chain is a probabilistic model that generates a sequence where the probability
of a symbol depends upon the previous symbol. A traffic light is an example of a
Markov chain.A Markov Chain can be used to model a random DNA sequence, where
there are four states: A, C, G, T, one for each letter in the alphabet. When we are given a
certain state, there is a transition from that state to another state with an associated
probability called a transition probability. An example Markov Chain can be drawn as
follows:

A C
start END

G T
The key property of a Markov chain is that the probability of a symbol S at position p
(Sp) depends only upon the previous symbol S at position p – 1 (Sp-1), and not on the
entire previous sequence.

Since the probability of a symbol is dependent upon the previous symbol, a prime
example for the use of Markov chains is in the detection of CpG islands, which are rich in
the dinucleotide CG.

The process of methylation in biological systems will typically convert the nucleotide C
to a T with a high probability when a CG nucleotide is encountered. As a result, there
will be an overabundance of the dinucleotide TG, and an underabundance of the
dinucleotide CG. If we ignore the start and end states for now, we can see that there are
sixteen different transitions. A study of regions of genomic DNA has determined normal
genomic transition probabilities to be the following, where the FROM node is labeled
along the rows to the left, and the TO node is labeled along the columns above:

A C G T
A 0.300 0.205 0.285 0.210
C 0.322 0.298 0.078 0.302
G 0.248 0.246 0.298 0.208
T 0.177 0.239 0.292 0.292

The model shown above can then assign these weights to the edges of the graph.

In some regions of the genome, such as the promoter region of genes, methylation is
suppressed. In these regions, the dinucleotide CG is found in greater quantities. In fact,
the nucleotides C and G are found to a greater degree than elsewhere in the genome. A
study of regions of genomic DNA where CpG islands exist has determined the transition
probabilities to be the following:

A C G T
A 0.180 0.274 0.426 0.120
C 0.171 0.368 0.274 0.188
G 0.161 0.339 0.375 0.125
T 0.079 0.355 0.384 0.182

A new model just like the one above can have its transition properties assigned according
to the new table. Now we have two different models: the first where CpG islands are
absent, and the second where CpG islands are present.
Let’s call the first model the non-CpG model and the second model the CpG model.

Given a new sequence, how would we determine whether it belongs to the non-CpG
model or the CpG model?

Remember, the key property of a Markov chain is that the probability of a symbol S at
position p (Sp) depends only upon the previous symbol S at position p – 1 (Sp-1), and not
on the entire previous sequence.

Therefore, to find the probability that a sequence fits a model, you would multiply all of
the conditional probabilities:

P(x) = P(xL|x L-1)P(x L-1|x L-2)…P(x2|x1)P(x1)

Which can be rewritten as:

L
P ( x ) = P ( x1 )∏ a xi −1xi
i=2

where

a xi −1xi
is the transition probability from residue at position i-1 to the residue at position i.

Let’s consider for now that in the non-CpG model, P(A) = P(T) = 0.3; P(C) = P(G) = 0.2,
so that A and T are more probable. In the CpG model, consider P(A) = P(C) = P(G) =
P(T) = 0.25.

Now consider the sequence: GGCGACG

The probability for this sequence is as follows:

P(G)P(G|G)P(C|G)P(G|C)P(A|G)P(C|A)P(G|C)

For the non-CpG model can be calculated as:

(0.20)(0.298)(0.246)(0.078)(0.248)(0.205)(0.078) = 0.000000453499
For the CpG model can be calculated as:

(0.25)(0.375)(0.339)(0.274)(0.161)(0.274)(0.274)(0.125) = 0.0010526

Given this information, it is more likely that this sequence fits the CpG model. One thing
to note is how quickly the probability gets to zero. This shows the importance of using
log statistics.

Using Markov models for discrimination

One question that might arise is how different the non-CpG and CpG models are in
relation to each other. If they are not different enough, then there is not enough
information to determine from which model a particular sequence is derived. In order to
test whether we are able to discriminate between the two models, a log ratio is taken for
each of the scores in the two previous tables to create a third table, where each entry, x, in
the new table is equal to: log2(P(x|CpG model) / P(x| non-CpG model)). The resulting
table is as follows:

A C G T
A -0.740 0.419 0.580 -0.803
C -0.913 0.302 1.812 -0.685
G -0.624 0.461 0.331 -0.730
T -1.169 0.573 0.393 -0.679

Using this log-odds ratio table as the scores, we can then see that a sequence with a
negative score will belong to the non-CpG model, while a sequence with a positive score
will belong to the CpG model.

Hidden Markov Models

The two models we have created can now be used to test which of the two models a
sequence belongs to. However, consider the case where we are given a long genomic
piece of DNA. How do we determine where the regions are where CpG islands are
located? Our models cannot answer these questions as they currently exist.

The solution to this problem is to combine both of our models into a single model where
there are small probabilities in switching back and forth between the two models. The
problem now becomes more complicated, because there are now two states
corresponding to each nucleotide symbol.

The difference between a Markov chain and a hidden Markov model is that a hidden
Markov model does not have a one-to-one correspondence between the states and the
symbols, and therefore it is no longer possible to tell what state the model was in when a
particular symbol was emitted. Therefore, the state is “hidden” from us.

In the example of the CpG islands used thus far, only one symbol is emitted at each state.
However, consider the example of multiple alignments where any one of a number of
amino acids is likely in any given column. In this case, the state of a hidden Markov
model could emit a symbol from a given distribution. The probabilities of emitting a
symbol given that you are in a specific state is referred to as the emission probabilities.

Using emission probabilities, the joint probability of seeing an observed sequence x and a
path through the Hidden Markov model , π, is equal to:

L
P( x, π )= aoπ 1 ∏ eπ i ( xi )aπ iπ i +1
i =1
Note that in this case,

eπ i ( xi )
Is the probability of emitting the residue x found position i in the sequence, when you are
at the state πi in the path.

This equation is not all that useful, since it is often the case that the path is not known.
However, it is important to be able to calculate the most probable path.

Viterbi Algorithm – Most Probable Path


The Viterbi algorithm is a dynamic programming algorithm. The most probable path
through the HMM can be calculated recursively. If vk(i) is the probability of the most
probable path ending in state k with observation i is known for all states k, then for the
next observation x i+1, the most probable path is equal to the probability of emitting the
symbol x i+1 while in state l multiplied by the maximum over all previous states k that can
transition into the state l.

vl (i + 1) = el ( xi +1 ) max(v k (i )a kl )
Initialization

In the initialization step, before you begin emitting any characters, set the probability of a
path of length 0 ending at state 0 to 1, and the probability of all other paths of length 0
ending at all other states equal to 0:

v0(0) = 1; vk(0) = 0 for k > 0


Recursion

The recursion step goes through the whole length of the input sequence, one at a time,
and calculates the maximum probability for being in a state, l, given the current input i:

vl(i) = el(xi)maxk(vk(i-1)akl)
In addition, the recursion step keeps track of a pointer for getting to each state, so that a
traceback can be performed to reconstruct the path with the maximum probability:

ptri(l) = argmaxk(vk(i-1)akl)

Termination
In the termination step, the probability of the sequence and the maximum path is set to be
the maximum value at the final position, and the pointer for the maximal path is set at
that point, similar to the recursive step, except that a termination step, ak0 is introduced:

P(x,π* ) = maxk(vk(L)ak0)

πL* = argmaxk(vk(L)ak0)

Traceback
Since pointers were kept for the path with the maximum probability using a recursive
dynamic programming approach, traceback continues in a similar fashion to the sequence
alignment. For the path with the maximum probability, we start at the final state, and
trace back through the set of transitions that led to that state. Then we will recurse back
until we get to the first state:

(i = L..1): πi-1* = ptri(πi*)


Forward Algorithm
In addition to determining the path with the highest probability, it is also necessary to
determine the probability of a sequence given a particular HMM. This could be done by
summing the probability over all possible paths. However, the number of potential paths
increases exponentially with the length of the sequence, so a brute force method is not
possible.

Luckily, the probability of the observed sequence up to and including a point xi, where
the path ends at state πi , can be calculated using a dynamic programming approach,
similar to the Viterbi algorithm:

fk(i)=P(x1...xi, πi=k)
The forward algorithm can be described as follows, with an initialization, recursion,
and termination step.

Initialization
The initialization step is identical to the Viterbi algorithm, except now the v’s (maximum
probable path) are replaced by f’s (overall probability)

f0(0) = 1; fk(0) = 0 for k > 0

Recursion
The recursion step goes through the whole length of the input sequence, one at a time,
and calculates the overall probability for being in a state, l, given the current input i. In
the Viterbi algorithm, this recursive step takes the maximum; in the forward algorithm,
we will consider sum over all possibilities:

f l (i ) = el ( xi )∑ f k (i − 1)a kl
k

Termination

The termination step is an extension of the recursion step with the difference that a
terminating transition is used. The overall probability of the sequence being described by
the HMM is then given in the final cell:

P( x ) = ∑ f k ( L)a k 0
k
Backward Algorithm
While the Viterbi algorithm finds the most probable path through the model, and the
forward algorithm finds the probability of fitting the sequence to the model, we might
also be interested in calculating the posterior probability that the emitted value came from
a particular state given the observed seqeuence. Formally, this is written as:
P(π i = k | x).

The backward algorithm is very similar to the traceback step in pairwise sequence
alignment using dynamic programming. We start at the final step and work our way back
to the beginning.

Initialization
In the initialization step, the value of the posterior probability for each of the final
transitions is assigned the value of the final transition probability:

bk(L) = ak0 for all k

Recursion
The recursion works backwards (i = L-1, ..., 1). Thus, the recursive step is:

bk (i ) = ∑ a kl el ( xi +1 )bl (i + 1)
l

Termination
The termination step reports back the same value as the forward algorithm, calculated in
the reverse direction:

P( x) = ∑ a0 l el ( x1 )bl (1)
l

HIDDEN MARKOV MODELS

Parameter Estimation
So now the question becomes, how does the model and the associated probabilities get
specified in the first place? There are two parts to HMM parameter estimation: the
design of the structure (what states there are and how they are connected) and the
assignment of the transition and emission probabilities.

Calculation of Probabilities

Estimation when the state sequence is known


The estimation of probabilities when the state sequence is known is a trivial task.
Consider a case where we have a multiple alignment from a protein family we want to
describe as a Hidden Markov Model. Thus, the transition probabilities and emission
probabilities can be calculated using a maximum likelihood estimation by taking the
observed frequencies of residues in a column, and the observed frequencies of
transitioning from one column to the next.

In order to deal with insufficient data and overtraining of models, pseudocounts should be
incorporated into these maximum likelihood estimations.

Estimation when the Paths are unknown – Baum-


Welch
When the state sequence is unknown, it becomes trickier to calculate the probabilities.
This is usually done in some sort of iterative fashion, until some sort of stop criterion is
reached.

In the Baum-Welch algorithm, the transition and emission probabilities are calculated as
the expected number of times each transition or emission is used, given the training data.
Once a model is in place, its overall log likelihood is computed, and transition and
emission probabilities are calculated again based on the values given.
The Baum-Welch algorithm is summarized as follows:

Initialization:
Pick arbitrary model parameters

Recurrence:
Set all the transition and emission variables to their pseudocount values
For each sequence j = 1..n:
Calculate the forward value for sequence j
Calculate the backward value for sequence j
Add the contribution of sequence j to the transition and emission variables
Calculate the new model parameters using maximum likelihood
Calculate the new log likelihood of the model

Termination:
Stop if the change in log likelihood is less than some predefined threshold

HMM Model Structure

Duration Modeling
Complex length distributions can be modeled by introducing several states with
transtitions between one another:

HMM Resulting in one or more characters

HMM Resulting in six or more sequences


HMM With 2 to 7 emissions

Silent States
In some cases, it would be nice to be able to skip over states. This is done by the
introduction of silent states. Silent states will allow for arbitrary deletions of a chain
states. In order to make the model less cluttered, silent states are introduced as circles in
the model. This will achieve the same effect of forward connecting transitions from each
of the states to each other state in the future.

Using the above HMM model, we can now skip some of the emitting states by traveling
through the silent states. The next effect would be a deletion of emitted sequence.

Insertion States
In addition to skipping over states, it may be necessary to emit residues between
matching states. This is done by introducing insert states which are labeled as diamonds
in the HMMs:
HMMs have many different applications in computational biology. Among them are:

Multiple sequence alignments


sequence profiles
gene prediction
protein structure prediction
protein family classification

There are two main programs used to calculate Hidden Markov Models:

HMMER home page:


http://hmmer.wustl.edu

SAM home page:


http://www.cse.ucsd.edu/research/compbio/sam.html
Pairwise Alignments Using HMMs

A finite state machine can be created to show the calculation of a pairwise alignment with
five states: A start state, a stop state, a match/mismatch state, an insertion state (for
sequence 1), and an insertion state (for sequence 2).

The finite state machine can be drawn as follows:

M EN
BE D
G

By assigning probabilities for the transitions, and for the emissions at states M, X, and Y,
this finite state machine can be converted into a pair HMM. A pair HMM differs from a
standard HMM in the sense that a pair of sequences is emitted in this case. This pair
HMM will generate an aligned pair of sequences.

Chapter 4 discusses using HMMs to find pairwise alignments; look it over, but we will
not discuss it further in class. Rather, we will concentrate on where the real power of
HMMs lies: in determining families of sequences based on multiple alignments.

Profile HMMs for sequence families

The power of HMMs in computational biology is to create a model of a sequence family


that will all the determination of the relationship of an individual sequence to a sequence
family. Such a relationship would allow you to concentrate on conserved features in the
family. Profile HMMs, are similar to the sequence profiles already discussed with the
multiple alignment programs.

The assumption with profile HMMs is that we begin with a multiple alignment that has
been correctly calculated. This multiple alignment can then be used to build a model to
score potential matches to new sequences. For example, consider the following multiple
alignment of globin sequences:

10 20 30 40 50 60
....*....|....*....|....*....|....*....|....*....|....*....|
consensus 1 SAAQKALVKASWGKVKG------NREELGAEILARLFK-------AYPDTKAYFPKFg-D 46
1ASH 1 ANKTRELCMKSLEHAKVdt--snEARQDGIDLYKHMFE-------NYPPLRKYFKSR--E 49
1ITH_A 3 TAAQIKAIQDHWFLNIKg-----CLQAAADSIFFKYLT-------AYPGDLAFFHKF--S 48
gi 1065933 162 DKESCEVVADSWRLVESrssaaeTSACFGLFVFQRVFS-------KIPMLRPLFG-Ls-E 212
gi 3877400 71 nvyEKELLRRTWSDEFD------NLYELGSAIYCYIFD-------HNPNCKQLFP-Fi-S 115
gi 3877381 15 TDEEVTAIRDVWRRA--------KTDNVGKKILQTLIE-------KRPKFAEYFG-IqsE 58
gi 3874505 230 TCAQIHLVRALWRQVYTt----kGPTVIGASIYHRLCFknvmvkeQMKQVE-LPPKFq-N 283
gi 4098133 39 EDRDALRVLQNAFKL--------DDPELVRRFYAHWFA-------LDASVRDLFP-P--- 79
gi 1707914 18 SPADVK--KHTVESMKAvpv-grDKAQNGIDFYKFFFT-------HHKDLRKFFKGA--E 65
gi 2494780 3 TKDEFDSLLHELDPKIDte---eHRMELGLGAYTELFA-------AHPEYIKKFSRL--Q 50
70 80 90 100 110 120
....*....|....*....|....*....|....*....|....*....|....*....|
consensus 47 LSTAAALKSSPKFKAHGKKVLGALDEAVKHL---DDDGNLKAALAKLGAR-HAKRG---H 99
1ASH 50 EYTAEDVQNDPFFAKQGQKILLACHVLCATY---DDRETFNAYTRELLDR-HARDHv--H 103
1ITH_A 49 SVPLYGLRSNPAYKAQTLTVINYLDKVVDAL-----GGNAGALMKAKVPS-HDAMG---- 98
gi 1065933 213 SDDVFDLPDNHPVRRHARLFTSILHISVKNVd--ELEAQVAPTVFKYGER-HYRPDitpH 269
gi 3877400 116 KYQGDEWKESKEFRSQALKFVQTLAQVVKNIyhmERTESFLYMVGQKHVK-FADRG---- 170
gi 3877381 59 SLDIRALNQSKEFHLQAHRIQNFLDTAVGSLg-fCPISSVFDMAHRIGQI-HFYRGv--N 114
gi 3874505 284 --------RDNFIKAHCKAVAELIDQVVENL---DHLDNVTGELMRIGRV-HAKVL---- 327
gi 4098133 80 -------DMGAQRAAFGQALHWVYGELVAQr-----AEEPVAFLAQLGRD-HRKYG---- 122
gi 1707914 66 NFGADDVQKSKRFEKQGTALLLAVHVLANVY---DNQAVFHGFVRELMNR-HEKRGvdpK 121
gi 2494780 51 EATPANVMAQDGAKYYAKTLINDLVELLKAS---TDEATLNTAIARTATKdHKPRN---- 103
130 140 150 160 170
....*....|....*....|....*....|....*....|....*....|....
consensus 100 VDPANFKLFGEALL---VVLAEHLg---DFTPEVKAAWDKALDVVADALKSGYR 147
1ASH 104 MPPEVWTDFWKLFE---EYLGKKT----TLDEPTKQAWHEIGREFAKEINKHGR 150
1ITH_A 99 ITPKHFGQLLKLVG---GVFQEEFs---ADPTTVAAWGDAAGVLVAAMK----- 141
gi 1065933 270 MTEENVRVFCAQIV---CTVFDFLrd-tEATPKCAESWIELMRYLGQKLLDGFD 319
gi 3877400 171 FKHEYWDIFQDAME---FALEHRLsimtDLDDNQKRDAVTVWRTLALYTTVHMR 221
gi 3877381 115 FGADNWLVFKKVTV---DQVTTGTt---DSSKEKEDtnsngtangkvdtdasli 162
gi 3874505 328 RGELTGKLWNTVAEtiiDCTLEWGdr-rCRSETVRKAWALIVAFVIEKIKAGHH 380
gi 4098133 123 VLPTQYDTLRRALY---TTLRDYLg------HPSRGAWTDAVDEAAGQSLNLII 167
gi 1707914 122 LWKIFFDDVWVPFL---ESKGAKLs------GDAKAAWKELNKNFNSEAQHQLE 166
gi 2494780 104 VSGAEFQTGEPIFI---KYFSHVL-----TTPANQAFMEKLLTKIFTGVAGQL- 148
First, consider only the BLOCKS, where there are no insertion and deletion events. For
example, we can consider the block of width 5 that is highlighted above:

HAKRG
HARDH
HDAMG
HYRPD
FADRG
HFYRG
HAKVL
HRKYG
HEKRG
HKPRN

As we have shown with the multiple alignment algorithms, a position specific scoring
matrix (PSSM) can be calculated for the BLOCK above.

(REVIEW CALCULATIONS FOR PSSMs)

Converting a PSSM to a HMM

Converting a PSSM for a block to a HMM is trivial, due to the absence of insertions and
deletions. There will be a begin state, and five match states, one for each column. Match
states are represented by squares in a diagram of a profile HMM. The transition
probabilities between match states in the HMM would be 1 (we can only transition to the
next match state), and the emission probabilities for each match state would be calculated
based on the PSSM. The diagram for the resulting HMM is shown below:

B E

Alignment to this HMM is also trivial, since there is no choice of transitions.

Adding Insert and Delete States to Obtain Profile HMMs

Insertions

For a profile HMM, insertions and deletions are treated separately. In order to handle
insertions, where a portion of a new sequence does not match anything else in the model,
a new set of insert states is inserted, denoted by diamonds. Whenever an insertion is
possible, a transition is needed from the last match state in a block to the insertion, from
the insertion to itself (to allow for multiple length insertions), and from the insertion to
the first match state of the next block. For instance, if our alignment had been:
HAKVPRG
HAR--DH
HDAV-MG
HYR--PD
FAD--RG
HFY--RG
HAK-PVL
HRKG-YG
HEKGGRG
HKP--RN

We now have an insertion after the third match state. Thus, the HMM now looks like the
following:

B E

In this case, the score of a gap of length k is equal to the score of the transition from the
match state to the insert state, plus (k-1) times the score of the transition remaining in the
insert state, plus the score of the transition from the insert state to the next match state.
This can be rewritten as:

log aM i I j +( k − 1) log a I j I j + log a I j M j + 1


Deletions

Deletions are handled in HMMs by introducing silent states between matches. Deletion
states do not emit a residue (thus, the name “silent” state). These states are denoted by
circles in a HMM. An example of a HMM emitting a sequence between 0 and 3 residues
long is as follows:

B E
HMM With Matches, Insertions, and Deletions

A HMM with match states, insertion states, and delete states is referred to as a profile
HMM (first introduced by David Haussler, et al and Anders Krogh, et al in the mid-
1990s). An example structure of a profile HMM is as follows:

Use of Profile HMMs to Generate Pairwise Alignments

Profile HMMs can be used to perform generalized pairwise alignments, similar to the
dynamic programming approach, where the emission probabilities are the
match/mismatch probabilities or the probabilities of matching two amino acids in a
scoring matrix. The transition probabilities from a match state to an insert or delete state
is equivalent to a gap-open score, while the transition probabilities within an insertion
state or between delete states is equivalent to a gap extension penalty.

Deriving profile HMMs from multiple alignments

Multiple sequence alignments can be used to create profile HMMs that act as models
describing the consensus sequence of a sequence family.

Basic Profile HMM parameterization

Choosing the length of the model


The choice of the length of the model corresponds to a decision on which multiple
alignment columns to assign to match states, and which to assign to insert states.
Considering the alignment below, we might consider columns 1, 2, 3, 6, and 7 to be
match states, and columns 4 and 5 to be insert states. A simple rule is that columns that
are more than half gap characters should be modeled by inserts.

HAKVPRG
HAR--DH
HDAV-MG
HYR---D
FAD--RG
HFY--RG
HAK-PVL
HRKG-YG
HEKGGRG
HKP--RN

Assigning the probability parameters

Once the length of the model is chosen, the next problem is to assign the transition and
emission probability parameters. The simplest manner in assigning probabilities is to
take the frequencies.

First, consider the transition frequencies. Note that for each state has three possible
transitions leading from it. The transition probability from state k to state l can be
calculated as:

Akl
a kl =
∑ Akl '
l'

Similarly, the emission probability of residue a at state k can be calculated:


Ek (a)
ek ( a ) =
∑ E k (a ' )
a'

as we have discussed before, there is a difficulty in using straight frequencies when


considering multiple alignments. The problem is that if not enough sequences are used in
the multiple alignment, then some residues will be underrepresented (or not represented
at all) and others will be overrepresented. In order to overcome this difficulty,
pseudocounts are introduced (much like in problem 3 of the homework). The easiest
form of pseudocounts is Laplace’s rule, which adds one to each count.

Let’s consider our alignment again:

HAKVPRG
HAR--DH
HDAV-MG
HYR---D
FAD--RG
HFY--RG
HAK-PVL
HRKG-YG
HEKGGRG
HKP--RN

First, consider the transitions. From column 1 to column 2, there are 10 transitions to the
next match state, 0 transitions to an insertion state, and 0 transitions to a delete state.
Using Laplace’s rule, the probabilities would be 11/13, 1/13, and 1/13 for aM1M2, aM1I1,
and aM1D2 respectively. The probabilities are the same for the transitions from the second
column to the third column. Note that the fourth and fifth columns are insertion columns.
Therefore, the next set of match transitions will be from column three to column four.
There are four matches where columns four and five have gaps, (probability of matching:
5/13); there is one where 4, 5 and 6 have gaps (probability of deletion 2/13), and five
with an insertion in column 4 and/or column 5 (probability of insertion 6/13).
Remembering that we have a total of five match states, the probabilities can be stored in
the following table:

MATCH 11/13 11/13 5/13 10/13 11/13


INSERTION 1/13 1/13 6/13 1/13 1/13
DELETION 1/13 1/13 2/13 2/13 1/13

The emission probabilities can be calculated for the match states in a similar fashion. In
the first column, there are 9 H’s, and 1 F. Using Laplace’s rule, this becomes 10 H’s, 2
F’s, and 1 each of the remaining 18 amino acids. Therefore, the probabilities are 10/30
H; 2/30 F; 1/30 for each of the remaining 18 amino acids.

(Go through this example on the board)

Searching with Profile HMMs

Once the profile HMM is in place, sequences can be searched against the HMM to detect
whether or not they belong to a particular family of sequences described by the profile
HMM. Using a global alignment, the probability of the most probable alignment and
sequence can be determined using the Viterbi algorithm (yielding P(x, PI*|M)). In
addition, the full probability of a sequence aligning to the profile HMM can be
determined using the forward algorithm (yielding P(x|M)). The Viterbi equations
specifically designed for profile HMMs are given on page 109, Durbin et al. The forward
equations are given on page 110.

Local Alignments to HMMs

To be discussed next week

CECS 694-02 Introduction to Bioinformatics


Lecture 8
Protein Family Classification and Gene Prediction

Hidden Markov Model Programs

HMMER

HMMER (“hammer”) is a profile-HMM package used for protein sequences, currently in


version 2.2. Various programs in the HMMER suite can take in a multiple alignment as
input, and produce a profile-HMM based on that alignment. Once a HMM is in place
(either a user-defined HMM or one of the Pfam HMMs to be described later), sequences
can be searched against the HMM to check for membership into a particular family. In
addition, sequences can be emitted from a HMM based on the transition and emission
probabilities. The source code for HMMER is freely available for download from the
following site:

http://hmmer.wustl.edu/

SAM
“Sequence Alignment and Modeling System” is a profile-HMM package used for protein
sequences as well. It is very similar to HMMER as far as the functionality is concerned.

http://www.cse.ucsc.edu/research/compbio/sam.html

Meta-meme
Meta-meme is a program that creates Hidden Markov Models of ungapped alignments.
The benefit is that there are fewer parameters to be learned in creating the HMMs. Thus,
Meta-meme is a Motif-based hidden markov approach.

http://metameme.sdsc.edu/

HMMPro
HMMPro is a commericial software package (free academic license) that has been
developed to add additional HMM features, including graphical interfaces, multiple
topologies, and multiple training methods.

http://www.netid.com/html/hmmpro.html

Protein Family Classification

Pfam
Pfam is a large collection of multiple sequence alignments and hidden Markov models
covering many common protein domains and families. Over 73% of all known protein
sequences have at least one match to one of 5,193 different protein families. PFAM
families are extensively hand curated to assure a greater reliability in results.

Profile HMMs have been created to describe each protein family. In general, the HMMs
are seeded with a training data set of 50 or more sequences that are multiply aligned.
Generally, this alignment first is accomplished using a multiple alignment program such
as Clustal. After the automatic multiple alignment is generated, it is scrutinized by eye
and adjusted. This assures that the seed alignments that produce the Profile HMM are
more likely to be correct.

After the HMM has been created (using the HMMER suite), additional sequences are
added to the family by comparing the HMM against sequence databases. The resulting
full alignments with additional family members may look worse than the initial seed
alignments.
Pfam families can be broken down into four basic types:

• family – default classification, stating members are related


• domain – structural unit found in multiple protein contexts
• repeat – a domain that in itself is not stable, but when combined with multiple
tandem repeats forms a domain or structure
• motif – shorter sequence units found outside of domains

Links to the Pfam software:

http://pfam.wustl.edu/
http://www.sanger.ac.uk/Software/Pfam/index.shtml

The best way to look at the Pfam families is to jump right into the Pfam program and
view some examples: http://pfam.wustl.edu/

Pfam also contains more information concerning the three-dimensional structures of


proteins. We will revisit these properties closer to the end of the semester.

ProDom
ProDom is a database of all protein domain families automatically generated from the
SWISS-PROT and TrEMBL databases. ProDom incorporates Pfam-A families as well as
generating new ProDom alignments using the PSI-BLAST program.

Let’s look at the example with entry PD000039

http://prodes.toulouse.inra.fr/prodom/2002.1/html/home.php

ProSite
ProSite is a database of protein domains that can be searched by either regular expression
patterns or sequence profiles. The ProSite data can be accessed at:

http://www.expasy.org/prosite/
InterPro
InterPro is an integrated resource of protein families, domains, and functional sites
created to handle the data from various protein family sites such as PROSITE, Pfam,
PRINTS, ProDom, SMART and TIGRFAMs into a single, comprehensive resource.

Current release of InterPro:

5629 Entries
4280 Families
1239 Domains
95 Repeats
15 post-translational modifications

Represented in InterPro are 74% of all proteins within the SWISS-PROT and TrEMBL
databases.
Computational Gene Prediction

Locating open reading frames (ORFs)

The simplest method of predicting geneic regions is to search for open reading frames
(ORFs). We have already discussed open reading frames, which begins with a start
(AUG) codon, and ends with one of three stop codons.

In prokaryotic organisms, DNA sequences coding for proteins generally are transcribed
into mRNA which is translated into protein with very little modification. Therefore, in
prokaryotes, locating an open reading frame from a start codon to a stop codon can give a
strong suggestion into protein coding regions. Longer ORFs are more likely to predict
protein-coding regions than shorter ORFs.

In eukaryotic organisms, the mRNA undergoes processing to remove intronic regions


before the protein is translated. Therefore, the ORF corresponding to a gene in a
eukaryotic organism may contain regions with stop codons that are found within intronic
regions. Posttranscriptional modification makes it more difficult for gene prediction.

Locating Homologous Genes

Once a new genomic DNA sequence has been obtained, the first step in gene prediction is
to locate homologous genes. This is accomplished by taking the new DNA sequence,
translating it into all six reading frames, and comparing it to protein sequence databases.
This step will locate known open reading frames, where the translation of the proteins is
known.

We have shown various algorithms for searching databases, and in particular, aspects that
allow translation into all possible reading frames before searching. (Examples are
BLASTX, FASTX, TBLASTN, TFASTX/TFASTY).

It is thought that only about half of the genes can be found by homology searches. The
remaining 50% need to be found using other mechanisms.

Locating ESTs /cDNAs

If the organism from which the genomic DNA has been extracted has EST or cDNA
sequences available, the next step is to perform a similarity search between the genomic
DNA and these ESTs or cDNAs. In addition, it can be helpful to perform similarity
searches with ESTs and cDNAs from other organisms as well. This step locates potential
exons within genomic sequences.

Programs that take into account not only sequence similarity but also intron/exon
boundary information (such as acceptor/donor sites and branch points) are listed as
follows:
EST-GENOME (http://www.hgmp.mrc.ac.uk/Registered/Option/est_genome.html)
SIM4 (http://pbil.univ-lyon1.fr/sim4.html)
SPIDEY (http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/)
Other Programs:
TAP (Transcript Analysis Program)

Computationally Predicting Genes

The third step is to take the genomic DNA and run it though gene prediction programs to
try and locate genes. Various programs exist to predict genes in different organisms,
usually basing the methodology on the observed characteristics of known exons, introns,
splice sites, and other regulatory sites in known genes. One important aspect to consider
is that gene structure varies from one organism to the next, so a program trained on one
organism is not generally useful for finding genes on another organism. Methods for
computationally predicting genes are generally error prone. We will discuss the
important statistics to consider when comparing gene prediction programs.

Gene prediction in prokaryotic organisms

In general, predicting genes in prokaryotic organisms is much easier. This is due to the
fact that prokaryotic genes generally lack introns, and several highly conserved promoter
regions are found around the start sites and transcription and translation.

Example: E. coli lexA gene


As an example, consider the E. coli lexA gene, which plays a role in…
There are two promoter sites (at positions -10 and -35 from the translation start site) that
mark positions of interaction with the RNA polymerase (which copies the DNA into
RNA). There is also a ribosomal binding site on the mRNA product complementary to
ribosomal RNA. (note the ribosome is used to translate the RNA into amino acids).

There are also three potential binding sides for the lexA product to the promoter region,
indicating that lexA provides a self feedback loop. When enough lexA has been
produced, these sites will be bound with lexA, telling it to stop creating this protein.

In addition, there is an open reading frame that is devoid of introns.

Many other prokaryotic genes operate in this same manner, and it is therefore fairly
straightforward to locate genes in prokaryotic organisms.

The highly conserved features of prokaryotic genes have made computational gene
identification a possibility. One method to detect these genes is to create HMMs based
on the gene structures. One such HMM is given in Mount, p50. This model suggests that
a model can be constructed based on each of the 61 codons in the genetic code, as well as
for the start codon and 3 stop codons.

Since codon usage and intergenic sequences vary from one organism to the next, a model
trained on the genes of one organism may not be useful in detecting genes in a second.
The reliability of the model depends on the accuracy of the information used to train the
initial model.

One program that uses a fifth order HMM (such that hexamers are important) in
modeling E. coli genes is the program GeneMark.hmm.

GeneMark (http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi )

Interpolated Markov Models (IMMs) addresses the problems of underrepresented


sequences by looking at smaller subsets (such as looking at pentamers instead of
hexamers). Of course, the longer the pattern, the more accurate the prediction. One
example of a program using the IMM method is Glimmer.

Glimmer (http://www.cs.jhu.edu/labs/compbio/glimmer.html)

Prediction of Genes in Eukaryotes

The commonly used methods for eukaryotic gene prediction train a computer program to
recognize sequences that are characteristic of known exons in genomic DNA sequences.
Then these programs are used to predict exons in unknown genomic sequences, and then
connect these exons to produce a gene structure.

The patterns used include intron-exon boundaries and upstream promoter sequences.
However, in eukaryotes, the signals for these are poorly defined, and therefore cannot be
searched by a simple pattern-matching technique as used with prokaryotes.

Splice sites
(Image source: http://genio.informatik.uni-stuttgart.de/GENIO/splice/)
Cannonical: GT->AG
Alternative: GC->AG
U12 Introns: AT->AC

Splice site prediction:


SPLICEVIEW
SplicePredictor
BCM-SPL

http://www.fruitfly.org/seq_tools/splice.html
http://www.softberry.com/spldb/SpliceDB.html

Gene Prediction Using Neural Networks

Neural networks provide a framework for finding complex yet subtle patterns and
relationships among sequences. Grail II provides analyses of protein-coding regions,
poly(A) sites, and promoters; constructs gene models; and predicts encoded protein
sequences. The underlying algorithm makes a list of the most likely exon candidates and
these are further evaluated using a neural network. Dynamic programming is then
applied to define the most probable gene models.

Input for Grail II includes several indicators of sequence patterns, including a Markov
model for gene recognition and inputs from two additional neural networks that evaluate
the region for potential splice sites. One important indicator in Grail II (and other gene
prediction programs as well) is the in-frame 6-mer preference score, since the occurrence
of codon pairs in coding regions is not random, while in noncoding regions it is more
likely to be random. A higher frequency of 6-mers that are more commonly found in
coding regions can be an indicator of the presence of an exon.

The neural network used for Grail II is trained using a set of known coding sequences.
A schematic for the Grail II system is as follows (Mount, p53):
Gene Prediction Using Dynamic Programming

GeneParser is a program that predicts the most likely combination of exons and introns in
a genomic sequence using a combined neural network and DP approach.

SEE PAGE 355 FOR A DESCRIPTION

Pattern Discrimination Methods

Discrimination methods are statistical methods used to classify the sequence based on a
number of observed sequence patterns, including a 6-mer exon preference score (EPS),
3’-flanking splice site score (FSS). The idea behind discrimination analysis is to is to
plot a pair of scores for known sequences against one another, labeling each point as
either intron or exon. Based on this data, a discriminating curve is drawn to discriminate
between the introns and exons. The scores are then calculated for the new genomic
sequence, and depending on where the score falls, the region is labeled either as an intron
or exon. (See page 356 for an example). Example pattern discrimination methods
include:

HEXON
FGENEH
MZEF

Hidden Markov Models


Genscan

Twinscan

Assessing Gene Prediction Programs

Burset and Guigo (1996) came out with an important paper describing methods in order
to test gene prediction programs. In this paper, they describe a set of known coding
sequences that should be used as data to train the models. In addition, a set of known
coding sequences is provided to evaluate the success of the model. The important
statistics to look at include:

True Positives (TP): Number of correctly predicted coding regions


False Positives (FP): Number of incorrectly predicted coding regions
True Negatives (TN): Number of correctly predicted non-coding regions
False Negatives (FN): Number of incorrectly predicted non-coding regions

Using these measures, the following can be calculated:


Actual Positives (AP) = TP + FN
Actual Negatives (AN) = TN + FP
Predicted Positives (PP) = TP + FP
Predicted Negatives (PN) = TN + FN

Sensitivity (SN) = TP / (TP + FN) (Percentage of coding regions found)


Specificity (SP) = TP / (TP + FP) (Percentage of positive that are correct)

These measures can be combined to form a single correlation coefficient:

[(TP )(TN ) − ( FP )( FN )]
CC =
[( AN )( PP)( AP)( PN )]

Gene Prediction Programs:

GeneFinder http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html
GeneMark http://genemark.biology.gatech.edu/GeneMark/
GeneParser http://beagle.colorado.edu/~eesnyder/GeneParser.html
GeneScan http://202.41.10.146/GS.html
Genie http://www.cse.ucsc.edu/~dkulp/cgi-bin/genie
GenScan http://genes.mit.edu/GENSCAN.html
Grail http://compbio.ornl.gov/
MZEF http://argon.cshl.org/genefinder/

MetaGene: http://rgd.mcw.edu/METAGENE/
BCM Web Launcher for Gene Predictions: http://searchlauncher.bcm.tmc.edu/seq-
search/gene-search.html
Genie http://www.fruitfly.org/seq_tools/genie.html
Grail EXP http://grail.lsd.ornl.gov/grailexp/
HMMGene http://www.cbs.dtu.dk/services/HMMgene/
NetGene2 http://www.cbs.dtu.dk/services/NetGene2/
geneid http://www1.imim.es/geneid.html
Procrustes http://hto-13.usc.edu/software/procrustes/
Genewise
Twinscan http://genes.cs.wustl.edu/

Assessing gene prediction programs


Burset and Guigo http://www1.imim.es/GeneIdentification/Evaluation/Index.html
Rogic et al genome research 2001
Promoter prediction programs

Burge CB, Karlin S. (1998) Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346-
354.

Burge C, Karlin S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol.
Biol. 268, 78-94.

Burset M, Guigo R. (1996) Evaluation of gene structure prediction programs.


Genomics, 34:353-357.

Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W. (1998) A computer program


for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8,
967-974

Kulp D, Haussler D, Reese MG, Eeckman FH. (1996). A generalized Hidden


Markov Model for the recognition of human genes in DNA. ISMB-96, St. Louis,
MO, AAAI/MIT Press.

Lukashin A, Borodovsky M, GeneMark.hmm: new solutions for gene finding,


NAR, 1998, Vol. 26, No. 4, pp. 1107-1115.

Reese MG, Eeckman FH, Kulp D, Haussler D. (1997). Improved splice site
detection in Genie. Proceedings of the First Annual International Conference on
Computational Molecular Biology (RECOMB) 1997, Santa Fe, NM, ACM Press,
New York.

Snyder EE, Stormo GD. (1995) Identification of Coding Regions in Genomic


DNA. J. Mol. Biol. 248: 1-18.

Snyder EE, Stormo GD. (1993) Identification of coding regions in genomic DNA
sequences: an application of dynamic programming and neural networks. Nucl.
Acids Res. 21(3): 607-613.
CECS694-02
Introduction to Bioinformatics
Lecture 9
Phylogenetic Prediction

Overview of phylogenetics

Phylogenetic analysis gives insight into how a family of related sequences has been
derived during evolution. The evolutionary relationships among the sequences are shown
as branches of a tree. The length and nesting of these branches reflects the degree of
similarity between any two given sequences. The objective of phylogenetic analysis is to
determine the length of the branches and to figure out how the tree should be drawn.

Sequences that are the most closely related are drawn as neighboring branches on a tree.

Phylogenetic analysis is dependent upon good multiple sequence alignment programs.


Given a multiple sequence alignment, phylogenetic analysis tries to group sequences with
similar patterns of substitutions in order to reconstruct a phylogenetic tree. For instance,
consider that we have two sequences that are related. Given these two sequences, an
ancestoral sequence can be (partially) derived. With more similar sequences, more
information can be gathered to add to a correct derivation and evolutionary history.

Uses of phylogenetic Analysis

Given a set of genes (such as a family of genes) phylogenetic analysis can help determine
which genes are likely to have equivalent functions.

Used to follow changes occurring in a rapidly changing species such as a virus. Take for
instance influenza. By studying the rapidly changing genes through phylogenetic
analysis, the next year’s strain can be predicted, and a flu vaccination can be developed.
The prediction is not always correct, but it gives a level of protection.

Tree of Life

On one level, it is interesting to understand and study how the evolution of species has
occurred. There are many different resources discussing the evolution of species. This
includes the NCBI taxonomy web sites, and the University of Arizona’s tree of life
project. We’ll take a look at both of these web sites in order to get a better appreciation
for the evolution of species relative to one another.

NCBI Taxonomy http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/


Tree of Life http://tolweb.org/tree/
Evolutionary Trees

An evolutionary tree is a two dimensional graph showing the evolutionary relationship


among a set of items being compared. This set can be organisms, genes, or dna
sequences. Consider for the moment that each of the units in the set are referred to as a
taxon. Each taxon will be defined by a distinct unit on the tree.

An evolutionary tree is composed of outer branches or leaves that represent the taxa and
nodes and branches representing the relationships among the taxa. Two taxa that are
derived from the same common ancestor will share a node in the graph. In general,
approaches to designing evolutionary trees attempt to define the length of each branch to
the next node according to the number of sequence level changes that occurred. One
thing to be careful of in phylogenetic analysis is that this distance may not be in direct
relation to evolutionary time. Analyses that prescribe to the theory of a uniform rate of
mutation are known as the molecular clock hypothesis.

Rooted Trees

In a rooted tree topology, one sequence (the root) is defined to be the common ancestor
of all of the other sequences. A unique path leads from the root node to any other node,
and the direction of the path indicates evolutionary time. The root is chosen by including
a sequence from an organism that is thought to have branched off earlier than the other
sequences. If the molecular clock hypothesis holds, it is also possible to predict a root.
As the number of sequences increase, the number of possible rooted trees increases very
rapidly. In some cases, a bifurcating binary tree is the best model to simulate
evolutionary events in which case one species branches off into two separate species.

Example of a rooted tree:

Image source: http://www.ncbi.nlm.nih.gov/About/primer/phylo.html


Star Topology (Unrooted Trees)

An unrooted tree (sometimes referred to as a star topology) shows the evolutionary


relationship among sequences, without revealing the location of the oldest ancestry.
There are fewer choices for an unrooted tree than a rooted tree.

Example of an unrooted tree:

Image source: http://www.shef.ac.uk/english/language/quantling/images/quantling1.jpg

Methods for Determining Evolutionary Trees

There are three methods used to calculate the tree(s) that best account for the observed
variation in a set of sequences. These methods are maximum parsimony, distance, and
maximum likelihood.

Maximum Parsimony

Maximum parsimony methods predict the evolutionary tree that minimizes the number of
steps required to generate the observed variation in the sequences. In order to construct
a tree using maximum parsimony, a multiple sequence alignment must first be obtained.
For each aligned position, phylogenetic trees that require the smallest number of
evolutionary changes to produce the observed sequence changes are identified. This
continues for each position in the alignment. Those trees that produce the smallest
number of changes overall for all sequence positions are identified. This is a rather time
consuming algorithm that only works well if the sequences have a strong sequence
similarity.

1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G
Consider the example above (Mount, 250). There are a total of four sequences, which
gives a possibility of three different unrooted trees. In this case some sites are
informative, and other sites are not. An informative site has the same sequence character
in at least two different sequences. Only the informative sites need to be considered.

Possible trees:

1 3 1 2 1 3

2 4 3 4 4 2

In this case, the optimal tree is obtained by adding the number of changes at each
informative site for each tree, and picking the tree requiring the least total number of
changes.
For a large number of sequences the number of trees to examine becomes so large that it
might not be possible to examine all possible trees. Some programs, such as PAUP, add
features that will allow the user to envoke a heuristic that will keep representative trees
that best fit the data.

The informative sites in the example alignment are 5, 7, and 9.

Let’s go through the possible trees, and figure out the number of rearrangements for each
in the informative sites. (SEE THE POWERPOINT PRESENTATION)

One problem with determining evolutionary distance between sequences is that columns
representing greater variation dominate the analysis. In order to overcome this problem
of determining long branch lengths is to look only at transversion events, which are the
most significant base changes (i.e. changes a purine to a pyrimidine or vice versa). This
is referred to as Lake’s method of invariants.

LOOK AT THE MITOCHONDRIAL SEQUENCE ANALYSIS ON P 252

Distance Methods

The distance method for construction of phylogenetic trees looks at the number of
changes between each pair in a group of sequences to produce a phylogenetic tree of the
group. The goal of distance methods is to identify a tree that positions neighbors
correctly and that also has branch lengths which reproduce the original data as closely as
possible.

CLUSTALW uses the neighbor-joining method as a guide to multiple sequence


alignments. The PHYLIP suite of programs employ neighbor-joining methods.

Phylip http://evolution.genetics.washington.edu/phylip.html

Distance analysis programs in PHYLIP

FITCH: estimates a phylogenetic tree assuming additivity of branch lengths using the
Fitch-Margoliash method.
KITSH: same as FITCH, but under the assumption of a molecular clock.
NEIGHBOR: estimates phylogenies using the neighbor-joining (no molecular clock
assumed) or unweighted pair group method with arithmetic mean (UPGMA) (molecular
clock assumed).

For phylogenetic analysis, the distance score counted as either the number of mismatched
positions in the alignment or the number of sequence positions that must be changed to
generate the other sequence is used.

The success of distance methods depends on the degree to which the distances among a
set of sequences can be made additive on a predicted evolutionary tree.

Consider the alignment:

A ACGCGTTGGGCGATGGCAAC
B ACGCGTTGGGCGACGGTAAT
C ACGCATTGAATGATGATAAT
D ACACATTGAGTGATAATAAT
The distances between these sequences can be shown as a table:

A B C D
A - 3 7 8
B - - 6 7
C - - - 3
D - - - -
Using this information, an unrooted tree showing the relationship between these
sequences can be drawn:

A C
2 1
4

1 2
B D

Fitch and Margoliash Method

The Fitch and Margoliash method uses a distance table. The sequences are combined in
threes to define the branches of the predicted tree and to calculate the branch lengths of
the tree.

Example using three sequences:

1) Draw an unrooted tree with three branches originating from a common node and
label the ends:
A a
C
c
B b

2) Calculate the lengths of tree branches algebraically:

A B C
A -- 22 39
B -- -- 41
C -- -- --

distance from A to B = a + b = 22 (1)


distance from A to C = a + c = 39 (2)
distance from B to C = b + c = 41 (3)

subtracting (3) from (2) yields:

b + c = 41
-a – c = -39
__________
b – a = 2 (4)

adding (1) and (4) yields 2b = 24; b = 12


so a + 12 = 22; a = 10
10 + c = 39; c = 29

A 10

C
29
12

B
Example of Fitch-Margoliash Using Five Sequences

The Fitch-Margoliash algorithm can be extended to three or more sequences. Consider


the following table of distances between five separate sequences:

A B C D E
A -- 22 39 39 41
B -- -- 41 41 43
C -- -- -- 18 20
D -- -- -- -- 10
E -- -- -- -- --
Suppose that the initial tree is as follows:
C
A c
a f
D
d
b g
B
e E

1) The first step is to locate the most closely related sequences in the distance table. In
this case, that would be sequences D and E.

2) Now create a new table by combining the remaining sequences. For the distance from
D to A,B,C take the average distance of each of these to D ( (39 + 41 + 18) / 3 = 32.7)
For the distance from E to A,B,C, take the average distance of each of these to E
((41+43+20)/3 = 34.7). The resulting table is as follows:

D E AVG ABC
D -- 10 32.7
E -- -- 34.7
AVG ABC -- -- --
3) The average distances from D to ABC and E to ABC could also be found by
averaging the sum of the appropriate branch lengths:

D to E: d + e = 10 (1)
D to ABC: d + m = 32.7 where m = g + (c + 2f + a + b) / 3 (2)
E to ABC: e + m = 34.7 (3)

By subtracting the third equation from the second equation we get:


d – e = -2
Adding this result to (1) we get: 2d = 8; d = 4
Substitute back in to get e = 6

4) Now treat D and E as a single sequence, and create a new distance table. The
distance to DE is taken as the average of sequence A to D and A to E. The other
distances are calculated in a similar fashion. The resulting distance table is:

A B C (DE)
A -- 22 39 40
B -- -- 41 42
C -- -- -- 19
(DE) -- -- -- --
5) Identify the closely related sequences in the table. In this case, it is C to DE.
Using algebra, the distance c can be calculated to be 9, and g is calculated to be 5.

6) Repeat the process until all lengths have been identified, in which case there is
only single composite node left.

Summary of Fitch-Margoliash Algorithm


1) Find the mostly closely related pairs of sequences (A, B).
2) Treat the rest of the sequences as a composite. Calculate the average distance
from A to all others; and from B to all others.
3) Use these values to calculate the length of the edges a and b.
4) Treat A and B as a composite. Calculate the average distances between AB and
each of the other sequences. Create a new distance table.
5) Identify next pair of related sequences and begin as with step 1.
6) Subtract extended branch lengths to calculate lengths of intermediate branches.
7) Repeat the entire process with all possible pairs of sequences.
8) Calculate predicted distances between each pair of sequences for each tree to find
the best tree.
Neighbor-joining algorithm

The neighbor-joining method is very similar to the Fitch-Margoliash method. The


sequences that should be joined are chosen to give the best least-squares estimates of the
branch lengths that most closely reflect the actual distances between the sequences.

The neighbor-joining method begins by creating a star topology in which no neighbors


are joined:

B C

A
E

The tree is modified by joining pairs of sequences. The pair to be joined is chosen by
calculating the sum of the branch lengths for the corresponding tree. The sum of the
branch lengths is calculated as follows:

∑ d im d mn ∑ d ij
+ d in
S mn = + +
2( N − 2) 2 N −2
where i,j represent all sequences except m and n, and i < j.

For example, consider the tree when A and B are joined:

B C

A
E

S mn =
∑d im + d in
+
d mn ∑ d ij
+
2( N − 2) 2 N −2
The pair that results in the smallest branch length is then chosen to be the pair that is
joined. Based on this choice, the Fitch-Margoliash algorithm is used to compute the
actual branch lengths.

After the pair has been joined, a new distance table is created with the recently joined
sequences now entered as a composite. The neighbor-joining algorithm chooses the next
pair of sequences to join, and the F-M algorithm computes the branch lengths.

The process continues until the correctly branched tree and distances have been
identified.

Unweighted Pair Group Method with Arithmetic Mean (UPGMA)

SEE STARTING ON P 262

Maximum Likelihood

SEE P 274 –277 MOUNT

Which Method Do I Choose?

The choice of which of these methods to choose depends upon the sequences that are
being compared. If there is strong sequence similarity, then maximum parsimony
methods are best. If there is not strong sequence similarity, but clearly recognizable
sequence similarity, then distance methods work best. For all others, the best approach is
a maximum likelihood model.

Difficulties with phylogenetic analysis

Phylogenetic analysis would be easier if evolution occurred in a vertical fashion.


However, horizontal or lateral transfer of genetic material (for instance through viruses)
occurs, which makes it difficult to determine the phylogenetic origin of some
evolutionary events.

If a gene is under selective pressure in different organisms, it can be rapidly evolving.


Such an evolution can mask earlier changes that had occurred phylogenetically. In
addition, different regions of a genome are under different pressures, and therefore
different sites within two comparative sequences may be evolving at different rates.

Rearrangements of genetic material can also lead to false conclusions with phylogenetic
analysis, especially if two sequences of different evolutionary origins are place next to
each other.
Gene duplication events also cause problems with phylogenetic analysis, since the
duplicated genes can evolve along separate pathways, leading to different functions.

PAUP (Maximum Parsimony)


MacClade (Maximum Parsimony)
CONSENSE
PHYLIP – (Distance -- neighbor joining)
CLUSTALW – distance-based tree

TreeTop http://www.genebee.msu.su/services/phtree_reduced.html
Phylodendron http://iubio.bio.indiana.edu/treeapp/treeprint-form.html
ATV (A Tree Viewer) http://www.genetics.wustl.edu/eddy/atv/
How to Make a phylogenetic tree http://hiv-web.lanl.gov/content/hiv-
db/TREE_TUTORIAL/Tree-tutorial.html
A Brief Review of Common Tree Making Methods
http://bioinfo.mbb.yale.edu/mbb452a/projects/Patricia-M-Strickler.html
NCBI Primer on Phylogenetics http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
List of Phylogeny Programs
http://evolution.genetics.washington.edu/phylip/software.html
TreeViewer http://www.avl.iu.edu/projects/DNAml/
Phylip http://evolution.genetics.washington.edu/phylip.html

CECS694-02
Introduction to Bioinformatics
Lecture 10
Phylogenetic Prediction

Tree of Life

On one level, it is interesting to understand and study how the evolution of species has
occurred. There are many different resources discussing the evolution of species. This
includes the NCBI taxonomy web sites, and the University of Arizona’s tree of life
project. We’ll take a look at both of these web sites in order to get a better appreciation
for the evolution of species relative to one another.

NCBI Taxonomy http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/


Tree of Life http://tolweb.org/tree/

Evolutionary Trees

An evolutionary tree is a two dimensional graph showing the evolutionary relationship


among a set of items being compared. This set can be organisms, genes, or dna
sequences. Consider for the moment that each of the units in the set are referred to as a
taxon. Each taxon will be defined by a distinct unit on the tree.

An evolutionary tree is composed of outer branches or leaves that represent the taxa and
nodes and branches representing the relationships among the taxa. Two taxa that are
derived from the same common ancestor will share a node in the graph. In general,
approaches to designing evolutionary trees attempt to define the length of each branch to
the next node according to the number of sequence level changes that occurred. One
thing to be careful of in phylogenetic analysis is that this distance may not be in direct
relation to evolutionary time. Analyses that prescribe to the theory of a uniform rate of
mutation are known as the molecular clock hypothesis.

Rooted Trees

In a rooted tree topology, one sequence (the root) is defined to be the common ancestor
of all of the other sequences. A unique path leads from the root node to any other node,
and the direction of the path indicates evolutionary time. The root is chosen by including
a sequence from an organism that is thought to have branched off earlier than the other
sequences. If the molecular clock hypothesis holds, it is also possible to predict a root.
As the number of sequences increase, the number of possible rooted trees increases very
rapidly. In some cases, a bifurcating binary tree is the best model to simulate
evolutionary events in which case one species branches off into two separate species.

Example of a rooted tree:

Image source: http://www.ncbi.nlm.nih.gov/About/primer/phylo.html

Star Topology (Unrooted Trees)


An unrooted tree (sometimes referred to as a star topology) shows the evolutionary
relationship among sequences, without revealing the location of the oldest ancestry.
There are fewer choices for an unrooted tree than a rooted tree.

Example of an unrooted tree:

Image source: http://www.shef.ac.uk/english/language/quantling/images/quantling1.jpg

Methods for Determining Evolutionary Trees

There are three methods used to calculate the tree(s) that best account for the observed
variation in a set of sequences. These methods are maximum parsimony, distance, and
maximum likelihood.

Maximum Parsimony

Maximum parsimony methods predict the evolutionary tree that minimizes the number of
steps required to generate the observed variation in the sequences. In order to construct
a tree using maximum parsimony, a multiple sequence alignment must first be obtained.
For each aligned position, phylogenetic trees that require the smallest number of
evolutionary changes to produce the observed sequence changes are identified. This
continues for each position in the alignment. Those trees that produce the smallest
number of changes overall for all sequence positions are identified. This is a rather time
consuming algorithm that only works well if the sequences have a strong sequence
similarity.

1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G
Consider the example above (Mount, 250). There are a total of four sequences, which
gives a possibility of three different unrooted trees. In this case some sites are
informative, and other sites are not. An informative site has the same sequence character
in at least two different sequences. Only the informative sites need to be considered.

Possible trees:

1 3 1 2 1 3

2 4 3 4 4 2

In this case, the optimal tree is obtained by adding the number of changes at each
informative site for each tree, and picking the tree requiring the least total number of
changes.
For a large number of sequences the number of trees to examine becomes so large that it
might not be possible to examine all possible trees. Some programs, such as PAUP, add
features that will allow the user to envoke a heuristic that will keep representative trees
that best fit the data.

The informative sites in the example alignment are 5, 7, and 9.

Let’s go through the possible trees, and figure out the number of rearrangements for each
in the informative sites. (SEE THE POWERPOINT PRESENTATION)

One problem with determining evolutionary distance between sequences is that columns
representing greater variation dominate the analysis. In order to overcome this problem
of determining long branch lengths is to look only at transversion events, which are the
most significant base changes (i.e. changes a purine to a pyrimidine or vice versa). This
is referred to as Lake’s method of invariants.
LOOK AT THE MITOCHONDRIAL SEQUENCE ANALYSIS ON P 252

Distance Methods

The distance method for construction of phylogenetic trees looks at the number of
changes between each pair in a group of sequences to produce a phylogenetic tree of the
group. The goal of distance methods is to identify a tree that positions neighbors
correctly and that also has branch lengths which reproduce the original data as closely as
possible.

CLUSTALW uses the neighbor-joining method as a guide to multiple sequence


alignments. The PHYLIP suite of programs employ neighbor-joining methods.

Phylip http://evolution.genetics.washington.edu/phylip.html

Distance analysis programs in PHYLIP

FITCH: estimates a phylogenetic tree assuming additivity of branch lengths using the
Fitch-Margoliash method.
KITSH: same as FITCH, but under the assumption of a molecular clock.
NEIGHBOR: estimates phylogenies using the neighbor-joining (no molecular clock
assumed) or unweighted pair group method with arithmetic mean (UPGMA) (molecular
clock assumed).

For phylogenetic analysis, the distance score counted as either the number of mismatched
positions in the alignment or the number of sequence positions that must be changed to
generate the other sequence is used.

The success of distance methods depends on the degree to which the distances among a
set of sequences can be made additive on a predicted evolutionary tree.

Consider the alignment:

A ACGCGTTGGGCGATGGCAAC
B ACGCGTTGGGCGACGGTAAT
C ACGCATTGAATGATGATAAT
D ACACATTGAGTGATAATAAT
The distances between these sequences can be shown as a table:
A B C D
A - 3 7 8
B - - 6 7
C - - - 3
D - - - -
Using this information, an unrooted tree showing the relationship between these
sequences can be drawn:

A C
2 1
4

1 2
B D

Fitch and Margoliash Method

The Fitch and Margoliash method uses a distance table. The sequences are combined in
threes to define the branches of the predicted tree and to calculate the branch lengths of
the tree.

Example using three sequences:

7) Draw an unrooted tree with three branches originating from a common node and
label the ends:
A a
C
c
B b

8) Calculate the lengths of tree branches algebraically:

A B C
A -- 22 39
B -- -- 41
C -- -- --

distance from A to B = a + b = 22 (1)


distance from A to C = a + c = 39 (2)
distance from B to C = b + c = 41 (3)

subtracting (3) from (2) yields:

b + c = 41
-a – c = -39
__________
b – a = 2 (4)

adding (1) and (4) yields 2b = 24; b = 12


so a + 12 = 22; a = 10
10 + c = 39; c = 29

A 10

C
29
12

B
Example of Fitch-Margoliash Using Five Sequences

The Fitch-Margoliash algorithm can be extended to three or more sequences. Consider


the following table of distances between five separate sequences:

A B C D E
A -- 22 39 39 41
B -- -- 41 41 43
C -- -- -- 18 20
D -- -- -- -- 10
E -- -- -- -- --
Suppose that the initial tree is as follows:
C
A c
a f
D
d
b g
B
e E

1) The first step is to locate the most closely related sequences in the distance table. In
this case, that would be sequences D and E.

2) Now create a new table by combining the remaining sequences. For the distance from
D to A,B,C take the average distance of each of these to D ( (39 + 41 + 18) / 3 = 32.7)
For the distance from E to A,B,C, take the average distance of each of these to E
((41+43+20)/3 = 34.7). The resulting table is as follows:

D E AVG ABC
D -- 10 32.7
E -- -- 34.7
AVG ABC -- -- --
9) The average distances from D to ABC and E to ABC could also be found by
averaging the sum of the appropriate branch lengths:

D to E: d + e = 10 (1)
D to ABC: d + m = 32.7 where m = g + (c + 2f + a + b) / 3 (2)
E to ABC: e + m = 34.7 (3)

By subtracting the third equation from the second equation we get:


d – e = -2
Adding this result to (1) we get: 2d = 8; d = 4
Substitute back in to get e = 6

10) Now treat D and E as a single sequence, and create a new distance table. The
distance to DE is taken as the average of sequence A to D and A to E. The other
distances are calculated in a similar fashion. The resulting distance table is:

A B C (DE)
A -- 22 39 40
B -- -- 41 42
C -- -- -- 19
(DE) -- -- -- --
11) Identify the closely related sequences in the table. In this case, it is C to DE.
Using algebra, the distance c can be calculated to be 9, and g is calculated to be 5.

12) Repeat the process until all lengths have been identified, in which case there is
only single composite node left.

Summary of Fitch-Margoliash Algorithm


9) Find the mostly closely related pairs of sequences (A, B).
10) Treat the rest of the sequences as a composite. Calculate the average distance
from A to all others; and from B to all others.
11) Use these values to calculate the length of the edges a and b.
12) Treat A and B as a composite. Calculate the average distances between AB and
each of the other sequences. Create a new distance table.
13) Identify next pair of related sequences and begin as with step 1.
14) Subtract extended branch lengths to calculate lengths of intermediate branches.
15) Repeat the entire process with all possible pairs of sequences.
16) Calculate predicted distances between each pair of sequences for each tree to find
the best tree.
Neighbor-joining algorithm

The neighbor-joining method is very similar to the Fitch-Margoliash method. The


sequences that should be joined are chosen to give the best least-squares estimates of the
branch lengths that most closely reflect the actual distances between the sequences.

The neighbor-joining method begins by creating a star topology in which no neighbors


are joined:

B C

A
E

The tree is modified by joining pairs of sequences. The pair to be joined is chosen by
calculating the sum of the branch lengths for the corresponding tree. The sum of the
branch lengths is calculated as follows:

∑ d im d mn ∑ d ij
+ d in
S mn = + +
2( N − 2) 2 N −2
where i,j represent all sequences except m and n, and i < j.

For example, consider the tree when A and B are joined:

B C

A
E
S mn =
∑d im + d in
+
d mn ∑ d ij
+
2( N − 2) 2 N −2

The pair that results in the smallest branch length is then chosen to be the pair that is
joined. Based on this choice, the Fitch-Margoliash algorithm is used to compute the
actual branch lengths.

After the pair has been joined, a new distance table is created with the recently joined
sequences now entered as a composite. The neighbor-joining algorithm chooses the next
pair of sequences to join, and the F-M algorithm computes the branch lengths.

The process continues until the correctly branched tree and distances have been
identified.

Unweighted Pair Group Method with Arithmetic Mean (UPGMA)

Works by clustering the sequences, starting with more similar sequences and working
towards more distant sequences.

The process assembles a tree upwards, with each node being added above the others, and
the edge lengths being determined by the difference in the heights of the nodes.

The distance dij between two clusters Ci and Cj is defined to be the average distance
between pairs of sequences from each cluster:

1
d ij = ∑ d pq
| Ci || C j | pinCi ,qinC j
where |Ci| and |Cj| are the number of sequences in clusters i and j, respectively

The algorithm for UPGMA clustering (Durbin p 166) is as follows:

1. Assign each sequence i to its own cluster Ci


2. Define one leaf of the tree T for each sequence, and place it at height 0.
3. Determine the two clusters, i and j for which dij is minimal
4. Define a new cluster k by Ck = Ci ∪ Cj, and define dkl for all l
5. Define a node k with daughter nodes i and j, and place it at height dij/2.
6. Add k to the current clusters and remove i and j.

7. Continue steps 3-6 until only two clusters i and j remain, and place the root of the
tree at height dij/2

EXAMPLE OF UPGMA

Consider the case where there are five sequences represented by dots on a graph. The
spacing between each of these is representative of the distance between them:

The first step is to assign each of the sequences to their own cluster, which now gives a
number to each of these. In addition, the tree can be constructed at the base, where each
sequence is a leaf of the tree:

1 2

3
4

5
.Now select the two clusters that are closest to each other. These are the sequences 1 and
2. Create a single cluster for these two sequences, and create a parent node in the tree at
height d12/2.

1 2 6

3
4 1 2

Contine on, selecting the two clusters that are closest: in this case, it is 4 and 5. Combine
into a single cluster, and update the tree:
1 2 6
7
3 1 2 4 5
4

The next two clusters are the one containing 4 and 5, and the one containing 3:

1 2
6
3 8
4 7
1 2 4 5 3
5

There are now only two clusters left, so join them to complete the tree:

9
1 2
6
8
3 7
4
1 2 4 5 3
5
SEE STARTING ON P 262

Maximum Likelihood

SEE P 274 –277 MOUNT

Which Method Do I Choose?

The choice of which of these methods to choose depends upon the sequences that are
being compared. If there is strong sequence similarity, then maximum parsimony
methods are best. If there is not strong sequence similarity, but clearly recognizable
sequence similarity, then distance methods work best. For all others, the best approach is
a maximum likelihood model.
Difficulties with phylogenetic analysis

Phylogenetic analysis would be easier if evolution occurred in a vertical fashion.


However, horizontal or lateral transfer of genetic material (for instance through viruses)
occurs, which makes it difficult to determine the phylogenetic origin of some
evolutionary events.

If a gene is under selective pressure in different organisms, it can be rapidly evolving.


Such an evolution can mask earlier changes that had occurred phylogenetically. In
addition, different regions of a genome are under different pressures, and therefore
different sites within two comparative sequences may be evolving at different rates.

Rearrangements of genetic material can also lead to false conclusions with phylogenetic
analysis, especially if two sequences of different evolutionary origins are place next to
each other.

Gene duplication events also cause problems with phylogenetic analysis, since the
duplicated genes can evolve along separate pathways, leading to different functions.

PAUP (Maximum Parsimony)


MacClade (Maximum Parsimony)
CONSENSE
PHYLIP – (Distance -- neighbor joining)
CLUSTALW – distance-based tree

Consider the following list of Globin sequences:

>gamma_A
MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAH
GKKVLT
SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMV
TAVAS
ALSSRYH
>alfa
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>beta
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>delta
VHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGK
EFTPQMQAAYQKVVAGVANALAHKYH
>epsilon
VHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKV
KAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGK
EFTPEVQAAWQKLVSAVAIALAHKYH
>gamma_G
MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAH
GKKVLT
SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMV
TGVAS
ALSSRYH
>myoglobin
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKK
HGATVL
TALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKAL
ELFR
KDMASNYKELGFQG
>teta1
ALSAEDRALVRALWKKLGSNVGVYTTEALERTFLAFPATKTYFSHLDLSPGSSQVRAHGQ
KVADALSLAVERLDDLPHALSALSHLHACQLRVDPASFQLLGHCLLVTLARHYPGDFSPA
LQASLDKFLSHVISALVSEYR
>zeta
SLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGS
KVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAE
AHAAWDKFLSVVSSVLTEKYR

Create a phylogeny from these.

1) Demonstrate using the GCG software (http://kingtut.spd.louisville.edu:999/)

Examples using a phlogenetic program:


http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html

TreeTop http://www.genebee.msu.su/services/phtree_reduced.html
Phylodendron http://iubio.bio.indiana.edu/treeapp/treeprint-form.html
ATV (A Tree Viewer) http://www.genetics.wustl.edu/eddy/atv/
How to Make a phylogenetic tree http://hiv-web.lanl.gov/content/hiv-
db/TREE_TUTORIAL/Tree-tutorial.html
A Brief Review of Common Tree Making Methods
http://bioinfo.mbb.yale.edu/mbb452a/projects/Patricia-M-Strickler.html
NCBI Primer on Phylogenetics http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
List of Phylogeny Programs
http://evolution.genetics.washington.edu/phylip/software.html
TreeViewer http://www.avl.iu.edu/projects/DNAml/
Phylip http://evolution.genetics.washington.edu/phylip.html

CECS694-02
Introduction to Bioinformatics
Lecture 11
RNA Secondary Structure Prediction

Introduction to RNA sequence analysis

RNA molecules are important to study since they are involved in important biochemical
functions, including translation, RNA splicing, processing and editing, cellular
localization, and catalysis.

RNA sequence analysis needs to be treated differently than DNA sequence analysis,
since RNA structures fold and base pair with themselves to form secondary structures.
Therefore, it is not necessarily the sequence but the structure conservation that is most
important in RNA sequence analysis.
Variations in RNA sequence maintain base-pairing patterns that give rise to these
secondary structures. Therefore, to maintain the secondary structure, when a nucleotide
in one base changes, the base with which it pairs must also change to maintain the same
structure. For instance, if you have the base pair G-C, and the G mutates to an A, then
the C should mutate to a U to maintain a base pairing at this location, which promotes the
same secondary structure. Such a variation is referred to as covariation.

In order to determine the secondary structure of the RNA molecule, all possible choices
of complementary sequences are considered, and the sets that provide the most
energetically stable molecules are chosen.
Another method to predict secondary structure in RNA takes into account conserved
patterns of base-pairing. Positions of covariance are studied, and are taken to be
conserved matches, since they maintain the secondary structure. Locating regions of
covariance in sequence data is a computationally challenging task.

Features of RNA Secondary structure

RNA is a polymer composed of a combination of four nucleotides: adenine (A), cytosine


(C), guanine (G), and uracil (U).

G-C and A-U form complementary hydrogen bonded base pairs, with the GC base pairs
being more stable since they form three hydrogen bonds as opposed to the two hydrogen
bonds formed by AU base pairs.

In addition to the canonical Watson-Crick GC and AU base pairs, non-canonical pairs


can occur in RNA secondary structure as well. The most common of these non-canonical
pairs in GU.

RNA is typically produced as a single stranded molecule (unlike DNA) which folds upon
itself to form base pairs. This structure is referred to as the secondary structure of the
RNA.
RNA secondary structure can be viewed as an intermediary between a linear molecule
and a three-dimensional structure. RNA secondary structure is mainly composed of
double-stranded RNA regions formed by folding the single-stranded RNA molecule back
on itself. There are a number of different secondary structures that can be formed from
this base-pairing, including:

Stem Loops (Hairpin loops)

Loops are generally at least 4 bases long

Bulge Loops

Bulge Loops occur when bases on one side of the structure cannot form base pairs

Interior Loops

Interior loops occur when bases on both sides of the structure cannot form base pairs.
Junctions or Multiloops

Junctions include two or more double-stranded regions converge to form a closed


structure.

In addition, tertiary interactions can be present as well. Such tertiary interactions are
located using covariance analysis. The types of tertiary interactions present in RNA
molecules include:

Kissing Hairpins

In kissing hairpins, the unpaired bases of two separate hairpin loops base pair with one
another.

Pseudoknots
Hairpin-Bulge Interactions

Limitations of Secondary Structure Prediction

Three assumptions are made in determining secondary structure prediction:


1) The most likely structure is similar to the energetically most stable structure
2) The energy associated with any position in the structure is only influenced by
local sequence and structure.
3) The structure formed does not produce pseudoknots.

One method of representing the base pairs of a secondary structure is to draw the
structure in a circle. An arc is drawn to represent each base pairing found in the
structure. If any of the arcs cross, then a pseudoknot is present.

An example of the circular method is shown below:


Image source: http://www.finchcms.edu/cms/biochem/Walters/rna_folding.html

RNA sequence evolution

With RNA sequences, homology is not defined in terms of sequence similarity, but rather
in terms of common secondary structure. Two sequences that do not appear to have
significant sequence similarity can still have conserved secondary structure.

Inferring structure by comparative sequence analysis


Comparative sequence analysis is the most reliable computational method for
determining the secondary structure of an RNA sequence. For example, consider the
following example from Durbin, et al., p 266:

In order to use comparative sequence analysis, the first step is to calculate a multiple
sequence alignment. This requires that the sequences be similar enough so that they can
be initially aligned. At the same time, the sequences should be dissimilar enough so that
covarying substitutions can be detected.

The mutual information gained by aligning two columns that covary is determined by the
function:

f xi x j
M ij = ∑f
xi , x j
xi x j log 2
f xi f x j
Where fxi is the frequency of a base in column i; fxixj is the joint (pairwise) frequency of
a base pair between columns i and j. For RNA, the information ranges from 0 and 2 bits.
If columns i and j are uncorrelated, the mutual information is 0.

An example of a plot for the mutual information of the yeast tRNA-Phe is given below
(Durbin, et al., p 268):
The mutual information from this graph produces the following structure:

Predicting Structure from a single sequence


Suppose we don’t have a set of similar RNAs from which the structure can be inferred
using covariance methods. There are a number of possible secondary structures that can
be determined from a single sequence. For example, an RNA molecule only 200 bases
long has 1050 possible secondary structures, many of which are not plausible. A method
to detect the correct structure is needed.

One of the simplest methods to find self-complementary regions in an RNA sequence is


to perform a dot-plot of the sequence against its complement. The repeat regions that are
found can potentially base pair with each other to form secondary structures. More
advanced dot-plot techniques incorporate free energy measures as well.
Image Source: http://www.finchcms.edu/cms/biochem/Walters/rna_folding.html

Base Pair Maximization – Nussinov Folding Algorithm

One approach to predicting secondary structure looks at finding the structure with the
most base pairs. An efficient dynamic programming approach to this problem was
introduced in the late 1970’s by Nussinov.

According to the Nussinov algorithms, there are four ways to get the best structure from I
to j from the best structures of the smaller subsequences:

1) Add i,j pair onto best structure found for subsequence i+1, j-1
2) add unpaired position i onto best structure for subsequence i+1, j
3) add unpaired position j onto best structure for subsequence i, j-1
4) combine two optimal structures i,k and k+1, j

The possible structures are shown below (Durbin et al., p 269):

The Nussinov RNA folding prediction program works by comparing a sequence against
itself in a dynamic programming matrix with the above rules for scoring the structure at a
particular point. Since the structure is folding upon itself, it is only necessary to calculate
half of the matrix.

Initialization step:
In the matrix fill step, the score for the matches along the main diagonal and the diagonal
just below it are set to zero. Formally, the scoring matrix, M, is initialized as follows:

M[i][i] = 0 for i = 1 to L (where L is the length of the sequence)


M[i][i-1] = 0 for i = 2 to L

Using the example in Durbin, et al. with the RNA sequence GGGAAAUCC, the matrix
now looks like the following, such that sequences of length 1 will score 0:

G G G A A A U C C
G 0
G 0 0
G 0 0
A 0 0
A 0 0
A 0 0
U 0 0
C 0 0
C 0 0

Now the matrix is filled in, starting with subsequences of length 2, and ending at
subsequences of length L. The four rules for filling in the matrix are used:

M[i][j] = max of the following four:


M[i+1][j] (Ith residue is hanging off by itself)
M[i][j-1] (jth residue is hanging off by itself)
M[i+1][j-1] + S(xi, xj) (ith and jth residue are paired; if xi = complement of xj,
then S(xi, xj) = 1; otherwise it is 0.
M[i][j] = MAXi<k<j (M[i][k] + M[k+1][j]) (merging two substructures)

When looking for subsequences of length 2, the matrix is filled as follows, since A-U is
the only base-pair found:

G G G A A A U C C
G 0 0
G 0 0 0
G 0 0 0
A 0 0 0
A 0 0 0
A 0 0 1
U 0 0 0
C 0 0 0
C 0 0

Filling in for subsequences of length 3, the matrix becomes:

G G G A A A U C C
G 0 0 0
G 0 0 0 0
G 0 0 0 0
A 0 0 0 0
A 0 0 0 1
A 0 0 1 1
U 0 0 0 0
C 0 0 0
C 0 0

The final filled matrix is as follows:

G G G A A A U C C
G 0 0 0 0 0 0 1 2 3
G 0 0 0 0 0 0 1 2 3
G 0 0 0 0 0 1 2 2
A 0 0 0 0 1 1 1
A 0 0 0 1 1 1
A 0 0 1 1 1
U 0 0 0 0
C 0 0 0
C 0 0

Traceback through this matrix (covered on P 271, Durbin et al) leads to the following
structure:
Given the four possibilities for the maximum structure in the Nussinov algorithm, it can
be computed to a stochastic context-free grammar as follows:

S → aS | cS | gS | uS
S → Sa | Sc | Sg | Su
S → aSu | cSg | uSa | gSc
S → SS
Such a simplistic approach will not give accurate structure predictions, since it does not
take into account important structural features, such as nearest neighbor interactions,
stacking interactions, and loop length preferences.

Energy Minimization Methods

Since RNA folding is determined by biophysical properties, methods that take into
account these properties are more likely to yield accurate predictions. One method that is
widely used is the energy minimization algorithm that predicts the correct secondary
structure is the one that minimizes the free energy (∆G).

The free energy of an RNA secondary structure is calculated as the sum of the individual
contributions of loops, base pairs, and other secondary structure elements. Energies of
stems are calculated as the stacking contributions between neighboring base pairs.

The predicted free-energy values (kcal/mole at 37oC ) are calculated as follows:

Stacking Energies for base pairs


A/U C/G G/C U/A G/U U/G
A/U -0.9 -1.8 -2.3 -1.1 -1.1 -0.8
C/G -1.7 -2.9 -3.4 -2.3 -2.1 -1.4
G/C -2.1 -2.0 -2.9 -1.8 -1.9 -1.2
U/A -0.9 -1.7 -2.1 -0.9 -1.0 -0.5
G/U -0.5 -1.2 -1.4 -0.8 -0.4 -0.2
U/G -1.0 -1.9 -2.1 -1.1 -1.5 -0.4

Destabilizing Energies for Loops


Number of Bases 1 5 10 20 30
Internal -- 5.3 6.6 7.0 7.4
Bulge 3.9 4.8 5.5 6.3 6.7
Hairpin -- 4.4 5.3 6.1 6.5

In order to find the structure for which the minimum free energy is found, the sequence is
compared against itself using a dynamic programming approach similar to the maximum
base-paired structure approach previously described. However, instead of using a scoring
scheme for the base pairs present, the score is based upon the free energies described
above. Gaps between matches represent some form of a loop, so the gap score is
calculated using the above tables as well. The most widely used software that
incorporates this minimum free energy algorithm is MFOLD.

Suboptimal folds

The correct structure is not necessarily the structure with the optimal structure, but a
structure within a certain threshold of the calculated minimum energy. Therefore, the
MFOLD algorithm has been updated to report suboptimal foldings as well.

Covariance Models

In order to locate covarying sites in RNA sequences, 7 different approaches are offered in
Mount, p225.

The key to covariance is the measure of the mutual information content previously
discussed. The mutual information content can be plotted on a motif logo, which can
give insight into the folding of a particular sequence.
Image source: http://www.cbs.dtu.dk/~gorodkin/appl/slogo.html

A formal covariance model, COVE, was devised by Eddy and Durbin. The model
provides very accurate results, but is extremely slow and unsuitable for large genomes.

Stochastic Context Free Grammars (SCFGs) have also been used to model RNA
secondary structure. Examples of these are tRNAScan-SE, and a program created to find
snoRNAs. Typically, with SCFG approaches, the grammars are created by using a
training set of data, and then the grammars are applied to potential sequences to see if
they fit into the language.

SCFGs allow the detection of sequences belonging to a family, such as tRNAs, group I
introns, snoRNAs, snRNAs, etc.

With a SCFG approach, base-paired columns are modeled by pairwise emitting non
terminals (for example aWu) while single-stranded columns are modeled by leftwise
emitting nonterminals (such as gW), when possible. Any RNA structure can then be
reduced to a SCFG (see Durbin, et al., p 278-279).

Tranformational Grammars

Transformational grammars were first described by the linguist Noam Chomsky in the
1950’s. (Yes, this is the same Noam Chomsky who has expressed various dissident
political views throughout the years!) Transformational grammars are very important in
computer science, most notably in compiler design. Grammars are covered in more
detail in compiler and automaton classes, so we will only briefly touch on them here.

Web site:
http://web.mit.edu/linguistics/www/chomsky.home.html

The idea behind transformational grammars is to take a set of outputs (such as a sentence,
or in our case, an RNA structure) and determine whether or not it can be produced using
a set of rules for the language.

Transformational grammars consist of a set of symbols and production rules on which the
symbols can be put together. The symbols can be either terminal (emitting) symbols or
non-terminal symbols that can be used to create longer strings of symbols.

Grammar for Palindromic sequences

First, consider the case of palindromic DNA sequences. There are a total of five possible
terminal symbols: {A, C, G, T, ε) where ε represents the blank terminal symbol. The
production rules for creating a palindromic sequence are as follows, where S and W are
non-terminal symbols:

S→W
W→ aWa | cWc | gWg | tWt
W→ a | c| g | t | ε

Using these production rules, we can create a derivation of the palindromic sequence
acttgttca as follows:

S ⇒ W ⇒ aWa ⇒ acWca⇒actWtca ⇒ acttWttca ⇒ acttgttca

In order to align a context-free grammar to a sequence, a parse tree can be created, where
the root of the tree is the non-terminal start symbol, S. Leaves of the parse tree are the
terminal symbols in the sequence, and internal nodes are the nonterminals. The leaves
can be parsed from left to right to view the results of the production. An example for the
parse tree on the above production is as follows:

S
W
W
More information on parse trees can be found in Durbin, et al., Chapter 9.

A SCFG for RNA secondary structure can be constructed as follows:

S→W
W→ WW (bifurcation)
W→ aWu | cWg | gWc | uWa (loops)
W→ gWu | uWg
W→ aW | cW | gW | uW (bulges on one side)
W→ Wa | Wc | Wg | Wu (bulges on opposite side)
W→ a | c| g | t | ε
Using this grammar, the structure for the RNA structure for the sequence:

GCUUACGACCAUAUCACGUUGAAUGCACGCCAUCCCGUCCGAUCUGGCAAG
UUAAGCAACGUUGAGUCCAGUUAGUACUUGGAUCGGAGACGGCCUGGGAA
UCCUGGAUGUUGUAAGCU

Produced by MFOLD, can be constructed using the following productions (5’ to 3’):

S⇒W⇒Wu⇒gWcu⇒gcWgcu⇒gcuWagcu⇒gcuuWaagcu⇒
gcuuaWuaagcu⇒gcuuacWguaagcu⇒
gcuuacgWuguaagcu⇒gcuuacgaWuuguaagcu⇒
gcuuacgacWguuguaagcu⇒gcuuacgaccWguuguaagcu⇒
gcuuacgaccaWguuguaagcu⇒....

Read Mount, Chapter 5


Durbin, et al, Chapters 9 and 10
CECS694-02
Introduction to Bioinformatics
Lecture 12
Microarray Image Analysis

Introduction to Microarray Images analysis

Genes are regions of a genome that code for either a structural or functional protein.
Genes are of interest to biologists due to their association with diseases. In the past, the
study on whether a gene was turned on or turned off under a specific condition was an
expensive and time consuming task. Within the past 10 years, the emergence of a new
technology, called microarrays, has made it possible to study the expression pattern of
thousands of genes instantaneously. Microarrays allow the study of genes (actually any
sequence of interest) under differing conditions.

Approach to microarray construction

The idea behind microarray construction is to spot up to tens of thousands of DNA/RNA


molecules on a slide, each of which uniquely identifies a certain region. These molecules
can be small, on the order of 25 bp long, or can be somewhat larger. For typical gene
expression experiments, the molecule is around 500 bp long. For the image detection, it
is necessary for the molecules to be a consistent size for each item being studied.

The ultimate goal of microarray data is to be able to understand how the expression levels
of different genes differ under two separate conditions. By asking and answering such
questions, we can get an idea of which genes are involved in a certain disease, and
potentially, the pathways involved in these diseases.

In order to figure out which genes are expressed in a given condition, cells in a given
condition are taken, and the mRNA from these cells is extracted. The mRNA represents
the genes that are turned on in these cells. These mRNA sequences are then labeled. The
manner in which these cells are labeled is dependent upon the technique that is being
used.

Single Channel Microarrays

With single channel microarrays, the genes present under a given condition are labeled
with biotin. The expressed genes are washed over the microarray slide, and the expressed
genes will hybridize at the appropriate locations. What results is a dark spot where the
expressed genes have hybridized. If a clear microscope slide has been used to spot the
microarrays, then light can be passed underneath. Black spots represent genes that are
expressed in a given condition. In order to study two different conditions in single
channel microarrays, two separate slides must be used. An example of a single channel
microarray is given below:
Two Channel Microarrays

With two channel microarrays, the samples under different conditions are labeled
separately. The labels normally incorporated are green and red. For argument sake,
assume that the control is labeled green, and the sample is labeled red. Both samples are
washed over the microarray slide, and hybridization occurs. Each spot on the slide is
now one of four colors as shown below:
The colors correspond to the expression of the gene under the different conditions. For
example, spots that are only green are highly expressed in the control, while spots that are
red are highly expressed in the sample. Spots that are yellow are equally expressed in
both sample and control, while black spots are genes that are not expressed in either the
sample nor the control.

Determining image intensity

Once the spots are determined, the difficulty is in quantifying the image signals.
Generally, the images are converted to some sort of matrix of numbers. This step
requires processing. Besides the spot intensity, other measures that might be taken
include measurements of error and measurements of background noise. For instance, you
might ask how green is a spot? Answering this question can give an indication as to how
the difference in expression levels between the control and the sample.

Such an approach is often referred to as a fold approach. In otherwords, how does the
expression level change under a given condition? (Two-fold difference? Four-fold
difference?) Besides determining this value, it is important to figure out when a
significant change has been made. One thing to be aware of is that a four-fold observed
difference does not necessarily mean that a gene is expressed four times as much in a
given condition!

Clustering

One might be inclined to ask questions concerning the relationship among sequences in
an experiment. Several approaches have been suggested. Included are:

k-Means Clustering

k-Means clustering attempts to partition the results into groups that have similar
expression patterns, where k is the number of clusters the user believes that the data
should fall into. There are three steps in the k-Means clustering algorithm:

1) Randomly assign each of the data points to one on the k-clusters


2) Calculate the mean inter- and intraclass distances
3) minimize the mean interclass distances and maximize intraclass distances using
an iterative approach.

EXAMPLE: rana.lbl.gov/FuzzyK/ images/figure2.html


Heirarchical Clustering

Hierarchical clustering creates a “phylogeny” or hierarchy of the data points by


employing the following algorithm:

1) Generate a gene similarity score for all pairs of genes


2) Place the gene similarity scores in a matrix
3) Join the genes that have the highest score
4) Continue to join next similar pairs of genes

Hierarchical clustering methods include: complete-linkage clustering, average-linkage


clustering, weighted pair-group averaging, and within pair-group averaging.

Clustering approaches have several disadvantages, and should be used with extreme
caution (if they are used at all).

Image source:
http://cfpub.epa.gov/ncer_abstracts/index.cfm/fuseaction/display.abstractDetail/abstract/9
75/report/2001

Self-Organizing Maps (SOMs)

SOMs are a type of neural network approach. A SOM has a set of nodes with a simple
topology and a distance function on the nodes. The nodes are iteratively mapped into a
k-dimensional gene expression space. The steps in assembling a SOM are as follows:
1) Random vectors are constructed and assigned to each partition
2) A gene is picked at random and the reference vector closest to that gene is
identified
3) The reference vector is adjusted to be more similar to the vector of the assigned
gene.
4) Steps 2 and 3 are iterated through, until the reference vectors converge.

Web page for SOMs:


http://staff.aist.go.jp/utsugi-a/Lab/BSOM1/index.html

Support Vector Machines

Support Vector Machines are supervised machine learning techniques. These techniques
organize the data by mapping the gene expression vectors into a higher dimensional
space based on a kernel function. The SVM is trained to discriminate between positive
and negative data points. SVMs find the hyperplane that is needed to maximize the
margin between the surface between the positive and negative data points.

Image source: iipl.jaist.ac.jp/ research/svm/

Other Clustering Approaches

Hidden Markov Models


Genetic Algorithms
Artificial Neural Networks

Read Mount, p519-526

Important Microarray Papers

Homework #4: Due 4/17/2003


Project #3: Due 4/14/2003
Final Project: Due 5/1/2003

CECS694-02
Introduction to Bioinformatics
Lecture 13
Protein Structure Prediction

Proteins are polypeptides that have a three dimensional structure. They can be described
through four different hierarchical levels:

• Primary structure – the sequence of amino acids constituting the polypeptide


chain.
• Secondary structure – the local organization of the parts of the polypeptide
chain into secondary structures such as α helices and β sheets.
• Tertiary structure – the three dimensional arrangements of the amino acids as
they react to one another due to the polarity and resulting interactions between
their side chains.
• Quaternary structure – if a protein consists of several protein subunits held
together, then the protein can be described as well by the number and relative
positions of the subunits.

Once the polypeptide sequence (primary structure) of a protein has been determined, the
next step is to determine the secondary and tertiary structure of the protein. The
secondary structures of a protein are packed into a core region with a hydrophobic
environment. Interactions between the amino acid side chains occur within the core
structure. Outside of the core are loops and structural elements the come in contact with
water, other proteins, and other structures.

Review of Protein Structure

Proteins are chains of amino acids joined by peptide bonds. Each amino acid is polar,
meaning that it has separate positive and negatively charged regions. Each amino acid
has a free C=O group (CARBOXYL), which can act as a hydrogen bond acceptor, and an
NH group (AMINYL), which can act as a hydrogen bond donor. Many confirmations of
the chain are possible due to the rotation around the Alpha-Carbon (Cα) atom. These
confirmational changes lead to differences in the three-dimensional structure of the
protein. Within a polypeptide chain, there is a pattern of N-Cα-C repeated. The angle
between the aminyl group and the Alpha-carbon is the PHI (φ) angle; the angle between
the Cα and the carboxyl group is the PSI (ψ) angle.
Image Source: Bioinformatics, Mount

The difference between each of the 20 amino acids is in the R side chains. Amino acids
can be separated into distinct groups based on the chemical properties of the side chains:
hydrophobic: Alanine(A), Valine(V), phenylalanine (Y), Proline (P), Methionine (M),
isoleucine (I), and Leucine(L); charged: Aspartic acid (D), Glutamic Acid (E), Lysine
(K), Arginine (R); Polar: Serine (S), Theronine (T), Tyrosine (Y); Histidine (H), Cysteine
(C), Asparagine (N), Glutamine (Q), Tryptophan (W).

Secondary Structures
Image source: http://www.ebi.ac.uk/microarray/biology_intro.html

The core of each protein is made up of regular secondary structures that fold into a three-
dimensional configuration. In these secondary structures, regular patterns of hydrogen
bonds are formed between neighboring amino acids, and the amino acids have similar φ
and ψ angles. These structures act to neutralize the polar groups on each amino acid.
These secondary structures are tightly packed in the protein core and a hydrophobic
environment, and thus, each amino acid side group has a limited space to occupy and
therefore a limited number of possible interactions.

Alpha Helix

The alpha helix (Picture, p 388) is the most abundant type of secondary structure in
proteins. The helix has 3.6 amino acids per turn with a Hydrogen bond formed between
every fourth reside. The average length of an alpha helix is 10 amino acids, or 3 turns,
but it varies from 5 to 40 amino acids.

http://www.hhmi.princeton.edu/sw/ http://www4.ocn.ne.jp/~bio/biology/protein.htm
2002/psidelsk/scavengerhunt.htm

Alpha helix structures are normally found on the surface of protein cores where they
interact with the aqueous environment. The inner facing side of the helix tends to have
hydrophobic amino acids, while the outer-facing side has hydrophilic amino acids. This
means that every third amino acid will tend to be hydrophobic. This is a pattern that can
be detected computationally. Sequences rich in alanine (A), gutamic acid (E), leucine
(L), and methionine (M) and poorer in proline (P), glycine (G), tyrosine (Y), and serine
(S) tend to form alpha helices.

Program to detect alpha helices:

Beta Sheet

Beta sheets are formed by hydrogen bonds between an average of 5-10 consecutive
amino acids in one portion of the chain with another 5-10 farther down the chain. The
interacting regions may be adjacent, with a short loop in between, or far apart with other
structures in between. If the chains run in the same direction, they form a parallel sheet.
If they run in opposite directions, the form an antiparallel sheet. A mixed sheet may also
be formed. The pattern of hydrogen bond formation in parallel and anti-parallel sheets is
different. Beta sheets have a slight counterclockwise rotation, and the Alpha carbons (as
well as the R side groups) alternate above and below the sheet in a pleated structure.
Prediction of beta sheets is more difficult, due to the wide range of the PHI and PSI
angles.

http://broccoli.mfn.ki.se/pps_course_96/ http://www4.ocn.ne.jp/~bio/
ss_960723_12.html biology/protein.htm
Image Source: Bioinformatics, Mount

Loops

Loops are regions of a protein chain are regions between alpha helicies and beta sheets.
They have various lengths and three-dimensional configurations, and they are located on
the surface of the structure. Hairpin loops represent a complete turn in the polypeptide
chain, as is found in anti-parallel beta sheets. Loops are allowed to be more variable as
far as the sequence structure is concerned. They tend to have charged and polar amino
acids and are frequently a component of active sites.

Coils

A region of secondary structure that is not a helix, sheet, or loop is commonly referred to
as a coil.
Classes of Protein Structure:

1) Class α: bundles of α helices connected by loops on the surface of the proteins


2) Class β: antiparallel β sheets, usually two sheets in close contact forming a
sandwich (enzymes, transfport proteins, antibodies, virus coat proteins)
3) Class α/β: comprised mainly of parallel β sheets with intervening α helices; may
also have mixed β sheets (metabolic enzymes)
4) Class α+ β: composed mainly of segregated α helices and antiparallel β sheets
5) Multidomain (α and β) proteins comprising domains representing more than one
of the above four domains.
6) Membrane and cell-surface proteins and peptides excluding proteins of the
immune system.

alpha class protein (hemoglobin) B-class protein (T-cell receptor CD8)

a/B class protein (tryptophan synthase) a+B class protein (1RNB)


membrane protein (10PF)

Sources:
http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics;pdbId=3hhb;page=;pid=&opt=
show&size=250

http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics&pdbId=1cd8&page=&pid=

http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics;pdbId=2wsy;page=;pid=&opt
=show&size=250

http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics;pdbId=1rnb;page=;pid=&opt=
show&size=250

http://www.rcsb.org/pdb/cgi/explore.cgi?job=graphics;pdbId=1opf;page=;pid=&opt=
show&size=250

Protein Structure Databases

There are a number of databases that contain information on three dimensional structures
of proteins, where the structure has been solved using either X-ray crystallography or
nuclear magnetic resonance (NMR) techniques. Examples of the available sequence
databases include:

PDB
SCOP
PIR
Swiss-Prot

The most extensive of these for 3-D structure is the Protein Data Bank (PDB). The
current release of PDB (April 8, 2003) has 20,622 structures.
A full description of PDB File Format can be obtained at:
http://www.rcsb.org/pdb/info.html A partial example PDB file for the entry 3hhb is given
below (the full file can be obtained at
http://www.rcsb.org/pdb/cgi/explore.cgi?job=download;pdbId=3HHB;page=0&opt=sho
w&format=PDB&pre=1) :

ATOM 1 N VAL A 1 6.452 16.459 4.843 7.00 47.38 3HHB 162


ATOM 2 CA VAL A 1 7.060 17.792 4.760 6.00 48.47 3HHB 163
ATOM 3 C VAL A 1 8.561 17.703 5.038 6.00 37.13 3HHB 164
ATOM 4 O VAL A 1 8.992 17.182 6.072 8.00 36.25 3HHB 165
ATOM 5 CB VAL A 1 6.342 18.738 5.727 6.00 55.13 3HHB 166
ATOM 6 CG1 VAL A 1 7.114 20.033 5.993 6.00 54.30 3HHB 167
ATOM 7 CG2 VAL A 1 4.924 19.032 5.232 6.00 64.75 3HHB 168
ATOM 8 N LEU A 2 9.333 18.209 4.095 7.00 30.18 3HHB 169
ATOM 9 CA LEU A 2 10.785 18.159 4.237 6.00 35.60 3HHB 170
ATOM 10 C LEU A 2 11.247 19.305 5.133 6.00 35.47 3HHB 171
ATOM 11 O LEU A 2 11.017 20.477 4.819 8.00 37.64 3HHB 172
ATOM 12 CB LEU A 2 11.451 18.286 2.866 6.00 35.22 3HHB 173
ATOM 13 CG LEU A 2 11.081 17.137 1.927 6.00 31.04 3HHB 174
ATOM 14 CD1 LEU A 2 11.766 17.306 .570 6.00 39.08 3HHB 175
ATOM 15 CD2 LEU A 2 11.427 15.778 2.539 6.00 38.96 3HHB 176

The second column indicates the amino acid position in the polypeptide chain
The fourth column indicates the current amino acid
Columns 7, 8, and 9 represent the x, y, and z coordinates (in angstroms)
The 11th column represents the temperature factor, which can be used as a measurement
of uncertainty.

Protein Structure Classification Databases

Structural Classification of proteins (SCOP)

SCOP is based on expert definition of structural similarities. SCOP classifies by class,


family, superfamily, and fold. SCOP is found at http://scop.mrc-lmb.cam.ac.uk/scop/

Classification by class, architecture, topology, and homology (CATH)

CATH classifies proteins into hierarchical levels by class, except that a/B and a+B are
considered to be a single class. CATH is located at
http://www.biochem.ucl.ac.uk/bsm/cath/

Fold classification based on structure-structure alignment of proteins (FSSP)

FSSP is based on structure alignment of all pairwise combinations of the proteins in PDB
using the structural alignment program DALI. Each protein is separated into individual
domains, and the domains are aligned using DALI to find common folds. FSSP is
located at http://www2.embl-ebi.ac.uk/dali/fssp/fssp.html
Molecular Modelling Database (MMDB)

MMDB categorizes structures from PDB into structurally related groups using the VAST
structure alignment program, that looks for similar arrangements of secondary structural
elements. MMDB has been incorporated into ENTREZ at
http://www.ncbi.nlm.nih.gov/Entrez

Spatial Arrangement of Backbone Fragments (SARF)

SARF provides a protein database categorized on structural similarities, similar to the


MMDB. SARF is found at: http://www-lmmb.ncifcrf.gov/~nicka/sarf2.html

Viewing Protein Structures

There are a number of programs available that convert the atomic coordinates of the 3-d
structures into views of the molecule. Viewers also allow the user to manipulate the
molecule by rotation, zooming, etc. Such a viewer can be critical in drug design, since it
yields insight into how the protein might interact with ligands at active sites. The most
popular program for viewing 3-dimensional structures is Rasmol. The following is a list
of the most popular viewers:

Rasmol: http://www.umass.edu/microbio/rasmol/
Chime: http://www.umass.edu/microbio/chime/
Cn3D: http://www.ncbi.nlm.nih.gov/Structure/
Mage: http://kinemage.biochem.duke.edu/website/kinhome.html
Swiss 3D viewer: http://www.expasy.ch/spdbv/mainpage.html

In addition to viewing 3-dimensional structures, there are repositories for still images.
One such site is the swissprot website:

http://www.expasy.ch/databases/swiss-3dimage/IMAGES/

Alignment of Protein Structures

To perform a structural alignment, the three-dimensional structure of one protein is


compared against the three-dimensional structure of a second protein, fitting together the
atoms as closely as possible to minimize the average deviation.

Structural similarity between proteins does not necessarily translate into an evolutionary
relationship between the two.

When structures are compared, positions of atoms in two three-dimensional structures are
compared. Typically these methods to align structures look for the positions of
secondary structural elements (helices and strands) within a protein domain to determine
whether or not the structures are similar. Distances between the carbon atoms are
examined to determine the degree to which the structures may be superimposed.
Additional information about the side chains (such as whether they are buried or visible)
can be used as well.

Secondary Structure Alignment Program (SSAP)

SSAP uses a method called double dynamic programming to produce a structural


alignment between two proteins.

A local structural environment is created for each residue in each sequence. This
environment is defined by the degree of burial in the hydrophobic core of the protein and
the type of secondary structure to which the residue belongs. One of the environment
variables is a representing of the geometry of the protein by drawing a series of vectors
from the CB atoms of an amino acid to the CB atoms of all of the other amino acids in
the protein. If the geometric views in two protein structures are similar, the structures
must also be similar.

These structural environments are compared to produce matching residues.

Steps involved in SSAP:

1) Calculate vectors from one Cβ of one amino acid to a set of other nearby amino
acids. The resulting vectors from two separate proteins are compared, and a
difference (expressed as an angle) is calculated. A score for this difference is then
computed.
2) A matrix for the scores of vector differences from one protein to the next is
computed.
3) An optimal alignment is found using global dynamic programming, with a
constant gap penalty.
4) The next amino acid residue in one of the sequences is considered, and an optimal
path to align this amino acid to the second sequence is computed using the steps
above.
5) Resulting alignments are then transferred to a summary matrix. If the paths cross
the same matrix position, the scores are summed. If part of the alignment path is
found in both matrices, then there is evidence of similarity between the vectors.
6) When all of the alignments have been placed in the summary matrix, a dynamic
programming alignment is performed for the summary matrix. The final
alignment represents the optimal alignment between the protein structures. The
resulting score is converted such that it can be compared to see how closely
related the two structures are to each other.

Image Source: Bioinformatics, Mount (p420)


Distance Matrix

Distance method uses graphical procedure similar to dot plots to identify the atoms that
lie most closely together in the three-dimensional structure. If two sequences have a
similar structure, then their resulting dot plots can be superimposed. For the dot plot, the
sequence of the protein is listed along both axes. The values in the distance matrix
represent the distance between the corresponding Cα atoms in the three dimensional
structure. The positions of the closest packing atoms are marked with a dot to highlight
regions of interest. Similar groups of secondary structural elements are superimposed as
closely as possible by minimizing the sum of the atomic distances.

Distance Alignment Tool (DALI)

Dali is one example of a program that uses the distance matrix method to align protein
structures. Existing structures that have been compared to one another are organized
into the FSSP database. The assembly step of DALI uses a Monte Carlo simulation
strategy to find submatrices that can be aligned with one another.

Fast Structural Similarity Search

One way to quickly compare two structures is to compare the types and arrangements of
the secondary structures within two proteins. If the elements are similarly arranged, the
three-dimensional structures are similar. VAST and SARF are example programs that
use these methods to compare two structures.

Structural Motifs based on Sequence Analysis

A few structural elements can be determined by looking at the sequence composition.


Examples of such structures include zinc finger motifs, leucine zippers, and coiled-coil
structures.

Zinc finger motifs can be found by looking at order and spacing of cysteine and histidine
residues in a sequence. Typical zinc finger motifs are composed of two cysteines
followed by two histidines.
Image source: www.bmb.psu.edu/faculty/tan/lab/ tanlab_gallery_protdna.html

Leucine zippers can be found by looking for two antiparallel alpha helices held together
by interactions between hydrophobic leucine residues found at every seventh position in
the helix.

Image source: ww2.mcgill.ca/biology/undergra/ c200a/sec3-5.htm

Coiled-coil structures have two to three alpha helices coiled around each other in a left-
handed supercoil. They may be predicted by searching for a 7-residue periodicity.
COILS2 (http://www.ch.embnet.org/software/COILS_form.html) is a package to detect
coils

Transmembrane-spanning Proteins

Membrane proteins traverse back and forth through a series of alpha helices composed of
amino acids with hydrophobic side chains. the typical length of these regions is 20-30
residues in length. Therefore, these protein regions can be detected by scanning for
hydrophobic regions around 19 residues in length. Membrane spanning alpha helices
tend to have hydrophobic residues on the inside facing portions, and hydrophilic residues
on the outside or exposed residues.

Image source: http://www.northwestern.edu/neurobiology/faculty/pinto2/pinto_12big.jpg


PHDhtm is a program that is used to predict membrane spanning helices. PHDhtm
employs a neural network approach, where the neural network is trained to recognize
sequence patterns and variations of helices in transmembrane proteins of known
structures. The details for training PHDhtm are given in Mount, p437-439.

TMpred is another progam that predicts alpha helices of transmembrane proteins. It


functions by searching a protein against a sequence scoring matrix that has been obtained
by aligning the sequences of all the transmembrane alpha helix regions that are known.

Secondary Structure Prediction Approaches

Chou-Fasman and GOR methods

The Chou-Fasman method was based on analyzing the frequency of amino acids in the
different secondary structures. For instance, it was determined that A, E, L, and M are
strong predictors of alpha helices, while P and G are predictors in the break of a helix. A
table of predictive values was created for alpha helices, beta sheets, and turns. The
structure with the greatest overall prediction value greater than 1 is used to determine the
structure for that region.

The GOR method improves upon the Chou-Fasman method by basing the assumption
that amino acids surrounding the central amino acid influence the secondary structure that
the central amino acid is likely to adopt, as opposed to it individually influencing the
secondary structure.

Scoring matrices are used in the GOR method, which incorporates both information
theory and Bayesian statistics.

Details of the GOR method are provided in Mount, p450-451.

Neural Network Models

In the neural network approach, programs are trained to recognize amino acid patterns
that are located in known secondary structures and to distinguish these patterns from
other patterns not located in structures. PHD and NNPREDICT are two programs that
incorporate neural network models.

Nearest-Neighbor Methods

Nearest-neighbor methods are also a type of machine learning method. The secondary
structure confirmation of an amino acid in the query is calculated by identifying
http://www.google.com/search?q=Simpa96&hl=en&lr=&ie=UTF-8&oe=UTF-
8&start=10&sa=Nsequences of known structures that are similar to the query by looking
at the surrounding amino acids. The programs using the nearest-neighbor methods
include PSSP, Simpa96, SOPM, and SOPMA.
Prediction of Three-dimensional Protein Structure

Retrieve Examples of each of these:


a: hemoglobin
b: T-cell receptor CD8

Protein classification

Programs to predict secondary structure:

nnpredict (http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html)

nnpredict uses a two-layer, feed-forward neural network to determine the secondary


structure classification.

Results for nnpredict:

Sequence:
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR

Secondary structure prediction (H = helix, E = strand, - = no


prediction):
-------H--HHHHHHH---H-HHHHHHHHHHH--------------------HEH----
HHHHHHHHHHH------HHHHHHHHHHH---------HHHHHHHHHHHHHH-------HH
HHHHHHHHHHHHHEEE-----

P389

Programs for viewing protein structure

RasMol

Predicting Secondary Structure

Predicting Tertiary Structure

“Protein-folding Problem”
Threading

Most robust of structure prediction techniques.


Searches for structures that have a similar fold without apparent sequence similarity.
Threading takes a query sequence whose structure is not known and threads it through the
coordinates of a target protein whose structure has been solved, using either X-ray
crystallography or NMR imaging.

Sequence is moved position by position through the structure subject to predetermined


constrants. Thermodynamic calculations are made to determine most energetically
favorable and conformationally stable alignment of the query sequence against the target
structure.

Threading is a computationally intensive task

Programs:

Protein Structure Prediction Center http://predictioncenter.llnl.gov/


PIR
Quaternary structure prediction:
http://msd.ebi.ac.uk/Services/Quaternary/quaternary.html

WHAT-IF
LOOK
SWISS-MODEL
VAST
DALI
3Dee
FSSP
PHD
TOPITS

SignalP http://www.cbs.dtu.dk/services/SignalP/
TMpred http://www.isrec.isb.sib.ch/ftp-server/tmpred/www/TMPRED_form.html

Bryant, Altschul (1995)


Eisenhaber (1995)
Lemer (1995)
Bryant Lawrence (1993)
Fetrow, Bryant (1993)
Jones, Thornton (1996)

CECS694-02
Introduction to Bioinformatics
Lecture 14
Protein Structure Prediction

Distance Matrix

Distance method uses graphical procedure similar to dot plots to identify the atoms that
lie most closely together in the three-dimensional structure. If two sequences have a
similar structure, then their resulting dot plots can be superimposed. For the dot plot, the
sequence of the protein is listed along both axes. The values in the distance matrix
represent the distance between the corresponding Cα atoms in the three dimensional
structure. The positions of the closest packing atoms are marked with a dot to highlight
regions of interest. Similar groups of secondary structural elements are superimposed as
closely as possible by minimizing the sum of the atomic distances.

Distance Alignment Tool (DALI)

Dali is one example of a program that uses the distance matrix method to align protein
structures. Existing structures that have been compared to one another are organized
into the FSSP database. The assembly step of DALI uses a Monte Carlo simulation
strategy to find submatrices that can be aligned with one another.

Fast Structural Similarity Search

One way to quickly compare two structures is to compare the types and arrangements of
the secondary structures within two proteins. If the elements are similarly arranged, the
three-dimensional structures are similar. VAST and SARF are example programs that
use these methods to compare two structures.

Structural Motifs based on Sequence Analysis

A few structural elements can be determined by looking at the sequence composition.


Examples of such structures include zinc finger motifs, leucine zippers, and coiled-coil
structures.

Zinc finger motifs can be found by looking at order and spacing of cysteine and histidine
residues in a sequence. Typical zinc finger motifs are composed of two cysteines
followed by two histidines.
Image source: www.bmb.psu.edu/faculty/tan/lab/ tanlab_gallery_protdna.html

Leucine zippers can be found by looking for two antiparallel alpha helices held together
by interactions between hydrophobic leucine residues found at every seventh position in
the helix.

Image source: ww2.mcgill.ca/biology/undergra/ c200a/sec3-5.htm

Coiled-coil structures have two to three alpha helices coiled around each other in a left-
handed supercoil. They may be predicted by searching for a 7-residue periodicity.
COILS2 (http://www.ch.embnet.org/software/COILS_form.html) is a package to detect
coils

Transmembrane-spanning Proteins

Membrane proteins traverse back and forth through a series of alpha helices composed of
amino acids with hydrophobic side chains. the typical length of these regions is 20-30
residues in length. Therefore, these protein regions can be detected by scanning for
hydrophobic regions around 19 residues in length. Membrane spanning alpha helices
tend to have hydrophobic residues on the inside facing portions, and hydrophilic residues
on the outside or exposed residues.

Image source: http://www.northwestern.edu/neurobiology/faculty/pinto2/pinto_12big.jpg


PHDhtm is a program that is used to predict membrane spanning helices. PHDhtm
employs a neural network approach, where the neural network is trained to recognize
sequence patterns and variations of helices in transmembrane proteins of known
structures. The details for training PHDhtm are given in Mount, p437-439.

TMpred is another progam that predicts alpha helices of transmembrane proteins. It


functions by searching a protein against a sequence scoring matrix that has been obtained
by aligning the sequences of all the transmembrane alpha helix regions that are known.

Secondary Structure Prediction Approaches

Chou-Fasman and GOR methods

The Chou-Fasman method was based on analyzing the frequency of amino acids in the
different secondary structures. For instance, it was determined that A, E, L, and M are
strong predictors of alpha helices, while P and G are predictors in the break of a helix. A
table of predictive values was created for alpha helices, beta sheets, and turns. The
structure with the greatest overall prediction value greater than 1 is used to determine the
structure for that region.

The GOR method improves upon the Chou-Fasman method by basing the assumption
that amino acids surrounding the central amino acid influence the secondary structure that
the central amino acid is likely to adopt, as opposed to it individually influencing the
secondary structure.

Scoring matrices are used in the GOR method, which incorporates both information
theory and Bayesian statistics.

Details of the GOR method are provided in Mount, p450-451.

Neural Network Models

In the neural network approach, programs are trained to recognize amino acid patterns
that are located in known secondary structures and to distinguish these patterns from
other patterns not located in structures. PHD and NNPREDICT are two programs that
incorporate neural network models.

Nearest-Neighbor Methods

Nearest-neighbor methods are also a type of machine learning method. The secondary
structure confirmation of an amino acid in the query is calculated by identifying
http://www.google.com/search?q=Simpa96&hl=en&lr=&ie=UTF-8&oe=UTF-
8&start=10&sa=Nsequences of known structures that are similar to the query by looking
at the surrounding amino acids. The programs using the nearest-neighbor methods
include PSSP, Simpa96, SOPM, and SOPMA.
Prediction of Three-dimensional Protein Structure

Threading

Threading is the most robust of structure prediction techniques. Threading searches for
structures that have a similar fold without apparent sequence similarity.
Threading takes a query sequence whose structure is not known and threads it through the
coordinates of a target protein whose structure has been solved, using either X-ray
crystallography or NMR imaging.

Sequence is moved position by position through the structure subject to predetermined


constrants. Thermodynamic calculations are made to determine most energetically
favorable and conformationally stable alignment of the query sequence against the target
structure.

Threading is a computationally intensive task, and requires a great deal of knowledge


about protein structure.

Environmental template method

In the environmental template method, the environment of each amino acid in each
known structural core is determined, including the secondary structure, the area of the
side chain that is buried by closeness to other atoms, and types of nearby side chains.
Each position is classified into one of 18 types, 6 representing increasing levels of residue
burial, combined with three classes of secondary structure (alpha helices, beta sheets, and
loops). Each amino acid is then assessed for its ability to fit into that type of structure.

Residue contact potential

The number and closeness between amino acids in the core are analyzed. The query
sequence is evaluated for amino acid interactions that will correspond to those in the core
and that will contribute to the stability of the protein. The most energetically stable
confirmations are the most likely three-dimensional structures.

Structure profile method

Predictions as to which amino acids are able to fit into a structural position are given as a
sequence profile. Substitutions in different structures have different effects –
substitutions in loops do not have as many constrants. A structure profile is created for
each core in the PDB. These profiles are then used to score the query sequence for
compatibility with that core. The structural profile is a table of scores with one row for
each amino acid position in the core and a column for each amino acid substitution at that
position plus two columns for deletion penalties. A dynamic programming algorithm is
used to identify an optimal, best scoring alignment.

Threading Services
123D http://www-lmmb.ncifcrf.gov/~nicka/123D.html
3D-PSSM
Honig lab
Libra I
NCBI structure site
Profit
Threader 2
TOPITS
UCLA-DOE structure prediction Server

DNA Sequencing
Sequencing DNA is a routine molecular biology technique. The most common form of
DNA sequencing used today is the Sanger dideoxynucleotide chain termination method.
In this method, new strands of DNA complementary to a single-stranded DNA template
are synthesized. The template DNA is supplied with a mixture of all four
deoxynucleotides (A, C, G, T) along with four dideoxynucleotides (A, C, G, T) that
terminate the elongation of the DNA sequence. Each nucleotide is labeled with a
different color fluorescent tag. The result is a set of DNA sequences, each with of
different lengths. The fragments are separated by their size using a technique known as
gel electrophoresis. As each labeled DNA fragment passes a laser detector, the color is
recorded. The DNA sequence is then reconstructed from the pattern of colors.

www.ncbi.nlm.nih.gov/About/primer/ genetics_molecular.html
http://jcsmr.anu.edu.au/group_pages/brf/services/DNA%20sequencing/Templiphi.html

Automated Sequencing Machine

www.csic.es/mostrar/tecnicas/ area2/iib1/abi377.htm
The procedure of determining the actual base that is represented is referred to as base-
calling. Often, automated sequencers have software installed that automatically takes the
trace data and calculates the bases. This information can also be used by programs such
as PHRED. For each base, there is an associated quality value, which represents the
probability that the base has been called correctly. Typically, the beginning and end of
the sequence will have lower values. These low quality regions are usually trimmed from
the final data. The PHRED quality value is calculated by the following formula:

QualityValue = −10 log10 ( P _ e)


where P_e is the probability that the base is an error.

PHRED: http://www.genome.washington.edu/UWGC/analysistools/Phred.cfm
TraceTuner http://www.paracel.com/tracetuner/
http://www.phrap.com/

Berno, A. 1996. A graph theoretic approach to the analysis of DNA sequencing data.
Genome Res. 6:80-91 .

DNA sequence assembly


http://icb.ime.usp.br/tdr/material/arthur/assembly.pdf

Ewing B, Green P. (1998) Base-calling of automated sequencer traces using phred II.
Genome Res, 8(3):186-194.

Genomic Sequencing

In order to sequence large molecules, such as chromosomes, the region to be sequenced


must be purified and broken into 100-kb or slightly larger random fragments, which are
cloned into vectors such as yeast artificial chromosomes (YACs) or bacterial artificial
chromosomes (BACs). The library of clones is screened for contigs, which are
overlapping regions. Building such a map of overlapping clones is a very laborious
procedure. Once a map is obtained, unique overlapping clones are chosen for
sequencing. However, these molecules are too large for direct sequencing. In order to
sequence each clone, the clone is broken down into subclones, with some level of
redundancy (typically 4x – 10x coverage). The subclones, on the order of 500 bases
long, are then sequenced. It is then necessary to assemble these subclones based on
overlapping sequences.

Shotgun sequencing

Shotgun sequencing is the process of sequencing a whole genome by ignoring map data.
The idea is to sequence both ends of DNA fragments of short (2kb), medium (10 kb) and
long (100 kb) fragments, and use these end sequences as anchors. The genome is then
randomly broken up into small (500 base) pieces which are then sequenced. The problem
of sequence assembly is much tougher with shotgun sequencing.

Comparison of sequencing strategies

Taken from Waterston RH, Lander ES, Sulston JE. (2002) On the sequencing of the
human genome. PNAS 99(6):3712-3716.

Sequence Assembly Programs

PHRAP (fragment assembly program) is the most widely used program when it comes to
assembling the smaller pieces of each clone together. Other programs that are used to
assemble whole genomes include ARACHNE (MIT’s Whitehead center); GigAssembler
(UCSC), and … The most valuable of these whole genome assembly techniques take
into account various pieces of information concerning BAC ends, polymorphisms, and
mapping markers in order to correctly orient and assemble the pieces of the genome.

Huang X, Maddan A. (1999) CAP3: A DNA sequence assembly program. Genome Res,
9(9):868-977.

Bonfield JK, Smith K, Staden R. (1995) A new DNA sequence assembly program.
Nucleic Acids Res, 23(24):4992-4999.

Mullikin JC, Ning, Z. (2003) The phusion assembler. Genome Res, 13(1):81-90.
Waterston RH, Lander ES, Sulston JE. (2002) On the sequencing of the human genome.
PNAS 99(6):3712-3716.

Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP,
Lander ES. (2002) ARACHNE: a whole-genome shotgun assembler. Genome Res
12(1):177-189.

Venter C, et al. (2001) The sequence of the human genome. Science 291(5507):1304-
1351.

Adams, et al. (2000) The genome sequence of Drosophila melanogaster. Science


287(5461):2185-2195.

Myers EW, et al. (2000) A whole-genome assembly of Drosophilia. Science


287(5461):2196-2204.

Genome sequence assembly process:


http://www.ncbi.nlm.nih.gov/genome/guide/build.html

Predicting Structural Features


Modeller http://guitar.rockefeller.edu/modeller/modeller.html
Swiss-model http://www.expasy.ch/swissmod/SWISS-MODEL.html
Whatif http://www.cmbi.kun.nl/whatif/

DNA Sequencing and Assembly

Whole genome assemblers


Arachne
GigAssembler

S-ar putea să vă placă și