Bioinformatics Lecture 5-9 Review

Introduction to Bioinformatics
Lecture 5 & 6
Sequence Alignment
What is sequence alignment?
Procedure of comparing sequences
Two sequences (Pair-wise Sequence Alignment)
More than two (Multiple Sequence Alignment)
What sequence are aligned?
Match
Mis-match
Global VS Local*****
Global Alignment
Attempts to align the maximum of the entire sequence
Suitable for similar and equal length sequences
CTGTCG-CTGCACG
-TGC-CG-TG---Global alignment
CTGTCGCTGCACG--------TGC-CGTG
Local alignment
Local Alignment
Gathers islands of matches
Stretches of sequences with highest density of
matches are aligned
Suitable for partially similar, different length and
conserved region containing sequences
How can we tell if the two sequences are similar?

Similarity judgments should be based on:
The types of changes or mutations that occur within

sequences.
Characteristics of those different types of mutations.
Frequency of mutations
Substitution > Insertion, Deletion
>>
Duplication
>
Inversion
Common mutations in DNA***

Substitution:
A C G T T G A C
A C G A T G A C
Deletion:
A C G T T G A C
A C G A C
Insertion:
A C G T T G A C
A C G C A A T T G A C
Common mutations***
Duplication:
A C G T T G A C
A C G T T G A T T G A C
Inversion (double stranded DNA shown):

A C G T T G
T G C A A C
A
T
C
G
Terminology *****
Homolog
A gene related to a second gene by descent
from a common ancestral DNA sequence
Ortholog
Orthologs are genes in different species that
evolved from a common ancestral gene by
speciation
Paralog
Paralogs are genes related by duplication
within a genome
Terminology
Analogous : different structure but similar

feature
Xenologous: related through transfer of
genetic material between species
Global Alignment *****
C
A
T
T
C
A
-5
-5
10
-5
-10
-2
-7
-5
-10
-15
15
10
-5
-2
-7
-2
-7
-4
-20
-5
-10 -15 -20 -25 -30 -35 -40
10 * 13
-10 -15 -20 -25
-25 -10
20
15
18
13
-30 -15
15
18
13
28
23
18
-35 -20
-5
10
13
28
23
26
33
C
Traceback can yield both optimum alignments
Local VS Global Alignment ***

Both uses dynamic programming method
Main difference
Rules for calculating scoring matrix are
slightly different
The scoring system must include negative
scores for mismatches
Only non-negative values are kept in the
scoring matrix
Has the effect of terminating the alignment
Local Alignment***
C
A
C
A
+1 for a match, -1 for a mismatch, -5 for a space
How can we know?
The alignment is global if

Matched regions are long
Cover most of the aligning sequences
Many gaps are present
This is very subjective

The matrix will give GA, if
Gives an average positive score to each aligned position
A small gap penalty
The matrix will give LA, if

Gives an average negative value to the mismatched
positions
Large gap penalty
Lecture 7
Why Multiple Sequence Alignment?

Up until now we have only tried to align two sequences.
A faint similarity between two sequences becomes significant
if present in many
Multiple alignments can reveal subtle similarities that
pairwise alignments do not reveal
Multiple Sequence Alignment: Approaches

Optimal Global Alignments Generalization of Dynamic programming
Find alignment that maximizes a score function
Computationally expensive: Time grows as product
of sequence lengths
Global Progressive Alignments - Match closelyrelated sequences first using a guide tree
Global Iterative Alignments - Multiple re-building
attempts to find best alignment
Local alignments
Profile analysis,
Block analysis
Patterns searching and/or Statistical methods
Global msa: Challenges

Computationally Expensive
If msa includes matches, mismatches and gaps and also

accounts the degree of variation then global msa can be
applied to only a few sequences
Difficult to score
Multiple comparison necessary in each column of the msa for

a cumulative score
Placement of gaps and scoring of substitution is more difficult
Difficulty increases with diversity

Relatively easy for a set of closely related sequences
Identifying the correct ancestry relationships for a set of
distantly related sequences is more challenging
Even difficult if some members are more alike compared
to others
Multiple Alignment: Dynamic Programming*********
si,j,k = max
si-1,j-1,k-1 + (vi, wj, uk) cube diagonal:

si-1,j-1,k + (vi, wj, _ ) no in/dels
si-1,j,k-1
si,j-1,k-1
si-1,j,k
si,j-1,k
si,j,k-1
+ (vi, _, uk)
+ (_, wj, uk)
+ (vi, _ , _)
+ (_, wj, _)
+ (_, _, uk)
face diagonal:
one in/del
edge diagonal:
two in/dels
(x, y, z) is an entry in the 3-D scoring matrix
Lecture 8
Sensitivity and Selectivity***

Sensitivity: the percentage of homologs that are
identified by the database search
(true positives) / (all positives)
Selectivity: the percentage of non-homologs that

are not identified as homologs
(true negatives) / (all negatives)
For sequence database similarity search methods,

there is usually a trade-off between sensitivity and
selectivity
Database searching
Instead, use faster heuristic approaches
FASTA [Pearson & Lipman, 1988]
BLAST [Altschul et al., 1990;
Smith-Waterman is slower, but more sensitive
FASTA
W. R. Pearson and D. J. Lipman (1988)

FASTA is the first widely used program for sequence database
similarity search
Goal: Perform fast, approximate local alignments to find
sequences in the database that are related to the query
sequence
Based on dot plot idea
Better than BLAST for nucleotide sequence search
Hashing Example, 1/9

Query Sequence: WATSNANDCRICK
k=1
1
10
11
12
13
Hash table:

Query Sequence: WATSNANDCRICK
k=1
1
10
11
12
13
Hash table:
2
6
9
12
11
13
5
7
10

Target Sequence: BASEBALLANDCRICKET
Target table
10
11
12
13
14
15
16
17
18

Hash table:
Target table
2
6
9
12
11
13
5
7
10
10
11
12
13
14
15
16
17
18

Hash table:
Target table
2
6
9
12
11
13
5
7
10
10
11
12
13
14
15
16
17
18
-4
-7
-5
-3
-3
-3
-3
-6
-3
-3
-3
-3
-15

Hash table:
Target table
2
6
9
12
11
13
5
7
10
10
11
12
13
14
15
16
17
18
-4
-7
-5
-3
-3
-3
-3
-6
-3
-3
-3
Offset
-15
-14
-13
-12
-11
-10
-9
-8
-7
-6
-5
-4
-3
-15
-3
-2
-1

Hash table:
Target table
2
6
9
12
11
13
5
7
10
10
11
12
13
14
15
16
17
18
-4
-7
-5
-3
-3
-3
-3
-6
-3
-3
-3
Offset
-15
1
-14
-13
-12
-11
-10
-9
-8
-7
-6
-5
-4
-3
-15
-3
-2
-1
4
1

A
2
6
9
12
11
13
5
7
10
Hash table:
Target table
10
11
12
13
14
15
16
17
18
-4
-7
-5
-3
-3
-3
-3
-6
-3
-3
-3
-15
Offset
-14
-13
-12
-11
-10
-9
-8
1
-3
-2
-1
-7
-6
-5
-4
-3
-15
-3
-2
-1
10
11
12
4
1
13
14
15
16
17

A
2
6
9
12
11
13
5
7
10
Hash table:
Target table
10
11
12
13
14
15
16
17
18
-4
-7
-5
-3
-3
-3
-3
-6
-3
-3
-3
-15
Offset
-14
-13
-12
-11
-10
-9
-8
1
-3
-2
-1
-7
-6
-5
-4
-3
-15
-3
-2
-1
10
11
12
13
14
15
4
1
16
17
Lecture 9
A Markov Chain Model

Nucleotide frequencies in the human genome
A
29.5
20.4
20.5
29.6
Markov Chain Model: Definition
a Markov chain model is defined by

a set of states
some states emit symbols
other states are silent
(e.g. the begin and end states)
a set of transitions with associated

probabilities
the transitions emanating from a given state
define a distribution over the possible next
states
Markov Chain Model: Property

given some sequence x of length L, we can ask how
probable the sequence is given our model
for any probabilistic model of sequences, we can write

this probability as
Pr(x) Pr(xL , xL1,K, x1 )
Pr(xL | xL1 ,K, x1 ) Pr(xL1 | xL2 ,K, x1 )KPr(x1 )
key property of a (1st order) Markov chain: the

probability of each xi depends only on the value of xi-1
Pr(x) Pr(xL | xL1 ) Pr(xL1 | xL2 )KPr(x2 | x1 ) Pr(x1 )
Pr(x1 ) Pr(xi | xi1 )
i2
The Probability of a Sequence for a Given

Markov Chain Model
Pr(cggt) Pr(c) Pr(g | c) Pr(g | g) Pr(t | g) Pr(end | t)
Markov Chain Model: Notation

the transition parameters can be denoted by
a xi1 xi where
a xi1 xi Pr(xi | xi1 )

similarly we can denote the probability of a sequence x as
B
x1
L
xi1 xi
i2
where
aB xi
Pr(x1 ) Pr(xi | xi1

)
i2
represents the transition from the begin state
HMM:
Goal: Find the most likely explanation for the

observed variables
CpG Islands
Written CpG to
distinguish from
a CG base pair
CpG dinucleotides are rarer than would be expected

from the independent probabilities of C and G.
Reason: When CpG occurs, C is typically chemically
modified by methylation and there is a relatively high
chance of methyl-C mutating into T
A CpG island is a region where CpG dinucleotides
are much more abundant than elsewhere.
High CpG frequency may be biologically significant;
e.g., may signal promoter region (start of a gene).
Markov Chain for Discrimination

parameters estimated for + and - models
human sequences containing 48 CpG islands
60,000 nucleotides
Calculated Transition probabilities for both models
The occasionally dishonest casino

A casino uses a fair die most of the time, but
occasionally switches to a loaded one
Fair die: Prob(1) = Prob(2) = . . . = Prob(6) = 1/6
Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10,
Prob(6) =
These are the emission probabilities
Transition probabilities
Prob(Fair Loaded) = 0.01
Prob(Loaded Fair) = 0.2
Transitions between states obey a Markov process
An HMM for occasionally dishonest casino

1: 1/6
2: 1/6
akl
0.99
0.80
0.01
3:
4:
5:
6:
ek (b)
1/6
1/6
1/6
1/6
Fair
0.2
1:
2:
3:
4:
5:
6:
1/10
1/10
1/10
1/10
1/10
1/2
Loaded
Three Important Questions
How likely is a given sequence?

the Forward algorithm
What is the most probable path for

generating a given sequence?
the Viterbi algorithm
How can we learn the HMM parameters

given a set of sequences?
the Baum-Welch (Forward-Backward)
algorithm
The occasionally dishonest casino

x x1 , x2 , x3 6,2,6
Pr(x, ) a0 F eF (6)a FF eF (2)a FF eF (6)
1
1
1
0.5 0.99 0.99
6
6
6
0.00227
(1)
(1) FFF
(2) LLL
(3)
LFL
Pr(x,
(2)
) a0 L eL (6)aLL eL (2)aLL eL (6)

0.5 0.5 0.8 0.1 0.8 0.5
0.008
Pr(x,
(3)
) a0 L eL (6)a LF eF (2)a FL eL (6)a L0

1
0.5 0.5 0.2 0.01 0.5
6
0.0000417
The Viterbi Algorithm

Initialization:
(i = 0)
v0 (0) 1, vk (0) 0 for k 0

Recursion: (i = 1, . . . , L): For each state k
vk (i) ek (xi ) max
v
(i
1)a
r
rk
r
Termination:
Pr(x, ) max
v
k (L)a k 0
k
*
To find *, use trace-back, as in dynamic programming
Viterbi: Example
x
6
0
2
0
(1/6)(1/2)
= 1/12
(1/6)max{(1/12)0.99,
(1/4)0.2}
= 0.01375
(1/6)max{0.013750.99,
0.020.2}
= 0.00226875
(1/2)(1/2)
= 1/4
(1/10)max{(1/12)0.01,
(1/4)0.8}
= 0.02
(1/2)max{0.013750.01,
0.020.8}
= 0.08
6
0
0.80
0.99
vk (i ) ek (xi ) max
v r (i 1)ark
r
1:
2:
3:
4:
5:
6:
1/6
1/6
1/6
1/6
1/6
1/6
Fair
0.01
0.2
1:
2:
3:
4:
5:
6:
1/10
1/10
1/10
1/10
1/10
1/2
Loaded
THANKS A LOT...

Bioinformatics Lecture 5-9 Review

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Bioinformatics Lecture 5-9 Review

Încărcat de

Drepturi de autor:

Formate disponibile

Introduction to Bioinformatics

What sequence are aligned?

How can we tell if the two sequences are similar?

The types of changes or mutations that occur within

Common mutations in DNA***

Inversion (double stranded DNA shown):

Analogous : different structure but similar

Global Alignment *****

-10 -15 -20 -25 -30 -35 -40

-10 -15 -20 -25

Local VS Global Alignment ***

+1 for a match, -1 for a mismatch, -5 for a space

How can we know?

The alignment is global if

This is very subjective

The matrix will give LA, if

Why Multiple Sequence Alignment?

Multiple Sequence Alignment: Approaches

Global msa: Challenges

If msa includes matches, mismatches and gaps and also

Multiple comparison necessary in each column of the msa for

Difficulty increases with diversity

Multiple Alignment: Dynamic Programming*********

si-1,j-1,k-1 + (vi, wj, uk) cube diagonal:

(x, y, z) is an entry in the 3-D scoring matrix

Sensitivity and Selectivity***

Selectivity: the percentage of non-homologs that

For sequence database similarity search methods,

W. R. Pearson and D. J. Lipman (1988)

Hashing Example, 1/9

Hashing Example, 2/9

Hashing Example, 3/9

Hashing Example, 4/9

Hashing Example, 5/9

Hashing Example, 6/9

Hashing Example, 7/9

Hashing Example, 8/9

Hashing Example, 9/9

A Markov Chain Model

Markov Chain Model: Definition

a Markov chain model is defined by

a set of transitions with associated

Markov Chain Model: Property

for any probabilistic model of sequences, we can write

key property of a (1st order) Markov chain: the

The Probability of a Sequence for a Given

Pr(cggt) Pr(c) Pr(g | c) Pr(g | g) Pr(t | g) Pr(end | t)

Markov Chain Model: Notation

a xi1 xi Pr(xi | xi1 )

Pr(x1 ) Pr(xi | xi1

represents the transition from the begin state

Goal: Find the most likely explanation for the

CpG dinucleotides are rarer than would be expected

Markov Chain for Discrimination

Calculated Transition probabilities for both models

The occasionally dishonest casino

An HMM for occasionally dishonest casino

Three Important Questions

How likely is a given sequence?

What is the most probable path for

How can we learn the HMM parameters

The occasionally dishonest casino

) a0 L eL (6)aLL eL (2)aLL eL (6)

) a0 L eL (6)a LF eF (2)a FL eL (6)a L0