Sunteți pe pagina 1din 44

Introduction to Bioinformatics

Lecture 5 & 6

Sequence Alignment
What is sequence alignment?
Procedure of comparing sequences
Two sequences (Pair-wise Sequence Alignment)
More than two (Multiple Sequence Alignment)

What sequence are aligned?

Match
Mis-match

Global VS Local*****
Global Alignment
Attempts to align the maximum of the entire sequence
Suitable for similar and equal length sequences

CTGTCG-CTGCACG
-TGC-CG-TG---Global alignment

CTGTCGCTGCACG--------TGC-CGTG
Local alignment

Local Alignment
Gathers islands of matches
Stretches of sequences with highest density of
matches are aligned
Suitable for partially similar, different length and
conserved region containing sequences

How can we tell if the two sequences are similar?


Similarity judgments should be based on:

The types of changes or mutations that occur within


sequences.
Characteristics of those different types of mutations.

Frequency of mutations
Substitution > Insertion, Deletion
>>
Duplication
>
Inversion

Common mutations in DNA***


Substitution:
A C G T T G A C
A C G A T G A C

Deletion:

A C G T T G A C
A C G A C
Insertion:

A C G T T G A C
A C G C A A T T G A C

Common mutations***
Duplication:
A C G T T G A C
A C G T T G A T T G A C

Inversion (double stranded DNA shown):


A C G T T G
T G C A A C

A
T

C
G

Terminology *****
Homolog
A gene related to a second gene by descent
from a common ancestral DNA sequence

Ortholog
Orthologs are genes in different species that
evolved from a common ancestral gene by
speciation

Paralog
Paralogs are genes related by duplication
within a genome

Terminology

Analogous : different structure but similar


feature
Xenologous: related through transfer of
genetic material between species

Global Alignment *****

C
A
T

T
C
A

-5

-5

10

-5

-10

-2

-7

-5

-10

-15

15

10

-5

-2

-7

-2

-7

-4

-20

-5

-10 -15 -20 -25 -30 -35 -40

10 * 13

-10 -15 -20 -25

-25 -10

20

15

18

13

-30 -15

15

18

13

28

23

18

-35 -20

-5

10

13

28

23

26

33

C
Traceback can yield both optimum alignments

Local VS Global Alignment ***


Both uses dynamic programming method
Main difference
Rules for calculating scoring matrix are
slightly different
The scoring system must include negative
scores for mismatches
Only non-negative values are kept in the
scoring matrix
Has the effect of terminating the alignment

Local Alignment***

C
A

C
A

+1 for a match, -1 for a mismatch, -5 for a space

How can we know?

The alignment is global if


Matched regions are long
Cover most of the aligning sequences
Many gaps are present

This is very subjective


The matrix will give GA, if
Gives an average positive score to each aligned position
A small gap penalty

The matrix will give LA, if


Gives an average negative value to the mismatched
positions
Large gap penalty

Introduction to Bioinformatics
Lecture 7

Why Multiple Sequence Alignment?


Up until now we have only tried to align two sequences.
A faint similarity between two sequences becomes significant
if present in many
Multiple alignments can reveal subtle similarities that
pairwise alignments do not reveal

Multiple Sequence Alignment: Approaches


Optimal Global Alignments Generalization of Dynamic programming
Find alignment that maximizes a score function
Computationally expensive: Time grows as product
of sequence lengths
Global Progressive Alignments - Match closelyrelated sequences first using a guide tree
Global Iterative Alignments - Multiple re-building
attempts to find best alignment
Local alignments
Profile analysis,
Block analysis
Patterns searching and/or Statistical methods

Global msa: Challenges


Computationally Expensive

If msa includes matches, mismatches and gaps and also


accounts the degree of variation then global msa can be
applied to only a few sequences

Difficult to score

Multiple comparison necessary in each column of the msa for


a cumulative score
Placement of gaps and scoring of substitution is more difficult

Difficulty increases with diversity


Relatively easy for a set of closely related sequences
Identifying the correct ancestry relationships for a set of
distantly related sequences is more challenging
Even difficult if some members are more alike compared
to others

Multiple Alignment: Dynamic Programming*********

si,j,k = max

si-1,j-1,k-1 + (vi, wj, uk) cube diagonal:


si-1,j-1,k + (vi, wj, _ ) no in/dels
si-1,j,k-1
si,j-1,k-1
si-1,j,k
si,j-1,k
si,j,k-1

+ (vi, _, uk)
+ (_, wj, uk)
+ (vi, _ , _)
+ (_, wj, _)
+ (_, _, uk)

face diagonal:
one in/del
edge diagonal:
two in/dels

(x, y, z) is an entry in the 3-D scoring matrix

Introduction to Bioinformatics
Lecture 8

Sensitivity and Selectivity***


Sensitivity: the percentage of homologs that are
identified by the database search
(true positives) / (all positives)

Selectivity: the percentage of non-homologs that


are not identified as homologs
(true negatives) / (all negatives)

For sequence database similarity search methods,


there is usually a trade-off between sensitivity and
selectivity

Database searching
Instead, use faster heuristic approaches
FASTA [Pearson & Lipman, 1988]
BLAST [Altschul et al., 1990;
Smith-Waterman is slower, but more sensitive

FASTA

W. R. Pearson and D. J. Lipman (1988)


FASTA is the first widely used program for sequence database
similarity search
Goal: Perform fast, approximate local alignments to find
sequences in the database that are related to the query
sequence
Based on dot plot idea
Better than BLAST for nucleotide sequence search

Hashing Example, 1/9


Query Sequence: WATSNANDCRICK
k=1
1

10

11

12

13

Hash table:

Hashing Example, 2/9


Query Sequence: WATSNANDCRICK
k=1
1

10

11

12

13

Hash table:

2
6

9
12

11

13

5
7

10

Hashing Example, 3/9


Target Sequence: BASEBALLANDCRICKET

Target table

10

11

12

13

14

15

16

17

18

Hashing Example, 4/9


Target Sequence: BASEBALLANDCRICKET

Hash table:
Target table

2
6

9
12

11

13

5
7

10

10

11

12

13

14

15

16

17

18

Hashing Example, 5/9


Target Sequence: BASEBALLANDCRICKET

Hash table:
Target table

2
6

9
12

11

13

5
7

10

10

11

12

13

14

15

16

17

18

-4

-7

-5

-3

-3

-3

-3

-6

-3

-3

-3

-3

-15

Hashing Example, 6/9


Target Sequence: BASEBALLANDCRICKET

Hash table:
Target table

2
6

9
12

11

13

5
7

10

10

11

12

13

14

15

16

17

18

-4

-7

-5

-3

-3

-3

-3

-6

-3

-3

-3

Offset

-15

-14

-13

-12

-11

-10

-9

-8

-7

-6

-5

-4

-3

-15

-3

-2

-1

Hashing Example, 7/9


Target Sequence: BASEBALLANDCRICKET

Hash table:
Target table

2
6

9
12

11

13

5
7

10

10

11

12

13

14

15

16

17

18

-4

-7

-5

-3

-3

-3

-3

-6

-3

-3

-3

Offset

-15
1

-14

-13

-12

-11

-10

-9

-8

-7

-6

-5

-4

-3

-15

-3

-2

-1

4
1

Hashing Example, 8/9


Target Sequence: BASEBALLANDCRICKET
A

2
6

9
12

11

13

5
7

10

Hash table:
Target table

10

11

12

13

14

15

16

17

18

-4

-7

-5

-3

-3

-3

-3

-6

-3

-3

-3

-15

Offset

-14

-13

-12

-11

-10

-9

-8

1
-3

-2

-1

-7

-6

-5

-4

-3

-15

-3

-2

-1

10

11

12

4
1

13

14

15

16

17

Hashing Example, 9/9


Target Sequence: BASEBALLANDCRICKET
A

2
6

9
12

11

13

5
7

10

Hash table:
Target table

10

11

12

13

14

15

16

17

18

-4

-7

-5

-3

-3

-3

-3

-6

-3

-3

-3

-15

Offset

-14

-13

-12

-11

-10

-9

-8

1
-3

-2

-1

-7

-6

-5

-4

-3

-15

-3

-2

-1

10

11

12

13

14

15

4
1

16

17

Introduction to Bioinformatics
Lecture 9

A Markov Chain Model


Nucleotide frequencies in the human genome
A

29.5

20.4

20.5

29.6

Markov Chain Model: Definition

a Markov chain model is defined by


a set of states
some states emit symbols
other states are silent
(e.g. the begin and end states)

a set of transitions with associated


probabilities
the transitions emanating from a given state
define a distribution over the possible next
states

Markov Chain Model: Property


given some sequence x of length L, we can ask how
probable the sequence is given our model

for any probabilistic model of sequences, we can write


this probability as
Pr(x) Pr(xL , xL1,K, x1 )
Pr(xL | xL1 ,K, x1 ) Pr(xL1 | xL2 ,K, x1 )KPr(x1 )

key property of a (1st order) Markov chain: the


probability of each xi depends only on the value of xi-1
Pr(x) Pr(xL | xL1 ) Pr(xL1 | xL2 )KPr(x2 | x1 ) Pr(x1 )
Pr(x1 ) Pr(xi | xi1 )

i2

The Probability of a Sequence for a Given


Markov Chain Model

Pr(cggt) Pr(c) Pr(g | c) Pr(g | g) Pr(t | g) Pr(end | t)

Markov Chain Model: Notation


the transition parameters can be denoted by

a xi1 xi where

a xi1 xi Pr(xi | xi1 )


similarly we can denote the probability of a sequence x as
B

x1

L
xi1 xi

i2

where

aB xi

Pr(x1 ) Pr(xi | xi1


)

i2

represents the transition from the begin state

HMM:

Goal: Find the most likely explanation for the


observed variables

CpG Islands

Written CpG to
distinguish from
a CG base pair

CpG dinucleotides are rarer than would be expected


from the independent probabilities of C and G.
Reason: When CpG occurs, C is typically chemically
modified by methylation and there is a relatively high
chance of methyl-C mutating into T
A CpG island is a region where CpG dinucleotides
are much more abundant than elsewhere.
High CpG frequency may be biologically significant;
e.g., may signal promoter region (start of a gene).

Markov Chain for Discrimination


parameters estimated for + and - models
human sequences containing 48 CpG islands
60,000 nucleotides

Calculated Transition probabilities for both models

The occasionally dishonest casino


A casino uses a fair die most of the time, but
occasionally switches to a loaded one
Fair die: Prob(1) = Prob(2) = . . . = Prob(6) = 1/6
Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10,
Prob(6) =
These are the emission probabilities

Transition probabilities
Prob(Fair Loaded) = 0.01
Prob(Loaded Fair) = 0.2
Transitions between states obey a Markov process

An HMM for occasionally dishonest casino


1: 1/6
2: 1/6

akl

0.99

0.80
0.01
3:
4:
5:
6:

ek (b)

1/6
1/6
1/6
1/6

Fair

0.2

1:
2:
3:
4:
5:
6:

1/10
1/10
1/10
1/10
1/10
1/2

Loaded

Three Important Questions

How likely is a given sequence?


the Forward algorithm

What is the most probable path for


generating a given sequence?
the Viterbi algorithm

How can we learn the HMM parameters


given a set of sequences?
the Baum-Welch (Forward-Backward)
algorithm

The occasionally dishonest casino


x x1 , x2 , x3 6,2,6
Pr(x, ) a0 F eF (6)a FF eF (2)a FF eF (6)
1
1
1
0.5 0.99 0.99
6
6
6
0.00227
(1)

(1) FFF

(2) LLL

(3)

LFL

Pr(x,

(2)

) a0 L eL (6)aLL eL (2)aLL eL (6)


0.5 0.5 0.8 0.1 0.8 0.5
0.008

Pr(x,

(3)

) a0 L eL (6)a LF eF (2)a FL eL (6)a L0


1
0.5 0.5 0.2 0.01 0.5
6
0.0000417

The Viterbi Algorithm


Initialization:

(i = 0)

v0 (0) 1, vk (0) 0 for k 0


Recursion: (i = 1, . . . , L): For each state k

vk (i) ek (xi ) max

v
(i
1)a

r
rk
r
Termination:

Pr(x, ) max

v
k (L)a k 0
k
*

To find *, use trace-back, as in dynamic programming

Viterbi: Example
x

6
0

2
0

(1/6)(1/2)
= 1/12

(1/6)max{(1/12)0.99,
(1/4)0.2}
= 0.01375

(1/6)max{0.013750.99,
0.020.2}
= 0.00226875

(1/2)(1/2)
= 1/4

(1/10)max{(1/12)0.01,
(1/4)0.8}
= 0.02

(1/2)max{0.013750.01,
0.020.8}
= 0.08

6
0

0.80

0.99

vk (i ) ek (xi ) max
v r (i 1)ark
r

1:
2:
3:
4:
5:
6:

1/6
1/6
1/6
1/6
1/6
1/6

Fair

0.01

0.2

1:
2:
3:
4:
5:
6:

1/10
1/10
1/10
1/10
1/10
1/2

Loaded

THANKS A LOT...

S-ar putea să vă placă și