Sunteți pe pagina 1din 57

Sequence Alignment

Kun-Mao Chao ( )
Department of Computer Science
and Information Engineering
National Taiwan University, Taiwan
E-mail: kmchao@csie.ntu.edu.tw
WWW: http://www.csie.ntu.edu.tw/~kmchao

Bioinformatics

Bioinformatics and Computational BiologyRelated Journals:

Bioinformatics (previously called CABIOS)


Bulletin of Mathematical Biology
Genome Research
Genomics
IEEE/ACM Transactions on Computational Biology and
Bioinformatics
Journal of Bioinformatics and Computational Biology
Journal of Computational Biology
Journal of Molecular Biology
Nature
Nucleic Acid Research
Science

Bioinformatics and Computational BiologyRelated Conferences:


Intelligent Systems for Molecular Biology (ISMB)
Pacific Symposium on Biocomputing (PSB)
The Annual International Conference on Research
in Computational Molecular Biology (RECOMB)
Workshop on Algorithms in Bioinformatics
(WABI)
The IEEE Computer Society Bioinformatics
Conference (CSB)
4

Bioinformatics and Computational


Biology-Related Books:
Calculating the Secrets of Life: Applications of the Mathematical
Sciences in Molecular Biology, by Eric S. Lander and Michael S.
Waterman (1995)
Introduction to Computational Biology: Maps, Sequences, and
Genomes, by Michael S. Waterman (1995)
Introduction to Computational Molecular Biology, by Joao Carlos
Setubal and Joao Meidanis (1996)
Algorithms on Strings, Trees, and Sequences: Computer Science and
Computational Biology, by Dan Gusfield (1997)
Computational Molecular Biology: An Algorithmic Approach, by
Pavel Pevzner (2000)
Introduction to Bioinformatics, by Arthur M. Lesk (2002)

Useful Websites
MIT Biology Hypertextbook
http://www.mit.edu:8001/afs/athena/course/other/esgbio/www
/7001main.html

The International Society for Computational Biology:


http://www.iscb.org/

National Center for Biotechnology Information


(NCBI, NIH):
http://www.ncbi.nlm.nih.gov/

European Bioinformatics Institute (EBI):


http://www.ebi.ac.uk/

DNA Data Bank of Japan (DDBJ):


http://www.ddbj.nig.ac.jp/
6

Sequence Alignment

Dot Matrix
Sequence A CTTAACT
Sequence B CGGATCAT
C G G A T

C
T
T
A
A
C
T

Pairwise Alignment
Sequence A: CTTAACT
Sequence B: CGGATCAT
An alignment of A and B:
C---TTAACT
CGGATCA--T

Sequence A
Sequence B

Pairwise Alignment
Sequence A: CTTAACT
Sequence B: CGGATCAT
An alignment of A and B:
Mismatch

Match

C---TTAACT
CGGATCA--T
Insertion
gap

Deletion
gap
10

Alignment Graph
Sequence A: CTTAACT
Sequence B: CGGATCAT
C G G A
C
T

C---TTAACT
CGGATCA--T

T
A
A
C
T

11

A simple scoring scheme


Match: +8 (w(x, y) = 8, if x = y)
Mismatch: -5 (w(x, y) = -5, if x y)
Each gap symbol: -3 (w(-,x)=w(x,-)=-3)

C - - - T T A A C T
C G G A T C A - - T
+8

-3 -3

-3 +8

-5 +8 -3

-3

Alignment score

+8 = +12
12

An optimal alignment

-- the alignment of maximum score


Let A=a1a2am and B=b1b2bn .
Si,j: the score of an optimal alignment between
a1a2ai and b1b2bj
With proper initializations, Si,j can be computed
as follows.

si , j

si 1, j w(ai ,)

max si , j 1 w(, b j )
s
i 1, j 1 w(ai , b j )
13

Computing Si,j
j
w(ai,bj
)
w(ai,-)

i
w(-,bj)

Sm,n

14

Initializations
0
C

-3

-6

-9

C
-3

G
-6

G A T C A T
-9 -12 -15 -18 -21 -24

A -12
A -15
C -18
T -21
15

S3,5 =
0

C
-3

G
-6

G A T C A T
-9 -12 -15 -18 -21 -24

-3

-1

-4

-7 -10 -13

-6

-3

-9

-2

-5

-2

A -12
A -15
C -18
T -21
16

S3,5 = 5
0

C
-3

G
-6

G A T C A T
-9 -12 -15 -18 -21 -24

-3

-1

-4

-7 -10 -13

-6

-3

-2

-9

-2

-5

-1

-1

A -12 -1

-3

-5

A -15 -4

-6

-8

-2

C -18 -7 -9 -11 0
T -21 -10 -12 -14 -3

-2

14

optimal
score

17

C T T A A C T
C G G A T C A T
8 5 5 +8 -5 +8 -3 +8 = 14
C G G A T C A T
0 -3 -6 -9 -12 -15 -18 -21 -24
C

-3

-1

-4

-7 -10 -13

-6

-3

-2

-9

-2

-5

-1

-1

A -12 -1

-3

-5

A -15 -4

-6

-8

-2

C -18 -7 -9 -11 0
T -21 -10 -12 -14 -3

-2

14
18

Now try this example in class


Sequence A: CAATTGA
Sequence B: GAATCTGC
Their optimal alignment

19

Initializations
0
C

-3

-6

-9

G
-3

A A T C T G C
-6 -9 -12 -15 -18 -21 -24

T -12
T -15
G -18
A -21
20

S4,2 =
0

G
-3

A A T C T G C
-6 -9 -12 -15 -18 -21 -24

-3

-5

-8 -11 -14 -4

-7 -10 -13

-6

-8

-3

-6

-9 -12 -15

-9 -11

11

T -12 -14

-1

-4

T -15
G -18
A -21
21

S5,5 =
0

G
-3

A A T C T G C
-6 -9 -12 -15 -18 -21 -24

-3

-5

-8 -11 -14 -4

-7 -10 -13

-6

-8

-3

-6

-9 -12 -15

-9 -11

11

-1

-4

T -12 -14 -3

19 16 13 10

T -15 -11 -6

16

G -18
A -21
22

S5,5 = 14
0

G
-3

A A T C T G C
-6 -9 -12 -15 -18 -21 -24

-3

-5

-8 -11 -14 -4

-7 -10 -13

-6

-8

-3

-6

-9 -12 -15

-9 -11

11

-1

-4

T -12 -14 -3

19 16 13 10

T -15 -11 -6

16 14 24 21 18

G -18 -7 -9
A -21 -10 1

13 11 21 32 29

-1

10

18 29 27

optimal
score

23

C A A T - T G A
G A A T C T G C
-5 +8 +8 +8 -3 +8 +8 -5 = 27
G A A T C T G C
0 -3 -6 -9 -12 -15 -18 -21 -24
C

-3

-5

-8 -11 -14 -4

-7 -10 -13

-6

-8

-3

-6

-9 -12 -15

-9 -11

11

-1

-4

T -12 -14 -3

19 16 13 10

T -15 -11 -6

16 14 24 21 18

G -18 -7 -9
A -21 -10 1

13 11 21 32 29

-1

10

18 29 27
24

Global Alignment vs. Local


Alignment
global alignment:
local alignment:

25

An optimal local alignment


Si,j: the score of an optimal local alignment ending
at ai and bj
With proper initializations, Si,j can be computed
as follows.

si , j

0
s

w
(
a
,

)
i

1
,
j
i

max si , j 1 w(, b j )
s
i 1, j 1 w( ai , b j )

26

local alignment

Match: 8
Mismatch: -5
Gap symbol: -3

C
0

G
0

G
0

A
0

T
0

C
0

A T
0 0

13

11

0
27

local alignment

Match: 8
Mismatch: -5
Gap symbol: -3

C
0

G
0

G
0

A
0

T
0

C
0

A T
0 0

13

11

13 10

11

13 10

13 10

The
best
score

18
28

A C - T
A T C A T
8-3+8-3+8 = 18
C G
0 0 0

G
0

A
0

T
0

C
0

A T
0 0

13

11

13 10

11

13 10

13 10

The
best
score

18
29

Now try this example in class


Sequence A: CAATTGA
Sequence B: GAATCTGC
Their optimal local alignment

30

Did you get it right?


0

G
0

A
0

A
0

T C T G C
0 0 0 0 0

16 13 10

13 24 21 18 15 12

10 21 19 29 26 23

16 13 15 13 23 34 32

18 16 26 37 34
31

A A T T G
A A T C T G
8+8+8-3+8+8 = 37
G A
0 0 0

A
0

T C T G C
0 0 0 0 0

16 13 10

13 24 21 18 15 12

10 21 19 29 26 23

16 13 15 13 23 34 32

18 16 26 37 34
32

Affine gap penalties

Match: +8 (w(x, y) = 8, if x = y)
Mismatch: -5 (w(x, y) = -5, if x y)
Each gap symbol: -3 (w(-,x)=w(x,-)=-3)
Each gap is charged an extra gap-open penalty: -4.
-4
-4

C - - - T T A A C T
C G G A T C A - - T
+8

-3 -3

-3 +8

-5 +8 -3

-3

+8 = +12

Alignment score: 12 4 4 = 4
33

Affine gap panalties


A gap of length k is penalized x + ky.
gap-open penalty
Three cases for alignment endings:
gap-symbol penalty
1. ...x
an aligned pair
...x
2. ...x
...-

a deletion

3. ......x

an insertion
34

Affine gap penalties


Let D(i, j) denote the maximum score of any
alignment between a1a2ai and b1b2bj ending
with a deletion.
Let I(i, j) denote the maximum score of any
alignment between a1a2ai and b1b2bj ending
with an insertion.
Let S(i, j) denote the maximum score of any
alignment between a1a2ai and b1b2bj.
35

Affine gap penalties

D (i, j ) max

D(i 1, j ) y

S (i 1, j ) x y
I (i, j 1) y

I (i, j ) max

S (i, j 1) x y
S (i 1, j 1) w(ai , b j )

S (i, j ) max
D (i, j )

I (i, j )

(A gap of length k is penalized x + ky.)


36

Affine gap penalties


D
I

-y

w(ai,bj)

-x-y

D
I

D
-x-y
-y

S
37

Constant gap penalties

Match: +8 (w(x, y) = 8, if x = y)
Mismatch: -5 (w(x, y) = -5, if x y)
Each gap symbol: 0 (w(-,x)=w(x,-)=0)
Each gap is charged a constant penalty: -4.
-4
-4

C - - - T T A A C T
C G G A T C A - - T
+8

0 +8

-5 +8

+8 = +27

Alignment score: 27 4 4 = 19
38

Constant gap penalties


Let D(i, j) denote the maximum score of any
alignment between a1a2ai and b1b2bj ending
with a deletion.
Let I(i, j) denote the maximum score of any
alignment between a1a2ai and b1b2bj ending
with an insertion.
Let S(i, j) denote the maximum score of any
alignment between a1a2ai and b1b2bj.
39

Constant gap penalties


D(i 1, j )
D(i, j ) max
S (i 1, j ) x
I (i, j 1)
I (i, j ) max
S (i, j 1) x

S (i 1, j 1) w(ai , b j )

S (i, j ) max
D(i, j )

I (i, j )

where x is a constant gap penalty for a gap


40

Restricted affine gap panalties


A gap of length k is penalized x + f(k)y.
where f(k) = k for k <= c and f(k) = c for k > c
Five cases for alignment endings:
1. ...x
...x

an aligned pair

2. ...x
...-

a deletion

3. ......x

an insertion

4. and 5. for long gaps

41

Restricted affine gap penalties


D(i 1, j ) y
D(i, j ) max
S (i 1, j ) x y
D' (i 1, j )

D' (i, j ) max


S (i 1, j ) x cy

I (i, j ) max

I (i, j 1) y

S (i, j 1) x y
I ' (i, j 1)

I ' (i, j ) max


S (i, j 1) x cy
S (i 1, j 1) w(ai , b j )

S (i, j ) max
D(i, j ); D' (i, j )

I (i, j ); I ' (i, j )

42

D(i, j) vs. D(i, j)


Case 1: the best alignment ending at (i, j)
with a deletion at the end has the last
deletion gap of length <= c
D(i, j) >= D(i, j)
Case 2: the best alignment ending at (i, j)
with a deletion at the end has the last
deletion gap of length >= c
D(i, j) <= D(i, j)
43

Max{S(i,j)-x-ky, S(i,j)-x-cy}

k
44

k best local alignments


Smith-Waterman
(Smith and Waterman, 1981; Waterman and Eggert, 1987)

FASTA
(Wilbur and Lipman, 1983; Lipman and Pearson, 1985)

BLAST
(Altschul et al., 1990; Altschul et al., 1997)

45

FASTA
1) Find runs of identities, and identify
regions with the highest density of
identities.
2) Re-score using PAM matrix, and keep top
scoring segments.
3) Eliminate segments that are unlikely to be
part of the alignment.
4) Optimize the alignment in a band.
46

FASTA
Step 1: Find runes of identities, and identify regions
with the highest density of identities.
Sequence B

Sequence A

47

FASTA
Step 2: Re-score using PAM matrix, and
keep top scoring segments.

48

FASTA
Step 3: Eliminate segments that are unlikely to be part
of the alignment.

49

FASTA
Step 4: Optimize the alignment in a band.

50

BLAST
Basic Local Alignment Search Tool
(by Altschul, Gish, Miller, Myers and Lipman)

The central idea of the BLAST algorithm


is that a statistically significant alignment
is likely to contain a high-scoring pair of
aligned words.

51

The maximal segment pair measure


A maximal segment pair (MSP) is defined
to be the highest scoring pair of identical
length segments chosen from 2 sequences.
(for DNA: Identities: +5; Mismatches: -4)
the highest
scoring pair

The MSP score may be computed


in time proportional to the product
of their lengths. (How?) An exact
procedure is too time consuming.
BLAST heuristically attempts to
calculate the MSP score.

52

BLAST
1) Build the hash table for Sequence A.
2) Scan Sequence B for hits.
3) Extend hits.

53

BLAST
Step 1: Build the hash table for Sequence A. (3-tuple example)
For protein sequences:
For DNA sequences:
Seq. A = AGATCGAT
12345678
AAA
AAC
..
AGA
..
ATC
..
CGA
..
GAT
..
TCG
..
TTT

1
3
5
2

Seq. A = ELVIS
Add xyz to the hash table
if Score(xyz, ELV) T;
Add xyz to the hash table
if Score(xyz, LVI) T;
Add xyz to the hash table
if Score(xyz, VIS) T;

54

BLAST
Step2: Scan sequence B for hits.

55

BLAST
Step2: Scan sequence B for hits.

Step 3: Extend hits.

hit
Terminate if the score of the sxtension fades
away. (That is, when we reach a segment pair
whose score falls a certain distance below the
best score found for shorter extensions.)

BLAST 2.0 saves


the time spent in
extension, and
considers gapped
alignments.

56

Remarks
Filtering is based on the observation
that a good alignment usually includes
short identical or very similar
fragments.
The idea of filtration was used in
both FASTA and BLAST.

57

S-ar putea să vă placă și