Documente Academic
Documente Profesional
Documente Cultură
Kun-Mao Chao ( )
Department of Computer Science
and Information Engineering
National Taiwan University, Taiwan
E-mail: kmchao@csie.ntu.edu.tw
WWW: http://www.csie.ntu.edu.tw/~kmchao
Bioinformatics
Useful Websites
MIT Biology Hypertextbook
http://www.mit.edu:8001/afs/athena/course/other/esgbio/www
/7001main.html
Sequence Alignment
Dot Matrix
Sequence A CTTAACT
Sequence B CGGATCAT
C G G A T
C
T
T
A
A
C
T
Pairwise Alignment
Sequence A: CTTAACT
Sequence B: CGGATCAT
An alignment of A and B:
C---TTAACT
CGGATCA--T
Sequence A
Sequence B
Pairwise Alignment
Sequence A: CTTAACT
Sequence B: CGGATCAT
An alignment of A and B:
Mismatch
Match
C---TTAACT
CGGATCA--T
Insertion
gap
Deletion
gap
10
Alignment Graph
Sequence A: CTTAACT
Sequence B: CGGATCAT
C G G A
C
T
C---TTAACT
CGGATCA--T
T
A
A
C
T
11
C - - - T T A A C T
C G G A T C A - - T
+8
-3 -3
-3 +8
-5 +8 -3
-3
Alignment score
+8 = +12
12
An optimal alignment
si , j
si 1, j w(ai ,)
max si , j 1 w(, b j )
s
i 1, j 1 w(ai , b j )
13
Computing Si,j
j
w(ai,bj
)
w(ai,-)
i
w(-,bj)
Sm,n
14
Initializations
0
C
-3
-6
-9
C
-3
G
-6
G A T C A T
-9 -12 -15 -18 -21 -24
A -12
A -15
C -18
T -21
15
S3,5 =
0
C
-3
G
-6
G A T C A T
-9 -12 -15 -18 -21 -24
-3
-1
-4
-7 -10 -13
-6
-3
-9
-2
-5
-2
A -12
A -15
C -18
T -21
16
S3,5 = 5
0
C
-3
G
-6
G A T C A T
-9 -12 -15 -18 -21 -24
-3
-1
-4
-7 -10 -13
-6
-3
-2
-9
-2
-5
-1
-1
A -12 -1
-3
-5
A -15 -4
-6
-8
-2
C -18 -7 -9 -11 0
T -21 -10 -12 -14 -3
-2
14
optimal
score
17
C T T A A C T
C G G A T C A T
8 5 5 +8 -5 +8 -3 +8 = 14
C G G A T C A T
0 -3 -6 -9 -12 -15 -18 -21 -24
C
-3
-1
-4
-7 -10 -13
-6
-3
-2
-9
-2
-5
-1
-1
A -12 -1
-3
-5
A -15 -4
-6
-8
-2
C -18 -7 -9 -11 0
T -21 -10 -12 -14 -3
-2
14
18
19
Initializations
0
C
-3
-6
-9
G
-3
A A T C T G C
-6 -9 -12 -15 -18 -21 -24
T -12
T -15
G -18
A -21
20
S4,2 =
0
G
-3
A A T C T G C
-6 -9 -12 -15 -18 -21 -24
-3
-5
-8 -11 -14 -4
-7 -10 -13
-6
-8
-3
-6
-9 -12 -15
-9 -11
11
T -12 -14
-1
-4
T -15
G -18
A -21
21
S5,5 =
0
G
-3
A A T C T G C
-6 -9 -12 -15 -18 -21 -24
-3
-5
-8 -11 -14 -4
-7 -10 -13
-6
-8
-3
-6
-9 -12 -15
-9 -11
11
-1
-4
T -12 -14 -3
19 16 13 10
T -15 -11 -6
16
G -18
A -21
22
S5,5 = 14
0
G
-3
A A T C T G C
-6 -9 -12 -15 -18 -21 -24
-3
-5
-8 -11 -14 -4
-7 -10 -13
-6
-8
-3
-6
-9 -12 -15
-9 -11
11
-1
-4
T -12 -14 -3
19 16 13 10
T -15 -11 -6
16 14 24 21 18
G -18 -7 -9
A -21 -10 1
13 11 21 32 29
-1
10
18 29 27
optimal
score
23
C A A T - T G A
G A A T C T G C
-5 +8 +8 +8 -3 +8 +8 -5 = 27
G A A T C T G C
0 -3 -6 -9 -12 -15 -18 -21 -24
C
-3
-5
-8 -11 -14 -4
-7 -10 -13
-6
-8
-3
-6
-9 -12 -15
-9 -11
11
-1
-4
T -12 -14 -3
19 16 13 10
T -15 -11 -6
16 14 24 21 18
G -18 -7 -9
A -21 -10 1
13 11 21 32 29
-1
10
18 29 27
24
25
si , j
0
s
w
(
a
,
)
i
1
,
j
i
max si , j 1 w(, b j )
s
i 1, j 1 w( ai , b j )
26
local alignment
Match: 8
Mismatch: -5
Gap symbol: -3
C
0
G
0
G
0
A
0
T
0
C
0
A T
0 0
13
11
0
27
local alignment
Match: 8
Mismatch: -5
Gap symbol: -3
C
0
G
0
G
0
A
0
T
0
C
0
A T
0 0
13
11
13 10
11
13 10
13 10
The
best
score
18
28
A C - T
A T C A T
8-3+8-3+8 = 18
C G
0 0 0
G
0
A
0
T
0
C
0
A T
0 0
13
11
13 10
11
13 10
13 10
The
best
score
18
29
30
G
0
A
0
A
0
T C T G C
0 0 0 0 0
16 13 10
13 24 21 18 15 12
10 21 19 29 26 23
16 13 15 13 23 34 32
18 16 26 37 34
31
A A T T G
A A T C T G
8+8+8-3+8+8 = 37
G A
0 0 0
A
0
T C T G C
0 0 0 0 0
16 13 10
13 24 21 18 15 12
10 21 19 29 26 23
16 13 15 13 23 34 32
18 16 26 37 34
32
Match: +8 (w(x, y) = 8, if x = y)
Mismatch: -5 (w(x, y) = -5, if x y)
Each gap symbol: -3 (w(-,x)=w(x,-)=-3)
Each gap is charged an extra gap-open penalty: -4.
-4
-4
C - - - T T A A C T
C G G A T C A - - T
+8
-3 -3
-3 +8
-5 +8 -3
-3
+8 = +12
Alignment score: 12 4 4 = 4
33
a deletion
3. ......x
an insertion
34
D (i, j ) max
D(i 1, j ) y
S (i 1, j ) x y
I (i, j 1) y
I (i, j ) max
S (i, j 1) x y
S (i 1, j 1) w(ai , b j )
S (i, j ) max
D (i, j )
I (i, j )
-y
w(ai,bj)
-x-y
D
I
D
-x-y
-y
S
37
Match: +8 (w(x, y) = 8, if x = y)
Mismatch: -5 (w(x, y) = -5, if x y)
Each gap symbol: 0 (w(-,x)=w(x,-)=0)
Each gap is charged a constant penalty: -4.
-4
-4
C - - - T T A A C T
C G G A T C A - - T
+8
0 +8
-5 +8
+8 = +27
Alignment score: 27 4 4 = 19
38
S (i 1, j 1) w(ai , b j )
S (i, j ) max
D(i, j )
I (i, j )
an aligned pair
2. ...x
...-
a deletion
3. ......x
an insertion
41
I (i, j ) max
I (i, j 1) y
S (i, j 1) x y
I ' (i, j 1)
S (i, j ) max
D(i, j ); D' (i, j )
42
Max{S(i,j)-x-ky, S(i,j)-x-cy}
k
44
FASTA
(Wilbur and Lipman, 1983; Lipman and Pearson, 1985)
BLAST
(Altschul et al., 1990; Altschul et al., 1997)
45
FASTA
1) Find runs of identities, and identify
regions with the highest density of
identities.
2) Re-score using PAM matrix, and keep top
scoring segments.
3) Eliminate segments that are unlikely to be
part of the alignment.
4) Optimize the alignment in a band.
46
FASTA
Step 1: Find runes of identities, and identify regions
with the highest density of identities.
Sequence B
Sequence A
47
FASTA
Step 2: Re-score using PAM matrix, and
keep top scoring segments.
48
FASTA
Step 3: Eliminate segments that are unlikely to be part
of the alignment.
49
FASTA
Step 4: Optimize the alignment in a band.
50
BLAST
Basic Local Alignment Search Tool
(by Altschul, Gish, Miller, Myers and Lipman)
51
52
BLAST
1) Build the hash table for Sequence A.
2) Scan Sequence B for hits.
3) Extend hits.
53
BLAST
Step 1: Build the hash table for Sequence A. (3-tuple example)
For protein sequences:
For DNA sequences:
Seq. A = AGATCGAT
12345678
AAA
AAC
..
AGA
..
ATC
..
CGA
..
GAT
..
TCG
..
TTT
1
3
5
2
Seq. A = ELVIS
Add xyz to the hash table
if Score(xyz, ELV) T;
Add xyz to the hash table
if Score(xyz, LVI) T;
Add xyz to the hash table
if Score(xyz, VIS) T;
54
BLAST
Step2: Scan sequence B for hits.
55
BLAST
Step2: Scan sequence B for hits.
hit
Terminate if the score of the sxtension fades
away. (That is, when we reach a segment pair
whose score falls a certain distance below the
best score found for shorter extensions.)
56
Remarks
Filtering is based on the observation
that a good alignment usually includes
short identical or very similar
fragments.
The idea of filtration was used in
both FASTA and BLAST.
57