Documente Academic
Documente Profesional
Documente Cultură
Sources of annotation
for the UniProt
Knowledgebase
Protein Domain Family database
InterPro
http://www.ebi.ac.uk/interpro
Classifying proteins into families and identifying important domains and sites
is invaluable for helping biologists to identify distantly related proteins and to
predict their functions.
Domain Repeats
T E F
D A
T K F
D S
Gaps
MCDQTKHSKCCPAK---GNQCCPP--TDEAF---QQNQCCQSKGNQCCPPKQNQCCQPKG-- TDEAF
M D +K ++CCP CCPP TD F Q++ CC + CCPPK + CC PK
MSDSSKTNQCCPTPCCPPKPCCPPKPTDKSFCCLQKSPCCPK--SPCCPPK-SPCCTPKVCP TDKSF
Indel
Deletion Insertion
Sequence alignment
2) Some similarity score criteria that tell us how good an alignment is and that
allows the creation of an alignment during the construction phase.
Visualizing alignments
• There are various softwares that allow to visualize DNA and protein
sequence alignments.
• One of the most widespread free-software, used for example by EBI, that we
will use during practicals is:
• It comes in two flavours: A Desktop version and a Java applet for browsers.
Alignment:
General concepts
Pair-wise alignment
(proteins or nucleic acids)
seq1: TCATG
seq2: CATTG
Algorithm
(dynamic
programming)
• Alignment algorithm
- dynamic programming
- Local, global or semiglobal, is the sequence alignment scheme
• Similarity matrix
- Contains values associated to each substitution
- Variuos method to build one, e.g. PAM and BLOSUM
• Gap cost
- minimal model with only one value for both gap open and gap extension
- However: evolution tends to group indel
Dynamic programming
L A M I A S E Q U E N Z A A L L I N E A S E M P R E P E R C H E
G
Alignment starting from Q
G
a matrix P
T
C
G
L
A
M
I
Each line connecting top-left to A
bottom right represents a possible S
I
alignment G
G
T
D
P
R
E
P
G
K
N
GQGPTCGLAMIASIGGTDPREPGKN
Dynamic programming
y1y2y3... ym
Notation:
– xi – i-th element of the sequence x
– yj – j-th element of the sequence y
– x1..i – Prefix of x from 1 to i
– F – optimal score matrix
•F(i,j) represent the optimal alignment x1..i with y1..j
– d – gap penalty
– s – scoring matrix
Global Alignment
yj aligned to a gap
F(i-1,j-1) F(i,j-1)
Move ahead on
both +s(xi,yj)
-d
F(i-1,j) F(i,j)
xi aligned to a gap -d
• Build F
• Initialize: F(0,0) = 0; F(i,0) = -d*i; F(0,j)= -d*j
• Fill the table from top-left to bottom-right corner using the recursive
relationship
x1x2x3... xn
{
F ( i− 1, j− 1 )+s ( x i , y j )
y1y2y3... ym
Shifts:
• Diagonal – both
• Up – gap up
• Sx – gap down
P
H
-2
E
-1
A
-1
G
-2
A
-1
W
-4
G
-2
H
-2
E
-1
E
-1
Example
A -2 -1 5 0 5 -3 0 -2 -1 -1
W -3 -3 -3 -3 -3 15 -3 -3 -3 -3 First step is filling the table keeping in
H 10 0 -2 -2 -2 -3 -2 10 0 0 consideration a substitution matrix (here
E 0 6 -1 -3 -1 -3 -3 0 6 6
BLOSUM50).
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6
BLOSUM50
H E A G A W G H E E
substitution matrix
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
GAP
-d = -8 P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16
W -24
Second step is
the recursive H -32
compilation of
E -40
the table
A -48
E -56
H E A G A W G H E E
P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1
A -2 -1 5 0 5 -3 0 -2 -1 -1
W -3 -3 -3 -3 -3 15 -3 -3 -3 -3
H
E
10
0
0
6
-2
-1
-2
-3
-2
-1
-3
-3
-2
-3
10
0
0
6
0
6 Table almost complete
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6
BLOSUM50
H E A G A W G H E E
substitution matrix
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
GAP
-d = -8 P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
BLOSUM50
H E A G A W G H E E
substitution matrix
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
GAP
-d = -8 P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
HEAGAWGHE-E
--P-AW-HEAE
Summary: global alignment
{
0
F(i− 1, j− 1)+s( x i , y j )
F (i , j )= max
F (i− 1, j)− d
F(i , j− 1)− d
H E A G A W G H E E
P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1
A
W
-2
-3
-1
-3
5
-3 -3
0 5
-3
-3
15 -3
0 -2
-3
-1
-3
-1
-3 Example
H 10 0 -2 -2 -2 -3 -2 10 0 0
E 0 6 -1 -3 -1 -3 -3 0 6 6
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
Traceback
Starts from the highest scores in the table and proceeds backwards to the first 0
met along the path.
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0 AWGHE
H 0 10 2 0 0 0 12 18 22 14 6 AW-HE
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
Summary: local alignment
OVERLAP ZONE
OVERHANG ZONE
The problem reflects in starrting from the highest score that sits on the last row or
column and tread backwards the whole table (to the first row or column).
Matrix initialization (first row or column) is done with 0s as in the “local” alignment.
Matrix compilation is done as in the “global” alignment.
A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2
W 0 -3 -5 -4 1 -4 18 10 2 6 -6
H 0 10 2 6 -6 -1 10 16 20 12 4
E 0 2 16 8 0 7 2 8 16 26 18
A 0 -2 8 21 13 5 3 2 8 18 25
E 0 0 4 13 18 12 4 4 2 14 24
The problem reflects in starrting from the highest score that sits on the last row or
column and tread backwards the whole table (to the first row or column).
Matrix initialization (first row or column) is done with 0s as in the “local” alignment.
Matrix compilation is done as in the “global” alignment.
A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2
W 0 -3 -5 -4 1 -4 18 10 2 6 -6
H 0 10 2 6 -6 -1 10 16 20 12 4
E 0 2 16 8 0 7 2 8 16 26 18
A 0 -2 8 21 13 5 3 2 8 18 25
E 0 0 4 13 18 12 4 4 2 14 24
The problem reflects in starrting from the highest score that sits on the last row or
column and tread backwards the whole table (to the first row or column).
Matrix initialization (first row or column) is done with 0s as in the “local” alignment.
Matrix compilation is done as in the “global” alignment.
A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2
W 0 -3 -5 -4 1 -4 18 10 2 6 -6
H 0 10 2 6 -6 -1 10 16 20 12 4 HEAGAWGHEE
PAW-HEAE
E 0 2 16 8 0 7 2 8 16 26 18
A 0 -2 8 21 13 5 3 2 8 18 25
E 0 0 4 13 18 12 4 4 2 14 24
Summary: semiglobal alignment
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0 First alignment
local
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
heagAWGHEe
W 0 0 0 0 2 0 20 12 4 0 0 pAW-HEae
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
H E A G A W G H E E
P
A
-2
-2
-1
-1
-1
5
-2
0
-1
5
-4
-3
-2
0
-2
-2
-1
-1
-1
-1
Suboptimal
W
H 10
-3 -3
0
-3
-2
-3
-2
-3
-2
15
-3
-3
-2
-3
10
-3
0
-3
0
alignments
E
A
0
-2
6
-1
-1
5
-3
0
-1
5
-3
-3
-3
0
0
-2
6
-1 -1
6 (Example II)
E 0 6 -1 -3 -1 -3 -3 0 6 6
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0 Second alignment
local:
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 0 0 0 0 0 0
HEAgawghee
W 0 0 0 0 2 0 0 0 0 0 0 pawHEAe
H 0 10 2 0 0 0 0 0 0 0 0
E 0 2 16 8 0 0 0 0 0 0 0
A 0 0 8 21 13 5 0 0 0 0 0
E 0 0 6 13 18 12 4 0 0 6 6
Suboptimal alignments
Sometimes it may be useful to have more than one alignment for two
sequences. These alignments are called suboptimal.
– E.g. to find similarity regions not yet considered in local alignments.
– Repeated regions
However, the most common matrices are based on statistical methods indicating
the substitution frequency among amin oacids in homologous protein families.
Similarity matrices
Let pa be the probability of finding residue a. ( x
)
px = 1
C ( a , b) = pa × pb
Probability of a mutation between amminoacids a and b (match) is observed on a
set of homolog sequences:
M ( a, b) = pa ,b ( x
pa , x = 1; pa ,b = pb,a )
Example: HEAGawghee
pawHEAE
Similarity matrices
=
M ( a ,b ) pa , b
C ( a ,b ) pa × pb
To keep the odd ratio more stable and get larger numer, we take the logarithm
(log oggs ratio):
s (a, b) = log( M ( a ,b )
C ( a ,b ) )
Tipically, values are multiplied by ten and only the integer part of it is kept.
Example: HEAGawghee
pawHEAE
Similarity matrices
•S = 0 random
•S < 0 less than random, less probable (unfavourable)
•S > 0 more than random, more probable (favourable)
Example: HEAGawghee
pawHEAE
BLOSUM matrices
(Henikoff & Henikoff, 1992)
• Gap costs
- We have seen a minimal model with unique gap open/extension cost
- But: evolution tends to aggregate indels in few points
- So: affine gap costs
Non-biologicall solutions are discouraged, e.g. (left) should get a lower score than (right)
Local alignment on the other hand allows to align only most similar residues of the two
sequences.
Semiglobal alignment tries to combine both methods.
L A M I A S E Q U E N Z A A L L I N E A S E M P R E P E R C H E
G
Q
G
P
T
C
G
L
A
M
I
A
S
I
G
G
T
D
P
R
E
P
G
K
N
GLOBAL
D L G P S S K Q T G K G S S M D I W D N G M
D - I - - T K S A G K G A I M R L - - E - M
SEMIGLOBAL / FREESHIFT
D L G P S S K Q T G K G S S M D I W D N G M
- - - D I T K S A G K G A I M R L E M - - -
LOCAL
- - - - - - - - - G K G - - - - - - - - - -
- - - - - - - - - G K G - - - - - - - - - -
Alignment and domains
A B
A=C
B’ C
– T-COFFEE,
• URL: http://www.tcoffee.org/
– MAFFT,
• URL: http://www.ebi.ac.uk/Tools/msa/mafft/
–…
Similarity searches in
databases
Similarity searches in databases
One of the most commonly faced problem with bioinformatics methods is to find homolog
sequences querying databases.
When functional elements are not available, the choice is made considering statistical features.
• On the basis of similarity, a score is given to each single sequence alignment;
• Ties are evaluated by computing the probability of getting the same score by chance.
• The lower the probability values the more significant is the alignment.
The main tools for database querying, such as FASTA and BAST, differ basically
in the approach to these problems.
BLAST
BLAST
BLAST
BLAST (Basic Local Aligment Search Tool) is a software which looks for local
sequence similarity using the algorithm by Altschul et al., 1990. Also BLAST, as
FASTA, works by:
2. Affine words are searched in the sequence database looking for exact matches, once
found, they are enlarged on the right and left of the alignment up to a certain depth
defined by the X parameter. The couples of segments on the same sequence having a
statistically significant similarity score greater than a threshold S are called HSP
(High scoring Segment Pairs).
3. In the same couple there might be more than one HSP for which is possible to
compute the occurrence probability (Karlin & Altschul, 1993).
The algorithm then considers only cases where two hits exists on the same diagonal at the
same distance lower than an A parameter before looking for HSPs.
Moreover, in the current implementation, BLAST considers gaps when trying to merge
ungapped HSP which are spatially related in the alignment matrix. Their union in a unique
fragment (containing gaps and insertions) causes a global improvement of the final score
and not a worsening.
Everything works using new parameters regulating costs and penalties for the presence of a
gap in the alignment.
BLAST
two hit method
E-value indicates the number of different alignments with the same score (X)
equivalent or better than the one obtained from my alignment (called S)
which can happen by chance in a database search (so: false positives). The
lower the value the more significant is the alignment.
E ( x ≥ S ) → E ( S ) = kmn e − λS
K and λ depend on the database e on its size, m and n are the lengths of
the two sequences. S is the score of my alignment.
0.4
A.
0.2
Yev
-2 -1 0 1 2 3 4 5
X
E-value E(S)
Different algorithm differ a lot on how they define a random sequence.
BLAST computes the probability of a given score to be signficant on the basis of the
dimension and composition of the database a priori applying:
E ( S ) = kmn e − λS
Where m is the length of the query sequence and n is the length of the subject sequence from
the database. On the contrary of FASTA, λ and k are precomputed according to an internal
standard distribution. The final score is similar to the FASTA one.
The lower the value of E is, the more significant the alignment is. A value of 1.0e-5, e.g.,
means that the probability of getting a sequence with the same score of my query is 1.0e-5, in
other words, we expect to observe a sequence with a score equal or better to 1.0e-5 every
100.000 observations (alignments).
Bit-score
The bit-score allows to put in direct relation searches made in databases of different
sizes since it does not depends on λ and K parameters, while the raw score does. In
this way two searches in different databases are comparable. The bit-score is
computed from the raw score S, is defined S’ and is normalized as:
λS − ln K
S'=
ln 2
From which the E-value can be computed:
−S '
E = mn2
E depends only on the sequences length parameter.
DB Accession Number
Name
Similarity values
Similarity values
Query
Subject
Residue number
Identity (letter),
Similarity (+)
Profiles
Frequency matrix
Matrices used to show the frequency of each residue in a multiple sequence
alignment
• how many times a residue is found at a certain position in the alignment
Position
Example:
Frequency
AA
Frequency matrix
Matrices used to show the frequency of each residue in a multiple sequence
alignment
• how many times a residue is found at a certain position in the alignment
Example:
Weighted matrices (PSSM) or profiles
(Gribskov)
Question: how to deal with “similar” AA never observed at a certain position?
We can use weighted matrices, PSSM – position specific scoring matrix, that take into
account the “weight” of the aminoacids , that is their “substitution likelihood” calculated
with PAM or BLOSUM substitution matrices .
Multiple alignment
Frequency matrix
Profile
(PSSM)
“weighted” similarity
Weighted matrices (PSSM) or profiles
(Gribskov)
PSSM is the product of the substitution matrix with the frequency matrix calculated from a
query sequence against the hits that exceed the fixed threshold (profiles).
Then, in order to find a new hit we use the same procedure but with a L x 20 matrix, where L is
the lenght of the query sequence.
PSSM calculation calls for the standardization of sequences that are redundant or
overrepresented in order to reduce the risk of a wrong calculation of the matrix .
1) Input sequence.
2) BLAST search.
3) Creation of a PSSM from a multiple alignment of all the hits that exceed
a fixed threshold.
7) Repeat from 4 to 6
8) The process ends when convergence is reached, in other
words when there aren’t other sequences that exceed the
threshold
PSI-BLAST: Drift problems
We generally carry out 4-6 iterations in order to avoid drift (profile wander)
events: the input sequence can be lost with the iterations if there is a large protein
family similar to this sequence (crowding out).
B B
Example: B B
B
A B Sequence A has
B been removed from
B
PSSM
To avoid these issues we can use databases in which each protein has a maximum
value of X% (50 <= X < 100) similarity with each one of the other sequences.
• i.e. NR90 contains all of the known sequences with a maximum value of 90% of similarity.
• one of the most popular program for clustering sequences: CD-HIT (Li et al., 2001)
Beyond PSI-BLAST…
Beyond PSI-BLAST…
Given a protein family, how can we fix the information in the multiple
sequence alignment in order to look for other sequences that are still
unknown???
• the most common alignment methods, even if they use profiles, i.e. they
don‘t evaluate indels positions.
Idea: creation of a HMM – Hidden Markov Model that best represents the
reality.
1YEA AKESTGFKPGSAKKGATLFKTRCQQCHTIEE-------GGPNKVGPNLHGIFGRHSGQVK
1YCC ----TEFKAGSAKKGATLFKTRCLQCHTVEK-------GGPHKVGPNLHGIFGRHSGQAE
2PCBB ---------GDVEKGKKIFVQKCAQCHTVEK-------GGKHKTGPNLHGLFGRKTGQAP
5CYTR ---------GDVAKGKKTFVQKCAQCHTVEN-------GGKHKVGPNLWGLFGRKTGQAE
1CCR -ASFSEAPPGNPKAGEKIFKTKCAQCHTVDK-------GAGHKQGPNLNGLFGRQSGTTP
1CRY ---------QDAASGEQVFK-QCLVCHSIGP-------GAKNKVGPVLNGLFGRHSGTIE
1HROA -----SAPPGDPVEGKHLFHTICITCHTDIK-------G-ANKVGPSLYGVVGRHSGIEP
1CXC -------QEGDPEAGAKAFN-QCQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAGTQA
1C2RA ---------GDAAKGEKEFN-KCKTCHSIIAPDGTEIVKG-AKTGPNLYGVVGRTAGTYP
155C -------NEGDAAKGEKEFN-KCKACHMIQAPD-GTDIKG-GKTGPNLYGVVGRKIASEE
2C2C --------EGDAAAGEKVSK-KCLACHTFDQ-------GGANKVGPNLFGVFENTAAHKD
2mtac -----APQFFNIIDGSPLNFDD-----AMEEGRDTEAVKHFLETGENVYNEDPEILPEAE
. * : * : . .
1YEA GYS-YTDANINK-----NVKWDEDSMSEYLTNPKKYIP--------GTKMAFAGLKKEKD
1YCC GYS-YTDANIKK-----NVLWDENNMSEYLTNPKKYIP--------GTKMAFGGLKKEKD
2PCBB GFT-YTDANKNK-----GITWKEETLMEYLENPKKYIP--------GTKMIFAGIKKKTE
5CYTR GYS-YTDANKSK-----GIVWNNDTLMEYLENPKKYIP--------GTKMIFAGIKKKGE
1CCR GYS-YSTADKNM-----AVIWEENTLYDYLLNPKKYIP--------GTKMVFPGLKKPQE
1CRY GFA-YSDANKNS-----GITWTEEVFREYIRDPKAKIP--------GTKMIFAGVKDEQK
1HROA GYN-YSEANIKS-----GIVWTPDVLFKYIEHPQKIVP--------GTKMGYPGQPDPQK
1CXC DFKGYGEGMKEAGAK--GLAWDEEHFVQYVQDPTKFLKEYTGDAKAKGKMTF-KLKKEAD
1C2RA EFK-YKDSIVALGAS--GFAWTEEDIATYVKDPGAFLKEKLDDKKAKTGMAF-KLAK--G
155C GFK-YGEGILEVAEKNPDLTWTEANLIEYVTDPKPLVKKMTDDKGAKTKMTF-KMGK--N
2C2C NYA-YSESYTEMKAK--GLTWTEANLAAYVKNPKAFVLEKSGDPKAKSKMTF-KLTKDDE
2mtac EL--YAGMCSGCHGHYAEGKIGPGLNDAYWTYPGNETDVGLFSTLYGG--ATGQMGPMWG
* * *
HMM – Hidden Markov Model
An HMM represents a generalization of the profile concept.
• AA substitution probabilities are different for each position (PSSM).
• Insertion and deletion probabilities are different at each position.
• HMM are used i.e. in the database Pfam about protein domains.
There are technical problems that (until now!) prevent the substitution of
PSI-BLAST with a large scale profile – profile method
– Dimension of the database of precalculated profiles