L02 PDF

Sequence databases
Protein sequence Database

Uniprot
http://www.uniprot.org/
Database of protein sequences
Uniprot
Uniprot knowledgebase (UniprotKB) consists of two sections:
•Swiss-Prot, which is manually annotated and reviewed.
• TrEMBL, which is automatically annotated and is not reviewed.

Database of protein sequences
Uniprot
Is a resource of protein
sequences and functional
information
Sources of annotation
for the UniProt
Knowledgebase
Protein Domain Family database
InterPro
http://www.ebi.ac.uk/interpro
Classifying proteins into families and identifying important domains and sites
is invaluable for helping biologists to identify distantly related proteins and to
predict their functions.
Proteins can be classified into different groups based on:
• the FAMILIES to which they belong
• the DOMAINS they contain
• the SEQUENCE FEATURES they possess

InterPro
A protein family is a group of proteins that share a common evolutionary origin,

reflected by their related functions and similarities in sequence or structure
InterPro
Domains are distinct functional and/or

structural units in a protein.
Usually they are responsible for a particular

function or interaction, contributing to the
overall role of a protein.
Domains may exist in a variety of biological

contexts, where similar domains can be found
in proteins with different functions
SH3 SH3 SH3 SH2

InterPro
Sequences features are group of amino acids that confer certain

characteristics upon a protein, and may be important for its overall
function
Domain Repeats
Active Bindings PTM

site ite site
InterPro
InterPro
The following databases make up the InterPro Consortium:
• CATH/Gene3D at University College, London, UK

• PANTHER at University of Southern California, CA, USA
• PIRSF at the Protein Information Resource, Georgetown University Medical Centre,
Washington DC, USA
• Pfam at the Wellcome Trust Sanger Institute, Hinxton, UK
• PRINTS at the University of Manchester, UK
• ProDom at PRABI Villeurbanne, France
• PROSITE and HAMAP at the Swiss Institute of Bioinformatics (SIB), Geneva,
Switzerland
• SMART at EMBL, Heidelberg, Germany
• SUPERFAMILY at the University of Bristol, UK
• TIGRFAMs at the J. Craig Venter Institute, Rockville, MD, US
• MobiDB at University of Padua, Italy
Pfam
http://www.sanger.ac.uk/Software/Pfam
• Collection of multiple sequence alignment based on Hidden Markov

Models (HMM).
• Features over 1500 families.
• Includes sequences from SwissProt and TrEMBL.
• Comprises two sections:

– Pfam-A containing manually curated MSA.
– Pfam-B (discontinued) in which sequences not included in Pfam-A are
automatically clustered.
Sequence alignment
Why
Let us imagine comparing human hemoglobin to mouse hemoglobin

– How can we do it?
– How do we quantify similarity?
– Which residues have been mutated?
Sequence alignment is the single most fundamental

technique in bioinformatics
Evolution
The study of changes occurring in DNA and in its products is the
object of study of Molecular Evolution.
Final goal: structural alignment
T E F
D A
T K F
D S
Gaps
MCDQTKHSKCCPAK---GNQCCPP--TDEAF---QQNQCCQSKGNQCCPPKQNQCCQPKG-- TDEAF
M D +K ++CCP CCPP TD F Q++ CC + CCPPK + CC PK
MSDSSKTNQCCPTPCCPPKPCCPPKPTDKSFCCLQKSPCCPK--SPCCPPK-SPCCTPKVCP TDKSF
Indel
Deletion Insertion
Sequence alignment
Pair-wise sequence alignment defines reciprocal similarity between two

sequences and homology relationships, structural conformation and function
can be inferred from it.
Evolution operates by consecutive mutations over time (point mutations,

insertions, deletions, inversions). This sequence of events, if rebuilt, allows to
infer relationships between sequences
To achieve this goal two elements are necessary:

1) An efficient algorithm, which should the most accurate possible to represent
the actual similarity between sequences.
2) Some similarity score criteria that tell us how good an alignment is and that
allows the creation of an alignment during the construction phase.
Visualizing alignments
• There are various softwares that allow to visualize DNA and protein
sequence alignments.
• One of the most widespread free-software, used for example by EBI, that we
will use during practicals is:
• Jalview (Java ALignment VIEWer)

URL: http://www.jalview.org/
• It comes in two flavours: A Desktop version and a Java applet for browsers.
Alignment:
General concepts
Pair-wise alignment
(proteins or nucleic acids)
What is the meaning of «pair-wise alignment»?

To write two sequences horizontally, so that the most possible identical (or
similar) symbols are found in the columns, even if some interval (gaps –
insertions/deletions – indels) is introduced.
seq1: TCATG
seq2: CATTG
TCAT-G 4 identical characters

1 indel
.CATTG 1 non-aligned position
Alignment
Similarity Gap cost

matrix
Algorithm
(dynamic
programming)
• Alignment algorithm
- dynamic programming
- Local, global or semiglobal, is the sequence alignment scheme
• Similarity matrix
- Contains values associated to each substitution
- Variuos method to build one, e.g. PAM and BLOSUM
• Gap cost
- minimal model with only one value for both gap open and gap extension
- However: evolution tends to group indel
Dynamic programming
L A M I A S E Q U E N Z A A L L I N E A S E M P R E P E R C H E
G
Alignment starting from Q
G
a matrix P
T
C
G
L
A
M
I
Each line connecting top-left to A
bottom right represents a possible S
I
alignment G
G
T
D
P
R
E
P
G
K
N
Optimal alignment is found building LAMIASEQUENZAALLINEASEMPREPERCHE

similarity matrices that associate values
to each possible aminoacid couple GQGPTCGLAMIASIGGTD-------------PREPGKN
and developing algorithms that identify
the highest scoring paths
LAMIASEQUENZ-AALLINEASEMPREPERCHE
GQGPTCGLAMIASIGGTDPREPGKN
Dynamic programming
The alignment is computed in to steps
1. computation of the best solution

in every box.
2. Backtracking: choice of the
optimal path on the basis of
data computed in the boxes.
The difference between global and

local alignment stands in box filling
and in the choice of the backtracking
starting point .
Dynamic programming
• Gives optimal alignment between two sequences.
• Simple algorithm variations produce global, local or semiglobal
alignments
– Global (Needleman & Wunsch, 1970)
– Local (Smith & Waterman, 1981)
• Alignment depends on the choice of some parameters
• Based on recursion ….. Each partial result depends on previous

computation of alignment score from a table. The table is eventually
used to build the optimal path.
• Alignment computation must be decomposable in a series of indvidual

optimal steps.
– Bellman optimality principle (1957)
Global
Alignment
Global Alignment
x1x2x3... xn
Given two sequences x and y to be aligned.
y1y2y3... ym
Notation:
– xi – i-th element of the sequence x
– yj – j-th element of the sequence y
– x1..i – Prefix of x from 1 to i
– F – optimal score matrix
•F(i,j) represent the optimal alignment x1..i with y1..j
– d – gap penalty
– s – scoring matrix
Global Alignment
yj aligned to a gap
F(i-1,j-1) F(i,j-1)
Move ahead on
both +s(xi,yj)
-d
F(i-1,j) F(i,j)
xi aligned to a gap -d
While filling the table, you can keep trace of the

path, or else which direction has been taken
(inverted arrows)
Global Alignment
• Build F
• Initialize: F(0,0) = 0; F(i,0) = -d*i; F(0,j)= -d*j
• Fill the table from top-left to bottom-right corner using the recursive
relationship
x1x2x3... xn
{
F ( i− 1, j− 1 )+s ( x i , y j )
y1y2y3... ym
F (i , j )= max F (i− 1, j)− d

F(i , j− 1)− d
Traceback
Alignment is computed in two steps
1. computation of the best solution

in every box.
2. Backtracking: choice of the
optimal path on the basis of
data computed in the boxes.
Backwards path follows the arrows:
The path always starts from the last
cell. By definition it ends at the
cell(1,1).
Shifts:
• Diagonal – both
• Up – gap up
• Sx – gap down
P
H
-2
E
-1
A
-1
G
-2
A
-1
W
-4
G
-2
H
-2
E
-1
E
-1
Example
A -2 -1 5 0 5 -3 0 -2 -1 -1
W -3 -3 -3 -3 -3 15 -3 -3 -3 -3 First step is filling the table keeping in
H 10 0 -2 -2 -2 -3 -2 10 0 0 consideration a substitution matrix (here
E 0 6 -1 -3 -1 -3 -3 0 6 6
BLOSUM50).
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6
BLOSUM50
H E A G A W G H E E
substitution matrix
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
GAP
-d = -8 P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16
W -24
Second step is
the recursive H -32
compilation of
E -40
the table
A -48
E -56
H E A G A W G H E E
P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1
A -2 -1 5 0 5 -3 0 -2 -1 -1
W -3 -3 -3 -3 -3 15 -3 -3 -3 -3
H
E
10
0
0
6
-2
-1
-2
-3
-2
-1
-3
-3
-2
-3
10
0
0
6
0
6 Table almost complete
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6
BLOSUM50
H E A G A W G H E E
substitution matrix
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
GAP
-d = -8 P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13
E -40 -22 -8 -16
A -48 -30 -16 -3
E -56 -38 -24 -11

H E A G A W G H E E
P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1
A -2 -1 5 0 5 -3 0 -2 -1 -1
W -3 -3 -3 -3 -3 15 -3 -3 -3 -3
H
E
10
0
0
6
-2
-1
-2
-3
-2
-1
-3
-3
-2
-3
10
0
0
6
0
6
Table complete
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6
BLOSUM50
H E A G A W G H E E
substitution matrix
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
GAP
-d = -8 P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

Traceback
Reverse path
H E A G A W G H E E following the arrows:
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 The path always
starts from the last
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
cell. By definition it
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60 gets to 0.
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
Shifts:
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
• Diagonal – both
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5 • up– gap up
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2 • Sx – gap down
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
HEAGAWGHE-E
--P-AW-HEAE
Summary: global alignment
• Table values initialization (-d*j and –d*i)
• Use of recursion to fill intermediate cells of the table (dynamic

programming)
• Traceback from the last cell(i,j)
• Use of O(nm) space and time

– O(n2) algorithm
– It is possible for small sequences but not for whole genomes

Local
Alignment
Local Alignment
• Smith-Waterman (1981)
• Another solution based on dynamic programming very similar to
(semi-)global alignment in the table filling.
• The “0” is introduced in the computation of cell scores…. No
negative values.
{
0
F(i− 1, j− 1)+s( x i , y j )
F (i , j )= max
F (i− 1, j)− d
F(i , j− 1)− d
H E A G A W G H E E
P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1
A
W
-2
-3
-1
-3
5
-3 -3
0 5
-3
-3
15 -3
0 -2
-3
-1
-3
-1
-3 Example
H 10 0 -2 -2 -2 -3 -2 10 0 0
E 0 6 -1 -3 -1 -3 -3 0 6 6
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
Traceback
Starts from the highest scores in the table and proceeds backwards to the first 0
met along the path.
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0 AWGHE
H 0 10 2 0 0 0 12 18 22 14 6 AW-HE
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
Summary: local alignment
• Table initialization (0)

programming), with the possibility of always use the 0 value
• Traceback from cell with highest value

– O(n2) algorithm

Semiglobal
Alignment
ANOTHER KIND OF ALIGNMENT
Semiglobal, Freeshift or “glocal” alignment
It is used one complete or partial overlap is expected.
The goal is a global alignment with no penalties for terminal overhang.
OVERLAP ZONE
OVERHANG ZONE
The problem reflects in starrting from the highest score that sits on the last row or
column and tread backwards the whole table (to the first row or column).
Matrix initialization (first row or column) is done with 0s as in the “local” alignment.
Matrix compilation is done as in the “global” alignment.
H E A G A W G H E E Hence the name

“semiglobal”
0 0 0 0 0 0 0 0 0 0 0
alignment
P 0 -2 -1 -1 -2 -1 -4 -2 -2 -1 -1
A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2
W 0 -3 -5 -4 1 -4 18 10 2 6 -6
H 0 10 2 6 -6 -1 10 16 20 12 4
E 0 2 16 8 0 7 2 8 16 26 18
A 0 -2 8 21 13 5 3 2 8 18 25
E 0 0 4 13 18 12 4 4 2 14 24

“semiglobal”
0 0 0 0 0 0 0 0 0 0 0
alignment
P 0 -2 -1 -1 -2 -1 -4 -2 -2 -1 -1
A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2
W 0 -3 -5 -4 1 -4 18 10 2 6 -6
H 0 10 2 6 -6 -1 10 16 20 12 4
E 0 2 16 8 0 7 2 8 16 26 18
A 0 -2 8 21 13 5 3 2 8 18 25
E 0 0 4 13 18 12 4 4 2 14 24

“semiglobal”
0 0 0 0 0 0 0 0 0 0 0
alignment
P 0 -2 -1 -1 -2 -1 -4 -2 -2 -1 -1
A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2
W 0 -3 -5 -4 1 -4 18 10 2 6 -6
H 0 10 2 6 -6 -1 10 16 20 12 4 HEAGAWGHEE
PAW-HEAE
E 0 2 16 8 0 7 2 8 16 26 18
A 0 -2 8 21 13 5 3 2 8 18 25
E 0 0 4 13 18 12 4 4 2 14 24
Summary: semiglobal alignment
• Table initialization (0)

programming)
• Traceback from the last row or column

– O(n2) algorithm

Alignment:
Advanced techniques
H E A G A W G H E E
P
A
-2
-2
-1
-1
-1
5
-2
0
-1
5
-4
-3
-2
0
-2
-2
-1
-1
-1
-1
Suboptimal
W
H 10
-3 -3
0
-3
-2
-3
-2
-3
-2
15
-3
-3
-2
-3
10
-3
0
-3
0
alignments
E 0 6 -1 -3 -1 -3 -3 0 6 6 (Example II)
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0 First alignment
local
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
heagAWGHEe
W 0 0 0 0 2 0 20 12 4 0 0 pAW-HEae
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
H E A G A W G H E E
P
A
-2
-2
-1
-1
-1
5
-2
0
-1
5
-4
-3
-2
0
-2
-2
-1
-1
-1
-1
Suboptimal
W
H 10
-3 -3
0
-3
-2
-3
-2
-3
-2
15
-3
-3
-2
-3
10
-3
0
-3
0
alignments
E
A
0
-2
6
-1
-1
5
-3
0
-1
5
-3
-3
-3
0
0
-2
6
-1 -1
6 (Example II)
E 0 6 -1 -3 -1 -3 -3 0 6 6
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0 Second alignment
local:
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 0 0 0 0 0 0
HEAgawghee
W 0 0 0 0 2 0 0 0 0 0 0 pawHEAe
H 0 10 2 0 0 0 0 0 0 0 0
E 0 2 16 8 0 0 0 0 0 0 0
A 0 0 8 21 13 5 0 0 0 0 0
E 0 0 6 13 18 12 4 0 0 6 6
Suboptimal alignments
Sometimes it may be useful to have more than one alignment for two
sequences. These alignments are called suboptimal.
– E.g. to find similarity regions not yet considered in local alignments.
– Repeated regions
It is possible to compute them using small variations during the

backtracking phase. This phase is then reiterated to get different
alignments, up to a threshold T (T >> 0 for meaningful alignments)
– In (Waterman & Eggert, 1987) cell values are set to zero.

• Previous example
– In (Vingron, 1996) cell values are decreased by a weight ε

(ε can be additive or multiplicative)
Similarity Matrices
The 20 amino acids
Similarity matrices
For nucleic acids alignments between complementary bases
have value of 1 or 0 (identity or non-identity).
Proteins have 20 aminoacids and single substitutions do not have

the same weight.
• Substitution of Serine (S) with Threonine (T) or of Glutamate (E) with
Asparatate (D) are usually tolerated by proteins, given he similarity of
these amino acids.
Similarity matrices are tables associating a similarity value to each substituion.
Matrices can be based on physico-chemical proprerties or residues.
However, the most common matrices are based on statistical methods indicating
the substitution frequency among amin oacids in homologous protein families.
Similarity matrices
Let pa be the probability of finding residue a. ( x
)
px = 1
The probability of a random substitution between residues is computed as the

probability of two independent events:
C ( a , b) = pa × pb
Probability of a mutation between amminoacids a and b (match) is observed on a
set of homolog sequences:
M ( a, b) = pa ,b ( x
pa , x = 1; pa ,b = pb,a )
Example: HEAGawghee
pawHEAE
Similarity matrices
The relationship between match and random can be expressed as ‘‘likelihood’’ or

‘‘odds ratio’’
=
M ( a ,b ) pa , b
C ( a ,b ) pa × pb
To keep the odd ratio more stable and get larger numer, we take the logarithm
(log oggs ratio):
s (a, b) = log( M ( a ,b )
C ( a ,b ) )
Tipically, values are multiplied by ten and only the integer part of it is kept.
Example: HEAGawghee
pawHEAE
Similarity matrices
Meaning of odds ratio:
•S = 0 random
•S < 0 less than random, less probable (unfavourable)
•S > 0 more than random, more probable (favourable)
An alignment score can be written as:

n
S =  s ( ai , bi )
i =1
Example: HEAGawghee
pawHEAE
BLOSUM matrices
(Henikoff & Henikoff, 1992)
• Blocks Amino Acid Substitution Matrices = BLOSUM

• Substitution observed in ~2,000 blocks of conserved sequences.
• Extracted from a database of 500 protein families.
• Counts swaps observed in each column.
BLOSUM matrices
Like in PAMs, computation of the matrix

• M(a,b) computed from alignments of protein families from the BLOCKS
database.
The BLOCKS database has alignments of sequences with homology

above a threshold P varying from 35% to 95% of sequence identity.
• BLOCKS 50 holds aligned sequences with <= 50% of sequence identities
Affine Gap Costs
Alignments: Components
• Similarity matrix
- Gives score for each substitution
- Different building methods exist, e.g. PAM and BLOSUM
• Alignment algorithm
- Local, global or freeshift define the sequences alignment schema
• Gap costs
- We have seen a minimal model with unique gap open/extension cost
- But: evolution tends to aggregate indels in few points
- So: affine gap costs
Non-biologicall solutions are discouraged, e.g. (left) should get a lower score than (right)
Seq 1 MNALSDR---T Seq 1 MNALSDRT---

Seq 2 M-G-SDRTTET Seq 2 MG--SDRTTET
Affine gap costs
Two parameters are used for gaps:
• gap open (γ) = opening of the first gap of an indel.
• gap extension (δ) = extending an already existing indel, of course γ > δ
• typical costs are: gap open = 10, gap extension = 2
• exact values can differ
Gap penalty: d = γ + δ * (length(i) – 1)
To build a substitution matrix, a more complex procedure is required :

• gap open cost has to be used only the first time, afterwards gap extension
has to be used.
• for each element of the matrix the optimal value of the gap needs to be saved
both in horizontal and vertical directions.
• details can be found Biological Sequence Analysis by Durbin et al.
(Cambridge University Press)
• this computation is not part of the final exam
Alignments: differences
Differences between global and local alignments
Global alignment impose an alignment which involves all the residues of the two sequences,
no matter their similarity.
Local alignment on the other hand allows to align only most similar residues of the two
sequences.
Semiglobal alignment tries to combine both methods.
L A M I A S E Q U E N Z A A L L I N E A S E M P R E P E R C H E
G
Q
G
P
T
C
G
L
A
M
I
A
S
I
G
G
T
D
P
R
E
P
G
K
N
GLOBAL
D L G P S S K Q T G K G S S M D I W D N G M
D - I - - T K S A G K G A I M R L - - E - M
SEMIGLOBAL / FREESHIFT
D L G P S S K Q T G K G S S M D I W D N G M
- - - D I T K S A G K G A I M R L E M - - -
LOCAL
- - - - - - - - - G K G - - - - - - - - - -
- - - - - - - - - G K G - - - - - - - - - -
Alignment and domains
A B
A=C
B’ C
But: a local alignment might lose

Global alignment:
terminal residues (e.g. divergent
reeats)…
Hence a third method, semiglobal
(or freeshift or glocal), tries to
combine the advantages of the other
two.
Local alignment:
Multiple sequence alignments
Multiple alignments
• Demonstrate homology
• Molecular phylogeny
• Structural prediction
• Functional prediction
• Identification of functionally important sites.
• Usage of algorithms for the search of an optimal alignment between two

sequences creates problems in its generalization (i.e. usage on three or more
sequences at the same time).
If L is the length of the sequences, it would take O(LN) units
of time to align N sequences. It is unfeasible.
• Usage of euristic methods or progressive based on the hypotesis that the

sequences to be aligned are phylogenetically correlated.
1. 2.
CLUSTALW
(Higgins & Sharp, 1988)
1. Pairwise alignment of all the starting

sequences with: 3.
 Approximate methods (n-tuple)
 Dynamic algorithm by Myers & Miller,
1988
2. Scoring of the alignments used to build

the phylogenetic tree (neighbour-joining,
NJ).
3. Progressive alignment of the sequences according to the tree order

(most similar sequences come first).
CLUSTALW
Progressive methods: disadvantages
• One an alignment is fixed is not modified in the subsequent steps. In
particular, the gap location cannot change (once a gap, always a gap).
• Initial errors are propagated in subsequent steps. In an error is introduced in

the initial alignment cannot be corrected, on the contrary it gets "fixed".
• Initial phylogenetic trees are derived from distance matrices between

pairs of independently aligned sequences. These are less reliable than
phylogenetic trees derived from complete multiple sequence
alignments.
• Alignment errors depend on sequence similarities. Care must be taken in

selecting input sequence to be real homologs and of comparable length to
avoid the insertion of too many gaps.
• If sequences are too divergent (< 25-30% sequence identity) progressive

methods become unreliable.
CLUSTALW alignment for GPx
Multiple alignments: more methods
Many methods have been published since CLUSTALW.
– CLUSTAL-OMEGA has replaced

CLUSTALW,
• URL: http://www.ebi.ac.uk/Tools/msa/clustalo/
– T-COFFEE,
• URL: http://www.tcoffee.org/
– MAFFT,
• URL: http://www.ebi.ac.uk/Tools/msa/mafft/
–…
Similarity searches in
databases
Similarity searches in databases
One of the most commonly faced problem with bioinformatics methods is to find homolog
sequences querying databases.
The main idea is that homologous proteins have a common

ancestor and therefore share extensive regions of similarity.
Comparing our query sequence and all the others contained in a database it is possible to estimate
the percentage of similarity and from this to infer possible homology.
Similarity searches in databases
For very similar sequences homology is obvious. In most cases, however, low levels of
similarity need to be faced.
Real functional homologs can have low similarity levels.
• It is not trivial to distinguish real and false homologs.
When functional elements are not available, the choice is made considering statistical features.
• On the basis of similarity, a score is given to each single sequence alignment;
• Ties are evaluated by computing the probability of getting the same score by chance.
• The lower the probability values the more significant is the alignment.
There are two problems to solve:

1. Development of algorithms able to identify sequences similar to the query one among
millions of different target sequences and
2. The choice of statistical methods to be used to define significant sequences.
The main tools for database querying, such as FASTA and BAST, differ basically
in the approach to these problems.
BLAST
BLAST
BLAST
BLAST (Basic Local Aligment Search Tool) is a software which looks for local
sequence similarity using the algorithm by Altschul et al., 1990. Also BLAST, as
FASTA, works by:
1. Decomposing the query sequence in words of few aminoacids, usually 2 or 3 (W

parameter) and generating a list of affine words (different from FASTA) using the
BLOSUM substitution matrix. Affine conserved words will have to have a score
greater than a fixes threshold T.
2. Affine words are searched in the sequence database looking for exact matches, once
found, they are enlarged on the right and left of the alignment up to a certain depth
defined by the X parameter. The couples of segments on the same sequence having a
statistically significant similarity score greater than a threshold S are called HSP
(High scoring Segment Pairs).
3. In the same couple there might be more than one HSP for which is possible to
compute the occurrence probability (Karlin & Altschul, 1993).
W = word size X = elongation

T = threshold S = HSP threshold
BLAST
two hit method
Current BLAST versions use the two-hit method which comes from the observation that
the algorithm execution time is mainly due to the elongation of the hits to get HSPs.
The algorithm then considers only cases where two hits exists on the same diagonal at the
same distance lower than an A parameter before looking for HSPs.
To avoid losing sensitivity, the T threshold has been lowered.
The algorithm is faster and does not lose precision.
Moreover, in the current implementation, BLAST considers gaps when trying to merge
ungapped HSP which are spatially related in the alignment matrix. Their union in a unique
fragment (containing gaps and insertions) causes a global improvement of the final score
and not a worsening.
Everything works using new parameters regulating costs and penalties for the presence of a
gap in the alignment.
BLAST
two hit method
MSP (Maximal scoring Segment

Pair) is a couple of segments of equal
length, getting the highest similarity
score in the comparison of two
sequences: the algorithm evaluates in
a strict way its statistical significance
(Karlin & Altschul 1990, 1993).
E-value E(S)
E-value indicates the number of different alignments with the same score (X)
equivalent or better than the one obtained from my alignment (called S)
which can happen by chance in a database search (so: false positives). The
lower the value the more significant is the alignment.
E ( x ≥ S ) → E ( S ) = kmn e − λS
K and λ depend on the database e on its size, m and n are the lengths of
the two sequences. S is the score of my alignment.
0.4
A.
0.2
Yev
-2 -1 0 1 2 3 4 5
X
E-value E(S)
Different algorithm differ a lot on how they define a random sequence.
BLAST computes the probability of a given score to be signficant on the basis of the
dimension and composition of the database a priori applying:
E ( S ) = kmn e − λS
Where m is the length of the query sequence and n is the length of the subject sequence from
the database. On the contrary of FASTA, λ and k are precomputed according to an internal
standard distribution. The final score is similar to the FASTA one.
A result significance is expressed as an expectation value E(S).
The lower the value of E is, the more significant the alignment is. A value of 1.0e-5, e.g.,
means that the probability of getting a sequence with the same score of my query is 1.0e-5, in
other words, we expect to observe a sequence with a score equal or better to 1.0e-5 every
100.000 observations (alignments).
Bit-score
The bit-score allows to put in direct relation searches made in databases of different
sizes since it does not depends on λ and K parameters, while the raw score does. In
this way two searches in different databases are comparable. The bit-score is
computed from the raw score S, is defined S’ and is normalized as:
λS − ln K
S'=
ln 2
From which the E-value can be computed:
−S '
E = mn2
E depends only on the sequences length parameter.
DB Accession Number
Name
Similarity values
Similarity values
Query
Subject
Residue number
Identity (letter),
Similarity (+)
Profiles
Frequency matrix
Matrices used to show the frequency of each residue in a multiple sequence
alignment
• how many times a residue is found at a certain position in the alignment
Position
Example:
Frequency
AA
Frequency matrix
Matrices used to show the frequency of each residue in a multiple sequence
alignment
• how many times a residue is found at a certain position in the alignment
Example:
Weighted matrices (PSSM) or profiles
(Gribskov)
Question: how to deal with “similar” AA never observed at a certain position?
We can use weighted matrices, PSSM – position specific scoring matrix, that take into
account the “weight” of the aminoacids , that is their “substitution likelihood” calculated
with PAM or BLOSUM substitution matrices .
Multiple alignment
Frequency matrix
“observed” similarity Substitution

matrix
“general” similarity
Profile
(PSSM)
“weighted” similarity
Weighted matrices (PSSM) or profiles
(Gribskov)
In the third column of this weighted

matrix we can notice that the A aa is
found only one time, with a lower score
(-1) compared to the aa M (+10), not
found.
M is more similar to other aminoacids,

such as L, I, V, F included in other
sequences of this matrix.
Now we have a complete profile of the

multiple sequence alignment which is
able to codify the probability of each AA
at every position.
PSI-BLAST
PSI-BLAST
Position Specific Iterated BLAST
It makes use of an iterative procedure in which every sequence above a fixed threshold is
used to create a model known as PSSM (Position Specific Substitution Matrix) used in the
following iterations to detect evolutionary distant sequences .
PSSM is the product of the substitution matrix with the frequency matrix calculated from a
query sequence against the hits that exceed the fixed threshold (profiles).
Then, in order to find a new hit we use the same procedure but with a L x 20 matrix, where L is
the lenght of the query sequence.
PSSM calculation calls for the standardization of sequences that are redundant or
overrepresented in order to reduce the risk of a wrong calculation of the matrix .
Gap penalty threshold is the same for each of the iterations.

PSI-BLAST procedure
1) Input sequence.
2) BLAST search.
3) Creation of a PSSM from a multiple alignment of all the hits that exceed
a fixed threshold.
4) Research with the obtained PSSM (research of profile

similarities -> database sequences).
5) Creation of a pairwise alignment based on the profile between
the input sequence and the new hits.
PSI-BLAST procedure
6) Creation of a new PSSM if there are new hits that exceed the
threshold.
the score of some sequences that wasn’t above the threshold, may increase and exceed the
threshold due to the fact that they could be distant homologous sequences of a gene family that
now shares only few conserved aminoacids (i.e. aa in an active site) that increase the score.
7) Repeat from 4 to 6
8) The process ends when convergence is reached, in other
words when there aren’t other sequences that exceed the
threshold
PSI-BLAST: Drift problems
We generally carry out 4-6 iterations in order to avoid drift (profile wander)
events: the input sequence can be lost with the iterations if there is a large protein
family similar to this sequence (crowding out).
B B
Example: B B
B
A B Sequence A has
B been removed from
B
PSSM
To avoid these issues we can use databases in which each protein has a maximum
value of X% (50 <= X < 100) similarity with each one of the other sequences.
• i.e. NR90 contains all of the known sequences with a maximum value of 90% of similarity.
• one of the most popular program for clustering sequences: CD-HIT (Li et al., 2001)
Beyond PSI-BLAST…
Beyond PSI-BLAST…
Given a protein family, how can we fix the information in the multiple
sequence alignment in order to look for other sequences that are still
unknown???
• the most common alignment methods, even if they use profiles, i.e. they
don‘t evaluate indels positions.
Idea: creation of a HMM – Hidden Markov Model that best represents the
reality.
1YEA AKESTGFKPGSAKKGATLFKTRCQQCHTIEE-------GGPNKVGPNLHGIFGRHSGQVK
1YCC ----TEFKAGSAKKGATLFKTRCLQCHTVEK-------GGPHKVGPNLHGIFGRHSGQAE
2PCBB ---------GDVEKGKKIFVQKCAQCHTVEK-------GGKHKTGPNLHGLFGRKTGQAP
5CYTR ---------GDVAKGKKTFVQKCAQCHTVEN-------GGKHKVGPNLWGLFGRKTGQAE
1CCR -ASFSEAPPGNPKAGEKIFKTKCAQCHTVDK-------GAGHKQGPNLNGLFGRQSGTTP
1CRY ---------QDAASGEQVFK-QCLVCHSIGP-------GAKNKVGPVLNGLFGRHSGTIE
1HROA -----SAPPGDPVEGKHLFHTICITCHTDIK-------G-ANKVGPSLYGVVGRHSGIEP
1CXC -------QEGDPEAGAKAFN-QCQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAGTQA
1C2RA ---------GDAAKGEKEFN-KCKTCHSIIAPDGTEIVKG-AKTGPNLYGVVGRTAGTYP
155C -------NEGDAAKGEKEFN-KCKACHMIQAPD-GTDIKG-GKTGPNLYGVVGRKIASEE
2C2C --------EGDAAAGEKVSK-KCLACHTFDQ-------GGANKVGPNLFGVFENTAAHKD
2mtac -----APQFFNIIDGSPLNFDD-----AMEEGRDTEAVKHFLETGENVYNEDPEILPEAE
. * : * : . .
1YEA GYS-YTDANINK-----NVKWDEDSMSEYLTNPKKYIP--------GTKMAFAGLKKEKD
1YCC GYS-YTDANIKK-----NVLWDENNMSEYLTNPKKYIP--------GTKMAFGGLKKEKD
2PCBB GFT-YTDANKNK-----GITWKEETLMEYLENPKKYIP--------GTKMIFAGIKKKTE
5CYTR GYS-YTDANKSK-----GIVWNNDTLMEYLENPKKYIP--------GTKMIFAGIKKKGE
1CCR GYS-YSTADKNM-----AVIWEENTLYDYLLNPKKYIP--------GTKMVFPGLKKPQE
1CRY GFA-YSDANKNS-----GITWTEEVFREYIRDPKAKIP--------GTKMIFAGVKDEQK
1HROA GYN-YSEANIKS-----GIVWTPDVLFKYIEHPQKIVP--------GTKMGYPGQPDPQK
1CXC DFKGYGEGMKEAGAK--GLAWDEEHFVQYVQDPTKFLKEYTGDAKAKGKMTF-KLKKEAD
1C2RA EFK-YKDSIVALGAS--GFAWTEEDIATYVKDPGAFLKEKLDDKKAKTGMAF-KLAK--G
155C GFK-YGEGILEVAEKNPDLTWTEANLIEYVTDPKPLVKKMTDDKGAKTKMTF-KMGK--N
2C2C NYA-YSESYTEMKAK--GLTWTEANLAAYVKNPKAFVLEKSGDPKAKSKMTF-KLTKDDE
2mtac EL--YAGMCSGCHGHYAEGKIGPGLNDAYWTYPGNETDVGLFSTLYGG--ATGQMGPMWG
* * *
HMM – Hidden Markov Model
An HMM represents a generalization of the profile concept.
• AA substitution probabilities are different for each position (PSSM).
• Insertion and deletion probabilities are different at each position.
• HMM are used i.e. in the database Pfam about protein domains.
• The most popular program is HMMER. (Eddy, 1995)
Mj = j column of the PSSM

Ij = insertion probability in j
Dj = deletion probability in j
Profile – profile alignment
PSI-BLAST extended the similarity search from a sequence – sequence

alignment to a profile – sequence alignment.
– This improved considerably the performance for sequences of a same
protein family, with a low degree of conservation
It could be possible to further improve the performance for a profile – profile

alignment.
– This can be achieved with the last generation of alignment methods
– It‘s really important for structure prediction in case of remote homology.
There are technical problems that (until now!) prevent the substitution of
PSI-BLAST with a large scale profile – profile method
– Dimension of the database of precalculated profiles

L02 PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

L02 PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Sequence databases

Protein sequence Database

•Swiss-Prot, which is manually annotated and reviewed.

• TrEMBL, which is automatically annotated and is not reviewed.

Proteins can be classified into different groups based on:

• the FAMILIES to which they belong

• the DOMAINS they contain

• the SEQUENCE FEATURES they possess

A protein family is a group of proteins that share a common evolutionary origin,

Domains are distinct functional and/or

Usually they are responsible for a particular

Domains may exist in a variety of biological

SH3 SH3 SH3 SH2

Sequences features are group of amino acids that confer certain

Active Bindings PTM

The following databases make up the InterPro Consortium:

• CATH/Gene3D at University College, London, UK

• Collection of multiple sequence alignment based on Hidden Markov

• Comprises two sections:

Let us imagine comparing human hemoglobin to mouse hemoglobin

Sequence alignment is the single most fundamental

Pair-wise sequence alignment defines reciprocal similarity between two

Evolution operates by consecutive mutations over time (point mutations,

To achieve this goal two elements are necessary:

• Jalview (Java ALignment VIEWer)

What is the meaning of «pair-wise alignment»?

TCAT-G 4 identical characters

Similarity Gap cost

Optimal alignment is found building LAMIASEQUENZAALLINEASEMPREPERCHE

The alignment is computed in to steps

1. computation of the best solution

The difference between global and

• Based on recursion ….. Each partial result depends on previous

• Alignment computation must be decomposable in a series of indvidual

While filling the table, you can keep trace of the

F (i , j )= max F (i− 1, j)− d

1. computation of the best solution

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37

H -32 -14 -18 -13

E -40 -22 -8 -16

A -48 -30 -16 -3

E -56 -38 -24 -11

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37

H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19

E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5

A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

• Table values initialization (-d*j and –d*i)

• Use of recursion to fill intermediate cells of the table (dynamic

• Traceback from the last cell(i,j)

• Use of O(nm) space and time

– It is possible for small sequences but not for whole genomes

• Table initialization (0)

• Use of recursion to fill intermediate cells of the table (dynamic

• Traceback from cell with highest value

• Use of O(nm) space and time

– It is possible for small sequences but not for whole genomes

Semiglobal, Freeshift or “glocal” alignment

It is used one complete or partial overlap is expected.

The goal is a global alignment with no penalties for terminal overhang.

• Table values initialization (-dj and –di)