Sunteți pe pagina 1din 100

Sequence databases

Protein sequence Database


Uniprot
http://www.uniprot.org/
Database of protein sequences
Uniprot
Uniprot knowledgebase (UniprotKB) consists of two sections:

•Swiss-Prot, which is manually annotated and reviewed.

• TrEMBL, which is automatically annotated and is not reviewed.


Database of protein sequences
Uniprot
Is a resource of protein
sequences and functional
information

Sources of annotation
for the UniProt
Knowledgebase
Protein Domain Family database
InterPro
http://www.ebi.ac.uk/interpro

Classifying proteins into families and identifying important domains and sites
is invaluable for helping biologists to identify distantly related proteins and to
predict their functions.

Proteins can be classified into different groups based on:

• the FAMILIES to which they belong

• the DOMAINS they contain

• the SEQUENCE FEATURES they possess


Protein Domain Family database
InterPro
http://www.ebi.ac.uk/interpro

A protein family is a group of proteins that share a common evolutionary origin,


reflected by their related functions and similarities in sequence or structure
Protein Domain Family database
InterPro
http://www.ebi.ac.uk/interpro

Domains are distinct functional and/or


structural units in a protein.

Usually they are responsible for a particular


function or interaction, contributing to the
overall role of a protein.

Domains may exist in a variety of biological


contexts, where similar domains can be found
in proteins with different functions

SH3 SH3 SH3 SH2


Protein Domain Family database
InterPro
http://www.ebi.ac.uk/interpro

Sequences features are group of amino acids that confer certain


characteristics upon a protein, and may be important for its overall
function

Domain Repeats

Active Bindings PTM


site ite site
Protein Domain Family database
InterPro
http://www.ebi.ac.uk/interpro
Protein Domain Family database
InterPro
http://www.ebi.ac.uk/interpro

The following databases make up the InterPro Consortium:

• CATH/Gene3D at University College, London, UK


• PANTHER at University of Southern California, CA, USA
• PIRSF at the Protein Information Resource, Georgetown University Medical Centre,
Washington DC, USA
• Pfam at the Wellcome Trust Sanger Institute, Hinxton, UK
• PRINTS at the University of Manchester, UK
• ProDom at PRABI Villeurbanne, France
• PROSITE and HAMAP at the Swiss Institute of Bioinformatics (SIB), Geneva,
Switzerland
• SMART at EMBL, Heidelberg, Germany
• SUPERFAMILY at the University of Bristol, UK
• TIGRFAMs at the J. Craig Venter Institute, Rockville, MD, US
• MobiDB at University of Padua, Italy
Pfam
http://www.sanger.ac.uk/Software/Pfam

• Collection of multiple sequence alignment based on Hidden Markov


Models (HMM).
• Features over 1500 families.
• Includes sequences from SwissProt and TrEMBL.

• Comprises two sections:


– Pfam-A containing manually curated MSA.
– Pfam-B (discontinued) in which sequences not included in Pfam-A are
automatically clustered.
Sequence alignment
Why

Let us imagine comparing human hemoglobin to mouse hemoglobin


– How can we do it?
– How do we quantify similarity?
– Which residues have been mutated?

Sequence alignment is the single most fundamental


technique in bioinformatics
Evolution
The study of changes occurring in DNA and in its products is the
object of study of Molecular Evolution.
Final goal: structural alignment

T E F
D A

T K F

D S
Gaps

MCDQTKHSKCCPAK---GNQCCPP--TDEAF---QQNQCCQSKGNQCCPPKQNQCCQPKG-- TDEAF
M D +K ++CCP CCPP TD F Q++ CC + CCPPK + CC PK
MSDSSKTNQCCPTPCCPPKPCCPPKPTDKSFCCLQKSPCCPK--SPCCPPK-SPCCTPKVCP TDKSF

Indel
Deletion Insertion
Sequence alignment

Pair-wise sequence alignment defines reciprocal similarity between two


sequences and homology relationships, structural conformation and function
can be inferred from it.

Evolution operates by consecutive mutations over time (point mutations,


insertions, deletions, inversions). This sequence of events, if rebuilt, allows to
infer relationships between sequences

To achieve this goal two elements are necessary:


1) An efficient algorithm, which should the most accurate possible to represent
the actual similarity between sequences.

2) Some similarity score criteria that tell us how good an alignment is and that
allows the creation of an alignment during the construction phase.
Visualizing alignments
• There are various softwares that allow to visualize DNA and protein
sequence alignments.

• One of the most widespread free-software, used for example by EBI, that we
will use during practicals is:

• Jalview (Java ALignment VIEWer)


URL: http://www.jalview.org/

• It comes in two flavours: A Desktop version and a Java applet for browsers.
Alignment:
General concepts
Pair-wise alignment
(proteins or nucleic acids)

What is the meaning of «pair-wise alignment»?


To write two sequences horizontally, so that the most possible identical (or
similar) symbols are found in the columns, even if some interval (gaps –
insertions/deletions – indels) is introduced.

seq1: TCATG
seq2: CATTG

TCAT-G 4 identical characters


1 indel
.CATTG 1 non-aligned position
Alignment

Similarity Gap cost


matrix

Algorithm
(dynamic
programming)
• Alignment algorithm
- dynamic programming
- Local, global or semiglobal, is the sequence alignment scheme
• Similarity matrix
- Contains values associated to each substitution
- Variuos method to build one, e.g. PAM and BLOSUM
• Gap cost
- minimal model with only one value for both gap open and gap extension
- However: evolution tends to group indel
Dynamic programming
L A M I A S E Q U E N Z A A L L I N E A S E M P R E P E R C H E
G
Alignment starting from Q
G

a matrix P
T
C
G
L
A
M
I
Each line connecting top-left to A
bottom right represents a possible S
I
alignment G
G
T
D
P
R
E
P
G
K
N

Optimal alignment is found building LAMIASEQUENZAALLINEASEMPREPERCHE


similarity matrices that associate values
to each possible aminoacid couple GQGPTCGLAMIASIGGTD-------------PREPGKN
and developing algorithms that identify
the highest scoring paths
LAMIASEQUENZ-AALLINEASEMPREPERCHE

GQGPTCGLAMIASIGGTDPREPGKN
Dynamic programming

The alignment is computed in to steps

1. computation of the best solution


in every box.
2. Backtracking: choice of the
optimal path on the basis of
data computed in the boxes.

The difference between global and


local alignment stands in box filling
and in the choice of the backtracking
starting point .
Dynamic programming
• Gives optimal alignment between two sequences.
• Simple algorithm variations produce global, local or semiglobal
alignments
– Global (Needleman & Wunsch, 1970)
– Local (Smith & Waterman, 1981)
• Alignment depends on the choice of some parameters

• Based on recursion ….. Each partial result depends on previous


computation of alignment score from a table. The table is eventually
used to build the optimal path.

• Alignment computation must be decomposable in a series of indvidual


optimal steps.
– Bellman optimality principle (1957)
Global
Alignment
Global Alignment
x1x2x3... xn
Given two sequences x and y to be aligned.

y1y2y3... ym
Notation:
– xi – i-th element of the sequence x
– yj – j-th element of the sequence y
– x1..i – Prefix of x from 1 to i
– F – optimal score matrix
•F(i,j) represent the optimal alignment x1..i with y1..j

– d – gap penalty
– s – scoring matrix
Global Alignment

yj aligned to a gap

F(i-1,j-1) F(i,j-1)
Move ahead on
both +s(xi,yj)
-d

F(i-1,j) F(i,j)
xi aligned to a gap -d

While filling the table, you can keep trace of the


path, or else which direction has been taken
(inverted arrows)
Global Alignment

• Build F
• Initialize: F(0,0) = 0; F(i,0) = -d*i; F(0,j)= -d*j
• Fill the table from top-left to bottom-right corner using the recursive
relationship

x1x2x3... xn

{
F ( i− 1, j− 1 )+s ( x i , y j )
y1y2y3... ym

F (i , j )= max F (i− 1, j)− d


F(i , j− 1)− d
Traceback
Alignment is computed in two steps

1. computation of the best solution


in every box.
2. Backtracking: choice of the
optimal path on the basis of
data computed in the boxes.
Backwards path follows the arrows:
The path always starts from the last
cell. By definition it ends at the
cell(1,1).

Shifts:
• Diagonal – both
• Up – gap up
• Sx – gap down
P
H
-2
E
-1
A
-1
G
-2
A
-1
W
-4
G
-2
H
-2
E
-1
E
-1
Example
A -2 -1 5 0 5 -3 0 -2 -1 -1
W -3 -3 -3 -3 -3 15 -3 -3 -3 -3 First step is filling the table keeping in
H 10 0 -2 -2 -2 -3 -2 10 0 0 consideration a substitution matrix (here
E 0 6 -1 -3 -1 -3 -3 0 6 6
BLOSUM50).
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6

BLOSUM50
H E A G A W G H E E
substitution matrix
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
GAP
-d = -8 P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73

A -16

W -24
Second step is
the recursive H -32
compilation of
E -40
the table
A -48

E -56
H E A G A W G H E E
P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1
A -2 -1 5 0 5 -3 0 -2 -1 -1
W -3 -3 -3 -3 -3 15 -3 -3 -3 -3
H
E
10
0
0
6
-2
-1
-2
-3
-2
-1
-3
-3
-2
-3
10
0
0
6
0
6 Table almost complete
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6

BLOSUM50
H E A G A W G H E E
substitution matrix
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
GAP
-d = -8 P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37

H -32 -14 -18 -13

E -40 -22 -8 -16

A -48 -30 -16 -3

E -56 -38 -24 -11


H E A G A W G H E E
P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1
A -2 -1 5 0 5 -3 0 -2 -1 -1
W -3 -3 -3 -3 -3 15 -3 -3 -3 -3
H
E
10
0
0
6
-2
-1
-2
-3
-2
-1
-3
-3
-2
-3
10
0
0
6
0
6
Table complete
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6

BLOSUM50
H E A G A W G H E E
substitution matrix
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
GAP
-d = -8 P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37

H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19

E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5

A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1


Traceback
Reverse path
H E A G A W G H E E following the arrows:
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 The path always
starts from the last
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
cell. By definition it
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60 gets to 0.
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
Shifts:
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
• Diagonal – both
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5 • up– gap up
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2 • Sx – gap down

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

HEAGAWGHE-E
--P-AW-HEAE
Summary: global alignment

• Table values initialization (-d*j and –d*i)

• Use of recursion to fill intermediate cells of the table (dynamic


programming)

• Traceback from the last cell(i,j)

• Use of O(nm) space and time


– O(n2) algorithm

– It is possible for small sequences but not for whole genomes


Local
Alignment
Local Alignment
• Smith-Waterman (1981)
• Another solution based on dynamic programming very similar to
(semi-)global alignment in the table filling.
• The “0” is introduced in the computation of cell scores…. No
negative values.

{
0
F(i− 1, j− 1)+s( x i , y j )
F (i , j )= max
F (i− 1, j)− d
F(i , j− 1)− d
H E A G A W G H E E
P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1
A
W
-2
-3
-1
-3
5
-3 -3
0 5
-3
-3
15 -3
0 -2
-3
-1
-3
-1
-3 Example
H 10 0 -2 -2 -2 -3 -2 10 0 0
E 0 6 -1 -3 -1 -3 -3 0 6 6
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6

H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 5 0 0 0 0 0

W 0 0 0 0 2 0 20 12 4 0 0

H 0 10 2 0 0 0 12 18 22 14 6

E 0 2 16 8 0 0 4 10 18 28 20

A 0 0 8 21 13 5 0 4 10 20 27

E 0 0 6 13 18 12 4 0 4 16 26
Traceback
Starts from the highest scores in the table and proceeds backwards to the first 0
met along the path.

H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 5 0 0 0 0 0

W 0 0 0 0 2 0 20 12 4 0 0 AWGHE
H 0 10 2 0 0 0 12 18 22 14 6 AW-HE
E 0 2 16 8 0 0 4 10 18 28 20

A 0 0 8 21 13 5 0 4 10 20 27

E 0 0 6 13 18 12 4 0 4 16 26
Summary: local alignment

• Table initialization (0)

• Use of recursion to fill intermediate cells of the table (dynamic


programming), with the possibility of always use the 0 value

• Traceback from cell with highest value

• Use of O(nm) space and time


– O(n2) algorithm

– It is possible for small sequences but not for whole genomes


Semiglobal
Alignment
ANOTHER KIND OF ALIGNMENT

Semiglobal, Freeshift or “glocal” alignment

It is used one complete or partial overlap is expected.

The goal is a global alignment with no penalties for terminal overhang.

OVERLAP ZONE

OVERHANG ZONE
The problem reflects in starrting from the highest score that sits on the last row or
column and tread backwards the whole table (to the first row or column).
Matrix initialization (first row or column) is done with 0s as in the “local” alignment.
Matrix compilation is done as in the “global” alignment.

H E A G A W G H E E Hence the name


“semiglobal”
0 0 0 0 0 0 0 0 0 0 0
alignment
P 0 -2 -1 -1 -2 -1 -4 -2 -2 -1 -1

A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2

W 0 -3 -5 -4 1 -4 18 10 2 6 -6

H 0 10 2 6 -6 -1 10 16 20 12 4

E 0 2 16 8 0 7 2 8 16 26 18

A 0 -2 8 21 13 5 3 2 8 18 25

E 0 0 4 13 18 12 4 4 2 14 24
The problem reflects in starrting from the highest score that sits on the last row or
column and tread backwards the whole table (to the first row or column).
Matrix initialization (first row or column) is done with 0s as in the “local” alignment.
Matrix compilation is done as in the “global” alignment.

H E A G A W G H E E Hence the name


“semiglobal”
0 0 0 0 0 0 0 0 0 0 0
alignment
P 0 -2 -1 -1 -2 -1 -4 -2 -2 -1 -1

A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2

W 0 -3 -5 -4 1 -4 18 10 2 6 -6

H 0 10 2 6 -6 -1 10 16 20 12 4

E 0 2 16 8 0 7 2 8 16 26 18

A 0 -2 8 21 13 5 3 2 8 18 25

E 0 0 4 13 18 12 4 4 2 14 24
The problem reflects in starrting from the highest score that sits on the last row or
column and tread backwards the whole table (to the first row or column).
Matrix initialization (first row or column) is done with 0s as in the “local” alignment.
Matrix compilation is done as in the “global” alignment.

H E A G A W G H E E Hence the name


“semiglobal”
0 0 0 0 0 0 0 0 0 0 0
alignment
P 0 -2 -1 -1 -2 -1 -4 -2 -2 -1 -1

A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2

W 0 -3 -5 -4 1 -4 18 10 2 6 -6

H 0 10 2 6 -6 -1 10 16 20 12 4 HEAGAWGHEE
PAW-HEAE
E 0 2 16 8 0 7 2 8 16 26 18

A 0 -2 8 21 13 5 3 2 8 18 25

E 0 0 4 13 18 12 4 4 2 14 24
Summary: semiglobal alignment

• Table initialization (0)

• Use of recursion to fill intermediate cells of the table (dynamic


programming)

• Traceback from the last row or column

• Use of O(nm) space and time


– O(n2) algorithm

– It is possible for small sequences but not for whole genomes


Alignment:
Advanced techniques
H E A G A W G H E E
P
A
-2
-2
-1
-1
-1
5
-2
0
-1
5
-4
-3
-2
0
-2
-2
-1
-1
-1
-1
Suboptimal
W
H 10
-3 -3
0
-3
-2
-3
-2
-3
-2
15
-3
-3
-2
-3
10
-3
0
-3
0
alignments
E 0 6 -1 -3 -1 -3 -3 0 6 6 (Example II)
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6

H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0 First alignment
local
P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 5 0 0 0 0 0
heagAWGHEe
W 0 0 0 0 2 0 20 12 4 0 0 pAW-HEae
H 0 10 2 0 0 0 12 18 22 14 6

E 0 2 16 8 0 0 4 10 18 28 20

A 0 0 8 21 13 5 0 4 10 20 27

E 0 0 6 13 18 12 4 0 4 16 26
H E A G A W G H E E
P
A
-2
-2
-1
-1
-1
5
-2
0
-1
5
-4
-3
-2
0
-2
-2
-1
-1
-1
-1
Suboptimal
W
H 10
-3 -3
0
-3
-2
-3
-2
-3
-2
15
-3
-3
-2
-3
10
-3
0
-3
0
alignments
E
A
0
-2
6
-1
-1
5
-3
0
-1
5
-3
-3
-3
0
0
-2
6
-1 -1
6 (Example II)
E 0 6 -1 -3 -1 -3 -3 0 6 6

H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0 Second alignment
local:
P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 0 0 0 0 0 0
HEAgawghee
W 0 0 0 0 2 0 0 0 0 0 0 pawHEAe
H 0 10 2 0 0 0 0 0 0 0 0

E 0 2 16 8 0 0 0 0 0 0 0

A 0 0 8 21 13 5 0 0 0 0 0

E 0 0 6 13 18 12 4 0 0 6 6
Suboptimal alignments

Sometimes it may be useful to have more than one alignment for two
sequences. These alignments are called suboptimal.
– E.g. to find similarity regions not yet considered in local alignments.
– Repeated regions

It is possible to compute them using small variations during the


backtracking phase. This phase is then reiterated to get different
alignments, up to a threshold T (T >> 0 for meaningful alignments)

– In (Waterman & Eggert, 1987) cell values are set to zero.


• Previous example

– In (Vingron, 1996) cell values are decreased by a weight ε


(ε can be additive or multiplicative)
Similarity Matrices
The 20 amino acids
Similarity matrices
For nucleic acids alignments between complementary bases
have value of 1 or 0 (identity or non-identity).

Proteins have 20 aminoacids and single substitutions do not have


the same weight.
• Substitution of Serine (S) with Threonine (T) or of Glutamate (E) with
Asparatate (D) are usually tolerated by proteins, given he similarity of
these amino acids.

Similarity matrices are tables associating a similarity value to each substituion.

Matrices can be based on physico-chemical proprerties or residues.

However, the most common matrices are based on statistical methods indicating
the substitution frequency among amin oacids in homologous protein families.
Similarity matrices
Let pa be the probability of finding residue a. ( x
)
px = 1

The probability of a random substitution between residues is computed as the


probability of two independent events:

C ( a , b) = pa × pb
Probability of a mutation between amminoacids a and b (match) is observed on a
set of homolog sequences:

M ( a, b) = pa ,b ( x
pa , x = 1; pa ,b = pb,a )

Example: HEAGawghee
pawHEAE
Similarity matrices

The relationship between match and random can be expressed as ‘‘likelihood’’ or


‘‘odds ratio’’

=
M ( a ,b ) pa , b
C ( a ,b ) pa × pb
To keep the odd ratio more stable and get larger numer, we take the logarithm
(log oggs ratio):

s (a, b) = log( M ( a ,b )
C ( a ,b ) )
Tipically, values are multiplied by ten and only the integer part of it is kept.

Example: HEAGawghee
pawHEAE
Similarity matrices

Meaning of odds ratio:

•S = 0 random
•S < 0 less than random, less probable (unfavourable)
•S > 0 more than random, more probable (favourable)

An alignment score can be written as:


n
S =  s ( ai , bi )
i =1

Example: HEAGawghee
pawHEAE
BLOSUM matrices
(Henikoff & Henikoff, 1992)

• Blocks Amino Acid Substitution Matrices = BLOSUM


• Substitution observed in ~2,000 blocks of conserved sequences.
• Extracted from a database of 500 protein families.
• Counts swaps observed in each column.
BLOSUM matrices

Like in PAMs, computation of the matrix


• M(a,b) computed from alignments of protein families from the BLOCKS
database.

The BLOCKS database has alignments of sequences with homology


above a threshold P varying from 35% to 95% of sequence identity.
• BLOCKS 50 holds aligned sequences with <= 50% of sequence identities
Affine Gap Costs
Alignments: Components
• Similarity matrix
- Gives score for each substitution
- Different building methods exist, e.g. PAM and BLOSUM
• Alignment algorithm
- Local, global or freeshift define the sequences alignment schema

• Gap costs
- We have seen a minimal model with unique gap open/extension cost
- But: evolution tends to aggregate indels in few points
- So: affine gap costs

Non-biologicall solutions are discouraged, e.g. (left) should get a lower score than (right)

Seq 1 MNALSDR---T Seq 1 MNALSDRT---


Seq 2 M-G-SDRTTET Seq 2 MG--SDRTTET
Affine gap costs
Two parameters are used for gaps:
• gap open (γ) = opening of the first gap of an indel.
• gap extension (δ) = extending an already existing indel, of course γ > δ
• typical costs are: gap open = 10, gap extension = 2
• exact values can differ

Gap penalty: d = γ + δ * (length(i) – 1)

To build a substitution matrix, a more complex procedure is required :


• gap open cost has to be used only the first time, afterwards gap extension
has to be used.
• for each element of the matrix the optimal value of the gap needs to be saved
both in horizontal and vertical directions.
• details can be found Biological Sequence Analysis by Durbin et al.
(Cambridge University Press)
• this computation is not part of the final exam
Alignments: differences
Differences between global and local alignments
Global alignment impose an alignment which involves all the residues of the two sequences,
no matter their similarity.

Local alignment on the other hand allows to align only most similar residues of the two
sequences.
Semiglobal alignment tries to combine both methods.

L A M I A S E Q U E N Z A A L L I N E A S E M P R E P E R C H E
G
Q
G
P
T
C
G
L
A
M
I
A
S
I
G
G
T
D
P
R
E
P
G
K
N
GLOBAL

D L G P S S K Q T G K G S S M D I W D N G M
D - I - - T K S A G K G A I M R L - - E - M

SEMIGLOBAL / FREESHIFT

D L G P S S K Q T G K G S S M D I W D N G M
- - - D I T K S A G K G A I M R L E M - - -

LOCAL

- - - - - - - - - G K G - - - - - - - - - -
- - - - - - - - - G K G - - - - - - - - - -
Alignment and domains
A B
A=C

B’ C

But: a local alignment might lose


Global alignment:
terminal residues (e.g. divergent
reeats)…
Hence a third method, semiglobal
(or freeshift or glocal), tries to
combine the advantages of the other
two.
Local alignment:
Multiple sequence alignments
Multiple alignments
• Demonstrate homology
• Molecular phylogeny
• Structural prediction
• Functional prediction
• Identification of functionally important sites.

• Usage of algorithms for the search of an optimal alignment between two


sequences creates problems in its generalization (i.e. usage on three or more
sequences at the same time).
If L is the length of the sequences, it would take O(LN) units
of time to align N sequences. It is unfeasible.

• Usage of euristic methods or progressive based on the hypotesis that the


sequences to be aligned are phylogenetically correlated.
1. 2.
CLUSTALW
(Higgins & Sharp, 1988)

1. Pairwise alignment of all the starting


sequences with: 3.
 Approximate methods (n-tuple)
 Dynamic algorithm by Myers & Miller,
1988

2. Scoring of the alignments used to build


the phylogenetic tree (neighbour-joining,
NJ).

3. Progressive alignment of the sequences according to the tree order


(most similar sequences come first).
CLUSTALW
Progressive methods: disadvantages
• One an alignment is fixed is not modified in the subsequent steps. In
particular, the gap location cannot change (once a gap, always a gap).

• Initial errors are propagated in subsequent steps. In an error is introduced in


the initial alignment cannot be corrected, on the contrary it gets "fixed".

• Initial phylogenetic trees are derived from distance matrices between


pairs of independently aligned sequences. These are less reliable than
phylogenetic trees derived from complete multiple sequence
alignments.

• Alignment errors depend on sequence similarities. Care must be taken in


selecting input sequence to be real homologs and of comparable length to
avoid the insertion of too many gaps.

• If sequences are too divergent (< 25-30% sequence identity) progressive


methods become unreliable.
CLUSTALW alignment for GPx
Multiple alignments: more methods
Many methods have been published since CLUSTALW.

– CLUSTAL-OMEGA has replaced


CLUSTALW,
• URL: http://www.ebi.ac.uk/Tools/msa/clustalo/

– T-COFFEE,
• URL: http://www.tcoffee.org/

– MAFFT,
• URL: http://www.ebi.ac.uk/Tools/msa/mafft/

–…
Similarity searches in
databases
Similarity searches in databases
One of the most commonly faced problem with bioinformatics methods is to find homolog
sequences querying databases.

The main idea is that homologous proteins have a common


ancestor and therefore share extensive regions of similarity.
Comparing our query sequence and all the others contained in a database it is possible to estimate
the percentage of similarity and from this to infer possible homology.
Similarity searches in databases
For very similar sequences homology is obvious. In most cases, however, low levels of
similarity need to be faced.
Real functional homologs can have low similarity levels.
• It is not trivial to distinguish real and false homologs.

When functional elements are not available, the choice is made considering statistical features.
• On the basis of similarity, a score is given to each single sequence alignment;
• Ties are evaluated by computing the probability of getting the same score by chance.
• The lower the probability values the more significant is the alignment.

There are two problems to solve:


1. Development of algorithms able to identify sequences similar to the query one among
millions of different target sequences and
2. The choice of statistical methods to be used to define significant sequences.

The main tools for database querying, such as FASTA and BAST, differ basically
in the approach to these problems.
BLAST
BLAST
BLAST
BLAST (Basic Local Aligment Search Tool) is a software which looks for local
sequence similarity using the algorithm by Altschul et al., 1990. Also BLAST, as
FASTA, works by:

1. Decomposing the query sequence in words of few aminoacids, usually 2 or 3 (W


parameter) and generating a list of affine words (different from FASTA) using the
BLOSUM substitution matrix. Affine conserved words will have to have a score
greater than a fixes threshold T.

2. Affine words are searched in the sequence database looking for exact matches, once
found, they are enlarged on the right and left of the alignment up to a certain depth
defined by the X parameter. The couples of segments on the same sequence having a
statistically significant similarity score greater than a threshold S are called HSP
(High scoring Segment Pairs).

3. In the same couple there might be more than one HSP for which is possible to
compute the occurrence probability (Karlin & Altschul, 1993).

W = word size X = elongation


T = threshold S = HSP threshold
BLAST
two hit method
Current BLAST versions use the two-hit method which comes from the observation that
the algorithm execution time is mainly due to the elongation of the hits to get HSPs.

The algorithm then considers only cases where two hits exists on the same diagonal at the
same distance lower than an A parameter before looking for HSPs.

To avoid losing sensitivity, the T threshold has been lowered.

The algorithm is faster and does not lose precision.

Moreover, in the current implementation, BLAST considers gaps when trying to merge
ungapped HSP which are spatially related in the alignment matrix. Their union in a unique
fragment (containing gaps and insertions) causes a global improvement of the final score
and not a worsening.

Everything works using new parameters regulating costs and penalties for the presence of a
gap in the alignment.
BLAST
two hit method

MSP (Maximal scoring Segment


Pair) is a couple of segments of equal
length, getting the highest similarity
score in the comparison of two
sequences: the algorithm evaluates in
a strict way its statistical significance
(Karlin & Altschul 1990, 1993).
E-value E(S)

E-value indicates the number of different alignments with the same score (X)
equivalent or better than the one obtained from my alignment (called S)
which can happen by chance in a database search (so: false positives). The
lower the value the more significant is the alignment.

E ( x ≥ S ) → E ( S ) = kmn e − λS

K and λ depend on the database e on its size, m and n are the lengths of
the two sequences. S is the score of my alignment.
0.4
A.

0.2

Yev

-2 -1 0 1 2 3 4 5
X
E-value E(S)
Different algorithm differ a lot on how they define a random sequence.
BLAST computes the probability of a given score to be signficant on the basis of the
dimension and composition of the database a priori applying:

E ( S ) = kmn e − λS
Where m is the length of the query sequence and n is the length of the subject sequence from
the database. On the contrary of FASTA, λ and k are precomputed according to an internal
standard distribution. The final score is similar to the FASTA one.

A result significance is expressed as an expectation value E(S).

The lower the value of E is, the more significant the alignment is. A value of 1.0e-5, e.g.,
means that the probability of getting a sequence with the same score of my query is 1.0e-5, in
other words, we expect to observe a sequence with a score equal or better to 1.0e-5 every
100.000 observations (alignments).
Bit-score
The bit-score allows to put in direct relation searches made in databases of different
sizes since it does not depends on λ and K parameters, while the raw score does. In
this way two searches in different databases are comparable. The bit-score is
computed from the raw score S, is defined S’ and is normalized as:

λS − ln K
S'=
ln 2
From which the E-value can be computed:

−S '
E = mn2
E depends only on the sequences length parameter.
DB Accession Number

Name

Similarity values
Similarity values

Query

Subject

Residue number

Identity (letter),
Similarity (+)
Profiles
Frequency matrix
Matrices used to show the frequency of each residue in a multiple sequence
alignment
• how many times a residue is found at a certain position in the alignment
Position

Example:

Frequency

AA
Frequency matrix
Matrices used to show the frequency of each residue in a multiple sequence
alignment
• how many times a residue is found at a certain position in the alignment

Example:
Weighted matrices (PSSM) or profiles
(Gribskov)
Question: how to deal with “similar” AA never observed at a certain position?
We can use weighted matrices, PSSM – position specific scoring matrix, that take into
account the “weight” of the aminoacids , that is their “substitution likelihood” calculated
with PAM or BLOSUM substitution matrices .

Multiple alignment

Frequency matrix

“observed” similarity Substitution


matrix
“general” similarity

Profile
(PSSM)
“weighted” similarity
Weighted matrices (PSSM) or profiles
(Gribskov)

In the third column of this weighted


matrix we can notice that the A aa is
found only one time, with a lower score
(-1) compared to the aa M (+10), not
found.

M is more similar to other aminoacids,


such as L, I, V, F included in other
sequences of this matrix.

Now we have a complete profile of the


multiple sequence alignment which is
able to codify the probability of each AA
at every position.
PSI-BLAST
PSI-BLAST
Position Specific Iterated BLAST
It makes use of an iterative procedure in which every sequence above a fixed threshold is
used to create a model known as PSSM (Position Specific Substitution Matrix) used in the
following iterations to detect evolutionary distant sequences .

PSSM is the product of the substitution matrix with the frequency matrix calculated from a
query sequence against the hits that exceed the fixed threshold (profiles).

Then, in order to find a new hit we use the same procedure but with a L x 20 matrix, where L is
the lenght of the query sequence.

PSSM calculation calls for the standardization of sequences that are redundant or
overrepresented in order to reduce the risk of a wrong calculation of the matrix .

Gap penalty threshold is the same for each of the iterations.


PSI-BLAST procedure

1) Input sequence.
2) BLAST search.

3) Creation of a PSSM from a multiple alignment of all the hits that exceed
a fixed threshold.

4) Research with the obtained PSSM (research of profile


similarities -> database sequences).
5) Creation of a pairwise alignment based on the profile between
the input sequence and the new hits.
PSI-BLAST procedure
6) Creation of a new PSSM if there are new hits that exceed the
threshold.
the score of some sequences that wasn’t above the threshold, may increase and exceed the
threshold due to the fact that they could be distant homologous sequences of a gene family that
now shares only few conserved aminoacids (i.e. aa in an active site) that increase the score.

7) Repeat from 4 to 6
8) The process ends when convergence is reached, in other
words when there aren’t other sequences that exceed the
threshold
PSI-BLAST: Drift problems
We generally carry out 4-6 iterations in order to avoid drift (profile wander)
events: the input sequence can be lost with the iterations if there is a large protein
family similar to this sequence (crowding out).

B B
Example: B B
B
A B Sequence A has
B been removed from
B
PSSM

To avoid these issues we can use databases in which each protein has a maximum
value of X% (50 <= X < 100) similarity with each one of the other sequences.
• i.e. NR90 contains all of the known sequences with a maximum value of 90% of similarity.
• one of the most popular program for clustering sequences: CD-HIT (Li et al., 2001)
Beyond PSI-BLAST…
Beyond PSI-BLAST…
Given a protein family, how can we fix the information in the multiple
sequence alignment in order to look for other sequences that are still
unknown???
• the most common alignment methods, even if they use profiles, i.e. they
don‘t evaluate indels positions.

Idea: creation of a HMM – Hidden Markov Model that best represents the
reality.
1YEA AKESTGFKPGSAKKGATLFKTRCQQCHTIEE-------GGPNKVGPNLHGIFGRHSGQVK
1YCC ----TEFKAGSAKKGATLFKTRCLQCHTVEK-------GGPHKVGPNLHGIFGRHSGQAE
2PCBB ---------GDVEKGKKIFVQKCAQCHTVEK-------GGKHKTGPNLHGLFGRKTGQAP
5CYTR ---------GDVAKGKKTFVQKCAQCHTVEN-------GGKHKVGPNLWGLFGRKTGQAE
1CCR -ASFSEAPPGNPKAGEKIFKTKCAQCHTVDK-------GAGHKQGPNLNGLFGRQSGTTP
1CRY ---------QDAASGEQVFK-QCLVCHSIGP-------GAKNKVGPVLNGLFGRHSGTIE
1HROA -----SAPPGDPVEGKHLFHTICITCHTDIK-------G-ANKVGPSLYGVVGRHSGIEP
1CXC -------QEGDPEAGAKAFN-QCQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAGTQA
1C2RA ---------GDAAKGEKEFN-KCKTCHSIIAPDGTEIVKG-AKTGPNLYGVVGRTAGTYP
155C -------NEGDAAKGEKEFN-KCKACHMIQAPD-GTDIKG-GKTGPNLYGVVGRKIASEE
2C2C --------EGDAAAGEKVSK-KCLACHTFDQ-------GGANKVGPNLFGVFENTAAHKD
2mtac -----APQFFNIIDGSPLNFDD-----AMEEGRDTEAVKHFLETGENVYNEDPEILPEAE
. * : * : . .

1YEA GYS-YTDANINK-----NVKWDEDSMSEYLTNPKKYIP--------GTKMAFAGLKKEKD
1YCC GYS-YTDANIKK-----NVLWDENNMSEYLTNPKKYIP--------GTKMAFGGLKKEKD
2PCBB GFT-YTDANKNK-----GITWKEETLMEYLENPKKYIP--------GTKMIFAGIKKKTE
5CYTR GYS-YTDANKSK-----GIVWNNDTLMEYLENPKKYIP--------GTKMIFAGIKKKGE
1CCR GYS-YSTADKNM-----AVIWEENTLYDYLLNPKKYIP--------GTKMVFPGLKKPQE
1CRY GFA-YSDANKNS-----GITWTEEVFREYIRDPKAKIP--------GTKMIFAGVKDEQK
1HROA GYN-YSEANIKS-----GIVWTPDVLFKYIEHPQKIVP--------GTKMGYPGQPDPQK
1CXC DFKGYGEGMKEAGAK--GLAWDEEHFVQYVQDPTKFLKEYTGDAKAKGKMTF-KLKKEAD
1C2RA EFK-YKDSIVALGAS--GFAWTEEDIATYVKDPGAFLKEKLDDKKAKTGMAF-KLAK--G
155C GFK-YGEGILEVAEKNPDLTWTEANLIEYVTDPKPLVKKMTDDKGAKTKMTF-KMGK--N
2C2C NYA-YSESYTEMKAK--GLTWTEANLAAYVKNPKAFVLEKSGDPKAKSKMTF-KLTKDDE
2mtac EL--YAGMCSGCHGHYAEGKIGPGLNDAYWTYPGNETDVGLFSTLYGG--ATGQMGPMWG
* * *
HMM – Hidden Markov Model
An HMM represents a generalization of the profile concept.
• AA substitution probabilities are different for each position (PSSM).
• Insertion and deletion probabilities are different at each position.

• HMM are used i.e. in the database Pfam about protein domains.

• The most popular program is HMMER. (Eddy, 1995)

Mj = j column of the PSSM


Ij = insertion probability in j
Dj = deletion probability in j
Profile – profile alignment

PSI-BLAST extended the similarity search from a sequence – sequence


alignment to a profile – sequence alignment.
– This improved considerably the performance for sequences of a same
protein family, with a low degree of conservation

It could be possible to further improve the performance for a profile – profile


alignment.
– This can be achieved with the last generation of alignment methods
– It‘s really important for structure prediction in case of remote homology.

There are technical problems that (until now!) prevent the substitution of
PSI-BLAST with a large scale profile – profile method
– Dimension of the database of precalculated profiles

S-ar putea să vă placă și