Sunteți pe pagina 1din 41

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

Distance Matrix Methods

Anders Gorm Pedersen


Molecular Evolution Group
Center for Biological Sequence Analysis
Technical University of Denmark
gorm@cbs.dtu.dk
Distance Matrix Methods
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

Gorilla : ACGTCGTA 1. Construct multiple alignment of


Human : ACGTTCCT sequences
Chimpanzee: ACGTTTCG

Go Hu Ch
Go - 4 4 2. Construct table listing all pairwise
differences (distance matrix)
Hu - 2
Ch -

1 Ch
1
1 3. Construct tree from pairwise distances
Hu
2
Go
Finding optimal branch lengths

S1 S2 S3 S4 S2
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

S1
S1 - D12 D13 D14 a c
b
S2 - D23 D24
d e
S3 - D34
S3 S4
S4 -
Observed distance Distance along tree
(patristic distance)

D12 ≈ d12 = a + b + c
D13 ≈ d13 = a + d
Goal: D14 ≈ d14 = a + b + e
D23 ≈ d23 = d + b + c
D24 ≈ d24 = c + e
D34 ≈ d34 = d + b + e
Exercise (handout)

• Construct distance matrix (count different positions)


CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

• Reconstruct tree and find best set of branch lengths


Optimal Branch Lengths for a Given Tree:
Least Squares

S2
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

S1
a c
b • Fit between given tree and
observed distances can be
d e expressed as “sum of squared
S3 S4 differences”:

Distance along tree

Q= Σ j>i(Dij - dij)2

D12 ≈ d12 = a + b + c • Find branch lengths that minimize


D13 ≈ d13 = a + d Q - this is the optimal set of branch
Goal: D14 ≈ d14 = a + b + e lengths for this tree.
D23 ≈ d23 = d + b + c
D24 ≈ d24 = c + e
D34 ≈ d34 = d + b + e
Optimal Branch Lengths: Least Squares

S2
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

S1
a c • Longer distances associated with
b larger errors

d e • Squared deviation may be weighted


S3 S4 so longer branches contribute less
Distance along tree to Q:

D12 ≈ d12 = a + b + c Q= Σ j>i (D


ij - dij)2 n
Dij
D13 ≈ d13 = a + d
Goal: D14 ≈ d14 = a + b + e
D23 ≈ d23 = d + b + c • Power (n) is typically 1 or 2
D24 ≈ d24 = c + e
D34 ≈ d34 = d + b + e
2 2 2
Q = ( DAB − dAB ) + ( DAC − dAC ) + ( DBC − dBC )

Finding Optimal Branch Lengths

A
A B C v1
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

v3
A - DAB DAC C

B - DBC v2
B
C -
Observed distance Distance along tree

DAB ≈ dAB = v1 + v2
Goal: DAC ≈ dAC = v1 + v3
DBC ≈ dBC = v2 + v3
2 2 2
Q = (DAB − dAB ) + ( DAC − dAC ) + ( DBC − dBC )

Finding Optimal Branch Lengths


CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
2 2 2
Q = ( DAB − dAB ) + ( DAC − dAC ) + ( DBC − dBC )
c
Q = DAB + dAB − 2DAB dAB
2 2
Finding Optimal Branch Lengths
2 2
+DAC + dAC − 2DAC dAC
2 2
+DBC + dBC − 2DBC dBC
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
2 2 2
Q = ( DAB − dAB ) + ( DAC − dAC ) + ( DBC − dBC )
c

Finding Optimal Branch Lengths


2 2
Q = DAB + dAB − 2DAB dAB
2 2
+DAC + dAC − 2DAC dAC
2 2
+DBC + dBC − 2DBC dBC
c
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

2
Q = DAB 2 + (v1 + v 2 ) − 2DAB (v1 + v 2 )
2 2
+DAC + (v1 + v 3 ) − 2DAC (v1 + v 3 )
2 2
+DBC + (v 2 + v 3 ) − 2DBC (v 2 + v 3 )
2 2 2
Q = ( DAB − dAB ) + ( DAC − dAC ) + ( DBC − dBC )
c
Q = DAB 2 + dAB 2 − 2DAB dAB Finding Optimal Branch Lengths
2 2
+DAC + dAC − 2DAC dAC
+DBC 2 + dBC 2 − 2DBC dBC
c
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

2 2
Q = DAB + (v1 + v 2 ) − 2DAB (v1 + v 2 )
2
+DAC 2 + (v1 + v 3 ) − 2DAC (v1 + v 3 )
2 2
+DBC + (v 2 + v 3 ) − 2DBC (v 2 + v 3 )
c
Q = DAB 2 + v12 + v 2 2 + 2v1v 2 − 2DAB v1 − 2DAB v 2
2 2 2
+DAC + v1 + v 3 + 2v1v 3 − 2DAC v1 − 2DAC v 3
+DBC 2 + v 2 2 + v 3 2 + 2v 2v 3 − 2DBC v 21 − 2DBC v 3
2 2 2
Q = DAB + v1 + v 2 + 2v1v 2 − 2DAB v1 − 2DAB v 2
+DAC 2 + v12 + v 3 2 + 2v1v 3 − 2DAC v1 − 2DAC v 3

Finding Optimal Branch Lengths


2 2 2
+DBC + v 2 + v 3 + 2v 2v 3 − 2DBC v 21 − 2DBC v 3
c
2
Q = 2v1
+(2v 2 + 2v 3 − 2DAB − 2DAC )v1
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

2 2 2 2 2
+2v 2 + 2v 3 + 2v 2v 3 − 2DAB v 2 − 2DAC v 3 − 2DBC v 2 − 2DBC v 3 + DAB + DAC + DBC
Q = DAB 2 + v12 + v 2 2 + 2v1v 2 − 2DAB v1 − 2DAB v 2
2 2 2
+DAC + v1 + v 3 + 2v1v 3 − 2DAC v1 − 2DAC v 3

Finding Optimal Branch Lengths


2 2 2
+DBC + v 2 + v 3 + 2v 2v 3 − 2DBC v 21 − 2DBC v 3
c
2
Q = 2v1
+(2v 2 + 2v 3 − 2DAB − 2DAC )v1
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

2 2 2 2 2
+2v 2 + 2v 3 + 2v 2v 3 − 2DAB v 2 − 2DAC v 3 − 2DBC v 2 − 2DBC v 3 + DAB + DAC + DBC
c
2
Q = 2v1 + C1v1 + C2
Q = 2v12 + C1v1 + C2

dQ
min :
dv1
=0 Finding Optimal Branch Lengths

dQ
= 4v1 + C1 = 4v1 + 2v 2 + 2v 3 − 2DAB − 2DAC
dv1
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS


4v1 + 2v 2 + 2v 3 − 2DAB − 2DAC = 0
Finding Optimal Branch Lengths
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

• System of n linear equations with n unknowns

• Can be solved using substitution method or matrix-based methods


Least Squares Optimality Criterion
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

• Search through all (or many) tree topologies

• For each investigated tree, find best branch lengths using least
squares criterion (solve N equations with N unknowns)

• Among all investigated trees, the best tree is the one with the smallest
sum of squared errors.

• Least squares criterion used both for finding branch lengths on


individual trees, and for finding best tree.
Minimum Evolution Optimality Criterion
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

• Search through all (or many) tree topologies

• For each investigated tree, find best branch lengths using least
squares criterion (solve N equations with N unknowns)

• Among all investigated trees, the best tree is the one with the smallest
sum of branch lengths (the shortest tree).

• Least squares criterion used for finding branch lengths on individual


trees, minimum tree length used for finding best tree.
Superimposed Substitutions

• Actual number of
ACGGTGC
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

evolutionary events: 5

C T • Observed number of
differences: 2
GCGGTGA

• Distance is (almost) always


underestimated
Model-based correction for
superimposed substitutions

• Goal: try to infer the real number of evolutionary events (the real
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

distance) based on

1. Observed data (sequence alignment)

2. A model of how evolution occurs


Jukes and Cantor Model

• Four nucleotides assumed to be


CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

equally frequent (f=0.25)

• All 12 substitution rates assumed to


be equal

• Under this model the corrected


distance is:

DJC = -0.75 x ln(1-1.33 x DOBS)

A C G T • For instance:
A -3α α α α DOBS=0.43 => DJC=0.64
C α -3α α α
G α α -3α α
T α α α -3α
Clustering Algorithms

• Starting point: Distance matrix


CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

• Cluster the two nearest nodes:


– Tree: connect pair of nodes to common ancestral node, compute branch lengths from ancestral
node to both descendants

– Distance matrix: replace the two joined nodes with the new (ancestral) node. Compute new distance
matrix, by finding distance from new node to all other nodes

• Repeat until all nodes are linked in tree

• Results in only one tree, there is no measure of tree-goodness.


Neighbor Joining Algorithm
• For each tip compute ui = Σj Dij/(n-2)
(essentially the average distance to all other tips, except the denominator is n-2 instead of n-1)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

• Find the pair of tips, i and j, where Dij-ui-uj is smallest

• Connect the tips i and j, forming a new ancestral node. The branch lengths from the ancestral node
to i and j are:
vi = 0.5 Dij + 0.5 (ui-uj)
vj = 0.5 Dij + 0.5 (uj-ui)

• Update the distance matrix: Compute distance between new node and each remaining tip as
follows:
Dij,k = (Dik+Djk-Dij)/2

• Replace tips i and j by the new node which is now treated as a tip

• Repeat until only two nodes remain.


Neighbor Joining Algorithm

A B C D
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 21 27
B - 12 18
C - 14
D -
Neighbor Joining Algorithm

A B C D i ui
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 21 27 A (17+21+27)/2=32.5
B - 12 18 B (17+12+18)/2=23.5
C - 14 C (21+12+14)/2=23.5
D - D (27+18+14)/2=29.5
Neighbor Joining Algorithm

A B C D i ui
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 21 27 A (17+21+27)/2=32.5
B - 12 18 B (17+12+18)/2=23.5
C - 14 C (21+12+14)/2=23.5
D - D (27+18+14)/2=29.5

A B C D
A - -39 -35 -35
B - -35 -35
C - -39
D -
Dij-ui-uj
Neighbor Joining Algorithm

A B C D i ui
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 21 27 A (17+21+27)/2=32.5
B - 12 18 B (17+12+18)/2=23.5
C - 14 C (21+12+14)/2=23.5
D - D (27+18+14)/2=29.5

A B C D
A - -39 -35 -35
B - -35 -35
C - -39
D -
Dij-ui-uj
Neighbor Joining Algorithm

A B C D i ui
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 21 27 A (17+21+27)/2=32.5
B - 12 18 B (17+12+18)/2=23.5
C - 14 C (21+12+14)/2=23.5
D - D (27+18+14)/2=29.5

A B C D
A - -39 -35 -35 C D
B - -35 -35
X
C - -39
D -
Dij-ui-uj
Neighbor Joining Algorithm

A B C D i ui
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 21 27 A (17+21+27)/2=32.5
B - 12 18 B (17+12+18)/2=23.5
C - 14 C (21+12+14)/2=23.5
D - D (27+18+14)/2=29.5

A B C D
A - -39 -35 -35 C D
B - -35 -35 4 10
X
C - -39
D - vC = 0.5 x 14 + 0.5 x (23.5-29.5) = 4
vD = 0.5 x 14 + 0.5 x (29.5-23.5) = 10
Dij-ui-uj
Neighbor Joining Algorithm

A B C D X
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 21 27
B - 12 18
C - 14
D -
X -

C D
4 10
X
Neighbor Joining Algorithm

A B C D X
DXA = (DCA + DDA - DCD)/2
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 21 27
= (21 + 27 - 14)/2
B - 12 18 = 17
C - 14
DXB = (DCB + DDB - DCD)/2
D -
= (12 + 18 - 14)/2
X - =8

C D
4 10
X
Neighbor Joining Algorithm

A B C D X
DXA = (DCA + DDA - DCD)/2
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 21 27 17
= (21 + 27 - 14)/2
B - 12 18 8 = 17
C - 14
DXB = (DCB + DDB - DCD)/2
D -
= (12 + 18 - 14)/2
X - =8

C D
4 10
X
Neighbor Joining Algorithm

A B X
DXA = (DCA + DDA - DCD)/2
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 17
= (21 + 27 - 14)/2
B - 8 = 17
X -
DXB = (DCB + DDB - DCD)/2
= (12 + 18 - 14)/2
=8

C D
4 10
X
Neighbor Joining Algorithm

A B X i ui
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 17 A (17+17)/1 = 34
B - 8 B (17+8)/1 = 25
X - X (17+8)/1 = 25

C D
4 10
X
Neighbor Joining Algorithm

A B X i ui
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 17 A (17+17)/1 = 34
B - 8 B (17+8)/1 = 25
X - X (17+8)/1 = 25

A B X
A - -42 -28 C D
B - -28 4 10
X
X -

Dij-ui-uj
Neighbor Joining Algorithm

A B X i ui
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 17 A (17+17)/1 = 34
B - 8 B (17+8)/1 = 25
X - X (17+8)/1 = 25

A B X
A - -42 -28 C D
B - -28 4 10
X
X -

Dij-ui-uj
Neighbor Joining Algorithm

A B X i ui
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 17 A (17+17)/1 = 34
B - 8 B (17+8)/1 = 25
X - X (17+8)/1 = 25

A B X
A - -42 -28 C D A B
B - -28 4 10 13 4
X Y
X -
vA = 0.5 x 17 + 0.5 x (34-25) = 13
Dij-ui-uj
vD = 0.5 x 17 + 0.5 x (25-34) = 4
Neighbor Joining Algorithm

A B X Y
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 17
B - 8
X -
Y

C D A B
4 10 13 4
X Y
Neighbor Joining Algorithm

A B X Y
DYX = (DAX + DBX - DAB)/2
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A - 17 17
= (17 + 8 - 17)/2
B - 8 =4
X - 4
Y

C D A B
4 10 13 4
X Y
Neighbor Joining Algorithm

X Y
DYX = (DAX + DBX - DAB)/2
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

X - 4
= (17 + 8 - 17)/2
Y - =4

C D A B
4 10 13 4
X Y
Neighbor Joining Algorithm

X Y
DYX = (DAX + DBX - DAB)/2
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

X - 4
= (17 + 8 - 17)/2
Y - =4

C D A B
4 10 13 4

4
Neighbor Joining Algorithm
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

A B C D C B
4 4
A - 17 21 27
4
B - 12 18
C - 14
10 13
D -
D

S-ar putea să vă placă și