Mole Vol Class 05

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS
Distance Matrix Methods
Anders Gorm Pedersen

Molecular Evolution Group
Center for Biological Sequence Analysis
Technical University of Denmark
gorm@cbs.dtu.dk
Distance Matrix Methods
Gorilla : ACGTCGTA 1. Construct multiple alignment of

Human : ACGTTCCT sequences
Chimpanzee: ACGTTTCG
Go Hu Ch
Go - 4 4 2. Construct table listing all pairwise
differences (distance matrix)
Hu - 2
Ch -
1 Ch
1
1 3. Construct tree from pairwise distances
Hu
2
Go
Finding optimal branch lengths
S1 S2 S3 S4 S2
S1
S1 - D12 D13 D14 a c
b
S2 - D23 D24
d e
S3 - D34
S3 S4
S4 -
Observed distance Distance along tree
(patristic distance)
D12 ≈ d12 = a + b + c
D13 ≈ d13 = a + d
Goal: D14 ≈ d14 = a + b + e
D23 ≈ d23 = d + b + c
D24 ≈ d24 = c + e
D34 ≈ d34 = d + b + e
Exercise (handout)
• Construct distance matrix (count different positions)

• Reconstruct tree and find best set of branch lengths

Optimal Branch Lengths for a Given Tree:
Least Squares
S2
S1
a c
b • Fit between given tree and
observed distances can be
d e expressed as “sum of squared
S3 S4 differences”:
Distance along tree
Q= Σ j>i(Dij - dij)2
D12 ≈ d12 = a + b + c • Find branch lengths that minimize

D13 ≈ d13 = a + d Q - this is the optimal set of branch
Goal: D14 ≈ d14 = a + b + e lengths for this tree.
D23 ≈ d23 = d + b + c
D24 ≈ d24 = c + e
D34 ≈ d34 = d + b + e
Optimal Branch Lengths: Least Squares
S2
S1
a c • Longer distances associated with
b larger errors
d e • Squared deviation may be weighted

S3 S4 so longer branches contribute less
Distance along tree to Q:
D12 ≈ d12 = a + b + c Q= Σ j>i (D

ij - dij)2 n
Dij
D13 ≈ d13 = a + d
Goal: D14 ≈ d14 = a + b + e
D23 ≈ d23 = d + b + c • Power (n) is typically 1 or 2
D24 ≈ d24 = c + e
D34 ≈ d34 = d + b + e
2 2 2
Q = ( DAB − dAB ) + ( DAC − dAC ) + ( DBC − dBC )
Finding Optimal Branch Lengths
A
A B C v1
v3
A - DAB DAC C
B - DBC v2
B
C -
Observed distance Distance along tree
DAB ≈ dAB = v1 + v2
Goal: DAC ≈ dAC = v1 + v3
DBC ≈ dBC = v2 + v3
2 2 2
Q = (DAB − dAB ) + ( DAC − dAC ) + ( DBC − dBC )

2 2 2
c
Q = DAB + dAB − 2DAB dAB
2 2
2 2
+DAC + dAC − 2DAC dAC
2 2
+DBC + dBC − 2DBC dBC
2 2 2
c

2 2
Q = DAB + dAB − 2DAB dAB
2 2
2 2
+DBC + dBC − 2DBC dBC
c
2
Q = DAB 2 + (v1 + v 2 ) − 2DAB (v1 + v 2 )
2 2
+DAC + (v1 + v 3 ) − 2DAC (v1 + v 3 )
2 2
+DBC + (v 2 + v 3 ) − 2DBC (v 2 + v 3 )
2 2 2
c
Q = DAB 2 + dAB 2 − 2DAB dAB Finding Optimal Branch Lengths
2 2
+DBC 2 + dBC 2 − 2DBC dBC
c
2 2
Q = DAB + (v1 + v 2 ) − 2DAB (v1 + v 2 )
2
+DAC 2 + (v1 + v 3 ) − 2DAC (v1 + v 3 )
2 2
+DBC + (v 2 + v 3 ) − 2DBC (v 2 + v 3 )
c
Q = DAB 2 + v12 + v 2 2 + 2v1v 2 − 2DAB v1 − 2DAB v 2
2 2 2
+DAC + v1 + v 3 + 2v1v 3 − 2DAC v1 − 2DAC v 3
+DBC 2 + v 2 2 + v 3 2 + 2v 2v 3 − 2DBC v 21 − 2DBC v 3
2 2 2
Q = DAB + v1 + v 2 + 2v1v 2 − 2DAB v1 − 2DAB v 2
+DAC 2 + v12 + v 3 2 + 2v1v 3 − 2DAC v1 − 2DAC v 3

2 2 2
+DBC + v 2 + v 3 + 2v 2v 3 − 2DBC v 21 − 2DBC v 3
c
2
Q = 2v1
+(2v 2 + 2v 3 − 2DAB − 2DAC )v1
2 2 2 2 2
+2v 2 + 2v 3 + 2v 2v 3 − 2DAB v 2 − 2DAC v 3 − 2DBC v 2 − 2DBC v 3 + DAB + DAC + DBC
Q = DAB 2 + v12 + v 2 2 + 2v1v 2 − 2DAB v1 − 2DAB v 2
2 2 2
+DAC + v1 + v 3 + 2v1v 3 − 2DAC v1 − 2DAC v 3

2 2 2
+DBC + v 2 + v 3 + 2v 2v 3 − 2DBC v 21 − 2DBC v 3
c
2
Q = 2v1
+(2v 2 + 2v 3 − 2DAB − 2DAC )v1
2 2 2 2 2
+2v 2 + 2v 3 + 2v 2v 3 − 2DAB v 2 − 2DAC v 3 − 2DBC v 2 − 2DBC v 3 + DAB + DAC + DBC
c
2
Q = 2v1 + C1v1 + C2
Q = 2v12 + C1v1 + C2
dQ
min :
dv1
=0 Finding Optimal Branch Lengths
dQ
= 4v1 + C1 = 4v1 + 2v 2 + 2v 3 − 2DAB − 2DAC
dv1
⇓
4v1 + 2v 2 + 2v 3 − 2DAB − 2DAC = 0
• System of n linear equations with n unknowns
• Can be solved using substitution method or matrix-based methods

Least Squares Optimality Criterion
• Search through all (or many) tree topologies
• For each investigated tree, find best branch lengths using least
squares criterion (solve N equations with N unknowns)
• Among all investigated trees, the best tree is the one with the smallest
sum of squared errors.
• Least squares criterion used both for finding branch lengths on

individual trees, and for finding best tree.
Minimum Evolution Optimality Criterion
• Search through all (or many) tree topologies
• For each investigated tree, find best branch lengths using least
squares criterion (solve N equations with N unknowns)
• Among all investigated trees, the best tree is the one with the smallest
sum of branch lengths (the shortest tree).
• Least squares criterion used for finding branch lengths on individual

trees, minimum tree length used for finding best tree.
Superimposed Substitutions
• Actual number of
ACGGTGC
evolutionary events: 5
C T • Observed number of
differences: 2
GCGGTGA
• Distance is (almost) always

underestimated
Model-based correction for
superimposed substitutions
• Goal: try to infer the real number of evolutionary events (the real
distance) based on
1. Observed data (sequence alignment)
2. A model of how evolution occurs

Jukes and Cantor Model
• Four nucleotides assumed to be

equally frequent (f=0.25)
• All 12 substitution rates assumed to

be equal
• Under this model the corrected

distance is:
DJC = -0.75 x ln(1-1.33 x DOBS)
A C G T • For instance:
A -3α α α α DOBS=0.43 => DJC=0.64
C α -3α α α
G α α -3α α
T α α α -3α
Clustering Algorithms
• Starting point: Distance matrix

• Cluster the two nearest nodes:

– Tree: connect pair of nodes to common ancestral node, compute branch lengths from ancestral
node to both descendants
– Distance matrix: replace the two joined nodes with the new (ancestral) node. Compute new distance
matrix, by finding distance from new node to all other nodes
• Repeat until all nodes are linked in tree
• Results in only one tree, there is no measure of tree-goodness.

Neighbor Joining Algorithm
• For each tip compute ui = Σj Dij/(n-2)
(essentially the average distance to all other tips, except the denominator is n-2 instead of n-1)
• Find the pair of tips, i and j, where Dij-ui-uj is smallest
• Connect the tips i and j, forming a new ancestral node. The branch lengths from the ancestral node
to i and j are:
vi = 0.5 Dij + 0.5 (ui-uj)
vj = 0.5 Dij + 0.5 (uj-ui)
• Update the distance matrix: Compute distance between new node and each remaining tip as
follows:
Dij,k = (Dik+Djk-Dij)/2
• Replace tips i and j by the new node which is now treated as a tip
• Repeat until only two nodes remain.

A B C D
A - 17 21 27
B - 12 18
C - 14
D -
A B C D i ui
A - 17 21 27 A (17+21+27)/2=32.5
B - 12 18 B (17+12+18)/2=23.5
C - 14 C (21+12+14)/2=23.5
D - D (27+18+14)/2=29.5
A B C D i ui
A - 17 21 27 A (17+21+27)/2=32.5
B - 12 18 B (17+12+18)/2=23.5
C - 14 C (21+12+14)/2=23.5
D - D (27+18+14)/2=29.5
A B C D
A - -39 -35 -35
B - -35 -35
C - -39
D -
Dij-ui-uj
A B C D i ui
A - 17 21 27 A (17+21+27)/2=32.5
B - 12 18 B (17+12+18)/2=23.5
C - 14 C (21+12+14)/2=23.5
D - D (27+18+14)/2=29.5
A B C D
A - -39 -35 -35
B - -35 -35
C - -39
D -
Dij-ui-uj
A B C D i ui
A - 17 21 27 A (17+21+27)/2=32.5
B - 12 18 B (17+12+18)/2=23.5
C - 14 C (21+12+14)/2=23.5
D - D (27+18+14)/2=29.5
A B C D
A - -39 -35 -35 C D
B - -35 -35
X
C - -39
D -
Dij-ui-uj
A B C D i ui
A - 17 21 27 A (17+21+27)/2=32.5
B - 12 18 B (17+12+18)/2=23.5
C - 14 C (21+12+14)/2=23.5
D - D (27+18+14)/2=29.5
A B C D
A - -39 -35 -35 C D
B - -35 -35 4 10
X
C - -39
D - vC = 0.5 x 14 + 0.5 x (23.5-29.5) = 4
vD = 0.5 x 14 + 0.5 x (29.5-23.5) = 10
Dij-ui-uj
A B C D X
A - 17 21 27
B - 12 18
C - 14
D -
X -
C D
4 10
X
A B C D X
DXA = (DCA + DDA - DCD)/2
A - 17 21 27
= (21 + 27 - 14)/2
B - 12 18 = 17
C - 14
DXB = (DCB + DDB - DCD)/2
D -
= (12 + 18 - 14)/2
X - =8
C D
4 10
X
A B C D X
A - 17 21 27 17
= (21 + 27 - 14)/2
B - 12 18 8 = 17
C - 14
D -
= (12 + 18 - 14)/2
X - =8
C D
4 10
X
A B X
A - 17 17
= (21 + 27 - 14)/2
B - 8 = 17
X -
= (12 + 18 - 14)/2
=8
C D
4 10
X
A B X i ui
A - 17 17 A (17+17)/1 = 34
B - 8 B (17+8)/1 = 25
X - X (17+8)/1 = 25
C D
4 10
X
A B X i ui
A - 17 17 A (17+17)/1 = 34
B - 8 B (17+8)/1 = 25
X - X (17+8)/1 = 25
A B X
A - -42 -28 C D
B - -28 4 10
X
X -
Dij-ui-uj
A B X i ui
A - 17 17 A (17+17)/1 = 34
B - 8 B (17+8)/1 = 25
X - X (17+8)/1 = 25
A B X
A - -42 -28 C D
B - -28 4 10
X
X -
Dij-ui-uj
A B X i ui
A - 17 17 A (17+17)/1 = 34
B - 8 B (17+8)/1 = 25
X - X (17+8)/1 = 25
A B X
A - -42 -28 C D A B
B - -28 4 10 13 4
X Y
X -
vA = 0.5 x 17 + 0.5 x (34-25) = 13
Dij-ui-uj
vD = 0.5 x 17 + 0.5 x (25-34) = 4
A B X Y
A - 17 17
B - 8
X -
Y
C D A B
4 10 13 4
X Y
A B X Y
DYX = (DAX + DBX - DAB)/2
A - 17 17
= (17 + 8 - 17)/2
B - 8 =4
X - 4
Y
C D A B
4 10 13 4
X Y
X Y
X - 4
= (17 + 8 - 17)/2
Y - =4
C D A B
4 10 13 4
X Y
X Y
X - 4
= (17 + 8 - 17)/2
Y - =4
C D A B
4 10 13 4
4
A B C D C B
4 4
A - 17 21 27
4
B - 12 18
C - 14
10 13
D -
D

Mole Vol Class 05

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Mole Vol Class 05

Încărcat de

Drepturi de autor:

Formate disponibile

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

Distance Matrix Methods

Anders Gorm Pedersen

Gorilla : ACGTCGTA 1. Construct multiple alignment of

• Construct distance matrix (count different positions)

• Reconstruct tree and find best set of branch lengths

Distance along tree

D12 ≈ d12 = a + b + c • Find branch lengths that minimize

d e • Squared deviation may be weighted

D12 ≈ d12 = a + b + c Q= Σ j>i (D

Finding Optimal Branch Lengths

Finding Optimal Branch Lengths

Finding Optimal Branch Lengths

Finding Optimal Branch Lengths

Finding Optimal Branch Lengths

• System of n linear equations with n unknowns

• Can be solved using substitution method or matrix-based methods

• Search through all (or many) tree topologies

• Least squares criterion used both for finding branch lengths on

• Search through all (or many) tree topologies

• Least squares criterion used for finding branch lengths on individual

• Distance is (almost) always

1. Observed data (sequence alignment)

2. A model of how evolution occurs

• Four nucleotides assumed to be

equally frequent (f=0.25)

• All 12 substitution rates assumed to

• Under this model the corrected

DJC = -0.75 x ln(1-1.33 x DOBS)

• Starting point: Distance matrix

• Cluster the two nearest nodes:

• Repeat until all nodes are linked in tree

• Results in only one tree, there is no measure of tree-goodness.

• Find the pair of tips, i and j, where Dij-ui-uj is smallest

• Repeat until only two nodes remain.

S-ar putea să vă placă și