Sunteți pe pagina 1din 26

Sequence Alignment and Phylogenetic

Prediction using Map Reduce Programming


Model in Hadoop DFS

Guided by
Dr. G. Sudha Sadasivam
Asst. Professor
Dept. of CSE

Presented by
C. Geetha Jini (07MW03)
D. Komagal Meenakshi (07MW05)

What is Sequence Alignment?


The procedure of comparing two or more
sequences by searching for a series of individual
characters or character patterns that are in the
same order in the sequences.

Types of Sequence Alignment


Pair-wise Alignment
Alignment of two sequences
Global using Needleman Wunsch algorithm.
LG PS S K Q T G K G S _ S R AW D N
|
|
| | |
|
|
LN _ AT K S AG K G AI M R LG D A

Local using Smith Waterman algorithm.


_________TGKG__________
| | |
_________AGK G__________

Multiple Sequence Alignment


Alignment of more than two sequences

NEEDLEMAN WUNSCH ALGORITHM


Initialization
F(0, 0) = 0
F(0, i) = i * d
F(j, 0) = j* d
Main Iteration

Case 1: xi aligns to yi
Case 2: xi aligns to gap
Case 3: yi aligns to gap

For each i=1M and j=1.N

F(i,j) = max

Ptr(i,j) =

F(i-1,j-1+s(xi,yj), case 1
F(i-1,j)-d, case 2
F(I,j-1)-d, case 3
DIAG, if case 1
UP, if case 2
LEFT, if case 3

s(xi,yj ) =

+1, match
-1, mismatch

Needleman Wunsch Algorithm


Case 1: xi aligns to yi
Case 2: xi aligns to gap
Case 3: yi aligns to gap

f(0,0)+s(1,1) =1
F(1,1)=max f(0,1)-1 = -2
f(1,0)-1 = -2
= 1 (case 1)

i=0

F(i,j)

j=0

s(xi,yj ) = +1, match


-1, mismatch
d=1

-1

-2

-3

-4

-1

-1

-2

-2

-3

-1

-1

f(0,1)+s(1,2) =-2
f(0,2)-1 = -3
f(1,1)-1 = 0
Max = 0 (case 3)

PTR =
DIAG, if case 1
UP, if case 2
LEFT, if case 3
OptimalAlignment
A_TA
AGTA
Score:1+0+1+2 = 4

Smith Waterman Algorithm


Initialization:
F(0, j) = F(i, 0) = 0
Iteration:
F(i, j) = max

0
F(i 1, j 1) + s(xi, yj), case 1
F(i 1, j) d, case 2
F(i, j 1) d, case 3

Smith Waterman Algorithm


Case 1: xi aligns to yi
Case 2: xi aligns to gap
Case 3: yi aligns to gap

F(1,1)=max

i=0

F(i,j)

j=0

f(0,0)+s(1,1) =1
f(0,1)-1 = -1
f(1,0)-1 = -1
0
= 1 (case 1)

s(xi,yj ) = +1, match


-1,mismatch
d=1

f(0,2)+s(1,3) =-1
F(1,3)=max f(0,3)-1 = -1
f(1,2)-1 = -1
0
=0

PTR =
DIAG, if case 1
UP, if case 2
LEFT, if case 3
OptimalAlignment
A_TA
_ _TA
Score: 1+2 = 4

Proposed system

Input: one query file and


a set of sequence files

Map

Put all files


in DFS

Set File Name as Key


Pass Entire File contents
as Value

Do Sequence alignment of
query file with the target files
in DFS

Return (Filename as key,


Score as Value).

Reduce
Combine all the (K,V) pairs
Output: (Filename, Score)

A multiple sequence alignment is a sequence alignment of three


or more biological sequences, generally protein, DNA, or RNA.

In general, the input is a set of query sequences that are


assumed to have an evolutionary relationship by which they
share a lineage and are descended from a common ancestor.

From the resulting multiple sequence alignment , phylogenetic


analysis can be conducted to assess the sequences shared
evolutionary origins.

Methods for producing MSA

Dynamic programming

Progressive alignment construction

most direct method for producing an MSA to identify


the globally optimal alignment solution .
computational complexity

For n individual sequences, the naive method requires


constructing the n-dimensional equivalent of the matrix formed
in standard pairwise sequence alignment.
The search space thus increases exponentially with increasing
n and is also strongly dependent on sequence length.

uses a heuristic search .


builds up a final MSA by combining pair wise
alignments beginning with the most similar pair and
progressing to the most distantly related.

The most popular progressive alignment method has been the


ClustalW.

All progressive alignment methods require two stages:

a first stage in which the relationships between the sequences


are represented as a tree, called a guide tree.
second step in which the MSA is built by adding the sequences
sequentially to the growing MSA according to the guide tree.

first step: computation of guide tree from pair-wise alignment scores by an


efficient clustering method such as neighbor-joining method.

Second step: The two most similar sequences are aligned first, additional
sequences (or groups of sequences) are added later following the guide tree

requires a method to optimally align a sequence with an alignment or an


alignment with an alignment

sequence 1
sequence 2
sequence 3
Sequence4

Example: According to guide tree, align


first sequences 1 and 2, then align
sequence 3 to alignment of sequence 1
and 2, then sequence 4 to alignment of
sequences 1, 2, and 3.

Neighbor-joining is a bottom-up clustering method used for


the construction of phylogenetic trees.
Neighbor-joining is an iterative algorithm. Each iteration
consists of the following steps:
Based on the current distance matrix calculate the matrix Q .

For example, if we have four taxa (A, B, C, D) and the following


distance matrix:

We obtain the following values for the Q matrix:

Find the pair of taxa in Q with the lowest value. Create a


node on the tree that joins these two taxa (i.e. join the
closest neighbors, as the algorithm name implies).

Calculate the distance of each of the taxa in the pair to this


new node.

Calculate the distance of all taxa outside of this pair to the


new node.

Start the algorithm again, considering the pair of joined


neighbors as a single taxon and using the distances
calculated in the previous step.

The primary problem is that when errors are made at any stage
in growing the MSA, these errors are then propagated through
to the final result.

Performance is also particularly bad when all of the


sequences in the set are rather distantly related.

Phylogenetic Analysis
An investigation of evolutionary relationships
among a group of related sequences by producing
a tree representation of relationships.
Significant use-to make prediction concerning tree
of life.

Structure
outer branches ->Sequences
Inner part -> Reflect the
degree to which sequences are
related
Alike sequences -> located at
neighboring outside branches
Less related sequences ->
more distant from each other

Proposed System
Implementation of Sequence alignment and
phylogenetic prediction using map-reduce
programming model in hadoop
Algorithms used for Alignment
Global-Needleman Wunsch Algorithm
Local-Smith Waterman Algorithm

Proposed system

Put all files


in DFS

Input: set of sequence


files

Map

Set File Name as Key


Pass Entire File contents
as Value

Do Sequence alignment of all


the files with all possible
combinations and find the
alignment scores

Return (Filename as key,


Score as Value).

Reduce
Combine all the (K,V) pairs
Output: (Filename, Score)
Phylogenetic Analysis

The mapreduce algorithm for pairwise sequence alignment


both local and global was completed using the Needleman
wunsch and Smith waterman algorithm in Hadoop.

This can be extended to do multiple sequence alignment and


to perform phylogenetic analysis in Hadoop for predicting
possible evolutionary relationships among a group of related
sequences.

Bibliography

David W. Mount, Bioinformatics Sequence and Genome Analysis, second


edition
http://apache.org/hadoop
http://wiki.apache.org/hadoop
Map reduce: Simplified data processing on Large Clusters, Jeffrey Dean and
Sanjay Ghemawat
www.biojava.org
www.biojava.org/wiki/Biojava:CookBook
Biojava in Anger, A Tutorial and Recipe for Those in a Hurry.
www.di.unito.it/~botta/didattica/biojavaHowTo.pdf
http://www-sop.inria.fr/oasis/Stages/04-05/BioProActive-Caromel.html
http://hpc.pnl.gov/projects/scalablast/
http://www.ebi.ac.uk/Tools/clustalw2/

Thank you

S-ar putea să vă placă și