Seq Hadoop

Sequence Alignment and Phylogenetic
Prediction using Map Reduce Programming

Model in Hadoop DFS
Guided by
Dr. G. Sudha Sadasivam
Asst. Professor
Dept. of CSE
Presented by
C. Geetha Jini (07MW03)
D. Komagal Meenakshi (07MW05)
What is Sequence Alignment?

The procedure of comparing two or more
sequences by searching for a series of individual
characters or character patterns that are in the
same order in the sequences.
Types of Sequence Alignment

Pair-wise Alignment
Alignment of two sequences
Global using Needleman Wunsch algorithm.
LG PS S K Q T G K G S _ S R AW D N
|
|
| | |
|
|
LN _ AT K S AG K G AI M R LG D A
Local using Smith Waterman algorithm.

_________TGKG__________
| | |
_________AGK G__________
Multiple Sequence Alignment

Alignment of more than two sequences
NEEDLEMAN WUNSCH ALGORITHM

Initialization
F(0, 0) = 0
F(0, i) = i * d
F(j, 0) = j* d
Main Iteration
Case 1: xi aligns to yi
Case 2: xi aligns to gap
Case 3: yi aligns to gap
For each i=1M and j=1.N
F(i,j) = max
Ptr(i,j) =
F(i-1,j-1+s(xi,yj), case 1
F(i-1,j)-d, case 2
F(I,j-1)-d, case 3
DIAG, if case 1
UP, if case 2
LEFT, if case 3
s(xi,yj ) =
+1, match
-1, mismatch
Needleman Wunsch Algorithm

f(0,0)+s(1,1) =1
F(1,1)=max f(0,1)-1 = -2
f(1,0)-1 = -2
= 1 (case 1)
i=0
F(i,j)
j=0
s(xi,yj ) = +1, match

-1, mismatch
d=1
-1
-2
-3
-4
-1
-1
-2
-2
-3
-1
-1
f(0,1)+s(1,2) =-2
f(0,2)-1 = -3
f(1,1)-1 = 0
Max = 0 (case 3)
PTR =
DIAG, if case 1
UP, if case 2
LEFT, if case 3
OptimalAlignment
A_TA
AGTA
Score:1+0+1+2 = 4
Smith Waterman Algorithm

Initialization:
F(0, j) = F(i, 0) = 0
Iteration:
F(i, j) = max
0
F(i 1, j 1) + s(xi, yj), case 1
F(i 1, j) d, case 2
F(i, j 1) d, case 3
Smith Waterman Algorithm

F(1,1)=max
i=0
F(i,j)
j=0
f(0,0)+s(1,1) =1
f(0,1)-1 = -1
f(1,0)-1 = -1
0
= 1 (case 1)
s(xi,yj ) = +1, match

-1,mismatch
d=1
f(0,2)+s(1,3) =-1
F(1,3)=max f(0,3)-1 = -1
f(1,2)-1 = -1
0
=0
PTR =
DIAG, if case 1
UP, if case 2
LEFT, if case 3
OptimalAlignment
A_TA
_ _TA
Score: 1+2 = 4
Proposed system
Input: one query file and

a set of sequence files
Map
Put all files

in DFS
Set File Name as Key

Pass Entire File contents
as Value
Do Sequence alignment of
query file with the target files
in DFS
Return (Filename as key,

Score as Value).
Reduce
Combine all the (K,V) pairs
Output: (Filename, Score)
A multiple sequence alignment is a sequence alignment of three

or more biological sequences, generally protein, DNA, or RNA.
In general, the input is a set of query sequences that are

assumed to have an evolutionary relationship by which they
share a lineage and are descended from a common ancestor.
From the resulting multiple sequence alignment , phylogenetic

analysis can be conducted to assess the sequences shared
evolutionary origins.
Methods for producing MSA
Dynamic programming
Progressive alignment construction
most direct method for producing an MSA to identify

the globally optimal alignment solution .
computational complexity
For n individual sequences, the naive method requires

constructing the n-dimensional equivalent of the matrix formed
in standard pairwise sequence alignment.
The search space thus increases exponentially with increasing
n and is also strongly dependent on sequence length.
uses a heuristic search .

builds up a final MSA by combining pair wise
alignments beginning with the most similar pair and
progressing to the most distantly related.
The most popular progressive alignment method has been the

ClustalW.
All progressive alignment methods require two stages:
a first stage in which the relationships between the sequences

are represented as a tree, called a guide tree.
second step in which the MSA is built by adding the sequences
sequentially to the growing MSA according to the guide tree.
first step: computation of guide tree from pair-wise alignment scores by an

efficient clustering method such as neighbor-joining method.
Second step: The two most similar sequences are aligned first, additional
sequences (or groups of sequences) are added later following the guide tree
requires a method to optimally align a sequence with an alignment or an

alignment with an alignment
sequence 1
sequence 2
sequence 3
Sequence4
Example: According to guide tree, align

first sequences 1 and 2, then align
sequence 3 to alignment of sequence 1
and 2, then sequence 4 to alignment of
sequences 1, 2, and 3.
Neighbor-joining is a bottom-up clustering method used for

the construction of phylogenetic trees.
Neighbor-joining is an iterative algorithm. Each iteration
consists of the following steps:
Based on the current distance matrix calculate the matrix Q .
For example, if we have four taxa (A, B, C, D) and the following

distance matrix:
We obtain the following values for the Q matrix:
Find the pair of taxa in Q with the lowest value. Create a

node on the tree that joins these two taxa (i.e. join the
closest neighbors, as the algorithm name implies).
Calculate the distance of each of the taxa in the pair to this

new node.
Calculate the distance of all taxa outside of this pair to the

new node.
Start the algorithm again, considering the pair of joined

neighbors as a single taxon and using the distances
calculated in the previous step.
The primary problem is that when errors are made at any stage
in growing the MSA, these errors are then propagated through
to the final result.
Performance is also particularly bad when all of the

sequences in the set are rather distantly related.
Phylogenetic Analysis
An investigation of evolutionary relationships
among a group of related sequences by producing
a tree representation of relationships.
Significant use-to make prediction concerning tree
of life.
Structure
outer branches ->Sequences
Inner part -> Reflect the
degree to which sequences are
related
Alike sequences -> located at
neighboring outside branches
Less related sequences ->
more distant from each other
Proposed System
Implementation of Sequence alignment and
phylogenetic prediction using map-reduce
programming model in hadoop
Algorithms used for Alignment
Global-Needleman Wunsch Algorithm
Local-Smith Waterman Algorithm
Proposed system
Put all files

in DFS
Input: set of sequence

files
Map
Set File Name as Key

Pass Entire File contents
as Value
Do Sequence alignment of all

the files with all possible
combinations and find the
alignment scores
Return (Filename as key,

Score as Value).
Reduce
Combine all the (K,V) pairs
Output: (Filename, Score)
Phylogenetic Analysis
The mapreduce algorithm for pairwise sequence alignment

both local and global was completed using the Needleman
wunsch and Smith waterman algorithm in Hadoop.
This can be extended to do multiple sequence alignment and

to perform phylogenetic analysis in Hadoop for predicting
possible evolutionary relationships among a group of related
sequences.
Bibliography
David W. Mount, Bioinformatics Sequence and Genome Analysis, second

edition
http://apache.org/hadoop
http://wiki.apache.org/hadoop
Map reduce: Simplified data processing on Large Clusters, Jeffrey Dean and
Sanjay Ghemawat
www.biojava.org
www.biojava.org/wiki/Biojava:CookBook
Biojava in Anger, A Tutorial and Recipe for Those in a Hurry.
www.di.unito.it/~botta/didattica/biojavaHowTo.pdf
http://www-sop.inria.fr/oasis/Stages/04-05/BioProActive-Caromel.html
http://hpc.pnl.gov/projects/scalablast/
http://www.ebi.ac.uk/Tools/clustalw2/
Thank you

Seq Hadoop

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Seq Hadoop

Încărcat de

Drepturi de autor:

Formate disponibile

Sequence Alignment and Phylogenetic

Prediction using Map Reduce Programming

What is Sequence Alignment?

Types of Sequence Alignment

Local using Smith Waterman algorithm.

Multiple Sequence Alignment

NEEDLEMAN WUNSCH ALGORITHM

For each i=1M and j=1.N

Needleman Wunsch Algorithm

s(xi,yj ) = +1, match

Smith Waterman Algorithm

Smith Waterman Algorithm

s(xi,yj ) = +1, match

Input: one query file and

Put all files

Set File Name as Key

Return (Filename as key,

A multiple sequence alignment is a sequence alignment of three

In general, the input is a set of query sequences that are

From the resulting multiple sequence alignment , phylogenetic

Methods for producing MSA

Progressive alignment construction

most direct method for producing an MSA to identify

For n individual sequences, the naive method requires

uses a heuristic search .

The most popular progressive alignment method has been the

All progressive alignment methods require two stages:

a first stage in which the relationships between the sequences

first step: computation of guide tree from pair-wise alignment scores by an

requires a method to optimally align a sequence with an alignment or an

Example: According to guide tree, align

Neighbor-joining is a bottom-up clustering method used for

For example, if we have four taxa (A, B, C, D) and the following

We obtain the following values for the Q matrix:

Find the pair of taxa in Q with the lowest value. Create a

Calculate the distance of each of the taxa in the pair to this

Calculate the distance of all taxa outside of this pair to the

Start the algorithm again, considering the pair of joined

Performance is also particularly bad when all of the

Put all files

Input: set of sequence

Set File Name as Key

Do Sequence alignment of all

Return (Filename as key,

The mapreduce algorithm for pairwise sequence alignment

This can be extended to do multiple sequence alignment and

David W. Mount, Bioinformatics Sequence and Genome Analysis, second

S-ar putea să vă placă și