CSE 6411 Read Mapping Notes

CSE 6411
Computational Biology
Read mapping (Data structures)
Atif Hasan Rahman

CSE, BUET
April, 2019
Short read mapping
• Short read
– Second generation or mainly Illumina reads
– 35-300bp
• Mapping
– Finding the positions in the genome reads are coming from
• Applications
– Reference guided genome assembly
– Genotyping
– Analysis of *-seq data
Reference guided assembly
Exact pattern matching
• Given a pattern (read) and a text (genome), find all
positions in the text that matches the pattern?
• This leads us to the Pattern Matching Problem.

Pattern Matching Problem
• Goal: Find all occurrences of a pattern in a text.
• Input:
– Pattern p = p1…pm of length m
– Text t = t1…tn of length n
• Output: All positions 1< i < (n – m + 1) such that the

m-letter substring of text t starting at i matches the
pattern p.
Adapted from slides available at
www.bioalgorithms.info
Exact Pattern Matching: Brute Force
PatternMatching(p,t)
1 m  length of pattern p
2 n  length of text t
3 for i  1 to (n – m + 1)
4 if ti…ti+m-1 = p
5 output i

Exact Pattern Matching: An Example
• PatternMatching algorithm for:
– Pattern GCAT
– Text AGCCGCATCT

– Pattern GCAT
– Text AGCCGCATCT
• AGCCGCATCT
• GCAT

– Pattern GCAT
– Text AGCCGCATCT
• AGCCGCATCT
• GCAT

– Pattern GCAT
– Text AGCCGCATCT
• AGCCGCATCT
• GCAT

– Pattern GCAT
– Text AGCCGCATCT
• AGCCGCATCT
• GCAT

– Pattern GCAT
– Text AGCCGCATCT
• AGCCGCATCT
• GCAT

– Pattern GCAT
– Text AGCCGCATCT
• AGCCGCATCT
• GCAT

– Pattern GCAT
– Text AGCCGCATCT
• AGCCGCATCT
• GCAT

Exact Pattern Matching: Running Time
• PatternMatching runtime: O(nm)
– In the worst case, we have to check m characters at each of
the n letters of the text.
– Example: Text = AAAAAAAAAAAAAAAA, Pattern = AAAC
• This is rare; on average, the run time is more like

O(m).
– Rarely will there be close to m comparisons at each step.

Knuth-Morris-Pratt Algorithm
• Linear time algorithm
• Avoids comparisons with positions in the text that
we have already checked
• AGCAGCTAGCTAGT
• AGCTAGT
• Linear time algorithm
• Avoids comparisons with positions in the text that
we have already checked
• AGCAGCTAGCTAGT
• AGCTAGT
Pattern Matched so far
AGCTAGT AGCTAG
Proper prefix Proper suffix
• Compute length of the longest proper prefix of the

subpattern that matches a proper suffix of the
subpattern
Short read mapping
• KMP runs in O(n+m)
– Preprocesses the pattern
– To map k-reads O(k(n+m)) time is needed
• But typically we need to map many reads to a single
genome
• Better to index the genome
– Hash tables
– Suffix trees
– Suffix arrays
– BWT-FM index
Multiple Pattern Matching (MPM)
Problem
• Goal: Given a set of patterns and a text, find all occurrences of
any of patterns in text.
• Input: k patterns p1,…,pk, and text of n characters t = t1…tn
• Output: Positions 1 < i < n where the substring of text t

starting at i matches one of the patterns pj for 1 < j < k.

Keyword Trees: Example
• Keyword tree:
– Apple

• Keyword tree:
– Apple
– Apropos

• Keyword tree:
– Apple
– Apropos
– Banana

• Keyword tree:
– Apple
– Apropos
– Banana
– Bandana

• Keyword tree:
– Apple
– Apropos
– Banana
– Bandana
– Orange

Keyword Trees: Properties
– Stores a set of keywords in a
rooted labeled tree.
– Each edge is labeled with a
letter from an alphabet.
– Any two edges coming out of
the same vertex have distinct
labels.
– Every keyword stored can be
spelled on a path from root to
some leaf.
– Furthermore, every path from
root to leaf gives a keyword.
Keyword Trees: Threading
• Thread “appeal”
– appeal

– appeal

– appeal

– appeal

• Thread “apple”
– apple

– apple

– apple

– apple

– apple

Suffix Trees = Collapsed Keyword Trees
• Suffix Tree: Similar to keyword tree,
– Suffixes of the text are keywords
– Edges that form paths are collapsed
– Each edge is labeled with a substring of the text
– All internal edges have at least two outgoing edges.
– Leaves are labeled by the index of the pattern.

Suffix Tree of a Text
• The suffix trees of a text is constructed for all its
suffixes.
• Example: Suffix tree of ATCATG:
ATCATG Keyword Suffix

TCATG Tree Tree
CATG
ATG
TG
G
Suffix Tree of a Text
Keyword Tree Suffix Tree

Suffix Trees: Example
• Example: Suffix tree of ATGCATACATGG

Suffix Trees: Runtime
• Say the length of the text is n.
• Construction of the keyword tree takes O(n2) time.
• Constructing the suffix tree takes O(n) time.
• So our method takes O(n2 + n) = O(n2) time.
• However, there exists a method of calculating the suffix tree in O(n) time
without needing the keyword tree for the suffixes.
ATCATG
TCATG
CATG Keyword Suffix
ATG Tree Tree
TG
G
Suffix Trees: Runtime
• Say the length of the text is n.
• Construction of the keyword tree takes O(n2) time.
• Constructing the suffix tree takes O(n) time.
• So our method takes O(n2 + n) = O(n2) time.
• However, there exists a method of calculating the suffix tree in O(n) time
without needing the keyword tree for the suffixes.
ATCATG
TCATG
CATG Keyword Suffix
ATG Tree Tree
TG
G
Pattern Matching with Suffix Trees
• To find any pattern in a text:
1. Build suffix tree for text of length n—O(n) time
2. Thread the pattern of length m through the suffix tree of
the text—O(m) time.
• Therefore the runtime of the Pattern Matching

Problem is only
O(m + n)!

Pattern Matching with Suffix Trees
SuffixTreePatternMatching(p,t)
1. Build suffix tree for text t
2. Thread pattern p through suffix tree
3. if threading is complete
4. output positions of all p-matching leaves in the tree
5. else
6. output “Pattern does not appear in text”

Threading ATG through a Suffix Tree
ATGCATACATGG

Suffix Array
• More space efficient than suffix tree
– Suffix tree index for human genome is about 47GB
• Lexicographically sort all the suffixes
• Store the starting indices of the suffixes along with
the original string
Suffix Array
• ATCATG
1 ATCATG$ 7 $
2 TCATG$ Sort the suffixes 1 ATCATG$
3 CATG$ lexicographically 4 ATG$
4 ATG$ 3 CATG$
5 TG$ 6 G$
6 G$ 2 TCATG$
7 $ 5 TG$
Suffix Array
• There exists algorithms to construct suffix array in
O(n) time
• Searching is done using binary search
– Basic binary search would take O(mlogn)
– Can be done in O(m+logn)
Burrows-Wheeler Transform (BWT)
• Lexicographically sort all the rotations
Figure from slide by Ben Langmead

• Sort order is the same for rotations and suffixes
– Since $ precedes others

• Original motivation was
compression
• Rotating and sorting
tend to increase run
lengths
Figure from original paper on BWT

LF mapping
• Last to first mapping
$abaaba
a$abaab
aaba$ab
aba$aba
abaaba$
ba$abaa
baaba$a
LF mapping
• Assign numbers arbitrarily to distinguish characters
– a b0 a a b1 a $
$abaaba
a $ a b a a b1
a a b a $ a b0
aba$aba
abaaba$
b1 a $ a b a a
b0 a a b a $ a
LF mapping
• They will be in same order in L and F. Why?
– a b0 a a b1 a $
$abaaba
a $ a b a a b1
a a b a $ a b0
aba$aba
abaaba$
b1 a $ a b a a
b0 a a b a $ a
LF mapping
• Assign numbers to distinguish characters
– a0 b0 a1 a2 b1 a3 $
$ a b a a b a3
a3 $ a b a a b1
a1 a b a $ a b0
a2 b a $ a b a1
a0 b a a b a $
b1 a $ a b a a2
b0 a a b a $ a0
LF mapping

LF mapping

BWT reversing
• Reverse BWT from right to left
$ a0
a0 b0
a1 b1
a2 a1
a3 $
b0 a2
b1 a3
a3b1a1a2b0a0$
BWT reversing
• Reverse BWT from right to left
$ a
a b
a b
a a
a $
b a
b a
abaaba$
FM index
• FM index
– Combines the BWT with a few small auxiliary data
structures
– Ferragina and Manzini
• F and L from BWT
– Both can be compressed
FM index querying

FM index querying

FM index querying

FM index issues
• Where are the matches?
– Store positions as in suffix array

FM index issues
• How to get numbers (ranks) for LF mapping?
– Calculate numbers of each character up to that row

FM index
• FM index for human genome can be as small as 1.5
GB
– With some space-time tradeoff
Short read mapping tools
• Hash based
– MAQ
– SOAP
• BWT based
– Bowtie, Bowtie2
– SOAP2
– BWA
References
• Suffix Tree
– An Introduction to Bioinformatics Algorithms by Jones and
Pevzner (Chapter 9)
– Part I of Algorithms on Strings, Trees and Sequences by
Dan Gusfield
• BWT, FM index
– http://www.cs.jhu.edu/~langmea/resources/bwt_fm.pdf

CSE 6411 Read Mapping Notes

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

CSE 6411 Read Mapping Notes

Încărcat de

Drepturi de autor:

Formate disponibile

CSE 6411

Read mapping (Data structures)

Atif Hasan Rahman

• This leads us to the Pattern Matching Problem.

• Output: All positions 1< i < (n – m + 1) such that the

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

• This is rare; on average, the run time is more like

Adapted from slides available at

• Compute length of the longest proper prefix of the

• Input: k patterns p1,…,pk, and text of n characters t = t1…tn

• Output: Positions 1 < i < n where the substring of text t

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

• Example: Suffix tree of ATCATG:

ATCATG Keyword Suffix

Keyword Tree Suffix Tree

Adapted from slides available at

• Therefore the runtime of the Pattern Matching

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Figure from slide by Ben Langmead

Figure from slide by Ben Langmead

Figure from original paper on BWT

Figure from slide by Ben Langmead

Figure from slide by Ben Langmead

Figure from slide by Ben Langmead

Figure from slide by Ben Langmead

Figure from slide by Ben Langmead

Figure from slide by Ben Langmead

Figure from slide by Ben Langmead

S-ar putea să vă placă și