UMD Bioscience Nov2008

Ultrafast and memory-efficient alignment of short
DNA sequences to the human genome

Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg
Center for Bioinformatics and Computation Biology, University of Maryland, College Park, MD, USA
http://bowtie-bio.sourceforge.net
Abstract
Bowtie is an ultrafast, memory-efficient program for aligning short reads to
mammalian genomes. Burrows-Wheeler indexing allows Bowtie to align
more than 25 million reads per CPU hour to the human genome in a small
memory footprint: approximately 1.3 gigabytes. Bowtie extends previous
Burrows-Wheeler techniques with a novel quality-aware backtracking
algorithm that permits mismatches. Multiple processor cores can be used
simultaneously to achieve greater alignment speed. Bowtie is free, open
source software available for download from http://bowtie-bio.sourceforge.net
Bowtie Index
Bowtie builds a genome index using a scheme based on the Burrows-Wheeler Transform
(BWT) [1] and the FM Index [2]. The Burrows-Wheeler Transformation of a text T,
BWT(T), is constructed as shown below. The Burrows-Wheeler Matrix of T is the matrix
whose rows are all distinct cyclic rotations of T$ sorted lexicographically. BWT(T) is the
sequence of characters in the rightmost column of the matrix.
Burrows-Wheeler Transform (BWT)
Bowties search algorithm is based on the FM Index exact matching algorithm [2]. This
algorithm proceeds in rounds where progressive rounds calculate the range of matrix rows
beginning with progressively longer suffixes of the query string.
Results
The chart below shows performance for aligning 8.84 million 35-base-pair reads from the
1000 Genomes pilot (SRA Accession SRR001115) to the human genome (NCBI build 36.3).
351x
The range of rows under consideration at each step is delimited by the red and green
arrows. As the algorithm considers additional characters in the query (purple), the arrows
are updated using the LF mapping (blue). Rows still under consideration when all query
characters are exhausted correspond to reference positions matching the query (orange).
59 x
Sequence: G C C A T A C G G A T T A G C C
Phred Quals: 40 40 35 40 40 40 40 30 30 20 15 15 40 40 40 40
To allow mismatches in alignments,

Bowtie extends on the above algorithm
by backtracking to previously
matched query characters and making
substitutions. Bowtie searches the space
of legal alignments using a qualityaware, greedy, depth-first search.
(higher number =
higher confidence)
G C C A T A C G G A C T A G C C
40 40 35 40 40 40 40 30 30 20 15 15 40 40 40 40
G C C A T A C G G G C T A G C C
40 40 35 40 40 40 40 30 30 20 15 15 40 40 40 40
Bowtie aligns reads dramatically faster than Maq (http://maq.sf.net) and SOAP
(http://soap.genomics.org.cn), two competing short read alignment tools. The chart shows
throughput in reads/CPU hour, running Bowtie 0.9.6, SOAP 1.10, and Maq 0.6.6. Each
program was set to report at most one hit per read, a typical setting for resequencing
applications. Maq and SOAP were run with default parameters, and Bowtie was run with
parameters to mimic both programs reporting constraints. Bowtie and Maq were compared
on a 2.4GHz Intel Core 2 Duo workstation with 2GB of RAM.
Burrows-Wheeler Matrix
A Bowtie genome index consists of the BWT plus some auxiliary data structures. All told,
the index is only about 65% larger than the input genome. This is radically smaller than
other indexing schemes such as suffix trees, suffix arrays, and seed hash tables. This
allows Bowtie to run easily and efficiently on a typical computer with 2 gigabytes of
RAM.
Human genome memory footprint

Bowtie Index
1.3 gigabytes
Suffix Tree
>35 gigabytes
Suffix Array
>12 gigabytes
Hash Tables
>12 gigabytes
c
a
acaacg
||
aaa
acaacg
| |
aaa
acaacg
||
aaa
Above: Backtracking scenarios leading to 3 distinct 1-mismatch alignments of aaa
the ith occurrence of

character X in the
last column
The Burrows-Wheeler Matrix

has a property called the LF
mapping: the ith occurrence of
character X in the last column
corresponds to the same text
character as the ith occurrence
of X in the first column. This
property underlies algorithms
that use the BWT to navigate
or search the text.
A Bowtie index of the human genome can be

built on a computer with 2 GB of RAM in
less than a day (right), though indexing is
faster with more RAM (left).
Rightmost query positions are the likeliest backtracking targets because shorter suffixes are
likely to occur in the genome by chance. When more than one mismatch is allowed,
excessive backtracking can make search very slow. To avoid this, Bowtie uses double
indexing: Bowtie indexes the genome as well as the reverse (mirror image) of the genome.
Searching the mirror index with a mirror query is equivalent to a normal search that
proceeds left-to-right instead of right-to-left. Normal search combined with mirrored
search avoids most or all excessive backtracking without missing valid alignments.
reverse
Allow mismatches in
this part of the read
Dont backtrack to
positions in this region
of the read
gacgtacacggataccactggagacttagatcca
Matching
Above: double indexing applied to a 1-mismatch search. Searches allowing more than one
mismatch follow a similar pattern, though some excessive backtracking is unavoidable.
When aligning on computers with multiple

processor cores, Bowties throughput
increases significantly.
Reads aligned
(%)
Matching
acctagattcagaggtcaccataggcacatgcag
Bowtie Search Algorithm
corresponds to the
same text character as
the ith occurrence of X
in the first column
Bowtie index of the reference is substantially

smaller than SOAPs, which is too large to
run on most workstations. Maq achieves a
small footprint by indexing small batches of
reads.
Bowtie - SOAP-mode
67.4
SOAP
67.3
Bowtie - Maq-mode
71.9
Maq
74.7
Bowtie has comparable sensitivity

SOAP and Maq. Maq finds some
mismatch alignments when asked for
mismatch alignments, accounting for
slightly higher sensitivity.
to
32its
Because Bowtie builds a compact index of the genome, Bowtie indexes are easy to
distribute over the Internet and easy to reuse. Reuse allows the cost of building or
downloading the index to be amortized over many runs. Pre-built indexes for model
organisms including human, chimp, mouse and dog can be downloaded from http://bowtiebio.sourceforge.net
References
1. Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo
Alto, CA 1994, Technical Report 124
2. Ferragina P, Manzini G: Opportunistic data structures with applications. In: Proceedings of the 41st Annual
Symposium on Foundations of Computer Science. IEEE Computer Society; 2000

UMD Bioscience Nov2008

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

UMD Bioscience Nov2008

Încărcat de

Drepturi de autor:

Formate disponibile

Ultrafast and memory-efficient alignment of short

DNA sequences to the human genome

Burrows-Wheeler Transform (BWT)

To allow mismatches in alignments,

Human genome memory footprint

Above: Backtracking scenarios leading to 3 distinct 1-mismatch alignments of aaa

the ith occurrence of

The Burrows-Wheeler Matrix

A Bowtie index of the human genome can be

When aligning on computers with multiple

Bowtie Search Algorithm

Bowtie index of the reference is substantially

Bowtie has comparable sensitivity

S-ar putea să vă placă și