Sunteți pe pagina 1din 7

Arc Diagrams and the Increasing Dot Plot: Visualizing Repetitions in Genomic Sequences

MXMLLN

Abstract
The Increasing Dot Plot is introduced, an alternative implementation method to Martin Wattenbergs Arc
Diagrams: Visualizing Structure in Strings [ARC02]. The technique is able to handle significantly larger sequences than
the suffix tree approach, while sacrificing only an uncommon use case more relevant for music visualizations. The two
techniques are compared and the Increasing Dot Plot is used to visualize a diverse set of genomes, chromosomes, genes,
and proteins.
Keywords: visualization, text visualization, dot plot, bioinformatics, computational biology
Introduction and Previous Work
Wattenberg [ARC02] introduces an approach that has standardized how to visualize repetitions in strings.
Although the paper does apply the technique to DNA sequences, the authors propose that point mutations make the
approach ill-suited for this application. Nevertheless, a number of publications explore repetitions in genomic
sequences, including [SZC08] and Micropeats (1995). This conflicting evidence suggests that further exploration of this
space is needed.
Wattenberg implements Arc Diagrams using a suffix tree. However, this approach may not be well suited to
DNA sequences, whose individual chromosomes contain hundreds of millions of base pairs (Mb < n < Gb). Nevertheless,
its robustness has made it the standard algorithm for identifying repetitions in Bioinformatics. As a result, much research
has gone into finding more efficient ways of storing the strings at each suffix tree node to be able to handle full genomes
[HUO07].

The Increasing Dot Plot


Despite the success of suffix trees, the implementation is far more complex than the main technique its
replacing: the Dot Plot. The Dot Plot can be used to visualize point similarities in a sequence by creating an n x n matrix,
where n is the length of the input string, comparing each character to every other character in the string. If two
characters match, a 1 is stored in the cell; otherwise, a 0 is stored. The Dot Plot is usually visualized in black and white,
for 1 and 0 respectively, and gives some impression of the similarity between different parts of the sequence. Although
the technique is still used in practice, its actual utility is very small. Nevertheless, the gains in insight from the suffix tree
are not comparable to its increased complexity over the Dot Plot. Thus, a new technique was created.
Using inspiration from Needleman-Wunschs dynamic programming, sequence alignment algorithm, a variation
of the Dot Plot was created. Instead of simply entering a binary number at each matrix position, a simple function is
used that utilizes previous cell values (comparison(a) is the character comparison, where a is the binary result):
matrix[x,y]=

comparison(0): 0
comparison(1): 1 + matrix[x-1, y-1]

Basically, if two characters are equal, the comparison result is added to the previous entry, otherwise a 0 is stored. Thus,
in contrast to a binary matrix, the matrix has a range of positive integer values, where longer sets of repetitions result in
larger consecutive numbers along a single diagonal. This Increasing Dot Plot technique can be visualized with a heat
map. Figure 2 compares the Dot Plot with the Increasing Dot Plot variation using protein d1btea_ 7.7.1.4.1 Extracellular
domain of the type II activin receptor {Mouse (Mus musculus)}. Notice the heat map on the right with the default
parameters reveals 4 different values including the two visualized in the dot plot on the left. The distracting main
diagonal has also been removed in the Increasing Dot Plot implementation.

Implementation: Dot Plot Space Limitations and Efficiencies


Dot Plots require O(n) for both time and space. Thus, without efficient memory management, Dot Plots are limited to
several thousand characters [Matlab was not able to handle anything more than sequences of 20KB bases]. Thankfully,
Increasing Dot Plots can drastically reduce their memory footprint to allow for sequences over 60MB, a 3,000 fold
increase.
Storing only the top half of the matrix, a common trick in symmetric matrix operations, allows for a small gain. Since the
top half of the matrix exactly mirrors the bottom half, only one needs to be computed. However, this method really only
improves time efficiency, reducing by a constant factor of . The key to space efficiency is that for repetitions, the final
number is an increasing chain has all the information needed to create an arc in the final visualization. For a repetition of
length 20, the diagonal increases from 0 to 20 by single digits and subtracting the final number from the end index will
point to the starting index. As a result, an arc diagram can be created from an array storing start and end index pairs. As
long as the arc arrays size is insignificant compared to the time complexity, its additional memory requirements are
negligible. This constraint can be encoded by setting a minimum repetition length, which also serves to remove noise
and clean the data.
In order to effectively utilize only the repetition information, all other data is discarded. The comparison function only
requires information in the previous row, namely the value along the diagonal. When a match is found, the previous cell
is added. When a match is not found and the previous cell is at least the minimum repetition size, its data is added to
the arc array. In addition, the final cell in the row must also be checked for a repetition, since it can no longer continue.
At this point, all the information from the previous row has been used, is no longer needed, and can be swapped with
the current row to save space. Consequently, the Increasing Dot Plot really only uses 3 n-sized elements: two onedimensional matrices and the original string. This results in O(n) [linear] space complexity.
Cetera Implementation Details
The Increasing Dot Plot was implemented in Processing, with memory increased to 1GB, for its rapid prototyping
prowess. The program was tested on a low-end laptop (2.13 GHz processor with 4 GB of RAM). Arc diagrams were
drawn with the specifications outlined in [ARC02], though Red was used as arc filler to differentiate the results. Unlike
Wattenberg, sequence characters are shown if n <= 125 using a standard font size (16). Larger sequences show index
labels instead and the sequence is automatically shortened to fit to the maximum window size. Additionally, all arc data
was output to a supplemental text file, including the repeating sequence, its length, and start indices for both
occurrences. This data file is essential to actually reference repetition data when the sequence is not shown.

Results: Biological Arc Diagrams


The original plan was to survey genomes of viruses and the biological kingdoms. Although the program was able
to completely store over a sequence over 60MB (chromosome 11 of Equus Caballus: NC_009154.2), actually creating a
diagram would take too long. For example, the complete virus genome for Murid Herpesvirus 1 (230,278 base pairs (bp)
) takes approximately 8 minutes to run (Figure gi|21716071 above). Sequences of even 1 or 2 million characters already
take many hours to complete, due to the O(n) time complexity. Thus, the original plan of sampling a large set of
genomes was too ambitious. Instead, a small set of genome, gene, and proteins are shown, where genomes and genes
are mostly nucleotide bases and proteins are exclusively amino acid base pair inputs.

As with many bioinformatics programs, choosing an appropriate minimum repetition length parameter is
extremely important. The length largely determines how many arcs will the diagram will include. Another parameter
was introduced to make sure a sufficient number of repetitions are found. If the minimum number of repetitions is not
reached, then the program decrements the minimum length by one and restarts. Smaller protein sequences (n < 1,000)
are run with minimum:3, repetitions:0 (Sequence dlush_2). These amino acids sequences many times cannot find
repetitions on the first pass and are automatically reduced to length 2, for which there are too many results.
Consequently, proteins might not be a good target for this application. All larger sequences (n > 1,000) were run with
minimum:20, repetitions:10. These parameters usually work, except in the case below of the fungus Encephalitozoon
Intestinalis ATCC 50506 chromosome I (NC_014415.1, 160332 bp) where the program had to reduce the minimum down
to length 15, after which 27 repetitions were found. Creating a function to compute the default parameters for various
sequence lengths would have been useful, but was outside the scope of the project.

Discussion: The Increasing Dot Plot vs. Arc Diagrams


As described in [ARC02], traditional Arc Diagrams do not visualize every pair of repetitions. Their
implementation and definitions were a bit confusing, but would probably amount to not showing the middle two arcs
connecting the first and third set of AABB, as well the as the second and fourth set (Sequence 01010101 below).
Wattenberg describes making two passes through the suffix tree to get the final visualization. Making another pass
through the arc array may eliminate this difference between the techniques.

The Shape of Song visualizations produced by the Arc Diagrams with music scores as input are clearly different
from those of biological sequences. Firstly, natural strings do not have the layering seen in music, where the song has
short consecutive repetitions, as well as much larger repetitions. Biological text seems to have far fewer repetitions and
very few repetitions between multiple pairs. What results is simply a less attractive, purely practical image. In addition,
the significantly longer length of genomes may hide elements of the visualization: Murid Herpesvirus 1(Figure
gi|21716071, previously shown) has 51 repetitions, but only two larger arcs are visible. Additionally, variation in
repetition length is mostly invisible for long sequences and large repetition lengths. Regardless, both pictures do present
a starting point for analysis.

Conclusion
Arc Diagrams are very relevant to genomic sequences, contrary to what Wattenberg suggested. With respect to
the new technique introduced, Increasing Dot Plots are significantly more informative than standard Dot Plots. In
practice, hopefully Increasing Dot Plots will replace their predecessors altogether, except for a few small applications.
Nevertheless, Increasing Dot Plots are just one additional tool in the much larger Bioinformatics toolbox.
References:

[ARC02]: Wattenberg, M. (2002) Arc Diagrams: Visualizing Structure in Strings. Proceedings of the IEEE
Symposium on Information Visualization. IEEE Computer Society. (http://www.turbulence.org/works/Song)

[SZC08]: Szczesny, P., and A. Lupas. 2008. Domain annotation of trimeric autotransporter adhesinsdaTAA.
Bioinformatics 24:1251-1256.

[HUO07]: Hongwei Huo and Vojislav Stojkovic, A Suffix Tree Construction Algorithm for DNA Sequences, IEEE
7th International Symposium on BioInformatics & BioEngineering. Harvard School of Medicine, Boston, MA,
October 14-17, Vol. II, pp. 1178-1182, 2007.

Appendix: Example arc array reference file (minimum repetitions: 3) with the triple repetition lls
d1ush_2 4.145.1.2.1 (26-362) 5'-nucleotidase (syn. UDP-sugar hydrolase), N-terminal domain {Escherichia coli}

Bases: 337
Repetitions: 12

tvl(3): 9, 98
eyg(3): 24, 27
aae(3): 45, 234
lls(3): 53, 111
lls(3): 53, 322
ign(3): 88, 152
lls(3): 111, 322
lfk(3): 125, 131
efr(3): 162, 275
kpd(3): 182, 250
nge(3): 197, 278
aen(3): 235, 314

S-ar putea să vă placă și