Documente Academic
Documente Profesional
Documente Cultură
MXMLLN
Abstract
The Increasing Dot Plot is introduced, an alternative implementation method to Martin Wattenbergs Arc
Diagrams: Visualizing Structure in Strings [ARC02]. The technique is able to handle significantly larger sequences than
the suffix tree approach, while sacrificing only an uncommon use case more relevant for music visualizations. The two
techniques are compared and the Increasing Dot Plot is used to visualize a diverse set of genomes, chromosomes, genes,
and proteins.
Keywords: visualization, text visualization, dot plot, bioinformatics, computational biology
Introduction and Previous Work
Wattenberg [ARC02] introduces an approach that has standardized how to visualize repetitions in strings.
Although the paper does apply the technique to DNA sequences, the authors propose that point mutations make the
approach ill-suited for this application. Nevertheless, a number of publications explore repetitions in genomic
sequences, including [SZC08] and Micropeats (1995). This conflicting evidence suggests that further exploration of this
space is needed.
Wattenberg implements Arc Diagrams using a suffix tree. However, this approach may not be well suited to
DNA sequences, whose individual chromosomes contain hundreds of millions of base pairs (Mb < n < Gb). Nevertheless,
its robustness has made it the standard algorithm for identifying repetitions in Bioinformatics. As a result, much research
has gone into finding more efficient ways of storing the strings at each suffix tree node to be able to handle full genomes
[HUO07].
comparison(0): 0
comparison(1): 1 + matrix[x-1, y-1]
Basically, if two characters are equal, the comparison result is added to the previous entry, otherwise a 0 is stored. Thus,
in contrast to a binary matrix, the matrix has a range of positive integer values, where longer sets of repetitions result in
larger consecutive numbers along a single diagonal. This Increasing Dot Plot technique can be visualized with a heat
map. Figure 2 compares the Dot Plot with the Increasing Dot Plot variation using protein d1btea_ 7.7.1.4.1 Extracellular
domain of the type II activin receptor {Mouse (Mus musculus)}. Notice the heat map on the right with the default
parameters reveals 4 different values including the two visualized in the dot plot on the left. The distracting main
diagonal has also been removed in the Increasing Dot Plot implementation.
As with many bioinformatics programs, choosing an appropriate minimum repetition length parameter is
extremely important. The length largely determines how many arcs will the diagram will include. Another parameter
was introduced to make sure a sufficient number of repetitions are found. If the minimum number of repetitions is not
reached, then the program decrements the minimum length by one and restarts. Smaller protein sequences (n < 1,000)
are run with minimum:3, repetitions:0 (Sequence dlush_2). These amino acids sequences many times cannot find
repetitions on the first pass and are automatically reduced to length 2, for which there are too many results.
Consequently, proteins might not be a good target for this application. All larger sequences (n > 1,000) were run with
minimum:20, repetitions:10. These parameters usually work, except in the case below of the fungus Encephalitozoon
Intestinalis ATCC 50506 chromosome I (NC_014415.1, 160332 bp) where the program had to reduce the minimum down
to length 15, after which 27 repetitions were found. Creating a function to compute the default parameters for various
sequence lengths would have been useful, but was outside the scope of the project.
The Shape of Song visualizations produced by the Arc Diagrams with music scores as input are clearly different
from those of biological sequences. Firstly, natural strings do not have the layering seen in music, where the song has
short consecutive repetitions, as well as much larger repetitions. Biological text seems to have far fewer repetitions and
very few repetitions between multiple pairs. What results is simply a less attractive, purely practical image. In addition,
the significantly longer length of genomes may hide elements of the visualization: Murid Herpesvirus 1(Figure
gi|21716071, previously shown) has 51 repetitions, but only two larger arcs are visible. Additionally, variation in
repetition length is mostly invisible for long sequences and large repetition lengths. Regardless, both pictures do present
a starting point for analysis.
Conclusion
Arc Diagrams are very relevant to genomic sequences, contrary to what Wattenberg suggested. With respect to
the new technique introduced, Increasing Dot Plots are significantly more informative than standard Dot Plots. In
practice, hopefully Increasing Dot Plots will replace their predecessors altogether, except for a few small applications.
Nevertheless, Increasing Dot Plots are just one additional tool in the much larger Bioinformatics toolbox.
References:
[ARC02]: Wattenberg, M. (2002) Arc Diagrams: Visualizing Structure in Strings. Proceedings of the IEEE
Symposium on Information Visualization. IEEE Computer Society. (http://www.turbulence.org/works/Song)
[SZC08]: Szczesny, P., and A. Lupas. 2008. Domain annotation of trimeric autotransporter adhesinsdaTAA.
Bioinformatics 24:1251-1256.
[HUO07]: Hongwei Huo and Vojislav Stojkovic, A Suffix Tree Construction Algorithm for DNA Sequences, IEEE
7th International Symposium on BioInformatics & BioEngineering. Harvard School of Medicine, Boston, MA,
October 14-17, Vol. II, pp. 1178-1182, 2007.
Appendix: Example arc array reference file (minimum repetitions: 3) with the triple repetition lls
d1ush_2 4.145.1.2.1 (26-362) 5'-nucleotidase (syn. UDP-sugar hydrolase), N-terminal domain {Escherichia coli}
Bases: 337
Repetitions: 12
tvl(3): 9, 98
eyg(3): 24, 27
aae(3): 45, 234
lls(3): 53, 111
lls(3): 53, 322
ign(3): 88, 152
lls(3): 111, 322
lfk(3): 125, 131
efr(3): 162, 275
kpd(3): 182, 250
nge(3): 197, 278
aen(3): 235, 314