Documente Academic
Documente Profesional
Documente Cultură
A Tile-based Parallel Global Algorithm for Biological Sequence Alignment on multi core architecture
D.D.Shrimankar and S.R.Sathe Department of Computer Science and Engineering Visvesvaraya National Institute of Technology Nagpur 440010 (India)
Abstract: The Global algorithm is a compute-intensive based sequence alignment method. In this paper, we investigate extending several parallel methods, such as the wave-front methods for the Global algorithm, to achieve a significant speed-up on a multiple core. The simple parallel wave-front method can take advantage of the computing power of the multi-core processor but it cannot handle long sequences because of the physical main memory limit. On the other hand, the tiling method can process long sequences but with increased overhead due to the increased data transmission between main memory and secondary memory. To further improve the performance on multi-core processor, we propose a new tile-based parallel algorithm. We take advantage of the homological segments to divide long sequences into many short pieces and each piece pair (tile) can be fully held in the main memory. By reorganizing the main memory limitation instead of complete computed values only required values are stored in buffer and by using offset it is picked up when required. The experimental results show that our new tile-based parallel algorithm can outperform the simple parallel wave-front method. Especially for the long sequence alignment problem, the best performance of tile-based algorithm is on average about an order magnitude faster than the serial global algorithm. Keywords: global algorithm, OpenMP, tile, multi-core processor
I. INTRODUCTION
Biological Sequence alignment is very important in homology modeling, phylogenetic tree reconstruction, sub-family classification, and identification of critical residues. When aligning multiple sequences, the cost of computation and storage of traditional dynamic programming algorithms becomes unacceptable on current computers. Many heuristic algorithms for the multiple sequence alignment problems run in a reasonable time with some loss of accuracy. However, with the growth of the volume of sequence data, the heuristic algorithms are still very costly in performance. There are generally two approaches to parallelize sequence alignment. One is a coarse-grained method, which tries to run multiple sequence alignment subtasks on different processors, for example, see [1, 2, 3] or optimize single pair-wise alignment consisting of multiple parallel sub-tasks, such as [4, 5, 6]. The communication among those sub-tasks is critical for the performance of coarse-grained method. The other is a more fine-grained method, which focuses on parallelizing the operations on smaller data components. Typical implementations of fine-grained methods for sequence alignment include parallelizing the Global algorithm (see [7]) and OpenMP model for parallelization [8].
OpenMP which uses multithreaded execution where parallelism can be efficiently exploited in simple parallel wave-front method. Under this, the computation of an element of the similarity matrix can be assigned to one thread by considering the data dependencies.
Tiling [9][10] has been used as an effective compiler optimizing technique to generate high performance scientific codes. Tiling not only can improve data locality for both the sequential and parallel programs [11], but also can help the compiler to maximize parallelism and minimize synchronization [12] for programs running on parallel machines. Thus, sometimes, it is used by the programmers to hand-tune their scientific programs to get better performance.
General-Purpose Computation on Cluster [13] provides a powerful platform to implement the parallel fine-grained Biological Sequence Alignment Global algorithm. With the rapidly increasing power and programmability of the Cluster, these chips are capable of performing a much broader range of tasks. The Cluster technology is employed in several finegrained parallel sequence alignment applications, such as in [14]. In our study, we employ a new tilebased mechanism into the wave-front method *15+ to accelerate the Global algorithm on a single Cluster. Our contributions are as follows: 1) We provide the design, implementation, and experimental study, of a new tile-based mechanism based on the wave-front method. This method can enable the alignment of very long sequences. For comparison, we also introduce the wave-front method used in parallelizing Global algorithm into
The global algorithm works by computing the socalled similarity matrix. The computation at each element in this matrix depends on the results of three other elements: its nearest west, northwest and north neighbors in the matrix. Such fine-grain data dependences present serious challenges for efficient parallel execution on current parallel computers. To meet such challenges, we exploit the power of
the multi-core processor implementation of DNA sequence alignment. 2) We divide the basic computational kernel of the Global algorithm of each tile into two parts, independent and dependent parts. The parallel and balanced execution of independent computing units in a coalesced memory-accessing manner can significantly improve the performance of the Global algorithm on a multi-core.
3) The rest of this paper is organized as follows: In Section 2, we describe the Global Algorithm for Biological Sequence Alignment in detail. In Section 3, we introduce the basic idea of simple wave-front method used with OpenMP model and related work on different architecture which accelerated sequence alignment applications. In Section 4 a parallel implementation of the same algorithm using traditional tiling transformation is discussed. In Section 5, our design and implementation of the simple wave-front with tile-based algorithms on cluster are presented. The experiments are presented in Section 6. In Section 7, we conclude and discuss future work with references.
cannonical C syntax. Second, due to the non-trivial dynamic overhead of the generic techniques, generic libraries are not widely used in programming high performance scientific and engineering algorithms. Finally, there are no experimental data in [23]. CellSwat [23] presents a tiling mechanism, but with this method, there are data dependencies among each tile. To find homological segments, there are algorithms such as Fast Fourier Transform (FFT) [24] based algorithms and k-mer based algorithms [25]. Compute Unified Device Architecture (CUDA) introduced by NVIDIA [26] is a programming environment for writing and running generalpurpose applications on the NVIDIA GPUs, such as GeForce, Quadro, and Tesla hardware architectures. In all the above mentioned papers, memory limitation is not mentioned or handled any where, which is the main problem while aligning longer sequences. In this paper we introduce our own algorithm which efficiently handle memory limitation and can align longer sequences.
The other elements of the matrix are calculated by finding the maximum value among the following three
values: the left element plus gap penalty, the upper-left element plus the score of substituting the horizontal symbol for the vertical symbol, and the upper element plus the gap penalty. For the general case where X = x1.xi and Y = y1 .yj, for i = 1.n and j = 1.m, the similarity matrix SM*n : m+ is built by applying the following recurrence equation, where gp is the gap penalty and ss is the substitution score:
would require a very large number of processors if real biological data is to be considered. An additional problem is the high communication overheads for such an implementation, which would require data exchange among all active processors at every time step.
SM i, j 1 SM [i, j ]
gp -------(1)
max SM i 1, j 1 ss SM i 1, j gp
In our example, gp is -1, and ss is 1 if the elements match and 0 otherwise. However, other general values can be used instead. Following this recurrence equation (1), the matrix is filled from top left to bottom right with entry [i, j] requiring the entries [i, j- 1], [i- 1, j- 1], and [i- 1, j]. Notice that SM[i, j] corresponds to the best score of the subsequences x1, .., xi and y1,.., yj. The analogous matrix is filled using equation 1. Since global alignment takes into account the entire sequences, the final score will always be found in the bottom right hand corner of the matrix. In our example, the final score 4 gives us measures of how similar the two sequences are. Figure 1 shows the similarity matrix and the two possible alignments (arrows going up and left).
Given the data dependences presented by the algorithm, the similarity matrix can be filled row by row, column by column, or anti-diagonal by anti-diagonal (i.e., all elements (i, j) for which i + j is a fixed value). The problem with the first two approaches is that most of the elements in a row or column depend on other elements in the same row (column). This means the row (or column) cannot be computed in parallel. On the other hand, the elements in an anti-diagonal depend only on previously calculated anti-diagonals. This means that parallel computation can proceed as a wave front across the similarity matrix, i.e., by computing successive antidiagonals of the matrix simultaneously, during successive time steps. Although it exposes parallelism, the anti-diagonal approach faces a few challenges when it comes to an efficient parallel implementation. First, the sizes of the anti-diagonals vary during the computation, which leads to unbalanced work among processors. Another challenge has to do with the number of elements to be computed by each processor in each time step. In the previous example we assumed that each processor would calculate one element of the matrix at a time. However, this fine-grain computation
The load-balancing problem can be addressed by putting several rows of blocks (or strips) on the same processor. Figure 2(b) illustrates this approach when four processors are used. The first and fifth strips are assigned to processor 1, the second and sixth strips are assigned to processor 2 and so on. This helps to keep all processors busy through most of the computation. For example, processor 1 initially works with the first strip, then simultaneously with the first and fifth strip, then finally only with the fifth strip. The akin pseudo code as in Figure 2(c). The Quad Processor Multithreaded Architecture Our Platform Multi-core processor [27] supports a multithreaded program execution model in which a program is viewed as a collection of threads whose execution ordering is determined by data and control dependences explicitly identified in the program. Threads, in turn, are further divided into fibers which are non-preemptive and scheduled according to dataflow-like firing rules, i.e., all needed data must be available before it becomes ready for execution. Programs structured using this two-level hierarchy can take advantage of both local synchronization and communication between fibers within the same thread, exploiting data locality. In addition, an effective overlapping of communication and computation is made possible by providing a pool of ready-to-run fibers from which the processor can fetch new work as soon as the current fiber ends and the necessary communication is initiated. The multi-core model defines a common set of primitive operations required for the management, synchronization and data communication of threads. Each core in a quad system consists of an execution unit (EU), a synchronization unit (SU), queues linking the EU and SU, local memory, and an interface to interconnection network. While the EU merely executes fibers, i.e., does the computation, the SU is responsible for scheduling and synchronizing threads, handling remote accesses and performing dynamic load balancing. Although designed to deal with multiple threads per node, the multiple core machine does not require any support for rapid context switching (since fibers are non-preemptive) and is well-suited to running on of-the-shelf processors. Multi-core processor systems have been implemented on a number of platforms: MANNA and Power MANNA, IBM SP2, Sun SMP cluster and Beowulf. Multi-core programs
are written using the programming language Threaded-C.[28][29] This is an extension of the ANSIC programming language which, by incorporating multi-core operations, allows the user to indicate parallelism explicitly.
Generally speaking, it assigns the computation of each strip to a thread, having two independent threads per node. However, in order to better overlap computation and communication, blocks on a strip are actually calculated by two fibers within a thread. These fibers are repeatedly instantiated to compute one block at a time, and only one of the two fibers of each thread can be active at a particular time. The decision of having two alternating fibers within each thread was based on the following reasoning. It would be a waste of resources if we had one separate fiber for each block, in each strip, since only one block can be calculated at a time. Having just one fiber for all blocks is also not a good idea because this fiber would get delayed due to the synchronization signal coming from the fiber immediately below. This signal acknowledges the receipt of data without it the fiber, re-instantiated, would be allowed to overwrite the previous data. Thus, with just one fiber, computation would not be allowed to proceed until this acknowledgment signal is received. With the addition of an extra fiber we can further overlap computation and communication since one of the fibers can wait for the acknowledgment while the other starts working on the following block. A snapshot of the computation of the similarity matrix using our implementation is illustrated in Figure 3. A thread is assigned to each horizontal strip and the actual computation is done by fibers labeled E(ven) and O(dd). The figure shows the computation of the main antidiagonal of the matrix. The arrows indicate data and synchronization signals. For example, processor 2 sends data (downward arrows) to processor 3 and receives data from processor 1 i.e., fibers E of strips 2 and 6 send data to fibers E of strips 3 and 7, and fibers O of strips 1 and 5 send data to fibers O of strips 2 and 6. Fibers within a same thread, that is, associated with the same strip, send only a synchronization signal (horizontal arrows) since they
share data local to the thread to which they belong. Finally, dotted upward arrows acknowledge the receipt of data so that the fiber receiving this signal can be re-instantiated to calculate another block of the same strip. During the initialization phase, each thread grabs a piece of the input sequence X. This piece is all a thread needs from sequence X so the whole sequence need not be stored. Moreover, after computing a block, each fiber sends to the fiber beneath a piece of the sequence Y being compared. By doing so, we minimize the initialization delay that occurs when the nodes are reading sequence X from the server. Furthermore, since subsequent pieces of sequence Y can be stored in the same memory area, the demands for space are considerably reduced.
which would require data exchange among all active processors at every time step.
Figure 4(a) show execution of tiles corresponding to the tiled code shown in Figure 4(b). Tiles are of rectangular shape and of size n1 * n2 . The index set is
of size
N1 n
1
N1
n
N2
N2
2
and the number of tiles, In the first time step tile (0,0) is
executed by processor P1. At the end of execution it sends the computed data to processor P2 and to itself i.e Processor P1. In the second time step processors P1 and P2 execute tiles (0,1) and (1,0) respectively. Similarly, in the third time step processors P1, P2 and P3 execute tiles (0,2),(1,1) and (2,0) respectively and so on.
In Fig. 4(b) all loops are parallel and tile size is n1 x n2 , suppose we run on quad core , before tiling, loop i has
N
N iteration and each processor executes
4
N n1* n2
1
iteration.
iteration, which is
divided evenly among processors. Thus the tile size is calculated by computing the number of memory references, as the parallelizing compiler needs to tune the tile size. So that each processor will execute nearly the same iterations. For example: if tile size is 8 X 8 = 64, then after tiling the cross strip ii loop of Figure 4(b) N has iterations. Thus each processor will get
64
extension
and
the
associated
code
generation
The main drawback of the above discussed methods is the limited main memory size for longer DNA sequences and high dependency on the computational resources of the host side. The simple global algorithm requires additional storage and communication bandwidth. Therefore, we introduce the new tile-based method, which simplifies the computational model and also handles very long sequences with less memory transmission. The tilebased method can be described as follows: If a matrix can be fully loaded into the memory, do it; otherwise divide the large matrix into tiles to ensure that each tile fits in the Processors memory as a whole and then calculate the tiles one by one or in parallel. Thus for the matrix of size 4x4 as shown in Figure 5(a) left part, where number of rows (r) and columns (c) are 4 and 4 respectively. In our algorithm, the number of rows will be diagonal rows i.e now number of rows will be (r+c-1) i.e 7 as shown in Figure 5(a) right part. Each matrix element corresponds to one tile. And for each diagonal row one file will be created where computed values will be stored and destroyed synchronously. This new parallel implementation is shown in Figure 5(b).
The wave-front algorithm is a very important method used in a variety of scientific applications. The computing procedure is similar to a frontier of a wave to fill a matrix, where each blocks value in the matrix is calculated based on the values of the previouslycalculated blocks. On the left-hand side of Figure 5(a) we show the wave-front structure for parallelizing the Global algorithm. Since the value of each block in the matrix is dependent on the left, upper, and upper-left blocks, in the figure, blocks with the same color are put in the same parallel computing wave-front round. The process for this wave-front algorithm is shown in
Fig 5(a). Wave-front structure of global algorithm for sequence alignment. calcSteps = (tileYCount + tileXCount) - 1; for (row = 0; row < calcSteps; row++) { #pragma omp parallel shared(col,row,diagonalRow,n,sum,chunk) private(i) {
#pragma omp for schedule(static,chunk) nowait for (col = 0; col <= row; col++) {
The left-hand side of the figure shows the wave-front process, and right-hand side of the figure shows the tile data skewing to implement the wave-front. The experimental results (Figure 8) show that all the graphs are of parabolic curve. Hence for all the results minimum time required with different number of threads will be
t
min
4 ac
b 4 a
-------------(2)
where a, b, c are the constants in the equation of parabola, these values will be calculated using least square method. When calculated, theoretical values matched with our experimental results.
6. EXPERIMENTAL RESULTS
6.1 General Test The experiments were performed on quad-core Intel 2.83GHz CPU platform with 3 GB memory, and 500GB of hard disk. We employ the automatic sequence-generating program to generate different test cases.
The general test was executed on four implementations of the Global Algorithm for Biological Sequence Alignment. The first one is the serial implementation; the second is the simple wave-front implementation using OpenMP; the third is traditional tile-based implementation and fourth is our new tile based implementation. For long sequences, the simple wavefront version cannot load the entire dynamic programming matrix into the processor memory. Therefore, we select groups of sequences which have variable lengths to test on Linux version. The experimental results are shown in Table 1.
the time of computing each tile and the time of sequence alignment using our tile-based algorithm. The test results are shown in Figure 7. The time for
Execution Time (Second) - Speedup Seq. Len. serial 16000 32000 64000 128000 256000 21.17 84.86 336.1 777.3 1470 Simple wave-front 18.2564 1.1591 51.2563 1.6555 245.3256 1.3696 547.2568 1.4201 800.214 1.8378 Using traditional tiling 12.4578 1.6987 35.2457 2.4075 185.4273 1.8121 356.8794 2.1777 496.2589 2.9634 New method using tile 1.8564 11.399 7.8572 10.799 30.437 11.0396 72.773 10.6798 144.18 10.2001
Table 2:Performance assessment of three different global implementations on 12-core processor with tile size 1000 x 1000
In the serial version, the compilers optimization methods have a great effect on the manner of accessing memory; however, in multi core processor, this is not true. Because multi-core processor has a special hierarchy of memory structure and the access time for different levels of the hierarchy greatly varies. For example, global memory has an access time of about 500 cycles and the on-chip memory such as a register of only 1 cycle. Due to the reason that multi-core processor does not provide a good mechanism to optimize memory accesses at a compiler level. Also we can see that among four versions of the algorithm, our tile-based method is the best. In fact, since the length of the test sequences is not long, the only difference between simple- and tile-based algorithms is that tile-based algorithm introduces some optimization methods. The results show that our optimization methods are effective. 6.2 Test of Tile-based Global Algorithm on 12-core processor. The experiments are performed on the platform which has a 12-core Intel 2.83GHz CPU with 3 GB memory, and 500GB of global memory
The selected tile size and the size of the sequence will significantly affect the algorithms efficiency. In our test, we focus on how the number of threads affects the final performance. Our tests are divided into two parts,
computing each tile is linearly constant. Hence we do not include the time for computing tile, because we want to show the time changes for computing sequences separately. From Figure 7, we see that with the growth of the average sequence length, the performance increases massively with our method. We select the parameters as tile size, which are used to decide how to partition the sequence length of average length, to keep the residues length at 200. This is because when the average sequence length exceeds 32,000 the time of sequence alignment will increase much faster.
1000 800 600 400 200 0 16000 32000 64000 Sequence Length 128000 256000
Serial Simple Wavefront OpenMP Traditional Tile with OpenMP Our method of tile with OpenMP
The global algorithm that solves the sequence alignment problem in parallel with different versions and exhibits better performance in our new tile based algorithm. This is emphasized in Figure 8, shows a pairwise comparison between the overall speedups for all the versions of parallel global algorithm. The comparison is done for each pair of sequences. Figure 8 show that the tile-based implementation is an order of magnitude faster than the simple wave-front implementation, because it only needs to calculate a portion of the dynamic programming matrix. In
addition, the time growth of the tile-based implementation is lower. From Figure 8, we see that as the residue length increases from 32000 to 256000, the time for tile-based method increases only about 190 seconds under Linux system, and the other methods grows at least 500 and 800 seconds respectively. As the sequence length increases, the growth of the calculation time needed by our tilebased method is very low.
12
Fig. 9 shows the performance evaluation on 64 core multithreaded processor for Sequence length and computational time in second. Figure 10 and 11 shows Speedup variation with variable sequence length of all the versions of the global algorithm.
8 Speedup Simple wave-front using OpenMP 6 Traditional tile with OpenMP New method of tile with OpenMP 4
0
0 16000 32000 64000 Sequence Length 128000 256000
16000
32000
128000
256000
Fig 8. Comparison of the overall speedups for all versions of Parallel Global Algorithm on 12- core processor.
Speedup
Simple wave-front 10.7190 1.9082 22.6239 3.6536 98.3412 3.1581 254.5421 2.5560 400.871 3.4675
Using traditional tiling 6.38930 3.2013 16.3214 5.0644 85.5439 3.6305 143.7619 4.5250 220.4846 6.3043
Fig. 10. Comparison of the overall speedups for simple wavefront and traditional tiling mechanisum on 64 core processor.
45000
40000 35000 30000 Speed - Up 25000 20000 15000 10000 5000 0 16000 32000 64000 128000 256000 Sequence Length Our Method of Tile with OpenMP
Table 3:Performance assessment of three different global implementations on 64-core processor with tile size 1000 x 1000
Fig 11. Comparison of the overall speedups for our new tile based versions on 64 core processor.
*8+ OpenMP Architecture Review Board: The OpenMP specification for parallel programming (2008), http://www.openmp.org [9] Anderson, J.M., Amarasinghe, S.P., Lam, M.S.: Data and computation transformations for multiprocessors. In: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, Santa Barbara, California (July 1921, 1995) 166178 SIGPLAN Notices, 30(8), August 1995. [10] Anderson, J.M., Lam, M.S.: Global optimizations for parallelism and locality on scalable parallel machines. In: Proceedings of the ACM SIGPLAN 93 Conference on Programming Language Design and Implementation, Albuquerque, New Mexico (June 2325, 1993) 112125 SIGPLAN Notices, 28(6), June 1993. [11] Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: Proceedings of the ACM SIGPLAN 91 Conference on Programming Language Design and Implementation, Toronto, Ontario (June 2628, 1991) 3044 SIGPLAN Notices, 26(6), June 1991. [12] Lim, A.W., Lam, M.S.: Maximizing parallelism and minimizing synchronization with affine transforms. In: Conference Record of POPL97: The 24th ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, Paris (January 1517, 1997) 201214 [13] Chapter 1, Cluster Computing at a Glance, Mark Baker and Rajkumar Buyya, Division of Computer Science University of Portsmouth Southsea, Hants, UK. School of Computer Science and Software Engineering Monash University Melbourne, Australia Email: Mark.Baker@port.ac.uk, rajkumar@dgs.monash.edu.au [14] Yang Chen, Songnian Yu and Ming Leng. 2006, in International Federation for Information Processing (IFIP), Volumn 207, Knowlidge Enterprise: Intelligent strategies in production design, Manufacturing and management, eds. K, Wang, Kovacs G., Wozny M., Fang M., (Bosten Springer), pp.311-321 [15] S. Aji, F. Blagojevic, W. Feng, D.S. Nikolopoulos. CellSWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Proceedings of the 2008 ACM International Conference on Computing Frontiers 13-22 [16] Viswanathan, G., Larus, J.R.: User-defined reductions for efficient communication in data-parallel languages. Technical Report 1293, University of Wisconsin-Madison (Jan 1996)
[17] Scholz, S.B.: On defining application-specific high-level array operations by means of shape invariant programming facilities. In: APL 98: Proceedings of the APL98 conference on Array processing language, New York, NY, USA, ACM (1998) 3238
REFERENCES
[1] K. Jiang, O. Thorston, A. Peters, B. Smith, C.P. Sosa. An Efficient Parallel Implementation of the Hidden Markov Methods for Genomic Sequence-Search on a Massively Parallel System IEEE Transactions on Parallel and Distributed Systems, Vol.19(No.1) (2008) [2] M. Ishikawa, T. Toya, M. Hoshida, K. Nitta, A. Ogiwara, M. Kanehisa. Multiple sequence alignment by parallel simulated annealing Comput. Appl. Biosci. 1993 Jun; 9(3):267-73. [3] S.-I. Tate, I. Yoshihara, K.Yamamori, M. Yasunaga. A parallel hybrid genetic algorithm for multiple protein sequence alignment. Evolutionary Computation, 2002. CEC '02. Proceedings of the 2002 Congress 12-17 May 2002 Volume: 1, On page(s): 309-314. [4] Y. Liu, W. Huang, J. Johnson, and S. Vaidya, GPU Accelerate Smith-Waterman, Proc. Intl Conf. Computational Science (ICC 06) pp.188-195,2006 [5] R. Horn, M. Houston, P. Hanrahan. ClawHMMer: A streaming HMMer search implementation, Proc. Supercomputing (2005). [6] W. Liu, B. Schmidt, G. Voss, W. Muller Wittig. Streaming Algorithms for Biological Sequence Alignment on GPUs IEEE TPDS, Vol. 18, No. 9. (2007), pp. 1270-1281. *7+ Rong X, Jan 2003, Pairwise Alignment - CS262 - Lecture 1 Notes [online], Stanford University. Available: http://ai.stanford.edu/~serafim/cs262/Spring2003/Notes/1.pdf
[18] Deitz, S.J., Chamberlain, B.L., Snyder, L.: High-level language support for user-defined reductions. J. Supercomput. 23(1) (2002) 2337
[19] UPC Consortium: UPC Collective Operations Specifications V1.0 A publication of the UPC Consortium (2003)
[20] Forum, M.P.I.: MPI: A message-passing interface standard (version 1.0). Technical report (May 1994) URL http://www.mcs.anl.gov/mpi/mpi-report.ps. [21] Kambadur, P., Gregor, D., Lumsdaine, A.: Openmp extensions for generic libraries. In: Lecture Notes in Computer Science: OpenMP in a New Era of Parallelism, IWOMP08, International Workshop on OpenMP. Volume 5004/2008., Springer Berlin / Heidelberg (2008) 123133 [22] Kambadur, P., Gregor, D., Lumsdaine, A.: Openmp extensions for generic libraries. In: Lecture
Notes in Computer Science: OpenMP in a New Era of Parallelism, IWOMP08, International Workshop on OpenMP. Volume 5004/2008., Springer Berlin / Heidelberg (2008) 123133 [23] S. Aji, F. Blagojevic, W. Feng, D.S. Nikolopoulos. CellSWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Proceedings of the 2008 ACM International Conference on Computing Frontiers 13-22 [24] M. Katoh and M. Kuma. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. In Nucleic Acids Res. 30:3059-3066 2002.
[25] F. Jiang. Multiple Sequence Alignment based on k-mer and FFT graduation thesis, Xidian University 2006 (in Chinese)
[26] CUDA Compute Unitfed Device Architecture, Programming Guide,Version 2.0. NVIDIA Corp. 2008. [27] A Parallel Dynamic Programming Algorithm on a Multicore Architecture byGuangming Tan, Ninghui Sun Key Lab of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China Graduate School of Chinese Academy of Sciences, Beijing, China {tgm, snh} @ ncic.ac.cn and Guang R. Gao - Computer Architecture and Parallel Systems Lab, Department of Electrical&Computer Engineering University of Delaware, Newark, DE, USA ggao@capsl.udel.edu, 2007 [28] Kevin Bryan Theobald. EARTH: An E_cient Architecture for Running Threads. PhD thesis, McGill University, Montr_eal, Qu_ebec, May 1999. [29] Herbert H. J. Hum, Olivier Maquelin, Kevin B. Theobald, Xinmin Tian, Guang R. Gao, and Laurie J. Hendren. A study of the EARTH-MANNA multithreaded system. International Journal of Parallel Programming, 24(4):319{347, August 1996