Sunteți pe pagina 1din 10

JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGIES, VOLUME 2, ISSUE 2, FEBRUARY 2012 1

A Tile-based Parallel Global Algorithm for Biological Sequence Alignment on multi core architecture
D.D.Shrimankar and S.R.Sathe Department of Computer Science and Engineering Visvesvaraya National Institute of Technology Nagpur 440010 (India)

Abstract: The Global algorithm is a compute-intensive based sequence alignment method. In this paper, we investigate extending several parallel methods, such as the wave-front methods for the Global algorithm, to achieve a significant speed-up on a multiple core. The simple parallel wave-front method can take advantage of the computing power of the multi-core processor but it cannot handle long sequences because of the physical main memory limit. On the other hand, the tiling method can process long sequences but with increased overhead due to the increased data transmission between main memory and secondary memory. To further improve the performance on multi-core processor, we propose a new tile-based parallel algorithm. We take advantage of the homological segments to divide long sequences into many short pieces and each piece pair (tile) can be fully held in the main memory. By reorganizing the main memory limitation instead of complete computed values only required values are stored in buffer and by using offset it is picked up when required. The experimental results show that our new tile-based parallel algorithm can outperform the simple parallel wave-front method. Especially for the long sequence alignment problem, the best performance of tile-based algorithm is on average about an order magnitude faster than the serial global algorithm. Keywords: global algorithm, OpenMP, tile, multi-core processor

I. INTRODUCTION
Biological Sequence alignment is very important in homology modeling, phylogenetic tree reconstruction, sub-family classification, and identification of critical residues. When aligning multiple sequences, the cost of computation and storage of traditional dynamic programming algorithms becomes unacceptable on current computers. Many heuristic algorithms for the multiple sequence alignment problems run in a reasonable time with some loss of accuracy. However, with the growth of the volume of sequence data, the heuristic algorithms are still very costly in performance. There are generally two approaches to parallelize sequence alignment. One is a coarse-grained method, which tries to run multiple sequence alignment subtasks on different processors, for example, see [1, 2, 3] or optimize single pair-wise alignment consisting of multiple parallel sub-tasks, such as [4, 5, 6]. The communication among those sub-tasks is critical for the performance of coarse-grained method. The other is a more fine-grained method, which focuses on parallelizing the operations on smaller data components. Typical implementations of fine-grained methods for sequence alignment include parallelizing the Global algorithm (see [7]) and OpenMP model for parallelization [8].

OpenMP which uses multithreaded execution where parallelism can be efficiently exploited in simple parallel wave-front method. Under this, the computation of an element of the similarity matrix can be assigned to one thread by considering the data dependencies.
Tiling [9][10] has been used as an effective compiler optimizing technique to generate high performance scientific codes. Tiling not only can improve data locality for both the sequential and parallel programs [11], but also can help the compiler to maximize parallelism and minimize synchronization [12] for programs running on parallel machines. Thus, sometimes, it is used by the programmers to hand-tune their scientific programs to get better performance.

General-Purpose Computation on Cluster [13] provides a powerful platform to implement the parallel fine-grained Biological Sequence Alignment Global algorithm. With the rapidly increasing power and programmability of the Cluster, these chips are capable of performing a much broader range of tasks. The Cluster technology is employed in several finegrained parallel sequence alignment applications, such as in [14]. In our study, we employ a new tilebased mechanism into the wave-front method *15+ to accelerate the Global algorithm on a single Cluster. Our contributions are as follows: 1) We provide the design, implementation, and experimental study, of a new tile-based mechanism based on the wave-front method. This method can enable the alignment of very long sequences. For comparison, we also introduce the wave-front method used in parallelizing Global algorithm into

The global algorithm works by computing the socalled similarity matrix. The computation at each element in this matrix depends on the results of three other elements: its nearest west, northwest and north neighbors in the matrix. Such fine-grain data dependences present serious challenges for efficient parallel execution on current parallel computers. To meet such challenges, we exploit the power of

JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGIES, VOLUME 2, ISSUE 2, FEBRUARY 2012 2

the multi-core processor implementation of DNA sequence alignment. 2) We divide the basic computational kernel of the Global algorithm of each tile into two parts, independent and dependent parts. The parallel and balanced execution of independent computing units in a coalesced memory-accessing manner can significantly improve the performance of the Global algorithm on a multi-core.
3) The rest of this paper is organized as follows: In Section 2, we describe the Global Algorithm for Biological Sequence Alignment in detail. In Section 3, we introduce the basic idea of simple wave-front method used with OpenMP model and related work on different architecture which accelerated sequence alignment applications. In Section 4 a parallel implementation of the same algorithm using traditional tiling transformation is discussed. In Section 5, our design and implementation of the simple wave-front with tile-based algorithms on cluster are presented. The experiments are presented in Section 6. In Section 7, we conclude and discuss future work with references.

cannonical C syntax. Second, due to the non-trivial dynamic overhead of the generic techniques, generic libraries are not widely used in programming high performance scientific and engineering algorithms. Finally, there are no experimental data in [23]. CellSwat [23] presents a tiling mechanism, but with this method, there are data dependencies among each tile. To find homological segments, there are algorithms such as Fast Fourier Transform (FFT) [24] based algorithms and k-mer based algorithms [25]. Compute Unified Device Architecture (CUDA) introduced by NVIDIA [26] is a programming environment for writing and running generalpurpose applications on the NVIDIA GPUs, such as GeForce, Quadro, and Tesla hardware architectures. In all the above mentioned papers, memory limitation is not mentioned or handled any where, which is the main problem while aligning longer sequences. In this paper we introduce our own algorithm which efficiently handle memory limitation and can align longer sequences.

2. Global Algorithm for Biological Sequence Alignment


The algorithm consists of two parts: the calculation of the total score indicating the similarity between the two given sequences, and the identification of the alignment(s) that lead to the score. In this paper we will concentrate on the calculation of the score, since this is the most computationally expensive part. The idea behind using dynamic programming is to build up the solution by using previous solutions for smaller subsequences. The comparison of the two sequences X and Y, using the dynamic programming algorithm, is illustrated in Figure 1. This algorithm finds global alignments by comparing entire sequences. The sequences are placed along the left margin (X) and on the top (Y). A similarity matrix is initialized with decreasing values (0, -1, -2, -3, ) along the first row and first column to penalize for consecutive gaps (insertions or deletions).

II. BACKGROUND AND RELATED WORK


Parallelization using OpenMP directives are supported in many parallel programming languages. They include C**[16], SAC [17], ZPL [18], UPC [19], and MPI [20]. Most of them support user-defined reduction operations, either through language constructs or through library routines. User-defined reduction operation provides a flexible way to implement tile reduction. However, programmers need to change both data structures and algorithms, which, sometimes, is not a tirivial job. Another piece of work that we need to mention is [21]. In [21], the authors propose to extend the OpenMP clause to parallelize C++ generic algorithms. They propose to support user defined types, overloaded operators, and function objects in the same way as the built-ins supported in the current OpenMP clause. Their work is very close to that presented in this paper. However, we study the parallelization problem from a different angle. We propose tiled parallelizing technique for OpenMP, while [20] proposes userdefined reduction operation to complete their OpenMP extensions for parallelizing generic libraries. In our tiled parallelization technique, we are concerned with the data partition, locality and a more flexible and efficient way to parallelize dense matrix programs written in cannonical C syntax, while the purpose of [23] is to allow people to parallelize programs written in modern C++ idioms such as iterators and function objects, which are not in

Fig 1: Similarity matrix and global alignments

The other elements of the matrix are calculated by finding the maximum value among the following three

JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGIES, VOLUME 2, ISSUE 2, FEBRUARY 2012 3

values: the left element plus gap penalty, the upper-left element plus the score of substituting the horizontal symbol for the vertical symbol, and the upper element plus the gap penalty. For the general case where X = x1.xi and Y = y1 .yj, for i = 1.n and j = 1.m, the similarity matrix SM*n : m+ is built by applying the following recurrence equation, where gp is the gap penalty and ss is the substitution score:

would require a very large number of processors if real biological data is to be considered. An additional problem is the high communication overheads for such an implementation, which would require data exchange among all active processors at every time step.

3. Parallelizing with simple wave-front method


In this method, we proved using experimental results that the dynamic programming algorithm is efficiently implemented on multi-core with high performance and efficiency, good Programmability, and reasonable cost. Here we are particularly interested in parallel machines made mainly of commodity of-the-shelf microprocessors and stock hardware. As mentioned, computing the anti-diagonal element by element would lead to expensive communication overheads. For each element, the program would compute a single maximum, yet would have to send the result to three processors (though one of these may be the same processor). One solution to this problem is to divide the similarity matrix into rectangular blocks, as shown in Figure 2(a). In this example, the program would compute block 1 first, followed by 2 and 5, etc. If each block has q rows and r columns, then the computation of a given block requires only the row segment immediately above the block, the column segment to its immediate left, and the element above and to the left a total of q+r+1 elements. For instance, if each block has 4 rows and 4 columns, then each block has to compute 16 maxima after receiving 9 input values. The communication-to-computation ratio drops from 3:1 to 9:16 an 81% reduction!

SM i, j 1 SM [i, j ]

gp -------(1)

max SM i 1, j 1 ss SM i 1, j gp

In our example, gp is -1, and ss is 1 if the elements match and 0 otherwise. However, other general values can be used instead. Following this recurrence equation (1), the matrix is filled from top left to bottom right with entry [i, j] requiring the entries [i, j- 1], [i- 1, j- 1], and [i- 1, j]. Notice that SM[i, j] corresponds to the best score of the subsequences x1, .., xi and y1,.., yj. The analogous matrix is filled using equation 1. Since global alignment takes into account the entire sequences, the final score will always be found in the bottom right hand corner of the matrix. In our example, the final score 4 gives us measures of how similar the two sequences are. Figure 1 shows the similarity matrix and the two possible alignments (arrows going up and left).
Given the data dependences presented by the algorithm, the similarity matrix can be filled row by row, column by column, or anti-diagonal by anti-diagonal (i.e., all elements (i, j) for which i + j is a fixed value). The problem with the first two approaches is that most of the elements in a row or column depend on other elements in the same row (column). This means the row (or column) cannot be computed in parallel. On the other hand, the elements in an anti-diagonal depend only on previously calculated anti-diagonals. This means that parallel computation can proceed as a wave front across the similarity matrix, i.e., by computing successive antidiagonals of the matrix simultaneously, during successive time steps. Although it exposes parallelism, the anti-diagonal approach faces a few challenges when it comes to an efficient parallel implementation. First, the sizes of the anti-diagonals vary during the computation, which leads to unbalanced work among processors. Another challenge has to do with the number of elements to be computed by each processor in each time step. In the previous example we assumed that each processor would calculate one element of the matrix at a time. However, this fine-grain computation

#pragma omp parallel shared (data, data1, c) private (id) {


id = omp_get_thread_num (); #pragma omp for
for ( i=1; i<N; i++ ) { for ( j=1; j<M; j++ )

(c) Pseud code Fig 2: Partition of the similarity matrix

JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGIES, VOLUME 2, ISSUE 2, FEBRUARY 2012 4

The load-balancing problem can be addressed by putting several rows of blocks (or strips) on the same processor. Figure 2(b) illustrates this approach when four processors are used. The first and fifth strips are assigned to processor 1, the second and sixth strips are assigned to processor 2 and so on. This helps to keep all processors busy through most of the computation. For example, processor 1 initially works with the first strip, then simultaneously with the first and fifth strip, then finally only with the fifth strip. The akin pseudo code as in Figure 2(c). The Quad Processor Multithreaded Architecture Our Platform Multi-core processor [27] supports a multithreaded program execution model in which a program is viewed as a collection of threads whose execution ordering is determined by data and control dependences explicitly identified in the program. Threads, in turn, are further divided into fibers which are non-preemptive and scheduled according to dataflow-like firing rules, i.e., all needed data must be available before it becomes ready for execution. Programs structured using this two-level hierarchy can take advantage of both local synchronization and communication between fibers within the same thread, exploiting data locality. In addition, an effective overlapping of communication and computation is made possible by providing a pool of ready-to-run fibers from which the processor can fetch new work as soon as the current fiber ends and the necessary communication is initiated. The multi-core model defines a common set of primitive operations required for the management, synchronization and data communication of threads. Each core in a quad system consists of an execution unit (EU), a synchronization unit (SU), queues linking the EU and SU, local memory, and an interface to interconnection network. While the EU merely executes fibers, i.e., does the computation, the SU is responsible for scheduling and synchronizing threads, handling remote accesses and performing dynamic load balancing. Although designed to deal with multiple threads per node, the multiple core machine does not require any support for rapid context switching (since fibers are non-preemptive) and is well-suited to running on of-the-shelf processors. Multi-core processor systems have been implemented on a number of platforms: MANNA and Power MANNA, IBM SP2, Sun SMP cluster and Beowulf. Multi-core programs

Fig 3: Computation of the similarity matrix on Quad Core

are written using the programming language Threaded-C.[28][29] This is an extension of the ANSIC programming language which, by incorporating multi-core operations, allows the user to indicate parallelism explicitly.
Generally speaking, it assigns the computation of each strip to a thread, having two independent threads per node. However, in order to better overlap computation and communication, blocks on a strip are actually calculated by two fibers within a thread. These fibers are repeatedly instantiated to compute one block at a time, and only one of the two fibers of each thread can be active at a particular time. The decision of having two alternating fibers within each thread was based on the following reasoning. It would be a waste of resources if we had one separate fiber for each block, in each strip, since only one block can be calculated at a time. Having just one fiber for all blocks is also not a good idea because this fiber would get delayed due to the synchronization signal coming from the fiber immediately below. This signal acknowledges the receipt of data without it the fiber, re-instantiated, would be allowed to overwrite the previous data. Thus, with just one fiber, computation would not be allowed to proceed until this acknowledgment signal is received. With the addition of an extra fiber we can further overlap computation and communication since one of the fibers can wait for the acknowledgment while the other starts working on the following block. A snapshot of the computation of the similarity matrix using our implementation is illustrated in Figure 3. A thread is assigned to each horizontal strip and the actual computation is done by fibers labeled E(ven) and O(dd). The figure shows the computation of the main antidiagonal of the matrix. The arrows indicate data and synchronization signals. For example, processor 2 sends data (downward arrows) to processor 3 and receives data from processor 1 i.e., fibers E of strips 2 and 6 send data to fibers E of strips 3 and 7, and fibers O of strips 1 and 5 send data to fibers O of strips 2 and 6. Fibers within a same thread, that is, associated with the same strip, send only a synchronization signal (horizontal arrows) since they

JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGIES, VOLUME 2, ISSUE 2, FEBRUARY 2012 5

share data local to the thread to which they belong. Finally, dotted upward arrows acknowledge the receipt of data so that the fiber receiving this signal can be re-instantiated to calculate another block of the same strip. During the initialization phase, each thread grabs a piece of the input sequence X. This piece is all a thread needs from sequence X so the whole sequence need not be stored. Moreover, after computing a block, each fiber sends to the fiber beneath a piece of the sequence Y being compared. By doing so, we minimize the initialization delay that occurs when the nodes are reading sequence X from the server. Furthermore, since subsequent pieces of sequence Y can be stored in the same memory area, the demands for space are considerably reduced.

which would require data exchange among all active processors at every time step.
Figure 4(a) show execution of tiles corresponding to the tiled code shown in Figure 4(b). Tiles are of rectangular shape and of size n1 * n2 . The index set is

of size
N1 n
1

N1
n

N2
N2
2

and the number of tiles, In the first time step tile (0,0) is

executed by processor P1. At the end of execution it sends the computed data to processor P2 and to itself i.e Processor P1. In the second time step processors P1 and P2 execute tiles (0,1) and (1,0) respectively. Similarly, in the third time step processors P1, P2 and P3 execute tiles (0,2),(1,1) and (2,0) respectively and so on.

4. Parallel implementation using traditional tiling transformation


The traditional method of tiling consists of, computing successive anti-diagonals of the matrix simultaneously, during successive time steps. This approach faces a few challenges when it comes to an efficient parallel implementation. First, the sizes of the anti-diagonals vary during the computation, which leads to unbalanced work among processors. For example, assume quad core processors, one per symbol (row) of sequence X, are available to compute the matrix in Figure 1. The computation would start with processor 1 calculating the element (0,0) (assuming rows and columns are numbered 0,1,2 . . . ). Then, in the next time step, processors 1 and 2 would calculate the elements (0,1) and (1,0) respectively, Similarly, in the third time step processors P1, P2 and P3 execute tiles (2,0),(1,1) and (0,2) respectively and so on. Thus one can imagine a wave front sweeping across the tiles as shown in Fig. 4. At time step i all the tiles on i th wave front execute simultaneously. In this research work a wave front approach for analysis of tiling is used. This way, for instance, processor 4, would have to wait 4 time steps before starting to work, and by the time we get to time step 6, processor 1 would already be idle. In the worst case, where X
and Y are the same length, each processor would only be used half of the time on average. Another challenge has to do with the number of elements to be computed by each processor in each time step. In the previous example we assumed that each processor would calculate one element of the matrix at a time. However, this fine-grain computation would require a very large number of processors if real biological data is to be considered. An additional problem is the high communication overheads for such an implementation,

In Fig. 4(b) all loops are parallel and tile size is n1 x n2 , suppose we run on quad core , before tiling, loop i has

N
N iteration and each processor executes

4
N n1* n2
1

iteration.

But in Figure 4(b) the cross strip ii loop has

iteration, which is

divided evenly among processors. Thus the tile size is calculated by computing the number of memory references, as the parallelizing compiler needs to tune the tile size. So that each processor will execute nearly the same iterations. For example: if tile size is 8 X 8 = 64, then after tiling the cross strip ii loop of Figure 4(b) N has iterations. Thus each processor will get

64

4 iterations. The appropriate tile size to get

64 optimized performance is given in our experimental results.

5. Tile Based OpenMP Method to Harness the Power of Multi-core Processor


Here in this method we introduce our new algorithm tile reduction, an OpenMP tile aware parallelization technique that applies reduction on multi-dimensional arrays. This new method uses tile reduction implementation including the required OpenMP API

extension

and

the

associated

code

generation

JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGIES, VOLUME 2, ISSUE 2, FEBRUARY 2012 6

technique for aligning DNA sequences.


Execution Time (Second)/Speedup SeqLen serial Simple wave-front 0.37 1.97 0.39 6.0 0.42 14.02 0.50 19.86 0.54 29.44 0.99 62.73 Using traditional tile 0.38 0.44 0.46 1.92 5.32 12.8 New method of tile 0.28 2.61 0.39 6 0.43 13.7 0.45 22.07 0.52 30.58 0.86 72.21

Pseudo code 1; we call it a simple implementation of the wave-front algorithm.


The right-hand side of Figure 5(a) shows the data skewing strategy to implement this wave-front pattern. This memory structure is useful because blocks in the same parallelizing group are adjacent to each other.

10000 20000 30000 40000 50000 100000

0.73 2.34 5.89 9.93 15.9 62.1

0.52 19.1 0.54 29.44 1.10 56.45

Table 1:Performance comparison of three different Global implementation on Quad processor

The main drawback of the above discussed methods is the limited main memory size for longer DNA sequences and high dependency on the computational resources of the host side. The simple global algorithm requires additional storage and communication bandwidth. Therefore, we introduce the new tile-based method, which simplifies the computational model and also handles very long sequences with less memory transmission. The tilebased method can be described as follows: If a matrix can be fully loaded into the memory, do it; otherwise divide the large matrix into tiles to ensure that each tile fits in the Processors memory as a whole and then calculate the tiles one by one or in parallel. Thus for the matrix of size 4x4 as shown in Figure 5(a) left part, where number of rows (r) and columns (c) are 4 and 4 respectively. In our algorithm, the number of rows will be diagonal rows i.e now number of rows will be (r+c-1) i.e 7 as shown in Figure 5(a) right part. Each matrix element corresponds to one tile. And for each diagonal row one file will be created where computed values will be stored and destroyed synchronously. This new parallel implementation is shown in Figure 5(b).
The wave-front algorithm is a very important method used in a variety of scientific applications. The computing procedure is similar to a frontier of a wave to fill a matrix, where each blocks value in the matrix is calculated based on the values of the previouslycalculated blocks. On the left-hand side of Figure 5(a) we show the wave-front structure for parallelizing the Global algorithm. Since the value of each block in the matrix is dependent on the left, upper, and upper-left blocks, in the figure, blocks with the same color are put in the same parallel computing wave-front round. The process for this wave-front algorithm is shown in

Fig 5(a). Wave-front structure of global algorithm for sequence alignment. calcSteps = (tileYCount + tileXCount) - 1; for (row = 0; row < calcSteps; row++) { #pragma omp parallel shared(col,row,diagonalRow,n,sum,chunk) private(i) {
#pragma omp for schedule(static,chunk) nowait for (col = 0; col <= row; col++) {

diagonalRow = row - col;


if (diagonalRow < tileYCount && col < tileXCount) {

Fig 5(b): Pseudo Code corresponding to Fig 5(a)

The left-hand side of the figure shows the wave-front process, and right-hand side of the figure shows the tile data skewing to implement the wave-front. The experimental results (Figure 8) show that all the graphs are of parabolic curve. Hence for all the results minimum time required with different number of threads will be

t
min

4 ac

b 4 a

-------------(2)

where a, b, c are the constants in the equation of parabola, these values will be calculated using least square method. When calculated, theoretical values matched with our experimental results.

JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGIES, VOLUME 2, ISSUE 2, FEBRUARY 2012 7

6. EXPERIMENTAL RESULTS
6.1 General Test The experiments were performed on quad-core Intel 2.83GHz CPU platform with 3 GB memory, and 500GB of hard disk. We employ the automatic sequence-generating program to generate different test cases.
The general test was executed on four implementations of the Global Algorithm for Biological Sequence Alignment. The first one is the serial implementation; the second is the simple wave-front implementation using OpenMP; the third is traditional tile-based implementation and fourth is our new tile based implementation. For long sequences, the simple wavefront version cannot load the entire dynamic programming matrix into the processor memory. Therefore, we select groups of sequences which have variable lengths to test on Linux version. The experimental results are shown in Table 1.

the time of computing each tile and the time of sequence alignment using our tile-based algorithm. The test results are shown in Figure 7. The time for
Execution Time (Second) - Speedup Seq. Len. serial 16000 32000 64000 128000 256000 21.17 84.86 336.1 777.3 1470 Simple wave-front 18.2564 1.1591 51.2563 1.6555 245.3256 1.3696 547.2568 1.4201 800.214 1.8378 Using traditional tiling 12.4578 1.6987 35.2457 2.4075 185.4273 1.8121 356.8794 2.1777 496.2589 2.9634 New method using tile 1.8564 11.399 7.8572 10.799 30.437 11.0396 72.773 10.6798 144.18 10.2001

Table 2:Performance assessment of three different global implementations on 12-core processor with tile size 1000 x 1000

In the serial version, the compilers optimization methods have a great effect on the manner of accessing memory; however, in multi core processor, this is not true. Because multi-core processor has a special hierarchy of memory structure and the access time for different levels of the hierarchy greatly varies. For example, global memory has an access time of about 500 cycles and the on-chip memory such as a register of only 1 cycle. Due to the reason that multi-core processor does not provide a good mechanism to optimize memory accesses at a compiler level. Also we can see that among four versions of the algorithm, our tile-based method is the best. In fact, since the length of the test sequences is not long, the only difference between simple- and tile-based algorithms is that tile-based algorithm introduces some optimization methods. The results show that our optimization methods are effective. 6.2 Test of Tile-based Global Algorithm on 12-core processor. The experiments are performed on the platform which has a 12-core Intel 2.83GHz CPU with 3 GB memory, and 500GB of global memory
The selected tile size and the size of the sequence will significantly affect the algorithms efficiency. In our test, we focus on how the number of threads affects the final performance. Our tests are divided into two parts,

computing each tile is linearly constant. Hence we do not include the time for computing tile, because we want to show the time changes for computing sequences separately. From Figure 7, we see that with the growth of the average sequence length, the performance increases massively with our method. We select the parameters as tile size, which are used to decide how to partition the sequence length of average length, to keep the residues length at 200. This is because when the average sequence length exceeds 32,000 the time of sequence alignment will increase much faster.

1600 1400 1200


Tim e (S econ d)

1000 800 600 400 200 0 16000 32000 64000 Sequence Length 128000 256000

Serial Simple Wavefront OpenMP Traditional Tile with OpenMP Our method of tile with OpenMP

Fig 7. Experimental Results on 12-core processor for variable sequence length.

The global algorithm that solves the sequence alignment problem in parallel with different versions and exhibits better performance in our new tile based algorithm. This is emphasized in Figure 8, shows a pairwise comparison between the overall speedups for all the versions of parallel global algorithm. The comparison is done for each pair of sequences. Figure 8 show that the tile-based implementation is an order of magnitude faster than the simple wave-front implementation, because it only needs to calculate a portion of the dynamic programming matrix. In

JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGIES, VOLUME 2, ISSUE 2, FEBRUARY 2012 8

addition, the time growth of the tile-based implementation is lower. From Figure 8, we see that as the residue length increases from 32000 to 256000, the time for tile-based method increases only about 190 seconds under Linux system, and the other methods grows at least 500 and 800 seconds respectively. As the sequence length increases, the growth of the calculation time needed by our tilebased method is very low.
12

Fig. 9 shows the performance evaluation on 64 core multithreaded processor for Sequence length and computational time in second. Figure 10 and 11 shows Speedup variation with variable sequence length of all the versions of the global algorithm.

1600 1400 1200


10
T i m e (S e c o n d )

1000 800 600 400 200

Serial Simple Wavefront OpenMP Traditional Tile with OpenMP


Our method of tile with OpenMP

8 Speedup Simple wave-front using OpenMP 6 Traditional tile with OpenMP New method of tile with OpenMP 4

0
0 16000 32000 64000 Sequence Length 128000 256000

16000

32000

64000 Sequence Length

128000

256000

Fig 8. Comparison of the overall speedups for all versions of Parallel Global Algorithm on 12- core processor.

Fig.9. Experimental Results on 64-core processor for variable sequence length.

6.3 Test of Tile-based Global Algorithm on 64-core processor.


The experiments are performed on the platform which has a 64-core Intel 2.53 GHz CPU with 12 MB cache.
Execution Time (Second) - Speedup

5 Simple Wavefront with OpenMP Traditional Tile with OpenMP

Speedup

Seq. Len. 16000 32000 64000 128000 256000

serial 20.46 82.66 310.6 650.5 1389

Simple wave-front 10.7190 1.9082 22.6239 3.6536 98.3412 3.1581 254.5421 2.5560 400.871 3.4675

Using traditional tiling 6.38930 3.2013 16.3214 5.0644 85.5439 3.6305 143.7619 4.5250 220.4846 6.3043

New method of tile 0.0005 040907.2 0.0043 19222.65 0.0173 17951.82

0 16000 32000 64000 128000 256000 Sequence Length

Fig. 10. Comparison of the overall speedups for simple wavefront and traditional tiling mechanisum on 64 core processor.

45000

1.4739 441.36 1.9486 713.33

40000 35000 30000 Speed - Up 25000 20000 15000 10000 5000 0 16000 32000 64000 128000 256000 Sequence Length Our Method of Tile with OpenMP

Table 3:Performance assessment of three different global implementations on 64-core processor with tile size 1000 x 1000

Fig 11. Comparison of the overall speedups for our new tile based versions on 64 core processor.

JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGIES, VOLUME 2, ISSUE 2, FEBRUARY 2012 9

7. Conclusion and future work


We were able to efficiently apply our new tile based nested parallelization with OpenMP to three production codes, to increase the scalability on multi core processors. Using OpenMP the performance increase is nearly linear to the number of the processors with minimal modifications to the source code, thus impacting neither the algorithmic structure nor the portability of the code. It is clearly shown that sharedmemory parallelization is a suitable way to achieve better runtime performance for applications in computer vision. The experimental results show that the best performance of our tile based parallel algorithm can achieve better speedup comprised with the serial, simple wave front and traditional tiling implementation of the global algorithm. In future work, we will implement our new tile-based global algorithm on distributed memory architecture using MPI model and hybrid model for Biological Sequence Alignment in the Multiple Sequence Alignment and Sequence database search problem. For this new implementation, more computing resources will be needed. When adding multiple sundry types of processors into our work, there will be disparity between computations times required to calculate tile matrix. Theoretically, the tile-based algorithm should be best to achieve scalable performance due to its few needs of communication between diverse hosts.

*8+ OpenMP Architecture Review Board: The OpenMP specification for parallel programming (2008), http://www.openmp.org [9] Anderson, J.M., Amarasinghe, S.P., Lam, M.S.: Data and computation transformations for multiprocessors. In: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, Santa Barbara, California (July 1921, 1995) 166178 SIGPLAN Notices, 30(8), August 1995. [10] Anderson, J.M., Lam, M.S.: Global optimizations for parallelism and locality on scalable parallel machines. In: Proceedings of the ACM SIGPLAN 93 Conference on Programming Language Design and Implementation, Albuquerque, New Mexico (June 2325, 1993) 112125 SIGPLAN Notices, 28(6), June 1993. [11] Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: Proceedings of the ACM SIGPLAN 91 Conference on Programming Language Design and Implementation, Toronto, Ontario (June 2628, 1991) 3044 SIGPLAN Notices, 26(6), June 1991. [12] Lim, A.W., Lam, M.S.: Maximizing parallelism and minimizing synchronization with affine transforms. In: Conference Record of POPL97: The 24th ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, Paris (January 1517, 1997) 201214 [13] Chapter 1, Cluster Computing at a Glance, Mark Baker and Rajkumar Buyya, Division of Computer Science University of Portsmouth Southsea, Hants, UK. School of Computer Science and Software Engineering Monash University Melbourne, Australia Email: Mark.Baker@port.ac.uk, rajkumar@dgs.monash.edu.au [14] Yang Chen, Songnian Yu and Ming Leng. 2006, in International Federation for Information Processing (IFIP), Volumn 207, Knowlidge Enterprise: Intelligent strategies in production design, Manufacturing and management, eds. K, Wang, Kovacs G., Wozny M., Fang M., (Bosten Springer), pp.311-321 [15] S. Aji, F. Blagojevic, W. Feng, D.S. Nikolopoulos. CellSWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Proceedings of the 2008 ACM International Conference on Computing Frontiers 13-22 [16] Viswanathan, G., Larus, J.R.: User-defined reductions for efficient communication in data-parallel languages. Technical Report 1293, University of Wisconsin-Madison (Jan 1996)
[17] Scholz, S.B.: On defining application-specific high-level array operations by means of shape invariant programming facilities. In: APL 98: Proceedings of the APL98 conference on Array processing language, New York, NY, USA, ACM (1998) 3238

REFERENCES
[1] K. Jiang, O. Thorston, A. Peters, B. Smith, C.P. Sosa. An Efficient Parallel Implementation of the Hidden Markov Methods for Genomic Sequence-Search on a Massively Parallel System IEEE Transactions on Parallel and Distributed Systems, Vol.19(No.1) (2008) [2] M. Ishikawa, T. Toya, M. Hoshida, K. Nitta, A. Ogiwara, M. Kanehisa. Multiple sequence alignment by parallel simulated annealing Comput. Appl. Biosci. 1993 Jun; 9(3):267-73. [3] S.-I. Tate, I. Yoshihara, K.Yamamori, M. Yasunaga. A parallel hybrid genetic algorithm for multiple protein sequence alignment. Evolutionary Computation, 2002. CEC '02. Proceedings of the 2002 Congress 12-17 May 2002 Volume: 1, On page(s): 309-314. [4] Y. Liu, W. Huang, J. Johnson, and S. Vaidya, GPU Accelerate Smith-Waterman, Proc. Intl Conf. Computational Science (ICC 06) pp.188-195,2006 [5] R. Horn, M. Houston, P. Hanrahan. ClawHMMer: A streaming HMMer search implementation, Proc. Supercomputing (2005). [6] W. Liu, B. Schmidt, G. Voss, W. Muller Wittig. Streaming Algorithms for Biological Sequence Alignment on GPUs IEEE TPDS, Vol. 18, No. 9. (2007), pp. 1270-1281. *7+ Rong X, Jan 2003, Pairwise Alignment - CS262 - Lecture 1 Notes [online], Stanford University. Available: http://ai.stanford.edu/~serafim/cs262/Spring2003/Notes/1.pdf

[18] Deitz, S.J., Chamberlain, B.L., Snyder, L.: High-level language support for user-defined reductions. J. Supercomput. 23(1) (2002) 2337
[19] UPC Consortium: UPC Collective Operations Specifications V1.0 A publication of the UPC Consortium (2003)

[20] Forum, M.P.I.: MPI: A message-passing interface standard (version 1.0). Technical report (May 1994) URL http://www.mcs.anl.gov/mpi/mpi-report.ps. [21] Kambadur, P., Gregor, D., Lumsdaine, A.: Openmp extensions for generic libraries. In: Lecture Notes in Computer Science: OpenMP in a New Era of Parallelism, IWOMP08, International Workshop on OpenMP. Volume 5004/2008., Springer Berlin / Heidelberg (2008) 123133 [22] Kambadur, P., Gregor, D., Lumsdaine, A.: Openmp extensions for generic libraries. In: Lecture

JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGIES, VOLUME 2, ISSUE 2, FEBRUARY 2012 10

Notes in Computer Science: OpenMP in a New Era of Parallelism, IWOMP08, International Workshop on OpenMP. Volume 5004/2008., Springer Berlin / Heidelberg (2008) 123133 [23] S. Aji, F. Blagojevic, W. Feng, D.S. Nikolopoulos. CellSWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Proceedings of the 2008 ACM International Conference on Computing Frontiers 13-22 [24] M. Katoh and M. Kuma. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. In Nucleic Acids Res. 30:3059-3066 2002.
[25] F. Jiang. Multiple Sequence Alignment based on k-mer and FFT graduation thesis, Xidian University 2006 (in Chinese)

[26] CUDA Compute Unitfed Device Architecture, Programming Guide,Version 2.0. NVIDIA Corp. 2008. [27] A Parallel Dynamic Programming Algorithm on a Multicore Architecture byGuangming Tan, Ninghui Sun Key Lab of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China Graduate School of Chinese Academy of Sciences, Beijing, China {tgm, snh} @ ncic.ac.cn and Guang R. Gao - Computer Architecture and Parallel Systems Lab, Department of Electrical&Computer Engineering University of Delaware, Newark, DE, USA ggao@capsl.udel.edu, 2007 [28] Kevin Bryan Theobald. EARTH: An E_cient Architecture for Running Threads. PhD thesis, McGill University, Montr_eal, Qu_ebec, May 1999. [29] Herbert H. J. Hum, Olivier Maquelin, Kevin B. Theobald, Xinmin Tian, Guang R. Gao, and Laurie J. Hendren. A study of the EARTH-MANNA multithreaded system. International Journal of Parallel Programming, 24(4):319{347, August 1996

S-ar putea să vă placă și