Sunteți pe pagina 1din 5

2010 International Conference on Distributed Framework for Multimedia Applications (DFmA)

A Parallel Implementation of Hybridized Merge-Quicksort Algorithm on MPICH


Rahmadi Trimananda#1, Christoforus Yoga Haryanto*2
#

Computer Engineering Department, Universitas Pelita Harapan MH Thamrin Boulevard, Lippo Karawaci 15811, Indonesia
1

rahmadi.trimananda@staff.uph.edu

Computer Engineering Department, Universitas Pelita Harapan MH Thamrin Boulevard, Lippo Karawaci 15811, Indonesia execution and we modify the merging operation into mergesort to collect the sorting results from each process. B. MPICH MPICH is a portable implementation of Message Passing Interface (MPI), which is freely available for distributed memory application in parallel computation [3]. The first MPICH implementation is called MPICH1 that implements the standard MPI-1.1 [3]. The last implementation of MPICH is MPICH2, which has implemented the standard MPI-2.0. In this project, we use MPICH1. MPI provides many functions to implement a parallel program. The followings are six main functions that can be used to realize a complete parallel program [3] [4]: MPI_INIT : starting MPI process, MPI_FINALIZE : ending MPI process, MPI_COMM_SIZE : getting the number of processors, MPI_COMM_RANK: getting process ID, MPI_SEND : sending messages, and MPI_RECV : receiving messages. Aside from supporting distributed memory applications in many computers, MPICH can be indirectly used for running applications in parallel on one computer with a number of processors. C. Motivation of Parallel Merge-Quicksort In this project, we assume several things to make a hypothesis concerning the possibly resulted speedups from sorting data elements in parallel. The assumptions are as the followings: each processor is unicore and connected with each other by a network, the processor speed when reading and writing data is ignored, the speed of each operation for sorting and merging data are considered equal, and the overhead of algorithm is considered minimum and complexity constant is ignored. In parallel processing, the quicksort algorithm has a time complexity of O((n/m)log2(n/m)) and the merging process has O(n log2m). In the reality, the value n will be much greater log2n. Therefore, a meeting point than m, so log2(n/m) should happen between the time complexity curve of data

Abstract Mergesort and quicksort algorithms are two wellknown sorting algorithms that have, in fact, great potential for parallel execution. This paper shows how the two are combined hand-in-hand for sorting large groups of data elements. Hierarchically, data elements are distributed to processors, sorted in smaller groups of data elements in parallel on each processor by using quicksort algorithm, and, later, merged in parallel by using mergesort algorithm. The implementation results on MPICH platform are showing potential speedups provided that the communication channel is adequate for large groups of data elements. Otherwise, longer execution time would be the problem when the data elements have to be transferred between processors. Nevertheless, the hybridized mergequicksort algorithm is worth of consideration for parallel sorting implementation. Keywords mergesort, quicksort, message passing, parallel.

I. INTRODUCTION Data sorting is one of the tasks, which are most frequently done by computers. Sorting algorithms are not only used as their original potentials, but also in pre-processing of many other algorithms. Some examples of the algorithms that deploy data sorting [1] [2]: Kruskal algorithm for constructing a minimum spanning tree, Graham scan for creating convex hull, and Data sorting for binary search. Nowadays, computer systems are based on symmetric multi-processing (SMP), which is widely used. With SMP, a single computer can run a number of processes simultaneously. There are several methods that can be implemented to exercise that potential, i.e. multithreading, multi-processing with interprocess communication, or with some particular schemes that support multi-processing such as OpenMP and MPI, i.e. MPICH. A. Sorting Algorithms Some data sorting algorithms that are well-known and commonly used for sequential processing are: bubblesort, selectionsort, insertionsort, mergesort, quicksort, heapsort, and radixsort [1] [2]. The first three have a complexity of O(n2) and the next three of O(n log2 n). The last one has a complexity of O(kn), where k is the length of chain. In this project, we deploy quicksort in each process for parallel

sorting and the time complexity curve of data merging. As a result, diminishing return occurs when we add the number of processors exceeding the processor speed and the interconnection speed of the network. In Fig. 1, the total number of operations, which are executed by each process, decreases rapidly for less number of processes. However, when the number of processes exceeds 8, the number of operations used for sorting data elements is less than the number of operations used for merging.

first process will get pid = 0 and will play its role as master process; it gets n elements of data as input, distributes tasks, and collects the sorting results. Master process would also take part in the sorting process. Apart from the master process, a number of (m-1) processes will be slaves that obtain parts of data to be processed and help integrate parallel data between the master process and slave processes.

Fig. 1 Graph of number of operations per process with 100 millions data elements

Fig. 2 Distribution of data by using the distribution algorithm (Section II-A)

II. RELATED WORKS Numerous research projects in this area have shown quite appealing results. Everyone is trying to attempt for more efficient sorting algorithms as they are needed in many applications with less execution time. Parallel implementations of mergesort are shown in [5] and [11] that have proven to accomplish better performances for parallel execution, while analysis and enhancement of parallel radixand quicksort algorithms are accomplished for multithreaded architectures in [6]. The other groups of research are looking for new algorithms as well, such as: ZZ-sort algorithm [7], AA-sort algorithm [10], and psort routine [8]. The parallel potential of GPU is also quite appealing for some people, who look for cheaper parallel solutions, such as count sort in [9]. In this project, we, however, try to combine two wellknown sorting algorithms: mergesort and quicksort for achieving better performance through parallel processing. The motivation is clear (Section I-C) that the hybrid algorithm would gain advantage from the nature of the two sorting algorithms, which are potential for parallel execution. III. DESIGN AND IMPLEMENTATION As explained in Section I, the parallel implementation in this experiment involves the combination of Quicksort and Mergesort algorithms to optimize the advantage gained from executing sorting processes in parallel. D. Algorithm In this project, a set of m processes (m 1) are generated. Each process will be placed on m numbers of processors. The

Fig. 3 Merging of quicksort results by implementing mergesort

After being read by the master process, ra is initialized as m on the master process and that data will be distributed in a generic way with time complexity O(n) by the master and slave processes, as described in the following algorithm: ra r while r pid > 1 do c (pid + r)/2 send_to_process(c, r) c r end while

in which send_to_process(c, r) is a function that distributes a number of n((r-c)/m) to other processes with parameters c and r. With this division of tasks, the program itself will execute in parallel if m > 1 or in serial if m = 1. Each process, i.e. master process and slave processes will use the quicksort algorithm on (n/m) elements. Sorting results from processes will be collected and amalgamated by applying mergesort. The complexity of this merging process is O (n log2 m). E. Implementation This parallel application is implemented in C programming language and compiled on Microsoft Visual C++ 2008. MPICH 1.0 [3] is used as the parallel subsystem. Data inputs to undergo the sorting process are generated by a random number generator as 32-bit integers. The program will be verbose as it displays the steps executed and the execution times. There are also some parameters that can be managed to display the sorting results to ensure that the program works properly. The program is designed without any standard input, so the parameters are to be keyed in as the command mpirun is invoked to execute it. Those parameters are stringed in the invocation of the executable file as the following: MPIMergeQuicksort.exe [-p] r n seed where p indicates the printing of the sorting results and r indicates the random number generation. The program is compiled with some compilation parameters as the followings: /O2 : maximize speed, /Ob2 : expand any suitable inline function, /Oi : enable intrinsic function, /Ot : favor fast code, /GL : enable link-time code generation, /MD : multi-threaded DLL run-time library, and /arch:SSE2 : use streaming SIMD Extensions 2. On the other hand, the linker uses a library called mpich.lib with x86-32 as the target architecture. To simplify the execution and analysis, a simple shell script is written and executed on the MS-DOS command line of Windows operating system. For profiling, the program uses performance counter, which has a resolution more accurate than 1s, from Windows API. In each process, i.e. data distribution, sorting, and merging, the slowest execution time is taken to accommodate different computer specifications in the parallel execution environment. F. Methods and Specifications The minimum computer specifications used in this experiment can be described as the followings: Processor type : Intel Core 2 Duo T7200, Clock speed : 2,0 GHz, L2 Cache : 4 MB, RAM : 1 GB DDR2 667 MHz, Network : Gigabit Ethernet via Cat6, and

Operating system : Windows XP SP3. In this experiment, each of the computer processors consists of two cores (dual-core processor). Consequently, only two processes per computer are allowed to be executed simultaneously. The experiments are conducted in the following setups: 1 computer with 1 process (one core active), 1 computer with 2 processes (two cores active), 2 computers with 2 processes on the first computer and 1 process on the second one (three cores active), and 2 computers with 2 processes on each (four cores active). The experiment is run three times for each combination of tasks distribution and for each amount of data. The results are averaged and plotted on graphs afterwards.

IV. ANALYSIS The analysis on the performance of the parallel implementation of quicksort and mergesort is described in 2 parts: individual and overall analysis on 4 different setups. A. Analysis of Four Different Setups

Fig. 4 Execution time (ms) for 1 computer with 1 process (logarithmic vs logarithmic graph)

Fig. 5 Execution time (ms) for 1 computer with 2 processes (logarithmic vs logarithmic graph)

Fig. 4 and Fig. 5 show the trend of the execution times. In this case, there is no distribution and merging time for 1 computer with only 1 process. The two plots indicate linear increase on logarithmical plot.

channel and load balancing are the two most important factors in parallel processing.

Fig. 8 The execution speeds of quicksort algorithm on the 4 setups (linear vs logarithmic graph) Fig. 6 Execution time (ms) for 2 computers with 2 processes on the first computer and 1 process on the (logarithmic vs logarithmic graph)

Fig. 9 The distribution speeds of the 4 setups (linear vs logarithmic graph) Fig. 7 Execution time (ms) for 2 computers with 2 processes on each (logarithmic vs logarithmic graph)

Fig. 6 and Fig. 7 show non-linear increases of execution times against the number of data values. Fig. 6 shows an exponential increase of data distribution and merging; data merging exponentially increases with a greater factor. Fig. 7, in fact, shows that the total execution times are longer than the others. These numbers indicate a problem that occurs when using two computers, i.e. four cores, which may actually involve communication overhead as the data values have to be transferred between the two computers. B. Overall Analysis Fig. 8 shows the speeds of quicksort algorithm in all setups. It is evident that for a small amount of data (below 1,000,000), 1 computer with 1 or 2 processes is faster than 2 computers. However, 2-computer setup affirms its supremacy for large amounts of data. This indicates that the data communication

Fig. 9 explains the importance of the data communication channel in parallel processing. In this figure, it is obvious that the distribution speed on 1 computer (with 2 cores) is never exceeded by the 2-computer setups. The internal communication bandwidth (42,7 Gbps half-duplex for DDR2 667 MHz as opposed to 1 Gbps full-duplex for the Ethernet). Additionally, glitches in Fig. 8 appear due to the limitation of timer resolution (ms) and the overhead of the profiling procedure. In Fig. 10, the merging speed of 1-computer setups, i.e. 1 or 2 cores is faster than that of 2-computer setups. The nature of the merging process itself needs the support from processors to compare data elements. In the merging process that involves 2 computers, the maximum speed is 525 Mbps for 20,000,000 elements (2 computers with 2 processes on the first computer and 1 process on the second computer), whereas 1-computer setup is up to 6 times faster. The pit at the position of 1,000,000 elements is, believed, due to the size of L2 cache (4 MB). In this case, the number 1,000,000 elements means 1,000,000 integers and 1 integer is 4 bytes.

runs. There is also a possibility that the data elements are circulating around the network of processors as the parallel processing is done in a master-slave hierarchy. Additionally, it is also suspected that the MPICH system is not efficient enough to take advantage of 1 Gbps network capacity as the trend of total speed decreases gradually for larger amounts of data. Regarding those facts, the experiments are to be conducted on some other platforms, e.g. MPICH2, to compare the results with the ones have been obtained. Faster communication channel and better traffic management in the network are also of great interest for better performance of parallel execution. ACKNOWLEDGMENT This project has been accomplished through a good teamwork of Computer Engineering students at Universitas Pelita Harapan and the priceless support from the department. REFERENCES
[1] [2] [3] B. Wilkinson and M. Allen, Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Upper Saddle River, New Jersey: Prentice-Hall, Inc., 1999. K. A. Berman and J. L. Paul, Algorithms: Sequential, Parallel, and Distributed. USA: Course Technology-Thomson, 2005. D. Ashton, W. Gropp, and E. Lusk, Installation and Users Guide to MPICH, a Portable Implementation of MPI Version1.2.5: The ch_nt device for workstations and clusters of Microsoft Windows machines. Chicago, USA: Mathematics and Computer Science Division, Argonne National Laboratory, University of Chicago, 2001. P. S. Pacheco, A Users Guide to MPI. San Fransisco, USA: Department of Mathematics, University of San Fransisco, 1998. G. G. Zheng, S. H. Teng, W. Zhang, and X. F. Fu, A Cooperative Sort Algorithm Based on Indexing, in Proc. 13th ICCSCWD09, 2009, pp. 704709. L. Rashid, W. M. Hassanein, and M. A. Hammad, Analyzing and Enhancing the Parallel Sort Operation on Multithreaded Architectures, in J. Supercomputer, Springer-Verlag, 2009. S. Q. Zheng, B. Calidas, and Y. J. Zhang, An Efficient General InPlace Parallel Sorting Scheme, in J. Supercomputer, Netherlands: Springer-Verlag, 1999, pp. 617. D. Man, Y. Ito, and K. Nakano, An Efficient Parallel Sorting Compatible with The Standard Qsort, in Proc. ICPDCAT09, 2009, pp. 512517. W. D. Sun and Z. M. Ma, Count Sort for GPU Computing, in Proc. ICPDS09, 2009, pp. 919924. H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani, AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors, in Proc. ICPACT07, 2007. M. S. Jeon and D. S. Kim, Parallel Merge Sort with Load Balancing, in International Journal of Parallel Programming, Vol. 31, No.1, February 2003, pp. 2133.

Fig. 10 The execution speeds of mergesort algorithm (merging process) on the 4 setups (linear vs logarithmic graph)

[4] [5] [6] Fig. 11 The total speeds of the 4 setups (linear vs logarithmic graph) [7] [8] [9] [10] [11]

Fig. 11 shows the total speed of the whole parallel processing. The fastest execution time is shown by 2 processes (2 cores) on 1 computer. This again indicates the ripple effect caused by the bandwidth difference of data communication channels. The slow data communication channel between 2 computers has pulled back the system from gaining advantage from parallelism. V. CONCLUSIONS Some important thoughts for concluding this paper are as the followings: The experiments show promising results for gaining parallelism from the hybridized merge-quicksort algorithm as it can be observed primarily from the execution of the parallel algorithm on dual core processors, where communication channel is not a problem. It is, of course, aside from the fact that the gain from parallel processing has significantly decreased as the data travels between computers. The division of tasks and data distribution cannot be completely controlled as it is done by MPICH. It seems that, once in a while, the system experiences an excessive data transfer in the network as the program

S-ar putea să vă placă și