Sunteți pe pagina 1din 12

International Journal of Computer Engineering Engineering International Journal of Computer and Technology (IJCET), ISSN 0976 6367(Print), ISSN

N 0976 6375(Online) Volume 1, Number 2, Sept Oct (2010), IAEME and Technology (IJCET), ISSN 0976 6367(Print)

ISSN 0976 6375(Online) Volume 1 Number 2, Sept - Oct (2010), pp. 57-68

IJCET

IAEME, http://www.iaeme.com/ijcet.html

IAEME

A COMPREHENSIVE STUDY OF NON-BLOCKING JOINING TECHNIQUES


Glory Birru Computer science and engineering Karunya University, Tamil Nadu E-Mail: Glory.Birru@live.com Silja Varghese Computer science and engineering Karunya University, Tamil Nadu E-Mail: varghesesilja287@gmail.com Ms. G. Hemalatha Assistant Professor, CSE Dept Karunya University, Coimbatore, India E-Mail: hema_latha207@yahoo.com

ABSTRACT:
The huge amount of the available data requires that the data be stored at different locations with the least amount of memory requirement and easy retrieval. This gave birth to databases and DBMS. The retrieval is simple and quick when the data is stored at a single location (logical or physical); it becomes complex or non-trivial when the data is not at one place. The technique of getting this data from different locations (here tables) together for use is called joining. Joining has been used since the development of databases; many techniques have since been introduced, some with the modification to existing ones and some with a different approach altogether. In a real-time query execution environment, when the number of tuples is large, it is the join that takes the maximum amount of time and CPU usage. In this paper we will explain and compare the non-blocking joining techniques and their approaches. The joining techniques are compared based on their execution time, flushing policy, the memory requirements, I/O complexity and other factors that make one algorithm more preferable than the other in the appropriate environment. The ability

57

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 1, Number 2, Sept Oct (2010), IAEME

of the techniques to handle the multiple inputs, continuous tuples to give excellent results for the resources available is of much significance.

Keywords: Blocking, Non-blocking, CPU usage, Memory usage, Execution time. INTRODUCTION:
The state-of-the-art joining techniques all have the basic assumption that the tuples or relations to be joined are available in the memory before the joining begins; this assumption though simple cannot always be met. The availability of large amount of realtime necessitates that the joining is done as the tuples arrive at real time. This introduces the concept of blocking and non-blocking joining algorithms where the first one requires all the input before hand while the later does not. The blocking algorithms though popular cannot be used in real-time environments, thus the non-blocking algorithms came into existence. This paper explains some of the non-blocking techniques for joining the tuples in a relation.

1. SYMMETRIC HASH JOIN


Symmetric hash join algorithm is a non blocking algorithm. The symmetric hash join operator maintains two hash tables, one for each relation. Each hash table uses a different hash function. It supports the traditional demand-pull pipeline interface. Read a tuple from the inner relation and insert it into the inner relation's hash table, using the inner relation's hash function. Then, use the new tuple to probe the outer relation's hash table for matches. To probe, use the outer relation's hash function. When probing with the inner tuple finds no more matches, read a tuple from the outer relation. Insert it into the outer relation's hash table using the outer relation's hash function. Then, use the outer tuple to probe the inner relation's hash table for matches, using the inner table's hash function. These two steps are repeated until there are no more tuples to be read from either of the two input relations.

Figure 1 Symmetric hash joins


58

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 1, Number 2, Sept Oct (2010), IAEME

SHJ aims at producing its output tuples as early as possible in the process of calculating the join, without decreasing the performance of the join operation itself.

2. XJOIN: A REACTIVELY-SCHEDULED PIPELINED JOIN OPERATOR


XJoin is a non-blocking join operator based on Symmetric hash Join algorithm. Xjoin is optimized to produce initial results quickly and hide inter mitten delays in data arrival by reactively scheduling background processing. XJoin is based on two fundamental principles: 1. It is optimized for producing results incrementally as they become available. 2. It allows progress to be made even when one or more sources experience delays.

Algorithm Details:
XJoin works in three stages. The first and second stages run while there are still tuples coming from either source, and the third stage is a cleanup executed after all the tuples have been received. The first stage hashes tuples into partitions and then probes the complementary memory partition for a match. If the memory allocated to the join has been exhausted, tuples are flushed to disk to make room for more incoming tuples.

Figure 2 handling the partitions


If both sources become blocked, the first stage yields to the second. This stage chooses a disk partition, reads the tuples it contains into memory and probes the corresponding memory partition of the other relation. The tuples in this disk partition cannot be discarded at this point because they may still join with inputs that have not yet

59

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 1, Number 2, Sept Oct (2010), IAEME

arrived. The second and third stages avoid producing spurious duplicates though keeping timestamps of when the second stage was run for a particular disk partition.

Figure 3 Memory to memory joins Thus, if a tuple in a disk partition is repeatedly run against the same tuples in a
memory partition, timestamps show that the two have already matched and the match is dropped. The third stage is a cleanup stage. For each set of partitions, it loads all of one into memory and then streams the corresponding disk and memory partitions by it. Once all the partitions have been processed, the join is complete. XJoin proceeds in 3 stages (separate threads)

MEMORY OVERFLOW HANDLING:


XJoin flushes largest bucket from only one source. Flush the largest single partition. Flushing policy affects the duplicate detection strategy of the join algorithm. Also affects its performance in two ways: Join output rate - The number of results generated as input is being received. This depends on the tuples in memory. Overall execution time - The total time may change depending on the cost of flushing and post-join cleanup.

3. PROGRESSIVE MERGE JOIN: GENERIC APPROACH AND NON-BLOCKING SORT-BASED JOIN ALGORITHM
Progressive Merge Join (PMJ) is derived from sort-merge join.PMJ computes the results already during the sorting phase. It does so by sorting both input sets simultaneously and by joining data items that are in main memory at the same time. The first data item can be produced earlier than completion of sorting.

60

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 1, Number 2, Sept Oct (2010), IAEME

PMJ ALGORITHM
PMJ consists of two phases. In first phase, PMJ starts reading as much data as possible from two input sets into available memory. Both subsets are then sorted using an internal algorithm like Quicksort. The sorted sequences are joined using an in memory join algorithm. After that, both sequences are temporarily written to external memory. PMJ continues with loading subsets in memory from the remaining input, sorting and joining these subsets until the input is completely processed. In the second phase, PMJ generates longer runs by merging the sequences that were temporarily written to external memory.

MEMORY OVERFLOW HANDLING:


In PMJ, memory overflow is handled by flushing policy. It flushes the whole memory by flushing large buckets into disk. Due to this kind of flushing, I/O performance of PMJ is better.

4. HASH MERGE JOIN: A NON-BLOCKING JOIN ALGORITHM FOR PRODUCING FAST AND EARLY JOIN RESULTS.
Hash Merge Join algorithm deals with data items from remote sources via unpredictable, slow, or bursty network traffic. The HMJ algorithm is designed with two goals in mind: (1) Minimize the time to produce the first few results, and (2) Produce join results even if the two sources of the join operator occasionally get blocked

HMJ ALGORITHM
The Hash-merge join algorithm has two phases: The hashing and merging phases. The hashing phase employs an in-memory hash-based join algorithm that produces join results as quickly as data arrives. Once the memory gets filled, certain parts of the memory are flushed into disk storage to free memory space for the newly incoming tuples. If one of the sources is blocked for any reason, e.g., due to slow or bursty network traffic, the hashing phase can still produce join results from the unblocked source. If the two input sources are blocked, the HMJ algorithm starts its merging phase. In the merging phase, previously flushed parts in disk are joined together using a sort-mergelike join algorithm. Thus, the HMJ algorithm can produce join results even if the two sources are blocked. Once the blocking of any of the two sources is resolved, the HMJ

61

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 1, Number 2, Sept Oct (2010), IAEME

algorithm switches back to the hashing phase. The HMJ algorithm switches back and forth between the two phases until all data items are received from remote sources. Then, the whole memory is flushed into disk storage and the merging phase takes place to produce the final part of the join result. The hash merge join algorithm can be diagrammatically shown as in figure.

Figure 4 Hash Merge Join

MEMORY OVERFLOW HANDLING


In HMJ algorithm, adaptive flushing policy is used to handle the memory overflow. The adaptive flushing policy aims to balance the memory to have similar number of tuples from each source. In adaptive flushing policy we can set the acceptable bucket size. This policy flushes partition pairs. It needs to choose two victim buckets; one from each source, with the same hash value. By flushing a pair of partitions, timestamps are not required to prevent duplicates. A flushed partition is sorted before being written to disk as the blocking phase performs a modified progressive merge join to produce results when both sources are blocked.

5. RPJ: PRODUCING FAST JOIN RESULTS ON STREAMS THROUGH RATE-BASED OPTIMIZATION


Rate based Progressive Join(RPJ) maximizes the output rate by optimizing its execution according to the characteristics of join relations, for example, data distribution, tuple arrival pattern etc. The objectives are to

62

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 1, Number 2, Sept Oct (2010), IAEME

(i) Generate the first result as early as possible (soon after data trans-mission begins), and (ii) Output the remaining results at a fast rate (as tuples continuously arrive). The existing algorithms consider that the memory is not large enough to accommodate all the tuples received from the input streams, such that part of the data must be mi-grated to the disk.

RPJ ALGORITHM
During the online phase, it performs as HMJ. When memory is full, it applies flushing policy. When both relations become blocked, RPJ begins its reactive phase, which combines the XJoin and HMJ reactive phases. The tuples from one of the disk buckets of either relation can join with the corresponding memory bucket of the opposite relation, as in case of HMJ and PMJ. The algorithm chooses the task that has the highest output rate. During its cleanup phase RPJ joins the disk buckets. The duplicate avoidance strategy is similar with that one applied by join. The algorithm can be depicted as in figure 5.

Figure 5 Rate based progressive join

MEMORY OVERFLOW HANDLING


RPJ uses optimal flushing policy for memory overflow handling. Here when memory is full, it tries to estimate which tuples have the smallest chance to participate in joins. Its flushing policy is based on the estimation of the probability of a new incoming tuple to belong to relation and to be a part of bucket. Once all probabilities are calculated, the flushing policy is applied. If the victim tuple does not contain enough tuples, the next smallest probability is chosen, all the tuples that are flushed together from the same relation and they form the sorted segment as in HMJ.

63

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 1, Number 2, Sept Oct (2010), IAEME

6. MAXIMIZING THE OUTPUT RATE OF MULTI-WAY JOIN QUERIES OVER STREAMING INFORMATION SOURCES
The complementary approach of allowing non-binary trees; that is, by generalizing existing streaming binary join algorithms to produce a multi-way streaming join operator, which we call MJoin, that works over more than two inputs is explored. Using a single multi-way join, an arrival from any input source can be used to generate and propagate results in a single step, without having to pass these results through a multi-stage binary execution pipeline. Furthermore, since the operator is completely symmetric with respect to its inputs, there is no need to restructure a query plan in response to changing input arrival rates.

MULTI-WAY JOIN ALGORITHM


The algorithm first creates as many hash tables as there are inputs. When a new tuple arrives at an input, it is inserted into the corresponding hash table and used to probe the remaining hash tables. This generates every possible result tuple that can be produced by joining the new arrival with the memory resident tuples of the other relations. Not all hash tables will be probed for every arrival, as the sequence of probes stops whenever a probe of a hash table finds no matches (since in this case it cannot produce answer tuples.) For instance, for the second probe operation to execute, the first one has to produce matches. The sequence is organized in such a way so that the most selective predicates are evaluated first and it is different for each input. This ensures that the smallest number of temporary tuples is generated.

MEMORY OVERFLOW HANDLING


The technique coordinated flushing can improve the output rate in the presence of overflow and addresses the problem of deciding how best to partition a large multiway join into set of one or more MJoin operators. Using coordinated flushing, when a new tuple arrives on any input stream, if it falls into an in-memory partition, it is immediately probed in the in-memory partitions of the other streams; if it falls into a disk resident partition, then it is added to an output buffer for that partition and not probed in the other streams.

64

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 1, Number 2, Sept Oct (2010), IAEME

7. EARLY HASH JOIN: A CONFIGURABLE ALGORITHM FOR THE EFFICIENT AND EARLY PRODUCTION OF JOIN RESULTS.
Early hash join is a hash-based join algorithm specifically designed for interactive query processing that has a fast response time like other early join algorithms with an overall execution time that is significantly shorter. It is a customizable hash join algorithm, which produces results early without a major penalty in total execution time. Early hash join reduces the total execution time and number of I/O operations by biasing the reading strategy and flushing policy to the smaller relation.

EARLY HASH JOIN (EHJ) ALGORITHM


The early hash join (EHJ) algorithm allows the optimizer to dynamically customize its performance to tradeoff between early production of results and minimal total execution time. Early hash join is based on symmetric hash join. It uses one hash table for each input. A hash table consists of P partitions. Each partition consists of B buckets. A bucket can store a linked list of pages, where each page can store a fixed number of tuples. When a tuple from an input arrives, it is first used to probe the hash table for the other input to generate matches. Then, it is placed in the hash table for its input. In this first in-memory phase, alternate reading is used by default as it was shown to be the best fixed reading strategy. However, it is possible to select different reading strategies (that favor R) if the bias is to minimize total execution time. At any time, the user/optimizer can change the reading policy and know the expected output rate. Once memory is full; the algorithm enters its second phase (called the flushing phase). In the flushing phase, the algorithm uses biased flushing to favor buffering as much of R in memory as possible. By default, it increases the reading rate to favor reading more of R. This reduces the expected output rate, but decreases the total execution time. In both phases, the optimizations to discard tuples when performing oneto-many joins and many to- many joins once all of R has been read are performed. Note that for one-to-many joins if a tuple from R matches tuple(s) in S in the hash table, then those tuples must be deleted from the hash table. For mediator joins, a concurrent background process can be activated if the inputs are slow. After all of R and S have been read, the algorithm performs a cleanup join to generate all possible join results.

65

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 1, Number 2, Sept Oct (2010), IAEME

MEMORY OVERFLOW HANDLING:


Biased flushing policy favors flushing partitions of S before partitions of R, and transitions the algorithm into a form of dynamic hash join. The biased flushing policy uses these rules to select a victim partition whenever memory must be freed: 1. Select the largest, non-frozen partition of S. 2. If no such partition of S exists, then select the smallest, non-frozen partition of R. Once a partition is flushed, all buckets of its hash table are removed and are replaced by a single page buffer. This partition is considered frozen (non-replacement) and cannot buffer any tuples in memory (except for the single page buffer) and cannot be probed. If a tuple hashes to this partition, it is placed in the page buffer which is flushed when filled. If a tuple in the other input hashes to this partition index, then no probe is performed.

READING STRATEGY
The reading policies are configurable by the optimizer, and can also be changed interactively as the join is progressing or after a certain number of output results have been generated. During the flushing phase, a 5:1 reading strategy is used to continue to produce results while lowering overall execution time. It is also possible to minimize total execution time by reading all of R once memory is full. These settings are chosen because in interactive querying the priority of the first few results is much higher than later query results. Further, early hash join can behave exactly as dynamic hash join by using a reading policy that reads all of R before any of S.

CONCLUSION:
With the increase in number of the users of World Wide Web and various realworld applications there is a huge amount of data that is available that requires processing. Joining the tuples in a relation has now become a common carry out in most of the applications; it has now taken a more significant place in a transaction. Responding to the queries at real-time necessitates the speeding up of the query processing of which joining takes the maximum time. Hence speeding of the joining of relations has become of prime importance. This paper surveys some of these techniques. The Table1 below shows the comparison of these techniques. Reducing the speed of a

66

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 1, Number 2, Sept Oct (2010), IAEME

join query execution is an issue that is still open for improvement. As obserevd from the studied techniques for the joining of tuples in a relation, it is evident that to reduce the use of CPU we need to use more memory and to save memory we have to increase the number of required CPU or the number of input output operations required. The best technique of joins depends on the environment of the application of the joins and the resource that is more valuable. The future work can also be done for performing join operations on streams of continuous inputs instead of relations.

Table 1 Comparison of the non-blocking joining techniques. SHJ Xjoin PMJ HMJ RPJ MJoin Flush Flush Adaptiv Optim Coordinat FLUSHING No Flushin Largest All e al ed POLICY g Not High Less Less Less Reduced I/O By COMPLEXI applica ble Reading TY Strategy Time Additio No Time Time DUPLICAT No Duplica Stamp nal Duplica Stamp Stamps E Check tes HANDLING tes Not Not Not Not Not Not RANGE Allowe Allowe Allow Allowed PREDICAT Allowe Allowed d d d ed ES High Comparati Not Less Less Optimum MEMORY vely Less Efficien REQUIREM t ENT

EHJ Biased Flushi ng Moder ate

Time Stamp s Allow ed Efficie nt Use Of Availa ble Memo ry Fast but more than DHJ

EXECUTIO N TIME

High

High

Less I/O so less executi on time

Lower than XJoin and PMJ

Lowe r than XJoin and HMJ

High since recomput aion required

REFERENCES:
1. J. Dittrich, B. Seeger, and D. Taylor. Progressive merge join: A generic and nonblocking sort-based join algorithm. In Proceedings of VLDB, 2002. 2. T.Urhan and M. J. Franklin. XJoin: A Reactively-Scheduled Pipelined Join Operator. IEEE Data Eng. Bull., 23(2), 2000.

67

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 1, Number 2, Sept Oct (2010), IAEME

3. M. F. Mokbel, M. Lu, and W. G. Aref. Hash-Merge Join: A Nonblocking Join Algorithm for Producing Fast and Early Join Results. In CDE Conf., 2004. 4. Y. Tao, M. L. Yiu, D. Papadias, M. Hadjieleftheriou, and N. Mamoulis. RPJ: Producing Fast Join Results on Streams Through Rate-based Optimization. In Proceedings of ACM SIGMOD Conference, 2005. 5. S. D. Viglas, J. F. Naughton, and J. Burger. Maximizing the output rate of multi-way join queries over streaming information sources. In VLDB 2003: Proceedings of the 29th international conference on Very large data bases, pages 285296. VLDB Endowment, 2003. 6. Rahman, Nurazzah Abd Saad, Tareq Salahi. Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results .ITSim 2008. International Symposium on 28 Aug. 2008

68

S-ar putea să vă placă și