Join Ordering in Fragment Queries: Approach I: Ordering Joins Without Using Semi-Joins

Join Ordering in Fragment Queries
Join ordering is important in a centralized DBMS. Join ordering is even more important in distributed DDBMSs. R site j: relation R is transferred to site j
Site 2 ENO
ASG
PNO
1. 2. 3. 4. 5.
EMP
Site 1
PROJ
Site 3
EMP site 2; site 2 computes EMP, EMP->site 3; site 3 computes the result. ASG->site 1: site 1 computes EMP, EMP>site 3; site 3 computes the result ASG->site 3; compute ASG;ASG->site 1 PROJ->site 2; compute PROJ; PROJ>site 1 EMP->site 2; PROJ->site 2; site 2 compute the join.
Approach I: Ordering joins without using semi-joins

Distributed INGRES R* (distributed version of system R)
R >< S
R
If size(R) < size(S) If size(S) < size(R)
Ignore the transfer time for producing data at the result site. Size(R): Total number of bytes for R Consider costs of all strategies and choose the best one
Approach II: Use semi-joins

Semijoin:
Site 1
R >< A S ( R SJ A S ) >< A S R >< A (S SJ A R) ( R SJ A S ) >< A (S SJ A R)
R' = R SJ A S '
S' = A(S) result= R'><A S
Site 2
Semijoin is better than join if size(R)+size(S)<Size(R) (i.e., a few tuples or R participate in the join) S can be minimized by encoding it in a bit array (BA). BA[i]=1 if h(value of S.A)=i, BA[i]=0 otherwise. h() is the hash function. R consists of tuples whose BA(h(value of R.A))=1;
Some Semi-Joins Alternatives

PNO PNO
ENO
PROJ
PNO
ENO
PROJ
ASG
EMP
ASG
EMP ASG EMP
ASG1=ASG SJ EMP ASG11= (ASG SJ PROJ) SJ EMP Complex: Most algorithms use single semi-joins rather than nested semi-joins
Comparison of Query Optimization Algorithms

Alg. Dist. INGRES SDD-1 R*
Opt. Timing Dynamic Objective Response time or total cost Total cost Optm. Factors Msg Size I/O, CPU Msg size Network Topology General or broadcast Wide area point-topoint General or local SemiJoins No Stats 1 Fragments Horizontal
Static
Yes
1,3,4,5
No
Static
Total cost
#msg, msg size, IO, CPU
No
1,2
No
1=relation cardinality, 2=number of unique values per attribute, 3=join selectivity factor, 4 size of projection on each join attribute, 5=attribute size and tuple size
R*
Input: A localized query tree, locations of the relations, database statistics Tasks: Select join ordering, the join algorithm, the access path for each fragment Select the sites of join results, the method to transfer data between sites. To join two relations, there are three candidate sites. Site of the 1st relation Site of the 2nd relation Site of the 3rd relation
R* Intersite Data Transfer (1)

1. Ship Whole. The entire relation is shipped to the join site. Pro vs Cons Large data transfer, fewer messages Smaller relations, use ship-whole
R* Intersite Data Transfer (2)

2. Fetch-as-needed (use semi-join)
The outer relation is scanned. The join value of each tuple is sent to the site of the inner relation, which selects the matching tuples and sends them back to the site of the outer relation.
Good when the relations are large and only a few matching tuples.
R (outer) joins S (inner) on attribute A

LT(): Local processing time (I/O + CPU time) CT(): Communication time s: average number of tuples of S that match an R tuple
Assumption: Ignore the cost of writing the result of the join.
Strategy 1: Ship-Whole of the entire outer relation to the site of the inner; the outer tuples can be joined as they arrive.
Cost = LT (retrieve card(R) tuples of R) + CT (size(R)) + LT (retrieve s tuples from S)*card(R) Site 1 Site 2
Outer R
inner S
Strategy 2: Ship-Whole of the entire inner relation to the site of the outer The inner tuples have to be stored in a temporary relation T.
Cost = LT (for retrieval of S) + CT ( size of S) + LT (store card(S) in T)+ LT (retrieve card(R) tuples from R)+ LT (for retrieve s tuples from T)*card(R)
Mem
Outer R
inner S
Strategy 3: For each outer tuple, fetch-asneeded of the inner tuples. Cost = LT(retrieve card(R) tuples from R) + CT(length(A))*card(R)+ LT(retrieve s tuples from S)*card(R)+ CT(s*length(S))*card(R)
Strategy 4: Ship Whole of both relations to the third site and compute the join there.
The inner relation is sent to the third site and stored as a temporary relation T. The outer relation is sent to the third site later and its tuples are joined with T as they arrive Cost = LT(retrieve card(S) tuples from S) + CT(size(S))+ LT(store card(S) tuples in T)+ LT(retrieve card(R) tuples from R)+ CT(size(R))+ LT(retrieve s tuples from T)*card(R)
Example
R Site 1 R is ASG2. S is EMP2.
There is an index on S.ENO. Assume communication cost is dominant.
A localized query tree.
(A)
S Site 2
Strategy 1: Ship whole R to site of S (good when R<<S, index on ENO can be used). Strategy 2: Ship whole S to site of R (good when S<<R, index on ENO cannot be used, need to store S). Strategy 3: Fetch S as needed for each tuple of R (good when length(A) is small, few tuples match, index on ENO can be used). Strategy 4: Ship-Whole R and S to a third site, most costly because no other operations after the join.

Join Ordering in Fragment Queries: Approach I: Ordering Joins Without Using Semi-Joins

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Join Ordering in Fragment Queries: Approach I: Ordering Joins Without Using Semi-Joins

Încărcat de

Drepturi de autor:

Formate disponibile

Join Ordering in Fragment Queries

Approach I: Ordering joins without using semi-joins

Approach II: Use semi-joins

R >< A S ( R SJ A S ) >< A S R >< A (S SJ A R) ( R SJ A S ) >< A (S SJ A R)

S' = A(S) result= R'><A S

Some Semi-Joins Alternatives

EMP ASG EMP

Comparison of Query Optimization Algorithms

#msg, msg size, IO, CPU

R* Intersite Data Transfer (1)

R* Intersite Data Transfer (2)

R (outer) joins S (inner) on attribute A

A localized query tree.

S-ar putea să vă placă și