Documente Academic
Documente Profesional
Documente Cultură
Join ordering is important in a centralized DBMS. Join ordering is even more important in distributed DDBMSs. R site j: relation R is transferred to site j
Site 2 ENO
ASG
PNO
1. 2. 3. 4. 5.
EMP
Site 1
PROJ
Site 3
EMP site 2; site 2 computes EMP, EMP->site 3; site 3 computes the result. ASG->site 1: site 1 computes EMP, EMP>site 3; site 3 computes the result ASG->site 3; compute ASG;ASG->site 1 PROJ->site 2; compute PROJ; PROJ>site 1 EMP->site 2; PROJ->site 2; site 2 compute the join.
R >< S
R
If size(R) < size(S) If size(S) < size(R)
Ignore the transfer time for producing data at the result site. Size(R): Total number of bytes for R Consider costs of all strategies and choose the best one
R' = R SJ A S '
Site 2
Semijoin is better than join if size(R)+size(S)<Size(R) (i.e., a few tuples or R participate in the join) S can be minimized by encoding it in a bit array (BA). BA[i]=1 if h(value of S.A)=i, BA[i]=0 otherwise. h() is the hash function. R consists of tuples whose BA(h(value of R.A))=1;
PROJ
PNO
ENO
PROJ
ASG
EMP
ASG
ASG1=ASG SJ EMP ASG11= (ASG SJ PROJ) SJ EMP Complex: Most algorithms use single semi-joins rather than nested semi-joins
Static
Yes
1,3,4,5
No
Static
Total cost
No
1,2
No
1=relation cardinality, 2=number of unique values per attribute, 3=join selectivity factor, 4 size of projection on each join attribute, 5=attribute size and tuple size
R*
Input: A localized query tree, locations of the relations, database statistics Tasks: Select join ordering, the join algorithm, the access path for each fragment Select the sites of join results, the method to transfer data between sites. To join two relations, there are three candidate sites. Site of the 1st relation Site of the 2nd relation Site of the 3rd relation
Strategy 1: Ship-Whole of the entire outer relation to the site of the inner; the outer tuples can be joined as they arrive.
Cost = LT (retrieve card(R) tuples of R) + CT (size(R)) + LT (retrieve s tuples from S)*card(R) Site 1 Site 2
Outer R
inner S
Strategy 2: Ship-Whole of the entire inner relation to the site of the outer The inner tuples have to be stored in a temporary relation T.
Cost = LT (for retrieval of S) + CT ( size of S) + LT (store card(S) in T)+ LT (retrieve card(R) tuples from R)+ LT (for retrieve s tuples from T)*card(R)
Mem
Outer R
inner S
Strategy 3: For each outer tuple, fetch-asneeded of the inner tuples. Cost = LT(retrieve card(R) tuples from R) + CT(length(A))*card(R)+ LT(retrieve s tuples from S)*card(R)+ CT(s*length(S))*card(R)
Strategy 4: Ship Whole of both relations to the third site and compute the join there.
The inner relation is sent to the third site and stored as a temporary relation T. The outer relation is sent to the third site later and its tuples are joined with T as they arrive Cost = LT(retrieve card(S) tuples from S) + CT(size(S))+ LT(store card(S) tuples in T)+ LT(retrieve card(R) tuples from R)+ CT(size(R))+ LT(retrieve s tuples from T)*card(R)
Example
R Site 1 R is ASG2. S is EMP2.
There is an index on S.ENO. Assume communication cost is dominant.
(A)
S Site 2
Strategy 1: Ship whole R to site of S (good when R<<S, index on ENO can be used). Strategy 2: Ship whole S to site of R (good when S<<R, index on ENO cannot be used, need to store S). Strategy 3: Fetch S as needed for each tuple of R (good when length(A) is small, few tuples match, index on ENO can be used). Strategy 4: Ship-Whole R and S to a third site, most costly because no other operations after the join.