Documente Academic
Documente Profesional
Documente Cultură
4-1
4.4
4.5
4.6
Assume that |E| = m = 400, |G| = n = 1000, and |RESP=manager(G)| = t = 20. |X| means how many tuples in relation X. Then the cost of the time units for the first query is:
4-2
|E| x |G| + |E x G| = 400 x 1000 + 400 x 1000 = 800,000 Note: E G costs |E| x |G| = 400 x 1000 units of time and (RESP=ManagerE.ENO=G.ENO(E G) costs the other |E x G| = 400 x 1000 units of time. The cost of the second query is: |E| x log2|G| + |E x G|= 400 x log21000 + 400 x 1000 = 400 x 10 + 400000 = 404000 Note: E ENO G costs |E| x log2|G| = 4000 units of time and RESP=Manager (E ENO G) costs |E x G| = 400000 units of time. The cost of the third query is: |G| + |E| x log2|RESP=manager(G)| = 1000 + 400 x log220 = 1000 + 4000 x 5 = 3,000. Note: (RESP=Manager(G)) costs |G| = 1000 units of time and (E ENO(RESP=Manager(G)) costs |E| x log2|RESP=manager(G)| = 400 x log220 units of time. It is clear that the third query consumes much less computing resources than the first and the second queries. Thus the third query should be used. Distributed DB. In a distributed DB, things become more complecated. Consider again the following query: (E ENO(RESP=Manager(G))) We assume that E and G are horizontally fragmented as follows: E1 = E3(E), E2 = >E3(E) G1 = E3(G), G2 = >E3(G) Fragments G1, G2, E1, and E2 are stored at sites 1, 2, 3, and 4, respectively, and the result is expected at site 5. Two strategies are shown in Figure 1.
4-3
RESP="Manager"
RESP="Manager"
RESP="Manager" E1 Site 3
(G1 U G2)
G1 Site 1
G2 Site 2 Strategy B
E2 Site 4
Figure 1: Equivalent distributed execution strategies We have the following assumptions: a tuple access (tupacc) costs 1 unit a tuple transfer (tuptrans) costs 10 units E and G have 400 and 1000 tuples, respectively There are 20 managers in relation G Data is uniformly distributed among sites and G and E are locally clustered (i.e., locally indexed) on attributes RESP and ENO, respectively. The costs of strategy A is: 1. Produce G requires 20*tupacc 2. Transfer G to E sites requires 20*tuptrans 3. Produce E requires 2*tupacc*(10+10) 4. Transfer E to result site requires 20*tuptrans Total cost: The cost of strategy B is: 1. Transfer E to site 5 requires 400*tuptrans 2. Transfer G to site 5 requires 1000*tuptrans 3. Produce G by selecting G requires 1000*tupacc 4. Join E and G requires 400*20*tupacc Total cost: = 4000 = 10000 = 1000 = 8000 = 23000 = 20 = 200 = 40 = 200 = 460
The reason of the difference between A.3 and B.4 is the locally indexing in A. This information is lost in B. Objectives
4-4
To map a high-level query into a sequence of database operations on fragments stored in local databases. The execution strategy should be optimised (the total cost, namely, the CPU, I/O and communication costs, should be minimised). Centralised query processing: choose the best query algebra among all equivalent ones. Distributed query processing: choose the best query, select the best way of data transfer, and select the best sites to process data.
Complexity of relational operators Table 1: Complexity of relational algebra operations Operation Select, Project (without duplicate elimination) Project (with duplicate elimination), Group Join, Semijoin, Division, Set operators Cartesian product Complexity O(n) O(n*log n) O(n*log n) O(n2)
In the example of Section 4.2, the third query is the best because it reduces the cardinatilty first (RESP=Manager(G)) and avoids cartesian product operation. The second query is the better than the first query because it avoids cartesian product operation. These key principles will be reflected in Section 4.4.
4-5
Use of semijoins: semijoins reduce the size of the operand relation, and therefore can reduce the size of data exchanged between sites.
Optimised fragment query with communication operations Local sites Local optimisation Optimised local queries Local schema
Figure 2: Distributed query processing layers Query decomposition Overview: Function: this layer decomposes the distirbuted calculus query into an algebraic query on global relations Information needed: description of global relations. Data distribution information is not used here Techniques: similar to centralised DBMS. Steps: Normalisation: rewrite the calculus query into a normalised form Analysis: semantically analyse the normalised query and reject incorrect ones as early as possible Simplification: simplify the correct queries to eliminate redundant predicates Restructuring: the calculus query is restructured as a good algebraic query. Data localisation Goal: localise the querys data using data distribution information Function: determine which fragments are involved in the query and transform the distributed query into a fragment query Information needed: fragmentation schema
4-6
Steps: Reconstruct the distributed query by applying the fragmentation rules, and then a localisation program (which uses relational algebraic operations on fragments) is derived Simplify the fragment query to produce another good query.
Global query optimisation Goal: find an execution strategy (relational algebra operations and communication primitives) for the query Function: Find the best ordering of operations in the fragment query, including communicaiton operations which minimise a cost function. Information needed: fragment statistics Local query optimisation This layer is performed by all the sites having fragments involved in the query. A local query is optimised using the local schema of the site.
The transformation of the predicates is using the well-known equivalence rules for logical operations (, , and ). 1. 2. 3. 4. 5. 6. 7. 8. 9. p1 p2 <=> p2 p1 p1 p2 <=> p2 p1 p1 (p2 p3) <=> (p1 p2) p3 p1 (p2 p3) <=> (p1 p2) p3 p1 (p2 p3) <=> (p1 p2) (p1 p3) p1 (p2 p3) <=> (p1 p2) (p1 p3) (p1 p2)<=> p1 p2 (p1 p2)<=> p1 p2 (p)<=> p
4-7
Example: Let us consider the following query on the example database: Find the names of employees who have been working on project J1 for 12 or 24 months. The SQL query: select ENAME from E, G where E.ENO=G.ENO and G.JNO=J1 and DUR=12 or DUR=24; The conjunctive normal form: E.ENO=G.ENOG.JNO=J1(DUR=12DUR=24) The disjunctive normal form: (E.ENO=G.ENOG.JNO=J1DUR=12) (E.ENO=G.ENOG.JNO=J1DUR=24) Analysis To reject those normalised queries that are impossible or unnecessary (e.g., type incorrect or semantically incorrect) in further processing. Type incorrect query: its attribute or relation names are not defined in the global schema. Example: The following SQL query on the example database is type incorrect: select E# from E where ENAME>200 The attribute E# is not defined in the schema. The operation >200 is not compatible with the type of ENAME. Semantically incorrect query: its components do not contribute in any way to the generation of the result. The query graph can be used to determine the semantic correctness: a conjunctive query without negation is semantically incorrect if its query graph is not connected.
Here relation connection graph is defined as follows: one node indicates the result relation, and any other node indicates an operand relation. An edge between two nodes that are not results represents a join, whereas an edge whose destination node is the result represents a project. A nonresult node may be labeled by a select or a self-join predicate.
Example: Query: Find the names and responsibilities of programmers who have been working on the CAD/CAM project for more than 3 years. The SQL: select ENAME, RESP from E, G, J where E.ENO=G.ENO and G.JNO=J.JNO and JNAME=CAD/CAM
4-8
Figure 3: Relation graphs An important subgraph of the relation connection graph is the join graph, in which only the joins are considered. However, if the SQL query is: select ENAME, RESP from E, G, J where E.ENO=G.ENO and JNAME=CAD/CAM and DUR>=36 and TITLE=Programmer; Then the query graph is:
DUR>=36 G E.ENO=G.ENO TITLE= Programmer E ENAME RESULT RESP J JANEM=CAD/CAM
Because the query graph is disconnected, it tells us that the query is semantically incorrect. Redundancy elimination The following well-known idempotency rules can be applied to simplify a query: 1. ppp 3. ptruep 5. pfalsefalse 2. ppp 4. pfalsep 6. ptruetrue
4-9
7. ppfalse 9. p1(p1p2)p1
Example: The SQL query: select TITLE from E where (NOT(TITLE=Programmer) and (TITLE=Programmer or TITLE=Elect.Eng.) and not (TITLE=Elect.Eng.)) or ENAME=J. Doe; can be simplified to: select TITLE from E where ENAME=J. Doe; Let p1 = (TITLE=Programmer), p2 = (TITLE=Elect.Eng.), p3 = (ENAME=J. Doe). The query qualification is: (p1(p1p2)p2) p3 = (((p1 p1) (p1 p2))p2) p3 = (p1p1p2)(p1p2p2) p3 So the disjunctive normal form is: (p1p1p2)(p1p2p2) p3 = (false p2)(p1 false ) p3 = false false p3 = p3 Thus, it can be reduced to p3. Rewriting Relational algebra tree of a query: a tree in which a leaf node is a relation stored in the database, and a nonleaf node is an intermediate relation produced by a relational algebra operation. The sequence of operation is directed from the leaves to the root, which represents the answer of the query. Example: Query: Find the names of employees other than J. Doe who worked on the CAD/CAM project for either one or two years. SQL statements: select ENAME
(applying idempotency rule 7) (applying idempotency rule 5) (applying idempotency rule 4) (applying transformation rule 5) (applying transformation rule 5 again)
4-10
from J, G, E where G.ENO=E.ENO and G.JNO=J.JNO and ENAME != J. Doe and J.NAME=CAD/CAM and (DUR=12 or DUR=24); Its algebra tree is:
ENAME
Project
DUR=12
DUR=24 Select
JNAME=CAD/CAM
ENAME != J. Doe
JNO
Join ENO
Figure 4: An algebra tree example Transformation rules: There are 6 rules that can be used to generate many equivalent trees. Let R, S and T be relations where R is defined over attributes A= {A1, ..., An} and S is defined over B={B1, ..., Bn}.
1. Commutative of binary operations.
A(A(R))A(R) E.g., let A={ENO}, A={ENO, ENAME}, and A={ENO, ENAME, TITLE}, then ENO(ENO, ENAME(E)) ENO(E). If pi is a predicate applied to attribute Ai, then p1(A1)(p2(A2)(R))p1(A1)p2(A2)(R) E.g., let p1=(JNO=J1) and p2=(BUDGET>160,000) then JNO=J1(BUDGET>160,000(J)) JNO=J1BUDGET>160,000(J).
4-11
4.
A1, ... An(p(Ap)(R))A1, ... An(p(Ap)(A1, ... An, Ap(R))) E.g., SELECT ENO, DUR FROM G WHERE DUR > 12; ENO, DUR(DUR>12(G)) DUR>12(ENO, DUR(G)). But, SELECT ENO FROM G WHERE DUR > 12; ENO(DUR>12(G)) ENO(DUR>12(ENO, DUR(G))).
5. Commuting selection with binary operations.
p(Ai) (R S) (p(Ai)(R)) S p(Ai)(R p(Aj, Bk) S) (p(Ai)(R)) p(Aj, Bk) S p(Ai)(R ) p(Ai)(R) p(Ai)(T)
6. then Commuting projection with binary operations. If C=AB, where AA and BB,
C(R S) A(R) B(S) C(R p(Aj, Bk) S) A(R) p(Aj, Bk) B(S) C(R S) C(R) C(S) The above 6 transformation rules demonstrate that a query can be executed in different orders, but the query results are exactly same. However, their complexities are different. Here is a transformation rule, p(Ai)(R p(Aj, Bk) S) (p(Ai)(R)) p(Aj, Bk) S. The left side is join first (R p(Aj, Bk) S), then selection p(Ai). The right side is selection first (p(Ai)(R)), then join . The right side is more efficient because it reduces cardinality first. Remeber in Section 4.2, we mentioned the two principles to reduce complexity of a query: Reducing cardinality first and/or Delaying / avoiding cartesian product operation.
In whatever possible way, try to use selection first. Another frenquent used transformation rule is C(R p(Aj, Bk) S) A(R) p(Aj, Bk) B(S). The left side is join first, then projection. The right side is projection first, then join. The right side is more efficient than the left side. Based on these 6 transformation rules and the above principles, you can easily transfer query 1 to query 2, then to query 3 in the example of Section 4.2.
4-12
They allow the separation of the unary operation to simplify the query expression Unary operations on the same relation can be grouped together and then can be done only once Unary operations can be commuted with binary operations so that some operations can be done first Binary operations can be ordered.
ENAME
DUR=12 DUR=24
ENAME
JNAME=CAD/CAM
JNO
ENAME != J. Doe
JNO, ENAME
JNO ENO ENO G E J (a) A "bad" equivalent tree
JNO
ENO, ENAME
DUR=12
ENAME != J. Doe
4-13
E2 E3.
Reduction for primary horizontal fragmentation After restructure the subtrees, determine those that will produce empty relations and remove them. Reduction with selection: selections on fragments that have a qualification contradicting the qualification of the fragmentation rule generate empty relations. Example: The selection predicate ENO=E5 conflicts with the predictes of fragments E1 and E3 above. An example query: select * from E where ENO=E5; Figure 6(a) is the generic query. It is easy to know that the selection predicate contradicts the predicates of E1 and E3. So the reduced query is in Figure 6(b).
ENO=E5
ENO=E5
E2
(a)
E2
(b)
Figure 6: Reduction with selection Reduction with join: if the joined relations are fragmented according to the join attribute, then the simplification is to distribute joins over unions and to eliminate useless joins. Useless joins can be determined when the qualifications of the joined fragments are contradicting. The distribution operation: (R1 R2)R3=(R1R3)(R2R3)
Example: Assume that E is fragmented as above and G is fragmented as G1=ENOE3(G) G2=>E3(G) Then G = G1 G2.
4-14
E1 and G1 are defined by the same predicate. Furthermore, the predicate defining G2 is the union of the predicates defining E2 and E3. Consider the join query: select * from E, G where E.ENO=G.ENO; Figure 7(a) shows the generic query and Figure 7(b) shows the reduced query.
ENO
ENO E 1 E 2 E 3 (a) G 1 G 2 E 1 G 1 E 2
ENO G (b) 2 E 3
ENO G 2
Figure 7: Query reduction example (horizontal fragmentation) Reason: EG = (E123)(G1G2) = (E1G1)(2G1)(3G1)(E1G2)(2G2)(3G2) = (E1G1)(2G2)(3G2)
Note: 2G1 is eliminated because their fragmentation predicates, E3<E6 and ENOE3, are conflicts. 3G1 and E1G2 are eliminated for the similar reasons. Reduction for vertical fragmentation Queries on vertical fragmentation can be reduced by determining the useless intermediate relations and removing the subtrees that produce them. Example: Relation E is divided into two vertical fragments: E1=ENO,ENAME(E), E2=ENO,TITLE(E) The localisation program is E=E1ENOE2 The SQL query:
4-15
select ENAME from E; The generic query is shown in Figure 8(a) and the reduced query is shown in Figure 8(b).
ENAME
ENAME
ENO E 2
1 (a)
(b)
Figure 8: Reduction for vertical fragmentation Reduction for derived fragmentation Typically, if relation R is subject to derived horizontal fragmentation due to relation S, the fragments of R and S that have the same join attributes values are located at the same site. Usually derived fragmentation is used for one-to-many relationships of the form SR. Example: Given a one-to-many relationship from E to G, G can be indirectly fragmented according to the following rules: G1=GE1 G2=GE2 Relation E is horizontally fragmented as: E1=TITLE=Programmer(E) E2=TITLE!=Programmer(E) The localisation program is: G=G1G2 The SQL query: select * from E, G where G.ENO=E.ENO and TITLE=Mech. Eng.; Figure 9(a) shows the generic query. By pushing down selections to E1 and E2, the query can be reduced to that of Figure 9(b). Further reduction of indirect fragmentation is shown in Figure 9(c) and (d)).
4-16
ENO
ENO
G 1 G 2
TITLE=Mech. Eng. G E 1 E 2 1 G 2
TITLE=Mech. Eng. E 2
ENO ENO ENO G TITLE=Mech. Eng. E 2 (d) Reduced query after eliminating the left subtree
G 1
TITLE=Mech. Eng. E 2 G 2
TITLE=Mech. Eng. E
Figure 9: Reduction for indirect fragmentation Reduction for hybrid fragmentation The optimisation of an operation or a combination of operations is always done at the expense of other operations. E.g., hybrid fragmentation based on selection-projection will make selection only, or projection only, less efficient than horizontal fragmentation (or vertical fragmentation). Queries on hybrid fragmentation can be reduced by combining the rules used in primary horizontal, vertical, and derived horizontal fragmentation: 1. Remove empty relations generated by contradicting selections on horizontal fragments. 2. Remove useless relations generated by projections on vertical fragments. 3. Distribute joins over unions in order to isolate and remove useless joins. Example: Hybrid fragmentation of E: E1=ENO<=E4(ENO,ENAME(E)) E2=ENO>E4(ENO,ENAME(E)) E3=ENO,TITLE(E)
4-17
Localisation program: E=(E1E2)ENOE3 SQL query: select ENAME from E where ENO=E5; Figure 10(a) shows the generic query. It can be reduced by first pushing selection down, eliminating fragment E1, and then pushing projection down, eliminating fragment E3. The formal presentation is as follows: ENAME(ENO=E5(E)) = ENAME(ENO=E5((E1 E2) ENO E3)) = ENAME((ENO=E5(E1) ENO=E5 (E2))ENO (ENO=E5(E3))) = ENAME(ENO=E5 (E2) ENO (ENO=E5(E3))) = ENAME(ENO=E5 (E2) ENO (ENO=E5(ENAME, ENO(E3)))) = ENAME(ENO=E5 (E2)) Figure 10(b) shows the reduced query.
ENAME
ENAME
ENO=E5
ENO=E5
ENO
4-18
tive of the optimiser is to find a strategy close to optimal and, to avoid bad strategies. The output of the optimiser is an optimal schedule consisting of the algebraic query specified on fragments and the communication operations to support the execution of the query over the fragment sites. Cost model Total_cost=CCPU*#insts+CI/O*#I/Os+CMSG*#MSGs+CTR*#bytes CCPU - the cost of a CPU instruction CI/O - the cost of a disk I/O CMSG - the fixed cost of initiating and receiving a message CTR - the cost of transmitting a data unit (byte here) from one site to another. Typical ratio of communication cost to I/O cost: WAN (such as Internet): 20:1 LAN (such as 10Mbps Ethernet): 1:1.6
When the response time of the query is the objective function of the optimiser, parallel local processing and parallel communications must be considered: Response_time=CCPU*seq_#insts+CI/O*seq_#I/Os +CMSG*seq_#MSGs+CTR*seq_#bytes where seq_#x, in which x can be instructions, I/O, messages, or bytes, is the maximum number of x which must be done sequentially for the execution of the query. Example: The example computes the answer to a query at site 3 with data from site 1 and site 2. Only communication cost is considered.
Site 1
Site 2
Assume CMSG and CTR are expressed in time units, then Total_time=2CMSG+CTR*(x+y) Response_time=max{CMSG+CTR*x, CMSG+CTR*y}
4-19