Documente Academic
Documente Profesional
Documente Cultură
Xufeng Zhang, Weiwei Sun, Wei Wang, Yahui Feng, and Baile Shi
Department of Computing and Information Technology, Fudan University, 200433, China
011021380@fudan.edu.cn
Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06)
0-7695-2581-4/06 $20.00 © 2006 IEEE
Aggregation is the most complicated, the incremental Net Increment (Minimum Increment) [4]: an increment of
maintenance of which depends on concrete aggregation a materialized view containing only data of relevant
formula. SUM and COUNT are self-maintainable. AVG is updates.
self-maintainable when the output relation adds an attribute Primary Views [9]: the original materialized views stored
for counting [6] .incremental maintenance of MAX or MIN in warehouse.
needs to access base relation. But we can use some means Auxiliary Views [9]: additional materialized views which
to reduce the visiting times of the base relation [7]. store auxiliary data to reduce overall primary views
Some other literatures discuss the incremental maintenance maintenance cost.
of composite operators. [4] discusses how to implement the
incremental maintenance of SPJ (Selection, Projection, 3.2. Canonical Form for ETL Processes
Join) views, [9] discusses how to implement the
incremental maintenance of ASPJ (Aggregation, Selection, The ETL process which includes selection,
Projection, Join) views. [12] doesn’t consider specific projection, union, join, difference and aggregation
combinations of relational operators but think about how to operators can be canonicalized [11]. The canonical form
select the optimal materialized views in given storage is composed of two type segments: AUSPJ-segments
space, it is aimed to minimize the total cost of the queries and D-segments. More details can be found in [11], we
and incremental maintenance of views. only give the results.
[8] discusses how to calculate the net increment of the A D-segment contains a single difference operator.
various basic relational operators(including difference An AUSPJ-segment, which is an operator sequence in
operator), but doesn’t involve the problem of incremental the form Į--ʌ-ı- , may omit any of the operators in
maintenance. the operator sequence, but it must satisfy one of the
[11] presents the canonical form for general view, but following:
the literature in itself doesn’t consider about incremental z It has an aggregation operator on top.
maintenance. z It is directly below a D-segment.
There hasn’t been any literature on automatically z It is below a join operator, and has a union
generating incremental ETL process, which is the research operator on the top.
content of this paper. Our research can take advantage of z It is the top segment in the view definition,
the existing research results of incremental maintenance of which means that no more segments lies above
materialized views. But the existing research results don’t this segment.
include the incremental maintenance of difference operator.
In data warehouse environments, difference operator is Example 1 (Canonical Form for an ETL Process)
seldom used in the definition of materialized views; but in Suppose we have an ETL process to store customers’
the definition of ETL process, difference operator is used total consume into data warehouse, but the target table
as frequently as the other basic relational operators. So as dose not contain VIP customers’ data. There are four
the precondition of our research, the incremental tables in data source, and one target table named
maintenance of difference operator is also one of our Total_Consume in data warehouse. The five tables list
works. as below.
Customer & VIP: C_ID int PRIMARY KEY,
3. Basic Conceptions C_NAME char(20)
Order_A & Order_B:
ORDER_ID int,
3.1. The Basic Conceptions of Incremental C_ID int REFERENCES Customer(C_ID),
Maintenance of Materialized Views PRODUCT_ID int,
P_NUM int,
In the discussion of incremental ETL process, we need P_PRICE int,
some existing conceptions which we will introduce briefly PRIMARY KEY (ORDER_ID, PRODUCT_ID)
here, details can be found in [1] [4] [9]. Total_Consume: C_NAME char(20),
TOTAL_CONSUME int
Materialized View [1]: a view who’s tuples are stored in the
database. The definition of the ETL process is shown in fig1(a),
Base Relation [1]: existing relation in the database, and the canonical form is shown in fig1(b).
which is used for creating views and materialized views.
Incremental view Maintenance [1]: a process to compute 3.3. Self-maintainable Materialized Views and
only the changes in the view to update its materialization. Self-maintainable Net Increment
Irrelevant updates[4]: a set of updates which updates base
relations but has no effect on the state of a view. The self-maintainability of materialized views has
Self-maintainable views [1]: views that can be maintained been discussed in detail in some literatures. All these
using only the materialized view and incremental data. discussions rely on one precondition: the input
Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06)
0-7695-2581-4/06 $20.00 © 2006 IEEE
relation(s) of a relation operator (or a set of relation
operators) are base relation(s), and the output relation is 4. Incremental Maintenance of One
materialized view. In an ETL process, the input and output
relations of a relation operator (or a segment) may be all Segment
views. Using the results of self-maintainability of
materialized views, we can judge whether to materialize According to section 3.2, the canonical form for
the input relation(s), but can’t judge whether to materialize an ETL process is composed of AUSPJ-segments and
the output relation. In order to judge simultaneously D-segments. In order to generate an incremental ETL
whether to materialize the input and output relation, we put process from the ETL process, we should first consider
forward a definition as follows. the implementation of incremental maintenance of one
Definition 1 self-maintainable net increment: a net segment.
increment of a view or a materialized view can be obtained
only with the increment(s) of the base relation(s), and the 4.1. Incremental Maintenance of AUSPJ-
net increment is called self-maintainable net increment. Segment
In an ETL process, if a net increment of a segment’s output
relation is self-maintainable, all the input and output We can use the existing methods to implement the
relations of the segment need not to be materialized. It’s incremental maintenance of an AUSPJ-segment which
helpful to lower the executing cost of the incremental ETL contains all the five kinds of operators. The methods can
process. Obviously, if the net increment of a view is self- be found in [1,3,4,5,7], here we only give the result.
maintainable, the view must be self-maintainable. But the To implement the incremental maintenance of a
contrary is not necessarily true. complete AUSPJ segment, we must maintain
ĮC_NAME,SUM(T_PRICE) materialized views listed below:
All the input relations of join operator; The output
ʌC_NAME, T_PRICE relation of projection and/or union operator: a counting
attribute is added to this relation or duplicate data are
allowed to be saved. But if we can sure that each tuple
C_ID
in the output relation is unique, we need no any
ʌC_ID,P_NUM*P_PRICEas T_PRICE materialized view;
The output relation of aggregation operator, and if
the formula is AVG, the output relation should add a
counting attribute. If the formula is MAX or MIN, the
Customer VIP Order_A Order_B
input relation also should be materialized.
According to the definition of AUSPJ-segment, an
D AUSPJ-segment may omit some operators, and the
corresponding materialized views of which is omitted.
ĮC_NAME,SUM(
C_NAME,SUM(T_PRICE)
4.2. Incremental Maintenance of D-Segment
ʌC_NAME, T_PRICE
According to the definition, D-segment only contains a
single difference operator. Without any loss of
C_ID
generality, consider a difference operation AB, and
suppose its output relation is C. According to the
Temp2 formulas in [8], C’s net increments are:
'C=('A(B'B))((A'A)B) A (1)
Temp1 C=(AB)(A'B) (2)
In formula (1) (2), relation A, B is used to calculate the
net increment of C, so A, B must be materialized. If a
view definition contains only one difference operator,
ʌC_ID,P_NUM*P_PRICE
_ID,P_NUM*P_PRICE as T_PRICE
the import relations are bass relations. So the method of
ʌC_ID,P_NUM*P_PRIC
C_ID,P_NUM*P_PRICE as T_PRICE
calculating C’s increment using A and B is called BRA
Customer VIP Order_A (Base Relations Access).
Order_B
Or
Formula C=AB is converted to C=A(AB). And for
E an arbitrary relation R, we have R'R=, RR=R.
Figure 1. Standardization of an ETL process (a) A common Let D=AB, E=B(AB), formula(11) (12) are
ETL process and (b) The canonical form for the ETL converted to
process
'C=('A(E'B))((D'A)B) A (3)
Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06)
0-7695-2581-4/06 $20.00 © 2006 IEEE
C=(AD)(C'B) (4) practice, an ETL process contains many AUSPJ-
So when relation C, D and E are materialized, the segments and D-segments, in which the selection of
incremental maintenance of C can also be implemented. auxiliary views is more complicated. We know that
And the method is named SRA (Split Relations Access). adding auxiliary views can lower the cost of calculating
SRA method is more efficient than BRA method. But in increment. But it will increase the maintenance cost of
SRA method, in order to maintain relation D and E, we auxiliary views. Moreover, the storage space of a
need to calculate the increment of them. Using 'C and C, database is generally finite, auxiliary views can’t be
the formulas for the increment of D are: increased infinitely.
'D=('A'C)(CA) (5) Similar to the method in [12], we abstract the problem
D=('C'A)(AC) (6) of selecting auxiliary views into math’s models. Given a
Using 'D and D, the formulas for the increment of E are: full ETL process G, storage space S and auxiliary views
'E=('B'D)(DB) (7) M, the execution cost of the incremental ETL process is:
k m
E=('D'B)(BD) (8)
In a data warehouse environment, the increment is very
W (G, M ) ¦ D( D , M ) ¦U (V , M ) C
i 1
i
j 1
j
Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06)
0-7695-2581-4/06 $20.00 © 2006 IEEE
Using the method in [11], a canonicalized ETL process can system running on the PC with Pentium IV 2.0G CPU
be described by a tree named ETL tree. Using the ETL tree, and 512M memory, and kingbase database as the
we give the algorithm named MCCI (minimal cost of DBMS. In the four base tables, each of
calculating increment) to generate incremental ETL. Customer,Order_A and Order_B contains 100000 tuples
MCCI Algorithm CreateIncrementalETL(t0,V0) and VIP contains 10000 tuples.
Input a canonicalized full ETL process tree t0, a set of In the experiment, we also consider the optimization of
view definition V0(each one associated with a the incremental ETL process. Because the attributes
segment in t0) (ORDER_ID, PRODUCT_ID) are the primary key of
Output a set of changed view definition V with materialized Order_A union Order_B, we keep the attributes in
views, a set of SQL statements S for incremental relation TEMP_2, and we don’t need to attach an
ETL process additional attribute to the relation TEMP_2 for counting.
begin Because VIP is the subset of Customer, we use formula
VÅ , SÅ ; (9) (10) to incremental maintain the relation TEMP_1.
do{get a top most unprocessed segment si from t0; The experiment’s result is shown in fig2.
Get vi associated with si from V0;
if (vi is an AUSPJ segment)
W L PH RI H[HFXW L RQ PV
for each operation in vi, add materialized views into
vi 1 ;
else for the difference operation in vi, add materialized
views into vi 2 ;
Generate SQL statements Si for incremental I XO O
L QFU HPHQW
maintenance of vi; RSW L PL ]HG
VÅVvi;
SÅSSi; U DW L R RI L QFU HPHQW W R I XO O
}
until (there is no unprocessed segment in t0);
return (V, S); Figure 2. Time of executing full or incremental ETL
end; processes
The algorithm only scans the ETL tree once when In the result, we can get that when the increment is very
generating the incremental ETL process. If there are N little, the running time of the incremental ETL processes
segments in the ETL tree, its running time is O(N). Let’s generated by MCCI algorithm is lower than that of the
suppose that when MCCI algorithm is selecting the method full ETL process. And the optimized incremental ETL
of incremental maintenance, it assures that the calculating processes take more little time.
cost of the increment of each segment is minimum. Then
the incremental ETL process generated by MCCI algorithm 7. Conclusion
is executed according to the first step of TSU method, the
execution cost is minimum. But in fact whether the This paper mainly discusses how to automatically
incremental maintenance cost is minimum depends on the generate incremental ETL process according to the
instance of the increment[1], so we can only assure that the existing full ETL process. As ETL process is similar to
incremental ETL process generated by MCCI algorithm the view definitions, we use the existing method of
has a good execution efficiency in general cases. incremental maintenance of materialized views for
reference to implement the automatic creation of
6. Experiments incremental ETL process.
First we use the method in [11] to canonicalize the ETL
We implement MCCI algorithm in an ETL tool and process. And the canonicalized ETL process only
compare the running time of the generated incremental contains AUSPJ-segments and D-segments. Using
ETL processes with that of full ETL processes. existing results of incremental maintenance of
We choose example 1 shown in section 3 to do the materialized views, we give the method to incremental
experiment, because the example includes join, projection, maintain AUSPJ-segments.
union, difference and aggregation operators except We also discuss the incremental maintenance of D-
selection, it can show all the methods of incremental segment in detail. First we put forward the plain method
maintenance. We choose WIN2003 server as the operating BRA which uses full information of the basic relations.
Secondly according to the hypothesis that “in a data
1
warehouse environment, the scale of general increment
The method to select materialized views AUSPJ-segment has been is much smaller than that of basic relation”, we put
discussed in section 4.1
2 We choose SRA method to maintain D-segment incrementally. forward the improved method SRA. In SRA method,
Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06)
0-7695-2581-4/06 $20.00 © 2006 IEEE
two auxiliary views are introduced to lower the total cost of [7] K. Yi et al., “Efficient maintenance of materialized top-k
incremental maintenance. Then we discuss the self- views,” in Proc. ICDE 2003, Bangalore, India, 2003, pp. 189-
maintainability of the materialized views of difference 200.
operator. The materialized views defined by difference [8] T. Griffin, L. Libkin, and H. Trickey, “An improved
algorithm for the incremental recomputation of active
operator are not always self-maintainable. Through relational expressions,” IEEE Trans. Knowl. Data Eng., 9(3),
analysis, we get three conditions in which the materialized 1997, pp. 508-511.
views are self-maintainable. We also discuss a common [9] Y. Cui and J. Widom, “Storing auxiliary data for efficient
difference operator in an ETL process, in which one input maintenance and lineage tracing of complex views,” in Proc.
relation is the subset of the other input relation. In this DMDW 2000, Stockholm, Sweden, 2000.
situation, we prove that the net increment of the output [10] D. Agrawal, et al. “Efficient view maintenance at data
relation is self-maintainable. warehouses,” SIGMOD Conference 1997, Tucson, USA, 1997,
At last we discuss how to automatically generate pp. 417-427.
incremental ETL process. In order to implement [11] Y. Cui, J. Widom, and J.L. Wiener, “Tracing the lineage
of view data in a warehousing environment,” ACM
incremental ETL process, we must select necessary Transactions on Database Systems, 25(2), 2000, pp. 179-227.
auxiliary views. And the selection of optimal auxiliary [12] H. Gupta, “Selection of views to materialize in a data
views in given storage space is NP-hard. Since the storage warehouse,” in Proc. ICDT 1997, Delphi, Greece, 1997, pp.
space of data warehouse is vast, we eliminate the 98-112.
corresponding constraint and put forward the TSU method
which postpones the execution of updating on auxiliary
views. Finally, we simplify the formula which is used to
select auxiliary views. According to the simplified formula
we give the algorithm of automatically generating ETL
process.
The incremental ETL process used to be designed
manually. It is our work that enables it to be generated
automatically. And the output incremental ETL process has
a good performance in most situations.
Acknowledgements
References
[1] A. Gupta and I. Mumick, “Maintenance of materialized views:
problems, techniques, and applications,” IEEE Data Eng.
Bulletin, Vol. 18, No. 2, 1995, pp. 3-18.
[2] A. E. Abbadi et al., “Performance issues in incremental
warehouse maintenance,” in Proc. VLDB 2000, Cairo, Egypt,
2000, pp. 461-472.
[3] I. S. Mumick and O. Shmueli, “Finiteness properties of
database queries,”,Advances in Database Research: Proc. of the
4th Australian Database Conference, Brisbane, Australia, 1993,
pp. 274-288.
[4] J. Blakeley, P. N. Larson, and F. Tompa, “Efficiently updating
materialized views,” ACM SIGMOD International Conference on
Management of Data, Washington, USA, 1986, pp. 269-400.
[5] A. Gupta, H. Jagadish, and I. Mumick, “Data integration using
self-maintainable views,” The 5th International Conference on
Extending Database Technology(EDBT), Avignon, France, 1996,
pp. 140–144.
[6] T. Palpanas, R. Sidle, R. Cochrane, and H. Pirahesh,
“Incremental maintenance for non-distributive aggregate
functions,” in Proc. VLDB 2002, Hong Kong, 2002, pp. 802-813.
Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06)
0-7695-2581-4/06 $20.00 © 2006 IEEE