Sunteți pe pagina 1din 6

Generating Incremental ETL Processes Automatically

Xufeng Zhang, Weiwei Sun, Wei Wang, Yahui Feng, and Baile Shi
Department of Computing and Information Technology, Fudan University, 200433, China
011021380@fudan.edu.cn

Abstract existing researches are focused on the incremental


maintenance of materialized views in situations which
Incremental ETL processes are used for the involve the operators of selection, projection, join and
incremental maintenance of data warehouses, which is aggregation but difference operators are excluded. In
generally designed by users with ETL tools. Using existing ETL process, difference operators are frequently used to
methods of incremental maintenance of materialized views eliminate useless data, which is different from the
for reference, we put forward an approach to generate an maintenance of materialized views in data warehouse.
incremental ETL process automatically from the full ETL This paper research how to automatically generate
process in this paper. Existing researches are focused on an incremental ETL process which involves operator of
the incremental maintenance of materialized views in such aggregation, selection, projection, union, join and
circumstances which involve the operators of selection, difference. As a precondition, we first discuss the
projection, join and aggregation but difference operators incremental maintenance of various operators in which
excluded. Since difference operators are used frequently in we emphasized the operator of difference. Then we
an ETL process, we first discuss incremental maintenance discuss how to select the optimal auxiliary views to
of materialized views defined with difference operators in implement the minimum cost in incremental
detail. maintenance. At last we present the algorithm which
generates an incremental ETL process automatically
from the full ETL process.
1. Introduction The rest of this paper is organized as follows.
Section 2 is an introduction to related work. Section 3
introduces some basic conceptions needed in this paper,
Data warehouses collect data from multiple, including the canonical form for ETL process. Section 4
distributed sources and integrate the information for introduces the method of incremental maintenance of
querying and analysis. The process of collecting data from AUSPJ segment and D segment, both of which are in
data sources to a data warehouse is called ETL (Extraction, canonical form. Section 5 presents the method of
Transformation and Loading) process. generating incremental ETL process from full ETL
Once a data warehouse is built, all the data are required to process. Section 6 is the experiments and section 7 is the
be loaded in. This process is called full ETL process. In summary and prospect of the research.
daily maintenance, when changes occur in the data sources,
it is obviously not efficient to reload all the data using full
ETL process. We need only to load those data which are 2. Related Work
newly created in the data sources, so incremental ETL
process is more efficient. At present, both full ETL process In relational database, various types of operators
and incremental ETL process are designed manually. have various methods of incremental maintenance.
In [10], the author proposes that data warehouse can The incremental maintenance of projection and union
be considered as the materialized views of data sources. can be implemented with the same method. Since
Correspondingly, full ETL process can be thought of as several input tuples may correspond to one output tuple,
view definition and incremental ETL process as the when one input tuple is deleted, we must check to see
incremental maintenance of materialized views which has whether there are some other input tuples that have the
received much attention. The results of these researches same output as the deleted one. [2] uses bag (duplicate)
can be utilized to generate an incremental ETL process semantics to solve this problem. Another solution is to
automatically from the full ETL process. attach an additional attribute to the output relation to
But existing researches on incremental maintenance take count of the corresponding input tuples[3].
of materialized views are focused on data warehouse The incremental maintenance of join generally needs
environments. Data in the data warehouse are organized full data of basic relations. But when a modification to
according to themes. In queries, the operators of these basic relations only includes deletion and update,
aggregation, selection, projection and join are frequently the output relation is self-maintainable [1]. [5] discusses
used while difference operators are seldom used. So the incremental maintenance of out-join, in which null
values must be considered.

Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06)
0-7695-2581-4/06 $20.00 © 2006 IEEE
Aggregation is the most complicated, the incremental Net Increment (Minimum Increment) [4]: an increment of
maintenance of which depends on concrete aggregation a materialized view containing only data of relevant
formula. SUM and COUNT are self-maintainable. AVG is updates.
self-maintainable when the output relation adds an attribute Primary Views [9]: the original materialized views stored
for counting [6] .incremental maintenance of MAX or MIN in warehouse.
needs to access base relation. But we can use some means Auxiliary Views [9]: additional materialized views which
to reduce the visiting times of the base relation [7]. store auxiliary data to reduce overall primary views
Some other literatures discuss the incremental maintenance maintenance cost.
of composite operators. [4] discusses how to implement the
incremental maintenance of SPJ (Selection, Projection, 3.2. Canonical Form for ETL Processes
Join) views, [9] discusses how to implement the
incremental maintenance of ASPJ (Aggregation, Selection, The ETL process which includes selection,
Projection, Join) views. [12] doesn’t consider specific projection, union, join, difference and aggregation
combinations of relational operators but think about how to operators can be canonicalized [11]. The canonical form
select the optimal materialized views in given storage is composed of two type segments: AUSPJ-segments
space, it is aimed to minimize the total cost of the queries and D-segments. More details can be found in [11], we
and incremental maintenance of views. only give the results.
[8] discusses how to calculate the net increment of the A D-segment contains a single difference operator.
various basic relational operators(including difference An AUSPJ-segment, which is an operator sequence in
operator), but doesn’t involve the problem of incremental the form Į-‰-ʌ-ı-  , may omit any of the operators in
maintenance. the operator sequence, but it must satisfy one of the
[11] presents the canonical form for general view, but following:
the literature in itself doesn’t consider about incremental z It has an aggregation operator on top.
maintenance. z It is directly below a D-segment.
There hasn’t been any literature on automatically z It is below a join operator, and has a union
generating incremental ETL process, which is the research operator on the top.
content of this paper. Our research can take advantage of z It is the top segment in the view definition,
the existing research results of incremental maintenance of which means that no more segments lies above
materialized views. But the existing research results don’t this segment.
include the incremental maintenance of difference operator.
In data warehouse environments, difference operator is Example 1 (Canonical Form for an ETL Process)
seldom used in the definition of materialized views; but in Suppose we have an ETL process to store customers’
the definition of ETL process, difference operator is used total consume into data warehouse, but the target table
as frequently as the other basic relational operators. So as dose not contain VIP customers’ data. There are four
the precondition of our research, the incremental tables in data source, and one target table named
maintenance of difference operator is also one of our Total_Consume in data warehouse. The five tables list
works. as below.
Customer & VIP: C_ID int PRIMARY KEY,
3. Basic Conceptions C_NAME char(20)
Order_A & Order_B:
ORDER_ID int,
3.1. The Basic Conceptions of Incremental C_ID int REFERENCES Customer(C_ID),
Maintenance of Materialized Views PRODUCT_ID int,
P_NUM int,
In the discussion of incremental ETL process, we need P_PRICE int,
some existing conceptions which we will introduce briefly PRIMARY KEY (ORDER_ID, PRODUCT_ID)
here, details can be found in [1] [4] [9]. Total_Consume: C_NAME char(20),
TOTAL_CONSUME int
Materialized View [1]: a view who’s tuples are stored in the
database. The definition of the ETL process is shown in fig1(a),
Base Relation [1]: existing relation in the database, and the canonical form is shown in fig1(b).
which is used for creating views and materialized views.
Incremental view Maintenance [1]: a process to compute 3.3. Self-maintainable Materialized Views and
only the changes in the view to update its materialization. Self-maintainable Net Increment
Irrelevant updates[4]: a set of updates which updates base
relations but has no effect on the state of a view. The self-maintainability of materialized views has
Self-maintainable views [1]: views that can be maintained been discussed in detail in some literatures. All these
using only the materialized view and incremental data. discussions rely on one precondition: the input

Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06)
0-7695-2581-4/06 $20.00 © 2006 IEEE
relation(s) of a relation operator (or a set of relation
operators) are base relation(s), and the output relation is 4. Incremental Maintenance of One
materialized view. In an ETL process, the input and output
relations of a relation operator (or a segment) may be all Segment
views. Using the results of self-maintainability of
materialized views, we can judge whether to materialize According to section 3.2, the canonical form for
the input relation(s), but can’t judge whether to materialize an ETL process is composed of AUSPJ-segments and
the output relation. In order to judge simultaneously D-segments. In order to generate an incremental ETL
whether to materialize the input and output relation, we put process from the ETL process, we should first consider
forward a definition as follows. the implementation of incremental maintenance of one
Definition 1 self-maintainable net increment: a net segment.
increment of a view or a materialized view can be obtained
only with the increment(s) of the base relation(s), and the 4.1. Incremental Maintenance of AUSPJ-
net increment is called self-maintainable net increment. Segment
In an ETL process, if a net increment of a segment’s output
relation is self-maintainable, all the input and output We can use the existing methods to implement the
relations of the segment need not to be materialized. It’s incremental maintenance of an AUSPJ-segment which
helpful to lower the executing cost of the incremental ETL contains all the five kinds of operators. The methods can
process. Obviously, if the net increment of a view is self- be found in [1,3,4,5,7], here we only give the result.
maintainable, the view must be self-maintainable. But the To implement the incremental maintenance of a
contrary is not necessarily true. complete AUSPJ segment, we must maintain
ĮC_NAME,SUM(T_PRICE) materialized views listed below:
All the input relations of join operator; The output
ʌC_NAME, T_PRICE relation of projection and/or union operator: a counting
attribute is added to this relation or duplicate data are
allowed to be saved. But if we can sure that each tuple
 C_ID
in the output relation is unique, we need no any
ʌC_ID,P_NUM*P_PRICEas T_PRICE materialized view;
 ‰ The output relation of aggregation operator, and if
the formula is AVG, the output relation should add a
counting attribute. If the formula is MAX or MIN, the
Customer VIP Order_A Order_B
input relation also should be materialized.
According to the definition of AUSPJ-segment, an
D  AUSPJ-segment may omit some operators, and the
corresponding materialized views of which is omitted.
ĮC_NAME,SUM(
C_NAME,SUM(T_PRICE)
4.2. Incremental Maintenance of D-Segment
ʌC_NAME, T_PRICE
According to the definition, D-segment only contains a
single difference operator. Without any loss of
 C_ID
generality, consider a difference operation AB, and
suppose its output relation is C. According to the
Temp2 formulas in [8], C’s net increments are:
'C=('A(B‰'B))‰((A‰'A)ˆ’B) ’A (1)
Temp1 ’C=(’AB)‰(A‰'B) (2)
In formula (1) (2), relation A, B is used to calculate the
‰
net increment of C, so A, B must be materialized. If a
 view definition contains only one difference operator,
ʌC_ID,P_NUM*P_PRICE
_ID,P_NUM*P_PRICE as T_PRICE
the import relations are bass relations. So the method of
ʌC_ID,P_NUM*P_PRIC
C_ID,P_NUM*P_PRICE as T_PRICE
calculating C’s increment using A and B is called BRA
Customer VIP Order_A (Base Relations Access).
Order_B
Or
Formula C=AB is converted to C=A(AˆB). And for
E  an arbitrary relation R, we have Rˆ'R=‡, Rˆ’R=’R.
Figure 1. Standardization of an ETL process (a) A common Let D=AˆB, E=B(AˆB), formula(11) (12) are
ETL process and (b) The canonical form for the ETL converted to
process
'C=('A(E‰'B))‰((D‰'A)ˆ’B) ’A (3)

Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06)
0-7695-2581-4/06 $20.00 © 2006 IEEE
’C=(’AD)‰(Cˆ'B) (4) practice, an ETL process contains many AUSPJ-
So when relation C, D and E are materialized, the segments and D-segments, in which the selection of
incremental maintenance of C can also be implemented. auxiliary views is more complicated. We know that
And the method is named SRA (Split Relations Access). adding auxiliary views can lower the cost of calculating
SRA method is more efficient than BRA method. But in increment. But it will increase the maintenance cost of
SRA method, in order to maintain relation D and E, we auxiliary views. Moreover, the storage space of a
need to calculate the increment of them. Using 'C and ’C, database is generally finite, auxiliary views can’t be
the formulas for the increment of D are: increased infinitely.
'D=('A'C)‰(’C’A) (5) Similar to the method in [12], we abstract the problem
’D=('C'A)‰(’A’C) (6) of selecting auxiliary views into math’s models. Given a
Using 'D and ’D, the formulas for the increment of E are: full ETL process G, storage space S and auxiliary views
'E=('B'D)‰(’D’B) (7) M, the execution cost of the incremental ETL process is:
k m
’E=('D'B)‰(’B’D) (8)
In a data warehouse environment, the increment is very
W (G, M ) ¦ D( D , M )  ¦U (V , M )  C
i 1
i
j 1
j

small, so the cost of maintaining D and E can be neglected. (11)


And thus we should choice SRA method to implement the given auxiliary views M, Function D(Di,M) represents
increment maintenance of D-segment. the cost of calculating increment Di, function U(Vj,M)
represents the maintenance cost of auxiliary view Vj,
The Analysis of Self-maintainability of D-segment. and C represents the maintenance cost of primary views,
According to the formulas (1) (2), we get that difference it is a constant.
operation’s output relation is not self-maintainable. But Meanwhile, the storage space S should satisfy:
m
under some conditions, it can be self-maintainable, as the
following theorem states: ¦V
j 1
j dS
Theorem 1 Given a materialized view C and view (12)
definition C=AB, when one of the three conditions is The original problem now converts to: Under the
satisfied, C is self-maintainable. precondition of formula (12), select the auxiliary views
i) A’s increment only has ’A and B’s increment only has M to minimize the value of formula (11). In fact, this is
'B; a NP-hard problem [12].
ii) A’s increment only has ’A and B’s increment is null; But there are differences between the incremental
iii) A’s increment is null and B’s increment only has 'B. ETL process and the incremental maintenance of
Theorem 1 can be easily proved using formulas (1) (2). materialized views. First of all, the storage space of a
Now we discuss another situation: given a difference data warehouse is vast, so we can eliminate the
operation AB, and A and B satisfy B  A. The situation constraint in formula (12).
frequently occurs in ETL process. And we have the The queries in the data warehouse will only visit the
following theorem. primary views. In order to further lower the lock time of
Theorem 2 Given a materialized view C and view a data warehouse when executing incremental ETL
process, we partition it into two steps: First, calculate
definition C=AB, when A and B satisfy B  A, the net
net increments of the output relation of each operator
increments of C are self-maintainable.
which is contained in the ETL process, and only updated
Proof: Using B  A, and formulas (3) (4), we can get
primary view; second, update all the auxiliary views.
'C=('A'B)‰(’B’A) (9)
We call the above method TSU(Two Step Update).
’C=(’A’B)‰('B'A) (10) Using TSU method, the data warehouse need be locked
Formulas (19) (20) contain only increments, so C’s only in the first step. So we need only consider how to
increments are self-maintainable. lower the execution cost of the first step. Formula (12) is
According to Theorem 2, we get if in a D-segment, the simplified as follows:
input relations A and B satisfy B  A, the input and output k
relations need not to be materialized. W (G, M ) ¦ D( D , M )  C
i 1
i
(13)
5. Automatic Generation of Incremental ETL Now we need only select the auxiliary views M which
Process can minimize the value of formula (25). To obtain this
aim, we need only assure that the cost of calculating the
The implementation of incremental ETL process net increment of the output relation of each operator is
depends on the auxiliary views created when processing minimum. Then we can solve the problem by selecting
full ETL process. appropriate auxiliary views for each operator according
In section 4 we only consider the selection of auxiliary to the method discussed in the last section.
views in single AUSPJ-segment and D-segment. But in

Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06)
0-7695-2581-4/06 $20.00 © 2006 IEEE
Using the method in [11], a canonicalized ETL process can system running on the PC with Pentium IV 2.0G CPU
be described by a tree named ETL tree. Using the ETL tree, and 512M memory, and kingbase database as the
we give the algorithm named MCCI (minimal cost of DBMS. In the four base tables, each of
calculating increment) to generate incremental ETL. Customer,Order_A and Order_B contains 100000 tuples
MCCI Algorithm CreateIncrementalETL(t0,V0) and VIP contains 10000 tuples.
Input a canonicalized full ETL process tree t0, a set of In the experiment, we also consider the optimization of
view definition V0(each one associated with a the incremental ETL process. Because the attributes
segment in t0) (ORDER_ID, PRODUCT_ID) are the primary key of
Output a set of changed view definition V with materialized Order_A union Order_B, we keep the attributes in
views, a set of SQL statements S for incremental relation TEMP_2, and we don’t need to attach an
ETL process additional attribute to the relation TEMP_2 for counting.
begin Because VIP is the subset of Customer, we use formula
VŇ , SŇ ; (9) (10) to incremental maintain the relation TEMP_1.
do{get a top most unprocessed segment si from t0; The experiment’s result is shown in fig2.
Get vi associated with si from V0;
if (vi is an AUSPJ segment) 

W L PH RI  H[HFXW L RQ PV
for each operation in vi, add materialized views into 
vi 1 ; 
else for the difference operation in vi, add materialized 
views into vi 2 ; 
Generate SQL statements Si for incremental I XO O
 L QFU HPHQW
maintenance of vi; RSW L PL ]HG

VÅV‰vi;       
SÅS‰Si; U DW L R RI  L QFU HPHQW  W R I XO O 

}
until (there is no unprocessed segment in t0);
return (V, S); Figure 2. Time of executing full or incremental ETL
end; processes
The algorithm only scans the ETL tree once when In the result, we can get that when the increment is very
generating the incremental ETL process. If there are N little, the running time of the incremental ETL processes
segments in the ETL tree, its running time is O(N). Let’s generated by MCCI algorithm is lower than that of the
suppose that when MCCI algorithm is selecting the method full ETL process. And the optimized incremental ETL
of incremental maintenance, it assures that the calculating processes take more little time.
cost of the increment of each segment is minimum. Then
the incremental ETL process generated by MCCI algorithm 7. Conclusion
is executed according to the first step of TSU method, the
execution cost is minimum. But in fact whether the This paper mainly discusses how to automatically
incremental maintenance cost is minimum depends on the generate incremental ETL process according to the
instance of the increment[1], so we can only assure that the existing full ETL process. As ETL process is similar to
incremental ETL process generated by MCCI algorithm the view definitions, we use the existing method of
has a good execution efficiency in general cases. incremental maintenance of materialized views for
reference to implement the automatic creation of
6. Experiments incremental ETL process.
First we use the method in [11] to canonicalize the ETL
We implement MCCI algorithm in an ETL tool and process. And the canonicalized ETL process only
compare the running time of the generated incremental contains AUSPJ-segments and D-segments. Using
ETL processes with that of full ETL processes. existing results of incremental maintenance of
We choose example 1 shown in section 3 to do the materialized views, we give the method to incremental
experiment, because the example includes join, projection, maintain AUSPJ-segments.
union, difference and aggregation operators except We also discuss the incremental maintenance of D-
selection, it can show all the methods of incremental segment in detail. First we put forward the plain method
maintenance. We choose WIN2003 server as the operating BRA which uses full information of the basic relations.
Secondly according to the hypothesis that “in a data
1
warehouse environment, the scale of general increment
The method to select materialized views AUSPJ-segment has been is much smaller than that of basic relation”, we put
discussed in section 4.1
2 We choose SRA method to maintain D-segment incrementally. forward the improved method SRA. In SRA method,

Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06)
0-7695-2581-4/06 $20.00 © 2006 IEEE
two auxiliary views are introduced to lower the total cost of [7] K. Yi et al., “Efficient maintenance of materialized top-k
incremental maintenance. Then we discuss the self- views,” in Proc. ICDE 2003, Bangalore, India, 2003, pp. 189-
maintainability of the materialized views of difference 200.
operator. The materialized views defined by difference [8] T. Griffin, L. Libkin, and H. Trickey, “An improved
algorithm for the incremental recomputation of active
operator are not always self-maintainable. Through relational expressions,” IEEE Trans. Knowl. Data Eng., 9(3),
analysis, we get three conditions in which the materialized 1997, pp. 508-511.
views are self-maintainable. We also discuss a common [9] Y. Cui and J. Widom, “Storing auxiliary data for efficient
difference operator in an ETL process, in which one input maintenance and lineage tracing of complex views,” in Proc.
relation is the subset of the other input relation. In this DMDW 2000, Stockholm, Sweden, 2000.
situation, we prove that the net increment of the output [10] D. Agrawal, et al. “Efficient view maintenance at data
relation is self-maintainable. warehouses,” SIGMOD Conference 1997, Tucson, USA, 1997,
At last we discuss how to automatically generate pp. 417-427.
incremental ETL process. In order to implement [11] Y. Cui, J. Widom, and J.L. Wiener, “Tracing the lineage
of view data in a warehousing environment,” ACM
incremental ETL process, we must select necessary Transactions on Database Systems, 25(2), 2000, pp. 179-227.
auxiliary views. And the selection of optimal auxiliary [12] H. Gupta, “Selection of views to materialize in a data
views in given storage space is NP-hard. Since the storage warehouse,” in Proc. ICDT 1997, Delphi, Greece, 1997, pp.
space of data warehouse is vast, we eliminate the 98-112.
corresponding constraint and put forward the TSU method
which postpones the execution of updating on auxiliary
views. Finally, we simplify the formula which is used to
select auxiliary views. According to the simplified formula
we give the algorithm of automatically generating ETL
process.
The incremental ETL process used to be designed
manually. It is our work that enables it to be generated
automatically. And the output incremental ETL process has
a good performance in most situations.

Acknowledgements

This work was supported by National Basic Research


Program of China (973 Program) 2005CB321905 and
National Science Foundation of China (NSFC) 60503035.
We thank all members (Yuan Hong, Yongrui Qin, et al. ) of
ETL TOOL research team for their contributions.

References
[1] A. Gupta and I. Mumick, “Maintenance of materialized views:
problems, techniques, and applications,” IEEE Data Eng.
Bulletin, Vol. 18, No. 2, 1995, pp. 3-18.
[2] A. E. Abbadi et al., “Performance issues in incremental
warehouse maintenance,” in Proc. VLDB 2000, Cairo, Egypt,
2000, pp. 461-472.
[3] I. S. Mumick and O. Shmueli, “Finiteness properties of
database queries,”,Advances in Database Research: Proc. of the
4th Australian Database Conference, Brisbane, Australia, 1993,
pp. 274-288.
[4] J. Blakeley, P. N. Larson, and F. Tompa, “Efficiently updating
materialized views,” ACM SIGMOD International Conference on
Management of Data, Washington, USA, 1986, pp. 269-400.
[5] A. Gupta, H. Jagadish, and I. Mumick, “Data integration using
self-maintainable views,” The 5th International Conference on
Extending Database Technology(EDBT), Avignon, France, 1996,
pp. 140–144.
[6] T. Palpanas, R. Sidle, R. Cochrane, and H. Pirahesh,
“Incremental maintenance for non-distributive aggregate
functions,” in Proc. VLDB 2002, Hong Kong, 2002, pp. 802-813.

Proceedings of the First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06)
0-7695-2581-4/06 $20.00 © 2006 IEEE

S-ar putea să vă placă și