Sunteți pe pagina 1din 6

(IJCNS) International Journal of Computer and Network Security, 143

Vol. 2, No. 2, February 2010

Automation of Data Warehouse, Extraction,


Transformation and Loading Update Cycle
Atif Amin1 and Abdul Aziz2
1
Faculty of Information Technology,
University of Central Punjab, Lahore, Pakistan
atif.amin @ucp.edu.pk
2
Faculty of Information Technology,
University of Central Punjab, Lahore, Pakistan
aziz@ucp.edu.pk

Abstract: Business enterprises invest lots of money to develop business advantage. In many organizations the way for users
data warehouse that gives them real, constant and up to date is to obtain timely information for correct decision. The
data for decision making. To keep data warehouse update, fundamental role of data warehouse is to provide correct
traditionally, data warehouses are updated periodically. decision making system. To achieve these kinds of
Periodic updates make a delay between operational data and information an application / tools are required for its
warehouse data. These updates are triggered on time set; some implementation. Their requirements include easy to access
may set it to evening time when there is no load of work on
data, scope and accuracy. They have also on-line analytical
systems. This fixing of time does not work in every case. Many
companies run day and night without any break, then in these
processing (OLAP) based on relation databases.
situations periodic updates stale warehouse. This delay depends Decision support systems (DSS), was named given in
upon the periodic interval, as interval time increase the 1970s to information systems designed to help managerial
difference between operational and warehouse data also staff for making decisions. Managerial problems are
increase. The most recent data is unavailable for the analysis ranging from assigning budget and choosing correct site
because it resides in operational data sources. For timely and locations for business etc. The basic idea behind developing
effective decision making warehouse should be updated as soon these systems was that mangers could create and operate
as possible. Extraction, Transformation and Loading (ETL) are these systems at their own. Therefore, in 1980s number of
designed tools for the updating of warehouse. When warehouse organizations called it executive information system (EIS)
is refreshed for the update purpose, it often gets stuck due to [3].
overloading on resources. Perfect time should be chosen for the
The basic idea behind EIS was the mangers needs
updating of warehouse, so that we can utilize our resources
standard information about their firms and the external
efficiently. Warehouse is not updated once, this is cyclic
process. We are introducing automation for ETL, Our proposed environment related to business. This information includes
framework will select best time to complete the process, so that the time history of problems and their output for predicting
warehouse gets updated automatically as soon as resources are their future state, so that manger could instantly know what
available without compromising on data warehouse usage. is going on. The EIS system does not have analytical
advantages of DSS. Some writers say that EIS is used by
Keywords: ETL, Updating, Loading, Data Warehouse. senior managers and DSS is used by junior staff.
Although very useful, the EIS and DSS often lacked a
1. Introduction strong database. Generally, information gathered for one
database cannot be used for another database. Managerial
Computers were used for data transactions and to provide
decision making, required consideration of the past and the
information to support decision making. As early as the
future, but not just the present. As a result, DSS and EIS
merit of placing information in different platform for
have to create their own databases, an area which was not
decision making were used. This approach is for easy to
their prime expertise. This activity is demanding and time
access needed data, improves system response time and
consuming.
assures security and data integrity. These systems were
pioneer to use this approach. Its end user saw many
applications for example executive summary and etc for
having specially prepared data.
Before two decades, organizations developed data
warehouse to provide users decision support system. There
are different approaches from earlier systems. One is the use
of special purpose system which task was to clean data,
extract useful data and loading all data into data warehouse.
Depending on the application needs many software can be
used to store data. Enhanced data access tools make it easy
for end user to access, transform, analyze and display
computed information without writing queries.
Many organizations are becoming customer focused.
They are using data mining to provide information for Figure 1. ETL follow in Data Warehouse
144 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 2, February 2010

Data of the interest is across the different sources, for the time near to no use. We have to utilize the time of below
centralize warehouse it is collected over the heterogeneous average of resources for the updating of warehouse; it will
sources. Data is firstly extracted from the desired source, be selected on historical records of resources.
then transformed and cleansed respectively according to
globalize rules. In figure 1 basic cycle of data warehouse 3. Related Work
updating is shown. ETL is the process of managing Many researchers [2], [3], [7] have done a lots of work on
operational source data uniform so that it can be loaded into the ETL life cycle. They have purposed different techniques
the warehouse. In order to keep data warehouse up to date, regarding updating time. This process stuck out the data
operational source data is loaded into the warehouse timely. warehouse when ETL is being performed, because in
Loading into the warehouse is done by ETL tool. Warehouse Extract-Transform-Load makes warehouse unstable. When
should be intelligent enough to make decision for updating. this process finishes the warehouse is updated not only with
data but also with Meta data repository as shown in figure 2.
2. Common Problem
It becomes stable when complete updating has been
Data warehouse is repository of business data that is drawn performed over it. We review different techniques that have
from multiple sources across the heterogeneous data stores. been used and proposed earlier. Many proposals consider
Since warehouse implement materialized view of data. Data that the development phase of data warehouse is totally
in the warehouse is stored and information is extracted from different from the development of RDBMS. Either they
the warehouse. Whenever data arrives in the warehouse, the include extra stages, like workload refinement [10] or the
extracted metalized information should also be implemented ETL process [11, 12]. Moreover, they provide different
according to current data. methods for the requirements analysis phase. Several
As we know that warehouse is the rising technique for the authors argue the importance of metadata as an integral part
retrieval of data from the distributed and heterogeneous data of the data warehouse design process [13, 14, 15].
sources. A data warehouse is a repository of integrated
3.1. Offline ETL Process
information, available for queries and analysis (e.g. decision
support and data mining) [1]. When relevant information This technique has become obsolete; it is not being used by
becomes the part of warehouse, the information extracted any modern data warehouse repository. We would like to
from source is translated into a common model called discuss it. Data in the warehouse was loaded offline. During
Relational Model, and integrated with the existing data of updating process, data warehouse was shut down from
warehouse. Data mining and querying to data warehouse functionality. In organizations every concern authorities was
can be done quickly and efficiently because formation of well informed that updating of data warehouse is in
data already exists due to warehouse property, and all progress and warehouse will not be working for some time
differences found resolved in warehouse. interval. When data in the warehouse was updated,
intimation to the all concern authorities of organization sent
that they can use it.
This technique was also called manually updating the
data warehouse. If any problem to the warehouse comes, all
others have to suffer from using it. During maintaining
phase warehouse was not capable of performing its
functionality. Warehouse is not one time activity, every time
it needs to be updating to capture new requirements and
more data. Most of the time warehouse was not functioning
due to its maintain process. This technique was very
resource consuming because lots of resource were left
unused most of the time due to unavailability of warehouse
functionality.
3.2. Periodic ETL Process
Figure 2. The ETL process When offline ETL process was not functioning well for the
We can think of data warehouse as a defining and storing goodwill of the organization, a periodic ETL process was
integrated materialized view over the multiple data sources. introduced. This technique does not automate the process of
Figure 2 shows a cyclic process; it will be executed each updating but it somehow has semi automated this task. In
time when warehouse would be refreshed. Warehouse this technique off-hours were used for the updating process.
should be refreshed with the new operational source data When there was no need of warehouse. It also stops the
timely, so that analysis can be performed over it. Refresh of functionality of the warehouse while updating. It does not
warehouse requires the maximum utilization of matter because in off-peak hours there is no need of data
computerized resources. In business environment computing warehouse. Warehouse is mostly needed because when there
resources are being used for gaining the efficiency in the is management staff and they have to make decision. Now
business. If in peak hours this updating starts it will acquire problem arises how to select off-peak hours.
the use of these resources and will slow down the business 3.2.1 Daily Periodic Activity
system. These resources are not used all the time.
Sometimes, these recourses are used at average and some Daily periodic activity is best for those organizations which
need up to date data for correct decision making and these
(IJCNS) International Journal of Computer and Network Security, 145
Vol. 2, No. 2, February 2010

organizations works in one shift only. If they work in the extra business needs. During maintenance time no user can
morning time and closed their work in the evening. We can communicate with the warehouse for getting results out of
see that their all resources remain unused when their office it.
remains closed from evening to morning up till new day
3.2.5 Yearly Periodic Activity
starts. Office close time will be the peak hours for that
organization which works for one shift only. Their There are some organizations which update their data
warehouse administrator will chose evening time for the warehouse on annual basis. Mostly these companies make
updating of warehouse. It will utilize this resource when their decision and policies at the beginning of every year.
these were at not in use. Our primary goal is to maximize Once they have decided their work plan then it will remain
the use of resource to save our cost. This is online updation in effect for rest of the year. In market different
because in this technique there is no need to shut down the organizations have different plans and policies. They use
activities of warehouse. Administrator is also not necessary the strategy which suits them best. Moreover, it also
to be there when warehouse is being updating. When depends on the business nature and market trends.
warehouse is updated, it automatically sends confirmation 3.3. Online periodic Queue Activity
report to the concerned person that the warehouse has been
An extensive study on the modeling of ETL jobs is
updated.
published by Simitsis and Vassiliadisl [5][9]. Online
3.2.2 Weekly Periodic Activity periodic queue is best suited for that environment where
Weekly periodic activity is best for those organizations different heterogeneous sources update the warehouse.
which need up to date data for correct decision making but Business situation does not remain the same for a long time,
these organizations works in all shifts in working days i.e. situation change quite frequently. One operation source may
they work round the clock and close their work in the be in one time zone and other may be in different time zone.
weekend holidays. We can see that their all resources That is when one is using warehouse and other needs to
remain unused when their office remains closed. Warehouse update warehouse online because this is peak off time of
administrator will chose week holidays for the updating of that organization. In this scenario which is the best time to
warehouse. It will utilize resource when these were in no update warehouse.
use. Our always goal is to maximize the use of resource to We see that none of the given solution works for that kind
save our cost. This same situation can be reciprocal in an of problem where operational data stores are at different
organization if it works at week ends and holidays, and places and at different time zones. Then researcher
remains closed during week days. It depends on introduced online periodic queue activity. This activity
organization to organization according to their structural allows warehouse to have queue attached with it.
needs; their updating will be performed in those weeks
holidays respectively. This is online updating because in Operational Source
Operational Source
this technique there is no need to shut down the activities of
warehouse. Administrator is also not necessary to be there
when warehouse is being updating. When warehouse is
updated, it automatically sends confirmation report to the
Operational Source
concerned person that warehouse has been updated. Warehouse
Operational Source
3.2.3 Monthly Periodic Activity
Monthly periodic activity is best for those organizations
which need up to date data for correct decision making but
these organizations works in all shift in working days as
well as on holidays also. If they work in the all the shifts
and does not close their work even in the weekends,
holidays. Their warehouse administrator will chose closing
date of the month for the updating of warehouse. It depends Queue

on organization to organization according to their structural


Figure 3. Online Periodic queue activity
needs; their updating will be performed in those closing
month’s days when there is less use of warehouse resources, In figure 3 it is shown that queue is attached with the
because normally these organizations does not make warehouse. Each operation source sends its ETL output to
decisions at the closing of the months. the warehouse for the updating. It stores that into queue and
update one by one. This technique allows writing as much
3.2.4 Quarterly Periodic Activity time as operational data sources needs to do so. Warehouse
Quarterly periodic activity is best for those organizations keeps updating itself whenever it gets off-hours. There are
which need up to date data for correct decision making but some problems with these techniques that are when
these organizations works in all shift in working days as operational data source keeps on sending their updating
well as on holidays. If they work in the all the shifts and request in the warehouse queue for the updating of
does not close their work in the weekends, holidays. warehouse, and warehouse does not get time for the
However, we require some extra time for the updating of updating. This creates problem of buffer overflow.
data warehouse. In this time updating is performed offline Warehouse would not be updating as it was desired. Lastly
and data warehouse is also maintained to accommodate it has another problem that is if more and more updating
146 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 2, February 2010

request will be coming from the different data source for limits. We called the threshold that is average of maximum
the updating of warehouse, it will keep busy warehouse and minimum.
mostly in the updating and it will reduces the productive
work out of warehouse. This will decrease the efficiency of
the warehouse because we know that updating stuck the
warehouse system.
4. Proposed Solution
All above techniques waits for the off-hours so that
resources gets free and ETL can be performed easily. In
reality this is not the case with multinational organizations
because these organizations run round the clock. There are
no holidays even on Christmas, Eids and Easters. Although
employees get off individually but these organizations has
maintained a cycle that if one employee will be on holidays
other will be working for it alternatively, on top of that, Figure 5. Threshold of Resource
customers satisfaction and quality matters to them. In this
type of working environment where organizations remain Figure 4 and figure 5 explain the overall load on
busy round the clock; it is very difficult to find any off-hours resource. When it is near or under the threshold more load
and warehouse shut down also creates a problem for them. can be applied as it is applied in the off-hours in the
Our proposed framework will identify the times for the periodic updating activity.
updating of the warehouse using prediction technique Now we see resources utilization that participates in the
applying historical data. working of organization and updating of warehouse. This
It identifies all the resources that are used by the resource can be any server that is designed for both work.
warehouse while it is being maintained and updated. It We anticipate that resource that participates in duel actions
keeps the record each resource that is utilized with and remains overburden and causes the system malfunction from
without updating warehouse. It also identifies those the warehouse. In figure 6 we maintain a graph that checks
machines that get maximum use of warehouse. We will the load on that resource during the office hours and after
attach histogram with each resource and machines that the office hours. It maintains a history record set of these
needs data. They will observe the utilization of the activities on daily basis. We have shown a history record set
resources. It will calculate the threshold limit for the each of day Monday in table 1. We have picked those record set
resource and machines. This observation keeps on checking time and their duration that stay below the average
at what level of load this resource perform in timely manner threshold time.
and what stage it would start sticking or malfunctioning.
Once we identify these things we will notice that there Table 1: Resources duration with their status
will be some times where updating can be performed easily. Resources Day Time Duration Status
This framework identifies that times when office is working Res 1 Mon 6:00 Am 4 hours Below
but concerned resources are not being used. It will apply Res 2 Mon 5:00 Am 4 hours Below
updating process at that time and will starts observing load Res 3 Mon 12:00 10 hours Below
of the overall working environments up till functionality is Am
being performed well. When loads gets back its normal Res 5 Mon 7:00 Am 3 hours Below
stage of threshold we stop updating the warehouse at that
stage and will resume from that stage once updating will be We show some calculations that will get time for the
completed. warehouse updating during the office hours. We gets the
overlapped time that is free for all the activities. Table 1 is
the history of one day, we gets two hours free for the
updating of the warehouse. This overlap time is from 7:00
Am to the 9:00 Am where all the resources are free and
warehouse updating activity will not makes them in the
state of the stuck system. This recording of resources
duration and their status will be a continuous process. As
long as this process goes old, it will start giving the true
prediction of time for the warehouse updating. This
philosophy is improved with the passage of time, as
statistical population/data size grows prediction becomes
more and more accurate.
There will be a situation where ETL update cycle will
predict to update the warehouse but current status would not
Figure 4. Minimum and maximum of resource utilization
allow doing so because updating in this situation will cause
These are the maximum and minimum load limit where warehouse failure. We will skip that time and wait for
resource can work properly as shown in figure 4 by the another time.
(IJCNS) International Journal of Computer and Network Security, 147
Vol. 2, No. 2, February 2010

warehouse buffer overflow due to the limited request of


update. Each data source will have a request of maximum of
one job in warehouse queue. Operational data sources will
contain there updating request in their queues until it does
not ask.

Figure 6. Resource utilization graph below and above


threshold point
Lastly, there may be a situation where current utilization
of resources and predicted time for the updating of
warehouse will allows to update warehouse smoothly but
when we will start updating the warehouse, the utilization
of the resource increase due to external use of resources. At
that time we will stop the updating process and give the
Figure 8. ETL update cycle architecture
utilization to the system environment and we will wait for
the next suitable time, and this is completely automatic, self These queues will give equal opportunity to each
activation and stopage. operational data source. It avoids them from starvation. In
Figure 7 gives the utilization picture of the resource that previous architectures some operation data sources may be
is below to its threshold time. We consider these as a free of starved due to the rapid updating of another warehouse.
utilization. We record the history of those resources with
their reason being free, which remains below average
mostly. If this case continues to be happened, we removed 5. Conclusion
them from the list of resources that we inspect because these
resources lie in the dual usage of system and warehouse. In this paper we discussed the issue regarding ETL update
cycle. We argued that update cycle method is very efficient
compared to complete loading of data from operational data
sources because in multinational organizations it is very
difficult to find off-hours easily. However, ETL update cycle
method is preferable in general because it gives maximum
utilization of resources and gets maximum throughput out
of it and availability of warehouse so that different users
gets maximum queries from it. Since ETL update cycle
requires extra checking cost for resources utilization every
time that is why the cost is increased, but this becomes
minimal when organizations gets more business due to
availability of warehouse.
Our main contribution is to provide certain rules for the
ETL update cycle that makes it fully automated process and
Figure 7. Resource Utilization below threshold point efficient without the human interaction with the data
warehouse. It selects its best time automatically for loading
There may comes a situation when there does not exist
data into the warehouse without compromising of its
any time frame when resources are not below threshold time
functionality and response time to the users. If warehouse
but are not at the maximum level of resource utilization.
gets some data from their operational connected data
Then we calculate the approximate load of ETL process
sources, then it give first priority to users and second
with its duration. We predict the load on the resource after
priority to updating otherwise our algorithms will not
putting the ETL update cost. If this remains at normal level
occupy the resources un-necessary. Warehouse does not
that is below the maximum level, we start updating the
need to update every time it gets updated call by some
warehouse until any of the resource goes beyond the
predefined time schedule. Our proposed mechanism saves
maximum load limit.
the time and cost as compared to periodic update and other
Our architecture of warehouse would also be modified
existing techniques for ETL.
with little changes. It has been shown in figure 8. It adds the
queue with each operational data source. This queue will
send the updating request when they will get response from
the warehouse that their pervious request has been updated.
This approach wills benefits in many ways. It will avoid
148 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 2, February 2010

References Authors Profile


[1] Gupta, A., Jagadish, H. V., Mumick, "Data Integration
using Self-Maintainable Views," EDBT, pp. 140-144, Atif Amin received his B.S. in Computer Science degree from
University of Central Punjab in 2008. He has been the winner of
1996 All Pakistan Software Competition, Softcom ‘08 in 2008 where he
[2] JÄorg, T., Dessloch, "Towards generating ETL won first position. He has been chairman IEEE from 2007-2008.
processes for incremental loading," IDEAS, pp. 101- He is now doing M. S. in Computer Science from University of
110, 2008 Central Punjab.
[3] JÄorg, T., Dessloch, "Formalizing ETL Jobs for
Incremental Loading of DataWarehouses," BTW pp. Prof. Dr. Abdul Aziz did his M.Sc. from University of the
327-346, 2009 Punjab, Pakistan in 1989; M.Phil and Ph.D in Computer Science
from University of East Anglia, UK. He secured many honors and
[4] Kimball, R., Caserta, The Data Warehouse ETL
awards during his academic career from various institutions. He is
Toolkit: Practical Techniques for Extracting, leaning, currently working as full Professor at the University of Central
Conforming, and Delivering Data, John Wiley & Punjab, Lahore, Pakistan. He is the founder and Chair of Data
Sons, 2004 Mining Research Group at UCP. Dr. Aziz has delivered lectures at
[5] Panos Vassiliadis, Alkis Simitsis, and Spiros many universities as guest speaker. He has published text books
Skiadopoulos, "Conceptual modeling for ETL and large number of research papers in different refereed
processes," In DOLAP, pp. 14–21, 2002 international journals and conferences. His research interests
[6] Alkis Simitsis, "Mapping conceptual to logical models include Knowledge Discovery in Databases (KDD) - Data Mining,
for ETL processes," In DOLAP, pp. 67–76, 2005 Pattern Recognition, Data Warehousing and Machine Learning.
[7] Alkis Simitsis, Panos Vassiliadis, and Timos K. Sellis, He is member of editorial board and referee for various well
"Optimizing ETL Processes in Data Warehouses," In known international journals / conferences including IEEE
ICDE, pp. 564–575, 2005 publications. (e-mail: aziz@ucp.edu.pk).
[8] Alkis Simitsis, Panos Vassiliadis, Manolis Terrovitis,
and Spiros Skiadopoulos, "Graph-Based Modeling of
ETL Activities with Multi-level Transformations and
Updates," In DaWaK, pp. 43–52, 2005
[9] Labio, W., Garcia-Molina,"Efficient Snapshot
Differential Algorithms for Data Warehousing,"
VLDB, pp. 63-74, 1996
[10] M. Golfarelli and S. Rizzi. A methodological
framework for data warehouse design. In I.-Y. Song
and T.J. Teorey, editors, Proceedings of the 1st ACM
International Workshop on Data Warehousing and
OLAP, DOLAP’98, pp. 3–9. ACM Press, 1998.
[11] R. Kimball, L. Reeves, M. Ross, and W.
Thornthwaite. The Data Warehouse Lifecycle
Toolkit: Expert Methods for Designing, Developing,
and Deploying Data Warehouses. Wiley, 1998.
[12] S. Luj´an-Mora and J. Trujillo. A comprehensive
method for data warehouse design. Proceedings of the
5th International Workshop on Design and
Management of Data Warehouses, DMDW’03.
CEUR Workshop Proceedings, 2003.
[13] C. Ballard, D. Herreman, D. Schau, R. Bell, E. Kim,
and A. Valencic. Data Modeling Techniques for Data
Warehousing. IBM Redbooks SG24-2238-00, 1998.
[14] L. Carneiro and A. Brayner. X-META: A
methodology for data warehouse design with
metadata management. In [157], pp. 13–22.
[15] F. Paim, A. Carvalho, and J. Castro. Towards a
methodology for requirements analysis of data
warehouse systems. In Proceedings of the 16th
Brazilian Symposium on Software Engineering,
SBES’02, pp. 1–16, 2002.

S-ar putea să vă placă și