Documente Academic
Documente Profesional
Documente Cultură
Load (ETL)
Eduardo Almeida
Master Alma Universit de Nantes
{eduardo.almeida@univ-nantes.fr}
Goal
To present the general concepts of the
Extract, Transform and Load (ETL) process
To ETL
Bibliography
Berson, Alex e Smith, Stephen J
Data Warehousing, Data Mining & OLAP
Kimball, Ralph
The Data Warehouse Toolkit
Inmon, Willian H.
Building the Data Warehouse
Business Inteligence avec Oracle 10g
Claire Noirault
http://asktom.oracle.com
Donsez, Didier (prsentations)
Universit Joseph Fourier
DW Overall architecture
Extract, Transform
and Load (ETL)
DW Overall architecture
(staging area)
ETL
Extract
Extract
Production data
Heterogeneous data
sources
Heterogeneous
representations
Incremental x full loading
Extract
Extraction
Logical (Full, Incremental)
Physical
Full Extraction
Export from the source of one table or a
set of tables (ex., )
Extract using programs (ex., PL/SQL, Java,
etc)
Advantages
No trace of the changes
No additional information
on the source
Drawbacks
Large amount of data
Impact performance on data
sources and the ETL process
Incremental Extraction
Necessity of a mechanism to define
modified data
A DATE attribute
Triggers
Original / current value (ex., MINUS
operator)
Physical Extraction
Necessity of a mechanism to define
modified data
Log files
Dump files
Flat files
Partitioning (source tables are partitioned
along a date key)
Transform
Transform
Integration
Cleansing
Standardizing
Enrichment
Sort
Filter
...
Transform
Data integration
Transform
Data Cleansing
Data Cleansing is the act of detecting and
correcting (or removing) corrupt or inaccurate
records.
So Paulo
S. Paulo
SP
DW
Transform
Standardizing
Address
number, street, city, country, zip
street, number, neighborhood, city, country, zip
Phone
+33 (0) 2 40 55 66 77
330240556677
Name
Johnny Hallyday
Hallyday, Johnny
JOHNNY HALLYDAY
Load
Load
Large amount of data
Significant processing
loads
Low system use
Verify referential
integrity after the load
From fact table to
dimension
Extract
Oracle 'exp' command
exp scott/tiger file=emp.dmp log=emp.log
tables=emp rows=yes indexes=no
exp scott/tiger file=emp.dmp tables=(emp,dept)
exp scott/tiger tables=emp query="where
deptno=10"
exp scott/tiger file=abc.dmp tables=abc
query=\"where sex=\'f\'\" rows=yes
Extract
Extracting into Flat Files Using SQL*Plus
SET echo off
SET pagesize 0
SPOOL country_city.dat
SELECT distinct t1.country_name ||'|'|| t2.cust_city
FROM countries t1, customers t2
WHERE t1.country_id = t2.country_id
AND t1.country_name= 'United States of America';
SPOOL off
Load
Oracle 'imp' command
exp scott/tiger file=emp.dmp log=emp.log
tables=emp rows=yes indexes=no
exp scott/tiger file=emp.dmp tables=(emp,dept)
exp scott/tiger tables=emp query="where
deptno=10"
exp scott/tiger file=abc.dmp tables=abc
query=\"where sex=\'f\'\" rows=yes
Load
Scenario
My system has both clients and clients_dim tables
I want to load clients_dim table from an export of
clients
Load
Using SQL*Loader
sqlldr user control=control.ctl
The control.ctl file has the load information:
load data
infile 'country_city.dat'
into table country_city
fields terminated by "|" optionally enclosed by '"'
( country_name, cust_city )
Load
Using PL/SQL
DECLARE
nom_cat VARCHAR2(25);
descr VARCHAR2(100);
CURSOR cur IS
SELECT ref_produit, nom_produit
FROM produits;
Load
Using PL/SQL
BEGIN
FOR crec IN cur LOOP
select NOM_CATEGORIE,DESCRIPTION
into NOM_CAT,DESCR
from categories
where code_categorie=crec.CODE_CATEGORIE;
Load
Using PL/SQL
insert into products_dim (REF_PRODUIT,NOM_PRODUIT
NOM_CATEGORIE,DESCRIPTION)
values(
crec.REF_PRODUIT,crec.NOM_PRODUIT,
NOM_CAT,DESCR);
END LOOP;
COMMIT;
END;
/
Cursor
PL/SQL Variables
Kettle
Open source ETL tool
http://kettle.pentaho.org/
Kettle
Kettle is designed to help you with your ETTL
needs, which include the Extraction,
Transformation, Transportation and Loading of
data.
Kettle Tutorial
Open a terminal
$ spoon.sh
Transformation
Kettle Tutorial
1 - Explorateur
2 - Connections
4 Tester la
connection
3 Configuration
Kettle Tutorial
1 Desing (Palette de cration)
2 Glisser et dposer
Kettle Tutorial
1 Nom tape
2 Ecrire SQL
Kettle Tutorial
Kettle Tutorial
Kettle Tutorial
1 Excuter
Kettle Tutorial
1 Filtrer
Kettle Tutorial
1 Excuter
Kettle Tutorial
1 Agrgation
Kettle Tutorial
1 Nom tape
2 Champ group
3 Champ agrg