Sunteți pe pagina 1din 42

A Pentaho Data Integration tool

Sreenivas K
04/17/14

Introduction
ETL Process
Pentahos Kettle

Data Integration Challenges


Prerequisites and Recent Releases
Pentaho DI Components
JDBC
Spoon

Transformations
Jobs

MaxQDPro: Kettle- ETL Tool

04/17/14

4 major components:
Extracting
Gathering raw data from source systems and storing it in ETL
staging environment
Data profiling
Identifying data that changed since last load
Transforming- Cleaning and Conforming
Processing data to improve its quality, format it, merge from
multiple sources, enforce conformed dimensions
Data cleansing
Recording error events
Audit dimensions
Creating and maintaining conformed dimensions and facts

MaxQDPro: Kettle- ETL Tool

04/17/14

Data filtering
Is not null, greater than, less than, includes

Field manipulation
Trimming, padding, upper and lowercase conversion

Data calculations
+ - X / , average, absolute value, arctangent, natural logarithm

Date manipulation
First day of month, Last day of month, add months, week of year,
day of year

Data type conversion


String to number, number to string, date to number

Merging fields & splitting fields

Looking up date
Look up in a database, in a text file, an excel sheet,

MaxQDPro: Kettle- ETL Tool

04/17/14

Loading
Loading data into data warehouse tables

Managing hierarchies in dimensions


Managing special dimensions such as date and
time, junk, mini, shrunken, small static, and usermaintained dimensions
Fact table loading
Building and maintaining bridge dimension tables
Handling late arriving data
Management of conformed dimensions
Administration of fact tables
Building aggregations
Building OLAP cubes
Transferring DW data to other environment for
specific purposes
MaxQDPro: Kettle- ETL Tool

04/17/14

MaxQDPro: Kettle- ETL Tool

04/17/14

Complexity and significant operational


problems.
Exceeds the designers expectations
Data Profilingof a source.
Data warehouses typically grow
asynchronously.
Establishing thescalability of an ETL system
across the lifetime .

MaxQDPro: Kettle- ETL Tool

04/17/14

Many off-the-shelf tools exist


High-end tools may not justify value for
smaller warehouses
Proprietary ETL

High upfront cost


Long term maintenance

Custom Code
Low upfront cost
Support grows as business requirements changes

MaxQDPro: Kettle- ETL Tool

04/17/14

Tool

Vendor

Oracle Warehouse Builder (OWB)

Oracle

Data Integrator (BODI)

Business Objects

IBM Information Server (Ascential)

IBM

SAS Data Integration Studio

SAS Institute

PowerCenter

Informatica

Oracle Data Integrator (Sunopsis)

Oracle

Data Migrator

Information Builders

Integration Services

Microsoft

Talend Open Studio

Talend

DataFlow

Group 1 Software (Sagent)

Data Integrator

Pervasive

Transformation Server

DataMirror

Transformation Manager

ETL Solutions Ltd.

Data Manager

Cognos

DT/Studio

Embarcadero Technologies

ETL4ALL

IKAN

DB2 Warehouse Edition

IBM

Jitterbit

Jitterbit

Pentaho Data Integration

Pentaho
MaxQDPro: Kettle- ETL Tool

04/17/14

Kettle Kettle Extraction Transformation


Transportation & Loading tool
Its open source business intelligence suite
for powerful data integration by Pentaho.
Founded in 2004.
Products of Pentaho

Mondrain OLAP server written in Java


Kettle ETL tool
Weka Machine learning and Data mining tool

MaxQDPro: Kettle- ETL Tool

04/17/14

10

Data is everywhere
Data is inconsistent

Records are different in each system

Performance issues
Running queries to summarize data for stipulated
long period takes operating system for task
Brings the OS on max load

Data is never all in Data Warehouse


Excel sheet, acquisition, new application

MaxQDPro: Kettle- ETL Tool

04/17/14

11

Meta data , model driven approach


What to do? And how to do?
Complex transformation with zero code
Graphically design data transformation and jobs

100% Java with cross-platform support


Extensible architecture
Repository-based
Full featured ETL
Integration with Pentaho Open BI Platform

MaxQDPro: Kettle- ETL Tool

04/17/14

12

Prerequisites

Java Runtime
Environment 1.5 and
above
Compatible with
almost any platform
Compatible with wide
range of Databases
technologies.

Recent Releases

4/25 Data Integration 3.0.3 GA

4/18 Data Integration 3.1


Milestone

2/8 Data Integration 3.0.2 GA

12/12 Data Integration 3.0.1 GA

11/15 Data Integration 3.0 GA

10/31 Data Integration 3.0 RC2

10/24 Data Integration 2.5.2 GA

10/08 Data Integration 3.0 RC1

08/24 Data Integration 2.5.1 GA


MaxQDPro: Kettle- ETL Tool

04/17/14

13

Pan
A program to execute transformations designed by Spoon
in XML or database repository.
Transformations are scheduled in batch mode to be run
automatically at regular intervals
Carte
Simple web server to execute transformations and jobs
remotely.
Accept an XML (small servlet) that contains
transformation to execute and the execution
configuration.
Allows to remotely monitor, start and stop the
transformations and jobs
Server running in Carte is a Slave Server
MaxQDPro: Kettle- ETL Tool

04/17/14

14

Spoon
GUI that allows you to design transformations and
jobs that can be run with the Kettle tools Pan and
Kitchen
Transformations and Jobs can describe themselves
using an XML file or can be put in a Kettle database
repository.
Spoon is available as executable script and batch file
to make use of tool in heterogeneous environment.
Latest version of Spoon is 3.2 beta version.
Kitchen
Execute jobs designed by Spoon in XML or database
repository
MaxQDPro: Kettle- ETL Tool

04/17/14

15

Installing
Ensure JRE 1.5 is
installed.
Unzip the binary
distribution in any folder

Launching
spoon.bat in windows
platform
spoon.sh in Unix like
platform

Create Shortcut with


spoon.ico pointing to bat file

Supported platform
Microsoft Windows
including Vista
Linux GTK: on i386 and
x86_64 processors
Apple's OSX: works both
on PowerPC and Intel
machines
Solaris: using a Motif
interface
AIX, HP-UX, FreeBSD

Works on most of OS
MaxQDPro: Kettle- ETL Tool

04/17/14

16

JDBC -Database
connectivity Java tool.
Comes in four different
types
Type1: JDBC-ODBC Bridge
Type 2 : Native API partial
Java driver
Type 3 : Middleware Java
Drivers
Type 4: Direct to DB Java
Drivers

Microsoft Based DB
like MS Access rely on
Type 1drivers
Oracle, Mysql can be
connected with other
types. But traditionally
used is the Type 4
driver.
JDBC can also operate
in Distributed
environment.

Latest JDBC 3.0


MaxQDPro: Kettle- ETL Tool

04/17/14

17

MaxQDPro: Kettle- ETL Tool

04/17/14

18

MaxQDPro: Kettle- ETL Tool

04/17/14

19

Key Improvement
Execution Results Pane for logs, metrics and
performance graph
Improved Database Connection dialog
Snap to grid (graphical workspace)
Zoom (Graphical Workspace)
Easier to use left panel for the objects palette
Over 30 new or improved Transformation Steps
13 new or improved Job Entries
Support for four new database types - MonetDB,
KingbaseES, Vertica, and HP NeoView
Improved translations
MaxQDPro: Kettle- ETL Tool

04/17/14

20

Repository Connection establishment


Auto login

By setting manually KETTLE_REPOSITORY,


KETTLE_USER and KETTLE_PASSWORD
environmental variables.

Login
By default PDI provides login username and
password ad admin.
It strictly advised to change default password to
avoid any security vulnerablity.

MaxQDPro: Kettle- ETL Tool

04/17/14

21

MaxQDPro: Kettle- ETL Tool

04/17/14

22

MaxQDPro: Kettle- ETL Tool

04/17/14

23

MaxQDPro: Kettle- ETL Tool

04/17/14

24

Transformation
Value: Values are part of a row
and can contain any type of
data
Row: a row exists of 0 or more
values
Output stream: an output
stream is a stack of rows that
leaves a step.
Input stream: an input
stream is a stack of rows that
enters a step.
Hop: A hop is a graphical
representation of one or more
data streams between 2 steps.
Note: A note is a piece of
information that can be added
to a transformation

Engine capable of performing a


multitude of functions such as reading,
manipulating and writing data to and
from various data sources.

MaxQDPro: Kettle- ETL Tool

04/17/14

25

Jobs
Job Entry: A job entry is
one part of a job and
performs a certain
Hop: A hop is a
graphical representation
of one or more data
streams between 2 steps

A way of calling transformations and


controlling the sequence of their
execution. Usually jobs are
scheduled in batch mode to be run
automatically at regular intervals.

Note: a note is a piece of


information that can be added
to a job

MaxQDPro: Kettle- ETL Tool

04/17/14

26

Input Steps
Output Steps
Lookup Steps
Transformation
Steps

Join Steps

Job Steps

DW Steps

Mapping Steps

MaxQDPro: Kettle- ETL Tool

04/17/14

27

MaxQDPro: Kettle- ETL Tool

04/17/14

28

MaxQDPro: Kettle- ETL Tool

04/17/14

29

MaxQDPro: Kettle- ETL Tool

04/17/14

30

MaxQDPro: Kettle- ETL Tool

04/17/14

31

MaxQDPro: Kettle- ETL Tool

04/17/14

32

MaxQDPro: Kettle- ETL Tool

04/17/14

33

MaxQDPro: Kettle- ETL Tool

04/17/14

34

Table Output Step

MaxQDPro: Kettle- ETL Tool

04/17/14

35

Insert / Update Output Step

MaxQDPro: Kettle- ETL Tool

04/17/14

36

Besides the execution order, it specifies the condition for next job entry
Unconditional - next job entry will be executed regardless of the result
of the originating job entry.
Follow when result is true - next job entry will only be executed when
the result of the originating job entry is true,
Follow when result is false - next job entry will only be executed when
the result of the originating job entry was false

MaxQDPro: Kettle- ETL Tool

04/17/14

37

MaxQDPro: Kettle- ETL Tool

04/17/14

38

MaxQDPro: Kettle- ETL Tool

04/17/14

39

MaxQDPro: Kettle- ETL Tool

04/17/14

40

Brief Introduction to ETL process


JDBC Repository Connection
Pentaho Data Integration Tool

Components

Pan
Carte
Kitchen
Spoon

Transformation with different Input Data Source


Jobs

MaxQDPro: Kettle- ETL Tool

04/17/14

41

kettle.pentaho.org
Kettle project homepage

kettle.javaforge.com
Kettle community website: forum, source, documentation, tech tips,
samples,

www.pentaho.org/download/
All Pentaho modules, pre-configured with sample data
Developer forums, documentation
Ventana Research Open Source BI Survey

www.mysql.com
White paper http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.htm
l
Kettle Webinar http://www.mysql.com/news-and-events/on-demand-webinars/penta
ho-2006-09-19.php
Roland Bouman blog on Pentaho Data Integration and MySQL
http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-k
ettle-turns.html
MaxQDPro: Kettle- ETL Tool

04/17/14

42

S-ar putea să vă placă și