Sunteți pe pagina 1din 36

Infosys Technologies Limited

Lab Guide For Pentaho Data Integration 4.0.1 (also known as Kettle)

Version No: 3.0

Table of Contents
Assignment 0: Installing PDI 4.0.1 and opening the PDI IDE ............................................................... 3 Assignment 1: The Kettle Repository ..................................................................................................... 3 Assignment 2: My first Data transfer using Kettle ................................................................................ 6 Assignment 3: Using the Add constants, Calculator and Select Values transformations .... 15 Assignment 4: Creating an ODBC data source ..................................................................................... 26 Assignment 5: Using the Database Lookup transformation............................................................ 29

Assignment 0: Installing PDI 4.0.1 and opening the PDI IDE


Learning Objective: To download and install Pentaho Data Integration 4.0.1, and open the PDI interface. Step 1: Install Java Runtime Environment (version 1.4 or higher) in your system. Step 2: Go to http://www.pentaho.com site and download Pentaho Data Integration 4.0.1. Step 3: Unzip the downloaded PDI zip file. Open the data-integration folder, and double click on the spoon.bat file to open the PDI IDE.

Assignment 1: The Kettle Repository


Learning Objective: To learn the concept of a repository in PDI (Kettle) and learn how to create, connect or disconnect from a repository. Concept of Repository: The Kettle repository is a workspace that the data integrator works on. This workspace is a physical region of the hard-drive that is designated exclusively for Kettle. In the repository, all information about transformations, jobs, schedules, etc. is stored. The repository concept promotes re-usability, which in turn saves time and effort. A repository may be created in two ways: 1) Kettle database repository 2) Kettle file repository When kettle is started, the Repository Connection dialog box appears, asking you to select arepository from the list of existing repositories, or create a new one. To create a file repository: Step 1: In Repository Connection dialog box click on + [ dialog box will appear. ] button. The Select the repository type

Step 2: Select kettle file repository and click ok.

Step 3: In File repository settings dialog box, click on Browse button, select a folder that shall exclusively be your file repository space; fill ID and Name and click on OK button. Click on the Repository connection- OK button to select the newly-created repository.

You are now ready to create transformations and jobs on this workspace. To disconnect from the current working repository, go to Tools menu: Tools -> Repository -> Disconnect repository or alternatively, press Ctrl+D. NOTE: In the course of working with Kettle, if you want to change your repository or create a new one, then you can do so by first disconnecting from the current working repository. Then, open the Repository Connection dialog box from: Tools -> Repository -> Connect or alternatively, press Ctrl+R. The Repository Connection dialog box appears.

Assignment 2: My first Data transfer using Kettle


Learning Objective: To create a simple transformation that involves data transfer from a flat file to an Access database destination. Step 1: In the Kettle IDE file menu, open File -> New -> Transformation, or alternatively, press Ctrl+N. Step 2: To save your transformation file with a name of your choice, press Ctrl+S. The Transformation properties dialog box opens up. Give the transformation a name of your choice, and then click on OK.

Step 3: On the Design pane on the left of the IDE, expand the Input group. Drag and drop the Text file input on the transformation design surface.

Step 4: Double-click on the Text file input. The text file input properties dialog box opens up. Click on Browse to select the flatfile to be used as an input.

Select the Products.txt flat file that will be used as input for the transformation. After clicking on Open, click on the button Add to add the file to the list of selected files.

Step 5: Go to the Content tab. Since this is a Comma separated values (CSV) flat file, specify the separator as comma (,).

Step 6: Open the fields tab click on Get fields, enter 0 to see the scan results of flat file and click on close button. You can also see the text file contains by click on preview rows button. Step 7: Once done, click on OK to complete the process of defining a flat file input.

Step 8: Expand the Output group on the design pane, and drag and drop Access output on the transformation surface. To determine data flow sequence from one transformation item to another, a Hop is used. To create the hop: a) Click on the Text file input, then press the <SHIFT> key and draw a line to the Access Output.

OR
b) Place the mouse pointer on Text file input until the hover menu appears and then drag the hop Output connector to Access output.

OR
c) Place mouse pointer on the Text file input, press the middle button of the mouse then drag the hop pointer and release on Access Output.

Step 9: Double-click on the Access output to open its properties dialog box. Since the access database does not currently exist, enter the file name along with the full path in The database filename field. Also enter the name of the target table in the Target table field. Keep the checkboxes of the Create database and Create table options selected, so that the database and the table will be created respectively if they do not exist already. After this is done, click on OK.

Step 10: To run the transformation, click on the green-coloured triangular button.

The Execute a transformation dialog box opens up. Click on Launch to execute the transformation.

The Execution Results pane appears. In the Step Metrics tab, the column Active shows Finished if the transformation was executed successfully.

Open the Northwind access database file. You will see that the data has been successfully populated in the Products table.

Assignment 3: Using the Add constants, Calculator and Select Values transformations
Learning Objective: To learn how to use the Calculator to calculate a new column using existing column values, and select specific fields to be populated in the destination using the Select Values transformation. Requirements: i. The columns from the employee excel sheet that are required to be sent to an Excel worksheet are: EmployeeID, LastName, FirstName, Title, TitleOfCourtesy, HireDate, City, Country, HomePhone, Extension and ReportsTo. In the Employee table, the Firstname and Lastname columns should be stored as a single column in the destination.

ii.

Step 1: Create a new transformation called Employee. Drag and drop Excel input on the transformation surface. Double-click the Excel input to open its properties dialog box. Click on Browse.

Select the excel workbook that contains the source data for the Employee table, and click on the Add button to add it to the list of selected files.

Step 2: Go to the Sheets tab, and click on Get sheetnames to get the list of the names of the sheets that you wish to include in the data flow. A dialog appears, that asks you to select the sheets you want.

Select the sheet named employee and click on the > button to include it in the list of selected sheets. Then click on OK.

Step 3: Next, go to the Fields tab and click on Get fields from header row button to get a list of the field names from the first row of the excel sheet employee.

Click on Preview rows and enter the number of rows that you would like to preview (this facility is for the developer to ensure that the connection will successfully be able to fetch the data from the excel sheet correctly).

Step 4: Click on OK to complete the task of defining a connection to the excel sheet data source.

Step 5: From the Transform group in the Design pane of Kettle, drag and drop Add constants transformation on the transformation surface. Double-click on it to open its properties dialog box. Name the new field as space, specify data-type as String and length as 1. The value should be given as a space.

After this is done, click on OK. The Add constants will now add a new field called space in the data flow. Step 6: From the Transform group in the Design pane of Kettle, drag and drop Calculator transformation on the transformation surface. Create a hop from Add constants to Calculator.

Step 7: Double-click on the Calculator to open its properties dialog box. i. ii. Specify the new field name as FullName. Select the calculation type as A+B+C.

iii.

Specify Field A as FirstName, Field B as space, Field C as LastName, Value type as String and Length as 70. Click on OK.

Step 8: From the Transform group in the Design pane of Kettle, drag and drop Select values transformation on the transformation surface. Create a hop from Calculator to Select values.

[NOTE: The Select values transformation is used for the purpose of specifically removing the columns
that are not required further in the data flow. The existing columns that are required may also be renamed to any other name and cast to another data type, if needed.] Step 9: Double-click on the Select values transformation to open its properties dialog box. Click on the Get fields to select button the fetch the fields that are presently in the data flow.

Step 10: Go to the Remove tab. This is where the columns that have to be excluded from the data flow are specified. Under the Fieldname column, click on the drop-down. It will show a list of the available fields in the data flow. Click on the name of the column you wish to exclude. For example, click on Address, since it is not required further in the data flow.

Do the same for all other fields that have to be excluded.

Step 11: Under the Metadata tab, click on the Get fields to change button. Remove the fields that are not required in the data flow. Specify the alternative name, data-type, length, precision, etc. for each of the input fields (if required).

Once done, click on OK.

Step 12: From the Output group in the Design pane of Kettle, drag and drop Excel output on the transformation surface. Create a hop from Select values to Excel output. Double-click on Excel output to open its properties dialog box. Click on the Browse button.

Step 13: Select the folder where you want to save the excel destination workbook. Specify the name of the file, and click on Save.

Step 14: In the Content tab, specify the sheet name as Employee.

Step 15: In the Fields tab, click on the Get Fields button to fetch the fields that have to be included in the Employee worksheet. Specify # as format for integer fields. Once done, click on OK.

Step 16: Your transformation is now complete and ready to be executed. Run the transformation by clicking on the green triangular button, and then clicking on the Launch button after that. After execution, the destination Excel sheet looks like this:

Assignment 4: Creating an ODBC data source


Step 1: Click on Start->Control Panel->Administrative Tools->Data Sources (ODBC), then in ODBC Data Source Administrator dialog box select User DSN tab. Click on Add.

Step 2: Select Microsoft Access driver (*.mdb, *.accdb) and click on Finish.

Step 3: Specify data source name, description and then click on Select to select the access database to be used.

Step 4: Select Northwind.accdb from its location and click on OK.

Step 5: Click on OK again.

Step 6: Click on OK again.

The ODBC data source has now been created.

Assignment 5: Using the Database Lookup transformation


Learning Objective: To learn how to lookup values from an referenced table using key-value pairs, and include the value field(s) into the data flow. Requirements: i. The OrderDetails sheet from the excel workbook Northwind contains product-wise data about orders. Replace the ProductID field by the ProductName and populate the data into the Northwind.accdb Access database, into a table named OrderDetails.

Step 1: Create a new transformation file, and save it as OrderDetails. Step 2: Drag and drop an Excel input on the transformation surface. Edit the properties of the Excel input. i. ii. iii. Select the data source as Northwind.xls. Select the source sheet as orderdetails. Click on Get fields from header row to fetch the fields for the data flow. Click on OK, once done.

Step 3: Drag and drop Database lookup on the transformation surface. Create a hop from Excel input to the Database lookup.

Step 4: Double-click on Database lookup to open its properties dialog box. For creating a new connection to the Access database table Products that belongs to the Northwind.accdb database, click on New.

Step 5: Give the connection a name. Select connection type as MS Access. Specify the name of the ODBC connection to the Northwind.accdb database. Click on Test to test the connection.

If connection is successful, the following message is displayed:

Click on OK.

Step 6: Click on Browse to select the lookup table.

Step 7: Select the Products table as the table to be looked up for value fields.

Step 8: To equate the key values between the source table and the lookup table, specify Table field as ProductID, comparator as = and Field1 as ProductID. Select the Values to return from the lookup table as ProductName.

Step 9: i. ii. iii. Drag and drop Select Values on the transformation surface. Create a hop from Database lookup to Select values. In the Remove tab, select the field ProductID to be removed. In the Metadata tab, specify the data types of the fields that are included in the data flow.

Step 10: Drag and drop Access output on the transformation surface. Create a hop from Select values to the Access output. i. ii. iii. Specify the database as the existing Northwind.accdb database. Give the table name as OrderDetails. Click on OK.

Step 10: Your transformation is now complete and ready to be executed. Run the transformation by clicking on the green triangular button, and then clicking on the Launch button after that. After execution, the destination table looks like this:

<EOF>

S-ar putea să vă placă și