Pentaho Data Integration Beginner's Guide

Ebook1,066 pages7 hours

Pentaho Data Integration Beginner's Guide

Name: Pentaho Data Integration Beginner's Guide
Brand: Packt Publishing
Rating: 4.0 (1 reviews)

By Maria Carina Roldan

Rating: 4 out of 5 stars

4/5

()

Read preview

About this ebook

In Detail

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work.

Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing.

"Pentaho Data Integration Beginner's Guide, Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration.

"Pentaho Data Integration Beginner's Guide, Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart.

With "Pentaho Data Integration Beginner's Guide, Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.

Approach

This book focuses on teaching you by example. The book walks you through every aspect of Pentaho Data Integration, giving systematic instructions in a friendly style, allowing you to learn in front of your computer, playing with the tool. The extensive use of drawings and screenshots make the process of learning Pentaho Data Integration easy. Throughout the book, numerous tips and helpful hints are provided that you will not find anywhere else.

Who this book is for

This book is a must-have for software developers, database administrators, IT students, and everyone involved or interested in developing ETL solutions, or, more generally, doing any kind of data manipulation. Those who have never used Pentaho Data Integration will benefit most from the book, but those who have, they will also find it useful.

This book is also a good starting point for database administrators, data warehouse designers, architects, or anyone who is responsible for data warehouse projects and needs to load data into them.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateOct 24, 2013

ISBN9781782165057

Author

Maria Carina Roldan

Maria Carina Roldan was born in Esquel, Argen na, and earned her Bachelor's degree in Computer Science at at the Universidad Nacional de La Plata (UNLP) and then moved to Buenos Aires where she has lived since 1994. She has worked as a BI consultant for almost fifteen years. She started working with Pentaho technology back in 2006. Over the last three and a half years, she has been devoted to working full me for Webdetails—a company acquired by Pentaho in 2013—as an ETL specialist. Carina is the author of Pentaho 3.2 Data Integra on Beginner's Book, Packt Publishing, April 2009, and the co-author of Pentaho Data Integra on 4 Cookbook, Packt Publishing, June 2011.

Related authors

Skip carousel

Related to Pentaho Data Integration Beginner's Guide

Related ebooks

Skip carousel

Talend Open Studio Cookbook
Ebook
Talend Open Studio Cookbook
byRick Barton
Rating: 2 out of 5 stars
2/5
Mastering Tableau
Ebook
Mastering Tableau
byDavid Baldwin
Rating: 3 out of 5 stars
3/5
Pentaho 3.2 Data Integration Beginner's Guide
Ebook
Pentaho 3.2 Data Integration Beginner's Guide
byMaria Carina Roldan
Rating: 0 out of 5 stars
0 ratings
Microsoft SQL Server 2014 Business Intelligence Development Beginner’s Guide
Ebook
Microsoft SQL Server 2014 Business Intelligence Development Beginner’s Guide
byReza Rad
Rating: 0 out of 5 stars
0 ratings
Learning pandas
Ebook
Learning pandas
byHeydt Michael
Rating: 4 out of 5 stars
4/5
Python Business Intelligence Cookbook
Ebook
Python Business Intelligence Cookbook
byDempsey Robert
Rating: 0 out of 5 stars
0 ratings
Learning Hadoop 2
Ebook
Learning Hadoop 2
byGarry Turkington
Rating: 4 out of 5 stars
4/5
Hadoop: Data Processing and Modelling
Ebook
Hadoop: Data Processing and Modelling
byGarry Turkington
Rating: 0 out of 5 stars
0 ratings
Oracle Business Intelligence Enterprise Edition 12c - Second Edition
Ebook
Oracle Business Intelligence Enterprise Edition 12c - Second Edition
byChristian Screen
Rating: 0 out of 5 stars
0 ratings
Mastering Python for Data Science
Ebook
Mastering Python for Data Science
bySamir Madhavan
Rating: 3 out of 5 stars
3/5
Hadoop Beginner's Guide
Ebook
Hadoop Beginner's Guide
byGarry Turkington
Rating: 4 out of 5 stars
4/5
Mastering QlikView
Ebook
Mastering QlikView
byStephen Redmond
Rating: 5 out of 5 stars
5/5
Pentaho Data Integration Cookbook - Second Edition
Ebook
Pentaho Data Integration Cookbook - Second Edition
byMaría Carina Roldán
Rating: 0 out of 5 stars
0 ratings
Data Warehouse Architecture A Complete Guide - 2021 Edition
Ebook
Data Warehouse Architecture A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools
Ebook
Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools
byVinicius Aquino do Vale
Rating: 0 out of 5 stars
0 ratings
Data warehouse Complete Self-Assessment Guide
Ebook
Data warehouse Complete Self-Assessment Guide
byGerardus Blokdyk
Rating: 4 out of 5 stars
4/5
Logical Data Warehouse A Complete Guide - 2019 Edition
Ebook
Logical Data Warehouse A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Learning Tableau 10 - Second Edition
Ebook
Learning Tableau 10 - Second Edition
byJoshua N. Milligan
Rating: 4 out of 5 stars
4/5
Data Lake Development with Big Data
Ebook
Data Lake Development with Big Data
byPasupuleti Pradeep
Rating: 0 out of 5 stars
0 ratings
Star Schema The Complete Reference
Ebook
Star Schema The Complete Reference
byChristopher Adamson
Rating: 0 out of 5 stars
0 ratings
Data Architecture: A Primer for the Data Scientist: A Primer for the Data Scientist
Ebook
Data Architecture: A Primer for the Data Scientist: A Primer for the Data Scientist
byW.H. Inmon
Rating: 5 out of 5 stars
5/5
Tabular Modeling with SQL Server 2016 Analysis Services Cookbook
Ebook
Tabular Modeling with SQL Server 2016 Analysis Services Cookbook
byDerek Wilson
Rating: 4 out of 5 stars
4/5
Microsoft Tabular Modeling Cookbook
Ebook
Microsoft Tabular Modeling Cookbook
byPaul te Braak
Rating: 0 out of 5 stars
0 ratings
MDM for Customer Data: Optimizing Customer Centric Management of Your Business
Ebook
MDM for Customer Data: Optimizing Customer Centric Management of Your Business
byKelvin K. A. Looi
Rating: 0 out of 5 stars
0 ratings
Learning Tableau
Ebook
Learning Tableau
byJoshua N. Milligan
Rating: 0 out of 5 stars
0 ratings
Enterprise Data Warehouse Third Edition
Ebook
Enterprise Data Warehouse Third Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Big Data: Principles and best practices of scalable realtime data systems
Ebook
Big Data: Principles and best practices of scalable realtime data systems
byJames Warren
Rating: 4 out of 5 stars
4/5
Database Modeling and Design: Logical Design
Ebook
Database Modeling and Design: Logical Design
byToby J. Teorey
Rating: 0 out of 5 stars
0 ratings
Information Structure Design for Databases: A Practical Guide to Data Modelling
Ebook
Information Structure Design for Databases: A Practical Guide to Data Modelling
byAndrew J. Mortimer
Rating: 5 out of 5 stars
5/5
Mastering Data Warehouse Design: Relational and Dimensional Techniques
Ebook
Mastering Data Warehouse Design: Relational and Dimensional Techniques
byClaudia Imhoff
Rating: 4 out of 5 stars
4/5

Computers For You

Skip carousel

Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 4 out of 5 stars
4/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Dawn of the New Everything: Encounters with Reality and Virtual Reality
Ebook
Dawn of the New Everything: Encounters with Reality and Virtual Reality
byJaron Lanier
Rating: 4 out of 5 stars
4/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
Ebook
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
byTommy Swindali
Rating: 4 out of 5 stars
4/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
Remote/WebCam Notarization : Basic Understanding
Ebook
Remote/WebCam Notarization : Basic Understanding
byJeannie Eunice Franks
Rating: 3 out of 5 stars
3/5
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
Ebook
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
byTriumph Books
Rating: 5 out of 5 stars
5/5
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
Ebook
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
byPatrick McNeil
Rating: 4 out of 5 stars
4/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
People Skills for Analytical Thinkers
Ebook
People Skills for Analytical Thinkers
byGilbert Eijkelenboom
Rating: 5 out of 5 stars
5/5
Summary of Dotcom Secrets: by Russell Brunson - The Underground Playbook for Growing Your Company Online with Sales Funnels - A Comprehensive Summary
Ebook
Summary of Dotcom Secrets: by Russell Brunson - The Underground Playbook for Growing Your Company Online with Sales Funnels - A Comprehensive Summary
byAlexander Cooper
Rating: 5 out of 5 stars
5/5
Practical Lock Picking: A Physical Penetration Tester's Training Guide
Ebook
Practical Lock Picking: A Physical Penetration Tester's Training Guide
byDeviant Ollam
Rating: 5 out of 5 stars
5/5
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
Ebook
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
byDexter Jackson
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
Podcast episode
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
byDataFramed
0 ratings
0% found this document useful
Simplifying Data Integration Through Eventual Connectivity - Episode 91: An interview about a new pattern for data integration that reduces the amount of effort required to find connections in numerous data sets
Podcast episode
Simplifying Data Integration Through Eventual Connectivity - Episode 91: An interview about a new pattern for data integration that reduces the amount of effort required to find connections in numerous data sets
byData Engineering Podcast
0 ratings
0% found this document useful
2155: Databricks - The Story Behind the Lakehouse Company: Many are citing open source as the future. The UK Government's National Data Strategy even talks about the importance of opening public sector datasets to form the backbone of innovation, efficiency, and growth. This is a trend that Databricks...
Podcast episode
2155: Databricks - The Story Behind the Lakehouse Company: Many are citing open source as the future. The UK Government's National Data Strategy even talks about the importance of opening public sector datasets to form the backbone of innovation, efficiency, and growth. This is a trend that Databricks...
byThe Tech Talks Daily Podcast
0 ratings
0% found this document useful
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
Podcast episode
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
byData Engineering Podcast
0 ratings
0% found this document useful
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
Podcast episode
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
byData Engineering Podcast
0 ratings
0% found this document useful
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42: A Whirlwind Tour Of The PostgreSQL Database (Interview)
Podcast episode
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42: A Whirlwind Tour Of The PostgreSQL Database (Interview)
byData Engineering Podcast
100%
100% found this document useful
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
Podcast episode
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
byData Engineering Podcast
0 ratings
0% found this document useful
#464: Diving deep into Amazon MWAA: Amazon Managed Workflows for Apache Airflow (MWAA) is a managed orchestration service for Apache Air
Podcast episode
#464: Diving deep into Amazon MWAA: Amazon Managed Workflows for Apache Airflow (MWAA) is a managed orchestration service for Apache Air
byAWS Podcast
0 ratings
0% found this document useful
Easily Build Advanced Similarity Search With The Pinecone Vector Database: An interview with Edo Liberty about the Pinecone vector database and how it makes it easy to build a similarity search service.
Podcast episode
Easily Build Advanced Similarity Search With The Pinecone Vector Database: An interview with Edo Liberty about the Pinecone vector database and how it makes it easy to build a similarity search service.
byData Engineering Podcast
0 ratings
0% found this document useful
Becoming a Data-led Professional - Arpit Choudhury
Podcast episode
Becoming a Data-led Professional - Arpit Choudhury
byDataTalks.Club
0 ratings
0% found this document useful
Preparing for a Data Science Interview - Luke Whipps
Podcast episode
Preparing for a Data Science Interview - Luke Whipps
byDataTalks.Club
0 ratings
0% found this document useful
Humans in the Loop - Lina Weichbrodt
Podcast episode
Humans in the Loop - Lina Weichbrodt
byDataTalks.Club
0 ratings
0% found this document useful
106: Visual Testing : How IDEs can make software testing easier - Paul Everitt: IDEs can help people with automated testing. In this episode, Paul Everitt and Brian discuss ways IDEs can encourage testing and make it easier for everyone, including beginners. We discuss features that exist and are great, as well as what is missing. The conversation also includes topics around being welcoming to new contributors for both open source and professional projects.
Podcast episode
106: Visual Testing : How IDEs can make software testing easier - Paul Everitt: IDEs can help people with automated testing. In this episode, Paul Everitt and Brian discuss ways IDEs can encourage testing and make it easier for everyone, including beginners. We discuss features that exist and are great, as well as what is missing. The conversation also includes topics around being welcoming to new contributors for both open source and professional projects.
byTest and Code
0 ratings
0% found this document useful
Storytime for DataOps - Christopher Bergh
Podcast episode
Storytime for DataOps - Christopher Bergh
byDataTalks.Club
0 ratings
0% found this document useful
129: How to Test Anything - David Lord: I asked people on twitter to fill in "How do I test _____?" to find out what people want to know how to test. Lots of responses. David Lord agreed to answer them with me. In the process, we come up with lots of great general advice on how to test just about anything.
Podcast episode
129: How to Test Anything - David Lord: I asked people on twitter to fill in "How do I test _____?" to find out what people want to know how to test. Lots of responses. David Lord agreed to answer them with me. In the process, we come up with lots of great general advice on how to test just about anything.
byTest and Code
0 ratings
0% found this document useful
Defining Success: Metrics and KPIs - Adam Sroka
Podcast episode
Defining Success: Metrics and KPIs - Adam Sroka
byDataTalks.Club
0 ratings
0% found this document useful
Analytics for a Better World - Parvathy Krishnan
Podcast episode
Analytics for a Better World - Parvathy Krishnan
byDataTalks.Club
0 ratings
0% found this document useful
Value-Stream Mapping
Podcast episode
Value-Stream Mapping
byThe Cloudcast
0 ratings
0% found this document useful
Dataprep with Eric Anderson: Eric Anderson joins the podcast to talk about how Dataprep is simplifying data wrangling!
Podcast episode
Dataprep with Eric Anderson: Eric Anderson joins the podcast to talk about how Dataprep is simplifying data wrangling!
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
How to Get Better Traffic, Free with Semantic Data: Everyone wants better ranking but that’s not really what they want. Stores want more traffic. They want that because it leads to sales. Sales is what really matters. All of SEO is just a marketing channel to get more visitors who turn into customers. The main FIX for SEO is better rankings What if there was a way to get more sales with the existing rankings you already have? How? By convincing more searchers to click on your search listings. More clicks == more traffic == more sales. What we’re talking about today is a search enhancement in Google called Rich Snippets, specifically Product Rich Snippets. They enhance your existing search results with more data and pixels. Eric Davis joins us today to walk us through it in easy to understand terms. Eric is founder of Little Stream Software, which helps Shopify entrepreneurs customize their Shopify stores using public and private Shopify Apps.
Podcast episode
How to Get Better Traffic, Free with Semantic Data: Everyone wants better ranking but that’s not really what they want. Stores want more traffic. They want that because it leads to sales. Sales is what really matters. All of SEO is just a marketing channel to get more visitors who turn into customers. The main FIX for SEO is better rankings What if there was a way to get more sales with the existing rankings you already have? How? By convincing more searchers to click on your search listings. More clicks == more traffic == more sales. What we’re talking about today is a search enhancement in Google called Rich Snippets, specifically Product Rich Snippets. They enhance your existing search results with more data and pixels. Eric Davis joins us today to walk us through it in easy to understand terms. Eric is founder of Little Stream Software, which helps Shopify entrepreneurs customize their Shopify stores using public and private Shopify Apps.
byThe Unofficial Shopify Podcast: Entrepreneur Tales
0 ratings
0% found this document useful
Hasty Treat - Refactoring: In this Hasty Treat, Scott and Wes discuss refactoring, what it is, why you should do it, when to do it, as well as best practices and much more. Netlify — Sponsor is the best way to deploy and host a front-end website. All the features...
Podcast episode
Hasty Treat - Refactoring: In this Hasty Treat, Scott and Wes discuss refactoring, what it is, why you should do it, when to do it, as well as best practices and much more. Netlify — Sponsor is the best way to deploy and host a front-end website. All the features...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
Hasty Treat - Hiring an Assistant: In this Hasty Treat, Scott and Wes talk about how to hire an assistant — something that can make your life a lot easier as a solo developer. LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and...
Podcast episode
Hasty Treat - Hiring an Assistant: In this Hasty Treat, Scott and Wes talk about how to hire an assistant — something that can make your life a lot easier as a solo developer. LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
[Bite] Documenting Data Science Projects
Podcast episode
[Bite] Documenting Data Science Projects
byDataCafé
0 ratings
0% found this document useful
Platform Engineering at a FAANG Company
Podcast episode
Platform Engineering at a FAANG Company
byThe Cloudcast
0 ratings
0% found this document useful
Beating GPT-4 with Open Source LLMs — with Michael Royzen of Phind
Podcast episode
Beating GPT-4 with Open Source LLMs — with Michael Royzen of Phind
byLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
0 ratings
0% found this document useful
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
Podcast episode
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Google AI with Jeff Dean: Mark and Melanie are joined by Jeff Dean today to discuss AI at Google.
Podcast episode
Google AI with Jeff Dean: Mark and Melanie are joined by Jeff Dean today to discuss AI at Google.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
BONUS: Unleashing Agile Experimentation, Accelerating Learning Cycles With 24h Experiments | Vasco Duarte: BONUS: Unleashing Agile Experimentation, Accelerating Learning Cycles With 24h Experiments, With Vasco Duarte Read the and search through the world’s largest audio library on Scrum directly on the . Merry Christmas, everyone! In today's...
Podcast episode
BONUS: Unleashing Agile Experimentation, Accelerating Learning Cycles With 24h Experiments | Vasco Duarte: BONUS: Unleashing Agile Experimentation, Accelerating Learning Cycles With 24h Experiments, With Vasco Duarte Read the and search through the world’s largest audio library on Scrum directly on the . Merry Christmas, everyone! In today's...
byScrum Master Toolbox Podcast: Agile storytelling from the trenches
0 ratings
0% found this document useful
062: February 2018 MPT - Going Through the File and Library: Understanding how the information in the file and library relates to the problem we have to resolve
Podcast episode
062: February 2018 MPT - Going Through the File and Library: Understanding how the information in the file and library relates to the problem we have to resolve
byThe Bar Exam Toolbox Podcast: Pass the Bar Exam with Less Stress
0 ratings
0% found this document useful

Skip carousel

Understanding ELT & ETL
Techfastly
Article
Understanding ELT & ETL
Apr 1, 2021
8 min read
The Future Of The Database
Linux Format
Article
The Future Of The Database
Aug 27, 2019
7 min read
Build A Search And Analytic Engine
Linux Format
Article
Build A Search And Analytic Engine
Mar 10, 2020
7 min read
How Image Recognition Works
APC
Article
How Image Recognition Works
Nov 4, 2019
4 min read
Google Answer Box Strategy
Techfastly
Article
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read
14 Organization Projects You Can DO IN A DAY
Family Tree
Article
14 Organization Projects You Can DO IN A DAY
Dec 20, 2022
Standardize file names (either digital or physical) Standardize place names, dates, or personal names in your online tree/software program Color-code folders (either digital or physical) Fill out new, “clean” research forms to outline the scope of yo
1 min read
Finish Your Cataloguing App
Linux Format
Article
Finish Your Cataloguing App
Jan 10, 2023
Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. In his spare time, Matt enjoys listening to music and reading. More featurepacked source code for this project can be downlo
7 min read
HoudahSpot 5
MacLife
Article
HoudahSpot 5
Jun 25, 2019
2 min read
What Should I Download?
Computeractive
Article
What Should I Download?
Sep 27, 2023
Q Is there an alternative to File Explorer that lets you and a complex structure is very difficult to cope with. Colin Gray A Files App, which you can download for free from the Microsoft Store (www.snipca.com/47540), is an attractive alternative to
2 min read
The 10 Must-Have Utilities for macOS Sierra
MacWorld
Article
The 10 Must-Have Utilities for macOS Sierra
Jan 24, 2017
12 min read
Find It Faster!
APC
Article
Find It Faster!
May 22, 2023
4 min read
GENEALOGY GADGETS & APPS FOR ALL OCCASIONS!
Family Tree UK
Article
GENEALOGY GADGETS & APPS FOR ALL OCCASIONS!
Dec 9, 2022
4 min read
Perfect Backup: Perfect? No, But Darn Close
PCWorld
Article
Perfect Backup: Perfect? No, But Darn Close
Jan 11, 2023
3 min read
Organize Files Automatically
MacLife
Article
Organize Files Automatically
Jul 24, 2018
2 min read
DJANGO Create A Database-driven Website
Linux Format
Article
DJANGO Create A Database-driven Website
Jun 4, 2019
The Django web framework was named after the famous guitarist Django Reinhardt and was first created by web developers at a small newspaper in Kansas. The main goals of Django is to enable fast development of complex websites with database needs. It
7 min read
“There’s No Single ‘Best’ Language To Learn. I Think The Real Key Is To Learn How To Write Code”
PC Pro Magazine
Article
“There’s No Single ‘Best’ Language To Learn. I Think The Real Key Is To Learn How To Write Code”
Oct 8, 2022
9 min read
Drill Down Deeper
MacLife
Article
Drill Down Deeper
Aug 16, 2022
2 min read
Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
Manage And Read Your System Logs
Linux Format
Article
Manage And Read Your System Logs
Mar 10, 2020
10 min read
Help Yourself To Avoid These Pitfalls
MacLife
Article
Help Yourself To Avoid These Pitfalls
Dec 11, 2018
GETTING UP TO full speed with the Shortcuts app takes time, and you’ll inevitably make a few mistakes along the way. Having to troubleshoot your efforts doesn’t mean you’ve failed — with years of experience, even professional programmers do this. Tak
2 min read
How We Tested…
Linux Format
Article
How We Tested…
Aug 24, 2021
To explore the capabilities of these terminal-based browsers, we used them to access popular websites and tried to download data where applicable. We also attempted tasks that you would normally carry out on those sites. Once we had accessed a site,
1 min read
Search Desktop File Contents Instantly
Linux Format
Article
Search Desktop File Contents Instantly
May 30, 2023
9 min read
Family History Software: An Introduction
Family Tree UK
Article
Family History Software: An Introduction
Feb 11, 2020
5 min read
Answers
Linux Format
Article
Answers
Mar 5, 2024
10 min read
Using and Transferring Metadata in Digital Photos
Family Tree
Article
Using and Transferring Metadata in Digital Photos
Apr 21, 2020
2 min read
Usability
Linux Format
Article
Usability
Oct 19, 2021
3 min read
Manipulate Data Like A Pro With Pandas
Linux Format
Article
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read
No-nonsense Backups
Linux Format
Article
No-nonsense Backups
Aug 25, 2020
2 min read
Stop Using Windows Tools
Computeractive
Article
Stop Using Windows Tools
Apr 26, 2023
16 min read
Accurate, Open Source IP-based Localisation
Linux Format
Article
Accurate, Open Source IP-based Localisation
Dec 14, 2021
8 min read

Related categories

Skip carousel

Reviews for Pentaho Data Integration Beginner's Guide

Rating: 4 out of 5 stars

4/5

1 rating0 reviews

Book preview

Pentaho Data Integration Beginner's Guide - Maria Carina Roldan

Pentaho Data Integration Beginner's Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers and more

Why Subscribe?

Free Access for Packt account holders

Preface

How to read this book

What this book covers

What you need for this book

Who this book is for

Conventions

Time for action – heading

What just happened?

Pop quiz – heading

Have a go hero – heading

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Exploring the Pentaho Demo

Pentaho Data Integration

Using PDI in real-world scenarios

Loading data warehouses or datamarts

Integrating data

Data cleansing

Migrating information

Exporting data

Integrating PDI along with other Pentaho tools

Pop quiz – PDI data sources

Installing PDI

Time for action – installing PDI

What just happened?

Pop quiz – PDI prerequisites

Launching the PDI graphical designer – Spoon

Time for action – starting and customizing Spoon

What just happened?

Spoon

Setting preferences in the Options window

Storing transformations and jobs in a repository

Creating your first transformation

Time for action – creating a hello world transformation

What just happened?

Directing Kettle engine with transformations

Exploring the Spoon interface

Designing a transformation

Running and previewing the transformation

Pop quiz – PDI basics

Installing MySQL

Time for action – installing MySQL on Windows

What just happened?

Time for action – installing MySQL on Ubuntu

What just happened?

Have a go hero – installing a visual software for administering and querying MySQL

Summary

2. Getting Started with Transformations

Designing and previewing transformations

Time for action – creating a simple transformation and getting familiar with the design process

What just happened?

Getting familiar with editing features

Using the mouseover assistance toolbar

Working with grids

Understanding the Kettle rowset

Looking at the results in the Execution Results pane

The Logging tab

The Step Metrics tab

Have a go hero – calculating the achieved percentage of work

Have a go hero - calculating the achieved percentage of work (second version)

Running transformations in an interactive fashion

Time for action – generating a range of dates and inspecting the data as it is being created

What just happened?

Adding or modifying fields by using different PDI steps

The Select values step

Getting fields

Date fields

Pop quiz – generating data with PDI

Have a go hero – experiencing different PDI steps

Have a go hero – generating a rowset with dates

Handling errors

Time for action – avoiding errors while converting the estimated time from string to integer

What just happened?

The error handling functionality

Time for action – configuring the error handling to see the description of the errors

What just happened?

Personalizing the error handling

Have a go hero – trying out different ways of handling errors

Summary

3. Manipulating Real-world Data

Reading data from files

Time for action – reading results of football matches from files

What just happened?

Input files

Input steps

Reading several files at once

Time for action – reading all your files at a time using a single text file input step

What just happened?

Time for action – reading all your files at a time using a single text file input step and regular expressions

What just happened?

Regular expressions

Troubleshooting reading files

Have a go hero – exploring your own files

Pop quiz – providing a list of text files using regular expressions

Have a go hero – measuring the performance of input steps

Sending data to files

Time for action – sending the results of matches to a plain file

What just happened?

Output files

Output steps

Have a go hero – extending your transformations by writing output files

Have a go hero – generate your custom matches.txt file

Getting system information

Time for action – reading and writing matches files with flexibility

What just happened?

The Get System Info step

Running transformations from a terminal window

Time for action – running the matches transformation from a terminal window

What just happened?

Have a go hero – finding out system information

XML files

Time for action – getting data from an XML file with information about countries

What just happened?

What is XML?

PDI transformation files

Getting data from XML files

XPath

Configuring the Get data from the XML step

Kettle variables

How and when you can use variables

Have a go hero – exploring XML files

Summary

4. Filtering, Searching, and Performing Other Useful Operations with Data

Sorting data

Time for action – sorting information about matches with the Sort rows step

What just happened?

Have a go hero – listing the last match played by each team

Calculations on groups of rows

Time for action – calculating football match statistics by grouping data

What just happened?

Group by Step

Numeric fields

Have a go hero – formatting 99.55

Pop quiz — formatting output fields

Have a go hero – listing the languages spoken by a country

Filtering

Time for action – counting frequent words by filtering

What just happened?

Time for action – refining the counting task by filtering even more

What just happened?

Filtering rows using the Filter rows step

Have a go hero – playing with filters

Looking up data

Time for action – finding out which language people speak

What just happened?

The Stream lookup step

Have a go hero – selecting the most popular of the official languages

Have a go hero – counting words more precisely

Data cleaning

Time for action – fixing words before counting them

What just happened?

Cleansing data with PDI

Have a go hero – counting words by cleaning them first

Summary

5. Controlling the Flow of Data

Splitting streams

Time for action – browsing new features of PDI by copying a dataset

What just happened?

Copying rows

Have a go hero – recalculating statistics

Distributing rows

Time for action – assigning tasks by distributing

What just happened?

Pop quiz – understanding the difference between copying and distributing

Splitting the stream based on conditions

Time for action – assigning tasks by filtering priorities with the Filter rows step

What just happened?

PDI steps for splitting the stream based on conditions

Time for action – assigning tasks by filtering priorities with the Switch/Case step

What just happened?

Have a go hero – listing languages and countries

Pop quiz – deciding between a Number range step and a Switch/Case step

Merging streams

Time for action – gathering progress and merging it all together

What just happened?

PDI options for merging streams

Time for action – giving priority to Bouchard by using the Append Stream

What just happened?

Have a go hero – sorting and merging all tasks

Treating invalid data by splitting and merging streams

Time for action – treating errors in the estimated time to avoid discarding rows

What just happened?

Treating rows with invalid data

Have a go hero – trying to find missing countries

Summary

6. Transforming Your Data by Coding

Doing simple tasks with the JavaScript step

Time for action – counting frequent words by coding in JavaScript

What just happened?

Using the JavaScript language in PDI

Inserting JavaScript code using the Modified JavaScript Value Step

Adding fields

Modifying fields

Using transformation predefined constants

Testing the script using the Test script button

Reading and parsing unstructured files with JavaScript

Time for action – changing a list of house descriptions with JavaScript

What just happened?

Looping over the dataset rows

Have a go hero – enhancing the houses file

Doing simple tasks with the Java Class step

Time for action – counting frequent words by coding in Java

What just happened?

Using the Java language in PDI

Inserting Java code using the User Defined Java Class step

Adding fields

Modifying fields

Sending rows to the next step

Data types equivalence

Testing the Java Class using the Test class button

Have a go hero – parameterizing the Java Class

Transforming the dataset with Java

Time for action – splitting the field to rows using Java

What just happened?

Avoiding coding by using purpose built steps

Pop quiz – choosing a scripting language for coding inside a transformation

Summary

7. Transforming the Rowset

Converting rows to columns

Time for action – enhancing the films file by converting rows to columns

What just happened?

Converting row data to column data by using the Row Denormaliser step

Have a go hero – houses revisited

Aggregating data with a Row Denormaliser step

Time for action – aggregating football matches data with the Row Denormaliser step

What just happened?

Using Row Denormaliser for aggregating data

Have a go hero – calculating statistics by team

Normalizing data

Time for action – enhancing the matches file by normalizing the dataset

What just happened?

Modifying the dataset with a Row Normaliser step

Summarizing the PDI steps that operate on sets of rows

Have a go hero – verifying the benefits of normalizing

Have a go hero – normalizing the Films file

Generating a custom time dimension dataset by using Kettle variables

Time for action – creating the time dimension dataset

What just happened?

Getting variables

Time for action – parameterizing the start and end date of the time dimension dataset

What just happened?

Using the Get Variables step

Have a go hero – enhancing the time dimension

Summary

8. Working with Databases

Introducing the Steel Wheels sample database

Connecting to the Steel Wheels database

Time for action – creating a connection to the Steel Wheels database

What just happened?

Connecting with Relational Database Management Systems

Pop quiz – connecting to a database in several transformations

Have a go hero – connecting to your own databases

Exploring the Steel Wheels database

Time for action – exploring the sample database

What just happened?

A brief word about SQL

Exploring any configured database with the database explorer

Have a go hero – exploring the sample data in depth

Have a go hero – exploring your own databases

Querying a database

Time for action – getting data about shipped orders

What just happened?

Getting data from the database with the Table input step

Using the SELECT statement for generating a new dataset

Making flexible queries using parameters

Time for action – getting orders in a range of dates using parameters

What just happened?

Adding parameters to your queries

Making flexible queries by using Kettle variables

Time for action – getting orders in a range of dates by using Kettle variables

What just happened?

Using Kettle variables in your queries

Pop quiz – interpreting data types coming from a database

Have a go hero – querying the sample data

Sending data to a database

Time for action – loading a table with a list of manufacturers

What just happened?

Inserting new data into a database table with the Table output step

Inserting or updating data by using other PDI steps

Time for action – inserting new products or updating existing ones

What just happened?

Time for action – testing the update of existing products

What just happened?

Inserting or updating with the Insert/Update step

Have a go hero – populating a films database

Have a go hero – populating the products table

Pop quiz – replacing an Insert/Update step with a Table Output step followed by an Update step

Eliminating data from a database

Time for action – deleting data about discontinued items

What just happened?

Deleting records of a database table with the Delete step

Have a go hero – deleting old orders

Have a go hero – creating the time dimension

Summary

9. Performing Advanced Operations with Databases

Preparing the environment

Time for action – populating the Jigsaw database

What just happened?

Exploring the Jigsaw database model

Looking up data in a database

Doing simple lookups

Time for action – using a Database lookup step to create a list of products to buy

What just happened?

Looking up values in a database with the Database lookup step

Have a go hero – preparing the delivery of the products

Have a go hero – refining the transformation

Performing complex lookups

Time for action – using a Database join step to create a list of suggested products to buy

What just happened?

Joining data from the database to the stream data by using a Database join step

Have a go hero – rebuilding the list of customers

Introducing dimensional modeling

Loading dimensions with data

Time for action – loading a region dimension with a Combination lookup/update step

What just happened?

Time for action – testing the transformation that loads the region dimension

What just happened?

Describing data with dimensions

Loading Type I SCD with a Combination lookup/update step

Have a go hero – adding regions to the Region dimension

Have a go hero – loading the manufacturers dimension

Storing a history of changes

Time for action – keeping a history of changes in products by using the Dimension lookup/update step

What just happened?

Time for action – testing the transformation that keeps history of product changes

What just happened?

Keeping an entire history of data with a Type II slowly changing dimension

Loading Type II SCDs with the Dimension lookup/update step

Have a go hero – storing a history just for the theme of a product

Have a go hero – loading the Regions dimension as a Type II SCD

Pop quiz – implementing a Type III SCD in PDI

Have a go hero – loading a mini dimension

Summary

10. Creating Basic Task Flows

Introducing PDI jobs

Time for action – creating a folder with a Kettle job

What just happened?

Executing processes with PDI jobs

Using Spoon to design and run jobs

Pop quiz – defining PDI jobs

Designing and running jobs

Time for action – creating a simple job and getting familiar with the design process

What just happened?

Changing the flow of execution on the basis of conditions

Looking at the results in the Execution results window

The Logging tab

The Job metrics tab

Running transformations from jobs

Time for action – generating a range of dates and inspecting how things are running

What just happened?

Using the Transformation job entry

Have a go hero – loading the dimension tables

Receiving arguments and parameters in a job

Time for action – generating a hello world file by using arguments and parameters

What just happened?

Using named parameters in jobs

Have a go hero – backing up your work

Running jobs from a terminal window

Time for action – executing the hello world job from a terminal window

What just happened?

Have a go hero – experiencing Kitchen

Using named parameters and command-line arguments in transformations

Time for action – calling the hello world transformation with fixed arguments and parameters

What just happened?

Have a go hero – saying hello again and again

Have a go hero – loading the time dimension from a job

Deciding between the use of a command-line argument and a named parameter

Have a go hero – analyzing the use of arguments and named parameters

Summary

11. Creating Advanced Transformations and Jobs

Re-using part of your transformations

Time for action – calculating statistics with the use of a subtransformations

What just happened?

Creating and using subtransformations

Have a go hero – calculating statistics for all subjects

Have a go hero – counting words more precisely (second version)

Creating a job as a process flow

Time for action – generating top average scores by copying and getting rows

What just happened?

Transferring data between transformations by using the copy/get rows mechanism

Have a go hero – modifying the flow

Iterating jobs and transformations

Time for action – generating custom files by executing a transformation for every input row

What just happened?

Executing for each row

Have a go hero – building lists of products to buy

Enhancing your processes with the use of variables

Time for action – generating custom messages by setting a variable with the name of the examination file

What just happened?

Setting variables inside a transformation

Running a job inside another job with a Job job entry

Understanding the scope of variables

Have a go hero – processing several files at once

Have a go hero – enhancing the jigsaw database update process

Have a go hero – executing the proper jigsaw database update process

Pop quiz – deciding the scope of variables

Summary

12. Developing and Implementing a Simple Datamart

Exploring the sales datamart

Deciding the level of granularity

Loading the dimensions

Time for action – loading the dimensions for the sales datamart

What just happened?

Extending the sales datamart model

Have a go hero – loading the dimensions for the puzzle star model

Loading a fact table with aggregated data

Time for action – loading the sales fact table by looking up dimensions

What just happened?

Getting the information from the source with SQL queries

Translating the business keys into surrogate keys

Obtaining the surrogate key for Type I SCD

Obtaining the surrogate key for Type II SCD

Obtaining the surrogate key for the Junk dimension

Obtaining the surrogate key for the Time dimension

Pop quiz – creating a product type dimension

Have a go hero – loading a puzzles fact table

Getting facts and dimensions together

Time for action – loading the fact table using a range of dates obtained from the command line

What just happened?

Time for action – loading the SALES star

What just happened?

Have a go hero – enhancing the loading process of the sales fact table

Have a go hero – loading the puzzle sales star

Have a go hero – loading the facts once a month

Automating the administrative tasks

Time for action – automating the loading of the sales datamart

What just happened?

Have a go hero – creating a backup of your work automatically

Have a go hero – enhancing the automation process by sending an email if an error occurs

Summary

A. Working with Repositories

Creating a database repository

Time for action – creating a PDI repository

What just happened?

Creating a database repository to store your transformations and jobs

Working with the repository storage system

Time for action – logging into a database repository

What just happened?

Logging into a database repository using credentials

Creating transformations and jobs in repository folders

Creating database connections, users, servers, partitions, and clusters

Designing jobs and transformations

Backing up and restoring a repository

Examining and modifying the contents of a repository with the Repository Explorer

Migrating from file-based system to repository-based system and vice versa

Summary

B. Pan and Kitchen – Launching Transformations and Jobs from the Command Line

Running transformations and jobs stored in files

Running transformations and jobs from a repository

Specifying command-line options

Kettle variables and the Kettle home directory

Checking the exit code

Providing options when running Pan and Kitchen

Summary

C. Quick Reference – Steps and Job Entries

Transformation steps

Job entries

Summary

D. Spoon Shortcuts

General shortcuts

Designing transformations and jobs

Grids

Repositories

Database wizards

Summary

E. Introducing PDI 5 Features

Welcome page

Usability

Solutions to commonly occurring situations

Backend

Summary

F. Best Practices

Summary

G. Pop Quiz Answers

Chapter 1, Getting Started with Pentaho Data Integration

Pop quiz – PDI data sources

Pop quiz – PDI prerequisites

Pop quiz – PDI basics

Chapter 2, Getting Started with Transformations

Pop quiz – generating data with PDI

Chapter 3, Manipulating Real-world Data

Pop quiz – providing a list of text files using regular expressions

Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data

Pop quiz – formatting output fields

Chapter 5, Controlling the Flow of Data

Pop quiz – deciding between a Number range step and a Switch/Case step

Pop quiz – understanding the difference between copying and distributing

Chapter 6, Transforming Your Data by Coding

Pop quiz – choosing a scripting language for coding inside a transformation

Chapter 8, Working with Databases

Pop quiz – connecting to a database in several transformations

Pop quiz – interpreting data types coming from a database

Chapter 9, Performing Advanced Operations with Databases

Pop quiz – implementing a Type III SCD in PDI

Chapter 10, Creating Basic Task Flows

Pop quiz – defining PDI jobs

Chapter 11, Creating Advanced Transformations and Jobs

Pop quiz – deciding the scope of variables

Chapter 12, Developing and Implementing a Simple Datamart

Pop quiz – creating a product type dimension

Index

Pentaho Data Integration Beginner's Guide

Second Edition

Pentaho Data Integration Beginner's Guide

Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: April 2010

Second Edition: October 2013

Production Reference: 1171013

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78216-504-0

www.packtpub.com

Cover Image by Suresh Mogre (<suresh.mogre.99@gmail.com>)

Credits

Author

María Carina Roldán

Reviewers

Tomoyuki Hayashi

Gretchen Moran

Acquisition Editors

Usha Iyer

Greg Wild

Lead Technical Editor

Azharuddin Sheikh

Technical Editors

Sharvari H. Baet

Aparna K

Kanhucharan Panda

Vivek Pillai

Project Coordinator

Navu Dhillon

Proofreaders

Simran Bhogal

Ameesha Green

Indexer

Mariammal Chettiyar

Graphics

Ronak Dhruv

Yuvraj Mannari

Production Coordinator

Conidon Miranda

Cover Work

Conidon Miranda

About the Author

María Carina Roldán was born in Esquel, Argentina, and earned her Bachelor's degree in Computer Science at at the Universidad Nacional de La Plata (UNLP) and then moved to Buenos Aires where she has lived since 1994.

She has worked as a BI consultant for almost fifteen years. She started working with Pentaho technology back in 2006. Over the last three and a half years, she has been devoted to working full time for Webdetails—a company acquired by Pentaho in 2013—as an ETL specialist.

Carina is the author of Pentaho 3.2 Data Integration Beginner's Book, Packt Publishing, April 2009, and the co-author of Pentaho Data Integration 4 Cookbook, Packt Publishing, June 2011.

I'd like to thank those who have encouraged me to write this book: firstly, the Pentaho community. They have given me such rewarding feedback after my other two books on PDI; it is because of them that I feel compelled to pass my knowledge on to those willing to learn. I also want to thank my friends! Especially Flavia, Jaqui, and Marce for their encouraging words throughout the writing process; Silvina for clearing up my questions about English; Gonçalo for helping with the use of PDI on Mac systems;  and Hernán for helping with ideas and examples for this new edition.

I would also like to thank the technical reviewers—Gretchen, Tomoyuki, Nelson, and Paula—for the time and dedication that they have put in to reviewing the book.

About the Reviewers

Tomoyuki Hayashi is a system engineer who mainly works for the intersection of open source and enterprise software. He has developed a CMIS-compliant and CouchDB-based ECM software named NemakiWare (http://nemakiware.com/).

He is currently working with Aegif, Japan, which provides advisory services for content-oriented applications, collaboration improvement, and ECM in general. It is one of the most experienced companies in Japan that supports the introduction of foreign-made software to the Japanese market.

Gretchen Moran works as an independent Pentaho consultant on a variety of business intelligence and big data projects. She has 15 years of experience in the business intelligence realm, developing software and providing services for a number of companies including Hyperion Solutions and the Pentaho Corporation.

Gretchen continues to contribute to Pentaho Corporation's latest and greatest software initiatives while managing the daily adventures of her two children, Isabella and Jack, with her husband, Doug.

www.PacktPub.com

Support files, eBooks, discount offers and more

You might want to visit www.PacktPub.com for support files and downloads related to your book.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by Packt

Copy and paste, print and bookmark content

On demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

Preface

Pentaho Data Integration (also known as Kettle) is an engine along with a suite of tools responsible for the processes of Extracting, Transforming, and Loading—better known as the ETL processes. PDI not only serves as an ETL tool, but is also used for other purposes such as migrating data between applications or databases, exporting data from databases to flat files, data cleansing, and much more. PDI has an intuitive, graphical, drag-and-drop design environment, and its ETL capabilities are powerful. However, getting started with PDI can be difficult or confusing. This book provides the guidance needed to overcome that difficulty, covering the key features of PDI. Each chapter introduces new features, allowing you to gradually get involved with the tool.

By the end of the book, you will have not only experimented with all kinds of examples, but will have also built a basic but complete datamart with the help of PDI.

How to read this book

Although it is recommended that you read all the chapters, you don't have to. The book allows you to tailor the PDI learning process according to your particular needs.

The first five chapters along with Chapter 10, Creating Basic Task Flows, cover the core concepts. If you don't know PDI and want to learn just the basics, reading those chapters will suffice. If you need to work with databases, you could include Chapter 8, Working with Databases, in the roadmap.

If you already know the basics, you can improve your PDI knowledge by reading Chapter 6, Transforming Your Data by Coding, Chapter 7, Transforming the Rowset, and Chapter 11, Creating Advanced Transformations and Jobs.

If you already know PDI and want to learn how to use it to load or maintain a data warehouse or datamart, you will find all that you need in Chapter 9, Performing Advanced Operations with Databases, and Chapter 12, Developing and Implementing a Simple Datamart.

Finally, all the appendices are valuable resources for anyone reading this book.

What this book covers

Chapter 1, Getting Started with Pentaho Data Integration, serves as the most basic introduction to PDI, presenting the tool. This chapter includes instructions for installing PDI and gives you the opportunity to play with the graphical designer (Spoon). The chapter also includes instructions for installing a MySQL server.

Chapter 2, Getting Started with Transformations, explains the fundamentals of working with transformations, including learning the simplest ways of transforming data and getting familiar with the process of designing, debugging, and testing a transformation.

Chapter 3, Manipulating Real-world Data, explains how to apply the concepts learned in the previous chapter to real-world data that comes from different sources. It also explains how to save the results to different destinations: plain files, Excel files, and more. As real data is very prone to errors, this chapter also explains the basics of handling errors and validating data.

Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data, expands the set of operations learned in previous chapters by teaching the reader a great variety of essential features such as filtering, sorting, or looking for data.

Chapter 5, Controlling the Flow of Data, explains different options that PDI offers to combine or split flows of data.

Chapter 6, Transforming Your Data by Coding, explains how JavaScript and Java coding can help in the treatment of data. It shows why you may need to code inside PDI, and explains in detail how to do it.

Chapter 7, Transforming the Rowset, explains the ability of PDI to deal with some sophisticated problems—for example, normalizing data from pivoted tables—in a simple fashion.

Chapter 8, Working with Databases, explains how to use PDI to work with databases. The list of topics covered includes connecting to a database, previewing and getting data, and inserting, updating, and deleting data. As database knowledge is not presumed, the chapter also covers fundamental concepts of databases and the SQL language.

Chapter 9, Performing Advanced Operations with Databases, explains how to perform advanced operations with databases, including those especially designed to load data warehouses. A primer on data warehouse concepts is also given in case you are not familiar with the subject.

Chapter 10, Creating Basic Task Flows, serves as an introduction to processes in PDI. Through the creation of simple jobs, you will learn what jobs are and what they are used for.

Chapter 11, Creating Advanced Transformations and Jobs, deals with advanced concepts that will allow you to build complex PDI projects. The list of covered topics includes nesting jobs, iterating on jobs and transformations, and creating subtransformations.

Chapter 12, Developing and Implementing a Simple Datamart, presents a simple datamart project, and guides you to build the datamart by using all the concepts learned throughout the book.

Appendix A, Working with Repositories, is a step-by-step guide to the creation of a PDI database repository and then gives instructions on to work with it.

Appendix B, Pan and Kitchen – Launching Transformations and Jobs from the Command Line, is a quick reference for running transformations and jobs from the command line.

Appendix C, Quick Reference – Steps and Job Entries, serves as a quick reference to steps and job entries used throughout the book.

Appendix D, Spoon Shortcuts, is an extensive list of Spoon shortcuts useful for saving time when designing and running PDI jobs and transformations.

Appendix E, Introducing PDI 5 Features, quickly introduces you to the architectural and functional features included in Kettle 5—the version that was under development when this book was written.

Appendix F, Best Practices, gives a list of best PDI practices and recommendations.

Appendix G , Pop Quiz Answers, contains answers to pop quiz questions.

What you need for this book

PDI is a multiplatform tool. This means that no matter what your operating system is, you will be able to work with the tool. The only prerequisite is to have JVM 1.6 installed. It is also useful to have Excel or Calculator, along with a nice text editor.

Having an Internet connection while reading is extremely useful as well. Several links are provided throughout the book that complement what is explained. Additionally, there is the PDI forum where you may search or post doubts if you are stuck with something.

Who this book is for

This book is a must-have for software developers, database administrators, IT students, and everyone involved or interested in developing ETL solutions, or more generally, doing any kind of data manipulation. Those who have never used PDI will benefit the most from the book, but those who have, will also find it useful.

This book is also a good starting point for database administrators, data warehouse designers, architects, or anyone who is responsible for data warehouse projects and needs to load data into them.

You don't need to have any prior data warehouse or database experience to read this book. Fundamental database and data warehouse technical terms and concepts are explained in easy-to-understand language.

Conventions

In this book, you will find several headings that appear frequently.

To give clear instructions on how to complete a procedure or task, we use:

Time for action – heading

Action 1

Action 2

Action 3

Instructions often need some extra explanation so that they make sense, so they are followed with:

What just happened?

This heading explains the working of tasks or instructions that you have just completed.

You will also find some other learning aids in the book, including:

Pop quiz – heading

These are short multiple-choice questions intended to help you test your own understanding.

Have a go hero – heading

These practical challenges and give you ideas for experimenting with what you have learned.

You will also find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text are shown as follows: You may notice that we used the Unix command rm to remove the Drush directory rather than the DOS del command.

A block of code is set as follows:

# * Fine Tuning

key_buffer = 16M

key_buffer_size = 32M

max_allowed_packet = 16M

thread_stack = 512K

thread_cache_size = 8

max_connections = 300

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

# * Fine Tuning

key_buffer = 16M

key_buffer_size = 32M

max_allowed_packet = 16M

thread_stack = 512K

thread_cache_size = 8

max_connections = 300

Any command-line input or output is written as follows:

cd /ProgramData/Propeople rm -r Drush git clone --branch master http://git.drupal.org/project/drush.git

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: On the Select Destination Location screen, click on Next to accept the default destination.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <feedback@packtpub.com>, and mention the book title through the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we

Enjoying the preview?

Page 1 of 1

Pentaho Data Integration Beginner's Guide

About this ebook

Maria Carina Roldan

Related authors

Related to Pentaho Data Integration Beginner's Guide

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Pentaho Data Integration Beginner's Guide

What did you think?

Book preview

Pentaho Data Integration Beginner's Guide - Maria Carina Roldan

Table of Contents

Pentaho Data Integration Beginner's Guide

Second Edition

Pentaho Data Integration Beginner's Guide

Second Edition

Credits

About the Author

About the Reviewers

Support files, eBooks, discount offers and more

Why Subscribe?

Preface

How to read this book

What this book covers

What you need for this book

Who this book is for

Conventions

What just happened?

Have a go hero – heading

Note

Tip

Reader feedback

Customer support

Downloading the example code

Errata