Sunteți pe pagina 1din 65

Module-10

APACHE OOZIE And Hadoop Project

www.edureka.co/big-data-and-hadoop
Course Topics
 Module 1  Module 6
» Understanding Big Data and Hadoop » HIVE

 Module 2  Module 7
» Hadoop Architecture and HDFS » Advance HIVE and HBase

 Module 3  Module 8
» Hadoop MapReduce Framework » Advance HBase

 Module 4  Module 9
» Advance MapReduce
» Processing Distributed Data with Apache Spark
 Module 5
» PIG  Module 10
» Oozie and Hadoop Project

Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
At the end of this module, you will be able to
 Implement Flume and Sqoop

 Understand Oozie

 Schedule Job in Oozie

 Implement Oozie Workflow

 Implement Oozie Coordinator

 Understand Project Discussion and implement one

Slide 3 www.edureka.co/big-data-and-hadoop
Flume and Sqoop

Demo on Flume and Sqoop


For detail steps of flume installation on Edureka VM: Click Here

For detail steps of MySql to HDFS - Using Sqoop installation on Edureka VM: Click Here

Flume and Sqoop is already


installed in Edureka VM .

Slide 4 www.edureka.co/big-data-and-hadoop
Oozie
 Oozie is a workflow/coordination system that you can use to manage Apache Hadoop jobs.

 Oozie server — a web application that runs in a Java servlet container (the standard Oozie distribution is using
Tomcat).

 This server supports reading and executing Workflows, Coordinators and Bundles definitions.

Slide 5 www.edureka.co/big-data-and-hadoop
Oozie
Oozie Functional Components

Oozie Workflow Oozie Coordinator Oozie Bundles

This component This provides support Facilitates packaging


provides support for for the automatic multiple coordinator
defining and execution of and workflow jobs,
executing a workflows and makes it easier
controlled sequence based on the time to manage the life
of MapReduce, Hive, and data availability. cycle of those jobs.
and Pig jobs.

Slide 6 www.edureka.co/big-data-and-hadoop
Oozie Overview
 Main Features

» Execute and monitor workflows in Hadoop


» Periodic scheduling of workflows
» Trigger execution by data availability
» HTTP and command line interface + Web console

 Adoption

» ~100 users on mailing list since launch on github


» In production at Yahoo!, running >200K jobs/day

Slide 7 www.edureka.co/big-data-and-hadoop
Apache Oozie

Oozie – Workflow

Slide 8 www.edureka.co/big-data-and-hadoop
Oozie Workflow
A workflow job can be in any of the following states:

PREP: When a workflow job is first created it will be in PREP state. The workflow job is defined but it is not running.

RUNNING: When a CREATED workflow job is started, it goes into RUNNING state, it will remain in RUNNING state while
it does not reach its end state, ends in error or it is suspended.

SUSPENDED: A RUNNING workflow job can be suspended, it will remain in SUSPENDED state until the workflow job is
resumed or it is killed.
PREP

KILLED RUNNING FAILED

SUSPENDED SUCCEEDED

Slide 9 www.edureka.co/big-data-and-hadoop
Oozie Workflow ( Contd.)
A workflow job can be in any of the following states:

SUCCEEDED: When a RUNNING workflow job reaches the end node it ends reaching the SUCCEEDED final state.

KILLED: When a CREATED , RUNNING or SUSPENDED workflow job is killed by an administrator or the owner via a
request to Oozie the workflow job ends reaching the KILLED final state.

FAILED: When a RUNNING workflow job fails due to an unexpected error it ends reaching the FAILED final state.

PREP

KILLED RUNNING FAILED

SUSPENDED SUCCEEDED

Slide 10 www.edureka.co/big-data-and-hadoop
Scheduling with Oozie

Oozie

Coordinator Job

Launch MR jobs HDFS


at regular
intervals

MapReduce

Slide 11 www.edureka.co/big-data-and-hadoop
Annie’s Question

…………………specialized in running workflows based on


time and data triggers.

Slide 12 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Coordinator Engine

Slide 13 www.edureka.co/big-data-and-hadoop
Oozie - workflow.xml
 The bare minimum for workflow XML defines a name, a starting point, and an end point.

<workflow-app xmlns="uri:oozie:workflow:0.1"
name="WorkflowRunnerTest"> The workflow definition language
is XML based and it is called
<start to=“process1"/>
HPDL (Hadoop Process Definition
<end name="end"/>
Language).
</workflow-app>

 Flow-control nodes : Provide a way to control the Workflow execution path.

 Start node start: This specifies a starting point for an Oozie Workflow.

 End node end: This specifies an end point for an Oozie Workflow.

Slide 14 www.edureka.co/big-data-and-hadoop
Oozie – workflow.xml
 To this we need to add an action, and within that we'll specify the map-reduce
parameters.
Action nodes provide a way for a
<action name=“process1"> Workflow to initiate the execution of a
<map-reduce> computation/processing task.
<job-tracker>localhost:8032</job-tracker>
<name-node>hdfs://localhost:9000</name-node> This runs an Hadoop MapReduce job.
<prepare> <delete
path="hdfs://localhost:9000/WordCountTest/out1"/></prepare>
<configuration>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value> Remember
</property> Actions require <Ok> and
</configuration> <error> tags to direct the next
</map-reduce> action on success or failure.
<ok to="end"/>
<error to="fail"/>
</action>

Slide 15 www.edureka.co/big-data-and-hadoop
Oozie - job.properties and lib
 job.properties files provide another place where job arguments can be specified.

 All of the properties specified will be available in the job execution context, and consequently
can be used throughout the job.

nameNode=hdfs://localhost:9000
jobTracker=localhost:8032
oozie.wf.application.path=${nameNode}/WordCountTest

 There is a lib directory which contains libraries used in the workflow (such as jar files (.jar)
or shared object files (.so)).

Slide 16 www.edureka.co/big-data-and-hadoop
Annie’s Question

The job.properties file needs to be a local file during


submissions, and not a HDFS path

True False

Slide 17 www.edureka.co/big-data-and-hadoop
Annie’s Answer

True

Slide 18 www.edureka.co/big-data-and-hadoop
DEMO ON OOZIE WORKFLOW

Slide 19 www.edureka.co/big-data-and-hadoop
Running Oozie Application
 Create Application

Step 1 : Create a directory for oozie job (WordCountTest)

Step 2 :Write application and create the jar (Example MapReduce jar). Move this jar to lib folder in WordCountTest
directory.

Step 3 : job.properties and workflow.xml inside WordCountTest directory.

Step 4 : Move this directory to hdfs.

 Running the Application

oozie job -oozie http://localhost:11000/oozie -config job.properties –run


(job.properties should be from local path)

Slide 20 www.edureka.co/big-data-and-hadoop
Monitoring an Oozie Workflow Job
 Workflow Job Status:
$ oozie job -info 1-20090525161321-oozie-xyz-W

Workflow Name : WorkflowRunnerTest


App Path : hdfs://localhost:9000/WordCountTest
Status : RUNNING

 Workflow Job Log:


$ oozie job -log 1-20090525161321-oozie-xyz-W

 Workflow Job Definition:


$ oozie job -definition 1-20090525161321-oozie-xyz-W

Slide 21 www.edureka.co/big-data-and-hadoop
Oozie – Coordinator

Slide 22 www.edureka.co/big-data-and-hadoop
Oozie Coordinator
 The Oozie Coordinator supports the automated starting of Oozie Workflow processes.

 It is typically used for the design and execution of recurring invocations of Workflow
processes triggered by time and/or data availability.

Slide 23 www.edureka.co/big-data-and-hadoop
Oozie Coordinator Properties and XML
 We will start with Coordinator which schedule wordcount example every 60 minutes.

 Moreover, Oozie coordinators can be parameterized using variables like ${inputDir}, ${startTime}, etc. within the
coordinator definition.

 When submitting a coordinator job, values for the parameters must be provided as input. As parameters are key
value pairs, they can be written in a coordinator.properties file or a XML file.

frequency=60 <coordinator-app name="coordinator1" frequency="${frequency}"


startTime=2014-02-04T10\:00Z start="${startTime}" end="${endTime}" timezone="${timezone}"
endTime=2015-02-04T11\:00Z xmlns="uri:oozie:coordinator:0.1">
timezone=UTC <action>
<workflow>
nameNode=hdfs://localhost:9000 <app-path>${workflowPath}</app-path>
jobTracker=localhost:8032 </workflow>
queueName=default </action>
</coordinator-app>
workflowPath=${nameNode}/WordCountTest_TimeBased

oozie.coord.application.path=${nameNode}/WordCountTest_TimeBased

Slide 24 www.edureka.co/big-data-and-hadoop
Oozie Application Lifecycle
Coordinator Job

0*f 1*f 2*f N*f


start end
action
create
Action Action Action Action
0 1 2 n

action
start

A WF WF WF
Oozie Coordinator Engine

B C Oozie Workflow Engine

Slide 25 www.edureka.co/big-data-and-hadoop
Use Case 1: Time Triggers

 Execute your workflow every 15 minutes (CRON)

00:15 00:30 00:45 01:00

Slide 26 www.edureka.co/big-data-and-hadoop
Example 1: Run Workflow every 15 mins
<coordinator-app name=“coord1”
start="2009-01-08T00:00Z"
end="2010-01-01T00:00Z"
frequency=”15"
xmlns="uri:oozie:coordinator:0.1">
<action>
<workflow>
<app-path>hdfs://localhost:9000/WordCountTest_TimeBased</app-path>
<configuration>
<property> <name>key1</name><value>value1</value> </property>
</configuration>
</workflow>
</action>
</coordinator-app>

Slide 27 www.edureka.co/big-data-and-hadoop
Use Case 2: Time and Data Triggers
 Materialize your workflow every hour, but only run them when the input data is ready.

Hadoop
Input Data
Exists?

01:00 02:00 03:00 04:00

Slide 28 www.edureka.co/big-data-and-hadoop
Example 2: Data Triggers
<coordinator-app name=“coord1” frequency=“${1*HOURS}”…>
<datasets>
<dataset name="logs" frequency=“${1*HOURS}” initial-instance="2009-01-01T00:00Z">
<uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name=“inputLogs” dataset="logs">
<instance>${current(0)}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>hdfs://localhost:9000/WordCountTest_TimeBased</app-path>
<configuration>
<property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property>
</configuration>
</workflow>
</action>
</coordinator-app>

Slide 29 www.edureka.co/big-data-and-hadoop
Use Case 3: Rolling Windows
 Access 15 minute datasets and roll them up into hourly datasets

00:15 00:30 00:45 01:00

01:00

01:15 01:30 01:45 02:00

02:00

Slide 30 www.edureka.co/big-data-and-hadoop
Monitoring an Oozie Coordinator Job
 Coordinator Job Status:
$ oozie job -info 1-20090525161321-oozie-xyz-C

Job Name : WordCountTest_TimeBased


App Path : hdfs://localhost:9000/WordCountTest_TimeBased
Status : RUNNING

 Coordinator Job Log:
$ oozie job –log 1-20090525161321-oozie-xyz-C

 Coordinator Job Definition:


$ oozie job –definition 1-20090525161321-oozie-xyz-C

Slide 31 www.edureka.co/big-data-and-hadoop
Some Oozie Commands
 Checking the Status of multiple Workflow Jobs
oozie jobs -oozie http://localhost:11000/oozie -localtime -len 2 -fliter status=RUNNING

 Checking the Status of multiple Coordinator Jobs


oozie jobs -oozie http://localhost:11000/oozie -jobtype coordinator

 Killing a Workflow, Coordinator or Bundle Job


$ oozie job -oozie http://localhost:11000/oozie -kill 14-20090525161321-oozie-joe

 Checking the Status of a Workflow, Coordinator or Bundle Job or a Coordinator Action


$ oozie job -oozie http://localhost:11000/oozie -info 14-20090525161321-oozie-joe

 Check version
$ oozie admin -oozie http://localhost:11000/oozie -version

Slide 32 www.edureka.co/big-data-and-hadoop
Oozie Web Console: List Jobs

Slide 33 www.edureka.co/big-data-and-hadoop
Oozie Web Console: Job Details

Slide 34 www.edureka.co/big-data-and-hadoop
Oozie Web Console: Failed Actions

Slide 35 www.edureka.co/big-data-and-hadoop
Oozie Web Console: Error Messages

Slide 36 www.edureka.co/big-data-and-hadoop
Project

Slide 37 www.edureka.co/big-data-and-hadoop
Use Case – How do I find out the best

Slide 38 www.edureka.co/big-data-and-hadoop
Use Case – The type of data we are dealing with!

Slide 39 www.edureka.co/big-data-and-hadoop
Abstract Flow Diagram

Huge Raw XML files User Interface To


with unstructured search the top
data line reviews rated links per
category

Slide 40 www.edureka.co/big-data-and-hadoop
Flow Diagram

User Interface To
search the top
rated links per
category

SQOOP
PIG

HIVE
HDFS
Huge Raw XML files
with unstructured
data line reviews

Slide 41 www.edureka.co/big-data-and-hadoop
Revision: Map-Reduce Phase
Huge Raw XML Map Reduce
files with
unstructured data
line reviews HDFS

Category hash url +tive -tive total

Slide 42 www.edureka.co/big-data-and-hadoop
Map-Reduce to Pig Phase

PIG

hash url +tive -tive total hash category

Slide 43 www.edureka.co/big-data-and-hadoop
Pig to Hive Phase

Category url +tive -tive total hash

Slide 44 www.edureka.co/big-data-and-hadoop
Hive to Sqoop Phase : Dumping data to MySQL

SQOOP

Web Interface
Slide 45 www.edureka.co/big-data-and-hadoop
In a Nut -Shell
Huge Raw XML files
with unstructured
data line reviews

Output
Structu
red
Data

Slide 46 www.edureka.co/big-data-and-hadoop
In a Nut -Shell
Huge Raw XML files
with unstructured
data line reviews

MR job will read


reviews, use some
dumb logic and then
decide if review is
good or bad.

Output
Structu
red
Data

Slide 47 www.edureka.co/big-data-and-hadoop
In a Nut -Shell
Huge Raw XML files
with unstructured
data line reviews

MR job will read


reviews, use some
dumb logic and then
decide if review is
good or bad.

Output
Structured
Output
Data
Structu
red
Data

Slide 48 www.edureka.co/big-data-and-hadoop
In a Nut -Shell
Huge Raw XML files
with unstructured
data line reviews

MR job will read


reviews, use some
dumb logic and then
decide if review is
good or bad.

Output
Structured
Output
Data
Structu
red
Data

Slide 49 Pig www.edureka.co/big-data-and-hadoop


In a Nut -Shell
Huge Raw XML files
with unstructured Category
Data
data line reviews Ratings data

MR job will read


reviews, use some
dumb logic and then
decide if review is
good or bad.

Output
Structured
Output
Data
Structu
red
Data

Slide 50 Pig www.edureka.co/big-data-and-hadoop


In a Nut -Shell
Huge Raw XML files
with unstructured Category
Data
data line reviews Ratings data

MR job will read


reviews, use some
dumb logic and then
decide if review is
good or bad.

Write a fancy query to get the


Output top rated links per category
Structured
Output
Data
Structu
red
Data

Slide 51 Pig www.edureka.co/big-data-and-hadoop


In a Nut -Shell
Huge Raw XML files
with unstructured Category
Data
data line reviews Ratings data

MR job will read


reviews, use some
dumb logic and then
decide if review is
good or bad.

Write a fancy query to get the


Output top rated links per category
Structured
Output
Data
Structu
red
Data
HIVE SQL
Output

Slide 52 Pig www.edureka.co/big-data-and-hadoop


In a Nut -Shell
Huge Raw XML files
Category Sqoop to read the data
with unstructured
Data and dump it to My SQL
data line reviews Ratings data

MR job will read


reviews, use some
dumb logic and then
decide if review is
good or bad.

Write a fancy query to get the


Output top rated links per category
Structured User Interface To
Output
Data search the top
Structu rated links per
red category
Data
HIVE SQL
Output

Slide 53 Pig www.edureka.co/big-data-and-hadoop


Essence
 When to use Map Reduce  When NOT to use Map Reduce
» To handle extremely Unstructured Data like xml » Scarcity of time.
files. » When Join functionality is required.
» Works both on Structured and Unstructured
data.
» Good for writing Complex Business Logic.
 When NOT to use Pig
» New language to learn.
 When to use Pig » When you have small Dataset.
» To handle Semi-structured Data like taking.
substring from every line.
» Structured and Unstructured data.
 When NOT to use Hive
» Not easy for complex business logic.
» Deals only with structured data.
 When to use Hive
» Very similar to SQL and comes handy while
analyzing data.
» Less Development Time.
» Suitable for adhoc analysis.

Slide 54 www.edureka.co/big-data-and-hadoop
Hadoop Eco system

Monitoring and deployment (Ambari)

Work flow (oozie)


Applications
Machine
NoSQL Streaming In-
Batch Learning
memory
PIG Hive
MR
Impala
Solr
HIPI
Tez Mahout
HBase Storm Spark

Cluster Management and Co-ordination (Yarn and zookeeper)

HDFS

Data loading Techniques (Flume, Sqoop)

Slide 55 www.edureka.co/big-data-and-hadoop
Assignment
Execute oozie practicals

Slide 56 www.edureka.co/big-data-and-hadoop
Edureka Certification
To achieve the Edureka Certification, you need to
complete a project, which helps you in applying all the
concepts you have learnt during your Hadoop classes

 Please use the Data Set and Problem Statement


given in the POC. To download it click on Download
the POC now

 Following are the steps you need to complete to


apply for the Certification:

» Submit your final project within 2 weeks from


the day you start, for final review

» You will receive your Final Certification on


successful completion of the project

Slide 57 www.edureka.co/big-data-and-hadoop
What Next??

Slide 58 www.edureka.co/big-data-and-hadoop
Big Data in 10 minutes
Learn Big Data not in months but in Minutes!! Sounds too good ? But true

MAPR HORTONWORKS

CLOUDERA HADOOP Go from zero to big data in under 10 minutes

Talend Open Studio for Big Data dramatically


simplifies the process of loading data into Hadoop,
transforming it there, and extracting processed data
from Hadoop to other destination systems

Slide 59 www.edureka.co/big-data-and-hadoop
Why Talend?
 Talend is the only Graphical User Interface tool which is capable enough to “translate” an ETL job to a
MapReduce job. Thus, Talend ETL job gets executed as a MapReduce job on Hadoop and get the big data work
done in minutes
 This is a key innovation which helps to reduce entry barriers in Big Data technology and allows ETL job
developers (beginners and advanced) to carry out Data Warehouse offloading to greater extent
 With its Eclipse-based graphical workspace, Talend Open Studio for Big Data enables the developer and data
scientist to leverage Hadoop loading and processing technologies like HDFS, HBase, Hive, and Pig without
having to write Hadoop application code
 Hadoop Applications, Seamlessly Integrated within minutes using Talend

Slide 60 www.edureka.co/big-data-and-hadoop
Why Talend? (Contd.)
 By simply selecting graphical components from a palette, arranging and configuring them, you can create Hadoop jobs
For example:

1. Load data into HDFS (Hadoop Distributed File System)


2. Use Hadoop Pig to transform data in HDFS
3. Load data into a Hadoop Hive based data warehouse
4. Perform ELT (extract, load, transform) aggregations in Hive
5. Leverage Sqoop to integrate relational databases and Hadoop

Slide 61 www.edureka.co/big-data-and-hadoop
Talend Hadoop Integration (Contd.)
 For Hadoop applications to be truly accessible to your organization, they need to be smoothly integrated into your
overall data flows
 Talend Open Studio for Big Data is the ideal tool for integrating Hadoop applications into your broader data
architecture
 Talend provides more built-in connector components than any other data integration solution available, with more
than 800+ connectors that make it easy to read from or write to any major file format, database, or packaged
enterprise application
 For example, in Talend Open Studio for Big Data, you can use drag 'n drop configurable components to create
data integration flows that move data from delimited log files into Hadoop Hive, perform operations in Hive, and
extract data from Hive into a MySQL database (or Oracle, Sybase, SQL Server, and so on)

Slide 62 www.edureka.co/big-data-and-hadoop
Who can use “Talend for Big Data”!!

Slide 63 www.edureka.co/big-data-and-hadoop
References
 https://www.talend.com/resource/hadoop-applications.html

 http://www.edureka.co/blog/big-data-and-etl-are-family/

Slide 64 www.edureka.co/big-data-and-hadoop
Thank you for being with Edureka!!
We would like to remind you that, your association with edureka does not stop here!

Remember

Lifetime Access Lifetime Support Free Additional Resources Discounts Referrals

You can get a discount You are eligible for


You have lifetime You get lifetime You get free access on every next course referral benefits. Earn
access to the support for your to edureka! you buy from edureka! when you refer
courses you are courses webinars for ALL For more details keep anyone to edureka!
registered for! courses checking your LMS

Please post all your reviews here: http://www.quora.com/Reviews-of-Edureka-online-education

Slide 65 www.edureka.co/big-data-and-hadoop

S-ar putea să vă placă și