Sunteți pe pagina 1din 33

Hadoop Pig

By
Ravikrishna Adepu

Overview
What is Pig?
Motivation
How is it being used
Data Model/Architecture
Components
Pig Latin By Example

What is Pig?
Apache Pig is a platform for analyzing
large data sets that consists of a highlevel language for expressing data
analysis
programs,
coupled
with
infrastructure
for
evaluating
these
programs.
The salient property of Pig programs is
that their structure is amenable to
substantial parallelization, which in turns
enables them to handle very large data
sets.

PigLatin
Pig's language layer currently
consists of a textual language
called Pig Latin,
which has the following key
properties:
Ease of programming.
Optimization opportunities.
Extensibility : Users can create
their own functions to do
special-purpose processing.

Hadoop Pig Architecture


Client Machine (Pig job
submission)
Pig->Map reduce transformations
Map Reduce Jobs
HDFS(Hadoop Distributed File
System) Map
Client
Machine

Reduce
Transform
ations

Map
reduce
Jobs

HDFS

Features
Simple to understand data flow
language for analysis familiar with
scripting languages
Fast , iterative language with
strong map reduce compilation
engine
Rich, multivalued nested
operations performed on large
datasets

Pig v/s SQL


Pig
Pig is procedural
Nested relational data
model (No constraints on
Data Types)
Schema is optional
Scan-centric analytic
workloads (No Random
reads or writes)
Limited query optimization

SQL
SQL is declarative
Flat relational data model
(Data is tied to a specific
Data Type)
Schema is required
OLTP + OLAP workloads
Significant opportunity for
query optimization

Pig procedural v/s SQL declarative


PIG
Users = load 'users' as (name,
age, ipaddr);
Clicks = load 'clicks' as (user,
url, value);
ValuableClicks = filter Clicks by
value > 0;
UserClicks = join Users by
name, ValuableClicks by user;
Geoinfo = load 'geoinfo' as
(ipaddr, dma);
UserGeo = join UserClicks by
ipaddr, Geoinfo by ipaddr;
ByDMA = group UserGeo by
dma;
ValuableClicksPerDMA = foreach
ByDMA
generate group,

SQL
insert into
ValuableClicksPerDMA
select dma, count(*) from
geoinfo
join
( select name, ipaddr
from users join clicks on
(users.name =
clicks.user) where value
> 0; )
using ipaddr group by
dma;

Motivation behind Pig


Challenges :
Map reduce requires a Java
Programmer
Map reduce can require multiple
stages to come to solution
User has to reinvent common
functionality
(join,filter etc)
Long development cycle with
rigorous testing states

Solution :
Opens the systems to he users
familiar with PHP, Ruby,Python
4hrs in Java -> 15 minutes in
PigLatin
Provide common operations like
Join, group, filter and sort etc
Pig provides PigLatin that
increases productivity * 10

How is Pig being used


Web log processing
Data processing for web search
platforms
Ad hoc queries across large data
sets
Rapid prototyping of algorithms
for large data sets
Quick fact : 70% of production jobs
at Yahoo Inc being used by Hadoop
Pig

Pig Processing :
Grunt ,the pig shell
Submit a script directly
Pig server java class, a JDBC
like interface
Pig Pen which Allows textual &
graphical scripting Samples
data & shows example data
flow

Components :
Pig resides on user machine
No need to install extra cluster
Job submitted to cluster & executed on cluster

First look at the program :


Lets first look at the programming language
itself so that you can see how its significantly
easier than having to write mapper and reducer
programs.
The first step in a Pig program is toLOADthe
data you want to manipulate from HDFS.
Then you run the data through a set
oftransformations(which, under the covers, are
translated into a set of mapper and reducer
tasks).
Finally, youDUMPthe data to the screen or
youSTOREthe results in a file somewhere.

Starting grunt :
cd /usr/share/doc/pig0.11.0+44/examples/data
ls
$Pig x local
You should see a prompt like
Grunt>
We can run Pig in two modes
Stand alone mode(local mode)
Distributed mode(Map reduce mode)

Execution Modes
Pig has two execution modes:
Local Mode:
To run Pig in local mode, you need access to a
single machine; all files are installed and run
using your local host and file system. Specify
local mode using the -x flag (pig -x local).
Mapreduce Mode:
To run Pig in mapreduce mode, you need
access to a Hadoop cluster and HDFS
installation. Mapreduce mode is the default
mode; you can,but don't need to, specify it
using the -x flag (pig OR pig -x mapreduce).

Loading Data :
Use the LOAD operator and the
load/store functions to read data into
Pig
(PigStorage is the default load
function).

Storing Final Results:


Use the STORE operator and the
load/store functions to write results to
the file system (PigStorage is the
default store function).

Continued :
Pig Latin provides operators that can help you
debug your Pig Latin statements:
Use the DUMP operator to display results to
your terminal screen.
Use the DESCRIBE operator to review the
schema of a relation.
Use the EXPLAIN operator to view the logical,
physical, or map reduce execution plans to
compute a relation.
Use the ILLUSTRATE operator to view the stepby-step execution of a series of statements.

Piglatin data types :


Basic data types :
INT
LONG
FLOAT
DOUBLE
CHARARRAY
BYTEARRAY
BOOLEAN

Continued :
Complex data types
BAG
TUPLE
MAP

Syntax
{(data_type) |
(tuple(data_type)) |
(bag{tuple(data_type)}) |
(map[]) } field

Usage :
Cast operators enable you to cast or convert data from one
type to another, as long as conversion is supported (see the
table above). For example, suppose you have an integer
field, myint, which you want to convert to a string. You can
cast this field from int to chararray using (chararray) myint.
A field can be explicitly cast. Once cast, the field remains
that type (it is not automatically cast back). In this
example $0 is explicitly cast to int.
B = FOREACH A GENERATE (int)$0 + 1;
Where possible, Pig performs implicit casts. In this
example $0 is cast to int (regardless of underlying data)
and $1 is cast to double.
B = FOREACH A GENERATE $0 + 1, $1 + 1.0

Tuple construction
A = load 'students' as (name:chararray,
age:int,gpa:float);
B = foreach A generate (name, age);
store B into results;
Input (students):
joe smith 20 3.5
amy chen 22 3.2
leo allen 18 2.1
Output (results):
(joe smith,20)
(amy chen,22)
(leo allen,18)

Bag Construction
A = load 'students' as (name:chararray,
age:int, gpa:float);
B = foreach A generate {(name, age)},
{name, age};
store B into results;
Input (students):
Joe smith 20 3.5
amy chen 22 3.2
leo allen 18 2.1
Output (results):
{(joe smith,20)} {(joe smith),(20)}
{(amy chen,22)} {(amy chen),(22)}
{(leo allen,18)} {(leo allen),(18)}

Map construction
A = load 'students' as (name:chararray,
age:int, gpa:float);
B = foreach A generate [name, gpa];
store B into results;
Input (students):
joe smith 20 3.5
amy chen 22 3.2
leo allen 18 2.1
Output (results):
[joe smith#3.5]
[amy chen#3.2]
[leo allen#2.1]

Piglatin: UDF
Pig provides extensive support
for user-defined functions
(UDFs) as a way to specify
custom processing. Functions
can be a part of almost every
operator in Pig
All UDFs are case sensitive

UDF: Types
Eval Functions (EvalFunc)
Ex: StringConcat (built-in) : Generates the concatenation of
the first two fields of a tuple.
Aggregate Functions (EvalFunc & Algebraic)
Ex: COUNT, AVG ( both built-in)
Filter Functions (FilterFunc)
Ex: IsEmpty (built-in)
Load/Store Functions (LoadFunc/ StoreFunc)
Ex: PigStorage (built-in)
Note: URL for built in functions:
http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/built
in/package-summary.html

How It Works
A = LOAD myfile
AS (x, y, z);
B = FILTER A by x > 0;
C = GROUP B BY x;
D = FOREACH A GENERATE
x, COUNT(B);
STORE D INTO output;

pig.jar:
parses
checks
optimizes
plans execution
submits jar
to Hadoop
monitors job progress

Execution Plan
Map:
Filter
Count
Combine/Reduce:
Sum

Project
Word count using Hadoop Pig :
Preparing a text file :
Its definitely a little more interesting if
you can work with some data you know
or at least have an interest in.
I used sample data provided by
cloudera for Hadoop Pig.

Import the file into the Sandbox


Go to the File Browser tab and upload the .txt file.
Take note of the default location it is loading to
(/user/hue)
Write a Pig script to parse the data and dump to a
file :
--script starts here
a = load '/user/hue/word_count_text.txt';
b = foreach a generate flatten(TOKENIZE((chararray)
$0)) as word;
c = group b by word;
d = foreach c generate COUNT(b), group; store d into
'/user/hue/pig_wordcount';
/* multi line comments */

RESULTS

References
1. http://en.wikipedia.org/wiki/Pig_(programming
_tool)
2. http://pig.apache.org/
3. http://hortonworks.com/hadoop/pig/
4. http://www.01.ibm.com/software/data/infosph
ere/hadoop/pig/
5. https://github.com/romainr/yelp-data-analysis
6. http://www.cloudera.com/content/cloudera/en
/resources/library/training/introduction-toapache-pig.html
7. https://github.com/romainr/hadoop-tutorialsexamples

QUESTIONS?

Thank you !

S-ar putea să vă placă și