Documente Academic
Documente Profesional
Documente Cultură
By
Ravikrishna Adepu
Overview
What is Pig?
Motivation
How is it being used
Data Model/Architecture
Components
Pig Latin By Example
What is Pig?
Apache Pig is a platform for analyzing
large data sets that consists of a highlevel language for expressing data
analysis
programs,
coupled
with
infrastructure
for
evaluating
these
programs.
The salient property of Pig programs is
that their structure is amenable to
substantial parallelization, which in turns
enables them to handle very large data
sets.
PigLatin
Pig's language layer currently
consists of a textual language
called Pig Latin,
which has the following key
properties:
Ease of programming.
Optimization opportunities.
Extensibility : Users can create
their own functions to do
special-purpose processing.
Reduce
Transform
ations
Map
reduce
Jobs
HDFS
Features
Simple to understand data flow
language for analysis familiar with
scripting languages
Fast , iterative language with
strong map reduce compilation
engine
Rich, multivalued nested
operations performed on large
datasets
SQL
SQL is declarative
Flat relational data model
(Data is tied to a specific
Data Type)
Schema is required
OLTP + OLAP workloads
Significant opportunity for
query optimization
SQL
insert into
ValuableClicksPerDMA
select dma, count(*) from
geoinfo
join
( select name, ipaddr
from users join clicks on
(users.name =
clicks.user) where value
> 0; )
using ipaddr group by
dma;
Solution :
Opens the systems to he users
familiar with PHP, Ruby,Python
4hrs in Java -> 15 minutes in
PigLatin
Provide common operations like
Join, group, filter and sort etc
Pig provides PigLatin that
increases productivity * 10
Pig Processing :
Grunt ,the pig shell
Submit a script directly
Pig server java class, a JDBC
like interface
Pig Pen which Allows textual &
graphical scripting Samples
data & shows example data
flow
Components :
Pig resides on user machine
No need to install extra cluster
Job submitted to cluster & executed on cluster
Starting grunt :
cd /usr/share/doc/pig0.11.0+44/examples/data
ls
$Pig x local
You should see a prompt like
Grunt>
We can run Pig in two modes
Stand alone mode(local mode)
Distributed mode(Map reduce mode)
Execution Modes
Pig has two execution modes:
Local Mode:
To run Pig in local mode, you need access to a
single machine; all files are installed and run
using your local host and file system. Specify
local mode using the -x flag (pig -x local).
Mapreduce Mode:
To run Pig in mapreduce mode, you need
access to a Hadoop cluster and HDFS
installation. Mapreduce mode is the default
mode; you can,but don't need to, specify it
using the -x flag (pig OR pig -x mapreduce).
Loading Data :
Use the LOAD operator and the
load/store functions to read data into
Pig
(PigStorage is the default load
function).
Continued :
Pig Latin provides operators that can help you
debug your Pig Latin statements:
Use the DUMP operator to display results to
your terminal screen.
Use the DESCRIBE operator to review the
schema of a relation.
Use the EXPLAIN operator to view the logical,
physical, or map reduce execution plans to
compute a relation.
Use the ILLUSTRATE operator to view the stepby-step execution of a series of statements.
Continued :
Complex data types
BAG
TUPLE
MAP
Syntax
{(data_type) |
(tuple(data_type)) |
(bag{tuple(data_type)}) |
(map[]) } field
Usage :
Cast operators enable you to cast or convert data from one
type to another, as long as conversion is supported (see the
table above). For example, suppose you have an integer
field, myint, which you want to convert to a string. You can
cast this field from int to chararray using (chararray) myint.
A field can be explicitly cast. Once cast, the field remains
that type (it is not automatically cast back). In this
example $0 is explicitly cast to int.
B = FOREACH A GENERATE (int)$0 + 1;
Where possible, Pig performs implicit casts. In this
example $0 is cast to int (regardless of underlying data)
and $1 is cast to double.
B = FOREACH A GENERATE $0 + 1, $1 + 1.0
Tuple construction
A = load 'students' as (name:chararray,
age:int,gpa:float);
B = foreach A generate (name, age);
store B into results;
Input (students):
joe smith 20 3.5
amy chen 22 3.2
leo allen 18 2.1
Output (results):
(joe smith,20)
(amy chen,22)
(leo allen,18)
Bag Construction
A = load 'students' as (name:chararray,
age:int, gpa:float);
B = foreach A generate {(name, age)},
{name, age};
store B into results;
Input (students):
Joe smith 20 3.5
amy chen 22 3.2
leo allen 18 2.1
Output (results):
{(joe smith,20)} {(joe smith),(20)}
{(amy chen,22)} {(amy chen),(22)}
{(leo allen,18)} {(leo allen),(18)}
Map construction
A = load 'students' as (name:chararray,
age:int, gpa:float);
B = foreach A generate [name, gpa];
store B into results;
Input (students):
joe smith 20 3.5
amy chen 22 3.2
leo allen 18 2.1
Output (results):
[joe smith#3.5]
[amy chen#3.2]
[leo allen#2.1]
Piglatin: UDF
Pig provides extensive support
for user-defined functions
(UDFs) as a way to specify
custom processing. Functions
can be a part of almost every
operator in Pig
All UDFs are case sensitive
UDF: Types
Eval Functions (EvalFunc)
Ex: StringConcat (built-in) : Generates the concatenation of
the first two fields of a tuple.
Aggregate Functions (EvalFunc & Algebraic)
Ex: COUNT, AVG ( both built-in)
Filter Functions (FilterFunc)
Ex: IsEmpty (built-in)
Load/Store Functions (LoadFunc/ StoreFunc)
Ex: PigStorage (built-in)
Note: URL for built in functions:
http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/built
in/package-summary.html
How It Works
A = LOAD myfile
AS (x, y, z);
B = FILTER A by x > 0;
C = GROUP B BY x;
D = FOREACH A GENERATE
x, COUNT(B);
STORE D INTO output;
pig.jar:
parses
checks
optimizes
plans execution
submits jar
to Hadoop
monitors job progress
Execution Plan
Map:
Filter
Count
Combine/Reduce:
Sum
Project
Word count using Hadoop Pig :
Preparing a text file :
Its definitely a little more interesting if
you can work with some data you know
or at least have an interest in.
I used sample data provided by
cloudera for Hadoop Pig.
RESULTS
References
1. http://en.wikipedia.org/wiki/Pig_(programming
_tool)
2. http://pig.apache.org/
3. http://hortonworks.com/hadoop/pig/
4. http://www.01.ibm.com/software/data/infosph
ere/hadoop/pig/
5. https://github.com/romainr/yelp-data-analysis
6. http://www.cloudera.com/content/cloudera/en
/resources/library/training/introduction-toapache-pig.html
7. https://github.com/romainr/hadoop-tutorialsexamples
QUESTIONS?
Thank you !