Sunteți pe pagina 1din 17

Pig

August 2013

Copyright 2013 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
What is PIG?
Definition:
Apache Pig is a data processing framework designed for the manipulation and
analysis of large data sets in a parallel environment.
Pig Latin is the high level dataflow language of the Pig environment.

History:
Originally developed at Yahoo! (2006)
Became a subproject of Apache Hadoop. (2007) (Pig v 0.6)
Used for 30% of all Map Reduce Jobs at Yahoo!.
Widely used Hadoop in the Hadoop ecosystem since 2007.

Provides:
A high level abstraction for implementing Map Reduce.
Freedom from details of the Hadoop/MapReduce java API.
Enables developers to focus on their app or dataflow.

Copyright 2013 Accenture. All rights reserved. 2


So Why PIG & not Map Reduce?

Like SQL Familiar Faster Development time

Data Flow over Programming Logic

Many standard Data Operation like JOIN, GROUP, FILTER


etc.

Manages nitty-gritty of connecting jobs and data flow

Copyright 2013 Accenture. All rights reserved. 3


PIG Features
No or Flexible Schema
Nested data types
Procedural Step by Step execution
Checkpoints
Split Operation
Built-in & UserDefinedFunction
Support for External Binaries
Parallelism

Copyright 2013 Accenture. All rights reserved. 4


Apache Pig Philosophy
Pigs Eat Anything:
Pig can operate on any type of data.
Metadata is optional.
Data format flexibility through UDFs

Pigs Live Anywhere:


Pigs high level of abstraction == implementation freedom.

Pigs are Domesticated:


Extensible by users.

Pigs Fly!
Built for speed. (development and runtime)

Copyright 2013 Accenture. All rights reserved. 5


PIG Latin First Look
Problem Statement Implementation(Pseudo)
Available Data: User Records, Pages
Browsed
Users = load users as (name, age);
Question: 5 most visited pages by Fltrd = filter Users by age >= 18 and age <= 25;
users between 0-18 years
Solution Strategy Pages = load pages as (user, url);
Load Users Load Pages
Jnd = join Fltrd by name, Pages by user;
Filter by age
Grpd = group Jnd by url;
Join on name
Smmd = foreach Grpd generate group,
COUNT(Jnd) as clicks;
Group on url

Count clicks Srtd = order Smmd by clicks desc;

Order by clicks Top5 = limit Srtd 5;

Take top 5 store Top5 into top5sites;


Copyright 2013 Accenture. All rights reserved. 6
Data Types

Pig Type Java Class


bytearray DataByteArray

chararray String

int Integer

long Long

float Float

double Double

tuple Tuple

bag DataBag

map Map<Object, Object>

Copyright 2013 Accenture. All rights reserved. 7


No/Flexible Schema - Example
records = LOAD /sample.txt AS (year:int, temperature:int, quality:int)

records = LOAD /sample.txt AS (year, temperature, quality)

records = LOAD /sample.txt AS (year, temperature:int, quality:int)

records = LOAD /sample.txt

projected_records = FOREACH records GENERATE year,temperature,quality;

projected_records = FOREACH records GENERATE $0,$1,$2;

Copyright 2013 Accenture. All rights reserved. 8


Why Use Pig: A Compelling Visual Comparison
Java/Map Reduce code for a simple web log user and page ranking join:

import java.io.IOException; reporter.setStatus("OK"); lp.setOutputKeyClass(Text.class);


import java.util.ArrayList; } lp.setOutputValueClass(Text.class);
import java.util.Iterator; lp.setMapperClass(LoadPages.class);
import java.util.List; // Do the cross product and collect the values FileInputFormat.addInputPath(lp, new
for (String s1 : first) { P a t h ( "u/s e r / g a t e s / p a g e s " ) ) ;
import org.apache.hadoop.fs.Path; for (String s2 : second) { FileOutputFormat.setOutputPath(lp,
import org.apache.hadoop.io.LongWritable; String outval = key + "," + s1 + "," + s2; new Path("/user/gates/tmp/indexed_pages"));
import org.apache.hadoop.io.Text; oc.collect(null, new Text(outval)); lp.setNumReduceTasks(0);
import org.apache.hadoop.io.Writable; reporter.setStatus("OK"); Job loadPages = new Job(lp);
im p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ; }
import org.apache.hadoop.mapred.FileInputFormat; } JobConf lfu = new JobConf(MRExample.class);
import org.apache.hadoop.mapred.FileOutputFormat; } l f ue.tsJ o b N a m e ( " L o a d a n d F i l t e r U s e r s " ) ;
import org.apache.hadoop.mapred.JobConf; } lfu.setInputFormat(TextInputFormat.class);
import org.apache.hadoop.mapred.KeyValueTextInputFormat; public static class LoadJoined extends MapReduceBase lfu.setOutputKeyClass(Text.class);
i m p o r t o r gp.aac h e . h a d o o p . m a p r e d . M a p p e r ; implements Mapper<Text, Text, Text, LongWritable> { lfu.setOutputValueClass(Text.class);
import org.apache.hadoop.mapred.MapReduceBase; lfu.setMapperClass(LoadAndFilterUsers.class);
import org.apache.hadoop.mapred.OutputCollector; public void map( F i l e I n p u t F o r m aItn.paudtdP a t h ( l f u , n e w
import org.apache.hadoop.mapred.RecordReader; Text k, Path("/user/gates/users"));
import org.apache.hadoop.mapred.Reducer; Text val, FileOutputFormat.setOutputPath(lfu,
import org.apache.hadoop.mapred.Reporter; O u t p u tcCtoolrl<eT e x t , L o n g W r i t a b l e > o c , new Path("/user/gates/tmp/filtered_users"));
i m po r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ; Reporter reporter) throws IOException { lfu.setNumReduceTasks(0);
import org.apache.hadoop.mapred.SequenceFileOutputFormat; // Find the url Job loadUsers = new Job(lfu);
import org.apache.hadoop.mapred.TextInputFormat; String line = val.toString();
import org.apache.hadoop.mapred.jobcontrol.Job; int firstComma = line.indexOf(','); JobConf join = new JM oRbECxoanmfp(le.class);
i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o notnrtorlo.lJ;o b C i n t s e c o n d C o m m a = l i n e . i n d e x O f (C'o,m'm,a )f;i r s t join.setJobName("Join Users and Pages");
import org.apache.hadoop.mapred.lib.IdentityMapper; String key = line.substring(firstComma, secondComma); join.setInputFormat(KeyValueTextInputFormat.class);
// drop the rest of the record, I don't need it anymore, join.setOutputKeyClass(Text.class);
public class MRExample { // just pass a 1 for the combiner/reducer to sum instead. join.setOutputValueClass(Text.class);
public static class LoadPages extends MapReduceBase Text outKey = new Text(key); join.setMapperClass(Iden pteirt.ycMlaapss);
implements Mapper<LongWritable, Text, Text, Text> { oc.collect(outKey, new LongWritable(1L)); join.setReducerClass(Join.class);
} FileInputFormat.addInputPath(join, new
public void map(LongWritable k, Text val, } Path("/user/gates/tmp/indexed_pages"));
OutputCollector<Text, Text> oc, public static class ReduceUrls extends MapReduceBase FileInputFormat.addInputPath(join, new
Reporter reporter) throws IOException { implements Reducer<Text, LongWritable, WritableComparaP balteh,("/user/gates/tmp/filtered_users"));
// Pull the key out Writable> { F i l e O u t p u t F o r mtaOtu.tspeu t P a t h ( j o i n , n e w
String line = val.toString(); Path("/user/gates/tmp/joined"));
int firstComma = line.indexOf(','); public void reduce( join.setNumReduceTasks(50);
S t r i n g k e y = l isnter.isnugb( 0 , f i r s t C o m m a ) ; T e xyt, k e Job joinJob = new Job(join);
String value = line.substring(firstComma + 1); Iterator<LongWritable> iter, joinJob.addDependingJob(loadPages);
Text outKey = new Text(key); OutputCollector<WritableComparable, Writable> oc, joinJob.addDependingJob(loadUsers);
// Prepend an index to the value so we know which file Reporter reporter) throws IOException {
// it came from. // Add up all the values we see JobConf group = new JobC xoanmfp(lMeR.Eclass);
T e x t o u t V a l = n e w "T e+x tv(a"l1u e ) ; group.setJobName("Group URLs");
oc.collect(outKey, outVal); long sum = 0; group.setInputFormat(KeyValueTextInputFormat.class);
} iwlhe (iter.hasNext()) { group.setOutputKeyClass(Text.class);
} sum += iter.next().get(); group.setOutputValueClass(LongWritable.class);
public static class LoadAndFilterUsers extends MapReduceBase reporter.setStatus("OK"); group.setOutputFormat(Sel qeuOeuntcpeuFt
iFormat.class);
implements Mapper<LongWritable, Text, Text, Text> { } group.setMapperClass(LoadJoined.class);
group.setCombinerClass(ReduceUrls.class);
public void map(LongWritable k, Text val, oc.collect(key, new LongWritable(sum)); group.setReducerClass(ReduceUrls.class);
OutputCollector<Text, Text> oc, } FileInputFormat.addInputPath(group, new
Reporter reporter) throws IOException { } Path("/user/gates/tmp/joined"));
// Pull the key out public static class LoadClicks extends MapReduceBase FileOutputFormat.setOutputPath(group, new
String line = val.toString(); mi
p l e m e n t s M a p p e r < W r i t a b l e C o m p a r a b l e , W r i t a b l e , L o n g W r i tPaabtlhe(," / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;
int firstComma = line.indexOf(','); Text> { group.setNumReduceTasks(50);
S t r i n g v a l u e = l i n e . s ufbisrtsrtiCnogm(m a + 1 ) ; Job groupJob = new Job(group);
int age = Integer.parseInt(value); public void map( groupJob.addDependingJob(joinJob);
if (age < 18 || age > 25) return; WritableComparable key,
String key = line.substring(0, firstComma); Writable val, JobConf top100 = new JobConf(MRExample.class);
Text outKey = new Text(key); OutputCollector<LongWritable, Text> oc, top100.setJobName("Top 100 sites");
// Prepend an index to the v ea lkuneo ws o
w hwich file Reporter rept ohrrtoewrs) IOException { top100.setInputFormat(SequenceFileInputFormat.class);
// it came from. oc.collect((LongWritable)val, (Text)key); top100.setOutputKeyClass(LongWritable.class);
Text outVal = new Text("2" + value); } top100.setOutputValueClass(Text.class);
oc.collect(outKey, outVal); } top100.setOutputFormat(SequenceFo ir
lmea
Otu.
tcpl
uats
Fs);
} public static class LimitClicks extends MapReduceBase top100.setMapperClass(LoadClicks.class);
} implements Reducer<LongWritable, Text, LongWritable, Text> { top100.setCombinerClass(LimitClicks.class);
public static class Join extends MapReduceBase top100.setReducerClass(LimitClicks.class);
implements Reducer<Text, Text, Text, Text> { int count = 0; FileInputFormat.addInputPath(top100, new
publi vcoid reduce( Path("/user/gates/tmp/grouped"));
public void reduce(Text key, LongWritable key, FileOutputFormat.setOutputPath(top100, new
Iterator<Text> iter, Iterator<Text> iter, Path("/user/gates/top100sitesforusers18to25"));
OutputCollector<Text, Text> oc, OutputCollector<LongWritable, Text> oc, top100.setNumReduceTasks(1);
Reporter reporter) throws IOException { Reporter reporter) throws IOException { Job limit = new Job(top100);
// For each value, figure out which file it's from and limit.addDependingJob(groupJob);
store it // Only output the first 100 records
// accordingly. while (c< ou1 n0t0 && iter.hasNext()) { J o b C o n t r o l j c = n e w J o b C o n t r o l ( "1F0i0n ds ittoeps f o r u s e r s
List<String> first = new ArrayList<String>(); oc.collect(key, iter.next()); 18 to 25");
List<String> second = new ArrayList<String>(); count++; jc.addJob(loadPages);
} jc.addJob(loadUsers);
while (iter.hasNext()) { } jc.addJob(joinJob);
Text t = iter.next(); } jc.addJob(groupJob);
String valueSt =ri tn.gt(o); public static void main(String[] args) throws IOException { jc.addJob(limit);
if (value.charAt(0) == '1') JobConf lp = new JobConf(MRExample.class); jc.run();
first.add(value.substring(1)); l p .tsJeo b N a m e ( " L o a d P a g e s " ) ; }
else second.add(value.substring(1)); lp.setInputFormat(TextInputFormat.class); }

Copyright 2013 Accenture. All rights reserved. 9


Why Use Pig: A Compelling Visual Comparison
(continued)
The same user and page ranking operation using Pig:
visits = load /data/visits as (user, url, time);
visits = foreach visits generate user,
Canonicalize(url), time;
Pages = load /data/pages as (url, pagerank);
VP = join Visits by url, Pages by url;
UserVisits = group VP by user;
UserPageranks = foreach UserVisits generate user,
AVG(VP.pagerank) as avgpr;
GoodUsers = filter UserPageranks by avgpr > 0.5;
store GoodUsers into '/data/good_users';

Copyright 2013 Accenture. All rights reserved. 10


Pig Latin Basic Operators
Pig Command Description Example

LOAD Read data from file Log = LOAD small.log AS (user, time, query);
STORE Write data into file STORE Log INTO output.log;
FILTER Apply predicate and remove Count = FOREACH group GENERATE ;
records
FOREACH Apply expression to each Adult = FILTER Users by age >= 18;
record
GROUP/COGRO Collect records w/matching AdultUrls = GROUP Adult by Url;
UP keys
JOIN Join >=2 inputs by key Joined = JOIN Adult BY name, Page BY user;
ORDER Sort records by key Sorted = ORDER Adult BY Count;
DISTINCT Remove duplicate records UniqUsers = DISTINCT Users;
UNION Merge 2 data sets C = UNION A, B;
SPLIT Split >=2 sets based on SPLIT a INTO Neg IF $0 <0, Pos IF $0 >0;
FILTER
STREAM Send records to a binary HighDivs = STREAM divs THROUGH highdiv.pl
;
DUMP Write to stdout DUMP HighDivs;
LIMIT Limit the # of records DUMP users LIMIT 50;
Copyright 2013 Accenture. All rights reserved. 11
Example : A Simple Word Count
Filled In, + Filtering + Results ORDERING (high to low)

A = LOAD user/training/input/shakes.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word;
C = FILTER B BY word MATCHES '\\w+';
D = GROUP C BY word;
E = FOREACH D GENERATE COUNT(C) AS count, group AS word;
F = ORDER E BY count DESC;

Detailed (stronger typed, and descriptive):


lines = LOAD /input/shakes.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered = FILTER words BY word MATCHES '\\w+';
wordgroups = GROUP filtered BY word;
wordcount = FOREACH wordgroups GENERATE COUNT(filtered) AS count, group AS word;
orderedwordcount = ORDER wordcount BY count DESC;
STORE orderedwordcount INTO output/shakes_freq;

Copyright 2013 Accenture. All rights reserved. 12


Example : Practical Lessons Learned
Pig Latin is a Data Flow Language
No explicit flow control.

Pig Latin is gently typed


LOAD <filename> ; -- Defaults are OK for simple examples. But..
LOAD <filename> AS (count: INT, word: charrarray); -- Is Preferred

In grunt>, DESCRIBE, is very useful. DUMP too (if you can handle the output
flood)
Pig exec can produce tons of un-necessary console output!
grunt> set DebugOff, grunt> set DebugOn;
$ pig -4 ~/workshop/conf/nolog.conf <exec cmd>

Each execution produces a log. Use them!


pig_1358834310797.log

Copyright 2013 Accenture. All rights reserved. 13


Real Data Formats and How to Handle Them
Simplified Weblog
43.60.688.623 03/Jun/2012:09:15:30 -0500 03 Jun 06 2012 09 15 30 -0500 GET /feeds/press 200 0 -
Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 33 subscribers; feed-id=4404395209182797140)
612.57.72.653 03/Jun/2012:09:14:50 -0500 03 Jun 06 2012 09 14 50 -0500 GET /product/product3 200
0 /product/product2 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727;
.NET CLR 3.0.04506.30) . . .

Simple PigStorage LOAD


weblog = LOAD <filename> USING PigStorage(\t) AS (client_ip:chararray, full_request_ date:chararray,
day:int, month:chararray, month_num:int, year:int, hour:int, minute:int, second:int, timezone:chararray, . . .,
user_agent:chararray);

weblog_group = GROUP weblog BY (client_ip, year, month_num);

2012, 06 2012, 07 2012, 08 ...


client_ip client_ip client_ip

Copyright 2013 Accenture. All rights reserved. 14


Question Answers

Copyright 2013 Accenture. All rights reserved. 15


References

Docs Hadoop - The Definitive Guide


Apache Pig website

Copyright 2013 Accenture. All rights reserved. 16


Thanks

Copyright 2013 Accenture. All rights reserved. 17

S-ar putea să vă placă și