Documente Academic
Documente Profesional
Documente Cultură
August 2013
Copyright 2013 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
What is PIG?
Definition:
Apache Pig is a data processing framework designed for the manipulation and
analysis of large data sets in a parallel environment.
Pig Latin is the high level dataflow language of the Pig environment.
History:
Originally developed at Yahoo! (2006)
Became a subproject of Apache Hadoop. (2007) (Pig v 0.6)
Used for 30% of all Map Reduce Jobs at Yahoo!.
Widely used Hadoop in the Hadoop ecosystem since 2007.
Provides:
A high level abstraction for implementing Map Reduce.
Freedom from details of the Hadoop/MapReduce java API.
Enables developers to focus on their app or dataflow.
Pigs Fly!
Built for speed. (development and runtime)
chararray String
int Integer
long Long
float Float
double Double
tuple Tuple
bag DataBag
LOAD Read data from file Log = LOAD small.log AS (user, time, query);
STORE Write data into file STORE Log INTO output.log;
FILTER Apply predicate and remove Count = FOREACH group GENERATE ;
records
FOREACH Apply expression to each Adult = FILTER Users by age >= 18;
record
GROUP/COGRO Collect records w/matching AdultUrls = GROUP Adult by Url;
UP keys
JOIN Join >=2 inputs by key Joined = JOIN Adult BY name, Page BY user;
ORDER Sort records by key Sorted = ORDER Adult BY Count;
DISTINCT Remove duplicate records UniqUsers = DISTINCT Users;
UNION Merge 2 data sets C = UNION A, B;
SPLIT Split >=2 sets based on SPLIT a INTO Neg IF $0 <0, Pos IF $0 >0;
FILTER
STREAM Send records to a binary HighDivs = STREAM divs THROUGH highdiv.pl
;
DUMP Write to stdout DUMP HighDivs;
LIMIT Limit the # of records DUMP users LIMIT 50;
Copyright 2013 Accenture. All rights reserved. 11
Example : A Simple Word Count
Filled In, + Filtering + Results ORDERING (high to low)
A = LOAD user/training/input/shakes.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word;
C = FILTER B BY word MATCHES '\\w+';
D = GROUP C BY word;
E = FOREACH D GENERATE COUNT(C) AS count, group AS word;
F = ORDER E BY count DESC;
In grunt>, DESCRIBE, is very useful. DUMP too (if you can handle the output
flood)
Pig exec can produce tons of un-necessary console output!
grunt> set DebugOff, grunt> set DebugOn;
$ pig -4 ~/workshop/conf/nolog.conf <exec cmd>