Hadoop Mapred

MapReduce f low chart :-
Here i am taking a small example which is WordCount job like hello world
example :-
Here i take a sample file which is file.txt and we assume that this file size is
2!".
#f we store this file into H$%& then this file is splitted into ' input splits as the
default size of the block is ('!". &o we get ) ('!" splits and one *!" split.
file.txt :-
file.txt:-(200MB)
hi how are +ou
('!"
how is +our job
how is +our famil+
('!"
how is +our sister
how is +our brother
('!"
what is the time now

* !"
What are +our strengths in hadoop
#n this # ha,e to count number of occurences in this file so # need to get this
hi how are you
how is your job
how is yourfamily
how is your sister
how is your brother
What are your strengths in hadoop
64MB
64MB
64MB
64MB(but
8MB used)
output
hi-.
how-/
0.......
we can execute the abo,e job b+ gi,ing the following command :-
12 hadoop jar test.jar $ri,erCode file.txt 3est4utput
--25s the mapper and reducer understands onl+ ke+ -,alue pairs so we need to
con,ert the gi,en input into ke+-,alue pairs. "ut there is a 6ecord6eader in
hadoop which con,erts the gi,en input into ke+ -,alue pair.
--26ecord6eader is gi,en b+ Hadoop which con,erts the input into ke+ -,alue
pair-so we dont need to write extra logic.
--26ecord6eader knows how to con,ert one line in the gi,en input split at a
time and is con,erting into ke+-,alue pairs.
--26ecord6eader con,erts the input line into ke+- ,alue pair based on the file
format which is gi,en b+ the user.
--26ecord6eader is responsible to take one line at a time from the input split
and con,ert into ke+-,alue pairs based on the format of the file. #f +ou are not
specif+ing an+ format of the file then b+ default it is 3ext#nput%ormat. &o it
generates the ke+-,alue pair as 7b+te offset- entire line 8.
--23he four file input formats are
.83ext#nput%ormat file
289e+:alue3ext#nput%ormat.
)8&e;uential%ile#nput%ormat.
'8&e;uential%ile5s3ext#nput%ormat.
--2#f +ou are not specif+ing the file format then b+ default it is 3ext#nput%ormat.
--2#f the input file format is 3ext#nput%ile%ormat then it takes the ke+ -,alue as
7b+teoffset- entire line8
--2 <n is also taken as one character.
--2next b+te offset = current b+te count > number of characters in the
pre,ious ,alue.
ex : - 7- hi how are +ou8
7.(-how is +our job8
--2#t seems that it is like parallel processing because if the entire job is
submitted to single s+stem then it ma+ take long time....but hadoop sol,es our
problem because the job is submitted to multiple s+stems and multiple
s+stems work together to gi,e the output within less time.
Primitive type wrapper class Box classes

int #nteger #ntWritable
float %loat %loatWritable
long ?ong ?ongWritable
double $ouble
$oubleWritable
&tring &tring
3ext
Character char 3ext

we ha,e man+ ,ersions a,ailable in ja,a. #n each ,ersion there will be
some enhancements. We ha,e primiti,e t+pes in ja,a from jdk.. @.
Collections framework supports onl+ object t+pes but not primiti,e t+pes.
&o one ,ersion of ja,a has introduced wrapper classes for primiti,e t+pes.
--2&o in order to con,ert primiti,e to objects or objects to primiti,e we ha,e
some methods to con,ert them.
--2%or example if +ou want to con,ert an AintA to #ntWritable then we do as
following
new #ntWritable7.8B
--2#f +ou want to con,ert from A#ntWritableA to AintA t+pe then we should go for
get78 method.
--2%or example if +ou want to con,ert an AfloatA to A%loatWritableA then we do
as following
new %loatWritable7..*f8B
--2#f +ou want to con,ert from A%loatWritableA to AfloatA t+pe then we should go
for get78 method.
--2%or example if +ou want to con,ert an A&tringA to A3extA then we do as
following
new 3ext7AsaiA8B
--2#f +ou want to con,ert from A3extA to AstringA t+pe then we should go for
to&tring78 method.
C8Which is feasible t+pe for the b+te offset@
--2?ong t+pe is feasible because as we count the number of characters in a file.
--2Wh+ these box classes are important because the !apper and 6educer
classes work on these as ke+ ,alue pairs.
%or example we take this file as :-
.. sai .
.2 ragha, 2
.) arram )
--2#f the 6ecord6eader identifies that +our file is A3ext#nput%ormatA then it
takes the ke+ -,alue pair as 7b+te offset - entire line8.
--2#n $ri,erCode if +ou specif+ the file format as A9e+:alue3ext#nput%ormatA
then +our 6ecord6eader will read one line from the input split and checks the
Afirst tab spaceA and splits it into ke+ and the ,alue as AremainingA content.
Dx :-

5ssume that 6ecord6eader is con,erting the abo,e file into ke+ ,alue pairs as...
7..- sai .8
7.2- ragha, 28
7.)- arram )8
file.txt:-72!"8
hi how are +ou
('!"
how is +our job
how is +our famil+
('!"
how is +our sister
how is +our brother
('!"

* !"
What are +our strengths in hadoop
--2&o according to H$%& block size defualt ,alue- it splits the abo,e file into
three ('!" blocks and one *!" block.
--2&o we will ha,e ' !appers here. "ased on the number of input splits that
man+ !appers are created. "ased on the number of !appers there are that
man+ 6ecord6eaders.
.st !apper output is :-
7hi- .8
7how- .8
7how-.8
7are-.8
7is-.8
7+ou-.8
7+our-.8
7job-.8
2nd !apper output is :-
7how .8
7how-.8
7is-.8
7is-.8
7+our-.8
7+our-.8
7sister-.8
7famil+-.8
)rd !apper output is :-
7how-.8
7is-.8
7is-.8
7+our-.8
7brother-.8
7what-.8
7the-.8
7time-.8
7now-.8
'th !apper output is :-
7what-.8
7are-.8
7+our-.8
7strengths-.8
7in-.8
7hadoop-.8
--2&o we got the output from all the ' mappers as ke+ -,alue pairs. &o our job
is still not completed - our job will be completed if we combine all the outputs
from all the mappers. &o this can done b+ the 6educer.
--26educer is a class which combines all the outputs of mappers and gi,es the
output as ke+-,alue pairs.
--23he data which is genereated between the mapper and the 6educer is
called the #ntermediate $ata.
ia!ramettic represe"tatio" of Map Reduce #o$
Input file whih is plaed in hdfs (!""MB file)
Input split
Input #plit
Input #plit
Input #plit
$eord$eader $eord$eader $eord$eader $eord$eader
Mapper
Mapper
Mapper
Mapper
(byteoffset% entire
line)
("%hi how are you)
(&6%how is your job)
7hi- .8
7how- .8
7how-.8
7are-.8
7is-.8
7+ou-.8
7+our-.8
7job-.8
7how -.8
7how-.8
7is-.8
7is-.8
7+our-.8
7+our-.8
7sister-.8
7famil+-.8
7how-.8
7is-.8
7is-.8
7+our-.8
7brother-.8
7what-.8
7the-.8
7time-.8
7now-.8
7what-.8
7are-.8
7+our-.8
7strengths-.
8
7in-.8
7hadoop-.8
5s the ke+s are duplicated in the abo,e output ie #ntermediate data will be
processed b+ 2 more phases called shuffling and sorting phase.
%huff li"! phase :-
#n this phase it will combine all the ,alues associated with a single identical ke+.
Dx :-
(hi%'&()
(how%'&%&%&%&%&()
(you%'&()
(your%'&%&%&%&%&()
..........
......
%orti"! Phase :-
Here all the ke+s will be sorted to default natural sorting order
ie7ascending order in this case8.
#f +ou are not de,eloping +our reducer class then hadoop will execute its
defaullt 6educer class which is #dentit+6educer which is responsible to do
onl+ sorting but it doesnt do shuffling.
#f +ou are de,eloping +our own 6educer then hadoop is responsible to do
both shuffling and sorting phases and gi,es the ke+-,alue pairs to the
6educer and 6educer will gi,e the output as ke+-,alue pairs in an output
#huffling )hase
(hi%'&()
(how%'&%&%&%&%&()
(you%'&()
(your%'&%&%&%&%&()
..........
........
#orting phase
$eduer
Intermediate *ata
file and keep it in H$%&.
Map Reduce #o$ (&re'ue"cy (ou"t )xample ) :
*ord(ou"tMapper.+ava :- (Mapper impleme"tatio")
package com.arrams.practise.mapred;

import ja,a.io.#4DxceptionB
import ja,a.util.&tring3okenizerB

import org.apache.hadoop.io.#ntWritableB
import org.apache.hadoop.io.?ongWritableB
import org.apache.hadoop.io.3extB
import org.apache.hadoop.mapred.!ap6educe"aseB
import org.apache.hadoop.mapred.!apperB
import org.apache.hadoop.mapred.4utputCollectorB
import org.apache.hadoop.mapred.6eporterB
public class *ord(ou"tMapper extends !ap6educe"ase implements
!apperE?ongWritable- 3ext- 3ext- #ntWritable2
F
pri,ate final static #ntWritable one = new #ntWritable7.8B

pri,ate 3ext word = new 3ext78B

public ,oid map7?ongWritable ke+- 3ext ,alue- 4utputCollectorE3ext-
#ntWritable2 output- 6eporter reporter8
throws #4Dxception
F
&tring line = ,alue.to&tring78B
&tring3okenizer tokenizer = new &tring3okenizer7line8B
while 7tokenizer.has!ore3okens788
F
word.set7tokenizer.next3oken788B
output.collect7word- one8B
G
G
G
*ord(ou"tReducer.+ava :- (Reducer impleme"tatio")
package com.arrams.practise.mapred;

import ja,a.io.#4DxceptionB
import ja,a.util.#teratorB

import org.apache.hadoop.mapred.!ap6educe"aseB
import org.apache.hadoop.mapred.4utputCollectorB
import org.apache.hadoop.mapred.6educerB
import org.apache.hadoop.mapred.6eporterB

public class 6educe extends !ap6educe"ase implements 6educerE3ext-
#ntWritable- 3ext- #ntWritable2
F
public ,oid reduce73ext ke+- #teratorE#ntWritable2 ,alues-
4utputCollectorE3ext- #ntWritable2 output-
6eporter reporter8 throws #4Dxception
F
int sum = B
while 7,alues.hasHext788
F
sum >= ,alues.next78.get78B
G
output.collect7ke+- new #ntWritable7sum88B
G
G
*ord(ou"t.+ava :- (river code impleme"tatio" )
pac,a!e com.arrams.practise.mapred-

import org.apache.hadoop.fs.IathB
import org.apache.hadoop.mapred.%ile#nput%ormatB
import org.apache.hadoop.mapred.%ile4utput%ormatB
import org.apache.hadoop.mapred.JobClientB
import org.apache.hadoop.mapred.JobConfB
import org.apache.hadoop.mapred.3ext#nput%ormatB
import org.apache.hadoop.mapred.3ext4utput%ormatB

public class WordCount
F
public static ,oid main7&tringKL args8 throws Dxception
F
JobConf conf = new JobConf7WordCount.class8B
conf.setJobHame7AwordcountA8B

conf.set4utput9e+Class73ext.class8B
conf.set4utput:alueClass7#ntWritable.class8B

conf.set!apperClass7*ord(ou"tMapper.class8B
conf.setCombinerClass7*ord(ou"tReducer.class8B
conf.set6educerClass7*ord(ou"tReducer .class8B

conf.set#nput%ormat73ext#nput%ormat.class8B
conf.set4utput%ormat73ext4utput%ormat.class8B

%ile#nput%ormat.set#nputIaths7conf- new Iath7argsKL88B
%ile4utput%ormat.set4utputIath7conf- new Iath7argsK.L88B

JobClient.runJob7conf8B

G
G
--2#n order to run WordCount job we need the abo,e ) .ja,a files and keep it in
same package and make the ja,a project as a jar file.
#n order to run the abo,e map reduce job we need to do the following steps :-
.8 &tart all the daemons of hadoop ie namenode-job tracker-task
tracker-secondar+ namenode .... using start-all.sh which is there in hadoop
MbinN folder.
28Dnsure that all the processes are started b+ entering following command on
the terminal.
12 jps
ragha,Oragha,-pc:P1 jps
QQ2( Job3racker
Q()) &econdar+HameHode
*)/ Jps
Q)(* $ataHode
QR*/ 3ask3racker
)8Create a sample file in local file s+stem and name the file as file.txt.
'8!o,e the file to H$%& using following command
12hadoop fs -put S$esktopSfile.txt SuserSragha,SinputSfile.txt
so it creates the file in the home director+ of hdfs.
/8Create !apper-6educer and $ri,er classes and compile them.#t will be
compiled successfull+ when +ou place the hadoop lib jars into the classpath.
(8Ilace the jar file an+ where in the file s+stem and execute the Job class7ie
$ri,er code8.
Q86un the WordCount.class file from the jar b+ gi,ing the following command.
12 hadoop jar WordCount.jar WordCount SuserSragha,SinputSfile.txt
SuserSragha,S3est4utput
*83hen the output will be generated into the hdfs output director+ and the
filename is gi,en as part-.

Hadoop Mapred

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Hadoop Mapred

Încărcat de

Drepturi de autor:

Formate disponibile

MapReduce f low chart :-

S-ar putea să vă placă și