Sunteți pe pagina 1din 12

Copyright 2010-2011 Cloudera, Inc. All rights reserved.

Not to be reproduced without prior written consent.


1


Clouderas Introduction to
Apache Hadoop:
Hands-On Exercises

!"#"$%& ()*"+,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, -
.%#/+01# 23"$45+"6 7+5#8 .9:; ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, <
.%#/+01# 23"$45+"6 =># % ?%@="/>4" A)B ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, C
Copyright 2010-2011 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
2
General Notes
Clouueia's tiaining couises use a viitual Nachine iunning the Cent0S S.6 Linux
uistiibution. This vN has Clouueia's Bistiibution incluuing Apache Bauoop veision
S (CBBS) installeu in Pseuuo-Bistiibuteu moue. Pseuuo-Bistiibuteu moue is a
methou of iunning Bauoop wheieby all five Bauoop uaemons iun on the same
machine. It is, essentially, a clustei consisting of a single machine. It woiks just like a
laigei Bauoop clustei, the only key uiffeience (apait fiom speeu, of couise!) being
that the block ieplication factoi is set to 1, since theie is only a single BataNoue
available.
Points to note while working in the VM
D, The vN is set to automatically log in as the usei training. Shoulu you log out
at any time, you can log back in as the usei training with the passwoiu
training.
-, Shoulu you neeu it, the ioot passwoiu is training. You may be piompteu foi
this if, foi example, you want to change the keyboaiu layout. In geneial, you
shoulu not neeu this passwoiu since the training usei has unlimiteu suuo
piivileges.
<, In some commanu-line steps in the exeicises, you will see lines like this:
$ hadoop fs -put shakespeare \
/user/training/shakespeare
The backslash at the enu of the fiist line signifies that the commanu is not
completeu, anu continues on the next line. You can entei the coue exactly as
shown (on two lines), oi you can entei it on a single line. If you uo the lattei, you
shoulu !"# type in the backslash.
Copyright 2010-2011 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
3
Hands-On Exercise: Using HDFS
E# *F5+ "3"$45+" G)> H5&& B"85# *) 8"* %4I>%5#*"/ H5*F *F" .%/))@ *))&+, J)>
H5&& K%#5@>&%*" L5&"+ 5# .9:;M *F" .%/))@ 95+*$5B>*"/ :5&" ;G+*"K,
Hadoop
Bauoop is alieauy installeu, configuieu, anu iunning on youi viitual machine.
Bauoop is installeu in the /usr/lib/hadoop uiiectoiy. You can iefei to this using
the enviionment vaiiable $HADOOP_HOME, which is automatically set in any
teiminal you open on youi uesktop.
Nost of youi inteiaction with the system will be thiough a commanu-line wiappei
calleu hadoop. If you stait a teiminal anu iun this piogiam with no aiguments, it
piints a help message. To tiy this, iun the following commanu:
$ hadoop
(Note: although youi commanu piompt is moie veibose, we use '$' to inuicate the
commanu piompt foi bievity's sake.)
The hadoop commanu is subuiviueu into seveial subsystems. Foi example, theie is
a subsystem foi woiking with files in BBFS anu anothei foi launching anu managing
NapReuuce piocessing jobs.
Step 1: Exploring HDFS
The subsystem associateu with BBFS in the Bauoop wiappei piogiam is calleu
FsShell. This subsystem can be invokeu with the commanu hadoop fs.
Copyright 2010-2011 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
4
D, 0pen a teiminal winuow (if one is not alieauy open) by uouble-clicking the
Teiminal icon on the uesktop.
-, In the teiminal winuow, entei:
$ hadoop fs
You see a help message uesciibing all the commanus associateu with this
subsystem.
<, Entei:
$ hadoop fs -ls /
This shows you the contents of the ioot uiiectoiy in BBFS. Theie will be
multiple entiies, one of which is /user. Inuiviuual useis have a "home"
uiiectoiy unuei this uiiectoiy, nameu aftei theii useiname - youi home
uiiectoiy is /user/training.
N, Tiy viewing the contents of the /user uiiectoiy by iunning:
$ hadoop fs -ls /user
You will see youi home uiiectoiy in the uiiectoiy listing.
O, Tiy iunning:
$ hadoop fs -ls /user/training
Theie aie no files, so the commanu silently exits. This is uiffeient than if you ian
hadoop fs -ls /foo, which iefeis to a uiiectoiy that uoesn't exist anu
which woulu uisplay an eiioi message.
Note that the uiiectoiy stiuctuie in BBFS has nothing to uo with the uiiectoiy
stiuctuie of the local filesystem; they aie completely sepaiate namespaces.
Copyright 2010-2011 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
5
Step 2: Uploading Files
Besiues biowsing the existing filesystem, anothei impoitant thing you can uo with
FsShell is to uploau new uata into BBFS.
D, Change uiiectoiies to the uiiectoiy containing the sample uata we will be using
in the couise.
cd ~/training_materials/developer/data
If you peifoim a 'iegulai' ls commanu in this uiiectoiy, you will see a few files,
incluuing two nameu shakespeare.tar.gz anu
shakespeare-stream.tar.gz. Both of these contain the complete woiks of
Shakespeaie in text foimat, but with uiffeient foimats anu oiganizations. Foi
now we will woik with shakespeare.tar.gz.
-, 0nzip shakespeare.tar.gz by iunning:
$ tar zxvf shakespeare.tar.gz
This cieates a uiiectoiy nameu shakespeare/ containing seveial files on youi
local filesystem.
<, Inseit this uiiectoiy into BBFS:
$ hadoop fs -put shakespeare /user/training/shakespeare
This copies the local shakespeare uiiectoiy anu its contents into a iemote,
BBFS uiiectoiy nameu /user/training/shakespeare.
N, List the contents of youi BBFS home uiiectoiy now:
$ hadoop fs -ls /user/training
You shoulu see an entiy foi the shakespeare uiiectoiy.
Copyright 2010-2011 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
6
O, Now tiy the same fs -ls commanu but without a path aigument:
$ hadoop fs -ls
You shoulu see the same iesults. If you uon't pass a uiiectoiy name to the -ls
commanu, it assumes you mean youi home uiiectoiy, i.e. /user/training.
Relative paths
If you pass any relative (non-absolute) paths to FsShell commands (or use
relative paths in MapReduce programs), they are considered relative to your
home directory. For example, you can see the contents of the uploaded
shakespeare directory by running:
$ hadoop fs -ls shakespeare
You also could have uploaded the Shakespeare files into HDFS by running the
following although you should not do this now, as the directory has already
been uploaded:
$ hadoop fs -put shakespeare shakespeare
Step 3: Viewing and Manipulating Files
Now let's view some of the uata copieu into BBFS.
D, Entei:
$ hadoop fs -ls shakespeare
This lists the contents of the /user/training/shakespeare uiiectoiy,
which consists of the files comedies, glossary, histories, poems, anu
tragedies.
Copyright 2010-2011 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
7
-, The glossary file incluueu in the taiball you began with is not stiictly a woik
of Shakespeaie, so let's iemove it:
$ hadoop fs -rm shakespeare/glossary
Note that you $"%&' leave this file in place if you so wisheu. If you uiu, then it
woulu be incluueu in subsequent computations acioss the woiks of
Shakespeaie, anu woulu skew youi iesults slightly. As with many ieal-woilu big
uata pioblems, you make tiaue-offs between the laboi to puiify youi input uata
anu the piecision of youi iesults.
<, Entei:
$ hadoop fs -cat shakespeare/histories | tail -n 50
This piints the last Su lines of ()!*+ -./ 01*# 2 to youi teiminal. This commanu
is hanuy foi viewing the output of NapReuuce piogiams. veiy often, an
inuiviuual output file of a NapReuuce piogiam is veiy laige, making it
inconvenient to view the entiie file in the teiminal. Foi this ieason, it's often a
goou iuea to pipe the output of the fs -cat commanu into head, tail, more,
oi less.
Note that when you pipe the output of the fs -cat commanu to a local 0NIX
commanu, the full contents of the file aie still extiacteu fiom BBFS anu sent to
youi local machine. 0nce on youi local machine, the file contents aie then
mouifieu befoie being uisplayeu.
N, If you want to uownloau a file anu manipulate it in the local filesystem, you can
use the fs -get commanu. This commanu takes two aiguments: an BBFS path
anu a local path. It copies the BBFS contents into the local filesystem:
$ hadoop fs -get shakespeare/poems ~/shakepoems.txt
$ less ~/shakepoems.txt

Copyright 2010-2011 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
8
Other Commands
Theie aie seveial othei commanus associateu with the FsShell subsystem, to
peifoim most common filesystem manipulations: rmr (iecuisive rm), mv, cp,
mkdir, etc.
D, Entei:
$ hadoop fs
This uisplays a biief usage iepoit of the commanus within FsShell. Tiy
playing aiounu with a few of these commanus if you like.
This is the end of the Exercise
Copyright 2010-2011 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
9
Hands-On Exercise: Run a
MapReduce Job
E# *F5+ "3"$45+" G)> H5&& 4)K@5&" A%P% L5&"+M 4$"%*" % AQ=M %#/ $># ?%@="/>4"
R)B+,
In auuition to manipulating files in BBFS, the wiappei piogiam hadoop is useu to
launch NapReuuce jobs. The coue foi a job is containeu in a compileu }AR file.
Bauoop loaus the }AR into BBFS anu uistiibutes it to the woikei noues, wheie the
inuiviuual tasks of the NapReuuce job aie executeu.
0ne simple example of a NapReuuce job is to count the numbei of occuiiences of
each woiu in a file oi set of files. In this lab you will compile anu submit a
NapReuuce job to count the numbei of occuiiences of eveiy woiu in the woiks of
Shakespeaie.
Compiling and Submitting a MapReduce Job
D, In a teiminal winuow, change to the woiking uiiectoiy, anu take a uiiectoiy
listing:
$ cd ~/training_materials/developer/exercises/wordcount
$ ls
This uiiectoiy contains a README file anu the following }ava files:
WordCount.java: A simple NapReuuce uiivei class.
WordCountWTool.java: A uiivei class that accepts geneiic options.
WordMapper.java: A mappei class foi the job.
SumReducer.java: A ieuucei class foi the job.
Examine these files if you wish, but uo not change them. Remain in this
uiiectoiy while you execute the following commanus.
Copyright 2010-2011 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
10
-, Compile the foui }ava classes:
$ javac -classpath $HADOOP_HOME/hadoop-core.jar *.java
Youi commanu incluues the classpath foi the Bauoop coie API classes. The
compileu (.class) files aie placeu in youi local uiiectoiy. These }ava files use
the 'olu' mapred API package, which is still valiu anu in common use: ignoie
any notes about uepiecation of the API which you may see.
<, Collect youi compileu }ava files into a }AR file:
$ jar cvf wc.jar *.class
N, Submit a NapReuuce job to Bauoop using youi }AR file to count the occuiiences
of each woiu in Shakespeaie:
$ hadoop jar wc.jar WordCount shakespeare wordcounts
This hadoop jar commanu names the }AR file to use (wc.jar), the class
whose main methou shoulu be invokeu (WordCount), anu the BBFS input anu
output uiiectoiies to use foi the NapReuuce job.
Youi job ieaus all the files in youi BBFS shakespeare uiiectoiy, anu places its
output in a new BBFS uiiectoiy calleu wordcounts.
O, Tiy iunning this same commanu again without any change:
$ hadoop jar wc.jar WordCount shakespeare wordcounts
Youi job halts iight away with an exception, because Bauoop automatically fails
if youi job tiies to wiite its output into an existing uiiectoiy. This is by uesign:
since the iesult of a NapReuuce job may be expensive to iepiouuce, Bauoop
tiies to pievent you fiom acciuentally oveiwiiting pieviously existing files.
Copyright 2010-2011 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
11
S, Review the iesult of youi NapReuuce job:
$ hadoop fs -ls wordcounts
This lists the output files foi youi job. (Youi job ian with only one Reuucei, so theie
shoulu be one file, nameu part-00000, along with a _SUCCESS file anu a _logs
uiiectoiy.)
T, view the contents of the output foi youi job:
$ hadoop fs -cat wordcounts/part-00000 | less
You can page thiough a few scieens to see woius anu theii fiequencies in the
woiks of Shakespeaie. Note that you coulu have specifieu wordcounts/* just
as well in this commanu.
U, Tiy iunning the WoiuCount job against a single file:
$ hadoop jar wc.jar WordCount shakespeare/poems pwords
When the job completes, inspect the contents of the pwords uiiectoiy.
C, Clean up the output files piouuceu by youi job iuns:
$ hadoop fs -rmr wordcounts pwords
Stopping MapReduce Jobs
It is impoitant to be able to stop jobs that aie alieauy iunning. This is useful if, foi
example, you acciuentally intiouuceu an infinite loop into youi Nappei. An
impoitant point to iemembei is that piessing ^C to kill the cuiient piocess (which
is uisplaying the NapReuuce job's piogiess) uoes #)* actually stop the job itself. The
NapReuuce job, once submitteu to the Bauoop uaemons, iuns inuepenuently of any
initiating piocess.
Losing the connection to the initiating piocess uoes not kill a NapReuuce job.
Insteau, you neeu to tell the Bauoop }obTiackei to stop the job.
Copyright 2010-2011 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
12
D, Stait anothei woiu count job like you uiu in the pievious section:
$ hadoop jar wc.jar WordCount shakespeare count2
-, While this job is iunning, open anothei teiminal winuow anu entei:
$ hadoop job -list
This lists the job ius of all iunning jobs. A job iu looks something like:
job_200902131742_0002
<, Copy the job iu, anu then kill the iunning job by enteiing:
$ hadoop job -kill jobid
The }obTiackei kills the job, anu the piogiam iunning in the oiiginal teiminal,
iepoiting its piogiess, infoims you that the job has faileu.
This is the end of the Exercise

S-ar putea să vă placă și