Documente Academic
Documente Profesional
Documente Cultură
IntroductiontotheHadoopSoftwareEcosystem
ViaA.Griffins
WhenHadoop1.0.0wasreleasedbyApachein2011,comprisingmainlyHDFSandMapReduce,itsoon
becameclearthatHadoopwasnotsimplyanotherapplicationorservice,butaplatformaroundwhichan
entireecosystemofcapabilitiescouldbebuilt.Sincethen,dozensofselfstandingsoftwareprojectshave
sprungintobeingaroundHadoop,eachaddressingavarietyofproblemspacesandmeetingdifferent
needs.
Manyoftheseprojectswerebegunbythesamepeopleorcompanieswhowerethemajordevelopersand
earlyusersofHadoopotherswereinitiatedbycommercialHadoopdistributors.Themajorityofthese
projectsnowshareahomewithHadoopattheApacheSoftwareFoundation,whichsupportsopensource
softwaredevelopmentandencouragesthedevelopmentofthecommunitiessurroundingtheseprojects.
ThefollowingsectionsaremeanttogivethereaderabriefintroductiontotheworldofHadoopandthe
corerelatedsoftwareprojects.TherearecountlesscommercialHadoopintegratedproductsfocusedon
makingHadoopmoreusableandlaymanaccessible,buttheonesherewerechosenbecausetheyprovide
corefunctionalityandspeedinHadoop.
Thesocalled"Hadoopecosystem"is,asbefitsanecosystem,complex,evolving,andnoteasilyparcelled
intoneatcategories.Simplykeepingtrackofalltheprojectnamesmayseemlikeataskofitsown,butthis
palesincomparisontothetaskoftrackingthefunctionalandarchitecturaldifferencesbetweenprojects.
Theseprojectsarenotmeanttoallbeusedtogether,aspartsofasingleorganismsomemayevenbe
seekingtosolvethesameproblemindifferentways.Whatunitesthemisthattheyeachseektotapinto
thescalabilityandpowerofHadoop,particularlytheHDFScomponentofHadoop.
AdditionalLinks
Cloudstory.com:3partseriesonHadoopecosystem
Part1
Part2
Part3
HDFS
TheHadoopDistributedFileSystem(HDFS)offersawaytostorelargefilesacrossmultiplemachines,
ratherthanrequiringasinglemachinetohavediskcapacityequalto/greaterthanthesummedtotalsizeof
thefiles.HDFSisdesignedtobefaulttolerantduetodatareplicationanddistributionofdata.Whenafile
isloadedintoHDFS,itisreplicatedandbrokenupinto"blocks"ofdata,whicharestoredacrossthecluster
nodesdesignatedforstorage,a.k.a.DataNodes.
ViaPaulKrzyzanowski
Atthearchitecturallevel,HDFSrequiresaNameNodeprocesstorunononenodeintheclusteranda
DataNodeservicetorunoneach"slave"nodethatwillbeprocessingdata.Whendataisloadedinto
HDFS,thedataisreplicatedandsplitintoblocksthataredistributedacrosstheDataNodes.The
NameNodeisresponsibleforstorageandmanagementofmetadata,sothatwhenMapReduceoranother
executionframeworkcallsforthedata,theNameNodeinformsitwheretheneededdataresides.
HDFSArchitecture
ViaComputerTechnologyReview
OnesignificantdrawbacktoHDFSisthatithasasinglepointoffailure(SPOF),whichliesinthe
NameNodeservice.IftheNameNodeortheserverhostingitgoesdown,HDFSisdownfortheentire
cluster.TheSecondaryNameNode,whichtakesperiodicsnapshotsoftheNameNodeandupdatesit,is
notitselfabackupNameNode.
CurrentlythemostcomprehensivesolutiontothisproblemcomesfromMapR,oneofthemajorHadoop
distributors.MapRhasdevelopeda"distributedNameNode,"wheretheHDFSmetadataisdistributed
acrosstheclusterin"Containers,"whicharetrackedbytheContainerLocationDatabase(CLDB).
RegularNameNodearchitecturevs.MapR'sdistributedNameNodearchitecture
ViaMapR
TheApachecommunityisalsoworkingtoaddressthisNameNodeSPOF:Hadoop2.0.2willincludean
updatetoHDFScalledHDFSHighAvailability(HA),whichprovidestheuserwith"theoptionofrunning
tworedundantNameNodesinthesameclusterinanActive/Passiveconfigurationwithahotstandby.This
allowsafastfailovertoanewNameNodeinthecasethatamachinecrashes,oragracefuladministrator
initiatedfailoverforthepurposeofplannedmaintenance."TheactiveNameNodelogsallchangestoa
directorythatisalsoaccessiblebythestandbyNameNode,whichthenusestheloginformationtoupdate
itself.
ArchitectureofHDFSHighAvailabilityframework
ViaCloudera
MapReduce
TheMapReduceparadigmforparallelprocessingcomprisestwosequentialsteps:mapandreduce.
Inthemapphase,theinputisasetofkeyvaluepairsandthedesiredfunctionisexecutedovereach
key/valuepairinordertogenerateasetofintermediatekey/valuepairs.
Inthereducephase,theintermediatekey/valuepairsaregroupedbykeyandthevaluesarecombined
togetheraccordingtothereducecodeprovidedbytheuserforexample,summing.Itisalsopossiblethat
noreducephaseisrequired,giventhetypeofoperationcodedbytheuser.
ViaArtificialIntelligenceinMotion
Attheclusterlevel,theMapReduceprocessesaredividedbetweentwoapplications,JobTrackerand
TaskTracker.JobTrackerrunsononly1nodeofthecluster,whileTaskTrackerrunsoneveryslavenode
inthecluster.EachMapReducejobissplitintoanumberoftaskswhichareassignedtothevarious
TaskTrackersdependingonwhichdataisstoredonthatnode.JobTrackerisresponsibleforscheduling
jobrunsandmanagingcomputationalresourcesacrossthecluster.JobTrackeroverseestheprogressof
eachTaskTrackerastheycompletetheirindividualtasks.
MapReduceArchitecture
ViaComputerTechnologyReview
YARN
AsHadoopbecamemorewidelyadoptedandusedonclusterswithuptotensofthousandsofnodes,it
becameobviousthatMapReduce1.0hadissueswithscalability,memoryusage,synchronization,andhad
itsownSPOFissues.Inresponse,YARN(YetAnotherResourceNegotiator)wasbegunasasubprojectin
theApacheHadoopProject,onparwithothersubprojectslikeHDFS,MapReduce,andHadoopCommon.
YARNaddressesproblemswithMapReduce1.0'sarchitecture,specificallywiththeJobTrackerservice.
Essentially,YARN"split[s]upthetwomajorfunctionalitiesoftheJobTracker,resourcemanagementand
jobscheduling/monitoring,intoseparatedaemons.TheideaistohaveaglobalResourceManager(RM)
andperapplicationApplicationMaster(AM)."(source:Apache)Thus,ratherthanburdeningasinglenode
withhandlingschedulingandresourcemanagementfortheentirecluster,YARNnowdistributesthis
responsibilityacrossthecluster.
YARNArchitecture
ViaApache
MapReduce2.0
MapReduce2.0,orMR2,containsthesameexecutionframeworkasMapReduce1.0,butitisbuiltonthe
scheduling/resourcemanagementframeworkofYARN.
YARN,contrarytowidespreadmisconceptions,isnotthesameasMapReduce2.0(MRv2).Rather,YARN
isageneralframeworkwhichcansupportmultipleinstancesofdistributedprocessingapplications,of
whichMapReduce2.0isone.
AdditionalLinks
Clouderablog:MR2andYARNBrieflyExplained
Hortonworksblog:ApacheHadoopYARNBackgroundandanOverview
InterviewwithArunMurthy,cofounderofHortonworks,aboutYARN
HadoopRelatedProjectsatApache
WiththeexceptionofChukwa,Drill,andHCatalog(incubatorlevelprojects),allotherApacheprojects
mentionedherearetoplevelprojects.
Thislistisnotmeanttobeallinclusive,butitservesasanintroductiontosomeofthemostcommonly
usedprojects,andalsoillustratestherangeofcapabilitiesbeingdevelopedaroundHadoop.Tonamejust
acouple,WhirrandCrunchareotherHadooprelatedApacheprojectsnotdescribedhere.
Avro
Avroisaframeworkforperformingremoteprocedurecallsanddataserialization.InthecontextofHadoop,
itcanbeusedtopassdatafromoneprogramorlanguagetoanother,e.g.fromCtoPig.Itisparticularly
suitedforusewithscriptinglanguagessuchasPig,becausedataisalwaysstoredwithitsschemainAvro,
andthereforethedataisselfdescribing.
Avrocanalsohandlechangesinschema,a.k.a."schemaevolution,"whilestillpreservingaccesstothe
data.Forexample,differentschemascouldbeusedinserializationanddeserializationofagivendataset.
AdditionalLinks
Avroin3minutes
BigTop
BigTopisaprojectforpackagingandtestingtheHadoopecosystem.MuchofBigTop'scodewasinitially
developedandreleasedaspartofCloudera'sCDHdistribution,buthassincebecomeitsownprojectat
Apache.
ThecurrentBigToprelease(0.5.0)supportsanumberofLinuxdistributionsandpackagesHadoop
togetherwiththefollowingprojects:Zookeeper,Flume,HBase,Pig,Hive,Sqoop,Oozie,Whirr,Mahout,
SolrCloud,Crunch,DataFuandHue.
AdditionalLinks
Apacheblogpost:WhatisBigTop?
Chukwa
Chukwa,currentlyinincubation,isadatacollectionandanalysissystembuiltontopofHDFSand
MapReduce.Tailoredforcollectinglogsandotherdatafromdistributedmonitoringsystems,Chukwa
providesaworkflowthatallowsforincrementaldatacollection,processingandstorageinHadoop.Itis
includedintheApacheHadoopdistribution,butasanindependentmodule.
Drill
DrillisanincubationlevelprojectatApacheandisanopensourceversionofGoogle'sDremel.Drillisa
distributedsystemforexecutinginteractiveanalysisoverlargescaledatasets.Someexplicitgoalsofthe
Drillprojectaretosupportrealtimequeryingofnesteddataandtoscaletoclustersof10,000nodesor
more.
Designedtosupportnesteddata,Drillalsosupportsdatawith(e.g.Avro)orwithout(e.g.JSON)schemas.
ItsprimarylanguageisanSQLlikelanguage,DrQL,thoughtheMongoQueryLanguagecanalsobe
used.
Flume
Flumeisatoolforharvesting,aggregatingandmovinglargeamountsoflogdatainandoutofHadoop.
Flume"channels"databetween"sources"and"sinks"anditsdataharvestingcaneitherbescheduledor
eventdriven.PossiblesourcesforFlumeincludeAvro,files,andsystemlogs,andpossiblesinksinclude
HDFSandHBase.Flumeitselfhasaqueryprocessingengine,sothatthereistheoptiontotransformeach
newbatchofdatabeforeitisshuttledtotheintendedsink.
SinceJuly2012,FlumehasbeenreleasedasFlumeNG(NewGeneration),asitdifferssignificantlyfrom
itsoriginalincarnation,a.k.aFlumeOG(OriginalGeneration)..
AdditionalLinks
Flumein3minutes
HBase
BasedonGoogle'sBigtable,HBase"isanopensource,distributed,versioned,columnorientedstore"that
sitsontopofHDFS.HBaseiscolumnbasedratherthanrowbased,whichenableshighspeedexecution
ofoperationsperformedoversimilarvaluesacrossmassivedatasets,e.g.read/writeoperationsthat
involveallrowsbutonlyasmallsubsetofallcolumns.HBasedoesnotprovideitsownqueryorscripting
language,butisaccessiblethroughJava,Thrift,andRESTAPIs.
HBasedependsonZookeeperandrunsaZookeeperinstancebydefault.
AdditionalLinks
HBasein3minutes
HCatalog
AnincubatorlevelprojectatApache,HCatalogisametadataandtablestoragemanagementservicefor
HDFS.HCatalogdependsontheHivemetastoreandexposesittootherservicessuchasMapReduceand
PigwithplanstoexpandtoHBaseusingacommondatamodel.HCatalog'sgoalistosimplifytheuser's
interactionwithHDFSdataandenabledatasharingbetweentoolsandexecutionplatforms.
AdditionalLinks
HCatalogin3minutes
Hive
HiveprovidesawarehousestructureandSQLlikeaccessfordatainHDFSandotherHadoopinput
sources(e.g.AmazonS3).Hive'squerylanguage,HiveQL,compilestoMapReduce.Italsoallowsuser
definedfunctions(UDFs).Hiveiswidelyused,andhasitselfbecomea"subplatform"intheHadoop
ecosystem.
Hive'sdatamodelprovidesastructurethatismorefamiliarthanrawHDFStomostusers.Itisbased
primarilyonthreerelateddatastructures:tables,partitions,andbuckets,wheretablescorrespondto
HDFSdirectoriesandcanbedividedintopartitions,whichinturncanbedividedintobuckets.
AdditionalLinks
Hivein3minutes
Mahout
Mahoutisascalablemachinelearninganddatamininglibrary.Therearecurrentlyfourmaingroupsof
algorithmsinMahout:
recommendations,a.k.a.collectivefiltering
classification,a.k.acategorization
clustering
frequentitemsetmining,a.k.aparallelfrequentpatternmining
Mahoutisnotsimplyacollectionofpreexistingalgorithmsmanymachinelearningalgorithmsare
intrinsicallynonscalablethatis,giventhetypesofoperationstheyperform,theycannotbeexecutedasa
setofparallelprocesses.AlgorithmsintheMahoutlibrarybelongtothesubsetthatcanbeexecutedina
distributedfashion,andhavebeenwrittentobeexecutableinMapReduce.
AdditionalLinks
MahoutandMachineLearningin3minutes
Oozie
OozieisajobcoordinatorandworkflowmanagerforjobsexecutedinHadoop,whichcanincludenon
MapReducejobs.ItisintegratedwiththerestoftheApacheHadoopstackand,accordingtotheOozie
site,it"support[s]severaltypesofHadoopjobsoutofthebox(suchasJavamapreduce,Streamingmap
reduce,Pig,Hive,SqoopandDistcp)aswellassystemspecificjobs(suchasJavaprogramsandshell
scripts)."
AnOozieworkflowisacollectionofactionsandHadoopjobsarrangedinaDirectedAcyclicGraph(DAG),
whichisacommonmodelfortasksthatmustbeainsequenceandaresubjecttocertainconstraints.
AdditionalLinks
Ooziein3minutes
Pig
Pigisaframeworkconsistingofahighlevelscriptinglanguage(PigLatin)andaruntimeenvironmentthat
allowsuserstoexecuteMapReduceonaHadoopcluster.LikeHiveQLinHive,PigLatinisahigherlevel
languagethatcompilestoMapReduce.
PigismoreflexiblethanHivewithrespecttopossibledataformat,duetoitsdatamodel.ViathePigWiki:
"Pig'sdatamodelissimilartotherelationaldatamodel,exceptthattuples(a.k.a.recordsorrows)canbe
nested.Forexample,youcanhaveatableoftuples,wherethethirdfieldofeachtuplecontainsatable.In
Pig,tablesarecalledbags.Pigalsohasa'map'datatype,whichisusefulinrepresentingsemistructured
data,e.g.JSONorXML."
AdditionalLinks
Pigin3minutes
Sqoop
Sqoop("SQLtoHadoop")isatoolwhichtransfersdatainbothdirectionsbetweenrelationalsystemsand
HDFSorotherHadoopdatastores,e.g.HiveorHBase.
AccordingtotheSqoopblog,"YoucanuseSqooptoimportdatafromexternalstructureddatastoresinto
HadoopDistributedFileSystemorrelatedsystemslikeHiveandHBase.Conversely,Sqoopcanbeused
toextractdatafromHadoopandexportittoexternalstructureddatastoressuchasrelationaldatabases
andenterprisedatawarehouses."
ZooKeeper
ZooKeeperisaserviceformaintainingconfigurationinformation,naming,providingdistributed
synchronizationandprovidinggroupservices.AstheZooKeeperwikisummarizesit,"ZooKeeperallows
distributedprocessestocoordinatewitheachotherthroughasharedhierarchicalnamespaceofdata
registers(wecalltheseregistersznodes),muchlikeafilesystem."ZooKeeperitselfisadistributedservice
with"master"and"slave"nodes,andstoresconfigurationinformation,etc.inmemoryonZooKeeper
servers.
AdditionalLinks
Zookeeperin3minutes
HadoopRelatedProjectsOutsideApache
TherearealsoprojectsoutsideofApachethatbuildonorparallelthemajorHadoopprojectsatApache.
Severalofinterestaredescribedhere.
Spark(UCBerkeley)
SparkisaparallelcomputingprogramwhichcanoperateoveranyHadoopinputsource:HDFS,HBase,
AmazonS3,Avro,etc.SparkisanopensourceprojectattheU.C.BerkeleyAMPLab,andinitsownwords,
Spark"wasinitiallydevelopedfortwoapplicationswherekeepingdatainmemoryhelps:iterative
algorithms,whicharecommoninmachinelearning,andinteractivedatamining."
WhileoftencomparedtoMapReduceinsofarasitalsoprovidesparallelprocessingoverHDFSandother
Hadoopinputsources,Sparkdiffersintwokeyways:
Sparkholdsintermediateresultsinmemory,ratherthanwritingthemtodiskthisdrasticallyreduces
queryreturntime
Sparksupportsmorethanjustmapandreducefunctions,greatlyexpandingthesetofpossible
analysesthatcanbeexecutedoverHDFSdata
ThefirstfeatureisthekeytodoingiterativealgorithmsonHadoop:ratherthanreadingfromHDFS,
performingMapReduce,writingtheresultsbacktoHDFS(i.e.todisk)andrepeatingforeachcycle,Spark
readsdatafromHDFS,performsthecomputation,andstorestheintermediateresultsinmemoryas
ResilientDistributedDataSets.Sparkcanthenrunthenextsetofcomputationsontheresultscachedin
memory,therebyskippingthetimeconsumingstepsofwritingthenthroundresultstoHDFSandreading
thembackoutforthe(n+1)thround.
AdditionalLinks
http://www.youtube.com/watch?v=N3ITxQcf6uQ
Shark(UCBerkeley)
Sharkisessentially"HiverunningonSpark."ItutilizestheApacheHiveinfrastructure,includingtheHive
metastoreandHDFS,butitgivesusersthebenefitsofSpark(increasedprocessingspeed,additional
functionsbesidesmapandreduce).Thisway,SharkuserscanexecutethequeriesinHiveQLoverthe
sameHDFSdatasets,butreceiveresultsinnearrealtimefashion.
Impala(Cloudera)
ReleasedbyCloudera,Impalaisanopensourceprojectwhich,likeApacheDrill,wasinspiredbyGoogle's
paperonDremelthepurposeofbothistofacilitaterealtimequeryingofdatainHDFSorHBase.Imapala
usesanSQLlikelanguagethat,thoughsimilartoHiveQL,iscurrentlymorelimitedthanHiveQL.Because
ImpalareliesontheHiveMetastore,HivemustbeinstalledonaclusterinorderforImpalatowork.
ThesecretbehindImpala'sspeedisthatit"circumventsMapReducetodirectlyaccessthedatathrougha
specializeddistributedqueryenginethatisverysimilartothosefoundincommercialparallelRDBMSs."
(Source:Cloudera)