Sunteți pe pagina 1din 15

HadoopEcosystem

IntroductiontotheHadoopSoftwareEcosystem

ViaA.Griffins
WhenHadoop1.0.0wasreleasedbyApachein2011,comprisingmainlyHDFSandMapReduce,itsoon
becameclearthatHadoopwasnotsimplyanotherapplicationorservice,butaplatformaroundwhichan
entireecosystemofcapabilitiescouldbebuilt.Sincethen,dozensofselfstandingsoftwareprojectshave
sprungintobeingaroundHadoop,eachaddressingavarietyofproblemspacesandmeetingdifferent
needs.
Manyoftheseprojectswerebegunbythesamepeopleorcompanieswhowerethemajordevelopersand
earlyusersofHadoopotherswereinitiatedbycommercialHadoopdistributors.Themajorityofthese
projectsnowshareahomewithHadoopattheApacheSoftwareFoundation,whichsupportsopensource
softwaredevelopmentandencouragesthedevelopmentofthecommunitiessurroundingtheseprojects.
ThefollowingsectionsaremeanttogivethereaderabriefintroductiontotheworldofHadoopandthe
corerelatedsoftwareprojects.TherearecountlesscommercialHadoopintegratedproductsfocusedon
makingHadoopmoreusableandlaymanaccessible,buttheonesherewerechosenbecausetheyprovide
corefunctionalityandspeedinHadoop.

Thesocalled"Hadoopecosystem"is,asbefitsanecosystem,complex,evolving,andnoteasilyparcelled
intoneatcategories.Simplykeepingtrackofalltheprojectnamesmayseemlikeataskofitsown,butthis
palesincomparisontothetaskoftrackingthefunctionalandarchitecturaldifferencesbetweenprojects.
Theseprojectsarenotmeanttoallbeusedtogether,aspartsofasingleorganismsomemayevenbe
seekingtosolvethesameproblemindifferentways.Whatunitesthemisthattheyeachseektotapinto
thescalabilityandpowerofHadoop,particularlytheHDFScomponentofHadoop.

AdditionalLinks
Cloudstory.com:3partseriesonHadoopecosystem
Part1
Part2
Part3

HDFS
TheHadoopDistributedFileSystem(HDFS)offersawaytostorelargefilesacrossmultiplemachines,
ratherthanrequiringasinglemachinetohavediskcapacityequalto/greaterthanthesummedtotalsizeof
thefiles.HDFSisdesignedtobefaulttolerantduetodatareplicationanddistributionofdata.Whenafile
isloadedintoHDFS,itisreplicatedandbrokenupinto"blocks"ofdata,whicharestoredacrossthecluster
nodesdesignatedforstorage,a.k.a.DataNodes.

ViaPaulKrzyzanowski

Atthearchitecturallevel,HDFSrequiresaNameNodeprocesstorunononenodeintheclusteranda
DataNodeservicetorunoneach"slave"nodethatwillbeprocessingdata.Whendataisloadedinto
HDFS,thedataisreplicatedandsplitintoblocksthataredistributedacrosstheDataNodes.The
NameNodeisresponsibleforstorageandmanagementofmetadata,sothatwhenMapReduceoranother
executionframeworkcallsforthedata,theNameNodeinformsitwheretheneededdataresides.

HDFSArchitecture
ViaComputerTechnologyReview
OnesignificantdrawbacktoHDFSisthatithasasinglepointoffailure(SPOF),whichliesinthe
NameNodeservice.IftheNameNodeortheserverhostingitgoesdown,HDFSisdownfortheentire
cluster.TheSecondaryNameNode,whichtakesperiodicsnapshotsoftheNameNodeandupdatesit,is
notitselfabackupNameNode.
CurrentlythemostcomprehensivesolutiontothisproblemcomesfromMapR,oneofthemajorHadoop
distributors.MapRhasdevelopeda"distributedNameNode,"wheretheHDFSmetadataisdistributed
acrosstheclusterin"Containers,"whicharetrackedbytheContainerLocationDatabase(CLDB).
RegularNameNodearchitecturevs.MapR'sdistributedNameNodearchitecture
ViaMapR
TheApachecommunityisalsoworkingtoaddressthisNameNodeSPOF:Hadoop2.0.2willincludean
updatetoHDFScalledHDFSHighAvailability(HA),whichprovidestheuserwith"theoptionofrunning
tworedundantNameNodesinthesameclusterinanActive/Passiveconfigurationwithahotstandby.This
allowsafastfailovertoanewNameNodeinthecasethatamachinecrashes,oragracefuladministrator
initiatedfailoverforthepurposeofplannedmaintenance."TheactiveNameNodelogsallchangestoa
directorythatisalsoaccessiblebythestandbyNameNode,whichthenusestheloginformationtoupdate

itself.

ArchitectureofHDFSHighAvailabilityframework
ViaCloudera

MapReduce
TheMapReduceparadigmforparallelprocessingcomprisestwosequentialsteps:mapandreduce.
Inthemapphase,theinputisasetofkeyvaluepairsandthedesiredfunctionisexecutedovereach
key/valuepairinordertogenerateasetofintermediatekey/valuepairs.
Inthereducephase,theintermediatekey/valuepairsaregroupedbykeyandthevaluesarecombined
togetheraccordingtothereducecodeprovidedbytheuserforexample,summing.Itisalsopossiblethat
noreducephaseisrequired,giventhetypeofoperationcodedbytheuser.

ViaArtificialIntelligenceinMotion
Attheclusterlevel,theMapReduceprocessesaredividedbetweentwoapplications,JobTrackerand
TaskTracker.JobTrackerrunsononly1nodeofthecluster,whileTaskTrackerrunsoneveryslavenode
inthecluster.EachMapReducejobissplitintoanumberoftaskswhichareassignedtothevarious
TaskTrackersdependingonwhichdataisstoredonthatnode.JobTrackerisresponsibleforscheduling
jobrunsandmanagingcomputationalresourcesacrossthecluster.JobTrackeroverseestheprogressof
eachTaskTrackerastheycompletetheirindividualtasks.

MapReduceArchitecture
ViaComputerTechnologyReview

YARN
AsHadoopbecamemorewidelyadoptedandusedonclusterswithuptotensofthousandsofnodes,it
becameobviousthatMapReduce1.0hadissueswithscalability,memoryusage,synchronization,andhad
itsownSPOFissues.Inresponse,YARN(YetAnotherResourceNegotiator)wasbegunasasubprojectin
theApacheHadoopProject,onparwithothersubprojectslikeHDFS,MapReduce,andHadoopCommon.
YARNaddressesproblemswithMapReduce1.0'sarchitecture,specificallywiththeJobTrackerservice.
Essentially,YARN"split[s]upthetwomajorfunctionalitiesoftheJobTracker,resourcemanagementand
jobscheduling/monitoring,intoseparatedaemons.TheideaistohaveaglobalResourceManager(RM)
andperapplicationApplicationMaster(AM)."(source:Apache)Thus,ratherthanburdeningasinglenode
withhandlingschedulingandresourcemanagementfortheentirecluster,YARNnowdistributesthis
responsibilityacrossthecluster.

YARNArchitecture
ViaApache
MapReduce2.0
MapReduce2.0,orMR2,containsthesameexecutionframeworkasMapReduce1.0,butitisbuiltonthe
scheduling/resourcemanagementframeworkofYARN.
YARN,contrarytowidespreadmisconceptions,isnotthesameasMapReduce2.0(MRv2).Rather,YARN
isageneralframeworkwhichcansupportmultipleinstancesofdistributedprocessingapplications,of
whichMapReduce2.0isone.

AdditionalLinks

Clouderablog:MR2andYARNBrieflyExplained
Hortonworksblog:ApacheHadoopYARNBackgroundandanOverview
InterviewwithArunMurthy,cofounderofHortonworks,aboutYARN

HadoopRelatedProjectsatApache
WiththeexceptionofChukwa,Drill,andHCatalog(incubatorlevelprojects),allotherApacheprojects
mentionedherearetoplevelprojects.
Thislistisnotmeanttobeallinclusive,butitservesasanintroductiontosomeofthemostcommonly
usedprojects,andalsoillustratestherangeofcapabilitiesbeingdevelopedaroundHadoop.Tonamejust
acouple,WhirrandCrunchareotherHadooprelatedApacheprojectsnotdescribedhere.

Avro

Avroisaframeworkforperformingremoteprocedurecallsanddataserialization.InthecontextofHadoop,
itcanbeusedtopassdatafromoneprogramorlanguagetoanother,e.g.fromCtoPig.Itisparticularly
suitedforusewithscriptinglanguagessuchasPig,becausedataisalwaysstoredwithitsschemainAvro,
andthereforethedataisselfdescribing.
Avrocanalsohandlechangesinschema,a.k.a."schemaevolution,"whilestillpreservingaccesstothe
data.Forexample,differentschemascouldbeusedinserializationanddeserializationofagivendataset.

AdditionalLinks
Avroin3minutes

BigTop

BigTopisaprojectforpackagingandtestingtheHadoopecosystem.MuchofBigTop'scodewasinitially
developedandreleasedaspartofCloudera'sCDHdistribution,buthassincebecomeitsownprojectat
Apache.
ThecurrentBigToprelease(0.5.0)supportsanumberofLinuxdistributionsandpackagesHadoop
togetherwiththefollowingprojects:Zookeeper,Flume,HBase,Pig,Hive,Sqoop,Oozie,Whirr,Mahout,
SolrCloud,Crunch,DataFuandHue.

AdditionalLinks
Apacheblogpost:WhatisBigTop?

Chukwa

Chukwa,currentlyinincubation,isadatacollectionandanalysissystembuiltontopofHDFSand
MapReduce.Tailoredforcollectinglogsandotherdatafromdistributedmonitoringsystems,Chukwa
providesaworkflowthatallowsforincrementaldatacollection,processingandstorageinHadoop.Itis
includedintheApacheHadoopdistribution,butasanindependentmodule.

Drill

DrillisanincubationlevelprojectatApacheandisanopensourceversionofGoogle'sDremel.Drillisa
distributedsystemforexecutinginteractiveanalysisoverlargescaledatasets.Someexplicitgoalsofthe
Drillprojectaretosupportrealtimequeryingofnesteddataandtoscaletoclustersof10,000nodesor

more.
Designedtosupportnesteddata,Drillalsosupportsdatawith(e.g.Avro)orwithout(e.g.JSON)schemas.
ItsprimarylanguageisanSQLlikelanguage,DrQL,thoughtheMongoQueryLanguagecanalsobe
used.

Flume

Flumeisatoolforharvesting,aggregatingandmovinglargeamountsoflogdatainandoutofHadoop.
Flume"channels"databetween"sources"and"sinks"anditsdataharvestingcaneitherbescheduledor
eventdriven.PossiblesourcesforFlumeincludeAvro,files,andsystemlogs,andpossiblesinksinclude
HDFSandHBase.Flumeitselfhasaqueryprocessingengine,sothatthereistheoptiontotransformeach
newbatchofdatabeforeitisshuttledtotheintendedsink.
SinceJuly2012,FlumehasbeenreleasedasFlumeNG(NewGeneration),asitdifferssignificantlyfrom
itsoriginalincarnation,a.k.aFlumeOG(OriginalGeneration)..

AdditionalLinks
Flumein3minutes

HBase

BasedonGoogle'sBigtable,HBase"isanopensource,distributed,versioned,columnorientedstore"that
sitsontopofHDFS.HBaseiscolumnbasedratherthanrowbased,whichenableshighspeedexecution
ofoperationsperformedoversimilarvaluesacrossmassivedatasets,e.g.read/writeoperationsthat
involveallrowsbutonlyasmallsubsetofallcolumns.HBasedoesnotprovideitsownqueryorscripting
language,butisaccessiblethroughJava,Thrift,andRESTAPIs.
HBasedependsonZookeeperandrunsaZookeeperinstancebydefault.

AdditionalLinks
HBasein3minutes

HCatalog

AnincubatorlevelprojectatApache,HCatalogisametadataandtablestoragemanagementservicefor
HDFS.HCatalogdependsontheHivemetastoreandexposesittootherservicessuchasMapReduceand
PigwithplanstoexpandtoHBaseusingacommondatamodel.HCatalog'sgoalistosimplifytheuser's
interactionwithHDFSdataandenabledatasharingbetweentoolsandexecutionplatforms.

AdditionalLinks
HCatalogin3minutes

Hive

HiveprovidesawarehousestructureandSQLlikeaccessfordatainHDFSandotherHadoopinput
sources(e.g.AmazonS3).Hive'squerylanguage,HiveQL,compilestoMapReduce.Italsoallowsuser
definedfunctions(UDFs).Hiveiswidelyused,andhasitselfbecomea"subplatform"intheHadoop
ecosystem.
Hive'sdatamodelprovidesastructurethatismorefamiliarthanrawHDFStomostusers.Itisbased
primarilyonthreerelateddatastructures:tables,partitions,andbuckets,wheretablescorrespondto
HDFSdirectoriesandcanbedividedintopartitions,whichinturncanbedividedintobuckets.

AdditionalLinks
Hivein3minutes

Mahout

Mahoutisascalablemachinelearninganddatamininglibrary.Therearecurrentlyfourmaingroupsof
algorithmsinMahout:
recommendations,a.k.a.collectivefiltering
classification,a.k.acategorization
clustering
frequentitemsetmining,a.k.aparallelfrequentpatternmining
Mahoutisnotsimplyacollectionofpreexistingalgorithmsmanymachinelearningalgorithmsare
intrinsicallynonscalablethatis,giventhetypesofoperationstheyperform,theycannotbeexecutedasa
setofparallelprocesses.AlgorithmsintheMahoutlibrarybelongtothesubsetthatcanbeexecutedina
distributedfashion,andhavebeenwrittentobeexecutableinMapReduce.

AdditionalLinks
MahoutandMachineLearningin3minutes

Oozie

OozieisajobcoordinatorandworkflowmanagerforjobsexecutedinHadoop,whichcanincludenon
MapReducejobs.ItisintegratedwiththerestoftheApacheHadoopstackand,accordingtotheOozie
site,it"support[s]severaltypesofHadoopjobsoutofthebox(suchasJavamapreduce,Streamingmap
reduce,Pig,Hive,SqoopandDistcp)aswellassystemspecificjobs(suchasJavaprogramsandshell
scripts)."
AnOozieworkflowisacollectionofactionsandHadoopjobsarrangedinaDirectedAcyclicGraph(DAG),

whichisacommonmodelfortasksthatmustbeainsequenceandaresubjecttocertainconstraints.

AdditionalLinks
Ooziein3minutes

Pig

Pigisaframeworkconsistingofahighlevelscriptinglanguage(PigLatin)andaruntimeenvironmentthat
allowsuserstoexecuteMapReduceonaHadoopcluster.LikeHiveQLinHive,PigLatinisahigherlevel
languagethatcompilestoMapReduce.
PigismoreflexiblethanHivewithrespecttopossibledataformat,duetoitsdatamodel.ViathePigWiki:
"Pig'sdatamodelissimilartotherelationaldatamodel,exceptthattuples(a.k.a.recordsorrows)canbe
nested.Forexample,youcanhaveatableoftuples,wherethethirdfieldofeachtuplecontainsatable.In
Pig,tablesarecalledbags.Pigalsohasa'map'datatype,whichisusefulinrepresentingsemistructured
data,e.g.JSONorXML."

AdditionalLinks
Pigin3minutes

Sqoop

Sqoop("SQLtoHadoop")isatoolwhichtransfersdatainbothdirectionsbetweenrelationalsystemsand
HDFSorotherHadoopdatastores,e.g.HiveorHBase.

AccordingtotheSqoopblog,"YoucanuseSqooptoimportdatafromexternalstructureddatastoresinto
HadoopDistributedFileSystemorrelatedsystemslikeHiveandHBase.Conversely,Sqoopcanbeused
toextractdatafromHadoopandexportittoexternalstructureddatastoressuchasrelationaldatabases
andenterprisedatawarehouses."

ZooKeeper

ZooKeeperisaserviceformaintainingconfigurationinformation,naming,providingdistributed
synchronizationandprovidinggroupservices.AstheZooKeeperwikisummarizesit,"ZooKeeperallows
distributedprocessestocoordinatewitheachotherthroughasharedhierarchicalnamespaceofdata
registers(wecalltheseregistersznodes),muchlikeafilesystem."ZooKeeperitselfisadistributedservice
with"master"and"slave"nodes,andstoresconfigurationinformation,etc.inmemoryonZooKeeper
servers.

AdditionalLinks
Zookeeperin3minutes

HadoopRelatedProjectsOutsideApache
TherearealsoprojectsoutsideofApachethatbuildonorparallelthemajorHadoopprojectsatApache.
Severalofinterestaredescribedhere.

Spark(UCBerkeley)

SparkisaparallelcomputingprogramwhichcanoperateoveranyHadoopinputsource:HDFS,HBase,
AmazonS3,Avro,etc.SparkisanopensourceprojectattheU.C.BerkeleyAMPLab,andinitsownwords,
Spark"wasinitiallydevelopedfortwoapplicationswherekeepingdatainmemoryhelps:iterative
algorithms,whicharecommoninmachinelearning,andinteractivedatamining."
WhileoftencomparedtoMapReduceinsofarasitalsoprovidesparallelprocessingoverHDFSandother
Hadoopinputsources,Sparkdiffersintwokeyways:
Sparkholdsintermediateresultsinmemory,ratherthanwritingthemtodiskthisdrasticallyreduces
queryreturntime
Sparksupportsmorethanjustmapandreducefunctions,greatlyexpandingthesetofpossible
analysesthatcanbeexecutedoverHDFSdata
ThefirstfeatureisthekeytodoingiterativealgorithmsonHadoop:ratherthanreadingfromHDFS,
performingMapReduce,writingtheresultsbacktoHDFS(i.e.todisk)andrepeatingforeachcycle,Spark
readsdatafromHDFS,performsthecomputation,andstorestheintermediateresultsinmemoryas
ResilientDistributedDataSets.Sparkcanthenrunthenextsetofcomputationsontheresultscachedin
memory,therebyskippingthetimeconsumingstepsofwritingthenthroundresultstoHDFSandreading
thembackoutforthe(n+1)thround.

AdditionalLinks
http://www.youtube.com/watch?v=N3ITxQcf6uQ

Shark(UCBerkeley)

Sharkisessentially"HiverunningonSpark."ItutilizestheApacheHiveinfrastructure,includingtheHive
metastoreandHDFS,butitgivesusersthebenefitsofSpark(increasedprocessingspeed,additional
functionsbesidesmapandreduce).Thisway,SharkuserscanexecutethequeriesinHiveQLoverthe
sameHDFSdatasets,butreceiveresultsinnearrealtimefashion.

Impala(Cloudera)

ReleasedbyCloudera,Impalaisanopensourceprojectwhich,likeApacheDrill,wasinspiredbyGoogle's
paperonDremelthepurposeofbothistofacilitaterealtimequeryingofdatainHDFSorHBase.Imapala
usesanSQLlikelanguagethat,thoughsimilartoHiveQL,iscurrentlymorelimitedthanHiveQL.Because
ImpalareliesontheHiveMetastore,HivemustbeinstalledonaclusterinorderforImpalatowork.
ThesecretbehindImpala'sspeedisthatit"circumventsMapReducetodirectlyaccessthedatathrougha
specializeddistributedqueryenginethatisverysimilartothosefoundincommercialparallelRDBMSs."
(Source:Cloudera)

S-ar putea să vă placă și