Hadoop Ecosystem

HadoopEcosystem
IntroductiontotheHadoopSoftwareEcosystem
ViaA.Griffins
WhenHadoop1.0.0wasreleasedbyApachein2011,comprisingmainlyHDFSandMapReduce,itsoon
becameclearthatHadoopwasnotsimplyanotherapplicationorservice,butaplatformaroundwhichan
entireecosystemofcapabilitiescouldbebuilt.Sincethen,dozensofselfstandingsoftwareprojectshave
sprungintobeingaroundHadoop,eachaddressingavarietyofproblemspacesandmeetingdifferent
needs.
Manyoftheseprojectswerebegunbythesamepeopleorcompanieswhowerethemajordevelopersand
earlyusersofHadoopotherswereinitiatedbycommercialHadoopdistributors.Themajorityofthese
projectsnowshareahomewithHadoopattheApacheSoftwareFoundation,whichsupportsopensource
softwaredevelopmentandencouragesthedevelopmentofthecommunitiessurroundingtheseprojects.
ThefollowingsectionsaremeanttogivethereaderabriefintroductiontotheworldofHadoopandthe
corerelatedsoftwareprojects.TherearecountlesscommercialHadoopintegratedproductsfocusedon
makingHadoopmoreusableandlaymanaccessible,buttheonesherewerechosenbecausetheyprovide
corefunctionalityandspeedinHadoop.
Thesocalled"Hadoopecosystem"is,asbefitsanecosystem,complex,evolving,andnoteasilyparcelled
intoneatcategories.Simplykeepingtrackofalltheprojectnamesmayseemlikeataskofitsown,butthis
palesincomparisontothetaskoftrackingthefunctionalandarchitecturaldifferencesbetweenprojects.
Theseprojectsarenotmeanttoallbeusedtogether,aspartsofasingleorganismsomemayevenbe
seekingtosolvethesameproblemindifferentways.Whatunitesthemisthattheyeachseektotapinto
thescalabilityandpowerofHadoop,particularlytheHDFScomponentofHadoop.
AdditionalLinks
Cloudstory.com:3partseriesonHadoopecosystem
Part1
Part2
Part3
HDFS
TheHadoopDistributedFileSystem(HDFS)offersawaytostorelargefilesacrossmultiplemachines,
ratherthanrequiringasinglemachinetohavediskcapacityequalto/greaterthanthesummedtotalsizeof
thefiles.HDFSisdesignedtobefaulttolerantduetodatareplicationanddistributionofdata.Whenafile
isloadedintoHDFS,itisreplicatedandbrokenupinto"blocks"ofdata,whicharestoredacrossthecluster
nodesdesignatedforstorage,a.k.a.DataNodes.
ViaPaulKrzyzanowski
Atthearchitecturallevel,HDFSrequiresaNameNodeprocesstorunononenodeintheclusteranda
DataNodeservicetorunoneach"slave"nodethatwillbeprocessingdata.Whendataisloadedinto
HDFS,thedataisreplicatedandsplitintoblocksthataredistributedacrosstheDataNodes.The
NameNodeisresponsibleforstorageandmanagementofmetadata,sothatwhenMapReduceoranother
executionframeworkcallsforthedata,theNameNodeinformsitwheretheneededdataresides.
HDFSArchitecture
ViaComputerTechnologyReview
OnesignificantdrawbacktoHDFSisthatithasasinglepointoffailure(SPOF),whichliesinthe
NameNodeservice.IftheNameNodeortheserverhostingitgoesdown,HDFSisdownfortheentire
cluster.TheSecondaryNameNode,whichtakesperiodicsnapshotsoftheNameNodeandupdatesit,is
notitselfabackupNameNode.
CurrentlythemostcomprehensivesolutiontothisproblemcomesfromMapR,oneofthemajorHadoop
distributors.MapRhasdevelopeda"distributedNameNode,"wheretheHDFSmetadataisdistributed
acrosstheclusterin"Containers,"whicharetrackedbytheContainerLocationDatabase(CLDB).
RegularNameNodearchitecturevs.MapR'sdistributedNameNodearchitecture
ViaMapR
TheApachecommunityisalsoworkingtoaddressthisNameNodeSPOF:Hadoop2.0.2willincludean
updatetoHDFScalledHDFSHighAvailability(HA),whichprovidestheuserwith"theoptionofrunning
tworedundantNameNodesinthesameclusterinanActive/Passiveconfigurationwithahotstandby.This
allowsafastfailovertoanewNameNodeinthecasethatamachinecrashes,oragracefuladministrator
initiatedfailoverforthepurposeofplannedmaintenance."TheactiveNameNodelogsallchangestoa
directorythatisalsoaccessiblebythestandbyNameNode,whichthenusestheloginformationtoupdate
itself.
ArchitectureofHDFSHighAvailabilityframework
ViaCloudera
MapReduce
TheMapReduceparadigmforparallelprocessingcomprisestwosequentialsteps:mapandreduce.
Inthemapphase,theinputisasetofkeyvaluepairsandthedesiredfunctionisexecutedovereach
key/valuepairinordertogenerateasetofintermediatekey/valuepairs.
Inthereducephase,theintermediatekey/valuepairsaregroupedbykeyandthevaluesarecombined
togetheraccordingtothereducecodeprovidedbytheuserforexample,summing.Itisalsopossiblethat
noreducephaseisrequired,giventhetypeofoperationcodedbytheuser.
ViaArtificialIntelligenceinMotion
Attheclusterlevel,theMapReduceprocessesaredividedbetweentwoapplications,JobTrackerand
TaskTracker.JobTrackerrunsononly1nodeofthecluster,whileTaskTrackerrunsoneveryslavenode
inthecluster.EachMapReducejobissplitintoanumberoftaskswhichareassignedtothevarious
TaskTrackersdependingonwhichdataisstoredonthatnode.JobTrackerisresponsibleforscheduling
jobrunsandmanagingcomputationalresourcesacrossthecluster.JobTrackeroverseestheprogressof
eachTaskTrackerastheycompletetheirindividualtasks.
MapReduceArchitecture
ViaComputerTechnologyReview
YARN
AsHadoopbecamemorewidelyadoptedandusedonclusterswithuptotensofthousandsofnodes,it
becameobviousthatMapReduce1.0hadissueswithscalability,memoryusage,synchronization,andhad
itsownSPOFissues.Inresponse,YARN(YetAnotherResourceNegotiator)wasbegunasasubprojectin
theApacheHadoopProject,onparwithothersubprojectslikeHDFS,MapReduce,andHadoopCommon.
YARNaddressesproblemswithMapReduce1.0'sarchitecture,specificallywiththeJobTrackerservice.
Essentially,YARN"split[s]upthetwomajorfunctionalitiesoftheJobTracker,resourcemanagementand
jobscheduling/monitoring,intoseparatedaemons.TheideaistohaveaglobalResourceManager(RM)
andperapplicationApplicationMaster(AM)."(source:Apache)Thus,ratherthanburdeningasinglenode
withhandlingschedulingandresourcemanagementfortheentirecluster,YARNnowdistributesthis
responsibilityacrossthecluster.
YARNArchitecture
ViaApache
MapReduce2.0
MapReduce2.0,orMR2,containsthesameexecutionframeworkasMapReduce1.0,butitisbuiltonthe
scheduling/resourcemanagementframeworkofYARN.
YARN,contrarytowidespreadmisconceptions,isnotthesameasMapReduce2.0(MRv2).Rather,YARN
isageneralframeworkwhichcansupportmultipleinstancesofdistributedprocessingapplications,of
whichMapReduce2.0isone.
AdditionalLinks
Clouderablog:MR2andYARNBrieflyExplained
Hortonworksblog:ApacheHadoopYARNBackgroundandanOverview
InterviewwithArunMurthy,cofounderofHortonworks,aboutYARN
HadoopRelatedProjectsatApache
WiththeexceptionofChukwa,Drill,andHCatalog(incubatorlevelprojects),allotherApacheprojects
mentionedherearetoplevelprojects.
Thislistisnotmeanttobeallinclusive,butitservesasanintroductiontosomeofthemostcommonly
usedprojects,andalsoillustratestherangeofcapabilitiesbeingdevelopedaroundHadoop.Tonamejust
acouple,WhirrandCrunchareotherHadooprelatedApacheprojectsnotdescribedhere.
Avro
Avroisaframeworkforperformingremoteprocedurecallsanddataserialization.InthecontextofHadoop,
itcanbeusedtopassdatafromoneprogramorlanguagetoanother,e.g.fromCtoPig.Itisparticularly
suitedforusewithscriptinglanguagessuchasPig,becausedataisalwaysstoredwithitsschemainAvro,
andthereforethedataisselfdescribing.
Avrocanalsohandlechangesinschema,a.k.a."schemaevolution,"whilestillpreservingaccesstothe
data.Forexample,differentschemascouldbeusedinserializationanddeserializationofagivendataset.
AdditionalLinks
Avroin3minutes
BigTop
BigTopisaprojectforpackagingandtestingtheHadoopecosystem.MuchofBigTop'scodewasinitially
developedandreleasedaspartofCloudera'sCDHdistribution,buthassincebecomeitsownprojectat
Apache.
ThecurrentBigToprelease(0.5.0)supportsanumberofLinuxdistributionsandpackagesHadoop
togetherwiththefollowingprojects:Zookeeper,Flume,HBase,Pig,Hive,Sqoop,Oozie,Whirr,Mahout,
SolrCloud,Crunch,DataFuandHue.
AdditionalLinks
Apacheblogpost:WhatisBigTop?
Chukwa
Chukwa,currentlyinincubation,isadatacollectionandanalysissystembuiltontopofHDFSand
MapReduce.Tailoredforcollectinglogsandotherdatafromdistributedmonitoringsystems,Chukwa
providesaworkflowthatallowsforincrementaldatacollection,processingandstorageinHadoop.Itis
includedintheApacheHadoopdistribution,butasanindependentmodule.
Drill
DrillisanincubationlevelprojectatApacheandisanopensourceversionofGoogle'sDremel.Drillisa
distributedsystemforexecutinginteractiveanalysisoverlargescaledatasets.Someexplicitgoalsofthe
Drillprojectaretosupportrealtimequeryingofnesteddataandtoscaletoclustersof10,000nodesor
more.
Designedtosupportnesteddata,Drillalsosupportsdatawith(e.g.Avro)orwithout(e.g.JSON)schemas.
ItsprimarylanguageisanSQLlikelanguage,DrQL,thoughtheMongoQueryLanguagecanalsobe
used.
Flume
Flumeisatoolforharvesting,aggregatingandmovinglargeamountsoflogdatainandoutofHadoop.
Flume"channels"databetween"sources"and"sinks"anditsdataharvestingcaneitherbescheduledor
eventdriven.PossiblesourcesforFlumeincludeAvro,files,andsystemlogs,andpossiblesinksinclude
HDFSandHBase.Flumeitselfhasaqueryprocessingengine,sothatthereistheoptiontotransformeach
newbatchofdatabeforeitisshuttledtotheintendedsink.
SinceJuly2012,FlumehasbeenreleasedasFlumeNG(NewGeneration),asitdifferssignificantlyfrom
itsoriginalincarnation,a.k.aFlumeOG(OriginalGeneration)..
AdditionalLinks
Flumein3minutes
HBase
BasedonGoogle'sBigtable,HBase"isanopensource,distributed,versioned,columnorientedstore"that
sitsontopofHDFS.HBaseiscolumnbasedratherthanrowbased,whichenableshighspeedexecution
ofoperationsperformedoversimilarvaluesacrossmassivedatasets,e.g.read/writeoperationsthat
involveallrowsbutonlyasmallsubsetofallcolumns.HBasedoesnotprovideitsownqueryorscripting
language,butisaccessiblethroughJava,Thrift,andRESTAPIs.
HBasedependsonZookeeperandrunsaZookeeperinstancebydefault.
AdditionalLinks
HBasein3minutes
HCatalog
AnincubatorlevelprojectatApache,HCatalogisametadataandtablestoragemanagementservicefor
HDFS.HCatalogdependsontheHivemetastoreandexposesittootherservicessuchasMapReduceand
PigwithplanstoexpandtoHBaseusingacommondatamodel.HCatalog'sgoalistosimplifytheuser's
interactionwithHDFSdataandenabledatasharingbetweentoolsandexecutionplatforms.
AdditionalLinks
HCatalogin3minutes
Hive
HiveprovidesawarehousestructureandSQLlikeaccessfordatainHDFSandotherHadoopinput
sources(e.g.AmazonS3).Hive'squerylanguage,HiveQL,compilestoMapReduce.Italsoallowsuser
definedfunctions(UDFs).Hiveiswidelyused,andhasitselfbecomea"subplatform"intheHadoop
ecosystem.
Hive'sdatamodelprovidesastructurethatismorefamiliarthanrawHDFStomostusers.Itisbased
primarilyonthreerelateddatastructures:tables,partitions,andbuckets,wheretablescorrespondto
HDFSdirectoriesandcanbedividedintopartitions,whichinturncanbedividedintobuckets.
AdditionalLinks
Hivein3minutes
Mahout
Mahoutisascalablemachinelearninganddatamininglibrary.Therearecurrentlyfourmaingroupsof
algorithmsinMahout:
recommendations,a.k.a.collectivefiltering
classification,a.k.acategorization
clustering
frequentitemsetmining,a.k.aparallelfrequentpatternmining
Mahoutisnotsimplyacollectionofpreexistingalgorithmsmanymachinelearningalgorithmsare
intrinsicallynonscalablethatis,giventhetypesofoperationstheyperform,theycannotbeexecutedasa
setofparallelprocesses.AlgorithmsintheMahoutlibrarybelongtothesubsetthatcanbeexecutedina
distributedfashion,andhavebeenwrittentobeexecutableinMapReduce.
AdditionalLinks
MahoutandMachineLearningin3minutes
Oozie
OozieisajobcoordinatorandworkflowmanagerforjobsexecutedinHadoop,whichcanincludenon
MapReducejobs.ItisintegratedwiththerestoftheApacheHadoopstackand,accordingtotheOozie
site,it"support[s]severaltypesofHadoopjobsoutofthebox(suchasJavamapreduce,Streamingmap
reduce,Pig,Hive,SqoopandDistcp)aswellassystemspecificjobs(suchasJavaprogramsandshell
scripts)."
AnOozieworkflowisacollectionofactionsandHadoopjobsarrangedinaDirectedAcyclicGraph(DAG),
whichisacommonmodelfortasksthatmustbeainsequenceandaresubjecttocertainconstraints.
AdditionalLinks
Ooziein3minutes
Pig
Pigisaframeworkconsistingofahighlevelscriptinglanguage(PigLatin)andaruntimeenvironmentthat
allowsuserstoexecuteMapReduceonaHadoopcluster.LikeHiveQLinHive,PigLatinisahigherlevel
languagethatcompilestoMapReduce.
PigismoreflexiblethanHivewithrespecttopossibledataformat,duetoitsdatamodel.ViathePigWiki:
"Pig'sdatamodelissimilartotherelationaldatamodel,exceptthattuples(a.k.a.recordsorrows)canbe
nested.Forexample,youcanhaveatableoftuples,wherethethirdfieldofeachtuplecontainsatable.In
Pig,tablesarecalledbags.Pigalsohasa'map'datatype,whichisusefulinrepresentingsemistructured
data,e.g.JSONorXML."
AdditionalLinks
Pigin3minutes
Sqoop
Sqoop("SQLtoHadoop")isatoolwhichtransfersdatainbothdirectionsbetweenrelationalsystemsand
HDFSorotherHadoopdatastores,e.g.HiveorHBase.
AccordingtotheSqoopblog,"YoucanuseSqooptoimportdatafromexternalstructureddatastoresinto
HadoopDistributedFileSystemorrelatedsystemslikeHiveandHBase.Conversely,Sqoopcanbeused
toextractdatafromHadoopandexportittoexternalstructureddatastoressuchasrelationaldatabases
andenterprisedatawarehouses."
ZooKeeper
ZooKeeperisaserviceformaintainingconfigurationinformation,naming,providingdistributed
synchronizationandprovidinggroupservices.AstheZooKeeperwikisummarizesit,"ZooKeeperallows
distributedprocessestocoordinatewitheachotherthroughasharedhierarchicalnamespaceofdata
registers(wecalltheseregistersznodes),muchlikeafilesystem."ZooKeeperitselfisadistributedservice
with"master"and"slave"nodes,andstoresconfigurationinformation,etc.inmemoryonZooKeeper
servers.
AdditionalLinks
Zookeeperin3minutes
HadoopRelatedProjectsOutsideApache
TherearealsoprojectsoutsideofApachethatbuildonorparallelthemajorHadoopprojectsatApache.
Severalofinterestaredescribedhere.
Spark(UCBerkeley)
SparkisaparallelcomputingprogramwhichcanoperateoveranyHadoopinputsource:HDFS,HBase,
AmazonS3,Avro,etc.SparkisanopensourceprojectattheU.C.BerkeleyAMPLab,andinitsownwords,
Spark"wasinitiallydevelopedfortwoapplicationswherekeepingdatainmemoryhelps:iterative
algorithms,whicharecommoninmachinelearning,andinteractivedatamining."
WhileoftencomparedtoMapReduceinsofarasitalsoprovidesparallelprocessingoverHDFSandother
Hadoopinputsources,Sparkdiffersintwokeyways:
Sparkholdsintermediateresultsinmemory,ratherthanwritingthemtodiskthisdrasticallyreduces
queryreturntime
Sparksupportsmorethanjustmapandreducefunctions,greatlyexpandingthesetofpossible
analysesthatcanbeexecutedoverHDFSdata
ThefirstfeatureisthekeytodoingiterativealgorithmsonHadoop:ratherthanreadingfromHDFS,
performingMapReduce,writingtheresultsbacktoHDFS(i.e.todisk)andrepeatingforeachcycle,Spark
readsdatafromHDFS,performsthecomputation,andstorestheintermediateresultsinmemoryas
ResilientDistributedDataSets.Sparkcanthenrunthenextsetofcomputationsontheresultscachedin
memory,therebyskippingthetimeconsumingstepsofwritingthenthroundresultstoHDFSandreading
thembackoutforthe(n+1)thround.
AdditionalLinks
http://www.youtube.com/watch?v=N3ITxQcf6uQ
Shark(UCBerkeley)
Sharkisessentially"HiverunningonSpark."ItutilizestheApacheHiveinfrastructure,includingtheHive
metastoreandHDFS,butitgivesusersthebenefitsofSpark(increasedprocessingspeed,additional
functionsbesidesmapandreduce).Thisway,SharkuserscanexecutethequeriesinHiveQLoverthe
sameHDFSdatasets,butreceiveresultsinnearrealtimefashion.
Impala(Cloudera)
ReleasedbyCloudera,Impalaisanopensourceprojectwhich,likeApacheDrill,wasinspiredbyGoogle's
paperonDremelthepurposeofbothistofacilitaterealtimequeryingofdatainHDFSorHBase.Imapala
usesanSQLlikelanguagethat,thoughsimilartoHiveQL,iscurrentlymorelimitedthanHiveQL.Because
ImpalareliesontheHiveMetastore,HivemustbeinstalledonaclusterinorderforImpalatowork.
ThesecretbehindImpala'sspeedisthatit"circumventsMapReducetodirectlyaccessthedatathrougha
specializeddistributedqueryenginethatisverysimilartothosefoundincommercialparallelRDBMSs."
(Source:Cloudera)

Hadoop Ecosystem

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Hadoop Ecosystem

Încărcat de

Drepturi de autor:

Formate disponibile

HadoopEcosystem

S-ar putea să vă placă și