Documente Academic
Documente Profesional
Documente Cultură
Hadoop Operations
Managing Petabytes with Open Source
Jeff Hammerbacher
Chief Scientist and Vice President of Products, Cloudera
June 22, 2009
HDFS
2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;"
#79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6"
'(-*-"0&"062--"%(//-2-)0".-2=-2.E"
HDFS distributes
"
file blocks among servers
" !" " F"
I" !"
I"
I"
G" G"
I" H"
"
!"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.'
"
Monday, June 22, 2009
Hadoop MapReduce
▪ Fault tolerant execution layer and API for parallel data processing
▪ Can target multiple storage systems
▪ Key/value data model
▪ Two major daemons: JobTracker and TaskTracker
▪ Many client interfaces
▪ Java
▪ C++
▪ Streaming
▪ Pig
▪ SQL (Hive)
"
!"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+'
Monday, June 22, 2009 "
Hadoop Subprojects
▪ Avro
▪ Cross-language serialization for RPC and persistent storage
▪ HBase
▪ Table storage on top of HDFS, modeled after Google’s BigTable
▪ Hive
▪ SQL interface to structured data stored in HDFS
▪ Pig
▪ Language for dataflow programming
▪ Zookeeper
▪ Coordination service for distributed systems
Hadoop Tier