Documente Academic
Documente Profesional
Documente Cultură
Email or Phone
Sign Up
Keep me logged in
Apache Hadoop is being used in three broad types of systems: as a warehouse for web
analytics, as storage for a distributed database and for MySQL database backups.
Data warehousing clusters are largely batch processing systems where the emphasis is on
scalability and throughput. We have enhanced the NameNodes locking model to scale to 30
petabytes, possibly the largest single Hadoop cluster in the world. We have multiple clusters
at separate data centers and our largest warehouse cluster currently spans three thousands
of machines. Our jobs scan around 2 petabytes per day in this warehouse and more than 300
people throughout the company query this warehouse every month. Our primary bottleneck is
raw disk capacity closely followed by CPU during peak load times. The code that runs on
these types of clusters is at http://github.com/facebook/hadoop-20-warehouse.
The realtime application-serving usage is a more recent usage of Apache Hadoop for us. We
have enhanced Apache Hadoop to make latencies small and response times faster. One of
our newest application Messages uses a distributed database called Apache HBase to store
data in Apache Hadoop clusters. Another use case is to collect click logs in near real time
from our web servers and stream them directly into our Hadoop clusters. We use Scribe to
make this easier. Scribe is well integrated to store data directly into Apache Hadoop. Most of
these realtime clusters are smaller in size, somewhere around hundred node each. The code
that runs on these clusters is at http://github.com/facebook/hadoop-20-append.
The third use case is for medium-term archiving of MySQL databases. This solution provides
fast backup and recovery from data stored in Hadoop File System and reduces our
1 of 3 1/9/2016 2:10 PM
Looking at the code behind our three uses of Apache Hadoop https://www.facebook.com/notes/facebook-engineering/looking-at-the-co...
2) RaidNode
The warehouse codebase has HDFS-RAID that uses parity files to decrease storage usage.
Disk is cheap, but 30 petabytes of storage is quite costly! Hadoop, by default, keeps three
copies of data. HDFS-RAID creates parity blocks from data blocks, thus allowing us to store
fewer copies of data but maintaining the same probability of data loss. We have deployed the
XOR-parity generator, thus saving about 5 petabytes of raw storage in our production
deployment. We are actively developing a Reed-Solomon variant of generating parity blocks,
this will allow us to store even fewer copies of data. We have contributed this feature to
Apache Hadoop and you can find it in the Hadoop MapReduce sub-project.
3) Appends
The realtime codebase has full support for reliable HDFS file-append operations and is a
basic requirement for Apache HBase to be able to store indexes reliably on HDFS. Earlier
versions of Apache HBase were susceptible to data loss because HDFS did not have an API
to sync data to stable storage. We have added a new interface to Apache Hadoop 0.20 to
reliably sync data to stable storage. This is the first known instance of Apache Hadoop in a
production setting that supports zero data loss while using Apache HBase. We have
contributed this feature to the 0.20-append branch of Apache Hadoop.
We will update the JIRAs and GitHub repositories with these features and many more in the
next few months. Youll also find Yahoo!s tree of Hadoop source code at https://github.com
/yahoo/hadoop-common. Stay tuned!
2 of 3 1/9/2016 2:10 PM
Looking at the code behind our three uses of Apache Hadoop https://www.facebook.com/notes/facebook-engineering/looking-at-the-co...
Sign Up Log In Messenger Facebook Lite Mobile Find Friends Badges People Pages Places
Games Locations About Create Ad Create Page Developers Careers Privacy Cookies Ad Choices
Terms Help
Facebook 2016
English (US)
3 of 3 1/9/2016 2:10 PM