Mastering Hadoop
()
About this ebook
Hadoop is synonymous with Big Data processing. Its simple programming model, "code once and deploy at any scale" paradigm, and an ever-growing ecosystem makes Hadoop an all-encompassing platform for programmers with different levels of expertise.
This book explores the industry guidelines to optimize MapReduce jobs and higher-level abstractions such as Pig and Hive in Hadoop 2.0. Then, it dives deep into Hadoop 2.0 specific features such as YARN and HDFS Federation.
This book is a step-by-step guide that focuses on advanced Hadoop concepts and aims to take your Hadoop knowledge and skill set to the next level. The data processing flow dictates the order of the concepts in each chapter, and each chapter is illustrated with code fragments or schematic diagrams.
Related to Mastering Hadoop
Related ebooks
Hadoop Beginner's Guide Rating: 4 out of 5 stars4/5Hadoop: Data Processing and Modelling Rating: 0 out of 5 stars0 ratingsLearning HBase Rating: 0 out of 5 stars0 ratingsLearning Hadoop 2 Rating: 4 out of 5 stars4/5HDInsight Essentials - Second Edition Rating: 0 out of 5 stars0 ratingsHadoop Blueprints Rating: 0 out of 5 stars0 ratingsApache Mahout Essentials Rating: 0 out of 5 stars0 ratingsBig Data Analytics Rating: 0 out of 5 stars0 ratingsMachine Learning with Spark - Second Edition Rating: 0 out of 5 stars0 ratingsHadoop Essentials Rating: 5 out of 5 stars5/5Learning Apache Spark 2 Rating: 0 out of 5 stars0 ratingsSpark for Data Science Rating: 0 out of 5 stars0 ratingsMonitoring Hadoop Rating: 0 out of 5 stars0 ratingsScala for Data Science Rating: 0 out of 5 stars0 ratingsInstant MapReduce Patterns – Hadoop Essentials How-to Rating: 0 out of 5 stars0 ratingsExploring Hadoop Ecosystem (Volume 1): Batch Processing Rating: 0 out of 5 stars0 ratingsApache Hive Essentials Rating: 0 out of 5 stars0 ratingsLearn Hadoop in 24 Hours Rating: 0 out of 5 stars0 ratingsCloudera Administration Handbook Rating: 0 out of 5 stars0 ratingsHands-On Machine Learning Recommender Systems with Apache Spark Rating: 0 out of 5 stars0 ratingsReal-Time Big Data Analytics Rating: 5 out of 5 stars5/5Securing Hadoop Rating: 4 out of 5 stars4/5Cassandra High Availability Rating: 5 out of 5 stars5/5Learn Hive in 24 Hours Rating: 0 out of 5 stars0 ratingsMastering Apache Cassandra - Second Edition Rating: 0 out of 5 stars0 ratingsScalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture Rating: 0 out of 5 stars0 ratingsGetting Started with Greenplum for Big Data Analytics Rating: 0 out of 5 stars0 ratingsDeep Learning for Numerical Applications with SAS Rating: 0 out of 5 stars0 ratingsLearning Azure DocumentDB Rating: 0 out of 5 stars0 ratingsPractical Machine Learning Rating: 2 out of 5 stars2/5
Data Modeling & Design For You
DAX Patterns: Second Edition Rating: 5 out of 5 stars5/5A Concise Guide to Object Orientated Programming Rating: 0 out of 5 stars0 ratingsMicrosoft Access: Database Creation and Management through Microsoft Access Rating: 0 out of 5 stars0 ratingsData Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5Ultimate Enterprise Data Analysis and Forecasting using Python Rating: 0 out of 5 stars0 ratingsSupercharge Power BI: Power BI is Better When You Learn To Write DAX Rating: 5 out of 5 stars5/5Bayesian Analysis with Python Rating: 5 out of 5 stars5/5Mastering Agile User Stories Rating: 4 out of 5 stars4/5Raspberry Pi :Raspberry Pi Guide On Python & Projects Programming In Easy Steps Rating: 3 out of 5 stars3/5Brainstorming and Beyond: A User-Centered Design Method Rating: 0 out of 5 stars0 ratingsR in Action, Third Edition: Data analysis and graphics with R and Tidyverse Rating: 0 out of 5 stars0 ratingsGraph Databases in Action: Examples in Gremlin Rating: 0 out of 5 stars0 ratingsData Visualization: a successful design process Rating: 4 out of 5 stars4/5WordPress For Beginners - How To Set Up A Self Hosted WordPress Blog Rating: 0 out of 5 stars0 ratingsPython Data Analysis Rating: 4 out of 5 stars4/5The Secrets of ChatGPT Prompt Engineering for Non-Developers Rating: 5 out of 5 stars5/5Programmable Logic Controllers Rating: 4 out of 5 stars4/5Reinforcement Learning Algorithms with Python: Learn, understand, and develop smart algorithms for addressing AI challenges Rating: 0 out of 5 stars0 ratingsSpreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science Rating: 0 out of 5 stars0 ratingsThe Systems Thinker - Mental Models: The Systems Thinker Series, #3 Rating: 0 out of 5 stars0 ratingsData Analytics with Python: Data Analytics in Python Using Pandas Rating: 3 out of 5 stars3/5Oracle Quick Guides: Part 2 - Oracle Database Design Rating: 0 out of 5 stars0 ratingsLogic Design: A Review Of Theory And Practice Rating: 0 out of 5 stars0 ratingsThinking in Algorithms: Strategic Thinking Skills, #2 Rating: 5 out of 5 stars5/5
Reviews for Mastering Hadoop
0 ratings0 reviews
Book preview
Mastering Hadoop - Sandeep Karanth
Table of Contents
Mastering Hadoop
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book?
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Hadoop 2.X
The inception of Hadoop
The evolution of Hadoop
Hadoop's genealogy
Hadoop-0.20-append
Hadoop-0.20-security
Hadoop's timeline
Hadoop 2.X
Yet Another Resource Negotiator (YARN)
Architecture overview
Storage layer enhancements
High availability
HDFS Federation
HDFS snapshots
Other enhancements
Support enhancements
Hadoop distributions
Which Hadoop distribution?
Performance
Scalability
Reliability
Manageability
Available distributions
Cloudera Distribution of Hadoop (CDH)
Hortonworks Data Platform (HDP)
MapR
Pivotal HD
Summary
2. Advanced MapReduce
MapReduce input
The InputFormat class
The InputSplit class
The RecordReader class
Hadoop's small files
problem
Filtering inputs
The Map task
The dfs.blocksize attribute
Sort and spill of intermediate outputs
Node-local Reducers or Combiners
Fetching intermediate outputs – Map-side
The Reduce task
Fetching intermediate outputs – Reduce-side
Merge and spill of intermediate outputs
MapReduce output
Speculative execution of tasks
MapReduce job counters
Handling data joins
Reduce-side joins
Map-side joins
Summary
3. Advanced Pig
Pig versus SQL
Different modes of execution
Complex data types in Pig
Compiling Pig scripts
The logical plan
The physical plan
The MapReduce plan
Development and debugging aids
The DESCRIBE command
The EXPLAIN command
The ILLUSTRATE command
The advanced Pig operators
The advanced FOREACH operator
The FLATTEN operator
The nested FOREACH operator
The COGROUP operator
The UNION operator
The CROSS operator
Specialized joins in Pig
The Replicated join
Skewed joins
The Merge join
User-defined functions
The evaluation functions
The aggregate functions
The Algebraic interface
The Accumulator interface
The filter functions
The load functions
The store functions
Pig performance optimizations
The optimization rules
Measurement of Pig script performance
Combiners in Pig
Memory for the Bag data type
Number of reducers in Pig
The multiquery mode in Pig
Best practices
The explicit usage of types
Early and frequent projection
Early and frequent filtering
The usage of the LIMIT operator
The usage of the DISTINCT operator
The reduction of operations
The usage of Algebraic UDFs
The usage of Accumulator UDFs
Eliminating nulls in the data
The usage of specialized joins
Compressing intermediate results
Combining smaller files
Summary
4. Advanced Hive
The Hive architecture
The Hive metastore
The Hive compiler
The Hive execution engine
The supporting components of Hive
Data types
File formats
Compressed files
ORC files
The Parquet files
The data model
Dynamic partitions
Semantics for dynamic partitioning
Indexes on Hive tables
Hive query optimizers
Advanced DML
The GROUP BY operation
ORDER BY versus SORT BY clauses
The JOIN operator and its types
Map-side joins
Advanced aggregation support
Other advanced clauses
UDF, UDAF, and UDTF
Summary
5. Serialization and Hadoop I/O
Data serialization in Hadoop
Writable and WritableComparable
Hadoop versus Java serialization
Avro serialization
Avro and MapReduce
Avro and Pig
Avro and Hive
Comparison – Avro versus Protocol Buffers / Thrift
File formats
The Sequence file format
Reading and writing Sequence files
The MapFile format
Other data structures
Compression
Splits and compressions
Scope for compression
Summary
6. YARN – Bringing Other Paradigms to Hadoop
The YARN architecture
Resource Manager (RM)
Application Master (AM)
Node Manager (NM)
YARN clients
Developing YARN applications
Writing YARN clients
Writing the Application Master entity
Monitoring YARN
Job scheduling in YARN
CapacityScheduler
FairScheduler
YARN commands
User commands
Administration commands
Summary
7. Storm on YARN – Low Latency Processing in Hadoop
Batch processing versus streaming
Apache Storm
Architecture of an Apache Storm cluster
Computation and data modeling in Apache Storm
Use cases for Apache Storm
Developing with Apache Storm
Apache Storm 0.9.1
Storm on YARN
Installing Apache Storm-on-YARN
Prerequisites
Installation procedure
Summary
8. Hadoop on the Cloud
Cloud computing characteristics
Hadoop on the cloud
Amazon Elastic MapReduce (EMR)
Provisioning a Hadoop cluster on EMR
Summary
9. HDFS Replacements
HDFS – advantages and drawbacks
Amazon AWS S3
Hadoop support for S3
Implementing a filesystem in Hadoop
Implementing an S3 native filesystem in Hadoop
Summary
10. HDFS Federation
Limitations of the older HDFS architecture
Architecture of HDFS Federation
Benefits of HDFS Federation
Deploying federated NameNodes
HDFS high availability
Secondary NameNode, Checkpoint Node, and Backup Node
High availability – edits sharing
Useful HDFS tools
Three-layer versus four-layer network topology
HDFS block placement
Pluggable block placement policy
Summary
11. Hadoop Security
The security pillars
Authentication in Hadoop
Kerberos authentication
The Kerberos architecture and workflow
Kerberos authentication and Hadoop
Authentication via HTTP interfaces
Authorization in Hadoop
Authorization in HDFS
Identity of an HDFS user
Group listings for an HDFS user
HDFS APIs and shell commands
Specifying the HDFS superuser
Turning off HDFS authorization
Limiting HDFS usage
Name quotas in HDFS
Space quotas in HDFS
Service-level authorization in Hadoop
Data confidentiality in Hadoop
HTTPS and encrypted shuffle
SSL configuration changes
Configuring the keystore and truststore
Audit logging in Hadoop
Summary
12. Analytics Using Hadoop
Data analytics workflow
Machine learning
Apache Mahout
Document analysis using Hadoop and Mahout
Term frequency
Document frequency
Term frequency – inverse document frequency
Tf-Idf in Pig
Cosine similarity distance measures
Clustering using k-means
K-means clustering using Apache Mahout
RHadoop
Summary
A. Hadoop for Microsoft Windows
Deploying Hadoop on Microsoft Windows
Prerequisites
Building Hadoop
Configuring Hadoop
Deploying Hadoop
Summary
Index
Mastering Hadoop
Mastering Hadoop
Copyright © 2014 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: December 2014
Production reference: 1221214
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78398-364-3
www.packtpub.com
Cover image by Poonam Nayak (<pooh.graphics@gmail.com>)
Credits
Author
Sandeep Karanth
Reviewers
Shiva Achari
Pavan Kumar Polineni
Uchit Vyas
Yohan Wadia
Commissioning Editor
Edward Gordon
Acquisition Editor
Rebecca Youé
Content Development Editor
Ruchita Bhansali
Technical Editors
Bharat Patil
Rohit Kumar Singh
Parag Topre
Copy Editors
Sayanee Mukherjee
Vikrant Phadkay
Project Coordinator
Kranti Berde
Proofreaders
Simran Bhogal
Maria Gould
Ameesha Green
Paul Hindle
Indexer
Mariammal Chettiyar
Graphics
Abhinash Sahu
Valentina Dsilva
Production Coordinator
Arvindkumar Gupta
Cover Work
Arvindkumar Gupta
About the Author
Sandeep Karanth is a technical architect who specializes in building and operationalizing software systems. He has more than 14 years of experience in the software industry, working on a gamut of products ranging from enterprise data applications to newer-generation mobile applications. He has primarily worked at Microsoft Corporation in Redmond, Microsoft Research in India, and is currently a cofounder at Scibler, architecting data intelligence products.
Sandeep has special interest in data modeling and architecting data applications. In his area of interest, he has successfully built and deployed applications, catering to a variety of business use cases such as vulnerability detection from machine logs, churn analysis from subscription data, and sentiment analyses from chat logs. These applications were built using next generation big data technologies such as Hadoop, Spark, and Microsoft StreamInsight and deployed on cloud platforms such as Amazon AWS and Microsoft Azure.
Sandeep is also experienced and interested in areas such as green computing and the emerging Internet of Things. He frequently trains professionals and gives talks on topics such as big data and cloud computing. Sandeep believes in inculcating skill-oriented and industry-related topics in the undergraduate engineering curriculum, and his talks are geared with this in mind. Sandeep has a Master's degree in Computer and Information Sciences from the University of Minnesota, Twin Cities.
Sandeep's twitter handle is @karanths. His GitHub profile is https://github.com/Karanth, and he writes technical snippets at https://gist.github.com/Karanth.
Acknowledgments
I would like to dedicate this book to my loving daughter, Avani, who has taught me many a lesson in effective time management. I would like to thank my wife and parents for their constant support that has helped me complete this book on time. Packt Publishing have been gracious enough to give me this opportunity, and I would like to thank all individuals who were involved in editing, reviewing, and publishing this book. Questions and feedback from curious audiences at my lectures have driven much of the content of this book. Some of the subtopics are from experiences I gained working on a wide variety of projects throughout my career. I would like to thank my audience and also my employers for indirectly helping me write this book.
About the Reviewers
Shiva Achari has over 8 years of extensive industry experience and is currently working as a Big Data architect in Teradata. Over the years, he has architected, designed, and developed multiple innovative and high-performing large-scale solutions such as distributed systems, data center, Big Data management, SaaS cloud applications, Internet applications, and data analytics solutions.
He is currently writing a book on Hadoop essentials, which is based on Hadoop, its ecosystem components, and how we can leverage the components in different phases of the Hadoop project life cycle.
Achari has experience in designing Big Data and analytics applications, ingestion, cleansing, transformation, correlating different sources, data mining, and user experience using Hadoop, Cassandra, Solr, Storm, R, and Tableau.
He specializes in developing solutions for the Big Data domain and possesses a sound hands-on experience on projects migrating to the Hadoop world, new development, product consulting, and POC. He also has hands-on expertise on technologies such as Hadoop, Yarn, Sqoop, Hive, Pig, Flume, Solr, Lucene, Elasticsearch, Zookeeper, Storm, Redis, Cassandra, HBase, MongoDB, Talend, R, Mahout, Tableau, Java, and J2EE.
Shiva has expertise in requirement analysis, estimations, technology evaluation, and system architecture, with domain experience of telecom, Internet applications, document management, healthcare, and media.
Currently, he supports presales activities such as writing technical proposals (RFP), providing technical consultation to customers, and managing deliveries of Big Data practice group in Teradata.
He is active on LinkedIn at http://in.linkedin.com/in/shivaachari/.
I would like to thank Packt Publishing for helping me out with the reviewing process and the opportunity to review this book, which was a great opportunity and experience. I will wish the publication and author best of luck for the success of the book.
Pavan Kumar Polineni is working as Analytics Manager at Fantain Sports. He has experience in the fields of information retrieval and recommendation engines. He is a Cloudera certified Hadoop administrator. His is interested in machine learning, data mining, and visualization.
He has a Bachelor's degree in Computer Science from Koneru Lakshmaiah College of Engineering and is about to complete his Master's degree in Software Systems from BITS, Pilani. He has worked at organizations such as IBM and Ctrls Datacenter. He can be found on Twitter as @polinenipavan.
Uchit Vyas is an open source specialist and a hands-on lead DevOps of Clogeny Technologies. He is responsible for the delivery of solutions, services, and product development. He explores new enterprise open source and defining architecture, roadmaps, and best practices. He has consulted and provided training on various open source technologies, including cloud computing (AWS Cloud, Rackspace, Azure, CloudStack, Openstack, and Eucalyptus), Mule ESB, Chef, Puppet and Liferay Portal, Alfresco ECM, and JBoss, to corporations around the world.
He has a degree in Engineering in Computer Science from the Gujarat University. He worked in the education and research team of Infosys Limited as senior associate, during which time he worked on SaaS, private clouds, virtualization, and now, cloud system automation.
He has also published book on Mule ESB, and is writing various books on open source technologies and AWS.
He hosts a blog named Cloud Magic World, cloudbyuchit.blogspot.com, where he posts tips and phenomena about open source technologies, mostly cloud technologies. He can also be found on Twitter as @uchit_vyas.
I am thankful to Riddhi Thaker (my colleague) for helping me a lot in reviewing this book.
Yohan Wadia is a client-focused virtualization and cloud expert with 5 years of experience in the IT industry.
He has been involved in conceptualizing, designing, and implementing large-scale solutions for a variety of enterprise customers based on VMware vCloud, Amazon Web Services, and Eucalyptus Private Cloud.
His community-focused involvement enables him to share his passion for virtualization and cloud technologies with peers through social media engagements, public speaking at industry events, and through his personal blog at yoyoclouds.com.
He is currently working with Virtela Technology Services, an NTT communications company, as a cloud solutions engineer, and is involved in managing the company's in-house cloud platform. He works on various open source and enterprise-level cloud solutions for internal as well as external customers. He is also a VMware Certified Professional and vExpert (2012, 2013).
www.PacktPub.com
Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.
Preface
We are in an age where data is the primary driver in decision-making. With storage costs declining, network speeds increasing, and everything around us becoming digital, we do not hesitate a bit to download, store, or share data with others around us. About 20 years back, a camera was a device used to capture pictures on film. Every photograph had to be captured almost perfectly. The storage of film negatives was done carefully lest they get damaged. There was a higher cost associated with taking prints of these photographs. The time taken between a picture click and to view it was almost a day. This meant that less data was being captured as these factors presented a cliff for people from recording each and every moment of their life, unless it was very significant.
However, with cameras becoming digital, this has changed. We do not hesitate to click a photograph of almost anything anytime. We do not worry about storage as our externals disks of a terabyte capacity always provide a reliable backup. We seldom take our cameras anywhere as we have mobile devices that we can use to take photographs. We have applications such as Instagram that can be used to add effects to our pictures and share them. We gather opinions and information about the pictures, and we click and base some of our decisions on them. We capture almost every moment, of great significance or not, and push it into our memory books. The era of big data has arrived!
This era of Big Data has similar changes in businesses as well. Almost everything in a business is logged. Every action taken by a user on the page of an e-commerce page is recorded to improve quality of service and every item bought by the user are recorded to cross-sell or up-sell other items. Businesses want to understand the DNA of their customers and try to infer it by pinching out every possible data they can get about these customers. Businesses are not worried about the format of the data. They are ready to accept speech, images, natural language text, or structured data. These data points are used to drive business decisions and personalize experiences for the user. The more data, the higher the degree of personalization and better the experience for the user.
We saw that we are ready, in some aspects, to take on this Big Data challenge. However, what about the tools used to analyze this data? Can they handle the volume, velocity, and variety of the incoming data? Theoretically, all this data can reside on a single machine, but what is the cost of such a machine? Will it be able to cater to the variations in loads? We know that supercomputers are available, but there are only a handful of them in the world. Supercomputers don't scale. The alternative is to build a team of machines, a cluster, or individual computing units that work in tandem to achieve a task. A team of machines are interconnected via a very fast network and provide better scaling and elasticity, but that is not enough. These clusters have to be programmed. A greater number of machines, just like a team of human beings, require more coordination and synchronization. The higher the number of machines, the greater the possibility of failures in the cluster. How do we handle synchronization and fault tolerance in a simple way easing the burden on the programmer? The answer is systems such as Hadoop.
Hadoop is synonymous with Big Data processing. Its simple programming model, code once and deploy at any scale
paradigm, and an ever-growing ecosystem make Hadoop an inclusive platform for programmers with different levels of expertise and breadth of knowledge. Today, it is the number-one sought after job skill in the data sciences space. To handle and analyze Big Data, Hadoop has become the go-to tool. Hadoop 2.0 is spreading its wings to cover a variety of application paradigms and solve a wider range of data problems. It is rapidly becoming a general-purpose cluster platform for all data processing needs, and will soon become a mandatory skill for every engineer across verticals.
This book covers optimizations and advanced features of MapReduce, Pig, and Hive. It also covers Hadoop 2.0 and illustrates how it can be used to extend the capabilities of Hadoop.
Hadoop, in its 2.0 release, has evolved to become a general-purpose cluster-computing platform. The book will explain the platform-level changes that enable this. Industry guidelines to optimize MapReduce jobs and higher-level abstractions such as Pig and Hive in Hadoop 2.0 are covered. Some advanced job patterns and their applications are also discussed. These topics will empower the Hadoop user to optimize existing jobs and migrate them to Hadoop 2.0. Subsequently, it will dive deeper into Hadoop 2.0-specific features such as YARN (Yet Another Resource Negotiator) and HDFS Federation, along with examples. Replacing HDFS with other filesystems is another topic that will be covered in the latter half of the book. Understanding these topics will enable Hadoop users to extend Hadoop to other application paradigms and data stores, making efficient use of the available cluster resources.
This book is a guide focusing on advanced concepts and features in Hadoop. Foundations of every concept are explained with code fragments or schematic illustrations. The data processing flow dictates the order of the concepts in each chapter.
What this book covers
Chapter 1, Hadoop 2.X, discusses the improvements in Hadoop 2.X in comparison to its predecessor generation.
Chapter 2, Advanced MapReduce, helps you understand the best practices and patterns for Hadoop MapReduce, with examples.
Chapter 3, Advanced Pig, discusses the advanced features of Pig, a framework to script MapReduce jobs on Hadoop.
Chapter 4, Advanced Hive, discusses the advanced features of a higher-level SQL abstraction on Hadoop MapReduce called Hive.
Chapter 5, Serialization and Hadoop I/O, discusses the IO capabilities in Hadoop. Specifically, this chapter covers the concepts of serialization and deserialization support and their necessity within Hadoop; Avro, an external serialization framework; data compression codecs available within Hadoop; their tradeoffs; and finally, the special file formats in Hadoop.
Chapter 6, YARN – Bringing Other Paradigms to Hadoop, discusses YARN (Yet Another Resource Negotiator), a new resource manager that has been included in Hadoop 2.X, and how it is generalizing the Hadoop platform to include other computing paradigms.
Chapter 7, Storm on YARN – Low Latency Processing in Hadoop, discusses the opposite paradigm, that is, moving data to the compute, and compares and contrasts it with batch processing systems such as MapReduce. It also discusses the Apache Storm framework and how to develop applications in Storm. Finally, you will learn how to install Storm on Hadoop 2.X with YARN.
Chapter 8, Hadoop on the Cloud, discusses the characteristics of cloud computing and Hadoop's Platform as a Service offering across cloud computing service providers. Further, it delves into Amazon's managed Hadoop services, also known as Elastic MapReduce (EMR) and looks into how to provision and run jobs on a Hadoop EMR cluster.
Chapter 9, HDFS Replacements, discusses the strengths and drawbacks of HDFS when compared to other file systems. The chapter also draws attention to Hadoop's support for Amazon's S3 cloud storage service. At the end, the chapter illustrates Hadoop HDFS extensibility features by implementing Hadoop's support for S3's native file system to extend Hadoop.
Chapter 10, HDFS Federation, discusses the advantages of HDFS Federation and its architecture. Block placement strategies, which are central to the success of HDFS in the MapReduce environment, are also discussed in the chapter.
Chapter 11, Hadoop Security, focuses on the security aspects of a Hadoop cluster. The main pillars of security are authentication, authorization, auditing, and data protection. We will look at Hadoop's features in each of these pillars.
Chapter 12, Analytics Using Hadoop, discusses higher-level analytic workflows, techniques such as machine learning, and their support in Hadoop. We take document analysis as an example to illustrate analytics using Pig on Hadoop.
Appendix, Hadoop for Microsoft Windows, explores Microsoft Window Operating System's native support for Hadoop that has been introduced in Hadoop 2.0. In this chapter, we look at how to build and deploy Hadoop on Microsoft Windows natively.
What you need for this book?
The following software suites are required to try out the examples in the book:
Java Development Kit (JDK 1.7 or later): This is free software from Oracle that provides a JRE (Java Runtime Environment) and additional tools for developers. It can be downloaded from http://www.oracle.com/technetwork/java/javase/downloads/index.html.
The IDE for editing Java code: IntelliJ IDEA is the IDE that has been used to develop the examples. Any other IDE of your choice can also be used. The community edition of the IntelliJ IDE can be downloaded from https://www.jetbrains.com/idea/download/.
Maven: Maven is a build tool that has been used to build the samples in the book. Maven can be used to automatically pull-build dependencies and specify configurations via XML files. The code samples in the chapters can be built into a JAR using two simple Maven commands:
mvn compilemvn assembly:single
These commands compile the code into a JAR file. These commands create a consolidated JAR with the program along with all its dependencies. It is important to change the mainClass references in the pom.xml to the driver class name when building the consolidated JAR file.
Hadoop-related consolidated JAR files can be run using the command:
hadoop jar
This command directly picks the driver program from the mainClass that was specified in the pom.xml. Maven can be downloaded and installed from http://maven.apache.org/download.cgi. The Maven XML template file used to build the samples in this book is as follows:
1.0 encoding=UTF-8
?>
http://www.w3.org/2001/XMLSchema-instance
xsi:schemaLocation=http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd
>
only. It has no influence on the Maven build itself. -->
Hadoop 2.2.0: Apache Hadoop is required to try out the examples in general. Appendix, Hadoop for Microsoft Windows, has the details on Hadoop's single-node installation on a Microsoft Windows machine. The steps are similar and easier for other operating systems such as Linux or Mac, and they can be found at http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/SingleNodeSetup.html
Who this book is for
This book is meant for a gamut of readers. A novice user of Hadoop can use this book to upgrade his skill level in the technology. People with existing experience in Hadoop can enhance their knowledge about Hadoop to solve challenging data processing problems they might be encountering in their profession. People who are using Hadoop, Pig, or Hive at their workplace can use the tips provided in this book to help make their jobs faster and more efficient. A curious Big Data professional can use this book to understand the expanding horizons of Hadoop and how it is broadening its scope by embracing other paradigms, not just MapReduce. Finally, a Hadoop 1.X user can get insights into the repercussions of upgrading to Hadoop 2.X. The book assumes familiarity with Hadoop, but the reader need not be an expert. Access to a Hadoop installation, either in your organization, on the cloud, or on your desktop/notebook is recommended to try some of the concepts.
Conventions
In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.
Code words in text are shown as follows: The FileInputFormat subclass and associated classes are commonly used for jobs taking inputs from HFDS.
A