BDA - Unit 2,3,4 (Remaining Topics)

Hadoop Architecture
Apache Hadoop offers a scalable, flexible and reliable distributed computing big data framework
for a cluster of systems with storage capacity and local computing power by leveraging
commodity hardware. Hadoop follows a Master Slave architecture for the transformation and
analysis of large datasets using Hadoop MapReduce paradigm. The 3 important hadoop
components that play a vital role in the Hadoop architecture are -
i. Hadoop Distributed File System (HDFS) – Patterned after the UNIX file system
ii. Hadoop MapReduce
iii. Yet Another Resource Negotiator (YARN)
Hadoop follows a master slave architecture design for data storage and distributed data
processing using HDFS and MapReduce respectively. The master node for data storage is
hadoop HDFS is the NameNode and the master node for parallel processing of data using
Hadoop MapReduce is the Job Tracker. The slave nodes in the hadoop architecture are the other
machines in the Hadoop cluster which store data and perform complex computations. Every
slave node has a Task Tracker daemon and a DataNode that synchronizes the processes with the
Job Tracker and NameNode respectively. In Hadoop architectural implementation the master or
slave systems can be setup in the cloud or on-premise.
Features of Hadoop HDFS
Fault Tolerance
Fault tolerance in HDFS refers to the working strength of a system in unfavorable conditions and
how that system can handle such situations. HDFS is highly fault-tolerant, in HDFS data is
divided into blocks and multiple copies of blocks are created on different machines in the cluster
(this replica creation is configurable). So whenever if any machine in the cluster goes down, then
a client can easily access their data from the other machine which contains the same copy of data
blocks. HDFS also maintains the replication factor by creating a replica of blocks of data on
another rack. Hence if suddenly a machine fails, then a user can access data from other slaves
present in another rack.
High Availability
HDFS is a highly available file system, data gets replicated among the nodes in the HDFS cluster
by creating a replica of the blocks on the other slaves present in HDFS cluster. Hence whenever
a user wants to access his data, they can access their data from the slaves which contains its
blocks and which is available on the nearest node in the cluster. And during unfavorable
situations like a failure of a node, a user can easily access their data from the other nodes.
Because duplicate copies of blocks which contain user data are created on the other nodes
present in the HDFS cluster.
Data Reliability
HDFS is a distributed file system which provides reliable data storage. HDFS can store data in
the range of 100s of petabytes. It stores data reliably on a cluster of nodes. HDFS divides
the data into blocks and these blocks are stored on nodes present in HDFS cluster. It stores data
reliably by creating a replica of each and every block present on the nodes present in the cluster
and hence provides fault tolerance facility. If node containing data goes down, then a user can
easily access that data from the other nodes which contain a copy of same data in the HDFS
cluster. HDFS by default creates 3 copies of blocks containing data present in the nodes in HDFS
cluster. Hence data is quickly available to the users and hence user does not face the problem of
data loss. Hence HDFS is highly reliable.
Replication
Data Replication is one of the most important and unique features of Hadoop HDFS. In HDFS
replication of data is done to solve the problem of data loss in unfavorable conditions like
crashing of a node, hardware failure, and so on. As data is replicated across a number of
machines in the cluster by creating blocks. The process of replication is maintained at regular
intervals of time by HDFS and HDFS keeps creating replicas of user data on different machines
present in the cluster. So whenever any machine in the cluster gets crashed, the user can access
their data from other machines which contain the blocks of that data. Hence there is no
possibility of losing of user data.
Scalability
As HDFS stores data on multiple nodes in the cluster, when requirements increase we can scale
the cluster. There is two scalability mechanism available: Vertical scalability – add more
resources (CPU, Memory, Disk) on the existing nodes of the cluster. Another way is horizontal
scalability – Add more machines in the cluster. The horizontal way is preferred as we can scale
the cluster from 10s of nodes to 100s of nodes on the fly without any downtime.
Distributed Storage
In HDFS all the features are achieved via distributed storage and replication. In HDFS data is
stored in distributed manner across the nodes in HDFS cluster. In HDFS data is divided into
blocks and is stored on the nodes present in HDFS cluster. And then replicas of each and every
block are created and stored on other nodes present in the cluster. So if a single machine in the
cluster gets crashed we can easily access our data from the other nodes which contain its replica.
Function of NameNode and DataNode
NameNode
All the files and directories in the HDFS namespace are represented on the NameNode by Inodes
that contain various attributes like permissions, modification timestamp, disk space quota,
namespace quota and access times. NameNode maps the entire file system structure into
memory. Two files fsimage and edits are used for persistence during restarts.
 Fsimage file contains the Inodes and the list of blocks which define the metadata.
 The edits file contains any modifications that have been performed on the content of the
fsimage file.
When the NameNode starts, fsimage file is loaded and then the contents of the edits file are
applied to recover the latest state of the file system. The only problem with this is that over the
time the edits file grows and consumes all the disk space resulting in slowing down the restart
process. If the hadoop cluster has not been restarted for months together then there will be a huge
downtime as the size of the edits file will be increase. This is when Secondary NameNode comes
to the rescue. Secondary NameNode gets the fsimage and edits log from the primary NameNode
at regular intervals and loads both the fsimage and edit logs file to the main memory by applying
each operation from edits log file to fsimage. Secondary NameNode copies the new fsimage file
to the primary NameNode and also will update the modified time of the fsimage file to fstime
file to track when then fsimage file has been updated.
DataNode
DataNode manages the state of an HDFS node and interacts with the blocks .A DataNode can
perform CPU intensive jobs like semantic and language analysis, statistics and machine learning
tasks, and I/O intensive jobs like clustering, data import, data export, search, decompression, and
indexing. A DataNode needs lot of I/O for data processing and transfer.
On startup every DataNode connects to the NameNode and performs a handshake to verify the
namespace ID and the software version of the DataNode. If either of them does not match then
the DataNode shuts down automatically. A DataNode verifies the block replicas in its ownership
by sending a block report to the NameNode. As soon as the DataNode registers, the first block
report is sent. DataNode sends heartbeat to the NameNode every 3 seconds to confirm that the
DataNode is operating and the block replicas it hosts are available.
What is Hadoop?
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. A Hadoop
frame-worked application works in an environment that provides distributed storage and
computation across clusters of computers. Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and storage.
Hadoop framework includes following four modules:
 Hadoop Common: These are Java libraries and utilities required by other Hadoop
modules. These libraries provides filesystem and OS level abstractions and contains the
necessary Java files and scripts required to start Hadoop.
 Hadoop YARN: This is a framework for job scheduling and cluster resource
management.
 Hadoop Distributed File System (HDFS™): A distributed file system that provides
high-throughput access to application data.
 Hadoop MapReduce: This is YARN-based system for parallel processing of large data
sets.
Benefits
Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast
amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost.
 Scalability and Performance – distributed processing of data local to each node in a

cluster enables Hadoop to store, manage, process and analyze data at petabyte scale.
 Reliability – large computing clusters are prone to failure of individual nodes in the
cluster. Hadoop is fundamentally resilient – when a node fails processing is re-directed to
the remaining nodes in the cluster and data is automatically re-replicated in preparation
for future node failures.
 Flexibility – unlike traditional relational database management systems, you don’t have
to created structured schemas before storing data. You can store data in any format,
including semi-structured or unstructured formats, and then parse and apply schema to
the data when read.
 Low Cost – unlike proprietary software, Hadoop is open source and runs on low-cost
commodity hardware.
Difference Between Hadoop and

Traditional RDBMS
Like Hadoop, traditional RDBMS cannot be used when it comes to process and store a large
amount of data or simply big data. Following are some differences between Hadoop and
traditional RDBMS.
Data Volume-
Data volume means the quantity of data that is being stored and processed. RDBMS works better
when the volume of data is low(in Gigabytes). But when the data size is huge i.e, in Terabytes
and Petabytes, RDBMS fails to give the desired results.
On the other hand, Hadoop works better when the data size is big. It can easily process and store
large amount of data quite effectively as compared to the traditional RDBMS.
Architecture-
If we talk about the architecture, Hadoop has the following core components:
HDFS(Hadoop Distributed File System), Hadoop MapReduce(a programming model to process

large data sets) and Hadoop YARN(used to manage computing resources in computer clusters).
Traditional RDBMS possess ACID properties which are Atomicity, Consistency, Isolation, and
Durability.
These properties are responsible to maintain and ensure data integrity and accuracy when a
transaction takes place in a database.
These transactions may be related to Banking Systems, Manufacturing Industry,

Telecommunication industry, Online Shopping, education sector etc.
Throughput-
Throughput means the total volume of data processed in a particular period of time so that the
output is maximum. RDBMS fails to achieve a higher throughput as compared to the Apache
Hadoop Framework.
This is one of the reason behind the heavy usage of Hadoop than the traditional Relational
Database Management System.
Data Variety-
Data Variety generally means the type of data to be processed. It may be structured, semi-
structured and unstructured.
Hadoop has the ability to process and store all variety of data whether it is structured, semi-
structured or unstructured. Although, it is mostly used to process large amount of unstructured
data.
Traditional RDBMS is used only to manage structured and semi-structured data. It cannot be
used to manage unstructured data. So we can say Hadoop is way better than the traditional
Relational Database Management System.
Latency/ Response Time –
Hadoop has higher throughput, you can quickly access batches of large data sets than traditional
RDBMS, but you cannot access a particular record from the data set very quickly. Thus Hadoop
is said to have low latency.
But the RDBMS is comparatively faster in retrieving the information from the data sets. It takes
a very little time to perform the same function provided that there is a small amount of data.
Scalability-
RDBMS provides vertical scalability which is also known as ‘Scaling Up’ a machine. It means
you can add more resources or hardwares such as memory, CPU to a machine in the computer
cluster.
Whereas, Hadoop provides horizontal scalability which is also known as ‘Scaling Out’ a
machine. It means adding more machines to the existing computer clusters as a result of which
Hadoop becomes a fault tolerant. There is no single point of failure. Due to the presence of more
machines in the cluster, you can easily recover data irrespective of the failure of one of the
machines.
Data Processing-
Apache Hadoop supports OLAP(Online Analytical Processing), which is used in Data Mining
techniques.
OLAP involves very complex queries and aggregations. The data processing speed depends on
the amount of data which can take several hours. The database design is de-normalized having
fewer tables. OLAP uses star schemas.
On the other hand, RDBMS supports OLTP(Online Transaction Processing), which involves
comparatively fast query processing. The database design is highly normalized having a large
number of tables. OLTP generally uses 3NF(an entity model) schema.
Cost-
Hadoop is a free and open source software framework, you don’t have to pay in order to buy the
license of the software.
Whereas RDBMS is a licensed software, you have to pay in order to buy the complete software
license.
Machine learning is closely related to (and often overlaps with) computational statistics, which
also focuses on prediction-making through the use of computers. It has strong ties to
mathematical optimization, which delivers methods, theory and application domains to the field.
Machine learning tasks are typically classified into three broad categories, depending on the
nature of the learning "signal" or "feedback" available to a learning system. These are
1.Supervised learning: The computer is presented with example inputs and their desired outputs,
given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs.
2.Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to
find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden
patterns in data) or a means towards an end (feature learning).
3.Reinforcement learning: A computer program interacts with a dynamic environment in which

it must perform a certain goal (such as driving a vehicle or playing a game against an opponent).
The program is provided feedback in terms of rewards and punishments as it navigates its
problem space.
Traditional analytics tools are not well suited to capturing the full value of big data.
The volume of data is too large for comprehensive analysis, and the range of potential
correlations and relationships between disparate data sources — from back end customer
databases to live web based clickstreams —are too great for any analyst to test all hypotheses
and derive all the value buried in the data.
Basic analytical methods used in business intelligence and enterprise reporting tools reduce to
reporting sums, counts, simple averages and running SQL queries. Online analytical processing
is merely a systematized extension of these basic analytics that still rely on a human to direct
activities specify what should be calculated.
Machine learning is ideal for exploiting the opportunities hidden in big data.
It delivers on the promise of extracting value from big and disparate data sources with far less
reliance on human direction. It is data driven and runs at machine scale. It is well suited to the
complexity of dealing with disparate data sources and the huge variety of variables and amounts
of data involved. And unlike traditional analysis, machine learning thrives on growing datasets.
The more data fed into a machine learning system, the more it can learn and apply the results to
higher quality insights.
Freed from the limitations of human scale thinking and analysis, machine learning is able to
discover and display the patterns buried in the data.
Apache Spark is an open source big data processing framework

built around speed, ease of use, and sophisticated analytics.
Spark is written in Scala Programming Language and runs on Java Virtual Machine (JVM) environment. It
currently supports the following languages for developing applications using Spark:
 Scala
 Java
 Python
 Clojure
 R
Features of Spark
 Spark gives us a comprehensive, unified framework to manage big data processing

requirements with a variety of data sets that are diverse in nature (text data, graph data
etc) as well as the source of data (batch v. real-time streaming data).
 Spark enables applications in Hadoop clusters to run up to 100 times faster in memory
and 10 times faster even when running on disk.
 Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-
in set of over 80 high-level operators. And you can use it interactively to query data
within the shell.
 In addition to Map and Reduce operations, it supports SQL queries, streaming data,
machine learning and graph data processing. Developers can use these capabilities stand-
alone or combine them to run in a single data pipeline use case.
 Spark also supports lazy evaluation of big data queries, which helps with optimization of
the steps in data processing workflows. It provides a higher level API to improve
developer productivity and a consistent architect model for big data solutions.
 Spark holds intermediate results in memory rather than writing `them to disk which is
very useful especially when you need to work on the same dataset multiple times. It’s
designed to be an execution engine that works both in-memory and on-disk. Spark
operators perform external operations when data does not fit in memory. Spark can be
used for processing datasets that larger than the aggregate memory in a cluster.
Machine Learning API from Spark, called Spark ML. Spark ML (spark.ml) package provides machine
learning API built on the DataFrames which are becoming the core part of Spark SQL library. This
package can be used for developing and managing the machine learning pipelines. It also provides
feature extractors, transformers, selectors and supports machine learning techniques like classification,
regression, and clustering. All of these are critical for developing machine learning solutions.
Spark ML provides you with a toolset to create "pipelines" of different machine learning related
transformations on your data. It makes it easy to for example to chain feature extraction, dimensionality
reduction, and the training of a classifier into 1 model, which as a whole can be later used for
classification.
Spark machine learning utilizes the Spark SQL DataFrame as a dataset. It can contain various types of
data types, for example, a dataset can contain different columns that store feature vectors, predictions,
true labels, and text.
The APIs for machine learning algorithms are standardized by Spark ML. With this, it is easy to combine
various algorithms in a single workflow or pipeline. The key concepts related to Spark ML API are listed
on the screen.
A transformer is defined as an algorithm that is capable of transforming one DataFrame into another.
An estimator is another algorithm that can produce a transformer by fitting on a DataFrame.
A pipeline specifies an ML workflow by chaining various transformers and estimators together.
The different steps involved in a machine learning pipeline process.
Step # Name Description

Data Ingestion Loading the data from different data sources.
ML1
ML2 Data Cleaning Data is pre-processed to get it ready for the machine learning data analysis.
Also known as Feature Engineering, this step is about extracting the
ML3 Feature Extraction
features from the data sets.
The machine learning model is trained in the next step using the training
ML4 Model Training
data sets.
Next, the machine learning model is evaluated based on different
ML5 Model Validation prediction parameters, for its effectiveness. We also tune the model during
the validation step. This step is used to pick the best model.
ML6 Model Testing The next step is to test the mode before it is deployed.
Model Final step is to deploy the selected model to execute in production
ML7
deployment environment.
Table 1. Machine learning pipeline process steps

Microsoft took yet another step towards market leadership in Big Data through the public
preview release of Azure Machine Learning (also known as "Azure ML"). Taking the predictive
analytics to public cloud seems like the next logical step towards large-scale consumerization of
Machine Learning. Azure ML does just that, while making it significantly easier for the
developers. The service runs on Azure public cloud, which means that users need not buy any
hardware or software; and also, need not worry about deployment and maintenance.
Through an integrated development environment called ML Studio, people without data science
background can also build data models through drag-and-drop gestures and simple data flow
diagrams. This not only minimizes coding, but also saves a lot of time through ML Studio's
library of sample experiments. On the other hand, seasoned data scientists will be glad to notice
how strongly Azure ML supports R. You can just drop existing R code directly into Azure ML,
or develop your own code using more than 350 R packages supported by ML Studio.
Reporting: The process of organizing data into informational summaries in order to monitor
how different areas of a business are performing.
Analysis: The process of exploring data and reports in order to extract meaningful insights,
which can be used to better understand and improve business performance.
Reporting translates raw data into information. Analysis transforms data and information into
insights. Reporting helps companies to monitor their online business and be alerted to when
data falls outside of expected ranges. Good reporting should raise questions about the business
from its end users. The goal of analysis is to answer questions by interpreting the data at a
deeper level and providing actionable recommendations. Through the process of performing
analysis you may raise additional questions, but the goal is to identify answers, or at least
potential answers that can be tested. In summary, reporting shows you what is happening while
analysis focuses on explaining why it is happening and what you can do about it.
Five differences between reporting and analysis:

1. Purpose
Reporting helps companies monitor their data even before digital technology boomed. Various
organizations have been dependent on the information it brings to their business, as reporting
extracts that and makes it easier to understand.
Analysis interprets data at a deeper level. While reporting can link between cross-channels of
data, provide comparison, and make understand information easier (think of a dashboard, charts,
and graphs, which are reporting tools and not analysis reports), analysis interprets this
information and provides recommendations on actions.
2. Tasks
As reporting and analysis have a very fine line dividing them, sometimes it’s easy to confuse
tasks that have analysis labeled on top of them when all it does is reporting. Hence, ensure that
your analytics team has a healthy balance doing both.
Here’s a great differentiator to keep in mind if what you’re doing is reporting or analysis:
Reporting includes building, configuring, consolidating, organizing, formatting, and

summarizing. It’s very similar to the abovementioned like turning data into charts, graphs, and
linking data across multiple channels.
Analysis consists of questioning, examining, interpreting, comparing, and confirming. With big
data, predicting is possible as well.
3. Outputs
Reporting and analysis have the push and pull effect from its users through their outputs.
Reporting has a push approach, as it pushes information to users and outputs come in the forms
of canned reports, dashboards, and alerts.
Analysis has a pull approach, where a data analyst draws information to further probe and to
answer business questions. Outputs from such can be in the form of ad hoc responses and
analysis presentations. Analysis presentations are comprised of insights, recommended actions,
and a forecast of its impact on the company—all in a language that’s easy to understand at the
level of the user who’ll be reading and deciding on it.
This is important for organizations to realize truly the value of data, such that a standard report is
not similar to a meaningful analytics.
4. Delivery
Considering that reporting involves repetitive tasks—often with truckloads of data, automation
has been a lifesaver, especially now with big data. It’s not surprising that the first thing
outsourced are data entry services since outsourcing companies are perceived as data reporting
experts.
Analysis requires a more custom approach, with human minds doing superior reasoning and
analytical thinking to extract insights, and technical skills to provide efficient steps towards
accomplishing a specific goal. This is why data analysts and scientists are demanded these days,
as organizations depend on them to come up with recommendations for leaders or business
executives make decisions about their businesses.
5. Value
This isn’t about identifying which one brings more value, rather understanding that both are
indispensable when looking at the big picture. It should help businesses grow, expand, move
forward, and make more profit or increase their value.
The big data technologies based on

Forrester’s analysis:
1. Predictive analytics: software and/or hardware solutions that allow firms to discover, evaluate,
optimize, and deploy predictive models by analyzing big data sources to improve business
performance or mitigate risk.
2. NoSQL databases: key-value, document, and graph databases.
3. Search and knowledge discovery: tools and technologies to support self-service extraction of
information and new insights from large repositories of unstructured and structured data that
resides in multiple sources such as file systems, databases, streams, APIs, and other platforms
and applications.
4. Stream analytics: software that can filter, aggregate, enrich, and analyze a high throughput of
data from multiple disparate live data sources and in any data format.
5. In-memory data fabric: provides low-latency access and processing of large quantities of data
by distributing data across the dynamic random access memory (DRAM), Flash, or SSD of a
distributed computer system.
6. Distributed file stores: a computer network where data is stored on more than one node, often
in a replicated fashion, for redundancy and performance.
7. Data virtualization: a technology that delivers information from various data sources, including
big data sources such as Hadoop and distributed data stores in real-time and near-real time.
8. Data integration: tools for data orchestration across solutions such as Amazon Elastic
MapReduce (EMR), Apache Hive, Apache Pig, Apache Spark, MapReduce, Couchbase, Hadoop,
and MongoDB.
9. Data preparation: software that eases the burden of sourcing, shaping, cleansing, and sharing
diverse and messy data sets to accelerate data’s usefulness for analytics.
10. Data quality: products that conduct data cleansing and enrichment on large, high-velocity data
sets, using parallel operations on distributed data stores and databases.
BI system helps the user community understand the business trends and
organization health.
Hadoop Architecture
"Hadoop is a framework that enables the distributed processing of large data sets across clusters of
commodity servers. It is designed to scale up from a single server to thousands of machines, with a very
high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters
comes from the software's ability to detect and handle failures at the application layer."
The Hadoop framework has four core capabilities: Hadoop common, Hadoop Distributed File
System, Hadoop YARN, and Hadoop MapReduce.
 Hadoop Common: Hadoop provides a set of common utilities that support other Hadoop
functionalities.
 Hadoop Distributed File System (HDFS): HDFS is a file system that spans all the nodes in a
Hadoop cluster for data storage. It links together the file systems on many local nodes to make
them into one big file system.
 Hadoop YARN: Yet Another Resource Negotiator (YARN) assigns CPU, memory, and storage to
applications running on a Hadoop cluster. The first generation of Hadoop could only run
MapReduce applications. It is a pluggable architecture and resource management for data
processing engines to interact with data stored in HDFS.
 Hadoop MapReduce: A YARN-based batch processing system which supports parallel processing
of large unstructured data sets.
Data Storage In Hadoop

Hadoop Distributed File System (HDFS) provides an optimal and reliable way to store data in
the overall Hadoop framework. HDSF works as a backbone in Hadoop implementation. In
HDFS, actual data store on a cluster or set of less expensive commodity servers make a family of
cluster. A HDFS cluster family has one Namenode and one-to-many Datanode. Namenode is a
master node that maintains the file system's metadata and Datanode is a slave node where the
actual data is stored. In a Hadoop application, whenever source data reaches HDFS, it interacts
with Namenode first to get file metadata information; then, data read/write happens directly on
Datanode.
Figure 2, referred from the HDFS user guide, will help you
to understand the HDFS architecture and data flow within a
Hadoop application.
Figure 2: The HDFS architecture and data flow
HDFS is a highly configured, distributed data storage and processing unit. It provides all
architectural characteristics, such as scalability, flexibility to store any type of data (structured &
unstructured), and fault tolerant.
Understanding Hadoop Components

The Hadoop framework consists of various sets of components that give you multiple ways to
implement the solution with available technical expertise or build with a new one. These
components help manage data storage, high performing I/O, data analytics, and system
monitoring. Here are some of the components of the Hadoop framework in alphabetical order:
Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Mahout, Pig, Spark, Sqoop, Tez, and
Zookeeper, and so on.
 Ambari: Ambari is an UI interface tool that gives the opportunity to provision, monitor, and
manage Hadoop clusters. It's a web-based, user-friendly tool and provides system health status
on dashboards and sends alerts whenever attention is needed.
 Avro: Avro is a data serialization system that comes with a rich data structure and provides a
compact, fast, and binary data format. It supports remote procedure calls.
 Cassandra: Cassandra is a highly scalable and available database that is a useful platform for
mission-critical data. Cassandra provides very high performance using column index,
materialized view, built-in caching, and de-normalization.
 Chukwa: Chukwa is a a data collection system for monitoring and analyzing large distributed
systems. It's built on top of Hadoop, an open source file system and MapReduce
implementation, and inherits Hadoop's scalability and robustness. Also, its provides a flexible
toolkit for displaying, monitoring, and analysis results to make the best use of collected data.
 HBase: HBase is a non-relational database that can be used when random and realtime
read/write data access is needed for Big Data. HBase runs on top of HDFS and provides fault-
tolerant storage.
 Hive: Hive is data warehouse software in the Hadoop world. It's built on top of the MapReduce
framework, which gives a way to execute interactive SQL queries over massive amounts of data
in Hadoop. Hive supports QL, a SQL-like query language to query the data and is called HiveQL.
 Mahout: Mahout is a tool in the Hadoop framework that helps to find a meaningful pattern in
huge data sets. Mahout is "A Scalable machine learning and data mining library."
 Pig: Pig is an extensible, optimized high-level language (Pig Latin) to process and analyze huge
amount of data. Pig Latin can be extended by using User Defined Functions and defines a set of
data transformations like data aggregation and sort. Also, the program is written in Pig Latin
paired with the MapReduce framework to process these programs.
 Spark: Spark provides a data computing facility; it implements fast, iterative algorithms with the
use of a robust set of APIs that process large scale data 100x faster than Hadoop MapReduce in
memory. Spark supports all type of applications, machine learning, and stream processing,
including graph computation and large size ETL.
 Sqoop: Sqoop is a tool that helps data transfer between Hadoop and relational databases
(structure data stores). Also, Sqoop imports from relational databases into HDFS or related
systems like Hive and HBase. It supports parallel data loading for enterprise data sources.
 Tez: Tez provides a powerful and flexible engine to execute an arbitrary directed acyclic graph
(DAG) of tasks to process data of both batch and interactive types. Tez is a generalized
MapReduce paradigm to execute complex tasks for big data processing and is built on Hadoop
YARN.
 Zookeeper: Zookeeper is an operational tool that provides distributed configuration services,
group services, and a naming registry for distributed systems. Zookeeper provides a fast and
reliable rich interface.
With these sets of components, we can understand the Hadoop paradigm and how these
components can help develop an end-to-end BI solution to handle massive data.
Analytics Using Hadoop

The Hadoop framework not only provides you the way to store and manage high volume of data
but also opens new dimensions in the data analytics world. Data analysts use Hadoop
components to explore, structure, and analyze the data, and then turn it into business insight.
Hadoop allows analysts to store data in any format, and then create schema when needed to
analyze rather than write, which transforms into specified schema upon load, like conventional
BI solution.
Hadoop extends conventional business decision making with solutions that increase the use and
business value of analytics throughout the organizations. Also, it represents transference in the
way that increases elasticity and provides a faster time to value because data doesn't have to be
modeled or integrated into an EDW before it can be analyzed.
Answer 4
Direct Batch Reporting is best for executives and operational managers who want summarized, pre-
built daily reports on Big Data content. This approach uses a medium latency architecture whereby
native file access is still an advantage (where possible), but traditional SQL-based querying can be
sufficient.
Some Big Data sources have specialized interfaces for query access (SQL-based languages like
CQL for Cassandra and HiveQL for Hadoop Hive) that make this use-case more straightforward,
although the time required to compile, query and re-assemble the data for this approach
lengthens the response time to minutes or perhaps hours.
Also, an analytic DBMS could be a popular choice in this architecture, as long as it supports the
scale-out, high-volume read access so critical in a Big Data environment. The user could interact
with a variety of data discovery tools, from simple or interactive reports and dashboards to richer
analytic views -- simply dependent on the project requirements.
This architecture can drive a more complete and relevant set of insight for a wider set of end
users.
Answer 5
 Live Exploration is best for data analysts and data scientists who want to discover
real-time patterns as they emerge from their Big Data content. This approach requires a low-latency
architecture that uses native connections to the Big Data source. Within Hadoop, for instance, this
means direct analytic and reporting access to HBase (or perhaps HDFS). For MongoDB or Cassandra, it
would mean direct access to their equivalent underlying schema-less data stores.
Such access allows data to be potentially analyzed within seconds of being transacted and
probably across more than a few dimensions. Initial construction of these exploratory queries
requires the BI tool to provide a mechanism for complex (non-SQL) queries or its own metadata
layer to map objects to native API calls, or both of these.
In this case, an in-memory engine combined with a suitable multi-dimensional analysis (end
user) tool would work with the (ideally) de-serialized, filtered data to enable drag-and-drop
pivots, swivels and analyses. Such a combination of intelligent data connection and BI platform
architecture would enable an analytic experience very much like what is available working with
traditional, relational (SQL-based) data sources.
For the sophisticated user, this approach can yield powerful insight from a variety of very high-
volume, high-velocity data.
Answer 1
Mapper task is the first phase of processing that processes each input record (from
RecordReader) and generates an intermediate key-value pair. Hadoop Mapper store
intermediate-output on the local disk. In this Hadoop mapper tutorial, we will try to answer what
is a mapper in MapReduce, how to generate key-value pair in Hadoop, what is InputSplit and
RecordReader in Hadoop, how mapper works in Hadoop. We will also discuss the number of
mapper in Hadoop MapReduce for running any program and how to calculate the number of
mappers required for a given data.
Answer 2
Partitioner partitions the key space.
Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a
subset of the key) is used to derive the partition, typically by a hash function. The total number
of partitions is the same as the number of reduce tasks for the job. Hence this controls which of
the m reduce tasks the intermediate key (and hence the record) is sent to for reduction.
HashPartitioner is the default Partitioner.
Apache Hadoop MapReduce is a framework for processing large

data sets in parallel across a Hadoop cluster. Data analysis uses a two step map and reduce
process. The job configuration supplies map and reduce analysis functions and the Hadoop
framework provides the scheduling, distribution, and parallelization services.
The top level unit of work in MapReduce is a job. A job usually has a map and a reduce phase,
though the reduce phase can be omitted. For example, consider a MapReduce job that counts the
number of times each word is used across a set of documents. The map phase counts the words
in each document, then the reduce phase aggregates the per-document data into word counts
spanning the entire collection.
During the map phase, the input data is divided into input splits for analysis by map tasks
running in parallel across the Hadoop cluster. By default, the MapReduce framework gets input
data from the Hadoop Distributed File System (HDFS).
The reduce phase uses results from map tasks as input to a set of parallel reduce tasks. The
reduce tasks consolidate the data into final results. By default, the MapReduce framework stores
results in HDFS.
Although the reduce phase depends on output from the map phase, map and reduce processing is
not necessarily sequential. That is, reduce tasks can begin as soon as any map task completes. It
is not necessary for all map tasks to complete before any reduce task can begin.
MapReduce operates on key-value pairs. Conceptually, a MapReduce job takes a set of input
key-value pairs and produces a set of output key-value pairs by passing the data through map and
reduce functions. The map tasks produce an intermediate set of key-value pairs that the reduce
tasks uses as input. The diagram below illustrates the progression from input key-value pairs to
output key-value pairs at a high level:
Though each set of key-value pairs is homogeneous, the key-value pairs in each step need not
have the same type. For example, the key-value pairs in the input set (KV1) can be (string,
string) pairs, with the map phase producing (string, integer) pairs as intermediate results (KV2),
and the reduce phase producing (integer, string) pairs for the final results (KV3)..
The keys in the map output pairs need not be unique. Between the map processing and the reduce
processing, a shuffle step sorts all map output values with the same key into a single reduce input
(key, value-list) pair, where the 'value' is a list of all values sharing the same key. Thus, the input
to a reduce task is actually a set of (key, value-list) pairs.
The key and value types at each stage determine the interfaces to your map and reduce functions.
Therefore, before coding a job, determine the data types needed at each stage in the map-reduce
process. For example:
1. Choose the reduce output key and value types that best represents the desired outcome.
2. Choose the map input key and value types best suited to represent the input data from
which to derive the final result.
3. Determine the transformation necessary to get from the map input to the reduce output,
and choose the intermediate map output/reduce input key value type to match.
MapReduce is composed of several components, including:
 JobTracker -- the master node that manages all jobs and resources in a cluster
 TaskTrackers -- agents deployed to each machine in the cluster to run the map and reduce
tasks
 JobHistoryServer -- a component that tracks completed jobs, and is typically deployed as
a separate function or with JobTracker
When we run the MapReduce job on very large data sets the mapper processes and produces
large chunks of intermediate output data which is then send to Reducer which causes huge
network congestion.
To increase the efficiency users can optionally specify a Combiner , via
Job.setCombinerClass(Reducer.class), to perform local aggregation of the intermediate outputs,
which helps to cut down the amount of data transferred from the Mapper to the Reducer.
Combiner acts as a mini-reducer. Combiner processes the output of Mapper and does local
aggregation before passing it to the reducer.
Recsources for hadoop mapreduce,word count

and data loading
https://www.dezyre.com/hadoop-tutorial/hadoop-mapreduce-tutorial-
https://dzone.com/articles/word-count-hello-word-program-in-mapreduce
https://blogs.oracle.com/datawarehousing/data-loading-into-hdfs-part1
http://tutorial.techaltum.com/hadoop-commands.html
Answer 1
package PackageDemo;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
Path input=new Path(files[0]);
Path output=new Path(files[1]);
Job j=new Job(c,"wordcount");
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}
public static class MapForWordCount extends Mapper<LongWritable, Text, Text, IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException
{
String line = value.toString();
String[] words=line.split(",");
for(String word: words )
{
Text outputKey = new Text(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException,
InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));
}
}
}
MapReduce is mainly used for parallel processing of large sets of data stored in Hadoop
cluster. Initially, it is a hypothesis specially designed by Google to provide parallelism, data
distribution and fault-tolerance. MR processes data in the form of key-value pairs. A key-value
(KV) pair is a mapping element between two linked data items - key and its value.
The key (K) acts as an identifier to the value. An example of a key-value (KV) pair is a pair
where the key is the node Id and the value is its properties including neighbor nodes, predecessor
node, etc. MR API provides the following features like batch processing, parallel processing of
huge amounts of data and high availability.
For processing large sets of data MR comes into the picture. The programmers will write MR
applications that could be suitable for their business scenarios. Programmers have to understand
the MR working flow and according to the flow, applications will be developed and deployed
across Hadoop clusters. Hadoop built on Java APIs and it provides some MR APIs that is going
to deal with parallel computing across nodes.
The MR work flow undergoes different phases and the end result will be stored in hdfs with
replications. Job tracker is going to take care of all MR jobs that are running on various nodes
present in the Hadoop cluster. Job tracker plays vital role in scheduling jobs and it will keep
track of the entire map and reduce jobs. Actual map and reduce tasks are performed by Task
tracker.
Hadoop Map Reduce architecture
Map reduce architecture consists of mainly two processing stages. First one is the map stage and
the second one is reduce stage. The actual MR process happens in task tracker. In between map
and reduce stages, Intermediate process will take place. Intermediate process will do operations
like shuffle and sorting of the mapper output data. The Intermediate data is going to get stored in
local file system.
Steps Hadoop takes to run a job

At the highest level, there are five independent entities:
 The client, which submits the MapReduce job.
 The YARN resource manager, which coordinates the allocation of compute resources on
the cluster.
 The YARN node managers, which launch and monitor the compute containers on
machines in the cluster.
 The MapReduce application master, which coordinates the tasks running the MapReduce
job. The application master and the MapReduce tasks run in containers that are scheduled
by the resource manager and managed by the node managers.
 The distributed filesystem, which is used for sharing job files between the other entities.
Map Phase
In the map phase, MapReduce gives the user an opportunity to operate on every record in the
data set individually. This phase is commonly used to project out unwanted fields, transform
fields, or apply filters. Certain types of joins and grouping can also be done in the map (e.g.,
joins where the data is already sorted or hash-based aggregation). There is no requirement that
for every input record there should be one output record. Maps can choose to remove records or
explode one record into multiple records.
Every MapReduce job specifies an InputFormat. This class is responsible for determining how
data is split across map tasks and for providing a RecordReader.
In order to specify how data is split across tasks, an InputFormat divides the input data into a
set of InputSplits. Each InputSplit is given to an individual map. In addition to information
on what to read, the InputSplit includes a list of nodes that should be used to read the data. In
this way, when the data resides on HDFS, MapReduce is able to move the computation to the
data.
The RecordReader provided by an InputFormat reads input data and produces key-value pairs
to be passed into the map. This class controls how data is decompressed (if necessary), and how
it is converted to Java types that MapReduce can work with.
The Reduce Phase

The three main sub phases in the Reduce phase are-
1. Shuffle
2. Merge/Sort
3. Invocation of the reduce() method.
The Shuffle phase
The Shuffle phase ensures that the partitions reach the appropriate Reducers. The Shuffle phase
is a component of the Reduce phase. During the Shuffle phase, each Reducer uses the HTTP
protocol to retrieve its own partition from the Mapper nodes. Each Reducer uses five threads by
default to pull its own partitions from the Mapper nodes defined by the property
mapreduce.reduce.shuffle.parallelcopies.
But how do the Reducer’s know which nodes to query to get their partitions? This happens
through the Application Master. As each Mapper instance completes, it notifies the Application
Master about the partitions it produced during its run. Each Reducer periodically queries the
Application Master for Mapper hosts until it has received a final list of nodes hosting its
partitions.
The reduce phase begins when the fraction of the total mapper instances completed exceeds the
value set in the property mapreduce.job.reduce.slowstart.completedmaps. This does not
mean that the reduce() invocations can begin. Only the partition download from the Mapper
nodes to the Reducer nodes can initiate.
Merge/Sort
At this point the Reducer has received all the partitions it needs to process by downloading them
from the Mapper nodes using the HTTP protocol. The key-value pairs received from the Mapper
needs to be sorted by the Mapper output key (or reducer input key).
Each of the partition downloaded from the Mapper are already sorted by the Mapper output key.
But all the partitions received from all the Mapper’s need to be sorted by the Mapper output key.
This is the Merge/Sort phase.
The end result of this phase will be that all the records meant to be processed by the Reducer will
be sorted by the Mapper output key (or Reducer input key). Only when this phase completes we
are ready to make the reduce() call.
Invocation of the reduce() method
MapReduce developers often wonder why we have the sort phase. The reduce method in the
Reducer handles all the values for one key in a single reduce method invocation.
public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
Note the interface to the reduce call. The values are received in the Iterable instance. But are
all values for a given key held in memory during the execute of a reduce() call. No! This would
easily overwhelm the JVM memory when one key has millions of values. In fact the entire
Iterable interface is just that – an interface. In reality the framework is simply iterating the file
comprising of Reducer input key-value pairs sorted by reducer key.
When a new key is encountered a new reduce call is made. During the iteration of each value
instance from the values instance the value instance is updated with a new set of values (the main
reason why you should not rely on references to an earlier value in the iteration). If the iteration
of the values instance is skipped or breaks, the framework ensures that the pointer is at the next
key in the partition file being processed by the reducer. This is the file that is produced by the
“Merge/Sort” process described earlier. It is the sort phase which allows millions of values to be
processed efficiently (in one pass while reading a file whose records are sorted by the Reducer
input key) for a given key inside the reduce invocation through the familiar Iterable interface
without running out of memory or having to make multiple passes on a reducer input file.
In general, a computer cluster is a collection of various computers that work collectively as a single
system.
“A hadoop cluster is a collection of independent components connected through a dedicated network to

work as a single centralized data processing resource. “
“A hadoop cluster can be referred to as a computational computer cluster for storing and analysing big
data (structured, semi-structured and unstructured) in a distributed environment.”
“A computational computer cluster that distributes data analysis workload across various cluster nodes
that work collectively to process the data in parallel.”
Hadoop clusters are also known as “Shared Nothing” systems because nothing is shared between the
nodes in a hadoop cluster except for the network which connects them. The shared nothing paradigm of
a hadoop cluster reduces the processing latency so when there is a need to process queries on huge
amounts of data the cluster-wide latency is completely minimized.
Advantages of a Hadoop Cluster Setup

 As big data grows exponentially, parallel processing capabilities of a Hadoop cluster help in
increasing the speed of analysis process. However, the processing power of a hadoop cluster
might become inadequate with increasing volume of data. In such scenarios, hadoop clusters
can scaled out easily to keep up with speed of analysis by adding extra cluster nodes without
having to make modifications to the application logic.
 Hadoop cluster setup is inexpensive as they are held down by cheap commodity hardware. Any
organization can setup a powerful hadoop cluster without having to spend on expensive server
hardware.
 Hadoop clusters are resilient to failure meaning whenever data is sent to a particular node for
analysis, it is also replicated to other nodes on the hadoop cluster. If the node fails then the
replicated copy of the data present on the other node in the cluster can be used for analysis.
Components of a Hadoop Cluster

Hadoop cluster consists of three components -
 Master Node – Master node in a hadoop cluster is responsible for storing data in HDFS and
executing parallel computation the stored data using MapReduce. Master Node has 3 nodes –
NameNode, Secondary NameNode and JobTracker. JobTracker monitors the parallel processing
of data using MapReduce while the NameNode handles the data storage function with HDFS.
NameNode keeps a track of all the information on files (i.e. the metadata on files) such as the
access time of the file, which user is accessing a file on current time and which file is saved in
which hadoop cluster. The secondary NameNode keeps a backup of the NameNode data.
 Slave/Worker Node- This component in a hadoop cluster is responsible for storing the data and
performing computations. Every slave/worker node runs both a TaskTracker and a DataNode
service to communicate with the Master node in the cluster. The DataNode service is secondary
to the NameNode and the TaskTracker service is secondary to the JobTracker.
 Client Nodes – Client node has hadoop installed with all the required cluster configuration
settings and is responsible for loading all the data into the hadoop cluster. Client node submits
mapreduce jobs describing on how data needs to be processed and then the output is retrieved
by the client node once the job processing is completed.
Adding a dedicated Hadoop system user

We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is
recommended because it helps to separate the Hadoop installation from other software
applications and user accounts running on the same machine (think: security, permissions,
backups, etc).
1 $ sudo addgroup hadoop

2 $ sudo adduser --ingroup hadoop hduser
This will add the user hduser and the group hadoop to your local machine.
Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine
if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our
single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the
hduser user we created in the previous section.
First, we have to generate an SSH key for the hduser user.
1 user@ubuntu:~$ su - hduser
2 hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
3 Generating public/private rsa key pair.
4 Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
5 Created directory '/home/hduser/.ssh'.
6 Your identification has been saved in /home/hduser/.ssh/id_rsa.
7 Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
8 The key fingerprint is:
9 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
10 The key's randomart image is:
11 [...snipp...]
12 hduser@ubuntu:~$
The second line will create an RSA key pair with an empty password. Generally, using an empty
password is not recommended, but in this case it is needed to unlock the key without your
interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).
Second, you have to enable SSH access to your local machine with this newly created key.
1 hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The final step is to test the SSH setup by connecting to your local machine with the hduser user.
The step is also needed to save your local machine’s host key fingerprint to the hduser user’s
known_hosts file. If you have any special SSH configuration for your local machine like a non-
standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man
ssh_config for more information).
1 hduser@ubuntu:~$ ssh localhost

2 The authenticity of host 'localhost (::1)' can't be established.
3 RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
4 Are you sure you want to continue connecting (yes/no)? yes
5 Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
6 Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010
7 i686 GNU/Linux
8 Ubuntu 10.04 LTS
9 [...snipp...]
hduser@ubuntu:~$
If the SSH connect should fail, these general tips might help:
 Enable debugging with ssh -vvv localhost and investigate the error in detail.
 Check the SSH server configuration in /etc/ssh/sshd_config, in particular the options
PubkeyAuthentication (which should be set to yes) and AllowUsers (if this option is
active, add the hduser user to it). If you made any changes to the SSH server configuration file,
you can force a configuration reload with sudo /etc/init.d/ssh reload.
Disabling IPv6
One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various networking-related
Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of my Ubuntu
box. In my case, I realized that there’s no practical point in enabling IPv6 on a box when you are
not connected to any IPv6 network. Hence, I simply disabled IPv6 on my Ubuntu machine. Your
mileage may vary.
To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and
add the following lines to the end of the file:
/etc/sysctl.conf
1 # disable ipv6
2 net.ipv6.conf.all.disable_ipv6 = 1
3 net.ipv6.conf.default.disable_ipv6 = 1
4 net.ipv6.conf.lo.disable_ipv6 = 1
You have to reboot your machine in order to make the changes take effect.
You can check whether IPv6 is enabled on your machine with the following command:
1 $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that’s what we want).
Alternative
You can do so by adding the following line to conf/hadoop-env.sh:
conf/hadoop-env.sh
1 export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
Hadoop
Installation
Download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop
package to a location of your choice. I picked /usr/local/hadoop. Make sure to change the
owner of all the files to the hduser user and hadoop group, for example:
1 $ cd /usr/local
2 $ sudo tar xzf hadoop-1.0.3.tar.gz
3 $ sudo mv hadoop-1.0.3 hadoop
4 $ sudo chown -R hduser:hadoop hadoop
Update $HOME/.bashrc
Add the following lines to the end of the $HOME/.bashrc file of user hduser. If you use a shell
other than bash, you should of course update its appropriate configuration files instead of
.bashrc.
$HOME/.bashrc
# Set Hadoop-related environment variables

1
export HADOOP_HOME=/usr/local/hadoop
2
3
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later
4
on)
5
export JAVA_HOME=/usr/lib/jvm/java-6-sun
6
7
# Some convenient aliases and functions for running Hadoop-related commands
8
unalias fs &> /dev/null
9
alias fs="hadoop fs"
10
unalias hls &> /dev/null
11
alias hls="fs -ls"
12
13
# If you have LZO compression enabled in your Hadoop cluster and
14
# compress job outputs with LZOP (not covered in this tutorial):
15
# Conveniently inspect an LZOP compressed file from the command
16
# line; run via:
17
#
18
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
19
#
20
# Requires installed 'lzop' command.
21
#
22
lzohead () {
23
hadoop fs -cat $1 | lzop -dc | head -1000 | less
24
}
25
26
# Add Hadoop bin/ directory to PATH
27
export PATH=$PATH:$HADOOP_HOME/bin
You can repeat this exercise also for other users who want to use Hadoop.
Excursus: Hadoop Distributed File System

(HDFS)
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware. It has many similarities with existing distributed file systems. However,
the differences from other distributed file systems are significant. HDFS is highly fault-tolerant
and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to
application data and is suitable for applications that have large data sets. HDFS relaxes a few
POSIX requirements to enable streaming access to file system data. HDFS was originally built as
infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache
Hadoop project, which is part of the Apache Lucene project.
Configuration
In file conf/hdfs-site.xml:
conf/hdfs-site.xml
<property>
1
<name>dfs.replication</name>
2
<value>1</value>
3
<description>Default block replication.
4
The actual number of replications can be specified when the file is
5
created.
6
The default is used if replication is not specified in create time.
7
</description>
8
</property>
Formatting the HDFS filesystem via the

NameNode
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which
is implemented on top of the local filesystem of your “cluster” (which includes only your local
machine if you followed this tutorial). You need to do this the first time you set up a Hadoop
cluster.
Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir
variable), run the command
1 hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format
Starting your single-node cluster
Run the command:
1 hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
If there are any errors, examine the log files in the /logs/ directory.
Stopping your single-node cluster
Run the command
1 hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh
to stop all the daemons running on your machine.
ETL Process in Hadoop

An architecture for setting up a Hadoop data store for ETL is shown below.
Here are the typical steps to setup Hadoop for ETL:
1. Set up a Hadoop cluster,

2. Connect data sources,
3. Define the metadata,
4. Create the ETL jobs,
5. Create the workflow.
Set Up a Hadoop Cluster
This step can be really simple or quite difficult depending on where you want the cluster to be.
On the public cloud, you can create a Hadoop cluster with just a few clicks using Amazon EMR,
Rackspace CBD, or other cloud Hadoop offerings. If the data sources are already on the same
public cloud, then this is obviously the no-brainer solution.
Connect Data Sources
The Hadoop eco-system includes several technologies such as Apache Flume and Apache Sqoop
to connect various data sources such as log files, machine data and RDBMS. Depending on the
amount of data and the rate of new data generation, a data ingestion architecture and topology
must be planned. Start small and iterate just like any other development project. The goal is to
move the data into Hadoop at a frequency that meets analytics requirements.
Define the Metadata
Hadoop is a “schema-on-read” platform and there is no need to create a schema before loading
data as databases typically require. That does not mean one can throw in any kind of data and
expect some magic to happen. It is still important to clearly define the semantics and structure of
data (the “metadata”) that will be used for analytics purposes. This definition will then help in
the next step of data transformation.
Going back to our example of the customer ID, define how exactly this ID will be stored in the
warehouse. Is it a 10 digit numeric key that will be generated by some algorithm or is it simply
appending a four digit sequence number to an existing ID?
With the metadata defined, this can be easily transposed to Hadoop using Apache HCatalog, a
technology provides a relational table view of data in Hadoop. HCatalog also allows this view to
be shared by different type of ETL jobs, Pig, Hive, or MapReduce.
Create the ETL Jobs
We can finally focus on the process of transforming the various sources of data. MapReduce,
Cascading and Pig are some of the most common used frameworks for developing ETL jobs.
Which technology to use and how to create the jobs really depends on the data set and what
transformations are needed. Many organizations use a combination of Pig and MapReduce while
others use Cascading exclusively.
Create the Workflow
Data cleansing and transformations are easier done when multiple jobs cascade into a workflow,
each performing a specific task. Often data mappings/transformations need to be executed in a
specific order and/or there may be dependencies to check. These dependencies and sequences are
captured in workflows – parallel flows allow parallel execution that can speed up the ETL
process. Finally the entire workflow needs to be scheduled. They may have to run weekly,
nightly, or perhaps even hourly.
A smooth workflow will result in the source data being ingested and transformed based on the
metadata definition and stored in Hadoop. At this point, the data is ready for analysis. Hive,
Impala and Lingual provide SQL-on-Hadoop functionality while several commercial BI tools
can connect to Hadoop to explore the data visually and generate reports.
What is workflow engine?

Once beyond the very early stages of a Hadoop deployment, the flow of data through the system
is complex. Data is coming into the cluster from multiple different sources and being processed
through many analytics and data science pipelines. In the beginning these can be managed via
cron and shell scripts, but this is not sufficiently robust and does not scale to larger teams. To
handle complex data pipelines at scale, a workflow engine is necessary. A workflow engine
allows the user to define and configure their data pipelines and then handles scheduling these
pipelines. They also have mechanisms for monitoring the progress of the pipelines and for
recovering from failure. Once the point where a workflow engine is needed is reached, how does
one choose one from the many available? Major workflow engines for Hadoop: Oozie, Airflow,
Luigi, and Azkaban.
Apache® Oozie™ is an open source project that simplifies workflow and coordination between
jobs. It provides users with the ability to define actions and dependencies between actions. Oozie
will then schedule actions to execute when the required dependencies have been met.
A workflow in Oozie is defined in what is called a Directed Acyclical Graph (DAG). Acyclical
means there are no loops in the graph (in other words, there’s a starting point and an ending point
to the graph), and all tasks and dependencies point from start to end without going back.
A DAG is made up of action nodes and dependency nodes. An action node can be a MapReduce
job, a Pig application, a file system task, or a Java application. Flow control in the graph is
represented by node elements that provide logic based on the input from the preceding task in the
graph. Examples of flow control nodes are decisions, forks, and join nodes.
An Oozie Workflow
Query to create table on hadoop

Create Table is a statement used to create a table in Hive. The syntax and example are as
follows:
Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ...)]

[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]
The following query creates a table named employee using the above data.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
Loading Data into table on hadoop

We can load data into a table using Insert command in two ways.One Using Values command and other
is using queries.
1.1 Using Values

Using Values command ,we can append more rows of data into existing table.
for example ,to existing above employee table we can add extra row 15,Bala,150000,35 like
below
Insert into table employee values (15,'Bala',150000,35)
After this You can run a select command to see newly added row.
1.2 Using Queries
You can also upload query output into a table.for example Assume you have emp table,from this, you
can upload data into employee table like below
Insert into table employee Select * from emp where dno=45;
After this also You can fire select query to see uploaded rows.
2.Using Load
You can load data into a hive table using Load statement in two ways.
One is from local file system to hive table and other is from HDFS to Hive table.
2.1 From LFS to Hive Table
Assume we have data like below in LFS file called /data/empnew.csv.

15,Bala,150000,35
Now We can use load statement like below.
Load data local inpath '/data/empnew.csv' into table emp
2.2 From HDFS to Hive Table
if we do not use local keyword ,it assumes it as a HDFS Path.
Load data local inpath '/data/empnew.csv' into table emp
After these two statements you can fire a select query to see loaded rows into table.
3. Using HDFS command

Assume You have data in a local file,You can simply upload data using hdfs commands.
run describe command to get the location of table like below.
describe formatted employee;
It will display Location of the table ,Assume You got location as /data/employee, you can upload data
into table by using one of below commands.
hadoop fs -put /path/to/localfile /Data/employee
hadoop fs -copyFromLocal /path/to/localfile /Data/employee
hadoop fs -moveFromLocal /pa

BDA - Unit 2,3,4 (Remaining Topics)

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

BDA - Unit 2,3,4 (Remaining Topics)

Încărcat de

Drepturi de autor:

Formate disponibile

Hadoop Architecture

Hadoop framework includes following four modules:

 Scalability and Performance – distributed processing of data local to each node in a

Difference Between Hadoop and

HDFS(Hadoop Distributed File System), Hadoop MapReduce(a programming model to process

These transactions may be related to Banking Systems, Manufacturing Industry,

3.Reinforcement learning: A computer program interacts with a dynamic environment in which

Apache Spark is an open source big data processing framework

 Spark gives us a comprehensive, unified framework to manage big data processing

The different steps involved in a machine learning pipeline process.

Step # Name Description

Table 1. Machine learning pipeline process steps

Five differences between reporting and analysis:

Reporting includes building, configuring, consolidating, organizing, formatting, and

The big data technologies based on

Data Storage In Hadoop

Understanding Hadoop Components

Analytics Using Hadoop

Partitioner partitions the key space.

HashPartitioner is the default Partitioner.

Apache Hadoop MapReduce is a framework for processing large

MapReduce is composed of several components, including:

Recsources for hadoop mapreduce,word count

Steps Hadoop takes to run a job

The Reduce Phase

The Shuffle phase

Invocation of the reduce() method

public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

“A hadoop cluster is a collection of independent components connected through a dedicated network to

Advantages of a Hadoop Cluster Setup

Components of a Hadoop Cluster

Adding a dedicated Hadoop system user

1 $ sudo addgroup hadoop

First, we have to generate an SSH key for the hduser user.

1 hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

1 hduser@ubuntu:~$ ssh localhost

You can do so by adding the following line to conf/hadoop-env.sh:

# Set Hadoop-related environment variables

Excursus: Hadoop Distributed File System

Formatting the HDFS filesystem via the

Starting your single-node cluster

Run the command:

Stopping your single-node cluster

Run the command

to stop all the daemons running on your machine.

ETL Process in Hadoop

1. Set up a Hadoop cluster,

Set Up a Hadoop Cluster

Connect Data Sources

Define the Metadata

Create the ETL Jobs

Create the Workflow

What is workflow engine?

Query to create table on hadoop

[(col_name data_type [COMMENT col_comment], ...)]

Loading Data into table on hadoop

1.1 Using Values

Insert into table employee values (15,'Bala',150000,35)

1.2 Using Queries

Insert into table employee Select * from emp where dno=45;

2.1 From LFS to Hive Table

Assume we have data like below in LFS file called /data/empnew.csv.

Load data local inpath '/data/empnew.csv' into table emp

2.2 From HDFS to Hive Table