Hadoop

What is Hadoop?
The Apache Hadoop software is a framework that allows for the distributed processing of large data sets
across clusters of computers using simple programming models.
Apache Hadoop is an open-source software framework that supports data-intensive distributed
applications licensed under the Apache v2 license. It supports running applications on large clusters of
commodity hardware. Hadoop derives from Google's MapReduce and Google File System papers.
Define the history of Hadoop?

Apache Hadoop was born out of a need to process an avalanche of Big Data. The web was generating
more and more information on a daily basis, and it was becoming very difficult to index over one billion
pages of content. In order to cope, Google invented a new style of data processing known as
MapReduce. A year after Google published a white paper describing the MapReduce framework, Doug
Cutting and Mike Cafarella, inspired by the white paper, created Hadoop to apply these concepts to an
open-source software framework to support distribution for the Nutch search engine project. Given the
original case, Hadoop was designed with a simple write-once storage infrastructure.
Hadoop has moved far beyond its beginnings in web indexing and is now used in many industries for a
huge variety of tasks that all share the common theme of lots of variety, volume and velocity of data
both structured and unstructured. It is now widely used across industries, including finance, media and
entertainment, government, healthcare, information services, retail, and other industries with Big Data
requirements but the limitations of the original storage infrastructure remain.
Hadoop is increasingly becoming the go-to framework for large-scale, data-intensive deployments.
Hadoop is built to process large amounts of data from terabytes to pet bytes and beyond. With this much
data, its unlikely that it would fit on a single computer's hard drive, much less in memory. The beauty of
Hadoop is that it is designed to efficiently process huge amounts of data by connecting many commodity
computers together to work in parallel. Using the MapReduce model, Hadoop can take a query over a
dataset, divide it, and run it in parallel over multiple nodes. Distributing the computation solves the
problem of having data thats too large to fit onto a single machine.
Apache Hadoop includes a Distributed File System (HDFS), which breaks up input data and stores data
on the compute nodes. This makes it possible for data to be processed in parallel using all of the
machines in the cluster. The Apache Hadoop Distributed File System is written in Java and runs on
different operating systems.
Hadoop was designed from the beginning to accommodate multiple file system implementations and
there are a number available. HDFS and the S3 file system are probably the most widely used, but many
others are available, including the MapR file system.
How is Hadoop Different from Past Techniques?
Hadoop can handle data in a very fluid way. Hadoop is more than just a faster, cheaper database and
analytics tool. Unlike databases, Hadoop doesnt insist that you structure your data. Data may be
unstructured and schemaless. Users can dump their data into the framework without needing to reformat
it. By contrast, relational databases require that data be structured and schemas be defined before storing
the data.
Hadoop has a simplified programming model. Hadoops simplified programming model allows users to
quickly write and test distributed systems. Performing computation on large volumes of data has been
done before, usually in a distributed setting but writing distributed systems is notoriously hard. By trading
away some programming flexibility, Hadoop makes it much easier to write distributed programs.
Hadoop is easy to administer. Alternative high performance computing (HPC) systems allow programs
to run on large collections of computers, but they typically require rigid program configuration and
generally require that data be stored on a separate storage area network (SAN) system. Schedulers on
HPC clusters require careful administration and since program execution is sensitive to node failure,
administration of a Hadoop cluster is much easier.
Hadoop invisibly handles job control issues such as node failure. If a node fails, Hadoop makes sure the
computations are run on other nodes and that data stored on that node are recovered from other nodes.
Hadoop is agile. Relational databases are good at storing and processing data sets with predefined and
rigid data models. For unstructured data, relational databases lack the agility and scalability that is
needed. Apache Hadoop makes it possible to cheaply process and analyze huge amounts of both
structured and unstructured data together, and to process data without defining all structure ahead of
time.
Why use Apache Hadoop?

Its cost effective. Apache Hadoop controls costs by storing data more affordably per terabyte than other
platforms. Instead of thousands to tens of thousands per terabyte Hadoop delivers compute and storage
for hundreds of dollars per terabyte.
Its fault-tolerant. Fault tolerance is one of the most important advantages of using Hadoop. Even if
individual nodes experience high rates of failure when running jobs on a large cluster, data is replicated
across a cluster so that it can be recovered easily in the face of disk, node or rack failures.
Its flexible. The flexible way that data is stored in Apache Hadoop is one of its biggest assets enabling
businesses to generate value from data that was previously considered too expensive to be stored and
processed in traditional databases. With Hadoop, you can use all types of data, both structured and
unstructured, to extract more meaningful business insights from more of your data.
Its scalable. Hadoop is a highly scalable storage platform, because it can store and distribute very large
data sets across clusters of hundreds of inexpensive servers operating in parallel. The problem with
traditional relational database management systems (RDBMS) is that they cant scale to process massive
volumes of data.
Describe the Hadoop architecture?

The Hadoop cluster building blocks are as follows:
NameNode: The centerpiece of HDFS, which stores file system metadata, directs the slave DataNode
daemons to perform the low-level I/O tasks, and also runs the JobTracker process.
Secondary NameNode: Performs internal checks of the NameNode transaction log.
DataNodes: Nodes that store the data in the HDFS file system, which are also known as slaves and run
the TaskTracker process.
How RDBMS differ from MapReduce?
Data Size
Access
Updates
Structure
Integrity
Scaling
----------------------------------Traditional RDBMS
----------------------------------GB
Interactive and batch
Read and write many Times
Static Schema
High
Nonlinear
-----------------------------------
----------------------------------------MapReduce
----------------------------------------PB
Batch
write once, read many times
Dynamic schema
Low
Linear
------------------------------------------
Oracle is suitable for Transactional data while Hadoop for Historical, Analytical or temporary data
Hadoop not suitable for Real time data better to use oracle
CISCO JMS
Java App
Real time data
Output combined with Historical data to produce result
Historical data
HDFS Storage
MR
Processing
Horizontal scalability
Dont bring the data for computation, bring the code to data
BigData
Hadoop
Hadoop Installation (Planning, configuration, installation, authentication,
management)
Cluster Maintenance
HDFS
MapReduce
Ecosystems (Pig, Hive, Hbase, mongoDB, Sqoop, Flume, Oozie, Zookeeper)
===============================
Command cluster taks
===============================
managing logs
Starting and Stopping Processes
Adding a Datanode
Decommissioning a Datanode
Checking Filesystem Integrity
Balancing HDFS Block Data
Dealing with a Failed Disk
Adding a Tasktracker
Decommissioning a Tasktracker
Killing a MapReduce Job
Killing a MapReduce Task
Dealing with a Blacklisted Tasktracker
Backup
Recovery
Upgrad
============================
Pig
Hive
HBase
MongoDB
Sqoop
Flume
Oozie
Zookeeper
Dataflow scripting
Data Warehouse and query language
Hadoop noSQL database
Data storage
Data transfer system - ETL
Stream data into HDFS and HBase
Workflow schedular/engine
Coordincation service
==================================================
============================
===============

Hadoop

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Hadoop

Încărcat de

Drepturi de autor:

Formate disponibile

What is Hadoop?

Define the history of Hadoop?

How is Hadoop Different from Past Techniques?

Why use Apache Hadoop?

Describe the Hadoop architecture?

How RDBMS differ from MapReduce?

Real time data

Output combined with Historical data to produce result

S-ar putea să vă placă și