Sunteți pe pagina 1din 28

A Seminar Report

On

HADOOP

By Varun Narang

MA 399 Seminar
IIT Guwahati

Roll Number: 09012332


Index of Topics:

1. Abstract

2. Introduction

3. What is MapReduce?

4. HDFS
Assumptions
Design
Concepts
The Communication Protocols
Robustness
Cluster Rebalancing
Data Integrity
Metadata disk failure
Snapshots

5. Data Organisation
Data Blocks
Staging
Replication Pipelining

6. Accessibility

7. Space Reclaimation
File Deletes and Undeletes
Decrease Replication Factor
Hadoop Filesystems

7. Hadoop Archives
Bibliography
1)Hadoop- The Definitive Guide, OReilly 2009, Yahoo! Press

2)MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay
Ghemawat

3)Ranking and Semi-supervised Classification on Large Scale Graphs Using Map-


Reduce, Delip Rao, David Yarowsky, Dept. of Computer Science, Johns Hopkins
University

4)Improving MapReduce Performance in Heterogeneous Environments, Matei Zaharia,


Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica, University of
California, Berkeley

5)MapReduce in a Week By Hannah Tang, Albert Wong, Aaron Kimball


Winter 2007
Abstract

Problem Statement:
The amount total digital data in the world has exploded in recent years. This has
happened primarily due to information (or data) generated by various enterprises all over
the globe. In 2006, the universal data was estimated to be 0.18 zettabytes in 2006, and is
forecasting a tenfold growth by 2011 to 1.8 zettabytes.
1 zettabyte = 1021 bytes

The problem is that while the storage capacities of hard drives have increased
massively over the years, access speedsthe rate at which data can be read from drives
have not kept up. One typical drive from 1990 could store 1370 MB of data and had a
transfer speed of 4.4 MB/s, so we could read all the data from a full drive in around 300
seconds. In 2010, 1 Tbdrives are the standard hard disk size, but the transfer speed is
around 100 MB/s, so it takes more than two and a half hours to read all the data off the
disk.

Solution Proposed:
Parallelisation:
A very obvious solution to solving this problem is parallelisation. The input data is
usually large and the computations have to be distributed across hundreds or thousands of
machines in order to finish in a reasonable amount of time.
Reading 1 Tb from a single hard drive may take a long time, but on parallelizing this over
100 different machines can solve the problem in 2 minutes.

The key issues involved in this Solution:


Hardware failure
Combine the data after analysis (i.e reading)

Apache Hadoop is a framework for running applications on large cluster built of


commodity hardware. The Hadoop framework transparently provides applications both
reliability and data motion.
It solves the problem of Hardware Failure through replication. Redundant copies of the
data are kept by the system so that in the event of failure, there is another copy available.
(Hadoop Distributed File System)
The second problem is solved by a simple programming model- Mapreduce. This
programming paradigm abstracts the problem from data read/write to computation over a
series of keys.

Even though HDFS and MapReduce are the most significant features of Hadoop, other
subprojects provide complementary services:
The various subprojects of hadoop includes:-
Core
Avro
Pig
HBase
Zoo Keeper
Hive
Chukwa
Introduction
Hadoop is designed to efficiently process large volumes of information by connecting many
commodity computers together to work in parallel. A 1000 CPU single machine (i.e a
supercomputer with a vast memory storage) would cost a lot. Thus Hadoop parallelizes the
computation by tying smaller and more reasonably priced machines together into a single cost-
effective compute cluster.
The features of hadoop that stand out are its simplified programming model and its efficient,
automatic distribution of data and work across machines.

Now we take a deeper look into these two main features of Hadoop and list their important
characteristics and description.

1. Data Distribution:

In a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in. The
Hadoop Distributed File System (HDFS) will split large data files into chunks which are managed
by different nodes in the cluster. In addition to this each chunk is replicated across several
machines, so that a single machine failure does not result in any data being unavailable. In case of
a system failure, the data is re-replicated which can result in partial storage. Even though the file
chunks are replicated and distributed across several machines, they form a single namespace, so
their contents are universally accessible.

Data is conceptually record-oriented in the Hadoop programming


framework. Individual input files are broken into segments and each segment
is processed upon by a node. The Hadoop framework schedules the
processes to be run in proximity to the location of data/records using
knowledge from the distributed file system. Each computation process
running on a node operates on a subset of the data. Which data operated on
by which node is decide based on its proximity to the node: i.e:
Most data is read from the local disk straight into the CPU, alleviating strain
on network bandwidth and preventing unnecessary network transfers. This
strategy of moving computation to the data, instead of moving the data to the
computation allows Hadoop to achieve high data locality which in turn results in high
performance.
2. MapReduce: Isolated Processes

Hadoop limits the amount of communication which can be performed by the processes, as each
individual record is processed by a task in isolation from one another. It makes the whole
framework much more reliable. Programs must be written to conform to a particular
programming model, named "MapReduce."

MapReduce is composed of two chief elements: Mappers and Reducers.


1. Data segments or records are processed in isolation by tasks called Mappers.
2. The output from the Mappers is then brought together by Reducers, where results from
different mappers are merged together.
Separate nodes in a Hadoop cluster communicate implicitly. Pieces of data can be tagged with
key names which inform Hadoop how to send related bits of information to a common destination
node. Hadoop internally manages all of the data transfer and cluster topology issues.
By restricting the communication between nodes, Hadoop makes the distributed system much
more reliable. Individual node failures can be worked around by restarting tasks on other
machines. The other workers continue to operate as though nothing went wrong, leaving the
challenging aspects of partially restarting the program.

What is MapReduce?
MapReduce is a programming model for processing and generating large data sets. Users
specify a map function that processes a key/value pair to generate a set of intermediate key/value
pairs, and a reduce function that merges all intermediate values associated with the same
intermediate key.
Programs written in this functional style are automatically parallelized and executed on a large
cluster of commodity machines. The run-time system takes care of the details of partitioning the
input data, scheduling the program's execution across a set of machines, handling machine
failures, and managing the required inter-machine communication (i.e this procedure is abstracted
or hidden from the user who can focus on the computational problem)

Note: This abstraction was inspired by the map and reduces primitives present in Lisp
and many other functional languages.

The Programming Model:


The computation takes a set of input key/value pairs, and produces a set of output
key/value pairs. The user of the MapReduce library expresses the computation as two
functions: Map and Reduce.

Map, written by the user, takes an input pair and produces a set of intermediate
key/value pairs. The MapReduce library groups together all intermediate values
associated with the same intermediate key Iand passes them to the Reduce function.

The Reduce function, also written by the user, accepts an intermediate key Iand a
set of values for that key. It merges together these values to form a possibly smaller
set of values. Typically just zero or one output value is produced per Reduce
invocation. The intermediate values
are supplied to the user's reduce function via an iterator.
Map and Reduce (Associated Types):
The input keys and values are drawn from a different domain than the output keys
and values. Also, the intermediate keys and values are from the same domain as the
output keys and values.

Hadoop Map-Reduce is a software framework for easily writing applications which


process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

Analyzing the Data with Hadoop MapReduce:


MapReduce works by breaking the processing into two phases: the map phase and the
reduce phase. It splits the input data-set into independent chunks which are processed by the map
tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are
then input to the reduce tasks.
Both the input and the output of the job are stored in a file-system. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.
Data Locality Optimisation:
Typically the compute nodes and the storage nodes are the same. The Map-Reduce framework
and the Distributed File System run on the same set of nodes. This configuration allows the
framework to effectively schedule tasks on the nodes where data is already present, resulting in
very high aggregate bandwidth across the cluster.

There are two types of nodes that control the job execution process:
1. jobtrackers
2. tasktrackers

The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on
tasktrackers.

Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the
overall progress of each job.
If a tasks fails, the jobtracker can reschedule it on a different tasktracker.
Input splits: Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined map
function for each record in the split.
The quality of the load balancing increases as the splits become more fine-grained. On the other
hand, if splits are too small, then the overhead of managing the splits and of map task creation
begins to dominate the total job execution time. For most jobs, a good split size tends to be the
size of a HDFS block, 64 MB by default.
WHY?
Map tasks write their output to local disk, not to HDFS. Map output is intermediate output: its
processed by reduce tasks to produce the final output, and once the job is complete the map
output can be thrown away. So storing it in HDFS, with replication, would be a waste of time. It
is also possible that the node running the map task fails before the map output has been consumed
by the reduce task.

Reduce tasks dont have the advantage of data localitythe input to a single reduce task is
normally the output from all mappers.

In case we have a single reduce task that is fed by all of the map tasks: The sorted map outputs
have to be transferred across the network to the node where the reduce task is running, where they
are merged and then passed to the user-defined reduce function.

The output of the reducer is normally stored in HDFS for reliability. For each HDFS block of the
reduce output, the first replica is stored on the local node, with other replicas being stored on off-
rack nodes.
MapReduce data flow with a single reduce task

When there are multiple reducers:


The map tasks partition their output, each creating one partition for each reduce task. There can
be many keys (and their associated values) in each partition, but the records for every key are all
in a single partition.

MapReduce data flow with multiple reduce task.

It is also possible to have zero reduce tasks as illustrated in the figure below.
MapReduce data flow with no reduce tasks

Combiner Functions

Many MapReduce jobs are limited by the bandwidth available on the cluster. In order to
minimize the data transferred between the map and reduce tasks, combiner functions are
introduced. Hadoop allows the user to specify a combiner function to be run on the map output
the combiner functions output forms the input to the reduce function. Combiner finctions can
help cut down the amount of data shuffled between the maps and the reduces.

Hadoop Streaming:

Hadoop provides an API to MapReduce that allows you to write your map and reduce functions
in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface
between Hadoop and your program, so you can use any language that can read standard input and
write to standard output to write your MapReduce program.

Hadoop Pipes:

Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce. Unlike Streaming, which
uses standard input and output to communicate with the map and reduce code, Pipes uses sockets
as the channel over which the tasktracker communicates with the process running the C++ map or
reduce function. JNI is not used.
HADOOP DISTRIBUTED FILESYSTEM (HDFS)

Filesystems that manage the storage across a network of machines are called distributed
filesystems. They are network-based, and thus all the complications of network programming are
also present in distributed file system. Hadoop comes with a distributed filesystem called HDFS,
which stands for Hadoop Distributed Filesystem.

HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very
large amounts of data (terabytes or even petabytes), and provide high-throughput access to this
information.

ASSUMPTIONS AND GOALS:

1. Hardware Failure
An HDFS instance may consist of hundreds or thousands of server machines, each storing part
of the file systems data. In case of such a large number of nodes, the probability of one of them
failing becomes substantial.

2. Streaming Data Access

Applications that run on HDFS need streaming access to their data sets. HDFS is designed more
for batch processing rather than interactive use by users. The emphasis is on high throughput of
data access rather than low latency of data access.

3. Large Data Sets

Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to
terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate
data bandwidth and scale to hundreds of nodes in a single cluster.

4. Simple Coherency Model

HDFS applications need a write-once-read-many access model for files. A file once created,
written, and closed need not be changed. This assumption simplifies data coherency issues and
enables high throughput data access. A Map/Reduce application or a web crawler application fits
perfectly with this model. There is a plan to support appending-writes to files in the future.

5. Moving Computation is Cheaper than Moving Data

A computation requested by an application is much more efficient if it is executed near the data
it operates on. This is especially true when the size of the data set is huge. This minimizes
network congestion and increases the overall throughput of the system. HDFS provides interfaces
for applications to move themselves closer to where the data is located.

6. Portability Across Heterogeneous Hardware and Software Platforms

HDFS has been designed to be easily portable from one platform to another. This facilitates
widespread adoption of HDFS as a platform of choice for a large set of applications.
The Design of HDFS:

Very large files:

Very large in this context means files that are hundreds of megabytes, gigabytes, or
terabytes in size. There are Hadoop clusters running today that store petabytes of data.

Streaming data access:

HDFS is built around the idea that the most efficient data processing pattern is a write-once,
read-many-times pattern. A dataset is typically generated or copied from source, then various
analyses are performed on that dataset over time. Each analysis will involve a large
proportion, if not all, of the dataset, so the time to read the whole dataset is more important
than the latency in reading the first record.

Commodity hardware:

Hadoop doesnt require expensive, highly reliable hardware to run on. Its designed to run on
clusters of commodity hardware for which the chance of node failure across the cluster is
high for large clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure. The use of commodity hardware restricts
the effectiveness of Hadoop in some applications. These applications have the following
common characteristics:

Low-latency data access

Applications that require low-latency access to data, in the tens of milliseconds range, will
not work well with HDFS.

Lots of small files:

Since the master node (or namenode) holds file-system metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the namenode. As a
rule of thumb, each file, directory, and block takes about 150 bytes.

Multiple writers, arbitrary file modifications:

Files in HDFS may be written to by a single writer. Writes are always made at the end of the
file. There is no support for multiple writers, or for modifications at arbitrary offsets in the
file.
A few Important Concepts of Hadoop Distributed File System:
1. Blocks:
A disk has a block size, which is the minimum amount of data that it can read or write.
Filesystems for a single disk build on this by dealing with data in blocks, which are an integral
multiple of the disk block size. Filesystem blocks are typically a few kilobytes in size, while disk
blocks are normally 512 bytes

HDFS too has the concept of a block, but it is a much larger unit64 MB by default. Like in a
filesystem for a single disk, files in HDFS are broken into block-sized chunks, which are stored
as independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a
single block does not occupy a full blocks worth of underlying storage.
HDFS blocks are large compared to disk blocks.

Having a block abstraction for a distributed filesystem brings several benefits:

A file can be larger than any single disk in the network. Theres nothing that requires the
blocks from a file to be stored on the same disk, so they can take advantage of any of the
disks in the cluster.

Making the unit of abstraction a block rather than a file simplifies the storage subsystem.

Blocks fit well with replication for providing fault tolerance and availability. To insure
against corrupted blocks and disk and machine failure, each block is replicated to a small
number of physically separate machines (typically three).

2. Namenodes and Datanodes:

A HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the
master) and a number of datanodes (workers).

The namenode has two chief functions:


To manage the filesystem namespace.
To maintains the filesystem tree and the metadata for all the files and directories in the
tree.
. The namenode also knows the datanodes on which all the blocks for a given file are
located
This information is stored persistently on the local disk in the form of two files: the namespace
image and the edit log. The namenode however does not store block locations persistently, since
this information is reconstructed from datanodes when the system starts.
A client accesses the filesystem on behalf of the user by communicating with the namenode and
datanodes.

Datanodes are the work horses of the filesystem. They store and retrieve blocks when they are
told to (by clients or the namenode), and they report back to the namenode periodically with lists
of blocks that they are storing. Without the namenode, the filesystem cannot be used. In fact, if
the machine running the namenode were obliterated, all the files on the filesystem would be lost
since there would be no way of knowing how to reconstruct the files from the blocks on the
datanodes. For this reason, it is important to make the namenode resilient to failure, and Hadoop
provides two mechanisms for this.
The first way is to back up the files that make up the persistent state of the filesystem metadata.
Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems.
These writes are synchronous and atomic. The usual configuration Choice is to write to local disk
as well as a remote NFS mount.

Another approach is to run a secondary namenode. It does not act as a namenode. Its main role
is to periodically merge the namespace image with the edit log to prevent the edit log from
becoming too large. The secondary namenode usually runs on a separate physical machine, since
it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a
copy of the merged namespace image, which can be used in the event of the namenode failing.
However, the state of the secondary namenode lags that of the primary, so in the event of total
failure of the primary data, loss is almost guaranteed.

3. The File System Namespace:

HDFS supports a traditional hierarchical file organization. A user or an application can create
directories and store files inside these directories. The file system namespace hierarchy is similar
to most other existing file systems; one can create and remove files, move a file from one
directory to another, or rename a file. HDFS does not yet implement user quotas or access
permissions.
The Namenode maintains the file system namespace. Any change to the file system namespace
or its properties is recorded by the Namenode.

4. Data Replication:

HDFS is designed to reliably store very large files across machines in a large cluster. It stores
each file as a sequence of blocks; all blocks in a file except the last block are the same size. The
blocks of a file are replicated for fault tolerance. The block size and replication factor are
configurable per file. An application can specify the number of replicas of a file.
The NameNode makes all decisions regarding replication of blocks. It periodically receives a
Heartbeat and a Blockreport from each of the Datanodes in the cluster. Receipt of a Heartbeat
implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a
Datanode.

5. Replica Placement:

Optimizing replica placement distinguishes HDFS from most other distributed file systems.. The
purpose of a rack-aware replica placement policy is to improve data reliability, availability, and
network bandwidth utilization.
Large HDFS instances run on a cluster of computers that commonly spread across many racks.
Communication between two nodes in different racks has to go through switches. In most cases,
network bandwidth between machines in the same rack is greater than network bandwidth
between machines in different racks.
For the common case, when the replication factor is three, HDFSs placement policy is to put
one replica on one node in the local rack, another on a different node in the local rack, and the last
on a different node in a different rack. This policy cuts the inter-rack write traffic which generally
improves write performance.

6. Replica Selection:
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read
request from a replica that is closest to the reader.

7. Safemode:

On startup, the NameNode enters a special state called Safemode. Replication of data blocks
does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat
and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that
a DataNode is hosting. Each block has a specified minimum number of replicas. A block is
considered safely replicated when the minimum number of replicas of that data block has checked
in with the NameNode. After a configurable percentage of safely replicated data blocks checks in
with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It
then determines the list of data blocks (if any) that still have fewer than the specified number of
replicas. The NameNode then replicates these blocks to other DataNodes.

8. The Persistence of File System Metadata:

The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called
the EditLog to persistently record every change that occurs to file system metadata. The
NameNode uses a file in its local host OS file system to store the EditLog. The entire file system
namespace, including the mapping of blocks to files and file system properties, is stored in a file
called the FsImage. The FsImage is stored as a file in the NameNodes local file system too.
The NameNode keeps an image of the entire file system namespace and file Blockmap in
memory. This key metadata item is designed to be compact, such that a NameNode with 4 GB of
RAM is plenty to support a huge number of files and directories
The DataNode stores HDFS data in files in its local file system. The DataNode has no
knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file
system. The DataNode does not create all files in the same directory. Instead, it uses a heuristic to
determine the optimal number of files per directory and creates subdirectories appropriately. It is
not optimal to create all local files in the same directory because the local file system might not
be able to efficiently support a huge number of files in a single directory. When a DataNode starts
up, it scans through its local file system, generates a list of all HDFS data blocks that correspond
to each of these local files and sends this report to the NameNode: this is the Blockreport
The Communication Protocols:
All HDFS communication protocols are layered on top of the TCP/IP protocol. A client
establishes a connection to a configurable TCP port on the NameNode machine. It talks the
ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode
Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the
DataNode Protocol. By design, the NameNode never initiates any RPCs. Instead, it only responds
to RPC requests issued by DataNodes or clients.
Robustness:
The primary objective of HDFS is to store data reliably even in the presence of failures. The
three common types of failures are NameNode failures, DataNode failures and network
partitions.

Data Disk Failure, Heartbeats and Re-Replication

Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition
can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode
detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes
without recent Heartbeats as dead and does not forward any new IO requests to them. Any data
that was registered to a dead DataNode is not available to HDFS any more. DataNode death may
cause the replication factor of some blocks to fall below their specified value. The NameNode
constantly tracks which blocks need to be replicated and initiates replication whenever necessary.
The necessity for re-replication may arise due to many reasons: a DataNode may become
unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the
replication factor of a file may be increased.
Cluster Rebalancing

The HDFS architecture is compatible with data rebalancing schemes. A scheme might
automatically move data from one DataNode to another if the free space on a DataNode falls
below a certain threshold. In the event of a sudden high demand for a particular file, a scheme
might dynamically create additional replicas and rebalance other data in the cluster. These types
of data rebalancing schemes are not yet implemented.

Data Integrity

It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can
occur because of faults in a storage device, network faults, or buggy software. The HDFS client
software implements checksum checking on the contents of HDFS files. When a client creates an
HDFS file, it computes a checksum of each block of the file and stores these checksums in a
separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies
that the data it received from each DataNode matches the checksum stored in the associated
checksum file. If not, then the client can opt to retrieve that block from another DataNode that has
a replica of that block.

Metadata Disk Failure

The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can
cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured
to support maintaining multiple copies of the FsImage and EditLog. Any update to either the
FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously.
This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate
of namespace transactions per second that a NameNode can support. However, this degradation is
acceptable because even though HDFS applications are very data intensive in nature, they are not
metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and
EditLog to use.
The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode
machine fails, manual intervention is necessary. Currently, automatic restart and failover of the
NameNode software to another machine is not supported.

Snapshots

Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot
feature may be to roll back a corrupted HDFS instance to a previously known good point in time.
HDFS does not currently support snapshots but will in a future release.
Data Organization
Data Blocks

HDFS is designed to support very large files. Applications that are compatible with HDFS are
those that deal with large data sets. These applications write their data only once but they read it
one or more times and require these reads to be satisfied at streaming speeds. HDFS supports
write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an
HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a
different DataNode.

Staging

A client request to create a file does not reach the NameNode immediately. In fact, initially the
HDFS client caches the file data into a temporary local file. Application writes are transparently
redirected to this temporary local file. When the local file accumulates data worth over one HDFS
block size, the client contacts the NameNode. The NameNode inserts the file name into the file
system hierarchy and allocates a data block for it. The NameNode responds to the client request
with the identity of the DataNode and the destination data block. Then the client flushes the block
of data from the local temporary file to the specified DataNode. When a file is closed, the
remaining un-flushed data in the temporary local file is transferred to the DataNode. The client
then tells the NameNode that the file is closed. At this point, the NameNode commits the file
creation operation into a persistent store. If the NameNode dies before the file is closed, the file is
lost.
The above approach has been adopted after careful consideration of target applications that run
on HDFS. These applications need streaming writes to files. If a client writes to a remote file
directly without any client side buffering, the network speed and the congestion in the network
impacts throughput considerably. This approach is not without precedent. Earlier distributed file
systems, e.g. AFS, have used client side caching to improve performance. A POSIX requirement
has been relaxed to achieve higher performance of data uploads.

Replication Pipelining

When a client is writing data to an HDFS file, its data is first written to a local file as explained
in the previous section. Suppose the HDFS file has a replication factor of three. When the local
file accumulates a full block of user data, the client retrieves a list of DataNodes from the
NameNode. This list contains the DataNodes that will host a replica of that block. The client then
flushes the data block to the first DataNode. The first DataNode starts receiving the data in small
portions (4 KB), writes each portion to its local repository and transfers that portion to the second
DataNode in the list. The second DataNode, in turn starts receiving each portion of the data
block, writes that portion to its repository and then flushes that portion to the third DataNode.
Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be
receiving data from the previous one in the pipeline and at the same time forwarding data to the
next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.

Accessibility:

HDFS can be accessed from applications in many different ways. Natively, HDFS provides a
java API for applications to use. A C language wrapper for this Java API is also available. In
addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is in
progress to expose HDFS through the WebDAV protocol.

Space Reclamation:
1. File Deletes and Undeletes

When a file is deleted by a user or an application, it is not immediately removed from HDFS.
Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as
long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the
expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion
of a file causes the blocks associated with the file to be freed. Note that there could be an
appreciable time delay between the time a file is deleted by a user and the time of the
corresponding increase in free space in HDFS.
A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user
wants to undelete a file that he/she has deleted, he/she can navigate the /trash directory and
retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The
/trash directory is just like any other directory with one special feature: HDFS applies specified
policies to automatically delete files from this directory. The current default policy is to delete
files from /trash that are more than 6 hours old. In the future, this policy will be configurable
through a well defined interface.
2. Decrease Replication Factor
When the replication factor of a file is reduced, the NameNode selects excess replicas that can
be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then
removes the corresponding blocks and the corresponding free space appears in the cluster. Once
again, there might be a time delay between the completion of the setReplication API call and the
appearance of free space in the cluster.

3. Hadoop Filesystems

Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation. The
Java abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop, and there
are several concrete implementations, which are described in following table.

A filesystem for a locally connected


disk with client-side checksums.
Local file Use RawLocalFileSys
fs.LocalFileSystem tem for a local filesystem with no
checksums.
Hadoops distributed filesystem.
HDFS is designed to work efficiently
HDFS hdfs hdfs.DistributedFileSystem in conjunction with Map-
Reduce.
A filesystem providing read-only
access to HDFS over HTTP. (Despite
HFTP hftp its name, HFTP has no connection
hdfs.HftpFileSystem with FTP.) Often used with distcp
(Parallel Copying with
A filesystem providing read-only
access to HDFS over HTTPS. (Again,
HSFTP hsftp Hdfs.HsftpFileSystem this has no connection with FTP.)
A filesystem layered on another
filesystem for archiving files.
HAR har Fs.HarFileSystem Hadoop
Archives are typically used
for archiving files in HDFS to
reduce
the namenodes memory usage.
CloudStore (formerly Kosmos
filesystem)
KFS(Clou Kfs fs.kfs.KosmosFileSystem is a distributed filesystem
d Store) like HDFS or Googles GFS,
written in C++.
A filesystem backed by an FTP
FTP ftp fs.ftp.FtpFileSystem server.
A filesystem backed by Amazon
S3(Nativ s3n fs.s3native.NativeS3FileSyste S3.
e) m
A filesystem backed by Amazon
S3, which stores files in blocks
S3(Block S3 fs.s3.S3FileSystem A (much like HDFS) to overcome
Based) S3s 5 GB file size limit.
Hadoop Archives:

HDFS stores small files inefficiently, since each file is stored in a block, and block metadata is
held in memory by the namenode. Thus, a large number of small files can eat up a lot of memory
on the namenode. (Note, however, that small files do not take up any more disk space than is
required to store the raw contents of the file. For example, a 1 MB file stored with a block size of
128 MB uses 1 MB of disk space, not 128 MB.) Hadoop Archives, or HAR files, are a file
archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode
memory usage while still allowing transparent access to files. In particular, Hadoop Archives can
be used as input to MapReduce.

Using Hadoop Archives

A Hadoop Archive is created from a collection of files using the archive tool. The tool runs a
MapReduce job to process the input files in parallel, so to run it, you need a MapReduce cluster
running to use it.

Limitations

There are a few limitations to be aware of with HAR files. Creating an archive creates a copy of
the original files, so you need as much disk space as the files you are archiving to create the
archive (although you can delete the originals once you have created the archive). There is
currently no support for archive compression, although the files that go into the archive can be
compressed (HAR files are like tar files in this respect). Archives are immutable once they have
been created. To add or remove files, you must recreate the archive. In practice, this is not a
problem for files that dont change after being written, since they can be archived in batches on a
regular basis, such as daily or weekly. As noted earlier, HAR files can be used as input to
MapReduce. However, there is no archive-aware InputFormat that can pack multiple files into a
single MapReduce split, so processing lots of small files, even in a HAR file, can still be
inefficient.

S-ar putea să vă placă și