Sunteți pe pagina 1din 17

Hadoop Distributed File System

Presented by

MOHAMMAD SUFIYAN
NAGARAJU KOLA
PRUDHVI KRISHNA KAMIREDDY
CONTENTS
What is Hadoop?
Hadoop Distributed File System
Basic Concepts of HDFS
HDFS Architecture
Data Storage Reliability
HDFS API
Advantages
Disadvantages
Real Time Examples
Conclusion
HADOOP
What is Hadoop?
It's a framework for running applications on large clusters of
commodity hardware which produces huge data and to process it
Apache Software Foundation Project

Open source

Amazons EC2

Hadoop Includes
HDFS a distributed file system

Map/Reduce HDFS implements this programming model. It is


an offline computing engine
HDFS
The Hadoop Distributed File System (HDFS) is a distributed file
system designed to run on commodity hardware. It has many
similarities with existing distributed file systems. However, the
differences from other distributed file systems are significant.
highly fault-tolerant and is designed to be deployed on low-cost
hardware.
provides high throughput access to application data and is
suitable for applications that have large data sets.
relaxes a few POSIX requirements to enable streaming access to
file system data..
Basic Concept of HDFS

HDFS is a file system written in Java based on the Googles GFS


Provides redundant storage for massive amounts of data
HDFS works best with a smaller number of large files
Millions as opposed to billions of files
Typically 100MB or more per file
Files in HDFS are write once
Optimized for streaming reads of large files and not random reads
HDFS Architecture
HDFS is comprised of interconnected clusters of nodes where
files and directories reside. An HDFS cluster consists of a
single node, known as a NameNode, that manages the file
system namespace and regulates client access to files. In
addition, data nodes (DataNodes) store data as blocks within
files.
NameNode
The namenode is the commodity hardware that contains the
GNU/Linux operating system and the namenode software. It is a
software that can be run on commodity hardware. The system
having the namenode acts as the master server and it does the
following tasks:
Manages the file system namespace.
Regulates clients access to files.
It also executes file system operations such as renaming, closing,
and opening files and directories.
Data Node
The datanode is a commodity hardware having the GNU/Linux
operating system and datanode software. For every node
(Commodity hardware/System) in a cluster, there will be a datanode.
These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per
client request.
They also perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.
The File System Namespace

HDFS supports a traditional hierarchical file organization. A user or


an application can create directories and store files inside these
directories. The file system namespace hierarchy is similar to most
other existing file systems; one can create and remove files, move a
file from one directory to another, or rename a file.

The NameNode maintains the file system namespace. Any change to


the file system namespace or its properties is recorded by the
NameNode. An application can specify the number of replicas of a
file that should be maintained by HDFS.
Data Replication

HDFS is designed to reliably store very large files across machines


in a large cluster. It stores each file as a sequence of blocks; all
blocks in a file except the last block are the same size. The blocks of
a file are replicated for fault tolerance. The block size and
replication factor are configurable per file. An application can
specify the number of replicas of a file. The replication factor can be
specified at file creation time and can be changed later. Files in
HDFS are write-once and have strictly one writer at any time.
HDFS API

Most common file and directory operations supported:


Create, open, close, read, write, seek, list, delete etc.
Files are write once and have exclusively one writer
Some operations peculiar to HDFS:
set replication, get block locations
Support for owners, permissions
Advantages

designed to store terabytes or petabytes


data spread across a large number of machines
supports much larger file sizes than NFS
stores data reliably (replication)
provides fast, scalable access
serve more clients by adding more machines
integrates with MapReduce local computation
Disadvantages

Not as general-purpose as NFS


Design restricts use to a particular class of applications
HDFS optimized for streaming read performance not good at
random access
Write once read many model
Updating a files after it has been closed is not supported (cant
append data)
System does not provide a mechanism for local caching of data
Who Uses of HDFS
Conclusion
Hadoop is an Apache Software Foundation distributed file system
and data management project with goals for storing and managing
large amounts of data. Hadoop uses a storage system called HDFS to
connect commodity personal computers, known as nodes, contained
within clusters over which data blocks are distributed.
HDFS shares many common features with other distributed file
systems while supporting some important differences. One
significant difference is HDFS's write-once-read-many model that
relaxes concurrency control requirements, simplifies data coherency,
and enables high-throughput access.

S-ar putea să vă placă și