Documente Academic
Documente Profesional
Documente Cultură
Mohammad Sufiyan(K00353565)
Nagaraju Kola(K00351882)
Abstract
1.HDFS
The Hadoop Distributed File System (HDFS)a subproject
of the Apache Hadoop projectis a distributed, highly faulttolerant file system designed to run on low-cost commodity
hardware. HDFS provides high-throughput access to
application data and is suitable for applications with large data
sets. This article explores the primary features of HDFS and
provides a high-level view of the HDFS architecture.
1.1 Overview of HDFS
HDFS has many similarities with other distributed file
systems, but is different in several respects. One noticeable
difference is HDFS's write-once-read-many model that relaxes
concurrency control requirements, simplifies data coherency,
and enables high-throughput access.
Another unique attribute of HDFS is the viewpoint that it is
usually better to locate processing logic near the data rather
than moving the data to the application space.
HDFS rigorously restricts data writing to one writer at a
time. Bytes are always appended to the end of a stream, and
byte streams are guaranteed to be stored in the order written.
HDFS has many goals. Here are some of the most notable:
You should NOT use HDFS when you are in one of the
following situations:
2. Related work
As the application of the HDFS increases, the pitfalls of the
HDFS are also being discovered and studied. Among them is
the poor performance when the HDFS handles small and
frequently interacted files. As Jiang et al. [9] pointed out, the
HDFS is designed for processing big data transmission rather
than transferring a large number of small files, hence it is not
suitable for interaction-intensive tasks.
Shvachko et al. [15] notice that the HDFS sacrifices the
immediate response time of individual I/O requests to gain
better throughput of the entire system. In other words, the
HDFS chooses the strategy of enlarging the data volume of a
single transmission to decrease the time ratio of data
transmission initialization overhead in the overall operation.
The transmission initialization includes permission verification,
access authentication, locating existing file blocks for reading
tasks, allocating storage space for incoming writing tasks, and
establishing or maintaining transmission connections, etc.
Another issue with the current HDFS architecture is its single
namenode structure which can quickly become the bottleneck
in the presence of interaction-intensive tasks and hence cause
other incoming I/O requests to wait for a long period of time
before being processed [1].
Efforts have been made to overcome the shortcomings of
the HDFS. For instance, Liu [13] combines several small files
into a large file and then uses an extra index to record their
offsets. Hence, the number of files is reduced and the efficiency
of processing access requests is increased. To enhance the
HDFSs ability of handling small files, Chandrasekar et al. [3]
also use a strategy that merges small files into a larger file and
records the location of file blocks inside the clients cache. This
strategy also allows the client application to circumvent the
namenode to directly contact the datanodes, hence alleviates
the bottleneck issue on the namenode. In addition, this strategy
uses pre-fetching in its I/O process to further increase the
system throughput when dealing with small files. The
drawback of the strategy is that it requires client side code
change.
Preventing access requests from frequently interacting with
datanodes at the bottom layer of the HDFS can also
significantly improve the I/O performance for interactionintensive tasks. As illustrated in [9], one way to achieve this is
to add caches to the original HDFS structure. In [17], a shared
cache strategy named HDCache, which is built on the top of
the HDFS, is presented to improve the performance of small
file access.
Liao et al. [12] adapt the conventional hierarchical
structure for the HDFS to increase the capability of the
namenode component. Borthakur [2] discusses in detail the
architecture of the HDFS and presents several possible ways to
apply the hierarchical concept to the HDFS. Fesehaye et al.
provide a solution called EDFS
[5] which introduces
middleware and hardware components between the client
application and the HDFS to implement a hierarchical storage
structure. This structure uses multiple layers of resource
Assume that there are n storage nodes. For each of them, the
available I/O speed is denoted as S. Although both the access
pattern and the data distribution can impact the performance of
the hard disk, for simplicity, S is assumed to be the average
case. The storage nodes are ordered by their available I/O
speed, i.e., Si < Sj if i < j. In addition, assume that at time
instance t, file f is at storage node j and the size of f is Fs , then
within the constant time quantum of Q, the number of unprocessed requests U is defined as follows:
s given in Eq. (9), for blocks in FMI , the value domain of their
corresponding entries in VM is [1, 2n] which indicates that
the blocks in FMI may be stored in cache. For the rest of the
blocks, as their domain is in [1, n], they can only be allocated
to datanodes in a rack.
As an example, assume that there are 5 racks, i.e., n = 5, and
an incoming block QF [i] is in set FMI , then the value domain
of entry VM [i] is in the range of [1, 10]. If VM [i] = 3,
it indicates that the incoming block QF [i] is allocated to rack
3; while if VM [i] = 9, the block is allocated to the cache of
rack 4.
To compare the interaction-intensity of incoming files with
those already in the cache, the indicator is introduced for
each incoming file to represent the ratio of the files current I
value versus the average I value of all cached files. The
definition of indicator for file f is given below:
where EW (i) and ER(i) are the predicted writing and reading
speed on node i; they are given by the auto-regressive
integrated moving average prediction model presented in [16].
k3 and k4 are the weight coefficients for reading and writing
time costs. Yi is used to record the number of blocks allocated
to datanode i. For the datanode i, let arrays Rw and Rr record
the average writing and reading speed respectively after each
allocation. Let arrays and record the CPU occupation rate
and the available network bandwidth of datanode i respectively
after each allocation. For the jth allocation of datanode i, EW
(i) is calculated as follows:
where N and |FM | are the swarm size and the number of
dimensions, respectively. Denote di of the globally best particle
as dg . Compare all di s and determine the maximum and
minimum distance dmax and dmin. Then, the evolutionary
factor f is defined by
a files I-I value. Since all the incoming requests are handled by
the MNN first, the MNN has the knowledge of access status of
all the existed files. Thus, the MNN can easily calculate the I-I
value for files. On the other hand, recording the access count
for a file only causes negligible overhead. As the workload of
the MNN is greatly alleviated in the extended structure, the
MNN can provide enough computing capacity to support the
updating. In addition, the MNN is used to calculate the reallocation bounds for an interaction-intensive file at the
beginning of each time quantum. For the MNN, a rack of
datanodes is considered as a single storage node, so in the view
of MNN there are 2n nodes, i.e., n caches and n racks of
datanodes. On the other hand, by sending the heartbeat report,
each SNN submits both its caches status and the average I/O
speed of the datanodes in its rack to the MNN, hence the MNN
has the knowledge of the available I/O speed of all its 2n
nodes. As the calculation of both the bounds for an
interaction-intensive file needs the I/O speed information of all
the storage nodes, the bound calculation procedure can only be
implemented on the MNN. This implementation brings
negligible workload increase to the MNN. For the PSO-based
storage allocation algorithm, both MNN and SNN are involved.
In each iteration of the PSO procedure, the MNN is in charge
of calculating the Cost value depicted in Eq. (11) and the
velocity V in Eq. (16) for each particle. To do so, both the
value in Eq. (12) and the value in Eq. (13) are needed. Since
the MNN can obtain the running status of caches on the SNNs,
can be calculated locally on the MNN. Otherwise, since the
datanodes only send their heartbeat reports to a SNN, the MNN
is not aware of the existence of datanode, hence the SNN is in
charge of calculating the value for each particle. When a new
iteration of PSO begins, the MNN broadcasts the locations of
each particle to all the SNNs while calculating the value for
each particle. In the meantime, SNNs calculate the value for
each particle with the particles location information given by
the MNN. It is easy to see that the MNN and the SNNs are
working in a parallel way. Moreover, since one block can be
only allocated to one storage place, the calculation of on
each SNN is independent, that means all the SNNs are also
working in a parallel way. As a result, the PSO-based storage
allocation algorithm runs very efficiently on the extended
HDFS structure. After the values and values of all the
particles are worked out, the MNN calculates the Cost value,
the velocity V and the new position of each particle. Then, a
new iteration is launched if the stop condition is not satisfied.
5. ASSUMPTION AND GOALS
5.1 Hardware Failure
Hardware failure is the norm rather than the exception. An
HDFS instance may consist of hundreds or thousands of server
machines, each storing part of the file systems data. The fact
that there are a huge number of components and that each
component has a non-trivial probability of failure means that
some component of HDFS is always non-functional. Therefore,
detection of faults and quick, automatic recovery from them is
a core architectural goal of HDFS.
5.2 Streaming Data Access
event that asks the NameNode to insert a record into the Edit
Log representing this action. Likewise, changing the replication
factor of a file causes another event that asks the NameNode to
insert another new record to be inserted into the Edit Log
representing this action. The NameNode uses a file in its local
host OS file system to store the Edit Log. The entire file system
namespace, including the mapping of blocks to files and file
system properties, is stored in a file called the Silage. The
Silage is stored as a file in the NameNodes local file system
too [19].
The image of the entire file system namespace and file Block
map is kept in the NameNodes system memory (RAM). This
key metadata item is designed to be compact, such that a
NameNode with 4GB of RAM is plenty to support a large
number of files and directories. When the NameNode starts up,
it reads the Silage and Edit Log from disk, applies all the
transactions from the Edit Log to the in-memory representation
of the Silage, and flushes out this new version into a new
Silage on disk. It can then truncate the old EditLog because its
transactions have been applied to the persistent FsImage. This
procedure is known as a checkpoint. In the current
implementation, a checkpoint only occurs when the NameNode
starts up [19].
The Datanode stores HDFS data in files in its local file system.
The Datanode has no knowledge about HDFS files. It stores
each block of HDFS data in a separate file in its local file
system. The Datanode does not create all files in the same
directory. Instead, it uses heuristics to determine the optimal
number of files per directory and creates subdirectories
appropriately. It is not optimal to create all local files in the
same directory because the local file system might not be able
to efficiently support a huge number of files in a single
directory. When a Datanode starts up, it scans through its local
file system, generates a list of all HDFS data blocks that
correspond to each of these local files and sends this report to
the NameNode: this is the Blockreport [19].
The NameNode does not receive the client request to create a
file. In reality, the HDFS client, initially, caches the file data
into a temporary local file. Application writes are transparently
redirected to this temporary local file. When the local file
accumulates data worth over one HDFS block size, the client
contacts the NameNode. The NameNode inserts the file name
into the file system hierarchy and allocates a data block for it.
The NameNode responds to the client request with the identity
of the Datanode and the destination data block. Then the client
flushes the block of data from the local temporary file to the
specified Datanode. When a file is closed, the remaining
unflushed data in the temporary local file is transferred to the
Datanode. The client then informs the NameNode that the file
is closed. At this point, the NameNode commits the file
creation operation into its persistent store. If the NameNode
dies before the file is closed, the file is lost [19].
6. Data Organization
6.1 Data Blocks
HDFS is designed to support very large files. Applications that
are compatible with HDFS are those that deal with large data
sets. These applications write their data only once but they read
it one or more times and require these reads to be satisfied at
streaming speeds. HDFS supports write-once-read-many
semantics on files. A typical block size used by HDFS is 64
MB. Thus, an HDFS file is chopped up into 64 MB chunks,
and if possible, each chunk will reside on a different DataNode
6.2 Staging
A client request to create a file does not reach the NameNode
immediately. In fact, initially the HDFS client caches the file
data into a temporary local file. Application writes are
transparently redirected to this temporary local file. When the
local file accumulates data worth over one HDFS block size,
the client contacts the NameNode. The NameNode inserts the
file name into the file system hierarchy and allocates a data
block for it. The NameNode responds to the client request with
the identity of the DataNode and the destination data block.
Then the client flushes the block of data from the local
temporary file to the specified DataNode. When a file is closed,
the remaining un-flushed data in the temporary local file is
transferred to the DataNode. The client then tells the
NameNode that the file is closed. At this point, the NameNode
commits the file creation operation into a persistent store. If the
NameNode dies before the file is closed, the file is lost. The
above approach has been adopted after careful consideration of
target applications that run on HDFS. These applications need
streaming writes to files. If a client writes to a remote file
directly without any client side buffering, the network speed
and the congestion in the network impacts throughput
considerably. This approach is not without precedent. Earlier
distributed file systems, e.g. AFS, have used client side caching
to improve performance. A POSIX requirement has been
relaxed to achieve higher performance of data uploads.
6.3 Replication Pipelining
When a client is writing data to an HDFS file, its data is first
written to a local file as explained in the previous section.
Suppose the HDFS file has a replication factor of three. When
the local file accumulates a full block of user data, the client
time delay between the time a file is deleted by a user and the
time of the corresponding increase in free space in HDFS.
A user can Undelete a file after deleting it as long as it remains
in the /trash directory. If a user wants to undelete a file that
he/she has deleted, he/she can navigate the /trash directory and
retrieve the file. The /trash directory contains only the latest
copy of the file that was deleted. The /trash directory is just like
any other directory with one special feature: HDFS applies
specified policies to automatically delete files from this
directory. The current default policy is to delete files from
/trash that are more than 6 hours old. In the future, this policy
will be configurable through a well defined interface.
9.2 Decrease Replication Factor
9. Space Reclamation
9.1 File Deletes and Undeletes
When a file is deleted by a user or an application, it is not
immediately removed from HDFS. Instead, HDFS first
renames it to a file in the /trash directory. The file can be
restored quickly as long as it remains in /trash. A file remains in
/trash for a configurable amount of time. After the expiry of its
life in /trash, the NameNode deletes the file from the HDFS
namespace. The deletion of a file causes the blocks associated
with the file to be freed. Note that there could be an appreciable
[9] L. Jiang, B. Li, M. Song, The optimization of hdfs based on small files, in:
Broadband Network and Multimedia Technology, IC-BNMT, 2010 3rd
IEEE International Conference on, IEEE, 2010, pp. 912915.
[10] J. Kennedy, R. Eberhart, Particle swarm optimization, in: Neural
Networks, 1995. Proceedings., IEEE International Conference on, vol. 4,
IEEE, 1995, pp. 19421948.
[11] J. Kennedy, W.M. Spears, Matching algorithms to problems: an
experimental test of the particle swarm and some genetic algorithms on
the multimodal problem generator, in: Evolutionary Computation
Proceedings, 1998. IEEE World Congress on Computational
Intelligence., The 1998 IEEE International Conference on, IEEE, 1998,
pp. 7883.
[12] H. Liao, J. Han, J. Fang, Multi-dimensional index on hadoop distributed
file system, in: Networking, Architecture and Storage (NAS), 2010 IEEE
Fifth International Conference on, IEEE, 2010, pp. 240249.
[13] X. Liu, J. Han, Y. Zhong, C. Han, X. He, Implementing webgis on
hadoop: a case study of improving small file i/o performance on hdfs, in:
Cluster Computing and Workshops, 2009. CLUSTER09. IEEE
International Conference on, IEEE, 2009, pp. 18.
[14] Y. Shi, R.C. Eberhart, Parameter selection in particle swarm optimization,
in: Evolutionary Programming VII, Springer, 1998, pp. 591600.
[15] K. Shvachko, H. Kuang, S. Radia, R. Chansler, The hadoop distributed
file system, in: Mass Storage Systems and Technologies (MSST), 2010
IEEE 26th Symposium on, Ieee, 2010, pp. 110.
[16] S. Vazhkudai, J.M. Schopf, I. Foster, Predicting the performance of wide
area data transfers, in: Parallel and Distributed Processing Symposium.,
Proceedings International, IPDPS 2002, Abstracts and CD-ROM, IEEE,
2002, pp. 3443.
[17] J. Zhang, G. Wu, X. Hu, X. Wu, A distributed cache for hadoop
distributed file system in real-time cloud services, in: Grid Computing
(GRID), 2012 ACM/IEEE 13th International Conference on, IEEE,
2012, pp. 1221.
[18] Z.-H. Zhan, J. Zhang, Y. Li, H.-H. Chung, Adaptive particle swarm
optimization, IEEE Trans. Syst. Man Cybern. 39 (6) (2009) 1362
[19]http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html