Sunteți pe pagina 1din 5

Technical Brief

MapR Direct Access NFS

2014 MapR, Inc. All Rights Reserved.

Technical Brief

MapR Direct Access NFS


Introduction

The Network File System (NFS) protocol provides remote access to shared disks across networks. An
NFS-enabled server can share directories and files with clients, allowing users and programs to access
files on remote systems as if they were stored locally. NFS has become a well-established industry
standard and a widely used interface that provides numerous benefits, including avoidance of data duplication to accommodate multiple users and applications and better administration and security of data.

NFS Benefits
with the MapR
Distribution
for Hadoop

MapR is the only distribution for ApacheTM Hadoop that leverages the full power of NFS. The MapR
POSIX compliant platform can be exported via NFS to perform fully random read-write operations on
files stored in Hadoop.
MapR Direct Access NFSTM makes Hadoop radically easier and less expensive to use. MapR allows files
to be modified and overwritten, and enables multiple concurrent reads and writes on any file. Here are
some examples of how MapR customers have leveraged NFS in their production environments:
Easy data ingestion

Popular online gaming company changed data ingestion from a complex Flume cluster to a 17-line
Python script.
Database bulk import/export with standard vendor tools

Fortune 100 company saved millions on data warehouse costs by leveraging MapR to pre-process data
prior to loading into data warehouses and leveraged bulk imports via NFS.
Ability to use existing applications/tools

Large credit card company uses MapR volumes as user home directories on the Hadoop gateway
servers and allows its users to continue to leverage standard Linux commands and utilities to access and
process data.

2014 MapR, Inc. All Rights Reserved.

MapR Technologies, Inc.


www.mapr.com

MapR Direct Access NFS

MapR NFS
Implementation

Combining HDFS APIs with NSF


Each node in the MapR cluster has a FileServer service, whose role is similar in many ways to the
DataNode in HDFS. In addition, there can be one or more NFS Gateway services running in the
cluster. In many deployments the NFS Gateway service runs on every node in the cluster, alongside the
FileServer service.
A MapR cluster can be accessed either through the Hadoop FileSystem API or through NFS:
Hadoop FileSystem API

To access a MapR cluster via the Hadoop FileSystem API, the MapR/Hadoop client must be installed
on the client. MapR provides easy-to-install clients for Linux, Mac, and Windows. The Hadoop
FileSystem API is in Java, so in most cases client applications are developed in Java and linked to the
Hadoop-core-*.jar library.
CLIENT

HADOOP APPLICATION
(E.G. HADOOP FS PUT)

ANY FILE-BASED
APPLICATION
(E.G. CP,EMACS)

HADOOP-CORE-*.JAR
(HDFS FILESYSTEM API

NFS CLIENT
(INCLUDED IN THE OS)

FILESERVER

NFS GATEWAY
FILESERVER
NFS GATEWAY
FILESERVER
NFS GATEWAY

MAPR CLUSTER

NFS

To access a MapR cluster over NFS, the client mounts any of the NFS Gateway servers. There is no need
to install any software on the client, because every common operating system includes an NFS client.
In Windows, the MapR cluster becomes a drive letter (e.g., M:, Z:, etc.), whereas in Linux and Mac
the cluster is accessible as a directory in the local file system (e.g., /mapr). Note that some lower-end
Windows versions do not include the NFS client.
The Hadoop FileSystem API is designed for MapReduce (with functions such as getFileBlockLocations),
so MapReduce jobs normally read and write data through that API. However, the NFS interface is often
more suitable for applications that are not specific to Hadoop. For example, an application server can
use the NFS interface to write its log files directly into the cluster and also to perform random readwrite operations.
2014 MapR, Inc. All Rights Reserved.

MapR Technologies, Inc.


www.mapr.com

MapR Direct Access NFS

MapR NFS Implementation


(continued)

NSF High Availability


MapR provides high availability for NFS in the M5 and M7 editions. The administrator uses a simple
MapR interface to allocate a pool of Virtual IP addresses (VIPs), which the cluster then automatically
assigns to the NFS Gateway servers. A VIP automatically migrates from one NFS Gateway service
to another in the event of a failure, so that all clients who mounted the cluster through that VIP can
continue reading and writing data with virtually no impact. In a typical deployment, a simple loadbalancing scheme such as DNS round-robin is used to uniformly distribute clients among the different
NFS Gateway servers (i.e., VIPs).

Random Read/Write
The MapR Distribution for Apache Hadoop includes an underlying storage system that supports
random reads and writes, with support for multiple simultaneous readers and writers. This provides a
significant advantage over other distributions, which only provide a write-once storage system (similar
to FTP).
MAPR NFS ALLOWS DRIECT DEPOSIT

WEB SERVER

RANDOM READ/WRITE

COMPRESSION
DATABASE
SERVER
DISTRIBUTED HA

DATA PLATFORM
APPLICATION
SERVER

CONNECTORS NOT NEEDED


NO EXTRA SCRIPTS OR CLUSTERS TO DEPLOY AND MAINTAIN

Support for random reads and writes is necessary to provide true NFS access, and more generally, any
kind of access for non-Hadoop applications. NFS is a simple protocol in which the client sends the
server requests to write or read n bytes at offset m in a given file. In a MapR cluster, the NFS Gateway
service receives these requests from the client and translates them into the corresponding RPCs to
the FileServer services. The server-side in the NFS protocol is mostly stateless-there is no concept of
opening or closing files.

2014 MapR, Inc. All Rights Reserved.

MapR Technologies, Inc.


www.mapr.com
MapR Direct Access NFS

HDFS-based
Distributions
and NFS

HDFS file operations typically involve a file open(), followed by sequential writes, and end with a file
close() operation. The file must be closed explicitly for HDFS to pick up the changes that were made.
No new writes are permitted until one reopens the file again.
NFS protocol, on the other hand, follows a different model to work with files, creating a technology
mismatch with HDFS..
Firstly, NFS protocol on the server side is stateless and does not include a file open/close primitive that
can be used to indicate to HDFS that the write operation is complete. Therefore, in order to make the
data permanent on HDFS, the NFS Gateway on HDFS has to be tweaked to make a guess and artificially close the file after a specified timeout. After the file closure however, any write arriving from the
NFS client is not written to HDFS, making the system susceptible to data loss.
Secondly, even if the end application on the NFS client side writes in sequential order, the local operating system and NFS client typically reorder the writes that get passed on to the NFS server. Therefore,
packets that the NFS server receives from the client are almost always guaranteed to be out of sequence
- which does not fit well with how HDFS expects its writes to be sequential. Therefore, to re-sequence
incoming data, the NFS gateway has to be tweaked again to temporarily save all the data to its local
disk (/tmp/.hdfs-nfs) prior to writing it to HDFS. Such a setup can quickly become impractical; as one
needs to make sure the NFS gateways local directory has enough space at all times. For example, if the
application uploads 10 files with each having 100MB, it is recommended for this directory to have 1GB
space in case a worst-case write reorder happens to every file.
Because of this bottleneck,
HDFS NFS cannot truly support multiple users, because the gateway may run out of local disk space
very quickly.
The system performance becomes unusable because all NFS traffic is staged on the gateways local disks.
In fact, HDFS NFS documentation recommends using the HDFS API and WebHDFS when performance matters.
The drastic limitations mentioned above, coupled with the fact that existing applications cannot perform random read-writes on HDFS, make NFS support on HDFS poor and unusable.

Summary

The MapR Distribution for Apache Hadoop uniquely provides a robust, enterprise-class storage service
that supports random reads and writes, and exposes the standard NFS interface so that clients can
mount the cluster and read and write data directly. This capability makes Hadoop much easier to use,
and enables new classes of applications.

MapR delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of mission-critical and realtime production uses. MapR brings unprecedented dependability, ease-of-use and world-record speed to Hadoop, NoSQL, database and
streaming applications in one unified big data platform. MapR is used by more than 500 customers across financial services, retail, media,
healthcare, manufacturing, telecommunications and government organizations as well as by leading Fortune 100 and Web 2.0 companies.
Amazon, Cisco, Google and HP are part of the broad MapR partner ecosystem. Investors include Lightspeed Venture Partners, Mayfield
Fund, NEA, and Redpoint Ventures. MapR is based in San Jose, CA. Connect with MapR on Facebook, LinkedIn, and Twitter.
2014 MapR Technologies. All rights reserved. Apache Hadoop, HBase and Hadoop are trademarks of the Apache Software Foundation
and not affiliated with MapR Technologies. All other trademarks are the property of their respective owners.

S-ar putea să vă placă și