Sunteți pe pagina 1din 19

Hadoop Configuration Instruction Manual

1
I. Introduction

This document explains the installation and configuration of Hadoop cluster on Linux
machines. Following are the versions of Linux for which these set-up instructions have
been tested:

• Ubuntu Linux10.04 LTS (deprecated: 8.10 LTS, 8.04, 7.10, 7.04)


• openSuSE version 11.2

II. Setting up single-node Hadoop cluster

Configuration Pre-requisites

1. Making sure the Sun’s JDK is the default on your machine and is correctly set up.

user@linux:~# java -version


java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) Client VM (build 16.3-b01, mixed
mode, sharing)

2. Add a dedicated Hadoop system user

This helps to separate the Hadoop installation from other software applications and user
accounts running on the same machine

3. SSH should be up and running on the machine and it should be configured to allow
SSH public key authentication.

(i) Generate an SSH key for the hadoop user.

user@linux:~$ su – hadoop

(ii) Create an RSA key pair with an empty password.

hadoop@linux:~$ ssh-keygen -t rsa -P ""


Generating public/private rsa key pair.
Enter file in which to save the key
(/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Your identification has been saved in
/home/hadoop/.ssh/id_rsa.
Your public key has been saved in
/home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:

2
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2
hadoop@linux
The key's randomart image is:
[...snipp...]
hadoop@ubuntu:~$

(ii) Enable SSH access to the local machine with this newly created key.

hadoop@linux:~$ cat $HOME/.ssh/id_rsa.pub >>


$HOME/.ssh/authorized_keys

(iii) Test the SSH setup by connecting to your local machine with the hadoop user.

hadoop@linux:~$ ssh localhost


The authenticity of host 'localhost (::1)' can't be
established.
RSA key fingerprint is
d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)?
yes
Warning: Permanently added 'localhost' (RSA) to the
list of known hosts.
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr
28 13:27:30 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS
[...snipp...]
hadoop@linux:~$

Hadoop single node configuration

1. Download Hadoop from Apache Download mirrors and extract the contents of the
Hadoop package to any location.

$ cd /usr/local
$ sudo tar xzf hadoop-0.20.2.tar.gz
$ sudo mv hadoop-0.20.2 hadoop

2. Change the owner of all the files to the hadoop user and group previously created.

$ sudo chown -R hadoop:hadoop hadoop

3. Make changes to some configuration files –

(i) conf/hadoop-env.sh

In this file, set the JAVA_HOME environment variable to the installation


path of JDK 6 directory.

3
(ii) conf/core-site.xml

<property>
<name>hadoop.tmp.dir</name>
<value>/your/path/to/hadoop/tmp/dir/hadoop-
${user.name}</value> <description>A base for
other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system.
A URI whose scheme and authority determine the
FileSystem implementation. The uri's scheme
determines the config property (fs.SCHEME.impl)
naming the FileSystem implementation class.
The uri's authority is used to determine the
host, port, etc. for a filesystem.
</description>
</property>

(iii) conf/mapred-site.xml

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce
job tracker runs at. If "local", then jobs are
run in-process as a single map and reduce task.
</description>
</property>

(iv) conf/hdfs-site.xml

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. The
actual number of replications can be specified
when the file is created. The default is used
if replication is not specified in create time.
</description>
</property>

4
4. Formatting the Hadoop file system

The first step to starting up your Hadoop installation is formatting the Hadoop
filesystem which is implemented on top of the local filesystem of your “cluster”
(which includes only your local machine when setting up the single node). You
need to do this the first time you set up a Hadoop cluster.

Note: Do not format a running Hadoop filesystem, this will cause all your data to
be erased.

To format the filesystem (which simply initializes the directory specified by the
dfs.name.dir variable), run the following command

hadoop@linux:~$ /hadoop/bin/hadoop namenode -format

The output will look like this:

hadoop@linux:/usr/local/hadoop$ bin/hadoop namenode -


format
10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
/*****************************************************
*******
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build =
https://svn.apache.org/repos/asf/hadoop/common/branche
s/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri
Feb 19 08:07:34 UTC 2010
******************************************************
******/
10/05/08 16:59:56 INFO namenode.FSNamesystem:
fsOwner=hadoop,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem:
supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem:
isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage: Image file of
size 96 saved in 0 seconds.
10/05/08 16:59:57 INFO common.Storage: Storage
directory .../hadoop-hadoop/dfs/name has been
successfully formatted.
10/05/08 16:59:57 INFO namenode.NameNode:
SHUTDOWN_MSG:
/*****************************************************
*******

5
SHUTDOWN_MSG: Shutting down NameNode at
ubuntu/127.0.1.1
******************************************************
******/
hadoop@linux:/usr/local/hadoop$

5. Starting the single-node cluster

For starting a single-node cluster, run the following command

hadoop@linux:~$ /bin/start-all.sh

This will start up a Namenode, Datanode, Jobtracker and a Tasktracker on your


machine.
The output will look like this:

hadoop@linux:/usr/local/hadoop$ bin/start-all.sh
starting namenode, logging to
/usr/local/hadoop/bin/../logs/hadoop-hadoop-namenode-
linux.out
localhost: starting datanode, logging to
/usr/local/hadoop/bin/../logs/hadoop-hadoop-datanode-
linux.out
localhost: starting secondarynamenode, logging to
/usr/local/hadoop/bin/../logs/hadoop-hadoop-
secondarynamenode-linux.out
starting jobtracker, logging to
/usr/local/hadoop/bin/../logs/hadoop-hadoop-
jobtracker-linux.out
localhost: starting tasktracker, logging to
/usr/local/hadoop/bin/../logs/hadoop-hadoop-
tasktracker-linux.out
hadoop@linux:/usr/local/hadoop$

To check if the hadoop processes are running, use the following command:

hadoop@linux:/usr/local/hadoop$ jps
2287 TaskTracker
2149 JobTracker
1938 DataNode
2085 SecondaryNameNode
2349 Jps
1788 NameNode

You can also check with netstat if Hadoop is listening on the configured ports.

hadoop@linux:~$ sudo netstat -plten | grep java

6
tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001
9236 2471/java
tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001
9998 2628/java
tcp 0 0 0.0.0.0:48159 0.0.0.0:* LISTEN 1001
8496 2628/java
tcp 0 0 0.0.0.0:53121 0.0.0.0:* LISTEN 1001
9228 2857/java
tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001
8143 2471/java
…..

In case of any errors, examine the log files in the /logs/ directory.

6. Stopping the single-node cluster

Run the following command to stop all daemons running on your machine

hadoop@linux:~$ /bin/stop-all.sh

The output will look like this:

hadoop@linux:/usr/local/hadoop$ bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
hadoop@linux:/usr/local/hadoop$

7
III. Setting up multi-node Hadoop cluster

Configuration pre-requisites

1. Configure a single-node Hadoop cluster on two Linux machines using the


instructions provided above.

2. In the following steps, we will merge these two single-node clusters into one
multi-node cluster in which one box will become the master and other the slave.

3. Both machines must be connected over the LAN via a single hub or switch and
configure the network interfaces to use a common network such as
192.168.0.x/24.

4. Update /etc/hosts on both machines with the following lines:

# /etc/hosts (for master AND slave)


192.168.0.1 master
192.168.0.2 slave

5. SSH access
The hadoop user on the master (hadoop@master) must be able to connect
a) To its own user account on the master – i.e. ssh master in this context and not
necessarily ssh localhost and
b) To the hadoop user account on the slave (hadoop@slave) via a password-less
SSH login. For this, just add the hadoop@master‘s public SSH key (which should

8
be in $HOME/.ssh/id_rsa.pub) to the “authorized_keys” file of
hadoop@slave (in this user’s $HOME/.ssh/authorized_keys). You can do this
manually or use the following SSH command:

hadoop@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub


hadoop@slave

This command will prompt you for the login password for user hadoop on slave,
then copy the public SSH key for you, creating the correct directory and fixing the
permissions as necessary.

The final step is to test the SSH setup by connecting with user hadoop from the
master to the user account hadoop on the slave. The step is also needed to save
slave‘s host key fingerprint to the hadoop@master‘s “known_hosts” file.

Connecting from master to master:

hadoop@master:~$ ssh master


The authenticity of host 'master (192.168.0.1)' can't
be established.
RSA key fingerprint is
3b:21:b3:c0:21:5c:7c:54:2f:1e:2d:96:79:eb:7f:95.
Are you sure you want to continue connecting (yes/no)?
yes
Warning: Permanently added 'master' (RSA) to the list
of known hosts.
Linux master 2.6.20-16-386 #2 Thu Jun 7 20:16:13 UTC
2007 i686
...
hadoop@master:~$

Connecting from master to slave:

hadoop@master:~$ ssh slave


The authenticity of host 'slave (192.168.0.2)' can't
be established.
RSA key fingerprint is
74:d7:61:86:db:86:8f:31:90:9c:68:b0:13:88:52:72.
Are you sure you want to continue connecting (yes/no)?
yes
Warning: Permanently added 'slave' (RSA) to the list
of known hosts.
Ubuntu 8.04
...
hadoop@slave:~$

9
Hadoop multi-node cluster configuration

Introduction

In this section, we will describe how to configure one Linux box as a master node
and the other Linux box as a slave node. The master node will also act as a slave
because we only have two machines available in our cluster but still want to
spread data storage and processing to multiple machines.

The master node will run the “master” daemons for each layer: namenode for the
HDFS storage layer, and jobtracker for the MapReduce processing layer. Both
machines will run the “slave” daemons: datanode for the HDFS layer, and
tasktracker for MapReduce processing layer. Basically, the “master” daemons are
responsible for coordination and management of the “slave” daemons while the
latter will do the actual data storage and data processing work.

Configuration

1. conf/masters (master only)

The “conf/masters” file defines the namenodes of our multi-node cluster. In our
case, this is just the master machine.

On master, update /conf/masters that it looks like this:

10
master

2. conf/slaves (master only)

This “conf/slaves” file lists the hosts, one per line, where the Hadoop slave
daemons (datanodes and tasktrackers) will be run. We want both the master box
and the slave box to act as Hadoop slaves because we want both of them to store
and process data.

On master, update /conf/slaves that it looks like this:

master
slave

If you have additional slave nodes, just add them to the conf/slaves file, one
per line (do this on all machines in the cluster).

master
slave
anotherslave01
anotherslave02
anotherslave03

3. conf/*-site.xml (all machines)

Note: Change the configuration files conf/core-site.xml,


conf/mapred-site.xml and conf/hdfs-site.xml on ALL machines
as follows.

(i) Change the fs.default.name variable (in conf/core-site.xml) which


specifies the NameNode (the HDFS master) host and port. In our case, this is the
master machine.

<!-- In: conf/core-site.xml -->


<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A
URI whose scheme and authority determine the
FileSystem implementation. The uri's scheme
determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's
authority is used to determine the host, port, etc.
for a filesystem.</description>
</property>

11
(ii) Change the mapred.job.tracker variable (in conf/mapred-site.xml)
which specifies the JobTracker (MapReduce master) host and port, which is the
master machine in our case.

<!-- In: conf/mapred-site.xml -->


<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job
tracker runs at. If "local", then jobs are run in-
process as a single map and reduce task.
</description>
</property>

(iii) Change the dfs.replication variable (in conf/hdfs-site.xml) which


specifies the default block replication. It defines how many machines a single file
should be replicated to before it becomes available. Since, we have only two
nodes available, so we set dfs.replication to 2.

<!-- In: conf/hdfs-site.xml -->


<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication. The actual
number of replications can be specified when the file
is created. The default is used if replication is
not specified in create time. </description>
</property>

4. Formatting the Namenode

Before we start our new multi-node cluster, we have to format Hadoop’s


distributed filesystem (HDFS) for the namenode. This needs to be done the first
time we set up a Hadoop cluster.

Note: Do not format a running Hadoop namenode, this will cause all the data in
the HDFS filesytem to be erased.

To format the filesystem (which simply initializes the directory specified by the
dfs.name.dir variable on the namenode), run the command:

hadoop@master:/usr/local/hadoop$ bin/hadoop namenode -


format

... INFO dfs.Storage: Storage directory


/usr/local/hadoop-datastore/hadoop-hadoop/dfs/name has
been successfully formatted.

12
hadoop@master:/usr/local/hadoop$

5. Starting the Multi-node cluster

To start the multi-node cluster, run the command:

hadoop@master:/usr/local/hadoop$ bin/start-all.sh

This command starts both the HDFS and MapReduce daemons on both the master
and the slave machines.

The following Java processes should run

(i) On master:

hadoop@master:/usr/local/hadoop$ jps

16017 Jps

14799 NameNode

15686 TaskTracker

14880 DataNode

15596 JobTracker

14977 SecondaryNameNode

hadoop@master:/usr/local/hadoop$

(ii) On slave:

hadoop@slave:/usr/local/hadoop$ jps

15183 DataNode

15897 TaskTracker

16284 Jps

hadoop@slave:/usr/local/hadoop$

In case, they don’t start up properly, refer to the logs.

13
6. Stopping the Multi-node cluster

To stop the multi-node cluster, run the command:

hadoop@master:/usr/local/hadoop$ bin/stop-all.sh

This will stop the HDFS daemons and the MapReduce daemons on all the slaves
and the master machines.

IV. Some useful Hadoop Web Interfaces

Hadoop comes with several web interfaces, which are by default (see conf/hadoop-
default.xml) available at these locations:

• http://localhost:50030/ - web UI for MapReduce job trackers(s)


• http://localhost:50060/ - web UI for task tracker(s)
• http://localhost:50070/ - web UI for HDFS name node(s)

They provide concise tracking information about the Hadoop cluster.

14
V. HDFS – Hadoop File System

a) Configuration

The HDFS configuration is located in a set of XML files in the Hadoop configuration
directory: conf/ under the main Hadoop install directory. The conf/hadoop-
defaults.xml file contains default values for every parameter in Hadoop. This file is
considered read-only. This configuration can be overridden by setting new values in
conf/hadoop-site.xml. This file should be replicated consistently across all
machines in the cluster.

The following settings are necessary to configure HDFS:

key value
fs.default.name protocol://servername:port
dfs.data.dir pathname
dfs.name.dir pathname

fs.default.name - This is the URI (protocol specifier, hostname, and port) that describes
the NameNode for the cluster. Each node in the system on which Hadoop is expected to
operate needs to know the address of the NameNode. The DataNode instances will
register with this NameNode, and make their data available through it. Individual client
programs will connect to this address to retrieve the locations of actual file blocks.

dfs.data.dir - This is the path on the local file system in which the DataNode instance
should store its data. It is not necessary that all DataNode instances store their data under
the same local path prefix, as they will all be on separate machines; it is acceptable that
these machines are heterogeneous. However, it will simplify configuration if this
directory is standardized throughout the system. By default, Hadoop will place this under
/tmp.

dfs.name.dir - This is the path on the local file system of the NameNode instance where
the NameNode metadata is stored. It is only used by the NameNode instance to find its
information, and does not exist on the DataNodes.

b) Starting HDFS

Once the namenode has been formatted in a hadoop cluster, we can start the distributed
file system using the following command:

user@namenode:hadoop$ bin/start-dfs.sh

The result of this command is encapsulated in the starting of a hadoop cluster using the
command (as discussed above):

15
user@master:hadoop$ bin/start-all.sh

c) Interacting with HDFS

This section will contains the basic commands necessary to interact with HDFS, loading
and retrieving data, as well as manipulating files.

(i) Listing files

someone@anynode:hadoop$ bin/hadoop dfs -ls /

Found 2 items drwxr-xr-x - hadoop supergroup


0 2008-09-20 19:40 /hadoop drwxr-xr-x - hadoop
supergroup 0 2008-09-20 20:08 /tmp

This example output assumes that "hadoop" is the username under which the Hadoop
daemons (NameNode, DataNode, etc) were started. "supergroup" is a special group
whose membership includes the username under which the HDFS instances were
started (e.g., "hadoop").

(ii) Uploading a file

Step 1: Create your home directory if it does not already exist.

someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user

someone@anynode:hadoop$ bin/hadoop dfs -mkdir


/user/yourowndirectory

Step 2: Upload a file. To insert a single file into HDFS,

someone@anynode:hadoop$ bin/hadoop dfs -put


/home/someone/File1.txt /user/yourowndirectory/

Another synonym for -put is -copyFromLocal. The syntax and functionality are
identical.

Step 3: Verify the file is in HDFS. We can verify that the operation worked with either
of the two following (equivalent) commands:

someone@anynode:hadoop$ bin/hadoop dfs -ls


/user/yourUserName

someone@anynode:hadoop$ bin/hadoop dfs –ls

16
d) Shutting down HDFS

user@namenode:hadoop$ bin/stop-dfs.sh

The result of this command is encapsulated in the stopping of a hadoop cluster using the
command (as discussed above):

user@master:hadoop$ bin/stop-all.sh

VI. Troubleshooting

1. java.io.IOException: Incompatible namespaceIDs

This error might be seen in the logs of a datanode (/logs/hadoop-hadoop-


datanode-.log), if the datanode does not start up properly. The full error looks
like this:

... ERROR org.apache.hadoop.dfs.DataNode:


java.io.IOException: Incompatible namespaceIDs in
/usr/local/hadoop-datastore/hadoop-hadoop/dfs/data:
namenode namespaceID = 308967713; datanode namespaceID =
113030094 at
org.apache.hadoop.dfs.DataStorage.doTransition(DataStorag
e.java:281) at
org.apache.hadoop.dfs.DataStorage.recoverTransitionRead(D
ataStorage.java:121) at
org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.jav
a:230) at
org.apache.hadoop.dfs.DataNode.(DataNode.java:199)
at
org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java
:1202) at
org.apache.hadoop.dfs.DataNode.run(DataNode.java:1146)
at
org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.ja
va:1167) at
org.apache.hadoop.dfs.DataNode.main(DataNode.java:1326)

Solution:

1. Stop the datanode


2. Edit the value of namespaceID in datanodes (both master and slave machines)
dfs/data/current/VERSION to match the value of the current namenode
(master)
3. Restart the datanode

17
The full paths of the relevant files are:

• namenode: /usr/local/hadoop-datastore/hadoop-
hadoop/dfs/name/current/VERSION
• datanode: /usr/local/hadoop-datastore/hadoop-
hadoop/dfs/data/current/VERSION (background: dfs.data.dir is by default set
to ${hadoop.tmp.dir}/dfs/data, and we set hadoop.tmp.dir in this tutorial to
/usr/local/hadoop-datastore/hadoop-hadoop).

2. Empty files created on HDFS when using the –put or –copyFromLocal


command to upload files

The full error from the datanode log looks as follows:

INFO hdfs.DFSClient: Exception in createBlockOutputStream


java.net.SocketException: Operation not supported INFO
hdfs.DFSClient: Abandoning block blk_-
1884214035513073759_1010

INFO hdfs.DFSClient: Exception in createBlockOutputStream


java.net.SocketException: Protocol not available

INFO hdfs.DFSClient: Abandoning block


blk_5533397873275401028_1010 INFO hdfs.DFSClient:
Exception in createBlockOutputStream
java.net.SocketException: Protocol not available INFO
hdfs.DFSClient: Abandoning block blk_-
237603871573204731_1011 INFO hdfs.DFSClient: Exception in
createBlockOutputStream java.net.SocketException:
Protocol not available INFO hdfs.DFSClient: Abandoning
block blk_-8668593183126057334_1011 WARN hdfs.DFSClient:
DataStreamer Exception: java.io.IOException: Unable to
create new block. at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBloc
kOutputStream(DFSClient.java:2845) at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2
000(DFSClient.java:2102) at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStre
amer.run(DFSClient.java:2288)

...….

Solution:

This problem can be solved by installing a different version of java (sun java 1.6.0)
and updating hadoop-env.xml to use it.

18
19

S-ar putea să vă placă și