Documente Academic
Documente Profesional
Documente Cultură
1
I. Introduction
This document explains the installation and configuration of Hadoop cluster on Linux
machines. Following are the versions of Linux for which these set-up instructions have
been tested:
Configuration Pre-requisites
1. Making sure the Sun’s JDK is the default on your machine and is correctly set up.
This helps to separate the Hadoop installation from other software applications and user
accounts running on the same machine
3. SSH should be up and running on the machine and it should be configured to allow
SSH public key authentication.
user@linux:~$ su – hadoop
2
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2
hadoop@linux
The key's randomart image is:
[...snipp...]
hadoop@ubuntu:~$
(ii) Enable SSH access to the local machine with this newly created key.
(iii) Test the SSH setup by connecting to your local machine with the hadoop user.
1. Download Hadoop from Apache Download mirrors and extract the contents of the
Hadoop package to any location.
$ cd /usr/local
$ sudo tar xzf hadoop-0.20.2.tar.gz
$ sudo mv hadoop-0.20.2 hadoop
2. Change the owner of all the files to the hadoop user and group previously created.
(i) conf/hadoop-env.sh
3
(ii) conf/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/your/path/to/hadoop/tmp/dir/hadoop-
${user.name}</value> <description>A base for
other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system.
A URI whose scheme and authority determine the
FileSystem implementation. The uri's scheme
determines the config property (fs.SCHEME.impl)
naming the FileSystem implementation class.
The uri's authority is used to determine the
host, port, etc. for a filesystem.
</description>
</property>
(iii) conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce
job tracker runs at. If "local", then jobs are
run in-process as a single map and reduce task.
</description>
</property>
(iv) conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. The
actual number of replications can be specified
when the file is created. The default is used
if replication is not specified in create time.
</description>
</property>
4
4. Formatting the Hadoop file system
The first step to starting up your Hadoop installation is formatting the Hadoop
filesystem which is implemented on top of the local filesystem of your “cluster”
(which includes only your local machine when setting up the single node). You
need to do this the first time you set up a Hadoop cluster.
Note: Do not format a running Hadoop filesystem, this will cause all your data to
be erased.
To format the filesystem (which simply initializes the directory specified by the
dfs.name.dir variable), run the following command
5
SHUTDOWN_MSG: Shutting down NameNode at
ubuntu/127.0.1.1
******************************************************
******/
hadoop@linux:/usr/local/hadoop$
hadoop@linux:~$ /bin/start-all.sh
hadoop@linux:/usr/local/hadoop$ bin/start-all.sh
starting namenode, logging to
/usr/local/hadoop/bin/../logs/hadoop-hadoop-namenode-
linux.out
localhost: starting datanode, logging to
/usr/local/hadoop/bin/../logs/hadoop-hadoop-datanode-
linux.out
localhost: starting secondarynamenode, logging to
/usr/local/hadoop/bin/../logs/hadoop-hadoop-
secondarynamenode-linux.out
starting jobtracker, logging to
/usr/local/hadoop/bin/../logs/hadoop-hadoop-
jobtracker-linux.out
localhost: starting tasktracker, logging to
/usr/local/hadoop/bin/../logs/hadoop-hadoop-
tasktracker-linux.out
hadoop@linux:/usr/local/hadoop$
To check if the hadoop processes are running, use the following command:
hadoop@linux:/usr/local/hadoop$ jps
2287 TaskTracker
2149 JobTracker
1938 DataNode
2085 SecondaryNameNode
2349 Jps
1788 NameNode
You can also check with netstat if Hadoop is listening on the configured ports.
6
tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001
9236 2471/java
tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001
9998 2628/java
tcp 0 0 0.0.0.0:48159 0.0.0.0:* LISTEN 1001
8496 2628/java
tcp 0 0 0.0.0.0:53121 0.0.0.0:* LISTEN 1001
9228 2857/java
tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001
8143 2471/java
…..
In case of any errors, examine the log files in the /logs/ directory.
Run the following command to stop all daemons running on your machine
hadoop@linux:~$ /bin/stop-all.sh
hadoop@linux:/usr/local/hadoop$ bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
hadoop@linux:/usr/local/hadoop$
7
III. Setting up multi-node Hadoop cluster
Configuration pre-requisites
2. In the following steps, we will merge these two single-node clusters into one
multi-node cluster in which one box will become the master and other the slave.
3. Both machines must be connected over the LAN via a single hub or switch and
configure the network interfaces to use a common network such as
192.168.0.x/24.
5. SSH access
The hadoop user on the master (hadoop@master) must be able to connect
a) To its own user account on the master – i.e. ssh master in this context and not
necessarily ssh localhost and
b) To the hadoop user account on the slave (hadoop@slave) via a password-less
SSH login. For this, just add the hadoop@master‘s public SSH key (which should
8
be in $HOME/.ssh/id_rsa.pub) to the “authorized_keys” file of
hadoop@slave (in this user’s $HOME/.ssh/authorized_keys). You can do this
manually or use the following SSH command:
This command will prompt you for the login password for user hadoop on slave,
then copy the public SSH key for you, creating the correct directory and fixing the
permissions as necessary.
The final step is to test the SSH setup by connecting with user hadoop from the
master to the user account hadoop on the slave. The step is also needed to save
slave‘s host key fingerprint to the hadoop@master‘s “known_hosts” file.
9
Hadoop multi-node cluster configuration
Introduction
In this section, we will describe how to configure one Linux box as a master node
and the other Linux box as a slave node. The master node will also act as a slave
because we only have two machines available in our cluster but still want to
spread data storage and processing to multiple machines.
The master node will run the “master” daemons for each layer: namenode for the
HDFS storage layer, and jobtracker for the MapReduce processing layer. Both
machines will run the “slave” daemons: datanode for the HDFS layer, and
tasktracker for MapReduce processing layer. Basically, the “master” daemons are
responsible for coordination and management of the “slave” daemons while the
latter will do the actual data storage and data processing work.
Configuration
The “conf/masters” file defines the namenodes of our multi-node cluster. In our
case, this is just the master machine.
10
master
This “conf/slaves” file lists the hosts, one per line, where the Hadoop slave
daemons (datanodes and tasktrackers) will be run. We want both the master box
and the slave box to act as Hadoop slaves because we want both of them to store
and process data.
master
slave
If you have additional slave nodes, just add them to the conf/slaves file, one
per line (do this on all machines in the cluster).
master
slave
anotherslave01
anotherslave02
anotherslave03
11
(ii) Change the mapred.job.tracker variable (in conf/mapred-site.xml)
which specifies the JobTracker (MapReduce master) host and port, which is the
master machine in our case.
Note: Do not format a running Hadoop namenode, this will cause all the data in
the HDFS filesytem to be erased.
To format the filesystem (which simply initializes the directory specified by the
dfs.name.dir variable on the namenode), run the command:
12
hadoop@master:/usr/local/hadoop$
hadoop@master:/usr/local/hadoop$ bin/start-all.sh
This command starts both the HDFS and MapReduce daemons on both the master
and the slave machines.
(i) On master:
hadoop@master:/usr/local/hadoop$ jps
16017 Jps
14799 NameNode
15686 TaskTracker
14880 DataNode
15596 JobTracker
14977 SecondaryNameNode
hadoop@master:/usr/local/hadoop$
(ii) On slave:
hadoop@slave:/usr/local/hadoop$ jps
15183 DataNode
15897 TaskTracker
16284 Jps
hadoop@slave:/usr/local/hadoop$
13
6. Stopping the Multi-node cluster
hadoop@master:/usr/local/hadoop$ bin/stop-all.sh
This will stop the HDFS daemons and the MapReduce daemons on all the slaves
and the master machines.
Hadoop comes with several web interfaces, which are by default (see conf/hadoop-
default.xml) available at these locations:
14
V. HDFS – Hadoop File System
a) Configuration
The HDFS configuration is located in a set of XML files in the Hadoop configuration
directory: conf/ under the main Hadoop install directory. The conf/hadoop-
defaults.xml file contains default values for every parameter in Hadoop. This file is
considered read-only. This configuration can be overridden by setting new values in
conf/hadoop-site.xml. This file should be replicated consistently across all
machines in the cluster.
key value
fs.default.name protocol://servername:port
dfs.data.dir pathname
dfs.name.dir pathname
fs.default.name - This is the URI (protocol specifier, hostname, and port) that describes
the NameNode for the cluster. Each node in the system on which Hadoop is expected to
operate needs to know the address of the NameNode. The DataNode instances will
register with this NameNode, and make their data available through it. Individual client
programs will connect to this address to retrieve the locations of actual file blocks.
dfs.data.dir - This is the path on the local file system in which the DataNode instance
should store its data. It is not necessary that all DataNode instances store their data under
the same local path prefix, as they will all be on separate machines; it is acceptable that
these machines are heterogeneous. However, it will simplify configuration if this
directory is standardized throughout the system. By default, Hadoop will place this under
/tmp.
dfs.name.dir - This is the path on the local file system of the NameNode instance where
the NameNode metadata is stored. It is only used by the NameNode instance to find its
information, and does not exist on the DataNodes.
b) Starting HDFS
Once the namenode has been formatted in a hadoop cluster, we can start the distributed
file system using the following command:
user@namenode:hadoop$ bin/start-dfs.sh
The result of this command is encapsulated in the starting of a hadoop cluster using the
command (as discussed above):
15
user@master:hadoop$ bin/start-all.sh
This section will contains the basic commands necessary to interact with HDFS, loading
and retrieving data, as well as manipulating files.
This example output assumes that "hadoop" is the username under which the Hadoop
daemons (NameNode, DataNode, etc) were started. "supergroup" is a special group
whose membership includes the username under which the HDFS instances were
started (e.g., "hadoop").
Another synonym for -put is -copyFromLocal. The syntax and functionality are
identical.
Step 3: Verify the file is in HDFS. We can verify that the operation worked with either
of the two following (equivalent) commands:
16
d) Shutting down HDFS
user@namenode:hadoop$ bin/stop-dfs.sh
The result of this command is encapsulated in the stopping of a hadoop cluster using the
command (as discussed above):
user@master:hadoop$ bin/stop-all.sh
VI. Troubleshooting
Solution:
17
The full paths of the relevant files are:
• namenode: /usr/local/hadoop-datastore/hadoop-
hadoop/dfs/name/current/VERSION
• datanode: /usr/local/hadoop-datastore/hadoop-
hadoop/dfs/data/current/VERSION (background: dfs.data.dir is by default set
to ${hadoop.tmp.dir}/dfs/data, and we set hadoop.tmp.dir in this tutorial to
/usr/local/hadoop-datastore/hadoop-hadoop).
...….
Solution:
This problem can be solved by installing a different version of java (sun java 1.6.0)
and updating hadoop-env.xml to use it.
18
19