Sunteți pe pagina 1din 8

Hadoop: Fully-Distributed Cluster Setup OS & Tools to use

OS: Ubuntu JVM: Sun JDK Hadoop: Apache Hadoop

Note : Create a dedicated user on the Linux machine(master as well as slaves) for Hadoop configuration and installation and make it as root , so suppose we have created a user called cluster so login to the cluster account and start the configuration

Hadoop: Prerequisite for Hadoop Setup in Ubuntu

1. Installing Java 1.6 (Sun JDK) in Ubuntu


1. sudo apt-get install python-software-properties
2. 3. 4. 5. sudo sudo sudo sudo add-apt-repository ppa:ferramroberto/java apt-get update apt-get install sun-java6-jdk update-java-alternatives -s java-6-sun

2. Installing SSH
Apache Hadoop startup scripts (start-all.sh & stop-all.sh) uses SSH to connect and start hadoop in slaves machines. So, to install SSH follow the steps below Step-1:Install SSH from Ubuntu repository. user1@ubuntu-server:~$ sudo apt-get install ssh

3. Hosts File configuration


The origional hosts file will be like 127.0.0.1 127.0.1.1 localhost localhost

# The following lines are desirable for IPv6 capable hosts

::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters Just need to make some changes : 127.0.0.1 #127.0.1.1 192.168.2.118 192.168.2.117 192.168.2.116 192.168.2.56 192.168.2.69 192.168.2.121 localhost localhost shashwat.blr.pointcross.com shashwat chethan tariq alok sandish moses

4. Configure Passwordless ssh


Why ssh-keygen? hadoop uses ssh (not password) to communicate with each other. The masters public key should be added to all the slaves ~/.ssh/authorized_keys file, so that master can easily communicate to all the slaves. In this case (pseudo distributed mode) both master and slave are in the same machine, hence we are adding the machines public key to the ~/.ssh/authorized_keys file in the same machine Start Terminal and issue following commands :
1. ssh-keygen -t rsa -P "" 2. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

then try
ssh localhost

Why ssh localhost? To check whether step-1 & 2 was done correctly or not. ssh localhost should connect to localhost without asking for password, because ssh uses public key for authentication and we have already added the public key in authorized_keys file. Copy the content of id_rsa.pub to authorized_keys to others slaves machine, which will be inside .ssh folder, this is required to enable master to communicate with slaves without using any password.

5. Download and configure Hadoop


1. cd - -

wget http://www.apache.org/dist/hadoop/common/hadoop0.20.2/hadoop-0.20.2.tar.gz 3. sudo tar -xzf hadoop-0.20.2.tar.gz 4. After extracting just give these two commands 1. chown -R cluster hadoop-0.20.2/ 2. chmod -R 755 hadoop-0.20.2 3. Set JAVA_HOME in /hadoop/conf/hadoop-env.sh
2.

Open hadoop-env.sh and put this line in it export JAVA_HOME=/usr/lib/jvm/java-6-sun

6.Configure Hadoop in Fully Distributed (or Cluster) Mode


1. Edit the config file hadoop/conf/masters as shown below. localhost 2. Edit the /hadoop/conf/slaves as follows : shashwat chethan tariq alok 3. Edit the core-site.xml file and put the following lines inside configuration tag /hadoop/conf/core-site.xml as follows :
<property> <name>hadoop.tmp.dir</name> <value>tmp</value> <description>A base for other temporary directories.</description> </property>

<property> <name>fs.default.name</name> <value>hdfs://shashwat:9000</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>

4. Edit the mapred-site.xml file and put the following lines inside configuration tag /hadoop/conf/mapred-site.xml as follows :
<property> <name>mapred.job.tracker</name> <value>shashwat:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>

5. Edit the hdfs-site.xml file and put the following lines inside configuration tag /hadoop/conf/hdfs-site.xml as follows :
<property> <name>dfs.replication</name> <value>2</value> <description>(According to no </description> </property>

of

nodes) Default

block

replication

Then copy the same hadoop folder to slaves with the same path as in master : supoose hadoop folder is in /home/cluster/hadoop it should exist on slaves too, so you can use following command to copy file from master to slave as follows :
ssh all salves from the master. e.g. shown below ssh alok ssh tariq ssh chethan ssh moses

scp scp scp scp

-r -r -r -r

/home/cluster/hadoop /home/cluster/hadoop /home/cluster/hadoop /home/cluster/hadoop

cluster@tariq:/home/cluster cluster@alok:/home/cluster cluster@chethan:/home/cluster cluster@moses:/home/cluster

6. After all the above steps issue following command :


bin/hadoop namenode -format and then bin/start-all Check if all (namenode, datanode, tasktracker, jobtracker, secondrynamenode are running) if so configuration is complete for master. Issue jps command to see running java programs :
Master should list NameNode, JobTracker, SecondaryNameNode All Slaves should list DataNode, TaskTracker

Screenshots from running hadoop machine : core-site.xml

hadoop-env.sh

hdfs-site.xml

mapred-site.xml

masters

slaves

1. Where to find the logs? at /hadoop/logs

2. How to check hadoop is running or not? use jps command or goto http://localhost:50070 to get more information on HDFS and goto http://localhost:50030 to get more information on MapReduce

S-ar putea să vă placă și