Sunteți pe pagina 1din 4

sudo apt update

sudo apt-get default jdk


java -version
wget "mirror site / hadoop 3.2.1.tar.gz"
extract hadoop and set java path

1. sudo apt-get install ssh


2. ssh-keygen (generate a key with no passphrase)
3. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys (Copy id_rsa.pub to
authorized_keys under ~/.ssh folder.)
4. chmod 700 ~/.ssh/authorized_keys (change permissions of the authorized_keys
to have all permissions for the user)
5. sudo /etc/init.d/ssh restart
6. ssh localhost (connect to the localhost itself since we don't have multinodes
set up)
yes
7.nano ~/.bashrc #used to store different variables of the system

export HADOOP_HOME="/usr/local/hadoop"
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
source ~/.bashrc (Now load the environment variables to the opened session)

8. sudo rm -r /usr/local/hadoop_tmp
sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode
sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode
sudo chown user:user -R /usr/local/hadoop_tmp
sudo chmod 700 /usr/local/hadoop_tmp/hdfs/datanode
9.start-dfs.sh and start-yarn.sh

for multinode clusters

10.Now go to your VMWARE and find the directory where your virtual machine is
installed.
11.Shutdown your virtual machine and head to the installation directory.
12.copy the ubuntu folder(virtual machine folder) three times and rename them to
master,slave1 and slave2.
13.start your VMWARE and select open virtual machine option.
14.open all the three virtual machines on your vmware and rename then to
master,slave1 and slave2 respectively.
15.start all the three virtual machines on your VMWARE

16.sudo vim /etc/hostname (on all the three machines)


set the hostnames as
1) master for master VM
2) slave1 for slave1 VM
3) slave2 for slave2 VM
17. sudo apt-get install net-tools (install net tools for running ifconfig command)
18. run ifconfig on all the VMs and note down their ip adresses
for me it was:-
1) 192.168.203.136 for master
2) 192.168.203.133 for slave1
3) 192.168.203.135 for slave2
19. sudo vim /etc/hosts configure hosts files in all the systems
127.0.1.1 ubuntu #remove this line
add these lines to your file
192.168.203.136 master
192.168.203.133 slave1
192.168.203.135 slave2

20.Try to ping you slaves and master using their hostnames

ping slave1
ping slave2
ping master

21.try to connect systems using ssh command

ssh master
exit
ssh slave1
exit
ssh slave2
exit

21.if not set up then set up java home in hadoop-env.sh


export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

22.sudo vim /usr/local/hadoop/etc/hadoop/core-site.xml (The core-site.xml file


informs Hadoop daemon where NameNode runs in the cluster
i.e its host name and
port)

<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>

23.sudo vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml (The hdfs-site.xml file


contains the configuration settings for HDFS daemons
; the NameNode and the
DataNodes and the replication factor)
1) for master node we will only have namenode configured
add these lines
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_tmp/hdfs/namenode</value>
</property>
2) for slaves configure only the datanodes (slave1 and slave2)
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.datanode.name.dir</name>
<value>file:/usr/local/hadoop_tmp/hdfs/datanode</value>
</property>

24.sudo vim /usr/local/hadoop/etc/hadoop/yarn-site.xml


<property>
<name>yarn.nodemanager.aux-services</name> #tells NodeManagers that
there will be an auxiliary service called mapreduce.shuffle
<value>mapreduce_shuffle</value> #that they need to
implement.
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name> #we give
it a class name as the means to implement that service.
<value>org.apache.hadoop.mapred.ShuffleHandler</value> #it is a
API class that implements mapreduce_shuffle service.
</property>
<property>
<name>yarn.resoucemanager.address</name> #
address of the resourcemanager
<value>master:8050</value>
</property>
<property>
<name>yarn.resoucemanager.scheduler.address</name> #The
address of the scheduler interface.
<value>master:8030</value>
</property>
<property>
<name>yarn.resoucemanager.resource-tracker.address</name> #
address of the resource tracker interface.
<value>master:8025</value>
</property>
25. sudo vim /usr/local/hadoop/etc/hadoop/mapred-site.xml (on all nodes i.e
master+slaves)

<property>
<name>mapreduce.framework.name</name> #The runtime
framework for executing MapReduce jobs
<value>yarn</value> #to tell MapReduce
that it will run as a YARN application
</property>
<property>
<name>mapreduce.jobhistory.address</name> #the address of the
jobhistory server.It maintains the details of the jab that was dumped from the
memory.
<value>master:10020</value> #(this is not
necessary but some softwares like pig asks for it)
</property>

26. delete and create namenode directory on master and datanode directory on slaves

1) on master
sudo rm -r /usr/local/hadoop_tmp
sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode
sudo chown user:user -R /usr/local/hadoop_tmp/hdfs
sudo chmod 700 /usr/local/hadoop_tmp/hdfs/namenode
2) on slaves
sudo rm -r /usr/local/hadoop_tmp
sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode
sudo chown user:user -R /usr/local/hadoop_tmp #change
slave1 to slave2 for slave2 system
sudo chmod 700 /usr/local/hadoop_tmp/hdfs/datanode
27. Edit masters and workers files
1) sudo vim /usr/local/hadoop/etc/hadoop/masters
add master and save
2)sudo vim /usr/local/hadoop/etc/hadoop/workers
add slave1
slave2 and save
28.Configuration is done and now format your namenode.

hdfs namenode -format

29.Now start your hadoop cluster


1) start-dfs.sh #starts your distributed file system
#i.e namenode and secondary namenode on master
#And datanode on slaves
2)start-yarn.sh #starts your framework
#i.e resourcemanager on master
#And nodemanager on slaves
This will start datanodes and nodemanagers on slaves remotely through ssh(secure
shell)

hdfs dfs -ls /

hdfs dfs -ls /user

hdfs dfs -mkdir /mydata

hdfs dfs -ls /

hdfs dfs -mkdir /mydata/testfolder

hdfs dfs -ls /

sudo nano testfile.txt

write some data

ls *.txt

hdfs dfs -put testfile.txt /mydata/testfolder/

hdfs dfs -ls /

hdfs dfs -get /mydata/testfolder/testfile.txt newfile.txt

ls *.txt

it is not possible to change directory in hadoop because hadoop is stateless it


cannot remember the current working directory.

S-ar putea să vă placă și