Sunteți pe pagina 1din 42

Configuring First

Hadoop Cluster
On Amazon EC2
Harun ELKIRAN - 515215014
Lecturer : Dr. Muhammad FAHIM

Department of Computer Engineering

Istanbul S. Zaim University, Istanbul, Turkey

Notes & Assumptions

How to setting up a small 4 nodes Hadoop Cluster on Amozon
EC2 Cloud
I am new to Hadoop and also Linux, documentation on the web is
limited and text dense. So I try to keep my slides simple and clear.
The slide assumes basic familiarity with Linux, Java and SSH.
The cluster will be set up manually to demonstrate concepts of
Hadoop. In real life, there are lots of configuration management
tools such as Cloudera, Puppet, Chef etc. to manage and
automate larger clusters.
This slide is not production ready. Real Hadoop clusters need

Recap What is Hadoop

An open source framework for reliable, scalable, distributed
It gives the ability process and work with large datasets that are
distributed across clusters of commodity hardware;
It allows to parallelize computation and move processing to the
data using the MapReduce framework.

Recap What is Amazon EC2

A cloud web host that allows to dynamically add and remove
compute server resources as you need them, allowing you to pay
for only the capacity that you need;
It is well suited for Hadoop Computation we can bring up
enormous clusters within minutes and then spin it down when weve
finished to reduce costs..
EC2 is quick and cost effective for experimental and learning
purposes, as well as being proven as a production Hadoop host.

Part 1

Access to
EC2 instances
Part 2

and Cluster
Part 3

1. Installing Java

2. Download Hadoop
I am going to use haddop 1.2.1 stable version
1. wget
(download hadoop)
2. tar -xzvf hadoop-1.2.1.tar.gz (unzip hadoop)
3. mv hadoop-1.2.1 hadoop (rename hadoop cat)

3. Setup Environment Variable

I used to WinSCP to update the .bashrc file to add
important Hadoop paths and directories..
export HADOOP_CONF=/home/ubuntu/hadoop/conf
export HADOOP_PREFIX=/home/ubuntu/hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
# Add Hadoop bin/ directory to path
For control:
1. $sudo apt-get update (update the packages and
2. sudo add-apt-repository ppa:webupd8team/java (install
latest java)
3. sudo apt-get update && sudo apt-get install oracle-jdk7-

source ~/.bashrc

4. Setup Pasword-less SSH on Servers

1. Master server remotely starts services on slave nodes,
which requires password-less access to Slave Servers. AWS
Ubuntu server comes with pre-installed OpenSSh server.

5. Hadoop Cluster Setup

This section we need to modify; This file contains some environment
variable settings used by Hadoop. You can use these to
affect some aspects of Hadoop daemon behavior, such as
where log files are stored, the maximum amount of heap
used etc. The only variable you should need to change at
this point is in this file is JAVA_HOME, which specifies the
path to the Java 1.7.x installation used by Hadoop.
core-site.xml key property for
namenode configuration for e.g hdfs://namenode/
hdfs-site.xml key property dfs.replication by default 3
mapred-site.xml key property mapred.job.tracker for
jobtracker configuration for e.g jobtracker:8021

6. Configure Master & Slaves

Every hadoop distribution comes with master and
slaves files. By default it contains one entry for
localhost, we have to modify these 2 files on both
masters (HadoopNameNode) and slaves
machines we have a dedicated machine for
Hadoop Secondary NamdeNode.
I used to WinSCP to update files.

7. Hadoop Daemon Startup

The first step to starting up your Hadoop installation
is formatting the Hadoop filesystem which runs on
top of your , which is implemented on top of the
local filesystems of your cluster. You need to do this
the first time you set up a Hadoop installation. Do
not format a running Hadoop filesystem, this will
cause all your data to be erased.
1. $ hadoop namenode format (To format the
2. $ cd $HADOOP_CONF
3. $
This will start hadoop.


Configuring Master & Slaves with WinSCP

Part 4

Status &
Running one toy
example on my
deployed system

To quickly verify my setup,

m going go to run the
hadoop pi example

What i Have Done?

Setup EC2, requested machines, configured network and password-less
Downloaded Java and Hadoop;
Configured MapReduce and pushed configuration around the cluster;
Started MapReduce;
Compiled a MapReduce job using Hadoop Pi Example;
Submitted the job, ran it succesfully, and viewed the output.
Hopefully i can see how this model of computation would be useful for
very large datasets that we wish to perform processing on..
Im also sold on EC2 as a distributed, fast, cost effective platform for
using Hadoop for big-data work.

Thank you

Department of Computer Engineering

Istanbul S. Zaim University, Istanbul, Turkey