Consolidated Hadoop Material

1.
Installing haddop on ubuntu as a single node cluster

1.1. install jdk (oracle jdk is preferred)
a. sudo apt-add-repository ppa:webupd8team/java
b. sudo apt-get update
c. sudo apt-get install oracle-java8-installer
1.2. sudo addgroup hadoop
1. 3. sudo adduser --ingroup hadoop hduser
1.3a. sudo visudo (come out of the editor)
(inside the file add the line below the line
root ALL=(ALL:ALL)
ALL)
hduser ALL=(ALL:ALL)
ALL
1.3b. sudo apt-get install openssh-server

1.4. su - hduser
1.5. ssh-keygen -t rsa -P ""
1.6. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
1.7. ssh localhost (this is for testing ssh keys)
1.8. wget http://mirrors.sonic.net/apache/hadoop/common/hadoop2.7.2/hadoop-2.7.2.tar.gz
1.9. sudo mv * /usr/local/hadoop (before this add the hduser to sudoers)
1.10. sudo chown -R hduser:hadoop /usr/local/hadoop
1.11. Edit the following files
~/.bashrc
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
/usr/local/hadoop/etc/hadoop/core-site.xml
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/hdfs-site.xml
1.12. update-alternatives --config java
1.13.
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END

1.14. vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
1.15. vi /usr/local/hadoop/etc/hadoop/core-site.xml
sudo mkdir -p /app/hadoop/tmp
sudo chown hduser:hadoop /app/hadoop/tmp
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
1.16. vi /usr/local/hadoop/etc/hadoop/mapred-site.xml
cd /usr/local/hadoop/etc/hadoop/
sudo mv mapred-site.xml.template mapred-site.xml
(By default, the /usr/local/hadoop/etc/hadoop/ folder contains
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
file which has to be renamed/copied with the name mapred-site.xml)
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
1.17. from the home directory of hadoop user(hduser) execute the following
commands
source .bashrc
hdfs namenode -format
2. Installing haddop on multi-node cluster

2.1. Configure hostnames for each node in cluster (This entry to be made in
each node in cluster). In this example it is considered the hostname of the
Namenode is masternode and the hostname of datanode is slavenode1
/etc/hosts
192.168.0.1 masternode
192.168.0.2 slavenode1
2.2. Create ssh keys in Namenode and transfer the public keys to each node
includeing Namenode
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@hostname_of_each_node
2.3. Define masters and slavenodes in configuration directory. This step need to
be carried out only in Namenode
create a file named $HADOOP_HOME/etc/hadoop/masters, enter the hostname
of Namenode
masternode
create a file baned $HADOOP_HOME/etc/hadoop/slaves, enter the hostnames of
all datanode(In this example the namenode also works as datanode)
masternode
slavenode1
You must change the configuration files $HADOOP_HOME/etc/hadoop/coresite.xml, $HADOOP_HOME/etc/hadoop/mapred-site.xml and
$HADOOP_HOME/etc/hadoop/hdfs-site.xml on ALL machines as follows.
core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://masternode:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
********************************************************************
mapred-site.xml
mkdir -p /usr/local/hadoop_store/hdfs/{namenode,datanode}
chown -R hduser:hadoop /usr/loca/hadoop_store
<property>
<name>mapred.job.tracker</name>
<value>masternode:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</property>
**************************************************************************
hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
</property>
bin/hdfs namenode -format
Run the command sbin/start-dfs.sh on the machine you want the (primary)
NameNode to run on. This will bring up HDFS with the NameNode running on
the machine you ran the previous command on, and DataNodes on the
machines listed in the $HADOOP_HOME/etc/hadoop/slaves file.
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-mapred.sh
stop multi-node cluster
$HADOOP_HOME/sbin/stop-mapred.sh
$HADOOP_HOME/sbin/stop-dfs.sh
3. Installing pig on the hadoop machine
3.1. Download for pig.apache.org
su hduser
cd /usr/loca
sudo wget http://www-eu.apache.org/dist/pig/pig-0.15.0/pig-0.15.0.tar.gz
sudo tar zxvf pig-0.15.0.tar.gz
sudo mv pig-0.15.0 pig
3.2. Set path variable for pig in .bashrc
export PATH=$PATH:/usr/local/pig/bin
3.3. pig can be run in two modes
Interactive mode
pig -x local can access local file system (i.e., ext4 etc.,)
pig can access only hdfs file system
Batch mode
pig -x local script
pig script
4. Pig example
Commands used in pig
LOAD
DUMP
LIMIT
GROUP BY
DESCRIBE
FOREACH GENERATE
GROUP ALL
DISTINCT
ORDER
FILTER
athletes = LOAD '/test/OlympicAthletes.csv' USING PigStorage(',',

'YES_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') AS (athlete:chararray,
country:chararray, year:int, sport:chararray, gold:int, silver:int, bronze:int,
total:int);
athletes = LOAD 'OlympicAthletes.csv' USING PigStorage(',') AS
(athlete:chararray, country:chararray, year:int, sport:chararray, gold:int,
silver:int, bronze:int, total:int);
athletes is a relation and its name is alias - (something like table in database)
A relation contains rows or entries, which in Pig are represented by tuples or
fields. In the relation 'athletes', the fields are 'athlete', 'country', 'year', etc.
To see the content of athletes
DUMP athletes
Using limit to no of rows to be read
athletes_lim = LIMIT athletes 10;
DUMP athletes_lim;
(To find out which country has secured more medals)
(Group the data country wise then find the count.)
GROUP BY
data_grp_field = GROUP data BY col;
athletes_grp_country = GROUP athletes BY country;
It can be hard to visualize what the data looks like after performing a GROUP
operation.
The data structures that result are more akin to Python than to SQL.
That second field is basically a list of tuples; each distinct group value has its
own list of corresponding data.
In Pig, this list is called a DataBag.
FOREACH..GENERATE
(To view count of medals obtained by a given country)
data = LOAD 'my-file.csv' using PigStorage('field1: int, field2: chararray, field3:
long');
new_data = FOREACH data GENERATE field1, field2;
data_grp = GROUP data BY field2;
new_data = FOREACH data_grp GENERATE group as field2, SUM(data.field) as
field_sum;
medal_sum = FOREACH athletes_grp_country GENERATE group AS country,
SUM(athletes.total) as medal_count;
DUMP medal_sum;
GROUP ALL
To group entire data set
data_grp = GROUP my_data ALL;
new_data = FOREACH data_grp GENERATE MIN(data.field1) as min_field1; //find
out the minimum from a specific field say year in our case
data_range = FOREACH (GROUP athletes ALL) GENERATE MIN(athletes.year) as
min_year, MAX(athletes.year) as max_year;
DUMP data_range;
DINSTINCT
distinct_countries = DISTINCT (FOREACH athletes GENERATE country);
DUMP distinct_countries;
ORDER BY
ordered_medals = ORDER medal_sum BY medal_count DESC;
ordered_medals_lim = LIMIT ordered_medals 1;
DUMP ordered_medals_lim;
FILTER
athletes_filter = FILTER athletes by sport != 'Swimming';
5. Mapreduce example using python
5.1. mapper.py
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
5.2. reducer.py
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)
5.3. Execute the following command as hduser from the home directory of the
user
a. wget http://www.gutenberg.org/cache/epub/2701/pg2701.txt
b. hadoop dfs -mkdir wordcount
c. hadoop dfs -copyFromLocal ./pg2701.txt wordcount/sample.txt
5.4. start the mapreduce job by executing the following command
hadoop jar path to hadoop-streaming-1.0.3.jar -mapper "python
$PWD/mapper.py" -reducer "python $PWD/reducer.py" -input "wctest/all.wx"
-output "/output"
wordcount is directory in hdfs file system. The input data for map reduce is
stored in mobydict.txt. a newdirectory named wordcount/output will be created
after executing the above command and the results will be stored below that
directory
6. Commissioning a datanode
6.1. Create a file "includes" under $HADOOP_HOME/etc/haddop f directory.
6.2. Include the IP of the datanode in this file.
6.3. Add the property below to hdfs-site.xml
<property>
<name>dfs.hosts</name>
<value>$HADOOP_HOME/etc/haddop/includes</value>
<final>true</final>
</property>
6.4. Add the property below to mapred-site.xml
<property>
<name>mapred.hosts</name>
<value>$HADOOP_HOME/etc/haddop/includes</value>
</property>
6.5. In Namenode, execute
hdfs dfsadmin -refreshNodes
6.6. In Jobtracker node(in our example namenode), execute
yarn rmadmin -refreshNodes
6.7. Login to the new slave node and execute:
$ cd path/to/hadoop
$ sbin/hadoop-daemon.sh start datanode
$ yarn nodemanager
6.8.Add IP/hostname of the new datanode in slaves file
Finally, Execute the below command during non-peak hour
$ bin/start-balancer.sh
7. De-Commissioning a datanode
7.1. Create a file "excludes" under $HADOOP_HOME/etc/haddop/excludes
directory.
7. 2. Include the IP of the datanode to be de-commissioned in this file.
7.3. hdfs-site.xml
<property>
<name>dfs.hosts.exclude</name>
<value>/$HADOOP_HOME/etc/hadoop/excludes</value>
<final>true</final>
</property>
7.4. mapred-site.xml
<property>
<name>mapred.hosts.exclude</name>
<value>/usr/local/hadoop/etc/hadoop/excludes</value>
<final>true</final>
</property>
7.5. In Namenode, execute

hdfs dfsadmin -refreshNodes
7. 6. In Jobtracker node, execute
yarn rmadmin -refreshNodes
7.7. remove IP of the new datanode in conf/slaves file

Finally, Execute the below command during non-peak hour
$ bin/start-balancer.sh
8. Secure Hadoop using kerberos
This involves two steps one to configure a kerberos server and then
configure all nodes in cluster(including namenode as kerberos) as
kerberos client. The kerberos server will be dedicted server will not
run namenode,secondarynode or datanode.
8.1 Configure kerberos server

8.1.1. sudo apt-get install krb5-kdc krb5-admin-server
8.1.2. sudo dpkg-reconfigure krb5-kdc
disable Kerberos 4 compatibility mode
do not run krb524d (daemon to convert Kerberos tickets between versions)
defaults for the other settings are acceptable
8.1.3. edit /etc/krb5.conf
[kdcdefaults]
kdc_ports = 750,88
default_realm = EXAMPLE.COM
[libdefaults]
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
[realms]
EXAMPLE.COM = {
database_name = /var/lib/krb5kdc/principal
admin_keytab = FILE:/etc/krb5kdc/kadm5.keytab
acl_file = /etc/krb5kdc/kadm5.acl
key_stash_file = /etc/krb5kdc/stash
kdc_ports = 750,88
max_life = 10h 0m 0s
max_renewable_life = 7d 0h 0m 0s
master_key_type = des3-hmac-sha1
supported_enctypes = des3-hmac-sha1:normal
des:normal des:v4 des:norealm des:onlyrealm
default_principal_flags = +preauth
}
des-cbc-crc:normal
8.1.4. krb5_newrealm
8.1.5. Kerberos uses an Access Control List (ACL) to specify the per-principal
access rights to the Kerberos admin daemon. This file's default location is
/etc/krb5kdc/kadm5.acl
# This file is the access control list for krb5 administration.
# When this file is edited run /etc/init.d/krb5-admin-server restart to activate
# One common way to set up Kerberos administration is to allow any principal
# ending in /admin is given full administrative rights.
# To enable this, uncomment the following line:
*/admin@EXAMPLE.COM *
8.1.6. create password policies

kadmin.local
kadmin.local: add_policy -minlength
8
8
8
8
-minclasses
-minclasses
-minclasses
-minclasses
3
4
4
2
admin
host
service
user
8.1.7. create an admin user

kadmin.local: addprinc -policy admin admin/admin
8.1.8. The Kerberos realm is administered using the kadmin utility. The kadmin
utility is an interactive interface that allows the administrator to create,
retrieve, update, and delete realm principals. kadmin can be run on any
computer that is part of the Kerberos realm, provided the user has the proper
credentials. However, for security reasons, it is best to run kadmin on a KDC.
kadmin -p admin/admin
8.1.9. common tasks that can be performed with kadmin utility (Shown as
example. The following commands need not be executed for hadoop)
Add a user:
kadmin: addprinc user
The default realm name is appended to the principal's name by default
Delete a user:
kadmin: delprinc user
List principals:
kadmin: listprincs
Add a service:
kadmin: addprinc service/server.fqdn
The default realm name is appended to the principal's name by default
Delete a user:
kadmin: delprinc service/server.fqdn
8.1.10. add principal for hduser
addprinc -policy user hduser@EXAMPLE.COM
8.1.11. starting the kerberos service

serv1ce krb5kdc start
service kadmin start
8.1.12. configure ssh to use kerberos authentication.
edit /etc/ssh/sshd_config
GSSAPIAuthentication yes
GSSAPICleanupCredentials yes
9. Configuring kerberos client (To be done on all nodes of hadoop
cluster)
9.1. Packages to be installed
krb5-user: Basic programs to authenticate using MIT Kerberos.
krb5-config: Configuration files for Kerberos Version 5.
libkadm55: MIT Kerberos administration runtime libraries. (No longer available
in Karmic)
Supply your realm name when prompted to enter a default realm
What are the Kerberos servers for your realm?
kdc-server.example.com
What is the administrative server for your Kerberos realm?
kerberos.example.com
9.2. Edit /etc/krb5.conf (For manual configuration)
[logging]
default = FILE:/var/log/krb5.log
[libdefaults]
kdc_timesync = 1
ccache_type = 4
forwardable = true
proxiable = true
[realms]
EXAMPLE.COM = {
kdc = kdc-server.example.com
admin_server = kdc-server.example.com
default_domain = EXAMPLE.COM
}
[domain_realm]
.example.com = EXAMPLE.COM
example.com = EXAMPLE.COM
9.3. To create principals for the user and then hadoop services and apache
sudo kadmin -p admin/admin
addprinc hduser@EXAMPLE.COM
addprinc hdfs/masternode.example.com - (replace the hostname with fqdn of
each datanode)
addprinc mapred/masternode.example.com
addprinc yarn/masternode.example.com
addprinc HTTP/masternode.example.com
9.4. To authenticate via kerberos with human interaction you use the kinit
command to request tickets.
In kerberos terminology Hadoop services such as yarn and hdfs are referred to
as service principals.
For each service principal you create encrypted kerberos keys referred to as
keytabs.
These keytabs are required for passwordless communication and
authentication in a similar way SSH keys are used
To create keytabs you use the kadmin utility so all keytab creation commands
are run from this shell.
To create a keytab you specify the name of file that will store the keytab and
the principal or principals that will be contained in the keytab
ktadd
-k
hdfs.keytab
HTTP/masternode.example.com
ktadd
-k
mapreduce.keytab
ktadd
-k
yarn.keytab
suod
sudo
sudo
sudo
hdfs/masternode.example.com
mapreduce/masternode.example.com
yarn/masternode.example.com
mkdir /usr/local/hadoop/etc/conf
mv hdfs.keytab /usr/local/hadoop/etc/conf
mv mapreduce.keytab /usr/local/hadoop/etc/conf
mv yarn.keytab /usr/local/hadoop/etc/conf
9.5. configure hadoop for kerberos security add these lines to core-site.xml of
every machine in the cluster
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value> 
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
9.6. Add these lines to hdfs-site.xml of every node in the cluster (Replace
_HOST with the hostname of the relevant node)

<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
</property>

<property>
<name>dfs.namenode.keytab.file</name>
<value>$HADOOP_HOME/etc/conf/hdfs.keytab</value> 
</property>
<property>
<name>dfs.namenode.kerberos.principal</name>
<value>hdfs/_HOST@EXAMPLE.COM</value>
</property>
<property>
<name>dfs.namenode.kerberos.internal.spnego.principal</name>
<value>HTTP/_HOST@EXAMPLE.COM</value>
</property>

<property>
<name>dfs.secondary.namenode.keytab.file</name>
<value>$HADOOP_HOME/etc/conf/hdfs.keytab</value> 
</property>
<property>
<name>dfs.secondary.namenode.kerberos.principal</name>
</property>
<property>
<name>dfs.secondary.namenode.kerberos.internal.spnego.principal</name>
<value>HTTP/_HOST@EXAMPLE.COM</value>
</property>

<property>
<name>dfs.datanode.data.dir.perm</name>
<value>700</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:1004</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:1006</value>
</property>
<property>
<name>dfs.datanode.keytab.file</name>
<value>$HADOOP_HOME/etc/hadoop/conf/hdfs.keytab</value> 
</property>
<property>
<name>dfs.datanode.kerberos.principal</name>
</property>
9.7. Add these lines to yarn-site.xml in every node in the cluster(replce _HOST
with hostname of relevant node)
<property>
<name>yarn.resourcemanager.principal</name>
<value>yarn/_HOST@EXAMPLE.COM</value>
</property>
<property>
<name>yarn.resourcemanager.keytab</name>
<value>$HADOOP_HOME/etc/hadoop/conf/yarn.keytab</value>
</property>


<property>
<name>yarn.nodemanager.principal</name>
<value>yarn/_HOST@EXAMPLE.COM</value>
</property>
<property>
<name>yarn.nodemanager.keytab</name>
<value>$HADOOP_HOME/etc/hadoop/conf/yarn.keytab</value>
</property>
<property>
<name>yarn.nodemanager.container-executor.class</name>
<value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecuto
r</value>
</property>
<property>
<name>yarn.nodemanager.linux-container-executor.group</name>
<value>yarn</value>
</property>
9.8. Add these line to mapred-site.xml to every node in the cluster

<property>
<name>mapreduce.jobhistory.keytab</name>
<value>$HADOOP_HOME/etc/conf/mapred.keytab</value>
</property>
<property>
<name>mapreduce.jobhistory.principal</name>
<value>mapred/_HOST@EXAMPLE.COM</value>
</property>
Good luck.
Baskar
baskar910@gmail.com
Posted under creative common license

Consolidated Hadoop Material

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Consolidated Hadoop Material

Încărcat de

Drepturi de autor:

Formate disponibile

1.

Installing haddop on ubuntu as a single node cluster

1.3b. sudo apt-get install openssh-server

#HADOOP VARIABLES END

2. Installing haddop on multi-node cluster

athletes = LOAD '/test/OlympicAthletes.csv' USING PigStorage(',',

7.5. In Namenode, execute

7.7. remove IP of the new datanode in conf/slaves file

8.1 Configure kerberos server

8.1.6. create password policies

8.1.7. create an admin user

8.1.11. starting the kerberos service

S-ar putea să vă placă și