Documente Academic
Documente Profesional
Documente Cultură
ALL
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</property>
**************************************************************************
hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>master:8033</value>
</property>
bin/hdfs namenode -format
Run the command sbin/start-dfs.sh on the machine you want the (primary)
NameNode to run on. This will bring up HDFS with the NameNode running on
the machine you ran the previous command on, and DataNodes on the
machines listed in the $HADOOP_HOME/etc/hadoop/slaves file.
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-mapred.sh
stop multi-node cluster
$HADOOP_HOME/sbin/stop-mapred.sh
$HADOOP_HOME/sbin/stop-dfs.sh
3. Installing pig on the hadoop machine
3.1. Download for pig.apache.org
su hduser
cd /usr/loca
sudo wget http://www-eu.apache.org/dist/pig/pig-0.15.0/pig-0.15.0.tar.gz
sudo tar zxvf pig-0.15.0.tar.gz
sudo mv pig-0.15.0 pig
3.2. Set path variable for pig in .bashrc
export PATH=$PATH:/usr/local/pig/bin
3.3. pig can be run in two modes
Interactive mode
pig -x local can access local file system (i.e., ext4 etc.,)
pig can access only hdfs file system
Batch mode
pig -x local script
pig script
4. Pig example
Commands used in pig
LOAD
DUMP
LIMIT
GROUP BY
DESCRIBE
FOREACH GENERATE
GROUP ALL
DISTINCT
ORDER
FILTER
GROUP ALL
To group entire data set
data_grp = GROUP my_data ALL;
new_data = FOREACH data_grp GENERATE MIN(data.field1) as min_field1; //find
out the minimum from a specific field say year in our case
data_range = FOREACH (GROUP athletes ALL) GENERATE MIN(athletes.year) as
min_year, MAX(athletes.year) as max_year;
DUMP data_range;
DINSTINCT
distinct_countries = DISTINCT (FOREACH athletes GENERATE country);
DUMP distinct_countries;
ORDER BY
ordered_medals = ORDER medal_sum BY medal_count DESC;
ordered_medals_lim = LIMIT ordered_medals 1;
DUMP ordered_medals_lim;
FILTER
athletes_filter = FILTER athletes by sport != 'Swimming';
5. Mapreduce example using python
5.1. mapper.py
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
5.2. reducer.py
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)
5.3. Execute the following command as hduser from the home directory of the
user
a. wget http://www.gutenberg.org/cache/epub/2701/pg2701.txt
b. hadoop dfs -mkdir wordcount
c. hadoop dfs -copyFromLocal ./pg2701.txt wordcount/sample.txt
5.4. start the mapreduce job by executing the following command
hadoop jar path to hadoop-streaming-1.0.3.jar -mapper "python
$PWD/mapper.py" -reducer "python $PWD/reducer.py" -input "wctest/all.wx"
-output "/output"
wordcount is directory in hdfs file system. The input data for map reduce is
stored in mobydict.txt. a newdirectory named wordcount/output will be created
after executing the above command and the results will be stored below that
directory
6. Commissioning a datanode
6.1. Create a file "includes" under $HADOOP_HOME/etc/haddop f directory.
6.2. Include the IP of the datanode in this file.
6.3. Add the property below to hdfs-site.xml
<property>
<name>dfs.hosts</name>
<value>$HADOOP_HOME/etc/haddop/includes</value>
<final>true</final>
</property>
6.4. Add the property below to mapred-site.xml
<property>
<name>mapred.hosts</name>
<value>$HADOOP_HOME/etc/haddop/includes</value>
</property>
6.5. In Namenode, execute
hdfs dfsadmin -refreshNodes
6.6. In Jobtracker node(in our example namenode), execute
yarn rmadmin -refreshNodes
6.7. Login to the new slave node and execute:
$ cd path/to/hadoop
$ sbin/hadoop-daemon.sh start datanode
$ yarn nodemanager
6.8.Add IP/hostname of the new datanode in slaves file
Finally, Execute the below command during non-peak hour
$ bin/start-balancer.sh
7. De-Commissioning a datanode
7.1. Create a file "excludes" under $HADOOP_HOME/etc/haddop/excludes
directory.
7. 2. Include the IP of the datanode to be de-commissioned in this file.
7.3. hdfs-site.xml
<property>
<name>dfs.hosts.exclude</name>
<value>/$HADOOP_HOME/etc/hadoop/excludes</value>
<final>true</final>
</property>
7.4. mapred-site.xml
<property>
<name>mapred.hosts.exclude</name>
<value>/usr/local/hadoop/etc/hadoop/excludes</value>
<final>true</final>
</property>
des-cbc-crc:normal
8.1.4. krb5_newrealm
8.1.5. Kerberos uses an Access Control List (ACL) to specify the per-principal
access rights to the Kerberos admin daemon. This file's default location is
/etc/krb5kdc/kadm5.acl
# This file is the access control list for krb5 administration.
# When this file is edited run /etc/init.d/krb5-admin-server restart to activate
# One common way to set up Kerberos administration is to allow any principal
# ending in /admin is given full administrative rights.
# To enable this, uncomment the following line:
*/admin@EXAMPLE.COM *
8
8
8
8
-minclasses
-minclasses
-minclasses
-minclasses
3
4
4
2
admin
host
service
user
9.3. To create principals for the user and then hadoop services and apache
sudo kadmin -p admin/admin
addprinc hduser@EXAMPLE.COM
addprinc hdfs/masternode.example.com - (replace the hostname with fqdn of
each datanode)
addprinc mapred/masternode.example.com
addprinc yarn/masternode.example.com
addprinc HTTP/masternode.example.com
9.4. To authenticate via kerberos with human interaction you use the kinit
command to request tickets.
In kerberos terminology Hadoop services such as yarn and hdfs are referred to
as service principals.
For each service principal you create encrypted kerberos keys referred to as
keytabs.
These keytabs are required for passwordless communication and
authentication in a similar way SSH keys are used
To create keytabs you use the kadmin utility so all keytab creation commands
are run from this shell.
To create a keytab you specify the name of file that will store the keytab and
the principal or principals that will be contained in the keytab
kadmin -p admin/admin
ktadd
-k
hdfs.keytab
HTTP/masternode.example.com
ktadd
-k
mapreduce.keytab
HTTP/masternode.example.com
ktadd
-k
yarn.keytab
HTTP/masternode.example.com
suod
sudo
sudo
sudo
hdfs/masternode.example.com
mapreduce/masternode.example.com
yarn/masternode.example.com
mkdir /usr/local/hadoop/etc/conf
mv hdfs.keytab /usr/local/hadoop/etc/conf
mv mapreduce.keytab /usr/local/hadoop/etc/conf
mv yarn.keytab /usr/local/hadoop/etc/conf
9.5. configure hadoop for kerberos security add these lines to core-site.xml of
every machine in the cluster
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value> <!-- A value of "simple" would disable security. -->
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
9.6. Add these lines to hdfs-site.xml of every node in the cluster (Replace
_HOST with the hostname of the relevant node)
<!-- General HDFS security config -->
<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
</property>
<!-- NameNode security config -->
<property>
<name>dfs.namenode.keytab.file</name>
<value>$HADOOP_HOME/etc/conf/hdfs.keytab</value> <!-- path to the HDFS
keytab -->
</property>
<property>
<name>dfs.namenode.kerberos.principal</name>
<value>hdfs/_HOST@EXAMPLE.COM</value>
</property>
<property>
<name>dfs.namenode.kerberos.internal.spnego.principal</name>
<value>HTTP/_HOST@EXAMPLE.COM</value>
</property>
<!-- Secondary NameNode security config -->
<property>
<name>dfs.secondary.namenode.keytab.file</name>
<value>$HADOOP_HOME/etc/conf/hdfs.keytab</value> <!-- path to the HDFS
keytab -->
</property>
<property>
<name>dfs.secondary.namenode.kerberos.principal</name>
<value>hdfs/_HOST@EXAMPLE.COM</value>
</property>
<property>
<name>dfs.secondary.namenode.kerberos.internal.spnego.principal</name>
<value>HTTP/_HOST@EXAMPLE.COM</value>
</property>
<!-- DataNode security config -->
<property>
<name>dfs.datanode.data.dir.perm</name>
<value>700</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:1004</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:1006</value>
</property>
<property>
<name>dfs.datanode.keytab.file</name>
<value>$HADOOP_HOME/etc/hadoop/conf/hdfs.keytab</value> <!-- path to
the HDFS keytab -->
</property>
<property>
<name>dfs.datanode.kerberos.principal</name>
<value>hdfs/_HOST@EXAMPLE.COM</value>
</property>
9.7. Add these lines to yarn-site.xml in every node in the cluster(replce _HOST
with hostname of relevant node)
<property>
<name>yarn.resourcemanager.principal</name>
<value>yarn/_HOST@EXAMPLE.COM</value>
</property>
<property>
<name>yarn.resourcemanager.keytab</name>
<value>$HADOOP_HOME/etc/hadoop/conf/yarn.keytab</value>
</property>
<!-- remember the principal for the node manager is the principal for the host
this yarn-site.xml file is on -->
<!-- these (next four) need only be set on node manager nodes -->
<property>
<name>yarn.nodemanager.principal</name>
<value>yarn/_HOST@EXAMPLE.COM</value>
</property>
<property>
<name>yarn.nodemanager.keytab</name>
<value>$HADOOP_HOME/etc/hadoop/conf/yarn.keytab</value>
</property>
<property>
<name>yarn.nodemanager.container-executor.class</name>
<value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecuto
r</value>
</property>
<property>
<name>yarn.nodemanager.linux-container-executor.group</name>
<value>yarn</value>
</property>
9.8. Add these line to mapred-site.xml to every node in the cluster
<!-- job history server secure configuration info -->
<property>
<name>mapreduce.jobhistory.keytab</name>
<value>$HADOOP_HOME/etc/conf/mapred.keytab</value>
</property>
<property>
<name>mapreduce.jobhistory.principal</name>
<value>mapred/_HOST@EXAMPLE.COM</value>
</property>
Good luck.
Baskar
baskar910@gmail.com
Posted under creative common license