Documente Academic
Documente Profesional
Documente Cultură
Contents
1.
2.
3.
4.
5.
2.1.1
2.1.2
2.2
Hadoop ...........................................................................................................................................................8
2.3
2.4
2.5
Download ..................................................................................................................................................... 10
3.2
4.2
4.3
5.1.1
5.1.2
5.2
5.2.1
5.2.2
5.2.3
5.2.4
View Datastores................................................................................................................................... 17
5.2.5
5.2.6
5.2.7
5.2.8
5.2.9
5.3
5.3.1
5.3.2
5.3.3
5.3.4
Using a Distro....................................................................................................................................... 21
5.4
5.4.1
5.4.2
5.4.3
5.5
5.5.1
5.5.2
5.5.3
5.6
5.7
5.8
5.9
6.
7.
connect ......................................................................................................................................................... 51
7.2
cluster ........................................................................................................................................................... 51
7.2.1
7.2.2
7.2.3
7.2.4
7.2.5
cluster limit............................................................................................................................................ 53
7.2.6
7.2.7
7.2.8
7.2.9
7.2.10
7.2.11
7.3
datastore ...................................................................................................................................................... 56
7.3.1
7.3.2
7.3.3
7.4
distro ............................................................................................................................................................. 58
7.4.1
7.5
disconnect .................................................................................................................................................... 58
7.6
fs.................................................................................................................................................................... 58
7.6.1
fs cat ...................................................................................................................................................... 58
7.6.2
fs chgrp ................................................................................................................................................. 58
7.6.3
fs chmod ............................................................................................................................................... 58
7.6.4
fs chown ................................................................................................................................................ 59
7.6.5
fs copyFromLocal ................................................................................................................................ 59
7.6.6
fs copyToLocal ..................................................................................................................................... 59
3
7.6.7
fs copyMergeToLocal ......................................................................................................................... 60
7.6.8
fs count.................................................................................................................................................. 60
7.6.9
fs cp ....................................................................................................................................................... 60
7.6.10
fs du..................................................................................................................................................... 60
7.6.11
fs expunge .......................................................................................................................................... 60
7.6.12
fs get.................................................................................................................................................... 61
7.6.13
fs ls ...................................................................................................................................................... 61
7.6.14
fs mkdir ............................................................................................................................................... 61
7.6.15
fs moveFromLocal............................................................................................................................. 61
7.6.16
fs mv .................................................................................................................................................... 61
7.6.17
fs put.................................................................................................................................................... 62
7.6.18
fs rm .................................................................................................................................................... 62
7.6.19
fs setrep .............................................................................................................................................. 62
7.6.20
fs tail .................................................................................................................................................... 62
7.6.21
fs text................................................................................................................................................... 63
7.6.22
fs touchz ............................................................................................................................................. 63
7.7
hive ............................................................................................................................................................... 63
7.7.1
7.7.2
hive script.............................................................................................................................................. 63
7.8
mr .................................................................................................................................................................. 63
7.8.1
mr jar ..................................................................................................................................................... 63
7.8.2
7.8.3
7.8.4
mr job history........................................................................................................................................ 64
7.8.5
7.8.6
7.8.7
7.8.8
7.8.9
7.8.10
7.8.11
7.9
network ......................................................................................................................................................... 66
7.9.1
7.9.2
7.9.3
network list............................................................................................................................................ 67
7.10
7.10.1
pig cfg.................................................................................................................................................. 68
7.10.2
7.11
resourcepool.............................................................................................................................................. 68
4
7.11.1
7.11.2
7.11.3
7.12
8.
topology...................................................................................................................................................... 70
7.12.1
7.12.2
8.1.1
8.1.2
8.1.3
8.1.4
8.1.5
8.2
8.2.1
8.2.2
8.3
9.
8.3.1
8.3.2
9.1.1
9.1.2
9.1.3
9.1.4
9.2
9.3
Configure http proxy for the VMs created by Serengeti Server ........................................................... 76
10.
10.1
10.1.1
10.1.2
10.1.3
10.1.4
10.2
10.3
Config http proxy for the VMs created by Serengeti Server ............................................................... 78
2. Serengeti Overview
2.1 Serengeti
The Serengeti virtual appliance is a management service that you can use to deploy Hadoop clusters on
VMware vSphere systems. It is a one-click deployment toolkit that allows you to leverage the VMware
vSphere platform to deploy a highly available Hadoop cluster in minutes, including common Hadoop
components such as HDFS, MapReduce, Pig, and Hive on a virtual platform. Serengeti supports multiple
Hadoop 0.20 based distributions, CDH4 (except YARN), and MapR M5.
Hadoop configuration.
Serengeti automatically adjusts Hadoop configurations according to the virtual machine specification.
After creation, you can export Hadoop clusters spec and tune Hadoop configuration without impacting
irrelevant Hadoop node.
Serengeti provides both cluster level and node group level configuration. You can set different
parameters for different node groups.
2.1.1.6 Data Compute Separation
Serengeti allows you to deploy a data and computer separated Hadoop cluster.
You can specify the number of compute nodes for one data node and specify compute node and
related data node on the same physical host.
Serengeti also allows you to deploy a compute-only cluster to performance isolation between
different MapReduce clusters or consume the existing HDFS.
Deploy a Hadoop cluster with only JobTracker and TaskTracker to consume an existing apache
0.20 based HDFS.
Deploy a Hadoop cluster with only job tracker and task tracker to consume an 3rd party HDFS.
Greenplum HD 1.2
Hortonworks HDP-1
CDH3
CDH4
MapR M5
You can add your preferred distribution to Serengeti and deploy Hadoop clusters accordingly.
2.2 Hadoop
Apache Hadoop is open source software for distributed storage and computing. Apache Hadoop includes
HDFS and MapReduce. The HDFS is a distributed file system. The MapReduce is a software framework
for distributed data processing.
You can find more information about Apache Hadoop on http://hadoop.apache.org/ for more information.
Ease of Provisioning: VMware virtualization encapsulates an application into an image that can
be duplicated or moved, greatly reducing the cost of application provisioning and deployment.
Manageability: Virtual machines may be moved from server to server with no downtime using
VMware vMotion, which simplifies common operations like hardware maintenance and reduces
planned downtime.
Availability: Unplanned downtime can be reduced and higher service levels can be provided to an
application. VMware High Availability (HA) ensures that in the case of an unplanned hardware
failure, any affected virtual machines are restarted on another host in a VMware cluster.
Software
o
SSH client
Network
o
DNS Server
Resource requirements
o
300GB is for your first Hadoop cluster. You can reduce the disk space
requirements by specifying the storage size in a cluster specification.
Shared storage is required if you use HA or FT for the Hadoop master node.
Others
o
All ESXi hosts should have time synchronized using the Network Time Protocol (NTP)
OS
9
Windows
Linux
Software
o
Unzip tool
Network
o
Can access Serengeti Management Server through HTTP in order to download CLI
package
10
5. Select a datastore.
6. Select a format for the virtual disks.
7. Map the networks used in the OVF template to the networks in your inventory.
8. Set the properties for this Serengeti deployment.
11
10. Click Next to deploy the virtual appliance. Itll take several minutes to deploy the virtual appliance.
After Serengeti virtual appliance is deployed successfully, two virtual machines will be installed in
12
vSphere. One is the Serengeti Management Server virtual machine another is the virtual machine
template for Hadoop nodes.
11. Power on the Serengeti vApp and open the console of Serengeti Management Server VM, you see
the initial OS login password for root/serengeti user. Update the password with command sudo
/opt/serengeti/sbin/set-password -u after login to the VM, and the initial password will disappear on
the welcome screen.
4. Quick Start
4.1 Set up the Serengeti CLI
Serengeti command line shell can run in Windows or Linux. You need Java installed on the machine.
You can download VMware-Serengeti-cli-0.8.0.0-<build number>.zip from the Serengeti Management
Server (http://your-serengeti-server/cli).
Unzip the downloaded package to a directory. Run Serengeti CLI by going to this directory, under cli,
and enter java jar serengeti*.jar.
Please refer to the troubleshooting document if you have any issues.
name.
This command will deploy a Hadoop cluster with one master node virtual machine, three worker node
virtual machines, and one client node virtual machine. The master node virtual machine contains
NameNode and JobTracker in it. The worker node virtual machines contain datanode and TaskTracker
services. The client node virtual machines contain a Hadoop client environment, including Hadoop client
shell, Pig, and Hive.
After the deployment is complete, you can view the IP addresses of the Hadoop node virtual machines.
Hint
Use the tab key for auto-completion and to get help for commands and parameters.
By default, Serengeti might use any resources added to deploy a Hadoop Cluster. To limit the scope of
resources for the cluster, you can specify resource pools, datastores, or a network in the cluster create
command
serengeti>cluster create --name myHadoop --rpNames myRP --dsNames myDS --networkName myNW
In this example myRP is the resource pool where the Hadoop cluster is deployed on, myDS is the
datastore where the virtual machine images is stored, myNW is the network which virtual machines will
use.
Hint
You can use resourcepool list, datastore list, network list command to see what resources are in
Serengeti.
Once you have a Hadoop cluster deployed you can execute Hadoop command directly in the CLI. In this
section we will describe how you can copy files from the local file system to HDFS and then run a
MapReduce job.
1. Start the Serengeti CLI and connect to Serengeti Management Server as described in section 4.1
2. Run the cluster list command to show all the available clusters
$serengeti>cluster list
3. Run the cluster target --name command to connect to the cluster you want to get data in/out.
The --name value is the cluster name that you want to connect.
$serengeti>cluster target --name cluster1
4. Run the fs put command to upload data to HDFS
$serengeti>fs put from /etc/inittab to /tmp/input/inittab
5. Run the fs get command to download data from HDFS
$serengeti>fs get from /tmp/input/inittab to /tmp/local-inittab
6. Run the mr jar command to run a MapReduce job
$serengeti> mr jar --jarfile /opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar --mainclass
org.apache.hadoop.examples.WordCount --args "/tmp/input /tmp/output"
7. Run the fs cat command to show the output of the MR job
$serengeti> fs cat /tmp/output/part-r-00000
8. Run the fs get command to download the output of the MR job
$serengeti> fs get from /tmp/output/part-r-00000 to /tmp/wordcount
14
Another way to use Hadoop is through the client VM. By default, Serengeti will deploy a VM named client
VM. It has Hadoop client, pig and Hive installed. The OS is configured ready to use Hadoop. You can see
the IP of client VM after a cluster is deployed or use cluster list command to see the IP. Following are the
steps to follow in order to verify that the Hadoop cluster is working properly.
1. Use ssh to login to the client VM.
use "joe" for user name. Password is "password".
2. Create your own home directory.
$ hadoop fs -mkdir /usr/joe
3. Or run a sample Hadoop mapreduce job.
$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar pi 10 10000000
Feel free to use submit other MapReduce, Pig or Hive jobs as well.
5. Using Serengeti
5.1 Manage Serengeti Users
Spring security In-Memory Authentication is used for Serengeti Authentication and user management.
You can modify /opt/serengeti/tomcat6/webapps/serengeti/WEB-INF/spring-security-context.xml file to
manage Serengeti users. And then restart tomcat service using command "sudo service tomat restart".
15
CDH3
HDP1
1. Download the three packages (hadoop/pig/hive) in tar ball format from the distro vendor's site.
2. Upload them to Serengeti Management Server virtual machine.
3. Put the packages in /opt/serengeti/www/distros/. The hierarchy should be
DISTRO_NAME/VERSION_NUMBER/TARBALLS. For example, place the Apache Hadoop distro as
shown in the following way.
- apache/
- 1.0.1/
- hadoop-1.0.1.tar.gz
- hive-0.8.1.tar.gz
- pig-0.9.2.tar.gz
4. Edit the /opt/serengeti/www/distros/manifest in Serengeti Management Server virtual machine to
add the mapping between Hadoop roles and the tar ball package of the distro. As the following
example, add JSON text to the manifest file:
{
"name" : "cdh",
"version" : "3u3",
"packages" : [
{
"roles" : ["hadoop_namenode", "hadoop_jobtracker",
"hadoop_tasktracker", "hadoop_datanode",
"hadoop_client"],
"tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz"
},
{
"roles" : ["hive"],
"tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz"
},
{
"roles" : ["pig"],
"tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz"
}
]
},
In this example, the CDH tar balls are put in directory /opt/serengeti/www/distros/cdh/3u3.
Please note if a distro supports HVE, please add hveSupported : true, after the line related to version
in the above example.
5. Restart the tomcat server in Serengeti Management Server to allow the Serengeti Management
Server read the new manifest file.
$ sudo service tomcat restart
If the commands are successful, in the Serengeti shell, issue the command "distro list", you can see the
distro that you added appears. Otherwise, make sure you write the correct JSON text in the manifest.
5.3.2.2 Using yum repository to deploy Hadoop cluster
Serengeti uses yum repository to deploy the following Hadoop distros:
19
CDH4
MapR M5
The two yum repo files (mapr-m5.repo and cloudera-cdh4.repo) point to the official yum repository of
MapR and CDH4 on the Internet. You can copy this sample file
/opt/serengeti/www/distros/manifest.sample to /opt/serengeti/www/distros/manifest.
When you create a MapR or CDH4 cluster, Hadoop nodes will download rpm packages from the
MapR/CDH4 official yum repository on the Internet.
If your VMs in the cluster created by Serengeti Management Server do not have access to the Internet or
the bandwidth to the Internet is not fast, we strongly suggest create a local yum repository for MapR
and CDH4. Please read the Appendix A: Create Local Yum Repository for MapR and Appendix B: Create
Local Yum Repository for CDH4 to create a yum repository.
2. Config the local yum repository url in manifest file
Once the local yum repository for MapR/CDH4 is created, open /opt/serengeti/www/distros/manifest
and add the distro configuration (use the sample in previous step and modify attribute
"package_repos" to the url of the local yum repository file).
3. Restart the tomcat server in Serengeti Management Server to allow the Serengeti Management
Server read the new manifest file.
$ sudo service tomcat restart
20
If the commands are successful, in the Serengeti shell, issue the command "distro list", you can see the
distro that you added. Otherwise, make sure you write the correct JSON text in the manifest.
"hadoop_client",
"hive",
"hive_server",
"pig"
],
"instanceNum": 1,
"instanceType": "SMALL"
}
]
}
In this example, you want 1 master virtual machine MEDIUM size, 5 worker virtual machines in SMALL
size, 1 client virtual machine in SMALL size. You can also specify number of CPUs, RAM, disk size etc.
for each of node groups.
2. Specify the spec when creating the cluster. You need use the full path to specify the file.
serengeti>cluster create --name myHadoop --specFile /home/serengeti/mySpec.txt
CAUTION
Changing the role of node groups might cause the deployed Hadoop cluster not workable.
Deploy a CDH4 Hadoop ClusterYou can create a default CDH4 Hadoop cluster by executing the following
command in Serengeti CLI:
serengeti>cluster create --name mycdh --distro cdh4
You can also create a customized CDH4 Hadoop cluster with a cluster spec file:
serengeti>cluster create --name mycdh --distro cdh4 --specFile
/opt/serengeti/samples/default_cdh4_ha_hadoop_cluster.json
/opt/serengeti/samples/default_cdh4_ha_and_federation_hadoop_cluster.json is a sample spec file for
CDH4. You can make a copy of it and modify the parameters in the file before creating the cluster. In this
example, nameservice0 and nameservice1 are federated with each other, the name nodes inside
nameservice0 node group (with instanceNum set as 2) are HDFS2 HA enabled. In Serengeti, name node
group names will be the name service names of HDFS2.
5.4.1.1.1 Deploy a MapR Hadoop Cluster
You can create a default MapR M5 Hadoop cluster by executing the following command in Serengeti CLI:
serengeti>cluster create --name mymapr --distro mapr
You can also create a customized MapR M5 Hadoop cluster with a cluster spec file:
serengeti>cluster create --name mycdh --distro mapr --specFile /opt/serengeti/samples/
default_mapr_cluster.json
/opt/serengeti/samples/ default_mapr_cluster.json is a sample spec file for MapR, you can make a copy
of it and modify the parameters in the file before creating the cluster.
{
"nodeGroups":[
{
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"cpuNum": 2,
"memCapacityMB": 7500,
},
{
"name": "data",
"roles": [
"hadoop_datanode"
],
"instanceNum": 4,
"cpuNum": 1,
"memCapacityMB": 3748,
"storage": {
"type": "LOCAL",
"sizeGB": 50
}
},
{
"name": "compute",
"roles": [
"hadoop_tasktracker"
],
"instanceNum": 8,
"cpuNum": 2,
"memCapacityMB": 7500,
"storage": {
"type": "LOCAL",
"sizeGB": 20
}
},
{
"name": "client",
"roles": [
"hadoop_client",
"hive",
"pig"
],
"instanceNum": 1,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
}
}
],
"configuration": {
}
23
}
In this example, four data nodes and eight compute nodes will be created and put into individual VMs. By
default, Serengeti uses Round Robin algorithm to put VM/node across ESX hosts evenly.
2. A data compute separated cluster, with instancePerHost constraint.
{
"nodeGroups":[
{
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"cpuNum": 2,
"memCapacityMB": 7500,
},
{
"name": "data",
"roles": [
"hadoop_datanode"
],
"instanceNum": 4,
"cpuNum": 1,
"memCapacityMB": 3748,
"storage": {
"type": "LOCAL",
"sizeGB": 50
},
"placementPolicies": {
"instancePerHost": 1
}
},
{
"name": "compute",
"roles": [
"hadoop_tasktracker"
],
"instanceNum": 8,
"cpuNum": 2,
"memCapacityMB": 7500,
"storage": {
"type": "LOCAL",
"sizeGB": 20
},
"placementPolicies": {
"instancePerHost": 2
}
},
{
"name": "client",
"roles": [
"hadoop_client",
"hive",
24
"pig"
],
"instanceNum": 1,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
}
}
],
"configuration": {
}
}
In this example, data and compute node group have placementPolicy constraint. After a successful
provision, four data nodes and eight compute nodes will be created and put into individual VMs. With the
instancePerHost=1 constraint, the four data nodes will be placed on four ESX hosts. The eight compute
nodes will be put onto four ESX hosts as well, two nodes for each.
Note that it is not guaranteed that the two compute nodes will stay collocated with each data node on
each of the four ESX hosts. To ensure that this is the case, create a VM-VM affinity rule between each
hosts compute nodes and data node, or disable DRS on the compute nodes.
3. A data compute separated cluster, with instancePerHost , groupAssociations constraints for compute
node group and groupRacks constraint for data node group.
{
"nodeGroups":[
{
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"cpuNum": 2,
"memCapacityMB": 7500,
},
{
"name": "data",
"roles": [
"hadoop_datanode"
],
"instanceNum": 4,
"cpuNum": 1,
"memCapacityMB": 3748,
"storage": {
"type": "LOCAL",
"sizeGB": 50
},
"placementPolicies": {
"instancePerHost": 1,
"groupRacks": {
"type": "ROUNDROBIN",
"racks": ["rack1", "rack2", "rack3"]
},
25
}
},
{
"name": "compute",
"roles": [
"hadoop_tasktracker"
],
"instanceNum": 8,
"cpuNum": 2,
"memCapacityMB": 7500,
"storage": {
"type": "LOCAL",
"sizeGB": 20
},
"placementPolicies": {
"instancePerHost": 2,
"groupAssociations": [
{
"reference": "data",
"type": "STRICT"
}
}
},
{
"name": "client",
"roles": [
"hadoop_client",
"hive",
"pig"
],
"instanceNum": 1,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
}
}
],
"configuration": {
}
}
In this example, after a successful provision, the four data nodes and eight compute nodes will be placed
on exactly the same four ESX hosts, each ESX host has one data node and two compute nodes, and
these four ESX hosts are selected from rack1, rack2 and rack3 fairly.
Here, as the definition of compute node group says, the placement of compute nodes should strictly
refer to the placement result of data node. That means, compute nodes should only be placed on ESX
hosts that have data nodes.
26
For example:
{
"externalHDFS": "hdfs://hostname-of-namenode:8020",
"nodeGroups": [
{
"name": "master",
"roles": [
"hadoop_jobtracker"
],
"instanceNum": 1,
"cpuNum": 2,
"memCapacityMB": 7500,
},
{
"name": "worker",
"roles": [
"hadoop_tasktracker",
],
"instanceNum": 4,
"cpuNum": 2,
"memCapacityMB": 7500,
"storage": {
"type": "LOCAL",
"sizeGB": 20
},
},
{
"name": "client",
"roles": [
"hadoop_client",
"hive",
"pig"
],
"instanceNum": 1,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
},
}
],
configuration : {
}
}
In this example, the externalHDFS field points to an existing HDFS. You should also specify the node
group with role hadoop_jobtracker and hadoop_tasktracker. Note that the externalHDFS field conflicts
with node groups that have hadoop_namenode and hadoop_datanode roles. The sample cluster spec
can also be found in file in samples/compute_only_cluster.json in the Serengeti CLI directory,
2. Specify the spec when creating the cluster. You need use the full path to specify the file.
serengeti>cluster create --name computeOnlyCluster --specFile /home/serengeti/coSpec.txt
27
{
"name": "group_name",
"placementPolicies": {
"instancePerHost": 2,
"groupRacks": {
"type": "ROUNDROBIN",
"racks": ["rack1", "rack2", "rack3"]
},
"groupAssociations": [{
"reference": "another_group_name",
"type": "STRICT" // or "WEAK"
}]
}
},
}
As this example shows, the palcementPolicy field contains three optional items: instancePerHost,
groupRacks and groupAssociations.
As the name implies, instancePerHost indicates how many VM nodes or instances should be placed for
each physical ESX host, this constraint is aimed at balancing the workload.
The groupRacks controls how VM nodes should be put across the racks you specified. In this example,
the rack type equals ROUNDROBIN, and the racks item indicates which racks in the topology map
(refer to chapter 5.8 to see how to configure topology map information and enable Hadoop cluster to be
rack awareness) will be used for this placement policy. If racks item is ignored, Serengeti will use all
racks in the topology map. ROUNDROBIN here means the candidates will be fairly selected when
determining which rack should be selected for each node.
On the other side, if you specify both the InstancePerHost and groupRacks for placement policy, you
should make sure the number of available hosts is enough. You can get the rack-hosts information by
using the command topology list.
groupAssociations means the node group has associations with target node groups, and each
association has reference and type fields. The field reference is the name of a target node group,
and type can be STRICT or WEAK. STRICT means the node group must be placed on the same
set or subset of ESX hosts relevant to the target group, while WEAK means the node group tries to be
placed on the same set or subset of ESX hosts relevant to the target group but no guarantee.
A typical scenario of applying groupRacks and groupAssociations is deploying a Hadoop cluster with
data and compute nodes separated. In this case, user might tend to put compute nodes and data nodes
on the same set of physical hosts for better performance, especially the throughput. You can refer to
5.3.3 for the practical examples of how to deploy Hadoop cluster by applying placement policy.
28
{
"nodeGroups":[
{
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"instanceType": "LARGE",
"cpuNum": 2,
"memCapacityMB": 7500,
"haFlag": "on"
},
{
"name": "data",
"roles": [
"hadoop_datanode"
],
"instanceNum": 4,
"cpuNum": 1,
"memCapacityMB": 3748,
"storage": {
"type": "LOCAL",
"sizeGB": 50
},
"placementPolicies": {
"instancePerHost": 1
}
},
{
"name": "compute",
"roles": [
"hadoop_tasktracker"
],
"instanceNum": 8,
"cpuNum": 1,
"memCapacityMB": 3748,
"storage": {
"type": "TEMPFS"
},
"placementPolicies": {
"instancePerHost": 2,
"groupAssociations": [
{
"reference": "data",
29
"type": "STRICT"
}
]
}
},
{
"name": "client",
"roles": [
"hadoop_client",
"hive",
"hive_server",
"pig"
],
"instanceNum": 1,
"cpuNum": 1,
"memCapacityMB": 3748,
"storage": {
"type": "LOCAL",
"sizeGB": 50
}
}
]
}
In this example, the cluster is D/C separated. Compute nodes are strictly associated with data nodes. By
specifying the Storage field of compute node group to type: TEMPFS, Serengeti will install NFS server
on associated data nodes, install NFS client on compute nodes, and mount data nodes disks on compute
nodes. Serengeti will not assign disks to compute nodes, and all temp files generated during running
MapReduce jobs are saved on the NFS disks.
30
"configuration": {
"hadoop": {
"core-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/core-default.html
// note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a
sample:
// "io.file.buffer.size": "4096"
},
"hdfs-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/hdfs-default.html
},
"mapred-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/mapred-default.html
},
"hadoop-env.sh": {
// "HADOOP_HEAPSIZE": "",
// "HADOOP_NAMENODE_OPTS": "",
// "HADOOP_DATANODE_OPTS": "",
// "HADOOP_SECONDARYNAMENODE_OPTS": "",
// "HADOOP_JOBTRACKER_OPTS": "",
// "HADOOP_TASKTRACKER_OPTS": "",
// "HADOOP_CLASSPATH": "",
// "JAVA_HOME": "",
// "PATH": "",
},
"log4j.properties": {
// "hadoop.root.logger": "DEBUG, DRFA ",
// "hadoop.security.logger": "DEBUG, DRFA ",
},
"fair-scheduler.xml": {
// check for all settings at http://hadoop.apache.org/docs/stable/fair_scheduler.html
// "text": "the full content of fair-scheduler.xml in one line"
},
"capacity-scheduler.xml": {
// check for all settings at http://hadoop.apache.org/docs/stable/capacity_scheduler.html
}
}
}
Serengeti provides a tool to convert the Hadoop configuration files of your existing cluster into
the above json format, so you dont need to write this json file manually. Please read section
Tool for converting Hadoop Configuration.
Some Hadoop Distributions have their own java jar files which are not put in
$HADOOP_HOME/lib, so by default Hadoop daemons cant find it. In order to use these jars,
you need to add a cluster configuration to include the full path of the jar file in
$HADOOP_CLASSPATH.
Here is a sample cluster configuration to configure Cloudera CDH3 Hadoop cluster with Fair
Scheduler (the jar files of Fair Scheduler is put in /usr/lib/hadoop/contrib/fairscheduler/):
31
"configuration": {
"hadoop": {
"hadoop-env.sh": {
"HADOOP_CLASSPATH": "/usr/lib/hadoop/contrib/fairscheduler/*:$HADOOP_CLASSPATH"
},
"mapred-site.xml": {
"mapred.jobtracker.taskScheduler": "org.apache.hadoop.mapred.FairScheduler"
},
"fair-scheduler.xml": {
}
}
}
32
hdfs-site.xml
mapred-site.xml
hadoop-env.sh
JAVA_HOME
PATH
HADOOP_CLASSPATH
HADOOP_HEAPSIZE
HADOOP_NAMENODE_OPTS
HADOOP_DATANODE_OPTS
HADOOP_SECONDARYNAMENODE_OPTS
HADOOP_JOBTRACKER_OPTS
HADOOP_TASKTRACKER_OPTS
HADOOP_LOG_DIR
log4j.properties
hadoop.root.logger
hadoop.security.logger
log4j.appender.DRFA.MaxBackupIndex
log4j.appender.RFA.MaxBackupIndex
log4j.appender.RFA.MaxFileSize
fair-scheduler.xml
text
capacity-scheduler.xml
all attributes described on http://hadoop.apache.org/docs/stable/capacity_scheduler.html
net.topology.impl
net.topology.nodegroup.aware
dfs.block.replicator.classname
hdfs-site.xml
dfs.http.address
dfs.name.dir
dfs.data.dir
topology.script.file.name
mapred-site.xml
mapred.job.tracker
mapred.local.dir
mapred.task.cache.levels
mapred.jobtracker.jobSchedulable
mapred.jobtracker.nodegroup.awareness
hadoop-env.sh
HADOOP_HOME
HADOOP_COMMON_HOME
HADOOP_MAPRED_HOME
HADOOP_HDFS_HOME
HADOOP_CONF_DIR
HADOOP_PID_DIR
log4j.properties
None
fair-scheduler.xml
None
capacity-scheduler.xml
None
mapred-queue-acls.xml
None
3) Open the cluster spec file and replace the Cluster Level Configuration or Group Level
Configuration with the content printed out step 2.
4) Execute cluster config --name --specFile to apply the new configuration to the existing
clusteror execute cluster create --name --specFile to create a new cluster with your
configuration.
5.4.2.2 Scale Out a Hadoop Cluster
You can scale out to have more Hadoop worker nodes or client nodes after Hadoop cluster is provisioned.
In the following example, the number of instances in worker node group in myHadoop cluster will
increase to 10.
serengeti>cluster resize --name myHadoop --nodeGroup worker --instanceNum 10
You cannot set a number smaller than current instance number in this version of the Serengeti
virtual appliance.
5.4.2.3 Scale TaskTracker Nodes Rapidly
You can change the number of active TaskTracker nodes rapidly in a running Hadoop cluster or node
group. The selection of TaskTrackers to be enabled or disabled is done with the goal of balancing the
number of TaskTrackers enabled per host in the specified Hadoop cluster or node group.
In this example, the number of active TaskTracker nodes in worker node group in myHadoop cluster is
set to 8:
serengeti>cluster limit --name myHadoop --nodeGroup worker --activeComputeNodeNum 8
If fewer than 8 TaskTracker nodes were running in the worker node group of myHadoop cluster,
additional TaskTracker nodes are enabled (re-commissioned and powered-on), up to the number
provisioned in the worker node group. If more than 8 TaskTrackers were running in the worker node
group, excess TaskTracker nodes are disabled (decommissioned and powered-off). No action is
performed if the number of active TaskTrackers already equals 8.
If the node group is not specified, the TaskTracker nodes are enabled/disabled such that the total number
of active TaskTrackers is 8 across all the compute node groups in the myHadoop cluster:
serengeti>cluster limit --name myHadoop activeComputeNodeNum 8
To enable all the TaskTrackers in the myHadoop cluster, use the cluster unlimit command:
serengeti>cluster unlimit --name myHadoop
This command is especially useful to fix any potential mismatch between the number of active
TaskTrackers as seen by Hadoop and the number of powered on TaskTracker nodes as seen by the
vCenter.
To enable all TaskTrackers within only one compute node group, specify the name of the node group
using the --nodeGroup option, similar to the cluster limit command.
5.4.2.4 Start/Stop Hadoop Cluster
In the Serengeti shell, you can start (or stop) a whole Hadoop cluster:
serengeti>cluster start --name mycluster
5.4.2.5 View Hadoop Clusters Deployed by Serengeti
In the Serengeti shell, you can list Hadoop clusters deployed by Serengeti.
serengeti>cluster list
35
36
Make sure you have chosen a cluster as target first in Serengeti CLI. See Chapter 7.2.10.
37
"", "");
Statement stmt = con.createStatement();
String tableName = "testHiveDriverTable";
stmt.executeQuery("drop table " + tableName);
ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value
string)");
// show tables
String sql = "show tables '" + tableName + "'";
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}
// describe table
sql = "describe " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1) + "\t" + res.getString(2));
}
// load data into table
// NOTE: filepath has to be local to the hive server
// NOTE: /tmp/test_hive_server.txt is a ctrl-A separated file with two fields per line
String filepath = "/tmp/test_hive_server.txt";
sql = "load data local inpath '" + filepath + "' into table " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
// select * query
sql = "select * from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()){
System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
38
}
// regular hive query
sql = "select count(1) from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()){
System.out.println(res.getString(1));
}
}
}
3. Running the JDBC Sample Code
a. Then on the command-line
$ javac HiveJdbcClient.java
b. Alternatively, you can run the following bash script, which will seed the data file and build your
classpath before invoking the client.
#!/bin/bash
HADOOP_HOME=/usr/lib/hadoop
HIVE_HOME=/usr/lib/hive
echo -e '1\x01foo' > /tmp/test_hive_server.txt
echo -e '2\x01bar' >> /tmp/test_hive_server.txt
HADOOP_CORE=`ls $HADOOP_HOME/hadoop-core-*.jar`
CLASSPATH=.:$HADOOP_CORE:$HIVE_HOME/conf
for jar_file_name in ${HIVE_HOME}/lib/*.jar
do
CLASSPATH=$CLASSPATH:$jar_file_name
done
java -cp $CLASSPATH HiveJdbcClient
39
}
},
{
"name" : "hbasemaster",
"roles" : [
"hbase_master"
],
"instanceNum" : 1,
"instanceType" : "MEDIUM",
"storage" : {
"type" : "shared",
"sizeGB" : 50
},
"cpuNum" : 2,
"memCapacityMB" : 7500,
"haFlag" : "on",
"configuration" : {
}
},
{
"name" : "worker",
"roles" : [
"hadoop_datanode",
"hadoop_tasktracker",
"hbase_regionserver"
],
"instanceNum" : 3,
"instanceType" : "SMALL",
"storage" : {
"type" : "local",
"sizeGB" : 50
},
"cpuNum" : 1,
"memCapacityMB" : 3748,
"haFlag" : "off",
"configuration" : {
}
},
{
"name" : "client",
"roles" : [
"hadoop_client",
"hbase_client"
],
"instanceNum" : 1,
"instanceType" : "SMALL",
"storage" : {
"type" : "shared",
"sizeGB" : 50
},
"cpuNum" : 1,
"memCapacityMB" : 3748,
"haFlag" : "off",
"configuration" : {
}
41
}
],
// we suggest running convert-hadoop-conf.rb to generate "configuration" section and paste the output
here
"configuration" : {
"hadoop": {
"core-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/core-default.html
// note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a
sample:
// "io.file.buffer.size": "4096"
},
"hdfs-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/hdfs-default.html
},
"mapred-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/mapred-default.html
},
"hadoop-env.sh": {
// "HADOOP_HEAPSIZE": "",
// "HADOOP_NAMENODE_OPTS": "",
// "HADOOP_DATANODE_OPTS": "",
// "HADOOP_SECONDARYNAMENODE_OPTS": "",
// "HADOOP_JOBTRACKER_OPTS": "",
// "HADOOP_TASKTRACKER_OPTS": "",
// "HADOOP_CLASSPATH": "",
// "JAVA_HOME": "",
// "PATH": ""
},
"log4j.properties": {
// "hadoop.root.logger": "DEBUG,DRFA",
// "hadoop.security.logger": "DEBUG,DRFA"
},
"fair-scheduler.xml": {
// check for all settings at http://hadoop.apache.org/docs/stable/fair_scheduler.html
// "text": "the full content of fair-scheduler.xml in one line"
},
"capacity-scheduler.xml": {
// check for all settings at http://hadoop.apache.org/docs/stable/capacity_scheduler.html
},
"mapred-queue-acls.xml": {
// check for all settings at
http://hadoop.apache.org/docs/stable/cluster_setup.html#Configuring+the+Hadoop+Daemons
// "mapred.queue.queue-name.acl-submit-job": "",
// "mapred.queue.queue-name.acl-administer-jobs", ""
}
},
"hbase": {
"hbase-site.xml": {
// check for all settings at http://hbase.apache.org/configuration.html#hbase.site
},
"hbase-env.sh": {
// "JAVA_HOME": "",
// "PATH": "",
// "HBASE_CLASSPATH": "",
// "HBASE_HEAPSIZE": "",
42
// "HBASE_OPTS": "",
// "HBASE_USE_GC_LOGFILE": "",
// "HBASE_JMX_BASE": "",
// "HBASE_MASTER_OPTS": "",
// "HBASE_REGIONSERVER_OPTS": "",
// "HBASE_THRIFT_OPTS": "",
// "HBASE_ZOOKEEPER_OPTS": "",
// "HBASE_REGIONSERVERS": "",
// "HBASE_SSH_OPTS": "",
// "HBASE_NICENESS": "",
// "HBASE_SLAVE_SLEEP": ""
},
"log4j.properties": {
// "hbase.root.logger": "DEBUG,DRFA"
}
},
"zookeeper": {
"java.env": {
// "JVMFLAGS": "-Xmx2g"
},
"log4j.properties": {
// "zookeeper.root.logger": "DEBUG,DRFA"
}
}
}
}
In the example, it has JobTracker and TaskTracker roles compared to the template we mentioned in
section 4.4, which means you can launch a HBase mapreduce job. It separate Hadoop NameNode and
HBase Master roles. The two HBase Master instances,will be protected by HBase internal HA function.
3. Rest-ful Web Service is running on client node and listening on port 8080
>curl I http://<client_node_ip>:8080/status/cluster
4.Thrift gateway is also enabled and listening on port 9090.
44
HVE stands for Hadoop Virtualization Extensions . HVE refines Hadoops replica placement, task
scheduling and balancer policies. Hadoop clusters implemented on virtualized infrastructure have full
awareness of the topology on which they are running. Thus, the reliability and performance of these
clusters
are
enhanced.
For
more
information
about
HVE,
you
can
refer
to
https://issues.apache.org/jira/browse/HADOOP-8468.
RACK_AS_RACK stands for the standard topology in existing Hadoop 1.0.x, where only rack and host
information are exposed to Hadoop.
HOST_AS_RACK is a simplified topology of RACK_AS_RACK when all the physical hosts for Serengeti
are on a single rack. In this case, each physical host will be treated as a rack in order to avoid all HDFS
data replicas are placed in a physical host in some worst cases.
HVE is the recommended topology in Serengeti if a distro supports HVE. Otherwise, we recommend
using RACK_AS_RACK topology in multiple rack environments. HOST_AS_RACK is used only when one
rack exists for Serengeti or no rack information at all.
In addition, when you decide to enable HVE, or RACK_AS_RACK, you need to upload the rack and
physical host information to Serengeti through CLI command below before you create a topology
awareness cluster.
serengeti>topology upload --fileName name_of rack_hosts_mapping_file
Here is a sample of the rack and physical hosts mapping file.
rack1: a.b.foo.com, a.c.foo.com
rack2: c.a.foo.com
In this sample, physical hosts a.b.foo.com and a.c.foo.com are in rack1, and c.a.foo.com is in rack2.
After a cluster is created with the selected topology option, you can view the allocated nodes on each
rack with:
serengeti>cluster list --name cluster-name --detail
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
10
"instanceType": "LARGE",
11
"cpuNum": 2,
12
"memCapacityMB":4096,
13
"storage": {
14
"type": "SHARED",
15
"sizeGB": 20
16
},
17
"haFlag":"on",
18
"rpNames": [
19
"rp1"
20
21
},
22
23
"name": "data",
24
"roles": [
25
"hadoop_datanode"
26
],
27
"instanceNum": 3,
28
"instanceType": "MEDIUM",
29
"cpuNum": 2,
30
"memCapacityMB":2048,
46
31
"storage": {
32
"type": "LOCAL",
33
"sizeGB": 50
34
35
"placementPolicies": {
36
"instancePerHost": 1,
37
"groupRacks": {
38
"type": "ROUNDROBIN",
39
40
41
42
},
43
44
"name": "compute",
45
"roles": [
46
"hadoop_tasktracker"
47
],
48
"instanceNum": 6,
49
"instanceType": "SMALL",
50
"cpuNum": 2,
51
"memCapacityMB":2048,
52
"storage": {
53
"type": "LOCAL",
54
"sizeGB": 10
55
56
"placementPolicies": {
57
"instancePerHost": 2,
58
"groupAssociations": [{
59
"reference": "data",
60
"type": "STRICT"
61
}]
62
63
},
64
65
"name": "client",
47
66
"roles": [
67
"hadoop_client",
68
"hive",
69
"hive_server",
70
"pig"
71
],
72
"instanceNum": 1,
73
"instanceType": "SMALL",
74
"memCapacityMB": 2048,
75
"storage": {
76
"type": "LOCAL",
77
"sizeGB": 10,
78
79
80
81 ],
82 "configuration": {
83 }
84 }
It defines 4 node groups.
Line 3 to 21 is an object defines the master node group. The attributes are as follows.
Line 4 defines the name of the node group. Attribute name is name. Value is master.
Line 5 to 8 defines role of the node group. Attribute name is role. Value is hadoop_ namenode
and hadoop_jobtracker. It means hadoop_namenode and hadoop_jobtracker will be deployed
to the virtual machine in the group.
Line 9 defines number of instances in the node group. Attribute name is instanceNum. Attribute
value is 1. It means therell be only one virtual machine created for the group.
You can have multiple instances for hadoop_tasktracker, hadoop_datanode, hadoop_client, pig,
and hive. But you can have only one instance for hadoop_namenode and hadoop_jobtracker.
Line 10 defines the instance type in the node group. Attribute name is instanceType. Value is
LARGE. The instance types are predefined virtual machine spec. They are combinations of
48
number of CPUs, RAM sizes, and storage size. The predefined number can be overridden by the
cpuNum, memCapacityMB and storage specified in the file.
Line 11 defines number of CPUs per virtual machine. Attribute name is cpuNum. Value is 2. Itll
override the number of CPUs of the predefined virtual machine spec.
Line 12 defines RAM size per virtual machine. Attribute name is "memCapacityMB". Value is
4096. It will override the RAM size of the predefined virtual machine spec.
Line 13 to 16 defines the storage requirement of the node group. Its an object. Object name is
storage.
o
Line 14 defines the storage type. Its an attribute of storage object. Attribute name is
type. Value is SHARED. It means it is required that Hadoop data must be stored in
shared storage.
Line 15 defines the storage size. Its an attribute of storage object. Attribute name is
sizeGB. Value is 20. It means therell be 20GB disk for Hadoop to use.
Line 17 defines if HA applies to the node. The attribute name is haFlag. The value is on. It
means the virtual machine in the group is protected by vSphere HA.
Line 18 to 20 defines the resourcepools which the node group must be associated with. The
attribute name is rpNames. The value is an array, which contains one resourcepool rp1.
You can see same structure for other 3 node groups. One more thing is for data and compute groups,
we specify a pair of comprehensive placement constraints:
Line 35 to 41 defines the placement constraints for the data node group. The attribute name is
placementPolicies and the value is a hash which contains instancePerHost and groupRacks.
The contraint means you need at least 3 esx hosts because this group requires 3 instances and
forces putting 1 instance on each one host, furthermore, this group will be provisioned on hosts
on rack1, rack2 and rack3 by using ROUNDROBIN algorithm.
Line 56 to 62 defines the placement constraints for the compute node group which contains
instancePerHost and groupAssociations. The contraint means you also need at least 3 esx
hosts for the same reason and this group is STRICT associated to node group data for better
performance.
You can customize Hadoop configuration by configuration attribute on line 82 to 83, which happens to
be empty in the sample.
You can modify value of the attributes, and you can also remove the optional value if you dont care.
Following is definition for the outer most attributes in a cluster spec:
Attribute
Type
Mandatory/optional
Description
nodeGroups
object
Mandatory
configuration
object
Optional
externalHDFS
string
Optional
49
Following is the definition of the objects and attributes for a particular node group.
Attribute
Type
Mandatory/Optional
Description
name
string
Mandatory
roles
list of
string
Mandatory
instanceNumber
integer
Mandatory
instanceType
string
Optional
cpuNum
integer
Optional
memCapacityMb
integer
Optional
Storage
object
Optional
Storage settings
type
string
Optional
sizeGB
integer
Optional
dsNames
list of
string
Optional
rpNames
list of
string
Optional
haFlag
string
Optional
placementPolicies
object
Optional
50
MEDIUM
LARGE
EXTRA_LARGE
Number of vCPU
RAM
3.75GB
7.5GB
15GB
30GB
25GB
50GB
100GB
200GB
50GB
100GB
200GB
400GB
50GB
100GB
200GB
400GB
When creating virtual machine, Serengeti will try to allocate datastore on the preferred type. SHARED
storage is preferred for master and clients. LOCAL storage is preferred for workers.
Separate disks are created for OS and swap.
Mandatory
--username Optional
--password Optional
The command will read username and password in interactive mode. Section 5.1 describes how to
manage Serengeti users.
If connect failed, or do not run connect command, the other Serengeti command is not allowed to be
executed.
7.2 cluster
7.2.1 cluster config
Modify Hadoop configuration of an existing default or customized Hadoop cluster in Serengeti.
Parameter
Type
Description
Optional
51
--yes
Optional
Mandatory/Optional Description
Mandatory
Optional
Optional
--dsNames <datastore
names>
Optional
--networkName
<network name>
Optional
--rpNames <resource
pool name>
Optional
--resume
Optional
--topology <topology
type>
Optional
--yes
Optional
--skipConfigValidation
Optional
If the cluster spec does not include required nodes, for example master node, Serengeti will generate
them with a default configuration.
52
Mandatory/Optional Description
Mandatory
--output
Optional
Mandatory/Optional Description
--name <cluster_name>
Mandatory
--nodeGroup
<node_group_name>
Optional
-activeComputeNodeNum
<number>
Mandatory
53
Mandatory/Optional Description
--name <cluster
name in
Serengeti>
Optional
--detail
Optional
For example:
54
Mandatory/Optional Description
Mandatory
--nodeGroup <name of
the node group>
Mandatory
--instanceNum <instance
number>
Mandatory
Example:
Cluster resize --name foo --nodeGroup slave --instanceCount 10
Mandatory/Optional Description
55
Mandatory/Optional Description
Mandatory/Optional
Description
--info
Optional
Mandatory/Optional Description
--name <cluster_name>
Mandatory
--nodeGroup
<node_group_name>
Optional
7.3 datastore
7.3.1 datastore add
Add a datastore to Serengeti for deploying.
Parameter
Mandatory/Optional Description
--name <datastore
name in Serengeti>
Mandatory
--spec <datastore
name in VCenter>
Mandatory
56
Mandatory
Mandatory/Optional Description
Mandatory
Mandatory/Optional Description
Optional
--detail
Optional
All datastores that are added to Serengeti are listed if the name is not specified.
For example:
57
7.4 distro
7.4.1 distro list
Show what are the roles offered in a distro.
Parameter
Mandatory/Optional Description
For example:
7.5 disconnect
Disconnect and logout from remote Serengeti server. After disconnect, user is not allowed to run any CLI
commands.
7.6 fs
7.6.1 fs cat
Copy source paths to stdout.
Parameter Mandatory/Optional Description
<file name> Mandatory
7.6.2 fs chgrp
Change group association of files.
Parameter
Mandatory/Optional
Description
--recursive true|false
Optional
<file name>
Mandatory
7.6.3 fs chmod
Change the permissions of files.
Parameter
Mandatory/Optional
Description
58
Mandatory
--recursive true|false
Optional
<file name>
Mandatory
7.6.4 fs chown
Change the owner of files.
Parameter
Mandatory/Optional Description
<file name>
Mandatory
7.6.5 fs copyFromLocal
Copy single source file, or multiple source files from local file system to the destination file system. It is
the same as put.
Parameter
Mandatory/Optional Description
Mandatory
Mandatory
7.6.6 fs copyToLocal
Copy files to the local file system. It is the same as get.
Parameter
Mandatory/Optional Description
Mandatory
Mandatory
59
7.6.7 fs copyMergeToLocal
Takes a source directory and a destination file as input and concatenates the files in the HDFS directory
into the local file system.
Parameter
Mandatory/Optional
Description
Mandatory
Mandatory
--endline <true|false>
Optional
7.6.8 fs count
Count the number of directories, files, bytes, quota, and remaining quota.
Parameter
Mandatory/Optional
Description
Mandatory
--quota <true|false>
Optional
7.6.9 fs cp
Copy files from source to destination. This command allows multiple sources as well in which case the
destination must be a directory.
Parameter
Mandatory/Optional Description
Mandatory
Mandatory
7.6.10 fs du
Displays sizes of files and directories contained in the given directory or the length of a file in case its just
a file.
Parameter Mandatory/Optional Description
<file name> Mandatory
7.6.11 fs expunge
Empty the trash bin in the HDFS.
60
7.6.12 fs get
Copy files to the local file system.
Parameter
Mandatory/Optional Description
Mandatory
Mandatory
7.6.13 fs ls
List files in the directory.
Parameter
Mandatory/Optional
Description
<path name>
Mandatory
--recursive <true|false>
Optional
7.6.14 fs mkdir
Create a new directory.
Parameter
Mandatory/Optional Description
<dir name>
Mandatory
7.6.15 fs moveFromLocal
Similar to put command, except that the source local file is deleted after it is copied.
Parameter
Mandatory/Optional
Description
Mandatory
7.6.16 fs mv
Move source files to destination in the HDFS.
Parameter
Mandatory/Optional
Description
Mandatory
61
as /path/file1 /path/file2.
--to <source file path>
Mandatory
7.6.17 fs put
Copy single src, or multiple srcs from local file system to the HDFS.
Parameter
Mandatory/Optional
Description
Mandatory
7.6.18 fs rm
Remove files in the HDFS.
Parameter
Mandatory/Optional Description
Mandatory
--recursive <true|false>
Optional
--skipTrash <true|false>
Optional
Bypass trash.
7.6.19 fs setrep
Change the replication factor of a file
Parameter
Mandatory/Optional
Description
Mandatory
Mandatory
Number of replicas.
--recursive <true|false>
Optional
--waiting <true|false>
Optional
7.6.20 fs tail
Display last kilobyte of the file to stdout.
Parameter
Mandatory/Optional
Description
<file path>
Mandatory
--file <true|false>
Optional
7.6.21 fs text
Take a source file and output the file in text format.
Parameter Mandatory/Optional Description
<file path> Mandatory
7.6.22 fs touchz
Create a file of zero length.
Parameter Mandatory/Optional Description
<file path> Mandatory
7.7 hive
7.7.1 hive cfg
Configure Hive.
Parameter
Mandatory/Optional
Description
Optional
Optional
--timeout
Optional
Mandatory/Optional Description
Mandatory
7.8 mr
7.8.1 mr jar
Run a MapReduce job located inside the provided jar.
Parameter
Mandatory/Optional
Description
63
Mandatory
Mandatory
--args <arg>
Optional
Mandatory/Optional Description
Mandatory
Mandatory
Mandatory
Mandatory/Optional Description
Mandatory
Mandatory
Mandatory
Mandatory/Optional
Description
Mandatory
--all <true|false>
Optional
Mandatory/Optional
Description
Mandatory
64
Mandatory/Optional
Description
Whether list all jobs.
Mandatory/Optional Description
--jobid <jobid>
Mandatory
--priority
Mandatory
<VERY_HIGH|HIGH|NORMAL|LOW|VERY_LOW>
Mandatory/Optional
Description
--jobid <jobid>
Mandatory
Mandatory/Optional
Description
--jobfile <jobfile>
Mandatory
65
<name>mapred.input.dir</name>
<value>/user/hadoop/input</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/user/hadoop/output</value>
</property>
<property>
<name>mapred.job.name</name>
<value>wordcount</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>org.apache.hadoop.examples.WordCount.Tokeni
zerMapper</value>
</property>
<property>
<name>mapreduce.reduce.class</name>
<value>org.apache.hadoop.examples.WordCount.IntSum
Reducer</value>
</property>
</configuration>
Mandatory/Optional
Description
--taskid <taskid>
Mandatory
Mandatory/Optional
Description
--taskid <taskid>
Mandatory
7.9 network
7.9.1 network add
Add a network to Serengeti.
Parameter
Mandatory/Optional Description
Mandatory
66
Mandatory
--dhcp
Combination 1
Combination 2
For example:
>network add --name ipNetwork --ip 192.168.1.1-100,192.168.1.120-180 --portGroup pg1 --dns
202.112.0.1 --gateway 192.168.1.255 --mask 255.255.255.1
>network add --name dhcpNetwork --dhcp --portGroup pg1
Mandatory/Optional Description
Mandatory/Optional Description
--detail
Optional
For example:
67
Mandatory/Optional Description
--props
Optional
--jobName
Optional
--jobPriority
Optional
--jobTracker
Optional
--execType
Optional
--validateEachStatement
Optional
Mandatory/Optional Description
7.11 resourcepool
7.11.1 resourcepool add
Add a resource pool in vSphere to Serengeti.
Parameter
Mandatory/Optional
Description
Mandatory
pool>
Parameter
Mandatory/Optional
Description
Mandatory
Parameter
Mandatory/Optional
Description
Optional
--detail
Optional
All resource pools that are added to Serengeti are listed if a name is not specified. For each resource
pool, NAME, PATH are listed. NAME is the name in Serengeti. PATH is the combination of the vSphere
cluster name and resource pool name, separated by /.
For example:
69
7.12 topology
7.12.1 topology upload
Upload a rack-hosts mapping topology file to Serengeti. The new uploaded file will overwrite the existing
file. The accepted file format looks like: for each line, rackname: hostname1, hostname2
Hostname1,hostname2, stands for the host name displayed in vSphere.
Parameter
Mandatory/Optional Description
--yes
Optional
8. vSphere Settings
8.1 vSphere Cluster Configuration
8.1.1 Setup Cluster
In the vCenter Client, select Inventory, Hosts and Clusters. In the left column, right-click the Datacenter
and select "New Cluster..." Follow new Cluster Wizard using the following settings:
Enable Admission Control and set desired policy. (Default policy is to tolerate 1 host failure)
The Management Network (VMkernel Port) has vMotion and "Fault Tolerance Logging"
enabled
Virtual machine disks are thick provisioned, without snapshots and located on shared storage
In the vCenter Client, select Inventory, Hosts and Clusters. In the left column, right click the virtual
machine and select Fault Tolerance, Turn On Fault Tolerance.
71
Either a vSwitch or vSphere Distributed Switch can be used to provide the Port Group backing a
Serengeti cluster. vDS acts as a single virtual switch across all attached hosts while a vSwitch is per-host
and requires the Port Group to be configured manually.
72
73
First open a bash shell terminal. If the machine needs a http proxy server to connect to
the Internet, set http_proxy env :
# switch to root user
sudo su
export http_proxy=http://< proxy_server:port>
Make sure the firewall on the machine doesn't block the network port 80 used by
Apache web server. You can open a web browser on another machine and navigate to
http://<ip_of _webserver>/ to ensure the default test page of Apache web server shows
up.
If you would like to stop the firewall, execute this command:
/sbin/service iptables stop
74
Install the yum-utils and createrepo packages if they are not already installed (yum-utils
includes the reposync command):
yum install -y yum-utils createrepo
This will take several minutes (depending on the network bandwidth) to download all the
RPMs in the remote repository, and all the RPMs are put in new folders named
maprtech and maprecosystem.
9.2 Create local yum repository
1) Put all the RPMs into a new folder under the Document Root folder of the Apache
Web Server. The Document Root folder is /var/www/html/ for Apache by default, and if
you use Serengeti Management Server to set up the yum server, the folder is
/opt/serengeti/www/.
doc_root=/var/www/html
mkdir -p $doc_root/mapr/2
mv maprtech/ maprecosystem/ $doc_root/mapr/2/
75
cd $doc_root/mapr/2
createrepo .
Please replace the <ip_of_webserver> with the IP address of the web server.
Ensure you can download http://<ip_of_webserver>/mapr/2/mapr-m5.repo from another
machine.
9.3 Configure http proxy for the VMs created by Serengeti Server
This step is optional and only applies if the VMs created by Serengeti Management
Server need a http proxy to connect to the yum repository. You need to configure http
proxy for the VMs as this: on Serengeti Server, add the following content into
/opt/serengeti/conf/serengeti.properties:
# set http proxy server
serengeti.http_proxy = http://<proxy_server:port>
# set the IPs of Serengeti Management Server and the local yum repository servers for
'serengeti.no_proxy'. The wildcard for matching multi IPs doesn't work.
serengeti.no_proxy = 10.x.y.z, 192.168.x.y, etc.
First open a bash shell terminal. If the machine needs a http proxy server to connect to
the Internet, set http_proxy env :
# switch to root user
76
sudo su
export http_proxy=http://<proxy_server:port>
Make sure the firewall on the machine doesn't block the network port 80 used by
Apache web server. You can open a web browser on another machine and navigate to
http://<ip_of_webserver>/ to ensure the default test page of Apache web server shows
up.
If you would like to stop the firewall, execute this command:
/sbin/service iptables stop
Install the yum-utils and createrepo packages if they are not already installed (yum-utils
includes the reposync command):
yum install -y yum-utils createrepo
This will take several minutes (depending on the network bandwidth) to download all the
RPMs in the remote repository, and all the RPMs are put in new folder named clouderacdh4.
10.2 Create local yum repository
1) Put all the RPMs into a new folder under the Document Root folder of the Apache
Web Server. The Document Root folder is /var/www/html/ for Apache by default, and if
you use Serengeti Management Server to set up the yum server, the folder is
/opt/serengeti/www/ .
77
doc_root=/var/www/html
mkdir -p $doc_root/cdh/4/
mv cloudera-cdh4/RPMS $doc_root/cdh/4/
Please replace the <ip_of_webserver> with the IP address of the web server.
Ensure you can download http://<ip_of_webserver>/cdh/4/cloudera-cdh4.repo from
another machine.
10.3 Config http proxy for the VMs created by Serengeti Server
This step is optional and only apply if the VMs created by Serengeti Management
Server need a http proxy to connect to the yum repository. You need to config http
proxy for the VMs as this: on Serengeti Server, add the following content into
/opt/serengeti/conf/serengeti.properties:
# set http proxy server
serengeti.http_proxy = http://< proxy_server:port>
# set the IPs of Serengeti Management Server and the local yum repository servers for
'serengeti.no_proxy'. The wildcard for matching multi IPs doesn't work.
serengeti.no_proxy = 10.x.y.z, 192.168.x.y, etc.
78