Serengeti User Guide - 0.8

VMware, Inc.
Serengeti Users Guide

Serengeti 0.8
Contents
1.
Serengeti Users Guide ......................................................................................................................................6

1.1
2.
Serengeti Overview ............................................................................................................................................6

2.1
3.
4.
5.
Intended Audience ........................................................................................................................................6

Serengeti ........................................................................................................................................................6
2.1.1
Serengeti Features ................................................................................................................................ 6
2.1.2
Serengeti Architecture Overview.........................................................................................................8
2.2
Hadoop ...........................................................................................................................................................8
2.3
VMware Virtual Infrastructure .....................................................................................................................9
2.4
Serengeti Virtual Appliance Requirements ............................................................................................... 9
2.5
Serengeti CLI Requirements .......................................................................................................................9
Installing the Serengeti Virtual Appliance ...................................................................................................... 10

3.1
Download ..................................................................................................................................................... 10
3.2
Deploy Serengeti ........................................................................................................................................ 10
Quick Start ......................................................................................................................................................... 13

4.1
Set up the Serengeti CLI ........................................................................................................................... 13
4.2
Deploy a Hadoop Cluster .......................................................................................................................... 13
4.3
Deploy a HBase Cluster ............................................................................................................................ 15
Using Serengeti ................................................................................................................................................. 15

5.1
Manage Serengeti Users ........................................................................................................................... 15
5.1.1
Add/Delete a User in Serengeti ......................................................................................................... 15
5.1.2
Modify User Password ........................................................................................................................ 16
5.2
Manage Resources in Serengeti .............................................................................................................. 16
5.2.1
Add a Datastore ................................................................................................................................... 16
5.2.2
Add a Network ..................................................................................................................................... 16
5.2.3
Add a Resource Pool .......................................................................................................................... 17
5.2.4
View Datastores................................................................................................................................... 17
5.2.5
View Networks ..................................................................................................................................... 17
5.2.6
View Resource Pools .......................................................................................................................... 17
5.2.7
Remove a Datastore ........................................................................................................................... 18
5.2.8
Remove a Network .............................................................................................................................. 18
5.2.9
Remove a Resource Pool .................................................................................................................. 18
5.3
Manage Distros ........................................................................................................................................... 18
5.3.1
Supported Distros ................................................................................................................................ 18
5.3.2
Add a Distro to Serengeti ................................................................................................................... 18
5.3.3
List Distros ............................................................................................................................................ 21
5.3.4
Using a Distro....................................................................................................................................... 21
5.4
Hadoop Clusters ......................................................................................................................................... 21
5.4.1
Deploy Hadoop Clusters .................................................................................................................... 21

2
5.4.2
Manage Hadoop Clusters .................................................................................................................. 30
5.4.3
Use Hadoop Clusters .......................................................................................................................... 36
5.5
HBase Clusters ........................................................................................................................................... 40
5.5.1
Deploy HBase Clusters ...................................................................................................................... 40
5.5.2
Manage HBase Clusters .................................................................................................................... 43
5.5.3
Use HBase Clusters ............................................................................................................................ 43
5.6
Monitoring Cluster Deployed by Serengeti ............................................................................................. 44
5.7
Make Hadoop Master Node HA/FT .......................................................................................................... 44
5.8
Hadoop Topology Awareness ................................................................................................................... 45
5.9
Start and Stop Serengeti Services ........................................................................................................... 45
6.
Cluster Specification Reference ..................................................................................................................... 46
7.
Serengeti Command Reference ..................................................................................................................... 51

7.1
connect ......................................................................................................................................................... 51
7.2
cluster ........................................................................................................................................................... 51
7.2.1
cluster config ........................................................................................................................................ 51
7.2.2
cluster create ........................................................................................................................................ 52
7.2.3
cluster delete ........................................................................................................................................ 53
7.2.4
cluster export ........................................................................................................................................ 53
7.2.5
cluster limit............................................................................................................................................ 53
7.2.6
cluster list .............................................................................................................................................. 54
7.2.7
cluster resize ........................................................................................................................................ 55
7.2.8
cluster start ........................................................................................................................................... 55
7.2.9
cluster stop ........................................................................................................................................... 56
7.2.10
cluster target ...................................................................................................................................... 56
7.2.11
cluster unlimit ..................................................................................................................................... 56
7.3
datastore ...................................................................................................................................................... 56
7.3.1
datastore add ....................................................................................................................................... 56
7.3.2
datastore delete ................................................................................................................................... 57
7.3.3
datastore list ......................................................................................................................................... 57
7.4
distro ............................................................................................................................................................. 58
7.4.1
distro list ................................................................................................................................................ 58
7.5
disconnect .................................................................................................................................................... 58
7.6
fs.................................................................................................................................................................... 58
7.6.1
fs cat ...................................................................................................................................................... 58
7.6.2
fs chgrp ................................................................................................................................................. 58
7.6.3
fs chmod ............................................................................................................................................... 58
7.6.4
fs chown ................................................................................................................................................ 59
7.6.5
fs copyFromLocal ................................................................................................................................ 59
7.6.6
fs copyToLocal ..................................................................................................................................... 59
3
7.6.7
fs copyMergeToLocal ......................................................................................................................... 60
7.6.8
fs count.................................................................................................................................................. 60
7.6.9
fs cp ....................................................................................................................................................... 60
7.6.10
fs du..................................................................................................................................................... 60
7.6.11
fs expunge .......................................................................................................................................... 60
7.6.12
fs get.................................................................................................................................................... 61
7.6.13
fs ls ...................................................................................................................................................... 61
7.6.14
fs mkdir ............................................................................................................................................... 61
7.6.15
fs moveFromLocal............................................................................................................................. 61
7.6.16
fs mv .................................................................................................................................................... 61
7.6.17
fs put.................................................................................................................................................... 62
7.6.18
fs rm .................................................................................................................................................... 62
7.6.19
fs setrep .............................................................................................................................................. 62
7.6.20
fs tail .................................................................................................................................................... 62
7.6.21
fs text................................................................................................................................................... 63
7.6.22
fs touchz ............................................................................................................................................. 63
7.7
hive ............................................................................................................................................................... 63
7.7.1
hive cfg .................................................................................................................................................. 63
7.7.2
hive script.............................................................................................................................................. 63
7.8
mr .................................................................................................................................................................. 63
7.8.1
mr jar ..................................................................................................................................................... 63
7.8.2
mr job counter ...................................................................................................................................... 64
7.8.3
mr job events ........................................................................................................................................ 64
7.8.4
mr job history........................................................................................................................................ 64
7.8.5
mr job kill ............................................................................................................................................... 64
7.8.6
mr job list .............................................................................................................................................. 65
7.8.7
mr job set priority ................................................................................................................................. 65
7.8.8
mr job status ......................................................................................................................................... 65
7.8.9
mr job submit ........................................................................................................................................ 65
7.8.10
mr task fail .......................................................................................................................................... 66
7.8.11
mr task kill .......................................................................................................................................... 66
7.9
network ......................................................................................................................................................... 66
7.9.1
network add .......................................................................................................................................... 66
7.9.2
network delete ...................................................................................................................................... 67
7.9.3
network list............................................................................................................................................ 67
7.10
pig script ..................................................................................................................................................... 68
7.10.1
pig cfg.................................................................................................................................................. 68
7.10.2
pig script ............................................................................................................................................. 68
7.11
resourcepool.............................................................................................................................................. 68
4
7.11.1
resourcepool add ............................................................................................................................... 68
7.11.2
resourcepool delete .......................................................................................................................... 69
7.11.3
resourcepool list ................................................................................................................................ 69
7.12
8.
topology...................................................................................................................................................... 70
7.12.1
topology upload ................................................................................................................................. 70
7.12.2
topology list ........................................................................................................................................ 70
vSphere Settings ............................................................................................................................................... 70

8.1
vSphere Cluster Configuration .................................................................................................................. 70
8.1.1
Setup Cluster ....................................................................................................................................... 70
8.1.2
Enable DRS/HA on an existing cluster ............................................................................................. 71
8.1.3
Add Hosts to Cluster ........................................................................................................................... 71
8.1.4
DRS/FT Settings .................................................................................................................................. 71
8.1.5
Enable FT on specific virtual machine ............................................................................................. 71
8.2
Network Settings ......................................................................................................................................... 71
8.2.1
Setup Port Group - Option A (vSphere Distributed Switch) .......................................................... 72
8.2.2
Setup Port Group - Option B (vSwitch) ............................................................................................ 72
8.3
9.
Storage Settings ......................................................................................................................................... 72
8.3.1
Shared Storage Setting ...................................................................................................................... 72
8.3.2
Local Storage Settings ....................................................................................................................... 72
Appendix A: Create Local Yum Repository for MapR ................................................................................. 74

9.1
Install a web server to server as yum server .......................................................................................... 74
9.1.1
Configure http proxy ............................................................................................................................ 74
9.1.2
Install Apache Web Server ................................................................................................................ 74
9.1.3
Install yum related packages ............................................................................................................. 75
9.1.4
Sync the remote MapR yum repository ............................................................................................ 75
9.2
Create local yum repository ...................................................................................................................... 75
9.3
Configure http proxy for the VMs created by Serengeti Server ........................................................... 76
10.
Appendix B: Create Local Yum Repository for CDH4 ............................................................................... 76
10.1
Install a web server to server as yum server ........................................................................................ 76
10.1.1
Configure http proxy.......................................................................................................................... 76
10.1.2
Install Apache Web Server .............................................................................................................. 77
10.1.3
Install yum related packages ........................................................................................................... 77
10.1.4
Sync the remote CDH4 yum repository ......................................................................................... 77
10.2
Create local yum repository .................................................................................................................... 77
10.3
Config http proxy for the VMs created by Serengeti Server ............................................................... 78
1. Serengeti Users Guide

The Serengeti Users Guide provides information about installing and using the Serengeti to deploying
and scaling Hadoop clusters on vSphere.
To help you start with Serengeti, this information includes descriptions of Serengeti concepts and features.
In addition, this information provides a set of usage examples and sample scripts.
1.1 Intended Audience

This book is intended for anyone who needs to install and use Serengeti. The information in this book is
written for administrators and developers who are familiar with VMware vSphere.
2. Serengeti Overview
2.1 Serengeti
The Serengeti virtual appliance is a management service that you can use to deploy Hadoop clusters on
VMware vSphere systems. It is a one-click deployment toolkit that allows you to leverage the VMware
vSphere platform to deploy a highly available Hadoop cluster in minutes, including common Hadoop
components such as HDFS, MapReduce, Pig, and Hive on a virtual platform. Serengeti supports multiple
Hadoop 0.20 based distributions, CDH4 (except YARN), and MapR M5.
2.1.1 Serengeti Features

2.1.1.1 Rapid Provisioning
Serengeti can deploy Hadoop clusters with HDFS, MapReduce, HBase, Pig, Hive client and Hive server
in your vSphere system easily and quickly.
Serengeti includes a provisioning engine, the Apache Hadoop distribution, and a virtual machine template.
Serengeti is preconfigured to automate Hadoop cluster deployment and configuration. With Serengeti,
you can save time in getting started with Hadoop because you do not need to install and configure an
operating system, or download, install and configure each software package on each machine.
2.1.1.2 High Availability
Serengeti takes advantage of vSphere high availability to protect the Hadoop master node virtual
machine. The master node virtual machine can be monitored by vSphere. When Hadoop namenode or
jobtracker service stops unexpectedly, vSphere will restart master node for recovery. When the virtual
machine stops unexpectedly by host failover or cannot access due to poor network, vSphere will leverage
FT to start another standby virtual machine automatically to reduce the unplanned down time.
6
2.1.1.3 Local Disk Management

Serengeti allows you to use both shared storage and local storage. After the disks are formatted to
datastores in vSphere, you can add the datastores to Serengeti easily. You can specify whether the
datastores are shared storage (SHARED) or local storage (LOCAL). Serengeti automatically allocates the
datastores to Hadoop clusters when you deploy a Hadoop cluster.
By default, Serengeti allocates Hadoop master nodes and client nodes on SHARED datastores, and
data/compute nodes on LOCAL datastores, including both system disk and data disks of those nodes. If
you specify only local storage or shared storage, Serengeti allocates all Hadoop nodes on the available
datastores for a default cluster.
2.1.1.4 Easy Scale Out
With Serengeti you can add more nodes to a Hadoop cluster with a single command after it has been
deployed. You can start with a small Hadoop cluster and scale out as needed.
2.1.1.5 Configuration
Serengeti allows you to customize the following:
Number of virtual machines
CPU, RAM, storage for the virtual machines
Software packages for the virtual machines
Hadoop configuration.
Serengeti automatically adjusts Hadoop configurations according to the virtual machine specification.
After creation, you can export Hadoop clusters spec and tune Hadoop configuration without impacting
irrelevant Hadoop node.
Serengeti provides both cluster level and node group level configuration. You can set different
parameters for different node groups.
2.1.1.6 Data Compute Separation
Serengeti allows you to deploy a data and computer separated Hadoop cluster.
You can specify number of data nodes per host.
You can specify the number of compute nodes for one data node and specify compute node and
related data node on the same physical host.
Serengeti also allows you to deploy a compute-only cluster to performance isolation between
different MapReduce clusters or consume the existing HDFS.
Deploy a Hadoop cluster with only JobTracker and TaskTracker to consume an existing apache
0.20 based HDFS.
Deploy a Hadoop cluster with only job tracker and task tracker to consume an 3rd party HDFS.
2.1.1.7 Remote CLI

You can remotely access Serengeti Management Server by installing CLI client in your environment.,
which is a one-stop-shop shell to deploy, manage and use Hadoop.
2.1.1.8 Hadoop Distribution Management
Serengeti allows you to use any of the following Hadoop distributions
7
Apache Hadoop 1.0.x
Greenplum HD 1.2
Hortonworks HDP-1
CDH3
CDH4
MapR M5
You can add your preferred distribution to Serengeti and deploy Hadoop clusters accordingly.
2.1.2 Serengeti Architecture Overview

The Serengeti virtual appliance runs on top of vSphere system and includes a Serengeti Management
Server virtual machine and a Hadoop Template virtual machine. The Hadoop Template virtual machine
includes an agent.
Serengeti performs these major steps to deploy a Hadoop cluster:

1. Serengeti Management Server searches for ESXi hosts with sufficient resources.
2. Serengeti Management Server selects ESXi hosts on which to place Hadoop virtual machines.
3. Serengeti Management Server sends a request to vCenter to clone and reconfigure virtual
machines.
4. Agent configures the OS parameters and network configurations.
5. Agent downloads Hadoop software packages from the Serengeti Management sServer.
6. Agent installs Hadoop software.
7. Agent configures Hadoop parameters.
Provisioning is performed in parallel, which reduces deployment time.
2.2 Hadoop
Apache Hadoop is open source software for distributed storage and computing. Apache Hadoop includes
HDFS and MapReduce. The HDFS is a distributed file system. The MapReduce is a software framework
for distributed data processing.
You can find more information about Apache Hadoop on http://hadoop.apache.org/ for more information.
2.3 VMware Virtual Infrastructure

VMwares leading virtualization solutions provide multiple benefits to IT administrators and users. VMware
virtualization creates a layer of abstraction between the resources required by an application and
operating system, and the underlying hardware that provides those resources. A summary of the value of
this abstraction layer includes the following:
Consolidation: VMware technology allows multiple application servers to be consolidated onto

one physical server, with little or no decrease in overall performance.
Ease of Provisioning: VMware virtualization encapsulates an application into an image that can
be duplicated or moved, greatly reducing the cost of application provisioning and deployment.
Manageability: Virtual machines may be moved from server to server with no downtime using
VMware vMotion, which simplifies common operations like hardware maintenance and reduces
planned downtime.
Availability: Unplanned downtime can be reduced and higher service levels can be provided to an
application. VMware High Availability (HA) ensures that in the case of an unplanned hardware
failure, any affected virtual machines are restarted on another host in a VMware cluster.
2.4 Serengeti Virtual Appliance Requirements
Software
o
VMware vSphere 5.0 Enterprise or VMware vSphere 5.1 Enterprise
VMWare vSphere Client 5.0 or VMWare vSphere Client 5.1
SSH client
Network
o
DNS Server
DHCP Server or Static IP Address Block
Resource requirements
o
Resource pool with at least 27.5GB RAM
Port group with at least 6 uplink ports
350G or more disk spaces are suggested.
17GB is for Serengeti virtual appliance,
300GB is for your first Hadoop cluster. You can reduce the disk space
requirements by specifying the storage size in a cluster specification.
The remaining disk space is reserved for swap space.
Shared storage is required if you use HA or FT for the Hadoop master node.
Others
o
All ESXi hosts should have time synchronized using the Network Time Protocol (NTP)
2.5 Serengeti CLI Requirements
OS
9
Windows
Linux
Software
o
Java 1.6.26 or later
Unzip tool
Network
o
Can access Serengeti Management Server through HTTP in order to download CLI
package
3. Installing the Serengeti Virtual Appliance

3.1 Download
Download a Serengeti Virtual Appliance OVA from VMware site.
3.2 Deploy Serengeti

Serengeti runs in VMWare vSphere system. You can use the vSphere client to connect VMware vCenter
Server and deploy Serengeti.
1. In the vSphere Client, Select menu File -> Deploy OVF Template
2. Select the OVA file location of Serengeti Virtual Appliance. vSphere client will verify the OVA file and
show you the brief information.
3. Specify the Serengeti virtual appliance name and inventory location.
Only alphabetic letters (a-z, A-Z), numbers (0-9), space ( ), hyphen (-) and underscore
(_) can be used for virtual appliance name and resource pool name. For datastore name, it can
be the above ones plus parenthesis ((, )) and period (.).
4. Select the resource pool on which to deploy the template.

You MUST deploy Serengeti in a top-level resource pool.
10
5. Select a datastore.
6. Select a format for the virtual disks.
7. Map the networks used in the OVF template to the networks in your inventory.
8. Set the properties for this Serengeti deployment.
11
Serengeti Management Server Network Settings

Network Type
Select DHCP or Static IP.
IP Address
Enter IP address for the Serengeti Management Server virtual machine.
Net mask
Enter the subnet mask of the network.
Gateway
Enter the IP address for the network gateway.
DNS Server 1
Enter the DNS server IP address.
DNS Server 2
Enter a second DNS server IP address.
Hadoop Resource Settings
Initialize Resources Keep this option selected to add the resource pool, datastore and network
to Serengeti Management Server database. Users can deploy Hadoop
clusters in the resource pool, datastore and network in which the Serengeti
virtual appliance is deployed. Hadoop node virtual machines attempt to
obtain IP address by using DHCP on the network.
9. Verify binding to vCenter Extension Service.
10. Click Next to deploy the virtual appliance. Itll take several minutes to deploy the virtual appliance.
After Serengeti virtual appliance is deployed successfully, two virtual machines will be installed in
12
vSphere. One is the Serengeti Management Server virtual machine another is the virtual machine
template for Hadoop nodes.
11. Power on the Serengeti vApp and open the console of Serengeti Management Server VM, you see
the initial OS login password for root/serengeti user. Update the password with command sudo
/opt/serengeti/sbin/set-password -u after login to the VM, and the initial password will disappear on
the welcome screen.
4. Quick Start
4.1 Set up the Serengeti CLI
Serengeti command line shell can run in Windows or Linux. You need Java installed on the machine.
You can download VMware-Serengeti-cli-0.8.0.0-<build number>.zip from the Serengeti Management
Server (http://your-serengeti-server/cli).
Unzip the downloaded package to a directory. Run Serengeti CLI by going to this directory, under cli,
and enter java jar serengeti*.jar.
Please refer to the troubleshooting document if you have any issues.
4.2 Deploy a Hadoop Cluster

You can use Serengeti CLI to perform actions such as creating and customizing Hadoop clusters. You
have two ways to access Serengeti CLI: from the Serengeti Management Server virtual machine or install
CLI on any machine and use it.
1. Enter the Serengeti shell.
>java jar serengeti*.jar
2. Run "connect command to connect to the Serengeti server.
serengeti>connect --host xx.xx.xx.xx:8080 --username xxx --password xxx
A user named serengeti with password password is created by default.
3. Run "cluster create command to deploy a Hadoop cluster on vSphere.
serengeti>cluster create --name myHadoop
In the example, myHadoop is the name of the Hadoop cluster you deploy. The Serengeti command
continually updates the progress of the deployment.
Only alphabetic letters (a-z, A-Z), numbers (0-9), and underscore (_) can be used cluster
13
name.
This command will deploy a Hadoop cluster with one master node virtual machine, three worker node
virtual machines, and one client node virtual machine. The master node virtual machine contains
NameNode and JobTracker in it. The worker node virtual machines contain datanode and TaskTracker
services. The client node virtual machines contain a Hadoop client environment, including Hadoop client
shell, Pig, and Hive.
After the deployment is complete, you can view the IP addresses of the Hadoop node virtual machines.
Hint
Use the tab key for auto-completion and to get help for commands and parameters.
By default, Serengeti might use any resources added to deploy a Hadoop Cluster. To limit the scope of
resources for the cluster, you can specify resource pools, datastores, or a network in the cluster create
command
serengeti>cluster create --name myHadoop --rpNames myRP --dsNames myDS --networkName myNW
In this example myRP is the resource pool where the Hadoop cluster is deployed on, myDS is the
datastore where the virtual machine images is stored, myNW is the network which virtual machines will
use.
Hint
You can use resourcepool list, datastore list, network list command to see what resources are in
Serengeti.
Once you have a Hadoop cluster deployed you can execute Hadoop command directly in the CLI. In this
section we will describe how you can copy files from the local file system to HDFS and then run a
MapReduce job.
1. Start the Serengeti CLI and connect to Serengeti Management Server as described in section 4.1
2. Run the cluster list command to show all the available clusters
$serengeti>cluster list
3. Run the cluster target --name command to connect to the cluster you want to get data in/out.
The --name value is the cluster name that you want to connect.
$serengeti>cluster target --name cluster1
4. Run the fs put command to upload data to HDFS
$serengeti>fs put from /etc/inittab to /tmp/input/inittab
5. Run the fs get command to download data from HDFS
$serengeti>fs get from /tmp/input/inittab to /tmp/local-inittab
6. Run the mr jar command to run a MapReduce job
$serengeti> mr jar --jarfile /opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar --mainclass
org.apache.hadoop.examples.WordCount --args "/tmp/input /tmp/output"
7. Run the fs cat command to show the output of the MR job
$serengeti> fs cat /tmp/output/part-r-00000
8. Run the fs get command to download the output of the MR job
$serengeti> fs get from /tmp/output/part-r-00000 to /tmp/wordcount
14
Another way to use Hadoop is through the client VM. By default, Serengeti will deploy a VM named client
VM. It has Hadoop client, pig and Hive installed. The OS is configured ready to use Hadoop. You can see
the IP of client VM after a cluster is deployed or use cluster list command to see the IP. Following are the
steps to follow in order to verify that the Hadoop cluster is working properly.
1. Use ssh to login to the client VM.
use "joe" for user name. Password is "password".
2. Create your own home directory.
$ hadoop fs -mkdir /usr/joe
3. Or run a sample Hadoop mapreduce job.
$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar pi 10 10000000
Feel free to use submit other MapReduce, Pig or Hive jobs as well.
4.3 Deploy a HBase Cluster

Serengeti also supports deploying HBase cluster on HDFS. The easiest way to deploy a HBase cluster is
running the following command:
serengeti>cluster create --name myHBase --type hbase
In the example, myHBase is the name of the HBase cluster you deployed, --type hbase implies you
want to deploy a HBase cluster based on a default template Serengeti provides. This command will
deploy one master node virtual machine which runs NameNode and HBaseMaster daemon, three
zookeeper nodes running ZooKeeper daemon, three data nodes running Hadoop DataNode and HBase
Regionserver daemon, and one client node from which you can launch Hadoop or HBase Jobs.
When deployment finished, you can access HBase through a few ways as you expected:
1. Login client VM to run hbase shell commands;
2. Launch a HBase job like hbase org.apache.hadoop.hbase.PerformanceEvaluation nomapred
randomWrite 3;
Default HBase cluster does not contain Hadoop JobTracker or Hadoop TaskTracker daemon. So
you need to deploy a customized cluster in case you want to run a HBase mapr job.
3. Access HBase through Rest-ful Web Service or Thrift gateway, HBase Rest and Thrift service are
configured on the HBase client node, and Rest service listens on port 8080, Thrift service listens
on port 9090.
5. Using Serengeti
5.1 Manage Serengeti Users
Spring security In-Memory Authentication is used for Serengeti Authentication and user management.
You can modify /opt/serengeti/tomcat6/webapps/serengeti/WEB-INF/spring-security-context.xml file to
manage Serengeti users. And then restart tomcat service using command "sudo service tomat restart".
5.1.1 Add/Delete a User in Serengeti

Add or delete user at /opt/serengeti/tomcat6/webapps/serengeti/WEB-INF/spring-security-context.xml file,
user-service element.
Following is a sample to add one user into user-service.
<authentication-manager alias="authenticationManager">
<authentication-provider>
<user-service>
15
<user name="serengeti" password="password" authorities="ROLE_ADMIN"/>

<user name="joe" password="password" authorities="ROLE_ADMIN"/>
</user-service>
</authentication-provider>
</authentication-manager>
The authorities value should define user role in Serengeti, but in M2, its not used, so its OK to have any
value here.
5.1.2 Modify User Password

Modify the password value in user-service element at the same file. Following is a sample.
<authentication-manager alias="authenticationManager">
<authentication-provider>
<user-service>
<user name="serengeti" password="password" authorities="ROLE_ADMIN"/>
<user name="joe" password="welcome1" authorities="ROLE_ADMIN"/>
</user-service>
</authentication-provider>
</authentication-manager>
5.2 Manage Resources in Serengeti

When deploying Serengeti.OVA, VI admin might allow you to use the same resources in which Serengeti
virtual appliance is using. You can also add more resources to Serengeti for your Hadoop cluster. You
can list resources in Serengeti and delete them if its no longer needed.
You must add resource pool, datastore and network before deploying a Hadoop cluster if VI
admin does not allow you to deploy Hadoop cluster in the same set of resources as Serengeti
server.
5.2.1 Add a Datastore

You can use datastore add command to add a vSphere datastore to Serengeti.
serengeti>datastore add --name myLocalDS --spec local* --type LOCAL
In this example, myLocalDS is the name you used to create the Hadoop cluster.
local* is a wildcard specifying a set of datastores. All datastores whose name starts with local will be
added and managed as a whole.
LOCAL specifies that the datastores are local storage.
In this version, Serengeti does not check if the datastore really exists. If you use a nonexistent
datastore, cluster creation will fail.
5.2.2 Add a Network

You can use network add command to add a network to Serengeti. A network is a port group and a way
to get ip on the port group.
serengeti>network add --name myNW --portGroup 10GPG --dhcp
In this example, myNW is the name you used to create the Hadoop cluster.
10GPG is the name of the port group created by VI Admin in vSphere.
Virtual machines using this network will use DHCP to obtain IP.
You can also add networks using a static IP.
serengeti>network add --name myNW --portGroup 10GPG --ip 192.168.1.2-100 --dns 10.111.90.2 -16
gateway 192.168.1.1 --mask 255.255.255.0

In this example, 192.168.1.2-100 is the IP address range Hadoop nodes can use.
10.111.90.2 is the DNS server IP.
192.168.1.1 is the gateway.
255.255.255.0 is the subnet mask.
In this version, Serengeti does not check if the added network is correct. If you use a wrong
network, cluster creation will fail.
5.2.3 Add a Resource Pool

You can use resourcepool add command to add a vSphere resource pool to Serengeti.
serengeti>resourcepool add --name myRP --vccluster cluster1 --vcrp rp1
In this example, myRP is the name you used to create the Hadoop cluster.
cluster1 is the vSphere cluster name and rp1 is vSphere resource pool name.
In this version, Serengeti does not check if the resource pool really exists. If you use a
nonexistent resource pool, cluster creation will fail.
vSphere nested resource pools are not supported in current version. The resource pools must be
one that is located directly under a cluster.
5.2.4 View Datastores

In the Serengeti shell, you can list datastores added to Serengeti.
serengeti>datastore list
You can see details of datastores.
serengeti> datastore list --detail
You can specify which datastore to list.
seretenti> datastore list --name myDS --detail
5.2.5 View Networks

In the Serengeti shell, you can list networks added to Serengeti.
serengeti>network list
You can see details of networks.
serengeti> network list --detail
You can specify which network to list.
seretenti> network list --name myNW --detail
5.2.6 View Resource Pools

In the Serengeti shell, you can list resource pools added to Serengeti.
serengeti>resourcepool list
You can see details of resource pools.
serengeti>resourcepool list --detail
17
You can specify which resource pool to list.

seretenti>resourcepool list --name myRP --detail
5.2.7 Remove a Datastore

You can use the datastore delete command to remove a datastore from Serengeti.
serengeti>datastore delete --name myDS
In this example, myDS is the name you specified when you added the datastore.
You cannot remove a datastore from Serengeti if it is referenced by a Hadoop cluster.
5.2.8 Remove a Network

You can use the network delete command to remove a network from Serengeti.
serengeti>network delete --name myNW
In this example, myNW is the name you specified when you added the network.
You cannot remove a network from Serengeti if it is referenced by a Hadoop cluster.

You can use network list command to see which cluster is referencing the network.
5.2.9 Remove a Resource Pool

You can use the resoucepool delete command to remove a resource pool from Serengeti.
serengeti>resourcepool delete --name myRP
In this example, myRP is the name you specified when you added the resource pool.
You cannot remove a resource pool from Serengeti if the resource pool is referenced by a
Hadoop cluster.
5.3 Manage Distros

5.3.1 Supported Distros
Serengeti Management Server includes the Apache Hadoop 1.0.1, but you can use your preferred
1
Hadoop distro as well. Greenplum HD1, CDH3, CDH4 , HDP1 and MapR M5 are also supported.
Serengeti now supports Hadoop cluster, Pig and Hive instance deployment.
5.3.2 Add a Distro to Serengeti

Serengeti uses tar ball or yum repository to deploy Hadoop cluster for different Hadoop distributions.
5.3.2.1 Using tar ball to deploy Hadoop cluster
Serengeti uses tar ball to deploy the following Hadoop distros:
Apache Hadoop 1.0.x
Greenplum HD 1
1
YARN is not supported at this moment.

18
CDH3
HDP1
1. Download the three packages (hadoop/pig/hive) in tar ball format from the distro vendor's site.
2. Upload them to Serengeti Management Server virtual machine.
3. Put the packages in /opt/serengeti/www/distros/. The hierarchy should be
DISTRO_NAME/VERSION_NUMBER/TARBALLS. For example, place the Apache Hadoop distro as
shown in the following way.
- apache/
- 1.0.1/
- hadoop-1.0.1.tar.gz
- hive-0.8.1.tar.gz
- pig-0.9.2.tar.gz
4. Edit the /opt/serengeti/www/distros/manifest in Serengeti Management Server virtual machine to
add the mapping between Hadoop roles and the tar ball package of the distro. As the following
example, add JSON text to the manifest file:
{
"name" : "cdh",
"version" : "3u3",
"packages" : [
{
"roles" : ["hadoop_namenode", "hadoop_jobtracker",
"hadoop_tasktracker", "hadoop_datanode",
"hadoop_client"],
"tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz"
},
{
"roles" : ["hive"],
"tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz"
},
{
"roles" : ["pig"],
"tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz"
}
]
},
In this example, the CDH tar balls are put in directory /opt/serengeti/www/distros/cdh/3u3.
Please note if a distro supports HVE, please add hveSupported : true, after the line related to version
in the above example.
5. Restart the tomcat server in Serengeti Management Server to allow the Serengeti Management
Server read the new manifest file.
$ sudo service tomcat restart
If the commands are successful, in the Serengeti shell, issue the command "distro list", you can see the
distro that you added appears. Otherwise, make sure you write the correct JSON text in the manifest.
5.3.2.2 Using yum repository to deploy Hadoop cluster
Serengeti uses yum repository to deploy the following Hadoop distros:
19
CDH4
MapR M5
1. Open the sample manifest file /opt/serengeti/www/distros/manifest.sample in Serengeti

Management Server virtual machine, you will see the following distro configuration for MapR and
CDH4:
{
"name" : "mapr",
"vendor" : "MAPR",
"version" : "2.1.1",
"packages" : [
{
"roles" : ["mapr_zookeeper", "mapr_cldb", "mapr_jobtracker", "mapr_tasktracker",
"mapr_fileserver", "mapr_nfs", "mapr_webserver", "mapr_metrics", "mapr_client", "mapr_pig",
"mapr_hive", "mapr_hive_server", "mapr_mysql_server"],
"package_repos" : ["http://<ip_of_serengeti_server>/mapr/2/mapr-m5.repo"]
}
]
},
{
"name" : "cdh4",
"vendor" : "CDH",
"version" : "4.1.2",
"packages" : [
{
"roles" : ["hadoop_namenode", "hadoop_jobtracker", "hadoop_tasktracker",
"hadoop_datanode", "hadoop_journalnode", "hadoop_client", "hive", "hive_server", "pig",
"hbase_master", "hbase_regionserver", "hbase_client", "zookeeper"],
"package_repos" : ["http://<ip_of_serengeti_server>/cdh/4/cloudera-cdh4.repo"]
}
]
}
The two yum repo files (mapr-m5.repo and cloudera-cdh4.repo) point to the official yum repository of
MapR and CDH4 on the Internet. You can copy this sample file
/opt/serengeti/www/distros/manifest.sample to /opt/serengeti/www/distros/manifest.
When you create a MapR or CDH4 cluster, Hadoop nodes will download rpm packages from the
MapR/CDH4 official yum repository on the Internet.
If your VMs in the cluster created by Serengeti Management Server do not have access to the Internet or
the bandwidth to the Internet is not fast, we strongly suggest create a local yum repository for MapR
and CDH4. Please read the Appendix A: Create Local Yum Repository for MapR and Appendix B: Create
Local Yum Repository for CDH4 to create a yum repository.
2. Config the local yum repository url in manifest file
Once the local yum repository for MapR/CDH4 is created, open /opt/serengeti/www/distros/manifest
and add the distro configuration (use the sample in previous step and modify attribute
"package_repos" to the url of the local yum repository file).
3. Restart the tomcat server in Serengeti Management Server to allow the Serengeti Management
Server read the new manifest file.
$ sudo service tomcat restart
20
If the commands are successful, in the Serengeti shell, issue the command "distro list", you can see the
distro that you added. Otherwise, make sure you write the correct JSON text in the manifest.
5.3.3 List Distros

You can use the "distro list" command to see available distros.
serengeti> distro list
You can see packages in each of the distro and make sure it includes services you want to deploy.
5.3.4 Using a Distro

You can choose which distro you use when deploying a cluster.
serengeti>cluster create --name myHadoop --distro cdh
5.4 Hadoop Clusters

5.4.1 Deploy Hadoop Clusters
5.4.1.1 Deploy a Customized Hadoop Cluster
You can customize the number of nodes, and size of virtual machines etc. when you create a cluster.
In Serengeti Management Server you can find sample specs in /opt/serengeti/samples/. If you are using
Serengeti CLI from your desktop you can find the sample specs in the client folder.
1. Edit a cluster spec file.
For example:
{
"nodeGroups" : [
{
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"instanceType": "MEDIUM"
},
{
"name": "worker",
"roles": [
"hadoop_datanode", "hadoop_tasktracker"
],
"instanceNum": 5,
"instanceType": "SMALL"
},
{
"name": "client",
"roles": [
21
"hadoop_client",
"hive",
"hive_server",
"pig"
],
"instanceNum": 1,
"instanceType": "SMALL"
}
]
}
In this example, you want 1 master virtual machine MEDIUM size, 5 worker virtual machines in SMALL
size, 1 client virtual machine in SMALL size. You can also specify number of CPUs, RAM, disk size etc.
for each of node groups.
2. Specify the spec when creating the cluster. You need use the full path to specify the file.
serengeti>cluster create --name myHadoop --specFile /home/serengeti/mySpec.txt
CAUTION
Changing the role of node groups might cause the deployed Hadoop cluster not workable.
Deploy a CDH4 Hadoop ClusterYou can create a default CDH4 Hadoop cluster by executing the following
command in Serengeti CLI:
serengeti>cluster create --name mycdh --distro cdh4
You can also create a customized CDH4 Hadoop cluster with a cluster spec file:
serengeti>cluster create --name mycdh --distro cdh4 --specFile
/opt/serengeti/samples/default_cdh4_ha_hadoop_cluster.json
/opt/serengeti/samples/default_cdh4_ha_and_federation_hadoop_cluster.json is a sample spec file for
CDH4. You can make a copy of it and modify the parameters in the file before creating the cluster. In this
example, nameservice0 and nameservice1 are federated with each other, the name nodes inside
nameservice0 node group (with instanceNum set as 2) are HDFS2 HA enabled. In Serengeti, name node
group names will be the name service names of HDFS2.
5.4.1.1.1 Deploy a MapR Hadoop Cluster
You can create a default MapR M5 Hadoop cluster by executing the following command in Serengeti CLI:
serengeti>cluster create --name mymapr --distro mapr
You can also create a customized MapR M5 Hadoop cluster with a cluster spec file:
serengeti>cluster create --name mycdh --distro mapr --specFile /opt/serengeti/samples/
default_mapr_cluster.json
/opt/serengeti/samples/ default_mapr_cluster.json is a sample spec file for MapR, you can make a copy
of it and modify the parameters in the file before creating the cluster.
5.4.1.2 Separating Data and Compute nodes

You can separate data and compute nodes in a cluster and apply more fined control of node placement
among ESX hosts. For example, you can use Serengeti to deploy the following clusters:
1. A data and compute separated cluster, without any node placement constraints.
22
{
"nodeGroups":[
{
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"cpuNum": 2,
"memCapacityMB": 7500,
},
{
"name": "data",
"roles": [
"hadoop_datanode"
],
"instanceNum": 4,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
}
},
{
"name": "compute",
"roles": [
"hadoop_tasktracker"
],
"instanceNum": 8,
"cpuNum": 2,
"storage": {
"type": "LOCAL",
"sizeGB": 20
}
},
{
"name": "client",
"roles": [
"hadoop_client",
"hive",
"pig"
],
"instanceNum": 1,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
}
}
],
"configuration": {
}
23
}
In this example, four data nodes and eight compute nodes will be created and put into individual VMs. By
default, Serengeti uses Round Robin algorithm to put VM/node across ESX hosts evenly.
2. A data compute separated cluster, with instancePerHost constraint.
{
"nodeGroups":[
{
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"cpuNum": 2,
},
{
"name": "data",
"roles": [
"hadoop_datanode"
],
"instanceNum": 4,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
},
"placementPolicies": {
"instancePerHost": 1
}
},
{
"name": "compute",
"roles": [
],
"instanceNum": 8,
"cpuNum": 2,
"storage": {
"type": "LOCAL",
"sizeGB": 20
},
}
},
{
"name": "client",
"roles": [
"hadoop_client",
"hive",
24
"pig"
],
"instanceNum": 1,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
}
}
],
"configuration": {
}
}
In this example, data and compute node group have placementPolicy constraint. After a successful
provision, four data nodes and eight compute nodes will be created and put into individual VMs. With the
instancePerHost=1 constraint, the four data nodes will be placed on four ESX hosts. The eight compute
nodes will be put onto four ESX hosts as well, two nodes for each.
Note that it is not guaranteed that the two compute nodes will stay collocated with each data node on
each of the four ESX hosts. To ensure that this is the case, create a VM-VM affinity rule between each
hosts compute nodes and data node, or disable DRS on the compute nodes.
3. A data compute separated cluster, with instancePerHost , groupAssociations constraints for compute
node group and groupRacks constraint for data node group.
{
"nodeGroups":[
{
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"cpuNum": 2,
},
{
"name": "data",
"roles": [
"hadoop_datanode"
],
"instanceNum": 4,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
},
"instancePerHost": 1,
"groupRacks": {
"type": "ROUNDROBIN",
"racks": ["rack1", "rack2", "rack3"]
},
25
}
},
{
"name": "compute",
"roles": [
],
"instanceNum": 8,
"cpuNum": 2,
"storage": {
"type": "LOCAL",
"sizeGB": 20
},
"groupAssociations": [
{
"reference": "data",
"type": "STRICT"
}
}
},
{
"name": "client",
"roles": [
"hadoop_client",
"hive",
"pig"
],
"instanceNum": 1,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
}
}
],
"configuration": {
}
}
In this example, after a successful provision, the four data nodes and eight compute nodes will be placed
on exactly the same four ESX hosts, each ESX host has one data node and two compute nodes, and
these four ESX hosts are selected from rack1, rack2 and rack3 fairly.
Here, as the definition of compute node group says, the placement of compute nodes should strictly
refer to the placement result of data node. That means, compute nodes should only be placed on ESX
hosts that have data nodes.
5.4.1.3 Deploy a Compute Only Cluster

You can create a compute only cluster that refers to an existing HDFS cluster with the following steps:
1. Edit a cluster spec file and save it, for example, as /home/serengeti/coSpec.txt.
26
For example:
{
"externalHDFS": "hdfs://hostname-of-namenode:8020",
"nodeGroups": [
{
"name": "master",
"roles": [
"hadoop_jobtracker"
],
"instanceNum": 1,
"cpuNum": 2,
},
{
"name": "worker",
"roles": [
"hadoop_tasktracker",
],
"instanceNum": 4,
"cpuNum": 2,
"storage": {
"type": "LOCAL",
"sizeGB": 20
},
},
{
"name": "client",
"roles": [
"hadoop_client",
"hive",
"pig"
],
"instanceNum": 1,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
},
}
],
configuration : {
}
}
In this example, the externalHDFS field points to an existing HDFS. You should also specify the node
group with role hadoop_jobtracker and hadoop_tasktracker. Note that the externalHDFS field conflicts
with node groups that have hadoop_namenode and hadoop_datanode roles. The sample cluster spec
can also be found in file in samples/compute_only_cluster.json in the Serengeti CLI directory,
2. Specify the spec when creating the cluster. You need use the full path to specify the file.
serengeti>cluster create --name computeOnlyCluster --specFile /home/serengeti/coSpec.txt
27
5.4.1.4 Control Hadoop Virtual Machine Placement

Serengeti provides a way for user to control how Hadoop virtual machines to be placed. Generally, its
implemented by specifying the placementPolicies field inside a node group, like:
{
"nodeGroups":[
{
"name": "group_name",
"groupRacks": {
},
"groupAssociations": [{
"reference": "another_group_name",
"type": "STRICT" // or "WEAK"
}]
}
},
}
As this example shows, the palcementPolicy field contains three optional items: instancePerHost,
groupRacks and groupAssociations.
As the name implies, instancePerHost indicates how many VM nodes or instances should be placed for
each physical ESX host, this constraint is aimed at balancing the workload.
The groupRacks controls how VM nodes should be put across the racks you specified. In this example,
the rack type equals ROUNDROBIN, and the racks item indicates which racks in the topology map
(refer to chapter 5.8 to see how to configure topology map information and enable Hadoop cluster to be
rack awareness) will be used for this placement policy. If racks item is ignored, Serengeti will use all
racks in the topology map. ROUNDROBIN here means the candidates will be fairly selected when
determining which rack should be selected for each node.
On the other side, if you specify both the InstancePerHost and groupRacks for placement policy, you
should make sure the number of available hosts is enough. You can get the rack-hosts information by
using the command topology list.
groupAssociations means the node group has associations with target node groups, and each
association has reference and type fields. The field reference is the name of a target node group,
and type can be STRICT or WEAK. STRICT means the node group must be placed on the same
set or subset of ESX hosts relevant to the target group, while WEAK means the node group tries to be
placed on the same set or subset of ESX hosts relevant to the target group but no guarantee.
A typical scenario of applying groupRacks and groupAssociations is deploying a Hadoop cluster with
data and compute nodes separated. In this case, user might tend to put compute nodes and data nodes
on the same set of physical hosts for better performance, especially the throughput. You can refer to
5.3.3 for the practical examples of how to deploy Hadoop cluster by applying placement policy.
28
5.4.1.5 Use NFS as Compute Nodes Local Directory

Serengeti allows user to specify NFS for compute nodes. There are several benefits 1) increase the
capacity of each compute node; 2) return storage resource back when some compute nodes stopped.
Here is an example to show how to deploy a cluster whose compute nodes have only NFS storage:
{
"nodeGroups":[
{
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"instanceType": "LARGE",
"cpuNum": 2,
"haFlag": "on"
},
{
"name": "data",
"roles": [
"hadoop_datanode"
],
"instanceNum": 4,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
},
}
},
{
"name": "compute",
"roles": [
],
"instanceNum": 8,
"cpuNum": 1,
"storage": {
"type": "TEMPFS"
},
"groupAssociations": [
{
29
"type": "STRICT"
}
]
}
},
{
"name": "client",
"roles": [
"hadoop_client",
"hive",
"hive_server",
"pig"
],
"instanceNum": 1,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
}
}
]
}
In this example, the cluster is D/C separated. Compute nodes are strictly associated with data nodes. By
specifying the Storage field of compute node group to type: TEMPFS, Serengeti will install NFS server
on associated data nodes, install NFS client on compute nodes, and mount data nodes disks on compute
nodes. Serengeti will not assign disks to compute nodes, and all temp files generated during running
MapReduce jobs are saved on the NFS disks.
5.4.2 Manage Hadoop Clusters

5.4.2.1 Modify Hadoop
Serengeti provides a simple and easy way to tune the Hadoop cluster configuration including attributes in
core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh, log4j.properties, fair-scheduler.xml,
capacity-scheduler.xml, etc.
In addition to modifying Hadoop configuration of an existing Hadoop cluster created by
Serengeti, you can also define Hadoop configuration in the cluster spec file when creating a new
cluster.
5.4.2.1.1 Cluster Level Configuration

You can modify the Hadoop configuration of an existing cluster by following these steps below:
1. Export the cluster spec file of the cluster:
serengeti>cluster export --spec --name myHadoop output /home/serengeti/myHadoop.json
2. Modify the configuration section at the bottom of /home/serengeti/myHadoop.json with the following
content and add the customized Hadoop configuration in this configuration section:
30
"configuration": {
"hadoop": {
"core-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/core-default.html
// note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a
sample:
// "io.file.buffer.size": "4096"
},
"hdfs-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/hdfs-default.html
},
"mapred-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/mapred-default.html
},
"hadoop-env.sh": {
// "HADOOP_HEAPSIZE": "",
// "HADOOP_NAMENODE_OPTS": "",
// "HADOOP_DATANODE_OPTS": "",
// "HADOOP_SECONDARYNAMENODE_OPTS": "",
// "HADOOP_JOBTRACKER_OPTS": "",
// "HADOOP_TASKTRACKER_OPTS": "",
// "HADOOP_CLASSPATH": "",
// "JAVA_HOME": "",
// "PATH": "",
},
"log4j.properties": {
// "hadoop.root.logger": "DEBUG, DRFA ",
// "hadoop.security.logger": "DEBUG, DRFA ",
},
"fair-scheduler.xml": {
// check for all settings at http://hadoop.apache.org/docs/stable/fair_scheduler.html
// "text": "the full content of fair-scheduler.xml in one line"
},
"capacity-scheduler.xml": {
// check for all settings at http://hadoop.apache.org/docs/stable/capacity_scheduler.html
}
}
}
Serengeti provides a tool to convert the Hadoop configuration files of your existing cluster into
the above json format, so you dont need to write this json file manually. Please read section
Tool for converting Hadoop Configuration.
Some Hadoop Distributions have their own java jar files which are not put in
$HADOOP_HOME/lib, so by default Hadoop daemons cant find it. In order to use these jars,
you need to add a cluster configuration to include the full path of the jar file in
$HADOOP_CLASSPATH.
Here is a sample cluster configuration to configure Cloudera CDH3 Hadoop cluster with Fair
Scheduler (the jar files of Fair Scheduler is put in /usr/lib/hadoop/contrib/fairscheduler/):
31
"configuration": {
"hadoop": {
"hadoop-env.sh": {
"HADOOP_CLASSPATH": "/usr/lib/hadoop/contrib/fairscheduler/*:$HADOOP_CLASSPATH"
},
"mapred.jobtracker.taskScheduler": "org.apache.hadoop.mapred.FairScheduler"
},
}
}
}
3. Run cluster config command to apply the new Hadoop configuration

serengeti>cluster config --name myHadoop --specFile /home/serengeti/myHadoop.json
4. If you want to reset an existing configuration attribute to the Hadoop default value, simply remove it or
comment it out using // in configuration section in cluster spec file, and run cluster config command.
5.4.2.1.2 Group Level Configuration
You can also modify the Hadoop configuration within a node group in an existing cluster by following
steps below:
1. Export the cluster spec file of the cluster:
serengeti>cluster export --spec --name myHadoop --output /home/serengeti/myHadoop.json
2. Modify the configuration section within the node group in /home/serengeti/myHadoop.json with the
same content as in Cluster Level Configuration and add the customized Hadoop configuration for this
node group.
The Hadoop configuration in Group Level Configuration will override the configuration with the
same name in Cluster Level Configuration.
3. Run cluster config command to apply the new Hadoop configuration
serengeti>cluster config --name myHadoop --specFile /home/serengeti/myHadoop.json
5.4.2.1.3 Black List and White List in Hadoop Configuration
Almost all the configuration attributes provided in Apache Hadoop are configurable in Serengeti, and
these attributes belong to White List. However a few attributes are not configurable in Serengeti and
these attributes belongs to Black List.
If you set an attribute in the cluster spec file and it is in the Black List or not in the White List, then run
cluster config command, Serengeti will detect these attributes and give a warning, you need to answer
yes to continue or no to abort.
Usually you dont need to configure fs.default.name' or dfs.http.address if there is a NameNode
or JobTracker in your cluster, because Serengeti will automatically configure these 2 attributes.
For example, when you create a default cluster in Serengeti, it will contains a NameNode and
JobTracker, and you dont need to explicitly configure fs.default.name' and dfs.http.address.
However you can set fs.default.name' to the uri of another NameNode if you really want to.
32
5.4.2.1.3.1 White List

core-site.xml
all attributes listed on http://hadoop.apache.org/common/docs/stable/core-default.html
exclude attributes defined in Black List
hdfs-site.xml
all attributes listed on http://hadoop.apache.org/common/docs/stable/hdfs-default.html
mapred-site.xml
all attributes listed on http://hadoop.apache.org/common/docs/stable/mapred-default.html
hadoop-env.sh
JAVA_HOME
PATH
HADOOP_CLASSPATH
HADOOP_HEAPSIZE
HADOOP_NAMENODE_OPTS
HADOOP_DATANODE_OPTS
HADOOP_SECONDARYNAMENODE_OPTS
HADOOP_JOBTRACKER_OPTS
HADOOP_TASKTRACKER_OPTS
HADOOP_LOG_DIR
log4j.properties
hadoop.root.logger
hadoop.security.logger
log4j.appender.DRFA.MaxBackupIndex
log4j.appender.RFA.MaxBackupIndex
log4j.appender.RFA.MaxFileSize
fair-scheduler.xml
text
all attributes described on http://hadoop.apache.org/docs/stable/fair_scheduler.html , which can

be put inside text field
capacity-scheduler.xml
all attributes described on http://hadoop.apache.org/docs/stable/capacity_scheduler.html
5.4.2.1.3.2 Black List

core-site.xml
33
net.topology.impl
net.topology.nodegroup.aware
dfs.block.replicator.classname
hdfs-site.xml
dfs.http.address
dfs.name.dir
dfs.data.dir
topology.script.file.name
mapred-site.xml
mapred.job.tracker
mapred.local.dir
mapred.task.cache.levels
mapred.jobtracker.jobSchedulable
mapred.jobtracker.nodegroup.awareness
hadoop-env.sh
HADOOP_HOME
HADOOP_COMMON_HOME
HADOOP_MAPRED_HOME
HADOOP_HDFS_HOME
HADOOP_CONF_DIR
HADOOP_PID_DIR
log4j.properties
None
fair-scheduler.xml
None
capacity-scheduler.xml
None
mapred-queue-acls.xml
None
5.4.2.1.4 Tool for converting Hadoop Configuration

In case you have a lot of Hadoop configuration in core-site.xml, hdfs-site.xml, mapred-site.xml, hadoopenv.sh, log4j.properties, fair-scheduler.xml, capacity-scheduler.xml, mapred-queue-acls.xml, etc. for your
existing Hadoop cluster, you can use a tool provided by Serengeti to convert the Hadoop xml
configuration files into the json format used in Serengeti.
1) Copy the directory $HADOOP_HOME/conf/ in your existing Hadoop cluster to the Serengeti
Server.
2) Execute convert-hadoop-conf.rb /path/to/hadoop_conf/ in bash shell and it will print out all the
converted Hadoop configuration attributes in json format.
34
3) Open the cluster spec file and replace the Cluster Level Configuration or Group Level
Configuration with the content printed out step 2.
4) Execute cluster config --name --specFile to apply the new configuration to the existing
clusteror execute cluster create --name --specFile to create a new cluster with your
configuration.
5.4.2.2 Scale Out a Hadoop Cluster
You can scale out to have more Hadoop worker nodes or client nodes after Hadoop cluster is provisioned.
In the following example, the number of instances in worker node group in myHadoop cluster will
increase to 10.
serengeti>cluster resize --name myHadoop --nodeGroup worker --instanceNum 10
You cannot set a number smaller than current instance number in this version of the Serengeti
virtual appliance.
5.4.2.3 Scale TaskTracker Nodes Rapidly
You can change the number of active TaskTracker nodes rapidly in a running Hadoop cluster or node
group. The selection of TaskTrackers to be enabled or disabled is done with the goal of balancing the
number of TaskTrackers enabled per host in the specified Hadoop cluster or node group.
In this example, the number of active TaskTracker nodes in worker node group in myHadoop cluster is
set to 8:
serengeti>cluster limit --name myHadoop --nodeGroup worker --activeComputeNodeNum 8
If fewer than 8 TaskTracker nodes were running in the worker node group of myHadoop cluster,
additional TaskTracker nodes are enabled (re-commissioned and powered-on), up to the number
provisioned in the worker node group. If more than 8 TaskTrackers were running in the worker node
group, excess TaskTracker nodes are disabled (decommissioned and powered-off). No action is
performed if the number of active TaskTrackers already equals 8.
If the node group is not specified, the TaskTracker nodes are enabled/disabled such that the total number
of active TaskTrackers is 8 across all the compute node groups in the myHadoop cluster:
serengeti>cluster limit --name myHadoop activeComputeNodeNum 8
To enable all the TaskTrackers in the myHadoop cluster, use the cluster unlimit command:
serengeti>cluster unlimit --name myHadoop
This command is especially useful to fix any potential mismatch between the number of active
TaskTrackers as seen by Hadoop and the number of powered on TaskTracker nodes as seen by the
vCenter.
To enable all TaskTrackers within only one compute node group, specify the name of the node group
using the --nodeGroup option, similar to the cluster limit command.
5.4.2.4 Start/Stop Hadoop Cluster
In the Serengeti shell, you can start (or stop) a whole Hadoop cluster:
serengeti>cluster start --name mycluster
5.4.2.5 View Hadoop Clusters Deployed by Serengeti
In the Serengeti shell, you can list Hadoop clusters deployed by Serengeti.
serengeti>cluster list
35
You can specify which cluster to list.

serengeti>cluster list --name mycluster
You can see details of Hadoop clusters.
serengeti>cluster list --detail
5.4.2.6 Login to Hadoop Nodes
You can login to Hadoop nodes including master, worker, and client nodes with password-less SSH from
Serengeti Management Server using SSH client tools like SSH, PDSH, ClusterSSH, Mussh and etc. to do
trouble shooting or run your own management automation scripts.
Serengeti Management Server is configured to be able to SSH to Hadoop cluster nodes without
password. Other clients or machines can use user name and password to SSH to the Hadoop cluster
nodes.
All of these deployed nodes have random passwords protection. If you want to login to each Hadoop
node directly, please login each node from vSphere client in order to change the password by following
the step in Section 3.2 step 11. Please press Ctrl + D in order to get the login information with the
original random password.
5.4.2.7 Delete a Hadoop Cluster
You can delete a Hadoop cluster you no longer needed.
serengeti>cluster delete --name myHadoop
In this example, myHadoop is the name of the Hadoop cluster you want to delete.
When a Hadoop cluster is deleted, all virtual machines in the cluster are destroyed.
You can delete a Hadoop cluster even though it is running.
5.4.3 Use Hadoop Clusters

5.4.3.1 Run Pig Scripts
You can run Pig script in the Serengeti CLI. For example, you have a Pig script in /tmp/data.pig.
serengeti> pig cfg
serengeti> pig script --location /tmp/data.pig
5.4.3.2 Run Hive Scripts

You can run Hive script in the Serengeti CLI. For example, you have a Hive script in tmp/data.hive.
serengeti>hive cfg
serengeti>hive script location /tmp/data.hive
5.4.3.3 Run HDFS command
You can run HDFS command in the Serengeti CLI. For example, you have file in /home/serengeti/data
and want to put it in your HDFS path /tmp.
serengeti> fs put from /home/serengeti/data to /tmp
36
5.4.3.4 Run Map Reduce job

You can run Map Reduce job in the Serengeti CLI. For example, you get example jar file in
/opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar and want to run pi.
serengeti> mr jar --jarfile /opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar --mainclass
org.apache.hadoop.examples.PiEstimator --args "10 10"
Make sure you have chosen a cluster as target first in Serengeti CLI. See Chapter 7.2.10.
5.4.3.5 Using Data through JDBC

Using Data through Hive JDBC, you can execute SQL in different programming language, such as Java,
Python and PHP, and so on. The following is a JDBC Client sample of Java code.
1. SSH to the node contains hive server role.
2. Create a Java file HiveJdbcClient.java which contains the Java Sample Code for connecting to the
Hive Server:
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
public class HiveJdbcClient {
private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
/**
* @param args
* @throws SQLException
**/
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e){
// TODO Auto-generated catch block
e.printStackTrace();
System.exit(1);
}
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default",
37
"", "");
Statement stmt = con.createStatement();
String tableName = "testHiveDriverTable";
stmt.executeQuery("drop table " + tableName);
ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value
string)");
// show tables
String sql = "show tables '" + tableName + "'";
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}
// describe table
sql = "describe " + tableName;
while (res.next()) {
System.out.println(res.getString(1) + "\t" + res.getString(2));
}
// load data into table
// NOTE: filepath has to be local to the hive server
// NOTE: /tmp/test_hive_server.txt is a ctrl-A separated file with two fields per line
String filepath = "/tmp/test_hive_server.txt";
sql = "load data local inpath '" + filepath + "' into table " + tableName;
// select * query
sql = "select * from " + tableName;
while (res.next()){
System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
38
}
// regular hive query
sql = "select count(1) from " + tableName;
while (res.next()){
System.out.println(res.getString(1));
}
}
}
3. Running the JDBC Sample Code
a. Then on the command-line
$ javac HiveJdbcClient.java
b. Alternatively, you can run the following bash script, which will seed the data file and build your
classpath before invoking the client.
#!/bin/bash
HADOOP_HOME=/usr/lib/hadoop
HIVE_HOME=/usr/lib/hive
echo -e '1\x01foo' > /tmp/test_hive_server.txt
echo -e '2\x01bar' >> /tmp/test_hive_server.txt
HADOOP_CORE=`ls $HADOOP_HOME/hadoop-core-*.jar`
CLASSPATH=.:$HADOOP_CORE:$HIVE_HOME/conf
for jar_file_name in ${HIVE_HOME}/lib/*.jar
do
CLASSPATH=$CLASSPATH:$jar_file_name
done
java -cp $CLASSPATH HiveJdbcClient
For more information of Hive client please visit https://cwiki.apache.org/Hive/hiveclient.html.
39
5.4.3.6 Using Data through ODBC

You can use specified out-of-box ODBC server for Hadoop Hive such as MapR Hive ODBC connector,
Apache Hadoop Hive ODBC Driver, etc.
Take MapR ODBC Connector as an example:
1. Install the MapR Hive ODBC Connector on your Windows 7 Professional or Windows 2008 R2.
2. Create a Data Source Name (DSN) with the ODBC Connectors Data Source Administrator to
connect your remote Hive server.
3. Import rows of HIVE_SYSTEM table in Hive server into excel by connecting to this DSN.
More information about Hive ODBC, please refer to https://cwiki.apache.org/Hive/hiveodbc.html
More information about MapR Hive ODBC Connector, please refer to
www.mapr.com/doc/display/MapR/Hive+ODBC+Connector.
5.5 HBase Clusters

5.5.1 Deploy HBase Clusters
You can customize a HBase cluster by specifying your own spec file. The following is an example:
{
"nodeGroups" : [
{
"name" : "zookeeper",
"roles" : [
"zookeeper"
],
"instanceNum" : 3,
"instanceType" : "SMALL",
"storage" : {
"type" : "shared",
"sizeGB" : 20
},
"cpuNum" : 1,
"memCapacityMB" : 3748,
"haFlag" : "on",
"configuration" : {
}
},
{
"name" : "hadoopmaster",
"roles" : [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum" : 1,
"instanceType" : "MEDIUM",
"storage" : {
"type" : "shared",
"sizeGB" : 50
},
"cpuNum" : 2,
"haFlag" : "on",
"configuration" : {
40
}
},
{
"name" : "hbasemaster",
"roles" : [
"hbase_master"
],
"instanceNum" : 1,
"instanceType" : "MEDIUM",
"storage" : {
"type" : "shared",
"sizeGB" : 50
},
"cpuNum" : 2,
"haFlag" : "on",
"configuration" : {
}
},
{
"name" : "worker",
"roles" : [
"hadoop_datanode",
"hadoop_tasktracker",
"hbase_regionserver"
],
"instanceNum" : 3,
"storage" : {
"type" : "local",
"sizeGB" : 50
},
"cpuNum" : 1,
"haFlag" : "off",
"configuration" : {
}
},
{
"name" : "client",
"roles" : [
"hadoop_client",
"hbase_client"
],
"instanceNum" : 1,
"storage" : {
"type" : "shared",
"sizeGB" : 50
},
"cpuNum" : 1,
"haFlag" : "off",
"configuration" : {
}
41
}
],
// we suggest running convert-hadoop-conf.rb to generate "configuration" section and paste the output
here
"configuration" : {
"hadoop": {
"core-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/core-default.html
// note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a
sample:
// "io.file.buffer.size": "4096"
},
"hdfs-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/hdfs-default.html
},
// check for all settings at http://hadoop.apache.org/common/docs/stable/mapred-default.html
},
"hadoop-env.sh": {
// "HADOOP_HEAPSIZE": "",
// "HADOOP_NAMENODE_OPTS": "",
// "HADOOP_DATANODE_OPTS": "",
// "HADOOP_SECONDARYNAMENODE_OPTS": "",
// "HADOOP_JOBTRACKER_OPTS": "",
// "HADOOP_TASKTRACKER_OPTS": "",
// "HADOOP_CLASSPATH": "",
// "JAVA_HOME": "",
// "PATH": ""
},
// "hadoop.root.logger": "DEBUG,DRFA",
// "hadoop.security.logger": "DEBUG,DRFA"
},
// check for all settings at http://hadoop.apache.org/docs/stable/fair_scheduler.html
// "text": "the full content of fair-scheduler.xml in one line"
},
"capacity-scheduler.xml": {
// check for all settings at http://hadoop.apache.org/docs/stable/capacity_scheduler.html
},
"mapred-queue-acls.xml": {
// check for all settings at
http://hadoop.apache.org/docs/stable/cluster_setup.html#Configuring+the+Hadoop+Daemons
// "mapred.queue.queue-name.acl-submit-job": "",
// "mapred.queue.queue-name.acl-administer-jobs", ""
}
},
"hbase": {
"hbase-site.xml": {
// check for all settings at http://hbase.apache.org/configuration.html#hbase.site
},
"hbase-env.sh": {
// "JAVA_HOME": "",
// "PATH": "",
// "HBASE_CLASSPATH": "",
// "HBASE_HEAPSIZE": "",
42
// "HBASE_OPTS": "",
// "HBASE_USE_GC_LOGFILE": "",
// "HBASE_JMX_BASE": "",
// "HBASE_MASTER_OPTS": "",
// "HBASE_REGIONSERVER_OPTS": "",
// "HBASE_THRIFT_OPTS": "",
// "HBASE_ZOOKEEPER_OPTS": "",
// "HBASE_REGIONSERVERS": "",
// "HBASE_SSH_OPTS": "",
// "HBASE_NICENESS": "",
// "HBASE_SLAVE_SLEEP": ""
},
// "hbase.root.logger": "DEBUG,DRFA"
}
},
"zookeeper": {
"java.env": {
// "JVMFLAGS": "-Xmx2g"
},
// "zookeeper.root.logger": "DEBUG,DRFA"
}
}
}
}
In the example, it has JobTracker and TaskTracker roles compared to the template we mentioned in
section 4.4, which means you can launch a HBase mapreduce job. It separate Hadoop NameNode and
HBase Master roles. The two HBase Master instances,will be protected by HBase internal HA function.
5.5.2 Manage HBase Clusters

HBase cluster has a few more configurable files compared to Hadoop cluster, including hbase-site.xml,
hbase-env.sh, log4j.properties and java.env for Zookeeper nodes. You can refer to HBase official site to
tune your HBase clusters.
Most operations and advanced specifications on Hadoop cluster can also apply to HBase cluster, like
scale out node group, separate data and compute nodes, control placement policy and so on with
following exceptions:
1. Zookeeper nodes are not allowed to scale out in this version;
2. You cannot deploy a compute-only cluster pointing to a HBase cluster to run HBase
mapreduce jobs.
5.5.3 Use HBase Clusters

Serengeti supports most of ways that HBase provides to access the database, including:
1. Do operations through HBase shell;
2. If the HBase cluster deployed has Hadoop JobTracker and TaskTracker roles, you can develop a
HBase mapreduce job to access HBase from the client node. Here is an example:
>hbase org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 3
43
3. Rest-ful Web Service is running on client node and listening on port 8080
>curl I http://<client_node_ip>:8080/status/cluster
4.Thrift gateway is also enabled and listening on port 9090.
5.6 Monitoring Cluster Deployed by Serengeti

Serengeti will create one VM folder for each deployed Serengeti Server. The folder name is SERENGETIvApp-<vApp name>. The vApp name is specified during Serengeti deployment.
For each cluster, two level folders will be created under Serengeti instance folder. First level is the cluster
name, and second level is the node group name.
Node group folder contains all nodes in that node group.
To browse the VM and check VM status in vCenter client, you may select Inventory, VMs and
Templates. The Serengeti folder is listed in the left panel. And then you can check VM nodes following
the folder structure.
If you have installed vCOPs, you can also fetch VM-level metrics including clusters health state, workload,
resource allocation, hardware status and etc. Please refer to vCOPs manual guide for more details.
5.7 Make Hadoop Master Node HA/FT

You can leverage vSphere HA and FT to address the SPOF problem of Hadoop.
1. Make sure you enabled the HA for the cluster where the Hadoop cluster is deployed. Please refer to
for detailed setting steps as needed.
2. Make sure you provide a shared storage for Hadoop to deploy on.
3. By default, Hadoop master node is configured to be protected by vSphere HA.
After doing this, once the master node virtual machine not reachable by vSphere. vSphere will start a new
instance on another available ESXi host to serve Hadoop cluster automatically.
Theres a short downtime when doing the recovery. If you want eliminate the down time, you can use
vSphere FT to protect the master node.
Serengeti support configure FT feature for master nodes. In cluster spec file, you can specify haFlag to
ft to enable FT protection.
...
"name": "master",
"cpuNum": 1,
"haFlag": ft
"storage": {
"type": "SHARED",
}
By using the cluster spec, master node of the Hadoop cluster is protected by vSphere FT. When one
master is not reachable, vSphere will switch traffic to the standby virtual machine immediately. So theres
no failover downtime.
Please refer to Apache Hadoop 1.0 High Availability Solution on VMware vSphere for more
information.
44
5.8 Hadoop Topology Awareness

You can make the Hadoop cluster topology aware when you create a cluster with the option of --topology
from CLI. By --topology, we support 3 types of topology awareness: HVE, RACK_AS_RACK,
HOST_AS_RACK.
Here is an example to create a cluster with the topology of HVE.
serengeti>cluster create --name myHadoop --topology HVE --distro HVE-supported_Distro
2
HVE stands for Hadoop Virtualization Extensions . HVE refines Hadoops replica placement, task
scheduling and balancer policies. Hadoop clusters implemented on virtualized infrastructure have full
awareness of the topology on which they are running. Thus, the reliability and performance of these
clusters
are
enhanced.
For
more
information
about
HVE,
you
can
refer
to
https://issues.apache.org/jira/browse/HADOOP-8468.
RACK_AS_RACK stands for the standard topology in existing Hadoop 1.0.x, where only rack and host
information are exposed to Hadoop.
HOST_AS_RACK is a simplified topology of RACK_AS_RACK when all the physical hosts for Serengeti
are on a single rack. In this case, each physical host will be treated as a rack in order to avoid all HDFS
data replicas are placed in a physical host in some worst cases.
HVE is the recommended topology in Serengeti if a distro supports HVE. Otherwise, we recommend
using RACK_AS_RACK topology in multiple rack environments. HOST_AS_RACK is used only when one
rack exists for Serengeti or no rack information at all.
In addition, when you decide to enable HVE, or RACK_AS_RACK, you need to upload the rack and
physical host information to Serengeti through CLI command below before you create a topology
awareness cluster.
serengeti>topology upload --fileName name_of rack_hosts_mapping_file
Here is a sample of the rack and physical hosts mapping file.
rack1: a.b.foo.com, a.c.foo.com
rack2: c.a.foo.com
In this sample, physical hosts a.b.foo.com and a.c.foo.com are in rack1, and c.a.foo.com is in rack2.
After a cluster is created with the selected topology option, you can view the allocated nodes on each
rack with:
serengeti>cluster list --name cluster-name --detail
5.9 Start and Stop Serengeti Services

You can stop and start Serengeti service to make a configuration take effect or to recover from an
abnormal situation.
You can run the following command in a Linux shell to stop the Serengeti service.
$ sudo serengeti-stop-services.sh
You can run the following command in a Linux shell to start the Serengeti service.
$ sudo serengeti-start-services.sh
2
HVE is currently supported on Greenplum HD 1.2.

45
6. Cluster Specification Reference

Cluster specification is a JSON text file. Heres a longer example with line number. Same file without line
number is attached as appendix.
1{
2 "nodeGroups" : [
3
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
10
"instanceType": "LARGE",
11
"cpuNum": 2,
12
"memCapacityMB":4096,
13
"storage": {
14
"type": "SHARED",
15
"sizeGB": 20
16
},
17
"haFlag":"on",
18
"rpNames": [
19
"rp1"
20
21
},
22
23
"name": "data",
24
"roles": [
25
"hadoop_datanode"
26
],
27
"instanceNum": 3,
28
"instanceType": "MEDIUM",
29
"cpuNum": 2,
30
46
31
"storage": {
32
"type": "LOCAL",
33
"sizeGB": 50
34
35
36
37
"groupRacks": {
38
39
40
41
42
},
43
44
"name": "compute",
45
"roles": [
46
47
],
48
"instanceNum": 6,
49
"instanceType": "SMALL",
50
"cpuNum": 2,
51
52
"storage": {
53
"type": "LOCAL",
54
"sizeGB": 10
55
56
57
58
"groupAssociations": [{
59
60
"type": "STRICT"
61
}]
62
63
},
64
65
"name": "client",
47
66
"roles": [
67
"hadoop_client",
68
"hive",
69
"hive_server",
70
"pig"
71
],
72
"instanceNum": 1,
73
"instanceType": "SMALL",
74
75
"storage": {
76
"type": "LOCAL",
77
"sizeGB": 10,
78
"dsNames": [ds1, ds2]
79
80
81 ],
82 "configuration": {
83 }
84 }
It defines 4 node groups.
Line 3 to 21 defines a node group named master.
Line 22 to 42 defines a data node group named data.
Line 43 to 63 defines a compute node group named compute.
Line 64 to 83 defines a client node group.
Line 3 to 21 is an object defines the master node group. The attributes are as follows.
Line 4 defines the name of the node group. Attribute name is name. Value is master.
Line 5 to 8 defines role of the node group. Attribute name is role. Value is hadoop_ namenode
and hadoop_jobtracker. It means hadoop_namenode and hadoop_jobtracker will be deployed
to the virtual machine in the group.
You can see available roles by distro list command.
Line 9 defines number of instances in the node group. Attribute name is instanceNum. Attribute
value is 1. It means therell be only one virtual machine created for the group.
You can have multiple instances for hadoop_tasktracker, hadoop_datanode, hadoop_client, pig,
and hive. But you can have only one instance for hadoop_namenode and hadoop_jobtracker.
Line 10 defines the instance type in the node group. Attribute name is instanceType. Value is
LARGE. The instance types are predefined virtual machine spec. They are combinations of
48
number of CPUs, RAM sizes, and storage size. The predefined number can be overridden by the
cpuNum, memCapacityMB and storage specified in the file.
Line 11 defines number of CPUs per virtual machine. Attribute name is cpuNum. Value is 2. Itll
override the number of CPUs of the predefined virtual machine spec.
Line 12 defines RAM size per virtual machine. Attribute name is "memCapacityMB". Value is
4096. It will override the RAM size of the predefined virtual machine spec.
Line 13 to 16 defines the storage requirement of the node group. Its an object. Object name is
storage.
o
Line 14 defines the storage type. Its an attribute of storage object. Attribute name is
type. Value is SHARED. It means it is required that Hadoop data must be stored in
shared storage.
Line 15 defines the storage size. Its an attribute of storage object. Attribute name is
sizeGB. Value is 20. It means therell be 20GB disk for Hadoop to use.
Line 17 defines if HA applies to the node. The attribute name is haFlag. The value is on. It
means the virtual machine in the group is protected by vSphere HA.
Line 18 to 20 defines the resourcepools which the node group must be associated with. The
attribute name is rpNames. The value is an array, which contains one resourcepool rp1.
You can see same structure for other 3 node groups. One more thing is for data and compute groups,
we specify a pair of comprehensive placement constraints:
Line 35 to 41 defines the placement constraints for the data node group. The attribute name is
placementPolicies and the value is a hash which contains instancePerHost and groupRacks.
The contraint means you need at least 3 esx hosts because this group requires 3 instances and
forces putting 1 instance on each one host, furthermore, this group will be provisioned on hosts
on rack1, rack2 and rack3 by using ROUNDROBIN algorithm.
Line 56 to 62 defines the placement constraints for the compute node group which contains
instancePerHost and groupAssociations. The contraint means you also need at least 3 esx
hosts for the same reason and this group is STRICT associated to node group data for better
performance.
You can customize Hadoop configuration by configuration attribute on line 82 to 83, which happens to
be empty in the sample.
You can modify value of the attributes, and you can also remove the optional value if you dont care.
Following is definition for the outer most attributes in a cluster spec:
Attribute
Type
Mandatory/optional
Description
nodeGroups
object
Mandatory
It contains one or more group specification, and

the details can be found in below table.
configuration
object
Optional
Customizable Hadoop configuration key/value

pairs.
externalHDFS
string
Optional
URI of external HDFS (only valid for a compute

only cluster)
49
Following is the definition of the objects and attributes for a particular node group.
Attribute
Type
Mandatory/Optional
Description
name
string
Mandatory
User defined node group name.
roles
list of
string
Mandatory
A list of software packages or services will be

installed in the virtual machines in the node
group. The item must be exactly the same as
you saw by distro list
instanceNumber
integer
Mandatory
How many virtual machines in the node group.

It must be a positive integer. For
hadoop_namenode and hadoop_jobtracker, it
must be 1.
instanceType
string
Optional
Size of virtual machines in the node group. Its

the name of predefined virtual machine
template. It can be SMALL, MEDIUM,
LARGE, and EXTRA_LARGE.
The cpuNum, memCapacityMb, and
Storage.sizeGB will overwrite this attribute if
they are all be defined in the same node group.
cpuNum
integer
Optional
Number of vCPUs per virtual machine
memCapacityMb
integer
Optional
Number of RAMs in MB per virtual machine
Storage
object
Optional
Storage settings
type
string
Optional
It can be LOCAL or SHARED.
sizeGB
integer
Optional
Data storage size. It must be a positive integer.
dsNames
list of
string
Optional
Datastores the node group can use.
rpNames
list of
string
Optional
Resourcepools the node group can use.
haFlag
string
Optional
It can be on, off or ft. on means use HA to

protect the node group, ft means use vSphere
FT to protect the node group.
By default, name node and job tracker are
protected by vSphere HA.
placementPolicies
object
Optional
It can contains three optional constraints:

"instancePerHost", "groupRacks" and
"groupAssociations", refer to 5.3.2 for details.
50
Serengeti comes with predefined virtual machine specification.

SMALL
MEDIUM
LARGE
EXTRA_LARGE
Number of vCPU
RAM
3.75GB
7.5GB
15GB
30GB
Disk size for Hadoop master data
25GB
50GB
100GB
200GB
Disk size for Hadoop worker data
50GB
100GB
200GB
400GB
Disk size for Hadoop client data
50GB
100GB
200GB
400GB
When creating virtual machine, Serengeti will try to allocate datastore on the preferred type. SHARED
storage is preferred for master and clients. LOCAL storage is preferred for workers.
Separate disks are created for OS and swap.
7. Serengeti Command Reference

7.1 connect
Connect and login to remote Serengeti server.
Parameter Mandatory/Optional Description
--host
Mandatory
Specify the Serengeti web service URL in format <Serengeti

Management Server ip or host>:<port>. By default, the Serengeti web
service is started at port 8080.
--username Optional
The Serengeti user name
--password Optional
The Serengeti password
The command will read username and password in interactive mode. Section 5.1 describes how to
manage Serengeti users.
If connect failed, or do not run connect command, the other Serengeti command is not allowed to be
executed.
7.2 cluster
7.2.1 cluster config
Modify Hadoop configuration of an existing default or customized Hadoop cluster in Serengeti.
Parameter
Type
Description
--name <cluster name in

Serengeti>
Mandatory Specify the Hadoop cluster name in Serengeti.
--specFile <spec file path>
Optional
Specify the Hadoop cluster's specification in a customized file.
51
--yes
Optional
Answer y to Y/N confirmation. If not specified, the users need

to answer y or n explicitly.
7.2.2 cluster create

Create a default/customized Hadoop cluster in Serengeti.
Parameter
Mandatory/Optional Description
--name <cluster name

in Serengeti>
Mandatory
Specify the Hadoop cluster name in Serengeti.
--type <cluster type>
Optional
Specify the cluster type. Hadoop or HBase is supported.

The default one is Hadoop.
--specFile <spec file

path>
Optional
Specify the Hadoop cluster's specification in a customized

file
--distro <Hadoop distro Optional

name>
Specify which distro will be used to deploy Hadoop

cluster. The distros includes Apache Hadoop, Greenplum
HD, CDH3 and HDP1.
--dsNames <datastore
names>
Optional
Specify which datastore will be used to deploy Hadoop

cluster in Serengeti. By default, it will use the same one
with Serengeti virtual machine. Multiple datastores can be
used, separated by ,.
--networkName
<network name>
Optional
Specify which network will be used to deploy Hadoop

cluster in Serengeti. By default, it will use the same one
with Serengeti virtual machine.
--rpNames <resource
pool name>
Optional
Specify which resource pool will be used to deploy

Hadoop cluster Serengeti. By default, it will use the same
one with Serengeti virtual machine. Multiple resource
pools can be used, separated by ,.
--resume
Optional
If resume is specified, this command will recover a

creation process which cluster is deployed failed.
--topology <topology
type>
Optional
Specify which topology type will be used for rack

awareness: HVE, RACK_AS_RACK, or
HOST_AS_RACK.
--yes
Optional
Answer y to Y/N confirmation. If not specified, the users

need to answer y or n explicitly.
--skipConfigValidation
Optional
Skip cluster configuration validation.
If the cluster spec does not include required nodes, for example master node, Serengeti will generate
them with a default configuration.
52
7.2.3 cluster delete

Delete a Hadoop cluster in Serengeti.
Parameter
--name <cluster name> Mandatory
Delete a specified Hadoop cluster in Serengeti.
7.2.4 cluster export

Export cluster information.
--spec
Mandatory
Export cluster specification. The exported cluster specification can be

used in cluster create or cluster config command.
--output
Optional
Specify the output file name for exported cluster information.

If not specified, the output will be displayed in the console.
7.2.5 cluster limit

Enable or disable provisioned compute nodes in the specified Hadoop cluster or node group in Serengeti
to reach the limit specified by activeComputeNodeNum. Compute nodes are re-commissioned and
powered-on, or decommissioned and powered-off to reach the specified number of active compute nodes.
Parameter
--name <cluster_name>
Mandatory
Name of the Hadoop cluster in Serengeti
--nodeGroup
<node_group_name>
Optional
Name of a node group in the specified Hadoop cluster

in Serengeti (supports node groups with task tracker
role only)
-activeComputeNodeNum
<number>
Mandatory
Number of active compute nodes for the specified

Hadoop cluster or node group within that cluster.
The valid value range is integers larger or equal to
zero.
- For zero value, all the nodes in the specific
Hadoop cluster or the specific node group (if -nodeGroup value is specified) will be
decommissioned and powered off.
- For integer value between 1 and the max
node number of a Hadoop cluster or the node
group (if --nodeGroup value is specified), the
specific number of nodes will stay
53
commissioned and powered on, other nodes

will be decommissioned.
- For integer value larger than the max node
number of a Hadoop cluster or the node group
(if --nodeGroup value is specified), all the
nodes in the specific Hadoop cluster or the
specific node group (if --nodeGroup value is
specified) will be re-commissioned and
powered on.
7.2.6 cluster list

List all Hadoop clusters in Serengeti.
Parameter
--name <cluster
name in
Serengeti>
Optional
List the specified Hadoop cluster in Serengeti including name,

distro, status, each role's information. For each role, it will list
instance count, CPU, memory, type and size.
--detail
Optional
List all the Hadoop clusters' details including name in

Serengeti, distro, deploy status, each nodes information in
different roles.
Note: with this option specified, Serengeti will query from
vCenter server to get the latest node status. That operation
may take a few seconds for each cluster.
For example:
54
7.2.7 cluster resize

Change the number of nodes in a node group.
Parameter
--name <cluster name in

Serengeti>
Mandatory
Specify the target Hadoop cluster in Serengeti.
--nodeGroup <name of
the node group>
Mandatory
Specify the target role which will be scaled out in

Hadoop cluster deployed by Serengeti.
--instanceNum <instance
number>
Mandatory
Specify the target count which will be scaled out to.

The target count needs to be more that original.
Example:
Cluster resize --name foo --nodeGroup slave --instanceCount 10
7.2.8 cluster start

Start a Hadoop cluster in Serengeti.
Parameter
Start a specified Hadoop cluster in Serengeti.
55
7.2.9 cluster stop

Stop a Hadoop cluster in Serengeti.
Parameter
Stop a specified Hadoop cluster in Serengeti.
7.2.10 cluster target

Connect to one Hadoop cluster to interact with it by Serengeti CLI, including run fs, mr, pig, and hive
commands.
Parameter
Mandatory/Optional
Description
--name <cluster name> Optional
The name of the cluster to connect to. If user dont specify

this parameter, the first cluster listed by cluster list
command will be used
--info
Show to targeted cluster information, such as the HDFS

URL, Job Tracker URL and Hive server URL.
Optional
Note: --name and info can not be used together.
7.2.11 cluster unlimit

Enable all of the provisioned compute nodes in the specified Hadoop cluster or node group in Serengeti.
Compute nodes are re-commissioned and powered-on as necessary.
Parameter
--name <cluster_name>
Mandatory
Name of the Hadoop cluster in Serengeti
--nodeGroup
<node_group_name>
Optional
Name of a node group in the specified Hadoop cluster

in Serengeti (only supports node groups with task
tracker role)
7.3 datastore
7.3.1 datastore add
Add a datastore to Serengeti for deploying.
Parameter
--name <datastore
name in Serengeti>
Mandatory
Specify the name of datastore added to Serengeti
--spec <datastore
name in VCenter>
Mandatory
Specify datastore name in vSphere. User can use wild

card to specify multiple vmfs store. * and ? are
56
supported in wild card.

--type <datastore type:
LOCAL|SHARE>
Mandatory
Specify datastore type in vSphere: local storage or

shared storage.
7.3.2 datastore delete

Delete a datastore from Serengeti.
Parameter
--name <datastore name in

Serengeti>
Mandatory
Delete a specified datastore in

Serengeti.
7.3.3 datastore list

List datastores added to Serengeti.
Parameter
--name <Name of datastore name

in Serengeti>
Optional
List the specified datastore information

including name, type.
--detail
Optional
List the datastore details including datastore

path in vSphere.
All datastores that are added to Serengeti are listed if the name is not specified.
For example:
57
7.4 distro
7.4.1 distro list
Show what are the roles offered in a distro.
Parameter
--name <distro name> Optional
List the specified distro information.
For example:
7.5 disconnect
Disconnect and logout from remote Serengeti server. After disconnect, user is not allowed to run any CLI
commands.
7.6 fs
7.6.1 fs cat
Copy source paths to stdout.
<file name> Mandatory
The file to be showed in the console. Multiple files must be quoted,

such as /path/file1 /path/file2
7.6.2 fs chgrp
Change group association of files.
Parameter
Mandatory/Optional
Description
--group <group name> Mandatory
The group name of the file
--recursive true|false
Optional
make the change recursively through the directory

structure
<file name>
Mandatory
The file whose group to be changed. Multiple files

must be quoted, such as /path/file1 /path/file2
7.6.3 fs chmod
Change the permissions of files.
Parameter
Mandatory/Optional
Description
58
--mode <permission mode>
Mandatory
The file permission mode, such as 755
--recursive true|false
Optional
make the change recursively through the directory

structure
<file name>
Mandatory
The file whose permission to be changed. Multiple

files must be quoted, such as /path/file1
/path/file2
7.6.4 fs chown
Change the owner of files.
Parameter
--owner <permission Mandatory

mode>
The file owner name
--recursive true|false Optional
make the change recursively through the directory structure
<file name>
The file whose owner to be changed. Multiple files must be

quoted, such as /path/file1 /path/file2
Mandatory
7.6.5 fs copyFromLocal
Copy single source file, or multiple source files from local file system to the destination file system. It is
the same as put.
Parameter
--from <local file

path>
Mandatory
The file path in local. Multiple files must be quoted, such as

/path/file1 /path/file2
--to <HDFS file

path>
Mandatory
The file path in HDFS. If --from is multiple files, --to is

directory name.
7.6.6 fs copyToLocal
Copy files to the local file system. It is the same as get.
Parameter
--from < HDFS file path >
Mandatory
The file path in HDFS. Multiple files must be quoted,

such as /path/file1 /path/file2
--to < local file path >
Mandatory
The file path in local. If --from is multiple files, --to is

directory name.
59
7.6.7 fs copyMergeToLocal
Takes a source directory and a destination file as input and concatenates the files in the HDFS directory
into the local file system.
Parameter
Mandatory/Optional
Description
--from < HDFS file path >
Mandatory
The file path in HDFS. Multiple files must be quoted,

such as /path/file1 /path/file2.
--to < local file path >
Mandatory
The file path in local.
--endline <true|false>
Optional
Whether add end line character.
7.6.8 fs count
Count the number of directories, files, bytes, quota, and remaining quota.
Parameter
Mandatory/Optional
Description
--path < HDFS path >
Mandatory
The path to be counted.
--quota <true|false>
Optional
Whether with quota information.
7.6.9 fs cp
Copy files from source to destination. This command allows multiple sources as well in which case the
destination must be a directory.
Parameter
--from <HDFS source

file path>
Mandatory
The file path in local. Multiple files must be quoted, such

as /path/file1 /path/file2
--to <HDFS destination

file path>
Mandatory

directory name.
7.6.10 fs du
Displays sizes of files and directories contained in the given directory or the length of a file in case its just
a file.
<file name> Mandatory
The file to be showed in the console. Multiple files must be quoted,

7.6.11 fs expunge
Empty the trash bin in the HDFS.
60
7.6.12 fs get
Copy files to the local file system.
Parameter
--from < HDFS file

path >
Mandatory
The file path in HDFS. Multiple files must be quoted, such as

/path/file1 /path/file2.
--to < local file

path >
Mandatory
The file path in local. If --from is multiple files, --to is

directory name.
7.6.13 fs ls
List files in the directory.
Parameter
Mandatory/Optional
Description
<path name>
Mandatory
The path to be listed. Multiple files must be quoted,

--recursive <true|false>
Optional
Whether list the directory with recursion.
7.6.14 fs mkdir
Create a new directory.
Parameter
<dir name>
Mandatory
The directory name to be created.
7.6.15 fs moveFromLocal
Similar to put command, except that the source local file is deleted after it is copied.
Parameter
Mandatory/Optional
Description
--from <local file path> Mandatory

as /path/file1 /path/file2.
--to <HDFS file path>

directory name.
Mandatory
7.6.16 fs mv
Move source files to destination in the HDFS.
Parameter
Mandatory/Optional
Description
--from <dest file path>
Mandatory
61
--to <source file path>

directory name.
Mandatory
7.6.17 fs put
Copy single src, or multiple srcs from local file system to the HDFS.
Parameter
Mandatory/Optional
Description
--from <local file path> Mandatory

--to <HDFS file path>

directory name.
Mandatory
7.6.18 fs rm
Remove files in the HDFS.
Parameter
< file path>
Mandatory
The file to be removed.
Optional
Remove files with recursion.
--skipTrash <true|false>
Optional
Bypass trash.
7.6.19 fs setrep
Change the replication factor of a file
Parameter
Mandatory/Optional
Description
--path < file path>
Mandatory
The path to be changed replication factor.
--replica <replica number>
Mandatory
Number of replicas.
Optional
Whether set replica with recursion.
--waiting <true|false>
Optional
Whether wait for the replica number is equal to the

number.
7.6.20 fs tail
Display last kilobyte of the file to stdout.
Parameter
Mandatory/Optional
Description
<file path>
Mandatory
The file path to be displayed.

62
--file <true|false>
Optional
Show content while the file grows.
7.6.21 fs text
Take a source file and output the file in text format.
<file path> Mandatory
The file path to be displayed.
7.6.22 fs touchz
Create a file of zero length.
<file path> Mandatory
The file name to be created.
7.7 hive
7.7.1 hive cfg
Configure Hive.
Parameter
Mandatory/Optional
Description
--host <server host >
Optional
The server host.
--port <server port>
Optional
The server port.
--timeout
Optional
The timeout in milliseconds.
7.7.2 hive script

Execute a Hive script. Note: You need to run hive cfg before running a hive script.
Parameter
--location <script path>
Mandatory
The hive script file name to be executed.
7.8 mr
7.8.1 mr jar
Run a MapReduce job located inside the provided jar.
Parameter
Mandatory/Optional
Description
63
--jarfile <jar file path>
Mandatory
The jar file path.
--mainclass <main class name>
Mandatory
The class which have main() method.
--args <arg>
Optional
The arguments to the main class. If there are

multiple arguments, they must be double
quoted.
7.8.2 mr job counter

Print the counter value of the MR job.
Parameter
--jobid <job id>
Mandatory
The MR job id.
--groupname <group name>
Mandatory
The counters group name.
--countername <counter name>
Mandatory
The counters name.
7.8.3 mr job events

Print the events' detail received by JobTracker for the given range.
Parameter
--jobid <job id>
Mandatory
The MR job id.
--from < from-event-#>
Mandatory
The start number of events to be printed.
--number < #-of-events>
Mandatory
The total number of events to be printed.
7.8.4 mr job history

Print job details, failed and killed job details.
Parameter
Mandatory/Optional
Description
<job history directory>
Mandatory
The directory where job history files put.
--all <true|false>
Optional
Print all jobs information.
7.8.5 mr job kill

Kill the MR job.
Parameter
Mandatory/Optional
Description
--jobid <job id>
Mandatory
The job id.
64
7.8.6 mr job list

List MR jobs.
Parameter
Mandatory/Optional
--all <true|false> Optional
Description
Whether list all jobs.
7.8.7 mr job set priority

Change the priority of the job.
Parameter
--jobid <jobid>
Mandatory
--priority
Mandatory
<VERY_HIGH|HIGH|NORMAL|LOW|VERY_LOW>
The job id.

The jobs priority.
7.8.8 mr job status

Query MR job status.
Parameter
Mandatory/Optional
Description
--jobid <jobid>
Mandatory
The job id.
7.8.9 mr job submit

Submit a MR job defined in the job file.
Parameter
Mandatory/Optional
Description
--jobfile <jobfile>
Mandatory
Specify the file which define the MR job. The file is

standard Hadoop configuration. One example configuration
file is as following:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>mapred.jar</name>
<value>/home/hadoop/hadoop-1.0.1/hadoop-examples1.0.1.jar</value>
</property>
<property>
65
<name>mapred.input.dir</name>
<value>/user/hadoop/input</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/user/hadoop/output</value>
</property>
<property>
<name>mapred.job.name</name>
<value>wordcount</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>org.apache.hadoop.examples.WordCount.Tokeni
zerMapper</value>
</property>
<property>
<name>mapreduce.reduce.class</name>
<value>org.apache.hadoop.examples.WordCount.IntSum
Reducer</value>
</property>
</configuration>
7.8.10 mr task fail

Fail the Map Reduce task.
Parameter
Mandatory/Optional
Description
--taskid <taskid>
Mandatory
Specify the task id.
7.8.11 mr task kill

Kill the Map Reduce task.
Parameter
Mandatory/Optional
Description
--taskid <taskid>
Mandatory
Specify the task id.
7.9 network
7.9.1 network add
Add a network to Serengeti.
Parameter
--name <network name in Serengeti>
Mandatory
Specify the name of network resource

added to Serengeti
66
--portGroup <port group name in

vSphere>
Mandatory
Specify the name of port group in vSphere

which user wants to add to Serengeti
--dhcp
Combination 1
Specify the IP address assignment type,

DHCP.
--ip <IP Spec, an IP address range

looks like xx.xx.xx.xx-xx[,xx]*>
--dns <dns server ip>
--secondaryDNS <dns server ip>
--gateway <gateway IP>
--mask <network mask>
Combination 2
Specify the IP address assignment type,

static IP.
For example:
>network add --name ipNetwork --ip 192.168.1.1-100,192.168.1.120-180 --portGroup pg1 --dns
202.112.0.1 --gateway 192.168.1.255 --mask 255.255.255.1
>network add --name dhcpNetwork --dhcp --portGroup pg1
7.9.2 network delete

Delete a network in Serengeti.
Parameter
--name <network name in Serengeti> Mandatory
Delete the specified network in Serengeti.
7.9.3 network list

List available networks in Serengeti.
Parameter
--name <network name in Serengeti> Optional
List the specified network in Serengeti

including name, port group in vSphere, IP
address assignment type, assigned IP
address and so on.
--detail
List the network detail information in

Serengeti including Hadoop cluster node's
network information.
Optional
For example:
67
7.10 pig script

7.10.1 pig cfg
Configure Pig.
Parameter
--props
Optional
Specify the Pig properties file location.
--jobName
Optional
Specify the job name.
--jobPriority
Optional
Specify the job priority.
--jobTracker
Optional
Specify the job tracker.
--execType
Optional
Specify the execution type.
--validateEachStatement
Optional
Validation of each statement or not.
7.10.2 pig script

Execute a Pig script. Note: You need to run pig cfg before running this command.
Parameter
--location <script path> Mandatory
Specify the name of the script to be executed.
7.11 resourcepool
7.11.1 resourcepool add
Add a resource pool in vSphere to Serengeti.
Parameter
Mandatory/Optional
Description
--name <resource pool name in Serengeti>
Mandatory
Specify the name of resource pool

added to Serengeti.
--vccluster <vSphere cluster of the resource Mandatory
Specify the vSphere cluster name in

68
pool>
--vcrp <vSphere resource pool name>
vSphere where the resource pool is

in.
Mandatory
Specify the vSphere resource pool

in vSphere which is added to
Serengeti for deploying. The
vSphere resource pool must be
directly under a cluster.
Parameter
Mandatory/Optional
Description
Mandatory
Remove specified resource pool

from Serengeti.
Parameter
Mandatory/Optional
Description
Optional
List the specific resource pool

name, path.
--detail
Optional
List each resource pool's general

information and Hadoop cluster'
node in this resource pool.
7.11.2 resourcepool delete

Remove a resource pool from Serengeti.
7.11.3 resourcepool list

List resource pools added to Serengeti.
All resource pools that are added to Serengeti are listed if a name is not specified. For each resource
pool, NAME, PATH are listed. NAME is the name in Serengeti. PATH is the combination of the vSphere
cluster name and resource pool name, separated by /.
For example:
69
7.12 topology
7.12.1 topology upload
Upload a rack-hosts mapping topology file to Serengeti. The new uploaded file will overwrite the existing
file. The accepted file format looks like: for each line, rackname: hostname1, hostname2
Hostname1,hostname2, stands for the host name displayed in vSphere.
Parameter
--fileName <topology file name> Mandatory
Specify the topology file name.
--yes
Answer y to Y/N confirmation.
Optional
7.12.2 topology list

List rack-hosts mapping topology stored in Serengeti.
8. vSphere Settings
8.1 vSphere Cluster Configuration
8.1.1 Setup Cluster
In the vCenter Client, select Inventory, Hosts and Clusters. In the left column, right-click the Datacenter
and select "New Cluster..." Follow new Cluster Wizard using the following settings:
Enable vSphere HA and vSphere DRS
Enable Host Monitoring
Enable Admission Control and set desired policy. (Default policy is to tolerate 1 host failure)
Virtual machine restart priority High
Virtual machine Monitoring virtual machine and Application Monitoring
Monitoring sensitivity High

70
8.1.2 Enable DRS/HA on an existing cluster

If DRS or HA is not already enabled on an existing cluster, it can be enabled by right-clicking the cluster
and selecting Edit Settings. Under Cluster Features, select "Turn On vSphere DRS" and "Turn On
vSphere HA". Use settings specified in "Setup Cluster" above.
8.1.3 Add Hosts to Cluster

In the vCenter Client, select Inventory, Hosts and Clusters. In the left column, right-click the Cluster
that was just created and select "Add Host...". Follow the Add Host Wizard to add a Host. Repeat for each
additional Host.
8.1.4 DRS/FT Settings

In the vCenter Client, select Inventory, Hosts and Clusters. In the left column, click a host in the
cluster. On the right side there will be a row of tabs near the top of the window, click on Configuration
then click on Networking. The window will display vSwitch port groups. By default A VMkernel Port called
Management Network is pre-configured. Click Properties... of the vSwitch, choose the Management
Network and click the Edit button. Enable vMotion and Fault Tolerance Logging from the
Management Network Properties window.
To verify the FT status of a host, click on the Summary tab and locate Host Configured for FT in the
general section. If there are any issues with FT they will be shown here.
8.1.5 Enable FT on specific virtual machine

Fault Tolerance runs one virtual machine on two separate hosts, it allows for instant failover in a variety of
situations. Before enabling FT ensure the necessary requirements are met:
Host hardware is listed in the VMware Hardware Compatibility List (HCL)
All hosts in the cluster have Hardware VT enabled in the BIOS
The Management Network (VMkernel Port) has vMotion and "Fault Tolerance Logging"
enabled
Available capacity in the cluster
Virtual machine disks are thick provisioned, without snapshots and located on shared storage
Virtual machine is single vCPU
In the vCenter Client, select Inventory, Hosts and Clusters. In the left column, right click the virtual
machine and select Fault Tolerance, Turn On Fault Tolerance.
8.2 Network Settings

Serengeti currently deploys using a single network. Virtual machines are deployed with one NIC which is
attached to a specific Port Group. How this Port Group is configured and the network backing the Port
Group depends on the environment. Here we will cover a basic network configuration that may be
customized as needed.
71
Either a vSwitch or vSphere Distributed Switch can be used to provide the Port Group backing a
Serengeti cluster. vDS acts as a single virtual switch across all attached hosts while a vSwitch is per-host
and requires the Port Group to be configured manually.
8.2.1 Setup Port Group - Option A (vSphere Distributed Switch)

In the vCenter Client, select Inventory, Networking. Right Click the Datacenter and select New
vSphere Distributed Switch.
Using the Create vSphere Distributed Switch wizard. Choose Switch Version 5.0. Enter a name and
number of uplink ports (physical adapters) you require.
On the Add Hosts and Physical Adapters step, select the adapter(s) on each host that will carry traffic to
the switch.
On the last step it will create a default Port Group. You can rename this Port Group after it is created and
the wizard is completed.
8.2.2 Setup Port Group - Option B (vSwitch)

In the vCenter Client, select Inventory, Hosts and Clusters. Navigate to the Networking section of the
Configuration Tab. Make sure the vSphere Standard Switch view is selected.
There is already vSwitch0 created by default. You may add a Port Group to this vSwitch or create a new
vSwitch that binds to different physical adapters.
To create a Port Group on the existing vSwitch click Properties on that vSwitch and then click the
Add button. Follow the wizard to create the Port Group.
To create a new vSwitch, click on Add Networking and follow the Add Network Wizard.
8.3 Storage Settings

Serengeti provisions virtual machines on shared storage to enable vSphere HA, FT and DRS features.
Local datastores are attached to virtual machines to be used for data.
8.3.1 Shared Storage Setting

Create LUN on Shared Storage (SAN/NAS) and verify it is accessible by all hosts in the cluster. For
vSphere HA Datastore Heartbeat feature two datastores are required.
8.3.2 Local Storage Settings

8.3.2.1 Configure DAS on Physical Hosts
Direct Attached Storage should be attached and configured on the physical controller to present each
disk separately to the OS. This configuration is commonly described as JBOD (Just A Bunch Of Disks) or
single disk RAID0.
8.3.2.2 Provision VMFS Datastores on DAS of Each Host
Create VMFS Datastores on Direct Attached Storage. This can be done in either of the following two
ways.
Manually using the vSphere Client, the vSphere Management Assistant
Automation by vSphere PowerCLI
72
8.3.2.2.1 Manually Using vSphere Client (Manual per disk):

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Expand Cluster => Select Host

Go to "Configuration" Tab
Choose "Storage"
Click "Add Storage..."
This will start the Add a Storage Wizard. In the wizard, continue the steps.
Select "Disk/LUN" for Storage Type => Next
Select a Local Disk from the list => Next
Select "VMFS-5" for File System Version => Next => Next
Enter Datastore Name => Next
"Maximum Available Space" => Next
Finish
8.3.2.2.2 Automation by vSphere PowerCLI

This method requires you have a vSphere PowerCLI installed. You can refer to vSphere PowerCLI site to
download and install PowerCLI.
Once the PowerCLI is installed, you can use it to format many Direct Attached Storages to VMFS at a
time.
1. Select Start > Programs > VMware > VMware vSphere PowerCLI.
The VMware vSphere PowerCLI console window opens.
2. In the VMware vSphere PowerCLI console window, run PowerCLI commands to format the disks.
CAUTION
The commands will apply to multiple ESXi hosts at a time. Make sure the scope is what you intended
to before you run a command.
Heres a sample script of provisioning datastores. You can type the commands line by line in
PowerCLI shell.
In this example, it formats local disks in all hosts in a vSphere cluster named My Cluster. The disks
are formatted to VMFS datastores. The prefix of datastore name is abcde.
vSphere PowerCLI - Create Local Datastores for Cluster

# Connect to a vCenter Server.
Connect-VIServer -Server 10.23.112.235 -Protocol https -User admin -Password pass
# Prepare variables.
$i = 0
$localDisks = @{}
$clusterName = "My Cluster"
$datastoreName = "abcde"
# Select Hosts
$vmHosts = Get-VMHost -Location $clusterName
# Get Local Disks
$ldArray = $vmHosts | Get-VMHostDisk | select -ExpandProperty ScsiLun | where {$_.IsLocal -eq
"True"}
73
# Get Primary Disks

$pdArray = $vmHosts | Get-VMHostDiagnosticPartition
# Add Local Disks to Hashtable keyed by CName
foreach($ld in $ldArray) {$localDisks.Add($ld.CanonicalName,$ld)}
# Remove Primary Disks from Local Disk Hashtable
foreach($pd in $pdArray) {$localDisks.Remove($pd.CanonicalName)}
# Create Datastores. Will fail to create for any local disks that are in-use.
foreach ($ld in $localDisks.Values) {$i++; New-Datastore -Vmfs -Name ($datastoreName +
$i.ToString("D3")) -Path $ld.CanonicalName -vmHost $ld.vmHost}
9. Appendix A: Create Local Yum Repository for MapR

9.1 Install a web server to server as yum server
Find a machine or virtual machine with CentOS 5.x 64 bits (or RHEL 5.x 64 bits) which
has Internet access, and install a web server such as Apache/lighttpd on the machine.
Or you can use the Serengeti Management Server if you dont have another machine.
This web server will serve as the yum server. This guide will take installing Apache web
server as an example.
9.1.1 Configure http proxy
First open a bash shell terminal. If the machine needs a http proxy server to connect to
the Internet, set http_proxy env :
# switch to root user
sudo su
export http_proxy=http://< proxy_server:port>
9.1.2 Install Apache Web Server

yum install -y httpd
/sbin/service httpd start
Make sure the firewall on the machine doesn't block the network port 80 used by
Apache web server. You can open a web browser on another machine and navigate to
http://<ip_of _webserver>/ to ensure the default test page of Apache web server shows
up.
If you would like to stop the firewall, execute this command:
/sbin/service iptables stop
74
9.1.3 Install yum related packages
Install the yum-utils and createrepo packages if they are not already installed (yum-utils
includes the reposync command):
yum install -y yum-utils createrepo
9.1.4 Sync the remote MapR yum repository
1) Create a new file /etc/yum.repos.d/mapr-m5.repo using vi or other editors with the

following content:
[maprtech]
name=MapR Technologies
baseurl=http://package.mapr.com/releases/v2.1.1/redhat/
enabled=1
gpgcheck=0
protect=1
[maprecosystem]
name=MapR Technologies
baseurl=http://package.mapr.com/releases/ecosystem/redhat
enabled=1
gpgcheck=0
protect=1
2) Mirror the remote yum repository to the local machine:

reposync -r maprtech
reposync -r maprecosystem
This will take several minutes (depending on the network bandwidth) to download all the
RPMs in the remote repository, and all the RPMs are put in new folders named
maprtech and maprecosystem.
9.2 Create local yum repository
1) Put all the RPMs into a new folder under the Document Root folder of the Apache
Web Server. The Document Root folder is /var/www/html/ for Apache by default, and if
you use Serengeti Management Server to set up the yum server, the folder is
/opt/serengeti/www/.
doc_root=/var/www/html
mkdir -p $doc_root/mapr/2
mv maprtech/ maprecosystem/ $doc_root/mapr/2/
2) Create a yum repository for the RPMs:
75
cd $doc_root/mapr/2
createrepo .
3) Create a new file /var/www/html/mapr/2/mapr-m5.repo with the following content:

[mapr-m5]
name=MapR Version 2
baseurl=http://<ip_of_webserver>/mapr/2
enabled=1
gpgcheck=0
protect=1
Please replace the <ip_of_webserver> with the IP address of the web server.
Ensure you can download http://<ip_of_webserver>/mapr/2/mapr-m5.repo from another
machine.
9.3 Configure http proxy for the VMs created by Serengeti Server
This step is optional and only applies if the VMs created by Serengeti Management
Server need a http proxy to connect to the yum repository. You need to configure http
proxy for the VMs as this: on Serengeti Server, add the following content into
/opt/serengeti/conf/serengeti.properties:
# set http proxy server
serengeti.http_proxy = http://<proxy_server:port>
# set the IPs of Serengeti Management Server and the local yum repository servers for
'serengeti.no_proxy'. The wildcard for matching multi IPs doesn't work.
serengeti.no_proxy = 10.x.y.z, 192.168.x.y, etc.
10. Appendix B: Create Local Yum Repository for CDH4

10.1 Install a web server to server as yum server
Find a machine or virtual machine with CentOS 5.x 64 bits (or RHEL 5.x 64 bits) which
has Internet access, and install a web server such as Apache/lighttpd on the machine.
Or you can use the Serengeti Management Server if you dont have another machine.
This web server will serve as the yum server. This guide will take installing Apache web
server as an example.
10.1.1 Configure http proxy
First open a bash shell terminal. If the machine needs a http proxy server to connect to
the Internet, set http_proxy env :
# switch to root user
76
sudo su
export http_proxy=http://<proxy_server:port>
10.1.2 Install Apache Web Server

yum install -y httpd
/sbin/service httpd start
Make sure the firewall on the machine doesn't block the network port 80 used by
Apache web server. You can open a web browser on another machine and navigate to
http://<ip_of_webserver>/ to ensure the default test page of Apache web server shows
up.
If you would like to stop the firewall, execute this command:
/sbin/service iptables stop
10.1.3 Install yum related packages
Install the yum-utils and createrepo packages if they are not already installed (yum-utils
includes the reposync command):
yum install -y yum-utils createrepo
10.1.4 Sync the remote CDH4 yum repository
1) Create a new file /etc/yum.repos.d/cloudera-cdh4.repo using vi or other editors

with the following content:
[cloudera-cdh4]
name=Cloudera's Distribution for Hadoop, Version 4
baseurl=http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/4.1.2/
gpgkey = http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
2) Mirror the remote yum repository to the local machine:

reposync -r cloudera-cdh4
This will take several minutes (depending on the network bandwidth) to download all the
RPMs in the remote repository, and all the RPMs are put in new folder named clouderacdh4.
10.2 Create local yum repository
1) Put all the RPMs into a new folder under the Document Root folder of the Apache
Web Server. The Document Root folder is /var/www/html/ for Apache by default, and if
you use Serengeti Management Server to set up the yum server, the folder is
/opt/serengeti/www/ .
77
doc_root=/var/www/html
mkdir -p $doc_root/cdh/4/
mv cloudera-cdh4/RPMS $doc_root/cdh/4/
2) Create a yum repository for the rpms:

cd $doc_root/cdh/4
createrepo .
3) Create a new file /var/www/html/cdh/4/cloudera-cdh4.repo with the following content:

[cloudera-cdh4]
name=Cloudera's Distribution for Hadoop, Version 4
baseurl=http://<ip_of_webserver>/cdh/4/
enabled=1
gpgcheck=0
Please replace the <ip_of_webserver> with the IP address of the web server.
Ensure you can download http://<ip_of_webserver>/cdh/4/cloudera-cdh4.repo from
another machine.
10.3 Config http proxy for the VMs created by Serengeti Server
This step is optional and only apply if the VMs created by Serengeti Management
Server need a http proxy to connect to the yum repository. You need to config http
proxy for the VMs as this: on Serengeti Server, add the following content into
/opt/serengeti/conf/serengeti.properties:
# set http proxy server
serengeti.http_proxy = http://< proxy_server:port>
# set the IPs of Serengeti Management Server and the local yum repository servers for
'serengeti.no_proxy'. The wildcard for matching multi IPs doesn't work.
serengeti.no_proxy = 10.x.y.z, 192.168.x.y, etc.
78

Serengeti User Guide - 0.8

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Serengeti User Guide - 0.8

Încărcat de

Drepturi de autor:

Formate disponibile

VMware, Inc.

Serengeti Users Guide

Serengeti Users Guide

Serengeti Users Guide ......................................................................................................................................6

Serengeti Overview ............................................................................................................................................6

Intended Audience ........................................................................................................................................6

Serengeti Features ................................................................................................................................ 6

Serengeti Architecture Overview.........................................................................................................8

VMware Virtual Infrastructure .....................................................................................................................9

Serengeti Virtual Appliance Requirements ............................................................................................... 9

Serengeti CLI Requirements .......................................................................................................................9

Installing the Serengeti Virtual Appliance ...................................................................................................... 10

Deploy Serengeti ........................................................................................................................................ 10

Quick Start ......................................................................................................................................................... 13

Set up the Serengeti CLI ........................................................................................................................... 13

Deploy a Hadoop Cluster .......................................................................................................................... 13

Deploy a HBase Cluster ............................................................................................................................ 15

Using Serengeti ................................................................................................................................................. 15

Manage Serengeti Users ........................................................................................................................... 15

Add/Delete a User in Serengeti ......................................................................................................... 15

Modify User Password ........................................................................................................................ 16

Manage Resources in Serengeti .............................................................................................................. 16

Add a Datastore ................................................................................................................................... 16

Add a Network ..................................................................................................................................... 16

Add a Resource Pool .......................................................................................................................... 17

View Networks ..................................................................................................................................... 17

View Resource Pools .......................................................................................................................... 17

Remove a Datastore ........................................................................................................................... 18

Remove a Network .............................................................................................................................. 18

Remove a Resource Pool .................................................................................................................. 18

Manage Distros ........................................................................................................................................... 18

Supported Distros ................................................................................................................................ 18

Add a Distro to Serengeti ................................................................................................................... 18

List Distros ............................................................................................................................................ 21

Hadoop Clusters ......................................................................................................................................... 21

Deploy Hadoop Clusters .................................................................................................................... 21

Serengeti Users Guide

Manage Hadoop Clusters .................................................................................................................. 30

Use Hadoop Clusters .......................................................................................................................... 36

HBase Clusters ........................................................................................................................................... 40

Deploy HBase Clusters ...................................................................................................................... 40

Manage HBase Clusters .................................................................................................................... 43

Use HBase Clusters ............................................................................................................................ 43

Monitoring Cluster Deployed by Serengeti ............................................................................................. 44

Make Hadoop Master Node HA/FT .......................................................................................................... 44

Hadoop Topology Awareness ................................................................................................................... 45

Start and Stop Serengeti Services ........................................................................................................... 45

Cluster Specification Reference ..................................................................................................................... 46

Serengeti Command Reference ..................................................................................................................... 51

cluster config ........................................................................................................................................ 51

cluster create ........................................................................................................................................ 52

cluster delete ........................................................................................................................................ 53

cluster export ........................................................................................................................................ 53

cluster list .............................................................................................................................................. 54

cluster resize ........................................................................................................................................ 55

cluster start ........................................................................................................................................... 55

cluster stop ........................................................................................................................................... 56

cluster target ...................................................................................................................................... 56

cluster unlimit ..................................................................................................................................... 56

datastore add ....................................................................................................................................... 56

datastore delete ................................................................................................................................... 57

datastore list ......................................................................................................................................... 57

distro list ................................................................................................................................................ 58