Sunteți pe pagina 1din 78

VMware, Inc.

Serengeti Users Guide


Serengeti 0.8

Serengeti Users Guide

Contents
1.

Serengeti Users Guide ......................................................................................................................................6


1.1

2.

Serengeti Overview ............................................................................................................................................6


2.1

3.

4.

5.

Intended Audience ........................................................................................................................................6


Serengeti ........................................................................................................................................................6

2.1.1

Serengeti Features ................................................................................................................................ 6

2.1.2

Serengeti Architecture Overview.........................................................................................................8

2.2

Hadoop ...........................................................................................................................................................8

2.3

VMware Virtual Infrastructure .....................................................................................................................9

2.4

Serengeti Virtual Appliance Requirements ............................................................................................... 9

2.5

Serengeti CLI Requirements .......................................................................................................................9

Installing the Serengeti Virtual Appliance ...................................................................................................... 10


3.1

Download ..................................................................................................................................................... 10

3.2

Deploy Serengeti ........................................................................................................................................ 10

Quick Start ......................................................................................................................................................... 13


4.1

Set up the Serengeti CLI ........................................................................................................................... 13

4.2

Deploy a Hadoop Cluster .......................................................................................................................... 13

4.3

Deploy a HBase Cluster ............................................................................................................................ 15

Using Serengeti ................................................................................................................................................. 15


5.1

Manage Serengeti Users ........................................................................................................................... 15

5.1.1

Add/Delete a User in Serengeti ......................................................................................................... 15

5.1.2

Modify User Password ........................................................................................................................ 16

5.2

Manage Resources in Serengeti .............................................................................................................. 16

5.2.1

Add a Datastore ................................................................................................................................... 16

5.2.2

Add a Network ..................................................................................................................................... 16

5.2.3

Add a Resource Pool .......................................................................................................................... 17

5.2.4

View Datastores................................................................................................................................... 17

5.2.5

View Networks ..................................................................................................................................... 17

5.2.6

View Resource Pools .......................................................................................................................... 17

5.2.7

Remove a Datastore ........................................................................................................................... 18

5.2.8

Remove a Network .............................................................................................................................. 18

5.2.9

Remove a Resource Pool .................................................................................................................. 18

5.3

Manage Distros ........................................................................................................................................... 18

5.3.1

Supported Distros ................................................................................................................................ 18

5.3.2

Add a Distro to Serengeti ................................................................................................................... 18

5.3.3

List Distros ............................................................................................................................................ 21

5.3.4

Using a Distro....................................................................................................................................... 21

5.4

Hadoop Clusters ......................................................................................................................................... 21

5.4.1

Deploy Hadoop Clusters .................................................................................................................... 21


2

Serengeti Users Guide

5.4.2

Manage Hadoop Clusters .................................................................................................................. 30

5.4.3

Use Hadoop Clusters .......................................................................................................................... 36

5.5

HBase Clusters ........................................................................................................................................... 40

5.5.1

Deploy HBase Clusters ...................................................................................................................... 40

5.5.2

Manage HBase Clusters .................................................................................................................... 43

5.5.3

Use HBase Clusters ............................................................................................................................ 43

5.6

Monitoring Cluster Deployed by Serengeti ............................................................................................. 44

5.7

Make Hadoop Master Node HA/FT .......................................................................................................... 44

5.8

Hadoop Topology Awareness ................................................................................................................... 45

5.9

Start and Stop Serengeti Services ........................................................................................................... 45

6.

Cluster Specification Reference ..................................................................................................................... 46

7.

Serengeti Command Reference ..................................................................................................................... 51


7.1

connect ......................................................................................................................................................... 51

7.2

cluster ........................................................................................................................................................... 51

7.2.1

cluster config ........................................................................................................................................ 51

7.2.2

cluster create ........................................................................................................................................ 52

7.2.3

cluster delete ........................................................................................................................................ 53

7.2.4

cluster export ........................................................................................................................................ 53

7.2.5

cluster limit............................................................................................................................................ 53

7.2.6

cluster list .............................................................................................................................................. 54

7.2.7

cluster resize ........................................................................................................................................ 55

7.2.8

cluster start ........................................................................................................................................... 55

7.2.9

cluster stop ........................................................................................................................................... 56

7.2.10

cluster target ...................................................................................................................................... 56

7.2.11

cluster unlimit ..................................................................................................................................... 56

7.3

datastore ...................................................................................................................................................... 56

7.3.1

datastore add ....................................................................................................................................... 56

7.3.2

datastore delete ................................................................................................................................... 57

7.3.3

datastore list ......................................................................................................................................... 57

7.4

distro ............................................................................................................................................................. 58

7.4.1

distro list ................................................................................................................................................ 58

7.5

disconnect .................................................................................................................................................... 58

7.6

fs.................................................................................................................................................................... 58

7.6.1

fs cat ...................................................................................................................................................... 58

7.6.2

fs chgrp ................................................................................................................................................. 58

7.6.3

fs chmod ............................................................................................................................................... 58

7.6.4

fs chown ................................................................................................................................................ 59

7.6.5

fs copyFromLocal ................................................................................................................................ 59

7.6.6

fs copyToLocal ..................................................................................................................................... 59
3

Serengeti Users Guide

7.6.7

fs copyMergeToLocal ......................................................................................................................... 60

7.6.8

fs count.................................................................................................................................................. 60

7.6.9

fs cp ....................................................................................................................................................... 60

7.6.10

fs du..................................................................................................................................................... 60

7.6.11

fs expunge .......................................................................................................................................... 60

7.6.12

fs get.................................................................................................................................................... 61

7.6.13

fs ls ...................................................................................................................................................... 61

7.6.14

fs mkdir ............................................................................................................................................... 61

7.6.15

fs moveFromLocal............................................................................................................................. 61

7.6.16

fs mv .................................................................................................................................................... 61

7.6.17

fs put.................................................................................................................................................... 62

7.6.18

fs rm .................................................................................................................................................... 62

7.6.19

fs setrep .............................................................................................................................................. 62

7.6.20

fs tail .................................................................................................................................................... 62

7.6.21

fs text................................................................................................................................................... 63

7.6.22

fs touchz ............................................................................................................................................. 63

7.7

hive ............................................................................................................................................................... 63

7.7.1

hive cfg .................................................................................................................................................. 63

7.7.2

hive script.............................................................................................................................................. 63

7.8

mr .................................................................................................................................................................. 63

7.8.1

mr jar ..................................................................................................................................................... 63

7.8.2

mr job counter ...................................................................................................................................... 64

7.8.3

mr job events ........................................................................................................................................ 64

7.8.4

mr job history........................................................................................................................................ 64

7.8.5

mr job kill ............................................................................................................................................... 64

7.8.6

mr job list .............................................................................................................................................. 65

7.8.7

mr job set priority ................................................................................................................................. 65

7.8.8

mr job status ......................................................................................................................................... 65

7.8.9

mr job submit ........................................................................................................................................ 65

7.8.10

mr task fail .......................................................................................................................................... 66

7.8.11

mr task kill .......................................................................................................................................... 66

7.9

network ......................................................................................................................................................... 66

7.9.1

network add .......................................................................................................................................... 66

7.9.2

network delete ...................................................................................................................................... 67

7.9.3

network list............................................................................................................................................ 67

7.10

pig script ..................................................................................................................................................... 68

7.10.1

pig cfg.................................................................................................................................................. 68

7.10.2

pig script ............................................................................................................................................. 68

7.11

resourcepool.............................................................................................................................................. 68
4

Serengeti Users Guide

7.11.1

resourcepool add ............................................................................................................................... 68

7.11.2

resourcepool delete .......................................................................................................................... 69

7.11.3

resourcepool list ................................................................................................................................ 69

7.12

8.

topology...................................................................................................................................................... 70

7.12.1

topology upload ................................................................................................................................. 70

7.12.2

topology list ........................................................................................................................................ 70

vSphere Settings ............................................................................................................................................... 70


8.1

vSphere Cluster Configuration .................................................................................................................. 70

8.1.1

Setup Cluster ....................................................................................................................................... 70

8.1.2

Enable DRS/HA on an existing cluster ............................................................................................. 71

8.1.3

Add Hosts to Cluster ........................................................................................................................... 71

8.1.4

DRS/FT Settings .................................................................................................................................. 71

8.1.5

Enable FT on specific virtual machine ............................................................................................. 71

8.2

Network Settings ......................................................................................................................................... 71

8.2.1

Setup Port Group - Option A (vSphere Distributed Switch) .......................................................... 72

8.2.2

Setup Port Group - Option B (vSwitch) ............................................................................................ 72

8.3

9.

Storage Settings ......................................................................................................................................... 72

8.3.1

Shared Storage Setting ...................................................................................................................... 72

8.3.2

Local Storage Settings ....................................................................................................................... 72

Appendix A: Create Local Yum Repository for MapR ................................................................................. 74


9.1

Install a web server to server as yum server .......................................................................................... 74

9.1.1

Configure http proxy ............................................................................................................................ 74

9.1.2

Install Apache Web Server ................................................................................................................ 74

9.1.3

Install yum related packages ............................................................................................................. 75

9.1.4

Sync the remote MapR yum repository ............................................................................................ 75

9.2

Create local yum repository ...................................................................................................................... 75

9.3

Configure http proxy for the VMs created by Serengeti Server ........................................................... 76

10.

Appendix B: Create Local Yum Repository for CDH4 ............................................................................... 76

10.1

Install a web server to server as yum server ........................................................................................ 76

10.1.1

Configure http proxy.......................................................................................................................... 76

10.1.2

Install Apache Web Server .............................................................................................................. 77

10.1.3

Install yum related packages ........................................................................................................... 77

10.1.4

Sync the remote CDH4 yum repository ......................................................................................... 77

10.2

Create local yum repository .................................................................................................................... 77

10.3

Config http proxy for the VMs created by Serengeti Server ............................................................... 78

Serengeti Users Guide

1. Serengeti Users Guide


The Serengeti Users Guide provides information about installing and using the Serengeti to deploying
and scaling Hadoop clusters on vSphere.
To help you start with Serengeti, this information includes descriptions of Serengeti concepts and features.
In addition, this information provides a set of usage examples and sample scripts.

1.1 Intended Audience


This book is intended for anyone who needs to install and use Serengeti. The information in this book is
written for administrators and developers who are familiar with VMware vSphere.

2. Serengeti Overview
2.1 Serengeti
The Serengeti virtual appliance is a management service that you can use to deploy Hadoop clusters on
VMware vSphere systems. It is a one-click deployment toolkit that allows you to leverage the VMware
vSphere platform to deploy a highly available Hadoop cluster in minutes, including common Hadoop
components such as HDFS, MapReduce, Pig, and Hive on a virtual platform. Serengeti supports multiple
Hadoop 0.20 based distributions, CDH4 (except YARN), and MapR M5.

2.1.1 Serengeti Features


2.1.1.1 Rapid Provisioning
Serengeti can deploy Hadoop clusters with HDFS, MapReduce, HBase, Pig, Hive client and Hive server
in your vSphere system easily and quickly.
Serengeti includes a provisioning engine, the Apache Hadoop distribution, and a virtual machine template.
Serengeti is preconfigured to automate Hadoop cluster deployment and configuration. With Serengeti,
you can save time in getting started with Hadoop because you do not need to install and configure an
operating system, or download, install and configure each software package on each machine.
2.1.1.2 High Availability
Serengeti takes advantage of vSphere high availability to protect the Hadoop master node virtual
machine. The master node virtual machine can be monitored by vSphere. When Hadoop namenode or
jobtracker service stops unexpectedly, vSphere will restart master node for recovery. When the virtual
machine stops unexpectedly by host failover or cannot access due to poor network, vSphere will leverage
FT to start another standby virtual machine automatically to reduce the unplanned down time.
6

Serengeti Users Guide

2.1.1.3 Local Disk Management


Serengeti allows you to use both shared storage and local storage. After the disks are formatted to
datastores in vSphere, you can add the datastores to Serengeti easily. You can specify whether the
datastores are shared storage (SHARED) or local storage (LOCAL). Serengeti automatically allocates the
datastores to Hadoop clusters when you deploy a Hadoop cluster.
By default, Serengeti allocates Hadoop master nodes and client nodes on SHARED datastores, and
data/compute nodes on LOCAL datastores, including both system disk and data disks of those nodes. If
you specify only local storage or shared storage, Serengeti allocates all Hadoop nodes on the available
datastores for a default cluster.
2.1.1.4 Easy Scale Out
With Serengeti you can add more nodes to a Hadoop cluster with a single command after it has been
deployed. You can start with a small Hadoop cluster and scale out as needed.
2.1.1.5 Configuration
Serengeti allows you to customize the following:

Number of virtual machines

CPU, RAM, storage for the virtual machines

Software packages for the virtual machines

Hadoop configuration.

Serengeti automatically adjusts Hadoop configurations according to the virtual machine specification.
After creation, you can export Hadoop clusters spec and tune Hadoop configuration without impacting
irrelevant Hadoop node.
Serengeti provides both cluster level and node group level configuration. You can set different
parameters for different node groups.
2.1.1.6 Data Compute Separation
Serengeti allows you to deploy a data and computer separated Hadoop cluster.

You can specify number of data nodes per host.

You can specify the number of compute nodes for one data node and specify compute node and
related data node on the same physical host.

Serengeti also allows you to deploy a compute-only cluster to performance isolation between
different MapReduce clusters or consume the existing HDFS.

Deploy a Hadoop cluster with only JobTracker and TaskTracker to consume an existing apache
0.20 based HDFS.

Deploy a Hadoop cluster with only job tracker and task tracker to consume an 3rd party HDFS.

2.1.1.7 Remote CLI


You can remotely access Serengeti Management Server by installing CLI client in your environment.,
which is a one-stop-shop shell to deploy, manage and use Hadoop.
2.1.1.8 Hadoop Distribution Management
Serengeti allows you to use any of the following Hadoop distributions
7

Serengeti Users Guide

Apache Hadoop 1.0.x

Greenplum HD 1.2

Hortonworks HDP-1

CDH3

CDH4

MapR M5

You can add your preferred distribution to Serengeti and deploy Hadoop clusters accordingly.

2.1.2 Serengeti Architecture Overview


The Serengeti virtual appliance runs on top of vSphere system and includes a Serengeti Management
Server virtual machine and a Hadoop Template virtual machine. The Hadoop Template virtual machine
includes an agent.

Serengeti performs these major steps to deploy a Hadoop cluster:


1. Serengeti Management Server searches for ESXi hosts with sufficient resources.
2. Serengeti Management Server selects ESXi hosts on which to place Hadoop virtual machines.
3. Serengeti Management Server sends a request to vCenter to clone and reconfigure virtual
machines.
4. Agent configures the OS parameters and network configurations.
5. Agent downloads Hadoop software packages from the Serengeti Management sServer.
6. Agent installs Hadoop software.
7. Agent configures Hadoop parameters.
Provisioning is performed in parallel, which reduces deployment time.

2.2 Hadoop
Apache Hadoop is open source software for distributed storage and computing. Apache Hadoop includes
HDFS and MapReduce. The HDFS is a distributed file system. The MapReduce is a software framework
for distributed data processing.
You can find more information about Apache Hadoop on http://hadoop.apache.org/ for more information.

Serengeti Users Guide

2.3 VMware Virtual Infrastructure


VMwares leading virtualization solutions provide multiple benefits to IT administrators and users. VMware
virtualization creates a layer of abstraction between the resources required by an application and
operating system, and the underlying hardware that provides those resources. A summary of the value of
this abstraction layer includes the following:

Consolidation: VMware technology allows multiple application servers to be consolidated onto


one physical server, with little or no decrease in overall performance.

Ease of Provisioning: VMware virtualization encapsulates an application into an image that can
be duplicated or moved, greatly reducing the cost of application provisioning and deployment.

Manageability: Virtual machines may be moved from server to server with no downtime using
VMware vMotion, which simplifies common operations like hardware maintenance and reduces
planned downtime.

Availability: Unplanned downtime can be reduced and higher service levels can be provided to an
application. VMware High Availability (HA) ensures that in the case of an unplanned hardware
failure, any affected virtual machines are restarted on another host in a VMware cluster.

2.4 Serengeti Virtual Appliance Requirements

Software
o

VMware vSphere 5.0 Enterprise or VMware vSphere 5.1 Enterprise

VMWare vSphere Client 5.0 or VMWare vSphere Client 5.1

SSH client

Network
o

DNS Server

DHCP Server or Static IP Address Block

Resource requirements
o

Resource pool with at least 27.5GB RAM

Port group with at least 6 uplink ports

350G or more disk spaces are suggested.

17GB is for Serengeti virtual appliance,

300GB is for your first Hadoop cluster. You can reduce the disk space
requirements by specifying the storage size in a cluster specification.

The remaining disk space is reserved for swap space.

Shared storage is required if you use HA or FT for the Hadoop master node.

Others
o

All ESXi hosts should have time synchronized using the Network Time Protocol (NTP)

2.5 Serengeti CLI Requirements

OS
9

Serengeti Users Guide

Windows

Linux

Software
o

Java 1.6.26 or later

Unzip tool

Network
o

Can access Serengeti Management Server through HTTP in order to download CLI
package

3. Installing the Serengeti Virtual Appliance


3.1 Download
Download a Serengeti Virtual Appliance OVA from VMware site.

3.2 Deploy Serengeti


Serengeti runs in VMWare vSphere system. You can use the vSphere client to connect VMware vCenter
Server and deploy Serengeti.
1. In the vSphere Client, Select menu File -> Deploy OVF Template
2. Select the OVA file location of Serengeti Virtual Appliance. vSphere client will verify the OVA file and
show you the brief information.
3. Specify the Serengeti virtual appliance name and inventory location.
Only alphabetic letters (a-z, A-Z), numbers (0-9), space ( ), hyphen (-) and underscore
(_) can be used for virtual appliance name and resource pool name. For datastore name, it can
be the above ones plus parenthesis ((, )) and period (.).

4. Select the resource pool on which to deploy the template.


You MUST deploy Serengeti in a top-level resource pool.

10

Serengeti Users Guide

5. Select a datastore.
6. Select a format for the virtual disks.
7. Map the networks used in the OVF template to the networks in your inventory.
8. Set the properties for this Serengeti deployment.

11

Serengeti Users Guide

Serengeti Management Server Network Settings


Network Type
Select DHCP or Static IP.
IP Address
Enter IP address for the Serengeti Management Server virtual machine.
Net mask
Enter the subnet mask of the network.
Gateway
Enter the IP address for the network gateway.
DNS Server 1
Enter the DNS server IP address.
DNS Server 2
Enter a second DNS server IP address.
Hadoop Resource Settings
Initialize Resources Keep this option selected to add the resource pool, datastore and network
to Serengeti Management Server database. Users can deploy Hadoop
clusters in the resource pool, datastore and network in which the Serengeti
virtual appliance is deployed. Hadoop node virtual machines attempt to
obtain IP address by using DHCP on the network.
9. Verify binding to vCenter Extension Service.

10. Click Next to deploy the virtual appliance. Itll take several minutes to deploy the virtual appliance.
After Serengeti virtual appliance is deployed successfully, two virtual machines will be installed in
12

Serengeti Users Guide

vSphere. One is the Serengeti Management Server virtual machine another is the virtual machine
template for Hadoop nodes.
11. Power on the Serengeti vApp and open the console of Serengeti Management Server VM, you see
the initial OS login password for root/serengeti user. Update the password with command sudo
/opt/serengeti/sbin/set-password -u after login to the VM, and the initial password will disappear on
the welcome screen.

4. Quick Start
4.1 Set up the Serengeti CLI
Serengeti command line shell can run in Windows or Linux. You need Java installed on the machine.
You can download VMware-Serengeti-cli-0.8.0.0-<build number>.zip from the Serengeti Management
Server (http://your-serengeti-server/cli).
Unzip the downloaded package to a directory. Run Serengeti CLI by going to this directory, under cli,
and enter java jar serengeti*.jar.
Please refer to the troubleshooting document if you have any issues.

4.2 Deploy a Hadoop Cluster


You can use Serengeti CLI to perform actions such as creating and customizing Hadoop clusters. You
have two ways to access Serengeti CLI: from the Serengeti Management Server virtual machine or install
CLI on any machine and use it.
1. Enter the Serengeti shell.
>java jar serengeti*.jar
2. Run "connect command to connect to the Serengeti server.
serengeti>connect --host xx.xx.xx.xx:8080 --username xxx --password xxx
A user named serengeti with password password is created by default.
3. Run "cluster create command to deploy a Hadoop cluster on vSphere.
serengeti>cluster create --name myHadoop
In the example, myHadoop is the name of the Hadoop cluster you deploy. The Serengeti command
continually updates the progress of the deployment.
Only alphabetic letters (a-z, A-Z), numbers (0-9), and underscore (_) can be used cluster
13

Serengeti Users Guide

name.
This command will deploy a Hadoop cluster with one master node virtual machine, three worker node
virtual machines, and one client node virtual machine. The master node virtual machine contains
NameNode and JobTracker in it. The worker node virtual machines contain datanode and TaskTracker
services. The client node virtual machines contain a Hadoop client environment, including Hadoop client
shell, Pig, and Hive.
After the deployment is complete, you can view the IP addresses of the Hadoop node virtual machines.
Hint
Use the tab key for auto-completion and to get help for commands and parameters.
By default, Serengeti might use any resources added to deploy a Hadoop Cluster. To limit the scope of
resources for the cluster, you can specify resource pools, datastores, or a network in the cluster create
command
serengeti>cluster create --name myHadoop --rpNames myRP --dsNames myDS --networkName myNW
In this example myRP is the resource pool where the Hadoop cluster is deployed on, myDS is the
datastore where the virtual machine images is stored, myNW is the network which virtual machines will
use.
Hint
You can use resourcepool list, datastore list, network list command to see what resources are in
Serengeti.
Once you have a Hadoop cluster deployed you can execute Hadoop command directly in the CLI. In this
section we will describe how you can copy files from the local file system to HDFS and then run a
MapReduce job.
1. Start the Serengeti CLI and connect to Serengeti Management Server as described in section 4.1
2. Run the cluster list command to show all the available clusters
$serengeti>cluster list
3. Run the cluster target --name command to connect to the cluster you want to get data in/out.
The --name value is the cluster name that you want to connect.
$serengeti>cluster target --name cluster1
4. Run the fs put command to upload data to HDFS
$serengeti>fs put from /etc/inittab to /tmp/input/inittab
5. Run the fs get command to download data from HDFS
$serengeti>fs get from /tmp/input/inittab to /tmp/local-inittab
6. Run the mr jar command to run a MapReduce job
$serengeti> mr jar --jarfile /opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar --mainclass
org.apache.hadoop.examples.WordCount --args "/tmp/input /tmp/output"
7. Run the fs cat command to show the output of the MR job
$serengeti> fs cat /tmp/output/part-r-00000
8. Run the fs get command to download the output of the MR job
$serengeti> fs get from /tmp/output/part-r-00000 to /tmp/wordcount

14

Serengeti Users Guide

Another way to use Hadoop is through the client VM. By default, Serengeti will deploy a VM named client
VM. It has Hadoop client, pig and Hive installed. The OS is configured ready to use Hadoop. You can see
the IP of client VM after a cluster is deployed or use cluster list command to see the IP. Following are the
steps to follow in order to verify that the Hadoop cluster is working properly.
1. Use ssh to login to the client VM.
use "joe" for user name. Password is "password".
2. Create your own home directory.
$ hadoop fs -mkdir /usr/joe
3. Or run a sample Hadoop mapreduce job.
$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar pi 10 10000000
Feel free to use submit other MapReduce, Pig or Hive jobs as well.

4.3 Deploy a HBase Cluster


Serengeti also supports deploying HBase cluster on HDFS. The easiest way to deploy a HBase cluster is
running the following command:
serengeti>cluster create --name myHBase --type hbase
In the example, myHBase is the name of the HBase cluster you deployed, --type hbase implies you
want to deploy a HBase cluster based on a default template Serengeti provides. This command will
deploy one master node virtual machine which runs NameNode and HBaseMaster daemon, three
zookeeper nodes running ZooKeeper daemon, three data nodes running Hadoop DataNode and HBase
Regionserver daemon, and one client node from which you can launch Hadoop or HBase Jobs.
When deployment finished, you can access HBase through a few ways as you expected:
1. Login client VM to run hbase shell commands;
2. Launch a HBase job like hbase org.apache.hadoop.hbase.PerformanceEvaluation nomapred
randomWrite 3;
Default HBase cluster does not contain Hadoop JobTracker or Hadoop TaskTracker daemon. So
you need to deploy a customized cluster in case you want to run a HBase mapr job.
3. Access HBase through Rest-ful Web Service or Thrift gateway, HBase Rest and Thrift service are
configured on the HBase client node, and Rest service listens on port 8080, Thrift service listens
on port 9090.

5. Using Serengeti
5.1 Manage Serengeti Users
Spring security In-Memory Authentication is used for Serengeti Authentication and user management.
You can modify /opt/serengeti/tomcat6/webapps/serengeti/WEB-INF/spring-security-context.xml file to
manage Serengeti users. And then restart tomcat service using command "sudo service tomat restart".

5.1.1 Add/Delete a User in Serengeti


Add or delete user at /opt/serengeti/tomcat6/webapps/serengeti/WEB-INF/spring-security-context.xml file,
user-service element.
Following is a sample to add one user into user-service.
<authentication-manager alias="authenticationManager">
<authentication-provider>
<user-service>

15

Serengeti Users Guide

<user name="serengeti" password="password" authorities="ROLE_ADMIN"/>


<user name="joe" password="password" authorities="ROLE_ADMIN"/>
</user-service>
</authentication-provider>
</authentication-manager>
The authorities value should define user role in Serengeti, but in M2, its not used, so its OK to have any
value here.

5.1.2 Modify User Password


Modify the password value in user-service element at the same file. Following is a sample.
<authentication-manager alias="authenticationManager">
<authentication-provider>
<user-service>
<user name="serengeti" password="password" authorities="ROLE_ADMIN"/>
<user name="joe" password="welcome1" authorities="ROLE_ADMIN"/>
</user-service>
</authentication-provider>
</authentication-manager>

5.2 Manage Resources in Serengeti


When deploying Serengeti.OVA, VI admin might allow you to use the same resources in which Serengeti
virtual appliance is using. You can also add more resources to Serengeti for your Hadoop cluster. You
can list resources in Serengeti and delete them if its no longer needed.
You must add resource pool, datastore and network before deploying a Hadoop cluster if VI
admin does not allow you to deploy Hadoop cluster in the same set of resources as Serengeti
server.

5.2.1 Add a Datastore


You can use datastore add command to add a vSphere datastore to Serengeti.
serengeti>datastore add --name myLocalDS --spec local* --type LOCAL
In this example, myLocalDS is the name you used to create the Hadoop cluster.
local* is a wildcard specifying a set of datastores. All datastores whose name starts with local will be
added and managed as a whole.
LOCAL specifies that the datastores are local storage.
In this version, Serengeti does not check if the datastore really exists. If you use a nonexistent
datastore, cluster creation will fail.

5.2.2 Add a Network


You can use network add command to add a network to Serengeti. A network is a port group and a way
to get ip on the port group.
serengeti>network add --name myNW --portGroup 10GPG --dhcp
In this example, myNW is the name you used to create the Hadoop cluster.
10GPG is the name of the port group created by VI Admin in vSphere.
Virtual machines using this network will use DHCP to obtain IP.
You can also add networks using a static IP.
serengeti>network add --name myNW --portGroup 10GPG --ip 192.168.1.2-100 --dns 10.111.90.2 -16

Serengeti Users Guide

gateway 192.168.1.1 --mask 255.255.255.0


In this example, 192.168.1.2-100 is the IP address range Hadoop nodes can use.
10.111.90.2 is the DNS server IP.
192.168.1.1 is the gateway.
255.255.255.0 is the subnet mask.
In this version, Serengeti does not check if the added network is correct. If you use a wrong
network, cluster creation will fail.

5.2.3 Add a Resource Pool


You can use resourcepool add command to add a vSphere resource pool to Serengeti.
serengeti>resourcepool add --name myRP --vccluster cluster1 --vcrp rp1
In this example, myRP is the name you used to create the Hadoop cluster.
cluster1 is the vSphere cluster name and rp1 is vSphere resource pool name.
In this version, Serengeti does not check if the resource pool really exists. If you use a
nonexistent resource pool, cluster creation will fail.
vSphere nested resource pools are not supported in current version. The resource pools must be
one that is located directly under a cluster.

5.2.4 View Datastores


In the Serengeti shell, you can list datastores added to Serengeti.
serengeti>datastore list
You can see details of datastores.
serengeti> datastore list --detail
You can specify which datastore to list.
seretenti> datastore list --name myDS --detail

5.2.5 View Networks


In the Serengeti shell, you can list networks added to Serengeti.
serengeti>network list
You can see details of networks.
serengeti> network list --detail
You can specify which network to list.
seretenti> network list --name myNW --detail

5.2.6 View Resource Pools


In the Serengeti shell, you can list resource pools added to Serengeti.
serengeti>resourcepool list
You can see details of resource pools.
serengeti>resourcepool list --detail
17

Serengeti Users Guide

You can specify which resource pool to list.


seretenti>resourcepool list --name myRP --detail

5.2.7 Remove a Datastore


You can use the datastore delete command to remove a datastore from Serengeti.
serengeti>datastore delete --name myDS
In this example, myDS is the name you specified when you added the datastore.
You cannot remove a datastore from Serengeti if it is referenced by a Hadoop cluster.

5.2.8 Remove a Network


You can use the network delete command to remove a network from Serengeti.
serengeti>network delete --name myNW
In this example, myNW is the name you specified when you added the network.

You cannot remove a network from Serengeti if it is referenced by a Hadoop cluster.


You can use network list command to see which cluster is referencing the network.

5.2.9 Remove a Resource Pool


You can use the resoucepool delete command to remove a resource pool from Serengeti.
serengeti>resourcepool delete --name myRP
In this example, myRP is the name you specified when you added the resource pool.
You cannot remove a resource pool from Serengeti if the resource pool is referenced by a
Hadoop cluster.

5.3 Manage Distros


5.3.1 Supported Distros
Serengeti Management Server includes the Apache Hadoop 1.0.1, but you can use your preferred
1
Hadoop distro as well. Greenplum HD1, CDH3, CDH4 , HDP1 and MapR M5 are also supported.
Serengeti now supports Hadoop cluster, Pig and Hive instance deployment.

5.3.2 Add a Distro to Serengeti


Serengeti uses tar ball or yum repository to deploy Hadoop cluster for different Hadoop distributions.
5.3.2.1 Using tar ball to deploy Hadoop cluster
Serengeti uses tar ball to deploy the following Hadoop distros:
Apache Hadoop 1.0.x
Greenplum HD 1
1

YARN is not supported at this moment.


18

Serengeti Users Guide

CDH3
HDP1

1. Download the three packages (hadoop/pig/hive) in tar ball format from the distro vendor's site.
2. Upload them to Serengeti Management Server virtual machine.
3. Put the packages in /opt/serengeti/www/distros/. The hierarchy should be
DISTRO_NAME/VERSION_NUMBER/TARBALLS. For example, place the Apache Hadoop distro as
shown in the following way.
- apache/
- 1.0.1/
- hadoop-1.0.1.tar.gz
- hive-0.8.1.tar.gz
- pig-0.9.2.tar.gz
4. Edit the /opt/serengeti/www/distros/manifest in Serengeti Management Server virtual machine to
add the mapping between Hadoop roles and the tar ball package of the distro. As the following
example, add JSON text to the manifest file:
{
"name" : "cdh",
"version" : "3u3",
"packages" : [
{
"roles" : ["hadoop_namenode", "hadoop_jobtracker",
"hadoop_tasktracker", "hadoop_datanode",
"hadoop_client"],
"tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz"
},
{
"roles" : ["hive"],
"tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz"
},
{
"roles" : ["pig"],
"tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz"
}
]
},
In this example, the CDH tar balls are put in directory /opt/serengeti/www/distros/cdh/3u3.
Please note if a distro supports HVE, please add hveSupported : true, after the line related to version
in the above example.
5. Restart the tomcat server in Serengeti Management Server to allow the Serengeti Management
Server read the new manifest file.
$ sudo service tomcat restart
If the commands are successful, in the Serengeti shell, issue the command "distro list", you can see the
distro that you added appears. Otherwise, make sure you write the correct JSON text in the manifest.
5.3.2.2 Using yum repository to deploy Hadoop cluster
Serengeti uses yum repository to deploy the following Hadoop distros:

19

Serengeti Users Guide

CDH4
MapR M5

1. Open the sample manifest file /opt/serengeti/www/distros/manifest.sample in Serengeti


Management Server virtual machine, you will see the following distro configuration for MapR and
CDH4:
{
"name" : "mapr",
"vendor" : "MAPR",
"version" : "2.1.1",
"packages" : [
{
"roles" : ["mapr_zookeeper", "mapr_cldb", "mapr_jobtracker", "mapr_tasktracker",
"mapr_fileserver", "mapr_nfs", "mapr_webserver", "mapr_metrics", "mapr_client", "mapr_pig",
"mapr_hive", "mapr_hive_server", "mapr_mysql_server"],
"package_repos" : ["http://<ip_of_serengeti_server>/mapr/2/mapr-m5.repo"]
}
]
},
{
"name" : "cdh4",
"vendor" : "CDH",
"version" : "4.1.2",
"packages" : [
{
"roles" : ["hadoop_namenode", "hadoop_jobtracker", "hadoop_tasktracker",
"hadoop_datanode", "hadoop_journalnode", "hadoop_client", "hive", "hive_server", "pig",
"hbase_master", "hbase_regionserver", "hbase_client", "zookeeper"],
"package_repos" : ["http://<ip_of_serengeti_server>/cdh/4/cloudera-cdh4.repo"]
}
]
}

The two yum repo files (mapr-m5.repo and cloudera-cdh4.repo) point to the official yum repository of
MapR and CDH4 on the Internet. You can copy this sample file
/opt/serengeti/www/distros/manifest.sample to /opt/serengeti/www/distros/manifest.
When you create a MapR or CDH4 cluster, Hadoop nodes will download rpm packages from the
MapR/CDH4 official yum repository on the Internet.
If your VMs in the cluster created by Serengeti Management Server do not have access to the Internet or
the bandwidth to the Internet is not fast, we strongly suggest create a local yum repository for MapR
and CDH4. Please read the Appendix A: Create Local Yum Repository for MapR and Appendix B: Create
Local Yum Repository for CDH4 to create a yum repository.
2. Config the local yum repository url in manifest file
Once the local yum repository for MapR/CDH4 is created, open /opt/serengeti/www/distros/manifest
and add the distro configuration (use the sample in previous step and modify attribute
"package_repos" to the url of the local yum repository file).
3. Restart the tomcat server in Serengeti Management Server to allow the Serengeti Management
Server read the new manifest file.
$ sudo service tomcat restart

20

Serengeti Users Guide

If the commands are successful, in the Serengeti shell, issue the command "distro list", you can see the
distro that you added. Otherwise, make sure you write the correct JSON text in the manifest.

5.3.3 List Distros


You can use the "distro list" command to see available distros.
serengeti> distro list
You can see packages in each of the distro and make sure it includes services you want to deploy.

5.3.4 Using a Distro


You can choose which distro you use when deploying a cluster.
serengeti>cluster create --name myHadoop --distro cdh

5.4 Hadoop Clusters


5.4.1 Deploy Hadoop Clusters
5.4.1.1 Deploy a Customized Hadoop Cluster
You can customize the number of nodes, and size of virtual machines etc. when you create a cluster.
In Serengeti Management Server you can find sample specs in /opt/serengeti/samples/. If you are using
Serengeti CLI from your desktop you can find the sample specs in the client folder.
1. Edit a cluster spec file.
For example:
{
"nodeGroups" : [
{
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"instanceType": "MEDIUM"
},
{
"name": "worker",
"roles": [
"hadoop_datanode", "hadoop_tasktracker"
],
"instanceNum": 5,
"instanceType": "SMALL"
},
{
"name": "client",
"roles": [
21

Serengeti Users Guide

"hadoop_client",
"hive",
"hive_server",
"pig"
],
"instanceNum": 1,
"instanceType": "SMALL"
}
]
}
In this example, you want 1 master virtual machine MEDIUM size, 5 worker virtual machines in SMALL
size, 1 client virtual machine in SMALL size. You can also specify number of CPUs, RAM, disk size etc.
for each of node groups.
2. Specify the spec when creating the cluster. You need use the full path to specify the file.
serengeti>cluster create --name myHadoop --specFile /home/serengeti/mySpec.txt
CAUTION
Changing the role of node groups might cause the deployed Hadoop cluster not workable.
Deploy a CDH4 Hadoop ClusterYou can create a default CDH4 Hadoop cluster by executing the following
command in Serengeti CLI:
serengeti>cluster create --name mycdh --distro cdh4
You can also create a customized CDH4 Hadoop cluster with a cluster spec file:
serengeti>cluster create --name mycdh --distro cdh4 --specFile
/opt/serengeti/samples/default_cdh4_ha_hadoop_cluster.json
/opt/serengeti/samples/default_cdh4_ha_and_federation_hadoop_cluster.json is a sample spec file for
CDH4. You can make a copy of it and modify the parameters in the file before creating the cluster. In this
example, nameservice0 and nameservice1 are federated with each other, the name nodes inside
nameservice0 node group (with instanceNum set as 2) are HDFS2 HA enabled. In Serengeti, name node
group names will be the name service names of HDFS2.
5.4.1.1.1 Deploy a MapR Hadoop Cluster
You can create a default MapR M5 Hadoop cluster by executing the following command in Serengeti CLI:
serengeti>cluster create --name mymapr --distro mapr
You can also create a customized MapR M5 Hadoop cluster with a cluster spec file:
serengeti>cluster create --name mycdh --distro mapr --specFile /opt/serengeti/samples/
default_mapr_cluster.json
/opt/serengeti/samples/ default_mapr_cluster.json is a sample spec file for MapR, you can make a copy
of it and modify the parameters in the file before creating the cluster.

5.4.1.2 Separating Data and Compute nodes


You can separate data and compute nodes in a cluster and apply more fined control of node placement
among ESX hosts. For example, you can use Serengeti to deploy the following clusters:
1. A data and compute separated cluster, without any node placement constraints.
22

Serengeti Users Guide

{
"nodeGroups":[
{
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"cpuNum": 2,
"memCapacityMB": 7500,
},
{
"name": "data",
"roles": [
"hadoop_datanode"
],
"instanceNum": 4,
"cpuNum": 1,
"memCapacityMB": 3748,
"storage": {
"type": "LOCAL",
"sizeGB": 50
}
},
{
"name": "compute",
"roles": [
"hadoop_tasktracker"
],
"instanceNum": 8,
"cpuNum": 2,
"memCapacityMB": 7500,
"storage": {
"type": "LOCAL",
"sizeGB": 20
}
},
{
"name": "client",
"roles": [
"hadoop_client",
"hive",
"pig"
],
"instanceNum": 1,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
}
}
],
"configuration": {
}
23

Serengeti Users Guide

}
In this example, four data nodes and eight compute nodes will be created and put into individual VMs. By
default, Serengeti uses Round Robin algorithm to put VM/node across ESX hosts evenly.
2. A data compute separated cluster, with instancePerHost constraint.
{
"nodeGroups":[
{
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"cpuNum": 2,
"memCapacityMB": 7500,
},
{
"name": "data",
"roles": [
"hadoop_datanode"
],
"instanceNum": 4,
"cpuNum": 1,
"memCapacityMB": 3748,
"storage": {
"type": "LOCAL",
"sizeGB": 50
},
"placementPolicies": {
"instancePerHost": 1
}
},
{
"name": "compute",
"roles": [
"hadoop_tasktracker"
],
"instanceNum": 8,
"cpuNum": 2,
"memCapacityMB": 7500,
"storage": {
"type": "LOCAL",
"sizeGB": 20
},
"placementPolicies": {
"instancePerHost": 2
}
},
{
"name": "client",
"roles": [
"hadoop_client",
"hive",
24

Serengeti Users Guide

"pig"
],
"instanceNum": 1,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
}
}
],
"configuration": {
}
}
In this example, data and compute node group have placementPolicy constraint. After a successful
provision, four data nodes and eight compute nodes will be created and put into individual VMs. With the
instancePerHost=1 constraint, the four data nodes will be placed on four ESX hosts. The eight compute
nodes will be put onto four ESX hosts as well, two nodes for each.
Note that it is not guaranteed that the two compute nodes will stay collocated with each data node on
each of the four ESX hosts. To ensure that this is the case, create a VM-VM affinity rule between each
hosts compute nodes and data node, or disable DRS on the compute nodes.
3. A data compute separated cluster, with instancePerHost , groupAssociations constraints for compute
node group and groupRacks constraint for data node group.
{
"nodeGroups":[
{
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"cpuNum": 2,
"memCapacityMB": 7500,
},
{
"name": "data",
"roles": [
"hadoop_datanode"
],
"instanceNum": 4,
"cpuNum": 1,
"memCapacityMB": 3748,
"storage": {
"type": "LOCAL",
"sizeGB": 50
},
"placementPolicies": {
"instancePerHost": 1,
"groupRacks": {
"type": "ROUNDROBIN",
"racks": ["rack1", "rack2", "rack3"]
},
25

Serengeti Users Guide

}
},
{
"name": "compute",
"roles": [
"hadoop_tasktracker"
],
"instanceNum": 8,
"cpuNum": 2,
"memCapacityMB": 7500,
"storage": {
"type": "LOCAL",
"sizeGB": 20
},
"placementPolicies": {
"instancePerHost": 2,
"groupAssociations": [
{
"reference": "data",
"type": "STRICT"
}
}
},
{
"name": "client",
"roles": [
"hadoop_client",
"hive",
"pig"
],
"instanceNum": 1,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
}
}
],
"configuration": {
}
}
In this example, after a successful provision, the four data nodes and eight compute nodes will be placed
on exactly the same four ESX hosts, each ESX host has one data node and two compute nodes, and
these four ESX hosts are selected from rack1, rack2 and rack3 fairly.
Here, as the definition of compute node group says, the placement of compute nodes should strictly
refer to the placement result of data node. That means, compute nodes should only be placed on ESX
hosts that have data nodes.

5.4.1.3 Deploy a Compute Only Cluster


You can create a compute only cluster that refers to an existing HDFS cluster with the following steps:
1. Edit a cluster spec file and save it, for example, as /home/serengeti/coSpec.txt.

26

Serengeti Users Guide

For example:
{
"externalHDFS": "hdfs://hostname-of-namenode:8020",
"nodeGroups": [
{
"name": "master",
"roles": [
"hadoop_jobtracker"
],
"instanceNum": 1,
"cpuNum": 2,
"memCapacityMB": 7500,
},
{
"name": "worker",
"roles": [
"hadoop_tasktracker",
],
"instanceNum": 4,
"cpuNum": 2,
"memCapacityMB": 7500,
"storage": {
"type": "LOCAL",
"sizeGB": 20
},
},
{
"name": "client",
"roles": [
"hadoop_client",
"hive",
"pig"
],
"instanceNum": 1,
"cpuNum": 1,
"storage": {
"type": "LOCAL",
"sizeGB": 50
},
}
],
configuration : {
}
}
In this example, the externalHDFS field points to an existing HDFS. You should also specify the node
group with role hadoop_jobtracker and hadoop_tasktracker. Note that the externalHDFS field conflicts
with node groups that have hadoop_namenode and hadoop_datanode roles. The sample cluster spec
can also be found in file in samples/compute_only_cluster.json in the Serengeti CLI directory,
2. Specify the spec when creating the cluster. You need use the full path to specify the file.
serengeti>cluster create --name computeOnlyCluster --specFile /home/serengeti/coSpec.txt

27

Serengeti Users Guide

5.4.1.4 Control Hadoop Virtual Machine Placement


Serengeti provides a way for user to control how Hadoop virtual machines to be placed. Generally, its
implemented by specifying the placementPolicies field inside a node group, like:
{
"nodeGroups":[

{
"name": "group_name",

"placementPolicies": {
"instancePerHost": 2,
"groupRacks": {
"type": "ROUNDROBIN",
"racks": ["rack1", "rack2", "rack3"]
},
"groupAssociations": [{
"reference": "another_group_name",
"type": "STRICT" // or "WEAK"
}]
}
},

}
As this example shows, the palcementPolicy field contains three optional items: instancePerHost,
groupRacks and groupAssociations.
As the name implies, instancePerHost indicates how many VM nodes or instances should be placed for
each physical ESX host, this constraint is aimed at balancing the workload.
The groupRacks controls how VM nodes should be put across the racks you specified. In this example,
the rack type equals ROUNDROBIN, and the racks item indicates which racks in the topology map
(refer to chapter 5.8 to see how to configure topology map information and enable Hadoop cluster to be
rack awareness) will be used for this placement policy. If racks item is ignored, Serengeti will use all
racks in the topology map. ROUNDROBIN here means the candidates will be fairly selected when
determining which rack should be selected for each node.
On the other side, if you specify both the InstancePerHost and groupRacks for placement policy, you
should make sure the number of available hosts is enough. You can get the rack-hosts information by
using the command topology list.
groupAssociations means the node group has associations with target node groups, and each
association has reference and type fields. The field reference is the name of a target node group,
and type can be STRICT or WEAK. STRICT means the node group must be placed on the same
set or subset of ESX hosts relevant to the target group, while WEAK means the node group tries to be
placed on the same set or subset of ESX hosts relevant to the target group but no guarantee.
A typical scenario of applying groupRacks and groupAssociations is deploying a Hadoop cluster with
data and compute nodes separated. In this case, user might tend to put compute nodes and data nodes
on the same set of physical hosts for better performance, especially the throughput. You can refer to
5.3.3 for the practical examples of how to deploy Hadoop cluster by applying placement policy.

28

Serengeti Users Guide

5.4.1.5 Use NFS as Compute Nodes Local Directory


Serengeti allows user to specify NFS for compute nodes. There are several benefits 1) increase the
capacity of each compute node; 2) return storage resource back when some compute nodes stopped.
Here is an example to show how to deploy a cluster whose compute nodes have only NFS storage:

{
"nodeGroups":[
{
"name": "master",
"roles": [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum": 1,
"instanceType": "LARGE",
"cpuNum": 2,
"memCapacityMB": 7500,
"haFlag": "on"
},
{
"name": "data",
"roles": [
"hadoop_datanode"
],
"instanceNum": 4,
"cpuNum": 1,
"memCapacityMB": 3748,
"storage": {
"type": "LOCAL",
"sizeGB": 50
},
"placementPolicies": {
"instancePerHost": 1
}
},
{
"name": "compute",
"roles": [
"hadoop_tasktracker"
],
"instanceNum": 8,
"cpuNum": 1,
"memCapacityMB": 3748,
"storage": {
"type": "TEMPFS"
},
"placementPolicies": {
"instancePerHost": 2,
"groupAssociations": [
{
"reference": "data",
29

Serengeti Users Guide

"type": "STRICT"
}
]
}
},
{
"name": "client",
"roles": [
"hadoop_client",
"hive",
"hive_server",
"pig"
],
"instanceNum": 1,
"cpuNum": 1,
"memCapacityMB": 3748,
"storage": {
"type": "LOCAL",
"sizeGB": 50
}
}
]
}
In this example, the cluster is D/C separated. Compute nodes are strictly associated with data nodes. By
specifying the Storage field of compute node group to type: TEMPFS, Serengeti will install NFS server
on associated data nodes, install NFS client on compute nodes, and mount data nodes disks on compute
nodes. Serengeti will not assign disks to compute nodes, and all temp files generated during running
MapReduce jobs are saved on the NFS disks.

5.4.2 Manage Hadoop Clusters


5.4.2.1 Modify Hadoop
Serengeti provides a simple and easy way to tune the Hadoop cluster configuration including attributes in
core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh, log4j.properties, fair-scheduler.xml,
capacity-scheduler.xml, etc.
In addition to modifying Hadoop configuration of an existing Hadoop cluster created by
Serengeti, you can also define Hadoop configuration in the cluster spec file when creating a new
cluster.

5.4.2.1.1 Cluster Level Configuration


You can modify the Hadoop configuration of an existing cluster by following these steps below:
1. Export the cluster spec file of the cluster:
serengeti>cluster export --spec --name myHadoop output /home/serengeti/myHadoop.json
2. Modify the configuration section at the bottom of /home/serengeti/myHadoop.json with the following
content and add the customized Hadoop configuration in this configuration section:

30

Serengeti Users Guide

"configuration": {
"hadoop": {
"core-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/core-default.html
// note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a
sample:
// "io.file.buffer.size": "4096"
},
"hdfs-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/hdfs-default.html
},
"mapred-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/mapred-default.html
},
"hadoop-env.sh": {
// "HADOOP_HEAPSIZE": "",
// "HADOOP_NAMENODE_OPTS": "",
// "HADOOP_DATANODE_OPTS": "",
// "HADOOP_SECONDARYNAMENODE_OPTS": "",
// "HADOOP_JOBTRACKER_OPTS": "",
// "HADOOP_TASKTRACKER_OPTS": "",
// "HADOOP_CLASSPATH": "",
// "JAVA_HOME": "",
// "PATH": "",
},
"log4j.properties": {
// "hadoop.root.logger": "DEBUG, DRFA ",
// "hadoop.security.logger": "DEBUG, DRFA ",
},
"fair-scheduler.xml": {
// check for all settings at http://hadoop.apache.org/docs/stable/fair_scheduler.html
// "text": "the full content of fair-scheduler.xml in one line"
},
"capacity-scheduler.xml": {
// check for all settings at http://hadoop.apache.org/docs/stable/capacity_scheduler.html
}
}
}

Serengeti provides a tool to convert the Hadoop configuration files of your existing cluster into
the above json format, so you dont need to write this json file manually. Please read section
Tool for converting Hadoop Configuration.

Some Hadoop Distributions have their own java jar files which are not put in
$HADOOP_HOME/lib, so by default Hadoop daemons cant find it. In order to use these jars,
you need to add a cluster configuration to include the full path of the jar file in
$HADOOP_CLASSPATH.
Here is a sample cluster configuration to configure Cloudera CDH3 Hadoop cluster with Fair
Scheduler (the jar files of Fair Scheduler is put in /usr/lib/hadoop/contrib/fairscheduler/):

31

Serengeti Users Guide

"configuration": {
"hadoop": {
"hadoop-env.sh": {
"HADOOP_CLASSPATH": "/usr/lib/hadoop/contrib/fairscheduler/*:$HADOOP_CLASSPATH"
},
"mapred-site.xml": {
"mapred.jobtracker.taskScheduler": "org.apache.hadoop.mapred.FairScheduler"

},
"fair-scheduler.xml": {

}
}
}

3. Run cluster config command to apply the new Hadoop configuration


serengeti>cluster config --name myHadoop --specFile /home/serengeti/myHadoop.json
4. If you want to reset an existing configuration attribute to the Hadoop default value, simply remove it or
comment it out using // in configuration section in cluster spec file, and run cluster config command.
5.4.2.1.2 Group Level Configuration
You can also modify the Hadoop configuration within a node group in an existing cluster by following
steps below:
1. Export the cluster spec file of the cluster:
serengeti>cluster export --spec --name myHadoop --output /home/serengeti/myHadoop.json
2. Modify the configuration section within the node group in /home/serengeti/myHadoop.json with the
same content as in Cluster Level Configuration and add the customized Hadoop configuration for this
node group.
The Hadoop configuration in Group Level Configuration will override the configuration with the
same name in Cluster Level Configuration.
3. Run cluster config command to apply the new Hadoop configuration
serengeti>cluster config --name myHadoop --specFile /home/serengeti/myHadoop.json
5.4.2.1.3 Black List and White List in Hadoop Configuration
Almost all the configuration attributes provided in Apache Hadoop are configurable in Serengeti, and
these attributes belong to White List. However a few attributes are not configurable in Serengeti and
these attributes belongs to Black List.
If you set an attribute in the cluster spec file and it is in the Black List or not in the White List, then run
cluster config command, Serengeti will detect these attributes and give a warning, you need to answer
yes to continue or no to abort.
Usually you dont need to configure fs.default.name' or dfs.http.address if there is a NameNode
or JobTracker in your cluster, because Serengeti will automatically configure these 2 attributes.
For example, when you create a default cluster in Serengeti, it will contains a NameNode and
JobTracker, and you dont need to explicitly configure fs.default.name' and dfs.http.address.
However you can set fs.default.name' to the uri of another NameNode if you really want to.

32

Serengeti Users Guide

5.4.2.1.3.1 White List


core-site.xml

all attributes listed on http://hadoop.apache.org/common/docs/stable/core-default.html

exclude attributes defined in Black List

hdfs-site.xml

all attributes listed on http://hadoop.apache.org/common/docs/stable/hdfs-default.html

exclude attributes defined in Black List

mapred-site.xml

all attributes listed on http://hadoop.apache.org/common/docs/stable/mapred-default.html

exclude attributes defined in Black List

hadoop-env.sh

JAVA_HOME

PATH

HADOOP_CLASSPATH

HADOOP_HEAPSIZE

HADOOP_NAMENODE_OPTS

HADOOP_DATANODE_OPTS

HADOOP_SECONDARYNAMENODE_OPTS

HADOOP_JOBTRACKER_OPTS

HADOOP_TASKTRACKER_OPTS

HADOOP_LOG_DIR

log4j.properties

hadoop.root.logger

hadoop.security.logger

log4j.appender.DRFA.MaxBackupIndex

log4j.appender.RFA.MaxBackupIndex

log4j.appender.RFA.MaxFileSize

fair-scheduler.xml

text

all attributes described on http://hadoop.apache.org/docs/stable/fair_scheduler.html , which can


be put inside text field

exclude attributes defined in Black List

capacity-scheduler.xml
all attributes described on http://hadoop.apache.org/docs/stable/capacity_scheduler.html

exclude attributes defined in Black List

5.4.2.1.3.2 Black List


core-site.xml
33

Serengeti Users Guide

net.topology.impl

net.topology.nodegroup.aware

dfs.block.replicator.classname

hdfs-site.xml

dfs.http.address

dfs.name.dir

dfs.data.dir

topology.script.file.name

mapred-site.xml

mapred.job.tracker

mapred.local.dir

mapred.task.cache.levels

mapred.jobtracker.jobSchedulable

mapred.jobtracker.nodegroup.awareness

hadoop-env.sh

HADOOP_HOME

HADOOP_COMMON_HOME

HADOOP_MAPRED_HOME

HADOOP_HDFS_HOME

HADOOP_CONF_DIR

HADOOP_PID_DIR

log4j.properties

None

fair-scheduler.xml

None

capacity-scheduler.xml
None
mapred-queue-acls.xml

None

5.4.2.1.4 Tool for converting Hadoop Configuration


In case you have a lot of Hadoop configuration in core-site.xml, hdfs-site.xml, mapred-site.xml, hadoopenv.sh, log4j.properties, fair-scheduler.xml, capacity-scheduler.xml, mapred-queue-acls.xml, etc. for your
existing Hadoop cluster, you can use a tool provided by Serengeti to convert the Hadoop xml
configuration files into the json format used in Serengeti.
1) Copy the directory $HADOOP_HOME/conf/ in your existing Hadoop cluster to the Serengeti
Server.
2) Execute convert-hadoop-conf.rb /path/to/hadoop_conf/ in bash shell and it will print out all the
converted Hadoop configuration attributes in json format.
34

Serengeti Users Guide

3) Open the cluster spec file and replace the Cluster Level Configuration or Group Level
Configuration with the content printed out step 2.
4) Execute cluster config --name --specFile to apply the new configuration to the existing
clusteror execute cluster create --name --specFile to create a new cluster with your
configuration.
5.4.2.2 Scale Out a Hadoop Cluster
You can scale out to have more Hadoop worker nodes or client nodes after Hadoop cluster is provisioned.
In the following example, the number of instances in worker node group in myHadoop cluster will
increase to 10.
serengeti>cluster resize --name myHadoop --nodeGroup worker --instanceNum 10
You cannot set a number smaller than current instance number in this version of the Serengeti
virtual appliance.
5.4.2.3 Scale TaskTracker Nodes Rapidly
You can change the number of active TaskTracker nodes rapidly in a running Hadoop cluster or node
group. The selection of TaskTrackers to be enabled or disabled is done with the goal of balancing the
number of TaskTrackers enabled per host in the specified Hadoop cluster or node group.
In this example, the number of active TaskTracker nodes in worker node group in myHadoop cluster is
set to 8:
serengeti>cluster limit --name myHadoop --nodeGroup worker --activeComputeNodeNum 8
If fewer than 8 TaskTracker nodes were running in the worker node group of myHadoop cluster,
additional TaskTracker nodes are enabled (re-commissioned and powered-on), up to the number
provisioned in the worker node group. If more than 8 TaskTrackers were running in the worker node
group, excess TaskTracker nodes are disabled (decommissioned and powered-off). No action is
performed if the number of active TaskTrackers already equals 8.
If the node group is not specified, the TaskTracker nodes are enabled/disabled such that the total number
of active TaskTrackers is 8 across all the compute node groups in the myHadoop cluster:
serengeti>cluster limit --name myHadoop activeComputeNodeNum 8
To enable all the TaskTrackers in the myHadoop cluster, use the cluster unlimit command:
serengeti>cluster unlimit --name myHadoop
This command is especially useful to fix any potential mismatch between the number of active
TaskTrackers as seen by Hadoop and the number of powered on TaskTracker nodes as seen by the
vCenter.
To enable all TaskTrackers within only one compute node group, specify the name of the node group
using the --nodeGroup option, similar to the cluster limit command.
5.4.2.4 Start/Stop Hadoop Cluster
In the Serengeti shell, you can start (or stop) a whole Hadoop cluster:
serengeti>cluster start --name mycluster
5.4.2.5 View Hadoop Clusters Deployed by Serengeti
In the Serengeti shell, you can list Hadoop clusters deployed by Serengeti.
serengeti>cluster list
35

Serengeti Users Guide

You can specify which cluster to list.


serengeti>cluster list --name mycluster
You can see details of Hadoop clusters.
serengeti>cluster list --detail
5.4.2.6 Login to Hadoop Nodes
You can login to Hadoop nodes including master, worker, and client nodes with password-less SSH from
Serengeti Management Server using SSH client tools like SSH, PDSH, ClusterSSH, Mussh and etc. to do
trouble shooting or run your own management automation scripts.
Serengeti Management Server is configured to be able to SSH to Hadoop cluster nodes without
password. Other clients or machines can use user name and password to SSH to the Hadoop cluster
nodes.
All of these deployed nodes have random passwords protection. If you want to login to each Hadoop
node directly, please login each node from vSphere client in order to change the password by following
the step in Section 3.2 step 11. Please press Ctrl + D in order to get the login information with the
original random password.
5.4.2.7 Delete a Hadoop Cluster
You can delete a Hadoop cluster you no longer needed.
serengeti>cluster delete --name myHadoop
In this example, myHadoop is the name of the Hadoop cluster you want to delete.
When a Hadoop cluster is deleted, all virtual machines in the cluster are destroyed.
You can delete a Hadoop cluster even though it is running.

5.4.3 Use Hadoop Clusters


5.4.3.1 Run Pig Scripts
You can run Pig script in the Serengeti CLI. For example, you have a Pig script in /tmp/data.pig.
serengeti> pig cfg
serengeti> pig script --location /tmp/data.pig

5.4.3.2 Run Hive Scripts


You can run Hive script in the Serengeti CLI. For example, you have a Hive script in tmp/data.hive.
serengeti>hive cfg
serengeti>hive script location /tmp/data.hive
5.4.3.3 Run HDFS command
You can run HDFS command in the Serengeti CLI. For example, you have file in /home/serengeti/data
and want to put it in your HDFS path /tmp.
serengeti> fs put from /home/serengeti/data to /tmp

36

Serengeti Users Guide

5.4.3.4 Run Map Reduce job


You can run Map Reduce job in the Serengeti CLI. For example, you get example jar file in
/opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar and want to run pi.
serengeti> mr jar --jarfile /opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar --mainclass
org.apache.hadoop.examples.PiEstimator --args "10 10"

Make sure you have chosen a cluster as target first in Serengeti CLI. See Chapter 7.2.10.

5.4.3.5 Using Data through JDBC


Using Data through Hive JDBC, you can execute SQL in different programming language, such as Java,
Python and PHP, and so on. The following is a JDBC Client sample of Java code.
1. SSH to the node contains hive server role.
2. Create a Java file HiveJdbcClient.java which contains the Java Sample Code for connecting to the
Hive Server:
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
public class HiveJdbcClient {
private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
/**
* @param args
* @throws SQLException
**/
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e){
// TODO Auto-generated catch block
e.printStackTrace();
System.exit(1);
}
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default",

37

Serengeti Users Guide

"", "");
Statement stmt = con.createStatement();
String tableName = "testHiveDriverTable";
stmt.executeQuery("drop table " + tableName);
ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value
string)");
// show tables
String sql = "show tables '" + tableName + "'";
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}
// describe table
sql = "describe " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1) + "\t" + res.getString(2));
}
// load data into table
// NOTE: filepath has to be local to the hive server
// NOTE: /tmp/test_hive_server.txt is a ctrl-A separated file with two fields per line
String filepath = "/tmp/test_hive_server.txt";
sql = "load data local inpath '" + filepath + "' into table " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
// select * query
sql = "select * from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()){
System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
38

Serengeti Users Guide

}
// regular hive query
sql = "select count(1) from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()){
System.out.println(res.getString(1));
}
}
}
3. Running the JDBC Sample Code
a. Then on the command-line
$ javac HiveJdbcClient.java
b. Alternatively, you can run the following bash script, which will seed the data file and build your
classpath before invoking the client.
#!/bin/bash
HADOOP_HOME=/usr/lib/hadoop
HIVE_HOME=/usr/lib/hive
echo -e '1\x01foo' > /tmp/test_hive_server.txt
echo -e '2\x01bar' >> /tmp/test_hive_server.txt
HADOOP_CORE=`ls $HADOOP_HOME/hadoop-core-*.jar`
CLASSPATH=.:$HADOOP_CORE:$HIVE_HOME/conf
for jar_file_name in ${HIVE_HOME}/lib/*.jar
do
CLASSPATH=$CLASSPATH:$jar_file_name
done
java -cp $CLASSPATH HiveJdbcClient

For more information of Hive client please visit https://cwiki.apache.org/Hive/hiveclient.html.

39

Serengeti Users Guide

5.4.3.6 Using Data through ODBC


You can use specified out-of-box ODBC server for Hadoop Hive such as MapR Hive ODBC connector,
Apache Hadoop Hive ODBC Driver, etc.
Take MapR ODBC Connector as an example:
1. Install the MapR Hive ODBC Connector on your Windows 7 Professional or Windows 2008 R2.
2. Create a Data Source Name (DSN) with the ODBC Connectors Data Source Administrator to
connect your remote Hive server.
3. Import rows of HIVE_SYSTEM table in Hive server into excel by connecting to this DSN.
More information about Hive ODBC, please refer to https://cwiki.apache.org/Hive/hiveodbc.html
More information about MapR Hive ODBC Connector, please refer to
www.mapr.com/doc/display/MapR/Hive+ODBC+Connector.

5.5 HBase Clusters


5.5.1 Deploy HBase Clusters
You can customize a HBase cluster by specifying your own spec file. The following is an example:
{
"nodeGroups" : [
{
"name" : "zookeeper",
"roles" : [
"zookeeper"
],
"instanceNum" : 3,
"instanceType" : "SMALL",
"storage" : {
"type" : "shared",
"sizeGB" : 20
},
"cpuNum" : 1,
"memCapacityMB" : 3748,
"haFlag" : "on",
"configuration" : {
}
},
{
"name" : "hadoopmaster",
"roles" : [
"hadoop_namenode",
"hadoop_jobtracker"
],
"instanceNum" : 1,
"instanceType" : "MEDIUM",
"storage" : {
"type" : "shared",
"sizeGB" : 50
},
"cpuNum" : 2,
"memCapacityMB" : 7500,
"haFlag" : "on",
"configuration" : {
40

Serengeti Users Guide

}
},
{
"name" : "hbasemaster",
"roles" : [
"hbase_master"
],
"instanceNum" : 1,
"instanceType" : "MEDIUM",
"storage" : {
"type" : "shared",
"sizeGB" : 50
},
"cpuNum" : 2,
"memCapacityMB" : 7500,
"haFlag" : "on",
"configuration" : {
}
},
{
"name" : "worker",
"roles" : [
"hadoop_datanode",
"hadoop_tasktracker",
"hbase_regionserver"
],
"instanceNum" : 3,
"instanceType" : "SMALL",
"storage" : {
"type" : "local",
"sizeGB" : 50
},
"cpuNum" : 1,
"memCapacityMB" : 3748,
"haFlag" : "off",
"configuration" : {
}
},
{
"name" : "client",
"roles" : [
"hadoop_client",
"hbase_client"
],
"instanceNum" : 1,
"instanceType" : "SMALL",
"storage" : {
"type" : "shared",
"sizeGB" : 50
},
"cpuNum" : 1,
"memCapacityMB" : 3748,
"haFlag" : "off",
"configuration" : {
}
41

Serengeti Users Guide

}
],
// we suggest running convert-hadoop-conf.rb to generate "configuration" section and paste the output
here
"configuration" : {
"hadoop": {
"core-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/core-default.html
// note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a
sample:
// "io.file.buffer.size": "4096"
},
"hdfs-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/hdfs-default.html
},
"mapred-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/mapred-default.html
},
"hadoop-env.sh": {
// "HADOOP_HEAPSIZE": "",
// "HADOOP_NAMENODE_OPTS": "",
// "HADOOP_DATANODE_OPTS": "",
// "HADOOP_SECONDARYNAMENODE_OPTS": "",
// "HADOOP_JOBTRACKER_OPTS": "",
// "HADOOP_TASKTRACKER_OPTS": "",
// "HADOOP_CLASSPATH": "",
// "JAVA_HOME": "",
// "PATH": ""
},
"log4j.properties": {
// "hadoop.root.logger": "DEBUG,DRFA",
// "hadoop.security.logger": "DEBUG,DRFA"
},
"fair-scheduler.xml": {
// check for all settings at http://hadoop.apache.org/docs/stable/fair_scheduler.html
// "text": "the full content of fair-scheduler.xml in one line"
},
"capacity-scheduler.xml": {
// check for all settings at http://hadoop.apache.org/docs/stable/capacity_scheduler.html
},
"mapred-queue-acls.xml": {
// check for all settings at
http://hadoop.apache.org/docs/stable/cluster_setup.html#Configuring+the+Hadoop+Daemons
// "mapred.queue.queue-name.acl-submit-job": "",
// "mapred.queue.queue-name.acl-administer-jobs", ""
}
},
"hbase": {
"hbase-site.xml": {
// check for all settings at http://hbase.apache.org/configuration.html#hbase.site
},
"hbase-env.sh": {
// "JAVA_HOME": "",
// "PATH": "",
// "HBASE_CLASSPATH": "",
// "HBASE_HEAPSIZE": "",
42

Serengeti Users Guide

// "HBASE_OPTS": "",
// "HBASE_USE_GC_LOGFILE": "",
// "HBASE_JMX_BASE": "",
// "HBASE_MASTER_OPTS": "",
// "HBASE_REGIONSERVER_OPTS": "",
// "HBASE_THRIFT_OPTS": "",
// "HBASE_ZOOKEEPER_OPTS": "",
// "HBASE_REGIONSERVERS": "",
// "HBASE_SSH_OPTS": "",
// "HBASE_NICENESS": "",
// "HBASE_SLAVE_SLEEP": ""
},
"log4j.properties": {
// "hbase.root.logger": "DEBUG,DRFA"
}
},
"zookeeper": {
"java.env": {
// "JVMFLAGS": "-Xmx2g"
},
"log4j.properties": {
// "zookeeper.root.logger": "DEBUG,DRFA"
}
}
}
}
In the example, it has JobTracker and TaskTracker roles compared to the template we mentioned in
section 4.4, which means you can launch a HBase mapreduce job. It separate Hadoop NameNode and
HBase Master roles. The two HBase Master instances,will be protected by HBase internal HA function.

5.5.2 Manage HBase Clusters


HBase cluster has a few more configurable files compared to Hadoop cluster, including hbase-site.xml,
hbase-env.sh, log4j.properties and java.env for Zookeeper nodes. You can refer to HBase official site to
tune your HBase clusters.
Most operations and advanced specifications on Hadoop cluster can also apply to HBase cluster, like
scale out node group, separate data and compute nodes, control placement policy and so on with
following exceptions:
1. Zookeeper nodes are not allowed to scale out in this version;
2. You cannot deploy a compute-only cluster pointing to a HBase cluster to run HBase
mapreduce jobs.

5.5.3 Use HBase Clusters


Serengeti supports most of ways that HBase provides to access the database, including:
1. Do operations through HBase shell;
2. If the HBase cluster deployed has Hadoop JobTracker and TaskTracker roles, you can develop a
HBase mapreduce job to access HBase from the client node. Here is an example:
>hbase org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 3
43

Serengeti Users Guide

3. Rest-ful Web Service is running on client node and listening on port 8080
>curl I http://<client_node_ip>:8080/status/cluster
4.Thrift gateway is also enabled and listening on port 9090.

5.6 Monitoring Cluster Deployed by Serengeti


Serengeti will create one VM folder for each deployed Serengeti Server. The folder name is SERENGETIvApp-<vApp name>. The vApp name is specified during Serengeti deployment.
For each cluster, two level folders will be created under Serengeti instance folder. First level is the cluster
name, and second level is the node group name.
Node group folder contains all nodes in that node group.
To browse the VM and check VM status in vCenter client, you may select Inventory, VMs and
Templates. The Serengeti folder is listed in the left panel. And then you can check VM nodes following
the folder structure.
If you have installed vCOPs, you can also fetch VM-level metrics including clusters health state, workload,
resource allocation, hardware status and etc. Please refer to vCOPs manual guide for more details.

5.7 Make Hadoop Master Node HA/FT


You can leverage vSphere HA and FT to address the SPOF problem of Hadoop.
1. Make sure you enabled the HA for the cluster where the Hadoop cluster is deployed. Please refer to
for detailed setting steps as needed.
2. Make sure you provide a shared storage for Hadoop to deploy on.
3. By default, Hadoop master node is configured to be protected by vSphere HA.
After doing this, once the master node virtual machine not reachable by vSphere. vSphere will start a new
instance on another available ESXi host to serve Hadoop cluster automatically.
Theres a short downtime when doing the recovery. If you want eliminate the down time, you can use
vSphere FT to protect the master node.
Serengeti support configure FT feature for master nodes. In cluster spec file, you can specify haFlag to
ft to enable FT protection.
...
"name": "master",
"cpuNum": 1,
"haFlag": ft
"storage": {
"type": "SHARED",
}
By using the cluster spec, master node of the Hadoop cluster is protected by vSphere FT. When one
master is not reachable, vSphere will switch traffic to the standby virtual machine immediately. So theres
no failover downtime.
Please refer to Apache Hadoop 1.0 High Availability Solution on VMware vSphere for more
information.

44

Serengeti Users Guide

5.8 Hadoop Topology Awareness


You can make the Hadoop cluster topology aware when you create a cluster with the option of --topology
from CLI. By --topology, we support 3 types of topology awareness: HVE, RACK_AS_RACK,
HOST_AS_RACK.
Here is an example to create a cluster with the topology of HVE.
serengeti>cluster create --name myHadoop --topology HVE --distro HVE-supported_Distro
2

HVE stands for Hadoop Virtualization Extensions . HVE refines Hadoops replica placement, task
scheduling and balancer policies. Hadoop clusters implemented on virtualized infrastructure have full
awareness of the topology on which they are running. Thus, the reliability and performance of these
clusters
are
enhanced.
For
more
information
about
HVE,
you
can
refer
to
https://issues.apache.org/jira/browse/HADOOP-8468.
RACK_AS_RACK stands for the standard topology in existing Hadoop 1.0.x, where only rack and host
information are exposed to Hadoop.
HOST_AS_RACK is a simplified topology of RACK_AS_RACK when all the physical hosts for Serengeti
are on a single rack. In this case, each physical host will be treated as a rack in order to avoid all HDFS
data replicas are placed in a physical host in some worst cases.
HVE is the recommended topology in Serengeti if a distro supports HVE. Otherwise, we recommend
using RACK_AS_RACK topology in multiple rack environments. HOST_AS_RACK is used only when one
rack exists for Serengeti or no rack information at all.
In addition, when you decide to enable HVE, or RACK_AS_RACK, you need to upload the rack and
physical host information to Serengeti through CLI command below before you create a topology
awareness cluster.
serengeti>topology upload --fileName name_of rack_hosts_mapping_file
Here is a sample of the rack and physical hosts mapping file.
rack1: a.b.foo.com, a.c.foo.com
rack2: c.a.foo.com
In this sample, physical hosts a.b.foo.com and a.c.foo.com are in rack1, and c.a.foo.com is in rack2.
After a cluster is created with the selected topology option, you can view the allocated nodes on each
rack with:
serengeti>cluster list --name cluster-name --detail

5.9 Start and Stop Serengeti Services


You can stop and start Serengeti service to make a configuration take effect or to recover from an
abnormal situation.
You can run the following command in a Linux shell to stop the Serengeti service.
$ sudo serengeti-stop-services.sh
You can run the following command in a Linux shell to start the Serengeti service.
$ sudo serengeti-start-services.sh
2

HVE is currently supported on Greenplum HD 1.2.


45

Serengeti Users Guide

6. Cluster Specification Reference


Cluster specification is a JSON text file. Heres a longer example with line number. Same file without line
number is attached as appendix.
1{
2 "nodeGroups" : [
3

"name": "master",

"roles": [

"hadoop_namenode",

"hadoop_jobtracker"

],

"instanceNum": 1,

10

"instanceType": "LARGE",

11

"cpuNum": 2,

12

"memCapacityMB":4096,

13

"storage": {

14

"type": "SHARED",

15

"sizeGB": 20

16

},

17

"haFlag":"on",

18

"rpNames": [

19

"rp1"

20

21

},

22

23

"name": "data",

24

"roles": [

25

"hadoop_datanode"

26

],

27

"instanceNum": 3,

28

"instanceType": "MEDIUM",

29

"cpuNum": 2,

30

"memCapacityMB":2048,

46

Serengeti Users Guide

31

"storage": {

32

"type": "LOCAL",

33

"sizeGB": 50

34

35

"placementPolicies": {

36

"instancePerHost": 1,

37

"groupRacks": {

38

"type": "ROUNDROBIN",

39

"racks": ["rack1", "rack2", "rack3"]

40

41

42

},

43

44

"name": "compute",

45

"roles": [

46

"hadoop_tasktracker"

47

],

48

"instanceNum": 6,

49

"instanceType": "SMALL",

50

"cpuNum": 2,

51

"memCapacityMB":2048,

52

"storage": {

53

"type": "LOCAL",

54

"sizeGB": 10

55

56

"placementPolicies": {

57

"instancePerHost": 2,

58

"groupAssociations": [{

59

"reference": "data",

60

"type": "STRICT"

61

}]

62

63

},

64

65

"name": "client",
47

Serengeti Users Guide

66

"roles": [

67

"hadoop_client",

68

"hive",

69

"hive_server",

70

"pig"

71

],

72

"instanceNum": 1,

73

"instanceType": "SMALL",

74

"memCapacityMB": 2048,

75

"storage": {

76

"type": "LOCAL",

77

"sizeGB": 10,

78

"dsNames": [ds1, ds2]

79

80

81 ],
82 "configuration": {
83 }
84 }
It defines 4 node groups.

Line 3 to 21 defines a node group named master.

Line 22 to 42 defines a data node group named data.

Line 43 to 63 defines a compute node group named compute.

Line 64 to 83 defines a client node group.

Line 3 to 21 is an object defines the master node group. The attributes are as follows.

Line 4 defines the name of the node group. Attribute name is name. Value is master.

Line 5 to 8 defines role of the node group. Attribute name is role. Value is hadoop_ namenode
and hadoop_jobtracker. It means hadoop_namenode and hadoop_jobtracker will be deployed
to the virtual machine in the group.

You can see available roles by distro list command.

Line 9 defines number of instances in the node group. Attribute name is instanceNum. Attribute
value is 1. It means therell be only one virtual machine created for the group.
You can have multiple instances for hadoop_tasktracker, hadoop_datanode, hadoop_client, pig,
and hive. But you can have only one instance for hadoop_namenode and hadoop_jobtracker.

Line 10 defines the instance type in the node group. Attribute name is instanceType. Value is
LARGE. The instance types are predefined virtual machine spec. They are combinations of

48

Serengeti Users Guide

number of CPUs, RAM sizes, and storage size. The predefined number can be overridden by the
cpuNum, memCapacityMB and storage specified in the file.

Line 11 defines number of CPUs per virtual machine. Attribute name is cpuNum. Value is 2. Itll
override the number of CPUs of the predefined virtual machine spec.

Line 12 defines RAM size per virtual machine. Attribute name is "memCapacityMB". Value is
4096. It will override the RAM size of the predefined virtual machine spec.

Line 13 to 16 defines the storage requirement of the node group. Its an object. Object name is
storage.
o

Line 14 defines the storage type. Its an attribute of storage object. Attribute name is
type. Value is SHARED. It means it is required that Hadoop data must be stored in
shared storage.

Line 15 defines the storage size. Its an attribute of storage object. Attribute name is
sizeGB. Value is 20. It means therell be 20GB disk for Hadoop to use.

Line 17 defines if HA applies to the node. The attribute name is haFlag. The value is on. It
means the virtual machine in the group is protected by vSphere HA.

Line 18 to 20 defines the resourcepools which the node group must be associated with. The
attribute name is rpNames. The value is an array, which contains one resourcepool rp1.

You can see same structure for other 3 node groups. One more thing is for data and compute groups,
we specify a pair of comprehensive placement constraints:

Line 35 to 41 defines the placement constraints for the data node group. The attribute name is
placementPolicies and the value is a hash which contains instancePerHost and groupRacks.
The contraint means you need at least 3 esx hosts because this group requires 3 instances and
forces putting 1 instance on each one host, furthermore, this group will be provisioned on hosts
on rack1, rack2 and rack3 by using ROUNDROBIN algorithm.

Line 56 to 62 defines the placement constraints for the compute node group which contains
instancePerHost and groupAssociations. The contraint means you also need at least 3 esx
hosts for the same reason and this group is STRICT associated to node group data for better
performance.

You can customize Hadoop configuration by configuration attribute on line 82 to 83, which happens to
be empty in the sample.
You can modify value of the attributes, and you can also remove the optional value if you dont care.
Following is definition for the outer most attributes in a cluster spec:
Attribute

Type

Mandatory/optional

Description

nodeGroups

object

Mandatory

It contains one or more group specification, and


the details can be found in below table.

configuration

object

Optional

Customizable Hadoop configuration key/value


pairs.

externalHDFS

string

Optional

URI of external HDFS (only valid for a compute


only cluster)

49

Serengeti Users Guide

Following is the definition of the objects and attributes for a particular node group.
Attribute

Type

Mandatory/Optional

Description

name

string

Mandatory

User defined node group name.

roles

list of
string

Mandatory

A list of software packages or services will be


installed in the virtual machines in the node
group. The item must be exactly the same as
you saw by distro list

instanceNumber

integer

Mandatory

How many virtual machines in the node group.


It must be a positive integer. For
hadoop_namenode and hadoop_jobtracker, it
must be 1.

instanceType

string

Optional

Size of virtual machines in the node group. Its


the name of predefined virtual machine
template. It can be SMALL, MEDIUM,
LARGE, and EXTRA_LARGE.
The cpuNum, memCapacityMb, and
Storage.sizeGB will overwrite this attribute if
they are all be defined in the same node group.

cpuNum

integer

Optional

Number of vCPUs per virtual machine

memCapacityMb

integer

Optional

Number of RAMs in MB per virtual machine

Storage

object

Optional

Storage settings

type

string

Optional

It can be LOCAL or SHARED.

sizeGB

integer

Optional

Data storage size. It must be a positive integer.

dsNames

list of
string

Optional

Datastores the node group can use.

rpNames

list of
string

Optional

Resourcepools the node group can use.

haFlag

string

Optional

It can be on, off or ft. on means use HA to


protect the node group, ft means use vSphere
FT to protect the node group.
By default, name node and job tracker are
protected by vSphere HA.

placementPolicies

object

Optional

It can contains three optional constraints:


"instancePerHost", "groupRacks" and
"groupAssociations", refer to 5.3.2 for details.

50

Serengeti Users Guide

Serengeti comes with predefined virtual machine specification.


SMALL

MEDIUM

LARGE

EXTRA_LARGE

Number of vCPU

RAM

3.75GB

7.5GB

15GB

30GB

Disk size for Hadoop master data

25GB

50GB

100GB

200GB

Disk size for Hadoop worker data

50GB

100GB

200GB

400GB

Disk size for Hadoop client data

50GB

100GB

200GB

400GB

When creating virtual machine, Serengeti will try to allocate datastore on the preferred type. SHARED
storage is preferred for master and clients. LOCAL storage is preferred for workers.
Separate disks are created for OS and swap.

7. Serengeti Command Reference


7.1 connect
Connect and login to remote Serengeti server.
Parameter Mandatory/Optional Description
--host

Mandatory

Specify the Serengeti web service URL in format <Serengeti


Management Server ip or host>:<port>. By default, the Serengeti web
service is started at port 8080.

--username Optional

The Serengeti user name

--password Optional

The Serengeti password

The command will read username and password in interactive mode. Section 5.1 describes how to
manage Serengeti users.
If connect failed, or do not run connect command, the other Serengeti command is not allowed to be
executed.

7.2 cluster
7.2.1 cluster config
Modify Hadoop configuration of an existing default or customized Hadoop cluster in Serengeti.
Parameter

Type

Description

--name <cluster name in


Serengeti>

Mandatory Specify the Hadoop cluster name in Serengeti.

--specFile <spec file path>

Optional

Specify the Hadoop cluster's specification in a customized file.

51

Serengeti Users Guide

--yes

Optional

Answer y to Y/N confirmation. If not specified, the users need


to answer y or n explicitly.

7.2.2 cluster create


Create a default/customized Hadoop cluster in Serengeti.
Parameter

Mandatory/Optional Description

--name <cluster name


in Serengeti>

Mandatory

Specify the Hadoop cluster name in Serengeti.

--type <cluster type>

Optional

Specify the cluster type. Hadoop or HBase is supported.


The default one is Hadoop.

--specFile <spec file


path>

Optional

Specify the Hadoop cluster's specification in a customized


file

--distro <Hadoop distro Optional


name>

Specify which distro will be used to deploy Hadoop


cluster. The distros includes Apache Hadoop, Greenplum
HD, CDH3 and HDP1.

--dsNames <datastore
names>

Optional

Specify which datastore will be used to deploy Hadoop


cluster in Serengeti. By default, it will use the same one
with Serengeti virtual machine. Multiple datastores can be
used, separated by ,.

--networkName
<network name>

Optional

Specify which network will be used to deploy Hadoop


cluster in Serengeti. By default, it will use the same one
with Serengeti virtual machine.

--rpNames <resource
pool name>

Optional

Specify which resource pool will be used to deploy


Hadoop cluster Serengeti. By default, it will use the same
one with Serengeti virtual machine. Multiple resource
pools can be used, separated by ,.

--resume

Optional

If resume is specified, this command will recover a


creation process which cluster is deployed failed.

--topology <topology
type>

Optional

Specify which topology type will be used for rack


awareness: HVE, RACK_AS_RACK, or
HOST_AS_RACK.

--yes

Optional

Answer y to Y/N confirmation. If not specified, the users


need to answer y or n explicitly.

--skipConfigValidation

Optional

Skip cluster configuration validation.

If the cluster spec does not include required nodes, for example master node, Serengeti will generate
them with a default configuration.
52

Serengeti Users Guide

7.2.3 cluster delete


Delete a Hadoop cluster in Serengeti.
Parameter

Mandatory/Optional Description

--name <cluster name> Mandatory

Delete a specified Hadoop cluster in Serengeti.

7.2.4 cluster export


Export cluster information.
Parameter Mandatory/Optional Description
--spec

Mandatory

Export cluster specification. The exported cluster specification can be


used in cluster create or cluster config command.

--output

Optional

Specify the output file name for exported cluster information.


If not specified, the output will be displayed in the console.

7.2.5 cluster limit


Enable or disable provisioned compute nodes in the specified Hadoop cluster or node group in Serengeti
to reach the limit specified by activeComputeNodeNum. Compute nodes are re-commissioned and
powered-on, or decommissioned and powered-off to reach the specified number of active compute nodes.
Parameter

Mandatory/Optional Description

--name <cluster_name>

Mandatory

Name of the Hadoop cluster in Serengeti

--nodeGroup
<node_group_name>

Optional

Name of a node group in the specified Hadoop cluster


in Serengeti (supports node groups with task tracker
role only)

-activeComputeNodeNum
<number>

Mandatory

Number of active compute nodes for the specified


Hadoop cluster or node group within that cluster.
The valid value range is integers larger or equal to
zero.
- For zero value, all the nodes in the specific
Hadoop cluster or the specific node group (if -nodeGroup value is specified) will be
decommissioned and powered off.
- For integer value between 1 and the max
node number of a Hadoop cluster or the node
group (if --nodeGroup value is specified), the
specific number of nodes will stay

53

Serengeti Users Guide

commissioned and powered on, other nodes


will be decommissioned.
- For integer value larger than the max node
number of a Hadoop cluster or the node group
(if --nodeGroup value is specified), all the
nodes in the specific Hadoop cluster or the
specific node group (if --nodeGroup value is
specified) will be re-commissioned and
powered on.

7.2.6 cluster list


List all Hadoop clusters in Serengeti.
Parameter

Mandatory/Optional Description

--name <cluster
name in
Serengeti>

Optional

List the specified Hadoop cluster in Serengeti including name,


distro, status, each role's information. For each role, it will list
instance count, CPU, memory, type and size.

--detail

Optional

List all the Hadoop clusters' details including name in


Serengeti, distro, deploy status, each nodes information in
different roles.
Note: with this option specified, Serengeti will query from
vCenter server to get the latest node status. That operation
may take a few seconds for each cluster.

For example:

54

Serengeti Users Guide

7.2.7 cluster resize


Change the number of nodes in a node group.
Parameter

Mandatory/Optional Description

--name <cluster name in


Serengeti>

Mandatory

Specify the target Hadoop cluster in Serengeti.

--nodeGroup <name of
the node group>

Mandatory

Specify the target role which will be scaled out in


Hadoop cluster deployed by Serengeti.

--instanceNum <instance
number>

Mandatory

Specify the target count which will be scaled out to.


The target count needs to be more that original.

Example:
Cluster resize --name foo --nodeGroup slave --instanceCount 10

7.2.8 cluster start


Start a Hadoop cluster in Serengeti.
Parameter

Mandatory/Optional Description

--name <cluster name> Mandatory

Start a specified Hadoop cluster in Serengeti.

55

Serengeti Users Guide

7.2.9 cluster stop


Stop a Hadoop cluster in Serengeti.
Parameter

Mandatory/Optional Description

--name <cluster name> Mandatory

Stop a specified Hadoop cluster in Serengeti.

7.2.10 cluster target


Connect to one Hadoop cluster to interact with it by Serengeti CLI, including run fs, mr, pig, and hive
commands.
Parameter

Mandatory/Optional

Description

--name <cluster name> Optional

The name of the cluster to connect to. If user dont specify


this parameter, the first cluster listed by cluster list
command will be used

--info

Show to targeted cluster information, such as the HDFS


URL, Job Tracker URL and Hive server URL.

Optional

Note: --name and info can not be used together.

7.2.11 cluster unlimit


Enable all of the provisioned compute nodes in the specified Hadoop cluster or node group in Serengeti.
Compute nodes are re-commissioned and powered-on as necessary.
Parameter

Mandatory/Optional Description

--name <cluster_name>

Mandatory

Name of the Hadoop cluster in Serengeti

--nodeGroup
<node_group_name>

Optional

Name of a node group in the specified Hadoop cluster


in Serengeti (only supports node groups with task
tracker role)

7.3 datastore
7.3.1 datastore add
Add a datastore to Serengeti for deploying.
Parameter

Mandatory/Optional Description

--name <datastore
name in Serengeti>

Mandatory

Specify the name of datastore added to Serengeti

--spec <datastore
name in VCenter>

Mandatory

Specify datastore name in vSphere. User can use wild


card to specify multiple vmfs store. * and ? are

56

Serengeti Users Guide

supported in wild card.


--type <datastore type:
LOCAL|SHARE>

Mandatory

Specify datastore type in vSphere: local storage or


shared storage.

7.3.2 datastore delete


Delete a datastore from Serengeti.
Parameter

Mandatory/Optional Description

--name <datastore name in


Serengeti>

Mandatory

Delete a specified datastore in


Serengeti.

7.3.3 datastore list


List datastores added to Serengeti.
Parameter

Mandatory/Optional Description

--name <Name of datastore name


in Serengeti>

Optional

List the specified datastore information


including name, type.

--detail

Optional

List the datastore details including datastore


path in vSphere.

All datastores that are added to Serengeti are listed if the name is not specified.
For example:

57

Serengeti Users Guide

7.4 distro
7.4.1 distro list
Show what are the roles offered in a distro.
Parameter

Mandatory/Optional Description

--name <distro name> Optional

List the specified distro information.

For example:

7.5 disconnect
Disconnect and logout from remote Serengeti server. After disconnect, user is not allowed to run any CLI
commands.

7.6 fs
7.6.1 fs cat
Copy source paths to stdout.
Parameter Mandatory/Optional Description
<file name> Mandatory

The file to be showed in the console. Multiple files must be quoted,


such as /path/file1 /path/file2

7.6.2 fs chgrp
Change group association of files.
Parameter

Mandatory/Optional

Description

--group <group name> Mandatory

The group name of the file

--recursive true|false

Optional

make the change recursively through the directory


structure

<file name>

Mandatory

The file whose group to be changed. Multiple files


must be quoted, such as /path/file1 /path/file2

7.6.3 fs chmod
Change the permissions of files.
Parameter

Mandatory/Optional

Description
58

Serengeti Users Guide

--mode <permission mode>

Mandatory

The file permission mode, such as 755

--recursive true|false

Optional

make the change recursively through the directory


structure

<file name>

Mandatory

The file whose permission to be changed. Multiple


files must be quoted, such as /path/file1
/path/file2

7.6.4 fs chown
Change the owner of files.
Parameter

Mandatory/Optional Description

--owner <permission Mandatory


mode>

The file owner name

--recursive true|false Optional

make the change recursively through the directory structure

<file name>

The file whose owner to be changed. Multiple files must be


quoted, such as /path/file1 /path/file2

Mandatory

7.6.5 fs copyFromLocal
Copy single source file, or multiple source files from local file system to the destination file system. It is
the same as put.
Parameter

Mandatory/Optional Description

--from <local file


path>

Mandatory

The file path in local. Multiple files must be quoted, such as


/path/file1 /path/file2

--to <HDFS file


path>

Mandatory

The file path in HDFS. If --from is multiple files, --to is


directory name.

7.6.6 fs copyToLocal
Copy files to the local file system. It is the same as get.
Parameter

Mandatory/Optional Description

--from < HDFS file path >

Mandatory

The file path in HDFS. Multiple files must be quoted,


such as /path/file1 /path/file2

--to < local file path >

Mandatory

The file path in local. If --from is multiple files, --to is


directory name.

59

Serengeti Users Guide

7.6.7 fs copyMergeToLocal
Takes a source directory and a destination file as input and concatenates the files in the HDFS directory
into the local file system.
Parameter

Mandatory/Optional

Description

--from < HDFS file path >

Mandatory

The file path in HDFS. Multiple files must be quoted,


such as /path/file1 /path/file2.

--to < local file path >

Mandatory

The file path in local.

--endline <true|false>

Optional

Whether add end line character.

7.6.8 fs count
Count the number of directories, files, bytes, quota, and remaining quota.
Parameter

Mandatory/Optional

Description

--path < HDFS path >

Mandatory

The path to be counted.

--quota <true|false>

Optional

Whether with quota information.

7.6.9 fs cp
Copy files from source to destination. This command allows multiple sources as well in which case the
destination must be a directory.
Parameter

Mandatory/Optional Description

--from <HDFS source


file path>

Mandatory

The file path in local. Multiple files must be quoted, such


as /path/file1 /path/file2

--to <HDFS destination


file path>

Mandatory

The file path in HDFS. If --from is multiple files, --to is


directory name.

7.6.10 fs du
Displays sizes of files and directories contained in the given directory or the length of a file in case its just
a file.
Parameter Mandatory/Optional Description
<file name> Mandatory

The file to be showed in the console. Multiple files must be quoted,


such as /path/file1 /path/file2.

7.6.11 fs expunge
Empty the trash bin in the HDFS.

60

Serengeti Users Guide

7.6.12 fs get
Copy files to the local file system.
Parameter

Mandatory/Optional Description

--from < HDFS file


path >

Mandatory

The file path in HDFS. Multiple files must be quoted, such as


/path/file1 /path/file2.

--to < local file


path >

Mandatory

The file path in local. If --from is multiple files, --to is


directory name.

7.6.13 fs ls
List files in the directory.
Parameter

Mandatory/Optional

Description

<path name>

Mandatory

The path to be listed. Multiple files must be quoted,


such as /path/file1 /path/file2.

--recursive <true|false>

Optional

Whether list the directory with recursion.

7.6.14 fs mkdir
Create a new directory.
Parameter

Mandatory/Optional Description

<dir name>

Mandatory

The directory name to be created.

7.6.15 fs moveFromLocal
Similar to put command, except that the source local file is deleted after it is copied.
Parameter

Mandatory/Optional

Description

--from <local file path> Mandatory

The file path in local. Multiple files must be quoted, such


as /path/file1 /path/file2.

--to <HDFS file path>

The file path in HDFS. If --from is multiple files, --to is


directory name.

Mandatory

7.6.16 fs mv
Move source files to destination in the HDFS.
Parameter

Mandatory/Optional

Description

--from <dest file path>

Mandatory

The file path in local. Multiple files must be quoted, such

61

Serengeti Users Guide

as /path/file1 /path/file2.
--to <source file path>

The file path in HDFS. If --from is multiple files, --to is


directory name.

Mandatory

7.6.17 fs put
Copy single src, or multiple srcs from local file system to the HDFS.
Parameter

Mandatory/Optional

Description

--from <local file path> Mandatory

The file path in local. Multiple files must be quoted, such


as /path/file1 /path/file2.

--to <HDFS file path>

The file path in HDFS. If --from is multiple files, --to is


directory name.

Mandatory

7.6.18 fs rm
Remove files in the HDFS.
Parameter

Mandatory/Optional Description

< file path>

Mandatory

The file to be removed.

--recursive <true|false>

Optional

Remove files with recursion.

--skipTrash <true|false>

Optional

Bypass trash.

7.6.19 fs setrep
Change the replication factor of a file
Parameter

Mandatory/Optional

Description

--path < file path>

Mandatory

The path to be changed replication factor.

--replica <replica number>

Mandatory

Number of replicas.

--recursive <true|false>

Optional

Whether set replica with recursion.

--waiting <true|false>

Optional

Whether wait for the replica number is equal to the


number.

7.6.20 fs tail
Display last kilobyte of the file to stdout.
Parameter

Mandatory/Optional

Description

<file path>

Mandatory

The file path to be displayed.


62

Serengeti Users Guide

--file <true|false>

Optional

Show content while the file grows.

7.6.21 fs text
Take a source file and output the file in text format.
Parameter Mandatory/Optional Description
<file path> Mandatory

The file path to be displayed.

7.6.22 fs touchz
Create a file of zero length.
Parameter Mandatory/Optional Description
<file path> Mandatory

The file name to be created.

7.7 hive
7.7.1 hive cfg
Configure Hive.
Parameter

Mandatory/Optional

Description

--host <server host >

Optional

The server host.

--port <server port>

Optional

The server port.

--timeout

Optional

The timeout in milliseconds.

7.7.2 hive script


Execute a Hive script. Note: You need to run hive cfg before running a hive script.
Parameter

Mandatory/Optional Description

--location <script path>

Mandatory

The hive script file name to be executed.

7.8 mr
7.8.1 mr jar
Run a MapReduce job located inside the provided jar.
Parameter

Mandatory/Optional

Description

63

Serengeti Users Guide

--jarfile <jar file path>

Mandatory

The jar file path.

--mainclass <main class name>

Mandatory

The class which have main() method.

--args <arg>

Optional

The arguments to the main class. If there are


multiple arguments, they must be double
quoted.

7.8.2 mr job counter


Print the counter value of the MR job.
Parameter

Mandatory/Optional Description

--jobid <job id>

Mandatory

The MR job id.

--groupname <group name>

Mandatory

The counters group name.

--countername <counter name>

Mandatory

The counters name.

7.8.3 mr job events


Print the events' detail received by JobTracker for the given range.
Parameter

Mandatory/Optional Description

--jobid <job id>

Mandatory

The MR job id.

--from < from-event-#>

Mandatory

The start number of events to be printed.

--number < #-of-events>

Mandatory

The total number of events to be printed.

7.8.4 mr job history


Print job details, failed and killed job details.
Parameter

Mandatory/Optional

Description

<job history directory>

Mandatory

The directory where job history files put.

--all <true|false>

Optional

Print all jobs information.

7.8.5 mr job kill


Kill the MR job.
Parameter

Mandatory/Optional

Description

--jobid <job id>

Mandatory

The job id.

64

Serengeti Users Guide

7.8.6 mr job list


List MR jobs.
Parameter

Mandatory/Optional

--all <true|false> Optional

Description
Whether list all jobs.

7.8.7 mr job set priority


Change the priority of the job.
Parameter

Mandatory/Optional Description

--jobid <jobid>

Mandatory

--priority
Mandatory
<VERY_HIGH|HIGH|NORMAL|LOW|VERY_LOW>

The job id.


The jobs priority.

7.8.8 mr job status


Query MR job status.
Parameter

Mandatory/Optional

Description

--jobid <jobid>

Mandatory

The job id.

7.8.9 mr job submit


Submit a MR job defined in the job file.
Parameter

Mandatory/Optional

Description

--jobfile <jobfile>

Mandatory

Specify the file which define the MR job. The file is


standard Hadoop configuration. One example configuration
file is as following:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>mapred.jar</name>
<value>/home/hadoop/hadoop-1.0.1/hadoop-examples1.0.1.jar</value>
</property>
<property>

65

Serengeti Users Guide

<name>mapred.input.dir</name>
<value>/user/hadoop/input</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/user/hadoop/output</value>
</property>
<property>
<name>mapred.job.name</name>
<value>wordcount</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>org.apache.hadoop.examples.WordCount.Tokeni
zerMapper</value>
</property>
<property>
<name>mapreduce.reduce.class</name>
<value>org.apache.hadoop.examples.WordCount.IntSum
Reducer</value>
</property>
</configuration>

7.8.10 mr task fail


Fail the Map Reduce task.
Parameter

Mandatory/Optional

Description

--taskid <taskid>

Mandatory

Specify the task id.

7.8.11 mr task kill


Kill the Map Reduce task.
Parameter

Mandatory/Optional

Description

--taskid <taskid>

Mandatory

Specify the task id.

7.9 network
7.9.1 network add
Add a network to Serengeti.
Parameter

Mandatory/Optional Description

--name <network name in Serengeti>

Mandatory

Specify the name of network resource


added to Serengeti

66

Serengeti Users Guide

--portGroup <port group name in


vSphere>

Mandatory

Specify the name of port group in vSphere


which user wants to add to Serengeti

--dhcp

Combination 1

Specify the IP address assignment type,


DHCP.

--ip <IP Spec, an IP address range


looks like xx.xx.xx.xx-xx[,xx]*>
--dns <dns server ip>
--secondaryDNS <dns server ip>
--gateway <gateway IP>
--mask <network mask>

Combination 2

Specify the IP address assignment type,


static IP.

For example:
>network add --name ipNetwork --ip 192.168.1.1-100,192.168.1.120-180 --portGroup pg1 --dns
202.112.0.1 --gateway 192.168.1.255 --mask 255.255.255.1
>network add --name dhcpNetwork --dhcp --portGroup pg1

7.9.2 network delete


Delete a network in Serengeti.
Parameter

Mandatory/Optional Description

--name <network name in Serengeti> Mandatory

Delete the specified network in Serengeti.

7.9.3 network list


List available networks in Serengeti.
Parameter

Mandatory/Optional Description

--name <network name in Serengeti> Optional

List the specified network in Serengeti


including name, port group in vSphere, IP
address assignment type, assigned IP
address and so on.

--detail

List the network detail information in


Serengeti including Hadoop cluster node's
network information.

Optional

For example:

67

Serengeti Users Guide

7.10 pig script


7.10.1 pig cfg
Configure Pig.
Parameter

Mandatory/Optional Description

--props

Optional

Specify the Pig properties file location.

--jobName

Optional

Specify the job name.

--jobPriority

Optional

Specify the job priority.

--jobTracker

Optional

Specify the job tracker.

--execType

Optional

Specify the execution type.

--validateEachStatement

Optional

Validation of each statement or not.

7.10.2 pig script


Execute a Pig script. Note: You need to run pig cfg before running this command.
Parameter

Mandatory/Optional Description

--location <script path> Mandatory

Specify the name of the script to be executed.

7.11 resourcepool
7.11.1 resourcepool add
Add a resource pool in vSphere to Serengeti.
Parameter

Mandatory/Optional

Description

--name <resource pool name in Serengeti>

Mandatory

Specify the name of resource pool


added to Serengeti.

--vccluster <vSphere cluster of the resource Mandatory

Specify the vSphere cluster name in


68

Serengeti Users Guide

pool>

--vcrp <vSphere resource pool name>

vSphere where the resource pool is


in.
Mandatory

Specify the vSphere resource pool


in vSphere which is added to
Serengeti for deploying. The
vSphere resource pool must be
directly under a cluster.

Parameter

Mandatory/Optional

Description

--name <resource pool name in Serengeti>

Mandatory

Remove specified resource pool


from Serengeti.

Parameter

Mandatory/Optional

Description

--name <resource pool name in Serengeti>

Optional

List the specific resource pool


name, path.

--detail

Optional

List each resource pool's general


information and Hadoop cluster'
node in this resource pool.

7.11.2 resourcepool delete


Remove a resource pool from Serengeti.

7.11.3 resourcepool list


List resource pools added to Serengeti.

All resource pools that are added to Serengeti are listed if a name is not specified. For each resource
pool, NAME, PATH are listed. NAME is the name in Serengeti. PATH is the combination of the vSphere
cluster name and resource pool name, separated by /.
For example:

69

Serengeti Users Guide

7.12 topology
7.12.1 topology upload
Upload a rack-hosts mapping topology file to Serengeti. The new uploaded file will overwrite the existing
file. The accepted file format looks like: for each line, rackname: hostname1, hostname2
Hostname1,hostname2, stands for the host name displayed in vSphere.
Parameter

Mandatory/Optional Description

--fileName <topology file name> Mandatory

Specify the topology file name.

--yes

Answer y to Y/N confirmation.

Optional

7.12.2 topology list


List rack-hosts mapping topology stored in Serengeti.

8. vSphere Settings
8.1 vSphere Cluster Configuration
8.1.1 Setup Cluster
In the vCenter Client, select Inventory, Hosts and Clusters. In the left column, right-click the Datacenter
and select "New Cluster..." Follow new Cluster Wizard using the following settings:

Enable vSphere HA and vSphere DRS

Enable Host Monitoring

Enable Admission Control and set desired policy. (Default policy is to tolerate 1 host failure)

Virtual machine restart priority High

Virtual machine Monitoring virtual machine and Application Monitoring

Monitoring sensitivity High


70

Serengeti Users Guide

8.1.2 Enable DRS/HA on an existing cluster


If DRS or HA is not already enabled on an existing cluster, it can be enabled by right-clicking the cluster
and selecting Edit Settings. Under Cluster Features, select "Turn On vSphere DRS" and "Turn On
vSphere HA". Use settings specified in "Setup Cluster" above.

8.1.3 Add Hosts to Cluster


In the vCenter Client, select Inventory, Hosts and Clusters. In the left column, right-click the Cluster
that was just created and select "Add Host...". Follow the Add Host Wizard to add a Host. Repeat for each
additional Host.

8.1.4 DRS/FT Settings


In the vCenter Client, select Inventory, Hosts and Clusters. In the left column, click a host in the
cluster. On the right side there will be a row of tabs near the top of the window, click on Configuration
then click on Networking. The window will display vSwitch port groups. By default A VMkernel Port called
Management Network is pre-configured. Click Properties... of the vSwitch, choose the Management
Network and click the Edit button. Enable vMotion and Fault Tolerance Logging from the
Management Network Properties window.
To verify the FT status of a host, click on the Summary tab and locate Host Configured for FT in the
general section. If there are any issues with FT they will be shown here.

8.1.5 Enable FT on specific virtual machine


Fault Tolerance runs one virtual machine on two separate hosts, it allows for instant failover in a variety of
situations. Before enabling FT ensure the necessary requirements are met:

Host hardware is listed in the VMware Hardware Compatibility List (HCL)

All hosts in the cluster have Hardware VT enabled in the BIOS

The Management Network (VMkernel Port) has vMotion and "Fault Tolerance Logging"
enabled

Available capacity in the cluster

Virtual machine disks are thick provisioned, without snapshots and located on shared storage

Virtual machine is single vCPU

In the vCenter Client, select Inventory, Hosts and Clusters. In the left column, right click the virtual
machine and select Fault Tolerance, Turn On Fault Tolerance.

8.2 Network Settings


Serengeti currently deploys using a single network. Virtual machines are deployed with one NIC which is
attached to a specific Port Group. How this Port Group is configured and the network backing the Port
Group depends on the environment. Here we will cover a basic network configuration that may be
customized as needed.

71

Serengeti Users Guide

Either a vSwitch or vSphere Distributed Switch can be used to provide the Port Group backing a
Serengeti cluster. vDS acts as a single virtual switch across all attached hosts while a vSwitch is per-host
and requires the Port Group to be configured manually.

8.2.1 Setup Port Group - Option A (vSphere Distributed Switch)


In the vCenter Client, select Inventory, Networking. Right Click the Datacenter and select New
vSphere Distributed Switch.
Using the Create vSphere Distributed Switch wizard. Choose Switch Version 5.0. Enter a name and
number of uplink ports (physical adapters) you require.
On the Add Hosts and Physical Adapters step, select the adapter(s) on each host that will carry traffic to
the switch.
On the last step it will create a default Port Group. You can rename this Port Group after it is created and
the wizard is completed.

8.2.2 Setup Port Group - Option B (vSwitch)


In the vCenter Client, select Inventory, Hosts and Clusters. Navigate to the Networking section of the
Configuration Tab. Make sure the vSphere Standard Switch view is selected.
There is already vSwitch0 created by default. You may add a Port Group to this vSwitch or create a new
vSwitch that binds to different physical adapters.
To create a Port Group on the existing vSwitch click Properties on that vSwitch and then click the
Add button. Follow the wizard to create the Port Group.
To create a new vSwitch, click on Add Networking and follow the Add Network Wizard.

8.3 Storage Settings


Serengeti provisions virtual machines on shared storage to enable vSphere HA, FT and DRS features.
Local datastores are attached to virtual machines to be used for data.

8.3.1 Shared Storage Setting


Create LUN on Shared Storage (SAN/NAS) and verify it is accessible by all hosts in the cluster. For
vSphere HA Datastore Heartbeat feature two datastores are required.

8.3.2 Local Storage Settings


8.3.2.1 Configure DAS on Physical Hosts
Direct Attached Storage should be attached and configured on the physical controller to present each
disk separately to the OS. This configuration is commonly described as JBOD (Just A Bunch Of Disks) or
single disk RAID0.
8.3.2.2 Provision VMFS Datastores on DAS of Each Host
Create VMFS Datastores on Direct Attached Storage. This can be done in either of the following two
ways.

Manually using the vSphere Client, the vSphere Management Assistant

Automation by vSphere PowerCLI

72

Serengeti Users Guide

8.3.2.2.1 Manually Using vSphere Client (Manual per disk):


1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

Expand Cluster => Select Host


Go to "Configuration" Tab
Choose "Storage"
Click "Add Storage..."
This will start the Add a Storage Wizard. In the wizard, continue the steps.
Select "Disk/LUN" for Storage Type => Next
Select a Local Disk from the list => Next
Select "VMFS-5" for File System Version => Next => Next
Enter Datastore Name => Next
"Maximum Available Space" => Next
Finish

8.3.2.2.2 Automation by vSphere PowerCLI


This method requires you have a vSphere PowerCLI installed. You can refer to vSphere PowerCLI site to
download and install PowerCLI.
Once the PowerCLI is installed, you can use it to format many Direct Attached Storages to VMFS at a
time.
1. Select Start > Programs > VMware > VMware vSphere PowerCLI.
The VMware vSphere PowerCLI console window opens.
2. In the VMware vSphere PowerCLI console window, run PowerCLI commands to format the disks.
CAUTION
The commands will apply to multiple ESXi hosts at a time. Make sure the scope is what you intended
to before you run a command.
Heres a sample script of provisioning datastores. You can type the commands line by line in
PowerCLI shell.
In this example, it formats local disks in all hosts in a vSphere cluster named My Cluster. The disks
are formatted to VMFS datastores. The prefix of datastore name is abcde.

vSphere PowerCLI - Create Local Datastores for Cluster


# Connect to a vCenter Server.
Connect-VIServer -Server 10.23.112.235 -Protocol https -User admin -Password pass
# Prepare variables.
$i = 0
$localDisks = @{}
$clusterName = "My Cluster"
$datastoreName = "abcde"
# Select Hosts
$vmHosts = Get-VMHost -Location $clusterName
# Get Local Disks
$ldArray = $vmHosts | Get-VMHostDisk | select -ExpandProperty ScsiLun | where {$_.IsLocal -eq
"True"}

73

Serengeti Users Guide

# Get Primary Disks


$pdArray = $vmHosts | Get-VMHostDiagnosticPartition
# Add Local Disks to Hashtable keyed by CName
foreach($ld in $ldArray) {$localDisks.Add($ld.CanonicalName,$ld)}
# Remove Primary Disks from Local Disk Hashtable
foreach($pd in $pdArray) {$localDisks.Remove($pd.CanonicalName)}
# Create Datastores. Will fail to create for any local disks that are in-use.
foreach ($ld in $localDisks.Values) {$i++; New-Datastore -Vmfs -Name ($datastoreName +
$i.ToString("D3")) -Path $ld.CanonicalName -vmHost $ld.vmHost}

9. Appendix A: Create Local Yum Repository for MapR


9.1 Install a web server to server as yum server
Find a machine or virtual machine with CentOS 5.x 64 bits (or RHEL 5.x 64 bits) which
has Internet access, and install a web server such as Apache/lighttpd on the machine.
Or you can use the Serengeti Management Server if you dont have another machine.
This web server will serve as the yum server. This guide will take installing Apache web
server as an example.
9.1.1 Configure http proxy

First open a bash shell terminal. If the machine needs a http proxy server to connect to
the Internet, set http_proxy env :
# switch to root user
sudo su
export http_proxy=http://< proxy_server:port>

9.1.2 Install Apache Web Server


yum install -y httpd
/sbin/service httpd start

Make sure the firewall on the machine doesn't block the network port 80 used by
Apache web server. You can open a web browser on another machine and navigate to
http://<ip_of _webserver>/ to ensure the default test page of Apache web server shows
up.
If you would like to stop the firewall, execute this command:
/sbin/service iptables stop
74

Serengeti Users Guide

9.1.3 Install yum related packages

Install the yum-utils and createrepo packages if they are not already installed (yum-utils
includes the reposync command):
yum install -y yum-utils createrepo

9.1.4 Sync the remote MapR yum repository

1) Create a new file /etc/yum.repos.d/mapr-m5.repo using vi or other editors with the


following content:
[maprtech]
name=MapR Technologies
baseurl=http://package.mapr.com/releases/v2.1.1/redhat/
enabled=1
gpgcheck=0
protect=1
[maprecosystem]
name=MapR Technologies
baseurl=http://package.mapr.com/releases/ecosystem/redhat
enabled=1
gpgcheck=0
protect=1

2) Mirror the remote yum repository to the local machine:


reposync -r maprtech
reposync -r maprecosystem

This will take several minutes (depending on the network bandwidth) to download all the
RPMs in the remote repository, and all the RPMs are put in new folders named
maprtech and maprecosystem.
9.2 Create local yum repository
1) Put all the RPMs into a new folder under the Document Root folder of the Apache
Web Server. The Document Root folder is /var/www/html/ for Apache by default, and if
you use Serengeti Management Server to set up the yum server, the folder is
/opt/serengeti/www/.
doc_root=/var/www/html
mkdir -p $doc_root/mapr/2
mv maprtech/ maprecosystem/ $doc_root/mapr/2/

2) Create a yum repository for the RPMs:

75

Serengeti Users Guide

cd $doc_root/mapr/2
createrepo .

3) Create a new file /var/www/html/mapr/2/mapr-m5.repo with the following content:


[mapr-m5]
name=MapR Version 2
baseurl=http://<ip_of_webserver>/mapr/2
enabled=1
gpgcheck=0
protect=1

Please replace the <ip_of_webserver> with the IP address of the web server.
Ensure you can download http://<ip_of_webserver>/mapr/2/mapr-m5.repo from another
machine.
9.3 Configure http proxy for the VMs created by Serengeti Server
This step is optional and only applies if the VMs created by Serengeti Management
Server need a http proxy to connect to the yum repository. You need to configure http
proxy for the VMs as this: on Serengeti Server, add the following content into
/opt/serengeti/conf/serengeti.properties:
# set http proxy server
serengeti.http_proxy = http://<proxy_server:port>
# set the IPs of Serengeti Management Server and the local yum repository servers for
'serengeti.no_proxy'. The wildcard for matching multi IPs doesn't work.
serengeti.no_proxy = 10.x.y.z, 192.168.x.y, etc.

10. Appendix B: Create Local Yum Repository for CDH4


10.1 Install a web server to server as yum server
Find a machine or virtual machine with CentOS 5.x 64 bits (or RHEL 5.x 64 bits) which
has Internet access, and install a web server such as Apache/lighttpd on the machine.
Or you can use the Serengeti Management Server if you dont have another machine.
This web server will serve as the yum server. This guide will take installing Apache web
server as an example.
10.1.1 Configure http proxy

First open a bash shell terminal. If the machine needs a http proxy server to connect to
the Internet, set http_proxy env :
# switch to root user
76

Serengeti Users Guide

sudo su
export http_proxy=http://<proxy_server:port>

10.1.2 Install Apache Web Server


yum install -y httpd
/sbin/service httpd start

Make sure the firewall on the machine doesn't block the network port 80 used by
Apache web server. You can open a web browser on another machine and navigate to
http://<ip_of_webserver>/ to ensure the default test page of Apache web server shows
up.
If you would like to stop the firewall, execute this command:
/sbin/service iptables stop

10.1.3 Install yum related packages

Install the yum-utils and createrepo packages if they are not already installed (yum-utils
includes the reposync command):
yum install -y yum-utils createrepo

10.1.4 Sync the remote CDH4 yum repository

1) Create a new file /etc/yum.repos.d/cloudera-cdh4.repo using vi or other editors


with the following content:
[cloudera-cdh4]
name=Cloudera's Distribution for Hadoop, Version 4
baseurl=http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/4.1.2/
gpgkey = http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1

2) Mirror the remote yum repository to the local machine:


reposync -r cloudera-cdh4

This will take several minutes (depending on the network bandwidth) to download all the
RPMs in the remote repository, and all the RPMs are put in new folder named clouderacdh4.
10.2 Create local yum repository
1) Put all the RPMs into a new folder under the Document Root folder of the Apache
Web Server. The Document Root folder is /var/www/html/ for Apache by default, and if
you use Serengeti Management Server to set up the yum server, the folder is
/opt/serengeti/www/ .
77

Serengeti Users Guide

doc_root=/var/www/html
mkdir -p $doc_root/cdh/4/
mv cloudera-cdh4/RPMS $doc_root/cdh/4/

2) Create a yum repository for the rpms:


cd $doc_root/cdh/4
createrepo .

3) Create a new file /var/www/html/cdh/4/cloudera-cdh4.repo with the following content:


[cloudera-cdh4]
name=Cloudera's Distribution for Hadoop, Version 4
baseurl=http://<ip_of_webserver>/cdh/4/
enabled=1
gpgcheck=0

Please replace the <ip_of_webserver> with the IP address of the web server.
Ensure you can download http://<ip_of_webserver>/cdh/4/cloudera-cdh4.repo from
another machine.
10.3 Config http proxy for the VMs created by Serengeti Server
This step is optional and only apply if the VMs created by Serengeti Management
Server need a http proxy to connect to the yum repository. You need to config http
proxy for the VMs as this: on Serengeti Server, add the following content into
/opt/serengeti/conf/serengeti.properties:
# set http proxy server
serengeti.http_proxy = http://< proxy_server:port>
# set the IPs of Serengeti Management Server and the local yum repository servers for
'serengeti.no_proxy'. The wildcard for matching multi IPs doesn't work.
serengeti.no_proxy = 10.x.y.z, 192.168.x.y, etc.

78

S-ar putea să vă placă și