Documente Academic
Documente Profesional
Documente Cultură
Legal Notices
Warranty
The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be
construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.
The information contained herein is subject to change without notice.
Copyright Notice
Copyright 2006 - 2015 Hewlett-Packard Development Company, L.P.
Trademark Notices
Adobe is a trademark of Adobe Systems Incorporated.
Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation.
UNIX is a registered trademark of The Open Group.
Page 2 of 15
Contents
Contents
Cluster Layout
Example
Hardware Selection
Installing and Configuring Hadoop
8
10
webHDFS
10
YARN
10
Hadoop Balancer
10
Replication Factor
11
11
12
12
12
12
13
15
Page 3 of 15
If you are already using HP Vertica Enterprise Edition, you should skip this guide and refer to the
rest of the documentation set.
This guide explains how to set up your HP Vertica and Hadoop clusters to work together. You also
need to consult the following documents for how to set up and use these products together:
l
Installation Guide: See this document for instructions on installing and verifying HP Vertica.
Administrator's Guide : See, in particular, the sections on configuring the database and
managing storage locations.
Hadoop Integration Guide: See this document for detailed instructions on configuring HP Vertica
to use HDFSstorage.
Page 4 of 15
Cluster Layout
Hadoop and HP Vertica each operate on a cluster of nodes for distributed processing. In the
Enterprise Edition product these clusters are on distinct nodes. In the HPVertica for SQLon
Hadoop product, HP Vertica nodes are co-located on some Hadoop nodes. Thus, you run your HP
Vertica cluster on a subset of your Hadoop cluster, while continuing to run Hadoop on those nodes.
The HP Vertica nodes use a private network in addition to the public network used by all Hadoop
nodes, as the following figure shows:
Page 5 of 15
Example
Assume that you:
l
Have 10 racks with 10 physical nodes each that are all configured for Hadoop.
HP Vertica works best if all of its nodes have the same hardware configuration. Given these
assumptions, you should distribute the 20 higher-end nodes, configured for HP Vertica, so that
there are two in each rack. Spreading the HP Vertica nodes across racks can improve I/O
performance and fault tolerance.
Page 6 of 15
For more information about configuring Kerberos for use with HP Vertica and Hadoop, see Using
Kerberos-Enabled Hadoop.
Page 7 of 15
Hardware Selection
Hadoop clusters frequently do not have identical provisioning requirements or hardware
configurations. However, HP Vertica nodes should be equivalent in size and capability, per the
best-practice standards recommended in General Hardware and OS Requirements and
Recommendations in the Installation Guide.
Because Hadoop cluster specifications do not always meet these standards, Hewlett-Packard
recommends the following specifications for HP Vertica nodes in your Hadoop cluster.
Specifications Recommendation
For...
Processor
Two-socket servers with 814 core CPUs, clocked at or above 2.6 GHz for
clusters over 10 TB
Single-socket servers with 812 cores clocked at or above 2.6 GHz for
clusters under 10 TB
Memory
Distribute the memory appropriately across all memory channels in the server:
l
Page 8 of 15
Storage
Read/write:
l
Storage post RAID: Each node should have 19 TB. For a production setting,
RAID 10 is recommended. In some cases, RAID 50 is acceptable.
Because of the heavy compression and encoding that HP Verticadoes, SSDs
are not required. In most cases, a RAID of more, less-expensive HDDs
performs just as well as a RAID of fewer SSDs.
If you intend to use RAID 50 for your data partition, you should keep a spare
node in every rack, allowing for manual failover of an HP Vertica node in the
case of a drive failure. An HP Vertica node recovery is faster than a RAID 50
rebuild. Also, be sure to never put more than 10 TB compressed on any node,
to keep node recovery times at an acceptable rate.
Network
Page 9 of 15
webHDFS
Hadoop has two services that can provide web access to HDFS:
l
webHDFS
httpFS
YARN
This service is available in newer releases of Hadoop and provides resource management for
Hadoop clusters. Because HP Vertica does not currently support YARN integration, you cannot
use YARN to control resource contention between database and Hadoop services.
Important: In a co-located cluster, do not run the YARN NodeManager on nodes that are
running HP Vertica. You can run YARN on the Hadoop-only nodes.
You can use Ambari or another Hadoop management tool to disable YARN on the database nodes.
This removes access to Hadoop services that otherwise would run processes on these nodes.
HDFS, however, should run on all nodes in the cluster, including the database nodes.
Hadoop Balancer
The Hadoop Balancer can redistribute data blocks across Hadoop-HDFS. For many Hadoop
services, this feature is useful. However, for HBase and HP Vertica this feature has negative
consequences:
l
The HDFS write chain normally writes a copy of the data locally and then replicates the data
blocks to other data nodes.
Page 10 of 15
If the data is left in this original state, the data blocks stored in HDFS are generally co-located
with the HP Vertica nodes. The HP Vertica nodes then attempt to read those data blocks.
The balancer can move these data blocks away from the HP Vertica nodes and degrade
database performance.
To prevent the undesired movement of data blocks across the HDFS cluster, HDP-2.2 provides the
ability to exclude or include specific data nodes while rebalancing. See the Hadoop documentation
for details.
Replication Factor
By default, HDFS stores three copies of each data block. HP Vertica is generally set up to store
two copies of each data item through K-Safety. Thus, lowering the replication factor to 2 can save
space and still provide data protection.
To lower the number of copies HDFS stores, set HadoopFSReplication, as explained in
Troubleshooting HDFS Storage Locations.
Page 11 of 15
Page 12 of 15
Page 13 of 15
Page 14 of 15
Page 15 of 15