Sunteți pe pagina 1din 15

HP Vertica for SQL on Hadoop

HP Vertica Analytic Database


Software Version: 7.1.x

Document Release Date: 7/21/2016

Legal Notices
Warranty
The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be
construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.
The information contained herein is subject to change without notice.

Restricted Rights Legend


Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer
Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial
license.

Copyright Notice
Copyright 2006 - 2015 Hewlett-Packard Development Company, L.P.

Trademark Notices
Adobe is a trademark of Adobe Systems Incorporated.
Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation.
UNIX is a registered trademark of The Open Group.

HP Vertica Analytic Database

Page 2 of 15

Contents
Contents

Introduction To HPVertica for SQLon Hadoop

Cluster Layout

Node Sharing Between Hadoop and HP Vertica

Example

Adding Kerberos Support

Hardware Selection
Installing and Configuring Hadoop

8
10

webHDFS

10

YARN

10

Hadoop Balancer

10

Replication Factor

11

Disk Space for Non-HDFS Use

11

Choosing How to Connect HP Vertica To HDFS

12

Creating an HDFS Storage Location

12

Using the ORC Reader

12

Using the HCatalog Connector

12

Using the HDFS Connector

13

We appreciate your feedback!

HP Vertica Analytic Database

15

Page 3 of 15

HP Vertica for SQL on Hadoop


Introduction To HPVertica for SQLon Hadoop

Introduction To HPVertica for SQLon


Hadoop
Using HPVertica for SQLon Hadoop you can build a HP Vertica database that exclusively uses
HDFS for storage. The license imposes no limits on the volume of data you can use. Except for
local on-disk storage, you still have access to most of the powerful features of HP Vertica
Enterprise Edition. For specific license restrictions, see Understanding HP Vertica Licenses.
This guide makes recommendations for architecture and configuration to optimize the performance
of your database.
l

If you are already using HP Vertica Enterprise Edition, you should skip this guide and refer to the
rest of the documentation set.

If you are new to HP Vertica, read the Concepts Guide.

This guide explains how to set up your HP Vertica and Hadoop clusters to work together. You also
need to consult the following documents for how to set up and use these products together:
l

Installation Guide: See this document for instructions on installing and verifying HP Vertica.

Administrator's Guide : See, in particular, the sections on configuring the database and
managing storage locations.

Hadoop Integration Guide: See this document for detailed instructions on configuring HP Vertica
to use HDFSstorage.

HP Vertica Analytic Database (7.1.x)

Page 4 of 15

HP Vertica for SQL on Hadoop


Cluster Layout

Cluster Layout
Hadoop and HP Vertica each operate on a cluster of nodes for distributed processing. In the
Enterprise Edition product these clusters are on distinct nodes. In the HPVertica for SQLon
Hadoop product, HP Vertica nodes are co-located on some Hadoop nodes. Thus, you run your HP
Vertica cluster on a subset of your Hadoop cluster, while continuing to run Hadoop on those nodes.
The HP Vertica nodes use a private network in addition to the public network used by all Hadoop
nodes, as the following figure shows:

HP Vertica Analytic Database (7.1.x)

Page 5 of 15

HP Vertica for SQL on Hadoop


Cluster Layout

Node Sharing Between Hadoop and HP Vertica


Normally, both Hadoop and HP Vertica use the entire node. Because this configuration is on shared
nodes, you must address potential resource contention in your configuration. For more information,
see "Best Practices for SQLonHadoop" in Managing Storage Locations.
Important: HPVertica for SQLon Hadoop does not currently support the YARN resource
manager. When you use HPVertica for SQLon Hadoop, you run YARN only on the nodes that
are designated as Hadoop-only.
To accommodate the shared nodes, you need to make the configuration changes this guide
describes. You can contain Hadoop and HP Vertica clusters within a single rack, or you can span
across many racks and nodes. Spreading node types across racks can improve efficiency.

Example
Assume that you:
l

Have 10 racks with 10 physical nodes each that are all configured for Hadoop.

Want to designate 20 nodes for use by HP Vertica.

Have planned to create 20 higher-end nodes of equivalent capability to support HP Vertica.

HP Vertica works best if all of its nodes have the same hardware configuration. Given these
assumptions, you should distribute the 20 higher-end nodes, configured for HP Vertica, so that
there are two in each rack. Spreading the HP Vertica nodes across racks can improve I/O
performance and fault tolerance.

Adding Kerberos Support


If you use Kerberos authentication, both HP Vertica and Hadoop require access to the same
Kerberos server. The following figure illustrates how identities and tickets move through the
system. The division between HP Vertica and Hadoop nodes is logical, not physical.

HP Vertica Analytic Database (7.1.x)

Page 6 of 15

HP Vertica for SQL on Hadoop


Cluster Layout

For more information about configuring Kerberos for use with HP Vertica and Hadoop, see Using
Kerberos-Enabled Hadoop.

HP Vertica Analytic Database (7.1.x)

Page 7 of 15

HP Vertica for SQL on Hadoop


Hardware Selection

Hardware Selection
Hadoop clusters frequently do not have identical provisioning requirements or hardware
configurations. However, HP Vertica nodes should be equivalent in size and capability, per the
best-practice standards recommended in General Hardware and OS Requirements and
Recommendations in the Installation Guide.
Because Hadoop cluster specifications do not always meet these standards, Hewlett-Packard
recommends the following specifications for HP Vertica nodes in your Hadoop cluster.
Specifications Recommendation
For...
Processor

For best performance, run:


l

Two-socket servers with 814 core CPUs, clocked at or above 2.6 GHz for
clusters over 10 TB

Single-socket servers with 812 cores clocked at or above 2.6 GHz for
clusters under 10 TB

Memory

Distribute the memory appropriately across all memory channels in the server:
l

Minimum 8 GB of memory per physical CPU core in the server

High-performance applications 1216 GB of memory per physical core

Typeat least DDR3-1600, preferably DDR3-1866

HP Vertica Analytic Database (7.1.x)

Page 8 of 15

HP Vertica for SQL on Hadoop


Hardware Selection

Storage

Read/write:
l

Minimum 40 MB/s per physical core of the CPU

For best performance 6080 MB/s per physical core

Storage post RAID: Each node should have 19 TB. For a production setting,
RAID 10 is recommended. In some cases, RAID 50 is acceptable.
Because of the heavy compression and encoding that HP Verticadoes, SSDs
are not required. In most cases, a RAID of more, less-expensive HDDs
performs just as well as a RAID of fewer SSDs.
If you intend to use RAID 50 for your data partition, you should keep a spare
node in every rack, allowing for manual failover of an HP Vertica node in the
case of a drive failure. An HP Vertica node recovery is faster than a RAID 50
rebuild. Also, be sure to never put more than 10 TB compressed on any node,
to keep node recovery times at an acceptable rate.
Network

10 GB networking in almost every case. With the introduction of 10 GB over


cat6a (Ethernet), the cost difference is minimal.

HP Vertica Analytic Database (7.1.x)

Page 9 of 15

HP Vertica for SQL on Hadoop


Installing and Configuring Hadoop

Installing and Configuring Hadoop


Begin by following the installation instructions for your Hadoop distribution.
After you install Hadoop, you can configure the following Hadoop components for use with HP
Vertica.

webHDFS
Hadoop has two services that can provide web access to HDFS:
l

webHDFS

httpFS

For HP Vertica, you must use Hadoop webHDFS service.

YARN
This service is available in newer releases of Hadoop and provides resource management for
Hadoop clusters. Because HP Vertica does not currently support YARN integration, you cannot
use YARN to control resource contention between database and Hadoop services.
Important: In a co-located cluster, do not run the YARN NodeManager on nodes that are
running HP Vertica. You can run YARN on the Hadoop-only nodes.
You can use Ambari or another Hadoop management tool to disable YARN on the database nodes.
This removes access to Hadoop services that otherwise would run processes on these nodes.
HDFS, however, should run on all nodes in the cluster, including the database nodes.

Hadoop Balancer
The Hadoop Balancer can redistribute data blocks across Hadoop-HDFS. For many Hadoop
services, this feature is useful. However, for HBase and HP Vertica this feature has negative
consequences:
l

The HDFS write chain normally writes a copy of the data locally and then replicates the data
blocks to other data nodes.

HP Vertica Analytic Database (7.1.x)

Page 10 of 15

HP Vertica for SQL on Hadoop


Installing and Configuring Hadoop

If the data is left in this original state, the data blocks stored in HDFS are generally co-located
with the HP Vertica nodes. The HP Vertica nodes then attempt to read those data blocks.

The balancer can move these data blocks away from the HP Vertica nodes and degrade
database performance.

To prevent the undesired movement of data blocks across the HDFS cluster, HDP-2.2 provides the
ability to exclude or include specific data nodes while rebalancing. See the Hadoop documentation
for details.

Replication Factor
By default, HDFS stores three copies of each data block. HP Vertica is generally set up to store
two copies of each data item through K-Safety. Thus, lowering the replication factor to 2 can save
space and still provide data protection.
To lower the number of copies HDFS stores, set HadoopFSReplication, as explained in
Troubleshooting HDFS Storage Locations.

Disk Space for Non-HDFS Use


You also need to reserve some disk space for non-HDFS use. To reserve disk space using Ambari,
set dfs.datanode.du.reserved to a value in the hdfs-site.xml configuration file.
Setting this parameter preserves space for non-HDFS files that HP Vertica requires.

HP Vertica Analytic Database (7.1.x)

Page 11 of 15

HP Vertica for SQL on Hadoop


Choosing How to Connect HP Vertica To HDFS

Choosing How to Connect HP Vertica To HDFS


HP Vertica provides several ways to use data stored in HDFS. These interfaces are described in
detail in Hadoop Integration Guide. This section explains how to choose among them for the
HPVertica for SQLon Hadoop product.

Creating an HDFS Storage Location


Using a storage location to store data in the HP Vertica native file format (ROS) delivers the best
query performance among the available options. However, doing this requires more disk space on
the database nodes in your cluster because HP Vertica maintains its own copy in addition to the
source files already stored in HDFS. If you can trade disk space for performance, you should use
this option.
You can use the HDFSConnector to load data that is already in HDFS into HP Vertica.
See Using the HP Vertica Storage Location for HDFS and Using the HP Vertica Connector for
HDFS.

Using the ORC Reader


If your data is stored in the Optimized Row Columnar format, an open format supported by most
Hadoop providers, HP Vertica can query that data directly from HDFS. This is faster than using the
HCatalog Connector, but you cannot pull schema definitions from Hive directly into the database.
The ORCReader reads the data in place; no extra copies are made.
See Reading ORC Files Directly.

Using the HCatalog Connector


The HCatalog Connector uses Hadoop services (Hive and HCatalog) to query data stored in
HDFS. Like the ORC Reader, it reads data in place rather than making copies. Using this interface
you can read all file formats supported by Hadoop, including Parquet and ORC, and HP Vertica can
use Hive's schema definitions. However, performance can be poor in some cases. The HCatalog
Connector is also sensitive to changes in the Hadoop libraries on which it depends; upgrading your
Hadoop cluster might affect your HCatalog connections.
See Using the HCatalog Connector.

HP Vertica Analytic Database (7.1.x)

Page 12 of 15

HP Vertica for SQL on Hadoop


Choosing How to Connect HP Vertica To HDFS

Using the HDFS Connector


The HDFSConnector can be used to create and query external tables, reading the data in place
rather than making copies. The HDFSConnector can be used with any data format for which a
parser is available. It does not use Hive data; you have to define the table yourself.Its performance
can be poor because, like the HCatalog Connector, it cannot take advantage of the benefits of
columnar file formats.
See Using the HP Vertica Connector for HDFS.

HP Vertica Analytic Database (7.1.x)

Page 13 of 15

HP Vertica for SQL on Hadoop


Choosing How to Connect HP Vertica To HDFS

HP Vertica Analytic Database (7.1.x)

Page 14 of 15

We appreciate your feedback!


If you have comments about this document, you can contact the documentation team by email. If
an email client is configured on this system, click the link above and an email window opens with
the following information in the subject line:
Feedback on HP Vertica for SQL on Hadoop (Vertica Analytic Database 7.1.x)
Just add your feedback to the email and click send.
If no email client is available, copy the information above to a new message in a web mail client,
and send your feedback to vertica-docfeedback@hp.com.

HP Vertica Analytic Database

Page 15 of 15

S-ar putea să vă placă și