HP Vertica 7.1.x HPVerticaForSQLOnHadoop

HP Vertica for SQL on Hadoop
HP Vertica Analytic Database

Software Version: 7.1.x
Document Release Date: 7/21/2016
Legal Notices
Warranty
The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be
construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.
The information contained herein is subject to change without notice.
Restricted Rights Legend

Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer
Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial
license.
Copyright Notice
Copyright 2006 - 2015 Hewlett-Packard Development Company, L.P.
Trademark Notices
Adobe is a trademark of Adobe Systems Incorporated.
Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation.
UNIX is a registered trademark of The Open Group.
Page 2 of 15
Contents
Contents
Introduction To HPVertica for SQLon Hadoop
Cluster Layout
Node Sharing Between Hadoop and HP Vertica
Example
Adding Kerberos Support
Hardware Selection
Installing and Configuring Hadoop
8
10
webHDFS
10
YARN
10
Hadoop Balancer
10
Replication Factor
11
Disk Space for Non-HDFS Use
11
Choosing How to Connect HP Vertica To HDFS
12
Creating an HDFS Storage Location
12
Using the ORC Reader
12
Using the HCatalog Connector
12
Using the HDFS Connector
13
We appreciate your feedback!
15
Page 3 of 15

Introduction To HPVertica for SQLon Hadoop
Introduction To HPVertica for SQLon

Hadoop
Using HPVertica for SQLon Hadoop you can build a HP Vertica database that exclusively uses
HDFS for storage. The license imposes no limits on the volume of data you can use. Except for
local on-disk storage, you still have access to most of the powerful features of HP Vertica
Enterprise Edition. For specific license restrictions, see Understanding HP Vertica Licenses.
This guide makes recommendations for architecture and configuration to optimize the performance
of your database.
l
If you are already using HP Vertica Enterprise Edition, you should skip this guide and refer to the
rest of the documentation set.
If you are new to HP Vertica, read the Concepts Guide.
This guide explains how to set up your HP Vertica and Hadoop clusters to work together. You also
need to consult the following documents for how to set up and use these products together:
l
Installation Guide: See this document for instructions on installing and verifying HP Vertica.
Administrator's Guide : See, in particular, the sections on configuring the database and
managing storage locations.
Hadoop Integration Guide: See this document for detailed instructions on configuring HP Vertica
to use HDFSstorage.
HP Vertica Analytic Database (7.1.x)
Page 4 of 15

Cluster Layout
Cluster Layout
Hadoop and HP Vertica each operate on a cluster of nodes for distributed processing. In the
Enterprise Edition product these clusters are on distinct nodes. In the HPVertica for SQLon
Hadoop product, HP Vertica nodes are co-located on some Hadoop nodes. Thus, you run your HP
Vertica cluster on a subset of your Hadoop cluster, while continuing to run Hadoop on those nodes.
The HP Vertica nodes use a private network in addition to the public network used by all Hadoop
nodes, as the following figure shows:
Page 5 of 15

Cluster Layout
Node Sharing Between Hadoop and HP Vertica

Normally, both Hadoop and HP Vertica use the entire node. Because this configuration is on shared
nodes, you must address potential resource contention in your configuration. For more information,
see "Best Practices for SQLonHadoop" in Managing Storage Locations.
Important: HPVertica for SQLon Hadoop does not currently support the YARN resource
manager. When you use HPVertica for SQLon Hadoop, you run YARN only on the nodes that
are designated as Hadoop-only.
To accommodate the shared nodes, you need to make the configuration changes this guide
describes. You can contain Hadoop and HP Vertica clusters within a single rack, or you can span
across many racks and nodes. Spreading node types across racks can improve efficiency.
Example
Assume that you:
l
Have 10 racks with 10 physical nodes each that are all configured for Hadoop.
Want to designate 20 nodes for use by HP Vertica.
Have planned to create 20 higher-end nodes of equivalent capability to support HP Vertica.
HP Vertica works best if all of its nodes have the same hardware configuration. Given these
assumptions, you should distribute the 20 higher-end nodes, configured for HP Vertica, so that
there are two in each rack. Spreading the HP Vertica nodes across racks can improve I/O
performance and fault tolerance.
Adding Kerberos Support

If you use Kerberos authentication, both HP Vertica and Hadoop require access to the same
Kerberos server. The following figure illustrates how identities and tickets move through the
system. The division between HP Vertica and Hadoop nodes is logical, not physical.
Page 6 of 15

Cluster Layout
For more information about configuring Kerberos for use with HP Vertica and Hadoop, see Using
Kerberos-Enabled Hadoop.
Page 7 of 15

Hardware Selection
Hardware Selection
Hadoop clusters frequently do not have identical provisioning requirements or hardware
configurations. However, HP Vertica nodes should be equivalent in size and capability, per the
best-practice standards recommended in General Hardware and OS Requirements and
Recommendations in the Installation Guide.
Because Hadoop cluster specifications do not always meet these standards, Hewlett-Packard
recommends the following specifications for HP Vertica nodes in your Hadoop cluster.
Specifications Recommendation
For...
Processor
For best performance, run:

l
Two-socket servers with 814 core CPUs, clocked at or above 2.6 GHz for
clusters over 10 TB
Single-socket servers with 812 cores clocked at or above 2.6 GHz for
clusters under 10 TB
Memory
Distribute the memory appropriately across all memory channels in the server:
l
Minimum 8 GB of memory per physical CPU core in the server
High-performance applications 1216 GB of memory per physical core
Typeat least DDR3-1600, preferably DDR3-1866
Page 8 of 15

Hardware Selection
Storage
Read/write:
l
Minimum 40 MB/s per physical core of the CPU
For best performance 6080 MB/s per physical core
Storage post RAID: Each node should have 19 TB. For a production setting,
RAID 10 is recommended. In some cases, RAID 50 is acceptable.
Because of the heavy compression and encoding that HP Verticadoes, SSDs
are not required. In most cases, a RAID of more, less-expensive HDDs
performs just as well as a RAID of fewer SSDs.
If you intend to use RAID 50 for your data partition, you should keep a spare
node in every rack, allowing for manual failover of an HP Vertica node in the
case of a drive failure. An HP Vertica node recovery is faster than a RAID 50
rebuild. Also, be sure to never put more than 10 TB compressed on any node,
to keep node recovery times at an acceptable rate.
Network
10 GB networking in almost every case. With the introduction of 10 GB over

cat6a (Ethernet), the cost difference is minimal.
Page 9 of 15


Begin by following the installation instructions for your Hadoop distribution.
After you install Hadoop, you can configure the following Hadoop components for use with HP
Vertica.
webHDFS
Hadoop has two services that can provide web access to HDFS:
l
webHDFS
httpFS
For HP Vertica, you must use Hadoop webHDFS service.
YARN
This service is available in newer releases of Hadoop and provides resource management for
Hadoop clusters. Because HP Vertica does not currently support YARN integration, you cannot
use YARN to control resource contention between database and Hadoop services.
Important: In a co-located cluster, do not run the YARN NodeManager on nodes that are
running HP Vertica. You can run YARN on the Hadoop-only nodes.
You can use Ambari or another Hadoop management tool to disable YARN on the database nodes.
This removes access to Hadoop services that otherwise would run processes on these nodes.
HDFS, however, should run on all nodes in the cluster, including the database nodes.
Hadoop Balancer
The Hadoop Balancer can redistribute data blocks across Hadoop-HDFS. For many Hadoop
services, this feature is useful. However, for HBase and HP Vertica this feature has negative
consequences:
l
The HDFS write chain normally writes a copy of the data locally and then replicates the data
blocks to other data nodes.
Page 10 of 15

If the data is left in this original state, the data blocks stored in HDFS are generally co-located
with the HP Vertica nodes. The HP Vertica nodes then attempt to read those data blocks.
The balancer can move these data blocks away from the HP Vertica nodes and degrade
database performance.
To prevent the undesired movement of data blocks across the HDFS cluster, HDP-2.2 provides the
ability to exclude or include specific data nodes while rebalancing. See the Hadoop documentation
for details.
Replication Factor
By default, HDFS stores three copies of each data block. HP Vertica is generally set up to store
two copies of each data item through K-Safety. Thus, lowering the replication factor to 2 can save
space and still provide data protection.
To lower the number of copies HDFS stores, set HadoopFSReplication, as explained in
Troubleshooting HDFS Storage Locations.
Disk Space for Non-HDFS Use

You also need to reserve some disk space for non-HDFS use. To reserve disk space using Ambari,
set dfs.datanode.du.reserved to a value in the hdfs-site.xml configuration file.
Setting this parameter preserves space for non-HDFS files that HP Vertica requires.
Page 11 of 15


HP Vertica provides several ways to use data stored in HDFS. These interfaces are described in
detail in Hadoop Integration Guide. This section explains how to choose among them for the
HPVertica for SQLon Hadoop product.
Creating an HDFS Storage Location

Using a storage location to store data in the HP Vertica native file format (ROS) delivers the best
query performance among the available options. However, doing this requires more disk space on
the database nodes in your cluster because HP Vertica maintains its own copy in addition to the
source files already stored in HDFS. If you can trade disk space for performance, you should use
this option.
You can use the HDFSConnector to load data that is already in HDFS into HP Vertica.
See Using the HP Vertica Storage Location for HDFS and Using the HP Vertica Connector for
HDFS.
Using the ORC Reader

If your data is stored in the Optimized Row Columnar format, an open format supported by most
Hadoop providers, HP Vertica can query that data directly from HDFS. This is faster than using the
HCatalog Connector, but you cannot pull schema definitions from Hive directly into the database.
The ORCReader reads the data in place; no extra copies are made.
See Reading ORC Files Directly.
Using the HCatalog Connector

The HCatalog Connector uses Hadoop services (Hive and HCatalog) to query data stored in
HDFS. Like the ORC Reader, it reads data in place rather than making copies. Using this interface
you can read all file formats supported by Hadoop, including Parquet and ORC, and HP Vertica can
use Hive's schema definitions. However, performance can be poor in some cases. The HCatalog
Connector is also sensitive to changes in the Hadoop libraries on which it depends; upgrading your
Hadoop cluster might affect your HCatalog connections.
See Using the HCatalog Connector.
Page 12 of 15

Using the HDFS Connector

The HDFSConnector can be used to create and query external tables, reading the data in place
rather than making copies. The HDFSConnector can be used with any data format for which a
parser is available. It does not use Hive data; you have to define the table yourself.Its performance
can be poor because, like the HCatalog Connector, it cannot take advantage of the benefits of
columnar file formats.
See Using the HP Vertica Connector for HDFS.
Page 13 of 15

Page 14 of 15
We appreciate your feedback!

If you have comments about this document, you can contact the documentation team by email. If
an email client is configured on this system, click the link above and an email window opens with
the following information in the subject line:
Feedback on HP Vertica for SQL on Hadoop (Vertica Analytic Database 7.1.x)
Just add your feedback to the email and click send.
If no email client is available, copy the information above to a new message in a web mail client,
and send your feedback to vertica-docfeedback@hp.com.
Page 15 of 15

HP Vertica 7.1.x HPVerticaForSQLOnHadoop

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

HP Vertica 7.1.x HPVerticaForSQLOnHadoop

Încărcat de

Drepturi de autor:

Formate disponibile

HP Vertica for SQL on Hadoop

HP Vertica Analytic Database

Document Release Date: 7/21/2016

Restricted Rights Legend

HP Vertica Analytic Database

Introduction To HPVertica for SQLon Hadoop

Node Sharing Between Hadoop and HP Vertica

Adding Kerberos Support

Disk Space for Non-HDFS Use

Choosing How to Connect HP Vertica To HDFS

Creating an HDFS Storage Location

Using the ORC Reader

Using the HCatalog Connector

Using the HDFS Connector

We appreciate your feedback!

HP Vertica Analytic Database

HP Vertica for SQL on Hadoop

Introduction To HPVertica for SQLon

If you are new to HP Vertica, read the Concepts Guide.

HP Vertica Analytic Database (7.1.x)

HP Vertica for SQL on Hadoop

HP Vertica Analytic Database (7.1.x)

HP Vertica for SQL on Hadoop

Node Sharing Between Hadoop and HP Vertica

Want to designate 20 nodes for use by HP Vertica.

Have planned to create 20 higher-end nodes of equivalent capability to support HP Vertica.

Adding Kerberos Support

HP Vertica Analytic Database (7.1.x)

HP Vertica for SQL on Hadoop

HP Vertica Analytic Database (7.1.x)

HP Vertica for SQL on Hadoop

For best performance, run:

Minimum 8 GB of memory per physical CPU core in the server

High-performance applications 1216 GB of memory per physical core

Typeat least DDR3-1600, preferably DDR3-1866

HP Vertica Analytic Database (7.1.x)

HP Vertica for SQL on Hadoop

Minimum 40 MB/s per physical core of the CPU

For best performance 6080 MB/s per physical core

10 GB networking in almost every case. With the introduction of 10 GB over

HP Vertica Analytic Database (7.1.x)

HP Vertica for SQL on Hadoop

Installing and Configuring Hadoop

For HP Vertica, you must use Hadoop webHDFS service.

HP Vertica Analytic Database (7.1.x)

HP Vertica for SQL on Hadoop

Disk Space for Non-HDFS Use

HP Vertica Analytic Database (7.1.x)

HP Vertica for SQL on Hadoop

Choosing How to Connect HP Vertica To HDFS

Creating an HDFS Storage Location

Using the ORC Reader

Using the HCatalog Connector

HP Vertica Analytic Database (7.1.x)

HP Vertica for SQL on Hadoop

Using the HDFS Connector

HP Vertica Analytic Database (7.1.x)

HP Vertica for SQL on Hadoop

HP Vertica Analytic Database (7.1.x)

We appreciate your feedback!

HP Vertica Analytic Database

S-ar putea să vă placă și