Documente Academic
Documente Profesional
Documente Cultură
Tutorial
Version: 1.4
December 2015
Table of Contents
1 Introduction ................................................................................................................... 5
1.1 Overview ............................................................................................................... 5
1.2 Audience ............................................................................................................... 5
1.3 Architecture........................................................................................................... 6
1.3.1 MapReduce .................................................................................................... 6
1.3.2 Controlling the Degree of Parallelism ............................................................. 7
1.3.3 Plugin Architecture ......................................................................................... 7
1.3.4 Stages of the TDCH job ................................................................................. 7
1.4 TDCH Plugins and Features ................................................................................. 9
1.4.1 Defining Plugins via the Command Line Interface .......................................... 9
1.4.2 HDFS Source and Target Plugins .................................................................. 9
1.4.3 Hive Source and Target Plugins ..................................................................... 9
1.4.4 HCatalog Source and Target Plugins ........................................................... 10
1.4.5 Teradata Source Plugins .............................................................................. 10
1.4.6 Teradata Target Plugins ............................................................................... 11
1.5 Teradata Plugin Space Requirements ................................................................ 12
1.5.1 Space Required by Teradata Target Plugins ............................................... 12
1.5.2 Storage Space Required for Extracting Data from Teradata ........................ 12
1.6 Teradata Plugin Privilege Requirements ............................................................ 13
2 Supported Plugin Properties ..................................................................................... 14
2.1 Source Plugin Definition Properties .................................................................... 14
2.2 Target Plugin Definition Properties ..................................................................... 15
2.3 Common Properties ............................................................................................ 16
2.4 Teradata Source Plugin Properties ..................................................................... 20
2.5 Teradata Target Plugin Properties ...................................................................... 26
2.6 HDFS Source Plugin Properties ......................................................................... 32
2.7 HDFS Target Properties ..................................................................................... 35
2.8 Hive Source Properties ....................................................................................... 38
2.9 Hive Target Properties ........................................................................................ 41
2.10 HCat Source Properties ................................................................................... 44
2.11 HCat Target Properties .................................................................................... 45
3 Installing Connector ................................................................................................... 46
3.1 Prerequisites ....................................................................................................... 46
3.2 Software Download ............................................................................................. 46
3.3 RPM Installation.................................................................................................. 46
3.4 ConfigureOozie Installation ................................................................................. 47
4 Launching TDCH Jobs ............................................................................................... 49
4.1 TDCHs Command Line Interface ....................................................................... 49
4.2 Runtime Dependencies ...................................................................................... 49
4.3 Launching TDCH with Oozie workflows .............................................................. 50
4.4 TDCHs Java API ................................................................................................ 50
5 Use Case Examples .................................................................................................... 52
5.1 Environment Variables for Runtime Dependencies ............................................ 52
5.2 Use Case: Import to HDFS File from Teradata Table ......................................... 53
5.2.1 Setup: Create a Teradata Table with Data ................................................... 53
5.2.2 Run: ConnectorImportTool command .......................................................... 54
Page 2 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
5.3 Use Case: Export from HDFS File to Teradata Table ......................................... 54
5.3.1 Setup: Create a Teradata Table ................................................................... 54
5.3.2 Setup: Create an HDFS File ......................................................................... 55
5.3.3 Run: ConnectorExportTool command .......................................................... 55
5.4 Use Case: Import to Existing Hive Table from Teradata Table ........................... 56
5.4.1 Setup: Create a Teradata Table with Data ................................................... 56
5.4.2 Setup: Create a Hive Table .......................................................................... 56
5.4.3 Run: ConnectorImportTool Command .......................................................... 56
5.4.4 Run: ConnectorImportTool Command .......................................................... 57
5.5 Use Case: Import to New Hive Table from Teradata Table ................................ 57
5.5.1 Setup: Create a Teradata Table with Data ................................................... 57
5.5.2 Run: ConnectorImportTool Command .......................................................... 58
5.6 Use Case: Export from Hive Table to Teradata Table ........................................ 58
5.6.1 Setup: Create a Teradata Table ................................................................... 58
5.6.2 Setup: Create a Hive Table with Data .......................................................... 59
5.6.3 Run: ConnectorExportTool Command ......................................................... 60
5.7 Use Case: Import to Hive Partitioned Table from Teradata PPI Table ................ 60
5.7.1 Setup: Create a Teradata PPI Table with Data ............................................ 60
5.7.2 Setup: Create a Hive Partitioned Table ........................................................ 61
5.7.3 Run: ConnectorImportTool Command .......................................................... 61
5.8 Use Case: Export from Hive Partitioned Table to Teradata PPI Table ............... 61
5.8.1 Setup: Create a Teradata PPI Table ............................................................ 61
5.8.2 Setup: Create a Hive Partitioned Table with Data ........................................ 62
5.8.3 Run: ConnectorExportTool command .......................................................... 63
5.9 Use Case: Import to Teradata Table from HCatalog Table ................................. 63
5.9.1 Setup: Create a Teradata Table with Data ................................................... 63
5.9.2 Setup: Create a Hive Table .......................................................................... 64
5.9.3 Run: ConnectorImportTool Command .......................................................... 64
5.10 Use Case: Export from HCatalog Table to Teradata Table ............................. 64
5.10.1 Setup: Create a Teradata Table ................................................................. 64
5.10.2 Setup: Create a Hive Table with Data ........................................................ 65
5.10.3 Run: ConnectorExportTool Command ....................................................... 65
5.11 Use Case: Import to Teradata Table from ORC File Hive Table ...................... 66
5.11.1 Run: ConnectorImportTool Command ........................................................ 66
5.12 Use Case: Export from ORC File HCat Table to Teradata Table ..................... 66
5.12.1 Setup: Create the Source HCatalog Table ................................................. 66
5.12.2 Run: ConnectorExportTool Command ....................................................... 67
5.13 Use Case: Import to Teradata Table from Avro File in HDFS .......................... 67
5.13.1 Setup: Create a Teradata Table ................................................................. 67
5.13.2 Setup: Prepare the Avro Schema File ........................................................ 68
5.13.3 Run: ConnectorImportTool Command........................................................ 69
5.14 Use Case: Export from Avro to Teradata Table ............................................... 69
5.14.1 Setup: Prepare the Source Avro File .......................................................... 69
5.14.2 Setup: Create a Teradata Table ................................................................. 69
5.14.3 Run: ConnectorExportTool Command ....................................................... 70
6 Performance Tuning ................................................................................................... 71
6.1 Selecting the Number of Mappers ...................................................................... 71
6.1.1 Maximum Number of Mappers on the Hadoop Cluster ................................ 71
6.1.2 Mixed Workload Hadoop Clusters and Schedulers ...................................... 71
6.1.3 TDCH Support for Preemption ..................................................................... 72
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 3
6.1.4 Maximum Number of Sessions on Teradata ................................................ 72
6.1.5 General Guidelines and Measuring Performance......................................... 72
6.2 Selecting a Teradata Target Plugin .................................................................... 73
6.3 Selecting a Teradata Source Plugin ................................................................... 73
6.4 Increasing the Batchsize Value........................................................................... 74
6.5 Configuring the JDBC Driver............................................................................... 74
7 Troubleshooting.......................................................................................................... 75
7.1 Troubleshooting Requirements ........................................................................... 75
7.2 Troubleshooting Overview .................................................................................. 76
7.3 Functional: Understand Exceptions .................................................................... 77
7.4 Functional: Data Issues ...................................................................................... 78
7.5 Performance: Back of the Envelope Guide ......................................................... 78
7.6 Console Output Structure ................................................................................... 80
7.7 Troubleshooting Examples ................................................................................. 81
7.7.1 Database doesnt exist ................................................................................. 81
7.7.2 Internal fast load server socket time out ....................................................... 82
7.7.3 Incorrect parameter name or missing parameter value in command line ..... 82
7.7.4 Hive partition column can not appear in the Hive table schema ................... 83
7.7.5 String will be truncated if its length exceeds the Teradata String length
(VARCHAR or CHAR) when running export job. ..................................................... 83
7.7.6 Scaling number of Timestamp data type should be specified correctly in
JDBC URL in internal.fastload method .................................................................... 83
7.7.7 Existing Error table error received when exporting to Teradata in
internal.fastload method .......................................................................................... 84
7.7.8 No more room in database error received when exporting to Teradata ....... 84
7.7.9 No more spool space error received when exporting to Teradata.............. 85
7.7.10 Separator is wrong or absent ..................................................................... 86
7.7.11 Date / Time / Timestamp format related errors ........................................... 86
7.7.12 Janpanese language problem .................................................................... 87
8 FAQ .............................................................................................................................. 88
8.1 Do I need to install the Teradata JDBC driver manually? ................................... 88
8.2 What authorization is necessary for running the TDCH? .................................... 88
8.3 How do I use User Customized Text Format Parameters? ................................. 88
8.4 How to use Unicode character as the separator? ............................................... 88
8.5 Why is the actual number of mappers less than the value of -nummappers? ..... 89
8.6 Why dont decimal values in Hadoop exactly match the value in Teradata? ...... 89
8.7 When should charset be specified in the JDBC URL? ........................................ 89
8.8 How do I configure the capacity scheduler to prevent task skew? ...................... 89
8.9 How can I build my own ConnectorDataTypeConverter ..................................... 89
9 Limitations & known issues ....................................................................................... 92
9.1 Teradata Connector for Hadoop ......................................................................... 92
9.2 Teradata JDBC Driver ........................................................................................ 93
9.3 Teradata Database ............................................................................................. 93
9.4 Hadoop Map/Reduce .......................................................................................... 93
9.5 Hive .................................................................................................................... 93
9.6 Avro data type conversion and encoding ............................................................ 94
Page 4 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
1 Introduction
1.1 Overview
The Teradata Connector for Hadoop (TDCH) is a map-reduce application that supports high-
performance parallel bi-directional data movement between Teradata systems and various Hadoop
ecosystem components.
TDCH can function as an end user tool with its own command-line interface, can be included in and
launched with custom Oozie workflows, and can also be integrated with other end user tools via its
Java API.
1.2 Audience
TDCH is designed and implemented for the Hadoop user audience. Users in this audience are
familiar with the Hadoop Distributed File System (HDFS) and MapReduce. They are also familiar
with other widely used Hadoop ecosystem components such as Hive, Pig and Sqoop. They should be
comfortable with the command line style of interfaces many of these tools support. Basic knowledge
about the Teradata database system is also assumed.
Teradata Hadoop
Pig
Sqoop
Teradata Teradata
MapReduce
Tools SQL
Teradata DB HDFS
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 5
1.3 Architecture
TDCH is a bi-directional data movement utility which runs as a MapReduce application inside the
Hadoop cluster. TDCH employs an abstracted plugin architecture which allows users to easily
configure, extend and debug their data movement jobs.
1.3.1 MapReduce
TDCH utilizes MapReduce as its execution engine. MapReduce is a framework designed for
processing parallelizable problems across huge datasets using a large number of computers (nodes).
When run against files in HDFS, MapReduce can take advantage of locality of data, processing data
on or near the storage assets to decrease transmission of data. MapReduce supports other distributed
filesystems such as Amazon S3. MapReduce is capable of recovering from partial failure of servers
or storage at runtime. TDCH jobs get submitted to the MapReduce framework, and the distributed
processes launched by the MapReduce framework make JDBC connections to the Teradata database;
the scalability and fault tolerance properties of the framework are key features of TDCH data
movement jobs.
Namenode
JT / RM
TDCH TDCH TDCH
Mapper Mapper Mapper
Page 6 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
1.3.2 Controlling the Degree of Parallelism
Both Teradata and Hadoop systems employ extremely scalable architectures, and thus it is very
important to be able to control the degree of parallelism when moving data between the two systems.
Because TDCH utilizes the MapReduce framework as its execution engine, the degree of parallelism
for TDCH jobs is defined by the number of mappers used by the MapReduce job. The number of
mappers used by the MapReduce framework can be configured via the command line parameter
nummappers, or via the tdch.num.mappers configuration property. More information about
general TDCH command line parameters and their underlying properties will be discussed in Section
2.1.
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 7
Preprocessing Stage
ConnectorImport/ExportTool
Input/OutputPlugInConfiguration
ConnectorJobRunner JobContext
OutputPlugin
OutputPlugin
OutputPlugin
Converter
OF/RW
SerDe
Postprocessing Stage
ConnectorJobRunner
Page 8 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
1.4 TDCH Plugins and Features
TextFile
TextFile is structured as a sequence of lines of text, and each line consists of multiple fields.
Lines and fields are delimited by separator. TextFile is easier for humans to read.
Avro
Avro is a serialization framework developed within Apache's Hadoop project. It uses JSON for
defining data types and protocols, and serializes data in a compact binary format. TDCH jobs
that read or write Avro files require an Avro schema to be specified inline or via a file.
TextFile
TextFile is structured as a sequence of lines of text, and each line consists of multiple fields.
Lines and fields are delimited by separator. TextFile is easier for humans to read.
SequenceFile
SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in
MapReduce as input/output formats.
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 9
RCFile
RCFile (Record Columnar File) is a data placement structure designed for MapReduce-based
data warehouse systems, such as Hive. RCFile applies the concept of rst horizontally-partition,
then vertically-partition. It combines the advantages of both row-store and column-store.
RCFile guarantees that data in the same row are located in the same node, and can exploit a
column-wise data compression and skip unnecessary column reads.
ORCFile
ORCFile (Optimized Row Columnar File) file format provides a highly efficient way to store
Hive data. It is designed to overcome limitations of the other Hive file formats. Using ORC files
improves performance when Hive is reading, writing, and processing data. ORC file support is
only available on Hadoop systems with Hive 0.11.0 or above installed.
TextFile
TextFile is structured as a sequence of lines of text, and each line consists of multiple fields.
Lines and fields are delimited by separator. TextFile is easier for humans to read.
split.by.hash
The Teradata split.by.hash source plugin utilizes each mapper in the TDCH job to retrieve rows
in a given hash range of the specified split-by column from a source table in Teradata. If the user
doesnt define a split-by column, the first column of the tables primary index is used by default.
The split.by.hash plugin supports more data types than the split.by.value plugin.
split.by.value
The Teradata split.by.hash source plugin utilizes each mapper in the TDCH job to retrieve rows
in a given value range of the specified split-by column from a source table in Teradata. If the
user doesnt define a split-by column, the first column of the tables primary index is used by
default. The split.by.value plugin supports less data types than the split.by.hash plugin.
Page 10 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
split.by.partition
The Teradata split.by.partition source plugin utilizes each mapper in the TDCH job to retrieve
rows in a given partition from a source table in Teradata. The split.by.partition plugin is used by
default when the source data set is defined by a query. The plugin creates a PPI, or partitioned
primary indexed, stage table with data from the source table when the source table is not already
a PPI table or when a query defines the source data set. To enable the creation of staging table,
the split.by.partition plugin requires that the associated Teradata user has create table and
create view privileges as well as free perm space equivalent to the size of the source table
available.
split.by.amp
The Teradata split.by.amp source plugin utilizes each mapper in the TDCH job to retrieve rows
associated with one or more amps from a source table in Teradata. The split.by.amp plugin
delivers the best performance due to its use of a special table operator available only in Teradata
14.10+ database systems.
batch.insert
The Teradata batch.insert target plugin associates an SQL JDBC session with each mapper in the
TDCH job when loading a target table in Teradata. The batch.insert plugin is the most flexible as
it supports most Teradata data types, requires no coordination between the TDCH mappers, and
can recover from mapper failure. If the target table is not NOPI, a NOPI stage table is created
and loaded as an intermediate step before moving the data to the target via a single insert-select
SQL operation. To enable the creation of staging table, the batch.insert plugin requires that the
associated Teradata user has create table privileges as well as free perm space equivalent to the
size of the source data set.
internal.fastload
The Teradata internal.fastload target plugin associates a FastLoad JDBC session with each
mapper in the TDCH job when loading a target table in Teradata. The internal.fastload method
utilizes a FastLoad slot on Teradata, and implements coordination between the TDCH mappers
and a TDCH coordinator process (running on the edge node where the job was submitted) as is
defined by the Teradata FastLoad protocol. The internal.fastload plugin delivers exceptional load
performance; however, it supports fewer data types than batch.insert and cannot recover from
mapper failure. If the target table is not NOPI, a NOPI stage table is created and loaded as an
intermediate step before moving the data to the target via a single insert-select SQL operation.
To enable the creation of staging table, the internal.fastload plugin requires that the associated
Teradata user has create table privileges as well as free perm space equivalent to the size of the
source data set.
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 11
1.5 Teradata Plugin Space Requirements
batch.insert
When the target table is not NoPI or is non-empty, the Teradata batch.insert target plugin creates
a temporary NoPI stage table. The Teradata batch.insert target plugin loads the source data into
the temporary stage table before executing an INSERT-SELECT operation to move the data
from the stage table into the target table. To support the use of a temporary staging table, the
target database must have enough permanent space to accommodate data in the stage table. In
addition to the permanent space required by the temporary stage table, the Teradata batch.insert
target plugin requires spool space equivalent to the size of the source data to support the
INSERT-SELECT operation between the temporary staging and target tables.
internal.fastload
When the target table is not NOPI or is non-empty, the Teradata internal.fastload target plugin
creates a temporary NOPI stage table. The Teradata internal.fastload target plugin loads the
source data into the temporary stage table before executing an INSERT-SELECT operation to
move the data from the stage table into the target table. To support the use of a temporary staging
table, the target database must have enough permanent space to accommodate data in the stage
table. In addition to the permanent space required by the temporary stage table, the Teradata
internal.fastload target plugin requires spool space equivalent to the size of the source data to
support the INSERT-SELECT operation between the temporary staging and target tables.
split.by.value
The Teradata split.by.value source plugin associates data in value ranges of the source table with
distinct mappers from the TDCH job. Each mapper retrieves the associated data via a SELECT
statement, and thus the Teradata split.by.value source plugin requires that the source database
have enough spool space to support N SELECT statements, where N is the number of mappers in
use by the TDCH job.
split.by.hash
The Teradata split.by.hash source plugin associates data in hash ranges of the source table with
distinct mappers from the TDCH job. Each mapper retrieves the associated data via a SELECT
statement, and thus the Teradata split.by.hash source plugin requires that the source database
have enough spool space to support N SELECT statements, where N is the number of mappers in
use by the TDCH job.
split.by.partition
Page 12 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
When the source table is not partitioned, the Teradata split.by.partition source plugin creates a
temporary partitioned staging table and executes an INSERT-SELECT to move data from the
source table into the stage table. To support the use of a temporary partitioned staging table, the
source database must have enough permanent space to accommodate the source data set in the
stage table as well as in the source table. In addition to the permanent space required by the
temporary stage table, the Teradata split.by.partition source plugin requires spool space
equivalent to the size of the source data to support the INSERT-SELECT operation between the
source table and the temporary partitioned stage table.
Once a partitioned source table is available, the Teradata split.by.partition source plugin
associates partitions from the source table with distinct mappers from the TDCH job. Each
mapper retrieves the associated data via a SELECT statement, and thus the Teradata
split.by.partition source plugin requires that the source database have enough spool space to
support N SELECT statements, where N is the number of mappers in use by the TDCH job.
split.by.amp
The Teradata split.by.amp source plugin does not require any space on the source database due
to its use of the tdampcopy table operator.
Requires Requires
Teradata Create Create
Select Privilege Required on System Views/Tables
Plugin Table View
Privilege Privilege
usexviews argument enabled usexviews argument disabled
split by DBC.COLUMNSX DBC.COLUMNS
No No
hash DBC.INDICESX DBC.INDICES
split by DBC.COLUMNSX DBC.COLUMNS
No No
value DBC.INDICESX DBC.INDICES
split by DBC.COLUMNSX DBC.COLUMNS
Yes Yes
partition DBC.INDICESX DBC.INDICES
split by DBC.COLUMNSX DBC.COLUMNS
No No
amp DBC.TABLESX DBC.TABLES
DBC.COLUMNSX DBC.COLUMNS
batch
Yes No DBC.INDICESX DBC.INDICES
insert
DBC.TABLESX DBC.TABLES
DBC.COLUMNSX DBC.COLUMNS
DBC.INDICESX DBC.INDICES
DBC.TABLESX DBC.TABLES
internal
Yes No DBC.DATABASESX DBC.DATABASES
fastload
DBC.TABLE_LEVELCONSTRA DBC.TABLE_LEVELCONSTRAIN
INTSX TS
DBC.TRIGGERSX DBC.TRIGGERS
NOTE: Create table privileges are only required by the batch.insert and internal.fastlod Teradata
plugins when staging tables are required.
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 13
2 Supported Plugin Properties
TDCH jobs are configured by associating a set of properties and values with a Hadoop configuration
object. The TDCH jobs source and target plugins should be defined in the Hadoop configuration
object using TDCHs ConnectorImportTool and ConnectorExportTool command line utilities, while
other common and plugin-centric attributes can be defined either by command line arguments or
directly via their java property names. The table below provides some metadata about the
configuration property definitions in this section.
Page 14 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
Java Property tdch.plugin.input.processor
tdch.plugin.input.format
tdch.plugin.input.serde
CLI Argument jobtype + fileformat
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description The three tdch.plugin.input properties define the source plugin. When
using the ConnectorExportTool, the source plugin will always be one of
the plugins that interface with components in the Hadoop cluster.
Submitting a valid value to the ConnectorImportTools jobtype and
fileformat command line arguments will cause the three
tdch.plugin.input properties to be assigned values associated with the
selected Hadoop source plugin. At this point, users should not define the
tdch.plugin.input properties directly.
Required no
Supported Values The following combinations of values are supported by the
ConnectorExportTools jobtype and fileformat arguments: hdfs +
textfile | avrofile, hive + textfile | sequencefile | rcfile | orcfile, hcat +
textfile
Default Value hdfs + textfile
Case Sensitive yes
Page 16 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
CLI Argument throttlemappers
Description Force the TDCH job to only use as many mappers as the queue
associated with the job can handle concurrently, overwriting the user
defined nummapper value.
Required no
Supported Values true | false
Default Value false
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 17
Java Property tdch.input.date.format
CLI Argument sourcedateformat
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
com.teradata.connector.common.tool.ConnectorExportTool
Description The parse pattern to apply to all input string columns during conversion
to the output column type, where the output column type is
determined to be a date column.
Required no
Supported Values string
Default Value yyyy-MM-dd
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 21
Java Property tdch.input.teradata.database
CLI Argument
Tool Class
Description The name of the database in the Teradata system from which the source
Teradata plugins will read data; this property gets defined by specifying
a fully qualified table name for the tdch.input.teradata.table property.
Required no
Supported Values string
Default Value
Case Sensitive no
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 25
2.5 Teradata Target Plugin Properties
Page 28 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
Supported Values the name of a database in the Teradata system
Default Value the current logon database of the JDBC connection
Case Sensitive no
Page 32 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
Java Property tdch.input.hdfs.separator
CLI Argument separator
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description The field separator that the HDFS textfile source plugin uses when
parsing files from HDFS.
Required no
Supported Values string
Default Value \t (tab character)
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 33
target plugin.
Required no
Supported Values single characters
Default Value
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 35
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description The field separator that the HDFS textfile target plugin uses when
writing files to HDFS.
Required no
Supported Values string
Default Value \t
Case Sensitive yes
Page 36 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
Supported Values string
Default Value
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 37
CLI Argument avroschemafile
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description The path to an Avro schema file in HDFS. This schema is used when
generating the output Avro file in HDFS by the HDFS Avro target
plugin.
Required no
Supported Values string
Default Value
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 39
Supported Values string
Default Value
Case Sensitive no
Page 40 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
Java Property tdch.input.hive.line.separator
CLI Argument lineseparator
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description The line separator that the Hive textfile source plugin uses when reading
from Hive delimited tables.
Required no
Supported Values string
Default Value \n
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 41
write data.
Required no
Supported Values string
Default Value default
Case Sensitive no
Page 42 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
Required no
Supported Values string
Default Value
Case Sensitive no
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 43
Java Property tdch.output.hive.line.separator
CLI Argument lineseparator
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description The line separator that the Hive textfile target plugin uses when writing
to Hive delimited tables.
Required no
Supported Values string
Default Value \n
Case Sensitive yes
Page 44 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
need to match the order of the target field names for schema mapping.
Required no
Supported Values string
Default Value
Case Sensitive no
3.1 Prerequisites
Teradata Database 13.0+
Hadoop cluster running a supported Hadoop distribution
o HDP
o CDH
o IBM
o MapR
TDCH continuously certifies against the latest Hadoop distributions from the most prominent
Hadoop vendors. See the SUPPORTLIST files available with TDCH for more information about the
distributions and versions of Hadoop supported by a given TDCH release.
After RPM installation, the following directory structure should be created (teradata-connector-
1.4.1-hadoop2.x used as example):
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 47
/usr/lib/tdch/1.4/:
/usr/lib/tdch/1.4/conf:
teradata-export-properties.xml.template
teradata-import-properties.xml.template
/usr/lib/tdch/1.4/lib:
/usr/lib/tdch/1.4/scripts:
configureOozie.sh
The README and SUPPORTLIST files contain information about the features and fixes included
in a given TDCH release, as well as information about what versions of relevant systems (Teradata,
Hadoop, etc) are supported by a given TDCH release.
The conf directory contains a set of xml files that can be used to define default values for common
TDCH properties. To use these files, specify defaults values for the desired properties in Hadoop
configuration format, remove the .template extension and copy them into the Hadoop conf
directory.
The lib directory contains the TDCH jar, as well as the Teradata GSS and JDBC jars. Only the
TDCH jar is required when launching TDCH jobs via the command line interface, while all three
jars are required when launching TDCH jobs via Oozie Java actions.
The scripts directory contains the configureOozie.sh script which can be used to install TDCH into
HDFS such that TDCH jobs can be launched by other Teradata products via custom Oozie Java
actions; see the following section for more information.
Once the configureOozie.sh script has been run, the following directory structure should exist in
HDFS:
/teradata/tdch/1.3/lib/teradataconnector-<version>.jar
/teradata/tdch/1.3/lib/tdgssconfig.jar
/teradata/tdch/1.3/lib/terajdbc4.jar
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 49
4 Launching TDCH Jobs
The tool class to-be-used will depend on whether the TDCH job is exporting data from the Hadoop
cluster to Teradata or importing data into the Hadoop cluster from Teradata.
For exports from Hadoop, reference the ConnectorExportTool main class via the path
com.teradata.connector.common.tool.ConnectorExportTool
For imports to Hadoop, reference the ConnectorImportTool main class via the path
com.teradata.connector.common.tool.ConnectorImportTool
When running TDCH jobs which utilize the Hive or Hcatalog source or target plugins, a set
dependent jars must be distributed with the TDCH jar to the nodes on which the TDCH job will be
run. These runtime dependencies should be defined in comma-separated format using the -libjars
command line option; see the following section for more information about runtime dependencies.
Job and plugin-specific properties can be defined via the -D<property>=value format, or via their
associated command line interface arguments. See section 2 for a full list of the properties and
arguments supported by the plugins and tool classes, and see section 5.1 for examples which utilize
the ConnectorExportTool and the ConnectorImportTool classes to launch TDCH jobs via the
command line interface.
TDCH jobs which utilize the Hive plugins as sources or targets are dependent on the following Hive
jar files:
antlr-runtime-3.4.jar
commons-dbcp-1.4.jar
commons-pool-1.5.4.jar
datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar
hive-cli-1.2.1.jar
hive-exec-1.2.1.jar
hive-jdbc-1.2.1.jar
hive-metastore-1.2.1.jar
jdo-api-3.0.1.jar
libfb303-0.9.2.jar
libthrift-0.9.2.jar
TDCH jobs which utilize the HCatalog plugins as sources or targets are dependent on all of the jars
associated with the Hive plugins (defined above), as well as the following HCatalog jar files:
hive-hcatalog-core-1.2.1.jar
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 51
which utilize the TDCH Java API is limited. The TDCH Java API is composed of the following sets
of classes:
Utility classes
o These classes can be used to fetch information and modify the state of a given data
source or target.
com.teradata.connector.common.utils
o ConnectorJobRunner
com.teradata.connector.common.utils
o ConnectorConfiguration
o ConnectorMapredUtils
o ConnectorPlugInUtils
com.teradata.connector.hcat.utils
o HCatPlugInConfiguration
com.teradata.connector.hdfs.utils
o HdfsPlugInConfiguration
com.teradata.connector.hive.utils
o HivePlugInConfiguration
o HiveUtils
com.teradata.connector.teradata.utils
o TeradataPlugInConfiguration
o TeradataUtils
More information about TDCHs Java API and Javadocs detailing the above classes are available
upon request.
Page 52 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
5 Use Case Examples
export LIB_JARS=
/path/to/avro/jars/avro-1.7.4.jar,
/path/to/avro/jars/avro-mapred-1.7.4-hadoop2.jar,
$HIVE_HOME/conf,
$HIVE_HOME/lib/antlr-runtime-3.4.jar,
$HIVE_HOME/lib/commons-dbcp-1.4.jar,
$HIVE_HOME/lib/commons-pool-1.5.4.jar,
$HIVE_HOME/lib/datanucleus-api-jdo-3.2.6.jar,
$HIVE_HOME/lib/datanucleus-core-3.2.10.jar,
$HIVE_HOME/lib/datanucleus-rdbms-3.2.9.jar,
$HIVE_HOME/lib/hive-cli-1.2.1.jar,
$HIVE_HOME/lib/hive-exec-1.2.1.jar,
$HIVE_HOME/lib/hive-jdbc-1.2.1.jar,
$HIVE_HOME/lib/hive-metastore-1.2.1.jar,
$HIVE_HOME/lib/jdo-api-3.0.1.jar,
$HIVE_HOME/lib/libfb303-0.9.2.jar,
$HIVE_HOME/lib/libthrift-0.9.2.jar,
$HCAT_HOME/hive-hcatalog-core-1.2.1.jar
export HADOOP_CLASSPATH=
/path/to/avro/jars/avro-1.7.4.jar:
/path/to/avro/jars/avro-mapred-1.7.4-hadoop2.jar:
$HIVE_HOME/conf:
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 53
$HIVE_HOME/lib/antlr-runtime-3.4.jar:
$HIVE_HOME/lib/commons-dbcp-1.4.jar:
$HIVE_HOME/lib/commons-pool-1.5.4.jar:
$HIVE_HOME/lib/datanucleus-api-jdo-3.2.6.jar:
$HIVE_HOME/lib/datanucleus-core-3.2.10.jar:
$HIVE_HOME/lib/datanucleus-rdbms-3.2.9.jar:
$HIVE_HOME/lib/hive-cli-1.2.1.jar:
$HIVE_HOME/lib/hive-exec-1.2.1.jar:
$HIVE_HOME/lib/hive-jdbc-1.2.1.jar:
$HIVE_HOME/lib/hive-metastore-1.2.1.jar:
$HIVE_HOME/lib/jdo-api-3.0.1.jar:
$HIVE_HOME/lib/libfb303-0.9.2.jar:
$HIVE_HOME/lib/libthrift-0.9.2.jar:
$HCAT_HOME/hive-hcatalog-core-1.2.1.jar
export USERLIBTDCH=/usr/lib/tdch/1.4/lib/teradata-connector-1.4.1.jar
.LOGON testsystem/testuser
DATABASE testdb;
c1 INT
,c2 VARCHAR(100)
);
.LOGOFF
Page 54 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
5.2.2 Run: ConnectorImportTool command
Execute the following on the Hadoop edge node
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
.LOGON testsystem/testuser
DATABASE testdb;
c1 INT
,c2 VARCHAR(100)
);
.LOGOFF
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 55
5.3.2 Setup: Create an HDFS File
Execute the following on the Hadoop edge node.
rm /tmp/example2_hdfs_data
com.teradata.connector.common.tool.ConnectorExportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
Set job type as hdfs
-jobtype hdfs
Set source HDFS path
-sourcepaths /user/mapred/example2_hdfs
Page 56 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
5.4 Use Case: Import to Existing Hive Table from Teradata Table
.LOGON testsystem/testuser
DATABASE testdb;
c1 INT
,c2 VARCHAR(100)
);
.LOGOFF
h1 INT
, h2 STRING
) STORED AS RCFILE;
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 57
-password testpassword
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
Set job type as hive
-jobtype hive
5.5 Use Case: Import to New Hive Table from Teradata Table
.LOGON testsystem/testuser
DATABASE testdb;
Page 58 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
CREATE MULTISET TABLE example4_td (
,c3 FLOAT
);
.LOGOFF
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
Set job type as hive
-jobtype hive
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 59
5.6 Use Case: Export from Hive Table to Teradata Table
.LOGON testsystem/testuser
DATABASE testdb;
c1 INT
, c2 VARCHAR(100)
);
.LOGOFF
h1 INT
, h2 STRING
h1 INT
, h2 STRING
) stored as textfile;
echo "4,acme">/tmp/example5_hive_data
Page 60 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
hive -e "INSERT OVERWRITE TABLE example6_hive SELECT * FROM
example5_hive;"
rm /tmp/example5_hive_data
com.teradata.connector.common.tool.ConnectorExportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
Set job type as hive
-jobtype hive
5.7 Use Case: Import to Hive Partitioned Table from Teradata PPI Table
.LOGON testsystem/testuser
DATABASE testdb;
c1 INT
, c2 DATE
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 61
PARTITION BY RANGE_N(c2 BETWEEN DATE '2006-01-01' AND DATE '2012-12-
31' EACH INTERVAL '1' MONTH);
.LOGOFF
h1 INT
STORED AS RCFILE;
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
-jobtype hive
-fileformat rcfile
-sourcetable example6_td
-sourcefieldnames "c1,c2"
Specify both source and target
-nummappers 1 field names so TDCH knows how to
map Teradata column Hive
-targettable example6_hive partition columns.
-targetfieldnames "h1,h2"
Page 62 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
5.8 Use Case: Export from Hive Partitioned Table to Teradata PPI Table
.LOGON testsystem/testuser
DATABASE testdb;
c1 INT
, c2 DATE
.LOGOFF
Execute the following through the Hive command line interface on the Hadoop edge node.
h1 INT
STORED AS RCFILE;
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 63
INSERT INTO TABLE example7_hive PARTITION (h2='2012-02-18') SELECT h1
FROM example7_tmp;
com.teradata.connector.common.tool.ConnectorExportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
-jobtype hive
-fileformat rcfile
-sourcetable example7_hive
-sourcefieldnames "h1,h2"
Specify both source and target
-nummappers 1 field names so TDCH knows how to
map Hive partition column to
-targettable example7_td Teradata column.
-targetfieldnames "c1,c2"
.LOGON testsystem/testuser
DATABASE testdb;
c1 INT
Page 64 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
, c2 VARCHAR(100)
);
.LOGOFF
h1 INT
, h2 STRING
) STORED AS RCFILE;
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
-sourcetable example8_td
-nummappers 1
-targettable example8_hive
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 65
5.10 Use Case: Export from HCatalog Table to Teradata Table
.LOGON testsystem/testuser
DATABASE testdb;
c1 INT
, c2 VARCHAR(100)
);
.LOGOFF
h1 INT
, h2 STRING
echo "8,acme">/tmp/example9_hive_data
rm /tmp/example9_hive_data
Page 66 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
hadoop jar $USERLIBTDCH
com.teradata.connector.common.tool.ConnectorExportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
Set job type as hcat
-jobtype hcat
-sourcetable example9_hive
-nummappers 1
-targettable example9_td
5.11 Use Case: Import to Teradata Table from ORC File Hive Table
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
-classname com.teradata.jdbc.TeraDriver
-jobtype hive
-targettable import_hive_fun22
-sourcetable import_hive_fun2
5.12 Use Case: Export from ORC File HCat Table to Teradata Table
com.teradata.connector.common.tool.ConnectorExportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
-classname com.teradata.jdbc.TeraDriver
-sourcedatabase default
-sourcetable export_hcat_fun1
-nummappers 2
-separator ','
-targettable export_hcat_fun1
Page 68 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
5.13 Use Case: Import to Teradata Table from Avro File in HDFS
insert into tdtbl(null, null, null, null, null, null, null, null,
null, null);
"type" : "record",
"name" : "xxre",
"fields" : [ {
"name" : "col1",
}, {
"name" : "col2",
"type" : "long"
}, {
"name" : "col3",
"type" : "float"
}, {
"name" : "col4",
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 69
"type" : "double", "default":1.0
}, {
"name" : "col5",
} ]
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
-classname com.teradata.jdbc.TeraDriver
-jobtype hdfs
-targetpaths /user/hduser/avro_import
-nummappers 2
-sourcetable tdtbl
-avroschemafile file:////home/hduser/tdch/manual/schema_default.avsc
-targetfieldnames "col2,col3"
-sourcefieldnames "i,s"
Page 70 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
5.14 Use Case: Export from Avro to Teradata Table
com.teradata.connector.common.tool.ConnectorExportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
-classname com.teradata.jdbc.TeraDriver
-fileformat avrofile
-jobtype hdfs
-sourcepaths /user/hduser/avro_export
-nummappers 2
-avroschemafile file:///home/hduser/tdch/manual/schema_default.avsc
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 71
-sourcefieldnames "col2,col3"
-targetfieldnames "i,s"
Page 72 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
6 Performance Tuning
The schedulers queue definition for the queue associated with the TDCH job; the queue
definition will include information about the minimum and maximum number of containers
offered by the queue, as well as whether the scheduler supports preemption.
Whether the given TDCH job supports preemption if the associated YARN scheduler and
queue have enabled preemption.
To determine the maximum number of mappers that can be run on a given scheduler-enabled YARN
cluster, see the Hadoop documentation for the scheduler that has been implemented in YARN on the
given cluster. See the following section for more information on which TDCH jobs support
preemption.
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 73
6.1.3 TDCH Support for Preemption
In some cases, the queues associated with a given YARN scheduler will be configured to support
elastic scheduling. This means that a given queue can grow in size to utilize the resources associated
with other queues when those resources are not in use; if these inactive queues become active while
the original job is running, containers associated with the original job will be preempted, and these
containers will be restarted when resources associated with the elastic queue become available. All
of TDCHs source plugins, and all of TDCHs target plugins except the TDCH internal.fastload
Teradata target plugin, support preemption. This means that all TDCH jobs, with the exception of
jobs which utilize the TDCH internal.fastload target plugin, can be run with more mappers than are
defined by maximum amount of containers associated with the given queue on scheduler-enabled,
preemption-enabled YARN clusters. Jobs which utilize the TDCH internal.fastload target plugin can
also be run in this environment, but may not utilize elastically-available resources. Again, see the
Hadoop documentation for the given scheduler to determine the maximum number of mappers
supported by a given queue.
Page 74 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
6.2 Selecting a Teradata Target Plugin
This section provides suggestions on how to select a Teradata target plugin, and provides some
information about the performance of the various plugins.
batch.insert
The Teradata batch.insert target plugin utilizes uncoordinated SQL sessions when connecting
with Teradata. This plugin should be used when loading a small amount of data, or when there
are complex data types in the target table which are not supported by the Teradata
internal.fastload target plugin. This plugin should also be used for long running jobs on YARN
clusters where preemptive scheduling is enabled or regular failures are expected.
internal.fastload
The Teradata internal.fastload target plugin utilizes coordinated FastLoad sessions when
connection with Teradata, and thus this plugin is more performant than the Teradata batch.insert
target plugin. This plugin should be used when transferring large amounts of data from large
Hadoop systems to large Teradata systems. This plugin should not be used for long running jobs
on YARN clusters where preemptive scheduling could cause mappers to be restarted or where
regular failures are expected, as the FastLoad protocol does not support restarted sessions and the
job will fail in this scenario.
split.by.value
The Teradata split.by.value source plugin performs the best when the split-by column has more
distinct values than the TDCH job has mappers, and when the distinct values in the split-by
column evenly partition the source dataset. The plugin has each mapper submit a range-based
SELECT query to Teradata, fetching the subset of data associated with the mappers designated
range. Thus, when the source data set is not evenly partitioned by the values in the split-by
column, the work associated with the data transfer will be skewed between the mappers, and the
job will take longer to complete.
split.by.hash
The Teradata split.by.hash source plugin performs the best when the split-by column has more
distinct hash values than the TDCH job has mappers, and when the distinct hash values in the
split-by column evenly partition the source dataset. The plugin has each mapper submit a range-
based SELECT query to Teradata, fetching the subset of data associated with the mappers
designated range. Thus, when the source data set is not evenly partitioned by the hash values in
the split-by column, the work associated with the data transfer will be skewed between the
mappers, and the job will take longer to complete.
split.by.partition
The Teradata split.by.partition source plugin performs the best when the source table is evenly
partitioned, the partition column(s) are also the indexed, and the number of partitions in the
source table is equal to the number of mappers used by the TDCH job. The plugin has each
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 75
mapper submit a range-based SELECT query to Teradata, fetching the subset of data associated
with one or more partitions. The plugin is the only Teradata source plugin to support defining the
source data set via an arbitrarily complex select query; in this scenario a staging table is used.
The number of partitions associated with the staging table created by the Teradata
split.by.partition source plugin can be explicitly defined by the user, so the plugin is the most
tunable of the four Teradata source plugins.
split.by.amp
The Teradata split.by.amp source plugin performs the best when the source data set is evenly
distributed on the amps in the Teradata system, and when the number of mappers used by the
TDCH job is equivalent to the number of amps in the Teradata system. The plugin has each
mapper submit a table operator-based SELEC query to Teradata, fetching the subset of data
associated with the mappers designated amps. The plugins use of the table operator makes it
the most performant of the four Teradata source plugins, but the plugin can only be used against
Teradata systems which have the table operator available (14.10+).
Page 76 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
7 Troubleshooting
NOTE: In some cases, it is also useful to enable DEBUG messages in the console and mapper logs.
To enable DEBUG messages in the console logs, update the HADOOP_ROOT_LOGGER
environment variable with the command export HADOOP_ROOT_LOGGER=DEBUG,console.
To enable DEBUG messages in the mapper logs, add the following property definition to the TDCH
command -Dmapred.map.log.level=DEBUG.
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 77
7.2 Troubleshooting Overview
Import Export
Issue Type
Performance (go through
Functional (look for exceptions)
checklist)
Problem Area
Page 78 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
7.3 Functional: Understand Exceptions
NOTE: The example console output contained in this section has not been updated to reflect the latest
version of TDCH; the error messages and stack traces for TDCH 1.3+ will look similar, though not
identical.
Look in the console output for
The very last error code
o 10000: runtime (look for database error code, or JDBC error code, or back trace)
o Others: pre-defined (checked) errors by TDCH
The very first instance of exception messages
Examples:
com.teradata.hadoop.exception.TeradataHadoopSQLException:
(omitted)
(omitted)
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 79
7.4 Functional: Data Issues
This category of issues occurs at runtime (most often with the internal.fastload method), and usually its
not obvious what the root cause is. Our suggestion is that you can check the following:
Does the schema match the data?
Is the separator correct?
Does the table DDL have time or timestamp columns?
o Check if tnano/tsnano setting has been specified to JDBC URL
Does the table DDL have Unicode columns?
o Check if CHARSET setting has been specified to JDBC URL
Does the table DDL have decimal columns
o Before release 1.0.6, this may cause issues
Check Fastload error tables to see whats inside
(Ttd-io), (Ttd-transfer),
Therefore we should:
Watch out for node-level CPU saturation (including core saturation), because no CPU = no
work can be done.
If all-node saturated with either Hadoop or Teradata, consider expanding system footprint
and/or lowering concurrency
If one-node much busier than other nodes within either Hadoop or Teradata, try to balance
the workload skew
If both Hadoop and Teradata are mostly idle, look for obvious user mistakes or configuration
issues, and if possible, increase concurrency.
And here is the checklist we could go through in case of slow performance:
Page 80 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
User Settings
Teradata JDBC URL
o Connecting to MPP system name? (and not single-node)
o Connecting through correct (fast) network interface?
/etc/hosts
ifconfig
Using best-performance methods?
Using most optimal number of mappers? (small number of mapper sessions can significantly
impact performance)
Is batch size too small or too large?
Database
Is database CPU or IO saturated?
o iostat, mpstat, sar, top
Is there any TDWM setting limiting # of concurrent sessions or users query priority?
o tdwmcmd -a
DBSControl settings
o AWT tasks: maxawttask, maxloadtask, maxloadawt
o Compression settings
Is database almost out of room?
Is there high skew to some AMPs (skew on PI column or split-by column)
Network
Are Hadoop network interfaces saturated?
o Could be high replication factor combined with slow network between nodes
Are Teradata network interfaces saturated?
o Could be slow network between systems
o Does network has bad latency?
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 81
Hadoop
Are Hadoop data nodes CPU or IO saturated?
o iostat, mpstat, sar, top, using ganglia or other tools
o Could be Hadoop configuration too small for the jobs size
Are there settings limiting # of concurrent mappers?
o mapred-site.xml
o scheduler configuration
Are mapper tasks skewed to a few nodes?
o use ps | grep java on multiple nodes to see if tasks have skew
o In capacity-scheduler.xml, set maxtasksperheartbeat to force even distribution
13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: input conditions are page_language like '%9sz6n6%' or
page_name is not null
13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: input field names are [page_name, page_hour, page_view]
com.teradata.hadoop.exception.TeradataHadoopException:
com.teradata.jdbc.jdbc_4.util.JDBCException: [Teradata Database]
[TeraJDBC 14.00.00.13] [Error 3802] [SQLState 42S02] Database testdb' does not
exist. at
com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDatabaseSQLException(Error
Factory.java:307) at
com.teradata.jdbc.jdbc_4.statemachine.ReceiveInitSubState.action(ReceiveI
nitSubState.java:102) at
com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.subStateMachi
ne(StatementReceiveState.java:302) at
com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.action(Statem
entReceiveState.java:183) at
com.teradata.jdbc.jdbc_4.statemachine.StatementController.runBody(Stateme
ntController.java:121) at
com.teradata.jdbc.jdbc_4.statemachine.StatementController.run(StatementCo
ntroller.java:112) at
com.teradata.jdbc.jdbc_4.TDSession.executeSessionRequest(TDSession.java:6
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 83
24) at
com.teradata.jdbc.jdbc_4.TDSession.<init>(TDSession.java:288) at
com.teradata.jdbc.jdk6.JDK6_SQL_Connection.<init>(JDK6_SQL_Connection.jav
a:30) at
com.teradata.jdbc.jdk6.JDK6ConnectionFactory.constructConnection(JDK6Conn
ectionFactory.java:22) at
com.teradata.jdbc.jdbc.ConnectionFactory.createConnection(ConnectionFacto
ry.java:130) at
com.teradata.jdbc.jdbc.ConnectionFactory.createConnection(ConnectionFacto
ry.java:120) at
com.teradata.jdbc.TeraDriver.doConnect(TeraDriver.java:228) at
com.teradata.jdbc.TeraDriver.connect(TeraDriver.java:154) at
java.sql.DriverManager.getConnection(DriverManager.java:582) at
java.sql.DriverManager.getConnection(DriverManager.java:185) at
com.teradata.hadoop.db.TeradataConnection.connect(TeradataConnection.java
:274)
This error occurs because the number of available map tasks currently is less than the number of
map tasks specified in the command line by parameter of "-nummappers". This error can occur in the
following conditions:
(1) There are some other map/reduce jobs running concurrently in the Hadoop cluster, so there are
not enough resources to allocate specified map tasks for the export job.
(2) The maximum number of map tasks is smaller than existing map tasks added expected map tasks
of the export jobs in the Hadoop cluster.
When the above error occurs, please try to increase the maximum number of map tasks of the
Hadoop cluster, or decrease the number of map tasks for the export job.
When this error occurs, please double check the input parameters and their values.
Page 84 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
7.7.4 Hive partition column can not appear in the Hive table schema
When running import job with 'hive' job type, the columns defined in the target partition schema
cannot appear in the target table schema. Otherwise, the following exception will be thrown:
In this case, please check the provided schemas for Hive table and Hive partition.
7.7.5 String will be truncated if its length exceeds the Teradata String length
(VARCHAR or CHAR) when running export job.
When running an export job, if the length of the source string exceeds the maximum length of
Teradatas String type (CHAR or VARCHAR), the source string will be truncated. It will result in
data inconsistency.
To prevent that from happening, please carefully set the data schema for source data and target data.
7.7.6 Scaling number of Timestamp data type should be specified correctly in JDBC
URL in internal.fastload method
NOTE: The example console output contained in this section has not been updated to reflect the
latest version of TDCH; the error messages and stack traces for TDCH 1.3+ will look similar, though
not identical.
When loading data into Teradata using the internal.fastload method, the following error may occur:
com.teradata.hadoop.exception.TeradataHadoopException: java.io.EOFException
at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323)
at java.io.DataInputStream.readUTF(DataInputStream.java:572)
at
java.io.DataInputStream.readUTF(DataInputStream.java:547) at
com.teradata.hadoop.mapreduce.TeradataInternalFastloadOutputProcessor.beginL
oading(TeradataInternalFastloadOutputProcessor.java:889) at
com.teradata.hadoop.mapreduce.TeradataInternalFastloadOutputProcessor.run
(TeradataInternalFastloadOutputProcessor.java:173) at
com.teradata.hadoop.job.TeradataExportJob.runJob(TeradataExportJob.java:7
5) at
com.teradata.hadoop.tool.TeradataJobRunner.runExportJob(TeradataJobRunner
.java:192) at
com.teradata.hadoop.tool.TeradataExportTool.run(TeradataExportTool.java:3
9) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at
com.teradata.hadoop.tool.TeradataExportTool.main(TeradataExportTool.java:
395) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 85
Usually the error is caused by setting the wrong tsnano value in the JDBC URL. In Teradata DDL,
the default length of timestamp is 6, which is also the maximum allowed value, but user can specify
a lower value.
When tsnano is set to
The same as the specified length of timestamp in the Teradata table: no problem;
tsnano is not set: no problem, it will use the specified length as in the Teradata table
less than the specified length: an error table will be created in Teradata, but no exception will
be shown
Greater than the specified length: the quoted error message will be received.
7.7.7 Existing Error table error received when exporting to Teradata in internal.fastload
method
NOTE: The example console output contained in this section has not been updated to reflect the
latest version of TDCH; the error messages and stack traces for TDCH 1.3+ will look similar, though
not identical.
If the following error occurs when exporting to Teradata using the internal.fastload method:
com.teradata.hadoop.exception.TeradataHadoopException:
com.teradata.jdbc.jdbc_4.util.JDBCException: [Teradata Database]
[TeraJDBC 14.00.00.13] [Error 2634] [SQLState HY000] Existing ERROR table(s) or
Incorrect use of export_hdfs_fun1_054815 in Fast Load operation.
This is caused by the existence of the Error table. If an export task is interrupted or aborted while
running, an error table will be generated and stay in Teradata database. Now when you try to run
another export job, the above error will take place.
In this case, user needs to drop the existed Error table manually, and then rerun the export job.
com.teradata.hadoop.exception.TeradataHadoopSQLException:
com.teradata.jdbc.jdbc_4.util.JDBCException: [Teradata Database]
Page 86 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
[TeraJDBC 14.00.00.01] [Error 2644] [SQLState HY000] No more room in database
testdb. at
com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDatabaseSQLException(Error
Factory.java:307) at
com.teradata.jdbc.jdbc_4.statemachine.ReceiveInitSubState.action(ReceiveI
nitSubState.java:102) at
com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.subStateMachi
ne(StatementReceiveState.java:298) at
com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.action(Statem
entReceiveState.java:179) at
com.teradata.jdbc.jdbc_4.statemachine.StatementController.runBody(Stateme
ntController.java:120) at
com.teradata.jdbc.jdbc_4.statemachine.StatementController.run(StatementCo
ntroller.java:111) at
com.teradata.jdbc.jdbc_4.TDStatement.executeStatement(TDStatement.java:37
2) at
com.teradata.jdbc.jdbc_4.TDStatement.executeStatement(TDStatement.java:31
4) at
com.teradata.jdbc.jdbc_4.TDStatement.doNonPrepExecute(TDStatement.java:27
7) at
com.teradata.jdbc.jdbc_4.TDStatement.execute(TDStatement.java:1087) at
com.teradata.hadoop.db.TeradataConnection.executeDDL(TeradataConnection.j
ava:364) at
com.teradata.hadoop.mapreduce.TeradataMultipleFastloadOutputProcessor.get
RecordWriter(TeradataMultipleFastloadOutputProcessor.java:315)
This is caused by the perm space of the database in Teradata being set too low. Please reset it to a
higher value to resolve it.
java.io.IOException: com.teradata.jdbc.jdbc_4.util.JDBCException:
[Teradata Database] [TeraJDBC 14.00.00.21] [Error 2646] [SQLState HY000]
No more spool space in example_db.
This is caused by the spool space of the database in Teradata being set too low. Please reset it to a
higher value to resolve it.
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 87
7.7.10 Separator is wrong or absent
NOTE: The example console output contained in this section has not been updated to reflect the
latest version of TDCH; the error messages and stack traces for TDCH 1.3+ will look similar, though
not identical.
If the -separator parameter is not set or is wrong, you may run into the following error:
Please make sure the separator parameters name and value is specified correctly.
java.lang.IllegalArgumentException
at java.sql.Date.valueOf(Date.java:138)
java.lang.IllegalArgumentException
Page 88 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
at java.sql.Time.valueOf(Time.java:89)
This error is reported by the Teradata database. One reason is that the user uses a database with Japanese
language supported. When the connector wants to get table schema from the database, it uses the
following statement:
SELECT TRIM(TRAILING FROM COLUMNNAME) AS COLUMNNAME, CHARTYPE FROM
DBC.COLUMNS WHERE DATABASENAME = (SELECT DATABASE) AND TABLENAME =
$TABLENAME;
The internal process in database encounters an invalid character during processing, which may be a
problem about TD14. The workaround is to set dbscontrol flag acceptreplacementCharacters to true.
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 89
8 FAQ
Page 90 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
8.5 Why is the actual number of mappers less than the value of -nummappers?
When you specify the number of mappers using the nummappers parameter, but in the execution,
you find that the actual number of mappers is less than your specified value, this is expected
behavior. This behavior is due to the fact that we use the getSplits() method of
CombineFileInputFormat class of Hadoop to decide partitioned splits number. As a result, the
number of mappers for running the job equals to splits number.
8.6 Why dont decimal values in Hadoop exactly match the value in Teradata?
When exporting data to Teradata, if the precision of decimal type is more than that of the target
Teradata column type, the decimal value will be rounded when stored in Teradata. On the other
hand, if the precision of decimal type is less than the definition of the column in the Teradata table,
0s will be appended to the scaling.
As an example, here's a user-defined converter which replaces occurrences of the term 'foo' in a
source string with the term 'bar':
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 91
public class FooBarConverter extends ConnectorDataTypeConverter {
public FooBarConverter() {}
if (object == null)
return null;
return ((String)object).replaceAll("foo","bar");
This user-defined converter extends the ConnectorDataTypeConverter class, and thus requires an
implementation for the convert(Object) method. At the time of the 1.4 release, user-defined
converters with no-arg constructors were not supported (this bug is being tracked by TDCH-775, see
known issues list); thus this user-defined converter a single arg constructor, where the input
argument is not used. To compile this user-defined converter, use the following syntax:
To run using this user-defined converter, first create a new jar which contains the user-defined
converter's class files:
Then add the new jar onto the HADOOP_CLASSPATH and LIB_JARS environment variables:
export HADOOP_CLASSPATH=/path/to/user-defined-
converter.jar:$HADOOP_CLASSPATH
export LIB_JARS=/path/to/user-defined-converter.jar,$LIB_JARS
Finally, reference the user-defined converter in your TDCH command. As an example, this TDCH
job would export 2 columns from an HDFS file into a Teradata table with one int column and one
string column. The second column in the hdfs file will have the FooBarConverter applied to it before
the record is sent to the TD table:
Page 92 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
hadoop jar $TDCH_JAR
com.teradata.connector.common.tool.ConnectorExportTool
-libjars=$LIB_JARS
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 93
9 Limitations & known issues
Page 94 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial
9.2 Teradata JDBC Driver
a) Row length (of all columns selected) must be 64KB or less
e) Number of rows in each batch.insert request needs to be less than 13668.
f) PERIOD (TIME) with custom format is not supported
g) PERIOD (TIMESTAMP) with custom format is not supported
h) JDBC Batch Insert max parcel size is 1MB
i) JDBC Fastload max parcel size is 64KB
9.5 Hive
a) "-hiveconf" option is used to specify the path of a Hive configuration file (see Section 3.1.) It is
required for a hive or hcat job.
With version 1.0.7, the file can be located in HDFS (hdfs://) or in a local file system (file://). Without the
URI schema (hdfs:// or file://) specified, the default schema name is "hdfs". Without the "-hiveconf"
parameter specified, the "hive-site.xml" file should be located in $HADOOP_CLASSPATH, a local
Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial Page 95
path, before running the TDCH job. For example, if the file "hive-site.xml" is in "/home/hduser/", a user
should export the path using the following command before running the TDCH job:
export HADOOP_CLASSPATH=/home/hduser/conf:$HADOOP_CLASSPATH
Page 96 Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop
Tutorial