Sunteți pe pagina 1din 21

P age |2

1. Introduction
Strong data warehouse performance is critical to keeping users satisfied, attaining service level
agreements (SLAs) and maximizing the return on investment (ROI) in the Teradata system.
Sometimes queries that perform unnecessary full-table scans or other operations that consume too
many system resources are submitted to the data warehouse. Application tuning is a process to
identify and tune target applications for performance improvements and proactively prevent
application performance problems.
Some data warehouses handle millions of queries in a day. This makes it difficult for DBAs to identify
suspect queries. A suspect query is one that either consumes too many system resources or is not
taking advantage of Teradatas parallelism. Identifying and documenting the frequency of problem
queries offers a more comprehensive view of the queries affecting data warehouse performance and
helps prioritize tuning efforts.
DBQL is a rich resource for performance data, as it provides full SQL text, CPU and I/O by query,
number of active AMPs in a query, spool use, number of query steps and full explain text. It also
offers information to calculate suspect query indicators such as large -table scans, skewing (when the
Teradata system is not using all the AMPs in parallel) and large-table-to-large-table product joins.

Whitepaper | TERADATA PERFORMANCE TUNING

P age |3

2. Teradata Architecture

2.1 The Parsing Engine is responsible for:


Managing individual sessions (up to 120).
Parsing and Optimizing your SQL requests.
Dispatching the optimized plan to the AMPs.
Sending the answer set response back to the requesting client.
2.2 The Message Passing Layer is responsible for:
Carrying messages between the AMPs and PE.
Point-to-Point and Broadcast communications.
Merging answer sets back to the PE.
Making Teradata parallelism possible.
2.3 The Amps are responsible for:
Finding the rows requested.
Lock management.
Sorting rows.
Aggregating columns.
Join processing.
Output conversion and formatting.
2.4 Storing Rows:
The rows of every table are distributed among all AMPs.
Each AMP is responsible for a subset of the rows of each table.
Ideally, each table will be evenly distributed among all AMPs.
Whitepaper | TERADATA PERFORMANCE TUNING

P age |4
Evenly distributed tables result in evenly distributed workloads.
The uniformity of distribution of the rows of a table depends on the choice of the
Primary Index.

Whitepaper | TERADATA PERFORMANCE TUNING

P age |5

3. Data Distribution
3.1 Primary Index
The value of the Primary Index for a specific row determines the AMP assignment
for that row.
This is done using a hashing algorithm.
Accessing the row by its Primary Index value is: always a one-AMP operation and the
most efficient way to access a row.
Two type UPI (Unique Primary Index) and NUPI (Non Unique Primary Index) .
3.1.1 Accessing Via a Unique Primary Index
A UPI access is a one-AMP operation which may access at most a single row

Whitepaper | TERADATA PERFORMANCE TUNING

P age |6
3.1.2 Row Distribution Using a UPI

Often, but not always, the PK column(s) will be used as a UPI.


Teradata will distribute different index values evenly across all AMPs.
Resulting row distribution among AMPs is very uniform.
Assures maximum efficiency for parallel operations.

3.1.3 Row Distribution Using a NUPI


Rows with the same PI value distribute to the same AMP.
Row distribution is less uniform or skewed.

Whitepaper | TERADATA PERFORMANCE TUNING

P age |7

Customer_Number may be the preferred access column for ORDER table, thus a good
index candidate.
Values for Customer_Number are somewhat non-unique.
Choice of Customer_Number is therefore a NUPI.

3.1.4 Row Distribution Using a Highly Non-Unique Primary Index (NUPI)


Table will not perform well in parallel operations.
Highly non-unique columns are poor PI choices generally.
The degree of uniqueness is critical to efficiency.

Whitepaper | TERADATA PERFORMANCE TUNING

P age |8

Values for Order_Status are highly non-unique.


Choice of Order_Status column is a NUPI.
Only two values exist, so only two AMPs will ever be used for this table.

Whitepaper | TERADATA PERFORMANCE TUNING

P age |9

3.2 Partitioned Primary Index


Unique feature of Teradata which allows access of portion of data of large table.
This works by hashing rows to different virtual amps.
PPI does not alter data distribution, it only creates partitions on data already distributed
based on PI.
PPIs are defined on a table in order to increase the query efficiency by avoiding full table
scans.
If we specify No Range or No Case, then all the values not in this range will be in single
partition.
If we specify UNKNOWN, then all NULL values will be placed in the partition.
Partitions are usually defined based on Range or Case.
Partition by Case
CREATE TABLE Order
(
Ord_number integer Not NULL,
Customer_number integer NOT NULL,
Order_date date,
Order_total integer
)
PRIMARY INDEX(Customer_number)
Partition by case1(
Order_total<10000,
Order_total<20000,
Order_total<30000,
NO CASE OR UNKNOWN);

Partition by Range
CREATE TABLE Order
(
Ord_number integer Not NULL,
Customer_number integer NOT NULL,
Order_date date,
Order_total integer
)
PRIMARY INDEX(Customer_number)
Partition by range1(
Order_date between date 2013-01-01 AND date2013-12-01
Each interval 1 month
NO Range OR UNKNOWN);
Whitepaper | TERADATA PERFORMANCE TUNING

P a g e | 10

4. Performance Tuning Thumb Rules


These are some best practices will should follow to use Teradata at its best performance. To
determine the best tuning options, it is important to baseline existing performance conditions, pilot
potential solutions through experimentation and analyze the results.

4.1 Run explain plan

Check for No or low confidence.


Check for By way of an all row scan - FTS.
Check for Translate.
Check for Product joins conditions.
Check for distinct/group by.
Check for In/not in keywords.

4.1.1 In case of product join scenarios check for

Proper usage of alias.


Joining on matching columns.
Usage of join keywords - like specifying type of joins (ex. inner or outer).
Use union in case of "OR scenarios.
Ensure statistics are collected on join columns and this is especially
important if the columns you are joining on are not unique.

4.2 Collect Stats

Run command "diagnostic help stats on for the session".


Gather information on columns on which stats has to be collected.
Collect stats on suggestions columns.
Also check for stats missing on PI, SI or columns used in joins - "help stats
<databasename>.<tablename>.

Whitepaper | TERADATA PERFORMANCE TUNING

P a g e | 11

Make sure stats are re-collected when at-least 10% of data changes
remove unwanted stats or stat which hardly improves performance of the queries
Collect stats on columns instead of indexes since index dropped will drop stats as well
collect stats on index having multiple columns, this might be helpful when these columns are
used in join conditions
Check if stats are re-created for tables whose structures have some changes
Example1:
Explain before collecting stats

Example2: Explain after collecting stats

Whitepaper | TERADATA PERFORMANCE TUNING

P a g e | 12

Below information statistics will collect


1.
2.
3.
4.
5.
6.

The number of rows in the table


The average row size
Information on all Indexes in which statistics were collected
The range of values for the column(s) in which statistics were collected
The number of rows per value for the column(s) in which statistics were collected
The number of NULLs for the column(s) in which statistics were collected

Which all columns we need to collect stats


Primary Index of a Join Index
Secondary Indexes defined on any join index
Non-indexed columns used in joins
The Unique Primary Index of small tables (less than 1,000 rows per AMP)
All Non-Unique Primary Indexes and All Non-Unique Secondary Indexes
Join index columns that frequently appear on any additional join index columns that
frequently appear in WHERE search conditions
7. The Unique Primary Index of small tables (less than 1,000 rows per AMP)
8. Columns that frequently appear in WHERE search conditions or in the WHERE clause of joins
1.
2.
3.
4.
5.
6.

Examples:
COLLECT STATISTICS on Emp_Table ;
COLLECT STATISTICS on Emp_Table COLUMN Dept_no ;
COLLECT STATISTICS on Emp_Table COLUMN(Emp_no, Dept_no);
COLLECT STATISTICS on Emp_Table INDEX Emp_no ;
COLLECT STATISTICS on Emp_Table INDEX (First_name, Last_name);
Whitepaper | TERADATA PERFORMANCE TUNING

P a g e | 13

Table-level statistics known as "summary statistics" are collected whenever column or index
statistics are collected.
SHOW SUMMARY STATISTICS VALUES ON Employee_Table;

4.3 Avoid Full table scan scenarios


Try to avoid FTS scenarios as, it might take very long time to access all the data in every amp
in the system
Make sure SI is defined on the columns which are used as part of joins or Alternate access
path.
Collect stats on SI columns else there are chances where optimizer might go for FTS even
when SI is defined on that particular column
If intermediate tables are used to store results, make sure that it has same PI of source and
destination table
For large list of values, avoid using IN /NOT IN in SQLs. Write large list values to a temporary
table and use temporary tables for computations.
Make sure when to use exists/not exists condition since they ignore unknown comparisons
(ex. - NULL value in the column results in unknown) . Hence this leads to inconsistent results
Some examples of when a Full Table Scan is performed:
- SQL statement does not contain the WHERE statement.
- The WHERE statement does not use the Primary or Secondary index.
- SQL statement uses a partial value (LIKE, ...) in the WHERE statement.
- SQL statement uses inequality operators (<, >, ...) in the WHERE statement.
Example1:
Explain without condition

Whitepaper | TERADATA PERFORMANCE TUNING

P a g e | 14

Example2: Explain with condition

Whitepaper | TERADATA PERFORMANCE TUNING

P a g e | 15

5. Teradata Performance Tuning Tips


Tip 1: What is the criteria to choose best Primary Index?
Be careful while choosing the primary index because it affects the data storage and
performance.
The following are the important tips while choosing the primary index.
1. Data Distribution.
You need to analyze the number of distinct values in the table. If the primary index
of the table contains less number of null values and more distinct values, it will give better
the performance.
2. Access frequency.
The column has to be frequently used in the where clause during the row selection.
The column should be that which is frequently used in join process.
3. Volatility
The column should not be frequently changed.
Tip 2: Column join must be of same data type
When trying to join columns from two tables, optimizer makes sure that datatype is same.
Else Optimizer will translate the column in driving table to match that of derived table.
Example:
TABLE employee deptno (char)
TABLE dept deptno (integer)

Make sure you are joining columns that have same data types to avoid translation.

Tip 3: Do not use functions like SUBSTR, COALESCE, CASE ... on the indices used as part of Join
Avoid using functions such as SUBSTR,COALESCE, CASE on the indices used as join.
Optimizer will not be able to read stats on those columns which have functions associated to
it as it is busy converting functions.
Might result in product join, spool out issues and opti mizer will not be able to take decisions
since no stats are available on the column.
Tip 4: Not Null columns
Make sure to use NOT NULL for columns which are declared as NULLABLE in TABLE definition
reason being Null values might get sorted to one poor AMP resulting in infamous "NO SPOOL
SPACE" error as that AMP cannot accommodate any more Null values
Recommended to use Not Null condition while joining on the nullable columns of a table so
that table skew can be avoided.

Whitepaper | TERADATA PERFORMANCE TUNING

P a g e | 16
Tip 5: Usage of Like clause
Example:
LIKE %SUBIN% will be processed differently from SUBIN %
In the former, the optimizer needs to do a full table scan which reduces the performance.
In the latter, the optimizer makes use of the index to perform on query thereby increasing
the performance.
If LIKE is used in a WHERE clause, it is better to try to use one or more leading character in
the clause, if at all possible.
Hence it is suggested to go for '% SUBIN %' only if SUBIN is a part of entire pattern say
'SUBSTRING'.
Tip 6: Distinct Vs Group by
Both return same number of rows but with some execution time difference between them.
GROUP BY sorts the data locally on vprocessor while DISTINCT redistribute data then it sorts
the data.
When data is nearly unique in a table, GROUP BY will spend more time attempting to
eliminate duplicates that do not exist at all.
DISTINCT redistributes the rows immediately, more data may move between the AMPs
whereas GROUP BY that only sends unique values between the AMPs.
Steps used in each case for elimination of Duplicates:
GROUP BY
It reads all the rows part of GROUP BY.
It will remove all duplicates in each AMP for given set of values using
"BUCKETS" concept.
Hashes the unique values on each AMP.
Then it will re-distribute them to particular /appropriate AMP's.
Once redistribution is completed, it
a. Sorts data to group duplicates on each AMP
b. Will remove all the duplicates on each amp and sends the
original/unique value
DISTINCT

It reads each row on Amp.


Hashes the column value identified in the distinct clause of select statement
Then redistributes the rows according to row value into appropriate Amp.
Once redistribution is completed, it Sorts data to group duplicates on each
Amp
Will remove all the duplicates on each amp and sends the original /unique
value.
Whitepaper | TERADATA PERFORMANCE TUNING

P a g e | 17
Hence it is better to go for
GROUP BY :
When Many duplicates
DISTINCT
:
When few or no duplicates
Tip 7: Which is faster? select * from table or select 'all Columns' from table ??
In case of using "select * from table, An extra stage is added where * is replaced by
column names by teradata and then it would fetch the data .
But using "select <all Columns > from table eliminates this extra stage of verifying and
fetching on columns from the table.
Hence it is always recommended to use "select <all Columns > from table"

Tip 8: Difference between delete, delete all?


Both return the same result.
Delete all will truncate the data as well as index table.
Delete will truncate the data but maintain the index table.

Tip 9: Variable length columns


The use of variable length columns should be minimized.
Fixed length columns should always be used to define tables.
VARCHARs are bad for read performance because each record can be of variable length and
that makes it more costly to find fields in a record.
Tables with fixed-length rows are easier to reconstruct if you have a table crash.
If you are choosing between CHAR and VARCHAR columns, the tradeoff is one of time versus
space. If speed is your primary concern, use CHAR columns to get the performance benefits
of fixed-length columns. If space is at a premium, use VARCHAR columns

Tip 10: Union Vs Union All


The union command can be used to break up a large sql process or statement into
several smaller sql processes or statements, which would run in parallel.
But these could then cause spoolspace limit problems.
Union all executes the sqls single threaded.
UNION query, by definition, eliminates all duplicate rows (as opposed to UNION ALL) and
is slower.
You should avoid of unnecessary UNIONs they are huge performance leak. As a rule of
thumb use UNION ALL if you are not sure which to use.

Whitepaper | TERADATA PERFORMANCE TUNING

P a g e | 18
Tip 11: Strategic Semicolon
At the end of every sql statement, there is a semicolon.
In some cases, the strategic placement of this semicolon can improve the sql time of a group
of sql statements.
But this will not improve an individual sql statements time.
Example:
1) The groups sql time could be improved if a group of sql
statements
share the same tables (or spool files)
2) The groups sql time could be improved if several sql statements use the
same unix input file.
Tip 12: Unix split OR Unix concatenation
Split
A large input unix files could be split into several smaller unix files, which could then be
input in series, or in parallel, to create smaller SQL processing steps.
Concatenation
A large query could be broken up into smaller independent queries, whose output is
written to several smaller unix files.
Then these smaller files are unix concatenated together to provide a single unix file.

Tip 13: IN Vs EXISTS in Teradata SQL


Performance wise both should be same with less no of records.
If no of records will be more, EXISTS is faster than IN.
Mostly IN is used in case of subqueries and EXISTS is used in case of correlated
subqueries.

Tip 14: NOT IN Vs NOT EXISTS


There is a huge difference between NOT IN vs. NOT EXISTS
NOT EXISTS simply ignores NULLs.
Never use NOT IN on NULLable columns, because this might return unexpected result
sets. (The result set might be empty and even if it's correct, there's a lot of work for the
database.)

Whitepaper | TERADATA PERFORMANCE TUNING

P a g e | 19
Tip 15: Top Vs SAMPLE
TOP 10 means "first 10 rows in sorted order".
The optimizer is free to select the cheapest plan it can find and stop processing as soon as
it has found enough rows to return.
SAMPLE does extra processing to try to randomize the result.
At a very simple level, for example, it could pick a random point at which to start scanning
the table and a number of rows to skip between rows that are returned.
Top really comes into good use when you are dealing with larger tables and queries
because rather than running the entire query and then returning 'sample' records, as the
query runs, top simply picks the first (or 'top') 10 records which have been returned
from any node, and then stops the query.

Tip 16: Global temporary table vs volatile table


Whenever we create GTT, its definition is stored into Data Dictionary.
Whenever we create VTT, its Definition is stored into System cache.
Whenever we insert data into GTT,data is stored into temp space. So definition of the
table will be active until we can delete using the drop table statement and data remains
active up to end of the session only.
Whenever we insert data into VTT, data is stored into spool space. So table definition
and data both are remains active only up to session end only.
We can collect statistics on Global temporary tables.
We cannot able to collect statistics on volatile tables.(Teradata 13 and above will allow
you to Collect Stats on Volatile Tables)

Tip 17: Use Same PI in Source & Target


If the Source and Target have the same PI, data dump can happen very efficiently form
source to target.

Tip 18: DROPPING volatile tables explicitly


Once volatile tables are no more required you can drop those. Dont wait for complete
procedure to be over.
This will free some spool space immediately and could prove to be very helpful in avoiding
No More Spool Space error.

Whitepaper | TERADATA PERFORMANCE TUNING

P a g e | 20
Tip 19: NO LOG for volatile tables
Create volatile tables with NO LOG option.

Tip 20: UPDATE clause and replacing UPDATE with DELETE & INSERT
Do not write UPDATE clause with just SET condition and no WHERE condition.
Even if the Target/Source has just one row, add WHERE clause for PI column.
Sometimes replacing UPDATE with DELETE & INSERT can save good number of AMPCPU.
Check if this holds good for your query.

Tip 21: Unnecessary casting for DATE columns


Avoid unnecessary casting for DATE columns.
Once defined as DATE, you can compare date columns against each other even when they
are in different format.
Internally, DATE is stored as INTEGER. CAST is required mainly when you have to compare
VARCHAR value as DATE.

Tip 22: Avoid UDF


Most of the functions are available in Teradata for data manipulations. So avoid User
Defined Functions until and unless there is no other way.

Tip 23: Use COMPRESS


Use COMPRESS in whichever attribute possible in table creation statement.
This helps in reducing IO and hence improves performance. Especially for attribute having
lots of NULL values/Unique known values.

Tip 24: IN or BETWEEN


In case of a choice of using the IN or the BETWEEN clauses in the query, it is advantageous to
use the BETWEEN clause, as it is much more efficient.
Example:
SELECT customer_number, customer_name
FROM customer
WHERE customer_number in (1000, 1001, 1002, 1003, 1004);

Whitepaper | TERADATA PERFORMANCE TUNING

P a g e | 21
is much less efficient than:
SELECT customer_number, customer_name
FROM customer
WHERE customer_number BETWEEN 1000 and 1004
Assuming there is a useful index on customer_number, the Query Optimizer can locate a range
of numbers much faster (using BETWEEN) than it can find a series of numbers using the IN clause.
Tip 25: MultiLoad delete or Delete command
MultiLoad delete is faster than normal Delete command, since the deletion happens in data
blocks of 64Kbytes, whereas delete command deletes data row by row.
Transient journal maintains entries only for Delete command since Teradata utilities doesnt
support Transient journal loading.

Whitepaper | TERADATA PERFORMANCE TUNING

P a g e | 22

6. Glossary

Acronym

Expansion

SLA

Service Level Agreements

ROI

Return On Investment

DBQL

Data Base Query Log

AMP

Access Module Processor

PE

Parsing Engine

UPI

Unique Primary Index

NUPI

Non Unique Primary Index

PPI

Partitioned Primary Index

FTS

Full Table Scan

PI

Primary Index

SI

Secondary Index

Whitepaper | TERADATA PERFORMANCE TUNING

S-ar putea să vă placă și