Sunteți pe pagina 1din 3

HANDS ON

Traverse Hierarchies
Analyze data and prevent problems with recursive queries.
by Ralph Meira

e find hierarchies in our


everyday lives: bills of
materials, organizational
structures, genealogy charts, etc. In
theory, hierarchies are easy to understand.
However, they can be quite a challenge
when you attempt to represent and manipulate them using a database and SQL.
The efficient handling of hierarchies can
lead to the detection of anomalies in complex
systems, the uncovering of unknown relationships and much more. Unfortunately, in
the real world, they are often plagued by data
quality problems that can negatively affect
their analysis. The Teradata Database can
shield you from these problems through the

use of recursive queriesas long as you take


a few precautions.

Analyze Levels of Data


Hierarchies are best represented in the form
of a treelike structure with several levels.
Their in-database representation typically
relies on parent-child relationships that, in
the case of an organizational chart, identify
the relationship between employer and
employee. Due to the nature of hierachical structures, data quality issues such
as missing links or cyclic childparent-child relationships can
often complicate the analysis
of data.

Querying Hierarchies
The recursive query syntax was created to enable SQL to query hierarchical data of an unknown depth. Using the WITH clause, you
can define derived tables before the main
query instead of within it. A recursive
query contains at least one reference
to its name within its own definition
and is composed of these parts:
>A
 seed query UNIONed
to an iterative (recursive) segment
> At least one logical
condition to prevent infinite loops
from occurring

PAGE 1 l Teradata Magazine l Q3/2011 l 2011 Teradata Corporation l AR-6393

Simple Hierarchy

figure 1

1
11

111

111

1
11

1111
11111

22

1111
11111

22

2222

222

11

11

33
111

222

1
3

33

333

Hierarchy With Problems

figure 2

111

22

22

11111
1

11111
1

111111
111

111111
111

333

2222

33

33

222

222

2222

2222

Navigate this straightforward hierarchy by following the parentchild relationships.

333

333

22222 22222
This hierarchy has some data problems: The green lines highlight
(a) a cyclic loop (1,11,111,1,) and (b) a lower-level child item 11111 as
a parent to 222, which is higher up in the hierarchy. There is also a
missing link between 22 and 222.

You can easily visualize how data quality


problems occur by comparing figure 1 to
figure 2. Notice how the link between 22
and 222 is missing in figure 2, perhaps due
to input error. Also notice that 1, 11 and
111 can lock you into a cyclic loop that
never ends.
Recursive queries can be used to analyze
hierarchical data while preventing data
quality problems. As an example, the
recursive SQL syntax shown below has
safety features that help identify cyclic
data loops and keep the recursion depth to
a maximum of 10 levels. This syntax can be
repurposed to unravel any hierarchy that

follows the format in table 1 and to let you


go much deeper than just 10 levels.
Some key elements in the syntax are:
> A materialized PATH is composed of
concatenated values that show the route
taken by each level of recursion.
> POSITION determines whether a node
from the hierarchy is being repeated in
the PATH, thus identifying cyclic data.
> The TREE.LEVEL < 10 effectively
stops the query from going any deeper
than 10 levels.
> The WHERE E.PARENT_ID = 0
clause in the SQL is responsible
for defining node 0 as the starting

point. Each step of all possible


routes initiated at node 0 is shown
following the PARENT-CHILD
row order.
Its not uncommon to be confused
by the syntax of recursive queries, so
it is useful to employ recursive VIEWS
to simplify the analysis of hierarchies.
Using the RECURSIVE VIEW format
enables the use of SET functions such
as MINUS and INTERSECT that allow
comparative analysis between two similar hierarchies.

table 1

Parent-Child Rules

PARENT_ID

SQL SAMPLE 1
WITH RECURSIVE TREE
( LEVEL, PARENT_ID, CHILD_ID, PATH) AS
( SELECT 0 AS LEVEL,
E.PARENT_ID,
E.CHILD_ID,
CAST (E.PARENT_ID
AS VARCHAR(200)) AS PATH
FROM TABLE_1 E
WHERE E.PARENT_ID = 0
UNION ALL
SELECT TREE.LEVEL +1 ,
S.PARENT_ID, S.CHILD_ID,
CAST (TREE.PATH || S.PARENT_ID
AS VARCHAR(200)) AS PATH

FROM TREE, TABLE_1 S


WHERE TREE.CHILD_ID=S.PARENT_ID
AND TREE.LEVEL < 10
AND POSITION (S.PARENT_ID IN TREE.PATH) < 1
)
SELECT PARENT_ID, CHILD_ID,
MIN(LEVEL)+1 AS DEPTH,
PATH||CHILD_ID AS PATH,
CASE
WHEN (POSITION(CHILD_ID IN PATH)>0)
THEN CYCLIC
ELSE END AS CYCLIC
FROM TREE GROUP BY 1,2,4,5
ORDER BY 4,3,1,2;

PAGE 2 l Teradata Magazine l Q3/2011 l 2011 Teradata Corporation l AR-6393

CHILD_ID

11

22

33

11

111

11

1111

22

222

33

333

222

2222

1111

11111

By using REPLACE RECURSIVE VIEW,


you can start to analyze changes or anomalies
introduced by data quality issues. ANSI
SQL recursive queries are also available to
Teradata Database users and have been for
many years.
SQL sample 2 shows what syntax with
recursive VIEWS can look like.
If the appropriate row changes are made
to table 1, its possible to have the table
represent the hierarchy shown in figure 2
(page 2). Executing the recursive SQL syntax
against a revised table 1 will produce the
results shown in table 2.
You can compare the hierarchy of nodes in
figures 1 and 2 using SQL sample 3.
Note that the SQL before and after the
MINUS is identical to the syntax used
earlier in the recursive SQL syntax to
extract, group and order the results of
the recursive query. A simpler analysis of
changes to a hierarchy can often be carried
out without the need for recursive queries
at all. For example, to find out which nodes
in a hierarchy are considered to be top
nodes, that is, nodes without parents, you
only need to SELECT all PARENT_IDs
MINUS the SELECT of all CHILD_IDs for
the same hierarchy.
In order to obtain the results shown in table
3, it is being assumed that TREE_FIG2_V
and TREE_FIG1_V correspond to recursive
VIEWs in the format shown in the recursive
VIEWs syntax where the underlying TABLEs
contain the parent-child relationships that
correspond to figures 2 and 1, respectively.
Table 3 shows the differences between figures 1
and 2, as originally intended.

Data Quality Protection


Hierarchies, though easy to traverse based
on their parent-child rules, often come
with data quality issues that can lead
to infinite loops. Recursive queries can
protect you from data problems such as
missing links and cyclic loops.
The WITH RECURSIVE syntax can
guide you through hierarchies to find their
depth, detect infinite loops and more. Youll
be surprised how easy it is to use SQL to
navigate the structures and identify data
quality problems. T
Ralph Meira is a Teradata senior
solution architect focused primarily
on manufacturing accounts.

SQL SAMPLE 2
REPLACE RECURSIVE VIEW TREE_V
( LEVEL, PARENT_ID, CHILD_ID, PATH) AS
( SELECT 0 AS LEVEL,
E.PARENT_ID,
E.CHILD_ID,
CAST (E.PARENT_ID
AS VARCHAR(200)) AS PATH
FROM TABLE_1 E
WHERE E.PARENT_ID = 0
UNION ALL

SELECT TREE_V.LEVEL +1 ,
S.PARENT_ID, S.CHILD_ID,
CAST (TREE_V.PATH || S.PARENT_ID
AS VARCHAR(200)) AS PATH
FROM TREE_V, TABLE_1 S
WHERE TREE_V.CHILD_ID=S.PARENT_ID
AND TREE_V.LEVEL < 10
AND POSITION (S.PARENT_ID IN TREE_V.PATH)
<1
);

SQL SAMPLE 3
SELECT PARENT_ID, CHILD_ID,
MIN(LEVEL)+1 AS DEPTH,
PATH||CHILD_ID AS PATH,
CASE
WHEN ( POSITION(CHILD_ID IN PATH) > 0)
THEN CYCLIC ELSE END AS CYCLIC
FROM TREE_FIG2_V GROUP BY 1,2,4,5
MINUS

SELECT PARENT_ID, CHILD_ID,


MIN(LEVEL)+1 AS DEPTH,
PATH||CHILD_ID AS PATH,
CASE
WHEN ( POSITION(CHILD_ID IN PATH) > 0)
THEN CYCLIC ELSE END AS CYCLIC
FROM TREE_FIG1_V GROUP BY 1,2,4,5
ORDER BY 4,3,1,2;

Paths From Node 0

table 2
PARENT_ID

CHILD_ID

DEPTH

PATH

CYCLIC

11

11

111

11

111

111

11

111

11

1111

11

1111

1111

11111

11

1111

11111

11111

222

11

1111

11111

222

222

2222

11

1111

11111

222

2222

2222

22222

11

1111

11111

222

2222

22

22

22

222

22

222

222

2222

22

222

2222

2222

22222

22

222

2222

33

33

33

333

33

333

333

2222

33

333

2222

2222

22222

33

333

2222

33

22222

33

22222

CYCLIC

22222

22222

22222

There are many available paths when starting at Node 0. The cyclic path is correctly
flagged in the last column.

Different Paths

table 3
PARENT_ID

CHILD_ID

DEPTH

PATH

CYCLIC

111

11

111

CYCLIC

1111

11111

11

1111

11111

222

11111

222

11

1111

11111

222

2222

222

2222

11

1111

11111

222

2222

2222

22222

11

1111

11111

22

222

22

222

2222

22

222

2222

2222

22222

22

222

2222

333

2222

33

333

2222

2222

22222

33

333

2222

33

22222

33

22222

22222

222

22222

22222

This table shows how the paths differ between figures 1 and 2.

xx
l 3TeradataMagazine.com
PAGE
l Teradata Magazine l Q3/2011 l 2011 Teradata Corporation l AR-6393

QX/201X l TDM l XX

S-ar putea să vă placă și