Sunteți pe pagina 1din 152

Inv

valuab
ble Da
ata Mo
odelin
ng Rules to
o
Impllemen
nt You
ur Datta Vau
ult

Dan Linsttedt
Inventorr of the Da
ata Vault

ISB
BN: 978-0--9866757-1
1-3

Super Charge Your Data Warehouse

Page 2 of 152

Super Charge Your


Data Warehouse
Invaluable Data Modeling Rules Implement Your Data Vault
Copyright Dan Linstedt, 2008-2011
http://LearnDataVault.com
All rights reserved.
All images are the property of Dan Linstedt, unless an image source is otherwise noted.
No part of this book may be reproduced in any form or by any electronic or mechanical means
including information storage and retrieval systems, without permission in writing from the
author. The only exception is by a reviewer, who may quote short excerpts in a review.
Printed in the United States of America
First Printing: December, 2010

Co-Editor: Kent Graziano


Special Thanks: Tom Breur for additional editing
Abstract:
The purpose of this book is to present and discuss the technical components of the Data Vault Data Model.
The examples in this book provide a strong foundation for how to build, and design structures when using the
Data Vault modeling technique. This book is a second in the series of books surrounding the Data Vault
model and methodology (approach). The target audience is anyone wishing to implement a Data Vault model
for integration purposes whether it be an Enterprise Data Warehouse, Operational Data Warehouse, or
Dynamic Data Integration Store.

Daniel Linstedt, 2008-2011


Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

http://SuperChargeYourEDW.com

Super Charge Your Data Warehouse

Page 3 of 152

Table of Contents
Acknowledgements .................................................................................................................................. 10
1.0

Introduction and Terminology ...................................................................................................... 11

1.1

Do I need to be a Data Modeler to read this book? ................................................................ 11

1.2

Review of Basic Terminology .................................................................................................... 11

1.3

Data Modeling Notations Used in This Text............................................................................. 15

1.4

Data Models as Ontologys ....................................................................................................... 15

1.5

Data Model Naming Conventions and Abbreviations ............................................................. 17

1.6

Introduction to Hubs, Links, and Satellites ............................................................................. 19

1.7

Flexibility of the Data Vault Model ........................................................................................... 22

1.8

Data Vault Basis of Commutative Properties and Set Based Math ....................................... 25

1.9

Data Vault and Parallel Processing Mathematics ................................................................... 27

1.10 Introduction to Complexity and the Data Vault ....................................................................... 32


1.11 Loading Processes: Batch Versus Real Time .......................................................................... 35
2.0

Architectural Definitions ............................................................................................................... 37

2.1

Staging Area .............................................................................................................................. 37

2.2

EDW Data Vault...................................................................................................................... 38

2.3

Metrics Vault.............................................................................................................................. 39

2.4

Meta Vault ................................................................................................................................. 39

2.5

Report Collections ..................................................................................................................... 40

2.6

Data Marts ................................................................................................................................. 40

2.7

Business Data Vault .................................................................................................................. 40

2.8

Operational Data Vault ............................................................................................................. 41

2.9

Dynamic Data Vault .................................................................................................................. 42

3.0

Common Attributes ....................................................................................................................... 43

3.1

Sequence Numbers .................................................................................................................. 45

3.2

Sub Sequence Numbers (Item Numbering) ............................................................................ 46

3.3
Load Dates ................................................................................................................................ 47
Dan Linstedt 2010-2011, all rights reserved
http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 4 of 152

3.4

Load End Dates ......................................................................................................................... 49

3.5

Last Seen Dates ........................................................................................................................ 50

3.6

Extract Dates ............................................................................................................................. 53

3.7

Record Creation Dates.............................................................................................................. 54

3.8

Record Sources ......................................................................................................................... 54

3.9

Process IDs ............................................................................................................................... 55

4.0

Hub Entities ................................................................................................................................... 56

4.1

Hub Definition and Purpose ..................................................................................................... 58

4.2

What is a Business Key? .......................................................................................................... 59

4.3

Where do we find Business Keys? ........................................................................................... 60

4.4

Why are Business Keys Important? ......................................................................................... 61

4.5

How do Business Keys tie to Hubs and Business Processes? ............................................... 63

4.6

Why not Surrogate Keys as Master Keys? ........................................................................... 64

4.7

Hub Smart Keys, Intelligent Keys ............................................................................................. 64

4.8

Hub Composite Business Keys ................................................................................................ 65

4.9

Hub Entity Structure .................................................................................................................. 66

4.10 Hub Examples ........................................................................................................................... 67


4.11 Dependent and Non-dependent Child Keys ............................................................................ 69
4.12 Mining patterns in the Hub Entity ............................................................................................ 71
4.13 Process of Building a Hub Table .............................................................................................. 73
4.14 Modeling Rules and Standards for Hub Tables ...................................................................... 74
4.15 What Happens when the Hub Standards Are Broken............................................................. 75
5.0

Link Entities ................................................................................................................................... 77

5.1

Link Definition and Purpose ..................................................................................................... 77

5.2

Reasons for Many To Many Relationships .............................................................................. 77

5.3

Flexibility .................................................................................................................................... 80

5.4

Granularity ................................................................................................................................. 83

5.5
Dynamic Adaptability ................................................................................................................ 86
Dan Linstedt 2010-2011, all rights reserved
http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 5 of 152

5.6

Scalability................................................................................................................................... 87

5.7

Link Entity Structure.................................................................................................................. 90

5.8

Link Driving Key ......................................................................................................................... 90

5.9

Link Examples ........................................................................................................................... 92

5.10 Degenerate Fields In Links ....................................................................................................... 94


5.11 Multi-Temporal Date Structures ............................................................................................... 94
5.12 Link-To-Link (Parent/Child Relationships) ............................................................................... 96
5.13 Link Applications ....................................................................................................................... 99
5.14 Hierarchical Links...................................................................................................................... 99
5.15 Same-As Links ......................................................................................................................... 101
5.16 Begin and End Dating Links ................................................................................................... 103
5.17 Low Value Links....................................................................................................................... 106
5.18 Transactional Links ................................................................................................................. 106
5.19 Computed Aggregate Links..................................................................................................... 108
5.20 Strength and Confidence Ratings in Links ............................................................................ 110
5.21 Exploration Links ..................................................................................................................... 111
6.0

Satellite Entities .......................................................................................................................... 112

6.1

Satellite Definition and Purpose ............................................................................................ 112

6.2

Satellite Entity Structure ......................................................................................................... 113

6.3

Satellite Examples................................................................................................................... 114

6.4

Importance of Keeping History ............................................................................................... 115

6.5

Splitting Satellites by Classification or Type of Data ............................................................. 116

6.6

Splitting Satellites by Rate of Change.................................................................................... 118

6.7

Satellites Arranged by Source System ................................................................................... 120

6.8

Overloaded Satellites (The Flip-Flop Effect) .......................................................................... 122

6.9

Satellite Applications: ............................................................................................................. 124

6.9.1

Effectivity Satellites ............................................................................................................. 124

6.9.2
Record Tracking Satellites .................................................................................................. 125
Dan Linstedt 2010-2011, all rights reserved
http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 6 of 152

6.9.3

Status Tracking Satellites ................................................................................................... 128

6.9.4

Computed Satellites (Quality Generated) .......................................................................... 129

6.9.5

Multiple Active Satellite Rows ............................................................................................ 129

6.10 Splitting Satellites ................................................................................................................... 132


6.11 Consolidating Satellites .......................................................................................................... 136
7.0

Query Assistant Tables ............................................................................................................... 140

7.1

Point in Time Tables ................................................................................................................ 140

7.2

Bridge Tables ........................................................................................................................... 143

8.0

Reference Tables ........................................................................................................................ 146

8.1

No-History Reference Tables .................................................................................................. 147

8.2

History Based Reference Tables ............................................................................................ 148

8.3

Code and Descriptions............................................................................................................ 150

9.0

Conclusions ................................................................................................................................. 151

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 7 of 152

Table of Figures
Figure 1-1: Example E-R Diagram (Elmasri/Navathe) ............................................................................ 13
Figure 1-2: Crows Foot and Arrow Notation Example ............................................................................ 15
Figure 1-3: Small Example: Ontology for Vehicle.................................................................................... 16
Figure 1-4: Example Abbreviations and Naming Conventions .............................................................. 18
Figure 1-5: Example Data Vault ............................................................................................................... 20
Figure 1-6: Flexibility of Adapting to Change .......................................................................................... 23
Figure 1-7: 3rd Normal Form Product and Supplier Example ................................................................ 24
Figure 1-8: Applied Set Theory for the Data Vault .................................................................................. 27
Figure 1-9: Parallel Computing Simplified .............................................................................................. 28
Figure 1-10: Logical Data Vault Hyper Cube........................................................................................... 29
Figure 1-11: Physical Data Vault Layout (Starting point) ....................................................................... 30
Figure 1-12: Physical Data Vault Layout (Partitioned) ........................................................................... 31
Figure 2-1: Enterprise BI Architectural Components ............................................................................. 37
Figure 3-1: Time Series Batch Loaded Data ........................................................................................... 43
Figure 3-2 Real-Time Arrival, Data Geology ............................................................................................ 44
Figure 3-3: Load Date Time Stamp and Record Source ........................................................................ 47
Figure 3-4: Example Load Date Time Stamp Data ................................................................................. 48
Figure 3-5: Load End Date Computations, Descriptive Data Life Cycle ................................................ 49
Figure 3-6: Structures containing Last Seen Dates ............................................................................... 51
Figure 3-7: Scan all data in EDW............................................................................................................. 51
Figure 3-8: Reduced Scan Set after Applying Last Seen Date .............................................................. 53
Figure 4-1: Business Key Changing Across Line of Business ................................................................ 57
Figure 4-2: Hub Example Images ............................................................................................................ 58
Figure 4-3: Hub Example Data ................................................................................................................ 59
Figure 4-4: Smart Key Example ............................................................................................................... 65
Figure 4-5: Composite Business Key Hub Example ............................................................................... 66
Figure 4-6: Example Hub Entity Structure .............................................................................................. 67
Figure 4-7: Example Hubs from Adventure Works 2008 ....................................................................... 68
Figure 4-8: Example of National Drug Code Data Vault ......................................................................... 69
Figure 4-9: Dependent Child Relationship Modeling ............................................................................. 70
Figure 4-10: Typical Hub Row Sizing ....................................................................................................... 75
Figure 5-1: Relationship Changes Over Time ......................................................................................... 78
Figure 5-2: Link Table Structure Housing Multiple Relationships ......................................................... 79
Figure 5-3: Starting Model Before Changes ........................................................................................... 81
Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 8 of 152

Figure 5-4: Data Vault After Modification ............................................................................................... 81


Figure 5-5: Additional Data Vault Model - More Changes...................................................................... 82
Figure 5-6: Global Data Vault Linking ..................................................................................................... 83
Figure 5-7: Uncovering Fact Table Grain ................................................................................................ 84
Figure 5-8: Data Vault Grain, Representing Star Schema ..................................................................... 84
Figure 5-9: Traditional Data Vault Storage Layout ................................................................................. 87
Figure 5-10: Performance Physical Split Version 1 ................................................................................ 88
Figure 5-11: Performance Physical Split Version 2 ................................................................................ 89
Figure 5-12: Performance Physical Split Version 3 ................................................................................ 89
Figure 5-13: Sample Link Structure ........................................................................................................ 90
Figure 5-14: Example Driving Key for Link ............................................................................................. 91
Figure 5-15: Example of Link Satellite with Driving Key ........................................................................ 91
Figure 5-16: Insert to Link/Sat Based on Driving Key ........................................................................... 92
Figure 5-17: Link Driving Key/Satellite End Dated ................................................................................ 92
Figure 5-18: Example of Link Tables From Adventure Works 2008 Data Vault................................... 93
Figure 5-19: Example of Link To Link Relationships .............................................................................. 96
Figure 5-20: Step 1, Flattening Link-To-Link Hierarchy ......................................................................... 97
Figure 5-21: Step 2, Flattening Link-To-Link Hierarchy ......................................................................... 98
Figure 5-22: Example Organization Structure ...................................................................................... 100
Figure 5-23: Hierarchical Link for Offices ............................................................................................. 100
Figure 5-24: Example Hierarchical Link of Employees ........................................................................ 101
Figure 5-25: Same-As Link Example, Business Data ........................................................................... 102
Figure 5-26: Same-As Link Data Vault Model....................................................................................... 102
Figure 5-27: Incorrect Link with Begin/End Date ................................................................................. 103
Figure 5-28: Begin & End Dates in Links .............................................................................................. 104
Figure 5-29: Example of Poorly Constructed Link ................................................................................ 105
Figure 5-30: Satellite Effectivity on a Link ............................................................................................ 105
Figure 5-31: Transactional Link Example ............................................................................................. 106
Figure 5-32: Transactional Link, No Satellite ....................................................................................... 108
Figure 5-33: Example of Computed Aggregate Link ............................................................................ 109
Figure 6-1: Example Satellite Entity ...................................................................................................... 114
Figure 6-2: Example Satellite Entities ................................................................................................... 115
Figure 6-3: Satellites Split by Type Of Data Option 1 ........................................................................... 117
Figure 6-4: Satellite Data Rate of Change Example ............................................................................. 118
Figure 6-5: Satellite Split by Rate Of Change ....................................................................................... 119
Figure 6-6: Customer Satellites Split by Source System ..................................................................... 121
Figure 6-7: Satellite Overload from Many Sources .............................................................................. 122
Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 9 of 152

Figure 6-8: Satellite Effectivity ............................................................................................................... 124


Figure 6-9: Denormalized Record Source Tracking Satellite ............................................................... 126
Figure 6-10: Normalized Record Source Tracking Satellite................................................................. 127
Figure 6-11: Status Tracking Satellite .................................................................................................. 128
Figure 6-12: Multi-Active Satellite Rows ............................................................................................... 130
Figure 6-13: Multi-Active Satellite Row Data ........................................................................................ 131
Figure 6-14: Multi-Active Satellite with Business Sub-Sequence........................................................ 132
Figure 6-15: Step 1: Identify Satellite Split Columns ........................................................................... 133
Figure 6-16: Step 2: Split Satellite Columns, Design New tables ....................................................... 134
Figure 6-17: Step 3: Copy Data From Original to New Satellites ........................................................ 134
Figure 6-18: Step 4: Eliminate Duplicates ............................................................................................ 135
Figure 6-19: Step 4: Alternate Elimination of Duplicates .................................................................... 135
Figure 6-20: Step 5: End Dates Adjusted After Satellite Split ............................................................. 136
Figure 6-21: Consolidating Satellite Data ............................................................................................. 137
Figure 6-22: Load End Dates Calculated in Consolidated Satellite .................................................... 139
Figure 7-1: Structure of PIT Table ......................................................................................................... 140
Figure 7-2: PIT Table Architecture Overview ......................................................................................... 141
Figure 7-3: Example PIT Table with Snapshot Dates ........................................................................... 142
Figure 7-4: Bridge Table Structure ........................................................................................................ 143
Figure 7-5: Bridge Table Architectural Overview .................................................................................. 144
Figure 7-6: Bridge Table Example Data ................................................................................................ 145
Figure 8-1: Non-History Reference Table .............................................................................................. 148
Figure 8-2: Standard History Based Reference Table.......................................................................... 149
Figure 8-3: Hub/Sat History Based Reference Table ........................................................................... 150
Figure 8-4: Group Code and Description .............................................................................................. 151

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 10 of 152

Acknowledgements
I wish to personally thank Kent Graziano for sticking by me all this time. His relentless editing skills
have truly helped to shape and hone this book. Its taken me two years to put this book together,
and countless hours of writing, creating graphics and examples in high quality print and color.
In addition to Kent, Tom Breur also assisted me in the editing process he helped me to draw out
important points and yes, he wanted me to change to single spacing but thats one thing I just
didnt compromise on.
Then, there is Sanjay Pande hes an IT veteran turned marketing expert who knows his stuff inside
and out. Hes been an inspiration to me to try new things, and create new titles for the book. Hes
also helping me with many other aspects of marketing that I wasnt even aware of.
I wish to thank my wife Julie for putting up with me spending hours editing my book (even on my
vacations) which I really shouldnt do. My wife also helped me re-formulate the cover art and pick a
cool looking design.
Id also like to thank God for blessing me with this knowledge and then finally urging me to trust Him
and write it down for others!
Finally, Id really like to thank YOU, the reader. Many of you know me, or have seen me teach in
person without you, there would be no Data Vault successes in the world today. I love to hear
about your trials, as well as your successes with the Data Vault if youd like to help me write (yet
another book of case-studies) then I want to hear from you!
Of course, if youre ever in Saint Albans or even Burlington Vermont, drop me an email or call me
Id be delighted to meet you for lunch.
Sincerely,
Daniel Linstedt
DanL@DanLinstedt.com

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 11 of 152

1.0 Introduction and Terminology


Welcome to the technical book about Data Vault Modeling. This book is for practitioners and
implementers. The content of this book is focused on the Data Vault structures, definitions of the
structures, metadata, and data modeling it does not cover the loading, querying, nor partitioning
of the Data Vault. These feature sets will be covered in the next set of technical books.
1.1

Do I need to be a Data Modeler to read this book?

No, it is not necessary to be a data modeler to read this book. While a data modeling background is
helpful, it is not required. The writing covers the basic components of the Data Vault Model, and
also introduces information about the concepts utilized by nearly all relational database systems.
Experience with RDBMS engines also can be applied to the concepts and knowledge presented
here. This book also assumes you are familiar with the basics of data warehousing as defined by
W.H. Inmon and Dr. Ralph Kimball.
A common understanding of fields / columns, tables, and key structures (such as referential
integrity) is helpful. In the next section are descriptions of common terms used throughout this
book.
1.2

Review of Basic Terminology

The terminology in this book consists of basic entity-relationship (E-R) diagramming and data
modeling terms. Terminology such as Table, Entity, Attribute, Column, Field, Primary Key, Foreign
Key, and Unique Index are utilized throughout. For reference purposes the following basic level
definitions of the terms are provided.
Term
Table
Entity
Attribute
Column
Field
Primary Key

Definition
A composite grouping of data elements instantiated in a
database, making up a concept.
A table, as referred to in a logical format (eg: customer, account,
etc..)
A single data element comprised of a name, data type, length,
precision, null flag, and possibly a default value.
An ordered attribute within a table.
Same as Column. See Column definition.
Main set of one or more attributes indicating a unique method
for identifying data stored within a table.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse


Term
Foreign Key

Unique Index
Business Key

Natural Key
Relationship
Many to 1

Many to Many
1 to 1
Cardinality

Page 12 of 152

Definition
One or more attributes associated with the primary key in
another table. Often used as lookup values, may be optional
(nullable) or mandatory (non-null). When enabled in a database,
foreign keys insure referential integrity.
One or more attributes combined to form a single unique list of
data spanning all rows within a single table.
Component used by the business users, business processes, or
operational code to access, identify, and associate information
within a business operational life-cycle. This key may be
represented by one or more attributes.
See business key.
An association between or across exactly two tables.
A notation used to describe the number of records in the lefthand table as related to the number of records in the right-hand
table. Example: many customer records may have 1 and only 1
contact record.
An open-ended notation. For example: where many customer
records may have many contact records.
A notation dictating singular cardinality: 1 customer record may
have 1 and only 1 contact record.
In mathematics, the cardinality of a set is a measure of the
"number of elements of the set". For example, the set A = {1, 2,
3} contains 3 elements, and therefore A has a cardinality of 3.
There are two approaches to cardinality one which compares
sets directly using bijections and injections, and another which
uses cardinal numbers. Reference:
http://en.wikipedia.org/wiki/Cardinality

Constraint

Weak Relationship
Strong Relationship
Associative Entity

A relationship between two tables that enforce existence of data


in a parent table, or an indicator of uniqueness and not-null
column. A constraint may also indicate basic rules such as
defaults, or functions to check values, or possibly ranges of
data.
A constraint that is optional (when the data is not null, then the
constraint is checked for validity). When the data is null, the
constraint is not checked for validity.
A constraint that is non-optional. Data is required (not-null) in
the child table all the time, and therefore is checked for validity.
An associative entity is an element of the Entity-relationship
model. The database relational model doesn't offer direct
support to many-to-many relationships, even though such
relationships happen frequently in normal usage. The solution to

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 13 of 15
52

this problem is the creatio


on of anotherr table to hold the
necessary
n
infformation forr this relation
nship. This neew table is
called
c
an associative entitty.
http://en.wikip
h
pedia.org/wiki//Associative_eentity

Data
a models are
e diagrammattic representtations of info
ormation and
d classes of information tto be held
within a mechanical storage mechanism such
s
as a database engin
ne; except in the case of a
concceptual/business model which
w
should
d be independent of techn
nology. Common databasse
engines today include: DB2 UDB,
U
Teradatta, MySQL, Po
ostGreSQL, O
Oracle, SQLSeerver, and Syybase
ASE.. There are several
s
main notations ussed for E-R diiagrams (e.g.., Chen, Barkker, IDEF, etcc). An
exam
mple of an E--R diagram using Elmasri//Navathe nottation is beloow:

Figurre 1-1: Example E-R Diagram (Elmasrii/Navathe)


Image: http://ciisnet.baruch.cuny.edu/holowczzak/classes/94
440/entityrelatioonship/

Data
a models (such as E-R dia
agrams) housse linguistic representatio
r
ons of concepts tied togetther
through associattions. These associationss can also be
e thought of aas Ontologiess. There are many
type
es of data mo
odeling notations available in the world
d today. Twoo main types are focused on in this
rd
docu
ument: 3 normal form and
a Star Sche
ema. For refference purpooses, simplee definitions o
of both
style
es are include
ed below.
3rd N
Normal Form is defined ass follows:
The third normal
n
form (3
3NF) is a norm
mal form used in database n
normalization. 3NF was
originally defined
d
by E.F. Codd[1] in 1971. Codd's definition
d
statees that a tablee is in 3NF if an
nd
only if both
h of the follow
wing conditionss hold:
The re
elation R (table
e) is in second
d normal form (2NF)
Every non-prime atttribute of R is non-transitive
n
ly dependent ((i.e. directly deependent) on
every key of R.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 14 of 15
52

A non-prim
me attribute off R is an attribute that does not belong to any candidatee key of R.[2] A
transitive dependency
d
iss a functional dependency in which X Z (X determinees Z) indirectlyy,
by virtue of
o X Y and YZ
Y
(where it is not the case
e that Y X). [3]
A 3NF definition that is equivalent
e
to Codd's,
C
but exxpressed differrently was giveen by Carlo
Zaniolo in 1982. This de
efinition statess that a table is
i in 3NF if an d only if, for eeach of its
ditions holds:
functional dependenciess X A, at lea
ast one of the following cond

X conttains A (that iss, X A is trivial functional dependency), or


X is a super key, or
A is a prime attributte (i.e., A is contained within
n a candidate key)[4]

Zaniolo's definition
d
givess a clear sense of the difference between
n 3NF and the more stringen
nt
Boyce-Cod
dd normal form
m (BCNF). BCN
NF simply eliminates the thirrd alternative ("A is a prime
attribute").
http://en.w
wikipedia.org//wiki/Third_no
ormal_form

Starr schema modeling is defiined as follow


ws:
The star scchema (somettimes referencced as star join schema) is tthe simplest sstyle of data
warehouse
e schema. The
e star schema consists of a few "fact tablees" (possibly oonly one,
justifying the
t name) refe
erencing any number
n
of "dim
mension tabless". The star scchema is
considered
d an importan
nt special case
e of the snowflake schema.
http://en.w
wikipedia.org//wiki/Star_sch
hema

a modeling approach is ccalled a starr, because it appears to loook similar to


o a starThe Star Schema
like shape. Star Schema mod
deling is championed by Dr. Ralph Kim
mball.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse
1.3

P
Page 15 of 15
52

Data Modeling Notations Used


U
in This Text

Crow
ws Foot notattion is utilized
d throughoutt this text to represent
r
raw
w data modeels; in additio
on to the
crow
ws-foot notatiion this text introduces arrrows to represent data m
migration paths (vectors/d
direction
of da
ata flow). It is
i occasionallly easier to describe
d
the vector
v
notatiion to busineess users whe
en
compared with describing
d
cro
ows-foot nota
ation.
The
e "Crow's Foo
ot" notation re
epresents relationships with
w connectiing lines betw
ween entitiess, and
pairss of symbols at the ends of those lines to represen
nt the cardin ality of the reelationship. C
Crow's
Foott notation is used
u
in Barkker's Notation
n and in meth
hodologies su
uch as SSAD
DM and Inform
mation
Engiineering. htttp://en.wikipeddia.org/wiki/Enntity-relationshipp_model

Figure 1-2: Crowss Foot and Arrrow Notation


n Example
Arrow notation iss less descrip
ptive (mathem
matically) and
d shows onlyy the direction or flow of tthe parent
table
e primary keyy (e.g. Artist) into the child
d table (e.g. Song).
S
1.4

Data Modelss as Ontologyss

Data
a models function as onto
ologies in thiss world. Theyy seek to orgganize a hiera
archy of inforrmation
into a classificatiion system. Ontologies are extremely powerful nottions that ca
an capture au
ugmented
or en
nhanced mettadata (information about the data model) that is not represen
nted by the m
model
itself.
In both co
omputer scien
nce and inform
mation science
e, an ontology is a formal rep
presentation o
of
a set of co
oncepts within a domain and
d the relationsships between
n those concep
pts. It is used tto
reason about the properties of that do
omain, and ma
ay be used to define the domain.
Ontologiess are used in artificial
a
intelliggence, the Semantic Web, ssoftware engin
neering,
biomedica
al informatics, library science
e, and informa
ation architectture as a form
m of knowledge
e
representa
ation about the world or som
me part of it.
http://en.w
wikipedia.org//wiki/Ontologyy_(computer_sscience)

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 16 of 15
52

Onto
ologies are one way to rep
present terms beyond datta modeling; much of thee Data Vault m
model is
base
ed on the Ontology conce
epts. When th
he Data Vaullt model form
m is combined
d with the function of
data
a mining, and
d structure mining
m
then ne
ew relationsh
hips can be d
discovered, ccreated and d
dropped
overr time. The ontology
o
can be morphed or dynamically altered intto new relationships goin
ng
forw
ward. There iss more discu
ussion on thiss topic in diffe
erent section
ns of this boook around the
e
flexibility of the Data
D
Vault mo
odel.

Fig
gure 1-3: Small Example: Ontology forr Vehicle
F
1-3 is extremely sim
mple and small. It repressents the nottion of the pa
arent term
The ontology in Figure
vehicle which contains the su
ub-classes: Car
C and Truck
k. Car and Trruck are both
h types of veh
hicles;
however each ha
as potentiallyy different de
escriptors. Trrucks genera lly contain la
arger frames, larger
moto
ors, larger wh
heels, and arre capable off towing and hauling heavvy loads where cars generally have
a sm
maller turning
g radius, use less gas, and
d can house more peoplee.
Onto
ologies are po
owerful categ
gorization an
nd organizatio
on techniquees. Imagine a set of musiic on a
mob
bile computin
ng device. No
ow imagine that there are
e many differrent categorizzations for th
hat music,
rangging from yea
ar, to composser, to album, to band, to artist, lead vvocalist, etc
Now stack these
cate
egorizations in different orrders or hiera
archies the
ey function ass indexes into the data se
et. At the
end of the index are the same music files; they are sim
mply categoriized differenttly. This is th
he basic
makkeup of ontolo
ogies.
How
wever, this de
escription can
n go deeper; switch the exxisting categoories out for business terrms, and
begin to describe
e each category. For insta
ance: Genre.. Different peeople might d
define whatt is
classsified as rock
k and roll diffferently, butt they are botth right. Cateegorization iss in the eye o
of the
beho
older, and is based on the
e individualss belief system and know ledge set (orr context) surrrounding
the iinformation at
a the bottom
m of the stack
k; which in th
his case are tthe music filees.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 17 of 152

The deeper into the ontology (or index), the more specialized and differentiated the definition
becomes. For example: underneath Rock, there might be 70s, 80s and 90s rock, or there might
be classic rock and roll. Where an individual who grew up in the 60s considers 60s and 70s to be
a part of; while an individual who grew up in the 90s or later considers any music earlier than 1985
to be classic rock. This is just one of the issues that the Data Vault Model and implementation
methodology provides a solution to. This book will uncover the key to modeling Ontologies in
enterprise data warehouses for use with Business Intelligence systems.
In fact, learning warehousing, applying, and using ontologies is a critical success factor for handling,
managing and applying unstructured data to a structured data warehouse. It is also a major
component for operational data warehousing, along with business rule definition and dissemination
of the data within an Enterprise Data Vault.
These are general descriptions of ontologies as used throughout this book. In addition to ontologies,
data models typically contain short-hand notations for names of fields known as abbreviations.
These abbreviations can have similar meaning within the same context (i.e. industry vertical) but
may have different meaning across different context. For example: Abbreviation CONT in health
care may mean contagious, in a legal system it may mean continuation. Abbreviations are best
separated by vertical industry.
1.5

Data Model Naming Conventions and Abbreviations

Physical data models often contain abbreviations for classifying tables and fields as many RDBMS
engines impose length limits on object names. The desire is to carry metadata meaning within the
abbreviations which results in a data dictionary being created. The naming conventions usually start
from the left hand side of the object name and move to the right with a logical flow with different
parts of the abbreviations separated by an underscore. The typical abbreviation is made up of
multiple components as shown in Figure 1-4:

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 18 of 15
52

Figure 1--4: Example Abbreviations


A
s and Namin
ng Convention
ns
NOTEE : THESE NAMIN
NG CONVENTIONS
S ARE FOR THE PHYSICAL
P
DATA MODEL. S OME OF THE EXAMPLLES ON THE ENS
SUING PAGES
ARE P
PHYSICAL, AND USE
U THIS APPRO
OACH, WHILE OTH
HER EXAMPLES ARE
A
LOGICAL AN
ND DISPLAY BUS INESS TERMS . PLEASE
TAKE NOTE OF THE DIFFERENCES BETWEEN THE LOG
GICAL AND PHYSICAL DATA MODE
ELS .

Vowels may be kept


k
in order to
t increase readability,
r
ho
owever in geeneral vowelss are removed from
the a
abbreviationss for shortening reference
e names. Th
here are som e notations w
which do not include
unde
erscores rath
her they utilizze text case to
t indicate th
he start of new
w terms (e.g., camel case
e used
with SQL Server). The Data Vault
V
physical modeling methodology
m
rrecommendss underscore
es as the
bestt practice for abbreviation
ns. The Data Vault logical modeling coomponents rrecommendss utilizing
full b
business nam
mes which arre demonstra
ated in this book.
NOTE: Naming conventions are one form
f
of ontologgy and metadaata that can b
be applied
actively within the confin
nes of the Bussiness Intellige
ence world.

ure 1-3 the abbreviations might be as follows:


For tthe ontology listed in Figu

Vehicle
Car
Truck
2 Wheel Drive
D
4 Wheel Drive
D

= VEH
= CAR
= TRK
= TWOWHDRV, TWDRV
= FOURWHD
DRV or AWD, FORWDRV
F

The suggested ta
able naming conventions for the Data Vault are ass follows:
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

HUB , or H
LNK, or L
HLNK, or HL
LSA, SAL, SLNK, SL
TLNK, TL
SAT, or S
HSAT
LSAT
PIT, or P
BR, or B
REF, or R

Page 19 of 152

for HUB tables


for standard Link Tables
for hierarchical Links
for same as Links
for transactional Links
for generic Satellites
for Hub Satellites
for Link Satellites
for point-in-time tables
for Bridge tables
for reference tables

Within each of the Data Vault tables there are standardized fields (more on this later). The naming
convention for these fields is as follows:

LDTS, LDT
LEDTS, LEDT
SQN
REC_SRC, RSRC
LSD,LSDT
SSQN

for load date time stamps


for load end date time stamps
for sequence numbers
for record sources
for last seen dates
for sub-sequencing identifiers

Always document the naming convention and the abbreviations chosen through a data dictionary in
order to convey meaning to the business and the IT team. Naming conventions are vital to the
success and measurement of the project. Naming conventions allow management, identification,
and monitoring of the entire system no matter how large it grows. Once the naming convention is
chosen, it must be adhered to (stick to it at all times) . One way to insure this is to conduct frequent
data model reviews and require non-conforming objects to be renamed
1.6

Introduction to Hubs, Links, and Satellites

The Data Vault model consists of three basic entity types: Hubs, Links, and Satellites (see Figure 15). The Hubs are comprised of unique lists of business keys. The Links are comprised of unique
lists of associations (commonly referred to as transactions, or intersections of 2 or more business
keys). The Satellites are comprised of descriptive data about the business key OR about the
association. The flexibility of the Data Vault model is based in the normalization (of or separation of)
data fields in to corresponding tables.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 20 of 15
52

Figure 1-5: Example Data Vault


Data
a Vault mode
els are repressentative of business
b
proccesses and aare tied to thee business th
hrough
the b
business keyys. Business keys indicate how the bu
usinesses inttegrate, conn
nect, and acccess
inforrmation in their systems. Data Vault models
m
are built
b
based on
n the concep
ptual understtanding of
the b
business.
Concepts such
h as custome
er, product, order,
o
email, sale, inventoory, part,
ervice, accou
unt, and portffolio are used
d to represen
nt ideas that cross lines
se
off business. Examples
E
of lines of busin
ness may incclude: sales, finance,
marketing,
m
contracting, ma
anufacturing, planning, p roduction, an
nd delivery.
Th
hese concep
pts can be rep
presented with business keys that shoould cross
lin
nes of busine
ess. The Hub
bs carry the unique
u
list off these keys a
and are
defined by sem
mantic grain (granularity) and businesss utilization..

he keys (in Figgure 1-5 theyy show an Orrder actually being


The Links repressent association across th
invoiced to a cusstomer for a specific
s
product). The asssociations ch
hange over tiime, some ha
ave
direcction (akin to
o mathematiccal vectors), others
o
are directionless. Links are ph
hysical
representations of foreign ke
eys, or in data
a modeling te
erms: an assoociative entitty.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 21 of 15
52

Hubs and Link


ks do not con
ntain context. Satellites p
provide the ccontext
ociations for a specific pooint in time. TThe
defining the keys and asso
bout a Hub orr Link that
Satellites conttain the desccriptive data attributes ab
an change ovver time. Sa
atellites are the data wareehousing porrtion of the
ca
Data
D
Vault mo
odel.
You see, Hubs an
nd Links are like the skele
eton and liga
aments of thee human bod
dy without tthem we
have
e no structure. Without th
hem, our Datta Warehouses are blobs of data loosely coupled w
with each
othe
er. But WITH them we havve definition,, structure, height, depth,, and specificc features. W
We as
hum
mans couldntt survive with
hout a skeleto
on. The Data Warehouse cannot surviive without H
Hubs and
Linkks. They form
m the foundattions of how we hook the data togetheer.
For iinstance, the
e Hubs are lik
ke the differe
ent bones in the
t body, thee arm, the legg, the toes, th
he head,
etc
The Links are
a similar to
o the ligamen
nts that hold the bones toogether, give them flexibillity, and
attacch them in a specific orde
er. Finally, th
he Satellites are added. S
Satellites aree like the skin
n, muscle,
and organs. The
ey add color, hair, eyes, an
nd all the oth
her componeents we need to be described.
By separating the
e concepts of
o descriptive data from sttructural dataa, and structtural data from
Linkkage data, we
e can easily begin
b
to asse
emble a pictu
ure or an ima ge of what our companie
es look
like. The Hubs provide
p
the working constructs to whicch everythingg is anchored. The Links p
provide
the a
assembly strructure to how
w these anch
hors interact,, and the Sattellites definee the contextt (like hair
color, eye color, skin
s
type, etcc.) of all of these components.
Rem
member this: the Data Vau
ult is targeted
d to be an En
nterprise Datta Warehousee. Its job is tto
integgrate dispara
ate data from
m many differrent sources, and to Link it all togetheer while main
ntaining
sourrce system co
ontext.
It sitts between th
he source sysstems and th
he data martss. Why? Beccause the da
ata marts are
e the
interrpretation layyer of the inte
egrated data
a warehouse data. In hum
man terms th
hink about it tthis way:
thinkk about a cerrtain event th
hat occurred in your life th
hat you shareed with anoth
her person. Do you
both
h remember it
i the same way?
w
Do you both remem
mber the exacct details? Or is your interpretation
of th
he event sligh
htly different than that of your friend?
Exacctly my point,, interpretatio
on depends on
o context, and
a the conteext you use too remember the event
is diffferent than the context your
y
friend usses. Even if the
t facts or tthe event itseelf is exactly the same
(you were both th
here, you botth saw the ecclipse of the sun,
s
but you both experieenced it diffe
erently).
This is why its so
o important to
t separate in
nterpretation
n from the faccts.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 22 of 152

Let your Data Vault house the facts, and build your data marts to house the interpretation.
1.7

Flexibility of the Data Vault Model

The Data Vault model is built for extreme flexibility and extreme scalability. The Link table separates
the relationships from the business key structures (the Hubs). The Link table provides for the
representation of the relationship to change over time. The Satellites provide the descriptive
characteristics about the Hubs or Links as they change over time.
For instance, suppose you own a car and you are the registered driver. You currently have two
relationships to the car: one as a driver, and one as an owner. Now suppose you hired a driver.
Well, you still own the car right? Now, you have one relationship with the car as the owner, but the
person you hired now has a relationship with the car as the driver. However, the description of the
car has not changed.
What if you sold the car to someone else? Then your relationship with the car as an owner would
END, and the buyers relationship with the car would begin. This information about the relationship
between business keys is what we keep in the Link structures. Again, the basic description of the car
remains unchanged so the Satellite data is untouched.
The Link table may also be applied to information association discovery. Business changes
frequently redefining relationships and cardinality of relationships. The Data Vault model
approach responds favorably because the designer can quickly change the Link tables with little to
no impact to the surrounding data model and load routines.
MAJOR FUNDAMENTAL TENANT: THE DATA VAULT MODEL IS FLEXIBLE IN ITS CORE DESIGN. IF THE DESIGN OR THE
ARCHITECTURE IS COMPROMISED ( THE STANDARDS / RULES ARE BROKEN) THEN THE MODEL BECOMES INFLEXIBLE AND
BRITTLE . B Y BREAKING THE STANDARDS / RULES AND CHANGING THE ARCHITECTURE , RE - ENGINEERING BECOMES NECESSARY
IN ORDER TO HANDLE BUSINESS CHANGES . O NCE THIS HAPPENS , TOTAL COST OF OWNERSHIP OVER THE LIFECYCLE OF THE
DATA WAREHOUSE RISES , COMPLEXITY RISES , AND THE ENTIRE VALUE PROPOSITION OF APPLYING THE D ATA V AULT CONCEPTS
BREAKS DOWN .

For example, suppose a data warehouse is constructed to house parts then after 3 months in
operation the business would like to track suppliers. The Data Vault can quickly be adapted by
adding a Supplier Hub, Supplier Satellites, followed by a Link table between parts and suppliers - the
impact is minimal (if any) to existing loading routines and existing history held within.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 23 of 15
52

Figure 1-6: Flexibility


F
of Adapting
A
to C hange
In Figure 1-6 we see the existting data warrehouse in pu
urple and thee new section
ns in yellow a
and
oran
nge (to the rig
ght of the red
d dotted line)). Placing the
e association
ns in a Link ta
able enabless the data
ware
ehouse desig
gn to be flexib
ble in this ma
anner wherre new comp
ponents do noot affect exissting
components.
One difference between
b
the Data
D
Vault model and a 3rd normal forrm model is tthe use of Lin
nks to
rd
represent associations across concepts. 3 Normal fo
orm represen
nts most rela
ationships byy tying
pare
ent keys to ch
hild tables directly (withou
ut an extrapo
olated associiation table).
Rela
ationships such as 1 to Many, Many to
o 1, and 1 to 1 are repressented in 3rd normal form directly
by embedding th
he parent fields in the chilld tables. Thiis leads to infflexibility of tthe model.
en the busine
ess rules change and the cardinality of the data m ust change tto meet busin
ness
Whe
need
ds, the mode
el is altered as
a is the operrational appliication usingg the model.
An e
example of a 3rd normal fo
orm model iss shown below
w in Figure 1
1-7.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 24 of 15
52

Figure 1-7:
1 3rd Normal Form Prod
duct and Sup
pplier Examplle
In Figure 1-7, a Product
P
can have
h
1 and only
o 1 supplie
er, but a Supp
plier can sup
pply many pro
oducts.
With
h a model like
e this, the bu
usiness rule may
m be: a Prroduct can on
nly be suppliied by a singlle
supp
plier, which means that the
t operation
nal system th
hat collects th
he informatioon is coded
acco
ordingly. Whe
en or if the business
b
chan
nges its rule to say: a prooduct can bee supplied byy more
than
n one supplie
er then the application
a
must
m
change, as must the underlying d
data model sttructure.
Whille this appea
ars to be a sm
mall change itt may affect all kinds of u
underlying infformation in the
operrational syste
em; especiallly if the produ
uct is a PARE
ENT to other ttables.
For d
data warehouses (exceptt Data Vaults) this structure leads to eeven more coomplexity. In a data
ware
ehouse that contains
c
fore
eign keys embedded in ch
hild tables, th
his leads to ccascading change
impa
acts. In othe
er words, any changes ma
ade to parentt keys will casscade all thee way down in
n to every
singgle child table
e. The end re
esult?

You have to rebuild ALL


L your ETL load
ding routines
You have to rebuild ALL
L your Queries against the dependent stru
uctures
You have to re-model ALL your parentt and child tab
bles

The e
end result is massive
m
re-eng
gineering efforrts, and thats not all! The pproblem gets eexponentially h
harder to
handle with larger and larger datta warehouse models.

Th
his is the #1 reason why Data Wareho
ouse/BI Projeects are torn
n down,
sttopped, halte
ed, burned, and
a ripped ap
part or labelled failures! The growingg
and already high cost of re
e-engineeringg which is cau
used by poorr
architectural design
d
and dependenciess built in to yoour data warrehouse
model!
m
Dont let this happ
pen to you! Use
U a Data Vaault and avoiid this mess
up-front.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 25 of 15
52

This is evident when loading history to a data


d
warehou
use where th e cardinality that exists in
n todays
mod
del did not exxist in the passt. Data ware
ehousing tea
ams run a higgh risk of re-eengineering tthe
arch
hitecture and loading proccess, if the en
nterprise datta warehousee model enfoorces currentt
relattionships. Th
he Data Vaultt model mitiggates this risk by providin
ng Link tabless for all relatiionships
rega
ardless of carrdinality.
1.8

Data Vault Basis


B
of Comm
mutative Properties and Set Based Math

The Data Vault iss based on ra


aw data sets arriving at th
he warehousee (with little tto no alteratiion of the
data
a within). This is common
nly referred to
o as the raw
w data warehoouse. Theree is a notion for
consstructing a bu
usiness base
ed Data Vaultt that will be discussed laater in this boook.
One of the found
ding principle
es of the Data
a Vault is: enable re-creattion of a sourrce system data for a
speccific point in time.
t
The Da
ata Vault ach
hieves this byy loading raw
w data, passivvely integratin
ng it by
business key and
d time-stamp
ping the load cycles with the
t arrival daates of the da
ata set.
Therre is a law in mathematics called the commutative
c
e property. T he commuta
ative propertyy is
defin
ned below:
In mathema
atics, commutattivity is the abiliity to change the order of someething without cchanging the en
nd
result. It is a fundamental property in mosst branches of mathematics
m
an
nd many proofs depend on it. TThe
commutativvity of simple op
perations was fo
or many years im
mplicitly assum ed and the prop
perty was not
given a nam
me or attributed
d until the 19th century when mathematicians
m
began to forma
alize the theory of
mathematiccs. http://en.w
wikipedia.org/wik
ki/Commutative
e

ecific point in
n time, wheree A = a sourcee system/source
The basic notion is that A = B = C at a spe
appllication, and B = staging area,
a
and C = enterprise Data Vault; ssuch that A ca
an be reconsstituted
for a
any point in time containe
ed within C. This preservves the auditaability of the data set hou
used
within the Data Vault
V
while offering base level integrration acrosss lines of bussiness (see previous
discussion on Hu
ub based bussiness keys).

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 26 of 15
52

In
n some casess B can repre
esent the Datta Vault whilee C represents as-is raw
le
evel star sche
emas. Raw level star schemas are u tilized to shoow the
business whatt the source systems are collecting, a nd where thee gaps may
be between th
he business rules,
r
busine
ess operation
ns, and sourcce system
applications. Information quality (IQ) can
c be improvved through resolution
ed gaps.
off the identifie
To fiind out more about gap analysis, please read the book:
b
The Neext Business Supermodel. The
a Vault Mode
eling.
Busiiness of Data
Anotther founding
g principle be
ehind the Datta Vault arch
hitecture is th
he use of set logic or set b
based
math. The Hubs and Links are loaded based on union
n sets of information, while the Satelliites are
load
ded based on
n delta chang
ges inclusive of the union functionalityy. Set logic iss applied to tthe
load
ding processe
es for restart ability, scala
ability, and pa
artitioning of the components.
Stan
ndard set the
eory is defined as follows:
Set theoryy, formalized using first-orde
er logic, is the most common
n foundationall system for
mathemattics. The langu
uage of set the
eory is used in the definition
ns of nearly alll mathematica
al
objects, su
uch as functions, and conce
epts of set theo
ory are integraated throughoout the
n be introduce
mathemattics curriculum
m. Elementary facts about se
ets and set meembership can
ed
in primary school, along
g with Venn dia
agrams, to study collectionss of commonpllace physical
objects. Ellementary ope
erations such as
a set union and
a intersectioon can be stud
died in this
context. More
M
advanced
d concepts succh as cardinality are a stand
dard part of the
undergrad
duate mathematics curriculu
um. http://en.w
wikipedia.org/w
wiki/Set_theory

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 27 of 15
52

In th
he Data Vaultt approach, set
s theory is applied
a
to inccoming data sets. The seet theory app
plied in
load
ding routines is depicted in Figure 1-8::

Fig
gure 1-8: App
plied Set Theo
ory for the Daata Vault
The set theory is applied again for Hub an
nd Link loadin
ng where onl y new data (not previously
inserted) is applied or loaded
d. Set-based logic is applied when single distinct liists of keys a
are loaded
in orrder to the ta
arget table wh
here they havvent yet bee
en loaded.
1.9

Data Vault and Parallel Pro


ocessing Math
hematics

The main purposse to introduccing set-theory concepts and


a the math
hematics beh
hind the Data
a Vault is
to prrovide you wiith a glimpse
e of the actua
al engineeringg effort behin
nd the Data V
Vault architecture
itself. The archittecture is nott merely justt another dessign of tablees strung toggether no, itt is
engineered with specific tolerrance levels so that it can
be flexible ass necessary. These
n scale, and b
conccepts are fou
undational to understandiing why the architecture
a
i s designed, a
and what the
e specific
purp
poses of the design
d
eleme
ents are. I ho
ope you find this section enlighteningg, as it explain
ns some
of th
he backgroun
nd reasons ass to why you should stick with the origginal structurres (unmodified) as
you implement your Data Vau
ult.
The Data Vault Modeling
M
com
mponents are
e based on pa
arallel processsing mathem
matics (versu
us serial
proccessing). Massively Parallel Systems typically
t
use a shared-notthing design. The Data Vault
Mod
deling compo
onents make use of this design techniq
que to split d
data in a verttical format: a
aka
vertiical partitioniing. The verttical partition
ning of data iss applied to H
Hub, Link and Satellite sttructures
and is a base part of the arch
hitecture. The objective of
o vertical parrtitioning with
hin the Data Vault
Mod
del is to split the
t work, so that the data
abase engine
es can optim ize the follow
wing:
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 28 of 15
52

Index Covverage
Data Redundancy (miniimize this)
Parallel Query
Resource Utilization (sp
plit over hardw
ware platformss)

If you are not fam


miliar with the
e mathematiical principless of Massiveely Parallel Prrocessing (MPP) you
can read about generic
g
parallel processin
ng rules and performance
p
e speed up (ta
asking) on W
Wikipedia.
Figure 1-9 is a grraph drawn from Wikipedia that introd
duces the con
ncepts of Parallel Processsing:

mputing Simp lified


Figure 1-9: Parallel Com
Im
mage: http://en.wikipedia.org/w
wiki/Parallel_coomputing

m of design an
nd processing. The math
hematics
The principles att work expresss themselves in the form
behiind the Data Vault Model can be found by reading about paralllel processingg. Specificallly:
Para
allel Data Pro
ocessing, Parrallel Task Pro
ocessing, and MPP syste ms design an
nd architectu
ure.
The topology of the
t computin
ng cluster (da
atabase engin
ne) can be an
ny of the dessired pieces including:
star,, ring, tree, hyper-cube, fa
at hyper-cube
e, or n-dimen
nsional mesh . The Data Va
ault splits ou
ut the
Busiiness Keys, the relationsh
hips (associations), and th
he descriptivve data (repeetitive).

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 29 of 15
52

Busiiness keys byy nature are generally


g
non
n-repetitive, therefore
t
inccreasing indeex coverage
dram
matically. Bu
usiness keys by nature are
e also specific to a set or tuple of data
a as an identtifying
marker. The Hub
bs therefore act as indepe
endent sourcces of inform
mation, makin
ng it easy (forr
insta
ance) to splitt different Hu
ubs across diffferent comp
puting platforrms in otheer words, app
plying
vertiical partitioniing at the hardware level..
This is the nature
e of MPP and
d is known ass scale-out. Scale-out teechnology alllows the mod
del to
grow
w as large as the data sett or the busin
ness demands while keep
ping near lineear performance gains
(in re
elation to the
e scale of the
e hardware). The Links orr association
ns also follow
w across multtiple
platfforms, and are sometime
es replicated in shared-no
othing environ
nments for e
ease of joins. One
term
m for this is: join-indexes

.
This is just one way
w to view th
he Data Vaultt Model; it is essentially b
based on the
e principles off a scalefree tree, all the way down to the individual table strucctures built w
within the moodel. Multiple
e scalefree trees are no
othing more than more Hu
ubs, Links, an
nd Satellites within the Da
ata Vault, thu
us
prod
ducing a cub
be-like struccture if desire
ed. An examp
ple of a logical design or conceptual vview of the
Data
a Vault in a Hyper
H
Cube it might look
k something like this:

Figure 1-10: Logical Data


a Vault Hyperr Cube
Image
e: http://clanbase.g
ggl.com/img.php?url=fc07.deviantart.co
om/fs14/i/2007/07
76/a/b/Hypercube__by_Meninx.jpg

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 30 of 15
52

In th
he physical model,
m
Hubs are
a connecte
ed to Link stru
uctures; Linkks become a physical notiion for an
asso
ociation. In the physical Data
D
Vault Model nodes are
a connected through Lin
nks to each o
other,
theyy are not directly related. This is a con
nceptual bassis for establiishing the preemise of the vision.
Hubs provide the
e keys, while Satellites aro
ound Hubs describe thee key for anyy given point in time.
Hype
er Cubes can
n be created as can trees. A simpler vision
v
or view
w of the Data Vault Model split for
para
allelism is in Figure 1-11:

Figure
e 1-11: Physiccal Data Vault Layout (Staarting point)
p
off the data beecause the size of the
This is where it sttarts, quite simple enough no real partitioning
et large enoug
gh. All of the
e tables go th
hrough one, ttwo, or three I/O connectiions to a
data set is not ye
SAN or a NAS drivve. When the
e data set grows, physica
al partitioningg (or split-off of tables) can occur.
The e
end-result (to
o an extreme
e) might be ass shown in Fiigure 1-12:

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 31 of 15
52

Figurre 1-12: Physsical Data Vault Layout (P


Partitioned)
In th
his case, each
h table has I//O channels bound to it, along
a
with deedicated diskk (DASD) sittiing on
Raid
d 0+1 formatt. This is the ultimate in separation
s
for relational d
database enggines. This allows the
relattional databa
ase engines to
t operate their parallel query enginees without d
disk wait state
e
depe
endencies accross each ta
able. Truly independent hardware
h
leveels can achieeve very high
h
perfo
ormance. Th
he next step might be sep
parating the processing
p
poower out intoo different no
odes,
reacching the MPP level of arcchitecture in hardware.
The Data Vault Model
M
followss a scale-free
e topology. Scale-free top
pology is defin
ned as follow
ws:
A scale-fre
ee network is a network who
ose degree disstribution follow
ws a power law, at least
asymptoticcally. That is, the
t fraction P((k) of nodes in the network h
having k connections to other
nodes goe
es for large values of k as P(k) ~ k-? where
e ? is a constaant whose valu
ue is typically iin
the range 2 < ? < 3, alth
hough occasio
onally it may lie
e outside thesse bounds.
Scale-free networks are noteworthy because many empirically ob
bserved netwoorks appear to
o
be scale-frree, including the protein ne
etworks, citatio
on networks, aand some social networks.[1]

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 32 of 15
52

Source: htttp://en.wikipe
edia.org/wiki//Scale-free_ne
etwork

The mathematicss behind scale-free netwo


orks applies to
t the Data V
Vault Model. Any physical model
builtt from the Da
ata Vault principles will ca
arry the same
e mathematiccal properties. Using a sp
pringgrap
ph, or a weigh
hted graph in
n either 2 dim
mensions or 3 dimensionss, it becomess apparent w
which of
the n
nodes are the most impo
ortant in the
e Data Model. The most iinterconnectted nodes are
e
centtralized within the graph, and have the
e most neigh
hbors.
1.10
0 Introduction to Complexityy and the Data
a Vault

This topic is reallly deserving of


o an entire chapter,
c
perh
haps even an
n entire bookk. However, in the
interrest of time, and due to th
he fact that this
t concept must be brou
ught to light, I will make a small
intro
oduction to th
his concept. The Data Va
ault model an
nd methodoloogy make a tremendous
conttribution to lo
owering the overall
o
compllexity of the systems
s
invo lved in data w
warehousingg.
Currrent data warrehousing sysstems try to do
d too much in their load
ding cycle. Th
hey try to add
dress ALL
of th
he following problems
p
in a single load pattern:

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 33 of 152

Sourcing Problems:
o Synchronization / Source Data Availability time windows
o Cross-System Joins
o Cross-System Filters
o Cross-System aggregates
o Indexing issues, leading to performance problems
o Disjoint or missing source data sets
o Missing source keys
o Bad source data, out of range source data
o Source system password issues
o Source system Availability for loading windows
o Source system CPU, RAM, and Disk Load
o Source System structure complexity
o Source system I/O performance
o Source System transactional record locks
Transformation problems often IN STREAM
o Cleansing
o Quality and Alignment
o Joins
o Consolidation
o Aggregation
o Filtering
o Sequence Assignment - often leading to lack of parallelism
o Data type correction
o Error handling (when the database kicks it back)
o Error handling (data is: out of bounds, out of range)
o Size of Memory
o Lookup issues (more sourcing problems, caching problems, Memory problems)
o Sorting issues (large caches, disk overflows, huge keys)
o BUSINESS RULES , especially across SPLIT data streams
o Multiple targets
o Multiple target errors
o Multiple sources
o Single transformation bottleneck (performance, realationships joins, and so on)

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 34 of 152

Target Problems
o Lack of database tuning
o Index updates (deadlocking)
o Update, Insert, and Delete mixed statements forcing data ORDER to be specific, cutting off
possibilities for executing in parallel
o Block size issues
o Multi-target issues (too many connections, error handling in one stream holding up all other
targets in the same data stream)
o WIDE targets (due to business rules being IN-STREAM)
o Indexes ON TARGETS (because targets ARE the data marts)
o Lack of control over target partitioning

Along with many more issues. In these cases, this is the traditional view of issues that data
integration specialists are left to solve. You are expected to construct load after load that answers
ALL of these problems in a SINGLE data stream right? Well, this is no way to do business. This
increases complexity to an unimaginable level, and this contributes to the ultimate downfall of the
data warehousing project!
Quality Software Management Vol. 1 Gerald M. Weinberg pp. 135-139.
When you develop your ETL for a star schema EDW, you essentially get a sequential set of (big
T) transformations. As that sequence grows in size and complexity, the difficulty of testing it,
and tracing errors back to the source grows exponentially, hence as your (S-S) EDW grows,
you get haunted by ever growing development cycles, and increasingly less control over the
testing process, until your EDW has developed into yet another legacy system. And then you
know what its fate will be

The Data Vault Motto: DIVIDE AND CONQUER!


Believe me, you can win every time with this strategy. Lets analyze this for a minute:
Sourcing Problems: Nearly every problem can be addressed through a few simple rules:

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 35 of 152

Separate each source load, and land the data in the target make each load a very simple copy
operation where the data is pulled from the source, and landed directly in the target in this case, a
STAGING AREA (as defined in the next chapter). Yes, you may source a CDC (Change Data Capture)
operation if you so desire.
Run the staging loads when the data is ready! Dont wait for other systems, dont perform any other
systems joins, and dont force the data to conform or align with specific rules or datatypes.

These two simple rules ensure that you can get in, get the data from the source, and get out when the source
data is ready to go. No waiting! No Joining! No timing complexity! No performance problems! You can always
take a copy operation and partition the target, and partition the load for MAXIMUM throughput!
Transformation problems: Divide and conquer. The following rules make it much easier to deal with this part of
the loading cycle:

Move the business rules downstream. This includes all the joins, filters, aggregations, quality,
cleansing, and alignments that need to happen between the Data Vault and the Data Marts. This
also allows you to effectively target the PROPER data mart with the PROPER rule set (as deemed
appropriate by the business).
Load raw data in to the Data Vault area, this provides SIMPLE, maintainable, and easy to use loading
code that meets the needs of the business. It also prevents you from having to re-engineer loading
routines to add new systems, or add new data. Sure you end up with a lot more routines, BUT each
one is a thousand times less complex, and easier to manage.

The end result for this? You can PARALLELIZE the loading routines to your data warehouse AND you can load
data to your Data Vault in REAL-TIME at the SAME TIME as your batch loads are running. Just try that with your
standard star-schema!
Targeting problems: They all but disappear. Why? Because once you divide and conquer, your loading routines
will be built for inserts only, high speed inserts at that, and they generally will contain only one or two target
tables for loading purposes! No more locking problems, no more worries about wide rows (except when you
get to loading data marts, thats another story). High degrees of parallelism, high degrees of partitioning, and
high performance, and really low complexity scores, what more could you ask for?
1.11 Loading Processes: Batch Versus Real Time

This book introduces the concepts with a small bit of background, it is meant to be only an
introduction to the loading patterns and processes used within the Data Vault. The purpose of this
entry is to define the basic terms of batch loading and real-time loading.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 36 of 152

Batch Loading: usually occurring on a scheduled basis, loading any number of rows
in a batch. The execution timing will vary from every 5 minutes to every 24 hours,
to weekly, to monthly, and so on. Any load cycle running every 20 seconds or less,
tends to fall close to the real-time loading category. All other scheduled cycles tend to
be labeled mini-batches.
Real-Time Loading: there is a grey area of definition between what a batch load is
and what a real-time load. For the purposes of this book, real-time loading is any
loading cycle that runs continuously (never ends), loads data from a web-service or
queuing service (usually) whenever the transactions appear.

Neither loading paradigm has any effect on the data modeling constructs within the Data Vault. The
Hub, Link, and Satellite definitions remain the same and are capable of handling extremely large
batches of data, and or extremely fast (millisecond feeds) loads of data.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 37 of 15
52

2.0 Architectural Defin


nitions
The Data Vault approach (pro
oject/method
dology) has common arch
hitectural com
mponents defined.
The componentss are referred
d to througho
out this book.. The purposse of this boook is to define
e the
Data
a Vault data model
m
structures. Contexxt for those structures
s
is a necessary ffoundationall
component of un
nderstanding
g the Data Vault. The com
mmon architeectural comp
ponents utilizzed in the
Data
a Vault appro
oach are defined in Figure
e 2-1:

Figu
ure 2-1: Enterrprise BI Arch
hitectural Com
mponents
The Data Vault methodology
m
includes eacch of these co
omponents. The architecctural components
discussed in thiss book (in dettail) include the
t Staging area
a
and the Data Vault. This section briefly
intro
oduces the otther sectionss as part of th
he architectu
ure for you to consider.
2.1

Staging Area
a

The staging area


a consists of tables
t
in the database to house incom
ming data 1:1
1 with the so
ource
syste
em (with som
me additional system driven elements). The stagin
ng area is reffreshed (purgged) prior
to ea
ach batch loa
ad cycle, in other
o
words, they
t
should not
n ever housse history of loads. This iis often
calle
ed a transient staging are
ea. Staging ta
ables house no referentiaal integrity an
nd no foreign
n keys.
Theyy house a seq
quence number which is reset and cyccled for each
h table with eeach batch cyycle. They
housse a load datte stamp and
d a record source for each
h table. Thesse componen
nts are descrribed in
Chapter 3.0 Com
mmon Data Elements.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 38 of 152

These tables do not carry any foreign keys, or original primary key definitions. Exceptions: Loading a
de-normalized COBOL based file, and executing normalization (splitting into multiple tables), the
staging tables will carry parent ID references. Loading a denormalized XML based file and executing
normalization, the staging tables will carry parent ID references.
The staging area may be partitioned in any manner desired. The format is owned and maintained by
the data warehousing team. The staging area tables may also contain any indexes needed (postload) in order to provide the data warehouse/Data Vault loads with the proper performance
downstream. Staging area data should be backed up at regular intervals (if the data arrives in realtime), otherwise it will be backed up at scheduled intervals.
The future need for a staging area is in question. In fact, within the operational Data Vault and
100% real-time feeds there appear to be no real needs to have a staging area. There are already a
few Operational Data Vaults built using the principles of by-passing the staging area, and loading
data directly (from the real-time feeds/web-services) to the Data Vault. The only reasons for staging
areas to continue to exist (as of 2010) include the following:

2.2

Data Synchronization with other static lookup data


Hot-Data Backup continuous backup in case the queuing engine dies (the transactional feed
engine)
Batch data Delivery Reformatting and consolidation
File Format adjustments / alignment
EDW Data Vault

The EDW (enterprise data warehouse), or core historical data repository, consists of the Data Vault
modeled tables. The EDW holds data over time at a granular level (raw data sets). The Data Vault is
comprised of Hubs, Links, and Satellites (defined in section 1.6 and further defined throughout this
book). The Enterprise Data Warehousing Layer is comprised of a Data Vault Model where all raw
granular history is stored. Unlike many existing data warehouses today, referential integrity is
complete across the model and is enforced at all times. The Data Vault model is a highly normalized
architecture. Some Satellites in the Data Vault may be denormalized to a degree under specific
circumstances.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 39 of 15
52

Th
he Data Vaullt model follo
ows all definittions of the D
Data Warehoouse (as
defined by Bill Inmon) exccept one: the Data Vault iss functionallyy based, not
su
ubject oriented meanin
ng that the bu
usiness keys are horizonttal in nature
and provide viisibility acrosss lines of business.

The Data Vault modeling


m
arch
hitecture hass been likene
ed to 3 norm
mal form. Th
he business kkeys in
th
the H
Hub appear to
t be 6 norm
mal form, wh
hile the load date
d
and recoord source are 3rd norma
al form.
The Data Vault model
m
should represent th
he lowest posssible grain. The Hubs an
nd Links in th
he Data
Vaullt model provvide the back
k-bone structure to which context (the Satellites) a
are applied.
2.3

Metrics Vaullt

A co
omponent forr capturing te
echnical metrrics about the
e: load proceess, loading ttime-lines, co
ompletion
ratess, amount off data moved
d, growth of ta
ables, files, and
a indexes. This Data Va
ault capturess the
tech
hnical metada
ata for the prrocesses and
d the databasse. By captu ring growth rrate actuals a
along with
run-ttimes, insert numbers, up
pdate numbe
ers, and row counts
c
proj
ojections of fu
uture storage
e
requ
uirements can be created and manage
ed. This allow
ws the busin
ness to monittor their need
ds, and
budgget 6 monthss to 1 year in advance forr future hardw
ware.
The Metrics Vaullt can also be
e crafted to in
nclude inform
mation aboutt CPU utilization, RAM acccess; I/O
throughput and I/O
I wait time
es. The addittional informa
ation in the M
Metrics Vaultt begins to prrovide a
conssistent and concise view of
o the utilizattion of the syystem in conjjunction with
h the growth o
of the
data
a sets and the
e hot spots on
o disk. From
m all of these
e metrics, a n
nearly compleete technical
man
nagement dashboard can be presente
ed to monitorr the EDW efffort.
2.4

Meta Vault

The Meta Vault contains


c
busiiness metada
ata (ontologie
es/taxonomi es/definition
ns) and physiical data
mod
del attribute names,
n
functtions (for tran
nslation) and
d technically iimplemented
d business ru
ules that
ETL / ELT followss to interpret the data. Th
he Meta Vault allows bus iness to prod
duce, maintain, and
delivver metadata
a across the board
b
from within
w
their ED
DW/BI soluti on set. The Meta Vault iss in fact
one form of Operrational Data
a Warehouse.
The Meta Vault contains
c
metadata for the
e staging area
a, EDW Dataa Vault, Report Collectionss, Data
Martts, and Metrics Vault area
as. The meta
adata is defin
ned through IT, business,, and processs
tech
hnologies.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse


2.5

Page 40 of 152

Report Collections

Report collections are defined as flat-wide denormalized structures, used for high-speed reporting or
flat file output access; they may also be used by data mining tools. They are a form of data mart
where end-user access is direct. Report collections provide the business users with pre-computed
totals at the end of each row. These pre-computed totals allow high speed filtering against patterns
of rows that are out of the normal zone (in other words, breaking business requirements).
2.6

Data Marts

Data marts are defined as: any point at which generic users directly access the structures and the
data for ad-hoc reporting, or drill-down analysis. This may or may not be a Star Schema. It may also
include normalized and denormalized tables. Data Marts may be virtualized; for example: in-RAM
cubes, and dynamically altered information sets. A form of a data mart is an Excel spreadsheet that
communicates directly with the Data Vault through an interactive metadata layer (possibly
something like Microsoft SharePoint direct to the Data Vault back-end). Direct communication
between the user, the metadata management, and the Data Vault is the beginnings of an
Operational Data Warehouse.
For purposes of auditability and accountability the data is separated into two physical layers:
corporate marts, and error marts. Corporate marts serve as the standard data marts, where data
that meets soft business rules is contained. Error marts serve as the landing zone for bad data,
that is: data that does not meet soft business rules. The definition of hard and soft business rules
is covered in the book: The Next Business Supermodel, the Business of Data Vault Modeling.
2.7

Business Data Vault

There is a new component in the architecture (not shown in Figure 2-1). The component is called
the Business Data Vault. Business users and IT alike are seeing the benefits of the flexibility,
scalability, and adaptability of the Data Vault model. They want the benefits, but with the business
data embedded. Downstream of the raw Data Vault, (between the Data Vault and the Data Marts in
the Figure 2-1) they are building a new store called the Business Data Vault.
The Business Data Vault (BDV) is a concept, a grouping of specific tables in fashioned using Data
Vault modeling concepts, but not necessarily following all the Raw Data Vault modeling rules. A
Business Data Vault (also known as EDW+) can be a group of tables inside the raw Data Vault
(where the record source has changed), or can be a completely separate data store. Either way, the
data that exists in the BDV has been altered, cleansed and changed to meet the rules of the
business and is downstream of the raw Data Vault. You may be able to dual-purpose the BDV and
apply master data rules as well, thus making the BDV a starting point for a Master Data System.
Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 41 of 152

The Business Data Vault contains all business data, all altered data, aggregated, and cleansed
information. IT staff are executing the business transformations once, assigning more metadata
(including master data definitions), and then releasing (through simple copy) the data needed in the
marts. The Business Data Vault is considered an extra copy of the information; however it is paired
with the business metadata and all of the transformations needed to make virtual cubes and high
speed delivery possible. The argument received from the business is that the data (posttransformation) is used on the financial reports, and as such, must also be accountable and
auditable. Therefore a second copy of the data (post-transformation) is necessary as another
system of record.
The technical argument provided is that the IT staff only wishes to do the transformation once, or
that they have a standing order to provide virtual marts; which in this case translates to RAM
based cubes, and views that look like dimensions and facts.
2.8

Operational Data Vault

The nature of the Raw Data Vault (EDW as depicted in Figure 2-1) is changing to include operational
data. The need to combine/consolidate operational data with the raw Data Vault is being driven by
Master Data Initiatives, and business needs. The business wants more historical data mixed with
current transactions at their finger-tips.
In order to meet this demand the Data Warehousing teams are loading operational data (real-time
loading) directly in to the Raw Data Vault, thus creating an Operational Data Vault. The entire
discussion of Operational Data Vaults is outside the scope of this text, and will be defined elsewhere
in articles and discussion forums.
WARNING: AN ODV INHERITS ALL THE ISSUES, PROBLEMS, AND RELIABILITY CONCERNS OF AN
OPERATIONAL SYSTEM . I TEMS SUCH AS GOVERNANCE , UP- TIME (6 X9 S ), 24 X7 X 365 SUPPORT , ALL COME
TO BEAR WITH AN OPERATIONAL DATA V AULT . T HE DECISION TO BUILD ONE SHOULD NOT BE TAKEN
LIGHTLY .

What is an Operational Data Vault? The Operational Data Vault is part data warehouse, and part online transactional data store (operational data store). The Operational Data Vault stores all changes
to data as inserts (as does a traditional data warehouse), however at the same time it also offers
update/edit access to the operational applications sitting directly on top of the data warehouse.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 42 of 152

In case you are wondering: Has this ever been done successfully? The answer is yes, it has
several times already. A company called Cendant Timeshare Resource Group (Cendant TRG) rebuilt
their entire operational layer in Java directly on top of the Data Vault, consolidating data
warehousing directly with operational applications. There were no separate systems for reporting,
no separate systems for operational data or OLTP applications, simply the Data Vault and the Java
OLTP application. This is one example which has been in use since 2001.
Another example is a drug manufacturing traceability warehouse that was built in 2008 for the US
Congress. This Data Vault had operational applications that were driven by drug packaging
machines which assigned unique IDs to every drug package from every manufacturer around the
world. These machines fed the data over remote web-services connections directly to the Data Vault
every 10 minutes, where the data was encrypted, secured, and stored only to be accessed every
time the drug was scanned at different points in the supply chain. At which time the warehouse
would provide different web-service access points to retrieve audit trails of all points where the drug
was scanned. In this manner, you (the consumer) could log in to a web-site after purchasing a drug,
type in its bar-coded number, and check its authenticity. It was called: Drug Track And Trace anticounterfeit operation.
2.9

Dynamic Data Vault

The Dynamic Data Vault is an operational Data Vault with dynamic adaptation to the structure. In
other words, the tables, columns, indexes, and keys are all subject to change automatically. Of
course to achieve this state requires a constant vigilant watch on the metadata, including but not
limited to incoming structures. The incoming structures may include XSD, XML, staging tables, or
other metadata (including queue based or process metadata) that describe the structure of the
incoming data set.
The dynamic nature of the Data Vault means: new attributes may be added to Satellites, new Links
and new Hubs may be formed on the fly. ETL /ELT loading code will be adjusted automatically, and
BI Query views will also inherit certain changes. At the end of all the automatic model changes,
emails of the changes are sent to the IT staff for review in the morning.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 43 of 15
52

3.0 Common
n Attributess
The Data Vault sttructures (tables) contain
n standard atttributes thatt assist with tthe constructtion,
trackking, and que
erying. The common
c
attributes in the Data Vault aare defined h
here and are applied
throughout the Hubs,
H
Links and Satellitess. The common attributess include: seq
quence numbers, subsequ
uence numbe
ers (line item
m numbers), load dates, lo
oad end-datees, last seen dates, extracct dates,
reco
ord creation dates,
d
and re
ecord sourcess.
Mosst of these fie
elds are EDW
W (enterprise data
d
warehouse) system defined, and
d EDW system
m
gene
erated/mainttained; as a result, the da
ata in these columns
c
are reference d
data and are
e nonaudiitable as theyy do not existt in the sourcce system. However,
H
recoord creation dates and lin
ne-item
num
mbers are two
o cases that are
a auditable
e particularly when they eexist in the soource system
m.
The Data Vault works
w
on the principles sim
milar to geolo
ogical layerin
ng where data
a arriving in tthe
ware
ehouse (in a single batch)) is stamped with a geolo
ogical time b
based layer ((a load date time
stam
mp). The load
d dates enforrce audit trails and record
d history bassed on the on
ne and only
conttrollable syste
em date time
e available to
o the EDW loa
ading routinees. The only point at whicch this
princciple does no
ot apply is du
uring real-time feed proce
essing.

T
Series Batch
B
Loaded
d Data
Figure 3-1: Time

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 44 of 15
52

The Data Vault assists with auditability an


nd recoverability by stamp
ping all particcipating rowss in a
singgle batch with
h the same lo
oad date time
e stamp. If th
he loading prrocess fails m
mechanically (for any
reasson) it is nece
essary to exa
amine all rows that were lo
oaded duringg that processs; resulting iin
remo
oval, replace
ement, or aug
gmentation to
o the data se
et. This is thee only mecha
anism availab
ble to
recre
eate the audit trail of the data for thatt date time stamp. As a sside note theese mechaniccal
prob
blems are nott often discovvered for wee
eks or month
hs after they have occurreed.
Real-time data lo
oads are trea
ated differenttly. Real-time
e data loads are stamped
d based on m
message
arrivval time. Rea
al-time latenccy is typically defined as message
m
arrivval in a data loading queue with
laten
ncy of arrival being less th
han one minute. Real-Tim
me Loading iss commonly defined in te
erms of:
transactions perr second. An example ima
age of real-tim
me data stam
mps is shown
n in Figure 3--2.

Figure 3-2 Real-Time


R
Arrival, Data Geeology
ears similar to
t layers of p
pebbles on th
he beach. Da
ata is not
Real-time data arrival time-sttamping appe
ns. It can be
e grouped toggether for ana
alytic purposses but a
conggruent with tiime intervalss or time-span
singgle time consttant does nott represent any
a fixed laye
er of informattion in the en
nterprise data
a
ware
ehouse. As data
d
loads sh
hift to incorpo
orate real-tim
me data feedss (also known
n as trickle fe
eeds) the
liness between co
onstant time (batch loads) and continu
uous time (reeal-time loadss) blurs.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse


3.1

Page 45 of 152

Sequence Numbers

Sequence numbers are required by relational database management systems (RDBMS) in order to
process joins quickly and efficiently. Without sequence numbers the joins across huge amounts of
information would operate comparatively slowly (compared to character based joins). The use of
sequence numbers as primary keys for Hubs and Links also eliminates any possible issues
maintaining multi-part cascading keys in Satellites or nested Link tables.
Staging area sequences are stored within the staging area. These sequences should be restarted
and set to cycle over for each load to a specific table. Staging sequence numbers are utilized only to
identify loaded duplicates. Staging area sequences should not ever leave the staging area, and
should not be moved forward into the Data Vault.
Duplicates are rows that have 100% completely the same data - from the keys, to the nulls, to the
descriptive fields. When the data is 100% duplicate, there needs to be a way to delete the rows
from the staging table in order to proceed with loading only one unique copy to the target Data Vault.
Without a sequence number, there is no unique identifier on each row. With a sequence number it
is easy to pick the first or last row as the candidate to leave in place and delete the rest.
Before deleting the duplicates the Metrics Vault should record a history of how many duplicates
there are per staging table per business key. By counting the duplicates auditability can be
maintained if the IT staff is ever asked to reproduce the source load. The number of duplicates
multiplied by one row provides the recreation with an accurate picture. In other words, a Cartesian
join product is applied in order to reproduce the original duplicate row set.
Hub and Link sequence numbers are created 1 for 1 with each unique business key and unique
association inserted to the respective table. Satellite sequence numbers are generally parent table
sequence numbers, in other words they are inherited from the Hub or Link parent table.
It is a recommended practice to setup sequence numbers to be number(12). In Oracle there
appears to be no byte-storage difference between a number(12) and a number(38). Most sequence
numbers will fit within this length, and will not require double or floating point math to resolve at
query time.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 46 of 152

Sequence numbers in the Data Vault should never be shown to business


users, and must not leave the Data Vault going forward. First, sequence
numbers are meaningless numbers which are there simply to provide
uniqueness to the rows they represent. Second, the numbers are there
merely for JOIN purposes at high rates of speed. Third, if I ask you: please
tell me what number 5 means to you? Can you define it? Can you make
sense of it? No. Its a meaningless NUMBER. There is no context.
The sin of this is that once you expose the sequence number to the business they will forever
attach that customer/product/employee/service or what-ever-it-is to the number you give them.
Meaning that they give it context, they force it to mean something to the business! Now, you (as IT)
no longer have the right or the ability to change/alter/destroy and rebuild that number, nor are you
allowed to attach different rows to that number.
This will cause problems for future re-loading, re-building, or even fixing the Data Warehouse,
regardless of the data modeling technique you choose! DONT DO THIS, DONT EXPOSE SEQUENCE
NUMBERS TO THE BUSINESS EVER!
3.2

Sub Sequence Numbers (Item Numbering)

Sub sequences depend on parent tables for context and within context have business meaning;
however as stand-alone attributes they hold no business meaning what-so-ever. In this regard subsequences do not work well as independent Hub keys. Sub sequences may also be defined as
ghost Hub tables if logically modeled but should not ever be physically implemented. For
example: A line-item number 5 has no context however it is required when discussing a particular
detail item on an invoice. Sub sequence numbers are utilized to order Link or Satellite rows. In Link
tables they are part of the unique index, in Satellite tables they can be included in order to provide
context called: multiple active Satellite rows.
Sub-sequences simply allow multiple rows to be active for a single master key. It is a best practice
to avoid sub-sequence numbers if at all possible. When used in a Link table they can cause reengineering of the loads in the future (if the Link structure changes).
WARNING: IF SUB-SEQUENCES APPEAR IN THE MODEL, IT MAY BE A CALL TO RESEARCH FURTHER. TAKE THE TIME TO
INVESTIGATE IF A LINK AND NEW HUB TABLES NEED TO BE DEFINED . I T IS COMMON TO MISTAKE THE NEED FOR A SUB SEQUENCE WHEN THE CORRECT MODEL WILL HAVE ONE OR MORE NEW HUBS WITH A L INK IN PLACE . EXCEPT IN REALTIME MILLISECOND SYSTEMS.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse
3.3

P
Page 47 of 15
52

Load Dates

Load
d dates are system
s
generrated, system
m maintained fields. This attribute is a
applied to the
e arriving
data
a set in both real-time and
d in batch mo
odes. Load dates
d
represeent the date time stamp
(acccording to the
e EDW machine clock) of the
t arriving data.
d
Load daates are applied to data ssets
arrivving in the sta
aging area off the Data Va
ault.
Load
d dates for re
eal-time data
a are applied based on the
e clock time arrival of thee transactions housed
in th
he incoming queue.
q
Load dates for ba
atch based da
ata are set oonce per batcch. They can be
thou
ught of as a date-time-sta
d
mp equivalent to a batch
h load processs identifier. As described
d above
(see
e Figure 3-1) the
t Data Vau
ult relies on the notion tha
at load datess are consisteently applied per
batcch for tracking purposes. The Load Da
ate should no
ot be set by rrepetitive sysstem calls thrroughout
the llife-cycle of a single load, nor should it be changed
d from one seet of staging data to anotther. The
load
d date time sttamp is the id
dentifier thatt indicates which geologiccal layer (in time series) that this
data
a applies to.
Load
d dates should be looked up from a siingle table ca
alled CONTRO
OL_DATE which is housed
d within
the sstaging area and containss a single column, single row of inform
mation. The load to the staging
table
es lookup the
e load-date (LOAD_DTS), and hard-cod
de the record
d source. For example, iff the batch
wind
dow is a nightly batch that begins at 22:00 hours, and complettes at 06:00 hours the following
morning, then the LOAD_DTS
S should be set to 00:00 hours
h
for thee following moorning days start
acro
oss all data in
n the staging area.

Figure
e 3-3: Load Date
D
Time Sta
amp and Reccord Source

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 48 of 15
52

By kkeeping a con
nsistent load date time sttamp on the information
i
i t becomes p
possible to tra
ace errors
and find technica
al load proble
ems (affectin
ng the data) months
m
afterr the load hass occurred. IIt also
beco
omes possiblle to remove that layer off geology and
d identify how
w far the prob
blem data ha
as spread.
The resulting load cycles become repeata
able, consiste
ent, and restaartable for an
ny given load
d cycle
acro
oss all time.

Figure 3-4: Exam


mple Load Date Time Staamp Data
Load
d dates are also
a utilized to perform ga
ap analysis be
etween arriv ing data in th
he warehouse (from
all p
parts of the world)
w
and the
e extract date
es or creation
n dates of thee data set. B
By analyzing the gap
betw
ween the load
d stamp (whe
en it arrived at
a the EDW machine)
m
and
d extract datee, the busine
ess can
quicckly see if the
ere is an unaccceptable de
elay in data arriving at thee warehouse..
For e
example: sup
ppose the loa
ad date is 10
0-14-2000 ass in Figure 3-4
4, but the exxtract date is 10-022000. The busin
ness in this case
c
has a se
ervice level aggreement (SLLA) in place tto provide the
e data
within 5 days of extraction
e
fro
om the sourcce. By storingg both in the Data Vault, tthe businesss can
quicckly determine that their SLA
S is not being met. Extract dates arre discussed in a later secction,
botto
om line is tha
at neither exttract nor crea
ation dates on
o data sets sshould be uttilized to reprresent the
load
d date time sttamp in the data
d
warehou
use.
Histo
orical treatment of the loa
ad date time
e stamp may differ slightlyy. Load date time stampss for
histo
orical data will vary based
d on the availability of dattes in the hisstorical data. It is recomm
mended to
state
e an assump
ption that: IF the
t system existed
e
at tha
at historical p
point THEN th
he load date time
stam
mp would havve been X. Based on th
his premise, set
s the controol date, load the history to the
stagging tables, and then the processes
p
follow precisely for loadingg historical geeology layers in to the
Data
a Vault.
Note: to assist
a
in managing volumiinous data se
ets set the co
ompression fl ag on the Loa
ad
Date Time
e stamp column. This will help queries
s tremendoussly; howeverr keep in mind
that on most
m
RDBMS engines
e
comp
pression adds
s overhead to
o the loading cycle. It ma
ay
be unwise
e in a true-re
eal-time soluttion.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 49 of 15
52

On o
occasion the historical date time stam
mp must coinccide with thee creation datte of the data
a, and
sometimes the granularity
g
of the creation date may be
e monthly wh
here the curreent loads occcur daily.
For tthese reason
ns controlling
g the load datte time stamp as a singlee unit providees full flexibility of
histo
orical loads for
f specific grrains of data as well, provviding snapsh
hot availability.
3.4

Load End Da
ates

Load
d end dates are
a system computed attributes. Thesse are mech anical attribu
utes that exisst solely
to m
make queries against the Data Vault ea
asier. Load end
e dates arre NOT necesssary for the
arch
hitecture, the
ey are query attributes
a
only. These atttributes indiccate the end of the data lifecycle
within the loading time-frame
e of the Data Vault. Time--series based
d database eengines are ccapable of
computing data life-cycles witthout resortin
ng to load en
nd date colum
mns.

Figure
F
3-5: Lo
oad End Date
e Computatio
ons, Descripttive Data Lifee Cycle
Load
d end dates are
a set accorrding to the next
n current row
r load datee. They may be exclusive
e (as
indiccated here) where
w
1 seco
ond has been
n subtracted from
f
the nexxt most current load date,, or
inclu
usive (not ind
dicated here) where they are
a equal to the load datee from the neext most rece
ent row.
Load
d end dates of
o the current row are sho
own in Figure
e 3-5 to be N ULL. It is opttional to configure
them
m to be future
e dated if desired. Load end
e dates that are futuree dated do noot need to be
e relocated on disk when
w
updated (the end da
ate is reset).
Load
d end dates which
w
are NU
ULL do not ca
ause the row to migrate too another dissk block when
upda
ated or end-d
dated; most RDBMS
R
engines make da
ate/time dataa types take tthe same am
mount of
byte
es whether NU
ULL or not this is known
n as a fixed-le
ength colum n in the data
abase engine
e. Load
end dates are no
ot auditable as
a they are syystem compu
uted values. Load end da
ates must be
e updated
in th
he row-set in order to ensure time-line consistency. Figure 3-5 depicts a Sa
atellite entity with
custtomer namess. Satellites are
a defined in detail in Ch
hapter 5.0.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse


3.5

Page 50 of 152

Last Seen Dates

Last seen dates are a particular component of the architecture that enable source-system hard
delete monitoring without resorting to complete scans of the data set currently in the Data Vault.
Last seen dates are optional within the Hubs and Links. The last seen metadata can be tracked in
alternative Satellites for better resolution at a lower level of detail.
Last seen dates are not required by the architecture of the Data Vault to stand up and work properly.
There are other manners in which to track data (discussed in the Satellite chapter) that may provide
more information than the last-seen-date; an alternative architecture is a status-tracking-Satellite, or
a record source tracking Satellite.
NOTE: LAST SEEN DATES SHOULD NOT BE USED IF THERE IS AN AUDIT TRAIL
AVAILABLE. AN AUDIT TRAIL IS MORE ACCURATE FROM THE SOURCE SYSTEM AND
ELIMINATES THE NEED TO IMPLEMENT LAST SEEN DATES. A UDIT TRAILS MAY ALSO BE
UTILIZED IF GENERATED FROM CHANGE - DATA- CAPTURE ( CDC ) UPSTREAM IN THE SOURCE
SYSTEM .

The problem faced by enterprise data warehouses is: detecting hard-deletes of source data while
the set of EDW data is continuously growing. The case is as follows: a source system does not
provide an audit trail, nor does it provide any event or transaction indicating which rows are being
deleted or removed. During every load cycle the entire source table/xml file is simply dumped and
loaded to the staging area of the Data Vault.
Traditional set theory dictates that in order to find missing rows that have been hard-deleted from
the source feed, a process takes place that scans everything in the Data Vault that does not exist on
any of the source feeds. This is an extremely expensive operation, and cannot be mathematically
sustained for high volume data warehouses. At some point running the full scan on the Data Vault
becomes impossible. It is at this point that the set can be contained or limited to a finite point by
introducing a system maintained date stamp called a last seen date. An example of the structure
can be seen in Figure 3-6 below.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 51 of 15
52

Figu
ure 3-6: Strucctures containing Last Seeen Dates
The follow
wing section ad
ddresses techn
nical impleme
entation which is out of scop
pe for this
documentt. However, the following infformation is ne
ecessary to asssist in the exp
planation of th
his
concept off last-seen-datte; therefore itt will be includ
ded in this textt.

The last seen date can be fou


und within Hu
ubs and Link
ks. The last sseen dates functionality iis to track
the llast time the Hub/Link sa
aw the key/re
elationship on
o any incom ing feed. The last seen d
date is a
syste
em generate
ed and system
m maintained
d feed, thereffore it is a noon-auditable attribute exissting
within the data warehouse.
w
It may be upd
dated in placce without afffecting the au
uditability of the
unde
erlying wareh
house data.

Figure 3-7:
3 Scan all data in EDW
W
Lastt seen dates provide a me
echanism to reduce the data
d
set scan
nned to detecct missing row
ws on the
sourrce feed. A different
d
architecture know
wn as Statuss Tracking Saatellites can p
provide more
e detailed
inforrmation in the appearancce and disapp
pearance of the
t keys. Staatus Trackingg Satellites m
may be
used
d in place of Last Seen Da
ates. These Satellites are
e covered in the Satellite chapter.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 52 of 152

For example suppose the Hub_Customer had 800 million customer keys. The source feed has 30
million on a nightly basis. The customer keys arriving on the source feeds are originating from three
applications: finance, sales, and contracts. The SQL query / code for detecting hard-deleted keys
(without utilizing a last seen date) is, as follows:
<Mark status as deleted for records in the following set: >
Select *
from HUB_CUSTOMER where
Customer_Acct_Num not exists
(
Select cust_acct from STG_MANUFACTURING
UNION ALL
Select customer_acct_num from STG_SALES
UNION ALL
Select cust_num from STG_CONTRACTS
)

First a last seen date column must be added to the HUB_CUSTOMER table. Second a new business
rule is created and signed-off on by the business in a service level agreement (SLA). The new rule is:
data is aged, and not marked as deleted until it hasnt been seen for more than 3 weeks. The keys
in HUB_CUSTOMER are tracked by reversing the set logic in the following code: (which presumes
Last Seen Date is a column in the Hub)
Update HUB_CUSTOMER set Last_Seen_Date = Load_DTS
Where
Customer_Acct_Num exists
(
Select cust_acct from STG_MANUFACTURING
UNION ALL
Select customer_acct_num from STG_SALES
UNION ALL
Select cust_num from STG_CONTRACTS
)

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 53 of 15
52

This completely reverses


r
the mathematica
al set operattion so that th
he database now scans o
only 30
million incoming records and direct joins against
a
the 800
8 million (eequal-join meeans hitting o
only 30
million in the Datta Vault Hub)). A seconda
ary operation is then run aagainst any H
HUB_CUSTOM
MER rows
that are older than
t
3 weekss; the resultin
ng set scan iss typically a ffinite 10% off the total datta or less.
The set size gene
erally stays small
s
(in this case around 80 million roows is scann
ned compared
d to a
scan
n of 800 milliion rows with
hout a last se
een date). Th
he reduced s can set is deepicted in Figgure 3-8.

Figure 3--8: Reduced Scan Set afte


er Applying LLast Seen Da
ate
Inse
erts to a Satellite are made for soft-de
eletes of keyys that are deeactivated. TThe rows alre
eady
marked as deleted are igno
ored from the
e follow-on sccans, reducin
ng the data sset size again
n from 80
million down to 20
2 or 30 million at most. These are th
he averages tthat have beeen experiencced by
impllementing this solution att large global companies.
3.6

Extract Dates

Extra
act dates are wonderful to
o capture if they are available on the ssource systems. Extract dates
repre
esent the datte and time that
t
the data is extracted
d or written to flat-file on
n the source ssystem.
Extra
act dates typiically are nott available wh
hen direct SQ
QL access is u
utilized. Extrract date and
d time for
SQL extracts is ussually stored in the metad
data (processs logs of the ETL perform
ming the extra
act) so it
is no
ot required to
o store in the Data Vault structures
s
directly. Extracct dates for flat-files are e
extremely
helpfful, particularly if the data
a set is pulled
d from severa
al areas arou
und the world
d.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 54 of 152

Extract dates are not reliable as in some cases the extract may be created on a PC where the clock,
and system date time are in question. In other cases the server performing the extract may be in a
different time zone than either the source system or the data warehouse server. Bottom line, the
EDW team generally has no control over the extract date and time on the source system, therefore it
is non-auditable data it is reference data in a manner of speaking, and as such should be stored
as just another attribute of the Satellites in the Data Vault.
3.7

Record Creation Dates

Record creation dates are wonderful if they are available on the data set. If they are available they
should be recorded as attributes in Satellites. They should not be a part of the Hub nor the Link
structures as they are not reflective of the key structures or associations. Record creation dates
generally represent the date and time of creation of the source system row (in its entirety). In some
cases these date time stamps may be edited by the business users on the source system (which
means they can change over time).
Regardless of the case, the EDW team has no governance to cover the management and
consistency of record creation dates. Furthermore even if governance procedures existed, it would
be a great undertaking to ensure governance over 100% of the source system data; resulting in a
non-auditable field which must be treated the same as any other source system data as an
attribute in a Satellite.
3.8

Record Sources

Record source columns are row-based metadata that represent where the row originated. These are
hard-coded values applied to maintain traceability of the arriving data set. Record sources can be
codified with the descriptions residing within reference tables. Record sources should be
architected to the lowest level of granularity. For example: SAP.FINANCE.GL (indicating an SAP
source system, followed by a financial application, followed by General Ledger).
Record sources are metadata that must be carried in to the staging area (are hard-coded in the
staging loads to the Data Vault). They can be created as lookup codes or lookup sequences to avoid
duplication of the data set in high volume situations. They are then resolved on the way from the
Data Vault to the data marts. If they are created as lookup codes, they then are placed in a
reference table. The reference tables are covered in the reference data chapter, chapter 8.0.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 55 of 152

Record sources must remain on the row level as a part of a 100% compliant and auditable solution.
These fields are used to answer questions about the data, where it came from and more specifically
which application. Traceability of the data from the AS-IS data marts all the way back to the source
systems provide compliance that meet regulatory standards. Developers, auditors, and business
users benefit from having a record source in each row of data across the entire model.
Tech Tip: To manage volume or repeating groups without joining (resolving to a code),
compress the column in database engines that support compression. Record source codes
are highly repeatable and redundant data. Record sources may be comprised of reference
codes; resolved on the way out of the Data Vault by joining to reference data. Reference
codes as record sources allow the data set to be compressed from the start.
3.9

Process IDs

Process IDs are a tracking mechanism for the loading process that brought the data into the
warehouse. They are not part of the core-architectural components of the Data Vault. Process IDs
may be used as a means to track the data set back to the individual loading process. They are
augmentative metadata only. Process ID columns are repetitive in nature, and as such should be
setup for column compression.
Process IDs may replace both record sources, and load-dates. If process IDs are tied to technical
metadata stored in the Meta Vault, they can replace the two items in Hubs and Links (load dates will
always be needed as part of the key for a Satellite). In this situation, the process metadata must be
tagged with a record source and a run/load date.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 56 of 152

4.0 Hub Entities


Hubs are defined by a unique list of business keys. They are surrounded with additional technical
metadata elements such as load date time stamp, last seen date (optional), record source and
sequence number. Business keys may be composite (made up of more than one field), intelligent
(smart-keys) contains meaning across parts of the keys, or sequential in nature.
Though not ideal, or desirable, over time operational application developers have made the mistake
of producing, displaying, and reporting on sequence numbers. The definition of a business key will
be discussed in detail starting in section 4.2
Tech Tip: Meaningless sequence numbers in operational systems can be a
design/architecture hazard. If sequence numbers in operational systems are exposed to
business users then they become (by default) business keys. Hub Tables are meant as a
consolidation point for horizontal business functions. For example: Customer Account
Numbers should span multiple lines of business. At the end of the day having a single
customer account number from customer inception to delivery is what the business process
needs to provide corporate level answer sets.

Unfortunately in the real-world, customer keys (as with so many other business keys) change
depending on the system being used. The keys change from one state to another as the customer
information passes from one system to another. These changes are typically a manual process
resulting in little to no visibility at the corporate level for where a customer is in the life-cycle of
business.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 57 of 15
52

Figure 4-1:
4 Businesss Key Changin
ng Across Lin
ne of Businesss
In Figure 4-1 abo
ove the key changes throu
ugh an Excel managed prrocess when the custome
er is
transsferred from the sales system to the procurement
p
t system. Thee ideal would
d be for the ssame key
to be
e used horizo
ontally acrosss all lines of business reggardless of th
he system of origin and th
he system
of transfer. Wha
at business doesnt realize
e is just how much mone y they are lossing by chan
nging the
business key from one line off business to
o the next.
Theyy also frequently allow this to happen by implemen
nting off-the-sshelf productts which expose
sequ
uence numbe
ers as busine
ess keys. Cle
early, sequen
nce numbers from Oracle Financials w
will never
matcch sequence
e numbers in Siebel or PeopleSoft or SAP,
S
etc Beecause the seequence num
mbers are
expo
osed, the bussiness beginss to use them
m as businesss keys autoomatically lossing traceability (and
mon
ney) when the
e sequence number
n
for th
he same custtomer differss across multtiple systemss.
One
O of the jobs that a go
ood data warehouse shou
uld perform iss: gap
analysis - thatt is: provide the business with a view oof the GAP beetween the
way
w the business believes they are ope
erating their business, an
nd the way
th
he systems are collecting the data. Byy examining tthis gap, the business
ca
an quickly loccate where they are hemorrhaging mooney.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse
4.1

P
Page 58 of 15
52

Hub Definitio
on and Purposse

The job of a Hub is to track th


he first time the
t Data Vau
ult sees a bussiness key arrrive in the
ng device. The
ware
ehousing load, and where
e it came from
m. The Hub is
i a businesss key recordin
business keys in a Hub shoulld be defined
d as the same
e semantic ggranularity. FFor example: Customer
Indivvidual is a diffferent grain than Custom
mer Corporatiion. Each of these types of customerss should
be re
espectively modeled
m
in tw
wo different Hubs
H
as show
wn in Figure 4
4-2.

Figure 4-2:
4 Hub Example Imagess
ndard fields including seq
quence numb
ber (SQN), Looad Date
Hubs have severral of the stan
OAD_DTS), an
nd Record So
ource (REC_S
SOURCE) . In special case s, a Hub will also include
e an
(_LO
encrryption key (E
ENCR_KEY) and
a potentially a Last See
en Date (LASTT_SEEN_DTS
S). The encryyption key
is a part of the Hubs when the data set is encrypted. It may be onee half of a tw
wo-part publicc key.
Encrryption key iss not standarrd which is wh
hy it is not lissted in Chaptter 3.0.
Lastt seen dates are not required, and are
e not a part of
o the core architecture. LLast seen dates assist
in tra
acking delete
ed rows/agin
ng business keys.
k
Busine
ess keys in Hu
ubs may be ttracked throu
ugh status
trackking Satellite
es which are covered in th
he Satellite chapter. Req uired in the a
architecture are the
sequ
uence numbe
er, load date, and record source.
The purpose of the Hub is to provide a soft-integration
n point of raw
w data that iss not altered from the
sourrce system, but
b is suppossed to have th
he same sem
mantic mean ing. The resulting singula
ar list of
keyss assists in th
he discovery of patterns across
a
system
ms. The Hub key also alloows corporate
business to track
k their inform
mation acrosss lines of bussiness; this p rovides a con
nsistent view
w of the
curre
ent state of application
a
syystems. The
ese systems are
a supposed
d to synchron
nize, but ofte
en dont
when they dont synchronize, business ke
eys begin to be
b replicated and worse yyet, are then applied
to diifferent conte
extual data sets.
s
Som
me examples of Hubs and their data arre shown in Figure
F
4-3:

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 59 of 15
52

Figure
e 4-3: Hub Exxample Data
In th
he HUB_CUSTT_ACCT (Hub
b Customer Account) it is easy
e
to spot similar patteerns, fat-finge
ered data,
and errors in enttry, possibly a lack of edit masks. The
e typical requ irement in th
his case is ass follows:
The busin
ness says: W
We always cre
eate our custo
omers in con tracts. You w
will always ge
et
your customer numbers from contrracts first bec
cause they arre responsibl e for closing
the deals and getting the money.

Whe
en the pattern
ns in the data
a are examin
ned, it is clear that Sales has produced keys (as ha
as
finan
nce) that are
e not in contra
acts. Its up to the busine
ess to figure out why; its the job of the data
ware
ehouse to po
oint out the pattern. With this type of analysis,
a
the data warehoouse can pro
ovide the
need
ded gap anallysis between
n the businesss requireme
ents and the source systeems. In this ccase there
mayy be broken source
s
system
m synchronization routine
es, or worse: a loop-hole in the business
proccess that ince
entivizes peo
ople in sales to
t enter new customers. All of this is speculation until the
business figures out why its happening and moves to fix it in the ssource system
ms or primaryy
proccesses of the
e business.
Therre are ways that these keys can be rollled together for BI reportting purposess. The notion
n of
hiera
archical Link
ks and same-as Links is discussed in the
t Link chap
pter (chapter 5.0). The da
ata itself
stayss in-tact in order to re-con
nstitute the source
s
system
m as necessaary for audita
ability.
4.2

What is a Bu
usiness Key?

A bussiness key iss something that


t
the busin
ness uses to track, locatee, and identiffy information. It is a
natural keys..
best practice to have
h
unique business
b
keyys. Business keys are alsso known as n
Busin
ness keys sh
hould be uniq
que, but often
n are not. Bu
usiness keys may actuallyy be source ssystem
sequ
uence IDs tha
at have been
n released to business ussers and now
w are embedd
ded in busine
ess
proce
esses. Busin
ness keys are
e supposed to
t have mean
ning to the bu
usiness. Exa
amples of bussiness
keys include:

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 60 of 152

Vehicle Identification Number (VIN)


Auto License Plate Number
Drivers License Number
Account Number
Portfolio Number
Part Number
Work Order Number
Employee Badge Number
Invoice Number
Ticket Number
Bar Code
Product Number

Each of these keys stand-alone in business and in the operational systems they usually are
surrounded with descriptive context to give them meaning. In data modeling terms these keys are
parents, and do not require any additional keys to provide them with the grain of definition. There
are times when business keys are composite keys (such as VIN numbers, or bar-codes). These are
also known as intelligent keys. Business keys may also include the natural key and the
corresponding source system surrogate sequence key; because the business failed to make the
natural key truly unique and the source system surrogate is now needed for traceability within the
EDW.
4.3

Where do we find Business Keys?

Business keys can be found in source system applications, on-line lookup screens, report headers,
source system data models, XML, XSD (schemas), and source COBOL copybooks. Business keys
can also be found in business process engines, SQL joins, source code (COBOL, java, stored
procedures, etc...), Business keys may also be found in Excel spread-sheets used to group items
together and label elements used in reports. Business keys may also be found listed in OLAP cubes
as part of dimensions used for drill down.
The best place to find business keys are within the business process layers. Businesses often
identify and track their information sets through business keys. The business process layers allow
business users to communicate from one person to another and translate, send, or attach the
information to the business process flow. Business keys may indicate hierarchies, groupings, crossmapping (from one system to another), physical identification tags, and global traceable information.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 61 of 152

NOTE: just because a surrogate is used within a source system does not automatically qualify it as a
business key. It must be presented, printed, displayed, or searched on made known to the
business user in order to qualify as a business key. It should also be clearly defined by the business
to represent a noun or an object that has context or is defined to be the key to contextual
information in order to become a business key.
However there are a number of surrogate keys (like Order Number and Invoice Number) which are
true surrogate numbers and have business value. Both Order Number and Invoice Number qualify
as business keys, as they are used by the business to uniquely identify (and track) data in the
source systems. In these cases, it is a hopeful thought that only one system maintains and
produces these surrogate numbers; that would be the optimal solution.
4.4

Why are Business Keys Important?

Business keys are the most important component of all information systems. Business keys provide
Links between business processes and the context that drives decision making. Business keys are
the most stable of data elements used by the business. They should be consistent throughout and
across lines of business. Through listing the business keys of the same semantic grain together,
patterns of inconsistencies and consistencies begin to emerge. Typing mistakes are more easily
caught, domain overload (domain chaos) is more easily visible, and missed punctuation becomes
clearer.
At the time of this writing it is known to be an extremely rare circumstance to acquire or locate
common business keys that transcend lines of business and the applications in which they are
generated, stored, and utilized. Businesses must begin to identify through metadata their need for
common business keys. This is a sign of true business architecture. The end result of common
business keys gives rise to board level visibility of the end-to-end business process in which their
data travels. By tracking the data set and their business keys, business users can begin to optimize
the business processes. This simple notion is the root of master data management. Master Data
will not succeed without the proper identification and management of business keys.
Because the business key (in theory) is supposed to be static and stable, it should be consistently
the smallest portion of the business and the most unchanging component of the business
regardless of the business units in which it is applied. This is separate from the business metadata
that defines the element and the functionality of how the element is applied in business. This is a
technical definition of same semantic grain that associates this business key with the
corresponding context surrounding it.
Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 62 of 152

For example: an automobiles VIN number (vehicle identification number) should not change.
However, the color of the car, number of doors, windows, seats, length of the car, and size of the
engine may all change over time. These are examples of descriptive attributes which are covered in
Chapter 6: Satellites.
Because business keys are supposed to be the most stable component, by separating them in the
model to a Hub, we can therefore stabilize the model itself over time. At the same time as we
stabilize individual structures, we also can adapt easily to new business keys at different grains or
defined by different criteria. Thus, the adoption of new structures to meet new business operating
procedures becomes easier (without losing history in the current system).
Without business keys, IT will not be able to build a master data system and properly tie the data set
(context) back to the business processes. Business keys make up master record locators that are
embedded for information visibility across lines of business. Business keys should never change,
should never be re-used. However, it is a well-known fact that the business keys do change and are
re-used, however this has major implications in business life-cycles and will cost the business
significant money on a year over year basis.
In Figure 4-1, it demonstrates the nature of the business key changing from line of business to
another. The end result is: no consistent visibility at the corporate level for maximum optimization.
The businesses that have this problem without tracking across the change will not be able to answer
the following questions:

How many customers does my business have today?


Where in the business life-cycle are my customers?
How many customers are in Sales, and not yet in Contracts?
How long does each customer spend in different business units?
Which business unit takes the longest to process customers through?
Which customers are most profitable?
Which customers take the shortest time to process through our business?

These are all master data questions that require consistent and tracked business keys. It does not require
stable business keys, as long as the business key changes are tracked across multiple alterations. Bottom line
is: business keys are the only way to create auditable and traceable information back to the root business
processes and source systems.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse


4.5

Page 63 of 152

How do Business Keys tie to Hubs and Business Processes?

Business Keys are the heartbeat of the data that travels through business processes. Think about
it, when you access a source system application to look up a customer, what do you type in? When
you look for a part, or a product, or an employee in a source system, what do you search on? Well, if
you guessed business keys, you guessed correctly!
Business keys are a part of every-day life. We use computers and their data stores to remember
and track all the possible information that we collect. We are then left to focus on a product, a
portfolio, or a set of customers. From these activities we have to identify, define, trace, and
manipulate all of this information within the business processes.
These business processes include manual efforts (we print a report and hand it to someone else), or
source system application (think data entry), or a dashboard of our top customers that we have to
touch every day to see if theres anything we can do for them.
Without business keys involved in these processes, there would be chaos. Without business keys
identifying all this information, it would all be ZERO VALUE to us. Which in fact is exactly what
happens to the data in the systems if or when the keys to that data are lost. You know the old
saying, out of sight, out of mind. If we cant track, edit, retrieve or manage all the information in
our operational systems, then the value of that information drops to zero.
Business keys are tied to the data set in the source application. Business keys are likewise tied to
every business process that the data flows through, thus ensuring traceability at the business
process level. Business keys are the foundation of the Data Vault; which means that your data
warehouse is centered on business keys. These keys are the life-blood of the data warehouse,
which is how we can tie value of the data assets back to the business.
Centering your data warehouse around business keys provides you with a huge advantage in data
warehousing valuation as an asset to your business. It gives you the ability to track, and trace, all
the information back to the point in the business processes where it makes the most sense.
THIS, MY FRIENDS, IS CALLED: GAP ANALYSIS. THIS IS OUR TRUE JOB AS BUSINESS INTELLIGENCE
EXPERTS . W E ARE SUPPOSED TO POINT OUT THE GAPS , AND HELP THE BUSINESS CLOSE THEM !

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse


4.6

Page 64 of 152

Why not Surrogate Keys as Master Keys?

Surrogate keys are helpful and useful to a machine, particularly when it comes to speeding up joins
and processing data sets in order of creation. However, thats where the helpfulness and
usefulness stops. Surrogate keys should never ever be shown to business users. They should
never be placed on reports, search screens, or operational application screens. They should never
be mistaken for business keys by the business users. They invariably cause problems (never ending
problems) that cost business large sums of money over the life of the source system and data
warehouse.
Problems begin to arise when the data needs to be re-loaded, and new surrogates must be
generated for the rows. This causes confusion in the auditability of the data set, and even calls in to
question any previously exposed surrogate keys that were printed on reports. These old
surrogates no longer match up with the newly generated data! So much for the system-of-record
source system!
Surrogate keys should remain within the confines of the systems in which they are applied. However
in modeling a Data Vault for source systems (especially those without business keys today), the Data
Vault model must accommodate the surrogate keys and (unfortunately for business) treat them as
the business key to that source. The loading routines must deal with collision, semantic meaning,
and definitional aspects of simple numbers. Surrogate keys mean nothing to the business, and
the business should not be asked to memorize or embed meaningless data into their business
operations.
Note: As mentioned in section 4.3 some business keys are in fact surrogate keys.
These include keys such as Order Number and Invoice Number. These keys are used
as meaningful business keys and should be represented as Hubs when necessary.

4.7

Hub Smart Keys, Intelligent Keys

Some business keys like Bar Codes are called: Smart Keys or intelligent keys, meaning its a key
comprised of multiple parts. All parts must be kept together as an UOW (unit of work). The business
utilizes the entire key as one unit (one identifier) to represent other information.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 65 of 15
52

Figure 4-4: Smart Key


K Example
In Figure 4-4 the key represents a single manufacturin
m
ng work ordeer number. The components
unde
erneath are the
t rules sup
pplied by business as documentation ffor the makee-up of the nu
umber.
Sma
art keys are helpful
h
only when
w
they are
e entered pro
operly in the ssource systeems.
Sma
art keys may also be know
wn as intelliggent keys that is to sayy that the data within the
e keys
have
e business meaning
m
by po
osition, value
e, and formatt. When inte lligent keys a
are used by tthe
business they must be kept together
t
in a single Hub within
w
the Datta Vault. This is consistent with
the d
definition and context of the
t businesss processes that
t
search aand index thiss key for purp
poses of
discovering addittional contexxt.
Furthermore it may
m be that th
his key is exp
posed to end--users. For eexample: a tw
wo dimension
nal bar
code
acturer is,
e. Bar codess are printed on drink prod
ducts. If the user wanted
d to know who the manufa
where it was manufactured, and
a on what date, they may
m want to sscan or lookkup the bar ccode.
Som
me bar codes also house batch
b
informa
ation, so thatt certain prod
ducts may bee recalled by batch
and by date. Sin
nce these barr-codes are kept as a sin
ngle identifierr within the business, they are to
be kkept as a sing
gle Hub busin
ness key with
hin the Data Vault
V
model. It is possib
ble (for explorratory
reassons) to break the parts of
o the bar-cod
de into separate Hubs h
however it is mandatory that the
origiinal bar-code
e Hub remain
n in place as well.
w
4.8

Hub Compossite Business Keys

Hubss are not required to housse their busin


ness keys witthin a single field. There may be time
es when
the d
data is well-defined enoug
gh that it allo
ows the differrent compon ents of an intelligent key to be
split apart into se
eparate fieldss. The busine
ess key is the
en made up of multiple fiields, resultin
ng in a
comp
posite identiffier. The com
mposite identtifier is still th
he single bussiness key for the Hub; meaning
that the key is un
niquely indexe
ed.
e may be rea
asons hidden
n in performa
ance, indexingg, partitionin g, or searching to split a business
There
key a
across multip
ple fields. Tw
wo examples of composite
e business keey Hubs are sshown in Figgure 4-5.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 66 of 15
52

Figu
ure 4-5: Composite Business Key Hub
b Example
Tech Tip: In Hub_Bar_Co
ode the compo
osite key (whe
en concatenatted) makes up
p the full barcode that is printed on a container. Each
E
constitue
ent part is a piiece of the whole. Since the
e
business uses
u
the entire
e bar code to track
t
the conttainer, the ent ire bar code iss itself a
business key.
k
Multiple fields
f
are simp
ply split apart to represent tthe compositee whole. In other
words, the
e ENTIRE BAR CODE is used as the busine
ess key by the business, theerefore it is part
of a single
e Hub.

Can w
we also have
e Vendor, Pro
oduct Code, and
a Productio
on Date in th
heir own Hubss? Yes, of co
ourse as
they most likely represent unique data by themselves; however, thee nature of a BAR CODE iis to be a
conju
ugation of alll its constitue
ent parts and as such, will
w remain a single Hub w
with all comp
posite
fieldss in its own right.
The ssource system for HUB_D
DOCTOR is wrritten to be ru
un in differen
nt states. The application
n was
assigned dooctor ID = 1 to a
then setup in Colorado, Denve
er, and New York.
Y
Each application

different doctor in
n its own statte. In order to
t avoid collissions upon d
data load, thee state ID or sstate
code
e must be loa
aded as a com
mposite with the Doctor ID
D. This main
ntains traceability back to
o the
sourcce applicatio
on in each sta
ate.
4.9

Hub Entity Structure

The Hub entity sttructure conssists of these


e required ele
ements: a su rrogate sequ
uence id, a bu
usiness
key, a load date stamp, and a record sourrce. There arre additional componentss that are necessary
and helpful in ord
der to meet the
t applied needs
n
of the data set. Iteems such as llast seen datte,
conffidence rating
g, strength ra
ating, encrypttion key, and
d possibly oth
her metadata
a elements m
may be
adde
ed for query purposes, pe
erformance purposes,
p
and
d discovery p
purposes as b
business req
quires.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 67 of 15
52

Th
he Hub entityy must NEVER contain forreign keys. Iff the Hub strructure is
co
ompromised (i.e., the modeling standards are not adhered to),, then the
in
ntegrity of the
e data; and th
he flexibility of
o the model are immedia
ately
co
ompromised.

Hubs must stand


d alone (be a parent to all other tabless), they mustt never be children. Figure 4-6 is
an e
example of th
he Hub Entityy Structure.

Figure 4-6: Example Hub


b Entity Struccture
Any comp
promise made
e in the struc
cture will lead
d directly to rre-engineerin
ng, high
maintenance costs, diffficulty in gro
owth, lack of flexibility, an
nd problematic real-time in
the near future.
f
Neve
er alter the raw structural definitions o
of the Data V
Vault.

4.10
0 Hub Example
es

For tthe exampless, there are several


s
differrent data mod
dels that havve been utilizzed. The exam
mples
here
e are created based on th
he Adventure Works mode
el (circa 2008
8 from Microosoft). There are a few
Hub examples an
nd data sets in Figure 4-7
7. You can find additiona l examples a
and download
ds on
http://danLinste
edt.com

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 68 of 15
52

Figure
e 4-7: Example Hubs from
m Adventure W
Works 2008
Lege
end:

Field extension SQN = Sequence


S
num
mber, surrogatte identity useed to manage u
uniqueness.
Field extension LDTS = Load Date Tim
me Stamp
Field extension RSRC = Record Sourcce
The remaining field(s) are
a the business keys.

This model has a mix of data types for bussiness keys. There are a few like DoccumentNode,, and
Prod
ductNumber which match
h character based businesss keys. Thee rest of thesee keys were derived
beca
ause of the fo
ollowing reassons:

There is no
n source syste
em application
n to check the
e rules againstt.
There are no source sysstem businesss users to ask (the Adventurre Works modeel was created
d by
programm
mers)
There is only
o a single syystem that is in
ntegrated here
e. In multiple systems casees, the processs of
modeling a Data Vault generally
g
repre
esents alpha-n
numeric busin
ness keys.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 69 of 15
52

Figure
e 4-8: Examp
ple of Nationa
al Drug Codee Data Vault
Figure 4-8 repressents the ND
DC (National Drug
D
Code) Data
D
Vault. M
More informa
ation about N
NDC
sourrce data and the operatio
onal system can
c be found at: http://ww
ww.fda.gov/ccder/ndc/ (n
note: if
the LLink no longe
er works, sea
arch Google for
f NDC druggs and clickk on the Link available fro
om
www
w.fda.gov) On
ne large differrence betwee
en this system and the Ad
dventure Works model is that this
syste
em has real business use
ers, along witth defined metadata for eeach of the b
business keyss.
The fact that a business usess surrogate keys
k
as business keys dicttates that thoose source ssystem
surro
ogate keys are chosen ass business ke
eys for defining their Hubss. As noted in these exam
mples it is
abso
olutely vital to
o annotate assumptions, questions, and
a reasons ffor designingg the Data Va
ault
arch
hitecture as the model is built.
b
In the case
c
of Adve
enture Works there are noo business ussers to
spea
ak with, and there
t
is no so
ource system
m to consult (application
(
logic is missing). Once a standard
is ch
hosen, it shou
uld be adhered to through
hout the life of the design
n.
4.11
1 Dependent and
a Non-dependent Child Keys

Hub business ke
eys may be co
omposite for another reasson depend
dent businesss keys. A de
ependent
business key only has contexxt when included with a pa
arent key. H owever, a deependent bussiness key
is im
mportant enough to warrant uniquenesss and when coupled with
h a parent keey, uniquely identifies
addiitional data. Dependent business
b
keyys are anothe
er source for creating com
mposite or multi-field
Hub business ke
eys. In Figure
e 4-9 below, please
p
remem
mber that thee table on the left is a source
rd
syste
em table rep
presented in 3 normal fo
orm, the depe
endent child key is the Hu
ub Line Item Number.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 70 of 15
52

A priime example
e of a depend
dent child keyy would be lin
ne-item-num ber. Line-Iteem-numbers e
exist only
within the contexxt of an invoicce or an orde
er. They are important in keeping the proper ordering of
the lline-items on
n the invoice. Without line
e-item-numbe
ers, every tim
me the system
m would print the
invoice the line-ittems would be
b printed in different ord
dered sets. LLine-item-num
mbers by the
emselves
makke no sense, an attempt to find line item 5 (five) by itself wou
uld be difficullt if not imp
possible.
Line
e-item-numbe
ers depend on parent con
ntext (such ass order numb
ber) to exist.

Figure 4-9: Depe


endent Child Relationship
p Modeling
NOTE:

TH
HE

HUB LINE ITEM

IS IN RED
R
AND DOTT
TED LINED BE
ECAUSE IT HA S NO

CONTEXT, NO MEANING
G BY ITSELF.

THE

LINE - IT EM NUMBER IIS WHOLLY DE


EPENDENT ON
N

THE SURR
ROUNDING KEY
YS FOR CONT EXTUAL RESO
OLUTION.

ITEM

SHO
OULD NOT BE MODELED IN THE
T
PHYSICA L DATA

TH EREFORE,
MODE
EL .

TH E

HUB LINE

Line
e-item numbe
ers are known
n as a depen
ndent child. They
T
are imp
portant as a b
business key, but not
by th
hemselves. They
T
must acccompany an
n additional business
b
key to make sen
nse. The Datta Vault
mod
deling standa
ards allow mu
ultiple repressentations of the dependeent child keyss. They can be
inclu
uded in the same
s
Hub witth the parentt or stand-alo
one businesss key, or they can be mode
eled
within a Link table.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 71 of 152

Another example of a dependent child key may be a sub-typed business key representation. The
most important question to ask is: does the key stand on its own? Does it have meaning by itself?
If the answer is no, and it remains a business key, then it may very well be a dependent child key.
Dependent child business keys are not allowed to be modeled explicitly within the model. If they are
combined in another parent keys Hub, then they shall not be modeled logically either Hubs are
not allowed to contain foreign keys. However if they are included in a Link structure (explained in
the next chapter), they can be represented logically; this notation is called a weak Hub. In the Link
Chapter (Chapter 5) they are also referred to as: degenerate fields.
4.12 Mining patterns in the Hub Entity

The Hub table brings together previously disassociated business keys. It represents lists of these
business keys in a single common table. For example: a list of all part numbers that appear across
the enterprise. Patterns can be mined from the single list of business keys. By coagulating the
business keys from multiple source systems into a single component, it becomes possible to extract
business value and meaning.
Hubs can be mined for the following information:

Entry patterns and format masks


Source System Key creation/generation patterns over time
Possible ontological relationships or hierarchies within keys

By mining the Hubs data it is possible to discover practical associations and ties across business
keys. Hierarchies and ontologies can be discovered which translates into added business value. The
results always need to be checked against the business to see if they are false positives. Complex
inter-relationships across the internal data patterns and shifts in entry can be discovered. It is
interesting to note that the longer the business key, the more likely it is to make these discoveries.
Entry patterns and format masks can also be established. The percentage of data that meets
particular patterns can be assigned. The greater the percentage, the more likely the business rule is
out there somewhere being utilized. It is possible to tie strength and confidence ratings to
percentages of data meeting specific patterns that have been discovered. Just as with the last case,
the more data involved in the discovery, the higher the confidence that the discovered pattern is an
applicable business pattern.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 72 of 152

Source system key creation (or broken business requirements) can be discovered as the data set is
loaded. Confidence ratings increase as additional data (new keys) arrives to demonstrate that the
business rule truly is broken. For example, the business requirement is: contracts always create
new customer keys, but when the data is loaded the pattern states otherwise.
Contracts are responsible for the creation, and inflow of customer accounts, the negotiation of these
accounts before the organization can begin building product for the customers. The data set in the
Hub shows that 40% of the new business keys are being created by a financial system. Further
discovery shows that it takes 20 days before the customers are synchronized and moved into the
contracts system. The business then needs to ask the following questions about their business
processes:

What does this say about the business requirement?


Can Finance negotiate contracts with the customer?
How long does it take for the data to move from the contracts system to the financial system?
How much is it costing the business to NOT have the customers created in contracts first?

What happens when the business key for a certain customer changes when it is passed from
Finance to Contracts? What if the programmatic code that changes the key does not record the
from-to, or the business user does not record the from-to change when they key it in to the
contracts system? What impact to the business does this have? It can be huge, it can be costly, and
it can range from the $10 dollar mark to the $10 million dollar mark.
Mining Hub keys for patterns can be a powerful way to validate the data against the business
requirements. It provides insight into the gap between the vision that the business assumes its
operating under, and the reality of their operational systems, coupled with the business process in
place today. This is the fundamental idea behind process improvement, monitoring and measuring.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 73 of 152

4.13 Process of Building a Hub Table

The process is simplistic in nature; however it requires a consistent check with the business users,
business application, and source system data set. At the end of the day the business application
collecting the data has the overriding decision. It is the responsibility of the Data Vault to enable
reproduction of the source system as-it-stood as of a specific point in time; otherwise the
commutative property is broken, and the system of record that exists within the Data Vault is
compromised.
1) Find the business key
a. Go to the business users and watch how they interact with the operational systems. View
their print-outs, application screens. Locate the find mechanisms, headers of reports, and
dimensional groups they use in their MS Excel Spreadsheets.
b. Determine which business keys are truly used in which business units. Do NOT worry or
consider HOW to define the business keys, leave that to the business users later in the
project.
c. Locate the business keys in the source system by examining the record join/find code.
d. Look for business keys in the dusty old data model that is supposed to represent the source
system, look for the primary keys and secondary unique indexes.
e. Pry open the physical data stores on the source systems, look for alternate unique indexes
and primary keys.
2) Validate the Business Keys
a. Check with the business units, balance the data sets and unique indexes that are physically
printed or seen by the business users. Eliminate those keys that are internal only. Many
times the internal keys are there for performance reasons.
b. Validate the business key data by profiling the data set. Discover the consistency, actual
uniqueness; develop metrics against the business keys, their patterns, and their associations
to other records in other systems.
3) Check Business keys against multiple source systems
a. Develop profiling patterns across multiple source systems that are within scope, discover
where the collisions are. Work on resolving the multiple entry patterns that occur. Again, the
focus is not to define these keys, but rather simply to identify the business keys.
4) Finally, build the Hub
a. Define the systems that feed the Hub. Develop data flows that identify potential collisions.
b. Define what to do in case of a collision. Get this answer from the business users by ASKING
them to define which system is the first master, the second master, the third master and so
on.
c. Implement loading paradigms from a staging area to the Hub in the Data Vault
d. Profile the results to produce metrics and measurements about the patterns of the data sets.
e. Publish the results to the entire IT team, the business users, and anyone interested in the
Data Warehouse. BEGIN the data quality improvement process as early as possible.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 74 of 152

5) Validate the results in the Hub


a. Reconstruct the LIST of business keys for each system, and balance the keys against each of
the source systems to ensure integrity has not been lost.

These are the fundamental steps to building a single Hub within the Data Vault model.
4.14 Modeling Rules and Standards for Hub Tables

The Data Vault model is a repeatable, consistent, scalable and flexible technique. There are rules
and standards around each of the table structures that must be followed, or the resulting model will
not qualify as a Data Vault model and will be subject to the risks it was designed to avoid. Below are
the modeling rules and standards that surround a Hub Table.

A Hub must have at least 1 business key


A Hub should not contain a composite set of business keys. ** exception below
A Hub SHOULD support at least one Satellite to be in existence, Hubs without Satellites usually
indicate "bad source data", or poorly defined source data, or business keys that are missing valuable
metadata. However, a Hubs Satellites may be hidden because of security restrictions or information
hiding paradigms
A Hub Business Key CAN be composite when: two of the same operational systems are using the
same keys to mean different things AND these keys collide when integrated back together again. In
this case, the record source becomes part of the business key. Please be aware: BAD DATA CAUSES
BREAKS IN THESE RULES - THESE ARE GUIDING PRINCIPLES. Exceptions to this rule should not
happen (but do), also be aware, bad architecture in source systems causes breaks in these rules too.
A Hub Business Key MAY also be composite because the key is utilized as a composite key within the
business
Hub's business key must stand-alone in the environment - either be a system created key, or a true
business key that is the single basis for "finding" information in the source system. A True business
key is often referred to as a NATURAL KEY
A Hub should contain a surrogate sequence key (if the database doesn't work well with natural keys).
A Hub's load-date-time stamp or observation start date must be an attribute in the Hub, and not a part
of the Hub's primary key structure
A Hub's PRIMARY KEY cannot contain a record source (though the business key may as noted above).
A Hub may contain a Last-Seen-Date if desired grain of tracking is needed

The rules for Data Vault modeling have not changed (architecturally) since 1997; which makes the
architecture itself stable and easy to use. The rules and standards for modeling are kept up to date
on the following web-site: http://DanLinstedt.com.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 75 of 152

4.15 What Happens when the Hub Standards Are Broken

The standards, the design, and the architecture of the Hub are based on mathematics including
finite complexity, measurable maintenance effort, including number of rows per block. If the Hub
standards are broken (such as introducing a foreign key directly in to the Hub) then the flexibility of
the model breaks. The adaptability to future business requirements breaks. The ability to load past
history (which may not match the relationship definition) breaks. When the rules and standards are
broken, it also introduces high levels of re-engineering upstream of the Data Warehouse. It forces
business requirements to creep back in to the upstream loads. Eventually the business
requirements change, and thus force re-engineering to occur in the loading, querying and
structuring of the Data Vault. The current architecture of the Data Vault avoids all re-engineering if
the rules and standards are adhered to.
If descriptive data is introduced to a Hub, then data over time becomes more difficult to manage.
The complexity of the loading cycle increases. The staging area requires additional copies of the
data set to synchronize it with the final image. It becomes impossible to split data by rate of change
or type of information.
It is not recommended nor condoned to break the standards of the Data Vault. The engineering
work has been done in order to avert pitfalls encountered on typical enterprise data warehousing
projects. In fact, if the standards are broken, the model will not qualify as a Data Vault model.
The only risk a pure Hub design has is the width of the business key. If the business key is
comprised of multiple fields (is a composite business key), then it may be possible that the number
of rows per block exceeds the desired count. When this happens, the number of I/Os increases
dramatically to search through the Hub structure and locate the proper business key.
The average Hub row size is accounted for as follows:
Field
Sequence
Business Key
Load Date Time Stamp
Record Source
TOTAL

Average Bytes
8
25
8
12
53 bytes

Figure 4-10: Typical Hub Row Sizing


Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 76 of 152

If the block size is 16,384 bytes (16k) then it can fit approximately 309 rows per disk I/O. If the
block size is 32k, then the Hub can fit approximately 618 rows per disk I/O. With a block size at 64k
the Hub can fit approximately 1236 rows per disk I/O. The best average is around 1000 rows per
block. The Data Vault implementation book covers the mathematics in detail, along with the loading
mechanisms, block sizes, and row widths.
NOTE:

THIS INFORMATION IS TECHNICAL IN NATURE, AND WILL BE COVERED IN DEPTH

IN THE

DATA VAULT IMPLEMENTATION

BOOK , AND IN THE COACHING AREA .

THIS

INFORMATION IS HERE TO CLARIFY THE PRESENTED TOPIC .

Do not break the rules of the design or architecture. If the rules are broken, the design will suffer reengineering in the near future. It also breaks the ability to keep costs down from a maintenance
perspective. The Data Vault model is based on scalability mathematics involved in computing nearlinear scalability from an MPP (massively parallel processing) perspective.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 77 of 152

5.0 Link Entities


Link entities act as the flexibility component of the Data Vault model. They are the glue that pulls
together any related association of two or more business keys. Where business keys interact, Links
are created. Link entities are generated as a result of a transaction, discovery, relationship, or
interaction between business units, business processes, or business keys themselves.
Links provide flexibility to the Data Vault model by allowing change to the structure over time.
Mutability of the model without loss of history is critical to the success and long-term viability of the
enterprise data warehouse. In other words, the model itself can now be adapted, morphed, and
changed at the speed of business without loss of auditability, and compliance. The Data Vault
model also gains flexibility from this technique because of the Link entity. The Link entity (in data
modeling terms) is commonly referred to as an associative entity.
5.1

Link Definition and Purpose

A Link Entity is an intersection of business keys. It contains the surrogate IDs that represent the
Hubs and Links parent business keys. A Link must have more than one parent table. A Link
tables grain is defined by the number of parent keys it contains. Each Link represents a unit-ofwork (UOW) based on source system analysis and business analysis.
The purpose of the Link is to capture and record the past, present, and future relationship
(intersection) of data elements at the lowest possible grain. The Link Entity also provides flexibility
and scalability to the Data Vault modeling technique. Typical examples of Links include:
transactions, associations, hierarchies, and re-definition of business terms.
WARNING: ANY

CHANGE TO THE

LINK

STRUCTURE

( LIKE

ADDING BEGIN/ END DATES,

ADDING BUSINESS KEYS ), OR CHANGING THE ARCHITECTURAL DEFINITION OF THE


WILL RESULT IN THE NEED FOR RE - ENGINEERING LATER.

CHANGES

LINK

TO THE

ARCHITECTURE COMPROMISE THE AGILITY AND FLEXIBILITY OF THE DATA MODEL.

DO

NOT MAKE CHANGES TO THE ARCHITECTURAL DEFINITIONS.


5.2

Reasons for Many To Many Relationships

Within the Data Vault modeling constructs a Link is formed any time there is a 1 to 1, 1 to many,
many to 1, or many to many relationship between data elements (business keys). The resulting
physical Data Vault can capture what the relationship was, while it captures what the relationship
is, and can adapt to what the relationship will be in the future.
Many-to-Many relationships provide the following benefits:
Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

1
1.
2
2.
3
3.
4
4.

P
Page 78 of 15
52

Flexibilityy
Granularity
Dynamic adaptability
Scalability

Man
ny-to-many re
elationships allow
a
the phyysical model to
t absorb da ta changes a
and businesss rule
chan
nges with little to no impa
act to both exxisting data sets
s
(history) and existingg processes (load and
querry). Businessses must cha
ange at the speed of business, and IT must becom
me more agile
e and
resp
ponsive to handling those changes. More
M
and morre business rrules are changing, fasterr and
faste
er.
Thro
ough the Link
k entity the Data Vault mittigates the ne
eed to restru
ucture/redesign the EDW model
beca
ause the rela
ationship changes. For exxample: today the businesss states 1 portfolio can
n handle
man
ny customerss, but each cu
ustomer musst be handled
d by 1 and on
nly 1 portfolioo. If the model is
designed in a rig
gid fashion (th
hat is to say with
w parent-cchild depend encies) then it representss the
curre
ent businesss rules quite well.
w
All is we
ell until the business
b
(tom
morrow, next year, or 2 ye
ears ago)
decides to chang
ge their busin
ness rule: no
ow, a custom
mer may be h andled by 3 or 4 different
portffolios. Figure 5-1 demon
nstrates relationship change over tim e.

Figure
F
5-1: Relationship Changes
C
Oveer Time
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 79 of 15
52

One of the proble


ems of mode
eling todays relationship in any data w
warehouse iss that it make
es the
strucctures static.. It forces the
e structures to represent todays relattionship ruless. These rela
ationships
have
e changed in the past, and will change
e again in the
e future. Thiss is the dyna
amic changing nature
of th
he business: grow, change
e, or die. The
e problem with introducin
ng static relattionships in tto the
mod
del is that it also
a re-introduces businesss rules to the loading proocesses. It a
also introduce
es static
relattionship enfo
orcement in to
t the loadingg routines. When
W
the relaationship doees change, ITT is forced
to re
e-engineer the loading rou
utines, the modeling
m
arch
hitecture, and
d the queriess to get the da
ata set in
to th
he Data Ware
ehouse. Thiss is an unacce
eptable and un-maintainaable cost going forward.
The Data Vault must
m
remain flexible,
f
and not introduce
e the need foor re-engineeering as the m
model
grow
ws. By modeling the Linkss as a many-tto-many relattionship, we can easily acccomplish this goal.
The Link table fu
unctions to fu
uture-proof th
he model and
d provide maxximum flexib
bility. Figure 5-2
dem
monstrates the reason for using a Link table:

Figure 5-2: Link Table Structure


S
Housing Multip le Relationsh
hips

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 80 of 152

Many-to-many relationships ensure that the business associations (past, present, and future) can be
added to the warehouse without altering the model or the load routines. The metadata that is
currently lost is the nature of the relationship (e.g.,1:1, 1:M, M:1) as documented in the source
system (what exactly did the operational model look like?). This must be documented in the
metadata of the Link table, hopefully in the Meta Vault. By capturing the metadata in the Meta Vault
(including computational functions that create the relationship, along with how its used) the
business can begin to track changes to business knowledge as they relate to the data set and
operational systems over time.
The resulting power of this capture mechanism enables the business to monitor the impact of their
decision. Data mining on the Meta Vault and the data set can then perform gap analysis in regards
to the quality of the decision and the end resulting impact (pre and post decision process). If the
business adapts its business process, adding a new Link table can be done easily and quickly
without reengineering the entire existing data warehouse. Load routines are isolated from the
impact, as are queries and BI processes.
If the model is rigid, then the loading (ETL) designs are also rigid. If the business rule changes, and
meets a rigid architecture, then the result of the impact is: forced re-engineering. The extent of the
impact may cascade into other child tables thus, the larger the EDW model grows, the larger the
possibility for impact. The less agile IT can be in response to business rule changes, and conversely
the more it costs (over time) to continue to adjust the EDW architecture to meet business needs.
This is the common design pattern that occurs in traditionally modeled warehouses. This impact is
completely mitigated by building a Link entity into the Data Vault. The Data Vault therefore is highly
scalable, flexible, and now, agile. The Link entity allows the structure to handle changes to business
rules without the impact of re-engineering (aka re-factoring), and without the ever increasing cost
curve. However, it is suggested that the business rule itself, along with any calculation that
produces this data set be recorded within a Meta Vault. To learn more about Meta Vault, check out
the one-on-one coaching area at: http://danLinstedt.com
5.3

Flexibility

Many-to-Many relationships provide maximum flexibility and agility. The more flexible the model is,
the faster it is to adapt or change. The faster the model can adapt, the less time it takes IT to
respond to business changes. The less time it takes for IT to respond to business changes, the
more work can be done in a shorter amount of time, leading to increased productivity of the IT staff
in the data warehousing environment. Adding new tables (especially Link tables) to the Data Vault is
easy.
Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 81 of 15
52

Figure 5-3: Starting


S
Mode
el Before Chaanges
Supp
pose the model starts out with an order tracking system
s
that kknows the cusstomer, orde
er,
prod
duct, and line
e-items (see Figure
F
5-3). Time
T
passes and now thee business wiishes to add a set of
prod
duct categorie
es and supplliers. How do
oes the mode
el evolve?

Figure 5-4: Data Vault After


A
Modificaation

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 82 of 15
52

As seen in Figure
e 5-4, changing the mode
el or adding new
n structurees is a simplee process. Not much
time
e or effort is required
r
to make
m
the changes occur and
a it has no impact on th
he existing po
ortions of
the w
warehouse. We
W can add Links
L
to repre
esent the new relationshiips without h
having to revise
existting structure
es (to add ne
ew foreign ke
ey columns) or
o reload any data. Do noot confuse this with
the ttime and effo
ort required to
t find and esstablish apprropriate busi ness keys. C
Creating the Hub
strucctures is the first step, an
nd the most important ste
ep to take. S
Suppose the b
business now
w wishes
to ad
dd sales regions to mana
age both custtomers and orders
o
as a coombined com
mponent, how
w hard
migh
ht it be to exttend the mod
del again?

Figure 5-5: Addition


nal Data Vault Model - Moore Changes
V
model due
d to the Link structure (see Figure 5
5-5).
It is easy to add new entities to the Data Vault
B
of Data
D
Vault Mo
odeling bookk about classiified informa
ation
Therre is a discusssion in the Business
syste
ems. This type of model flexibility lends itself well to protected
d environmen
nts as the seccured or
encrrypted data can
c be stored
d in separate entities and easily Linkeed to unsecurred data entities.
Keep in mind tha
at adding new
w Links acrosss global envvironments iss also possible. This is especially
help
pful in manag
ging distribute
ed (yet conne
ected) system
ms. As indicaated in Figuree 5-6, the Lin
nk tables
are sstored in diffferent global systems; it iss up to the ap
pplication to keep the keyys synchronizzed for
querry use.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 83 of 15
52

Figure 5-6
6: Global Datta Vault Linkiing
In th
his situation, there are ma
any different applicationss that are syn
nchronizing th
he operationa
al Data
Vaullt. Most likely the applica
ation loading data to the global
g
Data V
Vault is comp
prised of web
b-services,
and a business rules
r
engine. It is quite po
ossible to have more than
n one global controller, and to
farm
m out differen
nt components of the access dependin
ng on the seccurity, geo-loocation, or oth
her
crite
eria. The con
ntrol over the data, the loa
ading and qu
uerying are beeyond the scope of this b
book, and
will b
be discussed
d in the book titled: Data Vault
V
Implem
mentation.
5.4

Granularity

Gran
nularity is vita
al to an EDW
W; the Data Va
ault is no diffferent. Grain
n can be mea
asured by the
e number
of p
parent table
es a Link conttains. For ea
ach parent, th
here is a new
w (lower) leveel of grain introduced.
The same mode of thinking applies
a
when considering fact tables in
n a Star Scheema. For exa
ample,
what is the grain of the follow
wing fact table (see Figure
e 5-7), and hoow can it be accurately described?

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 84 of 15
52

Figure 5-7: Uncovering Fact Table G


Grain
The grain of this fact table ca
an be read ass: Customer by
b Product byy Sales by Weeek/Year/Mo
onth, etc..
Each
h dimensiona
al key createss a new level of grain for the facts. G rain as defin
ned by this exxample
simp
ply means de
etailed level of
o data. Graiin in the Data
a Vault Link ttables is no d
different. The
e Link
table
es represent the level of detail
d
that th
he data is stored at. Afterr converting tthis Star Schema to a
Data
a Vault, the grain
g
would lo
ook Figure 5--8.

Figure 5-8: Data Va


ault Grain, Re
epresenting S
Star Schema

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 85 of 152

When the business requirements indicate a need to record data at a different grain, new Links
should be added to the existing Data Vault old ones are simply no longer fed incoming data (but
are retained as they contain historical data). The alternative option is to re-engineer the existing
Link to add the new Hub-surrogate-key. Re-engineering is the enemy of flexibility, and auditability
and can quickly cause an EDW project to scale out of control. In regards to auditability, the
question is: once a new Hub-surrogate-key is added to the existing table, how should it be defined to
the business? Especially if the definition has to apply to past-historical data that is stored in the
Link already.
The very same question plagues the changes to star-schema fact tables; adding a dimensional
surrogate to a fact table causes the grain of all the data to change. When the business asks the
next question: can we reproduce a report from last year and compare it to data from this year? Of
course, the answer is: technically yes but what has to happen to the code that drives that report?
It has to split in to two parts, one part of the code for grabbing history, and a second part of the code
for grabbing current data with the new key, now the project is beginning to take on a much greater
cost in terms of maintenance. As changes continue to alter the structure, more code forks are
necessary to mitigate the business users desire for reporting; until one-day, the business wakes up
and says to IT: We cant afford any more changes, and why is the system such a mess already? This
is one of the reasons we advocate using Data Vault model for your core EDW instead of a
dimensional architecture this kind of change will not break a Data Vault.
What lurks in the shadows is even more troubling. Suppose its the first change, all is well and
everyone is happy (as long as access to each data set is governed). Then one day, another business
unit decides they need to roll up the data, or summarize the recent data that has the new key.
They then combine these results with the old-data that doesnt have the new key, and the numbers
no longer match. Now they ask IT: why is the data reporting bad numbers?
Accountability has just been destroyed. As stated above, in the situation of new relationships and
with the added needs of a data warehouse, it is best to always create new Links for these changes
and leave the old ones be. A hint from the implementation book: as data degrades in value (gets
older), theres a good chance that the old Link and its data, will be backed up and the old Link will
no longer be necessary within the warehouse. This is the beginnings of a Data Vault model that truly
changes with the business needs.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse


5.5

Page 86 of 152

Dynamic Adaptability

Link structures enable dynamic adaptability; that is: the ability to define associations or correlated
data sets on the fly. Dynamic adaptability leads to a fluid modeling structure. A data mining tool
with specialized algorithms that mine both the metadata (data model ontology) and the data set is
capable of discovering new relationships that are not yet represented in the model. The data mining
algorithm must include the metadata definitions of terminology that explain data models in order to
apply appropriate context when deciding to Link different data sets (i.e., Hubs) together.
Relationships (i.e., new Links in the Data Vault) created in this fashion must include two additional
attributes: confidence and strength. In other words, how confident is the mining engine (neural
network) that this relationship actually exists and is real, and how strong are the correlations across
the data sets? These two metrics are applied to every row of data that is loaded to the newly formed
Link.
A fluid Data Vault model is constantly adapting, self-learning. Like any neural network, the
alterations and learning must be a guided and corrected process, otherwise the neural network may
drive the model to an undesired state (possibly un-usable). Before these notions are dismissed as
theoretical in nature they must be considered as reality. A company known as NetQuote in Denver,
Colorado applied this technique (human based mining) to build an up-sell Linkage resulting in a 40%
profitability increase in the first week.
Learning systems, intelligence systems, and military grade systems may actually see the most
benefit from this technique. It allows testing of hypothesis without losing any of the historical data
which has been captured. To take advantage of the fluid model requires automated changes to
apply to loading and querying routines; in addition, it requires automated changes to the data marts
down-stream.
It is possible to create a learning system that is capable of discovering relationships across data
sets where none existed previously. It is possible to create a system that adapts to newly arriving
elements on the XML feeds, or web-service transactions. It is possible to create a system that
arrives at potential high impact information without the need of up-front human intervention. How to
build these systems is way beyond the scope of this book, but a well-designed Data Vault is a
prerequisite to even starting down this path.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse
5.6

P
Page 87 of 15
52

Scalability

The Link entity also provides the model with unlimited


d scale-out; th
he same wayy MPP (massively
para
allel processing) relies on scale-out (ad
dding new independent p
processing noodes) to reacch larger
and larger data sets.
s
The Lin
nk entity enab
bles data to sit
s in differen
nt geographiccal locations,, yet be
Linkked or associa
ated at run-tiime (see Figu
ure 5-6 above
e). The Dataa Vault scalab
bility is limite
ed only by
the iimagination and
a the hard
dware compo
onents applie
ed underneatth the model..
What is MPP?
Short for Massively
M
Para
allel Processin
ng, a type of co
omputing that uses many seeparate CPUs
running in parallel to exe
ecute a single
e program. MPP is similar to symmetric processing (SMP
P),
with the main
m
difference
e being that in
n SMP systemss all the CPUs share the sam
me memory,
whereas in
n MPP systems, each CPU has
h its own me
emory. MPP syystems are theerefore more
difficult to program beca
ause the appliication must be
b divided in s uch a way tha
at all the
executing segments can
n communicate with each otther. On the otther hand, MP
PP don't sufferr
from the bottleneck
b
problems inheren
nt in SMP syste
ems when all the CPUs atteempt to accesss
the same memory at once. http://ww
ww.webopedia
a.com/TERM/ M/MPP.html

Physsical location
n of the tabless on specific storage devices can be ooptimized forr maximum
perfo
ormance. Figure 5-9 indiicates a tradiitional startin
ng point for th
he Data Vaullt architecturre on Raid
5 (SA
AN or NAS diisk). This typ
pe of architeccture provides the lowest cost entry pooint for a singgle
syste
em. The Datta Vault mode
el is flexible enough
e
to grow with the ccorrespondin
ng needs. As
perfo
ormance gro
ows, as data sets
s
grow, ass real-time da
ata arrives the Data Vau
ult model can
n scale as
desired.

Fig
gure 5-9: Trad
ditional Data Vault Storagge Layout

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 88 of 15
52

Whe
en the perform
mance of thiss architecturre falls below
w expectations, it can be eeasily adjuste
ed to a
new physical architecture as shown in Figgure 5-10. Asssuming in th
his case that Link-Custom
mer-Order
and Satellite-Cusstomer-Orderr are growingg at an unpreccedented ratte, or that theey contain a massive
volume of inform
mation, they can
c be split off
o physically on to a specialized DASD
D (direct attacched
stora
age disk) with multiple I/O
O channels. This allows the
t business managemen
nt and IT to p
provide an
SLA (service leve
el agreement) which specifies perform
mance driven metrics arou
und certain q
queries or
proccesses that lo
oad down-stream marts.

Figu
ure 5-10: Perrformance Ph
hysical Split V
Version 1
hen be furthe
er partitioned
d across multtiple I/O chan
mponents,
The tables can th
nnels, additioonal disk com
or ha
ardware. The
e architecturre allows the performance
e to be tightlyy coupled to the physical storage,
while
e allowing the model to be
b de-coupled
d from the ph
hysical layerss. This is an ooptimal situa
ation for
an M
MPP design. For further performance,
p
, additional RAID
R
0+1 con
nfigurations a
and DASD ca
an be
intro
oduced to oth
her table stru
uctures, as se
een in Figure
e 5-11.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 89 of 15
52

Figu
ure 5-11: Perrformance Ph
hysical Split V
Version 2
This process can
n be repeated
d again, and again acrosss each individ
dual table, an
nd down to th
he
partition level of each individual table. Th
his enables fu
ull scale-out MPP style arrchitecture to
o be
execcuted at the physical
p
level of the Data Vault. This type
t
of desiggn is geared ffor extremelyy large
syste
ems, and forr the flexibilityy of breakingg off parts of the model on
n to slower eequipment, w
while other
partss of the model are placed
d on high-spe
eed, high-cosst equipment .

Figu
ure 5-12: Perrformance Ph
hysical Split V
Version 3
Addiitional I/O ch
hannels can be
b added to each
e
disk device for addittional paralleel access cap
pacity.
Partitioning of the tables further enhances performancce and parallelism. Furth
her discussio
ons of
physsical table strructuring can
n be found in the Data Vault implemen
ntation book.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse
5.7

P
Page 90 of 15
52

Link Entity Structure


S

The Link entity sttructure conssists of basicc required ele


ements: surroogate sequen
nce id, multip
ple
business sequen
nce keys, load
d date stamp
p, and record
d source. Theere are addittional components
that are necessa
ary and helpfu
ul in order to
o meet the ap
pplied needs of the data sset. Items su
uch as
last seen date, confidence ra
ating, strength
h rating, encryption key, aand possibly other metad
data
elem
ments may be
e added for query
q
purposes, performa
ance purposees, and discoovery purpose
es as
business require
es. As techno
ology advances, items succh as last seeen dates, meetadata (inclu
uding
reco
ord source), and
a encryptio
on key may be swallowed
d by the dataabase functionality.
The Link entity must
m
NEVER contain
c
busin
ness keys, or begin and eend dates. If a Link structture is
com
mpromised, th
hen the flexib
bility of the model
m
is imme
ediately com promised. Iff the structurre of the
Linkk is compromised then you are sure to
o need reengiineering in th
he future. Add
ding businesss keys to
a Lin
nked table insurers that itt depends on
n the businesss logic for looading, this ra
aises the com
mplexity
of th
he loading routines. Linkss must contain two or morre key sequeence fields (frrom either Hu
ubs or
Linkks) in order to
o be considerred valid; a Link with a sin
ngle Hub seq
quence key iss considered a peg leg
Linkk and is invaliid. Figure 5--13 is an exa
ample of the Link Entity Sttructure.

Figure 5--13: Sample Link Structu re


Warning: Any compro
omise made in the structu
ure will lead d
directly to re-engineering,
high main
ntenance costts, difficulty in
i growth, lac
ck of flexibilitty, and proble
ematic realtime in th
he near future
e. Never altter the raw sttructural defin
nitions of the
e Data Vault.

5.8

Link Driving Key

In evvery Link there is a notion


n called a driving key. The
e driving keyy is the main key that drives the
rest of the relatio
onship. The driving
d
key iss necessary to
t identify so that Satellitees based on the Link
can be appropria
ately end-da
ated when th
he relationship changes. In Figure 5-1
14, the drivin
ng key has
arbittrarily been assigned
a
to CUST_SQN.
C
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 91 of 15
52

Figure 5-14: Example Drriving Key forr Link


Wha
at this meanss in this exam
mple is: the account and employee
e
seq
quences can be re-assign
ned to a
speccific custome
er. For instan
nce: when the
e warehouse
e sees: CUST=
=11, ACCT=2
25, and EMP__SQN=12
on O
October 14, 2000,
2
its the
e first time for this relation
nship. An inssert occurs too the Link,
esta
ablishing Link
k _SQN = 1. Figure 5-15 adds a Satellite (discusssed in chapteer 6) for illusttrative
purp
poses.

e 5-15: Exam
mple of Link Satellite
S
with Driving Key
Figure
his case, the Link record 1 has 1 Satellite record. What
W
happen
ns when the operational ssystem
In th
chan
nges the acco
ount numberr that the cusstomer is asssociated with
h? What if the operationa
al system
chan
nges the emp
ployee that deals
d
with the
e customer? In each of th
hese cases, w
we see the fo
ollowing
insert (intermediate step) occcur. Figure 5-16
5
depicts the
t post-inssert of the neew row in the
e Link and
Sate
ellite.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 92 of 15
52

Figurre 5-16: Insert to Link/Sa


at Based on D
Driving Key
To re
estore order to the data, row 1 in the Satellite
S
requ
uires that it b
be end-dated
d. Why? Beca
ause the
custtomer sequen
nce in the Lin
nk is the driving key, and the Link receeived a new record that
supe
ersedes the old
o version. The
T ETL proccessing mustt take into acccount the drriving key in o
order to
makke the properr determination of Link records and asssociations too Satellite rows.
Note: the
e Driving Ke
ey may be a composite. It may be rrepresentative
e of the sourrce
system Prrimary Key, but
b not alway
ys. Sometimes it is a com
mbined view ((super-set
view) of multiple
m
syste
ems. Choose
e what makes
s sense to the
e business.

The final Figure 5-17


5
below shows
s
the Satellite row prroperly end-d ated by usingg the Drivingg Key to
dete
ect changes.

Figure 5-17: Lin


nk Driving Key/Satellite En
nd Dated
5.9

Link Examples

For tthe exampless of the Linkss we have ussed several different mod els includingg Microsoft Ad
dventure
Works data model and a health-care mod
del. These Lin
nk structures do not carryy last seen da
ates nor
stren
ngth/confide
ence ratings. Figure 5-18 contains exa
ample Link sstructures fou
und in the cu
urrent
verssion of the Ad
dventure Worrks 2008 Datta Vault.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 93 of 15
52

Figure 5-18: Exam


mple of Link Tables From Adventure W
Works 2008 Data Vault
In ea
ach of these examples, fo
ocus on the particular
p
gra
ain of the datta set. The la
ast Link (seen
n bottom
rightt) has a grain
n of 3 differen
nt Hubs: Hub
b Product, Hu
ub Category, Hub Sub Cattegory. In business
term
minology this would be rea
ad as: Producct by Categorry by Sub Cattegory. These would be kknown as
the d
dimensions. In technical terms, this iss deemed to be the grain
n of the data which is represented
by th
he Link table.
The combination
n of Hub Sequ
uences (composite) mustt form a uniq ue index. Th
his unique ind
dex must
matcch 1 to 1 with the generated Link Seq
quence. The Link Sequen
nce is the prim
mary key of tthe Link
table
e. This stand
dard is enforcced so that the sequence
es in the Dataa Vault modeel can be re-b
built at
any time.
Exam
mine the Link
k Lnk_WOID_
_LocID (which
h stands for: Link Work O
Order ID by Loocation ID). N
Notice its
asso
ociated keys. It contains Hub Work Orrder ID seque
ence, and Hu
ub Location ID
D sequence, but it
also contains an Oper_Seq. The
T Oper Seq
q turns out to
o be an operaational sequeencing numb
ber used
by th
he operationa
al system (in this case Ad
dventure Worrks Applicatioon if there wa
as one) to ord
der all the
child
d records. It also serves to
t make the combination
c
of the two keeys unique. This particular
elem
ment of a Link
k table is called a degene
erate field.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 94 of 152

5.10 Degenerate Fields In Links

The degenerate field is also known as a child attribute. It is a degenerate field because it depends
on the combination of both of the parents fields in order to make the relationship unique. The data
in this field is meaningless outside the context of the relationship, in other words the field is not a
business key. The field will not function as a Hub. Oper_SEQ as a Hub would only contain sequential
integers for ordering data.
This field may also be a date value. In the case of an oil well there may be a need to capture a
physical date as to when the well was turned on, because until its turned on it is not assigned
an actual well-number. Another example may be an expiration date on a prescription drug bottle.
This date is generally worked in to the bar-code, making it a part of a larger business key. In this
case, it is also part of the relationship between the drug itself, and the packaging material. These
degenerate keys are necessary in describing relationships to a higher level of detail however, by
themselves they do not provide significant information to cause the creation of a business key.
Degenerate fields have the following rules:

They cannot stand on their own (as Hubs)


They have no business meaning
They are dependent on other context in order to be defined
They give meaning and uniqueness to additional relationship information
They have no descriptors of their own

Examples of degenerate Link fields include sequencing or numbering information, for instance
line-item-sequence on a purchase order or invoice may be called a degenerate Link key. Dates (on
occasion) are also degenerate Link fields. However, this case must be carefully examined as not
all dates (such as start/stop, begin/end, and other descriptive dates) should end up as a part of the
Link. We discuss begin and end-dating Links in a section below. The degenerate field that is a
date is generally a rare case that should be applied sparingly and with caution. It usually is also an
indicator or a composite of a business key, making the relationship unique.
5.11 Multi-Temporal Date Structures

The Data Vault is enabled to house multi-temporal views of the information. Multiple date-time
stamps must be defined as data attributes in Satellite structures (defined in the Satellite chapter).
Utilizing the data in a multi-temporal state is accomplished through the query designs.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 95 of 152

WARNING: DO NOT ALTER THE ARCHITECTURE, NOR MODIFY THE STRUCTURES OF THE
ARCHITECTURE TO GAIN A MULTI- TEMPORAL VIEW OF THE DATA. AS STATED
PREVIOUSLY , ANY DEVIATION FROM THE STRUCTURE WILL CAUSE A SERIOUS BREAKDOWN
OF THE VALUE OF THE D ATA VAULT MODEL TO THE BUSINESS . T HE REASONS ARE
LISTED IN THE RE- ENGINEERING STATEMENTS THAT ARE MADE THROUGHOUT THIS BOOK .

The structures and standards have been built and tested for over 15 years. The standards have
been built to avoid the pitfalls and problems that existing data warehousing models suffer today.
Including but not limited to: cascading change impacts, scalability issues, flexibility problems,
absorption of new systems, and so on. By breaking the standards, you will experience many of the
same problems that you have today you will negate the whole reason for moving to a new data
modeling structure!
It is fine to add attributes to Satellites; it is not okay to change the primary keys of the Link or Hub
structures. There is a tendency by designers to want to add temporality (date/time keys) to Links
primary key structures and Hub key structures. The original Data Vault design in 1993 allowed this
as an option. By 1995 flaws in this design begin to appear as with cracks in the foundation of a
home, these flaws were significant enough to warrant a re-definition of the Link structure.
The finalized design was tested and passed with significantly better results the finalized design
allows no temporal date/time elements as part of the primary key of the Links. Allowing temporality
as a part of the primary key of the Links caused re-engineering 3 to 6 months later. It is the view of
this author that *any* cause to re-engineering should be eliminated if possible, and if not possible,
the impacts of changes should be reduced to a minimum. Otherwise, the results are disastrous; akin
to an invasive wall-climbing vine, that anchors its roots deep in the structure its climbing. Eventually
that structure must be torn down and completely re-built. It is the same for the Link table where the
primary key introduces temporality.
One of the foundations of the Data Vault is to enable iterative development by consistently
minimizing rework and reengineering. This not only future proofs the Data Vault, but also facilitates
rapid development because slight omissions or oversights at initial stages of the Data Vault design
can be coped with in an elegant and extremely economic way. Not so when temporal structures are
incorporated into the primary key structure because the primary key changes must be migrated
down-stream to ALL child tables leading to cascading change impacts. The cascading change
impacts affect re-engineering efforts of ALL the child tables from loading routines to queries
everything must be changed.
Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 96 of 15
52

That said: introdu


ucing temporrality as a rea
ason to produ
uce a new seequence in th
he Link and treating
the d
date/time as attributes within the Link
k table exhibit the exact ssame cause a
and effect. TThis
practtice of alterin
ng the Link Entity should never
n
happen.
**CAVEA
AT** The only time a Link
k may contain temporality
y is when it functions as a
FACT tablle in other words, it is a transactiona
al Link where the data is ttime-stamped
d,
and the data
d
set canno
ot be updated or changed
d in any way (legally by th
he business).

5.12
2 Link-To-Link (Parent/Child Relationshipss)

A Lin
nk-to-Link relationship ind
dicates a parrent-child arra
angement, o r a hierarchyy of some sorrt. In this
case
e, it is equiva
alent to a nessted relationsship with diffe
erent levels oof grain. An eexample of a Link-ToLinkk is below, in Figure 5-19.

Figu
ure 5-19: Example of Link
k To Link Relaationships
e, Hub Producct and Hub Supplier are both
b
parents to Link A. Link A and Hub Sales
For tthis example
Persson are both parents to Liink B. Link B and Hub Te
erritory are booth parents tto Link C. This forces
a risse in complexxity in loading
g and queryin
ng, and it beccomes most evident when
n Satellites a
are found
as children of ea
ach Link.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 97 of 15
52

Data
a modelers often feel the need to put a Link to Link relationshi p in the mod
del. While the
ese
relattionships are
e interesting and
a may be easy
e
to read logically, theey are extrem
mely difficult tto
impllement. From
m a logical sta
andpoint it iss easy to see the parent-cchild relationsships when m
modeling
a Lin
nk to Link arcchitecture. Im
mplementing this kind of model
m
is diffiicult becausee of the parent child
depe
endencies du
uring the load
ding cycle. Th
he following is
i a discussioon on best prractices for removing
Linkk to Link from
m the physical implementa
ation.
Note: it is fine to logicallly model Link to
t Link relationships; the prroblem is when
n they are
expressed
d in the physica
al data model. To avert issu
ues and probleems, the Link structures
should be flattened out,, and the hiera
archy depende
encies removeed.

The consequence (born by the loading pro


ocesses) of physically
p
imp
plementing this type of sttructure is
as fo
ollows:

It requiress the ETL to load sequentiallly to each child Link, it remooves the abilitty to load all Liinks in
parallel alll the time

Math
hematically speaking,
s
if we
w denorma
alized the Lin
nk Structuress so each Lin
nk is connectted to its
pare
ents Hubs we
e can represent the same
e data set in a flattened m
manner. Thee structure b
becomes
simp
pler to mainta
ain (going forward, it is ffuture proof especially if the relatioonship in the source
syste
em changes
) and, the structure
s
is easier to load as well as q uery. Figure 5-20 shows the first
flatttened hierarchy, and the
e new structu
ure.

Figure
e 5-20: Step 1, Flatteningg Link-To-Linkk Hierarchy

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 98 of 15
52

The first step is to


t remove Lin
nk A sequencce from Link B. The seco nd step, is too put the Hub
bs from
Linkk A directly in to Link B. Th
his moved Link B to a pee
er status of LLink A, howevver, Link B is a
diffe
erent grain. Remember,
R
Link
L
B alwayss had the following grain: Product by S
Supplier by Sales
Persson. In orderr to simplify the model, the loading and querying p rocesses the dependen
ncy on
Linkk A was removved or flatten
ned. Now, to
o finish the jo
ob, the processs is repeateed for Link C. Figure 521 b
below shows the complete
ed Link Strucctures which should be im
mplemented physically.

Figure
e 5-21: Step 2, Flatteningg Link-To-Linkk Hierarchy
nhooks the de
ependency of Link C to Link B. The Li nks are now successfullyy flattened
The final step un
(den
normalized). This new structure allowss all Links to be loaded in
n parallel, and allows all q
queries
direcct access to the
t data set through the Hubs. This means
m
that iff the queries need accesss to other
Linkks, it will be available base
ed on direct request.
r
It also has the ffollowing effeect: There can
n be
reco
ords in LINK C which do not exist in Lin
nk B!! There can be recorrds which exist in Link B w
which do
not e
exist in Link A.
A
This is absolutelyy vital to haviing a Data Warehouse
W
ca
apable of abssorbing 100%
% of the data 100% of
the ttime (within scope).
s
If the
e other dependencies we
ere in place, w
we would be forced to cre
eate
pare
ent records fo
or the entriess in Link B an
nd Link C justt to load the appropriate data set.
Please no
ote: the proce
ess of removing Link-To-L
Link relationsh
hips should h
happen afterr
the logica
al model has been built, as
s it is easier to
t accomplish
h once the co
orrect
relationsh
hips have bee
en established
d.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 99 of 152

5.13 Link Applications

Link structures are all defined the same way. There are several different applications of Links which
require discussion, introduction, and definition. It is these types of applications that are discussed
in the following sections. The different types of Links include the following:

Hierarchical Links
Same-As Links
Transactional Links
Exploration Links
Low-Value Links
Computed and Aggregate Links

In some cases, the data within the Links is derived or computed by one or more business processes
thus resulting in a Link which contains non-auditable data; or at the very least, data which never
existed in the source system. If this is the case mark those rows with the appropriate system
generated record source, or process name. Some of these cases include utilizing a Data Quality
engine to produce similarity across names and households, business names, product names,
etc Other cases include using aggregate functions to produce corporate vision information that is
used to drive the business in a day-to-day decision making function.
A majority of the time, these computations and aggregations or results of processing belong in a
business Data Vault which is defined in the upcoming book: Quick Start Guide to Business Data
Vaults.
5.14 Hierarchical Links

Hierarchical Links are just what their name implies: a Link structure which contains N levels of
hierarchical data from the same Hub. For example, consider the case of an employee who reports
to a manager. Their manager is also an employee who happens to report to a director, and so on.
The Hierarchical Link allows roll-ups and aggregation of lower level data into a tree topology, or treelike organization.
Hierarchies are a form of ontologies. Lets look at an Organizational chart, and an example of a
Hierarchical Link. Remember, the Hierarchical Link is an application of the standard Link structure.
It does not change the Link structure nor violate the rules in any fashion.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 100 of 1
152

Figure
F
5-22: Example
E
Organization Strructure
n in Figure 5-22:
The following asssumptions arre made about the organiization shown

Each divission office is not


n a store-fron
nt
The execu
utive office is not
n a store-front
Each office has its ow
wn office busin
ness key identifier
Each store has its own store busine
ess key identiffier

In th
his case, diffe
erent stores report
r
to diffe
erent division
nal offices, th
he division offfices report to an
execcutive office. The Data Va
ault model would appear as Figure 5-2
23:

Figure 5-23
3: Hierarchica
al Link for Offfices
The Hierarchical Link is show
wn in purple above
a
the Hub Office tablee. The hierarchical Link ccontains
be as
the o
office sequen
nce twice; on
nce for the ro
oot office, oncce for the paarent office.. There can b
man
ny office roll-u
ups as neede
ed. The reaso
on for extrap
polating the h
hierarchy to a many-to-ma
any
relattionship is th
hat: relationsh
hips in busin
ness change over
o
time. S
So the represeentation of th
he
hiera
archy today may
m not be th
he same as yesterday
y
or even
e
the sam
me as tomorrrow.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 101 of 1
152

The Data Vault model is isolated orr decoupled from


f
the impa
act of busine
ess change!

e of a hierarchical Link miight be the employee stru


ucture described earlier. In this
Anotther example
case
e, all employe
ees have a badge numbe
er, all employe
ees are just tthat: employyees. Since the tree
peated here.. The Data V
would look very similar
s
to the
e business tre
ee above, it will
w not be rep
Vault
mod
del for this strructure would
d appear in Figure
F
5-24:

Figurre 5-24: Exam


mple Hierarch
hical Link of Employees
Rem
member: the application
a
of
o the Link do
oes not chang
ge the structture of the Link. It merelyy defines
a ne
ew use for the
e Link. Do NOT add begin
n and end da
ates to Link sstructures, including the
appllication of the
e Links discu
ussed here. Changing
C
the
e structure off the Link can
n break the fflexibility
of th
he Data Vaultt model and introduce re--engineering in the near ffuture (as sooon as the bussiness
chan
nges).
5.15
5 Same-As Links

Sam
me-As Links are another tyype of applica
ation of the Link
L
structuree. In this casse, the data sset is
appllied as resolu
ution informa
ation. In othe
er words, all data
d
exists aat the same ssemantic grain or
has the same me
eaning to the
e business. All
A data are peers
p
to one-aanother. In these cases, the
diffe
erently spelle
ed names of companies
c
all represent the
t same com
mpany. Figu
ure 5-25 belo
ow
dem
monstrates bu
usiness data that identifie
es the same-as concept.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 102 of 1
152

Figure 5-25: Sam


me-As Link Exxample, Busi ness Data
Therre are as many names forr corporations as there arre corporatio ns in the worrld today. Why?
Beca
ause no-one can seem to
o (or wants to
o) spell them the same waay. There aree many differrent
reassons why nam
mes are multiplied acrosss the compan
ny (ranging froom incorrectt business inccentives
to simple mistyping). Especia
ally if there are different systems
s
or in
nternal and eexternal feedss bringing
data
a together. Whatever
W
the case may be
e, there are ju
ust a few bassic rules thatt need to be ffollowed
when preparing the
t data set to
t be loaded to the Same
e-As Link.

A businesss user (not IT)) must pick a MASTER


M
spelling to which alll similar spellings will map
OR: A neu
ural-net data mining
m
engine must pick a be
est-guess spe lling to which all similar spe
ellings will
map

In ea
ach case for the example above, a ma
aster spellingg has been ch
hosen. This can be thougght of as a
step
p in the directtion of defining master da
ata for use in
n the operatioonal systemss. The Data V
Vault
mod
del for this exxample would
d appear in Figure
F
5-26:

Figure
F
5-26: Same-As
S
Link
k Data Vault Model
Rem
member that even
e
though the application (or usage
e) of the Linkk varies, the LLink structure
e stays
the ssame.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 103 of 1
152

5.16
6 Begin and End Dating Link
ks

Therre is an urge by many datta architects to add the no


otion of timee to Link strucctures. It fee
els like a
natu
ural thing to do,
d however it
i will break the
t flexibility of the Data V
Vault and soooner or later (when
the b
business cha
anges its min
nd on how the
e relationship
p is defined),, it will requirre re-enginee
ering of
the lload processses and the SQL
S access processes. Figure 5-27 beelow depicts a Link with
beggin/end-dates embedded
d.

Fig
gure 5-27: Inccorrect Link with
w Begin/E
End Date
hnically. How
k wont hurt anything
a
tech
wever it incre
eases the
The process of putting a datte in the Link
chan
nce that the data
d
set will be utilized th
he wrong wayy. In other woords, the IT p
person will no
ow have
to an
nswer questions like:

What does the date me


ean?
How is it documented?
d
How is it computed?
c
Can the date change?
What affe
ect does it have on the assocciative key structure?
What does it do to the meaning
m
of the
e surrogate ke
ey?

All o
of these quesstions arise along with complication
c
in loading, q uerying and mining when
n the
struccture of the Link
and exists
L
is comprromised. Eve
ery field in th
he Data Vaultt has a speciffic purpose, a
in a specific placce for one or more busine
ess reasons. Begin and E
End dates deescribe when a
relattionship is acctive / inactivve. The purpose of a Link
k is: to establlish the fact tthat a relatio
onship
existts (remembe
er: right, wron
ng or indiffere
ent with no re
egard to timee).

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 104 of 1
152

Example: A patient checcks in to a hosspital and rece


eives an ID tagg with a numbeer on it.
Immediate
ely their tag is scanned and the system asssociates the ppatient data w
with the locatio
on
of the hospital where they checked in. ID is paired up by Location. Whether orr not the ID
number is right or wrong
g, or the locatiion in the com
mputer is right oor wrong, its tthe data that
was generrated by the ch
heck-in event. The Link wou
uld model ID b
by Location.

Once
e the associa
ation has bee
en establishe
ed, it is a factt that it existeed at that pooint in time in
n the
operrational syste
em. The factt stands for all
a time, the fa
act is neitherr right nor wrrong, nor doe
es it
start or stop. It is a relationship that the
t source syystem record
ded; thereforee the Data Va
ault
reco
ords it as well.
Rem
member, a pa
atient can havve many diffe
erent interactions with th at single loca
ation at different
time
es. This means that the dates
d
and tim
mes in this example are in
n fact descrip
ptive in nature
e.
Therrefore the tem
mporality of the
t Link data
a must be desscribed in a S
Satellite in order to mainttain the
prop
per structuress. Remembe
er this: adding begin and end dates too Links changge the grain o
of the
data
a (or businesss key) in the Link. Figure 5-28 below shows the efffect of addin
ng begin and end
date
es to the Link
k structure.

Figure 5-28
8: Begin & En
nd Dates in LLinks
In th
his example, the
t driving key comes in to question. Some of thee many questtions that appear as a
resu
ult: What doe
es the BEGIN
N and END da
ate mean? What
W
do they represent? H
How are theyy
gene
erated? In this
t case, if the source syystem createss begin and eend dates, th
hen the busin
ness user
has complete control over the
ese and theyy cannot accu
urately depictt a proper tim
me-line for th
he system
drive
en relationsh
hip. Why? Be
ecause the business
b
userrs can back-d
date the beggin and end date
sequ
uences.
Therres a more te
echnical reasson that work
ks against th
his structure. In Figure 5-29 below, it is clear to
which extend
see that the sam
me relationship can be rep
presented mu
ultiple times over time w
ds the
mea
aning of the unique
u
busine
ess key sequ
uence.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 105 of 1
152

Fig
gure 5-29: Exxample of Poo
orly Construccted Link
oblems IF and
d WHEN therre are Satellittes attached as
Linkks with begin and end dates cause pro
child
dren. They ca
an also cause queries to produce Carttesian resultts when joinss are made th
hat ignore
currrent or sing
gle record access. Resu
ults can be disastrous acrross joins and
d performancce will
slow
w to a crawl (just as it does in HUGE Fa
act Tables) due to the lack of unique eentries in the
e Link
table
e.
In orrder to track begin and en
nd cycles of relationships
r
s, the best praactice solutioon is to place
e them in
an E
Effectivity Sattellite. These
e are discusssed in the Sattellite Chapteer (Chapter 6
6). However, Figure 530 b
below shows an example of an Effectivvity Satellite off the Link, note: other S
Satellite Data
a
(Sat__Cust_Acct_Emp_Detailss) has been shortened due
e to screen rreal-estate.

0: Satellite Efffectivity on a Link


Figure 5-30

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 106 of 1
152

Notice that the begin and en


nd dates are
e now unhookked or decou
upled from th
he key associiations.
Thesse dates can now be set and
a controlle
ed by front-en
nd applicatio ns as well ass data miningg logic.
5.17
7 Low Value Liinks

Low value Links are another application


a
of
o the Link structure. Low
w value Linkss provide asso
ociations
without any context. They exist as an association for the
t sake of jooining two orr more busine
ess keys
toge
ether. They may
m even be categorized (in
( some cases) as explorration Links ((see below). An
exam
mple of a Low
w Value Link might be som
mething that joins part nu
umber to seccondary supp
plier id. In
the ccases where primary suppliers are important, but secondary su
uppliers are rarely used. Low
value Links may also be calle
ed computed aggregate Links (see bellow). They m
may supply roll-up
aggrregation poin
nts at higher levels of grain.
5.18
8 Transactiona
al Links

Tran
nsactional Lin
nks are defined to be a da
ata set which
h cannot legaally change. In other word
ds, its
transsactional hisstory. Any tra
ansaction tha
at cannot lega
ally be edited
d qualifies foor the transacctional
Linkk. The easiesst qualificatio
on for transacctional Link would
w
be to ccall it an unallterable fact. In other
word
ds, once issu
ued the reco
ord stays inta
act as audita
able history foorever.
Therre are two wa
ays to model the application of this da
ata within thee Data Vault. The first is a
traditional metho
od: Link and Satellite (mo
odified to havve no history)) the second is to place all the data
in th
he Link structture itself. Fiigure 5-31 indicates the first
f
method ffor setting up
p a transactio
onal Link.

1: Transactional Link Exam


mple
Figure 5-31
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 107 of 152

Transactional data is loaded direct to the transactional tables (both the Link and the Satellite). In
the above example, the transaction number is included in the Link for unique key structuring, while
the transactional date and time is included in the Satellite. It is possible in this circumstance to use
the transactional date as the Load date if and only if the time at which the transaction is loaded to
the Data Vault is relatively close (within seconds) of the actual transaction date itself. Otherwise it is
important to separate the data set to accurately represent it.
Although it is not pictured here, the transactional representation of the Satellite does not need to
store the Load Date, as in most cases it will match the Load Date housed in the Link parent. There
is however an exception to this rule: in some specific cases, the transaction is delivered in two parts
from two different streams, just milliseconds apart. In this type of real-time case, the Load Date
should be modeled and stored in both the Link parent and the Satellite in order to properly
represent the different arrival timings.
Figure 5-31 above also indicates a slightly modified Satellite structure (again, discussed in the
Satellite Chapter). In this case, there is no load-end-date in the Satellite, indicating there is no
history; in other words once the data has been added to the Data Vault it cannot be changed or
superseded with new information.
There is another option within the Data Vault for modeling transactional data, where the information
is housed directly in the Link structure. This architecture is not preferred as it changes the
architectural design by introducing decisions to the design process. Therefore it increases the
complexity of the maintenance cost and loading routines. In certain circumstances where
performance is absolutely required to the millisecond level (or lower), it may be necessary to
structure the transactional Link in Figure 5-32:

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 108 of 1
152

Figure
F
5-32: Transactiona
T
al Link, No Saatellite
The only issue to
o watch for with this type of
o Link is the
e width of thee data set. The width can
n easily
beco
ome too large
e, and quickly cut down on
o the numbe
er of rows peer block. If th
he Link becom
mes too
wide
e, the perform
mance of botth the load an
nd the querie
es will decreaase. Transacttion Links are
e
gene
erally built to
o house inserrt only, rapid-ffire transactiions which arrrive on a continuous multi-stream
basis direct from
m the operatio
onal systemss in to the Data Vault. Thee decision to adopt this m
modeling
struccture must be
b made on a case by case basis.
5.19
9 Computed Aggregate
A
Links

Com
mputed aggregate Links are similar to Fact tables in
n a dimensioonal model. C
Computed agggregate
Linkks have a reco
ord source th
hat is labeled
d system gen
nerated. Com
mputed aggreegate Links a
are utilized
to ho
ouse pre-com
mputed data sets like tota
als, summaries, averagess, minimums and maximu
ums. They
are p
part of the multi-layer
m
(scale free) arch
hitecture tha
at the Data Vaault offers. TTypically Com
mputed
Aggrregate Links are found on
nly in the arch
hitectural com
mponent knoown as the business vaullt.
How
wever, there may
m be timess when they provide
p
value
e to the raw d
data sets and
d hence will b
be found
in th
he raw Data Vault.
V

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 109 of 1
152

Figu
ure 5-33: Exa
ample of Com
mputed Aggreegate Link
The data found in computed aggregate Links are gene
erally not aud
ditable, as th
hey are mach
hine
computed within the Data Vault and are not
n part of an
ny source sysstem. Caution: if the resu
ults
houssed in the co
omputed aggrregate Links are found in financial rep
ports, or on a corporate e
executives
deskktop, they ma
ay become auditable ass they are utilized by busi ness users too run the bussiness.
For further exploration: In the example
e above (Figurre 5-33), the ssuppliers are interested in
knowing th
heir total saless of each prod
duct by store te
erritory. Ratheer than producce a separate
data mart,, the architects decided to in
nclude the pre
e-computed agggregate direcctly in the Data
a
Vault. The
e function F(x) determines th
he business ru
ules for aggreggation and posssibly cleansin
ng;
which mayy or may not in
nclude productt roll-ups to higher level asssemblies.
This type of
o structural co
omponent is an
a add-on, and
d is not consid
dered to be pa
art of the core
Data Vaultt model. The add-on
a
is similar to other qu
uery assist tab
bles that provid
de pre-built
answer se
ets to routines that load dow
wn-stream data
a marts.

The implementattion book will cover the details of how


w to load and query such a Link; for the
purp
poses of explanation im
mplementing a computed aggregate Link directly in
n the Data Va
ault
assissts with the query
q
aspectts of virtual data
d
marts.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 110 of 152

5.20 Strength and Confidence Ratings in Links

The study of the Data Vault as a neural network introduces a number of concepts. One of which is
the idea that the tables in the Data Vault act as the data storage or nodes in a fuzzy logic algorithm.
In doing so, the neural network needs to establish associations with strength and confidence
ratings. In querying the Link structures, the neural network can learn the context housed within,
and determine if the relationship needs to be improved, if its the strongest relationship, or if its the
weakest.
The strength rating (when added to the Link) is the result of data mining efforts to establish a
correlation across the related data sets. In other words, if there are two Hubs with Satellites that
both describe cars (maybe different types of cars), then an association or relationship can be
formed with a fairly high strength of 90% or above (for example). But its more specific than that.
Its based on EACH BUSINESS KEY association. In this example, there are two cars with different
VIN numbers that were recorded by two different systems. They each describe a blue car, front
wheel drive with 250k miles, that is a 1998 make and model match.... They are each connected to
similar owners/drivers in different states. The inference might be 90% chance that these are the
same car.
The confidence rating must be added to the Link in conjunction with the Strength rating so that we
know how confident the knowledge engine is in the rating its provided. In the example of the cars
above, the confidence may only be 60% because the drivers might have different names and never
have had the same address. However, maybe the confidence is 90% because a mining effort across
the drivers sees a family relationship to the drivers.
Strength and confidence can be added to the Link structure on a row by row basis, and are utilized
by analytics routines to filter out important correlations. Of course, these strength and confidence
ratings may change depending on the question being asked. At that point, the knowledge that is
sought may need to recalculate these ratings so that they make sense. The neural net engine that
is making the assumptions and assigning these calculations should utilize an industry vertical
ontology that describes the business terms. Otherwise, spotting the context for associations will be
difficult if not impossible.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 111 of 152

Note: this type of activity brings into focus another application of the Link structure called a
Dynamic Link. Dynamic Links are discovered and created by machine learning algorithms.
The data in the Dynamic Links are generally not auditable (as fuzzy logic rarely produces the
same result twice). They are very similar in nature to exploration Links (described below)
however the difference is that Dynamic Links are machine driven, while exploration Links are
manually created.

What you can do with Dynamic Linking is limitless. The Data Vault model is a scale free
architecture, which allows you to explore different Linking constructs until you find the right one that
represents the business. Its also the very same reason that the Data Vault model is future proof, in
that it can absorb any future change without changing the nature of the historical data that has
already been collected.
5.21 Exploration Links

Exploration Links are short-circuits to the joins across the Data Vault and are placed in to the Data
Warehouse for business reasons only. They are manually generated, and maintained however if
an exploration Link proves to be valuable to the business, the loading cycle can be automated.
Exploration Links are a form of computed aggregate Links. They may or may not contain computed
attributes.
Exploration Links are not auditable. The architect and BI team implement exploration Links to cross
several different parts of the model. They may span between 2 different Hubs which are spread
across the model, and not directly Linked by the source systems. A small company in Denver called
NetQuote installed an exploration Link to determine up-sell potential for targeted ads and discounts
to their web-customers as they clicked through the system.
This company saw a 40% increase in profitability as a result of the exploration Link. They found a
reason to implement it on a consistent basis. This company also built an Operational Data Vault
that was loaded at the time the transaction was generated from the web-front end. The operational
Data Vault was hooked to the message bus for both incoming and outgoing message routing.
Exploration Links are encouraged once the base Data Vault has been constructed and is in
operation. These Links can be created, queried, and destroyed at will without the destruction of
history within the Data Warehouse. Exploration Links give the business user and the IT architect a
chance to play with the data set within the Data Vault; hopefully resulting in new questions and
answers that can be viewed from the data warehouse.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 112 of 152

6.0 Satellite Entities


Satellite entities are the warehousing portion of the Data Vault. Satellites store data over time.
Satellites are comprised of descriptive data that provide context to the keys and associations at a
point in time or over a time period. Descriptive data in warehouses often changes; the purpose of
the Satellite is to capture all deltas (all changes) to any of the descriptive data which occurs.
Satellites are typically arranged by type or classification of data, and rate of change. There are many
different manners in which to setup classifications of data within a Satellite. For example, the
attributes could be classified by data type, or by content, or by context each of which will yield the
same result physically but a different result in the understanding or interpretation of the model.
Rate of change is yet another classification of Satellite data. Rate of change allows the Satellite to
split away groups of fields that change more quickly than others. This prevents or removes the need
for column data replication (of the slower changing attributes). By splitting the Satellites by rate of
change, the rows are also reduced in size allowing the data to insert more quickly, and be more
responsive to real-time feeds. The lower the latency of arrival, the faster the database must respond
with insert speed, the nature of these mechanics will be covered in the Data Vault implementation
book.
6.1

Satellite Definition and Purpose

A Satellite is a time-dimensional table housing detailed information about the Hubs or Links
business keys. The purpose of the Satellite is to provide context to the business keys. Satellites are
the data warehouse portion of the Data Vault. The Satellite tracks data by delta, and only allows
data to be loaded if there is at least one change to the record (other than the system fields:
sequence, load-date, load-end-date, and record source). A Satellite can have one and only one
parent table.
Satellites provide the descriptive data about the business key, or about the relationship of the keys.
They describe the relationship changes over time. Their job is to record the information as it is
loaded from the source system. They use load dates and load-end-dates to indicate record lifecycles because most database systems today are not capable of internally representing time-series
properly. Satellites often provide data normalization for future proofing, scalability, and auditability
of the data sets. How normalized a Satellite gets is a function of the design and a choice made by
the designer.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse


WARNING: ANY

Page 113 of 152

SATELLITE STRUCTURE WILL DAMAGE THE FLEXIBILITY


OF THE DATA VAULT MODEL . FOR INSTANCE , ADDING A FOREIGN KEY TO A SATELLITE IS
NOT ALLOWED IF A FOREIGN KEY IS ADDED, IT IS NO LONGER A D ATA VAULT MODEL.
ON THE OTHER HAND, BECAUSE SATELLITES ARE CHILD TABLES, THEIR PRIMARY KEY
CHANGE TO THE

STRUCTURE IS SOMEWHAT FLEXIBLE WITHOUT TOO MUCH HARM TO THE MODELING


ASPECT.

Remember: all models must serve a purpose and a function. The rules and standards described in
this book are golden guidelines most of which should be adhered to; however there are some
circumstances in which the principles of the Data Vault can be preserved while altering the structure
of the data model to fit the needs.
6.2

Satellite Entity Structure

The Satellite entity structure consists of basic required elements: surrogate sequence id (from the
parent table), load date stamp, load end date stamp, and record source. Database engines today
do not currently support (natively) time-series based table structures. Due to this limitation, the
architecture is forced to compensate with Load Date Stamps and Load End Date Stamps. These
date stamps have been described in the common attributes chapter (Chapter 3) of this book.
The Satellite entity must NEVER contain foreign keys (except for the single parent on which it relies).
If a Satellite structure is compromised, then the flexibility of the model is immediately compromised,
in other words: all possible hope of future proofing the data model is immediately lost. You are then
forced to reengineer the data model in the near future when the business changes the way
relationships are structured. Satellites may contain unknown or not-yet-identified business keys
until such time as the business keys become identifiable.
While this is not a general practice it is acceptable. When applying this standard rule, the business
key housed in the Satellite is treated the in the same manner as the rest of the descriptive data set
as just another descriptive element with changes tracked. However, once the business key
becomes identifiable it will be necessary (at that time) to split the key out to its own Hub, add a Link
association to the current parent of the Satellite. Then, the Satellite data must be re-formulated
without losing history. The process of reformulation of Satellites is covered in the one-on-one
coaching section and the Data Vault Implementation book. Satellites must have one and only one
parent table, none others are allowed. Figure 6-1 below shows a standard structure of a Satellite
Entity.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 114 of 1
152

Figure 6--1: Example Satellite


S
Entiity
The primary key of the Satellite is most offten the Pare
ent Sequencee combined w
with the Load
d Date
time
e stamp. If th
he Satellite iss loading in re
eal-time, then it may be n
necessary to add a sub-se
equence
num
mber, or if the
e database alllows it posssibly adding a millisecon d timer to the load date. By
addiing a sub-seq
quence, or millisecond tim
mer, real-time
e data can eaasily flow direectly in to the
e Satellite
without creating duplicate priimary keys (a
as a result of load date coollisions).
Therre are even situations
s
where the sub-ssequence ma
ay represent the parallel pipe numberr that is
feed
ding the Satellite with data
a (this of cou
urse is true if the data is aarriving at 10
0,000 transactions per
seco
ond across 10 parallel pip
pes); and in that
t
case, it may
m even be necessary too split the Sa
atellites in
to 10
0 Satellites (1 for each piipelined insert). Otherwisse, the datab
base may be overwhelmed
d by the
load
ding speed. However
H
thatt is beyond th
he scope of this book and
d is discussed
d both in the
e one-onone coaching secction and the
e Implementa
ation of the Data
D
Vault boook.
6.3

Satellite Exa
amples

Figure 6-2 showss several Sattellite examples from the Adventure W


Works 2008 D
Data Vault mo
odel. Cult
= Cu
ulture, ProdD
Desc = Producct Description, Prod = Pro
oduct, Mod = Model, Prod
d_Loc = Product
Loca
ation.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 115 of 1
152

Figure 6-2
2: Example Satellite
S
Entit ies
The data in each
h of these formulates the warehouse. Keep in min
nd that the foor this particu
ular Data
Vaullt model we are
a 1) dealing
g with a single source sysstem 2) a moodel which ha
as no identifiable or
mea
aningful busin
ness keys.
6.4

Importance of
o Keeping Hisstory

Histo
ory is partly what
w
a data warehouse
w
iss all about. The
T Data Vau
ult is no differrent, except tthat in the
Data
a Vault, histo
ory is raw data
a. Satellite structures
s
be
eing what theey are, can bee changed, altered,
and re-designed (as is docum
mented later in this chapte
er). Its impoortant to rem
member: when
na
Sate
ellite changess its design 100% of the
e historical data
d
must be preserved or the Data Vault will
no lo
onger pass an audit.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 116 of 152

As you continue through this chapter, please be mindful of this principle. Be thinking of how the
history can be preserved throughout the different changes. History serves as the audit trail of the
source systems, and the only record available is in the Data Vault; which means that the Data Vault
you build is now a system of record. There are differing opinions about what a system of record
really is, however as the business retires old sources, or as you implement operational data
warehousing, the data warehouse is relied upon to make financial decisions. It is at this point where
the data warehouse is a system of record.
6.5

Splitting Satellites by Classification or Type of Data

There are many different ways to define type of data. One way is to define type as data type. In this
manner, the Satellites can be divided into different pieces based on their data types. History has
shown that the benefits of this approach are as follows:

Create a fixed width row for bits, integers, dates/times (all non-varchar components)
Create variable width rows for all varchar / char attributes
Create variable length BLOB / CLOB / LOB objects
Dramatically increase compression rates for data sets
Decrease overall storage needs (by reducing the potential for chained rows)
Easier management and maintenance
No guess work involved in defining new Satellites
Easier indexing strategies
Easier partitioning strategies
Easier Query Parallelism
End Result? Increased performance

Of course we cant ignore the nature of the query set. When classifying attributes into different
Satellites by data type, it is important to remember the queries that will be grabbing the data sets
and put it in context with the platform that the queries are running on. For instance, if the platform
is Teradata, or IBM DB2 UDB EEE / MPP then the queries and parallelism will work quite well. Or if
the platform is SQLServer 2008 R2/MPP, or Oracle SMP Big Iron with Partitioning and Parallel query,
then the queries will work quite well.
If the platform is a DB2 based AS/400 then normalizing the Satellite goes against the
performance principles of the HFS (hierarchical file system). Also, if the hardware is under-powered,
under-sized, or the database hasnt been tuned appropriately then the queries might not run so well.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 117 of 1
152

Furthermore th
he database industry is ch
hanging (and
d by the time this is publisshed, will havve
chan
nged). The riise of NOSQLL (like HADOO
OP) solutionss, and column
nar databasees will change
e the way
we p
physically loo
ok at partition
ning data setss. This is all beyond the scope of thiss book, and w
will be
cove
ered in the Da
ata Vault Imp
plementation
n book, or in the
t one-on-oone on-line cooaching sectiion of my
web-site at: http:://danLinsted
dt.com
In th
he interest off discussion, and for the purposes
p
of demonstratio
d
on, the Figuree below (Figu
ure 6-3)
show
ws a split of SAT_PROD
S
frrom Figure 6--2 (above) intto multiple S
Satellites.

Figure 6-3: Satellites Split by Type Of Dataa Option 1


o splitting coolumns up (oor normalizingg) the
This is known as vertical parttitioning. That is: the act of
table
es into different groups tied together by the same key. This is a concept th
hat the colum
mnar
data
abases follow
w for performance, but the
ey take it to the
t extreme b
by splitting (n
normalizing) every
singgle column in to its own ta
able structure
e.
One of the beneffits of this tecchnique is it begins to sho
ow potential mismatchess for data types and
nam
ming conventions. For insttance: I would argue that size nvarcha
ar(5) might b
be better off a
as a
num
meric, either decimal
d
or flo
oat unless of course the
e size includees a measureement like cm
m
(cen
ntimeters) or in (inches). But in that ca
ase, the size and size meeasurement sshould be separate
columns both should
s
be sto
ored in the numeric Satelllite.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 118 of 1
152

Upon further insp


pection: DayssToManufactture could be
e seen as a n
numeric (as itt is now), or q
quite
posssibly as part of
o the date data types Sa
atellite. It all depends on how the business desiress to setup
and manage its metadata
m
ontology. Onto
ological definitions/classiffications for ffields in the ttable are
as numerous as the stars in the
t sky. In otther words there are m illions of com
mbinations (w
ways) to
splitt your data byy type; spend
d some time in
i the design
n phase whic h manner of classification /
grou
uping suits yo
our needs best.
6.6

Splitting Satellites by Rate


e of Change

Rate
e of change iss a similar to
opic to type or
o classificatio
on of data. R
Rate of changge can be de
escribed in
w fast
man
ny different ways.
w
Howeve
er, in this parrticular case the term ratee of change refers to: how
doess each eleme
ent or group of elements change in re
elationship too each other. In other wo
ords, the
rate of change off cell phone numbers
n
for an individual may be exp
ponentially higher/faster tthan the
rate of change fo
or that persons address. Lumping all this quickly changing data with otherr, slower
chan
nging data ca
auses data space explosion. Figure 6-4 shows an example of d
data that is
deno
ormalized in to a single Satellite,
S
and changes at different
d
ratees.

Figu
ure 6-4: Satellite Data Ratte of Changee Example
ough compre
ession in the database offfers a little bit of relief froom the repetiitive informattion, it
Altho
still piles up overr time. All this extra data
a is recordin
ng little to no changes to tthe informatiion as it
flows in. The imp
pact can be seen
s
in longe
er loading tim
mes, longer b
backup times, bigger log sspaces,
large
er temp area
as (needed fo
or queries), and slower qu
ueries overalll (more I/O ggoing on at th
he
hard
dware level).

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 119 of 1
152

In th
he above example, the cell phone changes every day. The Pho ne number m
may change e
every
othe
er day, but the name and address mayy change oncce a year or leess. The rigoor required fo
or loading
a ne
ew row in thiss instance can be painful.. Not only do
oes the data sset in the tab
ble explode, tthe
inde
exes for the fiields also exp
plode. Whats worse, is the coverage (thats the in
nternal datab
base
ratin
ng for which index
i
is the best
b
to use fo
or SQL) is dra
amatically red
duced by dup
plicate data; which
mea
ans that if you
u index the Name
N
or the Address
A
columns, their seelectivity will become veryy poor
very quickly.
Note: if you
y
think it lo
ooks OK now, or you dont see the harrm in this (be
ecause youve
e
been doin
ng this for years with a typ
pe 2 dimension), then jusst try to imag
gine this
happening
g over 10x th
he amount off data that yo
ou currently h
have. In othe
er words,
imagine this table with
h 100 million rows in it, when
w
in reality
y you only ha
ave 6.6 millio
on
names an
nd addresses (assuming each changed about 15 tim
mes total), an
nd 100 million
n
cell phone
e number cha
anges. The performance
p
is dreadful to
o think aboutt because of
the width
h or row size of the table.

This in turn leadss to all kinds of attempts at database optimization


n, all becausee of a poor ch
hoice in
the d
design of the
e table. This can be mitiggated very eassily by splittin
ng the Satellite up by rate
e of
chan
nge, this allow
ws the table to settle in to
o predictable
e patterns of repetition.

Figure
F
6-5: Satellite Split by Rate Of C
Change
In Figure 6-5, we
e see a 5x dissk space com
mpression immediately ta ke place in th
he contact name
Sate
ellite (becausse only 5 rows were shown). As it turn
ns out, storingg the informa
ation once is much
more
e efficient an
nd practical. The contact phone Satellite still has aall the changges and all th
he history,
but tthe row-size is much sma
aller, the inde
ex coverage iss better, and
d the perform
mance is faste
er.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse


6.7

Page 120 of 152

Satellites Arranged by Source System

When designing Satellites, sometimes the first instinct is to try to combine multiple systems data
directly in to a single structure. This can be good and bad. By combining multiple systems data in to
a single structure, there are a lot of considerations to be made these are covered in the next
section: Overloaded Satellites (the Flip-Flop Effect).
The best practice, and the easiest to get comfortable with is to split the Satellite data into separate
Satellites, one per source system. The next question that comes to mind then is: why include a
record source? Well to be honest, because the source of the data may still need to be geographically
identified, or possibly application identified. For instance, the source may be SAP sales module,
but the SAP Sales module may have been implemented across more than one source system
(physical machine). It is also possible that it may be implemented in different geographic regions.
The key to using a single Satellite per application is to ensure there is a match across the structures,
and that the metadata is defined the same way by the business.
Note: The best practice is to split the Satellites across each source system.

What benefits does this provide?

Allows the designer to add new systems as they come in the door without impacting existing designs,
and existing data sets
Removes the need to fight over what the data means, how to integrate it, and whether or not it
needs to be split, concatenated, lengthened, shortened, and manipulated.
Allows different data sets from different sources to populate their audit trail in accordance with their
rate of change and type of data. Where in this case, type represents the source system.
Solves the problem of disparate data arrival times. In other words if or when the data arrives, it is
inserted directly in to its Satellite for that system, theres little to no competition (at the I/O or
database level) for that resource (table). This allows us to Maximize Load Parallelism
Allows real-time data to flow from one system, while batch data flows from another limits the
exposure to the risk of having to merge data sets on the fly. Eliminates the dependencies across
multiple systems that would force those systems to have the data ready at the same time.

They say a picture is worth a thousand words, Figure 6-6 provides a generic example of what this
might look like in a Data Vault model.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 121 of 1
152

Figure
e 6-6: Custom
mer Satellitess Split by Sou
urce System
In th
his example, there
t
are som
me overlaps for customerrs, including Name and P
Phone Numbe
er but
thats where the similarities appear
a
to sto
op. Each systtem probablyy has its own
n unique way of
defin
ning what cu
ustomer mean
ns! But the business
b
statted in their reequirementss very clearly: if the
custtomer record
d has the sam
me business key
k in each system,
s
then it is supposeed to represe
ent the
sam
me customer. Very rarely does this data ever line up
u in the begginning, espeecially once history is
load
ded to the Satellite.
The job of a good
d Data Wareh
house is to point out or make
m
known tthe discrepan
ncies (the ga
ap
anallysis) betwee
en the way the business believes
b
its operating,
o
an d the way thee source systtems are
trulyy running. Th
he job of the Data Wareho
ouse is not to
o filter the infformation or to alter it in any way
to do
o so would viiolate compliance and au
uditability rule
es. Here, thee discrepanciies are plain to see,
and if you look closely youll notice that statistics can be run acrooss the sourcce systems to see
how
w far out of alignment
a
the
ey are with ea
ach other. In other word
ds: whats this
is costing myy business
to ha
ave broken business
b
rule
es in differentt source systtems? This q
question can finally be an
nswered
with metrics and measureme
ents.
quite possible that from profiling
p
the data
d
in this example, one might learn that contactss really
Its q
shou
uld have their own busine
ess keys, beccause they arre totally and
d uniquely disstinct from th
he notion
of cu
ustomer. Or they might le
earn the oppo
osite, that all contacts aree customers. The point iss: the
Data
a Vault should assist in te
elling the storry, and by splitting the so urce systemss across multiple
Sate
ellites makess it easier to spot
s
these errroneous pattterns.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse
6.8

P
Page 122 of 1
152

Overloaded Satellites
S
(The
e Flip-Flop Effe
ect)

Now
w suppose the
ere is a need
d to see all the data in a single
s
Satellitte, what do w
we do then? TThere are
speccific reasons why and whyy not to do th
his. There are
e inherent rissks as well; ssome of thosse risks
were
e covered in the
t last section. Take a minute
m
to che
eck the last ssection to en
nsure you did
dnt miss
anytthing importa
ant. This tech
hnique is called overloading because it allows multiple definitions of
sourrce system da
ata to insert multiple row
ws of data to the
t same tab
ble. The hopee is that the metadata
defin
nitions are th
he same for the
t fields, but theres no way
w to enforcce that. Therrefore, the da
ata
frequently becom
mes messy very quickly..
e
of ove
erloading when viewing data sets in leegacy system
ms. That is: a
a single
We ccan see the effects
field
d used for mu
ultiple purposses, and multtiple meanings based on character poosition and
appe
earance. In other wordss, smart-key data
d
where no edit checks
ks are in placee in the application;
tackked on to the Cobol copyb
book are multtiple re-defines and progrrammatic loggic to re-defin
ne what
the d
data should represent.
Overrloading a Sa
atellite is not necessary given todays technology, aand brings w
with it many riisks, such
as m
misinterpretattion, misunderstanding, inability to se
ee patterns, d
difficult to disscover proble
ems in the
data
a, risk of audit problems. One of the other
o
issues that
t
overload
ding a structu
ure also bubb
bles to
the ssurface is the
e question off: what do we
w do with the
e data set? D
Do we join it all together tto make
one best looking
g row for insert? Do we ru
un rules again
nst it to coaleesce it togeth
her? Its a sslipper
slope that leads right back to
o where we sttarted: with business
b
rulees being impllemented up--stream of
the E
EDW. This iss not what we
e want.
How
wever, all thatt said Figurre 6-7 represents what an
n overloaded Satellite migght look like (from a
data
a perspective
e), and the following paraggraphs explain what (if an
ny) good usess there might be for
this type of desig
gn.

Figure 6-7: Sate


ellite Overload from Manyy Sources

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 123 of 152

First, notice the record source. Each record source indicates a different source system (for purposes
of this example the sources are lines of business). It is not entirely clear which source system
should be the master system. Of course there are all kinds of questions that arise from this
example:

Does one row supersede another?


Which row is the right most current row?
The primary key is duplicated, should we include the Record Source as part of the Primary Key to
resolve?
What about the rest of the fields? What if they are NULL or not available in other sources?
Should we create a combined/merged row during the load to this table?

The questions and many more will begin to pop out with additional overloaded Satellites. So whats
the benefit of overloading if there are so many issues? The only benefit that Ive personally
experienced in the past is: to get the business to deal with the source of the problem because the
Data Vault ran out of disk space. The load cycle would load 15 million x 5 source feeds rows on
every load, because the loading mechanism detected a delta. Which brings up another point: when
a Satellite is overloaded, the loading cycle begins to take a turn toward the serial path.
Loading the Satellite must be done in reverse order (from least important to most important)
whereby the last row to delta (be inserted) becomes the most current, and all the others get enddated. Again, implementation is in the other book and this explanation is necessary to show the
gravity (risk) of this design. The better design is to split the Satellites by source system. This allows
each business unit to define which system is their master system, and when building the data marts
each Satellite will then provide the most current row to the process. Furthermore, by splitting the
Satellites out (as described in the previous section), the load can happen in parallel. Subsequent to
the load, a reporting structure could be built that attempts to merge the multiple Satellite data into
one table for the purposes of doing the cross-system data quality checking.
Let it be known: Overloading a Satellite incurs many risks. From metadata to
understanding, from load performance to indexing, from data quality to merging all
of which take a toll on the business and on IT.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse
6.9

P
Page 124 of 1
152

Satellite App
plications:

The Satellite can


n be utilized in different manners.
m
The
ere are somee specific app
plications for Satellites
whicch are fairly common
c
to Data
D
Vault implementation
ns world-widee. These app
plications are
e
docu
umented in this book to assist
a
you. Th
hese uses include: effecttivity trackingg, record tracking,
statu
us tracking, and
a compute
ed Satellites just
j
to name a few.
Thesse types of Satellites are typically calle
ed system drriven Satellitees and are noot included a
as a part
of th
he audit trail for the data warehouse. The reason is
i that appliccation logic d
determines th
he
conttents, and in general (exccept for comp
puted Satellittes) the data in these Sattellites can be
e backed
up, rrolled off, deleted, and re
e-built without harm to the
e end-user orr business da
ata.
The data in these
e Satellites iss generally fo
or IT utilizatio
on, to track th
he data itselff, to assist wiith
performance, or to determine
e start and sttop control ovver the Linkin
ng mechanissms. The following
secttions cover th
he application
n usages of Satellites.
S
6.9.1
1

Effectivity Satellites

One of the appliccations for Sa


atellites is ca
alled effectiviity. This typee of Satellite iis most often
n found
hangging off a Lin
nk table (see Figure 6-8). Its purpose is to track wh
hen the Link is active acccording to
the b
business. It is a tempora
al based struccture that ho
ouses begin aand end datees for the asssociation
or fo
or the business key it reprresents.

e 6-8: Satellitte Effectivity


Figure
at these date
es are not sysstem computted. These d
dates must arrrive on a sou
urce feed,
Keep in mind tha
e set in the source
s
system
m via a sourcce application
n, an audit trrail, or a chan
nge data cap
pture
or be
(CDC
C) program. These dates are auditablle, and must be traceablee back to thee source in orrder to
passs audit. If yo
ou are looking
g (in contrastt) to guess when
w
data ap pears/disapp
pears from a feed
then
n the status tracking
t
Sate
ellite (discusssed below) is a better fit.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 125 of 152

Note: all children (all Satellites hanging off this Link or Hub) are ended when the
effectivity of the association is ended. This status of ended must be forwarded to
the data marts that are fed from the Data Vault. This is business user / source system
data, and must be included in all queries that access this information.

In addition, not all Links and not all Hubs are a good fit for effectivity. These Satellites are strictly for
business user (source system) based data. It is not necessary to create an effectivity Satellite for
every Hub and Link in the model unless the source system delivers the data set for every business
key and every relationship, and that would become a data miners gold!
6.9.2

Record Tracking Satellites

There is another application of Satellites called Record Tracking. Record tracking is a system
generated set of data. This data is not auditable (generally speaking). The purpose of the record
tracking Satellite is to identify which source applications are feeding which keys and associations on
what load cycles. It originated as a need to capture changes (missing rows) from a source feed
because we received a full dump of a legacy system every day, and rows would disappear for three
days, and then re-appear. We were told that just because they disappear for three days, it doesnt
mean they are deleted.
Furthermore, we didnt have any CDC in place so when a record was truly deleted, it went missing
for an extended period of time. The business wanted a way of identifying the difference between
missing for a few days and was deleted. They settled on a rule that said: for Data X if the key
doesnt show up from this application for 7 consecutive days, then mark it deleted. They had other
rules for other data, i.e.: for Data Y it was 30 days, Data Z = 5 days.
This discussion is similar in nature (and related in concept) to the LAST SEEN DATE discussion that
was depicted in section 3.5. Record source tracking Satellites indicate each systems arrival, on a
load-cycle basis. For each key in each source system, or for each association on each source
system, an insert is made to the record tracking Satellite indicating that it was present on the feed
during the current load-cycle. The load-cycle is identified by load date, or load-cycle-id where load
date has been replaced.
Because this Satellite is non-auditable (other than IT metrics), its rules for use and definition can be
bent without breaking the architecture; the structure itself doesnt change what does change is
the way the data is treated. The following rules apply to record source tracking Satellites:

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 126 of 1
152

A row is in
nserted (regard
dless of delta)) for every dayy the key / asssociation appeears on the fee
ed. In
other words, it is not su
ubject to delta processing.
To avoid data
d
explosion
n, each column
n (or the table itself) must b
be compressed
d.
Because its
i system driven, old load-ccycle informattion may be su
ummarized, an
nd rolled off orr deleted
without ha
arm. By rolled
d off, you may choose to bacck it up or movve it to slower storage.

Figure 6-9:
6 Denorma
alized Record
d Source Traccking Satellitte
ed in a denormalized form
mat. The Leggend is as follows:
The data in Figurre 6-9 is store

RS_MFG = Record Sourrce Manufactu


uring
RS_FIN = Record Sourcce Finance
RS_SLS = Record Sourcce Finance
RS_CONTTR = Record So
ource contractts

Notice that this Satellite


S
has no load-end--date, and ha
as no record
d source itseelf. The reaso
on is that
all ro
ows are SYSG
GEN (system generated). End-dates are
a not necesssary, becausse as describ
bed
abovve, an entry is inserted (re
egardless of delta) for each load cyclee that is trackked. Row 1 a
and Row
2 are identical in
n nature. Thiss is a great fo
ormat for partitioning, filttering, and qu
uerying, it pro
ovides
extre
emely fast acccess to thesse componen
nts and the discovery as to which feed
ds they appe
ear and
disappear on. However,
H
be warned it introduces an
n insert folloowed by an u
update for ea
ach
reco
ord source that is added to the table.
Also note that ea
ach record so
ource is specifically name
ed in the metaadata (the sttructure) of the table.
This makes the table
t
structurrally driven as opposed to
o data driven
n. This may oor may not be
e the right
apprroach for you
ur Data Vault and does ad
dd a level of complexity
c
siince the struccture (and lo
oads)
would have to ch
hange if a sou
urce system is added. An
n alternative method of sttorage and trracking is
available, and ye
et again req
quires a mino
or change to the structuree of a Satellitte and its
corre
esponding prrimary key. The
T alternativve is commonly known ass a pivot of th
he above tab
ble, in this
case
e, normalized
d. Figure 6-1
10 representss the same data in a moree dynamic foormat:
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 127 of 1
152

Figure 6-10:
6
Normalized Record Source Traccking Satellitee
his case, it is easy to add new record sources
s
dyna
amically. Theere is no limitt to parallel in
nserts;
In th
therefore there iss no limit to the
t scalabilitty of this table
e. It pushes the complexxity downstream to the
querry (for interprretation and pivoting). If necessary, re
eview the reccord source ccolumn definition in
secttion 3.9.
Assu
uming this RS
S Satellite is a child of Customer, then
n the data miight be interp
preted as folllows:
Assu
ume Sequencce 1 = Customer Key ABC
C123.

On 10-14-2000, ABC12
23 (the key) ap
ppeared on the Manufactur ing feed, howeever it did NOTT appear
on the finance, contractts, nor sales fe
eeds.
On 10-15-2000, ABC12
23 appeared on
o the Manufa
acturing feed, aand did not ap
ppear on finan
nce,
contracts,, nor sales.
On 10-16-2000, ABC12
23 appeared on
o all feeds EX
XCEPT contraccts

If the
e business provides detailed record sources (that might even iindicate the p
point of origin within a
business processs), then they might be able to begin trracing the keeys through th
he business
proccesses. An asstute data miner
m
could make
m
good usse of this infoormation to help the busin
ness
unde
erstand how and when th
he data is mo
oving through
h the systemss. Someone who misses this
conccept sees no value in utiliizing record source
s
tracking Satellitess.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse
6.9.3
3

P
Page 128 of 1
152

Status Tra
acking Satellite
es

Status tracking Satellites


S
are put in place to track CRU
UD (create, u pdate, deletee) from the ssource
syste
em. These statuses
s
are generally
g
pro
oduced by audit trails or cchange data ccapture syste
ems. If
the sstatus is available and prroduced by either a source system or a source app
plication (mayybe a
business processs application
n) then it can and should be
b tracked. TThe best wayy to do this iss to track
the sstatus separa
ately of the rest of the infformation (i.e
e., put it in itss own Satellitte).
Thesse statuses should
s
repressent the state
e of the business key. In other wordss, when a bussiness key
or re
elationship iss created, a CDC
C record would
w
arrive in
n the Data Vaault indicatin
ng Create o
or Insert.
Whe
en it is update
ed, a new sta
atus would be issued and of course when its deeleted, an inssert to the
statu
us tracking Satellite
S
would indicate a delete had occurred
o
on the source.
Status Tracking Satellites
S
(se
ee Figure 6-11) allow visib
bility in to thee disappearance and reappe
earance of business keyss and relation
nships. The business
b
boook of the Data Vault discu
usses the
notio
on of businesss key re-use
e, and how when businessses re-use th
heir keys (to rrepresent diffferent
data
a) they are acctually prone to lose mo
oney. This is because theey lose audit ttrail capacityy at that
poin
nt, and it beco
omes more difficult
d
to tra
ace the data set
s back oveer time. Busin
ness users a
also get
pe for this bo
conffused when the
t business keys are re-u
used. However this topic is out of scop
ook.

Figure 6-1
11: Status Tracking Satel lite
Status Tracking Satellites
S
sho
ould be norm
malized, and should
s
follow
w the standarrd Satellite la
ayout and
ruless; such as inssert only whe
en changes are
a detected. They should
d have comp
pression turned on to
makke best use of
o the storage
e space, and if youre luck
ky they will help you ideentify which ssource
appllication or so
ource businesss process do
oesnt match
h business reequirements and which do
o.
Statuses may be
e inserted from multiple sources during the same l oad cycle. TThis may or m
may not
lead
d to multiple active
a
Satellite rows (whicch are descriibed in sectioon 6.14 below
w).

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 129 of 152

The best practice is to insert only the master system status or, to split the status Satellite by
source systems. If the Satellite is split by source systems, then you simply postponed the decision
to assign the master system (for query purposes of selecting current status) until you access the
information (loading to a data mart).
6.9.4

Computed Satellites (Quality Generated)

At first glance, everything in the Data Vault looks as though it has to be raw data. For the most part,
this is true (and is one of the fundamental premises of the approach). Often there is a need to
process raw data through quality routines, cleansing routines, and address correction routines;
generally the desire is to run these routines once and then distribute all the information downstream
to the data marts.
Within the Data Vault methodology and architecture there is a place for this data. Its called a
computed Satellite. The computed Satellite is a standard Satellite structure (with all the same rules,
formats, and structural integrity). The difference is that the record source is SYSGEN (system
generated information) or potentially the name of the application that is performing the data
alterations. Computed Satellites are not auditable (generally). I say generally because when or if
the data is used to run the business or make financial decisions, there is a good possibility that an
auditor will come back and expect to see how, when, and what the data was.
From an implementation perspective, it is suggested that you split the computed Satellites off to
their own disk storage area. It may be wise to place them on SSD (solid state disk) if they are highly
accessed and need to be extremely fast. At a minimum, they should be placed on their own I/O
channels and their own storage so they do not compete for read/write resources with the raw data
sets.
6.9.5

Multiple Active Satellite Rows

Multiple active Satellite rows are similar to Satellite overloading. Satellite overloading is discussed
previously, in section 6.8 above. The concept here indicates that there are several rows per key that
are alive, active, and valid all at the same time. In most cases, they would be arriving in the
Satellite from different systems. However, there are times when the data is normalized (as in the
example below) that make it a better choice to have multi-active Satellite rows.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 130 of 1
152

For in
nstance, a pa
art number which
w
is assiggned a statuss flag; in the manufacturing system its an
ACTIV
VE part; in th
he planning system
s
its an
n inactive parrt. This part number mayy have multip
ple
statu
uses from mu
ultiple system
ms, and they may or may not be valid ((depending oon the view p
point of
the u
user and dep
pending on th
he definition of
o the flag). Multiple activve Satellite rrows can be a
averted
easily (most of the time) by sp
plitting the da
ata by source
e system (in m
most cases), although in some
case
es, you may want
w
to split itt further by application
a
within each soource system
m.
Supp
pose however, that you ha
ave a list of phone
p
numbe
ers on incom
ming data; tha
at you never know just
how many phone
e number colu
umns will arrrive. Some days your load
ding process may see 3 p
phone
numbers, other days,
d
it may see
s 5, and evven within the
e same load batch the number of ph
hone
numbers is variab
ble. In this case, it is extrremely difficu
ult to architeect the rightt set of phone
e number
colum
mns in a Sate
ellite, and the
e last thing that should be considered
d is: phone_1
1, phone_2,
phon
ne_3. Etc causing wid
de rows, sparrse population, and a bun
nch of null coolumn values. It is
preciisely for thesse reasons th
hat multi-activve Satellite rows exist!
w demonstra
ates the loadiing of hierarcchical XML daata; it could a
also represen
nt a
Figurre 6-12 below
hiera
archical Cobo
ol data set. Any
A hierarchiccal structured information with variab
ble list length
h is a
cand
didate for thiss technique. By normalizing the structture, the arch
hitecture is w
well-suited to
o absorb
an unknown num
mber of eleme
ents per pare
ent record. The
T normalizeed Satellite h
has an additio
onal
elem
ment to the prrimary key kn
nown as a sub-sequence number. Su b-sequence n
numbers are
e
discu
ussed in secttion 3.2 of this book. The
ey basically provide
p
a mecchanism with
h which to un
niquely
identtify the data.

Figure 6-12
2: Multi-Activve Satellite R
Rows

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 131 of 1
152

In so
ome cases it might make sense to replace the sub--sequence n umber with a
an actual piece of
information that the
t businesss users underrstand. In this very speciaal example (n
not shown he
ere), the
architect replaced
d the sub-seq
quence number with a co
opy of the phoone number.. This technique
allow
wed them to overcome
o
a difficulty
d
in trracking the Satellite
S
data from load too load.
While
e loading and
d implementa
ation is not a focus of thiss book, this iidea will be b
briefly discusssed here
as it has relevancce to the stru
ucture choice
es made by th
he architect. One of the issues of utilizing subsequ
uence numbe
ers is that it introduces orrder-depende
ency to the looad cycle. In other words, from
one lload to the next, if the ord
der of the phone numbers
rs change theen its seen a
as an entire n
new delta
for th
he employee which mea
ans all the ph
hone numbers are re-inseerted as delta
a rows, even if the
phon
ne numbers themselves
t
did
d not changge.
This can be mitig
gated in two ways:
w
one (ass described above),
a
usingg the phone n
number as th
he subsequ
uence numbe
er (removes the
t order dep
pendency durring delta cheecking), or tw
wo: includingg the
existtence of the phone
p
numbe
er in the Sate
ellite as a currently activee row before inserting. O
Option #1,
destrroys any chance of reprod
ducing the da
ata set in the
e proper ordeer as it arrived (if this is im
mportant
to yo
ou, then sub-ssequencing is the only wa
ay). Option #2
# doesnt ch
heck the deleeted phone numbers
that may have dissappeared frrom the incom
ming data se
et.
In ca
ases where th
here is no nu
umber column
n alternative, replacing th
he sub-sequeence with an alphanumeric causes great
g
problem
ms with perfo
ormance. Un
nfortunately tthis is one of the cases where
choo
osing the best worst-case scenario see
ems to be the
e ideal. In su
uch a case, ssub-sequenciing is
always the architectural fall-back, please couple
c
that choice
c
with tu
urning on com
mpression off duplicate
data across the table. This will assist with
h maintainingg the integrityy while allowiing flexibility of
unkn
nown number of elementss to flow thro
ough the Data
a Vault.

Figure
F
6-13: Multi-Active Satellite Row
w Data

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 132 of 1
152

In Figure 6-13, itt is easy to se


ee that custo
omer 1 loaded
d four phonee numbers on
n 10-14-2000
0. They
are a
all supersede
ed by the load on 10-20-2
2000, as it iss the loadin
ng mechanism doesnt make any
attempt to detecct deltas, simply changes to the ordering along with
h new and deeleted phone
e
num
mbers. This iss because the
e business decided that keeping
k
the oorder of the p
phone numb
bers has a
business importa
ance. This iss a prime casse for turningg on table com
mpression.

Figure 6-1
14: Multi-Activve Satellite with
w Businesss Sub-Sequence
he phone num
mber has rep
placed the acctual orderin
ng column that was used
d in
In Figure 6-14, th
ection of exissting rows and discovery oof rows that a
are deleted from the
Figure 6-13. Thiss allows dete
T is not the preferred technique,
t
ass it requires a unique colu
umn to be avvailable
sourrce system. This
that is in numericc format.
6.10
0 Splitting Satellites

Splittting a Satellite occurs wh


hen one or more
m
columnss within the S
Satellite begin
n changing a
at different
ratess (when com
mpared to other columns within
w
the Satellite). The reasons whyy we split Sattellites are
discussed in section 6.5 (classsification orr type) and 6..6 rate of chaange. This seection introduces you
to ho
ow to split th
he Satellite, and
a what the concepts are
e to make it w
work properly.
Therre are a set of
o standards around splittting Satellitess, and while most of this information is about
load
ding, and movving data aro
ound, it is within the purview of this boook to discusss; mostly beccause it
also involves arcchitectural ch
hanges. The
e steps to spllitting a Satellite are beloow:

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 133 of 1
152

Identify da
ata in a Satellite that changges more rapid
dly than the other data in thee Satellite
Group tho
ose common elements toggether by rate of change
Split the Satellite
S
archittecturally by crreating a new Satellites and
d moving thosee elements
Load the data
d
to the ne
ew Satellite by simply copyin
ng the existingg columns, load
d dates, seque
ences
Run anoth
her process th
hat begins rem
moving duplicates (looks for deltas)
Run a fina
al process thatt updates the load-end date
es after the du
upes are remooved

The m
most importan
nt concept to hold
h
to when splitting
s
a Sate
ellite is to main
ntain 100% off the history off the data.
If anyy of the historyy is lost by dele
eting rows tha
at contain delta
as, then the n
new Satellite m
must be trunca
ated and
re-loa
aded from the original Satellite. Maintaining the audit trail
t
is vitally im
mportant. Oncce the audit trrail has
been checked and verified in botth new Satellittes, then the old
o Satellite caan be deleted//removed.
It is recommended that you run queries
q
(in parallel to the old Satellite) forr a few weeks against the new
Satellites to match the results. Once
O
a balancce has been esstablished, an d they are botth showing equal results,
then and only then can you delette the old Sate
ellite and repla
ace any affectted downstrea
am processes or extracts.

Figure 6-15: Step


p 1: Identify Satellite
S
Splitt Columns
In th
he Figure 6-1
15, the colum
mn that is cha
anging most frequently
f
is the cell phon
ne. The phon
ne
num
mber column also changess, not quite as
a fast as the
e cell phone, but still moree frequently tthan the
nam
me and addre
ess. Thereforre, we will split the Satellitte in to two d
different structures (see F
Figure 616) and group th
he phone and
d cell phone together
t
in one structure, while movin
ng the rest off the data
set tto another sttructure.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 134 of 1
152

Figure 6-1
16: Step 2: Sp
plit Satellite Columns,
C
Deesign New tab
bles
Now
w that the columns are pro
operly split apart, and the
e new structu
ures have beeen built, the next step
is to
o handle the data
d
(Figure 6-17).

17: Step 3: Copy Data Fro


om Original too New Satelliites
Figure 6-1

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 135 of 1
152

It is a replicated copy, and at this point it is easy to see that the cu


ustomer name Satellite co
ontains
dupllicate entriess, which doess not follow th
he standardss for deltas too be maintain
ned within th
he
Sate
ellites. The next step (Figure 6-18) is to
t delete / re
emove the pu
ure duplicatees. This proce
ess of
dupllicate elimina
ation must be
e run againstt each new Satellite.
S
Oth erwise, you m
might miss subtle
dupllicates that reside in the new Satellite
es.

Figure 6-18
8: Step 4: Elim
minate Dupliccates
This particular exxample does not show all the details of
o eliminatingg duplicates, and in fact by
simp
ply eliminatin
ng the duplica
ates in this example,
e
we no
n longer neeed to run thee next processsing step:
adju
usting the beg
gin and end dates.
d
Howe
ever, these stteps will run regardless, a
as there may be other
rowss of information that requ
uire fine-tunin
ng. The exam
mple below (FFigure 6-19) sshows the da
ata after
copyy-in, but before running re
eduction and delta processsing.

Figure 6-19: Step 4:


4 Alternate Elimination
E
oof Duplicates

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 136 of 1
152

After we have eliminated the duplicates, and


a selected the earliest load date in a series, and
d the last
load
d-end-date of the series, we
w have succceeded in pro
operly adjustiing the Satellite. Figure 6
6-20
represents the final stage of a Satellite affter processin
ng is complette.

Figure 6-20: Step 5: End Dates Ad


djusted Afterr Satellite Split
Once
made availab
e the Satellittes have been successfully split, and the
t data set m
ble, it is
reco
ommended to
o release the
e data in a pa
arallel path; meaning,
m
set up a duplicatted data marrt area,
re-diirect copies of
o the loading
g routines to the new data
a mart, then spend a weeek or so comparing
the n
new informattion to the old. Once reco
onciliation ha
as been accoomplished, yoou can switch
h over to
the n
new structures and drop the old one, releasing quite a bit of diisk space.
Dont forge
et! Sometime
es normalizing the data set and
a increasingg the parallelissm leads to
higher num
mbers of I/O calls,
c
however when run in parallel,
p
it lead
ds to overall reeduction in
processing
g time. A bala
ance needs to be achieved for the best peerformance poossible.
6.11
1 Consolidatin
ng Satellites

After splitting the


e Satellites, occasionally
o
(throughout
(
time),
t
the daata set changge rate beginss to slow
down again, or in
n some casess disappea
ar. If the source system iss retired, or the source da
ata feed is
switcched over, orr the businesss focus chan
nges then the
t rate at w hich the data
a changes may slow
down or stop cha
anging all-tog
gether. In the
ese situations you may coonsider conssolidating two
o or more
Sate
ellites togethe
er. This is co
ommonly kno
own as denorrmalization.
BE WARE! Over-denorm
malization leads to rows thatt are very widee, and un-wield
dy. If you
denormalize too far, the
e performance
e of the entire system
s
will su
uffer tremendoously, because
e
the I/O count and fragm
mentation, and chained rowss will appear aall across the p
processes.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 137 of 1
152

So w
why would you
u want to do this?

1
1.

Identify da
ata sets that are
a changing at
a the same ra
ates of speed iin one or moree Satellites related to the
same parent

2
2.
3
3.

Design the consolidated


d structure consisting of a combination
c
off all user-baseed Satellite defined fields

4
4.

Find the earliest


e
load da
ate from all th
he Satellites fo
or each row prooduced by thee join, select th
hat row for
insert using the load date
d
that was chosen

5
5.

Run a posst-insert proce


ess to clean up
p end-dates

Join the Satellites


S
acrosss parent sequ
uence numberrs, ordered by earliest load d
date; note: thiis will likely
also produ
uce a Cartesia
an join productt, which is som
mething we acctually want too start with

Figure 6-21 conttains an exam


mple of splitt data that has
h settled an
nd needs to b
be re-consoliidated.
This section will walk
w
through
h the processs of putting th
he data backk together.

Figure 6-21
1: Consolidating Satellite Data
This example sho
ows the data
a set, please note: this exa
ample does n
not show reccord source
prop
pagation. I do
o however disscuss this component in the
t followingg paragraphs..

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 138 of 152

The date in purple (highlighted/bold) in SAT_CUST_CONTACT_ADDR is the first and earliest date for
this particular customer (08/01/2000). Therefore it results in the first row inserted in to the
consolidated Satellite. The name and cell are NULL because there is no data for that customer
available as of that date in those Satellites. The record sources are all ?? because resolving them
can be a matter of interpretation or a decision made by the business user as to which system is the
MASTER system.
Please note: if the data in the original Satellites was previously split apart, then theres a
chance that the record sources for all rows across all split Satellites would be the same. In
this case it is ok to select the one available record source value and assign it to the newly
combined row in the consolidated Satellite. If this is not the case, please see the discussion
below.

If record sources vary across the multiple split Satellites (from row to row within a given parent key),
then a decision must be made in consolidation: which record source to use? This decision should
be put forward to the business users for complete resolution and sign-off, however that is not always
possible. For the cases where the business users wont make the decision, the following rule of
thumb is provided:
First, select the record source from the same table that houses the earliest load date that is being
selected. If this does not produce the desired outcome, then select the master system record
source from the Satellite in which it appears. Unfortunately during consolidation of multi-system
Satellites you may lose metadata. Please be aware that if the metadata is lost, the only way to
correctly audit the system will be to restore that days load to the staging area for further review.
At the end of this process of consolidation, run the assignment of Load-End-Dates to properly adjust
the dates and times of the data set. In Figure 6-22, Ive included Load-End-Date calculations after
theyve been set.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 139 of 1
152

Figure 6-22
2: Load End Dates
D
Calcula
ated in Consoolidated Sateellite
It is important to reconcile the information
n all the way through to th
he Data Martts before deleting or
desttroying the prreviously split Satellites. Run the data
a set in paralllel for a wee
ek or so throu
ugh to a
new data mart in
n order to ballance the info
ormation and
d reconcile th
he results to the old data mart.
Whe
en buy-off is achieved,
a
the
en it will be safe to backup and roll-offf the old splitt Satellites.
Splitting and consolidating can happe
en at any time during the lifee-cycle of the D
Data Vault. Its
a judgmen
nt call based on
o the rates off change in the
e data set, and
d the width of the rows.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 140 of 1
152

7.0 Query Asssistant Tables


Query assistant tables
t
are bu
uilt for one re
eason: performance. The architecturee of the Data Vault
doess not require these tabless to survive (ii.e., they are optional com
mponents). D
Due to restricctions in
curre
ent database
e engines and hardware, these tabless are sometim
mes necessary in order to
o gain
addiitional perforrmance from the Data Vault models.
Thesse tables havve a second reason
r
for exxistence: bufffering querie s from stream
ming real-tim
me data.
The business ma
ay have an SLLA (service le
evel agreeme
ent) in place tthat states th
hey can acco
ommodate
data
a changes every 5 minute
es on their da
ashboard. If data is arrivi ng every 10 sseconds, the
e query
musst be buffered
d to a 5 minu
ute increment so that the dashboard iss updated in
n accordance
e with
business needs. The query assistant
a
tablles can functtion as sched
duled snap-sh
hot tables to buffer
real--time feeds.
Therre are two ba
asic types of query
q
assista
ant tables: po
oint in time taables (PIT), a
and Bridge ta
ables.
Thesse table types are describ
bed below. These
T
tables are
a not need
ded in column
nar database
es, or flatwide
e devices succh as Netezza
a, and genera
ally are not necessary
n
in TTeradata eith
her.
7.1

Point in Time
e Tables

Poin
nt in time tables (PIT table
es) - are strucctures which surround a ssingle Hub (oor Link) and itts
corre
esponding Sa
atellites. It iss defined as: A structure which
w
sustain
ns integrity off joins acrosss time to
all th
he Satellites that are connected to the
e Hub. It is a specialized form of Sateellite. There iis a single
PIT ttable built for each Hub. These tabless cannot and
d should not sspan multiplee Hubs and LLinks.
Figure 7-1 showss the basic sttructure of a PIT table.

Figure 7-1:
7 Structure
e of PIT Tablee

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 141 of 1
152

Reco
ord sources are
a not necesssary, as the
e PIT table is a system gen
nerated tablee. Should yo
ou wish to
inclu
ude record so
ource you ma
ay, and the only reason fo
or doing so w
would be beca
ause you havve
multtiple loading processes po
opulating the
e PIT table. End-dates
E
aree not necesssary unless yo
ou wish to
enab
ble BETWEEN
N queries aga
ainst the sna
apshot inform
mation. PIT ttables provide equal-join access to
table
es around a Hub rather th
han focusingg on outer-join
n queries to aanswer quesstions. This iss why PIT
table
es are a query assistant table
t
only.
PIT TTables should
d not be crea
ated until or unless
u
there is a perform
mance problem
m with accesssing the
Sate
ellites around
d a single Hub. Figure 7-2
2 shows an arrchitectural d
depiction of w
where PIT tab
bles fit
within the Data Vault.
V

Figure 7-2: PIT


P Table Arch
hitecture Oveerview
own here, you
u may add PIT tables to help
h
join Linkk Satellites ass well. In gen
neral, PIT
Whille its not sho
table
es are not ad
dded and nott usually need
ded until the
ere are 3 or m
more Satellitees off a paren
nt. A data
exam
mple for one of the PIT tables is shown in Figure 7-3 to help elaaborate on th
he notion of a PIT
table
e.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 142 of 1
152

Figure 7-3: Exam


mple PIT Table
e with Snapsshot Dates
One of the key po
oints to this Figure
F
is: you
u can schedu
ule the snapsshot process as frequentlyy as you
need
d it; the prefe
erred schedu
ule is to load all PIT tabless as the very last step in tthe load. Forr example,
you can run the process
p
as fa
ast as every second,
s
or ass slow as eveery six or 12 h
hours. Anoth
her
funcction of both the
t PIT and Bridge
B
tabless is to buffer real-time inp
put feeds from
m alerting processes
in ta
actical reportss. It will keep
p the image
e of the data
a consistent b
between snapshot dates//times to
ensu
ure that a consistent view
w of the data can be retrie
eved, even th
hough the rea
al-time data may be
strea
aming in to other
o
Satellite
es around the
e Hub. This requires all eextract/retrieeval processe
es to
querry the Data Vault
V
using th
he PIT tables.
PIT TTables may contain
c
the HUB
H businesss key if so dessired (to avoiid joining bacck to the Hub
b in order
to re
etrieve the bu
usiness key). PIT Tables may
m not conttain any otheer Satellite da
ata. If you wiish to
consstruct your ow
wn version off a PIT table you
y may do so
s under the guise of com
mputed Satellites
(disccussed in the
e last chapter). PIT Tabless are specificcally architeccted for enha
ancing the
perfo
ormance of queries.
q
It is best howeve
er to keep the
e rows as sho
ort as possible (width wisee) in order too keep the
perfo
ormance as fast
f
as possible. It is also
o suggested that databasse columnar compression
n be
turned on (where
e available) fo
or the PIT tab
bles to make
e it work moree efficiently. As a final no
ote, it is
reco
ommended th
hat the data be contained
d within a ma
anaged windoow; meaningg that the num
mber of
rowss be kept to a maximum threshold.
t
When
W
the data
a becomes tooo old it should be rolle
ed off the
backk (deleted) so
o that the new data can be
b added. Byy keeping thiss rolling wind
dow of data yyou will be
able
e to easily tun
ne performan
nce.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse
7.2

P
Page 143 of 1
152

Bridge Table
es

Bridgge tables are


e nearly the same as PIT tables they serve the saame purposes, and have tthe same
goalss: to improve
e performance of the querries. They sh
hould not be constructed until and unless
perfo
ormance of th
he parallel jo
oins across th
he Data Vaultt is too slow. The differen
nce between this and
the P
PIT table is th
he Bridge con
nstruct focuses on joiningg across multtiple Hubs an
nd Links. It iss therefore
a spe
ecialized form
m of a Link ta
able. The Brid
dge table can
n be thought of as a higheer-level factle
ess fact
(som
mething like th
hat anyhow). It too conta
ains no Satelllite data (duee to width isssues), but con
ntains
keys from multiplle Hubs and Links. The Bridge
B
table iss also a systeem generateed / process ggenerated
table
e, and therefo
ore is not-aud
ditable.
Remembe
er, the architeccture does nott need these ta
ables in orderr to function prroperly these
e
tables exisst solely for the
e purpose of query
q
performance.

The basic structu


ure of a Bridg
ge table is provided in Figgure 7-4:

Figure 7-4:
7 Bridge Ta
able Structurre
The Bridge table may contain
n Hub business keys; how
wever be careeful as raisingg the number of bytes
per rrow will dram
matically slow
w the perform
mance of the table,
t
especiially in very la
arge data setts.
Anotther consequ
uence of haviing this table
e become too
o wide is the introduction of chained ro
ows,
fragmentation, and over-inde
exing. Keep in mind that the
t purpose of this table is to enhancce join
perfo
ormance, not kill it.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 144 of 1
152

In Bridge tables you


y may choose to have business
b
keyys, in others you might lleave them o
out. Even
within a Bridge ta
able, it is nott necessary to
o contain all the businesss keys. It migght be best to contain
only those busine
ess keys which are most frequently qu
ueried from a filter or a like clause. TThe Bridge
table
e must conta
ain Hub sequences (surro
ogate keys), and
a Link sequ
uences (surroogate keys), and it is
reco
ommended th
hat a Bridge table
t
only be
e utilized whe
en you have ttwo or more LLinks to join with a
singgle query.
Note
e: the Bridge table is not required
r
to be
b at the sam
me grain as th
he Links it is covering. In those
case
es, a group byy might be uttilized to con
nstruct a Bridge table with
h data at a hiigher grain th
han that
of th
he Link benea
ath it. Please
e keep in min
nd that by ch
hanging the ggrain, the Brid
dge table rea
ally is
representative off an explorattion Link as described
d
pre
eviously in seection 5.21.

Fig
gure 7-5: Brid
dge Table Architectural O
Overview
As in
ndicated, thiss Bridge table
e spans data
a from three Hubs,
H
and tw
wo Links. This example (ssee Figure
7-5) maintains th
he lowest posssible grain by
b keeping th
he Cartesian product in-ta
act; no group
p by has
been
n executed prior
p
to load of
o the data se
et. The data in the Bridgee (see Figure 7-6) might b
be read
as: sseller by prod
duct by parts. Only the ke
eys of the pro
oduct which h
have a sellerr, and are use
ed in the
man
nufactured pa
arts will be lissted in the Bridge table, unless
u
you ch
hoose to pop
pulate some o
of the
keyss with NULL values.
v

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 145 of 1
152

Figure 7-6: Bridge Tablle Example D


Data
The SQN (or sequ
uence) colum
mn is the prim
mary key of th
his table; its merely meant to keep th
he
herwise. Thee Load date iis the
insertion order of the recordss and is uniqu
ue. It has no
o meaning oth
date
e/time of inse
ertion (record
d creation or snap-shot da
ate). This exxample of thee Bridge table
e has
grou
uped the sequence with th
he business key for reada
ability purposses. For perfoormance: it iss best to
grou
up all sequen
nces at the he
ead of the table, followed
d by the busin
ness key colu
umns. This p
provides
maxximum accesss to fixed-wid
dth numeric columns
c
(bussiness keys), while the va
ariable colum
mns are
placced at the end
d of the row.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 146 of 152

8.0 Reference Tables


Reference tables are just that, tables utilized across the data warehouse as descriptors. To simplify
matters, my definition of reference data can be found below. Should reference tables be a part of
the Data Vault Model? Yes and no.
Note: I spoke to Bill Inmon about reference data, and I asked him why we have to have it in
the Data Warehouse. His answer was: it is necessary in order to store the lookup data
without redundancy.
He continued on to share with me the following: if we lived in a perfect world, a world without
disk space limits, then the reality would be (should be) that all reference data is resolved on
the way in to the warehouse.
This way, the descriptions would be stored as deemed necessary, possibly in addition to the
encoded or keyed information. However since we do not live in a perfect world, it is
sometimes necessary to house reference data in separate lookup tables to avoid redundancy
across the entire warehouse.

I define reference data as follows: any information deemed necessary to resolve descriptions from
codes, or to translate keys in to a consistent manner. Many of these fields are descriptive in
nature and describe a specific state of the other more important information. As such, reference
data lives in separate tables from the raw Data Vault tables.
Generally speaking, reference data is neither a business key, nor purely descriptive. It lives in a grey
area and covers a number of facets to resolve the information in the warehouse to a better context.
For instance, ICD9 / ICD10 (medical drug diagnosis codes) are an example of reference data. They
may be external sources of data governed (in this case) by a world body. These codes are often
found used as descriptors of other business keys.
While building separate tables in the Data Vault, and adding the codes to Satellites appears to
constitute foreign keys in Satellites, I will tell you that it should be a logical representation only. If
you physically create the foreign keys in the Satellites to reference tables, you can a) blacken your
model (too complex to maintain, too many relationships) b) destroy flexibility, and of course c) you
would not be following the perfect world scenario set up above.

Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Super Charge Your Data Warehouse

Page 147 of 152

In a truly perfect world, we would resolve all reference data on the way in to the warehouse, thus
making reference data obsolete, and then, there would be no need for foreign keys in the
Satellites to begin with. In any case, reference data can and does exist as a part of the Data Vault
model, and should be defined as: external data outside your control, data that is commonly used to
setup context/describe other business keys, or quite simply put: standard codes and descriptions or
classifications of information.
The structure of the reference tables can vary from 3rd normal form to star-like, to Hubs, Links and
Satellites from the Data Vault. So there is no need to worry or fret about the type of structure that
you want to use; just choose the one that works best for you and move on. Some options and
scenarios are described in the following sections.
8.1

No-History Reference Tables

Sometimes there is no need to store history of reference data changes; in this case we use a typical
3NF or 2NF type table. The nature of a data warehouse is in fact to store history, but when the
business signs off on the expected no-history requirement then the EDW team has the go-ahead.
A no-history reference table is a structure that has no history! Imagine that!
Ok, enough kidding aside its a table with no begin and no end-dates. Before I go on, Ill say this:
reference tables can be designed as Hubs and Links, or as simple 3rd normal form tables, that is:
flat and / or wide, its up to you. You need to decide whats best, and what fits then load it and go.
What types of data might you see in a no-history reference table? Well, that all depends of course,
but here are some examples of what Ive run in to in my career:

Medical drug prescription codes and definitions


Stock exchange symbols
Medical diagnosis codes
VIN number codes and definitions (manufacturer codes)
Calendar dates
Calendar Times
International currency codes
US State code abbreviations

And so on If youd like to add to this list, Id love to have your feedback. Just put the example in an
email and send it off. An example of a non-history based 3rd normal form reference table is shown
in Figure 8-1.
Dan Linstedt 2010-2011, all rights reserved

http://LearnDataVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 148 of 1
152

Figure 8-1: Non-History Reference TTable


In ad
ddition to cap
pturing the code, short de
escription and long descrription, I like tto capture th
he date
the rreference row
w was loaded
d. Sequence
e numbers are optional. TThere are tim
mes when I usse
sequ
uence numbe
ers for join pe
erformance, but most tim
mes I simply u
use the CODE
E as the prim
mary key.
Thatts the whole point of refe
erence tabless, to use the code,
c
and leaave the codee in place acrross the
mod
del. The COD
DE is the natu
ural key of the
e table.
Rem
member, that a non-historyy reference table
t
will onlyy and foreverr show the cu
urrent value and
hencce you can no longer ask the question
n: what was the
t descriptioon of the CA ccode last yea
ar?
8.2

History Base
ed Reference Tables
T

Histo
ory based refference table
es are reference data with
h a requirem
ment or a business need to
o store
the h
history of desscriptions. In
n other wordss, we or the business
b
wan
nts to track w
what the desccription
was last year, lasst month, and
d so on. The history may become imp ortant for certain reference data
espe
ecially if the reference
r
data relates to financial rep
ports. Particu
ularly when oold reports arre reprintted in the futture, sometim
mes the busin
ness or the auditor
a
wantss to see whatt the code an
nd
desccription was at
a a particula
ar point in tim
me.
In th
his case, I wo
ould strongly urge you to create
c
Hubs, Links, and S
Satellites to h
house the hisstorical
reference data. However, I would
w
discourrage you from
m using SEQU
UENCE numb
bers in this situation.
Natu
ural keys tend to be much
h more consistent over tim
me (in the caase of referen
nce data), an
nd
typiccally its the natural
n
keys which appea
ar in the rest of the raw Daata Vault moodel (EDW Mo
odel)
particularly in the
e Satellites.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 149 of 1
152

Addiing sequence
e numbers to
o the history based
b
refere
ence tables u sually adds n
no value sincce the
code
es tend to be
e static (i.e., ever
e
see the abbreviation for the statee of California
a change?). On the
othe
er hand, if you have a valid reason to do
d so then dont be shyy. Documentt the reason, and
procceed to use the sequence
es all across your
y
model. An example of a history b
based reference table
is sh
hown in Figurre 8-2:

Figurre 8-2: Stand


dard History Based
B
Refereence Table
In th
his Figure, you can plainlyy see the prevvious code, the time-line ffor the validity of the prevvious
code
e, along with the historica
al values of th
he previous code.
c
We aree however ussing the natu
ural key
(the code) as the
e primary keyy, but the load
d date must be included ffor uniqueneess. Figure 8
8-3 shows
a diffferent appro
oach (using a Hub and Satellite) as a history
h
based
d reference ta
able:

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 150 of 1
152

Figure 8-3: Hub/Sat History Based


B
Refereence Table
8.3

Code and De
escriptions

Code
es and descrriptions are commonly
c
fou
und in reference data. If you have a loot of codes to
o model
take
e the most effficient route that is: one
e that makess logical sensse. I like to ggroup many o
of the
similar codes tog
gether in to a single masster code tab
ble. In thesee cases, I havve to also asssign a
uniq
que group co
ode to help make the underlying code
e unique. Offten times the group code
e is a
mad
de-up or manufactured co
olumn (hard coded
c
data in
n the ETL rou
utine).
Beca
ause the group-code is syystem genera
ated, and it has
h no forma l business m
meaning outsiide of the
ness
EDW
W (generally), I usually try to keep the group
g
code in
nside the EDW for joiningg and uniquen
reassons only. Th
he example in
n Figure 8-4 is made-up data,
d
but shoows how you ccan apply a m
master
code
e or a group code
c
to use a single struccture and house all your i nformation.

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Supe
er Charge Yo
our Data Ware
ehouse

P
Page 151 of 1
152

Figure 8-4: Group Code


e and Descrip
ption
As yo
ou can see, this
t solution has a few fla
aws: if the gro
oup code chaanges or the code itself changes
this leads to confusion with in
nterpretation
n of the surro
ogate keys. H
However, as llong as the ccodes and
grou
up codes are consistent, it shouldnt be
b such a pro
oblem.

9.0 Conclusio
ons
The Data Vault Model
M
and me
ethodology are highly verssatile when tthe standards and rules a
are
follo
owed. Its when you break
k the rules an
nd standardss that you ca n get in to troouble, and I h
hope Ive
show
wn you enoug
gh insight to see how to apply
a
the app
propriate and
d proper design. Its by fo
ollowing
the rrules and sta
andards that you can take
e advantage of the years of research a
and design Ive put in
to th
his; allowing you
y to overco
ome and avoid the potenttial pitfalls an
nd project isssues.
I would like nothing more than to help you
u succeed, an
nd to hear froom you abou
ut your conce
erns,
quesstions, or com
mments. Im
m always interrested in hea
aring about c ustomer successes as we
ell as
challenges you fa
ace in your day to day job
b.
If you become a Data Vault fa
an along the way, feel free
e to let me k now!
Sinccerely,
Dan Linstedt

Da
an Linstedt 2010-2011,
2
all
a rights rese
erved

http:///LearnData
aVault.com

Super Charge Your Data Warehouse

Page 152 of 152

INDEX
adaptability, 38, 73, 76, 83
Architectural. See Architecture
Architecture, 9, 139
Basic Terminology, 10
Business Key, 4, 7, 11, 55, 57, 64, 72, 73
Business Keys, 4, 27, 58, 59, 61, 63, 71
Data Vault, 2, 3, 7, 8, 10, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 51, 52, 53, 56, 61,
62, 63, 65, 66, 67, 68, 71, 72, 73, 74, 75,
76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 87,
88, 90, 91, 92, 93, 97, 98, 99, 100, 101,
102, 104, 105, 106, 107, 108, 109, 110,
111, 112, 113, 114, 115, 118, 119, 121,
122, 123, 124, 126, 127, 129, 137, 138,
139, 144, 145, 146, 149
Data Vault Modeling. See Data Vault
EDW, 3, 7, 33, 36, 37, 38, 39, 41, 45, 46, 48,
49, 52, 58, 76, 78, 81, 82, 83, 120, 145,
146, 148
Flexibility, 3, 4, 7, 21, 22, 76, 78
HUB, 18, 44, 50, 51, 57, 64, 140
Hubs. See Hub

Dan Linstedt 2010-2011, all rights reserved

Link, 4, 5, 7, 8, 18, 20, 21, 22, 24, 26, 28,


34, 43, 44, 49, 52, 57, 69, 75, 76, 77, 78,
80, 81, 82, 83, 85, 87, 88, 89, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99, 100, 101, 102,
103, 104, 105, 106, 107, 108, 109, 110,
111, 122, 123, 139, 142
Links. See Link
Load Date, 7, 45, 46, 56, 66, 73, 105, 111,
112
Load End Date, 7, 47, 111
Record Source, 7, 9, 45, 56, 66, 73, 121,
124, 125
Reference, 6, 9, 11, 53, 144, 145, 146, 147
Satellite, 5, 6, 8, 9, 26, 34, 43, 44, 47, 48,
49, 52, 53, 56, 85, 89, 90, 92, 102, 103,
104, 105, 106, 110, 111, 112, 113, 114,
116, 117, 118, 119, 120, 121, 122, 123,
124, 125, 126, 127, 128, 129, 130, 131,
132, 133, 134, 135, 136, 137, 140, 141,
147
Satellites. See Satellite
Scalability, 4, 76, 84
Sequence, 3, 9, 32, 43, 44, 66, 73, 91, 112,
125, 130, 146
sequences. See Sequence
SQN. See Sequence

http://LearnDataVault.com

S-ar putea să vă placă și