Documente Academic
Documente Profesional
Documente Cultură
valuab
ble Da
ata Mo
odelin
ng Rules to
o
Impllemen
nt You
ur Datta Vau
ult
Dan Linsttedt
Inventorr of the Da
ata Vault
ISB
BN: 978-0--9866757-1
1-3
Page 2 of 152
http://LearnDataVault.com
http://SuperChargeYourEDW.com
Page 3 of 152
Table of Contents
Acknowledgements .................................................................................................................................. 10
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
Data Vault Basis of Commutative Properties and Set Based Math ....................................... 25
1.9
2.1
2.2
2.3
Metrics Vault.............................................................................................................................. 39
2.4
2.5
2.6
2.7
2.8
2.9
3.0
3.1
3.2
3.3
Load Dates ................................................................................................................................ 47
Dan Linstedt 2010-2011, all rights reserved
http://LearnDataVault.com
Page 4 of 152
3.4
3.5
3.6
3.7
3.8
3.9
4.0
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
5.1
5.2
5.3
Flexibility .................................................................................................................................... 80
5.4
Granularity ................................................................................................................................. 83
5.5
Dynamic Adaptability ................................................................................................................ 86
Dan Linstedt 2010-2011, all rights reserved
http://LearnDataVault.com
Page 5 of 152
5.6
Scalability................................................................................................................................... 87
5.7
5.8
5.9
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.9.1
6.9.2
Record Tracking Satellites .................................................................................................. 125
Dan Linstedt 2010-2011, all rights reserved
http://LearnDataVault.com
Page 6 of 152
6.9.3
6.9.4
6.9.5
7.1
7.2
8.0
8.1
8.2
8.3
9.0
http://LearnDataVault.com
Page 7 of 152
Table of Figures
Figure 1-1: Example E-R Diagram (Elmasri/Navathe) ............................................................................ 13
Figure 1-2: Crows Foot and Arrow Notation Example ............................................................................ 15
Figure 1-3: Small Example: Ontology for Vehicle.................................................................................... 16
Figure 1-4: Example Abbreviations and Naming Conventions .............................................................. 18
Figure 1-5: Example Data Vault ............................................................................................................... 20
Figure 1-6: Flexibility of Adapting to Change .......................................................................................... 23
Figure 1-7: 3rd Normal Form Product and Supplier Example ................................................................ 24
Figure 1-8: Applied Set Theory for the Data Vault .................................................................................. 27
Figure 1-9: Parallel Computing Simplified .............................................................................................. 28
Figure 1-10: Logical Data Vault Hyper Cube........................................................................................... 29
Figure 1-11: Physical Data Vault Layout (Starting point) ....................................................................... 30
Figure 1-12: Physical Data Vault Layout (Partitioned) ........................................................................... 31
Figure 2-1: Enterprise BI Architectural Components ............................................................................. 37
Figure 3-1: Time Series Batch Loaded Data ........................................................................................... 43
Figure 3-2 Real-Time Arrival, Data Geology ............................................................................................ 44
Figure 3-3: Load Date Time Stamp and Record Source ........................................................................ 47
Figure 3-4: Example Load Date Time Stamp Data ................................................................................. 48
Figure 3-5: Load End Date Computations, Descriptive Data Life Cycle ................................................ 49
Figure 3-6: Structures containing Last Seen Dates ............................................................................... 51
Figure 3-7: Scan all data in EDW............................................................................................................. 51
Figure 3-8: Reduced Scan Set after Applying Last Seen Date .............................................................. 53
Figure 4-1: Business Key Changing Across Line of Business ................................................................ 57
Figure 4-2: Hub Example Images ............................................................................................................ 58
Figure 4-3: Hub Example Data ................................................................................................................ 59
Figure 4-4: Smart Key Example ............................................................................................................... 65
Figure 4-5: Composite Business Key Hub Example ............................................................................... 66
Figure 4-6: Example Hub Entity Structure .............................................................................................. 67
Figure 4-7: Example Hubs from Adventure Works 2008 ....................................................................... 68
Figure 4-8: Example of National Drug Code Data Vault ......................................................................... 69
Figure 4-9: Dependent Child Relationship Modeling ............................................................................. 70
Figure 4-10: Typical Hub Row Sizing ....................................................................................................... 75
Figure 5-1: Relationship Changes Over Time ......................................................................................... 78
Figure 5-2: Link Table Structure Housing Multiple Relationships ......................................................... 79
Figure 5-3: Starting Model Before Changes ........................................................................................... 81
Dan Linstedt 2010-2011, all rights reserved
http://LearnDataVault.com
Page 8 of 152
http://LearnDataVault.com
Page 9 of 152
http://LearnDataVault.com
Page 10 of 152
Acknowledgements
I wish to personally thank Kent Graziano for sticking by me all this time. His relentless editing skills
have truly helped to shape and hone this book. Its taken me two years to put this book together,
and countless hours of writing, creating graphics and examples in high quality print and color.
In addition to Kent, Tom Breur also assisted me in the editing process he helped me to draw out
important points and yes, he wanted me to change to single spacing but thats one thing I just
didnt compromise on.
Then, there is Sanjay Pande hes an IT veteran turned marketing expert who knows his stuff inside
and out. Hes been an inspiration to me to try new things, and create new titles for the book. Hes
also helping me with many other aspects of marketing that I wasnt even aware of.
I wish to thank my wife Julie for putting up with me spending hours editing my book (even on my
vacations) which I really shouldnt do. My wife also helped me re-formulate the cover art and pick a
cool looking design.
Id also like to thank God for blessing me with this knowledge and then finally urging me to trust Him
and write it down for others!
Finally, Id really like to thank YOU, the reader. Many of you know me, or have seen me teach in
person without you, there would be no Data Vault successes in the world today. I love to hear
about your trials, as well as your successes with the Data Vault if youd like to help me write (yet
another book of case-studies) then I want to hear from you!
Of course, if youre ever in Saint Albans or even Burlington Vermont, drop me an email or call me
Id be delighted to meet you for lunch.
Sincerely,
Daniel Linstedt
DanL@DanLinstedt.com
http://LearnDataVault.com
Page 11 of 152
No, it is not necessary to be a data modeler to read this book. While a data modeling background is
helpful, it is not required. The writing covers the basic components of the Data Vault Model, and
also introduces information about the concepts utilized by nearly all relational database systems.
Experience with RDBMS engines also can be applied to the concepts and knowledge presented
here. This book also assumes you are familiar with the basics of data warehousing as defined by
W.H. Inmon and Dr. Ralph Kimball.
A common understanding of fields / columns, tables, and key structures (such as referential
integrity) is helpful. In the next section are descriptions of common terms used throughout this
book.
1.2
The terminology in this book consists of basic entity-relationship (E-R) diagramming and data
modeling terms. Terminology such as Table, Entity, Attribute, Column, Field, Primary Key, Foreign
Key, and Unique Index are utilized throughout. For reference purposes the following basic level
definitions of the terms are provided.
Term
Table
Entity
Attribute
Column
Field
Primary Key
Definition
A composite grouping of data elements instantiated in a
database, making up a concept.
A table, as referred to in a logical format (eg: customer, account,
etc..)
A single data element comprised of a name, data type, length,
precision, null flag, and possibly a default value.
An ordered attribute within a table.
Same as Column. See Column definition.
Main set of one or more attributes indicating a unique method
for identifying data stored within a table.
http://LearnDataVault.com
Unique Index
Business Key
Natural Key
Relationship
Many to 1
Many to Many
1 to 1
Cardinality
Page 12 of 152
Definition
One or more attributes associated with the primary key in
another table. Often used as lookup values, may be optional
(nullable) or mandatory (non-null). When enabled in a database,
foreign keys insure referential integrity.
One or more attributes combined to form a single unique list of
data spanning all rows within a single table.
Component used by the business users, business processes, or
operational code to access, identify, and associate information
within a business operational life-cycle. This key may be
represented by one or more attributes.
See business key.
An association between or across exactly two tables.
A notation used to describe the number of records in the lefthand table as related to the number of records in the right-hand
table. Example: many customer records may have 1 and only 1
contact record.
An open-ended notation. For example: where many customer
records may have many contact records.
A notation dictating singular cardinality: 1 customer record may
have 1 and only 1 contact record.
In mathematics, the cardinality of a set is a measure of the
"number of elements of the set". For example, the set A = {1, 2,
3} contains 3 elements, and therefore A has a cardinality of 3.
There are two approaches to cardinality one which compares
sets directly using bijections and injections, and another which
uses cardinal numbers. Reference:
http://en.wikipedia.org/wiki/Cardinality
Constraint
Weak Relationship
Strong Relationship
Associative Entity
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 13 of 15
52
Data
a models are
e diagrammattic representtations of info
ormation and
d classes of information tto be held
within a mechanical storage mechanism such
s
as a database engin
ne; except in the case of a
concceptual/business model which
w
should
d be independent of techn
nology. Common databasse
engines today include: DB2 UDB,
U
Teradatta, MySQL, Po
ostGreSQL, O
Oracle, SQLSeerver, and Syybase
ASE.. There are several
s
main notations ussed for E-R diiagrams (e.g.., Chen, Barkker, IDEF, etcc). An
exam
mple of an E--R diagram using Elmasri//Navathe nottation is beloow:
Data
a models (such as E-R dia
agrams) housse linguistic representatio
r
ons of concepts tied togetther
through associattions. These associationss can also be
e thought of aas Ontologiess. There are many
type
es of data mo
odeling notations available in the world
d today. Twoo main types are focused on in this
rd
docu
ument: 3 normal form and
a Star Sche
ema. For refference purpooses, simplee definitions o
of both
style
es are include
ed below.
3rd N
Normal Form is defined ass follows:
The third normal
n
form (3
3NF) is a norm
mal form used in database n
normalization. 3NF was
originally defined
d
by E.F. Codd[1] in 1971. Codd's definition
d
statees that a tablee is in 3NF if an
nd
only if both
h of the follow
wing conditionss hold:
The re
elation R (table
e) is in second
d normal form (2NF)
Every non-prime atttribute of R is non-transitive
n
ly dependent ((i.e. directly deependent) on
every key of R.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 14 of 15
52
A non-prim
me attribute off R is an attribute that does not belong to any candidatee key of R.[2] A
transitive dependency
d
iss a functional dependency in which X Z (X determinees Z) indirectlyy,
by virtue of
o X Y and YZ
Y
(where it is not the case
e that Y X). [3]
A 3NF definition that is equivalent
e
to Codd's,
C
but exxpressed differrently was giveen by Carlo
Zaniolo in 1982. This de
efinition statess that a table is
i in 3NF if an d only if, for eeach of its
ditions holds:
functional dependenciess X A, at lea
ast one of the following cond
Zaniolo's definition
d
givess a clear sense of the difference between
n 3NF and the more stringen
nt
Boyce-Cod
dd normal form
m (BCNF). BCN
NF simply eliminates the thirrd alternative ("A is a prime
attribute").
http://en.w
wikipedia.org//wiki/Third_no
ormal_form
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
1.3
P
Page 15 of 15
52
Crow
ws Foot notattion is utilized
d throughoutt this text to represent
r
raw
w data modeels; in additio
on to the
crow
ws-foot notatiion this text introduces arrrows to represent data m
migration paths (vectors/d
direction
of da
ata flow). It is
i occasionallly easier to describe
d
the vector
v
notatiion to busineess users whe
en
compared with describing
d
cro
ows-foot nota
ation.
The
e "Crow's Foo
ot" notation re
epresents relationships with
w connectiing lines betw
ween entitiess, and
pairss of symbols at the ends of those lines to represen
nt the cardin ality of the reelationship. C
Crow's
Foott notation is used
u
in Barkker's Notation
n and in meth
hodologies su
uch as SSAD
DM and Inform
mation
Engiineering. htttp://en.wikipeddia.org/wiki/Enntity-relationshipp_model
Data
a models function as onto
ologies in thiss world. Theyy seek to orgganize a hiera
archy of inforrmation
into a classificatiion system. Ontologies are extremely powerful nottions that ca
an capture au
ugmented
or en
nhanced mettadata (information about the data model) that is not represen
nted by the m
model
itself.
In both co
omputer scien
nce and inform
mation science
e, an ontology is a formal rep
presentation o
of
a set of co
oncepts within a domain and
d the relationsships between
n those concep
pts. It is used tto
reason about the properties of that do
omain, and ma
ay be used to define the domain.
Ontologiess are used in artificial
a
intelliggence, the Semantic Web, ssoftware engin
neering,
biomedica
al informatics, library science
e, and informa
ation architectture as a form
m of knowledge
e
representa
ation about the world or som
me part of it.
http://en.w
wikipedia.org//wiki/Ontologyy_(computer_sscience)
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 16 of 15
52
Onto
ologies are one way to rep
present terms beyond datta modeling; much of thee Data Vault m
model is
base
ed on the Ontology conce
epts. When th
he Data Vaullt model form
m is combined
d with the function of
data
a mining, and
d structure mining
m
then ne
ew relationsh
hips can be d
discovered, ccreated and d
dropped
overr time. The ontology
o
can be morphed or dynamically altered intto new relationships goin
ng
forw
ward. There iss more discu
ussion on thiss topic in diffe
erent section
ns of this boook around the
e
flexibility of the Data
D
Vault mo
odel.
Fig
gure 1-3: Small Example: Ontology forr Vehicle
F
1-3 is extremely sim
mple and small. It repressents the nottion of the pa
arent term
The ontology in Figure
vehicle which contains the su
ub-classes: Car
C and Truck
k. Car and Trruck are both
h types of veh
hicles;
however each ha
as potentiallyy different de
escriptors. Trrucks genera lly contain la
arger frames, larger
moto
ors, larger wh
heels, and arre capable off towing and hauling heavvy loads where cars generally have
a sm
maller turning
g radius, use less gas, and
d can house more peoplee.
Onto
ologies are po
owerful categ
gorization an
nd organizatio
on techniquees. Imagine a set of musiic on a
mob
bile computin
ng device. No
ow imagine that there are
e many differrent categorizzations for th
hat music,
rangging from yea
ar, to composser, to album, to band, to artist, lead vvocalist, etc
Now stack these
cate
egorizations in different orrders or hiera
archies the
ey function ass indexes into the data se
et. At the
end of the index are the same music files; they are sim
mply categoriized differenttly. This is th
he basic
makkeup of ontolo
ogies.
How
wever, this de
escription can
n go deeper; switch the exxisting categoories out for business terrms, and
begin to describe
e each category. For insta
ance: Genre.. Different peeople might d
define whatt is
classsified as rock
k and roll diffferently, butt they are botth right. Cateegorization iss in the eye o
of the
beho
older, and is based on the
e individualss belief system and know ledge set (orr context) surrrounding
the iinformation at
a the bottom
m of the stack
k; which in th
his case are tthe music filees.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Page 17 of 152
The deeper into the ontology (or index), the more specialized and differentiated the definition
becomes. For example: underneath Rock, there might be 70s, 80s and 90s rock, or there might
be classic rock and roll. Where an individual who grew up in the 60s considers 60s and 70s to be
a part of; while an individual who grew up in the 90s or later considers any music earlier than 1985
to be classic rock. This is just one of the issues that the Data Vault Model and implementation
methodology provides a solution to. This book will uncover the key to modeling Ontologies in
enterprise data warehouses for use with Business Intelligence systems.
In fact, learning warehousing, applying, and using ontologies is a critical success factor for handling,
managing and applying unstructured data to a structured data warehouse. It is also a major
component for operational data warehousing, along with business rule definition and dissemination
of the data within an Enterprise Data Vault.
These are general descriptions of ontologies as used throughout this book. In addition to ontologies,
data models typically contain short-hand notations for names of fields known as abbreviations.
These abbreviations can have similar meaning within the same context (i.e. industry vertical) but
may have different meaning across different context. For example: Abbreviation CONT in health
care may mean contagious, in a legal system it may mean continuation. Abbreviations are best
separated by vertical industry.
1.5
Physical data models often contain abbreviations for classifying tables and fields as many RDBMS
engines impose length limits on object names. The desire is to carry metadata meaning within the
abbreviations which results in a data dictionary being created. The naming conventions usually start
from the left hand side of the object name and move to the right with a logical flow with different
parts of the abbreviations separated by an underscore. The typical abbreviation is made up of
multiple components as shown in Figure 1-4:
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 18 of 15
52
Vehicle
Car
Truck
2 Wheel Drive
D
4 Wheel Drive
D
= VEH
= CAR
= TRK
= TWOWHDRV, TWDRV
= FOURWHD
DRV or AWD, FORWDRV
F
The suggested ta
able naming conventions for the Data Vault are ass follows:
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
HUB , or H
LNK, or L
HLNK, or HL
LSA, SAL, SLNK, SL
TLNK, TL
SAT, or S
HSAT
LSAT
PIT, or P
BR, or B
REF, or R
Page 19 of 152
Within each of the Data Vault tables there are standardized fields (more on this later). The naming
convention for these fields is as follows:
LDTS, LDT
LEDTS, LEDT
SQN
REC_SRC, RSRC
LSD,LSDT
SSQN
Always document the naming convention and the abbreviations chosen through a data dictionary in
order to convey meaning to the business and the IT team. Naming conventions are vital to the
success and measurement of the project. Naming conventions allow management, identification,
and monitoring of the entire system no matter how large it grows. Once the naming convention is
chosen, it must be adhered to (stick to it at all times) . One way to insure this is to conduct frequent
data model reviews and require non-conforming objects to be renamed
1.6
The Data Vault model consists of three basic entity types: Hubs, Links, and Satellites (see Figure 15). The Hubs are comprised of unique lists of business keys. The Links are comprised of unique
lists of associations (commonly referred to as transactions, or intersections of 2 or more business
keys). The Satellites are comprised of descriptive data about the business key OR about the
association. The flexibility of the Data Vault model is based in the normalization (of or separation of)
data fields in to corresponding tables.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 20 of 15
52
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 21 of 15
52
Page 22 of 152
Let your Data Vault house the facts, and build your data marts to house the interpretation.
1.7
The Data Vault model is built for extreme flexibility and extreme scalability. The Link table separates
the relationships from the business key structures (the Hubs). The Link table provides for the
representation of the relationship to change over time. The Satellites provide the descriptive
characteristics about the Hubs or Links as they change over time.
For instance, suppose you own a car and you are the registered driver. You currently have two
relationships to the car: one as a driver, and one as an owner. Now suppose you hired a driver.
Well, you still own the car right? Now, you have one relationship with the car as the owner, but the
person you hired now has a relationship with the car as the driver. However, the description of the
car has not changed.
What if you sold the car to someone else? Then your relationship with the car as an owner would
END, and the buyers relationship with the car would begin. This information about the relationship
between business keys is what we keep in the Link structures. Again, the basic description of the car
remains unchanged so the Satellite data is untouched.
The Link table may also be applied to information association discovery. Business changes
frequently redefining relationships and cardinality of relationships. The Data Vault model
approach responds favorably because the designer can quickly change the Link tables with little to
no impact to the surrounding data model and load routines.
MAJOR FUNDAMENTAL TENANT: THE DATA VAULT MODEL IS FLEXIBLE IN ITS CORE DESIGN. IF THE DESIGN OR THE
ARCHITECTURE IS COMPROMISED ( THE STANDARDS / RULES ARE BROKEN) THEN THE MODEL BECOMES INFLEXIBLE AND
BRITTLE . B Y BREAKING THE STANDARDS / RULES AND CHANGING THE ARCHITECTURE , RE - ENGINEERING BECOMES NECESSARY
IN ORDER TO HANDLE BUSINESS CHANGES . O NCE THIS HAPPENS , TOTAL COST OF OWNERSHIP OVER THE LIFECYCLE OF THE
DATA WAREHOUSE RISES , COMPLEXITY RISES , AND THE ENTIRE VALUE PROPOSITION OF APPLYING THE D ATA V AULT CONCEPTS
BREAKS DOWN .
For example, suppose a data warehouse is constructed to house parts then after 3 months in
operation the business would like to track suppliers. The Data Vault can quickly be adapted by
adding a Supplier Hub, Supplier Satellites, followed by a Link table between parts and suppliers - the
impact is minimal (if any) to existing loading routines and existing history held within.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 23 of 15
52
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 24 of 15
52
Figure 1-7:
1 3rd Normal Form Prod
duct and Sup
pplier Examplle
In Figure 1-7, a Product
P
can have
h
1 and only
o 1 supplie
er, but a Supp
plier can sup
pply many pro
oducts.
With
h a model like
e this, the bu
usiness rule may
m be: a Prroduct can on
nly be suppliied by a singlle
supp
plier, which means that the
t operation
nal system th
hat collects th
he informatioon is coded
acco
ordingly. Whe
en or if the business
b
chan
nges its rule to say: a prooduct can bee supplied byy more
than
n one supplie
er then the application
a
must
m
change, as must the underlying d
data model sttructure.
Whille this appea
ars to be a sm
mall change itt may affect all kinds of u
underlying infformation in the
operrational syste
em; especiallly if the produ
uct is a PARE
ENT to other ttables.
For d
data warehouses (exceptt Data Vaults) this structure leads to eeven more coomplexity. In a data
ware
ehouse that contains
c
fore
eign keys embedded in ch
hild tables, th
his leads to ccascading change
impa
acts. In othe
er words, any changes ma
ade to parentt keys will casscade all thee way down in
n to every
singgle child table
e. The end re
esult?
The e
end result is massive
m
re-eng
gineering efforrts, and thats not all! The pproblem gets eexponentially h
harder to
handle with larger and larger datta warehouse models.
Th
his is the #1 reason why Data Wareho
ouse/BI Projeects are torn
n down,
sttopped, halte
ed, burned, and
a ripped ap
part or labelled failures! The growingg
and already high cost of re
e-engineeringg which is cau
used by poorr
architectural design
d
and dependenciess built in to yoour data warrehouse
model!
m
Dont let this happ
pen to you! Use
U a Data Vaault and avoiid this mess
up-front.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 25 of 15
52
ecific point in
n time, wheree A = a sourcee system/source
The basic notion is that A = B = C at a spe
appllication, and B = staging area,
a
and C = enterprise Data Vault; ssuch that A ca
an be reconsstituted
for a
any point in time containe
ed within C. This preservves the auditaability of the data set hou
used
within the Data Vault
V
while offering base level integrration acrosss lines of bussiness (see previous
discussion on Hu
ub based bussiness keys).
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 26 of 15
52
In
n some casess B can repre
esent the Datta Vault whilee C represents as-is raw
le
evel star sche
emas. Raw level star schemas are u tilized to shoow the
business whatt the source systems are collecting, a nd where thee gaps may
be between th
he business rules,
r
busine
ess operation
ns, and sourcce system
applications. Information quality (IQ) can
c be improvved through resolution
ed gaps.
off the identifie
To fiind out more about gap analysis, please read the book:
b
The Neext Business Supermodel. The
a Vault Mode
eling.
Busiiness of Data
Anotther founding
g principle be
ehind the Datta Vault arch
hitecture is th
he use of set logic or set b
based
math. The Hubs and Links are loaded based on union
n sets of information, while the Satelliites are
load
ded based on
n delta chang
ges inclusive of the union functionalityy. Set logic iss applied to tthe
load
ding processe
es for restart ability, scala
ability, and pa
artitioning of the components.
Stan
ndard set the
eory is defined as follows:
Set theoryy, formalized using first-orde
er logic, is the most common
n foundationall system for
mathemattics. The langu
uage of set the
eory is used in the definition
ns of nearly alll mathematica
al
objects, su
uch as functions, and conce
epts of set theo
ory are integraated throughoout the
n be introduce
mathemattics curriculum
m. Elementary facts about se
ets and set meembership can
ed
in primary school, along
g with Venn dia
agrams, to study collectionss of commonpllace physical
objects. Ellementary ope
erations such as
a set union and
a intersectioon can be stud
died in this
context. More
M
advanced
d concepts succh as cardinality are a stand
dard part of the
undergrad
duate mathematics curriculu
um. http://en.w
wikipedia.org/w
wiki/Set_theory
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 27 of 15
52
In th
he Data Vaultt approach, set
s theory is applied
a
to inccoming data sets. The seet theory app
plied in
load
ding routines is depicted in Figure 1-8::
Fig
gure 1-8: App
plied Set Theo
ory for the Daata Vault
The set theory is applied again for Hub an
nd Link loadin
ng where onl y new data (not previously
inserted) is applied or loaded
d. Set-based logic is applied when single distinct liists of keys a
are loaded
in orrder to the ta
arget table wh
here they havvent yet bee
en loaded.
1.9
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 28 of 15
52
Index Covverage
Data Redundancy (miniimize this)
Parallel Query
Resource Utilization (sp
plit over hardw
ware platformss)
m of design an
nd processing. The math
hematics
The principles att work expresss themselves in the form
behiind the Data Vault Model can be found by reading about paralllel processingg. Specificallly:
Para
allel Data Pro
ocessing, Parrallel Task Pro
ocessing, and MPP syste ms design an
nd architectu
ure.
The topology of the
t computin
ng cluster (da
atabase engin
ne) can be an
ny of the dessired pieces including:
star,, ring, tree, hyper-cube, fa
at hyper-cube
e, or n-dimen
nsional mesh . The Data Va
ault splits ou
ut the
Busiiness Keys, the relationsh
hips (associations), and th
he descriptivve data (repeetitive).
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 29 of 15
52
.
This is just one way
w to view th
he Data Vaultt Model; it is essentially b
based on the
e principles off a scalefree tree, all the way down to the individual table strucctures built w
within the moodel. Multiple
e scalefree trees are no
othing more than more Hu
ubs, Links, an
nd Satellites within the Da
ata Vault, thu
us
prod
ducing a cub
be-like struccture if desire
ed. An examp
ple of a logical design or conceptual vview of the
Data
a Vault in a Hyper
H
Cube it might look
k something like this:
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 30 of 15
52
In th
he physical model,
m
Hubs are
a connecte
ed to Link stru
uctures; Linkks become a physical notiion for an
asso
ociation. In the physical Data
D
Vault Model nodes are
a connected through Lin
nks to each o
other,
theyy are not directly related. This is a con
nceptual bassis for establiishing the preemise of the vision.
Hubs provide the
e keys, while Satellites aro
ound Hubs describe thee key for anyy given point in time.
Hype
er Cubes can
n be created as can trees. A simpler vision
v
or view
w of the Data Vault Model split for
para
allelism is in Figure 1-11:
Figure
e 1-11: Physiccal Data Vault Layout (Staarting point)
p
off the data beecause the size of the
This is where it sttarts, quite simple enough no real partitioning
et large enoug
gh. All of the
e tables go th
hrough one, ttwo, or three I/O connectiions to a
data set is not ye
SAN or a NAS drivve. When the
e data set grows, physica
al partitioningg (or split-off of tables) can occur.
The e
end-result (to
o an extreme
e) might be ass shown in Fiigure 1-12:
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 31 of 15
52
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 32 of 15
52
Source: htttp://en.wikipe
edia.org/wiki//Scale-free_ne
etwork
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Page 33 of 152
Sourcing Problems:
o Synchronization / Source Data Availability time windows
o Cross-System Joins
o Cross-System Filters
o Cross-System aggregates
o Indexing issues, leading to performance problems
o Disjoint or missing source data sets
o Missing source keys
o Bad source data, out of range source data
o Source system password issues
o Source system Availability for loading windows
o Source system CPU, RAM, and Disk Load
o Source System structure complexity
o Source system I/O performance
o Source System transactional record locks
Transformation problems often IN STREAM
o Cleansing
o Quality and Alignment
o Joins
o Consolidation
o Aggregation
o Filtering
o Sequence Assignment - often leading to lack of parallelism
o Data type correction
o Error handling (when the database kicks it back)
o Error handling (data is: out of bounds, out of range)
o Size of Memory
o Lookup issues (more sourcing problems, caching problems, Memory problems)
o Sorting issues (large caches, disk overflows, huge keys)
o BUSINESS RULES , especially across SPLIT data streams
o Multiple targets
o Multiple target errors
o Multiple sources
o Single transformation bottleneck (performance, realationships joins, and so on)
http://LearnDataVault.com
Page 34 of 152
Target Problems
o Lack of database tuning
o Index updates (deadlocking)
o Update, Insert, and Delete mixed statements forcing data ORDER to be specific, cutting off
possibilities for executing in parallel
o Block size issues
o Multi-target issues (too many connections, error handling in one stream holding up all other
targets in the same data stream)
o WIDE targets (due to business rules being IN-STREAM)
o Indexes ON TARGETS (because targets ARE the data marts)
o Lack of control over target partitioning
Along with many more issues. In these cases, this is the traditional view of issues that data
integration specialists are left to solve. You are expected to construct load after load that answers
ALL of these problems in a SINGLE data stream right? Well, this is no way to do business. This
increases complexity to an unimaginable level, and this contributes to the ultimate downfall of the
data warehousing project!
Quality Software Management Vol. 1 Gerald M. Weinberg pp. 135-139.
When you develop your ETL for a star schema EDW, you essentially get a sequential set of (big
T) transformations. As that sequence grows in size and complexity, the difficulty of testing it,
and tracing errors back to the source grows exponentially, hence as your (S-S) EDW grows,
you get haunted by ever growing development cycles, and increasingly less control over the
testing process, until your EDW has developed into yet another legacy system. And then you
know what its fate will be
http://LearnDataVault.com
Page 35 of 152
Separate each source load, and land the data in the target make each load a very simple copy
operation where the data is pulled from the source, and landed directly in the target in this case, a
STAGING AREA (as defined in the next chapter). Yes, you may source a CDC (Change Data Capture)
operation if you so desire.
Run the staging loads when the data is ready! Dont wait for other systems, dont perform any other
systems joins, and dont force the data to conform or align with specific rules or datatypes.
These two simple rules ensure that you can get in, get the data from the source, and get out when the source
data is ready to go. No waiting! No Joining! No timing complexity! No performance problems! You can always
take a copy operation and partition the target, and partition the load for MAXIMUM throughput!
Transformation problems: Divide and conquer. The following rules make it much easier to deal with this part of
the loading cycle:
Move the business rules downstream. This includes all the joins, filters, aggregations, quality,
cleansing, and alignments that need to happen between the Data Vault and the Data Marts. This
also allows you to effectively target the PROPER data mart with the PROPER rule set (as deemed
appropriate by the business).
Load raw data in to the Data Vault area, this provides SIMPLE, maintainable, and easy to use loading
code that meets the needs of the business. It also prevents you from having to re-engineer loading
routines to add new systems, or add new data. Sure you end up with a lot more routines, BUT each
one is a thousand times less complex, and easier to manage.
The end result for this? You can PARALLELIZE the loading routines to your data warehouse AND you can load
data to your Data Vault in REAL-TIME at the SAME TIME as your batch loads are running. Just try that with your
standard star-schema!
Targeting problems: They all but disappear. Why? Because once you divide and conquer, your loading routines
will be built for inserts only, high speed inserts at that, and they generally will contain only one or two target
tables for loading purposes! No more locking problems, no more worries about wide rows (except when you
get to loading data marts, thats another story). High degrees of parallelism, high degrees of partitioning, and
high performance, and really low complexity scores, what more could you ask for?
1.11 Loading Processes: Batch Versus Real Time
This book introduces the concepts with a small bit of background, it is meant to be only an
introduction to the loading patterns and processes used within the Data Vault. The purpose of this
entry is to define the basic terms of batch loading and real-time loading.
http://LearnDataVault.com
Page 36 of 152
Batch Loading: usually occurring on a scheduled basis, loading any number of rows
in a batch. The execution timing will vary from every 5 minutes to every 24 hours,
to weekly, to monthly, and so on. Any load cycle running every 20 seconds or less,
tends to fall close to the real-time loading category. All other scheduled cycles tend to
be labeled mini-batches.
Real-Time Loading: there is a grey area of definition between what a batch load is
and what a real-time load. For the purposes of this book, real-time loading is any
loading cycle that runs continuously (never ends), loads data from a web-service or
queuing service (usually) whenever the transactions appear.
Neither loading paradigm has any effect on the data modeling constructs within the Data Vault. The
Hub, Link, and Satellite definitions remain the same and are capable of handling extremely large
batches of data, and or extremely fast (millisecond feeds) loads of data.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 37 of 15
52
Figu
ure 2-1: Enterrprise BI Arch
hitectural Com
mponents
The Data Vault methodology
m
includes eacch of these co
omponents. The architecctural components
discussed in thiss book (in dettail) include the
t Staging area
a
and the Data Vault. This section briefly
intro
oduces the otther sectionss as part of th
he architectu
ure for you to consider.
2.1
Staging Area
a
http:///LearnData
aVault.com
Page 38 of 152
These tables do not carry any foreign keys, or original primary key definitions. Exceptions: Loading a
de-normalized COBOL based file, and executing normalization (splitting into multiple tables), the
staging tables will carry parent ID references. Loading a denormalized XML based file and executing
normalization, the staging tables will carry parent ID references.
The staging area may be partitioned in any manner desired. The format is owned and maintained by
the data warehousing team. The staging area tables may also contain any indexes needed (postload) in order to provide the data warehouse/Data Vault loads with the proper performance
downstream. Staging area data should be backed up at regular intervals (if the data arrives in realtime), otherwise it will be backed up at scheduled intervals.
The future need for a staging area is in question. In fact, within the operational Data Vault and
100% real-time feeds there appear to be no real needs to have a staging area. There are already a
few Operational Data Vaults built using the principles of by-passing the staging area, and loading
data directly (from the real-time feeds/web-services) to the Data Vault. The only reasons for staging
areas to continue to exist (as of 2010) include the following:
2.2
The EDW (enterprise data warehouse), or core historical data repository, consists of the Data Vault
modeled tables. The EDW holds data over time at a granular level (raw data sets). The Data Vault is
comprised of Hubs, Links, and Satellites (defined in section 1.6 and further defined throughout this
book). The Enterprise Data Warehousing Layer is comprised of a Data Vault Model where all raw
granular history is stored. Unlike many existing data warehouses today, referential integrity is
complete across the model and is enforced at all times. The Data Vault model is a highly normalized
architecture. Some Satellites in the Data Vault may be denormalized to a degree under specific
circumstances.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 39 of 15
52
Th
he Data Vaullt model follo
ows all definittions of the D
Data Warehoouse (as
defined by Bill Inmon) exccept one: the Data Vault iss functionallyy based, not
su
ubject oriented meanin
ng that the bu
usiness keys are horizonttal in nature
and provide viisibility acrosss lines of business.
Metrics Vaullt
A co
omponent forr capturing te
echnical metrrics about the
e: load proceess, loading ttime-lines, co
ompletion
ratess, amount off data moved
d, growth of ta
ables, files, and
a indexes. This Data Va
ault capturess the
tech
hnical metada
ata for the prrocesses and
d the databasse. By captu ring growth rrate actuals a
along with
run-ttimes, insert numbers, up
pdate numbe
ers, and row counts
c
proj
ojections of fu
uture storage
e
requ
uirements can be created and manage
ed. This allow
ws the busin
ness to monittor their need
ds, and
budgget 6 monthss to 1 year in advance forr future hardw
ware.
The Metrics Vaullt can also be
e crafted to in
nclude inform
mation aboutt CPU utilization, RAM acccess; I/O
throughput and I/O
I wait time
es. The addittional informa
ation in the M
Metrics Vaultt begins to prrovide a
conssistent and concise view of
o the utilizattion of the syystem in conjjunction with
h the growth o
of the
data
a sets and the
e hot spots on
o disk. From
m all of these
e metrics, a n
nearly compleete technical
man
nagement dashboard can be presente
ed to monitorr the EDW efffort.
2.4
Meta Vault
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Page 40 of 152
Report Collections
Report collections are defined as flat-wide denormalized structures, used for high-speed reporting or
flat file output access; they may also be used by data mining tools. They are a form of data mart
where end-user access is direct. Report collections provide the business users with pre-computed
totals at the end of each row. These pre-computed totals allow high speed filtering against patterns
of rows that are out of the normal zone (in other words, breaking business requirements).
2.6
Data Marts
Data marts are defined as: any point at which generic users directly access the structures and the
data for ad-hoc reporting, or drill-down analysis. This may or may not be a Star Schema. It may also
include normalized and denormalized tables. Data Marts may be virtualized; for example: in-RAM
cubes, and dynamically altered information sets. A form of a data mart is an Excel spreadsheet that
communicates directly with the Data Vault through an interactive metadata layer (possibly
something like Microsoft SharePoint direct to the Data Vault back-end). Direct communication
between the user, the metadata management, and the Data Vault is the beginnings of an
Operational Data Warehouse.
For purposes of auditability and accountability the data is separated into two physical layers:
corporate marts, and error marts. Corporate marts serve as the standard data marts, where data
that meets soft business rules is contained. Error marts serve as the landing zone for bad data,
that is: data that does not meet soft business rules. The definition of hard and soft business rules
is covered in the book: The Next Business Supermodel, the Business of Data Vault Modeling.
2.7
There is a new component in the architecture (not shown in Figure 2-1). The component is called
the Business Data Vault. Business users and IT alike are seeing the benefits of the flexibility,
scalability, and adaptability of the Data Vault model. They want the benefits, but with the business
data embedded. Downstream of the raw Data Vault, (between the Data Vault and the Data Marts in
the Figure 2-1) they are building a new store called the Business Data Vault.
The Business Data Vault (BDV) is a concept, a grouping of specific tables in fashioned using Data
Vault modeling concepts, but not necessarily following all the Raw Data Vault modeling rules. A
Business Data Vault (also known as EDW+) can be a group of tables inside the raw Data Vault
(where the record source has changed), or can be a completely separate data store. Either way, the
data that exists in the BDV has been altered, cleansed and changed to meet the rules of the
business and is downstream of the raw Data Vault. You may be able to dual-purpose the BDV and
apply master data rules as well, thus making the BDV a starting point for a Master Data System.
Dan Linstedt 2010-2011, all rights reserved
http://LearnDataVault.com
Page 41 of 152
The Business Data Vault contains all business data, all altered data, aggregated, and cleansed
information. IT staff are executing the business transformations once, assigning more metadata
(including master data definitions), and then releasing (through simple copy) the data needed in the
marts. The Business Data Vault is considered an extra copy of the information; however it is paired
with the business metadata and all of the transformations needed to make virtual cubes and high
speed delivery possible. The argument received from the business is that the data (posttransformation) is used on the financial reports, and as such, must also be accountable and
auditable. Therefore a second copy of the data (post-transformation) is necessary as another
system of record.
The technical argument provided is that the IT staff only wishes to do the transformation once, or
that they have a standing order to provide virtual marts; which in this case translates to RAM
based cubes, and views that look like dimensions and facts.
2.8
The nature of the Raw Data Vault (EDW as depicted in Figure 2-1) is changing to include operational
data. The need to combine/consolidate operational data with the raw Data Vault is being driven by
Master Data Initiatives, and business needs. The business wants more historical data mixed with
current transactions at their finger-tips.
In order to meet this demand the Data Warehousing teams are loading operational data (real-time
loading) directly in to the Raw Data Vault, thus creating an Operational Data Vault. The entire
discussion of Operational Data Vaults is outside the scope of this text, and will be defined elsewhere
in articles and discussion forums.
WARNING: AN ODV INHERITS ALL THE ISSUES, PROBLEMS, AND RELIABILITY CONCERNS OF AN
OPERATIONAL SYSTEM . I TEMS SUCH AS GOVERNANCE , UP- TIME (6 X9 S ), 24 X7 X 365 SUPPORT , ALL COME
TO BEAR WITH AN OPERATIONAL DATA V AULT . T HE DECISION TO BUILD ONE SHOULD NOT BE TAKEN
LIGHTLY .
What is an Operational Data Vault? The Operational Data Vault is part data warehouse, and part online transactional data store (operational data store). The Operational Data Vault stores all changes
to data as inserts (as does a traditional data warehouse), however at the same time it also offers
update/edit access to the operational applications sitting directly on top of the data warehouse.
http://LearnDataVault.com
Page 42 of 152
In case you are wondering: Has this ever been done successfully? The answer is yes, it has
several times already. A company called Cendant Timeshare Resource Group (Cendant TRG) rebuilt
their entire operational layer in Java directly on top of the Data Vault, consolidating data
warehousing directly with operational applications. There were no separate systems for reporting,
no separate systems for operational data or OLTP applications, simply the Data Vault and the Java
OLTP application. This is one example which has been in use since 2001.
Another example is a drug manufacturing traceability warehouse that was built in 2008 for the US
Congress. This Data Vault had operational applications that were driven by drug packaging
machines which assigned unique IDs to every drug package from every manufacturer around the
world. These machines fed the data over remote web-services connections directly to the Data Vault
every 10 minutes, where the data was encrypted, secured, and stored only to be accessed every
time the drug was scanned at different points in the supply chain. At which time the warehouse
would provide different web-service access points to retrieve audit trails of all points where the drug
was scanned. In this manner, you (the consumer) could log in to a web-site after purchasing a drug,
type in its bar-coded number, and check its authenticity. It was called: Drug Track And Trace anticounterfeit operation.
2.9
The Dynamic Data Vault is an operational Data Vault with dynamic adaptation to the structure. In
other words, the tables, columns, indexes, and keys are all subject to change automatically. Of
course to achieve this state requires a constant vigilant watch on the metadata, including but not
limited to incoming structures. The incoming structures may include XSD, XML, staging tables, or
other metadata (including queue based or process metadata) that describe the structure of the
incoming data set.
The dynamic nature of the Data Vault means: new attributes may be added to Satellites, new Links
and new Hubs may be formed on the fly. ETL /ELT loading code will be adjusted automatically, and
BI Query views will also inherit certain changes. At the end of all the automatic model changes,
emails of the changes are sent to the IT staff for review in the morning.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 43 of 15
52
3.0 Common
n Attributess
The Data Vault sttructures (tables) contain
n standard atttributes thatt assist with tthe constructtion,
trackking, and que
erying. The common
c
attributes in the Data Vault aare defined h
here and are applied
throughout the Hubs,
H
Links and Satellitess. The common attributess include: seq
quence numbers, subsequ
uence numbe
ers (line item
m numbers), load dates, lo
oad end-datees, last seen dates, extracct dates,
reco
ord creation dates,
d
and re
ecord sourcess.
Mosst of these fie
elds are EDW
W (enterprise data
d
warehouse) system defined, and
d EDW system
m
gene
erated/mainttained; as a result, the da
ata in these columns
c
are reference d
data and are
e nonaudiitable as theyy do not existt in the sourcce system. However,
H
recoord creation dates and lin
ne-item
num
mbers are two
o cases that are
a auditable
e particularly when they eexist in the soource system
m.
The Data Vault works
w
on the principles sim
milar to geolo
ogical layerin
ng where data
a arriving in tthe
ware
ehouse (in a single batch)) is stamped with a geolo
ogical time b
based layer ((a load date time
stam
mp). The load
d dates enforrce audit trails and record
d history bassed on the on
ne and only
conttrollable syste
em date time
e available to
o the EDW loa
ading routinees. The only point at whicch this
princciple does no
ot apply is du
uring real-time feed proce
essing.
T
Series Batch
B
Loaded
d Data
Figure 3-1: Time
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 44 of 15
52
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Page 45 of 152
Sequence Numbers
Sequence numbers are required by relational database management systems (RDBMS) in order to
process joins quickly and efficiently. Without sequence numbers the joins across huge amounts of
information would operate comparatively slowly (compared to character based joins). The use of
sequence numbers as primary keys for Hubs and Links also eliminates any possible issues
maintaining multi-part cascading keys in Satellites or nested Link tables.
Staging area sequences are stored within the staging area. These sequences should be restarted
and set to cycle over for each load to a specific table. Staging sequence numbers are utilized only to
identify loaded duplicates. Staging area sequences should not ever leave the staging area, and
should not be moved forward into the Data Vault.
Duplicates are rows that have 100% completely the same data - from the keys, to the nulls, to the
descriptive fields. When the data is 100% duplicate, there needs to be a way to delete the rows
from the staging table in order to proceed with loading only one unique copy to the target Data Vault.
Without a sequence number, there is no unique identifier on each row. With a sequence number it
is easy to pick the first or last row as the candidate to leave in place and delete the rest.
Before deleting the duplicates the Metrics Vault should record a history of how many duplicates
there are per staging table per business key. By counting the duplicates auditability can be
maintained if the IT staff is ever asked to reproduce the source load. The number of duplicates
multiplied by one row provides the recreation with an accurate picture. In other words, a Cartesian
join product is applied in order to reproduce the original duplicate row set.
Hub and Link sequence numbers are created 1 for 1 with each unique business key and unique
association inserted to the respective table. Satellite sequence numbers are generally parent table
sequence numbers, in other words they are inherited from the Hub or Link parent table.
It is a recommended practice to setup sequence numbers to be number(12). In Oracle there
appears to be no byte-storage difference between a number(12) and a number(38). Most sequence
numbers will fit within this length, and will not require double or floating point math to resolve at
query time.
http://LearnDataVault.com
Page 46 of 152
Sub sequences depend on parent tables for context and within context have business meaning;
however as stand-alone attributes they hold no business meaning what-so-ever. In this regard subsequences do not work well as independent Hub keys. Sub sequences may also be defined as
ghost Hub tables if logically modeled but should not ever be physically implemented. For
example: A line-item number 5 has no context however it is required when discussing a particular
detail item on an invoice. Sub sequence numbers are utilized to order Link or Satellite rows. In Link
tables they are part of the unique index, in Satellite tables they can be included in order to provide
context called: multiple active Satellite rows.
Sub-sequences simply allow multiple rows to be active for a single master key. It is a best practice
to avoid sub-sequence numbers if at all possible. When used in a Link table they can cause reengineering of the loads in the future (if the Link structure changes).
WARNING: IF SUB-SEQUENCES APPEAR IN THE MODEL, IT MAY BE A CALL TO RESEARCH FURTHER. TAKE THE TIME TO
INVESTIGATE IF A LINK AND NEW HUB TABLES NEED TO BE DEFINED . I T IS COMMON TO MISTAKE THE NEED FOR A SUB SEQUENCE WHEN THE CORRECT MODEL WILL HAVE ONE OR MORE NEW HUBS WITH A L INK IN PLACE . EXCEPT IN REALTIME MILLISECOND SYSTEMS.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
3.3
P
Page 47 of 15
52
Load Dates
Load
d dates are system
s
generrated, system
m maintained fields. This attribute is a
applied to the
e arriving
data
a set in both real-time and
d in batch mo
odes. Load dates
d
represeent the date time stamp
(acccording to the
e EDW machine clock) of the
t arriving data.
d
Load daates are applied to data ssets
arrivving in the sta
aging area off the Data Va
ault.
Load
d dates for re
eal-time data
a are applied based on the
e clock time arrival of thee transactions housed
in th
he incoming queue.
q
Load dates for ba
atch based da
ata are set oonce per batcch. They can be
thou
ught of as a date-time-sta
d
mp equivalent to a batch
h load processs identifier. As described
d above
(see
e Figure 3-1) the
t Data Vau
ult relies on the notion tha
at load datess are consisteently applied per
batcch for tracking purposes. The Load Da
ate should no
ot be set by rrepetitive sysstem calls thrroughout
the llife-cycle of a single load, nor should it be changed
d from one seet of staging data to anotther. The
load
d date time sttamp is the id
dentifier thatt indicates which geologiccal layer (in time series) that this
data
a applies to.
Load
d dates should be looked up from a siingle table ca
alled CONTRO
OL_DATE which is housed
d within
the sstaging area and containss a single column, single row of inform
mation. The load to the staging
table
es lookup the
e load-date (LOAD_DTS), and hard-cod
de the record
d source. For example, iff the batch
wind
dow is a nightly batch that begins at 22:00 hours, and complettes at 06:00 hours the following
morning, then the LOAD_DTS
S should be set to 00:00 hours
h
for thee following moorning days start
acro
oss all data in
n the staging area.
Figure
e 3-3: Load Date
D
Time Sta
amp and Reccord Source
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 48 of 15
52
By kkeeping a con
nsistent load date time sttamp on the information
i
i t becomes p
possible to tra
ace errors
and find technica
al load proble
ems (affectin
ng the data) months
m
afterr the load hass occurred. IIt also
beco
omes possiblle to remove that layer off geology and
d identify how
w far the prob
blem data ha
as spread.
The resulting load cycles become repeata
able, consiste
ent, and restaartable for an
ny given load
d cycle
acro
oss all time.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 49 of 15
52
On o
occasion the historical date time stam
mp must coinccide with thee creation datte of the data
a, and
sometimes the granularity
g
of the creation date may be
e monthly wh
here the curreent loads occcur daily.
For tthese reason
ns controlling
g the load datte time stamp as a singlee unit providees full flexibility of
histo
orical loads for
f specific grrains of data as well, provviding snapsh
hot availability.
3.4
Load End Da
ates
Load
d end dates are
a system computed attributes. Thesse are mech anical attribu
utes that exisst solely
to m
make queries against the Data Vault ea
asier. Load end
e dates arre NOT necesssary for the
arch
hitecture, the
ey are query attributes
a
only. These atttributes indiccate the end of the data lifecycle
within the loading time-frame
e of the Data Vault. Time--series based
d database eengines are ccapable of
computing data life-cycles witthout resortin
ng to load en
nd date colum
mns.
Figure
F
3-5: Lo
oad End Date
e Computatio
ons, Descripttive Data Lifee Cycle
Load
d end dates are
a set accorrding to the next
n current row
r load datee. They may be exclusive
e (as
indiccated here) where
w
1 seco
ond has been
n subtracted from
f
the nexxt most current load date,, or
inclu
usive (not ind
dicated here) where they are
a equal to the load datee from the neext most rece
ent row.
Load
d end dates of
o the current row are sho
own in Figure
e 3-5 to be N ULL. It is opttional to configure
them
m to be future
e dated if desired. Load end
e dates that are futuree dated do noot need to be
e relocated on disk when
w
updated (the end da
ate is reset).
Load
d end dates which
w
are NU
ULL do not ca
ause the row to migrate too another dissk block when
upda
ated or end-d
dated; most RDBMS
R
engines make da
ate/time dataa types take tthe same am
mount of
byte
es whether NU
ULL or not this is known
n as a fixed-le
ength colum n in the data
abase engine
e. Load
end dates are no
ot auditable as
a they are syystem compu
uted values. Load end da
ates must be
e updated
in th
he row-set in order to ensure time-line consistency. Figure 3-5 depicts a Sa
atellite entity with
custtomer namess. Satellites are
a defined in detail in Ch
hapter 5.0.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Page 50 of 152
Last seen dates are a particular component of the architecture that enable source-system hard
delete monitoring without resorting to complete scans of the data set currently in the Data Vault.
Last seen dates are optional within the Hubs and Links. The last seen metadata can be tracked in
alternative Satellites for better resolution at a lower level of detail.
Last seen dates are not required by the architecture of the Data Vault to stand up and work properly.
There are other manners in which to track data (discussed in the Satellite chapter) that may provide
more information than the last-seen-date; an alternative architecture is a status-tracking-Satellite, or
a record source tracking Satellite.
NOTE: LAST SEEN DATES SHOULD NOT BE USED IF THERE IS AN AUDIT TRAIL
AVAILABLE. AN AUDIT TRAIL IS MORE ACCURATE FROM THE SOURCE SYSTEM AND
ELIMINATES THE NEED TO IMPLEMENT LAST SEEN DATES. A UDIT TRAILS MAY ALSO BE
UTILIZED IF GENERATED FROM CHANGE - DATA- CAPTURE ( CDC ) UPSTREAM IN THE SOURCE
SYSTEM .
The problem faced by enterprise data warehouses is: detecting hard-deletes of source data while
the set of EDW data is continuously growing. The case is as follows: a source system does not
provide an audit trail, nor does it provide any event or transaction indicating which rows are being
deleted or removed. During every load cycle the entire source table/xml file is simply dumped and
loaded to the staging area of the Data Vault.
Traditional set theory dictates that in order to find missing rows that have been hard-deleted from
the source feed, a process takes place that scans everything in the Data Vault that does not exist on
any of the source feeds. This is an extremely expensive operation, and cannot be mathematically
sustained for high volume data warehouses. At some point running the full scan on the Data Vault
becomes impossible. It is at this point that the set can be contained or limited to a finite point by
introducing a system maintained date stamp called a last seen date. An example of the structure
can be seen in Figure 3-6 below.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 51 of 15
52
Figu
ure 3-6: Strucctures containing Last Seeen Dates
The follow
wing section ad
ddresses techn
nical impleme
entation which is out of scop
pe for this
documentt. However, the following infformation is ne
ecessary to asssist in the exp
planation of th
his
concept off last-seen-datte; therefore itt will be includ
ded in this textt.
Figure 3-7:
3 Scan all data in EDW
W
Lastt seen dates provide a me
echanism to reduce the data
d
set scan
nned to detecct missing row
ws on the
sourrce feed. A different
d
architecture know
wn as Statuss Tracking Saatellites can p
provide more
e detailed
inforrmation in the appearancce and disapp
pearance of the
t keys. Staatus Trackingg Satellites m
may be
used
d in place of Last Seen Da
ates. These Satellites are
e covered in the Satellite chapter.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Page 52 of 152
For example suppose the Hub_Customer had 800 million customer keys. The source feed has 30
million on a nightly basis. The customer keys arriving on the source feeds are originating from three
applications: finance, sales, and contracts. The SQL query / code for detecting hard-deleted keys
(without utilizing a last seen date) is, as follows:
<Mark status as deleted for records in the following set: >
Select *
from HUB_CUSTOMER where
Customer_Acct_Num not exists
(
Select cust_acct from STG_MANUFACTURING
UNION ALL
Select customer_acct_num from STG_SALES
UNION ALL
Select cust_num from STG_CONTRACTS
)
First a last seen date column must be added to the HUB_CUSTOMER table. Second a new business
rule is created and signed-off on by the business in a service level agreement (SLA). The new rule is:
data is aged, and not marked as deleted until it hasnt been seen for more than 3 weeks. The keys
in HUB_CUSTOMER are tracked by reversing the set logic in the following code: (which presumes
Last Seen Date is a column in the Hub)
Update HUB_CUSTOMER set Last_Seen_Date = Load_DTS
Where
Customer_Acct_Num exists
(
Select cust_acct from STG_MANUFACTURING
UNION ALL
Select customer_acct_num from STG_SALES
UNION ALL
Select cust_num from STG_CONTRACTS
)
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 53 of 15
52
Extract Dates
Extra
act dates are wonderful to
o capture if they are available on the ssource systems. Extract dates
repre
esent the datte and time that
t
the data is extracted
d or written to flat-file on
n the source ssystem.
Extra
act dates typiically are nott available wh
hen direct SQ
QL access is u
utilized. Extrract date and
d time for
SQL extracts is ussually stored in the metad
data (processs logs of the ETL perform
ming the extra
act) so it
is no
ot required to
o store in the Data Vault structures
s
directly. Extracct dates for flat-files are e
extremely
helpfful, particularly if the data
a set is pulled
d from severa
al areas arou
und the world
d.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Page 54 of 152
Extract dates are not reliable as in some cases the extract may be created on a PC where the clock,
and system date time are in question. In other cases the server performing the extract may be in a
different time zone than either the source system or the data warehouse server. Bottom line, the
EDW team generally has no control over the extract date and time on the source system, therefore it
is non-auditable data it is reference data in a manner of speaking, and as such should be stored
as just another attribute of the Satellites in the Data Vault.
3.7
Record creation dates are wonderful if they are available on the data set. If they are available they
should be recorded as attributes in Satellites. They should not be a part of the Hub nor the Link
structures as they are not reflective of the key structures or associations. Record creation dates
generally represent the date and time of creation of the source system row (in its entirety). In some
cases these date time stamps may be edited by the business users on the source system (which
means they can change over time).
Regardless of the case, the EDW team has no governance to cover the management and
consistency of record creation dates. Furthermore even if governance procedures existed, it would
be a great undertaking to ensure governance over 100% of the source system data; resulting in a
non-auditable field which must be treated the same as any other source system data as an
attribute in a Satellite.
3.8
Record Sources
Record source columns are row-based metadata that represent where the row originated. These are
hard-coded values applied to maintain traceability of the arriving data set. Record sources can be
codified with the descriptions residing within reference tables. Record sources should be
architected to the lowest level of granularity. For example: SAP.FINANCE.GL (indicating an SAP
source system, followed by a financial application, followed by General Ledger).
Record sources are metadata that must be carried in to the staging area (are hard-coded in the
staging loads to the Data Vault). They can be created as lookup codes or lookup sequences to avoid
duplication of the data set in high volume situations. They are then resolved on the way from the
Data Vault to the data marts. If they are created as lookup codes, they then are placed in a
reference table. The reference tables are covered in the reference data chapter, chapter 8.0.
http://LearnDataVault.com
Page 55 of 152
Record sources must remain on the row level as a part of a 100% compliant and auditable solution.
These fields are used to answer questions about the data, where it came from and more specifically
which application. Traceability of the data from the AS-IS data marts all the way back to the source
systems provide compliance that meet regulatory standards. Developers, auditors, and business
users benefit from having a record source in each row of data across the entire model.
Tech Tip: To manage volume or repeating groups without joining (resolving to a code),
compress the column in database engines that support compression. Record source codes
are highly repeatable and redundant data. Record sources may be comprised of reference
codes; resolved on the way out of the Data Vault by joining to reference data. Reference
codes as record sources allow the data set to be compressed from the start.
3.9
Process IDs
Process IDs are a tracking mechanism for the loading process that brought the data into the
warehouse. They are not part of the core-architectural components of the Data Vault. Process IDs
may be used as a means to track the data set back to the individual loading process. They are
augmentative metadata only. Process ID columns are repetitive in nature, and as such should be
setup for column compression.
Process IDs may replace both record sources, and load-dates. If process IDs are tied to technical
metadata stored in the Meta Vault, they can replace the two items in Hubs and Links (load dates will
always be needed as part of the key for a Satellite). In this situation, the process metadata must be
tagged with a record source and a run/load date.
http://LearnDataVault.com
Page 56 of 152
Unfortunately in the real-world, customer keys (as with so many other business keys) change
depending on the system being used. The keys change from one state to another as the customer
information passes from one system to another. These changes are typically a manual process
resulting in little to no visibility at the corporate level for where a customer is in the life-cycle of
business.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 57 of 15
52
Figure 4-1:
4 Businesss Key Changin
ng Across Lin
ne of Businesss
In Figure 4-1 abo
ove the key changes throu
ugh an Excel managed prrocess when the custome
er is
transsferred from the sales system to the procurement
p
t system. Thee ideal would
d be for the ssame key
to be
e used horizo
ontally acrosss all lines of business reggardless of th
he system of origin and th
he system
of transfer. Wha
at business doesnt realize
e is just how much mone y they are lossing by chan
nging the
business key from one line off business to
o the next.
Theyy also frequently allow this to happen by implemen
nting off-the-sshelf productts which expose
sequ
uence numbe
ers as busine
ess keys. Cle
early, sequen
nce numbers from Oracle Financials w
will never
matcch sequence
e numbers in Siebel or PeopleSoft or SAP,
S
etc Beecause the seequence num
mbers are
expo
osed, the bussiness beginss to use them
m as businesss keys autoomatically lossing traceability (and
mon
ney) when the
e sequence number
n
for th
he same custtomer differss across multtiple systemss.
One
O of the jobs that a go
ood data warehouse shou
uld perform iss: gap
analysis - thatt is: provide the business with a view oof the GAP beetween the
way
w the business believes they are ope
erating their business, an
nd the way
th
he systems are collecting the data. Byy examining tthis gap, the business
ca
an quickly loccate where they are hemorrhaging mooney.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
4.1
P
Page 58 of 15
52
Hub Definitio
on and Purposse
Figure 4-2:
4 Hub Example Imagess
ndard fields including seq
quence numb
ber (SQN), Looad Date
Hubs have severral of the stan
OAD_DTS), an
nd Record So
ource (REC_S
SOURCE) . In special case s, a Hub will also include
e an
(_LO
encrryption key (E
ENCR_KEY) and
a potentially a Last See
en Date (LASTT_SEEN_DTS
S). The encryyption key
is a part of the Hubs when the data set is encrypted. It may be onee half of a tw
wo-part publicc key.
Encrryption key iss not standarrd which is wh
hy it is not lissted in Chaptter 3.0.
Lastt seen dates are not required, and are
e not a part of
o the core architecture. LLast seen dates assist
in tra
acking delete
ed rows/agin
ng business keys.
k
Busine
ess keys in Hu
ubs may be ttracked throu
ugh status
trackking Satellite
es which are covered in th
he Satellite chapter. Req uired in the a
architecture are the
sequ
uence numbe
er, load date, and record source.
The purpose of the Hub is to provide a soft-integration
n point of raw
w data that iss not altered from the
sourrce system, but
b is suppossed to have th
he same sem
mantic mean ing. The resulting singula
ar list of
keyss assists in th
he discovery of patterns across
a
system
ms. The Hub key also alloows corporate
business to track
k their inform
mation acrosss lines of bussiness; this p rovides a con
nsistent view
w of the
curre
ent state of application
a
syystems. The
ese systems are
a supposed
d to synchron
nize, but ofte
en dont
when they dont synchronize, business ke
eys begin to be
b replicated and worse yyet, are then applied
to diifferent conte
extual data sets.
s
Som
me examples of Hubs and their data arre shown in Figure
F
4-3:
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 59 of 15
52
Figure
e 4-3: Hub Exxample Data
In th
he HUB_CUSTT_ACCT (Hub
b Customer Account) it is easy
e
to spot similar patteerns, fat-finge
ered data,
and errors in enttry, possibly a lack of edit masks. The
e typical requ irement in th
his case is ass follows:
The busin
ness says: W
We always cre
eate our custo
omers in con tracts. You w
will always ge
et
your customer numbers from contrracts first bec
cause they arre responsibl e for closing
the deals and getting the money.
Whe
en the pattern
ns in the data
a are examin
ned, it is clear that Sales has produced keys (as ha
as
finan
nce) that are
e not in contra
acts. Its up to the busine
ess to figure out why; its the job of the data
ware
ehouse to po
oint out the pattern. With this type of analysis,
a
the data warehoouse can pro
ovide the
need
ded gap anallysis between
n the businesss requireme
ents and the source systeems. In this ccase there
mayy be broken source
s
system
m synchronization routine
es, or worse: a loop-hole in the business
proccess that ince
entivizes peo
ople in sales to
t enter new customers. All of this is speculation until the
business figures out why its happening and moves to fix it in the ssource system
ms or primaryy
proccesses of the
e business.
Therre are ways that these keys can be rollled together for BI reportting purposess. The notion
n of
hiera
archical Link
ks and same-as Links is discussed in the
t Link chap
pter (chapter 5.0). The da
ata itself
stayss in-tact in order to re-con
nstitute the source
s
system
m as necessaary for audita
ability.
4.2
What is a Bu
usiness Key?
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Page 60 of 152
Each of these keys stand-alone in business and in the operational systems they usually are
surrounded with descriptive context to give them meaning. In data modeling terms these keys are
parents, and do not require any additional keys to provide them with the grain of definition. There
are times when business keys are composite keys (such as VIN numbers, or bar-codes). These are
also known as intelligent keys. Business keys may also include the natural key and the
corresponding source system surrogate sequence key; because the business failed to make the
natural key truly unique and the source system surrogate is now needed for traceability within the
EDW.
4.3
Business keys can be found in source system applications, on-line lookup screens, report headers,
source system data models, XML, XSD (schemas), and source COBOL copybooks. Business keys
can also be found in business process engines, SQL joins, source code (COBOL, java, stored
procedures, etc...), Business keys may also be found in Excel spread-sheets used to group items
together and label elements used in reports. Business keys may also be found listed in OLAP cubes
as part of dimensions used for drill down.
The best place to find business keys are within the business process layers. Businesses often
identify and track their information sets through business keys. The business process layers allow
business users to communicate from one person to another and translate, send, or attach the
information to the business process flow. Business keys may indicate hierarchies, groupings, crossmapping (from one system to another), physical identification tags, and global traceable information.
http://LearnDataVault.com
Page 61 of 152
NOTE: just because a surrogate is used within a source system does not automatically qualify it as a
business key. It must be presented, printed, displayed, or searched on made known to the
business user in order to qualify as a business key. It should also be clearly defined by the business
to represent a noun or an object that has context or is defined to be the key to contextual
information in order to become a business key.
However there are a number of surrogate keys (like Order Number and Invoice Number) which are
true surrogate numbers and have business value. Both Order Number and Invoice Number qualify
as business keys, as they are used by the business to uniquely identify (and track) data in the
source systems. In these cases, it is a hopeful thought that only one system maintains and
produces these surrogate numbers; that would be the optimal solution.
4.4
Business keys are the most important component of all information systems. Business keys provide
Links between business processes and the context that drives decision making. Business keys are
the most stable of data elements used by the business. They should be consistent throughout and
across lines of business. Through listing the business keys of the same semantic grain together,
patterns of inconsistencies and consistencies begin to emerge. Typing mistakes are more easily
caught, domain overload (domain chaos) is more easily visible, and missed punctuation becomes
clearer.
At the time of this writing it is known to be an extremely rare circumstance to acquire or locate
common business keys that transcend lines of business and the applications in which they are
generated, stored, and utilized. Businesses must begin to identify through metadata their need for
common business keys. This is a sign of true business architecture. The end result of common
business keys gives rise to board level visibility of the end-to-end business process in which their
data travels. By tracking the data set and their business keys, business users can begin to optimize
the business processes. This simple notion is the root of master data management. Master Data
will not succeed without the proper identification and management of business keys.
Because the business key (in theory) is supposed to be static and stable, it should be consistently
the smallest portion of the business and the most unchanging component of the business
regardless of the business units in which it is applied. This is separate from the business metadata
that defines the element and the functionality of how the element is applied in business. This is a
technical definition of same semantic grain that associates this business key with the
corresponding context surrounding it.
Dan Linstedt 2010-2011, all rights reserved
http://LearnDataVault.com
Page 62 of 152
For example: an automobiles VIN number (vehicle identification number) should not change.
However, the color of the car, number of doors, windows, seats, length of the car, and size of the
engine may all change over time. These are examples of descriptive attributes which are covered in
Chapter 6: Satellites.
Because business keys are supposed to be the most stable component, by separating them in the
model to a Hub, we can therefore stabilize the model itself over time. At the same time as we
stabilize individual structures, we also can adapt easily to new business keys at different grains or
defined by different criteria. Thus, the adoption of new structures to meet new business operating
procedures becomes easier (without losing history in the current system).
Without business keys, IT will not be able to build a master data system and properly tie the data set
(context) back to the business processes. Business keys make up master record locators that are
embedded for information visibility across lines of business. Business keys should never change,
should never be re-used. However, it is a well-known fact that the business keys do change and are
re-used, however this has major implications in business life-cycles and will cost the business
significant money on a year over year basis.
In Figure 4-1, it demonstrates the nature of the business key changing from line of business to
another. The end result is: no consistent visibility at the corporate level for maximum optimization.
The businesses that have this problem without tracking across the change will not be able to answer
the following questions:
These are all master data questions that require consistent and tracked business keys. It does not require
stable business keys, as long as the business key changes are tracked across multiple alterations. Bottom line
is: business keys are the only way to create auditable and traceable information back to the root business
processes and source systems.
http://LearnDataVault.com
Page 63 of 152
Business Keys are the heartbeat of the data that travels through business processes. Think about
it, when you access a source system application to look up a customer, what do you type in? When
you look for a part, or a product, or an employee in a source system, what do you search on? Well, if
you guessed business keys, you guessed correctly!
Business keys are a part of every-day life. We use computers and their data stores to remember
and track all the possible information that we collect. We are then left to focus on a product, a
portfolio, or a set of customers. From these activities we have to identify, define, trace, and
manipulate all of this information within the business processes.
These business processes include manual efforts (we print a report and hand it to someone else), or
source system application (think data entry), or a dashboard of our top customers that we have to
touch every day to see if theres anything we can do for them.
Without business keys involved in these processes, there would be chaos. Without business keys
identifying all this information, it would all be ZERO VALUE to us. Which in fact is exactly what
happens to the data in the systems if or when the keys to that data are lost. You know the old
saying, out of sight, out of mind. If we cant track, edit, retrieve or manage all the information in
our operational systems, then the value of that information drops to zero.
Business keys are tied to the data set in the source application. Business keys are likewise tied to
every business process that the data flows through, thus ensuring traceability at the business
process level. Business keys are the foundation of the Data Vault; which means that your data
warehouse is centered on business keys. These keys are the life-blood of the data warehouse,
which is how we can tie value of the data assets back to the business.
Centering your data warehouse around business keys provides you with a huge advantage in data
warehousing valuation as an asset to your business. It gives you the ability to track, and trace, all
the information back to the point in the business processes where it makes the most sense.
THIS, MY FRIENDS, IS CALLED: GAP ANALYSIS. THIS IS OUR TRUE JOB AS BUSINESS INTELLIGENCE
EXPERTS . W E ARE SUPPOSED TO POINT OUT THE GAPS , AND HELP THE BUSINESS CLOSE THEM !
http://LearnDataVault.com
Page 64 of 152
Surrogate keys are helpful and useful to a machine, particularly when it comes to speeding up joins
and processing data sets in order of creation. However, thats where the helpfulness and
usefulness stops. Surrogate keys should never ever be shown to business users. They should
never be placed on reports, search screens, or operational application screens. They should never
be mistaken for business keys by the business users. They invariably cause problems (never ending
problems) that cost business large sums of money over the life of the source system and data
warehouse.
Problems begin to arise when the data needs to be re-loaded, and new surrogates must be
generated for the rows. This causes confusion in the auditability of the data set, and even calls in to
question any previously exposed surrogate keys that were printed on reports. These old
surrogates no longer match up with the newly generated data! So much for the system-of-record
source system!
Surrogate keys should remain within the confines of the systems in which they are applied. However
in modeling a Data Vault for source systems (especially those without business keys today), the Data
Vault model must accommodate the surrogate keys and (unfortunately for business) treat them as
the business key to that source. The loading routines must deal with collision, semantic meaning,
and definitional aspects of simple numbers. Surrogate keys mean nothing to the business, and
the business should not be asked to memorize or embed meaningless data into their business
operations.
Note: As mentioned in section 4.3 some business keys are in fact surrogate keys.
These include keys such as Order Number and Invoice Number. These keys are used
as meaningful business keys and should be represented as Hubs when necessary.
4.7
Some business keys like Bar Codes are called: Smart Keys or intelligent keys, meaning its a key
comprised of multiple parts. All parts must be kept together as an UOW (unit of work). The business
utilizes the entire key as one unit (one identifier) to represent other information.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 65 of 15
52
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 66 of 15
52
Figu
ure 4-5: Composite Business Key Hub
b Example
Tech Tip: In Hub_Bar_Co
ode the compo
osite key (whe
en concatenatted) makes up
p the full barcode that is printed on a container. Each
E
constitue
ent part is a piiece of the whole. Since the
e
business uses
u
the entire
e bar code to track
t
the conttainer, the ent ire bar code iss itself a
business key.
k
Multiple fields
f
are simp
ply split apart to represent tthe compositee whole. In other
words, the
e ENTIRE BAR CODE is used as the busine
ess key by the business, theerefore it is part
of a single
e Hub.
Can w
we also have
e Vendor, Pro
oduct Code, and
a Productio
on Date in th
heir own Hubss? Yes, of co
ourse as
they most likely represent unique data by themselves; however, thee nature of a BAR CODE iis to be a
conju
ugation of alll its constitue
ent parts and as such, will
w remain a single Hub w
with all comp
posite
fieldss in its own right.
The ssource system for HUB_D
DOCTOR is wrritten to be ru
un in differen
nt states. The application
n was
assigned dooctor ID = 1 to a
then setup in Colorado, Denve
er, and New York.
Y
Each application
different doctor in
n its own statte. In order to
t avoid collissions upon d
data load, thee state ID or sstate
code
e must be loa
aded as a com
mposite with the Doctor ID
D. This main
ntains traceability back to
o the
sourcce applicatio
on in each sta
ate.
4.9
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 67 of 15
52
Th
he Hub entityy must NEVER contain forreign keys. Iff the Hub strructure is
co
ompromised (i.e., the modeling standards are not adhered to),, then the
in
ntegrity of the
e data; and th
he flexibility of
o the model are immedia
ately
co
ompromised.
4.10
0 Hub Example
es
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 68 of 15
52
Figure
e 4-7: Example Hubs from
m Adventure W
Works 2008
Lege
end:
This model has a mix of data types for bussiness keys. There are a few like DoccumentNode,, and
Prod
ductNumber which match
h character based businesss keys. Thee rest of thesee keys were derived
beca
ause of the fo
ollowing reassons:
There is no
n source syste
em application
n to check the
e rules againstt.
There are no source sysstem businesss users to ask (the Adventurre Works modeel was created
d by
programm
mers)
There is only
o a single syystem that is in
ntegrated here
e. In multiple systems casees, the processs of
modeling a Data Vault generally
g
repre
esents alpha-n
numeric busin
ness keys.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 69 of 15
52
Figure
e 4-8: Examp
ple of Nationa
al Drug Codee Data Vault
Figure 4-8 repressents the ND
DC (National Drug
D
Code) Data
D
Vault. M
More informa
ation about N
NDC
sourrce data and the operatio
onal system can
c be found at: http://ww
ww.fda.gov/ccder/ndc/ (n
note: if
the LLink no longe
er works, sea
arch Google for
f NDC druggs and clickk on the Link available fro
om
www
w.fda.gov) On
ne large differrence betwee
en this system and the Ad
dventure Works model is that this
syste
em has real business use
ers, along witth defined metadata for eeach of the b
business keyss.
The fact that a business usess surrogate keys
k
as business keys dicttates that thoose source ssystem
surro
ogate keys are chosen ass business ke
eys for defining their Hubss. As noted in these exam
mples it is
abso
olutely vital to
o annotate assumptions, questions, and
a reasons ffor designingg the Data Va
ault
arch
hitecture as the model is built.
b
In the case
c
of Adve
enture Works there are noo business ussers to
spea
ak with, and there
t
is no so
ource system
m to consult (application
(
logic is missing). Once a standard
is ch
hosen, it shou
uld be adhered to through
hout the life of the design
n.
4.11
1 Dependent and
a Non-dependent Child Keys
Hub business ke
eys may be co
omposite for another reasson depend
dent businesss keys. A de
ependent
business key only has contexxt when included with a pa
arent key. H owever, a deependent bussiness key
is im
mportant enough to warrant uniquenesss and when coupled with
h a parent keey, uniquely identifies
addiitional data. Dependent business
b
keyys are anothe
er source for creating com
mposite or multi-field
Hub business ke
eys. In Figure
e 4-9 below, please
p
remem
mber that thee table on the left is a source
rd
syste
em table rep
presented in 3 normal fo
orm, the depe
endent child key is the Hu
ub Line Item Number.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 70 of 15
52
A priime example
e of a depend
dent child keyy would be lin
ne-item-num ber. Line-Iteem-numbers e
exist only
within the contexxt of an invoicce or an orde
er. They are important in keeping the proper ordering of
the lline-items on
n the invoice. Without line
e-item-numbe
ers, every tim
me the system
m would print the
invoice the line-ittems would be
b printed in different ord
dered sets. LLine-item-num
mbers by the
emselves
makke no sense, an attempt to find line item 5 (five) by itself wou
uld be difficullt if not imp
possible.
Line
e-item-numbe
ers depend on parent con
ntext (such ass order numb
ber) to exist.
TH
HE
IS IN RED
R
AND DOTT
TED LINED BE
ECAUSE IT HA S NO
CONTEXT, NO MEANING
G BY ITSELF.
THE
THE SURR
ROUNDING KEY
YS FOR CONT EXTUAL RESO
OLUTION.
ITEM
SHO
OULD NOT BE MODELED IN THE
T
PHYSICA L DATA
TH EREFORE,
MODE
EL .
TH E
HUB LINE
Line
e-item numbe
ers are known
n as a depen
ndent child. They
T
are imp
portant as a b
business key, but not
by th
hemselves. They
T
must acccompany an
n additional business
b
key to make sen
nse. The Datta Vault
mod
deling standa
ards allow mu
ultiple repressentations of the dependeent child keyss. They can be
inclu
uded in the same
s
Hub witth the parentt or stand-alo
one businesss key, or they can be mode
eled
within a Link table.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Page 71 of 152
Another example of a dependent child key may be a sub-typed business key representation. The
most important question to ask is: does the key stand on its own? Does it have meaning by itself?
If the answer is no, and it remains a business key, then it may very well be a dependent child key.
Dependent child business keys are not allowed to be modeled explicitly within the model. If they are
combined in another parent keys Hub, then they shall not be modeled logically either Hubs are
not allowed to contain foreign keys. However if they are included in a Link structure (explained in
the next chapter), they can be represented logically; this notation is called a weak Hub. In the Link
Chapter (Chapter 5) they are also referred to as: degenerate fields.
4.12 Mining patterns in the Hub Entity
The Hub table brings together previously disassociated business keys. It represents lists of these
business keys in a single common table. For example: a list of all part numbers that appear across
the enterprise. Patterns can be mined from the single list of business keys. By coagulating the
business keys from multiple source systems into a single component, it becomes possible to extract
business value and meaning.
Hubs can be mined for the following information:
By mining the Hubs data it is possible to discover practical associations and ties across business
keys. Hierarchies and ontologies can be discovered which translates into added business value. The
results always need to be checked against the business to see if they are false positives. Complex
inter-relationships across the internal data patterns and shifts in entry can be discovered. It is
interesting to note that the longer the business key, the more likely it is to make these discoveries.
Entry patterns and format masks can also be established. The percentage of data that meets
particular patterns can be assigned. The greater the percentage, the more likely the business rule is
out there somewhere being utilized. It is possible to tie strength and confidence ratings to
percentages of data meeting specific patterns that have been discovered. Just as with the last case,
the more data involved in the discovery, the higher the confidence that the discovered pattern is an
applicable business pattern.
http://LearnDataVault.com
Page 72 of 152
Source system key creation (or broken business requirements) can be discovered as the data set is
loaded. Confidence ratings increase as additional data (new keys) arrives to demonstrate that the
business rule truly is broken. For example, the business requirement is: contracts always create
new customer keys, but when the data is loaded the pattern states otherwise.
Contracts are responsible for the creation, and inflow of customer accounts, the negotiation of these
accounts before the organization can begin building product for the customers. The data set in the
Hub shows that 40% of the new business keys are being created by a financial system. Further
discovery shows that it takes 20 days before the customers are synchronized and moved into the
contracts system. The business then needs to ask the following questions about their business
processes:
What happens when the business key for a certain customer changes when it is passed from
Finance to Contracts? What if the programmatic code that changes the key does not record the
from-to, or the business user does not record the from-to change when they key it in to the
contracts system? What impact to the business does this have? It can be huge, it can be costly, and
it can range from the $10 dollar mark to the $10 million dollar mark.
Mining Hub keys for patterns can be a powerful way to validate the data against the business
requirements. It provides insight into the gap between the vision that the business assumes its
operating under, and the reality of their operational systems, coupled with the business process in
place today. This is the fundamental idea behind process improvement, monitoring and measuring.
http://LearnDataVault.com
Page 73 of 152
The process is simplistic in nature; however it requires a consistent check with the business users,
business application, and source system data set. At the end of the day the business application
collecting the data has the overriding decision. It is the responsibility of the Data Vault to enable
reproduction of the source system as-it-stood as of a specific point in time; otherwise the
commutative property is broken, and the system of record that exists within the Data Vault is
compromised.
1) Find the business key
a. Go to the business users and watch how they interact with the operational systems. View
their print-outs, application screens. Locate the find mechanisms, headers of reports, and
dimensional groups they use in their MS Excel Spreadsheets.
b. Determine which business keys are truly used in which business units. Do NOT worry or
consider HOW to define the business keys, leave that to the business users later in the
project.
c. Locate the business keys in the source system by examining the record join/find code.
d. Look for business keys in the dusty old data model that is supposed to represent the source
system, look for the primary keys and secondary unique indexes.
e. Pry open the physical data stores on the source systems, look for alternate unique indexes
and primary keys.
2) Validate the Business Keys
a. Check with the business units, balance the data sets and unique indexes that are physically
printed or seen by the business users. Eliminate those keys that are internal only. Many
times the internal keys are there for performance reasons.
b. Validate the business key data by profiling the data set. Discover the consistency, actual
uniqueness; develop metrics against the business keys, their patterns, and their associations
to other records in other systems.
3) Check Business keys against multiple source systems
a. Develop profiling patterns across multiple source systems that are within scope, discover
where the collisions are. Work on resolving the multiple entry patterns that occur. Again, the
focus is not to define these keys, but rather simply to identify the business keys.
4) Finally, build the Hub
a. Define the systems that feed the Hub. Develop data flows that identify potential collisions.
b. Define what to do in case of a collision. Get this answer from the business users by ASKING
them to define which system is the first master, the second master, the third master and so
on.
c. Implement loading paradigms from a staging area to the Hub in the Data Vault
d. Profile the results to produce metrics and measurements about the patterns of the data sets.
e. Publish the results to the entire IT team, the business users, and anyone interested in the
Data Warehouse. BEGIN the data quality improvement process as early as possible.
http://LearnDataVault.com
Page 74 of 152
These are the fundamental steps to building a single Hub within the Data Vault model.
4.14 Modeling Rules and Standards for Hub Tables
The Data Vault model is a repeatable, consistent, scalable and flexible technique. There are rules
and standards around each of the table structures that must be followed, or the resulting model will
not qualify as a Data Vault model and will be subject to the risks it was designed to avoid. Below are
the modeling rules and standards that surround a Hub Table.
The rules for Data Vault modeling have not changed (architecturally) since 1997; which makes the
architecture itself stable and easy to use. The rules and standards for modeling are kept up to date
on the following web-site: http://DanLinstedt.com.
http://LearnDataVault.com
Page 75 of 152
The standards, the design, and the architecture of the Hub are based on mathematics including
finite complexity, measurable maintenance effort, including number of rows per block. If the Hub
standards are broken (such as introducing a foreign key directly in to the Hub) then the flexibility of
the model breaks. The adaptability to future business requirements breaks. The ability to load past
history (which may not match the relationship definition) breaks. When the rules and standards are
broken, it also introduces high levels of re-engineering upstream of the Data Warehouse. It forces
business requirements to creep back in to the upstream loads. Eventually the business
requirements change, and thus force re-engineering to occur in the loading, querying and
structuring of the Data Vault. The current architecture of the Data Vault avoids all re-engineering if
the rules and standards are adhered to.
If descriptive data is introduced to a Hub, then data over time becomes more difficult to manage.
The complexity of the loading cycle increases. The staging area requires additional copies of the
data set to synchronize it with the final image. It becomes impossible to split data by rate of change
or type of information.
It is not recommended nor condoned to break the standards of the Data Vault. The engineering
work has been done in order to avert pitfalls encountered on typical enterprise data warehousing
projects. In fact, if the standards are broken, the model will not qualify as a Data Vault model.
The only risk a pure Hub design has is the width of the business key. If the business key is
comprised of multiple fields (is a composite business key), then it may be possible that the number
of rows per block exceeds the desired count. When this happens, the number of I/Os increases
dramatically to search through the Hub structure and locate the proper business key.
The average Hub row size is accounted for as follows:
Field
Sequence
Business Key
Load Date Time Stamp
Record Source
TOTAL
Average Bytes
8
25
8
12
53 bytes
http://LearnDataVault.com
Page 76 of 152
If the block size is 16,384 bytes (16k) then it can fit approximately 309 rows per disk I/O. If the
block size is 32k, then the Hub can fit approximately 618 rows per disk I/O. With a block size at 64k
the Hub can fit approximately 1236 rows per disk I/O. The best average is around 1000 rows per
block. The Data Vault implementation book covers the mathematics in detail, along with the loading
mechanisms, block sizes, and row widths.
NOTE:
IN THE
THIS
Do not break the rules of the design or architecture. If the rules are broken, the design will suffer reengineering in the near future. It also breaks the ability to keep costs down from a maintenance
perspective. The Data Vault model is based on scalability mathematics involved in computing nearlinear scalability from an MPP (massively parallel processing) perspective.
http://LearnDataVault.com
Page 77 of 152
A Link Entity is an intersection of business keys. It contains the surrogate IDs that represent the
Hubs and Links parent business keys. A Link must have more than one parent table. A Link
tables grain is defined by the number of parent keys it contains. Each Link represents a unit-ofwork (UOW) based on source system analysis and business analysis.
The purpose of the Link is to capture and record the past, present, and future relationship
(intersection) of data elements at the lowest possible grain. The Link Entity also provides flexibility
and scalability to the Data Vault modeling technique. Typical examples of Links include:
transactions, associations, hierarchies, and re-definition of business terms.
WARNING: ANY
CHANGE TO THE
LINK
STRUCTURE
( LIKE
CHANGES
LINK
TO THE
DO
Within the Data Vault modeling constructs a Link is formed any time there is a 1 to 1, 1 to many,
many to 1, or many to many relationship between data elements (business keys). The resulting
physical Data Vault can capture what the relationship was, while it captures what the relationship
is, and can adapt to what the relationship will be in the future.
Many-to-Many relationships provide the following benefits:
Dan Linstedt 2010-2011, all rights reserved
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
1
1.
2
2.
3
3.
4
4.
P
Page 78 of 15
52
Flexibilityy
Granularity
Dynamic adaptability
Scalability
Man
ny-to-many re
elationships allow
a
the phyysical model to
t absorb da ta changes a
and businesss rule
chan
nges with little to no impa
act to both exxisting data sets
s
(history) and existingg processes (load and
querry). Businessses must cha
ange at the speed of business, and IT must becom
me more agile
e and
resp
ponsive to handling those changes. More
M
and morre business rrules are changing, fasterr and
faste
er.
Thro
ough the Link
k entity the Data Vault mittigates the ne
eed to restru
ucture/redesign the EDW model
beca
ause the rela
ationship changes. For exxample: today the businesss states 1 portfolio can
n handle
man
ny customerss, but each cu
ustomer musst be handled
d by 1 and on
nly 1 portfolioo. If the model is
designed in a rig
gid fashion (th
hat is to say with
w parent-cchild depend encies) then it representss the
curre
ent businesss rules quite well.
w
All is we
ell until the business
b
(tom
morrow, next year, or 2 ye
ears ago)
decides to chang
ge their busin
ness rule: no
ow, a custom
mer may be h andled by 3 or 4 different
portffolios. Figure 5-1 demon
nstrates relationship change over tim e.
Figure
F
5-1: Relationship Changes
C
Oveer Time
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 79 of 15
52
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Page 80 of 152
Many-to-many relationships ensure that the business associations (past, present, and future) can be
added to the warehouse without altering the model or the load routines. The metadata that is
currently lost is the nature of the relationship (e.g.,1:1, 1:M, M:1) as documented in the source
system (what exactly did the operational model look like?). This must be documented in the
metadata of the Link table, hopefully in the Meta Vault. By capturing the metadata in the Meta Vault
(including computational functions that create the relationship, along with how its used) the
business can begin to track changes to business knowledge as they relate to the data set and
operational systems over time.
The resulting power of this capture mechanism enables the business to monitor the impact of their
decision. Data mining on the Meta Vault and the data set can then perform gap analysis in regards
to the quality of the decision and the end resulting impact (pre and post decision process). If the
business adapts its business process, adding a new Link table can be done easily and quickly
without reengineering the entire existing data warehouse. Load routines are isolated from the
impact, as are queries and BI processes.
If the model is rigid, then the loading (ETL) designs are also rigid. If the business rule changes, and
meets a rigid architecture, then the result of the impact is: forced re-engineering. The extent of the
impact may cascade into other child tables thus, the larger the EDW model grows, the larger the
possibility for impact. The less agile IT can be in response to business rule changes, and conversely
the more it costs (over time) to continue to adjust the EDW architecture to meet business needs.
This is the common design pattern that occurs in traditionally modeled warehouses. This impact is
completely mitigated by building a Link entity into the Data Vault. The Data Vault therefore is highly
scalable, flexible, and now, agile. The Link entity allows the structure to handle changes to business
rules without the impact of re-engineering (aka re-factoring), and without the ever increasing cost
curve. However, it is suggested that the business rule itself, along with any calculation that
produces this data set be recorded within a Meta Vault. To learn more about Meta Vault, check out
the one-on-one coaching area at: http://danLinstedt.com
5.3
Flexibility
Many-to-Many relationships provide maximum flexibility and agility. The more flexible the model is,
the faster it is to adapt or change. The faster the model can adapt, the less time it takes IT to
respond to business changes. The less time it takes for IT to respond to business changes, the
more work can be done in a shorter amount of time, leading to increased productivity of the IT staff
in the data warehousing environment. Adding new tables (especially Link tables) to the Data Vault is
easy.
Dan Linstedt 2010-2011, all rights reserved
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 81 of 15
52
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 82 of 15
52
As seen in Figure
e 5-4, changing the mode
el or adding new
n structurees is a simplee process. Not much
time
e or effort is required
r
to make
m
the changes occur and
a it has no impact on th
he existing po
ortions of
the w
warehouse. We
W can add Links
L
to repre
esent the new relationshiips without h
having to revise
existting structure
es (to add ne
ew foreign ke
ey columns) or
o reload any data. Do noot confuse this with
the ttime and effo
ort required to
t find and esstablish apprropriate busi ness keys. C
Creating the Hub
strucctures is the first step, an
nd the most important ste
ep to take. S
Suppose the b
business now
w wishes
to ad
dd sales regions to mana
age both custtomers and orders
o
as a coombined com
mponent, how
w hard
migh
ht it be to exttend the mod
del again?
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 83 of 15
52
Figure 5-6
6: Global Datta Vault Linkiing
In th
his situation, there are ma
any different applicationss that are syn
nchronizing th
he operationa
al Data
Vaullt. Most likely the applica
ation loading data to the global
g
Data V
Vault is comp
prised of web
b-services,
and a business rules
r
engine. It is quite po
ossible to have more than
n one global controller, and to
farm
m out differen
nt components of the access dependin
ng on the seccurity, geo-loocation, or oth
her
crite
eria. The con
ntrol over the data, the loa
ading and qu
uerying are beeyond the scope of this b
book, and
will b
be discussed
d in the book titled: Data Vault
V
Implem
mentation.
5.4
Granularity
Gran
nularity is vita
al to an EDW
W; the Data Va
ault is no diffferent. Grain
n can be mea
asured by the
e number
of p
parent table
es a Link conttains. For ea
ach parent, th
here is a new
w (lower) leveel of grain introduced.
The same mode of thinking applies
a
when considering fact tables in
n a Star Scheema. For exa
ample,
what is the grain of the follow
wing fact table (see Figure
e 5-7), and hoow can it be accurately described?
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 84 of 15
52
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Page 85 of 152
When the business requirements indicate a need to record data at a different grain, new Links
should be added to the existing Data Vault old ones are simply no longer fed incoming data (but
are retained as they contain historical data). The alternative option is to re-engineer the existing
Link to add the new Hub-surrogate-key. Re-engineering is the enemy of flexibility, and auditability
and can quickly cause an EDW project to scale out of control. In regards to auditability, the
question is: once a new Hub-surrogate-key is added to the existing table, how should it be defined to
the business? Especially if the definition has to apply to past-historical data that is stored in the
Link already.
The very same question plagues the changes to star-schema fact tables; adding a dimensional
surrogate to a fact table causes the grain of all the data to change. When the business asks the
next question: can we reproduce a report from last year and compare it to data from this year? Of
course, the answer is: technically yes but what has to happen to the code that drives that report?
It has to split in to two parts, one part of the code for grabbing history, and a second part of the code
for grabbing current data with the new key, now the project is beginning to take on a much greater
cost in terms of maintenance. As changes continue to alter the structure, more code forks are
necessary to mitigate the business users desire for reporting; until one-day, the business wakes up
and says to IT: We cant afford any more changes, and why is the system such a mess already? This
is one of the reasons we advocate using Data Vault model for your core EDW instead of a
dimensional architecture this kind of change will not break a Data Vault.
What lurks in the shadows is even more troubling. Suppose its the first change, all is well and
everyone is happy (as long as access to each data set is governed). Then one day, another business
unit decides they need to roll up the data, or summarize the recent data that has the new key.
They then combine these results with the old-data that doesnt have the new key, and the numbers
no longer match. Now they ask IT: why is the data reporting bad numbers?
Accountability has just been destroyed. As stated above, in the situation of new relationships and
with the added needs of a data warehouse, it is best to always create new Links for these changes
and leave the old ones be. A hint from the implementation book: as data degrades in value (gets
older), theres a good chance that the old Link and its data, will be backed up and the old Link will
no longer be necessary within the warehouse. This is the beginnings of a Data Vault model that truly
changes with the business needs.
http://LearnDataVault.com
Page 86 of 152
Dynamic Adaptability
Link structures enable dynamic adaptability; that is: the ability to define associations or correlated
data sets on the fly. Dynamic adaptability leads to a fluid modeling structure. A data mining tool
with specialized algorithms that mine both the metadata (data model ontology) and the data set is
capable of discovering new relationships that are not yet represented in the model. The data mining
algorithm must include the metadata definitions of terminology that explain data models in order to
apply appropriate context when deciding to Link different data sets (i.e., Hubs) together.
Relationships (i.e., new Links in the Data Vault) created in this fashion must include two additional
attributes: confidence and strength. In other words, how confident is the mining engine (neural
network) that this relationship actually exists and is real, and how strong are the correlations across
the data sets? These two metrics are applied to every row of data that is loaded to the newly formed
Link.
A fluid Data Vault model is constantly adapting, self-learning. Like any neural network, the
alterations and learning must be a guided and corrected process, otherwise the neural network may
drive the model to an undesired state (possibly un-usable). Before these notions are dismissed as
theoretical in nature they must be considered as reality. A company known as NetQuote in Denver,
Colorado applied this technique (human based mining) to build an up-sell Linkage resulting in a 40%
profitability increase in the first week.
Learning systems, intelligence systems, and military grade systems may actually see the most
benefit from this technique. It allows testing of hypothesis without losing any of the historical data
which has been captured. To take advantage of the fluid model requires automated changes to
apply to loading and querying routines; in addition, it requires automated changes to the data marts
down-stream.
It is possible to create a learning system that is capable of discovering relationships across data
sets where none existed previously. It is possible to create a system that adapts to newly arriving
elements on the XML feeds, or web-service transactions. It is possible to create a system that
arrives at potential high impact information without the need of up-front human intervention. How to
build these systems is way beyond the scope of this book, but a well-designed Data Vault is a
prerequisite to even starting down this path.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
5.6
P
Page 87 of 15
52
Scalability
Physsical location
n of the tabless on specific storage devices can be ooptimized forr maximum
perfo
ormance. Figure 5-9 indiicates a tradiitional startin
ng point for th
he Data Vaullt architecturre on Raid
5 (SA
AN or NAS diisk). This typ
pe of architeccture provides the lowest cost entry pooint for a singgle
syste
em. The Datta Vault mode
el is flexible enough
e
to grow with the ccorrespondin
ng needs. As
perfo
ormance gro
ows, as data sets
s
grow, ass real-time da
ata arrives the Data Vau
ult model can
n scale as
desired.
Fig
gure 5-9: Trad
ditional Data Vault Storagge Layout
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 88 of 15
52
Whe
en the perform
mance of thiss architecturre falls below
w expectations, it can be eeasily adjuste
ed to a
new physical architecture as shown in Figgure 5-10. Asssuming in th
his case that Link-Custom
mer-Order
and Satellite-Cusstomer-Orderr are growingg at an unpreccedented ratte, or that theey contain a massive
volume of inform
mation, they can
c be split off
o physically on to a specialized DASD
D (direct attacched
stora
age disk) with multiple I/O
O channels. This allows the
t business managemen
nt and IT to p
provide an
SLA (service leve
el agreement) which specifies perform
mance driven metrics arou
und certain q
queries or
proccesses that lo
oad down-stream marts.
Figu
ure 5-10: Perrformance Ph
hysical Split V
Version 1
hen be furthe
er partitioned
d across multtiple I/O chan
mponents,
The tables can th
nnels, additioonal disk com
or ha
ardware. The
e architecturre allows the performance
e to be tightlyy coupled to the physical storage,
while
e allowing the model to be
b de-coupled
d from the ph
hysical layerss. This is an ooptimal situa
ation for
an M
MPP design. For further performance,
p
, additional RAID
R
0+1 con
nfigurations a
and DASD ca
an be
intro
oduced to oth
her table stru
uctures, as se
een in Figure
e 5-11.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 89 of 15
52
Figu
ure 5-11: Perrformance Ph
hysical Split V
Version 2
This process can
n be repeated
d again, and again acrosss each individ
dual table, an
nd down to th
he
partition level of each individual table. Th
his enables fu
ull scale-out MPP style arrchitecture to
o be
execcuted at the physical
p
level of the Data Vault. This type
t
of desiggn is geared ffor extremelyy large
syste
ems, and forr the flexibilityy of breakingg off parts of the model on
n to slower eequipment, w
while other
partss of the model are placed
d on high-spe
eed, high-cosst equipment .
Figu
ure 5-12: Perrformance Ph
hysical Split V
Version 3
Addiitional I/O ch
hannels can be
b added to each
e
disk device for addittional paralleel access cap
pacity.
Partitioning of the tables further enhances performancce and parallelism. Furth
her discussio
ons of
physsical table strructuring can
n be found in the Data Vault implemen
ntation book.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
5.7
P
Page 90 of 15
52
5.8
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 91 of 15
52
e 5-15: Exam
mple of Link Satellite
S
with Driving Key
Figure
his case, the Link record 1 has 1 Satellite record. What
W
happen
ns when the operational ssystem
In th
chan
nges the acco
ount numberr that the cusstomer is asssociated with
h? What if the operationa
al system
chan
nges the emp
ployee that deals
d
with the
e customer? In each of th
hese cases, w
we see the fo
ollowing
insert (intermediate step) occcur. Figure 5-16
5
depicts the
t post-inssert of the neew row in the
e Link and
Sate
ellite.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 92 of 15
52
Link Examples
For tthe exampless of the Linkss we have ussed several different mod els includingg Microsoft Ad
dventure
Works data model and a health-care mod
del. These Lin
nk structures do not carryy last seen da
ates nor
stren
ngth/confide
ence ratings. Figure 5-18 contains exa
ample Link sstructures fou
und in the cu
urrent
verssion of the Ad
dventure Worrks 2008 Datta Vault.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 93 of 15
52
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Page 94 of 152
The degenerate field is also known as a child attribute. It is a degenerate field because it depends
on the combination of both of the parents fields in order to make the relationship unique. The data
in this field is meaningless outside the context of the relationship, in other words the field is not a
business key. The field will not function as a Hub. Oper_SEQ as a Hub would only contain sequential
integers for ordering data.
This field may also be a date value. In the case of an oil well there may be a need to capture a
physical date as to when the well was turned on, because until its turned on it is not assigned
an actual well-number. Another example may be an expiration date on a prescription drug bottle.
This date is generally worked in to the bar-code, making it a part of a larger business key. In this
case, it is also part of the relationship between the drug itself, and the packaging material. These
degenerate keys are necessary in describing relationships to a higher level of detail however, by
themselves they do not provide significant information to cause the creation of a business key.
Degenerate fields have the following rules:
Examples of degenerate Link fields include sequencing or numbering information, for instance
line-item-sequence on a purchase order or invoice may be called a degenerate Link key. Dates (on
occasion) are also degenerate Link fields. However, this case must be carefully examined as not
all dates (such as start/stop, begin/end, and other descriptive dates) should end up as a part of the
Link. We discuss begin and end-dating Links in a section below. The degenerate field that is a
date is generally a rare case that should be applied sparingly and with caution. It usually is also an
indicator or a composite of a business key, making the relationship unique.
5.11 Multi-Temporal Date Structures
The Data Vault is enabled to house multi-temporal views of the information. Multiple date-time
stamps must be defined as data attributes in Satellite structures (defined in the Satellite chapter).
Utilizing the data in a multi-temporal state is accomplished through the query designs.
http://LearnDataVault.com
Page 95 of 152
WARNING: DO NOT ALTER THE ARCHITECTURE, NOR MODIFY THE STRUCTURES OF THE
ARCHITECTURE TO GAIN A MULTI- TEMPORAL VIEW OF THE DATA. AS STATED
PREVIOUSLY , ANY DEVIATION FROM THE STRUCTURE WILL CAUSE A SERIOUS BREAKDOWN
OF THE VALUE OF THE D ATA VAULT MODEL TO THE BUSINESS . T HE REASONS ARE
LISTED IN THE RE- ENGINEERING STATEMENTS THAT ARE MADE THROUGHOUT THIS BOOK .
The structures and standards have been built and tested for over 15 years. The standards have
been built to avoid the pitfalls and problems that existing data warehousing models suffer today.
Including but not limited to: cascading change impacts, scalability issues, flexibility problems,
absorption of new systems, and so on. By breaking the standards, you will experience many of the
same problems that you have today you will negate the whole reason for moving to a new data
modeling structure!
It is fine to add attributes to Satellites; it is not okay to change the primary keys of the Link or Hub
structures. There is a tendency by designers to want to add temporality (date/time keys) to Links
primary key structures and Hub key structures. The original Data Vault design in 1993 allowed this
as an option. By 1995 flaws in this design begin to appear as with cracks in the foundation of a
home, these flaws were significant enough to warrant a re-definition of the Link structure.
The finalized design was tested and passed with significantly better results the finalized design
allows no temporal date/time elements as part of the primary key of the Links. Allowing temporality
as a part of the primary key of the Links caused re-engineering 3 to 6 months later. It is the view of
this author that *any* cause to re-engineering should be eliminated if possible, and if not possible,
the impacts of changes should be reduced to a minimum. Otherwise, the results are disastrous; akin
to an invasive wall-climbing vine, that anchors its roots deep in the structure its climbing. Eventually
that structure must be torn down and completely re-built. It is the same for the Link table where the
primary key introduces temporality.
One of the foundations of the Data Vault is to enable iterative development by consistently
minimizing rework and reengineering. This not only future proofs the Data Vault, but also facilitates
rapid development because slight omissions or oversights at initial stages of the Data Vault design
can be coped with in an elegant and extremely economic way. Not so when temporal structures are
incorporated into the primary key structure because the primary key changes must be migrated
down-stream to ALL child tables leading to cascading change impacts. The cascading change
impacts affect re-engineering efforts of ALL the child tables from loading routines to queries
everything must be changed.
Dan Linstedt 2010-2011, all rights reserved
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 96 of 15
52
5.12
2 Link-To-Link (Parent/Child Relationshipss)
A Lin
nk-to-Link relationship ind
dicates a parrent-child arra
angement, o r a hierarchyy of some sorrt. In this
case
e, it is equiva
alent to a nessted relationsship with diffe
erent levels oof grain. An eexample of a Link-ToLinkk is below, in Figure 5-19.
Figu
ure 5-19: Example of Link
k To Link Relaationships
e, Hub Producct and Hub Supplier are both
b
parents to Link A. Link A and Hub Sales
For tthis example
Persson are both parents to Liink B. Link B and Hub Te
erritory are booth parents tto Link C. This forces
a risse in complexxity in loading
g and queryin
ng, and it beccomes most evident when
n Satellites a
are found
as children of ea
ach Link.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 97 of 15
52
Data
a modelers often feel the need to put a Link to Link relationshi p in the mod
del. While the
ese
relattionships are
e interesting and
a may be easy
e
to read logically, theey are extrem
mely difficult tto
impllement. From
m a logical sta
andpoint it iss easy to see the parent-cchild relationsships when m
modeling
a Lin
nk to Link arcchitecture. Im
mplementing this kind of model
m
is diffiicult becausee of the parent child
depe
endencies du
uring the load
ding cycle. Th
he following is
i a discussioon on best prractices for removing
Linkk to Link from
m the physical implementa
ation.
Note: it is fine to logicallly model Link to
t Link relationships; the prroblem is when
n they are
expressed
d in the physica
al data model. To avert issu
ues and probleems, the Link structures
should be flattened out,, and the hiera
archy depende
encies removeed.
It requiress the ETL to load sequentiallly to each child Link, it remooves the abilitty to load all Liinks in
parallel alll the time
Math
hematically speaking,
s
if we
w denorma
alized the Lin
nk Structuress so each Lin
nk is connectted to its
pare
ents Hubs we
e can represent the same
e data set in a flattened m
manner. Thee structure b
becomes
simp
pler to mainta
ain (going forward, it is ffuture proof especially if the relatioonship in the source
syste
em changes
) and, the structure
s
is easier to load as well as q uery. Figure 5-20 shows the first
flatttened hierarchy, and the
e new structu
ure.
Figure
e 5-20: Step 1, Flatteningg Link-To-Linkk Hierarchy
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 98 of 15
52
Figure
e 5-21: Step 2, Flatteningg Link-To-Linkk Hierarchy
nhooks the de
ependency of Link C to Link B. The Li nks are now successfullyy flattened
The final step un
(den
normalized). This new structure allowss all Links to be loaded in
n parallel, and allows all q
queries
direcct access to the
t data set through the Hubs. This means
m
that iff the queries need accesss to other
Linkks, it will be available base
ed on direct request.
r
It also has the ffollowing effeect: There can
n be
reco
ords in LINK C which do not exist in Lin
nk B!! There can be recorrds which exist in Link B w
which do
not e
exist in Link A.
A
This is absolutelyy vital to haviing a Data Warehouse
W
ca
apable of abssorbing 100%
% of the data 100% of
the ttime (within scope).
s
If the
e other dependencies we
ere in place, w
we would be forced to cre
eate
pare
ent records fo
or the entriess in Link B an
nd Link C justt to load the appropriate data set.
Please no
ote: the proce
ess of removing Link-To-L
Link relationsh
hips should h
happen afterr
the logica
al model has been built, as
s it is easier to
t accomplish
h once the co
orrect
relationsh
hips have bee
en established
d.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Page 99 of 152
Link structures are all defined the same way. There are several different applications of Links which
require discussion, introduction, and definition. It is these types of applications that are discussed
in the following sections. The different types of Links include the following:
Hierarchical Links
Same-As Links
Transactional Links
Exploration Links
Low-Value Links
Computed and Aggregate Links
In some cases, the data within the Links is derived or computed by one or more business processes
thus resulting in a Link which contains non-auditable data; or at the very least, data which never
existed in the source system. If this is the case mark those rows with the appropriate system
generated record source, or process name. Some of these cases include utilizing a Data Quality
engine to produce similarity across names and households, business names, product names,
etc Other cases include using aggregate functions to produce corporate vision information that is
used to drive the business in a day-to-day decision making function.
A majority of the time, these computations and aggregations or results of processing belong in a
business Data Vault which is defined in the upcoming book: Quick Start Guide to Business Data
Vaults.
5.14 Hierarchical Links
Hierarchical Links are just what their name implies: a Link structure which contains N levels of
hierarchical data from the same Hub. For example, consider the case of an employee who reports
to a manager. Their manager is also an employee who happens to report to a director, and so on.
The Hierarchical Link allows roll-ups and aggregation of lower level data into a tree topology, or treelike organization.
Hierarchies are a form of ontologies. Lets look at an Organizational chart, and an example of a
Hierarchical Link. Remember, the Hierarchical Link is an application of the standard Link structure.
It does not change the Link structure nor violate the rules in any fashion.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 100 of 1
152
Figure
F
5-22: Example
E
Organization Strructure
n in Figure 5-22:
The following asssumptions arre made about the organiization shown
In th
his case, diffe
erent stores report
r
to diffe
erent division
nal offices, th
he division offfices report to an
execcutive office. The Data Va
ault model would appear as Figure 5-2
23:
Figure 5-23
3: Hierarchica
al Link for Offfices
The Hierarchical Link is show
wn in purple above
a
the Hub Office tablee. The hierarchical Link ccontains
be as
the o
office sequen
nce twice; on
nce for the ro
oot office, oncce for the paarent office.. There can b
man
ny office roll-u
ups as neede
ed. The reaso
on for extrap
polating the h
hierarchy to a many-to-ma
any
relattionship is th
hat: relationsh
hips in busin
ness change over
o
time. S
So the represeentation of th
he
hiera
archy today may
m not be th
he same as yesterday
y
or even
e
the sam
me as tomorrrow.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 101 of 1
152
Sam
me-As Links are another tyype of applica
ation of the Link
L
structuree. In this casse, the data sset is
appllied as resolu
ution informa
ation. In othe
er words, all data
d
exists aat the same ssemantic grain or
has the same me
eaning to the
e business. All
A data are peers
p
to one-aanother. In these cases, the
diffe
erently spelle
ed names of companies
c
all represent the
t same com
mpany. Figu
ure 5-25 belo
ow
dem
monstrates bu
usiness data that identifie
es the same-as concept.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 102 of 1
152
In ea
ach case for the example above, a ma
aster spellingg has been ch
hosen. This can be thougght of as a
step
p in the directtion of defining master da
ata for use in
n the operatioonal systemss. The Data V
Vault
mod
del for this exxample would
d appear in Figure
F
5-26:
Figure
F
5-26: Same-As
S
Link
k Data Vault Model
Rem
member that even
e
though the application (or usage
e) of the Linkk varies, the LLink structure
e stays
the ssame.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 103 of 1
152
5.16
6 Begin and End Dating Link
ks
Fig
gure 5-27: Inccorrect Link with
w Begin/E
End Date
hnically. How
k wont hurt anything
a
tech
wever it incre
eases the
The process of putting a datte in the Link
chan
nce that the data
d
set will be utilized th
he wrong wayy. In other woords, the IT p
person will no
ow have
to an
nswer questions like:
All o
of these quesstions arise along with complication
c
in loading, q uerying and mining when
n the
struccture of the Link
and exists
L
is comprromised. Eve
ery field in th
he Data Vaultt has a speciffic purpose, a
in a specific placce for one or more busine
ess reasons. Begin and E
End dates deescribe when a
relattionship is acctive / inactivve. The purpose of a Link
k is: to establlish the fact tthat a relatio
onship
existts (remembe
er: right, wron
ng or indiffere
ent with no re
egard to timee).
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 104 of 1
152
Once
e the associa
ation has bee
en establishe
ed, it is a factt that it existeed at that pooint in time in
n the
operrational syste
em. The factt stands for all
a time, the fa
act is neitherr right nor wrrong, nor doe
es it
start or stop. It is a relationship that the
t source syystem record
ded; thereforee the Data Va
ault
reco
ords it as well.
Rem
member, a pa
atient can havve many diffe
erent interactions with th at single loca
ation at different
time
es. This means that the dates
d
and tim
mes in this example are in
n fact descrip
ptive in nature
e.
Therrefore the tem
mporality of the
t Link data
a must be desscribed in a S
Satellite in order to mainttain the
prop
per structuress. Remembe
er this: adding begin and end dates too Links changge the grain o
of the
data
a (or businesss key) in the Link. Figure 5-28 below shows the efffect of addin
ng begin and end
date
es to the Link
k structure.
Figure 5-28
8: Begin & En
nd Dates in LLinks
In th
his example, the
t driving key comes in to question. Some of thee many questtions that appear as a
resu
ult: What doe
es the BEGIN
N and END da
ate mean? What
W
do they represent? H
How are theyy
gene
erated? In this
t case, if the source syystem createss begin and eend dates, th
hen the busin
ness user
has complete control over the
ese and theyy cannot accu
urately depictt a proper tim
me-line for th
he system
drive
en relationsh
hip. Why? Be
ecause the business
b
userrs can back-d
date the beggin and end date
sequ
uences.
Therres a more te
echnical reasson that work
ks against th
his structure. In Figure 5-29 below, it is clear to
which extend
see that the sam
me relationship can be rep
presented mu
ultiple times over time w
ds the
mea
aning of the unique
u
busine
ess key sequ
uence.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 105 of 1
152
Fig
gure 5-29: Exxample of Poo
orly Construccted Link
oblems IF and
d WHEN therre are Satellittes attached as
Linkks with begin and end dates cause pro
child
dren. They ca
an also cause queries to produce Carttesian resultts when joinss are made th
hat ignore
currrent or sing
gle record access. Resu
ults can be disastrous acrross joins and
d performancce will
slow
w to a crawl (just as it does in HUGE Fa
act Tables) due to the lack of unique eentries in the
e Link
table
e.
In orrder to track begin and en
nd cycles of relationships
r
s, the best praactice solutioon is to place
e them in
an E
Effectivity Sattellite. These
e are discusssed in the Sattellite Chapteer (Chapter 6
6). However, Figure 530 b
below shows an example of an Effectivvity Satellite off the Link, note: other S
Satellite Data
a
(Sat__Cust_Acct_Emp_Detailss) has been shortened due
e to screen rreal-estate.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 106 of 1
152
Tran
nsactional Lin
nks are defined to be a da
ata set which
h cannot legaally change. In other word
ds, its
transsactional hisstory. Any tra
ansaction tha
at cannot lega
ally be edited
d qualifies foor the transacctional
Linkk. The easiesst qualificatio
on for transacctional Link would
w
be to ccall it an unallterable fact. In other
word
ds, once issu
ued the reco
ord stays inta
act as audita
able history foorever.
Therre are two wa
ays to model the application of this da
ata within thee Data Vault. The first is a
traditional metho
od: Link and Satellite (mo
odified to havve no history)) the second is to place all the data
in th
he Link structture itself. Fiigure 5-31 indicates the first
f
method ffor setting up
p a transactio
onal Link.
http:///LearnData
aVault.com
Transactional data is loaded direct to the transactional tables (both the Link and the Satellite). In
the above example, the transaction number is included in the Link for unique key structuring, while
the transactional date and time is included in the Satellite. It is possible in this circumstance to use
the transactional date as the Load date if and only if the time at which the transaction is loaded to
the Data Vault is relatively close (within seconds) of the actual transaction date itself. Otherwise it is
important to separate the data set to accurately represent it.
Although it is not pictured here, the transactional representation of the Satellite does not need to
store the Load Date, as in most cases it will match the Load Date housed in the Link parent. There
is however an exception to this rule: in some specific cases, the transaction is delivered in two parts
from two different streams, just milliseconds apart. In this type of real-time case, the Load Date
should be modeled and stored in both the Link parent and the Satellite in order to properly
represent the different arrival timings.
Figure 5-31 above also indicates a slightly modified Satellite structure (again, discussed in the
Satellite Chapter). In this case, there is no load-end-date in the Satellite, indicating there is no
history; in other words once the data has been added to the Data Vault it cannot be changed or
superseded with new information.
There is another option within the Data Vault for modeling transactional data, where the information
is housed directly in the Link structure. This architecture is not preferred as it changes the
architectural design by introducing decisions to the design process. Therefore it increases the
complexity of the maintenance cost and loading routines. In certain circumstances where
performance is absolutely required to the millisecond level (or lower), it may be necessary to
structure the transactional Link in Figure 5-32:
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 108 of 1
152
Figure
F
5-32: Transactiona
T
al Link, No Saatellite
The only issue to
o watch for with this type of
o Link is the
e width of thee data set. The width can
n easily
beco
ome too large
e, and quickly cut down on
o the numbe
er of rows peer block. If th
he Link becom
mes too
wide
e, the perform
mance of botth the load an
nd the querie
es will decreaase. Transacttion Links are
e
gene
erally built to
o house inserrt only, rapid-ffire transactiions which arrrive on a continuous multi-stream
basis direct from
m the operatio
onal systemss in to the Data Vault. Thee decision to adopt this m
modeling
struccture must be
b made on a case by case basis.
5.19
9 Computed Aggregate
A
Links
Com
mputed aggregate Links are similar to Fact tables in
n a dimensioonal model. C
Computed agggregate
Linkks have a reco
ord source th
hat is labeled
d system gen
nerated. Com
mputed aggreegate Links a
are utilized
to ho
ouse pre-com
mputed data sets like tota
als, summaries, averagess, minimums and maximu
ums. They
are p
part of the multi-layer
m
(scale free) arch
hitecture tha
at the Data Vaault offers. TTypically Com
mputed
Aggrregate Links are found on
nly in the arch
hitectural com
mponent knoown as the business vaullt.
How
wever, there may
m be timess when they provide
p
value
e to the raw d
data sets and
d hence will b
be found
in th
he raw Data Vault.
V
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 109 of 1
152
Figu
ure 5-33: Exa
ample of Com
mputed Aggreegate Link
The data found in computed aggregate Links are gene
erally not aud
ditable, as th
hey are mach
hine
computed within the Data Vault and are not
n part of an
ny source sysstem. Caution: if the resu
ults
houssed in the co
omputed aggrregate Links are found in financial rep
ports, or on a corporate e
executives
deskktop, they ma
ay become auditable ass they are utilized by busi ness users too run the bussiness.
For further exploration: In the example
e above (Figurre 5-33), the ssuppliers are interested in
knowing th
heir total saless of each prod
duct by store te
erritory. Ratheer than producce a separate
data mart,, the architects decided to in
nclude the pre
e-computed agggregate direcctly in the Data
a
Vault. The
e function F(x) determines th
he business ru
ules for aggreggation and posssibly cleansin
ng;
which mayy or may not in
nclude productt roll-ups to higher level asssemblies.
This type of
o structural co
omponent is an
a add-on, and
d is not consid
dered to be pa
art of the core
Data Vaultt model. The add-on
a
is similar to other qu
uery assist tab
bles that provid
de pre-built
answer se
ets to routines that load dow
wn-stream data
a marts.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
The study of the Data Vault as a neural network introduces a number of concepts. One of which is
the idea that the tables in the Data Vault act as the data storage or nodes in a fuzzy logic algorithm.
In doing so, the neural network needs to establish associations with strength and confidence
ratings. In querying the Link structures, the neural network can learn the context housed within,
and determine if the relationship needs to be improved, if its the strongest relationship, or if its the
weakest.
The strength rating (when added to the Link) is the result of data mining efforts to establish a
correlation across the related data sets. In other words, if there are two Hubs with Satellites that
both describe cars (maybe different types of cars), then an association or relationship can be
formed with a fairly high strength of 90% or above (for example). But its more specific than that.
Its based on EACH BUSINESS KEY association. In this example, there are two cars with different
VIN numbers that were recorded by two different systems. They each describe a blue car, front
wheel drive with 250k miles, that is a 1998 make and model match.... They are each connected to
similar owners/drivers in different states. The inference might be 90% chance that these are the
same car.
The confidence rating must be added to the Link in conjunction with the Strength rating so that we
know how confident the knowledge engine is in the rating its provided. In the example of the cars
above, the confidence may only be 60% because the drivers might have different names and never
have had the same address. However, maybe the confidence is 90% because a mining effort across
the drivers sees a family relationship to the drivers.
Strength and confidence can be added to the Link structure on a row by row basis, and are utilized
by analytics routines to filter out important correlations. Of course, these strength and confidence
ratings may change depending on the question being asked. At that point, the knowledge that is
sought may need to recalculate these ratings so that they make sense. The neural net engine that
is making the assumptions and assigning these calculations should utilize an industry vertical
ontology that describes the business terms. Otherwise, spotting the context for associations will be
difficult if not impossible.
http://LearnDataVault.com
Note: this type of activity brings into focus another application of the Link structure called a
Dynamic Link. Dynamic Links are discovered and created by machine learning algorithms.
The data in the Dynamic Links are generally not auditable (as fuzzy logic rarely produces the
same result twice). They are very similar in nature to exploration Links (described below)
however the difference is that Dynamic Links are machine driven, while exploration Links are
manually created.
What you can do with Dynamic Linking is limitless. The Data Vault model is a scale free
architecture, which allows you to explore different Linking constructs until you find the right one that
represents the business. Its also the very same reason that the Data Vault model is future proof, in
that it can absorb any future change without changing the nature of the historical data that has
already been collected.
5.21 Exploration Links
Exploration Links are short-circuits to the joins across the Data Vault and are placed in to the Data
Warehouse for business reasons only. They are manually generated, and maintained however if
an exploration Link proves to be valuable to the business, the loading cycle can be automated.
Exploration Links are a form of computed aggregate Links. They may or may not contain computed
attributes.
Exploration Links are not auditable. The architect and BI team implement exploration Links to cross
several different parts of the model. They may span between 2 different Hubs which are spread
across the model, and not directly Linked by the source systems. A small company in Denver called
NetQuote installed an exploration Link to determine up-sell potential for targeted ads and discounts
to their web-customers as they clicked through the system.
This company saw a 40% increase in profitability as a result of the exploration Link. They found a
reason to implement it on a consistent basis. This company also built an Operational Data Vault
that was loaded at the time the transaction was generated from the web-front end. The operational
Data Vault was hooked to the message bus for both incoming and outgoing message routing.
Exploration Links are encouraged once the base Data Vault has been constructed and is in
operation. These Links can be created, queried, and destroyed at will without the destruction of
history within the Data Warehouse. Exploration Links give the business user and the IT architect a
chance to play with the data set within the Data Vault; hopefully resulting in new questions and
answers that can be viewed from the data warehouse.
http://LearnDataVault.com
A Satellite is a time-dimensional table housing detailed information about the Hubs or Links
business keys. The purpose of the Satellite is to provide context to the business keys. Satellites are
the data warehouse portion of the Data Vault. The Satellite tracks data by delta, and only allows
data to be loaded if there is at least one change to the record (other than the system fields:
sequence, load-date, load-end-date, and record source). A Satellite can have one and only one
parent table.
Satellites provide the descriptive data about the business key, or about the relationship of the keys.
They describe the relationship changes over time. Their job is to record the information as it is
loaded from the source system. They use load dates and load-end-dates to indicate record lifecycles because most database systems today are not capable of internally representing time-series
properly. Satellites often provide data normalization for future proofing, scalability, and auditability
of the data sets. How normalized a Satellite gets is a function of the design and a choice made by
the designer.
http://LearnDataVault.com
Remember: all models must serve a purpose and a function. The rules and standards described in
this book are golden guidelines most of which should be adhered to; however there are some
circumstances in which the principles of the Data Vault can be preserved while altering the structure
of the data model to fit the needs.
6.2
The Satellite entity structure consists of basic required elements: surrogate sequence id (from the
parent table), load date stamp, load end date stamp, and record source. Database engines today
do not currently support (natively) time-series based table structures. Due to this limitation, the
architecture is forced to compensate with Load Date Stamps and Load End Date Stamps. These
date stamps have been described in the common attributes chapter (Chapter 3) of this book.
The Satellite entity must NEVER contain foreign keys (except for the single parent on which it relies).
If a Satellite structure is compromised, then the flexibility of the model is immediately compromised,
in other words: all possible hope of future proofing the data model is immediately lost. You are then
forced to reengineer the data model in the near future when the business changes the way
relationships are structured. Satellites may contain unknown or not-yet-identified business keys
until such time as the business keys become identifiable.
While this is not a general practice it is acceptable. When applying this standard rule, the business
key housed in the Satellite is treated the in the same manner as the rest of the descriptive data set
as just another descriptive element with changes tracked. However, once the business key
becomes identifiable it will be necessary (at that time) to split the key out to its own Hub, add a Link
association to the current parent of the Satellite. Then, the Satellite data must be re-formulated
without losing history. The process of reformulation of Satellites is covered in the one-on-one
coaching section and the Data Vault Implementation book. Satellites must have one and only one
parent table, none others are allowed. Figure 6-1 below shows a standard structure of a Satellite
Entity.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 114 of 1
152
Satellite Exa
amples
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 115 of 1
152
Figure 6-2
2: Example Satellite
S
Entit ies
The data in each
h of these formulates the warehouse. Keep in min
nd that the foor this particu
ular Data
Vaullt model we are
a 1) dealing
g with a single source sysstem 2) a moodel which ha
as no identifiable or
mea
aningful busin
ness keys.
6.4
Importance of
o Keeping Hisstory
Histo
ory is partly what
w
a data warehouse
w
iss all about. The
T Data Vau
ult is no differrent, except tthat in the
Data
a Vault, histo
ory is raw data
a. Satellite structures
s
be
eing what theey are, can bee changed, altered,
and re-designed (as is docum
mented later in this chapte
er). Its impoortant to rem
member: when
na
Sate
ellite changess its design 100% of the
e historical data
d
must be preserved or the Data Vault will
no lo
onger pass an audit.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
As you continue through this chapter, please be mindful of this principle. Be thinking of how the
history can be preserved throughout the different changes. History serves as the audit trail of the
source systems, and the only record available is in the Data Vault; which means that the Data Vault
you build is now a system of record. There are differing opinions about what a system of record
really is, however as the business retires old sources, or as you implement operational data
warehousing, the data warehouse is relied upon to make financial decisions. It is at this point where
the data warehouse is a system of record.
6.5
There are many different ways to define type of data. One way is to define type as data type. In this
manner, the Satellites can be divided into different pieces based on their data types. History has
shown that the benefits of this approach are as follows:
Create a fixed width row for bits, integers, dates/times (all non-varchar components)
Create variable width rows for all varchar / char attributes
Create variable length BLOB / CLOB / LOB objects
Dramatically increase compression rates for data sets
Decrease overall storage needs (by reducing the potential for chained rows)
Easier management and maintenance
No guess work involved in defining new Satellites
Easier indexing strategies
Easier partitioning strategies
Easier Query Parallelism
End Result? Increased performance
Of course we cant ignore the nature of the query set. When classifying attributes into different
Satellites by data type, it is important to remember the queries that will be grabbing the data sets
and put it in context with the platform that the queries are running on. For instance, if the platform
is Teradata, or IBM DB2 UDB EEE / MPP then the queries and parallelism will work quite well. Or if
the platform is SQLServer 2008 R2/MPP, or Oracle SMP Big Iron with Partitioning and Parallel query,
then the queries will work quite well.
If the platform is a DB2 based AS/400 then normalizing the Satellite goes against the
performance principles of the HFS (hierarchical file system). Also, if the hardware is under-powered,
under-sized, or the database hasnt been tuned appropriately then the queries might not run so well.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 117 of 1
152
Furthermore th
he database industry is ch
hanging (and
d by the time this is publisshed, will havve
chan
nged). The riise of NOSQLL (like HADOO
OP) solutionss, and column
nar databasees will change
e the way
we p
physically loo
ok at partition
ning data setss. This is all beyond the scope of thiss book, and w
will be
cove
ered in the Da
ata Vault Imp
plementation
n book, or in the
t one-on-oone on-line cooaching sectiion of my
web-site at: http:://danLinsted
dt.com
In th
he interest off discussion, and for the purposes
p
of demonstratio
d
on, the Figuree below (Figu
ure 6-3)
show
ws a split of SAT_PROD
S
frrom Figure 6--2 (above) intto multiple S
Satellites.
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 118 of 1
152
Rate
e of change iss a similar to
opic to type or
o classificatio
on of data. R
Rate of changge can be de
escribed in
w fast
man
ny different ways.
w
Howeve
er, in this parrticular case the term ratee of change refers to: how
doess each eleme
ent or group of elements change in re
elationship too each other. In other wo
ords, the
rate of change off cell phone numbers
n
for an individual may be exp
ponentially higher/faster tthan the
rate of change fo
or that persons address. Lumping all this quickly changing data with otherr, slower
chan
nging data ca
auses data space explosion. Figure 6-4 shows an example of d
data that is
deno
ormalized in to a single Satellite,
S
and changes at different
d
ratees.
Figu
ure 6-4: Satellite Data Ratte of Changee Example
ough compre
ession in the database offfers a little bit of relief froom the repetiitive informattion, it
Altho
still piles up overr time. All this extra data
a is recordin
ng little to no changes to tthe informatiion as it
flows in. The imp
pact can be seen
s
in longe
er loading tim
mes, longer b
backup times, bigger log sspaces,
large
er temp area
as (needed fo
or queries), and slower qu
ueries overalll (more I/O ggoing on at th
he
hard
dware level).
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 119 of 1
152
In th
he above example, the cell phone changes every day. The Pho ne number m
may change e
every
othe
er day, but the name and address mayy change oncce a year or leess. The rigoor required fo
or loading
a ne
ew row in thiss instance can be painful.. Not only do
oes the data sset in the tab
ble explode, tthe
inde
exes for the fiields also exp
plode. Whats worse, is the coverage (thats the in
nternal datab
base
ratin
ng for which index
i
is the best
b
to use fo
or SQL) is dra
amatically red
duced by dup
plicate data; which
mea
ans that if you
u index the Name
N
or the Address
A
columns, their seelectivity will become veryy poor
very quickly.
Note: if you
y
think it lo
ooks OK now, or you dont see the harrm in this (be
ecause youve
e
been doin
ng this for years with a typ
pe 2 dimension), then jusst try to imag
gine this
happening
g over 10x th
he amount off data that yo
ou currently h
have. In othe
er words,
imagine this table with
h 100 million rows in it, when
w
in reality
y you only ha
ave 6.6 millio
on
names an
nd addresses (assuming each changed about 15 tim
mes total), an
nd 100 million
n
cell phone
e number cha
anges. The performance
p
is dreadful to
o think aboutt because of
the width
h or row size of the table.
Figure
F
6-5: Satellite Split by Rate Of C
Change
In Figure 6-5, we
e see a 5x dissk space com
mpression immediately ta ke place in th
he contact name
Sate
ellite (becausse only 5 rows were shown). As it turn
ns out, storingg the informa
ation once is much
more
e efficient an
nd practical. The contact phone Satellite still has aall the changges and all th
he history,
but tthe row-size is much sma
aller, the inde
ex coverage iss better, and
d the perform
mance is faste
er.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
When designing Satellites, sometimes the first instinct is to try to combine multiple systems data
directly in to a single structure. This can be good and bad. By combining multiple systems data in to
a single structure, there are a lot of considerations to be made these are covered in the next
section: Overloaded Satellites (the Flip-Flop Effect).
The best practice, and the easiest to get comfortable with is to split the Satellite data into separate
Satellites, one per source system. The next question that comes to mind then is: why include a
record source? Well to be honest, because the source of the data may still need to be geographically
identified, or possibly application identified. For instance, the source may be SAP sales module,
but the SAP Sales module may have been implemented across more than one source system
(physical machine). It is also possible that it may be implemented in different geographic regions.
The key to using a single Satellite per application is to ensure there is a match across the structures,
and that the metadata is defined the same way by the business.
Note: The best practice is to split the Satellites across each source system.
Allows the designer to add new systems as they come in the door without impacting existing designs,
and existing data sets
Removes the need to fight over what the data means, how to integrate it, and whether or not it
needs to be split, concatenated, lengthened, shortened, and manipulated.
Allows different data sets from different sources to populate their audit trail in accordance with their
rate of change and type of data. Where in this case, type represents the source system.
Solves the problem of disparate data arrival times. In other words if or when the data arrives, it is
inserted directly in to its Satellite for that system, theres little to no competition (at the I/O or
database level) for that resource (table). This allows us to Maximize Load Parallelism
Allows real-time data to flow from one system, while batch data flows from another limits the
exposure to the risk of having to merge data sets on the fly. Eliminates the dependencies across
multiple systems that would force those systems to have the data ready at the same time.
They say a picture is worth a thousand words, Figure 6-6 provides a generic example of what this
might look like in a Data Vault model.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 121 of 1
152
Figure
e 6-6: Custom
mer Satellitess Split by Sou
urce System
In th
his example, there
t
are som
me overlaps for customerrs, including Name and P
Phone Numbe
er but
thats where the similarities appear
a
to sto
op. Each systtem probablyy has its own
n unique way of
defin
ning what cu
ustomer mean
ns! But the business
b
statted in their reequirementss very clearly: if the
custtomer record
d has the sam
me business key
k in each system,
s
then it is supposeed to represe
ent the
sam
me customer. Very rarely does this data ever line up
u in the begginning, espeecially once history is
load
ded to the Satellite.
The job of a good
d Data Wareh
house is to point out or make
m
known tthe discrepan
ncies (the ga
ap
anallysis) betwee
en the way the business believes
b
its operating,
o
an d the way thee source systtems are
trulyy running. Th
he job of the Data Wareho
ouse is not to
o filter the infformation or to alter it in any way
to do
o so would viiolate compliance and au
uditability rule
es. Here, thee discrepanciies are plain to see,
and if you look closely youll notice that statistics can be run acrooss the sourcce systems to see
how
w far out of alignment
a
the
ey are with ea
ach other. In other word
ds: whats this
is costing myy business
to ha
ave broken business
b
rule
es in differentt source systtems? This q
question can finally be an
nswered
with metrics and measureme
ents.
quite possible that from profiling
p
the data
d
in this example, one might learn that contactss really
Its q
shou
uld have their own busine
ess keys, beccause they arre totally and
d uniquely disstinct from th
he notion
of cu
ustomer. Or they might le
earn the oppo
osite, that all contacts aree customers. The point iss: the
Data
a Vault should assist in te
elling the storry, and by splitting the so urce systemss across multiple
Sate
ellites makess it easier to spot
s
these errroneous pattterns.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
6.8
P
Page 122 of 1
152
Overloaded Satellites
S
(The
e Flip-Flop Effe
ect)
Now
w suppose the
ere is a need
d to see all the data in a single
s
Satellitte, what do w
we do then? TThere are
speccific reasons why and whyy not to do th
his. There are
e inherent rissks as well; ssome of thosse risks
were
e covered in the
t last section. Take a minute
m
to che
eck the last ssection to en
nsure you did
dnt miss
anytthing importa
ant. This tech
hnique is called overloading because it allows multiple definitions of
sourrce system da
ata to insert multiple row
ws of data to the
t same tab
ble. The hopee is that the metadata
defin
nitions are th
he same for the
t fields, but theres no way
w to enforcce that. Therrefore, the da
ata
frequently becom
mes messy very quickly..
e
of ove
erloading when viewing data sets in leegacy system
ms. That is: a
a single
We ccan see the effects
field
d used for mu
ultiple purposses, and multtiple meanings based on character poosition and
appe
earance. In other wordss, smart-key data
d
where no edit checks
ks are in placee in the application;
tackked on to the Cobol copyb
book are multtiple re-defines and progrrammatic loggic to re-defin
ne what
the d
data should represent.
Overrloading a Sa
atellite is not necessary given todays technology, aand brings w
with it many riisks, such
as m
misinterpretattion, misunderstanding, inability to se
ee patterns, d
difficult to disscover proble
ems in the
data
a, risk of audit problems. One of the other
o
issues that
t
overload
ding a structu
ure also bubb
bles to
the ssurface is the
e question off: what do we
w do with the
e data set? D
Do we join it all together tto make
one best looking
g row for insert? Do we ru
un rules again
nst it to coaleesce it togeth
her? Its a sslipper
slope that leads right back to
o where we sttarted: with business
b
rulees being impllemented up--stream of
the E
EDW. This iss not what we
e want.
How
wever, all thatt said Figurre 6-7 represents what an
n overloaded Satellite migght look like (from a
data
a perspective
e), and the following paraggraphs explain what (if an
ny) good usess there might be for
this type of desig
gn.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
First, notice the record source. Each record source indicates a different source system (for purposes
of this example the sources are lines of business). It is not entirely clear which source system
should be the master system. Of course there are all kinds of questions that arise from this
example:
The questions and many more will begin to pop out with additional overloaded Satellites. So whats
the benefit of overloading if there are so many issues? The only benefit that Ive personally
experienced in the past is: to get the business to deal with the source of the problem because the
Data Vault ran out of disk space. The load cycle would load 15 million x 5 source feeds rows on
every load, because the loading mechanism detected a delta. Which brings up another point: when
a Satellite is overloaded, the loading cycle begins to take a turn toward the serial path.
Loading the Satellite must be done in reverse order (from least important to most important)
whereby the last row to delta (be inserted) becomes the most current, and all the others get enddated. Again, implementation is in the other book and this explanation is necessary to show the
gravity (risk) of this design. The better design is to split the Satellites by source system. This allows
each business unit to define which system is their master system, and when building the data marts
each Satellite will then provide the most current row to the process. Furthermore, by splitting the
Satellites out (as described in the previous section), the load can happen in parallel. Subsequent to
the load, a reporting structure could be built that attempts to merge the multiple Satellite data into
one table for the purposes of doing the cross-system data quality checking.
Let it be known: Overloading a Satellite incurs many risks. From metadata to
understanding, from load performance to indexing, from data quality to merging all
of which take a toll on the business and on IT.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
6.9
P
Page 124 of 1
152
Satellite App
plications:
Effectivity Satellites
Note: all children (all Satellites hanging off this Link or Hub) are ended when the
effectivity of the association is ended. This status of ended must be forwarded to
the data marts that are fed from the Data Vault. This is business user / source system
data, and must be included in all queries that access this information.
In addition, not all Links and not all Hubs are a good fit for effectivity. These Satellites are strictly for
business user (source system) based data. It is not necessary to create an effectivity Satellite for
every Hub and Link in the model unless the source system delivers the data set for every business
key and every relationship, and that would become a data miners gold!
6.9.2
There is another application of Satellites called Record Tracking. Record tracking is a system
generated set of data. This data is not auditable (generally speaking). The purpose of the record
tracking Satellite is to identify which source applications are feeding which keys and associations on
what load cycles. It originated as a need to capture changes (missing rows) from a source feed
because we received a full dump of a legacy system every day, and rows would disappear for three
days, and then re-appear. We were told that just because they disappear for three days, it doesnt
mean they are deleted.
Furthermore, we didnt have any CDC in place so when a record was truly deleted, it went missing
for an extended period of time. The business wanted a way of identifying the difference between
missing for a few days and was deleted. They settled on a rule that said: for Data X if the key
doesnt show up from this application for 7 consecutive days, then mark it deleted. They had other
rules for other data, i.e.: for Data Y it was 30 days, Data Z = 5 days.
This discussion is similar in nature (and related in concept) to the LAST SEEN DATE discussion that
was depicted in section 3.5. Record source tracking Satellites indicate each systems arrival, on a
load-cycle basis. For each key in each source system, or for each association on each source
system, an insert is made to the record tracking Satellite indicating that it was present on the feed
during the current load-cycle. The load-cycle is identified by load date, or load-cycle-id where load
date has been replaced.
Because this Satellite is non-auditable (other than IT metrics), its rules for use and definition can be
bent without breaking the architecture; the structure itself doesnt change what does change is
the way the data is treated. The following rules apply to record source tracking Satellites:
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 126 of 1
152
A row is in
nserted (regard
dless of delta)) for every dayy the key / asssociation appeears on the fee
ed. In
other words, it is not su
ubject to delta processing.
To avoid data
d
explosion
n, each column
n (or the table itself) must b
be compressed
d.
Because its
i system driven, old load-ccycle informattion may be su
ummarized, an
nd rolled off orr deleted
without ha
arm. By rolled
d off, you may choose to bacck it up or movve it to slower storage.
Figure 6-9:
6 Denorma
alized Record
d Source Traccking Satellitte
ed in a denormalized form
mat. The Leggend is as follows:
The data in Figurre 6-9 is store
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 127 of 1
152
Figure 6-10:
6
Normalized Record Source Traccking Satellitee
his case, it is easy to add new record sources
s
dyna
amically. Theere is no limitt to parallel in
nserts;
In th
therefore there iss no limit to the
t scalabilitty of this table
e. It pushes the complexxity downstream to the
querry (for interprretation and pivoting). If necessary, re
eview the reccord source ccolumn definition in
secttion 3.9.
Assu
uming this RS
S Satellite is a child of Customer, then
n the data miight be interp
preted as folllows:
Assu
ume Sequencce 1 = Customer Key ABC
C123.
On 10-14-2000, ABC12
23 (the key) ap
ppeared on the Manufactur ing feed, howeever it did NOTT appear
on the finance, contractts, nor sales fe
eeds.
On 10-15-2000, ABC12
23 appeared on
o the Manufa
acturing feed, aand did not ap
ppear on finan
nce,
contracts,, nor sales.
On 10-16-2000, ABC12
23 appeared on
o all feeds EX
XCEPT contraccts
If the
e business provides detailed record sources (that might even iindicate the p
point of origin within a
business processs), then they might be able to begin trracing the keeys through th
he business
proccesses. An asstute data miner
m
could make
m
good usse of this infoormation to help the busin
ness
unde
erstand how and when th
he data is mo
oving through
h the systemss. Someone who misses this
conccept sees no value in utiliizing record source
s
tracking Satellitess.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
6.9.3
3
P
Page 128 of 1
152
Status Tra
acking Satellite
es
Figure 6-1
11: Status Tracking Satel lite
Status Tracking Satellites
S
sho
ould be norm
malized, and should
s
follow
w the standarrd Satellite la
ayout and
ruless; such as inssert only whe
en changes are
a detected. They should
d have comp
pression turned on to
makke best use of
o the storage
e space, and if youre luck
ky they will help you ideentify which ssource
appllication or so
ource businesss process do
oesnt match
h business reequirements and which do
o.
Statuses may be
e inserted from multiple sources during the same l oad cycle. TThis may or m
may not
lead
d to multiple active
a
Satellite rows (whicch are descriibed in sectioon 6.14 below
w).
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
The best practice is to insert only the master system status or, to split the status Satellite by
source systems. If the Satellite is split by source systems, then you simply postponed the decision
to assign the master system (for query purposes of selecting current status) until you access the
information (loading to a data mart).
6.9.4
At first glance, everything in the Data Vault looks as though it has to be raw data. For the most part,
this is true (and is one of the fundamental premises of the approach). Often there is a need to
process raw data through quality routines, cleansing routines, and address correction routines;
generally the desire is to run these routines once and then distribute all the information downstream
to the data marts.
Within the Data Vault methodology and architecture there is a place for this data. Its called a
computed Satellite. The computed Satellite is a standard Satellite structure (with all the same rules,
formats, and structural integrity). The difference is that the record source is SYSGEN (system
generated information) or potentially the name of the application that is performing the data
alterations. Computed Satellites are not auditable (generally). I say generally because when or if
the data is used to run the business or make financial decisions, there is a good possibility that an
auditor will come back and expect to see how, when, and what the data was.
From an implementation perspective, it is suggested that you split the computed Satellites off to
their own disk storage area. It may be wise to place them on SSD (solid state disk) if they are highly
accessed and need to be extremely fast. At a minimum, they should be placed on their own I/O
channels and their own storage so they do not compete for read/write resources with the raw data
sets.
6.9.5
Multiple active Satellite rows are similar to Satellite overloading. Satellite overloading is discussed
previously, in section 6.8 above. The concept here indicates that there are several rows per key that
are alive, active, and valid all at the same time. In most cases, they would be arriving in the
Satellite from different systems. However, there are times when the data is normalized (as in the
example below) that make it a better choice to have multi-active Satellite rows.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 130 of 1
152
For in
nstance, a pa
art number which
w
is assiggned a statuss flag; in the manufacturing system its an
ACTIV
VE part; in th
he planning system
s
its an
n inactive parrt. This part number mayy have multip
ple
statu
uses from mu
ultiple system
ms, and they may or may not be valid ((depending oon the view p
point of
the u
user and dep
pending on th
he definition of
o the flag). Multiple activve Satellite rrows can be a
averted
easily (most of the time) by sp
plitting the da
ata by source
e system (in m
most cases), although in some
case
es, you may want
w
to split itt further by application
a
within each soource system
m.
Supp
pose however, that you ha
ave a list of phone
p
numbe
ers on incom
ming data; tha
at you never know just
how many phone
e number colu
umns will arrrive. Some days your load
ding process may see 3 p
phone
numbers, other days,
d
it may see
s 5, and evven within the
e same load batch the number of ph
hone
numbers is variab
ble. In this case, it is extrremely difficu
ult to architeect the rightt set of phone
e number
colum
mns in a Sate
ellite, and the
e last thing that should be considered
d is: phone_1
1, phone_2,
phon
ne_3. Etc causing wid
de rows, sparrse population, and a bun
nch of null coolumn values. It is
preciisely for thesse reasons th
hat multi-activve Satellite rows exist!
w demonstra
ates the loadiing of hierarcchical XML daata; it could a
also represen
nt a
Figurre 6-12 below
hiera
archical Cobo
ol data set. Any
A hierarchiccal structured information with variab
ble list length
h is a
cand
didate for thiss technique. By normalizing the structture, the arch
hitecture is w
well-suited to
o absorb
an unknown num
mber of eleme
ents per pare
ent record. The
T normalizeed Satellite h
has an additio
onal
elem
ment to the prrimary key kn
nown as a sub-sequence number. Su b-sequence n
numbers are
e
discu
ussed in secttion 3.2 of this book. The
ey basically provide
p
a mecchanism with
h which to un
niquely
identtify the data.
Figure 6-12
2: Multi-Activve Satellite R
Rows
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 131 of 1
152
In so
ome cases it might make sense to replace the sub--sequence n umber with a
an actual piece of
information that the
t businesss users underrstand. In this very speciaal example (n
not shown he
ere), the
architect replaced
d the sub-seq
quence number with a co
opy of the phoone number.. This technique
allow
wed them to overcome
o
a difficulty
d
in trracking the Satellite
S
data from load too load.
While
e loading and
d implementa
ation is not a focus of thiss book, this iidea will be b
briefly discusssed here
as it has relevancce to the stru
ucture choice
es made by th
he architect. One of the issues of utilizing subsequ
uence numbe
ers is that it introduces orrder-depende
ency to the looad cycle. In other words, from
one lload to the next, if the ord
der of the phone numbers
rs change theen its seen a
as an entire n
new delta
for th
he employee which mea
ans all the ph
hone numbers are re-inseerted as delta
a rows, even if the
phon
ne numbers themselves
t
did
d not changge.
This can be mitig
gated in two ways:
w
one (ass described above),
a
usingg the phone n
number as th
he subsequ
uence numbe
er (removes the
t order dep
pendency durring delta cheecking), or tw
wo: includingg the
existtence of the phone
p
numbe
er in the Sate
ellite as a currently activee row before inserting. O
Option #1,
destrroys any chance of reprod
ducing the da
ata set in the
e proper ordeer as it arrived (if this is im
mportant
to yo
ou, then sub-ssequencing is the only wa
ay). Option #2
# doesnt ch
heck the deleeted phone numbers
that may have dissappeared frrom the incom
ming data se
et.
In ca
ases where th
here is no nu
umber column
n alternative, replacing th
he sub-sequeence with an alphanumeric causes great
g
problem
ms with perfo
ormance. Un
nfortunately tthis is one of the cases where
choo
osing the best worst-case scenario see
ems to be the
e ideal. In su
uch a case, ssub-sequenciing is
always the architectural fall-back, please couple
c
that choice
c
with tu
urning on com
mpression off duplicate
data across the table. This will assist with
h maintainingg the integrityy while allowiing flexibility of
unkn
nown number of elementss to flow thro
ough the Data
a Vault.
Figure
F
6-13: Multi-Active Satellite Row
w Data
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 132 of 1
152
Figure 6-1
14: Multi-Activve Satellite with
w Businesss Sub-Sequence
he phone num
mber has rep
placed the acctual orderin
ng column that was used
d in
In Figure 6-14, th
ection of exissting rows and discovery oof rows that a
are deleted from the
Figure 6-13. Thiss allows dete
T is not the preferred technique,
t
ass it requires a unique colu
umn to be avvailable
sourrce system. This
that is in numericc format.
6.10
0 Splitting Satellites
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 133 of 1
152
Identify da
ata in a Satellite that changges more rapid
dly than the other data in thee Satellite
Group tho
ose common elements toggether by rate of change
Split the Satellite
S
archittecturally by crreating a new Satellites and
d moving thosee elements
Load the data
d
to the ne
ew Satellite by simply copyin
ng the existingg columns, load
d dates, seque
ences
Run anoth
her process th
hat begins rem
moving duplicates (looks for deltas)
Run a fina
al process thatt updates the load-end date
es after the du
upes are remooved
The m
most importan
nt concept to hold
h
to when splitting
s
a Sate
ellite is to main
ntain 100% off the history off the data.
If anyy of the historyy is lost by dele
eting rows tha
at contain delta
as, then the n
new Satellite m
must be trunca
ated and
re-loa
aded from the original Satellite. Maintaining the audit trail
t
is vitally im
mportant. Oncce the audit trrail has
been checked and verified in botth new Satellittes, then the old
o Satellite caan be deleted//removed.
It is recommended that you run queries
q
(in parallel to the old Satellite) forr a few weeks against the new
Satellites to match the results. Once
O
a balancce has been esstablished, an d they are botth showing equal results,
then and only then can you delette the old Sate
ellite and repla
ace any affectted downstrea
am processes or extracts.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 134 of 1
152
Figure 6-1
16: Step 2: Sp
plit Satellite Columns,
C
Deesign New tab
bles
Now
w that the columns are pro
operly split apart, and the
e new structu
ures have beeen built, the next step
is to
o handle the data
d
(Figure 6-17).
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 135 of 1
152
Figure 6-18
8: Step 4: Elim
minate Dupliccates
This particular exxample does not show all the details of
o eliminatingg duplicates, and in fact by
simp
ply eliminatin
ng the duplica
ates in this example,
e
we no
n longer neeed to run thee next processsing step:
adju
usting the beg
gin and end dates.
d
Howe
ever, these stteps will run regardless, a
as there may be other
rowss of information that requ
uire fine-tunin
ng. The exam
mple below (FFigure 6-19) sshows the da
ata after
copyy-in, but before running re
eduction and delta processsing.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 136 of 1
152
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 137 of 1
152
So w
why would you
u want to do this?
1
1.
Identify da
ata sets that are
a changing at
a the same ra
ates of speed iin one or moree Satellites related to the
same parent
2
2.
3
3.
4
4.
5
5.
Figure 6-21
1: Consolidating Satellite Data
This example sho
ows the data
a set, please note: this exa
ample does n
not show reccord source
prop
pagation. I do
o however disscuss this component in the
t followingg paragraphs..
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
The date in purple (highlighted/bold) in SAT_CUST_CONTACT_ADDR is the first and earliest date for
this particular customer (08/01/2000). Therefore it results in the first row inserted in to the
consolidated Satellite. The name and cell are NULL because there is no data for that customer
available as of that date in those Satellites. The record sources are all ?? because resolving them
can be a matter of interpretation or a decision made by the business user as to which system is the
MASTER system.
Please note: if the data in the original Satellites was previously split apart, then theres a
chance that the record sources for all rows across all split Satellites would be the same. In
this case it is ok to select the one available record source value and assign it to the newly
combined row in the consolidated Satellite. If this is not the case, please see the discussion
below.
If record sources vary across the multiple split Satellites (from row to row within a given parent key),
then a decision must be made in consolidation: which record source to use? This decision should
be put forward to the business users for complete resolution and sign-off, however that is not always
possible. For the cases where the business users wont make the decision, the following rule of
thumb is provided:
First, select the record source from the same table that houses the earliest load date that is being
selected. If this does not produce the desired outcome, then select the master system record
source from the Satellite in which it appears. Unfortunately during consolidation of multi-system
Satellites you may lose metadata. Please be aware that if the metadata is lost, the only way to
correctly audit the system will be to restore that days load to the staging area for further review.
At the end of this process of consolidation, run the assignment of Load-End-Dates to properly adjust
the dates and times of the data set. In Figure 6-22, Ive included Load-End-Date calculations after
theyve been set.
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 139 of 1
152
Figure 6-22
2: Load End Dates
D
Calcula
ated in Consoolidated Sateellite
It is important to reconcile the information
n all the way through to th
he Data Martts before deleting or
desttroying the prreviously split Satellites. Run the data
a set in paralllel for a wee
ek or so throu
ugh to a
new data mart in
n order to ballance the info
ormation and
d reconcile th
he results to the old data mart.
Whe
en buy-off is achieved,
a
the
en it will be safe to backup and roll-offf the old splitt Satellites.
Splitting and consolidating can happe
en at any time during the lifee-cycle of the D
Data Vault. Its
a judgmen
nt call based on
o the rates off change in the
e data set, and
d the width of the rows.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 140 of 1
152
Point in Time
e Tables
Poin
nt in time tables (PIT table
es) - are strucctures which surround a ssingle Hub (oor Link) and itts
corre
esponding Sa
atellites. It iss defined as: A structure which
w
sustain
ns integrity off joins acrosss time to
all th
he Satellites that are connected to the
e Hub. It is a specialized form of Sateellite. There iis a single
PIT ttable built for each Hub. These tabless cannot and
d should not sspan multiplee Hubs and LLinks.
Figure 7-1 showss the basic sttructure of a PIT table.
Figure 7-1:
7 Structure
e of PIT Tablee
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 141 of 1
152
Reco
ord sources are
a not necesssary, as the
e PIT table is a system gen
nerated tablee. Should yo
ou wish to
inclu
ude record so
ource you ma
ay, and the only reason fo
or doing so w
would be beca
ause you havve
multtiple loading processes po
opulating the
e PIT table. End-dates
E
aree not necesssary unless yo
ou wish to
enab
ble BETWEEN
N queries aga
ainst the sna
apshot inform
mation. PIT ttables provide equal-join access to
table
es around a Hub rather th
han focusingg on outer-join
n queries to aanswer quesstions. This iss why PIT
table
es are a query assistant table
t
only.
PIT TTables should
d not be crea
ated until or unless
u
there is a perform
mance problem
m with accesssing the
Sate
ellites around
d a single Hub. Figure 7-2
2 shows an arrchitectural d
depiction of w
where PIT tab
bles fit
within the Data Vault.
V
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 142 of 1
152
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
7.2
P
Page 143 of 1
152
Bridge Table
es
Figure 7-4:
7 Bridge Ta
able Structurre
The Bridge table may contain
n Hub business keys; how
wever be careeful as raisingg the number of bytes
per rrow will dram
matically slow
w the perform
mance of the table,
t
especiially in very la
arge data setts.
Anotther consequ
uence of haviing this table
e become too
o wide is the introduction of chained ro
ows,
fragmentation, and over-inde
exing. Keep in mind that the
t purpose of this table is to enhancce join
perfo
ormance, not kill it.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 144 of 1
152
Fig
gure 7-5: Brid
dge Table Architectural O
Overview
As in
ndicated, thiss Bridge table
e spans data
a from three Hubs,
H
and tw
wo Links. This example (ssee Figure
7-5) maintains th
he lowest posssible grain by
b keeping th
he Cartesian product in-ta
act; no group
p by has
been
n executed prior
p
to load of
o the data se
et. The data in the Bridgee (see Figure 7-6) might b
be read
as: sseller by prod
duct by parts. Only the ke
eys of the pro
oduct which h
have a sellerr, and are use
ed in the
man
nufactured pa
arts will be lissted in the Bridge table, unless
u
you ch
hoose to pop
pulate some o
of the
keyss with NULL values.
v
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 145 of 1
152
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
I define reference data as follows: any information deemed necessary to resolve descriptions from
codes, or to translate keys in to a consistent manner. Many of these fields are descriptive in
nature and describe a specific state of the other more important information. As such, reference
data lives in separate tables from the raw Data Vault tables.
Generally speaking, reference data is neither a business key, nor purely descriptive. It lives in a grey
area and covers a number of facets to resolve the information in the warehouse to a better context.
For instance, ICD9 / ICD10 (medical drug diagnosis codes) are an example of reference data. They
may be external sources of data governed (in this case) by a world body. These codes are often
found used as descriptors of other business keys.
While building separate tables in the Data Vault, and adding the codes to Satellites appears to
constitute foreign keys in Satellites, I will tell you that it should be a logical representation only. If
you physically create the foreign keys in the Satellites to reference tables, you can a) blacken your
model (too complex to maintain, too many relationships) b) destroy flexibility, and of course c) you
would not be following the perfect world scenario set up above.
http://LearnDataVault.com
In a truly perfect world, we would resolve all reference data on the way in to the warehouse, thus
making reference data obsolete, and then, there would be no need for foreign keys in the
Satellites to begin with. In any case, reference data can and does exist as a part of the Data Vault
model, and should be defined as: external data outside your control, data that is commonly used to
setup context/describe other business keys, or quite simply put: standard codes and descriptions or
classifications of information.
The structure of the reference tables can vary from 3rd normal form to star-like, to Hubs, Links and
Satellites from the Data Vault. So there is no need to worry or fret about the type of structure that
you want to use; just choose the one that works best for you and move on. Some options and
scenarios are described in the following sections.
8.1
Sometimes there is no need to store history of reference data changes; in this case we use a typical
3NF or 2NF type table. The nature of a data warehouse is in fact to store history, but when the
business signs off on the expected no-history requirement then the EDW team has the go-ahead.
A no-history reference table is a structure that has no history! Imagine that!
Ok, enough kidding aside its a table with no begin and no end-dates. Before I go on, Ill say this:
reference tables can be designed as Hubs and Links, or as simple 3rd normal form tables, that is:
flat and / or wide, its up to you. You need to decide whats best, and what fits then load it and go.
What types of data might you see in a no-history reference table? Well, that all depends of course,
but here are some examples of what Ive run in to in my career:
And so on If youd like to add to this list, Id love to have your feedback. Just put the example in an
email and send it off. An example of a non-history based 3rd normal form reference table is shown
in Figure 8-1.
Dan Linstedt 2010-2011, all rights reserved
http://LearnDataVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 148 of 1
152
History Base
ed Reference Tables
T
Histo
ory based refference table
es are reference data with
h a requirem
ment or a business need to
o store
the h
history of desscriptions. In
n other wordss, we or the business
b
wan
nts to track w
what the desccription
was last year, lasst month, and
d so on. The history may become imp ortant for certain reference data
espe
ecially if the reference
r
data relates to financial rep
ports. Particu
ularly when oold reports arre reprintted in the futture, sometim
mes the busin
ness or the auditor
a
wantss to see whatt the code an
nd
desccription was at
a a particula
ar point in tim
me.
In th
his case, I wo
ould strongly urge you to create
c
Hubs, Links, and S
Satellites to h
house the hisstorical
reference data. However, I would
w
discourrage you from
m using SEQU
UENCE numb
bers in this situation.
Natu
ural keys tend to be much
h more consistent over tim
me (in the caase of referen
nce data), an
nd
typiccally its the natural
n
keys which appea
ar in the rest of the raw Daata Vault moodel (EDW Mo
odel)
particularly in the
e Satellites.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 149 of 1
152
Addiing sequence
e numbers to
o the history based
b
refere
ence tables u sually adds n
no value sincce the
code
es tend to be
e static (i.e., ever
e
see the abbreviation for the statee of California
a change?). On the
othe
er hand, if you have a valid reason to do
d so then dont be shyy. Documentt the reason, and
procceed to use the sequence
es all across your
y
model. An example of a history b
based reference table
is sh
hown in Figurre 8-2:
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 150 of 1
152
Code and De
escriptions
Code
es and descrriptions are commonly
c
fou
und in reference data. If you have a loot of codes to
o model
take
e the most effficient route that is: one
e that makess logical sensse. I like to ggroup many o
of the
similar codes tog
gether in to a single masster code tab
ble. In thesee cases, I havve to also asssign a
uniq
que group co
ode to help make the underlying code
e unique. Offten times the group code
e is a
mad
de-up or manufactured co
olumn (hard coded
c
data in
n the ETL rou
utine).
Beca
ause the group-code is syystem genera
ated, and it has
h no forma l business m
meaning outsiide of the
ness
EDW
W (generally), I usually try to keep the group
g
code in
nside the EDW for joiningg and uniquen
reassons only. Th
he example in
n Figure 8-4 is made-up data,
d
but shoows how you ccan apply a m
master
code
e or a group code
c
to use a single struccture and house all your i nformation.
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
Supe
er Charge Yo
our Data Ware
ehouse
P
Page 151 of 1
152
9.0 Conclusio
ons
The Data Vault Model
M
and me
ethodology are highly verssatile when tthe standards and rules a
are
follo
owed. Its when you break
k the rules an
nd standardss that you ca n get in to troouble, and I h
hope Ive
show
wn you enoug
gh insight to see how to apply
a
the app
propriate and
d proper design. Its by fo
ollowing
the rrules and sta
andards that you can take
e advantage of the years of research a
and design Ive put in
to th
his; allowing you
y to overco
ome and avoid the potenttial pitfalls an
nd project isssues.
I would like nothing more than to help you
u succeed, an
nd to hear froom you abou
ut your conce
erns,
quesstions, or com
mments. Im
m always interrested in hea
aring about c ustomer successes as we
ell as
challenges you fa
ace in your day to day job
b.
If you become a Data Vault fa
an along the way, feel free
e to let me k now!
Sinccerely,
Dan Linstedt
Da
an Linstedt 2010-2011,
2
all
a rights rese
erved
http:///LearnData
aVault.com
INDEX
adaptability, 38, 73, 76, 83
Architectural. See Architecture
Architecture, 9, 139
Basic Terminology, 10
Business Key, 4, 7, 11, 55, 57, 64, 72, 73
Business Keys, 4, 27, 58, 59, 61, 63, 71
Data Vault, 2, 3, 7, 8, 10, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 51, 52, 53, 56, 61,
62, 63, 65, 66, 67, 68, 71, 72, 73, 74, 75,
76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 87,
88, 90, 91, 92, 93, 97, 98, 99, 100, 101,
102, 104, 105, 106, 107, 108, 109, 110,
111, 112, 113, 114, 115, 118, 119, 121,
122, 123, 124, 126, 127, 129, 137, 138,
139, 144, 145, 146, 149
Data Vault Modeling. See Data Vault
EDW, 3, 7, 33, 36, 37, 38, 39, 41, 45, 46, 48,
49, 52, 58, 76, 78, 81, 82, 83, 120, 145,
146, 148
Flexibility, 3, 4, 7, 21, 22, 76, 78
HUB, 18, 44, 50, 51, 57, 64, 140
Hubs. See Hub
http://LearnDataVault.com