Sunteți pe pagina 1din 17

Chapter 3 Data Warehousing and OLAP Technology

Focus
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
3! What is Data Warehouse?
De"ined in many di""erent ways# $ut not rigorously
o A decision support data$ase that is maintained separately "rom the
organi%ation&s operational data$ase
o 'upport in"ormation processing $y pro(iding a solid plat"orm o"
consolidated# historical data "or analysis
We use) *A data warehouse is a su$+ect-oriented# integrated# time-(ariant#
and non(olatile collection o" data in support o" management&s decision-
ma,ing process-.W / 0nmon
Data warehousing)
o The process o" constructing and using data warehouses
The 1 main "eatures)
! 'u$+ect-Oriented
o Organi%ed around ma+or su$+ects# such as customer# product# sales
o Focusing on the modeling and analysis o" data "or decision ma,ers#
not on daily operations or transaction processing
o Pro(ide a simple and concise (iew around particular su$+ect issues $y
e2cluding data that are not use"ul in the decision support process
3 0ntegrated
o Constructed $y integrating multiple# heterogeneous data sources
relational data$ases# "lat "iles# on-line transaction records
o Data cleaning and data integration techni4ues are applied
5nsure consistency in naming con(entions# encoding structures#
attri$ute measures# etc among di""erent data sources
5g# /otel price) currency# ta2# $rea,"ast co(ered# etc
When data is mo(ed to the warehouse# it is con(erted
3 Time 6ariant
o The time hori%on "or the data warehouse is signi"icantly longer than
that o" operational systems
Operational data$ase) current (alue data
Data warehouse data) pro(ide in"ormation "rom a historical
perspecti(e 7eg# past 8-!9 years:
o 5(ery ,ey structure in the data warehouse
Contains an element o" time# e2plicitly or implicitly
;ut the ,ey o" operational data may or may not contain *time
element-
1 <on-6olatile
o A physically separate store o" data trans"ormed "rom the operational
en(ironment
o Operational update o" data does not occur in the data warehouse
en(ironment
Does not re4uire transaction processing# reco(ery# and
concurrency control mechanisms
=e4uires only two operations in data accessing)
initial loading of data and access of data
3!! Data Warehouse (s Operational D;>'
OLTP 7On-Line Transaction Processing:
o >a+or tas, o" traditional relational D;>'
o Day-to-day operations) purchasing# in(entory# $an,ing#
manu"acturing# payroll# registration# accounting# etc
OLAP 7On-Line Analytical Processing:
o >a+or tas, o" data warehouse system
o Data analysis and decision ma,ing
Distinct "eatures 7OLTP (s OLAP:)
o ?ser and system orientation) customer (s mar,et
o Data contents) current# detailed (s historical# consolidated
o Data$ase design) 5= @ application (s star @ su$+ect
o 6iew) current# local (s e(olutionary# integrated
o Access patterns) update (s read-only $ut comple2 4ueries
3!3 Why 'eparate Data Warehouse?
/igh per"ormance "or $oth systems
o D;>'. tuned "or OLTP) access methods# inde2ing# concurrency
control# reco(ery
o Warehouse.tuned "or OLAP) comple2 OLAP 4ueries#
multidimensional (iew# consolidation
Di""erent "unctions and di""erent data)
o missing data) Decision support re4uires historical data which
operational D;s do not typically maintain
o data consolidation) D' re4uires consolidation 7aggregation#
summari%ation: o" data "rom heterogeneous sources
o data 4uality) di""erent sources typically use inconsistent data
representations# codes and "ormats which ha(e to $e reconciled
Data >odeling "or a warehouse
33 A multidimensional Data >odel
From Ta$les and 'preadsheets to Data Cu$es
A data warehouse is $ased on a multidimensional data model which (iews
data in the "orm o" a data cu$e
A data cu$e is typically organi%ed around a central theme# such as sales#
stored in a fact 7numeric measures: table# which allows data to $e modeled
and (iewed in multiple dimensions
Fact ta$le contains measures 7such as dollarsAsold: and ,eys to each o" the
related dimension ta$les
Ta$le3-32ls) >ultidimensional "act ta$les in two dimensional "ormat
Figure3-!) 3D cu$e
Figure3-3) 1D cu$e
Dimension ta$les# such as item 7itemAname# $rand# type:# or time7day# wee,#
month# 4uarter# year:# pro(ide additional in"ormation a$out the dimensions
0n data warehousing literature# an n-D $ase cu$e is called a $ase cu$oid The
top most 9-D cu$oid# which holds the highest-le(el o" summari%ation# is
called the ape2 cu$oid The lattice o" cu$oids "orms a data cu$e
Conceptual >odeling o" Data Warehouses
>odeling data warehouses) dimensions B measures
o 'tar schema) A "act ta$le in the middle connected to a set o"
dimension ta$les# e2#
'now"la,e schema) A re"inement o" star schema where some dimensional
hierarchy is normali%ed into a set o" smaller dimension ta$les# "orming a shape
similar to snow"la,e# e2#
Fact constellations) >ultiple "act ta$les share dimension ta$les# (iewed as a
collection o" stars# there"ore called galaxy schema or "act constellation 52#
333 52amples "or de"ining 'tar# 'now"la,e# and Fact Constellation
A Data >ining Cuery Language# D>CL) Language Primiti(es
Cu$e De"inition 7Fact Ta$le:
de"ine cu$e Dcu$eAnameE FDdimensionAlistEG) DmeasureAlistE
Dimension De"inition 7 Dimension Ta$le :
de"ine dimension DdimensionAnameE as 7Dattri$uteAorAsu$dimensionAlistE:
'pecial Case 7'hared Dimension Ta$les:
First time as *cu$e de"inition-
de"ine dimension DdimensionAnameE as DdimensionAnameA"irstAtimeE in
cu$e Dcu$eAnameA"irstAtimeE
52ample 31) De"ining a 'tar 'chema in D>CL
de"ine cu$e salesAstar Ftime# item# $ranch# locationG)
dollarsAsold H sum7salesAinAdollars:#
a(gAsales H a(g7salesAinAdollars:# unitsAsold H count7I:
de"ine dimension time as 7timeA,ey# day# dayAo"Awee,#
month# 4uarter# year:
de"ine dimension item as 7itemA,ey# itemAname# $rand# type#
supplierAtype:
de"ine dimension branch as 7$ranchA,ey# $ranchAname# $ranchAtype:
de"ine dimension location as 7locationA,ey# street# city#
pro(inceAorAstate# country:
331 >easures) Their Categori%ation and Computation
/ow is a multidimensional point in a data cu$e space de"ined?

/ow is a multidimensional ta$le stored in 3d D;?


Data cu$e measure)
o a numeric "unction that can $e e(aluated at each point in the data cu$e
space
o A measure (alue is computed "or a gi(en point $y aggregating the data
corresponding to the respecti(e dimension-(alue pair de"ining the
gi(en point
52ample 3-J) translating e2ample 31 into 'CL
'elect s.time_unit# sitem_key# sbranch_key# slocation_key#
'um7dollarAsold:
From time t# item 0# $ranch $# location l# sales s
Where stimeA,eyHttimeA,ey A<D sitemA,ey A<D K
Lroup ;y stimeA,ey# sitemA,ey# K
This 'CL select in"o all sales data "rom "act ta$le# sum up the "acts $ased on the
timeAunit 7e2# data may $e stored in timeA,ey o" days and timeAunit here could $e
4uarter# more a$out this later:
We can sa(e this result in a cu$e permanently
since this cu$e has ALL the dimensions# it&s called $ase "act ta$le
We can also create cu$e o" "ewer dimension# e2# display sales "rom ALL $ranch
$ased on time 74uarter:# item# and location)

Three Categories o" measure)
distri$uti(e) i" the result deri(ed $y applying the "unction to n aggregate
(alues is the same as that deri(ed $y applying the "unction on all the data
without partitioning
o 5g# count7:# sum7:# min7:# ma27:
alge$raic) i" it can $e computed $y an alge$raic "unction with M arguments
7where M is a $ounded integer:# each o" which is o$tained $y applying a
distri$uti(e aggregate "unction
o 5g# a(g7:# minA<7:# standardAde(iation7:
holistic) i" there is no constant $ound on the storage si%e needed to descri$e
a su$aggregate
o 5g# median7:# mode7:# ran,7:
338 Concept /ierarchy
A concept hierarchy can de"ine the relationship $etween attri$utes or (alues
o" one attri$ute
Figure 3J
'peci"ication o" hierarchies
o 'chema hierarchy
5ither
Total order) true hierarchy
52# location
Partial order) a lattice
day D Mmonth D 4uarterN wee,O D year
o 'etAgrouping hierarchy) de"ined $y discreti%ation or grouping (alues
"or a gi(en attri$ute
M!!9O D ine2pensi(e
Figure 3-P
33Q Typical OLAP Operations in the multidimensional Data >odel
'ee Figure 3!9
=oll up 7drill-up:) summari%e data
o by climbing up hierarchy or by dimension reduction
Drill down 7roll down:) re(erse o" roll-up
o from higher level summary to lower level summary or detailed data,
or introducing new dimensions
'lice and dice)
o project and select
Pi(ot 7rotate:)
o reorient the cube, visualization, 3 to series of ! planes.
Other operations
o drill across" involving #across$ more than one fact table
o drill through" through the bottom level of the cube to its back%end
relational tables #using &'($
33J A 'tar-<et Cuery >odel
Cuerying o" multidimensional data$ase can $e $ased on a starnet model
A starnet model consists o" radial lines emanating "rom a center point#
where each line represents a concept hierarchy "or a dimension
52#
33 Data warehouse Architecture
33! 'teps "or design and construction o" data warehouses
Design o" a Data Warehouse) A ;usiness Analysis Framewor,
Four (iews regarding the design o" a data warehouse
Top-down (iew
o allows selection o" the rele(ant in"ormation necessary "or the data
warehouse
Data source (iew
o e2poses the in"ormation $eing captured# stored# and managed $y
operational systems
Data warehouse (iew
o consists o" "act ta$les and dimension ta$les
;usiness 4uery (iew
o sees the perspecti(es o" data in the warehouse "rom the (iew o" end-
user
Data Warehouse Design Process# many approaches)
Top-down) 'tarts with o(erall design and planning 7mature:
;ottom-up) 'tarts with e2periments and prototypes 7rapid:
From so"tware engineering point o" (iew
Water"all) structured and systematic analysis at each step $e"ore proceeding
to the ne2t
'piral) rapid generation o" increasingly "unctional systems# short turn
around time# 4uic, turn around
Typical data warehouse design process
Choose a $usiness process to model# eg# orders# in(oices# etc
Choose the grain 7atomic level of data: o" the $usiness process
52# indi(idual sales#
sales during an entire day "or an item at a $ranch "or a particular location
Choose the dimensions that will apply to each "act ta$le record
Choose the measure that will populate each "act ta$le record
333
'ee also Figure 3-!3 7page !3!:
Three Data Warehouse >odels
5nterprise warehouse
o collects all o" the in"ormation a$out su$+ects spanning the entire
organi%ation
Data >art
o a su$set o" corporate-wide data that is o" (alue to a speci"ic groups o"
users 0ts scope is con"ined to speci"ic# selected groups# such as
mar,eting data mart
o 0ndependent (s dependent 7directly "rom warehouse: data mart
6irtual warehouse
o A set o" (iews o(er operational data$ases
o Only some o" the possi$le summary (iews may $e materiali%ed
Data Warehouse De(elopment) A =ecommended Approach
331 >etadata =epository
>eta data is the data de"ining warehouse o$+ects 0t has the "ollowing ,inds
Description o" the structure o" the warehouse
schema# (iew# dimensions# hierarchies# deri(ed data de"n# data mart
locations and contents
Operational meta-data
data lineage 7history o" migrated data and trans"ormation path:# currency o"
data 7acti(e# archi(ed# or purged:# monitoring in"ormation 7warehouse usage
statistics# error reports# audit trails:
The algorithms used "or summari%ation
The mapping "rom operational en(ironment to the data warehouse
Data related to system per"ormance
warehouse schema# (iew and deri(ed data de"initions
;usiness data
$usiness terms and de"initions# ownership o" data# charging policies
Data Warehouse ;ac,-5nd Tools and ?tilities
Data e2traction)
get data "rom multiple# heterogeneous# and e2ternal sources
Data cleaning)
detect errors in the data and recti"y them when possi$le
Data trans"ormation)
con(ert data "rom legacy or host "ormat to warehouse "ormat
Load)
sort# summari%e# consolidate# compute (iews# chec, integrity# and $uild
indicies and partitions
=e"resh
propagate the updates "rom the data sources to the warehouse
338 Types o" OLAP 'er(er
=elational OLAP 7=OLAP:
o ?se relational or e2tended-relational D;>' to store and manage
warehouse data and OLAP middle ware to support missing pieces
o 0nclude optimi%ation o" D;>' $ac,end# implementation o"
aggregation na(igation logic# and additional tools and ser(ices
o greater scala$ility
>ultidimensional OLAP 7>OLAP:
o Array-$ased multidimensional storage engine 7sparse matri2
techni4ues:
o "ast inde2ing to pre-computed summari%ed data
/y$rid OLAP 7/OLAP:
o ?ser "le2i$ility# eg# low le(el) relational# high-le(el) array
o 'peciali%ed 'CL ser(ers
o speciali%ed support "or 'CL 4ueries o(er starRsnow"la,e schemas
'peciali%ed 'CL ser(ers 7eg# =ed$ric,s:
o 'peciali%ed support "or 'CL 4ueries o(er starRsnow"la,e schemas
31 Data Warehouse 0mplementation
31! 5""icient Data Cu$e Computation
52ample 3!! Assume the data cu$e "or All5lectronics sales contains 3
dimensions) item# time 74uarter:# location What&s the total num$er o" cu$oids
7or group-$y&s: that can $e computed "or the data cu$e?
'ample 4ueries)
*Compute the sum o" sales# group $y item-
*Compute the sum o" sales# group $y item and time-
K
Data cu$e can $e (iewed as a lattice o" cu$oids
The $ottom-most cu$oid is the $ase cu$oid
The top-most cu$oid 7ape2: contains only one cell
/ow many cu$oids in an n-dimensional cu$e with L le(els?
Where Li is the num$er o" le(els associated with dimension i
52# i" a cu$e has !9 dimensions# with 1 le(els in each dimension# the total
num$er is around 8
!9
>ateriali%ation 7precompute: o" data cu$e
>ateriali%e
o e(ery 7cu$oid: 7"ull materiali%ation:#
o none 7no materiali%ation:# or
o some 7partial materiali%ation:
'election o" which cu$oids to materiali%e
o ;ased on si%e# sharing# access "re4uency# etc
5""icient cu$e computation methods
=OLAP-$ased cu$ing algorithms 7Agarwal et al&PQ:
Array-$ased cu$ing algorithm 7Shao et al&PJ:
;ottom-up computation method 7;ayer B =amar,rishnan&PP:
=OLAP-$ased cu$ing algorithms
'orting# hashing# and grouping operations are applied to the dimension
attri$utes in order to reorder and cluster related tuples
: !
!
7 +

=
=
n
i
i
( )
Lrouping is per"ormed on some su$aggregates as a *partial grouping step-
Aggregates may $e computed "rom pre(iously computed aggregates# rather
than "rom the $ase "act ta$le
>ulti-way Array Aggregation "or Cu$e Computation
Partition arrays into chun,s 7a small su$cu$e which "its in memory:
Compressed sparse array addressing) 7chun,Aid# o""set:
Compute aggregates in *multiway- $y (isiting cu$e cells in the order which
minimi%es the T o" times to (isit each cell# and reduces memory access and storage
cost
What is the best traversing order to do multi-way aggregation?
>ethod) the planes should $e sorted and computed according to their si%e in
ascending order Figure3!Q
'ee the details o" 52ample 3!3 7pp J8-JU:
0dea) ,eep the smallest plane in the main memory# "etch and compute only
one chun, at a time "or the largest plane
Limitation o" the method) computing well only "or a small num$er o"
dimensions
0" there are a large num$er o" dimensions# *$ottom-up computation- and
ice$erg cu$e computation methods can $e e2plored
313 0nde2ing OLAP Data) ;itmap 0nde2
0nde2 on a particular column
5ach (alue in the column has a $it (ector) $it-op is "ast
The length o" the $it (ector) T o" records in the $ase ta$le
The i-th $it is set i" the i-th row o" the $ase ta$le has the (alue "or the
inde2ed column
not suita$le "or high cardinality domains
52ample) Figure3-!8 to 3-!J
0nde2ing OLAP Data) Voin 0ndices
Voin inde2) V07=-id# '-id: where = 7=-id# K: ' 7'-id# K:
Traditional indices map the (alues to a list o" record ids
o 0t materiali%es relational +oin in V0 "ile and speeds up relational +oin .
a rather costly operation
0n data warehouses# +oin inde2 relates the (alues o" the dimensions o" a start
schema to rows in the "act ta$le
o 5g "act ta$le) &ales and two dimensions city and product
o A +oin inde2 on city maintains "or each distinct city a list o" =-0Ds o"
the tuples recording the 'ales in the city
o Voin indices can span multiple dimensions
5""icient Processing OLAP Cueries
Determine which operations should $e per"ormed on the a(aila$le cu$oids)
trans"orm drill# roll# etc into corresponding 'CL andRor OLAP operations#
eg# dice H selection @ pro+ection
Determine to which materiali%ed cu$oid7s: the rele(ant operations should $e
applied
52ploring inde2ing structures and compressed (s dense array structures in
>OLAP
38 From Data Warehousing to Data >ining
38! Data Warehouse ?sage
Three ,inds o" data warehouse applications
o 0n"ormation processing
supports 4uerying# $asic statistical analysis# and reporting using
crossta$s# ta$les# charts and graphs
o Analytical processing
multidimensional analysis o" data warehouse data
supports $asic OLAP operations# slice-dice# drilling# pi(oting
o Data mining
,nowledge disco(ery "rom hidden patterns
supports associations# constructing analytical models#
per"orming classi"ication and prediction# and presenting the
mining results using (isuali%ation tools
383 From OLAP to OLA>
Why online analytical mining?
/igh 4uality o" data in data warehouses
o DW contains integrated# consistent# cleaned data
A(aila$le in"ormation processing structure surrounding data warehouses
o OD;C# OL5D;# We$ accessing# ser(ice "acilities# reporting and
OLAP tools
OLAP-$ased e2ploratory data analysis
o >ining with drilling# dicing# pi(oting# etc
On-line selection o" data mining "unctions
o 0ntegration and swapping o" multiple mining "unctions# algorithms#
and tas,s
An OLA> Architecture 7see also "igure 3!U:

S-ar putea să vă placă și