Documente Academic
Documente Profesional
Documente Cultură
Focus
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
3! What is Data Warehouse?
De"ined in many di""erent ways# $ut not rigorously
o A decision support data$ase that is maintained separately "rom the
organi%ation&s operational data$ase
o 'upport in"ormation processing $y pro(iding a solid plat"orm o"
consolidated# historical data "or analysis
We use) *A data warehouse is a su$+ect-oriented# integrated# time-(ariant#
and non(olatile collection o" data in support o" management&s decision-
ma,ing process-.W / 0nmon
Data warehousing)
o The process o" constructing and using data warehouses
The 1 main "eatures)
! 'u$+ect-Oriented
o Organi%ed around ma+or su$+ects# such as customer# product# sales
o Focusing on the modeling and analysis o" data "or decision ma,ers#
not on daily operations or transaction processing
o Pro(ide a simple and concise (iew around particular su$+ect issues $y
e2cluding data that are not use"ul in the decision support process
3 0ntegrated
o Constructed $y integrating multiple# heterogeneous data sources
relational data$ases# "lat "iles# on-line transaction records
o Data cleaning and data integration techni4ues are applied
5nsure consistency in naming con(entions# encoding structures#
attri$ute measures# etc among di""erent data sources
5g# /otel price) currency# ta2# $rea,"ast co(ered# etc
When data is mo(ed to the warehouse# it is con(erted
3 Time 6ariant
o The time hori%on "or the data warehouse is signi"icantly longer than
that o" operational systems
Operational data$ase) current (alue data
Data warehouse data) pro(ide in"ormation "rom a historical
perspecti(e 7eg# past 8-!9 years:
o 5(ery ,ey structure in the data warehouse
Contains an element o" time# e2plicitly or implicitly
;ut the ,ey o" operational data may or may not contain *time
element-
1 <on-6olatile
o A physically separate store o" data trans"ormed "rom the operational
en(ironment
o Operational update o" data does not occur in the data warehouse
en(ironment
Does not re4uire transaction processing# reco(ery# and
concurrency control mechanisms
=e4uires only two operations in data accessing)
initial loading of data and access of data
3!! Data Warehouse (s Operational D;>'
OLTP 7On-Line Transaction Processing:
o >a+or tas, o" traditional relational D;>'
o Day-to-day operations) purchasing# in(entory# $an,ing#
manu"acturing# payroll# registration# accounting# etc
OLAP 7On-Line Analytical Processing:
o >a+or tas, o" data warehouse system
o Data analysis and decision ma,ing
Distinct "eatures 7OLTP (s OLAP:)
o ?ser and system orientation) customer (s mar,et
o Data contents) current# detailed (s historical# consolidated
o Data$ase design) 5= @ application (s star @ su$+ect
o 6iew) current# local (s e(olutionary# integrated
o Access patterns) update (s read-only $ut comple2 4ueries
3!3 Why 'eparate Data Warehouse?
/igh per"ormance "or $oth systems
o D;>'. tuned "or OLTP) access methods# inde2ing# concurrency
control# reco(ery
o Warehouse.tuned "or OLAP) comple2 OLAP 4ueries#
multidimensional (iew# consolidation
Di""erent "unctions and di""erent data)
o missing data) Decision support re4uires historical data which
operational D;s do not typically maintain
o data consolidation) D' re4uires consolidation 7aggregation#
summari%ation: o" data "rom heterogeneous sources
o data 4uality) di""erent sources typically use inconsistent data
representations# codes and "ormats which ha(e to $e reconciled
Data >odeling "or a warehouse
33 A multidimensional Data >odel
From Ta$les and 'preadsheets to Data Cu$es
A data warehouse is $ased on a multidimensional data model which (iews
data in the "orm o" a data cu$e
A data cu$e is typically organi%ed around a central theme# such as sales#
stored in a fact 7numeric measures: table# which allows data to $e modeled
and (iewed in multiple dimensions
Fact ta$le contains measures 7such as dollarsAsold: and ,eys to each o" the
related dimension ta$les
Ta$le3-32ls) >ultidimensional "act ta$les in two dimensional "ormat
Figure3-!) 3D cu$e
Figure3-3) 1D cu$e
Dimension ta$les# such as item 7itemAname# $rand# type:# or time7day# wee,#
month# 4uarter# year:# pro(ide additional in"ormation a$out the dimensions
0n data warehousing literature# an n-D $ase cu$e is called a $ase cu$oid The
top most 9-D cu$oid# which holds the highest-le(el o" summari%ation# is
called the ape2 cu$oid The lattice o" cu$oids "orms a data cu$e
Conceptual >odeling o" Data Warehouses
>odeling data warehouses) dimensions B measures
o 'tar schema) A "act ta$le in the middle connected to a set o"
dimension ta$les# e2#
'now"la,e schema) A re"inement o" star schema where some dimensional
hierarchy is normali%ed into a set o" smaller dimension ta$les# "orming a shape
similar to snow"la,e# e2#
Fact constellations) >ultiple "act ta$les share dimension ta$les# (iewed as a
collection o" stars# there"ore called galaxy schema or "act constellation 52#
333 52amples "or de"ining 'tar# 'now"la,e# and Fact Constellation
A Data >ining Cuery Language# D>CL) Language Primiti(es
Cu$e De"inition 7Fact Ta$le:
de"ine cu$e Dcu$eAnameE FDdimensionAlistEG) DmeasureAlistE
Dimension De"inition 7 Dimension Ta$le :
de"ine dimension DdimensionAnameE as 7Dattri$uteAorAsu$dimensionAlistE:
'pecial Case 7'hared Dimension Ta$les:
First time as *cu$e de"inition-
de"ine dimension DdimensionAnameE as DdimensionAnameA"irstAtimeE in
cu$e Dcu$eAnameA"irstAtimeE
52ample 31) De"ining a 'tar 'chema in D>CL
de"ine cu$e salesAstar Ftime# item# $ranch# locationG)
dollarsAsold H sum7salesAinAdollars:#
a(gAsales H a(g7salesAinAdollars:# unitsAsold H count7I:
de"ine dimension time as 7timeA,ey# day# dayAo"Awee,#
month# 4uarter# year:
de"ine dimension item as 7itemA,ey# itemAname# $rand# type#
supplierAtype:
de"ine dimension branch as 7$ranchA,ey# $ranchAname# $ranchAtype:
de"ine dimension location as 7locationA,ey# street# city#
pro(inceAorAstate# country:
331 >easures) Their Categori%ation and Computation
/ow is a multidimensional point in a data cu$e space de"ined?
=
=
n
i
i
( )
Lrouping is per"ormed on some su$aggregates as a *partial grouping step-
Aggregates may $e computed "rom pre(iously computed aggregates# rather
than "rom the $ase "act ta$le
>ulti-way Array Aggregation "or Cu$e Computation
Partition arrays into chun,s 7a small su$cu$e which "its in memory:
Compressed sparse array addressing) 7chun,Aid# o""set:
Compute aggregates in *multiway- $y (isiting cu$e cells in the order which
minimi%es the T o" times to (isit each cell# and reduces memory access and storage
cost
What is the best traversing order to do multi-way aggregation?
>ethod) the planes should $e sorted and computed according to their si%e in
ascending order Figure3!Q
'ee the details o" 52ample 3!3 7pp J8-JU:
0dea) ,eep the smallest plane in the main memory# "etch and compute only
one chun, at a time "or the largest plane
Limitation o" the method) computing well only "or a small num$er o"
dimensions
0" there are a large num$er o" dimensions# *$ottom-up computation- and
ice$erg cu$e computation methods can $e e2plored
313 0nde2ing OLAP Data) ;itmap 0nde2
0nde2 on a particular column
5ach (alue in the column has a $it (ector) $it-op is "ast
The length o" the $it (ector) T o" records in the $ase ta$le
The i-th $it is set i" the i-th row o" the $ase ta$le has the (alue "or the
inde2ed column
not suita$le "or high cardinality domains
52ample) Figure3-!8 to 3-!J
0nde2ing OLAP Data) Voin 0ndices
Voin inde2) V07=-id# '-id: where = 7=-id# K: ' 7'-id# K:
Traditional indices map the (alues to a list o" record ids
o 0t materiali%es relational +oin in V0 "ile and speeds up relational +oin .
a rather costly operation
0n data warehouses# +oin inde2 relates the (alues o" the dimensions o" a start
schema to rows in the "act ta$le
o 5g "act ta$le) &ales and two dimensions city and product
o A +oin inde2 on city maintains "or each distinct city a list o" =-0Ds o"
the tuples recording the 'ales in the city
o Voin indices can span multiple dimensions
5""icient Processing OLAP Cueries
Determine which operations should $e per"ormed on the a(aila$le cu$oids)
trans"orm drill# roll# etc into corresponding 'CL andRor OLAP operations#
eg# dice H selection @ pro+ection
Determine to which materiali%ed cu$oid7s: the rele(ant operations should $e
applied
52ploring inde2ing structures and compressed (s dense array structures in
>OLAP
38 From Data Warehousing to Data >ining
38! Data Warehouse ?sage
Three ,inds o" data warehouse applications
o 0n"ormation processing
supports 4uerying# $asic statistical analysis# and reporting using
crossta$s# ta$les# charts and graphs
o Analytical processing
multidimensional analysis o" data warehouse data
supports $asic OLAP operations# slice-dice# drilling# pi(oting
o Data mining
,nowledge disco(ery "rom hidden patterns
supports associations# constructing analytical models#
per"orming classi"ication and prediction# and presenting the
mining results using (isuali%ation tools
383 From OLAP to OLA>
Why online analytical mining?
/igh 4uality o" data in data warehouses
o DW contains integrated# consistent# cleaned data
A(aila$le in"ormation processing structure surrounding data warehouses
o OD;C# OL5D;# We$ accessing# ser(ice "acilities# reporting and
OLAP tools
OLAP-$ased e2ploratory data analysis
o >ining with drilling# dicing# pi(oting# etc
On-line selection o" data mining "unctions
o 0ntegration and swapping o" multiple mining "unctions# algorithms#
and tas,s
An OLA> Architecture 7see also "igure 3!U: