Sunteți pe pagina 1din 19

DATA WAREHOUSING AND DATA MINING NOTES UNIT I

DataWarehouseIntroduction
A data warehouse is a collection of data marts representing historical data from different
operationsinthecompany.Thisdataisstoredinastructureoptimizedforqueryinganddata
analysisasadatawarehouse.Tabledesign,dimensionsandorganizationshouldbeconsistent
throughoutadatawarehousesothatreportsorqueriesacrossthedatawarehouseareconsistent.
Adatawarehousecanalsobeviewedasadatabaseforhistoricaldatafromdifferentfunctions
withinacompany.
ThetermDataWarehousewascoinedbyBillInmonin1990,whichhedefinedinthe
following way: "A warehouse is a subjectoriented, integrated, timevariant and nonvolatile
collectionofdatainsupportofmanagement'sdecisionmakingprocess".Hedefinedthetermsin
thesentenceasfollows:
Subject Oriented: Data that gives information about a particular subject instead of about a
company'songoingoperations.
Integrated:Datathatisgatheredintothedatawarehousefromavarietyofsourcesandmerged
intoacoherentwhole.
Timevariant:Alldatainthedatawarehouseisidentifiedwithaparticulartimeperiod.
Nonvolatile:Dataisstableinadatawarehouse.Moredataisaddedbutdataisneverremoved.
Benefitsofdatawarehousing
Datawarehousesaredesignedtoperformwellwithaggregatequeriesrunningonlarge
amountsofdata.
Thestructureofdatawarehousesiseasierforenduserstonavigate,understandand
queryagainstunliketherelationaldatabasesprimarilydesignedtohandlelotsof
transactions.
Datawarehousesenablequeriesthatcutacrossdifferentsegmentsofacompany's
operation.E.g.productiondatacouldbecomparedagainstinventorydataevenifthey
wereoriginallystoredindifferentdatabaseswithdifferentstructures.
Queriesthatwouldbecomplexinverynormalizeddatabasescouldbeeasiertobuild
andmaintainindatawarehouses,decreasingtheworkloadontransactionsystems.
Datawarehousingisanefficientwaytomanageandreportondatathatisfroma
varietyofsources,nonuniformandscatteredthroughoutacompany.
Datawarehousingisanefficientwaytomanagedemandforlotsofinformationfrom
lotsofusers.
Datawarehousingprovidesthecapabilitytoanalyzelargeamountsofhistoricaldata
fornuggetsofwisdomthatcanprovideanorganizationwithcompetitiveadvantage.

OperationalandinformationalData

OperationalData:

Focusingontransactionalfunctionsuchasbankcardwithdrawalsanddeposits

Detailed

Updateable

Reflectscurrentdata

InformationalData:

Focusingonprovidinganswerstoproblemsposedbydecisionmakers

Summarized

Nonupdateable

Data Warehouse Architecture

DatawarehouseArchitectureanditssevencomponents
1. Datasourcing,cleanup,transformation,andmigrationtools
2. Metadatarepository
3. Warehouse/databasetechnology
4. Datamarts
5. Dataquery,reporting,analysis,andminingtools
6. Datawarehouseadministrationandmanagement
Informationdeliverysystem

Data warehouse is an environment, not a product which is based on relational database


managementsystemthatfunctionsasthecentralrepositoryforinformationaldata.
Thecentralrepositoryinformationissurroundedbynumberofkeycomponentsdesignedtomake
theenvironmentisfunctional,manageableandaccessible.
Thedatasourcefordatawarehouseiscomingfromoperationalapplications.Thedataentered
intothedatawarehousetransformedintoanintegratedstructureandformat.Thetransformation
process involves conversion, summarization, filtering and condensation. The data warehouse
mustbecapableofholdingandmanaginglargevolumesofdataaswellasdifferentstructureof
datastructuresoverthetime.
1Datawarehousedatabase
Thisisthecentralpartofthedatawarehousingenvironment.Thisistheitemnumber2inthe
abovearch.diagram.ThisisimplementedbasedonRDBMStechnology.

2Sourcing,Acquisition,Cleanup,andTransformationTools
Thisisitemnumber1intheabovearchdiagram.Theyperformconversions,summarization,key
changes,structuralchangesandcondensation.Thedatatransformationisrequiredsothatthe
information can by used by decision support tools. The transformation produces programs,
controlstatements,JCLcode,COBOLcode,UNIXscripts,andSQLDDLcodeetc.,tomovethe
dataintodatawarehousefrommultipleoperationalsystems.
Thefunctionalitiesofthesetoolsarelistedbelow:
Toremoveunwanteddatafromoperationaldb
Convertingtocommondatanamesandattributes
Calculatingsummariesandderiveddata
Establishingdefaultsformissingdata
Accommodatingsourcedatadefinitionchanges
Issuestobeconsideredwhiledatasourcing,cleanup,extractandtransformation:
Data heterogeneity: It refers to DBMS different nature such as it may be in different data
modules, it may have different access languages, it may have data navigation methods,
operations,concurrency,integrityandrecoveryprocessesetc.,
Data heterogeneity: It refers to the different way the data is defined and used in different
modules.
3Metadata
Itisdataaboutdata.Itisusedformaintaining,managingandusingthedatawarehouse.
Itisclassifiedintotwo:
TechnicalMetadata:Itcontainsinformationaboutdatawarehousedatausedby
warehousedesigner,administratortocarryoutdevelopmentandmanagementtasks.It
includes,
Infoaboutdatastores
Transformationdescriptions.Thatismappingmethodsfromoperationaldbto
warehousedb
WarehouseObjectanddatastructuredefinitionsfortargetdata
Therulesusedtoperformcleanup,anddataenhancement
Datamappingoperations
Accessauthorization,backuphistory,archivehistory,infodeliveryhistory,data
acquisitionhistory,dataaccessetc.,
BusinessMetadata:Itcontainsinfothatgivesinfostoredindatawarehousetousers.It
includes,

Subject areas, and info object type including queries, reports, images, video,
audioclipsetc.
Internethomepages
Inforelatedtoinfodeliverysystem

Datawarehouseoperationalinfosuchasownerships,audittrailsetc.,

Metadatahelpstheuserstounderstandcontentandfindthedata.Metadataarestoredin
aseparatedatastoreswhichisknownasinformationaldirectoryorMetadatarepository
whichhelpstointegrate,maintainandviewthecontentsofthedatawarehouse.The
followingliststhecharacteristicsofinfodirectory/Metadata:
Itisthegatewaytothedatawarehouseenvironment
Itsupportseasydistributionandreplicationofcontentforhighperformance
and
availability
Itshouldbesearchablebybusinessorientedkeywords
Itshouldactasalaunchplatformforendusertoaccessdataandanalysistools
Itshouldsupportthesharingofinfo
Itshouldsupportschedulingoptionsforrequest
ITshouldsupportandprovideinterfacetootherapplications
Itshouldsupportendusermonitoringofthestatusofthedatawarehouseenvironment
4Accesstools
Itspurposeistoprovideinfotobusinessusersfordecisionmaking.Therearefivecategories:
Dataqueryandreportingtools
Applicationdevelopmenttools
Executiveinfosystemtools(EIS)
OLAPtools
Dataminingtools
Queryandreportingtoolsareusedtogeneratequeryandreport.Therearetwotypesofreporting
tools.Theyare:
Productionreportingtoolusedtogenerateregularoperationalreports
Desktopreportwriterareinexpensivedesktoptoolsdesignedforendusers.
ManagedQuerytools:usedtogenerateSQLquery.ItusesMetalayersoftwareinbetweenusers
anddatabaseswhichoffersapointandclickcreationofSQLstatement.Thistoolisapreferred
choiceofuserstoperformsegmentidentification,demographicanalysis,territorymanagement
andpreparationofcustomermailinglistsetc.
Application development tools: This is a graphical data access environment which integrates
OLAPtoolswithdatawarehouseandcanbeusedtoaccessalldbsystems
OLAPTools: areusedtoanalyzethedatainmultidimensionalandcomplexviews.Toenable
multidimensionalpropertiesitusesMDDBandMRDBwhereMDDBrefersmultidimensional
databaseandMRDBrefersmultirelationaldatabases.
Dataminingtools:areusedtodiscoverknowledgefromthedatawarehousedataalsocanbeused

fordatavisualizationanddatacorrectionpurposes.
5Datamarts
Departmentalsubsetsthatfocusonselectedsubjects.Theyareindependentusedbydedicated
usergroup.Theyareusedforrapiddeliveryofenhanceddecisionsupportfunctionalitytoend
users.Datamartisusedinthefollowingsituation:
Extremelyurgentuserrequirement
Theabsenceofabudgetforafullscaledatawarehousestrategy
Thedecentralizationofbusinessneeds
Theattractionofeasytousetoolsandmindsizedproject
Datamartpresentstwoproblems:
1. Scalability: A small data mart can grow quickly in multi dimensions. So that while
designingit,theorganizationhastopaymoreattentiononsystemscalability,consistency
andmanageabilityissues
2. Dataintegration
6Datawarehouseadminandmanagement
Themanagementofdatawarehouseincludes,
Securityandprioritymanagement
Monitoringupdatesfrommultiplesources
Dataqualitychecks
Managingandupdatingmetadata
Auditingandreportingdatawarehouseusageandstatus
Purgingdata
Replicating,subsettinganddistributingdata
Backupandrecovery
Data warehouse storage management which includes capacity planning, hierarchical
storagemanagementandpurgingofageddataetc.,
7Informationdeliverysystem

Itisusedtoenabletheprocessofsubscribingfordatawarehouseinfo.
Deliverytooneormoredestinationsaccordingtospecifiedschedulingalgorithm

BuildingaDatawarehouse
Therearetworeasonswhyorganizationsconsiderdatawarehousingacriticalneed.Inother
words,therearetwofactorsthatdriveyoutobuildandusedatawarehouse.Theyare:
Businessfactors:
Businessuserswanttomakedecisionquicklyandcorrectlyusingallavailabledata.
Technologicalfactors:
Toaddresstheincompatibilityofoperationaldatastores
ITinfrastructureischangingrapidly.Itscapacityisincreasingandcostisdecreasingso

thatbuildingadatawarehouseiseasy
Thereareseveralthingstobeconsideredwhilebuildingasuccessfuldatawarehouse.
Businessconsiderations:
Organizationsinterestedindevelopmentofadatawarehousecanchooseoneofthefollowing
twoapproaches:

TopDownApproach(SuggestedbyBillInmon)
BottomUpApproach(SuggestedbyRalphKimball)

TopDownApproach
InthetopdownapproachsuggestedbyBillInmon,webuildacentralizedrepositorytohouse
corporatewidebusinessdata.ThisrepositoryiscalledEnterpriseDataWarehouse(EDW).The
dataintheEDWisstoredinanormalizedforminordertoavoidredundancy.
Thecentralrepositoryforcorporatewidedatahelpsusmaintainoneversionoftruthofthedata.
ThedataintheEDWisstoredatthemostdetaillevel.ThereasontobuildtheEDWonthemost
detaillevelistoleverage
1. Flexibilitytobeusedbymultipledepartments.
2. Flexibilitytocaterforfuturerequirements.
Thedisadvantagesofstoringdataatthedetaillevelare
1. Thecomplexityofdesignincreaseswithincreasinglevelofdetail.
2. Ittakeslargeamountofspacetostoredataatdetaillevel,henceincreasedcost.
OncetheEDWisimplementedwestartbuildingsubjectareaspecificdatamartswhichcontain
data in a de normalized form also called star schema. The data in the marts are usually
summarizedbasedontheendusersanalyticalrequirements.
Thereasontodenormalizethedatainthemartistoprovidefasteraccesstothedatafortheend
usersanalytics.Ifweweretohavequeriedanormalizedschemaforthesameanalytics,wewould
endupinacomplexmultipleleveljoinsthatwouldbemuchslowerascomparedtotheoneon
thedenormalizedschema.
Weshouldimplementthetopdownapproachwhen
1. The business has complete clarity on all or multiple subject areas data warehosue
requirements.
2. Thebusinessisreadytoinvestconsiderabletimeandmoney.
TheadvantageofusingtheTopDownapproachisthatwebuildacentralizedrepositorytocater
foroneversionoftruthforbusinessdata.Thisisveryimportantforthedatatobereliable,
consistentacrosssubjectareasandforreconciliationincaseofdatarelatedcontentionbetween
subjectareas.

The disadvantage of using the Top Down approach is that it requires more time and initial
investment.ThebusinesshastowaitfortheEDWtobeimplementedfollowedbybuildingthe
datamartsbeforewhichtheycanaccesstheirreports.
BottomUpApproach
ThebottomupapproachsuggestedbyRalphKimballisanincrementalapproachtobuildadata
warehouse.Herewebuildthedatamartsseparatelyatdifferentpointsoftimeasandwhenthe
specificsubjectarearequirementsareclear.Thedatamartsareintegratedorcombinedtogether
to form a data warehouse. Separate data marts are combined through the use of conformed
dimensionsandconformedfacts.Aconformeddimensionandaconformedfactisonethatcanbe
sharedacrossdatamarts.
AConformeddimensionhasconsistentdimensionkeys,consistentattributenamesandconsistent
valuesacrossseparatedatamarts.Theconformeddimensionmeansexactsamethingwithevery
facttableitisjoined.
AConformedfacthasthesamedefinitionofmeasures,samedimensionsjoinedtoitandatthe
samegranularityacrossdatamarts.
The bottom up approach helps us incrementally build the warehouse by developing and
integratingdatamartsasandwhentherequirementsareclear.Wedonthavetowaitforknowing
theoverallrequirementsofthewarehouse.Weshouldimplementthebottomupapproachwhen
1. Wehaveinitialcostandtimeconstraints.
2. Thecompletewarehouserequirementsarenotclear.Wehaveclaritytoonlyonedata
mart.

TheadvantageofusingtheBottomUpapproachisthattheydonotrequirehighinitialcostsand
haveafasterimplementationtime;hencethebusinesscanstartusingthemartsmuchearlieras
comparedtothetopdownapproach.
ThedisadvantagesofusingtheBottomUpapproachisthatitstoresdatainthedenormalized
format,hencetherewouldbehighspaceusagefordetaileddata.Wehaveatendencyofnot
keepingdetaileddatainthisapproachhenceloosingoutonadvantageofhavingdetaildata.i.e.
flexibilitytoeasilycatertofuturerequirements.Bottomupapproachismorerealisticbutthe
complexityoftheintegrationmaybecomeaseriousobstacle.
Designconsiderations
Tobeasuccessfuldatawarehousedesignermustadoptaholisticapproachthatisconsideringall
data warehouse components as parts of a single complex system, and take into account all

possibledatasourcesandallknownusagerequirements.
Most successful data warehouses that meet these requirements have these common
characteristics:

Arebasedonadimensionalmodel
Containhistoricalandcurrentdata
Includebothdetailedandsummarizeddata
Consolidatedisparatedatafrommultiplesourceswhileretainingconsistency

Datawarehouseisdifficulttobuildduetothefollowingreason:
Heterogeneityofdatasources
Useofhistoricaldata
Growingnatureofdatabase
Datawarehousedesignapproachmusebebusinessdriven,continuousanditerativeengineering
approach.Inadditiontothegeneralconsiderationstherearefollowingspecificpointsrelevantto
thedatawarehousedesign:
Datacontent
Thecontentandstructureofthedatawarehousearereflectedinitsdatamodel.Thedatamodelis
thetemplatethatdescribeshowinformationwillbeorganizedwithintheintegratedwarehouse
framework.Thedatawarehousedatamustbeadetaileddata.Itmustbeformatted,cleanedup
andtransformedtofitthewarehousedatamodel.
Metadata
Itdefinesthelocationandcontentsofdatainthewarehouse.Metadataissearchablebyusersto
find definitions or subject areas. In other words, it must provide decision support oriented
pointerstowarehousedataandthusprovidesalogicallinkbetweenwarehousedataanddecision
supportapplications.
whDatadistributiononeofthebiggestchallengeswhendesigningadatawarehouseisthe
data placement and distribution strategy. Data volumes continue to grow in nature.
Therefore,itbecomesnecessarytoknowhowthedatashouldbedividedacrossmultiple
servers and which users should get access to which types of data. The data can be
distributedbasedonthesubjectarea,location(geographicalregion),ortime(current,
month,year).
Tools
Anumberoftoolsareavailablethatarespecificallydesignedtohelpintheimplementationof
thedatawarehouse.Allselectedtoolsmustbecompatiblewiththegivendatawarehouse
environmentandwitheachother.AlltoolsmustbeabletouseacommonMetadatarepository.

Designsteps
Thefollowingninestepmethodisfollowedinthedesignofadatawarehouse:
1. Choosingthesubjectmatter
2. Decidingwhatafacttablerepresents
3. Identifyingandconformingthedimensions
4. Choosingthefacts
5. Storingprecalculationsinthefacttable
6. Roundingoutthedimensiontable
7. Choosingthedurationofthedb
8. Theneedtotrackslowlychangingdimensions
9. Decidingthequeryprioritiesandquerymodels
Technicalconsiderations
Anumberoftechnicalissuesaretobeconsideredwhendesigningadatawarehouse
environment.Theseissuesinclude:
Thehardwareplatformthatwouldhousethedatawarehouse
Thedbmsthatsupportsthewarehousedata
Thecommunicationinfrastructurethatconnectsdatamarts,operationalsystemsandend
users
Thehardwareandsoftwaretosupportmetadatarepository
Thesystemsmanagementframeworkthatenablesadminoftheentireenvironment

DBMSschemasfordecisionsupport
Thebasicconceptsofdimensionalmodelingare:facts,dimensionsandmeasures.Afactisa
collectionofrelateddataitems,consistingofmeasuresandcontextdata.Ittypicallyrepresents
businessitemsorbusinesstransactions.Adimensionisacollectionofdatathatdescribeone
businessdimension.Dimensionsdeterminethecontextualbackgroundforthefacts;theyarethe
parametersoverwhichwewanttoperformOLAP.Ameasureisanumericattributeofafact,
representingtheperformanceorbehaviorofthebusinessrelativetothedimensions.
ConsideringRelationalcontext,therearethreebasicschemasthatareusedindimensional
modeling:
1. Starschema
2. Snowflakeschema
3. Factconstellationschema
Starschema
The multidimensional view of data that is expressed using relational data base semantics is
providedbythedatabaseschemadesigncalledstarschema.Thebasicofstatschemaisthat
informationcanbeclassifiedintotwogroups:

Facts
Dimension
Star schema has onelargecentral table (fact table) anda set of smaller tables(dimensions)
arrangedinaradialpatternaroundthecentraltable.
Factsarecoredataelementbeinganalyzedwhiledimensionsareattributesaboutthefacts.
Thedeterminationofwhichschemamodelshouldbeusedforadatawarehouseshouldbebased
upontheanalysisofprojectrequirements,accessibletoolsandprojectteampreferences.

Whatisstarschema?Thestarschemaarchitectureisthesimplestdatawarehouseschema.Itis
calledastarschemabecausethediagramresemblesastar,withpointsradiatingfromacenter.
Thecenterofthestarconsistsoffacttableandthepointsofthestararethedimensiontables.
Usuallythefacttablesinastarschemaareinthirdnormalform(3NF)whereasdimensionaltables
aredenormalized.Despitethefactthatthestarschemaisthesimplestarchitecture,itismost
commonlyusednowadaysandisrecommendedbyOracle.
FactTables
A fact table is a table that contains summarized numerical and historical data (facts) and a
multipartindexcomposedofforeignkeysfromtheprimarykeysofrelateddimensiontables.
Afacttabletypicallyhastwotypesofcolumns:foreignkeystodimensiontablesandmeasures
thosethatcontainnumericfacts.Afacttablecancontainfact'sdataondetailoraggregatedlevel.
DimensionTables
Dimensionsarecategoriesbywhichsummarizeddatacanbeviewed.E.g.aprofitsummaryina
facttablecanbeviewedbyaTimedimension(profitbymonth,quarter,year),Regiondimension
(profitbycountry,state,city),Productdimension(profitforproduct1,product2).
Adimensionisastructureusuallycomposedofoneormorehierarchiesthatcategorizesdata.Ifa
dimensionhasn'tgotahierarchiesandlevelsitiscalledflatdimensionorlist.Theprimarykeys
of each of the dimension tables are part of the composite primary key of the fact table.
Dimensionalattributeshelptodescribethedimensionalvalue.Theyarenormallydescriptive,

textualvalues.Dimensiontablesaregenerallysmallinsizethenfacttable.
Typicalfacttablesstoredataaboutsaleswhiledimensiontablesdataaboutgeographicregion
(markets,cities),clients,products,times,channels.
Measures
Measuresarenumericdatabasedoncolumnsinafacttable.Theyaretheprimarydatawhichend
usersareinterestedin.E.g.asalesfacttablemaycontainaprofitmeasurewhichrepresentsprofit
oneachsale.
Aggregationsareprecalculatednumericdata.Bycalculatingandstoringtheanswerstoaquery
beforeusersaskforit,thequeryprocessingtimecanbereduced.Thisiskeyinprovidingfast
queryperformanceinOLAP.
Cubes are data processing units composed of fact tables and dimensions from the data
warehouse.Theyprovidemultidimensionalviewsofdata,queryingandanalyticalcapabilitiesto
clients.
Themaincharacteristicsofstarschema:

Simplestructure>easytounderstandschema
Greatqueryeffectives>smallnumberoftablestojoin
Relatively long time of loading data into dimension tables > denormalization,
redundancydatacausedthatsizeofthetablecouldbelarge.
Themostcommonlyusedinthedatawarehouseimplementations>widelysupportedby
alargenumberofbusinessintelligencetools

Snowflakeschema:istheresultofdecomposingoneormoreofthedimensions.Themanytoone
relationshipsamongsetsofattributesofadimensioncanseparatenewdimensiontables,forming
a hierarchy. The decomposed snowflake structure visualizes the hierarchical structure of
dimensionsverywell.
Fact constellation schema: For each star schema it is possible to construct fact constellation
schema(forexamplebysplittingtheoriginalstarschemaintomorestarschemeseachofthem
describesfactsonanotherlevelofdimensionhierarchies).Thefactconstellationarchitecture
containsmultiplefacttablesthatsharemanydimensiontables.
Themainshortcomingofthefactconstellationschemaisamorecomplicateddesignbecause
manyvariantsforparticularkindsofaggregationmustbeconsideredandselected.Moreover,
dimensiontablesarestilllarge.

TheMultidimensionaldataModel
ThemultidimensionaldatamodelisanintegralpartofOnLineAnalyticalProcessing,orOLAP.
BecauseOLAPisonline,itmustprovideanswersquickly;analystsposeiterativequeriesduring
interactivesessions,notinbatchjobsthatrunovernight.AndbecauseOLAPisalsoanalytic,the
queriesarecomplex.Themultidimensionaldatamodelisdesignedtosolvecomplexqueriesin
realtime.
Multidimensionaldatamodelistoviewitasacube.Thecableattheleftcontainsdetailedsales
databyproduct,marketandtime.Thecubeontherightassociatessalesnumber(unitsold)with
dimensionsproducttype,marketandtimewiththeunitvariablesorganizedascellinanarray.
Thiscubecanbeexpendedtoincludeanotherarraypricewhichcanbeassociateswithalloronly
some dimensions. As number of dimensions increases number of cubes cell increase
exponentially.
Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies for years,
quarters,months,weakandday.GEOGRAPHYmaycontaincountry,state,cityetc.

Inthiscubewecanobserve,thateachsideofthecuberepresentsoneoftheelementsofthe
question. The xaxis represents the time, the yaxis represents the products and the zaxis
representsdifferentcenters.Thecellsofinthecuberepresentsthenumberofproductsoldorcan
representthepriceoftheitems.

ThisFigurealsogivesadifferentunderstandingtothedrillingdownoperations.Therelations
definedmustnotbedirectlyrelated,theyrelateddirectly.
Thesizeofthedimensionincrease,thesizeofthecubewillalsoincreaseexponentially.Thetime
responseofthecubedependsonthesizeofthecube.
OperationsinMultidimensionalDataModel:

Aggregation(rollup)
dimensionreduction:e.g.,totalsalesbycity
summarizationoveraggregatehierarchy:e.g.,totalsalesbycityandyear>total
salesbyregionandbyyear
Selection(slice)definesasubcube
e.g.,saleswherecity=PaloAltoanddate=1/15/96
Navigationtodetaileddata(drilldown)
e.g.,(salesexpense)bycity,top3%ofcitiesbyaverageincome
VisualizationOperations(e.g.,Pivotordice)

OLAP
OLAPstandsforOnlineAnalyticalProcessing.Itusesdatabasetables(factanddimensiontables)
toenablemultidimensionalviewing,analysisandqueryingoflargeamountsofdata.E.g.OLAP
technologycouldprovidemanagementwithfastanswerstocomplexqueriesontheiroperational
dataorenablethemtoanalyzetheircompany'shistoricaldatafortrendsandpatternsOnline
AnalyticalProcessing(OLAP)applicationsandtoolsarethosethataredesignedtoaskcomplex
queriesoflargemultidimensionalcollectionsofdata.DuetothatOLAPisaccompaniedwith
datawarehousing.
Need
ThekeydriverofOLAPisthemultidimensionalnatureofthebusinessproblem.Theseproblems
are characterized by retrieving a very large number of records that can reach gigabytes and
terabytes and summarizing this data into a form information that can by used by business
analysts.

OneofthelimitationsthatSQLhas,itcannotrepresentthesecomplexproblems.Aquerywillbe
translated in to several SQL statements. These SQL statements will involve multiple joins,
intermediatetables,sorting,aggregationsandahugetemporarymemorytostorethesetables.
Theseproceduresrequiredalotofcomputationwhichwillrequirealongtimeincomputing.The

secondlimitationofSQLisitsinabilitytousemathematicalmodelsintheseSQLstatements.If
ananalyst,couldcreatethesecomplexstatementsusingSQLstatements,stilltherewillbealarge
numberofcomputationandhugememoryneeded.ThereforetheuseofOLAPispreferableto
solvethiskindofproblem.
CategoriesofOLAPTools
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensionalcube.Thestorageisnotintherelationaldatabase,butinproprietaryformats.
Thatis,datastoredinarraybasedstructures.
Advantages:

Excellentperformance:MOLAPcubesarebuiltforfastdataretrieval,andareoptimalfor
slicinganddicingoperations.

Canperformcomplexcalculations:Allcalculationshavebeenpregeneratedwhenthe
cubeiscreated.Hence,complexcalculationsarenotonlydoable,buttheyreturnquickly.
Disadvantages:

Limitedintheamountofdataitcanhandle:Becauseallcalculationsareperformedwhen
thecubeisbuilt,itisnotpossibletoincludealargeamountofdatainthecubeitself.This
isnottosaythatthedatainthecubecannotbederivedfromalargeamountofdata.
Indeed,thisispossible.Butinthiscase,onlysummarylevelinformationwillbeincluded
inthecubeitself.

Requiresadditionalinvestment:Cubetechnologyareoftenproprietaryanddonotalready
existintheorganization.Therefore,toadoptMOLAPtechnology,chancesareadditional
investmentsinhumanandcapitalresourcesareneeded.

Examples:MicrostrategyIntelligenceServer,MetaCube(Informix/IBM)
HOLAP(MQE:ManagedQueryEnvironment)
HOLAPtechnologiesattempttocombinetheadvantagesofMOLAPandROLAP.Forsummary

typeinformation,HOLAPleveragescubetechnologyforfasterperformance.Itstoresonlythe
indexesandaggregationsinthemultidimensionalformwhiletherestofthedataisstoredinthe
relationaldatabase.

Examples:PowerPlay(Cognos),Brio,MicrosoftAnalysisServices,OracleAdvancedAnalytic
Services
OLAPGuidelines
Dr.E.F.Coddthefatheroftherelationalmodel,createdalistofrulestodealwiththeOLAP
systems. Users should priorities these rules according to their needs to match their business
requirements(reference3).Theserulesare:
1) Multidimensional conceptual view: The OLAP should provide an appropriate
multidimensionalBusinessmodelthatsuitstheBusinessproblemsandRequirements.
2) Transparency:TheOLAPtoolshouldprovidetransparencytotheinputdatafortheusers.
3) Accessibility:TheOLAPtoolshouldonlyaccessthedatarequiredonlytotheanalysis
needed.
4) Consistentreportingperformance:TheSizeofthedatabaseshouldnotaffectinanyway
theperformance.
5) Client/serverarchitecture:TheOLAPtoolshouldusetheclientserverarchitectureto
ensurebetterperformanceandflexibility.
6) Genericdimensionality:Dataenteredshouldbeequivalenttothestructureandoperation
requirements.
7) Dynamicsparsematrixhandling:TheOLAPtooshouldbeabletomanagethesparse
matrixandsomaintainthelevelofperformance.
8) Multiusersupport:TheOLAPshouldallowseveralusersworkingconcurrentlytowork
together.
9) Unrestrictedcrossdimensionaloperations:TheOLAPtoolshouldbeabletoperform
operationsacrossthedimensionsofthecube.
10) Intuitive data manipulation. Consolidation path re orientation, drilling down across
columnsorrows,zoomingout,andothermanipulationinherentintheconsolidationpath
outlinesshouldbeaccomplishedviadirectactionuponthecellsoftheanalyticalmodel,
and should neither require the use of a menu nor multiple trips across the user
interface.(Reference4)
11) Flexiblereporting:Itistheabilityofthetooltopresenttherowsandcolumninamanner
suitabletobeanalyzed.
12) Unlimited dimensions and aggregation levels: This depends on the kind of Business,
wheremultipledimensionsanddefininghierarchiescanbemade.

InadditiontotheseguidelinesanOLAPsystemshouldalsosupport:

Comprehensive database management tools: This gives the database management to


controldistributedBusinesses
Theabilitytodrilldowntodetailsourcerecordlevel:WhichrequiresthatTheOLAPtool
shouldallowsmoothtransitionsinthemultidimensionaldatabase.
Incrementaldatabaserefresh:TheOLAPtoolshouldprovidepartialrefresh.
Structured Query Language (SQL interface): the OLAP system should be able to
integrateeffectivelyinthesurroundingenterpriseenvironment.

OLTPvsOLAP
OLTPstandsforOnLineTransactionProcessingandisadatamodelingapproachtypicallyused
tofacilitateandmanageusualbusinessapplications.Mostofapplicationsyouseeanduseare
OLTPbased.OLTPtechnologyusedtoperformupdatesonoperationalortransactionalsystems
(e.g.,pointofsalesystems)
OLAPstandsforOnLineAnalyticProcessingandisanapproachtoanswermultidimensional
queries. OLAP was conceived for Management Information Systems and Decision Support
Systems.OLAPtechnologyusedtoperformcomplexanalysisofthedatainadatawarehouse.
ThefollowingtablesummarizesthemajordifferencesbetweenOLTPandOLAPsystemdesign.
OLTP
Online
Transaction
(OperationalSystem)

System OLAP
Processing Online
Analytical
(DataWarehouse)

Operationaldata;OLTPsarethe
Sourceofdata originalsourceofthedata.
Purpose
data

of Tocontrol and run


businesstasks

Whatthedata
Inserts
Updates

Reveals a snapshot
businessprocesses

Consolidationdata;OLAPdatacomes
fromthevariousOLTPDatabases

fundamental Tohelpwithplanning,problemsolving,
anddecisionsupport
ofongoing Multidimensionalviewsofvariouskinds
ofbusinessactivities

and Shortandfastinsertsandupdates Periodiclongrunningbatchjobsrefresh


initiatedbyendusers
thedata

Queries

Processing
Speed

Relativelystandardized andsimple Often


complex
queries
involving
queries Returning relativelyfew aggregations
records
Dependsontheamountofdatainvolved;
batchdatarefreshesandcomplexqueries
Typicallyveryfast
maytakemanyhours;queryspeedcanbe
improvedbycreatingindexes

Space

Canberelativelysmall

Requirements

dataisarchived

Database

System
Processing

ifhistorical Largerduetotheexistenceofaggregation
structuresandhistorydata;requiresmore
indexesthanOLTP
Typically

denormalized with

fewer

Design

Backup
Recovery

Highlynormalizedwithmanytables tables; use ofstar


schemas
Backupreligiously;operationaldata
and iscriticaltorunthebusiness,data
lossislikelytoentailsignificant
monetarylossandlegalliability

and/or

snowflake

Instead of regularbackups,
some
environments may consider
simply
reloading the OLTP data as arecovery
method

S-ar putea să vă placă și