Sunteți pe pagina 1din 73

Data Warehouse

&
Data Mining

1
Overview

❚ Part 1: Data Warehouses


❚ Part 2: OLAP
❚ Part 3: Data Mining

2
Part 1: Data Warehouses

3
Data, Data everywhere
yet ...
❚ I can’t find the data I need
❙ data is scattered over the
network
❚ manyget
I ❙can’t versions, subtle
the data I need
❙ differences
need an expert to get the data
❚ I can’t understand the data I
found
❙ available data poorly
documented
❚ I can’t use the data I found
❙ results are unexpected
❙ data needs to be transformed
from one form to other 4
What is a Data Warehouse?

A single, complete and


consistent store of data
obtained from a variety
of different sources
made available to end
users in a what they
can understand and use
in a business context.

[Barry Devlin]
5
Why Data Warehousing?
Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?

What product prom- Which customers


-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
have on revenue
and margins? 6
Decision Support

❚ Used to manage and control


business
❚ Data is historical or point-in-time
❚ Optimized for inquiry rather than
update
❚ Used by managers and end-users to
understand the business and make
judgements
7
The Evol uti on of Da ta
Warehous ing

❚ Since 1970s, organizations gained


competitive advantage through
systems that automate business
processes to offer more efficient and
cost-effective services to the
customer.

❚ This resulted in accumulation of


growing amounts of data in
operational databases.
8
Data Wa rehousi ng
Concept s

❚ A subject-oriented, integrated,
time-variant, and non-volatile
collection of data in support of
management’s decision-making
process (Inmon, 1993).

9
Sub ject- ori ente d Data
❚ The warehouse is organized around
the major subjects of the enterprise
(e.g. customers, products, and sales)
rather than the major application
areas (e.g. customer invoicing, stock
control, and product sales).

❚ This is reflected in the need to store


decision-support data rather than
application-oriented data.

10
Integrated Dat a

❚ The data warehouse integrates


corporate application-oriented data
from different source systems, which
often includes data that is
inconsistent.

❚ The integrated data source must be


made consistent to present a unified
view of the data to the users.

11
Time -v ariant Data

❚ Data in the warehouse is only


accurate and valid at some point in
time or over some time interval.

❚ Time-variance is also shown in the


extended time that the data is held,
the implicit or explicit association of
time with all data, and the fact that
the data represents a series of
snapshots.
12
Non- vol ati le D ata

❚ Data in the warehouse is not


updated in real-time but is
refreshed from operational
systems on a regular basis.

❚ New data is always added as a


supplement to the database,
rather than a replacement.

13
Benef its of D ata
Warehous ing

❚ Potential high returns on


investment

❚ Competitive advantage

❚ Increased productivity of
corporate decision-makers

14
Com pa rison of OLT P
Systems and Data
Warehous ing

15
Data Wa rehouse
Queri es
❚ The types of queries that a data
warehouse is expected to answer
ranges from the relatively simple to
the highly complex and is dependent
on the type of end-user access tools
used.

❚ End-user access tools include:


❙ Reporting, query, and application
development tools
❙ Executive information systems (EIS)
❙ OLAP tools 16
Exampl es of Typi cal D ata
War eho use Qu er ies
❚ What was the total revenue for Scotland in the third quarter of
2004?
❚ What was the total revenue for property sales for each type of
property in Great Britain in 2003?
❚ What are the three most popular areas in each city for the renting
of property in 2004 and how does this compare with the figures
for the previous two years?
❚ What is the monthly revenue for property sales at each branch
office, compared with rolling 12-monthly prior figures?
❚ What would be the effect on property sales in the different
regions of Britain if legal costs went up by 3.5% and Government
taxes went down by 1.5% for properties over £100,000?
❚ Which type of property sells for prices above the average selling
price for properties in the main cities of Great Britain and how
does this correlate to demographic data?
❚ What is the relationship between the total annual revenue
generated by each branch office and the total number of sales
staff assigned to each branch office?

© Pearson Education Limited 1995, 2005 17


Probl ems of Da ta
Warehous ing
❚ Underestimation of resources for
data loading

❚ Hidden problems with source


systems

❚ Required data not captured

❚ Increased end-user demands

❚ Data homogenization
18
Probl ems of Da ta
Warehous ing
❚ High demand for resources

❚ Data ownership

❚ High maintenance

❚ Long duration projects

❚ Complexity of integration
19
Typic al Archi tecture of
a Data War eh ouse

20
Data Mart
❚ A subset of a data warehouse that
supports the requirements of a
particular department or business
function.

❚ Characteristics include
❙ Focuses on only the requirements of one
department or business function.
❙ Do not normally contain detailed
operational data unlike data warehouses.
❙ More easily understood and navigated.

21
Reas on s for Cr eati ng a
Data Mart
❚ To give users access to the data they need
to analyze most often.

❚ To provide data in a form that matches the


collective view of the business function
area. data by a group of users in a
department or

❚ To improve end-user response time due to


the reduction in the volume of data to be
accessed.

22
Reas on s for Cr eati ng a
Data Mart
❚ To provide appropriately structured
data as dictated by the requirements
of the end-user access tools.

❚ Building a data mart is simpler


compared with establishing a
corporate data warehouse.

❚ The cost of implementing data marts


is normally less than that required to
establish a data warehouse. 23
Reas on s for Cr eati ng a
Data Mart

❚ The potential users of a data


mart are more clearly defined
and can be more easily targeted
to obtain support for a data mart
project rather than a corporate
data warehouse project.

24
From the Data Warehouse
to Data Marts
Information

Individually Less
Structured

Departmentally History
Structured Normalized
Detailed

Organizationally More
Structured Data Warehouse

Data
25
Part 2: OLAP

26
Nature of OLAP Analysis

❚ Aggregation -- (total sales, percent-


to-total)
❚ Comparison -- Budget vs. Expenses
❚ Ranking -- Top 10, quartile analysis
❚ Access to detailed and aggregate
data
❚ Complex criteria specification
❚ Visualization
27
❚ Need interactive response to aggregate
Bus iness Int ell ig ence
Technolo gie s
OLAP & Data Mining
❚ Accompanying the growth in data
warehousing is an ever-increasing
demand by users for more powerful
access tools that provide advanced
analytical capabilities.

❚ There are two main types of access


tools available to meet this
demand, namely Online Analytical
Processing (OLAP) and data
mining. 28
Bus ines s Int elligence
Technol ogi es
❚ OLAP and Data Mining differ in
what they offer the user and
because of this they are
complementary technologies.

❚ An environment that includes a


data warehouse (or more
commonly one or more data marts)
together with tools such as OLAP
and /or data mining are collectively
referred to as Business Intelligence
29
Onl ine A nal yti cal
Processi ng (OLA P)
❚ The dynamic synthesis, analysis,
and consolidation of large
volumes of multi-dimensional
data, Codd (1993).

❚ Describes a technology that uses


a multi-dimensional view of
aggregate data to provide quick
access to strategic information
for the purposes of advanced 30
Onl ine A nal yti cal
Processi ng (OLA P)
❚ Enables users to gain a deeper
understanding and knowledge about
various aspects of their corporate
data through fast, consistent,
interactive access to a wide variety of
possible views of the data.

❚ Allows users to view corporate data


in such a way that it is a better model
of the true dimensionality of the
enterprise.
31
Onl ine A nal yti cal
Processi ng (OLA P)
❚ Can easily answer ‘who?’ and ‘what?’
questions, however, ability to answer
‘what if?’ and ‘why?’ type questions
distinguishes OLAP from general-
purpose query tools.

❚ Types of analysis ranges from basic


navigation and browsing (slicing and
dicing) to calculations, to more
complex analyses such as time series
and complex modeling.
32
Exa mple s of OLAP
appl ication s i n var ious
functio nal areas

33
OLAP Appl icati ons

❚ Although OLAP applications are


found in widely divergent
functional areas, they all have
the following key features:
❙ multi-dimensional views of data
❙ support for complex calculations
❙ time intelligence

34
OLAP Appl icati ons -
support for compl ex
cal cula tion s
❚ Must provide a range of
powerful computational methods
such as that required by sales
forecasting, which uses trend
algorithms such as moving
averages and percentage
growth.

35
OLAP Appl icati ons –
ti me intel ligence
❚ Key feature of almost any analytical
application as performance is almost
always judged over time.

❚ Time hierarchy is not always used in


the same manner as other
hierarchies.

❚ Concepts such as year-to-date and


period-over-period comparisons
should be easily defined. 36
OLAP Benef its
❚ Increased productivity of end-users.
❚ Reduced backlog of applications
development for IT staff.
❚ Retention of organizational control
over the integrity of corporate data.
❚ Reduced query drag and network
traffic on OLTP systems or on the
data warehouse.
❚ Improved potential revenue and
profitability.
37
Rep resentati on of
Mu lti- di mensi onal Data
❚ Example of two-dimensional query.
❘ What is the total revenue generated by
property sales in each city, in each quarter of
2004?’

❚ Choice of representation is based on


types of queries end-user may ask.

❚ Compare representation - three-field


relational table versus two-
dimensional matrix.
38
Mu lti-dimensional D ata as
Three-field table ve rsus
Two-dimens io nal Matr ix

39
Rep resentati on of
Mu lti- di mensi onal Data
❚ Example of three-dimensional
query.
❙ ‘What is the total revenue generated by
property sales for each type of property
(Flat or House) in each city, in each
quarter of 2004?’

❚ Compare representation - four-


field relational table versus
three-dimensional cube.
40
Mul ti -di mensi onal D ata
as Four- fi el d Tab le
versu s Three-
di mensi onal Cube

41
Rep resentati on of
Mu lti- di mensi onal Data
❚ Cube represents data as cells in
an array.

❚ Relational table only represents


multi-dimensional data in two
dimensions.

42
Multi-dimensional Data
❚ Measure - sales (actual, plan,
variance) Dimensions: Product, Region, Time
Hierarchical summarization paths
n
io

W
eg

S
R

N Product Region Time


Product

Juice Industry Country Year


Cola
Milk
Cream Category Region Quarter
Toothpaste
Soap
1 2 34 5 6 7 Product City Month week

Month Office Day


43
Strengths of OLAP
❚ It is a powerful
visualization tool
❚ It provides fast,
interactive response
times
❚ It is good for analyzing
time series
❚ It can be useful to find
some clusters and
outliners
❚ Many vendors offer OLAP
44
tools
OLAP and Executive
Information Systems
❚ Andyne Computing -- ❚ Oracle -- Express
Pablo ❚ Pilot -- LightShip
❚ Arbor Software -- ❚ Planning Sciences --
Essbase Gentium
❚ Cognos -- PowerPlay ❚ Platinum Technology
❚ Comshare -- -- ProdeaBeacon,
Commander OLAP Forest & Trees
❚ Holistic Systems -- ❚ SAS Institute --
Holos SAS/EIS, OLAP++
❚ Information Advantage ❚ Speedware -- Media
-- AXSYS, WebOLAP
❚ Informix -- Metacube
❚ Microstrategies -- 45
DSS/Agent
Part 3: Data Mining

46
Data Min ing
❚ The process of extracting valid,
previously unknown, comprehensible,
and actionable information from large
databases and using it to make
crucial business decisions,
(Simoudis,1996).

❚ Involves the analysis of data and the


use of software techniques for
finding hidden and unexpected
patterns and relationships in sets of
data. 47
Data Min ing
❚ Reveals information that is hidden
and unexpected, as little value in
finding patterns and relationships
that are already intuitive.

❚ Patterns and relationships are


identified by examining the
underlying rules and features in the
data.

48
Data Min ing
❚ Most accurate results normally
require large volumes of data to
deliver reliable conclusions.

❚ Starts by developing an optimal


representation of structure of
sample data

49
Data Min ing
❚ Data mining can provide huge
paybacks for companies who
have made a significant
investment in data warehousing.

❚ Relatively new technology,


however already used in a
number of industries.

50
Exa mple s of
Appl icati ons of D ata
Mini ng
❚ Retail / Marketing
❙ Identifying buying patterns of
customers
❙ Finding associations among
customer demographic
characteristics
❙ Predicting response to mailing
campaigns
❙ Market basket analysis
51
Exa mple s of
Appl icati ons of D ata
Mini ng

❚ Banking
❙ Detecting patterns of fraudulent
credit card use
❙ Identifying loyal customers
❙ Predicting customers likely to
change their credit card affiliation
❙ Determining credit card spending
by customer groups

52
Exa mple s of
Appl icati ons of D ata
Mini ng
❚ Insurance
❙ Claims analysis
❙ Predicting which customers will buy new
policies

❚ Medicine
❙ Characterizing patient behavior to
predict surgery visits
❙ Identifying successful medical therapies
for different illnesses

53
Data Min ing Operat ions
❚ Four main operations include:
❙ Predictive modeling
❙ Database segmentation
❙ Link analysis
❙ Deviation detection

❚ There are recognized associations


between the applications and the
corresponding operations.
❙ e.g. Direct marketing strategies use
database segmentation.

54
Data Min ing
Tec hniques
❚ Techniques are specific
implementations of the data
mining operations.

❚ Each operation has its own


strengths and weaknesses.

55
Data Mi ning Operat ions
and Ass ocia ted
Techniq ue s

56
Predi ctive Model ing
❚ Similar to the human learning
experience
❙ uses observations to form a model of the
important characteristics of some
phenomenon.

❚ Uses generalizations of ‘real world’


and ability to fit new data into a
general framework.
❚ Can analyze a database to determine
essential characteristics (model)
about the data set. 57
Predi ctive Model ing

❚ Model is developed using a


supervised learning approach, which
has two phases: training and testing.
❙ Training builds a model using a large
sample of historical data called a training
set.
❙ Testing involves trying out the model on
new, previously unseen data to
determine its accuracy and physical
performance characteristics.

58
Predi ctive Model ing
❚ Applications of predictive modeling
include customer retention
management, credit approval, cross
selling, and direct marketing.

❚ There are two techniques associated


with predictive modeling:
classification and value prediction,
which are distinguished by the nature
of the variable being predicted.

59
Exampl e of Cl assi ficati on
usi ng Tree I nducti on

60
Predi ctive Model ing -
Val ue Predi ctio n
❚ Used to estimate a continuous
numeric value that is associated with
a database record.

❚ Uses the traditional statistical


techniques of linear regression and
nonlinear regression.

❚ Relatively easy-to-use and


understand.
61
Predi ctive Model ing -
Val ue Predi ctio n
❚ Linear regression attempts to fit a
straight line through a plot of the
data, such that the line is the best
representation of the average of all
observations at that point in the plot.

❚ Problem is that the technique only


works well with linear data and is
sensitive to the presence of outliers
(that is, data values, which do not
conform to the expected norm).
62
Predi ctive Model ing -
Val ue Predi ctio n
❚ Data mining requires statistical
methods that can accommodate
non-linearity, outliers, and non-
numeric data.

❚ Applications of value prediction


include credit card fraud
detection or target mailing list
identification.
63
Data bas e Segmenta tion
❚ Aim is to partition a database
into an unknown number of
segments, or clusters, of similar
records.

❚ Uses unsupervised learning to


discover homogeneous sub-
populations in a database to
improve the accuracy of the
profiles. 64
Data bas e Segmenta tion
❚ Less precise than other
operations thus less sensitive
to redundant and irrelevant
features.

❚ Applications of database
segmentation include
customer profiling, direct
marketing, and cross selling.

65
Exampl e of Datab ase
Segmentati on usi ng a
Scatterpl ot

66
Li nk Anal ysi s
❚ Aims to establish links
(associations) between records, or
sets of records, in a database.

❚ There are three specializations


❙ Associations discovery
❙ Sequential pattern discovery
❙ Similar time sequence discovery

❚ Applications include product


affinity analysis, direct marketing,
and stock price movement. 67
Li nk Anal ysi s -
As soci ations Di scovery
❚ Finds items that imply the presence
of other items in the same event.

❚ Affinities between items are


represented by association rules.
❙ e.g. ‘When a customer rents property for
more than 2 years and is more than 25
years old, in 40% of cases, the customer
will buy a property. This association
happens in 35% of all customers who rent
properties’.
68
Li nk Anal ysi s - Sequent ial
Pattern Di sc overy
❚ Finds patterns between events
such that the presence of one
set of items is followed by
another set of items in a
database of events over a period
of time.
❙ e.g. Used to understand long term
customer buying behavior.

69
Li nk Anal ysi s - Simi lar
Tim e Sequence Di scover y
❚ Finds links between two sets of
data that are time-dependent,
and is based on the degree of
similarity between the patterns
that both time series
demonstrate.
❙ e.g. Within three months of buying
property, new home owners will
purchase goods such as cookers,
freezers, and washing machines.
70
Devi ati on Detecti on
❚ Relatively new operation in terms of
commercially available data mining
tools.

❚ Often a source of true discovery


because it identifies outliers, which
express deviation from some
previously known expectation and
norm.

71
Devi ati on Detecti on

❚ Can be performed using


statistics and visualization
techniques or as a by-product of
data mining.

❚ Applications include fraud


detection in the use of credit
cards and insurance claims,
quality control, and defects
tracing. 72
Exa mple of Databas e
Seg mentati on usi ng a
Vi sual izatio n

73