Sunteți pe pagina 1din 73

Data Warehouse

Data Mining


❚ Part 1: Data Warehouses

❚ Part 2: OLAP
❚ Part 3: Data Mining

Part 1: Data Warehouses

Data, Data everywhere
yet ...
❚ I can’t find the data I need
❙ data is scattered over the
❚ manyget
I ❙can’t versions, subtle
the data I need
❙ differences
need an expert to get the data
❚ I can’t understand the data I
❙ available data poorly
❚ I can’t use the data I found
❙ results are unexpected
❙ data needs to be transformed
from one form to other 4
What is a Data Warehouse?

A single, complete and

consistent store of data
obtained from a variety
of different sources
made available to end
users in a what they
can understand and use
in a business context.

[Barry Devlin]
Why Data Warehousing?
Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?

What product prom- Which customers

-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
have on revenue
and margins? 6
Decision Support

❚ Used to manage and control

❚ Data is historical or point-in-time
❚ Optimized for inquiry rather than
❚ Used by managers and end-users to
understand the business and make
The Evol uti on of Da ta
Warehous ing

❚ Since 1970s, organizations gained

competitive advantage through
systems that automate business
processes to offer more efficient and
cost-effective services to the

❚ This resulted in accumulation of

growing amounts of data in
operational databases.
Data Wa rehousi ng
Concept s

❚ A subject-oriented, integrated,
time-variant, and non-volatile
collection of data in support of
management’s decision-making
process (Inmon, 1993).

Sub ject- ori ente d Data
❚ The warehouse is organized around
the major subjects of the enterprise
(e.g. customers, products, and sales)
rather than the major application
areas (e.g. customer invoicing, stock
control, and product sales).

❚ This is reflected in the need to store

decision-support data rather than
application-oriented data.

Integrated Dat a

❚ The data warehouse integrates

corporate application-oriented data
from different source systems, which
often includes data that is

❚ The integrated data source must be

made consistent to present a unified
view of the data to the users.

Time -v ariant Data

❚ Data in the warehouse is only

accurate and valid at some point in
time or over some time interval.

❚ Time-variance is also shown in the

extended time that the data is held,
the implicit or explicit association of
time with all data, and the fact that
the data represents a series of
Non- vol ati le D ata

❚ Data in the warehouse is not

updated in real-time but is
refreshed from operational
systems on a regular basis.

❚ New data is always added as a

supplement to the database,
rather than a replacement.

Benef its of D ata
Warehous ing

❚ Potential high returns on


❚ Competitive advantage

❚ Increased productivity of
corporate decision-makers

Com pa rison of OLT P
Systems and Data
Warehous ing

Data Wa rehouse
Queri es
❚ The types of queries that a data
warehouse is expected to answer
ranges from the relatively simple to
the highly complex and is dependent
on the type of end-user access tools

❚ End-user access tools include:

❙ Reporting, query, and application
development tools
❙ Executive information systems (EIS)
❙ OLAP tools 16
Exampl es of Typi cal D ata
War eho use Qu er ies
❚ What was the total revenue for Scotland in the third quarter of
❚ What was the total revenue for property sales for each type of
property in Great Britain in 2003?
❚ What are the three most popular areas in each city for the renting
of property in 2004 and how does this compare with the figures
for the previous two years?
❚ What is the monthly revenue for property sales at each branch
office, compared with rolling 12-monthly prior figures?
❚ What would be the effect on property sales in the different
regions of Britain if legal costs went up by 3.5% and Government
taxes went down by 1.5% for properties over £100,000?
❚ Which type of property sells for prices above the average selling
price for properties in the main cities of Great Britain and how
does this correlate to demographic data?
❚ What is the relationship between the total annual revenue
generated by each branch office and the total number of sales
staff assigned to each branch office?

© Pearson Education Limited 1995, 2005 17

Probl ems of Da ta
Warehous ing
❚ Underestimation of resources for
data loading

❚ Hidden problems with source


❚ Required data not captured

❚ Increased end-user demands

❚ Data homogenization
Probl ems of Da ta
Warehous ing
❚ High demand for resources

❚ Data ownership

❚ High maintenance

❚ Long duration projects

❚ Complexity of integration
Typic al Archi tecture of
a Data War eh ouse

Data Mart
❚ A subset of a data warehouse that
supports the requirements of a
particular department or business

❚ Characteristics include
❙ Focuses on only the requirements of one
department or business function.
❙ Do not normally contain detailed
operational data unlike data warehouses.
❙ More easily understood and navigated.

Reas on s for Cr eati ng a
Data Mart
❚ To give users access to the data they need
to analyze most often.

❚ To provide data in a form that matches the

collective view of the business function
area. data by a group of users in a
department or

❚ To improve end-user response time due to

the reduction in the volume of data to be

Reas on s for Cr eati ng a
Data Mart
❚ To provide appropriately structured
data as dictated by the requirements
of the end-user access tools.

❚ Building a data mart is simpler

compared with establishing a
corporate data warehouse.

❚ The cost of implementing data marts

is normally less than that required to
establish a data warehouse. 23
Reas on s for Cr eati ng a
Data Mart

❚ The potential users of a data

mart are more clearly defined
and can be more easily targeted
to obtain support for a data mart
project rather than a corporate
data warehouse project.

From the Data Warehouse
to Data Marts

Individually Less

Departmentally History
Structured Normalized

Organizationally More
Structured Data Warehouse

Part 2: OLAP

Nature of OLAP Analysis

❚ Aggregation -- (total sales, percent-

❚ Comparison -- Budget vs. Expenses
❚ Ranking -- Top 10, quartile analysis
❚ Access to detailed and aggregate
❚ Complex criteria specification
❚ Visualization
❚ Need interactive response to aggregate
Bus iness Int ell ig ence
Technolo gie s
OLAP & Data Mining
❚ Accompanying the growth in data
warehousing is an ever-increasing
demand by users for more powerful
access tools that provide advanced
analytical capabilities.

❚ There are two main types of access

tools available to meet this
demand, namely Online Analytical
Processing (OLAP) and data
mining. 28
Bus ines s Int elligence
Technol ogi es
❚ OLAP and Data Mining differ in
what they offer the user and
because of this they are
complementary technologies.

❚ An environment that includes a

data warehouse (or more
commonly one or more data marts)
together with tools such as OLAP
and /or data mining are collectively
referred to as Business Intelligence
Onl ine A nal yti cal
Processi ng (OLA P)
❚ The dynamic synthesis, analysis,
and consolidation of large
volumes of multi-dimensional
data, Codd (1993).

❚ Describes a technology that uses

a multi-dimensional view of
aggregate data to provide quick
access to strategic information
for the purposes of advanced 30
Onl ine A nal yti cal
Processi ng (OLA P)
❚ Enables users to gain a deeper
understanding and knowledge about
various aspects of their corporate
data through fast, consistent,
interactive access to a wide variety of
possible views of the data.

❚ Allows users to view corporate data

in such a way that it is a better model
of the true dimensionality of the
Onl ine A nal yti cal
Processi ng (OLA P)
❚ Can easily answer ‘who?’ and ‘what?’
questions, however, ability to answer
‘what if?’ and ‘why?’ type questions
distinguishes OLAP from general-
purpose query tools.

❚ Types of analysis ranges from basic

navigation and browsing (slicing and
dicing) to calculations, to more
complex analyses such as time series
and complex modeling.
Exa mple s of OLAP
appl ication s i n var ious
functio nal areas

OLAP Appl icati ons

❚ Although OLAP applications are

found in widely divergent
functional areas, they all have
the following key features:
❙ multi-dimensional views of data
❙ support for complex calculations
❙ time intelligence

OLAP Appl icati ons -
support for compl ex
cal cula tion s
❚ Must provide a range of
powerful computational methods
such as that required by sales
forecasting, which uses trend
algorithms such as moving
averages and percentage

OLAP Appl icati ons –
ti me intel ligence
❚ Key feature of almost any analytical
application as performance is almost
always judged over time.

❚ Time hierarchy is not always used in

the same manner as other

❚ Concepts such as year-to-date and

period-over-period comparisons
should be easily defined. 36
OLAP Benef its
❚ Increased productivity of end-users.
❚ Reduced backlog of applications
development for IT staff.
❚ Retention of organizational control
over the integrity of corporate data.
❚ Reduced query drag and network
traffic on OLTP systems or on the
data warehouse.
❚ Improved potential revenue and
Rep resentati on of
Mu lti- di mensi onal Data
❚ Example of two-dimensional query.
❘ What is the total revenue generated by
property sales in each city, in each quarter of

❚ Choice of representation is based on

types of queries end-user may ask.

❚ Compare representation - three-field

relational table versus two-
dimensional matrix.
Mu lti-dimensional D ata as
Three-field table ve rsus
Two-dimens io nal Matr ix

Rep resentati on of
Mu lti- di mensi onal Data
❚ Example of three-dimensional
❙ ‘What is the total revenue generated by
property sales for each type of property
(Flat or House) in each city, in each
quarter of 2004?’

❚ Compare representation - four-

field relational table versus
three-dimensional cube.
Mul ti -di mensi onal D ata
as Four- fi el d Tab le
versu s Three-
di mensi onal Cube

Rep resentati on of
Mu lti- di mensi onal Data
❚ Cube represents data as cells in
an array.

❚ Relational table only represents

multi-dimensional data in two

Multi-dimensional Data
❚ Measure - sales (actual, plan,
variance) Dimensions: Product, Region, Time
Hierarchical summarization paths



N Product Region Time


Juice Industry Country Year

Cream Category Region Quarter
1 2 34 5 6 7 Product City Month week

Month Office Day

Strengths of OLAP
❚ It is a powerful
visualization tool
❚ It provides fast,
interactive response
❚ It is good for analyzing
time series
❚ It can be useful to find
some clusters and
❚ Many vendors offer OLAP
OLAP and Executive
Information Systems
❚ Andyne Computing -- ❚ Oracle -- Express
Pablo ❚ Pilot -- LightShip
❚ Arbor Software -- ❚ Planning Sciences --
Essbase Gentium
❚ Cognos -- PowerPlay ❚ Platinum Technology
❚ Comshare -- -- ProdeaBeacon,
Commander OLAP Forest & Trees
❚ Holistic Systems -- ❚ SAS Institute --
❚ Information Advantage ❚ Speedware -- Media
❚ Informix -- Metacube
❚ Microstrategies -- 45
Part 3: Data Mining

Data Min ing
❚ The process of extracting valid,
previously unknown, comprehensible,
and actionable information from large
databases and using it to make
crucial business decisions,

❚ Involves the analysis of data and the

use of software techniques for
finding hidden and unexpected
patterns and relationships in sets of
data. 47
Data Min ing
❚ Reveals information that is hidden
and unexpected, as little value in
finding patterns and relationships
that are already intuitive.

❚ Patterns and relationships are

identified by examining the
underlying rules and features in the

Data Min ing
❚ Most accurate results normally
require large volumes of data to
deliver reliable conclusions.

❚ Starts by developing an optimal

representation of structure of
sample data

Data Min ing
❚ Data mining can provide huge
paybacks for companies who
have made a significant
investment in data warehousing.

❚ Relatively new technology,

however already used in a
number of industries.

Exa mple s of
Appl icati ons of D ata
Mini ng
❚ Retail / Marketing
❙ Identifying buying patterns of
❙ Finding associations among
customer demographic
❙ Predicting response to mailing
❙ Market basket analysis
Exa mple s of
Appl icati ons of D ata
Mini ng

❚ Banking
❙ Detecting patterns of fraudulent
credit card use
❙ Identifying loyal customers
❙ Predicting customers likely to
change their credit card affiliation
❙ Determining credit card spending
by customer groups

Exa mple s of
Appl icati ons of D ata
Mini ng
❚ Insurance
❙ Claims analysis
❙ Predicting which customers will buy new

❚ Medicine
❙ Characterizing patient behavior to
predict surgery visits
❙ Identifying successful medical therapies
for different illnesses

Data Min ing Operat ions
❚ Four main operations include:
❙ Predictive modeling
❙ Database segmentation
❙ Link analysis
❙ Deviation detection

❚ There are recognized associations

between the applications and the
corresponding operations.
❙ e.g. Direct marketing strategies use
database segmentation.

Data Min ing
Tec hniques
❚ Techniques are specific
implementations of the data
mining operations.

❚ Each operation has its own

strengths and weaknesses.

Data Mi ning Operat ions
and Ass ocia ted
Techniq ue s

Predi ctive Model ing
❚ Similar to the human learning
❙ uses observations to form a model of the
important characteristics of some

❚ Uses generalizations of ‘real world’

and ability to fit new data into a
general framework.
❚ Can analyze a database to determine
essential characteristics (model)
about the data set. 57
Predi ctive Model ing

❚ Model is developed using a

supervised learning approach, which
has two phases: training and testing.
❙ Training builds a model using a large
sample of historical data called a training
❙ Testing involves trying out the model on
new, previously unseen data to
determine its accuracy and physical
performance characteristics.

Predi ctive Model ing
❚ Applications of predictive modeling
include customer retention
management, credit approval, cross
selling, and direct marketing.

❚ There are two techniques associated

with predictive modeling:
classification and value prediction,
which are distinguished by the nature
of the variable being predicted.

Exampl e of Cl assi ficati on
usi ng Tree I nducti on

Predi ctive Model ing -
Val ue Predi ctio n
❚ Used to estimate a continuous
numeric value that is associated with
a database record.

❚ Uses the traditional statistical

techniques of linear regression and
nonlinear regression.

❚ Relatively easy-to-use and

Predi ctive Model ing -
Val ue Predi ctio n
❚ Linear regression attempts to fit a
straight line through a plot of the
data, such that the line is the best
representation of the average of all
observations at that point in the plot.

❚ Problem is that the technique only

works well with linear data and is
sensitive to the presence of outliers
(that is, data values, which do not
conform to the expected norm).
Predi ctive Model ing -
Val ue Predi ctio n
❚ Data mining requires statistical
methods that can accommodate
non-linearity, outliers, and non-
numeric data.

❚ Applications of value prediction

include credit card fraud
detection or target mailing list
Data bas e Segmenta tion
❚ Aim is to partition a database
into an unknown number of
segments, or clusters, of similar

❚ Uses unsupervised learning to

discover homogeneous sub-
populations in a database to
improve the accuracy of the
profiles. 64
Data bas e Segmenta tion
❚ Less precise than other
operations thus less sensitive
to redundant and irrelevant

❚ Applications of database
segmentation include
customer profiling, direct
marketing, and cross selling.

Exampl e of Datab ase
Segmentati on usi ng a
Scatterpl ot

Li nk Anal ysi s
❚ Aims to establish links
(associations) between records, or
sets of records, in a database.

❚ There are three specializations

❙ Associations discovery
❙ Sequential pattern discovery
❙ Similar time sequence discovery

❚ Applications include product

affinity analysis, direct marketing,
and stock price movement. 67
Li nk Anal ysi s -
As soci ations Di scovery
❚ Finds items that imply the presence
of other items in the same event.

❚ Affinities between items are

represented by association rules.
❙ e.g. ‘When a customer rents property for
more than 2 years and is more than 25
years old, in 40% of cases, the customer
will buy a property. This association
happens in 35% of all customers who rent
Li nk Anal ysi s - Sequent ial
Pattern Di sc overy
❚ Finds patterns between events
such that the presence of one
set of items is followed by
another set of items in a
database of events over a period
of time.
❙ e.g. Used to understand long term
customer buying behavior.

Li nk Anal ysi s - Simi lar
Tim e Sequence Di scover y
❚ Finds links between two sets of
data that are time-dependent,
and is based on the degree of
similarity between the patterns
that both time series
❙ e.g. Within three months of buying
property, new home owners will
purchase goods such as cookers,
freezers, and washing machines.
Devi ati on Detecti on
❚ Relatively new operation in terms of
commercially available data mining

❚ Often a source of true discovery

because it identifies outliers, which
express deviation from some
previously known expectation and

Devi ati on Detecti on

❚ Can be performed using

statistics and visualization
techniques or as a by-product of
data mining.

❚ Applications include fraud

detection in the use of credit
cards and insurance claims,
quality control, and defects
tracing. 72
Exa mple of Databas e
Seg mentati on usi ng a
Vi sual izatio n