Sunteți pe pagina 1din 12

Introduction to Spatial Data Mining

Instructor: Wei Ding


CS 697 Spatial Data Mining
S
Spring 2009

Spatial Data mining:


Find interesting,
g potentially useful,
non-trivial patterns in spatial data.
Patterns:
P tt :
• classification
• association analysis
anal sis
• cluster analysis
S ti l attribute
Spatial tt ib t – geographically
hi ll referenced
f d
• Neighborhood and extent
• Location
L ti (longitude,
(l it d latitude,
l tit d elevation,
l ti etc.)
t )

2
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large
g 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10

Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 N
No L
Large 67K ?
10

Classification
y Given a collection of records (training set)
y Each record contains a set of attributes, one of the attributes is
tthee cclass.
ass.
y Find a model for class attribute as a function of the values of
other attributes.
y Goal: previously unseen records should be assigned a class as
accurately as possible.
y A test set is used to determine the accuracy of the model.
Usually,
y the ggiven data set is divided into trainingg and test sets,
with training sets used to build the model and test set used to
validate it.

4
Classification: Example of a Decision
Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
Refund
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No
TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No YES
NO
10 No Single 90K Yes
10

Training Data Model: Decision Tree


5

Decision Tree Classification Task


Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10

Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10

6
Apply Model to Test Data
Start from the root of tree. Test Data
Refund Marital Taxable
Status Income Cheat
Refund No Married 80K ?
Yes No
10

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Apply Model to Test Test


DataData
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

8
Apply Model to Test Test
DataData
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Apply Model to Test Test


DataData
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

10
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

11

Apply Model to Test Test


DataData
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES

12
Association Rule Mining
y Given a set of transactions,
transactions find rules that will predict the
occurrence of an item based on the occurrences of other
te s in the
items t e transaction
t a sact o
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} → {Beer},
1 Bread, Milk {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread Milk,
Bread, Milk Diaper,
Diaper Beer Implication means co-occurrence,
co occurrence not
5 Bread, Milk, Diaper, Coke causality!

13

Interpret an Association Rule


y {Diaper} → {Beer}
y The rule suggests that a strong relationship exists
bbetween
t the
th sale
l off di
diapers andd bbeer bbecause many
customers who buy diapers also buy beer.
y Retailers can use this type of rules to help them identify
new opportunities for cross-selling their products to the
customers.

14
Spatial Association Rules
y Earth science data
y Association patterns may reveal interesting connections among
tthee ocean,
ocea , land,
a , aand at
atmospheric
osp e c pprocesses
ocesses

y Criminal data
y Association patterns may reveal criminal behavior between
ppeople
p and their environment

15

Evaluation of Association Patterns


y Association analysis algorithms have the potential to generate
a large number of patterns
y Thousands or even millions of patterns from real commercial
databases
y Problem: How to identify the most interesting ones?
y One person’s trash might be another person’s treasure
y Objective interestingness measure: statistical arguments
y Subjective interestingness measure: subjective arguments
y {Butter}Æ{Bread},
{Butter}Æ{Bread} not interesting because it is expected
y {Diaper}Æ{Beer}, interesting because it is quite unexpected

16
Example of Cluster Analysis: Spatial Cluster
The 1854
Asiatic
Cholera in
London

A cluster whose
centroid is a water
pump

17

Cluster Analysis
y Cluster analysis divides data into groups (clusters) that are
meaningful, useful, or both.

y Clustering for understanding (be meaningful): the clusters


should capture the natural structure of the data.

y Clusteringg for utilityy (be


( useful):
) cluster analysis
y provides
p an
abstraction from individual data objects to the clusters in which
those data objects reside.

18
What is Cluster Analysis
y Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
uunrelated
e ate to) the
t e objects
o jects in ot
other
e ggroups
oups
Inter-cluster
Intra-cluster distances are
di t
distances are maximized
i i d
minimized

19

Distance
y Not necessarily be the Euclidean Distance.
Distance
y Euclidean distance is the “ordinary” distance between two points
tthat
at one
o e would
ou measure
easu e with
t a ruler.
ue.

y Different measures of distance with respect to different


views.
views

20
Notion of Clusters Can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

Different measures of distance!

21

Applications of Cluster Analysis


y Understanding Discovered Clusters
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Industry Group

y Group
G related
l t d ddocuments
t ffor 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN

browsing, group genes and proteins Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,


Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
that have similar functionality, or 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
p y
Technology2 DOWN
Technology2-DOWN
group stocks with similar price Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

fluctuations Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

y Summarization
y Reduce the size of large data sets

Clustering precipitation
in Australia
Summary
y Spatial Data Mining is a technology that blends traditional
data analysis methods with sophisticated algorithms for
pprocessing
ocess g large
a ge vo
volumes
u es of
o spat
spatial
a data.
ata.

y It has opened up exciting opportunities for exploring and


analyzing new types of data and for analyzing old types of
data in new ways.

23

S-ar putea să vă placă și