Documente Academic
Documente Profesional
Documente Cultură
2
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large
g 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 N
No L
Large 67K ?
10
Classification
y Given a collection of records (training set)
y Each record contains a set of attributes, one of the attributes is
tthee cclass.
ass.
y Find a model for class attribute as a function of the values of
other attributes.
y Goal: previously unseen records should be assigned a class as
accurately as possible.
y A test set is used to determine the accuracy of the model.
Usually,
y the ggiven data set is divided into trainingg and test sets,
with training sets used to build the model and test set used to
validate it.
4
Classification: Example of a Decision
Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
6
Apply Model to Test Data
Start from the root of tree. Test Data
Refund Marital Taxable
Status Income Cheat
Refund No Married 80K ?
Yes No
10
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
8
Apply Model to Test Test
DataData
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
10
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
11
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
12
Association Rule Mining
y Given a set of transactions,
transactions find rules that will predict the
occurrence of an item based on the occurrences of other
te s in the
items t e transaction
t a sact o
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} → {Beer},
1 Bread, Milk {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread Milk,
Bread, Milk Diaper,
Diaper Beer Implication means co-occurrence,
co occurrence not
5 Bread, Milk, Diaper, Coke causality!
13
14
Spatial Association Rules
y Earth science data
y Association patterns may reveal interesting connections among
tthee ocean,
ocea , land,
a , aand at
atmospheric
osp e c pprocesses
ocesses
y Criminal data
y Association patterns may reveal criminal behavior between
ppeople
p and their environment
15
16
Example of Cluster Analysis: Spatial Cluster
The 1854
Asiatic
Cholera in
London
A cluster whose
centroid is a water
pump
17
Cluster Analysis
y Cluster analysis divides data into groups (clusters) that are
meaningful, useful, or both.
18
What is Cluster Analysis
y Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
uunrelated
e ate to) the
t e objects
o jects in ot
other
e ggroups
oups
Inter-cluster
Intra-cluster distances are
di t
distances are maximized
i i d
minimized
19
Distance
y Not necessarily be the Euclidean Distance.
Distance
y Euclidean distance is the “ordinary” distance between two points
tthat
at one
o e would
ou measure
easu e with
t a ruler.
ue.
20
Notion of Clusters Can be Ambiguous
21
y Group
G related
l t d ddocuments
t ffor 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
fluctuations Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
y Summarization
y Reduce the size of large data sets
Clustering precipitation
in Australia
Summary
y Spatial Data Mining is a technology that blends traditional
data analysis methods with sophisticated algorithms for
pprocessing
ocess g large
a ge vo
volumes
u es of
o spat
spatial
a data.
ata.
23