Documente Academic
Documente Profesional
Documente Cultură
BalancedIterativeReducingandClusteringusing
Hierarchies
TianZhang,RaghuRamakrishnan,MironLivny
PresentedbyZhaoLi
2009,Spring
Outline
IntroductiontoClustering
MainTechniquesinClustering
HybridAlgorithm:BIRCH
ExampleoftheBIRCHAlgorithm
Experimentalresults
Conclusions
September29,2016
Clustering
Introduction
Dataclusteringconcernshowtogroupasetofobjectsbasedontheirsimilarity
ofattributesand/ortheirproximityinthevectorspace.
Mainmethods
Partitioning:KMeans
Hierarchical:BIRCH,ROCK,
Densitybased:DBSCAN,
Agoodclusteringmethodwillproducehighqualityclusterswith
highintraclasssimilarity
lowinterclasssimilarity
September29,2016
MainTechniques(1)
PartitioningClustering(KMeans)
step.1
initialcenter
initialcenter
September29,2016
initialcenter
KMeansExample
Step.2
newcenterafter1st
iteration
newcenterafter1st
iteration
September29,2016
newcenterafter1st
iteration
KMeansExample
Step.3
newcenterafter2nd
iteration
newcenterafter2nd
iteration
newcenterafter2nd
iteration
September29,2016
MainTechniques(2)
HierarchicalClustering
Multilevelclustering:level
1hasnclustersleveln
hasonecluster,orupside
down.
AgglomerativeHC:starts
withsingletonandmerge
clusters(bottomup).
DivisiveHC:startswith
onesampleandsplit
clusters(topdown).
September29,2016
Dendrogram
AgglomerativeHCExample
NearestNeighborLevel2,k=7clusters.
September29,2016
NearestNeighbor,Level3,k=6clusters.
September29,2016
NearestNeighbor,Level4,k=5clusters.
September29,2016
10
NearestNeighbor,Level5,k=4clusters.
September29,2016
11
NearestNeighbor,Level6,k=3clusters.
September29,2016
12
NearestNeighbor,Level7,k=2clusters.
September29,2016
13
NearestNeighbor,Level8,k=1cluster.
September29,2016
14
Remarks
Partitioning
Clustering
Hierarchical
Clustering
Time
O(n)
Complexity
O(n2logn)
Pros
EasytouseandRelatively
efficient
Outputsadendrogramthatis
desiredinmanyapplications.
Cons
Sensitivetoinitialization;
highertimecomplexity;
badinitializationmightlead Needtostorealldatain
tobadresults.
memory.
Needtostorealldatain
memory.
September29,2016
15
IntroductiontoBIRCH
Designedforverylargedatasets
Timeandmemoryarelimited
Incrementalanddynamicclusteringofincomingobjects
Onlyonescanofdataisnecessary
Doesnotneedthewholedatasetinadvance
Twokeyphases:
Scansthedatabasetobuildaninmemorytree
Appliesclusteringalgorithmtoclustertheleafnodes
September29,2016
16
SimilarityMetric(1)
Givenaclusterofinstances,wedefine:
Centroid:
Radius:averagedistancefrommemberpointstocentroid
Diameter:averagepairwisedistancewithinacluster
September29,2016
17
SimilarityMetric(2)
centroidEuclideandistance:
centroidManhattandistance:
averageintercluster:
averageintracluster:
varianceincrease:
September29,2016
18
ClusteringFeature
TheBirchalgorithmbuildsadendrogramcalledclustering
featuretree(CFtree)whilescanningthedataset.
EachentryintheCFtreerepresentsaclusterofobjectsand
ischaracterizedbya3tuple:(N,LS,SS),whereNisthe
numberofobjectsintheclusterandLS,SSaredefinedinthe
following.
September29,2016
LS Pi
Pi N
SS
Pi N
Pi
19
PropertiesofClusteringFeature
CFentryismorecompact
Storessignificantlylessthanallofthedatapointsin
thesubcluster
ACFentryhassufficientinformationtocalculateD0
D4
Additivitytheoremallowsustomergesubclusters
incrementally&consistently
September29,2016
20
CFTree
Eachnonleafnodehas
atmostBentries
Eachleafnodehasat
mostLCFentries,
eachofwhichsatisfies
thresholdT
Nodesizeis
determinedby
dimensionalityofdata
spaceandinput
parameterP(page
size)
September29,2016
21
CFTreeInsertion
Recursedownfromroot,findtheappropriateleaf
Followthe"closest"CFpath,w.r.t.D0//D4
Modifytheleaf
IftheclosestCFleafcannotabsorb,makeanewCF
entry.Ifthereisnoroomfornewleaf,splitthe
parentnode
Traverseback
UpdatingCFsonthepathorsplittingnodes
September29,2016
22
CFTreeRebuilding
Ifwerunoutofspace,increasethresholdT
Byincreasingthethreshold,CFsabsorbmoredata
Rebuilding"pushes"CFsover
ThelargerTallowsdifferentCFstogrouptogether
Reducibilitytheorem
IncreasingTwillresultinaCFtreesmallerthanthe
original
Rebuildingneedsatmosthextrapagesofmemory
September29,2016
23
ExampleofBIRCH
Newsubcluster
sc8
sc1
sc3
sc4 sc5
sc2
LN1
LN2
LN1
sc7
sc6
LN3
Root
LN2 LN3
sc8 sc1
sc5
sc3
sc2 sc4 sc6 sc7
September29,2016
24
InsertionOperationinBIRCH
Ifthebranchingfactorofaleafnodecannotexceed3,thenLN1issplit.
sc1
sc8
sc3
sc4 sc5
sc7
sc6
sc2
LN1
LN2
LN1
LN1
LN3
Root
LN2
LN3
LN1
sc8 sc1
sc5 sc6 sc7
sc3
sc4
sc2
September29,2016
25
Ifthebranchingfactorofanonleafnodecannot
exceed3,thentherootissplitandtheheightof
theCFTreeincreasesbyone.
sc8
sc1
sc3
sc4 sc5
sc7
sc6
sc2
LN1
LN2
LN1
LN3
Root
NLN1
NLN2
LN1
LN1 LN2 LN3
September29,2016
26
BIRCHOverview
September29,2016
27
ExperimentalResults
Inputparameters:
Memory(M):5%ofdataset
Diskspace(R):20%ofM
Distanceequation:D2
Qualityequation:weightedaveragediameter(D)
Initialthreshold(T):0.0
Pagesize(P):1024bytes
September29,2016
28
ExperimentalResults
KMEANSclustering
DS
Time
#Scan
DS
Time
#Scan
1
2
43.9
13.2
2.09
4.43
289
51
1o
2o
33.8
12.7
1.97
4.20
197
29
32.9
3.66
187
3o
36.0
4.35
241
BIRCHclustering
DS
Time
#Scan
DS
Time
#Scan
1
2
11.5
10.7
1.87
1.99
2
2
1o
2o
13.6
12.1
1.87
1.99
2
2
11.4
3.95
3o
12.2
3.99
September29,2016
29
Conclusions
ACFtreeisaheightbalancedtreethatstores
theclusteringfeaturesforahierarchical
clustering.
Givenalimitedamountofmainmemory,BIRCH
canminimizethetimerequiredforI/O.
BIRCHisascalableclusteringalgorithmwith
respecttothenumberofobjects,andgood
qualityofclusteringofthedata.
September29,2016
30
ExamQuestions
WhatisthemainlimitationofBIRCH?
SinceeachnodeinaCFtreecanholdonlyalimited
numberofentriesduetothesize,aCFtreenodedoesnt
alwayscorrespondtowhatausermayconsideranature
cluster.Moreover,iftheclustersarenotsphericalin
shape,itdoesntperformwellbecauseitusesthenotion
ofradiusordiametertocontroltheboundaryofa
cluster.
September29,2016
31
ExamQuestions
NamethetwoalgorithmsinBIRCH
clustering:
CFTreeInsertion
CFTreeRebuilding
Whatisthepurposeofphase4inBIRCH?
Doadditionalpassesoverthedatasetandreassign
datapointstotheclosestcentroid.
September29,2016
32
Q&A
Thankyouforyourpatience
Goodluckforfinalexam!
September29,2016
33