Sunteți pe pagina 1din 33

BIRCH:

BalancedIterativeReducingandClusteringusing
Hierarchies
TianZhang,RaghuRamakrishnan,MironLivny
PresentedbyZhaoLi
2009,Spring

Outline
IntroductiontoClustering
MainTechniquesinClustering
HybridAlgorithm:BIRCH
ExampleoftheBIRCHAlgorithm
Experimentalresults
Conclusions
September29,2016

Clustering
Introduction

Dataclusteringconcernshowtogroupasetofobjectsbasedontheirsimilarity
ofattributesand/ortheirproximityinthevectorspace.
Mainmethods

Partitioning:KMeans

Hierarchical:BIRCH,ROCK,

Densitybased:DBSCAN,

Agoodclusteringmethodwillproducehighqualityclusterswith

highintraclasssimilarity

lowinterclasssimilarity

September29,2016

MainTechniques(1)

PartitioningClustering(KMeans)
step.1

initialcenter

initialcenter

September29,2016

initialcenter

KMeansExample
Step.2

newcenterafter1st
iteration

newcenterafter1st
iteration

September29,2016

newcenterafter1st
iteration

KMeansExample
Step.3

newcenterafter2nd
iteration

newcenterafter2nd
iteration

newcenterafter2nd
iteration
September29,2016

MainTechniques(2)
HierarchicalClustering
Multilevelclustering:level
1hasnclustersleveln
hasonecluster,orupside
down.
AgglomerativeHC:starts
withsingletonandmerge
clusters(bottomup).
DivisiveHC:startswith
onesampleandsplit
clusters(topdown).
September29,2016

Dendrogram

AgglomerativeHCExample
NearestNeighborLevel2,k=7clusters.

September29,2016

NearestNeighbor,Level3,k=6clusters.

September29,2016

NearestNeighbor,Level4,k=5clusters.

September29,2016

10

NearestNeighbor,Level5,k=4clusters.

September29,2016

11

NearestNeighbor,Level6,k=3clusters.

September29,2016

12

NearestNeighbor,Level7,k=2clusters.

September29,2016

13

NearestNeighbor,Level8,k=1cluster.

September29,2016

14

Remarks
Partitioning
Clustering

Hierarchical
Clustering

Time
O(n)
Complexity

O(n2logn)

Pros

EasytouseandRelatively
efficient

Outputsadendrogramthatis
desiredinmanyapplications.

Cons

Sensitivetoinitialization;
highertimecomplexity;
badinitializationmightlead Needtostorealldatain
tobadresults.
memory.
Needtostorealldatain
memory.

September29,2016

15

IntroductiontoBIRCH
Designedforverylargedatasets
Timeandmemoryarelimited
Incrementalanddynamicclusteringofincomingobjects
Onlyonescanofdataisnecessary
Doesnotneedthewholedatasetinadvance

Twokeyphases:
Scansthedatabasetobuildaninmemorytree
Appliesclusteringalgorithmtoclustertheleafnodes

September29,2016

16

SimilarityMetric(1)
Givenaclusterofinstances,wedefine:
Centroid:
Radius:averagedistancefrommemberpointstocentroid

Diameter:averagepairwisedistancewithinacluster

September29,2016

17

SimilarityMetric(2)
centroidEuclideandistance:
centroidManhattandistance:
averageintercluster:
averageintracluster:
varianceincrease:
September29,2016

18

ClusteringFeature
TheBirchalgorithmbuildsadendrogramcalledclustering
featuretree(CFtree)whilescanningthedataset.
EachentryintheCFtreerepresentsaclusterofobjectsand
ischaracterizedbya3tuple:(N,LS,SS),whereNisthe
numberofobjectsintheclusterandLS,SSaredefinedinthe
following.

September29,2016

LS Pi
Pi N

SS

Pi N

Pi

19

PropertiesofClusteringFeature
CFentryismorecompact
Storessignificantlylessthanallofthedatapointsin
thesubcluster
ACFentryhassufficientinformationtocalculateD0
D4
Additivitytheoremallowsustomergesubclusters
incrementally&consistently

September29,2016

20

CFTree
Eachnonleafnodehas
atmostBentries
Eachleafnodehasat
mostLCFentries,
eachofwhichsatisfies
thresholdT
Nodesizeis
determinedby
dimensionalityofdata
spaceandinput
parameterP(page
size)
September29,2016

21

CFTreeInsertion

Recursedownfromroot,findtheappropriateleaf
Followthe"closest"CFpath,w.r.t.D0//D4

Modifytheleaf
IftheclosestCFleafcannotabsorb,makeanewCF
entry.Ifthereisnoroomfornewleaf,splitthe
parentnode

Traverseback
UpdatingCFsonthepathorsplittingnodes

September29,2016

22

CFTreeRebuilding
Ifwerunoutofspace,increasethresholdT
Byincreasingthethreshold,CFsabsorbmoredata
Rebuilding"pushes"CFsover
ThelargerTallowsdifferentCFstogrouptogether
Reducibilitytheorem
IncreasingTwillresultinaCFtreesmallerthanthe
original
Rebuildingneedsatmosthextrapagesofmemory

September29,2016

23

ExampleofBIRCH
Newsubcluster
sc8
sc1

sc3

sc4 sc5

sc2
LN1

LN2
LN1

sc7

sc6
LN3

Root
LN2 LN3

sc8 sc1
sc5
sc3
sc2 sc4 sc6 sc7
September29,2016

24

InsertionOperationinBIRCH
Ifthebranchingfactorofaleafnodecannotexceed3,thenLN1issplit.

sc1

sc8

sc3

sc4 sc5

sc7

sc6

sc2
LN1

LN2
LN1
LN1

LN3

Root
LN2
LN3
LN1

sc8 sc1
sc5 sc6 sc7
sc3
sc4
sc2
September29,2016

25

Ifthebranchingfactorofanonleafnodecannot
exceed3,thentherootissplitandtheheightof
theCFTreeincreasesbyone.

sc8
sc1

sc3

sc4 sc5

sc7

sc6

sc2
LN1

LN2
LN1

LN3

Root

NLN1
NLN2
LN1
LN1 LN2 LN3

September29,2016

sc8 sc1 sc2 sc3sc4sc5 sc6 sc7

26

BIRCHOverview

September29,2016

27

ExperimentalResults
Inputparameters:
Memory(M):5%ofdataset
Diskspace(R):20%ofM
Distanceequation:D2
Qualityequation:weightedaveragediameter(D)
Initialthreshold(T):0.0
Pagesize(P):1024bytes
September29,2016

28

ExperimentalResults
KMEANSclustering
DS

Time

#Scan

DS

Time

#Scan

1
2

43.9
13.2

2.09
4.43

289
51

1o
2o

33.8
12.7

1.97
4.20

197
29

32.9

3.66

187

3o

36.0

4.35

241

BIRCHclustering
DS

Time

#Scan

DS

Time

#Scan

1
2

11.5
10.7

1.87
1.99

2
2

1o
2o

13.6
12.1

1.87
1.99

2
2

11.4

3.95

3o

12.2

3.99

September29,2016

29

Conclusions
ACFtreeisaheightbalancedtreethatstores
theclusteringfeaturesforahierarchical
clustering.
Givenalimitedamountofmainmemory,BIRCH
canminimizethetimerequiredforI/O.
BIRCHisascalableclusteringalgorithmwith
respecttothenumberofobjects,andgood
qualityofclusteringofthedata.
September29,2016

30

ExamQuestions

WhatisthemainlimitationofBIRCH?
SinceeachnodeinaCFtreecanholdonlyalimited
numberofentriesduetothesize,aCFtreenodedoesnt
alwayscorrespondtowhatausermayconsideranature
cluster.Moreover,iftheclustersarenotsphericalin
shape,itdoesntperformwellbecauseitusesthenotion
ofradiusordiametertocontroltheboundaryofa
cluster.

September29,2016

31

ExamQuestions

NamethetwoalgorithmsinBIRCH
clustering:
CFTreeInsertion
CFTreeRebuilding

Whatisthepurposeofphase4inBIRCH?
Doadditionalpassesoverthedatasetandreassign
datapointstotheclosestcentroid.

September29,2016

32

Q&A
Thankyouforyourpatience
Goodluckforfinalexam!

September29,2016

33

S-ar putea să vă placă și