Sunteți pe pagina 1din 30

K en K-Means?

SP-2155
Andrs Mora Ziga
K-means
Es: un mtodo

Busca: particionar un conjunto de n observaciones


en k grupos en el que cada observacin pertenece
al grupo cuyo valor medio es ms cercano.

Usos: clustering

Inconveniente: se debe proveer el valor de K.


K-means
El trmino "k-means" fue utilizado por primera vez por James
MacQueen en 1967.

La idea se remonta aHugo Steinhausen 1957.

Elalgoritmo estndarfue propuesto por primera vez por Stuart Lloyd


en 1957 como una tcnica paramodulacin por impulsos
codificados, aunque no se public fuera de los laboratorios Bell hasta
1982

En 1965, E. W. Forgy public esencialmente el mismo mtodo, por lo


que a veces tambin se le nombra como Lloyd-Forgy.

Una versin ms eficiente fue propuesta y publicada en Fortran por


Hartigan y Wong en 1975 y1979.
Mtricas para K
No supervisadas

Data driven

Probabilidad

Supervisadas

Requieren proveer groundtruth


Mtricas no supervisadas
Elbow Method

X X X
Dk = ||xi xj ||2 = 2nk ||xi k ||2 (11)
xi 2Ck xj 2Ck xi 2Ck

K
X 1
Wk = Dk (12)
2nk
k=1

Gapn (k) = En {log Wk } log Wk (13)

p
sk = 1 + 1/Bsd(k) (14)
Calinski-Harabaz Index
X X X
2 2
Dk = ||xi xj || = 2nk ||xi k || (1
xi 2Ck xj 2Ck xi 2Ck

K
X
2
T = ||i || (1
i=1

K
X
1
J= Dk (1
T
k=1

K
X 1
k=1

K
X 1
Gap Method
Wk =
k=1
2nk
Dk (14)


Gapn (k) = En {log Wk } log Wk (15)

p
sk = 1 + 1/Bsd(k) (16)

Gap(k) Gap(k + 1) sk+1 (17)


k=1

K
X 1
Gap Method
Wk =
k=1
2nk
Dk (14)


Gapn (k) = En {log Wk } log Wk (15)

p
sk = 1 + 1/Bsd(k) (16)

Gap(k) Gap(k + 1) sk+1 (17)


also its interdependence with other object groupings
ery important. Because
in the data set. values of K. In such cases, the sum of distortions
e result of the clustering after the increase in the number of clusters could
In K-means clustering, the distortion of a cluster is
or this result to vary as
a function of the
mains unchanged. How-
data Gap Method++
be estimated from the current value.
population and the distance
The evaluation function f(K ) is defined using the
between objects
iencies of the K-means and the cluster
equations centre according to
108
on randomness. Thus, D T Pham, S S Dimov, and C D Nguyen
d consistentXN j results so 8
2 > 1 if K 1
Ij as d(x
be used a variable
jt , w j )# < SK (1a)
its contribution
A new version of the K- to the
f (K sum
) of all distortions,
if S K "1 = S
0, K,
8K . 1 measure (2)was
t1 > a S
given by
he incremental K-means : K K "1
result.
1 if SK "1 0, 8K . 1
s where
requirement and can
Ij is the distortion of cluster 8 j, w is the centre When the
> 3 j
X K
of cluster j, Nj is the number of> >
> 1 "
objects belonging
if
to
K 2 andwith
N d .a 1unifor
> 4N
S
veal trends
K in thej data I < d
(1b) the clusters a
cluster j, x is the tth object belonging to cluster j, (3a)
is important j1 jt keep it
to aK itions, the p
and d(x , w ) is the distance >
between
> 1 " a
object K "1x and
of objects. jtThe jnumber >
> a K "1 jt if K . 2 and N d . 1
>
: 6 equal in size
themuch
o be centre w
smaller of cluster
than j.
where K is the specified number of clusters.
j
(3b) The
When K increases, f (Kis) represented by its distortion another.
Each cluster
Thus, such information is important in assessing
onstant value. Then, if, [29] showed
and its impact on the entire data set is assessed by
Gap Method++
Gap Method++
Gap Method++
Gap Method++
Selection of K in K-means clustering 113
Gap Method++
Information Criterion

Bayesian (BIC)

Akaike (AIC)
X-means
d from a Gaussian. The alternative hypotheses
distributionare
test that we present below are valid for either covariance matrix as
The test also accounts for the number of datapoints n tested by incorporating
around the center are sampled from acalculation
Gaussian.of the critical value of the test (see Equation 2). This prevents the
algorithm from making bad decisions about clusters with few datapoints.
G-means
around the center are not sampled from a Gaussian.
2.1 Testing clusters for Gaussian fit
hypothesis H , then we believe that the one center is sufficient to
0
e should not split the cluster into twoTosub-clusters. If we algorithm
specify the G-means reject H0fully we need a test to detect whether the data
e want to split the cluster. Algoritmo
to a center G-means
are sampled from(X,a)
a Gaussian. The alternative hypotheses are
ed on the Anderson-Darling statistic. Thisone-dimensional
H0 : The data around testthe
has
center are sampled from a Gaussian.
1. Iniciar centroides
y to be the most powerful normality test that isHbased on the empirical
1 : The data around the center are not sampled from a Gaussian.
n function (ECDF). Given a list of values xi that have been converted
e 1, let x(i) be the ith ordered value.IfLet
2.we i = Fthe
zaccept
K-means (xnull
con ),hypothesis
dichos
(i) where FH is0 , then we believe that the one center is su
centroides
model
e distribution function. Then the statistic is its data, and we should not split the cluster into two sub-clusters. If we
and accept
3. Una vez asignadas las etiquetas, H1un
usar , then
testwe want to split
estadstico the cluster.
(Anderson-Darling statistic) para
n detectar si cada cluster sigue una distribucin Gaussiana con un nivel de confiabilidad a.
X
1 The test we use is based on the Anderson-Darling statistic. This one-dimension
= (2i 1) [log(zi ) + log(1 z n+1 i )] n (1)
been shown empirically to be the most powerful normality test that is based on the
n 4. Si para el cluster se cumple, mantener dicho centroide, sino se reemplazar el centroide por
i=1 cumulative distribution
dos centros.function (ECDF). Given a list of values xi that have been
to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), w
that for the case where and are estimated from the data (as in
the N (0, 1) cumulative distribution function. Then the statistic is
orrect the statistic according to
5. Repetir desde el paso 2 hasta que no se dividan los centroides
Xn
1
A (Z)
2
= A (Z)(1 + 4/n
2
25/(n ))2
A (Z)
2
= (2)(2i 1) [log(zi ) + log(1 zn+1 i )] n
n i=1
Silhouette Method
Ejemplo Silhouette
Method
Silhouette Method
Silhouette Method
Silhouette Method
Silhouette Method
Silhouette Method
Mtricas Supervisadas
Mtricas supervisadas

Completeness (integridad, exhaustividad o completitud).

Homogeneity (homogeneidad)

V-measure: media armnica entre completeness y homogeneity

v = 2 * (homogeneity * completeness) / (homogeneity + completeness)


Conclusiones
Conclusiones
An la literatura no ha brindado una respuesta definitiva para
hallar la K en K-Means.

Varias de las herramientas dejan la seleccin del K ptimo


ambiguo, por lo que sigue existiendo un sesgo por parte de la
decisin del que analice el resultado.

Las mtricas no supervisadas eligen un K ptimo a partir de


informacin obtenida meramente de los datos y su estadstica.

Las mtricas supervisadas son las nicas que permiten


evaluar realmente la veracidad del ajuste a un modelo real
Bibliografa
Pattern Classification (2nd Ed) (Richard O. Duda, Peter E. Hart, David G. Stork),
Wiley, 2001.

Selection of K in K-means clustering (D T Pham , S S Dimov, and C D Nguyen).


Manufacturing Engineering Centre, Cardiff University, Cardiff, UK, 2004

X-means: Extending K-means with Efficient Estimation of the Number of Clusters.


(Dan Pelleg and Andrew Moore). School of computer Science, Carneige Mellon
University, Pittsburg, USA.

Learning the K in K-means (Greg Hamerly, Charles Elkan). Department of Computer


Science and Engineering. University of California, San Diego. USA.

S-ar putea să vă placă și