Documente Academic
Documente Profesional
Documente Cultură
SP-2155
Andrs Mora Ziga
K-means
Es: un mtodo
Usos: clustering
Data driven
Probabilidad
Supervisadas
X X X
Dk = ||xi xj ||2 = 2nk ||xi k ||2 (11)
xi 2Ck xj 2Ck xi 2Ck
K
X 1
Wk = Dk (12)
2nk
k=1
p
sk = 1 + 1/Bsd(k) (14)
Calinski-Harabaz Index
X X X
2 2
Dk = ||xi xj || = 2nk ||xi k || (1
xi 2Ck xj 2Ck xi 2Ck
K
X
2
T = ||i || (1
i=1
K
X
1
J= Dk (1
T
k=1
K
X 1
k=1
K
X 1
Gap Method
Wk =
k=1
2nk
Dk (14)
Gapn (k) = En {log Wk } log Wk (15)
p
sk = 1 + 1/Bsd(k) (16)
K
X 1
Gap Method
Wk =
k=1
2nk
Dk (14)
Gapn (k) = En {log Wk } log Wk (15)
p
sk = 1 + 1/Bsd(k) (16)
Bayesian (BIC)
Akaike (AIC)
X-means
d from a Gaussian. The alternative hypotheses
distributionare
test that we present below are valid for either covariance matrix as
The test also accounts for the number of datapoints n tested by incorporating
around the center are sampled from acalculation
Gaussian.of the critical value of the test (see Equation 2). This prevents the
algorithm from making bad decisions about clusters with few datapoints.
G-means
around the center are not sampled from a Gaussian.
2.1 Testing clusters for Gaussian fit
hypothesis H , then we believe that the one center is sufficient to
0
e should not split the cluster into twoTosub-clusters. If we algorithm
specify the G-means reject H0fully we need a test to detect whether the data
e want to split the cluster. Algoritmo
to a center G-means
are sampled from(X,a)
a Gaussian. The alternative hypotheses are
ed on the Anderson-Darling statistic. Thisone-dimensional
H0 : The data around testthe
has
center are sampled from a Gaussian.
1. Iniciar centroides
y to be the most powerful normality test that isHbased on the empirical
1 : The data around the center are not sampled from a Gaussian.
n function (ECDF). Given a list of values xi that have been converted
e 1, let x(i) be the ith ordered value.IfLet
2.we i = Fthe
zaccept
K-means (xnull
con ),hypothesis
dichos
(i) where FH is0 , then we believe that the one center is su
centroides
model
e distribution function. Then the statistic is its data, and we should not split the cluster into two sub-clusters. If we
and accept
3. Una vez asignadas las etiquetas, H1un
usar , then
testwe want to split
estadstico the cluster.
(Anderson-Darling statistic) para
n detectar si cada cluster sigue una distribucin Gaussiana con un nivel de confiabilidad a.
X
1 The test we use is based on the Anderson-Darling statistic. This one-dimension
= (2i 1) [log(zi ) + log(1 z n+1 i )] n (1)
been shown empirically to be the most powerful normality test that is based on the
n 4. Si para el cluster se cumple, mantener dicho centroide, sino se reemplazar el centroide por
i=1 cumulative distribution
dos centros.function (ECDF). Given a list of values xi that have been
to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i) ), w
that for the case where and are estimated from the data (as in
the N (0, 1) cumulative distribution function. Then the statistic is
orrect the statistic according to
5. Repetir desde el paso 2 hasta que no se dividan los centroides
Xn
1
A (Z)
2
= A (Z)(1 + 4/n
2
25/(n ))2
A (Z)
2
= (2)(2i 1) [log(zi ) + log(1 zn+1 i )] n
n i=1
Silhouette Method
Ejemplo Silhouette
Method
Silhouette Method
Silhouette Method
Silhouette Method
Silhouette Method
Silhouette Method
Mtricas Supervisadas
Mtricas supervisadas
Homogeneity (homogeneidad)