Sunteți pe pagina 1din 4

Semi-supervised Locality-weight Fuzzy C-Means Clustering

Based on Seeds and One Novel Decision Rule


Lei Gu
1,2
1. Key Laboratory of Advanced Process Control
for Light Industry, Jiangnan University
Wuxi, China
2. School of Computer Science and Technology,
Nanjing University of Posts and Telecommunications
Nanjing, China
gulei@njupt.edu.cn
Xianling Lu
3
3. School of Internet of Things Engineering,
Jiangnan University
Wuxi, China
jnluxl@gmail.com
AbstractBecause the semi-supervised clustering can take
advantage of some labeled data also called seeds to affect the
clustering of unlabeled data, this paper proposed a semi-
supervised clustering method based on a locality-weight fuzzy
c-means clustering algorithm. The presented clustering method
uses some seeds for the initialization and applies one novel
decision rule to reassigning the class label to one data. To in-
vestigate the effectiveness of our approach, several experiments
are done on one artificial dataset and three real datasets. Ex-
perimental results show that our proposed method can im-
prove the clustering performance significantly compared to
some unsupervised and semi-supervised clustering algorithms.
Keywords-semi-supervised clustering; fuzzy c-means; seeds;
k-means
I. INTRODUCTION
With the fast developments of computer science, data
clustering has played an important role in our life. It has been
used in a wide variety of fields, ranging from machine
learning, biometric recognition, image processing and image
retrieval to electrical engineering, mechanical engineering,
remote sensing and genetics. The aim of data clustering me-
thods is to divide data into several homogeneous groups call-
ed clusters, within each of which the similarity or dissi-
milarity between data is larger or less than data belonging to
different groups[1]. Unsupervised clustering partitions all
unlabeled data into a certain number of groups on the basis
of one chosen similarity or dissimilarity measure[2,3]. Dif-
ferent measure of the similarity or dissimilarity can lead to
various clustering methods such as k-means[4], fuzzy c-
means[5], mountain clustering, subtractive clustering[6] and
neural gas[7].
Semi-supervised clustering can also divide a collection of
unlabeled data into several groups. However, a small amount
of labeled data is allowed to be applied to aiding and biasing
the clustering of unlabeled data in semi-supervised clustering
unlike the unsupervised clustering, and so a significant in-
crease in clustering performance can be obtained by the
semi-supervised clustering[8]. The popular semi-supervised
clustering methods are composed of two categories called
the similarity-based and search-based approaches respect-
tively[9]. In similarity-based methods, an existing clustering
algorithm employs a specific similarity measure trained by
labeled data. In search-based methods, the clustering algori-
thms modify the objective function under the aid of labeled
data such that better clusters are found[9]. A number of
semi-supervised clustering approaches published until now
belongs to the search-based methods. For example, [10]
presented a semi-supervised clustering with pairwise con-
straints and [9] gave an active semi-supervised fuzzy clus-
tering.
In these traditional unsupervised clustering algorithms,
K-Means(KM), which can be easily implemented, is the
best- known squared error-based clustering algo-rithm. It is
noticeable that semi-supervised KM clustering by seeding
was pro-posed in [8]. [8] introduced the clustering method
viewed as the semi-supervised variants of k-means called
Seed-KMeans (SeedKM). SeedKM can apply some labeled
data called seeds to the initialization of KM. Recently, one
Locality-weight Fuzzy C-Means clustering methods called
(LFCM) is proposed by [11]. Although a semi-supervised
LFCM is also given in [11], this semi-supervised variant of
LFCM only uses some pairwise must-link and cannot-link.
In this paper, one semi-supervised clustering technique
inspired by SeedKM is introduced into the LFCM algorithm.
The novel method also called (SeedLFCM) applies some
seeds to the initialization of LFCM and uses one new
decision rule to SeedLFCM. Experimental results show that
our approach can obtain the better clustering performance
compared to KM, LFCM and SeedKM.
The remainder of this paper is organized as follows. In
Section 2 the LFCM clustering algorithm is reported. Section
3 formulates the proposed the SeedLFCM clustering method.
Experimental results are shown in Section 4, and Section 5
gives our conclusions
II. THE LFCMCLUSTERING
Let { }
1 2
, , , = "
N
X x x x be a set of N unlabeled data
in the d-dimensional space
d
R . Suppose that X is divided
978-1-4673-0915-8/12/$31.00 2012 IEEE
2012 3rd International Conference on System Science, Engineering Design and Manufacturing Informatization
88
into k clusters and
i
c ( 1, 2, , = " i k ) represents the cluster
centers. The LFCM clustering is outlined as follows[11]:
Step1. Initialize k cluster centers
i
c ( 1, 2, , = " i k ) by the
random method.
Step2. Let 1 = H and =
i i
o c ( 1, 2, , = " i k ).
Step3.For each data
j
x , Compute the weight
ij
W between
j
x and the cluster center
i
c by the following formula,
which can describe to the neighbourhood structure of
the data points.
( )
2
e

=
j i
i
x c
t
ij
W (1)
where
i
t is a parameter and can be obtained as follows:
2
2
1
u
u
=
e

| |
=

\ .
_
i j i
k
i
i
v
x S
t
otherwise
k
(2)
where
2
1
u
=

=
_
q
r i
r
i
x c
q
, q is the number of the
neighbors of
i
c and
i
S is a set with q nearest
neighbors of
i
c .
Step4. For each data
j
x , obtain the membership
ij
E with
the following equation:
( )
( )
1
2
1
1
2
1
1
|
|

_
ij j i
ij
k
hj j h
h
W x c
E
W x c
(3)
where 2 | > .
Step5. Update each cluster center
i
c as follows:
( )
( )
1
1
|
|
=
=
=
_
_
N
ij ij j
j
i
N
ij ij
j
E W x
c
E W
(4)
Step6. If
1
max e
=
<
k
i i
i
c o , then goto Step8; otherwise
goto Step7.
Step7. If o s H , then let 1 = + H H , =
i i
o c ( 1, 2, = i
, " k ) and goto Step3; otherwise goto Setp8. o is
the maximum running times of LFCM.
Step8. End LFCM.
III. THE PROPOSED SEEDLFCM
Firstly, the generation of seeds is given. Given the num-
ber of clusters k and a nonempty set
{ }
1 2
, , , = "
N
X x x x
of all unlabeled data in the d-dimensional space
d
R , the clu-
stering algorithms can partition X into k clusters. Let L ,
called the seed set, be the subset of X and for each
m
x
( e
m
x L ), the label be given by means of supervision[8].
We assume that L can be divided into k groups on the
basis of data labels and each subgroup should contain at least
two labeled data for the implementation of SKM. Therefore,
we can acquire a k partitioning
{
1
, L }
2
,"
k
L L of the seed
set L .
Secondly, the procedure for the proposed clustering
method SeedLFCM is as follows:
Step1. Let a set
{ }
1 2
, , , = "
k
L L L L .
Step2. For each subset of L, compute k initial centers
i
c
( 1, 2, , = " i k ) using the following equation:
1
1
=
=
_
G
i g
g
c x
G
(5)
where e
g i
x L and G is the number of all data
belonging to
i
L
Step3. Use k initial centers
1
c ,
2
c , ",
k
c for the initiali-
zation of SeedLFCM.
Step4. The Step2, Step3, Step4, Step5, Step6, Step7 and
Step8 of LFCM are done according to the algorithm
flow of LFCM.
Step5. When LFCM ends, each data should be given its clu-
ster label.
Step6. Assume that one data x is assigned to the ith cluster.
Compute the distance
i
d between x and the cluster
center
i
c .
Step7. Caculate the distance
m
d between x and the nearest
seed
m
x to x .
Step8. Apply the following novel decision rule to reassign-
ing x to one new cluster label Q.
<
=

m i
m if d d
Q
i otherwise
(6)
Step9. SeedLFCM ends.
Three main differences between the previous LFCM and
the proposed novel SeedLFCM as follows:
a. The SeedLFCM is the semi-supervised clustering algo-
rithm, but the LFCM is one of the unsupervised clustering
methods.
89
b. The SeedLFCM uses some labeled seeds for the
initialization, but LFCM employs the random method.
c. The SeedLFCM applies a novel decision rule Eq.(6) to
labeling data on the basis of the results of the LFCM algo-
rithm.
IV. EXPERIMENTAL RESULTS
To demonstrate the effectiveness of the novel proposed
SeedLFCM semi-supervised clustering algorithms, we
compared it with several traditional unsupervised and semi-
supervised clustering methods, such as KM, LFCM and
SeedKM, on one artificial dataset[12] and three UCI real
datasets[13], referred to as DUNN, Iris, BUPA and Sonar
respectively. As shown in Figure.1, one artificial dataset also
called DUNN is a 2-dimensional dataset with 90 instances
of two classes. Iris dataset contains 150 cases with 4-
dimensional feature from three classes. BUPA dataset
collects 345 6-dimensional cases belonging to two classes.
Sonar dataset contains 208 60-dimensional data samples
divided into two classes. All experiments were done by
Matlab on WindowsXP operating system.
Figure.1 The DUNN dataset
For LFCM and SeedLFCM, on each dataset, we set
300 o = ,
5
10 c

= , 2 | = and q was given and the better


value that can be selected based on the set {5,10, / 3 , (
(
N
}
/ 2 , 3 / 4 , 4 / 5 ( ( (
( ( (
N N N ( N is the number of all
data). For the semi-supervised SeedKM and SeedLFCM, on
each dataset, we randomly generated % P ( 0,10, 20, = P
,100 " ) of the dataset as seeds. Since true labels are known,
the clustering accuracies % Q on unlabeled data, which is
the remaining
( ) 100 % P of the dataset, could be quanti-
tatively assessed. Therefore, the clustering accuracies % Y
of the whole dataset consisting of unlabeled data and labeled
seeds could be calculated by
( ) % 100 % % + Q P P .
When 0 = P , the semi-supervised clustering methods beca-
me the unsupervised algorithms, and when 100 = P , all
data of the whole dataset was considered as labeled seeds
and 100 = Y . On each dataset the SeedKM and SeedLFCM
were run 20 times for different P ( 10, 20, , 90 = " P ) and
we report in Figure.2, Figure.3, Figure.4 and Figure.5 the
average accuracies % Y of the whole dataset obtained.
Furthermore, the results of the KM and LFCM are shown in
the above figures on 20 independent runs for each dataset.
Figure.2 Comparison of clustering accuracies on the DUNN dataset
Figure.3 Comparison of clustering accuracies on the Iris dataset
Figure.4 Comparison of clustering accuracies on the BUPA dataset
90
Firstly, because the KM and LFCM never employ the
seeds, their performance is not affected by the seeds.
Secondly, because the seeds and novel decision rule are
applied to SeedLFCM, there are the drastic distinctions when
they are compared with the methods KM and LFCM. Finally,
Figure.5 Comparison of clustering accuracies on the Sonar dataset
although the clustering accuracies of the SeedKM is im-
proved with the increase of the seeds, we can see from the
Figure.2, Figure.3, Figure.4 and Figure.5 that the proposed
SeedLFCM can achieve the better performance than the
SeedKM when an equal amount of labeled seeds is used.
V. CONCLUSIONS
In this paper, we propose a new semi-supervised locality-
weight fuzzy c-means clustering method SeedLFCM in-
spired by the SeedKM. The proposed SeedLFCM not only
employs some seeds in the initialization, but only uses one
novel decision rule for assigning one class label to a data
point. Finally, Experiments are carried out on one artificial
dataset and three real datasets. In comparison to KM, LFCM
and SeedKM, our proposed approach has shown its super-
iority.
ACKNOWLEDGMENT
This research is supported by the Foundation of Key
Laboratory of Advanced Process Control for light Industry
(Jiangnan University), Ministry of Education, China (No.
APCLI1003), and this research is also supported by the
Scientific Research Foundation of Nanjing University of
Posts and Telecommunications (No.NY21 0078).
REFERENCES
[1] Fillippone M., Camastra F., Masulli F., et.al., A survey of kernel and
spectral methods for clustering. Pattern Recognition, Vol.41, No.1,
pp.176-190, 2008.
[2] A.K. Jain, M.N. Murty, P.J. Flyn, Data clustering: a review. ACM
Computing Surveys, 1999, Vol.31, No.3, pp.256-323.
[3] R. Xu, D. Wunsch, Survey of clustering algorithms. IEEE Transac-
tions on Neural Networks, 2005, Vol.16, No.3, pp.645-678.
[4] J.T. Tou, R.C. Gonzalez, Pattern recognition principles. Addison-
Wesley, London, 1974.
[5] J.C. Bezdek, Pattern recognition with fuzzy objective function
algorithms. Plenum Press, New York, 1981.
[6] D.W. Kim, K.Y. Lee, D. Lee, K.H. Lee, A kernel-based subtractive
clustering method. Pattern Recognition Letters, 2005, Vol.26, No.7,
pp.879-891.
[7] T.M. Martinetz, S.G. Berkovich, K.J. Schulten, Neural-gas network
for vector quantization and its application to time-series prediction.
IEEE Transactions on Neural Networks, 1993, Vol.4, No.4, pp.558-
569.
[8] S. Basu, A. Banerjee, R.J. Mooney, Semi-supervised clustering by
seeding. In Proceedings of the Nineteenth International Conference
on Machine Learning, 2002, pp.27-34.
[9] N. Grira, M. Crucianu, N. Boujemaa, Active semi-supervised fuzzy
clustering. Pattern Recognition, 2008, Vol.41, No.5, pp.1834-1844.
[10] S. Basu, A. Banjeree, R.J. Mooney, Active semi-supervised for
pairwise constrained clustering. In Proceedings of the 2004 SIAM
International Conference on Data Mining, 2004, pp.333-344.
[11] P.F. Huang, D.Q. Zhang, Locality sensitive c-means clustering
algorithms. Neurocomputing, 2010, Vol.73, No.16-18, pp.2935-2943.
[12] Dunn J., A fuzzy relative of the ISODATA process and its use in
detecting compact well separated clusters. Journal of Cybemet, No.3,
pp.32-57, 1974.
[13] UCI Machine Learning Repository: http://www.ics.uci.edu/~mlearn/
MLSummary.html.
91

S-ar putea să vă placă și