Sunteți pe pagina 1din 5

International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

A Survey on Methodologies used for Semantic


Document Clustering
Aditi Gupta Dr. Jyoti Gautam Mr. Ajay Kumar
Post Graduate Student Head of Department Assistant Professor
Department of Computer Science & Department of Computer Science & Department of Computer Science &
Engineering, JSS Academy of Engineering, JSS Academy of Engineering, JSS Academy of
Technical Education, Noida Technical Education, Noida Technical Education, Noida
aditi.gupta1693@gmail.com jyotig@jssaten.ac.in ajay.itcs@gmail.com

Abstract— Document clustering is a traditional technique, and is Here, a survey of various semantic document-clustering
used in multiple fields like data mining, information retrieval, techniques is presented. Section II highlights the fundamentals of
knowledge discovery from data, pattern recognition etc. Large document clustering which includes document clustering,
volumes of textual data being created in the modern world have
traditional approach, drawbacks of traditional approach and
resulted in the rise in importance of document clustering techniques.
Although various document-clustering techniques have been studied semantic approach. Section III describes the related work. In this
in recent years, clustering quality still remains an area of concern. section we cover the work that has already been done previously
Particularly, majority of the present document clustering methods by the researchers in semantic document clustering, what were
do not account for the semantic relationships and as a result give the observation of each study and the resulting research gaps.
unsatisfactory clustering results. Semantic relationships consider the Section IV compares the various methods that were used for
context of the usage of the term and do not solely rely on its isolated clustering and analyses the performance. Section V describes the
meaning. In the recent years, a lot of effort has gone into applying merits and demerits of various approaches been used for
semantics to document clustering. This paper presents a survey of document clustering. Section VI presents the summary of the
various research papers that have been studied and highlights the
survey in the form of conclusion.
merits and demerits of each clustering algorithm. This will give a
direction to future research in a more focused manner.

Keywords— document clustering; evaluation measures; II. FUNDAMENTALS OF DOCUMENT CLUSTERING


semantic; wordnet; word sense disambiguation
A. Document Clustering
Document clustering is a traditional approach and is used in
I. INTRODUCTION various areas of research such as data mining, information
retrieval, knowledge discovery from data, pattern recognition etc.
Clustering [11] is a process of organizing data objects into a set
Document clustering [7] divides documents into groups based on
of non-disjoint classes known as clusters such that objects in the
similarities of their content. Each cluster consists of documents,
same cluster are similar to each other and dissimilar to the
which have similarities within the group (have high intra-cluster
objects in another cluster. Document clustering [17] is based on
similarity) but at the same time are different documents of other
the similar approach, that is, documents are organized into
groups (have low inter-cluster similarity).
meaningful clusters in such a way that documents in the same
cluster represent same topic and those in different cluster
represent different topic. Traditional clustering [8] approach uses Document clustering is different from document classification. In
words (terms) of document as feature vector but doesn’t consider document classification, the classes are known apriori, and
meaningful relationships between these words for document documents are assigned to such classes, whereas, in document
clustering. Hence, it is not able to produce meaningful clusters clustering, the number or membership of classes is not known
and faces problems such as synonymy, polysemy, and ambiguity. from before. Thus, classification can be compared to supervised
machine learning while clustering relates to unsupervised
Semantic document clustering [27] solves these problems by machine learning. For the clustering process, vector space model
considering the meaning of the term in the context before is used to represent documents and its major characteristic is the
clustering, for this various approaches can be such as Word sense high dimensionality of the feature space, which can hinder the
disambiguation [20] and Wordnet [10]. performance of clustering algorithms.

978-1-5386-1887-5/17/$31.00 ©2017 IEEE

671
International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

Document pre processing procedure is depicted in Figure 1. The traditional approach of document clustering is described
below in Figure 2.

Fig. 2: Traditional document-clustering. Image Source [16]

By looking at the above figure of traditional document


clustering, we can conclude that in this approach keywords are
used to compute the similarity and meaning of the terms are not
taken into consideration. Therefore, there are some limitations in
Fig.1: Procedure of Document Preprocessing. Image Source [11] this approach, which are described below.
C. Drawbacks of traditional document clustering
From the above figure we can conclude that firstly documents are
preprocessed and then clustered. Various steps followed in 1) Synonymy and Polysemy:
preprocessing [6] are tokenization, stop word removal and A word may be repeated in different form but will have
stemming. In the stemming process, morphological forms are the same meaning, such synonym words may exist in
normalized into canonical forms. Stop word removal is used to the document but traditional approach does not consider
remove the words which does not convey any meaning such as the meaning, hence these words may end up in different
‘a’, ‘the’ etc. The tokenization process separates the characters clusters. Polysemy means same word possessing
into tokens by using white spaces and punctuation marks as different meaning. But here in this approach all of these
separators. After the preprocessing processed documents are words will end up in the same document even though
obtained and this can then be clustered using clustering these words have different meanings.
algorithms.
2) Ambiguity: It is slightly different from polysemy. In this
Document clustering uses two approaches, which are: traditional same word will possess different meaning in different
approach [8] and semantic approach [12]. Now we will study a context. This problem will not be resolved here since
detailed view of traditional document clustering and what its semantic relations are not taken into consideration.
drawbacks were that lead to the semantic document clustering.
3) High dimensionality: A word may possess different
meaning and when these words are mapped onto feature
space, then may cause a problem of high
B. Traditional document clustering
dimensionality.
In this approach, a document is considered to be a set of words.
The main drawback of this approach is, it does not consider the
meaning of the words that is, it ignores the semantic relationship D. Semantic document clustering
between words. It uses words or sequences to form clusters and Semantics is concerned with the study of meaning. It focuses on
further concept weighting [23] approach is used to cluster the the relation between signifiers like words, phrases and terms. The
documents. Due to this, it is not able to find meaningful clusters meaning of semantic is related with the meaning in language.
and differentiate between different clusters. Semantic document clustering [11] concerns with partitioning the

672
International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

documents into clusters in such a way that documents in the same wordnet, word sense disambiguation, subject vector space model
cluster are similar to each other but documents in different (SVSM). But it considered subjects of all clusters to be
clusters are dissimilar to each other. Various techniques such as independent which results a high dimensionality clustered model.
word sense disambiguation [20] and word net mapping [10] are
used to find the dominant sense of the word and cluster the Ruksana Ater and Yoojin Chung [21] proposed an evolutionary
documents accordingly. The semantic based document clustering approach for document clustering in which combination of
approach is described below in Figure 3. Genetic Algorithm and K-Means is used in such a way that
there’s no requirement for pre-specification of the desired
number of clusters. Data set is partitioned into small sets on
which genetic algorithm is applied and hence it avoids problem
of local minima. Its future work focuses on making the algorithm
fully automatic in such a way that there’s no requirement for
parameter specification.

Vivek Kumar Singh, Nisha Tiwari and Shekhar Garg[25]


proposed a document clustering approach using K-means,
Heuristic k-means and fuzzy c-means. It uses different
representation schemes such as tf,tf-idf, Boolean and concludes
that tf is better than Boolean but worse than tf-idf. Out of these
three clustering algorithms, Heuristic k-means produces better
results than K –means and FCM proves to be the robust
Fig. 3 Semantic based document clustering. Image Source [15] algorithm.

From the above figure, we can see that semantic document Rishiraj Saha Roy and Durga Toshniwal [19] proposed an
clustering includes document-preprocessing, concept weighting approach which uses Naïve Bayesian Concept for fuzzy
[23] according to the dominant sense using domain ontology and clustering of textual documents. In this approach, datasets are
clustering of the documents. Hence obtained document clusters represented as a weighted term matrix using vector space model
are semantically related. and subsequently co-occurrence of semantically linked terms is
observed. Each term is uniquely assigned to a single cluster using
Using Wordnet and Word Sense Disambiguation one can solve the Naïve Bayesian concept and then documents are assigned to
problems of synonymy, polysemy and ambiguity. different clusters. This approach produces better results than
traditional algorithms and doesn’t require any user input for the
WordNet [29] is the lexical database developed by Miller et al. It clustering. As future work indexing can be used to store the
is used to include background information for each word such as cluster information and retrieve details relevant to few clusters
their synset IDs, which constitute their different possible senses, only which will help in reducing the computational space and
and also different levels of hypernyms and more general terms time.
for the word. The synsets are sets of synonyms that gather lexical
items having similar significances. Word Sense Disambiguation Thaung Thaung Win and Lin Mon [24] proposed an approach
[10] is a way of determining the correct sense of a word in a that uses Fuzzy C-Mean algorithm for document clustering.
context using two types of method. Some methods used the Unlike other algorithms, which use crisp partitioning, fuzzy
definition of each sense in a dictionary such as WordNet in clustering allows for degree of membership to which a data
which the definition is the gloss assigned to each synset. Other belong to different clusters. PBM and F-Measure are used in this
approaches used the semantic relatedness in an existing semantic for cluster validation. As future work, Hyperspherical fuzzy C-
network. Means can also be used for document clustering for providing
different level of fuzziness.
III. RELATED WORK
This section presents survey on various semantic document Xiaohui Cui and Thomas E. Potok [26] proposed a document
clustering approaches that has already been used. clustering technique using particle swarm optimization. It uses
K-means, PSO and hybrid of both to cluster the documents and
Gaby G. Dagher and Benjamin C.M. Fung [6] proposed a subject evaluates the result. Henceforth it concludes that hybrid KPSO
based semantic document clustering approach for digital forensic algorithm can generate higher compact clusters than PSO or K-
investigations in which data mining techniques are used to means alone.
support investigations. In this approach clustering is done using

673
International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

Ajit Kumar and I-Jen Chiang [1] proposed a document clustering compared to K-Means alone.
technique which used semantic cliques aggregation. In this
Semantic SSA with bisecting K-Means
approach, features are extracted from document and associations clustering +
Precision,
achieves 10% higher precision value,
are defined among frequently occurring words known as 7. Recall, F-
K-Means 20% higher recall value and 40%
Measure
semantic clique. In the semantic clique each connected [5] higher F-measure value than SSA
component represents a theme and further documents are Frequent
Concept When data size is small FCDC gives
clustered according to the theme. This approach proves to be Based good results but when data size is
immensely useful for automatic document clustering. 8. F-Measure
Document large, obtained clusters are not
Clustering accurate.
Yogesh Kumar Meena, Shashank and Vibhav Prakash Singh[28] [18]
F-Measure, It gives high precision, recall and F-
proposed a text document clustering approach using genetic 9.
Naïve
Recall, Measure values resulting in good
algorithm and discrete differential evolution. Generic algorithm Baye’s [9]
Precision clusters.
is an optimization method, which finds out the best cluster
centers and DDE clusters the documents. GA and DDE solely
require more iterations but their hybrid clusters the documents in
V. MERITS AND DEMERITS
less iterations. As future work, more efficient algorithm can be
designed for large datasets. In this section, Table 2 gives a brief view of merits and demerits
of different document clustering methods.
Amy J.C. Trappey, Charles V. Trappey, Fu-Chiang Hsu and
David W.Hsiao [2] proposed a fuzzy ontological knowledge TABLE II Merits and Demerits of Clustering Algorithms
document clustering methodology. It automatically interprets and
cluster knowledge documents using an ontology schema. It also S. No. Method Merits Demerits
uses fuzzy logic control to match suitable document clusters and Different initial
compares the result with the K-Means approach and concludes K-means tighter the partitioning result in
1. K-Means clusters and is useful different final cluster and
that FODC approach outperforms the K-Means approach. when data size is large. difficult to predict K-
value.
It has low
IV. COMPARISON OF METHODS It requires initial number
Bisecting computational cost and
2. of cluster value (K value)
K-Means produce satisfactory
In this section, Table 1 describes different methods used for clusters.
given by user
document clustering and analyses their performance based on It is stable and allows
different performance evaluation parameters. 3. PSO It is computationally slow.
parallel computation.
It can solve It does not guarantee
TABLE I Comparison of Various Methods of Document Clustering multidimensional optimized results and
Genetic
4. problems and also there is no assurance that
Algorithm
S. Clustering Evaluation solves problems with it will find the global
Performance Analysis multiple solutions. optimum solution.
No. Method Method
Precision and recall gives almost It gives best result for
K-Means Precision, accurate value and proves that the overlapped dataset and It requires a prior
1. Fuzzy
[13] Recall obtained clusters are similar to each 5. in this a term may specification of the
other Algorithm
belong to more than one clusters.
Bisecting Precision, Precision and purity has high values cluster.
2. K-Means Purity, whereas entropy has lower value, and
Computationally
[10] Entropy hence better clustering.
Self- They are very simple, expensive and every SOM
The hybrid of PSO and K-Means 6. Organizing efficient and easy to is different and finds
K-Means + Intra-Cluster, algorithm gives low intra-cluster Map understand different similarities
3.
PSO [22] Inter-Cluster value and high inter-cluster value as among the term vectors.
compared to K-Means or PSO alone.
K-Means + GA achieve 50% higher
K-Means +
precision value than K-Means alone
4. Genetic F-Measure
and 12.5% higher value than GA
[21]
alone. VI. CONCLUSION
DDE + Better fitness values are found out by In this paper, a detailed survey of semantic document clustering
5. Genetic Fitness Value applying combination of GA & DDE
[28] in less iteration.
is presented. It includes survey of traditional and various
F-Measure, Hybrid achieves 11% higher recall semantic document clustering approaches. Various steps such
Ontological as stemming, stopword removal, tokenization and document
Precision, value, 25% higher precision value,
6. Semantic +
Fuzzy [2]
Recall, 11% higher F-Measure value and clustering are followed in both techniques. Moreover, it is
Entropy 33% higher entropy value as observed that semantic based approach for document clustering

674
International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

provide better accuracy, result and quality of cluster than [17] PankajJajoo, "Document clustering", M.Tech thesis, IIT, Kharagpur,
2008.
traditional approach because word sense disambiguation and [18] RekhaBaghel and Dr. RenuDhir. "A frequent concepts based document
wordnet are used. Various clustering methodologies have been clustering algorithm," International Journal of computer applications ,
compared and analyzed with the help of performance metrics 4.5 (2010): 6-12.
and their merits and demerits have also been discussed. This [19] RishirajSaha Roy and DurgaToshniwal, “Fuzzy clustering of text
documents using Naïve Bayesian concept”, International conference on
survey paper will be used to give a direction to the proposed recent trends in information, telecommunication and computing, IEEE
work. 2010.
[20] Roberto Navigli, “Word sense disambiguation : A survey”, ACM
computing surveys(CSUR), Volume 41, Issue 2, 2009.
REFERENCES [21] RuksanaAktar and Yoojin Chung, “An evolutionary approach for
document clustering”, International conference in Electronic
[1] Ajit Kumar and I-Jen Chiang, “Document Clustering Using Semantic engineering and computer science, Elsevier 2013.
Cliques Aggregation”, International journal of Computer and [22] S. Sarkar, "A Comparative Analysis of Particle Swarm Optimization
Communications, 2015, 3,Page 28-42. and K-means Algorithm For Text Clustering Using Nepali Wordnet,"
[2] Amy J.C Trappey, Charles V.Trappey, Fu-Chiang Hsu and David International Journal on Natural Language Computing (IJNLC) Vol. 3,
W.Hsiao, “A fuzzy ontological knowledge document clustering No.3, June 2014.
methodolgy”, IEEE transactions on systems, man, cybernetics, Volume [23] Sapna Gupta, Vikrant Chole, and A. Mahajan. "A Review on
39, No.3, 2009. DocumentClustering Using Concept Weight”, International Journal of
[3] B. Choudhary and P. Bhattacharyya, "Text clustering using semantics," Scientific & Engineering Research, Volume 5, Issue 3, March-2014
The 11th International World Wide Web Conference, WWW2002, [24] Thaung Thaung Win and Lin Mon, “Document clustering by Fuzzy C-
Honolulu, Hawaii, USA, 2002. Mean algorithm” International Conference on advanced computer
[4] D. Buttler, "A short survey of document structure similarity control (ICACC) 2010.
algorithms", International Conference on Internet Computing, pp. 3-9, [25] Vivek Kumar Singh, NishaTiwari and ShekharGarg, “Document
2004. clustering using K-Means, Heuristic K-Means and Fuzzy C-Means”,
[5] G.Thilagavathi and J.Anitha, “Document clustering in forensic International conference on computational intelligence and
investigation by Hybrid approach”, International journal of computer communication systems 2011.
applications, Volume 91-No.3, April 2014. [26] Xiaohui Cui, Thomas E.Potak and Paul Palathingal, “Document
[6] Gaby G. Dagher and Benjamin C.M. Fung. “Subject based Semantic clustering using Particle Swarm Optimization”, Swarm Intelligence
Document Clustering for Digital Forensic Investigations”, In Data & symposium, IEEE Proceedings 2005.
Knowledge Engineering, Elsevier2013. [27] Y. Wang and J. Hodges, "Document clustering with semantic analysis"
[7] H. Gaudani, K. Lakhani, and R. Chhatrala, "Survey of Document Procedings IEEE of the 39th Annual Hawaii International Conference
Clustering," International Journal of Computer Science and Mobile on System Science, vol. 3, 2006, HICSS'06.
Computing, vol. 3, issue 5, 2014, pp. 871-874. [28] Yogesh Kumar Meena, Shashank and VibhavPrakash Singh, “Text
[8] H. Tar and T. Nyaunt, "Enhancing Traditional Text Documents document clustering using Genetic algorithm and discrete differential
Clustering based on Ontology" International Journal of Computer evolution”, International journal of computer applications, Volume 43-
Applications, vol. 33, no. 10, 2011, pp. 38-42. No. 1, April 2012.
[9] How Jing, Yu Tsao, Kuan-Yu Chen and Hsin-Min Wang. “Semantic [29] Zakaria Elberrichi, Abdellatif Rahmoun, and Mohamed Amine
Naïve Bayes Classifier for Document Classification”. In International Bentaallah, "Using WordNet for Text Categorization", International
Joint Conference on Natural Language Processing Page 1117-1123, Arab Journal Information Technology 5, no. 1 (2008): 16-24.
October 2013.
[10] Jullian Sedding, "Wordnet based text document clustering" Proceeding
of the 3rd Workshop on robust methods in analysis of natural language
data association for Computational Linguistics, pp. 104-113, 2004.
[11] K. Sathiyakumari, G. Maninekalai, V. Preamsudha, and M. Phil
Scholar, “A survey on various approaches in document clustering.”
International Journal of Computer Technology and Application
(IJCTA) 2, no. 5, CiteseerX (2011): 1534-1539.
[12] K. Shaban, "A Semantic Approach for Document Clustering," Journal
of software 4.5 (2009), Page 391-404.
[13] M. S. Anbarasi, "Ontology Oriented Concept Based Clustering",
International Journal of Research in Eng. and Technology, Vol. 3,
Issue 2, Feb-2014.
[14] Mamta Mahilane, and K. L. Sinha, "A Survey Paper On Different
Techniques Of Document Clustering", International Journal of Current
Engineering and Scientific Research (IJCESR), Volume-2, Issue-
1,2015.
[15] Maitri P. Naik, Harshadkumar B. Prajapati and Vipul K. Dabhi, “A
Survey on Semantic Document Clustering”, Electrical, Computer and
Communication Technologies (ICECCT), IEEE International
conference 2015.
[16] NagmaY .Saiyad, H.B. Prajapati and V.K. Dhabi, “A survey of
document clustering using semantic approach”, International
conference on Electrical, Electronics and Optimization
techniques{ICEEOT), IEEE 2016.

675

S-ar putea să vă placă și