Sunteți pe pagina 1din 7

Intelligent Systems Conference 2017

7-8 September 2017 | London, UK

Improving Arabic Document Clustering using K-


Means Algorithm and Particle Swarm Optimization
Abdullah S. Daoud Ahmed Sallam* Mohamed E. Wheed
Sinai University Faculty of Computers and Faculty of Computers and
North Sinai, Informatics Informatics
Egypt Suez Canal University Suez Canal University
am.said72@yahoo.com Ismailia, Egypt Ismailia, Egypt
sallam_ah@ci.suez.edu.eg mohamed_badawy@ci.suez.edu.eg

Abstract—Document clustering plays a vital role in text partitioning methodology, the algorithm divides N data objects
mining fields such as information retrieval, sentiment analysis, (points) into K groups (Clusters) in some similarity aspects
and text organizing. Document clustering aims to automatically between data points and the clusters centers [7]. Clusters are
divide a collection of documents based on some aspects of formed such that each data point in the cluster has a short
similarity into groups that are meaningful, useful or both. This distance to its cluster center. Although the k-means algorithm
paper aims to improve the clustering task for the Arabic is simple, fast, and useful for large data size; it has a crucial
documents. Recent studies show that partitioning clustering issue that has a great effect on its results. K-means is sensitive
algorithms are more suitable for clustering process. However, k- to the initial seeds selection. In this paper, a solution to this
means is the most common algorithm that is being used for
problem is proposed by hybridization of the standard k-means
clustering process because of its simplicity and speed. It can only
generate an arbitrary solution because the results depend on the
algorithm with Particle Swarm Optimization (PSO).
initial centers for the desired clusters “the seeds”. In this paper, a A. Challenges, motivation and problem statement
new modified k-means algorithm called PSO K-means, supported
by Particle Swarm Optimization (PSO) is applied to enhance the a) Challenges
Arabic document clustering process. Then, an intensive Arabic document clustering in particular and the
comparative study between the proposed model and the standard complexity of Arabic natural processing, in general, are
k-means algorithm is applied. Also, the stemming algorithms associated with the nature of Arabic language. Some of the
those are being used in Arabic language processing were challenges:
assessed. Through the experiments, an evaluation for the new
algorithm is done with three different Arabic data sets. The • Š‡ ”ƒ„‹… Žƒ‰—ƒ‰‡ Šƒ• ƒ ‘”‡ …‘’Ž‡š
results demonstrate that the proposed model can produce more ‘”’Š‘Ž‘‰›Ǥ ‡Ǥ‰Ǥǡ –Š‡ ™‘”† ™”‹–‹‰ ሺΔΑΎΘϛሻ Šƒ•
accurate results compared to the standard k-means algorithm for †‹ˆˆ‡”‡– ˆ‘”• Ž‹‡ ሺǡΎΒΘϜϳ ǡϥϮΒΘϜϳ ǡΐΘϜΗ ǡΐΘϜϳ ǡΖΒΘϛ ǡΐΘϛ
Arabic language documents. On the other hand, Arabic light ϦΒΘϜϳሻǤ
stemmer is more suitable for the stemming step.
• ”‡’”‘…‡••‹‰ •–ƒ‰‡ ሺ–‡‹‰ሻ ‹• ‘”‡ …‘’Ž‡šǡ
Keywords—Arabic document clustering; NLP; swarm ™Š‡”‡–Š‡”ƒ„‹…Žƒ‰—ƒ‰‡…‘–ƒ‹ƒŽƒ”‰‡†‡‰”‡‡‘ˆ
intelligence; PSO; k-means; features selection; features extraction ‹ˆŽ‡…–‹‘ሺΏΎΘϛǦ ΐΘϛ – ΔΒΘϜϣሻǡ™‘”†‰‡†‡”ሺƒ•…—Ž‹‡
ሺΐΘϜϳሻƒ† ‡‹‹‡ሺΐΘϜΗሻሻƒ†’Ž—”ƒŽ‹–‹‡•ሺ‹‰—Žƒ”
I. INTRODUCTION ሺΐΘϜϳሻǡ—ƒŽሺΎΒΘϜϳሻƒ†Ž—”ƒŽሺϥϮΒΘϜϳሻሻǤ
Nowadays, text is a common method for information • ‘‡ ™‘”†• Šƒ˜‡ †‹ˆˆ‡”‡– •‡ƒ–‹… ‡ƒ‹‰•ǡ
exchange. The importance of this method has led many ™Š‹…Š …‘–‡š– †‡’‡†‡– ‡Ǥ‰Ǥǡ –Š‡ ™‘”† ሺϖϠΣሻ …‘—Ž†
researchers to find out suitable methods to analyze natural „‡˜‡”„•Šƒ˜‹‰ሺϖ ˴ ˷ϠΣ˴ ሻǡ‘”‘—‡ƒ””‹‰
˴ ˴ϠΣ˴ ሻǡ˜‡”„ˆŽ›ሺϖ
language texts to extract useful information. Text mining can
ሺϖ˸ ˴ϠΣ˴ ሻǤ
be defined as the process of detecting meaningful and
interesting linguistic patterns from natural language texts [1] •  ˆ‡™ ƒŽ‰‘”‹–Š• ƒ”‡ ƒ˜ƒ‹Žƒ„Ž‡ ˆ‘” ƒƒŽ›œ‹‰ –Š‡
[2] [3]. To aid the process of text mining, many techniques ”ƒ„‹… Žƒ‰—ƒ‰‡ ƒ† –Š‹• ’ƒ’‡” Š‹‰ŠŽ‹‰Š–• –Š‡
were developed. These techniques are grouped into two ’”‘„Ž‡•‘ˆ–Š‡•‡ƒŽ‰‘”‹–Š•Ǥ
categories, Clustering and Classification. Document clustering
is a machine learning technique that is used to identify the • Š‡ ƒ˜ƒ‹Žƒ„‹Ž‹–› ‘ˆ †ƒ–ƒ•‡–• ˆ‘” –”ƒ‹‹‰ ƒ† –‡•–‹‰
similarity between text documents based on their content. ’‘”–‹‘• –‘ ƒ‡ ƒ …‘’ƒ”‹•‘ „‡–™‡‡ ‘–Š‡”
Unlike text classification, document clustering is unsupervised –‡…Š‹“—‡•ƒ”‡ƒ˜ƒ‹Žƒ„Ž‡Ǥ
method, which there is no predefined class [1]. Clustering has •  ˆ‡™ –›’‡• ‘ˆ ”‡•‡ƒ”…Š Šƒ˜‡ „‡‡ †‘‡ ˆ‘” –Š‹•
various types of applications like Search Engines, Identifying ’—”’‘•‡Ǥ
Cancerous Data, Students Academics Analysis, and Sentiment
Analysis (SA) [4] [5] [6]. One of the most widely used
techniques in clustering is the k-means algorithm. Based on a

879 | P a g e
978-1-5090-6435-9/17/$31.00 ©2017 IEEE
Intelligent Systems Conference 2017
7-8 September 2017 | London, UK
b) Motivation WordNet [18] [19], that provide a semantic relationship
The Arabic language is one of the oldest languages, and the between the documents.
number of Arabic speakers in the world is 168.4 million, the Fuzzy algorithms like fuzzy C-means have been examined
majority of them lives in the Middle East and North Africa1. for document clustering purpose [20]. These algorithms permit
Moreover, and due to religious reasons, millions of Muslims a document to belong to more than two clusters and it is useful
around the world are learning Arabic to understand the holy for highly overlapped documents datasets. Other techniques
Quran book, which is written in Arabic. Consequently, there were applied such as self-organized map (SOM) [21].
are a huge Arabic data including Feedback, Tweets, and Blogs
on the internet. In addition to, millions of queries submitted to Other issues for clustering Arabic documents are the
search engines. For these reasons, this paper looking forward to preprocessing phase and the high dimensional documents.
enhancing the Arabic document clustering process where, a High features space is a challenge task for clustering algorithm.
few types of researches have been done for this purpose. Dimension reduction techniques are divided into two
approaches: features extraction (FE) and features selection
c) Problem statement (FS). In FE, the original features space is reduced into lower
Given ( ݅ ) a data set of Arabic documents ‫ ܦ‬ൌ space using some linear transformation such as Latent
ሼ݀ͳǡ ݀ʹǡ ǥ ǡ ݀ܰሽǡ (݅݅) a desired number of clusters‫ܭ‬, and (݅݅݅) a Semantic Analysis (LSA) [22] and Principal Components
fitness function that evaluates the quality of clustering, the goal Analysis (PCA) [23]. While, FS looking for selecting features
is to compute the assignment of the documents to the clusters that holding adequate information about the text documents
as ‫ ܦ‬՜ ሼͳǡʹǡ ǥ ǡ ‫ܭ‬ሽ that minimizes the fitness function. dataset by assessing the performance of selected features in
some aspect of fitness functions [1]. An example of such
B. Thesis Contributions algorithms are genetic algorithms (GA) [24] and swarm
The main contributions of this thesis are as follows: intelligence (SI) algorithms like particle swarm optimization
(PSO) [25], ant colony optimization (ACO), and bee colony
1) Demonstrate high-performance document clustering in
optimization (BCO) [26]. Practically, FS is better than FE due
the context of Arabic text documents comparatively with the to the fact that FE may lose some useful features in the
previous work that was done on the used data set. transformation process.
2) Measure a new modified k-means algorithm enhanced
by Particle Swarm Optimization to generate more accurate A hybridization of the k-means algorithm and PSO
results for Clustering process for Arabic resources. introduced in many studies for European languages document
clustering [27] [3] [28]. These studies proved that this
3) Features selection process has been adopted for Arabic
hybridization process overcomes the standard k-means issue
text clustering problem. about the initialization of the seeds, and achieve more powerful
4) Present a new study for clustering Arabic text and accurate results according to natural language processing
documents and showing the performance on a common corpus (NLP) stage.
under common criteria to make it comparative.
PSO was utilized by many researchers for clustering the
5) On the basis of selected features, this paper introduce a
text documents. Each researcher investigated PSO based on his
comparative study for using different types of stemmers to language and region. So, this work aim to measure this
reveal its effect s to the Clustering process. hybridization for the Arabic text documents clustering.
II. RELATED WORKS For the Arabic language, a few studies have been done for
Document clustering process becomes an interesting point the document clustering process and the most never use
for many researchers, especially for European Languages. On common data sets under common performance criteria. In [8],
the other side, a few studies are performed to serve the Arabic a probabilistic model using Latent Dirichlet Allocation (LDA)
language. A review to what has been done for document was used for clustering Arabic datasets, and a comparative
clustering techniques in general and for the Arabic language in study between this model and standard k-means was carried
particular is presented. out. Four Arabic datasets were examined where LDA results
are superior to standard k-means for all used datasets.
Many studies are presented such [9], where an
experimental study has shown that partitioning algorithms are In [29], a comparative study was applied between
more suitable and high accurate for clustering large data set stemming algorithms designed for Arabic language processing.
since, they have a linear time complexity. In [9] [10] [11], k- They checked the effectiveness of raw, root and light stemming
means algorithm and its variants are implemented for algorithms in clustering Arabic datasets, using standard k-
documents clustering to enhance accuracy and efficiency, these means algorithms and demonstrated that light stemmer has
studies proved that k-means algorithm is more efficient for high accurate and powerful than the others.
clustering high dimensional text documents. In [12] [13] [14], Some disadvantage of [8] or other works tested k-means
an experiments that investigated the effect of similarity algorithm over Arabic datasets:
measures like Jaccard, Euclidian, and cosine where Euclidian
and Cosine are more accurate than others. In[15] [16] [17], • Š‡›†‘ǯ–…ƒ”‡ƒ„‘—––Š‡‹‹–‹ƒŽ•‡Ž‡…–‹‘–‘…Ž—•–‡”
other semantic similarity measures based on a popular tool …‡–‡” ’”‘„Ž‡ ™Š‹Ž‡ –Š‹• ™‘” ’”‡•‡– ‹–•
‹’‘”–ƒ…‡‹–Š‡…Ž—•–‡”‹‰’”‘…‡••Ǥ
1
http://www.internetworldstats.com/stats7.htm

880 | P a g e
978-1-5090-6435-9/17/$31.00 ©2017 IEEE
Intelligent Systems Conference 2017
7-8 September 2017 | London, UK

• Š‡› —•‡† ƒ ˜ƒ”‹‘—• ˆ‡ƒ–—”‡• ‡š–”ƒ…–‹‘ –‡…Š‹“—‡• K-means results depend on the first selection of the clusters
Ž‹‡‘†‡Ž‹ሾʹͻሿ™Š‡”‡ƒ•–Š‡•‡‘†‡Ž•ƒˆˆ‡…– centers, which is done randomly, where the first step of the
–Š‡‘”‹‰‹ƒŽˆ‡ƒ–—”‡••‡–Ǥ algorithm is to select ‫ ܭ‬randomly documents as initial clusters
centers (seeds). Then the algorithm repeats two steps: re-
This paper is organized as follows: Section III provides the assigning each document vector to the closest center and
methods of representing documents in clustering algorithms recalculating the center based on each cluster members until a
and of computing the similarity between documents. Also, it stopping criterion is met. As shown in Fig. 1, K-means results
describes the limitations of the used algorithm, k-means depend on the initial seeds. For example, for seeds D5 and D7,
algorithm, and the swarm intelligence approach. Section IV k-means converges to {{D2,D3,D4,D5,D8},
provides a solution to the limitation problem using particle {D1,D6,D7,D9,D10}}, while, for seeds D4 and D10, k-means
swarm intelligence. Section V provides the detailed converges to {{D1,D2,D3,D4,D5,D6,D8}, {D7,D9,D10}}, for
experimental setup and results for comparing the performance ‫ = ܭ‬2 in both cases.
of the PSO algorithm with the K-means approaches. The
discussion of the experiment’s results is also presented. The
conclusion is in Section VI.
III. BACKGROUND
A. K-means Clustering Algorithm
K-means clustering algorithm belongs to the family of
partitioning algorithms which split data into a set of disjoint
partitions (Clusters). Given a data set with N points, a partition
method constructs ‫ ܭ‬group of data where each group
represents a cluster such that (‫ ܭ‬൏ൌ ܰ) [7]. So, it classifies the
data into ‫ ܭ‬groups meeting the following conditions: Fig. 1. K-means results depends on the initial seed.
1) Each cluster contains at least one point So, to obtain good results, it is common to run the k-means
2) Each point belongs to exactly one cluster algorithm multiple times with different seeds to achieve a more
K-means represent the cluster by its objects mean (center) compact clustering. K-means can be summarized as in Table 1:
[3] where the center is calculated using (1):
ͳ TABLE I. STANDARD K-MEANS ALGORITHM
‫ܥ‬௝ ൌ ෍ †୨ (1)
୨ Algorithm 1
ௗೕ ‫א‬௦ೕ
1) Randomly choose K document vectors to set as initial centers.
Where †୨ denotes the document vector that belongs to
2) Assign each document vector to the closest cluster center based on
cluster•୨ , ୨ stands for the center of cluster j, ୨ is the number (2).
of document belongs to cluster•୨ . 3) Update the cluster center using equation (1).
Based on a distance function e.g. Euclidian [27], k-means 4) Repeat step (2) to (3) until either no changing in the clusters centers
assigns the object to one of the predetermined clusters where or maximum number of iterations is met.
the object has a minimum distance to one of the clusters
B. Swarm Intelligence
centers. Given a data set ‫ ܦ‬with a very large dimension݀, the
Euclidian distance between an object ܺ ൌ ሾ‫ݔ‬ଵ ǡ ‫ݔ‬ଶ ǡ ǥ ǡ ‫ݔ‬ௗ ሿ and Swarm intelligence models were inspired by natural swarm
the center of cluster ‫ ܥ‬ൌ ሾܿͳǡ ܿʹǡ ǥ ܿ݀ሿis calculated using (2). systems. Several swarm intelligence models have been
successfully applied in many real-life applications. These
ௗ models are based on different natural swarm systems.
(2) Examples of swarm intelligence models are Ant Colony
ඩ෍ሺš୧ െ …୧ ሻଶ
Optimization, Particle Swarm Optimization and Artificial Bee
ଵ Colony [26] [30]. This paper primarily focuses on one of the
The quality of the clustering can be measured by a fitness most popular swarm intelligence model that is Particle Swarm
function or sometimes is called (evaluation function), which Optimization.
compute the Sum of Squared Error (SSE) i.e., a measure of
a) Particle Swarm Intelligence
how well the selected seeds to represent their documents [2]
[28]. Minimizing the SSE is our goal. The SSE is measured as PSO was introduced by Russell Eberhart and James
below: Kennedy in 1995. PSO draws inspiration from the social
behavior associated with the bird flocks [25].

ܵܵ‫ ܧ‬ൌ ෍ ෍ሺ‫ ݔ‬െ ܿ௜ ሻଶ (3) Particles are individual solutions while the swarm is a
collection of particles that represent the solution space. Each
௜ୀଵ ௫‫א‬௖೔
particle in the swarm is a moving object which can move
Where ‫ ܭ‬is the number of clusters, x are the objects belong
through the search space according to certain rules in order to
to the ݅ ௧௛ cluster, and ܿ௜ is the ݅ ௧௛ cluster center [27]. adjust its motion (i.e. Position and Velocity) and can be
attracted to a better position, in order to find the best solution.

881 | P a g e
978-1-5090-6435-9/17/$31.00 ©2017 IEEE
Intelligent Systems Conference 2017
7-8 September 2017 | London, UK
PSO contains a fitness function to evaluate each particle in TABLE II. ORIGINAL PSO ALGORITHM
the swarm at a specific position. The goal is to optimize the
fitness value (i.e. Minimize or Maximize) for determining the Algorithm 2
best position in the whole swarm. Fitness function depends on 1) Initialize each particle with a random velocity and position.
the problem to be solved. 2) For each particle’s position (P), repeat the steps from [3] to [4].

b) The Original PSO Algorithm 3) Evaluate fitness (P) according to a fitness function.
4) IF fitness (P) is better than fitness of Pbest then Pbest=P.
Each particle moves through the solution space with a
velocity, which represents the particle’s speed in a specific 5) Set best of Pbests as Gbest.
direction and a memory, which store its personal best solution 6) Update the velocities of all the particles using (4).
(ܲ௕௘௦௧ ), which achieved by this particle, and the global best 7) Move each particle to its new position using (5).
solution ( ‫ܩ‬௕௘௦௧ ), which achieved by other particles in the 8) Repeat steps (2) to (6) until a stopping criterion is met (e.g., the
swarm [26]. maximum number of allowed iterations is reached or a sufficiently good
fitness value is achieved).
The original version of the PSO algorithm [27] [26] [25] is
described by the following two simple velocity and position IV. PSO K-MEANS ALGORITHM
update equations are shown in (4) and (5) respectively as
follows: The dependency of the k-means algorithm to the selection
of the initial seeds is big problem, which has a great effect on
the results [27] [3]. So, we need to perform a global search in
the entire solution space (data set) for the seeds, which gives
ܸ௜ௗ ሺ‫ ݐ‬൅ ͳሻ ൌ ܹ ‫ܸ כ‬௜ௗ ሺ‫ݐ‬ሻ ൅ ܿଵ ‫ݎ כ‬ଵ the best results. Therefore, we merge PSO with the standard k-
‫ כ‬൫‫݌‬௜ௗ ሺ‫ݐ‬ሻ െ ‫ݔ‬௜ௗ ሺ‫ݐ‬ሻ൯ ൅ ܿଶ ‫ݎ כ‬ଶ (4) means algorithm to get the optimal solution.
‫ כ‬ቀ‫݌‬௚ௗ ሺ‫ݐ‬ሻ െ ‫ݔ‬௜ௗ ሺ‫ݐ‬ሻቁ Each particle in the swarm represents one possible solution
for clustering the document data set. Therefore, each particle
represents a set of clusters centers as ܺ݅ ൌ ሾ‫ͳܥ‬ǡ ‫ʹܥ‬ǡ ǥ ǡ ‫݇ܥ‬ሿ ,
‫ݔ‬௜ௗ ሺ‫ ݐ‬൅ ͳሻ ൌ ‫ݔ‬௜ௗ ሺ‫ݐ‬ሻ ൅ ܸ௜ௗ ሺ‫ݐ‬ሻ (5) where ‫ ܭ‬is the number of clusters, and ‫ ܥ‬is the center of the
Where: cluster. Then, at each iteration, the fitness of each particle is
measured by (3) to identify the local best position (ܲ௕௘௦௧ ) and
• – : denote the iteration counter. the global best position (‫ܩ‬௕௘௦௧ ).
• ǣ ‹• –Š‡ ‹‡”–‹ƒ ™‡‹‰Š– ˆƒ…–‘”Ǥ – ‡•—”‡• –Šƒ– –Š‡ Fitness function (3) should be minimized such that,
˜‡Ž‘…‹–›‘ˆ‡ƒ…Š’ƒ”–‹…Ž‡‹•‘–…Šƒ‰‡†•—††‡Ž›ǡ„—– whenever the value of SSE is smaller, the clustering evaluation
”ƒ–Š‡” –Š‡ ’”‡˜‹‘—• ˜‡Ž‘…‹–› ‘ˆ –Š‡ ’ƒ”–‹…Ž‡ ‹• –ƒ‡ becomes more accurate.
‹–‘…‘•‹†‡”ƒ–‹‘Ǥ The PSO k-means clustering algorithm can be summarized
• ୧ୢ ǣ ”‡’”‡•‡–• –Š‡ ”ƒ–‡ ‘ˆ –Š‡ ’‘•‹–‹‘ …Šƒ‰‡ as in Table 3:
ሺ˜‡Ž‘…‹–›ሻ‘ˆ–Š‡‹୲୦ ’ƒ”–‹…Ž‡‹–Š‡†୲୦ †‹‡•‹‘Ǥ
TABLE III. PROPOSED MERGED K-MEANS-PSO CLUSTERING ALGORITHM
• š୧ୢ ǣ ”‡’”‡•‡–• –Š‡ ’‘•‹–‹‘ ‘ˆ –Š‡‹୲୦ ’ƒ”–‹…Ž‡ ‹ –Š‡
Algorithm 3
†–Š†‹‡•‹‘Ǥ 1) Set the swarm size, the number of iterations (N), and the
number of clusters (K).
• ’୧ୢ ǣ”‡’”‡•‡–•–Š‡„‡•–’‘•‹–‹‘‘ˆ–Š‡‹୲୦ ’ƒ”–‹…Ž‡‹ 2) At the initial stage, each particle randomly select K different
–Š‡†୲୦ †‹‡•‹‘Ǥ document vectors from the document data set as the initial
clusters centers vectors.
• ’୥ୢ ǣ ”‡’”‡•‡–• –Š‡ ’‘•‹–‹‘ ‘ˆ –Š‡ ‰Ž‘„ƒŽ „‡•– 3) For each particle’s position, repeat the steps from [4] to [7].
’ƒ”–‹…Ž‡ሺš୥ ሻ‹–Š‡•™ƒ”‹–Š‡†୲୦ †‹‡•‹‘ሺ‹Ǥ‡Ǥǡ 4) Assign each document to the closest cluster center according to
(2).
–Š‡’‘•‹–‹‘–Šƒ–‰‹˜‡•–Š‡‰Ž‘„ƒŽ„‡•–ˆ‹–‡••˜ƒŽ—‡ሻǤ
5) Update the centers according to (1).
• …ଵ ƒ† …ଶ ǣ –Š‡ …‘‰‹–‹˜‡ ƒ† •‘…‹ƒŽ …‘•–ƒ–• 6) Calculate the fitness value for each particle in the swarm
according to (3).
”‡•’‡…–‹˜‡Ž› ƒ† ƒ”‡ ‘™ ƒ• ƒ……‡Ž‡”ƒ–‹‘ 7) Compare the particle’s fitness value with its best solution value
…‘‡ˆˆ‹…‹‡–•Ǥ ‫݌‬௜ௗ , if current value is better, set ‫݌‬௜ௗ equal the current value.
8) Set the global best position ‫݌‬௚ௗ as the best value encountered
• ”ଵ ƒ†”ଶ ǣ”ƒ†‘˜ƒŽ—‡•‹–Š‡”ƒ‰‡‘ˆሺͲǡͳሻǤ by all particles (Minimum SSE value)
According to (4) and (5), the basic flow of the original 9) Update the velocity and location of each particle using the new
‫݌‬௜ௗ and ‫݌‬௚ௗ according to Equation (3.4) and (3.5) respectively.
PSO algorithm can be described as shown in Table 2: 10) Repeat the steps [2] to [8] until the number of iterations is
reached.

882 | P a g e
978-1-5090-6435-9/17/$31.00 ©2017 IEEE
Intelligent Systems Conference 2017
7-8 September 2017 | London, UK
V. EXPERIMENT SETUP, RESULTS AND DISCUSSION • Minimum Term Frequency: Low frequency terms are
not useful in discrimination between categories. It was
A. Setup found that removing these words helps in distinguishing
In this section, we describe the data set being used, the between the clusters.
preprocessing stage and reveal the best practice for selecting
PSO parameters that are suitable for processing Arabic c) Document representation
documents. We carried out our experiments using Eclipse After completing the preprocessing steps, documents are
JAVA environment and WEKA (Waikato Environment for represented in the vector space model (VSM) [35] [2]. A Term
Knowledge Analysis)2. or a feature in VSM is represented as a single word, and each
document is represented as a vector of weights ݆݀ ൌ
a) Data set ሾܹଵ௝ ǡ ܹଶ௝ ǡ ǥ ǡ ܹே௝ ሿǡ where N denotes the number of distinct
In this research, we have used three different Arabic terms within the document, and ܹ denotes the Term
datasets to evaluate the standard k-means and the merged k- Frequency-Inverse Document Frequency weight [TF-IDF]. TF-
means PSO algorithm, as shown in Table 4. These data sets are IDF is calculated using (6).
collected by Saad 3 from multiple sites.
ܰ
TABLE IV. DATA SET DESCRIPTION ܹ௜௝ ൌ ܰ௜௝ ‫ ‰‘Ž כ‬ቆ ቇ (6)
ܰ௝
Data set Number Of Documents Number Of classes Where ܰ௜௝ is the term frequency that denotes number of
BBC 4,763 7 times term ܶ௜ occurs in ‫ܦ‬௝ , and ܰ௝ denotes the number of
CNN 5,170 6
OSAC 22,429 10 documents in which the term ܶ௜ appears.
b) Text preprocessing d) Parameters
Document preprocessing stage is an important step in text The performance of PSO algorithm is affected by some
mining, and has a great effect on the results. The goal of the parameters such as:
preprocessing stage is to transform text documents into a
suitable format for the automatic processing [1]. Arabic • Swarm Size: The larger number of particles implies that
documents are preprocessed as follows: more search space to be covered, but increasing number
of particles will be time-consuming. Also, swarm size
• Tokenization: Documents are tokenized by removing depends on the size of the data set to be analyzed. In
punctuation marks, digits, numbers, symbols, and other this research, we tried different numbers of particles on
non-Arabic words. The remaining words are considered different data sets. Based on our experiments a
as terms or tokens. recommended swarm size would be in the range 25 to
35.
• Stop words removal: Words such as prepositions,
pronouns, and articles that occur frequently and are not • The number of iterations: How many passes are applied
helpful for distinguishing the documents are called to solve the problem and get an optimal solution. With a
stop-words [33] . Stop-words are usually removed with few numbers of iterations, searching would be stopped.
the help of predefined list contains common stop-words Furthermore, a large number of iterations are a time-
such as (ϝ΍ ,Ϧϣ ,ϰϓ ,ϰϠϋ ,ϕϮϓ). consuming. Number of iterations depends on the data
size.
• Stemming: The process of reducing words to their stem
or root, where is used to match different variants of B. Results
words. There are two types of stemming [34]: First ࡤ the standard k-means algorithm was applied for each
1) Root Based Stemming: In this type, we reduce the data set. And then we applied PSO k-means on the same data
words to their roots. In the Arabic language, several words sets to compare the results of both algorithms. Next, we repeat
the experiments with different preprocessing options to
such as (ϢϴϠόΘϟ΍ - ˯ΎϤϠόϟ΍ - ϢϠόϤϟ΍ ) which mean “the education”,
measure the stemming effectiveness as follows:
“the scientists” and“the teacher”, respectively are reduced to
one root (ϢϠϋ) which mean “science”. Many algorithms are • Stemming using Khoja Algorithm for root extraction.
used in root extraction. In our experiments ֊ we are going to • Stemming using the light stemming technique, where
use Khoja Stemming Algorithm . normalization is performed.
2) Light Stemming: Light stemming does not affect the
semantics of words. It removes prefixes and suffixes from the • No stemming at all (Raw).
words instead of extracting the original root. For example, the a) Clustering evaluation
Arabic word (ϢϠόϤϟ΍), which mean “the teacher” is reduced to Minimizing the objective (fitness) function Equation (3) to
(ϢϠόϣ ), which mean “teacher”. Also, it normalizes words into achieve higher intra-cluster similarity and lower inter-cluster
their normal state from diacritics (Tashkeel) and stretching similarity, such that, documents within the same cluster are
character (Tatweel). similar as possible and documents from different clusters are
dissimilar as possible. This is called internal criterion for
2
Weka Software: http://www.cs.waikato.ac.nz/ml/weka/ evaluating clustering quality. Nonetheless, good scores in an
3
https://sites.google.com/site/motazsite/corpora/osac objective function do not produce efficient results [36]. An

883 | P a g e
978-1-5090-6435-9/17/$31.00 ©2017 IEEE
Intelligent Systems Conference 2017
7-8 September 2017 | London, UK
alternative to internal criterion is external criteria, which C. Discussion
measure the clustering quality by matching cluster structure to We evaluate each data set according to (7), (8), (9), and
some a priori knowledge e.g. Precision, Recall, F-Measure and (10). From the experimental results for each data set, it is clear
Accuracy as in (7), (8), (9) and, (10) are used respectively: that PSO k-means algorithm gives better results compared to
ܰ௜௝ standard k-means algorithm. In the new algorithm, the result
‫ ݊݋݅ݏ݅ܿ݁ݎ݌‬ൌ ቆ ቇ (7) doesn’t depend on the initial centers. Moreover, the light
ܰ௝ stemmer gives higher accurate clustering solution compared to
ܰ௜௝ other stemming options.
ܴ݈݈݁ܿܽ ൌ ൬ ൰ (8)
ܰ௜
Where ܰ௜௝ is the number of documents of class ݅ in the Also, the number of instances for a data set has a great
effect on the accuracy, where CNN’s results are superior to
cluster ݆, Nj is the number of documents of cluster݆, and ܰ௝ is BBC, and OSAc’s results are better than CNN. This implies
the number of documents of class ݅. F-measure is computed that there is a relationship between data size and accuracy as
using precision and recall as below: shown in Fig. 2.
‫ܨ‬ሺ݅ǡ ݆ሻ ൌ ሺሺʹ ‫‘‹•‹…‡” כ‬ሺ݅ǡ ݆ሻ
‫ŽŽƒ…‡ כ‬ሺ݅ǡ ݆ሻሻሻȀሺሺ”‡…‹•‹‘ሺ݅ǡ ݆ሻ (9) 80

൅ ‡…ƒŽŽሺ݅ǡ ݆ሻሻሻ
……—”ƒ…› 64
65 62

ൌ ሺ‘–ƒŽ‘””‡…–Ž›Žƒ••‹ˆ‹‡†‘…—‡–•ሻ (10) 53 52 52
55

Ȁሺ‘–ƒŽ‘…—‡–•—„‡”ሻ ‫ͲͲͳ כ‬ 48
41 40
Tables 5 to 8 show the Precision, Recall, F-Measure, and
31
Accuracy respectively over applying standard k-means and 32

PSO k-mean for each data set with different stemming options
16
and Minimum Term Frequency=5:

TABLE V. PRECISION OVER BOTH ALGORITHMS BBC CNN OSAc

Light Root Not


Data
Standard k-means PSO-kmeans
set
- No Light Root No Light Root Fig. 2. Relationship between data size and accuracy.
BBC 0.15 0.28 0.25 0.50 0.46 0.27
VI. CONCLUSION & FUTURE WORK
CNN 0.59 0.61 0.45 0.62 0.61 0.60
OSAC 0.6 0.6 0.5 0.6 0.7 0.6 A. Conclusion
In this paper, one of the crucial issues of the k-means
TABLE VI. RECALL OVER BOTH ALGORITHMS algorithm was investigated, which is the dependency of the
Data selection to the clusters centers and its effects on clustering
Standard k-means PSO-kmeans
set Arabic documents. Therefore, we use Particle Swarm
- No Light Root No Light Root Optimization (PSO) as a search strategy to increase the search
BBC 0.26 0.39 0.31 0.47 0.41 0.46 space in the entire data set, i.e. broad area that all particles try
CNN 0.47 0.45 0.44 0.53 0.55 0.55 to cover to find best solution. So, merging k-means algorithm
OSAC 0.5 0.4 0.4 0.5 0.6 0.5 with PSO implies to avoid k-means limitation to the initial
selection for clusters centers. PSO avoids standard k-means
TABLE VII. F-MEASURE OVER BOTH ALGORITHMS algorithm’s limitation.
Data Using three different Arabic corpora BBC, CNN, and
Standard k-means PSO-kmeans
set OSAc, PSO k-means algorithm gives better results over than
- No Light Root No Light Root applying the standard algorithm to the same data sets. This
BBC 0.15 0.27 0.22 0.33 0.33 0.25 implies that globalize the search space for selecting the clusters
CNN 0.44 0.41 0.39 0.51 0.52 0.51 centers is an important task in the Arabic clustering process.
OSAC 0.4 0.5 0.4 0.5 0.4 0.5 Also, Arabic stemming step has a great effect on the whole
task. Arabic Light stemmer algorithm is more accurate than the
TABLE VIII. ACCURACY OVER BOTH ALGORITHMS root stemmer.
Data These research results are more accurate than those at [8] to
Standard k-means PSO-kmeans
set the same used data sets, in a comparison with applying k-
- No Light Root No Light Root means individually, and their proposed technique, according to
BBC 31.1 34.3 28 40.7 41.3 31 the parameters setting for the proposed algorithm. Also, Arabic
CNN 47 51.1 40.7 52.2 53.2 52.7 light stemmer algorithm is powerful for the Arabic document
OSAC 53 52.1 50 62 64.5 55 clustering process which does not affect the semantics of
words. On the other hand, results are affected from the wrongs
due to stemming step, where some modern words are not

884 | P a g e
978-1-5090-6435-9/17/$31.00 ©2017 IEEE
Intelligent Systems Conference 2017
7-8 September 2017 | London, UK
stemmed correctly. Therefore, The Arabic stemming [17] Sridevi and Nagaveni, "Ontology based Similarity Measure in
algorithms should be enhanced to solve this problem due to its Document Ranking," International Journal of Computer Applications
(0975 - 8887) Volume 1 – No. 26, 2010.
effects on the similarity measure.
[18] O. Rajendra and Sahay, "Web Document Clustering and Ranking using
B. Future work Tf-Idf based Apriori Approach," in In: Proceedings of the 2005
International Conference on Computational Intelligence for Modelling.
In future we can concentrate on these points: [19] Rekha and R. D, "A Frequent Concepts Based Document Clustering
Algorithm," International Journal of Computer Applications (0975 –
• Dealing with Arabic dialects problem. 8887) Volume 4 – No.5, 2010.
• Using another distance function of k-means algorithms, [20] Thangamani.M and Thangaraj.P, "Fuzzy Clustering Of Documents ",
which support semantic methods. Modern Applied Science, Vol. 4, No. 7; July 2010.
[21] H. Y. Richard Freeman and Nigel, "Self-Organising Maps for Tree View
• Using other stemming algorithms for the Arabic Based Hierarchical Document Clustering," IEEE, 2002.
language, that solves the stemming problem for the [22] R. S. and P. A., "Space and Cosine Similarity measures for Text
modern words. Document Clustering," International Journal of Engineering Research
\& Technology (IJERT) Vol. 2 Issue 2, 2013.
REFERENCES [23] Ding and X. He, "K-means Clustering via Principal Component
[1] H. Khalifa, "New Techniques for Arabic Document Classification," Analysis".
2013. [24] K. Premalatha and Natarajan, "Discrete PSO with GA Operators for
[2] Ismael and N. Hashimah, "Text Document Preprocessing and Document Clustering," International Journal of Recent Trends in
Dimension Reduction Techniques for Text Document Clustering," in Engineering, Vol 1, No. 1, 2009.
International Conference on Artificial Intelligence with Applications in [25] J. Kennedy and R. Eberhart, "Particle Swarm Optimization," in IEEE
Engineering and Technology, 2014. International Conference on Neural Networks, 1995.
[3] Kaur, "Evaluation of Document Clustering Approach based on Particle [26] H. Ahmed and J. Glasgow, "Swarm Intelligence: Concepts, Models and
Swarm Optimization Method for Twitter Dataset," International Journal Applications," 2012.
of Computer Science and Information Technologies, Vol. 6, 2015.
[27] T. E. P. P. P. Xiaohui Cui, "Document Clustering using Particle Swarm
[4] X. Y. Wang and J. M. Garibaldi, "A COMPARISON OF FUZZY AND Optimization," Applied Software Engineering Research Group
NON-FUZZY CLUSTERING TECHNIQUES IN CANCER Computational Sciences and Engineering Division Oak Ridge National
DIAGNOSIS". Laboratory Oak Ridge, TN 37831-6085.
[5] O. Oyelade and Obagbuwa., "Application of k-Means Clustering [28] M. N. Singh, "The Improved K-Means with Particle Swarm
algorithm for prediction of Students’ Academic Performance," Optimization," Journal of Information Engineering and
International Journal of Computer Science and Information Security, Applications,Vol.3, No.11, 2013.
Vol. 7, No. 1, 2010.
[29] O. Ghanem and W. Ashour, "Stemming Effectiveness in Clustering of
[6] L. Bijuraj, "Clustering and its Applications," 2013. Arabic Documents," International Journal of Computer Applications (
[7] M. M. a. S. N. Tapas Kanungo, "An Efficient k-Means Clustering 0975 – 8887), 2012.
Algorithm: Analysis and Implementation," IEEE TRANSACTIONS ON [30] R.Jensi, "A SURVEY ON OPTIMIZATION APPROACHES TO TEXT
PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, DOCUMENT CLUSTERING," International Journal on Computational
NO. 7, 2002. Sciences \& Applications (IJCSA) Vol.3, No.6., 2013.
[8] Kelaiaia and H. Merouani, "Clustering with Probabilistic Topic Models [31] H. Aliwy, "Arabic Morphosyntactic Raw Text Part of Speech Tagging
on Arabic Texts: A Comparative Study of LDA and K-Means," The System," 2012.
International Arab Journal of Information Technology VOL. 13, NO. 2,
2016. [32] S. R. EL-BELTAGY and A. RAFEA, "An accuracy-enhanced light
stemmer for arabic text," ACM Transactions on Speech and Language
[9] G. K. Michael Steinbach and V. Kumar, "A Comparison of Document Processing, 6, 3, Article 9, 2010.
Clustering Techniques".
[33] S. Alghamdi and Shahriza., "Arabic web pages clustering and annotation
[10] Jadon and A. Khunteta, "A New Approach of Document Clustering," using semantic class features," Journal of King Saud University,
International Journal of Advanced Research in Computer Science and Computer and Information Sciences, 2014.
Software Engineering, 2013.
[34] Flat clustering, Cambridge University Press.
[11] H. Anaya-Sánchez, "A document clustering algorithm for discovering
and describing topics," Science Direct, 2010. [35] M. Lashkari and A. Rostami, "EXTENDED PSO ALGORITHM FOR
IMPROVEMENT PROBLEMS K-MEANS CLUSTERING
[12] M. C. Ling Shyu and H. Rubin, "Affinity-Based Similarity Measure for ALGORITHM," International Journal of Managing Information
Web Document Clustering," IEEE, 2004. Technology (IJMIT) Vol.6, No.3, 2014.
[13] S. A. Ghosh, "Impact of Similarity Measures on Web-page Clustering," [36] S. P. Deshpande and D. V. M. Thakare, "DATA MINING SYSTEM
AAAI Technical Report WS-00-01, 2000. AND APPLICATIONS: A REVIEW," International Journal of
[14] Leuski, "Evaluating Document Clustering for Interactive Information Distributed and Parallel systems (IJDPS) Vol.1, No.1, 2010.
Retrieval," 2010. [37] M. N. Al-Gedawy, "Detecting Egyptian Dialect Microblogs using a
[15] K NIGAM and T. MITCHELL, "Text Classication from Labeled and Boosted PSO-based Fuzzier," Egyptian Computer Science Journal Vol.
Unlabeled Documents using EM," Springer-Verlag,, 2006. 39 No. 1, 2015.
[16] L. Z. M. K. Liping J and Joshua, "Ontology-based Distance Measure for
Text Clustering".

885 | P a g e
978-1-5090-6435-9/17/$31.00 ©2017 IEEE

S-ar putea să vă placă și