Sunteți pe pagina 1din 5

Document Clustering Based on Term Frequency and Inverse

Document Frequency
S.Suseela, .Assistant Professor/cse, Periyar Maniammai University.Vallam
suseelass@gmail.com

Abstract- In various organizations such as office, college, Libraries),automated(or semi automated)creation of document
industry, etc, we are maintaining lot of information such as taxonomies(Eg,Yahoo and Open Directory Styles),and efficient
Medical documents, financial documents and various document information retrieval by focusing on relevant subsets(cluster)
collection. It contains, billion of information, it has been rather than whole Collections.
increasing exponentially. One way to deal with the extraordinary In general, the clustering techniques are based on four concepts:
flood of data is cluster analysis. This fact has lead to the need to data representation model, similarity measure, clustering model,
organize a large set of documents into categories through and clustering algorithm. Most of the current document clustering
clustering. It is believed that grouping similar documents together methods are based on the Vector Space Document (VSD) model
into clusters will help the users find relevant information quicker, [4].T he common framework of this data model starts with a
and will allow them to focus their search in the appropriate representation of any document as a feature vector of the words
direction. It is used to divide large unstructured document corpora that appear in the documents of a data set. A distinct word
into groups of more or less closely related documents. We appearing in the documents is usually considered to be an atomic
propose a new similarity measure to compute the similarity of feature term in the VSD model, because words are the basic units
text-based documents based on the Term Frequency and Inverse in most natural languages (including English) to represent
Document Frequency using Vector Space Model. We apply the semantic concepts. In particular, the term weights (usually tf-idf,
new similarity measure to the HAC algorithm, (Hierarchical term-frequencies and inverse document-frequencies) of the words
Agglomerative Clustering Algorithm), and develop a new are also contained in each feature vector [6]. The similarity
Clustering Approach (DC_TFIDF). These models will provide between two documents is computed with one of the several
accurate document similarity calculation and also improve the similarity measures based on the two corresponding feature
effectiveness of the clustering technique over the traditional vectors, e.g., cosine measure, Jaccard measure, and euclidean
methods. distance. To achieve a more accurate document clustering, a more
informative feature term—phrase—has been considered in recent
keywords: Cluster Analysis, Similarity Measure research work and literature. A phrase of a document is an
Vector Space Model, Term frequency and Inverse Document ordered sequence of one or more words [7]. Bigrams and trigrams
frequency are commonly used methods to extract and identify meaningful
phrases in statistical natural language processing [8].
I.INTRODUCTION
II.DOCUMENT CLUSTERING
Since nowadays most of the information content is still
available in textual form, text is an important basis for Document clustering is an automatic grouping of text
information retrieval. Natural language text carries a lot of documents into clusters so that documents within a cluster have
meaning, which still cannot fully be captured computationally. high similarity in comparison to one another, but are dissimilar to
Therefore information retrieval systems are based on strongly documents in other clusters. Unlike document classification no
simplified models of text, ignoring most of the grammatical labeled documents are provided in clustering; hence, clustering is
structure of text and reducing texts essentially to the terms they also known as unsupervised learning. Text Clustering applied into
contain [1]. This approach is called full text retrieval and is a decision tree[9], Statistical Analysis[10], Neural Networks[11],
simplification that has proven to be very successful. Nowadays inductive logic programming and rule based System, and research
this approach is gradually extended by taking into account other area such as Database, Information retrieval(IR),and artificial
features of documents, such as the document or link structure. intelligence(AI),and Natural Language Processing.

Text mining shares many concepts with traditional data mining Any clustering technique relies on four concepts:
methods. Data mining includes many techniques that can unveil
inherent structure in underlying data; one of these techniques is 1. A data representation model
clustering. When applied to textual data, clustering methods try to 2. Similarity Measure
identify expected groupings of the text documents so that a set of 3. A cluster model, and
clusters is produced in which clusters exhibit high intracluster 4. A clustering algorithm that builds the clusters
similarity and low intercluster similarity.[2]. Generally, text using the data model and similarity measure.
document clustering methods attempt to segregate the document
into groups where each groups represents some topic that is Similarity between documents is measured using one of several
different than those topic represented by the other groups.[3].The similarity measures that are based on such a feature vector. The
Application of document clustering are “Clustering of retrieved following picture shows us how the clustering the documents is
documents to present organized and understandable results to the achieved.
user ,Clustering documents in a collection(eg, Digital
IV.HIERARCHICAL AGGLOMERATIVE CLUSTERING

0.2 After find the cosine similarity, we have to apply the


similarity values in HAC algorithm and find the clusters. The
following procedures are followed to automatically group related
0.15 documents into clusters. The first step to create N*N doc-doc
matrix. Second one, each document starts as a cluster of size one.
Then combine the two clusters with greatest similarity. Then
0.1
continue the process until there is only one cluster. The Fig7.1
shows us how the clustering is achieved through Simple HAC
0.05 algorithm.

SimpleHAC (d1... dN)


0
1 3 2 5 4 6 Step1: for n ß 1 to N
Figure2.1 Agglomerative and divisive Step2: do for i ß 1 to N
Setp3: do C[n][i] ß Sim(dn,di)
The clustering the documents can be visualized as a dendrogram Step4: I[n] ß 1
in fig.2.1.A tree like diagram that records the sequence of merges
or splits. A clustering of the data objects is obtained by cutting the Step5: Aß[]
dendrogram at the desired. Step6: for k ß 1 to N-1

III.VECTOR SPACE MODEL Step7: do <i,m> ß argmax{<i,m>: i≠m and I[i]=1 and
I[m]=1} C[i][m]
Vector space model (or term vector model) is an Step8: Append(<i,m>)
algebraic model for representing text documents (and any objects,
in general) as vectors of identifiers, such as, for example, index Step9: for j ß 1 to N
terms. It is used in information filtering, information retrieval, Step10: do C[i][j] ß Sim(j,i,m)
indexing and relevancy rankings.A document is represented as a
vector. Each dimension corresponds to a separate term. If a term (C[i][j]:Similarity between cluster i and j)
occurs in the document, its value in the vector is non-zero.
Step11: C[j][i] ß Sim(j,i,m)
Several different ways of computing these values, also known as
(term) weights, have been developed. One of the best known Step12:
Fig 7.1I[m] ß 0Hac
Simple (I:indicate which clusters are still
Algorithm
schemes is tf_idf weighting .The definition of term depends on
the application. Typically terms are single words, keywords, or available to be merged.
longer phrases. If the words are chosen to be the terms, the Step13: return A (A: list of clusters).
dimensionality of the vector is the number of words in the
vocabulary (the number of distinct words occurring in the Figure 4 .1.HAC algorithm
corpus).A similarity measure, as a computable numeric value, is
defined using the query and document-vector to express a V. DOCUMENT SIMILARITY BASED ON TERM
likeness (closeness) between a query and a document. The FREQUENCY AND INVERSED DOCUMENT
similarity measure typically has the following three basic FREQUENCY(CSDC_TFIDF)
properties: (i) It usually takes on values between 0 and 1; (ii) Its
value does not depend on the order in which the query and the In the model, each document d is considered to be a
document are compared (when computing the similarity); (iii) It is vector in the M-dimensional term space. In particular, we usually
equal to 1 when the query- and document-vectors are equal. employ the term tf-idf weighting scheme , in which each
Those documents are said to be retrieved in response to a query document can be represented as
for which the similarity measure exceeds a threshold value. The D={w(1,d),w(2,d),…….,w(M, d)} (5.1)
main advantages of tf_idf is term-weighting improves quality of Where
the answer set, partial matching allows retrieval of docs that w(i, d) = (1+log tf (i, d)).log (1+N/df (i)), (5.2)
approximate the query conditions, cosine ranking formula
sorts .The vector space model procedure can be divided in to three And tf(i ,d) is the frequency of the ith term in the
stages. The first stage is the document indexing where content document d, and df(i) is the number of documents containg the ith
bearing terms are extracted from the document text. The second term.
stage is the weighting of the indexed terms to enhance retrieval of In the VSD model, the cosine similarity is the most
document relevant to the user. The last stage ranks the document commonly used measure to compute the pair wise similarity of
with respect to the query according to a similarity measure. two document di and dj, which is defined as
D1=w11, w12…w1t (5.3)

D2=w21,w22,…..,w2t. (5.4)
Document sets

REMOVIN VECTOR
G THE SPACE
STOP MODEL

4 WORDS

VII.EXPERIMENTAL RESULT

4 4 Step1:Let us take two documents D1,D2 Over Document Sets


Clustering Cosine
D1 D2
!! (CSDC_TFID Similarity
X1.The cat ate cheese too. Y1.The cat ate cheese too.
F) Calculatio
X2.The cat ate milk also. Y2.The cat ate milk also.
!! n
X3The cat ate mouse too. Y3.The cat ate mouse too.

Table 7.1.
Clustered Documents
Step2:After removing the stop words are
D1 D2
X1. cat ate cheese. Y1. cat ate cheese
Fig.5.1. DC_TFIDF X2. cat ate milk also. Y2. cat ate milk
X3 cat ate mouse too. Y3. cat ate mouse
Before Document Clustering, a document “cleaning” Procedure is
executed for all documents in the document sets. First, all non Table 7.2
word tokens are striped off. Second, the text is parsed into words. Step3:Draw the term Document Matrix for D1,D2
Third, all stop words are identified and removed. Finally all Xi1, Xi2, Xi3.,…. Xin and Yi1, Yi2, Yi3,…,Yin are document
stemmed words are concatenated into new documents. Using identifiers
these Vector Space Model, the words are represented as feature
Document D1
vectors. Then apply the equation 5.1 & 5.2 then find the Term Term Document
corresponding weights for the each terms in the documents. After Frequency Frequency
obtaining the term weights we have to apply the cosine X11 1 3
similarity, and then find the similar documents. If the cosine X12 1 3
similar are equal to one, the two documents are similar otherwise X13 1 1
not similar. X21 1 3
X22 1 3
VI.PROCEDURE CSDC_TFIDF() X23 1 3
X31 1 3
Procedure CSDC_TFIDF() X32 1 3
X33 1 1
Input: document sets Table 7.3
Output: Similar Documents.
Document D2
1: N ! Number of documents. Term Term Document
2: Idf !Inverse Document Frequency. Frequency Frequency
Y11 1 3
3: Df!Document Frequency. Y22 1 3
4:di!Document Identfiers. Y13 1 1
Y21 1 3
5: For each D do Y22 1 3
6: For each di do Y23 1 3
Y31 1 3
7: Read the document from left to right over the Y32 1 3
Document D. Y33 1 1
Table 7.4
8: Calculate the term frequency for each term di € D
9: Calculate the Inverse Document Frequency for step 4:Find the weigth for each term Over D1.
w(i, d) = (1+log tf (i, d)).log (1+N/df (i)),
each term di € D
10: calculate the Cosine Similarity W(x11,D1)= (1+Log tfx11,D1)*(Log(1+N/df(x11,D1)
=(1+logtf(1)*(log(1+3/3)
Cos-sim=|dx|.|dy /|dx|*|dy| =1.287682
11: if (cos_sim==1)
The two documents are similar.
else
The two documents are not similar.
end if
end for
Weight weight cluster j belongs to class i. The entropy of each cluster j is
for term calculated the standard formula
W(x11,D1 1.287682
W(x12,D1 1.287682
W(x13,D1 2.386294
W(x21,D1 1.287682 where the sum is taken over all classes. The total entropy for a
W(x22,D1 1.287682 set of clusters is calculated as the sum of entropies for each cluster
W(x23,D1 1.287682 weighted by the size of each cluster:
W(x31,D1 1.287682
W(x32,D1 1.287682
W(x33,D1 2.386294
Table 7.5
where Nj is the size of cluster j, and N is the total number of data
objects. We would like to generate clusters of lower entropy,
Find the weigth for each
which is an indication of the homogeneity (or similarity) of
term Over D2.
objects in the clusters. The weighted overall entropy formula
Weight weight
avoids favoring smaller clusters over larger clusters.
for term
B. F_Measure
W(y11,D1 1.287682 The second external quality measure is the F-measure, a measure
W(y12,D1 1.287682 that combines the precision and recall ideas from information
W(y13,D1 2.386294 retrieval literature. The precision and recall of a cluster j with
W(y21,D1 1.287682 respect to a class i are defined as:
W(y22,D1 1.287682 P=Precision (i , j)=Nij/Ni
W(y23,D1 1.287682 R=Recall (i , j)=Nij /Nj
W(y31,D1 1.287682 where Nij is the number of members of class I in cluster j, Nj is
W(y32,D1 1.287682 the number of members of cluster j, and Ni is the number of
W(y33,D1 2.386294 members of class i. The F-measure of a class i is defined as:
Table 7.6 F (i) =2PR/P+R

IX.RELATED WORK
Step5: Apply the cosine similarity
In the traditional document models such as the VSD model, words
or characters are considered to be the basic terms in statistical
feature analysis and extraction. To achieve a more accurate
document clustering, enveloping more informative features has
become more and more important in information retrieval
Cos_sim=x11y11+x12y12+x13y13+x21y21+x22y22+x23y23/sqtr(x112+x122 literature recently. Bigrams, trigrams, and much longer n-grams
have been commonly used in statistical natural language
+x132+ x212+ x222+ x232)*(y112+y122+xy132+yx212+ y222+ processing. Recently the most similar work going on, which
compares four similarity measures on a collection of documents
y232) such as Dice coefficient, Jaccard measure, Cosine similarity,
Euclidean distance.
Cos_sim=1 .
X.CONCLUSIONS
The traditional VSD model plays important roles in text-based
Thus the two documents are D1 and D2 are Similar Documents.
information retrieval. We proposed the vector based document
Like wise find the values for all the term in the term document models and cosine similarity measures are a highly accurate and
matrix. Then apply the hierarchical agglomerative algorithm, and efficient practical document clustering solution. The CSDC-
then find the clustered Documents. TFIDF system definitely improves efficiency and effectiveness in
information retrieval system. It also reduces the search space.
VIII.EVALUATION METRICS The proposed system, clustering the documents and also give
users an overview of the contents of a document collection. If a
A.Entropy collection is well clustered, we can search only the cluster that
will contain relevant documents. Searching a smaller collection
One external measure is entropy, which provides a measure of should improve effectiveness and efficiency.
“goodness” for unnested clusters or for the clusters at one level of
a hierarchical clustering. Entropy tells us how homogeneous a XI.REFERENCES
cluster is. The higher the homogeneity of a cluster, the lower the
entropy is, and vice versa. The entropy of a cluster containing [1]Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern
only one object (perfect homogeneity) is zero. Let P be a partition Information Retrieval
result of a clustering algorithm consisting of m clusters. For every (ACM Press Series), Addison Wesley, 1999.
cluster j in P we compute ij p, the probability that a member of
[2]K.Cios,W.Pedrycs,and R.Swiniarski,Data Mining Methods for [10].D.Freitag & A.McCallum,”Information Extraction with
Knowledge Discovery.Boston:Kluwer ACADEMIC HMMs and Shrinkage,”Proc.AAAI-99 Workshop Machine
publishers,1998. Learing for Information Extraction, pp.31-36, 1999.
[3]W.B.Frakes and R.Baeza_Yates, Information Retrieval:Data [11].T.Honela,S,Kaski,K.Lagus,and T.Kohonen,”WEBSOM-self
Structure and Algorithm.Englewood Cliffs,N.J.Prentice Hall,1992 Organizing Maps of Document
[4]. Porter. M, “New Models in Probabilistic Information Collection,”Proc,WSOM,’97,Workshop of self Organizing
Retrieval,” British Library Research and Development Report, Mas,p.310-315,June 1997.
no. 5587, 1980. [12] O. Zamir and O. Etzioni, “Grouper: A Dynamic Clustering
[5]. Salton.G, Wong.A, and Yang C.S, “A Vector Space Model Interface to Web Search results,” Computer Networks, vol.
forAutomatic Indexing,” Comm. ACM, vol. 18, no. 11, pp. 613- 31,nos. 11-16, pp. 1361-1374, 1999
620,1975. [13]. E. Charniak, Statistical Language Learning. MIT Press,
[6]. Ukkonen.E, “On-Line Construction of Suffix Trees,” 1993.
Algorithmica, vol. 14, no. 3, pp. 249-260, 1995.
[7]. Salton.G, Wong.A, and Yang C.S, “A Vector Space Model
forAutomatic Indexing,” Comm. ACM, vol. 18, no. 11, pp. 613-
620,1975.
[8]. Yamamoto.M and Church K.W, “Using Suffix Arrays to
Compute Term Frequency and Document Frequency for All
Substrings in a Corpus,” Computational Linguistics, vol. 27, no.
1,pp. 1-30, 2001.
[9].S.Dumies,J.Platt,D.Heckerman,and M.Shami,”Inductive
Learing Algorithms and Representaions for Text
Categorization”,Proc,Seventh Int’l Conf.Information and
Knowledge Manegement ,pp.148-15,Nov.1998.

S-ar putea să vă placă și