Sunteți pe pagina 1din 15

DOCUMENT AND TERM CLUSTERING

 Introduction to Clustering
 Thesaurus Generation
 Item Clustering
 Hierarchy of Clustering
Introduction
 Cluster analysis is a technique for multivariate analysis that
assigns items to automatically created groups based on a
calculation of the degree of association between items and
groups.
 In the information retrieval (IR) field, cluster analysis has
been used to create groups of documents with the goal of
improving the efficiency and effectiveness of retrieval, or
to determine the structure of the literature of a field.
 The terms in a document collection can also be clustered to
show their relationships.
Introduction(Cont..)
 The two main types of cluster analysis methods are the
non-hierarchical, which divide a data set of N items into
M clusters, and the hierarchical, which produce a nested
data set in which pairs of items or clusters are
successively linked.
 The non-hierarchical methods such as the single pass and
reallocation methods are heuristic in nature and require
less computation than the hierarchical methods
Introduction(Cont..)
 Term clustering : Used to create a statistical thesaurus
. Increase recall by expanding searches with related terms (query
expansion)
 Document clustering : Used to create document clusters
.The search can retrieve items similar to an item of interest, even
if the query would not have retrieved the item (resultant set
expansion) .Result-set clustering
Process of Clustering
Define the domain for clustering
Thesaurus: “medical terms”
Documents: set of items to be clustered
Identify those objects to be used in the clustering process and reduce the potential
for erroneous data that could induce errors in the clustering process
Determine the attributes of the objects to be clustered
Thesaurus: determine the specific words in the objects to be used
Documents: may focus on specific zone within the items that are to be used to
determine similarity
Reduce erroneous association
Complete Term Relation Method
 In this method the similarity between every term pair is
calculated as a basis for determining the clusters .
 Using the vector model for clustering
 The vector model is represented by a matrix where the
rows are individual items and the column are the
unique words(processing token) in the items.
 The value in the matrix represented how strongly that
particular word represents concept in terms
Complete Term Relation Method(Cont..)
 Following Examples shows database 5 item and 8 terms

To determine the relationship between term a similarity measure is


required. The measure calculates the similarity between two terms.
Similarity Measure Formula in CTRM

Where K is summed across the set of all items. The formula


takes the two column of two terms being analyzed,
multiplying and accumulating the values in each row. The
result can be placed in resultant “M” matrix called Term-
Term matrix. Where “M” is the number of column in
original matrix.

The following are the Term-Term matrix for given above


example
Term-Term Matrix

 There is no value in the diagonal since that represents the auto


correlation of a word to itself.
 The next step is to select a threshold that determines if two terms are
considered similar to each other to be in same class.
 In this Example Threshold value is 10. This produces a new binary
matrix called Term Relationship Matrix that define which terms are
similar
Term Relationship Matrix

 The final step in creating clusters is to determine when two objects (words) are
in the same cluster . There are many algorithms available:
 Cliques, single link, stars, and connected components are algorithms
Cliques Algorithms
 Cliques require all items in a cluster to be within the threshold of all other items
Cliques Algorithms(Cont..)
 Applying the algorithm on Term Relationship Matrix following
classes are created. Here notice Trem 1 and Term6 are more than
in one class
Single Link Algorithm
 In this strong constraint that every term in a classis similar to
every other term is relaxed. This rule generate single link cluster
is that any term that is similar to any term in cluster can be added
to the cluster.
 It is impossible for a term to be in two different clusters . The
algorithm is:
Single Link Algorithm(Cont..)
 Applying this algorithms on Term Relationship Matrix following
classes are created
Star Techniques Algorithms
 Select a term and then places in the class all terms that are related to that term
 Terms not yet in classes are selected as new seeds until all terms are assigned
to a class
 There are many different classes that can be created using the Star technique
 If we always choose as the starting point for a class the lowest numbered term
not already in a class .
 Applying this algorithms on Term Relationship Matrix following classes are
created: This allows term to be in multiple clusters Ex: Term 4

S-ar putea să vă placă și