Sunteți pe pagina 1din 4

PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NANOSCIENCE, ENGINEERING & ADVANCED COMPUTING (ICNEAC-2011)

Clustering with Lower Bound on Similarity


T.A.RAVIKUMAR@, D D D SURIBABU*, ANAND KUMAR DEVA$, D.CHITTI BABU#,
@ *

Department of CSE, PG Student Swarnandhra College of Engineering & Technology, Seetharampuram, Narsapur, (A.P.), India Department of CSE, Swarnandhra Institute of Engineering & Technology, Seetharampuram, Narsapur, (A.P.), India $ Department of CSE, Swarnandhra Institute of Engineering & Technology, Seetharampuram, Narsapur, (A.P.), India # Department of IT, Swarnandhra College of Engineering & Technology, Seetharampuram, Narsapur, (A.P.), India Email: @rk.thadi@gmail.com,* siet_csehod@yahoo.com, $anand.k.deva@gmail.com, and
#

itchittibabu@gmail.com

ABSTRACT
Traditional text-based document classifiers tend to perform poorly in the Web. Text in web documents is usually noisy and often does not contain enough information to classify them. However, the Web provides a different source of evidence that can be useful to document classification: its hyperlink structure. In this paper we propose a new method, called SimClus, for clustering web documents with lower bound on similarity. Instead of accepting k the number of Clusters to find, the alternative similarity-based approach imposes a lower bound on the similarity between an object and its corresponding cluster representative. This automatically imposes a similarity bound among the members of a cluster, supports overlapping clusters and can easily be adapted to work in a dynamic setting, where new objects are added in or existing objects are removed. In this the clustering objective is to cover all the objects with fewer representative objects. KEYWORDS: Clustering, Hyperlink, Representative Objects, Overlapping clusters.

I. INTRODUCTION The World Wide Web has become a main focus of research in Information Retrieval (IR). Its unique characteristics, like the increasing volume of data, the volatility of its documents, or the wide array of users interests, make it a challenging environment for traditional IR solutions. On the other hand, the Web provides ground to explore a new set of possibilities. Multimedia documents, semi-structured data, user behavior logs, and many other sources of information allow a whole new range of IR algorithms to be tested. This work focuses on one such source of information, widely available in the Web: its link structure. It is possible to infer at least two different meanings from links between web pages. First, if two pages are linked, we can assume that their subjects are related. Second, if a page is pointed to by many others, we can assume that its content is important. These two assumptions have been successfully used in web IR for tasks like document classification. Web documents are usually noisy and with little text, containing images, scripts and other types of data unusable by text classifiers. Furthermore, they can be created by many different authors, with no coherence in style, language or structure. Thus, link information can be a useful complement for classification. In this paper, we evaluate how link structure can be used to determine a measure of similarity between web documents. A good similarity measure will be able to accurately determine if two web pages are topic-related. Thus, we expect that such measure will be effective in classifying documents into a set of pre-defined categories. In many application domains that involve clustering,

it is difficult to guess the number of clusters. It is even more challenging in dynamic domains like news groups and blogosphere clustering, where the number of topics is typically un- known and may even change. An alternative to the parameter k that defines the number of clusters is to provide a lower bound, that defines the desired (minimum) similarity between a cluster member and a representative object in that cluster. This similarity- based formulation has several benefits. Firstly, the lower bound automatically imposes a similarity bound among the members of a cluster. Secondly, it supports overlapping clusters in a very natural way. Thirdly, this paradigm can easily be adapted to work in a dynamic setting, where new objects are added in or existing objects are removed. 2. LINK-BASED SIMILARITY MEASURES The issue of document similarity is of central importance to IR. Although the most widely used measure is still the cosine similarity in the vector space model, it is known that using different approaches will influence retrieval effectiveness. For this reason, many alternatives which use link information have been proposed as a way of finding similarity among web documents. In this paper, we evaluate link-based similarity measures by applying them to a classification algorithm. To determine how related two web pages are, we used five different similarity measures derived from their link structure: co-citation, bibliographic coupling, Amsler, Companion with authority degrees, and Companion with hub degrees. Here, we use it to provide a value of similarity between documents.

PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NANOSCIENCE, ENGINEERING & ADVANCED COMPUTING (ICNEAC-2011)

There are many differences between citations and web links. Citations between scientific papers are commonly used to provide background information, give credit to the authors of an idea, discuss or criticize existing work, among others. Web links, on the other hand, can be seen as a generalized form of citation. Besides the same functionality, they are also used for advertising, in-site navigation, and providing access to databases among others. These extra roles can make links a less reliable source of evidence, when used as an indicator of similarity between web pages. However, we argue that there is enough functionality in common between links and citations to allow the application of the measures proposed here. In the following sections, we describe each of these measures in detail. 2.1 Co-citation Co-citation was first proposed as a similarity measure between scientific papers. Two papers are cocited if a third paper has citations to both of them. This reflects the assumption that the author of a scientific paper will cite only papers related to his own work. As discussed above, although web links have many differences from citations, we can assume that many of them have the same meaning, i.e., a web page author will insert links to pages related to his own page. In this case, we can apply co-citation to web documents by treating links as citations. We say that two pages are co-cited if a third page has links to both of them. Let d be a web page and let Pd be the set of pages that link to d, called the parents of d. The co-citation similarity between two pages d1 and d2 is defined as

children of d. Bibliographic coupling between two pages d1 and d2 is defined as:

(2) According to Eq. (2), the more children page d1 has in common with page d2, the more related they are. This value is normalized by the total set of children, to fit between 0 and 1. If both Cd1 and Cd2 are empty, we define the bibliographic coupling similarity as zero. 2.3 Amsler This measure of similarity combines both cocitation and bibliographic coupling. According to Amsler, two papers A and B are related if (1) A and B are cited by the same paper, (2) A and B cite the same paper, or (3) A cites a third paper C that cites B. As for the previous measures, we can apply the Amsler similarity measure to web pages, replacing citations by links Let d be a web page, let Pd be the set of parents of d, and let Cd be the set of children of d. The Amsler similarity between two pages d1 and d2 is defined as:

(3)

(1)

Eq. (3) tells us that, the more links (either parents or children) d1 and d2 have in common, the more they are related. The measure is normalized by the total number of links. If neither d1 nor d2 have any children or parents, the similarity is defined as zero. 2.4 Companion In this approach, given a web page d, the algorithm finds a set of pages related to d by examining its link structure. Companion is able to return a degree of how related each page is to d. This degree can be used as a similarity measure between d and other pages. To find a set of pages related to a page d, the Companion algorithm has two main steps: (1) build a vicinity graph of d and (2) compute the degrees of similarity. In step 1, pages that are linked to d are retrieved. We build the set V, the vicinity of d in a way that contains the parents of d, the children of the parents of d, the children of d, and the parents of the children of d. This is the set of pages related to d. In step 2 we compute the degree to which the pages in V are related to d. To do this, we consider the pages in V and the links among them as a graph, called the vicinity graph of d. This graph is then processed by the HITS algorithm .The HITS algorithm returns the degree of authority and hub of each page in V. Intuitively, a good

Eq. (1) shows that, the more parents d1 and d2 have in common, the more related they are. This value is normalized by the total set of parents, so that the cocitation similarity varies between 0 and 1. If both Pd1 and Pd2 are empty, we define the co-citation similarity as zero. 2.2 Bibliographic Coupling In this similarity measure, two documents share one unit of bibliographic coupling if they both cite the same paper. The idea is based on the notion that paper authors who work on the same subject tend to cite the same papers. As for co-citation, we can apply this principle to the Web. We assume that two authors of web pages on the same subject tend to insert links to the same pages. Thus, we say that two pages have one unit of bibliographic coupling between them if they link to the same page. More formally, let d be a web page. We define Cd as the set of pages that d links to, also called the

PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NANOSCIENCE, ENGINEERING & ADVANCED COMPUTING (ICNEAC-2011)

authority is a page with important information on a given subject. A good hub is a page that links to many good authorities. In practice, the degrees of authority and hub are computed recursively: a page is a good hub if it links to many good authorities and a good authority if it is linked by many good hubs. Once HITS is applied, we can choose to use the degree of authority or hub (or a combination of both) as a measure of similarity between d and each page in V . We define the similarity between d and any page that is not in V as zero. The Companion algorithm can use either the authority or the hub degree in isolation as a similarity measure. All the measures described here can be used to calculate the similarity between any two web documents. To be useful, these measures should be able to correctly determine if two web pages are on the same subject. 3. SimClus: Lower Bound Similarity Clustering Consider a set of objects, O, and a similarity function, sim: O O [0, 1] such that for any x O: sim(x, x) = 1. Assume that the objective is to cluster the objects in O such that the objects in a cluster are at least -similar for a user defined [0, 1] (with minimum number of clusters). This formulation leads to an interesting graph problem, given as follows: Let G(V, E) be a graph whose vertices are the objects, and an edge e(u, v) E implies that the similarity between vertices u and v is at least . Below we shall refer to it as the () Similarity graph. Now, any clique in this graph can be taken as one cluster in some clustering, since the distances between the elements in a clique would satisfy the required pair-wise similarity constraints. The clustering objective then becomes to cover the entire graph G by a minimum number of cliques. A relaxation of the above problem can be obtained which requires the similarity bound () to hold only between the cluster elements and a fixed object belonging to that cluster. The center object is the representative for the corresponding cluster. We call this relaxed formulation the lower bound similarity clustering (LBSC) problem. LBSC seeks exactly one cluster center (representative object) for every cluster. Thus a center object c together with all the objects s such that sim(c, s) , form a cluster. All objects s that satisfy the above inequality are called -similar objects with respect to the object c. If s is -similar to multiple centers, it belongs to multiple clusters. Thus this model naturally supports overlapping clusters. Since every object s belongs to at least one cluster, s is -similar to at least one representative object, say c. In that case, we say that c covers s. Thus, the clustering objective is to cover all the objects with the smallest number of center objects. A center always covers itself. Star clustering is the leading algorithm in this paradigm. It sorts the vertices in the descending order of degree; it then selects the first vertex in the sorted order

as one of the cluster centers. Any other vertex covered by this one is deleted from the sorted list and the procedure is repeated until all the vertices are covered. The main difference between Star clustering and SimClus is that instead of choosing the uncovered object with the highest degree (as done in Star), SimClus chooses the object that can cover the most uncovered elements. This drastically reduces the number of clusters; the clusters are more dense and hence, more informative. In SimClus at the beginning, all the objects are not covered and the center-set is empty. We then map each vertex, u of LBSC similarity graph to a set su that contains itself and all objects that it can cover .To facilitate the greedy heuristic, the set su , u V contains only those objects that are uncovered. So, we call the set su as the uncovered cover-set of u, which intuitively means that it holds those objects that are uncovered and can be covered by choosing u as a center. Hence, once an object is chosen as a center, the uncovered cover-set of all the non-center objects are updated by removing any object that is covered by the newly chosen center. In every iteration a new center is selected using the above greedy criterion and the process is repeated until all the objects are covered. If there is a tie in the above criterion, we break the tie by selecting the object that has the largest degree in the similarity graph. If there is still a tie, we break it arbitrarily. 4. Dynamic SimClus The static algorithm that is provided in the previous section requires that the entire -similarity graph is available before the algorithm is applied. However, in many practical application scenarios, this requirement does not hold. For information retrieval, new documents can be added or old documents may be deleted from the repositories and the clustering may need to be updated. One option is to re-cluster the objects by running the static clustering algorithm for every change in the collection. But, for most of the cases, the changes are only local; so re-computing the entire clustering is wasteful. Also note that re- computation dramatically changes the cluster-centers; so if the objective of LBSC is to find representative objects in a dynamic setting, one may prefer the dynamic algorithm over the static algorithm, since the former retains a large portion of the representative set. It is also useful in adopting a lazy update scheme for a static version. Since, the dynamic version is generally worse in terms of number of clusters compared to the static version, a static update may follow after a batch of dynamic updates to re-optimize the clustering. In a dynamic environment, we assume that we have an initial solution for the lower bound similarity clustering. New requests for insertion or deletion of an object come in an arbitrary manner. We need to satisfy the requests efficiently while maintaining the minimality of the representative set as much as possible. We are allowed to change the status of an object in either direction (a center can be made non-center and vice-

PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NANOSCIENCE, ENGINEERING & ADVANCED COMPUTING (ICNEAC-2011)

versa). Since the dynamic model assumes that a valid solution to the lower bound similarity clustering exists, addition of new objects cannot be made based on the greedy criteria of the static algorithm, as only one object (the new object) is uncovered and all vertices adjacent to it have exactly the same size (one) for the uncovered cover-set. So, we propose to obtain cluster centers in such way that the following three conditions are satisfied. 1. Every object is either a cluster-center or is adjacent to at least one cluster center. 2. A cluster center with degree > 0 must be adjacent to at least one non-center object of smaller or equal degree. 3. No cluster-center is redundant, i.e. every clustercenter covers at least one object exclusively. Note that the first of the above conditions is from the definition of LBSC. However, the second and the third conditions are chosen for LBSC to have a reasonably good solution in a dynamic setting. Inserting a new ob ject: A center is called illegal if it does not satisfy condition (2) above, i.e. all its adjacent vertices have degree strictly higher than it. A center is called redundant, if its removal does not change (increase) the number of nodes that need to be covered. The dynamic SimClus accepts as input the similarity graph and the covered vector, which stores the current coverage of each existing vertex. Once a new object v is added, the adjacency list of similarity graph G is updated. If v is adjacent to any existing center, it is properly reflected in the covered vector. Now, we have two different cases to consider. In the first case, when v is covered, condition 1 is already satisfied. Then, we check condition 2 (illegal center) for all the centers that are adjacent to v. Since addition of new vertices can change the adjacency list of some of the existing vertices, this may change a legal center into an illegal one. If this check succeeds (some illegal neighboring center is found), we make the illegal center as non- center and v becomes a center. This step is called recursively on success. In case the above step fails (v is not a center yet), we test whether v should be a center as it has strictly higher degree than any of its neighboring centers. Note that this step is not essential for correctness according to our conditions, but this heuristic has potential to generate better centers. For the second case, when v is not covered, then none of its adjacent vertices is a center and to fulfill the coverage requirements, at least one of these vertices should be a center. We first test whether it should be some x adj(v) by checking whether it has any illegal neighbors. If not, then we choose the vertex with the highest degree as a center. When a node is made a center, some auxiliary updates are required. First, redundancy check is required for all other centers that are adjacent to it to satisfy the condition 3. Moreover, some adjacent non-centers can also become a center as one of its neighboring centers becomes illegal.

Deleting an existing object: Deletion is comparably easier. Like insertion, we first update the adjacency and covered vector as necessary. Then, we consider the following two cases. The first is when the deleted node v was a center. In that case, we need to check if any of its adjacent vertexes becomes isolated. All of those become centers. For the remaining, if they are still covered, we return immediately. Assume x is not covered, then we cover x by making the highest degree vertex (among itself and its neighbors) a center. For the second case, when v was not a center, its removal does not violate condition (1), so we check for condition (2) and (3) by calling the same methods as in insert routine. 5. CONCLUSIONS In this paper we used link-based similarity measures to find similarity among web documents while classifying a set of web pages using SimClus algorithm. This clustering algorithm uses lower bound on similarity to cluster a set of objects from the similarity matrix. This algorithm is faster and produces higher quality clusters in comparison to existing popular algorithms. Further- more, it provides representative centers for every cluster; hence, it is effective in summarization or semi-supervised classification. It is also suitable for multi-label or dynamic clustering. 6. REFERENCES [1]Pavel Calado, Marco Cristo, Marcos AndreGoncalves, Edleno S. de Moura, Berthier Ribeiro-Neto, and Nivio Ziviani.Link-based similarity measures for the classification of web documents, Journal of the American Society for Information Science and Technology, 57(2):208-221, January 2006. [2]Mohammad Al Hasan: Mining interesting subgraphs by output space sampling. SIGKDD Explorations 12(1): 73-74 (2010). [3]J. Aslam, J. E. Pelekhov, and D. Rus. The star clustering algorithm for static and dynamic information organization. Graph Algorithms and Application, 8(1):95129, 2004. [4] R. G-Garcia,J.Badia-Contelles,and A.Pons-Porrata. Extended Star Clustering, LNCS 2905.Springer, 2003.

S-ar putea să vă placă și