Documente Academic
Documente Profesional
Documente Cultură
By:
Chaithra K.V.
4AI08CS020
OVERVIEW
Introduction Motivation & Objectives Feature Reduction Feature Clustering Fuzzy Feature Clustering(FFC) Text Classification An Example Applications Conclusion References
INTRODUCTION
Text Classification;
Process of classifying documents into predefined classes.
Documents class 1 class 2 . . class n
Feature Reduction
Purpose:
Reduce the computational load Increase data consistency
Technique:
To eliminate redundant data To reduce the dimensionality of the feature set To find best set of vectors which best separate pattern
Two ways:
Feature selection Feature reduction
Feature Reduction
Feature Selection:
is defined as the process of selecting a subset of relevant features. improves the classification accuracy by eliminating the noise features from various corpuses.
Feature Extraction:
convert the representation of the original highdimensional data set into a lower-dimensional data Most efficient than feature selection.
Feature Clustering
An Efficient approach for feature reduction Group all features into some clusters where evey features in a cluster are similar to each other That is let D be the set consisting of all original documents with m features then we obtain D as the set containing converted documents with k features with k<m.
Sort the patterns in order by the xi values. Perform the self constructing algorithm.
Feature Reduction
Feature Reduction
Classified documents
Here we illustrate how Fuzzy Self Constructing algorithm method works. Let D be a simple document set, containing 9 documents d1, d2, . . . , d9 of two classes c1 and c2, with 10 words in the feature vector W, as shown in Table 1. For simplicity, we denote the ten words as w1, w2, . . . , w10, respectively.
An Example
We calculate the ten word patterns x1, x2,. . . x10 for each word wi. As: xi = <xi1, xi2, . . . , xip> i.e xi= <P(c1|wi,P(c2|wi),,P(cp|wi) >;
For Example, for the above document set, P(c2|w6) = 1 x 0 + 2 x 0 + 0 x 0 + 1 x 0 + 1 x 1 + 1 x 1 +1 x 1 + 1 x 1 + 0 x 1/1 + 2 + 0 + 1 + 1 + 1 +1 + 1 + 0 = 0.50. The resulting word patterns are shown in Table 2. Since there are two classes involved in D each word pattern is a twodimensional vector.
We run our self-constructing clustering algorithm, by setting 0 = 0.5 (initial deviation) and = 0.64 (threshold), on the word patterns and obtain 3 clusters G1, G2, and G3, which are shown in Table 3.
The fuzzy similarity of each word pattern to each cluster is shown in Table 4.
The weighting matrices TH, TS, and TM obtained by hardweighting, soft-weighting, and mixed weighting (with = 0.8, user defined constant), respectively, are shown in Table 5.
The transformed data sets DH, DS, and DM obtained as follows. D=DT where D = [d1,d2,dn]T , D=[d1,d2,dn]T , And t11 t1k t21 t2k T= tm1 tmk with di = [ di1 di2 dim], di = [di1 di2 dik ] And T is a Weighting matrix. These transformed data sets for different cases of weighting are shown in Table 6.
Table 6: Transformed Data Sets: Hard DH, Soft DS, and Mixed DM Based on DH, DS, or DM, a classifier with two SVMs is built. Suppose d is an unknown document and d = <0, 1, 1, 1, 1, 1, 0, 1, 1, 1>. We first convert d to d by .Then, the transformed document is obtained as dH = dTH = <2, 4, 2>; dS = dTS = <2.5591, 4.3478, 3.9964>, or dM = dTM = <2.1118, 4.0696, 2.3993>. Then the transformed unknown document is fed to the classifier. For this example, the classifier concludes that d belongs to c2.
Applications
Document Organization Spam Filtering Filtering Pornography Content Web Page Prediction Identity Based access & reporting Mobile SMS Classification
CONCLUSION
FFC algorithm is an incremental clustering approach to reduce the dimensionality of features Determines the features automatically Runs faster Better extracted features than other methods The word patterns in a cluster have a high degree of similarity to each other.
References
[1] A Supervised Clustering Method for Text Classification : Umarani Pappuswamy, Dumisizwe Bhembe, Pamela W. Jordan and Kurt VanLehn Learning Research and Development Center, 3939 0Hara Street, University of Pittsburgh, Pittsburgh, PA 15260, USA [2] Fuzzy Similarity-based Feature Clustering for Document Classification: Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee Department of Electrical Engineering, National Sun Yat-Sen University, Taiwan. 2009 Conference on Information Technology and Applications in Outlying Islands. [3] A Fuzzy Similarity Based Concept Mining Model for Text Classification : Shalini Puri, M. Tech. Student, BIT, Mesra, India. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 2, No. 11, 2011. [4] A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification: Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee, Member. IEEE Transactions on Knowledge And Data Engineering, VOL. 23, NO. 3, MARCH 2011. [5] Text Classification Aided by Clustering: a Literature Review Antonia Kyriakopoulou Athens University of Economics and Business Greece: Tools in Artificial Intelligence. [6] Semantic Clustering for a Functional Text Classification Task:Thomas Lippincott and Rebecca Passonneau, Columbia University,Department of Computer Science, Center for Computational Learning Systems New York. [7] Using Clustering to Enhance Text Classification: Antonia Kyriakopoulou, Theodore Kalamboukis..Department of Informatics, Athens University of Economics and Business 76 Patission St., Athens, GR 104.34. SIGIR'07, July 23,27, 2007, Amsterdam, The Netherlands.ACM 978-1-59593-5977/07/0007
Thank You!!!