Sunteți pe pagina 1din 21

TEXT CLASSIFICATION

Using Fuzzy Self-Constructing Feature Clustering Algorithm

Under the guidance of:

By:

Mr. S.J.Prashanth B.E.MTech,LMISTE


Asst. Professor, CS & E Dept.
25/04/2012

Chaithra K.V.
4AI08CS020

OVERVIEW
Introduction Motivation & Objectives Feature Reduction Feature Clustering Fuzzy Feature Clustering(FFC) Text Classification An Example Applications Conclusion References

INTRODUCTION
Text Classification;
Process of classifying documents into predefined classes.
Documents class 1 class 2 . . class n

Text Classification is also called


Text Categorization Document Classification Document Categorization

Motivation and Objective


In text classification the dimensionality is very huge The current problem with feature clustering algorithms are:
The number of extracted features need to be specified in advance Variance is not considered while comparing.

So need for reducing the dimensionality and make it run faster

Feature Reduction
Purpose:
Reduce the computational load Increase data consistency

Technique:
To eliminate redundant data To reduce the dimensionality of the feature set To find best set of vectors which best separate pattern

Two ways:
Feature selection Feature reduction

Feature Reduction
Feature Selection:
is defined as the process of selecting a subset of relevant features. improves the classification accuracy by eliminating the noise features from various corpuses.

Feature Extraction:
convert the representation of the original highdimensional data set into a lower-dimensional data Most efficient than feature selection.

Feature Clustering
An Efficient approach for feature reduction Group all features into some clusters where evey features in a cluster are similar to each other That is let D be the set consisting of all original documents with m features then we obtain D as the set containing converted documents with k features with k<m.

Fuzzy Feature Clustering


Process:
A document set D of n documents d1,d2,dn Find feature vector W of m words w1,w2,wm P classes c1,c2,..cp Construct the pattern for each word in W, xi=[xi1,xi2,xip] Let G be the cluster containing q word patterns x1,x2,xq Let xj= [ xj1,xj2,xjp ], 1 j q Find the mean and deviation Find the fuzzy similarity of a word pattern x to cluster G i.e G(x).

Fuzzy Feature Clustering


The word pattern close to the mean of a cluster is regarded to be very similar to this obtained cluster Predefine a threshold , 01 Check if G(x) . Then two cases may occur:
No existing fuzzy clusters on which xi has passed the similarity test. Then create a new cluster Gh. There are existing clusters on which xi has passed the test. Then update the existing cluster.

Sort the patterns in order by the xi values. Perform the self constructing algorithm.

Fuzzy Feature Clustering


Find the Data transformation D=DT, T is a weighting matrix. Perform the weighting {hard, soft, mixed} Hard: each word is only allowed to belong to a cluster, and so it only contributes to a new extracted feature. Soft:each word is allowed to contribute to all new extracted features. Mixed: is a combination of the hard-weighting approach and the soft-weighting approach.

Overall flow of Text Classification


Unknown Pattern Training set of documents

Feature Reduction

Feature Reduction

Training data set for class 1

Training data set for class p


p classifiers are constructed

Train 1st Classifier (SVM)

Train p-th Classifier (SVM)

Classified documents

Here we illustrate how Fuzzy Self Constructing algorithm method works. Let D be a simple document set, containing 9 documents d1, d2, . . . , d9 of two classes c1 and c2, with 10 words in the feature vector W, as shown in Table 1. For simplicity, we denote the ten words as w1, w2, . . . , w10, respectively.

An Example

Table 1: Sample Document set

We calculate the ten word patterns x1, x2,. . . x10 for each word wi. As: xi = <xi1, xi2, . . . , xip> i.e xi= <P(c1|wi,P(c2|wi),,P(cp|wi) >;
For Example, for the above document set, P(c2|w6) = 1 x 0 + 2 x 0 + 0 x 0 + 1 x 0 + 1 x 1 + 1 x 1 +1 x 1 + 1 x 1 + 0 x 1/1 + 2 + 0 + 1 + 1 + 1 +1 + 1 + 0 = 0.50. The resulting word patterns are shown in Table 2. Since there are two classes involved in D each word pattern is a twodimensional vector.

Table 2: Word Patterns of W

We run our self-constructing clustering algorithm, by setting 0 = 0.5 (initial deviation) and = 0.64 (threshold), on the word patterns and obtain 3 clusters G1, G2, and G3, which are shown in Table 3.

Table 3: Obtained clusters

The fuzzy similarity of each word pattern to each cluster is shown in Table 4.

Table 4: Fuzzy Similarities of Word Patterns to Three Clusters

The weighting matrices TH, TS, and TM obtained by hardweighting, soft-weighting, and mixed weighting (with = 0.8, user defined constant), respectively, are shown in Table 5.

Table 5: Weighting Matrices: Hard TH, Soft TS, and Mixed TM

The transformed data sets DH, DS, and DM obtained as follows. D=DT where D = [d1,d2,dn]T , D=[d1,d2,dn]T , And t11 t1k t21 t2k T= tm1 tmk with di = [ di1 di2 dim], di = [di1 di2 dik ] And T is a Weighting matrix. These transformed data sets for different cases of weighting are shown in Table 6.

Table 6: Transformed Data Sets: Hard DH, Soft DS, and Mixed DM Based on DH, DS, or DM, a classifier with two SVMs is built. Suppose d is an unknown document and d = <0, 1, 1, 1, 1, 1, 0, 1, 1, 1>. We first convert d to d by .Then, the transformed document is obtained as dH = dTH = <2, 4, 2>; dS = dTS = <2.5591, 4.3478, 3.9964>, or dM = dTM = <2.1118, 4.0696, 2.3993>. Then the transformed unknown document is fed to the classifier. For this example, the classifier concludes that d belongs to c2.

Applications
Document Organization Spam Filtering Filtering Pornography Content Web Page Prediction Identity Based access & reporting Mobile SMS Classification

CONCLUSION
FFC algorithm is an incremental clustering approach to reduce the dimensionality of features Determines the features automatically Runs faster Better extracted features than other methods The word patterns in a cluster have a high degree of similarity to each other.

References
[1] A Supervised Clustering Method for Text Classification : Umarani Pappuswamy, Dumisizwe Bhembe, Pamela W. Jordan and Kurt VanLehn Learning Research and Development Center, 3939 0Hara Street, University of Pittsburgh, Pittsburgh, PA 15260, USA [2] Fuzzy Similarity-based Feature Clustering for Document Classification: Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee Department of Electrical Engineering, National Sun Yat-Sen University, Taiwan. 2009 Conference on Information Technology and Applications in Outlying Islands. [3] A Fuzzy Similarity Based Concept Mining Model for Text Classification : Shalini Puri, M. Tech. Student, BIT, Mesra, India. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 2, No. 11, 2011. [4] A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification: Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee, Member. IEEE Transactions on Knowledge And Data Engineering, VOL. 23, NO. 3, MARCH 2011. [5] Text Classification Aided by Clustering: a Literature Review Antonia Kyriakopoulou Athens University of Economics and Business Greece: Tools in Artificial Intelligence. [6] Semantic Clustering for a Functional Text Classification Task:Thomas Lippincott and Rebecca Passonneau, Columbia University,Department of Computer Science, Center for Computational Learning Systems New York. [7] Using Clustering to Enhance Text Classification: Antonia Kyriakopoulou, Theodore Kalamboukis..Department of Informatics, Athens University of Economics and Business 76 Patission St., Athens, GR 104.34. SIGIR'07, July 23,27, 2007, Amsterdam, The Netherlands.ACM 978-1-59593-5977/07/0007

Thank You!!!

S-ar putea să vă placă și