Text

TEXT CLASSIFICATION
Using Fuzzy Self-Constructing Feature Clustering Algorithm
Under the guidance of:
By:
Mr. S.J.Prashanth B.E.MTech,LMISTE

Asst. Professor, CS & E Dept.
25/04/2012
Chaithra K.V.
4AI08CS020
OVERVIEW
Introduction Motivation & Objectives Feature Reduction Feature Clustering Fuzzy Feature Clustering(FFC) Text Classification An Example Applications Conclusion References
INTRODUCTION
Text Classification;
Process of classifying documents into predefined classes.
Documents class 1 class 2 . . class n
Text Classification is also called

Text Categorization Document Classification Document Categorization
Motivation and Objective

In text classification the dimensionality is very huge The current problem with feature clustering algorithms are:
The number of extracted features need to be specified in advance Variance is not considered while comparing.
So need for reducing the dimensionality and make it run faster
Feature Reduction
Purpose:
Reduce the computational load Increase data consistency
Technique:
To eliminate redundant data To reduce the dimensionality of the feature set To find best set of vectors which best separate pattern
Two ways:
Feature selection Feature reduction
Feature Reduction
Feature Selection:
is defined as the process of selecting a subset of relevant features. improves the classification accuracy by eliminating the noise features from various corpuses.
Feature Extraction:
convert the representation of the original highdimensional data set into a lower-dimensional data Most efficient than feature selection.
Feature Clustering
An Efficient approach for feature reduction Group all features into some clusters where evey features in a cluster are similar to each other That is let D be the set consisting of all original documents with m features then we obtain D as the set containing converted documents with k features with k<m.
Fuzzy Feature Clustering

Process:
A document set D of n documents d1,d2,dn Find feature vector W of m words w1,w2,wm P classes c1,c2,..cp Construct the pattern for each word in W, xi=[xi1,xi2,xip] Let G be the cluster containing q word patterns x1,x2,xq Let xj= [ xj1,xj2,xjp ], 1 j q Find the mean and deviation Find the fuzzy similarity of a word pattern x to cluster G i.e G(x).

The word pattern close to the mean of a cluster is regarded to be very similar to this obtained cluster Predefine a threshold , 01 Check if G(x) . Then two cases may occur:
No existing fuzzy clusters on which xi has passed the similarity test. Then create a new cluster Gh. There are existing clusters on which xi has passed the test. Then update the existing cluster.
Sort the patterns in order by the xi values. Perform the self constructing algorithm.

Find the Data transformation D=DT, T is a weighting matrix. Perform the weighting {hard, soft, mixed} Hard: each word is only allowed to belong to a cluster, and so it only contributes to a new extracted feature. Soft:each word is allowed to contribute to all new extracted features. Mixed: is a combination of the hard-weighting approach and the soft-weighting approach.
Overall flow of Text Classification

Unknown Pattern Training set of documents
Feature Reduction
Feature Reduction
Training data set for class 1
Training data set for class p

p classifiers are constructed
Train 1st Classifier (SVM)
Train p-th Classifier (SVM)
Classified documents
Here we illustrate how Fuzzy Self Constructing algorithm method works. Let D be a simple document set, containing 9 documents d1, d2, . . . , d9 of two classes c1 and c2, with 10 words in the feature vector W, as shown in Table 1. For simplicity, we denote the ten words as w1, w2, . . . , w10, respectively.
An Example
Table 1: Sample Document set
We calculate the ten word patterns x1, x2,. . . x10 for each word wi. As: xi = <xi1, xi2, . . . , xip> i.e xi= <P(c1|wi,P(c2|wi),,P(cp|wi) >;
For Example, for the above document set, P(c2|w6) = 1 x 0 + 2 x 0 + 0 x 0 + 1 x 0 + 1 x 1 + 1 x 1 +1 x 1 + 1 x 1 + 0 x 1/1 + 2 + 0 + 1 + 1 + 1 +1 + 1 + 0 = 0.50. The resulting word patterns are shown in Table 2. Since there are two classes involved in D each word pattern is a twodimensional vector.
Table 2: Word Patterns of W
We run our self-constructing clustering algorithm, by setting 0 = 0.5 (initial deviation) and = 0.64 (threshold), on the word patterns and obtain 3 clusters G1, G2, and G3, which are shown in Table 3.
Table 3: Obtained clusters
The fuzzy similarity of each word pattern to each cluster is shown in Table 4.
Table 4: Fuzzy Similarities of Word Patterns to Three Clusters
The weighting matrices TH, TS, and TM obtained by hardweighting, soft-weighting, and mixed weighting (with = 0.8, user defined constant), respectively, are shown in Table 5.
Table 5: Weighting Matrices: Hard TH, Soft TS, and Mixed TM
The transformed data sets DH, DS, and DM obtained as follows. D=DT where D = [d1,d2,dn]T , D=[d1,d2,dn]T , And t11 t1k t21 t2k T= tm1 tmk with di = [ di1 di2 dim], di = [di1 di2 dik ] And T is a Weighting matrix. These transformed data sets for different cases of weighting are shown in Table 6.
Table 6: Transformed Data Sets: Hard DH, Soft DS, and Mixed DM Based on DH, DS, or DM, a classifier with two SVMs is built. Suppose d is an unknown document and d = <0, 1, 1, 1, 1, 1, 0, 1, 1, 1>. We first convert d to d by .Then, the transformed document is obtained as dH = dTH = <2, 4, 2>; dS = dTS = <2.5591, 4.3478, 3.9964>, or dM = dTM = <2.1118, 4.0696, 2.3993>. Then the transformed unknown document is fed to the classifier. For this example, the classifier concludes that d belongs to c2.
Applications
Document Organization Spam Filtering Filtering Pornography Content Web Page Prediction Identity Based access & reporting Mobile SMS Classification
CONCLUSION
FFC algorithm is an incremental clustering approach to reduce the dimensionality of features Determines the features automatically Runs faster Better extracted features than other methods The word patterns in a cluster have a high degree of similarity to each other.
References
[1] A Supervised Clustering Method for Text Classification : Umarani Pappuswamy, Dumisizwe Bhembe, Pamela W. Jordan and Kurt VanLehn Learning Research and Development Center, 3939 0Hara Street, University of Pittsburgh, Pittsburgh, PA 15260, USA [2] Fuzzy Similarity-based Feature Clustering for Document Classification: Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee Department of Electrical Engineering, National Sun Yat-Sen University, Taiwan. 2009 Conference on Information Technology and Applications in Outlying Islands. [3] A Fuzzy Similarity Based Concept Mining Model for Text Classification : Shalini Puri, M. Tech. Student, BIT, Mesra, India. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 2, No. 11, 2011. [4] A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification: Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee, Member. IEEE Transactions on Knowledge And Data Engineering, VOL. 23, NO. 3, MARCH 2011. [5] Text Classification Aided by Clustering: a Literature Review Antonia Kyriakopoulou Athens University of Economics and Business Greece: Tools in Artificial Intelligence. [6] Semantic Clustering for a Functional Text Classification Task:Thomas Lippincott and Rebecca Passonneau, Columbia University,Department of Computer Science, Center for Computational Learning Systems New York. [7] Using Clustering to Enhance Text Classification: Antonia Kyriakopoulou, Theodore Kalamboukis..Department of Informatics, Athens University of Economics and Business 76 Patission St., Athens, GR 104.34. SIGIR'07, July 23,27, 2007, Amsterdam, The Netherlands.ACM 978-1-59593-5977/07/0007
Thank You!!!

Text

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Text

Încărcat de

Drepturi de autor:

Formate disponibile

TEXT CLASSIFICATION

Using Fuzzy Self-Constructing Feature Clustering Algorithm

Under the guidance of:

Mr. S.J.Prashanth B.E.MTech,LMISTE

Text Classification is also called

Motivation and Objective

So need for reducing the dimensionality and make it run faster

Fuzzy Feature Clustering

Fuzzy Feature Clustering

Fuzzy Feature Clustering

Overall flow of Text Classification

Training data set for class 1

Training data set for class p

Train 1st Classifier (SVM)

Train p-th Classifier (SVM)

Table 1: Sample Document set

Table 2: Word Patterns of W

Table 3: Obtained clusters

Table 4: Fuzzy Similarities of Word Patterns to Three Clusters

Table 5: Weighting Matrices: Hard TH, Soft TS, and Mixed TM

S-ar putea să vă placă și