Sunteți pe pagina 1din 5

International Journal of Computational Intelligence and Information Security, February 2012 Vol. 3, No.

A Term Frequency-Inverse Document Frequency Based Prototype Model for Easing Text Categorization Effort for Conference Organizing Committee
1

Sukanya Ray and 2Nidhi Chandra

1 2

Amity School Of Engineering & Technology, Amity University, Noida (U.P.), India Amity School Of Engineering & Technology, Amity University, Noida (U.P.), India
1

sukanyaray007@gmail.com, 2 nsrivastava5@amity.edu Abstract

At present Vector Space Model (VSM) is an important feature to categorize the text content on basis of its features to differentiate either the text content or text document. Term weighing plays a major role in achieving high performance in text categorization and information retrieval because it keeps count for the terms and from count feature text categorization becomes easy. Term Frequency (tf) and Inverse Document Frequency (idf) is utilized for calculating weight of each term and categorizing the document. In this paper a new approach for text categorization is proposed which will overcome the drawback that; tf-idf is not efficient for short documents. Keywords: VSM, tf-idf, text categorization

1. Introduction
Text categorization means assigning a given document into one or more of the pre-defined categories. This is being widely used for document indexing, text retrieval, classifying and cataloging documents, filtering web resources. With the rapid growth of text in digital form text categorization is gaining immense importance. [1] For example when a national or international conference is being organized then thousands of research papers are being submitted in the conference for publication and categorization of those research papers manually is very time taking and can be erroneous but if this can be automated then the problem might get solved. If we have pre-defined category according to the topics of the conference then we can categorize the digital documents in these categories. This will save time and is most likely to be error free. The same thing can be applied during the time of admission where we can categorize the forms submitted by taking the digital copies of the forms and then categorizing it based on subjects and degrees. Industries can filter resumes sent to them by using text categorization as going through each and every resume and then finding it not meeting the criteria can waste a lot of time. Text categorization can also help any person new in a domain to gain knowledge about that domain by listing out all the documents available. [8] The person need not have any prior knowledge about that domain for that search, only the domain name will be more than sufficient for such searching. To sum it up, it can be said that text categorization will help in fast and easy access of proper documents in each and every domain. Vector Space Model (VSM) is an important method for document categorizing, web resource filtering, and information retrieval. [1] A prototype vector for each category using pre-defined data set can be constructed by applying Vector Space Model along with text categorization. For constructing an automated categorization scheme, the documents to be categorized must be represented by a vector having pertinent features which will help in distinguishing between the different pre-defined categories or classes (for example: in case of categorizing for a conference the different categories or classes can be computer networks, operating system, mobile computing, artificial intelligence etc.) or by genres (for example: in case of web filtering it can be surveys, reviews, research papers, editorials, travelogue, blogs etc.). In Term Frequency- Inverse Document Frequency (tf-idf) at first the term frequency (tf) is calculated. Term frequency is calculated by calculating the frequency of each individual term appearing in the document. Inverse Document Frequency is calculated by counting the number of documents in which the term appears in the total

33

International Journal of Computational Intelligence and Information Security, February 2012 Vol. 3, No. 2

number of documents in the corpus.

where D is the total number of documents in the corpus is the number of documents where term t appears. If the term is not in the corpora then it will and be divison by zero which is not mathematically possible. So in that case the term will be adjusted by 1+ Thus tf-idf is calculated by tf-idf (t,d) = tf (t,d) * idf (t) where d is number of documents and t in the number of individual terms. The main aim of this paper is to design a prototype for automatic text categorization so that when a digital form of a document is submitted in any organization then with the help of this prototype the document can be automatically categorized into some pre-defined categories or domains. This paper is divided into 4 sections. Second section focus on the existing approaches in field of prototype design in text categorization using term frequency-inverse document frequency. Third section focus on the proposed method and the fourth section is for further scope of research in this domain.

2. Literature Review
N-Gram Based Text Classification: In this paper the text representation and feature selection is done with the help of Nave Bayes classifier, Support Vector Machine, Neural Network, K-Nearest Neighbor. [4] Here the classifier treats the text categorization as a standard classification problem and thereby reduce the learning process to 2 simple steps: Feature Engineering Classification learning over the feature space n-gram modeling is done based on Markovs Model. Classification Based on Specific Vocabulary: In this paper the text representation method is based on lemmas. Lemma inflicts all word form under the same group (goes, gone, go under come under lemma go). In the Nave Bayes method 2 hypothesis are selected and the selected category corresponds to the one maximized. Then we estimate the underlying probability. The Support Vector Machine (SVM) method uses the features derived from the vector space model and then finds the weight of each term through applying TF-IDF formula. In the Z score based classification model binomial distribution is used. [3] When Z score is positive and is over the threshold limit it means that term is largely used and a negative score indicates that term is under used. The main drawback of this paper is that all the documents used were on two categories divided by year. So the proposed method was not randomly tested on any topic and the time period also helped. R-tfidf, a Variety of tf-idf Term Weighting Strategy in Document Categorization: It is an improvement over TFIDF method. Here we multiply the TF-IDF formula with an adjusting factor. [5] This factor increase the importance of term frequency in a document and punish the terms that appear less frequently in a document where as it has relatively higher term frequency weighting. The drawback of this paper is that the dataset employed here is really very small and tested only on 25 documents. A Novel Term Weighting Scheme for Automated Text Categorization: In this paper Gain Ratio is mainly used in term selection but it would have been better if it was used for weighting terms as its value reflects the importance of a term. Another method is the Confidence Weight method which considers the category information. It considers the proportion of documents in category c that contain a term t, and the proportion of documents in category c that contain the term t. [6] If the two proportions are of big difference, t is important and should get a high weight. Term Weighting Scheme for Automatic Text Categorization is a method where first we evaluate the probability of occurrence of a term in a document. Every document has been normalized to length 1 with degree 1. To weight a term, we have to estimate two probability P( cj ) and P( cj | ti ). P( cj ) can be evaluated by the total number of words in the category cj of the whole corpus. P( cj | ti ) can be calculated using Bayesian Theorem. Thus we can evaluate the probability P( ti |cj ). The major drawback of this approach is that it is assumed that every category has

34

International Journal of Computational Intelligence and Information Security, February 2012 Vol. 3, No. 2

a fixed term distribution and it is not tested on a random number of terms in every category. Fast String Matching using an n-gram Algorithm: The method used here is the Straight Forward Method (SF). [4] In this method the pattern is placed over the text at its extreme left and scanned to the right for a mismatch. If a mismatch occurs, the pattern is shifted one position to the right and the scan is restarted at the leftmost position of the pattern. As the N M + 1 positions in the text are searched, it has O(mN) worst case time complexity. The drawback of this approach is that backtracking causes quadratic time complexity. An evaluation of statistical approaches to text categorization: In this approach if a category is pre-defined then the vectors of the documents belonging to that category are given positive weight and the vectors of the remaining document are given a negative weight. By adding up these positive weighted vectors and negative weighted vectors the prototype vector is found. [9] The main weakness of this method if the assumption of one centroid per category and the performance degrades when the documents belonging to the same category forms separate clusters. Effect of term distributions on centroid-based text categorization: In this paper interclass standard deviation (icsd), class standard deviation (csd) and standard deviation (sd), were introduced to the tf-idf model to improve its performance. [7] Random walk term weighting for improved text classification: This literature proposes a method where term co-occurrences is used to calculate the dependency between word feature and how a word feature contributes to a given context. This scheme can improve the accuracy of the text classifier. [10]

3. Proposed Methodology
So far a lot of work has been done on improving the tf-idf method but still each new and improved process has some drawbacks. This method is being proposed taking into consideration the problem that might be faced during the organization of a national or international conference or publication of magazines, journals. In this case thousands of papers are submitted each day in the conference and the papers have to be categorized according to domains before they are reviewed. The whole process can be very time taking if done manually, so it would be better if the text categorization can be done automatically. For this some pre-defined dataset or corpora is required on the basis of the topic domains in the conference. If the conference is on computer science engineering the domains can be networking, artificial intelligence, mobile computing, algorithm, operating system etc. If networking and mobile computing are to be kept in two separate domains then the feature selection for making the corpora must be done carefully as these two subjects can have almost same keywords. The corpora are being made in .txt format. So after the corpora are being made, the documents to be categorized are to be taken in digital form in .txt format. Then all the stop words in the documents are to be removed. The stop words are the words occurring very frequently in every document in English (e.g. a, an, the, which, of, in, there etc.). Removal of the stop words is necessary because otherwise the term frequency of each of the stop words can be more than the term frequency of the key words which are necessary for categorizing the document. In that case categorization of the document can never be done. In English there are almost 635 stop words. After the removal of the stop words stemming of the words have to be done for more accurate result. Stemming inflicts that all the words come under the same group (e.g. go, goes, going, gone all the four words come from the root word go). In stemming all the child words are stemmed and made similar to their parent or root word. The next step is the Term Frequency i.e. to count the frequency of each of the different terms that appear in the document. After the frequency of all the different terms have been counted the term with the highest frequency is taken into account and it is then matched with all the pre-defined corpora. Here the Inverse Document Frequency method comes into play. The term of the document having the highest frequency is matched with every document in the corpora and the number of document having this term is counted out of the total number of documents. When a match is found then the document is declared to belong to that domain with which the highest frequency term of the document matched. The Term Frequency-Inverse Document Frequency (tf-idf) is calculated by the product of the Term frequency (tf) and the Inverse Document Frequency (idf).

35

International Journal of Computational Intelligence and Information Security, February 2012 Vol. 3, No. 2

The proposed idea is shown in flow diagram in Fig.1.

Figure 1: Flow diagram showing the steps of the proposed idea.

It might often happen that the same term might belong to more than one category. In that case if the situation does not permit multiple categorizations then the term with the second highest frequency is taken into account and it is being matched against all the domains of the pre-defined corpora. Based on the two results the categorization might become easy else the term with the third highest frequency is being considered and it is being matched against all the corpora. In such a case where multiple categorizations is not allowed there matching of the corpora might be done against the 5 highest frequency terms or more to get a proper categorization.One advantage of this method is that it is a simple straight forward method where no complex formula or complex quadratic equations are used and no this prototype can be used for all domains depending on the corpora built for it.

4. Conclusion
This paper focuses on automated text categorization so that precious time can be saved while organizing any national or international conference or during the time of admission or during publication of magazines and journals or at any other situation where a lot of documents are being submitted daily and they need to be sorted category wise before working on them. Though this prototype is designed keeping mainly the national and international conference organization committee in mind but this prototype design can work on any domain depending upon the type of corpora or pre-defined data set build for that purpose. The proposed prototype design helps in accurate and automated categorizing of documents but building proper corpora with proper feature plays a very important role for the successful functioning of this prototype.

References
[1] Ma Zhanguo, Feng Jing, Chen Liang, Hu Xiangyi, Shi Yanqin, Ma Zhanguo An Improved Approach to Terms Weighting in Text Classification 978-1-4244-9283-1/11 2011 IEEE [2] Jong yong kim, john shawe-taylor Fast String Matching using an n-gram Algorithm SoftwarePractise and Experience, VOL. 24(1), 7988 (JANUARY 1994) [3] Jacques Savoy, Olena Zubaryeva Classification Based on Specific Vocabulary 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology

36

International Journal of Computational Intelligence and Information Security, February 2012 Vol. 3, No. 2

[4] Mojgan Farhoodi, Alireza Yari and Ali Sayah N-Gram Based Text Classification for Persian Newspaper Corpus 2010 IEEE [5] Dengya Zhu, Jitian XIAO R-tfidf, a Variety of tf-idf Term Weighting Strategy in Document Categorization 2011 Seventh International Conference on Semantics, Knowledge and Grids [6] Hua Jiang, Ping Li, Xin Hu, Shuyan Wang An improved method of term weighting for text classification Centre for Development of Advanced Computing March 11,2010 [7] Ma Zhanguo, Feng Jing, Chen Liang, Hu Xiangyi, Shi Yanqin, Ma Zhanguo Improved Terms Weighting Algorithm of Text 2011 International Conference on Network Computing and Information Security 978-07695-4355-0/11 2011 IEEE [8] LI-PING JING, HOU-KUAN HUANG, HONG-BO SHI IMPROVED FEATURE SELECTION APPROACH TFIDF IN TEXT MINING Proceedings of the First International Conference on Machine Learning and Cybernetics, Beijing, 4-5 November 2002 [9] Verayuth Lertnattee' and Thanaruk Theeramunkong Analysis of Inverse Class Frequency in Centroid-based Text Classification international Symposium on Communications and Information Technologies 2004 ( ISCIT 2004 ) Sapporo, Japan. [10] Md. Rafiqul Islam, Md. Rakibul Islam An Effective Term Weighting Method Using Random Walk Model for Text Classification Proceedings of 11th International Conference on Computer and Information Technology (ICCIT 2008) 25-27 December, 2008, Khulna, Bangladesh.

37

S-ar putea să vă placă și