Sunteți pe pagina 1din 9

International Journal of Innovative Computing, Information and Control Volume 5, Number 12, December 2009

ICIC International c 2009 ISSN 1349-4198


pp. 1ISII08-093

THE APPLICATION OF LATENT SEMANTIC INDEXING AND ONTOLOGY IN TEXT CLASSIFICATION Xi-Quan Yang1 , Na Sun1,2 , Tie-Li Sun1 , Xue-Ya Cao1 and Xiao-Juan Zheng3
School of Computer School of Software Northeast Normal University Changchun, Jilin, 130117, P. R. China yangxq375@nenu.edu.cn
3 1

Department of Basic Teaching Jiangsu University of Science and Technology Zhangjiagang, Jiangsu, 215600, P. R. China sunna2004@126.com

Received July 2008; revised December 2008


Abstract. The employment of ontology to improve text classication performance has been an active research topic in recent years. These researches include using domain ontology or using both WordNet general ontology and Latent Semantic Indexing (LSI) algorithm to realize text classication. However, in these methods, there are some problems such as high-dimensional and sparse space or inapplicability in professional elds of text classication. In order to solve these problems, this paper proposes a general framework for text classication, which is meant for the exploitation of full-edged domain ontology as a knowledge base, to support the semantic-based text classication. Applying LSI algorithm to reduce the feature vector and using ontology knowledge as a background to classify the text can achieve higher performances in professional elds. Keywords: Latent semantic indexing, Domain ontology, Text classication

1. Introduction. The development of ontology has been one of the motivations of semantic web since it was envisioned. Thanks to its better semantic information, the use of ontology can make up the limitations of traditional text classication methods, so many scholars have focused on putting ontology ideas into the traditional methods, which has been made a big achievement in the last few years. Reference [13] proposed an approach to put ontology into text representation as the background knowledge, and realized automatic classication of XML texts. Reference [5] put forward an approach to classify web document by using domain ontologies, which were constructed from the concepts extracted from web information, and realized mapping between the concepts in ontology and the glossary in web documents. These methods could get good classication results, but they did not take dimension reduction into account. Traditional document representation like bag-of-words encodes documents as feature vectors, which usually leads to sparse feature spaces with large dimensionality, thus making it hard to achieve higher classication accuracy and better time eciency. The time and computational complexity of classifying increases rapidly with the augmentation of feature vector. Noise data and irrelevant features may even play a counteractive role in text classication. Reducing dimension can enhance the computing eciency and improve the precision of classication. So far, there are many reducing dimension methods, the most classical
1

X.-Q. YANG, N. SUN, T.-L. SUN, X.-Y. CAO AND X.-J. ZHENG

method is feature selection [6,14], like document frequency, mutual information, information gain, chi square statistic. Yang yiming and others had done contrast experiments about these four methods, and found that chi square statistics is superior to other methods. But all these methods never consider the latent semantic structures between words in documents, the application of semantic dictionary and LSI algorithm could reduce dimension of feature space, and improve the performance of text classication. As reference [7] shown, applying WordNet ontology and LSI algorithm to reduce the feature vector, oers better results and this could make dimension reduction feasible even for huge document collections. Reference [4] proposed an English text classication method based on semantic set index, which is presented from the WordNet thesaurus and LSI (latent semantic Indexing) model. This method eectively incorporates linguistic knowledge and conceptual index into text vector space representation, and acquires better results. Reference [10] also adopts WordNet to realize text classication in Indian language. WordNet ontology includes many general vocabularies and its coverage of vocabulary is broad, but lack of special terminology in special domain, it may get poor results in professional elds. Though there are many researches both using LSI algorithm to reduce high-dimensional and sparse vector space, and using WordNet general ontology to realize classier, the researches about how to use both domain ontology and LSI algorithm in text classication are still small. This paper proposes a general framework for text classication, which meant for the exploitation of full-edged domain ontology as knowledge base, to support the semanticbased text classication. Using LSI algorithm to reduce the feature vector, and exploiting four domain ontologies to realize text classication can get better results by experiments. The reminder of this paper is structured as follows. The research basis is provided in Section 2. We describe our specic approach in detail to classify texts based on domain ontologies in Section 3. And the description of an experiment that we performed and the experiment results performance analysis are provided in Section 4. Conclusion and future researches are provided in Section 5. 2. The Research Basis. 2.1. LSI. Latent Semantic Indexing ( LSI [3, 9, 11, 15] ) which was proposed by Laudauer and Dumais is an adaptation of classical vector model. The basic thought is that there are some latent semantic structures between words in documents, so systems can discover latent semantic relations from the documents automatically and organize documents into semantic space structure by statistical analysis. We compute the correlation of index and text by decomposing the measurement matrix in singular value decomposition (SVD) [1], and organize them into the same semantic space, which can be represented as a matrix A = (aij )td , where t d, and respectively represented the total number of indexes and texts. The row vector represents the number that dierent words appear in text, and column vector represents number of dierent document in sets, and aij represents the number that index i appear in document j. The decomposition A = U V T is called singular decomposition of matrix A. There exists orthogonal matrices U and V, and a diagonal matrix . U is a t by t orthogonal matrix and V is a d by d orthogonal matrix, and columns of U (or V ) are called left (or right) singular vectors of matrix A. Diagonal matrix is t by d matrix and the elements are a1 ,a2 , , amin(t,d), i a1 > a2 > a3 , , amin(t,d) > 0, numbers a1 ,a2 , , amin(t,d) are singular values of the matrix A. Let rank(A) = r, and there exists k where k r even k (t,d), we can take only k greatest singular values and corresponding singular vector coordinates and create a k-reduced singular decomposition of A. We call

APPLICATION OF LATENT SEMANTIC INDEXING AND ONTOLOGY

Ak = Uk k VkT a k-reduced singular value decomposition rank-k SVD. By computing SVD, we can get a k-reduced matrix consisting of k nonzero numbers. The matrix can express latent semantic relations the same as the ones in whole document, and eliminate noise eect phenomenon which caused by polysemy and synonymy. The value of k was experimentally determined to several tens or hundreds, but exact value of k is however a mystery, it is dependent on the number of topics in collection. This paper applied LSI algorithm to overcome the high-dimensional and sparse vector space problems in text classication. The LSI algorithm can solve the problem of polysemy and synonymy, and provide solid foundation for more accurate text classication. 2.2. Domain ontology. Ontology as a structural knowledge representation can provide a modeling tool to describe knowledge model at the semantic and knowledge level, and also provide a semantic framework for text information. It includes vocabularies which are generally acknowledged by experts and scholars in specic domain, and these vocabularies are organized as a formal of directed acyclic graph to describe concept, property and the relationship between them [12].The more upward level, the more abstract concepts are, and the more downward level, the more specic concepts are. We can form a normalized or standardized architecture by the ontology description, and make the machines to realize the knowledge share and reuse according to the restrictions in ontology. There are many denitions about ontology, with the increasingly understanding of ontology, the denitions are becoming more and more perfect, the most acknowledged denition is proposed by Studer [8] in 1998, which is ontology is a formal and normalized explanation of a shared conceptualization. In fact, ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts. This paper provides semantic information for the classication by constructing category vector from the concepts in domain ontology, and can get better classication precision. The category vector is c = {c1 , c2 , , ci , , cn }, where c is the predened category label and ci is concepts in the ontology. 3. Classication Process. 3.1. Feature vector reduction. 3.1.1. Text preprocessing. First,we preprocess text by removing stop words and stemming of words. Stopping is a process of removing the most frequent word that exists in text such as and, the, it, etc. Removing these words will save spaces for storing text contents and reduce time taken during the classify process. Stemming is a process of extracting each word from a text by reducing it to a possible root word. Taking words below as an example, likes, liked, and liking have similar meaning with the word like. After the stemming and stopping process of the words in each text, we will represent them as the item-text frequency matrix, taking tdf (term frequency times inverted document frequency) to compute weight of each item as element in matrix. The formula of each items weight is wij = tf ij idfj . Term frequency is the number of how many times the distinct word j occurs in text i. Inverted document frequency weighs the frequency of an item in a text with a factor that discounts its importance when it appears in almost all N texts, and the formula is idfj = log( nk + 0.01). Where N is the total number of text, nk is the number of texts which contains item. Therefore item which appear too rarely or too frequently plays less important role in text classication. Considering the length of text may aect weights, we normalize this formula (1) as follow, whose weight values are between zero and one.

X.-Q. YANG, N. SUN, T.-L. SUN, X.-Y. CAO AND X.-J. ZHENG

wij =

N tf ij log( nk + 0.01)
N

(1)

(tf ij )2

j=1

N (log( nk

+ 0.01))2

3.1.2. Reducing high-dimensional vector. We have removed these stop words which plays less important role on text classication by text preprocessing, but it is impossible for each word to appear in all texts, so the matrix which we get is high-dimensional and sparse, and it is hard to eectively classify texts. In this paper, we adopt LSI algorithm mentioned in Section 2.1 to realize matrix decomposition, and get three matrices, U, , V the process of decomposition is shown in Figure 1. The row vector of matrix Uk is represented the words vector, and the row vector of matrix Vk is represented the texts vector.

Figure 1. The Process of decomposition We use new vector of each text by SVD decomposition algorithm to replace original text vector, and then get a low-dimensional matrix which is used to realize text classication. The new vector of each text is made from the following formula. dnew = dT Uk 1 k (2)

where dnew is the new text vector which is got by dimensional reducing process and d is the original text vector. 3.2. Classication process. The purpose of classication is to discover common characteristics of data objects which belongs to the same type, and to design the classier to discriminate which type the unknown samples belongs to. Ontology-based text classication can obtain knowledge according to the natural distribution of data without prior knowledge. In this paper, we realize text classication by computing similarity of the text vector and category vector. The principle of text classication is shown in Figure 2. 3.2.1. Parsing ontology. Jena is open source and grown out of work with the HP labs semantic web programming and is a Java framework for building semantic web applications. It provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine. Jena can acquire data according to corresponding relations between resource classes in general model and classes, properties and instances in ontology model. In this paper, we adopt Jena 2.5.3 API to parse the ontology which is described by OWL language, and get concepts from ontology to construct category vector. The concepts include classes, subclasses, properties and instances in ontology.

APPLICATION OF LATENT SEMANTIC INDEXING AND ONTOLOGY

Figure 2. Text classication process This paper use four domain ontology to construct category vector, including wine, pizza, computer and tea ontology. Wine, pizza, and computer ontologies are chosen from standard ontology library, and tea ontology is our previous work, which was constructed by ourselves. 3.2.2. Computing similarity. Using concepts from domain ontology we can construct category vector, which is presented by c ={(c1 , w1 ),(c2 , w2 ), ,(ci , wi ), }, where ci is concepts in ontology, wi is weight of concept. Text vector is d ={(I1 , v1 ),(I2 , v2 ), ,(Ij , vj ), }, where Ij is key word in text and vj is weight of key word. Then after using LSI algorithm we can get new text vector which is shown in detail in Section 3.1.2. By using formula (3) we compute the similarity of category vector and new text vector and then determine which category the text belongs to.
n

wik vjk
2 wik n k=1

Sim(c, d) = cos(c, d) =

k=1 n 2 vjk

(3)

k=1

4. Experiment. 4.1. Evaluation index. This paper uses information evaluation index including precision and recall to evaluate the results. The formula is shown below: P recision = TC TC + FC (4)

TC (5) T C + MC where T C is the number of texts which belongs to category ci and is also correctly classied to category ci by classier, FC is the number of texts which does not belong to category ci but is wrongly classied to category ci by classier and MC is the number of texts Recall =

X.-Q. YANG, N. SUN, T.-L. SUN, X.-Y. CAO AND X.-J. ZHENG

which is supposed to belong to category ci but is not classied to other categories except category ci by classier. Precision and recall can respectively reect results in two dierent aspects. But we also need to consider both of these two indexes in order to get better results, and take a balance point between them. So we also adopt F1 measure formula [2] to evaluate the experiment results. 2 P recision Recall (6) P recision + Recall 4.2. Experiment results. The experiment platform is windows XP operating system, CPU is 2.50GHZ and the memory is 768MB. Experiment environment is Eclipse SDK3.3, and programming language is Java language. In this experiment, we randomly collect 361 texts from yahoo website and so on, including 111 texts about tea, 106 texts about computer and 80 texts about wine and 64 texts about pizza. We made a performance comparison between our method with Naive Bayesian algorithm, and the results are respectively given in Table 1, Table 2 and Table 3 using ontology classier with LSI algorithm, ontology classier and Naive Bayesian classier. F 1 M easure = Table 1. The data of text classication with ontology and LSI algorithm

Table 2. The data of text classication with ontology

From the Figure 6, we can see that the classier with domain ontology and LSI algorithm could get higher performance than the classier only with domain ontology and Naive Bayesian classier. Combing LSI algorithm and ontology to classier, the recall and F1 measure had been improved obviously. The precisions had the same level with ontology classier and increased slightly.

APPLICATION OF LATENT SEMANTIC INDEXING AND ONTOLOGY

Table 3. The data of text classication with Naive Bayesian

Figure 3. The precision comparison of NB and our approach

Figure 4. The Recall comparison of NB and our approach

Figure 5. The F1 Measure comparison of NB and our approach

X.-Q. YANG, N. SUN, T.-L. SUN, X.-Y. CAO AND X.-J. ZHENG

Figure 6. The average performance comparisons of experiments The precisions of computer and wine were improved, and the precisions of tea and pizza were decreased. The reason was that tea ontology which constructed by ourselves was not perfect compare with the one in ontology library. Less classes and instances may lead to lower precision. Compared with wine ontology, pizza ontology has more instances and properties. Adding these instances and properties which have little eect on text classication, instead, the noise data was produced and aected the classication results. The results shown in Figure 3, Figure 4 and Figure 5 proved that the evaluation indexes of classier with ontology and LSI were signicantly higher than Naive Bayesian classier except the precision of pizza. Naive Bayesian classier could get good results in small sample space, but with the increase number of texts, the performance were decreased. However, the classier with ontology and LSI can get better results when texts numbers increased and ontology completed. 5. Conclusion and Future Development. In this paper we propose a general framework for text classication, and apply LSI algorithm and domain ontology into the framework, which solves the problem of high-dimensional and sparse vector space, and can improve the precision of text classication. The further step is to improve LSI algorithm to adaptive ontology feature vectors, and realize exible calculation in order to get better precision. Acknowledgment. The authors gratefully acknowledge the support of the National Natural Science Foundation, China (No.60473042), and Science and Technology Development Project, Jilin Province, China (No.20080323).
REFERENCES [1] C. C. Chang, C. C. Lin and Y. S. Hu, An SVD oriented watermark embedding scheme with high qualities for the restored images, International Journal of Innovative Computing, Information and Control, vol.3, no.3, pp.609-620, 2007. [2] D. D. Lewis, Evaluating and optimizing autonomous text classication systems, Proc. of the 18th Annual International ACM SIGIR Conf. on Research and Development in Information Retrieval, New York, pp.246-254, 1995. [3] D. Q. Miao and Z. H. Wei, The Principle and Application of Chinese Information Processing, Beijing, Tsinghua Publishing House, 2007. [4] L. Lv, Y. S. Liu and Y. Liu, Realizing English text classication with semantic set index method, Journal of Beijing University of Posts and Telecommunications, vol.29, no.2, pp.22-25, 2006. [5] M. H. Song, S. Y. Lim, D. J. Kang and S. J. Lee, Automatic classication of web pages based on the concept of domain ontology, Proc. of the 12th Asia-Pacic Software Engineering Conf., IEEE Computer Science, 2005. [6] M. Rogati and Y. M. Yang, High-performing feature selection for text classication, in Proc. of the 11th ACM Intl Conf. on Information and Knowledge Management, G. David, K. Kalpakis, Q. Sajda, D. Han and S. Len (eds.), ACM Press, pp.659-661, 2002.

APPLICATION OF LATENT SEMANTIC INDEXING AND ONTOLOGY

[7] P. Moravec, M. Kolovrat and V. Sn el, LSI vs. wordnet ontology in dimension reduction for as information retrieval, c V. Sn el, J. Pokorn, K. Richta (eds.), Dateso, pp.18-26, 2004. as y [8] R. Studer, V. R. Benjamins and D. Fensel, Knowledge engineering, principles and methods, Data and Knowledge Engineering, vol.25, no.122, pp.161-197, 1998. [9] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Laudauer and R. Harhman, Indexing by latent semantic analysis, Journal of the Society for Information Science, vol. 41, no.6, pp.391-407,1990 . [10] S. Mohanty, P. K. Santi, R. Mishra, R. N. Mohapatra and S. Swain, Semantic based text classication using wordnets: Indian languages perspective, Proc. of the 3th International Global WordNet Conf., P. Sojka, K.-S. Choi, C. Fellbaum, P. Vossen (eds.), South Jeju Island, Korea, pp.321-324, 2006. [11] T. K. Laudauer, A solution to platos problem: The latent semantic analysis theory of the acquisition, and representation of knowledge, Psychological Review, vol.104, no.2, pp.211-240, 1997. [12] T. R. Gruber, Toward principals for the design of ontologies used for knowledge sharing, International Journal of Human computer Studies, vol.43, no.5-6, pp.907-928, 1995. [13] T. Martin, S. Ralf and W. Gerhard, Exploiting structure, annotation, and ontological knowledge for automatic classication of XML data, Proc. of the 6th International Workshop on the Web and Databases, San Diego, California, 2003. [14] Y. M. Yang and J. O. Pedersen, A comparative study on feature selection in text classication, Proc. of the 14th Intl Conf. on Machine Learning, Nashville, Morgan Kaufmann Publishers, pp.412-420, 1997. http://citeseer.nj.nec.com/yang97comparative.html [15] Z. T. Yu, X. Z. Fan, J. Y. Guo and Z. M. Gen, Answer extracting for Chinese question answering system based on latent semantic analysis, Chinese Journal of Computers, vol.29, no.10, 2006.

APPENDIX. Test Sets: 1. Tea: http://www.pcworld.com/ http://chinese-tea.net/?s=101 http://en.wikipedia.org/wiki/Tea http://chinesefood.about.com/library/weekly/aa011400a.htm 2. Wine: http://en.wikipedia.org/wiki/Wine http://www.healthcastle.com/redwine-heart.shtml http://wine.about.com/od/winebasic1/a/winelegs.htm 3. Pizza: http://en.wikipedia.org/wiki/Pizza http://www.cooks.com/rec/search/0,10,pizza toppings,FF.html http://italianfood.about.com/library/weekly/aa021305.htm 4. Computer: http://www.newegg.com/ http://www.cyberguys.com/ http://en.wikibooks.org/wiki/Computer Hardware http://www.devhardware.com/

S-ar putea să vă placă și