A New Approach To Intelligent Text Filtering Based On Novelty Detection

A New Approach to Intelligent Text Filtering Based on Novelty Detection
Randa Kassab Jean-Charles Lamirel
LORIA - INRIA Lorraine Campus scientique - BP. 239 54506 Vandoeuvre-l`s-Nancy e France Email: {kassabr,lamirel}@loria.fr
Abstract This paper presents an original approach to modelling users information need in text ltering environment. This approach relies on a specic novelty detection model which allows both accurate learning of users prole and evaluation of the coherency of users behaviour during his interaction with the system. Thanks to an online learning algorithm, the novelty detection model is also able to track changes in users interests over time. The proposed approach has been successfully tested on the Reuters-21578 benchmark. The experimental results prove that this approach signicantly outperforms the well-known Rocchios learning algorithm. Keywords: novelty detection, information ltering, personalization, prole, online learning. 1 Introduction
Consequently to the constant increase of information available on the Web, in digital libraries and in other similar resources, new techniques for personalized information access have become more and more important. Information ltering is one of the most useful and challenging tasks for eective information access. It is concerned with dynamically adapting the distribution of information where both evolving users interests and new incoming information are taken into account. For building up a model of users information need, also called users prole, information ltering relies on users positive examples, represented by documents he likes, and possibly on users negative examples, represented by documents he dislikes. This prole is furthermore used to automatically separate relevant from irrelevant documents in an incoming stream. A wide range of machine learning algorithms and information retrieval techniques have been applied to text ltering task, including the Rocchios linear classier, k-nearest neighbours, Bayesian classiers, neural networks, Support Vector Machines and boosting (Ault & Yang 2001, Schapire & Singer 1998, Schutze, Hull & Pedersen 1995, Dumais, Platt, Heckerman & Sahami 1998, Shankar & Karypis 2000). All these techniques focus on exactly learning users prole in order to lter the content that would be interesting for the user as accurately as possible. This practice
Copyright c 2006, Australian Computer Society, Inc. This paper appeared at the Seventeenth Australasian Database Conference (ADC2006), Hobart, Australia. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 49. Gillian Dobbie and James Bailey, Eds. Reproduction for academic, not-for prot purposes permitted provided this text is included.
is very suitable considering the fact that the performance of the ltering process is extensively dependent on the accuracy of the users prole. However, these techniques could not provide any information about the users behaviour during his interaction with the ltering system. In our opinion, users information need can be of dierent types, e.g. precise, exploratory or thematic (Lamirel & Crhange 1994); e and according to these types the system must adapt the ltering results in a specic way. Moreover, most of the existing methods operate in an o-line mode where all the training documents have to be stored. Consequently, at each time a new training example need to be added, the system has to restart the training from the beginning. Such learning mode is not appropriate in online applications where memory space is limited and real-time ltering response is crucial. This paper presents a new approach to text ltering based on the novelty detection principle. The main objective of novelty detection is to emphasize the novelty in yet unseen document with respect to previously learned ones. The basic idea is to learn a model of available documents and to use it for identifying the dissimilar documents (novelty). In ltering context, the novelty detection principle is mainly applied in a reverse way, i.e. the documents that are similar to a model learned from positive examples of users need will be selected. The specic novelty detector lter (ndf) we use, is an adaptation of a former novelty detector model proposed by Kohonen (kohonen 1984) and based on the orthogonal projection operators. The most powerful feature of this lter is its ability to accurately learn users prole and to evaluate, in a parallel way, the coherency of users behaviour during his interaction with the system. Thanks to an online learning algorithm, the novelty lter is also able to track changes in users interests over time. The novelty detector lter is investigated on the Reuters-21578 benchmark. We compare experimentally its performance with the widely used Rocchios algorithm; our experiments clearly highlight that the novelty detector lter is more eective than this latter algorithm. The paper is organized as follows: Section 2 introduces the basic concepts of the novelty detector lter. Section 3 describes our adaptation of this lter for the text ltering task. Section 4 reports on our experiments. The paper ends with the description of our future work. 2 The Novelty Detector Filter
The novelty detector lter is a linear adaptive system which acts, after its learning on a reference data, as a projection operator in a vector space that is orthogonal to the vector space spanned by the reference data. Consider gure 1; the ndf only passes
through the novelty component of a data with respect to the previously learned reference data. The residual component which is orthogonal to the novelty component is known as the habituation component. Such a system has a transfer function equivalent to a square matrix : k = I Xk Xk + where Xk = [x1 , x2 , . . . xk ] is the reference data matrix in which each xi is a n-dimensional vector, Xk + is the penrose pseudo-inverse of Xk , I denotes the identity matrix. The inputs and the outputs are thus related by x = k x.
Figure 1: The model of the novelty detector lter The mathematical learning model of the novelty detector is based on the theorem of Greville (kohonen 1984) which yields a recursive expression for calculating the transfer function of the lter. After simplication, the theorem can be expressed as: x kx T k k = k1 xk 2 (1)
proved much harder than expected for the ndf. This can be mainly explained by the fact that the learning rule (Eq.1), which is essentially designed for novelty detection, is not directly applicable to ltering task. In fact, the novelty detection learning rule is mainly intended to distinguish the novelty parts from the old or habituated parts in the input data with respect to the previously learned reference data, regardless of whether a learned data is more seen in the reference data than another or not. Hence, such a learning rule does not allow cumulative learning of the users need. In other words, both training examples which are regarded as redundant (i.e. those that do not carry novelty, that is, those for which x is null) and the features that have been totally learned (i.e. those for which the novelty vector is null when applying the unit vector associated to the feature as input) are no more taken into account during training. Therefore, it is often possible to learn the discriminating and non discriminating features with the same degree of relevancy. Accordingly, a good separation between relevant and irrelevant documents will not always be easy to achieve. An adaptation of the learning rule to ltering is then required. Our proposal is to introduce the identity matrix in the learning formula for considering separately all training examples, and consequently all their features, during the learning phase. The new learning rule is dened as: k = I + k1 x kx T k xk 2 (2)
where xk = k1 xk represents the orthogonal projection of the vector xk on a space that is orthogonal to the space spanned by the k 1 learned vectors; x represents the length of the vector x; and the recursion starts with 0 = I. Once the learning phase described above is over, the lter becomes habituated to the reference data. Thus, if one of the reference data or their arbitrary linear combination is applied to the lter input, the novelty output will be zero. On the other hand, if a novel data not belonging to the space spanned by the reference data is chosen as an input, the corresponding output will be nonzero and can be seen as representative of the new features extracted from the input data with respect to the reference data. 3 The Novelty Detector Filter in Text Filtering
where xk = (I + k1 )xk and 0 is a zero, or null, matrix. As learning progresses, features which frequently appear in the training documents become more and more habituated as compared to the less frequent ones. This typically helps to discriminate more accurately the relevant and irrelevant documents. An example and further details about the defects of the original learning rule of the ndf (Eq.1) when applied to ltering task and how our modied rule (Eq.2) can improve performance are provided in the appendix. Since learning from only positive examples is suitable for several applications, two strategies are considered and tested for using the novelty detector lter in text ltering system according to the kind of available training examples. 3.1 Positive training examples only
The principle of novelty detection is particularly interesting for text ltering. In this context, a data corresponds to a text document which is itself represented by a weighted feature vector in a n-dimensional description space, where n is the total number of features extracted from the training documents (i.e. all the documents currently available in the system). The reference data are the documents marked by the user as examples of his information need, i.e. the positive training examples. Our previous evaluations of the ndf for text ltering (Kassab, Lamirel & Nauer 2005, Kassab, Lamirel & Nauer 2005) enable us to observe that the performance of the ndf is quite eective in the case of single-label datasets where each document belongs exactly to one category. However, it fails in the multilabel case where a document may be belong to more than one category. Although the latter case is relatively more dicult to process for all the existing approaches, since high correlation between relevant and irrelevant documents is strongly probable, it has
The learning of a ndf on a set of positive training examples permits to calculate the transfer function of the lter. This function is a projection matrix that represents a space that is orthogonal to the space spanned by the positive training examples. Then, the projection of each yet unseen document, say d, on the lter matrix will generate a novelty vector d = d. Two proportions can be thus computed: The novelty proportion which quanties the amount of novelty in the document under consideration with respect to the documents that have been seen during training. Nd = d n d (3)
where n is the number of training documents.
The habituation proportion which quanties the similarity of the document with the previously learned ones. Hd = 1 d n d (4)
This later proportion could be considered as the relevance score of the document d and thus be used for ranking documents: the higher the habituation proportion is, the more relevant the document will be. 3.2 Positive and negative training examples
When training set consists of positive and negative examples of users need, two novelty detector lters should be used. The acceptance lter A which is learned from the positive examples and the rejection lter R learned from the negative examples. After learning, the relevance score of each new document d is computed using the following formula: Rd = HAd HRd (5)
Figure 2: Evolution of the saturation when learning from all 10 categories and when learning only from one category (e.g. Interest) 4 The Rocchio Algorithm
where HAd is the habituation proportion of the document d using the acceptance lter; HRd is the habituation proportion of the document d using the rejection lter; , are positive parameters which control the relative importance of the positive and negative examples respectively. Concept of saturation Since most ltering systems use all features assigned to positive or negative examples for modelling users prole, this could naturally lead to an increased noise in this prole and consequently to reduce ltering accuracy. Nevertheless, if the accuracy becomes too low this phenomenon can surely be imputed to the user who has not been able to correctly formulate his need. Unfortunately, to the best of our knowledge, this fact is not taken into account in most existing systems despite its great importance for choosing a well-suited strategy for adjusting a dissemination threshold. The saturation of a ndf (SN DF ) may be understood as the inability of this lter to extract new features with respect to the learned documents. In other words, it corresponds to the learning by the lter of all features of the description space. This case can occur if the number of the learned documents is such as it generates a subspace whose dimension is equal to the dimension of the description space. We dene the saturation of a ndf as the ratio of the number of learned features1 to the total number of features in the description space. The saturation value is useful for assessing the type of users information need. This later can be considered as precise when the saturation is quite low and thus the ltering threshold must be set to a high value, while the users need can be regarded as exploratory when the saturation is very high, or maximal, and in this case the ltering threshold must be set to a low value2 (see gure 2).
Rocchios Algorithm is a relevance feedback algorithm that has been widely used for improving the performance of information retrieval systems (Rocchio 1971) and that has been then adapted to text ltering task (Schapire et al. 1998, Ault et al. 2001). It allows computing of a prototype prole Pc as the weighted dierence of the centroid vectors of the positive and the negative examples: Pc =
dR
|R|
dN
|N |
(6)
The parameters and control the relative impact of the positive and negative examples; |R| and |N | are respectively the number of positive and negative examples. The prole Pc is restricted to non negative values. Ranking is then achieved by performing a cosine similarity between the prototype prole and each new document. 5 Experiments
In this section, we describe our experiments for testing the performance of the novelty detector lter in text ltering environment. Throughout the present study we focus our attention on the evaluation of the quality of the learned prole, i.e. the lters matrix, for representing users information need and thus for ranking documents by relevancy. Hence, we did not use any dissemination threshold but we plan to investigate this strategy in near future. Experimental results on the Reuters-21578 collection are presented, demonstrating that our approach is more eective than the Rocchios learning algorithm. 5.1 Reuters collection
1 We mean here by learned features, the features for which the habituation proportion is higher than zero or a predened threshold. 2 We expect that a threshold set to the value (1 SN DF ) would be suitable, but we have not yet compared it with other thresholding strategies.
We conducted our experiments on the Reuters-21578 collection3 . The documents of this collection are divided into training and test sets and they are labelled by 135 categories. Our experimental results are reported for the set of the top 10 categories with the highest number of positive training documents.
3 The Reuters-21578 collection is publicly available at : http://www.daviddlewis.com/resources/testcollections/reuters21578/
Table 1: # training and test examples in the top 10 categories of Reuters collection Category name # train # test acq 1650 719 corn 181 56 crude 389 189 earn 2877 1087 grain 433 149 interest 347 131 money-fx 538 179 ship 197 89 trade 369 117 wheat 212 71 Table1 gives the number of training and test documents in each of the categories. Usually, the training documents assigned to each category are used as positive examples of users need and the rest of training documents in the other categories as negative examples. Due to the relative high quantity of the negative examples in Reuters collection, we have selected the negative examples using the query zoning method ( Singhal, Mitra & Buckley 1997, Schapire et al. 1998), where only the |R| top ranking documents retrieved from the negative training examples by the centroid vector of the positive examples are used. This method has proved to be ecient for Rocchios algorithm. 5.2 Preprocessing
Table 2: Comparison of performance results for the original learning rule (NDRule) and our modied learning rule (FRule) using only positive examples Category name NDRule FRule acq 74.50 95.65 corn 61.09 81.34 crude 40.49 91.65 earn 84.45 91.95 grain 64.48 97.82 interest 58.16 78.83 money-fx 50.95 76.70 ship 54.60 88.10 trade 33.30 93.35 wheat 59.34 87.61 mean AvgP 58.14 88.30 thresholds(Robertson & Soboro 2001). It is dened as the sum of the precision value at each point where a relevant document appears in the ranked list, divided by the total number of relevant documents. Relevant documents which are not retrieved within the top 1000 receive a precision of zero. 5.4 Results and discussion
We performed standard preprocessing steps: document parsing, tokenization, stop words removal. Only single words were used for content representation. The highly discriminating word features were selected using the chi-square statistic (Yang & Pedersen 1997) as follows. Given a set of candidate features, we compute the feature goodness for each category as:
x(f, c) = N (AD CB) (A + C) (B + D) (A + B) (C + D) (7)
where N is the total number of training documents; A is the number of times f and c co-occur; B is the number of times f occurs without c; C is the number of times c occurs without c; D is the number of times neither f nor c occurs. Then, the values were sorted in descending order and the top 50 features were chosen for each category4 and used for forming a global description space of 379 features. These later were thus used for representing both training and test documents by vectors of feature-weights. These weights are calculated using tfidf weighting ( Salton & Buckley 1988) for training examples and only tf weights for test examples5 . The vectors are normalized using the cosine normalization method (Salton et al. 1988), so that each document has a length of 1. 5.3 Performance measure
Evaluation is achieved using average uninterpolated precision metric, which is widely used for trec6 routing tasks without dissemination
4 Selecting the top 50 feautures is solely based on the fact that the average length of the training documents after preprocessing is 47 words. 5 The tf (term frequency) formula used in our study is: T F = 1 + log2 (tf ) where tf is the feature frequency within a document. The idf (inverse document frequency) is dened as: IDF = +1 log2 ( Nn ) where N is the total number of documents in the training collection, n the number of documents containing the feature. 6 Text REtrieval Conference, TREC. http://trec.nist.gov/
For the purpose of demonstration, we rstly present the results from the original learning rule (Eq.1) and our modied learning rule (Eq.1) for the 10 most frequent categories from the Reuters collection. As it can be seen in Table2, our modied learning rule (FRule) substantially outperforms the original learning rule (NDRule). More specically, NDRule yields 58.14% accuracy over the ten categories, while our modied learning rule FRule yields 88.30%, resulting in an overall improvement of 30%. The explanation of the substantial dierence between the two learning rules has to do again with the fact that the original learning rule is not able to distinguish between discriminating and non discriminating features in training documents (see section 3). Consequentely, the failure of the NDRule becomes obvious when documents are represented by a few highly discriminating features against too many non discriminating ones. Hence, when we looked at the ranking performed by the NDRule we found a considerable confusion between the categories. As an example, the NDRule produces particularly bad results for trade because of its confusion with earn. As soon as trade shares almost all its features with earn, some of these features are certainly more relevant to trade than to earn. Nevertheless, the inability of the NDRule to discriminate such features led it to consider most earn documents as relevant and even more relevant to trade than its proper documents. On the other hand, earn is less sensitive to this phenomenon because of the higher number of the test documents associated to this category as compared to trade. In the rest of this discussion we evaluate the performance of the ndf using our modied learning rule (FRule) with and without negative examples. Table3 summarizes our results using only positive training examples. Compared to Rocchios method, ndf produces superior average precision on almost all categories with +3.25% overall performance. Although Rocchio is slightly better than ndf for some categories (viz. ship, trade, wheat), we believe that ndf learning is constantly better than Rocchio learning especially for distinguishing between discriminating and non discriminating features in training examples. This situation is especially obvious for the crude category whose documents are represented with a high ratio of discriminating features. In our opinion, one
Figure 3: Comparison of NDF and Rocchio on the Reuters categories using only positive examples reason of the small advantage of the Rocchios method for few categories might be the presence of some outliers among the test documents which are extreme in comparison with the training documents, i.e. their discriminating features are quite dierent from those of the training documents. Thus, the ndf will advantage the outliers of the other categories that are similar to the positive training documents over the outliers of the training category. As a consequence, these outliers might disturb the ranking of the last relevant documents. On the other hand, Rocchio avoids this problem because its separation between discriminating and non discriminating features is less precise. We would also like to mention that Rocchio achieved much better eectiveness in our experiments than the current state-of-the-art Rocchios method on the Reuters collection, see (Dumais et al. 1998, Shankar et al. 2000). We think that the results are quite sensitive to the indexing and feature selection methods. For example, unlike the claims reported in (Ault et al. 2001), we found that chi-square test leads to a considerable improvement in performance on the Reuters collection. Nevertheless, this improvement depends on how the selection criterion is applied. A general practice is to use a global measure that averages the chi values over the number of categories ( Liu, Liu, Chen & Ma 2003, Yang et al. 1997). This strategy enables to eliminate the most common features, but not to extract the most discriminating features between categories. For this reason, we have measured individually the relevancy of a feature for each category and then selected the top 50 features of each category in order to form our global feature space. To get a better understanding of the behavior of the ndf, we have calculated the average uninterpolated precision in relation to the number of positive training examples for the 10 categories. Considering the results we obtained, shown in gure 3, we observe that ndf and Rocchio are closely competitive when training from very few postive examples. However, as
Table 3: AvgP using only positive examples Category name Rocchio ndf acq 92.46 95.65 corn 78.88 81.34 crude 83.88 91.65 earn 86.04 91.95 grain 97.02 97.82 interest 73.73 78.83 money-fx 66.75 76.70 ship 88.55 88.10 trade 94.48 93.35 wheat 88.72 87.61 mean AvgP 85.05 88.30 the number of positive training examples increases, ndf is consistently superior to Rocchio for almost all categories. As expected, exploiting negative examples yields better results for both Rocchios and ndf approaches. Table4 shows the average precision for = 2, = 1 (see equations 5,6). Despite the signicant improvement in the performance of Rocchios method, we observe that the ndf remains more eective through all categories. In order to explore the sensitivity of our proposed approach to the control parameters setting, we conduct experiments using dierent values of these parameters. The mean uninterpolated average precision over the categories is depicted in table5. Conversely to (Singhal et al. 1997), our results show that = is unsuitable for multi-label collection, especially when both query zoning and feature selection techniques are jointly applied. Our hypothesis is that for the most similar categories, like corn, grain and wheat, which have many features in common, such parameter settings could yield the elimination of the most expressive features of the relevant category and thus the irrelevant documents would be favoured over the relevant ones. This eect becomes evident when fea-
Table 4: AvgP using positive and negative examples Category name Rocchio ndf acq 95.29 97.27 corn 95.45 95.94 crude 92.96 93.27 earn 90.02 92.66 grain 97.80 98.50 interest 84.04 90.92 money-fx 80.41 88.99 ship 88.38 89.32 trade 95.73 96.87 wheat 91.75 91.95 mean AvgP 91.18 93.57 Table 5: Mean average precision Rocchio 1 1 78.57 16 4 88.90 0.75 0.25 89.77 2 1 91.18 for dierent , ndf 58.01 91.62 92.50 93.57
References Ault, T. & Yang, Y. (2001), kNN, Rocchio and Metrics for Information Filtering at TREC-10, in The Tenth Text REtrieval Conference (TREC 10) Dumais, S., Platt, J., Heckerman, D.,& Sahami, M (1998), Inductive Learning Algorithms and Representations for Text Categorization, in Proceedings of the Seventh International Conference on Information and Knowledge Management CIKM, pp. 148155. Kassab, R., Lamirel, J.C., & Nauer, E. (2005), Novelty Detection for Modeling Users Prole, in Proceedings of the 18th International Florida Articial Intelligence Research Society Conference (FLAIRS 05), Clearwater Beach, Florida, AAAI Press. pp. 830831. Kassab, R., Lamirel, J.C., & Nauer, E. (2005), Une nouvelle approche pour la modlisation du proe l de lutilisateur dans les syst`mes de ltrage e dinformation : le mod`le de ltre dtecteur e e de nouveaut, in The Deuxi`me Confrence en e e Recherche dinformation et Applications, CORIA05, France, pp. 185200. Kohonen, T. (1984), Self organisation and associative memory, Springer Verlag, New York, USA. Lamirel, J.C., & Crhange, M. (1994), Applicae tion of a Symbolico-Connectionist Approach for the Design of a Highly Interactive Documentary Database Interrogation System with OnLine Learning Capabilities, in Proceedings of the Third International Conference on Information and Knowledge Management (CIKM 94), pp. 155163. Liu, T., Liu, S., Chen, Z. & Ma, W. (2003), An Evaluation on Feature Selection for Text Clustering, in Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003). Robertson, S.E. & Soboro, I. (2001), The TREC-9 Filtering Track Final Report, in Proceedings of the 9th Text REtrieval Conference (TREC 9), pp. 2540. Rocchio, J J (1971), Relevance feedback in information retrieval, In The SMART Retrieval System : Experiments in Automatic Document Processing, Prentice Hall Inc., Englewood Clis, New Jersey. Salton, G. & Buckley, C. (1988), Term weighting approaches in automatic text retrieval, in Information Processing and Management, 24(5), 513 523. Schapire, R., Singer, Y. & Singhal, A. (1998), Boosting and Rocchio Applied to Text Filtering, in Proceedings of the Twenty-rst Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 215 223. Schutze, H., Hull, David A. & Pedersen, Jan O. (1995), A Comparison of Classiers and Document Representations for the Routing Problem, in Proceedings of SIGIR95, the International Conference on Research and Development in Information Retrieval (1995), ACM Press, pp. 229237.
ture selection method is applied. In a concrete example, assuming that a document d has a positive similarity 0.8 with the acceptance lter (respectively, the centroid of positive examples in Rocchios method) and a negative similarity 0.9 with the rejection lter (respectively, the centroid of negative examples) the relevance score of this document will be -0.1. Thus, this document which is a common document between relevant category and one or more of the negative categories would be ranked under the irrelevant documents which have very low or zero similarity with the positive and negative examples. In this case, Rocchio will perform better than ndf because it prohibits this eect by arbitrarily setting the negative components to zero. 6 Conclusion
We have presented a promising approach to modelling users information need in intelligent text ltering environment. This approach is based on a specic novelty detection model, the ndf, which allows to accurately learn users prole and which provides, in a parallel way, a more comprehensive capture of the type of users information need. Experiments carried out on the Reuters collection showed the eectiveness of the ndf as compared to Rocchios method. Moreover, when combining our method with a suitable feature selection approach, the results seem to be better than the ones ever reported for the best ofine learning method, like knn, svm and bootstrap. Hence, in a further step, we plan to more thoroughly compare our approach with these latter methods. Nevertheless, a more extensive evaluation is needed, concerning especially the dissemination thresholds setting and the performance of the ndf for adaptive ltering task (see denition in (Robertson et al. 2001)). Last but not least, we plan to exploit the ability of our approach to detect the novelty for controlling the amount of redundancy in ltering results according to the type of users information need ( Zhang, Callan & Minka 2002). Hence, the ltering result should exactly response to users need when this later is precise, whereas, if the type of the users information need is exploratory, the ltering result should rstly recover as much as possible the users need. In this latter case redundancy could be present and new information could also be recommended to the user.
Shankar, S. & Karypis, George(2000), Weight Adjustment Schemes for a Centroid Based Classier, Computer Science Technical Report (TR00035), Department of Computer Science, University of Minnesota, Minneapolis, Minnesota. Singhal, A., Mitra, M.& Buckley, C. (1997), Learning routing queries in a query zone, in Proceedings SIGIR97, 20th ACM International Conference on Research and Development in Information Retrieval, pp. 2532. Yang, Y. & Pedersen, Jan O. (1997), A Comparative Study on Feature Selection in Text Categorization, in Proceedings of ICML-97, 14th International Conference on Machine Learning, pp. 412420. Zhang, Y., Callan, J.& Minka, T. (2002), Novelty and redundancy detection in adaptive ltering, in Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 8188. Appendix Illustrative example of the NDF learning model In sections 2 and 3 we have outlined the basic concepts of the ndf model and our adaptation of this model to text ltering task. This appendix presents a simple example showing the main defect of the original learning rule of the ndf (Eq.1) when applied to ltering task and why a modication of this rule is required. The example also illustrates how the modied rule (Eq.2) can enhance the eciency of the ltering process, in both learning users prole and ranking the ltering result. We will refer to the original learning rule as NDRule and to the modied learning rule as FRule.
2. d 2d T 2 d2 2 0 0 = 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
2 = 1
with d2 = 1 d2 On the basis of the above matrix, it is possible to calculate the habituation proportion of all features by projecting the unit vector Uf associated with each feature on the space spanned by the ndf; using the following formula: Hf = 1 we get: Hf 1 = Hf 3 = 1, Hf 2 = Hf 4 = Hf 5 = 0 This means that the features f1 , f3 have been totally learned and will have the same relevancy for representing the training documents, despite the fact that f1 is more relevant for discriminating users need than f3 which could be a common feature (i.e. a non discriminating feature) among documents in collection. This is due to the fast learning of features that is characteristic to the NDRule. Hence once a feature is totally learned it will no more be considered since only the novelty vector is used during learning (i.e. the presence or absence of such a feature in the ensuing training documents will not aect learning in the next steps). This increases the learning possibility of non discriminating features. Calculating now the relevance score of each new document which corresponds to the habituation proportion of each document, we get: Rank 1 2 Document d7 d5 ,d6 Relevance score 1 0.29 |2 Uf | |Uf |
Table 6: Sample feature-by-document matrix (5 features 7 documents) Train Test d1 d2 d3 d4 d5 d6 d7 f1 1 1 0 0 1 0 1 f2 0 0 0 1 0 0 0 f3 0 1 1 1 0 1 0 f4 0 0 1 1 0 1 0 f5 0 0 0 0 1 0 0 Let us now consider the set of documents presented in table6 which contains four training documents and three new testing documents. Let us suppose that the two documents d1 , d2 are selected by the user as examples of his information need among the four available documents. The system has to learn the users prole using these two examples in order to rank the incoming testing documents by relevancy for the user. As a result of running the NDRule on the reference documents d1 , d2 we obtain the transfer matrix 2 of the ndf: 1. d 1d T 1 1 = 0 d1 2 where 0 = I and d1 = 0 d1
Note that although d5 is more relevant than d6 , they were ranked as having the same degree of relevancy7 . The main reason for this is the inability of the NDRule to correctly distinguish between discriminating and non discriminating features. Modifying the learning strategy is then imperative to avoid this problem. In the following discussion we are going to show how our modication of the learning rule (FRule) can improve the ltering eciency. Starting from the modied learning rule FRule, the transfer matrix 2 we obtain is: 0.8 0 2 = 0.4 0 0 0 0.4 0 0 2 0 0 0 0 1.2 0 0 0 0 2 0 0 0 0 2
If we now calculate the habituation proportions of the features according to Eq.4, we get:
7 In more complex cases where there is a high similarity between the relevant and non relevant training and/or test documents, many relevant documents may have lower rank than non relevant ones.
Hf 1 = 0.55, Hf 3 = 0.37, Hf 2 = Hf 4 = Hf 5 = 0 As it can be seen, f1 is now more habituated than f3 and accordingly, f1 will be considered as more relevant than f3 for representing the users need. The ranking of the new documents is shown in the following table: Rank 1 2 3 Document d7 d5 d6 Relevance score 0.55 0.23 0.16
As it can be expected, the non relevant document d6 is ranked with the lowest relevance score.

A New Approach To Intelligent Text Filtering Based On Novelty Detection

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A New Approach To Intelligent Text Filtering Based On Novelty Detection

Încărcat de

Drepturi de autor:

Formate disponibile

A New Approach to Intelligent Text Filtering Based on Novelty Detection

Randa Kassab Jean-Charles Lamirel

where n is the number of training documents.

S-ar putea să vă placă și