Sunteți pe pagina 1din 5

A new approach to Email classification using Concept Vector Space Model

Chao Zeng
Institute of Computer Applications
East China Normal University
Shanghai, China
czeng@ica.stc.sh.cn
Junzhong Gu
Institute of Computer Applications
East China Normal University
Shanghai, China
jzgu@cs.ecnu.edu.cn
Abstract
Email classification methods based on the content
general use Vector Space Model. The model is
constructed based on the frequency of every
independent word appearing in Email content.
Frequency based VSM does not take the context
environment of the word into account, thus the feature
vectors can not accurately represent Email content,
which will result in the inaccurate of classification.
This paper presents a new approach to Email
classification based on the Concept Vector Space
Model using WordNet. In our approach, based on
WordNet we extract the high-level information on
categories during training process by replacing terms
in the feature vector with synonymy sets and
considering the hypernymy-hyponymy relation between
synonymy sets. We design a Email classification system
based on the concept VSM and carry on a series of
experiments. The results show that our approach could
improve the accuracy of Email classification especially
when the size of training set is small.

1. Introduction
Email has been an efficient and popular
communication mechanism as the number of Internet
users increase. However, the existence and spread of
Email have result in greater interference to us while
enjoying the convenience of the Email. Email often
reflects the current hot issues of the social and public
feelings, the proliferation of Email at the same time
also affected people on the collation and acquisition of
information. If Email can be automatically classified,

Zhao Lu
Institute of Computer Applications
East China Normal University
Shanghai, China
zlu@cs.ecnu.edu.cn

then people can access to the content of their relations


accurately and quickly, which will greatly improve the
efficiency, thereby reduce the loss in manpower,
financial and material resources. Above all Email
classification is of great significance and value. Email
classification has become a new academic subject. It
has been concern about by various circles of society in
recent years. In commercial circles software for Email
classification emerge in endlessly, while in academia
set off on an upsurge of the Email classification. The
accuracy of Email classification encourages the
researchers.
Email filtering technology develops continuously
while a means of Email changes constantly. Nowadays
many anti-Email technology programmers will not
only adopt a technology, but a synthesis of multiple
technologies. The main technology used by current
anti-Email products are as follows: black list, white
list, DNS identification, rate control, OCR recognition
and analysis, virus scanning, a comprehensive
reputation system, rule-based scoring system, data
mining and so on. In recent years, a large number of
researchers study Email classification based on data
mining technology. Current data mining techniques are
Bayesian, artificial intelligence, text clustering and
decision tree, and so on.

2. Related work
Typically, the process of Email classification has
the following three main steps: pre-processing, feature
selection and classifier construction.
Pre-processing
was
composed
of
word
segmentation, feature representation and feature

extraction. Currently feature representation in


accordance with the semantic understanding can be
divided into two categories: the expression model
based on keywords - Vector Space Model (VSM), and
the notional expression model based on acceptation
understanding. Although VSM does not consider the
semantic information and lost some relationship
between word and word, it is simpler and easier to
handle, and the text-processing (classification mainly)
can be more effective than the latter. Therefore VSM is
the most common used method.
The existing feature selection methods can
generally be classified into two categories: filtering
methods, and eliminate method. The former considers
feature selection as a step of pre-processing. It weights
the features through a series of rules, and then
constructs the drop-dimensional vector space by the
first k feature vector ranked by weight. For example:
Document Frequency, Mutual Information and 2
Statistics Law. The flaw of filtering methods is that
they consider the Dimension of feature is independent
of each other, thereby lowering the accuracy of
classification. Now many improvements has been
made based on Document Frequency and TF * IDF for
calculating the weight of the vector proposed by Salton
in 1973, such as: Thorsten Joachims proposed
probability TF * IDF algorithm [1]; Roberto Basili
proposed TF * IWF * IWF algorithm [2], and so on.
Eliminate is the method that considers the
classification as a black box for feature selection. Such
method has been verified more effective than the
filtering method. However, its calculation is too
excessive, especially when the number of the feature
vector is too large, its practicality is not strong.
Classifier construction can be broadly categorized
into: statistics based Classifier [3], connection based
Classifier [4] and rule-based Classifier [5]. Nave
Bayes[6], KNN[7], SVM[8] are Statistics based
methods; neural network is a connection based method;
rule-based decision tree is a rule based method. Some
researchers have verified the validity of these
algorithms, according to the experiment result, among
14 kinds of classification algorithms including the
KNN, decision tree, Park Bayes, neural networks,
KNN and other algorithms classification accuracy rate
is satisfactory after training by a large volume of
training set. However, they presence a common
problem: do not consider the semantic relationships
between words so that often appear in highdimensional vector space, which will greatly reduce
the property of the classification; On the other hand, in
the condition of limited training set , because that the
type information is too small, the level is too low,
resulting in the classification accuracy of these
algorithms greatly decreased. Simple vector

classification is adaptable for the situation of simple


text and low vector dimension. Because Email is
basically small text, which will not be many, simple
vector classification is appropriate for Email
classification in respect to the complexity and effect.
After anglicizing the present technology, this paper
presents a new approach of feature selection. In our
approach, based on WordNet[9], for describing a text
Email by establishing concept vector space model, we
can firstly extract the high-level information on
categories during training process by replacing terms
with synonymy sets in WordNet and considering
hypernymy-hyponymy relation between synonymy
sets. Secondly, we use TF * IWF * IWF method to
revise the weight of the concept vector. In the end, we
could determine the type of text Email using simple
vector classification method.

3. Email classification using concept VSM


Email request from two stages: the training stage
and classification stage. In the training stage we train
the classifier by using type-marked Email, thus get the
feature vector space of each type; in the classification
stage we take the unclassified Email as input, and tell
the type of the Email as result. As shown in Figure 1.
Email folders

Incoming Email

Pre-processing

Concept list
formulation

Weight revision

Classifier

Email folder

Email folder

Figure 1 Email classification based on concept VSM

3.1. Training
The process of training is as follows:
1. Pre-processing
2. Concept list formulation
3. Weight revision

VectorId, then change the corresponding value of


Vi
in
VectorCValue,

3.1.1. Pre-processing.The process of Pre-process is as


follows
1) Segment the words in Email d j , remove

Vi
Vi + N ( wi : C k ) ,and change the

punctuation and high frequency words, remove


roots
and
affixes.
Represent
d j as:

{w1 w2 ...wi ...wk } , count the appearance time of


wi in d j as N ( wij )
2)

Count the appearance time of

wi in Email of type

5)

weight of the direct hyponymy word of these


synonyms collections as well. If not, return to 3).
If all the words in Q L ' (C k ) was done, then we
have constructed the concept vector space of the
training set of type C k .

6)

nk

C k as N ( wi : Ck ) = N ( wij ) ,in which C k

Join the VectorId and VectorCValue of each type


to form the VectorId and VectorSValue of the
whole training set.

j =1

d1 d 2 ...d j ...d nk , means :There are

nk training Email of type C k . Set up a word


set Q L (C k ) to save the words appear in Email of
3)

type.
Count the appearance time of

3.1.3. Weight Revision.The process of weight revision


is as follows:
1) Count the Inverse Document Frequency of Vi

IDF (Vi ) = [log(

wi in the whole

number of Email of the whole training set in


which Vi appears and N (Vi ) is the number of

training set as N ( wi )

= N ( wi : C k ) , n is the

Email of type

k =1

number of the type. Set up the word set of the


training set represent as Q J .

2)

3.1.2. Concept List Formulation.The process of


concept list formulation is as follows:
Initialization: Q L ' (C k ) = Q L (C k ) ; Q J ' = Q J .
1)

If

wi

then

does

not

appear

in

For word

Save the concept list to VectorId sequentially. And


at the same time save N ( wi : Ck ) to
VectorCValue, as the weight of synonyms
collection of wi i=1,2,3,4,,m, in which m is
the number of words in

4)

concept list. If VectorId exists, then no need to


save again.
For each wi (i=2,3,4,,m) in Q L ' (C k ) , search
in WordNet, get the concept list of

wi . If the

synonyms collections of the concept list are in

in

which

U ij is the weight of Vi in type C k , Vi is equal


to the weight of
3)

Vi in the whole training set

divided by the number of the type.


Revise the weight the concept:

VectorCValue(Vi ) =
VectorCValue(Vi ) * IDF (Vi ) * FICF (Vi )
4)

Normalizing the VectorCValues:

VectorCValue(Vi )

VectorCValue(Vi )
n

(VectorCValue(V ))
i

QL ' (C k ) in the concept

list, and also as the weight of the synonyms


collection of wi s direct hyponymy word in the

wi by search in WordNet.
3)

(U V )
V
ij

wi in QL ' (C k ) , get the concept list of

Vi

Count the Inverse Category Frequency of

FICF (Vi ) =

QJ '
Q J ' wi
2)

C k in which Vi appears.

ij

WordNet,

QL ' (C k )
QL ' (C k ) wi

N (Vi ) 2
)] , in which N is the
N

i =1

3.2. Classification
The process of classification is as follows:
1. Do pre-processing to the test Email, construct
the concept list of the Email, and then get
VectorId and VectorCValue of the Email.

2.
3.

Revise the weight to get VectorId and


VectorCValue of the Email.
Calculate the similarity of the concept vector

X = ( x , x ,..., x )

1
2
m
space of the Email
and
the concept vector space of each type

Y = ( y1 , y 2 ,..., y m )

Fsim ( X ,Y ) =

(x
m

x
k =1

4.

* yk )

k =1

* yk

k =1

The test Email belongs to the type which


corresponding to the largest value of
Fsim ( X , Y ) .

classified Email and the number of Email which


belongs to the type, it mainly reflects the classifiers
ability of searching extension. F1 is a evaluation
method which considers both Precision and Recall
comprehensively. We compute the precision, recall and
F1 values as follows:

P(Pr ecision) =

N correct
N actual

N correct
N total
P*R*2
F1 =
P+R
The average F1 score is used as the evaluation
R(Re call ) =

measure in all the following evaluation.

4. Experiments

4.3. Experiment Results

In this section, extensive experiments were


performed to gauge many aspects of the proposed
Email classification method. We introduced the
experimental data set, the evaluation metrics and the
experimental results.

We made two experiments: 1. Fixing the number of


training set and test set, we compare the performance
of the traditional vector space model and the concept
vector space model using WordNet proposed by this
paper; 2. with different size of training set, compare the
performance of the traditional VSM and the concept
VSM.
In the first experiment three categories has been
selected from 20_newsgroups, including alt.atheism,
comp.graPhies, reo.autos. We selected 300 Email from
each category as training set, which means a total of
900. As testing set, 100 Email was selected from each
category.
Table 1 Traditional VSM
Type
rec.auto Alt.athesim comp.graphics
Precision
0.78
0.70
0.73
Recall
0.70
0.70
0.80
F1
0.74
0.70
0.76

4.1. Data Set


Data set is the prerequisite and foundation for Email
classification, and is also an essential basis for
objective evaluation of performance of an essential
basis for classification. In this paper, we use
documents of 20_newsgroups as data set, which is a
standard document set. We put these documents below
20 directories, each directory is a category of the news
group, and each category generally includes 1,000
articles. Select some types from the set of documents
of 20_newsgroups to do the experiments, take a part of
the data set to be the training set, taking the other to be
the test set. We do the experiments using traditional
vector space model and concept based vector space
model proposed by this paper separately.

4.2. Performance Evaluation


We select three commonly used assessment
methods to evaluate the performance of the
classification system, named Precision, Recall
and F1 value. Precision is the ratio of the number of
correctly classified Email and the actual number of
Email which was assigned to the type, it mainly
reflects the ability to search accurately of the classifier.
Recall rate is the ratio of the number of correctly

Type
Precision
Recall
F1

Table 2 Concept VSM


rec.auto alt.athesim comp.graphics
0.86
0.93
0.90
0.82
0.88
0.96
0.84
0.90
0.93

Table 1, 2 respectively lists the result of the


traditional method and the improved method. It can be
seen from the results of the two tables, the concept
VSM based on WordNet results superior to the
traditional vector space model.

Classification Accuracy

reduce the dimension of the feature vector when the


training set is large.

1.2
1
0.8
0.6
0.4
0.2
0

6. References

30

100

300

600

700

900

Training set size


Concept VSM

Tradition VSM

Figure 2 Performance comparisons with different


size of training set
In experiment 2, we selected 30, 100, 300, 600, 700,
900 from the training set in experiment 1, and keep the
testing set the same with experiment 1. Here the
evaluation index is calculated by the Hong average
calculation.
It can be seen from Figure 2 that the scale of the
training set has impact on the performance. With the
increasing in the size of training set, the classification
accuracy also increases in both of the tradition VSM
and concept VSM, it is because when training set is
small, it is difficult to select the better features to
represent Email vector, so that the classifier is not
performing well. But with the increase in the number
of training samples, you can more easily choose the
better features to represent Email vector.

5. Conclusion and future work


This paper presents an approach of feature
selection. In our approach, based on WordNet, for
describing a text Email by establishing concept vector
space model, we can firstly extract the high-level
information on categories during training process by
replacing terms with synonymy sets in WordNet and
considering hypernymy-hyponymy relation between
synonymy sets. Secondly, we use TF * IWF * IWF
method to revise the weight of the concept vector. In
the end, we could determine the type of text Email
using simple vector classification method. We carry on
a series of experiments to compare our approach with
the term-based VSM approach. The results show that
our approach could improve the accuracy of text Email
classification especially when the size of training set is
small.
Our future research is to use the concept vector
obtained by the method proposed in this paper to do
level classification. At the same time, we will further
attempt to improve the classification accuracy and

[1] Thorsten Joachims. A probabilistic analysis of the


Rocchio algorithm with TFIDF for text categorization. In
Proceedings of ICML97, pages 143-151.
[2] R. Basili, A. Moschitti, M. Pazienza. A text classifier
based on linguistic processing. In Proceedings of IJCAI-99,
Machine Learning for Information Filtering.
[3] Manu Aery, Sharma Chakravarthy. eMailSiftining-based
Approaches to Email Classification[C]. Proceeding of the
27th Annual International Conference on Research and
Development in Information Retrieval, ACM, 2004, 580
581.
[4] Clark J,Koprinska I,Poon J.A neural network based
approach to automated e-mail classification.Proc of the
IEEE/WIC Intl Conf on Web Intelligence,2003:702-705.
[5] Irena Koprinska, Felix Trieu, Josiah Poon and James
Clark. E-mail Classification by Decision Forests. Proc. 8th
Australasian, Document Computing Symposium (ADCS),
2003.
[6] Meyer, T.A., Whateley, B., SpamBayes: effective Opensource, Bayesian Based, Email Classification System. First
Conference on Email and Anti-Spam (CEAS), 2004, pp. 1-8.
[7] L Baoli,L Qin,Y Shiwen.An Adaptive k-Nearest
Neighbor Text Categorization Strategy. ACM Transactions
on Asian Language Information Processing(TALIP),2004.
[8] Andrew Farrugia. Investigation of Support Vector
Machines for Email Classification. 2004.
[9] C. Felbaum. WordNet: An Electronic Lexical Database.
MIT Press, Cambridge, Massachusetts, 1998.

S-ar putea să vă placă și