A New Approach To Email Classification Using Concept Vector Space Model

A new approach to Email classification using Concept Vector Space Model
Chao Zeng
Institute of Computer Applications
East China Normal University
Shanghai, China
czeng@ica.stc.sh.cn
Junzhong Gu
Shanghai, China
jzgu@cs.ecnu.edu.cn
Abstract
Email classification methods based on the content
general use Vector Space Model. The model is
constructed based on the frequency of every
independent word appearing in Email content.
Frequency based VSM does not take the context
environment of the word into account, thus the feature
vectors can not accurately represent Email content,
which will result in the inaccurate of classification.
This paper presents a new approach to Email
classification based on the Concept Vector Space
Model using WordNet. In our approach, based on
WordNet we extract the high-level information on
categories during training process by replacing terms
in the feature vector with synonymy sets and
considering the hypernymy-hyponymy relation between
synonymy sets. We design a Email classification system
based on the concept VSM and carry on a series of
experiments. The results show that our approach could
improve the accuracy of Email classification especially
when the size of training set is small.
1. Introduction
Email has been an efficient and popular
communication mechanism as the number of Internet
users increase. However, the existence and spread of
Email have result in greater interference to us while
enjoying the convenience of the Email. Email often
reflects the current hot issues of the social and public
feelings, the proliferation of Email at the same time
also affected people on the collation and acquisition of
information. If Email can be automatically classified,
Zhao Lu
Shanghai, China
zlu@cs.ecnu.edu.cn
then people can access to the content of their relations

accurately and quickly, which will greatly improve the
efficiency, thereby reduce the loss in manpower,
financial and material resources. Above all Email
classification is of great significance and value. Email
classification has become a new academic subject. It
has been concern about by various circles of society in
recent years. In commercial circles software for Email
classification emerge in endlessly, while in academia
set off on an upsurge of the Email classification. The
accuracy of Email classification encourages the
researchers.
Email filtering technology develops continuously
while a means of Email changes constantly. Nowadays
many anti-Email technology programmers will not
only adopt a technology, but a synthesis of multiple
technologies. The main technology used by current
anti-Email products are as follows: black list, white
list, DNS identification, rate control, OCR recognition
and analysis, virus scanning, a comprehensive
reputation system, rule-based scoring system, data
mining and so on. In recent years, a large number of
researchers study Email classification based on data
mining technology. Current data mining techniques are
Bayesian, artificial intelligence, text clustering and
decision tree, and so on.
2. Related work
Typically, the process of Email classification has
the following three main steps: pre-processing, feature
selection and classifier construction.
Pre-processing
was
composed
of
word
segmentation, feature representation and feature
extraction. Currently feature representation in

accordance with the semantic understanding can be
divided into two categories: the expression model
based on keywords - Vector Space Model (VSM), and
the notional expression model based on acceptation
understanding. Although VSM does not consider the
semantic information and lost some relationship
between word and word, it is simpler and easier to
handle, and the text-processing (classification mainly)
can be more effective than the latter. Therefore VSM is
the most common used method.
The existing feature selection methods can
generally be classified into two categories: filtering
methods, and eliminate method. The former considers
feature selection as a step of pre-processing. It weights
the features through a series of rules, and then
constructs the drop-dimensional vector space by the
first k feature vector ranked by weight. For example:
Document Frequency, Mutual Information and 2
Statistics Law. The flaw of filtering methods is that
they consider the Dimension of feature is independent
of each other, thereby lowering the accuracy of
classification. Now many improvements has been
made based on Document Frequency and TF * IDF for
calculating the weight of the vector proposed by Salton
in 1973, such as: Thorsten Joachims proposed
probability TF * IDF algorithm [1]; Roberto Basili
proposed TF * IWF * IWF algorithm [2], and so on.
Eliminate is the method that considers the
classification as a black box for feature selection. Such
method has been verified more effective than the
filtering method. However, its calculation is too
excessive, especially when the number of the feature
vector is too large, its practicality is not strong.
Classifier construction can be broadly categorized
into: statistics based Classifier [3], connection based
Classifier [4] and rule-based Classifier [5]. Nave
Bayes[6], KNN[7], SVM[8] are Statistics based
methods; neural network is a connection based method;
rule-based decision tree is a rule based method. Some
researchers have verified the validity of these
algorithms, according to the experiment result, among
14 kinds of classification algorithms including the
KNN, decision tree, Park Bayes, neural networks,
KNN and other algorithms classification accuracy rate
is satisfactory after training by a large volume of
training set. However, they presence a common
problem: do not consider the semantic relationships
between words so that often appear in highdimensional vector space, which will greatly reduce
the property of the classification; On the other hand, in
the condition of limited training set , because that the
type information is too small, the level is too low,
resulting in the classification accuracy of these
algorithms greatly decreased. Simple vector
classification is adaptable for the situation of simple

text and low vector dimension. Because Email is
basically small text, which will not be many, simple
vector classification is appropriate for Email
classification in respect to the complexity and effect.
After anglicizing the present technology, this paper
presents a new approach of feature selection. In our
approach, based on WordNet[9], for describing a text
Email by establishing concept vector space model, we
can firstly extract the high-level information on
categories during training process by replacing terms
with synonymy sets in WordNet and considering
hypernymy-hyponymy relation between synonymy
sets. Secondly, we use TF * IWF * IWF method to
revise the weight of the concept vector. In the end, we
could determine the type of text Email using simple
vector classification method.
3. Email classification using concept VSM

Email request from two stages: the training stage
and classification stage. In the training stage we train
the classifier by using type-marked Email, thus get the
feature vector space of each type; in the classification
stage we take the unclassified Email as input, and tell
the type of the Email as result. As shown in Figure 1.
Email folders
Incoming Email
Pre-processing
Concept list
formulation
Weight revision
Classifier
Email folder
Email folder
Figure 1 Email classification based on concept VSM
3.1. Training
The process of training is as follows:
1. Pre-processing
2. Concept list formulation
3. Weight revision
VectorId, then change the corresponding value of

Vi
in
VectorCValue,
3.1.1. Pre-processing.The process of Pre-process is as

follows
1) Segment the words in Email d j , remove
Vi
Vi + N ( wi : C k ) ,and change the
punctuation and high frequency words, remove

roots
and
affixes.
Represent
d j as:
{w1 w2 ...wi ...wk } , count the appearance time of

wi in d j as N ( wij )
2)
Count the appearance time of
wi in Email of type
5)
weight of the direct hyponymy word of these

synonyms collections as well. If not, return to 3).
If all the words in Q L ' (C k ) was done, then we
have constructed the concept vector space of the
training set of type C k .
6)
nk
C k as N ( wi : Ck ) = N ( wij ) ,in which C k
Join the VectorId and VectorCValue of each type

to form the VectorId and VectorSValue of the
whole training set.
j =1
d1 d 2 ...d j ...d nk , means :There are
nk training Email of type C k . Set up a word

set Q L (C k ) to save the words appear in Email of
3)
type.
Count the appearance time of
3.1.3. Weight Revision.The process of weight revision

is as follows:
1) Count the Inverse Document Frequency of Vi
IDF (Vi ) = [log(
wi in the whole
number of Email of the whole training set in

which Vi appears and N (Vi ) is the number of
training set as N ( wi )
= N ( wi : C k ) , n is the
Email of type
k =1
number of the type. Set up the word set of the

training set represent as Q J .
2)
3.1.2. Concept List Formulation.The process of

concept list formulation is as follows:
Initialization: Q L ' (C k ) = Q L (C k ) ; Q J ' = Q J .
1)
If
wi
then
does
not
appear
in
For word
Save the concept list to VectorId sequentially. And

at the same time save N ( wi : Ck ) to
VectorCValue, as the weight of synonyms
collection of wi i=1,2,3,4,,m, in which m is
the number of words in
4)
concept list. If VectorId exists, then no need to

save again.
For each wi (i=2,3,4,,m) in Q L ' (C k ) , search
in WordNet, get the concept list of
wi . If the
synonyms collections of the concept list are in
in
which
U ij is the weight of Vi in type C k , Vi is equal

to the weight of
3)
Vi in the whole training set
divided by the number of the type.

Revise the weight the concept:
VectorCValue(Vi ) =
VectorCValue(Vi ) * IDF (Vi ) * FICF (Vi )
4)
Normalizing the VectorCValues:
VectorCValue(Vi )
VectorCValue(Vi )
n
(VectorCValue(V ))
i
QL ' (C k ) in the concept
list, and also as the weight of the synonyms

collection of wi s direct hyponymy word in the
wi by search in WordNet.
3)
(U V )
V
ij
wi in QL ' (C k ) , get the concept list of
Vi
Count the Inverse Category Frequency of
FICF (Vi ) =
QJ '
Q J ' wi
2)
C k in which Vi appears.
ij
WordNet,
QL ' (C k )
QL ' (C k ) wi
N (Vi ) 2
)] , in which N is the
N
i =1
3.2. Classification
The process of classification is as follows:
1. Do pre-processing to the test Email, construct
the concept list of the Email, and then get
VectorId and VectorCValue of the Email.
2.
3.
Revise the weight to get VectorId and

VectorCValue of the Email.
Calculate the similarity of the concept vector
X = ( x , x ,..., x )
1
2
m
space of the Email
and
the concept vector space of each type
Y = ( y1 , y 2 ,..., y m )
Fsim ( X ,Y ) =
(x
m
x
k =1
4.
* yk )
k =1
* yk
k =1
The test Email belongs to the type which

corresponding to the largest value of
Fsim ( X , Y ) .
classified Email and the number of Email which

belongs to the type, it mainly reflects the classifiers
ability of searching extension. F1 is a evaluation
method which considers both Precision and Recall
comprehensively. We compute the precision, recall and
F1 values as follows:
P(Pr ecision) =
N correct
N actual
N correct
N total
P*R*2
F1 =
P+R
The average F1 score is used as the evaluation
R(Re call ) =
measure in all the following evaluation.
4. Experiments
4.3. Experiment Results
In this section, extensive experiments were

performed to gauge many aspects of the proposed
Email classification method. We introduced the
experimental data set, the evaluation metrics and the
experimental results.
We made two experiments: 1. Fixing the number of

training set and test set, we compare the performance
of the traditional vector space model and the concept
vector space model using WordNet proposed by this
paper; 2. with different size of training set, compare the
performance of the traditional VSM and the concept
VSM.
In the first experiment three categories has been
selected from 20_newsgroups, including alt.atheism,
comp.graPhies, reo.autos. We selected 300 Email from
each category as training set, which means a total of
900. As testing set, 100 Email was selected from each
category.
Table 1 Traditional VSM
Type
rec.auto Alt.athesim comp.graphics
Precision
0.78
0.70
0.73
Recall
0.70
0.70
0.80
F1
0.74
0.70
0.76
4.1. Data Set

Data set is the prerequisite and foundation for Email
classification, and is also an essential basis for
objective evaluation of performance of an essential
basis for classification. In this paper, we use
documents of 20_newsgroups as data set, which is a
standard document set. We put these documents below
20 directories, each directory is a category of the news
group, and each category generally includes 1,000
articles. Select some types from the set of documents
of 20_newsgroups to do the experiments, take a part of
the data set to be the training set, taking the other to be
the test set. We do the experiments using traditional
vector space model and concept based vector space
model proposed by this paper separately.
4.2. Performance Evaluation

We select three commonly used assessment
methods to evaluate the performance of the
classification system, named Precision, Recall
and F1 value. Precision is the ratio of the number of
correctly classified Email and the actual number of
Email which was assigned to the type, it mainly
reflects the ability to search accurately of the classifier.
Recall rate is the ratio of the number of correctly
Type
Precision
Recall
F1
Table 2 Concept VSM

rec.auto alt.athesim comp.graphics
0.86
0.93
0.90
0.82
0.88
0.96
0.84
0.90
0.93
Table 1, 2 respectively lists the result of the

traditional method and the improved method. It can be
seen from the results of the two tables, the concept
VSM based on WordNet results superior to the
traditional vector space model.
Classification Accuracy
reduce the dimension of the feature vector when the

training set is large.
1.2
1
0.8
0.6
0.4
0.2
0
6. References
30
100
300
600
700
900
Training set size

Concept VSM
Tradition VSM
Figure 2 Performance comparisons with different

size of training set
In experiment 2, we selected 30, 100, 300, 600, 700,
900 from the training set in experiment 1, and keep the
testing set the same with experiment 1. Here the
evaluation index is calculated by the Hong average
calculation.
It can be seen from Figure 2 that the scale of the
training set has impact on the performance. With the
increasing in the size of training set, the classification
accuracy also increases in both of the tradition VSM
and concept VSM, it is because when training set is
small, it is difficult to select the better features to
represent Email vector, so that the classifier is not
performing well. But with the increase in the number
of training samples, you can more easily choose the
better features to represent Email vector.
5. Conclusion and future work

This paper presents an approach of feature
selection. In our approach, based on WordNet, for
describing a text Email by establishing concept vector
space model, we can firstly extract the high-level
information on categories during training process by
replacing terms with synonymy sets in WordNet and
considering hypernymy-hyponymy relation between
synonymy sets. Secondly, we use TF * IWF * IWF
method to revise the weight of the concept vector. In
the end, we could determine the type of text Email
using simple vector classification method. We carry on
a series of experiments to compare our approach with
the term-based VSM approach. The results show that
our approach could improve the accuracy of text Email
classification especially when the size of training set is
small.
Our future research is to use the concept vector
obtained by the method proposed in this paper to do
level classification. At the same time, we will further
attempt to improve the classification accuracy and
[1] Thorsten Joachims. A probabilistic analysis of the

Rocchio algorithm with TFIDF for text categorization. In
Proceedings of ICML97, pages 143-151.
[2] R. Basili, A. Moschitti, M. Pazienza. A text classifier
based on linguistic processing. In Proceedings of IJCAI-99,
Machine Learning for Information Filtering.
[3] Manu Aery, Sharma Chakravarthy. eMailSiftining-based
Approaches to Email Classification[C]. Proceeding of the
27th Annual International Conference on Research and
Development in Information Retrieval, ACM, 2004, 580
581.
[4] Clark J,Koprinska I,Poon J.A neural network based
approach to automated e-mail classification.Proc of the
IEEE/WIC Intl Conf on Web Intelligence,2003:702-705.
[5] Irena Koprinska, Felix Trieu, Josiah Poon and James
Clark. E-mail Classification by Decision Forests. Proc. 8th
Australasian, Document Computing Symposium (ADCS),
2003.
[6] Meyer, T.A., Whateley, B., SpamBayes: effective Opensource, Bayesian Based, Email Classification System. First
Conference on Email and Anti-Spam (CEAS), 2004, pp. 1-8.
[7] L Baoli,L Qin,Y Shiwen.An Adaptive k-Nearest
Neighbor Text Categorization Strategy. ACM Transactions
on Asian Language Information Processing(TALIP),2004.
[8] Andrew Farrugia. Investigation of Support Vector
Machines for Email Classification. 2004.
[9] C. Felbaum. WordNet: An Electronic Lexical Database.
MIT Press, Cambridge, Massachusetts, 1998.

A New Approach To Email Classification Using Concept Vector Space Model

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A New Approach To Email Classification Using Concept Vector Space Model

Încărcat de

Drepturi de autor:

Formate disponibile

A new approach to Email classification using Concept Vector Space Model

then people can access to the content of their relations

extraction. Currently feature representation in

classification is adaptable for the situation of simple

3. Email classification using concept VSM

Figure 1 Email classification based on concept VSM

VectorId, then change the corresponding value of

3.1.1. Pre-processing.The process of Pre-process is as

punctuation and high frequency words, remove

{w1 w2 ...wi ...wk } , count the appearance time of

Count the appearance time of

weight of the direct hyponymy word of these

C k as N ( wi : Ck ) = N ( wij ) ,in which C k

Join the VectorId and VectorCValue of each type

d1 d 2 ...d j ...d nk , means :There are

nk training Email of type C k . Set up a word

3.1.3. Weight Revision.The process of weight revision

IDF (Vi ) = [log(

number of Email of the whole training set in

number of the type. Set up the word set of the

3.1.2. Concept List Formulation.The process of

Save the concept list to VectorId sequentially. And

concept list. If VectorId exists, then no need to

synonyms collections of the concept list are in

U ij is the weight of Vi in type C k , Vi is equal

Vi in the whole training set

divided by the number of the type.

Normalizing the VectorCValues:

QL ' (C k ) in the concept

list, and also as the weight of the synonyms

wi in QL ' (C k ) , get the concept list of

Count the Inverse Category Frequency of

Revise the weight to get VectorId and

The test Email belongs to the type which

classified Email and the number of Email which

measure in all the following evaluation.

4.3. Experiment Results

In this section, extensive experiments were

We made two experiments: 1. Fixing the number of

4.1. Data Set

4.2. Performance Evaluation

Table 2 Concept VSM

Table 1, 2 respectively lists the result of the

reduce the dimension of the feature vector when the

Training set size

Figure 2 Performance comparisons with different

5. Conclusion and future work

[1] Thorsten Joachims. A probabilistic analysis of the

S-ar putea să vă placă și