Documente Academic
Documente Profesional
Documente Cultură
Chao Zeng
Institute of Computer Applications
East China Normal University
Shanghai, China
czeng@ica.stc.sh.cn
Junzhong Gu
Institute of Computer Applications
East China Normal University
Shanghai, China
jzgu@cs.ecnu.edu.cn
Abstract
Email classification methods based on the content
general use Vector Space Model. The model is
constructed based on the frequency of every
independent word appearing in Email content.
Frequency based VSM does not take the context
environment of the word into account, thus the feature
vectors can not accurately represent Email content,
which will result in the inaccurate of classification.
This paper presents a new approach to Email
classification based on the Concept Vector Space
Model using WordNet. In our approach, based on
WordNet we extract the high-level information on
categories during training process by replacing terms
in the feature vector with synonymy sets and
considering the hypernymy-hyponymy relation between
synonymy sets. We design a Email classification system
based on the concept VSM and carry on a series of
experiments. The results show that our approach could
improve the accuracy of Email classification especially
when the size of training set is small.
1. Introduction
Email has been an efficient and popular
communication mechanism as the number of Internet
users increase. However, the existence and spread of
Email have result in greater interference to us while
enjoying the convenience of the Email. Email often
reflects the current hot issues of the social and public
feelings, the proliferation of Email at the same time
also affected people on the collation and acquisition of
information. If Email can be automatically classified,
Zhao Lu
Institute of Computer Applications
East China Normal University
Shanghai, China
zlu@cs.ecnu.edu.cn
2. Related work
Typically, the process of Email classification has
the following three main steps: pre-processing, feature
selection and classifier construction.
Pre-processing
was
composed
of
word
segmentation, feature representation and feature
Incoming Email
Pre-processing
Concept list
formulation
Weight revision
Classifier
Email folder
Email folder
3.1. Training
The process of training is as follows:
1. Pre-processing
2. Concept list formulation
3. Weight revision
Vi
Vi + N ( wi : C k ) ,and change the
wi in Email of type
5)
6)
nk
j =1
type.
Count the appearance time of
wi in the whole
training set as N ( wi )
= N ( wi : C k ) , n is the
Email of type
k =1
2)
If
wi
then
does
not
appear
in
For word
4)
wi . If the
in
which
VectorCValue(Vi ) =
VectorCValue(Vi ) * IDF (Vi ) * FICF (Vi )
4)
VectorCValue(Vi )
VectorCValue(Vi )
n
(VectorCValue(V ))
i
wi by search in WordNet.
3)
(U V )
V
ij
Vi
FICF (Vi ) =
QJ '
Q J ' wi
2)
C k in which Vi appears.
ij
WordNet,
QL ' (C k )
QL ' (C k ) wi
N (Vi ) 2
)] , in which N is the
N
i =1
3.2. Classification
The process of classification is as follows:
1. Do pre-processing to the test Email, construct
the concept list of the Email, and then get
VectorId and VectorCValue of the Email.
2.
3.
X = ( x , x ,..., x )
1
2
m
space of the Email
and
the concept vector space of each type
Y = ( y1 , y 2 ,..., y m )
Fsim ( X ,Y ) =
(x
m
x
k =1
4.
* yk )
k =1
* yk
k =1
P(Pr ecision) =
N correct
N actual
N correct
N total
P*R*2
F1 =
P+R
The average F1 score is used as the evaluation
R(Re call ) =
4. Experiments
Type
Precision
Recall
F1
Classification Accuracy
1.2
1
0.8
0.6
0.4
0.2
0
6. References
30
100
300
600
700
900
Tradition VSM