Sunteți pe pagina 1din 12

Pattern Recognition 45 (2012) 41174128

Contents lists available at SciVerse ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

An evidential reasoning based classication algorithm and its application for


face recognition with class noise
Xiaodong Wang a,b, F. Liu a,b,n, L.C. Jiao b, Zhiguo Zhou a,b, Jingjing Yu a,b, Bing Li b, Jianrui Chen b,
Jiao Wu a,b, Fanhua Shang b
a
b

School of Computer Science and Technology, Xidian University, Xian 710071, PR China
Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xian 710071, PR China

a r t i c l e i n f o

abstract

Article history:
Received 2 February 2012
Received in revised form
26 May 2012
Accepted 7 June 2012
Available online 18 June 2012

For classication problems, in practice, real-world data may suffer from two types of noise, attribute
noise and class noise. It is the key for improving recognition performance to remove as much of their
adverse effects as possible. In this paper, a formalism algorithm is proposed for classication problems
with class noise, which is more challenging than those with attribute noise. The proposed formalism
algorithm is based on evidential reasoning theory which is a powerful tool to deal with uncertain
information in multiple attribute decision analysis and many other areas. Thus, it may be more effective
alternative to handle noisy label information. And then a specic algorithmEvidential Reasoning
based Classication algorithm (ERC) is derived to recognize human faces under class noise conditions.
The proposed ERC algorithm is extensively evaluated on ve publicly available face databases with class
noise and yields good performance.
& 2012 Elsevier Ltd. All rights reserved.

Keywords:
Face recognition
Class noise
Evidential reasoning
Linear regression classication (LRC)
Sparse representation-based classication
(SRC)

1. Introduction
Pattern recognition/classication is a very topic in machine learning (or articial intelligence). It assigns the input data into one of a
given number of categories by an algorithm. The algorithm is
obtained by learning the training set of instances. Classication is
applied in many elds, such as speech recognition, handwriting
recognition, document classication, internet search engines, medical
image analysis, optical character recognition, and so on. For classication problems, the training set may suffer from two types of noise:
attribute noise and class noise [1], that decrease classication
accuracy usually. If some training samples are not correctly labeled,
the training data would have class noise. Classication under class
noise conditions is a more challenging problem usually, compared to
attribute noise. The sources of class noise are very diverse, such as
subjectivity, data-entry error, or inadequacy of the information [2].
This paper focuses on face classication problems in the presence of
class noise.
Many algorithms have been proposed to solve the class noise
problem. Parts of them are summarized as follows.

 Nearest neighbor algorithm. Nearest neighbor based algorithms


[37] choose effective subsets of training sets instead of the

n
Corresponding author at: School of Computer Science and Technology, Xidian
University, Xian 710071, PR China. Tel.: 86 2988204310; fax: 86 2988201023.
E-mail addresses: xiao_dong_wang1975@163.com (X. Wang),
lf204310@163.com (F. Liu).

0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.patcog.2012.06.005

original training sets according to some rules. These algorithms


obtain better accuracy and efciency since many noisy training
data and outliers are cleared from the original training sets.
Decision tree algorithm. Many decision tree algorithms [812]
are applied to eliminate class noise. Some of them avoid
overtraining by adjusting model parameters and building
smaller trees (e.g. [9,10]) and others eliminate the training
data with class noise by pruning methods or ensemble learning (e.g. [8,11] and [12]).
Probabilistic algorithm. Probabilistic methods [1317] also help
many algorithms to tolerate class noise. High breakdown
estimation is used to eliminate the inuence of outliers in
[13,14]. Modeling methods for the class noise are proposed by
[1517], which detect the inconsistencies of the labels of the
training samples.
Ensemble learning algorithm. Ensemble learning algorithms
[2,18,19] can also improve the classication accuracy under
class noise conditions. Bagging [18] tolerates the outliers by
bootstrapping training subsets and combining the classication results from the different subsets. In [2], ensemble lters
are applied to eliminate mislabeled training samples. [19] uses
reduced reward-punishment editing to identify and remove
the outliers, which will form the different training subsets
with the changed parameters. Then, rotation forest algorithm
combines these subsets and yields the better classication
results.
Other algorithm. There are also many other methods to deal
with the class noise problems, such as the method based on

4118

X. Wang et al. / Pattern Recognition 45 (2012) 41174128

2.1. ER analytical algorithm

Table 1
The related algorithms.
First group
Nearest neighbor algorithm
Decision tree algorithm
Probabilistic algorithm
Ensemble learning algorithm
Other algorithm

[37]
[9,10]
[2]
[20,21]

Second group

[8,11,12]
[1317]
[18,19]

neural network used in [20] which removed noisy samples by


neural network as a ltering mechanism, and the method
based on mutual information proposed by [21] which calculated mutual information for each training sample and
removed the samples with larger mutual information, and so
on.
All of the above listed algorithms can be roughly divided into two
groups: the rst group aims to clear the outliers, the second group
tries to construct the class-noise-proof models. According to their
groups, we can summarize these algorithms briey and form
Table 1.
As a powerful framework for uncertain reasoning, Dempster
Shafer theory (D-S theory [22,23]) has been widely applied to
pattern recognition elds, such as classication [2433], clustering [3437], and so on. Based on D-S theory, Yang et al. [38,39]
proposed an evidential reasoning algorithm (ER). And then Wang
et al. [40] developed the analytical evidential reasoning algorithm
(ER analytical algorithm). Compared with traditional evidence
combination method of the D-S theory, ER (and ER analytical
algorithm) can save the computation cost greatly and prevent the
irrational conclusions under evidence conict in evidence combination. The main contribution of this paper is to propose a novel
evidential reasoning based classication algorithm, which can
overcome the effects of class noise in classication problems.
Specically, several novel aspects of this paper are summarized as
follows:
First, a formalism algorithm based on ER theory for classication problems. Because it distinguishes the training samples
according to their different importances, the formalism algorithm
may address the class noise better. Utilizing the properties of ER,
it has many other potential benets. For example, it may be
applied to the data with various structures (e.g. manifold or
sphere structures), diverse attributes (e.g. the quantitative, qualitative or vague attributes), and different labels (e.g. soft or crisp
labels). In addition, it may take advantage of additional priori
information, and so on. Furthermore, under the framework of the
formalism classication algorithm, a specic algorithm (named
by Evidential Reasoning based Classication algorithm, ERC) is
proposed to recognize human faces with class noise. Numerical
experiments show that ERC has good performance for solving
such problems.
The remainder of this paper is organized as follows. Section 2
introduces a formalism classication algorithm based on ER
analytical algorithm and analyzes its potential benets. In
Section 3, the specic algorithm ERC is described in detail. Several
numerical experiments are presented in Section 4. They are used
to evaluate the performance of ERC to handle class noise. Finally,
the paper is concluded in Section 5.

Now, we briey describe some basic conceptions and conclusions of the ER analytical algorithm [3840]. They are given to
cater for classication problems. Let O fC1 ,C2 , . . . ,CK g be a
collectively exhaustive and mutually exclusive set of hypotheses,
then O is called the frame of discernment. The nonnegative vector
b b1, . . . , bKT is called a belief degree vector (BDV) if
PK
where bi9bCi is the belief degree of the hypothi 1 bi r1,
PK
esis C . If
i 1 bi 1, the BDV b is complete; otherwise, if
PK i
i 1 bi o1, the BDV b is incomplete. For classication problems, let x be a sample and O fC1 ,C2 , . . . ,CK g correspond to K
classes respectively, bi represents the belief degree of the
hypothesis Ci x belongs to i-th class. So, BDV can be explained
P
as a soft label. If the BDV is incomplete, bO 1 Ki 1 bi is the
belief degree of O which corresponds to the hypothesis x do not
belong to any class.
Several BDVs corresponding to the same sample x form the
1
N
belief rule base (BRB). Let fb , . . . , b g be a BRB corresponding to
the sample x, the nal conclusion can be combined by ER
analytical algorithm [40,41],
2

m4

K Y
N
X

@wi bi s 1wi

s1i1

K
X

bi jAK1

j1

N
Y

0
@1wi

i1

K
X

131

bi jA5 ,

j1

1
QN

bk

i
i 1 wi b s 1wi

QN
i
j 1 b j
i 1 1wi
QN
i 1 1wi 

PK

PK

j1

bi j

1m

2
In Eqs. (1) and (2), the activation weights are calculated [41,42] by

yi

wi PN

j1

yj

i 1; 2, . . . ,N,

where rule weights yi i 1; 2, . . . ,N reect the importance of


bi i 1; 2, . . . ,N in combination steps.
Now, we illustrate the performance of ER analytical algorithm
1 2
by several examples. Let O fC1 ,C2 g, b , b be two BDV such that

b1 1 0:6, b1 2 0:4 and b2 1 0:6, b2 2 0:4:


Their combination result according to Eq. (2) with activation
weights w1 w2 0:5 is

b1 0:6190, b2 0:3810:
1

In the above example, if leaving w1 ,w2 unchanged and b , b


satisfy that

b1 1 0:6, b1 2 0:4 and b2 1 0:4, b2 2 0:6,


the combination result turns into

b1 0:5, b2 0:5:
1

So, if both b and b tend to support the same hypothesis, the


combination result more inclines to support this hypothesis.
1
2
Otherwise, if b and b tend to support different hypotheses,
the combination result forms an unclear conclusion. However, the
asymmetric activation weights can improve the unclear result. If
1 2
w1 0:8,w2 0:2 and b , b satisfy that

b1 1 0:6, b1 2 0:4 and b2 1 0:4, b2 2 0:6,


2. A formalism classication algorithm
In this section, we provide some backgrounds about the ER
analytical algorithm rst, and then introduce a formalism classication algorithm.

the combination result turns into

b1 0:5793, b2 0:4207:
1

It means that BDV b with a bigger weight plays a more important


role in ER.

X. Wang et al. / Pattern Recognition 45 (2012) 41174128

2.2. A formalism classication algorithm


Based on ER, a formalism classication algorithm is proposed
as follows.
The formalism classication algorithm
Step 1: Generate a BDV for each training sample according to
the information from the training sample itself and other
training samples.
Step 2: For a test sample, compute an activation weight for
each BDV by exploiting the data structure.
Step 3: Combine all BDVs with the activation weights by Eqs.
(1)(3) and make decision based on the combination result.
When the formalism classication algorithm is applied to classify,
two crucial factors affect classication results. They are BDVs and
activation weights. The methods to form them can be chosen
according to the characteristics of data exibly. The different methods
cause specic algorithms. For example, ERC applies the linear regression classication algorithm (LRC [43]) and sparse representation
based classication algorithm (SRC [44]) to generate the BDVs and
activation weights in the next section because the two algorithms are
more suitable for face databases. There is a one-to-one correspondence between the BDVs and the training samples. In the training
phase, BDVs are generated and xed. The function of BDVs is to
distinguish the importance of the training samples. This will be
explained with ERC in the next section. Since BDVs are xed, the
classication result of a test sample depends entirely on the activation
weights. The activation weights reect some similarities that are
closely correlated with the structures of the data. If larger activation
weights are arranged for the training data in the same class as the test
sample, the correct classication result will be obtained. As an
example, we propose a specic algorithm ERC for face recognition
that is robust against label noise. More details about ERC will be
discussed in the next section.
The formalism classication algorithm has many other potential benets yet.
Firstly, this algorithm may be applied to classify a test sample
with priori information if it is available. For example, a test
sample belongs to some classes with larger probability (or must
not belong to some classes) is learned in advance. For our method,
the priori information can be used to make decision by providing
an additional BDV with a proper activation weight or by only
adjusting the activation weights corresponding to each training
sample. If these priori information comes from other classiers,
our algorithm will form an ensemble learning algorithm.
Secondly, this algorithm may be applied to handle different
kinds of attitudes, e.g. quantitative, qualitative and vague attitudes. Many methods to deal with different kinds of the attitudes
have been developed by ER in multiple attribute decision analysis
[3842,4552] which may be introduced in our algorithm.
Thirdly, this algorithm may learn from the training samples
with soft label and crisp label, or even unreliable training
samples. Both soft label and crisp label can easily be recast as
BDVs and the uncertainty of the i-th training sample included in
bi O. If a test sample obtains the combined BDV such that bO is
larger than all other belief degrees, the algorithm can treat it as an
invalid test image and refuse to classify it.

3. ERC Algorithm
With increasingly diverse face data sources, e.g. internet or
surveillance video, class noise will be unavoidable. Based on the
formalism classication algorithm, we develop a specic algorithm
ERC for face recognition problems with class noise in this section. It

4119

demonstrates the benets of the proposed formalism classication


algorithm well. Let y be a test sample and matrix X consist of the
training samples. Let X X1 ,X2 , . . . ,XK , and Xk k 1; 2, . . . ,K be a
submatrix of X with training samples from the k-th class as its
column vectors. Suppose that Xk contains nk training samples and
P
N Kk 1 nk . ERC is introduced in detail in Sections 3.13.3.
3.1. Generate BDV
Let xi be a training sample which belongs to j-th class.
According to its class label, a BDV is dened as

b j 1 and b k 0 8k A f1; 2, . . . ,Kg\fjg:


fxk1 ,xk2 ,

. . . ,xknk g

Let nmin mink fnk 1g and


be the column vectors of
the matrix Xk (k 1; 2, . . . ,K), i.e. the training samples in k-th
^ k denotes the matrix with the nearest nmin
class. For k a j, X
^ j denotes
vectors in fxk1 ,xk2 , . . . ,xknk g to xi as its columns. And X
j
j
the matrix with the nearest nmin vectors in fx1 ,x2 , . . . ,xjnj ,g\fxi g to
i
xi as its columns. The BDV b corresponding to xi is obtained by
the following Algorithm 1.
Algorithm 1.
Step 1.1: Generate the BDV b according to Eq. (4).
^ TX
^ k 1 X
^ T xi , k 1; 2, . . . ,K.
Step 1.2: Calculate a^ k X
k

^ k a^ k J , k 1; 2, . . . ,K.
Step 1.3: Compute distance dk xi Jxi X
2
~
Step 1.4: Calculate b according to


gdk xi
formulab~ k exp  P
, k 1; 2, . . . ,K
5
d x
j j

where g 4 0 is a constant.
Step 1.5: Give activation weights w1 1r for b and w2 r
for b~ , where r A 0; 1 is a constant.
Step 1.6: Combine b and b~ according to formulae (1)(3) with
i

activation weights w1 ,w2 and obtain b .


i

The BDV b fuses information that comes from both the class label
of xi and other training samples. So, it can reduce the adverse effects
from class noise well. It will be validated by experiments in Section
4.1. Specically, if b and b~ indicate the same class that xi belongs to,
i
the combination result b will clearly show the class; otherwise, there
i
will be no component of b signicantly greater than other compoi
nents. In the latter case, the contribution of b will be small to classify
the test sample. In other words, BDVs distinguish the training
samples according to their importance.
i
The method to generate b is similar to many data-cleaning
approaches [37]. The difference is that the data-cleaning approaches
remove the unreliable training samples, whereas our method retains
all training samples and generates BDVs. The BDVs represent the
belief degree of training samples, therefore they can be considered as
soft labels. Compared with the data cleaning approaches, the BDVs
maintain more information of training samples which will be
combined by ER analytical algorithm well later.
The idea to form b~ is inspired by the method in [30]. One of the
difference between them is that the proposed method uses the
distances between sample and the class subspaces rather than the
distances between samples. The distances between sample and
the class subspaces (LRC [43]) is more suitable for face recognition. The function of BDV in this paper is similar to the function of
the basic probability assignment (BPA) in [30]. Another difference
is that the BDVs are xed and the BPAs are changed with test
samples during the test phase.
Algorithm 1 involves two parameters g and r. In Eq. (5), the
parameter g is set to 10. Another parameter r in Step 1.5 will be
described in Section 4.

4120

X. Wang et al. / Pattern Recognition 45 (2012) 41174128

3.2. Generate activation weights


~ j (j 1; 2, . . . ,K) be a matrix consisting of a series of
Let X
column vector xi , where xi is a training sample such that
i
i
j arg maxk fb kg (b corresponding to xi is the BDV generated
by Algorithm 1). Suppose B X, I, where X is obtained by
normalizing the columns of X to unit length, and I is an identity
matrix. For the test sample y, the algorithm to generate the
activation weight wi corresponding to the training sample xi
(i 1; 2, . . . ,N) is given as follows.
Algorithm 2.
~ 1 ~ T
~ TX
~i
Step 2.1: Calculate a~ k X
k k X k y, k 1; 2, . . . ,K and let a k
be the component of a~ k corresponding to the training sample
~ k.
xi if xi is a column vector of X
~ k a~ k J (k 1; 2, . . . ,K)
Step 2.2: Compute distance dk y JyX
2
n

and let k arg mink fdk yg.


~ n and a~ i n 40
Step 2.3: Set y~ i a~ ikn if xi is a column vector of X
k
k
and y~ i 0 otherwise, i 1; 2, . . . ,N.
P
~
~ i y~ i = N
Step 2.4: Calculate w
j 1 y j , i 1; 2, . . . ,N.
Step 2.5: Solve the 1 -minimization problem:
a arg minJaJ1 s:t: y Ba,
a

and let a i be the i-th component of a (i.e a i corresponds to the


training sample xi ), i 1; 2, . . . ,N.
Step 2.6: Set y i a i if a i 40 and y i 0 otherwise,
i 1; 2, . . . ,N.
P
Step 2.7: Compute w i y i = N
j 1 y j , i 1; 2, . . . ,N.
Step 2.8: Give activation weight wi (i 1; 2, . . . ,N) according to
7
~ i 1lw i ,
wi lw
where l A 0; 1 is a constant.
For test sample y, the activation weight wi (i 1; 2, . . . ,N)
~ i and w i . The former is
includes two pieces of information, i.e. w
obtained by LRC, and the latter by SRC. For face recognition
problems, both LRC and SRC tend to arrange larger coefcients for
the training samples which come from the same class with y
[43,44]. And thus, the BDVs corresponding to these training
samples are larger usually. The performance of ERC will be
~ i (or w i ) is applied as the
analyzed in Section 4.3, if only w
activation weight. Since the activation weights must not be
negative, y~ i in Step 2.3 is set to 0 if a~ ikn o0. Similarly, y i in Step
2.6 is set to 0 if a i o0, i 1; 2, . . . ,N. In addition, only the rst N
components of a are used though its length is larger than N. In
other words, we discard the components of a corresponding to I.
3.3. Evidence combination and classication
For a test sample y, the complete ERC algorithm is presented as
follows.
Algorithm 3. (ERC algorithm).
Step 3.1: Generate the BDV for each training sample by using
Algorithm 1.
Step 3.2: Compute activation weights for the test sample y by
using Algorithm 2.
Step 3.3: Combine these BDVs with the activation weights
according to Eqs. (1)(3) and obtain the new BDV b
corresponding to y.
Step 3.4: Decide y to be in the k-th class if k arg maxk fbkg.
Generally speaking, SRC is hard to use to learn the training
samples with soft labels. However, it becomes possible by using

the strategy of Algorithm 3. Many other face recognition methods


can be extended similarly, e.g. SSM (sparse subspace method,
[53]) and NFL (nearest feature line [54]). That is another benet
of ERC.

3.4. Analysis of computational complexity


Because ERC is based on the sparse representation method, it
has very high computation cost like SRC. In this subsection, we
will analyze its computational complexity.
Firstly, we give the analysis of the computational complexity
about evidential combination (see formulae (1) and (2)). Let N be
the number of BDVs in BRB and K be the number of the
components of BDVs in BRB. When m is calculated, two multii
plication operations are needed to compute wi b s and
PK
i
wi j 1 b j, N multiplication operations are needed to compute
QN
PK
i
i
N multiplication operations
i 1 wi b s 1wi
j Q
1 b j, andP
i
N
are needed to compute i 1 1wi Kj 1 b j. Other operations
have only very small computation cost. So, the computation cost
is 4  ON ON for m. When bk k 1; 2, . . . ,K is calculated, N
QN
multiplication operations are needed to compute
i 1 1wi .
QN
Note that only one
be computed (for
i Q
1 1wi needs to P
i
i
N
K
all
k 1; 2, . . . ,K),
and
i 1 wi b s 1wi
j 1 b j
QN
PK
i
i 1 1wi
j 1 b j have be computed when m is calculated.
In addition, two multiplication operations are needed to compute
Q
Q
P
P
mQNi 1 wi bi s 1wi Kj 1 bi j Ni 1 1wi Kj 1 bi j and
N
m i 1 1wi  for each k A f1; 2, . . . ,Kg. So, the computation cost
is 2K N for b. Therefore, if N b K, the total computation cost is
2K N ON ON for evidential combination; otherwise if
K bN, the total computation cost is 2K N ON OK for
evidential combination.
Secondly, we analyze the computation cost to calculate b~ k
(k 1; 2, . . . ,K) according to the formula (5). Suppose that each
class of data includes p training samples and each sample has D
features. When a^ k is calculated, p2 D multiplication operations are
3
^
^ TX
needed to compute X
k k , p multiplication operations are needed
T
1
^
^
to compute X k X k , pD multiplication operations are needed to
^ T xi , and p2 multiplication operations are needed to
compute X
k
^ TX
^ T xi . For face recognition problem, D is much
^ 1 X
compute X
k k
k
larger than p usually. So, the computation cost is OKp2 D for all a^ k
(k 1; 2, . . . ,K). It needs KpD D OKpD multiplication opera^ k a^ k J (k 1; 2, . . . ,K), and K multions to compute dk xi Jxi X
2
tiplication operations and K division operations to compute b~ k
(k 1; 2, . . . ,K) according to the formula (5). Therefore, the total
computation cost is OKp2 D OKpD 2K OKp2 D for all b~ k
(k 1; 2, . . . ,K).
According to the analysis of the second and third paragraphs in
this subsection and the fact that it does not need multiplication
operations to generate b j j 1; 2, . . . ,K according to the formula (4), the total computation cost to compute bi is that OKp2 D
for b~ i plus O(K) for all bi by executing evidential combination.
Note that the computation cost of evidential combination is O(K)
because only two BDVs are combined to generate bi . Therefore,
the total computation cost to train ERC (Algorithm 1) is ONKp2 D
i
which is applied to generate all b (i 1; 2, . . . ,N).
Thirdly, we analyze the computation cost to calculate a by
solving programming (6). Let B be D  N D matrix where D is
the numbers of the components of the training samples and N is
the numbers of the training samples. There are many optimization algorithms to solve this problem. We choose a more efcient
algorithmLARS [55] among them which is a greedy algorithm. It
can obtain high-quality solution with less iterations. The entire
computation cost of LARS is OD3 since the number of the
columns of B N D is larger than the number of the components
of the samples D (see [55] for more details).

X. Wang et al. / Pattern Recognition 45 (2012) 41174128

Fourthly, we analyze the computation cost to calculate activation


weight wi (Algorithm 2). Similar to the analysis of the third paragraph
in this subsection, OKp2 D multiplication operations are needed to
compute all dk y (k 1; 2, . . . ,K). According to the analysis of the last
paragraph, OD3 multiplication operations are needed to compute a .
For other variables, respective N division operations are needed to
~ i and w i (i 1; 2, . . . ,N), 2N multiplication operations
compute all w
are needed to execute the formula (7) and generate wi (i 1; 2, . . . ,N).
Note that Kp2 D and D3 are much larger than N for face recognition. So,
the entire computation cost of Algorithm 2 is OKp2 D OD3 .
According to the conclusion of the second and sixth paragraph and
the fact that N b K usually for the classication problem, the
computation cost to test ERC is ON OKp2 D OD3
OKp2 D OD3 for each test sample y. The computation costs to
test LRC and SRC are OKp2 D and OD3 , respectively. So, the
computation cost of ERC is equivalent to the sum of the computation
costs of LRC and SRC roughly.

4. Experimental results and discussions


To illustrate the efciency of the proposed algorithm, ERC is
employed to classify ve face databases under class noise conditions. The ve face databases are AR [56,57], Georgia Tech (GT)
[58], JAFFE [59], ORL [60] and Extended Yale B [61,62],
respectively.
The AR face database contains over 4,000 color images corresponding to 126 peoples faces (70 men and 56 women). Each
class contains 26 samples (face pictures) corresponding to different facial variations, illumination, expressions, and facial disguises. Similar to [44], a part of the data set are selected to test
ERC in the experiments. This part consists of 100 subjects (50
male subjects and 50 female subjects) and 14 images with only
illumination change and expressions for each person. In Fig. 1, the
rst row shows the images of the rst object in this database.

4121

The GT face database contains images of 50 people. There are


15 images for each person. These images show frontal (or tilted
faces) with different facial expressions and lighting conditions. In
Fig. 1, the second row shows the images of the rst object in this
database.
The JAFFE face database contains 213 images of ten female
models. There are about 21 images for each person. In Fig. 1, the
third and fourth rows show the images of the rst object in this
database.
The ORL face database contains 40 subjects with ten images
per subject. The images were taken against a homogeneous
background. For each person, the images show upright and frontal
position with different lighting, facial details (glasses/no glasses)
and facial expressions. In Fig. 1, the fth row shows the images of
the rst object in this database.
The Extended Yale B face database contains 2414 frontal face
images of 38 subjects under various laboratory-controlled lighting conditions. In Fig. 1, the sixth-ninth rows show the images of
the rst object in this database.
In order to reduce the computational cost, we resize the cropped
facial images to order 32  32, and convert them to the column
vectors, and then project them into the PCA subspace by throwing
away the smallest principal components. For PCA projection, 99%
energy is kept in the sense of reconstruction error. Specically,
suppose that f 1 ,f 2 , . . . ,f N
are length-1024 vectors. Let
T
i
Wpca w1pca ,w2pca , . . . ,w1024
according
pca and wpca be the eigenvectors
P
to i-th largest eigenvalue of the matrix 1=N N
f i f f i f T ,
i

1
P
D
1
2
D T
where f 1=N N
i 1 f i . Let Wpca wpca ,wpca , . . . ,wpca be a D  N
D
projective
matrix
and
xD

W
f
(i

1;
2, . . . ,N).
If
i
pca
P i
2
D 2
d
d minfD A f1; 2, . . . ,1024g9 N
(i 1; 2,
i 1 Jxi J2 4 0:99Jf i J2 g, xi
. . . ,N) are the dimensionality reduction data by PCA which will be
used to test ERC in the numerical experiments. The other experimental designs and corresponding results are given in the following
sections. They illustrate that ERC is more robust against class noise
than competing methods.

Fig. 1. The rst, second and fth rows show cropped facial images (32  32) of the rst subject in the AR database, Georgia Tech database and ORL database, respectively.
The third and fourth rows show cropped facial images (32  32) of the rst subject in the JAFFE database. The 64 faces (32  32 cropped images) about rst subject of
Extended Yale B database are shown in the sixth-ninth rows.

4122

X. Wang et al. / Pattern Recognition 45 (2012) 41174128

4.1. The stability of BDV under class noise conditions


In this subsection, we illustrate that the BDV formed by
Algorithm 1 can reduce adverse effects from class noise well.
All 5 face databases are used to test the stability of BDV under
class noise conditions. For each class, about half of images are
random chosen as training samples and 20% class noise is added.
Specically, n1 denotes the number of the training images of each
class, n2 denotes the number of the training images with the
random noisy labels of each class, and they are shown in Table 2.
Algorithm 1 is employed to generate a BDV for each training
i
i
image. If b k is the maximum component of b , we determine
that the i-th training image belongs to the k-th class. In Algorithm
1, parameter g is set to 10 and parameter r is searched from the
grid f0:1,0:2, . . . ,0:9g. The best average results of 50 random
experiments are shown in Table 3. The original class noise ratio
is shown in the rst row. The new class noise ratio (under the best
r) obtained by BDV and corresponding r are presented in the
second and third rows, respectively.
The results listed in Table 3 show that the BDV generated by
Algorithm 1 reduces class noise ratio in general, especially for
JAFFE and extended Yale B databases.
4.2. Performance of ERC
The classication performance of ERC on all 5 face databases is
provided, as listed in Table 4. We choose seven algorithms for
comparison:

 The method proposed in [30] (which denoted by KNNDS),


which is a k-nearest neighbor method based on DempsterShafer theory. KNNDS has a parameter (i.e. the neighborhood
size) that is turned from the grid {1,2,y,20} in the experiments. The best results are reported.
Table 2
The number of the training images and noise images of each class.
Face database

AR

GT

JAFFE

ORL

Extended Yale B

n1
n2

7
1

8
2

11
2

5
1

32
6

 RT1, RT2 and RT3 [7], which based on k-nearest neighbor

method are class-noise-tolerant algorithms. There are two


neighborhood sizes as the parameters for the three methods.
All of them are turned from the grid {1,2,y,20} and the best
results are reported in the experiments.
LRC [43] and SRC [44], which are two good performance face
recognition algorithms. LRC has no parameter. For SRC, sparse
representation is obtained by the programming (6) in the
experiments. So, there is also no parameter for SRC.
Linear support vector machine (linear SVM) [64], which is
suitable for face classication because the face data satisfy the
linear subspace hypothesis roughly (i.e. the face images from
the same person lie on a linear subspace roughly). In the
experiments, the algorithm CSVC with the linear kernel in
the libsvm2.91 tools (http://www.csie.ntu.edu.tw/  cjlin/
libsvmtools/) is used. The regularization parameter C are
turned from the grid f225 ,224 , . . . ,225 g. The best results are
reported.

In the experiments, the number of the training images and the


class noise ratio are same as the settings in Section 4.1. In the
experiments, least-angle regression algorithm (LARs [55], i.e.
SolveLasso.m in SparseLab [63]) is applied to solve programming
(6) for both ERC and SRC. The parameter l of ERC is searched from
the grid {0,0.1,0.2,y,1} and the other parameters, i.e. g and r, are
the same as the settings in Section 4.1. For different parameters,
the best results are shown. Table 4 shows the average error rate
and standard deviation of 50 random examples and the corresponding best parameters. For AR, ORL, and Extended Yale B
database, ERC obtains lower error rate than other seven algorithms. For GT database, ERC is worse than linear SVM and KNNDS
and much better than other ve algorithms. For JAFFE database,
ERC is slightly worse than linear SVM and much better than other
six algorithms.
4.3. Sensitivity to the selection of parameters
Although the classication performance of ERC is inuenced
by two parameters (i.e. r and l), the following experimental
results show that they are easy to tune. (The settings of
the experiments are the same as the settings in Section 4.2.)
Figs. 211 illustrate the evolutions of the average error rate and

Table 3
The average error rate and standard deviation obtained by Algorithm 1.
Face database

AR

GT

JAFFE

ORL

Extended Yale B

Noise ratio
New noise ratio

0.1429
0.09947 0.0061
0.6

0.25
0.1778 7 0.0141
0.7

0.1818
0.0591 7 0.0194
0.6

0.2
0.1235 7 0.0160
0.6

0.1875
0.0376 70.0051
0.6

Table 4
The average error rate and standard deviation obtained by different algorithms.
Face database

AR

GT

JAFFE

ORL

Yale B

ERC

0.2517 7 0.0206
0.6
0.2
0.4794 7 0.0191
0.3020 7 0.0215
0.6677 7 0.0219
0.5462 7 0.0187
0.7930 7 0.0185
0.3670 7 0.0176
0.2738 7 0.0235

0.3400 70.0188
0.6
0.1
0.3396 7 0.0195
0.4058 7 0.0244
0.4647 7 0.0256
0.4882 7 0.0273
0.5397 7 0.0340
0.4223 7 0.0264
0.3075 7 0.0218

0.0307 70.0210
0.6
0.1
0.0334 7 0.0205
0.1309 7 0.0398
0.0683 7 0.0322
0.0788 7 0.0402
0.0670 70.0321
0.1501 7 0.0419
0.0262 70.0232

0.1653 70.0297
0.6
0.1
0.2402 70.0297
0.2472 7 0.0325
0.4159 7 0.0394
0.3987 7 0.0402
0.5550 70.0501
0.2392 7 0.0271
0.1797 7 0.0297

0.08597 0.0087
0.6
0.3
0.50707 0.0145
0.1096 70.0088
0.5931 70.0181
0.5768 70.0141
0.6602 70.0172
0.2413 70.0130
0.1709 70.0111

r
l
KNNDS
LRC
RT1
RT2
RT3
SRC
Linear SVM

X. Wang et al. / Pattern Recognition 45 (2012) 41174128

4123

GT database

AR database
0.8

ERC
KNNDS
LRC
RT1
RT2
RT3
SRC
SVM

error rate

0.6

0.5
error rate

0.7

ERC
KNNDS
LRC
RT1
RT2
RT3
SRC
SVM

0.55

0.5

0.45

0.4

0.4
0.35

0.3

0.3

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Fig. 2. Evolutions of the error rate and standard deviation of ERC on AR database
versus r. In experiments, l 0:2 and best error rate and standard deviation of
other algorithms also are shown for comparison.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Fig. 5. Evolutions of the error rate and standard deviation of ERC on GT database
versus l. In experiments, r 0:6 and best error rate and standard deviation of
other algorithms also are shown for comparison.

AR database
JAFFE database

0.8

0.6

ERC
KNNDS
LRC
RT1
RT2
RT3
SRC
SVM

0.18
0.16
0.14
error rate

0.7

error rate

0.2

ERC
KNNDS
LRC
RT1
RT2
RT3
SRC
SVM

0.5

0.4

0.12
0.1
0.08
0.06
0.04

0.3
0.02

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Fig. 3. Evolutions of the error rate and standard deviation of ERC on AR database
versus l. In experiments, r 0:6 and best error rate and standard deviation of
other algorithms also are shown for comparison.

GT database
0.55

error rate

0.5

0.45

ERC
KNNDS
LRC
RT1
RT2
RT3
SRC
SVM

0.4

0.35

0.3
0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Fig. 4. Evolutions of the error rate and standard deviation of ERC on GT database
versus r. In experiments, l 0:1 and best error rate and standard deviation of
other algorithms also are shown for comparison.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Fig. 6. Evolutions of the error rate and standard deviation of ERC on JAFFE
database versus r. In experiments, l 0:1 and best error rate and standard
deviation of other algorithms also are shown for comparison.

standard deviation of 50 random experiments on ve face


databases versus two parameters. When one parameter changes,
another parameter is set to the best value in Table 4. For ease of
comparison, the best error rate and standard deviation of other
algorithms as straight lines are shown in the gures. As can be
seen from the experimental results, although ERC is sensitive to r,
the best r are near 0.6 for all databases. When r 0:6, ERC is
stable for a wide range of values of l.
Next, we will provide the analysis of performance of ERC when
only one of SRC and LRC are applied to generate the activation
weights. When the parameters are set to r 0:6, l 0, only SRC is
applied to generate the activation weights. For ORL databases,
ERC is better than all of the other algorithms. For JAFFE databases,
ERC is slightly worse than linear SVM and better than other six
algorithms. For AR databases, ERC is slightly worse than LRC and
linear SVM, and better than other ve algorithms. For Extended
Yale B databases, ERC is slightly worse than LRC and better than
other six algorithms. For GT databases, ERC is slightly worse than
KNNDS and linear SVM, and better than other ve algorithms. It is
clear that ERC is much better than SRC on all databases.

4124

X. Wang et al. / Pattern Recognition 45 (2012) 41174128

JAFFE database

Yale B database

0.2
ERC
KNNDS
LRC
RT1
RT2
RT3
SRC
SVM

0.16

error rate

0.14
0.12
0.1

ERC
KNNDS
LRC
RT1
RT2
RT3
SRC
SVM

0.6
0.5
error rate

0.18

0.08

0.4
0.3

0.06

0.2

0.04
0.02
0

0.1
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Fig. 7. Evolutions of the error rate and standard deviation of ERC on JAFFE
database versus l. In experiments, r 0:6 and best error rate and standard
deviation of other algorithms also are shown for comparison.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Fig. 10. Evolutions of the error rate and standard deviation of ERC on Extended
Yale B database versus r. In experiments, l 0:3 and best error rate and standard
deviation of other algorithms also are shown for comparison.

Yale B database
ORL database
0.6

ERC
KNNDS
LRC
RT1
RT2
RT3
SRC
SVM

0.5

error rate

0.45
0.4

0.6
0.5
error rate

0.55

ERC
KNNDS
LRC
RT1
RT2
RT3
SRC
SVM

0.35

0.4
0.3

0.3
0.2

0.25
0.2

0.1

0.15

0
0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Fig. 8. Evolutions of the error rate and standard deviation of ERC on ORL database
versus r. In experiments, l 0:1 and best error rate and standard deviation of
other algorithms also are shown for comparison.

ERC
KNNDS
LRC
RT1
RT2
RT3
SRC
SVM

0.55
0.5

error rate

0.45
0.4

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Fig. 11. Evolutions of the error rate and standard deviation of ERC on Extended
Yale B database versus l. In experiments, r 0:6 and best error rate and standard
deviation of other algorithms also are shown for comparison.

When the parameters are set to r 0:6, l 1, only LRC is


applied to generate the activation weights. For Extended Yale B
databases, ERC performs better than the other algorithms. For AR
and ORL databases, ERC is slightly worse than linear SVM and
better than other ve algorithms. For GT and JAFFE databases, ERC
is slightly worse than linear SVM and KNNDS, and better than
other ve algorithms. ERC is better than LRC for all databases
under these conditions.
Since the classication performance of ERC is heavily dependent on SRC and LRC, ERC can obtain higher precision if the more
effective technique to form the activation weights is found.

ORL database
0.6

0.1

0.35
0.3

4.4. The performance for the diverse contamination rates

0.25
0.2
0.15
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Fig. 9. Evolutions of the error rate and standard deviation of ERC on ORL database
versus l. In experiments, r 0:6 and best error rate and standard deviation of
other algorithms also are shown for comparison.

In this subsection, we test the performance of ERC as the


contamination rate changes. The experiments are split into two
groups. In the rst group, the number of the training images
keeps unchanged and the number of the training images with
noisy labels increases gradually. The different numbers of the
training images with noisy labels are shown in Fig. 12. The
numbers of the training images are 8, 8, 12, 6 and 32 for AR
database, GT database, JAFFE database, ORL database and

X. Wang et al. / Pattern Recognition 45 (2012) 41174128

GT database

AR database

4125

JAFFE database
0.5

0.9
0.7
0.8

0.4
0.6

error rate

0.7

0.3

0.6

0.5

0.5

0.2

0.4

0.4

0.1

0.3

0.3
0.2
1

0
4
1
2
3
4
1 2
the number of the training images with noisy labels

0.8

error rate

LRC

0.5
0.5

RT1

0.4

RT2

0.3

0.3

RT3

0.2

0.2

SRC

0.1

SVM

0.1
1

KNNDS

0.6

0.4

ERC

0.7

0.6

Yale B database

ORL database
0.7

3
3
6
9
12 15
the number of the training images with noisy labels

Fig. 12. Evolutions of the error rate and standard deviation of ERC versus the numbers of the training images with noisy labels. In experiments, the numbers of the training
images keeps unchanged. There are 8, 8, 12, 6 and 32 training images for AR database, GT database, JAFFE database, ORL database and Extended Yale B databases,
respectively.

Extended Yale B databases, respectively. Other settings of the


experiments are same with the settings in Section 4.2. The
average error rates and standard deviations of 20 random experiments on the three face databases are given in Fig. 12. For AR, ORL
and Extended Yale B databases, ERC obtained the best results for
all contamination rates. For GT database, ERC is better than RT1,
RT2, RT3, LRC and SRC for all noise ratios and worse than linear
SVM and KNNDS. For JAFFE database, ERC is best for the least
noise ratio, but ERC degenerates seriously with the increase of
contamination rate. It is considered by us that the large class
noise has serious effects for LRC and SRC, which causes the bad
BDV, activation weights and serious degeneration of ERC.
In the second group, the number of the training images with
noisy labels keeps unchanged and the number of the training
images increases gradually. The numbers of the training images
are shown in Fig. 13. The numbers of the training images with
noisy labels are 1, 2, 2, 1 and 6 for AR database, GT database, JAFFE
database, ORL database and Extended Yale B databases, respectively. Other settings of the experiments are same with the
settings in Section 4.2. The average error rate and standard
deviation of 20 random experiments on the three face databases
are reported in Fig. 13. For AR and Extended Yale B databases, ERC
has the best performance. For JAFFE databases, ERC has the
similar recognition rate with linear SVM and KNNDS, and better
performance than other ve algorithms. For ORL databases, ERC
has the similar recognition rate with linear SVM, and has better
performance than other six algorithms. For GT databases, ERC is

worse than linear SVM, is similar to the recognition accuracy of


KNNDS, and is better than other ve algorithms.
4.5. Fixed parameters and computational complexity
It is very difcult to select proper parameters for ERC because
of the noisy labels. In this subsection, we try to test ERC under the
xed parameters conditions in terms of classication accuracies
and CPU time.
All ve face databases are applied to test ERC. In all ve
experiments, the parameters r 0:6 and l 0:2 are xed for ERC.
KNNDS, RT1, RT2, RT3 and linear SVM select their best parameters
to cater for ve different databases according to the numerical
results in Section 4.2. The other settings of the experiments are
same as the settings in Section 4.2. Table 5 reports the average
error rate and standard deviation of ERC and the competitive
algorithms. The numerical results illustrate that the recognition
accuracies of ERC are best on AR, ORL and Extended Yale B
databases. Although ERC is not optimal, it is still better than
LRC, RT1, RT2, RT3 and SRC on GT and JAFFE databases.
The theoretical computational complexity of ERC is analyzed
in Section 3.4, and the numerical computational complexity is
provided below. Tables 6 and 7 list the average training and test
CPU time obtained by all algorithms under the best parameters,
respectively. The numerical experiments are executed by
MATLAB7.10 in an HP xw9400 Workstation with Six-Core AMD
Opteron(tm) Processor 2439 SE 2.80 GHz and 32 GB memory. The

4126

X. Wang et al. / Pattern Recognition 45 (2012) 41174128

error rate

0.25

0.55

0.7

0.5

0.2

0.6

0.45

0.5

0.4

0.15

0.4

0.35

0.1

0.3

0.3

0.2

0.25

0.1

0.2
7

10 11 12

0.05
0

8
9
10 11 12
the number of the trainning images

10 12 14 16 18

Yale B database

ORL database

error rate

JAFFE dataqbase

GT database

AR database
0.8

0.7

0.7

ERC

0.6

0.6

KNNDS

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

LRC
RT1
RT2

0.1

RT3
SRC

0.1
4

SVM

26 32 40 48 56
the number of the trainning images

Fig. 13. Evolutions of the error rate and standard deviation of ERC versus the numbers of the training images. In experiments, the numbers of the training images with
noisy labels keeps unchanged. There are 1, 2, 2, 1 and 6 training images with noisy labels for AR database, GT database, JAFFE database, ORL database and Extended Yale B
databases, respectively.

Table 5
The average error rate and standard deviation obtained by different algorithms.
Face database

AR

GT

JAFFE

ORL

Yale B

ERC
KNNDS
LRC
RT1
RT2
RT3
SRC
Linear SVM

0.25277 0.0195
0.4807 70.0171
0.2958 7 0.0237
0.6768 7 0.0219
0.5471 7 0.0189
0.7914 7 0.0199
0.3738 7 0.0159
0.27007 0.0238

0.3513 7 0.0245
0.3427 7 0.0288
0.4048 70.0303
0.4766 7 0.0305
0.4869 7 0.0326
0.5389 7 0.0247
0.4255 7 0.0294
0.3123 70.0236

0.0291 70.0201
0.0365 70.0223
0.1153 7 0.0320
0.0751 70.0427
0.0738 70.0304
0.0672 70.0350
0.1390 70.0349
0.0249 70.0181

0.1779 70.0298
0.2480 7 0.0253
0.2562 7 0.0384
0.4171 7 0.0430
0.4078 7 0.0462
0.5668 7 0.0517
0.2535 7 0.0355
0.1822 7 0.0316

0.0848 70.0090
0.5054 7 0.0123
0.1078 7 0.0080
0.5889 7 0.0179
0.5677 7 0.0169
0.6639 7 0.0142
0.2365 7 0.0107
0.1702 7 0.0142

Table 6
The average training CPU time and standard deviation obtained by different algorithms.
Face database

AR

GT

JAFFE

ORL

Yale B

ERC
RT1
RT2
RT3
Linear SVM

4.1028 7 0.0148
37.7467 1.4622
2.86197 0.0773
0.8438 7 0.0465
0.2859 7 0.0119

2.67197 0.5809
5.25787 0.5314
0.6897 7 0.0690
0.79007 0.2883
0.1059 70.0106

1.0794 70.4571
0.5538 70.2490
0.05137 0.0071
0.07667 0.0253
0.00887 0.0101

1.7000 7 0.6871
1.0944 7 0.2700
0.2822 7 0.0259
0.1197 7 0.0125
0.0325 7 0.0062

74.1437 6.2063
152.317 3.1661
5.4703 70.1159
4.89457 1.4927
0.6977 70.0116

CPU time about KNNDS, LRC and SRC is not represented in Table 6
because they have not training phase. Although it needs long CPU
time to train ERC, the computation cost can be tolerated. In the

test phase, the CPU time is little more than the sum of the CPU
time consumed by LRC and SRC. It is consistent with the
theoretical analysis of computational complexity in Section 3.4.

X. Wang et al. / Pattern Recognition 45 (2012) 41174128

4127

Table 7
The average test CPU time and standard deviation obtained by different algorithms.
Face database

AR

GT

JAFFE

ORL

Yale B

ERC
KNNDS
LRC
RT1
RT2
RT3
SRC
Linear SVM

13.040 70.3545
0.3475 7 0.0098
1.43697 0.0289
0.2900 70.0096
0.3541 7 0.0075
0.2456 7 0.0078
8.37447 0.0356
0.5344 7 0.0063

5.94257 0.0342
0.5438 7 0.2047
0.4866 7 0.0055
0.0966 70.0068
0.1563 7 0.0055
0.0856 70.0079
4.8069 7 0.0291
0.1263 7 0.0043

2.04097 0.2936
0.03067 0.0137
0.05197 0.0074
0.01727 0.0047
0.0200 7 0.0071
0.01507 0.0044
1.96817 0.2936
0.00847 0.0079

1.47977 0.0298
0.0541 70.0253
0.1391 70.0384
0.0378 70.0430
0.04507 0.0462
0.0331 70.0517
1.14697 0.0355
0.0291 70.0316

189.527 2.1644
4.95317 0.6369
89.8667 1.7034
2.15167 0.5379
3.5078 70.4134
0.39067 0.0134
51.9877 0.9396
0.76027 0.0105

5. Conclusion

Acknowledgment

This paper focuses on applying ER to face classication. This


application needs two preparations: (i) a BRB; (ii) the activation
weights corresponding to all BDVs in BRB. Thus, a formalism
classication algorithm is proposed based on the following two
aspects. For BRB, a BDV is generated as a soft label for each
training sample and indicates the contribution of the sample for
the nal classication decision. Furthermore, the activation
weights are obtained by exploiting the structure information
between testing and training sets. Classication is based on the
result of the combination of evidence.
ER has well-developed theories and widespread applications
in many areas. So, we think that the formalism classication
algorithm may be a tie of ER and pattern recognition theories.
Many benets of ER can be applied to develop pattern recognition
methods, for example, it can use various priori information,
integrate different kinds of the attitudes (even partial missing
attitudes), and handle unreliable training samples, and so on.
The formalism classication algorithm proposed in this paper
might be regarded as a general framework. By changing the
method to generate the BDV and activation weights, different
specic algorithms can be derived. Therefore, the formalism
classication algorithm is very exible and applicable for a wide
range of classication problems. However, to obtain better performance, the BDVs and activation weights must be designed
carefully according to the characteristics of different classication
problems. So, it is better to choose one kind of data sets which has
the same characteristics to test the performance of the formalism
classication algorithm. Then, ERC is proposed to recognize
human face under class noise conditions. To the best of our
knowledge, ER is rst applied to face recognition in this paper.
Because BDV fuses the information from training sample itself
and other training samples, ERC can tolerate a certain degree of
class noise well. Two excellent face recognition algorithms (i.e.
LRC and SRC) are used to form the activation weights. Thus, ERC is
more suitable to classify face images. With the help of the
strategy of ERC, many algorithms (e.g. such as SRC, SSM and
NFL etc.) can be applied to recognize human face when the
training samples have soft labels, which could not be done before.
By taking full advantage of LRC, SRC and ER, the numerical
experiments in Section 4 witness that the proposed ERC algorithm
obtains high accuracy in face recognition with class noise.
There are several limitations to the proposed algorithm. ERC
has two parameters which need to be adjusted for the better
performance. Though these two parameters of ERC are easily
adjusted, as shown in Sections 4.3 and 4.5, we prefer an algorithm
without any parameters. In addition, Algorithm 2 is time-consuming in either training or testing phase, a more efcient
method to nd BDVs and activation weights should be designed
to reduce the computational cost in our future work.

The authors would like to thank the three reviewers for their
comments, they help us very greatly to improve this submission. The
authors would like to thank Dr. Zhijie Zhou for his suggestions which
correct many errors and improve the quality of this paper greatly.
This work was supported in part by the National Natural Science
Foundation of China (No. 60803097, 60970067, 61003198, 61072106,
60971112, 60971128, 61072108), The Fund for Foreign Scholars in
University Research and Teaching Programs (the 111 Project) (No.
B07048), the National Science and Technology Ministry of China (No.
9140A07011810DZ0107, 9140A07021010DZ0131), the Fundamental
Research Funds for the Central Universities (No. JY10000902001,
K50510020001, JY10000902045).

References
[1] X. Zhu, X. Wu, Class noise vs. attribute noise: a quantitative study, Articial
Intelligence Review 22 (2004) 177210.
[2] C. Brodley, M. Freidl, Identifying mislabeled training data, Journal of Articial
Intelligence Research 11 (1999) 131167.
[3] B. Dasarathy, Noising around the neighbourhood: a new system structure and
classication rule for recognition in partially exposed environments, IEEE
Transactions on Pattern Analysis and Machine Intelligence 2 (1980) 6771.
[4] G. Gates, The reduced nearest neighbor rule, IEEE Transactions on Information Theory 18 (1972) 431433.
[5] P. Hart, The condensed nearest neighbor rule, IEEE Transactions on Information Theory 14 (1968) 515516.
[6] F. Angiulli, Fast condensed nearest neighbor rule, in: International Conference
on Machine Learning, 2005, pp. 711.
[7] D. Wilson, T. Martinez, Instance pruning techniques, in: International Conference on Machine Learning, 1997, pp. 404411.
[8] G. John, Robust decision trees: removing outliers from databases, in:
Proceedings of the First ACM SIGKDD Conference on Knowledge Discovery
and Data Mining, 1995, pp. 174179.
[9] T. Denoeux, M. Bjanger, Induction of decision trees from partially classied
data using belief functions, in: Proceedings of SMC, 2000, pp. 29232928.
[10] P. Vannoorenbergue, T. Denoeux, Handling uncertain labels in multiclass
problems using belief decision trees, in: Proceedings of IPMU, 2002.
[11] J. Mingers, An empirical comparison of pruning methods for decision tree
induction, Machine Learning 4 (1989) 227243.
[12] X. Zhu, X. Wu, Q. Chen, Eliminating class noise in large datasets, in:
International Conference on Machine Learning, 2003, pp. 920927.
[13] D. Hawkins, G. McLachlan, High-breakdown linear discriminant analysis,
Journal of the American Statistical Association 92 (1997) 136143.
[14] S. Bashir, E. Carter, High breakdown mixture discriminant analysis, Journal of
Multivariate Analysis 93 (2005) 102111.

[15] N. Lawrence, B. Scholkopf,


Estimating a kernel Fisher discriminant in the
presence of label noise, in: International Conference on Machine Learning,
2001, pp. 306313.
[16] Y. Li, L. Wessels, D. Ridder, M. Reinders, Classication in the presence of class noise
using a probabilistic kernel Fisher method, Pattern Recognition 40 (2007)
33493357.
[17] C. Bouveyrona, S. Girard, Robust supervised classication with mixture models:
learning from data with uncertain labels, Pattern Recognition 42 (2009)
26492658.
[18] L. Breiman, Bagging predictors, Machine Learning 24 (1996) 123140.
[19] L. Nanni, A. Franco, Reduced reward-punishment editing for building ensembles of classiers, Expert Systems with Applications 38 (2011) 23952400.

4128

X. Wang et al. / Pattern Recognition 45 (2012) 41174128

[20] X. Zeng, T. Martinez, A noise ltering method using neural networks, in: IEEE
International Workshop on Soft Computing Techniques in Instrumentation,
Measurement and Related Applications, 2003, pp. 2631.
[21] I. Guyon, N. Matic, V. Vapnik, Discovering informative patterns and data cleaning,
Advances in Knowledge Discovery and Data Mining (1996) 181203.
[22] G. Shafer, A Mathematical Theory of Evidence, Princeton University Press,
Princeton, 1976.
[23] A. Dempster, Upper and lower probabilities induced by a multi-valued
mapping, Annals of Mathematical Statistics 38 (1967) 325339.
[24] T. Denoeux, Z. Younes, F. Abdallah, Representing uncertainty on set-valued
variables using belief functions, Articial Intelligence 174 (2010) 479499.
[25] E. Co me, L. Oukhellou, T. Denoeux, P. Aknin, Learning from partially
supervised data using mixture models and belief functions, Pattern Recognition 42 (2009) 334348.
[26] T. Denoeux, P. Smets, Classication using belief functions: the relationship
between the case-based and model-based approaches, IEEE Transactions on
Systems, Man and CyberneticsPart B 36 (2006) 13951406.
[27] T. Denoeux, A neural network classier based on DempsterShafer theory, IEEE
Transactions on Systems, Man and CyberneticsPart A 30 (2000) 131150.
[28] L. Zouhal, T. Denoeux, An evidence-theoretic k-NN rule with parameter
optimization, IEEE Transactions on Systems, Man and CyberneticsPart C 28
(1998) 263271.
[29] T. Denoeux, Analysis of evidence-theoretic decision rules for pattern classication, Pattern Recognition 30 (1997) 10951107.
[30] T. Denoeux, A k-nearest neighbor classication rule based on DempsterShafer
theory, IEEE Transactions on Systems, Man and Cybernetics 25 (1995) 804813.
[31] Y. Bi, J. Guan, D. Bell, The combination of multiple classiers using an
evidential reasoning approach, Articial Intelligence 172 (2008) 17311751.
[32] Y. Bi, S. McClean, T. Anderson, Combining rough decisions for intelligent text
mining using Dempsters rule, Articial Intelligence Review 26 (2006) 191209.
[33] L. Xu, A. Krzyzak, C. Suen, Methods of combining multiple classiers and their
applications to handwriting recognition, IEEE Transactions on Systems, Man
and Cybernetics 22 (1992) 418435.
[34] M. Masson, T. Denoeux, RECM: relational evidential c-means algorithm,
Pattern Recognition Letters 30 (2009) 10151026.
[35] M. Masson, T. Denoeux, ECM: an evidential version of the fuzzy c-means
algorithm, Pattern Recognition 41 (2008) 13841397.
[36] M. Masson, T. Denoeux, Clustering interval-valued data using belief functions, Pattern Recognition Letters 25 (2004) 163171.
[37] T. Denoeux, M. Masson, EVCLUS: evidential clustering of proximity data, IEEE
Transactions on Systems, Man and CyberneticsPart B 34 (2004) 95109.
[38] J. Yang, M. Singh, An Evidential reasoning approach for multiple-attribute
decision making with uncertainty, IEEE Transactions on Systems, Man and
Cybernetics 24 (1994) 118.
[39] J. Yang, D. Xu, On the evidential reasoning algorithm for multiple attribute
decision analysis under uncertainty, IEEE Transactions on Systems, Man and
Cybernetics 32 (2002) 289304.
[40] Y. Wang, J. Yang, D. Xu, Environmental impact assessment using the
evidential reasoning approach, European Journal of Operational Research
174 (2006) 18851913.
[41] Z. Zhou, C. Hu, J. Yang, D. Xu, D. Zhou, Online updating belief rule based
system for pipeline leak detection under expert intervention, Expert Systems
with Applications 36 (2009) 77007709.
[42] J. Yang, J. Liu, J. Wang, H. Sii, H. Wang, Belief rule-base inference methodology
using the evidential reasoning approach-RIMER, IEEE Transactions on Systems,
Man, and Cybernetics C Part A: Systems and Humans 36 (2006) 266285.

[43] I. Naseem, R. Togneri, M. Bennamoun, Linear regression for face recognition, IEEE
Transactions on Pattern Analysis and Machine Intelligence 32 (2010) 21062112.
[44] J. Wright, A. Yang, A. Ganesh, S. Sastry, Y. Ma, Robust face recognition via
sparse representation, IEEE Transactions on Pattern Analysis and Machine
Intelligence 31 (2009) 210227.
[45] J. Yang, Y. Wang, D. Xu, K. Chin, The evidential reasoning approach for MADA
under both probabilistic and fuzzy uncertainties, European Journal of
Operational Research 171 (2006) 309343.
[46] J. Yang, P. Sen, A general multi-level evaluation process for hybrid MADM
with uncertainty, IEEE Transactions on Systems Man, and Cybernetics 24
(1994) 14581473.
[47] J. Yang, Rule and utility based evidential reasoning approach for multiattribute decision analysis under uncertainties, European Journal of Operational Research 131 (2001) 3161.
[48] Z. Zhou, C. Hu, J. Yang, D. Xu, M. Chen, D. Zhou, A sequential learning
algorithm for online constructing belief-rule-based systems, Expert Systems
with Applications 37 (2010) 17901799.
[49] J. Zhou, C. Hu, D. Xu, M. Chen, D. Zhou, A model for real-time failure prognosis
based on hidden Markov model and belief rule base, European Journal of
Operational Research 207 (2010) 269283.
[50] J. Zhou, C. Hu, J. Yang, D. Xu, D. Zhou, New model for system behavior
prediction based on belief rule based systems, Information Sciences 180
(2010) 48434846.
[51] J. Zhou, C. Hu, J. Yang, D. Xu, D. Zhou, Bayesian reasoning approach based
recursive algorithm for online updating belief rule based expert system of pipeline
leak detection, Expert Systems with Applications 38 (2011) 39373943.
[52] J. Zhou, C. Hu, J. Yang, D. Xu, D. Zhou, Online updating belief-rule-bass using
the RIMER approach, IEEE Transactions on Systems, Man, and
CyberneticsPart A: Systems and Humans, doi:http://dx.doi.org/10.1109/
TSMCA.2011.2147312.
[53] T. Sakai, Multiple pattern classication by sparse subspace decomposition,
arXiv:0907.5321v2.
[54] S. Li, J. Lu, Face recognition using the nearest feature line method, IEEE
Transactions on Neural Networks 10 (1999) 439443.
[55] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, The
Annals of Statistics 32 (2004) 407499.
[56] A. Martinez, R. Benavente, The AR Face Database, CVC Technical Report 24, 1998.
[57] A. Martinez, A. Kak, PCA versus LDA, IEEE Transactions on Pattern Analysis
and Machine Intelligence 23 (2001) 228233.
[58] Georgia Tech Face Database, /http://www.anean.com/face_reco.htmS,
2007.
[59] M. Lyons, J. Budynek, S. Akamatsu, Automatic classication of single facial
images, IEEE Transactions on Pattern Analysis and Machine Intelligence 21
(1999) 13571362.
[60] F. Samaria, A. Harter, Parameterization of a stochastic model for human face
identication, in: Proceedings of the Second IEEE Workshop Applications of
Computer Vision, 1994, pp. 138142.
[61] A. Georghiades, P. Belhumeur, D. Kriegman, From few to many: illumination
cone models for face recognition under variable lighting and pose, IEEE
Transactions on Pattern Analysis and Machine Intelligence 23 (2001)
643660.
[62] K. Lee, J. Ho, D. Kriegman, Acquiring linear subspaces for face recognition
under variable lighting, IEEE Transactions on Pattern Analysis and Machine
Intelligence 27 (2005) 684698.
[63] /http://sparselab.stanford.edu/S.
[64] C. Bishop, Pattern Recognition and Machine Learning, Springer, 2007.

Xiaodong Wang received the B.S. degree from Harbin Institute of Technology, Harbin, China, in 1998, and the M.S. degree from Inner Mongolia University of Technology,
Hohhot, China, in 2007. He is currently working toward the Ph.D. degree in Computer Application Technology at the School of Computer Science and Technology, Xidian
University and the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xian, China. His current research interests include
convex optimization, compressive sensing and pattern recognition.

Fang Liu (M07-SM07) received the B.S. degree in Computer Science and Technology from Xian Jiaotong University, Xian, China, in 1984, and the M.S. degree in Computer
Science and Technology from Xidian University, Xian, in 1995. Currently, she is a Professor with the School of Computer Science, Xidian University, Xian, China. She is the
author or coauthor of ve books and more than 80 papers in journals and conferences. Her research interests include signal and image processing, synthetic aperture radar
image processing, multiscale geometry analysis, learning theory and algorithms, optimization problems, and data mining.

L.C. Jiao (SM89) received the B.S. degree from Shanghai Jiaotong University, Shanghai, China, in 1982, and the M.S. and Ph.D. degrees from Xian Jiaotong University, Xian,
China, in 1984 and 1990, respectively. He is currently a Distinguished Professor with the School of Electronic Engineering, Xidian University, Xian, China. His research
interests include signal and image processing, natural computation, and intelligent information processing. He has led approximately 40 important scientic research
projects and published more than ten monographs and 100 papers in international journals and conferences. He is the author of three books: Theory of Neural Network
Systems (Xian, China: Xidian University Press, 1990), Theory and Application on Nonlinear Transformation Functions (Xian, China: Xidian University Press, 1992), and
Applications and Implementations of Neural Networks (Xian, China: Xidian University Press, 1996). He is the author or coauthor of more than 150 scientic papers.
Prof. Jiao is a member of the IEEE Xian Section Executive Committee, and the Chairman of Awards and Recognition Committee and an executive committee member of
the Chinese Association of Articial Intelligence.

Jiao Wu (S09) received the B.S. degree and the M.S. degree in Applied Mathematics from Shaanxi Normal University, Xian, China, in 1999 and 2002, respectively. She is
currently working towards the Ph.D. degree in Computer Application Technology at the School of Computer Science and Technology, Xidian University and the Key
Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xian, China. Her research interests include image processing, machine
learning, statistics learning theory, and algorithms.

S-ar putea să vă placă și