Sunteți pe pagina 1din 13

972

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010

Novel Maximum-Margin Training Algorithms for Supervised Neural Networks


Oswaldo Ludwig and Urbano Nunes, Senior Member, IEEE
AbstractThis paper proposes three novel training methods, two of them based on the backpropagation approach and a third one based on information theory for multilayer perceptron (MLP) binary classiers. Both backpropagation methods are based on the maximal-margin (MM) principle. The rst one, based on the gradient descent with adaptive learning rate algorithm (GDX) and named maximum-margin GDX (MMGDX), directly increases the margin of the MLP output-layer hyperplane. The proposed method jointly optimizes both MLP layers in a single process, backpropagating the gradient of an MM-based objective function, through the output and hidden layers, in order to create a hidden-layer space that enables a higher margin for the output-layer hyperplane, avoiding the testing of many arbitrary kernels, as occurs in case of support vector machine (SVM) training. The proposed MM-based objective function aims to stretch out the margin to its limit. An objective function based -norm is also proposed in order to take into account the on idea of support vectors, however, overcoming the complexity involved in solving a constrained optimization problem, usually in SVM training. In fact, all the training methods proposed in this paper have time and space complexities ( ) while usual SVM training methods have time complexity ( 3 ) and space complexity ( 2 ), where is the training-data-set size. The second approach, named minimization of interclass interference (MICI), has an objective function inspired on the Fisher discriminant analysis. Such algorithm aims to create an MLP hidden output where the patterns have a desirable statistical distribution. In both training methods, the maximum area under ROC curve (AUC) is applied as stop criterion. The third approach offers a robust training framework able to take the best of each proposed training method. The main idea is to compose a neural model by using neurons extracted from three other neural networks, each one previously trained by MICI, MMGDX, and LevenbergMarquard (LM), respectively. The resulting neural network was named assembled neural network (ASNN). Benchmark data sets of real-world problems have been used in experiments that enable a comparison with other state-of-the-art classiers. The results provide evidence of the effectiveness of our methods regarding accuracy, AUC, and balanced error rate.

I. INTRODUCTION

Index TermsInformation theory, maximal-margin (MM) principle, multilayer perceptron (MLP), pattern recognition, supervised learning.

Manuscript received April 30, 2009; revised March 12, 2010; accepted March 12, 2010. Date of publication April 19, 2010; date of current version June 03, 2010. This work was supported by the Portuguese Foundation for Science and Technology (FCT) under Grant PTDC/EEAACR/72226/2006. The work of O. Ludwig was supported by FCT under Grant SFRH/BD/44163/2008. The authors are with the ISR-Institute of Systems and Robotics, Department of Electrical and Computer Engineering, University of Coimbra Polo II, 3030-290 Coimbra, Portugal (e-mail: oludwig@isr.uc.pt; urbano@isr.uc.pt). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TNN.2010.2046423

ONCERNING pattern recognition, there are three main problems to work on: a suitable composition for the training data set, the feature extraction and/or selection algorithm [10], and the classier. This paper focuses on some developments that our research group has been working on regarding the last problem using multilayer perceptron (MLP) as a classier. Similarly to other more complex neural models, such as simultaneous recurrent neural networks (SRNs) [5], an MLP with one sigmoidal hidden layer and linear output layer is a universal approximator, because the sigmoidal hidden units of MLP compose a basis of linearly independent soft functions [4]. Taking into account the simplicity of MLP, this work adopts this model for pattern recognition applications. Our experiments with benchmark data sets give evidence that an adequate training algorithm can lead the MLP to achieve performance which is better than (or at least similar) as the other state-of-the-art approaches, such as support vector machines (SVMs) with nonlinear kernel [8], [17], Bayesian neural network [11], or novel algorithms based on kernel Fisher discriminant analysis [24]. One of the problems that occur during supervised neural networks (NNs) training is called overtting. The error on the training data set is driven to a small value, however the error is large when new data are presented to the NN. It occurs because the NN memorizes the training examples, i.e., the NN does not learn to generalize to new situations. Unfortunately, despite some theoretical developments proposed in [23], it is difcult to know in practice how large a network should be for a specic application. To deal with this kind of problem previous studies propose some methods for improving generalization based on early stopping criteria, regularization, and maximal-margin (MM) training algorithms. In the early stopping criterion [20], the available data are divided into three subsets. The rst subset is the training data set, which is used for computing the gradient and updating the network weights and biases. The second subset is used as a validation data set and the third subset is used to evaluate the nal accuracy of the NN. The error on the validation data set is monitored during the training process. After some number of iterations, the NN begins to overt the data and, consequently, the error on the validation data set begins to rise. In order to deal with this problem, when the validation error increases during a specied number of iterations, the algorithm stops the training section and applies the weights and biases at the minimum of the validation error in the NN model. Our proposed approaches also follow the early stop criterion, however we apply the maximum value of area under ROC curve (AUC) [2] as the stop criterion for the training section.

1045-9227/$26.00 2010 IEEE

LUDWIG AND NUNES: NOVEL MAXIMUM-MARGIN TRAINING ALGORITHMS FOR SUPERVISED NEURAL NETWORKS

973

The regularization approach [25] involves modifying the objective function, which is normally the mean squared error (MSE) on the training data set by adding a term that consists of the average of the sum of squares of the network weights and biases. Using this objective function causes the network to have smaller weights and biases, and this forces the network response to be smoother and less likely to overt. The new algorithms proposed in this paper also adopt a modied objective function in order to improve the classication margin. MM-based training algorithms for NN are often inspired by SVM-based training algorithms, such as [6] and [12]. In [6], a decision tree based on linear programming is applied to maximize the margins, while in [12], an MLP is trained layer by layer based on the CARVE algorithm [26]. Motivated by the success of large margin methods in supervised learning, some authors extended large margin methods to unsupervised learning [27]. Besides early stopping criterion, our work also explores MM-based training algorithms. However, different from the SVM approach, in this paper, the concept of margin has an indirect relation with support vectors. Actually, in this paper, margin is dened as a function of the distance between each pattern to the output-layer hyperplane. Inspired by SVM-based training algorithms, a simple method that applies -norm in order to take into account the idea of support vectors is proposed in Section III-B. The paper is organized as follows. Section II briey describes the usual gradient descent with momentum term and adaptive learning rate method (GDX). Section III describes new MMGDX algorithms. In Section IV, another new training method, named minimization of interclass interference (MICI), is described. Section V presents another new training method based on information theory. Experimental results are presented in Section VI and conclusions in Section VII. II. GRADIENT-DESCENT WITH ADAPTIVE LEARNING RATE Besides some global search methods that have been applied in MLP training, such as genetic algorithms [25], simulated annealing [18], or hybrid methods [9], there are many variations of the backpropagation algorithm due to different approaches of the gradient-descent algorithm, such as the methods which are commented in Table I. The MATLAB neural network toolbox offers some training algorithms; among them we highlight the GDX [1], due to its application in this work. Traditionally, the objective function of GDX is the MSE of (1) where is the target output, is the output estimated by the belonging to the training data set, and MLP for the input is the error. Backpropagation is used to calculate the derivatives of the MSE functional with respect to the weight and bias. Each variable is adjusted according to the gradient descent with momentum term. For each epoch, if MSE decreases toward the goal, then the learning rate is increased by a given factor . If MSE increases by more than a given threshold , the learning rate is decreased by a given factor , and the synaptic weights update which increased the MSE is discarded.

TABLE I SOME VARIATIONS OF BACKPROPAGATION ALGORITHM

III. PROPOSED MM GDX New MM-based training algorithms, where both MLP layers are jointly optimized in a single process, are proposed in this section. In fact, an MM-based objective function is backpropagated through the output and hidden layers in such a way as to create a hidden output especially oriented towards obtaining a larger margin for the output-layer separating hyperplane. This methodology, named MMGDX, is different from other previous approaches, such as [12], where the MLP is trained layer by layer. The unconstrained optimization problem (2) is applied on an MLP with one sigmoidal hidden layer and linear output layer, according to the following model: (3) is the output vector of the hidden layer, where is the synaptic weights matrix of the layer , is the bias vector of layer 1, is the input vector, and is the sigmoid function. In MMGDX, the output layer of model (3) has bias , because after the training section, the ROC curve information is taken into account to adjust the classier threshold, which acts as bias. The separating hyperplane of model (3) is given by (4) is a point belonging to the hyperplane. Considwhere ering as the projection of point on the separating hyperplane (4) and as the distance between the separating hyperplane (4) and , yields

(5)

974

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010

Multiplying both sides of (5) by

yields (6)

is the hidden-layer the activation function of neuron , and output vector in response to example . are updated as follows: The weights of layer (14) (15)

As belongs to hyperplane (4), substituting (4) and the second line of (3) in (6), yields (7)

(16) As the sigmoid activation function bounds the hidden neuron , the norm of vector has its maxoutput in the interval imum value equal to , where is the number of hidden neuis one, we rons. Taking into account that the norm of can deduce that (8) from (7) is bounded in the interval i.e., the distance . Therefore, as the target output (where denotes the training example index) assumes the values or 1, we propose the error function (9) in order to force the MLP to stretch out the value of (in this work dened as the classication margin of example ) to its limit, creating a hidden output space where the distance between patterns of different classes is as larger as possible. Different from our approach, in the traditional backpropagation methods the error (1) is usually adopted, which permits the undesirable increase of the output matrix in order to achieve the target output without taking into account the distance , as can be inferred from (7). A. MMGDX Based on MSE Our rst approach of MMGDX, here referred as MMGDX-A, has an MM objective function based on MSE (10) where is the total number of training examples and is dened in (9). The weight update is based on the gradient descent with momentum term and adaptive learning rate, therefore this method was named MMGDX. Backpropagation is used to calculate the derivatives of the objective function (10) as follows: (11) (12) (13) where is the th row of matrix , is the th element of vector , is the th position of vector , is the derivative of the sigmoid function, is where is the iteration, is the learning rate, and is the momentum term. In short, each variable is adjusted according to the gradient descent with momentum term. For each epoch, if the MM-based objective function decreases towards the goal, then the learning rate is increased by a given factor . If increases by more than a given threshold , the learning rate is decreased by a given factor and the update of synaptic weights and biases that increased in the current iteration is undone. During the training, the value of AUC is calculated at each epoch, over the validation data set. If AUC increases, then a register of network parameters is updated. Finally, the registered network parameters, which correspond to the highest AUC, are adopted. The training section stops after AUC checks without AUC improvement. Algorithm 1 details the proposed method. Algorithm 1: MMGDX Input: , : training data set : validation data set (for AUC

calculation) : number of neurons in the hidden layer : number of iterations between each AUC checking : stop criterion (maximum number of repetitions of the event ) Output: : network parameters 1: initiate weights according to Nguyen-widrow algorithm; ; 2: 3: while do 4: for do 5: ; 6: update weights by means of (11)(13) and (14)(16); 7: propagate through the model (3) obtaining ; 8: apply and in (9) and (10) in order to check ; 9: if then 10: ; 11: else 12: if then 13: ; 14: end if 15: end if 16: end for 17: propagate through the model (3) obtaining ; 18: calculate using and ; 19: if then 20: ; 21: ; ; ;

LUDWIG AND NUNES: NOVEL MAXIMUM-MARGIN TRAINING ALGORITHMS FOR SUPERVISED NEURAL NETWORKS

975

22:

else ;

is the hidden layer output vector in refunction of neuron , sponse to example , and

23: end if 24: end while 25: ; ; 26: adjust threshold (i.e., network parameter ROC curve information.

) by means of

(21) The weights update was described in Section III-A. A dynamic norm is adopted in order to escape from local minima. Actually, if the optimization algorithm stops at a local minimum, the adopted norm is replaced by -norm during one iteration. If is improved, the algorithm restores the adopted norm .

Notice that, at each Algorithm 1 iteration, the nonrecursive equations (14)(16) as well as the recursive equations (10)(13) that demand a number of iterations directly proportional to the total number of training examples are calculated. Therefore, . Similarly to other the MMGDX has time complexity rst-order optimization methods, the MMGDX does not need to store second-order derivatives to compose Jacobian matrix, . therefore it has space complexity

C. MMGDX for Multiclass Problems As occurs in case of SVM, the extension from two-class MMGDX to multiclass MMGDX is nontrivial and may be a topic for further study. However, it is possible to decompose the multiclass classication problem into multiple two-class classication problems. Some usual approaches to decompose a multiclass pattern classication problem into two-class problems are one-against-all (OAA), one-against-one (OAO), and P-against-Q (PAQ). Those approaches are popular among researchers in SVM, Adaboost, or decision trees. The OAA modeling scheme was rst introduced by Vapnik [22] in the SVM context. For an -class pattern classication binary NNs. In problem, the OAA scheme uses a system of order to train the th NN, the training data set is decomposed in two sets: , where contains all the examcontains all the ples of class , which receive label , and examples belonging to all other classes, which receive label . A decision function for the ensemble output can be (22) is the likelihood of the where is the estimated class and th NN. This architecture has advantages over a single NN for multiclass problems, for instance, each NN can have its own feature space and architecture, since all of them are trained independently. However, there are some disadvantages, namely, the ensemble may uncover regions in the feature space, i.e., regions that are rejected by all NNs as other classes. Another problem is the training data, namely, when the number of classes is large, the training data for each NN is highly imbalanced, i.e., . This fact can lead the NN to totally ignoring the . minority class The OAO modeling scheme, also known as pairwise method, can avoid the OAA problems. The pairwise method decomposes the -class pattern classication problem into binary problems, i.e., each NN is trained to discriminate class from class , avoiding highly imbalanced training data sets. Notice that each class can receive up to votes, because there are NNs that are trained to discriminate class from each other class. Among many decision functions, we detach the simple majority vote, which counts the votes for each class based on the output from all NNs. The class with the largest number of votes is the system output. In OAO modeling, the feature space is less likely to have uncovered regions, due to

B. MMGDX Based on

-Norm

The second approach of MMGDX, here denoted as MMGDX-B, has the objective function based on -norm (17) where is the -norm, is the error vector, and is dened in (9). The main idea is to calculate the functional focusing specially on the support vectors margins, inspired on the SVM soft-margin training algorithm. The -norm is a trick to avoid the constrained optimization problem usual in the SVM-like approach. Notice that larger errors are related to support vectors (i.e., the patterns with small distance from the separating hyperplane), therefore, if the -norm is applied, the larger is, the larger is the contribution of the larger errors in the calculation of the objective function . In fact, if the power , only the pattern with smallest distance from the separate hyperplane will be considered in the calculation of the objective function . In short, the -norm enables the implementation of a training algorithm with some similarity to the soft-margin SVM (i.e., with larger importance for the support vector margins), applying backpropagation in an unconstrained optimization approach. Backpropagation is used to calculate the derivatives of the objective function (17) as follows:

(18) (19) (20) where is the th row of , is the th element of vector , is the th position of vector , is the derivative of the sigmoid function, is the activation

976

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010

A. Training the Hidden Layer The hidden layer of model (3) is trained in order to solve the following unconstrained optimization problem: (24) where and are the parameters of the hidden layer of (3). An iterative solution based on the ascendant gradient is adopted. Backpropagation is used to calculate the gradients of the objective function (23) as follows:

(25)

Fig. 1. Hidden-layer space with patterns of two classes (blue and red) (a) before and (b) after MICI application.

(26) where

the redundancy in the training of pattern classes. Another advantage of OAO modeling is that it has the capability of incremental class learning, i.e., a new set of NNs can be included and trained in order to represent a new class without affecting the existing NNs. However, the major disadvantage of OAO modeling is the required computational effort when the number of classes is large. Notice that, in this method, it is necessary to train NNs, therefore, both time and space complexities grow in the order . Fortunately, it is possible to train all NNs simultaneously on different computers in order to speed up the training time. More details on multiclass pattern classication using NNs can be found in [13]. IV. MINIMIZATION OF INTERCLASS INTERFERENCE The second proposed training method, named MICI, consists in generating a hidden layer where the patterns have a desirable statistical distribution, in such a way as to increase the classier margin. The proposed method creates a hidden space where the distance between classes is increased, as exemplied in Fig. 1, which illustrates the hidden-layer space of a neural model with three hidden neurons before and after the application of MICI. Specically, in such a space, the Euclidean distance between the prototypes of each class is increased, and the patterns dispersion of each class is decreased. The aim is to maximize the objective function (23) where , with , is is the number of training patthe prototype of class , is the hidden-layer output terns that belong to class , (feature vector) for an input that belongs to class , and is the variance of class patterns in the hidden space.

(27)

(28) (29) (30)

(31) (32) (33) (34) is the activation function of neuron , is the class, is the derivative of sigmoid function, is the th row of [see model (3)], is the th position of is the th position of vector , and is the vector , position of the output vector of the hidden layer in response to regarding the th example that belongs to class the input . The hidden-layer weights are updated according to (14) and (16), similarly to the MMGDX algorithm. Therefore, as detailed in Algorithm 1, during the MICI training, the value of AUC is calculated at each epoch, over the validation data set. If AUC increases, then a register of network parameters is updated. Finally, the registered network parameters, which correspond to the highest AUC, are adopted.

LUDWIG AND NUNES: NOVEL MAXIMUM-MARGIN TRAINING ALGORITHMS FOR SUPERVISED NEURAL NETWORKS

977

B. Training the Output Layer In order to take advantage of the suitable patterns distribution of the hidden-layer space (generated after MICI training section) the output layer is trained based on the Fisher linear discriminant analysis (FLDA) which denes the separation between two distributions as the ratio of the variance between the classes by the variance within the classes. Therefore, the output layer is trained in order to solve the optimization problem (35) with (36) is dened in (23), and is the covariance matrix of where , generated from patterns that belong hidden-layer outputs to class . The value is, in some sense, a measurement of the signal-to-noise ratio for class labeling. To obtain the maximum separation is sufcient to solve the optimization problem (35), whose solution is (37) In order to nd the best separating hyperplane, one must solve (38) for the threshold (i.e., bias) . As described above, the output-layer training implicates in of the hiddenthe calculation of the covariance matrixes layer output vectors. Such matrixes are calculated by propagating the training examples through the hidden layer, whose weights were calculated in the previous step. Besides the covariance matrix, the prototypes of both classes are also calculated in such hidden-layer space. The MICI is described in Algorithm 2. Algorithm 2: MICI method Input: : training data set : validation data set (for AUC

C. Extending MICI to Deal With Multiclass Problems The MICI training method can be extended to multiclass pattern recognition problems (i.e., neural network with multiple output neurons) by changing the objective function (23) by

(39)

is the prototype of class , is the number of where is the variance of class patterns in the hidden classes, and space. The derivatives of (39) are not in the scope of this paper. The weights and bias of each output neuron can be adjusted by (37) and (38), respectively. V. ASSEMBLING A TRAINED NEURAL NETWORK Our experiments indicate that neural models that were trained by different methods, namely, MICI, MMGDX, or MSE-based gradient methods (e.g., LM or GDX), have diversity of behaviors, according to the pattern recognition problem. Considering that the success of training methods is problem dependent, we aim to offer a robust training framework able to take the best of each training method. Taking it into account, this section introduces a methodology which composes a neural model by using neurons extracted from three neural networks, each one previously trained by MICI, MMGDX, and LM, respectively. The resulting neural network was named assembled neural network (ASNN). The idea is to compose a usual MLP neural network previously trained hidden neurons by the choice of a set of , . The neuron choice aims to maximize between the target the mutual information (i.e., the class) and the hidden outputs originated from neurons . Therefore, the result of this process is a neural network with its hidden layer already trained. Notice that the proposed combinatorial optimization problem has (40) is the number of available hidden possible solutions, where neurons, i.e., the total number of hidden neurons in the three previously trained neural models. In order to avoid the computational effort necessary to verify all the possibilities, the is percombinatorial optimization of formed by means of genetic algorithm (GA), similarly to the indirect approach described in our previous work [10]. However, in this work, the crossover operation was modied in order to deal with combinatorial optimization problems, specifically, each gene of a new individual is taken from one of the parents in a random process, differently from [10] where each gene of a new individual is a linear combination of genes of both parents. Algorithm 5 details the neuron selection process which needs statistical information supplied by Algorithm 4. The proposed crossover operation is detailed in steps 1027 of Algorithm 5. Actually, that neuron selection process is similar to a feature selection, because it is possible to consider the

calculation) : number of neurons in the hidden layer : network parameters Output: and to training the hidden layer 1: apply of (3) by means of MICI in order to obtain and , as described in Section IV-A; through the network hidden layer, 2: propagate trained at the previous step, in order to obtain a set of ; hidden outputs and in order to obtain , , , and 3: apply according to Section IV-B; in (37) and (38) in order to 4: apply , , , and calculate and .

978

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010

hidden-layer output vector as a feature vector. The applied GA tness function (see steps 5 and 6 of Algorithm 5) is based on the principle of max-relevance and min-redundancy [14], i.e., the objective is that the outputs of the selected hidden neurons present discriminant power, avoiding redundancy. In this work, the application of the principle of max-relevance and min-redundancy corresponds to searching the set of hiddenthat satises the maximization problem neuron indexes (41) subject to for (42)

where , is the neuron index, is the objective function based on max-relevance/min-redundancy principle (43) is the relevance of a set with neurons, and (44) is the redundancy between those neurons. Notice that is between the mean value of the mutual information the hidden outputs and the labels and is the mean value of between hidden outputs. The mutual information constraint (42) was introduced in order to avoid repeated neurons. That constraint is checked in steps 2225 of Algorithm 5. The maximum value of (41) is attained when the set of neuron is mutually exclusive and totally correlated to the outputs target output . In other words, the idea is to take advantage on the diversity between neurons that were trained by means of different methods. According to model (3) the parameters (synaptic weights and bias) of a hidden neuron are represented by the th row of maand the th element of vector . Therefore, the weight trix of the proposed ASNN is composed by concatematrix selected rows (corresponding to the selected nating the neurons) of matrixes , , and of the previously trained neural models. The same procedure is adopted for the bias . The output layer is trained based on FLDA, similarly to MICI output-layer training. More details about the method are given in Algorithm 3. Algorithm 3: ASNN method : training data set : number of available hidden neurons (i.e., the total number of hidden neurons in the three previously trained NNs); : desired number of hidden neurons for the ASNN model. : ASNN parameters: Output: Input:

and to train the model (3) by 1: apply means of MMGDX-B, obtaining the adjusted parameters and ; 2: apply and to train the model (3) by means of MICI, obtaining the adjusted parameters and ; and to train the model (3) by 3: apply means of LM, obtaining the adjusted parameters and ; 4: concatenate all the weight matrices of previously trained NNs composing the matrix ; 5: concatenate all the bias vectors of previously trained NNs composing the vector ; 6: propagate through the hidden layer of model (3) and in order to obtain a using the parameters ; set of hidden output vectors and in order 7: feed Algorithm 4 with to obtain the statistical data and , ; 8: feed the Algorithm 5 with the mutual information and , values in order to select the hidden neurons that ; satisfy the condition (41), obtaining the set of indexes 9: compose the ASNN weight matrix using the selected rows of matrix ; 10: compose the ASNN bias vector using the selected positions of vector ; through the hidden layer of model (3) 11: propagate and , trained at the previous using the parameters step, in order to obtain a set of hidden outputs ; and in order to obtain , , , and 12: apply according to Section IV-B; in (37) and (38) in order to 13: apply , , , and calculate and . Algorithm 4: Mutual information calculation Input: , : data set containing vectors with the responses of all hidden neurons (referring to the three previously trained NNs) and respective target outputs and , Output: : statistical data 1: calculate entropy of each element of the (details about the distribution hidden-output generation in [14]); of the target output; 2: calculate entropy 3: calculate the joint entropy of each ; hidden-output pair of each 4: calculate the joint entropy hidden-output and the target output; 5: calculate the mutual information

LUDWIG AND NUNES: NOVEL MAXIMUM-MARGIN TRAINING ALGORITHMS FOR SUPERVISED NEURAL NETWORKS

979

between each hidden-output and the target output; 6: calculate the mutual information

TABLE II DATA SET DESCRIPTION

between elements of each hidden-output pair

Algorithm 5: Neuron selection by GA and information theory and , : statistical data calculated by Algorithm 4; : desired number of hidden neurons for the ASNN; : selective pressure; : maximum number of generations; : population size. : set of indexes of the selected neurons Output: chromosomes for 1: generate a set with the initial population, each chromosome is a vector containing neuron indexes randomly generated without repeated elements; do 2: for 3: evaluating the population: do 4: for , 5: calculate and for the individual according to (43) and (44), by means of the previously calculated mutual information values for ; all the elements (i.e., indexes) of chromosome : storing the tness of each 6: ; individual 7: end for ; 8: rank the individuals according to their tness ; 9: store the genes of the best individual in 10: performing the crossover: ; 11: 12: for do ; 13: 14: randomly selecting the indexes of parents by using the asymmetric distribution proposed in [10]: , random number with 15: uniform distribution; , 16: [10]; 17: store the indexes which are absentees in both parents ; in : 18: assembling the chromosome 19: for do 20: randomly select a parent (i.e., between and ) to give the th gene for the th individual of the new generation: ; 21: 22: considering the constraint (42): then 23: if there is duplicity of indexes in from ; 24: pick up a new index for 25: end if 26: end for 27: end for 28: end for Input:

VI. EXPERIMENTS Results of ve experiments, regarding real-world two-class classication problems, are described in this section. More specically, the new algorithms are here evaluated over ve pattern recognition real-world benchmark data sets: Gisette,1 Dexter,2 Dorothea,3 Breast-Cancer data set,4 and Thyroid data set.5 A. Data Sets The rst three data sets were part of the 2003 Neural Information Processing Systems (NIPS2003) challenge, therefore the performance of other 56 methods applied in the challenge give some reference of the effectiveness of our methods. The Breast-Cancer and Thyroid benchmark data sets do not have validation sets. Therefore, around 20% of each training data set were reserved to calculate ROC curves, necessary to check the training stop criterion, and the other 80% were applied to MLP training. As the labels of all the NIPS2003 validation sets are available on the Internet, such validation sets were used as test data sets, in order to evaluate the MLP performance. Table II
1http://archive.ics.uci.edu/ml/datasets/Gisette 2http://archive.ics.uci.edu/ml/datasets/Dexter 3http://archive.ics.uci.edu/ml/datasets/Dorothea 4http://archive.ics.uci.edu/ml/datasets/ml/datasets/Breast+Cancer 5http://ida.rst.fraunhofer.de/projects/bench/benchmarks.htm

980

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010

TABLE III DATA SET COMPOSITION

TABLE V NORMS ADOPTED IN THE EXPERIMENTS WITH MMGDX-B

TABLE VI TRAINING METHOD COMPARISON ON DEXTER

TABLE IV APPLIED NEURAL ARCHITECTURES

TABLE VII TRAINING METHOD COMPARISON ON GISETTE

TABLE VIII TRAINING METHOD COMPARISON ON DOROTHEA

Five, seven, ten, twelve, and fteen neurons were tested in the hidden layer. Concerning MMGDX-B (i.e., based on -norm) , and were tested. The arthe norms chitectures and norms were selected based on the better mean accuracy over vefold cross validation. Table IV reports the selected architectures while Table V reports the selected norms. C. Results on NIPS2003 Data Sets provides a brief description of the data sets, while their compositions are given in Table III. B. Training-Method Parameters and Neural Architectures In the experiments, the parameters of the MMGDX method were set as follows: initial learning rate of the rst layer , initial learning rate of the second layer , , , , , , and stop criterion . Regarding MICI, the parameters were set as follows: , , , , initial learning rate , and stop criterion . The ASNN selects neurons from the NNs previously trained by MMGDX-B, MICI, and LM (or GDX in case of NIPS data sets). Besides the results of all the proposed methods, this section also reports the results obtained by the use of GDX with stop criterion based on the maximal AUC. The results of each experiment are described in Tables VIVIII, which inform the accuracy, the balanced error rate (BER), and the AUC (calculated over the test data set). Taking into account the large number of features of NIPS2003 data sets (up to 100 000), second-order training methods, such as LevenbergMarquardt (LM), cannot be applied due to the great amount of memory that is required to compose the Jacobian matrix and to calculate its pseudoinverse. Regarding the experiments with SVM, the standard SVM-packge from Matlab was applied. This software allows specifying the SVM training method, used to nd the

LUDWIG AND NUNES: NOVEL MAXIMUM-MARGIN TRAINING ALGORITHMS FOR SUPERVISED NEURAL NETWORKS

981

Fig. 2. Scalability analysis of training time on Dexter data set.

Fig. 3. Scalability analysis of training time on Gisette data set.

separating hyperplane, among three options: the usual quadratic programming (QP), least-squares method (LS), and sequential minimal optimization (SMO) [15]. The SMO method breaks the entire QP problem into a series of small QP problems, which are solved analytically. This method claims to decrease the amount of memory and time usually required by standard QP methods. However, the SVM training required 8 GB of RAM to run the NIPS2003 data set in Matlab (Windows version). This section also presents a scalability analysis, by assessing the processing time and accuracy of SVM, and NN trained by GDX, MICI, and MMGDX for different amounts of training samples. Unfortunately, it was not possible to perform experiments with LM, even considering only 20% of the training data set. Furthermore, the experiments with SVM did not include the PCA or Mahalanobis kernels because the calculation of the covariance matrix over thousands of features and subsequent calculation of its eigenvectors (in case of PCA kernel) or its inverse (in case of Mahalanobis kernel) overows the available memory. Therefore, four kernels were investigated, by means of a vefold cross validation: linear, RBF, and polynomial (second and third orders). For each kernel, ve values of margin param, with , were tested. In the case eter of a radial basis function (RBF) kernel, a 2-D grid, composed of discrete values of and , was exploited. Linear kernel was the best kernel for all the NIPS data sets. The best values of param, eter for Dexter, Gisette, and Dorothea data sets were , and , respectively. All the three evaluated SVM training methods have similar performance and training time for Dexter and Dorothea data sets, however, they have different performances in the case of Gisette data set, where only the LS training method was able to run the entire data set. Notice that, even by using only 5000 training examples, the SMO training method was more accurate than the LS in the Gisette data set (see Fig. 6). Figs. 24 illustrate the training time comparison between training methods for different numbers of training samples, while Fig. 57 illustrate the accuracy comparison. Figs. 5 and 7 present only one curve for SVM accuracy because all the SVM training methods showed the same accuracy. In Figs. 2 and 4, we present only the shortest SVM training times because

Fig. 4. Scalability analysis of training time on Dorothea data set.

Fig. 5. Scalability analysis of classier accuracy on Dexter data set.

all the training methods had quite similar training time. The reader can nd other NIPS2003 results in [3], where feature selectors were applied before classiers. Our experimental

982

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010

TABLE X TRAINING METHODS COMPARISON ON THYROID DATA SET

TABLE XI TRAINING TIME ON BREAST-CANCER DATA SET

Fig. 6. Scalability analysis of classier accuracy on Gisette data set. TABLE XII TRAINING TIME ON THYROID DATA SET

TABLE XIII OTHER METHODS RESULTS ON BREAST-CANCER DATA SET

TABLE XIV OTHER METHODS RESULTS ON THYROID DATA SET Fig. 7. Scalability analysis of classier accuracy on Dorothea data set.

TABLE IX TRAINING METHODS COMPARISON ON BREAST-CANCER DATA SET

based on MSE, such as GDX or LM, may lead the neural model to overt the data in case of many hidden neurons. D. Results on Breast-Cancer and Thyroid Data Sets The results reported in this section are averaged over the classication performance through the 100 evaluations. The results6 of Tables IX and X can be compared with the results of other methods, described in [16] and reproduced in Tables XIII and XIV. Tables XI and XII report the training time. Actually, all the proposed methods had quite similar accuracy on the experiments, except the case of MICI on Breast-Cancer data set. MMGDX-B was the second overall ranked algorithm among the proposed methods; the third position in Dexter and
6In Tables IX, X, XIII, and XIV, the value after deviation.

results indicate that, without feature selector, NNs seem more suitable to deal with the NIPS2003 data set. From Table IV, we can observe that the usual GDX had its better accuracy using models with few hidden neurons on Dexter and Dorothea data sets; this fact may be explained by the high input dimension of these data sets, i.e., in such cases, the neural model has many adjustable parameters (synaptic weights) in the input layer, in view of which, training methods

6 means the standard

LUDWIG AND NUNES: NOVEL MAXIMUM-MARGIN TRAINING ALGORITHMS FOR SUPERVISED NEURAL NETWORKS

983

Gisette data sets and the second position in the other three experiments. On the other hand, ASNN was the rst ranked method in all the experiments, what gives evidence of the problem-independent robustness of ASNN. Regarding the ASNN training time, because the MMGDX-B is the most expensive method among the three training algorithms applied in the composition of ASNN, the training time difference between MMGDX-B and ASNN is not signicant in practical applications (see Tables XI and XII). VII. CONCLUSION The experimental results using MMGDX, MICI, and ASNN applied to real-world benchmarks provide evidence of the effectiveness of those methods regarding accuracy, AUC, and BER. Notice that the neural models trained by the proposed methods have been applied without feature selection, despite the distractor features (i.e., probes) that have been added to the NIPS2003 benchmark data sets. The proposed methods deserve special credit due to their high cost benet, good results, and model simplicity (a simple MLP model) when compared with other algorithms of Tables XIII and XIV. In case of SVM, the number of support vectors increases linearly with the number of training examples [19], therefore, the computational requirements to store the active segment of the kernel matrix increases with the size of the training data set. Finally, the active segment contains all the dot products between support vectors. time and Actually, usual SVM training algorithms have space complexities, where is the training-data-set size [21]. The proposed MM methods have the same time and space complexities of the usual GDX, i.e., , being a suitable option of MM-based classier for data sets with large number of features, such as the NIPS data sets, or with large number of training data. Moreover, the MMGDX automatically adjusts the hidden-layer weights in such a way as to enable larger margins, avoiding the test of many kernels, as occurs in case of SVM applications. REFERENCES
[1] H. Demuth and M. Beale, Neural Network Toolbox Users Guide: For Use with MATLAB, 4.0 ed. Natick, MA: The MathWorks Inc., 2000. [2] U. Franke and S. Heinrich, Fast obstacle detection for urban trafc situations, IEEE Trans. Intell. Transp. Syst., vol. 3, no. 3, pp. 173181, Sep. 2002. [3] I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, Feature Extraction, Foundations and Applications. Heidelberg, Germany: Springer-Verlag, 2006. [4] K. Hornik, M. Stinchcombe, and H. White, Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks, Neural Netw., vol. 3, no. 5, pp. 551560, 1990. [5] R. Ilin, R. Kozma, and P. J. Werbos, Beyond feedforward models trained by backpropagation: A practical training tool for a more efcient universal approximator, IEEE Trans. Neural Netw., vol. 19, no. 6, pp. 929937, Jun. 2008. [6] A. K. D. Jayadeva and S. Chandra, Binary classication by SVM based tree type neural network, in Proc. Int. Conf. Neural Netw., Honolulu, HI, May 2002, vol. 3, pp. 27732778. [7] C. T. Kim and J. J. Lee, Training two-layered feedforward networks with variable projection method, IEEE Trans. Neural Netw., vol. 19, no. 2, pp. 371375, Feb. 2008.

[8] C. Liu, Gabor-based kernel PCA with fractional power polynomial models for face recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 5, pp. 572581, May 2004. [9] O. Ludwig, P. C. Gonzalez, and A. C. Lima, Optimization of ANN applied to non-linear system identication, in Proc. 25th IASTED Int. Conf. Model. Identif. Control, Lanzarote, Feb. 2006, pp. 402407. [10] O. Ludwig, U. Nunes, R. Araujo, L. Schnitman, and H. A. Lepikson, Applications of information theory, genetic algorithms, and neural models to predict oil ow, Commun. Nonlinear Sci. Numer. Simul., vol. 14, no. 7, pp. 28702885, 2009. [11] R. M. Neal and J. Zhang, High dimensional classication with Bayesian neural networks and Dirichlet diffusion trees, in Studies in Fuzziness and Soft Computing. Berlin, Germany: Springer-Verlag, 2006, vol. 207, pp. 265296. [12] T. Nishikawa and S. Abe, Maximizing margins of multilayer neural networks, in Proc. 9th Int. Conf. Neural Inf. Process., Singapore, Nov. 2002, vol. 1, pp. 322326. [13] G. Ou and Y. L. Murphey, Multi-class pattern classication using neural networks, Pattern Recognit., vol. 40, no. 1, pp. 418, Jan. 2007. [14] H. Peng, F. Long, and C. Ding, Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans Pattern Anal. Mach. Intell., vol. 27, no. 8, pp. 12261238, Aug. 2005. [15] J. C. Platt, Sequential minimal optimizer: A fast algorithm for training support vector machines, Microsoft Research, Tech. Rep. MSR-TR-98-14, 1998. [16] G. Ratsch, T. Onoda, and K. Mller, Soft margins for AdaBoost, in Machine Learning. Norwell, MA: Kluwer, 2001, vol. 42, pp. 287320. [17] A. Ruiz and P. E. Lopez-de-Teruel, Nonlinear kernel-based statistical pattern analysis, IEEE Trans. Neural Netw., vol. 12, no. 1, pp. 1632, Jan. 2001. [18] R. Sexton, R. Dorsey, and J. Johnson, Optimization of neural networks: A comparative analysis of the genetic algorithm and simulated annealing, Eur. J. Oper. Res., vol. 114, no. 3, pp. 589601, May 1999. [19] I. Steinwart, Sparseness of support vector machines, J. Mach. Learn. Res., vol. 4, no. 12, pp. 10711105, 2003. [20] N. K. Treadgold and T. D. Gedeon, Exploring constructive cascade networks, IEEE Trans. Neural Netw., vol. 10, no. 6, pp. 13351350, Nov. 1999. [21] I. W. Tsang, J. T. Kwok, and P. M. Cheung, Core vector machines: Fast SVM training on very large data sets, J. Mach. Learn. Res., vol. 6, pp. 363392, Apr. 2005. [22] V. Vapnik, The Nature of Statistical Learning Theory. London, U.K.: Springer-Verlag, 1995. [23] V. Vapnik and A. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities, Theory Probab. Appl., vol. 16, no. 2, pp. 264280, 1971. [24] J. Yang, A. F. Frangi, J. U. Yang, D. Zhang, and Z. Jin, KPCA plus LDA: A complete kernel Fisher discriminant framework for feature extraction and recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 2, pp. 230244, Feb. 2005. [25] J. Yaochu, T. Okabe, and B. Sendhoff, Neural network regularization and ensembling using multi-objective evolutionary algorithms, in Proc. Congr. Evol. Comput., Portland, OR, Jun. 2004, DOI: 10.1109/ CEC.2004.1330830 . [26] S. Young and T. Downs, CARVE: A constructive algorithm for real-valued examples, IEEE Trans. Neural Netw., vol. 9, no. 6, pp. 11801190, Nov. 1998. [27] K. Zhang, I. W. Tsang, and J. T. Kwok, Maximum margin clustering made practical, IEEE Trans. Neural Netw., vol. 20, no. 4, pp. 583596, Apr. 2009. Oswaldo Ludwig received the M.Sc. degree in electrical engineering from the Federal University of Bahia, Brazil, in 2004. Currently, he is working towards the Ph.D. degree with the Automation and Mobile Robotics Group, ISR-Institute of Systems and Robotics, University of Coimbra, Coimbra, Portugal. He has published a book and more than 20 papers in the area of articial intelligence. His current research focuses on machine learning with application to pedestrian detection in the domain of intelligent vehicles.

984

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010

Urbano Nunes (S90M95SM09) received the Lic. and Ph.D degrees in electrical engineering from the University of Coimbra, Coimbra, Portugal, in 1983 and 1995, respectively. Currently, he is an Associate Professor with the Faculty of Sciences and Technology, University of Coimbra. He is also a Researcher of the Institute for Systems and Robotics, University of Coimbra, where he is the Coordinator of the Automation and Mobile Robotics Group and the Coordinator of the Mechatronics Laboratory. He has been involved with/responsible for several funded projects, at both national and international levels, in the areas of mobile robotics and intelligent vehicles.

Dr. Nunes is an Associate Editor for the IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, an Associate Editor for the IEEE INTELLIGENT TRANSPORTATION SYSTEMS MAGAZINE, and a Cochair of the Technical Committee (TC) on Autonomous Ground Vehicles and Intelligent Transportation Systems (ITS) of the IEEE Robotics and Automation Society (RAS). He was with several conferences: International Conference on Advanced Robotics, General Co-Chair (2003); IEEE Intelligent Transportation Systems Conference (ITSC), Program Chair (2006); IEEE International Conference on Vehicular Electronics and Safety (ICVES07), Program Chair (2007); IEEE ITSC, Program Co-Chair (2008). He is the General Chair for the IEEE ITSC 2010.

S-ar putea să vă placă și