Sunteți pe pagina 1din 6

The extension version of this paper can be found in G.-B. Huang, Q.-Y. Zhu and C.-K.

Siew, “Extreme Learning Machine: Theory and Applications”, Neurocomputing, vol. 70, pp. 489-501, 2006.

Extreme Learning Machine: A New Learning


Scheme of Feedforward Neural Networks
Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew
School of Electrical and Electronic Engineering
Nanyang Technological University
Nanyang Avenue, Singapore 639798
E-mail: egbhuang@ntu.edu.sg

Abstract– It is clear that the learning speed of feedforward to local minimums. And many iterative learning steps are
neural networks is in general far slower than required and required by such learning algorithms in order to obtain
it has been a major bottleneck in their applications for past better learning performance.
decades. Two key reasons behind may be: 1) the slow gradient-
based learning algorithms are extensively used to train neural It has been shown [5], [6] that SLFNs (with N hidden
networks, and 2) all the parameters of the networks are tuned neurons) with arbitrarily chosen input weights can learn N
iteratively by using such learning algorithms. Unlike these distinct observations with arbitrarily small error. Unlike
traditional implementations, this paper proposes a new learning the popular thinking and most practical implementations
algorithm called extreme learning machine (ELM) for single- that all the parameters of the feedforward networks need
hidden layer feedforward neural networks (SLFNs) which ran-
domly chooses the input weights and analytically determines to be tuned, one may not necessarily adjust the input
the output weights of SLFNs. In theory, this algorithm tends weights and first hidden layer biases in applications. In
to provide the best generalization performance at extremely fact, some simulation results on artificial and real large
fast learning speed. The experimental results based on real- applications in our work [7] have shown that this method
world benchmarking function approximation and classification not only makes learning extremely fast but also produces
problems including large complex applications show that the
new algorithm can produce best generalization performance in good generalization performance. Recently, it has been
some cases and can learn much faster than traditional popular further rigorously proved in our work [8] that SLFNs
learning algorithms for feedforward neural networks. with arbitrarily assigned input weights and hidden layer
biases and with almost any nonzero activation function
I. Introduction can universally approximate any continuous functions on
From a mathematical point of view, research on the any compact input sets. These research results imply that
approximation capabilities of feedforward neural networks in the applications of feedforward neural networks input
has focused on two aspects: universal approximation on weights may not be necessarily adjusted at all.
compact input sets and approximation in a finite set. After the input weights and the hidden layer biases are
Many researchers have explored the universal approxi- chosen arbitrarily, SLFNs can be simply considered as a
mation capabilities of standard multi-layer feedforward linear system and the output weights (linking the hidden
neural networks[1], [2], [3]. In real applications, the neural layer to the output layer) of SLFNs can be analytically
networks are trained in finite training set. For function determined through simple generalized inverse operation
approximation in a finite training set, Huang and Babri[4] of the hidden layer output matrices. Based on this con-
shows that a sigle-hidden layer feedforward neural network cept, this paper proposes a simple learning algorithm for
(SLFN) with at most N hidden neurons and with almost SLFNs called extreme learning machine (ELM) whose
any nonlinear activation function can learn N distinct learning speed can be thousands of times faster than
observations with zero error. It should be noted that the traditional feedforward network learning algorithms like
input weights (linking the input layer to the first hidden back-propagation algorithm while obtaining better gener-
layer) need to be adjusted in all these previous theoretical alization performance. Different from traditional learning
research works as well as in almost all practical learning algorithms the proposed learning algorithm not only tends
algorithms of feedforward neural networks. to reach the smallest training error but also the smallest
Traditionally, all the parameters of the feedforward norm of weights. Bartlett’s theory on the generalization
networks need to be tuned and thus there exists the depen- performance of feedforward neural networks[9] states that
dency between different layers of parameters (weights and for feedforward neural networks reaching smaller training
biases). For past decades gradient descent-based methods error, the smaller the norm of weights is, the better
have mainly been used in various learning algorithms of generalization performance the networks tend to have.
feedforward neural networks. However, it is clear that Therefore, the proposed learning algorithm tends to have
gradient descent based learning methods are generally very better generalization performance for feedforward neural
slow due to improper learning steps or may easily converge networks.

Proceedings of International Joint Conference on Neural Networks (IJCNN2004), 25-29 July, 2004, Budapest, Hungary.
As the new proposed learning algorithm tends to reach of a linear system Ax = y. Then it is necessary and
the smallest training error, obtain the smallest norm of sufficient that G = A† , the Moore-Penrose generalized
weights and the best generalization performance, and runs inverse of matrix A.
extremely fast, in order to differentiate it from the other
popular SLFN learning algorithms, it is called the Extreme III. Extreme Learning Machine
Learning Machine (ELM) in the context of this paper. In Section II we have briefed the Moore-Penrose inverse
This paper is organized as follows. Section II introduces and the smallest norm least-squares solution of a general
the Moore-Penrose generalized inverse and the minimum linear system Ax = y. We can now propose an extremely
norm least-squares solution of a general linear system fast learning algorithm for the single hidden layer feed-
which play an important role in developing our new forward networks (SLFNs) with Ñ hidden neurons, where
ELM learning algorithm. Section III proposes the new Ñ ≤ N , the number of training samples.
ELM learning algorithm for single-hidden layer feedfor-
ward neural networks (SLFNs). Performance evaluation is A. Approximation problem of SLFNs
presented in Section IV. Discussions and conclusions are For N arbitrary distinct samples (xi , ti ), where xi =
given in Section V. [xi1 , xi2 , · · · , xin ]T ∈ Rn and ti = [ti1 , ti2 , · · · , tim ]T ∈
II. Preliminaries Rm , standard SLFNs with Ñ hidden neurons and activa-
tion function g(x) are mathematically modeled as
In this section, the Moore-Penrose generalized inverse is
introduced. We also consider in this section the minimum Ñ
norm least-squares solution of a general linear system βi g(wi · xj + bi ) = oj , j = 1, · · · , N, (4)
Ax = y in Euclidean space, where A ∈ Rm×n and i=1
y ∈ Rm . As shown in [5], [6], the SLFNs are actually
where wi = [wi1 , wi2 , · · · , win ]T is the weight vector
a linear system if the input weights and the hidden layer connecting the ith hidden neuron and the input neurons,
biases can be chosen arbitrarily. βi = [βi1 , βi2 , · · · , βim ]T is the weight vector connecting
A. Moore-Penrose Generalized Inverse the ith hidden neuron and the output neurons, and bi is
the threshold of the ith hidden neuron. wi · xj denotes
The resolution of a general linear system Ax = y, where
the inner product of wi and xj . The output neurons are
A may be singular and may even not be square, can
chosen linear in this paper.
be made very simple by the use of the Moore-Penrose
generalized inverse[10]. That standard SLFNs with Ñ hidden neurons with
Definition 2.1: [10] A matrix G of order n × m is the activation function g(x) can approximate these N samples

Moore-Penrose generalized inverse of matrix A of order with zero error means that j=1 oj − tj = 0, i.e., there
m × n, if exist βi , wi and bi such that

AGA = A, GAG = G, (AG)T = AG, (GA)T = GA Ñ

(1) βi g(wi · xj + bi ) = tj , j = 1, · · · , N. (5)


i=1
For the sake of convenience, the Moore-Penrose general-
ized inverse of matrix A will be denoted by A† . The above N equations can be written compactly as:
B. Minimum Norm Least-Squares Solution of General Hβ = T (6)
Linear System
where
For a general linear system Ax = y, we say that x̂ is a
least-squares solution (l.s.s) if H(w1 , · · · , wÑ , b1 , · · · , bÑ , x1 , · · · , xN )
⎡ ⎤
g(w1 · x1 + b1 ) · · · g(wÑ · x1 + bÑ )
Ax̂ − y = min Ax − y (2) ⎢ ⎥
x
=⎣ .. ..
. ··· . ⎦
where · is a norm in Euclidean space. g(w1 · xN + b1 ) · · · g(wÑ · xN + bÑ ) N×Ñ
Definition 2.2: x0 ∈ Rn is said to be a minimum norm
(7)
least-squares solution of a linear system Ax = y if for any
y ∈ Rm ⎡ T ⎤ ⎡ T ⎤
β1 t1
⎢ . ⎥ ⎢ ⎥
x0 ≤ x , ∀x ∈ {x : Ax − y ≤ Az − y , ∀z ∈ Rn } β = ⎣ .. ⎦ and T = ⎣ ... ⎦ (8)
(3) βÑ T tTN
Ñ×m N ×m
That means, a solution x0 is said to be a minimum norm
least-squares solution of a linear system Ax = y if it has As named in Huang and Babri[4] and Huang[6], H is called
the smallest norm among all the least-squares solutions. the hidden layer output matrix of the neural network; the
Theorem 2.1: {p. 147 of [10]} Let there exist a matrix G ith column of H is the ith hidden neuron’s output vector
such that Gy is a minimum norm least-squares solution with respect to inputs x1 , x2 , · · · , xN .
B. Gradient-Based Learning Algorithms C. Minimum Norm Least-Squares Solution of SLFN
As analyzed in pervious works[4], [5], [6], if the number As mentioned in Section I, it is very interesting and
of hidden neurons is equal to the number of distinct surprising that unlike the most common understanding
training samples, Ñ = N , matrix H is square and that all the parameters of SLFNs need to be adjusted, the
invertible, and SLFNs can approximate these training input weights wi and the hidden layer biases bi are in fact
samples with zero error. However, in most cases the not necessarily tuned and the hidden layer output matrix
number of hidden neurons is much less than the number H can actually remain unchanged once arbitrary values
of distinct training samples, Ñ N , H is a nonsquare have been assigned to these parameters in the beginning
matrix and there may not exist wi , bi , βi (i = 1, · · · , Ñ ) of learning.
such that Hβ = T. Thus, instead one may need to find Our recent work [6] also shows that SLFNs (with
specific ŵi , b̂i , β̂ (i = 1, · · · , Ñ ) such that infinite differential activation functions) with the input
weights chosen arbitrarily can learn distinct observations
H(ŵ1 , · · · , ŵÑ , b̂1 , · · · , b̂Ñ )β̂ − T with arbitrarily small error. In fact, simulation results
(9) on artificial and real world cases done in [7] as well as
= min H(w1 , · · · , wÑ , b1 , · · · , bÑ )β − T
wi ,bi ,β
in this paper have further demonstrated that no gain is
which is equivalent to minimizing the cost function possible by adjusting the input weights and the hidden
⎛ ⎞2 layer biases. Recently, it has been further rigorously
N Ñ proved in our work[8] that unlike the traditional function
E= ⎝ βi g(wi · xj + bi ) − tj ⎠ (10) approximation theories[1], which require to adjust input
j=1 i=1 weights and hidden layer biases, the feedforward networks
with arbitrarily assigned input weights and hidden layer
When H is unknown gradient-based learning algorithms biases and with almost all nonzero activation functions
are generally used to search the minimum of Hβ − T . can universally approximate any continuous functions on
In the minimization procedure by using gradient-based any compact input sets.
algorithms, vector W which is the set of weights (wi ,βi ) These research results (simulations and theories) show
and biases (bi ) parameters W is iteratively adjusted as that the input weights and the hidden layer biases of
follows: SLFNs need not be adjusted at all and can be arbitrarily
∂E(W)
Wk = Wk−1 − η (11) given. For fixed input weights wi and the hidden layer
∂W biases bi , seen from equation (9), to train an SLFN is
Here η is a learning rate. The popular learning algo- simply equivalent to finding a least-squares solution β̂ of
rithm used in feedforward neural networks is the back- the linear system Hβ = T:
propagation learning algorithm where gradients can be
computed efficiently by propagation from the output to H(w1 , · · · , wÑ , b1 , · · · , bÑ )β̂ − T
(12)
the input. There are several issues on back-propagation = min H(w1 , · · · , wÑ , b1 , · · · , bÑ )β − T
β
learning algorithms:
1) When the learning rate η is too small, the learning According to Theorem 2.1, the smallest norm least-
algorithm converges very slowly. However, when η squares solution of the above linear system is:
is too large, the algorithm becomes unstable and
β̂ = H† T (13)
diverges.
2) Another peculiarity of the error surface that impacts Remarks: As discussed in Section II, we have the
the performance of the back-propagation learning following important properties:
algorithm is the presence of local minima[11]. It is
1) Minimum training error. The special solution β̂ =
undesirable that the learning algorithm stops at a
H† T is one of the least-squares solutions of a general
local minimum if it is located far above a global
linear system Hβ = T, meaning that the smallest
minimum.
training error can be reached by this special solution:
3) Neural network may be over-trained by using back-
propagation algorithms and obtain worse general- Hβ̂ − T = HH† T − T = min Hβ − T (14)
ization performance. Thus, validation and suitable β

stopping methods are required in the cost function Although almost all learning algorithms wish to
minimization procedure. reach the minimum training error, however, most of
4) Gradient-based learning is very time-consuming in them cannot reach it because of local minimum or
most applications. infinite training iteration is usually not allowed in
The aim of this paper is to solve the above issues related applications.
with gradient-based algorithms and propose an efficient 2) Smallest norm of weights and best generalization
learning algorithm for feedforward neural networks. performance. In further, the special solution β̂ = H† T
has the smallest norm among all the least-squares different from standard feedforward networks and it is
solutions of Hβ = T: not the objective of the paper to systematically compare
the differences between SVM and feedforward networks,
β̂ = H† T ≤ β ,
the performance comparison between SVMs and our ELM
∀β ∈ {β : Hβ − T ≤ Hz − T , ∀z ∈ RÑ ×N } are also simply conducted.
(15) All the simulations for the BP and ELM algorithms
are carried out in MATLAB 6.5 environment running
As pointed out by Bartlett[12], [9], for feedfor-
in a Pentium 4, 1.9 GHZ CPU. Although there are
ward networks with many small weights but small
many variants of BP algorithm, a faster BP algorithm
squared error on the training examples, the Vapnik-
called Levenberg-Marquardt algorithm is used in our
Chervonenkis (VC) dimension (and hence number
simulations.
of parameters) is irrelevant to the generalization
performance. Instead, the magnitude of the weights The simulations for SVM are carried out using compiled
in the network is more important. The smaller the C-coded SVM packages: LIBSVM[13] running in the same
weights are, the better generalization performance PC. When comparing learning speed of ELM and SVMs,
the network tends to have. As analyzed above, our readers should bear in mind that a C implementation
method not only reaches the smallest squared error would be much faster than MATLAB. So a comparison
on the training examples but also obtains the smallest with LIBSVM could give SVM an advantage. The kernel
weights. Thus, it is reasonable to think that our function used in SVM is radial basis function whereas the
method tends to have better generalization perfor- activation function used in our proposed algorithms is a
mance. It should be worth pointing out that it may simple sigmoidal function g(x) = 1/(1+exp(−x)). In order
be difficult for gradient-based learning algorithms to compare BP and our ELM, BP and ELM are always
like back-propagation to reach the best generalization assigned the same of number of hidden neurons for same
performance since they only try to obtain the smallest applications.
training errors without considering the magnitude of A. Benchmarking with Function Approximation Problem
the weights.
3) The minimum norm least-squares solution of Hβ = T California Housing is a dataset obtained from the
is unique, which is β̂ = H† T. StatLib repository2 . There are 20,640 observations for
predicting the price of houses in California. Information
D. Learning Algorithm for SLFNs on the variables were collected using all the block groups
Thus, a simple learning method for SLFNs called the in California from the 1990 Census. In this sample a block
extreme learning machine (ELM)1 can be summarized as group on average includes 1425.5 individuals living in a
follows: geographically compact area. Naturally, the geographical
Algorithm ELM : Given a training set ℵ = {(xi , ti )|xi ∈ area included varies inversely with the population density.
Rn , ti ∈ Rm , i = 1, · · · , N }, activation function g(x), and Distances among the centroids of each block group were
hidden neuron number Ñ , computed as measured in latitude and longitude. All the
step 1: Assign arbitrary input weight wi and bias bi , i = block groups reporting zero entries for the independent
1, · · · , Ñ . and dependent variables were excluded. The final data
step 2: Calculate the hidden layer output matrix H. contained 20,640 observations on 9 variables, which con-
step 3: Calculate the output weight β: sists of 8 continuous inputs (median income, housing
median age, total rooms, total bedrooms, population,
β = H† T (16) households, latitude, and longitude) and one continuous
output (median house value). In our simulations, 8,000
where H, β and T are defined as formula (7) and (8).
training data and 12,640 testing data randomly generated
IV. Performance Evaluation from the California Housing database for each trial.
In this section, the performance of the proposed ELM For simplicity, the 8 input attributes and one output
learning algorithm is compared with the popular algo- have been normalized to the range [0, 1] in our experiment.
rithms of feedforward neural networks like the conven- The parameter C is tuned and set as C = 500 in
tional back-propagation (BP) algorithm on some bench- SVR algorithm. 50 trials have been conducted for all the
marking problems in the function approximation and clas- algorithms and the average results are shown in Table
sification areas. This paper mainly focuses on feedforward I. Seen from Table I, the learning speed of our ELM
neural networks and aims to propose a new learning algorithm is more than 1,000 and 2,000 times faster than
algorithm to train feedforward networks efficiently. Al- BP and SVM for this case. The generalization performance
though Support Vector Machines (SVMs) are obviously obtained by the ELM algorithm is very close to the
generalization performance of BP and SVMs.
1 Source codes can be downloaded from
http://www.ntu.edu.sg/eee/icis/cv/egbhuang.htm. 2 http://www.niaad.liacc.up.pt/∼ltorgo/Regression/cal housing.html
TABLE I TABLE III
PERFORMANCE COMPARISON IN CALIFORNIA HOUSING PERFORMANCE COMPARISON IN REAL MEDICAL
PREDICTION APPLICATION. DIAGNOSIS APPLICATION: DIABETES.
Algo Time (seconds) Training Testing No of SVs/ Algorithms Testing Rate
Training Testing RMS RMS Neurons ELM 76.54%
ELM 0.272 0.143 0.1358 0.1365 20 SVM[14] 76.50%
BP 295.23 0.286 0.1369 0.1426 20 AdaBoost[15] 75.60%
SVM 558.4137 20.9763 0.1267 0.1275 2534.0 C4.5[15] 71.60%
RBF[16] 76.30%

B. Benchmarking with Real Medical Diagnosis Applica- C. Benchmarking with Real-World Large Complex Appli-
tion cation
The performance comparison of the new proposed ELM We have also tested the performance of our ELM
algorithm and many other popular algorithms has been algorithm for large complex applications such as Forest
conducted for a real medical diagnosis problem: Diabetes3 , Cover Type Prediction.4
using the “Pima Indians Diabetes Database” produced Forest Cover Type Prediction is an extremely large
in the Applied Physics Laboratory, Johns Hopkins Uni- complex classification problem with seven classes. The
versity, 1988. The diagnostic, binary-valued variable in- forest cover type for 30 x 30 meter cells was obtained
vestigated is whether the patient shows signs of diabetes from US Forest Service (USFS) Region 2 Resource Infor-
according to World Health Organization criteria (i.e., if the mation System (RIS) data. There are 581,012 instances
2 hour post-load plasma glucose was at least 200 mg/dl at (observations) and 54 attributes per observation. In order
any survey examination or if found during routine medical to compare with the previous work[17], similarly it was
care). The database consists of 768 women over the age modified as a binary classification problem where the goal
of 21 resident in Phoenix, Arizona. All examples belong was to separate class 2 from the other six classes. There
to either positive or negative class. All the input values are 100,000 training data and 481,012 testing data. The
are within [0, 1]. For this problem, 75% and 25% samples parameters for the SVM are C = 10 and γ = 2.
are randomly chosen for training and testing at each trial, 50 trials have been conducted for the ELM algorithm,
respectively. The parameter C of SVM algorithm is tuned and 1 trial for SVM since it takes very long time to train
and set as: C = 10 and the rest parameters are set as SLFNs using SVM for this large complex case5 . Seen from
default. Table IV, the proposed ELM learning algorithm obtains
50 trials have been conducted for all the algorithms better generalization performance than SVM learning
and the average results are shown in Table II. Seen from algorithm. However, the proposed ELM learning algorithm
Table II, in our simulations SVM can reach the testing rate only spent 1.5 minutes on learning while SVM need to
77.31% with 317.16 support vectors at average. Rätsch et spend nearly 12 hours on training. The learning speed
al[14] obtained a testing rate 76.50% for SVM which is has dramatically been increased 430 times. On the other
slightly lower than the SVM result we obtained. However, hand, since the support vectors obtained by SVM is much
the new ELM learning algorithm can achieve the average larger than the required hidden neurons in ELM, the
testing rate 76.54% with 20 neurons, which is higher than testing time spent SVMs for this large testing data set
the results obtained by many other popular algorithms is more than 480 times than the ELM. It takes more than
such as bagging and boosting methods[15], C4.5[15], and 5.5 hours for the SVM to react to the 481,012 testing
RBF[16] (cf. Table III). BP algorithm performs very poor samples. However, it takes only less than 1 minute for
in our simulations for this case. It can also be seen that the obtained ELM to react to the testing samples. That
the ELM learning algorithm run around 1,000 times faster means, after trained and deployed the ELM may react
than BP, and 12 times faster than SVM for this small to new observations much faster than SVMs in such real
problem without considering that C executable environ- application. It should be noted that in order to obtain as
ment may run much faster than MATLAB environment. good performance as possible for SVM, long time effort
has been made to find the appropriate parameters for
TABLE II SVM. In fact the generalization performance of SVM we
PERFORMANCE COMPARISON IN REAL MEDICAL obtained in our simulation for this case is much higher
DIAGNOSIS APPLICATION: DIABETES. than the one reported in literature[17].
Algo Training Time Success Rate No of SVs/ V. Discussions and Conclusions
(Seconds) Training Testing Neurons
ELM 0.015 78.71% 76.54% 20 This paper proposed a new learning algorithm for
BP 16.196 92.86% 63.45% 20 single-hidden layer feedforward neural networks (SLFNs)
SVM 0.1860 78.76% 77.31% 317.16
4 http://www.ics.uci.edu/∼mlearn/MLRepository.html.
5 Actually we have tested SVM for this case many times and always
3 ftp://ftp.ira.uka.de/pub/neuron/proben1.tar.gz obtained similar results as presented here.
TABLE IV
in our technical report6 [19].
PERFORMANCE COMPARISON OF THE ELM, BP and SVM
LEARNING ALGORITHMS IN FOREST TYPE PREDICTION References
APPLICATION. [1] K. Hornik, “Approximation capabilities of multilayer feedfor-
Algo Time (minutes) Success Rate No of SVs/ ward networks,” Neural Networks, vol. 4, pp. 251—257, 1991.
Training Testing Training Testing Neurons [2] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, “Multilayer
feedforward networks with a nonpolynomial activation func-
ELM 1.6148 0.7195 92.35 90.21 200 tion can approximate any function,” Neural Networks, vol. 6,
SLFN[17] 12 N/A 82.44 81.85 100 pp. 861—867, 1993.
SVM 693.60 347.78 91.70 89.90 31,806 [3] Y. Ito, “Approximation of continuous functions on Rd by linear
combinations of shifted rotations of a sigmoid function with and
without scaling,” Neural Networks, vol. 5, pp. 105—115, 1992.
[4] G.-B. Huang and H. A. Babri, “Upper bounds on the number of
hidden neurons in feedforward networks with arbitrary bounded
called extreme learning machine (ELM), which has sev- nonlinear activation functions,” IEEE Transactions on Neural
eral interesting and significant features different from Networks, vol. 9, no. 1, pp. 224—229, 1998.
[5] S. Tamura and M. Tateishi, “Capabilities of a four-layered
traditional popular gradient-based learning algorithms for feedforward neural network: Four layers versus three,” IEEE
feedforward neural networks: Transactions on Neural Networks, vol. 8, no. 2, pp. 251—255,
1997.
1) The learning speed of ELM is extremely fast. It [6] G.-B. Huang, “Learning capability and storage capacity of
two-hidden-layer feedforward networks,” IEEE Transactions on
can train SLFNs much faster than classical learning Neural Networks, vol. 14, no. 2, pp. 274—281, 2003.
algorithms. Previously, it seems that there exists a [7] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Real-time learn-
virtual speed barrier which most (if not all) classic ing capability of neural networks,” in Technical Report
ICIS/45/2003, (School of Electrical and Electronic Engineering,
learning algorithms cannot break through and it is not Nanyang Technological University, Singapore), Apr. 2003.
unusual to take very long time to train a feedforward [8] G.-B. Huang, L. Chen, and C.-K. Siew, “Universal approxi-
network using classic learning algorithms even for mation using incremental feedforward networks with arbitrary
input weights,” in Technical Report ICIS/46/2003, (School of
simple applications. Electrical and Electronic Engineering, Nanyang Technological
2) Unlike the traditional classic gradient-based learning University, Singapore), Oct. 2003.
algorithms which intend to reach minimum training [9] P. L. Bartlett, “The sample complexity of pattern classification
with neural networks: The size of the weights is more important
error but do not consider the magnitude of weights, than the size of the network,” IEEE Transactions on Informa-
the ELM tends to reach not only the smallest training tion Theory, vol. 44, no. 2, pp. 525—536, 1998.
error but also the smallest norm of weights. Thus, the [10] D. Serre, “Matrices: Theory and applications,” Springer-Verlag
proposed ELM tends to have the better generalization [11] New York, Inc, 2002.
S. Haykin, “Neural networks: A comprehensive foundation,”
performance for feedforward neural networks. New Jersey: Prentice Hall, 1999.
3) Unlike the traditional classic gradient-based learning [12] P. L. Bartlett, “For valid generalization, the size of the weights
is more important than the size of the network,” in Advances
algorithms which only work for differentiable acti- in Neural Information Processing Systems’1996 (M. Mozer,
vation functions, the ELM learning algorithm can be M. Jordan, and T. Petsche, eds.), vol. 9, pp. 134—140, MIT
used to train SLFNs with non-differentiable activation Press, 1997.
[13] C.-C. Chang and C.-J. Lin, “LIBSVM —
functions. a library for support vector machines,” in
4) Unlike the traditional classic gradient-based learning http://www.csie.ntu.edu.tw/∼cjlin/libsvm/, Deptartment
algorithms facing several issues like local minimum, of Computer Science and Information Engineering, National
improper learning rate and overfitting, etc, the ELM [14] Taiwan University, Taiwan, 2003.
G. Rätsch, T. Onoda, and K. R. Müller, “An improvement
tends to reach the solutions straightforward without of AdaBoost to avoid overfitting,” in Proceedings of the 5th
such trivial issues. The ELM learning algorithm International Conference on Neural Information Processing
looks much simpler than most learning algorithms [15] (ICONIP’1998), 1998.
Y. Freund and R. E. Schapire, “Experiments with a new
for feedforward neural networks. boosting algorithm,” in International Conference on Machine
Learning, pp. 148—156, 1996.
It should be noted that gradient-based learning algo- [16] D. R. Wilson and T. R. Martinez, “Heterogeneous radial
rithms like back-propagation can be used for feedforward basis function networks,” in Proceedings of the International
Conference on Neural Networks (ICNN 96), pp. 1263—1267, June
neural networks which have more than one hidden layers 1996.
while the proposed ELM algorithm at its present form [17] R. Collobert, S. Bengio, and Y. Bengio, “A parallel mixtures
is still only valid for single-hidden layer feedforward of SVMs for very large scale problems,” Neural Computation,
vol. 14, pp. 1105—1114, 2002.
networks (SLFNs). Fortunately, it has been found that [18] G.-B. Huang, Y.-Q. Chen, and H. A. Babri, “Classification
SLFNs can approximate any continuous function[1], [8] ability of single hidden layer feedforward neural networks,”
and implement any classification application[18]. Thus, IEEE Transactions on Neural Networks, vol. 11, no. 3, pp. 799—
801, 2000.
reasonably speaking the proposed ELM algorithm can [19] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learn-
be used generally in many cases and it needs to be ing machine,” in Technical Report ICIS/03/2004, (School of
investigated further in the future. On the other hand, more Electrical and Electronic Engineering, Nanyang Technological
University, Singapore), Jan. 2004.
comprehensive experimental results of ELM on different
aritificial and real world benchmark problems can be found 6 http://www.ntu.edu.sg/eee/icis/cv/egbhuang.htm.

S-ar putea să vă placă și