Documente Academic
Documente Profesional
Documente Cultură
M
achine learning and artificial intelligence training time. In “ELM-Guided Memetic Compu-
have seemingly never been as critical and tation for Vehicle Routing,” the authors consider
the ELM as an engine for automating the encap-
important to real-life applications as they are in
sulation of knowledge memes from past problem-
today’s autonomous, big data era. The success of solving experiences. In “ELMVIS: A Nonlinear
machine learning and artificial intelligence relies Visualization Technique Using Random Permu-
on the coexistence of three necessary conditions: tations and ELMs,” the authors propose an ELM
powerful computing environments, rich and/or method for data visualization based on random
large data, and efficient learning techniques (algo- permutations to map original data and their cor-
rithms). The extreme learning machine (ELM) as responding visualization points. In “Combining
an emerging learning technique provides efficient ELMs with Random Projections,” the authors
unified solutions to generalized feed-forward net- analyze the relationships between ELM feature-
works including but not limited to (both single- and mapping schemas and the paradigm of random
multi-hidden-layer) neural networks, radial basis projections. In “Reduced ELMs for Causal Re-
function (RBF) networks, and kernel learning. lation Extraction from Unstructured Text,” the
ELM theories1–4 show that hidden neurons are authors propose combining ELMs with neuron se-
important but can be randomly generated and in- lection to optimize the neural network architecture
dependent from applications, and that ELMs have and improve the ELM ensemble’s computational
both universal approximation and classification efficiency. In “A System for Signature Verification
capabilities; they also build a direct link between Based on Horizontal and Vertical Components
multiple theories (specifically, ridge regression, op- in Hand Gestures,” the authors propose a novel
timization, neural network generalization perfor- paradigm for hand signature biometry for touch-
mance, linear system stability, and matrix theory). less applications without the need for handheld de-
Consequently, ELMs, which can be biologically vices. Finally, in “An Adaptive and Iterative Online
inspired, offer significant advantages such as fast Sequential ELM-Based Multi-Degree-of-Freedom
learning speed, ease of implementation, and min- Gesture Recognition System,” the authors propose
imal human intervention. They thus have strong an online sequential ELM-based efficient gesture
potential as a viable alternative technique for recognition algorithm for touchless human machine
large-scale computing and machine learning. interaction.
This special edition of Trends & Controver-
sies includes eight original works that detail
the further developments of ELMs in theories,
applications, and hardware implementation. In We thank all the authors for their contributions
“Representational Learning with ELMs for Big to this special issue. We also thank IEEE Intelli
Data,” the authors propose using the ELM as an gent Systems and its editor in chief, Daniel Zeng,
auto-encoder for learning feature representations for the opportunity of publishing these works.
using singular values. In “A Secure and Practi-
cal Mechanism for Outsourcing ELMs in Cloud References
Computing,” the authors propose a method for 1. G.-B. Huang, L. Chen, and C.-K. Siew, “Universal Ap-
handling large data applications by outsourcing proximation Using Incremental Constructive Feedfor-
to the cloud that would dramatically reduce ELM ward Networks with Random Hidden Nodes,” IEEE
1 (a1, b1) 1
is the projected feature space of X
1 g1 squashed via a sigmoid function, we
hypothesize that ELM-AE’s output
weight b will learn to represent the
x p βp p x features of the input data via singular
values. To test if our hypothesis is cor-
L gL rect, we created 10 mini datasets con-
taining digits 0 to 9 from the MNIST
dataset. Then we sent each mini da-
(aL, bL) taset through an ELM-AE (network
d d
structure: 784-20-784) and compared
ELM othogonal
random feature mapping the contents of the output weights b
(Figure 2a) with the manually cal-
d > L: Compressed representation culated rank 20 SVD (Figure 2b)
for each mini dataset. As Figure 2
d = L: Equal dimension representation
shows, ELM-AE output weight b and
d < L: Sparse representation the manually calculated SVD basis.
Multilayer neural networks per-
Figure 1. ELM-AE has the same solution as the original extreme learning machine form poorly when trained with back
except that its target output is the same as input x, and the hidden node propagation (BP) only, so we initial-
parameters (ai, bi) are made orthogonal after being randomly generated. Here,
gi (x) = g(ai, bi, x) is the ith hidden node for input x.
ize hidden layer weights in a deep
network by using layer-wise unsu-
pervised training and fine-tune the
randomly generated hidden parame- representations, we calculate output whole neural network with BP. Simi-
ters tends to improve ELM-AE’s gener- weights b as follows: lar to deep networks, ML-ELM hid-
alization performance. −1 den layer weights are initialized with
I
According to ELM theory, ELMs β = + HT H HT X ,(6) ELM-AE, which performs layer-wise
C
are universal approximators,8 hence unsupervised training. However, in
ELM-AE is as well. Figure 1 shows where H = [h1, …, h N] are ELM-AE’s contrast to deep networks, ML-ELM
ELM-AE’s network structure for hidden layer outputs, and X = [x1, …, doesn’t require fine tuning.
compressed, sparse, and equal dimen- x N] are its input and output data. ML-ELM hidden layer activa-
sion representation. In ELM-AE, the For equal dimension ELM-AE repre- tion functions can be either linear or
orthogonal random weights and bi- sentations, we calculate output weights nonlinear piecewise. If the number
ases of the hidden nodes project the b as follows: of nodes Lk in the kth hidden layer
input data to a different or equal is equal to the number of nodes Lk−1
dimension space, as shown by the
b = H−1T in the (k − 1)th hidden layer, g is cho-
Johnson-Lindenstrauss lemma9 and bT b = I. (7) sen as linear; otherwise, g is chosen as
calculated as nonlinear piecewise, such as a sigmoi-
Singular value decomposition (SVD) dal function:
h = g(a . x + b) is a commonly used method for feature
aTa = I, bTb = 1, (5) representation. Hence we believe that Hk = g((bk)T Hk−1),(9)
ELM-AE performs feature representa-
where a = [a1, …, aL] are the orthogo- tion similar to SVD. Equation 6’s singu- where Hk is the kth hidden layer out-
nal random weights, and b = [b1, …, lar value decomposition (SVD) is put matrix. The input layer x can be
bL] are the orthogonal random biases N considered as the 0th hidden layer,
d2
between the input and hidden nodes. Hβ = ∑ ui d2 +i C uTi X,(8) where k = 0. The output of the con-
ELM-AE’s output weight b is respon- i =1 i nections between the last hidden
sible for learning the transformation where u are eigenvectors of HHT, and layer and the output node t is ana-
from the feature space to input data. d are singular values of H, related to lytically calculated using regularized
For sparse and compressed ELM-AE the SVD of input data X. Because H least squares.
Figure 2. ELM-AE vs. singular value decomposition. (a) The output weights b of ELM-AE and (b) rank 20 SVD basis shows the
feature representation of each number (0–9) in the MNIST dataset.
Performance Evaluation Table 1. Performance comparison of ML-ELM with state-of-the-art deep networks.
The MNIST is commonly used for Testing accuracy %
testing deep network performance; Algorithms (standard deviation %) Training time
the dataset contains images of hand- Multi-layer extreme learning machine 99.03 (±0.04) 444.655 s
written digits with 60,000 training (ML-ELM)
samples and 10,000 testing samples. Extreme learning machine 97.39 (±0.1) 545.95 s
Table 1 shows the results of using the (ELM random features)
original MNIST dataset without any ELM (ELM Gaussian kernel); run on a 98.75 790.96 s
distortions to test the performance faster machine
of ML-ELM with respect to DBNs, Deep belief network (DBN) 98.87 20,580 s
DBMs, SAEs, SDAEs, random fea- Deep Boltzmann machine (DBM) 99.05 68,246 s
ture ELMs, and Gaussian kernel Stacked auto-encoder (SAE) 98.6 –
ELMs. Stacked eenoising auto-encoder (SDAE) 98.72 –
We conducted the experiments
on a laptop with a core i7 3740QM
2.7-GHz processor and 32 Gbytes
of RAM running Matlab 2013a. respectively, to generate the results • ELM-AE output weights can be de-
Gaussian-kernel ELMs require a shown in Table 1. As a two-layer termined analytically, unlike RBMs
larger memory than 32 Gbytes, so DBM network produces better results and traditional auto-encoders, which
we executed on a high-performance than a three-layer one, 3 we tested the require iterative algorithms.
computer with dual Xeon E5-2650 two-layer network. • ELM-AE learns to represent fea-
2-GHz processors and 256 Gbytes of As Table 1 shows, ML-ELM per- tures via singular values, unlike
RAM running Matlab 2013a. ML- forms on par with DBMs and out- RBMs and traditional auto-encod-
ELM (network structure: 784-700- performs SAEs, SDAEs, DBNs, ELMs ers, where the actual representation
700-15000-10 with ridge parameters with random feature, and Gaussian of data is learned.
10 −1 for layer 784-700, 103 for layer kernel ELMs. Furthermore, ML-ELM
700-15000 and 108 for layer 15000- has the least amount of required
10) with sigmoidal hidden layer training time with respect to deep
activation function generated an ac- networks:
curacy of 99.03. We used DBNs and ELM-AE can be seen as a special
DBM network structures 748-500- • In contrast to deep networks, ML- case of ELM, where the input is equal
500-2000-10 and 784-500-1000-10, ELM doesn’t require fine-tuning. to output, and the randomly generated
1 β1 1 1 β i+1 1
1 1
x p p x hi p p hi
L1 Li+1
d d Li Li
(b)
(a)
1
(β1)T 1 1 1 1
(β i+1)T
t
x p
L1 Li Li+1 Lk
h1 hi hi+1 hk
d
(c)
Figure 3. Adding layers in ML-ELM. (a) ELM-AE output weights b1 with respect to input data x are the first-layer weights of
ML-ELM. (b) The output weights b i+1 of ELM-AE, with respect to ith hidden layer output hi of ML-ELM are the (i + 1)th layer
weights of ML-ELM. (c) The ML-ELM output layer weights are calculated using regularized least squares.
weights are chosen to be orthogonal 4. G.-B. Huang, Q.-Y. Zhu and C.-K. Siew, 9. W. Johnson and J. Lindenstrauss, “Exten-
(see Figure 3). ELM-AE’s representa- “Extreme Learning Machine: Theory sions of Lipschitz Mappings into a Hilbert
tion capability might provide a good and Applications,” Neurocomputing, Space,” Proc. Conf. Modern Analysis and
solution to multilayer feed-forward vol. 70, 2006, pp. 489–501. Probability, vol. 26, 1984, pp. 189–206.
neural networks. ELM-based multi- 5. Y. LeCun et al., “Gradient-Based
layer networks seem to provide bet- Learning Applied to Document Recogni- Liyanaarachchi Lekamalage Chamara Ka-
ter performance than state-of-the-art tion,” Proc. IEEE, vol. 86, no. 11, 1998, sun is at the School of Electrical and Electronic
deep networks. pp. 2278–2324. Engineering, Nanyang Technological University,
6. G.-B. Huang et al.,“Extreme Learning Singapore. Contact him at chamarak001
References Machine for Regression and Multiclass @e.ntu.edu.sg.
1. G. E. Hinton and R. R. Salakhutdinov, Classification,” IEEE Trans. Systems,
“Reducing the Dimensionality of Man, and Cybernetics, vol. 42, no. 2, Hongming Zhou is at the School of Elec-
Data with Neural Networks,” Science, 2012, pp. 513–529. trical and Electronic Engineering, Nanyang
vol. 313, no. 5786, 2006, pp. 504–507. 7. B. Widrow et al., “The No-Prop Technological University, Singapore. Contact
2. P. Vincent et al., “Stacked Denoising Algorithm: A New Learning Algorithm him at hmzhou@ntu.edu.sg.
Autoencoders: Learning Useful Repre- for Multilayer Neural Networks,”
sentations in a Deep Network with a Neural Networks, vol. 37, 2013, Guang-Bin Huang is at the School of Elec-
Local Denoising Criterion,” J. Machine pp. 182–188. trical and Electronic Engineering, Nanyang
Learning Research, vol. 11, 2010, 8. G.-B. Huang, L. Chen, and C.-K. Siew, Technological University, Singapore. Contact
pp. 3371–3408. “Universal Approximation Using him at egbhuang@ntu.edu.sg.
3. R. Salakhutdinov and H. Larochelle Incremental Constructive Feedforward
“Efficient Learning of Deep Boltzmann Networks with Random Hidden Node,” Chi Man Vong is in the Faculty of Science
Machines,” J. Machine Learning IEEE Trans. Neural Networks, vol. 17, and Technology, University of Macau. Contact
Research, vol. 9, 2010, pp. 693–700. no. 4, 2006, pp. 879–892. him at cmvong@umac.mo.
B
for each trial. Table 2 shows the re- Reduce,” Neurocomputing, vol. 102,
sults. With the increase of M, mem- y outsourcing the calculation of 2013, pp. 52–58.
ory b ecomes the dominant computing the Moore-Penrose generalized in- 5. M. van Heeswijk et al., “GPU-Accelerated
resource when solving the ELM prob- verse, which is the computationally and Parallelized ELM Ensembles for
lem. The asymmetric speedup also in- heaviest operation in the ELM, Parti- Large-Scale Regression,” Neurocomput
creases, which means that the larger tioned ELM can release the customer ing, vol. 74, no. 16, 2011, pp. 2430–2437.
the problems’ overall size, the larger from the heavy burden of expensive 6. D. Serre, Matrices: Theory and Applica
speedups the proposed mechanism computations. The high physical sav- tions, Springer, 2010.
can achieve. ings of computing resources and the 7. Y. Cheng et al., “Efficient Revocation
The training accuracy inclines literally unlimited resources in cloud in Ciphertext-Policy Attribute-Based
steadily from 83 to 95 percent with the computing enable our proposed mech Encryption Based Cryptographic Cloud
number of hidden nodes while the test- anism to be applied to multiple big Storage,” J. Zhejiang University-Science
ing accuracy changes between 80 and data applications. C (Computers & Electronics), vol. 14,
84 percent. We also tested the proposed Feb. 2013, pp. 85–97.
mechanism over the whole CIFAR-10 Acknowledgments 8. C. Wang, K. Ren, and J. Wang, “Secure
dataset with feature extraction in ad- This work was supported by the National and Practical Outsourcing of Linear Pro-
vance. SVM and Fastfood11 built on Natural Science Foundation of China (proj- gramming in Cloud Computing,” Proc.
ect no. 61379145, 61170287, 61232016,
ELM can achieve 42.3 and 63.1 percent INFOCOM, IEEE, 2011, pp. 820–828.
61070198). This research has been enabled
testing accuracy, respectively, while our by the use of computing resources provided 9. P. Shi et al., “Dependable Deployment
method can achieve 64.5 percent test- by WestGrid and Compute/Calcul Canada. Method for Multiple Applications in
ing accuracy. To find specific M for the We thank Guang-Bin Huang and the re- Cloud Services Delivery Network,”
ELM problem with the best testing ac- viewers for their constructive and insightful China Communications, vol. 8, July
comments of this article.
curacy, customers might want to test 2011, pp. 65–75.
multiple experiments under different 10. A. Krizhevsky and G. Hinton, “Learning
values of M. Then, they can realize the References Multiple Layers of Features from Tiny Im-
computing power of the cloud in a way 1. G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, ages,” master’s thesis, Dept. of Computer
that tests multiple ELM problems with “Extreme Learning Machine: A New Science, University of Toronto, 2009.
v3 τ1
τ3
Depot
v8
v4
τ2
v7
Figure 5. Realistic logistic vehicle routing. (a) The logistical vehicle routing in a typical courier service, and (b) a graph
representation of that same routing plan.
traveled by all vehicles, CostVRPSD(s), v2 , v3 , v0 , ..., vi , v0} denotes customer Prediction of Task Assignments
as given by data and optimized routes, respec- in Unseen VRPSDs
K tively. The location (the Cartesian The recommendations of effective
CostVRPSD (s) = ∑ LVRPSD (τ k ),(1) coordinates) of each customer vi de- task assignments involve a prediction
k =1 fines the features of the learning task, of the vehicle to be assigned to serve
where LVRPSD (τk) is the expected dis- vi = {x 1 , …, xi , …, xd}, where d de- each customer of the unseen VRPSD
tance traveled by vehicle k. notes the dimension. An SLFN-ELM of interest. Given routing customers
structure is then designed to learn { }
V ' = v'i i = {1,..., m} , where m is the
ELM-Guided Memetic the task pair vectors vi and vj that are number of customers, the task pairs
Computation served by a common vehicle in s. To
The ELM was proposed by Guang-Bin achieve this goal, we define the task {f (v ) , f (v )}
'
i
'
j are constructed via
Huang and colleagues5 for single-layer pair feature vector representation as
feed-forward neural networks (SLFNs). Equation 2. The Hb output of the
It reported notable generalization per-
formance with high learning efficiency
{f (vi ), f (v j )} = { x1i −
j
x1 ,..., xdi −
j
xd } , trained ELM classifier describes how
probable the task pairs will be served by
and little human intervention. The (2) a common vehicle. With the Sigmoid
training process is equivalent to finding function S (t ) = 1 1 + e − t , S(Hb) then
a least-squares solution β of the linear where | . | denotes the absolute value gives the distances between con-
system Hb = T, where H is the hidden- operation. If vi and vj are served by a structed task pairs in the unseen
layer output matrix, and T is the target common vehicle in s, the respective VRPSD. In this manner, for m cus-
output. {f (vi), f (vj)} will be classified with out- tomers, an m × m symmetric dis-
put 1; otherwise, they will be classi- tance matrix DM is attained and
Learning of Task Assignments from fied with output 0. The training data simple clustering (such as K-Me-
Previous Routing Experiences of class 1 task pairs and class 0 task doids) on DM leads to the pre-
The objective of the learning task pairs are extracted from the obtained diction of the task assignments.
assignment via the ELM is to create optimized routes s. In this manner, The predicted task assignments are
association lists of customers to vehi- the recommendations for effective then encoded to form the population
cles from optimized routes. Suppose task assignments on unseen VRPSDs of unseen VRPSD solution individuals
V = {vi|i = {1, …, n}}, where n is the are realized via the ELM trained from in an EA so as to positively bias the
number of customers, and s = {v0 , v1 , previous routing experiences. search toward high-quality solutions
2.5
10 2
8 1.5
6 1
al
Spir ld
4 n i fo 0.5
ma
2 0
0 –0.5
10
5 –1
10
0 5 –1.5
–5 0 Normalized distribution
–5 –2
–10 –10 –2 –1.5 –1 –0.5 0 0.5 1 1.5 2
Figure 6. Projecting a high-dimensional spiral manifold data xi to a lower-dimensional visualization space points vi.
Visualization points are fixed, and only the pairings (stored in an ordering matrix O) of the original and visualization data
samples are changed.
Several such swaps constitute for an Adapting the ELM for Data load. Multiplying an ordering matrix
update: Visualization O with either V or X yields exactly
The direct data visualization algo- the same new pairs (v'i , x'i ), although
Viter ← Viter−1Oiter.(1) rithm requires recalculation of the their order will differ. But because the
whole ELM. The most computation- reconstruction error doesn’t depend
ELMVIS starts by initializing N vi- ally costly part is a recalculation of on a particular ordering of the pairs,
sualization space points vi, taken ei- matrix H and its pseudo-inverse H†. these operations are interchangeable.
ther from a Gaussian distribution or For changes in V, the whole ELM Our proposed adaptation of the ELM
from a regular grid. Then an ELM is needs recalculating, but for changes thus consists of replacing changes in
initialized, and the ordering matrix O in X, the points V and a hidden layer V by changes in X, as in Equation 2:
is set to an identity matrix. An initial representation H can remain constant,
reconstruction MSE is calculated, af- so only the output weight matrix (Xiter ← Xiter−1 Oiter)
ter which an iteration starts by choos- needs to be updated. ⇐ (Viter ← Viter−1 Oiter).(2)
ing a random number of samples out The reconstruction mean squared error
of N and permuting the correspond- In the ELM structure, replacing
1
∑i =1 ∑ j =1 ( xˆ ij − xij )
N D 2
ing rows of O. The ordering matrix MSErec = changes in V with changes in X will
ND
O is applied to visualization points keep the matrices H and H† constant.
by multiplication, which permutes the depends on the x̂i , which is an output They need to be calculated only once
prototypes V in the same way. The re- of an ELM, trained using data pairs (vi, on initialization; during iterations,
construction error is recalculated: if xi). But the solution of the ELM is a lin- the reconstruction of X is obtained
it increases, the permutation of rows ear system of equations, and the nonlin- using the following rule:
of O is rolled back; new iteration be- ear part of the ELM is applied to each
gins by again choosing a number of transformed input vector separately of ( ) (
ˆ = Hβ = H H† X = HH† X .(3)
X )
samples and permuting the corre- the others. So the nonlinear mapping of
sponding rows in O. Convergence is an ELM is independent of the order of Denoting a new matrix H2 = HH†
achieved once the error attains a de- training pairs (vi, xi), as is the MSErec. and calculating it at the initialization,
sired threshold or the iteration limit This fact lets us adapt the ELM in the training of the ELM on each it-
is reached. ELMVIS to cut the computational eration is reduced to a single matrix
ELMVIS ELMVIS
Dataset PCA SOM NeRV (Gaussian) (PCA)
a squared root term into the input
Spiral 0.482 0.054 0.011 0.049 0.060
data X equation:
Sculptural 0.980 0.916 0.769 0.718 0.724
faces 2 α cos(π L α )
X= ,(4)
Real faces 0.724 0.511 0.501 0.462 0.449 2 α sin(π L α )
where α is distributed evenly between
0 and 1; L determines the amount of
100 points, 5 neurons
swings the spiral makes and is set to
1.5 3 in the experiment. The visualization
points V are evenly distributed on a
line, and both X and V are normal-
1.0
ized to have zero mean and unit vari-
ance. In this experiment, the amount
0.5 of neurons of the ELM and SOM is
set to 5. Figure 7 shows the ELMVIS
model and data mapping; Figure 8
0.0 shows a reconstruction learned from
NeRV results.
–0.5 The PCA projection squashes the
second dimension of a spiral along
the direction of the largest vari-
–1.0 ance. NeRV succeeded in finding a
manifold, showing great results even
after estimating its mapping by a sep-
–1.5
arate ELM. SOM showed good results
as well. ELMVIS partially unfolds the
–2.0 spiral, but some parts remain torn and
–2.0 –1.5 –1.0 –0.5 –0.0 –0.5 –1.0 –1.5 –2.0
misplaced. Also, eventual outliers ap-
pear because the random permuta-
Figure 7. An example of ELMVIS fitting the spiral data. The thinner color line is a tion algorithm hasn’t found the best
back projection of the ELM; black lines and color gradient denote the ordering of solution in a given range of iterations.
points. Some points are mapped incorrectly because the solution isn’t exact.
Still, the results of ELMVIS on a spi-
ral dataset are acceptable, far better
ultiplication. This gives the neces-
m A visualization method is assumed to than the naive PCA.
sary speed to run hundreds of thou- have good performance if its visual- We also tested the experimental con-
sands or even millions of iterations ization has a low MSErec. Reverse pro- vergence speed of ELMVIS; the spiral
within a few minutes. jection of visualized data to the orig- test is the fastest of the three due to a
inal space is required to obtain the smaller number of neurons and lower
Experimental Results error; for NeRV, the only method that original data dimensionality, while
The ELMVIS visualization method- doesn’t provide such projection, the convergence speed is independent of
ology was tested on three datasets. reverse projection is learned by using a these values and only relies on the
The selected reference methods are separate ELM. Table 4 lists the errors amount of test points. Note that the
PCA as the baseline, SOM2 as an- for all methods. graphs here represent averages over
other method that uses fixed visual- The first dataset for testing is a spi- many runs; other results of ELM runs
ization points, and NeRV3 as a state- ral toy dataset, a common and rela- show the best outcome, corresponding
of-the-art nonlinear visualization tively hard benchmark. The spiral is to the best random initialization of a
method. drawn in a 2D space, and the goal hidden layer of that ELM.
The primary comparison uses re- is to project it into one dimension. It As stated earlier, complexity of the
construction error that’s an MSE of consists of N = 100 points, distrib- exact solution of ELMVIS is factorial
a reconstruction of the original data. uted evenly along its line by including in the number of points. The real
Networks with Random Hidden Nodes,” tive supervised learning by combining f (x) = ∑ w j ⋅ a(rj ⋅ x + bj ) .(1)
IEEE Trans. Neural Networks, vol. 17, two distinct components. A hidden j =1
no. 4, 2006, pp. 879–892. layer performs an explicit mapping Thus, a set of random weights {rj ∈
6. C.R. Rao and S.K. Mitra, Generalized of the input space to a feature space; ℜ d; j = 1, …, L} connects the input to
Inverse of a Matrix and Its Applica the mapping isn’t subject to any op- the hidden layer; the jth hidden neu-
tions, J. Wiley, 1971. timization, since all the parameters ron embeds a random bias term bj
in the hidden nodes are set randomly. and a nonlinear activation function
Anton Akusok is a PhD student in the The output layer includes the only de- a(.). A vector of weighted links, w ∈
epartment of Information and Computer
D grees of freedom—that is, the weights ℜ L , connects the hidden layer to the
Science at Aalto University, Finland. Con- of the links that connect hidden neu- output neuron.
tact him at anton.akusok@aalto.fi. rons to output neurons. Thus, train- The vector quantity w = [w 1, ..., wL]
ing requires solving a linear system by embeds the degrees of freedom in the
Amaury Lendasse is a docent in the De- a convex optimization problem. The ELM learning process, which can be
partment of Information and Computer literature has proven that the ELM
formalized after introducing the fol-
Science at Aalto University, Finland, and approach can attain a notable repre- lowing notations:
also affiliated with IKERBASQUE, Basque sentation ability.1
Foundation for Science, Computational In- According to the ELM scheme, the • X is the N × (d + 1) matrix that
telligence Group, Computer Science Faculty, configuration of the hidden nodes ul- originates from the training set. X
University of the Basque Country, and Ar- timately defines the feature mapping stems from a set of N labeled pairs
cada University of Applied Sciences. Con- to be adopted. Actually, the ELM (xi, yi), where xi is the ith input
tact him at amaury.lendasse@aalto.fi. model can support a wide class of ac- vector and yi ∈ ℜ is the associate
tivation functions. Indeed, an exten- expected target value.
Francesco Corona is a docent in the De- sion of the ELM approach to kernel • R is the (d + 1) × L matrix with the
partment of Information and Computer Sci- functions has been discussed in the random weights.
ence at Aalto University, Finland. Contact literature.1
him at francesco.corona@aalto.fi. Here, we address the specific role Here, by using a common trick,
played by feature mapping in the both the input vector x and the ran-
Rui Nian is an associate professor in the ELM. The goal is to analyze the re- dom weights rj are extended to x: =
College of Information and Engineering at lationships between such feature [x1, ..., xd , 1] and rj ∈ ℜ d+1 to include
Ocean University, China. Contact her at mapping schema and the paradigm the bias term.
nianrui_80@163.com. of random projection (RP).2 RP is a Accordingly, the ELM learning
prominent technique for dimension- process requires solving the following
Yoan Miche is a postdoctoral researcher in ality reduction that exploits random linear system:
the Department of Information and Com- subspaces. This research shows that
puter Science at Aalto University, Finland. RP can support the design of a novel y = Hw,(2)
Contact him at yoan.miche@aalto.fi. ELM approach, which combines gen-
eralization performance with compu- where H is the hidden layer out-
tational efficiency. The latter aspect put matrix obtained by applying the
Combining ELMs with is attained by the RP-based model, activation function, a(), to every ele-
Random Projections which always performs a dimension- ment of the matrix:
ality reduction in the feature map-
Paolo Gastaldo and Rodolfo Zunino, ping stage, and therefore shrinks the XR.(3)
University of Genoa, Italy number of nodes in the hidden layer.
Erik Cambria, MIT Media Laboratory Equation 3 clarifies that in the
Sergio Decherchi, Italian Institute of ELM Feature Mapping ELM scheme in Equation 1, the hid-
Technology, Italy Let x ∈ ℜ d denote an input vector. den layer performs a mapping of the
The function f(x) of an output neu- original d-dimensional space into
In the extreme learning machine (ELM) ron in an ELM that adopts L hidden an L-dimensional space through
model,1 a single-layer feed-forward units is written as the random matrix R, which is set
The experimental session aimed to O ur theory showed that, by a di- Erik Cambria is an associate researcher
evaluate the ability of the RP-ELM rect implementation of the JL lemma, at MIT Media Laboratory. Contact him at
model to suitably trade off generaliza- we can sharply reduce the number of cambria@media.mit.edu.
tion performance and computational neurons in the hidden node without
complexity (that is, the number of affecting the generalization perfor- Sergio Decherchi is a postdoc researcher at
nodes in the hidden layer). It’s worth mance in prediction accuracy. As a Italian Institute of Technology, Italy. Contact
noting that the experiments didn’t ad- result, the eventual learning machine him at sergio.decherchi@iit.it.
dress gene selection. Table 5 reports always benefits from a considerable
on the results of the two experiments, simplification in the feature-mapping
and gives the error rates attained for stage. This allows the RP-ELM model Reduced ELMs for Causal
10 different settings of L. In both to properly balance classification ac- Relation Extraction from
cases, the highest values of L corre- curacy and resource occupation. Unstructured Text
sponded to a compression ratio of The experiments also showed that
1:20 in the feature-mapping stage. The the proposed model can attain satis- Xuefeng Yang and Kezhi Mao, School
performances were assessed by adopt- factory performance. Further inves- of Electrical and Electronic Engineering,
ing a leave-one-out (LOO) scheme, tigations will aim to confirm the ef- Nanyang Technological University,
which yielded the most reliable esti- fectiveness of the RP-ELM scheme by Singapore
mates in the presence of limited-size additional theoretical insights and a
dataset. Error rates were worked out massive campaign of experiments. Natural language is the major inter-
as the percentage of misclassified pat- mediary tool for human communica-
terns over the test set. References tion. However, it’s unstructured and
The table compares the results of 1. G.-B. Huang et al., “Extreme Learning therefore hard for computers to under-
the RP-ELM model with those at- Machine for Regression and Multiclass stand. In recent decades, knowledge
tained by the standard ELM model. Classification,” IEEE Trans. Systems, extraction, which transfers unstruc-
Results showed that, in both experi- Man, and Cybernetics, vol. 42, no. 2, tured language text into machine-un-
ments, RP-ELM attained lower er- 2012, pp. 513–529. derstandable knowledge, has received
ror rates than the standard ELM. 2. R. Baraniuk et al., “A Simple Proof considerable attention.1,2 Knowledge
Moreover, the RP- ELM performed of the Restricted Isometry Property can be categorized into descriptive and
comparably with approaches re- for Random Matrices,” Constructive logic information, both of which are
ported in the literature, in which Approximation, vol. 28, no. 3, 2008, indispensable in knowledge expres-
ELM models included 1,000+ neu- pp. 253–263. sion. Think of the following example:
rons and didn’t adopt a LOO valida- 3. G.-B. Huang, D.H. Wang, and Y. Lan, Jim is happy today because his favou
tion procedure. “Extreme Learning Machines: A Survey,” rite basketball team won the final.
By system
signature
(Min. distance
Depth from sensor) Display captured Store the captured (e.g. 1st frm.) (e.g. 1st frm.) (entire frms.)
sensor RGB sequences depth movie as a 3D
43˚ on a LCD monitor signature data
1.6 ~ 1.9 m
RGB
monitor User enters into the Performs hand
Input depth movie
1.4 m
By user
feedback)
User decides a standing position for hand gestures Palm-to-sensor Signature cropping
by looking him/herself displayed on the monitor distance calculation in spatial domain
1m 1m
(Operational range for
standing position) (d) Directional sum (e) R1&R2 projection and fusion
e.g.)
57˚
1m
Score
(L2 – norm)
Up fusion
sum using
t t
w Scale TERELM
normalization
Profile sum R2
h
1m
h projection
Final (fused)
decision
Shaded region: Operational range for standing position Profile sum image
Figure 11. A flow diagram of the proposed hand gesture signature verification system. (a) and (b) The user’s hand signature is
captured using a depth sensor and stored as a video sequence, (c) Each sample is preprocessed and (d) represented by a set of
directional features. (e) Finally, the obtained match scores are fused using TERELM.
mode: Digits /
ordinates of points A and C).
x2 x2
Alphabets
glov
Input output
xn-1 xn-1
glov
r1-lov
xx xn-1
0.95 6
0.9
5
0.85
4
Accuracy
Time (s)
0.8
3 ELM
0.75 SVM
OS-ELM
0.7 ELM 2 ALOS-ELM
SVM
0.65 OS-ELM
1
ALOS-ELM
0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Incremental number Incremental number
(a) (b)
Figure 15. Incremental experiment results. Compare (a) the testing accuracy of each increment with (b) the training time of each
increment.
• Our bloggers keep you up on the latest Cloud, Big Data, Programming, Enterprise
and Software strategies.
• Our multimedia, videos and articles give you technology solutions you can use.
• Our professional development information helps your career.
Visit http://computingnow.computer.org