Sunteți pe pagina 1din 30

TRENDS & CONTROVERSIES

Editor: erik cambria, MIT Media Laboratory, cambria@media.mit.edu

Extreme Learning Machines

Erik Cambria, MIT Media Laboratory


Guang-Bin Huang, Nanyang Technological University, Singapore

M
achine learning and artificial intelligence training time. In “ELM-Guided Memetic Compu-
have seemingly never been as critical and tation for Vehicle Routing,” the authors consider
the ELM as an engine for automating the encap-
important to real-life applications as they are in
sulation of knowledge memes from past problem-
today’s autonomous, big data era. The success of solving experiences. In “ELMVIS: A Nonlinear
machine learning and artificial intelligence relies Visualization Technique Using Random Permu-
on  the coexistence of three necessary conditions: tations and ELMs,” the authors propose an ELM
powerful computing environments, rich and/or method for data visualization based on random
large data, and efficient learning techniques (algo- permutations to map original data and their cor-
rithms). The extreme learning machine (ELM) as responding visualization points. In “Combining
an emerging learning technique provides efficient ELMs with Random Projections,” the authors
unified solutions to generalized feed-forward net- analyze the relationships between ELM feature-
works including but not limited to (both single- and mapping schemas and the paradigm of random
multi-hidden-layer) neural networks, radial basis projections. In “Reduced ELMs for Causal Re-
function (RBF) networks, and kernel learning. lation Extraction from Unstructured Text,” the
ELM theories1–4 show that hidden neurons are authors propose combining ELMs with neuron se-
important but can be randomly generated and in- lection to optimize the neural network architecture
dependent from applications, and that ELMs have and improve the ELM ensemble’s computational
both universal approximation and classification efficiency. In “A System for Signature Verification
capabilities; they also build a direct link between Based on Horizontal and Vertical Components
multiple theories (specifically, ridge regression, op- in Hand Gestures,” the authors propose a novel
timization, neural network generalization perfor- paradigm for hand signature biometry for touch-
mance, linear system stability, and matrix theory). less applications without the need for handheld de-
Consequently, ELMs, which can be biologically vices. Finally, in “An Adaptive and Iterative Online
inspired, offer significant advantages such as fast Sequential ELM-Based Multi-Degree-of-Freedom
learning speed, ease of implementation, and min- Gesture Recognition System,” the authors propose
imal human intervention. They thus have strong an online sequential ELM-based efficient gesture
potential as a viable alternative technique for recognition algorithm for touchless human machine
large-scale computing and machine learning. interaction.
This special edition of Trends & Controver-
sies includes eight original works that detail
the further developments of ELMs in theories,
applications, and hardware implementation. In We thank all the authors for their contributions
“Representational Learning with ELMs for Big to this special issue. We also thank IEEE Intelli­
Data,” the authors propose using the ELM as an gent Systems and its editor in chief, Daniel Zeng,
auto-encoder for learning feature representations for the opportunity of publishing these works.
using singular values. In “A Secure and Practi-
cal Mechanism for Outsourcing ELMs in Cloud References
Computing,” the authors propose a method for 1. G.-B. Huang, L. Chen, and C.-K. Siew, “Universal Ap-
handling large data applications by outsourcing proximation Using Incremental Constructive Feedfor-
to the cloud that would dramatically reduce ELM ward Networks with Random Hidden Nodes,” IEEE

30 1541-1672/13/$31.00 © 2013 IEEE Ieee InTeLLIGenT SYSTemS


Published by the IEEE Computer Society

IS-28-06-TandC.indd 30 20/01/14 3:01 PM


Trans. Neural Networks, vol. 17, no. 4, could be used for feature engineering. outputs (random hidden features) for
2006, pp. 879–892. These engineered features then could be input x, and gi(x) is the output of the
2. G.-B. Huang, X. Ding, and H. Zhou, used to train multiple-layer neural net- ith hidden node. Given N training
“Optimization Method Based Extreme works, or deep networks. Two types samples {(x i , t i )}N
i =1 , the ELM can re-
Learning Machine for Classification,” of deep networks based on RBM exist: solve the following learning problem:
Neurocomputing, vol. 74, 2010, the deep belief network (DBN)1 and the
pp. 155–163. deep Boltzmann machine (DBM).3 The Hb = T,(2)
3. G.-B. Huang, Q.-Y. Zhu, and C.-K. two types of auto-encoder-based deep
Siew, “Extreme Learning Machine: networks are the stacked auto-encoder where T = [t1, …, tN]T are target labels,
Theory and Applications,” Neuro­ (SAE)2 and the stacked denoising auto- and H = [hT(x1), ..., hT(xN)]T. We can
computing, vol. 70, 2006, pp. 489–501. encoder (SDAE).3 DBNs and DBMs calculate the output weights b from
4. G.-B. Huang et al., “Extreme Learning are created by stacking RBMs, whereas
Machine for Regression and Multiclass SAEs and SDAEs are created by stack- b = H†T,(3)
Classification,” IEEE Trans. Systems, ing auto-encoders. Deep networks out-
Man, and Cybernetics, vol. 42, no. 2, perform traditional multilayer neural where H† is the Moore-Penrose gen-
2011, pp. 513–529. networks, single-layer feed-forward eralized inverse of matrix H.
neural networks (SLFNs), and support To improve generalization perfor-
Erik Cambria is an associate researcher vector machines (SVMs) for big data, mance and make the solution more
at MIT Media Laboratory. Contact him at but are tainted by slow learning speeds. robust, we can add a regularization
cambria@media.mit.edu. Guang-Bin Huang and colleagues4 in- term as shown elsewhere:6
troduced the extreme learning machine −1
 I 
Guang-Bin Huang is in the School of Elec- (ELM) as an SLFN with a fast learn- β =  + HT H  HT T .(4)
C 
trical and Electronic Engineering, Nanyang ing speed and good generalization ca-
Technological University, Singapore. Con- pability. Similar to deep networks, our ELM-AE’s main objective to repre-
tact him at egbhuang@ntu.edu.sg. proposed multilayer ELM (ML-ELM) sent the input features meaningfully
performs layer-by-layer unsupervised in three different representations:
learning. This article also introduces the
Representational ELM auto-encoder (ELM-AE), which • Compressed. Represent features
Learning with ELMs represents features based on singu- from a higher dimensional input
for Big Data lar values. Resembling deep networks, data space to a lower dimensional
ML-ELM stacks on top of ELM-AE to feature space.
Liyanaarachchi Lekamalage Chamara create a multilayer neural network. It • Sparse. Represent features from a
Kasun, Hongming Zhou, and Guang- learns significantly faster than existing lower dimensional input data space
Bin Huang, School of Electrical and deep networks, out­ performing DBNs, to a higher dimensional feature space.
Electronic Engineering, Nanyang SAEs, and SDAEs and performing on • Equal. Represent features from an
Technological University, Singapore par with DBMs on the MNIST5 dataset. input data space dimension equal
Chi Man Vong, Faculty of Science and to feature space dimension.
Technology, University of Macau Representation Learning
The ELM for SLFNs shows that hid- The ELM is modified as follows to
A machine learning algorithm’s gen- den nodes can be randomly gener- perform unsupervised learning: input
eralization capability depends on the ated. The input data is mapped to data is used as output data t = x, and
­dataset, which is why engineering a da- L-dimensional ELM random feature random weights and biases of the hid-
taset’s features to represent the data’s space, and the network output is den nodes are chosen to be orthogo-
salient structure is important. How- L nal. Bernard Widrow and colleagues7
ever, feature engineering requires do- fL (x) = ∑ β i hi (x) = h ( x ) β ,(1) introduced a least mean square (LMS)
main knowledge and human ingenuity i =1 implementation for the ELM and
to generate appropriate features. where b = [b1, …, bL]T is the output a corresponding ELM-based auto-­
Geoffrey Hinton1 and Pascal Vincent2 weight matrix between the hidden encoder that uses nonorthogonal ran-
showed that a restricted Boltzmann nodes and the output nodes, h(x) = dom ­hidden parameters (weights and
machine (RBM) and auto-encoders [g1(x), …, g L(x)] are the hidden node biases). Orthogonalization of these

november/december 2013 www.computer.org/intelligent 31

IS-28-06-TandC.indd 31 20/01/14 3:01 PM


Input nodes Output nodes

1 (a1, b1) 1
is the projected feature space of X
1 g1 squashed via a sigmoid function, we
hypothesize that ELM-AE’s output
weight b will learn to represent the
x p βp p x features of the input data via singular
values. To test if our hypothesis is cor-
L gL rect, we created 10 mini datasets con-
taining digits 0 to 9 from the MNIST
dataset. Then we sent each mini da-
(aL, bL) taset through an ELM-AE (network
d d
structure: 784-20-784) and compared
ELM othogonal
random feature mapping the contents of the output weights b
(Figure 2a) with the manually cal-
d > L: Compressed representation culated rank 20 SVD (Figure 2b)
for each mini dataset. As Figure 2
d = L: Equal dimension representation
shows, ELM-AE output weight b and
d < L: Sparse representation the manually calculated SVD basis.
Multilayer neural networks per-
Figure 1. ELM-AE has the same solution as the original extreme learning machine form poorly when trained with back
except that its target output is the same as input x, and the hidden node propagation (BP) only, so we initial-
parameters (ai, bi) are made orthogonal after being randomly generated. Here,
gi (x) = g(ai, bi, x) is the ith hidden node for input x.
ize hidden layer weights in a deep
network by using layer-wise unsu-
pervised training and fine-tune the
randomly generated hidden parame- representations, we calculate output whole neural network with BP. Simi-
ters tends to improve ELM-AE’s gener- weights b as follows: lar to deep networks, ML-ELM hid-
alization performance. −1 den layer weights are initialized with
 I 
According to ELM theory, ELMs β =  + HT H  HT X ,(6) ELM-AE, which performs layer-wise
C 
are universal approximators,8 hence unsupervised training. However, in
ELM-AE is as well. Figure 1 shows where H = [h1, …, h N] are ELM-AE’s contrast to deep networks, ML-ELM
ELM-AE’s network structure for hidden layer outputs, and X = [x1, …, doesn’t require fine tuning.
compressed, sparse, and equal dimen- x N] are its input and output data. ML-ELM hidden layer activa-
sion representation. In ELM-AE, the For equal dimension ELM-AE repre- tion functions can be either linear or
orthogonal random weights and bi- sentations, we calculate output weights nonlinear piecewise. If the number
ases of the hidden nodes project the b as follows: of nodes Lk in the kth hidden layer
input data to a different or equal is equal to the number of nodes Lk−1
dimension space, as shown by the
­ b = H−1T in the (k − 1)th hidden layer, g is cho-
Johnson-­Lindenstrauss lemma9 and bT b = I. (7) sen as linear; otherwise, g is chosen as
calculated as nonlinear piecewise, such as a sigmoi-
Singular value decomposition (SVD) dal function:
h = g(a . x + b) is a commonly used method for feature
aTa = I, bTb = 1, (5) representation. Hence we believe that Hk = g((bk)T Hk−1),(9)
ELM-AE performs feature representa-
where a = [a1, …, aL] are the orthogo- tion similar to SVD. Equation 6’s singu- where Hk is the kth hidden layer out-
nal random weights, and b = [b1, …, lar value decomposition (SVD) is put matrix. The input layer x can be
bL] are the orthogonal random biases N considered as the 0th hidden layer,
d2
between the input and hidden nodes. Hβ = ∑ ui d2 +i C uTi X,(8) where k = 0. The output of the con-
ELM-AE’s output weight b is respon- i =1 i nections between the last hidden
sible for learning the transformation where u are eigenvectors of HHT, and layer and the output node t is ana-
from the feature space to input data. d are singular values of H, related to lytically calculated using regularized
For sparse and compressed ELM-AE the SVD of input data X. Because H least squares.

32 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-06-TandC.indd 32 20/01/14 3:01 PM


(a) (b)

Figure 2. ELM-AE vs. singular value decomposition. (a) The output weights b of ELM-AE and (b) rank 20 SVD basis shows the
feature representation of each number (0–9) in the MNIST dataset.

Performance Evaluation Table 1. Performance comparison of ML-ELM with state-of-the-art deep networks.
The MNIST is commonly used for Testing accuracy %
testing deep network performance; Algorithms (standard deviation %) Training time
the dataset contains images of hand- Multi-layer extreme learning machine 99.03 (±0.04) 444.655 s
written digits with 60,000 training (ML-ELM)
samples and 10,000 testing samples. Extreme learning machine 97.39 (±0.1) 545.95 s
Table 1 shows the results of using the (ELM random features)
original MNIST dataset without any ELM (ELM Gaussian kernel); run on a 98.75 790.96 s
distortions to test the performance faster machine
of ML-ELM with respect to DBNs, Deep belief network (DBN) 98.87 20,580 s
DBMs, SAEs, SDAEs, random fea- Deep Boltzmann machine (DBM) 99.05 68,246 s
ture ELMs, and Gaussian kernel Stacked auto-encoder (SAE) 98.6 –
ELMs. Stacked eenoising auto-encoder (SDAE) 98.72 –
We conducted the experiments
on a laptop with a core i7 3740QM
2.7-GHz processor and 32 Gbytes
of RAM running Matlab 2013a. respectively, to generate the results • ELM-AE output weights can be de-
Gaussian-kernel ELMs require a shown in Table 1. As a two-layer termined analytically, unlike RBMs
larger memory than 32 Gbytes, so DBM network produces better results and traditional auto-encoders, which
we executed on a high-performance than a three-layer one, 3 we tested the require iterative algorithms.
computer with dual Xeon E5-2650 two-layer network. • ELM-AE learns to represent fea-
2-GHz processors and 256 Gbytes of As Table 1 shows, ML-ELM per- tures via singular values, unlike
RAM running Matlab 2013a. ML- forms on par with DBMs and out- RBMs and traditional auto-encod-
ELM (network structure: 784-700- performs SAEs, SDAEs, DBNs, ELMs ers, where the actual representation
700-15000-10 with ridge parameters with random feature, and Gaussian of data is learned.
10 −1 for layer 784-700, 103 for layer kernel ELMs. Furthermore, ML-ELM
700-15000 and 108 for layer 15000- has the least amount of required
10) with sigmoidal hidden layer training time with respect to deep
­activation function generated an ac- networks:
curacy of 99.03. We used DBNs and ELM-AE can be seen as a special
DBM network structures 748-500- • In contrast to deep networks, ML- case of ELM, where the input is equal
500-2000-10 and 784-500-1000-10, ELM doesn’t require fine-tuning. to output, and the randomly ­generated

november/december 2013 www.computer.org/intelligent 33

IS-28-06-TandC.indd 33 20/01/14 3:01 PM


Orthogonal random Orthogonal random
hidden nodes hidden nodes

1 β1 1 1 β i+1 1
1 1

x p p x hi p p hi
L1 Li+1

d d Li Li

(b)
(a)

1
(β1)T 1 1 1 1
(β i+1)T
t
x p
L1 Li Li+1 Lk

h1 hi hi+1 hk
d
(c)

Figure 3. Adding layers in ML-ELM. (a) ELM-AE output weights b1 with respect to input data x are the first-layer weights of
ML-ELM. (b) The output weights b i+1 of ELM-AE, with respect to ith hidden layer output hi of ML-ELM are the (i + 1)th layer
weights of ML-ELM. (c) The ML-ELM output layer weights are calculated using regularized least squares.

weights are chosen to be orthogonal 4. G.-B. Huang, Q.-Y. Zhu and C.-K. Siew, 9. W. Johnson and J. Lindenstrauss, “Exten-
(see Figure 3). ELM-AE’s representa- “Extreme Learning Machine: Theory sions of Lipschitz Mappings into a Hilbert
tion capability might provide a good and Applications,” Neurocomputing, Space,” Proc. Conf. Modern Analysis and
solution to  multilayer feed-forward vol. 70, 2006, pp. 489–501. Probability, vol. 26, 1984, pp. 189–206.
neural networks. ELM-based multi- 5. Y. LeCun et al., “Gradient-Based
layer networks seem to provide bet- ­Learning Applied to Document Recogni- Liyanaarachchi Lekamalage Chamara Ka-
ter performance than state-of-the-art tion,” Proc. IEEE, vol. 86, no. 11, 1998, sun is at the School of Electrical and Electronic
deep networks. pp. 2278–2324. Engineering, Nanyang Technological University,
6. G.-B. Huang et al.,“Extreme Learning Singapore. Contact him at chamarak001
References Machine for Regression and Multiclass @e.ntu.edu.sg.
1. G. E. Hinton and R. R. Salakhutdinov, Classification,” IEEE Trans. Systems,
“Reducing the Dimensionality of Man, and Cybernetics, vol. 42, no. 2, Hongming Zhou is at the School of Elec-
Data with Neural Networks,” Science, 2012, pp. 513–529. trical and Electronic Engineering, Nanyang
vol. 313, no. 5786, 2006, pp. 504–507. 7. B. Widrow et al., “The No-Prop Technological University, Singapore. Contact
2. P. Vincent et al., “Stacked Denoising ­Algorithm: A New Learning Algorithm him at hmzhou@ntu.edu.sg.
Autoencoders: Learning Useful Repre- for Multilayer Neural Networks,”
sentations in a Deep Network with a Neural Networks, vol. 37, 2013, Guang-Bin Huang is at the School of Elec-
Local Denoising Criterion,” J. Machine pp. 182–188. trical and Electronic Engineering, Nanyang
Learning Research, vol. 11, 2010, 8. G.-B. Huang, L. Chen, and C.-K. Siew, Technological University, Singapore. Contact
pp. 3371–3408. “Universal Approximation Using him at egbhuang@ntu.edu.sg.
3. R. Salakhutdinov and H. Larochelle Incremental Constructive Feedforward
­“Efficient Learning of Deep Boltzmann Networks with Random Hidden Node,” Chi Man Vong is in the Faculty of Science
Machines,” J. Machine Learning IEEE Trans. Neural Networks, vol. 17, and Technology, University of Macau. Contact
­Research, vol. 9, 2010, pp. 693–700. no. 4, 2006, pp. 879–892. him at cmvong@umac.mo.

34 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-06-TandC.indd 34 20/01/14 3:01 PM


Original ELM
Calculate H
Problem
A Secure and Practical
Mechanism for
Outsourcing ELMs in
Cloud Computing Cloud
servers
Customer
Jiarun Lin, Jianping Yin, Zhiping Cai,
Qiang Liu, and Kuan Li, National
University of Defense Technology, Result H
China β=H T
Victor C.M. Leung, University of
British Columbia, Vancouver, Canada
Figure 4. Architecture for outsourcing the extreme learning machine (ELM) to the cloud.
T he extreme learning machine (ELM)1–3
is a newly proposed algorithm for algorithm can significantly improve cloud. However, doing so also relin-
generalized single-hidden layer feed- the training time of the original ELM quishes the user’s direct control over
forward neural networks (SLFNs) algorithm by outsourcing the heaviest his or her data, and could expose sen-
that not only tends to reach the small- computation. sitive information.7
est training error but also the small- We’ve conducted extensive experi­ Cloud computing can follow an
est norm of weights at an extremely ments to evaluate our proposed mecha- “honest but curious” model, also
fast learning speed, which provides nism’s performance. The experimental called a semi-honest model in pre-
good generalization performance. and analytical results show that our vious research,8 in which the cloud
How­ ever, the growing volume and proposal can save considerable ELM server is persistently interested in
increasingly complex structure of the training time. When the size of the analyzing data to mine more infor-
data involved in today’s applications ELM problem increases, the speedups mation for various purposes, either
make using the ELM over large-scale achieved by the proposed mechanism intentionally or because it’s compro-
data a challenging task. To ­address also grow. mised. Here, we assume that cloud
this challenge, researchers have pro- servers can behave unfaithfully—that
posed enhanced ELM variants,4,5 but Outsourcing ELM in is, cheat the customer to save power
not all users have abundant comput- Cloud Computing or reduce executing time while hop-
ing resources or distributed comput- We modeled N arbitrary distinct sam- ing not to be caught. To enable se-
ing frameworks at hand. Instead, they ples with matrices (X, T). Other work cure and practical outsourcing, our
need to be able to outsource the ex- has proved that adjusting the input proposed mechanism must be in-
pensive computation associated with weights w and the biases b when geniously designed so as to ensure
ELM to the cloud to utilize its liter- training SLFNs iteratively isn’t nec- the  confidentiality of ELM problems
ally unlimited resources on a pay-per- essary.1–3 Instead, they can be ran- while guaranteeing correctness and
use basis at relatively low prices. domly assigned if the activation func- soundness. We first assume that the
To the best of our knowledge, tions in the hidden layer are infinitely cloud server performs the computa-
we’re the first to outsource ELM in differentiable. We use M to denote tion honestly and discuss the verifi-
cloud computing while assuring the the number of the hidden nodes and cation of correctness and soundness
I/O’s confidentiality. ELM problems, H to denote the output matrix of the later.
in which the parameters of hidden hidden layer whose size is N × M. The
nodes are assigned randomly and the smallest norm least-squares solution Partitioned ELM Architecture
desired output weights can be deter- of the output weights can be theoreti- Two different entities are involved
mined analytically, are suitable for cally determined by b = H†T, where in our architecture: cloud custom-
being outsourced to the cloud. H† is the Moore-Penrose generalized ers and cloud servers. The former
This article proposes a secure and inverse.6 has several computationally expen-
practical outsourcing mechanism called To reduce the time used for training sive large-scale ELM problems to
Partitioned ELM to address the chal- or executing the ELM on large-scale outsource; the latter has unlimited
lenge of performing the ELM over data, it’s natural to want to outsource resources and provides utility com-
large-scale data. The Partitioned ELM any bottle-neck computations to the puting services. Figure 4 shows our

november/december 2013 www.computer.org/intelligent 35

IS-28-06-TandC.indd 35 20/01/14 3:01 PM


architecture for outsourcing the ELM the ELM. The confidentiality of the the computation, yet still interested in
in cloud computing. input and the training SLFN’s pa- learning information. However, the
To focus on outsourcing, we omit- rameters (w, b) is achieved by the ran- server might behave unfaithfully, so
ted the authentication processes in domly generated parameters and ran- the customer must be able to verify
this article, assuming that the com- domly chosen activation functions. result correctness and soundness.
munication channels are reliably au- For convenience, we denote this as In our mechanism, the returned in-
thenticated and encrypted, which H = g(H 0), where g is the activation verse itself can serve as the verification
can be achieved in practice with little functions, and H 0 is the temporary proof. From the definition of Moore-
overhead.9 matrix for H. Even with knowledge of Penrose generalized equations, we can
As the name Partitioned ELM in- the infinitely differentiable activation verify whether the returned matrix is
dicates, our mechanism explicitly de- functions associated with the hidden the desired inverse.6 Therefore, the
composes the ELM algorithm into nodes, the cloud server can’t exactly correctness and soundness of the re-
a public and a private part. The pri- determine X, w, or b from the medi- sults can be verified while incurring
vate part consists of generations of ate matrix H 0. Therefore, we can also low computational overhead or extra
random parameters and some light outsource the computation of the ac- communication.
­matrix operations. The customer cal- tivation functions to the cloud. In this article, we only focus on
culates the output matrix of the hid- The communication overhead be- outsourcing the basic ELM algo-
den layer locally and sends it to the tween the customer and the cloud rithm, but it’s worth noting that the
cloud server, which is mainly re- server can be further reduced by us- proposed mechanism isn’t limited to a
sponsible for calculating the Moore-­ ing pipeline parallelization, where the specific type of ELM and can be em-
Penrose generalized inverse, the most cloud server calculates the activation ployed for a large variety of ELM al-
time-consuming calculations in the functions and receives H 0 in a pipe- gorithms. Applying our outsourcing
ELM. Finally, the customer multiplies line manner. mechanism to various ELM variants,
the inverse with the target m
­ atrix to especially those with regularization
calculate b. Calculation of Output Weights factor or kernels,3 is one of our future
The cloud server receives the medi- works.
Encryption of Training Samples ate matrix H 0 and then calculates the
The ELM is instinctively suitable hidden layer’s output matrix. There- Performance Evaluation
for outsourcing in cloud computing, after, it calculates the Moore-Penrose We use toriginal to denote the train-
while still assuring the confidential- generalized inverse, whose execution ing time of the original ELM and
ity of the training samples and the de- time dominates the training time of toutsource to denote that of the pro-
sired parameters of neural networks, the original ELM problem and sends posed mechanism. In Partitioned
because of encryption. In the private the Moore-Penrose generalized in- ELM, the time costs at the cus-
part, the parameters (w, b) are as- verse back to the customer. Finally, tomer and cloud server sides are
signed randomly and are part of the the customer calculates the output denoted as t customer and t cloud, re-
desired parameters of the training weights b by multiplying the inverse spectively. Then, we define the asym-
SLFNs. These parameters must be as- H † and the target output T of the metric speedup of the proposed
signed by the cloud customer, not the training samples locally. mechanism as λ = toriginal t customer ,
server. Without any knowledge of the During the whole process, the pa- which physically means the sav-
activation function or the parameters, rameters (w, b, b) of the training ings of the customer’s computing
the cloud server can’t obtain knowl- SLFNs are kept away from the cloud resources and is independent of
edge about the exact X or (w, b) from server: it can’t mine special informa- how resourceful the cloud server is;
H. Random parameter generation is tion about the original ELM prob- rather, it’s directly related to ELM
also associated with input data con- lems or the trained SLFNs, such as problem size.
fidentiality: with random parame- the input training samples (X, T) or In our series of experiments, we
ters and randomly chosen activation the desired parameters. conducted the customer computa-
functions, the customer calculates the tions on a workstation with an In-
hidden layer’s output matrix, which Result Verifications tel Xeon Quad Processor running at
the cloud server can’t mine. In short, Up until now, we’ve assumed that the 3.60 GHz with 2-Gbytes RAM and
the encryption of X is embedded in cloud server is honestly performing 1-Gbyte Linux swap space; we did

36 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-06-TandC.indd 36 20/01/14 3:01 PM


Table 2. Performance over parts of the CIFAR-10 dataset.

M Toriginal (s) Toutsource (s) Tcustomer (s) Tcloud (s) l


500 12.65 6.19 2.70 3.48 4.69
the cloud server computations on a 1,000 53.94 17.07 5.07 12.00 10.64
workstation with an Intel Core Duo 1,500 114.29 33.62 7.46 26.16 15.32
Processor running at 2.50 GHz with 2,000 347.02 57.84 10.10 47.74 34.36
4-Gbytes RAM and Windows Vir-
2,500 485.30 89.78 12.58 77.20 38.58
tual Memory. By outsourcing the
3,000 1,055.95 135.74 14.79 120.95 71.40
bottle-neck ELM computation from
a workstation with lower resources 3,500 1,513.80 191.40 17.29 174.11 87.55
to one with more computing power,
we could evaluate the training speed different M simultaneously to reduce Learning Scheme of Feedforward Neural
of our proposed mechanism without the overall training time. Networks,” Proc. Int’l Joint Conf.
a real cloud environment. Given that the activation functions Neural Networks (IJCNN2004), vol. 2,
We tested Partitioned ELM on a are infinitely differentiable, the input IEEE, 2004, pp. 985–990.
large-scale dataset called CIFAR-10,10 weights and biases involved in Parti- 2. G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew,
which consists of 50,000 32 × 32 tioned ELM weren’t tuned iteratively “Extreme Learning Machine: Theory
training color images and 10,000 test- but assigned randomly, which helped us and Applications,” Neurocomputing,
ing images in 10 classes; we had 5,000 determine the output weights theoreti- vol. 70, 2006, pp. 489–501.
training images and 1,000 testing im- cally. Compared with traditional learn- 3. G.-B. Huang et al., “Extreme Learning
ages per class. To reduce the num- ing algorithms for SLFNs and deep Machine for Regression and Multiclass
ber of attributes, we transformed the learning algorithms, Partitioned ELM Classification,” IEEE Trans. Systems,
color images into gray. We conducted requires much less human intervention Man, and Cybernetics, Part B: Cybernet­
five trials for each M, and randomly and potentially smaller training time. ics, vol. 42, no. 2, 2012, pp. 513–529.
chose two classes from the 10 classes 4. Q. He et al., “Parallel Extreme Learning
as the training and testing samples Machine for Regression Based on Map­

B
for each trial. Table 2 shows the re- Reduce,” Neurocomputing, vol. 102,
sults. With the increase of M, mem- y outsourcing the calculation of 2013, pp. 52–58.
ory b ­ ecomes the dominant computing the Moore-Penrose generalized in- 5. M. van Heeswijk et al., “GPU-Accelerated
­resource when solving the ELM prob- verse, which is the computationally and Parallelized ELM Ensembles for
lem. The asymmetric speedup also in- heaviest operation in the ELM, Parti- Large-Scale Regression,” Neurocomput­
creases, which means that the larger tioned ELM can release the customer ing, vol. 74, no. 16, 2011, pp. 2430–2437.
the problems’ overall size, the larger from the heavy burden of expensive 6. D. Serre, Matrices: Theory and Applica­
speedups the proposed mechanism computations. The high physical sav- tions, Springer, 2010.
can achieve. ings of computing resources and the 7. Y. Cheng et al., “Efficient Revocation
The training accuracy inclines literally unlimited resources in cloud in Ciphertext-Policy Attribute-Based
steadily from 83 to 95 percent with the computing enable our proposed mech­ Encryption Based Cryptographic Cloud
number of hidden nodes while the test- anism to be applied to multiple big Storage,” J. Zhejiang University-Science
ing accuracy changes between 80 and data applications. C (Computers & Electronics), vol. 14,
84 percent. We also tested the proposed Feb. 2013, pp. 85–97.
mechanism over the whole CIFAR-10 Acknowledgments 8. C. Wang, K. Ren, and J. Wang, “Secure
dataset with feature extraction in ad- This work was supported by the National and Practical Outsourcing of Linear Pro-
vance. SVM and Fastfood11 built on Natural Science Foundation of China (proj- gramming in Cloud Computing,” Proc.
ect no. 61379145, 61170287, 61232016,
ELM can achieve 42.3 and 63.1 percent INFOCOM, IEEE, 2011, pp. 820–828.
61070198). This research has been enabled
testing accuracy, respectively, while our by the use of computing resources provided 9. P. Shi et al., “Dependable Deployment
method can achieve 64.5 percent test- by WestGrid and Compute/Calcul Canada. Method for Multiple Applications in
ing accuracy. To find specific M for the We thank Guang-Bin Huang and the re- Cloud Services Delivery Network,”
ELM problem with the best testing ac- viewers for their constructive and insightful China Communications, vol. 8, July
comments of this article.
curacy, customers might want to test 2011, pp. 65–75.
multiple experiments under different 10. A. Krizhevsky and G. Hinton, “Learning
values of M. Then, they can realize the References Multiple Layers of Features from Tiny Im-
computing power of the cloud in a way 1. G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, ages,” master’s thesis, Dept. of Computer
that tests multiple ELM problems with “Extreme Learning Machine: A New Science, University of Toronto, 2009.

november/december 2013 www.computer.org/intelligent 37

IS-28-06-TandC.indd 37 20/01/14 3:01 PM


11. Q. Le, T. Sarlos, and A. Smola, “Fast- escalation of costs related to soaring meme here as an ELM-encapsulated
food-Approximating Kernel Expansions fuel prices and inflation. It also poses instruction that recommends high-
in Loglinear Time,” to appear in Proc. significant national and ­international quality task assignments of ­
vehicles
ICML, 2013. implications because of the traffic
­ on fresh routing problems, thus
congestion and increased air pollu- speeding up the evolutionary search
Jiarun Lin is a PhD candidate at the Na- tion experienced in many urban ­cities towards the global optima.
tional University of Defense Technology, worldwide. The VRP is a particu-
Changsha, China. Contact him at nudtjrlin@ larly challenging problem due to its Vehicle Routing with
gmail.com. ­complex combinatorial ­nature, which Stochastic Demand
seeks to service a set of customers We showcase here the VRP with sto-
Jianping Yin is a professor at the National with a fleet of capacity-constrained chastic demand (VRPSD), whereby
University of Defense Technology, Chang- vehicles.1 The VRP is  NP-hard, with consignments are delivered and col-
sha, China. Contact him at jpyin@nudt. only explicit enumeration approaches lected from delivery centers to cus-
edu.cn. known to solve such problems opti- tomers’ doors, or vice versa, and each
mally. However, enumeration meth- customer’s demand is uncertain be-
Zhiping Cai is an associate professor at the ods don’t cope well computationally fore the customer is serviced. This
National University of Defense Technology, with large-scale problems. part of the logistics often involves
Changsha, China. Contact him at zpcai@ Evolutionary algorithms (EAs), on routing a fleet of vehicles for physi-
nudt.edu.cn. the other hand, have demonstrated cal consignment distribution; it plays
notable performances and scale well. a crucial role in ensuring that con-
Qiang Liu is a PhD candidate at the Na- However, because of their inherent na- signments are distributed in correct
tional University of Defense Technol- ture, which involves the iterative pro- quantities. In most supply chains,
ogy, Changsha, China. Contact him at cess of reproduction—selection, cross- this accounts for the majority of ship-
­libra6032009@gmail.com. over, and mutation—EAs are deemed to ment costs and is the main cause of
be slow and unable to meet the pressure air pollution and traffic congestion
Kuan Li is an assistant professor at the of delivering fast, high-quality solutions. in urban areas. For instance, Figure 5
National University of Defense Technol- It’s notable that learning serves as a depicts an example VRPSD involving
ogy, Changsha, China. Contact him at core mechanism in human function- 10 customers served by three capac-
li.kuan@163.com. ing and adaptation to a quickly evolv- ity-constrained vehicles located at the
ing society. Past research studies sped delivery center. In VRPSD, each cus-
Victor C.M. Leung is a professor at the up conventional EAs by incorporating tomer vi is modeled with a stochastic
University of British Columbia, Vancouver, problem-specific knowledge2 or memes demand(i), which is only revealed at
Canada. Contact him at vleung@ece.ubc.ca. (the basic unit of cultural transmission each stop of customer vi. In the de-
stored in brains3).4 Knowledge and livery/collection process, the assigned
memes usually exist as data structures, route τk might fail to fulfill the ca-
ELM-Guided Memetic procedures, or rule representations, and pacity constraint of its ith customer,
Computation for Vehicle when assimilated into search can re- where C < Σ m i = 1,v i ∈τ k demand(i), at
Routing sult in faster convergence to desirable which point vehicle k will have to
solutions. take a recourse action from vi−1 to
Liang Feng and Yew-Soon Ong, School Recently, the extreme learning ma- the delivery center to replenish before
of Computer Engineering, Nanyang chine (ELM) has been a hot topic in ­returning to service vi.
Technological University, Singapore neural network research. Here, we The objective is thus defined as find-
Meng-Hiot Lim, School of Electrical consider the ELM as a meme encap- ing a route s = {τ1, τ2, …, τk, …, τK}
and Electronic Engineering, Nanyang sulation engine for speeding up evo- (here, K denotes the total number
Technological University, Singapore lutionary search on vehicle routing of vehicles, τk = {v0, vi, vi + 1, …,
problems. The ELM enhances the vm, v0}, where v0 denotes the depot)
The significance of solving the vehicle conventional EA by automating the that satisfies all customer demands
routing problem (VRP) is increasingly learning of knowledge memes from as well as vehicle capacity constraint
apparent in the fields of transporta- previous vehicle routing experiences. C, while at the same time minimiz-
tion and logistics, mainly due to the In particular, we model the knowledge ing the overall expected distance

38 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-06-TandC.indd 38 20/01/14 3:01 PM


v2
v1 v9
v10

v3 τ1
τ3

Depot
v8
v4
τ2
v7

τ1 = { v0, v1, v2, v3, v0 }


v5 v6
τ2 = { v0, v4, v5, v6, v7, v0 }
s = { τ1 , τ2, τ3 } τ3 = { v0, v10, v9, v8, v0, }
(a) (b)

Figure 5. Realistic logistic vehicle routing. (a) The logistical vehicle routing in a typical courier service, and (b) a graph
representation of that same routing plan.

­traveled by all vehicles, CostVRPSD(s), v2 , v3 , v0 , ..., vi , v0} denotes customer Prediction of Task Assignments
as given by data and optimized routes, respec- in Unseen VRPSDs
K tively. The location (the Cartesian The recommendations of effective
CostVRPSD (s) = ∑ LVRPSD (τ k ),(1) coordinates) of each customer vi de- task assignments involve a prediction
k =1 fines the features of the learning task, of the vehicle to be assigned to serve
where LVRPSD (τk) is the expected dis- vi = {x 1 , …, xi , …, xd}, where d de- each customer of the unseen VRPSD
tance traveled by vehicle k. notes the dimension. An SLFN-ELM of interest. Given routing customers
structure is then designed to learn { }
V ' = v'i i = {1,..., m} , where m is the
ELM-Guided Memetic the task pair vectors vi and vj that are number of customers, the task pairs
Computation served by a common vehicle in s. To
The ELM was proposed by Guang-Bin achieve this goal, we define the task {f (v ) , f (v )}
'
i
'
j are constructed via
Huang and colleagues5 for single-layer pair feature vector representation as
feed-forward neural networks (SLFNs). Equation 2. The Hb output of the
It reported notable generalization per-
formance with high learning efficiency
{f (vi ), f (v j )} = { x1i −
j
x1 ,..., xdi −
j
xd } , trained ELM classifier describes how
probable the task pairs will be served by
and little human intervention. The  (2) a common vehicle. With the Sigmoid
training process is equivalent to finding function S (t ) = 1 1 + e − t , S(Hb) then
a least-squares solution β of the linear where | . | denotes the absolute value gives the distances between con-
system Hb = T, where H is the hidden- operation. If vi and vj are served by a structed task pairs in the unseen
layer output matrix, and T is the target common vehicle in s, the respective VRPSD. In this manner, for m cus-
output. {f (vi), f (vj)} will be classified with out- tomers, an m × m symmetric dis-
put 1; otherwise, they will be classi- tance matrix DM is attained and
Learning of Task Assignments from fied with output 0. The training data simple clustering (such as K-Me-
­
Previous Routing Experiences of class 1 task pairs and class 0 task doids) on DM leads to the pre-
The objective of the learning task pairs are extracted from the obtained diction of the task assignments.
assignment via the ELM is to create optimized routes s. In this manner, The predicted task assignments are
­association lists of customers to vehi- the recommendations for effective then encoded to form the population
cles from optimized routes. Suppose task assignments on unseen VRPSDs of unseen VRPSD solution individuals
V = {vi|i = {1, …, n}}, where n is the are realized via the ELM trained from in an EA so as to positively bias the
number of customers, and s = {v0 , v1 , previous routing experiences. search toward high-quality ­solutions

november/december 2013 www.computer.org/intelligent 39

IS-28-06-TandC.indd 39 20/01/14 4:14 PM


Learning of Task Assignment:
For each solved routing problem instance
Construct task pair feature representation based on Equation 2
A
ssign binary label to the constructed task pairs based on the optimized routing
solution s.
Train the SLFN ELM with the labeled task pairs.
End For
/*End of learning task assignment with ELM */

Prediction of Task Assignment for Evolutionary Search:


Population Initialization
Construct task pair feature representation of the newly encountered or unseen routing
­problem instance.
D
erive distance matrix DM based on the Hb output of the trained ELM classifier of the
constructed task pairs.
­
Apply K-Medoids on DM to obtain the task assignment.
Encode the obtained task assignments as solutions. See Fig. 1.
Insert the encoded solutions into the initial population of the evolutionary search.
End Initialization
While (the termination criteria are not met)
Reproduction operator (i.e., crossover, mutation, etc.)
Selection operator (i.e., elitism, etc.)
End While

Algorithm 1. Outline of proposed ELM-Guided memetic computation for vehicle routing.

Table 3. Statistical results of ELM-MEA and MEA on VRPSDs with a stochastic


demand of variance 0.25.* rapidly. Algorithm 1 details our pro-
posed ELM-guided memetic computa-
# VRPSD Ave.Cost Ave.R Ave.CS
tional framework for vehicle routing.
1 A-n33-k5 ≈ ≈ 1.23%
2 A-n45-k7 ≈ ≈ 29.23% Realistic Logistical
3 A-n61-k9 ≈ ≈ 19.30% Vehicle Routing
4 A-n65-k9 ≈ ≈ 24.61% We tested the numerical performance
of our proposed approach on realistic
5 B-n31-k5 ≈ ≈ 24.79%
logistical vehicle routing by compar-
6 B-n45-k5 ≈ ≈ 22.78%
ing it to the recently published Monte
7 B-n50-k7 ≈ ≈ 14.97% Carlo evolutionary algorithm (MEA)
8 B-n52-k7 ≈ ≈ 31.99% for reliable VRPSD route design.6 Our
9 B-n56-k7 ≈ + 45.91% approach, which incorporates ELM-
10 B-n67-k10 ≈ ≈ 32.30% encapsulated knowledge memes from
previously solved problems to provide
11 B-n78-k10 ≈ ≈ 22.01%
the recommendations for high-quality
12 E-n22-k4 ≈ ≈ 30.60%
solutions in the baseline MEA search
13 E-n30-k3 ≈ ≈ 54.22% on unseen VRPSDs, is thus notated as
14 E-n76-k14 ≈ ≈ 20.03% ELM-MEA. For a fair comparison, we
15 E-n76-k7 ≈ ≈ 31.09% keep all parameters and operator set-
16 F-n45-k4 ≈ ≈ 25.72% tings of MEA and ELM-MEA consis-
tent with that in the original work.6
17 F-n72-k4 ≈ ≈ 32.15%
All results reported are for 30 inde-
18 M-n121-k7 ≈ ≈ 43.99%
pendent runs on 20 VRPSD instances.6
19 P-n101-k4 ≈ ≈ 49.12% Table 3 tabulates the statistical re-
20 P-n22-k8 ≈ ≈ 51.66% sults of ELM-MEA and MEA based
+, ª, and - denote ELM-MEA is statistically better than, competitive with, or significantly poorer than MEA, respectively. on the Wilcoxon rank sum test ­under

40 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-06-TandC.indd 40 20/01/14 3:01 PM


a 95 percent confidence level. Ave. of Manufacturing Technology, and the Cen- ELMVIS: A Nonlinear
Cost denotes the averaged cost so- ter for Computational Intelligence (C2I) at Visualization Technique
Nanyang Technological University.
lution, Ave.R refers to the averaged Using Random
route reliabilities, and Ave.CS denotes Permutations and ELMs
the mean percentage computational References
cost savings (in terms of the num- 1. G. Dantzig and J.H. Ramser, “The Truck Anton Akusok, Amaury Lendasse, and
ber of fitness evaluation) observed on Dispatching Problem,” Management Sci­ Francesco Corona, Aalto University,
ELM-MEA to arrive at the converged ence, vol. 6, 1959, pp. 80–91. Finland
optimized solution of MEA. 2. Y.C. Jin, Knowledge Incorporation in Rui Nian, Ocean University, China
From the results obtained, ELM- Evolutionary Computation, Springer, Yoan Miche, Aalto University, Finland
MEA achieved competitive solu- 2010.
tion qualities and route reliabilities 3. X.S. Chen et al., “A Multi-Facet Survey Data visualization is an old problem in
to MEA on all the VRPSDs con- on Memetic Computation,” IEEE Trans. machine learning.1 High-dimensional
sidered. But on search efficiency, Evolutionary Computation, no. 5, 2011, data is ill suited for human analy-
ELM-MEA demonstrated superior- pp. 591–607. sis, and only two or three dimensions
ity over MEA. When solving VRPSD 4. Y.S. Ong, M.H. Lim, and X.S. can be perceived successfully. One of
“A-n33-k5,” where no previous rout- Chen, “Research Frontier: Memetic the simplest methods for dimension-
ing experience is available, MEA ­Computation–Past, Present & Future,” ality reduction is variable selection, in
and ELM-MEA performed alike. IEEE Computational Intelligence Maga­ which the data can be explained by a
On subsequent VRPSDs, ELM-MEA zine, vol. 5, no. 2, 2010, pp. 24–36. smaller set of transformed variables.
had increased computational cost 5. G.B. Huang, Q.Y. Zhu, and C.K. Siew, Many nonlinear dimensionality
savings of up to 54.22 percent over “Extreme Learning Machine: A New reduction methods aim to find and
MEA to arrive at competitive rout- Learning Scheme of Feedforward Neural unfold a manifold in the data using
ing solutions. It’s worth highlighting Networks,” Proc. IEEE Int’l Joint various cost functions and training
that because ELM-MEA and MEA Conf. Neural Networks, IEEE, 2004, algorithms. A common cost function
share a common baseline VRPSD pp. 985–990. is a preservation of neighborhood in
solver, ELM-MEA’s superior perfor- 6. X. Chen, L. Feng, and Y. S. Ong, “A original and reduced spaces. With-
mance in search efficiency is clearly self-adaptive memeplexes robust search out evident manifold structure, or
attributed to the effectiveness of the scheme for solving stochastic demands if the dimensionality of manifold is
ELM-guided memetic computation vehicle routing problem,” Interational still higher than the one of a reduced
approach. Journal Systems Science, vol. 43, no. 7, space, topology-preserving methods
pp. 1347–1366, 2012. lose their point. These cases require
a nonlinear dimensionality reduction
method with a general cost function
Our proposed approach for effi-
Liang Feng is a PhD student at the Center
for Computational Intelligence in the School without other assumptions. The ex-
cient vehicle routing comprises two of Computer Engineering at Nanyang Tech- treme learning machine (ELM)-based
core ingredients: the automated learn- nological University, Singapore. Contact him visualization method we propose here
ing of task assignments as knowledge at feng0039@e.ntu.edu.sg. uses natural reconstruction error,
memes from previous vehicle ­routing while the ELM’s nonlinearity pro-
experiences and ELM prediction, Yew-Soon Ong is an associate professor and vides the desired nonlinear projection.
which defines the task assignments of director of the Center for Computational Intel- Our proposed ELM visualisation
customers to vehicles based on encap- ligence in the School of Computer Engineering method, denoted ELMVIS for conve-
sulated knowledge memes. Our dem- at Nanyang Technological University, Singa- nience, maps the data points to some
onstrations with realistic logistical ve- pore. Contact him at asysong@ntu.edu.sg. fixed points—or prototypes—in the
hicle routing showcase our approach’s visualization space. Their exact po-
effectiveness. Meng-Hiot Lim is an associate professor sition is weakly relevant to data and
with the School of Electrical and Electronic can be chosen arbitrarily, for exam-
Acknowledgments Engineering at Nanyang Technological Uni- ple, as a grid or Gaussian distributed
This work is partially supported under the versity, Singapore. Contact him at emhlim@ points. The prototypes are then ran-
A*Star-TSRP funding, the Singapore Institute ntu.edu.sg. domly assigned to data points, and

november/december 2013 www.computer.org/intelligent 41

IS-28-06-TandC.indd 41 20/01/14 3:01 PM


an ELM is used to estimate the re- details about the last two methods ap- which makes the ELM an extremely
construction error. To train the vi- pear elsewhere.2,3 fast artificial neural network method.
sualizer, several points are chosen,
their assignment permuted, and the Methodology Data Visualization with the ELM
error re-estimated. Any better so- The ELM algorithm was originally pro- The goal of our ELMVIS method is
lution found is kept; otherwise, the posed by Guang-Bin Huang and col- to maximize recall by minimizing a
permutation is abandoned. Although leagues4 to use the structure of a single- mean square error (MSE) of a non-
the exact solution requires a facto- layer feed-forward (SLFN) network. linear reconstruction provided by
rial number of trials (all possible per- The main concept behind the ELM an ELM. Given the N data points
mutations of N points), experiments is the replacement of a computation- xi ∈D, compactly written as a matrix
show acceptable convergence rates ally costly procedure of training a hid-
( )
T
with up to several hundred points due den layer by its random initialization. X = x1T ... xTN , the goal is to find such
to the ELM’s extremely fast recon- An output weights the matrix between
struction error estimation. Benefits the hidden representation of inputs; points vi ∈d (schematically shown in
of the method are its generality and the true outputs remains to be found,
( ),
T
the presence of only one parameter— which is a linear task. The method is Figure 6), denoted as V = v1T ... vTN
the number of neurons in the ELM, proven to be universal approximator
which doesn’t require exact tuning. given enough hidden neurons.5 which minimizes the recall using the
The method also works with very Consider a set of N distinct samples ELM’s reconstruction error as a non-
high data dimensionality. (xi, yi) with xi ∈D and yi ∈d. An SLFN linear metric. Typically, d equals 2 or
The competitive visualization meth- with K hidden neurons is modeled as 3, while D could be large. Note that
ods used here for comparison are an ELM in the methodology per-
∑ k =1 βkφ(wk x i + bk ), i ∈ [1, N], with f
K
Principal Component Analysis (PCA), forms an inverse projection D ← d
self-organizing maps (SOMs),2 and from low-dimensional visualization
the neighborhood retrieval visualizer being the matrix activation function, space to a high-dimensional original
(NeRV).3 PCA is a simple linear re- w the input weights, b the biases, and data space to estimate a reconstruc-
gressor with an exact solution, which b the output weights. tion error; other dimensionality re-
maximizes the variance of a projec- If the SLFN perfectly approximates duction methods mostly use a direct
tion under orthogonality constraint. the data, the errors between the es- projection D → d.
SOMs are initialized with a low-­ timated outputs ŷi and the actual The ELM needs both input and
dimensional lattice embedded in the outputs yi are zero, and the relation output samples to be able to train.
data space, which is then iteratively among inputs, weights, and outputs is Data points X are already known, so
fit to the given data points using the we must set the visualization points
∑ k =1 βkφ(wk x i + bk ) = yi, i ∈ [1, N],
K
quantization error. When a vertice is then V. Because the manifold structure of
moved in the data space, its neighbors high-dimensional data X, if any, is un-
on a lattice perform a smaller move in which can be written compactly likely to project well onto a 2D or 3D
the same direction, which preserves plane (except in artificially created da-
( )
T
T T
the whole lattice’s integrity. NeRV as Hb = Y, with β = β1 ... β K , tasets), the exact positioning of points
approaches visualization as an infor- V isn’t of great importance. This al-
( ).
T
mation retrieval task—given a data Y = y1T ... yTN lows fixing the positions of V  at the
sample as a query, the probability dis- beginning. Knowing V and X, the
tributions over all the other samples Solving the output weights b from only thing left to find is which point
to be its neighbors in both the original the hidden layer representation of in- vi corresponds to which point xi. This
space and in the visualization space puts H and true outputs Y is achieved correspondence (or pairings in Figure
should be as close as possible. NeRV using the Moore-Penrose general- 6) might be expressed as an ordering
derives its optimization function from ized inverse of the matrix H, denoted matrix O. At initialization, O0 is an
the Kullback-Leibler divergence be- as H†. 6 The ELM’s training requires identity matrix of size N × N. Some
tween these two distributions, and no iterations; the most computation- of its ones exchange indexes, such as
thus it’s the most general visualization ally costly part is the calculation of a (1a,a 1b,b) → (1a,b, 1b,a), which swaps
method of the aforementioned. More pseudo-inverse of the matrix H(D×K), ­samples va and vb after application.

42 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-06-TandC.indd 42 20/01/14 3:01 PM


Input data space Visualization data space
3

2.5
10 2
8 1.5
6 1
al
Spir ld
4 n i fo 0.5
ma
2 0

0 –0.5
10
5 –1
10
0 5 –1.5
–5 0 Normalized distribution
–5 –2
–10 –10 –2 –1.5 –1 –0.5 0 0.5 1 1.5 2

Pairings of points in both spaces

Evaluate reconstruction error using ELM

Figure 6. Projecting a high-dimensional spiral manifold data xi to a lower-dimensional visualization space points vi.
Visualization points are fixed, and only the pairings (stored in an ordering matrix O) of the original and visualization data
samples are changed.

Several such swaps constitute for an Adapting the ELM for Data load. Multiplying an ordering matrix
update: Visualization O with either V or X yields exactly
The direct data visualization algo- the same new pairs (v'i , x'i ), although
Viter ← Viter−1Oiter.(1) rithm requires recalculation of the their order will differ. But because the
whole ELM. The most computation- reconstruction error doesn’t depend
ELMVIS starts by initializing N vi- ally costly part is a recalculation of on a particular ordering of the pairs,
sualization space points vi, taken ei- matrix H and its pseudo-inverse H†. these operations are interchangeable.
ther from a Gaussian distribution or For changes in V, the whole ELM Our proposed adaptation of the ELM
from a regular grid. Then an ELM is needs recalculating, but for changes thus consists of r­eplacing changes in
initialized, and the ordering matrix O in X, the points V and a hidden layer V by changes in X, as in Equation 2:
is set to an identity matrix. An initial representation H can remain constant,
reconstruction MSE is calculated, af- so only the output weight matrix (Xiter ← Xiter−1 Oiter)
ter which an iteration starts by choos- needs to be updated.   ⇐ (Viter ← Viter−1 Oiter).(2)
ing a random number of samples out The reconstruction mean squared error
of N and permuting the correspond- In the ELM structure, replacing
1
∑i =1 ∑ j =1 ( xˆ ij − xij )
N D 2
ing rows of O. The ordering matrix MSErec = changes in V with changes in X will
ND
O is applied to visualization points keep the matrices H and H† constant.
by multiplication, which permutes the depends on the x̂i , which is an output They need to be calculated only once
prototypes V in the same way. The re- of an ELM, trained using data pairs (vi, on initialization; during iterations,
construction error is recalculated: if xi). But the solution of the ELM is a lin- the reconstruction of X is obtained
it increases, the permutation of rows ear system of equations, and the nonlin- using the following rule:
of O is rolled back; new iteration be- ear part of the ELM is applied to each
gins by again choosing a number of transformed input vector separately of ( ) (
ˆ = Hβ = H H† X = HH† X .(3)
X )
samples and permuting the corre- the others. So the nonlinear mapping of
sponding rows in O. Convergence is an ELM is independent of the order of Denoting a new matrix H2 = HH†
achieved once the error attains a de- training pairs (vi, xi), as is the MSErec. and calculating it at the initialization,
sired threshold or the iteration limit This fact lets us adapt the ELM in the training of the ELM on each it-
is reached. ELMVIS to cut the computational eration is reduced to a single matrix

november/december 2013 www.computer.org/intelligent 43

IS-28-06-TandC.indd 43 20/01/14 3:01 PM


Table 4. MSE of reconstruction on all datasets. The best error of 100 restarts is
shown for all methods except PCA, due to a random initialization procedure.

ELMVIS ELMVIS
Dataset PCA SOM NeRV (Gaussian) (PCA)
a squared root term into the input
Spiral 0.482 0.054 0.011 0.049 0.060
data X equation:
Sculptural 0.980 0.916 0.769 0.718 0.724
faces  2 α cos(π L α )
X=  ,(4)
Real faces 0.724 0.511 0.501 0.462 0.449  2 α sin(π L α ) 
where α is distributed evenly between
0 and 1; L determines the amount of
100 points, 5 neurons
swings the spiral makes and is set to
1.5 3 in the experiment. The visualization
points V are evenly distributed on a
line, and both X and V are normal-
1.0
ized to have zero mean and unit vari-
ance. In this experiment, the amount
0.5 of neurons of the ELM and SOM is
set to 5. Figure 7 shows the ELMVIS
model and data mapping; Figure 8
0.0 shows a reconstruction learned from
NeRV results.
–0.5 The PCA projection squashes the
second dimension of a spiral along
the direction of the largest vari-
–1.0 ance. NeRV succeeded in finding a
­manifold, showing great results even
after estimating its mapping by a sep-
–1.5
arate ELM. SOM showed good results
as well. ELMVIS partially unfolds the
–2.0 spiral, but some parts remain torn and
–2.0 –1.5 –1.0 –0.5 –0.0 –0.5 –1.0 –1.5 –2.0
misplaced. Also, eventual outliers ap-
pear because the random permuta-
Figure 7. An example of ELMVIS fitting the spiral data. The thinner color line is a tion algorithm hasn’t found the best
back projection of the ELM; black lines and color gradient denote the ordering of solution in a given range of iterations.
points. Some points are mapped incorrectly because the solution isn’t exact.
Still, the results of ELMVIS on a spi-
ral dataset are acceptable, far better
­ ultiplication. This gives the neces-
m A visualization method is assumed to than the naive PCA.
sary speed to run hundreds of thou- have good performance if its visual- We also tested the experimental con-
sands or even millions of iterations ization has a low MSErec. Reverse pro- vergence speed of ELMVIS; the spiral
within a few minutes. jection of visualized data to the orig- test is the fastest of the three due to a
inal space is required to obtain the smaller number of neurons and lower
Experimental Results error; for NeRV, the only method that original data dimensionality, while
The ELMVIS visualization method- doesn’t provide such projection, the convergence speed is independent of
ology was tested on three datasets. reverse projection is learned by using a these values and only relies on the
The selected reference methods are separate ELM. Table 4 lists the errors amount of test points. Note that the
PCA as the baseline, SOM2 as an- for all methods. graphs here represent averages over
other method that uses fixed visual- The first dataset for testing is a spi- many runs; other results of ELM runs
ization points, and NeRV3 as a state- ral toy dataset, a common and rela- show the best outcome, corresponding
of-the-art nonlinear visualization tively hard benchmark. The spiral is to the best random initialization of a
method. drawn in a 2D space, and the goal hidden layer of that ELM.
The primary comparison uses re- is to project it into one dimension. It As stated earlier, complexity of the
construction error that’s an MSE of consists of N = 100 points, distrib- exact solution of ELMVIS is ­factorial
a reconstruction of the original data. uted evenly along its line by i­ncluding in the number of points. The real

44 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-06-TandC.indd 44 20/01/14 3:01 PM


ELM model approximating NeRV
2.0

speed of convergence was estimated


1.5
on different-sized subsets of the spi-
ral data, ranging from 20 to 100
points. For each separate amount of 1.0
points, 100,000 training steps were
performed, and experiments restarted 0.5
100 times with different initial pair-
ings. Figure 9 shows the obtained con- 0.0
vergence plot with average values and
some standard deviations.
–0.5
Variance in ELMVIS convergence is
explained by the convergence speed:
while all the individual runs tend to –1.0
the same lower bound, best cases con-
verge very quickly, and worst cases –1.5
spend much time on MSE plateaus
seeking a better solution. For 50 –2.0
points, convergence is reached on av- –2.5 –2.0 –1.5 –1.0 –0.5 0.0 0.5 1.0 1.5 2.0
erage at iteration 60,000, which is far
less than the factorial of 50. The re-
Figure 8. ELM reconstruction, learned from NeRV results. Only one point deviates
sults show that the real convergence from the perfect approximation. The ELM model printed with crosses is for visibility,
speed remains feasible for applications as it mostly coincides with the data.
with a low to medium amount of data
samples.
ELMVIS convergence, 5 neurons
0.30
20 samples
The ELMVIS method is most suit- 0.25
30 samples
40 samples
able for the purposes of visualization
MSE with standard deviation

of complex data or data without a 50 samples


simple manifold. Another benefit of 70 samples
0.20
the ELM is the presence of the reverse 100 samples
projection, which can be used to
check how visualization space areas 0.15
correspond to the data space ones.
Using PCA for initialization didn’t
0.10
prove useful—points from a simple
Gaussian distribution proved to be a
better alternative.
0.05
0 20,000 40,000 60,000 80,000 1,00,000
References Number of iterations
1. J.A. Lee and M. Verleysen, Nonlinear Di­
mensionality Reduction, Springer, 2007. Figure 9. Convergence of the ELM visualization algorithm on the spiral dataset, with
100,000 training steps and 100 restarts. Plots are ordered from 20 samples (lowest)
2. T. Kohonen, “Self-Organized Forma-
to 100 (highest). Only some standard deviations are shown.
tion of Topologically Correct Feature
Maps,” Biological Cybernetics, vol. 43,
no. 1, 1982, pp. 59–69. J. Machine Learning Research, vol. 11, and Applications,” ­Neurocomputing, vol.
3. J. Venna et al., “Information Retrieval 2010, pp. 451–490. 70, no. 1, 2006, pp. 489–501.
Perspective to Nonlinear Dimensional- 4. G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, 5. G.-B. Huang, L. Chen, and C.-K. Siew,
ity Reduction for Data Visualization,” “Extreme Learning Machine: Theory “Universal Approximation ­Using

november/december 2013 www.computer.org/intelligent 45

IS-28-06-TandC.indd 45 20/01/14 3:01 PM


I­ ncremental Constructive Feedforward network (SLFN) implements induc- L

Networks with Random Hidden Nodes,” tive supervised learning by combining f (x) = ∑ w j ⋅ a(rj ⋅ x + bj ) .(1)
IEEE Trans. Neural Networks, vol. 17, two distinct components. A hidden j =1

no. 4, 2006, pp. 879–892. layer performs an explicit mapping Thus, a set of random weights {rj ∈
6. C.R. Rao and S.K. Mitra, Generalized of the input space to a feature space; ℜ d; j = 1, …, L} connects the input to
Inverse of a Matrix and Its Applica­ the mapping isn’t subject to any op- the hidden layer; the jth hidden neu-
tions, J. Wiley, 1971. timization, since all the parameters ron embeds a random bias term bj
in the hidden nodes are set randomly. and a nonlinear activation function
Anton Akusok is a PhD student in the The output layer includes the only de- a(.). A vector of weighted links, w ∈
­ epartment of Information and Computer
D grees of freedom—that is, the weights ℜ L , connects the hidden layer to the
Science at Aalto University, Finland. Con- of the links that connect hidden neu- output neuron.
tact him at anton.akusok@aalto.fi. rons to output neurons. Thus, train- The vector quantity w = [w 1, ..., wL]
ing requires solving a linear system by embeds the degrees of freedom in the
Amaury Lendasse is a docent in the De- a convex optimization problem. The ELM learning process, which can be
partment of Information and Computer literature has proven that the ELM
­ formalized after introducing the fol-
Science at Aalto University, Finland, and approach can ­attain a notable repre- lowing notations:
also affiliated with IKERBASQUE, Basque sentation ability.1
Foundation for Science, Computational In- According to the ELM scheme, the • X is the N × (d + 1) matrix that
telligence Group, Computer Science Faculty, configuration of the hidden nodes ul- originates from the training set. X
University of the Basque Country, and Ar- timately defines the feature mapping stems from a set of N labeled pairs
cada University of Applied Sciences. Con- to be adopted. Actually, the ELM (xi, yi), where xi is the ith input
tact him at amaury.lendasse@aalto.fi. model can support a wide class of ac- vector and yi ∈ ℜ is the associate
tivation functions. Indeed, an exten- expected target value.
Francesco Corona is a docent in the De- sion of the ELM approach to kernel • R is the (d + 1) × L matrix with the
partment of Information and Computer Sci- functions has been discussed in the random weights.
ence at Aalto University, Finland. Contact literature.1
him at francesco.corona@aalto.fi. Here, we address the specific role Here, by using a common trick,
played by feature mapping in the both the input vector x and the ran-
Rui Nian is an associate professor in the ELM. The goal is to analyze the re- dom weights rj are extended to x: =
College of Information and Engineering at lationships between such feature [x1, ..., xd , 1] and rj ∈ ℜ d+1 to include
Ocean University, China. Contact her at mapping schema and the paradigm the bias term.
nianrui_80@163.com. of random projection (RP).2 RP is a Accordingly, the ELM learning
prominent technique for dimension- process requires solving the following
Yoan Miche is a postdoctoral researcher in ality reduction that exploits random linear system:
the Department of Information and Com- subspaces. This research shows that
puter Science at Aalto University, Finland. RP can support the design of a novel y = Hw,(2)
Contact him at yoan.miche@aalto.fi. ELM approach, which combines gen-
eralization performance with compu- where H is the hidden layer out-
tational efficiency. The latter aspect put matrix obtained by applying the
Combining ELMs with is attained by the RP-based model, ­activation function, a(), to every ele-
Random Projections which always performs a dimension- ment of the matrix:
ality reduction in the feature map-
Paolo Gastaldo and Rodolfo Zunino, ping stage, and therefore shrinks the XR.(3)
University of Genoa, Italy number of nodes in the hidden layer.
Erik Cambria, MIT Media Laboratory Equation 3 clarifies that in the
Sergio Decherchi, Italian Institute of ELM Feature Mapping ELM scheme in Equation 1, the hid-
Technology, Italy Let x ∈ ℜ d denote an input vector. den layer performs a mapping of the
The function f(x) of an output neu- original d-dimensional space into
In the extreme learning machine (ELM) ron in an ELM that adopts L hidden an L­-dimensional space through
model,1 a single-layer feed-forward units is written as the random matrix R, which is set

46 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-06-TandC.indd 46 20/01/14 3:01 PM


i­ndependently from the distribution of Over the years, the use of probabi- that shrinks the size L of the hidden
the training data. In principle, the fea- listic methods greatly simplified the layer and reduces the computational
ture mapping phase can either involve original JL proof, and at the same time overhead accordingly. The eventual
a reduction in dimensionality (L < d) led to straightforward randomized al- model will be denoted as RP-ELM.
or, conversely, remap the input space gorithms for implementing the trans- The crucial point is that the JL lemma
into in an expanded space (L > d). formation. In matrix notation, the em- guarantees that the original geometry
Both theoretical and practical cri- bedding operation is e­ xpressed as of the data is only slightly perturbed by
teria have been proposed in the liter- the dimensionality reduction process;2
ature to set the parameter L.1,3 This K = XP,(4) indeed, the degradation grows gradu-
quantity is crucial because it deter- ally as L decreases (given d and N).2
mines the ELM’s generalization abil- where X is the original set of N, In principle, the literature provides
ity. At the same time, it affects the d-dimensional observations, K is the several criteria for the construction of
eventual computational ­ complexity projection of the data into a lower, a random matrix that satisfies the JL
of both the learning machine and r-dimensional subspace, and P is the lemma. The present work focuses on
the trained model. These aspects be- random matrix providing an embed- matrices in which the entries are inde-
come critical in hardware implemen- ding that satisfies the JL lemma. pendent realizations of ±1 Bernoulli
tations of the ELM model, where re- In principle, Equation 4 is a projec- random variables;2 hence, matrix R in
source occupation is of paramount tion only if P is orthogonal; this en- Equation 3 is generated as follows:
importance. sures that similar vectors in the original
1 / L with probability 1/2
A few pruning strategies for the space remain close to each other in the Ri, j =  (5)
ELM model have been proposed in low-dimensional space. In very high- −1 / L with proba
ability 1/2.
the literature to balance generaliza- dimensional spaces, however, bypass- Richard Baraniuk and colleagues2
tion performance and computational ing orthogonalization saves computa- showed that this kind of random
complexity. 3 The present work tack- tion time without affecting the quality matrix actually satisfies both the JL
les this problem from a different per- of the projection matrix significantly. lemma and the restricted isometry
spective and proposes to exploit the In this regard, the literature provides a property, thus bringing out a con-
fruitful properties of random projec- few practical criteria to build P.2 nection between RP and compressed
tions. The approach discussed here sensing.
applies RP to reduce the dimensional- RP-ELM
ity of data; the study, however, opens The ability of RP to preserve, approx- Experimental Results
interesting vistas on using RP to tune imately, the distances between the The performance of the proposed RP-
the basic quantity L as well. N data vectors in the r-dimensional ELM model was tested on two binary
subspace is a valuable property for classification problems (www.csie.ntu.
Dimensionality Reduction ­machine learning applications in gen- edu.tw/~cjlin/libsvmtools/datasets/­
by Using RP eral.4 Indeed, this property is the con- binary.html): colon cancer and leuke­
RP is a simple and powerful dimension ceptual basis of the novel approach mia. The former dataset contains ex-
reduction technique that uses a suit- that ­connects the ELM feature map- pression levels of 2,000 genes taken in
ably scaled random matrix with inde- ping scheme in Equation 3 to the RP 62 different samples; 40 samples refer
pendent, normally ­ distributed e­ntries paradigm. to tumors. The latter dataset provides
to project data into low-dimensional A new ELM model can be derived the expression levels of 7,129 genes
spaces. The procedure to get an RP is from Equation 1 if we set as hypoth- taken over 72 samples; 25 samples re-
straightforward and arises from the eses that L should be smaller than d fer to “acute lymphoblast leukemia”
Johnson-Lindenstrauss (JL) lemma.2 and the mapping implemented by the and 47 samples refer to “acute myeloid
The lemma states that any N point set weights rj satisfies the JL lemma. Un- leukemia.” The datasets share two in-
lying in d-dimensional Euclidean space der these assumptions, the mapping teresting features: the number of pat-
can be embedded into a r-dimensional scheme in Equation 3 always imple- terns is very low, and the dimension-
space, with r ≥ O(ε−2ln(N)), without ments the dimensionality reduction ality of data is very high as compared
distorting the distances between any process (as in Equation 4). In practice, with the number of patterns. In both
pair of points by more than a factor we can take advantage of the proper- cases, data are quite noisy because
1 ± ε, where ε ∈ (0, 1). ties of RP to obtain an ELM model gene expression profiles are involved.

november/december 2013 www.computer.org/intelligent 47

IS-28-06-TandC.indd 47 20/01/14 3:01 PM


Table 5. Error rates scored by RP-ELM and standard ELM on the two binary
classification problems.

Colon cancer Leukemia


Error rate (%) Error rate (%) Int’l J. Machine Leaning and Cybernetics,
L RP-ELM ELM L RP-ELM ELM vol. 2, no. 2, 2011, pp. 107–122.
4. Y. Miche, B. Schrauwen, and A. Lendasse,
10 38.7 38.7 35 25.0 40.3
“Machine ­Learning Techniques based on
20 40.3 35.5 70 27.8 31.9
Random Projections,” Proc. of European
30 43.5 45.2 105 47.2 27.8 Symp. Artificial Neural Networks –
40 32.3 45.2 140 30.6 33.3 ­Computational Intelligence and Machine
Learning, 2010, pp. 295–302.
50 29.0 50.0 175 37.5 37.5
60 37.1 48.4 210 25.0 37.5
Paolo Gastaldo is an assistant professor at
70 37.1 40.3 245 27.8 40.3 the University of Genoa, Italy. Contact him
80 29.0 37.1 280 31.9 36.1 at paolo.gastaldo@unige.it.

90 29.0 43.5 315 31.9 30.6


Rodolfo Zunino is an associate professor
100 25.8 40.3 350 38.9 33.3
at the University of Genoa, Italy. Contact
him at rodolfo.zunino@unige.it.

The experimental session aimed to O ur theory showed that, by a di- Erik Cambria is an associate researcher
evaluate the ability of the RP-ELM rect implementation of the JL lemma, at MIT Media Laboratory. Contact him at
model to suitably trade off generaliza- we can sharply reduce the number of cambria@media.mit.edu.
tion performance and computational neurons in the hidden node without
complexity (that is, the number of affecting the generalization perfor- Sergio Decherchi is a postdoc researcher at
nodes in the hidden layer). It’s worth mance in prediction accuracy. As a Italian Institute of Technology, Italy. Contact
noting that the experiments didn’t ad- result, the eventual learning machine him at sergio.decherchi@iit.it.
dress gene selection. Table 5 reports always benefits from a considerable
on the results of the two experiments, simplification in the feature-mapping
and gives the error rates attained for stage. This allows the RP-ELM model Reduced ELMs for Causal
10 different settings of L. In both to properly balance classification ac- Relation Extraction from
cases, the highest values of L corre- curacy and resource occupation. Unstructured Text
sponded to a compression ratio of The experiments also showed that
1:20 in the feature-mapping stage. The the proposed model can attain satis- Xuefeng Yang and Kezhi Mao, School
performances were assessed by adopt- factory performance. Further inves- of Electrical and Electronic Engineering,
ing a leave-one-out (LOO) scheme, tigations will aim to confirm the ef- Nanyang Technological University,
which yielded the most reliable esti- fectiveness of the RP-ELM scheme by Singapore
mates in the presence of limited-size additional theoretical insights and a
dataset. Error rates were worked out massive campaign of experiments. Natural language is the major inter-
as the percentage of misclassified pat- mediary tool for human communica-
terns over the test set. References tion. However, it’s unstructured and
The table compares the results of 1. G.-B. Huang et al., “Extreme Learning therefore hard for computers to under-
the RP-ELM model with those at- Machine for Regression and Multiclass stand. In recent decades, knowledge
tained by the standard ELM model. Classification,” IEEE Trans. Systems, extraction, which transfers unstruc-
Results showed that, in both experi- Man, and Cybernetics, vol. 42, no. 2, tured language text into machine-un-
ments, RP-ELM attained lower er- 2012, pp. 513–529. derstandable knowledge, has received
ror rates than the standard ELM. 2. R. Baraniuk et al., “A Simple Proof considerable attention.1,2 Knowledge
Moreover, the RP- ELM performed of the Restricted Isometry Property can be categorized into descriptive and
comparably with approaches re- for Random Matrices,” Constructive logic information, both of which are
ported in the literature, in which Approximation, vol. 28, no. 3, 2008, indispensable in knowledge expres-
ELM models included 1,000+ neu- pp. 253–263. sion. Think of the following example:
rons and didn’t adopt a LOO valida- 3. G.-B. Huang, D.H. Wang, and Y. Lan, Jim is happy today because his favou­
tion procedure. “Extreme Learning Machines: A Survey,” rite basketball team won the final.

48 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-06-TandC.indd 48 20/01/14 3:01 PM


The Murray Hill, N.J, company said full-year earnings may Long
be off 33 cents a share because the company removed a Sentence
catheter from the market Relation
The ­ descriptive information Be(Jim, extractor
1 { full year earnings may be off 33 cents a share } {because}
Happy) and Win(Team, Final) don’t {the company removed a catheter from the market} Short
make much sense without the causal 2 {the company}{removed}{a catheter} relations
relation Because. In the literature, 3 {the company removed a catheter}{from}{the market} Lexical
most research focuses on descriptive 4{full-year earnings}{be off}{33 cents a share} semantic
resources
information ­ extraction, and research
X11 X1N
in logic information extraction is rela- Feature
where N = 2240
tively rare. We focus here on extract- X41 X4N matrix
ing the logic level relation, namely, the Feature
selection
causal relation from unstructured text. X11 X1M
In recent years, machine learning and where M = 120 Reduced
matrix
semantic resources for causal relation X41 X4M
Reduced
extraction has been explored. Some re- ELM
searchers, for example,3 extracted <NP1 Ensemble
{full year earnings may be off 33 cents a share}{because}
verb NP2> syntactic patterns and then {the company removed a catheter from the market} Causal
relations
employed semantic constraints to clas-
sify candidates as causal or non-causal.
Other work4 modified this and used Figure 10. System architecture. The relation extractor is built on the Stanford Parser,
which provides both a dependency relation format and a constituent tree format.
the C4.5 decision tree instead of simple
constraints to perform classification for
a question-and-answer application. r­ equirement. The ELM is a newly devel- a dependency relation format and a
One team5 proposed a novel bound- oped learning paradigm for single-layer constituent tree format.
ary feature extracted from WordNet to feed-forward neural networks, in which The extracted relations are catego-
help semantic relation classification be- the weights from the input layer to the rized as either a verb or p ­ reposition
tween nominals that contained causal hidden layer are randomly assigned, type based on their cue’s part of speech.
relation. Another team6 employed pre- while the weights from the hidden layer Feature generation and the selection
defined syntactic patterns to extract to the output layer are obtained using module combine various ­resources in-
candidates containing any of the four linear least square estimation. Because cluding named entity recognition tool,
relators “because,” “after,” “as,” and of its non-iterative nature, the ELM is English syntactic knowledge, linguistic
“since,” and then classified the patterns computational efficient. Please note, our expert knowledge, and lexical semantic
using the bagging ensemble method. proposed algorithm isn’t a simple com- resources to generate candidate fea-
Our study expands both the syntac- bination of an ensemble technique with tures and then select the informative
tic and semantic perspectives to cover the ELM. We propose restricted boost- ones. ­After this, every candidate rela-
purpose, explanation, condition, and ing sampling to further enhance the en- tion is classified into causal or non-
intra-sentential explicitly marked causal semble’s capability to handle the imbal- causal using our proposed ensemble
relations. The larger coverage gener- ance problem, while neuron selection/ of a ­reduced ELM classifier.
ates more candidate relations to clas- reduction helps reduce the ELM archi-
sify, which requires a computationally tecture and hence the computational Ensemble of the
efficient pattern classifier for both train- cost for testing data. In the literature, Reduced ELM
ing and testing. In addition, among the several algorithms have been proposed Compared with non-causal relations,
generated candidate relations, only a to reduce hidden layer neurons,9–12 but the causal relation is relatively rare.
small portion is causal, hence imbalance they use a set-based selection method The data of causal class and non-
problem exists in both training and test- and are computationally expensive due causal class are often imbalanced,
ing data. To address the computational to their attempts at finding optimal or a problem that usually results in bi-
efficiency problem and the imbalance suboptimal neurons. Here, we use Fish- ased classifiers neglecting the minor
data problem, we propose an ensem- er’s ratio to measure and select hidden class. In recent years, the ensemble
ble with the extreme learning machine layer neurons. ­technique has been used to alleviate
(ELM). This ­ensemble ­alleviates the im- Figure 10 gives a full picture of our this imbalance problem because the
balance problem,7 and lets the ELM8 system. The relation extractor, built technique trains individual c­ lassifiers
address the computational efficiency on the Stanford Parser, provides both with balanced or less skewed data.

november/december 2013 www.computer.org/intelligent 49

IS-28-06-TandC.indd 49 20/01/14 3:01 PM


Table 6. Classification results of eight algorithms. Cs, RBO, BA, and MS denote
cost sensitive, restricted boosting, bagging, and model selection, respectively.

Algorithm F-score G-mean Accuracy


CsELM+RBO 0.6637 0.8237 0.8869 won’t give the majority class more at-
CsELM 0.6206 0.7472 0.8915 tention than the minority class.
CsSvm+RBO 0.6524 0.8316 0.8764 An ensemble of classifiers can con-
CsSvm+BA 0.6311 0.8327 0.8606 sist of a large number of base clas-
Svm+RBO 0.6356 0.7784 0.8879 sifiers. The training of those base
classifiers must be computationally
Svm+BA 0.6056 0.8031 0.8565
efficient, so we use the ELM for that
AdaBoost 0.5784 0.8237 0.8180
purpose. Because of the random as-
CsSvm+MS 0.6420 0.7840 0.8894 signment and the linear least estima-
tion of weights, the training of the
ELM is extremely fast, but due to the
Table 7. Neuron selection reduces the neuron number without hurting random assignment of weights, the
performance. NS is neuron selection, CI is confidence interval, and Time ELM usually demands a relatively
is the time needed for one repeat of five-fold cross validation. large number of hidden layer neu-
Data type NS Number Time (s) F CI rons, which harms its computational
efficiency for testing data. To deal
prep No 1000 16.35 0.6287 0.5907,0.6678
with this problem, researchers have
prep No 2000 29.38 0.6325 0.5942,0.6708
proposed using the ELM with neuron
prep No 5000 69.16 0.6407 0.6032,0.6781 selection,9–12 which aims to pick the
prep Yes 1000 34.74 0.6363 0.5996,0.6729 best subset of neurons in a randomly
verb No 500 5.66 0.6578 0.6043,0.7113 projected large neuron set. However,
verb No 1000 10.15 0.6697 0.6272,0.7122 these set-based selection methods
verb No 2000 17.84 0.6812 0.6458,0.7169 are computationally intensive, which
is why we use the individual-based
verb No 4000 37.84 0.6889 0.6505,0.7273
neuron selection method, to improve
verb Yes 500 34.98 0.6721 0.6219,0.7224
neuron selection efficiency.
verb Yes 1000 19.25 0.6765 0.6354,0.7177 The role of a hidden layer neuron
is to map data from the original fea-
ture space into a new dimension in
AdaBoost13 is the most widely used The weights are then normalized to which data of different classes are
ensemble technique. Assuming that make ∑i wi (k + 1) = 1. separable. Thus, the importance of
pi(k) denotes the predicted class label To further enhance the imbalance- a hidden layer neuron can be evalu-
of the ith data by kth weak classifier handling capability of the ensemble ated based on its capability to pro-
at the kth iteration, and li denotes its technique, we propose restricted boost- vide large class separation in the new
true class label, then the total error is ing in this study, with the goal of re- dimension. Assuming that the hidden
calculated as follows: stricting the data’s weight adjustment in layer neuron j maps data to a new
n the majority class. In restricted boost- ­dimension zj, on which the means of
o` (k) = ∑ wi (k)I(li ≠ pi (k)),(1) ing, the error for minority and majority data of two classes are µ1j and µ2
j

i =1 classes are calculated separately: respectively, and the standard devia-


where function I is the indicator n tions are σ 1j and σ 2j respectively, the
function whose output is 1 if inputs o` 1(k) = ∑ wi (k)I(li ≠ pi (k))I(li = 1) (4) class separation provided by the hid-
are equal and 0 otherwise, n is the i =1 den layer neuron j can be measured
number of data points, and wi(k) is n by Fisher’s ratio, which is defined
the weight of ith data at the kth itera- o` −1(k) = ∑ wi (k)I(li ≠ pi (k))I(li = −1). (5) as follows:
i =1
tion. The weight is updated at each it- 2
eration as follows: The weights for data of the minor- µ1j − µ2j
ity class and the majority class are Fj = j j
. (6)
(σ 1 )2 + (σ 2 )2
1 1 − o`(k) then adjusted as in Equations 2 and 3
α (k) = ln (2)
2 o`(k) based on their respective error. By this Neurons providing large class separa-
wi(k + 1) = wi(k) exp(−α(k)li pi(k)). (3) restricted boosting, the base classifier tion are retained, while those p
­ roviding

50 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-06-TandC.indd 50 20/01/14 3:01 PM


little or no class separation are re- verify that simple individual-based Proc. 6th Int’l Language Resources
moved. Due to its nature of indi- neuron selection can significantly cut and Evaluation, European Language
vidual neuron selection, the neurons down the number of neurons and Resources Assoc., 2008, pp. 310–313.
selected by Fisher’s ratio are non-­ hence the computational cost for test- 7. M. Galar et al., “A Review on Ensem-
optimal, but this non-optimality is ing data, with little performance loss. bles for the Class Imbalance Problem:
wanted because the ensemble requires In addition, the smaller confidence in- Bagging-, Boosting-, and Hybrid-Based
weak classifiers. terval indicates that the reduced ELM Approaches,” IEEE Trans. Systems,
is more robust than the original ELM. Man, and Cybernetics, vol. 42, no. 4,
Experiment 2012, pp. 463–484.
For high-level semantic relation ex- 8. G.-B. Huang, Q.-Y. Zhu, and C.-K.
traction, data is very expensive. In
The restricted boosting and neu-
Siew, “Extreme Learning Machine:
this study, we labeled 300 sentences Theory and Applications,” Neuro­
from Propbank14 based on Matthew ron selection algorithm effectively ad- computing, vol. 70, no. 1, 2006,
Hausknecht’s annotation. The rela- dresses the concerns of imbalanced pp. 489–501.
tion extractor extracted 1,683 rela- data and computational efficiency in 9. H.-J. Rong et al., “A Fast Pruned-­
tions, of which 280 are causal. causal relation extraction. Our pro- Extreme Learning Machine for Clas-
We conducted two experiments to posed method has been tested us- sification Problem,” Neurocomputing,
test the proposed algorithm. The first ing a real problem of knowledge vol. 72, no. 1, 2008, pp. 359–366.
tested the method’s capability to deal extraction. 10. Y. Lan, Y.C. Soh, and G.-B. Huang,
with imbalanced data compared with “Random Search Enhancement of
other sampling methods, and the sec- References Error Minimized Extreme Learning
ond evaluated the capability of the 1. D. Wimalasuriya and D. Dou, “Ontol- Machine,” European Symp. Artificial
proposed neuron selection algorithm ogy-Based Information Extraction: An Neural Networks, European Neural
to reduce the ELM architecture. The Introduction and a Survey of Current Network Soc., 2010, pp. 327–332.
performance of the ensemble of the Approaches,” J. Information Science, 11. Y. Lan, Y.C. Soh, and G.-B. Huang,
original ELM and the ensemble of the vol. 36, no. 3, 2010, pp. 306–323. “Constructive Hidden Nodes Selec-
reduced ELM were compared for an 2. E. Cambria, T. Mazzocco, and A. tion of Extreme Learning Machine for
equal number of neurons. The results Hussain, “Application of Multi-Dimen- Regression,” Neurocomputing, vol. 73,
are based on 100 repeats of five-fold sional Scaling and Artificial Neural no. 16, 2010, pp. 3191–3199.
cross validation. Networks for Biologically Inspired 12. Y. Miche et al., “Op-elm: Optimally
Table 6 lists the results of the first Opinion Mining,” Biologically Inspired Pruned Extreme Learning Machine,”
experiment. Apparently, the best Cognitive Architectures, vol. 4, no. 0, IEEE Trans. Neural Networks, vol. 21,
F-score and G-mean are obtained by 2013, pp. 41–53. no. 1, 2010, pp. 158–162.
combining the ELM and restricted 3. R. Girju et al., “Text Mining for Causal 13. Y. Freund and R.E. Schapire, “A
boosting. Compared with the origi- Relations,” Proc. FLAIRS Conf., AAAI Decision-Theoretic Generalization of
nal AdaBoost and other sampling Press, 2002, pp. 360–364. On-line Learning and an Application
methods, the proposed restricted 4. R. Girju, “Automatic Detection of to Boosting,” Computational Learning
boosting improves both the accuracy Causal Relations for Question Answer- Theory, Springer, 1995, pp. 23–37.
and F score. It is also observed that ing,” Proc. ACL 2003 Workshop on 14. P. Kingsbury and M. Palmer, “From
the ELM outperforms SVM in this Multilingual Summarization and Ques­ Treebank to Propbank,” Proc. 3rd Int’l
application. tion Answering, Assoc. Computational Conf. Language Resources and Evalua­
Table 7 gives the results of the sec- Linguistics, 2003, pp. 76–83. tion, Citeseer, 2002, pp. 1989–1993.
ond experiment. The F score shows 5. B. Beamer, A. Rozovskaya, and R.
that the ensemble of the reduced ELM Girju, “Automatic Semantic Relation Xuefeng Yang is a PhD candidate in the
with 1,000 neurons outperforms the Extraction with Multiple Boundary School of Electrical and Electronic Engineering
ensemble of the ELM with 2,000 Generation,” Proc. 23rd Nat’l Conf. at Nanyang Technological University, Singa-
random neurons, while the Time col- Artificial Intelligence, AAAI Press, pore. Contact him at yang0302@e.ntu.edu.sg.
umn shows that the time needed for 2008, pp. 824–829.
one repeat of five-fold cross validation 6. E. Blanco, N. Castell, and D. Moldo- Kezhi Mao is an associate professor in
is similar. The results in Table 7 also van, “Causal Relation Extraction,” the School of Electrical and Electronic

november/december 2013 www.computer.org/intelligent 51

IS-28-06-TandC.indd 51 20/01/14 3:01 PM


(a) Proposed system (b) Operational interface (c) Preprocessing
Side view Detection of start
Start (initialize) the Preprocessing, feature Palm-mass
0.7 m system extraction, verification & end frame of
area detection

By system
signature
(Min. distance
Depth from sensor) Display captured Store the captured (e.g. 1st frm.) (e.g. 1st frm.) (entire frms.)
sensor RGB sequences depth movie as a 3D
43˚ on a LCD monitor signature data

1.6 ~ 1.9 m
RGB
monitor User enters into the Performs hand
Input depth movie
1.4 m

(user operational range signature gestures

By user
feedback)
User decides a standing position for hand gestures Palm-to-sensor Signature cropping
by looking him/herself displayed on the monitor distance calculation in spatial domain

1m 1m
(Operational range for
standing position) (d) Directional sum (e) R1&R2 projection and fusion
e.g.)

Top view Up sum image


Transpose,
w
Depth sensor & scale normalization
t R1
RGB monitor

Matching score computation


projection

57˚
1m

Score

(L2 – norm)
Up fusion
sum using
t t
w Scale TERELM
normalization
Profile sum R2
h
1m

h projection
Final (fused)
decision
Shaded region: Operational range for standing position Profile sum image

Figure 11. A flow diagram of the proposed hand gesture signature verification system. (a) and (b) The user’s hand signature is
captured using a depth sensor and stored as a video sequence, (c) Each sample is preprocessed and (d) represented by a set of
directional features. (e) Finally, the obtained match scores are fused using TERELM.

­ ngineering at Nanyang Technological Uni-


E forgery. Second, both have limita- for feature extraction. Subsequently,
versity, Singapore. Contact him at EKZMao@ tions in terms of remote authentica- these features are fused for possible
ntu.edu.sg. tion. To authenticate a handwritten performance enhancement. The total
signature on a document, the signers error rate minimization of extreme
have to be physically present during learning machine (TERELM)4 was
A System for Signature signature acquisition. adopted for fusion due to its classifi-
Verification Based on Recent research1,2 proposes a new cation-goal-driven learning without
Horizontal and Vertical paradigm for signature biometry: a the need of an iterative search.
Components in Hand user holding a positional sensor or
Gestures wearing a glove with markers attached Proposed System
performs his signature in the air in- Figure 11 shows the configuration
Beom-Seok Oh, Jehyoung Jeon, stead of on a surface. Because of this of our prototype system for hand
Kar-Ann Toh, Andrew Beng Jin Teoh, interface’s contactless nature, no trace gesture signature verification. As il-
and Jaihie Kim, School of Electrical of signature is left for forgery, and the lustrated in the figure, a depth sen-
and Electronics Engineering, Yonsei signers don’t need to be physically sor (Microsoft Kinect) was placed at
University, Korea present. However, existing in-air sys- 1.4  m above an LCD monitor that
tems are rather limited. Holding a po- displays an RGB movie taken by the
Due to its ease of use and behavioral sitioning sensor such as a smartphone sensor for real-time user feedback.
uniqueness, the signature has played for in-air signature isn’t natural; the The sensor height is determined to
an important role in personal identifi- range of wrist usage is rather narrow, cover the upper-body motion of a
cation since the dawn of civilization. which limits hand gestures. user whose height falls between 1.6
The most frequently and widely used Here, we propose an in-air hand and 1.9 m standing approximately
form of signature is either a writ- gesture signature verification system 1 to 2 m away from the sensor. The
ten version or a stamp that uses a that doesn’t require a handheld de- user spreads his arm out toward the
seal, both of which have drawbacks. vice. A depth image sensor captures sensor to perform the intended hand
First, once a signature is written or signature gestures and records each signature gestures.
stamped on a document, it’s revealed signature as a 3D volume. A struc- The signature data acquired using
to anyone who can access that doc- tured projection 3 is then applied to the prototype system contains not
ument. This opens a vulnerability to the directionally accumulated images only the region of the body but also

52 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-06-TandC.indd 52 20/01/14 3:01 PM


noise such as imaging distortion and to the ni lowest depth values are se- Considering the conformation of
background clutter. We’re particu- lected and utilized as a region of matrix inner-product, the size of R1
larly interested in the movement of palm-mass. The output of this step projection matrix should be k × w'
the palm-mass region (“palm” is the  i , which contains only palm-
is M for pre-multiplication to M  ui (which
targeted hand region that includes mass area. we call R1u), and k × h' for pre-mul-
the palm, fingers, and back of palm) • Signature cropping in spatial do- tiplication to M  p (denoted as R1p ),
i
that forms the desired signature in- main. Finally, a rectangular mask respectively. Here, the k indicates an
formation. To segment the region of that covers the region of hand arbitrary number of projection vec-
interest, four steps of preprocessing movement is applied on M  i to crop tors. Similar to the R1 projection, the
were performed on the acquired raw only the signature region out. R 2u ∈  t '× k projection matrix is post-
signature data as follows: multiplied to M  ui , and the R 2p ∈  t '× k
As shown in Figures 11c and 11d, the projection matrix is post-multiplied
• Start and end of signature detec- preprocessed signature data M  i is in to M  p (see Figure 11e).
i
tion. Because there’s no clear in- the form of 3D volume. To efficiently Here, R1u M  ui ∈  k× t ' and R p M p
1 i
dication of when a user starts and extract necessary features, we adopt a ∈  k × t ' extract vertically compre­
ends a hand gesture signature, we summation of the volume data along ssed features of the hand position
manually detect them. The output the up (y-axis) and profile (x-axis) in horizontal and vertical direction.
of this process is signature mov- directions, respectively.  ui R u ∈  w '× k captures hand move-
M 2
 i ∈ ℝ h×w×t , where i = 1, ...,
ies M The upward summing of M i ment in the horizontal direction, and
m denotes the number of samples, generates a 2D signature image the feature matrix that results from
h and w respectively indicate the that’s called an up-summed image  p R p ∈  h '× k contains information
M i 2
height and width of a depth frame, M ui ∈  w × t . This M ui exhibits the on how the hand moves along the
and t denotes time indexing, which way the signature moves horizontally vertical direction.
equals the number of frames. (see Figure 11d). In a similar man-
• Palm-to-sensor distance estimation. ner, a profile summing of the volume Experiments
Because the user’s hand is the closest yields another signature image called To enhance the verification accuracy
object from the sensor, pixels that cor- M ip ∈  h×t . Through this accumula- of individual features, the four pro-
respond to fingertips might have the tion, we can observe how the signa- jected features discussed above are
smallest depth value. Moreover, our ture varies vertically. fused at score level using TERELM.4
pre-analysis on signature data showed Different signatures have ­different
that the acquired hand gesture signa- spatial size and time duration. To Database
tures are relatively consistent in terms standardize the spatial image size and We acquired a database of hand ges-
of depth. With these in mind, we re- time duration, a simple image ­resizing ture signatures from 100 subjects.
corded the lowest depth value per technique that uses the bicubic inter- Each subject was briefly instructed
frame in which their average zi is used polation is adopted. As a result of about the proposed signature system
p h× t
as an estimated distance between the this step, M ui ∈  w×t and M i ∈  and asked to perform his or her own
u
 i ∈  w '× t ' and
palm-mass and the sensor. are normalized as M 2D signature using a hand in the air.
 p h '× t '
• Palm-mass area detection. The Mi ∈  , where w', h', and t' are Participants performed the in-air sig-
next task is to segment the palm- the normalized width, height, and nature 10 times, with each trial re-
mass area from each frame of M  i. The time sizes. corded as a movie sequence. The first
size (number of pixels) of palm- From both the up-summed image five trial sequences per subject were
mass area is estimated by a first-  ui and the profile summed image
M used for system training, and the re-
order exponential function de-  p, we can observe how the user’s
M maining five were used for perfor-
i
fined as ni = p1 × exp(p2 × zi ) + γ , hand moves horizontally and verti- mance evaluation.
where  ⋅  is a floor function, p1 and cally. To extract directional informa-
p2 are variables of the first-order tion for verification, both sum images Evaluation scenario
exponential function, zi is the cal- were projected onto two structured The goal of our experimental study is
culated palm-to-sensor distance of projection bases, such as horizontal to observe our proposed signature sys-
ith sample, and g is an offset. The projection basis matrix R1 and verti- tem’s feasibility for identity verifica-
ni number of pixels that correspond cal projection basis matrix R 2 .3 tion under three scenarios: individual­

november/december 2013 www.computer.org/intelligent 53

IS-28-06-TandC.indd 53 20/01/14 3:01 PM


Table 8. Average EER (%) accuracy and CPU time (s) performance (elapsed for
learning) benchmarking along the evaluation scenarios.

Feature / Individual/ CPU time for


Scenarios fusion type fusion algorithm EER (%) learning (s)
 u images related work,4 we set the threshold
Scenario 1, Projection R1 on M 10.17 N/A
­individual features  u images
R 2 on M 9.32 (no learning t = 0 and offset h = 1 for TERELM and
­features R1 on M p images 7.72 required) normalized all the ­ attributes into the
 p images
R 2 on M 7.15 range [0,1].
Trajectory Fingertip position 7.27
features5 Fingertip velocity 2.92 N/A
Fingertip acceleration 10.48 (no learning
Results
Palm-mass center position 7.78 required) Table 8 shows the average equal er-
Palm-mass center velocity 3.04 ror rate (EER) over 30 runs using 30
Palm-mass center acceleration 5.92
different R1 and R2 projection bases
Scenario 2, Case 1: fusion SVM (linear) 4.07 110.63 along with the investigated experi-
­unimodal of all projection SVM (Poly, order = 3) 3.37 102.59
fusion features at score VM (RBF, s = 1) 3.52 153.47 mental scenarios. As shown in the ta-
level ELM (N = 100) 3.39 1.55 ble, R1 and R2 projections on profile-
TERELM (N = 50) 3.43 0.16 summed images M  p show about 2.5
Case 2: fusion of all trajectories at feature level 2.10 N/A to 3 percent lower EER performance
Scenario 3, Fusion of all SVM (linear) 0.72 13.17 than that of up-summed images M  u.
bimodal fusion ­projected and SVM (Poly, order = 3) 0.66 13.17 Among the four projections, R1 on
­trajectory features SVM (RBF, s = 1) 0.62 26.46  p shows the best EER performance,
at score level ELM (N = 90) 0.75 1.32 M
while R1 on M  u gives the worst.
(10 features) TERELM (N = 100) 0.63 0.29
The best performance of trajectory
features was observed in “Fingertip ve-
features, unimodal fusion, and bimodal size k and group size l. 3 In this work, locity,” with “Fingertip ­ acceleration”
fusion. Under the first scenario, the we set k = 100, l = 10 for R1u and R1p, giving the worst performance. Gen-
proposed four projection features are and l = 5 for R 2u and R 2p. These erally, the palm-mass center features
evaluated in terms of accuracy. Beside parameters were obtained based on
­ show better EER performances than
the projection features, we also evalu- 10 runs of two-fold cross-validation that of fingertip features. This could be
ate the discriminative power of six tra- using only the training set. due to the extracted palm-mass center
jectory features5 under the same experi- For trajectory features, the finger- point being more stable than the ex-
mental setup. The six trajectory features tip and palm-mass trajectories are tracted fingertip point.
were extracted from both fingertip and ­extracted from a signature data sample Under Scenario 2, we observed veri-
palm-mass center trajectories. M .5 From the trajectories, we also ex- fication performance enhancements
Under the second scenario, we tracted velocity and acceleration fea- as a result of information fusion. Par-
fused all four projection features at tures,5 giving us six trajectory ­features ticularly in Case 1, all the fusion re-
score level and all the six trajectory in total. Dynamic time warping (DTW) sults show about 3 to 4 percent lower
features at feature level, respectively. is adopted for trajectory matching. EER performance than that of the
Under the third scenario, all four pro- In scores fusion, verification accuracy best projection feature, the R1 on M  p.
jection features and six trajectory fea- and CPU time (elapsved for learning) The three investigated fusion schemes
tures are fused at score level. performances of TERELM will be com- appear to have similar accuracy per-
pared with that of the extreme learning formance. However, TERELM out-
Evaluation protocols machine (ELM)6 and support vector ma- performed SVM and ELM in terms
To stabilize the palm-mass area de- chine (SVM)7 using linear, polynomial of learning speed. The main reason
tection, the exponential parame- (at different orders within the range {2, for the fast learning speed of ELM
ters p1 = 13, 910, p2 = −0.001929, … , 6}), and radial basis function (RBF) and TERELM is due to their nonit-
and g = 495 are found manually us- (at different s values selected within erative solution; TERELM is seen to
ing the training set. The normaliza- {0.1, 0.5, 1, 1.5, …, 5}) kernels. For the be slightly faster than ELM due to its
tion ranges w' = 97, h' = 69, and ELM and TERELM, different numbers split covariance with smaller sizes. In
t' = 30 were determined based on the of hidden nodes N  ∈ {10, 20, … , 100} Case 2, the feature level fusion of tra-
minimum sizes of the entire training are experimented. In this fusion perfor- jectory features yields about 0.8 per-
palm-mass area samples. mance benchmarking, only the best test cent lower EER performance than
The R1 and R 2 projections have performances among the evaluated pa- that of the best trajectory feature, the
two parameters, namely, projection rameter settings are reported. ­Following palm-mass center velocity.

54 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-06-TandC.indd 54 20/01/14 3:01 PM


The last five rows of the table show features and parameters settings, can be Jehyoung Jeon is an MS candidate in the
the EER accuracy and CPU learning used for identity verification. School of Electrical and Electronics Engi-
speed (in seconds) under Scenario 3. neering at Yonsei University, Korea. Contact
The three investigated fusion algo- Acknowledgements him at jh.jeon@yonsei.ac.kr.
rithms yield a similar range of 0.6 This research was supported by Basic Sci-
to 0.7 percent EER performances, ence Research Program through the Na- Kar-Ann Toh is a professor in the School
tional Research Foundation of Korea (NRF)
about 1.4 to 1.5 percent lower EER of Electrical and Electronics Engineering at
funded by the Ministry of Education, Sci-
values than that of the best uni- ence and Technology (Grant number: 2012- Yonsei University, Korea. Contact him at
­
modal fusion. Similar to Scenario 0001306). ­katoh@yonsei.ac.kr.
2, TERELM shows the faster learn-
ing among the three compared al- References Andrew Beng Jin Teoh is an associate
gorithms due to its split covariance 1. M. Katagiri and T. Sugimura, “Personal ­rofessor in the School of Electrical and
p
computation. Authentication by Free Space Signing Electronics Engineering at Yonsei University,
with Video Capture,” Proc. 5th Asian Korea. Contact him at bjteoh@yonsei.ac.kr.
Observations and Discussion Conf. Computer Vision, Japan: Asian
The up-summed images M  u con- Federation of Computer Vision Societ- Jaihie Kim is a professor in the School of
tain horizontal movements of users’ ies, 2002, pp. 350–355. Electrical and Electronics Engineering at Yon-
hands while the hand movements in 2. G. Bailador et al., “Analysis of sei University, Korea. Contact him at jhkim
vertical direction are captured by the ­Pattern Recognition Techniques for @yonsei.ac.kr.
­profile-summed images M  p. From In-Air Signature Biometrics,” Pattern
­Table 8, we observe that the R1 and ­Recognition, vol. 44, no. 10, 2011,
 u produced better
R 2 projections on M pp. 2468–2478. An Adaptive and Iterative
EER performances than that of M  p. 3. B.-S. Oh et al., “Combining Local Face Online Sequential ELM-
From these clues, we conclude that Image Features for Identity Verifica- Based Multi-Degree-
summing signature volumes upward tion,” Neurocomputing, vol. 74, no. 16, of-Freedom Gesture
would be more beneficial than taking 2011, pp. 2452–2463. Recognition System
profile summation in terms of verifi- 4. K.-A. Toh, “Deterministic Neural
cation accuracy. ­C lassification,” Neural Com­ Hanchao Yu, Yiqiang Chen, and Junfa
The table shows that the usage of putation, vol. 20, no. 6, 2008, Liu, Institute of Computing, Chinese
palm-mass center features for iden- pp. 1565–1595. Academy of Sciences, China
tity verification yields better accura- 5. J.-H. Jeon et al., “A System for Hand Guang-Bin Huang, School of Electrical
cies than that of using the fingertip Gesture based Signature Recognition,” and Electronics Engineering, Nanyang
features. This could be due to stability Proc. 12th Int’l Conf. Control Automa­ Technical University, Singapore
of the extracted features as mentioned tion Robotics & Vision, IEEE, 2012,
previously. The table also reveals that pp. 171–175. Gesture recognition can be divided
the velocity feature contains the most 6. G.-B. Huang, Q.-Y. Zhu, and C.-K. into online recognition, where the
discriminative information among the Siew, “Extreme Learning Machine: recognition model can adapt to new
­
investigated trajectory features. Theory and Applications,” Neuro­ users automatically to get high recog-
Under Scenarios 2 and 3, we observed computing, vol. 70, no. 1–3, 2006, nition accuracy, and offline recogni-
performance enhancement resulting pp. 489–501. tion, where the model fits well to us-
from information fusion. Particularly, 7. C.J.C. Burges, “A Tutorial on Support ers who have contributed to training
the lowest learning cost was observed Vector Machines for Pattern Recog- samples but might not perform as well
for TERELM over ELM and SVMs with nition,” Data Mining and Knowl­ with new users. Recently, ­gesture rec-
similar performance enhancement over edge Discovery, vol. 2, no. 2, 1998, ognition technology has become a re-
that of single modality. pp. 121–167. search hotspot in human-computer
interaction.1 Zhou Ren and colleagues2
Beom-Seok Oh is a PhD candidate in proposed a g­esture recog­nition system
based on Kinect. The system used depth
Our experiments showed that the pro-
the  School of Electrical and Electronics
Engineering at Yonsei University, Korea.
­ and skin color information to detect
posed signature system, with adequate Contact him at a-bullet@yonsei.ac.kr. hand gestures from a messy ­environment

november/december 2013 www.computer.org/intelligent 55

IS-28-06-TandC.indd 55 20/01/14 3:01 PM


Input window

We adopted the Newton iterative


method to update b(k). That is, every new
batch of data needs to execute Equation
1 iteratively until meeting Equation 2:
|b(k+1) - b(k)| < e,(2)
Digit
where e is a given minimum threshold.
To keep the fast retraining speed, we
limit the iterative times to 100. If the
iterative execution doesn’t meet Equa-
tion 2, and the iterative time reaches
Alphabet 100, we break the execution and use
Recongnition result window
the b(k+1) as the ideal model.
Figure 12. Online gesture recognition system interface. The upper left window
displays the writing trace of users dynamically. The lower left window shows the Online Gesture Recognition
recognition result as soon as the input finishes. System
Using Kinect and based on the AIOS-
and Finger-Earth Mover’s Distance for Here, we propose the adaptive and ELM, we developed an online gesture
gesture recognition. To help  someone iterative online sequential ELM (AIOS- recognition system (OGRS) that can
communicate with a hearing- or speech- ELM), which executes multiple it- recognize contactless gesture inputs
impaired person, M.K. Bhuyan and col- erations to make full use of implied of 0–9 digits and a–z letters. Figure
leagues3 presented a method for synthe- knowledge in each batch of incre- 12 shows the OGRS interface. The in-
sizing hand gestures with the help of a mental data. By introducing an adap- put window in the upper left displays
computer and implemented a gesture tive mechanism and capitalizing on the writing trace of users dynamically.
animation framework for recognizing the original model’s recognition abil- The recognition result window at the
hand gestures. Mosiuoa Sole and col- ity, AIOS-ELM emphasizes the con- lower left shows the recognition re-
leagues4 applied the extreme learning tribution of current data to the model, sults as soon as the input finishes.
machine (ELM)5 to classify static hand which can quickly improve its adaptive Figure 13 shows the OGRS frame-
gestures that represent different letters ability and thus improve the ­OS-ELM’s work, which includes gesture segmenta-
of the Auslan dictionary. generalization performance. tion, data collection, fingertip tracking,
These gesture recognition works feature extraction, digit/alphabet gesture
are mainly for offline recognition. AIOS-ELM recognition, and so on.
While in actual application, instead We start by revising the parameter
of working for users whose samples updating formula of OS-ELM: Gesture Segmentation
have been used in training, gesture Skeleton and depth data are acquired
recognition systems should recog-  Bn +1  through Kinect at a speed of 30 fps.
β (k +1) = β (k) +  1 + 
nize most users’ gestures fast and 
n
N0 + Σ i =1Bi  Effective input gestures can be seg-
accurately even if their samples mented out by feeling users’ writing
weren’t used in training beforehand. ( )
⋅ Kk−+11HkT+1 Tk +1 − Hk +1β (k) . intention via Equation 3; effec-
An online sequential learning frame- (1) tive input gestures are segmented
work might ­ provide an ­efficient so-  p|6 ≤ q ≤ 5p|6 and
out onlywhen
lution: they can learn from users’ In Equation 1, b(k) is the output ( )
0 ≤ h ≤ BA + BC / 2 . OGRS only col-
samples chunk by chunk and don’t weights linking the hidden nodes to lects the depth data of segmented in-
require all the data present at one the output nodes, and k is the index of put gestures to do further processing:
time. Nan-Ying Liangand colleagues6 the current model. N0 represents the
proposed an online sequential vari- number of existing data in the system,     
θ = arccos  BA ⋅BC
ant of the ELM (OS-ELM). OS-ELM bi represents the amount of new user    
  BA ⋅ BC  ,(3)
can process data in sequential form data for updating the model at the ith
and u­ pdate the existing model just by time, and n represents the number of 
h = A − Cy
learning the newly arriving samples. batches of new user data in the system.  y
2

56 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-06-TandC.indd 56 20/01/14 3:02 PM


where A, B, and C are the points of
the wrist, elbow,
 and shoulder,
 re-
spectively, and BA and BC are the Model fast transfer
vectors corresponding to points A, B,
glov
Choose recognition
and C (Ay and Cy are the vertical co- x1
glov
x1

mode: Digits /
ordinates of points A and C).
x2 x2

Alphabets
glov

Input output
xn-1 xn-1
glov

r1-lov
xx xn-1

Fingertip Tracking r1-lov

Based on the collected gesture data, Gesture recognition


fingertips can be detected accurately by
the palm posture adaption-based ro-
A
bust single fingertip tracking method
we described in our previous work.7 System
By detecting and recording the moving B
C
trace of fingertips that’s based on the
effective gesture data, we can get the
same dimensional data as recognition Gesture segmentation
features by taking the interpolation Fingertip tracking
and subsampling operations.

Gesture Recognition Data


OGRS uses the ELM to train the ini- collection
tial gesture recognition model based
on the collected training data of digit/
alphabet gestures; it then uses the Figure 13. OGRS framework. It includes gesture segmentation, data collection,
AIOS-ELM to update models based fingertip tracking, and so on.
on new users’ gesture data.
It’s worth mentioning that we de-
signed a delete gesture that works
by waving the other hand; it can de-
lete incorrect input of users or incor-
rect recognition of the system. Inputs
that aren’t deleted are considered to
be correctly labeled samples to be
learned. Based on the labeled samples,
OGRS uses the AIOS-ELM to imple-
ment the online learning by retraining Figure 14. Samples of digit gestures.
the gesture recognition model when-
ever it receives new samples. OGRS
can become more intelligent by fre- the Microsoft Windows Server 2008 300 as a testing dataset) for the initial
quently interacting with more users. ­operating system. gesture recognition model, with each
user accounting for 15 gestures for
Experiments and Results Data Source each digit. We treated the last person
We used the OGRS as an experiment We invited 21 users (11 males and 10 as a new user of the system, whose
system and samples of digit gestures females) to use our OGRS. In practice, corresponding 500 gestures we se-
0–9 as experiment data. Figure 14 the system automatically collected us- lected as incremental training data
shows some samples of digit gestures ers’ writing trajectory information of were divided into 10 batches, and
from users. The experiments ran on a digit gestures. We randomly selected each digit accounts for five gestures
PC with Intel Core i5-2310 2.90-GHz 20 users’ corresponding 3,000 ges- in each batch of data. The other 300
processor, 4 Gbytes of RAM, and tures (2,700 as a training dataset, and gestures of the new user were selected

november/december 2013 www.computer.org/intelligent 57

IS-28-06-TandC.indd 57 20/01/14 3:02 PM


Testing accuracy Training time
1 7

0.95 6

0.9
5
0.85
4
Accuracy

Time (s)
0.8
3 ELM
0.75 SVM
OS-ELM
0.7 ELM 2 ALOS-ELM
SVM
0.65 OS-ELM
1
ALOS-ELM

0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Incremental number Incremental number
(a) (b)

Figure 15. Incremental experiment results. Compare (a) the testing accuracy of each increment with (b) the training time of each
increment.

Table 9. Initial gesture recognition models.


they didn’t retrain with old data but
Incremental testing updated the old model with newly
Training dataset Testing dataset dataset
­arrived data. AIOS-ELM is also more
Algorithm Accuracy Time Accuracy Time Accuracy Time accurate than the ELM and OS-ELM
ELM 96.93% 3.69s 90.33% 0.04s 69.67% 0.04s in all incremental experiments be-
SVM 88.59% 4.52s 88.00% 0.58s 63.67% 0.58s cause it uses the adaptive weight pun-
ishment and iterative strategy, which
makes it faster to adapt to new users
as an incremental testing dataset, and shorter than SVM, and the ELM’s and get higher gesture recognition ac-
each digit accounts for 30 gestures. training and testing accuracy are both curacy by using less incremental time.
higher. But even though the ELM- Based on AIOS-ELM, the online ges-
Gesture Recognition Experiments generated gesture recognition model ture recognition system can reach a
We validated the AIOS-ELM’s per- is faster and more accurate than high accuracy of 96.7 percent within
formance by comparing it with SVM, SVM, the ELM can only get a test- 10 sequential operations. AIOS-ELM
ELM, and OS-ELM. ing accuracy of 69.67 percent for new needs to iterate Equation 1 in a se-
users. quential training process, which costs
Initial gesture recognition models. a little more time than OS-ELM, but it
We trained gesture recognition mod- Incremental experiments. Based on takes only about 1 second and doesn’t
els by ELM and SVM with the train- the initial gesture recognition models, affect the efficiency of an online ges-
ing dataset, and then tested the initial we set incremental times and used the ture recognition system.
gesture recognition model with the ELM, SVM, OS-ELM, and AIOS-ELM Our results show that based on
testing dataset and the incremental to ­retrain the gesture recognition mod- AIOS-ELM, the gesture recognition sys-
testing dataset to get testing accuracy els with sequentially arriving training tem can support online lifelong learning
and running time. The active function data. for users and reach quick, high recogni-
of the ELM was set as Sigmoid. We As Figure 15 shows, the ELM is tion accuracy for new user gestures.
also set the amount of hidden nodes more accurate and faster than SVM in
of ELM as 500, and chose the param- all incremental experiments, but OS-
eters c and g of SVM to be 1 and 0.06. ELM and AIOS-ELM are much faster E xperiments confirm that our ges-
Table 9 shows the results. than the ELM in all incremental ex- ture recognition system using AIOS-
As Table 9 shows, the training and periments because they used a sequen- ELM can quickly and accurately
testing time for the ELM are both tial training mechanism, which means adapt to new users.

58 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

IS-28-06-TandC.indd 58 20/01/14 3:02 PM


Acknowledgments Region 8 Flagship Conf. African Con­ Yiqiang chen is a professor in the Institute
We thank Lei Zhang and Meiyu Huang for tinent AFRICON 2011), IEEE, 2011, of Computing Technology at the Chinese
their constructive suggestions for this article. pp. 1–6. Academy of Sciences, China. Contact him at
5. G.B. Huang, Q.Y. Zhu, and C.K. Siew, yqchen@ict.ac.cn.
References “Extreme Learning Machine: Theory
1. S. Mitra, and T. Acharya, “Gesture and Applications,” Neurocomputing, Junfa Liu is an associate professor in the
Recognition: A Survey,” IEEE Trans. vol. 70, no. 1, 2006, pp. 489–501. Institute of Computing Technology at the
Systems, Man, and Cybernetics, vol. 6. N. Y. Liang et al., “A Fast and Accurate Chinese Academy of Sciences, China. Con-
37, no. 3, 2007, pp. 311–324. Online Sequential Learning Algorithm tact him at liujunfa@ict.ac.cn.
2. Z. Ren et al., “Robust Hand Gesture for Feedforward Networks,” IEEE
Recognition with Kinect Sensor,” Proc. Trans. Neural Networks, vol. 17, no. 6, Guang-bin Huang is an associate professor
19th ACM Int’l Conf. Multimedia, 2006, pp. 1411–1423. in the School of Electrical and Electronic En-
ACM, 2011, pp. 759–760. 7. H.C. Yu et al., “Robust Single Fingertip gineering at Nanyang Technological Univer-
3. M.K. Bhuyan, V.V. Ramaraju, and Tracking Method Based on Palm sity, Singapore. Contact him at egbhuang@
Y. Iwahori, “Hand Gesture Recogni- Posture Self-Adaption,” J. Computer­ ntu.edu.sg.
tion and Animation for Local Hand Aided Design & Computer Graphics,
Motions,” Int’l J. Machine Learn­ vol. 25, no. 12, 2013, pp. 1793–1800.
ing and Cybernetics, vol. 3, 2013,
pp. 1–17. Hanchao Yu is a PhD candidate in the
4. M.M. Sole and M.S. Tsoeu, “Sign Lan- Institute of Computing Technology at the Selected CS articles and columns
guage Recognition Using the Extreme Chinese Academy of Sciences, China. Con- are also available for free at
Learning Machine,” Proc. 2011 IEEE tact him at yuhanchao@ict.ac.cn. http://ComputingNow.computer.org.

GET HOT TOPIC INSIGHTS


FROM INDUSTRY LEADERS

• Our bloggers keep you up on the latest Cloud, Big Data, Programming, Enterprise
and Software strategies.
• Our multimedia, videos and articles give you technology solutions you can use.
• Our professional development information helps your career.

Visit ComputingNow.computer.org. Your resource for technical development and


leadership.

Visit http://computingnow.computer.org

november/december 2013 www.computer.org/intelligent 59

IS-28-06-TandC.indd 59 20/01/14 3:02 PM

S-ar putea să vă placă și