Sunteți pe pagina 1din 18

Information Sciences 299 (2015) 99116

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

A weighted LS-SVM based learning system for time series


forecasting q
Thao-Tsen Chen, Shie-Jue Lee
Department of Electrical Engineering, National Sun Yat-Sen University, Kaohsiung 80424, Taiwan

a r t i c l e

i n f o

Article history:
Received 22 May 2014
Received in revised form 6 December 2014
Accepted 11 December 2014
Available online 17 December 2014
Keywords:
Time series forecasting
Multi-step forecasting
Nearest neighbor
Mutual information
Support vector machine

a b s t r a c t
Time series forecasting is important because it can often provide the foundation for
decision making in a large variety of elds. Statistical approaches have been extensively
adopted for time series forecasting in the past decades. Recently, machine learning
techniques have drawn attention and useful forecasting systems based on these techniques
have been developed. In this paper, we propose a weighted Least Squares Support Vector
Machine (LS-SVM) based approach for time series forecasting. Given a forecasting
sequence, a suitable set of training patterns are extracted from the historical data by
employing the concepts of k-nearest neighbors and mutual information. Based on the
training patterns, a modied LS-SVM is developed to derive a forecasting model which
can then be used for forecasting. Our proposed approach has several advantages. It can
produce adaptive forecasting models. It works for univariate and multivariate cases. It also
works for one-step as well as multi-step forecasting. A number of experiments are
conducted to demonstrate the effectiveness of the proposed approach for time series
forecasting.
2014 Elsevier Inc. All rights reserved.

1. Introduction
Time series forecasting concerns forecasting of future events based on a series of observations taken previously at equally
spaced time intervals [25]. It has played an important role in the decision making process in a variety of elds such as
nance, power supply, and medical care [21]. One example is to predict stock exchange indices or closing stock prices in
the stock market [912,23,43,60]. Another example is to predict the electricity demand to avoid producing extra electric
power in the power supply industry [27,39,68]. If forecasting is done for one time step ahead into the future, it is called single-step or one-step forecasting. For the case of two or more time steps ahead, it is usually called multi-step forecasting
[5,21].
Two approaches have been adopted for constructing time series forecasting models. The global modeling approach
constructs a model which is independent of the target to be forecasted. For time series prediction, the conditions of the
environment may vary as time goes on. A global model is not adaptive and thus accuracy suffers. A local model constructed
by the local modeling approach [28,29,4547] is dependent on the target to be forecasted and therefore is adaptive. Local

q
This work was supported by the National Science Council under the Grants NSC-99-2221-E-110-064-MY3 and NSC-101-2622-E-110-011-CC3, and by
Aim for the Top University Plan of the National Sun Yat-Sen University and Ministry of Education.
Corresponding author.
E-mail address: leesj@mail.ee.nsysu.edu.tw (S.-J. Lee).

http://dx.doi.org/10.1016/j.ins.2014.12.031
0020-0255/ 2014 Elsevier Inc. All rights reserved.

100

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116

models are usually characterized by using a small number of the neighbors in the proximity of the forecasting sequence.
Another issue in time series forecasting is to determine the lags to be involved in the model. The lags have a big inuence
on the forecasting accuracy. For example, in steel making engineering [4,34], the furnace temperature will change two to
eight hours from the time the materials are applied into the furnace. This indicates that there is a time lag of two to eight
hours for the temperature change. Furthermore, the lags may vary as time goes on and adjusting them is required [59]. Lags
selection methods using t-statistics of estimated coefcients or F-statistics of groups of coefcients to measure statistical
signicance were proposed [26]. The idea of mutual information was adopted in lags selection for electric load time series
forecasting [7]. Two strategies, direct and iterative, have been adopted for constructing time series forecasting models. The
difference between them lies on whether or not the forecasts of previous steps are involved in the prediction of the current
step.
Statistical methods have been extensively adopted for time series forecasting in the past decades, among them are moving average, weighted moving average, Kalman ltering, exponential smoothing, regression analysis, autoregressive moving
average (ARMA), autoregressive integrated moving average (ARIMA), and autoregressive moving average with exogenous
inputs (ARMAX) [6]. These methods are based on the assumption that a probability model generates the underlying time
series data. Future values of the time series are assumed to be related to past values as well as to past errors. BoxJenkins
models [6], i.e., ARMA, ARIMA, and ARMAX, are quite exible due to the inclusion of both autoregressive and moving average
terms. Recently, machine learning techniques have drawn attention and useful forecasting systems based on these techniques have been developed [49]. The multilayer perceptron, often simply called the neural network, is a popular network
architecture in use for time series forecasting [13,20,24]. The neural network often encounters the local minimum problem
during the learning process and the number of nodes in the hidden layer is usually difcult to decide. The k-nearest neighbor
regression method is a nonparametric method which bases its prediction on the k nearest neighbors of the target to be
forecasted [29]. However, it lacks the capability of adaptation and the distance measure adopted may affect prediction performance. Classication and Regression Trees is a regression model that is based on a hierarchical tree-like partition of the
input space [22,53,72]. Fuzzy theory is incorporated for prediction in the stock market [9,19,5557]. However, membership
functions need to be determined, which is often a challenging task to the user. Also, no learning is offered by fuzzy theory.
Support vector regression can provide high accuracy by mapping into a high-dimensional feature space and including a penalty term in the error function [30,33,47,52,65,71]. Neuro-fuzzy modeling is a hybrid approach which takes advantage of
both neural network techniques and fuzzy theory [1,31,32,43,50]. Supervised learning algorithms are utilized to optimize
the parameters of the produced forecasting models through learning.
Some of the existing forecasting methods can only be applied in univariate forecasting, others can only work for one-step
forecasting, and yet others cannot produce adaptive models. In this paper, we propose a direct local modeling approach
based on machine learning to derive forecasting models. Given a forecasting sequence, two steps are involved in our
proposed approach. Firstly, a suitable set of training patterns are extracted from the historical data by employing the concepts of k-nearest neighbors and mutual information. A measure taking trends into consideration is adopted to gauge the
similarity between two sequences of data, and proper lags associated with relevant variables for forecasting are determined.
Secondly, based on the extracted training patterns, a modied LS-SVM is developed to derive a forecasting model which can
then be used for forecasting. Our proposed approach has several advantages. It can produce adaptive forecasting models. It
works for univariate and multivariate cases. It also works for one-step as well as multi-step forecasting. The effectiveness of
the proposed approach is demonstrated by the results obtained from a number of experiments conducted on real-world
datasets.
The rest of this paper is organized as follows. Section 2 states the problem to be solved. Section 3 describes in detail our
proposed forecasting approach. An illustrating example is given in Section 4. Experimental results are presented in Section 5.
Some relevant issues are discussed in Section 6. Finally, concluding remarks are given in Section 7.
2. Problem statement
Consider a series of real-valued observations [69]:

X0 ; Y 0 ; X1 ; Y 1 ; . . . ; Xt ; Y t

taken at equally spaced time points t 0 ; t 0 Mt; t 0 2Mt; . . . for some process P, where Y i denotes the value of the output
variable (or dependent variable) observed at the time point t 0 iMt and Xi denotes the values of n additional variables
(or independent variables), n P 0, observed at the time point t 0 iMt. Time series forecasting is to estimate the value of Y
at some future time t s, i.e., Y ts , by

b ts GXtq ; Y tq ; . . . ; Xt1 ; Y t1 ; Xt ; Y t


Y

where s P 1 is called the horizon of prediction, G is the forecasting function or model, Y ti is the ith lag of Y t ; Xti is the ith lag
of Xt , and q is the lag-span of prediction. For s 1, it is called one-step forecasting. For s > 1, it is called multi-step forecasting. Also, if n 0, it is univariate forecasting; otherwise, it is multivariate forecasting.
Forecasting Y ts can be regarded as a function approximation task. Two strategies are usually adopted to construct
forecasting models:

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116

101

 Direct. Train on Xtq ; Y tq ; . . . ; Xt1 ; Y t1 ; Xt ; Y t to predict Y ts directly, for any s P 1.
 Iterative. Train to predict Y t1 only, but iterate to get Y ts for any s > 1.
These strategies work identically for the case of s 1. However, for the case of s > 1, the iterative strategy cannot work for
b t2 ; Y
b t3 ; . . . ; Y
b ts successively for this case. The foremultivariate prediction since Xt1 ; . . . ; Xts1 are not available to obtain Y
casting model for estimating Y ts can be obtained by a global or local modeling approach. These two approaches have their
own pros and cons as described in the previous section.

3. Proposed approach
Without loss of generality, we consider only Y and one additional variable X, i.e., n 1, in Eq. (1). Extension to the case of
two or more additional variables is trivial. Our goal is to estimate Y ts ; s P 1, from X tq ; Y tq ; X tq1 ; Y tq1 ; . . . ;
X t1 ; Y t1 ; X t ; Y t . To do this, we try to learn from the observed data how.






Y qs can be forecasted from X 0 ; Y 0 ; X 1 ; Y 1 ; . . . ; X q1 ; Y q1 ; X q ; Y q ;


Y qs1 can be forecasted from X 1 ; Y 1 ; X 2 ; Y 2 ; . . . ; X q ; Y q ; X q1 ; Y q1 ;
. . .;
Y t1 can be forecasted from X tsq1 ; Y tsq1 ; X tsq ; Y tsq ; . . . ; X ts2 ; Y ts2 ; X ts1 ; Y ts1 ;
Y t can be forecasted from X tsq ; Y tsq ; X tsq1 ; Y tsq1 ; . . . ; X ts1 ; Y ts1 ; X ts ; Y ts .

Based on what we have learned, we intend to predict Y ts from X tq ; Y tq ; X t1q ; Y t1q ; . . . ; X t ; Y t in a similar manner. For
convenience, the sequence



Q X tq ; Y tq ; X tq1 ; Y tq1 ; . . . ; X t1 ; Y t1 ; X t ; Y t

is called the forecasting sequence for forecasting Y ts .


We adopt the direct local modeling approach, based on machine learning techniques, to derive the forecasting model for
Y ts . The proposed approach, as shown in Fig. 1, consists of two phases, training and forecasting. In the training phase, a forecasting model is derived. Two steps, training patterns extraction and model derivation, are involved in this phase. A suitable
set of training patterns are extracted from the observed data. Then a modied LS-SVM is adopted to derive the forecasting
model. In the forecasting phase, the forecasted result is obtained from the derived model. Let z t  s 1  q and



S1 X 0 ; Y 0 ; X 1 ; Y 1 ; . . . ; X q1 ; Y q1 ; X q ; Y q ;


S2 X 1 ; Y 1 ; X 2 ; Y 2 ; . . . ; X q ; Y q ; X q1 ; Y q1 ;
4

... ...;
Sz1 X z2 ; Y z2 ; X z1 ; Y z1 ; . . . ; X ts2 ; Y ts2 ; X ts1 ; Y ts1 ;
Sz X z1 ; Y z1 ; X z ; Y z ; . . . ; X ts1 ; Y ts1 ; X ts ; Y ts :
For convenience, we call S fS1 ; S2 ; . . . ; Sz g the neighbor set of Q and each element in S a neighbor of Q .

Fig. 1. Training and forecasting in the proposed approach.

102

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116

3.1. Training patterns extraction


In this step, a suitable set of training patterns are extracted from the observed data. Firstly, the most appropriate context
for constructing the forecasting model for Y ts is located. Then proper lags involved in training are determined.
3.1.1. Finding k nearest neighbors
The most appropriate context for constructing the forecasting model for Y ts is dened to be the k nearest neighbors of Q
in S, where k is a number specied by the user. We adopt a measure, which is similar to that in [29], to compute the
similarity between two sequences. Let A1 and A2 be two sequences:



A1 X a ; Y a ; X a1 ; Y a1 ; . . . ; X aq1 ; Y aq1 ; X aq ; Y aq ;


A2 X b ; Y b ; X b1 ; Y b1 ; . . . ; X bq1 ; Y bq1 ; X bq ; Y bq ;
and F1 and F2 be the differential sequences of A1 and A2 , respectively, dened as



F1 X a1  X a ; Y a1  Y a ; . . . ; X aq  X aq1 ; Y aq  Y aq1 ;


F2 X b1  X b ; Y b1  Y b ; . . . ; X bq  X bq1 ; Y bq  Y bq1 :
Note that the differential sequence reveals trends, i.e., rising or falling conditions, of the original sequence. A negative entry
in the differential sequence indicates a falling occurrence at the underlying point in the original sequence while a positive
entry indicates a rising occurrence. Let N E A1 ; A2 be the normalized Euclidean distance between A1 and A2 , and N E F1 ; F2 be
the normalized Euclidean distance between F1 and F2 . The following measure [29]:

NH A1 ; A2

NE A1 ; A2 NE F1 ; F2
2

is adopted to locate the k nearest neighbors of Q . We call N H A1 ; A2 the hybrid distance between A1 and A2 .
We identify the k nearest neighbors of Q as follows. We calculate the hybrid distance between Q and every neighbor in S.
As a result, we have z hybrid distances. The k sequences with the k shortest hybrid distances are taken to be the k nearest
neighbors of Q . For convenience, these k nearest neighbors are expressed as





X t1 q ; Y t1 q ; X t1 q1 ; Y t1 q1 ; . . . ; X t1 1 ; Y t1 1 ; X t1 ; Y t1 ;

X t2 q ; Y t2 q ; X t2 q1 ; Y t2 q1 ; . . . ; X t2 1 ; Y t2 1 ; X t2 ; Y t2 ;

...;


X t q ; Y tk1 q ; X tk1 q1 ; Y tk1 q1 ; . . . ; X tk1 1 ; Y tk1 1 ; X tk1 ; Y tk1 ;
 k1

X tk q ; Y tk q ; X tk q1 ; Y tk q1 ; . . . ; X tk 1 ; Y tk 1 ; X tk ; Y tk

where q 6 ti 6 t  s for any i; 1 6 i 6 k.


3.1.2. Lags selection
Next, we select certain lags out of X tq ; Y tq ; X t1q ; Y t1q ; . . . ; X t1 ; Y t1 ; X t , and Y t for training. One simple idea is to use
them all and the number of lags is 2q 2 in this case. However, this is not necessarily a good idea. Usually, we do not know in
advance how long the response time of the underlying system is. To be worse, the response time may vary along the way. If q
is not large enough, we might miss some inputs which are important to the system. Therefore, q is usually set sufciently
large. Adopting all the lags may lead to the undertting problem and the prediction accuracy can be poor. To avoid this,
we only select a subset from the 2q 2 candidate lags.
We adopt the concept of mutual information [38,54,58,61,62] for lags selection. Let the mutual information between two
vectors U and V be denoted as MIU; V. Intuitively, mutual information measures the information that U and V share. That is,
it measures how much knowing one of these vectors reduces uncertainty about the other. If MIU; V is large, there is likely
some strong connection between U and V. The adoption of mutual information, rather than others, e.g., correlation [42], is
mainly due to its capability of measuring non-linear relationship between involved vectors.
From the k nearest neighbors of Eq. (6) and their corresponding s-step ahead outputs Y t1 s ; Y t2 s ; . . . ; Y tk s , we form the
following 2q 3 vectors:

Z1 ; Z2 ; . . . ; Z2q1 ; Z2q ; Z2q1 ; Z2q2 ; H

where


Z2i1 X t1 iq

X t2 iq

. . . X tk iq


Z2i2 Y t1 iq

Y t2 iq

. . . Y tk iq


H Y t1 s

Y t2 s

. . . Y tk s

T

T

; 0 6 i 6 q;

T

; 0 6 i 6 q;

103

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116

We use a greedy approach with forward selection [14,59] to nd a desired number of lags. Let d be the number of lags to be
found. Firstly, we calculate MIZi ; H; 1 6 i 6 2q 2. Let MIZd1 ; H be the largest, indicating that the most signicant connection exists between Zd1 and H. Therefore, Zd1 is selected. Next, we calculate MIfZd1 ; Zi g; H; 1 6 i 6 2q 2 and i d1 . Let
MIfZd1 ; Zd2 g; H be the largest. Therefore, Zd2 is also selected. Then we calculate MIfZd1 ; Zd2 ; Zi g; H; 1 6 i 6 2q 2 and
i d1 and i d2 . This goes on until d lags are found. Then the lags obtained are used to determine the training patterns
and the input for forecasting. For example, suppose d 3 and Z2q1 ; Z2q1 , and Z2q2 are selected. By referring to Eqs. (6)
and (7), we extract the following k training patterns:

h

i

X tj 1 ; X tj ; Y tj ; Y tj s ;

1 6 j 6 k:

These training patterns are to be used in training a direct forecasting model in the model derivation step. Accordingly, the
input for forecasting Y ts is X t1 ; X t ; Y t .
3.2. Model derivation
After we obtain the k training patterns, we can proceed to derive the direct forecasting models. Let the training patterns
be represented as fxi ; yi gki1 , where xi is the input and yi is the desired output of pattern i. Note that we have also derived
the input
x for forecasting
Y ts . In the case of the previous example shown in Eq. (8), the


 rst training pattern includes
x1 X t1 1 ; X t1 ; Y t1 and y1 Y t1 s , the second training pattern includes x2 X t2 1 ; X t2 ; Y t2 and y2 Y t2 s , . . ., and the kth


training pattern includes xk X tk 1 ; X tk ; Y tk and yk Y tk s . The input x for forecasting Y ts is x X t1 ; X t ; Y t .
Least Squares Support Vector Machine (LS-SVM) [52,65] is a powerful method for solving function estimation problems.
However, all the errors induced have the same weight in LS-SVM [3]. Preferably, we would like to give more credit to a training pattern that is more accurately forecasted in the training process [64,67]. That is, a training pattern with a smaller error is
allowed to have a bigger weight. We modify the traditional LS-SVM to meet our requirement, and the modied LS-SVM can
be expressed as

min Jx; e

k
1 T
1 X
x x c gj e2j
2
2 j1

subject to yi x /xi b ei ;

i 1; 2; . . . ; k

where x is the weight vector to be solved, c is the regularization parameter which is a constant, / maps the original input
space to a high-dimensional feature space, bis a real constant, and e1 ; e2 ; . . ., and ek are error variables. Note that each error
ej ; 1 6 j 6 k, is weighted by gj which is dened as

gj exp 

g1;j g2;j

10

where g1;j is the hybrid distance between x and xj and g2;j is the normalized absolute difference between yj and the median of
y, i.e.

g1;j NH x; xj ;
g2;j

11

jyj  medianyj

:
maxfjyj  medianyjg

12

Note that mediany is the median of all the Y values and the denominator maxfjyj  medianyjg is the maximum of all
jyj  medianyj; 1 6 j 6 k. Essentially, g1;j concerns the closeness between x and xj . As xj is closer to x; g1;j gets smaller
and consequently, gj is larger. Furthermore, we consider not only the input but also the output for the weighting. The rationale behind this consideration is that: For time series forecasting, it is not always the case that closer inputs lead to closer

Table 1
An example showing the relationship between inputs and outputs.
Input distance

Output distance

1
2
3
4
..
.
148
149
150

0.2364
0.1505
0.1009
0.0893
..
.
2.0958
1.0462
0.8589

104

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116

outputs. An example is shown in Table 1 which contains 150 pieces of data taken from the Poland Electrical Load dataset
[51]. In this table, the left column indicates the ordering of the input distances between one reference point and all its
150 nearest neighbors. A smaller index indicates a neighbor closer to the reference point with respect to the input. The right
column indicates the output distances between the reference point and these nearest neighbors. As can be seen, closer inputs
may not lead to closer outputs. For instance, consider neighbors 1 and 2. Neighbor 1 is closer to the reference point in terms
of the input distance, while neighbor 2 is closer to the reference point in terms of the output distance. Therefore, we consider
the contribution of the output by including the term g2;j in Eq. (12). As yj is closer to the median, g2;j gets smaller and consequently, gj is larger. Using median instead of mean is due to the consideration of outliers which may usually occur in practical applications. In the case of some outputs having abnormal values, the median is less likely affected.
The solution to Eq. (9) can be derived by constructing the Lagrangian function:

Lx; b; e; a Jx; e 

k
X

ai xT /xi b ei  yi

13

i1

where ai 2 R; 1 6 i 6 k, are the Lagrange multipliers. By eliminating the variables, we have

"

#
b

1Tk

1k

X cgI


0

14

where y y1 y2 . . . yk T ; 1k 1
matrix which is dened by

...

1 T ; a a1

a2 . . . ak T ; I is the identity matrix, and X is the kernel


2 !
kxi  xj k
X i;j /xi /xj exp 
T

15

for i; j 1; 2; . . . ; k. From Eq. (14), the values of a1 ; a2 ; . . . ; ak are obtained, and the forecasting model is given by

^ Gx
y

k
X

k
X

i1

i1

ai /xT /xi b


2 !
kx  xi k

ai exp 

16

which can then be used for forecasting Y ts .


3.3. Summary of training and forecasting
Given a series of real-valued observations X 0 ; Y 0 ; X 1 ; Y 1 ; . . . ; X t , our goal is to estimate the value of Y ts at some future time


t s; s P 1, based on these observations. Let the forecasting sequence Q be X tq ; Y tq ; . . . ; X t1 ; Y t1 ; X t ; Y t where q is a
sufciently large positive integer. We nd the k nearest neighbors as the most suitable context of Q . Then we use mutual
information to select the lags of the involved variables. Then k training patterns are extracted from the observations. Based
on the training patterns, the forecasting model is derived by the modied LS-SVM. Finally, Y ts is estimated by the
application of the forecasting model. The whole process can be summarized as below.
procedure Proposed-Forecasting-Method
Training phase:
Find the k nearest neighbors of Q using Eq. (5);
Decide the lags of the involved variables using mutual information;
A set of k training patterns are extracted;
Derive the forecasting model by the modied LS-SVM of Eq. (9);
Forecasting phase:
Forecast Y ts by the derived model using Eq. (16);
end procedure

4. Example
Suppose we have a series of observations X 0 ; Y 0 ; X 1 ; Y 1 ; . . . ; X 19 ; Y 19 , as shown in Table 2. We want to forecast the value of
Y 21 based on the given data. For this case, we have t 19 and s 2. Let q 1; k 6, and d 2. According to Eq. (3), the forecasting sequence Q is

Q X 18 ; Y 18 ; X 19 ; Y 19  1:140; 1:066; 1:174; 1:114

105

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116


Table 2
Time series example.
Time index

Hybrid distance

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

1.118
1.116
1.005
0.903
1.099
1.131
0.969
1.093
1.125
1.091
0.943
1.138
1.149
1.155
1.158
1.161
1.065
0.944
1.140
1.174

1.073
1.062
0.963
0.874
1.048
1.078
1.084
1.106
1.104
1.002
0.897
1.099
1.137
1.136
1.130
1.097
0.990
0.870
1.066
1.114

0.2469
0.7192
0.9059
0.8329
0.1278
0.6813
0.4349
0.2046
0.5280
0.9089
0.7967
0.1120
0.2153
0.2248
0.2697
0.6371
0.9207

and according to Eq. (4), the neighbor set S contains the following 17 neighbors:

S1 X 0 ; Y 0 ; X 1 ; Y 1  1:118; 1:073; 1:116; 1:062;


S2 X 1 ; Y 1 ; X 2 ; Y 2  1:116; 1:062; 1:005; 0:963;
... ...;
S16 X 15 ; Y 15 ; X 16 ; Y 16  1:161; 1:097; 1:065; 0:990;
S17 X 16 ; Y 16 ; X 17 ; Y 17  1:065; 0:990; 0:944; 0:870:
Then we proceed with the training steps as follows.
 Finding k nearest neighbors. We calculate the hybrid distances between Q and Si ; i 1; 2; . . . ; 17, by Eq. (5). The distances
are shown in the fourth column of Table 2. For example, the hybrid distance between Q and S17 is 0.9207, shown in the
row with time index 17. The hybrid distance between Q and S16 is 0.6371, shown in the row with time index 16. The 6
nearest neighbors of Q are S12 ; S5 ; S8 ; S13 ; S14 , and S1 . Therefore, we have t 1 12; t2 5; t3 8; t 4 13; t 5 14, and t6 1
for Eq. (6).
 Lags selection. According to Eq. (7), we set

Z1 X 11 ; X 4 ; X 7 ; X 12 ; X 13 ; X 0 T ;
Z2 Y 11 ; Y 4 ; Y 7 ; Y 12 ; Y 13 ; Y 0 T ;
Z3 X 12 ; X 5 ; X 8 ; X 13 ; X 14 ; X 1 T ;

17

Z4 Y 12 ; Y 5 ; Y 8 ; Y 13 ; Y 14 ; Y 1  ;
H Y 14 ; Y 7 ; Y 10 ; Y 15 ; Y 16 ; Y 3 T :
Firstly, we compute MIZ1 ; H 0:2500, MIZ2 ; H 0:2083, MIZ3 ; H 0:2083, and MIZ4 ; H 0:4583. Since MIZ4 ; H
is the largest, Z4 is selected. Next, we compute MIfZ4 ; Z1 g, H 0:1667, MIfZ4 ; Z2 g; H 0:1250, and
MIfZ4 ; Z3 g; H 0:2917. Since MIfZ4 ; Z3 g; H is the largest, Z3 is selected. Then we stop. Note that Z3 and Z4 correspond
to the third element and the fourth element, respectively, in each nearest neighbor. Therefore, we extract the following 6
training patterns:

x1 X 12 ; Y 12 ;

y1 Y 14 ;

x2 X 5 ; Y 5 ;

y2 Y 7 ;

x3 X 8 ; Y 8 ;

y3 Y 10 ;

x4 X 13 ; Y 13 ;
x5 X 14 ; Y 14 ;
x6 X 1 ; Y 1 ;

y4 Y 15 ;
y5 Y 16 ;
y6 Y 3 ;

and the input for forecasting Y 21 is x X 19 ; Y 19 .

18

106

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116

 Model derivation. We use the extracted training patterns fxi ; yi g6i1 to derive the forecasting model. Firstly, we create the
modied LS-SVM of Eq. (9). The weights gj ; j 1; 2; . . . ; 6, are calculated by Eq. (10). For example, g1 is

g1 exp 

g1;1 g2;1

0:7326

where

g1;1 NH x; x1 0:1120
and

g2;1

jy1  medianyj
0:5103:
maxfjyj  medianyjg

Then we solve the Lagrange multipliers in Eq. (14) and obtain the optimal forecasting model of Eq. (16).
Finally, we apply x X 19 ; Y 19  to the forecasting model and the forecast of Y 21 is

b 21 1:0917:
Y
5. Experimental results
We present the results of several experiments with real-world time series datasets to demonstrate the effectiveness of
our proposed Local Modeling Approach (LMA). We also compare it with other existing methods in this section. To evaluate
the performance of each method, several metrics are adopted [49], including mean absolute error (MAE), mean square error
(MSE), root mean square error (RMSE), and mean absolute percentage error (MAPE), which are dened below:

PNt
MAE

i1 jyi

PNt
MSE

^i j
y

i1 yi

^i
y

19

Nt

;
Nt
s

PNt
^ 2
i1 yi  yi
RMSE
;
Nt
PNt jyi y^i j
MAPE 100 

i1

Nt

yi

20
21
22

^i are the actual output value and forecasted output value,


where N t is the number of the testing data, and yi and y
respectively, for i 1; 2; . . . ; N t .
Some points should be mentioned here. Firstly, model derivation in our approach is performed using the LS-SVM program
provided by the authors of [65] on the website of [44]. When the training patterns are obtained, they are fed into the LS-SVM
program and the forecasting model is derived automatically. The program can determine the settings of the hyper-parameters by itself without user intervention. Secondly, not all metrics mentioned above are used in every experiment. To
compare with other methods, we borrow the experimental results of these methods from the literature. Since different metrics are adopted in different cited papers, the metrics used in each experiment can be different. Thirdly, training errors are
not available in the cited papers. Therefore, we only present the training errors of LMA in the experiments.
5.1. One-step forecasting
Four experiments are conducted on the datasets Poland Electricity [51], Laser [40], Sunspot [63], and TAIEX [66] to show
the performance of one-step forecasting of different methods. The rst three datasets are used for univariate forecasting,
while the last is used for multivariate forecasting.
5.1.1. Poland Electricity dataset
The Poland Electricity dataset [51] records the electricity load of Poland, covering 1500 days in 1990s. Only one variable,
the output, is involved in this dataset. The rst 1000 values are used for training, and the remaining 500 values are used for
testing. Some statistics [35], including Minimum (Min), Maximum (Max), Mean, Median, Mode, Standard deviation (Std), and
Range, of the training data are shown in Table 3. The MAE and RMSE obtained by several one-step forecasting methods are
shown in Table 4. These methods include ANFIS [32], Neural Network based on the MATLAB toolbox (NN-MAT) [24], ARMA
[6], Sorjamaa et al.s method [59], and our LMA. For LMA, we set k 150; q 9, and d 6. The hyper-parameters set automatically in the LS-SVM program for LMA are c  2:4  1010 and r2  3  108 . These settings are determined in the training
phase, and are not changed in the testing phase. Note that the errors for Sorjamaa in the table are taken directly from [59] in

107

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116


Table 3
Some statistics of four datasets.
Poland Electricity

Min
Max
Mean
Median
Mode
Std
Range

0.6385
1.3490
0.9885
0.9864
0.6385
0.1632
0.7105

Laser

Sunspot

2
255
59.893
43
14
49.279
253

0
154.400
43.472
37.600
11
34.267
154.400

EUNITE
LoadMax

Temperature

464
876
670.790
677
799
93.541
412

14.225
26.525
8.802
8.938
20.250
8.686
40.750

Table 4
Performance of one-step forecasting on Poland Electricity.

ANFIS
NN-MAT
ARMA
Sorjamaa
LMA

MAE

RMSE

0.2985
0.0639
0.3744
0.1856
0.0245
(0.0194)

0.5225
0.0835
0.4807
0.3351
0.0454
(0.0311)

Fig. 2. One-step forecasted results by LMA and ARMA on Poland Electricity.

which no training errors are listed. Also, the parenthesized numbers in the second row of the LMA entry indicate the training
errors of LMA. Also, the best result on each metric is in boldface in this table. As can be seen from the table, LMA performs
better than the others in both MAE and RMSE. Fig. 2 shows the forecasted results obtained by LMA and ARMA. For clarity, we
only show the rst 100 forecasted values in this gure. From this gure, we can see that LMA provides a better match than
ARMA.
5.1.2. Laser dataset
The Laser dataset [40] contains a series of chaotic laser data obtained in a physics laboratory experiment. Only one
variable, the output, is involved in this dataset. The length of the time series is 10,093, but we use only the rst 5700 observations. The rst 5600 observations are used for training and the remaining are used for testing. Some statistics of the training data are shown in Table 3. The forecasted results obtained by LMA and ANFIS are shown pictorially in Fig. 3. For LMA, we
set k 150; q 9, and d 7. The hyper-parameters set automatically in the LS-SVM program for LMA are c  2:4  107 and
r2  700. These settings are determined in the training phase, and are not changed in the testing phase. Table 5 shows the
MAE and RMSE obtained by different methods. The parenthesized numbers in the second row of the LMA entry indicate the
training errors of LMA. Again, LMA performs better than the others in both MAE and RMSE.

108

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116

Fig. 3. One-step forecasted results by LMA and ANFIS on Laser.

Table 5
Performance of one-step forecasting on Laser.

ANFIS
NN-MAT
ARMA
Sorjamaa
LMA

MAE

RMSE

5.7257
3.4300
14.8915
1.8288
0.7236
(0.6031)

14.4748
10.3494
27.5397
4.2405
0.9570
(0.8754)

5.1.3. Sunspot dataset


The Sunspot dataset [63] is related to the observations on the annual activity of sunspots. The study on sunspot activity
has practical importance to geophysicists, environment scientists, and climatologists. The dataset records the annual number
of sunspots from the year of 1700 through the year of 1987, giving a total of 288 observations. Only one variable, the output,
is involved in this dataset. The data from 1700 to 1920 are used for training, and the data from 1921 to 1987 are used for
testing. Some statistics of the training data are shown in Table 3. In this experiment, we compare our method with ARIMA
[70], Multiple ANN [2], ARIMA and Neural Network [70], ANN (p; d; q) [36], and Generalized ANNs/ARIMA [37]. Table 6 shows
the MAE, MSE, and MAPE obtained by these methods. Note that we borrow the numbers, except for ANFIS and LMA, in this
table from the cited papers. A dash in an entry indicates that no number is provided in the source. The parenthesized
numbers in the second row of the LMA entry indicate the training errors of LMA. For LMA, we set k 90; q 11, and
d 9. The hyper-parameters set automatically in the LS-SVM program for LMA are c  3600 and r2  700. These settings
are determined in the training phase, and are not changed in the testing phase. From this table, we can see that our method,
LMA, provides the best performance. Fig. 4 shows the forecasted results obtained by LMA and ANFIS.
5.1.4. TAIEX dataset
In this experiment, we make forecasting for the Taiwan Stock Exchange Capitalization Weighted Stock Index (TAIEX) [66],
with additional inputs from Dow Jones [15] and NASDAQ [48]. The period we are working with is from the year of 1999
through the year of 2004. The data from January to October of each year are used for training, and the data from November

Table 6
Performance of one-step forecasting on Sunspot.

ANFIS
ARIMA in [70]
Multiple ANN [2]
ARIMA & Neural Network [70]
ANN (p; d; q) [36]
Generalized ANNs/ARIMA [37]
LMA

MAE

MSE

MAPE

16.31
13.03

12.78
12.18
11.45
10.29
(7.88)

613.87
306.08
280.48
280.16
234.21
218.64
215.71
(101.97)

36.10

30.69

20.92
(15.37)

109

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116

Fig. 4. One-step forecasted results by LMA and ANFIS on Sunspot.

Table 7
Some statistics of the TAIEX related data in 2004.

NASDAQ
Dow Jones
TAIEX

Min

Max

Mean

Median

Mode

Std

Range

1752
9758
5317

2140
10,738
7034

1958.95
10263.81
6057.18

1960.26
10244.93
5949.26

1969.99
9757.81
5316.87

81.59
224.80
458.25

388
980
1717

Table 8
RMSE of one-step forecasting on TAIEX.
1999

2000

2001

2002

2003

2004

Average

With Dow Jones

ANFIS
ARMAX
NN-MAT
Chen & Chang [9]
Chen et al. [10]
Chen & Chen [11]
T2NFS [43]
Chen et al. [12]
LMA

156.72
102.53
103.78
101.97
115.47
99.87
97.30
102.34
100.73
(91.48)

280.85
126.41
132.46
148.85
127.51
122.75
124.28
131.25
123.65
(110.83)

521.64
115.36
116.27
113.70
121.98
117.18
110.01
113.62
117.59
(103.77)

1871.23
63.19
67.24
79.81
74.65
68.45
60.05
65.77
62.96
(48.36)

1805.04
52.24
55.01
64.08
66.02
53.96
52.24
52.23
52.17
(45.16)

192.69
53.81
54.10
82.32
58.89
52.55
51.80
56.16
53.44
(44.73)

799.69
85.76
88.14
98.46
94.09
85.79
82.61
86.89
85.09

With NASDAQ

ANFIS
ARMAX
NN-MAT
Chen & Chang [9]
Chen et al. [10]
Chen & Chen [11]
T2NFS [43]
Chen et al. [12]
LMA

437.58
102.37
103.71
123.64
119.32
102.60
99.63
102.11
102.84
(93.45)

3708.04
113.06
131.31
131.10
129.87
119.98
121.94
131.30
123.56
(113.32)

425.19
115.79
118.17
115.08
123.12
114.81
108.68
113.83
115.26
(109.23)

601.06
63.74
67.89
73.06
71.01
69.07
64.83
66.45
65.59
(56.16)

3323.86
55.61
53.59
66.36
65.14
53.16
51.05
52.83
51.59
(43.80)

328.36
53.33
53.38
60.48
61.94
53.57
51.81
54.17
51.44
(42.10)

1470.68
83.98
88.01
94.95
95.07
85.53
82.99
86.78
85.05

With Dow Jones and NASDAQ

ANFIS
ARMAX
NN-MAT
Chen & Chang [9]
Chen et al. [10]
Chen & Chen [11]
T2NFS [43]
LMA

434.02
106.57
109.29
106.34
116.64
101.33
99.04
102.57
(94.29)

3490.57
121.10
132.98
130.13
123.62
121.27
120.90
121.01
(114.23)

656.78
113.40
114.83
113.33
123.85
114.48
103.84
114.32
(110.27)

1340.88
66.88
63.73
72.33
71.98
67.18
58.10
56.95
(54.10)

574.10
52.89
55.65
60.29
58.06
52.72
52.49
51.95
(45.32)

201.52
51.82
51.50
68.07
57.73
52.27
51.73
50.71
(46.98)

1116.31
85.44
88.00
91.75
91.98
84.88
81.02
82.92

32

33

11

Number of zeros in each year

110

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116

to December are used for testing. Some statistics of the training data for 2004 are shown in Table 7. Table 8 shows the RMSE
obtained by different methods. In this table, we also list the results of Chen [9], Chen et al. [11], T2NFS [43], and Chen et al.
[10,12]. The results under three conditions are listed in Table 8: with Dow Jones as the additional input, with NASDAQ as the
additional input, and with Dow Jones and NASDAQ as the additional inputs. For LMA, we set q 14; d 4, and k  100. The
hyper-parameters set automatically in the LS-SVM program for LMA are c  2  106  6  106 and r2  1000  3000. These
settings are determined in the training phase, and are not changed in the testing phase. Note that the parenthesized numbers
in the second row of the LMA entries indicate the training errors of LMA. We can nd that our method has better RMSE in
2002, 2003, 2004 than most of the other methods. However, LMA performs worse than [11,43] in 1999, 2000 and 2001. The
reason is that in these years, many zeros are contained in Dow Jones and NASDAQ due to their being asynchronous with TAIEX. If Dow Jones and NASDAQ were closed while TAIEX was open on some day, a zero was recorded in Dow Jones and NASDAQ for that day. As can be seen, there are 32 zeros in these two datasets in 1999. Because our method is inuenced by the
obtained nearest neighbors, these zeros are annoying to our method. Note that the zeros also deteriorate a lot the
performance of ANFIS. However, our method still performs better than [9,10] in 1999, 2000, and 2001.

5.2. Multi-step forecasting


Two experiments are conducted on the datasets Laser [40] and EUNITE [16] to show the performance of multi-step
forecasting of different methods. The rst dataset is used for univariate forecasting, while the second is used for multivariate
forecasting.

5.2.1. Laser dataset


The division of training data and testing data is identical to that for one-step forecasting. Tables 9 and 10 show the MAE
and RMSE, respectively, obtained by different methods for 2-step to 12-step forecasting. For LMA, we set k 150; q 9, and
d 7. The hyper-parameters set automatically in the LS-SVM program for LMA are c  3  104  7:5  106 and
r2  700  2  105 . These settings are determined in the training phase, and are not changed in the testing phase. Note that
the parenthesized numbers listed in the column of LMA in these two tables indicate the training errors of LMA. As can be
seen, our method performs much better than the other methods both in MAE and RSME. Fig. 5 shows some forecasted results
obtained by LMA and NN-MAT. Apparently, our forecasted results match more closely to the actual values.

Table 9
MAE of multi-step forecasting on Laser.
ANFIS
2-step
3-step
4-step
5-step
6-step
7-step
8-step
9-step
10-step
11-step
12-step

9.1072
7.0787
7.9532
8.2511
8.7864
9.4033
9.4587
10.9792
12.7650
14.7686
13.1312

NN-MAT

ARMA

Sorjamaa

LMA

7.0461
6.7370
9.5677
7.3968
7.5659
9.5344
9.5561
10.9181
10.0274
9.5501
13.8144

17.6998
17.0023
19.9313
21.0612
21.8979
22.1063
23.6841
28.5120
29.4743
28.9017
30.7653

1.8294
3.1721
4.5941
3.8067
4.6517
5.0789
8.3264
6.4187
6.9559
19.0988
17.5508

1.7388
2.6654
2.4529
2.6971
2.7371
2.1663
2.4619
3.3041
3.7495
4.9000
4.9966

(0.6298)
(0.7722)
(0.7682)
(0.8238)
(0.7889)
(0.7915)
(0.7720)
(0.8603)
(0.8772)
(1.0343)
(1.0965)

Table 10
RMSE of multi-step forecasting on Laser.

2-step
3-step
4-step
5-step
6-step
7-step
8-step
9-step
10-step
11-step
12-step

ANFIS

NN-MAT

ARMA

Sorjamaa

LMA

22.5924
17.9022
20.1061
19.5676
18.4178
20.5050
21.3847
24.8706
24.7307
29.2389
27.3947

15.0670
18.2621
24.4235
19.8554
17.9245
19.6204
19.1559
20.8771
24.2657
22.5536
30.1944

33.2179
33.2998
36.2056
37.9758
38.6497
38.9259
39.2201
44.2355
45.2404
43.4734
43.9369

4.6673
10.7440
15.7759
11.0067
12.2136
12.2059
14.3293
14.3791
14.8786
32.3007
29.2913

4.2313 (0.9089)
7.4607 (1.3115)
6.5999 (1.3107)
8.3976 (1.4436)
7.8629 (1.3684)
5.9445 (1.3478)
5.1336 (1.3172)
7.7728 (1.4507)
9.9128 (1.3533)
12.6468 (1.6861)
12.7318 (1.8214)

111

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116

(a) 5-step forecasting

(b) 9-step forecasting


Fig. 5. Multi-step forecasted results by LMA and NN-MAT on Laser.

Table 11
Performance of multi-step forecasting on EUNITE.

ARMAX
ANFIS
Backpropagation [17]
Gain Scaling [17]
Gain Scaling with SS [17]
Gain Scaling with CIS and SS [17]
Early Stopping [17]
Early Stopping with SS [17]
Early Stopping with CIS and SS [17]
Extended Bayesian Training [17]
L2-SVM with CV [17]
L2-SVM with CIS and CV [17]
L2-SVM Gradient Descent [17]
Benchmark [16]
LMA

MAPE

MAX ERROR

# of inputs

3.69
4.54
5.05
4.87
2.19
2.77
1.95
2.13
2.87
1.75
3.52
2.87
2.07
1.98
1.71
(0.14)

67.56
94.74
111.89
137.78
55.95
70.99
40.28
50.90
71.26
55.64
60.39
67.17
59.78
51.42
40.99
(73.28)

20
5
49
49
49
20
49
49
20
40
49
20
45

12

112

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116

Fig. 6. Multi-step forecasted results by LMA and ARMAX on EUNITE.

Table 12
Performance of LMA, with different settings of k, on Laser.

110
120
130
140
150
160
170
180
190
200

MAE

RMSE

1.0775
1.0344
1.0212
0.9904
0.7236
0.9348
0.9376
0.9766
1.0046
1.0804

1.9369
1.7951
1.9197
1.6181
0.9570
1.5870
1.6058
1.5502
1.6440
1.9369

5.2.2. EUNITE dataset


The EUNITE dataset was provided by the EUNITE network for a well-known competition [16]. It contains the electricity
load demand every half an hour from January 1997 through December 1998. Additional inputs include average daily
temperatures and information about holidays. The goal of the competition was to predict maximum daily values of electrical
load for January 1999 (31 values altogether). As specied in the competition, we use maximum daily values of electrical load
(LoadMax) and average daily temperatures (Temperature) as inputs. The information about holidays is not used. Some statistics of the training data are shown in Table 3. Table 11 shows the results obtained by a number of different methods. The
numbers listed in the rows with citations are taken from [17]. The benchmark is the winner in the competition [8]. In this
table, MAX ERROR indicates the maximum absolute error in the 31 forecasts and # of inputs indicates the number of inputs
involved in forecasting. For LMA, we set k 200; q 35, and d 12. The hyper-parameters set automatically in the LS-SVM
program for LMA are c  400 and r2  40. These settings are determined in the training phase, and are not changed in the
testing phase. Note that the parenthesized numbers in the second row of the LMA entry indicate the training errors of LMA.
The forecasted results obtained by LMA and ARMAX are shown pictorially in Fig. 6 where the horizontal axis indicates the
number of steps ahead to be forecasted. Apparently, our forecasted results match more closely to the actual values.
6. Discussion
We discuss some issues related to our proposed method LMA.
6.1. Effect of k
In our experiments, the value of k is set by trial-and-error. Once set, k is not changed during the testing stage. Here, we
would like to show how much the value of k affects the performance of LMA. Table 12 shows the MAE and RMSE obtained by
LMA on Laser, with kvarying in the range between 110 and 200. In this experiment, q and d are xed to be 9 and 7, respectively. Note that different settings result in a variation of errors. It is not easy to determine a good k for a given application.

113

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116

One way is to get help from experts who have domain knowledge about the underlying application. Another way may utilize
some optimization methods, e.g., GA. However, optimization may take time.
6.2. Comparison among different measures of similarity
We adopt the hybrid distance to measure the similarity between two sequences in our approach. We show here that
using the hybrid distance to nd the nearest neighbors is a good idea.
Suppose the distance between two sequences is measured by the time index, i.e., two sequences with close time indices
are considered to be close to each other. For example, consider the example in Section 4. If measured by the time index, the 6
nearest neighbors of Q would be S17 ; S16 ; S15 ; S14 ; S13 , and S12 . Table 13 shows the MAE and RMSE obtained by LMA on Laser,
with kvarying in the range between 130 and 200, where the k nearest neighbors are determined by the time index. As before,
q and d are xed to be 9 and 7, respectively. We can see that the performance is poor in both MAE and RSME. This indicates
that the time index is less appropriate than the hybrid distance in determining the nearest neighbors of Q for LMA.
Alternatively, the similarity between two sequences can be measured by only the Euclidean distance between original
sequences. Let us give an example to explain why the hybrid distance is more effective. For simplicity, suppose only one variable Y is involved. Consider three sequences A1 ; A2 , and A3 :

A1 5; 4; 3; 2; 1;

A2 5; 4; 2; 0; 1;

A3 5; 3; 1; 2; 3:

We have

NE A1 ; A2 NE A1 ; A3

3
0:6397:
4:69

23

If we measure the similarity based on N E A1 ; A2 and N E A1 ; A3 , then A2 and A3 are equally similar to A1 . Now we consider
the differential sequences:

F1 1; 1; 1; 1;

F2 1; 2; 2; 1;

F3 2; 2; 1; 1

and have

1:414
0:3779;
3:7417
3:162
NE F1 ; F3
0:8450:
3:7417

NE F1 ; F2

24

Therefore, we have

0:6397 0:3779
0:5088;
2
0:6397 0:8450
0:7424:
NH A1 ; A3
2

NH A1 ; A2

25

By taking trends into account, we have N H A1 ; A2 < N H A1 ; A3 and conclude that A2 is closer to A1 . This is reasonable since
both A1 and A2 have a down trend while A3 has a down-then-up trend. Table 14 shows the MAE and RMSE of one-step
forecasting on Laser obtained by LMA with different similarity measures. Note that we set k 150; q 9, and d 7. In this
table, ED-SEQ stands for the Euclidean distance between original sequences, ED-FOD for the Euclidean distance between differential sequences, and HD-SEQ for the hybrid distance. As can be seen, the hybrid distance works more effectively than the
other two alternatives. Note that in this work, we just simply use 1/2 in Eq. (5). However, it can be specied by a domain
expert or learned through an optimization method, e.g., GA.

Table 13
Performance of LMA, with distance measured by time index, on Laser.

130
140
150
160
170
180
190
200

MAE

RMSE

18.2506
18.1728
12.3552
14.5660
10.6852
10.1166
10.2233
10.4542

26.6031
27.5728
26.2939
29.1388
28.2920
27.0055
29.9817
30.0145

114

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116


Table 14
Performance of LMA, with different similarity measures, on Laser.

ED-SEQ
ED-FOD
HD-SEQ

MAE

RMSE

0.9668
1.4750
0.7236

1.5667
3.0873
0.9570

Table 15
Performance of LMA, with different settings of d, on Laser.

4
5
6
7
8
9
10 (all)

MAE

RMSE

1.2883
0.9426
0.9479
0.7236
0.8869
0.9609
1.0468

2.4091
1.6350
1.4245
0.9570
1.4860
1.9711
2.1052

Table 16
Performance of LMA, with different settings of q, on Poland.

6
7
8
9
10
11
12

MAE

RMSE

0.0264
0.0251
0.0249
0.0245
0.0256
0.0258
0.0259

0.0477
0.0462
0.0460
0.0454
0.0480
0.0487
0.0507

Table 17
Comparison between LS-SVM and modied LS-SVM on Laser and Poland.
Laser

LS-SVM
Modied LS-SVM

Poland

MAE

RMSE

MAE

RMSE

0.8984
0.7236

1.5171
0.9570

0.0271
0.0245

0.0532
0.0454

Table 18
Comparison between LS-SVM and modied LS-SVM on EUNITE.

LS-SVM
Modied LS-SVM

MAPE

MAX ERROR

1.7653
1.7100

45.6756
40.9900

6.3. Effect of d and q


In our experiments, the values of d and q are set by trial-and-error. Once set, d and q are not changed during the testing
stage. Here, we would like to show how much the values of d and q affect the performance of LMA. Note that determining
good values for d and q is not an easy task. Seeking advice from experts surely is helpful. Applying some optimization methods, e.g., GA, is also useful but could be time-consuming. Table 15 shows the MAE and RMSE of one-step forecasting on Laser
obtained by LMA with different numbers of d. For this table, k and q are xed to be 150 and 9, respectively. Therefore, the
total number of candidate lags is 10. We can see that the case of using all the candidate lags, i.e., d 10, for forecasting does
not provide the best performance. Instead, the best performance occurs at d 7. Table 16 shows the MAE and RMSE of onestep forecasting on Poland obtained by LMA, with qvarying in the range between 6 and 12. For this table, k and d are xed to
be 150 and 6, respectively. Note that different settings result in a small variation of errors.

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116

115

6.4. Comparison between LS-SVM and modied LS-SVM


Finally, we compare the performance between LS-SVM and the modied LS-SVM adopted in LMA. Table 17 shows the
results of one-step forecasting on Laser and Poland, and Table 18 shows the results of multi-step forecasting on EUNITE.
For LS-SVM, no gj appears in Eq. (9). As can be seen, the modied LS-SVM can provide better performance than LS-SVM.
7. Conclusion
We have presented a machine learning based local modeling approach for time series forecasting. Several steps are
involved in our approach. Firstly, the k neighbors which are most similar to the given forecasting sequence are located. Secondly, proper lags associated with relevant variables for forecasting are determined. Thirdly, an optimal forecasting model is
derived by applying a modied LS-SVM. The derived model can then be used for forecasting. Our proposed approach has several advantages. It can produce adaptive forecasting models. It works for univariate and multivariate cases. It also works for
one-step as well as multi-step forecasting. Several experiments have been conducted and the results have shown the effectiveness of the proposed approach for time series forecasting.
One drawback of the proposed approach is the demanding cost in computing the distance between the forecasting
sequence and each of its neighbors. Some algorithms have been proposed [34,46] for fast distance computation. Another possibility is to divide the observed data into different groups in advance by certain clustering techniques, e.g., fuzzy clustering
[41] or self-organizing multi-layer perceptron [18]. For any forecasting sequence, the cluster to which the forecasting
sequence is most similar is identied. Then the sequences contained in this cluster are regarded as the nearest neighbors.
In this way, the computation complexity can be much reduced.
Acknowledgments
The authors are grateful to the anonymous reviewers, Associate Editor, and Editor-in-Chief for their comments, which
were very helpful in improving the quality and presentation of the paper.
References
[1] M. Abdollahzade, A. Miranian, H. Hassani, H. Iranmanesh, A new hybrid enhanced local linear neuro-fuzzy model based on the optimized singular
spectrum analysis and its application for nonlinear and chaotic time series forecasting, Inform. Sci. 295 (2015) 107125.
[2] R. Adhikari, R.K. Agrawal, A homogeneous ensemble of articial neural networks for time series forecasting, Int. J. Comput. Appl. 32 (7) (2011) 18.
[3] D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithms, Mach. Learn. 6 (1) (1991) 3766.
[4] S.K. Bag, ANN based prediction of blast furnace parameters, Inst. Eng. 68 (1) (2007) 3742.
[5] Y. Bao, T. Xiong, Z. Hu, PSO-MISMO modeling strategy for multistep-ahead time series prediction, IEEE Trans. Cybernet. 44 (5) (2014) 655668.
[6] G.E.P. Box, G.M. Jenkins, G.C. Reinsel, Time Series Analysis: Forecasting And Control, Wiley, 2008.
[7] M. Bozic, M. Stojanovic, Z. Stajic, N. Floranovic, Mutual information-based inputs selection for electric load time series forecasting, Entropy 15 (2013)
926942.
[8] B.-J. Chen, M.-W. Chang, C.-J. Lin, Load forecasting using support vector machines: a study on EUNITE competition 2001, IEEE Trans. Power Syst. 19 (4)
(2004) 18211830.
[9] S.-M. Chen, TAIEX forecasting based on fuzzy time series and fuzzy variation groups, IEEE Trans. Fuzzy Syst. 19 (1) (2011) 112.
[10] S.-M. Chen, Y.-C. Chang, Multi-variable fuzzy forecasting based on fuzzy clustering and fuzzy rule interpolation techniques, Inform. Sci. 180 (24) (2010)
47724783.
[11] S.-M. Chen, H.-P. Chu, T.-W. Sheu, TAIEX forecasting using fuzzy time series and automatically generated weighted of multiple factors, IEEE Trans. Syst.
Man Cybernet. Part A 42 (6) (2012) 14851495.
[12] S.-M. Chen, G.M.T. Manalu, J.-S. Pan, H.-C. Liu, Fuzzy forecasting based on two-factors second-order fuzzy-trend logical relationship groups and particle
swarm optimization techniques, IEEE Trans. Cybernet. 43 (3) (2013) 11021117.
[13] Y. Chen, B. Yang, J. Dong, Time-series prediction using a local linear wavelet neural network, Neurocomputing 69 (46) (2006) 449465.
[14] S.F. Crone, N. Kourentzes, Feature selection for time series prediction a combined lter and wrapper approach for neural networks, Neurocomputing
73 (1012) (2010) 19231936.
[15] Dow Jones Web Site. <http://www.djindexes.com/>.
[16] EUNITE Data Set. <http://neuron.tuke.sk/competition/index.php>.
[17] V.H. Ferreira, A.P. Alves da Silva, Toward estimating autonomous neural network-based electric load forecasters, IEEE Trans. Power Syst. 22 (4) (2007)
15541562.
[18] B. Gas, Self-organizing multi-layer perceptron, IEEE Trans. Neural Netw. 21 (11) (2010) 17661779.
[19] F. Gaxiola, P. Melin, F. Valdez, O. Castillo, Interval type-2 fuzzy weight adjustment for backpropagation neural networks with application in time series
prediction, Inform. Sci. 260 (2014) 114.
[20] M. Ghiassi, H. Saidane, A dynamic architecture for articial neural networks, Neurocomputing 63 (2005) 397413.
[21] J. G De Gooijer, R.J. Hyndman, 25 years of time series forecasting, Int. J. Forecast. 22 (3) (2006) 443473.
[22] J.A. Guajardo, R. Weber, J. Miranda, A model updating strategy for predicting time series with seasonal patterns, Appl. Soft Comput. 10 (1) (2010) 276
283.
[23] E. Guresen, G. Kayakutlu, T.U. Daim, Using articial neural network models in stock market index prediction, Expert Syst. Appl. 38 (8) (2011) 10389
10397.
[24] M.T. Hagan, H.B. Demuth, M.H. Beale, Neural Network Design, PWS Pub. Co., 1995.
[25] J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2011.
[26] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer, New York, 2008.
[27] H.S. Hippert, D.W. Bunn, R.C. Souza, Large neural networks for electricity load forecasting: are they overtted?, Int J. Forecast. 21 (3) (2005) 425434.
[28] Z. Huang, M.-L. Shyu, k-NN based LS-SVM framework for long-term time series prediction, in: 2010 IEEE International Conference on Information
Reuse and Integration, 2010, pp. 6974.

116

T.-T. Chen, S.-J. Lee / Information Sciences 299 (2015) 99116

[29] Z. Huang, M.-L. Shyu, Recent trends in information reuse and integration, in: Long-term Time Series Prediction using k-NN based LS-SVM Framework
with Multi-value Integration, Springer, Vienna, 2012, pp. 191209 (Chapter 9).
[30] K.-C. Hung, K.-P. Lin, Long-term business cycle forecasting through a potential intuitionistic fuzzy least-squares support vector regression approach,
Inform. Sci. 224 (2013) 3748.
[31] J.-S.R. Jang, Fuzzy modeling using generalized neural networks and Kalman lter algorithm, in: Proceedings of the Ninth National Conference on
Articial Intelligence (AAAI-91), 1991, pp. 762767.
[32] J.-S.R. Jang, ANFIS: adaptive-network based fuzzy inference systems, IEEE Trans. Syst. Man Cybernet. 23 (3) (1993) 665685.
[33] Z. Ji, B. Wang, S. Deng, Z. You, Predicting dynamic deformation of retaining structure by LSSVR-based time series method, Neurocomputing 137 (2014)
165172.
[34] N. Kaneko, S. Matsuzaki, M. Ito, H. Oogai, K. Uchida, Application of improved local models of large scale database-based online modeling to prediction
of molten iron temperature of blast furnace, ISIJ Int. 50 (7) (2010) 939945.
[35] H. Kantz, Nonlinear Time Series Analysis, Cambridge University Press, 2003.
[36] M. Khashei, M. Bijari, An articial neural network p; d; q model for time series forecasting, Expert Syst. Appl. 37 (1) (2010) 479489.
[37] M. Khashei, M. Bijari, Which methodology is better for combining linear and nonlinear models for time series forecasting?, J Ind. Syst. Eng. 4 (4) (2011)
265285.
[38] A. Kraskov, H. Stgbauer, P. Grassberger, Estimating mutual information, Phys. Rev. E 69 (6) (2004) 066138.
[39] A. Kusiak, H. Zheng, Z. Song, Short-term prediction of wind farm power: a data mining approach, IEEE Trans. Energy Convers. 24 (1) (2009) 125136.
[40] Laser Time Series Data Set. <http://www-psych.stanford.edu/andreas/Time-Series/SantaFe.html>.
[41] S.-J. Lee, C.-S. Ouyang, A neuro-fuzzy system modeling with self-constructing rule generation and hybrid SVD-based learning, IEEE Trans. Fuzzy Syst.
11 (3) (2003) 341353.
[42] W. Li, Mutual information functions versus correlation functions, J. Stat. Phys. 60 (56) (1990) 823837.
[43] C.-F. Liu, C.-Y. Yeh, S.-J. Lee, Application of type-2 neuro-fuzzy modeling in stock price prediction, Appl. Soft Comput. 12 (4) (2012) 13481358.
[44] LS-SVM Program. <http://www.esat.kuleuven.be/sista/lssvmlab/>.
[45] J. McNames, A nearest trajectory strategy for time series prediction, in: Proceedings of the International Workshop on Advanced Black-Box Techniques
for Nonlinear Modeling, K.U. Leuven, Belgium, 1998, pp. 112128.
[46] J. McNames, B. Widrow, J.H. Friedman, J.P. How, Innovations in Local Modeling for Time Series Prediction, 1999. <http://web.cecs.pdx.edu/mcnames/
Publications/Dissertation.pdf>.
[47] A. Miranian, M. Abdollahzade, Developing a local least-squares support vector machines-based neuro-fuzzy model for nonlinear and chaotic time
series prediction, IEEE Trans. Neural Netw. Learn. Syst. 24 (2) (2013) 207218.
[48] NASDAQ Web Site. <http://www.nasdaq.com/>.
[49] M. Negnevitsky, Articial Intelligence: A Guide to Intelligent Systems, Addison-Wesley, 2004.
[50] C.-S. Ouyang, W.-J. Lee, S.-J. Lee, A TSK-type neuro-fuzzy network approach to system modeling problems, IEEE Trans. Syst. Man Cybernet. Part B:
Cybernet. 35 (4) (2005) 751767.
[51] Poland Data Set. <http://research.ics.aalto./eiml/datasets.shtml>.
[52] N.I. Sapankevych, R. Sankar, Time series prediction using support vector machines: a survey, IEEE Comput. Intell. Mag. 4 (2) (2009) 2438.
[53] A. Sfetsos, C. Siriopoulos, Time series forecasting with a hybrid clustering scheme and pattern recognition, IEEE Trans. Syst. Man Cybernet. Part A 34 (3)
(2004) 399405.
[54] G. Silviu, Information Theory with Applications, McGraw-Hill, 1977.
[55] O. Song, B.S. Chissom, Fuzzy time series and its models, Fuzzy Sets Syst. 54 (3) (1993) 269277.
[56] O. Song, B.S. Chissom, Forecasting enrollments with fuzzy time series Part I, Fuzzy Sets Syst. 54 (1) (1993) 19.
[57] O. Song, B.S. Chissom, Forecasting enrollments with fuzzy time series Part II, Fuzzy Sets Syst. 62 (1) (1994) 18.
[58] A. Sorjamaa, J. Hao, A. Lendasse, Mutual information and k-nearest neighbors approximator for time series prediction, Lect. Notes Comput. Sci. 3657
(2005) 553558.
[59] A. Sorjamaa, J. Hao, N. Reyhani, Y. Ji, A. Lendasse, Methodology for long-term prediction of time series, Neurocomputing 70 (1618) (2007) 28612869.
[60] J.H. Stock, M.W. Watson, Introduction to Econometrics, Addison-Wesley, 2010.
[61] H. Stogbauer, A. Kraskov, S.A. Astakhov, P. Grassberger, Least-dependent-component analysis based on mutual information, Phys. Rev. E 70 (2004)
066123.
[62] M.B. Stojanovic, M.M. Bozic, M.M. Stankovic, Z.P. Stajic, A methodology for training set instance selection using mutual information in time series
prediction, Neurocomputing 141 (2014) 236245.
[63] Sunspot Data Set. <http://sidc.oma.be/sunspot-data/>.
[64] J.A.K. Suykens, J. De Brabanter, L. Lukas, J. Vandewalle, Weighted least squares support vector machines: robustness and sparse approximation,
Neurocomputing 48 (14) (2002) 85105.
[65] J.A.K. Suykens, T.V. Gestel, J.D. Brabanter, B.D. Moor, J. Vandewalle, Least Squares Support Vector Machines, World Scientic Publishing Company, 2002.
[66] TAIEX Web Site. <http://www.tese.com.tw/en/products/indices/tsec/taiex.php>.
[67] F.E.H. Tay, L.J. Cao, Modied support vector machines in nancial time series forecasting, Int. J. Forecast. 48 (1) (2002) 6984.
[68] S.S. Torbaghan, A. Motamedi, H. Zareipour, L.A. Tuan, Mediumterm electricity price forecasting, in: North American Power Symposium (NAPS) 2012,
2012, pp. 18.
[69] W.W.S. Wei, Time Series Analysis: Univariate and Multivariate Methods, Pearson, 2005.
[70] G.P. Zhang, Time series forecasting using a hybrid ARIMA and neural network model, Neurocomputing 50 (2003) 159175.
[71] L. Zhang, W.-D. Zhou, P.-C. Chang, J.-W. Yang, F.-Z. Li, Iterated time series prediction with multiple support vector regression models, Neurocomputing
99 (2013) 411422.
[72] H. Zou, Y. Yang, Combining time series models for forecasting, Int. J. Forecast. 20 (1) (2004) 6984.

S-ar putea să vă placă și