Documente Academic
Documente Profesional
Documente Cultură
Information Sciences
journal homepage: www.elsevier.com/locate/ins
a r t i c l e
i n f o
Article history:
Received 22 May 2014
Received in revised form 6 December 2014
Accepted 11 December 2014
Available online 17 December 2014
Keywords:
Time series forecasting
Multi-step forecasting
Nearest neighbor
Mutual information
Support vector machine
a b s t r a c t
Time series forecasting is important because it can often provide the foundation for
decision making in a large variety of elds. Statistical approaches have been extensively
adopted for time series forecasting in the past decades. Recently, machine learning
techniques have drawn attention and useful forecasting systems based on these techniques
have been developed. In this paper, we propose a weighted Least Squares Support Vector
Machine (LS-SVM) based approach for time series forecasting. Given a forecasting
sequence, a suitable set of training patterns are extracted from the historical data by
employing the concepts of k-nearest neighbors and mutual information. Based on the
training patterns, a modied LS-SVM is developed to derive a forecasting model which
can then be used for forecasting. Our proposed approach has several advantages. It can
produce adaptive forecasting models. It works for univariate and multivariate cases. It also
works for one-step as well as multi-step forecasting. A number of experiments are
conducted to demonstrate the effectiveness of the proposed approach for time series
forecasting.
2014 Elsevier Inc. All rights reserved.
1. Introduction
Time series forecasting concerns forecasting of future events based on a series of observations taken previously at equally
spaced time intervals [25]. It has played an important role in the decision making process in a variety of elds such as
nance, power supply, and medical care [21]. One example is to predict stock exchange indices or closing stock prices in
the stock market [912,23,43,60]. Another example is to predict the electricity demand to avoid producing extra electric
power in the power supply industry [27,39,68]. If forecasting is done for one time step ahead into the future, it is called single-step or one-step forecasting. For the case of two or more time steps ahead, it is usually called multi-step forecasting
[5,21].
Two approaches have been adopted for constructing time series forecasting models. The global modeling approach
constructs a model which is independent of the target to be forecasted. For time series prediction, the conditions of the
environment may vary as time goes on. A global model is not adaptive and thus accuracy suffers. A local model constructed
by the local modeling approach [28,29,4547] is dependent on the target to be forecasted and therefore is adaptive. Local
q
This work was supported by the National Science Council under the Grants NSC-99-2221-E-110-064-MY3 and NSC-101-2622-E-110-011-CC3, and by
Aim for the Top University Plan of the National Sun Yat-Sen University and Ministry of Education.
Corresponding author.
E-mail address: leesj@mail.ee.nsysu.edu.tw (S.-J. Lee).
http://dx.doi.org/10.1016/j.ins.2014.12.031
0020-0255/ 2014 Elsevier Inc. All rights reserved.
100
models are usually characterized by using a small number of the neighbors in the proximity of the forecasting sequence.
Another issue in time series forecasting is to determine the lags to be involved in the model. The lags have a big inuence
on the forecasting accuracy. For example, in steel making engineering [4,34], the furnace temperature will change two to
eight hours from the time the materials are applied into the furnace. This indicates that there is a time lag of two to eight
hours for the temperature change. Furthermore, the lags may vary as time goes on and adjusting them is required [59]. Lags
selection methods using t-statistics of estimated coefcients or F-statistics of groups of coefcients to measure statistical
signicance were proposed [26]. The idea of mutual information was adopted in lags selection for electric load time series
forecasting [7]. Two strategies, direct and iterative, have been adopted for constructing time series forecasting models. The
difference between them lies on whether or not the forecasts of previous steps are involved in the prediction of the current
step.
Statistical methods have been extensively adopted for time series forecasting in the past decades, among them are moving average, weighted moving average, Kalman ltering, exponential smoothing, regression analysis, autoregressive moving
average (ARMA), autoregressive integrated moving average (ARIMA), and autoregressive moving average with exogenous
inputs (ARMAX) [6]. These methods are based on the assumption that a probability model generates the underlying time
series data. Future values of the time series are assumed to be related to past values as well as to past errors. BoxJenkins
models [6], i.e., ARMA, ARIMA, and ARMAX, are quite exible due to the inclusion of both autoregressive and moving average
terms. Recently, machine learning techniques have drawn attention and useful forecasting systems based on these techniques have been developed [49]. The multilayer perceptron, often simply called the neural network, is a popular network
architecture in use for time series forecasting [13,20,24]. The neural network often encounters the local minimum problem
during the learning process and the number of nodes in the hidden layer is usually difcult to decide. The k-nearest neighbor
regression method is a nonparametric method which bases its prediction on the k nearest neighbors of the target to be
forecasted [29]. However, it lacks the capability of adaptation and the distance measure adopted may affect prediction performance. Classication and Regression Trees is a regression model that is based on a hierarchical tree-like partition of the
input space [22,53,72]. Fuzzy theory is incorporated for prediction in the stock market [9,19,5557]. However, membership
functions need to be determined, which is often a challenging task to the user. Also, no learning is offered by fuzzy theory.
Support vector regression can provide high accuracy by mapping into a high-dimensional feature space and including a penalty term in the error function [30,33,47,52,65,71]. Neuro-fuzzy modeling is a hybrid approach which takes advantage of
both neural network techniques and fuzzy theory [1,31,32,43,50]. Supervised learning algorithms are utilized to optimize
the parameters of the produced forecasting models through learning.
Some of the existing forecasting methods can only be applied in univariate forecasting, others can only work for one-step
forecasting, and yet others cannot produce adaptive models. In this paper, we propose a direct local modeling approach
based on machine learning to derive forecasting models. Given a forecasting sequence, two steps are involved in our
proposed approach. Firstly, a suitable set of training patterns are extracted from the historical data by employing the concepts of k-nearest neighbors and mutual information. A measure taking trends into consideration is adopted to gauge the
similarity between two sequences of data, and proper lags associated with relevant variables for forecasting are determined.
Secondly, based on the extracted training patterns, a modied LS-SVM is developed to derive a forecasting model which can
then be used for forecasting. Our proposed approach has several advantages. It can produce adaptive forecasting models. It
works for univariate and multivariate cases. It also works for one-step as well as multi-step forecasting. The effectiveness of
the proposed approach is demonstrated by the results obtained from a number of experiments conducted on real-world
datasets.
The rest of this paper is organized as follows. Section 2 states the problem to be solved. Section 3 describes in detail our
proposed forecasting approach. An illustrating example is given in Section 4. Experimental results are presented in Section 5.
Some relevant issues are discussed in Section 6. Finally, concluding remarks are given in Section 7.
2. Problem statement
Consider a series of real-valued observations [69]:
X0 ; Y 0 ; X1 ; Y 1 ; . . . ; Xt ; Y t
taken at equally spaced time points t 0 ; t 0 Mt; t 0 2Mt; . . . for some process P, where Y i denotes the value of the output
variable (or dependent variable) observed at the time point t 0 iMt and Xi denotes the values of n additional variables
(or independent variables), n P 0, observed at the time point t 0 iMt. Time series forecasting is to estimate the value of Y
at some future time t s, i.e., Y ts , by
where s P 1 is called the horizon of prediction, G is the forecasting function or model, Y ti is the ith lag of Y t ; Xti is the ith lag
of Xt , and q is the lag-span of prediction. For s 1, it is called one-step forecasting. For s > 1, it is called multi-step forecasting. Also, if n 0, it is univariate forecasting; otherwise, it is multivariate forecasting.
Forecasting Y ts can be regarded as a function approximation task. Two strategies are usually adopted to construct
forecasting models:
101
Direct. Train on Xtq ; Y tq ; . . . ; Xt1 ; Y t1 ; Xt ; Y t to predict Y ts directly, for any s P 1.
Iterative. Train to predict Y t1 only, but iterate to get Y ts for any s > 1.
These strategies work identically for the case of s 1. However, for the case of s > 1, the iterative strategy cannot work for
b t2 ; Y
b t3 ; . . . ; Y
b ts successively for this case. The foremultivariate prediction since Xt1 ; . . . ; Xts1 are not available to obtain Y
casting model for estimating Y ts can be obtained by a global or local modeling approach. These two approaches have their
own pros and cons as described in the previous section.
3. Proposed approach
Without loss of generality, we consider only Y and one additional variable X, i.e., n 1, in Eq. (1). Extension to the case of
two or more additional variables is trivial. Our goal is to estimate Y ts ; s P 1, from X tq ; Y tq ; X tq1 ; Y tq1 ; . . . ;
X t1 ; Y t1 ; X t ; Y t . To do this, we try to learn from the observed data how.
Based on what we have learned, we intend to predict Y ts from X tq ; Y tq ; X t1q ; Y t1q ; . . . ; X t ; Y t in a similar manner. For
convenience, the sequence
Q X tq ; Y tq ; X tq1 ; Y tq1 ; . . . ; X t1 ; Y t1 ; X t ; Y t
S1 X 0 ; Y 0 ; X 1 ; Y 1 ; . . . ; X q1 ; Y q1 ; X q ; Y q ;
S2 X 1 ; Y 1 ; X 2 ; Y 2 ; . . . ; X q ; Y q ; X q1 ; Y q1 ;
4
... ...;
Sz1 X z2 ; Y z2 ; X z1 ; Y z1 ; . . . ; X ts2 ; Y ts2 ; X ts1 ; Y ts1 ;
Sz X z1 ; Y z1 ; X z ; Y z ; . . . ; X ts1 ; Y ts1 ; X ts ; Y ts :
For convenience, we call S fS1 ; S2 ; . . . ; Sz g the neighbor set of Q and each element in S a neighbor of Q .
102
A1 X a ; Y a ; X a1 ; Y a1 ; . . . ; X aq1 ; Y aq1 ; X aq ; Y aq ;
A2 X b ; Y b ; X b1 ; Y b1 ; . . . ; X bq1 ; Y bq1 ; X bq ; Y bq ;
and F1 and F2 be the differential sequences of A1 and A2 , respectively, dened as
F1 X a1 X a ; Y a1 Y a ; . . . ; X aq X aq1 ; Y aq Y aq1 ;
F2 X b1 X b ; Y b1 Y b ; . . . ; X bq X bq1 ; Y bq Y bq1 :
Note that the differential sequence reveals trends, i.e., rising or falling conditions, of the original sequence. A negative entry
in the differential sequence indicates a falling occurrence at the underlying point in the original sequence while a positive
entry indicates a rising occurrence. Let N E A1 ; A2 be the normalized Euclidean distance between A1 and A2 , and N E F1 ; F2 be
the normalized Euclidean distance between F1 and F2 . The following measure [29]:
NH A1 ; A2
NE A1 ; A2 NE F1 ; F2
2
is adopted to locate the k nearest neighbors of Q . We call N H A1 ; A2 the hybrid distance between A1 and A2 .
We identify the k nearest neighbors of Q as follows. We calculate the hybrid distance between Q and every neighbor in S.
As a result, we have z hybrid distances. The k sequences with the k shortest hybrid distances are taken to be the k nearest
neighbors of Q . For convenience, these k nearest neighbors are expressed as
X t1 q ; Y t1 q ; X t1 q1 ; Y t1 q1 ; . . . ; X t1 1 ; Y t1 1 ; X t1 ; Y t1 ;
X t2 q ; Y t2 q ; X t2 q1 ; Y t2 q1 ; . . . ; X t2 1 ; Y t2 1 ; X t2 ; Y t2 ;
...;
X t q ; Y tk1 q ; X tk1 q1 ; Y tk1 q1 ; . . . ; X tk1 1 ; Y tk1 1 ; X tk1 ; Y tk1 ;
k1
X tk q ; Y tk q ; X tk q1 ; Y tk q1 ; . . . ; X tk 1 ; Y tk 1 ; X tk ; Y tk
where
Z2i1 X t1 iq
X t2 iq
. . . X tk iq
Z2i2 Y t1 iq
Y t2 iq
. . . Y tk iq
H Y t1 s
Y t2 s
. . . Y tk s
T
T
; 0 6 i 6 q;
T
; 0 6 i 6 q;
103
We use a greedy approach with forward selection [14,59] to nd a desired number of lags. Let d be the number of lags to be
found. Firstly, we calculate MIZi ; H; 1 6 i 6 2q 2. Let MIZd1 ; H be the largest, indicating that the most signicant connection exists between Zd1 and H. Therefore, Zd1 is selected. Next, we calculate MIfZd1 ; Zi g; H; 1 6 i 6 2q 2 and i d1 . Let
MIfZd1 ; Zd2 g; H be the largest. Therefore, Zd2 is also selected. Then we calculate MIfZd1 ; Zd2 ; Zi g; H; 1 6 i 6 2q 2 and
i d1 and i d2 . This goes on until d lags are found. Then the lags obtained are used to determine the training patterns
and the input for forecasting. For example, suppose d 3 and Z2q1 ; Z2q1 , and Z2q2 are selected. By referring to Eqs. (6)
and (7), we extract the following k training patterns:
h
i
X tj 1 ; X tj ; Y tj ; Y tj s ;
1 6 j 6 k:
These training patterns are to be used in training a direct forecasting model in the model derivation step. Accordingly, the
input for forecasting Y ts is X t1 ; X t ; Y t .
3.2. Model derivation
After we obtain the k training patterns, we can proceed to derive the direct forecasting models. Let the training patterns
be represented as fxi ; yi gki1 , where xi is the input and yi is the desired output of pattern i. Note that we have also derived
the input
x for forecasting
Y ts . In the case of the previous example shown in Eq. (8), the
rst training pattern includes
x1 X t1 1 ; X t1 ; Y t1 and y1 Y t1 s , the second training pattern includes x2 X t2 1 ; X t2 ; Y t2 and y2 Y t2 s , . . ., and the kth
training pattern includes xk X tk 1 ; X tk ; Y tk and yk Y tk s . The input x for forecasting Y ts is x X t1 ; X t ; Y t .
Least Squares Support Vector Machine (LS-SVM) [52,65] is a powerful method for solving function estimation problems.
However, all the errors induced have the same weight in LS-SVM [3]. Preferably, we would like to give more credit to a training pattern that is more accurately forecasted in the training process [64,67]. That is, a training pattern with a smaller error is
allowed to have a bigger weight. We modify the traditional LS-SVM to meet our requirement, and the modied LS-SVM can
be expressed as
min Jx; e
k
1 T
1 X
x x c gj e2j
2
2 j1
subject to yi x /xi b ei ;
i 1; 2; . . . ; k
where x is the weight vector to be solved, c is the regularization parameter which is a constant, / maps the original input
space to a high-dimensional feature space, bis a real constant, and e1 ; e2 ; . . ., and ek are error variables. Note that each error
ej ; 1 6 j 6 k, is weighted by gj which is dened as
gj exp
g1;j g2;j
10
where g1;j is the hybrid distance between x and xj and g2;j is the normalized absolute difference between yj and the median of
y, i.e.
g1;j NH x; xj ;
g2;j
11
jyj medianyj
:
maxfjyj medianyjg
12
Note that mediany is the median of all the Y values and the denominator maxfjyj medianyjg is the maximum of all
jyj medianyj; 1 6 j 6 k. Essentially, g1;j concerns the closeness between x and xj . As xj is closer to x; g1;j gets smaller
and consequently, gj is larger. Furthermore, we consider not only the input but also the output for the weighting. The rationale behind this consideration is that: For time series forecasting, it is not always the case that closer inputs lead to closer
Table 1
An example showing the relationship between inputs and outputs.
Input distance
Output distance
1
2
3
4
..
.
148
149
150
0.2364
0.1505
0.1009
0.0893
..
.
2.0958
1.0462
0.8589
104
outputs. An example is shown in Table 1 which contains 150 pieces of data taken from the Poland Electrical Load dataset
[51]. In this table, the left column indicates the ordering of the input distances between one reference point and all its
150 nearest neighbors. A smaller index indicates a neighbor closer to the reference point with respect to the input. The right
column indicates the output distances between the reference point and these nearest neighbors. As can be seen, closer inputs
may not lead to closer outputs. For instance, consider neighbors 1 and 2. Neighbor 1 is closer to the reference point in terms
of the input distance, while neighbor 2 is closer to the reference point in terms of the output distance. Therefore, we consider
the contribution of the output by including the term g2;j in Eq. (12). As yj is closer to the median, g2;j gets smaller and consequently, gj is larger. Using median instead of mean is due to the consideration of outliers which may usually occur in practical applications. In the case of some outputs having abnormal values, the median is less likely affected.
The solution to Eq. (9) can be derived by constructing the Lagrangian function:
Lx; b; e; a Jx; e
k
X
ai xT /xi b ei yi
13
i1
"
#
b
1Tk
1k
X cgI
0
14
where y y1 y2 . . . yk T ; 1k 1
matrix which is dened by
...
1 T ; a a1
2 !
kxi xj k
X i;j /xi /xj exp
T
15
for i; j 1; 2; . . . ; k. From Eq. (14), the values of a1 ; a2 ; . . . ; ak are obtained, and the forecasting model is given by
^ Gx
y
k
X
k
X
i1
i1
ai /xT /xi b
2 !
kx xi k
ai exp
16
4. Example
Suppose we have a series of observations X 0 ; Y 0 ; X 1 ; Y 1 ; . . . ; X 19 ; Y 19 , as shown in Table 2. We want to forecast the value of
Y 21 based on the given data. For this case, we have t 19 and s 2. Let q 1; k 6, and d 2. According to Eq. (3), the forecasting sequence Q is
105
Hybrid distance
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1.118
1.116
1.005
0.903
1.099
1.131
0.969
1.093
1.125
1.091
0.943
1.138
1.149
1.155
1.158
1.161
1.065
0.944
1.140
1.174
1.073
1.062
0.963
0.874
1.048
1.078
1.084
1.106
1.104
1.002
0.897
1.099
1.137
1.136
1.130
1.097
0.990
0.870
1.066
1.114
0.2469
0.7192
0.9059
0.8329
0.1278
0.6813
0.4349
0.2046
0.5280
0.9089
0.7967
0.1120
0.2153
0.2248
0.2697
0.6371
0.9207
and according to Eq. (4), the neighbor set S contains the following 17 neighbors:
Z1 X 11 ; X 4 ; X 7 ; X 12 ; X 13 ; X 0 T ;
Z2 Y 11 ; Y 4 ; Y 7 ; Y 12 ; Y 13 ; Y 0 T ;
Z3 X 12 ; X 5 ; X 8 ; X 13 ; X 14 ; X 1 T ;
17
Z4 Y 12 ; Y 5 ; Y 8 ; Y 13 ; Y 14 ; Y 1 ;
H Y 14 ; Y 7 ; Y 10 ; Y 15 ; Y 16 ; Y 3 T :
Firstly, we compute MIZ1 ; H 0:2500, MIZ2 ; H 0:2083, MIZ3 ; H 0:2083, and MIZ4 ; H 0:4583. Since MIZ4 ; H
is the largest, Z4 is selected. Next, we compute MIfZ4 ; Z1 g, H 0:1667, MIfZ4 ; Z2 g; H 0:1250, and
MIfZ4 ; Z3 g; H 0:2917. Since MIfZ4 ; Z3 g; H is the largest, Z3 is selected. Then we stop. Note that Z3 and Z4 correspond
to the third element and the fourth element, respectively, in each nearest neighbor. Therefore, we extract the following 6
training patterns:
x1 X 12 ; Y 12 ;
y1 Y 14 ;
x2 X 5 ; Y 5 ;
y2 Y 7 ;
x3 X 8 ; Y 8 ;
y3 Y 10 ;
x4 X 13 ; Y 13 ;
x5 X 14 ; Y 14 ;
x6 X 1 ; Y 1 ;
y4 Y 15 ;
y5 Y 16 ;
y6 Y 3 ;
18
106
Model derivation. We use the extracted training patterns fxi ; yi g6i1 to derive the forecasting model. Firstly, we create the
modied LS-SVM of Eq. (9). The weights gj ; j 1; 2; . . . ; 6, are calculated by Eq. (10). For example, g1 is
g1 exp
g1;1 g2;1
0:7326
where
g1;1 NH x; x1 0:1120
and
g2;1
jy1 medianyj
0:5103:
maxfjyj medianyjg
Then we solve the Lagrange multipliers in Eq. (14) and obtain the optimal forecasting model of Eq. (16).
Finally, we apply x X 19 ; Y 19 to the forecasting model and the forecast of Y 21 is
b 21 1:0917:
Y
5. Experimental results
We present the results of several experiments with real-world time series datasets to demonstrate the effectiveness of
our proposed Local Modeling Approach (LMA). We also compare it with other existing methods in this section. To evaluate
the performance of each method, several metrics are adopted [49], including mean absolute error (MAE), mean square error
(MSE), root mean square error (RMSE), and mean absolute percentage error (MAPE), which are dened below:
PNt
MAE
i1 jyi
PNt
MSE
^i j
y
i1 yi
^i
y
19
Nt
;
Nt
s
PNt
^ 2
i1 yi yi
RMSE
;
Nt
PNt jyi y^i j
MAPE 100
i1
Nt
yi
20
21
22
107
Min
Max
Mean
Median
Mode
Std
Range
0.6385
1.3490
0.9885
0.9864
0.6385
0.1632
0.7105
Laser
Sunspot
2
255
59.893
43
14
49.279
253
0
154.400
43.472
37.600
11
34.267
154.400
EUNITE
LoadMax
Temperature
464
876
670.790
677
799
93.541
412
14.225
26.525
8.802
8.938
20.250
8.686
40.750
Table 4
Performance of one-step forecasting on Poland Electricity.
ANFIS
NN-MAT
ARMA
Sorjamaa
LMA
MAE
RMSE
0.2985
0.0639
0.3744
0.1856
0.0245
(0.0194)
0.5225
0.0835
0.4807
0.3351
0.0454
(0.0311)
which no training errors are listed. Also, the parenthesized numbers in the second row of the LMA entry indicate the training
errors of LMA. Also, the best result on each metric is in boldface in this table. As can be seen from the table, LMA performs
better than the others in both MAE and RMSE. Fig. 2 shows the forecasted results obtained by LMA and ARMA. For clarity, we
only show the rst 100 forecasted values in this gure. From this gure, we can see that LMA provides a better match than
ARMA.
5.1.2. Laser dataset
The Laser dataset [40] contains a series of chaotic laser data obtained in a physics laboratory experiment. Only one
variable, the output, is involved in this dataset. The length of the time series is 10,093, but we use only the rst 5700 observations. The rst 5600 observations are used for training and the remaining are used for testing. Some statistics of the training data are shown in Table 3. The forecasted results obtained by LMA and ANFIS are shown pictorially in Fig. 3. For LMA, we
set k 150; q 9, and d 7. The hyper-parameters set automatically in the LS-SVM program for LMA are c 2:4 107 and
r2 700. These settings are determined in the training phase, and are not changed in the testing phase. Table 5 shows the
MAE and RMSE obtained by different methods. The parenthesized numbers in the second row of the LMA entry indicate the
training errors of LMA. Again, LMA performs better than the others in both MAE and RMSE.
108
Table 5
Performance of one-step forecasting on Laser.
ANFIS
NN-MAT
ARMA
Sorjamaa
LMA
MAE
RMSE
5.7257
3.4300
14.8915
1.8288
0.7236
(0.6031)
14.4748
10.3494
27.5397
4.2405
0.9570
(0.8754)
Table 6
Performance of one-step forecasting on Sunspot.
ANFIS
ARIMA in [70]
Multiple ANN [2]
ARIMA & Neural Network [70]
ANN (p; d; q) [36]
Generalized ANNs/ARIMA [37]
LMA
MAE
MSE
MAPE
16.31
13.03
12.78
12.18
11.45
10.29
(7.88)
613.87
306.08
280.48
280.16
234.21
218.64
215.71
(101.97)
36.10
30.69
20.92
(15.37)
109
Table 7
Some statistics of the TAIEX related data in 2004.
NASDAQ
Dow Jones
TAIEX
Min
Max
Mean
Median
Mode
Std
Range
1752
9758
5317
2140
10,738
7034
1958.95
10263.81
6057.18
1960.26
10244.93
5949.26
1969.99
9757.81
5316.87
81.59
224.80
458.25
388
980
1717
Table 8
RMSE of one-step forecasting on TAIEX.
1999
2000
2001
2002
2003
2004
Average
ANFIS
ARMAX
NN-MAT
Chen & Chang [9]
Chen et al. [10]
Chen & Chen [11]
T2NFS [43]
Chen et al. [12]
LMA
156.72
102.53
103.78
101.97
115.47
99.87
97.30
102.34
100.73
(91.48)
280.85
126.41
132.46
148.85
127.51
122.75
124.28
131.25
123.65
(110.83)
521.64
115.36
116.27
113.70
121.98
117.18
110.01
113.62
117.59
(103.77)
1871.23
63.19
67.24
79.81
74.65
68.45
60.05
65.77
62.96
(48.36)
1805.04
52.24
55.01
64.08
66.02
53.96
52.24
52.23
52.17
(45.16)
192.69
53.81
54.10
82.32
58.89
52.55
51.80
56.16
53.44
(44.73)
799.69
85.76
88.14
98.46
94.09
85.79
82.61
86.89
85.09
With NASDAQ
ANFIS
ARMAX
NN-MAT
Chen & Chang [9]
Chen et al. [10]
Chen & Chen [11]
T2NFS [43]
Chen et al. [12]
LMA
437.58
102.37
103.71
123.64
119.32
102.60
99.63
102.11
102.84
(93.45)
3708.04
113.06
131.31
131.10
129.87
119.98
121.94
131.30
123.56
(113.32)
425.19
115.79
118.17
115.08
123.12
114.81
108.68
113.83
115.26
(109.23)
601.06
63.74
67.89
73.06
71.01
69.07
64.83
66.45
65.59
(56.16)
3323.86
55.61
53.59
66.36
65.14
53.16
51.05
52.83
51.59
(43.80)
328.36
53.33
53.38
60.48
61.94
53.57
51.81
54.17
51.44
(42.10)
1470.68
83.98
88.01
94.95
95.07
85.53
82.99
86.78
85.05
ANFIS
ARMAX
NN-MAT
Chen & Chang [9]
Chen et al. [10]
Chen & Chen [11]
T2NFS [43]
LMA
434.02
106.57
109.29
106.34
116.64
101.33
99.04
102.57
(94.29)
3490.57
121.10
132.98
130.13
123.62
121.27
120.90
121.01
(114.23)
656.78
113.40
114.83
113.33
123.85
114.48
103.84
114.32
(110.27)
1340.88
66.88
63.73
72.33
71.98
67.18
58.10
56.95
(54.10)
574.10
52.89
55.65
60.29
58.06
52.72
52.49
51.95
(45.32)
201.52
51.82
51.50
68.07
57.73
52.27
51.73
50.71
(46.98)
1116.31
85.44
88.00
91.75
91.98
84.88
81.02
82.92
32
33
11
110
to December are used for testing. Some statistics of the training data for 2004 are shown in Table 7. Table 8 shows the RMSE
obtained by different methods. In this table, we also list the results of Chen [9], Chen et al. [11], T2NFS [43], and Chen et al.
[10,12]. The results under three conditions are listed in Table 8: with Dow Jones as the additional input, with NASDAQ as the
additional input, and with Dow Jones and NASDAQ as the additional inputs. For LMA, we set q 14; d 4, and k 100. The
hyper-parameters set automatically in the LS-SVM program for LMA are c 2 106 6 106 and r2 1000 3000. These
settings are determined in the training phase, and are not changed in the testing phase. Note that the parenthesized numbers
in the second row of the LMA entries indicate the training errors of LMA. We can nd that our method has better RMSE in
2002, 2003, 2004 than most of the other methods. However, LMA performs worse than [11,43] in 1999, 2000 and 2001. The
reason is that in these years, many zeros are contained in Dow Jones and NASDAQ due to their being asynchronous with TAIEX. If Dow Jones and NASDAQ were closed while TAIEX was open on some day, a zero was recorded in Dow Jones and NASDAQ for that day. As can be seen, there are 32 zeros in these two datasets in 1999. Because our method is inuenced by the
obtained nearest neighbors, these zeros are annoying to our method. Note that the zeros also deteriorate a lot the
performance of ANFIS. However, our method still performs better than [9,10] in 1999, 2000, and 2001.
Table 9
MAE of multi-step forecasting on Laser.
ANFIS
2-step
3-step
4-step
5-step
6-step
7-step
8-step
9-step
10-step
11-step
12-step
9.1072
7.0787
7.9532
8.2511
8.7864
9.4033
9.4587
10.9792
12.7650
14.7686
13.1312
NN-MAT
ARMA
Sorjamaa
LMA
7.0461
6.7370
9.5677
7.3968
7.5659
9.5344
9.5561
10.9181
10.0274
9.5501
13.8144
17.6998
17.0023
19.9313
21.0612
21.8979
22.1063
23.6841
28.5120
29.4743
28.9017
30.7653
1.8294
3.1721
4.5941
3.8067
4.6517
5.0789
8.3264
6.4187
6.9559
19.0988
17.5508
1.7388
2.6654
2.4529
2.6971
2.7371
2.1663
2.4619
3.3041
3.7495
4.9000
4.9966
(0.6298)
(0.7722)
(0.7682)
(0.8238)
(0.7889)
(0.7915)
(0.7720)
(0.8603)
(0.8772)
(1.0343)
(1.0965)
Table 10
RMSE of multi-step forecasting on Laser.
2-step
3-step
4-step
5-step
6-step
7-step
8-step
9-step
10-step
11-step
12-step
ANFIS
NN-MAT
ARMA
Sorjamaa
LMA
22.5924
17.9022
20.1061
19.5676
18.4178
20.5050
21.3847
24.8706
24.7307
29.2389
27.3947
15.0670
18.2621
24.4235
19.8554
17.9245
19.6204
19.1559
20.8771
24.2657
22.5536
30.1944
33.2179
33.2998
36.2056
37.9758
38.6497
38.9259
39.2201
44.2355
45.2404
43.4734
43.9369
4.6673
10.7440
15.7759
11.0067
12.2136
12.2059
14.3293
14.3791
14.8786
32.3007
29.2913
4.2313 (0.9089)
7.4607 (1.3115)
6.5999 (1.3107)
8.3976 (1.4436)
7.8629 (1.3684)
5.9445 (1.3478)
5.1336 (1.3172)
7.7728 (1.4507)
9.9128 (1.3533)
12.6468 (1.6861)
12.7318 (1.8214)
111
Table 11
Performance of multi-step forecasting on EUNITE.
ARMAX
ANFIS
Backpropagation [17]
Gain Scaling [17]
Gain Scaling with SS [17]
Gain Scaling with CIS and SS [17]
Early Stopping [17]
Early Stopping with SS [17]
Early Stopping with CIS and SS [17]
Extended Bayesian Training [17]
L2-SVM with CV [17]
L2-SVM with CIS and CV [17]
L2-SVM Gradient Descent [17]
Benchmark [16]
LMA
MAPE
MAX ERROR
# of inputs
3.69
4.54
5.05
4.87
2.19
2.77
1.95
2.13
2.87
1.75
3.52
2.87
2.07
1.98
1.71
(0.14)
67.56
94.74
111.89
137.78
55.95
70.99
40.28
50.90
71.26
55.64
60.39
67.17
59.78
51.42
40.99
(73.28)
20
5
49
49
49
20
49
49
20
40
49
20
45
12
112
Table 12
Performance of LMA, with different settings of k, on Laser.
110
120
130
140
150
160
170
180
190
200
MAE
RMSE
1.0775
1.0344
1.0212
0.9904
0.7236
0.9348
0.9376
0.9766
1.0046
1.0804
1.9369
1.7951
1.9197
1.6181
0.9570
1.5870
1.6058
1.5502
1.6440
1.9369
113
One way is to get help from experts who have domain knowledge about the underlying application. Another way may utilize
some optimization methods, e.g., GA. However, optimization may take time.
6.2. Comparison among different measures of similarity
We adopt the hybrid distance to measure the similarity between two sequences in our approach. We show here that
using the hybrid distance to nd the nearest neighbors is a good idea.
Suppose the distance between two sequences is measured by the time index, i.e., two sequences with close time indices
are considered to be close to each other. For example, consider the example in Section 4. If measured by the time index, the 6
nearest neighbors of Q would be S17 ; S16 ; S15 ; S14 ; S13 , and S12 . Table 13 shows the MAE and RMSE obtained by LMA on Laser,
with kvarying in the range between 130 and 200, where the k nearest neighbors are determined by the time index. As before,
q and d are xed to be 9 and 7, respectively. We can see that the performance is poor in both MAE and RSME. This indicates
that the time index is less appropriate than the hybrid distance in determining the nearest neighbors of Q for LMA.
Alternatively, the similarity between two sequences can be measured by only the Euclidean distance between original
sequences. Let us give an example to explain why the hybrid distance is more effective. For simplicity, suppose only one variable Y is involved. Consider three sequences A1 ; A2 , and A3 :
A1 5; 4; 3; 2; 1;
A2 5; 4; 2; 0; 1;
A3 5; 3; 1; 2; 3:
We have
NE A1 ; A2 NE A1 ; A3
3
0:6397:
4:69
23
If we measure the similarity based on N E A1 ; A2 and N E A1 ; A3 , then A2 and A3 are equally similar to A1 . Now we consider
the differential sequences:
F1 1; 1; 1; 1;
F2 1; 2; 2; 1;
F3 2; 2; 1; 1
and have
1:414
0:3779;
3:7417
3:162
NE F1 ; F3
0:8450:
3:7417
NE F1 ; F2
24
Therefore, we have
0:6397 0:3779
0:5088;
2
0:6397 0:8450
0:7424:
NH A1 ; A3
2
NH A1 ; A2
25
By taking trends into account, we have N H A1 ; A2 < N H A1 ; A3 and conclude that A2 is closer to A1 . This is reasonable since
both A1 and A2 have a down trend while A3 has a down-then-up trend. Table 14 shows the MAE and RMSE of one-step
forecasting on Laser obtained by LMA with different similarity measures. Note that we set k 150; q 9, and d 7. In this
table, ED-SEQ stands for the Euclidean distance between original sequences, ED-FOD for the Euclidean distance between differential sequences, and HD-SEQ for the hybrid distance. As can be seen, the hybrid distance works more effectively than the
other two alternatives. Note that in this work, we just simply use 1/2 in Eq. (5). However, it can be specied by a domain
expert or learned through an optimization method, e.g., GA.
Table 13
Performance of LMA, with distance measured by time index, on Laser.
130
140
150
160
170
180
190
200
MAE
RMSE
18.2506
18.1728
12.3552
14.5660
10.6852
10.1166
10.2233
10.4542
26.6031
27.5728
26.2939
29.1388
28.2920
27.0055
29.9817
30.0145
114
ED-SEQ
ED-FOD
HD-SEQ
MAE
RMSE
0.9668
1.4750
0.7236
1.5667
3.0873
0.9570
Table 15
Performance of LMA, with different settings of d, on Laser.
4
5
6
7
8
9
10 (all)
MAE
RMSE
1.2883
0.9426
0.9479
0.7236
0.8869
0.9609
1.0468
2.4091
1.6350
1.4245
0.9570
1.4860
1.9711
2.1052
Table 16
Performance of LMA, with different settings of q, on Poland.
6
7
8
9
10
11
12
MAE
RMSE
0.0264
0.0251
0.0249
0.0245
0.0256
0.0258
0.0259
0.0477
0.0462
0.0460
0.0454
0.0480
0.0487
0.0507
Table 17
Comparison between LS-SVM and modied LS-SVM on Laser and Poland.
Laser
LS-SVM
Modied LS-SVM
Poland
MAE
RMSE
MAE
RMSE
0.8984
0.7236
1.5171
0.9570
0.0271
0.0245
0.0532
0.0454
Table 18
Comparison between LS-SVM and modied LS-SVM on EUNITE.
LS-SVM
Modied LS-SVM
MAPE
MAX ERROR
1.7653
1.7100
45.6756
40.9900
115
116
[29] Z. Huang, M.-L. Shyu, Recent trends in information reuse and integration, in: Long-term Time Series Prediction using k-NN based LS-SVM Framework
with Multi-value Integration, Springer, Vienna, 2012, pp. 191209 (Chapter 9).
[30] K.-C. Hung, K.-P. Lin, Long-term business cycle forecasting through a potential intuitionistic fuzzy least-squares support vector regression approach,
Inform. Sci. 224 (2013) 3748.
[31] J.-S.R. Jang, Fuzzy modeling using generalized neural networks and Kalman lter algorithm, in: Proceedings of the Ninth National Conference on
Articial Intelligence (AAAI-91), 1991, pp. 762767.
[32] J.-S.R. Jang, ANFIS: adaptive-network based fuzzy inference systems, IEEE Trans. Syst. Man Cybernet. 23 (3) (1993) 665685.
[33] Z. Ji, B. Wang, S. Deng, Z. You, Predicting dynamic deformation of retaining structure by LSSVR-based time series method, Neurocomputing 137 (2014)
165172.
[34] N. Kaneko, S. Matsuzaki, M. Ito, H. Oogai, K. Uchida, Application of improved local models of large scale database-based online modeling to prediction
of molten iron temperature of blast furnace, ISIJ Int. 50 (7) (2010) 939945.
[35] H. Kantz, Nonlinear Time Series Analysis, Cambridge University Press, 2003.
[36] M. Khashei, M. Bijari, An articial neural network p; d; q model for time series forecasting, Expert Syst. Appl. 37 (1) (2010) 479489.
[37] M. Khashei, M. Bijari, Which methodology is better for combining linear and nonlinear models for time series forecasting?, J Ind. Syst. Eng. 4 (4) (2011)
265285.
[38] A. Kraskov, H. Stgbauer, P. Grassberger, Estimating mutual information, Phys. Rev. E 69 (6) (2004) 066138.
[39] A. Kusiak, H. Zheng, Z. Song, Short-term prediction of wind farm power: a data mining approach, IEEE Trans. Energy Convers. 24 (1) (2009) 125136.
[40] Laser Time Series Data Set. <http://www-psych.stanford.edu/andreas/Time-Series/SantaFe.html>.
[41] S.-J. Lee, C.-S. Ouyang, A neuro-fuzzy system modeling with self-constructing rule generation and hybrid SVD-based learning, IEEE Trans. Fuzzy Syst.
11 (3) (2003) 341353.
[42] W. Li, Mutual information functions versus correlation functions, J. Stat. Phys. 60 (56) (1990) 823837.
[43] C.-F. Liu, C.-Y. Yeh, S.-J. Lee, Application of type-2 neuro-fuzzy modeling in stock price prediction, Appl. Soft Comput. 12 (4) (2012) 13481358.
[44] LS-SVM Program. <http://www.esat.kuleuven.be/sista/lssvmlab/>.
[45] J. McNames, A nearest trajectory strategy for time series prediction, in: Proceedings of the International Workshop on Advanced Black-Box Techniques
for Nonlinear Modeling, K.U. Leuven, Belgium, 1998, pp. 112128.
[46] J. McNames, B. Widrow, J.H. Friedman, J.P. How, Innovations in Local Modeling for Time Series Prediction, 1999. <http://web.cecs.pdx.edu/mcnames/
Publications/Dissertation.pdf>.
[47] A. Miranian, M. Abdollahzade, Developing a local least-squares support vector machines-based neuro-fuzzy model for nonlinear and chaotic time
series prediction, IEEE Trans. Neural Netw. Learn. Syst. 24 (2) (2013) 207218.
[48] NASDAQ Web Site. <http://www.nasdaq.com/>.
[49] M. Negnevitsky, Articial Intelligence: A Guide to Intelligent Systems, Addison-Wesley, 2004.
[50] C.-S. Ouyang, W.-J. Lee, S.-J. Lee, A TSK-type neuro-fuzzy network approach to system modeling problems, IEEE Trans. Syst. Man Cybernet. Part B:
Cybernet. 35 (4) (2005) 751767.
[51] Poland Data Set. <http://research.ics.aalto./eiml/datasets.shtml>.
[52] N.I. Sapankevych, R. Sankar, Time series prediction using support vector machines: a survey, IEEE Comput. Intell. Mag. 4 (2) (2009) 2438.
[53] A. Sfetsos, C. Siriopoulos, Time series forecasting with a hybrid clustering scheme and pattern recognition, IEEE Trans. Syst. Man Cybernet. Part A 34 (3)
(2004) 399405.
[54] G. Silviu, Information Theory with Applications, McGraw-Hill, 1977.
[55] O. Song, B.S. Chissom, Fuzzy time series and its models, Fuzzy Sets Syst. 54 (3) (1993) 269277.
[56] O. Song, B.S. Chissom, Forecasting enrollments with fuzzy time series Part I, Fuzzy Sets Syst. 54 (1) (1993) 19.
[57] O. Song, B.S. Chissom, Forecasting enrollments with fuzzy time series Part II, Fuzzy Sets Syst. 62 (1) (1994) 18.
[58] A. Sorjamaa, J. Hao, A. Lendasse, Mutual information and k-nearest neighbors approximator for time series prediction, Lect. Notes Comput. Sci. 3657
(2005) 553558.
[59] A. Sorjamaa, J. Hao, N. Reyhani, Y. Ji, A. Lendasse, Methodology for long-term prediction of time series, Neurocomputing 70 (1618) (2007) 28612869.
[60] J.H. Stock, M.W. Watson, Introduction to Econometrics, Addison-Wesley, 2010.
[61] H. Stogbauer, A. Kraskov, S.A. Astakhov, P. Grassberger, Least-dependent-component analysis based on mutual information, Phys. Rev. E 70 (2004)
066123.
[62] M.B. Stojanovic, M.M. Bozic, M.M. Stankovic, Z.P. Stajic, A methodology for training set instance selection using mutual information in time series
prediction, Neurocomputing 141 (2014) 236245.
[63] Sunspot Data Set. <http://sidc.oma.be/sunspot-data/>.
[64] J.A.K. Suykens, J. De Brabanter, L. Lukas, J. Vandewalle, Weighted least squares support vector machines: robustness and sparse approximation,
Neurocomputing 48 (14) (2002) 85105.
[65] J.A.K. Suykens, T.V. Gestel, J.D. Brabanter, B.D. Moor, J. Vandewalle, Least Squares Support Vector Machines, World Scientic Publishing Company, 2002.
[66] TAIEX Web Site. <http://www.tese.com.tw/en/products/indices/tsec/taiex.php>.
[67] F.E.H. Tay, L.J. Cao, Modied support vector machines in nancial time series forecasting, Int. J. Forecast. 48 (1) (2002) 6984.
[68] S.S. Torbaghan, A. Motamedi, H. Zareipour, L.A. Tuan, Mediumterm electricity price forecasting, in: North American Power Symposium (NAPS) 2012,
2012, pp. 18.
[69] W.W.S. Wei, Time Series Analysis: Univariate and Multivariate Methods, Pearson, 2005.
[70] G.P. Zhang, Time series forecasting using a hybrid ARIMA and neural network model, Neurocomputing 50 (2003) 159175.
[71] L. Zhang, W.-D. Zhou, P.-C. Chang, J.-W. Yang, F.-Z. Li, Iterated time series prediction with multiple support vector regression models, Neurocomputing
99 (2013) 411422.
[72] H. Zou, Y. Yang, Combining time series models for forecasting, Int. J. Forecast. 20 (1) (2004) 6984.