NARX

Improving recurrent network load forecasting
T. Czernichow
1,2,3,
A. Germond 1, B. Dorizzi 2 , P. Caire 3
1 EPFL-DE-LRE- Switzerland, 2 INT - France (czerni@etna.int-evry.fr), 3 EDF-DER - France.
Abstract
In this article, we present a not fully connected recurrent network applied to the problem of load forecasting. Although many authors have pointed out that Recurrent Networks were able to modelize NARMAX process (Non linear Auto Regressive Moving Average with eXogeneous variables), we present a constructing scheme for the MA part. In addition we present a modification of the learning step which improves learning convergence and the accuracy of the forecast. At last, the use of a continuous learning scheme and a robust learning scheme, which appeared to be necessary when using a MA part, enables us to reach a good precision of the forecast, compared to the accuracy of the model in use at the utility at present.
1. Introduction
Load forecasting has been a key issue since the last decades, moreover after the economic crisis has struck western countries. Adjusting the production to the precisely estimated demand is very important for the energy reduction price. Many different methods have been used to achieve load forecasting : GMDH [1], linear models and statistical methods[2], and neural networks in all their forms[3], modulars, hybrids[4] and also recurrent[5]. Recurrent neural networks (RNN) are particular in the sense they allow feedback connections. In particular, they can take into account the moving average (MA) part of a process, and hence modelize an ARMA model [6-8]. In this paper we will first explain the problem of load forecasting: which series we use, what are the explanatory variables etc In the second part, we describe how to build the MA part of the network, and the modifications we have brought to the estimation algorithm so as to reach results exploitable by a utility.
2. The problem of load forecasting

We basically deal with three kind of series. The first one is the load, recorded since some years, but which cannot be used entirely because the process has changed over the years. The second one is composed of the weather variables, which are supposed to partially explain the behaviour of the first series in some ways, and may be forecasted to a maximum horizon of 5 days. The last kind of variables are calendar ones. The key point with these variables is the selection of the good ones, as many of them are available: coding of the day type (day off, day of the week), coding of the hour, or coding of the number of the day in the year. In this work, the data are the half-hour load recorded since 1986, the temperature and the cloud cover each three hours also recorded since the same date. We will only use the years from 1989 to 1992 for the data bases. It is nevertheless a 70000 pattern long data base, which is clearly enough to estimate a model. The load is a sum of the loads of different regions of France. The weather variables are also measured on different regions, and are summed proportionally to the importance, in the total load, the corresponding region has. The load has a peak value of 70000MW, and has in France a high degree of diversity. It may be due to the extended use of electric heating, which creates a high non-linear relationship with the wheather variables. The cloud cover variable ranges from 0 to 8. The next chapter will describe more thouroughly the chosen network architecture, the corresponding input variables and the learning algorithm.
3. Simple Recurrent Networks

Many people have proposed the use of local recurrencies [9-12], but a general definition of a RNN would be that some entries depend on the weights. In [13] we also proposed a general framework for the description, and therefore the implementation of Simple Recurrent Networks (SRN). In RNN, the derivatives of the outputs regarding to the weights lead to recursive equations, because of the dependencies previously mentioned , and many different algorithms exist for teaching these networks (a good summary may be found in the book by Hertz, Krogh and Palmer [14]). The problem is that, for complex structures, there exits no general algorithm that induces the computation of the gradient from the topology, like in the back-propagation algorithm. What we know is that computing the exact gradient will always lead to, at least, a recursive equation. We have decided to
only compute an approximation of the gradient by only considering the non-recursive part. In this case we refer to a SRN. SRN are a good framework for the introduction of knowledge in the model. If an apriori knowledge on the relationship between variables, or even on the MA part (for example a bilinear assumption) exists, it is very easy to build a recurrent link including this function (by example, the square of the error). In [15], the authors showed how a SRN can learn the functioning of a simple automata. In [13] we have shown on simple experiments that SRN were able to memorize past values of a time series. It is very important in the context of model estimation to use the relevant variables. It has appeared through these experiments that, when dealing with time series forecasting, SRN's were able to make a local search in the past to extract the relevant inputs: If X is the only input of variable X, but X is also important, then during the learning phase, the weights of the recurrent layers will adjust so as to raise the importance of X . The importance of a variable is difficult to notice if the relation with the target is not linear. The SRN are only able to make local searchs (technically short term memory), e.g. only until X for p equals to seven or eight.
t t-3 t-3 t-p
Modification of the gradient step One of the problems that occured is that the load series is heteroskedastic, with a period of a week. This particularity disturbs the training in the high variance part, and decreases the performance on the low variance parts. To prevent this effect, the learning step has been adjusted for each pattern. On the learning database we have estimated the variance at each half-hour h (from 1 to 336, a week long), which gives a 336 elements vector, named V. During learning, and for each pattern, the learning rate is multiplied by a coefficient linked to the variance of the corresponding half-hour. It is the inverse of the estimated variance, scaled so as to put the maximum to one. In this way, a pattern with high variances has a small learning step :
t = ct . t , ct =
max(V ) , where h is a function giving the hour of pattern t. e is globally t V Lh(t )
decreasing with a 1/t scheme, but is also able to grow if the error diminishes. In a second phase, we are now predicting the variance (the error) of the system at each time step, to both control the gradient step and give error bounds. Building the MA part Starting from a pure AR model, we use the Auto Correlation Function (ACF) and the Partial Auto Correlation Function (PACF) graphs to decide which error lag is necessary to the modelization. For each model error, the time step corresponding to the highest pic is included and a new estimation is made. For this application we have found that using the error at time t, t-24h,...,t-168h (one for each day lag) was improving the forecast. The error is taken into account by putting it back into the network by a recurrent link. Stability problems may occur during learning at the beginning of the training, when the error of the model is high. That is why it is necessary to use in this case, a robust algorithm. Robust modification The quality of the database is crucial in model estimation[16], in the sense that few outliers may greatly disturb the estimated function shape. We have implemented a robust algorithm by taking out of the learning database a certain number of patterns at each learning sequence. After making one estimation step of the learning database, we can estimate the mean and the standard deviation (std) of the error. On the next estimation step the patterns with a previous error higher than c*std are ignored. This will stabilize the learning and prevent distortions due to outliers. The value of c usely taken is around 4, but many tries may be done to select (on the validation base) which value is the good one. Continuous learning scheme Another improvement is the continuous learning scheme. The basic idea is to re-adjust the model as soon as new patterns are available. The model can be more precisely adjusted to a process which would change smoothly. For example, in our application, this scheme has permitted to capture the rupture in the load curve which has occured at the beginning of the nineties. During the test phase, the weights are updated using the same learning algorithm as during the learning phase. The adaptation of the weights is generally done after the presentation of each pattern of the test set. However, there is a need to estimate a learning rate m different for each pattern. Indeed, if we use a constant learning rate, for all the patterns of the test set, the curve representing the mean square error E versus m is a parabolode presented on Figure 1. When re-learning is too strong, the estimation is damaged, and we speak of catastrophic interference.
t
[ ]
opt Fig. 1. MSE versus the learning rate. The estimation of the optimal learning rate is in itself, a temporal problem. We have chosen to estimate it on the recent past of the series. Two techniques may be employed which we have merged. The first one is to choose the learning rate (for the current day, or hour) which gives the best result (in terms of MSE) on the day before. The other is to choose the one that gives the best result on the previous day (or hour) with the same configuration. The configuration is coded by three binary positions. For example, if the day is off, and the day before is worked, as well as the day after, the code will be 010. We make a mean of the two values to compute our learning rate.
4. Results on a one year data set

All the models are estimated with about 30000 patterns, and the results correspond to the best models estimated on a validation database of one year. They are compared with two mean criteria and the quantile graphs of the raw N P P t t errors. The two criteria are the Mean Absolute Percentage Error (MAPE= 100 P ) in percentage and the N t =1 t
N Root Mean Square Error (RMSE= 1 P P ) in MegaWatts (MW). we will also use the quantile of the raw t t N t =1 error in order to give an idea of its distribution. We consider two types of networks : a NARX model (model 1, Fig. 3.b) and a NARMAX network (Fig. 3.a). We compared the performance of the NARMAX when we include only the error of the last hour (model 2) or with the errors of each day of the past week (model 3). The two networks both take as AR input the load, temperature and cloud cover at time t, t-24h and t-168h for forecasting t+1. Moreover, we have included a coding of the hour, and a binary coding of the day type : 0 for working day and 1 for day off. The input have the coding of the day of the hour to forecast, the day before and the day after. There is also a binary coding on one bit for special tariff.
NARX results The first results with a SRN (fig 3.b) are only given for a forecast horizon of one half-hour. For this horizon, the network is better than the EDF model (tab1). However, this is no more true for larger horizons (up to one day). Model MAPE ARIMA 0.7% SRN with CL 0.5% SRN without CL 0.7% Tab 1. One step ahead forecast To get the one day ahead forecast, we iterate the mapping corresponding to the one step ahead forecast. The performance was then disastrous. Some input variables were badly chosen, in particular for the vacations period and had to be changed. Still the accuracy was not enough good for industries criteria. The study of the residues showed correlations at some time lags. No AR model taking as inputs those time lags were able to cancel the correlations. For these reasons, we have considered a NARMAX model. Inclusion of past errors lags Figure 1 shows the first results after inclusion of the error at t-1. The performances are better but still below the ones of the EDF system.
In MW
RMSE (In MW) 2500 2000 1500 1000 500 0 24
Forecasting horizon (a)
(In half-hour)
Mape (In %) 4.5 4 3.5 3 2.5 2 1.5 1 0.5
48 72 Time (in Half-hour)
96
24
48 72 Time (in Half-hour)
96
(b) (c) Fig. 2. Performance of the model including only the last half-hour error. (a) quantiles of the raw error for each horizon (a x% line implies that x% of the values are below the line), (b) error of each horizon for the RMSE criterion, and (c) for the MAPE criterion. It is interesting to notice that for large horizons the forecast is very stable, when the input or the wheather variables are noised, up to a certain limit of 10% of the variance of the load signal. It certainly shows that the network takes significantly into account the other input variables. It is also the reason why the iteration process (using the forecast as input variable) enables a good forecast. In the following part, we show the final result of a network taking into account the error in eight time steps: t, t-24h, ..., t-168h. Figure 3 shows the architectures of the networks.
In_Cont Input Delay Del_Cont Err_Cont ErrW ErrJ1 ErrH In_Cont Input Delay Del_Cont
Hidden_In
Hidden_Del
Hidden_Err
Hidden_In
Hidden_Del
Hidden
Hidden
Output
(a) Fig. 3. NARMAX (a) and NARX (b) networks.
Output
(b)
The input layer has a memory, or context layer, and so has the delay layer and the error layer. They all have a different one because their memory may have a different structure and a different lenght in the past. The information is then merged on a hidden layer before the output layer. The size of the first hidden layer is chosen thanks to our past experience (around eight neurons), and the second layer is sized by tries on the validation
database. For the (a) part we have only drawn the error for the last half-hour the day before and the week before, but the network takes into account the error of each day of the past week. Figure 4 shows the ACF and PACF functions (on the same graph), for the errors made by the final model on the learning database.
0,08 0,06 0,04 0,02 0 -0,02 -0,04 -0,06 0 48 96 144 192 240 Time lag (In half-hour)
ACF PACF
288
336
Fig. 4. Correlogram of the residues on the learning database. Although most of the pics have disappeared, some remains, with nevertheless a reduced amplitude. The learning database contains 30000 patterns, therefore the correlation bounds are below many values of the graphs. This fact may be partly due to the repetitivity of information inside the base, or to its varying variance, and will have to be studied more thoroughly.
8000 6000 4000
Error (In MW)
2,5
1400 1200 MAPE RMSE

RMSE (MW)
2000 0 -2000 -4000 -6000 -8000 0 8 20 32 44 56 68 80 92 104 116 Forecasting horizon (In half-hour)
0/100 1/100 5/100 10/100 25/100 50/100 75/100 90/100 95/100 99/100 100/100
2
MAPE (In %)
1000 1,5 1 0,5 0 0 12 24 36 48 60 72 84 96 108 Forecasting horizon (In half-hour) 800 600 400 200 0
(b) (a) Fig. 5. Results of the final NARMAX model, including the error for each day of the previous week: (a) Quantiles of the errors for each horizon until two days and a half, and (b) the error measured for two criteria, the MAPE and the RMSE, along the horizon. The results are better than those of model 1 and 2. The values are 1.8% for MAPE and 1020 MW for the RMSE, for an horizon of one day (starting the forecast at noon of the day before and measuring the error from 0h30 to 24h, so as to compare with the model in use at EDF). The two days ahead forecasting gives respectively for the MAPE and the RMSE, 2,2% and 1240MW. The system in use at EDF has a MAPE of 1.9% for a one day ahead horizon. Our system has hence similar performance, although it is better for the two days ahead horizon, and has smaller error bounds.
MAPE (In %) for each type of day 3 2,5 2 1,5 1 0,5 0 Type 0 Type 1 Type 2 Type 3 Type 4 Type 5 Type 6 Type 7 Global
Fig. 6. The MAPE for one day ahead forecasting, measured for each type of days.
Figure 6 shows the decomposition of the error for each type of days. The type is given by the numerical conversion of the binary coding. For example, 011 is a day off, for which the day before was not off, but the day after was off; therefore its type on the graph is 4x0+2x1+1x1=type 3. Normal days are well predicted (1.6%) and the worst ones are type 7 (code 111) and type 2 (code 010). Most of the time the vacation periods are not well predicted, because of the low understanding of the process : the explanatory variables are not very well identified. It is evident that the introduction of this MA part in our model, not only improved the averaged prediction errors (MAPE, RMSE), but also allowed a diminution of the maximum of the error. This, and more generally the shrinking of the extremum parts of the quantile graph is very important for the utility. In fact, we should measure the forecast with a ponderation linked to the economic cost (also very difficult to obtain) corresponding to the error of one MW at the time the forecast is made. In other words, the evaluation criteria that we use are not appropriate to the real error prediction cost.
5. Conclusion
We have presented a building methodology for the MA part of a SRN model. It is coherent with the formalism we introduced in our previous work, and although a part of the gradient computation is neglected, the algorithm converges and gives good results. We have also shown that it is possible, when the process changes slowly, to use a continuous testing scheme which avoids the catastrophic interference trap. Finally, the modifications brought to the standard algorithm make the system at least as good, for one day ahead forecasting, as the complex model in use at the utility, and even better for larger horizons.
Acknowledgement
The authors would like to thank A. Piras and A. Muoz for interesting comments and discussions.
References
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] T. S. Dillon, K. Morsztyn, K. Phua, Short Term Load Forecasting using Adaptive Pattern Recognition and Self Organising Techniques, presented at PSCC 5 th Power Systems Computation Conference, Cambridge (sept), 1975. D. W. Bunn, Short-term Forecasting: A review of Procedures in the Electricity Supply Industry, J. Opl. Res. Soc, vol. 33, pp. 533-545, 1982. A. Piras, T. Czernichow, K. Imhof, P. Caire, Y. Jaccard, B. Dorizzi, Short Term Load Forecasting with Neural Networks: A Review, submitted to ICANN, 1995, Paris. A. Piras, A. Germond, B.Buchenel, K.Imhof, Y.Jaccard, Hetereogeneous Artificial Neural Networks for Short Term Load Forecasting, IEEE PICA '95, Salt Lake City, Utah, 1995. K. Y. Lee, T.I. Choi, C.C. Ku, Short-Term Load Forecasting Using Diagonal Recurent Neural Network, IEEE ANNPS '93, 1993. S. A. Billings, H.B. Jamaluddin, S. Chen, Properties of Neural Networks with Applications to Modelling non-linear dynamical Systems, Int. J. Control, vol. 55, No.1, pp. 193-224, 1992. J. Connor, L. Atlas, Recurrent Neural Networks and Times Series Prediction, IJCNN '91, vol. I, pp. 301-306, 1991. J. T. Connor, R.D. Martin, L.E. Atlas, Recurrent Neural Network and Nobust Time Series Prediction, IEEE Transactions on Neural Networks, vol. 5, No. 2, pp. 240-254, 1994. A. D. Back, A.C. Tsoi, FIR and IIR Synapses, a new neural network architecture for time series modelling, Neural computation, vol. 3, pp. 375-385, 1991. J. L. Elman, Finding structure in time, Cognitive Science, vol. 14, pp. 179-212, 1990. C. R. Gent, C.P. Sheppard, Predicting Time Series by a Fully Connected Neural Network trained by Back-Propagation, Computing & Control Engineering Journal, pp. 109-112, 1992. P. Caire, G. Hatabian, C. Muller, Progress in Forecasting by Neural Networks, IJCNN, Baltimore, 1992. T. Czernichow, B. Dorizzi, P. Caire, Load Forecasting and Recurrent Networks, Neural Network World, vol. 6, pp. 895-905, 1993. J. Hertz, A. Krogh, R.G. Palmer, Introduction to the Theory of Neural Computation, 1991. A. Cleeremans, D. Servan-schreiber, J.L. McClelland, Finite State Automata and Simple Recurrent Networks, Neural Computation, vol. 1, pp. 301-306, 1989. D. S. Chen, R.C. Jain, A robust back-propagation learning algorithm for function approximation, IEEE Trans. on Neural Networks, vol. 5-No 3, 1994 (May).

NARX

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

NARX

Încărcat de

Drepturi de autor:

Formate disponibile

Improving recurrent network load forecasting

A. Germond 1, B. Dorizzi 2 , P. Caire 3

1 EPFL-DE-LRE- Switzerland, 2 INT - France (czerni@etna.int-evry.fr), 3 EDF-DER - France.

2. The problem of load forecasting

3. Simple Recurrent Networks

max(V ) , where h is a function giving the hour of pattern t. e is globally t V Lh(t )

4. Results on a one year data set

RMSE (In MW) 2500 2000 1500 1000 500 0 24

Forecasting horizon (a)

48 72 Time (in Half-hour)

48 72 Time (in Half-hour)

(a) Fig. 3. NARMAX (a) and NARX (b) networks.

1400 1200 MAPE RMSE

S-ar putea să vă placă și