NON-LINEAR MODEL IDENTIFICATION AND STATISTICAL SIGNIFICANCE TESTS
AND THEIR APPLICATION TO FINANCIAL MODELLING A N Burgess London Business School ,UK ABSTRACT We describe a methodology, based upon the statistical concept of analysis of variance (ANOVA), which can be used both for non-linear model identification and for testing the statistical significance of inputs to a neural network. We compare our model identification procedure to established approach of correlation analysis on both linear and non-linear time-series. We describe how the significance tests can form the basis of a modelling methodology analagous to stepwise regression. Finally we consider an application of these techniques to the problem of modelling weekly returns of the FTSE 100 index. A NON-LINEAR MODEL IDENTIFICATION PROCEDURE Within a linear framework there are a variety of statistical props which assist in the stages of model identification, construction and verification; for instance the well-known and established Box- J enkins framework, see Box and J enkins (1). The model identification stage of these techniques is typically based on correlation analysis, i.e. constructing the auto-correlation function (ACF) of the time-series. Examination of the ACF allows the modeller to identify both the order and type (autoregressive, moving average, or mixed) of the model. When performing time-series modelling using neural networks it is still the norm to use correlation analysis as a preliminary variable selection technique. However, correlation analysis will in many cases fail to identify the existance of significant non-linear relationships. This might cause variables to be erroneously left out of later stages of modelling and result in poorer models than would bethe case had a more powef i identification procedure been used. Both linear regression models and neural networks are commonly fitted by minimising mean squared error. They provide an estimate of the mean value of the output variable given the current values of the input variable, i.e. a conditional expectation. This insight allows us to realise why measures of dependence whtch have been developed for linear models are not always suitable for non-linear models. In a broad sense, a linear relationship exists if and only if the conditional expectation either consistently increases or consistently decreases as we increase the value of the independent variable. A non-linear relationship, however, only requires that the conditional expectation varies as we increase the independent variable. The class of functions which would thus be missed by a linear measure but identified by a suitable non-linear one includes all symmetric functions, all periodic functions (e.g. sinewave) and many others besides. We propose a measure of the degree to which the conditional expectation of the dependent variable varies with a given set of independent variables but which imposes no condition that this variation should beof a particular form. Analysis of variance or ANOVA The ANOVA technique is a standard statistical technique, usually used for analysis of categorical independent variables, which divides a sample into g groups and then tests to see if the differences between the groups are statistically significant. It does this by comparing the variability within the individual groups to the variability between the different groups. Firstly, the total variability within the groups is calculated: Where go) is the group to which the ith pattern belongs With g groups we lose g degrees of freedom in estimating the group means, so we can estimate the variance of the data by SSW/ (n-g). Secondly the variability between the groups is calculated: 313 Where nj is the number of observations in group j. Here we lose one degree of freedom for using the sample mean. Thus we can also use SSB to estimate the true variance of the data by dividing by (g-1). If the different groups represent samples from the same underlying distribution then SSW and SSB are simply dependent on the underlying true variance. Adjusting for degrees of freedom, each can beused as an estimate of the variance and the ratio: SSB I ( g - 1) SSW I ( n - g) F = (3) follows an F distribution with ( g-1 , n-g ) degrees of freedom. Under the null hypothesis that all groups are drawn from the same distribution then the F ratio will be below the 10% critical value 9 times out of ten and below the 1% critical value 99 times out of a hundred, etc. However if a pattern exists then the between group variation will beincreased. This will cause a higher F ratio and, if the pattern is sufficiently strong, lead tal the rejection of the null hypothesis. Testing a single variable Thus we can perform an ANOVA test to establish1 whether the variance in the conditional expectation of y given different values of x is statistically significant. The first step is to choose g , the number of groups, typically this would be in the range 3-10. Following this, each observation is allocated to the appropriate group by dividing the continuous range of the original variable into g non-overlapping regions. For a normally distributed variable, using boundary value:; which correspond to (l..g-l)/g of the cumulative normal distribution will cause the number in each group to beapproximately the same. E.g. let X be normally distributed with mean 10 anti standard deviation 5, let g =4, the 2 values for 1/4, 214 and 3/4 are -0.675, 0 and 0.675. These Correspond to X values of 6.625, 10 and 13.375. Group 1 consists of all those observations for which X <=6.675; group 2 is those for which 6.675 <X <=10; group 3 those where 10 <X <=13.375 and group 4 those where X 13.375. hypothesis that the independent variable contains no usefhl information about the dependent variable the F ratio (SSB/g-l)/(SSW/n-g) follows an F distribution with ( g-1, n-g ) degrees of freedom. Testing sets of two or more independent variables Wecan also test sets of variables simultaneously. For instance, with two variables the variation between the groups can Ibebroken down into one component due to the first variable, one component due to the second variable anti a third component due to the interaction between the two: i.e. SS:B,d = SSB,I +SSBa +SSB(,I, ~2 ) with degrees of freedom: 2-1 = r-1 r-1 (r-1l2 Thus we can use this approach to test directly for either positive or negative interaction effects between variables. RESULTS ON TEST DATA In order to motivate our approach we compare it to correlation analysis on both linear and non-linear benchmark time series. Linear relationships Linearly aiutocorrelated series were constructed and analysed using the two approaches. In order to obtain a function which is comparable with the ACF wedefine the conditional expectation variation function (CEVF) to equal SQRT((F-l)/n). Sample results are shown in figure 1. The overall shape of the CEVF is very similar to that of the ACF, the scale and decay rate differ slightly but the AR(1) nature of the process leaves a similar and consistent signature. Non-linear relationships A hrther set of benchmark series were constructed so as to exhibit first order non-linear autocorrelation of degree r i i S follows: where e(t+l) is normal and i.i.d. The parameters a,P and y were chosen so as to make the function pathological to linear analysis. The mean value of the dependent variable within eacih group can then be computed. Under the null1 314 As for the linear case, a thousand different series of 1000 observations each were generated at each level of correlation from 0.1 through to 0.9 in steps of 0.2. Average ACFs and CEVFs for a sample of these series are shown in figure 2. Unlike the ACF, the CEVF clearly picks out a significant pattern in the data. The pattern is consistent across different levels of r and exhibits a faster decay than with the linear series. This is perhaps due to the noiseenhancing chaotic effect of the nonlinear relationship. USING PARTIAL F-TESTS TO MEASURE THE STATISTICAL SIGNIFICANCE OF NEURAL NETWORKS In this section, we propose the use of partial F-tests to measure the statistical significance of input variables, hidden units and weights within a neural network model. The partial F test A normal F test, which might bereferred to as a full F test, tests the null hypothesis that the variance explained by a model is no more than would be accounted for purely by fitting the degrees of freedom in the model to random data. If the null hypothesis is true then the F-ratio: (SSV - SSE) / k SSE I (n - k - 1) F = where SSV =original variation, SSE =sum squared error of model, k =number of parameters should follow an F distribution with ( k, n-k-1 ) degrees of freedom and hence the actual F-ratio can be tested for statistical significance. A partial F test is simply one in which a part of the model is tested rather than the whole model. For instance, it is relatively easy to perform partial F tests on subsets of the variables within a multiple linear regression, for details see Weisberg (2). Applying partial F tests t o neural networks With neural networks there is a problem in precisely estimating the degrees of freedom, this is because the effective degrees of freedom of the model may not be the same as the total number of parameters in the model, see Moody (3). This might be caused by inefficiencies in the learning procedure or by non- independence of model parameters. The problem can be particularly acute for neural network models where factors such as early stopping and weight decay are introduced into the learning procedure in a conscious effort to prevent the network fully exploiting its potential degrees of freedom. The usually quoted motivation for such an approach is the suggestion, via Occams Razor, that simpler models are more likely to generalise well. However, from a statistical viewpoint, the objective can be more concisely stated as trying to maximise the eflective F ratio of the network. Thus there is a problem in estimating the degrees of freedom contained in a particular neural network, or a given portion of a network. This is a field worthy of research in its own right. The approach we choose to adopt, however, is one which attempts to solve this problem indirectly and in two stages. The first stage is to conduct F tests on sections of the entire network to identify which inputs, hidden units and connections play a statistically significant role in the model. The second stage is to use the results of this analysis to reduce the model to a minimal form in which the insignificant components are removed from the model. At this point wecan test the significance of the entire model, whilst at the same time having achieved a parsimonious model which is more open to interpretation than the original network. In order to perform F tests on part of a neural network model we must be able to measure two things: the amount of variance explained by the partial model and the number of degrees of freedom which the partial model contains. Let us consider each of these in turn. Measuring the variance explained by a model component For individual weights, hidden units or input variables the procedure is relatively straightforward. First measure the squared error of the complete model and subtract this from the squared error of the reduced model (i.e. the original model deprived of the influence of the element under test). This gives the amount of variance explained by the element in question. A network can be deprived of an input variable by simply replacing the actual values of that variable with the average (mean) value of the variable, following Moody and Utans (4). For hidden units within the network it is necessary to first run through the data sample and calculate the average value which the unit takes. Then measure the increase in squared error which is caused by holding the unit fixed to its average value. 31 5 In order to measure the effect of a single connection wemeasure the increase in error caused by holding the input to the connection constant at its average value. Estimating the degrees of freedom associated with a model component The problem of estimating d, the number of degrees of freedom associated with a network component, consists of identifjrlng the parameters which are directly linked to the component and adding to these an estimate of the indirect linkages. Although there is no known method for quantlfying the actual number of indirect linkages we describe below methods for obtaining both worst-case and average-case estimates. Wewill consider, in turn, the cases of input variables, hidden units, and network connections in the context of a fully connected network with N inputs, M outputs and H1, H2 units in the first and second hidden layers respectively. Input variables: each input variable has H1 direct connections and (N-l).Hl irrelevant connections. The: remaining parameters of the network might or might not be influenced by the variable and represent candidate indirect connections. Wedefine worst case: and average case measures of the number of indirect connections as follows: Worst case: all candidate parameters: i.e. H1 +Hl.H2! +H2 +..~.. +Hn.M -t M Average case: an equal share, along with the othei: input variables, of the candidate parameters: i.e. (H1 1- H1.H2 +H2 +... +Hn.M +M) / N In general the total degrees of freedom, d, associated with a chosen input variable in a network with 11 hidden layers is given by: Worst case: Average case: (7) HI +M(+ Hn)+ 2 Hi(l+Hi-J i = 2 d = HI+ N These two figures can then be used to calculate botlh conservative and average-case F statistics for the input variable being tested. Hidden units: each hidden unit has N +H2 direct connections plus a bias. There are (HI - l)@ +H2 -t 1) irrelevant parameters which are directly associated with the other units in the same layer. As in the case of the input v,ariable the remaining C parameters are candiidates to provide indirect degrees of freedom with the worst case estimate being C itself, and the average case being CYH1. We will omit the expressions for the general case estimates as; they resemble those given above for the input layer. Connections: each connection has only one direct parameter the weight which is associated with the comnection. The other Hl.H2 - 1 connections between the first and second hidden layers are irrelevant. The unit in the first hidden layer provides 1 bias and N coniaectiona as candidates whilst the unit in the second hidden layer provides H3 +1 candidates. Again, the (Hl-l)N connections between input and first hidden layer, and the 1(H2-1)H3 connections between second and third hidden layers can also beignored. All other connections in the network (i.e. if there were more hidden layers) would also provide candidate degrees of freedom. For the case of a general connection from layer i to layer j, i.e. i <j, the average case degree offreedom: Li Lj L + - Wh.ere Lo ==number of input variables, Ll..-]=number of units in hidden layers l..n-1, and L,, =number of outlput units. MODELLING METHODOLOGY BASED ON SIGNIFICANCE TESTS The ability to perform significance tests on network components allows the use of a principled modelled methodology analagous to backwards stepwise regression (which is itself based on the concept of partial F tests). An initial model is built using the entire set of candidate variables (which in turn might have been selected firom a larger universe of variables using conditional expectation analysis). Significance tests are then calmed out on each input variable. Insignificant Variables are dropped, and the new model is fitted to the data. This process is continued until all remaining variables are statistically significant. 31 6 APPLICATION TO FINANCIAL MODELLING OF THE FTSE INDEX CONCLUSION We applied our methodology to a problem of modelling returns on the FTSE 100 index. The economic basis of our approach was to first derive a cointegrating residual which reflects the over- or under-pricing of the FTSE relative to a basket of other stock indices. (For an introduction to the concept of cointegration see Alexander and Johnson ( 5) ) . The other variables were chosen as being ones which might modulate the cointegrating effect and were: gold prices oil prices 3-month interbank lending rates (3MM) bond yields sterling index market momentum (lagged returns on the FTSE) The available data covered the period June 1990- February 1994. Five hundred daily observations were used for training purposes and the most recent 200 observations used for out-of-sample testing. Small networks with only four hidden units were used and the networks were trained to convergence using standard backpropagation. Because statistical tests are designed to beused in-sample there was no need for a cross-validation set although in practice one might still beused as a further measure of consistency of results. Results In total, four iterations of the stepwise procedure were required, using a critical F value of 4. The initial network contained 12 explanatory variables of which 4 were sigmfkant. Two of these variables were dropped in successive iterations to leave a model containing only the cointegrating residual and the volatility of short term interest rates. Intuitively this makes economic sense because it suggests that the cointegrating effect is chnged when there is an expectation of change in interest rates. Details of the different networks are presented in table 1. The out-of-sample performance, measured in percentage direction correct together with net profit from a simple trading strategy, generally improved as insignificant variables were dropped from the model. The final model generated profits of 13.1 percent during the out of sample period (approximately one year) with the direction of the market being correctly predicted 60.5% of the time. We have introduced a principled variable selection methodology based on the use of analysis of variance (ANOVA). On artificial data, our method performs comparably to correlation analysis where linear relationships are present but, unlike correlation analysis, it also succeeds in idenwing non-linear relationships. A simple generalisation of the approach allows for direct testing for interactions effects between variables. We have described a methodology for conducting signficance tests of components - input variables, hidden units, and connections - within a trained neural network, based on theuse of partial F tests. The significance tests provide the basis for a modelling methodology analogous to the stepwise regression used in linear modelling. Wehave successfully applied this approach to an application of predicting returns on the FTSE 100 index. An original model containing 13 variables was reduced to a final model containing only a cointegrating residual (measure of mispricing relative to other stock indices) and a measure of interest rate volatility. We are currently developing an integrated modelling framework in which significance testing is used to drop variables from a model and variable selection is used to bring in new variables in a manner directly analagous to the full (backwards and forwards) stepwise regression used in traditional statistics. REFERENCES 1. Box G E P and Jenkins G M, 1970, Time Series Analysis, Forecasting and Control, Holden-Day 2. Weisberg S, 1985, Applied Linear Regression, Wiley, New York, USA 3. Moody J E, 1992, The effective number of parameters: an analysis of generalisation and regularization in nonlinear learning systems, NIPS 4, 847-54, Morgan Kaufmann, San Mateo, US 4. Moody J E and Utans J , 1992, Principled architecture selection for neural networks: Application to corporate bond rating prediction, NIPS 4, 683-90, Morgan Kaufmann, San Mateo, US 5. Alexander C and Johnson A, 1994, Dynamic links, RISK, VOL 7. No 2 317 0 . 9 0 . a 0. 7 0 . 6 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 0 . 8 ) (f . O. 9) - - --- -p I- 1 2 3 4 L A G Figure 1: ACFs and CEVFs for linearly autocorrelated series Va l u e 1 2 3 4 Figure 2: ACFs and CEVFs for non-linearly autocorrelated series TABLE 1 - Results of stenwise modelling nroeedure amlied to FrSE 100 returns Variables F (Modell) F (Model2) F (ModeD) - F (Model 4) Gold (volatility) 4.5 3.2 X X Gold(changes) 0.1 X X X Oil (vol) 4.2 5.4 2.5 X Oil(changes) 1.6 X X X Bond (vol) 2.4 X X X Bond (changes) 1.7 X X X 3h4M (vol) 6.2 7.1 5.7 7.1 3MM (changes) 1.3 X X X Sterling (vol) 0.5 X X X Sterling (changes) 2.3 X X X Residual 7.5 8.4 6.9 7.3 Momentum 3. 1 X X X cutoff (F) 4 4 4 4 Direction Correct 1041200 1131200 11 11200 1211200 13.1% Profit 3.4% 8.2% 6.7% -