Sunteți pe pagina 1din 6

31 2

NON-LINEAR MODEL IDENTIFICATION AND STATISTICAL SIGNIFICANCE TESTS


AND THEIR APPLICATION TO FINANCIAL MODELLING
A N Burgess
London Business School ,UK
ABSTRACT
We describe a methodology, based upon the statistical
concept of analysis of variance (ANOVA), which can
be used both for non-linear model identification and
for testing the statistical significance of inputs to a
neural network. We compare our model identification
procedure to established approach of correlation
analysis on both linear and non-linear time-series. We
describe how the significance tests can form the basis
of a modelling methodology analagous to stepwise
regression. Finally we consider an application of these
techniques to the problem of modelling weekly returns
of the FTSE 100 index.
A NON-LINEAR MODEL IDENTIFICATION
PROCEDURE
Within a linear framework there are a variety of
statistical props which assist in the stages of model
identification, construction and verification; for
instance the well-known and established Box-
J enkins framework, see Box and J enkins (1). The
model identification stage of these techniques is
typically based on correlation analysis, i.e.
constructing the auto-correlation function (ACF) of the
time-series. Examination of the ACF allows the
modeller to identify both the order and type
(autoregressive, moving average, or mixed) of the
model.
When performing time-series modelling using neural
networks it is still the norm to use correlation analysis
as a preliminary variable selection technique.
However, correlation analysis will in many cases fail to
identify the existance of significant non-linear
relationships. This might cause variables to be
erroneously left out of later stages of modelling and
result in poorer models than would bethe case had a
more powef i identification procedure been used.
Both linear regression models and neural networks are
commonly fitted by minimising mean squared error.
They provide an estimate of the mean value of the
output variable given the current values of the input
variable, i.e. a conditional expectation. This insight
allows us to realise why measures of dependence
whtch have been developed for linear models are not
always suitable for non-linear models.
In a broad sense, a linear relationship exists if and
only if the conditional expectation either consistently
increases or consistently decreases as we increase the
value of the independent variable. A non-linear
relationship, however, only requires that the
conditional expectation varies as we increase the
independent variable. The class of functions which
would thus be missed by a linear measure but
identified by a suitable non-linear one includes all
symmetric functions, all periodic functions (e.g.
sinewave) and many others besides.
We propose a measure of the degree to which the
conditional expectation of the dependent variable
varies with a given set of independent variables but
which imposes no condition that this variation should
beof a particular form.
Analysis of variance or ANOVA
The ANOVA technique is a standard statistical
technique, usually used for analysis of categorical
independent variables, which divides a sample into g
groups and then tests to see if the differences between
the groups are statistically significant. It does this by
comparing the variability within the individual groups
to the variability between the different groups.
Firstly, the total variability within the groups is
calculated:
Where go) is the group to which the ith pattern
belongs
With g groups we lose g degrees of freedom in
estimating the group means, so we can estimate the
variance of the data by SSW/ (n-g).
Secondly the variability between the groups is
calculated:
313
Where nj is the number of observations in group j.
Here we lose one degree of freedom for using the
sample mean. Thus we can also use SSB to estimate
the true variance of the data by dividing by (g-1).
If the different groups represent samples from the
same underlying distribution then SSW and SSB are
simply dependent on the underlying true variance.
Adjusting for degrees of freedom, each can beused as
an estimate of the variance and the ratio:
SSB I ( g - 1)
SSW I ( n - g)
F =
(3)
follows an F distribution with ( g-1 , n-g ) degrees of
freedom.
Under the null hypothesis that all groups are drawn
from the same distribution then the F ratio will be
below the 10% critical value 9 times out of ten and
below the 1% critical value 99 times out of a hundred,
etc. However if a pattern exists then the between group
variation will beincreased. This will cause a higher F
ratio and, if the pattern is sufficiently strong, lead tal
the rejection of the null hypothesis.
Testing a single variable
Thus we can perform an ANOVA test to establish1
whether the variance in the conditional expectation of
y given different values of x is statistically significant.
The first step is to choose g , the number of groups,
typically this would be in the range 3-10. Following
this, each observation is allocated to the appropriate
group by dividing the continuous range of the original
variable into g non-overlapping regions. For a
normally distributed variable, using boundary value:;
which correspond to (l..g-l)/g of the cumulative
normal distribution will cause the number in each
group to beapproximately the same.
E.g. let X be normally distributed with mean 10 anti
standard deviation 5, let g =4, the 2 values for 1/4,
214 and 3/4 are -0.675, 0 and 0.675. These Correspond
to X values of 6.625, 10 and 13.375. Group 1 consists
of all those observations for which X <=6.675; group
2 is those for which 6.675 <X <=10; group 3 those
where 10 <X <=13.375 and group 4 those where X
13.375.
hypothesis that the independent variable contains no
usefhl information about the dependent variable the F
ratio (SSB/g-l)/(SSW/n-g) follows an F distribution
with ( g-1, n-g ) degrees of freedom.
Testing sets of two or more independent variables
Wecan also test sets of variables simultaneously. For
instance, with two variables the variation between the
groups can Ibebroken down into one component due to
the first variable, one component due to the second
variable anti a third component due to the interaction
between the two:
i.e. SS:B,d = SSB,I +SSBa +SSB(,I, ~2 )
with degrees of freedom:
2-1 = r-1 r-1 (r-1l2
Thus we can use this approach to test directly for
either positive or negative interaction effects between
variables.
RESULTS ON TEST DATA
In order to motivate our approach we compare it to
correlation analysis on both linear and non-linear
benchmark time series.
Linear relationships
Linearly aiutocorrelated series were constructed and
analysed using the two approaches. In order to obtain a
function which is comparable with the ACF wedefine
the conditional expectation variation function (CEVF)
to equal SQRT((F-l)/n).
Sample results are shown in figure 1. The overall
shape of the CEVF is very similar to that of the ACF,
the scale and decay rate differ slightly but the AR(1)
nature of the process leaves a similar and consistent
signature.
Non-linear relationships
A hrther set of benchmark series were constructed so
as to exhibit first order non-linear autocorrelation of
degree r i i S follows:
where e(t+l) is normal and i.i.d.
The parameters a,P and y were chosen so as to make
the function pathological to linear analysis.
The mean value of the dependent variable within eacih
group can then be computed. Under the null1
314
As for the linear case, a thousand different series of
1000 observations each were generated at each level of
correlation from 0.1 through to 0.9 in steps of 0.2.
Average ACFs and CEVFs for a sample of these series
are shown in figure 2.
Unlike the ACF, the CEVF clearly picks out a
significant pattern in the data. The pattern is
consistent across different levels of r and exhibits a
faster decay than with the linear series. This is perhaps
due to the noiseenhancing chaotic effect of the
nonlinear relationship.
USING PARTIAL F-TESTS TO MEASURE THE
STATISTICAL SIGNIFICANCE OF NEURAL
NETWORKS
In this section, we propose the use of partial F-tests to
measure the statistical significance of input variables,
hidden units and weights within a neural network
model.
The partial F test
A normal F test, which might bereferred to as a full
F test, tests the null hypothesis that the variance
explained by a model is no more than would be
accounted for purely by fitting the degrees of freedom
in the model to random data. If the null hypothesis is
true then the F-ratio:
(SSV - SSE) / k
SSE I (n - k - 1)
F =
where SSV =original variation, SSE =sum squared
error of model, k =number of parameters
should follow an F distribution with ( k, n-k-1 )
degrees of freedom and hence the actual F-ratio can be
tested for statistical significance.
A partial F test is simply one in which a part of the
model is tested rather than the whole model. For
instance, it is relatively easy to perform partial F tests
on subsets of the variables within a multiple linear
regression, for details see Weisberg (2).
Applying partial F tests t o neural networks
With neural networks there is a problem in precisely
estimating the degrees of freedom, this is because the
effective degrees of freedom of the model may not be
the same as the total number of parameters in the
model, see Moody (3). This might be caused by
inefficiencies in the learning procedure or by non-
independence of model parameters.
The problem can be particularly acute for neural
network models where factors such as early stopping
and weight decay are introduced into the learning
procedure in a conscious effort to prevent the network
fully exploiting its potential degrees of freedom. The
usually quoted motivation for such an approach is the
suggestion, via Occams Razor, that simpler models
are more likely to generalise well. However, from a
statistical viewpoint, the objective can be more
concisely stated as trying to maximise the eflective F
ratio of the network.
Thus there is a problem in estimating the degrees of
freedom contained in a particular neural network, or a
given portion of a network. This is a field worthy of
research in its own right. The approach we choose to
adopt, however, is one which attempts to solve this
problem indirectly and in two stages.
The first stage is to conduct F tests on sections of the
entire network to identify which inputs, hidden units
and connections play a statistically significant role in
the model. The second stage is to use the results of this
analysis to reduce the model to a minimal form in
which the insignificant components are removed from
the model. At this point wecan test the significance of
the entire model, whilst at the same time having
achieved a parsimonious model which is more open to
interpretation than the original network.
In order to perform F tests on part of a neural network
model we must be able to measure two things: the
amount of variance explained by the partial model and
the number of degrees of freedom which the partial
model contains. Let us consider each of these in turn.
Measuring the variance explained by a model
component
For individual weights, hidden units or input variables
the procedure is relatively straightforward. First
measure the squared error of the complete model and
subtract this from the squared error of the reduced
model (i.e. the original model deprived of the
influence of the element under test). This gives the
amount of variance explained by the element in
question.
A network can be deprived of an input variable by
simply replacing the actual values of that variable with
the average (mean) value of the variable, following
Moody and Utans (4).
For hidden units within the network it is necessary to
first run through the data sample and calculate the
average value which the unit takes. Then measure the
increase in squared error which is caused by holding
the unit fixed to its average value.
31 5
In order to measure the effect of a single connection
wemeasure the increase in error caused by holding the
input to the connection constant at its average value.
Estimating the degrees of freedom associated with a
model component
The problem of estimating d, the number of degrees of
freedom associated with a network component,
consists of identifjrlng the parameters which are
directly linked to the component and adding to these
an estimate of the indirect linkages. Although there is
no known method for quantlfying the actual number of
indirect linkages we describe below methods for
obtaining both worst-case and average-case estimates.
Wewill consider, in turn, the cases of input variables,
hidden units, and network connections in the context
of a fully connected network with N inputs, M outputs
and H1, H2 units in the first and second hidden layers
respectively.
Input variables: each input variable has H1 direct
connections and (N-l).Hl irrelevant connections. The:
remaining parameters of the network might or might
not be influenced by the variable and represent
candidate indirect connections. Wedefine worst case:
and average case measures of the number of indirect
connections as follows:
Worst case: all candidate parameters: i.e. H1 +Hl.H2!
+H2 +..~.. +Hn.M -t M
Average case: an equal share, along with the othei:
input variables, of the candidate parameters: i.e. (H1 1-
H1.H2 +H2 +... +Hn.M +M) / N
In general the total degrees of freedom, d, associated
with a chosen input variable in a network with 11
hidden layers is given by:
Worst case:
Average case:
(7)
HI +M(+ Hn)+ 2 Hi(l+Hi-J
i = 2
d = HI+
N
These two figures can then be used to calculate botlh
conservative and average-case F statistics for the input
variable being tested.
Hidden units: each hidden unit has N +H2 direct
connections plus a bias. There are (HI - l)@ +H2 -t
1) irrelevant parameters which are directly associated
with the other units in the same layer. As in the case of
the input v,ariable the remaining C parameters are
candiidates to provide indirect degrees of freedom with
the worst case estimate being C itself, and the average
case being CYH1.
We will omit the expressions for the general case
estimates as; they resemble those given above for the
input layer.
Connections: each connection has only one direct
parameter the weight which is associated with the
comnection. The other Hl.H2 - 1 connections between
the first and second hidden layers are irrelevant. The
unit in the first hidden layer provides 1 bias and N
coniaectiona as candidates whilst the unit in the second
hidden layer provides H3 +1 candidates. Again, the
(Hl-l)N connections between input and first hidden
layer, and the 1(H2-1)H3 connections between second
and third hidden layers can also beignored. All other
connections in the network (i.e. if there were more
hidden layers) would also provide candidate degrees of
freedom.
For the case of a general connection from layer i to
layer j, i.e. i <j, the average case degree offreedom:
Li Lj
L +
-
Wh.ere Lo ==number of input variables, Ll..-]=number
of units in hidden layers l..n-1, and L,, =number of
outlput units.
MODELLING METHODOLOGY BASED ON
SIGNIFICANCE TESTS
The ability to perform significance tests on network
components allows the use of a principled modelled
methodology analagous to backwards stepwise
regression (which is itself based on the concept of
partial F tests).
An initial model is built using the entire set of
candidate variables (which in turn might have been
selected firom a larger universe of variables using
conditional expectation analysis). Significance tests
are then calmed out on each input variable.
Insignificant Variables are dropped, and the new model
is fitted to the data. This process is continued until all
remaining variables are statistically significant.
31 6
APPLICATION TO FINANCIAL MODELLING
OF THE FTSE INDEX
CONCLUSION
We applied our methodology to a problem of
modelling returns on the FTSE 100 index. The
economic basis of our approach was to first derive a
cointegrating residual which reflects the over- or
under-pricing of the FTSE relative to a basket of other
stock indices. (For an introduction to the concept of
cointegration see Alexander and Johnson ( 5) ) . The
other variables were chosen as being ones which might
modulate the cointegrating effect and were:
gold prices
oil prices
3-month interbank lending rates (3MM)
bond yields
sterling index
market momentum (lagged returns on the FTSE)
The available data covered the period June 1990-
February 1994. Five hundred daily observations were
used for training purposes and the most recent 200
observations used for out-of-sample testing. Small
networks with only four hidden units were used and
the networks were trained to convergence using
standard backpropagation. Because statistical tests are
designed to beused in-sample there was no need for a
cross-validation set although in practice one might still
beused as a further measure of consistency of results.
Results
In total, four iterations of the stepwise procedure were
required, using a critical F value of 4. The initial
network contained 12 explanatory variables of which 4
were sigmfkant. Two of these variables were dropped
in successive iterations to leave a model containing
only the cointegrating residual and the volatility of
short term interest rates. Intuitively this makes
economic sense because it suggests that the
cointegrating effect is chnged when there is an
expectation of change in interest rates. Details of the
different networks are presented in table 1.
The out-of-sample performance, measured in
percentage direction correct together with net profit
from a simple trading strategy, generally improved as
insignificant variables were dropped from the model.
The final model generated profits of 13.1 percent
during the out of sample period (approximately one
year) with the direction of the market being correctly
predicted 60.5% of the time.
We have introduced a principled variable selection
methodology based on the use of analysis of variance
(ANOVA). On artificial data, our method performs
comparably to correlation analysis where linear
relationships are present but, unlike correlation
analysis, it also succeeds in idenwing non-linear
relationships. A simple generalisation of the approach
allows for direct testing for interactions effects between
variables.
We have described a methodology for conducting
signficance tests of components - input variables,
hidden units, and connections - within a trained neural
network, based on theuse of partial F tests.
The significance tests provide the basis for a modelling
methodology analogous to the stepwise regression used
in linear modelling. Wehave successfully applied this
approach to an application of predicting returns on the
FTSE 100 index. An original model containing 13
variables was reduced to a final model containing only
a cointegrating residual (measure of mispricing
relative to other stock indices) and a measure of
interest rate volatility.
We are currently developing an integrated modelling
framework in which significance testing is used to
drop variables from a model and variable selection is
used to bring in new variables in a manner directly
analagous to the full (backwards and forwards)
stepwise regression used in traditional statistics.
REFERENCES
1. Box G E P and Jenkins G M, 1970, Time Series
Analysis, Forecasting and Control, Holden-Day
2. Weisberg S, 1985, Applied Linear Regression,
Wiley, New York, USA
3. Moody J E, 1992, The effective number of
parameters: an analysis of generalisation and
regularization in nonlinear learning systems, NIPS 4,
847-54, Morgan Kaufmann, San Mateo, US
4. Moody J E and Utans J , 1992, Principled
architecture selection for neural networks: Application
to corporate bond rating prediction, NIPS 4, 683-90,
Morgan Kaufmann, San Mateo, US
5. Alexander C and Johnson A, 1994, Dynamic
links, RISK, VOL 7. No 2
317
0 . 9
0 . a
0. 7
0 . 6
0 . 5
0 . 4
0 . 3
0 . 2
0 . 1
0
0 . 8 )
(f . O. 9)
-
-
---
-p I-
1 2 3 4
L A G
Figure 1: ACFs and CEVFs for linearly autocorrelated series
Va l u e
1 2 3 4
Figure 2: ACFs and CEVFs for non-linearly autocorrelated series
TABLE 1 - Results of stenwise modelling nroeedure amlied to FrSE 100 returns
Variables F (Modell) F (Model2) F (ModeD) - F (Model 4)
Gold (volatility) 4.5 3.2 X X
Gold(changes) 0.1 X X X
Oil (vol) 4.2 5.4 2.5 X
Oil(changes) 1.6 X X X
Bond (vol) 2.4 X X X
Bond (changes) 1.7 X X X
3h4M (vol) 6.2 7.1 5.7 7.1
3MM (changes) 1.3 X X X
Sterling (vol) 0.5 X X X
Sterling (changes) 2.3 X X X
Residual 7.5 8.4 6.9 7.3
Momentum 3. 1 X X X
cutoff (F) 4 4 4 4
Direction Correct 1041200 1131200 11 11200 1211200
13.1% Profit 3.4% 8.2% 6.7%
-

S-ar putea să vă placă și