Statistical Arbitrage For Mid-Frequency Trading

Statistical Arbitrage for Mid-frequency Trading
Nicolas Kseib, Xiaolin Lin, Lorenzo Limonta, Mike Phulsuksombati

June 11, 2014
Abstract
The main goal of this project is to generate and exploit the trading signal from
real-life high-frequency/mid-frequency trading data. With the aid from Thesys, we are
able to use real-life trading data to explore and evaluate Statistical Arbitrage based
algorithms. We implement a PCA analysis to isolate residual signals. By using multiple data mining techniques, we developed market neutral trading strategies. The
parameters for different learning methods were updated using walk-forward optimization. Finally, we simulate the trading strategies using real data and evaluate their
performance. The results show our methods, by implementing different models as well
as raw residual signal, can generate profitable strategies in pre-managed data.
Intoduction
In the field of investment, statistical arbitrage refers to strategies attempting to profit from
pricing inefficiencies in the market, identified through mathematical models. The basic
assumption of any such strategy is that prices of similar securities will move towards a
historical average. It encompasses a variety of strategies and investment programs whose
common features are:
Trading signals are systematic
Trading book is market-neutral
The mechanism for generating excess returns is statistical
The idea is to make many bets with positive expected returns, taking advantage of diversication across stocks, to produce a low-volatility investment strategy which is uncorrelated
with the market.
Historically the father of modern statistical arbitrage techniques is pairs trading, a strategy where two security with similar return behavior are first identified and then traded.
Once their respective value diverge significantly from the expected mean, one goes long on
the security under performing whilst going short on the security performing better than expected. This is done under the assumption that on the long term their price will converge
back to their mean.
1
In this paper we follow the natural extension of such strategy, rather than simply choosing
a pair, we trade group of stocks against other group of stock, thus implementing a generalized
pairs-trading technique. Innovative of our technique is the trading time-horizon, extremely
different from usual arbitrage strategies. Rather than time windows of weeks or months we
behave as high frequency trading (HFT) firms, looking for imbalances in the short term, in
the order of a few minutes at most.
In this section, we introduce one way to construct the residual signals. This signal will be
basis of our trading strategy, as it will allow us to distinguish which information are relevant
and which are noise. In section 2, we present the walk forward optimization (WFO), the
method we use to update our parameter, as well as our trading strategy and some signal
filtering method used to improve our performance. The simulation result and daily return
from our strategy is shown in section 3. We conclude and discuss our challenge in section 4.
1.1
PCA Analysis
As one can imagine, fundamental for the correct implemantion of such strategy is understanding the correlation between the price movements of different assets that make up our
book. We follow the approach as in [AL10] and [Lal+99]. A first approach for extracting our
signal of interest from data is to use Principal Components Analysis (PCA). This approach
uses historical share-price data on a cross-section of N stocks going back M days in history.
For simplicity of exposition, the cross-section is assumed to be identical to the investment
universe, although this need not be the case in practive. Let us represent the stocks return
data, on any given date t0 , going back M + 1 days as a matrix.
Rik =
Si(t0 (k1)t) Si(t0 kt)

, k = 1, . . . , M, i = 1, . . . , N,
Si(t0 kt)
(1)
where Sit is the price of stock i at time t adjusted for dividends and t = 1 minutes. Since
some stocks are more volatile than others, it is convinient to work with standardized returns
matrix Y ,
Yik =
Rik Ri
i
(2)
where
M
1 X
Ri =
Rik
M k=1
(3)
and
M
i2
1 X
i )2
=1=
(Rik R
M 1 k=1
(4)
The empirical correlation matrix C of the data is defined by

M
1 X
ij =
Yik Yjk ,
M 1 k=1
which is symmetric and positive definite. Notice that, for any index i, we have
PM
M
2
1 X
1
2
k=1 (Rik Ri )
=1
ii =
(Yik ) =
M 1 k=1
M 1
i2
(5)
(6)
The commonly used solution to extract meaningful information from the data is Principal Components Ananlysis. We consider the eigenvectors and eigenvalues of the empirical
correlation matrix and rank the eigenvalues in decreasing order:
N 1 2 3 . . . N 0.
(7)
We denote the corresponding eigenvectors by

(j)
(j)
v (j) = (v1 , . . . v2 )
(8)
We will note () the density of eigenvalues of the empirical correlation matrix by

1 dn()
N d
where n() is the number of eigenvalues of C less than . Interestingly, if Y is a T
T
random matrix C () is exactly known in the limit N , T and Q = N
1
and reads:
p
(max )(min )
Q
C () =
2 2
r
1
1
2
max
2
)
min = (1 +
Q
Q
C () =
(9)
N
fixed
(10)
(11)
with [max , min ] and where 2 is qual to the variance of the element of Y .
Let min , . . . , max be the significant eigenvalues in the above sense For each index j,
we consider the corresponding eigenportfolio, which is such that the respective amounts
invested in each of the stocks is defined as
(j)
(j)
Qi =
vi
i
(12)
The eigenportfolio returns are therefore

Fjk =
N
(j)
X
v
i
i=1
Rik
j = k, . . . , l.
(13)
The residual signal then can be generate as the following

= (F T F )1 F T C
C = F
Residual = C C
3
(14)
(15)
(16)
1.2
Residual Prediction with Data Mining Techniques
We also need to select a model to predict the residual signal. We use standard data mining
techniques namely: least squares, random forest, elastic net regression and multinomial
logistic regression.
1.3
Main Challenges
One first challenge that is built in with Random Matrix Theory is that we will have
many zero returns as we have smaller time differences. This will make it hard to
compute the SVD that we need for our eigenvalues/eigenvectors.
The Residuals signals we computed are very small and are easily preturbed by computers.
The biggest challenge is that Residual signals are sensitive to the 2 we choose for the
distribution of eigenvalues
There is reverse relationship between time cost and parameters tunning. Ideally, we
want to tune more parameters and get more stable, accurate results. But we will spend
more time tuning them, which is not so ideal if we are executing in fast time.
2
2.1
Trading Strategy
Parameter Estimation
The walk forward optimization (WFO) methodology will be used to update the choice of
the variance of the elements in the standardized returns matrix. This variance is used to
compute the eigenvalue spectrum of the empirical correlation matrix. The performance of
the parameter to be optimized will be a-posteriori judged in terms of the robustness or
stability of the obtained optimal parameter maximizing a certain objective function. In this
report, we choose the Sharpe ratio to be our objective function. The classical WFO algorithm
was used and we start by building our model using an initial amount of data satisfying the
T
> 1. Using the optimal model obtained in this initial period of data we make our
condition N
first out-of-sample predictions. After the out of sample prediction period ends this segment
of data is added to our in-sample database and we build another predictive model with a
different 2 . This will allow us to update our model and account for any non-stationarity or
new information in the process. It should be noted that it is essential to find a good stable
optimization procedure in order to fit for the parameters used in the modeling of the mean
reversion process.
As we said the optimization is performed using the Sharpe ratio as an objective function
starting with a 7M $ investment (10K$ and following a buy/short trading strategy on the
70 stocks in the XLK technology ETF). The first predictive model is built using data from
T
the first 191 minutes of the trading day which gives a Q = N
2.73 thus satisfying the
4
Figure 1: Walk Forward Optimization
conditions allowing us to apply equations (10) and (11). This model is used to predict the
next period consisting of 120 minutes. When the 120 minutes are over they are added to our
sample and used on top of the initial data to build a new predictive model. This process is
repeated until the end of the day. Using periods of size 120 we end up building two different
models each day having distinct values of 2 . From a preliminary analysis it seems that the
choice of the length of the training periods is crucial for a correct parametrization of 2 ,
indeed if you chose a large number of minutes you might run the risk of over-fitting, whereas
if you chose a small sample the statistical significance can greatly deteriorate.
Figure 2: The 25th of February plot of the Sharpe ratio versus 2 to compute the trading signals.
Figures 2 and 3 show the variation of the Sharpe ratio with respect to 2 for both periods
during a certain day. We show the results for two days where for the first one profits were
5
Figure 3: The 24th of February plot of the Sharpe ratio versus 2 to compute the trading signals.
realized and for the second one losses were incurred. The idea is to try to understand if the
stability or robustness of the optimization procedure will have any impact on the profits and
losses. Indeed for the profitable day we can see in figure 2 that there is a cluster of positive
values of 2 (achieving a high value of Sharpe ratio) and a cluster of negative values, for both
considered periods. It was interesting to note that the positive cluster of values was around
2 = [0.75, 0.85] in accordance with the results obtained by Laloux et al. (2008). Indeed,
this could explain the good results achieved by looking at the accumulated wealth plot for
the raw residuals approach on the 25th February data. On other hand losses were incurred
for the 24th of February data when using the same signal generation approach. Looking at
the graph of 2 versus the Sharpe ratio again we see that the cluster of positive values of 2
is absent and that the objective function was highly oscillatory. This should be considered
as a warning that the methodology might try to overfit the available data. A future possible
extension of this work is to test an important hypothesis. it would stipulate that the absence
of a good cluster of positive 2 values can be considered as evidence against the predictive
power of the model, thus an indication of a possible bearish day.
2.2
Signal Filtering Techniques
The signal is obtained from the residual from section 2. The signal to buy the stock is
when the residual is positive and the signal to short sell the stock is when the residual is negative. However, the residual may has some noise and we may end up in nonmarket neutral strategy. Thus, to achieve the more accurate signal and trade with market neutral strategy, we need to filter out some residual before transforming them into
signal. We provide the ipython notebook to demonstrate this part and the WFO at
https://github.com/mikemeetoo/mse448.
Figure 4: Signal Filtering Techniques
2.2.1
Residual Filtering
When we get the residuals We want to extract the strong signal from them; therefore, we
ignore the small residual with low magnitude by setting it to zero. We pick the ratio to
filter out and set up the banner by
positive banner = max(positive residual) + ( 1) min(positive residual)
negative banner = max(negative residual) + ( 1) min(negative residual)
If the residuals that lie between this banner are filtered out by setting them to zero.
2.2.2
Active Stock Filtering
We also consider only stocks that actively traded by consider ratio of zeros in the return
data. If the return of the stock contians more the ratio of zeros than the threshold , we
will not consider that stock in our analysis.
2.2.3
Sorting Residual
The goal of this part is that we want to obtain the market neutral portfolio. We sort the
filtered residual in order and count the number of positive residual and negative residual.
Then we set the low magnitude residual to zero until the number of positive and negative
residuals is equal. For example,
2.2.4
Generate Signal
After all the methods described above, we generate the signal by consider the sign of the
residual if it is positive we get the signal 1 to by the stock, if it is negative we get the signal
1 to sell the stock.
7
3
3.1
Results
Simulation Settings
We are provided with high frequency data by Thesys from Feb 24, 2014 to Feb 28, 2014.
Based on this data set, we conducted simulations to tune and evaluate different methods.
There are two kinds of simulations with different settings. The first one is for the comparisons across different methods proposed before. Daily profit (Investment returns before
and after the last minute of the day), computational cost (inner loop time), minimum and
maximum wealth are evaluted and compared.
The parameters are selected by tuning in the first part of simulations for comparing
different methods. We used the filtering parameter = 0.3 (except Logistic regression,
Logistic regression is not stable regards to filtering parameter. So we perform Logisitic
regression without filtering) and Active stock parameter = 0.5 for fair comparisons across
all methods.
The other one is for the comparison on different parameters in the parameter tuning
using only Raw Residual signals.
3.2
Simulation Results
First of all, our simulation compares across different methods and shows there is some unstable factor on the last minute of the net wealth we invested (due to dumping all the
positions).
Secondly, we demonstrate the computational cost for different methods which can guide
the feasiblity of implementation in high-frequency trading (shown in Table 1 (a) below).
Among all the methods, Logistic Regression costs from 60 to 170 milliseconds in each innerloop, Random Forest costs from 130 to 280 milliseconds, Elastic Net costs 50 to 200 milliseconds, Least Square costs 16 to 115 milliseconds, while Raw Residual Signal costs 5 to
15 milliseconds. Thus, Raw Residual Signal and Least Square based methods are the most
efficient, and Random Forest is the most time-consuming. However, all the proposed and
developed methods can finish evaluation, prediction and execution with in 300 milliseconds
which make it feasible for high-frequency trading.
Thirdly, across all the models we used, we cant identify one model that is always profitable. Daily profits for each methods are shown in Table 1 (b) and Table 1 (c) below. The
highest daily profit is achieved using Elastic Net while the largest daily loss is achieved using Logistic Regression. And Least Square tends to result in small profit or loss. We need
to perform more back testing to further discover the detailed personality of each methods
and decide which method to implement base on different settings. But, in general, we are
optimistic about the results we get since we do not see which model should be discarded.
Last but not least, the proposed methods are robust at the extent of daily profit. In the
raw residual simulations, if we choose the variance parameter for eigenvalue distribution, we
8
are profitable on every day of the given data.

Strategy
Raw Residual
Least Square
Logistic Regression
Elastic Net
Random Forest
Min Inner-Loop Time (ms)

5
16
60
60
130
Max Inner-Loop Time (ms)

15
115
170
200
280
(a) Inner-loop time cost for different methods
Strategy
Elastic Net
Least Square
Random Forest
Logistic Regression
Raw Residual
1st Day 2nd Day 3rd Day

4th Day
4991.895 1207.83 -1419.98 7315.945
-1426.335 -1543.56 -5180.005 3441.225
3363.33 1465.835 -300.325
4704.34
4213.545 3247.285 -11122.88 2200.35
-3673.71 -1137.92 5296.745 -3831.775
5th Day
-5374.125
-1155.935
-4259.67
-6603.495
3330.235
(b) Investement Returns the minute before last minute of the day
Strategy
1st Day 2nd Day 3rd Day
4th Day
Elastic Net
6223.68 3510.77 -2142.92 7410.985
Least Square
934.48
-208.865 -1159.76
3214.12
Random Forest
1979.275 3693.96 -1108.035 6212.715
Logistic Regression 2755.92 3037.53 -7647.885
250.56
Raw Residual
-2570.84
231.56
5375.98 -4056.205
5th Day
-5079.255
-539.23
-8128.54
-7885.815
3611.49
(c) Investement Returns on the last minute of the day

Table 1: Performance Comparison of each strategies
Conclusion
As it can be seen from table one we are able to generate positive (P) returns on any of the day
considered, though this ability of making profit depends strongly on the chosen optimization
method. This implies that we are just as likely to generate negative (N) or positive returns
on any given day. A closer look at the table reveals that any given method is unable to
generate profit on more than four days out of the five considered, thus, under the study
undertaken, relying on a single strategy seems to be unwise and too risky. This suggest that
in a real-case scenario, in order to correctly implement a winning HFT stats-arb strategy, we
will have to correctly choose which strategy to use out of the five studied and, if more than
one is chosen, what weight to assign in order to maximize returns while minimizing risk. A
first possibility could simply be choosing a static optimal weight for each of the presented
optimization strategy, though from a cursory look at table one, it would seem to be better
to limit future analysis to strategies 1-3-5 or 3-4-5, since in any given day at least two of
9
the strategies give positive returns. Alternatively a continuosly updating weighting process
could be applied, as figure 5 through 11 shows, there seem to be clear daily trends depending
on the optimization strategy choosen. This feature could be exploited to maximize return
by increasing the amount of money invested through a single strategy throughout the day
as it makes profit, while filtering out the negative effect due to negative return strategies.
In summary,under the mindful consideration of a correct computation of our book signal as
well as a careful implementation of our optimization process, the results presented so far
show the feasaibility of implementing a HFT stats-arb strategy.
References
[AL10]
Marco Avellaneda and Jeong-Hyun Lee. Statistical arbitrage in the US equities

market. In: Quantitative Finance 10.7 (2010), pp. 761782.
[Lal+99] Laurent Laloux et al. Noise dressing of financial correlation matrices. In: Physical review letters 83.7 (1999), p. 1467.
A Appendix
10
50
100
150
200
6999000
7001500
Simulation of 7M investment
Wealth
7001000
6998000
Wealth
50
Time
100
150
200
50
Time
100
Time
(c) Day 26
(d) Day 27
6999000
7003000
Wealth
200
6999000
Wealth
7002000
50
150
(b) Day 25
6999000
Wealth
7002000
(a) Day 24
100
Time
50
100
150
Time
(e) Day 28
Figure 5: Raw Residuals
11
200
150
200
7000500
Wealth
0
50
100
150
6998500
6999500
6998000
6996000
Wealth
7000000
7001500
200
50
100
Time
(a) Day 24
200
(b) Day 25
Wealth
7000000
6996000
7002000
6998000
7004000
7000000
100
150
200
50
100
Time
Time
(c) Day 26
(d) Day 27
7003000
Wealth
50
7001000
6999000
Wealth
150
Time
50
100
150
200
Time
(e) Day 28
Figure 6: Raw Residuals after filtered with = 0.3
12
150
200
7003000
7002000
Wealth
7000000
7001000
7004000
7002000
7000000
Wealth
7006000
50
100
150
200
50
100
Time
150
200
Time
(a) Day 24
(b) Day 25
7000000
7004000
Wealth
6999000
6998000
100
150
200
50
100
Time
Time
(c) Day 26
(d) Day 27
6998000
7000000
Wealth
50
6996000
6994000
6997000
Wealth
7000000
50
100
150
Time
(e) Day 28
Figure 7: Elastic Net
13
200
150
200
6999000
Wealth
6997000
6998000
6999500
6998500
Wealth
7000500
7000000
50
100
150
200
50
100
Time
(a) Day 24
7004000
Wealth
7000000
7002000
6998000
7000000
7006000
6996000
100
150
200
50
100
Time
Time
(c) Day 26
(d) Day 27
7003000
Wealth
50
7001000
6999000
Wealth
200
(b) Day 25
6994000
150
Time
50
100
150
Time
(e) Day 28
Figure 8: Least Square
14
200
150
200
Wealth
6999000
6997000
7002000
7000000
Wealth
7004000
7001000
7006000
50
100
150
200
50
100
Time
(a) Day 24
7001000
7000000
Wealth
7003000
7005000
7002000
100
150
200
50
100
Time
Time
(c) Day 26
(d) Day 27
6996000
Wealth
7000000
6992000
50
6988000
6999000
7001000
Wealth
200
(b) Day 25
6999000
150
Time
50
100
150
Time
(e) Day 28
Figure 9: Random Forest
15
200
150
200
Wealth
6999000
6997000
7002000
7000000
Wealth
7004000
7001000
7006000
50
100
150
200
50
100
Time
(a) Day 24
7001000
7000000
Wealth
7003000
7005000
7002000
50
100
150
200
50
100
Time
Time
(c) Day 26
(d) Day 27
7004000
Wealth
7000000
6999000
7001000
Wealth
200
(b) Day 25
6999000
150
Time
50
100
150
200
Time
(e) Day 28
Figure 10: Random Forest after filtered with = 0.3
16
150
200
Wealth
6998000
7000000
7004000
7002000
7000000
Wealth
50
100
150
200
50
100
Time
(a) Day 24
7000000
Wealth
6996000
6998000
6996000
7002000
7000000
6992000
100
150
200
50
100
Time
Time
(c) Day 26
(d) Day 27
6998000
Wealth
50
6994000
6990000
Wealth
200
(b) Day 25
6988000
150
Time
50
100
150
200
Time
(e) Day 28
Figure 11: Logistic Regression
17
150
200

Statistical Arbitrage For Mid-Frequency Trading

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Statistical Arbitrage For Mid-Frequency Trading

Încărcat de

Drepturi de autor:

Formate disponibile

Statistical Arbitrage for Mid-frequency Trading

Nicolas Kseib, Xiaolin Lin, Lorenzo Limonta, Mike Phulsuksombati

Si(t0 (k1)t) Si(t0 kt)

The empirical correlation matrix C of the data is defined by

We denote the corresponding eigenvectors by

We will note () the density of eigenvalues of the empirical correlation matrix by

The eigenportfolio returns are therefore

The residual signal then can be generate as the following

Residual Prediction with Data Mining Techniques

Figure 1: Walk Forward Optimization

Signal Filtering Techniques

Figure 4: Signal Filtering Techniques

Active Stock Filtering

are profitable on every day of the given data.

Min Inner-Loop Time (ms)

Max Inner-Loop Time (ms)

(a) Inner-loop time cost for different methods

1st Day 2nd Day 3rd Day

(c) Investement Returns on the last minute of the day

Marco Avellaneda and Jeong-Hyun Lee. Statistical arbitrage in the US equities

S-ar putea să vă placă și