(University of Toronto, Surkov) Parallel Option Pricing With Fourier Space Time-Stepping Method On Graphics Processing Units

Parallel Option Pricing with Fourier Space Time-stepping Method on Graphics Processing Units
Vladimir Surkov Department of Computer Science, University of Toronto, Toronto, ON, Canada vsurkov@cs.toronto.edu
October 8, 2007
ABSTRACT With the evolution of Graphics Processing Units (GPUs) into powerful and cost-ecient computing architectures, their range of application has expanded tremendously, especially in the area of computational nance. Current research in the area, however, is limited in terms of options priced and complexity of stock price models. This paper presents algorithms, based on the Fourier Space Timestepping (FST) method, for pricing single and multi-asset European and American options with Lvy underliers on e a GPU. Furthermore, the single-asset pricing algorithm is parallelized to attain greater eciency. KEY WORDS Option pricing, Lvy processes, Fourier Space Timee stepping, Fast Fourier Transform, Graphics Processing Unit, parallel computing tackle various pricing problems. This paper discusses the application of graphics cards to option pricing and shows that they provide a signicant increase in performance over CPUs when pricing path-dependent, single and multi-asset options.
Introduction
Over the past decade, the area of computational nance has grown tremendously, both in scope and volume of problems being addressed and in complexity of models being used. Increasingly, jump-diusion and Lvy models e are used instead of the Black-Scholes-Merton (BSM) model to correct for the observed implied volatility (or skew) and term structure. Since working with these new models requires solving more complex equations, a wide array of new computational methods have been developed. Moreover, the ever increasing range of applications of these computationally involved methods brings about a need to execute them as quickly and eciently as possible. Pricing and risk management tasks in a typical industry setting must be delegated to dedicated computational servers in order to be completed in a satisfactory period of time. Thus, there is an ever growing need for development of computationally ecient pricing architectures that are exible enough to
Over the last several years Graphics Processing Units (GPUs) have evolved from mere dedicated graphics rendering devices to computing workhorses. The fact that GPUs are designed with data processing in mind rather than data caching and ow control, and their highly parallel structure makes them more eective than typical CPUs in compute-intensive and highly parallel applications. The literature on utilizing GPUs in option pricing is quite sparse and limited to the implementation of BSM formula, Monte-Carlo simulation by Podlozhnyuk (2007) and binomial lattice pricing method by Colb and Phar (2005). This paper shows that GPUs can be eectively leveraged for pricing of exotic, path-dependent and multi-dimensional options, where the underlying stock price indices are modeled using jump-diusion and Lvy processes. Additionally, the paper shows how the e parallel structure of GPUs can be used to price multiple options simultaneously to further increase computational eciency of such architectures.
The remainder of the paper is organized as follows: section 2 presents the Fourier Space Time-stepping (FST) method, which computes the evolution of option values in time using two Fast Fourier Transforms (FFTs). Section 3 shows that GPUs provide a highly ecient alternative to CPUs for computing FFTs. Sections 4 and 5 develop numerical algorithms for utilizing GPUs in order to price single and multi-asset options on Lvy processes e using the FST method. Lastly, section 6 shows how the parallel architecture of a GPU can be further utilized to price multiple options concurrently.
FST Method
of ODEs, parameterized by . The resulting ODE has an explicit solution: F[v](t1 , ) = F[v](t2 , ) e(t2 t1 )() . (3)
Under the BSM model, where stock returns have log-normal distribution, the option pricing problem reduces to solving a partial dierential equation (PDE) and standard computational methods can be utilized. It is well known, however, that the BSM model is inconsistent with market behavior, manifesting in observed implied volatility smile (or skew) and term structure. Jump-diusion and Lvy models have been e widely used to partially alleviate some of the biases inherent in the classical BSM model. Unfortunately, the resulting pricing problems require solving more dicult partial integro-dierential equations (PIDEs). A number of approaches for solving such equations have been suggested in literature (e.g. Andersen and Andreasen (2000), dHalluin, Forsyth, and Labahn (2003), Briani, Natalini, and Russo (2004), Cont and Voltchkova (2005), and Almendral and Oosterlee (2005)). Although the methods are quite diverse, they all treat the integral and diusive terms asymmetrically and are dicult to extend to higher dimensions. The FST method is a new, ecient algorithm, based on transform methods, which treats the diusive and integrals terms symmetrically, is applicable to a wide class of path-dependent options (such as American, Bermudan, barrier, and shout options) as well as multi-asset options, and naturally extends to regime-switching Lvy models. The outline of e the derivation of the FST method is presented here; for more details see Jackson, Jaimungal, and Surkov (2007a). Let V (t, S(t)) denote the price at time t of an option written on an underlying price index (or indices) S(t) with a T -maturity payo of (S(T )). Assume that the price index, or indices, follow an exponential Lvy proe cess and can be written as S(t) = S(0)eX(t) . This class of models is rich enough to generate geometric Brownian motion, jump-diusion and other Lvy processes such e as Variance Gamma and CGMY. Under this modeling assumption, the discount-adjusted and log-transformed price process v(t, X(t)) := er(T t) V (t, S(0)eX(t) ) satises a PIDE t v + Lv v(T, x) = 0 = (S(0) ex ), (1)
The solution of the PIDE is found by applying the inverse Fourier transform: v(t1 , x) = F 1 F[v](t2 , ) e(t2 t1 )() (x). (4)
The continuous Fourier transforms can be approximated by discrete Fourier transforms, which in turn can be computed eciently using the FFT algorithm. Thus, the FST method can be written as: v n1 = FFT1 [FFT[v n ] e t ], (5)
where v n and are d-dimensional matrices of option values at time tn and characteristic exponents, respectively, with the exponentiation and product applied componentwise. For European options a single time-step required: v 0 = FFT1 [FFT[v T ] e t ]. (6)
For American options, the monitoring of exercise condition is required at each time step: v n1 = max{FFT1 [FFT[v n ] e t ], v T }. (7)
Other path-dependent options, such as barrier, shout and Asian can be priced using the FST method by applying an appropriate boundary policy or optimization algorithm. Since the cornerstone of the FST method is the FFT algorithm and its computationally ecient implementations are available through the use of GPUs, it is quite natural to use them for option pricing via the FST method.
FFT Computation on GPUs
where L is the innitesimal generator of the Lvy process. e Applying Fourier transform to the innitesimal generator allows the characteristic exponent () of X(t) to be factored out. Furthermore, applying the Fourier transform to the pricing PIDE leads to t F[v](t, ) + ()F[v](t, ) = 0 , F[v](T, ) = F[](). (2)
The PIDE is transformed into a one parameter family 2
Traditionally, GPUs have been reserved to performing a number of graphics primitive operations, such as texture mapping and polygon rendering. Over the past several years, however, the functionality of such cards increased tremendously to allow for their use in general scientic and business computing. GPUs have evolved into cheap, powerful and highly parallel processing units that rival traditional CPUs in computationally intensive applications. Figure 1 compares the computational throughput of a typical CPU (Dual-Core Intel Xeon 5160 3.0GHz processor) and various graphics cards (nVidia 8600 GTS card with 32 stream processors @ 675MHz, nVidia 8800 GTX card with 128 stream processors @ 575MHz and
600 500 400 GFlops 300 200 100 0

Intel Xeon 5160 nVidia 8600 nVidia 8800 GTS GTX
38 139 330
554
one and two-dimensional FFTs of various sizes on CPU and GPU. The CPU time measures the computational time for a combination of forward and backward, complex-to-complex FFTs on a CPU. The GPU time measures the time to perform the same combination on FFTs on a GPU where the data is not moved to or from the device. The GPU round-trip time, again, measures the same combination of FFTs but with data uploaded to the device before and downloaded from the device after the computation. Note that the one-dimensional transforms are out-of-place while the two-dimensional transforms are in-place transforms.
As evident from the results presented in Table 1, if memory transfer is not taken into account, the GPU is Figure 1: Benchmark performance comparison for a typmore ecient for any transform size and has an asympical CPU and various GPUs totic speedup factor of approximately 7.5 as compared to the CPU. The computational eciency of the GPU is ATI Radeon X1900 card with 48 pixel shader processors reduced when the data transfer is taken into account. In fact, for transforms of size less than 16384 the memory @ 650MHz). round-trip makes the computation of FFT on a GPU A signicant bottleneck in utilizing GPUs for any type actually slower. For transforms of larger sizes, however, of computing is the transfer of data to and from the the memory transfer overhead becomes less signicant card. Thus, it is of paramount importance to reduce data and the GPU achieves a speedup factor of 3. trac when designing numerical algorithms that utilize GPUs. In this section, computational times required to perform FFTs of various sizes and dimensions on a GPU and on a CPU are discussed. Also, the total 4 Single Asset Options round-trip time to compute an FFT, which includes the data transfer time, is measured. In this section, the FST method on a GPU (FST-GPU) for pricing European and American options on a single FFTW library by Frigo and Johnson (2005), which asset is discussed. In addition, results for timing tests provides a exible C interface and is one of the fastest are presented to compare the eciency of the FST-GPU FFT algorithm implementations currently available, was method and FST method on a CPU (FST-CPU). As used to execute FFTs on a CPU. For executing FFTs conrmed by the results of the previous section, memory on a GPU, nVidia CUFFT library provides an interface transfer is a critical issue when designing the option pricmodeled after FFTW and was used for our experiments. ing algorithms for GPUs. Given the above timing results, The experiments were conducted on an Intel P4 2.8 GHz one would expect FST-GPU to be marginally less ecient workstation with a nVidia 8600 GTS video card. than FST-CPU for pricing of standard European options (where typically only 8192 space points are required to achieve accuracy of 1/10 of a cent), since only two FFT Transform CPU time GPU time GPU time evaluations are performed with a full memory round-trip. size (msec.) round-trip (msec.) (msec.) For American options, on the other hand, one can expect a greater eciency of FST-GPU if the algorithm does 4096 0.41 1.69 0.10 not require a memory round-trip between every time step. 16384 2.67 2.32 0.42 65536 11.9 4.66 1.54 5122 57 26 11 The FST-GPU algorithm for pricing of European 10242 238 94 43 options is outlined in Algorithm 1 and is naturally 20482 1249 426 170 derived from equation (6). For pricing with N points, the algorithm requires to upload N oating point values Table 1: Comparison of one and two-dimensional FFT for the option payo and N/2 + 1 complex oating point execution times on CPU and GPU values for the characteristic factor e t (since option values are real, half the complex values are redundant Table 1 summarizes the timing results for executing due to Hermitian symmetry) and download N oating 3
ATI Radeon X1900
Algorithm 1: FST-GPU algorithm for pricing European options Input: Option payo v T , characteristic exponent Output: Option values v 0 Upload v T , e t to GPU v 0 FFT1 [FFT[v T ] e t ] Download v 0 from GPU return v 0
Algorithm 2: FST-GPU algorithm for pricing American options Input: Option payo v T , characteristic exponent Output: Option values v 0 Upload v T , e t to GPU vN vT for n N to 1 do v n FFT1 [FFT[v n ] e t ] v n1 = max{v n , v T } end Download v 0 from GPU return v 0
point values of v 0 to the host. If option value is required only at a specic spot price then only one oat value has to be downloaded. In addition to the memory transfer, execution of one forward and backward FFTs is required. Since only two FFTs are performed per full memory put option with parameters S = 100, K = 100, T = 0.25 round-trip, the transfer overhead becomes a signicant with the stock process modeled by a Variance Gamma disadvantage of FST-GPU. process with parameters = 0.1, = 0.15, = 0.4, and r = 0.05. The timing results are given in Table 3. To test the performance of the FST-GPU algorithm, a European put option with parameters S = 95, K = 100, T = 1 was priced, where the stock process is modeled by Grid Time Price CPU time GPU time a Kou jump-diusion model with parameters = 0.2, points points (sec.) (sec.) = 0.3, p = 0.5, + = 3, = 2, and r = 0.03. 4096 512 2.3577 0.17 0.07 The timing results are given in Table 2. As expected, 8192 1024 2.3580 0.60 0.25 the overhead created by the memory transfer makes 16384 2048 2.3581 3.27 1.11 FST-GPU marginally less ecient then FST-CPU. It 32768 4096 2.3582 14.02 4.10 must be noted, however, that pricing on a GPU card frees up the CPU resources for other tasks, such as Table 3: Performance results for pricing of an American position aggregation and analysis. Thus, it may still be option with Variance Gamma model benecial to utilize GPUs for European option pricing in a typical workstation setup. As expected, the FST-GPU outperformed FST-CPU by a factor of 3.4 due to the substantial decrease in memory transfer time as a fraction of the overall compuGrid Price CPU Time GPU Time tational time. points (msec.) (msec.) 4096 8192 16384 32768 10.6783 10.6781 10.6782 10.6782 4.09 8.38 16.54 33.24 7.09 10.89 18.83 35.21
Multi Asset Options
The computational eciency of the FFT is not restricted to a single dimension; as such, the FST method for pricing multi-asset options can be readily implemented on a GPU. This section extends the one-dimensional The FST-GPU algorithm for pricing American options FST-GPU algorithm of the previous section to pricing extends Algorithm 1 by incorporating time-stepping of European and American options that depend on two equation (7) and is given in Algorithm 2. When M time- assets. From the timing results presented in section 3, steps are used, M forward and backward FFTs of size N one can expect higher eciency of FST-GPU due to are executed and M N evaluations of max function are larger transform sizes, with American multi-asset options required. Yet, the algorithm requires the same amount beneting the most. of memory transfer as in the European case. Thus, as M With the derivations of the multi-asset FST method increases, the memory transfer overhead becomes a less signicant factor in the performance of FST-GPU. given in section 2, the extensions of Algorithm 1 and Algorithm 2 to two-asset setting are discussed below. In The American option test was done on an American the two-asset case, option payo v T constitutes a matrix Table 2: Performance results for pricing of a European option with a Kou jump-diusion model 4
of values, while is the corresponding characteristic exponent matrix with same dimensions. Similarly, FFT and FFT1 refer to the two-dimensional forward and backward FFT algorithms. For pricing with N N space points, the algorithm requires to upload N 2 oating point values for the option payo and N (N/2 + 1) complex oating point values for the characteristic factor et (again, due to Hermitian symmetry). Also, N 2 oating point values are downloaded to the host while only one oat value may be downloaded if entire price surface is not needed. As in the single-asset case, pricing of European options requires execution of one forward and backward two-dimensional FFTs, while for path-dependent options M forward and backward FFTs are needed. In the rst test case, a European spread call option with parameters S1 = 98, S2 = 100, K = 2, T = 3 and payo given by (S1 (T ), S2 (T )) = (S1 (T ) S2 (T ) K)+ was priced. The stock process is modeled by a twodimensional Merton jump-diusion model with 1 = 0.1, 1 = 0.125, 1 = 0.13, 1 = 0.37, 2 = 0.2, 2 = 0.25, 2 = 0.11, 2 = 0.41, = 0.5, r = 0.1. Table 4 provides timing results. Similarly to the results in the one-dimensional European case, the FST-GPU and FST-CPU produce comparable results. Due to the xed overhead associated with each memory transfer, transforms of large size are relatively more ecient then small transforms. As opposed to the single-asset case, the large size of the problem has made the multi-asset FST-GPU more ecient then FST-CPU.
Figure 2: Price surface on an American double-trigger stop-loss option
Here, N (t) is a Poisson process which drives the arrival of losses li drawn from a Gamma distribution with mean m and variance v (refer to Jaimungal and Wang (2006) and Jackson, Jaimungal, and Surkov (2007b) for more details on the double-trigger stop loss contracts and compound Poisson loss processes). The parameters for this test case are = 0.15, r = 0.05, = 1, m = 2, v = 5, and = 0.005; and the timing results for this test case are presented in Table 5.
Grid points 2562 5122 10242 20482
Price 15.5678 15.5723 15.5729 15.5729
CPU time (sec.) 0.10 0.47 1.98 8.32
GPU time (sec.) 0.11 0.45 1.87 7.90
Grid points 2562 5122 10242 20482
Time points 128 256 512 1024
Price 3.5256 3.7416 3.7785 3.7935
CPU time (sec.) 1.0 9.9 82.5 713.8
GPU time (sec.) 0.2 1.4 10.9 142.4
Table 4: Performance results for pricing of an European Table 5: Performance results for pricing of an American spread option double-trigger stop-loss option As an example of the path-dependent test case, an American double-trigger stop loss option with paramHere, the FST-GPU outperformed FST-CPU by a eters S = 120, K = 100, T = 0.25, La = 5, Ld = 40 factor of 7.5. Note that for the largest grid size, the and payo given by (S1 (T ), L(T )) = IS(T )<K [(L(T ) La )+ (L(T ) Ld )+ ] was priced. The joint stock price speedup factor is only 5, apparently due to overheating and thus lower performance of the GPU card (the and loss dynamics are given by overheating can be remedied by providing appropriate cooling to the system, which was not available for this S(t) = S(0) exp { L(t) + t + Wt } , experiment). Still, pricing of American options with N (t) FST-GPU is the more ecient method by far. L(t) = l .
i n=1
35 30 25 FFTs / ms 20 15 10 5 0
N = 4096 N = 8192 N = 16384 N = 32768
Batch size 4 8 16 32 64
GPU time (msec.) 20.21 29.35 50.78 94.20 177.78
Options/sec. 197 272 315 339 360
Table 6: Performance results for parallel pricing of European options from 197 options per second to 360 options per second, an increase of 83%. While most of the performance gains came from parallelization of FFTs, streamlining of memory data transfer contributed as well. For path-dependent option, where the computation of FFTs takes up the majority of computational time, the performance gain due to parallelization of FST-GPU would be even higher.
4 Batch Size
16
64
Figure 3: Performance of batching FFT computation
Parallel Option Pricing
Conclusions
The inherently parallel, multi-processor structure of the GPUs makes them especially attractive for parallel solution of multiple problems. As this section will show, multiple single-asset option can be priced eciently by utilizing precisely this property and running multiple algorithms concurrently.
This paper presents GPU-based pricing algorithms for European and American options that depend on one or two assets. The algorithms implement the FST method on a GPU architecture and are especially eective for path-dependent options and in concurrent pricing, i.e. scenarios where where the memory transfer is a small part of the overall computational cost. Further perforThe CUFFT library provides a convenient interface for mance gains are achieved by parallelizing the pricing of evaluating multiple one-dimensional FFTs concurrently multiple options. and thus fully exploiting the parallel architecture of the GPU. Results in Figure 6 show that by batching When evaluating the potential of GPU-based numerthe FFTs together one can achieve a higher throughput ical methods, speed and precision are two of the most on the GPU. For FFTs of size less than 4096 points, critical factors to be considered. At the time of writing the GPU achieves an increase in throughput of 50% of this paper, more powerful GPUs with larger number when batch size is increased from 1 to 64, however, the of processors, such as nVidia 8800 GTX, have become marginal return of parallelizing the FFTs diminishes available on the market. The increase in the number of as the size of the FFT increases. In fact, for FFTs of parallel processors should increase the performance of size larger then 16384 points, theres no advantage in the FST-GPU algorithm, especially for multi-asset and throughput across the various batch sizes. Given the concurrent option pricing. large size of the FFT and the relatively small number of processors on nVidia 8600 GTS, there is little idling In terms of precision, the GPUs are at a disadvantage of the processors and thus little benet to paralleliza- to CPUs, because they currently do not support double tion. Still, more powerful GPUs with larger number precision oating point values. Although theres research of processors, such as nVidia 8800 GTX, should retain made to achieve double precision accuracy using iterative the benets of parallelization even at such transform sizes. single precision algorithms (see, for instance, Gddeke, o Strzodka, and Turek (2005)), native implementation The increase in FFT throughput due to batching di- of double precision is essential for GPUs to become rectly translates into faster pricing of large number of widespread in general scientic computing. Nonetheless, options, as show in Table 6. In this example, Euro- single precision implementations of pricing algorithms pean options were priced with 8192 space points and var- can still be used in settings such as risk management, ious batch sizes, ranging from 4 to 64. Parallelization where speed considerations outweigh precision. of FST-GPU increased the throughput of the algorithm 6
The results presented in this paper show that utilizing GPU in option pricing via FST method can bring about tremendous computational advantages. Further advances in the performance on this type of dedicated hardware will result in a bigger edge for the GPU-based pricing methods over the corresponding CPU implementation. Moreover, the increase in functionality of GPUs could make them a tool of choice in computational nance and scientic computing as a whole.
ference on Financial Engineering and Applications, pp. 9297. Jaimungal, S. and T. Wang (2006). Catastrophe Options with Stochastic Interest Rates and Compound Poisson Losses. Insurance: Mathematics and Economics 38(3), 469483. Podlozhnyuk, V. (2007). Black-Scholes Option Pricing. Part of CUDA SDK documentation.
References
Almendral, A. and C. W. Oosterlee (2005). Numerical Valuation of Options with Jumps in the Underlying. Applied Numerical Mathematics 53, 118. Andersen, L. and J. Andreasen (2000). Jump-Diusion Processes: Volatility Smile Fitting and Numerical Methods for Option Pricing. Review of Derivatives Research 4, 231262. Briani, M., R. Natalini, and G. Russo (2004). ImplicitExplicit Numerical Schemes for Jump-Diusion Processes. IAC Report 38, Istituto per le Applicazioni del Calcolo IAC-CNR. Colb, K. and M. Phar (2005). Option Pricing on the GPU. GPU Gems 2 , 719731. Cont, R. and E. Voltchkova (2005). A Finite Dierence Scheme for Option Pricing in Jump Diusion and Exponential Lvy Models. SIAM Journal on e Numerical Analysis 43 (4), 15961626. dHalluin, Y., P. A. Forsyth, and G. Labahn (2003). A Penalty Method for American Options with Jump Diusion Processes. Numerische Mathematik 97 (2), 321352. Frigo, M. and S. G. Johnson (2005). The Design and Implementation of FFTW3. Proceedings of the IEEE 93 (2), 216231. Special issue on Program Generation, Optimization, and Platform Adaptation. Gddeke, D., R. Strzodka, and S. Turek (2005). Aco celerating Double Precision FEM Simulations with GPUs. In F. Hlsemann, M. Kowarschik, and u U. Rde (Eds.), Proceedings of the 18th Symposium u on Simulation Technique (ASIM 2005), pp. 139 144. SCS Publishing House e.V. Jackson, K. R., S. Jaimungal, and V. Surkov (2007a). Fourier Space Time-stepping for Option Pricing with Lvy Models. Preprint. e Jackson, K. R., S. Jaimungal, and V. Surkov (2007b). Option Pricing with Regime Switching Lvy Proe cesses using Fourier Space Time-stepping. In Proceeding of the Fourth IASTED International Con7

(University of Toronto, Surkov) Parallel Option Pricing With Fourier Space Time-Stepping Method On Graphics Processing Units

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

(University of Toronto, Surkov) Parallel Option Pricing With Fourier Space Time-Stepping Method On Graphics Processing Units

Încărcat de

Drepturi de autor:

Formate disponibile

Parallel Option Pricing with Fourier Space Time-stepping Method on Graphics Processing Units

FFT Computation on GPUs

The PIDE is transformed into a one parameter family 2

600 500 400 GFlops 300 200 100 0

ATI Radeon X1900

Multi Asset Options

Figure 2: Price surface on an American double-trigger stop-loss option

Grid points 2562 5122 10242 20482

Price 15.5678 15.5723 15.5729 15.5729

CPU time (sec.) 0.10 0.47 1.98 8.32

GPU time (sec.) 0.11 0.45 1.87 7.90

Grid points 2562 5122 10242 20482

Time points 128 256 512 1024

Price 3.5256 3.7416 3.7785 3.7935

CPU time (sec.) 1.0 9.9 82.5 713.8

GPU time (sec.) 0.2 1.4 10.9 142.4

N = 4096 N = 8192 N = 16384 N = 32768

GPU time (msec.) 20.21 29.35 50.78 94.20 177.78

Options/sec. 197 272 315 339 360

Figure 3: Performance of batching FFT computation

Parallel Option Pricing

S-ar putea să vă placă și