Documente Academic
Documente Profesional
Documente Cultură
Jerzy Kaniewski, Robert Berezowski, Dariusz Gretkowski, Oleg Maslennikow and Przemysaw Sotan Technical University of Koszalin, Institute of Electronics, ul. Partyzantw 17, 75-411 Koszalin, POLAND
Abstract: In this paper, the problem of the designing of the processor array (PA) architectures for the DSP problems solution is discussed on the example of the digital FIR-filtering algorithm. At first, the main stages of the proposed [4] methodology of the PA structure design is described. Then, as a example, the design of PA architectures performing FIR filtering algorithm is described. Note, that in the order to deriving of the array architecture with desired features, some purposive transformations of the basic algorithm dependence graph are employed. Then, the description of the library of the derived FIR-filter VHDL models is represented. Finally, the emulation parameters of the several filter structures and the different variants of the multiply unit realization is shown. As a result, the obtained filter structures have high throughput, hardware utilization and minimized sensitivity to truncation errors. digital FIR-filtering algorithm. At first, the main stages of the proposed [4] methodology of the PA structure design is described. Then, as a example, the design of PA architecture performing FIR filtering algorithm is described. Note, that in the order to deriving of the array architecture with desired features, some purposive transformations of the basic algorithm dependence graph are employed. Since the arrays architecture obtained in this way feature a strong dependence from the filtering task parameters (for example, such as number of filter coefficients), we show how these architectures should be modified in order to process an arbitrary large task size on fixed-size arrays. Then, the description of the library of the derived FIR-filter VHDL models is represented. This library contains a set of parametrized FIR filter structures and is intended to computer aided design of digital signal processing systems on the base of FPGAs. Finally, the emulation parameters of the several filter structures and the different variants of the multiply unit realization is shown. As a result, the obtained filter structures have high throughput, hardware utilization and minimized sensitivity to truncation errors.
1 Introduction
The solution of numerous problems of digital signal processing (DSP) reduces to linear algebraic computations and should be implemented in the real time mode [1,3]. Besides, most of DSP problems, such as digital filtering, Fourier transformation, solution of linear systems and least squares problems, etc., are characterized by a high computational complexity [3,4]. This implies the necessity of solving these problems on application-specific parallel systems. The VLSI processor arrays (PA) [ 3,4,5 ] are examples of such architectures. Using massive pipelining, these arrays exploit the regularity inherent in many algorithms to achieve high performance while keeping local communications and low I/O requirements. Note, that architectures of VLSI processor arrays can be designed systematically [1,3,4,5,] for regular algorithms and PAs can be realized as the ASICs or FPGA-based circuits. Implementation of the DSP systems in FPGAs has a set of advantages, such as full adaptation of implemented in FPGA structure to the algorithm, high throughput and salability, hardware utilization effectiveness, achieving high rate of calculating precision and lowest cost/performance ratio. Therefore, in this paper, the problem of the designing of the PA architectures for the DSP problems solution is discussed on the example of the
G is transformed into a set of structural schemes C = <S,T,> of arrays implementing this algorithm, where S is a directed graph called the array structure, T is the synchronization function specifying the computation time of nodes in the DG, and is the set of operation algorithms of PEs. One of the most promising approaches to mapping recursive algorithms with regular dependencies into processor arrays consists of [3,4] finding linear mapping operator F which transforms the each node K of the algorithm DG to the corresponding node of the structure graph S: F : Kn KFm+1 , F(K)= FK, K Kn , (1)
subgraph is mapped to one PE, and each PE sequentially executes the nodes of the corresponding subgraph. Therefore, an additional local memory within each PE is needed. To avoid this disadvantage, one subgraph is mapped to one array in the LPGS method. All nodes within one subgraph are processed concurrently, while all subgraphs are processed sequentially. As a result, all intermediate data which correspond to data dependencies between subgraphs should be stored in buffers outside the processor array. Represented methodology is described in details in Ref. [4] and will be used in the next section for the design of the FIR-filter PA architectures.
where m is the dimension of the PA structure (m+1 n). Operator F represents the (m+1) x n matrix and composes of two components: space mapping FS and time mapping FT :
FS F = Z ( m +1) n FT
Thus, the arbitrary DG node K Kn will be carried out in the processor element (PE) with coordinates FSK at the tact number FTK . Note, that the operator F should be satisfied to the following conditions: 1. FTd >0 ,d D ; 2. K1, K2 Kn (K1 K2 FK1 FK2); (2) 3. rank(F) = m+1. In according to the methodology [4], the set of all possible and nonequivalent allocation mappings FS(K) satisfying given constraints for links between PEs (which are located in vertices of a lattice Km Zm ) is firstly determined. For each of network topologies S corresponding to this set, an optimal schedule mapping which implements the algorithm correctly is find then. This mapping is constructed as a linear (or affine) function FT with n unknown coefficients. A basic requirement in practical PA designs is an ability to process large size tasks (or large size DG) on processor arrays with a fixed number of PEs. To provide this ability, two partitioning methods [3] are usually used: locally sequential globally parallel (LSGP) method and locally parallel globally sequential (LPGS) method. Both of them are based on the decomposition of a dependence graph (DG) of an algorithm into a set of regular subgraphs, but differ in the way how these subgraphs are mapped onto resulting architectures. In the LSPG method, one
(3)
where x[n-k+1] is the input data, a[k] is the filter coefficients and y[n] is the output data (k=1,...,K; n=1,..., N+K-1), while K is the number of coefficients and N is the number of input data. The dependence graph G of this algorithm which was constructed in according to the Ref.[6] is shown in the Fig.1 (on the left). Nodes of G are distributed in nodes of the two-dimensional lattice Q1={K=(n,k): 1 k K, k n N+k }. Note, that the each DG node corresponds to the performing of the full bit parallel multiplication with addition operation, i.e. DG G represents the realization of the FIR filtering algorithm at the level of input words. The data dependencies (or arcs) between nodes of the graph G are represented by the three different vectors d1 , d2 , d3 which compose the dependence matrix D of the algorithm:
1 0 1 n D = d1 , d 2 , d 3 = 1 1 0 k
Note, that vectors d1 = [1,1] and d2 = [0,1] correspond respectively to the pipelined propagation of the variables
x[n-k+1] and a[k] between DG nodes, while vector d3 = [1,0] corresponds to the transmission of the resulting variables y[n] for recomputing. The dimension of the algorithm DG at the words level is equal n=2. Therefore the m=2 or m=1 or m=0
yK
k n
y(K+1)
y(K+2)
...
yN
y(N+1)
y(N+2)
y(N+K-1)
Y PEK
aK y3 . y2 y1 a2 a3 ......................................
a1 x1
x2
x3
x4
x5
x6 . . . xN
Fig.1. Dependence graph G of 1-D filtering algorithm The 2-D filter structures have the largest throughput. But in the case of full bit parallel calculations the high hardware costs of such structures often do not permit their implementation in FPGAs. Therefore often the bit-level algorithm mapping is performed to synthesize the structure with bit serial calculations, which provides small hardware cost and pipelined computations. But in this case the quantization frequency must be equal to fQ=1/(ltC), where l is the word length. Thus, the 2-D filter structures with full bit parallel calculations are not considered in the paper. In the order to deriving of the 1-D PA architecture, the projection of the DG G to the 1-D hyperplain determined by the allocation mapping operator FS should be performed. For example, one of the possible 1-D structures is the structure S1 , which is shown in Fig.1 (on the right) . This structure, which corresponds to the projection of G along n-axis, contains K PEs , one input and K output channels and implements of the filtering algorithm at the asymptotic time period T=N tacts. Note, that each PE contains a multiplier, an adder and three registers.
k* a5,a6 a7, a8
The drawbacks of the structure S1 are comparatively large number of registers into PEs and output channels. Note, that the last drawback may be eliminated by means increasing of the PE control units overhead and latency time delay. Therefore, to deriving the 1-D array structure with minimal number of the I/O channels and PE registers and with the fixed number p<K/2 of PEs, we transform the graph G and decompose it onto subgraphs. In a particularly, we change the direction of the vector d3 = [1,0] on the opposite one, and join together (in the pairs), the neighboring nodes of the DG which are located along axis k (see Fig.1). As a result, the modified graph G* is derived. It showed in the Fig.2 (on the left) for K=8. Then we decompose DG G* into a set of s= ] K/2p[ subgraphs having the same topology, where ]q[ denotes the nearest integer equal to or greater then q. As evident from Fig.2, this can be done if we cut the graph G* using a set of straight lines parallel to n-axis. These lines decompose the graph G* into s regular subgraphs with p layers each (see Fig.2).
n
a3,a4 a1,a2 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 ... xN y6 ... yN PE2 PE1 X Y F I F O
Then we project each resulting subgraph onto k*-axis in order to obtaining the fixed-size array structure S2 shown in the Fig.2 (on the right), where FIFO denotes the external FIFO memory block. The total execution time T of the FIR filtering algorithm realization is equal to
frequency of X and Y data input is equal (for both filters) f/K, where f is the internal frequency of the filter operation.
Coefficients FIFO a(k) (K) FIFO (K-1) MX Input X RG MUL SM RG RG Output Y
T = s ( N + K 1)
time steps and the asymptotic processor utilization 1. Moreover, the latency time delay is equal one time step. The internal structure of the k*-th PE are detailed in the Fig.3, where RG denotes register, SM is the adder and MUL denotes multiplier.
Yout RG SM
Yin
Fig.5. Structure S4 of the one-processor filter PA
SM
ai
Xin
MUL
MUL
ai+1
Xout
RG
Fig. 3. Internal PE structure for the PA S2 Note, that the methodology [4] allow s to design zerodimensional (m=0) PA structures (i.e. contained only one PE). Examples of such structures are the structures S3 and S4 which are shown in the Fig. 4 and Fig.5 respectively. Here FIFO (q) denotes the FIFO memory block with q cells.
It was be above mentioned, that in the case of full bit parallel calculations the high hardware costs of such structures often do not permit their implementation in FPGAs [1]. Therefore often the bit-level algorithm mapping is performed to synthesize the structure with bit serial calculations. Note, that proposed methodology [4] allows to derive the bit-level PA structures, if the input algorithm describes calculation at the bit level. For example, 1-D FIR filtering algorithm at the bit level is represented four-dimensional dependence graph [6]. Using methodology [6] the bit-level PA architecture S5 was design. It represented in the Fig.6 and consists of K multiply-addition units MU, and FIFO blocks of the different size (Xdelay, Ddelay, Ydelay). Each MU block consists from the one-bit cells. Internal structure of the first type cell is reprezented in the Fig.7,
IN
coefficients a(k) Input X(n) RG FIFO (K ) MUL SM Output RG Y
Xdelay Ddelay MU
FIFO (K-1)
Fig.6. Structure S5 of the bit-level filter PA where & denotes the AND operation and D denotes the Dtrigger.
Fig.4. Structure S3 of the one-processor filter PA Advantage of the filter structure S3 is the minimal latency time delay which is equal one time step, while the advantage of the structure S3 is the lower hardware overhead (lower data width of the FIFO (K-1)). The
Cin
In RG 8 7
(k=1,...,K) 4 3 0
ROM1
Yin D Cout
SM
Yout
Fig.8. Internal structure of the ROM-based multiply unit The obtained model parameters for the case Lx=La=16 and K=8 are represented in the Tabl. 1, where CLB denotes the configurable logical block of the FPGA and LUT denotes the look-up-table.
Struc- CLB ture S2 468 S4 130 S5 1687 S6 44 Flipflops 870 69 3276 4LUT
5 Conclusions
Implementation application-specific parallel systems, in FPGA has a set of advantages, such as full adaptation of implemented in FPGA structure to the applied algorithms, high performance, achieving high rate of calculating precision, reducing both the way from idea to the market and development costs. Moreover, only in the case of system implementation on the base of FPGA, a highest hardware utilization and lowest cost/performance ratio may be derived. Therefore, in this paper, the design of the FPGA-based parallel FIR filter structures is represented. Authors propose the method for mapping algorithms into
PA structures, which helps to design filters with high characteristics. This method has showed good results by the development of the library of FIR filter structures. The library contains a set of parametrized FIR filter structures and is intended to computer aided design of digital signal processing systems on the base of FPGAs. The library consists of structural models of filters described by VHDL language in synthesable style. All models are characterized by high technical parameters.
References
[1]. J. Isoaho, J. Pasawn, O. Vaino, H. Terhunen. DSP System Integration and Prototyping With FPGAs. J. VLSI Signal Processing, 1993, 6, p. 155-172. [2]. The Synthesis Approach to Digital System Design / Ed.: P. Michel, U. Lauther, P. Duzy. Kluwer Academic Pub. 1992. [3]. Kung S.Y. VLSI processor arrays. Prentice Hall, Englewood Cliffs, 1988. [4]. Wyrzykowski R., Kanevski J.S., Maslennikov O. Mapping recursive algorithms into processor arrays. Proc. Int. Workshop Parallel Numerics'94, M.Vajtersic and P.Zinterhof eds., Smolenice, (Slovakia), 1994, pp.169191. [5]. Moreno J.H., Lang T. Matrix computations on systolic-type arrays. Kluwer Acad.Publ., Boston, 1992. [6]. Wyrzykowski R., Kanevski Ju.S., Maslennikov O.V., Maslennikova N.N. A Method for Deriving Dependence Graphs of Recursive Algorithms for Processor Array Design, Proc. Int.Workshop "Parallel Numerics'95" Sorrento, Italy, 1995, p.263-280.