Sunteți pe pagina 1din 6

VHDL-MODELS OF PARALLEL FIR DIGITAL FILTERS

Jerzy Kaniewski, Robert Berezowski, Dariusz Gretkowski, Oleg Maslennikow and Przemysaw Sotan Technical University of Koszalin, Institute of Electronics, ul. Partyzantw 17, 75-411 Koszalin, POLAND
Abstract: In this paper, the problem of the designing of the processor array (PA) architectures for the DSP problems solution is discussed on the example of the digital FIR-filtering algorithm. At first, the main stages of the proposed [4] methodology of the PA structure design is described. Then, as a example, the design of PA architectures performing FIR filtering algorithm is described. Note, that in the order to deriving of the array architecture with desired features, some purposive transformations of the basic algorithm dependence graph are employed. Then, the description of the library of the derived FIR-filter VHDL models is represented. Finally, the emulation parameters of the several filter structures and the different variants of the multiply unit realization is shown. As a result, the obtained filter structures have high throughput, hardware utilization and minimized sensitivity to truncation errors. digital FIR-filtering algorithm. At first, the main stages of the proposed [4] methodology of the PA structure design is described. Then, as a example, the design of PA architecture performing FIR filtering algorithm is described. Note, that in the order to deriving of the array architecture with desired features, some purposive transformations of the basic algorithm dependence graph are employed. Since the arrays architecture obtained in this way feature a strong dependence from the filtering task parameters (for example, such as number of filter coefficients), we show how these architectures should be modified in order to process an arbitrary large task size on fixed-size arrays. Then, the description of the library of the derived FIR-filter VHDL models is represented. This library contains a set of parametrized FIR filter structures and is intended to computer aided design of digital signal processing systems on the base of FPGAs. Finally, the emulation parameters of the several filter structures and the different variants of the multiply unit realization is shown. As a result, the obtained filter structures have high throughput, hardware utilization and minimized sensitivity to truncation errors.

1 Introduction
The solution of numerous problems of digital signal processing (DSP) reduces to linear algebraic computations and should be implemented in the real time mode [1,3]. Besides, most of DSP problems, such as digital filtering, Fourier transformation, solution of linear systems and least squares problems, etc., are characterized by a high computational complexity [3,4]. This implies the necessity of solving these problems on application-specific parallel systems. The VLSI processor arrays (PA) [ 3,4,5 ] are examples of such architectures. Using massive pipelining, these arrays exploit the regularity inherent in many algorithms to achieve high performance while keeping local communications and low I/O requirements. Note, that architectures of VLSI processor arrays can be designed systematically [1,3,4,5,] for regular algorithms and PAs can be realized as the ASICs or FPGA-based circuits. Implementation of the DSP systems in FPGAs has a set of advantages, such as full adaptation of implemented in FPGA structure to the algorithm, high throughput and salability, hardware utilization effectiveness, achieving high rate of calculating precision and lowest cost/performance ratio. Therefore, in this paper, the problem of the designing of the PA architectures for the DSP problems solution is discussed on the example of the

2 Methodology of the PA structures design (overview)


Architectures of VOLS. processor arrays can be designed systematically [3,4,5] by applying linear (or affine) mappings of algorithms which are expressed by systems of recursive equations or nested loops. Above algorithms are regular ones and can be represented [3,4,6] by regular or quasi-regular dependence graph (DGs), or a composition of them. Each node of such a DG corresponds to a certain operator (or iteration) of the original algorithm, and is associated with the integer vector K = (k1,...,kn). All its nodes are located in the vertices K of a lattice Kn Zn , where Kn is called the index space. If the iteration corresponding to a node K2 depends on the iteration corresponding to another node K1, this dependence is represented by the dependence vector d = K2 - K1. The set of the all different dependence vectors d of the DG forms dependence matrix D of the algorithm. In the course of mapping, a given algorithm AL with the dependence graph

G is transformed into a set of structural schemes C = <S,T,> of arrays implementing this algorithm, where S is a directed graph called the array structure, T is the synchronization function specifying the computation time of nodes in the DG, and is the set of operation algorithms of PEs. One of the most promising approaches to mapping recursive algorithms with regular dependencies into processor arrays consists of [3,4] finding linear mapping operator F which transforms the each node K of the algorithm DG to the corresponding node of the structure graph S: F : Kn KFm+1 , F(K)= FK, K Kn , (1)

subgraph is mapped to one PE, and each PE sequentially executes the nodes of the corresponding subgraph. Therefore, an additional local memory within each PE is needed. To avoid this disadvantage, one subgraph is mapped to one array in the LPGS method. All nodes within one subgraph are processed concurrently, while all subgraphs are processed sequentially. As a result, all intermediate data which correspond to data dependencies between subgraphs should be stored in buffers outside the processor array. Represented methodology is described in details in Ref. [4] and will be used in the next section for the design of the FIR-filter PA architectures.

where m is the dimension of the PA structure (m+1 n). Operator F represents the (m+1) x n matrix and composes of two components: space mapping FS and time mapping FT :

3 FIR filtering algorithm and architectures


Finite impulse response (FIR) filters are one of the most basic building blocks in digital signal processing. For a given frequency response, FIR filters are a higher order than IIR filters, making FIR filters more expensive computationally. However, only FIR filters may be used in systems that require a linear phase and have inherently stable structure. The mathematical equation that describes onedimensional (1-D) FIR-filter operation is following:

FS F = Z ( m +1) n FT

Thus, the arbitrary DG node K Kn will be carried out in the processor element (PE) with coordinates FSK at the tact number FTK . Note, that the operator F should be satisfied to the following conditions: 1. FTd >0 ,d D ; 2. K1, K2 Kn (K1 K2 FK1 FK2); (2) 3. rank(F) = m+1. In according to the methodology [4], the set of all possible and nonequivalent allocation mappings FS(K) satisfying given constraints for links between PEs (which are located in vertices of a lattice Km Zm ) is firstly determined. For each of network topologies S corresponding to this set, an optimal schedule mapping which implements the algorithm correctly is find then. This mapping is constructed as a linear (or affine) function FT with n unknown coefficients. A basic requirement in practical PA designs is an ability to process large size tasks (or large size DG) on processor arrays with a fixed number of PEs. To provide this ability, two partitioning methods [3] are usually used: locally sequential globally parallel (LSGP) method and locally parallel globally sequential (LPGS) method. Both of them are based on the decomposition of a dependence graph (DG) of an algorithm into a set of regular subgraphs, but differ in the way how these subgraphs are mapped onto resulting architectures. In the LSPG method, one

y[n] = a[k ] x[n k + 1]


k =1

(3)

where x[n-k+1] is the input data, a[k] is the filter coefficients and y[n] is the output data (k=1,...,K; n=1,..., N+K-1), while K is the number of coefficients and N is the number of input data. The dependence graph G of this algorithm which was constructed in according to the Ref.[6] is shown in the Fig.1 (on the left). Nodes of G are distributed in nodes of the two-dimensional lattice Q1={K=(n,k): 1 k K, k n N+k }. Note, that the each DG node corresponds to the performing of the full bit parallel multiplication with addition operation, i.e. DG G represents the realization of the FIR filtering algorithm at the level of input words. The data dependencies (or arcs) between nodes of the graph G are represented by the three different vectors d1 , d2 , d3 which compose the dependence matrix D of the algorithm:

1 0 1 n D = d1 , d 2 , d 3 = 1 1 0 k

Note, that vectors d1 = [1,1] and d2 = [0,1] correspond respectively to the pipelined propagation of the variables

x[n-k+1] and a[k] between DG nodes, while vector d3 = [1,0] corresponds to the transmission of the resulting variables y[n] for recomputing. The dimension of the algorithm DG at the words level is equal n=2. Therefore the m=2 or m=1 or m=0
yK
k n

dimensional PA architectures of the FIR-filters may be designed using methodology [4].

y(K+1)

y(K+2)

...

yN

y(N+1)

y(N+2)

y(N+K-1)

Y PEK

aK y3 . y2 y1 a2 a3 ......................................

. . . y3 PE3 PE2 PE1 X y2 y1

a1 x1

x2

x3

x4

x5

x6 . . . xN

Fig.1. Dependence graph G of 1-D filtering algorithm The 2-D filter structures have the largest throughput. But in the case of full bit parallel calculations the high hardware costs of such structures often do not permit their implementation in FPGAs. Therefore often the bit-level algorithm mapping is performed to synthesize the structure with bit serial calculations, which provides small hardware cost and pipelined computations. But in this case the quantization frequency must be equal to fQ=1/(ltC), where l is the word length. Thus, the 2-D filter structures with full bit parallel calculations are not considered in the paper. In the order to deriving of the 1-D PA architecture, the projection of the DG G to the 1-D hyperplain determined by the allocation mapping operator FS should be performed. For example, one of the possible 1-D structures is the structure S1 , which is shown in Fig.1 (on the right) . This structure, which corresponds to the projection of G along n-axis, contains K PEs , one input and K output channels and implements of the filtering algorithm at the asymptotic time period T=N tacts. Note, that each PE contains a multiplier, an adder and three registers.
k* a5,a6 a7, a8

The drawbacks of the structure S1 are comparatively large number of registers into PEs and output channels. Note, that the last drawback may be eliminated by means increasing of the PE control units overhead and latency time delay. Therefore, to deriving the 1-D array structure with minimal number of the I/O channels and PE registers and with the fixed number p<K/2 of PEs, we transform the graph G and decompose it onto subgraphs. In a particularly, we change the direction of the vector d3 = [1,0] on the opposite one, and join together (in the pairs), the neighboring nodes of the DG which are located along axis k (see Fig.1). As a result, the modified graph G* is derived. It showed in the Fig.2 (on the left) for K=8. Then we decompose DG G* into a set of s= ] K/2p[ subgraphs having the same topology, where ]q[ denotes the nearest integer equal to or greater then q. As evident from Fig.2, this can be done if we cut the graph G* using a set of straight lines parallel to n-axis. These lines decompose the graph G* into s regular subgraphs with p layers each (see Fig.2).

n
a3,a4 a1,a2 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 ... xN y6 ... yN PE2 PE1 X Y F I F O

Fig.2. Modified DG G* of 1-D filtering algorithm and fixed-size PA structure

Then we project each resulting subgraph onto k*-axis in order to obtaining the fixed-size array structure S2 shown in the Fig.2 (on the right), where FIFO denotes the external FIFO memory block. The total execution time T of the FIR filtering algorithm realization is equal to

frequency of X and Y data input is equal (for both filters) f/K, where f is the internal frequency of the filter operation.
Coefficients FIFO a(k) (K) FIFO (K-1) MX Input X RG MUL SM RG RG Output Y

T = s ( N + K 1)
time steps and the asymptotic processor utilization 1. Moreover, the latency time delay is equal one time step. The internal structure of the k*-th PE are detailed in the Fig.3, where RG denotes register, SM is the adder and MUL denotes multiplier.

Yout RG SM

Yin
Fig.5. Structure S4 of the one-processor filter PA

SM

ai
Xin

MUL

MUL

ai+1
Xout

RG

Fig. 3. Internal PE structure for the PA S2 Note, that the methodology [4] allow s to design zerodimensional (m=0) PA structures (i.e. contained only one PE). Examples of such structures are the structures S3 and S4 which are shown in the Fig. 4 and Fig.5 respectively. Here FIFO (q) denotes the FIFO memory block with q cells.

It was be above mentioned, that in the case of full bit parallel calculations the high hardware costs of such structures often do not permit their implementation in FPGAs [1]. Therefore often the bit-level algorithm mapping is performed to synthesize the structure with bit serial calculations. Note, that proposed methodology [4] allows to derive the bit-level PA structures, if the input algorithm describes calculation at the bit level. For example, 1-D FIR filtering algorithm at the bit level is represented four-dimensional dependence graph [6]. Using methodology [6] the bit-level PA architecture S5 was design. It represented in the Fig.6 and consists of K multiply-addition units MU, and FIFO blocks of the different size (Xdelay, Ddelay, Ydelay). Each MU block consists from the one-bit cells. Internal structure of the first type cell is reprezented in the Fig.7,

IN
coefficients a(k) Input X(n) RG FIFO (K ) MUL SM Output RG Y

Xdelay Ddelay MU
FIFO (K-1)

Ddelay Ddelay Out MU MU MU Ydelay

Fig.6. Structure S5 of the bit-level filter PA where & denotes the AND operation and D denotes the Dtrigger.

Fig.4. Structure S3 of the one-processor filter PA Advantage of the filter structure S3 is the minimal latency time delay which is equal one time step, while the advantage of the structure S3 is the lower hardware overhead (lower data width of the FIFO (K-1)). The

Cin

Xin & Ain D D Xout


RG 15 12 11

In RG 8 7

(k=1,...,K) 4 3 0

ROM1

Yin D Cout

SM

Yout

ROM2 ROM3 ROM4 SM SM RG SM RG Out

Fig.7. Internal structure of the bit-level cell of MU block

4 Parametrized FIR filter structure library


Using the proposed mapping method [4] FIR filter structure library is developed. The library contains a set of parametrized FIR filter structures and is intended to computer aided design of digital signal processing system on the base of FPGAs. The library consists of structural models of filters described by VHDL language in synthesable style with the fixed point data, programmable coefficients and full internal accuracy. The structure parameters are: - filter tap (number of coefficients) K= 2,3,...,32; - input/output data width Lx=4,6,...,32 bits; - coefficient width La=4,6,...,32 bits. For the increasing of the internal frequency of the filters operation, in all filter structures the addition is performed for one clock period, the multiplication lasts two or three clock periods. But the two or three staged pipelined parallel multiplier derives products every clock cycle. Usually the coefficients are known constants, therefore the multiply operation is performed using tables of coefficients multiplied by a set of natural numbers, which substantially minimizes the hardware cost. The example of the such 16 bits multiply unit structure S6 is represented in the Fig. 8, where k is the input of the coefficient number and ROM1 ROM4 are the 16-cells memory blocks (if k=1). The VHDL-models of the all derived filter structures were developed and tested in the Xilinx Foundation 1.5.

Fig.8. Internal structure of the ROM-based multiply unit The obtained model parameters for the case Lx=La=16 and K=8 are represented in the Tabl. 1, where CLB denotes the configurable logical block of the FPGA and LUT denotes the look-up-table.
Struc- CLB ture S2 468 S4 130 S5 1687 S6 44 Flipflops 870 69 3276 4LUT

Tabl.1 3LUT MHz 103 46 498 34,3 38,6 111,9 72,8

646 219 2174 78

5 Conclusions
Implementation application-specific parallel systems, in FPGA has a set of advantages, such as full adaptation of implemented in FPGA structure to the applied algorithms, high performance, achieving high rate of calculating precision, reducing both the way from idea to the market and development costs. Moreover, only in the case of system implementation on the base of FPGA, a highest hardware utilization and lowest cost/performance ratio may be derived. Therefore, in this paper, the design of the FPGA-based parallel FIR filter structures is represented. Authors propose the method for mapping algorithms into

PA structures, which helps to design filters with high characteristics. This method has showed good results by the development of the library of FIR filter structures. The library contains a set of parametrized FIR filter structures and is intended to computer aided design of digital signal processing systems on the base of FPGAs. The library consists of structural models of filters described by VHDL language in synthesable style. All models are characterized by high technical parameters.

References
[1]. J. Isoaho, J. Pasawn, O. Vaino, H. Terhunen. DSP System Integration and Prototyping With FPGAs. J. VLSI Signal Processing, 1993, 6, p. 155-172. [2]. The Synthesis Approach to Digital System Design / Ed.: P. Michel, U. Lauther, P. Duzy. Kluwer Academic Pub. 1992. [3]. Kung S.Y. VLSI processor arrays. Prentice Hall, Englewood Cliffs, 1988. [4]. Wyrzykowski R., Kanevski J.S., Maslennikov O. Mapping recursive algorithms into processor arrays. Proc. Int. Workshop Parallel Numerics'94, M.Vajtersic and P.Zinterhof eds., Smolenice, (Slovakia), 1994, pp.169191. [5]. Moreno J.H., Lang T. Matrix computations on systolic-type arrays. Kluwer Acad.Publ., Boston, 1992. [6]. Wyrzykowski R., Kanevski Ju.S., Maslennikov O.V., Maslennikova N.N. A Method for Deriving Dependence Graphs of Recursive Algorithms for Processor Array Design, Proc. Int.Workshop "Parallel Numerics'95" Sorrento, Italy, 1995, p.263-280.

S-ar putea să vă placă și