Sunteți pe pagina 1din 8

Asian Journal Of Computer Science And Information Technology 3 : 8 (2013) 109 - 116.

Contents lists available at www.innovativejournal.in

Asian Journal of Computer Science And Information Technology

Journal Homepage: http://www.innovativejournal.in/index.php/ajcsit

HDL BASED IMPLEMENTATION OF PALM ASSOCIATIVE MEMORY


Pavan Vyas*, Rachit Kacheria

Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, Gujarat, India

ARTICLE INFO ABSTRACT

Corresponding Author: The objective of this paper is to implement and analyze the palm associative memory
Pavan Vyas proposed by G. Palm [1]. In this paper, a design implementation of this algorithm, based
PG student, Dhirubhai Ambani on Verilog HDL (hardware description language) and MATLAB programming language
Institute of Information and is proposed. Xilinx Spartan-3e FPGA (Field Programming Gate Array) is required for
Communication Technology, simulation purpose, which performs arithmetic operations for implementing
Gandhinagar, Gujarat, India associative memory. The simulation results will be obtained with Xilinx ISE 10.1 and
MATLAB R2010a. The results are analyzed in terms of operating frequency and chip
utilization. It also summarizes the Time of Computation and Hardware for Logic
Keywords: - Neural network implementation for different input vector size in the system.
algorithm, Associative memory, k
winners take all circuit. 2013, AJCSIT, All Right Reserved.
INTRODUCTION
An associative memory is a content addressable structure that maps a set of input patterns to a set of output patterns. A content-addressable
structure is a type of memory that allows the recall of data based on the degree of similarity between the input pattern and the patterns stored in
memory.
There are two types of associative memory. One is an auto-associative memory which retrieves a previously stored pattern that most closely
resembles the current pattern. Other is a hetero-associative memory in which the retrieved pattern is in general different from the input pattern, not only
in content but, possibly in type and format.
The paper aims at implementing associative memory suitable for running experiments and possible further elaboration of the
implementation, based on the experimental results. We used Palm model for neural associative memory. Both the algorithms given
here are implemented for smaller data sets and after the functionality is verified, we do it for actual data set. The paper describes concept and practical
implementation of the associative memory.
I. ALGORITHM
On neural associative memory, Willshaw published a paper in 1969 and thereafter Palm gave detail explanation of it in 1980. They both
together proposed associative memory models, based on a binary weight matrix, Hebbian learning function and threshold activation functions [1, 8].
Their combined efforts gave an efficient implementation of Bayesian Memory which was defined as WPNAM (Willshaw and Palm Neural Associative
Memory). This will be used many times in this research paper. WPNAM and AM are interchangeable. When we mention AM in this work, we mean
WPNAM-based AM. Figure 1 shows the basic network structure of the WPNAM model.

Figure 1: The architecture of the Willshaw and Palm NAM model.

The training stage of the network is described as:

Where Wij is the binary values 0 or 1 (or multi-bit value) of the weight between the i-th input neuron and the j-th output neuron; Xiμ is the
binary value of i-th input neuron; Yj μ is the binary value of j-th output neuron; μ is the μ-th training pattern index; M is the total number of training
patterns; ∨ is the oring operation; ∧ is the anding operation and Σ is the integer summation.

109
Vyas et.al/HDL Based Implementation of Palm Associative Memory
In the above equation 1, weights can be produced by two methods. In first method, the training vectors Xiμ and Yj μ are being anded for
particular training pattern and then results are ored for M numbers of different training patterns. In this method, we get weights in 0 or 1 binary value. In
another method, again the training vectors Xiμ and Yj μ are being anded for particular training pattern, but here, the results are summed for M numbers
of different training patterns. In this method, we get weights in multi-bits value. For the retrieval stage of the network, when an input pattern x
propagates through the network, the output neuron’s potential is given by:

Where Sj is the j-th output neuron’s potential; Xi is the value of i-th input neuron’s value, which would be a noisy version of the training vector
(how to generate such test vectors will be discussed in [9]); Wij is the weight connecting i-th input neuron to j-th output neuron.
Now this neuron’s potential goes through a function which has some threshold value θ, which activates k different output neurons. Those outputs, with
value one, are called activated neurons. The value of the j-th output neuron is given by:

Equation 2 consists of ‘and operation’ for input vector and weight memory. Here, integer summation of the resultant vector is being done to
generate output neurons’ potential. This operation is called inference. It is formed by the inner product operation of matrix. Equation 3 gives neurons’
activation function. As we know, this has a threshold value θ, which activates k different output neurons. Those, which are set to one, are called activated
neurons. As proposed by Palm, at a time Only O (log2N) neurons are active. Thus, we conclude that such a computed connection updates the network.
Equation 3 is generally referred as k-WTA (k Winners-Take-All). So, for any implementation of AM based HDM model, it must have these two operations.
1. Inner product operation for matrix
2. k-WTA
AM can perform ISP (Intelligent Signal Processing), but it has some limitations. The main issue is about networks scaling. AM consists of sparse
vectors having very less 1s as compared to 0s. But after accumulation of these vectors, the weight memory becomes less sparse. As proposed by Palm,
maximum capacity is achieved when the number of 1s and number of 0s in the vector are same.
II. S Y S T E M O R G A N I Z A T I O N
Figure 2 shows the system organization and data flow for the AM algorithm. In this Figure 2, we add noise to the original/clean images. The size of each
image is 128 X 128 pixels. Now this noisy image will go through image processing for feature extraction. Untill this stage, the work is done on PC using
MATLAB. Now these extracted images (feature vectors) are transferred from PC by some communication channel to the FPGA board. Thereafter, these
feature vectors are being multiplied to the weight matrix, which is stored in the memory. This operation is called inner product, which gives out ‘sum
vector’ (Integer). By some thresolding activity on ‘sum vector’, we get ‘binary vector’ (clean vector). Finally, this ‘binary vector’ is transferred to the PC,
which reassembles the clean original image.

Figure 2: System organization for the NAM algorithm.

III. FEATURE EXTRACTION


Feature extraction is a technique to map large vector to the small vector. It is basically done on input large vector. The extracted vector may be irrelevant
to the input vector. Here, the small explanation about feature extraction is given. It is not the only method to extract feature vector, but this method might
be helpful to solve the purpose. The input images are shown below in Figure 3. (Courtesy: Sam roweis data & Link:
http://www.cs.nyu.edu/~roweis/data.html)

Figure 3: Input data set for feature extraction

IV: GENERATE INPUT DATA SEQUENCE FOR AM


Figure 4 shows the single character of input data set, which is a 16 X 20 pixels image. Then, we added 2 X 2 pixels of zero padding around that image to
distinguish it in the input dataset. Here, white pixels are defined as 1s and black pixels are defined as 0s. Now 20 X 24 pixels image is reproduced. We
have designed green colour 5X5 pixels mask which moves and collects number of 1s (white pixels) from it, as directed in Figure below. This is done by
MATLAB. In Figure 5, the extracted features are shown. The white shaded area helps us identify a particular light.

110
Vyas et.al/HDL Based Implementation of Palm Associative Memory

Figure 4: Method to generate feature vector

Figure 5: Extracted feature vectors

Figure 6: Functional blocks and data flow

Figure 6 shows the FPGA functional blocks and the weight matrix organization in the memory. The memory stores the weight matrix sparsely with the
rows and columns indices of the active weight bits. The FPGA stores the input vector in the input matrix unit. The input matrix unit contains the non-zero
row information of the input vector. The FPGA reads weight matrix with the same row index, searches for the corresponding column numbers inside the
input vector, then accumulates the value and writes to the result vector unit. This operation is done in the inner-product unit. The ‘k-WTA’ unit will sort
the incoming result and transfer it from the inner product unit to the result vector unit.
V: ALGORITHMS OF NAM

111
Vyas et.al/HDL Based Implementation of Palm Associative Memory
Basically there are two major algorithms of NAM. One is training or learning Algorithm, in which network is exposed to the large number of input and
output vectors. In this algorithm, outer product of vectors is called training and the other is Inference Algorithm, which Receives an input vector and
settles down to one of the nearest stored pairs. In this algorithm, inner product of vectors is called inference.
Here, small example is given to understand the AM:
 Input vector X and output vector Y are there
 X is 10x1 vector
 Y is 10x1 vector
 Let’s find out outer product that is learning algorithm
 X * YT= W
 Hence we get weight matrix W of 10x10.
 Now operation (X * W) will lead to vector Y that is called inference.
NOTE: If in case any noise occurs in vector X, even then it will lead to vector Y, after X * W is done.
VI: HDL IMPLEMENTATION OF TRAINING ALGORITHM
We have studied the training algorithm in the previous topic. Now, in this topic, implementation of training algorithm by Verilog HDL is given. It requires
input vectors and output vectors to produce weight matrix. This weight matrix has to be very sparse.

Figure 7: Input/ Output pin diagram of training algorithm

In the Figure 7, input-output pin diagram is shown. It consists of input vector, output vector, clock, reset, memory write, memory read and data out. In the
previous topic, we have seen many complex arithmetic operations. Now, all these operations are done by Verilog for hardware implementation.
The block diagram, as shown in Figure 8, shows how this algorithm generates weight matrix. It consists of ‘and block’, ‘or block’, counter, clock
and memory. Here, vector X and Y of training image are feature extracted vectors. For computing auto-associative memory, vector X (input) and vector Y
(output) should be same. The ‘and block’ is used for anding operation of both vectors X and Y. It will work on positive edge of clock and it will send its
output to the ‘or block’. In ‘or block’, weight memory and the resultant vector of ‘and block’ will operate together. Now, oring operation should be
performed in this block with these two vectors. Now, there is a counter which basically generates address for memory, which is also enabled by positive
edge of clock. Now these addresses are used for reallocation of weight matrix in it. The memory module holds the weight matrix in it, which is
mathematically two dimensional array. Now, when another image comes for training, the existing weight matrix is being ored by the newer one and
stored into the memory. This weight matrix is then used for inference.
The calculation of timing is as below.
As shown in Figure 8, anding of two vectors is carried out by the ‘and block’. This block takes time Tand. This time is a function of the input vector size ‘n’
and minimum period of clock Tclk. This operation requires one clock cycle. So, Tand = Tclk * n. In our experiment, we have used n as 128 bits vector and
for the same experiment, our minimum period of clock Tclk is 11.113ns. So ‘and block’ takes total time Tand is equal to 1.42us.
Now, oring of two vectors is carried out by the ‘or block’. This block takes time Tor. This time is also a function of the input vector size ‘n’ and minimum
period of clock Tclk. This operation requires one clock cycle. So, Tor = Tclk * n. In our experiment, we have used n as 128 bits vector and for same
experiment, our minimum period of clock Tclk is 11.113ns. So, ‘or block’ takes total time Tor is equal to 1.42us.
For storing weight matrix in memory, it requires another clock cycle, which is equal to Tclk. Hence, total three clock cycles are required for storing one
image’s extraction and it takes time O (n). In our experiment, total time taken Tcomputation, by this algorithm for single image storage is 3615us.
Now, in general total computation time is given as: Tcomputation = (Tand + Tor + Tmem) * Img. Where Tcomputation is the total computation time for
training algorithm; Tand is the time taken by ‘and block’; Tor is the time taken by ‘or block’; Tmem is the time taken to store weight matrix into memory;
Tclk is the clock period; n is the number of extracted vector size; Img is the number of images which are going to be trained.

VII: HDL IMPLEMENTATION OF INFERENCE ALGORITHM


We know about the inference algorithm from the previous topic. Now, Verilog HDL implementation for FPGA is given in this topic. We have already
explained earlier how this algorithm works, it requires input vector X and weight matrix to produce output vector Y.
In the Figure 9, input-output pin diagram is shown. It consists of input vector, output vector, clock, reset and weight matrix. In previous discussion, we
had known many arithmetic complex operations. Now, all these operations are done by Verilog HDL for hardware implementation.
The block diagram, shown in Figure 10, shows how this algorithm generates output vector Y. Figure 10 shows counters, memory modules, shift registers,
comparators, flip flops, anding module, etc. We have generated weight matrix by learning algorithm to use it for inference. So, now weight matrix is
anded by input (noisy) vector. The resultant vector is stored in the memory. Now, if we want to collect the total number of 1s from the vector, then the
resultant vector should be transferred to the shift register and shifted by one. It checks for the bit 1 and if it arrives, then increases the counter by 1. Now,
this number is stored in memory-1 as well as in memory-2 by counter, which generates address. Now, the sorting module will sort the memory-1’s
numbers, and find the appropriate k-WTA. For thresolding, we have to compare the sorted numbers of memory-1 with the memory-2. We will put 1 if the
memory-2 number is greater than the k-WTA, else put 0. After thresolding the memory-2, it will give us binary vector (cleaned) as output. If it does not
clean, then repeat the whole process with input vector as previous output vector. Even though we have noisy vector as input, we will get cleaned vector
as output.

112
Vyas et.al/HDL Based Implementation of Palm Associative Memory

Figure 8: Block diagram of learning/training algorithm

Figure 9: Input/ Output pin diagram of training algorithm

Figure 10: Block diagram of inference algorithm

The detailed timings of this algorithm are given below:


Figure 10 shows that there is an ‘and block’ for anding of two vectors. This block takes time Tand. This time is a function of the input vector size ‘n’ and
minimum period of clock Tclk. So, Tand = Tclk * n. In our experiment, we have used n as 128 bits vector and for same experiment, our minimum period of
clock Tclk is 28.48ns. So, anding block takes total time Tand is equal to 3.65us.
Now, addition operation takes place. Now, we collect 1s from the anded vectors and sum them up. In this way, we got total number of 1s out of whole
binary vector. It is a function of the input vector size ‘n’, anded vector ‘n’ and minimum period of clock Tclk. This summation is done serially. So,
Taddition= Tclk * n2. In our experiment, we have used n as 128 bits vector and our minimum period of clock Tclk is 28.48ns. So, addition block takes total
time Taddition is equal to 466.6us.
Now, if we want to do these summation in parallel, then it is a function of the input vector size ‘n’ and minimum period of clock Tclk. Hence, we can
clearly see the time difference between serial and parallel addition. So, Taddition= Tclk * n. In our experiment, we have used n as 128 bits vector and our
minimum period of clock Tclk is 31.35ns. So, addition block takes total time Taddition is equal to 4.01us.

113
Vyas et.al/HDL Based Implementation of Palm Associative Memory
For sorting of these values, we require sorting block. It is a function of the input vector size ‘n’, k-WTA number ‘m’ and minimum period of clock Tclk. So,
Tsort= Tclk * n * m. Here m is log2n. In our experiment, we have used n as 128 bits vector, minimum period of clock Tclk is 28.48ns and k-WTA number m
is 7. So, sorting block takes total time Tsort is equal to 25.55us.
Finally, we are left with k-WTA module. It is a function of the input vector size ‘n’ and minimum period of clock Tclk. So, Tk-wta= Tclk * n. In our
experiment, we used n as 128 bits vector and minimum period of clock Tclk is 28.48ns. So k-WTA block takes total time Tk-wta is equal to 3.65us.
So our total computation time Tcomputation is 499.45us (serially) and 40.13us (parallelly). In general, total time is of O (n2) for serial computation and O
(n log2n) for parallel computation.
In general total computation time is given below:
Tcomputation= Tand + Taddition + Tsort + Tk-wta
Where Tcomputation is the total time taken by this whole algorithm; Tand is the total time taken by anding operation of input vector to the weight
matrix; Taddition is the total time taken by addition of 1s with shifting operation; Tsort is the total time taken by sorting operation of memory-1; Tk-wta
is the total time taken by k-WTA by thresolding operation; Tclk is the minimum clock period.
In Figure 11 timings of different blocks for serial computation are shown. The major time is required for serial computation by addition operation, which
is of O (n2).

Figure 11: Serial computation

In Figure 12 timings of different blocks for parallel computation are shown. We can clearly see the time difference between serial computation and
parallel computation. For parallel computation, we did addition operation in parallel of O (n) time. Due to this, we can save overall computation time, but
it is a trade off between cost and time. Here, cost is referred to memory or number of adders use for parallel computation.

Figure 12: Parallel computation

VIII: MATLAB AND HDL SIMULATION RESULT


Now, it is time to compare the inference algorithm. Let us see, which is better in comparison HDL or MATLAB. We have implemented whole architecture
in MATLAB with same data sets, which is used in Verilog HDL.

Table 1. Timing comparision for HDL and MATLAB for serial computation
Size 40X40 64X64 80X80 128X128
LUTs (total 9312) 2212 3452 4386 6762
Verilog time(us) 171.63 253.48 298.71 499.45
MATLAB time(us) 141 236 282 472

114
Vyas et.al/HDL Based Implementation of Palm Associative Memory
Here, MATLAB time is linear because PC has less degree of parallel computation as compared to FPGA. In FPGA, due to parallelism of memory fetching
and addition operation, we can achieve result in least time, compared to serial implementation of NAM.

Table 2. Timing comparision for HDL and MATLAB for parallel computation
size 40X40 64X64 80X80 128X128
LUTs (total 9312) 2892 4718 5582 8376
Verilog time(us) 14.92 22.16 27.37 40.13
MATLAB time(us) 141 236 282 472
IX: CONCLUSION
The implementation of AM gives an overview of the design methodology and hardware components. It clearly shows from simulation results that it
eradicates noise. It also gives information about the comparison between HDL and MATLAB (microprocessor). Also, the timing formulas are given to find
speed of NAM.
Mostly, the problem associated with the PCs is that the instructions are executed sequentially, although they are multi core, are slower than FPGA. The
amount of parallelism of FPGA is far better than PC. As shown in Table 1 and 2, it is almost five times faster than PC. We succeeded in implementation of
NAM architecture as proposed by Palm [1]. The methodology will help different networks on FPGA. We have shown proper data size and bit widths of
various elements so that one can compare his design of another neural network with the existing design.
X: FUTURE WORK
In this paper, we have used binary vectors, but it is also possible with the use integer vectors to generate weight matrix for better accuracy and for that
we have to pay price of memory. So, it has to be trade off between accuracy and price. FPGA implementation can be done for real video (image) with
interfacing camera to the FPGA itself. Feature extraction was done by PC (MATLAB), but it can also be done by Verilog and get fitted into the FPGA and
hence we won’t pay price for communication time between PC and FPGA anymore. Still, we can optimise the architecture by experimenting on different
tools e.g. MATLAB. It can be done parallel by another methodology and optimise the timings for the overall computation.
REFERENCE
[1] G. Palm, "On associative memory," Biological Cybernetics, vol. 36, pp. 19-31, 1980.
[2] Intel, "60 Years of the transistor: 1947-2007," 2008. Available:
http://www.intel.com/technology/timeline.pdf.
[3] D. Hammerstrom, "A survey of bio-inspired and other alternative architectures," in Nanotechnology Information Technology - II, vol. 2. Weinheim,
Germany: Wiley-VCH Verlag GmbH & Co., 2008, pp. 251-285.
[4] S. Borkar, P. Dubey, K. Kahn, D. Kuck, H. Mulder, S. Pawlowski, and J. Rattner, "Platform 2015: Intel processor and platform evolution for the next
decade, Technology Intel Magazine, pp. 1-10, 2005.
[5] S. Zankovych, T. Hoffmann, J. Seekamp, J.-U. Bruch, and C. M. S. Torres, "Nanoimprint lithography: challenges and prospects," Nanotechnology, vol. 12,
pp. 91-95, 2001.
[6] Mazad S. Zaveri and Dan Hammerstrom, "CMOL/CMOS Implementations of Bayesian Polytree Inference: Digital & Mixed-Signal Architectures and
Performance/Price," 2009.
[7] D. Hammerstrom, C. Gao, S. Zhu, and M. Butts, "FPGA implementation of very large associative memories - scaling issues," in FPGA Implementations of
Neural Networks, A. Omondi, Ed. Boston: Kluwer Academic Publishers, 2003.
[8] G. Palm, F. Schwenker, F. T. Sommer, and A. Strey, "Neural associative memories," in Associative Processing and Processors. Los Alamitos, CA.: IEEE
Computer Society, 1997, pp. 284-306.
[9] S. Zhu, "Associative memory as a Bayesian building block," Ph.D. dissertation, OGI School of Science and Engineering, Oregon Health and Science
University, Beaverton, Oregon 2008.
[10] C. H. Luk, C. Gao, D. Hammerstrom, M. Pavel, and D. Kerr, "Biologically inspired enhanced vision system (EVS) for aircraft landing guidance,"
presented at International Joint Conference on Neural Networks, Budapest HUNGARY, pp. 1751-1756, 2004.

APPENDIX A: SIMULATION WAVEFORM

For simulation we have taken clean input vector as Xclean = [1100000000] and also noisy input vectors Xnoisy = [1110100000] & [1100010000]
.

Figure 13: Simulation waveform1

Now operation of inference takes place. Clean input vector X is multiplied by weight matrix W and gives clean output vector. As it is an auto associative
memory, the input vector and output vector should be same. As indicated in Figure 13, output is still not received. It requires more clock cycle to receive
output. Operation: Xclean * W = [1100000000] = Yclean = Xclean.

115
Vyas et.al/HDL Based Implementation of Palm Associative Memory

Figure 14: Simulation waveform2

In Figure 14 after many clock cycles, we get output. An operation takes place, where noisy input vector is operated with weight matrix W and gives clean
output vector. It is clearly shown in the Figure 14, if input vector is clean, then we will get cleaned output vector because this is an auto associative
memory. This Figure also indicates sorted values from the memory.
We can clearly see in Figure 15 that if we have noisy input, this algorithm can clean it up. This clean output vector has been shown below in Figure 15.
The main thing is to generate sparse weight matrix for proper working of algorithm. Operation: Xnoisy * W= [1100000000] = Yclean = Xclean

Figure 15: Simulation waveform3

116

S-ar putea să vă placă și