Documente Academic
Documente Profesional
Documente Cultură
Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, Gujarat, India
Corresponding Author: The objective of this paper is to implement and analyze the palm associative memory
Pavan Vyas proposed by G. Palm [1]. In this paper, a design implementation of this algorithm, based
PG student, Dhirubhai Ambani on Verilog HDL (hardware description language) and MATLAB programming language
Institute of Information and is proposed. Xilinx Spartan-3e FPGA (Field Programming Gate Array) is required for
Communication Technology, simulation purpose, which performs arithmetic operations for implementing
Gandhinagar, Gujarat, India associative memory. The simulation results will be obtained with Xilinx ISE 10.1 and
MATLAB R2010a. The results are analyzed in terms of operating frequency and chip
utilization. It also summarizes the Time of Computation and Hardware for Logic
Keywords: - Neural network implementation for different input vector size in the system.
algorithm, Associative memory, k
winners take all circuit. 2013, AJCSIT, All Right Reserved.
INTRODUCTION
An associative memory is a content addressable structure that maps a set of input patterns to a set of output patterns. A content-addressable
structure is a type of memory that allows the recall of data based on the degree of similarity between the input pattern and the patterns stored in
memory.
There are two types of associative memory. One is an auto-associative memory which retrieves a previously stored pattern that most closely
resembles the current pattern. Other is a hetero-associative memory in which the retrieved pattern is in general different from the input pattern, not only
in content but, possibly in type and format.
The paper aims at implementing associative memory suitable for running experiments and possible further elaboration of the
implementation, based on the experimental results. We used Palm model for neural associative memory. Both the algorithms given
here are implemented for smaller data sets and after the functionality is verified, we do it for actual data set. The paper describes concept and practical
implementation of the associative memory.
I. ALGORITHM
On neural associative memory, Willshaw published a paper in 1969 and thereafter Palm gave detail explanation of it in 1980. They both
together proposed associative memory models, based on a binary weight matrix, Hebbian learning function and threshold activation functions [1, 8].
Their combined efforts gave an efficient implementation of Bayesian Memory which was defined as WPNAM (Willshaw and Palm Neural Associative
Memory). This will be used many times in this research paper. WPNAM and AM are interchangeable. When we mention AM in this work, we mean
WPNAM-based AM. Figure 1 shows the basic network structure of the WPNAM model.
Where Wij is the binary values 0 or 1 (or multi-bit value) of the weight between the i-th input neuron and the j-th output neuron; Xiμ is the
binary value of i-th input neuron; Yj μ is the binary value of j-th output neuron; μ is the μ-th training pattern index; M is the total number of training
patterns; ∨ is the oring operation; ∧ is the anding operation and Σ is the integer summation.
109
Vyas et.al/HDL Based Implementation of Palm Associative Memory
In the above equation 1, weights can be produced by two methods. In first method, the training vectors Xiμ and Yj μ are being anded for
particular training pattern and then results are ored for M numbers of different training patterns. In this method, we get weights in 0 or 1 binary value. In
another method, again the training vectors Xiμ and Yj μ are being anded for particular training pattern, but here, the results are summed for M numbers
of different training patterns. In this method, we get weights in multi-bits value. For the retrieval stage of the network, when an input pattern x
propagates through the network, the output neuron’s potential is given by:
Where Sj is the j-th output neuron’s potential; Xi is the value of i-th input neuron’s value, which would be a noisy version of the training vector
(how to generate such test vectors will be discussed in [9]); Wij is the weight connecting i-th input neuron to j-th output neuron.
Now this neuron’s potential goes through a function which has some threshold value θ, which activates k different output neurons. Those outputs, with
value one, are called activated neurons. The value of the j-th output neuron is given by:
Equation 2 consists of ‘and operation’ for input vector and weight memory. Here, integer summation of the resultant vector is being done to
generate output neurons’ potential. This operation is called inference. It is formed by the inner product operation of matrix. Equation 3 gives neurons’
activation function. As we know, this has a threshold value θ, which activates k different output neurons. Those, which are set to one, are called activated
neurons. As proposed by Palm, at a time Only O (log2N) neurons are active. Thus, we conclude that such a computed connection updates the network.
Equation 3 is generally referred as k-WTA (k Winners-Take-All). So, for any implementation of AM based HDM model, it must have these two operations.
1. Inner product operation for matrix
2. k-WTA
AM can perform ISP (Intelligent Signal Processing), but it has some limitations. The main issue is about networks scaling. AM consists of sparse
vectors having very less 1s as compared to 0s. But after accumulation of these vectors, the weight memory becomes less sparse. As proposed by Palm,
maximum capacity is achieved when the number of 1s and number of 0s in the vector are same.
II. S Y S T E M O R G A N I Z A T I O N
Figure 2 shows the system organization and data flow for the AM algorithm. In this Figure 2, we add noise to the original/clean images. The size of each
image is 128 X 128 pixels. Now this noisy image will go through image processing for feature extraction. Untill this stage, the work is done on PC using
MATLAB. Now these extracted images (feature vectors) are transferred from PC by some communication channel to the FPGA board. Thereafter, these
feature vectors are being multiplied to the weight matrix, which is stored in the memory. This operation is called inner product, which gives out ‘sum
vector’ (Integer). By some thresolding activity on ‘sum vector’, we get ‘binary vector’ (clean vector). Finally, this ‘binary vector’ is transferred to the PC,
which reassembles the clean original image.
110
Vyas et.al/HDL Based Implementation of Palm Associative Memory
Figure 6 shows the FPGA functional blocks and the weight matrix organization in the memory. The memory stores the weight matrix sparsely with the
rows and columns indices of the active weight bits. The FPGA stores the input vector in the input matrix unit. The input matrix unit contains the non-zero
row information of the input vector. The FPGA reads weight matrix with the same row index, searches for the corresponding column numbers inside the
input vector, then accumulates the value and writes to the result vector unit. This operation is done in the inner-product unit. The ‘k-WTA’ unit will sort
the incoming result and transfer it from the inner product unit to the result vector unit.
V: ALGORITHMS OF NAM
111
Vyas et.al/HDL Based Implementation of Palm Associative Memory
Basically there are two major algorithms of NAM. One is training or learning Algorithm, in which network is exposed to the large number of input and
output vectors. In this algorithm, outer product of vectors is called training and the other is Inference Algorithm, which Receives an input vector and
settles down to one of the nearest stored pairs. In this algorithm, inner product of vectors is called inference.
Here, small example is given to understand the AM:
Input vector X and output vector Y are there
X is 10x1 vector
Y is 10x1 vector
Let’s find out outer product that is learning algorithm
X * YT= W
Hence we get weight matrix W of 10x10.
Now operation (X * W) will lead to vector Y that is called inference.
NOTE: If in case any noise occurs in vector X, even then it will lead to vector Y, after X * W is done.
VI: HDL IMPLEMENTATION OF TRAINING ALGORITHM
We have studied the training algorithm in the previous topic. Now, in this topic, implementation of training algorithm by Verilog HDL is given. It requires
input vectors and output vectors to produce weight matrix. This weight matrix has to be very sparse.
In the Figure 7, input-output pin diagram is shown. It consists of input vector, output vector, clock, reset, memory write, memory read and data out. In the
previous topic, we have seen many complex arithmetic operations. Now, all these operations are done by Verilog for hardware implementation.
The block diagram, as shown in Figure 8, shows how this algorithm generates weight matrix. It consists of ‘and block’, ‘or block’, counter, clock
and memory. Here, vector X and Y of training image are feature extracted vectors. For computing auto-associative memory, vector X (input) and vector Y
(output) should be same. The ‘and block’ is used for anding operation of both vectors X and Y. It will work on positive edge of clock and it will send its
output to the ‘or block’. In ‘or block’, weight memory and the resultant vector of ‘and block’ will operate together. Now, oring operation should be
performed in this block with these two vectors. Now, there is a counter which basically generates address for memory, which is also enabled by positive
edge of clock. Now these addresses are used for reallocation of weight matrix in it. The memory module holds the weight matrix in it, which is
mathematically two dimensional array. Now, when another image comes for training, the existing weight matrix is being ored by the newer one and
stored into the memory. This weight matrix is then used for inference.
The calculation of timing is as below.
As shown in Figure 8, anding of two vectors is carried out by the ‘and block’. This block takes time Tand. This time is a function of the input vector size ‘n’
and minimum period of clock Tclk. This operation requires one clock cycle. So, Tand = Tclk * n. In our experiment, we have used n as 128 bits vector and
for the same experiment, our minimum period of clock Tclk is 11.113ns. So ‘and block’ takes total time Tand is equal to 1.42us.
Now, oring of two vectors is carried out by the ‘or block’. This block takes time Tor. This time is also a function of the input vector size ‘n’ and minimum
period of clock Tclk. This operation requires one clock cycle. So, Tor = Tclk * n. In our experiment, we have used n as 128 bits vector and for same
experiment, our minimum period of clock Tclk is 11.113ns. So, ‘or block’ takes total time Tor is equal to 1.42us.
For storing weight matrix in memory, it requires another clock cycle, which is equal to Tclk. Hence, total three clock cycles are required for storing one
image’s extraction and it takes time O (n). In our experiment, total time taken Tcomputation, by this algorithm for single image storage is 3615us.
Now, in general total computation time is given as: Tcomputation = (Tand + Tor + Tmem) * Img. Where Tcomputation is the total computation time for
training algorithm; Tand is the time taken by ‘and block’; Tor is the time taken by ‘or block’; Tmem is the time taken to store weight matrix into memory;
Tclk is the clock period; n is the number of extracted vector size; Img is the number of images which are going to be trained.
112
Vyas et.al/HDL Based Implementation of Palm Associative Memory
113
Vyas et.al/HDL Based Implementation of Palm Associative Memory
For sorting of these values, we require sorting block. It is a function of the input vector size ‘n’, k-WTA number ‘m’ and minimum period of clock Tclk. So,
Tsort= Tclk * n * m. Here m is log2n. In our experiment, we have used n as 128 bits vector, minimum period of clock Tclk is 28.48ns and k-WTA number m
is 7. So, sorting block takes total time Tsort is equal to 25.55us.
Finally, we are left with k-WTA module. It is a function of the input vector size ‘n’ and minimum period of clock Tclk. So, Tk-wta= Tclk * n. In our
experiment, we used n as 128 bits vector and minimum period of clock Tclk is 28.48ns. So k-WTA block takes total time Tk-wta is equal to 3.65us.
So our total computation time Tcomputation is 499.45us (serially) and 40.13us (parallelly). In general, total time is of O (n2) for serial computation and O
(n log2n) for parallel computation.
In general total computation time is given below:
Tcomputation= Tand + Taddition + Tsort + Tk-wta
Where Tcomputation is the total time taken by this whole algorithm; Tand is the total time taken by anding operation of input vector to the weight
matrix; Taddition is the total time taken by addition of 1s with shifting operation; Tsort is the total time taken by sorting operation of memory-1; Tk-wta
is the total time taken by k-WTA by thresolding operation; Tclk is the minimum clock period.
In Figure 11 timings of different blocks for serial computation are shown. The major time is required for serial computation by addition operation, which
is of O (n2).
In Figure 12 timings of different blocks for parallel computation are shown. We can clearly see the time difference between serial computation and
parallel computation. For parallel computation, we did addition operation in parallel of O (n) time. Due to this, we can save overall computation time, but
it is a trade off between cost and time. Here, cost is referred to memory or number of adders use for parallel computation.
Table 1. Timing comparision for HDL and MATLAB for serial computation
Size 40X40 64X64 80X80 128X128
LUTs (total 9312) 2212 3452 4386 6762
Verilog time(us) 171.63 253.48 298.71 499.45
MATLAB time(us) 141 236 282 472
114
Vyas et.al/HDL Based Implementation of Palm Associative Memory
Here, MATLAB time is linear because PC has less degree of parallel computation as compared to FPGA. In FPGA, due to parallelism of memory fetching
and addition operation, we can achieve result in least time, compared to serial implementation of NAM.
Table 2. Timing comparision for HDL and MATLAB for parallel computation
size 40X40 64X64 80X80 128X128
LUTs (total 9312) 2892 4718 5582 8376
Verilog time(us) 14.92 22.16 27.37 40.13
MATLAB time(us) 141 236 282 472
IX: CONCLUSION
The implementation of AM gives an overview of the design methodology and hardware components. It clearly shows from simulation results that it
eradicates noise. It also gives information about the comparison between HDL and MATLAB (microprocessor). Also, the timing formulas are given to find
speed of NAM.
Mostly, the problem associated with the PCs is that the instructions are executed sequentially, although they are multi core, are slower than FPGA. The
amount of parallelism of FPGA is far better than PC. As shown in Table 1 and 2, it is almost five times faster than PC. We succeeded in implementation of
NAM architecture as proposed by Palm [1]. The methodology will help different networks on FPGA. We have shown proper data size and bit widths of
various elements so that one can compare his design of another neural network with the existing design.
X: FUTURE WORK
In this paper, we have used binary vectors, but it is also possible with the use integer vectors to generate weight matrix for better accuracy and for that
we have to pay price of memory. So, it has to be trade off between accuracy and price. FPGA implementation can be done for real video (image) with
interfacing camera to the FPGA itself. Feature extraction was done by PC (MATLAB), but it can also be done by Verilog and get fitted into the FPGA and
hence we won’t pay price for communication time between PC and FPGA anymore. Still, we can optimise the architecture by experimenting on different
tools e.g. MATLAB. It can be done parallel by another methodology and optimise the timings for the overall computation.
REFERENCE
[1] G. Palm, "On associative memory," Biological Cybernetics, vol. 36, pp. 19-31, 1980.
[2] Intel, "60 Years of the transistor: 1947-2007," 2008. Available:
http://www.intel.com/technology/timeline.pdf.
[3] D. Hammerstrom, "A survey of bio-inspired and other alternative architectures," in Nanotechnology Information Technology - II, vol. 2. Weinheim,
Germany: Wiley-VCH Verlag GmbH & Co., 2008, pp. 251-285.
[4] S. Borkar, P. Dubey, K. Kahn, D. Kuck, H. Mulder, S. Pawlowski, and J. Rattner, "Platform 2015: Intel processor and platform evolution for the next
decade, Technology Intel Magazine, pp. 1-10, 2005.
[5] S. Zankovych, T. Hoffmann, J. Seekamp, J.-U. Bruch, and C. M. S. Torres, "Nanoimprint lithography: challenges and prospects," Nanotechnology, vol. 12,
pp. 91-95, 2001.
[6] Mazad S. Zaveri and Dan Hammerstrom, "CMOL/CMOS Implementations of Bayesian Polytree Inference: Digital & Mixed-Signal Architectures and
Performance/Price," 2009.
[7] D. Hammerstrom, C. Gao, S. Zhu, and M. Butts, "FPGA implementation of very large associative memories - scaling issues," in FPGA Implementations of
Neural Networks, A. Omondi, Ed. Boston: Kluwer Academic Publishers, 2003.
[8] G. Palm, F. Schwenker, F. T. Sommer, and A. Strey, "Neural associative memories," in Associative Processing and Processors. Los Alamitos, CA.: IEEE
Computer Society, 1997, pp. 284-306.
[9] S. Zhu, "Associative memory as a Bayesian building block," Ph.D. dissertation, OGI School of Science and Engineering, Oregon Health and Science
University, Beaverton, Oregon 2008.
[10] C. H. Luk, C. Gao, D. Hammerstrom, M. Pavel, and D. Kerr, "Biologically inspired enhanced vision system (EVS) for aircraft landing guidance,"
presented at International Joint Conference on Neural Networks, Budapest HUNGARY, pp. 1751-1756, 2004.
For simulation we have taken clean input vector as Xclean = [1100000000] and also noisy input vectors Xnoisy = [1110100000] & [1100010000]
.
Now operation of inference takes place. Clean input vector X is multiplied by weight matrix W and gives clean output vector. As it is an auto associative
memory, the input vector and output vector should be same. As indicated in Figure 13, output is still not received. It requires more clock cycle to receive
output. Operation: Xclean * W = [1100000000] = Yclean = Xclean.
115
Vyas et.al/HDL Based Implementation of Palm Associative Memory
In Figure 14 after many clock cycles, we get output. An operation takes place, where noisy input vector is operated with weight matrix W and gives clean
output vector. It is clearly shown in the Figure 14, if input vector is clean, then we will get cleaned output vector because this is an auto associative
memory. This Figure also indicates sorted values from the memory.
We can clearly see in Figure 15 that if we have noisy input, this algorithm can clean it up. This clean output vector has been shown below in Figure 15.
The main thing is to generate sparse weight matrix for proper working of algorithm. Operation: Xnoisy * W= [1100000000] = Yclean = Xclean
116