CAM - MEM Emory

To design and implementation of CAM memory for DAUL
CORE RAM

Abstract:
A new information detection method has been proposed for a very fast and
efficient search engine. This method is implemented on hardware system using
FPGA. We take advantages of Content Addressable Memory (CAM) which has an
ability of searching and matching mode for designing the system. The CAM blocks
have been designed using available memory blocks of the FPGA device to save
access times of the whole system. The entire memory can return multi-matched
results concurrently. The system operates based on the CAMs for pattern matching
in parallel manner to return multiple addresses of multi-matched results. Based on
the parallel multi-matching operations, the system can be applied for pattern
matching with various required constraint conditions without using any search
principles. The very fast multi-matched results 60ns are achieved at the operational
frequency 50 MHz. Thus increases the matching performance of the information
detection system which uses this method as the core system.

Keywords - FPGA, Information Detection Hardware System, Content Addressable
Memory, Dual-port RAM, parallel operation, multiple matches.

Introduction
Nowadays, the information detection at high speed plays an important role in many
applications. It becomes a key element in analyzing and detecting patterns as well
as searching data such as network router, telecommunications, image recognition,
sound recognition, character recognition, full text search, and artificial intelligence
in robots, bioinformatics and DNA computation. However, designing fast and
efficient information detection on hardware systems is the challenge for designers.
Moreover, the information detection systems which have been implemented on
hardware are still not largely implemented to satisfy the industrial needs. The
efficiency in information detection for searching is always the trade-off between
the number of detection times and system resources. The search engines for
various purposes on hardware system were presented. However, the matching
operations were implemented in sequential or complicated algorithms. These
methods slowed down the speed of information detection for vast information and
consumed a large space in the system.

Content addressable memory:
Content-addressable memory (CAM) is a special type of computer memory used in
certain very high speed searching applications. It is also known as associative
memory, associative storage, or associative array, although the last term is more
often used for a programming data structure. Several custom computers, like
the Goodyear STARAN, were built to implement CAM, and were
designated associative computers.

Hardware associative array:
Unlike standard computer memory (random access memory or RAM) in which the
user supplies a memory address and the RAM returns the data word stored at that
address, a CAM is designed such that the user supplies a data word and the CAM
searches its entire memory to see if that data word is stored anywhere in it. If the
data word is found, the CAM returns a list of one or more storage addresses where
the word was found (and in some architectures, it also returns the data word, or
other associated pieces of data). Thus, a CAM is the hardware embodiment of what
in software terms would be called an associative array. The data word recognition
unit was proposed by Dudley Allen Buck in 1955.

Standards for content addressable memories:
A major interface definition for CAMs and other Network Search Elements (NSEs)
was specified in an Interoperability Agreement called the Look-Aside
Interface (LA-1 and LA-1B) developed by the Network Processing Forum, which
later merged with the Optical Internetworking Forum (OIF). Numerous devices
have been produced by Integrated Device Technology, Cypress Semiconductor,
IBM, Broadcom and others to the LA interface agreement. On December 11, 2007,
the OIF published the serial look-aside (SLA) interface agreement.

Semiconductor implementations:
Because a CAM is designed to search its entire memory in a single operation, it is
much faster than RAM in virtually all search applications. There are cost
disadvantages to CAM however. Unlike a RAM chip, which has simple storage
cells, each individual memory bit in a fully parallel CAM must have its own
associated comparison circuit to detect a match between the stored bit and the input
bit. Additionally, match outputs from each cell in the data word must be combined
to yield a complete data word match signal. The additional circuitry increases the
physical size of the CAM chip which increases manufacturing cost. The extra
circuitry also increases power dissipation since every comparison circuit is active
on every clock cycle. Consequently, CAM is only used in specialized applications
where searching speed cannot be accomplished using a less costly method. One
successful early implementation was a General Purpose Associative Processor IC
and System.

Alternative implementations:
To achieve a different balance between speed, memory size and cost, some
implementations emulate the function of CAM by using standard tree search or
hashing designs in hardware, using hardware tricks like replication or pipelining to
speed up effective performance. These designs are often used in routers.
Ternary CAMs:
Binary CAM is the simplest type of CAM which uses data search words consisting
entirely of 1s and 0s. Ternary CAM (TCAM) allows a third matching state of "X"
or "Don't Care" for one or more bits in the stored data word, thus adding flexibility
to the search. For example, a ternary CAM might have a stored word of "10XX0"
which will match any of the four search words "10000", "10010", "10100", or
"10110". The added search flexibility comes at an additional cost over binary
CAM as the internal memory cell must now encode three possible states instead of
the two of binary CAM. This additional state is typically implemented by adding a
mask bit ("care" or "don't care" bit) to every memory cell.
Holographic associative memory provides a mathematical model for "Don't Care"
integrated associative recollection using complex valued representation.

Example applications:
Content-addressable memory is often used in computer networking devices. For
example, when a network switch receives a data frame from one of its ports, it
updates an internal table with the frame's source MAC address and the port it was
received on. It then looks up the destination MAC address in the table to determine
what port the frame needs to be forwarded to, and sends it out on that port. The
MAC address table is usually implemented with a binary CAM so the destination
port can be found very quickly, reducing the switch's latency.
Ternary CAMs are often used in network routers, where each address has two
parts: the network address, which can vary in size depending on
the subnet configuration, and the host address, which occupies the remaining bits.
Each subnet has a network mask that specifies which bits of the address are the
network address and which bits are the host address. Routing is done by consulting
a routing table maintained by the router which contains each known destination
network address, the associated network mask, and the information needed to route
packets to that destination. Without CAM, the router compares the destination
address of the packet to be routed with each entry in the routing table, performing
a logical AND with the network mask and comparing it with the network address.
If they are equal, the corresponding routing information is used to forward the
packet. Using a ternary CAM for the routing table makes the lookup process very
efficient. The addresses are stored using "don't care" for the host part of the
address, so looking up the destination address in the CAM immediately retrieves
the correct routing entry; both the masking and comparison are done by the CAM
hardware.
Other CAM applications include:
CPU fully associative cache controllers and translation look-aside buffers
A CPU cache is a cache used by the central processing unit (CPU) of
a computer to reduce the average time to access memory. The cache is a
smaller, faster memory which stores copies of the data from frequently
used main memory locations. Most CPUs have different independent caches,
including instruction and data caches, where the data cache is usually
organized as a hierarchy of more cache levels (L1, L2 etc.)
Database engines
A database is an organized collection of data. The data are typically
organized to model relevant aspects of reality in a way that supports
processes requiring this information. For example, modeling the availability
of rooms in hotels in a way that supports finding a hotel with vacancies.
Database management systems (DBMSs) are specially designed software
applications that interact with the user, other applications, and the database
itself to capture and analyze data. A general-purpose DBMS is
a software system designed to allow the definition, creation, querying,
update, and administration of databases. Well-known DBMSs
include MySQL, MariaDB,PostgreSQL, SQLite, Microsoft SQL
Server, Oracle, SAP, dBASE, FoxPro, IBM DB2, LibreOffice
Base and FileMaker Pro. A database is not generally portable across
different DBMSs, but different DBMSs can interoperate by
using standards such as SQL and ODBC or JDBC to allow a single
application to work with more than one database.
Data compression hardware
o In computer science and information theory, data compression, source
coding, or bit-rate reduction involves encoding information using
fewer bits than the original representation. Compression can be
either lossy or lossless. Lossless compression reduces bits by
identifying and eliminating statistical redundancy. No information is
lost in lossless compression. Lossy compression reduces bits by
identifying unnecessary information and removing it. The process of
reducing the size of a data file is popularly referred to as data
compression, although its formal name is source coding (coding done
at the source of the data before it is stored or transmitted).
o Compression is useful because it helps reduce resource usage, such as
data storage space or transmission capacity. Because compressed data
must be decompressed to use, this extra processing imposes
computational or other costs through decompression; this situation is
far from being a free lunch. Data compression is subject to a space
time complexity trade-off. For instance, a compression scheme for
video may require expensive hardware for the video to be
decompressed fast enough to be viewed as it is being decompressed,
and the option to decompress the video in full before watching it may
be inconvenient or require additional storage. The design of data
compression schemes involves trade-offs among various factors,
including the degree of compression, the amount of distortion
introduced (e.g., when using lossy data compression), and the
computational resources required to compress and uncompress the
data.
o New alternatives to traditional systems (which sample at full
resolution, then compress) provide efficient resource usage based on
principles of compressed sensing. Compressed sensing techniques
circumvent the need for data compression by sampling off on a
cleverly selected basis.
Artificial neural networks
o In computer science and related fields, artificial neural networks are
computational models inspired by animals' central nervous systems (in
particular the brain) that are capable of machine learning and pattern
recognition. They are usually presented as systems of interconnected
"neurons" that can compute values from inputs by feeding information
through the network.
o For example, in a neural network for handwriting recognition, a set of
input neurons may be activated by the pixels of an input image
representing a letter or digit. The activations of these neurons are then
passed on, weighted and transformed by some function determined by
the network's designer, to other neurons, etc., until finally an output
neuron is activated that determines which character was read.
o Like other machine learning methods, neural networks have been used
to solve a wide variety of tasks that are hard to solve using ordinary
rule-based programming, including computer vision and speech
recognition.
Intrusion Prevention System
o An intrusion detection system (IDS) is a device or software
application that monitors network or system activities for malicious
activities or policy violations and produces reports to a management
station. Some systems may attempt to stop an intrusion attempt but
this is neither required nor expected of a monitoring system. Intrusion
detection and prevention systems (IDPS) are primarily focused on
identifying possible incidents, logging information about them, and
reporting attempts. In addition, organizations use IDPSes for other
purposes, such as identifying problems with security policies,
documenting existing threats and deterring individuals from violating
security policies. IDPSes have become a necessary addition to the
security infrastructure of nearly every organization.
o IDPSes typically record information related to observed events, notify
security administrators of important observed events and produce
reports. Many IDPSes can also respond to a detected threat by
attempting to prevent it from succeeding. They use several response
techniques, which involve the IDPS stopping the attack itself,
changing the security environment (e.g. reconfiguring a firewall) or
changing the attack's content.

Random-access memory:
Random-access memory is a form of computer data storage. A random-access
device allows stored data to be accessed directly in any random order. In contrast,
other data storage media such as hard disks, CDs, DVDs and magnetic tape, as
well as early primary memory types such as drum memory, read and write data
only in a predetermined order, consecutively, because of mechanical design
limitations. Therefore, the time to access a given data location varies significantly
depending on its physical location.
Today, random-access memory takes the form of integrated circuits. Strictly
speaking, modern types of DRAM are not random access, as data is read in bursts,
although the name DRAM / RAM has stuck. However, many types
of SRAM, ROM, OTP, and NOR flash are still random access even in a strict
sense. RAM is normally associated with volatile types of memory (such
as DRAM memory modules), where its stored information is lost if the power is
removed. Many other types of non-volatile memory are RAM as well, including
most types of ROM and a type of flash memory called NOR-Flash. The first RAM
modules to come into the market were created in 1951 and were sold until the late
1960s and early 1970s.

History:
Early computers used relays, or delay lines for "main" memory functions.
Ultrasonic delay lines could only reproduce data in the order it was written. Drum
memory could be expanded at low cost but retrieval of non-sequential memory
items required knowledge of the physical layout of the drum to optimize speed.
Latches built out of vacuum tube triodes, and later, out of discrete transistors, were
used for smaller and faster memories such as random-access register banks and
registers. Such registers were relatively large, power-hungry and too costly to use
for large amounts of data; generally only a few hundred or few thousand bits of
such memory could be provided.
The first practical form of random-access memory was the Williams tube starting
in 1947. It stored data as electrically charged spots on the face of a cathode ray
tube. Since the electron beam of the CRT could read and write the spots on the
tube in any order, memory was random access. The capacity of the Williams tube
was a few hundred to around a thousand bits, but it was much smaller, faster, and
more power-efficient than using individual vacuum tube latches. Developed at
the University of Manchester in England, the Williams tube provided the medium
on which the first electronically stored-memory program was implemented in
the Manchester Small-Scale Experimental Machine (SSEM) computer, which first
successfully ran a program on 21 June 1948.
[1]
In fact, rather than the Williams
tube memory being designed for the SSEM, the SSEM was a testbed to
demonstrate the reliability of the memory.
Magnetic-core memory was invented in 1947 and developed up until the mid-
1970s. It became a widespread form of random-access memory, relying on an
array of magnetized rings. By changing the sense of each ring's magnetization, data
could be stored with one bit stored per ring. Since every ring had a combination of
address wires to select and read or write it, access to any memory location in any
sequence was possible.
Magnetic core memory was the standard form of memory system until displaced
by solid-state memory in integrated circuits, starting in the early 1970s. Robert H.
Dennard invented dynamic random-access memory (DRAM) in 1968; this allowed
replacement of a 4 or 6-transistor latch circuit by a single transistor for each
memory bit, greatly increasing memory density at the cost of volatility. Data was
stored in the tiny capacitance of each transistor, and had to be periodically
refreshed in a few milliseconds before the charge could leak away.
Prior to the development of integrated read-only memory (ROM)
circuits, permanent (or read-only) random-access memory was often constructed
using diode matrices driven by address decoders, or specially wound core rope
memory planes.

Types of RAM:
The three main forms of modern RAM are static RAM (SRAM), dynamic
RAM (DRAM) and phase-change memory (PRAM). In SRAM, a bit of data is
stored using the state of a flip-flop. This form of RAM is more expensive to
produce, but is generally faster and requires less power than DRAM and, in
modern computers, is often used as cache memory for the CPU. DRAM stores a bit
of data using a transistor and capacitor pair, which together comprise a memory
cell. The capacitor holds a high or low charge (1 or 0, respectively), and the
transistor acts as a switch that lets the control circuitry on the chip read the
capacitor's state of charge or change it. As this form of memory is less expensive to
produce than static RAM, it is the predominant form of computer memory used in
modern computers.
Both static and dynamic RAM are considered volatile, as their state is lost or reset
when power is removed from the system. By contrast, read-only memory (ROM)
stores data by permanently enabling or disabling selected transistors, such that the
memory cannot be altered. Writeable variants of ROM (such
as EEPROM and flash memory) share properties of both ROM and RAM, enabling
data to persist without power and to be updated without requiring special
equipment. These persistent forms of semiconductor ROM include USB flash
drives, memory cards for cameras and portable devices, etc. ECC memory (which
can be either SRAM or DRAM) includes special circuitry to detect and/or correct
random faults (memory errors) in the stored data, using parity bitsor error
correction code.
In general, the term RAM refers solely to solid-state memory devices (either
DRAM or SRAM), and more specifically the main memory in most computers. In
optical storage, the term DVD-RAM is somewhat of a misnomer since, unlike CD-
RW or DVD-RW it does not need to be erased before reuse. Nevertheless a DVD-
RAM behaves much like a hard disc drive if somewhat slower.

Memory hierarchy:
One can read and over-write data in RAM. Many computer systems have a
memory hierarchy consisting of CPU registers, on-die SRAM caches,
external caches, DRAM, paging systems andvirtual memory or swap space on a
hard drive. This entire pool of memory may be referred to as "RAM" by many
developers, even though the various subsystems can have very different access
times, violating the original concept behind the random access term in RAM. Even
within a hierarchy level such as DRAM, the specific row, column, bank, rank,
channel, or interleave organization of the components make the access time
variable, although not to the extent that rotating storage media or a tape is variable.
The overall goal of using a memory hierarchy is to obtain the higher possible
average access performance while minimizing the total cost of the entire memory
system (generally, the memory hierarchy follows the access time with the fast CPU
registers at the top and the slow hard drive at the bottom).
In many modern personal computers, the RAM comes in an easily upgraded form
of modules called memory modules or DRAM modules about the size of a few
sticks of chewing gum. These can quickly be replaced should they become
damaged or when changing needs demand more storage capacity. As suggested
above, smaller amounts of RAM (mostly SRAM) are also integrated in
the CPU and other ICs on the motherboard, as well as in hard-drives, CD-ROMs,
and several other parts of the computer system.

Other uses of RAM:
In addition to serving as temporary storage and working space for the operating
system and applications, RAM is used in numerous other ways.

Virtual memory:
Most modern operating systems employ a method of extending RAM capacity,
known as "virtual memory". A portion of the computer's hard drive is set aside for
a paging file or a scratch partition, and the combination of physical RAM and the
paging file form the system's total memory. (For example, if a computer has 2 GB
of RAM and a 1 GB page file, the operating system has 3 GB total memory
available to it.) When the system runs low on physical memory, it can "swap"
portions of RAM to the paging file to make room for new data, as well as to read
previously swapped information back into RAM. Excessive use of this mechanism
results in thrashing and generally hampers overall system performance, mainly
because hard drives are far slower than RAM.

RAM disk:
Software can "partition" a portion of a computer's RAM, allowing it to act as a
much faster hard drive that is called a RAM disk. A RAM disk loses the stored
data when the computer is shut down, unless memory is arranged to have a standby
battery source.

Shadow RAM:
Sometimes, the contents of a relatively slow ROM chip are copied to read/write
memory to allow for shorter access times. The ROM chip is then disabled while
the initialized memory locations are switched in on the same block of addresses
(often write-protected). This process, sometimes called shadowing, is fairly
common in both computers and embedded systems.
As a common example, the BIOS in typical personal computers often has an option
called use shadow BIOS or similar. When enabled, functions relying on data
from the BIOSs ROM will instead use DRAM locations (most can also toggle
shadowing of video card ROM or other ROM sections). Depending on the system,
this may not result in increased performance, and may cause incompatibilities. For
example, some hardware may be inaccessible to the operating system if shadow
RAM is used. On some systems the benefit may be hypothetical because the BIOS
is not used after booting in favor of direct hardware access. Free memory is
reduced by the size of the shadowed ROMs.

Memory wall:
The "memory wall" is the growing disparity of speed between CPU and memory
outside the CPU chip. An important reason for this disparity is the limited
communication bandwidth beyond chip boundaries. From 1986 to
2000, CPU speed improved at an annual rate of 55% while memory speed only
improved at 10%. Given these trends, it was expected that memory latency would
become an overwhelming bottleneck in computer performance.
CPU speed improvements slowed significantly partly due to major physical
barriers and partly because current CPU designs have already hit the memory wall
in some sense. Intel summarized these causes in a 2005 document.
First of all, as chip geometries shrink and clock frequencies rise, the
transistor leakage current increases, leading to excess power consumption and
heat... Secondly, the advantages of higher clock speeds are in part negated by
memory latency, since memory access times have not been able to keep pace with
increasing clock frequencies. Third, for certain applications, traditional serial
architectures are becoming less efficient as processors get faster (due to the so-
called Von Neumann bottleneck), further undercutting any gains that frequency
increases might otherwise buy. In addition, partly due to limitations in the means
of producing inductance within solid state devices, resistance-capacitance (RC)
delays in signal transmission are growing as feature sizes shrink, imposing an
additional bottleneck that frequency increases don't address.
The RC delays in signal transmission were also noted in Clock Rate versus IPC:
The End of the Road for Conventional Micro-architectures which projects a
maximum of 12.5% average annual CPU performance improvement between 2000
and 2014. The data on Intel Processors clearly shows a slowdown in performance
improvements in recent processors. However, Intel's Core 2 Duo-processors
(codenamed Conroe) showed a significant improvement over previous Pentium
4 processors; due to a more efficient architecture, performance increased while
clock rate actually decreased.

Dual ported RAM:
Dual-ported RAM (DPRAM) is a type of Random Access Memory that allows
multiple reads or writes to occur at the same time, or nearly the same time, unlike
single-ported RAM which only allows one access at a time.
Video RAM or VRAM is a common form of dual-ported dynamic RAM mostly
used for video memory, allowing the CPU to draw the image at the same time the
video hardware is reading it out to the screen.
Apart from VRAM, most other types of dual-ported RAM are based on static
RAM technology.
Most CPUs implement the processor registers as a small dual-ported or multi-
ported RAM.

Parallel computing:
Parallel computing is a form of computation in which many calculations are
carried out simultaneously, operating on the principle that large problems can often
be divided into smaller ones, which are then solved concurrently ("in parallel").
There are several different forms of parallel computing: bit-level, instruction
level, data, and task parallelism. Parallelism has been employed for many years,
mainly in high-performance computing, but interest in it has grown lately due to
the physical constraints preventing frequency scaling. As power consumption (and
consequently heat generation) by computers has become a concern in recent
years, parallel computing has become the dominant paradigm in computer
architecture, mainly in the form of multi-core processors.
Parallel computers can be roughly classified according to the level at which the
hardware supports parallelism, with multi-core and multi-processor computers
having multiple processing elements within a single machine,
while clusters, MPPs, and grids use multiple computers to work on the same task.
Specialized parallel computer architectures are sometimes used alongside
traditional processors, for accelerating specific tasks.
Parallel computer programs are more difficult to write than sequential
ones, because concurrency introduces several new classes of potential software
bugs, of which race conditions are the most
common. Communication and synchronization between the different subtasks are
typically some of the greatest obstacles to getting good parallel program
performance. The maximum possible speed-up of a single program as a result of
parallelization is known as Amdahl's law.

Design approach
Our objective is to design search engine on hardware systems such as FPGA
without increasing the detection time and system complexities. Besides, the system
which is fast, efficient, and low cost has become increasingly necessary. We have
proposed a novel information detection method on hardware system in which the
searching data will be processed in parallel. The architecture is based on Content
Addressable Memory (CAM) and parallel structures for achieving fast detection,
accelerating the search performance, but consuming as least system resources as
possible. Because CAM can be implemented on digital circuits, CAM and the
whole system are completely implemented on a programmable hardware device
such as FPGA.
This paper makes the following main contributions:
1) Efficient and fast CAM structure. The multi-match values are returned in
parallel.
2) The system has operated in parallel structure and also produced the multi-
match results in parallel at a high speed (60ns on FPGA.)
3) The system has been proposed for search purposes but without using any
search principles.
4) The system design is based on CAM blocks and uses simple logic circuits
such as SHIFT, AND without using CPU and complex computations.

CAM STRUCTURE ON FPGA
Content Addressable Memory (CAM) plays a very important part in our system.
There are many approaches for CAM designs. Most of the CAM designs are
mainly focused on circuit design aspects. However, designing CAM on FPGA
hardware has also received much attention. For example, the CAM can be carried
out by using Look-Up Table (LUT) or memory resources on FPGA devices.

Fig: DUAL PORT RAM structure

In this paper, we have designed CAM by using memory resources of Altera
Cyclone IVE. The reasons are to save the logic resources, use the available
memory resources and accelerate the match speed of the system. Because the CAM
has a structure which is similar to RAM, we have implemented the CAM based on
the dual-port RAM structure. We take advantages of the M9K memory structure
on Cyclone IV E device as shown in Fig. Each dual-port M9K block consists of
8192 memory bits and 1024 bits for parity check codes. Port-A is used to store or
erase data in the CAM and is therefore a write-only port. Port-B is used to look up
matched data and is a read-only port.

Fig: DUAL PORT RAM structure
Port-A would be configured to be 2(8+5)-bit address and 1-bit data for 8192-
word1-bit = 8192 bits. Port-B can be configured as 8-bit address (2
8
= 256 words)
and 32-bit data width for a configuration of 256-word32-bit = 8192 bits. The size
of these two ports is always the same as the initial 8192 bits. In Port-B, the address
port will be considered as match data, meanwhile the data port is considered as
match address. It means that, at Port-B, the 8-bit address will become 8-bit
matched data, meanwhile the 32-bit data will be 32-bit match one-hot address. In
other words, Port-B behaves as the RAM function but the address and data port are
swapped for each other to build CAM functions. The detailed swapping process is
illustrated in Fig. 2. The expected match data of the CAM will be read into Port-
Bs address port, and the matched address will be returned via Port-Bs data port.
At this time, the data in memory is considered as memory addresses which are in
the same indexes so that the data, or also called CAM addresses, will be read out
concurrently. In this case, one M9K is configured as one CAM 32-word x 8-bit.

Fig: CAM circuit based on dual-port RAM

A dual-port M9K block can be configured as various CAM sizes. Because our
CAM is built from dual-port M9K blocks which have 8192-word address
corresponding to 2 address bits, the bit length and the depth of one CAM unit
device are varied in the range how the total numbers of address bits and data bits
are always equal to 13. For instance, CAM depth sizes and data bit length can be
flexibly configured as 32, 16, 8, 4-word depth and 8-bit, 9-bit, 10-bit, 11-bit length
respectively. The CAM sizes with various bit length are depicted in Table I.
However, in our research, we only use the CAM size of 32-word depth and 8-bit
width as a CAM unit device for cascading so that we just focus on CAM size of
32-word x 8-bit only. The output of our CAM is match one-hot address register
storing the multiple matched values in parallel. The match detection output is
validated by the match signal output, which is bitwise-OR of the all of matched
one-hot addresses from the CAM.

Conclusion
A new information detection method which has been proposed for a very fast and
efficient search engine has been implemented successfully on hardware system
using FPGA. Based on the parallel multi-matching operations, the information
detection system can be applied for pattern matching with various defined search
pattern without using any search principles. Thus improves the detection or search
performance of the information processing and detection systems. The most
significant advantage of this information detection system is the use of CAM
blocks with parallel match outputs, the parallel process operations and very simple
logic circuits such as AND and SHIFT for very fast data detection. This leads to
less detection operations and accelerates the search performance. The multi-
matched values of various search patterns are returned concurrently. All the
designs are implemented successfully on FPGA, and the system produces the
correct addresses in parallel.

4. INTRODUCTION TO XILINX ISE

Xilinx designs, develops and markets programmable logic products
including integrated circuits (ICs), software design tools, predefined system
functions delivered as intellectual property (IP) cores, design services, customer
training, field engineering and technical support. Xilinx provide both FPGAs and
CPLDs programmable logic devices for electronic equipment manufacturers such
as communications, industrial, consumer, automotive and data processing.

It is the world's largest supplier of programmable logic devices, the inventor
of the field programmable gate array (FPGA) and the first semiconductor company
with a fabless manufacturing model.

The Integrated Software Environment (ISE) Design Suite is the central
electronic design automation (EDA) product family sold by Xilinx. The ISE
Design Suite features include design entry and synthesis supporting Verilog or
VHDL, place-and-route (PAR), completed verification and debug using ChipScope
Pro tools, and creation of the bit files that are used to configure the chip.

The ISE 9.1i is a hands-on learning tool for new users of the ISE software
and for users who wish to refresh their knowledge of the software. It demonstrates
basic set-up and design methods available in the PC version of the ISE software.
We will have a greater understanding of how to implement our own design flow
using the ISE 9.1i software.

Xilinx PLD designers with a quick overview of the basic design process
using ISE 9.1i. We will have an understanding of how to create, verify, and
implement a design. ISE controls all aspects of the design flow. Through the
Project Navigator interface, you can access all of the design entry and design
implementation tools. You can also access the files and documents associated with
your project.

THESE ARE THE FOLLOWING STEPS WHILE DOING A PROJECT
Getting Started
Create a New Project
Create an HDL Source
Design Simulation
Create Timing Constraints
Implement Design and Verify Constraints
Reimplement Design and Verify Pin Locations
Dumpping code on the Spartan-3 Demo Board

4.1 SOFTWARE REQUIREMENTS
We must install the following software: ISE 9.1i

4.2 HARDWARE REQUIREMENTS
We must have the following hardware: Spartan-3 Startup Kit, containing the
Spartan-3 Startup Kit Demo Board(FPGA).

4.3 PROJECT NAVIGATOR MAIN WINDOW:
The following figure shows the Project Navigator main window, which
allows you to manage your design starting with design entry through device
configuration.
1 2 3 4 5

Figure 1. Project Navigator Window

1.Toolbar
2.Sources window
3.Processes window
4.Workspace
5.Transcript window
4.4 USING THE SOURCES WINDOW:
The first step in implementing your design for a Xilinx FPGA or
CPLD is to assemble the design source files into a project. The Sources tab
in the Sources window shows the source files you create and add to your
project, as shown in the following figure.

Figure 2.Source Window

4.4 USING THE PROCESS WINDOW
The Processes tab in the Processes window allows you to run
actions or "processes" on the source file you select in the Sources tab of
the Sources window. The processes change according to the source file you select.
The Process tab shows the available processes in a hierarchical view. Processes are
arranged in the order of a typical design flow: project creation, design entry,
constraints management, synthesis, implementation, and programming file
creation.

Figure 3. Process Window

PROCESS TYPES:
The following types of processes are available as you work on your design:
Tasks
When you run a task process, the ISE software runs in "batch mode,"
that is, the software processes your source file but does not open any
additional Software tools in the Workspace. Output from the processes appears
in the Transcript window.
Reports
Most tasks include report sub-processes, which generate a summary or
status report, for example, the Synthesis Report or Map Report. When
you run a report process, the report appears in the Workspace.
Tools
When you run a tools process, the related tool launches in stand alone
mode or appears in the Workspace where you can view or modify your
design source files. Process Status:Project Navigator keeps track of the
changes you make in source file and shows the status of each process with
the following status icons:

Running
This icon shows that the process is running.
Up-to-date
This icon shows that the process ran successfully with no errors or
warnings and does not need to be rerun. If the icon is next to a report
process, the report is up-to-date; however, associated tasks may have
warnings or errors.
Warnings reported
This icon shows that the process ran successfully but that warnings
were encountered.
Errors reported
This icon shows that the process ran but encountered an error.
Out-of-Date
This icon shows that you made design changes, which require that the
Process be rerun. If this icon is next to a report process, you can rerun
the associated task process to create an up-to-date version of the report.
No icon
If there is no icon, this shows that the process was never run.

. INTRODUCTION TO VLSI

VLSI stands for "Very Large Scale Integration". This is the field which
involves packing more and more logic devices into smaller and smaller areas.
Thanks to VLSI, circuits that would have taken board full of space can now be put
into a small space few millimeters across! This has opened up a big opportunity to
do things that were not possible before.
VLSI circuits are everywhere ... your computer, your car, your brand new
state-of-the-art digital camera, the cell-phones, and what have you. All this
involves a lot of expertise on many fronts within the same field, which we will
look at in later sections.
VLSI has been around for a long time, there is nothing new about it ... but as
a side effect of advances in the world of computers, there has been a dramatic
prolife ration of tools that can be used to design VLSI circuits. Alongside, obeying
Moore's law, the capability of an IC has increased exponentially over the years, in
terms of computation power, utilisation of available area, yield. The combined
effect of these two advances is that people can now put diverse functionality into
the IC's, opening up new frontiers.
Examples are embedded systems, where intelligent devices are put inside
everyday objects, and ubiquitous computing where small computing devices
proliferate to such an extent that even the shoes you wear may actually do
something useful like monitoring your heartbeats! These two fields are kind a
related, and getting into their description can easily lead to another article.
2.1 DEALING WITH VLSI CIRCUITS
Digital VLSI circuits are predominantly CMOS based. The way normal
blocks like latches and gates are implemented is different from what students have
seen so far, but the behaviour remains the same. All the minsaturisation involves
new things to consider.A lot of thought has to go into actual implementations as
well as design. Let us look atsome of the factors involved ...
1. Circuit Delays. Large complicated circuits running at very high frequencies have
onebig problem to tackle - the problem of delays in propagation of signals through
gates and wires even for areas a from micrometers across! The operation speed is
so large that as the delays add up, they can actually become comparable
2. Power. Another effect of high operation frequencies is increased consumption
ofpower. This has two-fold effect - devices consume batteries faster, and heat
dissipation increases. Coupled with the fact that surface areas have decreased, heat
poses a major threat to the stability of the circuit itself.

3. Layout. Laying out the circuit components is task common to all branches Of
electronics. Whats so special in our case is that there are many possible ways to
do this; there can be multiple layers of different materials on the same silicon, there
can be different arrangements of the smaller parts for the same component and so
on. The power dissipation and speed in a circuit present a trade-off; if we try to
optimize on one, the other is affected. The choice between the two is determined
by the way we chose the layout the circuit components. Layout can also affect the
fabrication of VLSI chips, making it either easy or difficult to implement the
components on the silicon.

2.2 THE VLSI DESIGN PROCESS
A typical digital design flow is as follows:
Specification
Architecture
RTL Coding
RTL Verification
Synthesis
Backend

Tape Out to Foundry to get end product.a wafer with repeated number
of identical Ics. All modern digital designs start with a designer writing a hardware
description of the IC (using HDL or Hardware Description Language) in Verilog
/VERILOG. A Verilog or VERILOG program essentially describes the hardware
(logic gates, Flip-Flops, countersetc) and the interconnect of the circuit blocks and
the functionality. Various CAD tools area vailable to synthesize a circuit based on
the HDL. The most widely used synthesis toolscome from two CAD companies.
Synposys and Cadence. Without going into details, we can say that the VERILOG,
can be called as the "C" ofVHSIC stands for "Very High Speed Integrated Circuit".
This languages is used to designthe circuits at a high-level, in two ways. It can
either be a behavioural description, which describes what the circuit is supposed to
do, or a structural description, which describes what the circuit is made of. There
are other languages for describing circuits, such as Verilog, which work in a
similar fashion. Both forms of description are then used to generate a very low-
level description that actually spells out how all this is to be fabricated on the
silicon chips. This will result in the manufacture of the intended IC.
2.3 A TYPICAL ANALOG DESIGN FLOW IS AS FOLLOWS:
In case of analog design, the flow changes somewhat.
Specifications

Architecture

Circuit Design

SPICE Simulation

Layout

Parametric Extraction / Back Annotation

Final Design

Tape Out to foundry.

While digital design is highly automated now, very small portion of
analog design can be automated. There is a hardware description language called
AHDL but is not widely used as it does not accurately give us the behavioral
model of the circuit because of the complexity of the effects of parasitic on the
analog behavior of the circuit. Many analog chips are what are termed as flat or
non-hierarchical designs. This is true for small transistor count chips such as an
operational amplifier, or a filter or a power management chip. For more complex
analog chips such as data converters, the design is done at a transistor level,
building up to a cell level, then a block level and then integrated at a chip level.
Not many CAD tools are available for analog design even today and thus analog
design remains a difficult art. SPICE remains the most useful simulation tool for
analog as well as digital design.
2.4 MOST OF TODAYS VLSI DESIGNS ARE CLASSIFIED INTO
THREE CATEGORIES:
2.4.1 ANALOG:
Small transistor count precision circuits such as Amplifiers, Data converters,
filters, Phase Locked Loops, Sensors etc.
2.4.2 ASICS OR APPLICATION SPECIFIC INTEGRATED CIRCUITS:
Progress in the fabrication of IC's has enabled us to create fast and
powerful circuits in smaller and smaller devices. This also means that we can pack
a lot more of functionality into the same area. The biggest application of this
ability is found in the design of ASIC's. These are IC's that are created for specific
purposes - each device is created to do a particular job, and do it well. The most
common application area for this is DSP signal filters, image compression, etc.
To go to extremes, consider the fact that the digital wristwatch normally consists of
a single IC doing all the time-keeping jobs as well as extra features like games,
calendar, etc.

3.SoC OR SYSTEMS ON A CHIP:
These are highly complex mixed signal circuits (digital and analog all on the
same chip). A network processor chip or a wireless radio chip is an example of an
SoC.

5. INTRODUCTION TO FPGA (SPARTAN 3E) KIT

A field-programmable gate array (FPGA) is an integrated circuit designed
to be configured by the customer or designer after manufacturing. Hence "field-
programmable". The FPGA configuration is generally specified using a hardware
description language (HDL), FPGAs can be used to implement any logical
function. The ability to update the functionality after shipping, and the low non-
recurring engineering costs (not with standing the generally higher unit cost), offer
advantages for many applications.

FPGAs contain programmable logic components called "logic blocks", and a
hierarchy of reconfigurable interconnects that allow the blocks to be "wired
together", Somewhat like a one-chip programmable breadboard. Logic blocks can
be configured to perform complex combinational functions, or merely simple logic
gates like AND and XOR. In most FPGAs, the logic blocks also include memory
elements, which may be simple flip-flops or more complete blocks of memory.

An alternate approach to using hard-macro processors is to make use of "soft"
processor cores that are implemented within the FPGA logic.
To define the behavior of the FPGA, the user provides a hardware
description language (HDL) or a schematic design. The HDL form is more suited
to work with large structurebecause it's possible to just specify them numerically
rather than having to draw every piece by hand.

As previously mentioned, many modern FPGAs have the ability to be
reprogrammed at "run time," and this is leading to the idea of reconfigurable
computing or reconfigurable systems
Applications of FPGAs include digital signal processing, software-defined
radio, aerospace and defense systems, medical imaging, computer vision, speech
recognition, cryptography, bioinformatics, computer hardware emulation, radio
astronomy, metal detection and a growing range of other areas.\
Xilinx continued unchallenged and quickly growing from 1985 to the mid-
1990s, when competitors sprouted up,
Xilinx has two main FPGA families: the high-performance Virtex series and
the high-volume Spartan series, with a cheaper EasyPath option for ramping to
volume production.

5.1 SPARTAN FAMILY
The Spartan series targets applications with a low-power footprint, extreme
cost sensitivity and high-volume; e.g. displays, set-top boxes, wireless routers and
other applications
The Spartan-3E consumes 70-90% less power in suspend mode and 40-50%
less for static power compared to standard devices.Verilog HDL is a hardware
description language used to design and document electronic systems. Verilog
HDL allows designers to design at various levels of abstraction. It is the most
widely used HDL with a user community of more than 50,000 active designers.

Figure 4. FPGA Spartan 3e Starter Kit

5.2 Spartan 3e Architecture
Input/Output Blocks (IOBs) control the flow of data between the I/O pins and the
internal logic of the device. Each IOB supports bidirectional data flow plus 3-state
operation. Supports a variety of signal standards, including four high-performance
differential standards. Double Data-Rate (DDR) registers are included.
Block RAM provides data storage in the form of 18-Kbit dual-port blocks.
Multiplier Blocks accept two 18-bit binary numbers as inputs and calculate the
product.
Digital Clock Manager (DCM) Blocks provide self-calibrating, fully digital
solutions for distributing, delaying, multiplying, dividing, and phase-shifting clock
signals.

These elements are organized as shown in Figure . A ring of IOBs surrounds a
regular array of CLBs. Each device has two columns of block RAM except for the
XC3S100E, which has one column. Each RAM column consists of several 18-Kbit
RAM blocks. Each block RAM is associated with a dedicated multiplier. The
DCMs are positioned in the center with two at the top and two at the bottom of the
device. The XC3S100E has only one DCM at the top and bottom, while the
XC3S1200E and XC3S1600E add two DCMs in the middle of the left and right
sides.

Figure 5 Spartan 3e Architecture

The Spartan-3E family features a rich network of traces that interconnect all five
functional elements, transmitting signals among them. Each functional element has
an associated switch matrix that permits multiple connections to the
routing.
5.3 IMPLEMENTATION OVERVIEW FOR FPGAS:
After synthesis, you run design implementation, which comprises the following
steps:
1. Translate, which merges the incoming net lists and constraints into a
Xilinx design file
2. Map, which fits the design into the available resources on the target
device
3. Place and Route, which places and routes the design to the timing
constraints.
4. Programming file generation, which creates a bit stream file that can
be downloaded to the device

5.4 FPGA PIN CONFIGURATION FOR SRL KIT
S. no PIN NAME H/W PIN
NO
PURPOSE
1. Switch_0 P2 i/p on boardSW1

9. LED_0 P12 o/p on boardD8

17. LEDDT P24 (B_3 or pin9)
18. LEDQ1 P27 (T_1 or pin1)
21. LEDQ4 P34 (B_6 or pin12)
22. LEDA P35 (T_2 or pin2)
23. LEDB P36 (T_6 or pin6)
24. LEDC P40 (B_4 or pin10)
25. LEDD P41 (B_2 or pin8)
26. LEDE P47 (B_1 or pin7)
27. LEDF P48 (T_3 or pin3)
28. LEDG P49 (B_2 or pin11)

30. GCLK P89 Internal CLK

References
1) Y. Utan, S.Wakabayashi, S.Nagayama, An FPGA-based text search engine
for approximate regularexpression matching, Proc. International
Conference on Field-Programmable Technology (FPT), pp.184-191, Dec.
2010 .
2) H. Yamada, M. Hirata, H. Nagai, K. Takahashi, A High-Speed String
Search Engine, IEEE Journal of Solid-State Circuits, Vol. 22, No. 5, pp.
829-834, Oct. 1987.
3) Md. A. Abedin, Y. Tanaka, A. Ahmadi, T. Koide, and H.J. Mattausch,
Mixed Digital-Analog Associative Memory Enabling Fully-Parallel
Nearest Euclidean Distance Search, Jpn. J. Appl. Phys. 46 (2007) 2231.
4) K. Pagiamtzis and A. Sheikholeslami, Content-Addressable Memory
(CAM) Circuits and Architectures: A Tutorial and Survey, IEEE Journal of
Solid-State Circuits, Vol. 41, No. 3, pp. 712-727, Mar. 2006.
5) S.A. Guccione, D. Levi and D. Downs, A Reconfigurable Content
Addressable Memory, Proc. of the 15 IPDPS 2000 Workshops on Parallel
and Distributed Processing, pp. 882-889, 2000.
6) H. Nakahara, T. Sasao and M. Matsuura, A CAM Emulator Using Look-Up
Table Cascades, Proc. 21st International Parallel and Distributed
Processing Symposium (IPDPS 07), pp. 26-30, Mar. 2007.
7) J-L. Brelet, An Overview of Multiple CAM Designs in Virtex Family
Devices, Xilinx Inc., Application Notes 201, pp. 4-5, Sep. 1999.
8) J-L. Brelet, Using Block RAM for High Performance Read/Write CAMs,
Xilinx Inc., Application Notes 204, May. 2000.
9) European Molecular Biology Laboratory-European Informatics Institute,
http://www.ebi.ac.uk.
10) K.C. McGrath, S.R. Thomas-Hall, C.T. Cheng, L. Leo, A. Alexa, S.
Schmidt and P.M. Schenk, Isolation and analysis of mRNA from
environmental microbial communities, Journal of Microbiology, Methods
75(2), pp. 172-176, 2008.

CAM - MEM Emory

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

CAM - MEM Emory

Încărcat de

Drepturi de autor:

Formate disponibile

To design and implementation of CAM memory for DAUL

S-ar putea să vă placă și