Sunteți pe pagina 1din 38

Project Report

on
Design and Development of Scalable Deep Learning
Accelerator
Submitted in the partial fulfilment of the requirements for
the award of the degree of

BACHELOR OF TECHNOLOGY
In
ELECTRONICS AND COMMUNICATION ENGINEERING
By

Md. Samrah Shimroj 16311A0473


M. V .K Gayatri Shivani 16311A0476
Ch. Kaveri 17315A0414

UNDER THE GUIDANCE OF


Dr. Abhishek Choubey
Associate Professor

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING


SREENIDHI INSTITUTE OF SCIENCE & TECHNOLOGY
Yamnampet (V), Ghatkesar (M), Hyderabad – 501 301.
SREENIDHI INSTITUTE OF SCIENCE AND TECHNOLOGY
(Affiliated to Jawaharlal Nehru Technological University, Hyderabad)
Yamnampet (V), Ghatkesar (M), Hyderabad – 501 301

CERTIFICATE

This is to certify that the project report entitled “Design and Development of
Scalable Deep Learning Accelerator using VHDL is being submitted by

K. Rakesh 16311A0478
M.V.K Gayatri Shivani 16311A0476
Ch. Kaveri 17315A0414
in partial fulfilment of the requirements for the award of Bachelor of Technology degree in
Electronics and Communication Engineering to Sreenidhi Institute of Science &
Technology affiliated to Jawaharlal Nehru Technological University, Hyderabad
(Telangana). This record is a bona fide work carried out by them under our guidance and
supervision. The results embodied in the report have not been submitted to any other
University or Institution for the award of any degree or diploma.

Internal Guide Project Coordinator

Dr. Abhishek Choubey Dr. S.Ramani


Associate Professor Associate Professor

Head of the Department


Dr. S.P.V. SUBBA RAO
Professor, Department of ECE
DECLARATION

We hereby declare that the work described in this thesis titled “Design and development of
Scalable Deep Learning Accelerator Unit using VHDL ” which is being submitted by us
in partial fulfilment for the award of Bachelor of Technology in the Department of
Electronics and Communication Engineering, Sreenidhi Institute Of Science &
Technology is the result of investigations carried out by us under the guidance ofName of
internal guide, Designation, Department of ECE, Sreenidhi Institute of Science &
Technology, Hyderabad.

No part of the thesis is copied from books/ journals/ internet and whenever the portion is
taken, the same has been duly referred. The report is based on the project work done entirely
by us and not copied from any other source. The work is original and has not been submitted
for any Degree/Diploma of this or any other university.

Place: Hyderabad
Date:

Md. Samrah Shimroj 16311A0473


M.V.K Gayatri Shivani 16311A0476
Ch. Kaveri 17315A0414
ACKNOWLEDGEMENTS

We thank Dr.S.RAMANI, Associate. professor, Dept. of ECE,Sreenidhi Institute of Science


& Technology, Hyderabad for her valuable comments and suggestions that greatly helped in
improving quality of thesis.

We would like to express our sincere gratitude to Dr. S.P.V. Subbarao, Professor, Head of
the department, Electronics & Communication Engineering, Sreenidhi Institute of Science &
Technology, Hyderabad for his continued support and valuable guidance and encouragement
extended to us during our research work. We thank him for his painstaking efforts to guide us
throughout our research work.

We are very grateful to Dr. P. NARSIMHA REDDY, Director and Dr. Siva Reddy,
Principal and the Management of Sreenidhi Institute Of Science & Technology for
having provided the opportunity for taking up this project.

We thank to all my teachers and professors for their valuable comments after reviewing our
research papers.

We wish to extend my special thanks to all my colleagues and friends who helped directly or
indirectly to complete our research work.

We extend our thanks to our parents and all our family members for their unceasing
encouragement and support who wished us a lot to complete this work.
TABLE OF CONTENTS

Chapter Title Page


No. No.
INDEX IV
----------------------------------------------------------------------------------------
LIST OF FIGURES VI
----------------------------------------------------------------------------------------
ABSTRACT VII
CHAPTER----------------------------------------------------------------------------------------………
1: INTRODUCTION 1
----------------------------------------------------------------------------------------
1.0 Introduction 1
1.1 ----------------------------------------------------------------------------------------
Brief History ------------------------------------------------------------------------------- 1
1.2 Problem Statement 2
1.3 Motivation -------------------------------------------------------------------------------
------------------------------------------------------------------------------- 2
1.4 Objectives of the Project 3
1.5 Existing
------------------------------------------------------------------------------- Method 3
1.6 Proposed
------------------------------------------------------------------------------- Method 3
CHAPTER-------------------------------------------------------------------------------
2: LITERATURE SURVEY 5
----------------------------------------------------------------
2.0 Introduction 5
2.1 ----------------------------------------------------------------------------------------
Literature Review 5
CHAPTER----------------------------------------------------------------------------------------
3: PROPOSED SYSTEM--------------------------- 7
3.0 Introduction 7
3.1 ----------------------------------------------------------------------------------------
DLAU Architecture and Execution Model 10
CHAPTER--------------------------------------------------------------------------------------------
4: SOFTWARE REQUIREMENTS------ 13
CHAPTER 5: EXPERIMENTAL/SIMULATION RESULTS AND DISCUSSIONS 20
CHAPTER 6:----------
CONCLUSIONS AND FUTURE SCOPE---------------------------------------- 36
6.0 Conclusions 36
----------------------------------------------------------------------------------__
LIST OF FIGURES
S. No. Figure No. Figure name Page no.

1 1.1 Neural Network 4

2 2.1 Dlau architecture 10

3 2.2 TMMU Architecture 12

4 2.3 PSAU Architecture 13

5 2.4 General Block diagram 15

6 4.1 Graph of a convolution layer 29

7 4.2 Overview of accelerator design 30

8 4.3 Computation Model of a neural network 32

9 4.4 A typical structure of a nn based accelerator 34


ABSTRACT

As the emerging field of machine learning, deep learning shows excellent ability in
solving complex learning problems. However, the size of the networks becomes increasingly
large scale due to the demands of the practical applications, which poses significant challenge to
construct high performance implementations of deep learning neural networks. In order to
improve the performance as well as to maintain the low power cost, in this paper we design deep
learning accelerator unit (DLAU), which is a scalable accelerator architecture for large-scale
deep learning networks using field-programmable gate array (FPGA) as the hardware prototype.
The DLAU accelerator employs three pipelined processing units to improve the throughput and
utilizes tile techniques to explore locality for deep learning applications. Experimental results on
the state-of-the-art Xilinx FPGA board demonstrate that the DLAU accelerator is able to achieve
up to 36.1× speedup comparing to the Intel Core2 processors, with the power consumption at
234 mW.

Keywords: Deep learning, field-programmable gate array (FPGA), hardware accelerator, neural
network
CHAPTER 1

INTRODUCTION

Deep learning is a specific subset of Machine Learning, which is a specific subset of Artificial
Intelligence. Artificial Intelligence is the broad mandate of creating machines that can think
intelligently Machine Learning is one way of doing that, by using algorithms to glean insights
from data (see our gentle introduction here) Deep Learning is one way of doing that, using a
specific algorithm called a Neural Network. Neural networks are inspired by the structure of the
cerebral cortex. At the basic level is the perception, the mathematical representation of a
biological neuron. Like in the cerebral cortex, there can be several layers of interconnected
perceptions. Input values, or in other words our underlying data, get passed through this
“network” of hidden layers until they eventually converge to the output layer. This is explained
in the next section. Deep Learning is at the cutting edge of what machines can do, and
developers and business leaders absolutely need to understand what it is and how it works. This
unique type of algorithm has far surpassed any previous benchmarks for classification of images,
text, and voice. It also powers some of the most interesting applications in the world, like
autonomous vehicles and real-time translation.

There was certainly a bunch of excitement around Google’s Deep Learning based Alpha
Go beating the best Go player in the world, but the business applications for this technology are
more immediate and potentially more impactful. In the 1980s, most neural networks were a
single layer due to the cost of computation and availability of data. Using Deep Learning to
classify and label images isn’t only better than any other traditional algorithms: it’s starting to be
better than actual humans. Face book has had great success with identifying faces in photographs
by using Deep Learning. It’s not just a marginal improvement, but a game changer: Speech
recognition is another area that has felt Deep Learning’s impact. Spoken languages are so vast
and ambiguous. Baidu – one of the leading search engines of China – has developed a voice
recognition system that is faster and more accurate than humans at producing text on a mobile
phone; in both English and Mandarin. Google is now using deep learning to manage the energy
at the company’s data centers. They’ve cut their energy needs for cooling by 40%. That
translates to about a 15% improvement in power usage efficiency for the company and hundreds
of millions of dollars in savings. Deep Learning is important because it finally makes these tasks
accessible. As a main means to accelerate deep learning algorithms, FPGA (Field Programmable
Gate Array) has high performance and low power consumption. It poses significant challenges to
implement high performance deep learning networks with low power cost, especially for large-
scale deep learning neural network models. So far, the state-of-the-art means for accelerating
deep learning algorithms are field programmable gate array (FPGA), application specific
integrated circuit (ASIC), and graphic processing unit (GPU). Compared with GPU acceleration,
hardware accelerators like FPGA and ASIC can achieve at least moderate performance with
lower power consumption. To tackle these problems, a scalable deep learning accelerator unit
named DLAU to speed up the kernel computational parts of deep learning algorithms is
presented. In particular, we utilize the tile techniques, FIFO buffers, and pipelines to minimize
memory transfer operations, and reuse the computing units to implement the large size neural
networks. This approach distinguishes itself from previous literatures with following
contributions. The DLAU accelerator is composed of three fully pipelined processing units,
including tiled matrix multiplication unit (TMMU), part sum accumulation unit (PSAU), and
activation function acceleration unit (AFAU). Different network topologies such as CNN, DNN,
or even emerging neural networks can be composed from these basic modules. Consequently, the
scalability of FPGA-based accelerator is higher than ASIC-based accelerator.

1.1 INTRODUCTION TO NEURAL NETWORKS

The simplest definition of a neural network, more properly referred to as an 'artificial'


neural network (ANN), is provided by the inventor of one of the first neuro-computers, Dr.
Robert Hecht-Nielsen. He defines a neural network as: "...a computing system made up of a
number of simple, highly interconnected processing elements, which process information by
their dynamic state response to external inputs. In "Neural Network Primer: Part I" by Maureen
Caudill, AI Expert, Feb. 1989 ANNs are processing devices (algorithms or actual hardware) that
are loosely modeled after the neuronal structure of the mammalian cerebral cortex but on much
smaller scales. A large ANN might have hundreds or thousands of processor units, whereas a
mammalian brain has billions of neurons with a corresponding increase in magnitude of their
overall interaction and emergent behavior. Although ANN researchers are generally not
concerned with whether their networks accurately resemble biological systems, some have. For
example, researchers have accurately simulated the function of the retina and modeled the eye
rather well. Although the mathematics involved with neural networking is not a trivial matter, a
user can rather easily gain at least an operational understanding of their structure and function.
Basics of Neural Networks Neural networks are typically organized in layers. Layers are made
up of a number of interconnected 'nodes' which contain an 'activation function'. Patterns are
presented to the network via the 'input layer', which communicates to one or more 'hidden layers'
where the actual processing is done via a system of weighted 'connections'. The hidden layers
then link to an 'output layer' where the answer is outputs as shown in the graphic below.

Figure.1 Neural Network.

However, both FPGA and ASIC have relatively limited computing resources, memory, and I/O
bandwidths, therefore it is challenging to develop complex and massive DNNs using hardware
accelerators. For ASIC, it has a longer development cycle and the flexibility is not satisfying.
Chen et al. [6] presented a ubiquitous machine-learning hardware accelerator called DianNao,
which initiated the field of deep learning processor. It opens a new paradigm to machine learning
hardware accelerators focusing on neural networks. But DianNao is not implemented using
reconfigurable hardware like FPGA, therefore it cannot adapt to different application demands.
Currently, around FPGA acceleration researches, Ly and Chow [5] designed FPGA-based
solutions to accelerate the restricted Boltzmann machine (RBM). They created dedicated
hardware processing cores which are optimized for the RBM algorithm. Similarly, Kim et al. [7]
also developed an FPGA-based accelerator for the RBM. They use multiple RBM processing
modules in parallel, with each module responsible for a relatively small number of nodes. Other
similar works also present FPGA-based neural network accelerators [9]. Yu et al. [8] presented
an FPGA-based accelerator, but it cannot accommodate changing network size and network
topologies. To sum up, these studies focus on implementing a particular deep learning algorithm
efficiently, but how to increase the size of the neural networks with scalable and flexible
hardware architecture has not been properly solved. To tackle these problems, we present a
scalable deep learning accelerator unit named DLAU to speed up the kernel computational parts
of deep learning algorithms. In particular, we utilize the tile techniques, FIFO buffers, and
pipelines to minimize memory transfer operations, and reuse the computing units to implement
the large size neural networks.

1.2 Problem Statement

As the emerging field of deep learning shows excellent ability in solving complex learning
problems. However, the size of the networks becomes increasingly large scale due to the
demands of the practical applications, which poses significant challenge to construct a high
performance implementations of deep learning neural networks

1.3 Motivation

The data centers around the world are consuming more energy every year. This consumption of
energy will only increase with the development of deep learning based applications as it’s a data
intensive field. In order to redue the energy consumed and increase the efficiency, novel methods
need to be invented in the form of a hardware accelerator dedicated to deep learning based
applications.

1.4 Objectives of the Project

The aim of the project is to simulate a scalable accelerator in VHDL. It involves building three
pipelined architectures which are then joined to execute the model of the proposed accelerator.

1.5 Existing Method

In general, deep learning uses a multi-layer neural network model to extract high-level features
which are a combination of lowlevel abstractions to find the distributed data features, in order to
solve complex problems in machine learning.However, with the increasing accuracy
requirements and complexity for the practical applications, the size of the neural networks
becomes explosively large scale.The explosive volume of data makes the data centers quite
power consuming.Therefore, it poses significant challenges to implement high performance deep
learning networks with low power cost, especially for large scale deep learning neural network
models However, both FPGA and ASIC have relatively limited computing resources, memory,
and I/O bandwidths, therefore it is challenging to develop complex and massive deep neural
networks using hardware accelerators

1.6 Proposed Method

The paper presents a scalable deep learning accelerator unit named DLAU to speed up the kernel
computational parts of deep learning algorithms. In particular, we utilize the tile techniques,
FIFO buffers, and pipelines to minimize memory transfer operations, and reuse the computing
units to implement the large-size neural networks. In order to explore the locality of the deep
learning application, we employ tile techniques to partition the large scale input data. The
DLAU architecture can be configured to operate different sizes of tile data to leverage the trade-
offs between speedup and hardware costs. Consequently the FPGA based accelerator is more
scalable to accommodate different machine learning applications.The DLAU accelerator is
composed of three fully pipelined processing units, including TMMU, PSAU, and AFAU.
Different network topologies such as CNN, DNN, or even emerging neural networks can be
composed from these basic modules. Consequently the scalability of FPGA based accelerator is
higher than ASIC based accelerator
CHAPTER 2

LITERATURE REVIEW

1. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” The deep learning paradigm
tackles problems on which shallow architectures (e.g. SVM) are affected by the curse of
dimensionality. As part of a two-stage learning scheme involving multiple layers of
nonlinear processing a set of statistically robust features is automatically extracted from
the data. The present tutorial introducing the ESANN deep learning special session
details the state-of-the-art models and summarizes the current understanding of this
learning approach which is a reference for many difficult classification tasks
Unfortunately, training deep architectures is a difficult task and classical methods that
have proved effective when applied to shallow architectures are not as efficient when
adapted to deep architectures. Adding layers does not necessarily lead to better solutions.
For example, the more the number of layers in a neural network, the lesser the impact of
the back-propagation on the first layers. The gradient descent then tends to get stuck in
local minima or plateaus [3], which is why practitioners have often preferred to limit
neural networks to one or two hidden layers. This issue has been solved by introducing an
unsupervised layer-wise pertaining of deep architectures [3, 4]. More precisely, in a deep
learning scheme each layer is treated separately and successively trained in a greedy
manner: once the previous layers have been trained, a new layer is trained from the
encoding of the input data by the previous layers. Then, a supervised fine-tuning stage of
the whole network can be performed
The main idea of convolution networks is to combine local computations
(convolution of the signal with weight sharing units) and pooling. The convolutions are
intended to give translation invariance to the system, as the weights depend only on
spatial separation and not on spatial position. The pooling allows constructing a more
abstract set of features through nonlinear combination of the previous level features,
taking into account the local topology of the input data. By alternating convolution layers
and pooling layers, the network successively extracts and combines local features to
construct a good representation of the input. The connectivity of the convolution
networks, where each unit in a convolution or a pooling layer is only connected to a small
subset of the preceding layer, allows training networks with as much as 7 hidden layers.
The supervised learning is easily achieved, through an error gradient back propagation.
On the one hand, convolution framework has been applied to RBM and DBN [10, 30,
31]. In [31] the authors derive a generative pooling strategy which scales well with image
size and they show that the intermediate representations are more abstract in the higher
layer (from edges in the lower layers to object parts in the higher). On the other hand, the
unsupervised pre-training stage of deep learning have been applied to convolution
networks [32] and can greatly reduce the number of labeled examples required.
Furthermore, deep convolution networks with sparse regularization [33] yield very
promising results for difficult visual detection tasks, such as pedestrian detection.
2. C. Zhang et al., “Optimizing FPGA-based accelerator design for deep convolution
neural networks, Convolution neural network (CNN) has been widely employed for
image recognition because it can achieve high accuracy by emulating behavior of optic
nerves in living creatures. Recently, rapid growth of modern applications based on deep
learning algorithms has further improved research and implementations. Especially,
various accelerators for deep CNN have been proposed based on FPGA platform because
it has advantages of high performance, reconfigurability, and fast development round,
etc. Although current FPGA accelerators have demonstrated better performance over
generic processors, the accelerator design space has not been well exploited. One critical
problem is that the computation throughput may not well match the memory bandwidth
provided an FPGA platform. Consequently, existing approaches cannot achieve best
performance due to underutilization of either logic resource or memory bandwidth. At the
same time, the increasing complexity and scalability of deep learning applications
aggravate this problem. In order to overcome this problem, we propose an analytical
design scheme using the roofline model. For any solution of a CNN design, we
quantitatively analyze its computing throughput and required memory bandwidth using
various optimization techniques, such as loop tiling and transformation. Then, with the
help of roofline model, we can identify the solution with best performance and lowest
FPGA resource requirement. As a case study, we implement a CNN accelerator on a
VC707 FPGA board and compare it to previous approaches. Our implementation
achieves a peak performance of 61.62 GFLOPS under 100MHz working frequency,
which outperform previous approaches significantly.
Unfortunately, both advances of FPGA technology and deep learning algorithm
aggravate this problem at the same time. On one hand, the increasing logic resources and
memory bandwidth provided by state-of-art FPGA platforms enlarge the design space. In
addition, when various FPGA optimization techniques, such as loop tiling and
transformation, are applied, the design space is further expanded. On the other hand, the
scale and complexity of deep learning algorithms keep increasing to meet the requirement
of modern applications. Consequently, it is more difficult to find out the optimal solution
in the design space. Thus, an efficient method is urgently required for exploration of
FPGA based CNN design space. To efficiently explore the design space, we propose an
analytical design scheme in this work. Our work outperforms previous approaches for
two reasons. First, work [1] [2] [3] [6] [14] mainly focused on computation engine
optimization. They either ignore external memory operation or connect their accelerator
directly to external memory. Our work, however, takes buffer management and
bandwidth optimization into consideration to make better utilization of FPGA resource
and achieve higher performance. Second, previous study [12] accelerates CNN
applications by reducing external data access with delicate data reuse. However, this
method do not necessarily lead to best overall performance. Moreover, their method
needs to reconfigure FPGA for different layers of computation. This is not feasible in
some scenarios. Our accelerator is able to execute acceleration jobs across different
layers without reprogramming FPGA
3. D. L. Ly and P. Chow, “A high-performance FPGA architecture for restricted
Boltzmann machines, Artificial neural networks (ANNs) are a natural target for
hardware acceleration by FPGAs and GPGPUs because commercial-scale applications
can require days to weeks to train using CPUs, and the algorithms are highly
parallelizable. Previous work on FPGAs has shown how hardware parallelism can be
used to accelerate a “Restricted Boltzmann Machine” (RBM) ANN algorithm, and how
to distribute computation across multiple FPGAs. Here we describe a fully pipelined
parallel architecture that exploits “mini-batch” training (combining many input cases to
compute each set of weight updates) to further accelerate ANN training. We implement
on an FPGA, for the first time to our knowledge, a more powerful variant of the basic
RBM, the “Factored RBM” (fRBM). The fRBM has proved valuable in learning
transformations and in discovering features that are present across multiple types of
input. We obtain (in simulation) a 100-fold acceleration (vs. CPU software) for an fRBM
having N = 256 units in each of its four groups (two input, one output, one intermediate
group of units) running on a Virtex-6 LX760 FPGA. Many of the architectural features
we implement are applicable not only to fRBMs, but to basic RBMs and other ANN
algorithms more broadly.
In recent years increasing attention has been directed to the acceleration of deep
learning algorithms, in particular the basic RBM algorithm. Raina et al. [2009] applied
the fine-grained parallelism of graphics processor units (GPU) to the basic RBM ANN.
Ly and Chow [2010] investigated how the basic RBM can be mapped to an FPGA
hardware architecture. They created dedicated hardware processing cores which were
optimized for certain parts of the algorithm: an embedded processor, memory, and the
Message Passing Interface (MPI) manage system controls, storing of intermediate results,
and communication among the cores. Their experimental system, using a Xilinx FPGA
(XC2VP70), integrated 128 neurons in each of two fully interconnected layers. The
architecture in Kim et al. [2010] uses multiple RBM processing modules in parallel, with
each module responsible for a relatively small number of ANN nodes. Their experimental
system on an Altera FPGA (EP3SL340) integrated 256 neurons in each of two fully-
interconnected layers. The architecture we describe here for the Factored RBM algorithm
is suitable for FPGA or ASIC implementation, and is extendable to multi-FPGA systems.
Our experimental system, run on a Xilinx Virtex-6 FPGA (XC6VLX760), can integrate
up to 256 nodes in each of the four sets of ANN units (“neuron” layers x, y, and h, and
the set of “factors” f, as described in this article), with full interconnection between layer
f and each of layers x, y, and h. Our architecture exploits parallel operators and a coarse-
grain pipeline to increase system throughput. The weight update operations are divided
into twelve processing phases that comprise seven pipeline stages. The design is fully
parametrized, so that setting network size parameters in a script automatically generates
the required HDL code for FPGA configuration. In addition, using multiple instances of
processing blocks or time-division multiplexed processing, the proposed architecture
supports super-scalar or weight training operations for virtually increased node sizes.
4. T. Chen et al., “DianNao: A small-footprint high-throughput accelerator for
ubiquitous machine-learning,” Machine-Learning tasks are becoming pervasive in a
broad range of domains, and in a broad range of systems (from embedded systems to data
centers). At the same time, a small set of machine-learning algorithms (especially
Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-
of-the-art across many applications. As architectures evolve towards heterogeneous
multi-cores composed of a mix of cores and accelerators, a machinelearning accelerator
can achieve the rare combination of efficiency (due to the small number of target
algorithms) and broad application scope. Until now, most machine-learning accelerator
designs have focused on efficiently implementing the computational part of the
algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their
large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a
special emphasis on the impact of memory on accelerator design, performance and
energy. We show that it is possible to design an accelerator with a high throughput,
capable of performing 452 GOP/s (key NN operations such as synaptic weight
multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485
mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and
it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after
layout at 65nm. Such a high throughput in a small footprint can open up the usage of
state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set
of applications.
While efficient implementation of computational primitives is a first and important step
with promising results, inefficient memory transfers can potentially void the throughput,
energy or cost advantages of accelerators, i.e., an Amdahl’s law effect, and thus, they
should become a first-order concern, just like in processors, rather than an element
factored in accelerator design on a second step. Unlike in processors though, one can
factor in the specific nature of memory transfers in target algorithms, just like it is done
for accelerating computations. This is especially important in the domain of machine-
learning where there is a clear trend towards scaling up the size of neural networks in
order to achieve better accuracy and more functionality [16, 26]. In this study, we
investigate an accelerator design that can accommodate the most popular state-of-the-art
algorithms, i.e., Convolutional Neural Networks (CNNs) and Deep Neural Networks
(DNNs). We focus the design of the accelerator on memory usage, and we investigate an
accelerator architecture and control both to minimize memory transfers and to perform
them as efficiently as possible. We present a design at 65nm which can perform 496 16-
bit fixed-point operations in parallel every 1.02ns, i.e., 452 GOP/s, in a 3.02mm2 ,
485mW footprint (excluding main memory accesses). On 10 of the largest layers found in
recent CNNs and DNNs, this accelerator is 117.87x faster and 21.08x more energy-
efficient (including main memory accesses) on average than an 128-bit SIMD core
clocked at 2GHz
5. S. K. Kim, L. C. McAfee, P. L. McMahon, and K. Olukotun, “A highly scalable
restricted Boltzmann machine FPGA implementation,” The paradigm of machine
learning deals with methods that allow a computer to extract complex patterns underlying
data. The applications of such methods are extensive, including visual pattern
recognition, speech recognition and video game artificial intelligence. One popular
method of machine learning is the use of artificial neural networks (ANNs). Such
networks roughly model the structure of the biological neural networks in the brain, in
that they consist of many parallel simple neurons, connected together through weighted
relationships. The activation of the neurons dependant on the weights and states of
connected neurons determines the reaction of the network to some input. By controlling
the value of the weights, the network can be trained to recognize certain patterns or
features of a dataset. Many different types of ANNs exist with different network
topologies, activation functions and learning algorithms. A particularly popular
architecture is the Restricted Boltzmann Machine (RBM); a stochastic, generative model
that has proven to perform well in problems such as face recognition [6]. Recently, it has
been shown that when several RBMs are stacked together to form a Deep Belief Network
(DBN), an efficient learning algorithm exists to train the entire network [1]. DBNs have
the benefit of being able to learn more complex features and have been applied to
problems of generating facial expressions [2], semantic hashing of documents [3] and
recognition of hand written digits [1]. Although the learning algorithm is relatively
efficient, training the large networks required for the real-world applications above can
still take several days or weeks on a general purpose desktop computer [6]. The parallel
nature of the RBM architecture makes it very tractable by hardware implementations and
several groups have created FPGA and GPU based RBM solutions providing much
needed speed-up [7, 8, 5]. In particular, Ly et al.’s FPGA architecture has produced a
145x speed-up relative to a desktop PC [7]. However, it has only implemented relatively
small RBM networks of 256x256 neurons whereas real world applications require much
larger networks. For example, the DBN used to recognize handwritten digits [1]
contained a RBM of size 2000x510. The goal of my thesis is to scale Ly, et al.’s FPGA
architecture [4] up, to be capable of handling the thousands of neurons necessary in real
world DBN applications while maintaining maximum performance. The main bottleneck
limiting the current implementation size of the FPGA architecture is the size of the
weight matrix. This data structure is necessarily large since each node must be connected
through a weight to all of the nodes in the next layer due to the bipartite graph
organization of the RBM. To allow for larger networks, my project will first involve
adapting the FPGA architecture to a larger, faster FPGA platform. This allows for the
possibility of higher clock speeds as well as better interconnects between FPGAs and
thus greater performance. In addition, I will investigate the effect of decreasing the bit
width of the weights. This could allow more weights to be stored on-chip and a increase
in communication bandwidth between computational cores provided that the network is
trainable at lower precision. Finally, by time-multiplexing the resources of four FPGAs, I
hope accelerate the training performance of arbitrary size networks.
6. J. Qiu et al., “Going deeper with embedded FPGA platform for convolution neural
network,” Recent researches on neural network have shown significant advantage in
machine learning over traditional algorithms based on features and models. Neural
network is now widely adopted in regions like image, speech and video recognition. But
the high computation and storage complexity of neural network inference poses great
difficulty on its application. CPU platforms are hard to enough computation capacity.
GPU platforms are the €rst choice for neural network process because of its high
computation capacity and easy to use development frameworks. On the other hand,
FPGA-based neural network inference accelerator is becoming a research topic. With
specifically designed hardware, FPGA is the next possible solution to surpass GPU in
speed and energy efficiency. Various FPGA-based accelerator designs have been
proposed with software and hardware optimization techniques to achieve high speed and
energy efficiency. In this paper, we give an overview of previous work on neural network
inference accelerators based on FPGA and summarize the main techniques used. An
investigation from software to hardware, from circuit level to system level is carried out
to complete analysis of FPGA-based neural network inference accelerator design and
serves as a guide to future work.
But the computation and storage complexity of NN models are high. In Table 1,
we list the number of operations, number of parameters (add or multiplication), and top-1
accuracy on ImageNet dataset [50] of state-of-the-art CNN models. Take CNN as an
example. Œe largest CNN model for a 224 × 224 image classification requires up to 39
billion ƒfloating point operations (FLOP) and more than 500MB model parameters [56].
As the computation complexity is proportional to the input image size, processing images
with higher resolutions may need more than 100 billion operations. Latest work like
Mobile Net [24] and ShueNet [79] are trying to reduce the network size with advanced
network structures, but with obvious accuracy loss. Œe balance between the size of NN
models and accuracy is still an open question today. In some cases, the large model size
hinders the application of NN, especially in power limited or latency critical
scenarios.Œtherefore, choosing a proper computation platform for neural-network-based
applications is essential. A typical CPU can perform 10-100G FLOP per second, and the
power efficiency is usually below 1GOP/J. So CPUs are hard to meet the high
performance requirements in cloud applications nor the low power requirements in
mobile applications. In contrast, GPUs up to 10TOP/s peak performance and are good
choices for high performance neural network applications. Development frameworks like
Cae [26] and Tensor flow [4] also easy-to-use interfaces which makes GPU the €rst
choice of neural network acceleration. Besides CPUs and GPUs, FPGAs are becoming a
platform candidate to achieve energy efficient neural network processing. With a neural
network oriented hardware design, FPGAs can implement high parallelism and make use
of the properties of neural network computation to remove additional logic.
CHAPTER 3
PROPOSED SYSTEM

3.1 DLAU ARCHITECTURE AND EXECUTION MODEL

Fig. 2 describes the DLAU system architecture which contains an embedded processor, a DDR3
memory controller, a DMA module, and the DLAU accelerator. The embedded processor is
responsible for providing programming interface to the users and communicating with DLAU
via JTAG-UART. In particular it transfers the input data and the weight matrix to internal
BRAM blocks, activates the DLAU accelerator, and returns the results to the user after
execution. The DLAU is integrated as a standalone unit which is flexible and adaptive to
accommodate different applications with configurations. The DLAU consists of three processing
units organized in a pipeline manner: 1) TMMU; 2) PSAU; and 3) AFAU. For execution, DLAU
reads the tiled data from the memory by DMA, computes with all the three processing units in
turn, and then writes the results back to the memory. In particular, the DLAU accelerator
architecture has the following key features FIFO Buffer: Each processing unit in DLAU has an
input buffer and an output buffer to receive or send the data in FIFO. These buffers are employed
to prevent the data loss caused by the inconsistent throughput between each processing unit.
Tiled Techniques: Different machine learning applications may require specific neural network
sizes. The tile technique is employed to divide the large volume of data into small tiles that can
be cached on chip, therefore the accelerator can be adopted to different neural network size.
Consequently, the FPGA based accelerator is more scalable to accommodate different machine
learning applications. Pipeline Accelerator: We use stream-like data passing mechanism (e.g.,
AXI-Stream for demonstration) to transfer data between the adjacent processing units, therefore,
TMMU, PSAU, and AFAU can compute in streaming-like manner. Of these three computational
modules, TMMU is the primary computational unit, which reads the total weights and tiled
nodes data through DMA, performs the calculations, and then transfers the intermediate part sum
results to PSAU. PSAU collects part sums and performs accumulation. When the accumulation
is completed, results will be passed to AFAU. AFAU performs the activation function using
piecewise linear interpolation methods. In the rest of this section, we will detail the
implementation of these three processing units, respectively.
Fig. 2. DLAU accelerator architecture
A. TMMU Architecture

TMMU is in charge of multiplication and accumulation operations. TMMU is specially


designed to exploit the data locality of the weights and is responsible for calculating the part
sums. TMMU employs an input FIFO buffer which receives the data transferred from DMA
and an output FIFO buffer to send part sums to PSAU. Fig. 3 illustrates the TMMU
schematic diagram, in which we set tile size = 32 as an example. TMMU first reads the
weight matrix data from input buffer into different BRAMs in 32 by the row number of the
weight matrix (n = i%32 where n refers to the number of BRAM, and i is the row number of
weight matrix). Then, TMMU begins to buffer the tiled node data. In the first time, TMMU
reads the tiled 32 values to registers Reg_a and starts execution. In parallel to the
computation at every cycle, TMMU reads the next node from input buffer and saves to the
registers Reg_b. Consequently, the registers Reg_a and Reg_b can be used alternately. For
the calculation, we use pipelined binary adder tree structure to optimize the performance. As
depicted in Fig. 3, the weight data and the node data are saved in BRAMs and registers. The
pipeline takes advantage of time-sharing the coarse-grained accelerators. As a consequence,
this implementation enables the TMMU unit to produce a part sum result every clock cycle.
Figure 2.2 TMMU Shematic

PSAU Architecture PSAU is responsible for the accumulation operation. Fig. 3 presents the
PSAU architecture, which accumulates the part sum produced by TMMU. If the part sum is the
final result, PSAU will write the value to output buffer and send results to AFAU in a pipeline
manner. PSAU can accumulate one part sum every clock cycle, therefore the throughput of
PSAU accumulation matches the generation of the part sum in TMMU.

Fig.2.3 PSAU Architecture


B. AFAU Architecture

Finally, AFAU implements the activation function using piecewise linear interpolation (y =
ai x + bi, x [x1, xi+1]). This method has been widely applied to implement activation
functions with negligible accuracy loss when the interval between xi and xi+1 is
insignificant. In activation function implementation of sigmoid function is done
In this section, the architecture and theory behind PASTA is presented. The adder first
accepts two input operands to perform half additions for each bit. Subsequently, it iterates using
earlier generated carry and sums to perform half-additions repeatedly until all carry bits are
consumed and settled at zero level. A. Architecture of PASTA The general architecture of the
adder is shown in Fig. 1. The selection input for two-input multiplexers corresponds to the Req
handshake signal and will be a single 0 to 1 transition denoted by SEL. It will initially select the
actual operands during SEL=0and will switch to feedback/carry paths for subsequent iterations
using SEL=1. The feedback path from the HAs enable the multiple iterations to continue until
the completion when all carry signals will assume zero values.

Fig. 2.4. General block diagram of PASTA


C. State Diagrams

In Fig. 6, two state diagrams are drawn for the initial phase and the iterative phase of the
proposed architecture. Each state is represented by (Ci+1Si) pair where Ci+1, Si represent carry
out and sum values, respectively, from the ith bit adder block. During the initial phase, the circuit
merely works as a combinational HA operating in fundamental mode. It is apparent that due to
the use of HAs instead of FAs, state (11) cannot appear. During the iterative phase (SEL=1), the
feedback path through multiplexer block is activated. The carry transitions (Ci) are allowed as
many times as needed to complete the recursion. From the definition of fundamental mode
circuits, the present design cannot be considered as a fundamental mode circuit as the input–
outputs will go through several transitions before producing the final output. It is not a Muller
circuit working outside the fundamental mode either as internally; several transitions will take
place, as shown in the state diagram. This is analogous to cyclic sequential circuits where gate
delays are utilized to separate individual states.
Fig.2.5. State diagrams for PASTA (a) Initial phase (b) Iterative phase

Thus, all the single-bit adders will successfully kill or propagate the carries until all carries are
zero fulfilling the terminating condition. The mathematical form presented above is valid under
the condition that the iterations progress synchronously for all bit levels and the required input
and outputs for a specific iteration will also be in synchrony with the progress of one iteration. In
the next section, we present an implementation of the proposed architecture which is
subsequently verified using simulations.
CHAPTER 4

SOFTWARE REQUIREMENTS

CNN Basics Convolution neural network (CNN) is first inspired by research in neuroscience.
After over twenty years of evolution, CNN has been gaining more and more distinction in
research fields, such as computer vision, AI (e.g. [11] [9]). As a classical supervised learning
algorithm, CNN employs a feed forward process for recognition and a backward path for
training. In industrial practice, many application designers train CNN off-line and use the off-
line trained CNN to perform time-sensitive jobs. So the speed of feed forward computation is
what matters. In this work, we focus on speeding up the feed forward computation with FPGA
based accelerator design. A typical CNN is composed of two components: a feature extractor and
a classifier. The feature extractor is used to filter input images into “feature maps” that represent
various features of the image. These features may include corners, lines, circular arch, etc.,
which are relatively invariant to position shifting or distortions. The output of the feature
extractor is a low-dimensional vector containing these features. This vector is then fed into the
classifier, which is usually based on traditional artificial neural networks. The purpose of this
classifier is to decide the likelihood of categories that the input (e.g. image) might belong to. A
typical CNN is composed of multiple computation layers. For example, the feature extractor may
consist of several convolution layers and optional sub-sampling layers. Figure 1 illustrates the
computation of a convolution layer. The convolution layer receives N feature maps as input.
Each input feature map is convolved by a shifting window with a K ×K kernel to generate one
pixel in one output feature map. The stride of the shifting window is S, which is normally
smaller than K. A total of M output feature maps will form the set of input feature maps for the
next convolution layer. The pseudo code of a convolution layer can be written as that in Code 1

Figure 4.1: Graph of a convolution layer


In the feed forward computation perspective, a previous study [5] proved that convolution
operations will occupy over 90% of the computation time. So in this work, we will focus on
accelerating convolution layers. Integration with other optional layers, such as sub-sampling or
max pooling layers, will be studied in future work.

A Real-Life CNN

Figure 4.2: A real-life CNN that won the Image Net 2012 contest

Figure 2 shows a real-life CNN application, taken from [9]. This CNN is composed of 8 layers.
The first 5 layers are convolution layers and layers 6 ∼ 8 form a fully connected artificial neural
network. The algorithm receives three 224x224 input images that are from an original 256x256
three-channel RGB image. The output vector of 1000 elements represents the likelihoods of 1000
categories. As is shown in Figure 2, Layer1 recieves 3 input feature maps in 224x224 resolution
and 96 output feature maps in 55x55 resolution. The output of layer1 is partitioned into two sets,
each sized 48 feature maps. Layer1’s kernel size is 11x11 and the sliding window shifts across
feature maps in a stride of 4 pixels. The following layers also have a similar structure. The
sliding strides of other layers’ convolution window are 1 pixel. Table 1 shows this CNN’s
configuration.

ACCELERATOR DESIGN EXPLORATION

In this section, we first present an overview of our accelerator structure and introduce several
design challenges on an FPGA platform. Then, in order to overcome them, we propose
corresponding optimization techniques to explore the design space.

Design Overview As shown in Figure 4, a CNN accelerator design on FPGA is composed


of several major components, which are processing elements (PEs), on-chip buffer, external
memory, and on-/off-chip interconnect. A PE is the basic computation unit for convolution. All
data for processing are stored in external memory. Due to on-chip resource limitation, data are
first cached in on-chip buffers before being fed to PEs. Double buffers are used to cover
computation time with data transfer time. The on-chip interconnect is dedicated for data
communication between PEs and on-chip buffer banks.

Figure 4.3: Overview of accelerator design

There are several design challenges that obstacle an efficient CNN accelerator design on an
FPGA platform. First, loop tiling is mandatory to fit a small portion of data on chip. An improper
tiling may degrade the efficiency of data reuse and parallelism of data processing. Second, the
organization of PEs and buffer banks and interconnects between them should be carefully
considered in order to process on chip data efficiently. Third, the data processing throughput of
PEs should match the off-chip bandwidth provided by the FPGA platform.

Design Space Exploration As mentioned in Section 3.2 and Section 3.3, given a specific
loop schedule and tile size tuple hTm, T n, T r, T ci, the computational roof and computation to
communication ratio of the design variant can be calculated. Enumerating all possible loop
orders and tile sizes will generate a series of computational performance and computation to
communication ratio pairs. Figure 8(a) depicts all legal solutions for layer 5 of the example CNN
application in the rooline model coordinate system. The “x” axis denotes the computation to
communication ratio, or the ratio of floating point operation per DRAM byte access. The “y”
axis denotes the computational performance (GFLOPS). The slope of the line between any point
and the origin point (0, 0) denotes the minimal bandwidth requirement for this implementation.
For example, design P’s minimal bandwidth requirement is equal to the slope of the line P 0 . In
Figure 8(b), the line of bandwidth roof and computational roof are defined by the platform
specification. Any point at the left side of bandwidth roofline requires a higher bandwidth than
what the platform can provide. For example, although implementation A achieves the highest
possible computational performance, the memory bandwidth required cannot be satisfied by the
target platform. The actual performance achievable on the platform would be the ordinate value
of A 0. Thus the platform-supported designs are defined as a set including those located at the
right side of the bandwidth roofline and those just located on the bandwidth roofline, which are
projections of the left side designs. We explore this platform-supported design space and a set of
implementations with the highest performance can be collected. If this set only includes one
design, then this design will be our final result of design space exploration. However, a more
common situation is that we could find several counterparts within this set, e.g. point C, D and
some others in Figure 8(b). We pick the one with the highest CI value because this design
requires the least bandwidth This selection criteria derives from the fact that we can use fewer
I/O ports, fewer LUTs and hardwired connections etc. for data transfer engine in designs with
lower bandwidth requirement. Thus, point C is our finally chosen design in this case for layer 5.
Its bandwidth requirement is 2.2 GB/s.

Memory Sub-System

On-chip buffers are built upon a basic idea of double buffering, in which double buffers are
operated in a ping pong manner to overlap data transfer time with computation. Therefore, they
are organized in four sets: two for input feature maps and weights and two for output feature
maps. We first introduce each buffer set’s organization and followed by the ping-pong data
transfer mechanism. Every buffer set contains several independent buffer banks. The number of
buffer banks in each input buffer set is equal to Tn (tile size of input fm). The number of buffer
banks in each output buffer set is equal to Tm (tile size of output fm). Double buffer sets are used
to realize ping-pong operations. To simplify discussion, we use a concrete case in Figure 9 to
illustrate the mechanism of ping-pong operation. See the code in Figure 9. The “off-load”
operation will occur only once after N Tn times of “load” operation. But the amount of data in
every output fm transfer are larger than that of input fm in a ratio of ≈ Tm Tn = 64 7 . To
increase the bandwidth utilization, we implement two independent channels, one for load
operation and the other for off-load operation. Figure 12 shows the timing of several compute
and data transfer phases. For the first phase, computation engine is processing with input buffer
set 0 while copying the next phase data to input buffer set 1. The next phase will do the opposite
operation. This is the ping-pong operation of input feature maps and weights. When N Tn
phases of computation and data copying are done, the resulting output feature maps are written
down to DRAM. The “off-load” operation would off-load results in the output buffer set 0 in the
period of N Tn phases till the reused temporary data in the output buffer set 1 generates the new
results. This is the ping-pong operation of output feature maps. Note that those two independent
channel for load and store operation mechanism work for any other data reuse situation in this
framework.

Fig. 4.4. (a) Computation graph of a neural network model. (b) CONV and FC layers in NN
model.

In this section, we introduce the basic functions in a neural network. In this paper, we only focus
on the inference of NN, which means using a trained model to predict or classify new data. Œe
training process of NN is not discussed in this paper. A neural network model can be expressed
as a directed graph shown in Figure 1(a). Each vertex of the graph denotes a layer which
conducts operations on data from a previous layer or input and generates results to the next layer
or output. We refer the parameter of each layer as weights and the input/output of each layer as
activations through this paper.

FPGA-based Accelerator

In recent years, FPGA is becoming a promising solution for algorithm acceleration. Compared
with CPU, GPU, and DSP platforms, for which the software and hardware are designed
independently, FPGA enables the developers to implement only the necessary logic in hardware
according to the target algorithm. By eliminating the redundancy in general hardware platforms,
FPGAs can achieve higher efficiency. Application special integrated circuits (ASICs) based
solutions achieve even higher efficiency but requires much longer development cycle and higher
cost.

Fig. 4.5. (a) A typical structure of an FPGA-based NN accelerator. (b) Gap between NN model
size and the storage unit size on FPGAs. The bar chart compares the register and SRAM sizes on
FPGA chips in different scales. The dead line denotes the parameter sizes of different NN
models with 32-bit floating point parameters

For FPGA-based neural network accelerator, a typical architecture of the system is shown in
Figure 2(a). Œe system usually consists of a CPU host and an FPGA part. A pure FPGA chip
usually works with a host PC/server through PCIe connections. SoC platforms (like the Xilinx
Zynq Series) and Intel HARPv2 [18] platform integrate the host and the FPGA in the same chip
or package. Both the host and the FPGA can work with their own external memory and access
each others’ memory through the connection. Most of the designs implement NN accelerator on
the FPGA part and control the accelerator with the software on the host. Typical FPGA chips
consist large on-chip storage units like registers and SRAM(Static Random Access Memory), but
still too small compared with NN models as shown in Figure 2(b). Common models implement
100-1000MB parameters while the largest available FPGA chip implements ¡50MB on-chip
SRAM. Œis gap requires that external memory like DDR SDRAM is needed. Œe bandwidth and
power consumption of DDR limits the system performance. Œe computation capacity of FPGA
is relatively higher. Common FPGAs implement hundreds to thousands of DSP units, each of
which can compute 18 × 27 or 18 × 19, achieving up to 10TFLOP/s (floating point operations
per second) on the largest FPGAs. But for low-end FPGAs like Xilinx
XC7Z020, this number is reduced to 20GFLOP/s, which is hard to support real-time video
processing for applications on mobile platforms. Even faced with the above challenges,
researchers have proposed a series of optimization methods from algorithm to architecture to
design high performance NN accelerators on FPGA, which will be discussed in the following
sections of this paper.
CHAPTER 5

RESULTS AND DISCUSSION

In order to evaluate the performance and cost of the DLAU accelerator, we have implemented
the hardware prototype on the Xilinx Zynq Zed board development board, which equips ARM
Cortex-A9 processors clocked at 667 MHz and programmable fabrics. For benchmarks, we use
the Mnist data set to train the 784×M×N×10 DNNs in MATLAB, and use M×N layers’ weights
and nodes value for the input data of DLAU. For comparison, we use Intel Core2 processor
clocked at 2.3 GHz as the baseline. In the experiment we use tile size = 32 considering the
hardware resources integrated in the Zedboard development board. The DLAU computes 32
hardware neurons with 32 weights every cycle. The clock of DLAU is 200 MHz (one cycle takes
5 ns). Three network sizes—64×64, 128×128, and 256×256 are tested.

Speedup Analysis

We present the speedup of DLAU and some other similar implementations of the deep learning
algorithms in Table II. Experimental results demonstrate that the DLAU is able to achieve up to
36.1× speedup at 256×256 network size.

TABLE I COMPARISONS BETWEEN SIMILAR APPROACHES

In comparison, Ly and Chow [5] and Kim et al. [7] presented the work only on RBM algorithms,
while the DLAU is much more scalable and flexible. DianNao [6] reaches up to 117.87×
speedup due to its high working frequency at 0.98 GHz. Moreover, as DianNao is hardwired
instead of implemented on an FPGA platform, therefore it cannot efficiently adapt to different
neural network sizes.
Fig. 5.1. Speedup at different network sizes and tile sizes.

Fig.5 illustrates the speedup of DLAU at different network sizes- 64×64, 128×128, and 256×256,
respectively. Experimental results demonstrate a reasonable ascendant speedup with the growth
of neural networks sizes. In particular, the speedup increases from 19.2× in 64×64 network size
to 36.1× at the 256×256 network size. The right part of Fig. 4 illustrates how the tile size has an
impact on the performance of the DLAU. It can be acknowledged that bigger tile size means
more number of neurons to be computed concurrently. At the network size of 128×128, the
speedup is 9.2× when the tile size is 8. When the tile size increases to 32, the speedup reaches
30.5×. Experimental results demonstrate that the DLAU framework is configurable and scalable
with different tile sizes. The speedup can be leveraged with hardware cost to achieve satisfying
tradeoffs. Fig. 6 illustrates the floorplan of the FPGA chip. The left corner depicts the ARM
processor which is hardwired in the FPGA chip. Other modules, including different components
of the DLAU accelerator , the DMA, and memory interconnect, are presented in different colors.
Regarding the programming logic devices, TMMU takes most of the areas as it utilizes a
significant number of LUTs and FFs.

Fig. 5.2. Floorplan of the FPGA chip.


Technology Schematic:
Design summary

Timing Report

COMPARISON TABLE

Thus, all the single-bit adders will successfully kill or propagate the carries until all carries are
zero fulfilling the terminating condition. The mathematical form presented above is valid under
the condition that the iterations progress synchronously for all bit levels and the required input
and outputs for a specific iteration will also be in synchrony with the progress of one iteration. In
the next section, we present an implementation of the proposed architecture which is
subsequently verified using simulations.
CONCLUSION

In this paper, we have presented DLAU, which is a scalable and flexible deep learning
accelerator based on FPGA. The DLAU includes three pipelined processing units, which can be
reused for large scale neural networks. The proposed DLAU uses carry save adder in the
computation process and as further process of the project to increase the computation speed
parallel self timed adder is used in the design. Experimental results on Xilinx FPGA prototype
show that DLAU accelerator achieved more speed when compared to other devices like ASIC.
REFERENCES

1. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp.
436–444, 2015.
2. J. Hauswald et al., “DjiNN and Tonic: DNN as a service and its implications for future
warehouse scale computers,” in Proc. ISCA, Portland, OR, USA, 2015, pp. 27–40.
3. C. Zhang et al., “Optimizing FPGA-based accelerator design for deep convolutional
neural networks,” in Proc. FPGA, Monterey, CA, USA, 2015, pp. 161–170. 4. P.
Thibodeau. Data Centers are the New Polluters. Accessed on Apr.
4. 2016. [Online]. Available: http://www.computerworld.com/ article/2598562/data-
center/data-centers-are-the-new-polluters.html
5. L. Ly and P. Chow, “A high-performance FPGA architecture for restricted Boltzmann
machines,” in Proc. FPGA, Monterey, CA, USA, 2009, pp. 73–82.
6. T. Chen et al., “DianNao: A small-footprint high-throughput accelerator for ubiquitous
machine-learning,” in Proc. ASPLOS, Salt Lake City, UT, USA, 2014, pp. 269–284.
7. S. K. Kim, L. C. McAfee, P. L. McMahon, and K. Olukotun, “A highly scalable
restricted Boltzmann machine FPGA implementation,” in Proc. FPL, Prague, Czech
Republic, 2009, pp. 367–372.
8. Yu, C. Wang, X. Ma, X. Li, and X. Zhou, “A deep learning prediction process accelerator
based FPGA,” in Proc. CCGRID, Shenzhen, China, 2015, pp. 1159–1162.
9. Qiu et al., “Going deeper with embedded FPGA platform for convolutional neural
network,” in Proc. FPGA, Monterey, CA, USA, 2016, pp. 26–35

S-ar putea să vă placă și