Implementation of RTL Reinforcement Learning Design For Maze Navigation

IMPLEMENTATION OF RTL REINFORCEMENT LEARNING DESIGN FOR MAZE NAVIGATION
Andi Muhammad Riyadhus Ilmy ( 13217053 )

Evan Roberts (13217057)
Tyfan Juliano H (13217063)
M Yafie Abdillah (13217091)
Lecturer : Prof. Trio Adiono and Nana Sutisna
EL 4138 - VLSI System Design
Electrical Engineering Study Program - School of Electrical Engineering and Informatics ITB
Abstract
In this final project, the design and implementation of the RTL design as a hardware accelerator for reinforcement learning
process. RTL designs made in Verilog language will then be simulated with Modelsim software and implemented on the ZyBo
board using the Vivado software. Then the reinforcement learning test will be carried out for maze navigation with a size of 10 ×
10. The system design that has been made successfully navigates the maze after going through the learning process on hardware.
Keywords: Reinforcement Learning, Maze Navigation, ZyBO, Vivado, Verilog.
1. INTRODUCTION
In this final project of the EL43138 VLSI System Design, the RTL design implementation for reinforcement
learning in Maze Navigation has dimensions of 10 × 10. The RTL design is implemented in Verilog
Language and will run on Zybo boards. This assignment was motivated as the final project of the EL4138
VLSI System Design and Design course with the topic participating in the competition at the 2021 LSI
Design Contest with the theme this time is "Reinforcement Learning". Reinforcement learning is a type
of machine learning that adapts a reward and punishment system in the learning process. One application
of reinforcement learning is navigating a maze with a machine or robot. Tests and simulations are carried out
on the ZyBO board with Vivado as the compiler for the language using Verilog and C language. The maze
navigation simulation will produce a visualization of a safe maze navigation path starting from the
entrance to the exit of the maze.
The Reinforcement Learning application that we will implement is a 10x10 maze navigation. Maze Navigation
is an algorithm that can move a robot to navigate into a maze automatically without requiring human
involvement in it. Maze navigation is urgently needed in search and rescue operations carried out on
previously unmapped terrain, such as the search for residents who get lost in a natural cave that is flooded,
as happened to a group of children's soccer teams in Thailand in 2018. Maze navigation is a very dangerous
task when carried out by humans because human explorers can get lost, threatening the explorer's
life. Therefore, we decided to implement automatic maze navigation in the hope that it can help provide an
overview and learning for people who want to produce maze navigation based robots.
2. LITERATURE STUDY
This part contains the description brief of the various sources of literature that had hooks late to the final
project.
2.1 REINFORCEMENT LEARNING

Reinforcement Learning is a type of machine learning approach regarding what to do (implementing action
into a situation) on a problem to get maximum results. The agent is a term for Learner ( Learner ) and Decision
Maker ( Decision Maker ) . The agent selects an action and the environment responds to that action and
[1]
assigns a new state to the agent. The environment will also produce rewards that will be maximized at any
time by the agent. The picture above is an illustration of Reinforcement Learning. Agents can be trained using
the following approach. Initially, the Agent was not given a clue about what action to take. The agent will
learn the action based on the Trial and Error principle, then make decisions based on the reward (maximum
reward). In Reinforcement Learning there are 4 sub-elements, namely:
1. Policy
The policy is a way for an agent to behave in a situation. In other words, this element is a mapping
of the action to be taken by the agent, then implemented in a situation.
2. Reward Function.
This element is defined as the goals an agent wants to achieve. In this process, the agent will
maximize the reward of the action that has been taken. Reward Function will be
the agent's reference regarding what is good and what is bad.
3. Value Function.
If Reward Function defines the best results on the spot, on the Value Function The agent will consider
the best outcome for the long term. Or in other words, the value of a state is the total amount
of rewards that an agent can accumulate until the next period, starting from that state . Rewards
are obtained directly from the environment , while Value must be estimated continuously from
the agent's observations .
4. Environment Model
In this element, the agent will predict the next state and reward . This element is used for planning
or in other words, the agent will decide on action by considering possible situations in the future.
Figure 2 . 1 Illustration of Reinforcement Learning
2.2 VERILOG LANGUAGE

Verilog, standardized as IEEE 1364, is a hardware description language (HDL) used to model electronic
systems. This language is most often used in the design and verification of digital circuits at the register
transfer level of abstraction. It is also used in the verification of analog circuits and mixed-signal circuits,
as well as in the design of genetic circuits. In 2009, the Verilog standard (IEEE 1364-2005) was incorporated
into the SystemVerilog standard, creating the IEEE 1800-2009 Standard. Since then, Verilog has officially
become part of the SystemVerilog language. The current version is the IEEE standard 1800-2017 . [2]
2.3 VIVADO
The Vivado® Design Suite provides a next-generation development environment that is powerful, System on
Chip ( SoC), centered on IP and systems, which have been built from the ground up to overcome
productivity barriers in system-level integration and implementation. It comes in three editions : [3]
1. Vivado HL WebPack Edition
2. Vivado HL Design Edition
3. Vivado HL System Edition

Whereas Vivado's Advanced Synthesis Compiler allows C, C ++ and SystemC programs to be targeted
directly to Xilinx devices without the need to create RTL manually. Vivado HLS was extensively reviewed
for enhancing developer productivity, and confirmed to support C ++ classes, templates , functions, and
operator overloads. Vivado 2014.1 introduces support for automatically converting OpenCL kernels to IP
for Xilinx devices. The OpenCL kernel is a program that runs on multiple CPU, GPU and FPGA platforms.
1. The Vivado simulator is a component of the Vivado Design Suite. It is a compiled language
simulator that supports mixed languages, Tcl scripts, encrypted IP and enhanced verification.
2. Vivado IP Integrator allows engineers to quickly integrate and configure IPs from the large
Xilinx IP library. The Integrator is also tuned for the Simulink MathWorks design built with the
Xilinx System Generator and Vivado Advanced Synthesis.
3. The Vivado Tcl Store is a scripting system for developing add-ons to Vivado, and can be
used to add and modify Vivado's capabilities. Tcl is the scripting language Vivado is based on. All
the functions underlying Vivado can be called and controlled via Tcl Script.
2.4 ZYBO
ZYBO ( ZYnq Board ) is a feature-rich, ready-to-use embedded software and entry-level digital circuit
development platform built around the smallest member of the Xilinx Zynq-7000 family, the Z-7010. The Z-
7010 is based on the Xilinx All Programmable System-on-Chip (AP SoC) architecture , which tightly integrates
the dual-core ARM Cortex-A9 processor with Xilinx 7 series Field Programmable Gate Array (FPGA) logic. The
rich connectivity device available on the ZYBO, Zynq Z-7010 can accommodate the overall system
design. On-board memory, video and audio I / O, dual USB ports , Ethernet, and an SD slot will have your
design ready to go without the need for additional hardware. Also, six Pmod ports are available to put any
design on an easy growth path . [4]
Figure 2 . 2 ZYnq Board
3. SYSTEM DESIGN
3.1 MAZE NAVIGATION 10 × 10

Navigation problems are designed to be tested. The following is a navigation map design with a size of 10
× 10 which will be implemented in the system. In this case, the state shows the position of the agent in the
maze. In the labyrinth, there are 100 states with walls between the various states and around the
labyrinth. Then in each state , the agent can perform four actions , namely moving left, right, up, and
down. Each action will cause the agent to experience a state change according to the direction of
the action . The existing wall will limit the state transfer that will be carried out by the agent but will not
prohibit the agent from taking the action so that there will be cases where the agent will hit the wall and
there will be no state transfer . The objective of the agent here is to achieve the goal from the starting
position. In this case, the initial position is in state 0 and the final destination is in state
99 . Every action taken by the agent will result in a reward or punishment depending on the state shift that
occurs. The reward or punishment is as follows:
1. Get reward 0 when moving to a new state
2. Get punishment -5 if you return to the previous state
3. Get punishment -10 if you hit the wall (don't move state )
4. Get 10 rewards for reaching the final goal
The Reinforcement Learning process is limited to the number of learning generations of 1023 generations. In
each generation, the agent is allowed to take a maximum of 50 actions whether it hits a wall or not before
being replaced by the next generation. The following is an illustration of the maze that will be used for
testing.
Figure 3 . 1 Illustration of the maze to be tested
3.2 RTL DESIGN

The RTL design that will be implemented will be made using the Verilog language and will be divided into
several building module blocks. The main building blocks of the system are the control unit, reward
generator, policy generator and Q-learning accelerator. Each block of building modules will be combined using
a top - level design then this design is executed on the ZyBO board to carry out the learning and
exploration process . Design code in Verilog language will be listed in the Appendix of the report.
Figure 3 . 2 Block diagram top level
3.2.1 Control Unit

The Control Unit (CU) is a component for controlling the operation of all components in the top-level based
on the clock signal and reset received. The CU receives several input signals, namely clock , reset ,
and randomValue . Meanwhile, the output signal is a signal CU stateReset and stateSelect . CU
has the following functions:
1. Ensure agent can respond to user reset command via stateReset signal ,
2. Determining the start and end of a generation,
3. stop agent movement when it reaches its final destination by setting the stateReset value to 1
until a generation ends,
4. Giving an alternate generation command via stateReset , the stateReset signal will be
worth 1 for 1 clock period if the system has gone through 50 counted clocks from
the last stateReset . At the top-level , each action will be performed in one clock so
that a generation of 50 actions can be represented by 50 clock cycles , and
5. Determine the algorithm for determining the action as the output stateSelect signal based
on the randomValue signal and the number of generations that have been
performed. Signal randomValue provide random values with a value range of 0 to 1023 obtained
from LSFR module components policy generator. Determination of the action is done
by the epsilon - greedy method by adjusting the stateSelect signal . Value
sinya l stateSelect will be determined by the variable decideFactor through
the equation as follows:
decideFactor = (1023 − 𝑗𝑢𝑚𝑙𝑎ℎ 𝑔𝑒𝑛𝑒𝑟𝑎𝑠𝑖 𝑠𝑎𝑎𝑡 𝑖𝑛𝑖) – randomValue
If decide Factor is positive, then stateSelect will be worth 1 so that the agent will choose
the action randomly. Meanwhile, if the value of decideFactor is negative ,
then stateSelect will be worth 0 so that the agent will choose the next action based on the
largest value from the Qtable ( greedy algorithm) . Thus, if the number of generations is greater,
it will be easier for the Decide Factor to be negative so that the agent will increasingly choose
a greedy algorithm rather than random action .
3.2.2 Reward Generator
Reward Generator is a block that has the function of determining the reward value ( currentReward )
obtained by the agent . The incoming data ( input )
is clock , stateReset , currentState and nextState . Meanwhile, the output data
is currentReward . This block determines the reward based on nextState , currentState ,
and prevState which is the currentState paused for one clock . The determination of reward based
on state is as follows:
1. Agent will get a reward of 10 ( rw3 ) if nextState is 99 ( goal ) regardless of
the currentState and prevState values ,
2. A gent will get a punishment of -10 ( rw2 ) if the agent hits a wall which is marked with an
unchanged state . A hitting wall state can be identified
when currentState and nextState have the same value,
3. A gent will get a punishment of -5 ( rw1 ) if the agent reverses direction or returns to
the previous state which is marked with the prevState value equal to
the nextState value . These punishments are given to avoid looping conditions where
the agent moves between the same two states continuously, and
4. A gent will get a reward of 0 ( rw0 ) if the agent changes state without hitting the wall or returning
to the previous state .
Figure 3 . 3 Block diagram of reward generator
3.2.3 Policy Generator

Policy Generator (PG) functions to determine the next state based on orders received from the CU. The
incoming data ( input ) is Qtable (s), stateSelect , currentState , stateReset . while
the output data is randomValue , nextAction and nextState . PG has the following components:
1. The lsfr_16bit blocks functions to generate random actions. Besides, this block also
generates a random value of 10bit ( randValOut ) which the CU block will accept to select
the next action determination algorithm. The random condition in this block is psudo-
random which is obtained by implementing the Linear-Feedback shift register algorithm .
2. The g reedAction block functions to generate actions based on the largest Q value of
the Qtable of a state ( greedy algorithm ) .
3. Block d ecideNextAction is a two- channel multiplexer that will select the next action
from a random action or greedy action based on the stateSelect signal received from the
CU block.
4. The w allDetect block functions to determine the next state based on the position of
the walls in the memory wallV and wallH .
5. The resetState block functions to return the agent to its initial position ( state 0) at the time
of generation change .
Figure 3 . 4 Block diagram of the policy generator
3.2.3.1 Module wallDetect

Module wallDetect functioning to determine NextState taken by
the agent based CurrentState , nextAction , and the position of the walls that exist in the memory
block RAM_WallH and RAM_WallV . The incoming data ( input )
is currentState and nextAction . Meanwhile, the output data is nextState .
Figure 3 . 5 Block diagram of the wallDetect module
RAM_WallH and RAM_WallV are read-only memory written by software which will be explained in the next
section. RAM_wallH and RAM_WallV save information whether there is a wall on the left of a state , and
the next on a state sequentially. The addresses of RAM_WallH and RAM_WallV represent
the state represented by these addresses. Here are the access addresses
of RAM_WallH and RAM_WallV based on the type of action taken:
1. Action 0 (above): accesses the RAM_WallV address which represents
the currentState value so that data is obtained whether or not a wall is present at the top
of the current state ( currentState )
2. Action 1 (below): accessing the address RAM_WallV representing a value ( CurrentState +
10) to obtain the data whether or not the wall at the top of the state under state current
( CurrentState ) equal to the wall at the bottom of the state today ( CurrentState )
3. Action 2 (left): accesses the RAM_WallH address which represents
the currentState value so that the data on whether or not a wall exists on the left of
the current state ( currentState )
4. Action 3 (right): accesses the RAM_WallH address which represents the value
( currentState + 1) so that data is obtained whether or not the wall on the left of the state is
to the right of the current state ( currentState ) which is the same as the wall on the left of
the current state ( currentState )
The data on whether or not a wall ( hitWall ) will then be processed by the State G en block which
will determine the next movement of the agent based on currentState , nextAction ,
and hitWall as follows:
1. the next state ( nextState ) will be the same as the current state ( nextState ) if a wall
is detected or the agent is in the destination position ( state 99 )
2. the next state ( nextState ) will move from the current state whose displacement
depends on the nextAction signal if the conditions in point 1 are not met. The displacement is as
follows:
a. Action 0 (above): agent will move to the valued state ( currentState - 10)
b. Action 1 (below): agent will move to the value state ( currentState + 10)
c. Action 2 (left): agent will move to the valued state ( currentState - 1)
d. Action 3 (right): the agent will move to the value state ( currentState + 1)
3.2.4 Q-Learning Accelerator

This block design architecture is made by following reference [5]. However, an additional block is given,
the isFinished block, which prevents changing the value of the memory block when the agent reaches its
final destination ( state 99) . In this module, there are four memory blocks, each of which represents
an action that can be selected by the agent . The enable write signal for each memory block is controlled by
a decoder block which only provides an enable write signal to the memory block which stores the Q-
value for the action being performed by the agent so that changes to the Qtable in the current state only occur
for that action . The fourth value of QValue will be selected to become the largest value QValue and
the QValue for the current action then passed to the Q-Updater block which will calculate the new Qvalue to
change the previous value.
Figure 3 . 6 Block diagram of the Q-learning accelerator module
3.2.4.1 Q-Updater block

The Q-Updater block is a module that implements the equation to calculate the new Qvalue value. The
equation implemented is as follows . [5]
𝑄𝑛𝑒𝑤 (𝑠𝑡 , 𝑎𝑡 ) = 𝑄(𝑠𝑡 , 𝑎𝑡 ) + 𝛼(𝑟𝑡 + 𝛾 max(𝑄(𝑠𝑡+1 , 𝑎𝑡 )) − 𝑄(𝑠𝑡 , 𝑎𝑡 )))

The above equation is the result of a slight modification of the actual equation to reduce the number
of multipliers required in the computation process to reduce system complexity. In addition, the block
multiplier in this architecture is replaced by an approximated multiplier, which is a right shifter that
will multiply the entered Qvalue with the alpha or gamma variable by shifting the Qvalue bit right based on
the leading one position of the alpha or gamma binary representation . The more leading one is considered,
the more accurate the result will be, but the number of shifters needed will also increase.
Figure 3 . 7 Block diagram of the Q-Updater module
3.3 MEMORY DESIGN

In this system 6 memory blocks are consisting of:
1. 4 blocks of QActionTable memory
a. Each Qtable memory has 100 components representing 100 states , each address represents
1 state
b. Has a size of 16 bits with the signed fixed-point format (8.8)
c. Stores the value Qvalue
2. 1 WallH memory block
a. Storing data for walls that are to the left of the state , the value 1 represents a wall to the
left of the state
b. Has 100 components, each representing 1 state
3. 1 WallV memory block
a. Storing data for walls that are above the state , the value 1 represents a wall above the state .
b. Has 110 components where the first 100 components each represent 1 state while
components 101 to 110 represent walls that are under state 90 to 99
3.4 SOFTWARE DESIGN

In this design, a software is designed that will bridge hardware interaction with the user. Writing program
code for the software is done in C language. The software has the following functions :
1. Displays a map of the maze
2. Displaying the maze navigation results obtained from 4 Qtable memories at the end of
the learning process
3. Write down the contents of memory WallH and WallV based maze used
Here is a flow chart of the software used:
Figure 3 . 8 Software flow chart
The software interaction and RTL ( hardware) design are linked by several additional component blocks. To
write a value, it is done on a memory block that can be accessed by hardware and software alternately. Also,
some flags provide information on whether the learning process has started, is currently running, or is
already running.
Figure 3 . 9 Illustration of hardware and software interaction

4. IMPLEMENTATION RESULTS
4.1 TIMING DIAGRAM
4.1.1 QlearningAgent Top-level Results

QLearningAgent is the top - level RTL design of our maze navigation which is responsible for generating
predictive maze navigation flow. The following are the results of the QLearningAgent simulation
Figure 4 . 1 Timing diagram for the beginning of the learning process

In Figure 4.1 , the agent has just started the training process so that the movements taken by the agent are
still very random and often hit walls.
Figure 4 . 2 Timing diagram for the end of the learning process

At the end of the training process, the 1023 generation agent was able to navigate the maze well. This can
be seen from the nextState signal value that moves from 0 to 99.
4.1.2 Control Unit Results

The following are the results of testing the functionality of the Control Unit:
1. Responding to user reset
The Control Unit's response to reset can be seen in the following figure:
Figure 4 . 3 Timing diagram for reset signal response
It appears that the control unit will repeat the training process when the reset signal is given .
2. Stops agent movement when it reaches state 99

This function is also implemented successfully as shown in the following figure:
Figure 4 . 4 Timing diagram on stopping agent condition
After reaching state 99. the stateReset signal will always have a value of 1 which makes the
agent will not move the state or the training process at all.
3. Make a generation change

Change of state can also be implemented properly as shown in the following figure:
Figure 4 . 5 Timing diagrams on alternation of generations
As seen in the picture above, after going through 50 actions or 50 clocks , the stateReset will be
1 which makes the agent position return to position 0. This indicates a change of generation.
4. Determines the type of action the agent performs

The type of action an agent will perform tends to use random action types on a small generation
and tends to use a greedy algorithm action . This can be seen in the following image:
a. Early Generation
Figure 4 . 6 Timing diagrams for early generations
At the early generation (indicated by a signal genCount are worth little)

signals stateSelectOut always worth 1 signifying the agent will take a random action,
this can be confirmed by looking at the signal nextAction equal to
signal a ctLSFR . This is by the epsilon-greedy algorithm wherein the initial generation it
was desired that the action taken by the agent was random.
b. Middle Generation
At the middle generation (marked with genCount signal which has a value of between
500-600) moved stateSelectOut signal value between 0 and 1 that indicates the agent will
sometimes take a random act or action of a greedy algorithm . This is following the epsilon-
greedy algorithm where in the middle generation the action taken by the agent is a
combination of random action and greedy algorithm action .
Figure 4 . 7 Timing diagrams for the middle generation
c. Final Generation
the early generations (marked with genCount signal which is worth over 950)
stateSelectOut signal is always worth 0 which indicates the agency will take action greedy
algorithm , this can be confirmed by looking at the same nextAction signal ActGreed
signal. This is per the epsilon-greedy algorithm where in the final generation it is desired
that the action taken by the agent is the action of the greedy algorithm .
Figure 4 . 8 Timing diagrams for the final generation
4.1.3 Reward Generator Results

QLearningAgent has 4 types of rewards that can be given to agents based on the actions taken by
the agent . The following are the results of testing the four rewards:
1. Reward is worth 10 ( 0x0A00 )
A reward of 10 will be given to agents who have reached their final destination . The following are
the results of testing a reward worth 10:
Figure 4 . 9 Timing reward generator diagram generates rewards 10
It can be seen in the picture above, when nextAction is worth 99, the reward given
is 0x0 A 00 or 10 in decimal .
2. Reward worth -10 ( 0xF600 )

A reward of -10 will be given to the agent who hits the wall. The following are the results of the -
10 reward test :
Figure 4 . 10 Timing generator reward diagram generates rewards - 10
Seen in the picture above, when agent t hits the wall which is marked with the
value nextAction equal to currentAction , the reward given is 0xF600 or - 1 0 in decimal
3. Rewards are worth -5 ( 0xFB00 )

Reward worth -5 will be given to agent who returns to the previous state . The following are the
results of the -5 reward test :
Figure 4 . 11 Timing reward generator diagram generates reward -5

As seen in the image above, when the agent returns to the previous state which is marked with
the current nextState value equal to the nextState value in the previous 2 clocks ,
the reward given is 0xFB00 or -5 in decimal
4. Reward is worth 0 (0x0000)

Reward value 0 can be confirmed by looking at the third image above. A reward of 0 is given
to agent t who does not touch state 99, does not hit the wall, and does not return to
the previous state .
4.1.4 Result of Policy Generator

Policy Generator is responsible for selecting actions based on commands from the control unit and determine
the state further based on state at this time, actions taken, and the presence or absence of a wall. Following
are the results of testing the policy generator :
1. Select actions based on Control Unit commands
Verification of this functionality can be seen in the verification functionality of control unit part
Specifies the type of action taken by the agent .
2. Determine the next state
a. Achieving goals
This functionality testing can be seen in test control unit section Stopping the movement
of the agent when the final goal is achieved . In this test, it appears that the agent will not
change the state
b. Move according to the specified action
Here are some simulation results that indicate a state change due to taking an action :
Table 1 Timing Diagram of Policy Generator
Action Simulation Results Information
moving state to the top is
indicated by a move
from state 26 to a state that is
up (action = 1)
located above state 26,
namely state 16.
moving the state downward

is marked by a move
down (action = 3) from state 13 to a state that is
located under state 13,
namely state 23.
state shift to the left is
indicated by a move
left (action = 2) from state 54 to state which is
to the left of state 54, namely
state 53.
moving state to the right is
indicated by a move
right (action = 0) from state 81 to the state that
is to the right of state 81,
namely state 82.
c. Hit the wall

When hitting the wall, the agent's position will not change which is indicated by
the nextState and CurrentState values which are the same as shown in the
following simulation results:
Figure 4 . 12 Timings in the policy generator diagram when hitting a wall
When the policy generator reading memor i Wall and get movement agent will make
the agent into a wall as represented by the signal hitWall is 1, then the position of the
agent will not change ( CurrentState = NextState ).
4.1.5 Results Q l Earning Accelerator

The QLearningAccelerator block is the part whose job is to update the Qvalue value on the Qtable based on
the actions performed by the agent and interacting with the Qtable memory block . Therefore,
the QLearningAccelerator functionality can be seen in the Demo section. In addition, the performance
of QLearningAccelerator can be tested by looking at QUpdater's performance . QUpdater functions to update
the QValue value using the QLearning formula . The following are the results of the QUpdater simulation :
Figure 4 . 13 Timing block diagram QUpdater
In this simulation, we allocate the values of the input variables as follows:
Table 2 Variable values in the QUpdater block

Variable Value ( fixed point 8.8) Value (de s imal)
Alpha 00.10 0.5
Gamma 00.E0 0.875
Q mux FB00 -5
Q max F800 -8
Rewards F500 -10
Based on the QLearning equation , the following values are obtained :
𝑄𝑛𝑒𝑤 = (1 − 𝑎𝑙𝑝ℎ𝑎) × 𝑄 + 𝑎𝑙𝑝ℎ𝑎 × (𝑟𝑒𝑤𝑎𝑟𝑑 + 𝑔𝑎𝑚𝑚𝑎 × (𝑄𝑛𝑒𝑤 ))

𝑄𝑛𝑒𝑤 = (1 − 0.5) × −5 + 0.5 × (−10 + 0.875 × (−8))
𝑄𝑛𝑒𝑤 = −11
𝑄𝑛𝑒𝑤 from the simulation result is 0x F500 or -11 in decimal notation. Thus, 𝑄𝑛𝑒𝑤 from calculation and
simulation results are the same .
4.2 SYSTEM PERFORMANCE
The system that has been designed is then implemented on ZyBO through the Vivado software . Figure 4.14
shows the design block that has been made to implement the design on ZyBO . Design has successfully
passed the process of implementation in the software so that it can be analyzed the size and power usage. But
the demo results with the ZyBO board still can't be shown.
Figure 4 . 14 Block system design
4.2.1 Speed
The RTL design that the author designed makes the entire QLearning process that needs to be carried out
in taking 1 action can be executed in 1 clock period only, so that to perform 1 generation consisting of
50 actions , our system only requires 50 clock cycles . Thus, the time required to train the maze
navigation agent can be calculated as follows:
𝑇𝑖𝑚𝑒(𝑠) = 𝑐𝑙𝑜𝑐𝑘 𝑝𝑒𝑟𝑖𝑜𝑑 × (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑐𝑡𝑖𝑜𝑛 𝑝𝑒𝑟 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛) × (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛)

𝑇𝑖𝑚𝑒(𝑠) = 𝑐𝑙𝑜𝑐𝑘 𝑝𝑒𝑟𝑖𝑜𝑑 × 50 × 1023
𝑇𝑖𝑚𝑒(𝑠) = 51150 × 𝑐𝑙𝑜𝑐𝑘 𝑝𝑒𝑟𝑖𝑜𝑑
4.2.2 Size
Through the process of implementation in software Vivado obtained information on the number of
resource and power consumption needed to run the design which has been made. It can be seen that the
available resources on ZyBO are still sufficient for the design needs so that it can be implemented without
having to change the architecture
Figure 4 . 15 FPGA Components Required
Figure 4 . 16 Resource usage data Figure 4 . 17 Power consumption data
4.3 DEMO SOFTWARE

Where in the demo section we will read the QTable value which has been updated by
QLearningAccelerator through the 1023 QLearningAgent generation. In that section, the software can
display the results of agent navigation from state 0 to 99. This shows that QLearningAccelerator can
function properly.
The demo we describe does not yet involve implementing QLearningAgent on Zybo. communication
between the QLearningAgent RTL design and the software is done through reading the memory
simulation file list. Here is a demonstration of the performance of the maze navigation we made, this
demonstration:
4.3.1 Initialization
At the initialization stage, the software will initialize the maze map which will be used as an environment for
the agent. The following is the initial view of the maze navigation software:
Figure 4 . 18 Startup software

The map of the maze to be evaluated can be displayed by giving the command "1"
Figure 4 . 19 Map view in software
4.3.2 Writing memory RAM_WallH and RAM_WallV

The training process begins by providing input with a value of 2. In this process, the software will write
down the positions of the walls in the maze to be processed by QLearningAgent.
Figure 4 . 20 Command input display to show predictions

The following are the contents of RAM_WallH and RAM_WallV memory written by the software:
Figure 4 . 21 Contents Memory of program writing
4.3.3 Read memory Qtable1, Qtable2, Qtable3, and Qtable4 that have been through the training process
After completing the training process, the software will read the QTable which has been updated by
QLearningAgent through the training process. The Qtable will then be processed so that the software can
determine the travel path that the agent has obtained . The contents of the Qtable memory before and after
the training process can be seen in the appendix
Figure 4 . 22 Display during the learning process
4.3.4 Displaying the results of the agent navigation based on the contents of the Qtable memory received from
the hardware
Here are the results of a maze navigation by agents maze navigation us which looks agent able to get out of
the maze in 20 steps.
Table 3 Display of the learning result agent steps in the software

Step 1 Step 2
Step 3 Step 4
Step 5 Step 6
Step 7 Step 8
Step 9 Step 10
Step 11 Step 12
Step 13 Step 14
Step 15 Step 16
Step 17 Step 18
Step 19 Step 20
5. CONCLUSION
From the design, simulation, and implementation processes that have been carried out, the following
conclusions are obtained.
• The system that has been designed can carry out a reinforcement learning process to navigate a
maze with a size of 10 × 10.
• The overall learning process has been successfully implemented in hardware through the RTL
design so that the software only functions as a link for user input without
a policy generator algorithm in it.
• The learning process succeeds in producing a Qtable which makes the actions taken by
the agent succeed in reaching the destination position with the following learning parameters .
o Alpha is 0.5
o Gamma is worth 0.875
o 1023 Generations with a maximum of 50 actions per generation
o QUpdater calculation with 3 right shifter
• The learning process takes a total of 51150 of the clock period .
• The implementation at ZyBO is still below the resource allocation limit with the use of 11% existing
LUTs,% existing LUTRAM,% existing RAM, and% existing asdfa. In addition, the power
consumption is 1.6665 W with a Junction temperature of 44.2 C with Thermal margin 40.8 C (3.5 W)
and effective UJA 11.5 C / W .
List of References
[1] https://medium.com/group-3-machine-learning/reinforcement-learning-dcf88b8a49c , accessed
December 28, 2020
[2] https://en.wikipedia.org/wiki/Verilog , accessed on 27 December 2020
[3] https://www.xilinx.com/support/university/vivado.html , accessed on 27 December 2020
[4] https://reference.digilentinc.com/reference/programmable-logic/zybo/start , accessed on 27 December
2020
[5] Spanò, S., Cardarilli, GC, Di Nunzio, L., Fazzolari, R., Giardino, D., Matta, M., ... & Re, M. (2019). An
efficient hardware implementation of reinforcement learning: The q-learning
algorithm. IEEE Access , 7 , 186340-186351.
APPENDIX I Kode Verilog
Kode file modules16bit.v
//Andi Muhammad Riyadhus Ilmy (13217053)
//Tyfan Juliano (13217063)
//Yafie Abdillah (13217091)
//Evan Robert (13217057)
//Building Blocks for Q Learning Accelerator with 16 bit data
//=========================================================================
//=========================================================================
//Q LEARNING ACCELERATOR MODULES===========================================
//=========================================================================
//=========================================================================
//VALIDATION MODULE========================================================
//Prevent Q Table from being Updated if Finished (State = 99)
module isFinished(
input signed [15:0] qValue,
input [7:0] currentState,
output signed [15:0] out
);
assign out = (currentState != 8'd99) ? qValue:
16'd0;
endmodule
//DECODER MODULE===========================================================
//Send enable signal to control Action RAM Write Mode.
//Use 2 bit input for 4 outputs.
module decoder(
input stateRst,
input rst,
input [1:0] act,
output reg en0,
output reg en1,
output reg en2,
output reg en3
);
reg [1:0] prevAct;
always @(*) begin

//RESET
if (rst == 1'b1 || stateRst == 1'b1) begin
en0 <= 1'b0;
en1 <= 1'b0;
en2 <= 1'b0;
en3 <= 1'b0;
end
else begin
if (act == 2'd0) begin
en0 <= 1'b1;
en1 <= 1'b0;
en2 <= 1'b0;
en3 <= 1'b0;
end
else if (act == 2'd1) begin
en0 <= 1'b0;
en1 <= 1'b1;
en2 <= 1'b0;
en3 <= 1'b0;
end
en0 <= 1'b0;
en1 <= 1'b0;
en2 <= 1'b1;
en3 <= 1'b0;
end
en0 <= 1'b0;
en1 <= 1'b0;
en2 <= 1'b0;
en3 <= 1'b1;
end
else begin
en0 <= 1'b0;
en1 <= 1'b0;
en2 <= 1'b0;
en3 <= 1'b0;
end
end
end
endmodule
//ACTION RAM MODULE========================================================

//Access memory to get Q value. Each RAM represent 1 action.
//Module always out Read value. Module only write if WR_EN = HIGH
module ram1_16bit (
input [7:0] WR_ADDR, RD_ADDR,
input signed [15:0] D_IN,
input WR_EN,
output signed [15:0] D_OUT
);
reg [15:0] tempMem[0:255];
//Read memory initiation file

initial begin
$readmemh("mem_in_row1.list", tempMem);
end
//Write if EN = 1
always @(WR_ADDR or RD_ADDR or WR_EN) begin
#1 if (WR_EN) begin
tempMem[WR_ADDR] <= D_IN;
$writememh("mem_out_row1.list", tempMem);
end
end
assign D_OUT = tempMem[RD_ADDR];

endmodule
module ram2_16bit (
input signed[15:0] D_IN,
input WR_EN,
output signed[15:0] D_OUT
);
reg [15:0] tempMem[0:255];

initial begin
end
//Write if EN = 1
#1 if (WR_EN) begin
end
end

endmodule
module ram3_16bit (
input WR_EN,
);
reg [15:0] tempMem[0:255];

initial begin
end
//Write if EN = 1
#1 if (WR_EN) begin
end
end

endmodule
module ram4_16bit (
input WR_EN,
);
reg [15:0] tempMem[0:255];

initial begin
end
//Write if EN = 1
#1 if (WR_EN ) begin
end
end

endmodule
//MULTIPLEXER MODULE=======================================================
//Select Q Value, based on action taken by agents.
//Use 2 bit selector that represent action taken by the agent.
module mux4to1_16bit(
input [15:0] in0, in1, in2, in3,
input [1:0] sel,
output [15:0] out
);
assign out =
(sel == 2'd0) ? in0 :
(sel == 2'd1) ? in1 :
(sel == 2'd2) ? in2 :
(sel == 2'd3) ? in3 :
16'd00;
endmodule
//MAX MODULE===============================================================
//Compare 4 value and choose the highest value
module max4to1_16bit(
input [15:0] D1, D2, D3, D4,
output [15:0] Y
);
wire [15:0] max0_out, max1_out;
//portmap
compMax_16bit max0(.A(D1), .B(D2), .C(max0_out));
compMax_16bit max1(.A(D3), .B(D4), .C(max1_out));
compMax_16bit max2(.A(max0_out), .B(max1_out), .C(Y));
endmodule
//COMPARATOR===============================================================
//Act as basic module for building MAX MODULE
//Compare 2 value and choose the highest value
module compMax_16bit (
input signed[15:0] A, B,
output signed[15:0] C
);
assign C = (A > B) ? A:
(A < B) ? B:
B;
endmodule
//Q UPDATER MODULE=========================================================
//Implement equation to update Q Value based on Alfa, Gamma, and Reward
module qUpdater_16bit(
input signed [15:0] Q, Qmax, rt,
input signed [7:0] alfa_i, alfa_j, alfa_k,
input signed [7:0] gamma_i, gamma_j, gamma_k,
output signed[15:0] Qnew
);
wire signed [15:0] yi_a0, yj_a0, yk_a1;
wire signed [15:0] a0_a1, a1_a2, a2_s0;
wire signed [15:0] s0_alfa, ai_a3, aj_a3, ak_a4;
wire signed [15:0] a3_a4, Qn;
//PortMap
rShift_16bit yi(.Q(Qmax), .S(gamma_i), .Y(yi_a0));
rShift_16bit yj(.Q(Qmax), .S(gamma_j), .Y(yj_a0));
rShift_16bit yk(.Q(Qmax), .S(gamma_k), .Y(yk_a1));
add_16bit a0(.in0(yi_a0), .in1(yj_a0), .out(a0_a1));
add_16bit a1(.in0(a0_a1), .in1(yk_a1), .out(a1_a2));
add_16bit a2(.in0(a1_a2), .in1(rt), .out(a2_s0));

sub_16bit s0(.in0(a2_s0), .in1(Q), .out(s0_alfa));
rShift_16bit ai(.Q(s0_alfa), .S(alfa_i), .Y(ai_a3));

rShift_16bit aj(.Q(s0_alfa), .S(alfa_j), .Y(aj_a3));
rShift_16bit ak(.Q(s0_alfa), .S(alfa_k), .Y(ak_a4));
add_16bit a3(.in0(ai_a3), .in1(aj_a3), .out(a3_a4));
add_16bit a4(.in0(a3_a4), .in1(ak_a4), .out(Qn));
add_16bit a5(.in0(Q), .in1(Qn), .out(Qnew));
endmodule
//RIGHT SHIFTER============================================================
//Act as basic module for building Q UPDATER MODULE
//Implement right shift as approximated multiplier
module rShift_16bit (
input signed [15:0] Q,
input signed [7:0] S,
output signed [15:0] Y
);
assign Y = (S == 8'sd0) ? 16'sd0 :
((Q >>> S));
endmodule
//ADDER====================================================================
//Implement addition
module add_16bit(
input signed[15:0] in0, in1,
output signed[15:0] out
);
assign out = $signed(in0) + $signed(in1);
endmodule
//SUBSTRACTOR==============================================================
//Implement substraction
module sub_16bit(
input signed[15:0] in0, in1,
output signed[15:0] out
);
assign out = $signed(in0) - $signed(in1);
endmodule
//=========================================================================
//=========================================================================
//POLICY GENERATOR MODULES=================================================
//=========================================================================
//=========================================================================
//GREEDY ALGORITHM MODULE==================================================

//Choose next action based on highest Q Value
module greedAction(
input [15:0] qAct0, qAct1, qAct2, qAct3,
output [1:0] nextAction
);
wire [15:0] maxValue;
max4to1_16bit max(
.D1(qAct0),
.D2(qAct1),
.D3(qAct2),
.D4(qAct3),
.Y(maxValue)
);
assign nextAction =
(maxValue == qAct0) ? 2'd0:
2'd0;
endmodule
//PSEUDO RANDOM MODULE=====================================================

//Generate pseudo-random next action based on fibonacci LSFR
//Also generate 10 bit random value for epsilon calculation
module lsfr_16bit(
input clk,
output [1:0] nextAction,
output [9:0] randomValue //maximum value is 2^9 = 512
);
reg [15:0] shiftReg;
reg shiftVal, val0, val1, val2, val3;
//set seed value

parameter [15:0] seed = 16'h69CD;
initial begin
shiftReg = seed;
end
//fibonacci lsfr
always @(posedge clk)
begin
shiftReg = shiftReg << 1; //Left shift 1 bit
//XOR taps sequentially

val0 = shiftReg[15] ^ shiftReg[13];
val1 = val0 ^ shiftReg[12];
val2 = val1 ^ shiftReg[10];
//assign taps-XOR to inputs

shiftReg[0] = val2;
end
assign nextAction = shiftReg[2:1];

assign randomValue = shiftReg[9:0];
endmodule
//NEXT ACTION DECIDER MODULE===============================================

//Decide next action from greedy algorithm or LSFR
//Decision based on selector input from CU
module decideNextAct(
input [1:0] greedAct, lsfrAct,
input sel,
output [1:0] nxtAct
);
//Decide Next Action
assign nxtAct = (sel == 1'd1) ? lsfrAct:
greedAct;
endmodule
//=========================================================================
//=========================================================================
//WALL DETECT SUBMODULES===================================================
//=========================================================================
//=========================================================================
//WALL DETECTION MODULE====================================================

//determine agent's next state after taking certain action
//based on current state (see maze walls)
//WALLH MODULE========================================================
//
//
module wallH_16bit (
input [7:0] RD_ADDR,
);
reg [15:0] tempMem[0:255];

initial begin
$readmemh("wallH.list", tempMem);
end

endmodule
//WALLV MODULE========================================================
//
//
module wallV_16bit (
input [7:0] RD_ADDR,
);
reg [15:0] tempMem[0:255];

initial begin
$readmemh("wallV.list", tempMem);
end

endmodule
//WALL DETECTOR MODULE ===========================================

//
//
module wallDetector (
input [1:0] nxtAct,
output [15:0] hitWallfin
);
reg [15:0] hitWall;
wire [7:0] rdAddrH, rdAddrV;
wire [15:0] wallH_det, wallV_det;
assign hitWallfin = hitWall;
assign rdAddrH = (nxtAct == 2'd2) ? currentState: //kiri

(nxtAct == 2'd0) ? (currentState + 8'd1): //kanan
8'd0;
assign rdAddrV = (nxtAct == 2'd1) ? currentState: //atas

(nxtAct == 2'd3) ? (currentState + 8'd10): //bawah
8'd0;
wallV_16bit wallV_16bit(
.RD_ADDR(rdAddrV),
.D_OUT(wallV_det)
);
wallH_16bit wallH_16bit(
.RD_ADDR(rdAddrH),
.D_OUT(wallH_det)
);
always @(*) begin

if(nxtAct == 2'd0) begin
if((currentState == 8'd9) && (currentState == 8'd19) && (currentState
== 8'd29) && (currentState == 8'd39) && (currentState == 8'd49) &&
(currentState == 8'd59) && (currentState == 8'd69) && (currentState == 8'd79)
&& (currentState == 8'd89) && (currentState == 8'd99)) begin
hitWall = 16'd1;
end
else begin
hitWall = wallH_det;
end
end
else if(nxtAct == 2'd1) begin
hitWall = wallV_det;
end
else if(nxtAct == 2'd2) begin
hitWall = wallH_det;
end
else begin
hitWall = wallV_det;
end
end
endmodule
//STATE GENERATOR
MODULE=======================================================
//determine agent's next action after taking certain action from certain state
module stategen(
input [7:0] currentState ,
input [15:0] hitWall,
input [1:0] nxtAction ,
output [7:0] nxtState
//batas undo
);
reg [7:0] out;
assign nxtState = (currentState == 8'd99) ? currentState : out;
always @(*) begin

if(hitWall == 8'd0) begin
if (nxtAction == 2'd0) begin
out = currentState + 8'd1;
end
else if (nxtAction == 2'd1) begin
out = currentState - 8'd10;
end
out = currentState - 8'd1;
end
out = currentState + 8'd10;
end
end
else begin
out = currentState;
end
end
endmodule
module wallDetect (
input [1:0] nxtAction,
output [7:0] nxtState
);
wire [15:0] hitWall;
wallDetector wallDetector (
.currentState(currentState),
.nxtAct(nxtAction),
.hitWallfin(hitWall)
);
stategen stategen(
.currentState(currentState) ,
.hitWall(hitWall),
.nxtAction(nxtAction),
.nxtState(nxtState)
);
endmodule
//=========================================================================
//=========================================================================
//END OF WALL DETECT SUBMODULES============================================
//=========================================================================
//=========================================================================
//STATE RESET MODULE=======================================================

//Reset State to zero if given HIGH inputs
module resetState(
input [7:0] inState,
input stateRst,
output [7:0] outState
);
assign outState =
(stateRst == 1'd0) ? inState :
8'd0;
endmodule
//=========================================================================
//=========================================================================
//REWARD GENERATOR MODULE==================================================
//=========================================================================
//=========================================================================
//REWARD CONTROLLER MODULE=================================================

//Generate Reward Selector Based on states
module rewardSelect(
input [7:0] prevState,
input [7:0] nxtState,
input [7:0] currentState ,
output reg [1:0] rwSel
);
reg [9:0] count = 10'b0; //For debugging
always @(*) begin

if(currentState == 8'd99) begin
count = count + 10'b1; //Count how many times reach finish line
end
if (nxtState == 8'd99) begin
//Reward for finish line
rwSel = 2'd3;
end
else if (nxtState == 0) begin
//Punishment if return to START
rwSel = 2'd2;
end
else if (nxtState != currentState) begin
if(nxtState != prevState) begin
//No Reward or Punishment
rwSel = 2'd0;
end
else begin
//Punishment for returning to previous location
rwSel = 2'd1;
end
end
else begin
//Punishment for hitting wall
rwSel = 2'd2;
end
end
endmodule
//REWARD MULTIPLEXER
MODULE=======================================================
//select reward according to reward selector
module rewardMux_16bit(
input signed [15:0] rw0, rw1, rw2, rw3,
input [1:0] rwSel,
input staterst,
output signed [15:0] out
);
assign out =
(rwSel == 2'd1 && staterst == 1'd0) ? rw1 : //-50
(rwSel == 2'd2 && staterst == 1'd0) ? rw2 : //-100
(rwSel == 2'd3 && staterst == 1'd0) ? rw3 : //100
rw0; //0
endmodule
Kode file qLearningAcellerator16bit.v

//Modul QLearningAccel untuk LSI Design 2021
//Used module in modules16bit.v
module qLearningAccel_16bit(
input clk,
input stateRst,
input rst,
input [7:0] st, //Current State
input [7:0] nxtst, //Next State
input [1:0] act, // Current Action
input signed[15:0] rt, //Reward or Punismnet Value
input [23:0] alfa, gamma, //[23:16]for i, [15:8] for j, [7:0] for k
output signed[15:0] qRow0, qRow1, qRow2, qRow3 //Row of Q Value
);
wire wrEn1, wrEn2, wrEn3, wrEn4; //Decoder

wire signed[15:0] datOut1, datOut2, datOut3, datOut4; //Action RAM
wire [7:0] alfai, alfaj, alfak, gammai, gammaj, gammak; //Q Updater
wire signed[15:0] maxOut, muxOut, qOut, newQVal; //Q Updater
//Register to delay outputs from action ram

reg [15:0] del0, del1, del2, del3;
isFinished isFinished(
.qValue(newQVal),
.currentState(st),
.out(qOut)
);
decoder decoder(
.stateRst(stateRst),
.rst(rst),
.act(act),
.en0(wrEn1),
.en1(wrEn2),
.en2(wrEn3),
.en3(wrEn4)
);
ram1_16bit action1(
.WR_ADDR(st),
.D_IN(qOut),
.RD_ADDR(nxtst),
.WR_EN(wrEn1),
.D_OUT(datOut1)
);
ram2_16bit action2(
.WR_ADDR(st),
.D_IN(qOut),
.RD_ADDR(nxtst),
.WR_EN(wrEn2),
.D_OUT(datOut2)
);
ram3_16bit action3(
.WR_ADDR(st),
.D_IN(qOut),
.RD_ADDR(nxtst),
.WR_EN(wrEn3),
.D_OUT(datOut3)
);
ram4_16bit action4(
.WR_ADDR(st),
.D_IN(qOut),
.RD_ADDR(nxtst),
.WR_EN(wrEn4),
.D_OUT(datOut4)
);
mux4to1_16bit mux(
.in0(del0),
.in1(del1),
.in2(del2),
.in3(del3),
.sel(act),
.out(muxOut)
);
max4to1_16bit max(
.D1(datOut1),
.D2(datOut2),
.D3(datOut3),
.D4(datOut4),
.Y(maxOut)
);
qUpdater_16bit main(
.Q(muxOut),
.Qmax(maxOut),
.rt(rt),
.alfa_i(alfai),
.alfa_j(alfaj),
.alfa_k(alfak),
.gamma_i(gammai),
.gamma_j(gammaj),
.gamma_k(gammak),
.Qnew(newQVal)
);
//Delay operation
always @ (posedge clk)
begin
del0 = datOut1;
del1 = datOut2;
del2 = datOut3;
del3 = datOut4;
end
//Leading one assignment for Alfa and Gamma

assign alfai = alfa[23:16];
assign alfaj = alfa[15:8];
assign alfak = alfa[7:0];
assign gammai = gamma[23:16];
assign gammaj = gamma[15:8];
assign gammak = gamma[7:0];
//Assign Output for Policy Generator

assign qRow0 = del0;
endmodule
Kode file reward16bit.v

module rewardModule_16bit(
input clk,
input stateRstIn,
input [7:0] currentStateIn, nextStateIn,
input signed[15:0] rw0, rw1, rw2, rw3,
output signed[15:0] currentRewardOut
);
wire signed[15:0] nextReward;
wire [1:0] rwSel;
reg [7:0] prevState;

reg signed[15:0] tempOut;
//PORTMAP
rewardSelect sel0(
.prevState(prevState),
.nxtState(nextStateIn),
.currentState(currentStateIn),
.rwSel(rwSel)
);
rewardMux_16bit mux0(
.rw0(rw0),
.rw1(rw1),
.rw2(rw2),
.rw3(rw3),
.rwSel(rwSel),
.staterst(stateRstIn),
.out(nextReward)
);
//Delay operations
always @(posedge clk) begin
prevState = currentStateIn;
tempOut = nextReward;
end
//Assign Outputs
assign currentRewardOut = nextReward;
endmodule
Kode file policyGenerator16bit.v

//Modul Policy Generator untuk LSI Design 2021
module policyGenerator_16bit(
input clk,
input stateRstIn, //From CU
input stateSelectIn, //From CU
input [7:0] currentStateIn,
input [15:0] qAct0, qAct1, qAct2, qAct3, //From QUpdater
output [1:0] nextActionOut,
output [9:0] randValueOut,
output [7:0] nxtStateOut
);
wire [1:0] actLsfr, actGreed, muxOut; //Action Wire

wire [7:0] wallOut; //State Wire
//PORTMAP
lsfr_16bit lsfr0(
.clk(clk),
.nextAction(actLsfr),
.randomValue(randValueOut)
);
greedAction greed0(
.qAct0(qAct0), .qAct1(qAct1), .qAct2(qAct2), .qAct3(qAct3),
.nextAction(actGreed)
);
decideNextAct mux0(
.greedAct(actGreed), .lsfrAct(actLsfr),
.sel(stateSelectIn),
.nxtAct(muxOut)
);
wallDetect wall0(
.currentState(currentStateIn),
.nxtAction(muxOut),
.nxtState(wallOut)
);
resetState rst0(
.inState(wallOut),
.stateRst(stateRstIn),
.outState(nxtStateOut)
);
assign nextActionOut = muxOut;

endmodule
Kode file qLearningAgent16bit.v

//Modul Q Learning Agent untuk LSI Design 2021
//Top Level Module
module qLearningAgent_16bit(
input CLOCK, RESET,
output [1:0] nextAction
);
//Rewards
parameter signed [15:0] rw0 = 16'sh0000; //ZERO
parameter signed [15:0] rw1 = 16'shFB00; //-5 = 1111 1011.0000 0000
parameter signed [15:0] rw2 = 16'shF600; //-10 = 1110 0111.0000 0000
parameter signed [15:0] rw3 = 16'sh0A00; // 10 = 0000 1010.0000 0000
//Alfa and Gamma

parameter signed [23:0] alfa = 24'sh010000; //0.100 = 0,5
parameter signed [23:0] gamma = 24'sh010203; //0.111 = 0,875
//Register for delaying inputs 1 clock cycle

reg [1:0] currentAction;
reg [7:0] currentState;
//Wires
wire stateReset ,stateSelect;
wire [7:0] nextState;
wire [9:0] randomValue;
wire signed [15:0] currentReward;
wire signed [15:0] row0, row1, row2, row3; // For passing to outputs
//PORTMAPPING START========================================================
maze_display_tb_2 disp(
.state(currentState)
);
qLearningAccel_16bit qla(
.clk(CLOCK),
.stateRst(stateReset),
.rst(RESET),
.st(currentState), //Current State
.nxtst(nextState), //Next State
.act(nextAction), //Current Action
.rt(currentReward),
.alfa(alfa),
.gamma(gamma),
.qRow0(row0), .qRow1(row1), .qRow2(row2), .qRow3(row3)
);
policyGenerator_16bit pg(
.clk(CLOCK),
.stateRstIn(stateReset), //From CU
.stateSelectIn(stateSelect), //From CU
.currentStateIn(currentState),
.qAct0(row0), .qAct1(row1), .qAct2(row2), .qAct3(row3), //From QUpdater
.nextActionOut(nextAction),
.randValueOut(randomValue),
.nxtStateOut(nextState)
);
controlUnit cu(
.clk (CLOCK),
.rst(RESET),
.randomValueIn(randomValue),
.stateRstOut(stateReset),
.stateSelectOut(stateSelect)
);
rewardModule_16bit reward(
.clk(CLOCK),
.stateRstIn(stateReset),
.currentStateIn(currentState),
.nextStateIn(nextState),
.rw0(rw0), .rw1(rw1), .rw2(rw2), .rw3(rw3),
.currentRewardOut(currentReward)
);
//=========================================================================
//Delay Operations
always @(posedge CLOCK)
begin
currentAction <= nextAction;
currentState <= nextState;
end
//Display Change in currentState to terminal

always @(currentState)
begin
$monitor("state : %d\n",currentState);
end
endmodule
//TESTBENCH FOR THE MODULE

module toplevel_tb();
reg clock, reset;
wire [1:0] nextAct;
qLearningAgent_16bit DUT(
.CLOCK(clock),
.RESET(reset),
.nextAction(nextAct)
);
initial begin
clock = 1'b1;
reset = 1'b1;
end
//clock generator
always begin
#50 clock = ~clock; //Clock dengan periode 50 ps
end
//reset Cycle
always begin
#150
reset = ~reset; //Reset HIGH for 150 ps
#7850
reset = ~reset; //Reset LOW for 7850 ps
end
endmodule
Kode file Maze_Nav_Disp.c (software untuk implementasi ZyBO)

#include <stdio.h>
#include <stdlib.h>
#include "string.h"
#include "math.h"
#define MEM_INP_BASE_H 0x0 // memory untuk tembok horizontal
#define MEM_INP_BASE_V 0x0 // memory untuk tembok horizontal
#define MEM_OUT_BASE1 0x0 // qtable 0
short int *meminp_p_h, *meminp_p_v, *memout_p1, *memout_p2, *memout_p3,

*memout_p4;
//mapDisp===================================================================
======
//nampilin peta berdasarkan lokasi tembok vertikal dan horizontal serta lokasi
agen
//saat ini
void mapDisp(short int wallH[], short int wallV[], int agent){
int indH, indV, wallvdisp;
int indWallH, indWallV, indPos;
indH = 0;
indV = 0;
int indfin = 0;
indWallH = 0;
indWallV = 0;
indPos = 0;
wallvdisp = 0;
for(indV = 0; indV < 20; indV++){
for(indH = 0; indH <20; indH++){
if((indV%2) == 0){
if(indV == 0 && indH ==0){
printf("-");
}
else if((indH%2) == 0){

if (indH == 0){
if(indV == 0){
printf("-");
wallvdisp = 0;
}
else{
printf("|");
wallvdisp = 0;
}
}
else if((wallvdisp == 1) && (indH !=0)){
printf("-");
wallvdisp = 0;
}
else{
printf(" ");
wallvdisp = 0;
}
}
else {
if(wallV[indWallV] == 1){
if(indH == 19){
if(indV == 0){
printf("--");
}
else{
printf("|");
}
}
else{
printf("-");
}
wallvdisp = 1;
}
else{
if(indH == 19){
printf(" |");
}
else{
printf(" ");
}
}
indWallV++;
}
}
else{
if((indH%2) == 0){
if(wallH[indWallH] == 1){
printf("|");
}
else{
printf(" ");
}
indWallH++;
if(((indWallH+1)%20)==0){
}
}
else {
if(indPos == agent){
printf("1");
}
else{
printf("0");
}
indPos++;
}
}
}
if((indV%2) != 0){
printf("|");
}
printf("\n");
}
for (indfin = 0; indfin < 21; indfin++){
if((indfin%2) == 0){
printf("-");
}
else {
printf("-");
}
}
printf("\n");
//return 0;
}
//writeMap==================================================================
=======
//nulis array yang berisi lokasi tembok vertikal dan horizontal ke memory zybo
void writeMap (short int wallH[], short int wallV[]){
meminp_p_h = (short int *)MEM_INP_BASE_H;

meminp_p_v = (short int *)MEM_INP_BASE_V;
// *** Write to block memory input ***
for (int i = 0; i <= 110; i++){
*(meminp_p_h+i) = wallH[i];
*(meminp_p_v+i) = wallV[i];
}
}
//printPath=================================================================
====================================================================
//nampilin visualisasi perjalanan agen berdasarkan q table hasil learning, dan
lokasi tembok vertikal serta horizontal
void printPath(short int qtable0[], short int qtable1[], short int qtable2[],
short int qtable3[], short int wallH[],short int wallV[]){
printf("route : \n");
int s = 0;
short int max, maxind;
int i;
for (i = 0; i<100 ;i++){
if (s == 99){
printf("STEP %d !!\n",i);
mapDisp(wallH, wallV, s);
printf("\n GOAL in %d step!!! \n",i);
break;
}
printf("\n");
max = qtable0[s];
maxind = 0;
if(max < qtable1[s]){
max = qtable1[s];
maxind = 1;
}
max = qtable2[s];
maxind = 2;
}
max = qtable3[s];
maxind = 3;
}
if(maxind == 0){
if(wallH[s+1] == 0){
s = s+1;
printf("Agent go East\n");
}
else{
printf("HIT WALL!!\n");
break;
}
}
else if(maxind == 1){
if(wallV[s] == 0){
s = s-10;
printf("Agent go North\n");
}
else{
printf("HIT WALL\n!!");
break;
}
}
if(wallH[s] == 0){
s = s-1;
printf("Agent go West\n");
}
else{
break;
}
}
else{
if(wallV[s+10] == 0){
s = s+10;
printf("Agent go South\n");
}
else{
printf("HIT WALL!!");
break;
}
}
}
}
void main () {
//memout_p1 = (uint32_t *)MEM_OUT_BASE1; //assign lokasi memory qtable0
printf("start\n");
short int disp[100] = {11, 10, 8, 8, 8, 8, 8, 8, 8, 9,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
2, 1, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 2, 0, 0, 0, 4, 4, 4, 4, 1,
7, 6, 4, 4, 12, 12, 12, 12, 12, 5};
short int qtable0[100];
short int wallH [100]; //posisi tembok horizontal, kalo ada tembok di kiri
maka bernilai 1; index 10, 21, 32, ..., 109 merupakan tembok batas kanan peta
short int wallV [110]; //posisi tembok vertikan, kalo ada tembok di atas
maka bernilai 1; index 100, 101, ..., 102 merupakan tembok batas bawah peta
int i,j,k = 0;
int input;
//tulis peta ke memory zybo
//writeMap(disp);
//program mulai mengkonversi array peta jadi array tembok horizontal dan
vertikal
for(i=0; i<100; i++){
wallV[i] = 0;
wallH[i] = 0;
}
//j = 0;
for(i=0; i<100; i++){
if ((disp[i] & 2) == 2){
wallH[i] = 1;
}
else {
wallH[i] = 0;
}
if((disp[i] & 8) == 8){
wallV[i] = 1;
}
else{
wallV[i] = 0;
}
}
for(i=100; i<110; i++) {
wallV[i] = 1;
}
// isi array tembok selesai
//baca qtable dari hardware
for (int i = 0; i <= 100; i++){
qtable0[i] = (unsigned int)*(memout_p1+i);
}
while (input != 99){
printf("\nWhat should I do? :\n");
printf("1 : print Map\n");
printf("2 : show Prediction\n");
printf("99 : END\n");
printf("Command : ");
scanf("%d",&input);
printf("\n");
if(input == 1){
mapDisp(wallH, wallV, 0);
}
else if (input == 2){
printf("Sending Map data to Zybo...\n");
printf("Calculating best route...\n");
printPath(qtable0, qtable1, qtable2, qtable3, wallH, wallV);
}
printf("Thanks!!\n");
}
else {
printf("Invalid Command!!\n");
}
}
}
Maze_Nav_Disp_demo.c (software untuk demo)
#include <stdio.h>
#include <stdlib.h>
#include "string.h"
#include "math.h"
#define MEM_INP_BASE_H 0x0 // memory untuk tembok horizontal
#define MEM_INP_BASE_V 0x0 // memory untuk tembok horizontal
short int *meminp_p_h, *meminp_p_v, *memout_p1, *memout_p2, *memout_p3,

*memout_p4;
//mapDisp===================================================================
======
//nampilin peta berdasarkan lokasi tembok vertikal dan horizontal serta lokasi
agen
//saat ini
void mapDisp(short int wallH[], short int wallV[], int agent){
int indH, indV, wallvdisp;
int indWallH, indWallV, indPos;
indH = 0;
indV = 0;
int indfin = 0;
indWallH = 0;
indWallV = 0;
indPos = 0;
wallvdisp = 0;
for(indV = 0; indV < 20; indV++){
for(indH = 0; indH <20; indH++){
if((indV%2) == 0){
if(indV == 0 && indH ==0){
printf("-");
}
else if((indH%2) == 0){

if (indH == 0){
if(indV == 0){
printf("-");
wallvdisp = 0;
}
else{
printf("|");
wallvdisp = 0;
}
}
else if((wallvdisp == 1) && (indH !=0)){
printf("-");
wallvdisp = 0;
}
else{
printf(" ");
wallvdisp = 0;
}
}
else {
if(wallV[indWallV] == 1){
if(indH == 19){
if(indV == 0){
printf("--");
}
else{
printf("|");
}
}
else{
printf("-");
}
wallvdisp = 1;
}
else{
if(indH == 19){
printf(" |");
}
else{
printf(" ");
}
}
indWallV++;
}
}
else{
if((indH%2) == 0){
if(wallH[indWallH] == 1){
printf("|");
}
else{
printf(" ");
}
indWallH++;
if(((indWallH+1)%20)==0){
}
}
else {
if(indPos == agent){
printf("1");
}
else{
printf("0");
}
indPos++;
}
}
}
if((indV%2) != 0){
printf("|");
}
printf("\n");
}
for (indfin = 0; indfin < 21; indfin++){
if((indfin%2) == 0){
printf("-");
}
else {
printf("-");
}
}
printf("\n");
//return 0;
}
//writeMap==================================================================
=======
//nulis array yang berisi lokasi tembok vertikal dan horizontal ke memory zybo
void writeMap (short int wallH[], short int wallV[]){
meminp_p_h = (short int *)MEM_INP_BASE_H;

meminp_p_v = (short int *)MEM_INP_BASE_V;
// *** Write to block memory input ***
for (int i = 0; i <= 110; i++){
*(meminp_p_h+i) = wallH[i];
*(meminp_p_v+i) = wallV[i];
}
}
//printPath=================================================================
====================================================================
//nampilin visualisasi perjalanan agen berdasarkan q table hasil learning, dan
lokasi tembok vertikal serta horizontal
void printPath(short int qtable0[], short int qtable1[], short int qtable2[],
short int qtable3[], short int wallH[],short int wallV[]){
printf("route : \n");
int s = 0;
short int max, maxind;
int i;
for (i = 0; i<100 ;i++){
if (s == 99){
printf("\n GOAL in %d step!!! \n",i);
break;
}
printf("\n");
max = qtable0[s];
maxind = 0;
max = qtable1[s];
maxind = 1;
}
max = qtable2[s];
maxind = 2;
}
max = qtable3[s];
maxind = 3;
}
if(maxind == 0){
if(wallH[s+1] == 0){
s = s+1;
printf("Agent go East\n");
}
else{
break;
}
}
if(wallV[s] == 0){
s = s-10;
printf("Agent go North\n");
}
else{
printf("HIT WALL\n!!");
break;
}
}
if(wallH[s] == 0){
s = s-1;
printf("Agent go West\n");
}
else{
break;
}
}
else{
if(wallV[s+10] == 0){
s = s+10;
printf("Agent go South\n");
}
else{
printf("HIT WALL!!");
break;
}
}
}
}
void main () {
printf("start\n");
short int disp[100] = {11, 10, 8, 8, 8, 8, 8, 8, 8, 9,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
2, 1, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 2, 0, 0, 0, 4, 4, 4, 4, 1,
7, 6, 4, 4, 12, 12, 12, 12, 12, 5};
short int qtable0[100] = {0xf622, 0xfe63, 0xfa84, 0xf925, 0xfbbe, 0xfc25,
0xfa06, 0xfa60, 0xfbab, 0xf8a0,
0xf52a, 0xf246, 0xfad3, 0xfd96, 0xfcf5, 0xfe6b,
0xfcaa, 0xfabb, 0xfb7e, 0xf7c9,
0xf638, 0xfc18, 0xfe36, 0xfe41, 0xfe21, 0xfd7b,
0xfa1b, 0xfb97, 0xfe68, 0xf804,
0xf6a1, 0xf0eb, 0xf85d, 0xfa86, 0xfce3, 0xfe9b,
0xfbe4, 0xfe0e, 0xfc50, 0xf792,
0xf6e6, 0xf288, 0xfb3f, 0xfae0, 0xfbdd, 0xfbee,
0xfc43, 0xfcb6, 0x0097, 0xf578,
0x017c, 0xf742, 0xf9d0, 0xfcee, 0xfcd3, 0xfca7,
0xfb86, 0xfcd5, 0x0102, 0xf954,
0xf314, 0xf7e9, 0xfcc6, 0xfe52, 0xfdf7, 0xfebc,
0xfde0, 0xff47, 0xff3a, 0xf99c,
0xf1d2, 0xf828, 0xfb08, 0xfcb9, 0xfb93, 0xff07,
0xfd5e, 0x0460, 0x034e, 0xfb15,
0xef30, 0xffca, 0x0018, 0x0474, 0x0518, 0x05d5,
0x06ac, 0x07a3, 0x08bc, 0xfebd,
0xedba, 0x02f7, 0x0366, 0x01E9, 0x0585, 0x06ac,
0x07a3, 0x08bc, 0x09ff, 0x0000};
short int qtable1[100] =
{0xf693,0xf245,0xf48f,0xf2e7,0xf3ec,0xf4ec,0xf3e9,0xf536,0xf422,0xf921,
0xf566,0xfeb6,0xf811,0xfecb,0xfda9,0xfd60,0xfc76,0xfe02,0xfbd9,0xfaea,
0xfcca,0xfec4,0xf878,0xfc5a,0xfa0d,0xfc0f,0xfc3c,0xfbd0,0xfdfe,0xfd0b,
0xfc51,0xfd88,0xfc25,0xfd4b,0xfda1,0xfbb7,0xfd2f,0xfc3c,0xfd32,0xfb11,
0xfc96,0xfcd1,0xf9ed,0xfbf1,0xfd68,0xfb77,0xfd6f,0xfcb5,0xff0f,0xfe36,
0xfb79,0xfd52,0xfa28,0xf95b,0xfdcb,0xfd8e,0xfd62,0xff6c,0xffb0,0x012f,
0xffe7,0x0017,0xfa5c,0xfb4b,0xfe1f,0xfe58,0xfe9d,0xfcff,0x0073,0x005c,
0xfe93,0xff9b,0xfbdb,0xfdaf,0xfc28,0xfebc,0x0008,0xff93,0xff07,0xff44,
0xfc08,0xfec6,0xfd88,0xfede,0xfe95,0x0079,0x01b8,0x031e,0x0486,0x05ff,
0xf5a5,0xfcf0,0x0110,0x03e4,0xfad4,0xfb80,0xfc97,0xfda2,0xfebd,0x0000};
{0xf642,0xf14b,0xf9b0,0xfece,0xfe10,0xfa07,0xfd5e,0xfa64,0xfa73,0xfd7d,
0xf61a,0xf1c3,0xf3a0,0xfaa7,0xf92a,0xfd2d,0xfe88,0xfc2e,0xfe27,0xfb2d,
0xf6c9,0xf1df,0xfc50,0xfc52,0xfa1b,0xfaca,0xfbeb,0xf9c0,0xfc35,0xfbd0,
0xf70c,0xf363,0xf274,0xfc2a,0xfcba,0xfb27,0xfdff,0xfbe7,0xfd33,0xfb8c,
0xf6e4,0xf2c7,0xf23f,0xf9cc,0xfa86,0xfbe6,0xfc6b,0xfc89,0xfe40,0x0014,
0xf64d,0xff34,0xf2a4,0xf8ee,0xfcb4,0xfcf9,0xff16,0xfdff,0xfeaa,0xfdfc,
0xf480,0xf7ca,0xf374,0xfa34,0xfd7d,0xfd2c,0xfd9f,0xfc4b,0xfe99,0x0008,
0xf417,0xf832,0xf33c,0xfadb,0xfe05,0xfe97,0x0072,0x0219,0x01d3,0x0398,
0xeead,0xf8ce,0xff53,0xfd8e,0x01bb,0x00c1,0xffad,0x0218,0x0247,0x03e0,
0xecdd,0xf8dc,0x0020,0x0076,0xff8f,0x032e,0x0411,0x051a,0x0386,0x0000};
{0x00bc,0xf6d8,0xfe10,0xfb1b,0xf908,0xfbc8,0xf9aa,0xfafb,0xfd13,0xfbdd,
0x00d8,0xf800,0xfe03,0xf9c5,0xfc0e,0xfc2a,0xfdb9,0xfc1b,0xfbcf,0xfa2b,
0x00f8,0xf942,0xfa36,0xfb47,0xfa0d,0xfe47,0xfbbb,0xfd2f,0xf97b,0xffc0,
0x011f,0xfa43,0xfad3,0xf9a2,0xfc88,0xfabf,0xfe44,0xfd3b,0xff96,0xffe6,
0x014b,0xfc69,0xfc4a,0xf7c7,0xfab4,0xfded,0xfad4,0xfb18,0xfbe7,0x047c,
0xfee9,0x01b5,0xf908,0xf909,0xf918,0xff0d,0xfba5,0xfc76,0xfd70,0x05a1,
0xfa57,0x01f7,0xf92c,0xfbb2,0xfd16,0xfa62,0xfdf5,0xfc59,0x0127,0x0719,
0xf825,0x0240,0xfc7c,0x0186,0xfe6b,0xffc3,0x0234,0xffe7,0x03a1,0x08a2,
0xf356,0x0295,0x0211,0xfdd5,0xf938,0xfac9,0xfb3e,0xfc95,0xfda2,0x09ff,
0xee0e,0xf807,0xf96e,0xf957,0xfb03,0xfb6a,0xfc7f,0xfd97,0xfebd,0x0000};
short int wallH [100]; //posisi tembok horizontal, kalo ada tembok di kiri
maka bernilai 1; index 10, 21, 32, ..., 109 merupakan tembok batas kanan peta
short int wallV [110]; //posisi tembok vertikan, kalo ada tembok di atas
maka bernilai 1; index 100, 101, ..., 102 merupakan tembok batas bawah peta
int i = 0;
int input;
//tulis peta ke memory zybo
//writeMap(disp);
//program mulai mengkonversi array peta jadi array tembok horizontal dan
vertikal
for(i=0; i<100; i++){
wallV[i] = 0;
wallH[i] = 0;
}
//j = 0;
for(i=0; i<100; i++){
if ((disp[i] & 2) == 2){
wallH[i] = 1;
}
else {
wallH[i] = 0;
}
if((disp[i] & 8) == 8){
wallV[i] = 1;
}
else{
wallV[i] = 0;
}
}
for(i=100; i<110; i++) {
wallV[i] = 1;
}
// isi array tembok selesai
//printPath(qtable0, qtable1, qtable2, qtable3, wallH, wallV);
while (input != 99){
printf("\nWhat should I do? :\n");
printf("1 : print Map\n");
printf("2 : show Prediction\n");
printf("99 : END\n");
printf("Command : ");
scanf("%d",&input);
printf("\n");
if(input == 1){
mapDisp(wallH, wallV, 0);
}
printf("Sending Map data to Zybo...\n");
printf("Calculating best route...\n");
//scanf("%d",&input);
printPath(qtable0, qtable1, qtable2, qtable3, wallH, wallV);
}
printf("Thanks!!\n");
}
else {
printf("Invalid Command!!\n");
}
}
}
APPENDIX II Isi Blok Memory
Memori Qtable (sebelum proses pelatihan)
Qtable1 Qtable2 Qtable3 Qtable4

(memory_in_row1.list) (memory_in_row2.list) (memory_in_row3.list) (memory_in_row4.list)
00d0 0056 00F1 00FF

00e7 0029 00CB 0080
0020 00cb 00A4 0078
00e9 004f 0060 000F
00a1 0087 00CF 00AE
0018 002a 0088 000A
0047 009a 0059 0012
008c 0043 00F0 0085
00f5 00a7 00E0 0018
00f7 00b0 008C 00D1
0028 00bf 009F 00D1
00f8 0073 0096 00B8
00f5 0015 0035 0026
007c 003a 004D 00A8
00cc 00e9 0078 0084
0024 0027 003B 00F9
006b 00d3 00D8 00A6
00ea 0089 0031 00CC
00ca 00ff 0039 0074
00f5 0014 002B 006E
00a7 0071 003A 00D3
0009 001b 006F 0015
00d9 00f6 004F 0022
00ef 0001 00EC 002C
00ad 00c6 006E 0064
00c1 00d1 002F 00D4
00be 00de 00E7 00CD
0064 0015 00FA 000F
00a7 0066 0070 0066
002b 0042 001C 0086
00b4 00cc 0042 006A
0008 006e 0068 00A8
0046 00e9 0098 00A0
000b 002e 0043 004A
0018 0043 009A 006E
00d2 0025 00B6 0003
00b1 0022 0038 00FB
0051 00de 001E 002A
00f3 0094 004B 001B
0008 008c 0051 005F
0070 0025 006C 0032
0061 00da 0082 007D
00c3 009f 0015 0056
00cb 0059 0043 00F3
002f 0083 00CD 00EB
007d 0066 0007 000D
0072 0013 00ED 00BC
00a5 003d 00BA 0044
00b5 001f 007D 006C
00c1 002f 0094 008C
0046 003d 003C 00F1
00ae 006a 0075 006A
00a7 000c 00F6 00FB
0029 00e7 008B 004D
001e 00f1 0085 00B3
007f 007d 003B 00AA
00f5 007d 007D 008A
0057 0056 009F 00B2
0095 00e6 00AD 00AA
0039 005e 0065 002D
00c0 001c 005E 0020
0041 00c7 00FC 00FF
0081 0063 0009 002B
00b2 003d 00E2 0008
00e4 0067 00E9 008F
00f5 0018 00CB 00E1
008c 0021 0019 00AB
0023 00f1 0043 0030
0026 00f4 0055 005E
0041 0093 00AE 0075
00d7 000f 0022 00FB
0041 003c 00B8 0028
00d0 005a 001B 00DB
003e 00d2 00A7 00A5
00ed 0003 007E 0060
0059 000b 00C7 0030
0032 002b 00B7 006D
0040 00a6 00E7 007B
009d 00bb 00E4 001E
0079 00a5 0055 0096
005a 0073 00B2 0039
00d4 008c 0032 0062
0095 004b 0007 0095
008c 00be 00BE 0040
00ea 0030 0080 004A
0049 00af 007A 009D
00c1 002e 00E7 0043
00c0 005e 009C 00D3
0061 00a0 009E 00FB
0091 00c7 00DC 00BA
0013 0014 00CE 0058
000d 00ed 0093 0095
0087 00c6 002E 001B
00c7 007c 003D 00E8
00ef 006f 00E2 00E1
0021 0072 0007 00D1
0091 004e 007D 0042
0078 0082 002A 0098
0003 0082 00FA 0005
0000 0000 0000 0000
Memori Qtable (setelah proses pelatihan)
Qtable1 Qtable2 Qtable3 Qtable4

(memory_out_row1.list (memory_out_row2.list (memory_out_row3.list (memory_out_row4.list
) ) ) )
f158 f0ff f0ee fc41

fe63 f245 f14b f6d8
fa84 f48f f9b0 fe10
f925 f2e7 fece fb1b
fbbe f3ec fe10 f908
fc25 f4ec fa07 fbc8
fa06 f3e9 fd5e f9aa
fa60 f536 fa64 fafb
fbab f422 fa73 fd13
f8a0 f921 fd7d fbdd
f3cb f171 f3f5 fa1b
f246 feb6 f1c3 f800
fad3 f811 f3a0 fe03
fd96 fecb faa7 f9a4
fcf5 fda9 f92a fab5
fe6b fd60 fd2d fc2a
fcaa fc76 fe88 fdb9
fabb fe02 fc2e fc1b
fb7e fbd9 fe27 fbcf
f7c9 faea fb2d fa2b
f483 fd2b f454 fe5c
fc18 fec4 f365 f90c
fe57 f878 fc50 fa36
fe4e fda5 fc52 fb47
fe21 fc00 fc4f fa0d
fd7b fc0f faca fe47
fa1b fc3c fbeb fbbb
fb97 fbd0 f9c0 fd2f
fe68 fdfe fc35 f97b
f804 fd0b fbd0 ffc0
f61c fe31 f61a 0037
f0eb fe39 f363 f96f
f85d fc25 f274 fad3
fa86 fd4b fc2a f9a2
fce3 fda1 fcba fc88
fe9b fbb7 fb27 fabf
fbe4 fd2f fdff fe44
fe0e fc3c fbe7 fd3b
fc50 fd32 fd33 ff96
f792 fb11 fb8c ffe6
f68c fed6 f68f fea2
f288 fe09 f380 ff9b
fb3f f9ed f23f fc4a
fae0 fbf1 f9cc f7c7
fbdd fd68 fa86 fab4
fbee fb77 fbe6 fded
fc43 fd6f fc6b fad4
fcb6 fcb5 fc89 fb18
0097 ff0f fe40 fbe7
f578 fe36 0014 047c
fefc 0076 f748 007f
f76e fe16 0109 01b5
f9d0 fa28 f2a4 f908
fcee f95b f8ee f909
fcd3 fdcb fcb4 f918
fca7 fd8e fcf9 ff0d
fb86 fd62 ff16 fba5
fcd5 ff6c fdff fc76
0102 ffb0 feaa fd70
f954 012f fdfc 05a1
f664 0146 f6f9 004b
f7e9 0017 f7ca 01f7
fcc6 fa5c f374 f92c
fe52 fb4b fa34 fbb2
fdf7 fe1f fd7d fbe6
febc fe58 fd2c fa62
fde0 fe9d fd9f fdf5
ff47 fcff fc4b fc59
ff3a 0073 fe99 0127
f99c 005c 0008 0719
f54f 0102 f678 fc05
f828 ff9b f832 0240
fb08 fbdb f33c fc7c
fbb7 fdaf fadb 0186
fb93 fd40 ffac 0170
ff07 febc fe97 ffc3
fd5e 0008 0072 0234
0460 ff93 0219 ffe7
034e ff07 01d3 03a1
fb15 ff44 0398 08a2
f0db fc98 f09a f5be
ffca fec6 f8ce 0295
ff7e fd88 ff53 0211
0474 fede ffae fdd5
0518 fe98 01bb f938
05d5 0079 00c1 fac9
06ac 01b8 ffad fb3e
07a3 031e 0218 fc95
08bc 0486 0247 fda2
febd 05ff 03e0 09ff
edba fbaa efef ed7e
02f7 fcf0 f8b9 f807
0366 0110 0020 f96e
01e9 03e4 0076 f957
0585 fad4 ff8f fb03
06ac fb80 032e fb6a
07a3 fc97 0411 fc7f
08bc fda2 051a fd97
09ff febd 0386 febd
0000 0000 0000 0000
RamWallH dan RamWallV
RamWallH (WallH.list) RamWallV (WallV.list)
01 01
01 01
00 01
00 01
00 01
00 01
00 01
00 01
00 01
00 01
01 00
01 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
01 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
01 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
00 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
01 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
01 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
01 00
00 00
00 00
00 01
00 01
00 01
00 01
00 01
00 00
01
01
01
01
01
01
01
01
01
01

Implementation of RTL Reinforcement Learning Design For Maze Navigation

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Implementation of RTL Reinforcement Learning Design For Maze Navigation

Încărcat de

Drepturi de autor:

Formate disponibile

IMPLEMENTATION OF RTL REINFORCEMENT LEARNING DESIGN FOR MAZE NAVIGATION

Andi Muhammad Riyadhus Ilmy ( 13217053 )

2.1 REINFORCEMENT LEARNING

Figure 2 . 1 Illustration of Reinforcement Learning

2.2 VERILOG LANGUAGE

1. Vivado HL WebPack Edition

2. Vivado HL Design Edition

3. Vivado HL System Edition

Figure 2 . 2 ZYnq Board

3.1 MAZE NAVIGATION 10 × 10

Figure 3 . 1 Illustration of the maze to be tested

3.2 RTL DESIGN

3.2.1 Control Unit

decideFactor = (1023 − 𝑗𝑢𝑚𝑙𝑎ℎ 𝑔𝑒𝑛𝑒𝑟𝑎𝑠𝑖 𝑠𝑎𝑎𝑡 𝑖𝑛𝑖) – randomValue

Figure 3 . 3 Block diagram of reward generator

3.2.3 Policy Generator

3.2.3.1 Module wallDetect

Figure 3 . 5 Block diagram of the wallDetect module

3.2.4 Q-Learning Accelerator

Figure 3 . 6 Block diagram of the Q-learning accelerator module

3.2.4.1 Q-Updater block

𝑄𝑛𝑒𝑤 (𝑠𝑡 , 𝑎𝑡 ) = 𝑄(𝑠𝑡 , 𝑎𝑡 ) + 𝛼(𝑟𝑡 + 𝛾 max(𝑄(𝑠𝑡+1 , 𝑎𝑡 )) − 𝑄(𝑠𝑡 , 𝑎𝑡 )))

Figure 3 . 7 Block diagram of the Q-Updater module

3.3 MEMORY DESIGN

3.4 SOFTWARE DESIGN

Figure 3 . 9 Illustration of hardware and software interaction

4.1 TIMING DIAGRAM

4.1.1 QlearningAgent Top-level Results

Figure 4 . 1 Timing diagram for the beginning of the learning process

Figure 4 . 2 Timing diagram for the end of the learning process

4.1.2 Control Unit Results

Figure 4 . 3 Timing diagram for reset signal response

2. Stops agent movement when it reaches state 99

3. Make a generation change

Figure 4 . 5 Timing diagrams on alternation of generations

4. Determines the type of action the agent performs

Figure 4 . 6 Timing diagrams for early generations

At the early generation (indicated by a signal genCount are worth little)

Figure 4 . 7 Timing diagrams for the middle generation

Figure 4 . 8 Timing diagrams for the final generation

4.1.3 Reward Generator Results

Figure 4 . 9 Timing reward generator diagram generates rewards 10

2. Reward worth -10 ( 0xF600 )

3. Rewards are worth -5 ( 0xFB00 )

Figure 4 . 11 Timing reward generator diagram generates reward -5

4. Reward is worth 0 (0x0000)

4.1.4 Result of Policy Generator

moving the state downward

c. Hit the wall

Figure 4 . 12 Timings in the policy generator diagram when hitting a wall

4.1.5 Results Q l Earning Accelerator

In this simulation, we allocate the values of the input variables as follows:

Table 2 Variable values in the QUpdater block

Based on the QLearning equation , the following values are obtained :

𝑄𝑛𝑒𝑤 = (1 − 𝑎𝑙𝑝ℎ𝑎) × 𝑄 + 𝑎𝑙𝑝ℎ𝑎 × (𝑟𝑒𝑤𝑎𝑟𝑑 + 𝑔𝑎𝑚𝑚𝑎 × (𝑄𝑛𝑒𝑤 ))

Figure 4 . 14 Block system design

𝑇𝑖𝑚𝑒(𝑠) = 𝑐𝑙𝑜𝑐𝑘 𝑝𝑒𝑟𝑖𝑜𝑑 × (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑐𝑡𝑖𝑜𝑛 𝑝𝑒𝑟 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛) × (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛)

Figure 4 . 16 Resource usage data Figure 4 . 17 Power consumption data

4.3 DEMO SOFTWARE

Figure 4 . 18 Startup software

Figure 4 . 19 Map view in software

4.3.2 Writing memory RAM_WallH and RAM_WallV

Figure 4 . 20 Command input display to show predictions

short int meminp_p_h, meminp_p_v, memout_p1, memout_p2, *memout_p3,

short int meminp_p_h, meminp_p_v, memout_p1, memout_p2, *memout_p3,