Documente Academic
Documente Profesional
Documente Cultură
Abstract
In this final project, the design and implementation of the RTL design as a hardware accelerator for reinforcement learning
process. RTL designs made in Verilog language will then be simulated with Modelsim software and implemented on the ZyBo
board using the Vivado software. Then the reinforcement learning test will be carried out for maze navigation with a size of 10 ×
10. The system design that has been made successfully navigates the maze after going through the learning process on hardware.
Keywords: Reinforcement Learning, Maze Navigation, ZyBO, Vivado, Verilog.
1. INTRODUCTION
In this final project of the EL43138 VLSI System Design, the RTL design implementation for reinforcement
learning in Maze Navigation has dimensions of 10 × 10. The RTL design is implemented in Verilog
Language and will run on Zybo boards. This assignment was motivated as the final project of the EL4138
VLSI System Design and Design course with the topic participating in the competition at the 2021 LSI
Design Contest with the theme this time is "Reinforcement Learning". Reinforcement learning is a type
of machine learning that adapts a reward and punishment system in the learning process. One application
of reinforcement learning is navigating a maze with a machine or robot. Tests and simulations are carried out
on the ZyBO board with Vivado as the compiler for the language using Verilog and C language. The maze
navigation simulation will produce a visualization of a safe maze navigation path starting from the
entrance to the exit of the maze.
The Reinforcement Learning application that we will implement is a 10x10 maze navigation. Maze Navigation
is an algorithm that can move a robot to navigate into a maze automatically without requiring human
involvement in it. Maze navigation is urgently needed in search and rescue operations carried out on
previously unmapped terrain, such as the search for residents who get lost in a natural cave that is flooded,
as happened to a group of children's soccer teams in Thailand in 2018. Maze navigation is a very dangerous
task when carried out by humans because human explorers can get lost, threatening the explorer's
life. Therefore, we decided to implement automatic maze navigation in the hope that it can help provide an
overview and learning for people who want to produce maze navigation based robots.
2. LITERATURE STUDY
This part contains the description brief of the various sources of literature that had hooks late to the final
project.
assigns a new state to the agent. The environment will also produce rewards that will be maximized at any
time by the agent. The picture above is an illustration of Reinforcement Learning. Agents can be trained using
the following approach. Initially, the Agent was not given a clue about what action to take. The agent will
learn the action based on the Trial and Error principle, then make decisions based on the reward (maximum
reward). In Reinforcement Learning there are 4 sub-elements, namely:
1. Policy
The policy is a way for an agent to behave in a situation. In other words, this element is a mapping
of the action to be taken by the agent, then implemented in a situation.
2. Reward Function.
This element is defined as the goals an agent wants to achieve. In this process, the agent will
maximize the reward of the action that has been taken. Reward Function will be
the agent's reference regarding what is good and what is bad.
3. Value Function.
If Reward Function defines the best results on the spot, on the Value Function The agent will consider
the best outcome for the long term. Or in other words, the value of a state is the total amount
of rewards that an agent can accumulate until the next period, starting from that state . Rewards
are obtained directly from the environment , while Value must be estimated continuously from
the agent's observations .
4. Environment Model
In this element, the agent will predict the next state and reward . This element is used for planning
or in other words, the agent will decide on action by considering possible situations in the future.
2.3 VIVADO
The Vivado® Design Suite provides a next-generation development environment that is powerful, System on
Chip ( SoC), centered on IP and systems, which have been built from the ground up to overcome
productivity barriers in system-level integration and implementation. It comes in three editions : [3]
1. The Vivado simulator is a component of the Vivado Design Suite. It is a compiled language
simulator that supports mixed languages, Tcl scripts, encrypted IP and enhanced verification.
2. Vivado IP Integrator allows engineers to quickly integrate and configure IPs from the large
Xilinx IP library. The Integrator is also tuned for the Simulink MathWorks design built with the
Xilinx System Generator and Vivado Advanced Synthesis.
3. The Vivado Tcl Store is a scripting system for developing add-ons to Vivado, and can be
used to add and modify Vivado's capabilities. Tcl is the scripting language Vivado is based on. All
the functions underlying Vivado can be called and controlled via Tcl Script.
2.4 ZYBO
ZYBO ( ZYnq Board ) is a feature-rich, ready-to-use embedded software and entry-level digital circuit
development platform built around the smallest member of the Xilinx Zynq-7000 family, the Z-7010. The Z-
7010 is based on the Xilinx All Programmable System-on-Chip (AP SoC) architecture , which tightly integrates
the dual-core ARM Cortex-A9 processor with Xilinx 7 series Field Programmable Gate Array (FPGA) logic. The
rich connectivity device available on the ZYBO, Zynq Z-7010 can accommodate the overall system
design. On-board memory, video and audio I / O, dual USB ports , Ethernet, and an SD slot will have your
design ready to go without the need for additional hardware. Also, six Pmod ports are available to put any
design on an easy growth path . [4]
3. SYSTEM DESIGN
If decide Factor is positive, then stateSelect will be worth 1 so that the agent will choose
the action randomly. Meanwhile, if the value of decideFactor is negative ,
then stateSelect will be worth 0 so that the agent will choose the next action based on the
largest value from the Qtable ( greedy algorithm) . Thus, if the number of generations is greater,
it will be easier for the Decide Factor to be negative so that the agent will increasingly choose
a greedy algorithm rather than random action .
3.2.2 Reward Generator
Reward Generator is a block that has the function of determining the reward value ( currentReward )
obtained by the agent . The incoming data ( input )
is clock , stateReset , currentState and nextState . Meanwhile, the output data
is currentReward . This block determines the reward based on nextState , currentState ,
and prevState which is the currentState paused for one clock . The determination of reward based
on state is as follows:
1. Agent will get a reward of 10 ( rw3 ) if nextState is 99 ( goal ) regardless of
the currentState and prevState values ,
2. A gent will get a punishment of -10 ( rw2 ) if the agent hits a wall which is marked with an
unchanged state . A hitting wall state can be identified
when currentState and nextState have the same value,
3. A gent will get a punishment of -5 ( rw1 ) if the agent reverses direction or returns to
the previous state which is marked with the prevState value equal to
the nextState value . These punishments are given to avoid looping conditions where
the agent moves between the same two states continuously, and
4. A gent will get a reward of 0 ( rw0 ) if the agent changes state without hitting the wall or returning
to the previous state .
RAM_WallH and RAM_WallV are read-only memory written by software which will be explained in the next
section. RAM_wallH and RAM_WallV save information whether there is a wall on the left of a state , and
the next on a state sequentially. The addresses of RAM_WallH and RAM_WallV represent
the state represented by these addresses. Here are the access addresses
of RAM_WallH and RAM_WallV based on the type of action taken:
1. Action 0 (above): accesses the RAM_WallV address which represents
the currentState value so that data is obtained whether or not a wall is present at the top
of the current state ( currentState )
2. Action 1 (below): accessing the address RAM_WallV representing a value ( CurrentState +
10) to obtain the data whether or not the wall at the top of the state under state current
( CurrentState ) equal to the wall at the bottom of the state today ( CurrentState )
3. Action 2 (left): accesses the RAM_WallH address which represents
the currentState value so that the data on whether or not a wall exists on the left of
the current state ( currentState )
4. Action 3 (right): accesses the RAM_WallH address which represents the value
( currentState + 1) so that data is obtained whether or not the wall on the left of the state is
to the right of the current state ( currentState ) which is the same as the wall on the left of
the current state ( currentState )
The data on whether or not a wall ( hitWall ) will then be processed by the State G en block which
will determine the next movement of the agent based on currentState , nextAction ,
and hitWall as follows:
1. the next state ( nextState ) will be the same as the current state ( nextState ) if a wall
is detected or the agent is in the destination position ( state 99 )
2. the next state ( nextState ) will move from the current state whose displacement
depends on the nextAction signal if the conditions in point 1 are not met. The displacement is as
follows:
a. Action 0 (above): agent will move to the valued state ( currentState - 10)
b. Action 1 (below): agent will move to the value state ( currentState + 10)
c. Action 2 (left): agent will move to the valued state ( currentState - 1)
d. Action 3 (right): the agent will move to the value state ( currentState + 1)
It appears that the control unit will repeat the training process when the reset signal is given .
After reaching state 99. the stateReset signal will always have a value of 1 which makes the
agent will not move the state or the training process at all.
As seen in the picture above, after going through 50 actions or 50 clocks , the stateReset will be
1 which makes the agent position return to position 0. This indicates a change of generation.
b. Middle Generation
At the middle generation (marked with genCount signal which has a value of between
500-600) moved stateSelectOut signal value between 0 and 1 that indicates the agent will
sometimes take a random act or action of a greedy algorithm . This is following the epsilon-
greedy algorithm where in the middle generation the action taken by the agent is a
combination of random action and greedy algorithm action .
c. Final Generation
the early generations (marked with genCount signal which is worth over 950)
stateSelectOut signal is always worth 0 which indicates the agency will take action greedy
algorithm , this can be confirmed by looking at the same nextAction signal ActGreed
signal. This is per the epsilon-greedy algorithm where in the final generation it is desired
that the action taken by the agent is the action of the greedy algorithm .
It can be seen in the picture above, when nextAction is worth 99, the reward given
is 0x0 A 00 or 10 in decimal .
When the policy generator reading memor i Wall and get movement agent will make
the agent into a wall as represented by the signal hitWall is 1, then the position of the
agent will not change ( CurrentState = NextState ).
𝑄𝑛𝑒𝑤 from the simulation result is 0x F500 or -11 in decimal notation. Thus, 𝑄𝑛𝑒𝑤 from calculation and
simulation results are the same .
4.2 SYSTEM PERFORMANCE
The system that has been designed is then implemented on ZyBO through the Vivado software . Figure 4.14
shows the design block that has been made to implement the design on ZyBO . Design has successfully
passed the process of implementation in the software so that it can be analyzed the size and power usage. But
the demo results with the ZyBO board still can't be shown.
4.2.1 Speed
The RTL design that the author designed makes the entire QLearning process that needs to be carried out
in taking 1 action can be executed in 1 clock period only, so that to perform 1 generation consisting of
50 actions , our system only requires 50 clock cycles . Thus, the time required to train the maze
navigation agent can be calculated as follows:
4.2.2 Size
Through the process of implementation in software Vivado obtained information on the number of
resource and power consumption needed to run the design which has been made. It can be seen that the
available resources on ZyBO are still sufficient for the design needs so that it can be implemented without
having to change the architecture
Figure 4 . 15 FPGA Components Required
4.3.3 Read memory Qtable1, Qtable2, Qtable3, and Qtable4 that have been through the training process
After completing the training process, the software will read the QTable which has been updated by
QLearningAgent through the training process. The Qtable will then be processed so that the software can
determine the travel path that the agent has obtained . The contents of the Qtable memory before and after
the training process can be seen in the appendix
Figure 4 . 22 Display during the learning process
4.3.4 Displaying the results of the agent navigation based on the contents of the Qtable memory received from
the hardware
Here are the results of a maze navigation by agents maze navigation us which looks agent able to get out of
the maze in 20 steps.
Step 3 Step 4
Step 5 Step 6
Step 7 Step 8
Step 9 Step 10
Step 11 Step 12
Step 13 Step 14
Step 15 Step 16
Step 17 Step 18
Step 19 Step 20
5. CONCLUSION
From the design, simulation, and implementation processes that have been carried out, the following
conclusions are obtained.
• The system that has been designed can carry out a reinforcement learning process to navigate a
maze with a size of 10 × 10.
• The overall learning process has been successfully implemented in hardware through the RTL
design so that the software only functions as a link for user input without
a policy generator algorithm in it.
• The learning process succeeds in producing a Qtable which makes the actions taken by
the agent succeed in reaching the destination position with the following learning parameters .
o Alpha is 0.5
o Gamma is worth 0.875
o 1023 Generations with a maximum of 50 actions per generation
o QUpdater calculation with 3 right shifter
• The learning process takes a total of 51150 of the clock period .
• The implementation at ZyBO is still below the resource allocation limit with the use of 11% existing
LUTs,% existing LUTRAM,% existing RAM, and% existing asdfa. In addition, the power
consumption is 1.6665 W with a Junction temperature of 44.2 C with Thermal margin 40.8 C (3.5 W)
and effective UJA 11.5 C / W .
List of References
[1] https://medium.com/group-3-machine-learning/reinforcement-learning-dcf88b8a49c , accessed
December 28, 2020
[2] https://en.wikipedia.org/wiki/Verilog , accessed on 27 December 2020
[3] https://www.xilinx.com/support/university/vivado.html , accessed on 27 December 2020
[4] https://reference.digilentinc.com/reference/programmable-logic/zybo/start , accessed on 27 December
2020
[5] Spanò, S., Cardarilli, GC, Di Nunzio, L., Fazzolari, R., Giardino, D., Matta, M., ... & Re, M. (2019). An
efficient hardware implementation of reinforcement learning: The q-learning
algorithm. IEEE Access , 7 , 186340-186351.
APPENDIX I Kode Verilog
Kode file modules16bit.v
//Andi Muhammad Riyadhus Ilmy (13217053)
//Tyfan Juliano (13217063)
//Yafie Abdillah (13217091)
//Evan Robert (13217057)
//Building Blocks for Q Learning Accelerator with 16 bit data
//=========================================================================
//=========================================================================
//Q LEARNING ACCELERATOR MODULES===========================================
//=========================================================================
//=========================================================================
//VALIDATION MODULE========================================================
//Prevent Q Table from being Updated if Finished (State = 99)
module isFinished(
input signed [15:0] qValue,
input [7:0] currentState,
output signed [15:0] out
);
assign out = (currentState != 8'd99) ? qValue:
16'd0;
endmodule
//DECODER MODULE===========================================================
//Send enable signal to control Action RAM Write Mode.
//Use 2 bit input for 4 outputs.
module decoder(
input stateRst,
input rst,
input [1:0] act,
output reg en0,
output reg en1,
output reg en2,
output reg en3
);
//Write if EN = 1
always @(WR_ADDR or RD_ADDR or WR_EN) begin
#1 if (WR_EN) begin
tempMem[WR_ADDR] <= D_IN;
$writememh("mem_out_row1.list", tempMem);
end
end
module ram2_16bit (
input [7:0] WR_ADDR, RD_ADDR,
input signed[15:0] D_IN,
input WR_EN,
output signed[15:0] D_OUT
);
reg [15:0] tempMem[0:255];
//Write if EN = 1
always @(WR_ADDR or RD_ADDR or WR_EN) begin
#1 if (WR_EN) begin
tempMem[WR_ADDR] <= D_IN;
$writememh("mem_out_row2.list", tempMem);
end
end
module ram3_16bit (
input [7:0] WR_ADDR, RD_ADDR,
input signed[15:0] D_IN,
input WR_EN,
output signed[15:0] D_OUT
);
reg [15:0] tempMem[0:255];
//Write if EN = 1
always @(WR_ADDR or RD_ADDR or WR_EN) begin
#1 if (WR_EN) begin
tempMem[WR_ADDR] <= D_IN;
$writememh("mem_out_row3.list", tempMem);
end
end
module ram4_16bit (
input [7:0] WR_ADDR, RD_ADDR,
input signed[15:0] D_IN,
input WR_EN,
output signed[15:0] D_OUT
);
reg [15:0] tempMem[0:255];
//Write if EN = 1
always @(WR_ADDR or RD_ADDR or WR_EN) begin
#1 if (WR_EN ) begin
tempMem[WR_ADDR] <= D_IN;
$writememh("mem_out_row4.list", tempMem);
end
end
//MULTIPLEXER MODULE=======================================================
//Select Q Value, based on action taken by agents.
//Use 2 bit selector that represent action taken by the agent.
module mux4to1_16bit(
input [15:0] in0, in1, in2, in3,
input [1:0] sel,
output [15:0] out
);
assign out =
(sel == 2'd0) ? in0 :
(sel == 2'd1) ? in1 :
(sel == 2'd2) ? in2 :
(sel == 2'd3) ? in3 :
16'd00;
endmodule
//MAX MODULE===============================================================
//Compare 4 value and choose the highest value
module max4to1_16bit(
input [15:0] D1, D2, D3, D4,
output [15:0] Y
);
wire [15:0] max0_out, max1_out;
//portmap
compMax_16bit max0(.A(D1), .B(D2), .C(max0_out));
compMax_16bit max1(.A(D3), .B(D4), .C(max1_out));
compMax_16bit max2(.A(max0_out), .B(max1_out), .C(Y));
endmodule
//COMPARATOR===============================================================
//Act as basic module for building MAX MODULE
//Compare 2 value and choose the highest value
module compMax_16bit (
input signed[15:0] A, B,
output signed[15:0] C
);
assign C = (A > B) ? A:
(A < B) ? B:
B;
endmodule
//Q UPDATER MODULE=========================================================
//Implement equation to update Q Value based on Alfa, Gamma, and Reward
module qUpdater_16bit(
input signed [15:0] Q, Qmax, rt,
input signed [7:0] alfa_i, alfa_j, alfa_k,
input signed [7:0] gamma_i, gamma_j, gamma_k,
output signed[15:0] Qnew
);
wire signed [15:0] yi_a0, yj_a0, yk_a1;
wire signed [15:0] a0_a1, a1_a2, a2_s0;
wire signed [15:0] s0_alfa, ai_a3, aj_a3, ak_a4;
wire signed [15:0] a3_a4, Qn;
//PortMap
rShift_16bit yi(.Q(Qmax), .S(gamma_i), .Y(yi_a0));
rShift_16bit yj(.Q(Qmax), .S(gamma_j), .Y(yj_a0));
rShift_16bit yk(.Q(Qmax), .S(gamma_k), .Y(yk_a1));
add_16bit a0(.in0(yi_a0), .in1(yj_a0), .out(a0_a1));
add_16bit a1(.in0(a0_a1), .in1(yk_a1), .out(a1_a2));
//RIGHT SHIFTER============================================================
//Act as basic module for building Q UPDATER MODULE
//Implement right shift as approximated multiplier
module rShift_16bit (
input signed [15:0] Q,
input signed [7:0] S,
output signed [15:0] Y
);
assign Y = (S == 8'sd0) ? 16'sd0 :
((Q >>> S));
endmodule
//ADDER====================================================================
//Act as basic module for building Q UPDATER MODULE
//Implement addition
module add_16bit(
input signed[15:0] in0, in1,
output signed[15:0] out
);
assign out = $signed(in0) + $signed(in1);
endmodule
//SUBSTRACTOR==============================================================
//Act as basic module for building Q UPDATER MODULE
//Implement substraction
module sub_16bit(
input signed[15:0] in0, in1,
output signed[15:0] out
);
assign out = $signed(in0) - $signed(in1);
endmodule
//=========================================================================
//=========================================================================
//POLICY GENERATOR MODULES=================================================
//=========================================================================
//=========================================================================
max4to1_16bit max(
.D1(qAct0),
.D2(qAct1),
.D3(qAct2),
.D4(qAct3),
.Y(maxValue)
);
assign nextAction =
(maxValue == qAct0) ? 2'd0:
(maxValue == qAct1) ? 2'd1:
(maxValue == qAct2) ? 2'd2:
(maxValue == qAct3) ? 2'd3:
2'd0;
endmodule
//fibonacci lsfr
always @(posedge clk)
begin
shiftReg = shiftReg << 1; //Left shift 1 bit
//=========================================================================
//=========================================================================
//WALL DETECT SUBMODULES===================================================
//=========================================================================
//=========================================================================
//WALLH MODULE========================================================
//
//
module wallH_16bit (
input [7:0] RD_ADDR,
output signed [15:0] D_OUT
);
reg [15:0] tempMem[0:255];
wallV_16bit wallV_16bit(
.RD_ADDR(rdAddrV),
.D_OUT(wallV_det)
);
wallH_16bit wallH_16bit(
.RD_ADDR(rdAddrH),
.D_OUT(wallH_det)
);
end
end
endmodule
//STATE GENERATOR
MODULE=======================================================
//determine agent's next action after taking certain action from certain state
module stategen(
input [7:0] currentState ,
input [15:0] hitWall,
input [1:0] nxtAction ,
output [7:0] nxtState
//batas undo
);
reg [7:0] out;
assign nxtState = (currentState == 8'd99) ? currentState : out;
endmodule
module wallDetect (
input [7:0] currentState,
input [1:0] nxtAction,
output [7:0] nxtState
);
wallDetector wallDetector (
.currentState(currentState),
.nxtAct(nxtAction),
.hitWallfin(hitWall)
);
stategen stategen(
.currentState(currentState) ,
.hitWall(hitWall),
.nxtAction(nxtAction),
.nxtState(nxtState)
);
endmodule
//=========================================================================
//=========================================================================
//END OF WALL DETECT SUBMODULES============================================
//=========================================================================
//=========================================================================
//=========================================================================
//=========================================================================
//REWARD GENERATOR MODULE==================================================
//=========================================================================
//=========================================================================
//REWARD MULTIPLEXER
MODULE=======================================================
//select reward according to reward selector
module rewardMux_16bit(
input signed [15:0] rw0, rw1, rw2, rw3,
input [1:0] rwSel,
input staterst,
output signed [15:0] out
);
assign out =
(rwSel == 2'd1 && staterst == 1'd0) ? rw1 : //-50
(rwSel == 2'd2 && staterst == 1'd0) ? rw2 : //-100
(rwSel == 2'd3 && staterst == 1'd0) ? rw3 : //100
rw0; //0
endmodule
isFinished isFinished(
.qValue(newQVal),
.currentState(st),
.out(qOut)
);
decoder decoder(
.stateRst(stateRst),
.rst(rst),
.act(act),
.en0(wrEn1),
.en1(wrEn2),
.en2(wrEn3),
.en3(wrEn4)
);
ram1_16bit action1(
.WR_ADDR(st),
.D_IN(qOut),
.RD_ADDR(nxtst),
.WR_EN(wrEn1),
.D_OUT(datOut1)
);
ram2_16bit action2(
.WR_ADDR(st),
.D_IN(qOut),
.RD_ADDR(nxtst),
.WR_EN(wrEn2),
.D_OUT(datOut2)
);
ram3_16bit action3(
.WR_ADDR(st),
.D_IN(qOut),
.RD_ADDR(nxtst),
.WR_EN(wrEn3),
.D_OUT(datOut3)
);
ram4_16bit action4(
.WR_ADDR(st),
.D_IN(qOut),
.RD_ADDR(nxtst),
.WR_EN(wrEn4),
.D_OUT(datOut4)
);
mux4to1_16bit mux(
.in0(del0),
.in1(del1),
.in2(del2),
.in3(del3),
.sel(act),
.out(muxOut)
);
max4to1_16bit max(
.D1(datOut1),
.D2(datOut2),
.D3(datOut3),
.D4(datOut4),
.Y(maxOut)
);
qUpdater_16bit main(
.Q(muxOut),
.Qmax(maxOut),
.rt(rt),
.alfa_i(alfai),
.alfa_j(alfaj),
.alfa_k(alfak),
.gamma_i(gammai),
.gamma_j(gammaj),
.gamma_k(gammak),
.Qnew(newQVal)
);
//Delay operation
always @ (posedge clk)
begin
del0 = datOut1;
del1 = datOut2;
del2 = datOut3;
del3 = datOut4;
end
//PORTMAP
rewardSelect sel0(
.prevState(prevState),
.nxtState(nextStateIn),
.currentState(currentStateIn),
.rwSel(rwSel)
);
rewardMux_16bit mux0(
.rw0(rw0),
.rw1(rw1),
.rw2(rw2),
.rw3(rw3),
.rwSel(rwSel),
.staterst(stateRstIn),
.out(nextReward)
);
//Delay operations
always @(posedge clk) begin
prevState = currentStateIn;
tempOut = nextReward;
end
//Assign Outputs
assign currentRewardOut = nextReward;
endmodule
module policyGenerator_16bit(
input clk,
input stateRstIn, //From CU
input stateSelectIn, //From CU
input [7:0] currentStateIn,
input [15:0] qAct0, qAct1, qAct2, qAct3, //From QUpdater
output [1:0] nextActionOut,
output [9:0] randValueOut,
output [7:0] nxtStateOut
);
//PORTMAP
lsfr_16bit lsfr0(
.clk(clk),
.nextAction(actLsfr),
.randomValue(randValueOut)
);
greedAction greed0(
.qAct0(qAct0), .qAct1(qAct1), .qAct2(qAct2), .qAct3(qAct3),
.nextAction(actGreed)
);
decideNextAct mux0(
.greedAct(actGreed), .lsfrAct(actLsfr),
.sel(stateSelectIn),
.nxtAct(muxOut)
);
wallDetect wall0(
.currentState(currentStateIn),
.nxtAction(muxOut),
.nxtState(wallOut)
);
resetState rst0(
.inState(wallOut),
.stateRst(stateRstIn),
.outState(nxtStateOut)
);
module qLearningAgent_16bit(
input CLOCK, RESET,
output [1:0] nextAction
);
//Rewards
parameter signed [15:0] rw0 = 16'sh0000; //ZERO
parameter signed [15:0] rw1 = 16'shFB00; //-5 = 1111 1011.0000 0000
parameter signed [15:0] rw2 = 16'shF600; //-10 = 1110 0111.0000 0000
parameter signed [15:0] rw3 = 16'sh0A00; // 10 = 0000 1010.0000 0000
//Wires
wire stateReset ,stateSelect;
wire [7:0] nextState;
wire [9:0] randomValue;
wire signed [15:0] currentReward;
wire signed [15:0] row0, row1, row2, row3; // For passing to outputs
//PORTMAPPING START========================================================
maze_display_tb_2 disp(
.state(currentState)
);
qLearningAccel_16bit qla(
.clk(CLOCK),
.stateRst(stateReset),
.rst(RESET),
.st(currentState), //Current State
.nxtst(nextState), //Next State
.act(nextAction), //Current Action
.rt(currentReward),
.alfa(alfa),
.gamma(gamma),
.qRow0(row0), .qRow1(row1), .qRow2(row2), .qRow3(row3)
);
policyGenerator_16bit pg(
.clk(CLOCK),
.stateRstIn(stateReset), //From CU
.stateSelectIn(stateSelect), //From CU
.currentStateIn(currentState),
.qAct0(row0), .qAct1(row1), .qAct2(row2), .qAct3(row3), //From QUpdater
.nextActionOut(nextAction),
.randValueOut(randomValue),
.nxtStateOut(nextState)
);
controlUnit cu(
.clk (CLOCK),
.rst(RESET),
.randomValueIn(randomValue),
.stateRstOut(stateReset),
.stateSelectOut(stateSelect)
);
rewardModule_16bit reward(
.clk(CLOCK),
.stateRstIn(stateReset),
.currentStateIn(currentState),
.nextStateIn(nextState),
.rw0(rw0), .rw1(rw1), .rw2(rw2), .rw3(rw3),
.currentRewardOut(currentReward)
);
//=========================================================================
//Delay Operations
always @(posedge CLOCK)
begin
currentAction <= nextAction;
currentState <= nextState;
end
qLearningAgent_16bit DUT(
.CLOCK(clock),
.RESET(reset),
.nextAction(nextAct)
);
initial begin
clock = 1'b1;
reset = 1'b1;
end
//clock generator
always begin
#50 clock = ~clock; //Clock dengan periode 50 ps
end
//reset Cycle
always begin
#150
reset = ~reset; //Reset HIGH for 150 ps
#7850
reset = ~reset; //Reset LOW for 7850 ps
end
endmodule
//mapDisp===================================================================
======
//nampilin peta berdasarkan lokasi tembok vertikal dan horizontal serta lokasi
agen
//saat ini
void mapDisp(short int wallH[], short int wallV[], int agent){
int indH, indV, wallvdisp;
int indWallH, indWallV, indPos;
indH = 0;
indV = 0;
int indfin = 0;
indWallH = 0;
indWallV = 0;
indPos = 0;
wallvdisp = 0;
for(indV = 0; indV < 20; indV++){
for(indH = 0; indH <20; indH++){
if((indV%2) == 0){
if(indV == 0 && indH ==0){
printf("-");
}
}
else if((wallvdisp == 1) && (indH !=0)){
printf("-");
wallvdisp = 0;
}
else{
printf(" ");
wallvdisp = 0;
}
}
else {
if(wallV[indWallV] == 1){
if(indH == 19){
if(indV == 0){
printf("--");
}
else{
printf("|");
}
}
else{
printf("-");
}
wallvdisp = 1;
}
else{
if(indH == 19){
printf(" |");
}
else{
printf(" ");
}
}
indWallV++;
}
}
else{
if((indH%2) == 0){
if(wallH[indWallH] == 1){
printf("|");
}
else{
printf(" ");
}
indWallH++;
if(((indWallH+1)%20)==0){
}
}
else {
if(indPos == agent){
printf("1");
}
else{
printf("0");
}
indPos++;
}
}
}
if((indV%2) != 0){
printf("|");
}
printf("\n");
}
for (indfin = 0; indfin < 21; indfin++){
if((indfin%2) == 0){
printf("-");
}
else {
printf("-");
}
}
printf("\n");
//return 0;
}
//writeMap==================================================================
=======
//nulis array yang berisi lokasi tembok vertikal dan horizontal ke memory zybo
void writeMap (short int wallH[], short int wallV[]){
//printPath=================================================================
====================================================================
//nampilin visualisasi perjalanan agen berdasarkan q table hasil learning, dan
lokasi tembok vertikal serta horizontal
void printPath(short int qtable0[], short int qtable1[], short int qtable2[],
short int qtable3[], short int wallH[],short int wallV[]){
printf("route : \n");
int s = 0;
short int max, maxind;
int i;
for (i = 0; i<100 ;i++){
if (s == 99){
printf("STEP %d !!\n",i);
mapDisp(wallH, wallV, s);
printf("\n GOAL in %d step!!! \n",i);
break;
}
printf("STEP %d !!\n",i);
mapDisp(wallH, wallV, s);
printf("\n");
max = qtable0[s];
maxind = 0;
if(max < qtable1[s]){
max = qtable1[s];
maxind = 1;
}
if(max < qtable2[s]){
max = qtable2[s];
maxind = 2;
}
if(max < qtable3[s]){
max = qtable3[s];
maxind = 3;
}
if(maxind == 0){
if(wallH[s+1] == 0){
s = s+1;
printf("Agent go East\n");
}
else{
printf("HIT WALL!!\n");
break;
}
}
else if(maxind == 1){
if(wallV[s] == 0){
s = s-10;
printf("Agent go North\n");
}
else{
printf("HIT WALL\n!!");
break;
}
}
else if(maxind == 2){
if(wallH[s] == 0){
s = s-1;
printf("Agent go West\n");
}
else{
printf("HIT WALL!!\n");
break;
}
}
else{
if(wallV[s+10] == 0){
s = s+10;
printf("Agent go South\n");
}
else{
printf("HIT WALL!!");
break;
}
}
}
}
void main () {
//memout_p1 = (uint32_t *)MEM_OUT_BASE1; //assign lokasi memory qtable0
//memout_p2 = (uint32_t *)MEM_OUT_BASE2; //assign lokasi memory qtable1
//memout_p3 = (uint32_t *)MEM_OUT_BASE3; //assign lokasi memory qtable2
//memout_p4 = (uint32_t *)MEM_OUT_BASE4; //assign lokasi memory qtable3
printf("start\n");
short int disp[100] = {11, 10, 8, 8, 8, 8, 8, 8, 8, 9,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
2, 1, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 2, 0, 0, 0, 4, 4, 4, 4, 1,
7, 6, 4, 4, 12, 12, 12, 12, 12, 5};
short int qtable0[100];
short int qtable1[100];
short int qtable2[100];
short int qtable3[100];
short int wallH [100]; //posisi tembok horizontal, kalo ada tembok di kiri
maka bernilai 1; index 10, 21, 32, ..., 109 merupakan tembok batas kanan peta
short int wallV [110]; //posisi tembok vertikan, kalo ada tembok di atas
maka bernilai 1; index 100, 101, ..., 102 merupakan tembok batas bawah peta
int i,j,k = 0;
int input;
//tulis peta ke memory zybo
//writeMap(disp);
//program mulai mengkonversi array peta jadi array tembok horizontal dan
vertikal
for(i=0; i<100; i++){
wallV[i] = 0;
wallH[i] = 0;
}
//j = 0;
for(i=0; i<100; i++){
if ((disp[i] & 2) == 2){
wallH[i] = 1;
}
else {
wallH[i] = 0;
}
if((disp[i] & 8) == 8){
wallV[i] = 1;
}
else{
wallV[i] = 0;
}
}
for(i=100; i<110; i++) {
wallV[i] = 1;
}
// isi array tembok selesai
//baca qtable dari hardware
for (int i = 0; i <= 100; i++){
qtable0[i] = (unsigned int)*(memout_p1+i);
qtable1[i] = (unsigned int)*(memout_p2+i);
qtable2[i] = (unsigned int)*(memout_p3+i);
qtable3[i] = (unsigned int)*(memout_p4+i);
}
while (input != 99){
printf("\nWhat should I do? :\n");
printf("1 : print Map\n");
printf("2 : show Prediction\n");
printf("99 : END\n");
printf("Command : ");
scanf("%d",&input);
printf("\n");
if(input == 1){
mapDisp(wallH, wallV, 0);
}
else if (input == 2){
printf("Sending Map data to Zybo...\n");
printf("Calculating best route...\n");
printPath(qtable0, qtable1, qtable2, qtable3, wallH, wallV);
}
else if (input == 99){
printf("Thanks!!\n");
}
else {
printf("Invalid Command!!\n");
}
}
}
Maze_Nav_Disp_demo.c (software untuk demo)
#include <stdio.h>
#include <stdlib.h>
#include "string.h"
#include "math.h"
#define MEM_INP_BASE_H 0x0 // memory untuk tembok horizontal
#define MEM_INP_BASE_V 0x0 // memory untuk tembok horizontal
#define MEM_OUT_BASE1 0x0 // qtable 0
#define MEM_OUT_BASE2 0x0 // qtable 1
#define MEM_OUT_BASE3 0x0 // qtable 2
#define MEM_OUT_BASE4 0x0 // qtable 3
//mapDisp===================================================================
======
//nampilin peta berdasarkan lokasi tembok vertikal dan horizontal serta lokasi
agen
//saat ini
void mapDisp(short int wallH[], short int wallV[], int agent){
int indH, indV, wallvdisp;
int indWallH, indWallV, indPos;
indH = 0;
indV = 0;
int indfin = 0;
indWallH = 0;
indWallV = 0;
indPos = 0;
wallvdisp = 0;
for(indV = 0; indV < 20; indV++){
for(indH = 0; indH <20; indH++){
if((indV%2) == 0){
if(indV == 0 && indH ==0){
printf("-");
}
}
else if((wallvdisp == 1) && (indH !=0)){
printf("-");
wallvdisp = 0;
}
else{
printf(" ");
wallvdisp = 0;
}
}
else {
if(wallV[indWallV] == 1){
if(indH == 19){
if(indV == 0){
printf("--");
}
else{
printf("|");
}
}
else{
printf("-");
}
wallvdisp = 1;
}
else{
if(indH == 19){
printf(" |");
}
else{
printf(" ");
}
}
indWallV++;
}
}
else{
if((indH%2) == 0){
if(wallH[indWallH] == 1){
printf("|");
}
else{
printf(" ");
}
indWallH++;
if(((indWallH+1)%20)==0){
}
}
else {
if(indPos == agent){
printf("1");
}
else{
printf("0");
}
indPos++;
}
}
}
if((indV%2) != 0){
printf("|");
}
printf("\n");
}
for (indfin = 0; indfin < 21; indfin++){
if((indfin%2) == 0){
printf("-");
}
else {
printf("-");
}
}
printf("\n");
//return 0;
}
//writeMap==================================================================
=======
//nulis array yang berisi lokasi tembok vertikal dan horizontal ke memory zybo
void writeMap (short int wallH[], short int wallV[]){
//printPath=================================================================
====================================================================
//nampilin visualisasi perjalanan agen berdasarkan q table hasil learning, dan
lokasi tembok vertikal serta horizontal
void printPath(short int qtable0[], short int qtable1[], short int qtable2[],
short int qtable3[], short int wallH[],short int wallV[]){
printf("route : \n");
int s = 0;
short int max, maxind;
int i;
for (i = 0; i<100 ;i++){
if (s == 99){
printf("STEP %d !!\n",i);
mapDisp(wallH, wallV, s);
printf("\n GOAL in %d step!!! \n",i);
break;
}
printf("STEP %d !!\n",i);
mapDisp(wallH, wallV, s);
printf("\n");
max = qtable0[s];
maxind = 0;
if(max < qtable1[s]){
max = qtable1[s];
maxind = 1;
}
if(max < qtable2[s]){
max = qtable2[s];
maxind = 2;
}
if(max < qtable3[s]){
max = qtable3[s];
maxind = 3;
}
if(maxind == 0){
if(wallH[s+1] == 0){
s = s+1;
printf("Agent go East\n");
}
else{
printf("HIT WALL!!\n");
break;
}
}
else if(maxind == 1){
if(wallV[s] == 0){
s = s-10;
printf("Agent go North\n");
}
else{
printf("HIT WALL\n!!");
break;
}
}
else if(maxind == 2){
if(wallH[s] == 0){
s = s-1;
printf("Agent go West\n");
}
else{
printf("HIT WALL!!\n");
break;
}
}
else{
if(wallV[s+10] == 0){
s = s+10;
printf("Agent go South\n");
}
else{
printf("HIT WALL!!");
break;
}
}
}
}
void main () {
//memout_p1 = (uint32_t *)MEM_OUT_BASE1; //assign lokasi memory qtable0
//memout_p2 = (uint32_t *)MEM_OUT_BASE2; //assign lokasi memory qtable1
//memout_p3 = (uint32_t *)MEM_OUT_BASE3; //assign lokasi memory qtable2
//memout_p4 = (uint32_t *)MEM_OUT_BASE4; //assign lokasi memory qtable3
printf("start\n");
short int disp[100] = {11, 10, 8, 8, 8, 8, 8, 8, 8, 9,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
2, 1, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 3, 2, 0, 0, 0, 0, 0, 0, 1,
3, 2, 0, 0, 0, 4, 4, 4, 4, 1,
7, 6, 4, 4, 12, 12, 12, 12, 12, 5};
short int qtable0[100] = {0xf622, 0xfe63, 0xfa84, 0xf925, 0xfbbe, 0xfc25,
0xfa06, 0xfa60, 0xfbab, 0xf8a0,
0xf52a, 0xf246, 0xfad3, 0xfd96, 0xfcf5, 0xfe6b,
0xfcaa, 0xfabb, 0xfb7e, 0xf7c9,
0xf638, 0xfc18, 0xfe36, 0xfe41, 0xfe21, 0xfd7b,
0xfa1b, 0xfb97, 0xfe68, 0xf804,
0xf6a1, 0xf0eb, 0xf85d, 0xfa86, 0xfce3, 0xfe9b,
0xfbe4, 0xfe0e, 0xfc50, 0xf792,
0xf6e6, 0xf288, 0xfb3f, 0xfae0, 0xfbdd, 0xfbee,
0xfc43, 0xfcb6, 0x0097, 0xf578,
0x017c, 0xf742, 0xf9d0, 0xfcee, 0xfcd3, 0xfca7,
0xfb86, 0xfcd5, 0x0102, 0xf954,
0xf314, 0xf7e9, 0xfcc6, 0xfe52, 0xfdf7, 0xfebc,
0xfde0, 0xff47, 0xff3a, 0xf99c,
0xf1d2, 0xf828, 0xfb08, 0xfcb9, 0xfb93, 0xff07,
0xfd5e, 0x0460, 0x034e, 0xfb15,
0xef30, 0xffca, 0x0018, 0x0474, 0x0518, 0x05d5,
0x06ac, 0x07a3, 0x08bc, 0xfebd,
0xedba, 0x02f7, 0x0366, 0x01E9, 0x0585, 0x06ac,
0x07a3, 0x08bc, 0x09ff, 0x0000};
short int qtable1[100] =
{0xf693,0xf245,0xf48f,0xf2e7,0xf3ec,0xf4ec,0xf3e9,0xf536,0xf422,0xf921,
0xf566,0xfeb6,0xf811,0xfecb,0xfda9,0xfd60,0xfc76,0xfe02,0xfbd9,0xfaea,
0xfcca,0xfec4,0xf878,0xfc5a,0xfa0d,0xfc0f,0xfc3c,0xfbd0,0xfdfe,0xfd0b,
0xfc51,0xfd88,0xfc25,0xfd4b,0xfda1,0xfbb7,0xfd2f,0xfc3c,0xfd32,0xfb11,
0xfc96,0xfcd1,0xf9ed,0xfbf1,0xfd68,0xfb77,0xfd6f,0xfcb5,0xff0f,0xfe36,
0xfb79,0xfd52,0xfa28,0xf95b,0xfdcb,0xfd8e,0xfd62,0xff6c,0xffb0,0x012f,
0xffe7,0x0017,0xfa5c,0xfb4b,0xfe1f,0xfe58,0xfe9d,0xfcff,0x0073,0x005c,
0xfe93,0xff9b,0xfbdb,0xfdaf,0xfc28,0xfebc,0x0008,0xff93,0xff07,0xff44,
0xfc08,0xfec6,0xfd88,0xfede,0xfe95,0x0079,0x01b8,0x031e,0x0486,0x05ff,
0xf5a5,0xfcf0,0x0110,0x03e4,0xfad4,0xfb80,0xfc97,0xfda2,0xfebd,0x0000};
short int qtable2[100] =
{0xf642,0xf14b,0xf9b0,0xfece,0xfe10,0xfa07,0xfd5e,0xfa64,0xfa73,0xfd7d,
0xf61a,0xf1c3,0xf3a0,0xfaa7,0xf92a,0xfd2d,0xfe88,0xfc2e,0xfe27,0xfb2d,
0xf6c9,0xf1df,0xfc50,0xfc52,0xfa1b,0xfaca,0xfbeb,0xf9c0,0xfc35,0xfbd0,
0xf70c,0xf363,0xf274,0xfc2a,0xfcba,0xfb27,0xfdff,0xfbe7,0xfd33,0xfb8c,
0xf6e4,0xf2c7,0xf23f,0xf9cc,0xfa86,0xfbe6,0xfc6b,0xfc89,0xfe40,0x0014,
0xf64d,0xff34,0xf2a4,0xf8ee,0xfcb4,0xfcf9,0xff16,0xfdff,0xfeaa,0xfdfc,
0xf480,0xf7ca,0xf374,0xfa34,0xfd7d,0xfd2c,0xfd9f,0xfc4b,0xfe99,0x0008,
0xf417,0xf832,0xf33c,0xfadb,0xfe05,0xfe97,0x0072,0x0219,0x01d3,0x0398,
0xeead,0xf8ce,0xff53,0xfd8e,0x01bb,0x00c1,0xffad,0x0218,0x0247,0x03e0,
0xecdd,0xf8dc,0x0020,0x0076,0xff8f,0x032e,0x0411,0x051a,0x0386,0x0000};
short int qtable3[100] =
{0x00bc,0xf6d8,0xfe10,0xfb1b,0xf908,0xfbc8,0xf9aa,0xfafb,0xfd13,0xfbdd,
0x00d8,0xf800,0xfe03,0xf9c5,0xfc0e,0xfc2a,0xfdb9,0xfc1b,0xfbcf,0xfa2b,
0x00f8,0xf942,0xfa36,0xfb47,0xfa0d,0xfe47,0xfbbb,0xfd2f,0xf97b,0xffc0,
0x011f,0xfa43,0xfad3,0xf9a2,0xfc88,0xfabf,0xfe44,0xfd3b,0xff96,0xffe6,
0x014b,0xfc69,0xfc4a,0xf7c7,0xfab4,0xfded,0xfad4,0xfb18,0xfbe7,0x047c,
0xfee9,0x01b5,0xf908,0xf909,0xf918,0xff0d,0xfba5,0xfc76,0xfd70,0x05a1,
0xfa57,0x01f7,0xf92c,0xfbb2,0xfd16,0xfa62,0xfdf5,0xfc59,0x0127,0x0719,
0xf825,0x0240,0xfc7c,0x0186,0xfe6b,0xffc3,0x0234,0xffe7,0x03a1,0x08a2,
0xf356,0x0295,0x0211,0xfdd5,0xf938,0xfac9,0xfb3e,0xfc95,0xfda2,0x09ff,
0xee0e,0xf807,0xf96e,0xf957,0xfb03,0xfb6a,0xfc7f,0xfd97,0xfebd,0x0000};
short int wallH [100]; //posisi tembok horizontal, kalo ada tembok di kiri
maka bernilai 1; index 10, 21, 32, ..., 109 merupakan tembok batas kanan peta
short int wallV [110]; //posisi tembok vertikan, kalo ada tembok di atas
maka bernilai 1; index 100, 101, ..., 102 merupakan tembok batas bawah peta
int i = 0;
int input;
//tulis peta ke memory zybo
//writeMap(disp);
//program mulai mengkonversi array peta jadi array tembok horizontal dan
vertikal
for(i=0; i<100; i++){
wallV[i] = 0;
wallH[i] = 0;
}
//j = 0;
for(i=0; i<100; i++){
if ((disp[i] & 2) == 2){
wallH[i] = 1;
}
else {
wallH[i] = 0;
}
if((disp[i] & 8) == 8){
wallV[i] = 1;
}
else{
wallV[i] = 0;
}
}
for(i=100; i<110; i++) {
wallV[i] = 1;
}
// isi array tembok selesai
//printPath(qtable0, qtable1, qtable2, qtable3, wallH, wallV);
while (input != 99){
printf("\nWhat should I do? :\n");
printf("1 : print Map\n");
printf("2 : show Prediction\n");
printf("99 : END\n");
printf("Command : ");
scanf("%d",&input);
printf("\n");
if(input == 1){
mapDisp(wallH, wallV, 0);
}
else if (input == 2){
printf("Sending Map data to Zybo...\n");
printf("Calculating best route...\n");
//scanf("%d",&input);
printPath(qtable0, qtable1, qtable2, qtable3, wallH, wallV);
}
else if (input == 99){
printf("Thanks!!\n");
}
else {
printf("Invalid Command!!\n");
}
}
}
APPENDIX II Isi Blok Memory
Memori Qtable (sebelum proses pelatihan)
01 01
01 01
00 01
00 01
00 01
00 01
00 01
00 01
00 01
00 01
01 00
01 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
01 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
01 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
00 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
01 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
01 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
01 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
00 00
01 00
01 00
00 00
00 00
00 01
00 01
00 01
00 01
00 01
00 00
01
01
01
01
01
01
01
01
01
01