Documente Academic
Documente Profesional
Documente Cultură
ACTOR-CRITIC REINFORCEMENT
LEARNING ALGORITHM FOR BULK
GOOD SYSTEM
The main focus of this thesis is to design and implement a neural-network based actor
critic reinforcement learning algorithm for a bulk good system. The algorithm is imple-
mented to a Simulink model of the bulk good system to evaluate the result of the learning
process. The neural-network act as function approximator. The critic part used SARSA-
LambaLin algorithm and the actor part used natural actor-critic algorithm. The algorithm
is implemented to a Simulink model of the bulk good system to evaluate the result of the
learning process. The whole algorithm is written in Matlab. The goal is to get the maxi-
mum output from the system with minimum electrical power consumption. A comparison
between the learning process result and the baseline model result is made.
Keywords:
Simulink, Function Approximator, Actor Critic RL, SARSA
II
DECLARATION OF AUTHORSHIP
I hereby certify that this thesis has been composed by me and is based on my own work,
unless stated otherwise. No other person’s work has been used without due acknowledg-
ment in this thesis. All references and verbatim extracts have been quoted, and all sources
of information, including graphs and data sets, have been specifically acknowledged.
Date: Signature:
III
TASK OF THE PROJECT
• Familiarization with the bulk good system and the function of each station.
• Familiarization and research on reinforcement learning fundamentals.
• Research on actor‐critic reinforcement learning.
• Physical modelling of the bulk good system based on previous works and
implementation of the model in Simulink.
• Design and implementation of the actor critic reinforcement learning algorithm in
the simulation environment in Matlab/Simulink to maximise volume output and
minimize electrical power consumption of the system.
IV
CONTENT
1 INTRODUCTION .............................................................................................. 1
1.1 Background ......................................................................................................... 1
1.2 Scope ..................................................................................................................... 2
1.3 Purpose ................................................................................................................. 2
1.4 Organization of the Thesis ................................................................................. 2
4 PROGRAMMING............................................................................................ 14
4.1 Programming Overview ................................................................................... 14
4.2 State-Action Space ............................................................................................ 14
4.2.1 State Space ................................................................................................... 15
4.2.2 Action Space ................................................................................................ 16
4.3 Neural-Network................................................................................................. 16
4.4 Actor Critic Reinforcement Learning Algorithm .......................................... 18
4.4.1 Critic ............................................................................................................ 18
V
4.4.2 Actor ............................................................................................................ 19
4.4.3 Actor Critic .................................................................................................. 20
4.4.4 Reward Function .......................................................................................... 21
5 RESULT ............................................................................................................ 22
5.1 The Simulink Model of the Bulk Good System .............................................. 22
5.1.1 Baseline Model ............................................................................................ 22
5.1.2 Model for Learning Process ......................................................................... 26
5.2 The Reinforcement Learning Implementation Result ................................... 27
5.3 Comparison between the Baseline Model Simulation Result and the Learning
Simulation Result .................................................................................................... 33
6 CONCLUSION ................................................................................................. 35
6.1 Conclusion ......................................................................................................... 35
LIST OF FIGURES............................................................................................. 36
VI
ABBREVIATIONS
RL Reinforcement Learning
SARSA current State, current Action, new Reward, next State, next Action
TD Temporal-Difference
VII
1 Introduction
1 Introduction
1.1 Background
Nowadays, manufacturers around the world are racing to be the most cost-efficient man-
ufacturer. One of the costliest resources that is used in most production process is elec-
tricity. Manufacturers are thinking on how to produce as many products as possible with
the minimum electricity power consumed.
The task to optimize the productivity of a production process is very lengthy if done man-
ually. In order to simplify the task, a reinforcement learning algorithm can be used to
optimize the production process. The algorithm will explore all the possibility adjust-
ments that can be made to the system and will choose the one which is the most optimized
one, ignoring the parameters set by the operator.
Automation Laboratory at the University of Applied Sciences Soest has a bulk good
system which operates similarly to a production process system. The bulk good system is
used to transport material from station 1 to station 4 by the actuators installed in the
system. A bachelor student has designed a reinforcement learning algorithm in order to
increase volume output of the bulk good system while minimizing the electrical power
consumption [1]. Because there are many reinforcement learning algorithm that has been
developed, it might be possible to increase the efficiency of the bulk good system even
more.
This thesis project tries to implement another reinforcement learning algorithm to the
bulk good system in order to maximize volume output while minimizing power consump-
tion. The algorithm will run in a Simulink model of the bulk good system and the algo-
rithm will be written in Matlab. The result of the learning process will then be compared
with the bulk good system without optimization.
1
Introduction 2
1.2 Scope
The thesis project focuses on designing a neural-network based actor critic reinforcement
learning algorithm to be applied to the bulk good system. The physical model of the bulk
good system is made in Simulink. The actor critic reinforcement learning algorithm is
made in Matlab. The thesis project only applies the reinforcement learning algorithm to
the simulation model of the bulk good system for testing the resulting behavior of the
system.
1.3 Purpose
The main goals of the thesis project are to design the physical model of the bulk good
system, design a neural-network based actor critic reinforcement learning, and to compare
the result of the learning process with the result of the baseline model.
This thesis report is divided into six chapters. The first chapter introduces the thesis pro-
ject. The second describes the definition of reinforcement learning and actor critic rein-
forcement learning. The third chapter discusses about the modelling process of the bulk
good system in Simulink. The fourth chapter discusses about the neural-network based
actor critic reinforcement learning algorithm programming process and structure. The
fifth chapter discusses about the Simulink model result, the result of the implementation
of the reinforcement learning algorithm to the bulk good system model, and the compar-
ison between the result of the learning process and the baseline model result. The last
chapter concludes the thesis report with a conclusion.
2
3 Reinforcement Learning
2 Reinforcement Learning
The agent acts in the environment and will be able to observe its state in the environment
because of its previous action. Its previous action also gives it a reward value which gives
value to the action. The agent then takes another action and compare the reward with its
experience to take a better action in the next steps.
3
Reinforcement Learning 4
In actor critic reinforcement learning, the critic part will evaluate the policy based on the
reward given by the environment because of previous action and criticize it in order for
it to be changed and improved [5]. The actor will then choose action based on the impro-
vised policy. The reward value will only be observed by the critic part. The process is
shown in figure 2.2.
Actor critic algorithm aim is to utilize the benefits of using actor-only algorithm and
critic-only algorithm [7] to get faster learning process.
4
5 Modelling of the Bulk Good System
In order to develop the bulk good system model in Simulink, the researcher needs to
understand the behavior of the actual bulk good system. The bulk good system is made
up from four different stations with each individual task. The four stations are the loading
station, the storage station, the weighing station, and the final storage station. The first
station is the loading station. In this station, materials are being fed in to the first buffer
container and transported to the first minihopper by the belt conveyor. The materials are
then transported to the storage station by using a vacuum pump. In the storage station, the
materials are first stored in the second buffer container and moved to the second
minihopper by the vibration conveyor. The second vacuum pump moves the materials
from the second station to the third station. The third station is the weighing station. The
materials are stored in the third buffer container and are gradually moved to the third
minihopper by the dosing machine. Lastly the materials are moved to the final station by
the third vacuum pump and stored there. Figure 3.1 shows the bulk good system.
5
Modelling of the Bulk Good System 6
In this project, the researcher is interested only in the material flow from the first buffer
container until the third buffer container before the weighing process.
The bulk good system description in this thesis is made by combining information from
experiments, datasheets, and Tim Kempe’s thesis chapter 4 [1].
There are four identical buffer containers in the bulk good system. Each buffer container
is equipped with two level sensors, one at the top and one at the bottom, which give
information of the material volume level in the container. The researcher conducted an
experiment to measure the actual material volume value which triggers the level sensors
by filling in the buffer container with the material until the sensor status is triggered.
Figure 3.2 shows the sensor placement in the buffer container.
Sensor 1
Sensor 2
The experiment result shows that the first level sensor, which is located near the top of
the buffer container, will be triggered when the material volume in the container is about
17.42 liter. The second level sensor is located near the bottom of the buffer container and
will be triggered when the material volume in the container is more than 0.2 liter.
6
7 Modelling of the Bulk Good System
3.1.2 Minihoppers
There are three identical minihoppers in the bulk good system. In this project, only the
first two minihoppers are being used. Each minihopper is equipped with a level sensor
which give information of the material volume level in the minihopper. An experiment
was conducted to measure the relation between the material volume level and the level
sensor reading in both the minihoppers by gradually filling in materials to the minihopper
and record the level sensor reading. Figure 3.3 shows the level sensor placement above
the minihopper.
Level Sensor
7
Modelling of the Bulk Good System 8
The experiment result shows that in the first minihopper, the level sensor reading varies
from 464 to 880. The level sensor reading is 464 when the material volume in the first
minihopper is between 0 and 1.6 liter. The level sensor reading is 880 when the material
volume is 3 liter.
8
9 Modelling of the Bulk Good System
In the second minihopper, the level sensor reading varies from 224 to 881. The level
sensor reading is 224 when the material volume in the second minihopper is between 0
and 1.2 liter. The level sensor reading is 881 when the material volume is 3.6 liter.
The first actuator in the bulk good system is the belt conveyor. The electric motor driven
belt conveyor is installed in the loading station/first station. Its task is to transport material
from the first buffer container to the first minihopper. The speed of the belt conveyor can
be modified in between 450 and 1800 rpm.
The material volume flow in the belt conveyor can be described as a linear function:
𝑑𝑉𝑏 (1)
= 𝑉𝑏̇ = 𝛼𝑏 ∙ 𝑛
𝑑𝑡
9
Modelling of the Bulk Good System 10
where 𝑉𝑏̇ is the material volume flow by the belt conveyor (l/min), 𝛼𝑏 is the belt conveyor
constant (0,0054 l.min/min), and n is the belt conveyor speed (rpm) [1].
The vibration conveyor is installed in the storage station/second station. Its task is to
transport material from the second buffer container to the second minihopper. The
vibration conveyor is operated by pneumatic.
The researcher conducted an experiment to get the material volume flow generated by the
vibration conveyor by feeding the vibration conveyor with constant material volume flow
and measuring the material output overtime.
Table 3.3 shows the result of the experiment. Taking the average volume flow from the
experiment gives the vibration conveyor material volume flow = 0.098 l/s.
There are three vacuum pumps installed in the bulk good system. In this thesis, because
the researcher is only interested in the material volume flow until before the weighing
process, the third vacuum pump is not taken into consideration.
The first vacuum pump is the MULTIJECTOR MX360 [10]. The evacuation time of this
pump is 0,569 s. The MX360 volume output equation [1]:
10
11 Modelling of the Bulk Good System
1 2 1 (2)
𝑉𝑀𝑋360 (𝑡) = 0.0332 2
∙ 𝑡 + 0.464 ∙ 𝑡 − 0.2749 𝑙
𝑠 𝑠
Taking the first derivative of equation 2 with respect to time will result in the material
volume output flow generated by MX360 pump:
1 (3)
𝑉̇𝑀𝑋360 = 0.0662 ∙ 𝑡 + 0.464 𝑙/𝑠
𝑠2
The second vacuum pump is the MULTIJECTOR MX540 [10]. The evacuation time of
this pump is 0.979 s. The MX540 volume output equation [1]:
1 2 1 (4)
𝑉𝑀𝑋540 (𝑡) = 0.0096 2
∙ 𝑡 + 0.3535 ∙ 𝑡 − 0.3552 𝑙
𝑠 𝑠
Taking the first derivative of equation 4 with respect to time will result in the material
volume output flow generated by MX540 pump:
1 (5)
𝑉̇𝑀𝑋540 = 0.0192 ∙ 𝑡 + 0.464 𝑙/𝑠
𝑠2
In order to fulfill the goal of this thesis project, the electrical power consumption data of
each actuator should also be calculated. The power consumption of the belt conveyor,
MX360 pump, and MX540 pump have already been calculated and documented by Tim
Kempe [1].
The belt conveyor energy power consumption is measured by AI Energy Meter [11].
11
Modelling of the Bulk Good System 12
Power [W]
Figure 3.4 shows the relation between the belt conveyor motor speed with its power con-
sumption.
The vibration conveyor is powered by a pneumatic system. There are only two states of
the vibration conveyor: ON and OFF. The power consumption of the vibration conveyor
is 0 when it is OFF and constant when it is ON because the speed remains constant. When
it is on, the vibration conveyor power consumption:
(6)
𝑃𝑣𝑖𝑏𝑟 = 26.9 𝑊
(7)
𝑃𝑀𝑋360 = 305 𝑊, [1]
(8)
𝑃𝑀𝑋540 = 456 𝑊. [1]
12
13 Modelling of the Bulk Good System
In the bulk good system, the sensors are utilized to create a control system which control
the actuators’ behavior. This control system is made to prevent overflow of materials in
the bulk good system containers.
The belt conveyor will stop when minihopper 1 is full or when buffer container 1 is empty.
The vibration conveyor will stop when minihopper 2 is full or when buffer container 2 is
empty.
After understanding the physical properties, material volume flow and power consump-
tion of each component in the bulk good system, the researcher can proceed to developing
the Simulink model. The researcher creates the Simulink model with the help of Simulink
Documentation [12] and previous work example [13]. The researcher needs to create two
different Simulink models of the bulk good system, one that mimics the actual bulk good
system with the control system (baseline model) [14], and one which does not implement
the control system. The second Simulink model of the bulk good system is made without
the control system because the reinforcement learning algorithm will be implemented in
the system and it should explore and learn all the possibilities of taking a set of action in
the system. The input and output of both the Simulink model is the same. The input are
the initial states of each buffer container and minihopper, the speed of the belt conveyor,
the state of the vibration conveyor, the suction time of MX360 pump, and the suction time
of MX540 pump. The output are the material volume output from MX540 pump and the
total power consumption of the system over time. It is also made possible to monitor the
state of each of the container and actuator during the simulation process.
13
Programming 14
4 Programming
The researcher develop the program of this thesis project in Matlab. The researcher needs
to develop a neural-network based actor critic reinforcement learning algorithm for this
thesis project. The first task is to determine the state-action space, then develop a neural-
network for the actor critic part. The next step is to develop an actor critic algorithm. The
last step is to compile all the functions into one program.
For the reinforcement learning algorithm, the researcher declares the following state and
action space:
No. States
1 Buffer station 1
2 Buffer station 2
3 Minihopper station 1
4 Minihopper station 2
14
15 Programming
No. Actuators
1 Belt conveyor
2 Vibration conveyor
3 MX360 Pump
4 MX540 Pump
The sensors in buffer 1 gives information of the state in buffer 1. The following table
describes the states in buffer 1:
1 0 0 Empty
2 1 0 Filled
3 1 1 Full/Overflow
Where 0 means that the sensor is not triggered and 1 means that the sensor is triggered.
The sensors in buffer 2 gives information of the state in buffer 2. The following table
describes the states in buffer 2:
15
Programming 16
1 0 0 Empty
2 1 0 Filled
3 1 1 Full/Overflow
For buffer 1 and buffer 2, the data which is taken is considered discrete. This is because
it is not possible to determine the exact volume of the material in the buffers when it is
not fully filled (between empty and full).
For the minihopper, the level sensor reading changes depending on the fill level of the
minihopper as described in chapter 3.1.2. Therefore, the states in minihopper 1 and 2 is
considered as continuous states.
For the belt conveyor, the action is the speed of the belt conveyor. The belt conveyor
speed can be any number between 450 and 1800 based on the range of it described in
chapter 3.1.3. It is also possible to switch off the belt conveyor.
For the vibration conveyor, the action is only ON and OFF (discrete) with 1 = ON and 0
= OFF.
For the vacuum pumps, the action is the suction time. The minimum suction time is 1 s
and the maximum suction time is 10 s. It is also possible to switch off the vacuum pumps.
4.3 Neural-Network
The neural-network developed in this thesis served as a function approximator which will
be fed to the actor critic reinforcement learning algorithm. This thesis project use function
approximator in order for the learning algorithm to be able to explore every state and
action [15]. The actor and critic part of the reinforcement learning algorithm will only use
16
17 Programming
one neural network. The outputs of the neural-network are the Q-value for the critic part
and π-value for the actor part. The activation function used in the neural network is RBF
gaussian distribution function because there are continuous states-actions space in the
system [16, 17].
Each state value is divided into three gaussian functions [18]. For the action, the belt
conveyor and vacuum pump is divided into three gaussian functions individually and the
vibration conveyor is divided into two gaussian functions. The number of multivariate
gaussian distribution function in the neural network (hidden neurons) is calculated by:
φ1
θ1
s θ2
φ2 ∑ Q(s,a)
a
θn Critic
φn ω1
ω2
ωn
∑ π (s,a)
Actor
Figure 4.1 shows the neural network. The input is state action space which is fed into the
multivariate gaussian distribution function. The multivariate gaussian distribution func-
tion is described as:
17
Programming 18
(2 ∗ 3.14)𝑑 (9)
𝑓 (𝑠, 𝑎) = exp(−(sa − c)𝑇 𝛴−1 (𝑠𝑎 − 𝑐 ))
√det(𝛴)
where,
d = input dimension,
sa = state action,
c = gaussian centers,
The output of the gaussian function are normalized and multiplied by the critic weight θ.
The result of this multiplication are then summed up to get the critic output Q(s,a). The
normalized gaussian function output are also multiplied by the actor weight ω and
summed up to get the actor output policy π(s,a).
The reinforcement learning programming part is divided into two parts: critic and actor.
4.4.1 Critic
In this thesis project, the critic part is using SARSALambdaLin with function approxi-
mation. The pseudocode of the algorithm [19]:
1. δ = r + γQ(s’,a’) – Q(s,a)
2. z = Q(s,a) + λz
3. θ = θ + αδz
4. return(θ,z)
where,
δ = TD-error,
r = reward,
18
19 Programming
z = immediate reward,
λ = eligibility trace,
θ = critic’s weight,
The first line is the TD-error calculation. The discount factor in this thesis project is set
to 99%. The Q value is fed in the algorithm by the neural-network function approximator.
Q(s’,a’) is the Q value for the next State and next Action. Q(s,a) is the Q value for the
current State and current Action. For every iteration, z is updated which in result will
update the critic’s weight which is used in the neural network to get the Q values. The
critic’s learning rate is set to between 0.01 – 0.1 in this thesis project. Higher learning rate
value might make the system to be unstable, resulting in insignificant result. A gradual
decrease of the critic’s learning rate α is also implemented to smoothen the learning pro-
cess.
4.4.2 Actor
In this thesis project, the actor part is using Natural Actor-Critic [19]:
(10)
𝜔𝑡+1 = 𝜔𝑡 + 𝛽𝑡 𝜃𝑡
where,
ω = actor’s weight,
θ = critic’s weight.
19
Programming 20
In every iteration, the actor’s weight will always be updated. The actor’s weight is used
in the neural-network to get the action policies. A gradual decrease of the actor’s learning
rate β is also implemented to smoothen the learning process.
Combining the actor and critic part of the algorithm will result in an actor critic reinforce-
ment learning algorithm. The pseudocode is as follows [19]:
1. Initialize ω, θ, z to 0
2. Set a pre-determined first action
Loop for each episode
3. (r,s’) ← Execute a
4. a’ ← draw(πω(s’,.))
5. (θ,z) ← SARSALambdaLin(s,a,r,s’,a’,θ,z)
6. ω ← ω + β.θ
7. update state s ← s’
8. update action a ← a’
Line 1 and 2 is the initialization part of the algorithm. Line 3 is getting the reward value
and the next state by doing action a in the environment. Line 4 is drawing the next action.
There are four actuators in the system that can be manipulated as discussed in chapter
4.2.2. For the vibration conveyor, choosing the action is made by using Gibbs sampling
because it is discrete [20]:
𝜉 (𝑠, 𝑎) is the output of the multivariate gaussian function in the neural-network for the
corresponding action. 𝜋𝜔 (𝑎|𝑠) is the probability of the action a to be taken in state s.
For the belt conveyor and vacuum pumps, the actions are in continuous/infinite space.
Choosing the action is done by using gaussian distribution policies.
After getting the current state action, next state action, and reward value, the algorithm
will continue to the critic part as discussed in chapter 4.4.1. The output of the critic algo-
rithm are the updated critic’s weight and update z value. The critic’s weight is then used
20
21 Programming
to update the actor’s weight. So, the critic part evaluates the result of the learning process
by using the reward value in order to update the actor part so that it will eventually take
a better action in the future.
Because the goal of this thesis is to get high material volume output with the lowest power
consumption possible, the reward function will be consisted of the material volume output
value V and the power consumption value P. In the process of learning, the simulation
model that is used is the one without the control mechanism, thus making it possible to
lose material during the process because of material overflow in the buffer containers and
the minihoppers. When this overflow happens, the learning process should be reminded
by penalizing the reward value so that it will learn that overflow is not allowed in the
system. There are so many possibilities of reward function that can be developed from
these parameters. The researcher developed two reward functions to be tested in the learn-
ing process. The first reward function:
𝑃 (12)
𝑟= − − 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
𝑉 + 0.1
The idea of the first reward function is to maximize the reward value when the power
consumption is low and the volume is high. The denominator is made to be V + 0.1 in
order to prevent the reward value to be undefined when there is no material volume flow
(V=0).
(13)
𝑟 = 𝑎𝑉 − 𝑏𝑃 − 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
The idea of the second reward function is to reward material volume flow and penalize
power consumption and penalty. The coefficient a and b are there to be modified based
on trial and error runs in order to balance the three parameters.
21
Result 22
5 Result
Figure 5.1 shows the baseline model for station 1 of the bulk good system. The belt con-
veyor is modeled here similar to the actual system in which the material flow is from
buffer 1 to minihopper 1 with the rate discussed in chapter 3.1.3. The belt conveyor will
be switched off if the sensor in minihopper 1 is trigerred (volume in minihopper 1 is more
than 3 l). For the sake of simulation continuity, when buffer 1 is empty, the system will
wait for a pre-determined second and then buffer 1 will be refilled to full. The power
consumption of the belt conveyor is calculated by using a lookup table in accordance to
the relation between the belt conveyor motor speed and the power consumption.
22
23 Result
Figure 5.2 shows the baseline model of station 2. The actuators in this station are the
MX360 pump and the vibration conveyor. The material volume in buffer 2 is added by
the vacuum pump and is taken out by the vibration conveyor, transported to minihopper
2. The material volume flow rate of the vibration conveyor is modelled based on the cal-
culation made in chapter 3.1.4. When minihopper 1 in station 1 is empty, the MX360
pump is not turned off but there is no material flow in the vacuum pump thus still con-
suming power. Figure 5.3 shows the model of the MX360 pump.
The model is made with the pump evacuation time in mind. When the evacuation time is
reached, the pump will start to pump materials. So, at t < 0.569s the material volume flow
is 0 and at 0.569s < t < suction time the material volume flow is as described in chapter
3.1.5.
23
Result 24
Figure 5.4 shows the baseline model of station 3 of the bulk good system which is the last
station in this thesis project. The output volume is added by the MX540 pump taken from
minihopper 2 from the previous station. When minihopper 2 is empty, the MX540 pump
will not be turned off. Only the material volume flow will be 0 but the pump still con-
sumes power.
Figure 5.5 shows the model of the MX540 pump which is has similar structure to the
model of the MX360 pump shown in figure 5.3. The difference is in the material volume
flow calculation and evacuation time as described in chapter 3.1.5.
The baseline model is then tested to validate the similarity with the actual bulk good sys-
tem. The baseline model is simulated for 1000 s and the result is shown in figure 5.6.
24
25 Result
(a) (b)
Figure 5.6: Graph of Baseline Model (a) Volume Output and (b) Power Consump-
tion
Figure 5.6(a) shows the volume output of the baseline model. At a fixed-time interval,
the volume output of the system remains constant. This happens because the system will
wait for some seconds (in this case 45 seconds) for the buffer 1 to be refilled to full when
it is empty. Buffer 1 empty means that there are no material input flow into the system
thus making the output also 0 after the material inside the system is completely removed.
Figure 5.6(b) shows the total power consumption of the baseline model. The graph can
be said to be linear but when analyzing it deeper, it can be seen in the line that it is not
perfectly straight as seen in figure 5.7.
This behavior suits the real behavior of the system because there are time that the vacuum
pump is not consuming energy based on how it works. And when the suction time of the
25
Result 26
vacuum pump is constant for the whole simulation, a repeating behavior as seen in figure
5.7 will appear.
The Simulink model of the bulk good system for learning process is similar to the baseline
model. The differences are:
1. There are no control system implemented in the system that can prevent overflow.
2. The simulation time is significantly smaller than the baseline model. This is be-
cause the idea is to let the system update the state and action value of the system in every
simulation run so that the learning algorithm can learn.
3. The model needs initial state value of each containers in the system for every run.
4. The model needs action values of each actuators in the system for every run.
The biggest difference is with the vacuum pump model. Figure 5.8 shows the modified
MX360 vacuum pump model.
For the material volume flow in the vacuum pump, the calculation is still the same as
before. The interval limit block simulates the operating time of the pump which is be-
tween the evacuation time and the suction time. This will result in that the vacuum pump
will only turn ON once in one simulation run. For example if the simulation time is 15
seconds and the suction time is 5 seconds, for the time after 5 seconds, the vacuum pump
is turned off.
26
27 Result
𝑃
𝑟= − − 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
𝑉 + 0.1
The simulation runs for 1000 episodes with 15 seconds for each episode using this reward
function and the result is shown in figure 5.9.
The algorithm successfully explores the possibility of reward value with different action
set at first. At around 250 episodes, the algorithm decided to turn OFF every actuator in
the system in order to get the maximum value of the reward value which is 0. It is clear
that this reward function is not suitable for this task to optimize volume output with min-
imum power consumption because the reward value will stay negative thus giving the
maximum value of 0 when power consumption is 0 which translates to all actuator turned
OFF.
The second reward function is then used for the remaining test:
𝑟 = 𝑎𝑉 − 𝑏𝑃 − 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
27
Result 28
Because the process to find the balance of the parameters is by trial and error, the second
simulation is made with a = 10 and b = 1. Figure 5.10 shows the result of the simulation:
The reward value converges to 100 after around 150 episodes. Positive value of the re-
ward value can only happen if 10 x volume output is greater than power consumption +
penalty. The reward value at some episodes is negative because of overflow and higher
power consumption. Figure 5.11 shows the volume output curve and power consumption
curve of the simulation:
(a) (b)
Figure 5.11: (a) Volume Output Curve and (b) Power Consumption Curve of Sim-
ulation 2
The volume output curve shows that the volume output of the system converges at the
maximum. For the power consumption, it converges at around 11 W per episode. There
28
29 Result
are possibilities that the power consumption can be lowered, suggested by the power con-
sumption value in the early episodes. Further analysis is made by analyzing the behavior
of the vacuum pumps and the result is shown in figure 5.12
(a) (b)
Figure 5.12: (a) Pump 1 Behavior Curve and (b) Pump 2 Behavior Curve of Simu-
lation 2
From figure 5.12 it can be seen that both pumps converge at the maximum suction time
which is 10 s. The researcher then checks the behavior of the belt and vibration conveyor
and found out that the learning algorithm forces the belt conveyor to always be on with
the maximum speed and the vibration conveyor is always on also. The researcher think
that this is not the best action to be chosen by the learning algorithm thus the researcher
conducts several simulations with different reward function coefficient.
The result of the simulations varies. The most interesting result is that some simulation
turns ON all actuators to the maximum and some simulation turn OFF all the actuators.
At this time, the researcher feels the need to modify the reinforcement learning algorithm
parameters, not only the reward function.
The first step is changing the number of hidden neurons in the neural-network. The num-
ber of hidden neurons of 4374 might be overfitting for the task, so it is reduced to 100.
The next step is to change the parameters of the algorithm such as the decay rate and
eligibility trace. Again, this process of modification is a trial and error process.
29
Result 30
After several simulation, the following result can be seen in figure 5.13
The material volume flow in each actuator is taken into consideration in the reward func-
tion. This simulation implements the initial state randomizer, 100 hidden neurons and
gaussian parameter randomizer. The result shows repeating reward values at the end of
the simulation. The volume output and power consumption of the system is shown in
figure 5.14.
30
31 Result
(a) (b)
Figure 5.14: (a) Volume Output Curve and (b) Power Consumption Curve of Sim-
ulation 3
From figure 5.14, it can be seen that the volume output and power consumption of the
system ends with an oscillating behavior. For the volume output curve, there are times
that the system’s volume output is 0. This happens because of the refill mechanism of
buffer container 1 which wait for 10 episodes to be refilled. The power consumption curve
hints that the actuators change in during the learning process until the end. Further anal-
ysis is made to check the behavior of the actuators as shown in figure 5.15.
31
Result 32
(a) (b)
(c) (d)
Figure 5.15: Actuators Behavior Curve of Simulation 3 Belt Conveyor, (b) Vibra-
tion Conveyor, (c) MX360 Pump, and (d) MX540 Pump
From figure 5.15, it can be seen that the belt conveyor is set to the maximum value. The
same happens with both vacuum pumps. For the vibration conveyor, the state oscillates
between ON and OFF. This oscillating behavior results in the power consumption oscil-
lation shown in figure 5.14(b).
At this point of time, the researcher cannot conclude that the learning algorithm success-
fully make the system learn to generate the maximum output with the minimum power
consumed. Therefore, a comparison between the simulation result and the baseline model
is conducted and the result is discussed in the next sub-chapter.
32
33 Result
In order for the simulation result to be able to be compared with the baseline model sim-
ulation result, the baseline model simulation need to be run in the same amount of time
as the learning simulation. The last simulation run for 438 episodes with 15 s for each
episode. The total time needed for the simulation to be completed is
So, the baseline model simulation needs to run for 6570 s. After the simulation ended,
data of volume output and power consumption for each sampling time is recorded. The
data then divided into 438 samples with each sample duration of 15 s. This is made to get
the same sample number as in the learning simulation. Each sample contains information
of the total volume output and total power consumption of the system at the corresponding
15 s period. The result is then plotted in the same graph as the learning simulation result
for comparison purposes as shown in figure 5.16 and figure 5.17.
33
Result 34
Figure 5.16 shows the comparison between the volume output of the baseline model sim-
ulation result with the learning algorithm simulation result. The maximum volume output
of the learning algorithm simulation is higher than the maximum volume output of the
baseline model. The volume output of the learning algorithm seems to be shifted to the
right compared to the baseline model. This might happen because the learning process
stops at a random time which make the stable process starts not exactly at the same time
as the baseline model which is constant from the start of the simulation.
Figure 5.17 shows the comparison between the power consumption of the baseline model
simulation result with the learning algorithm simulation result. It can be seen that the
power consumption of the learning algorithm simulation is always lower than the baseline
model simulation result. This might happen because the information to stop the belt con-
veyor or the vibration conveyor happens only if the next container is full while the learn-
ing algorithm learns about the intermediate state of the next container and will modify the
actuator accordingly.
34
35 Conclusion
6 Conclusion
6.1 Conclusion
A Simulink model of the bulk good system has been made and mimics the actual bulk
good system behavior.
A neural-network based actor critic reinforcement learning algorithm has been made in
Matlab. The neural-network acts as function approximator for the actor and critic part of
the algorithm. The actor part used natural actor-critic algorithm and the critic part used
SARSALambaLin algorithm.
Another reinforcement learning approach shall be made in the future to further improve
the efficiency of the bulk good system.
35
List of Figures 36
List of Figures
36
37 List of Tables
List of Tables
37
References 38
References
[7] X. Jia, “Deep Learning for Actor-Critic Reinforcement Learning,” Delft University
of Technology, May/2015.
38
39 List of Tables
[16] Y. Wu, H. Wang, B. Zhang, and K.-L. Du, “Using Radial Basis Function Networks
for Function Approximation and Classification,” ISRN Applied Mathematics, vol.
2012, no. 3, pp. 1–34, 2012.
[19] Michael Hermann, “RL 14: Actor Critic Methods,” Aug. 3 2013.
39
References 40
40