Documente Academic
Documente Profesional
Documente Cultură
by M. Bindhammer
and = 1 a penalty
is the updating function (rule) according to which the elements of the set p
are updated at each time t = n . Therefore p ( n + 1) = ( ( n ) , ( n ) , p ( n ) ) ,
where the i th element of the set p ( n ) is pi ( n ) = Prob ( ( n ) = i ) with
i = 1, 2,..., r
n:
p ( n ) = p ( n ) + p ( n ) + ... + p ( n ) = 1
i
and
i =1
i : pi ( n = 1) =
1
.
r
c = {c1 , c2 ,...cr } is the finite set of penalty probabilities that the action i will
result in a penalty input from the random environment. If the penalty
probabilities are constant, the environment is called a stationary random
environment.
= 0:
= 1:
p j ( n ) + a (1 p j ( n ) ) j = i
p j ( n + 1) =
ji
p j ( n ) (1 a )
p j ( n ) (1 b )
p j ( n + 1) = b
+ p j ( n ) (1 b )
r 1
j =i
ji
Rev. 2
2. An Example
We make two assumptions before we start with the example. For simplicity we
consider the random environment as a stationary random environment and we are
using the linear reward-penalty scheme.
Let's say, the robot roams through a room and shall learn how to avoid obstacles, then
a stationary random environment simply means, that the probabilities are everywhere
in the room the same, that the robot will hit an obstacle. We will discuss later in detail
how to gain such penalty probabilities as a function of the position in the room.
Let us assume, the robot can choose from the set = {1 , 2 , 3 , 4 } of actions. We
could define these actions for instance as follows: 1 : drive forwards, 2 : drive
backwards, 3 : turn right and 4 : turn left.
i = 1, 2,..., r
r=4
1 1
=
r 4
1
Let a = b =
2
pi (1) =
Let's assume now, the initial action 1 (which has been selected randomly) has led
to an input = 0 (reward) at the time t = n . The new probabilities are then
calculated as follows:
= 0:
p j ( n ) + a (1 p j ( n ) ) j = i
p j ( n + 1) =
ji
p j ( n ) (1 a )
p j ( n + 1) = p j ( n ) + a (1 p j ( n ) ) for 1
p j ( n + 1) =
1 1 1 5
+ 1 =
4 2 4 8
p j ( n + 1) = p j ( n ) (1 a )
p j ( n + 1) =
for 2 , 3 , 4
1 1 1
1 =
4 2 8
As it is requested that n :
p ( n) = 1
i
i =1
Rev. 2
5 1 1 1
p ( n + 1) = 8 + 8 + 8 + 8 = 1
j
j =1
I.e. after the input from the environment ( n ) = 0 , the probability that action 1
5
while the probabilities
8
1
will be chosen has been decreased to .
8
The same we compute now if the initial action 1 has led to an input = 1
(penalty) at the time t = n .
= 1:
p j ( n ) (1 b )
p j ( n + 1) = b
r 1 + p j ( n ) (1 b )
p j ( n + 1) = p j ( n ) (1 b )
j =i
ji
for 1
p j ( n + 1) =
1 1 1
1 =
4 2 8
p j ( n + 1) =
b
+ p j ( n ) (1 b ) for 2 , 3 , 4
r 1
1
1 1 7
p j ( n + 1) = 2 + 1 =
4 1 4 2 24
r
1 7
7
7
p ( n + 1) = 8 + 24 + 24 + 24 = 1
j
j =1
I.e. after the input from the environment ( n ) = 1 , the probability that action 1
1
while the probabilities
8
7
will be chosen has been increased to
.
24
From this example it can be also seen immediately that the limits of a probability pi
for n are either 0 or 1. Therefore the robot learns to choose the optimal action
asymptotically. It shall be noted, that it converges not always to the correct action; but
the probability that it converges to the wrong one can be made arbitrarily small by
making the learning parameter a small.
Rev. 2
3. Algorithm of choice
Before we finally can start writing a basic program example, we need to find an
'algorithm of choice' that selects an action i , tagged with the according probability
pi .
To begin with, we consider the random number generator random (min,max),
where min is the lower bound of the random value and max the upper bound. It is
sufficient for our approach, to use a pseudo-random generator, as all microcontrollers
must perform mathematics to generate random numbers, and so the sequence can
never be truly random, but it is important for a sequence of values generated by
random(min,max) to differ on subsequent executions. This can be archived by
initializing the random number generator with a fairly random input, such as an
unconnected ADC pin.
The easiest case is, if the probabilities are equal (which is for instance initially the
case). The pseudo code looks as follows:
if(p_1==p_2==...==p_r)
then
rand_number=random (1,r)
//if
all
To cover all cases, a simple approach is to find first the probability(ies) with the
maximum value at a time n and use a modified random number generator, to chose
the action according to the probability value of pmax ( n ) and the remaining
probabilities. If the action with the maximum probability value has been selected, the
algorithm stops. If not, the action with the next smaller probability value is
determined and a modified random number generator is used again to chose. The
algorithm can be imagined as a repeating coin toss. On one side of the coin is the
action tagged with the probability pmax ( n ) , one the other side are all the other
actions with the remaining probabilities.
We define the modified random number generator as random (1,y), where
y N >0 . We furthermore define a variable t N > 0 .
Example:
Rev. 2
if(rand_number<=t) then do
3
,
5
while the probability that actions tagged with the remaining probabilities will be
2
chosen is
. With this method we can create any rational number probability. In this
5
The probability in the example that the action tagged with pmax will be chosen is
a , because it is then most likely that the probabilities getting irrational too and we
would be not able to reproduce it with the introduced simple random number
generator, which most microcontroller IDE's provide. We define therefore,
pi {0,1} , otherwise pi Q + and 0 > a < 1 a Q + .
pmax ( n ) can be computed now by:
pmax ( n ) =
t
y
a and
p j ( n ) by
u
with u , v N >0 ,
v
w
p j (n) =
with w, x N >0 .
x
a=
= 0:
Rev. 2
p j ( n ) + a (1 p j ( n ) ) j = i
p j ( n + 1) =
ji
p j ( n ) (1 a )
= 1:
= 0:
= 1:
p j ( n ) (1 a )
p j ( n + 1) = a
r 1 + p j ( n ) (1 a )
j =i
ji
v w + u ( x w ) t1
=
vx
y1
p j ( n + 1) =
w ( v u ) = t2
vx
y2
j =i
j i
w ( v u ) t3
=
v x
y3
p j ( n + 1) =
x u + w ( r 1) ( v u ) = t4
x v ( r 1)
y4
j =i
ji
At the end of this chapter, we discuss the need of stopping criterions. As mentioned,
the robot learns asymptotically, i.e. a certain probability converges only to 0 or 1 for
n . The rate of convergence can be adjusted by the learning parameter a . The
learning parameter is no universal constant. To chose the right value for the learning
parameter, some experiments need to be done and it depends on the task the robot
should learn. If the value of the learning parameter is too small, it takes a long time
till the robot learn a task; if the value of the learning parameter is too large, the robot
might interprets data from the environment wrong.
As all microcontrollers, which supporting floating point math, have a limited number
of digits after the decimal point, the convergence criterion is not n , it just
depends on the number of digits after the decimal point. If for instance 5 digits after
the decimal point can be calculated, the microcontroller will interpret 0.000001 as 0
and 0.999999 as 1. That isn't a problem for our project and can be used instead of
implementing stopping criterions.
Stopping criterions making only sense, if the environment not changes. Imagine a
robot arm that should learn to lift a cup from a fixed position. After the robot has
learned to lift the cup, the learning progress can be stopped; just repeat what it has
learned. But if the position of the cup changes unpredictable, the learning progress
never stops.
Rev. 2