Sunteți pe pagina 1din 7

Concept of a learning robot based on VSLA

by M. Bindhammer

1. Learning robot and its environment


In this first chapter we briefly discuss the concept behind the learning robot and its
environment based on the variable structure stochastic learning automaton (VSLA).
The robot can choose from a finite number of actions (e.g. drive forwards, drive
backwards, turn right, turn left). Initially at a time t = n = 1 one of the possible
actions is chosen by the robot at random with a given probability p . This action
is now applied to the random environment in which the robot "lives" and the response

from the environment is observed by the sensor(s) of the robot.


The feedback from the environment is binary, i.e. it is either favorable or
unfavorable for the given task the robot should learn. We define = 0 as a reward
(favorable) and = 1 as a penalty (unfavorable). If the response from the
environment is favorable ( = 0 ), then the probability pi of choosing that action i
for the next period of time t = n + 1 is updated according to the updating rule .
After that, another action is chosen and the response of the environment observed.
When a certain stopping criterion is reached, the algorithm stops and the robot has
learnt some characteristics of the random environment.
Definition abstract:


= {1 , 2 ,..., r } is the finite set of r actions/outputs of the robot. The


output (action) is applied to the environment at time t = n , denoted by ( n )

= {1 , 2 } is the binary set of inputs/responses from the environment. The


input (response) is applied to the robot at time t = n , denoted by ( n ) . In our
case, the values for are chosen to be 0 and 1. = 0 represents a reward

and = 1 a penalty

p = { p1 , p2 ,..., pr } is the finite set of probabilities a certain action ( n ) is


chosen at a time t = n , denoted by p ( n )

is the updating function (rule) according to which the elements of the set p
are updated at each time t = n . Therefore p ( n + 1) = ( ( n ) , ( n ) , p ( n ) ) ,
where the i th element of the set p ( n ) is pi ( n ) = Prob ( ( n ) = i ) with
i = 1, 2,..., r

n:

p ( n ) = p ( n ) + p ( n ) + ... + p ( n ) = 1
i

and

i =1

i : pi ( n = 1) =

1
.
r

c = {c1 , c2 ,...cr } is the finite set of penalty probabilities that the action i will
result in a penalty input from the random environment. If the penalty
probabilities are constant, the environment is called a stationary random
environment.

The updating functions (reinforcement schemes) are categorized based on their


linearity. The general linear scheme is given by:
If ( n ) = i ,

= 0:

= 1:

p j ( n ) + a (1 p j ( n ) ) j = i
p j ( n + 1) =
ji
p j ( n ) (1 a )
p j ( n ) (1 b )

p j ( n + 1) = b
+ p j ( n ) (1 b )

r 1

j =i
ji

where a and b are the learning parameter with 0 > a, b < 1 .


If a = b , the scheme is called the linear reward-penalty scheme. If for = 1 p j
remains unchanged ( j 0), it is called the linear reward scheme.

Rev. 2

2. An Example
We make two assumptions before we start with the example. For simplicity we
consider the random environment as a stationary random environment and we are
using the linear reward-penalty scheme.
Let's say, the robot roams through a room and shall learn how to avoid obstacles, then
a stationary random environment simply means, that the probabilities are everywhere
in the room the same, that the robot will hit an obstacle. We will discuss later in detail
how to gain such penalty probabilities as a function of the position in the room.
Let us assume, the robot can choose from the set = {1 , 2 , 3 , 4 } of actions. We
could define these actions for instance as follows: 1 : drive forwards, 2 : drive
backwards, 3 : turn right and 4 : turn left.
i = 1, 2,..., r
r=4

1 1
=
r 4
1
Let a = b =
2
pi (1) =

Let's assume now, the initial action 1 (which has been selected randomly) has led
to an input = 0 (reward) at the time t = n . The new probabilities are then
calculated as follows:

= 0:

p j ( n ) + a (1 p j ( n ) ) j = i
p j ( n + 1) =
ji
p j ( n ) (1 a )

p j ( n + 1) = p j ( n ) + a (1 p j ( n ) ) for 1

p j ( n + 1) =

1 1 1 5
+ 1 =
4 2 4 8

p j ( n + 1) = p j ( n ) (1 a )
p j ( n + 1) =

for 2 , 3 , 4

1 1 1
1 =
4 2 8

As it is requested that n :

p ( n) = 1
i

i =1

Rev. 2

5 1 1 1
p ( n + 1) = 8 + 8 + 8 + 8 = 1
j

j =1

I.e. after the input from the environment ( n ) = 0 , the probability that action 1
5
while the probabilities
8
1
will be chosen has been decreased to .
8

will be chosen as action ( n + 1) has been increased to


that one of the actions 2 , 3 or 4

The same we compute now if the initial action 1 has led to an input = 1
(penalty) at the time t = n .

= 1:

p j ( n ) (1 b )

p j ( n + 1) = b
r 1 + p j ( n ) (1 b )

p j ( n + 1) = p j ( n ) (1 b )

j =i
ji

for 1

p j ( n + 1) =

1 1 1
1 =
4 2 8

p j ( n + 1) =

b
+ p j ( n ) (1 b ) for 2 , 3 , 4
r 1

1
1 1 7
p j ( n + 1) = 2 + 1 =
4 1 4 2 24
r

1 7
7
7
p ( n + 1) = 8 + 24 + 24 + 24 = 1
j

j =1

I.e. after the input from the environment ( n ) = 1 , the probability that action 1
1
while the probabilities
8
7
will be chosen has been increased to
.
24

will be chosen as action ( n + 1) has been decreased to


that one of the actions 2 , 3 or 4

From this example it can be also seen immediately that the limits of a probability pi
for n are either 0 or 1. Therefore the robot learns to choose the optimal action
asymptotically. It shall be noted, that it converges not always to the correct action; but
the probability that it converges to the wrong one can be made arbitrarily small by
making the learning parameter a small.

Rev. 2

3. Algorithm of choice
Before we finally can start writing a basic program example, we need to find an
'algorithm of choice' that selects an action i , tagged with the according probability
pi .
To begin with, we consider the random number generator random (min,max),
where min is the lower bound of the random value and max the upper bound. It is
sufficient for our approach, to use a pseudo-random generator, as all microcontrollers
must perform mathematics to generate random numbers, and so the sequence can
never be truly random, but it is important for a sequence of values generated by
random(min,max) to differ on subsequent executions. This can be archived by
initializing the random number generator with a fairly random input, such as an
unconnected ADC pin.
The easiest case is, if the probabilities are equal (which is for instance initially the
case). The pseudo code looks as follows:
if(p_1==p_2==...==p_r)

then

rand_number=random (1,r)

//if

all

probabilities are equal, generate a random number between 1 and r


if(rand_number==1) then do alpha_1 //if random number=1 then perform
action 1
if(rand_number==2) then do alpha_2 //if random number=2 then perform
action 2
...
if(rand_number==r) then do alpha_r //if random number=r then perform
action r

To cover all cases, a simple approach is to find first the probability(ies) with the
maximum value at a time n and use a modified random number generator, to chose
the action according to the probability value of pmax ( n ) and the remaining
probabilities. If the action with the maximum probability value has been selected, the
algorithm stops. If not, the action with the next smaller probability value is
determined and a modified random number generator is used again to chose. The
algorithm can be imagined as a repeating coin toss. On one side of the coin is the
action tagged with the probability pmax ( n ) , one the other side are all the other
actions with the remaining probabilities.
We define the modified random number generator as random (1,y), where
y N >0 . We furthermore define a variable t N > 0 .
Example:
Rev. 2

int t=3 //assign the value 3 to the integer variable t


int y=5 //assign the value 5 to the integer variable y
rand_number=random (1,y)

//generate a random number between 1 and 5

if(rand_number<=t) then do

alpha_p_max //if random number has a

value of 1 to 3 then perform action tagged with p max


else
//start algorithm of choice again

3
,
5
while the probability that actions tagged with the remaining probabilities will be
2
chosen is
. With this method we can create any rational number probability. In this
5
The probability in the example that the action tagged with pmax will be chosen is

regard, we should not use an irrational number like

2 for the learning parameter

a , because it is then most likely that the probabilities getting irrational too and we
would be not able to reproduce it with the introduced simple random number
generator, which most microcontroller IDE's provide. We define therefore,
pi {0,1} , otherwise pi Q + and 0 > a < 1 a Q + .
pmax ( n ) can be computed now by:
pmax ( n ) =

t
y

As we defined the learning parameter a as 0 > a < 1 a Q + , we can substitute

a and

p j ( n ) by

u
with u , v N >0 ,
v
w
p j (n) =
with w, x N >0 .
x

a=

We re-write the right terms of following equations as common fractions, where we


only have integer denominators and numerators:

= 0:

Rev. 2

p j ( n ) + a (1 p j ( n ) ) j = i
p j ( n + 1) =
ji
p j ( n ) (1 a )

= 1:

= 0:

= 1:

p j ( n ) (1 a )

p j ( n + 1) = a
r 1 + p j ( n ) (1 a )

j =i

ji

v w + u ( x w ) t1
=

vx
y1

p j ( n + 1) =
w ( v u ) = t2
vx
y2

j =i
j i

w ( v u ) t3
=

v x
y3

p j ( n + 1) =
x u + w ( r 1) ( v u ) = t4

x v ( r 1)
y4

j =i
ji

At the end of this chapter, we discuss the need of stopping criterions. As mentioned,
the robot learns asymptotically, i.e. a certain probability converges only to 0 or 1 for
n . The rate of convergence can be adjusted by the learning parameter a . The
learning parameter is no universal constant. To chose the right value for the learning
parameter, some experiments need to be done and it depends on the task the robot
should learn. If the value of the learning parameter is too small, it takes a long time
till the robot learn a task; if the value of the learning parameter is too large, the robot
might interprets data from the environment wrong.
As all microcontrollers, which supporting floating point math, have a limited number
of digits after the decimal point, the convergence criterion is not n , it just
depends on the number of digits after the decimal point. If for instance 5 digits after
the decimal point can be calculated, the microcontroller will interpret 0.000001 as 0
and 0.999999 as 1. That isn't a problem for our project and can be used instead of
implementing stopping criterions.
Stopping criterions making only sense, if the environment not changes. Imagine a
robot arm that should learn to lift a cup from a fixed position. After the robot has
learned to lift the cup, the learning progress can be stopped; just repeat what it has
learned. But if the position of the cup changes unpredictable, the learning progress
never stops.

Rev. 2

S-ar putea să vă placă și