Sunteți pe pagina 1din 8

Using the GTSOM network for mobile robot navigation with reinforcement learning

Mauricio Menegaz1 , Paulo M. Engel1


1

Instituto de Inform tica Universidade Federal do Rio Grande do Sul (UFRGS) a Caixa Postal 15.064 91.501-970 Porto Alegre RS Brazil
{mmenegaz,engel}@inf.ufrgs.br

Abstract. This paper describes a model for an autonomous robotic agent that is capable of mapping its environment, creating a state representation and learning how to execute simple tasks using this representation. The multi-level architecture developed is composed of the execution level, responsible for interaction with the environment, the clustering level, which maps the input received from sensor space into a compact representation, and the planning level, that learns the correct behaviour in order to achieve the goal. The model was implemented in software and tested in an experiment that consists in nding the path in a maze. Results show that it can divide the state space in a meaningful and efcient way and learn how to execute the given task.

1. Introduction
This paper describes a model to be used in robotics. One of the big challenges in this area is the need to deal with multi-dimensional sensor data, with lots of redundancy and temporal dependency. Using this input, the robot should be able to extract the information that is relevant to the task execution, and build a memory representation of it. Classic aproaches to the mapping problem include topological maps (for example [Kuipers and Byun 1988]), grid-maps [Moravec and Elfes 1985] or the combination of both [Thrun 1998]. Also important is the problem of localization. This is specially difcult when the map of the environment is not previously known. In this case, the map must be built online, while the environment is being explored. However, in order to interpret the data read from the sensors, the robot needs to know its own location in the map. To help prevent errors, the sensor data can be combined with odometric readings using some method for modelling and reducing the error (see [Machado 2003]). In the current work, we propose a model that uses a self-organizing neural network to build the state representation from the sensor data. The chosen algorithm was the Growing Temporal Self Organizing Map - GTSOM [Bastos 2007]. Once the state representation is available, it can be used for guiding the robots actions. This was achieved in a reinforcement learning approach. The robot has no previous knowledge of the task, and must learn it by trial and error. The work on [Li and Pang 2006] is somehow similar to this, even though the neural network algorithm is not the same. The differences between the model proposed here and the one by Li and Pang will be explained in this paper. ____________________________________________________________________________________
Hfen, Uruguaiana, v. 32 - n 62 - II Semestre - Ano 2008 - ISSN 1983-6511 83

The remainder of the text is organized as follows: section 2 introduces the GTSOM network. Next, on section 3, we describe the proposed model in detail. Section 4 presents the results obtained by the simulated model in the maze task. Finally, we show the contributions and future work in section 5.

2. The GTSOM Network


The GTSOM Network is a temporal self-organizing map whose structure changes during use to best t the input data. It does not have a xed shape, as the nodes and the connections between them are added and deleted during operation. An overview of its algorithm will be presented here. For further details, see [Bastos 2007]. The network is composed of the input layer, a set of neurons and a set of arcs. For each component of the input vector there is a memory unit, used to extract temporal features that are implicit in the input data. The memory is based on the differentiatorintegrator neuron [Moser 2004], dened by equation 1. x(t) = (1 )x(t) + x(t 1) y(t) = (1 )(x(t) x(t 1)) + y(t 1) (1)

Where e are constant values between 0 and 1 and y(t) is the memorys output at time t. The initial conditions are x(0) = 0 e y(0) = 0. The vector z, formed by concatenating the input vector with the memory units output is fully connected with a set of neurons. Each one of them has a vector of input weights wi (t) and an habituation unit wh . The input weights are used to determine the winning neuron, in a way that is similar to the conventional Kohonen Network. The habituation unit guides the neuron inclusion / exclusion process. At each time step, the algorithm performs the following steps: selection of the winning unit, weight adjustment, neuron inclusion/exclusion and arc inclusion/exclusion. The habituation unit is based on the model by [Wang 1998] and has its behaviour dened by equation 2: wh (t) = wh (t 1) + ((1 wh (t 1)) wh (t 1)S(t)) (2)

Where , and are constants that control the rate of habituation increase and decay and S is a vector that represents stimuli received from an external source t. Neurons whose weight vector is similar to the input receive stimuli closer to 1, as seen in equation 3. Si (t) = exp(||z(t) wi (t)||) (3)

The winning unit is chosen as the one whose weight vector is at the minimal Euclidean distance from the input vector. Then, for all neurons in the network, the weight vector is adjusted by equation 4:

wi (t) + ev (t) wh (t)[z(t) wi (t)] if i = v

wi (t) + en (t) wh (t)[z(t) wi (t)] if i is directly connected to v (4) wi (t) otherwise ____________________________________________________________________________________
Hfen, Uruguaiana, v. 32 - n 62 - II Semestre - Ano 2008 - ISSN 1983-6511 84

wi (t + 1) =

ev (t) and en (t) are the learning rate for the winning neuron and its direct neighbours. They are calculated by equation 5. p(t) = max(||z(t) wv (t)||, p(t 1)) ev (t) = ||z(t)wv (t)|| p(t) en (t) = ev (t) 10

(5)

When the quantization error is large and the input vector is unknown by the network, a new neuron is created. This is matematically dened by equation 6
h exp(||z(t) wv (t)||) < e wv (t) <

(6)

Where is used to control the networks granularity and controls the networks growing speed. The removal of a neuron can happen in two cases: when it is not connected to any other unit in the network, or when its habituation unit has a value very close to 1, which indicates that the neuron is not being used. The connections between nodes can also be created and removed at each time step. When the winning vector is not connected to the one with the second-best prototype, a connection between them is added. Each connection has an associated age counter. This counter is increased at each step and is reset when the connected neurons are chosen as winners. If the age counter achieves a limit (imax ), the connection is removed.

3. Proposed Model
The proposed model was built using the GTSOM network together with a grid map module in order to create a discrete state space on top of which standard reinforcement learning methods can be applied. Figure 1 shows the main components of the model. The architecture is divided in 3 levels. At the execution level are the components that interact directly with the outside world, receiving information through distance and odometry sensors and sending comands to the robots actuators. The clustering level is responsible for mapping the sensor readings into a compact state representation that contains the minimum amount of information relevant to the decision making process. This is done by 3 components: the GTSOM network, the grid map module and the state identication (clustering) module. The last level, called planning, is where the action policy learning takes place. It uses the state information generated in the clustering level, together with a reinforcement signal in order to decide the best action to execute at each time step. 3.1. Input Interfaces The input interfaces perform pre-processing on the data read from the sensors. In the simulated experiments, the two data sources are: distance sensors and odometry readings. The rst provide the distance to the nearest obstacle in each direction, and are used by the GTSOM network and the mapping module. The latter are used by the grid map algorithm to update the robots location. ____________________________________________________________________________________
Hfen, Uruguaiana, v. 32 - n 62 - II Semestre - Ano 2008 - ISSN 1983-6511 83 85

Figure 1. Model Overview

3.2. The Clustering Level The GTSOM network performs a central role in the clustering level. Its task is to transform the multidimensional continuous sensor readings into a discrete representation suitable for using with conventional reinforcement learning. Due to the characteristics of the network, it is able to perform online learning, without the need for a separate training phase. During operation, it automatically creates nodes to represent new, unseen situations and removes unused ones. The nodes created by the network can be seen as states in the robots sensorial space. Each one corresponds to a situation, like start of corridor, crossing, room entrance, etc. Although this is an important information, it is not enough for decision making. Using only distance sensors, the same sensor reading can be obtained at different locations in the environment. The grid map was included in the architecture to eliminate the sensorial ambiguity. So, if there are two similar corridors, for example, they will be represented at the clustering level as different states due to their location on the map, even though they correspond to the same node in the GTSOM network. This work does not focus on the grid map algorithm. Any one that is able of performing self-localization combining distance sensors and odometry information for error correction can be used. It is assumed that the output of the grid map module is a (x, y) coordinate in a bidimensional map. The markovian state is created combining the winning node from the GTSOM network with the coordinates from the grid map. For every node in the GTSOM network there is a list of possible markovian states, each one associated with a location. Different nodes on the GTSOM network always lead to different markovian states. A parameter R sets the threshold distance above which GTSOM nodes are broken into different states. ____________________________________________________________________________________
Hfen, Uruguaiana, v. 32 - n 62 - II Semestre - Ano 2008 - ISSN 1983-6511 86

3.3. The Planning Level The planning level is responsible for making decisions about which action should be taken at each moment, using the state representation from the clustering level and the reinforcement received from the environment. The method follows the discrete reinforcement learning approach, considering that the reward is delayed on time and the model of the environment is unknown. The chosen algorithm was Q-Learning. However, the set of states is not xed, but incrementally built during the learning process. At each time step, the agent checks the current state. When the state remains the same, no learning takes place and it keeps with the same selected action. When the state changes, if the new state is not yet known, its Q-value is initialized with zero. Then, the previous states Q-value is updated using the standard rule, as in equation 7.

Q(st1 , a) Q(st1 , a) + [r + maxa Q(st , a ) Q(st1 , a)]

(7)

The action selection policy is -greedy. The action with the highest Q-value is usually selected. Sometimes, a random action is chosen. The parameter sets the frequency on which the random action is taken. It is important to note that the environment learning (at the clustering level) happens simultaneously with the value-function learning (at the planning level). This means that the planning level must deal with a changing state set.

4. Results
The proposed model was simulated in software and tested in a simple maze task. The agent has no previous knowledge of the environment and must nd the location of a certain object. We assume that it is able of recognizing the object once it gets sufciently close to it. To navigate in the environment, the planning level chooses between 3 actions: turn 90o left, turn 90o right or keep moving in the same direction. There is no previous domain knowledge, meaning that learning how to avoid collisions is part of the task. The goal of the experiment is to test if the combination of the GTSOM network with the grid map is able to divide the state space in a way that enables the agent to learn the task. The environment is a cross-shaped maze (gure 2). The agent must nd an object that is placed at the end of one of the corridors. At each episode its initial position is set to one of the other 3 corridors. Each time the agent moves, it receives a reward of -1. When it collides with a wall, the reward is -100 and goes back to one of the possible initial positions, starting a new episode. If it reaches the goal, it receives a reward of +100 and the episode is nished. This is different from what was done in [Li and Pang 2006], because in that work the agent received a positive reward when moving closer to the goal and a negative reward otherwise, which makes the learning task easier since the reward is immediate. ____________________________________________________________________________________
Hfen, Uruguaiana, v. 32 - n 62 - II Semestre - Ano 2008 - ISSN 1983-6511 87

Figure 2. Environment shape

The chosen parameters for the GTSOM network were: =0.70, =0.30, =0.002, =0.9, =0.7, Imax =100, =1, =1. The threshold distance R for state identication was set to 40. As for the Q-Learning, we used =0.9, =0.95, =0.05. The experiment was run for 150 episodes. For each episode, the total reward received by the agent was recorded, as a way to evaluate the agents performance. This result is shown on gure 3. Each dot corresponds to an episode, with the total reward on the y-axis. The line is a moving average of period 12. At rst it might seem odd that in the rst episodes, the total reward decreases with time, as if the agent was performing worse at each episode. However, what is really happening is that it is learning to avoid collisions, therefore it keeps moving (and receiving a reward of -1) for a longer time, but still not reaching the goal. After about 50 episodes, the agent nds the goal. At this point it starts to increase its performance but still fails a few times, because of the exploration constant ( =0,05) associated with the action policy. One interesting point about the experiment is the way the GTSOM network segments the sensorial space. It formed 5 groups of neurons, each one associated with a certain feature (Figure 4f). Figure 4a-e shows the locations on the map associated with each of the 5 formed groups. We can see that groups 1 and 3 correspond to the agent moving through a corridor. The difference between them is that in group 1 there is a wall close to the agents back, while in group 3 the wall is closer to the front. Groups 4 and 5 correspond to movement perpendicular to the corridor, and group 2 is associated with the intersection.

5. Conclusion and Future Work


In this paper, we proposed an architecture for a robotic agent that is capable of creating a cognitive map combining sensorial and spacial features. Using these two aspects to dene a state space, the agent is able to learn the appropriate path to the goal in a simple ____________________________________________________________________________________
Hfen, Uruguaiana, v. 32 - n 62 - II Semestre - Ano 2008 - ISSN 1983-6511 88

Figure 3. Reward received

Figure 4. Node groups

____________________________________________________________________________________ Hfen, Uruguaiana, v. 32 - n 62 - II Semestre - Ano 2008 - ISSN 1983-6511 89

environment using discrete reinforcement learning. We have also demonstrated how the topology dened by the GTSOM networks nodes and arcs can be used to dene sensorial categories. Results from simulation show that the model can segment the state space correctly and learn the task in a reliable way. The agent learns to avoid collisions and remembers the location of the goal, even when its initial position changes. Two possible points are identied as possible improvements. One is that since the exploration strategy is generic, the agent tends to focus on regions that are known and thus takes longer to nd the goal. It is possible that an incentive to explore unknown areas could improve convergence speed. The other aspect that can be pointed out is the learning method. Algorithms that use eligibility traces can converge faster than standard Q-Learning. However, since in this model previously visited states can be deleted it is not clear how to keep consistency on the updates. This is subject to further investigation.

References
Bastos, E. N. F. (2007). Uma proposta de rede neural auto-organiz vel, temporal e cona strutiva voltada a aplicacoes rob ticas. Masters thesis, UFRGS. o Kuipers, B. J. and Byun, Y.-T. (1988). A robust qualitative method for robot spatial learning. In Proceedings of the Seventh National Conference on Articial Intelligence, pages 774779. Li, G. and Pang, J. (2006). A reinforcement learning with adaptive state space construction for mobile robot navigation. In Networking, Sensing and Control 2006. Machado, K. F. (2003). M dulo de auto-localizacao para um agente explorat rio usando o o ltro de kalman. Masters thesis, UFRGS. Moravec, H. P. and Elfes, A. (1985). High resolution maps from wide angle sonar. In Proceedings of the IEEE International Conference on Robotics and Automation, addr. Moser, L. D. (2004). Modelo de um neur nio diferenciador-integrador para representacao o temporal em arquiteturas conexionistas. Masters thesis, UFRGS. Thrun, S. (1998). Learning metric-topological maps for indoor mobile robot navigation. Articial Intelligence, 99(1):2171. Wang, D. (1998). Habituation. MIT Press, Cambridge, MA, USA.

____________________________________________________________________________________ Hfen, Uruguaiana, v. 32 - n 62 - II Semestre - Ano 2008 - ISSN 1983-6511 90

S-ar putea să vă placă și