Sunteți pe pagina 1din 30

Reinforcement Learning

Models
Peter Sords

Reinforcement Learning
Reinforcement learning theories seek to explain how an
agent learns to take actions in an environment so as to
maximize some notion of cumulative reward
Reinforcement learning differs from standard supervised
learning in that correct input/output pairs are never presented,
nor sub-optimal actions explicitly corrected
There is a biological wisdom built into simple observations
that gets encoded into our nervous system to compute
optimal solutions
Reinforcement Learning
Pavlov in 1927 first demonstrated how cognitive connections
can be created or degraded based on predictability in his
famous experiment in which he trained his dog by ringing a
bell (stimulus) before feeding it (reward)
After repeated trials, the dogs started to salivate upon hearing
the bell before the food was delivered.
The bell is the conditioned stimulus (CS) and the food is the
unconditioned stimulus (US)
Further experimental evidence led to the hypothesis that
reinforced learning is dependent upon the temporal
difference (TD) between CS and US and the predictability
Temporal Difference (TD) Model
The difference between the actual reward occurrence and
the predicted reward occurrence is referred to as the reward
prediction error (RPE)
This concept has been employed in the temporal difference (TD)
model to model reinforcement learning
The TD model uses the reward prediction error of a learned
event to recalibrate the odds of that event in the future
The goal of the model is to compute a desired prediction signal
which reflects the sum of future reinforcement


Reinforcement Learning Models
Reinforcement models (like the TD model) are typically composed
of four main components
1. A reward function, which attributes a desirability value to a state.
The reward function is often a one-dimensional scalar r(t)
2. A value function (also known as critic) which determines the long
term desirability of a state
3. A policy function (also known as actor) which maps the agents
states to possible actions, using the output of the value function
(the reinforcement signal) to determine the best action to choose
4. A model of the environment which includes a representation of
the environment dynamics required for maximizing the sum of
future awards


Dopamine neurons
Dopaminergic (DA) neurons (i.e., neurons whose primary neurotransmitter is
dopamine) have been known to play important roles in behavior, motivation,
attention, working memory, reward and learning
Wolfram Schultz and colleagues (1997) have shown the reward prediction error
of the TD model resembles midbrain dopamine (DA) neuron activity in
situations with predictable rewards
During classical conditioning, unexpected rewards triggered an increase in
phasic activity of DA neurons
As learning progresses, the phasic burst shifts from the time of expected reward
delivery to the time the conditioned stimulus (CS) is given
The conditioned response to CS onset grows over learning trials whereas the
unconditioned response to actual reward delivery declines over learning trials
After learning, the omission of an expected reward induces a dopamine cell
pause, i.e., a depression in firing rate to a below-baseline level, at the time of
expected reward delivery.
DA neurons encode prediction error
DA activity predicts rewards before they occur rather than
reporting them only after the behavior
Measured dopamine neural activity when macaque monkeys were presented distinct
visual stimuli that specified both the probability and magnitude of receiving an award
TD Model: The Critic
The Critic: The TD algorithm is used as an adaptive critic for
reinforced learning, to learn an estimate of a value function,
the reward prediction V(t) , representing expected total
future reward, from any state
Reward prediction V(t) is computed as the weighted sum over
multiple stimulus representations x
m
(t)


V(t) is adjusted throughout learning by updating the weights
w
m
(t) of the incoming neuron units according to the TD model
V(t) w
m
x
m
(t)
TD Model: The Critic
The TD model represents a temporal stimulus as multiple
incoming signals x
m
(t) each multiplied by corresponding
adaptive weights w
m
(t)
TD Model: The Critic
The TD model calculates the reward prediction error (t)
based on the temporal difference between the current
discounted value function V(t) and the previous time step
V(t-1)

V(t)=the weighted input from the active unit at time t
r(t)=the reward obtained at time t
=the discount factorreflects the decreased impact of more
distant rewards
(t) r(t)+ V(t) V(t 1)
TD Model: The Value Function
The prediction error (t) is used to improve the estimate of the
reward reward value V(t) for later trials as follows

Where is the learning rate
For a given policy and a sufficiently small , the TD learning
algorithm converges with probability 1
This adjusts the weights w
m
(t) of incoming stimuli

If reward prediction V(t) for a stimulus is underestimated, then
prediction error (t) is positive and adaptive weight w
m
(t) is
increased. If the value of the reward is overestimated, then (t) is
negative and the synaptic weight is decreased


w
m
(t) w
m
(t 1)+ (t)x
m
(t 1)
V(t) V(t)+ (t)
TD Model: The Actor
The goal of a TD learning agent, as for
every reinforcement , is to maximize the
accumulated reward it receives over
time
The actor module leans a policy (s,a)
which gives the probability of selecting
an action a in a state s. A common
method of defining a policy is given by
the Gibbs softmax distribution:


Where p(s,a) is known as the preference
of action a in state s and the index b runs
over all possible actions in state s



(s, a)
e
p(s,a)
e
p(s,b)
b
TD Model: The Actor
The preference of the chosen action a in state s is adjusted to
make the selection of this action correspondingly more or
less likely the next time the agent visits that state. One
possibility to update the preference in the actor-critic
architecture is given by:

Where is another learning rate that relates different states to
corresponding actions

p(s
n
, a
n
) p(s
n
, a
n
)+
n
(t)
The actor (neuron) learns
stimulus-action pairs under
the influence of the
prediction-error signal of
the critic (TD model)
The actor module learns a
policy (s,a), which gives
the probability of selecting
an action a in a state s.


Modeling the DA prediction error signal
Drawback of model: run-away synaptic strength
Thus, the TD model computes the reward prediction error r(t)
from discounted temporal differences in the prediction signal
p(t) and from the reward signal with the equation r(t) =
reward(t-100) [p(t-100) g p(t)] (time t in msec, 100 msec is
the step size of the model implementation). The reward
prediction error is phasically increased above base line levels
of zero for rewards and reward-predicting stimuli if these
events are unpredicted but remains on base line levels if
these events are predicted. In addition, if a predicted reward
is omitted, the reward prediction error decreases below base
line levels at the time of the predicted reward when the
predicted reward fails to occur.
The agent-environment interface
In reinforcement learning a distinction is made between the
elements of a problem that are controllable and those which
are only observable. The controllable aspects are said to be
modified by an agent, and the observable aspects are said to
be sensed through the environment




At each point in time, the environment is in some state, s
t
, which is
one of a finite set of states
The agent has access to that state, and based on it takes some
action, a
t
, which is one of a finite set of actions available at a given
state, represented by A(s
t
)
That action has some effect on the environment, which pushes it
into its next state, s
t+1
The environment also emits a scalar reward value, r
t+1


Future Prospects
Further development and understanding of learning models
and algorithm can be used in robotics
Reinforced learning models are useful in computer science to
create algorithms that can self-optimize for solving tasks

The goal of a reinforcement learning algorithm is to choose
the best actions for the agent.
R = rt+1. t=0
It is, however, impossible to reason about rewards delivered
for all of time, so rewards are weighted such that those
delivered sooner are weighted higher, and rewards delivered
very far in the future are ignored entirely. To do this, we add a
discounting factor 1 to the above equation.
R = trt+1 (5.1) t=0

Disturbances of Dopamine
Perhaps the best understood pathology of dopamine excess is drug
addiction
Addictive drugs such as cocaine, amphetamine and heroin all
increase dopamine concentrations in the Nac and other forebrain
structures
Disturbances of dopamine function are also known to have a
central role in schizophrenia
schizophrenia is associated with a hyper-dopaminergic state
The development of the formal models of dopamine function
discussed above, and its interaction with other brain systems,
offers hope for a more sophisticated understanding of how
dopamine disturbances produce the patterns of clinical
psychopathology observed in schizophrenia
Fiorillo et al. (2003)
The study demonstrated that
dopamine may also code for
uncertainty
The Fiorillo experiment
associated the presentation of
five different visual stimuli to
macaques monkeys with the
delayed, probabilistic (p
r
= 0,
0.25, 0.5, 0.75, 1) delivery of
juice rewards, where p
r
is the
probability of receiving the
reward
Fiorillo et al. (2003)
They used a delay conditioning paradigm, in which the
stimulus persists for a fixed interval of 2s, with the reward
being delivered when the stimulus disappears
TD theory predicts that the phasic activation of the DA cells at
the time of the visual stimuli should correspond to the average
expected reward, and so should increase with p
r
. This is exactly
what is seen to occur.
What the current TD model fails to explain for results such as
these, however, are the ramping activity between the CS and
reward and the small response at the expected time of reward

Fiorillo et al. (2003)
These resulting neuron activity, after learning had taken
place, showed:
(i) A phasic burst of activity, or reward prediction error, at the
time of the expected reward, whose magnitude increased as
probability decreased; and
(ii) a new slower, sustained activity, above baseline, related to
motivationally relevant stimuli, which developed with increasing
levels of uncertainty, and varied with reward magnitude
Dopamine neuron firing is greatest when uncertainty of the
reward is at a maximum (i.e. at 50% probability)

S-ar putea să vă placă și