Sunteți pe pagina 1din 25

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/329029610

Deep Q-Learning Explained

Presentation · June 2018


DOI: 10.13140/RG.2.2.22983.14241

CITATIONS READS

0 794

1 author:

Mauricio Arango
Oracle Corporation
25 PUBLICATIONS   198 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Mauricio Arango on 18 November 2018.

The user has requested enhancement of the downloaded file.


Deep Q-Learning Explained
Mauricio Arango
June 20, 2018

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Deep Q-Learning
Deep Q-Network (DQN)
• Combined improved version of Q-Learning
Mnih et al. 2013, 2015, Google DeepMind
with a deep convolutional neural network
• Used same architecture to learn to play 49
DQN Agent different Atari 2600 video games
– Performance better of higher than humans
• State, 𝒔𝒕 , raw video pixels from console – no
feature engineering
• Reward, 𝒓𝒕 , change in game score for step
• Output action, 𝒂𝒕 , joystick/button positions
• Focus of talk is on Q-Learning & improvements
– Enabled successful use of deep networks with Q-
Learning

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Agenda
1 Reinforcement Learning Overview
2 Q-Learning
3 Deep Q-Learning

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. 3


What is Reinforcement Learning?
• Type of machine learning where an agent learns to achieve a goal by trial
and error
• The agent tries actions and receives feedback (reinforcement) − reward or
punishment − that indicates how good or bad was the action
• The agent learns by adapting its behavior in such a way that the total
reward received is maximized
• Applicable to solving wide range of control and optimization problems
involving sequential decision making
– Industrial automation & robotics
– Healthcare
– Finance

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Reinforcement Learning Framework
Framework for sequential decision making

Agent
state reward action
𝑠' 𝑟' 𝑎'
𝑟'*+
𝑠'*+ Environment

• Agent – learner and decision-maker


• Environment – set of things the agent interacts with
– State – a representation of the environment observable by the agent

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Reinforcement Learning Framework
Agent
action a0 a a2
state reward s0 s1 1 s2 …
𝑎' r0 r1 r2
𝑠' 𝑟'
𝑟'*+
𝑠'*+ Environment

• Interaction – occurs as a sequence of time steps in each of which:


– Agent observes state 𝒔𝒕 of the environment
– Performs action 𝒂𝒕 a policy function
– As a consequence of the action, one step later the agent observes a new state 𝒔𝒕*𝟏 and receives
numerical reward 𝒓𝒕*𝟏
– Agent adjusts the policy (value function) – adapts behavior, learns
• The goal of the agent is to maximize the total rewards it receives – this drives policy
(value function) adjustments
Copyright © 2017, Oracle and/or its affiliates. All rights reserved.
Reinforcement Learning Algorithms
Output is an optimal policy function

• Policy function – given a state, produces action to execute


– 𝝅(𝒔) –> a

• Types of RL algorithms:
– Value-based - learn optimal action-value function 𝑸∗ (𝒔, 𝒂)
• Derive policy from 𝑄∗ (𝑠, 𝑎) – Q-Learning
– Policy-based – search directly for the optimal policy 𝝅∗

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Gridworld example
Learning action-value function 𝑸(𝒔, 𝒂)

• The environment is the grid


• States S = {𝑠+,+ , 𝑠4,4 , … 𝑠5,6 }
• Actions A = {up, right, down, left}
• Reward
– -0.1 on reaching any state, except 𝑠5,6
and 𝑠4,6
– -1.0 on reaching fail state 𝑠4,6
– +1.0 on reaching goal state 𝑠5,6

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Gridworld example
Computing actions from Q-values

• Assume optimal 𝑄 ∗ 𝑠, 𝑎 values for each


(s,a) pair have been calculated – Q-values

• Policy 𝜋 ∗ 𝑠 for each state is very simple


to obtain:
– Choose the action that yields highest Q-value
𝝅∗ 𝒔 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝑸∗ (𝒔, 𝒂)
𝒂

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Q-Values
Dynamic programming – Bellman equation

• The solution to this problem is based in Dynamic Programming and is an


iterative equation - Bellman update equation:
Next Current
estimate estimate Error = ∆(𝑠, 𝑎, 𝑟, 𝑠 E )

E E
𝑸 𝒔, 𝒂 ← 𝑸 𝒔, 𝒂 + 𝜶 [𝒓 + 𝜸 𝐦𝐚𝐱
D
𝑸 𝒔 ,𝒂 − 𝑸 𝒔, 𝒂 ]
𝒂
Learning Target value - Current
rate Bellman equation estimate

𝑸 𝒔, 𝒂 ← 𝑸 𝒔, 𝒂 + 𝜶 ∆(𝑠, 𝑎, 𝑟, 𝑠 E )
• Agent interacts with environment and obtains samples:
< 𝒄𝒖𝒓𝒓𝒆𝒏𝒕 𝒔𝒕𝒂𝒕𝒆, 𝒂𝒄𝒕𝒊𝒐𝒏, 𝒓𝒆𝒘𝒂𝒓𝒅, 𝒏𝒆𝒙𝒕 𝒔𝒕𝒂𝒕𝒆 > − < 𝒔, 𝒂, 𝒓, 𝒔E >

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Q-Learning
Lookup table-based
• Lookup table
– One entry per (s, a) pair
– Initialized with random Q-values
• Use Bellman update equation to iteratively update 𝑄 𝑠, 𝑎 estimates
• Q-values converge to optimal: 𝑸 𝒔, 𝒂 ⟶ 𝑸∗ (𝒔, 𝒂)
• Limitations:
– Only feasible for small state-action spaces
– In large state-action spaces many regions may never be visited – many iterations to
cover limited percentage of space
– Doesn’t handle generalization to unvisited state-action pairs

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Q-Learning
Function approximation
• Construct approximation of Q-function from observed examples of agent interaction
with environment
• Generalize from states agent has visited to states it has not visited
• Makes possible massive reduction in states that need to be visited to reach an
approximate solution
• Function approximation is what supervised learning does in this case, regression

s Function
a Approximator
Parameter 𝑄 𝑠, 𝑎, 𝑤
r vector: w
𝑠E

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Q-Learning
Function approximation with online updates
• Regression: find 𝑄 𝑠, 𝑎 where 𝑄 𝑠, 𝑎 = 𝑦, given 𝑠, 𝑎 and 𝑦
– 𝑋 are state-action pairs 𝑠, 𝑎
– 𝑦 (targets) are desired values estimated with the Bellman equation = 𝒓 + 𝜸 𝒎𝒂𝒙
D
𝑸 𝒔E ,𝒂E , 𝒘
𝒂

• Regression requirements – fulfilled through neural networks


– Nonlinear functions
– Capability to handle both online and batch modes – incremental updates
• With a neural network, 𝑄 is a parameterized function with weights w
– Instead of updating Q values on each iteration, update parameter vector w that defines the function:
𝒘 ← 𝒘 + ∆𝒘
• This can be realized through gradient descent methods

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Q-Learning
Neural network function approximation with online updates

• Use neural network with stochastic gradient descent and back propagation, where:
error (s, a, r, s’, w) = 𝑟 + 𝛾 max
D
𝑄(𝑠 E ,𝑎E , 𝑤) − 𝑄 𝑠, 𝑎, 𝑤
_
E E 4
Mean Squared Error (MSE) = 𝛦[(𝑟 + 𝛾 𝑚𝑎𝑥
D
𝑄(𝑠 ,𝑎 , 𝑤) − 𝑄 𝑠, 𝑎, 𝑤 ) ]
_

𝒘 ← 𝒘 + 𝜶 𝒓 + 𝜸 𝒎𝒂𝒙
D
𝑸 𝒔E ,𝒂E , 𝒘 − 𝑸 𝒔, 𝒂, 𝒘 𝜵𝒘 𝑸 𝒔, 𝒂, 𝒘
𝒂

Gradient obtained
with back propagation

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Q-Learning
Neural network function approximation with online updates

7 ∆𝒘
Apply 1
𝑸 𝒔, 𝒂, 𝒘
s Neural
𝜀 − greedy a
network 2 policy (*)
a ×
𝜵𝒘 𝑸 𝒔, 𝒂, 𝒘 6 Initialize network weights
4 Repeat for each episode:
r + − Initialize s
5 Repeat for each step of episode:
Choose a from s using 𝜖 − 𝑔𝑟𝑒𝑒𝑑𝑦 policy − 1
E E
𝒎𝒂𝒙 𝑸 𝒔 ,𝒂 , 𝒘 Take action a, observe r, s’
Neural D
𝒂
s’ Neural
Neural The network used Obtain current Q and gradient – 2
network
network
network 3 in stage 1 and in Calculate max next state Q value - 3
a’ stage 2 is the same Calculate target – 4
Calculate error – 5
Calculate weights delta – 6
Update weights – retrain network – 7
(*) 𝜺 − greedy policy: s ⟵ 𝑠′ Perform
With probability 𝜀, select random a gradient
otherwise select argmax 𝑄 𝑠, 𝑎, 𝑤 Until s is terminal descent
_

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Problems with online Q-Learning
Reasons online Q-Learning implementations may fail to converge

• Function approximation issues


– Correlations between samples -- Consecutive experience samples can be highly
correlated, which can cause overfitting
– Data-inefficient - samples are discarded after each update
• Bootstrapping issues – update estimates on the basis of other estimates
– Non-stationary targets - target values of Q-learning updates depend on same
weights that are being updated
• Target = 𝒓 + 𝜸 𝒎𝒂𝒙
D
𝑸 𝒔E ,𝒂E , 𝒘 depends on w; targets are used to build updates to w
𝒂

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


DQN
Is a general-purpose algorithm – What changes is architecture of neural
network used – Q-network
DQN
Agent

• DQN improvements to Q-Learning made possible successful use


of Q-Learning with deep neural networks

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


DQN
Modifications to online Q-Learning
• Experience replay
– Resolves correlated samples and data efficiency issues
• Separate network for generating targets
– Reduces non-stationary targets issue

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Experience Replay
• To remove correlations, build data-set from agent’s own experience stored in a
replay memory

• At each time step:


– Store the transition tuple (s, a, r, s’) in the replay memory
– Sample random mini-batch of transitions (s, a, r, sʹ) from the replay memory
– Compute Q-learning targets using separate fixed network with parameters 𝒘{
• Mini-batch updates yield smoother (less noisy) gradients
• Each experience sample potentially used in many weight updates – less new samples
new samples required

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Separate network for generating targets
Reduce targets non-stationarity

• Separate neural network, 𝑄| with weights 𝒘{ used to calculate targets:


– Mini-batch targets = 𝒓 + 𝜸 𝒎𝒂𝒙 } 𝒔E ,𝒂E , 𝒘{
𝑸
D 𝒂

s Neural 𝑸(𝒔, 𝒂, 𝒔) 𝑠E Neural } 𝒔E ,𝒂E , 𝒘{


𝑸
network network
a 𝑤 𝑎E 𝑤{

• Target network parameters fixed for C updates


} and use it for generating Q-learning
– Every C updates clone network 𝑸 to obtain target network 𝑸
targets for the following C updates to 𝑸

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


DQN algorithm
7 ∆𝒘
Apply 1
𝑸 𝒔, 𝒂, 𝒘
Neural 𝜀 − greedy a
s network 2 policy (*)
𝑤 × Initialize replay memory D to capacity N
𝒈𝒓𝒂𝒅𝒊𝒆𝒏𝒕 = 𝜵𝒘 𝑸 𝒔, 𝒂, 𝒘 6 Initialize network 𝑸 with weights 𝒘
Initialize network 𝑸 } with weights 𝒘{ = 𝒘

r + −
For episode = 1, M do
For t = 1, T do
y 5 error
8 4 Select action a using an 𝜀 − greedy policy- 1
Execute action and observe next state 𝑠 E
every C updates and reward 𝑟
Store transition (s, a, r, s’) in D
𝒘{ ← 𝒘 Sample random mini-batch of transitions
(s, a, r, s’) from D
Neural
s’ network
3 Obtain 𝑄(s, a, w) and gradient 𝛻• 𝑄 𝑠, 𝑎, 𝑤 – 2
Calculate 𝑚𝑎𝑥 𝑄| 𝑠 E ,𝑎 E , 𝑤 { – 3
} 𝒔E ,𝒂E , 𝒘{
_D

𝑤{ 𝒎𝒂𝒙
D
𝑸 Calculate mini-batch targets, y – 4
𝒂 Calculate error – 5
Calculate weights 𝑤 delta – 6
Update 𝑤 weights for network Q – retrain network – 7
(*) 𝜺 − greedy policy: Every C updates reset 𝑄| = Q – 8
End For
With probability 𝜀, select random a End For
otherwise select argmax 𝑄 𝑠, 𝑎, 𝑤 Perform
_ gradient descent

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Improvements to DQN

• Prioritized experience replay


– The RL agent can learn more effectively from some transitions than from others
– Prioritize transitions according to surprise, defined as proportional to DQN error:
E E
𝒓 + 𝜸 𝐦𝐚𝐱 D
𝑸 𝒔 ,𝒂 − 𝑸 𝒔, 𝒂
𝒂

• Double DQN
E E
– Addresses DQN overestimation problem caused by max 𝑄 𝑠 ,𝑎 operation in the
_D
updates
– Use two separate networks, one used to determine the maximizing action the other
second to estimate Q-values; alternate on each step

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


Summary & conclusions
• Online Q-Learning with function approximation transforms RL into a
sequence of regression training calculations, but has limitations due to:
– Non-stationary targets
– Correlated training transitions
• DQN mitigates these problems and makes it successfully applicable in
many domains − Key tool in an RL toolbox
• Why neural networks and Q-Learning − DQN
– Non-linear function approximation
– Incremental training to support both online and batch learning
– Capability to handle large number of inputs
– Automates feature engineering on raw inputs

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.


References
• Mnih et al. – Playing Atari with Deep Reinforcement Learning,
arXiv:1312.5602v1, 2013.
• Mnih et al. – Human-level control through deep reinforcement learning,
Nature 518, 2015.
• Arulkumaran, K. et al. – A Brief Survey of Deep Reinforcement Learning,
arXiv:1708.05866, 2017.
• Sutton R. and Barto A. – Reinforcement Learning: An Introduction, MIT
Press, second edition, 2018.
• Mitchel, T. – Machine Learning, WCB/MacGraw-Hill, 1997.
• Russell, S., Norvig P. – Artificial Intelligence: A Modern Approach, Prentice-
Hall 2010.
Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

View publication stats

S-ar putea să vă placă și