Deep Reinforcement Learning

Deep Reinforcement Learning
MIT 6.S191
Alexander Amini
January 30, 2019
AlphaGo video
6.S191 Introduction to Deep Learning

1/30/19
introtodeeplearning.com
Classes of Learning Problems
Supervised Learning
Data: (", $)
" is data, $ is label
Goal: Learn function to map

"→$
Apple example:
This thing is an apple.

1/30/19
Supervised Learning Unsupervised Learning
Data: (", $) Data: "

" is data, $ is label " is data, no labels!
Goal: Learn function to map Goal: Learn underlying

"→$ structure
Apple example: Apple example:
This thing is like

the other thing.
1/30/19
Supervised Learning Unsupervised Learning Reinforcement Learning
Data: (", $) Data: " Data: state-action pairs

Goal: Learn function to map Goal: Learn underlying Goal: Maximize future rewards
"→$ structure over many time steps
Apple example: Apple example: Apple example:
This thing is like Eat this thing because it

the other thing. will keep you alive.
1/30/19
Supervised Learning Unsupervised Learning Reinforcement Learning
Data: (", $) Data: " Data: state-action pairs

RL: our focus

Goal: Learn function to map today.
Goal: Learn underlying Goal: Maximize future rewards
"→$ structure over many time steps
Apple example: Apple example: Apple example:
This thing is like Eat this thing because it

This is an apple.
the other thing. will keep you alive.
1/30/19
Reinforcement Learning (RL): Key Concepts
AGENT
Agent: takes actions.

1/30/19
AGENT ENVIRONMENT
Environment: the world in which the agent exists and operates.

1/30/19
AGENT Action: !" ENVIRONMENT

ACTIONS
Action: a move the agent can make in the environment.

1/30/19
OBSERVATIONS
AGENT Action: !" ENVIRONMENT

ACTIONS
Observations: of the environment after taking actions.

1/30/19
OBSERVATIONS
State changes: !"#$
AGENT Action: %" ENVIRONMENT

ACTIONS
State: a situation which the agent perceives.

1/30/19
OBSERVATIONS
State changes: !"#$
Reward: %"
AGENT Action: &" ENVIRONMENT

ACTIONS
Reward: feedback that measures the success or failure of the agent’s action.

1/30/19
OBSERVATIONS
State changes: !"#$
Reward: %"

ACTIONS
Total Reward ,
'" = ) %*
*+"
1/30/19
OBSERVATIONS
State changes: !"#$
Reward: %"

ACTIONS
Total Reward ,
'" = ) %* = %" + %"#$ … + %"#/ + ⋯

*+"
1/30/19
OBSERVATIONS
State changes: !"#$
Reward: %"

ACTIONS
Discounted ,
Total Reward
'" = ) - * %*
*+"
1/30/19
OBSERVATIONS
State changes: !"#$
Reward: %"

ACTIONS
Discounted
Total Reward
. ': discount factor
)" = + ' , %, = ' " %" + ' "#$ %"#$ … + ' "#1 %"#1 + ⋯
,-"
1/30/19
Defining the Q-function
!" = $" + &$"'( +& ) $"') + ⋯
Total reward, !" , is the discounted sum of all rewards obtained from time 0
+ ,, . = / !"
The Q-function captures the expected total future reward an

agent in state, ,, can receive by executing a certain action, .

1/30/19
How to take actions given a Q-function?
! ", $ = & '(
(state, action)
Ultimately, the agent needs a policy ) * , to infer the best action to take at its state, s
Strategy: the policy should choose an action that maximizes future reward
+ ∗ " = argmax !(", $)

2

1/30/19
Deep Reinforcement Learning Algorithms
Value Learning Policy Learning
Find % ", # Find ! "

# = argmax %(", #) Sample # ~ ! "
-
Deep Reinforcement Learning Algorithms
Value Learning Policy Learning
Find % ", # Find ! "

# = argmax %(", #) Sample # ~ ! "
-
Digging deeper into the Q-function
Example: Atari Breakout
It can be very difficult for humans to

accurately estimate Q-values
A B
Which (s, a) pair has a

higher Q-value?

1/30/19
Example: Atari Breakout

A B

higher Q-value?

1/30/19
Example: Atari Breakout - Middle

A B

higher Q-value?

1/30/19
Example: Atari Breakout - Side

A B

higher Q-value?

1/30/19
Deep Q Networks (DQN)
How can we use deep neural networks to model Q-functions?
state, " Deep

! ", $
NN
“move
right”
action, $

1/30/19
Deep Q Networks (DQN)
! ", $%
state, " Deep Deep ! ", $&
! ", $
NN NN
“move state, " ! ", $'
right”
action, $

1/30/19
Deep Q Networks (DQN): Training
! ", $%
! ", $
NN NN
right”
action, $
&
ℒ=* + + - max !(" 2 , $2 ) − ! ", $
12
1/30/19
! ", $%
! ", $
NN NN
right”
action, $
target predicted
&
ℒ=* + + - max !(" 2 , $2 ) − ! ", $
12
1/30/19
! ", $%
! ", $
NN NN
right”
action, $
target predicted
&
ℒ=* + + - max !(" 2 , $2 ) − ! ", $
12
1/30/19
DQN Atari Results

1/30/19
DQN Atari Results
Surpass Below
human-level human-level

1/30/19
Downsides of Q-learning
Complexity:
• Can model scenarios where the action space is discrete and small
• Cannot handle continuous action spaces IMPORTANT:
Imagine you want to predict
steering wheel angle of a car!
Flexibility:
• Cannot learn stochastic policies since policy is deterministically computed
from the Q function
To overcome, consider a new class of RL training algorithms:

Policy gradient methods

1/30/19
Policy Gradient (PG) : Key Idea
DQN (before): Approximating Q and inferring the optimal policy,
! ", $%
! ", $&
Deep
NN
! ", $'
state, "

1/30/19
Policy Gradient (PG): Key Idea
Policy Gradient: Directly optimize the policy!
! "# |%
! "& |%
Deep
NN
! "' |%
state, %

1/30/19
! "# |% ( ! "- |% = 1
)* ∈,
! "& |%
Deep
NN
! "' |%
state, %

1/30/19
! "# |% ( ! "- |% = 1
)* ∈,
! "& |%
Deep
! "|% = 0("23456|%3"37)
NN
! "' |%
state, %

1/30/19
! "# |% ( ! "- |% = 1
)* ∈,
! "& |%
Deep
! "|% = 0("23456|%3"37)
NN
! "' |%
state, %

1/30/19
Policy Gradient (PG): Training
function REINFORCE
Initialize !
for "#$%&'" ~ )*
{%, , ., , /, }342
,12 ← "#$%&'"
for t = 1 to T-1
∇ ← ∇* log )* .: |%: <:
! ← ! + >∇
1. Run a policy for a while return !
2. Increase probability of actions that lead to high
rewards
3. Decrease probability of actions that lead to
low/no rewards

1/30/19
function REINFORCE
Initialize !
for "#$%&'" ~ )*
{%, , ., , /, }342
,12 ← "#$%&'"
for t = 1 to T-1
∇ ← ∇789: ;7 <= |?= @=
! ← ! + B∇
2. Increase probability of actions that lead to high
rewards
3. Decrease probability of actions that lead to
low/no rewards

1/30/19
function REINFORCE
Initialize !
for "#$%&'" ~ )*
{%, , ., , /, }342
,12 ← "#$%&'"
for t = 1 to T-1
∇ ← ∇789: ;7 <= |?= @=
! ← ! + B∇
2. Increase probability of actions that lead to high log-likelihood of action
rewards
3. Decrease probability of actions that lead to ∇* log )* .F |%F GF
low/no rewards
reward
1/30/19
The Game of Go
Aim: Get more board territory than your opponent.
Board Size %
Positions 3$ % Legal Legal Positions
nxn
1×1 3 33.33% 1
2×2 81 70.37% 57
3×3 19,683 64.40% 12,675
4×4 43,046,721 56.49% 24,318,165
5×5 847,288,609,443 48.90% 414,295,148,741
9×9 4.434264882×1038 23.44% 1.03919148791×1038
13×13 4.300233593×1080 8.66% 3.72497923077×1079
19×19 1.740896506×10172 1.20% 2.08168199382×10170
Greater number of legal board positions than atoms in the universe.

Source: Wikipedia.
1/30/19
AlphaGo Beats Top Human Player at Go (2016)
Silver et al., Nature 2016.

1/30/19
1) Initial training: human data

1/30/19
2) Self-play and reinforcement learning

à super-human performance

1/30/19

3) “Intuition” about board state

1/30/19

3) “Intuition” about board state

1/30/19
AlphaZero: RL from Self-Play (2018)
Silver et al., Science 2018.

1/30/19
Questions?

Deep Reinforcement Learning

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Deep Reinforcement Learning

Încărcat de

Drepturi de autor:

Formate disponibile

Deep Reinforcement Learning

6.S191 Introduction to Deep Learning

Goal: Learn function to map

This thing is an apple.

6.S191 Introduction to Deep Learning

Data: (", $) Data: "

Goal: Learn function to map Goal: Learn underlying

Apple example: Apple example:

This thing is like

Data: (", $) Data: " Data: state-action pairs

Apple example: Apple example: Apple example:

This thing is like Eat this thing because it

Data: (", $) Data: " Data: state-action pairs

RL: our focus

Apple example: Apple example: Apple example:

This thing is like Eat this thing because it

Agent: takes actions.

6.S191 Introduction to Deep Learning

Environment: the world in which the agent exists and operates.

6.S191 Introduction to Deep Learning

AGENT Action: !" ENVIRONMENT

Action: a move the agent can make in the environment.

6.S191 Introduction to Deep Learning

AGENT Action: !" ENVIRONMENT

Observations: of the environment after taking actions.

6.S191 Introduction to Deep Learning

AGENT Action: %" ENVIRONMENT

State: a situation which the agent perceives.

6.S191 Introduction to Deep Learning

AGENT Action: &" ENVIRONMENT

6.S191 Introduction to Deep Learning

AGENT Action: &" ENVIRONMENT

AGENT Action: &" ENVIRONMENT

'" = ) %* = %" + %"#$ … + %"#/ + ⋯

AGENT Action: &" ENVIRONMENT

AGENT Action: &" ENVIRONMENT

!" = $" + &$"'( +& ) $"') + ⋯

The Q-function captures the expected total future reward an

6.S191 Introduction to Deep Learning

+ ∗ " = argmax !(", $)

6.S191 Introduction to Deep Learning

Value Learning Policy Learning

Find % ", # Find ! "

Value Learning Policy Learning

Find % ", # Find ! "

It can be very difficult for humans to

Which (s, a) pair has a

6.S191 Introduction to Deep Learning

It can be very difficult for humans to

Which (s, a) pair has a

6.S191 Introduction to Deep Learning

It can be very difficult for humans to

Which (s, a) pair has a

6.S191 Introduction to Deep Learning

It can be very difficult for humans to

Which (s, a) pair has a

6.S191 Introduction to Deep Learning

state, " Deep

6.S191 Introduction to Deep Learning

6.S191 Introduction to Deep Learning

6.S191 Introduction to Deep Learning

6.S191 Introduction to Deep Learning

To overcome, consider a new class of RL training algorithms:

6.S191 Introduction to Deep Learning