Slides RL and Optimal Control

Reinforcement Learning and Optimal Control
A Selective Overview
Dimitri P. Bertsekas
Laboratory for Information and Decision Systems

Massachusetts Institute of Technology
2018 CDC
December 2018
Bertsekas (M.I.T.) Reinforcement Learning 1 / 33

Reinforcement Learning (RL): A Happy Union of AI and
Decision/Control Ideas
AI/RL Decision/
Learning through
Control/DP
Experience
Complementary Principle of
Ideas Optimality
Simulation,
Model-Free Methods
Late 80s-Early 90s Markov Decision
Problems
Feature-Based
Representations
POMDP
A*/Games/
Policy Iteration
Heuristics
Value Iteration
Historical highlights
Exact DP, optimal control (Bellman, Shannon, 1950s ...)
First major successes: Backgammon programs (Tesauro, 1992, 1996)
Algorithmic progress, analysis, applications, first books (mid 90s ...)
Machine Learning, BIG Data, Robotics, Deep Neural Networks (mid 2000s ...)
AlphaGo and Alphazero (DeepMind, 2016, 2017)

AlphaGo (2016) and AlphaZero (2017)
AlphaZero (Google-Deep Mi
Plays much better than all chess programs
Plays different!
Learned from scratch ... with 4 hours of training!
Same algorithm learned multiple games (Go, Shogi)

uk Self-Learning/Policy Iteration Constraint 1 2 u3 u4 u5 !Self-Learning/Policy
Relaxation "
f33much
ys f66 ff77 Tail
ff44 ff55 fbetter thanproblem approximation
all computer
AlphaZero2˜µf3 f4 ufrk5 :FufFeature-based
f1 fJprograms
F(Google-Deep
(i), k(i)
6 f7 kCost Jˆ F parametric
kMind)
k (i) Iteration Con
Plays mucharchitecture
better than all co
State
AlphaZero was Trained
f1 f2 f3 f4 f5 f6 f7 Using Self-Generated Data
r: Vector of weights Original States Aggregate States
raining! CurrentLearned “Improved” from scratchPosition ... with“value”4 hoursMove of training! Current “Improved”
pproximate
unction Neural
pproximate Player Cost
Cost J˜Jµ˜µ Policy
Network
Features Policy Improvement
Features
Improvement
Mapping
Plays Approximate
different! Approximate Cost ˜µ“probabilities”
J Policy
Value Improvement
Function Player Features
Position “values” Move “probabilities”
ural Network
eural Net Policy Features Evaluation Approximate Improvement
Choose CosttheJ˜µAggregation
Policy
of !
Current
"
Improvement
andPolicy µ by
Disaggregation Probabilities
eadbetter
uch Min than all Choosecomputer
AlphaZero the Aggregation
programs
(Google-Deep F and
(i) Cost
Mind) Jˆ F (i)muchProbabilities
Disaggregation
Plays better than all computer program
pproximate
pproximate Neural Cost
Cost Network Neural
J˜Jµ˜µNetwork
Policy
Policy Network
Improvement
Features
Improvement Use aPolicy
Approximate NeuralEvaluation Cost J˜µNeural
Network Improvement
OtherNetwork
orPolicy Scheme Form
Improvement ofPolicy
Current
theEvaluatPoli
Aggrega
ural Lookahead
Network Neural Use a
byMinimization
FeaturesLookaheadNeural Policy
Network
I 1Min I Evaluation
or Other ˜ Neural
Scheme Improvement
by Network
Form
Lookahead the of
Policy
Aggregate
Min Current Evaluation
Policy
States µImpr
CTSates xk+1 States xk+2 Approximate
xAt
k Cost-to-go
Heuristic/
State q
x Cost
k
J
Current µ Policy
Approximation
Suboptimal state Improvement
Base
x 0 Policy
... MCTS Lookahead Minimiza
by Lookahead
I1 Iq Min Use a Neural by Lookahead Scheme Min
orStates
Other Scheme
,
ion
, f f , , f
Player
f
pproximation
2 ,
3 4 5 6 J7
3 4 ,f f 5 , f
Features
, f ,
6 ,f f˜7}
Plays} States
Mapping
different! F̂ x=k+1 {f States
,
Approximate
1 f 2 , f 3 ,xf , f
Value
k+2
4 5 x
, f
k ,Heuristic/
f
Function
6 7 } Player x k+1 States
Suboptimal
Features Mappingxk+2Policy
Base xk Heu
States Use
F̂ = a{f
x Neural
, States
f , Scheme
f , f x , f or, fxOther
, f } Scheme
Heuristic/ States x
Suboptimal States xk+2Policy
Base xk Heuristic/ S
k+1 15 2 3 4 k+2 Possibly
5 6k 7Include “Handcrafted” k+1 Features
ximation
daptive
1
¯uf¯kwith
fSimulation
2
uk Possibly3 4
uAggregation
k ukApproximation
ukInclude
Terminal Self-Learning/Policy
cost J˜d f Features Iteration
f¯ with Constraint
Approximation J˜
Relaxation
esddffi i ffRepresentative
IMIZATION
es withROLLOUT
Aggregation
Approximation
Feature
States
Empty J k+2approximation
States
xschedule
˜ “Handcrafted”
Generate f i LOOKAHEAD
FeaturesApproximation
Heuristic
Aggregation ˜Policy
MINIMIZATION
F (i) of Formulate J Aggregate ROLLOUT
Problem
presentative
Lookahead
ion with Minimization Feature
At State States df iSimulation
Cost-to-go
xk Current ¯
f Approximation
f state
withxAggregation... MCTS Lookahead Minimization Cost-to-go
Policy!µ̃µ̂ µ̃µ̂
Current Generate Adaptive Features F (i) of 0 Terminal
Formulate Aggregatecost
Adaptiveapproximation
Problem Simulation Terminal
Heuristic PA
Policy The " Policy
“current"
Adaptive player µ Improved
plays games
Simulation Policy
Generate
Terminal µ̃µ̂
that“Impoved”
are cost used
Adaptive toPolicy
“train"
approximation µ̂ an
by “improved"
Simulation “Solving” Terminal
Heuristic player
the Aggregate
cost
Policy ap P
rrent
ost J˜µPolicy
F (i),
At a r Simulation
µ
given
Improved
Generate i ≈ J“Impoved”
ofposition, with
µPolicy
(i)
the J“move µ̃µ̂Policy
µ (i) Feature µ̂ byMap
probabilities" “Solving”
Simulation
and the the
with Problem
Aggregate
“value" of a position are
! Simulation
T"µ Φr Φr = with ΠTµ Φr˜ ! Same " algorithm
Simulation learned withmultiple!games (Go, " Shogi)
ZATION
Φr Φr = ROLLOUT
ΠT
approximated µ Φr Empty
Has States
!Cost
beenbyschedule xJk+2
aused
deepµ" toFLOOKAHEAD
(i), r other
neural
learn net (NN)≈ MINIMIZATION
ofarchitecture
i games Jµ (i) (Go,
% J#µShogi)
(i)
!Cost J˜ROLLOUT
Featureµ" F˜(i), Maprk+ℓ−1 of i ≈"xJ
States
! functionk+2µ (i) J
µ F (i), rk+ℓ−1 $: Cost
Feature-based parametric ˜1 (j)
!J˜“Improved” " Jµ̂µall J$
⇤ ˜µ Ffunction
y-Deep
µ̂µ̂ Mind) Generate Plays µ much F (i),
! r ofPolicy
better " Aggregate
≈
ithan ˜ Jcosts
(i)computerµ (i)
Cost rFeature
JCost
` programs (i),
Map
! rF (i)ofJ"0Cost
i(i)≈Cost
J ˆµ (i)
F Jµ!(i) JFeat
(i)
,nerate
wk ) + Successive
“Improved” !gAggregate
k xNNs J,˜µµ"m
mPolicy F(x
arecosts
µ̂m
(i), ),
trained
# rrsw⇤
`s: Cost
+ Jfunction
using
mFeature-based k+ℓ (xk+ℓ
self-generated
min J˜0)(i) ECost gdata
!parametric k˜(x
J and k ,aw
, u(i),
function
kF
" of
˜k1:)(j)
Jform
architecture
r# + of regression
Feature-based gk xm ,paraµ
Vector of weights
˜= #
ss}}policy
Approximation
uk ,µk+1 ,...,µk+ℓ−1 in a space
˜Monte-Carlo
µ basis functions
s Plays much m be
Subspace
Subspace Am=k+1JJJ=
State
form µSpace
of {Φr
{Φr
F (i),| |s
randomized
Approximation
s
r ∈ ∈
Feature
: ℜℜ Space
Feature-based
in
all ℓ=1ℓ=1
a
chess FF (i,
Subspace
(i,
improvement
ℓ
spaceℓ v)r
v)r
parametric
of
programs ℓℓ
basis J J µ= F{Φr
functions(i), | s ∈
architecture
# r :
Plays
s
ℜ s } Search
Feature-based
Treemuch better
ℓ=1 F (i,
(MCTS)
m=k+1
ℓ v)r
parametric
than ℓ a
ate Space
osition k+ℓ−1 “values”
$generates Feature
! Move
move r: probabilities
Space Vector
Subspace
“probabilities”
" of weights
J = #{Φr % | s ∈ ℜ } r:ℓ=1 s Vector
k+ℓ−1
$ Fℓ (i,!v)r of ℓweights
rg= allr:(rchess . programs "
k xm1,,µ .Function
.(x
, rmsof )), wweights
+
roximate Value Vector m mPlayer+
min J˜k+ℓ Cost
(xk+ℓ
Features E↵k)g(i, (xu,
gkMapping r:ukVector
k ,j) , w ) + of probabilities
Transition weights
gk xm , µmp(x ij (u) wmp + J˜k+ℓ
m ), W
=hoose(rm=k+1
1 , . the
.AlphaZero
. , rAggregation
s ) Cost bears ↵Position
usimilarity
k g(i,kand “values”
,µk+1Disaggregation
u, j) ,...,µ to
Transitionearlier
k+ℓ−1 Move works, e.g.,kpTD-Gammon
“probabilities”
Probabilities
probabilities Position
(u) W “values”
(Tesauro,1992), Move “proba
but
ℜs }J˜µx∗
ij m=k+1 p
,ace
v) FF
v) (i,
S11(i,
= {Φr
isState
v) v)Position
more FF2|2(i,(i,y(i)
∈v)v)ℜLinear
ircomplicated } x∗ x̃because
Linear
s“values”
Ay(i) + Weighting
bMove
Weighting Fs (i, ofv) ofof
“probabilities”
Controlled
the F1MCTS v)Position
(i,Markov F
and v) “values”
2 (i,Chain
the Linear
deep Move
NNWeighting
Evaluate
Subspace = “probabilities”
{Φrof| r ∈Cost
Approximate
S
ate ai Neural
se Features y(i) Ay(i) Network + b Choose
Controlled Fsor v) the
(i,Other
Markov F1 (i, Aggregation
v) F2Evaluate
Scheme
Chain (i,Form and
v) Linear the Disaggregation
Choose
Weighting
Aggregate
Approximate theof
J˜µ FStates
Cost JÃggregation
Probabilities
of
(i) µ of and
and D
The Choose success the of AlphaZero
Aggregation isEvaluate
dueand to aApproximate
skillful
Disaggregation
Choose CostAggregation
implenentation/integration
the Probabilities of Disaggrega
known
snt state
S = {Φrideas, x
| r0 ∈ ℜ... MCTS
sEvaluate∗
} xawesome Lookahead
x̃UseApproximate
a Neural Minimization
Network ˜
Cost Jµ Subspace Cost-to-go
For(i)Other of S Use Scheme Approximation
a |Neural
r Form s } Network
thex̃ Aggregate
∗ or Othe S
UseI1aINeural
and
Network or Other 1
computational
Use
Scheme s F (i) = power F (i), . . . , F (i) = : {Φr
Vector ∈
of ℜFeatures x of i
seParametric
a Neural approximation
Scheme
F (i)q = orF1Other
(i),at ,Scheme
. . .theFs˜(i) : Vector of I1aINeural
q
Features
Form Network
the Aggregate
or OtherStates
Schem
I1 IqBertsekas (M.I.T.) end
Rollout: JSimulation Monte
(i)I1: IFeature-based
µ FReinforcement
q Learning treeof
Carlofixed
with
i
search
policy Parametric
architecture approximati
Final Features 4 / 33
Approximate DP/RL Methodology is now Ambitious and Universal
Exact DP applies (in principle) to a very broad range of optimization problems

Deterministic <—-> Stochastic
Combinatorial optimization <—-> Optimal control w/ infinite state/control spaces
One decision maker <—-> Two player games
... BUT is plagued by the curse of dimensionality and need for a math model
Approximate DP/RL overcomes the difficulties of exact DP by:

Approximation (use neural nets and other architectures to reduce dimension)
Simulation (use a computer model in place of a math model)
State of the art:

Broadly applicable methodology: Can address broad range of challenging
problems. Deterministic-stochastic-dynamic, discrete-continuous, games, etc
There are no methods that are guaranteed to work for all or even most problems
There are enough methods to try with a reasonable chance of success for most
types of optimization problems
Role of the theory: Guide the art, delineate the sound ideas
Approximation in Value Space
Central Idea: Lookahead with an approximate cost
Compute an approximation J˜ to the optimal cost function J ∗
At current state, apply control that attains the minimum in
˜
Current Stage Cost + J(Next State)
Multistep lookahead extension

At current state solve an `-step DP problem using terminal cost J˜
Apply the first control in the optimal policy for the `-step problem
˜
Example approaches to compute J:
Problem approximation: Use as J˜ the optimal cost function of a simpler problem
Rollout and model predictive control: Use a single policy iteration, with cost
evaluated on-line by simulation or limited optimization
Self-learning/approximate policy iteration (API): Use as J˜ an approximation to the
cost function of the final policy obtained through a policy iteration process
Role of neural networks: “Learn" the cost functions of policies in the context of
API; “learn" policies obtained by value space approximation
Aims and References of this Talk
The purpose of this talk

To selectively review some of the methods, and bring out some of the AI-DP
connections
References
Quite a few Exact DP books (1950s-present starting with Bellman; my latest book
“Abstract DP" came out earlier this year)
Quite a few DP/Approximate DP/RL/Neural Nets books (1996-Present)

I Bertsekas and Tsitsiklis, Neuro-Dynamic Programming, 1996
I Sutton and Barto, 1998, Reinforcement Learning (new edition 2019, Draft on-line)
I NEW DRAFT BOOK: Bertsekas, Reinforcement Learning and Optimal Control, 2019,
on-line
Many surveys on all aspects of the subject; Tesauro’s papers on computer
backgammon, and Silver, et al., papers on AlphaZero

Terminology in RL/AI and DP/Control
RL uses Max/Value, DP uses Min/Cost

Reward of a stage = (Opposite of) Cost of a stage.
State value = (Opposite of) State cost.
Value (or state-value) function = (Opposite of) Cost function.
Controlled system terminology

Agent = Decision maker or controller.
Action = Control.
Environment = Dynamic system.
Methods terminology
Learning = Solving a DP-related problem using simulation.
Self-learning (or self-play in the context of games) = Solving a DP problem using
simulation-based policy iteration.
Planning vs Learning distinction = Solving a DP problem with model-based vs
model-free simulation.

Outline
1 Approximation in Value Space
2 Problem Approximation
3 Rollout and Model Predictive Control
4 Parametric Approximation - Neural Networks
5 Neural Networks and Approximation in Value Space
6 Model-free DP in Terms of Q-Factors
7 Policy Iteration - Self-Learning

Finite Horizon Problem - Exact DP
wk
uk = µk (xk ) System xk
xk+1 = fk (xk , uk , wk )
) µk
System
xk +1 = fk (xk , uk , wk ), k = 0, 1, . . . , N − 1
where xk : State, uk : Control, wk : Random disturbance
Cost function: ( )
N−1
X
E gN (xN ) + gk (xk , uk , wk )
k =0
Perfect state information: uk is applied with (exact) knowledge of xk

Optimization over feedback policies {µ0 , . . . , µN−1 }: Rules that specify the control
µk (xk ) to apply at each possible state xk that can occur

The DP Algorithm and Approximation in Value Space
Go backwards, k = N − 1, . . . , 0, using
JN (xN ) = gN (xN )
n o
Jk (xk ) = min Ewk gk (xk , uk , wk ) + Jk +1 fk (xk , uk , wk )
uk
Jk (xk ): Optimal cost-to-go starting from state xk
Approximate DP is motivated by the ENORMOUS computational demands of exact DP
Approximation in value space: Use an approximate cost-to-go function J˜k +1

n o
µ̃k (xk ) ∈ arg min Ewk gk (xk , uk , wk ) + J˜k +1 fk (xk , uk , wk )
uk
There is also a multistep lookahead version

At state xk solve an `-step DP problem with terminal cost function approximation J˜k +` .
Use the first control in the optimal `-step sequence.

Limited simulation (Monte Carlo tree search)
Approximation
# in Value Space Methods '
Approximations: Computation of J˜k+ℓ%: !
k+ℓ−1
$ &
(Could be approximate)
Simple
One-step minchoices
case E Parametric
at state xkg:k (xk , uk , wkapproximation
)+ gk xm , Problem
µm (xm ), wapproximation
m + J k+ℓ (x
# k+ℓ ) ˜
k+ℓ−1
"
u ,µ ,...,µ
k k+1 k+ℓ−1 Approximations: Computation of J˜k+ min E gk (xk , uk , wk ) + gk xm , µm (xm ),
m=k+1
Approximate
Approximations: minimization uk ,µk+1Replace
Computation
,...,µ
ofE{·}J˜k+ℓwith
Approximations:
k+ℓ−1
: (Could nominal values
beComputation
approximate) m=k+1
of J˜k+ℓ : (Coul
Rollout First ℓ Steps “Future” First Step First ℓ Steps “Future” Approximate minimization Replace E
(certainty
DP minimizationequivalent control) Computation Approximate of J˜k+1minimization : Replace E{·} with
Nonlinear Ay(x) +!b φ1Replace (x, v) Nonlinear φE{·}2 (x, v) φm
withAy(x) nominal
(x, v)+rbvalues xφ1Initial
(x, "v) φ2 (x, v)˜ φm (x, v) r x Ini
Approximations: Approximations:
Computation ˜ (certainty of J˜k+ℓ
Computation : (Couldof
equivalent beJapproximate)
control) k+ℓ : Computati
Limited min E (Monte
simulation gk (xk ,Carlo uk , wtree + J k+1equivalent
k )(certainty
search) (xk+1 ) Computation of J˜
(certainty uk equivalent control) Computation
Selective Depth of J˜k+1 : control)
Lookahead Tree σ(ξ) ξ 1 0 -1 Encodin
Cost J˜µ F (i), rSelectiveof i ⇡ Jµ (i) Depth Lookahead
Jµ (i) Feature
DP Map Tree
minimization σ(ξ)
DP minimization
Replace ξ 1 E{·} 0 -1Replace
Limited
Encoding
with E{·}y(x)
nominal
simulation withvalues nominal
(Monte Carlo values
tree sear
ure-based parametric Simple
architecture
Limited choices State Parametric
simulation (Monte approximation
Carlo Limited
tree Problem
search) simulation approximation (Monte Carlo tree search)
s t j̄ j̄2 j̄Linear
` j̄ 1 j̄1 # Parameter Linear Layer Parameter v = (A, ˜b) Sigmoidal Layer Li
hts Original1 States Aggregate Layer
Àpproximations: States (certainty Replace vequivalent
= E{·}(A,
(certaintyb)withSigmoidal
control)
k+ℓ−1 nominal
equivalentComputation
Simple Layer
values control)
choicesLinear
(certainty of Weighting
Jequiv-
Parametric k+1 : approximation
Cost alent control)
Approximation
Rollout r ′ φ(x, v)
Cost Approximation $ Simple % r′ φ(x,Parametric
choices
v) &
approximation Problem
min
Move “probabilities” Simple
Simplify
Nodes j 2 A(j̄` ) Path Pj ,kLength E choices
E{·} g (x Parametric
,
k! L u , w
kj · · ·k ) approximation
+ g s
k "
Problem
x
t j̄
m 1 ,
j̄ 2µ approximation
j̄
m m
` (x
j̄ ` 1 ),
j̄ 1 w m + J˜k+ℓ (xk+ℓ
u
Carlo k ,µTree
k+1 ,...,µ
Search (certainty
k+ℓ−1
gation and Disaggregation Probabilities Limited
equivalence) simulation Limited (Monte simulationCarlo Rollout (Monte
tree search)
Model Carlo tree
Predictive search)
Control
min E gk (xk , uk , Feature wk )m=k+1 J˜k+1
+ Rollout Extraction
(xk+1 ) Features: % Material & Balance, uk =
Feature Extraction Features: Material Balance, Model
Nodes jPredictive
u k2 = A(µj̄`d) Path Control
! P , Length Lj · · ·
work Aggregation Rollout
Adaptive thesimulation Monte-Carlo Tree Search k xk (Ik )j
uk
or Other Scheme Form Aggregate States
Simple ! choices Simple Parametric choices Parametric
approximation " !approximation min E gapproximation
Problem k (xk ,Problem
uk , wk ) +approx J˜k+1"
First ℓ Steps “Future” # !min E First g (x , Step
uMobility,
, w ) + J˜Safety, Aggregation
(x etc min
) E gukk(xk , of
Weighting uk ,Features
wk ) + '%J˜k+1 Score
(xk+1 Pos
)
eme or Other Scheme k k k k+ℓ−1 k !
k+1 k+1 u !
Mobility, Safety, uetc
φWeighting of˜" Features
v)%#xxφ Scorek Position &$k+ℓ−1 Evaluator
States k+ℓ−1
$
φxmin : States bexapproximate) " k+ℓ−1
" $#
Is diNonlinear Ay(x) xxjk?+ gkb(x ,1u(x, , wv) (x, (x, v) r)m+mx Initial
Approximations: kComputation of #
tion Multistep + aFeatures<min
ijcase UPPER
min stateEh
atStates E g:Rollout ) )+
Rollout
+ 2Jk+1k+ℓ (Could
ggkk E ,m
g,µkµ
k+2
(x ,(x
min um ,),w),w
kw E#+ + k˜
gJ ˜k+ℓ
J(x x(x
gkk, u(x ,k+ℓ
w
,k+ℓ
µkm )(xm ), wmgk+xJm
))+ ˜k+ℓ
,µ
Handcrafted”States u ,µ x k+1 ,...,µ k+2 k (x k ,kuk ,kwk k u ,µ ,...,µ mum
,µ mm k(x km
,...,µ k+ℓ km
k+ℓ−1
k k+1
uk ,µk+1 ,...,µk+ℓ−1 k+ℓ−1 k k+1
m=k+1 ! k+ℓ−1 k k+1
#
k+ℓ−1
m=k+1 " m=k+1 $ %
sxF0(i)
2 of = 1
j fFormulate
¯ if j 2 I DP
Aggregatex a
f 0WpProblem
¯ 0 1
minimization 2 t
# b
+ = {J 2 J | J + 
C Destination
Replace E{·} m=k+1
with nominal values Ismin dFirst + a <
Eφ(x UPPER
gkk(x h
, ukk+ℓ−1
,$w '
j ?k ) + % r′ φ gkk(xx
X and ⇡ 2 P p,x 0
1 2 3 min
4
J} State
5 EWk+ℓ−1
First g+ℓx
p$ kfrom
k (x
Steps
uFeature
k , u
,µ! , w
“Future”
%k k
,...,µ)iVector
+ Jℓ˜ijSteps
k+1 (x
k “Future”
k+1
1& )2
kApproximator k
µ̂Tail problem approximation u3 k u4k u5 k ukukuk Constraint
ξ⇡ xRelaxation
k k+1 k+ℓ−1 U U k+ℓ−1 U
Selective State First Depth
x,...,µℓSteps
Steps
Feature Lookahead
“Future”
Vector ,0uukφ (xTree )u0)+ σ(ξ)min
Approximator 21m1+P ,0 E -1 gφ
′(x,
r10k1(x 1(x
Encoding , u=
)Ik¯,{J
w " )m + 1y(x)
˜φk+ℓ #v) )gkC v)xm , µ+mv)(x
ed” Tail
Policy )by!“Solving” ℓ1kthe Aggregate gProblem j(x
min E m=k+1
J(xproblem
k 0approximation
for
First all p-stable
(certainty 3uu
1
4⇡
2 uequivalentu2kfrom
“Future” 5u
! k (x
ukx kFirst
kcontrol)
,w
with kx
Step
kConstraint
k k 2
Nonlinear
X and
k ,µk+1Relaxation
,...,µ gkk+ℓ−1
Ay(x) bµ
Nonlinear
=
p,x φU
2m UkifW
mv)),
k Upφ 2w
Ay(x)
k+
22k(x,
m v)+ kJ
x0φb2 J
a(x,0(x,
%|1 (x
v) +φx2
J2rk+ℓ (x,
t bInitial
J} φ
W mp(x,
Destinat fro
problem approximation k+1u
uk ,µNonlinear k uk+ℓ−1
k kAy(x)
k
k uk + Constraint
b φ (x, v) Relaxation
φmin (x,
k+ℓ−1 v) φ U Ugv)
E(x, kj(xf¯U
r kx, u k, w
Initial k ) +f gm=k+1
k xm , µm (xm ), wm
Nonlinear 1 2 " m# $
minAy(x)E+ bgkφ (x1k(x, , uukkv) φ)x (x, xv) iφm−1 (x,iv) .r(x x Initial
arned multiple m=k+1
within Wp+ games (Go, Shogi) ,µk+1 ,...,µ .Selective
.m(0, w0) (N, −N )1 (N, 0) −N
⇡ īStep
(N, N
xξ0)1with
2+ k+ℓ−1 gm First ℓ), Steps ˜k+ℓ
uLimited
k ,µk+1 ,...,µsimulation (Monte Carlo
,w
p i −N
#ktree 0Selective
k k xm
search) Depth J(x
,µ
m m
Lookahead k) ! 0
m Tree forJ“Future”
+
Depth all (x
m=k+1
σ(ξ) p-stable
ξk+ℓ
Lookahead 0) First Tree
-1 Encodingfromσ(ξ)y(x) 0 -1 E
⇤
` Cost function x J˜00(i)
x Costi i
k+ℓ−1˜
function .J.1.(j) (0, 0) N(N, )
m=k+1
(N, First 0) ℓ
ī Steps
(N, k+ℓ−1
Nonlinear$ N “Future”
) −N %Ay(x) 0 First
g(i)
+ b I¯
Step
φ N (x, − v)&2 φ (x, v) φ m (x
Prob. Linear Layer
u Prob.Selective
k1Firstm−1 ℓ Parameter
u CostSteps
m
Depth 1 min “Future”
Cost 1 vE
Lookahead =
First
u Tree (A,Step
gLookahead
k (x
b) ξkSigmoidal
σ(ξ) ξ,within
1w1φ(x,
k ,ℓuNonlinear
0)-1+-1Encoding
W p+ Layer
Encoding g+ y(x) Linear 1 Weighting 2
J˜(x,
a space of NAt i xState
basis xk Current
Selective
functionsNonlinearPlaysDepth
Nonlinear
ukx ,µ
state
much
Ay(x)Ay(x)
x0 ... than
Lookahead
better
,...,µ + + bb
MCTS
φφ
Tree
(x,
(x, v)
v) First
φ φσ(ξ)
Linear
(x,(x,v) Steps
v)
Layer
φ
0Minimization
k “Future”
(x,
Parameter
v) r v)Ay(x)
Linear
x r
Initial x Layer
v Initial
= bxm
k Cost-to-go
y(x)
(A, φb),µ
1 (x,
m (x
Parameter v)mφ),
Sigmoidal
w mv) +
Approximation
2v(x,= (A,
Layer φ k+ℓv)(xk+ℓ
b)mSigmoidal
Linear rx
Weighti L
)
At
Cost State
ApproximationCurrent Simple
k state x0 ... MCTS
choices
state ′ Parametric
...
0r φ(x,
k+1 MCTS v) Cost
k+ℓ−1 approximation
Lookahead
1
1 2 2 Problem
Minimization
m m approximation Cost-to-go Approximation p
=tate
min xkc,Current
a + J(2) Lookahead Minimization
Nonlinear
Approximation Cost-to-go
Ay(x) Cost r ′m=k+1
Prob. +
φ(x, b v)φu
Approximation Approximation
Prob.
(x, v) φ1 r(x, u
′ φ(x, Cost
v) v) φ 1
(x, Cost
v) r 1x u
Initial
Linear Layer ParameterJ(1) s ib) 1 iSigmoidal
m−1 im . .Layer . (0, 0)
Selective 1 (N, −N
Depth 2 ) (N, 0) ī (N, N ) −N 0 0
Lookahead m Tree σ(ξ) ξ 1
Bertsekas
Transition probabilities Linear
(M.I.T.)
Rollout
p (u) Layer
W Parameter
Selective Depth Lookahead vv==Reinforcement
=(A,
(A, min
Tree b) σ(ξ) c,Learning
Sigmoidalξa 1+0J(2) -1Layer
Encoding
Linear
Linear Weighting
y(x) Weighting 13 / 33
Problem Approximation: Simplify the Tail Problem and Solve it Exactly
Use as cost-to-go approximation J˜k +1 the exact cost-to-go of a simpler problem
Many problem-dependent possibilities:

Probabilistic approximation
I Certainty equivalence: Replace stochastic quantities by deterministic ones (makes the
lookahead minimization deterministic)
I Approximate expected values by limited simulation
I Partial versions of certainty equivalence
Enforced decomposition of coupled subsystems
I One-subsystem-at-a-time optimization
I Constraint decomposition
I Lagrangian relaxation
Aggregation: Group states together and view the groups as aggregate states
I Hard aggregation: J˜k +1 is a piecewise constant approximation to Jk +1
I Feature-based aggregation: The aggregate states are defined by “features" of the
original states
I Biased hard aggregation: J˜k +1 is a piecewise constant local correction to some other
approximation Jˆk +1 , e.g., one provided by a neural net

Policy Cost Evaluation Jµ of Current policy µ µ(i) ∈ arg minu∈U(i) Q̃µ (i, u, r)
Rollout: On-Line Simulation-Based Approximation in Value Space

Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function
States xk+1 States xxk+2 xStates

k Heuristic/
xk+2 xSuboptimal Base Policy Polic
Variable Length Rollout Selective Depth
Selective DepthLookaheadRolloutTree Policy µ Adaptive Simulation States Terminal k+1 Cost Function k Heuristic/Suboptimal
Approximation J˜
Approximation J˜
Feature Extraction Features: Material Balance, uk = µdkSimulation
Adaptive xk (Ik ) Terminal cost approximation Heuristic Policy
Limited Rollout Selective Depth Adaptive Simulation Policy µ ApproximationAdaptive J˜µ Simulation Terminal cost approximation Heu
Simulation with ! "
! Cost" J˜µ F (i), r of i ≈ J µ (i) J µ (i) Feature Map
States xk+1 StatesSafety,
Mobility, xk+2 xetc k Heuristic/
WeightingSuboptimal of FeaturesBase ScorePolicyPosition
˜ Evaluator
Cost Jµ F (i), r ! of i ≈"Jµ (i) Jµ (i) Feature Map
of
QkCurrent
(xk , u) uPolicy
u Q̃k (xk , u)Approximation k ũJk˜ Qµk (x
byk Lookahead
, u) − Q̃k (xkMin , u) ! " J˜µ F (i), r : Feature-based parametric architecture
States x States x J˜µ F (i), r : Feature-based parametric architecture
States
x0 xk x1k+1Adaptive
x2k+1 x3k+1 xk+1 States
x4k+1
Simulation k+1 xTerminal
Statesk+2xxk Base Heuristic/
k+2
cost Heuristic Suboptimal
approximation ik States Base
ik+1 Policy
Heuristic States
Policyik+2r: Vector of weights
N
! " J˜ r: Vector of weights
Approximation
J˜µ xF0(i), Position “values” Move “probabilities”
Cost xk rimof1ii≈ m
J
. µ.(i)
. (0, J µ (i)
0) (N, FeatureN ) Map
(N, 0) ī (N, N ) N 0
Position “values”g(i) I¯ NMove 2 “probabilities”
!
Adaptive " Simulation Terminal cost approximation Heuristic PolicyChoose the Aggregation and Disaggregation Probabilit
Initial State ˜ 15N 1 i 5 18 4 19 9 21 25 8 12
Jµ F (i), r : Feature-based parametric architecture 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Simulation with Choose the Aggregation and Disaggregation Probabilities
r: Vector !of weights " Use a Neural Network or Other Scheme Form the Agg
Cost J˜µs Fi1(i), imr 1 of imi .≈. . J(0,
µ (i)
0) J(N,µ (i) Feature
N ) (N, 0) Mapī (N, N Use ) Na 0NeuralIg(i) ¯ N 2 N or Other˜ Scheme Form the Aggregate States
1 Iq I Network
Stage 1 Stage 2 Stage "3 Stage −States
1 c(Nx)k+1
N N“probabilities” c(NStates
− 1) kxk+2 k + 1xk Rollout Policy Approximation J
i!
Position “values” Move I1 Iq
J˜µ F (i), r : Feature-based parametric architecture
Adaptive Simulation Terminal cost approximation Use a Neural Scheme or Other Scheme
Choose the Aggregation and Disaggregation ! " Probabilities Use a Neural Scheme or Other Scheme
Heuristic Use r: Vector
Cost u1k of
Heuristicu2kweights
u3k“Future” States
u4k SelectiveCostxDepthJ˜ xFk+1States
(i), fxkof
=rForm
Adaptive ≈,xukJ
(xikSimulationRollout
(i) Policy
k )Jxµk(i)
Tree Feature Possibly
Map ofInclude
Approximation
Projections J˜ “Handcrafted” Features
or System k ,µw Observations
k+1 k+2
a Neural Network Other µScheme the Aggregate
PossiblyStates Include “Handcrafted” Features
Leafs of“values”
Position the TreeMove !
“probabilities”
Adaptive "
Simulation Terminalparametriccost approximation Generate Features F (i) of Formulate Aggregate Proble
I1 Iq J˜µ F (i), r : Feature-based architecture
Generate Features F (i) of Formulate Aggregate Problem
! "
Choose the Aggregation orCost and J˜µ=
Disaggregation
Fµ(i), ≈Probabilities Generate “Impoved” Policy µ̂ by “Solving” the Aggreg
Belief State
The Usepk base
aController
Neural
p(j1policy
µ Control
) Scheme
p(jk2 ) p(j can
3) !
r:ube
Other
p(j k (p
Scheme
4 ) any
kVector ofkr)weights
.of
. . iQ-Factors
suboptimal Jµ (i) JµCurrent (i) Feature
policy
Generate
State Mapxk
(obtained
“Impoved” Policy byµ̂ another
by “Solving”method) the Aggregate Problem
Use a Neural "
Possibly IncludeNetwork J˜µorPosition
“Handcrafted” FOther Scheme
(i), rFeatures
: “values” Form the
Feature-based Move
Aggregatearchitecture
parametric
“probabilities”
StatesSame algorithm learned multiple games (Go, Shogi)
I1One-step
Iq ′
or multistep
′′ u u′ u′′ of ′
lookahead; exact Sameminimization
algorithm learned or a “randomized
multiple games (Go, form of
Shogi)
Aggregate costs rℓ∗ Cost function J˜0 (i) Cost function J
N xN xNNeighbors
x0 x1 xk xGenerate k k k Fx(i)
Features imr:ofx Vector x′′k+1
Projections
Formulate
Choose ofthe of
weightsNeighbors
Aggregate
Aggregation of
Problem
and imDisaggregation
lookahead"
Use a Neural Scheme that k+1 involves k+1
or Other Scheme “adaptive" simulation
Aggregate costs and rProbabilities
ℓMonte
∗ Cost function Carlo tree
J˜0 (i) Costsearch
function J˜1 (j)
Position Approximation in a space of basis functions Plays muc
Generate “Impoved” Policy Use µ̂ bya “values”
“Solving”
Neural Move
Networkthe “probabilities”
Aggregate
or Other Problem
Scheme Form athe AggregatebasisStates
With
Initial State
PossiblyorTerminal
x0 salgorithm
without
State
Includex Feature terminalVector cost
“Handcrafted”
State I1 tChoose
ILength
approximation
(x) Approximator
Features
=games
1Aggregation
Approximation
(x) 0r (obtained
all chessin programs by
space another
of method)
functions Plays much better than
Same learned qmultiple the (Go, Shogi) and Disaggregation
all chess programs Probabilities
Some Generate forms Features of∗ model
F (i) of Formulate
Use
predictive
a Neural
Aggregate
Scheme
control Problem can be viewed
Cost αk g(i,as u, j)special
Transitioncases (base
probabilities pij (u) Wp
Aggregate costs rℓ Cost Use a Neural
function J˜0 (i)
Network ororOther
Cost function Other Cost Scheme
J˜1Scheme
(j) αk g(i, Form
u, j) the Aggregate
Transition States
probabilities pij (u) Wp
policy
Generate
x0 a 0 1 2 Approximation is
t b C Destination a short-term
“Impoved” I I Policy deterministic
µ̂ by “Solving” the Aggregate
Possibly Include “Handcrafted” optimization) ProblemControlled
Features 2
Markov Chain Evaluate Approximate Cost
a1 space
` StagesinRiccati
q
of basis functions
Equation Iterates PPlays P0 P much
1 P 2better 1
Controlled
2 than
Markov P
Chain Evaluate Approximate ! "Cost J˜µ of
Important theoretical a fact: With J˜µ F (i) of
X and exact of 0lookahead ∈ Jand no J}terminal " cost
Same algorithm learned multiple games (Go,orShogi) P +1
Evaluate Approximate Cost
J(xkall
)→ chess
0 forprograms
all p-stable π from Use x0 Generate
with Neural Scheme
x0 ∈Features πF∈(i) Other
Pp,x Scheme
Formulate
Wp+ = {J
Evaluate Aggregate
Approximate| J +Problem
!≤ Cost WJ˜p+!F from
(i) " of
˜0 (i) Cost function J˜1 (j) µ
hin Wp+ approximation,
Aggregateαk g(i, costs rℓ∗ the
Cost krollout
function
Possibly IncludeJpolicy improves
“Handcrafted” over! the
F (i) base F1"(i),policy
=Aggregate . . . Problem
, Fs (i) : Vector of Features of i
Cost Cost u, of Transition
j) Period Stock
probabilities
Generate Ordered
“Impoved”pij (u)at WPeriod
Policy
p µ̂Features
kby Inventory
“Solving” System
the
r(uk ) + cuk xin xk √
=Generate +ofubasis
+ Features
k functions
wk F (i)Plays F (i) = F1 (i), . .!. , Fs "(i) : Vector of Features of i
Approximation
Controlled Markov k+1aChain
space Evaluate Approximate of much
Formulate
Cost J˜! better
of than
Aggregate
" J˜µ FProblem
(i) : Feature-based architecture Final Features
Prob. u Prob. 1 − u Cost
Bertsekas (M.I.T.) 1 Cost 1 − u
Same algorithm learned multiple
Reinforcement µ games (Go, Shogi)
all chess programs ˜ Learning 17 / 33
Example of Rollout: Backgammon (Tesauro, 1996)
Possible Moves
Av. Score by Av. Score by Av. Score by Av. Score by

Monte-Carlo Monte-Carlo Monte-Carlo Monte-Carlo
Simulation Simulation Simulation Simulation
Base policy was a backgammon player developed by a different RL method [TD(λ)

trained with a neural network]; was also used for terminal cost approximation
The best backgammon players are based on rollout ... but are too slow for
real-time play (MC simulation takes too long)
AlphaGo has similar structure to backgammon

The base policy and terminal cost approximation are obtained with a deep neural net.
In AlphaZero the rollout-with-base-policy part was dropped (long lookahead suffices)
k k k k
1 4u2 5u3 u4 u5 Constraint Relaxation U U 1 U 2 k
pproximation
tion u1k u2k u3kuu
kk uk k Constraint
k k k Relaxation U U 1 U 2
Parametric Approximation in Value Space
! !
k+ℓ−1
# App "
te xk Current state x0 ... MCTS Lookahead EMinimization
min Approximations:
gk (xkmin
, uk , w Cost-to-go
Computation
E gk (xgkk,of J,˜w
uxkm )
k) + , µkm
k+ℓ
urrent state x
tate x0 ... MCTS
0 ... MCTS Lookahead
u ,µ
k k+1 Minimization
,...,µ
Lookahead Minimization u ,µ Cost-to-go
,...,µ
k+ℓ−1 Cost-to-go
k k+1 Approxima
Approximation
k+ℓ−1
m=k+1
First ℓDP minimization

Steps FirstReplace
“Future” ℓ StepsE{·} with nomina
“Future”
schedule LOOKAHEAD MINIMIZATION ROLLOUT
Nonlinear bStates
Ay(x) + Nonlinear xk+2
φ1 (x, v) Ay(x)
φ +φbmφ(x,
2 (x, v) 1 (x,
v)v)r
LOOKAHEAD
eAHEAD !
MINIMIZATION
MINIMIZATION ROLLOUT
ROLLOUT States
k+ℓ−1 x States
(certainty x
equivalent
k+2 control) %
" # k+2 $
min E gk (xk , uk , wk ) +SelectivegkDepth
xm , µLookahead
m (x + J˜σ(ξ)
m ), wm Tree
Selective (xξk+ℓ
1 0) -1 ET
k+ℓLookahead
Depth
uk ,µk+1 ,...,µk+ℓ−1
Limited
m=k+1 simulation (Monte Carlo tree search
! k+ℓ−1
" # $ %
! min !
Nonlinear Ay(x)
E gk+ℓ−1
k" + b φ k+ℓ−1
(xk , uk1,#w"
(x, φLinear
k ) + 2 (x,
v) # Layer
v)Simple Parameter
gφkm (x,
x$m v)
, µ r
m Linear
x
(xInitial
$m ), Layer
vw=m(A,
% + ˜k+ℓ
b)
J %(xk+ℓv) La
Parameter
Sigmoidal =
uk ,µk+1 ,...,µ choices
′w
Parametric
˜k+ℓ (xapproximation
′)φ(x, v)
P
min
E k+ℓ−1 E
gSelective
k (xk , uk
g
k+ℓ−1(x , u ,
, wk )Lookahead
k
Depth k w
+ Selective
k k ) + Cost
gk Depth
Tree g
xmSelective x
, µLookahead
m k (xm
m=k+1m , µ
Approximation
), w m
Tree
m
(xm ),
r
Cost˜ φ(x,
+ J k+ℓ m +
v) J
Approximation
(xk+ℓ ) k+ℓr
,...,µ Depth Lookahead Tree
1 m=k+1
Selective
J˜k comes from aDepth ofLookahead
m=k+1
class functions J˜Tree σ(ξ)where
ξ 1 0rk-1isEncoding
k (xk , rk ), Rollout
y(x)
a tunable parameter vector
d
Feature Extraction Features:
FeatureMaterial Feature
Balance,
Extraction ukExtraction
Features:
Feature =Material
µk Features:
Extraction Features:
)Feature
xk (IkBalance,
Material Material
d x (I )
uk Extraction
=Balance,
µ k Balance,
k kuk Features
= µdk xk (I
Linear Layer Subspace
Parameter vThe=S(A, = {Φr b) | r ∈ ℜs }Layer
Sigmoidal x∗ x̃Linear Weighting
Feature-based
Mobility, Safety, architectures:
Subspace
etc Weighting S
Mobility, = {Φr
of Features
Safety, |linear
r
etc ∈ ℜ
Scorecase
s }Position
Weighting x ∗etcx̃
of ! Score
Evaluator
Features PositionScoreEvaluator
Subspace S = {Φr | rv)t∈pnn Mobility,
ℜ(u)s ∗
} Mobility,
x Safety,
x̃ pni (u) Safety, Weighting of Features PositionScor
k+ℓ−1
" # Ev
Costxk+1
Approximation Special ′ φ(x,
StatesrState
xk+1 n p (u) p (u) etc Weighting
p Mobility,
(u) of Features
Safety, etc Weight
States States xk+2 States
Statesxk+2xk+1inStatesmin xk+2 jn E njgk (xk , uk , wk ) + g k xm
Special
Special State
State n tnptnnp(u)
nn (u)pinp(u)
in (u)pniStates
p(u)
ni (u)pjnupx(u) pnjStates
,µ(u)
kjn
k+1k+1
p,...,µ
nj (u)
(u) xk+2 xk+1 States xk+2
States
k+ℓ−1
al State n t pnn (u) pinState (u) pni i (u)
Feature Extraction
pjnSpecial
(u) State Mapping
pnj (u) n t pnn (u) Feature
p in (u) Vector
p ni (u) ⇧(i)
p jn #Linear
(u) p nj (u)$
Cost m=k+1
State
Simulation xi kFeature
Feature
State
State iFeature
Approximator
Feature
with Vector
Extraction State
Extraction
fixed
Extraction k (x
⇧(i)
policy ) Feature
Mapping Approximator
x⇥krkMapping
Features: State
Parametric
FeatureVector
Material
xk Feature
Feature kk(x
Vector )0Vector
kk)⇧(i)
approximation
Vector rApproximator
(xBalance, ⇧(i)
k u(x
Linear
Linear k k= µdkthe
)Cost k (x
0r
xkk)end
atApproximator
kCost rk0 k (xk )Carlo
(Ikk ) Monte
Approximator
iApproximator
Feature Extraction ⇧(i)
⇥ ⇥ r
rMapping Feature State Vector
i FeatureState
⇧(i) Linear x k Feature
Extraction
First Cost Mapping
ℓ end
Steps Vector
State φxkkMonte
Feature
“Future” (xFeature
k ) Approximator
Vector ⇧(i) Vector
Linear φCos(r
htion
ator
with
fixed
⇧(i)
fixed
x⇥0rpolicy
⇧(i) policy
xk im 1 Parametric
im .Route
Parametric
. . (0, 0)xto approximation
xQueue
(N, N )1(N,
Approximator
approximation
i2mx0) r0)N1at
i⇥m
x.īk(0,
(N, )imtheN. .N.0)(0,
at ¯the
g(i) Monte
I(N,
N
end Carlo
2)Nb(N,
Carlo
tree search
N2I¯(x,
tree kse
I¯
0 k im 0. .⇧(i) (N,
Nonlinear (N, 0)
0)Ay(x)ī (N,
N+ ) φ10)
N(x,ī0v)
g(i)
(N, φ )N Nv) 20φg(i)
m (x,
N i Route Mobility,
Route to to Safety,
Queue
Queue N 2i 2 etc Weighting N i
of Features Score Position Evaluator
te to States
Queue x 2 States
Bertsekas Route
(M.I.T.) x to Queue Route 2 to xQueue
0 xk Learning
Reinforcement im−1
2 im . . . (0, x0 0)x (N, im−1 −Nim ) (N,
. . . 0)
(0,ī0)(N, / 33N
20(N,
Training with Fitted Value Iteration
This is just DP with intermediate approximation at each step
Start with JÑ = gN and sequentially train going backwards, until k = 0

Given J˜k +1 , we construct a number of samples (xks , βks ), s = 1, . . . , q,
n o
βks = min E g(xks , u, wk ) + J˜k +1 fk (xks , u, wk ), rk +1 , s = 1, . . . , q
u
We “train" J˜k on the set of samples (xks , βks ), s = 1, . . . , q
Training by least squares/regression

We minimize over rk
q
X 2
J˜k (xks , rk ) − β s + γkrk − r̄ k2
s=1
where r̄ is an initial guess for rk and γ > 0 is a regularization parameter

Neural Networks for Constructing Cost-to-Go Approximations J˜k
Major fact about neural networks

They automatically construct features to be used in a linear architecture
Neural nets are approximation architectures of the form
m
ri φi (x, v ) = r 0 φ(x, v )
X
˜ v, r) =
J(x,
i=1
involving two parameter vectors r and v with different roles

View φ(x, v ) as a feature vector
View r as a vector of linear weights for φ(x, v )
By training v jointly with r , we obtain automatically generated features!
Neural nets can be used in the fitted value iteration scheme

Train the stage k neural net (i.e., compute J˜k ) using a training set generated with the
stage k + 1 neural net (which defines J˜k +1 )

Leafs of the Tree
Selective Depth Lookahead Tree (⇠) ⇠ 1 0 -1 Encoding Ay(x) +y(x) b 1 (x, v) 2 (x, v) m (x, v) r x
Neural Network with a Single Nonlinear Layer
p(j1 ) p(j2 ) p(j3 ) p(j4 ) Selective Depth Lookahead Tree (⇠)
Selective Depth Lookahead Tree (⇠) ⇠ 1 0 -1 E
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear Weighting
Selective Depth Lookahead Tree (⇠) ⇠ 1 0 -1 Encoding y(x)
Cost Approximation
Neighbors of im Projections(x, 0r
ofv)Neighbors ofAy(x)im + b 1 (x, v) 2 (x, v) m (x, v) r Linear Layer Parameter v = (A, b) Sig
Linear Cost Layer Parameter v = (A, b) Sigmoidal La
Approximation
ive Depth Lookahead Tree (⇠) ⇠State 1 0 -1x Encoding y(x) Lineard Layer Parameter v = (A, b) Sigmoidal Layer Linear W
Cost Approximation
Feature
Feature Extraction
Vector Features:
(x) Approximator Ay(x)
Ay(x) Material
++bSelective
b(x) 0(x,
11(x, rBalance,
v) v)Depth (x,
22(x,
Cost uApproximation
v)kv) =m
Lookahead µm(x,
k(x, xv)kv)(IrkTree
r) r0 (x, (⇠) v) ⇠ 1 0 -1 Encoding y(x)
... ... Feature Extraction Features: Materia
... Feature Extraction Features: Material Balance,
r Layer Parameter v = (A, b) Sigmoidal Layer Linear Weighting
Mobility, Safety, Ay(x) etc Weighting +Selective
b 1 (x,ofv)Linear
Selective 2 (x,Lookahead
Features
Depth v) Score
Lookahead (x,Feature
v)Tree
Position r Extraction
Tree Evaluator
(⇠)
0
Depth Layer mParameter v(⇠) = (A, ⇠⇠11b) 00-1 -1Encoding
SigmoidalEncoding
Features: Layer y(x)Linear
y(x)
Material Weighting
Balance, uk = µdk xk (
oximation Selective (x, v)Depth r Lookahead Tree States (⇠) x⇠Selective
1 0 States-1 Encoding xk+2 Lookahead
Depth y(x) TreeApproximation (⇠) ⇠2 1 0 -1 Encoding Mobility, Safety, etc Weighting of Fea
k+1 Cost 2
P v)0 r y(x)
(x,
Selective Depth Lookahead ` Selective
Stages
Selective Riccati
Tree Depth Depth(⇠) Equation 1 0 -1 Iterates
⇠ Lookahead
Lookahead Encoding
Tree TreePLinear
(⇠) P
y(x)
(⇠) ⇠0 1⇠P 11 -10P-1
0Lookahead 2
Encoding Encoding 1 y(x) Py(x)+1 Mobility, States Safety,xk+1 etc Weighting
States xk+2 of Features Scor
Selective Selective Depth
Linear TreeLayer Layer(⇠) ParameterParameter Tree v (⇠)
=
vMobility,
= (A, ⇠
(A, 1 b)0 -1
b) Safety,
SigmoidalEncoding
Sigmoidal etcStates Layer
Layer y(x)
Weighting Linear
Linear Weighting
of-1Weighting
Features Score Position E
re Extraction Features: Material Balance, u = µ d xDepth (I ) Lookahead ⇠Selective1 0 -1 Encoding Depth
0
y(x)
States
Lookahead x k+1 Tree (⇠) x ⇠ 1
k+2 0 Encoding y(x)
Linear Layer Parameter v Selective = (A, b) Sigmoidal State
k Linear
Depth xk k Layer
Lookahead k k Linear
Feature
Layer Vector
Parameter
Tree Cost
Cost Weighting
Approximation
kv (x=
Approximation
(⇠) 1k )(A,
0Approximator
b)Encoding
Sigmoidal
Nonlinear (x,
(x, v)
Ay(x)
v) 0Layer
0rrkr k x+ (xFeatures:
) 1 (x, v)
bkLinear Weighting 2 (x, v) Balance, m (x, v) rux Initial
µd xk (Ik )
tive Depth Linear Layer
Lookahead 0 r ⇠Cost
(x, Parameter
Tree v)(⇠) of=Encoding
v-1
1 0Linear
Linear Period
(A,
Layer kParameter
b)Parameter
Sigmoidal StockvLayer Ordered
= (A,(A,Linear
b)0b) at⇠Sigmoidal
Period
Weighting
-1
Feature k Layer Inventory y(x)
Extraction
States k+1 States
System xMaterial
k+2
State
k =
xk Feature kVector k (xk ) Appro
Cost Approximation Cost LayerApproximation v= (x, v) rSigmoidal Layer v Linear
=Linear Weighting
Weighting Layer Linear
Cost r(u )+ cu
0 r k xk+1 Linear= xkLayer + (x, k0 rLinear
uv)+Parameter wk Layer v = (A, Parameter
b) Sigmoidal (A,
Layer b)
Layer Linear
Sigmoidal Weighting
State xk b) Feature
Weighting
Vector k (x k ) Approximator
Weighting r
ity, Safety, etcApproximation
Weighting of Features Cost (x,
k
Cost v)
Approximation
Approximation
Scorex Position
xk im 1 Cost (x, Evaluator0v)
r .Approximation Feature (x, v)
Extraction
Linear
0
0)rīLinear Features:
Parameter
Material ¯ v = (A,
NBalance, (⇠) ⇠uuk1kk=
Sigmoidal
=Score d Layer Linear
Linear CostLayer 0 Parameter
Approximation d x (I ) (x,
imv .= . (A,
(0, v)0)
b) (N, + b N1)Extraction
Feature
0 r Sigmoidal (N,
Selective
Layer
Mobility, (N, DepthN )Weighting
Features:
2 (x,Safety, State N v)
Lookahead
etc 0Material
g(i)
Weighting
x(x, I Tree
v)0 r Balance,
r Feature 2Features
ofVector 0(x µkµk)kEncoding
-1 d xxkPosition
(Ikk)) y(x)
k(I
Approximator rk0 k (xk )
Evaluator
1 States
ure xk+2 Features:
Extraction Material Balance, d Ay(x) (x,
Cost v) Approximation v) m (x, k
Feature Extraction Features:
Cost Approximation N i uk Feature
Material = µ(x,
Balance, k v) 0u
rkk = µState
kExtraction k xFeatures:
k (Ik ) States Materialxk+1 Balance,
States xCD u = µdk xk (Ik )
k+2k ABC
x0 xk im 1 im . . . (0, 0) (N, N ) (N,
Feature Extraction Features: Stock
Feature at
Feature Period
Material
Extraction
Extraction k + 1
Balance,Initial
Features:
Features: ukMaterial=MaterialAd C
µk xk Balance, AB
(IkBalance,
) AC uku= CA d d
µkWeighting
µkxkx(Ikk(I) kdBalance, x 0 kN mx i i
i µd1 xmPosition . . . (0, 0) (N, N ) (N, 0) ī (N, N
k = ) of Features
x Feature
State encoding (could be the identity, could include special features of the state)
Vector (x ) Approximator r 0 Feature(x ) Extraction Feature Features:
Selective Extraction
Mobility,
Mobility, Material
Depth Features:
Safety,
Linear
Safety,
Lookahead Balance,
Feature
etc
Layer
etc Material
Parameter
Weighting
uxk0=xkµ(⇠)
Tree
Extraction kimof
N vkFeatures
=ik0m)(A,
x⇠Features:
1(I
i1 -1.u. kEncoding
=
.b)(0,Score
Sigmoidal
Score
Material0)k (N, k (IkN
Position
y(x) )Layer
Balance,
Evaluator
Linear
Evaluator
) (N, u0)k ī=(N,
µ
Weighting
d N x )(IkN) 0 g(i)
lity,
k Safety,Mobility, etc Weighting k ketcofWeighting
Safety, FeaturesFeature Score ks Position
of Extraction
Features ki1 ikm Score
Mobility, iEvaluator
.Position
mSafety,
1 Features: . . (0,States 0)
etc Evaluator
States
(N,
Material xxk+1
Weighting N Cost) States
Balance,
k+1 (N,States uīxkxk+2
of0)Features
Approximation
State (N,
=
xk+2 µN
Ndk )Score
k Feature ixkN (I g(i)v)I¯ kN(x
0)Position
r0kVector(x, Evaluator2N
k ) Approximator rk k (xk )
0 k k
k+2 Mobility,
ACB
Safety, etc Mobility,
Weighting ACD of CAB CAD CDAFeatures
1 States xk+1 States xk+2 Mobility,
i States Safety,
Linear layer Ay (x) + b [parameters to be determined: v = (A, b)]xFeatures
k+1etc
Safety, etc
States Score
Weighting k+2 of of
Weighting
xMobility, Position Evaluator
Features
Safety, etcScoreScore
Weighting Position
Position of Evaluator
Evaluator
Features Score Position s i1Evaluatorim 1 im . . . (0, 0) (N, N ) (N, 0)
im 1 imStates . . . (0,x0) k+1 States
(N, N ) x(N,
States
k+2 0) ī x(N,
States xk+1 ) Mobility,
NStates N x0k+2
States xg(i) Safety,
k+2StatesI¯ N etc x2k+1
Weighting
Linear LayerofParameter Features Mobility, Score
v = Safety,
(A, Position etc Evaluator
b) Sigmoidal s i1Layer
Weighting iim 1ofLinear im 0 . . (x
Features Weighting
. (0, 0) (N, Position
Score N ) (N, Evaluator
0) ī (N, N )
of States xxk+2
k+1 Statex Vector (x )Approximator
Approximator r0 kBalance, )(N, d x (I
Mobility, Safety, etc Weighting State
Features xFeature
Feature
kScore x(x,k Position 0Extraction
Vector .s.ki.k(x
im Evaluator ikFeatures:
k)0) Material 0) r(N, k(xNkk))N )uk0) =īµ k I¯ ) N 02 g(i) I¯ N
xk FeatureStateVector xk Feature k (xkVector
) Approximator S(x States
A )SApproximator
B r 0 C1AB
xk+1
(x 2ku )C
States
3 AC
xCCost
CA Approximation
k+2
4r 0Selective C CD CBCk States C 0 CB xiv)
mC
k+1 rCD1 States 1(0,
0 xk+2
m i 1(N, im . .N ) (N,
. (0, 0) k īk(N, N 0(N,
kg(i)N ) kN
Stateskx k States kukkStateu
xkk+2 k k (xk ) Vector
k xukk Feature Depth NAdaptive
k (xik ) Approximator Simulation rTree
k k (x Projections
k) of
Nonlinear layer produces m outputs φ (x, v ) = σ Ay (x) + b , i = 1, . . . , m
State xk Feature Vector
k+1
State
State
Leafskx(x k) Feature
kxkFeature
of Approximator
the Tree Vector
VectorState kr(x
0
kk (x kk)(x )
xApproximator
)kk xApproximator ii Vector 0
i rk.Features:
0
r. k.k.(x
i
k )0)
(x ) 0)Approximator r0īīof
1 2
i)u)k duNkNxScore u0k u3 4 Selective
¯ Depth Adaptive
State xk IFeature
k
x0Feature
0xx kkk)im m Mobility,
11imm . . k(0, k(x
(0, kk
Safety,
xk0)rFeature
(N, (N, N))(N,
etc )Weighting
k (xkN
(N,0) 0) 2(N,
k (x
(N, kN
Features
)k3kN u=k4)ofµSelective
the k0Tree
g(i)
(Ig(i) Ir¯I0 NN (x
k Position
22Evaluator
mim 1 i1mix . 0. ..x(0, i0)
.k. (0,
m 0)
(N,
1 i(N,
N ) (0,
m . . .N
(N, 0)0)(N,
) (N, īC(N,
0) NNx))kCN
ī (N,
AB AD
(N, )N0)xN 0C g(i)
0(N,
DA
0ī x
g(i) ¯C
NN)ICD ¯N
1 iN
2 Vector
C
N.0.k2NBDFeature
g(i)
.N I¯CDB
k (x
N
Extraction Approximator
2CNAB State k Material Vector uk1 kku
Balance,
4 NI¯) N
(x u
ku
kLeafs Approximator k k )Depth
k k k
Adaptive
) Simulatio
m State Feature k im
Vector mk (x i i 0)
) (0,
Approximator (N, States s)ix(N, 0 0)
rk+1
1 kimk (x
Statesī m
1 ki)
(N,. .xN .uk+21) u0)
(0, 2Nu(N, 30 u g(i)
k k Selective
k¯ Leafs of the (N,Tree 0) 2Depth
ī (N, N ) N 0 g(i)
Adaptive I¯ N 2Tree
Simulation N Proje
N i σ is a scalar nonlinear differentiable function; several types have been used
x0 xk im 1 im . . . (0,x0) x x(N,
0 0k N x i i
km ip(j N
m1 11m)i (N,
i
) p(j . . 0)
. .(0,
. ī
m 2 ) p(j3 ) x
x0 xk im 1 im . . . (0,
(N,
0)
(0, N
(N,
0) (N,
p(j 0
)
x NN )
N 0(N,
) g(i)
(N, 0)
4 )k im 1 iim . . . (0, 0) (N,
0) (N, Safety,
I¯ī N
0) (N,
N ) (N,etc
ī (N,N 2
0)
)
N ) N N0
Leafs N g(i)
0 )
k¯
g(i)
of(N,I the N
I 0) N ī
Tree
.
2(N, 2
¯ N ) N 0 g(i) ¯N 2
I
N i N Ni Do i not Repair Mobility, x0īWeightingx(N, N) N
k im 1 im of. 0Features
. .g(i)
(0,
. I N
0) (N, Score 2NPosition
) p(j
(N,1 0) ) īEvaluator
p(j (N,
2 ) N 3) ) p(j
p(j
¯ N 40) g(i) I¯ N 2
x0 xkNimi 1 im . Repair
(hyperbolic tangent, logistic, rectified linear unit) N 1(N, iStates
2 nNx)1(N, snsiStates
ip111im iīm(N, p112 iN
1State p1n. . x.(0, pk(0, 0)
Feature (N, pI¯2(n NN)1))(N,
Vector (N,
2. 0) k0)(x īīk(N,
)(N, NN)) NN 00g(i)
Approximator g(i) r I¯I N
0
kN (xk 2)2NN
. . (0,¯ 0) 10) Nxik+2 i ). . . N 00) (N,
g(i) N
m 1(n 1)
u3k1ui4kmSelective ¯k+1
m p(j 1 ) p(j2 ) p(j3 ) p(j4k)
m im 0)1Depth
s. .i1. (0, i(N, Adaptive
m . . .N ) (N,
(0, 0)
N (N,
Simulation
i0) ī (N,N)N (N, ) 0) NTree
īs0(N,
Neighbors Projections
ig(i)
1 iN m) 1
I of NiN .of
im0.2Projections
Ni(0,
. g(i) i I0) N(N, 2ofNNeighbors ) u(N, 1 u0) 2 uī3of u4imN
(N, ) p(jN
Selective 0Depth
g(i) I¯ Adaptive
N3 ) p(j 2N
m
¯ī N 1 ) p(j 2 ) p(j 4 ) Simulation Tree Projections of
e Tree
i s i1 im 1 im . . . (0, 0)s (N, is1 ii1miim N1 )i1m (N,im . .0) .ī (0,
. .(0, (N,
0) 0) N(N,
(N, ) N N)N0(N, )g(i)(N, 0) I0) (N,
ī. (N, N2k )Nk) NksN 0 kN g(i) s(N,
0 )g(i) I¯ N I¯0)Nī 2(N, N2N Neighbors¯ (N, of im Projections of Neighb
i Training problem is to use the training set (x , β ), s = 1, . . . , q, for
i i s i1 im 1 im . . . (0, 0) s i 1 i
1 m
State 1 i m
(N, x1kNFeature Leafs. .
) (N, (0,of
x 0)
0)the
x (N,
īsk (N,
0 Vector iTree
i1mim
i
N1 )k1m (x
0im
N . .
k ).0
. (0,
. .g(i)
0)
Approximator
(0, 0)
¯
IN(N,N
(N,
) N2N
Neighbors
N )N 00
(N,
rk) (N, g(i)
k (x
0)of0)
I
ī
k )im
N N 2 )
Projections
ī (N,
N N
N ) N 0ofg(i)
I¯ ¯N 2 of i
0 g(i)Neighbors
I N 2 Nm
i u1ku 22uu
uN 33uu 44 Selective Depth Adaptive Simulation Tree Projections
p(j
s i 1 im i 1 iState x Feature
m . . . (0, 0) (N, N ) (N,u
Vector (x) 0)k ī k(N,
Approximator
k ikkN
i k)kSelective N 0 g(i)
2
(x) I¯rNeighbors
Depth N Adaptive 2 N of iSimulation m Projections Tree Neighbors ofofim
ofProjections
u3k2 )u4kp(j 13 )up(j
Selective 2 u43) u Depth Adaptive
4 Selective Simulation q Tree Projections of State x Feature Vector (x) Approxim
u Depth Adaptive
! 1
Simulation
uk uk m 2 3 4Tree
uk uk Selective Projections
Leafs of the of Tree
i Leafs Depth of the Adaptive Tree Simulation Tree ) Projections of
imp(j .)(0, p(j 2 ) (N,p(j3 ) N p(j
k k k k
Tree of the uTree
e Leafs 1 u2 u3 u4 Selective Depth
uk2 k3Adaptive
uk1uuk1 k2uLeafs uuk31k4uof 4
Selective Simulation Depth x 0 x3kProjections
Treesuk uim . .1of 0) ) 4(N, 0) ī State
(N, Nx) Feature N 0 g(i) Vector I¯ N (x) 2 Approximator (x)
ui1k Adaptive Simulation
1 Simulation Tree Projections of of
k2 Selective Depth u2kAdaptive Tree Projections
X X the Tree s is1 Depth
4 Selective Adaptive Simulation Tree of 0 g(i) I¯ (x)
k k k k
min r φ (x , v ) − β + (Regularization Term)
u u u 3 u 4 Selective
N Depth k Adaptive im 1 u12 iu
Simulation m 3 .u.4State
.Tree
(0, 0)Projections
x (N,
Feature N ) Vector
(N,of 0)Projections
ī (N,
(x) N ) N
Approximator N 0 r2 N
Leafs of the
bors of im Projections of Neighbors Tree LeafsLeafs ofof2the
of ithe Tree Tree
k k k k i of i the Tree u Selective Depth Adaptive Simulation Tree Projections of
u1k vu,r um3k u`4kofStages
Selective Leafs
DepthEquation Adaptive p(jp(j1Simulation
1)i)p(j p(j2Neighbors
2))p(j p(jTree k) p(j
) p(j k 4k) k
)Pi2m Projections 2
P1Projections 1of P +1
3 4of P of Neighbors of im
kLeafs the Tree Riccati Iterates LeafsP P3of 0 the Tree 2
p(j2 ) p(j p(j3 )1 )p(jp(j4 )2 ) p(j3 ) p(j4Leafs
) of the Tree s=1p(j1 )i=1 p(j2 ) p(j3 ) p(j4 )
s i1 im 1 im . . . (0, 0) (N, N ) (N, 0) ī (N, N ) N 0 g(i) I¯ N 2 N ` Stages Riccati Equation Iterates P
x Feature Vector p(j1 ) (x) p(j2Approximator
) p(j3 ) p(j4p(j ) p(j 1 )(x)
1p(j
) p(j 02r) 2p(j) p(j 3 ) 3p(j) p(j 4) 4) Neighbors (x)Àpproximator
Stages iRiccati Equation
(x)0 r TreeIterates P Pof 0 P1 P
1of i2km 3Projections
p(j1 ) p(j
Cost of 2Period) p(j3i) kp(j p(j14Stock )) p(jNeighbors
2 ) p(j3 ) p(j
OrderedState uof 4u
kat
)im
Period u4k Selective
xuProjections
Feature k) `p(j Vector ofofNeighbors
Inventory
Neighbors
Depth Adaptive ofofim m Simulation Projections
hbors ofNeighbors
im Projections of im of Neighbors
Projections of of i
Neighbors of i p(j 1k) p(j 2 4 ) System
3 ) p(jRiccati
Stages Equation Iterates P P P P 2 1
p(j1 ) r(u m
p(jk2)) + p(j Neighbors
Solved often with incremental gradient methods (known as backpropagation) of im Projections Leafs of Neighbors
of the Treeof im 0 1 2
cu3 )k p(j 4) =
m Cost of Period k Stock Ordered at P
Neighbors of im Projections of Neighbors xk+1 of ixmk +ofuof +k w k
Neighbors
Neighbors of of imim Projections
Projections Neighbors
Neighbors of imim Vector
of of iCost of(x) Period 0 rcu k x Stock xOrdered
i Neighbors u1k u2kState
State
uof 3 of ix4mx Feature
Projections
Feature of iof
Vector (x)Approximator
Neighbors
(x) Approximator m r(u k(x)
)+ 0r k k+1 = of k + u + k at w Period
k k
x Feature Vector (x) Approximator (x) 0 Neighbors
r
of
20 m
Projections Neighbors
k uk Selective Neighbors Depth m Adaptiveof
Cost0 rim of
Simulation
Projections
Period k
Tree
of Stock
Neighbors Projections
Ordered of uim at Period k Inventory
State x Universal approximation theorem: With sufficiently large number of parameters,
Feature
ges Riccati Equation Iterates P P Vector
State x Feature Vector
(x) Approximator
Neighbors
0 P1 P2 Stock
(x)
State
2
of im Projections
Approximator at
(x)
x
1 Period r
P
Feature
P +1 Leafs
Vector(x)
k + of
0
Vector
1 Neighbors
of the
Initial
rApproximator
(x)
Tree State
Approximator
p(j
of imA1 C AB
` Stages
) p(j 2 )
Riccati
r r(u
p(j (x)
AC 3 ) p(j
EquationCA
k ) + cuk xk+1 = x
4 ) r(u
CD
k ) +
IteratesABC
cu k
P
x k+1
P
=
P
x k +
P
+
2
k
1
w k 2
P
(x)0 rk + u + 0k 1wk 2
State
State x Feature
x Feature Vector (x) (x) Approximator (x)(x) 0 0
r (x)
State x Feature Vector Approximator
0r Stock at Period0 k + 1P Initial +1 State A
“arbitrarily" complex functions can be closely approximated
State x Feature Vector (x) Approximator
Neighbors
State x(x) Feature
of im Projections
Vector Stock (x) Approximator
of at Periodof ki+ 1 (x) Initial
22 P
r State A C AB
P00kNeighbors
of Period k Stock Ordered at State Period x Feature
k ACB Inventory Vector (x)
System Approximator (x) 0r
ACD CAB p(j1CAD ) `p(j Stages2 ) CDA p(jCost 3 ) p(j
Riccati 4) Equation Iterates POrdered P111at P22Period 22 m 1k PInventory
P
2
P 2` Stages Riccati of Equation
Period k
StockIterates Stock
at P
Period P
2 P
+ P Initial State1 P A
+1 +1C AB System AC CA CD
gesxk+1 `=Stages
Riccati xkEquation
+ uRiccati
+ k Iterates
wEquation
Bertsekas k (M.I.T.)
P PIterates0 P1 PP 2 P
2
`0StagesP11 PPRiccati 2+1
2 1Equation P
Reinforcement IteratesLearning P P P P 2 1 P ACB ACD CAB CAD 24CDA / 33
Ay(x) + b 1 (x, v) 2 (x, v) m (x, v) r
Deep
Selective Depth Lookahead Neural
Tree (⇠) Networks
⇠ 1 0 -1 Encoding y(x) Ay(x) + b 1 (x, v) 2 (x, v) m (x, v) rx
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear Weighting Selective Depth Lookahead Tree (⇠) ⇠ 1 0 -1
Cost Approximation (x, v)0 r
Linear Layer Parameter v = (A, b) Sigmoidal L
d 0
v) 2 (x, v) m (x, v) r x Feature Extraction Features: Material Balance, uk = µk xk (Ik )Cost Approximation r (x, v)
... ... ... ...
... ...
Lookahead Tree (⇠) ⇠ 1 Mobility, 0 -1 Encoding Safety,y(x) etc Weighting of Features Score Position Evaluator Feature Extraction Features: Material Balance
th Lookahead Tree States
Selective (⇠) ⇠Depth 1xk+1 States
0 -1Lookahead
Encoding xk+2y(x) Tree Selective (⇠) ⇠ 1 0 -1 Depth Encoding Lookahead y(x) Tree (⇠) ⇠ 1 0 -1 Encoding y(x)
ve DepthSelective Lookahead Depth Tree
Selective (⇠) ⇠Depth
Lookahead 1 0Tree -1Lookahead
Encoding
Selective ⇠ 1 y(x)
Selective
(⇠) Depth
0Tree -1Depth Lookahead
Encoding
(⇠) Lookahead
⇠ 1 0y(x) Tree
-1 EncodingTree(⇠)(⇠) ⇠ 1⇠ 01 -1
y(x) 0 -1 Encoding
Encoding y(x)
Mobility, y(x) Safety, etc Weighting of Features Sco
rameter v = (A, b) Sigmoidal Layer Linear Weighting 0
Parameter
(x, v)0 r v = (A, Linearb) Sigmoidal
Layer State xLayer
Parameterk Feature v =Vector
Linear (A,Weightingb) Linear k (xk ) Layer
Sigmoidal
Nonlinear Approximator
LayerParameter
Ay(x) + b rk1Weighting
Linear =(x
v(x,
k v) k) 2
(A, b)(x, Sigmoidal
v) States
Nonlinear xAy(x)
m (x, Layer
v) States
r xLinear
k+1 +b x
Initial Weighting
(x, v) 2 (x, v) m (x, v) r x Init
1k+2
head
Layer
n Tree
(x, 0 r (⇠) ⇠
Parameter
v)Linear
Cost Layer 1v0=-1 (A,Encoding
Parameter
Approximation Linear b) Sigmoidal
Layer v(x, 0Layer
r b)Linear
=Parameter
(A,
v) Linear
Linear
Sigmoidal
Cost v Layer LayerWeighting
=Approximation
(A, Parameter
b)Parameter
Layer Linearv Weighting
Sigmoidal =v= (A,
Layer
(x, 0 rb)
v)(A, Sigmoidal
b)
Linear Sigmoidal
Weighting Layer
Layer Linear
Linear Weighting
Weighting
ximation 0
(x, v) rCost Approximation 0 Cost 0
Approximation 0 0
Cost Approximation
ion Features: Material Balance,
(x,
x0 v) xkriu Cost(x,
m 1 imd . . . (I
v)Approximation
(0, r 0) (N, Selective N )(x, (x,
(N, v)0)
Depth
v)r īr (N, N ) N 0 g(i) I¯ N 2 State xk Feature Vector k (xk ) Approximator
Lookahead Tree (⇠) Selective
⇠ 1 0 -1Depth Encoding Lookaheady(x) Tree (⇠) ⇠ 1 0 -1 Encodin
tures: Features:
Material Balance, uBalance, d kx =(Iµk) x k k)
action More Feature complex
N i Extraction
Material k = µk NNs k µare
uFeatures:
kk = formed
d x (I )Feature Extraction Features:
kMaterial
k Balance, by dconcatenation
uk = µdk xk (Ik ) Material ofBalance,
multiple uk =layers µdk xk (Ik )
k
e Extraction Features: Material Balance, u = µ d x (I ) d d d
Feature Extraction Feature Features:
Extraction Feature
Material kFeature
Features: kExtraction
Balance, k uk =Features:
Extraction
kMaterial Balance,xk (Iku)Material
µFeatures: k Material
= vµ= k x Balance,
Balance,
k (I uku=
k ) Sigmoidal k =
xµ0kµxxkkk x(I
im ) 1k )im . . . (0, 0) (N, N ) (N, 0) ī (N, N
kk(I
, etc Weighting of Features Score Position Evaluator Linear Layer k Parameter (A, b) Linear Layer
Layer Parameter
Linear Weighting
v = (A, b) Sigmoidal Layer Lin
eighting
ety, of The
etc WeightingFeatures outputs
of Score
Mobility, Features Safety, of
s Position
i1 iScore
m etc 1each
iEvaluator
. . . (0,
Position
Weighting
m nonlinear
0) (N,
Evaluator
Cost of Mobility,
Features N ) (N,
Approximation layer 0) ī (N,
Safety,
Score become
0N
retc
Position ) v)Evaluator
Weighting
(x, N 0 g(i) theCost inputs
ofI¯Features
NApproximation
N2 iNScoreof the Position next
r0 (x, linear layer
Evaluator
v)
+2
ty, Safety,
xk+2 Mobility, etc Weighting
States xSafety, i States
k+1 Mobility,
of Features
etc Weighting
xk+2 Score
Safety, etc Position
ofMobility,
States Safety,
Mobility,
Features
Weighting Evaluator
Safety,
xScore
k+1 etcetc
of Features
Position
States Weighting
xWeighting
k+2Score
Evaluator of of
PositionFeatures
Features Score
Evaluator Score Position
Position Evaluator
Evaluator
States
States xk+2xk+1A hierarchy
States
States xk+2xk+1 States of0 features States
States
xk+2 xk+1 xk+1 States
States xk+2xk+2
Feature Extraction Features: Material Feature sExtraction
Balance, i1 iumk = 1 iµ d .x
. .k (I
mFeatures: (0,k )0) Material
(N, N )Balance,
(N, 0) ī (N,
uk = Nµ)
e Vector k (xk ) Approximator 0 1 2 r)k3 0k (x k) k
rurekVector
(xk ) Approximator
k (xState xk rFeature
k ) Approximator kukk (x rku4kkSelective
ukkukVector (xk )k (xk ) Depth StateAdaptive
Approximator xk Feature rk0Simulation
Vector
k (x k)
Tree
k (xkProjections
) Approximator i of rk0 k (xk )
xk Feature State
Considerable
Vectorxk Feature (xState
kLeafs k ) Approximator
of xthe
Vector
success
Tree
Feature (x ) r 0
State
Vector (x
State
Approximator
k
has
x
k )
k x(x
been
Feature
Feature
) 0 Vector
Approximator
r (x
achieved
Vector
) (x
k kk r (x )(x
in
0 ) Approximator major
Approximator
) r 0contexts
r 0 (x(x) )
. . (0, 0) (N, N ) (N, 0) ī (N, N ) N ¯0 g(i)
k k k k k kk
I¯ N Mobility, 2 k
k Safety,
k etck Weighting
kk k
of Features k kk k kScore
Mobility, k
Safety, Position
etc Weighting
Evaluator of Features Score Posit
m) (N,. . . (0,N0)) (N,
(N, 0)xN ī )(N,
0 x k(N,imN0) )1 īim
N(N, .0.N .g(i)) 0)
(0, INN 0 g(i)
(N, 2N I¯) N
States x0 xx0)
(N, 2k īim(N,
k+1 1N
States im) x . .k+2
. (0,
N 0 0)g(i)(N, I¯ NNStates ) 2(N, 0) xk+1 ī (N,uN1 u2 u3 u4 Selective
k ) k xN
States k+2
¯
k 0kg(i) I N Depth 2 Adaptive Simulati
m 1 im x . 0.N.x(0, i0) (N, i
ki m 1 m 0 k m 1 m x N
. . .)
p(j
x (N,
(0, i
1 )
0) 0)
p(j(N,ī
i
2 (N,
) p(j
.N. N
.3))
x)
(0,p(j
(N,
0 x x
0)N x
0)
4 ) i0
(N,
0k k m 1 m mī g(i)
i(N,1 N i ¯
I
N
m)i N
.
)
(N,. . (0,
.
N .
0)2 0)
(0,
0ī (N,
0)
g(i)
(N, N¯
(N,
I )NN )
NN (N,
)
2 (N,
0 0)
g(i) ī
0) I¯
(N,
ī (N,
N N ) Leafs
N2 ) N N0
of g(i)
0the g(i) ¯
I N
Tree
I¯ N 2 2
N iPossible reasons for the success
N i
(0, 0) (N, N ) (N, 0) ī (N, N ) N ¯0 g(i) I N 2 N
NN i i ¯
State xk Feature Vector k (xk ) Approximator State xk Feature 0
rk k (x Vector
k) k (xk ) Approximator rk k (xk
0
N,
. . (0,N0)) (N,
(N, 0)N īs)(N,
i(N, N) N
1 im0)1 īiNeighbors
(N, 0 g(i)
m . .N ) I0)
. (0, N N(N,
of 0img(i) NI¯)N
2Projections
N (N,s 0) 2i1Nīofim
(N, Neighbors
1N im) . .N . (0, 0ofg(i)
im(N,
0) I¯ N N )2 (N, N 0) ī (N, Np(j ) 1 )Np(j ) p(jI¯3 )Np(j42)N
0 2g(i)
1 im . .s. (0,ii1With
i0)
m (N, 1 im more
.N.s.)i(0,
(N,
1 im complex
0)0)(N,
1 īi(N,
m .N .N )(N,
.)(0, sfeatures,
N
0)is0) i0i1m
1i(N, īg(i)
i(N,
m1 N
¯ iN
i1IN . . the 2. (0,
. .(0, N
00) number
0)(N, I¯ )NN )N
(N, 2) 0of
N(N, (N,0)parameters
¯ī N
ī I(N,
0) (N, N )N )N N 0 in
0 g(i)g(i) I¯the
NI¯ N 2linear2 N ¯ layers may be
N
xm)0 )m(N,
xk N i0) ī g(i)
(N, N
m 1 im . . . (0, 0) (N, N ) (N,
N g(i) x20 N
0) xkī i(N,
m 1Nim ) . .N. (0, 0 g(i)
0) (N, I N N )2 (N, 0) ī (N, N ) N 0
i i State x Featurei i Vector (x) Approximator (x)0 r Neighbors of im Projections of Neighbors of im
ective Depth
Depth Adaptive
drastically
Adaptive Simulation
Simulation
decreased Tree Projections N i of N i
elective Depth Adaptive u4kTree
u1k u2k u3k Simulation Projections
Selective Tree
Depth of
Projections
Adaptive u1k uof 2 u3 u4 Selective Depth Adaptive Simulation Tree Projections of
Simulation
k k k Tree Projections of
u3k u4k Selective
u1kLeafs
u2k uDepth
We 3 u4 Selective
may 1 u2 uDepth
theAdaptive
kuse
3 Simulation
u4k Selective
k k matrices u1kTree 12uu
uLeafs
uDepth
kA
3 u 34uSelective
2Projections
uof
with 4 Selective of Depth Depth Adaptive
Adaptive Simulation
Simulation Tree
TreeProjections
Projections of ¯ofVectorlinear
1 iam special1 im . . . (0,structure that encodes special
kof k uTree Adaptive kSimulation
kk kAdaptive
sk ithek Tree
Tree Simulation
Projections 0)Tree
(N, of Projections
N ) (N, s0)i1of
ī i(N, State
Ni)m . N .x. Feature
(0,
0 g(i)0) (N,
I N N2) N (x)0)Approximator
(N, ī (N, N ) N 0 (x g
m 1
TreeLeafs of the Tree Leafs of the Tree Leafs Leafs of of thethe Tree Tree 2
) p(j4 ) operations such
` Stages as
Riccati Equation Iterates P P0 P1 P2 2 1i P +1
convolutioni P
j3 ) p(j4 ) p(j1 ) p(j2 ) p(j3 ) p(j4 ) p(j1 ) p(j2 ) p(j3 ) p(j4 )
(j2 ) p(j3p(j ) p(j ) 2 ) p(j3p(j
1 ) 4p(j ) p(j1) 4 p(j
) 2 ) p(j3 ) p(j p(j )1 ) 1p(j
4p(j ) p(j 2 ) 2p(j1) u 32) 3p(j
p(j )3 p(j44) 4) 1 u 2 u3 u4 Selective
Projections of NeighborsCost of imof Period k StockuOrdered k k uk uat k Selective
Period kDepth Adaptive
Inventory uSystem
k Simulation
k k` Stages
k Tree Projections
Riccati Depth Adaptive
Equation of Iterates
Simulation
P P0 Tree P1 P
itions of Neighbors
m Projections of of im of of
Neighbors
Neighbors
r(uk ) + cu imim Projections ofLeafs NeighborsNeighbors
of w the of
Tree iof
m im Projections of Leafs Neighbors of theofTree im
k xk+1 = xk + u + k k
ors of imNeighbors
Projections of iofm Neighbors
Neighbors
Projections 0
of of imofim Neighbors
Neighbors
Projections
Neighbors ofof ofim imim
of
NeighborsProjections
Projectionsof im of of Neighbors
Neighbors of ofimim
Vector (x) Approximator Bertsekas 0 (M.I.T.) (x) r Reinforcement Learning Cost of Period k Stock Ordered at 25Period / 33
Q-Factors - Model-Free RL
The Q-factor of a state-control pair (xk , uk ) at time k is defined by

n o
Qk (xk , uk ) = E gk (xk , uk , wk ) + Jk +1 (xk +1 )
where Jk +1 is the optimal cost-to-go function for stage k + 1

Note that
Jk (xk ) = min Qk (xk , uk )
u∈Uk (xk )
so the DP algorithm is written in terms of Qk

n o
Qk (xk , uk ) = E gk (xk , uk , wk ) + min Qk +1 (xk +1 , u)
u∈Uk +1 (xk +1 )
We can approximate Q-factors instead of costs

Fitted Value Iteration for Q-Factors: Model-Free Approximate DP
Consider fitted value iteration of Q-factor parametric approximations

n o
Q̃ k (xk , uk , rk ) ≈ E gk (xk , uk , wk ) + min Q̃ k +1 (xk +1 , u, rk +1 )
u∈Uk +1 (xk +1 )
(Note a mathematical magic: The order of E{·} and min have been reversed.)
We obtain Q̃ k (xk , uk , rk ) by training with many pairs (xks , uks ), βks , where βks is a

sample of the approximate Q-factor of (xks , uks ). No need to compute E{·}

No need for a model to obtain βks . Sufficient to have a simulator that generates
random samples of state-control-cost-next state

(xk , uk ), (gk (xk , uk , wk ), xk +1 )
Having computed rk , the one-step lookahead control is obtained on-line as
µk (xk ) = arg min Q̃ k (xk , u, rk )

u∈Uk (xk )
without the need of a model or expected value calculations

Also the on-line calculation of the control is simplified

A Few Remarks
y1 y2 yon Infinite Horizon Problems
3 System Space State i µ(i, r) µ(·, r) Policy
y1 y2 y3 System Space State i µ(i, r) µ(·, r) Policy

Initial Policy Controlled System Cost per Stage Vector G(r) Transi-
tion Matrix P (r) Approximate Policy
Q̃µ (i, u, r) J˜µ (i, r) G(r) Transition Matrix P (r)
Evaluation
( )
r = ⇧ Tµ ( r) ⇧(Jµ ) µ(i) 2 arg Policy min Steady-State
u2U (i) Q̃µ (i, u,
Improvement r)
Distribution ⇧(r) Average Cost ⇤(r)
Evaluate Approximate Cost Steady-State Approximate DistributionPolicy ⇧(r) Average
Subspace M = { r | r 2 < m} J
Cost ˜µ (i,
⇤(r) Guess r) ofInitial Current Policy Policy µ Evaluation
Pn ⌃j1 y1 ⌃j1 y2 ⌃j1 y3 j1 j2 j3 y1 y2 y3 Original State Space
minu2U (i) j=1 pij (u) g(i, u,Policy j) + J(j)
Evaluate
˜ Approximate Computation
States xk+1 Cost Statesofxk+2 ˜ xk Heuristic/ Suboptimal Base Policy
J:
Improvement
⌃j1 y1 ⌃j1 y2 ⌃j1 y3 j1 j2 j3 y1 y2 y3 Original State Space Approximate P
Cost 0 Cost g(i, u, j) Monte Guess Carlo
˜
Jµ (r) = tree search
Approximation
r Using First Step J˜ “Future”
Initial PolicySimulation ⇥ Evaluatio
( ) 1 0 0 0
Node Subset S1rS= ⇧ Tµ States
N Aggr.
( Generate
r) Stage ⇧(Jµ“Improved”
)Adaptive
1µ(i)
Stage 2 arg minu2U
2 Simulation
Stage (i) Q̃
3µStage µN (i, u,1r) cost approximation Heuristic Policy
Evaluate Approximate of Current cost Policy
Policy byTerminal
Lookahead
⇧ 1 0⇥Improvement
Policy Min
0 0⌃
1 0 ⇧0 0 ⌃
of Current
Candidate Policy J
Subspace
(m+2)-Solutions µ˜M by=(ũ {1 , .rSimulation
Lookahead . |. ,rũ2 um } with
,<Min Based
x,k+1 on J˜ (i, r) x⇧Heuristic/ 0 1 0n 0Suboptimal ⌃
µ (r) = r Using
P
mStates m+1
Simulation !um+2 ) (m+2)-Solution
States " µ⇧xk+2 1 0 Guess ⇧0 0 ⌃
k ↵ Policy
Initial ⌃ Base Policy⇥
Cost J˜µ F˜(i), r of ⇧ i ≈ J ⇧(i)1 J 0⌃(i) 0 ˜ pij⌃0
Feature ˜ r)
⌃(u) Map
n µ(i) = arg min g(i, u, j) + J(j,
Set ofStates
Statesxk+1min
(u 1 )States
Set(i) xStates
ofGenerate
k+2 xp k (u
ij Heuristic/
(u) , u g(i,
) u,
Set
Approximation Suboptimal
j)
of + J(j)
States (u˜ Base
Computation
⇧, u 0 , u Policy
1 )µ⇧ 0 0µof
⌃ J:
" µ J ⇧ Evaluate
u2U 1
“Improved” 2 1 2 3
Most popular ˜setting: j=1
Stationary ! Policy finite-state ⇥system,
= ⇧ 1u⇧U 0⌃(i) 0 0 ⌃ cost
stationary
Approximatej=1 policies,
˜
Jµ(m+2)-Solution
FCarlo
(i), r tree : Feature-based⇧ 1 0 ⇧ 0 0 ⌃
parametric ⌃
Approximation
Run the Cost From
Heuristics 0 JCost g(i,Candidate
Each u, j) Monte search⇧ (ũ1 ,First
. . .⇧, ũStep 0 ) 0architecture
“Future”
0m ,nu1cost
⌃ ⌃
discounting or termination Adaptive state Simulation ⇥ = ⇧ 1J˜Terminal 0 ⇧
µ (r) =0 ↵ ⌃ approximation
m+1
r0 Using
⇧ 0Policy 0⌃
⌃
Simulation Heuristic
⇥
Policy
Adaptive
Set of States Simulation
Node
(ũ ) Subset
Set of Terminal
S
States S Simulation
(ũ Aggr.
, ũcost
) r:States
Vectorwith
approximation of weights
Stage µ(i)1 ⇧Heuristic
=
Stage
⇧ arg
0 2 1min
Stage
⇤ 0 3
0 ⌃ pij1(u)0N
Stage ⌃ u,
g(i,
⌅= 1Ax j) + J(j,˜ r)
Policy iteration (PI) method generates
1 1 N 1 2
! "a sequence
⇧ of
x
u⇧U (i)0 “Improved”
Generate 0⌃ =policies
1 T (x)
0 Policy + µ b
Simulation with j=1
Set of States
I The !uCandidate
current (m+2)-Solutions
= (u1", . . .policy
, uN ) Current
µ is evaluated
Position
Cost (ũJ˜1µ, .F
m-Solution .“values”
(i),
.using
, ũ(ũ
m r,1u, of
.a⇧
⇤
m+1 i,0≈
.Move mJ
ũ, u
.parametric0)“probabilities”
1) (m+2)-Solution
µ (i) 0 ⌃ Feature Map ˜
0Jµ0(i) ⌅ 0 1
0 0 1 0 architecture: J µ (x, r )
m+2
Cost J˜µ F (i), r of i ≈ Jµ (i) Jµ (i) !Feature"Map ↵n
Cost G(u)
I An Set
Heuristic of States
“improved" (u1 ) Set
N -Solutions
policy uµ=ofis(u J ˜Choose
1µobtained
12,F
States .3.(i), 5the
.4,(u
u r N16,: uAggregation
7by2)8) one-step
9Setx1ofx
1Feature-based 0 x03and
2States Disaggregation
xparametric
4 (u
lookahead
0 1 )usingProbabilities
1 , u2 , u3architecture J˜µµ(i)
(x, r=) arg min pij (u) g
! " x = T (x) = Ax
pij = 0 ⇥ aij = 0 + b
J˜µ F (i), r+: 1)-Solutions
Feature-based (ũ11parametric Use architecture
, u7am+1
u⇧U (i)
8Neural x2Network or Other Scheme Form the) Aggregate States
Candidate (mRun ,From j=1
The architecture the Heuristics
is trained 2. .3. 4, ũr:Each
5using
m6 Vector )x
Candidateof1 weights
9simulation x(m+2)-Solution
3 4 x data with (ũ µ
1 , . . . , ũ m , u m+1
r: Vector of weights I1 Iq ⌅ |⇥| (1 ⌅)|⇥| l(1 ⌅)⇥| ⌅⇥ O A B C |1 ⌅⇥|
CostThus
G(u) Heuristic
the Set system N -Solutions
of States (ũ1 ) Set ofPosition
“observes States
itself"
Asynchronous (ũ“values”
, ũ2 ) Initial
1under Move
µ and state“probabilities”
uses
x Initial thestate data f (x, tou,“learn"
w) Timethe
⌅(u |⇥|, .(1 Use ⌅)|⇥|a Neural
l(1 optimal Scheme
⌅)⇥| ⌅⇥ cost Oor AOtherBpijC=|1 Scheme
0 ⇥x̃⌅⇥| aij = 0 M cost function J x = T (x) = Ax
Position
Piecewise
improved “values”
Constant
Set Moveu “probabilities”
of Aggregate
States
policy µ -= Problem 1 . Choose
“self-learning" . V, u k :N k-stages
Approximation
) Current
the Aggregation m-Solution vector
and 1 , . .with
(ũDisaggregation
. , ũm ), . . . , x̃iProbabilities
i1terminal
Asynchronous PossiblyInitial Include state Decision µ(i)Features
“Handcrafted” x Initial state f (x, u, w)
Choose
Artificial theState
Start Aggregation
EndHeuristic
State and Disaggregation Probabilities
Exact PICost G(u)
converges Time to an N -Solutions
Use a Neural
optimal upolicy;
= (u 1 , . approximate
Network . . , uor N Other1 ) T J JScheme0 PI “converges"Form the Aggregate to withinStates an
Use aConstant
Neural Network I1: k-stages
or 1)-Solutions
VOther Iq SchemeGenerate Form
optimal Features the
cost F (i) of
Aggregate
vector Formulate
with ↵ M
States
terminal
x̃ Aggregate
i1 , . . . , x̃icost function Problem
⇥J pij = 0 ⇥ aij
Piecewise
“error zone"
Candidate Aggregate
of the (m + Problem
optimal, k Approximation
Vthen (ũ
: 1oscillates
(k , .
+ . . , ũ m , um+1
1)-stages )
optimal cost ⇤vector
im x̃ M
im with ⌅(i ⇤r 2
terminal
m ) cost function
k+1
I1 Iq UseGenerate a Neural “Impoved”
Scheme Policy µ̂m=1
by “Solving” the Aggregate Problem
Feature VectorCost
TD-Gammon, F (i)G(u)Aggregate
Heuristic
AlphaGo, Cost JNand Approximation
-Solutions AlphaZero, Costall JˆTµorJFJ(i)
use
Other
0 forms
Scheme
of approximate PI for training
Use a Neural Scheme or Other Scheme Direct
Same algorithm
Possibly Method:
IncludeApproximationProjection
learned ↵
“Handcrafted” of cost
multipleFeatures
M vector
games (Go, J Jµ n t pnn (u) pin (u)
µ ⇥Shogi)
R1 R2 R3 R` Piecewise
Rq 1 Rq rConstant
⇤
q 1 r3 V
⇤ Cost Jˆ(k
pAggregate
ni: (u)µ + pFjn(i) (u)Problem
1)-stages pnj (u) optimal cost vector ⇤im x̃with im˜ ⌅(i m)
terminal
2
⇤ ⇤ r cost function
x̃i1 , . . . , x̃iM
Possibly Include
Bertsekas (M.I.T.) “Handcrafted” k+1
Aggregate costs r Cost function J (i) Cost function J˜ (j)
Features Reinforcement ∗Learning ⌅(i) 30 / 33
A Few Topics we did not Cover in this Talk
Infinite horizon extensions: Approximate value and policy iteration methods, error
bounds, model-based and model-free methods
Temporal difference methods: A class of methods for policy evaluation in infinite
horizon problems with a rich theory, issues of variance-bias tradeoff
Sampling for exploration, in the context of policy iteration
Monte Carlo tree search, and related methods
Aggregation methods, synergism with other approximate DP methods
Approximation in policy space, actor-critic methods, policy gradient methods
Special aspects of imperfect state information problems, connections with
traditional control schemes
Infinite spaces optimal control, connections with aggregation schemes
Special aspects of deterministic problems: Shortest paths and their use in
approximate DP
Simulation-based methods for general linear systems, connection to proximal
algorithms

Concluding Remarks
Some words of caution

There are challenging implementation issues in all approaches, and no fool-proof
methods
Problem approximation and feature selection require domain-specific knowledge
Training algorithms are not as reliable as you might think by reading the literature
Approximate PI involves oscillations
Recognizing success or failure can be a challenge!
The RL successes in game contexts are spectacular, but they have benefited from
perfectly known and stable models and small number of controls (per state)
Problems with partial state observation remain a big challenge
On the positive side

Massive computational power together with distributed computation are a source
of hope
Silver lining: We can begin to address practical problems of unimaginable difficulty!
There is an exciting journey ahead!

Thank you!

Slides RL and Optimal Control

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Slides RL and Optimal Control

Încărcat de

Drepturi de autor:

Formate disponibile

Reinforcement Learning and Optimal Control

Laboratory for Information and Decision Systems

Bertsekas (M.I.T.) Reinforcement Learning 1 / 33

Bertsekas (M.I.T.) Reinforcement Learning 2 / 33

Learned from scratch ... with 4 hours of training!

Same algorithm learned multiple games (Go, Shogi)

Bertsekas (M.I.T.) Reinforcement Learning 3 / 33

Exact DP applies (in principle) to a very broad range of optimization problems

Approximate DP/RL overcomes the difficulties of exact DP by:

State of the art:

Multistep lookahead extension

The purpose of this talk

Quite a few DP/Approximate DP/RL/Neural Nets books (1996-Present)

Bertsekas (M.I.T.) Reinforcement Learning 7 / 33

RL uses Max/Value, DP uses Min/Cost

Controlled system terminology

Bertsekas (M.I.T.) Reinforcement Learning 8 / 33

1 Approximation in Value Space

3 Rollout and Model Predictive Control

4 Parametric Approximation - Neural Networks

5 Neural Networks and Approximation in Value Space

6 Model-free DP in Terms of Q-Factors

7 Policy Iteration - Self-Learning

Bertsekas (M.I.T.) Reinforcement Learning 9 / 33

Perfect state information: uk is applied with (exact) knowledge of xk

Bertsekas (M.I.T.) Reinforcement Learning 11 / 33

Jk (xk ): Optimal cost-to-go starting from state xk

Approximate DP is motivated by the ENORMOUS computational demands of exact DP

Approximation in value space: Use an approximate cost-to-go function J˜k +1

There is also a multistep lookahead version

Bertsekas (M.I.T.) Reinforcement Learning 12 / 33

Use as cost-to-go approximation J˜k +1 the exact cost-to-go of a simpler problem

Many problem-dependent possibilities:

Bertsekas (M.I.T.) Reinforcement Learning 15 / 33

Rollout: On-Line Simulation-Based Approximation in Value Space

States xk+1 States xxk+2 xStates

Av. Score by Av. Score by Av. Score by Av. Score by

Base policy was a backgammon player developed by a different RL method [TD(λ)

AlphaGo has similar structure to backgammon

First ℓDP minimization

This is just DP with intermediate approximation at each step

Start with J˜N = gN and sequentially train going backwards, until k = 0

We “train" J˜k on the set of samples (xks , βks ), s = 1, . . . , q

Training by least squares/regression

where r̄ is an initial guess for rk and γ > 0 is a regularization parameter

Bertsekas (M.I.T.) Reinforcement Learning 21 / 33

Major fact about neural networks

involving two parameter vectors r and v with different roles

Neural nets can be used in the fitted value iteration scheme

Bertsekas (M.I.T.) Reinforcement Learning 23 / 33

The Q-factor of a state-control pair (xk , uk ) at time k is defined by

where Jk +1 is the optimal cost-to-go function for stage k + 1

so the DP algorithm is written in terms of Qk

We can approximate Q-factors instead of costs

Bertsekas (M.I.T.) Reinforcement Learning 27 / 33

Consider fitted value iteration of Q-factor parametric approximations

sample of the approximate Q-factor of (xks , uks ). No need to compute E{·}

Having computed rk , the one-step lookahead control is obtained on-line as

µk (xk ) = arg min Q̃ k (xk , u, rk )

without the need of a model or expected value calculations

Bertsekas (M.I.T.) Reinforcement Learning 28 / 33

y1 y2 y3 System Space State i µ(i, r) µ(·, r) Policy

Bertsekas (M.I.T.) Reinforcement Learning 31 / 33

Some words of caution