Documente Academic
Documente Profesional
Documente Cultură
A Selective Overview
Dimitri P. Bertsekas
2018 CDC
December 2018
AI/RL Decision/
Learning through
Control/DP
Experience
Complementary Principle of
Ideas Optimality
Simulation,
Model-Free Methods
Late 80s-Early 90s Markov Decision
Problems
Feature-Based
Representations
POMDP
A*/Games/
Policy Iteration
Heuristics
Value Iteration
Historical highlights
Exact DP, optimal control (Bellman, Shannon, 1950s ...)
First major successes: Backgammon programs (Tesauro, 1992, 1996)
Algorithmic progress, analysis, applications, first books (mid 90s ...)
Machine Learning, BIG Data, Robotics, Deep Neural Networks (mid 2000s ...)
AlphaGo and Alphazero (DeepMind, 2016, 2017)
AlphaZero (Google-Deep Mi
Plays much better than all chess programs
Plays different!
˜
Example approaches to compute J:
Problem approximation: Use as J˜ the optimal cost function of a simpler problem
Rollout and model predictive control: Use a single policy iteration, with cost
evaluated on-line by simulation or limited optimization
Self-learning/approximate policy iteration (API): Use as J˜ an approximation to the
cost function of the final policy obtained through a policy iteration process
Role of neural networks: “Learn" the cost functions of policies in the context of
API; “learn" policies obtained by value space approximation
Bertsekas (M.I.T.) Reinforcement Learning 6 / 33
Aims and References of this Talk
References
Quite a few Exact DP books (1950s-present starting with Bellman; my latest book
“Abstract DP" came out earlier this year)
Methods terminology
Learning = Solving a DP-related problem using simulation.
Self-learning (or self-play in the context of games) = Solving a DP problem using
simulation-based policy iteration.
Planning vs Learning distinction = Solving a DP problem with model-based vs
model-free simulation.
2 Problem Approximation
uk = µk (xk ) System xk
xk+1 = fk (xk , uk , wk )
) µk
System
xk +1 = fk (xk , uk , wk ), k = 0, 1, . . . , N − 1
where xk : State, uk : Control, wk : Random disturbance
Cost function: ( )
N−1
X
E gN (xN ) + gk (xk , uk , wk )
k =0
Go backwards, k = N − 1, . . . , 0, using
JN (xN ) = gN (xN )
n o
Jk (xk ) = min Ewk gk (xk , uk , wk ) + Jk +1 fk (xk , uk , wk )
uk
Possible Moves
k+2 Mobility,
ACB
Safety, etc Mobility,
Weighting ACD of CAB CAD CDAFeatures
1 States xk+1 States xk+2 Mobility,
i States Safety,
Linear layer Ay (x) + b [parameters to be determined: v = (A, b)]xFeatures
k+1etc
Safety, etc
States Score
Weighting k+2 of of
Weighting
xMobility, Position Evaluator
Features
Safety, etcScoreScore
Weighting Position
Position of Evaluator
Evaluator
Features Score Position s i1Evaluatorim 1 im . . . (0, 0) (N, N ) (N, 0)
im 1 imStates . . . (0,x0) k+1 States
(N, N ) x(N,
States
k+2 0) ī x(N,
States xk+1 ) Mobility,
NStates N x0k+2
States xg(i) Safety,
k+2StatesI¯ N etc x2k+1
Weighting
Linear LayerofParameter Features Mobility, Score
v = Safety,
(A, Position etc Evaluator
b) Sigmoidal s i1Layer
Weighting iim 1ofLinear im 0 . . (x
Features Weighting
. (0, 0) (N, Position
Score N ) (N, Evaluator
0) ī (N, N )
of States xxk+2
k+1 Statex Vector (x )Approximator
Approximator r0 kBalance, )(N, d x (I
Mobility, Safety, etc Weighting State
Features xFeature
Feature
kScore x(x,k Position 0Extraction
Vector .s.ki.k(x
im Evaluator ikFeatures:
k)0) Material 0) r(N, k(xNkk))N )uk0) =īµ k I¯ ) N 02 g(i) I¯ N
xk FeatureStateVector xk Feature k (xkVector
) Approximator S(x States
A )SApproximator
B r 0 C1AB
xk+1
(x 2ku )C
States
3 AC
xCCost
CA Approximation
k+2
4r 0Selective C CD CBCk States C 0 CB xiv)
mC
k+1 rCD1 States 1(0,
0 xk+2
m i 1(N, im . .N ) (N,
. (0, 0) k īk(N, N 0(N,
kg(i)N ) kN
Stateskx k States kukkStateu
xkk+2 k k (xk ) Vector
k xukk Feature Depth NAdaptive
k (xik ) Approximator Simulation rTree
k k (x Projections
k) of
Nonlinear layer produces m outputs φ (x, v ) = σ Ay (x) + b , i = 1, . . . , m
State xk Feature Vector
k+1
State
State
Leafskx(x k) Feature
kxkFeature
of Approximator
the Tree Vector
VectorState kr(x
0
kk (x kk)(x )
xApproximator
)kk xApproximator ii Vector 0
i rk.Features:
0
r. k.k.(x
i
k )0)
(x ) 0)Approximator r0īīof
1 2
i)u)k duNkNxScore u0k u3 4 Selective
¯ Depth Adaptive
State xk IFeature
k
x0Feature
0xx kkk)im m Mobility,
11imm . . k(0, k(x
(0, kk
Safety,
xk0)rFeature
(N, (N, N))(N,
etc )Weighting
k (xkN
(N,0) 0) 2(N,
k (x
(N, kN
Features
)k3kN u=k4)ofµSelective
the k0Tree
g(i)
(Ig(i) Ir¯I0 NN (x
k Position
22Evaluator
mim 1 i1mix . 0. ..x(0, i0)
.k. (0,
m 0)
(N,
1 i(N,
N ) (0,
m . . .N
(N, 0)0)(N,
) (N, īC(N,
0) NNx))kCN
ī (N,
AB AD
(N, )N0)xN 0C g(i)
0(N,
DA
0ī x
g(i) ¯C
NN)ICD ¯N
1 iN
2 Vector
C
N.0.k2NBDFeature
g(i)
.N I¯CDB
k (x
N
Extraction Approximator
2CNAB State k Material Vector uk1 kku
Balance,
4 NI¯) N
(x u
ku
kLeafs Approximator k k )Depth
k k k
Adaptive
) Simulatio
m State Feature k im
Vector mk (x i i 0)
) (0,
Approximator (N, States s)ix(N, 0 0)
rk+1
1 kimk (x
Statesī m
1 ki)
(N,. .xN .uk+21) u0)
(0, 2Nu(N, 30 u g(i)
k k Selective
k¯ Leafs of the (N,Tree 0) 2Depth
ī (N, N ) N 0 g(i)
Adaptive I¯ N 2Tree
Simulation N Proje
N i σ is a scalar nonlinear differentiable function; several types have been used
x0 xk im 1 im . . . (0,x0) x x(N,
0 0k N x i i
km ip(j N
m1 11m)i (N,
i
) p(j . . 0)
. .(0,
. ī
m 2 ) p(j3 ) x
x0 xk im 1 im . . . (0,
(N,
0)
(0, N
(N,
0) (N,
p(j 0
)
x NN )
N 0(N,
) g(i)
(N, 0)
4 )k im 1 iim . . . (0, 0) (N,
0) (N, Safety,
I¯ī N
0) (N,
N ) (N,etc
ī (N,N 2
0)
)
N ) N N0
Leafs N g(i)
0 )
k¯
g(i)
of(N,I the N
I 0) N ī
Tree
.
2(N, 2
¯ N ) N 0 g(i) ¯N 2
I
N i N Ni Do i not Repair Mobility, x0īWeightingx(N, N) N
k im 1 im of. 0Features
. .g(i)
(0,
. I N
0) (N, Score 2NPosition
) p(j
(N,1 0) ) īEvaluator
p(j (N,
2 ) N 3) ) p(j
p(j
¯ N 40) g(i) I¯ N 2
x0 xkNimi 1 im . Repair
(hyperbolic tangent, logistic, rectified linear unit) N 1(N, iStates
2 nNx)1(N, snsiStates
ip111im iīm(N, p112 iN
1State p1n. . x.(0, pk(0, 0)
Feature (N, pI¯2(n NN)1))(N,
Vector (N,
2. 0) k0)(x īīk(N,
)(N, NN)) NN 00g(i)
Approximator g(i) r I¯I N
0
kN (xk 2)2NN
. . (0,¯ 0) 10) Nxik+2 i ). . . N 00) (N,
g(i) N
m 1(n 1)
u3k1ui4kmSelective ¯k+1
m p(j 1 ) p(j2 ) p(j3 ) p(j4k)
m im 0)1Depth
s. .i1. (0, i(N, Adaptive
m . . .N ) (N,
(0, 0)
N (N,
Simulation
i0) ī (N,N)N (N, ) 0) NTree
īs0(N,
Neighbors Projections
ig(i)
1 iN m) 1
I of NiN .of
im0.2Projections
Ni(0,
. g(i) i I0) N(N, 2ofNNeighbors ) u(N, 1 u0) 2 uī3of u4imN
(N, ) p(jN
Selective 0Depth
g(i) I¯ Adaptive
N3 ) p(j 2N
m
¯ī N 1 ) p(j 2 ) p(j 4 ) Simulation Tree Projections of
e Tree
i s i1 im 1 im . . . (0, 0)s (N, is1 ii1miim N1 )i1m (N,im . .0) .ī (0,
. .(0, (N,
0) 0) N(N,
(N, ) N N)N0(N, )g(i)(N, 0) I0) (N,
ī. (N, N2k )Nk) NksN 0 kN g(i) s(N,
0 )g(i) I¯ N I¯0)Nī 2(N, N2N Neighbors¯ (N, of im Projections of Neighb
i Training problem is to use the training set (x , β ), s = 1, . . . , q, for
i i s i1 im 1 im . . . (0, 0) s i 1 i
1 m
State 1 i m
(N, x1kNFeature Leafs. .
) (N, (0,of
x 0)
0)the
x (N,
īsk (N,
0 Vector iTree
i1mim
i
N1 )k1m (x
0im
N . .
k ).0
. (0,
. .g(i)
0)
Approximator
(0, 0)
¯
IN(N,N
(N,
) N2N
Neighbors
N )N 00
(N,
rk) (N, g(i)
k (x
0)of0)
I
ī
k )im
N N 2 )
Projections
ī (N,
N N
N ) N 0ofg(i)
I¯ ¯N 2 of i
0 g(i)Neighbors
I N 2 Nm
i u1ku 22uu
uN 33uu 44 Selective Depth Adaptive Simulation Tree Projections
p(j
s i 1 im i 1 iState x Feature
m . . . (0, 0) (N, N ) (N,u
Vector (x) 0)k ī k(N,
Approximator
k ikkN
i k)kSelective N 0 g(i)
2
(x) I¯rNeighbors
Depth N Adaptive 2 N of iSimulation m Projections Tree Neighbors ofofim
ofProjections
u3k2 )u4kp(j 13 )up(j
Selective 2 u43) u Depth Adaptive
4 Selective Simulation q Tree Projections of State x Feature Vector (x) Approxim
u Depth Adaptive
! 1
Simulation
uk uk m 2 3 4Tree
uk uk Selective Projections
Leafs of the of Tree
i Leafs Depth of the Adaptive Tree Simulation Tree ) Projections of
imp(j .)(0, p(j 2 ) (N,p(j3 ) N p(j
k k k k
Tree of the uTree
e Leafs 1 u2 u3 u4 Selective Depth
uk2 k3Adaptive
uk1uuk1 k2uLeafs uuk31k4uof 4
Selective Simulation Depth x 0 x3kProjections
Treesuk uim . .1of 0) ) 4(N, 0) ī State
(N, Nx) Feature N 0 g(i) Vector I¯ N (x) 2 Approximator (x)
ui1k Adaptive Simulation
1 Simulation Tree Projections of of
k2 Selective Depth u2kAdaptive Tree Projections
X X the Tree s is1 Depth
4 Selective Adaptive Simulation Tree of 0 g(i) I¯ (x)
k k k k
min r φ (x , v ) − β + (Regularization Term)
u u u 3 u 4 Selective
N Depth k Adaptive im 1 u12 iu
Simulation m 3 .u.4State
.Tree
(0, 0)Projections
x (N,
Feature N ) Vector
(N,of 0)Projections
ī (N,
(x) N ) N
Approximator N 0 r2 N
Leafs of the
bors of im Projections of Neighbors Tree LeafsLeafs ofof2the
of ithe Tree Tree
k k k k i of i the Tree u Selective Depth Adaptive Simulation Tree Projections of
u1k vu,r um3k u`4kofStages
Selective Leafs
DepthEquation Adaptive p(jp(j1Simulation
1)i)p(j p(j2Neighbors
2))p(j p(jTree k) p(j
) p(j k 4k) k
)Pi2m Projections 2
P1Projections 1of P +1
3 4of P of Neighbors of im
kLeafs the Tree Riccati Iterates LeafsP P3of 0 the Tree 2
p(j2 ) p(j p(j3 )1 )p(jp(j4 )2 ) p(j3 ) p(j4Leafs
) of the Tree s=1p(j1 )i=1 p(j2 ) p(j3 ) p(j4 )
s i1 im 1 im . . . (0, 0) (N, N ) (N, 0) ī (N, N ) N 0 g(i) I¯ N 2 N ` Stages Riccati Equation Iterates P
x Feature Vector p(j1 ) (x) p(j2Approximator
) p(j3 ) p(j4p(j ) p(j 1 )(x)
1p(j
) p(j 02r) 2p(j) p(j 3 ) 3p(j) p(j 4) 4) Neighbors (x)`Approximator
Stages iRiccati Equation
(x)0 r TreeIterates P Pof 0 P1 P
1of i2km 3Projections
p(j1 ) p(j
Cost of 2Period) p(j3i) kp(j p(j14Stock )) p(jNeighbors
2 ) p(j3 ) p(j
OrderedState uof 4u
kat
)im
Period u4k Selective
xuProjections
Feature k) `p(j Vector ofofNeighbors
Inventory
Neighbors
Depth Adaptive ofofim m Simulation Projections
hbors ofNeighbors
im Projections of im of Neighbors
Projections of of i
Neighbors of i p(j 1k) p(j 2 4 ) System
3 ) p(jRiccati
Stages Equation Iterates P P P P 2 1
p(j1 ) r(u m
p(jk2)) + p(j Neighbors
Solved often with incremental gradient methods (known as backpropagation) of im Projections Leafs of Neighbors
of the Treeof im 0 1 2
cu3 )k p(j 4) =
m Cost of Period k Stock Ordered at P
Neighbors of im Projections of Neighbors xk+1 of ixmk +ofuof +k w k
Neighbors
Neighbors of of imim Projections
Projections Neighbors
Neighbors of imim Vector
of of iCost of(x) Period 0 rcu k x Stock xOrdered
i Neighbors u1k u2kState
State
uof 3 of ix4mx Feature
Projections
Feature of iof
Vector (x)Approximator
Neighbors
(x) Approximator m r(u k(x)
)+ 0r k k+1 = of k + u + k at w Period
k k
x Feature Vector (x) Approximator (x) 0 Neighbors
r
of
20 m
Projections Neighbors
k uk Selective Neighbors Depth m Adaptiveof
Cost0 rim of
Simulation
Projections
Period k
Tree
of Stock
Neighbors Projections
Ordered of uim at Period k Inventory
State x Universal approximation theorem: With sufficiently large number of parameters,
Feature
ges Riccati Equation Iterates P P Vector
State x Feature Vector
(x) Approximator
Neighbors
0 P1 P2 Stock
(x)
State
2
of im Projections
Approximator at
(x)
x
1 Period r
P
Feature
P +1 Leafs
Vector(x)
k + of
0
Vector
1 Neighbors
of the
Initial
rApproximator
(x)
Tree State
Approximator
p(j
of imA1 C AB
` Stages
) p(j 2 )
Riccati
r r(u
p(j (x)
AC 3 ) p(j
EquationCA
k ) + cuk xk+1 = x
4 ) r(u
CD
k ) +
IteratesABC
cu k
P
x k+1
P
=
P
x k +
P
+
2
k
1
w k 2
P
(x)0 rk + u + 0k 1wk 2
State
State x Feature
x Feature Vector (x) (x) Approximator (x)(x) 0 0
r (x)
State x Feature Vector Approximator
0r Stock at Period0 k + 1P Initial +1 State A
“arbitrarily" complex functions can be closely approximated
State x Feature Vector (x) Approximator
Neighbors
State x(x) Feature
of im Projections
Vector Stock (x) Approximator
of at Periodof ki+ 1 (x) Initial
22 P
r State A C AB
P00kNeighbors
of Period k Stock Ordered at State Period x Feature
k ACB Inventory Vector (x)
System Approximator (x) 0r
ACD CAB p(j1CAD ) `p(j Stages2 ) CDA p(jCost 3 ) p(j
Riccati 4) Equation Iterates POrdered P111at P22Period 22 m 1k PInventory
P
2
P 2` Stages Riccati of Equation
Period k
StockIterates Stock
at P
Period P
2 P
+ P Initial State1 P A
+1 +1C AB System AC CA CD
gesxk+1 `=Stages
Riccati xkEquation
+ uRiccati
+ k Iterates
wEquation
Bertsekas k (M.I.T.)
P PIterates0 P1 PP 2 P
2
`0StagesP11 PPRiccati 2+1
2 1Equation P
Reinforcement IteratesLearning P P P P 2 1 P ACB ACD CAB CAD 24CDA / 33
Ay(x) + b 1 (x, v) 2 (x, v) m (x, v) r
Deep
Selective Depth Lookahead Neural
Tree (⇠) Networks
⇠ 1 0 -1 Encoding y(x) Ay(x) + b 1 (x, v) 2 (x, v) m (x, v) rx
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear Weighting Selective Depth Lookahead Tree (⇠) ⇠ 1 0 -1
Cost Approximation (x, v)0 r
Linear Layer Parameter v = (A, b) Sigmoidal L
d 0
v) 2 (x, v) m (x, v) r x Feature Extraction Features: Material Balance, uk = µk xk (Ik )Cost Approximation r (x, v)
... ... ... ...
... ...
Lookahead Tree (⇠) ⇠ 1 Mobility, 0 -1 Encoding Safety,y(x) etc Weighting of Features Score Position Evaluator Feature Extraction Features: Material Balance
th Lookahead Tree States
Selective (⇠) ⇠Depth 1xk+1 States
0 -1Lookahead
Encoding xk+2y(x) Tree Selective (⇠) ⇠ 1 0 -1 Depth Encoding Lookahead y(x) Tree (⇠) ⇠ 1 0 -1 Encoding y(x)
ve DepthSelective Lookahead Depth Tree
Selective (⇠) ⇠Depth
Lookahead 1 0Tree -1Lookahead
Encoding
Selective ⇠ 1 y(x)
Selective
(⇠) Depth
0Tree -1Depth Lookahead
Encoding
(⇠) Lookahead
⇠ 1 0y(x) Tree
-1 EncodingTree(⇠)(⇠) ⇠ 1⇠ 01 -1
y(x) 0 -1 Encoding
Encoding y(x)
Mobility, y(x) Safety, etc Weighting of Features Sco
rameter v = (A, b) Sigmoidal Layer Linear Weighting 0
Parameter
(x, v)0 r v = (A, Linearb) Sigmoidal
Layer State xLayer
Parameterk Feature v =Vector
Linear (A,Weightingb) Linear k (xk ) Layer
Sigmoidal
Nonlinear Approximator
LayerParameter
Ay(x) + b rk1Weighting
Linear =(x
v(x,
k v) k) 2
(A, b)(x, Sigmoidal
v) States
Nonlinear xAy(x)
m (x, Layer
v) States
r xLinear
k+1 +b x
Initial Weighting
(x, v) 2 (x, v) m (x, v) r x Init
1k+2
head
Layer
n Tree
(x, 0 r (⇠) ⇠
Parameter
v)Linear
Cost Layer 1v0=-1 (A,Encoding
Parameter
Approximation Linear b) Sigmoidal
Layer v(x, 0Layer
r b)Linear
=Parameter
(A,
v) Linear
Linear
Sigmoidal
Cost v Layer LayerWeighting
=Approximation
(A, Parameter
b)Parameter
Layer Linearv Weighting
Sigmoidal =v= (A,
Layer
(x, 0 rb)
v)(A, Sigmoidal
b)
Linear Sigmoidal
Weighting Layer
Layer Linear
Linear Weighting
Weighting
ximation 0
(x, v) rCost Approximation 0 Cost 0
Approximation 0 0
Cost Approximation
ion Features: Material Balance,
(x,
x0 v) xkriu Cost(x,
m 1 imd . . . (I
v)Approximation
(0, r 0) (N, Selective N )(x, (x,
(N, v)0)
Depth
v)r īr (N, N ) N 0 g(i) I¯ N 2 State xk Feature Vector k (xk ) Approximator
Lookahead Tree (⇠) Selective
⇠ 1 0 -1Depth Encoding Lookaheady(x) Tree (⇠) ⇠ 1 0 -1 Encodin
tures: Features:
Material Balance, uBalance, d kx =(Iµk) x k k)
action More Feature complex
N i Extraction
Material k = µk NNs k µare
uFeatures:
kk = formed
d x (I )Feature Extraction Features:
kMaterial
k Balance, by dconcatenation
uk = µdk xk (Ik ) Material ofBalance,
multiple uk =layers µdk xk (Ik )
k
e Extraction Features: Material Balance, u = µ d x (I ) d d d
Feature Extraction Feature Features:
Extraction Feature
Material kFeature
Features: kExtraction
Balance, k uk =Features:
Extraction
kMaterial Balance,xk (Iku)Material
µFeatures: k Material
= vµ= k x Balance,
Balance,
k (I uku=
k ) Sigmoidal k =
xµ0kµxxkkk x(I
im ) 1k )im . . . (0, 0) (N, N ) (N, 0) ī (N, N
kk(I
, etc Weighting of Features Score Position Evaluator Linear Layer k Parameter (A, b) Linear Layer
Layer Parameter
Linear Weighting
v = (A, b) Sigmoidal Layer Lin
eighting
ety, of The
etc WeightingFeatures outputs
of Score
Mobility, Features Safety, of
s Position
i1 iScore
m etc 1each
iEvaluator
. . . (0,
Position
Weighting
m nonlinear
0) (N,
Evaluator
Cost of Mobility,
Features N ) (N,
Approximation layer 0) ī (N,
Safety,
Score become
0N
retc
Position ) v)Evaluator
Weighting
(x, N 0 g(i) theCost inputs
ofI¯Features
NApproximation
N2 iNScoreof the Position next
r0 (x, linear layer
Evaluator
v)
+2
ty, Safety,
xk+2 Mobility, etc Weighting
States xSafety, i States
k+1 Mobility,
of Features
etc Weighting
xk+2 Score
Safety, etc Position
ofMobility,
States Safety,
Mobility,
Features
Weighting Evaluator
Safety,
xScore
k+1 etcetc
of Features
Position
States Weighting
xWeighting
k+2Score
Evaluator of of
PositionFeatures
Features Score
Evaluator Score Position
Position Evaluator
Evaluator
States
States xk+2xk+1A hierarchy
States
States xk+2xk+1 States of0 features States
States
xk+2 xk+1 xk+1 States
States xk+2xk+2
Feature Extraction Features: Material Feature sExtraction
Balance, i1 iumk = 1 iµ d .x
. .k (I
mFeatures: (0,k )0) Material
(N, N )Balance,
(N, 0) ī (N,
uk = Nµ)
e Vector k (xk ) Approximator 0 1 2 r)k3 0k (x k) k
rurekVector
(xk ) Approximator
k (xState xk rFeature
k ) Approximator kukk (x rku4kkSelective
ukkukVector (xk )k (xk ) Depth StateAdaptive
Approximator xk Feature rk0Simulation
Vector
k (x k)
Tree
k (xkProjections
) Approximator i of rk0 k (xk )
xk Feature State
Considerable
Vectorxk Feature (xState
kLeafs k ) Approximator
of xthe
Vector
success
Tree
Feature (x ) r 0
State
Vector (x
State
Approximator
k
has
x
k )
k x(x
been
Feature
Feature
) 0 Vector
Approximator
r (x
achieved
Vector
) (x
k kk r (x )(x
in
0 ) Approximator major
Approximator
) r 0contexts
r 0 (x(x) )
. . (0, 0) (N, N ) (N, 0) ī (N, N ) N ¯0 g(i)
k k k k k kk
I¯ N Mobility, 2 k
k Safety,
k etck Weighting
kk k
of Features k kk k kScore
Mobility, k
Safety, Position
etc Weighting
Evaluator of Features Score Posit
m) (N,. . . (0,N0)) (N,
(N, 0)xN ī )(N,
0 x k(N,imN0) )1 īim
N(N, .0.N .g(i)) 0)
(0, INN 0 g(i)
(N, 2N I¯) N
States x0 xx0)
(N, 2k īim(N,
k+1 1N
States im) x . .k+2
. (0,
N 0 0)g(i)(N, I¯ NNStates ) 2(N, 0) xk+1 ī (N,uN1 u2 u3 u4 Selective
k ) k xN
States k+2
¯
k 0kg(i) I N Depth 2 Adaptive Simulati
m 1 im x . 0.N.x(0, i0) (N, i
ki m 1 m 0 k m 1 m x N
. . .)
p(j
x (N,
(0, i
1 )
0) 0)
p(j(N,ī
i
2 (N,
) p(j
.N. N
.3))
x)
(0,p(j
(N,
0 x x
0)N x
0)
4 ) i0
(N,
0k k m 1 m mī g(i)
i(N,1 N i ¯
I
N
m)i N
.
)
(N,. . (0,
.
N .
0)2 0)
(0,
0ī (N,
0)
g(i)
(N, N¯
(N,
I )NN )
NN (N,
)
2 (N,
0 0)
g(i) ī
0) I¯
(N,
ī (N,
N N ) Leafs
N2 ) N N0
of g(i)
0the g(i) ¯
I N
Tree
I¯ N 2 2
N iPossible reasons for the success
N i
(0, 0) (N, N ) (N, 0) ī (N, N ) N ¯0 g(i) I N 2 N
NN i i ¯
State xk Feature Vector k (xk ) Approximator State xk Feature 0
rk k (x Vector
k) k (xk ) Approximator rk k (xk
0
N,
. . (0,N0)) (N,
(N, 0)N īs)(N,
i(N, N) N
1 im0)1 īiNeighbors
(N, 0 g(i)
m . .N ) I0)
. (0, N N(N,
of 0img(i) NI¯)N
2Projections
N (N,s 0) 2i1Nīofim
(N, Neighbors
1N im) . .N . (0, 0ofg(i)
im(N,
0) I¯ N N )2 (N, N 0) ī (N, Np(j ) 1 )Np(j ) p(jI¯3 )Np(j42)N
0 2g(i)
1 im . .s. (0,ii1With
i0)
m (N, 1 im more
.N.s.)i(0,
(N,
1 im complex
0)0)(N,
1 īi(N,
m .N .N )(N,
.)(0, sfeatures,
N
0)is0) i0i1m
1i(N, īg(i)
i(N,
m1 N
¯ iN
i1IN . . the 2. (0,
. .(0, N
00) number
0)(N, I¯ )NN )N
(N, 2) 0of
N(N, (N,0)parameters
¯ī N
ī I(N,
0) (N, N )N )N N 0 in
0 g(i)g(i) I¯the
NI¯ N 2linear2 N ¯ layers may be
N
xm)0 )m(N,
xk N i0) ī g(i)
(N, N
m 1 im . . . (0, 0) (N, N ) (N,
N g(i) x20 N
0) xkī i(N,
m 1Nim ) . .N. (0, 0 g(i)
0) (N, I N N )2 (N, 0) ī (N, N ) N 0
i i State x Featurei i Vector (x) Approximator (x)0 r Neighbors of im Projections of Neighbors of im
ective Depth
Depth Adaptive
drastically
Adaptive Simulation
Simulation
decreased Tree Projections N i of N i
elective Depth Adaptive u4kTree
u1k u2k u3k Simulation Projections
Selective Tree
Depth of
Projections
Adaptive u1k uof 2 u3 u4 Selective Depth Adaptive Simulation Tree Projections of
Simulation
k k k Tree Projections of
u3k u4k Selective
u1kLeafs
u2k uDepth
We 3 u4 Selective
may 1 u2 uDepth
theAdaptive
kuse
3 Simulation
u4k Selective
k k matrices u1kTree 12uu
uLeafs
uDepth
kA
3 u 34uSelective
2Projections
uof
with 4 Selective of Depth Depth Adaptive
Adaptive Simulation
Simulation Tree
TreeProjections
Projections of ¯ofVectorlinear
1 iam special1 im . . . (0,structure that encodes special
kof k uTree Adaptive kSimulation
kk kAdaptive
sk ithek Tree
Tree Simulation
Projections 0)Tree
(N, of Projections
N ) (N, s0)i1of
ī i(N, State
Ni)m . N .x. Feature
(0,
0 g(i)0) (N,
I N N2) N (x)0)Approximator
(N, ī (N, N ) N 0 (x g
m 1
TreeLeafs of the Tree Leafs of the Tree Leafs Leafs of of thethe Tree Tree 2
) p(j4 ) operations such
` Stages as
Riccati Equation Iterates P P0 P1 P2 2 1i P +1
convolutioni P
j3 ) p(j4 ) p(j1 ) p(j2 ) p(j3 ) p(j4 ) p(j1 ) p(j2 ) p(j3 ) p(j4 )
(j2 ) p(j3p(j ) p(j ) 2 ) p(j3p(j
1 ) 4p(j ) p(j1) 4 p(j
) 2 ) p(j3 ) p(j p(j )1 ) 1p(j
4p(j ) p(j 2 ) 2p(j1) u 32) 3p(j
p(j )3 p(j44) 4) 1 u 2 u3 u4 Selective
Projections of NeighborsCost of imof Period k StockuOrdered k k uk uat k Selective
Period kDepth Adaptive
Inventory uSystem
k Simulation
k k` Stages
k Tree Projections
Riccati Depth Adaptive
Equation of Iterates
Simulation
P P0 Tree P1 P
itions of Neighbors
m Projections of of im of of
Neighbors
Neighbors
r(uk ) + cu imim Projections ofLeafs NeighborsNeighbors
of w the of
Tree iof
m im Projections of Leafs Neighbors of theofTree im
k xk+1 = xk + u + k k
ors of imNeighbors
Projections of iofm Neighbors
Neighbors
Projections 0
of of imofim Neighbors
Neighbors
Projections
Neighbors ofof ofim imim
of
NeighborsProjections
Projectionsof im of of Neighbors
Neighbors of ofimim
Vector (x) Approximator Bertsekas 0 (M.I.T.) (x) r Reinforcement Learning Cost of Period k Stock Ordered at 25Period / 33
Q-Factors - Model-Free RL
(Note a mathematical magic: The order of E{·} and min have been reversed.)
We obtain Q̃ k (xk , uk , rk ) by training with many pairs (xks , uks ), βks , where βks is a
Infinite horizon extensions: Approximate value and policy iteration methods, error
bounds, model-based and model-free methods
Temporal difference methods: A class of methods for policy evaluation in infinite
horizon problems with a rich theory, issues of variance-bias tradeoff
Sampling for exploration, in the context of policy iteration
Monte Carlo tree search, and related methods
Aggregation methods, synergism with other approximate DP methods
Approximation in policy space, actor-critic methods, policy gradient methods
Special aspects of imperfect state information problems, connections with
traditional control schemes
Infinite spaces optimal control, connections with aggregation schemes
Special aspects of deterministic problems: Shortest paths and their use in
approximate DP
Simulation-based methods for general linear systems, connection to proximal
algorithms