Sunteți pe pagina 1din 2

1: Discrete Time Optimal Control

The DPA (Dynamic Programming Algorithm) j to T in M − k − 1 moves:  


System dynamics. Jk (i) = min aij + Jk+1 (j) k = M − 2, . . . , 1, 0
xk+1 = fk (xk , uk , wk ) k = 0, 1, . . . , N − 1 j
Where Termination criterion: if k = 0 or Jk (i) = Jk+1 (i) ∀i.
k ∈ N discrete time index How about alternative to DP for shortest path?
xk ∈ Sk state YES di+aij<dT? Label-correcting algorithm.
uk ∈ Uk (xk ) control input Set dj=di+aij, 0 Place node S into OPEN, set dS = 0, dj = ∞ ∀j.
wk ∈ Dk disturbance or noise, potentially wk ∼ P(·|xk , uk ) put j into OPEN Children YES
N ∈ N time horizon 1 Remove i from OPEN, execute Step 2 ∀ children j of i.
2 If di + aij < min(dj , dT ), set dj = di + aij , set i to be
fk system dynamics Parent
When xk can only take finite # of values, express dynamics by transition probabilities: i j di+aij<dj? optimal parent of j. If j 6= T , place j into OPEN if not
xk+1 = wk Dynamics already there.
pij (u, k) = P(wk = j|xk = i, uk = u) Transition probabilities OPEN bin 3 If OPEN empty, done. Else, go to Step 1.
To visualize: transition probability graph (Markov chain). Proof of termination. Every time j enters OPEN, its cost is reduced to the current shortest
Additive cost function. path from S → j. The # of distinct paths from S → j with cost smaller than any given number
 NX −1 z Stage}|cost k { is finite since # of nodes is finite and aij ≥ 0 ∀i, j. Hence, there can be only a finite # of cost
Jπ (x0 ) = E gN (xN ) + gk (xk , uk , wk ) reductions ⇒ algorithm terminates in a finite # steps. 
wk |
Proof that resulting cost is optimal (or ∞ if no path exists). Suppose no path S → T
Terminal |k=0
{z }
{z } exists. Then no node i such that i → T exists can enter OPEN. Thus, dT will never be reduced
cost Accumulated from ∞. Now suppose path S → T exists. Then, since the number of paths with cost smaller
cost
than any given number is finite, a shortest path exists. Let (S, j1 , . . . , jk , T ) be this path and its
The above definitions (system dynamics and additive cost function) form the Basic Prob-
cost d∗ . Let dm be cost of path (S, j1 , . . . , jm ), m = 1, . . . , k. FTSOC, suppose at termination
lem.
Open loop: decide on control inputs {u0 , u1 , . . . , uN −1 } before k = 0. Stick to it no matter dT > d∗ . If so, this must have been true throughout the algorithm. Notably, dT > dm ∀m
what happens! (since aij ≥ 0). It follows that jk will never enter OPEN with cost dk , otherwise next iteration
Closed loop: wait until time k to make decision. Does not imply online computation! Basically, will set dT = d∗ . Similarly, jk−1 will never enter OPEN with dk−1 , otherwise next iteration
requires a mechanism to decide what uk to apply as a function of xk ⇒ implies xk measureable. we’ll have jk enter and distance set to dk , then dT = d∗ in the iteration after. Going back, it
Control rule uk = µk (xk ). This is the mechanism that makes a closed-loop. When at k, get the means j1 never enters OPEN with d1 . But this happens at first iteration⇒contradiction! 
xk ⇒ uk = µk (xk ). π = {µ0 , µ1 , . . . , µN −1 } is a policy or control law. Infinite # nodes. Suppose 1) αij ≥ 1 2) # children finite for each node 3) ∃ path S → T
Π = {π1 , π2 , . . .} is the set of admissible policies. The optimal policy π ∗ = SdT ,max
{µ∗ ∗ ∗ ∗ 4) Shortest distance d∗ T (∈ N) ≤ dT ,max < ∞. Define set R = Si where Si set of
0 , µ1 , . . . , µN −1 } is s.t. Jπ ∗ (x0 ) ≤ Jπ (x0 ) ∀π ∈ Π. Then J (x0 ) ≡ Jπ ∗ (x0 ) the op- i=1
timal cost. nodes for which min # arcs to S is i; let’s show that R is finite. Induction: • S1 is finite by 2)
DPA objective: find π ∗ . • Assume Sk finite (*) • Sk+1 ≤ i|i child of node in Sk+1 . Since Sk finite (*) & each j ∈ Sk
Principle of optimality. Suppose π ∗ = {µ∗ ∗ ∗
0 , µ1 , . . . , µN −1 } is the optimal policy going from
has finite # children (2)), Sk+1 finite. So R finite. Also, any i 6= R never enters open since by
time 0 to time N − 1. Then the truncated policy {µ∗ ∗ ∗
i , µi+1 , . . . , µN −1 } is the optimal policy
1), shortest path S → i is > dT ,max . So we are back to a finite graph problem, can apply above
going from time i to time N − 1. proof to show that terminates with label dT = d∗ T! 
The DPA. The optimal cost J ∗ (x0 ) and the associated optimal policy π ∗ is given by the last NB1: di + aij < dT assumes that the cost always increases along a path. Won’t work with
step of the following recursive algorithm, proceeding backwards in time: negative arc lengths!
Initialization: JN (xN ) = gN (xN ) ∀xN ∈ SN NB2: a child can have multiple parents, only one optimal parents, and multiple of its own chil-
Recursion: for each xk ∈ Sk , compute h cost-to-go
 at state xk : i dren. . . Different algorithms around how to select item in OPEN:
Jk (xk ) = min E gk (xk , uk , wk ) + Jk+1 (fk (xk , uk , wk )) 1. LIFO (depth-first): pick youngest node in the bin (idea: explore as far as possible first).
uk ∈Uk (xk ) wk 2. FIFO (breadth-first): pick oldest node in the bin (idea: stockpile many nodes into OPEN).
In the above: memorize each Jk (xk ) and the argmin, u∗ ∗
k = µk (xk ), for each xk , k = N − 3. Dijkstra (best-first): pick node with lowest current cost (idea: explore most promising nodes).
1, . . . , 1, 0. finally: J0 (x0 ) = J ∗ (x0 ), and the memorized µ∗ ∗ A∗ Algorithm.
k form the optimal policy π in the
form of a lookup table. Replace test di + aij < dT in label-correcting algo. by di + aij + hj < dT where hj is lower
Recasting problems into DPA formulation bound on cost from j → T , i.e. djT ≥ hj .
1. Time lags: Idea: prevent adding j into OPEN if clearly even if in the best-case scenario (lower bound hj )
going from j to T will cost more than already-available dT .
xk+1 = fk (xk , xk−1 , uk , uk−1 , wk )
Multi-objective problems
Recast: define yk = xk−1 , sk = uk−1 . Let x̃k = (xk , yk , sk ), x̃k+1 = f˜k (x̃k , uk , wk ): Idea: multiple costs you care about, but don’t know yet how to “combine” them into one cost.
 
xk+1
 
fk (xk , yk , uk , sk , wk ) How to narrow down # of possible optimal control policies?
y
 k+1  = 

xk  (cost #1)
uk time Inferior b/c Vector of costs x = (x1 , x2 , . . . , xM ) is inferior if ∃y s.t. yl ≤ xl l =
sk+1
| {z } of 1, 2, . . . , M with strict inequality for at least one l. If no such y exists, x
(cost #2)
| {z }
=x̃k+1 f˜k fuel non-inferior.
Consequently: uk = µk (xk , yk , sk ).
This is exactly how discrete-time state space works! Sure, we write xk+1 = Axk + Buk , but in fact xk+1 → Suppose you have M cost functions f1 (x), f2 (x), . . . , fM (x) with x ∈ X a vector.
x̃k+1 , xk → x̃k so the underlying mechanics is that all the “−1s” are replaced with augmented states. x ∈ S called non-inferior solution if (f1 (x), f2 (x), . . . , fM (x)) is non-inferior vector of set
{(f1 (y), f2 (y), . . . , fM (y))|y ∈ S}. Each cost fl (x) computed using deterministic DPA:
2. Correlated disturbances: (
xk+1 = fk (xk , uk )
(
wk = Ck yk+1
Disturbances given by linear system: . Recast: l (x ) + PN −1 g l (x , u )
fl (u) = gN
yk+1 = Ak yk + ξk N k=0 k k k
" #
xk+1
 
fk (xk , uk , Ck (Ak yk + ξk )) Idea: find all non-inferior solutions, store them and later decide which one to use (based on some
=
yk+1 Ak yk + ξk criteria).
| {z } | {z } Extended principle of optimality: If {uk , uk+1 , . . . , uN −1 } is a non-inferior control sequence
=x̃k+1 f˜k starting at xk at time k, then {uk+1 , . . . , uN −1 } is a non -inferior control sequence starting at
Consequently: uk = µk (xk , yk ). fk (xk , uk ) at time k + 1.
3. Forecasts:
Multi-objective problem algorithm (***). Call Fk (xk ) set of non-inferior M -tuples of costs
Idea: at time k, receive info that wk has probability distribution Qk . At time k+1, wk+1 can take
to go at x(k). The algorithm is then:
on prob. distrib. Qi with probab. pi ; we encode this in yk+1 = ξk where P(ξk = i) = pi , when Initialization:
ξk = i then the next disturbance wk+1 will be evaluated using prob. distrib. Qi . Augmented FN (xN ) = {(gN 1 (x ), g 2 (x ), . . . , g M (x ))|x
N N N N N N ∈ SN }.
state: Recursion: for each xk ∈
" # n Sk , compute non-inferior set at state xk :
1 M
o
xk+1 fk (xk , uk , wk ∼ Qy )
 
Fk (xk ) = noninf gk (xk , uk ) + c1 , ·, gk (xk , uk ) + cM (c1 , ·, cM ) ∈ Fk+1 (xk+1 ))

= k
yk+1 ξk uk ∈Uk (xk )
| {z } | {z } This search is hard – sometimes brute force, explore all uk combinations to find the non-inferior M -tuples!
=x̃k+1 f˜k At the final iteration we’ll get F0 (x0 ), this is the set of non-inferior solutions (associated
Consequently: uk = µk (xk , yk ), i.e. input we apply depends now on the forecast we receive right non-inferior control policies will have had to be memorized when doing the algorithm!). Later
now. use some criterion to pick one out that we’ll use (e.g.: linear combination of each M -tuple, pick
Then the cost-to-go becomes:   the smallest one)!

Jk (xk , yk ) = min E gk (xk , uk , wk ) + E Jk+1 (fk (xk , uk , wk ), ξk ) yk Above algorithm degenerates to DPA when M = 1, since in this case non-inferior≡minimal:
uk wk ξk , black non-inferior (≡ minimal, DPA!) and red all inferior!
 m
X  Handling constraints
= min E gk (xk , uk , wk ) + pi Jk+1 (fk (xk , uk , wk ), i) yk Consider problem:
uk wk
i=1

xk+1 = fk (xk , uk )
Viterbi algorithm. Situation: Markov chain with transition probabilities pij = P(xk+1 = 
J = g 1 (x ) + PN −1 g 1 (x , u ) cost


j|xk = i) and initial state probability p(x0 ), but can only indirectly observe x via measurement π N N k=0 k k k
r(z; i, j) = P(meas = z|xk = i, xk+1 = j) ∀k. Objective: given measures ZN = {z1 , . . . , zN } subject to


 l PN −1 l
gk (xk , uk ) ≤ bl

find Xc
N = {x b N } = argmaxX (P(XN |ZN )). Idea:
b0 , . . . , x gN (xN ) + l = 2, . . . , M constraints
N k=0
P(XN , ZN ) = P(x0 , . . . , xN , z1 , . . . , zN ) Solution: calculate F0 (x0 ) with algo. (***) above. Then throw away any elements (i.e. control
= P(x2 , . . . , xN , z2 , . . . , zN |x0 , x1 , z1 )P(z1 |x0 , x1 )P(x1 |x0 )P(x0 ) policies) that don’t satisfy constraints and, from what’s left, pick one with smallest cost (i.e.
= P(x2 , . . . , xN , z2 , . . . , zN |x0 , x1 , z1 )r(z1 ; x0 , x1 )px ,x p(x0 ) smallest Jπ ).
0 1
Continuing by developing P(x2 , . . . , xN , z2 , . . . , zN |x0 , x1 , z1 ) the same way, we get: Can we do better? Yes if pbm. deterministic (no noise). Do the following:
N 1. Compute J˜k l (x ) optimal cost to arrive to x for l = 2, . . . , M using forward DP on cost
P(XN , ZN ) = p(x0 )
Y
px r(zk ; xk−1 , xk ) k k
k−1 ,xk l, individually. Forms a lower bound for cost l, “lower ” because may be violating other costs.
k=1 PN 2. Apply algo. (***) but in Recursion on top of non-inferior check also use “A∗ algo. in reverse
Logarithm: log(P(XN , ZN )) = minx (− log(p(x0 )) + k=1 − log(px r(zk ; xk−1 , xk ))).
type check”:
N k−1 ,xk
l l l l
Can apply DPA with 0 terminal cost and stage cost − log(p(x0 )) for k = 0. J˜k (xk ) + gk (xk , uk ) + c ≤ b l = 2, . . . , M
Deterministic, finite state systems (shortest path problem) | {z } | {z }
Going 0→k Going k→N
Assume: At time k = 0, you’ll have only the feasible, non-inferior control sequences .
1. xk ∈ Sk finite set Infinite-horizon problems
2. No disturbance wk , i.e. P(wk = 0) = 1. Time-invariant system:
3. Only one way to go from state i ∈ Sk → j ∈ Sk+1 (if multiple, pick the one with lowest xk+1 = f (xk , uk , wk ) xk ∈ S, uk ∈ U, wk ∼ P(·|xk , uk )
cost when at stage k. . . ). Notice: no k subscript for f , S and U !
Stage 1 Stage 2 Stage N Cost:
Shortest Path Problem.
−1
 
ij NX
ak
ij = gk (i, uk ) the cost to go from state i ∈ Sk to j ∈ Sk+1 , Jπ (x0 ) = E  g(xk , µk (xk ), wk ) No terminal cost! g time-inv.!
ij k=0
j = fk (i, u ). aij = ∞ if going from i to j impossible.
k Idea: loose notion of time, obtain static policy u = µ(x) using:
Goal: find path of least cost from S to T .
Bellman equation:
∗ ∗
 
Shortest Path DPA. J (x) = min E g(x, u, w) + J (f (x, u, w)) ∀x ∈ S
Initialization: JN (i) = aN iT ∀i ∈ SN
u w
Recursion: for each i ∈ Sk , compute cost-to-go at state i: Stochastic shortest-path problems (SSPP)
k
 
Jk (i) = min aij + Jk+1 (j) k = N − 1, . . . , 1, 0 Idea: a problem with infinite horizon, no noise but state transitions uncertain (only probability
j∈Sk+1 known). Want to control to optimally reach “terminal state”.
Dynamics:
Forward DPA (only possible when deterministic, no noise). xk+1 = wk xk ∈ S = {1, 2, . . . , n, t} finite set!
Idea: going backwards same as going forwards, “optimal” path cannot change! Possible only when
P(wk = k|xk = i, uk = u) = pij (u) uk ∈ U (i) finite set!
deterministic due to noise info.
Initialization: J˜N (j) = a0 Special cost-free termination state (think: “destination state”):
Sj ∀j ∈ S1 1. ptt (u) = 1 ∀u ∈ U (t) (“once at t, always at t”).
Recursion: for each j ∈ SN −k+1 , compute cost-to-arrive at state j: 2. g(t, u) = 0 ∀u ∈ U (t) (“no cost being at t”).
N −k Policy stationnary: π = {µ, µ, . . .} ≡ µ. µ optimal if Jµ (i) = J ∗ (i) = minJπ (i) (optimal
 
J˜k (j) = min a + J˜k+1 (i) k = N − 1, . . . , 1, 0 π
i∈SN −k ij
cost) where cost is computed as:
When finished, optimal path the same so: J˜0 (T ) ≡ J0 (s). −1
 
NX

Recast shortest path as DPA. Assume all cycles have non-negative cost (otherwise infinite loop). Jπ (i) = lim E  g(xk , µ(xk )) x0 = i (1)
Then, going from S to T will only ever require at most M := N − 1 moves (visit all nodes, not counting starting N →∞
k=0
node). So we can limit horizon to M . Allow degenerate moves: reach T but can stay at some nodes Main results:
(effectively reach T in M − (# degen.’s) moves). (A). Given any initial conditions J0 (1), . . . , J0 (n) (pick them!) the sequence
Call Jk (i) optimal cost to go i → T in M − k moves. Then can write:  Xn 
Recast shortest path as DPA algorithm. Jk+1 (i) = min g(i, u) + pij (u)Jk (j) ∀i ∈ S \ t = {1, 2, . . . , n}
u∈U (i) j=1
Initialization (k = M − 1): JM −1 (i) = aiT ∀i. Can be infinite if no dir. path i → T !

Recursion: for each i, calculate cost to go to T in M − k moves by going to j in one move and from converges to optimal cost J (i) for each i. Note: subscript k in Jk not “stage k”, rather “current
Lemma. F (t, x, u) continuously differentiable, U convex and µ∗ (t, x) := arg minF (t, x, u) con-
iteration number” in the sequence! u∈U
(B). Optimal cost satisfies Bellman’s Equation (version for stochastic shortest path problems): tinuously differentiable. Then:
n ∂ minu∈U F (t,x,u) ∂F (t,x,µ∗ (t,x))
 
∗ X ∗ 1. = ∀t, x
J (i) = min g(i, u) + pij (u)J (j) ∀i = 1, . . . , n (2) ∂t ∂t
u∈U (i) j=1 ∂ minu∈U F (t,x,u) ∂F (t,x,µ∗ (t,x))
which has a unique solution. 2. = ∀t, x
∂x ∂x
(C). For any static policy µ, costs Jµ (i) are the unique solutions of: Pontryagin’s Minimum Principle.
n
Define: H(x, u, p) := g(x, u) + pT f (x, u) the Hamiltonian.
X
J(i) = g(i, µ(i)) + pij (µ(i))J(j) i ∈ S \ t = {1, 2, . . . , n}
j=1 ∂J ∗ (t,x∗ (t))
Furthermore, given any initial conditions J0 (1), J0 (2),. . . , J0 (n) the sequence: Co-state equation: p(t) =
∂x
Xn u∗ (t) := optimal control and x∗ (t) := resulting state trajectory.
Jk+1 (i) = g(i, µ(i)) + pij (µ(i))Jk (j)
Statement: u∗ (t), x∗ (t) & co-state p(t) must minimize H(x, u, p) (maximize if original problem
j=1 seeks to maximize cost!).
converges to Jµ (i) (cost starting from state i given policy µ) for each i. Proof : take (A) but
restrict U (i) = µ(i), so no min ⇒ just take u = µ(i). Convergence to Jµ (i) assured by (B) with Resulting conditions (≡ free term. state with cost h(x(T )) pbm.):
Jµ (i) not J ∗ (i) since we only consider the single policy µ. . .  ∂H
1) ẋ∗ (t) = (x∗ (t), u∗ (t), p(t)) x∗ (0) = x0
How to solve Bellman’s equation (2)? ∂p
Method 1: Value Iteration (VI). Basically, use (A). ∂H ∂h(x∗ (T ))
Step 1: Choose initial costs J0 (i) i = 1, . . . , n (could guess, could be smart). 2) Adjoint eq.’s ṗ(t) = − (x∗ (t), u∗ (t), p(t)) p(T ) = (5)
Step 2: Iterate by solving ∗ ∂x ∗ ∂x
 n
X  3) u (t) = arg minH(x (t), u, p(t))
Jk+1 (i) = min g(i, u) + pij (u)Jk (j) ∀i = 1, 2, . . . , n (3) u∈U
u∈U (i) ∗ ∗
4) H(x (t), u (t), p(t)) = const. (if time-invariant f, g) ∀t ∈ [0, T ]
j=1
until converges: |Jk+1 (i) − Jk (i)| <  ∀i. Problem: NOT guaranteed that policy has converged! The full set of equations/ICs (5) := Pontryagin’s necessary conditions for optimality. When
writing in exam: in condition 3), already show the minimization’s result (i.e. u∗ = {· · · )!
Complexity of VI: Step 2 is minimization over p possibilities of u (that each take n multiplications
to compute because of the sum), done n times (for each i). ⇒ O(pn2 ). NB: when multiple states, write 2) for each state individually, i.e. ṗi = − ∂H .
∂xi
Method 2: Policy Iteration (PI). Basically, use (C). Let i ∈ S \ t = {1, 2, . . . , n}. NB: when more than 1 input, then write 3) for each input individually (u∗ i (t) = arg min(H))
Initialize: Choose initial policy µ0 (i) ∀i (can be anything. . . ). Set k = 0. ui ∈Ui
Stage 1: Given µk (static policy at iteration not “time” – no such thing exists here – k) obtain NB: when solving cond. 1) using Laplace, substitute x1 (0), ẋ1 (0), etc. by A, B, . . . and use available
J k (i) by solving: ICs to solve for A, B, . . . Always possible! Don’t “assume” e.g. ẋi (0) = 0 if not given! Same for
µ cond. 2) (e.g. don’t assume ṗj (0) = 0. . . ).
n
k X k Time-varying ẋ = f (x, u,t). Everything same, but H 6= const. Idea: augment state, z = [x; y] =
J(i) = g(i, µ (i)) + pij (µ (i))J(j) for each i (4)
j=1 [x; t] ⇒ ż = [f (x, u, y); 1]. Then, H
f = g(x, u, y) + p f (x, u, y) + p = H(x, u, p, y) + p . Since
1 2 2
This is a linear problem with n eq.’s, n unknowns. Solve as Ax = b ⇒ x = A−1 b where ṗ2 = −∂ H/∂y
f 6= 0 generally and since H
f = const. ⇒ H(x, u, p, y) = H(x, u, p, t) 6= const.!
x = [J k (1); J k (2); . . . ; J k (n)].
µ µ
Stage 2: Improve policy by iterating
µ !
4 Minimum principle is necessary but not sufficient for optimality; necessary and sufficient ⇔ f (x, u)
 n  linear and U, h, g convex! Also, since necessary, if ∃! solution only, then it is optimal!
k+1 X
Practical implementation:
µ (i) = argmin g(i, u) + pij (u)J k (j) ∀i
u∈U (i) µ • HJB: if can solve, then will obtain u = µ(x), your feedback controller!
j=1
• Pontryagin: can obtain one “optimal trajectory” (for given initial/final conditions and
quit when J k+1 (i) = J k (i) ∀i (precise no-bullshit stopping criterion).
µ µ cost). Use another closed-loop feedback (e.g. PID) to track the trajectory or solve for
NB: STOP ⇔ J k+1 (i) = J k (i) or µk+1 (i) = µk (i) ∀i. If get multiple minimizers for µk+1 (i) trajectory/optimal input online (e.g. MPC).
µ µ Minimum Principle applications (drop ∗ notation for brevity )
including some identical to µk (i), must continue to iterate! Only guarantee: will converge even- Fixed Terminal State x(T ) = xT .
Z T
tually for whatever choice!
Extra “Method”: Exhaustive Search. Enumerate all policy combinations and for each policy Cost = g(x(t), u(t))dt
0
find J(i) i = 1, . . . , n using (4). Optimal policy is the one giving smallest (if taking min) vector Conditions 1 and 2 (3 and 4 same as in (5)):
[J(1); . . . ; J(n)]. 1) ẋ(t) = f (x(t), u(t)) x(0) = x0 , x(T ) = xT
Complexity of PI: Stage 2 is like VI, so O(pn2 ). Stage 1: solve linear system, so O(n3 ). ⇒ ∂H
2) ṗ(t) = − (x(t), u(t), p(t))
O(n2 (n + p)). Worst case: search over all policies combinations, pn combinations. ∂x
Rewrite VI (below left) as PI (below right):  Free initial state, with cost l(x(0)). Z T
 
n 1 µk (i) = argminu g(i, u) +
P
j pij (u)Jk (j)
 
X Cost = l(x(0)) + g(x(t), u(t))dt
Jk+1 (i) = min g(i, u)+ pij (u)Jk (j) → 0
u Jk+1 (i) = g(i, µk (i)) + k
2 P
j=1 j pij (µ (i))Jk (j) Conditions 1 and 2 (3 and 4 same as in (5)):
PI solves a linear system → same as running value update ∞ # of times. VI: only does it once. 1) ẋ(t) = f (x(t), u(t)) x(T ) = xT
∂H ∂l(x(0))
For the next solution method, notice  if we pick as initial condition: 2) ṗ(t) = − (x(t), u(t), p(t)) p(0) = −
n

∂x ∂x
X
J0 (i) ≤ min g(i, u) + pij (u)J0 (j) ∀i = 1, . . . , n Free initial and terminal states. Conditions 1 and 2:
u∈U (i) 1) ẋ(t) = f (x(t), u(t))
i=1 ∂H
Then by (3), J0 (i) ≤ J1 (i) ∀i. Can show this means: Jk (i) ≤ Jk+1 (i) ∀i, k. From VI know that ∂l(x(0)) ∂h(x(T ))
2) ṗ(t) = − (x(t), u(t), p(t)) p(0) = − p(T ) =
∂x ∂x
Jk → J ∗ , so J0 (i) ≤ Jk (i) ≤ J ∗ (i)∀i, k. Can do: ∂x
Free terminal time.
Method 3: Linear Programming  (LP).
 Find J by solving following linear program: Z T
n
X Cost = h(x(T )) + g(x(t), u(t))dt
max  J(i) subject to 0
n Conditions 1, 2 and 3 same as in (5); 4) H(x(t), u(t), p(t))= 0 ∀t ∈ [0, T ].
i=1 X
J(i) ≤ g(i, u) + pij (u)J(j) ∀i, ∀u ∈ U (i) Singular problems
Problem is singular if Hamiltonian is not a function of u for a non-trivial time interval ⇒
j=1
∗ ∗ ∗ cannot use u = arg min(H) to find optimal input during that time interval!
The solution must be J since J(i) ≤ J (i) ∀i and J satisfies constraints.
Example situation.
Discounted problems. Discounted problem cost function: 
−1 1 if p > 0
 
NX 
k
u(t) = −1 if p < 0
Jπ (i) = lim E  α g(xk , µ(xk )) x0 = i 0 < α < 1, i ∈ 1, . . . , n
N →∞ 

??? if p = 0 “singular arc candidate”
k=0
Advantage: no explicit termination state required! Use when your problem “doesn’t have an Solution: non-trivial time requires ṗ = 0 for non-trivial time. Using condition 2) write what this
end”. . . requires (ṗ = (condition 2) = 0, what does this imply for input u?). If this condition can hold for
Bellman’s Equation for discounted problem: non-trivial time ⇒ then this is the ??? you are looking for! If cannot hold for non-trivial time, then
 n 
∗ X ∗ no need to consider the p = 0 case!
J (i) = min g(i, u) + (α · pij (u))J (j) ∀i = 1, . . . , n
u∈U (i) NB: if multiple inputs ui , in a singular arc can only control “current” input u being considered,
j=1
Can then use VI, PI or LP as before using α · pij (u) as trans. prob.’s! not the others! Dunno what’s going on with them ⇒ cannot set non-trivial time conditions on
them.
Markov Decision Processes (MDP). Idea: stage cost in (1) is Ex (g(xk , µ(xk ))), in MDP Constraints
k Equality constraints. min(Cost) subject to ẋ(t) = f (x(t), u(t), t), g(t, x(t), u(t)) = c. Form La-
replace this with Ex ,x (g(xk , xk+1 )). However:
k k+1   X grangian:

E (g(xk , xk+1 )) = E E (g(xk , xk+1 )|xk ) = E g(xk , j)pij µ(xk ) L = H(x(t), u(t), t) + λ(t)(g(t, x(t), u(t)) − c)
xk ,xk+1 xk xk+1 xk where Lagrange multiplier λ(t) generally time-variant. Continue as before using L instead of
j
Then, can use yellow above as g(xk , µ(xk )) and apply VI/PI/LP as before! H and require additionally ∂L/∂λ = 0. NB: preferable to just use substitution to eliminate
2: Continuous Time Optimal Control constrained variables!
Inequality constraints. min(Cost) subject to ẋ(t) = f (x(t), u(t), t), g(t, x(t), u(t)) ≤ c. Form L
Notation
∂F (t,x) ∂F (t,x) as above and consider two cases: constraint inactive (λ = 0) and constraint active (λ 6= 0).
• = Partial derivative For inactive case, solve unconstrained problem → if constraint violated, then constraint must be
∂t ∂t
∂F (t,x(t)) ∂F (t,x) active so you solve a g(·) = c equality constraint problem as above.
• = Partial derivative (shorthand) 3: Extra
∂t ∂t
x=x(t)
Laplace transforms: • ε(t − c) → e−cs • δ(t − c) → e−cs • eat → 1
dF (t,x(t)) ∂F (t,x(t)) ∂F (t,x) ∂x(t) s s−a
• = + Total derivative
dt ∂t ∂x ∂t n n! a s 2as
x=x(t) • t → • sin(at) → • cos(at) → • t sin(at) →
s n+1 2
s +a 2 2
s +a 2 (s +a )2
2 2
The Hamilton Jacobi Bellman (HJB) Equation
System dynamics:
• t cos(at) → s2 −a2 • sinh(at) → a • cosh(at) → s • e at sin(bt) → b
ẋ(t) = f (x(t), u(t)) 0 ≤ t ≤ T (s2 +a2 )2 s2 −a2 s2 −a2 (s−a)2 +b2
x(0) = x0 no noise! • eat cos(bt) → s−a • e at sinh(bt) → b • eat cosh(bt) → s−a
where (s−a)2 +b2 (s−a)2 −b2 (s−a)2 −b2
x(t) ∈ Rn State
• tn eat → n! , n = 1, 2, . . . • f (n) (t) → sn F (s) − sn−1 f (0) − · · · − f (n−1) (0)
u(t) ∈ U ⊂ Rm Control constraint set (s−a)n+1
t ∈ R Time, with T the terminal time NB: attention for singularities, e.g. when decomposing fractions coeffs with 1/(something that could
Require: f (·, u(t)) continuously differentiable, f (x(t), ·) continuous, u(t) pwc and assume solution be 0) appear! Then have to treat these singularities separately!
exists and is unique on 0 ≤ t ≤ T (very heavy assumption!). Calculus: • d (asin(x)) = q 1 • d (acos(x)) = − q 1 • d (atan(x)) = 1
Objective: minimize the following cost function: Z T
dx
1−x2
dx
1−x2
dx 1+x2
Cost = h(x(T )) + g(x(t), u(t))dt Discrete-time LQR. Infinite horizon, LTI system with quadratic cost:
0 xk+1 = Axk + Buk k = 0, 1, . . .
where g(·, u(t)) and h(·) continuously differentiable, g(x(t), ·) continuous. ∞
Hamilton Jacobi Bellman (HJB) equation: X T T T T
Cost = xk Qxk + uk Ruk R = R > 0, Q = Q ≥ 0
 ∂V (t, x)  ∂V (t, x) T
 
k=0
0 = min g(x, u) + + f (x, u) ∀t, x HJB Conjecture: same as for continuous time LQR:
u∈U ∂t ∂x
V (T , x) = h(x) Boundary condition • Cost-to-go is quadratic: V (x) = xT Kx (and time-invariant, like infinite-horizon pbm’s!)
• K = KT ≥ 0
The u = µ(t, x) minimizing RHS of HJB optimal policy and the resulting V (t, x) := J ∗ (t, x),
Results (using HJB, same as for continuous-time LQR):
i.e. is the optimal cost (NB: must verify the HJB and the boundary condition!). NB: HJB is a sufficient
but not necessary optimality condition! 1. Optimal Cost to go: J(x) = xT Kx
LQR controller 2. Optimal feedback strategy:
u = F x (Optimal feedback strategy)
Cost function: Z T F := −(R + B T KB)−1 B T KA
T T T
Cost = x (T )QT x(T ) + x (t)Qx(t) + u (t)Ru(t)dt K := AT (K − KB(R + B T KB)−1 B T K)A + Q, K ≥ 0
0
with QT = QT ≥ 0, Q = Q T ≥ 0, R = RT > 0. Assume: Trig identities: • sin(u ± v) = sin(u) cos(v) ± cos(u) sin(v)
T • cos(u ± v) = cos(u) cos(v) ∓ sin(u) sin(v) • tan(u ± v) = (tan(u) ± tan(v))/(1 ∓ tan(u) tan(v))
• Cost-to-go is quadratic: V (t, x) = xT K(t)x • sin(2u) = 2 sin(u) cos(v) • cos(2u) = cos2 (u) − sin2 (u) = 2 cos2 (u) − 1 = 1 − 2 sin2 (u)
• K(t) = K T (t) • sinh(x) = (ex − e−x )/2 • cosh(x) = (ex + e−x )/2 • tanh(x) = sinh(x)/ cosh(x)
Optimal policy: µ(t, x) = −R−1 B T K(t)x(t), where K(t) found from Continuous Time Ricatti • cosh2 (x) − sinh2 (x) = 1 • csch(x) = 1/ sinh(x) • sech(x) = 1/ cosh(x) • coth(x) = 1/ tanh(x)
Differential Equation: Complexity. DPA has a complexity O(|U | · |S| · N ) where |U | the cardinality (# of elements) in
K̇(t) = −K(t)A − AT K(t) + K(t)BR−1 B T K(t) − Q
(
control space U , |S| the cardinality  of the state space and N the time horizon. A “brute force”
K(T ) = QT PN N
approach has a complexity ≡ 2N (cardinality of the power set) where N is the # of
For an infinite horizon problem (QT = 0, T = ∞): µ(x) = −R−1 B T Kx := −F x where K k=0 k
found from Algebraic Ricatti Equation: choices.  
T −1 T n n!
0 = −KA − A K + KBR B K−Q Combin. (order doesn’t matter): = . Permut. (order matters): P (n, k) = n!
k k!(n−k)! k!
Pontryagin’s Minimum Principle
NB: e ≈ 2.7183, ln(2) ≈ 0.6931.

S-ar putea să vă placă și