RL

ReinforcementLearning
Slidesfrom
R.S.SuttonandA.G.Barto
ReinforcementLearning:AnIntroduction
http://www.cs.ualberta.ca/~sutton/book/thebook.html
http://rlai.cs.ualberta.ca/RLAI/RLAIcourse/RLAIcourse.html
TheAgentEnvironmentInterface
Agentandenvironmentinteractatdiscretetimesteps : t 0,1, 2, K
Agentobservesstateatstep t : st S
producesactionatstep t : at A(st )
getsresultingreward : rt 1
andresultingnextstate : st 1
...
st
at
rt+1
st+1
at+1
rt+2
st+2
at+2
rt+3 s
t+3
...
at+3
TheAgentLearnsaPolicy
Policyatstept, t :
amappingfromstatestoactionprobabilities
t (s, a) probabilitythat at awhenst s
Reinforcementlearningmethodsspecifyhowtheagent
changesitspolicyasaresultofexperience.
Roughly,theagentsgoalistogetasmuchrewardasit
canoverthelongrun.
GettingtheDegreeofAbstractionRight
Timestepsneednotrefertofixedintervalsofrealtime.
Actionscanbelowlevel(e.g.,voltagestomotors),orhigh
level(e.g.,acceptajoboffer),mental(e.g.,shiftinfocus
ofattention),etc.
Statescanbelowlevelsensations,ortheycanbe
abstract,symbolic,basedonmemory,orsubjective(e.g.,
thestateofbeingsurprisedorlost).
AnRLagentisnotlikeawholeanimalorrobot,which
consistofmanyRLagentsaswellasothercomponents.
Theenvironmentisnotnecessarilyunknowntotheagent,
onlyincompletelycontrollable.
Rewardcomputationisintheagentsenvironmentbecause
theagentcannotchangeitarbitrarily.
GoalsandRewards
Isascalarrewardsignalanadequatenotionofagoal?
maybenot,butitissurprisinglyflexible.
Agoalshouldspecifywhatwewanttoachieve,nothow
wewanttoachieveit.
Agoalmustbeoutsidetheagentsdirectcontrolthus
outsidetheagent.
Theagentmustbeabletomeasuresuccess:
explicitly;
frequentlyduringitslifespan.
Returns
Supposethesequenceofrewardsafterstep tis :
rt 1 , rt 2 , rt 3 , K
Whatdowewanttomaximize?
Ingeneral,
wewanttomaximizethe expectedreturn, ERt ,foreachstept.
Episodictasks:interactionbreaksnaturallyinto
episodes,e.g.,playsofagame,tripsthroughamaze.
Rt rt 1 rt 2 L rT ,
whereTisafinaltimestepatwhichaterminalstateis
reached,endinganepisode.
ReturnsforContinuingTasks
Continuingtasks:interactiondoesnothavenaturalepisodes.
Discountedreturn:
Rt rt 1 rt 2 2 rt 3 L k rt k 1 ,
k 0
where , 0 1,isthediscountrate.
shortsighted0 1farsighted
AnExample
Avoidfailure:thepolefallingbeyond
acriticalangleorthecarthittingendof
track.
Asanepisodictaskwhereepisodeendsuponfailure:
reward 1foreachstepbeforefailure
return numberofstepsbeforefailure
Asacontinuingtaskwithdiscountedreturn:
reward 1uponfailure; 0otherwise

return k ,forkstepsbeforefailure
Ineithercase,returnismaximizedby
avoidingfailureforaslongaspossible.
AnotherExample
Gettothetopofthehill
asquicklyaspossible.
reward 1foreachstepwhere notattopofhill

return numberofstepsbeforereachingtopofhill
Returnismaximizedbyminimizing
numberofstepsreachthetopofthehill.
AUnifiedNotation
Inepisodictasks,wenumberthetimestepsofeach
episodestartingfromzero.
Weusuallydonothavedistinguishbetweenepisodes,so
st
st, j
wewriteinsteadofforthestateatsteptof
episodej.
Thinkofeachepisodeasendinginanabsorbingstatethat
alwaysproducesrewardofzero:
R
Wecancoverallcasesbywriting t rt k 1 ,
k 0
where canbe1onlyifazerorewardabsorbingstateisalwaysreached.
TheMarkovProperty
Bythestateatstept,thebookmeanswhateverinformationis
availabletotheagentatsteptaboutitsenvironment.
Thestatecanincludeimmediatesensations,highlyprocessed
sensations,andstructuresbuiltupovertimefromsequencesof
sensations.
Ideally,astateshouldsummarizepastsensationssoastoretain
allessentialinformation,i.e.,itshouldhavetheMarkov
Property:
Prst 1 s,r
t 1 r st ,at ,rt , st 1 ,at 1 ,K ,r1 ,s0 ,a0
Prst 1 s,r
t 1 r st ,at
forall s, r, andhistoriesst ,at ,rt , st 1 ,at 1 ,K ,r1, s0 ,a0 .
MarkovDecisionProcesses
IfareinforcementlearningtaskhastheMarkovProperty,itis
basicallyaMarkovDecisionProcess(MDP).
Ifstateandactionsetsarefinite,itisafiniteMDP.
TodefineafiniteMDP,youneedtogive:
stateandactionsets
onestepdynamicsdefinedbytransitionprobabilities:
Psas Prst 1 s st s,at aforalls, sS, a A(s).
rewardprobabilities:
Rsas Ert 1 st s,at a,s t 1 sforalls, sS, a A(s).
AnExampleFiniteMDP
RecyclingRobot
Ateachstep,robothastodecidewhetheritshould(1)activelysearchforacan,
(2)waitforsomeonetobringitacan,or(3)gotohomebaseandrecharge.
Searchingisbetterbutrunsdownthebattery;ifrunsoutofpowerwhile
searching,hastoberescued(whichisbad).
Decisionsmadeonbasisofcurrentenergylevel:high,low.
Reward=numberofcanscollected
ValueFunctions
Thevalueofastateistheexpectedreturnstartingfrom
thatstate;dependsontheagentspolicy:
State valuefunctionforpolicy :
k
V (s) E Rt st s E rt k 1 st s
k 0
Thevalueoftakinganactioninastateunderpolicy
istheexpectedreturnstartingfromthatstate,takingthat
action,andthereafterfollowing:
Action valuefunctionforpolicy :
k
Q (s, a) E Rt s t s, at a E rt k 1 s t s,at a
k 0
BellmanEquationforaPolicy
Thebasicidea:
Rt rt 1 rt 2 2 rt 3 3 rt 4 L
rt 1 rt 2 rt 3 rt 4 L
2
rt 1 Rt 1
So:
V (s) E Rt st s
E rt 1 V st 1 st s
Or,withouttheexpectationoperator:
V (s) (s,a) PssRss V ( s)

MoreontheBellmanEquation
V (s) (s,a) PssRss V ( s)

Thisisasetofequations(infact,linear),oneforeachstate.
Thevaluefunctionforisitsuniquesolution.
Backupdiagrams:
forV
forQ
Gridworld
Actions:north,south,east,west;deterministic.
Ifwouldtakeagentoffthegrid:nomovebutreward=1
Otheractionsproducereward=0,exceptactionsthatmove
agentoutofspecialstatesAandBasshown.
Statevaluefunction
forequiprobable
randompolicy;
=0.9
OptimalValueFunctions
ForfiniteMDPs,policiescanbepartiallyordered:

ifandonlyif V (s) V (s)foralls S
Thereisalwaysatleastone(andpossiblymany)policiesthat
isbetterthanorequaltoalltheothers.Thisisanoptimal
policy.Wedenotethemall*.
Optimalpoliciessharethesameoptimalstatevaluefunction:
V (s) max V (s)foralls S
Optimalpoliciesalsosharethesameoptimalactionvalue
function:
Q (s,a) max Q (s, a)foralls Sanda A(s)
Thisistheexpectedreturnfortakingactionainstates
andthereafterfollowinganoptimalpolicy.
BellmanOptimalityEquationforV*
Thevalueofastateunderanoptimalpolicymustequal
theexpectedreturnforthebestactionfromthatstate:
V (s) max Q (s,a)

aA(s)
max Ert 1 V (st 1 ) st s, at a
aA(s)
max PsasRsas V (s)

aA(s)
Therelevantbackupdiagram:
V
istheuniquesolutionofthissystemofnonlinearequations.
BellmanOptimalityEquationforQ*
Q (s,a) E rt 1 max Q (st1 , a)

st s,at a
Psas Rsas max Q ( s,

a)
Therelevantbackupdiagram:
Q
istheuniquesolutionofthissystemofnonlinearequations.
WhyOptimalStateValueFunctionsareUseful
V
Anypolicythatisgreedywithrespecttoisanoptimalpolicy.
V
Therefore,given,onestepaheadsearchproducesthe
longtermoptimalactions.
E.g.,backtothegridworld:
WhatAboutOptimalActionValueFunctions?
*
Q
Given,theagentdoesnoteven
havetodoaonestepaheadsearch:
(s) arg max Q (s,a)

aA (s)
SolvingtheBellmanOptimalityEquation
FindinganoptimalpolicybysolvingtheBellmanOptimality
Equationrequiresthefollowing:
accurateknowledgeofenvironmentdynamics;
wehaveenoughspaceantimetodothecomputation;
theMarkovProperty.
Howmuchspaceandtimedoweneed?
polynomialinnumberofstates(viadynamicprogramming
methods;Chapter4),
BUT,numberofstatesisoftenhuge(e.g.,backgammonhas
about10**20states).
Weusuallyhavetosettleforapproximations.
ManyRLmethodscanbeunderstoodasapproximatelysolvingthe
BellmanOptimalityEquation.
TDPrediction
PolicyEvaluation(thepredictionproblem):
foragivenpolicy,computethestatevaluefunction V
ThesimplestTDmethod, TD(0) :
V(st ) V(st ) rt 1 V (st1 ) V(st )
target:anestimateofthereturn
R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction
24
SimplestTDMethod
V(st ) V(st ) rt 1 V (st1 ) V(st )
st
rt 1
st 1
TT
TT
T
T
TT
T
T
TT
25
Example:DrivingHome
State
ElapsedTime
(minutes)
leavingoffice
0
Predicted
TimetoGo
30
Predicted
TotalTime
30
reachcar,
raining
exithighway
35
40
20
15
35
behindtruck
30
10
40
homestreet
40
43
arrivehome
43
43
26
DrivingHome
Changesrecommendedby
MonteCarlomethods=1)
Changesrecommended
byTDmethods(=1)
27
AdvantagesofTDLearning
TDmethodsdonotrequireamodeloftheenvironment,
onlyexperience
TDmethodscanbefullyincremental
Youcanlearnbeforeknowingthefinaloutcome
Lessmemory
Lesspeakcomputation
Youcanlearnwithoutthefinaloutcome
Fromincompletesequences
28
RandomWalkExample
ValueslearnedbyTD(0)
after
variousnumbersofepisodes
29
TDandMContheRandomWalk
Dataaveragedover
100sequencesofepisodes
30
OptimalityofTD(0)
BatchUpdating:traincompletelyonafiniteamountofdata,
e.g.,trainrepeatedlyon10episodesuntilconvergence.
ComputeupdatesaccordingtoTD(0),butonlyupdate
estimatesaftereachcompletepassthroughthedata.
ForanyfiniteMarkovpredictiontask,underbatchupdating,
TD(0)convergesforsufficientlysmall.
ConstantMCalsoconvergesundertheseconditions,butto
adifferenceanswer!
31
RandomWalkunderBatchUpdating
Aftereachnewepisode,allpreviousepisodesweretreatedasabatch,
andalgorithmwastraineduntilconvergence.Allrepeated100times.
32
LearningAnActionValueFunction
Estimate Q forthecurrentbehaviorpolicy .
Aftereverytransitionfromanonterminalstate st , dothis :
Q st , at Qst , at rt 1 Qst 1 ,at 1 Qst ,at
Ifst 1isterminal, thenQ(st 1, at 1 ) 0.
33
Sarsa:OnPolicyTDControl
Turnthisintoacontrolmethodbyalwaysupdatingthe
policytobegreedywithrespecttothecurrentestimate:
34
WindyGridworld
undiscounted,episodic,reward=1untilgoal
35
ResultsofSarsaontheWindyGridworld
36
Cliffwalking
greedy=0.1
37

RL

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

RL

Încărcat de

Drepturi de autor:

Formate disponibile

ReinforcementLearning

reward 1uponfailure; 0otherwise

reward 1foreachstepwhere notattopofhill

forall s, r, andhistoriesst ,at ,rt , st 1 ,at 1 ,K ,r1, s0 ,a0 .

V (s) max V (s)foralls S

V (s) max Q (s,a)

max Ert 1 V (st 1 ) st s, at a

max PsasRsas V (s)

Q (s,a) E rt 1 max Q (st1 , a)

Psas Rsas max Q ( s,

(s) arg max Q (s,a)

S-ar putea să vă placă și