Proposal Revisions

Oleksiy Mnyshenko |1
PROPOSAL REFINEMENT
Understanding Rule Learning Dynamics as an ALGORITHM
PRIOR
Given finite actions space A, prior θ is chosen such that . If prior
places positive probabilities on only , then some positive
probability has to be placed on so that these actions are not initially excluded. Therefore let
, where ε is suitably small . Since then probabilities in prior θ need
to be readjusted so that . Also note that value of utility
function is available, such that [Most likely utilities have to be consistent with probabilities
placed on each ].
Given above ramifications
_____________________________________________________________________________________
QUESTIONS:
1. How to choose ε?
2. Is it sufficient to uniformly distribute probability of ε over actions not included in the original prior?
3. Is it possible to spread ε smoother manner to prevent our prior from looking as a field of one-point
peaks and uniform distribution over the rest of the field [we may apply intuition that allows us to
allocate greater portion of ε to actions that are closer to action with positive probability in original
prior]?
4. Is the following observation correct?

In order to perform and update of the potential function in accordance to the following
equation: v(a, t, t+1) = 0v(a, t, t) + r(a, t, t; at) ,
Therefore we need some value for utility of action over which we spread ε so that reinforcement can be
performed, r(at, t, t; at) 1u(at, t, t).
Extra reasoning: Maybe keeping is fine because on average

actions with known utilities will be chosen therefore updated values will come from these actions.
UPDATING potential function and TRANSITION PROBABILITIES

Once first action ,
We update potential function in the “neighborhood” of . Note that definition of the neighborhood
depends on the choice of similarity function. Since we will be using normalized Gaussian Kernels then
value of potential function is updated for every action in the actions space A. Domains within
which updating occurs can be restricted by using pyramidal Gaussians.
Definition:
Therefore updating proceeds in a following way:
∀ ∈ [
Therefore reinforcement is:
Given new updated values we can choose next action. Since (t+1) now is a current
period then replace it by t. Thus is chosen with following probability:
We have just completed ONE ITERATION of the algorithm!
_____________________________________________________________________________________
QUESTIONS:
1. What are the strategies for choosing parameter b (bandwidth)? Is it a static parameter or it can be
dynamically adjusted based on data generated by the algorithm?
MARKOV PROCESS
Process of choosing an action from current probability distribution based on current values of potential
function and then updating the potential function again to arrive to new probability distribution on
actions space A can be modeled as a Markov process.
| - probability of transition from current state in which last update occurred

around to a new state such that update of potential value function was around
Every state characterized by:
[Question: if we were to run computation how would we record value of utility function for our actions
space]
Transition probabilities:
Note that above Markov chain is inhomogeneous since transition probabilities are dependent on the last
that was used to update potential function on A. Since initially prior is such that
and , then there is always going to be a positive
probability of randomly choosing some at any t.
Therefore the whole state space consists of one recurrent class, meaning that above Markov process is
irreducible. Irreducibility of space has potential to yield interesting results as related to convergence.
Above process may get stuck in a local maximum and it can be avoided by using perturbed Markov
process such that with probability ) next state is chosen in accordance to transition probabilities
and with probability , mistake is made such that next around which update will be
performed is chosen randomly from uniform distribution over the actions space.
CONVERGENCE
Our goal is to demonstrate that the statement below that holds for extreme case of counterfactual
thinking also is true under the reinforcement learning algorithm that was outlined above:
regardless of the recently undertaken action.

This assumption may appear a nuisance since if such counterfactual thinking is possible agent would
choose an action with maximum utility first time he has to decide. Also the question of whether agent
knows all of his available actions may arise as well.
In other words we want to demonstrate that converges to such that utility of

undertaking is maximized. Formally,
Questions:
1. Aiding convergence through specifying similarity function in a way that will allow kernel to have
smaller variance where utility function is sensitive to small variation in actions; and greater bandwidth
where changes in payoff across a neighborhood of action is small.
2. Need to investigate other ways of specifying similarity function.

READINGS
Topics:
Mathematics
1. Infinite and inhomogeneous Markov chains;
2. Kernel Density Estimation/Constructing similarity function that aid convergence;
Economics:
1. Bayesian Convergence theorem;
2. Law of effect;
3. Learning behavior (books or articles that will help me put such a highly focused research in the
context with other overarching topics)
- “The Theory of Evolution and Dynamical Systems”, Josef Hofbauer, Karl Sigmund, Cambridge University
press.
Research (methodology):
Becker, Tricks of the Trade: How to Think about Your Research While You’re Doing It
(University of Chicago Press, 1998).
Booth, Colomb and Williams, The Craft of Research (University of Chicago Press, 2003).
Turabian, Booth, Colomb, and Williams, A Manual for Writers of Research Papers, Theses,
and Dissertations, Seventh Edition: Chicago Style for Students and Researchers (Chicago:
University of Chicago Press, 2007). Get the 7th edition!
Zerubavel, The Clockwork Muse: A Practical Guide to Writing Theses, Dissertations and
Books (Harvard University Press, 1999).

Proposal Revisions

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Proposal Revisions

Încărcat de

Drepturi de autor:

Formate disponibile

Oleksiy Mnyshenko |1

Given above ramifications

4. Is the following observation correct?

Extra reasoning: Maybe keeping is fine because on average

UPDATING potential function and TRANSITION PROBABILITIES

Therefore updating proceeds in a following way:

Therefore reinforcement is:

We have just completed ONE ITERATION of the algorithm!

| - probability of transition from current state in which last update occurred

Every state characterized by:

regardless of the recently undertaken action.

In other words we want to demonstrate that converges to such that utility of

2. Need to investigate other ways of specifying similarity function.

1. Infinite and inhomogeneous Markov chains;

2. Kernel Density Estimation/Constructing similarity function that aid convergence;

1. Bayesian Convergence theorem;

S-ar putea să vă placă și