Documente Academic
Documente Profesional
Documente Cultură
Sbastien Bubeck e
1: CNN
2: NBC
d: ABC
Player
1: CNN
2: NBC
d: ABC
A {1, . . . , d} Player
Adversary
1: CNN
2: NBC
d: ABC
A {1, . . . , d} Player
Adversary
1: CNN
2: NBC
d: ABC
A {1, . . . , d} Player
Adversary
1: CNN
2: NBC
d: ABC
ered: loss su
Player
A {1, . . . , d}
Adversary
Feedback: 1, . . . , d
1: CNN
2: NBC
d: ABC
ered: loss su
Player
A {1, . . . , d}
Player
A {1, . . . , d} Player
Adversary
A {1, . . . , d} Player
Adversary
A {1, . . . , d} Player
Adversary
ered loss su
Player
A {1, . . . , d}
ered: loss su
Player
A {1, . . . , d}
Feedback:
Some Applications
Computer Go
Medical trials
Packets routing
Ads placement
Dynamic allocation
Notation
For each round t = 1, 2, . . . , n;
1
The player chooses an arm It {1, . . . , d}, possibly with the help of an external randomization. Simultaneously the adversary chooses a loss vector d t = ( 1,t , . . . , d,t ) [0, 1] . The player incurs the loss It ,t , and observes:
The loss vector t in the full information setting. Only the loss incured It ,t in the bandit setting.
Rn = E
min E
i=1,...,d t=1
i,t .
Notation
For each round t = 1, 2, . . . , n;
1
The player chooses an arm It {1, . . . , d}, possibly with the help of an external randomization. Simultaneously the adversary chooses a loss vector d t = ( 1,t , . . . , d,t ) [0, 1] . The player incurs the loss It ,t , and observes:
The loss vector t in the full information setting. Only the loss incured It ,t in the bandit setting.
Rn = E
min E
i=1,...,d t=1
i,t .
Notation
For each round t = 1, 2, . . . , n;
1
The player chooses an arm It {1, . . . , d}, possibly with the help of an external randomization. Simultaneously the adversary chooses a loss vector d t = ( 1,t , . . . , d,t ) [0, 1] . The player incurs the loss It ,t , and observes:
The loss vector t in the full information setting. Only the loss incured It ,t in the bandit setting.
Rn = E
min E
i=1,...,d t=1
i,t .
Notation
For each round t = 1, 2, . . . , n;
1
The player chooses an arm It {1, . . . , d}, possibly with the help of an external randomization. Simultaneously the adversary chooses a loss vector d t = ( 1,t , . . . , d,t ) [0, 1] . The player incurs the loss It ,t , and observes:
The loss vector t in the full information setting. Only the loss incured It ,t in the bandit setting.
Rn = E
min E
i=1,...,d t=1
i,t .
Notation
For each round t = 1, 2, . . . , n;
1
The player chooses an arm It {1, . . . , d}, possibly with the help of an external randomization. Simultaneously the adversary chooses a loss vector d t = ( 1,t , . . . , d,t ) [0, 1] . The player incurs the loss It ,t , and observes:
The loss vector t in the full information setting. Only the loss incured It ,t in the bandit setting.
Rn = E
min E
i=1,...,d t=1
i,t .
Notation
For each round t = 1, 2, . . . , n;
1
The player chooses an arm It {1, . . . , d}, possibly with the help of an external randomization. Simultaneously the adversary chooses a loss vector d t = ( 1,t , . . . , d,t ) [0, 1] . The player incurs the loss It ,t , and observes:
The loss vector t in the full information setting. Only the loss incured It ,t in the bandit setting.
Rn = E
min E
i=1,...,d t=1
i,t .
Notation
For each round t = 1, 2, . . . , n;
1
The player chooses an arm It {1, . . . , d}, possibly with the help of an external randomization. Simultaneously the adversary chooses a loss vector d t = ( 1,t , . . . , d,t ) [0, 1] . The player incurs the loss It ,t , and observes:
The loss vector t in the full information setting. Only the loss incured It ,t in the bandit setting.
Rn = E
min E
i=1,...,d t=1
i,t .
Theorem (Cesa-Bianchi, Freund , Haussler, Helmbold, Schapire and Warmuth [1997]) Exp satises Rn Moreover for any strategy, sup
adversaries
n log d . 2
Rn
Theorem (Cesa-Bianchi, Freund , Haussler, Helmbold, Schapire and Warmuth [1997]) Exp satises Rn Moreover for any strategy, sup
adversaries
n log d . 2
Rn
Theorem (Cesa-Bianchi, Freund , Haussler, Helmbold, Schapire and Warmuth [1997]) Exp satises Rn Moreover for any strategy, sup
adversaries
n log d . 2
Rn
i,t = is an unbiased estimate of on the estimated losses. Exp3 satises: Rn Moreover for any strategy, sup
adversaries i,t .
i,t
pt (i)
1It =i ,
Rn
1 nd + o( nd). 4
i,t = is an unbiased estimate of on the estimated losses. Exp3 satises: Rn Moreover for any strategy, sup
adversaries i,t .
i,t
pt (i)
1It =i ,
Rn
1 nd + o( nd). 4
i,t = is an unbiased estimate of on the estimated losses. Exp3 satises: Rn Moreover for any strategy, sup
adversaries i,t .
i,t
pt (i)
1It =i ,
Rn
1 nd + o( nd). 4
min
i=1,...,d
i,t t=1
, d
pt (i)
1It =i +
. pt (i)
min
i=1,...,d
i,t t=1
, d
pt (i)
1It =i +
. pt (i)
min
i=1,...,d
i,t t=1
, d
pt (i)
1It =i +
. pt (i)
= 0.95
log d nd
and
min
i=1,...,d
i,t t=1
5.15 nd log(d 1 ).
log d nd ,
= 0.95
log d nd
and
= 1.05 d log d , Exp3.P satises, for any (0, 1), with n probability at least 1 :
n It ,t t=1 n
min
i=1,...,d
i,t t=1
= 0.95
log d nd
and
min
i=1,...,d
i,t t=1
5.15 nd log(d 1 ).
log d nd ,
= 0.95
log d nd
and
= 1.05 d log d , Exp3.P satises, for any (0, 1), with n probability at least 1 :
n It ,t t=1 n
min
i=1,...,d
i,t t=1
INF (Implicitely Normalized Forecaster) is based on a potential function : R R increasing, convex, twice + continuously dierentiable, and such that (0, 1] (R ). At each time step INF computes the new probability distribution as follows:
t1
pt (i) = Ct
s=1
i,s
,
d i=1 pt (i)
where Ct is the unique real number such that (x) = exp(x) + (x) = (x)
1/2 d
= 1.
INF (Implicitely Normalized Forecaster) is based on a potential function : R R increasing, convex, twice + continuously dierentiable, and such that (0, 1] (R ). At each time step INF computes the new probability distribution as follows:
t1
pt (i) = Ct
s=1
i,s
,
d i=1 pt (i)
where Ct is the unique real number such that (x) = exp(x) + (x) = (x)
1/2 d
= 1.
INF (Implicitely Normalized Forecaster) is based on a potential function : R R increasing, convex, twice + continuously dierentiable, and such that (0, 1] (R ). At each time step INF computes the new probability distribution as follows:
t1
pt (i) = Ct
s=1
i,s
,
d i=1 pt (i)
where Ct is the unique real number such that (x) = exp(x) + (x) = (x)
1/2 d
= 1.
INF (Implicitely Normalized Forecaster) is based on a potential function : R R increasing, convex, twice + continuously dierentiable, and such that (0, 1] (R ). At each time step INF computes the new probability distribution as follows:
t1
pt (i) = Ct
s=1
i,s
,
d i=1 pt (i)
where Ct is the unique real number such that (x) = exp(x) + (x) = (x)
1/2 d
= 1.
Theorem (Audibert and Bubeck [2009], Audibert and Bubeck [2010], Audibert, Bubeck and Lugosi [2011]) Quadratic INF satises: Rn 2 2nd.
Stochastic Assumption
Assumption (Robbins [1952]) The sequence of losses ( t )1tn is a sequence of i.i.d random variables. For historical reasons in this setting we consider gains rather than losses and we introduce dierent notation: Let i be the unknown reward distribution underlying arm i, i the mean of i , = max1id i and i = i . Let Xi,s i be the reward obtained when pulling arm i for the s th time, and Ti (t) = t 1Is =i the number of times s=1 arm i was pulled up to time t. Thus here
n d
Rn = n E
t=1
I t =
i=1
i ETi (n).
Stochastic Assumption
Assumption (Robbins [1952]) The sequence of losses ( t )1tn is a sequence of i.i.d random variables. For historical reasons in this setting we consider gains rather than losses and we introduce dierent notation: Let i be the unknown reward distribution underlying arm i, i the mean of i , = max1id i and i = i . Let Xi,s i be the reward obtained when pulling arm i for the s th time, and Ti (t) = t 1Is =i the number of times s=1 arm i was pulled up to time t. Thus here
n d
Rn = n E
t=1
I t =
i=1
i ETi (n).
Stochastic Assumption
Assumption (Robbins [1952]) The sequence of losses ( t )1tn is a sequence of i.i.d random variables. For historical reasons in this setting we consider gains rather than losses and we introduce dierent notation: Let i be the unknown reward distribution underlying arm i, i the mean of i , = max1id i and i = i . Let Xi,s i be the reward obtained when pulling arm i for the s th time, and Ti (t) = t 1Is =i the number of times s=1 arm i was pulled up to time t. Thus here
n d
Rn = n E
t=1
I t =
i=1
i ETi (n).
Stochastic Assumption
Assumption (Robbins [1952]) The sequence of losses ( t )1tn is a sequence of i.i.d random variables. For historical reasons in this setting we consider gains rather than losses and we introduce dierent notation: Let i be the unknown reward distribution underlying arm i, i the mean of i , = max1id i and i = i . Let Xi,s i be the reward obtained when pulling arm i for the s th time, and Ti (t) = t 1Is =i the number of times s=1 arm i was pulled up to time t. Thus here
n d
Rn = n E
t=1
I t =
i=1
i ETi (n).
Stochastic Assumption
Assumption (Robbins [1952]) The sequence of losses ( t )1tn is a sequence of i.i.d random variables. For historical reasons in this setting we consider gains rather than losses and we introduce dierent notation: Let i be the unknown reward distribution underlying arm i, i the mean of i , = max1id i and i = i . Let Xi,s i be the reward obtained when pulling arm i for the s th time, and Ti (t) = t 1Is =i the number of times s=1 arm i was pulled up to time t. Thus here
n d
Rn = n E
t=1
I t =
i=1
i ETi (n).
General principle: given some observations from an unknown environment, build (with some probabilistic argument) a set of possible environments , then act as if the real environment was the most favorable one in . Application to stochastic bandits: given the past rewards, build condence intervals for the means (i ) (in particular build upper condence bounds), then play the arm with the highest upper condence bound.
General principle: given some observations from an unknown environment, build (with some probabilistic argument) a set of possible environments , then act as if the real environment was the most favorable one in . Application to stochastic bandits: given the past rewards, build condence intervals for the means (i ) (in particular build upper condence bounds), then play the arm with the highest upper condence bound.
Xs +
s=1
log 1 . 2t
This directly suggests the famous UCB strategy of Auer, Cesa-Bianchi and Fischer [2002]: It argmax
1id
1 Ti (t 1)
Ti (t1)
Xi,s +
s=1
2 log t . Ti (t 1)
10 log n . i
Xs +
s=1
log 1 . 2t
This directly suggests the famous UCB strategy of Auer, Cesa-Bianchi and Fischer [2002]: It argmax
1id
1 Ti (t 1)
Ti (t1)
Xi,s +
s=1
2 log t . Ti (t 1)
10 log n . i
Xs +
s=1
log 1 . 2t
This directly suggests the famous UCB strategy of Auer, Cesa-Bianchi and Fischer [2002]: It argmax
1id
1 Ti (t 1)
Ti (t1)
Xi,s +
s=1
2 log t . Ti (t 1)
10 log n . i
Illustration of UCB
Illustration of UCB
Lower bound
For any p, q [0, 1], let kl(p, q) = p log p 1p + (1 p) log . q 1q
Theorem (Lai and Robbins [1985]) Consider a consistent strategy, i.e. s.t. a > 0, we have ETi (n) = o(na ) if i > 0. Then for any Bernoulli reward distributions, Rn i . lim inf n+ log n kl(i , )
i:i >0
Note that
1 i (1 ) . 2i kl(i , ) 2i
Lower bound
For any p, q [0, 1], let kl(p, q) = p log p 1p + (1 p) log . q 1q
Theorem (Lai and Robbins [1985]) Consider a consistent strategy, i.e. s.t. a > 0, we have ETi (n) = o(na ) if i > 0. Then for any Bernoulli reward distributions, Rn i . lim inf n+ log n kl(i , )
i:i >0
Note that
1 i (1 ) . 2i kl(i , ) 2i
Lower bound
For any p, q [0, 1], let kl(p, q) = p log p 1p + (1 p) log . q 1q
Theorem (Lai and Robbins [1985]) Consider a consistent strategy, i.e. s.t. a > 0, we have ETi (n) = o(na ) if i > 0. Then for any Bernoulli reward distributions, Rn i . lim inf n+ log n kl(i , )
i:i >0
Note that
1 i (1 ) . 2i kl(i , ) 2i
KL-UCB
Theorem (Chernos inequality) Let X , X1 , . . . , Xt be i.i.d random variables in [0, 1], then P 1 t
t
Xs EX
s=1
exp (t kl(EX , EX )) .
Xs , q
s=1
log 1 t
KL-UCB
Theorem (Chernos inequality) Let X , X1 , . . . , Xt be i.i.d random variables in [0, 1], then P 1 t
t
Xs EX
s=1
exp (t kl(EX , EX )) .
Xs , q
s=1
log 1 t
KL-UCB
Theorem (Chernos inequality) Let X , X1 , . . . , Xt be i.i.d random variables in [0, 1], then P 1 t
t
Xs EX
s=1
exp (t kl(EX , EX )) .
Xs , q
s=1
log 1 t
KL-UCB
Thus Chernos bound suggests the KL-UCB strategy of Garivier and Capp [2011] (see also Honda and Takemura [2010], Maillard, e Munos and Stoltz [2011]) : It argmax max q [0, 1] : 1id Ti (t1) (1 + ) log t 1 Xi,s , q . kl Ti (t 1) Ti (t 1)
s=1
Garivier and Capp proved the following regret bound for n large e enough: i (1 + 2 ) Rn log n. kl(i , )
i:i >0
KL-UCB
Thus Chernos bound suggests the KL-UCB strategy of Garivier and Capp [2011] (see also Honda and Takemura [2010], Maillard, e Munos and Stoltz [2011]) : It argmax max q [0, 1] : 1id Ti (t1) (1 + ) log t 1 Xi,s , q . kl Ti (t 1) Ti (t 1)
s=1
Garivier and Capp proved the following regret bound for n large e enough: i (1 + 2 ) Rn log n. kl(i , )
i:i >0
KL-UCB
Thus Chernos bound suggests the KL-UCB strategy of Garivier and Capp [2011] (see also Honda and Takemura [2010], Maillard, e Munos and Stoltz [2011]) : It argmax max q [0, 1] : 1id Ti (t1) (1 + ) log t 1 Xi,s , q . kl Ti (t 1) Ti (t 1)
s=1
Garivier and Capp proved the following regret bound for n large e enough: i (1 + 2 ) Rn log n. kl(i , )
i:i >0
Heavy-tailed distributions
The standard UCB works for all 2 - subgaussian distributions (not only bounded distributions), i.e. such that E exp ((X EX )) 2 2 , R. 2
It is easy to see that this equivalent to > 0 s.t. E exp(X 2 ) < +. What happens for distributions with heavier tails? Can we get logarithmic regret if the distributions only have a nite variance?
Heavy-tailed distributions
The standard UCB works for all 2 - subgaussian distributions (not only bounded distributions), i.e. such that E exp ((X EX )) 2 2 , R. 2
It is easy to see that this equivalent to > 0 s.t. E exp(X 2 ) < +. What happens for distributions with heavier tails? Can we get logarithmic regret if the distributions only have a nite variance?
Heavy-tailed distributions
The standard UCB works for all 2 - subgaussian distributions (not only bounded distributions), i.e. such that E exp ((X EX )) 2 2 , R. 2
It is easy to see that this equivalent to > 0 s.t. E exp(X 2 ) < +. What happens for distributions with heavier tails? Can we get logarithmic regret if the distributions only have a nite variance?
Lemma Let X , X1 , . . . , Xt be i.i.d random variables such that n E(X EX )2 1. Let (0, 1), k = 8 log 1 and N = 8 log 1 . Then with probability at least 1 , N kN 1 8 log( 1 ) 1 Xs , . . . , Xs + 8 . EX median N N n
s=1 s=(k1)N+1
Lemma Let X , X1 , . . . , Xt be i.i.d random variables such that n E(X EX )2 1. Let (0, 1), k = 8 log 1 and N = 8 log 1 . Then with probability at least 1 , N kN 1 1 8 log( 1 ) Xs , . . . , Xs + 8 . EX median N N n
s=1 s=(k1)N+1
Lemma Let X , X1 , . . . , Xt be i.i.d random variables such that n E(X EX )2 1. Let (0, 1), k = 8 log 1 and N = 8 log 1 . Then with probability at least 1 , N kN 1 1 8 log( 1 ) EX median Xs , . . . , Xs + 8 . N N n
s=1 s=(k1)N+1
Lemma Let X , X1 , . . . , Xt be i.i.d random variables such that n E(X EX )2 1. Let (0, 1), k = 8 log 1 and N = 8 log 1 . Then with probability at least 1 , N kN 1 1 8 log( 1 ) Xs , . . . , Xs + 8 EX median . N N n
s=1 s=(k1)N+1
LT-UCB
This suggests LT-UCB, Bubeck, Cesa-Bianchi and Lugosi [2012]: Ni,t kt Ni,t 1 1 It argmax median Xi,s , . . . , Xi,s Ni,t Ni,t 1id
s=1 s=(kt 1)Ni,t +1
+ 32
log t , Ti (t 1)
with kt = 16 log t and Ni,t = Ti (t1) . The following regret bound 16 log t can be proved for any set of distributions with variance bounded by 1: 200 log n Rn . i
i:i >0
LT-UCB
This suggests LT-UCB, Bubeck, Cesa-Bianchi and Lugosi [2012]: Ni,t kt Ni,t 1 1 It argmax median Xi,s , . . . , Xi,s Ni,t Ni,t 1id
s=1 s=(kt 1)Ni,t +1
+ 32
log t , Ti (t 1)
with kt = 16 log t and Ni,t = Ti (t1) . The following regret bound 16 log t can be proved for any set of distributions with variance bounded by 1: 200 log n Rn . i
i:i >0
Markovian rewards
Assumption The sequence (Xi,t )t1 forms an aperiodic irreducible nite-state markov chain with unknown transition matrix Pi . Again in this framework it is possible to design a UCB strategy with logarithmic regret (Tekin and Liu, [2011]), using the following result: Theorem (Lezaud [1998]) Let X1 , . . . , Xt be an aperiodic irreducible nite-state markov chain with transition matrix P. Let 2 be the second largest eigenvalue of the multiplicative symmetrization of P and = 1 2 . Let be the expectation of X1 under the stationary distribution. There exists C > 0 such that for any (0, 1], P 1 t
t
Xs +
s=1
C exp
t 2 28
Markovian rewards
Assumption The sequence (Xi,t )t1 forms an aperiodic irreducible nite-state markov chain with unknown transition matrix Pi . Again in this framework it is possible to design a UCB strategy with logarithmic regret (Tekin and Liu, [2011]), using the following result: Theorem (Lezaud [1998]) Let X1 , . . . , Xt be an aperiodic irreducible nite-state markov chain with transition matrix P. Let 2 be the second largest eigenvalue of the multiplicative symmetrization of P and = 1 2 . Let be the expectation of X1 under the stationary distribution. There exists C > 0 such that for any (0, 1], P 1 t
t
Xs +
s=1
C exp
t 2 28
Markovian rewards
Assumption The sequence (Xi,t )t1 forms an aperiodic irreducible nite-state markov chain with unknown transition matrix Pi . Again in this framework it is possible to design a UCB strategy with logarithmic regret (Tekin and Liu, [2011]), using the following result: Theorem (Lezaud [1998]) Let X1 , . . . , Xt be an aperiodic irreducible nite-state markov chain with transition matrix P. Let 2 be the second largest eigenvalue of the multiplicative symmetrization of P and = 1 2 . Let be the expectation of X1 under the stationary distribution. There exists C > 0 such that for any (0, 1], P 1 t
t
Xs +
s=1
C exp
t 2 28
Rn = nf E
t=1
f (xt ),
Rn = nf E
t=1
f (xt ),
Rn = nf E
t=1
f (xt ),
Rn = nf E
t=1
f (xt ),
Example in 1d
f* f
f(x t)
xt
f (xt ) rt
Xi
x xt
For a xed domain Xi x containing ni points {xt } Xi , we have that ni rt f (xt ) is a martingale. Thus by Azumas inequality, t=1
1 ni
ni
rt +
t=1
log 1/ 1 2ni ni
ni
f (xt ) rt
Xi
x xt
For a xed domain Xi x containing ni points {xt } Xi , we have that ni rt f (xt ) is a martingale. Thus by Azumas inequality, t=1
1 ni
ni
rt +
t=1
log 1/ 1 2ni ni
ni
f (xt ) rt
Xi
x xt
For a xed domain Xi x containing ni points {xt } Xi , we have that ni rt f (xt ) is a martingale. Thus by Azumas inequality, t=1
1 ni
ni
rt +
t=1
log 1/ 1 2ni ni
ni
Xi
Xi
1 ni
ni
w.p. 1 ,
rt +
t=1
Tradeo between number of points in a domain and size of the domain. By considering several domains we can derive a tigher upper bound.
Xi
Xi
1 ni
ni
w.p. 1 ,
rt +
t=1
Tradeo between number of points in a domain and size of the domain. By considering several domains we can derive a tigher upper bound.
Xi
Xi
1 ni
ni
w.p. 1 ,
rt +
t=1
Tradeo between number of points in a domain and size of the domain. By considering several domains we can derive a tigher upper bound.
A hierarchical decomposition
Use a tree of partitions at all scales:
def
Turnedon nodes
B B
h,i
h+1,2i1
h+1,2i
Selected node
Pulled point
Xt
Example in 1d
rt B(f (xt )) a Bernoulli distribution with parameter f (xt )
Analysis of HOO
The near-optimality dimension d of f is dened as follows: Let X = {x X , f (x) f } be the set of -optimal points. Then X can be covered by O( d ) balls of radius . A similar notion was introduced in [Kleinberg, Slivkins, Upfall, 2008]. Theorem HOO satises:
def
Rn = O(n d+2 ).
d+1
Analysis of HOO
The near-optimality dimension d of f is dened as follows: Let X = {x X , f (x) f } be the set of -optimal points. Then X can be covered by O( d ) balls of radius . A similar notion was introduced in [Kleinberg, Slivkins, Upfall, 2008]. Theorem HOO satises:
def
Rn = O(n d+2 ).
d+1
Analysis of HOO
The near-optimality dimension d of f is dened as follows: Let X = {x X , f (x) f } be the set of -optimal points. Then X can be covered by O( d ) balls of radius . A similar notion was introduced in [Kleinberg, Slivkins, Upfall, 2008]. Theorem HOO satises:
def
Rn = O(n d+2 ).
d+1
Analysis of HOO
The near-optimality dimension d of f is dened as follows: Let X = {x X , f (x) f } be the set of -optimal points. Then X can be covered by O( d ) balls of radius . A similar notion was introduced in [Kleinberg, Slivkins, Upfall, 2008]. Theorem HOO satises:
def
Rn = O(n d+2 ).
d+1
Example 1:
Assume the function is locally peaky around its maximum: f (x ) f (x) = (||x x||).
It takes O( 0 ) balls of radius to cover X with (x, y ) = ||x y ||. Thus d = 0 and the regret is O( n).
Example 1:
Assume the function is locally peaky around its maximum: f (x ) f (x) = (||x x||).
It takes O( 0 ) balls of radius to cover X with (x, y ) = ||x y ||. Thus d = 0 and the regret is O( n).
Example 1:
Assume the function is locally peaky around its maximum: f (x ) f (x) = (||x x||).
It takes O( 0 ) balls of radius to cover X with (x, y ) = ||x y ||. Thus d = 0 and the regret is O( n).
Example 2:
Assume the function is locally quadratic around its maximum: f (x ) f (x) = (||x x||2 ).
For (x, y ) = ||x y ||, it takes O( D/2 ) balls of radius D+2 cover X . Thus d = D/2 and Rn = O(n D+4 ). For (x, y ) = ||x y ||2 , it takes O( 0 ) -balls of radius cover X . Thus d = 0 and Rn = O( n).
to to
Example 2:
Assume the function is locally quadratic around its maximum: f (x ) f (x) = (||x x||2 ).
For (x, y ) = ||x y ||, it takes O( D/2 ) balls of radius D+2 cover X . Thus d = D/2 and Rn = O(n D+4 ). For (x, y ) = ||x y ||2 , it takes O( 0 ) -balls of radius cover X . Thus d = 0 and Rn = O( n).
to to
Example 2:
Assume the function is locally quadratic around its maximum: f (x ) f (x) = (||x x||2 ).
For (x, y ) = ||x y ||, it takes O( D/2 ) balls of radius D+2 cover X . Thus d = D/2 and Rn = O(n D+4 ). For (x, y ) = ||x y ||2 , it takes O( 0 ) -balls of radius cover X . Thus d = 0 and Rn = O( n).
to to
Example
X = [0, 1]D , 0 and mean-payo function f locally -smooth around (any of) its maximum x (in nite number): f (x ) f (x) = (||x x || ) as x x . Theorem Assume that we run HOO using (x, y ) = ||x y || . Known smoothness: = . Rn = O( n), i.e., the rate is independent of the dimension D. Smoothness underestimated: < . 1 Rn = O(n(d+1)/(d+2) ) where d = D
1
Example
X = [0, 1]D , 0 and mean-payo function f locally -smooth around (any of) its maximum x (in nite number): f (x ) f (x) = (||x x || ) as x x . Theorem Assume that we run HOO using (x, y ) = ||x y || . Known smoothness: = . Rn = O( n), i.e., the rate is independent of the dimension D. Smoothness underestimated: < . 1 Rn = O(n(d+1)/(d+2) ) where d = D
1
Example
X = [0, 1]D , 0 and mean-payo function f locally -smooth around (any of) its maximum x (in nite number): f (x ) f (x) = (||x x || ) as x x . Theorem Assume that we run HOO using (x, y ) = ||x y || . Known smoothness: = . Rn = O( n), i.e., the rate is independent of the dimension D. Smoothness underestimated: < . 1 Rn = O(n(d+1)/(d+2) ) where d = D
1
Example
X = [0, 1]D , 0 and mean-payo function f locally -smooth around (any of) its maximum x (in nite number): f (x ) f (x) = (||x x || ) as x x . Theorem Assume that we run HOO using (x, y ) = ||x y || . Known smoothness: = . Rn = O( n), i.e., the rate is independent of the dimension D. Smoothness underestimated: < . 1 Rn = O(n(d+1)/(d+2) ) where d = D
1
Example
X = [0, 1]D , 0 and mean-payo function f locally -smooth around (any of) its maximum x (in nite number): f (x ) f (x) = (||x x || ) as x x . Theorem Assume that we run HOO using (x, y ) = ||x y || . Known smoothness: = . Rn = O( n), i.e., the rate is independent of the dimension D. Smoothness underestimated: < . 1 Rn = O(n(d+1)/(d+2) ) where d = D
1
Path planning
Player
Player
Adversary
Player
Adversary
4 1 2 3 8 9 6 7 5 d2 d1 d
Player
Adversary
4 1 2 3 8 9 6 7 5 d2 d1 d
Player
loss suered:
+ ... +
Adversary
Feedback:
4 1 2 3 8 9 6 7 5
d2 d1 d
Player
loss suered:
+ ... +
Adversary
+ ... +
d2 d1 d
Player
loss suered:
+ ... +
Adversary
+ ... +
d2 d1 d
Player
loss suered:
+ ... +
Notation
S {0, 1}d
4 1 2 3 8 9 6 7 5 d2 d1 d
Rd +
4 1 2 3 8 9 6 7 5 d2 d1 d
Vt S, loss suered:
n n T t Vt t=1
TV t t
Rn = E
min E
uS t=1
T t u
Notation
S {0, 1}d
4 1 2 3 8 9 6 7 5 d2 d1 d
Rd +
4 1 2 3 8 9 6 7 5 d2 d1 d
Vt S, loss suered:
n n T t Vt t=1
TV t t
Rn = E
min E
uS t=1
T t u
Notation
S {0, 1}d
4 1 2 3 8 9 6 7 5 d2 d1 d
Rd +
4 1 2 3 8 9 6 7 5 d2 d1 d
Vt S, loss suered:
n n T t Vt t=1
TV t t
Rn = E
min E
uS t=1
T t u
Notation
S {0, 1}d
4 1 2 3 8 9 6 7 5 d2 d1 d
Rd +
4 1 2 3 8 9 6 7 5 d2 d1 d
Vt S, loss suered:
n n T t Vt t=1
TV t t
Rn = E
min E
uS t=1
T t u
Key idea
Vt pt , pt (S) Then, unbiased estimate t of the loss t : t = t in the full information game, i,t =
i,t V S:Vi =1
pt (V ) Vi,t
Key idea
Vt pt , pt (S) Then, unbiased estimate t of the loss t : t = t in the full information game, i,t =
i,t V S:Vi =1
pt (V ) Vi,t
Key idea
Vt pt , pt (S) Then, unbiased estimate t of the loss t : t = t in the full information game, i,t =
i,t V S:Vi =1
pt (V ) Vi,t
Key idea
Vt pt , pt (S) Then, unbiased estimate t of the loss t : t = t in the full information game, i,t =
i,t V S:Vi =1
pt (V ) Vi,t
Key idea
Vt pt , pt (S) Then, unbiased estimate t of the loss t : t = t in the full information game, i,t =
i,t V S:Vi =1
pt (V ) Vi,t
Loss assumptions
Denition (L ) We say that the adversary statises the L assumption: if t 1 for all t = 1, . . . , n. Denition (L2 ) We say that the adversary statises the L2 assumption: if T v 1 for all t = 1, . . . , n and v S. t
Loss assumptions
Denition (L ) We say that the adversary statises the L assumption: if t 1 for all t = 1, . . . , n. Denition (L2 ) We say that the adversary statises the L2 assumption: if T v 1 for all t = 1, . . . , n and v S. t
exp
In the full information game, against L2 adversaries, we have (for some ) Rn 2dn, which is the optimal rate, Dani, Hayes and Kakade [2008]. Thus against L adversaries we have Rn d 3/2 2n. But this is suboptimal, Koolen, Warmuth and Kivinen [2010]. Audibert, Bubeck and Lugosi [2011] showed that, for any , there exists a subset S {0, 1}d and an L adversary such that: Rn 0.02 d 3/2 n.
exp
In the full information game, against L2 adversaries, we have (for some ) Rn 2dn, which is the optimal rate, Dani, Hayes and Kakade [2008]. Thus against L adversaries we have Rn d 3/2 2n. But this is suboptimal, Koolen, Warmuth and Kivinen [2010]. Audibert, Bubeck and Lugosi [2011] showed that, for any , there exists a subset S {0, 1}d and an L adversary such that: Rn 0.02 d 3/2 n.
exp
In the full information game, against L2 adversaries, we have (for some ) Rn 2dn, which is the optimal rate, Dani, Hayes and Kakade [2008]. Thus against L adversaries we have Rn d 3/2 2n. But this is suboptimal, Koolen, Warmuth and Kivinen [2010]. Audibert, Bubeck and Lugosi [2011] showed that, for any , there exists a subset S {0, 1}d and an L adversary such that: Rn 0.02 d 3/2 n.
exp
In the full information game, against L2 adversaries, we have (for some ) Rn 2dn, which is the optimal rate, Dani, Hayes and Kakade [2008]. Thus against L adversaries we have Rn d 3/2 2n. But this is suboptimal, Koolen, Warmuth and Kivinen [2010]. Audibert, Bubeck and Lugosi [2011] showed that, for any , there exists a subset S {0, 1}d and an L adversary such that: Rn 0.02 d 3/2 n.
Legendre function
Denition Let D be a convex subset of Rd with nonempty interior int(D) and boundary D. We call Legendre any function F : D R such that F is strictly convex and admits continuous rst partial derivatives on int(D), For any u D, for any v int(D), we have
s0,s>0
lim (u v )T F (1 s)u + sv = +.
Legendre function
Denition Let D be a convex subset of Rd with nonempty interior int(D) and boundary D. We call Legendre any function F : D R such that F is strictly convex and admits continuous rst partial derivatives on int(D), For any u D, for any v int(D), we have
s0,s>0
lim (u v )T F (1 s)u + sv = +.
Legendre function
Denition Let D be a convex subset of Rd with nonempty interior int(D) and boundary D. We call Legendre any function F : D R such that F is strictly convex and admits continuous rst partial derivatives on int(D), For any u D, for any v int(D), we have
s0,s>0
lim (u v )T F (1 s)u + sv = +.
Bregman divergence
Denition The Bregman divergence DF : D int(D) associated to a Legendre function F is dened by DF (u, v ) = F (u) F (v ) (u v )T F (v ).
(S)
(S)
pt
wt
(S)
pt
(S)
pt
(S)
pt
diamDF (S) + E
t=1
T t
F (wt )
t .
Semi-Bandit=Bandit: Exp3 Auer et al. [2002] Full Info: Component Hedge Koolen, Warmuth and Kivinen [2010] Semi-Bandit: MW Kale, Reyzin and Schapire [2010] Bandit: new algorithm
Semi-Bandit=Bandit: Exp3 Auer et al. [2002] Full Info: Component Hedge Koolen, Warmuth and Kivinen [2010] Semi-Bandit: MW Kale, Reyzin and Schapire [2010] Bandit: new algorithm
Semi-Bandit=Bandit: Exp3 Auer et al. [2002] Full Info: Component Hedge Koolen, Warmuth and Kivinen [2010] Semi-Bandit: MW Kale, Reyzin and Schapire [2010] Bandit: new algorithm
Semi-Bandit=Bandit: Exp3 Auer et al. [2002] Full Info: Component Hedge Koolen, Warmuth and Kivinen [2010] Semi-Bandit: MW Kale, Reyzin and Schapire [2010] Bandit: new algorithm
Semi-Bandit=Bandit: Exp3 Auer et al. [2002] Full Info: Component Hedge Koolen, Warmuth and Kivinen [2010] Semi-Bandit: MW Kale, Reyzin and Schapire [2010] Bandit: new algorithm
Semi-Bandit=Bandit: Exp3 Auer et al. [2002] Full Info: Component Hedge Koolen, Warmuth and Kivinen [2010] Semi-Bandit: MW Kale, Reyzin and Schapire [2010] Bandit: new algorithm
1 (s)ds
1 (s)ds
1 (s)ds
1 (s)ds
wt+1 argmin
w D s=1
T w + F (w ) s
Particularly interesting choice: F self-concordant barrier function, Abernethy, Hazan and Rakhlin [2008]
wt+1 argmin
w D s=1
T w + F (w ) s
Particularly interesting choice: F self-concordant barrier function, Abernethy, Hazan and Rakhlin [2008]
Theorem (Koolen, Warmuth and Kivinen [2010]) In the full information game, the LinExp strategy (with well-chosen parameters) satises for any concept class S {0, 1} and any L -adversary: Rn d 2n. Moreover for any strategy, there exists a subset S {0, 1} and an L -adversary such that: Rn 0.008 d n.
Theorem (Audibert, Bubeck and Lugosi [2011]) In the semi-bandit game, the LinExp strategy (with well-chosen parameters) satises for any concept class S {0, 1} and any L -adversary: Rn d 2n. Moreover for any strategy, there exists a subset S {0, 1} and an L -adversary such that: Rn 0.008 d n.
For the bandit game the situation becomes trickier. First it appears necessary to add some sort of forced exploration on S to control third order error terms in the regret bound. 2 F (w ) 1 Second, the control of the quadratic term T t t t is much more involved than previously.
For the bandit game the situation becomes trickier. First it appears necessary to add some sort of forced exploration on S to control third order error terms in the regret bound. 2 F (w ) 1 Second, the control of the quadratic term T t t t is much more involved than previously.
For the bandit game the situation becomes trickier. First it appears necessary to add some sort of forced exploration on S to control third order error terms in the regret bound. 2 F (w ) 1 Second, the control of the quadratic term T t t t is much more involved than previously.