Sunteți pe pagina 1din 53

ELE 538B: Large-Scale Optimization for Data Science

Mirror descent

Yuxin Chen
Princeton University, Spring 2018
Outline

• Mirror descent

• Bregman divergence

• Alternative forms of mirror descent

• Convergence analysis

Mirror descent 5-2


A proximal
| {z
1:=f ( )
view
}
D :=g(ˆof
⇢ (X gradient .descent
) )= 6
4
00 >
i
..

7
⌘ 5
(Lasso) minimize kX yk2 + k k1 1 >ˆ

A proximal viewpoint of projected GD


2Rp
minimize kX yk2 + ⇢00 kX 3k
|2 (Lasso)
{z | {z }
1} 2:=g(00 ⇣) 2R
p ⌘ n1
quares) minimize 2Rp f :=f ( )( = ) kX ⇢ X >
yk12 ˆ |2 {z } | {z }
2 6 1 7 )
:=g(
⇣ D 1 6
(Lasso)
⌘ = 6
minimize •2R .Mirror
:=f ( )descent
.p. 2 kX yk27+ k k1
7
east squares) minimize 00 (X > ˆ) 2
| ⇣ {z1 ⌘ }5 | {z }
g( minimize x f (x)
1 ⇢ f ( (t)i) = kX yk
) 2k24+minimize
p
mize k (t)
µ2R
t rf (squares) ) :=g(2 )
2 ⇣ (Least ⌘ 2R • Bregman divergence
p f (⇢ 00) :=f
=X (>kX)ˆ yk
subject to
n
2
1 x œ C
minimize k ⌦ (t)
µt rf ( (t) 2
) ↵k + g( ⇣) ⌘ 1
t 2
k f 2 t + rf minimize t
, (Leastxt+1 t 1
= xtt ≠ ÷pt+1
squares)
minimize
1 (xt )2Rp (t)2f ( )2= kX
minimize
t ÒfkX
(t) • Alternative yk )+ formskk +k21g(of mirror
yk2
descent
↵ k t 2Rt+1 2 ⇣µt rf (
(Lasso) )
1 ⌦ t2 | {z ⌘ }
is convex and -Lipschitz| continuous
{z }
t 2 t t
k k f +• rf f , L f 1 :=g( )
minimize Convergence
( )µt rf ( ↵ analysis
µt (t) (t) 2
1
k t 2
k f t 2+ rf t ,
Ì k ⌦ • :=f t
) kt + g( t+1
)
2µt I + 1 , J
t ), x ≠ xt2 ↵ ≠ 1 Îx ≠ xt Î2
(Least squares) 1minimize f (xf )f+
t
(t )+Òf=⌦rf (xkX
21 , yk t 2÷t 2
t 2 p t
k k2R
t 2
xt+1 = arg min f (xt ) + 2µÈÒf
t ⇣(x t
), x ≠ x t
Í + ⌘ Îx ≠ x Î
x
minimize ¸ 1 ˚˙ (t) µt rf ( ˝(t) ) 2÷ 2
k
first-order
2 approximation ¸k t+ g(˚˙ ) ˝
1 ⌦ proximal
↵ term
t 2
k k + c f t + rf t , t t t+1
2µt
Mirror descent ( Mirror descent ) 5-35


1
x t+1
= arg min 
f (xt
 ) + ∇f (xt ), x − xt + kx − xt k22
x∈C | {z } 2ηt
linear approximation
Proximal gradient methods 5-2

Mirror descent 5-3


A proximal
| {z
1:=f ( )
view
}
D :=g(ˆof
⇢ (X gradient .descent
) )= 6
4
00 >
i
..

7
⌘ 5
(Lasso) minimize 2Rp kX yk2 + k k1 1 >ˆ
minimize yk2 + ⇢00 kX
A proximal viewpoint of projected GD
|2 (Lasso) | {z } kX 3k
1} 2:=g(00 ⇣) 2R
p ⌘

Outline
{z n1
ˆ 2 | {z }
quares) minimize 2R p f :=f
( )( = ) kX ⇢ X yk1>2 | {z
1 )
}
2 6 6 minimize • Mirror
:=f ( kX descent 27 )
:=g(
7
(Lasso) .. p yk + k k1
A proximal view of gradient d
1 6 2R
east squares)

minimize
D⇢00 (X > ˆ)⌘= 2 . |2 ⇣ {z ⌘ }7 | {z }
mize
1
k (t) 2Rp f ( (t)
µt rf (squares)
i) = kX
24
) 2k +minimizeg( ) minimize
yk
x f (x) 1 5 :=g(2 )
2 ⇣ (Least ⌘ 2R•p Bregman
f (⇢00) :=f
=Xdivergence
(>kX )ˆ yk
subject to
n
2
1 x œ C
minimize k ⌦ (t)
µt rf ( (t) 2
) ↵k + g( ⇣) ⌘ 1
t 2
f 2 t + rf minimize t
xt+1
, (Least t 1
=k xtt ≠
squares) ÷t+1 1 (xt )2Rp (t)2I
minimize f ( )2= kX yk2
k (Lasso) minimize 2Rp t
(t)Òf •µtAlternative
kX rf ( yk )+ forms k k +k21g( of mirror
) descent
1ror descent ⌦ ↵
t2
t+1 2 ⇣
t x t+1 | = arg min f |(x{z )
t ⌘}
+ ÈÒf (x t
), x ≠ xt Í +
f is convex and Lf -Lipschitz continuous
{z }
t 2
k f t +• rf t
2
k , 1
( )x µt rf (t ↵:=g((t) ) 2
minimize Convergence analysis
µt (t)
1 t 2 tÌ2
k ⌦ • :=f t
) kt + g( t+1
)
gman divergence 2µt I
k k f + rf ,
+ ⌦1 t , J 1 Îx ≠ xt Î2
(Least squares) 1minimize f (x t ) + Òf (xkX ), xt ≠ yk xt2 t ↵ ≠
t 2 f (
t )+= rf
1 2÷t 2
p
k k2R f ,
2 t 2
x of=mirror
ernative formst+1
arg min descent f (x ) + t 2µt
ÈÒf⇣(x ), x ≠ x Í +⌘ Îx ≠ x Î
t t
x
minimize ¸ 1 ˚˙ (t) µt rf ( ˝(t) ) 2÷ 2
k
first-order
2 approximation ¸k t+ g(˚˙ ) ˝
nvergence analysis proximal term
1 ⌦ ↵
+ , 1k
t 2
kt + 2 c f
t
+ rf t , t t t+1
Òf (xt ), x ≠ xt ≠ 2÷
2µ t t Îx ≠ x Î2
Mirror descent ( Mirror descent ) 5-35

1
xt+1 = arg min 
f (xt
 ) + ∇f (xt ), x − xt + kx − xt k22
x∈C | {z } 2ηt
linear approximation
| {z }
proximity term
Proximal gradient methods 5-2 5-2
• quadratic proximal termBy is optimality
used by GD condition,
to monitor is the point where
xt+1 discrepancy
t + ÈÒf (x ), x ≠ x Í and ≠ 2÷1 t Îx ≠ xt Î2 have
f (x )approximation
between f (·) and first-order
t t

Mirror descent 5-3


Proximal gradient methods
Inhomoneneous / non-Euclidean geometry

Quadratic proximity term is based on certain “prior belief”:


• discrepancy between f (·) and its linear approximation is locally
well approximated by homogeneous penalty (2ηt )−1 kx − xt k22
| {z }
squared Euclidean penalty

Issues: local geometry might sometimes be highly inhomogeneous, or


even non-Euclidean

Mirror descent 5-4


. .2
Îxt+1 ≠ xú Î22 = .xt ≠ xú ≠ ÷(Òf (xt ) ≠ Òf (xú )).2
¸ ˚˙ ˝
Example:
. t quadratic
.
ú 2
minimization
=0
.
= .x ≠ x .2 ≠ 2÷Èx ≠ x , Òf (x ) ≠ Òf (x )Í + ÷ 2 .Òf (xt ) ≠ Òf (
t ú t ú
¸ ˚˙ ˝
Ø 2÷ 2
L ÎÒf (x )≠Òf (x )Î2 (smoothness)
t ú

. .2 . .2
Æ .xt ≠ xú .2 ≠ ÷ 2 .Òf (xt ) ≠ Òf (xú ).2
. .2
Æ .xt ≠ xú . 2

1
minimizex∈Rn (x − x∗ )> Q(x − x∗ )
f (x) =
2
Gradient methods max Qi,i
where Q  0 is diagonal matrix with large κ = minii Qi,i 1

• gradient descent xt+1 = xt − ηt Q(xt − x∗ ) is slow, since iteration


complexity is O κ log 1ε
• doesn’t fit curvature of f (·) well

Mirror descent 5-5


. .2
Îxt+1 ≠ xú Î22 = .xt ≠ xú ≠ ÷(Òf (xt ) ≠ Òf (xú ))2.2 . t
Îx¸ ≠ x Î2 = .x ≠ xú ≠ ÷(Òf (x
t+1 ˚˙ ú
˝
Example:
. t quadratic
.
ú 2
minimization
=0
2
.
= .x ≠ x .2 ≠ 2÷Èx ≠ x , Òf (x ) ≠ =Òf
t ú t . ú .
.x(x
t
≠)Í
x˝ú+
.2 ÷≠ . Òft(x
2÷Èx ≠ x)ú≠
t
Òf
, Òf (x(
¸ ˚˙ 2 ¸ ˚˙
Ø 2÷ t ú 2
L ÎÒf (x )≠Òf (x )Î2 (smoothness) Ø 2÷
L ÎÒf (x )≠Òf (x
t ú

. .2 . .2 . t .
ú 2 2
.
Æ .xt ≠ xú .2 ≠ ÷ 2 .Òf (xt ) ≠ Òf (xú ).Æ2 .x ≠ x .2 ≠ ÷ .Òf (x ) ≠ Òf (
t

. .2 . .2
Æ .xt ≠ xú .2
Æ .xt ≠ xú . 2

1
minimizex∈Rn (x − x∗ )> Q(xGradient
f (x) = ∗
)
− xmethods
2
Gradient methods max Qi,i
where Q  0 is diagonal matrix with large κ = minii Qi,i 1

• one can significantly accelerate it by rescaling gradient


xt+1 = xt − ηt Q−1 ∇f (xt ) = xt − ηt (xt − x∗ )
| {z }
reaches x∗ with ηt =1
( )

1
⇐⇒ x t+1
= arg minn t
∇f (x ), x − x t
+ (x − xt )> Q(x − xt )
x∈R 2ηt
| {z }
fits geometry better

Mirror descent 5-5


Example: probability simplex

total-variation distance

minimizex∈∆ f (x)
where ∆ := {x ∈ Rn+ | 1> x = 1} is probability simplex

• Euclidean distance is in general not recommended for measuring


distance between probability vectors
• may prefer probability divergence metrics, e.g. Kullback-Leibler
divergence, total-variation distance, χ2 divergence

Mirror descent 5-6


Mirror descent: adjust gradient updates to fit problem geometry

— Nemirovski & Yudin, ’1983


imize 2Rp kX yk
⇢ X +n k k1 ⇢00 X1> ˆ
⇢00 Xn> ˆ
Outline
|2 } | {z } 6
A proximal 1:=f ( ) view :=g( of 6gradient . .descent
{z 7
)= 6 7
Outline ykMirror
+ ki> ˆk) 1 descent (MD)
D2⇢00 (X . 7
minimize 2Rp kX 4 1 ⇣ ⌘ 5
2
| (Lasso)
{z |
minimize {z } kX yk 2
+ 00 k k> ˆ
1} :=g( ) 2R2 2 p ⇢ Xn1
minimize
Replace
2Rpquadratic
f :=f
( )( = )
proximity
kX yk − x|t k2 with
kx } | {z } metric D
{z distance-like
2 2:=f 1
( )
:=g( ) ϕ
⇣ 1
(Lasso)
⌘ minimize kX yk2 + k k1
Ax proximal
Mirror descent }view2 of gradient
2Rp
es) minimize 2
g( minimize f|(x) {z1
(t) 2R p f ( )
(t) = kX yk • 2 | {z }
k t rf ( squares)
µ(Least ) 2k2 +minimize )
:=f ( kX
f( ) = :=g(
yk )
⇣ ⌘ 2Rp
subject to
)
1 (t) (t) 2 x œ C 2
• Bregman⌘ divergence
k ⌦ µt rf ( ) ↵k + g( ⇣) 1
I
f 2 t + rf minimize =
1 tt (t)Òf (xt2R )
t t+1 t squares) minimize
t+1 kX yk2
, (Least
x k x ≠ ÷ t µ rf (
p (t)f ( )2=
) k + 2g( )
Alternative forms
⌘) +ofÈÒf mirror(xtdescent
t
t+1 = •arg min f (x ), x ≠ xt Í
⌦ ↵2 t+1 t
k2 f t +• rf f is convex
t
, and
t
Lft-Lipschitz
1x ⇣ continuous
x
1 minimize k ⌦ (t)
µt rf (t ↵(t) 2
) tk + t+1 g( )
k k f tÌ2+ rf • tConvergence
t 2
, analysis
r descent 2µt I ⌦ J
1
k k f t t ++ rf 1 t t ,
t 2 ,t ↵ 1 t 2
f (x ) + t Òf (x ), x ≠ x t 2 ≠ 2÷t Îx ≠ x Î2
t
x t+1
= arg min f (x ) + ÈÒf (x ), x ≠ x Í +
t 2µt t
Îx ≠ x Î
x ¸ ˚˙ ˝ 2÷t
≠ 2÷1 t Îx ≠ xt Î22 ≠ ÷1t DÏ first-order
(x, xt ) approximation ¸ ˚˙ ˝
( proximal term )

1 t
x t+1
= arg min f (x ) + ∇f (x ), x − x +
t t
Dϕ (x, x ) t
Mirror descent x∈C ηt | {z } 5-35

Mirror descent
Bregman divergence

where Dϕ (x, z) := ϕ(x) − ϕ(z) − h∇ϕ(z), x − zi for convex and


5-2
differentiable ϕ
Mirror descent 5-8
al gradient methods 5-2
Mirror descent (MD)
or more generally,
( )

1
x t+1
= arg min 
f (xt
 ) + g t , x − xt + Dϕ (x, xt ) (5.1)
x∈C η t

with g t ∈ ∂f (xt )

• monitor local geometry via appropriate Bregman divergence


metrics
◦ generalization of squared Euclidean distance
◦ e.g. squared Mahalanobis distance, KL divergence

Mirror descent 5-9


Principles in choosing Bregman divergence

• fits local curvature of f (·)


• fits geometry of constraint set C
• makes sure Bregman projection (defined later) is inexpensive

Mirror descent 5-10


Bregman divergence
Bregman divergence

Let ϕ : C 7→ R be strictly convex and differentiable on C, then

Dϕ (x, z) := ϕ(x) − ϕ(z) − h∇ϕ(z), x − zi

• shares a few similarities to squared Euclidean distance


• locally quadratic measure: think of it as

Dϕ (x, z) = (x − z)> ∇2 ϕ(ξ)(x − z)

for some ξ depending on x and z

Mirror descent 5-12


Example: squared Mahalanobis distance

Let Dϕ (x, z) = 12 (x − z)> Q(x − z) for Q  0, which is generated by

1
ϕ(x) = x> Qx
2

Proof: Dϕ (x, z) = ϕ(x) − ϕ(z) − h∇ϕ(z), x − zi


1 1
= x> Qx − z > Qz − z > Q(x − z)
2 2
1
= (x − z) Q(x − z)
>
2


Mirror descent 5-13


Example: squared Mahalanobis distance

When Dϕ (x, z) = 12 (x − z)> Q(x − z), C = Rn , and f differentiable,


MD has closed-form

xt+1 = xt − ηt Q−1 ∇f (xt )


In general,
 
1
xt+1 = arg min ηt hg t , xi + (x − xt )> Q(x − xt )
x∈C 2
 
1 > D  E 1 t> t
= arg min x Qx + ηt Q xt − Q−1 g t , x + x Qx
x∈C 2 2
 

1 > 
= arg min x − (xt − Q−1 g t ) Q x − (xt − Q−1 g t )
x∈C 2
| {z }
projection of xt −Q−1 g t based on weighted `2 distance kzk2Q := z > Qz

Mirror descent 5-14


Example: KL divergence
P
Let Dϕ (x, z) = KL(x k z) := i xi log zi ,
xi
which is generated by
X
ϕ(x) = xi log xi (negative entropy)
i
P
if C = ∆ := {x ∈ Rn+ | i xi = 1} is probability simplex

Proof: Dϕ (x, z) = ϕ(x) − ϕ(z) − h∇ϕ(z), x − zi


X X X  
= xi log xi − zi log zi − log zi + 1 xi − zi
i i i
X X X xi
=− xi + zi + xi log = KL(x k z)
i i i
zi
| {z } | {z }
=1 =1


Mirror descent 5-15
Example: KL divergence

When Dϕ (x, z) = KL(x k z), C = ∆, and f differentiable, MD has


closed-form (homework)
 
xti exp − ηt ∇f (xt )
xt+1 = Pn  i  , 1≤i≤n
i t
j=1 xj exp − ηt ∇f (xt ) j

• often called exponentiated gradient descent or entropic descent

Mirror descent 5-16


Example: generalized KL divergence

If C = Rn+ (positive orthant), then negative entropy


P
ϕ(x) = i xi log xi generates
X xi
Dϕ (x, z) = KL(x k z) := xi log − x i + zi
i
zi

Mirror descent 5-17


Example: von Neumann divergence

If C = Sn+ (positive-definite cone), then generalized negative entropy


of eigenvalues
X
ϕ(X) = λi (X) log λi (X) − λi (X) := Tr(X log X − X)
i

generates von Neumann divergence


X λi (X)
Dϕ (X, Z) = λi (X) log − λi (X) + λi (Z)
i
λi (Z)
:= Tr(X(log X − log Z) − X + Z)

Mirror descent 5-18


Common families

Function Name ϕ(x) dom ϕ Dϕ (x; y)


1 2 1
Squared norm 2x (−∞, +∞) 2 (x − y)2
x
Shannon entropy x log x − x [0, +∞) x log y −x+y
x 1−x
Bit entropy x log x + (1 − x) log(1 − x) [0, 1] x log y + (1 − x) log 1−y
x x
Burg entropy − log x (0, +∞) y − log y −1

Hellinger − 1 − x2 [−1, 1] (1 − xy)(1 − y 2 )−1/2 − (1 − x2 )1/2
ℓp quasi-norm −xp (0 < p < 1) [0, +∞) −xp + p xy p−1 − (p − 1) y p
p p p−1 p
ℓp norm |x| (1 < p < ∞) (−∞, +∞) |x| − p x sgn y |y| + (p − 1) |y|
Exponential exp x (−∞, +∞) exp x − (x − y + 1) exp y
Inverse 1/x (0, +∞) 1/x + x/y 2 − 2/y
Table 2.1
Common seed functions and the corresponding divergences.

taken from I. Dhillon & J. Tropp, 2007

Exponential Family ψ(θ) dom ψ µ(θ) ϕ(x) Divergence


Mirror descent
Gaussian (σ 2 fixed) 1 2 2 2 1 2 5-19
2σ θ (−∞, +∞) σ θ 2σ 2 x Euclidean
Basic properties of Bregman divergence

Let ϕ : C 7→ R be µ-strongly convex and differentiable on C

• non-negativity: Dϕ (x, z) ≥ 0, and Dϕ (x, z) = 0 iff x = z


◦ in fact, Dϕ (x, z) ≥ µ
2 kx − zk22 (strong convextiy of ϕ)
• convexity: Dϕ (x, z) is convex in x, but
| {z }
by defn, since ϕ is cvx
not necessarily convex in z
| {z }
e.g. ϕ(x)=− log x (domain: [1,∞))

• lack of symmetry: in general, Dϕ (x, z) 6= Dϕ (z, x)

Mirror descent 5-20


Basic properties of Bregman divergence

Let ϕ : C 7→ R be µ-strongly convex and differentiable on C

• linearity: for ϕ1 , ϕ2 strictly convex and λ ≥ 0,

Dϕ1 +λϕ2 (x, z) = Dϕ1 (x, z) + λDϕ2 (x, z)

• unaffected by linear terms: let ϕ2 (x) = ϕ1 (x) + a> x + b,


then Dϕ2 = Dϕ1
• gradient: ∇x Dϕ (x, z) = ∇ϕ(x) − ∇ϕ(z)

Mirror descent 5-20


PICTURE] For every threeFact
[PICTURE]
points 5.1
[PICTURE]
x, y, z, • Mirror descent
• Alternative forms of mirror descent • Mirror descent
act 5.1 DFact 5.1 =For every
Ï (x, z)Three-point
DÏ (x,
Fact 5.1three
y) +lemma Three-point lemma
Mirror
•points
DÏ (y,•z) ≠ descent
Bregman
ÈÒÏ(z) divergence
x,≠y, z, x ≠ yÍ
ÒÏ(y),
s x, y, z, • Convergence
For every three points x, y, z, analysis • Bregman divergence
or every three points x, y, z, Alternative
• •Bregman forms of mirror descent
divergence 2
[PICTURE]
(x, z) =D for y)
•(x, Euclidean
+ z, casez)with
D (y, ≠ Ò,
Ï(x) = ÎxÎ
DÏ (x, Ï (x
Dz) = D≠ ForDevery
Ï (x,
z)Ï
Three-point
y)= three
+ Îx points
DÏ (y,≠ zÎ≠22ÈÒÏ(z)
lemma
z) x, y, ÏDÏ≠(x ≠ ÈÒÏ(z)
≠ y)
ÒÏ(y), x ≠=yÍÎx ≠ yÎ22
• ÏAlternative forms of mirror de
2
+Fact
x,Dy)(x, D
z) Ï=
(y,
5.1
D
z)
(x,
≠y)D
ÈÒÏ(z)
+ (yD ≠ (y,
z) =

z) Îy

ÒÏ(y),
≠ zÎ
ÈÒÏ(z)• 2• Convergence
x
≠\xyz

Alternative yÍ
ÒÏ(y), x
analysis
forms≠ yÍof mirror descent
• forÏ Euclidean case 2
with Ï(x) =2ÎxÎ2 , this is law of2 cosine 2 2 2
Îx ≠ zÎ = Îx ≠ yÎ + Îy ≠ zÎ
Ï Ï Ï
For every three points x, y, z, DÏ (x ≠ z) = Îx ≠ zÎ2 2 DÏ (x ≠ y) = 2Îx ≠ yÎ22 2
[PICTURE] D (x, z) = D (x, y) + D (y, z) ≠ ÈÒÏ(z) ≠ ÒÏ
Ï D Convergence
•Ï22(y Ï = ofÎycosine
≠2 isz)law

≠analysis
2
zÎ2 \xyz ÏConvergence analysis
for≠ 2
Euclidean case with 2 = ÎxÎ , zÎ
this
Fact
D (x,
Ï 5.1z) =
•Îx
D (x,
zÎy) 2+ =D Îx (y,
≠ z)yÎ

Ï(x)
2 +
ÈÒÏ(z) Îy ≠≠ ÒÏ(y), 2 ≠x 2
≠ yÍ Èz ≠ y, x ≠ yÍ
Ï Ï
¸ DÏ˚˙ 2(x ≠ ˝ = Îx ≠ zÎ22
z) Ï (x
2 ≠
Îx ≠ zÎ22 = Îx ≠ yÎ22 + ÎyD (x2 ≠ z) = Îx ≠ zÎ
2≠2 yÍ 2 2 cos Ï (x ≠ y)2= Îx
D\zyx ≠D yÎ 2
• for Euclidean case with Ï(x) = ÎxÎ , this is law
≠ÏzÎ Èz ≠Îz≠yÎ
y, x ≠2 Îx≠yÎ 2
For every three points x,
• for Euclidean case with 2 Ï(x) = ÎxÎ2 ,Dthis
y, z, 2 is law ¸
of cosine
˚˙ D˝ 2(y ≠ z) = Îy ≠ zÎ \xyz
se with Ï(x) = ÎxÎ2 , this is law ofÏ (y cosine z) = zÎ2 \xyz 2 2
≠Îz≠yÎ Îy2 ≠ Ï
2 Îx≠yÎ cos \zyx

• forD Ï (x, z) =case


Euclidean DÏ (x,
Mirror
with + D=
descent
y)Ï(x) Ï (y,
ÎxÎz)2 ,≠this is law≠ofÒÏ(y),
ÈÒÏ(z) cosine x ≠ yÍ 5-2
zÎ22descent
= Îx ≠ yÎ22 + Îy ≠ zÎ22 ≠Mirror
2
2 2Mirror ≠ y, x ≠ yÍ2
Èz descent 2
Îx ≠Mirror
2
Îx ≠ yÎ2 +ÎxÎy ≠ ≠22 =zÎ
Mirror descent

2
Îx2≠ ≠ yÎ22 2 • for
+ Îy ≠Èz
≠ zÎ =case
ÎxEuclidean
zÎ22≠ ≠ 2y,Èz with
Îx ≠x2≠≠ + Îy=≠ÎxÎ
yÎ Ï(x)
descent
¸ xyÍ
y,
zÎ , ≠ 2 is Èz
this
≠ yÍ˚˙ law≠ ˝2 5-21 5-21 2
22
¸ ˚˙¸ Îz≠yÎ ˝ ˝ 2 cos \zyx
˚˙ 2 Îx≠yÎ ¸
Îz≠yÎ2 Îx≠yÎ2 cos \zyx
Îz≠yÎ2 Î
Fact 5.1
• for Euclidean case withÎz≠yÎ
Ï(x) = ÎxÎ
≠ zÎ222cos
2 , this is law of cosine
2 Îx≠yÎ
Îx 2
=\zyx
Îx ≠ yÎ 2
+ Îy ≠ zÎ22 ≠ 2 Èz ≠
Mirror descent 2 descent
Mirror
Îx ≠ zÎ22 = Îx ≠ yÎ22 + Îy ≠ zÎ22 ≠ 2 Èz ≠ y, x ≠ yÍ5-21 ¸
For every three points x, y, z,
Mirror descent
ror descent ¸ ˚˙ ˝ 5-21
Îz≠yÎ2 Îx≠yÎ2 cos \zyx
Îz≠yÎ2 Îx
Dϕ (x, z) = Dϕ (x, y) + Dϕ (y, z) − h∇ϕ(z)
Mirror descent 5-21 − ∇ϕ(y), x − yi

Mirror descent 5-21

Mirror descent
• for Euclidean case with ϕ(x) = kxk22 , this is law of cosine
kx − zk22 = kx − yk22 + ky − zk22 − 2 hz − y, x − yi
| {z }
kz−yk2 kx−yk2 cos ∠zyx
Mirror descent 5-21
Proof of three-point lemma

Dϕ (x, y) + Dϕ (y, z) − Dϕ (x, z)


= ϕ(x) − ϕ(y) − h∇ϕ(y), x − yi + ϕ(y) − ϕ(z) − h∇ϕ(z), y − zi
− {ϕ(x) − ϕ(z) − h∇ϕ(z), x − zi}
= −h∇ϕ(y), x − yi − h∇ϕ(z), y − zi + h∇ϕ(z), x − zi
= h∇ϕ(z) − ∇ϕ(y), x − yi

Mirror descent 5-22


Connection with exponential families

Exponential family: a family of distribution with probability density


(parametrized by θ)

pϕ (x | θ) = exp {hx, θi − ϕ(θ) − h(x)}

for some cumulant function ϕ and some function h

• example (spherical Gaussian)


( ) ( )
kx − θk22 1 kxk22
pϕ (x | θ) ∝ exp − = exp hx, θi − kθk22 −
2 |2 {z } 2
:=ϕ(θ)

Mirror descent 5-23


Connection with exponential families

For exponential families, under mild conditions, ∃ function gϕ∗ s.t.

pϕ (x | θ) = exp {−Dϕ∗ (x, µ(θ))} gϕ∗ (x) (5.2)

where ϕ∗ (θ) := supx {hx, θi − ϕ(x)} is Fenchel conjugate of ϕ, and


µ(θ) := Eθ [x]

• ∃ unique Bregman divergence associated with every member of


exponential family
( )
kx − µk22
pϕ (x | θ) ∝ exp −
2 }
| {z
Dϕ∗ (x,µ)

Mirror descent 5-24


Connection with exponential families

For exponential families, under mild conditions, ∃ function gϕ∗ s.t.

pϕ (x | θ) = exp {−Dϕ∗ (x, µ(θ))} gϕ∗ (x) (5.2)

where ϕ∗ (θ) := supx {hx, θi − ϕ(x)} is Fenchel conjugate of ϕ, and


µ(θ) := Eθ [x]

• example (spherical Gaussian): since ϕ∗ (x) = 12 kxk22 , we have


Dϕ∗ (x, µ) = 12 kx − µk22 , which implies
( )
kx − µk22
pϕ (x | θ) ∝ exp −
2 }
| {z
Dϕ∗ (x,µ)

Mirror descent 5-24


Proof of (5.2)

pϕ (x | θ) = exp{hx, θi − ϕ(θ) − h(x)}


(i) 
= exp ϕ∗ (µ) + hx − µ, ∇ϕ∗ (µ)i − h(x)

= exp − ϕ∗ (x) + ϕ∗ (µ) + hx − µ, ∇ϕ∗ (µ)i exp{ϕ∗ (x) − h(x)}
= exp(−Dϕ∗ (x, µ)) exp{ϕ∗ (x) − h(x)}
| {z }
:=gϕ∗ (x)

Here, (i) follows since (a) in exponential families, one has µ = ∇ϕ(θ) and
∇ϕ∗ (µ) = θ, and (b) hµ, θi = ϕ(θ) + ϕ∗ (µ) (homework)

Mirror descent 5-25


Bregman projection

Given a point x, define

PC,ϕ (x) := arg min Dϕ (z, x)


z∈C

as Bregman projection of x onto C

• as we shall see, MD is useful when Bregman projection requires


little computational effort

Mirror descent 5-26


(x, =Ï (x,Fact + z) DÏ=(y, (z, (z, )Alternative
++ (x
If xC+ (y, Three-point lemma 2 2≠ Alternative2Èz ≠ forms of mirror 2 descent
Fact 5.1
D z) DÏthen y)ÈÒÏ(z) 5.2 z) ≠ ÈÒÏ(z) ≠D ÒÏ(y), x≠≠Îy•≠x ≠zÎ2yÍ forms of yÍmirror 2 descent z≠y 2 x≠y 2 cos \zyx
For every casethree points y, z, =¸ Îx
˚˙ for
cos yÍ Euclidean with
2≠ zÎ = Ï (y 5-21+ Îy= 2 \xyz 22 x
≠Ø
Alternative forms of mirror it˝ descent ’zÎy œ \zyx
C 2C≠x2isthen
• for
x,ÏD y) D z) ≠ ≠ÏzÎ D
ÒÏ(y), x)
2x Convergence
Fact yÍ ≠5.2 analysis •yÎ ≠D , ≠≠x)
≠ ≠
=P (x), (x, (x,
D z)
≠+ (y, 2 x,
Îx Îx 2≠ z≠y x≠y
2y,˚˙x= 2+
yÎ zÎ y,
for Euclidean ≠case with yÍ Ï(x) = ÎxÎ2 , this is la
2 yÎ
(x,Fact 5.1
2Alternative
•Èz2in squared
forms ˝of
Euclidean
mirror ¸ case,
descent ˝means
xangle obtuse
Îx ≠ zÎ22 = Îx•≠for
• D22Euclidean D22 Ï≠(x,
=•Ï (z, xC(x, D (x2+ , (y (y,ÎxD œ= y)
Ï D yÎ +
(x,
z)Îyz)≠=zÎ
≠≠ 2 •Ï
case
ÈÒÏ(z) 2Mirror
with
y)Èz+≠Dy, (y,
Ï=

CyÍ
xÎxÎ≠ 2≠
z)yÍ, 5-21
this
ÒÏ(y), is law
Îx≠Ï
ÈÒÏ(z)
¸x
2≠ ≠

≠ of
Mirror
C
≠yÍ
cosine
descent
ÒÏ(y),
Îx
x ≠ yÍ
≠ ≠\zx
zÎ Èz ≠ y,•x ≠ Ï(x) 5-2 cos \zyx
\xyz If = P 2(x),
C,Ï Mirror descent Mirror descent Ï2 Ï(x)
2descent 2 5-21

[PICTURE]
2 2 2 z≠y x≠y
DÏ (z, =D for Euclidean
)+D case with yÎ •
2 2•, Èz
this is law of cosine
2 2
=ÈÒÏ(z) +

= =x¸ descent ˚˙ 5-2


x)z) D y) D z) C≠ ≠ ÒÏ(y), x ¸2 ≠ yÍ
angle itÏis obtuse
Ï Ï
2•
Mirror 5-215-2 Mirror
Ï≠ 5-21descent 5-2
ÏD x) z)
’z ≠
Îy ≠≠
Îx
2zÎ
Îy
Mirror zÎ
Mirror descent
Îx ≠ ≠
zÎ y, Îx ˚˙ Îy zÎ
2 descent
Èz
2cos

2 yÍ
=\zx
Fact 5.2
2 2 2 x≠ydescentcosMirror
2 descent 2 2 2 5-21

uared Ï Euclidean ÎxÎ


Dcase, means angle is22then
obtuse
+Ï(x)
Mirror z≠y 2
=
Mirror descent

+ 2 ≠
5-2
Ï= \zx
descent
2 22, ≠
5-21
¸ ˝
•then
C
x(x), (x, 2= +(x, 2+ ≠(y,
≠ y,Mirror
x ≠ yÍdescent 5-21
If 2 =
\zyx
2= (x), Mirror 5-21
˚˙ C≠
Cx
≠C,Ï
z≠y

¸for
≠Euclidean case with 2 this
Ïis law of
≠y,cosine
2

C(z,
Îx ≠
MirrorinCifx,
(z,
•zÎ )Îx
Cis +
Cdescent ≠
affineyÎ plane,
(x Îy
C , x) then ’z2 œ≠
zÎ zÎ22 = Îx ≠ yÎ22 + Îy ≠ zÎ22 ≠ 2 ÈzMirror
y, x ≠ yÍ =≠ 5-21
If then
squared Euclidean case, itCÏmeans If≠xangle iszÎ obtuse \zyx
Mirror descent 5-21

= Alternative (x), forms of mirror descent


For
Pevery
Mirror descent three
x) Øpoints
D x x P C,ÏD
(x
Îx
Îx ≠

C =
D˚˙=zÎ Îx
\zx C≠
Îx
z)
(x),
x ≠
•then

2D
Ï(x)
2 Îx
Îy
z≠y
Ï (x ≠

y)y)yÎ

ÎxÎ
5-21 ¸ z≠y
x≠y descent

= 2Îx D Îy
Èz ≠z)
¸yÎ2
2 cos \zyx zÎ
≠x
2 ˚˙ 2 ˝
ÈÒÏ(z) Èz 2
z≠y 2
y, x ≠
ÒÏ(y),
2 x≠y x ≠yÍ
x≠y 22cos \zyx 5-2

(x, 2 ,C= (x, + (y,


2 Convergence analysis
Ï Ï y, z, Ï
D z) Îx ˝≠ zÎ D ˚˙ 2 x≠y ˝
2 cos \zyx
affine plane,[PICTURE] 5-21 ÏP
if is
Convergenceaffine plane,
Convergence
Ï
analysis
P then analysis
[PICTURE] • Mirror descent
x (x, = (x, + D (y, z)
C,Ï C,Ï z≠y
for Euclidean case with
z≠y 2 x≠y 2 cos \zyx

2 Ï(x)Ï= ÎxÎ
then 2 (z,D 2 z)
Mirror descent
(z,
Mirror
) 2 D descent
Convergence
• for
(x • •cosy)
C,ÏC
\zyx
Îy2y,≠analysis
D •
Euclidean case with Ï(x) = ÎxÎ2 , 5-21 z) ≠ ÈÒÏ(z)
this is law of cosine ≠ ÒÏ(y), x¸
(z,≠ yÍ ˚˙ (z,
5-21
˝ ) + • (x
D 5-21 Mirror descent
z) D y)

D Generalized
•Pythagorean Theorem
(z, = (z, + (x
z≠y x≠y
2 C2
2 2 2 5-21
Îx ≠ zÎ = ÎxÏ≠ yÎ + Îy
for Euclidean case with Ï = zÎ(y
this is Ï2 z)
lawÏ of =≠
cosine Cdescent
Ï C,Ï2 cos D x) Ø D x D , x) ’z œ C
Mirror descent

Îx ≠ zÎ2 = Îx ≠Mirror + Îy ≠ z
\zyx
• if C is affine plane, then
DD x)
x) Ø D
D x D
D ,, x)
x) ’z’z œ œ C C \xyz Ï Ï

Proof of Fact 5.2


• Factz) ≠5.2
Ï(x) Ï ≠2D
ÎxÎ • ≠≠ C Èz xzÎ ≠ yÍ
Mirror 2 z≠y x≠y Ï Ï C 5-21
Ï 2 2
¸ zÎ D= + Îy D2 (xÈzD yÎ2descent
Ï Ï Ï C
DÏ (x, z) = DÏ (x, y) 2 + DÏ (y, 2ÈÒÏ(z)
˚˙(z, ≠ yÎ222+ Îy x
(z, )≠+≠ 2 Mirror
Mirror descent
Ï
Îx ≠ Îx ≠ yÎ
Îx ≠ zÎ22 =≠ ÎxÒÏ(y), ≠ ≠ y, x ≠ x≠y cos \zyx
x zÎ ≠22 yÍ yÍ C
Mirror descent
2 ¸C 22
DÏ (z, x) Ø DÏ (z, xC,Ï ) +DD(x (x
Fact 5.1
Mirror descent 5-21

(x = (x ≠ = zÎ
+For every three 2points
Dz)x) ˝Ø D , x)
≠ 2Mirror ≠2 ’z
y)2ϭ 25-26
descent
2C 2 2 ˚˙ ˝ 5-2 2
5-2
(x = 2 2 2
Mirror descent
≠ Îx
≠ zÎ Èzdescent
y, x ≠ yÍ
z≠y \xyz 2 2 C Ï5-21
5-21

2 2 Mirror descent
Mirror descent Mirror descent
• Îx ≠
œ=
(x2C+ œyÎC2+ Îy
z) zÎ ˚˙ Ï ˝ yÎ 5-21

(x
Ï
=
Ï
DÏ•(z,inx) = DÏ (z, xC ) + D (xzÎ D, x) ≠ ’z Îx œ≠ C
zÎ Îx ≠ x Î (x
Îx = 2
¸

for Euclidean case with ,x)this is


’zlawC of cosine
5-26

=D
’z Mirror
œ2 C+\zxdescent Ï 2
Mirror ≠ Îx ≠ ≠ Îx ≠ 2 2 2x≠yÏ 2
C ,it2x)
descent 5-21
squared Euclidean case, 2 means angle D (z,
zÎis22 then
obtuse
z) = (z,
zÎ )+ D (xx, y) yÎ
¸foÎ
orfor
Fact 5.1
ForMirror
Euclidean every
Euclidean case
case for2Euclidean
• with
descent

in
Îx Ï≠

• squared
(z,
three points
with is =
case
• if C
= ÎxÎ
for
= ,with
x,is
= ÎxIf≠
Ï ≠•C,Ï
y,affine
2Euclidean
(z,
2, x
this
Euclidean is )law
this
Ï(x) =z) =, x)
• Fact

z,Ï xCplane,=
2 PÎy
Îx, this
if C case,
of(x
is law
casecosine
ÎxÎ ≠D
D5.1

2C,Ï
ofaffine
iswith
≠(x),
Cx
is (y
then
cosine
D
law D≠
of
itœ(yC≠means
z)
2 Èz ≠z≠y
=(x
x)
cosine
Îy
==ÎxÎ
≠D≠y)
angle
zÎ=xÎx
(z,
Ï(x)
Ï,zÎ
z≠yÏ x≠y cos \zyx

˚˙ 2 ˝
y, x
\xyz
\zx
this
≠ ≠ÎxÎ
yÎ x≠y 2 cos
Îx
2 yÍ

is5-26law
2,≠
isD y,
(z,2z,
zÎÏ2\zyx
obtuse
descentÏ 2
isof xC )Îx +≠
MirrorÏdescentÏ
cosine
5-21

DÏyÎ
2

Îy
Îx ≠=zÎ
≠ zÎ Îx2≠ ≠
2

Ï 2 z≠y
C Ï22
• ≠Èz
2 C
C
cos \zyx
zÎ 2
Mirror descent 2
Mirror descent 2

thagorean
If x = PTheorem
2 2
se

with Ï(x) = , Ï(x)
2• this
• if C is affine plane, thenÏ(x) x)
law
D
ÎxÎ of cosine
D ’z Ï
2 Ï MirrorC descent
Ï(x) ≠then
z)
¸
D Ï C 2
x) Ø
• Convergence analysis
z≠y 2 x≠y 2Mirror
Îy ≠ \xyz
C• xif C affine plane, then , x) ’z
2 2 5-2
5-21

(x), then
escent ÎxÎ plane, (y Ï (x=
cos \zyx
Ï 2 5-21

then
(x , x) =DDÏÏ(z, (z, )Îy
+D (x C \xyz 2 2
nts
DÏ (z, x) D(z,Ïx) )+≠
Mirror descent
Ïz) ≠Ï’z zÎ Ï 5-21

CØ DD xC C, x) ,2x) ’z œ C
y, C’z œC x Cœ
(x, = (x, + (y,
Èz2x) 2 22, this
for Euclidean case with =2 2≠ , this is law 2 of cosine 2
Cx, z,plane,
Ï C =zÎ Îx ≠ zÎ = Îx ≠ yÎ + Îy ≠ zÎ ≠ 2 Èz ≠ y, x ≠ yÍ
2 ≠ yÎ2 + Îy ≠ zÎ2 ≠ 2
D z) D y) D z) ÈÒÏ(z) ≠ ÒÏ(y), x ≠ yÍ
• Bregman
for Euclidean divergencecase with = is law of cosine
• 2 Ï(x) ÎxÎ
= Îx2descent
+
2Îx
2
Bregman divergence
Ï Ï Ï

if Ø is affine
≠ zÎ ≠=y, 5-26≠ yÍ
x (z, = (z, ) + (x
• then (z, (z, ) + (x C 5-21• Ïfor CEuclidean case with
Mirror 5-2 ’z
ifC,Ï
C isÎx affine plane, then =
(z, =Mirror
(z,
descent
) + 2 (x • Ï ¸ 2 Ï(x) ÎxÎ
Îx ≠Mirror
2zÎ2descent
zÎD 2
=
x)Îx ≠D yÎ
2yÎ 2 C2
x
2 + Îy
Îy D ≠ C 2, x) ≠ 2’z
5-26 œ Èz
2
Mirror
C ≠ y,
descent x ≠ yÍ D D x˝C 2 Ï C
D , x) 22 œ C
5-21 D z≠y
Ï(x) x) D x D , x) ’z œ C
Îx ≠ yÎ2 + Îy ≠2 zÎ2 ≠2 2 Èz 2≠ y, x
For
Îx ≠
(xÏC(z,
Ï
MirrorÎx
C ) +’z Ïœ(x
≠ Ï Ï
≠2 is zÎ ≠ 2•+˝ Îy
Mirror Èz
descent ≠ y, x 2
5-21 ≠ yÍ ˚˙ Ï
5-21
2 ϸ ˚˙ Ï ˝
Ïevery , x)xthree Dpoints C˚˙2 = 2 Mirror
¸ ˚˙ ˝
DÏ(x, œ= (x, +Èz≠z, (y,
CC ,•x)¸for if affine plane, then 2
2
x,Euclidean
2
case with
2
=≠ zÎ22, ≠ this is Èz
law ≠ofy,cosine
2
≠descent 2 xz≠y 2
+D
zÎ22 Îx≠

cosFact 5.2
= + 2
Ï=(x =
descent
’z œz≠y
ÏFor C ) +every three points
x) D • C
Îx ≠ Îx ≠
≠˚˙¸z,zÎ
Îy ≠ ≠
Îx ≠ Èz
yÎ2Ï(x)
≠ ≠

\xyz
zÎ yÎ zÎ y,
Mirror x yÍ x≠y cos \zyx 5-2
y, ¸˝Mirror ˚˙ ÎxÎ
Mirror descent 5-26

Let 2Mirror≠ ÈÒÏ(z) ÒÏ(y


z≠y
(z, xD z) C z)2D
≠¸zÎ22y) D C z)
cos descent 2 2 5-26
’z≠ ÏÎx ≠ Îx Ï≠ Î\zyx Îx CÎx≠
\zyx
Mirror descent
DÏ (z,x≠yx) =
˝
x≠y descent
≠2 (x
D C ,2
C yÎ2 +
2

x, 2y,
≠ 2˚˙
2
2 x≠y
≠ ˝Mirror
x
5-26
2
5-26

Ï (z, = with (z, )+ (x x)Ï ≠ zÎ22 =


cos 2
2 + Îy ≠
z≠y D\zyx cos x)
\zyx 2
- Îy 2

D (z, x) Ø D (z, x ) + D (x , x)
cos MirrorÎx zÎ 2 ’z Îx
œ≠ y, x ≠ yÍ
D
• for x) case
Euclidean DÏÏ(x) =x 2x≠y z≠y
DofÏ\zyx 2 x≠y,descent ≠ yÎ
2 , this is2law cosine
(x, +D 2 (y, z)
z≠y descent
2C,Ï
ÎxÎ
- descent
Alternative forms
z≠y 2 x≠y 2¸cos \zyx
of mirror descent
C,Ï 5-2
Fact 25.2 ≠Îx Ï (z, 2x) =≠ (z, = C )Alternative
z+ (z, (x = of ) ˚˙
Mirror descent 5-2

forms mirror ≠descent ˝


’z œ C
Mirror descent 5-26

Ï ÈÒÏ(z)
Fact
y) 5.2
Mirror descent
D≠ =≠ yÍ D≠ÏÒÏ(y), 2xÒ
gMirror + D≠Ïx
D Ï• ≠ C yÍ
2x) ,-x)
2z=x ’z œxC
5-26
ÒÏ(x ÒÏ(x)
DÏ (x,Îxz) ≠ zÎÏ
= DÏ (x, +z≠y
Îy ≠ zÎ22 ≠ 2Mirror zÎ ≠2
(y, Îx ≠yÎ •Ï2• inÎy squared
≠zÎC 2≠ Euclidean
Èz ≠Ïy, ˚˙ case,
≠ yÍ
C C,Ï˝ it5-2 means If xangle
2 cos \zyx
\zxC x isthen
obtuse
C5-21= PC,Ï (x),
descent 2 = Îx ≠ yÎ2 + 5-26
or descent Mirror descent Mirror descent
5-26 y) DÈz y, x
¸ descent
Ïx≠y˚˙ z)
˝ ÈÒÏ(z)descent
ÒÏ(y), ¸x ≠ yÍ z≠y 2 x≠y
5-21 5-2
5-21
5-21

uclidean case,\zx it means angle is obtuse


C,Ï
means angle is obtuse \zx
If = (x), then
(x, •= (x,optimality
+D Ï (y,
Cx Mirror cos \zyx
descent Mirror descent
If x = P (x), then x 2 2
x P cos
scent C C C,Ï C,Ï C,Ï 5-21 z≠y
C,Ï D
Since x5-21
2
=Ïarg
x≠y
minz)
2 \zyx
ÏD
(z,Ïx),
DConvergencey) analysis
condition z) ≠
gives ÈÒÏ(z) ≠ ÒÏ(y)
plane, then Mirror descent
(z,xx)
DÏ (z, x) Ø DDÏÏ(z, (z, x) •
•ifD’z
)+ C œisC,Ï
Ï (x C affine
Convergence plane,
œ C then
’zanalysis
zœC 5-21 Mirror descent
DÏ (z, x) Ø D
C)Ø
+Mirror (xCx, C,Ï
Mirror descent
Mirror descent
DÏÏdescent , x) 5-21
5-21

(z,
Mirror descent
x) = D (z, x ) +’z
C ) + DÏ (xÏC , x) C œ(xCCdescent ’z œ C DÏ (x ≠ Èg,
Dz) z(x=x≠
≠ Îx z)Í≠Ø
C,Ï 0Îx
=zÎ zÎœ22≠
2 ≠’z
Îx C5-26
C ÎÏ2(x ≠
x5-21D 2 Îxy)C2≠ = zÎ
Îx22 ≠ yÎ 2
\xyz
2
• for Euclidean case with Ï(x) = ÎxÎ , this is law of
DÏMirror , x) Ï 2
• for
• ifEuclidean if C is
case
C is affine •plane, affine
with
then plane, =
then
ÎxÎ22 , this isDlaw (yD of (z,
cosine
x)
= = D (z,
2 xC\xyz) + D (x 2 , x) ’z Mirror
œ C desc
• in squared Euclidean case, it means angle \zx x is obtuse
ase with Ï(x) = ÎxÎ 2 , this
Ï(x)
is law of cosine
Ï
≠ z) Îy ≠ÏzÎ 2
Ï C
2 Therefore, for all
Mirror
z œ Ï
descent
C, • if C is affine plane, t
D (z, x) = D (z,
DÏ (z, x) = DÏÏ(z, xC ) + DÏÏ (xC , C,Ï x x) ) + D’z (x
Ï œ C,Ï
C , x) ’z œ C C
2 2 2
Îx ≠ zÎ2 = Îx ≠ yÎ2 + Îy ≠0zÎ ≠x2 ≠≠Èz ≠2 = yÎ22C,Ï
+ 2
ÍzÎ22, ≠ 2 isÏ Èz
for y,Euclidean =y,˚˙ case
≠ yÍ≠
x Îx
with ≠=
2 this D law≠ofy,c
¸(z, x) =˚
Mirror descent 5-26
Mirror descent
2 2 2Èg,
Îx C,Ï ¸zÎ ˝≠ ), zÎy x≠ 5-26 5-26

if is ≠affine≠Cplane, then
Îx ≠ + Îy 2 Èz • Ø≠ ≠zÍyÍ 2 ÈÒÏ(x) ÒÏ(x Ï(x) ÎxÎ
C,Ï 5-26

C ) + D •(x

Ï C2 , C
x) zÎ
’z2 œ ¸ = DÏ (x
x
Îz≠yÎ Îx≠yÎ cos
˚˙C,Ï , x)2≠˝ DÏ (z,
2 x) + DÏ (z, xC,Ï )
\zyx
Mirror descent
Mirror descent Îz≠yÎ2 Îx≠
2 2
Îz≠yÎ2 Îx≠yÎ2 cos \zyx
MirrorÎx
as claimed, where ≠linezÎ
lastMirror 2 = from
comes Îx ≠ yÎ
Fact 5.1
2 + Îy ≠ zÎ22 ≠ 2 Èz ≠ y, x
Fact 5.2
descent
DÏ (z, x) = DÏ (z, xC ) + DÏ (xC , x)
descent
Mirror descent 5-21 C¸
’z œÎz≠yÎ ˚˙
2 Îx≠yÎ
it means angle \zxC x is obtuse
If xC,ϕ = PC,ϕ (x), then
Mirror descent Mirror descent 5-21 5-27

Dϕ (z, x) ≥Mirror (z, xC,ϕ ) + Dϕ (xC,ϕ , x)


Dϕdescent ∀z ∈ C
xC )Mirror
+ Ddescent
Ï (xC , x) ’z œ C 5-2

• in squared Euclidean case, it means angle ∠zxC,ϕ x is obtuse

5-26
Mirror descent 5-27
(x, =Ï (x,Fact + z) DÏ=(y, (z, (z, )Alternative
++ (x
If xC+ (y, Three-point lemma 2 2≠ Alternative2Èz ≠ forms of mirror 2 descent
Fact 5.1
D z) DÏthen y)ÈÒÏ(z) 5.2 z) ≠ ÈÒÏ(z) ≠D ÒÏ(y), x≠≠Îy•≠x ≠zÎ2yÍ forms of yÍmirror 2 descent z≠y 2 x≠y 2 cos \zyx
For every casethree points y, z, =¸ Îx
˚˙ for
cos yÍ Euclidean with
2≠ zÎ = Ï (y 5-21+ Îy= 2 \xyz 22 x
≠Ø
Alternative forms of mirror it˝ descent ’zÎy œ \zyx
C 2C≠x2isthen
• for
x,ÏD y) D z) ≠ ≠ÏzÎ D
ÒÏ(y), x)
2x Convergence
Fact yÍ ≠5.2 analysis •yÎ ≠D , ≠≠x)
≠ ≠
=P (x), (x, (x,
D z)
≠+ (y, 2 x,
Îx Îx 2≠ z≠y x≠y
2y,˚˙x= 2+
yÎ zÎ y,
for Euclidean ≠case with yÍ Ï(x) = ÎxÎ2 , this is la
2 yÎ
(x,Fact 5.1
2Alternative
•Èz2in squared
forms ˝of
Euclidean
mirror ¸ case,
descent ˝means
xangle obtuse
Îx ≠ zÎ22 = Îx•≠for
• D22Euclidean D22 Ï≠(x,
=•Ï (z, xC(x, D (x2+ , (y (y,ÎxD œ= y)
Ï D yÎ +
(x,
z)Îyz)≠=zÎ
≠≠ 2 •Ï
case
ÈÒÏ(z) 2Mirror
with
y)Èz+≠Dy, (y,
Ï=

CyÍ
xÎxÎ≠ 2≠
z)yÍ, 5-21
this
ÒÏ(y), is law
Îx≠Ï
ÈÒÏ(z)
¸x
2≠ ≠

≠ of
Mirror
C
≠yÍ
cosine
descent
ÒÏ(y),
Îx
x ≠ yÍ
≠ ≠\zx
zÎ Èz ≠ y,•x ≠ Ï(x) 5-2 cos \zyx
\xyz If = P 2(x),
C,Ï Mirror descent Mirror descent Ï2 Ï(x)
2descent 2 5-21

[PICTURE]
2 2 2 z≠y x≠y
DÏ (z, =D for Euclidean
)+D case with yÎ •
2 2•, Èz
this is law of cosine
2 2
=ÈÒÏ(z) +

= =x¸ descent ˚˙ 5-2


x)z) D y) D z) C≠ ≠ ÒÏ(y), x ¸2 ≠ yÍ
angle itÏis obtuse
Ï Ï
2•
Mirror 5-215-2 Mirror
Ï≠ 5-21descent 5-2
ÏD x) z)
’z ≠
Îy ≠≠
Îx
2zÎ
Îy
Mirror zÎ
Mirror descent
Îx ≠ ≠
zÎ y, Îx ˚˙ Îy zÎ
2 descent
Èz
2cos

2 yÍ
=\zx
Fact 5.2
2 2 2 x≠ydescentcosMirror
2 descent 2 2 2 5-21

uared Ï Euclidean ÎxÎ


Dcase, means angle is22then
obtuse
+Ï(x)
Mirror z≠y 2
=
Mirror descent

+ 2 ≠
5-2
Ï= \zx
descent
2 22, ≠
5-21
¸ ˝
•then
C
x(x), (x, 2= +(x, 2+ ≠(y,
≠ y,Mirror
x ≠ yÍdescent 5-21
If 2 =
\zyx
2= (x), Mirror 5-21
˚˙ C≠
Cx
≠C,Ï
z≠y

¸for
≠Euclidean case with 2 this
Ïis law of
≠y,cosine
2

C(z,
Îx ≠
MirrorinCifx,
(z,
•zÎ )Îx
Cis +
Cdescent ≠
affineyÎ plane,
(x Îy
C , x) then ’z2 œ≠
zÎ zÎ22 = Îx ≠ yÎ22 + Îy ≠ zÎ22 ≠ 2 ÈzMirror
y, x ≠ yÍ =≠ 5-21
If then
squared Euclidean case, itCÏmeans If≠xangle iszÎ obtuse \zyx
Mirror descent 5-21

= Alternative (x), forms of mirror descent


For
Pevery
Mirror descent three
x) Øpoints
D x x P C,ÏD
(x
Îx
Îx ≠

C =
D˚˙=zÎ Îx
\zx C≠
Îx
z)
(x),
x ≠
•then

2D
Ï(x)
2 Îx
Îy
z≠y
Ï (x ≠

y)y)yÎ

ÎxÎ
5-21 ¸ z≠y
x≠y descent

= 2Îx D Îy
Èz ≠z)
¸yÎ2
2 cos \zyx zÎ
≠x
2 ˚˙ 2 ˝
ÈÒÏ(z) Èz 2
z≠y 2
y, x ≠
ÒÏ(y),
2 x≠y x ≠yÍ
x≠y 22cos \zyx 5-2

(x, 2 ,C= (x, + (y,


2 Convergence analysis
Ï Ï y, z, Ï
D z) Îx ˝≠ zÎ D ˚˙ 2 x≠y ˝
2 cos \zyx
affine plane,[PICTURE] 5-21 ÏP
if is
Convergenceaffine plane,
Convergence
Ï
analysis
P then analysis
[PICTURE] • Mirror descent
x (x, = (x, + D (y, z)
C,Ï C,Ï z≠y
for Euclidean case with
z≠y 2 x≠y 2 cos \zyx

2 Ï(x)Ï= ÎxÎ
then 2 (z,D 2 z)
Mirror descent
(z,
Mirror
) 2 D descent
Convergence
• for
(x • •cosy)
C,ÏC
\zyx
Îy2y,≠analysis
D •
Euclidean case with Ï(x) = ÎxÎ2 , 5-21 z) ≠ ÈÒÏ(z)
this is law of cosine ≠ ÒÏ(y), x¸
(z,≠ yÍ ˚˙ (z,
5-21
˝ ) + • (x
D 5-21 Mirror descent
z) D y)

D Generalized
•Pythagorean Theorem
(z, = (z, + (x
z≠y x≠y
2 C2
2 2 2 5-21
Îx ≠ zÎ = ÎxÏ≠ yÎ + Îy
for Euclidean case with Ï = zÎ(y
this is Ï2 z)
lawÏ of =≠
cosine Cdescent
Ï C,Ï2 cos D x) Ø D x D , x) ’z œ C
Mirror descent

Îx ≠ zÎ2 = Îx ≠Mirror + Îy ≠ z
\zyx
• if C is affine plane, then
DD x)
x) Ø D
D x D
D ,, x)
x) ’z’z œ œ C C \xyz Ï Ï

Proof of Fact 5.2


• Factz) ≠5.2
Ï(x) Ï ≠2D
ÎxÎ • ≠≠ C Èz xzÎ ≠ yÍ
Mirror 2 z≠y x≠y Ï Ï C 5-21
Ï 2 2
¸ zÎ D= + Îy D2 (xÈzD yÎ2descent
Ï Ï Ï C
DÏ (x, z) = DÏ (x, y) 2 + DÏ (y, 2ÈÒÏ(z)
˚˙(z, ≠ yÎ222+ Îy x
(z, )≠+≠ 2 Mirror
Mirror descent
Ï
Îx ≠ Îx ≠ yÎ
Îx ≠ zÎ22 =≠ ÎxÒÏ(y), ≠ ≠ y, x ≠ x≠y cos \zyx
x zÎ ≠22 yÍ yÍ C
Mirror descent
2 ¸C 22
DÏ (z, x) Ø DÏ (z, xC,Ï ) +DD(x (x
Fact 5.1
Mirror descent 5-21

(x = (x ≠ = zÎ
+For every three 2points
Dz)x) ˝Ø D , x)
≠ 2Mirror ≠2 ’z
y)2ϭ 25-26
descent
2C 2 2 ˚˙ ˝ 5-2 2
5-2
(x = 2 2 2
Mirror descent
≠ Îx
≠ zÎ Èzdescent
y, x ≠ yÍ
z≠y \xyz 2 2 C Ï5-21
5-21

2 2 Mirror descent
Mirror descent Mirror descent
• Îx ≠
œ=
(x2C+ œyÎC2+ Îy
z) zÎ ˚˙ Ï ˝ yÎ 5-21

(x
Ï
=
Ï
DÏ•(z,inx) = DÏ (z, xC ) + D (xzÎ D, x) ≠ ’z Îx œ≠ C
zÎ Îx ≠ x Î (x
Îx = 2
¸

for Euclidean case with ,x)this is


’zlawC of cosine
5-26

=D
’z Mirror
œ2 C+\zxdescent Ï 2
Mirror ≠ Îx ≠ ≠ Îx ≠ 2 2 2x≠yÏ 2
C ,it2x)
descent 5-21
squared Euclidean case, 2 means angle D (z,
zÎis22 then
obtuse
z) = (z,
zÎ )+ D (xx, y) yÎ
¸foÎ
orfor
Fact 5.1
ForMirror
Euclidean every
Euclidean case
case for2Euclidean
• with
descent

in
Îx Ï≠

• squared
(z,
three points
with is =
case
• if C
= ÎxÎ
for
= ,with
x,is
= ÎxIf≠
Ï ≠•C,Ï
y,affine
2Euclidean
(z,
2, x
this
Euclidean is )law
this
Ï(x) =z) =, x)
• Fact

z,Ï xCplane,=
2 PÎy
Îx, this
if C case,
of(x
is law
casecosine
ÎxÎ ≠D
D5.1

2C,Ï
ofaffine
iswith
≠(x),
Cx
is (y
then
cosine
D
law D≠
of
itœ(yC≠means
z)
2 Èz ≠z≠y
=(x
x)
cosine
Îy
==ÎxÎ
≠D≠y)
angle
zÎ=xÎx
(z,
Ï(x)
Ï,zÎ
z≠yÏ x≠y cos \zyx

˚˙ 2 ˝
y, x
\xyz
\zx
this
≠ ≠ÎxÎ
yÎ x≠y 2 cos
Îx
2 yÍ

is5-26law
2,≠
isD y,
(z,2z,
zÎÏ2\zyx
obtuse
descentÏ 2
isof xC )Îx +≠
MirrorÏdescentÏ
cosine
5-21

DÏyÎ
2

Îy
Îx ≠=zÎ
≠ zÎ Îx2≠ ≠
2

Ï 2 z≠y
C Ï22
• ≠Èz
2 C
C
cos \zyx
zÎ 2
Mirror descent 2
Mirror descent 2

thagorean
If x = PTheorem
2 2
se

with Ï(x) = , Ï(x)
2• this
• if C is affine plane, thenÏ(x) x)
law
D
ÎxÎ of cosine
D ’z Ï
2 Ï MirrorC descent
Ï(x) ≠then
z)
¸
D Ï C 2
x) Ø
• Convergence analysis
z≠y 2 x≠y 2Mirror
Îy ≠ \xyz
C• xif C affine plane, then , x) ’z
2 2 5-2
5-21

(x), then
escent ÎxÎ plane, (y Ï (x=
cos \zyx
Ï 2 5-21

then
(x , x) =DDÏÏ(z, (z, )Îy
+D (x C \xyz 2 2
nts
DÏ (z, x) D(z,Ïx) )+≠
Mirror descent
Ïz) ≠Ï’z zÎ Ï 5-21

CØ DD xC C, x) ,2x) ’z œ C
y, C’z œC x Cœ
(x, = (x, + (y,
Èz2x) 2 22, this
for Euclidean case with =2 2≠ , this is law 2 of cosine 2
Cx, z,plane,
Ï C =zÎ Îx ≠ zÎ = Îx ≠ yÎ + Îy ≠ zÎ ≠ 2 Èz ≠ y, x ≠ yÍ
2 ≠ yÎ2 + Îy ≠ zÎ2 ≠ 2
D z) D y) D z) ÈÒÏ(z) ≠ ÒÏ(y), x ≠ yÍ
• Bregman
for Euclidean divergencecase with = is law of cosine
• 2 Ï(x) ÎxÎ
= Îx2descent
+
2Îx
2
Bregman divergence
Ï Ï Ï

if Ø is affine
≠ zÎ ≠=y, 5-26≠ yÍ
x (z, = (z, ) + (x
• then (z, (z, ) + (x C 5-21• Ïfor CEuclidean case with
Mirror 5-2 ’z
ifC,Ï
C isÎx affine plane, then =
(z, =Mirror
(z,
descent
) + 2 (x • Ï ¸ 2 Ï(x) ÎxÎ
Îx ≠Mirror
2zÎ2descent
zÎD 2
=
x)Îx ≠D yÎ
2yÎ 2 C2
x
2 + Îy
Îy D ≠ C 2, x) ≠ 2’z
5-26 œ Èz
2
Mirror
C ≠ y,
descent x ≠ yÍ D D x˝C 2 Ï C
D , x) 22 œ C
5-21 D z≠y
Ï(x) x) D x D , x) ’z œ C
Îx ≠ yÎ2 + Îy ≠2 zÎ2 ≠2 2 Èz 2≠ y, x
For
Îx ≠
(xÏC(z,
Ï
MirrorÎx
C ) +’z Ïœ(x
≠ Ï Ï
≠2 is zÎ ≠ 2•+˝ Îy
Mirror Èz
descent ≠ y, x 2
5-21 ≠ yÍ ˚˙ Ï
5-21
2 ϸ ˚˙ Ï ˝
Ïevery , x)xthree Dpoints C˚˙2 = 2 Mirror
¸ ˚˙ ˝
DÏ(x, œ= (x, +Èz≠z, (y,
CC ,•x)¸for if affine plane, then 2
2
x,Euclidean
2
case with
2
=≠ zÎ22, ≠ this is Èz
law ≠ofy,cosine
2
≠descent 2 xz≠y 2
+D
zÎ22 Îx≠

cosFact 5.2
= + 2
Ï=(x =
descent
’z œz≠y
ÏFor C ) +every three points
x) D • C
Îx ≠ Îx ≠
≠˚˙¸z,zÎ
Îy ≠ ≠
Îx ≠ Èz
yÎ2Ï(x)
≠ ≠

\xyz
zÎ yÎ zÎ y,
Mirror x yÍ x≠y cos \zyx 5-2
y, ¸˝Mirror ˚˙ ÎxÎ
Mirror descent 5-26

Let 2Mirror≠ ÈÒÏ(z) ÒÏ(y


z≠y
(z, xD z) C z)2D
≠¸zÎ22y) D C z)
cos descent 2 2 5-26
’z≠ ÏÎx ≠ Îx Ï≠ Î\zyx Îx CÎx≠
\zyx
Mirror descent
DÏ (z,x≠yx) =
˝
x≠y descent
≠2 (x
D C ,2
C yÎ2 +
2

x, 2y,
≠ 2˚˙
2
2 x≠y
≠ ˝Mirror
x
5-26
2
5-26

Ï (z, = with (z, )+ (x x)Ï ≠ zÎ22 =


cos 2
2 + Îy ≠
z≠y D\zyx cos x)
\zyx 2
- Îy 2

D (z, x) Ø D (z, x ) + D (x , x)
cos MirrorÎx zÎ 2 ’z Îx
œ≠ y, x ≠ yÍ
D
• for x) case
Euclidean DÏÏ(x) =x 2x≠y z≠y
DofÏ\zyx 2 x≠y,descent ≠ yÎ
2 , this is2law cosine
(x, +D 2 (y, z)
z≠y descent
2C,Ï
ÎxÎ
- descent
Alternative forms
z≠y 2 x≠y 2¸cos \zyx
of mirror descent
C,Ï 5-2
Fact 25.2 ≠Îx Ï (z, 2x) =≠ (z, = C )Alternative
z+ (z, (x = of ) ˚˙
Mirror descent 5-2

forms mirror ≠descent ˝


’z œ C
Mirror descent 5-26

Ï ÈÒÏ(z)
Fact
y) 5.2
Mirror descent
D≠ =≠ yÍ D≠ÏÒÏ(y), 2xÒ
gMirror + D≠Ïx
D Ï• ≠ C yÍ
2x) ,-x)
2z=x ’z œxC
5-26
ÒÏ(x ÒÏ(x)
DÏ (x,Îxz) ≠ zÎÏ
= DÏ (x, +z≠y
Îy ≠ zÎ22 ≠ 2Mirror zÎ ≠2
(y, Îx ≠yÎ •Ï2• inÎy squared
≠zÎC 2≠ Euclidean
Èz ≠Ïy, ˚˙ case,
≠ yÍ
C C,Ï˝ it5-2 means If xangle
2 cos \zyx
\zxC x isthen
obtuse
C5-21= PC,Ï (x),
descent 2 = Îx ≠ yÎ2 + 5-26
or descent Mirror descent Mirror descent
5-26 y) DÈz y, x
¸ descent
Ïx≠y˚˙ z)
˝ ÈÒÏ(z)descent
ÒÏ(y), ¸x ≠ yÍ z≠y 2 x≠y
5-21 5-2
5-21
5-21

uclidean case,\zx it means angle is obtuse


C,Ï
means angle is obtuse \zx
If = (x), then
(x, •= (x,optimality
+D Ï (y,
Cx Mirror cos \zyx
descent Mirror descent
If x = P (x), then x 2 2
x P cos
scent C C C,Ï C,Ï C,Ï 5-21 z≠y
C,Ï D
Since x5-21
2
=Ïarg
x≠y
minz)
2 \zyx
ÏD
(z,Ïx),
DConvergencey) analysis
condition z) ≠
gives ÈÒÏ(z) ≠ ÒÏ(y)
plane, then Mirror descent
(z,xx)
DÏ (z, x) Ø DDÏÏ(z, (z, x) •
•ifD’z
)+ C œisC,Ï
Ï (x C affine
Convergence plane,
œ C then
’zanalysis
zœC 5-21 Mirror descent
DÏ (z, x) Ø D
C)Ø
+Mirror (xCx, C,Ï
Mirror descent
Mirror descent
DÏÏdescent , x) 5-21
5-21

(z,
Mirror descent
x) = D (z, x ) +’z
C ) + DÏ (xÏC , x) C œ(xCCdescent ’z œ C DÏ (x ≠ Èg,
Dz) z(x=x≠
≠ Îx z)Í≠Ø
C,Ï 0Îx
=zÎ zÎœ22≠
2 ≠’z
Îx C5-26
C ÎÏ2(x ≠
x5-21D 2 Îxy)C2≠ = zÎ
Îx22 ≠ yÎ 2
\xyz
2
• for Euclidean case with Ï(x) = ÎxÎ , this is law of
DÏMirror , x) Ï 2
• for
• ifEuclidean if C is
case
C is affine •plane, affine
with
then plane, =
then
ÎxÎ22 , this isDlaw (yD of (z,
cosine
x)
= = D (z,
2 xC\xyz) + D (x 2 , x) ’z Mirror
œ C desc
• in squared Euclidean case, it means angle \zx x is obtuse
ase with Ï(x) = ÎxÎ 2 , this
Ï(x)
is law of cosine
Ï
≠ z) Îy ≠ÏzÎ 2
Ï C
2 Therefore, for all
Mirror
z œ Ï
descent
C, • if C is affine plane, t
D (z, x) = D (z,
DÏ (z, x) = DÏÏ(z, xC ) + DÏÏ (xC , C,Ï x x) ) + D’z (x
Ï œ C,Ï
C , x) ’z œ C C
2 2 2
Îx ≠ zÎ2 = Îx ≠ yÎ2 + Îy ≠0zÎ ≠x2 ≠≠Èz ≠2 = yÎ22C,Ï
+ 2
ÍzÎ22, ≠ 2 isÏ Èz
for y,Euclidean =y,˚˙ case
≠ yÍ≠
x Îx
with ≠=
2 this D law≠ofy,c
¸(z, x) =˚
Mirror descent 5-26
Mirror descent
2 2 2Èg,
Îx C,Ï ¸zÎ ˝≠ ), zÎy x≠ 5-26 5-26

if is ≠affine≠Cplane, then
Îx ≠ + Îy 2 Èz • Ø≠ ≠zÍyÍ 2 ÈÒÏ(x) ÒÏ(x Ï(x) ÎxÎ
C,Ï 5-26

C ) + D •(x

Ï C2 , C
x) zÎ
’z2 œ ¸ = DÏ (x
x
Îz≠yÎ Îx≠yÎ cos
˚˙C,Ï , x)2≠˝ DÏ (z,
2 x) + DÏ (z, xC,Ï )
\zyx
Mirror descent
Mirror descent Îz≠yÎ2 Îx≠
2 2
Îz≠yÎ2 Îx≠yÎ2 cos \zyx
MirrorÎx
as claimed, where ≠linezÎ
lastMirror 2 = from
comes Îx ≠ yÎ
Fact 5.1
2 + Îy ≠ zÎ22 ≠ 2 Èz ≠ y, x
Fact 5.2
descent
DÏ (z, x) = DÏ (z, xC ) + DÏ (xC , x)
descent
Mirror descent 5-21 C¸
’z œÎz≠yÎ ˚˙
2 Îx≠yÎ
it means angle \zxC x is obtuse
If xC,ϕ = PC,ϕ (x), then
Mirror descent Mirror descent 5-21 5-27

Dϕ (z, x) ≥Mirror (z, xC,ϕ ) + Dϕ (xC,ϕ , x)


Dϕdescent ∀z ∈ C
xC )Mirror
+ Ddescent
Ï (xC , x) ’z œ C 5-2

• if C is affine plane, then


Dϕ (z, x) = Dϕ (z, xC,ϕ ) + Dϕ (xC,ϕ , x) ∀z ∈ C
5-26
Mirror descent 5-27
Proof of Fact 5.2

Let

g = ∇z Dϕ (z, x) = ∇ϕ(xC,ϕ ) − ∇ϕ(x)
z=xC,ϕ

Since xC,ϕ = arg minz∈C Dϕ (z, x), optimality condition gives

hg, z − xC,ϕ i ≥ 0 ∀z ∈ C

Therefore, for all z ∈ C,

0 ≥ hg, xC,ϕ − zi = h∇ϕ(x) − ∇ϕ(xC,ϕ ), z − xC,ϕ i


= Dϕ (z, xC,ϕ ) + Dϕ (xC,ϕ , x) − Dϕ (z, x)

as claimed, where last line comes from Fact 5.1

Mirror descent 5-28


Alternative forms of mirror descent
An alternative form of MD

Using Bregman divergence, one can also describe MD as

 
∇ϕ y t+1 = ∇ϕ xt − ηt g t with g t ∈ ∂f (xt ) (5.3a)


xt+1 ∈ PC,ϕ y t+1 = arg min Dϕ (z, y t+1 ) (5.3b)
z∈C

• performs gradient descent in “dual” space

Mirror descent 5-30


An alternative form of MD

The equivalence can be seen by looking at optimality conditions


• optimality condition of (5.3b) together with (5.3a) gives
  
0 ∈ ∇ϕ xt+1 − ∇ϕ y t+1 + NC xt+1
| {z }
normal cone
  
= ϕ xt+1 − ∇ϕ xt + ηt g + NC xt+1
t

• optimality condition of (5.1) reads


1 n  o 
0 ∈ gt + ∇ϕ xt+1 − ∇ϕ xt + NC xt+1
ηt
• these two conditions are clearly identical

Mirror descent 5-31


Another form of MD

For simplicity, assume C = Rn , then another form is

  
xt+1 = ∇ϕ∗ ∇ϕ xt − ηg t (5.4)

where ϕ∗ (x) := supz {hz, xi − ϕ(z)} is Fenchel-conjugate of ϕ

• this is the version originally proposed in Nemirovski &


Yudin ’1983

Mirror descent 5-32


Another form of MD

When C = Rn , (5.3a)-(5.3b) simplifies to


  
xt+1 = y t+1 = (∇ϕ)−1 ∇ϕ xt − ηg t

It thus sufficies to show

(∇ϕ)−1 = (∇ϕ)∗ (5.5)

Mirror descent 5-33


Proof of Claim (5.5)

Suppose y = ∇ϕ(x). From conjugate subgradient theorem, this is


equivalent to (homework)

ϕ(x) + ϕ∗ (y) = hx, yi

Since ϕ∗∗ = ϕ, we further have

ϕ∗ (y) + ϕ∗∗ (x) = hx, yi,

which combined with conjugate subgradient theorem yields


x = ∇ϕ∗ (y). This means

x = ∇ϕ∗ (y) = ∇ϕ∗ (∇ϕ(x))

and hence ∇ϕ∗ = (∇ϕ)−1

Mirror descent 5-34


Convergence analysis
Convex and Lipschitz problems

minimizex f (x)
subject to x∈C

• f is convex and Lipschitz continuous


◦ ϕ is ρ-strongly convex w.r.t. k · k
◦ kgk∗ ≤ Lf for any subgradient g ∈ ∂f (x) and any point x,
where k · k∗ is dual norm of k · k

Mirror descent 5-36


Convergence analysis
Using Lemma 5.4, we immediate arrive at
Theorem 5.3
Suppose f is convex and Lipschitz continuous (i.e. kg t k∗ ≤ Lf ) on C,
and suppoe ϕ is ρ-strongly convex w.r.t. k · k. Then
 L2f Pt
best,t opt
supx∈C Dϕ x, x0 + 2ρ
2
k=0 ηk
f −f ≤ Pt
k=0 ηk

√ 
2ρR √
1
• If ηt = Lf t
with R := supx∈C Dϕ x, x0 , then
√ !
best,t opt Lf R log t
f −f ≤O √ √
ρ t

◦ one can further remove log t factor


Mirror descent 5-37
Example: optimization over probability simplex
Suppose C = ∆ is probability simplex, and pick x0 = n−1 1
(1) set ϕ(x) = 12 kxk22 , which is 1-strongly convex w.r.t. k · k2 . Then

1 1 1 1
sup Dϕ (x, x0 ) = sup kx − n−1 1k22 = sup kxk22 − ≤
x∈∆ x∈∆ 2 x∈∆ 2 n 2

Then Theorem 5.3 says


 
log t
f best,t − f opt ≤ O Lf,2 √
t
if all subgradient g obey kgk2 ≤ Lf,2

Mirror descent 5-38


Example: optimization over probability simplex
Suppose C = ∆ is probability simplex, and pick x0 = n−1 1
P
(2) set φ(x) = − ni=1 xi log xi , which is 1-strongly convex
w.r.t. k · k1 . Then
n
X n
X 1
sup Dϕ (x, x0 ) = sup KL(x k x0 ) = sup xi log xi − xi log
x∈∆ x∈∆ x∈∆ i=1 i=1
n
Xn
= log n + sup xi log xi ≤ log n
x∈∆ i=1

Then Theorem 5.3 says


 
p log t
f best,t − f opt ≤ O Lf,∞ log n √
t
if all subgradient g obey kgk∞ ≤ Lf,∞
Mirror descent 5-38
Example: optimization over probability simplex

Comparing these two choices and ignoring log terms, we have


   
Lf,2 Lf,∞
Euclidean: O √ vs. KL: O √
t t


Since kgk∞ ≤ kgk2 ≤ nkgk∞ , one has

1 L
√ ≤ f,∞ ≤ 1
n Lf,2

and hence KL version often yields much better performance

Mirror descent 5-39


Numerical example: robust regression
taken from Stanford EE364B

m
X
minimizex f (x) = |a>
i x − bi |
i=1
subject to x ∈ ∆ = {x ∈ Rn+ | 1> x = 1}
ai,1 +ai,2
with ai ∼ N (0, In×n ) and bi = 2 + N (0, 10−2 ), m = 20,
n = 3000

Mirror descent 5-40


we immediate arrive at Example

"
and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Th
Suppose f is convex and Lipschitz continuous (i.e

Lf R lo
L
2

with R := supxœC DÏ x, x0
Numerical example: robust regression

supxœC DÏ x, x0 +
k=0 ÷k


A Ô
"

Ô
!
Robust regression problem with ai ∼ N (0, In×n ) and

qt

¶ one can further remove log t factor


taken
bi = (ai,1 + ai,2 )/2 + εi where εi ∼ N (0, 10 −2
),from
m =Stanford
20, n =EE364B
3000
!

≠f ÆO
x and Lipschitz continuous (i.e. Îg t Îú Æ Lf ) on C,
trongly convex w.r.t. Î · Î. Then

best,t opt
Æ

! " L2f qt
supxœC DÏ x, x0 + 2
opt

2fl k=0 ÷k
f
opt
≠f

t
Æ
1

qt
2flR Ô

k=0 ÷k
best,t

Lf
Theorem 5.3

Ô
• If ÷t =
f

Mirror descent
! "
with R := supxœC DÏ x, x0 , then
A Ô B
L R log t
f best,t ≠ f opt Æ O
f
Ô Ô
fl t
stepsizes chosen according to best bounds (but still sensitive to
Mirror descent 5-40
stepsize choice)
Fundamental inequality for mirror descent

Lemma 5.4


opt
 ηt2 L2f
ηt f (x ) − f
t
≤ Dϕ (x , x ) − Dϕ (x , x
∗ t ∗ t+1
)+

• Dϕ (x∗ , xt ) − Dϕ (x∗ , xt+1 ) motivates us to form telescopic sum


later

Mirror descent 5-41


Proof of Lemma 5.4

 
f xt − f x∗ ≤ hg t , xt − x∗ i (property of subgradient)
1  
= h∇ϕ xt − ∇ϕ y t+1 , xt − x∗ i (MD update rule)
ηt
1    
= Dϕ x∗ , xt + Dϕ xt , y t+1 − Dϕ x∗ , y t+1 (three point lemma)
ηt
1     
≤ Dϕ x∗ , xt + Dϕ xt , y t+1 − Dϕ x∗ , xt+1 − Dϕ xt+1 , y t+1
ηt
(Pythagorean)
1    1   
= Dϕ x∗ , xt − Dϕ x∗ , xt+1 + Dϕ xt , y t+1 − Dϕ xt+1 , y t+1
ηt ηt
so we need to first bound 2nd term of last line

Mirror descent 5-42


Proof of Lemma 5.4 (cont.)

We claim that
  (ηt Lf )2
Dϕ xt , y t+1 − Dϕ xt+1 , y t+1 ≤ (5.6)

This gives
     (ηt Lf )2
ηt f xt − f x∗ ≤ Dϕ x∗ , xt − Dϕ x∗ , xt+1 +

as claimed

Mirror descent 5-43


Proof of Lemma 5.4 (cont.)
Finally, we justify (5.6):

 
Dϕ xt , y t+1 − Dϕ xt+1 , y t+1
 

= ϕ xt − ϕ xt+1 − ∇ϕ y t+1 , xt − xt+1

 ρ 2

≤ ∇ϕ xt , xt − xt+1 − xt − xt+1 − ∇ϕ y t+1 , xt − xt+1
2
(strong convexity of ϕ)

  t ρ t 2
= ∇ϕ x − ∇ϕ y
t t+1
,x − x t+1
− x − xt+1 2
2
ρ
t+1 2
= ηt hg , x − x i −
t t t+1 t
x −x (MD update rule)
2
t ρ 2
≤ ηt Lf x − xt+1 − xt − xt+1 (Cauchy-Schwarz)
2
(ηt Lf )2
≤ (optimize quadratic function in kxt − xt+1 k)

Mirror descent 5-44


Proof of Theorem 5.3

From Lemma 5.4, one has


  ηk2 L2f
ηk f (xk ) − f opt ≤ Dϕ (x∗ , xk ) − Dϕ (x∗ , xk+1 ) +

Taking this inequality for k = 0, · · · , t and summing them up give
Pt
t
X 
opt

0
L2f 2
k=0 ηk
ηk f (x ) − f k
≤ Dϕ (x , x ) − Dϕ (x , x
∗ ∗ t+1
)+
k=0

P
L2 t 2
k=0 ηk
≤ sup Dϕ (x, x0 ) +
f
x∈C 2ρ
Pt
ηk (f (xk )−f opt )
This together with f best,t − f opt ≤ k=0 Pt concludes
k=0
ηk
proof
Mirror descent 5-45
Reference

[1] ”Problem complexity and method efficiency in optimization,”


A. Nemirovski, D. Yudin, Wiley, 1983.
[2] ”Mirror descent and nonlinear projected subgradient methods for convex
optimization,” A. Beck, M. Teboulle, Operations Research Letters,
31(3), 2003.
[3] ”Convex optimization: algorithms and complexity,” S. Bubeck,
Foundations and trends in machine learning, 2015.
[4] ”First-order methods in optimization,” A. Beck, Vol. 25, SIAM, 2017.
[5] ”Mathematical optimization, MATH301 lecture notes,” E. Candes,
Stanford.
[6] ”Convex optimization, EE364B lecture notes,” S. Boyd, Stanford.
[7] ”Matrix nearness problems with Bregman divergences,” I. Dhillon,
J. Tropp, SIAM Journal on Matrix Analysis and Applications, 29(4),
2007.
Mirror descent 5-46

S-ar putea să vă placă și