Sunteți pe pagina 1din 16

Lower Bounds on Rate of Convergence of Cutting

Plane Methods
Xinhua Zhang
Dept. of Computing Sciences
University of Alberta
xinhua2@ualberta.ca
Ankan Saha
Dept. of Computer Science
University of Chicago
ankans@cs.uchicago.edu
S.V. N. Vishwanathan
Dept. of Statistics and
Dept. of Computer Science
Purdue University
vishy@stat.purdue.edu
Abstract
In a recent paper Joachims [1] presented SVM-Perf, a cutting plane method
(CPM) for training linear Support Vector Machines (SVMs) which converges to
an accurate solution in O(1/
2
) iterations. By tightening the analysis, Teo et al.
[2] showed that O(1/) iterations sufce. Given the impressive convergence speed
of CPM on a number of practical problems, it was conjectured that these rates
could be further improved. In this paper we disprove this conjecture. We present
counter examples which are not only applicable for training linear SVMs with
hinge loss, but also hold for support vector methods which optimize a multivari-
ate performance score. However, surprisingly, these problems are not inherently
hard. By exploiting the structure of the objective function we can devise an algo-
rithm that converges in O(1/

) iterations.
1 Introduction
There has been an explosion of interest in machine learning over the past decade, much of which
has been fueled by the phenomenal success of binary Support Vector Machines (SVMs). Driven by
numerous applications, recently, there has been increasing interest in support vector learning with
linear models. At the heart of SVMs is the following regularized risk minimization problem:
min
w
J(w) :=

2
|w|
2
. .. .
regularizer
+R
emp
(w)
. .. .
empirical risk
with R
emp
(w) :=
1
n
n

i=1
max(0, 1 y
i
'w, x
i
`). (1)
Here we assume access to a training set of n labeled examples (x
i
, y
i
)
n
i=1
where x
i
R
d
and y
i

1, +1, and use the square Euclidean norm |w|
2
=

i
w
2
i
as the regularizer. The parameter
controls the trade-off between the empirical risk and the regularizer.
There has been signicant research devoted to developing specialized optimizers which minimize
J(w) efciently. In an award winning paper, Joachims [1] presented a cutting plane method
(CPM)
1
, SVM-Perf, which was shown to converge to an accurate solution of (1) in O(1/
2
) iter-
ations, with each iteration requiring O(nd) effort. This was improved by Teo et al. [2] who showed
that their Bundle Method for Regularized Risk Minimization (BMRM) (which encompasses SVM-
Perf as a special case) converges to an accurate solution in O(nd/) time.
While online learning methods are becoming increasingly popular for solving (1), a key advantage
of CPM such as SVM-Perf and BMRM is their ability to directly optimize nonlinear multivariate
performance measures such as F
1
-score, ordinal regression loss, and ROCArea which are widely
used in some application areas. In this case R
emp
does not decompose into a sum of losses over
individual data points like in (1), and hence one has to employ batch algorithms. Letting (y, y)
denote the multivariate discrepancy between the correct labels y := (y
1
, . . . , y
n
)

and a candidate
labeling y (to be concretized later), the R
emp
for the multivariate measure is formulated by [3] as
1
In this paper we use the term cutting plane methods to denote specialized solvers employed in machine
learning. While clearly related, they must not be confused with cutting plane methods used in optimization.
1
R
emp
(w) = max
y{1,1}
n

(y, y) +
1
n
n

i=1
'w, x
i
` ( y
i
y
i
)

. (2)
In another award winning paper by Joachims [3], the regularized risk minimization problems corre-
sponding to these measures are optimized by using a CPM.
Given the widespread use of CPM in machine learning, it is important to understand their conver-
gence guarantees in terms of the upper and lower bounds on the number of iterations needed to
converge to an accurate solution. The tightest, O(1/), upper bounds on the convergence speed of
CPMis due to Teo et al. [2], who analyzed a restricted version of BMRMwhich only optimizes over
one dual variable per iteration. However, on practical problems the observed rate of convergence
is signicantly faster than predicted by theory. Therefore, it had been conjectured that the upper
bounds might be further tightened via a more rened analysis. In this paper we construct counter
examples for both decomposable R
emp
like in equation (1) and non-decomposable R
emp
like in
equation (2), on which CPM require (1/) iterations to converge, thus disproving this conjecture
2
.
We will work with BMRM as our prototypical CPM. As Teo et al. [2] point out, BMRM includes
many other CPM such as SVM-Perf as special cases.
Our results lead to the following natural question: Do the lower bounds hold because regularized
risk minimization problems are fundamentally hard, or is it an inherent limitation of CPM? In other
words, to solve problems such as (1), does there exist a solver which requires less than O(nd/)
effort (better in n, d and )? We provide partial answers. To understand our contribution one needs
to understand the two standard assumptions that are made when proving convergence rates:
A1: The data points x
i
lie inside a L
2
(Euclidean) ball of radius R, that is, |x
i
| R.
A2: The subgradient of R
emp
is bounded, i.e., at any point w, there exists a subgradient g
of R
emp
such that |g| G < .
Clearly assumption A1 is more restrictive than A2. By adapting a result due to [6] we show that one
can devise an O(nd/

) algorithm for the case when assumption A1 holds. Finding a fast optimizer
under assumption A2 remains an open problem.
Notation: Lower bold case letters (e.g., w, ) denote vectors, w
i
denotes the i-th component of
w, 0 refers to the vector with all zero components, e
i
is the i-th coordinate vector (all 0s except
1 at the i-th coordinate) and
k
refers to the k dimensional simplex. Unless specied otherwise,
', ` denotes the Euclidean dot product 'x, w` =

i
x
i
w
i
, and || refers to the Euclidean norm
|w| := ('w, w`)
1/2
. We denote R := R , and [t] := 1, . . . , t.
Our paper is structured as follows. We briey review BMRM in Section 2. Two types of lower
bounds are subsequently dened in Section 3, and Section 4 contains descriptions of various counter
examples that we construct. In Section 5 we describe an algorithm which provably converges to an
accurate solution of (1) in O(1/

) iterations under assumption A1. The paper concludes with a


discussion and outlook in Section 6. Technical proofs and a ready reckoner of the convex analysis
concepts used in the paper can be found in the supplementary material.
2 BMRM
At every iteration, BMRM replaces R
emp
by a piecewise linear lower bound R
cp
k
and optimizes [2]
min
w
J
k
(w) :=

2
|w|
2
+ R
cp
k
(w), where R
cp
k
(w) := max
1ik
'w, a
i
` + b
i
, (3)
to obtain the next iterate w
k
. Here a
i
R
emp
(w
i1
) denotes an arbitrary subgradient of R
emp
at w
i1
and b
i
= R
emp
(w
i1
) 'w
i1
, a
i
`. The piecewise linear lower bound is successively
tightened until the gap

k
:= min
0tk
J(w
t
) J
k
(w
k
) (4)
falls below a predened tolerance .
Since J
k
in (3) is a convex objective function, one can compute its dual. Instead of minimizing J
k
with respect to w one can equivalently maximize the dual [2]:
D
k
() =
1
2
|A
k
|
2
+'b
k
, ` , (5)
2
Because of the specialized nature of these solvers, lower bounds for general convex optimizers such as
those studied by Nesterov [4] and Nemirovski and Yudin [5] do not apply.
2
Algorithm 1: qp-bmrm: solving the inner loop
of BMRM exactly via full QP.
Require: Previous subgradients a
i

k
i=1
and
intercepts b
i

k
i=1
.
1: Set A
k
:=(a
1
, . . . , a
k
) , b
k
:=(b
1
, . . . b
k
)

.
2:
k
argmax

1
2
|A
k
|
2
+', b
k
`

.
3: return w
k
=
1
A
k

k
.
Algorithm 2: ls-bmrm: solving the inner loop
of BMRM approximately via line search.
Require: Previous subgradients a
i

k
i=1
and
intercepts b
i

k
i=1
.
1: Set A
k
:=(a
1
, . . . , a
k
) , b
k
:=(b
1
, . . . b
k
)

.
2: Set () :=

k1
, 1

.
3:
k
argmax
[0,1]

1
2
|A
k
()|
2
+'(),b
k
`

.
4:
k

k1
, 1
k

.
5: return w
k
=
1
A
k

k
.
and set
k
= argmax

k
D
k
(). Note that A
k
and b
k
in (5) are dened in Algorithm 1. Since
maximizing D
k
() is a quadratic programming (QP) problem, we call this algorithm qp-bmrm.
Pseudo-code can be found in Algorithm 1.
Note that at iteration k the dual D
k
() is a QP with k variables. As the number of iterations increases
the size of the QP also increases. In order to avoid the growing cost of the dual optimization at each
iteration, [2] proposed using a one-dimensional line search to calculate an approximate maximizer

k
on the line segment (

k1
, (1))

: [0, 1], and we call this variant ls-bmrm. Pseudo-


code can be found in Algorithm 2. We refer the reader to [2] for details.
Even though qp-bmrm solves a more expensive optimization problem D
k
() per iteration, Teo
et al. [2] could only show that both variants of BMRM converge at O(1/) rates:
Theorem 1 ([2]) Suppose assumption A2 holds. Then for any < 4G
2
/, both ls-bmrm and qp-
bmrm converge to an accurate solution of (1) as measured by (4) after at most the following
number of steps:
log
2
J(0)
G
2
+
8G
2

1.
Generality of BMRM Thanks to the formulation in (3) which only uses R
emp
, BMRM is applica-
ble to a wide variety of R
emp
. For example, when used to train binary SVMs with R
emp
specied by
(1), it yields exactly the SVM-Perf algorithm [1]. When applied to optimize the multivariate score,
e.g. F
1
-score with R
emp
specied by (2), it immediately leads to the optimizer given by [3].
3 Upper and Lower Bounds
Since most rates of convergence discussed in the machine learning community are upper bounds,
it is important to rigorously dene the meaning of a lower bound with respect to , and to study
its relationship with the upper bounds. At this juncture it is also important to clarify an important
technical point. Instead of minimizing the objective function J(w) dened in (1), if we minimize a
scaled version cJ(w) this scales the approximation gap (4) by c. Assumptions such as A1 and A2
x this degree of freedom by bounding the scale of the objective function.
Given a function f T and an optimization algorithm A, suppose w
k
are the iterates produced
by the algorithmAwhen minimizing f. Dene T(; f, A) as the rst step index k when w
k
becomes
an accurate solution
3
:
T(; f, A) = min k : f(w
k
) min
w
f(w) . (6)
Upper and lower bounds are both properties for a pair of T and A. A function g() is called an
upper bound of (T, A) if for all functions f T and all > 0, it takes at most order g() steps for
A to reduce the gap to less than , i.e.,
(UB) > 0, f T, T(; f, A) g(). (7)
3
The initial point also matters, as in the best case we can just start from the optimal solution. Thus the quan-
tity of interest is actually T(; f, A) := maxw
0
min{k : f(w
k
) minw f(w) , starting point being w0}.
However, without loss of generality we assume some pre-specied way of initialization.
3
Algorithms
Assuming A1 Assuming A2
UB SLB WLB UB SLB WLB
ls-bmrm O(1/) (1/) (1/) O(1/) (1/) (1/)
qp-bmrm O(1/) open open O(1/) open (1/)
Nesterov O(1/

) (1/

) (1/

) n/a n/a n/a


Table 1: Summary of the known upper bounds and our lower bounds. Note: A1 A2, but not vice
versa. SLB WLB, but not vice versa. UB is tight, if it matches WLB.
On the other hand, lower bounds can be dened in two different ways depending on how the above
two universal qualiers are ipped to existential qualiers.
Strong lower bounds (SLB) h() is called a SLB of (T, A) if there exists a function

f T,
such that for all > 0 it takes at least h() steps for A to nd an accurate solution of

f:
(SLB)

f T, s.t. > 0, T(;

f, A) h(). (8)
Weak lower bound (WLB) h() is called a WLB of (T, A) if for any > 0, there exists a
function f

T depending on , such that it takes at least h() steps for A to nd an accurate


solution of f

:
(WLB) > 0, f

T, s.t. T(; f

, A) h(). (9)
Clearly, the existence of a SLB implies a WLB. However, it is usually much harder to establish SLB
than WLB. Fortunately, WLBs are sufcient to refute upper bounds or to establish their tightness.
The size of the function class T affects the upper and lower bounds in opposite ways. Suppose
T

T. Proving upper (resp. lower) bounds on (T

, A) is usually easier (resp. harder) than proving


upper (resp. lower) bounds for (T, A).
4 Constructing Lower Bounds
Letting the minimizer of J(w) be w

, we are interested in bounding the primal gap of the iterates


w
k
: J(w
k
) J(w

). Datasets will be constructed explicitly whose resulting objective J(w) will


be shown to attain the lower bounds of the algorithms. The R
emp
for both the hinge loss in (1)
and the F
1
-score in (2) will be covered, and our results are summarized in Table 1. Note that as
assumption A1 implies A2 and SLB implies WLB, some entries of the table imply others.
4.1 Strong Lower Bounds for Solving Linear SVMs using ls-bmrm
We rst prove the (1/) lower bound for ls-bmrm on SVM problems under assumption A1. Con-
sider a one dimensional training set with four examples: (x
1
, y
1
) =(1,1), (x
2
, y
2
) =(
1
2
,1),
(x
3
, y
3
) =(
1
2
, 1), (x
4
, y
4
) =(1, 1). Setting =
1
16
, the regularized risk (1) can be written as (using
w instead of w as it is now a scalar):
min
wR
J(w) =
1
32
w
2
+
1
2

1
w
2

+
+
1
2
[1 w]
+
. (10)
The minimizer of J(w) is w

= 2, which can be veried by the fact that 0 is in the subdifferential


of J at w

: 0 J(2) =

2
16

1
2
1
2
: [0, 1]

. So J(w

) =
1
8
. Choosing w
0
= 0, we have
Theorem 2 lim
k
k (J(w
k
) J(w

)) =
1
4
, i.e. J(w
k
) converges to J(w

) at 1/k rate.
The proof relies on two lemmata. The rst shows that the iterates generated by ls-bmrm on J(w)
satisfy the following recursive relations.
Lemma 3 For k 1, the following recursive relations hold true
w
2k+1
= 2 +
8
2k1,1
(w
2k1
4
2k1,1
)
w
2k1
(w
2k1
+ 4
2k1,1
)
> 2, and w
2k
= 2
8
2k1,1
w
2k1
(1, 2). (11)

2k+1,1
=
w
2
2k1
+ 16
2
2k1,1
(w
2k1
+ 4
2k1,1
)
2

2k1,1
, where
2k+1,1
is the rst coordinate of
2k+1
. (12)
4
The proof is lengthy and is relegated to Appendix B. These recursive relations allow us to derive the
convergence rate of
2k1,1
and w
k
(see proof in Appendix C):
Lemma 4 lim
k
k
2k1,1
=
1
4
. Combining with (11), we get lim
k
k[2 w
k
[ = 2.
Now that w
k
approaches 2 at the rate of O(1/k), it is nally straightforward to translate it into the
rate at which J(w
k
) approaches J(w

). See the proof of Theorem 2 in Appendix D.


4.2 Weak Lower Bounds for Solving Linear SVMs using qp-bmrm
Theorem 1 gives an upper bound on the convergence rate of qp-bmrm, assuming that R
emp
satises
the assumption A2. In this section we further demonstrate that this O(1/) rate is also a WLB (hence
tight) even when the R
emp
is specialized to SVM objectives satisfying A2.
Given > 0, dene n = 1/| and construct a dataset (x
i
, y
i
)
n
i=1
as y
i
= (1)
i
and x
i
=
(1)
i
(ne
i+1
+

ne
1
) R
n+1
. Then the corresponding objective function (1) is
J(w)=
|w|
2
2
+R
emp
(w), where R
emp
(w)=
1
n
n

i=1
[1y
i
'w, x
i
`]
+
=
1
n
n

i=1
[1

nw
1
nw
i+1
]
+
.
(13)
It is easy to see that the minimizer w

=
1
2
(
1

n
,
1
n
,
1
n
, . . . ,
1
n
)

and J(w

) =
1
4n
. In fact, simply
check that y
i
'w

, x
i
` = 1, so J(w

) =

n
i=1

i
,
1
, . . . ,
n

:
i
[0, 1]

, and
setting all
i
=
1
2n
yields the subgradient 0. Our key result is the following theorem.
Theorem 5 Let w
0
= (
1

n
, 0, 0, . . .)

. Suppose running qp-bmrm on the objective function (13)


produces iterates w
1
, . . . , w
k
, . . .. Then it takes qp-bmrm at least

2
3

steps to nd an accurate
solution. Formally,
min
i[k]
J(w
i
) J(w

) =
1
2k
+
1
4n
for all k [n], hence min
i[k]
J(w
i
) J(w

) > for all k <


2
3
.
Indeed, after taking n steps, w
n
will cut a subgradient a
n+1
= 0 and b
n+1
= 0, and then the
minimizer of J
n+1
(w) gives exactly w

.
Proof Since R
emp
(w
0
) = 0 and R
emp
(w
0
) =

1
n

n
i=1

i
y
i
x
i
:
i
[0, 1]

, we can choose
a
1
=
1
n
y
1
x
1
=

n
, 1, 0, . . .

, b
1
= R
emp
(w
0
) 'a
1
, w
0
` = 0 +
1
n
=
1
n
, and
w
1
= argmin
w

1
2
|w|
2

n
w
1
w
2
+
1
n

n
, 1, 0, . . .

.
In general, we claim that the k-th iterate w
k
produced by qp-bmrm is given by
w
k
=

n
,
k copies
. .. .
1
k
, . . . ,
1
k
, 0, . . .

.
We prove this claim by induction on k. Assume the claim holds true for steps 1, . . . , k, then it is
easy to check that R
emp
(w
k
) = 0 and R
emp
(w
k
) =

1
n

n
i=k+1

i
y
i
x
i
:
i
[0, 1]

. Thus we
can again choose
a
k+1
=
1
n
y
k+1
x
k+1
, and b
k+1
= R
emp
(w
k
) 'a
k+1
, w
k
` =
1
n
, so
w
k+1
= argmin
w

1
2
|w|
2
+ max
1ik+1
'a
i
, w` + b
i

n
,
k+1 copies
. .. .
1
k + 1
, . . . ,
1
k + 1
, 0, . . .

,
which can be veried by checking that J
k+1
(w
k+1
) =

w
k+1
+

i[k+1]

i
a
i
:
k+1


0. All that remains is to observe that J(w
k
) =
1
2k
+
1
2n
while J(w

) =
1
4n
from which it follows
that J(w
k
) J(w

) =
1
2k
+
1
4n
as claimed.
5
As an aside, the subgradient of the R
emp
in (13) does have Euclidean norm

2n at w = 0. However,
in the above run of qp-bmrm, R
emp
(w
0
), . . . , R
emp
(w
n
) always contains a subgradient with
norm 1. So if we restrict the feasible region to

n
1/2

[0, ]
n
, then J(w) does satisfy the
assumption A2 and the optimal solution does not change. This is essentially a local satisfaction of
A2. In fact, having a bounded subgradient of R
emp
at all w
k
is sufcient for qp-bmrm to converge
at the rate in Theorem 1.
However when we assume A1 which is more restrictive than A2, it remains an open question to
determine whether the O(1/) rates are optimal for qp-bmrm on SVM objectives. Also left open is
the SLB for qp-bmrm on SVMs.
4.3 Weak Lower Bounds for Optimizing F
1
-score using qp-bmrm
F
1
-score is dened by using the contingency table: F
1
( y, y) :=
2a
2a+b+c
.
Given > 0, dene n = 1/|+1 and construct a dataset (x
i
, y
i
)
n
i=1
as
follows: x
i
=
n
2

3
e
1

n
2
e
i+1
R
n+1
with y
i
= 1 for all i [n1],
and x
n
=

3n
2
e
1
+
n
2
e
n+1
R
n+1
with y
n
= +1. So there is only one
positive training example. Then the corresponding objective function is
y =1 y =1
y =1 a b
y =1 c d
Contingency table.
J(w) =
1
2
|w|
2
+ max
y

1 F
1
(y, y) +
1
n
n

i=1
y
i
'w, x
i
` (y
i
y
i
1)

. (14)
Theorem 6 Let w
0
=
1

3
e
1
. Then qp-bmrmtakes at least

1
3

steps to nd an accurate solution.


J(w
k
)min
w
J(w)
1
2

1
k

1
n 1

k [n1], hence min


i[k]
J(w
i
)min
w
J(w) > k <
1
3
.
Proof A rigorous proof can be found in Appendix E, we provide a sketch here. The crux is to show
w
k
=

3
,
k copies
. .. .
1
k
, . . . ,
1
k
, 0, . . .

k [n 1]. (15)
We prove (15) by induction. Assume it holds for steps 1, . . . , k. Then at step k + 1 we have
1
n
y
i
'w
k
, x
i
` =

1
6
+
1
2k
if i [k]
1
6
if k + 1 i n 1
1
2
if i = n
. (16)
For convenience, dene the term in the max in (14) as

k
( y) := 1 F
1
(y, y) +
1
n
n

i=1
y
i
'w
k
, x
i
` (y
i
y
i
1).
Then it is not hard to see that the following assignments of y (among others) maximize
k
: a)
correct labeling, b) only misclassify the positive training example x
n
(i.e., y
n
= 1), c) only
misclassify one negative training example in x
k+1
, . . . , x
n1
into positive. And
k
equals 0 at all
these assignments. For a proof, consider two cases. If y misclassies the positive training example,
then F
1
(y, y) = 0 and by (16) we have

k
( y)=10 +
1
n
n1

i=1
y
i
'w
k
, x
i
` (y
i
y
i
1)+
1
2
(11)=
k + 3
6k
k

i=1
(y
i
y
i
1)+
1
6
n1

i=k+1
(y
i
y
i
1)0.
Suppose y correctly labels the positive example, but misclassies t
1
examples in x
1
, . . . , x
k
and t
2
examples in x
k+1
, . . . , x
n1
(into positive). Then F
1
(y, y) =
2
2+t1+t2
, and

k
( y) = 1
2
2 + t
1
+ t
2
+

1
6
+
1
2k

i=1
(y
i
y
i
1) +
1
6
n1

i=k+1
(y
i
y
i
1)
=
t
1
+ t
2
2 + t
1
+ t
2

1
3
+
1
k

t
1

1
3
t
2

t t
2
3(2 + t)
0 (t := t
1
+ t
2
).
6
So we can pick y as (
k copies
. .. .
1, . . . , 1, +1,
nk1 copies
. .. .
1, . . . , 1, +1)

which only misclassies x


k+1
, and get
a
k+1
=
2
n
y
k+1
x
k+1
=
1

3
e
1
e
k+2
, b
k+1
= R
emp
(w
k
) 'a
k+1
, w
k
` = 0 +
1
3
=
1
3
,
w
k+1
= argmin
w
:=J
k+1
(w)
. .. .
1
2
|w|
2
+ max
i[k+1]
'a
i
, w` + b
i
=

3
,
k+1 copies
. .. .
1
k + 1
, . . . ,
1
k + 1
, 0, . . .

.
which can be veried by J
k+1
(w
k+1
) =

w
k+1
+

k+1
i=1

i
a
i
:
k+1

0 (just set all

i
=
1
k+1
). So (15) holds for step k + 1. End of induction.
All that remains is to observe that J(w
k
) =
1
2
(
1
3
+
1
k
) while min
w
J(w) J(w
n1
) =
1
2
(
1
3
+
1
n1
)
from which it follows that J(w
k
) min
w
J(w)
1
2
(
1
k

1
n1
) as claimed in Theorem 6.
5 An O(nd/

) Algorithm for Training Binary Linear SVMs


The lower bounds we proved above show that CPM such as BMRM require (1/) iterations to
converge. We now show that this is an inherent limitation of CPMand not an artifact of the problem.
To demonstrate this, we will show that one can devise an algorithm for problems (1) and (2) which
will converge in O(1/

) iterations. The key difculty stems from the non-smoothness of the


objective function, which renders second and higher order algorithms such as L-BFGS inapplicable.
However, thanks to Theorem 11 (see appendix A), the Fenchel dual of (1) is a convex smooth
function with a Lipschitz continuous gradient, which are easy to optimize.
To formalize the idea of using the Fenchel dual, we can abstract from the objectives (1) and (2) a
composite form of objective functions used in machine learning with linear models:
min
wQ1
J(w) = f(w) + g

(Aw), where Q
1
is a closed convex set. (17)
Here, f(w) is a strongly convex function corresponding to the regularizer, Aw stands for the output
of a linear model, and g

encodes the empirical risk measuring the discrepancy between the correct
labels and the output of the linear model. Let the domain of g be Q
2
. It is well known that [e.g. 7,
Theorem 3.3.5] under some mild constraint qualications, the adjoint form of J(w):
D() = g() f

(A

), Q
2
(18)
satises J(w) D() and inf
wQ1
J(w) = sup
Q2
D().
Example 1: binary SVMs with bias. Let A := Y X

where Y := diag(y
1
, . . . , y
n
) and X :=
(x
1
, . . . , x
n
), f(w) =

2
|w|
2
, g

(u) = min
bR
1
n

n
i=1
[1 + u
i
y
i
b]
+
which corresponds to
g()=

i
. Then the adjoint form turns out to be the well known SVM dual objective function:
D() =

i

1
2

Y X

XY , Q
2
=

[0, n
1
]
n
:

i
y
i

i
= 0

. (19)
Example 2: multivariate scores. Denote A as a 2
n
-by-d matrix where the y-th row is

n
i=1
x

i
( y
i
y
i
) for each y 1, +1
n
, f(w) =

2
|w|
2
, g

(u) = max
y

(y, y) +
1
n
u
y

which corresponds to g() = n

y
(y, y)
y
, we recover the primal objective (2) for multi-
variate performance measure. Its adjoint form is
D() =
1
2

AA

+ n

y
(y, y)
y
, Q
2
=

[0, n
1
]
2
n
:

y
=
1
n

. (20)
In a series of papers [6, 8, 9], Nesterov developed optimal gradient based methods for minimizing
the composite objectives with primal (17) and adjoint (18). A sequence of w
k
and
k
is produced
such that under assumption A1 the duality gap J(w
k
) D(
k
) is reduced to less than after at
most k = O(1/

) steps. We refer the readers to [8] for details.


7
5.1 Efcient Projections in Training SV Models with Optimal Gradient Methods
However, applying Nesterovs algorithm is challenging, because it requires an efcient subroutine
for computing projections onto the set of constraints Q
2
. This projection can be either an Euclidean
projection or a Bregman projection.
Example 1: binary SVMs with bias. In this case we need to compute the Euclidean projection to
Q
2
dened by (19), which entails solving a Quadratic Programming problem with a diagonal Hes-
sian, many box constraints, and a single equality constraint. We present an O(n) algorithm for this
task in Appendix F. Plugging this into the algorithm described in [8] and noting that all intermediate
steps of the algorithm can be computed in O(nd) time directly yield a O(nd/

) algorithm. A
self-contained description of the algorithm is given in Appendix G.
Example 2: multivariate scores. Since the dimension of Q
2
in (20) is exponentially large in
n, Euclidean projection is intractable and we resort to Bregman projection. Given a differentiable
convex function F on Q
2
, a point , and a direction g, we can dene the Bregman projection as:
V (, g) := argmin
Q2
F( ) 'F() g, ` .
Scaling up by a factor of n, we can choose F() as the negative entropy F() =

i
log
i
.
Then the application of the algorithm in [8] will endow a distribution over all possible labelings:
p( y; w) exp

c( y, y) +

i
a
i
'x
i
, w` y
i

, where c and a
i
are constant scalars. (21)
The solver will request the expectation E
y
[

i
a
i
x
i
y
i
] which in turn requires that marginal distri-
bution of p( y
i
). This is not as straightforward as in graphical models because ( y, y) may not
decompose. Fortunately, for multivariate scores dened by contingency tables, it is possible to com-
pute the marginals in O(n
2
) time by using dynamic programming, and this cost is similar to the
algorithm proposed by [3]. The idea of the dynamic programming can be found in Appendix H.
6 Outlook and Conclusion
CPM are widely employed in machine learning especially in the context of structured prediction
[16]. While upper bounds on their rates of convergence were known, lower bounds were not studied
before. In this paper we set out to ll this gap by exhibiting counter examples on which CPM
require (1/) iterations. This is a fundamental limitation of these algorithms and not an artifact
of the problem. We show this by devising an O(1/

) algorithm borrowing techniques from [8].


However, this algorithmassumes that the dataset is contained in a ball of bounded radius (assumption
A1 Section 1). Devising a O(1/

) algorithm under the less restrictive assumption A2 remains an


open problem.
It is important to note that the linear time algorithm in Appendix F is the key to obtaining a
O(nd/

) computational complexity for binary SVMs with bias mentioned in section 5.1. How-
ever, this method has been rediscovered independently by many authors (including us), with the
earliest known reference to the best of our knowledge being [10] in 1990. Some recent work in op-
timization [11] has focused on improving the practical performance, while in machine learning [12]
gave an exact projection algorithm in linear time and [13] gave an expected linear time algorithm
for the same problem.
Choosing an optimizer for a given machine learning task is a trade-off between a number of poten-
tially conicting requirements. CPM are one popular choice but there are others. If one is interested
in classication accuracy alone, without requiring deterministic guarantees, then online to batch
conversion techniques combined with stochastic subgradient descent are a good choice [17]. While
the dependence on is still (1/) or worse [18], one gets bounds independent of n. However, as
we pointed out earlier, these algorithms are applicable only when the empirical risk decomposes
over the examples.
On the other hand, one can employ coordinate descent in the dual as is done in the Sequential Mini-
mal Optimization (SMO) algorithm of [19]. However, as [20] show, if the kernel matrix obtained by
stacking x
i
into a matrix X and X

X is not strictly positive denite, then SMO requires O(1/)


iterations with each iteration costing O(nd) effort. However, when the kernel matrix is strictly pos-
itive denite, then one can obtain an O(n
2
log(1/)) bound on the number of iterations, which has
better dependence on , but is prohibitively expensive for large n.
8
References
[1] T. Joachims. Training linear SVMs in linear time. In Proc. ACM Conf. Knowledge Discovery
and Data Mining (KDD). ACM, 2006.
[2] Choon Hui Teo, S. V. N. Vishwanthan, Alex J. Smola, and Quoc V. Le. Bundle methods for
regularized risk minimization. J. Mach. Learn. Res., 11:311365, January 2010.
[3] T. Joachims. A support vector method for multivariate performance measures. In Proc. Intl.
Conf. Machine Learning, pages 377384, San Francisco, California, 2005. Morgan Kaufmann
Publishers.
[4] Yurri Nesterov. Introductory Lectures On Convex Optimization: A Basic Course. Springer,
2003.
[5] Arkadi Nemirovski and D Yudin. Problem Complexity and Method Efciency in Optimization.
John Wiley and Sons, 1983.
[6] Yurri Nesterov. A method for unconstrained convex minimization problem with the rate of
convergence O(1/k
2
). Soviet Math. Docl., 269:543547, 1983.
[7] J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization: Theory and
Examples. CMS books in Mathematics. Canadian Mathematical Society, 2000.
[8] Yurii Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM J. on
Optimization, 16(1):235249, 2005. ISSN 1052-6234.
[9] Yurii Nesterov. Gradient methods for minimizing composite objective function. Technical
Report 76, CORE Discussion Paper, UCL, 2007.
[10] P. M. Pardalos and N. Kovoor. An algorithm for singly constrained class of quadratic programs
subject to upper and lower bounds. Mathematical Programming, 46:321328, 1990.
[11] Y.-H. Dai and R. Fletcher. New algorithms for singly linearly constrained quadratic programs
subject to lower and upper bounds. Mathematical Programming: Series A and B archive, 106
(3):403421, 2006.
[12] Jun Liu and Jieping Ye. Efcient euclidean projections in linear time. In Proc. Intl. Conf.
Machine Learning. Morgan Kaufmann, 2009.
[13] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandrae. Efcient projections
onto the
1
-ball for learning in high dimensions. In Proc. Intl. Conf. Machine Learning, 2008.
[14] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In S. Thrun, L. Saul, and
B. Sch olkopf, editors, Advances in Neural Information Processing Systems 16, pages 2532,
Cambridge, MA, 2004. MIT Press.
[15] Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter Bartlett. Exponenti-
ated gradient algorithms for conditional random elds and max-margin markov networks. J.
Mach. Learn. Res., 9:17751822, 2008.
[16] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured
and interdependent output variables. J. Mach. Learn. Res., 6:14531484, 2005.
[17] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-
gradient solver for SVM. In Proc. Intl. Conf. Machine Learning, 2007.
[18] Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin Wainwright. Information-
theoretic lower bounds on the oracle complexity of convex optimization. In Neural Information
Processing Systems, 2009.
[19] J. C. Platt. Sequential minimal optimization: A fast algorithm for training support vector
machines. Technical Report MSR-TR-98-14, Microsoft Research, 1998.
[20] Nikolas List and Hans Ulrich Simon. Svm-optimization and steepest-descent line search. In
Sanjoy Dasgupta and Adam Klivans, editors, Proc. Annual Conf. Computational Learning
Theory, LNCS. Springer, 2009.
[21] J.B. Hiriart-Urruty and C. Lemar echal. Convex Analysis and Minimization Algorithms, I and
II, volume 305 and 306. Springer-Verlag, 1993.
9
Supplementary Material
A Concepts from Convex Analysis
The following four concepts from convex analysis are used in the paper.
Denition 7 Suppose a convex function f : R
n
R is nite at w. Then a vector g R
n
is called
a subgradient of f at w if, and only if,
f(w

) f(w) +'w

w, g` for all w

.
The set of all such g vectors is called the subdifferential of f at w, denoted by
w
f(w). For any
convex function f,
w
f(w) must be nonempty. Furthermore if it is a singleton then f is said to be
differentiable at w, and we use f(w) to denote the gradient.
Denition 8 A convex function f : R
n
R is strongly convex with respect to a norm || if there
exists a constant > 0 such that f

2
| |
2
is convex. is called the modulus of strong convexity
of f, and for brevity we will call f -strongly convex.
Denition 9 Suppose a function f : R
n
R is differentiable on Q R
n
. Then f is said to have
Lipschitz continuous gradient (l.c.g) with respect to a norm | | if there exists a constant L such that
|f(w) f(w

)| L|ww

| w, w

Q.
For brevity, we will call f L-l.c.g.
Denition 10 The Fenchel dual of a function f : R
n
R, is a function f

: R
n
R dened by
f

(w

) = sup
wR
n
'w, w

` f(w)
Strong convexity and l.c.g are related by Fenchel duality according to the following lemma:
Theorem 11 ([21, Theorem 4.2.1 and 4.2.2])
1. If f : R
n
R is -strongly convex, then f

is nite on R
n
and f

is
1

-l.c.g.
2. If f : R
n
R is convex, differentiable on R
n
, and L-l.c.g, then f

is
1
L
-strongly convex.
Finally, the following lemma gives a useful characterization of the minimizer of a convex function.
Lemma 12 ([21, Theorem 2.2.1]) A convex function f is minimized at w

if, and only if, 0


f(w

). Furthermore, if f is strongly convex, then its minimizer is unique.


B Proof of Lemma 3
We prove the lemma by induction on k. Obviously, Lemma 3 holds for k = 1. Suppose it holds for
indices up to some k 1 (k 2). Let p = 2k 1 (p 3). Then
A
p
=

3
4
, 0,
1
4
, . . . , 0,
1
4

,

b
p
=

1, 0,
1
2
, . . . , 0,
1
2

,
w
p
= 16A
p

p
= (16)

3
4

p,1

1
4

p,3

1
4

p,5
. . .
1
4

p,p2

1
4

p,p


p,3
+ . . . +
p,p2
+
p,p
=
w
p
4
3
p,1
.
So
10

b
p

p
=
p,1
+
1
2

p,3
+
1
2

p,5
+ . . . +
1
2

p,p2
+
1
2

p,p
=
1
8
w
p

1
2

p,1
.
Since w
p
> 2, so a
p+1
= 0, b
p+1
= 0. So A
p+1
= (A
p
, 0),

b
p+1
=

b
p
, 0

. Let
p+1
=
(
p
, 1 ), then D
p+1
() = 8
2
(A
p

p
)
2

b
p

p
. So

p+1
=

b
p

p
16 (A
p

p
)
2
=
2w
p
8
p,1
w
2
p
, w
p+1
= 16A
p

p+1
= w
p

p+1
= 2
8
p,1
w
p
< 2.
(22)
which proves the claim in (11) for even iterates as p + 1 = 2k.
Since
2,1
=
1
9
, p 3, and
k,1

k+1,1
due to the update rule of ls-bmrm, we have
8
p,1

8
9
< 2 < w
p
, hence w
p+1
> 1. (23)
Next step, since w
p+1
(1, 2), so a
p+2
=
1
4
, b
p+2
=
1
2
, A
p+2
=

A
p
, 0,
1
4

,

b
p+1
=

b
p
, 0,
1
2

.
Let
p+2
() = (
p+1

t
, (1
p+1
), 1 ). Then
A
p+2

p+2
=
p+1
A
p

p

1
4
(1 ),

b
p+2

p+2
=
p+1

b
p

p
+
1
2
(1 ).
D
p+2
() = 8(A
p+2

p+2
)
2

b
p+2

p+2
=

4
p+1
A
p

p
+ 1

2
2

2

4
p+1
A
p

p
+
p+1

b
p

p
+
1
2

+const,
where the const means terms independent of . So

p+2
= argmin
[0,1]
D
p+2
() =
4
p+1
A
p

p
+
p+1

b
p

p
+
1
2

4
p+1
A
p

p
+ 1

2
=
w
2
p
+ 16
2
p,1
(w
p
+ 4
p,1
)
2
, (24)
w
p+2
= 16A
p+2

p+2
= 16
p+2

p+1
A
p

p
+ 4(1
p+2
) = 2 +
8
p,1
(w
p
4
p,1
)
w
p
(w
p
+ 4
p,1
)
,
where the last step is by plugging in the expression of
p+1
in (22) and
p+2
in (24). Now using
(23) we get
w
p+2
2 =
8
p,1
(w
p
4
p,1
)
w
p
(w
p
+ 4
p,1
)
> 0.
C Proof of Lemma 4
The proof is based on (12). Let
k
= 1/
2k1,1
, then lim
k

k
= because
lim
k

2k1,1
= 0. Now
lim
k
k
2k1,1
=

lim
k
1
k
2k1,1

1
=

lim
k

k
k

1
=

lim
k

k+1

k

1
,
where the last step is by the discrete version of LHospitals rule.
To compute lim
k

k+1

k
we plug the denition
k
= 1/
2k1,1
into (12), which gives:
1

k+1
=
w
2
2k
+ 16
1

2
k

w
2k
+ 4
1

2
1

k

k+1

k
= 8
w
2k

2
k
w
2
2k

2
k
+ 16
= 8
w
2k
w
2
2k
+
16

2
k
.
Since lim
k
w
k
= 2 and lim
k

k
= , so
lim
k
k
2k1,1
=

lim
k

k+1

k

1
=
1
4
.
11
D Proof of Theorem 2
Denote
k
= 2 w
k
, then lim
k
k [
k
[ = 2 by Theorem 4. So
If
k
> 0, then J(w
k
) J(w

) =
1
32
(2
k
)
2
+
1
2

k
2

1
8
=
1
8

k
+
1
32

2
k
=
1
8
[
k
[ +
1
32

2
k
.
If
k
0, then J(w
k
) J(w

) =
1
32
(2
k
)
2

1
8
=
1
8

k
+
1
32

2
k
=
1
8
[
k
[ +
1
32

2
k
.
Combining these two cases, we conclude lim
k
k(J(w
k
) J(w

)) =
1
4
.
E Proof of Theorem 6
The crux of the proof is to show that
w
k
=

3
,
k copies
. .. .
1
k
, . . . ,
1
k
, 0, . . .

k [n 1]. (25)
At the rst iteration, we have
1
n
y
i
'w
0
, x
i
` =

1
6
if i [n 1]
1
2
if i = n
. (26)
For convenience, dene the term in the max of (14) as

0
( y) := 1 F
1
(y, y) +
1
n
n

i=1
y
i
'w
0
, x
i
` (y
i
y
i
1).
The key observation in the context of F
1
score is that
0
( y) is maximized at any of the following
assignments of ( y
1
, . . . , y
n
), and it is easy to check that they all give
0
( y) = 0:
(1, . . . , 1, +1), (1, . . . , 1, 1), (+1, 1, 1, . . . , 1, +1), . . . , (1, . . . , 1, +1, +1).
The rst assignment is just the correct labeling of the training examples. The second assignment just
misclassies the only positive example x
n
into negative. The rest n1 assignments only misclassify
a single negative example into positive. To prove that they maximize
0
( y), consider two cases of
y. First the positive training example is misclassied. Then F
1
(y, y) = 0 and by (26) we have

0
( y) = 1 0 +
1
n
n1

i=1
y
i
'w
0
, x
i
` (y
i
y
i
1) +
1
2
(1 1) =
1
6
n1

i=1
(y
i
y
i
1) 0.
Second, consider the case of y where the positive example is correctly labeled, while t 1 negative
examples are misclassied. Then F
1
(y, y) =
2
2+t
, and

0
( y) = 1
2
2 + t
+
1
6
n1

i=1
(y
i
y
i
1) =
t
2 + t

1
3
t =
t t
2
3(2 + t)
0, t [1, n 1].
So now suppose we pick
y
1
= (+1, 1, 1, . . . , 1, +1)

,
i.e. just misclassify the rst negative training example. Then
a
1
=
2
n
y
1
x
1
=

3
, 1, 0, . . .

, b
1
= R
emp
(w
0
) 'a
1
, w
0
` = 0 +
1
3
=
1
3
,
w
1
= argmin
w

1
2
|w|
2

3
w
1
w
2

3
, 1, 0, . . .

.
Next, we prove (25) by induction. Assume that it holds for steps 1, . . . , k. Then at step k + 1 it is
easy to check that
1
n
y
i
'w
k
, x
i
` =

1
6
+
1
2k
if i [k]
1
6
if k + 1 i n 1
1
2
if i = n
. (27)
12
Dene

k
( y) := 1 F
1
(y, y) +
1
n
n

i=1
y
i
'w
k
, x
i
` (y
i
y
i
1).
Then it is not hard to see that the following y (among others) maximize
k
: a) correct labeling, b)
only misclassify the positive training example x
n
, c) only misclassify one negative training example
in x
k+1
, . . . , x
n1
. And
k
equals 0 at all these assignments. For proof, again consider two cases.
If y misclassies the positive training example, then F
1
(y, y) = 0 and by (27) we have

k
( y) = 1 0 +
1
n
n1

i=1
y
i
'w
k
, x
i
` (y
i
y
i
1) +
1
2
(1 1)
=

1
6
+
1
2k

i=1
(y
i
y
i
1) +
1
6
n1

i=k+1
(y
i
y
i
1) 0.
If y correctly labels the positive example, but misclassies t
1
examples in x
1
, . . . , x
k
and t
2
exam-
ples in x
k+1
, . . . , x
n1
(into positive). Then F
1
(y, y) =
2
2+t1+t2
, and

k
( y) = 1
2
2 + t
1
+ t
2
+

1
6
+
1
2k

i=1
(y
i
y
i
1) +
1
6
n1

i=k+1
(y
i
y
i
1)
=
t
1
+ t
2
2 + t
1
+ t
2

1
3
+
1
k

t
1

1
3
t
2

t t
2
3(2 + t)
0 (t := t
1
+ t
2
).
So we can pick y as (
k copies
. .. .
1, . . . , 1, +1,
nk1 copies
. .. .
1, . . . , 1, +1)

which only misclassies x


k+1
, and get
a
k+1
=
2
n
y
k+1
x
k+1
=
1

3
e
1
e
k+2
, b
k+1
= R
emp
(w
k
) 'a
k+1
, w
k
` = 0 +
1
3
=
1
3
,
w
k+1
= argmin
w
1
2
|w|
2
+ max
i[k+1]
'a
i
, w` + b
i
=

3
,
k+1 copies
. .. .
1
k + 1
, . . . ,
1
k + 1
, 0, . . .

.
which can be veried by J
k+1
(w
k+1
) =

w
k+1
+

k+1
i=1

i
a
i
:
k+1

0 (setting all

i
=
1
k+1
). All that remains is to observe that J(w
k
) =
1
2
(
1
3
+
1
k
) while min
w
J(w) J(w
n1
) =
1
2
(
1
3
+
1
n1
) from which it follows that J(w
k
)min
w
J(w)
1
2
(
1
k

1
n1
) as claimed by Theorem
6.
F A linear time algorithm for a box constrained diagonal QP with a single
linear equality constraint
It can be shown that the dual optimization problem D() from (1) can be reduced into a box con-
strained diagonal QP with a single linear equality constraint.
In this section, we focus on the following simple QP:
min
1
2
n

i=1
d
2
i
(
i
m
i
)
2
s.t. l
i

i
u
i
i [n];
n

i=1

i
= z.
Without loss of generality, we assume l
i
< u
i
and d
i
= 0 for all i. Also assume
i
= 0 because
otherwise
i
can be solved independently. To make the feasible region nonempty, we also assume

i
((
i
> 0)l
i
+ (
i
< 0)u
i
) z

i
((
i
> 0)u
i
+ (
i
< 0)l
i
).
13
The algorithm we describe below stems from [10] and nds the exact optimal solution in O(n) time,
faster than the O(nlog n) complexity in [13].
With a simple change of variable
i
=
i
(
i
m
i
), the problem is simplied as
min
1
2
n

i=1

d
2
i

2
i
s.t. l

i

i
u

i
i [n];
n

i=1

i
= z

,
where
l

i
=


i
(l
i
m
i
) if
i
> 0

i
(u
i
m
i
) if
i
< 0
,
u

i
=


i
(u
i
m
i
) if
i
> 0

i
(l
i
m
i
) if
i
< 0
,

d
2
i
=
d
2
i

2
i
, z

= z

i
m
i
.
We derive its dual via the standard Lagrangian.
L =
1
2

d
2
i

2
i

+
i
(
i
l

i
) +

i
(
i
u

i
)

i
z

.
Taking derivative:
L

i
=

d
2
i

i

+
i
+

i
= 0
i
=

d
2
i
(
+
i

i
+ ). (28)
Substituting into L, we get the dual optimization problem
min D(,
+
i
,

i
) =
1
2

d
2
i
(
+
i

i
+ )
2

+
i
l

i
+

+
i
u

i
z

s.t.
+
i
0,

i
0 i [n].
Taking derivative of D with respect to , we get:

d
2
i
(
+
i

i
+ ) z

= 0. (29)
The KKT condition gives:

+
i
(
i
l

i
) = 0, (30a)

i
(
i
u

i
) = 0. (30b)
Now we enumerate four cases.
1.
+
i
> 0,

i
> 0. This implies that l

i
=
i
= u

i
, which is contradictory to our assumption.
2.
+
i
= 0,

i
= 0. Then by (28),
i
=

d
2
i
[l

i
, u

i
], hence [

d
2
i
l

i
,

d
2
i
u

i
].
3.
+
i
> 0,

i
= 0. Now by (30) and (28), we have l

i
=
i
=

d
2
i
(
+
i
+ ) >

d
2
i
, hence
<

d
2
i
l

i
and
+
i
=

d
2
i
l

i
.
4.
+
i
= 0,

i
> 0. Now by (30) and (28), we have u

i
=
i
=

d
2
i
(

i
+ ) <

d
2
i
, hence
>

d
2
i
u

i
and

i
=

d
2
i
u

i
+ .
In sum, we have
+
i
= [

d
2
i
l

i
]
+
and

i
= [

d
2
i
u

i
]
+
. Now (29) turns into
14

d
2
i
l

d
2
i
u

i
l

i
u

i
slope =

d
2
i
h
i
()
Figure 1: h
i
()
Algorithm 3: O(n) algorithm to nd the root of f(). Ignoring boundary condition checks.
1: Set kink set S

d
2
i
l

i
: i [n]

d
2
i
u

i
: i [n]

.
2: while [S[ > 2 do
3: Find median of S: m MED(S).
4: if f(m) 0 then
5: S x S : x m.
6: else
7: S x S : x m.
8: end if
9: end while
10: Return root
lf(u)uf(l)
f(u)f(l)
where S = l, u.
f() :=

d
2
i
([

d
2
i
l

i
]
+
[

d
2
i
u

i
]
+
+ )
. .. .
=:hi()
z

= 0. (31)
In other words, we only need to nd the root of f() in (31). h
i
() is plotted in Figure 1. Note that
h
i
() is a monotonically increasing function of , so the whole f() is monotonically increasing
in . Since f() 0 by z



i
u

i
and f() 0 by z



i
l

i
, the root must exist.
Considering that f has at most 2n kinks (nonsmooth points) and is linear between two adjacent
kinks, the simplest idea is to sort

d
2
i
l

i
,

d
2
i
u

i
: i [n]

into s
(1)
. . . s
(2n)
. If f(s
(i)
) and
f(s
(i+1)
) have different signs, then the root must lie between them and can be easily found because
f is linear in [s
(i)
, s
(i+1)
]. This algorithm takes at least O(nlog n) time because of sorting.
However, this complexity can be reduced to O(n) by making use of the fact that the median of n
(unsorted) elements can be found in O(n) time. Notice that due to the monotonicity of f, the median
of a set S gives exactly the median of function values, i.e., f(MED(S)) = MED(f(x) : x S).
Algorithm 3 sketches the idea of binary search. The while loop terminates in log
2
(2n) iterations
because the set S is halved in each iteration. And in each iteration, the time complexity is linear to
[S[, the size of current S. So the total complexity is O(n). Note the evaluation of f(m) potentially
involves summing up n terms as in (31). However by some clever aggregation of slope and offset,
this can be reduced to O([S[).
G Solving Binary Linear SVMs using Nesterovs Algorithm
Now we present the algorithm of [8] in Algorithm 4. It requires a
2
-strongly convex prox-function
on Q
2
: d
2
() =
2
2
||
2
, and sets D
2
= max
Q2
d
2
(). Let the Lipschitz constant of D()
be L. Algorithm 4 is based on two mappings

(w) : Q
1
Q
2
and w() : Q
2
Q
1
, together
with an auxiliary mapping v : Q
2
Q
2
. They are dened by

(w) := argmin
Q2
d
2
() 'Aw, ` + g() = argmin
Q2

2
||
2
+w

XY

i
, (32)
w() := argmin
wQ1
'Aw, ` + f(w) = argmin
wR
d
w

XY +

2
|w|
2
=
1

XY (33)
15
Algorithm 4: Solving Binary Linear SVMs using Nesterovs Algorithm.
Require: L as a conservative estimate of (i.e., no less than) the Lipschitz constant of D().
Ensure: Two sequences w
k
and
k
which reduce the duality gap at O(1/k
2
) rate.
1: Initialize: Randomly pick
1
in Q
2
. Let
0
= 2L,
0
v(
1
), w
0
w(
1
).
2: for k = 0, 1, 2, . . . do
3: Let
k
=
2
k+3
,
k
(1
k
)
k
+
k

k
(w
k
).
4: Set w
k+1
(1
k
)w
k
+
k
w(
k
),
k+1
v(
k
),
k+1
(1
k
)
k
.
5: end for
v() := argmin

Q2
L
2
|

|
2
'D(),

` . (34)
Equations (32) and (34) are examples of a box constrained QP with a single equality constraint. In
the appendix, we provide a linear time algorithm to nd the minimizer of such a QP. The overall
complexity of each iteration is thus O(nd) due to the gradient calculation in (34) and the matrix
multiplication in (33).
H Dynamic Programming to Compute the Marginals of (21)
For convenience, we repeat the joint distribution here:
p( y; w) exp

c( y, y) +

i
a
i
'x
i
, w` y
i

.
Clearly, the marginal distributions can be efciently computed if we are able efciently compute
the forms like

y
exp (c( y, y) +

i
a
i
'x
i
, w` y
i
). Notice that ( y, y) only depends on the
sufcient statistics including false positive and false negative (b and c in the contingency table
respectively), so our idea will be to enumerate all possible values of b and c.
For any xed value of b and c, we just need to sum up
exp

i
a
i
'x
i
, w` y
i

i
exp (a
i
'x
i
, w` y
i
)
over all y which has false positive b and false negative c. Let us call this set of labeling as (
n
(b, c).
If y
n
= +1, then
V
n
(b, c) :=

yCn(b,c)
n

i=1
exp (a
i
'x
i
, w` y
i
)
= exp (a
n
'x
i
, w`)

( y1,..., yn1)Cn1(b,c)
n1

i=1
exp (a
i
'x
i
, w` y
i
) ( y
n
= +1)
+ exp (a
n
'x
i
, w`)

( y1,..., yn1)Cn1(b,c1)
n1

i=1
exp (a
i
'x
i
, w` y
i
) ( y
n
= 1)
= exp (a
n
'x
i
, w`) V
n1
(b, c) + exp (a
n
'x
i
, w`) V
n1
(b, c 1).
If y
n
= 1, then
V
n
(b, c) = exp (a
n
'x
i
, w`) V
n1
(b 1, c) + exp (a
n
'x
i
, w`) V
n1
(b, c).
In practice, the recursions will start from V
1
, and there is no specic b and c kept in mind V
k
(p, q)
will enumerate all values of p and q. So at completion, we get V
n
(b, c) all possible values of b and
c. The cost for computation and storage is O(n
2
).
16

S-ar putea să vă placă și