Documente Academic
Documente Profesional
Documente Cultură
Plane Methods
Xinhua Zhang
Dept. of Computing Sciences
University of Alberta
xinhua2@ualberta.ca
Ankan Saha
Dept. of Computer Science
University of Chicago
ankans@cs.uchicago.edu
S.V. N. Vishwanathan
Dept. of Statistics and
Dept. of Computer Science
Purdue University
vishy@stat.purdue.edu
Abstract
In a recent paper Joachims [1] presented SVM-Perf, a cutting plane method
(CPM) for training linear Support Vector Machines (SVMs) which converges to
an accurate solution in O(1/
2
) iterations. By tightening the analysis, Teo et al.
[2] showed that O(1/) iterations sufce. Given the impressive convergence speed
of CPM on a number of practical problems, it was conjectured that these rates
could be further improved. In this paper we disprove this conjecture. We present
counter examples which are not only applicable for training linear SVMs with
hinge loss, but also hold for support vector methods which optimize a multivari-
ate performance score. However, surprisingly, these problems are not inherently
hard. By exploiting the structure of the objective function we can devise an algo-
rithm that converges in O(1/
) iterations.
1 Introduction
There has been an explosion of interest in machine learning over the past decade, much of which
has been fueled by the phenomenal success of binary Support Vector Machines (SVMs). Driven by
numerous applications, recently, there has been increasing interest in support vector learning with
linear models. At the heart of SVMs is the following regularized risk minimization problem:
min
w
J(w) :=
2
|w|
2
. .. .
regularizer
+R
emp
(w)
. .. .
empirical risk
with R
emp
(w) :=
1
n
n
i=1
max(0, 1 y
i
'w, x
i
`). (1)
Here we assume access to a training set of n labeled examples (x
i
, y
i
)
n
i=1
where x
i
R
d
and y
i
1, +1, and use the square Euclidean norm |w|
2
=
i
w
2
i
as the regularizer. The parameter
controls the trade-off between the empirical risk and the regularizer.
There has been signicant research devoted to developing specialized optimizers which minimize
J(w) efciently. In an award winning paper, Joachims [1] presented a cutting plane method
(CPM)
1
, SVM-Perf, which was shown to converge to an accurate solution of (1) in O(1/
2
) iter-
ations, with each iteration requiring O(nd) effort. This was improved by Teo et al. [2] who showed
that their Bundle Method for Regularized Risk Minimization (BMRM) (which encompasses SVM-
Perf as a special case) converges to an accurate solution in O(nd/) time.
While online learning methods are becoming increasingly popular for solving (1), a key advantage
of CPM such as SVM-Perf and BMRM is their ability to directly optimize nonlinear multivariate
performance measures such as F
1
-score, ordinal regression loss, and ROCArea which are widely
used in some application areas. In this case R
emp
does not decompose into a sum of losses over
individual data points like in (1), and hence one has to employ batch algorithms. Letting (y, y)
denote the multivariate discrepancy between the correct labels y := (y
1
, . . . , y
n
)
and a candidate
labeling y (to be concretized later), the R
emp
for the multivariate measure is formulated by [3] as
1
In this paper we use the term cutting plane methods to denote specialized solvers employed in machine
learning. While clearly related, they must not be confused with cutting plane methods used in optimization.
1
R
emp
(w) = max
y{1,1}
n
(y, y) +
1
n
n
i=1
'w, x
i
` ( y
i
y
i
)
. (2)
In another award winning paper by Joachims [3], the regularized risk minimization problems corre-
sponding to these measures are optimized by using a CPM.
Given the widespread use of CPM in machine learning, it is important to understand their conver-
gence guarantees in terms of the upper and lower bounds on the number of iterations needed to
converge to an accurate solution. The tightest, O(1/), upper bounds on the convergence speed of
CPMis due to Teo et al. [2], who analyzed a restricted version of BMRMwhich only optimizes over
one dual variable per iteration. However, on practical problems the observed rate of convergence
is signicantly faster than predicted by theory. Therefore, it had been conjectured that the upper
bounds might be further tightened via a more rened analysis. In this paper we construct counter
examples for both decomposable R
emp
like in equation (1) and non-decomposable R
emp
like in
equation (2), on which CPM require (1/) iterations to converge, thus disproving this conjecture
2
.
We will work with BMRM as our prototypical CPM. As Teo et al. [2] point out, BMRM includes
many other CPM such as SVM-Perf as special cases.
Our results lead to the following natural question: Do the lower bounds hold because regularized
risk minimization problems are fundamentally hard, or is it an inherent limitation of CPM? In other
words, to solve problems such as (1), does there exist a solver which requires less than O(nd/)
effort (better in n, d and )? We provide partial answers. To understand our contribution one needs
to understand the two standard assumptions that are made when proving convergence rates:
A1: The data points x
i
lie inside a L
2
(Euclidean) ball of radius R, that is, |x
i
| R.
A2: The subgradient of R
emp
is bounded, i.e., at any point w, there exists a subgradient g
of R
emp
such that |g| G < .
Clearly assumption A1 is more restrictive than A2. By adapting a result due to [6] we show that one
can devise an O(nd/
) algorithm for the case when assumption A1 holds. Finding a fast optimizer
under assumption A2 remains an open problem.
Notation: Lower bold case letters (e.g., w, ) denote vectors, w
i
denotes the i-th component of
w, 0 refers to the vector with all zero components, e
i
is the i-th coordinate vector (all 0s except
1 at the i-th coordinate) and
k
refers to the k dimensional simplex. Unless specied otherwise,
', ` denotes the Euclidean dot product 'x, w` =
i
x
i
w
i
, and || refers to the Euclidean norm
|w| := ('w, w`)
1/2
. We denote R := R , and [t] := 1, . . . , t.
Our paper is structured as follows. We briey review BMRM in Section 2. Two types of lower
bounds are subsequently dened in Section 3, and Section 4 contains descriptions of various counter
examples that we construct. In Section 5 we describe an algorithm which provably converges to an
accurate solution of (1) in O(1/
k
:= min
0tk
J(w
t
) J
k
(w
k
) (4)
falls below a predened tolerance .
Since J
k
in (3) is a convex objective function, one can compute its dual. Instead of minimizing J
k
with respect to w one can equivalently maximize the dual [2]:
D
k
() =
1
2
|A
k
|
2
+'b
k
, ` , (5)
2
Because of the specialized nature of these solvers, lower bounds for general convex optimizers such as
those studied by Nesterov [4] and Nemirovski and Yudin [5] do not apply.
2
Algorithm 1: qp-bmrm: solving the inner loop
of BMRM exactly via full QP.
Require: Previous subgradients a
i
k
i=1
and
intercepts b
i
k
i=1
.
1: Set A
k
:=(a
1
, . . . , a
k
) , b
k
:=(b
1
, . . . b
k
)
.
2:
k
argmax
1
2
|A
k
|
2
+', b
k
`
.
3: return w
k
=
1
A
k
k
.
Algorithm 2: ls-bmrm: solving the inner loop
of BMRM approximately via line search.
Require: Previous subgradients a
i
k
i=1
and
intercepts b
i
k
i=1
.
1: Set A
k
:=(a
1
, . . . , a
k
) , b
k
:=(b
1
, . . . b
k
)
.
2: Set () :=
k1
, 1
.
3:
k
argmax
[0,1]
1
2
|A
k
()|
2
+'(),b
k
`
.
4:
k
k1
, 1
k
.
5: return w
k
=
1
A
k
k
.
and set
k
= argmax
k
D
k
(). Note that A
k
and b
k
in (5) are dened in Algorithm 1. Since
maximizing D
k
() is a quadratic programming (QP) problem, we call this algorithm qp-bmrm.
Pseudo-code can be found in Algorithm 1.
Note that at iteration k the dual D
k
() is a QP with k variables. As the number of iterations increases
the size of the QP also increases. In order to avoid the growing cost of the dual optimization at each
iteration, [2] proposed using a one-dimensional line search to calculate an approximate maximizer
k
on the line segment (
k1
, (1))
1.
Generality of BMRM Thanks to the formulation in (3) which only uses R
emp
, BMRM is applica-
ble to a wide variety of R
emp
. For example, when used to train binary SVMs with R
emp
specied by
(1), it yields exactly the SVM-Perf algorithm [1]. When applied to optimize the multivariate score,
e.g. F
1
-score with R
emp
specied by (2), it immediately leads to the optimizer given by [3].
3 Upper and Lower Bounds
Since most rates of convergence discussed in the machine learning community are upper bounds,
it is important to rigorously dene the meaning of a lower bound with respect to , and to study
its relationship with the upper bounds. At this juncture it is also important to clarify an important
technical point. Instead of minimizing the objective function J(w) dened in (1), if we minimize a
scaled version cJ(w) this scales the approximation gap (4) by c. Assumptions such as A1 and A2
x this degree of freedom by bounding the scale of the objective function.
Given a function f T and an optimization algorithm A, suppose w
k
are the iterates produced
by the algorithmAwhen minimizing f. Dene T(; f, A) as the rst step index k when w
k
becomes
an accurate solution
3
:
T(; f, A) = min k : f(w
k
) min
w
f(w) . (6)
Upper and lower bounds are both properties for a pair of T and A. A function g() is called an
upper bound of (T, A) if for all functions f T and all > 0, it takes at most order g() steps for
A to reduce the gap to less than , i.e.,
(UB) > 0, f T, T(; f, A) g(). (7)
3
The initial point also matters, as in the best case we can just start from the optimal solution. Thus the quan-
tity of interest is actually T(; f, A) := maxw
0
min{k : f(w
k
) minw f(w) , starting point being w0}.
However, without loss of generality we assume some pre-specied way of initialization.
3
Algorithms
Assuming A1 Assuming A2
UB SLB WLB UB SLB WLB
ls-bmrm O(1/) (1/) (1/) O(1/) (1/) (1/)
qp-bmrm O(1/) open open O(1/) open (1/)
Nesterov O(1/
) (1/
) (1/
:
(WLB) > 0, f
T, s.t. T(; f
, A) h(). (9)
Clearly, the existence of a SLB implies a WLB. However, it is usually much harder to establish SLB
than WLB. Fortunately, WLBs are sufcient to refute upper bounds or to establish their tightness.
The size of the function class T affects the upper and lower bounds in opposite ways. Suppose
T
1
w
2
+
+
1
2
[1 w]
+
. (10)
The minimizer of J(w) is w
: 0 J(2) =
2
16
1
2
1
2
: [0, 1]
. So J(w
) =
1
8
. Choosing w
0
= 0, we have
Theorem 2 lim
k
k (J(w
k
) J(w
)) =
1
4
, i.e. J(w
k
) converges to J(w
) at 1/k rate.
The proof relies on two lemmata. The rst shows that the iterates generated by ls-bmrm on J(w)
satisfy the following recursive relations.
Lemma 3 For k 1, the following recursive relations hold true
w
2k+1
= 2 +
8
2k1,1
(w
2k1
4
2k1,1
)
w
2k1
(w
2k1
+ 4
2k1,1
)
> 2, and w
2k
= 2
8
2k1,1
w
2k1
(1, 2). (11)
2k+1,1
=
w
2
2k1
+ 16
2
2k1,1
(w
2k1
+ 4
2k1,1
)
2
2k1,1
, where
2k+1,1
is the rst coordinate of
2k+1
. (12)
4
The proof is lengthy and is relegated to Appendix B. These recursive relations allow us to derive the
convergence rate of
2k1,1
and w
k
(see proof in Appendix C):
Lemma 4 lim
k
k
2k1,1
=
1
4
. Combining with (11), we get lim
k
k[2 w
k
[ = 2.
Now that w
k
approaches 2 at the rate of O(1/k), it is nally straightforward to translate it into the
rate at which J(w
k
) approaches J(w
ne
1
) R
n+1
. Then the corresponding objective function (1) is
J(w)=
|w|
2
2
+R
emp
(w), where R
emp
(w)=
1
n
n
i=1
[1y
i
'w, x
i
`]
+
=
1
n
n
i=1
[1
nw
1
nw
i+1
]
+
.
(13)
It is easy to see that the minimizer w
=
1
2
(
1
n
,
1
n
,
1
n
, . . . ,
1
n
)
and J(w
) =
1
4n
. In fact, simply
check that y
i
'w
, x
i
` = 1, so J(w
) =
n
i=1
i
,
1
, . . . ,
n
:
i
[0, 1]
, and
setting all
i
=
1
2n
yields the subgradient 0. Our key result is the following theorem.
Theorem 5 Let w
0
= (
1
n
, 0, 0, . . .)
steps to nd an accurate
solution. Formally,
min
i[k]
J(w
i
) J(w
) =
1
2k
+
1
4n
for all k [n], hence min
i[k]
J(w
i
) J(w
.
Proof Since R
emp
(w
0
) = 0 and R
emp
(w
0
) =
1
n
n
i=1
i
y
i
x
i
:
i
[0, 1]
, we can choose
a
1
=
1
n
y
1
x
1
=
n
, 1, 0, . . .
, b
1
= R
emp
(w
0
) 'a
1
, w
0
` = 0 +
1
n
=
1
n
, and
w
1
= argmin
w
1
2
|w|
2
n
w
1
w
2
+
1
n
n
, 1, 0, . . .
.
In general, we claim that the k-th iterate w
k
produced by qp-bmrm is given by
w
k
=
n
,
k copies
. .. .
1
k
, . . . ,
1
k
, 0, . . .
.
We prove this claim by induction on k. Assume the claim holds true for steps 1, . . . , k, then it is
easy to check that R
emp
(w
k
) = 0 and R
emp
(w
k
) =
1
n
n
i=k+1
i
y
i
x
i
:
i
[0, 1]
. Thus we
can again choose
a
k+1
=
1
n
y
k+1
x
k+1
, and b
k+1
= R
emp
(w
k
) 'a
k+1
, w
k
` =
1
n
, so
w
k+1
= argmin
w
1
2
|w|
2
+ max
1ik+1
'a
i
, w` + b
i
n
,
k+1 copies
. .. .
1
k + 1
, . . . ,
1
k + 1
, 0, . . .
,
which can be veried by checking that J
k+1
(w
k+1
) =
w
k+1
+
i[k+1]
i
a
i
:
k+1
0. All that remains is to observe that J(w
k
) =
1
2k
+
1
2n
while J(w
) =
1
4n
from which it follows
that J(w
k
) J(w
) =
1
2k
+
1
4n
as claimed.
5
As an aside, the subgradient of the R
emp
in (13) does have Euclidean norm
2n at w = 0. However,
in the above run of qp-bmrm, R
emp
(w
0
), . . . , R
emp
(w
n
) always contains a subgradient with
norm 1. So if we restrict the feasible region to
n
1/2
[0, ]
n
, then J(w) does satisfy the
assumption A2 and the optimal solution does not change. This is essentially a local satisfaction of
A2. In fact, having a bounded subgradient of R
emp
at all w
k
is sufcient for qp-bmrm to converge
at the rate in Theorem 1.
However when we assume A1 which is more restrictive than A2, it remains an open question to
determine whether the O(1/) rates are optimal for qp-bmrm on SVM objectives. Also left open is
the SLB for qp-bmrm on SVMs.
4.3 Weak Lower Bounds for Optimizing F
1
-score using qp-bmrm
F
1
-score is dened by using the contingency table: F
1
( y, y) :=
2a
2a+b+c
.
Given > 0, dene n = 1/|+1 and construct a dataset (x
i
, y
i
)
n
i=1
as
follows: x
i
=
n
2
3
e
1
n
2
e
i+1
R
n+1
with y
i
= 1 for all i [n1],
and x
n
=
3n
2
e
1
+
n
2
e
n+1
R
n+1
with y
n
= +1. So there is only one
positive training example. Then the corresponding objective function is
y =1 y =1
y =1 a b
y =1 c d
Contingency table.
J(w) =
1
2
|w|
2
+ max
y
1 F
1
(y, y) +
1
n
n
i=1
y
i
'w, x
i
` (y
i
y
i
1)
. (14)
Theorem 6 Let w
0
=
1
3
e
1
. Then qp-bmrmtakes at least
1
3
1
k
1
n 1
3
,
k copies
. .. .
1
k
, . . . ,
1
k
, 0, . . .
k [n 1]. (15)
We prove (15) by induction. Assume it holds for steps 1, . . . , k. Then at step k + 1 we have
1
n
y
i
'w
k
, x
i
` =
1
6
+
1
2k
if i [k]
1
6
if k + 1 i n 1
1
2
if i = n
. (16)
For convenience, dene the term in the max in (14) as
k
( y) := 1 F
1
(y, y) +
1
n
n
i=1
y
i
'w
k
, x
i
` (y
i
y
i
1).
Then it is not hard to see that the following assignments of y (among others) maximize
k
: a)
correct labeling, b) only misclassify the positive training example x
n
(i.e., y
n
= 1), c) only
misclassify one negative training example in x
k+1
, . . . , x
n1
into positive. And
k
equals 0 at all
these assignments. For a proof, consider two cases. If y misclassies the positive training example,
then F
1
(y, y) = 0 and by (16) we have
k
( y)=10 +
1
n
n1
i=1
y
i
'w
k
, x
i
` (y
i
y
i
1)+
1
2
(11)=
k + 3
6k
k
i=1
(y
i
y
i
1)+
1
6
n1
i=k+1
(y
i
y
i
1)0.
Suppose y correctly labels the positive example, but misclassies t
1
examples in x
1
, . . . , x
k
and t
2
examples in x
k+1
, . . . , x
n1
(into positive). Then F
1
(y, y) =
2
2+t1+t2
, and
k
( y) = 1
2
2 + t
1
+ t
2
+
1
6
+
1
2k
i=1
(y
i
y
i
1) +
1
6
n1
i=k+1
(y
i
y
i
1)
=
t
1
+ t
2
2 + t
1
+ t
2
1
3
+
1
k
t
1
1
3
t
2
t t
2
3(2 + t)
0 (t := t
1
+ t
2
).
6
So we can pick y as (
k copies
. .. .
1, . . . , 1, +1,
nk1 copies
. .. .
1, . . . , 1, +1)
3
e
1
e
k+2
, b
k+1
= R
emp
(w
k
) 'a
k+1
, w
k
` = 0 +
1
3
=
1
3
,
w
k+1
= argmin
w
:=J
k+1
(w)
. .. .
1
2
|w|
2
+ max
i[k+1]
'a
i
, w` + b
i
=
3
,
k+1 copies
. .. .
1
k + 1
, . . . ,
1
k + 1
, 0, . . .
.
which can be veried by J
k+1
(w
k+1
) =
w
k+1
+
k+1
i=1
i
a
i
:
k+1
i
=
1
k+1
). So (15) holds for step k + 1. End of induction.
All that remains is to observe that J(w
k
) =
1
2
(
1
3
+
1
k
) while min
w
J(w) J(w
n1
) =
1
2
(
1
3
+
1
n1
)
from which it follows that J(w
k
) min
w
J(w)
1
2
(
1
k
1
n1
) as claimed in Theorem 6.
5 An O(nd/
(Aw), where Q
1
is a closed convex set. (17)
Here, f(w) is a strongly convex function corresponding to the regularizer, Aw stands for the output
of a linear model, and g
encodes the empirical risk measuring the discrepancy between the correct
labels and the output of the linear model. Let the domain of g be Q
2
. It is well known that [e.g. 7,
Theorem 3.3.5] under some mild constraint qualications, the adjoint form of J(w):
D() = g() f
(A
), Q
2
(18)
satises J(w) D() and inf
wQ1
J(w) = sup
Q2
D().
Example 1: binary SVMs with bias. Let A := Y X
where Y := diag(y
1
, . . . , y
n
) and X :=
(x
1
, . . . , x
n
), f(w) =
2
|w|
2
, g
(u) = min
bR
1
n
n
i=1
[1 + u
i
y
i
b]
+
which corresponds to
g()=
i
. Then the adjoint form turns out to be the well known SVM dual objective function:
D() =
i
1
2
Y X
XY , Q
2
=
[0, n
1
]
n
:
i
y
i
i
= 0
. (19)
Example 2: multivariate scores. Denote A as a 2
n
-by-d matrix where the y-th row is
n
i=1
x
i
( y
i
y
i
) for each y 1, +1
n
, f(w) =
2
|w|
2
, g
(u) = max
y
(y, y) +
1
n
u
y
y
(y, y)
y
, we recover the primal objective (2) for multi-
variate performance measure. Its adjoint form is
D() =
1
2
AA
+ n
y
(y, y)
y
, Q
2
=
[0, n
1
]
2
n
:
y
=
1
n
. (20)
In a series of papers [6, 8, 9], Nesterov developed optimal gradient based methods for minimizing
the composite objectives with primal (17) and adjoint (18). A sequence of w
k
and
k
is produced
such that under assumption A1 the duality gap J(w
k
) D(
k
) is reduced to less than after at
most k = O(1/
) algorithm. A
self-contained description of the algorithm is given in Appendix G.
Example 2: multivariate scores. Since the dimension of Q
2
in (20) is exponentially large in
n, Euclidean projection is intractable and we resort to Bregman projection. Given a differentiable
convex function F on Q
2
, a point , and a direction g, we can dene the Bregman projection as:
V (, g) := argmin
Q2
F( ) 'F() g, ` .
Scaling up by a factor of n, we can choose F() as the negative entropy F() =
i
log
i
.
Then the application of the algorithm in [8] will endow a distribution over all possible labelings:
p( y; w) exp
c( y, y) +
i
a
i
'x
i
, w` y
i
, where c and a
i
are constant scalars. (21)
The solver will request the expectation E
y
[
i
a
i
x
i
y
i
] which in turn requires that marginal distri-
bution of p( y
i
). This is not as straightforward as in graphical models because ( y, y) may not
decompose. Fortunately, for multivariate scores dened by contingency tables, it is possible to com-
pute the marginals in O(n
2
) time by using dynamic programming, and this cost is similar to the
algorithm proposed by [3]. The idea of the dynamic programming can be found in Appendix H.
6 Outlook and Conclusion
CPM are widely employed in machine learning especially in the context of structured prediction
[16]. While upper bounds on their rates of convergence were known, lower bounds were not studied
before. In this paper we set out to ll this gap by exhibiting counter examples on which CPM
require (1/) iterations. This is a fundamental limitation of these algorithms and not an artifact
of the problem. We show this by devising an O(1/
) computational complexity for binary SVMs with bias mentioned in section 5.1. How-
ever, this method has been rediscovered independently by many authors (including us), with the
earliest known reference to the best of our knowledge being [10] in 1990. Some recent work in op-
timization [11] has focused on improving the practical performance, while in machine learning [12]
gave an exact projection algorithm in linear time and [13] gave an expected linear time algorithm
for the same problem.
Choosing an optimizer for a given machine learning task is a trade-off between a number of poten-
tially conicting requirements. CPM are one popular choice but there are others. If one is interested
in classication accuracy alone, without requiring deterministic guarantees, then online to batch
conversion techniques combined with stochastic subgradient descent are a good choice [17]. While
the dependence on is still (1/) or worse [18], one gets bounds independent of n. However, as
we pointed out earlier, these algorithms are applicable only when the empirical risk decomposes
over the examples.
On the other hand, one can employ coordinate descent in the dual as is done in the Sequential Mini-
mal Optimization (SMO) algorithm of [19]. However, as [20] show, if the kernel matrix obtained by
stacking x
i
into a matrix X and X
) f(w) +'w
w, g` for all w
.
The set of all such g vectors is called the subdifferential of f at w, denoted by
w
f(w). For any
convex function f,
w
f(w) must be nonempty. Furthermore if it is a singleton then f is said to be
differentiable at w, and we use f(w) to denote the gradient.
Denition 8 A convex function f : R
n
R is strongly convex with respect to a norm || if there
exists a constant > 0 such that f
2
| |
2
is convex. is called the modulus of strong convexity
of f, and for brevity we will call f -strongly convex.
Denition 9 Suppose a function f : R
n
R is differentiable on Q R
n
. Then f is said to have
Lipschitz continuous gradient (l.c.g) with respect to a norm | | if there exists a constant L such that
|f(w) f(w
)| L|ww
| w, w
Q.
For brevity, we will call f L-l.c.g.
Denition 10 The Fenchel dual of a function f : R
n
R, is a function f
: R
n
R dened by
f
(w
) = sup
wR
n
'w, w
` f(w)
Strong convexity and l.c.g are related by Fenchel duality according to the following lemma:
Theorem 11 ([21, Theorem 4.2.1 and 4.2.2])
1. If f : R
n
R is -strongly convex, then f
is nite on R
n
and f
is
1
-l.c.g.
2. If f : R
n
R is convex, differentiable on R
n
, and L-l.c.g, then f
is
1
L
-strongly convex.
Finally, the following lemma gives a useful characterization of the minimizer of a convex function.
Lemma 12 ([21, Theorem 2.2.1]) A convex function f is minimized at w
3
4
, 0,
1
4
, . . . , 0,
1
4
,
b
p
=
1, 0,
1
2
, . . . , 0,
1
2
,
w
p
= 16A
p
p
= (16)
3
4
p,1
1
4
p,3
1
4
p,5
. . .
1
4
p,p2
1
4
p,p
p,3
+ . . . +
p,p2
+
p,p
=
w
p
4
3
p,1
.
So
10
b
p
p
=
p,1
+
1
2
p,3
+
1
2
p,5
+ . . . +
1
2
p,p2
+
1
2
p,p
=
1
8
w
p
1
2
p,1
.
Since w
p
> 2, so a
p+1
= 0, b
p+1
= 0. So A
p+1
= (A
p
, 0),
b
p+1
=
b
p
, 0
. Let
p+1
=
(
p
, 1 ), then D
p+1
() = 8
2
(A
p
p
)
2
b
p
p
. So
p+1
=
b
p
p
16 (A
p
p
)
2
=
2w
p
8
p,1
w
2
p
, w
p+1
= 16A
p
p+1
= w
p
p+1
= 2
8
p,1
w
p
< 2.
(22)
which proves the claim in (11) for even iterates as p + 1 = 2k.
Since
2,1
=
1
9
, p 3, and
k,1
k+1,1
due to the update rule of ls-bmrm, we have
8
p,1
8
9
< 2 < w
p
, hence w
p+1
> 1. (23)
Next step, since w
p+1
(1, 2), so a
p+2
=
1
4
, b
p+2
=
1
2
, A
p+2
=
A
p
, 0,
1
4
,
b
p+1
=
b
p
, 0,
1
2
.
Let
p+2
() = (
p+1
t
, (1
p+1
), 1 ). Then
A
p+2
p+2
=
p+1
A
p
p
1
4
(1 ),
b
p+2
p+2
=
p+1
b
p
p
+
1
2
(1 ).
D
p+2
() = 8(A
p+2
p+2
)
2
b
p+2
p+2
=
4
p+1
A
p
p
+ 1
2
2
2
4
p+1
A
p
p
+
p+1
b
p
p
+
1
2
+const,
where the const means terms independent of . So
p+2
= argmin
[0,1]
D
p+2
() =
4
p+1
A
p
p
+
p+1
b
p
p
+
1
2
4
p+1
A
p
p
+ 1
2
=
w
2
p
+ 16
2
p,1
(w
p
+ 4
p,1
)
2
, (24)
w
p+2
= 16A
p+2
p+2
= 16
p+2
p+1
A
p
p
+ 4(1
p+2
) = 2 +
8
p,1
(w
p
4
p,1
)
w
p
(w
p
+ 4
p,1
)
,
where the last step is by plugging in the expression of
p+1
in (22) and
p+2
in (24). Now using
(23) we get
w
p+2
2 =
8
p,1
(w
p
4
p,1
)
w
p
(w
p
+ 4
p,1
)
> 0.
C Proof of Lemma 4
The proof is based on (12). Let
k
= 1/
2k1,1
, then lim
k
k
= because
lim
k
2k1,1
= 0. Now
lim
k
k
2k1,1
=
lim
k
1
k
2k1,1
1
=
lim
k
k
k
1
=
lim
k
k+1
k
1
,
where the last step is by the discrete version of LHospitals rule.
To compute lim
k
k+1
k
we plug the denition
k
= 1/
2k1,1
into (12), which gives:
1
k+1
=
w
2
2k
+ 16
1
2
k
w
2k
+ 4
1
2
1
k
k+1
k
= 8
w
2k
2
k
w
2
2k
2
k
+ 16
= 8
w
2k
w
2
2k
+
16
2
k
.
Since lim
k
w
k
= 2 and lim
k
k
= , so
lim
k
k
2k1,1
=
lim
k
k+1
k
1
=
1
4
.
11
D Proof of Theorem 2
Denote
k
= 2 w
k
, then lim
k
k [
k
[ = 2 by Theorem 4. So
If
k
> 0, then J(w
k
) J(w
) =
1
32
(2
k
)
2
+
1
2
k
2
1
8
=
1
8
k
+
1
32
2
k
=
1
8
[
k
[ +
1
32
2
k
.
If
k
0, then J(w
k
) J(w
) =
1
32
(2
k
)
2
1
8
=
1
8
k
+
1
32
2
k
=
1
8
[
k
[ +
1
32
2
k
.
Combining these two cases, we conclude lim
k
k(J(w
k
) J(w
)) =
1
4
.
E Proof of Theorem 6
The crux of the proof is to show that
w
k
=
3
,
k copies
. .. .
1
k
, . . . ,
1
k
, 0, . . .
k [n 1]. (25)
At the rst iteration, we have
1
n
y
i
'w
0
, x
i
` =
1
6
if i [n 1]
1
2
if i = n
. (26)
For convenience, dene the term in the max of (14) as
0
( y) := 1 F
1
(y, y) +
1
n
n
i=1
y
i
'w
0
, x
i
` (y
i
y
i
1).
The key observation in the context of F
1
score is that
0
( y) is maximized at any of the following
assignments of ( y
1
, . . . , y
n
), and it is easy to check that they all give
0
( y) = 0:
(1, . . . , 1, +1), (1, . . . , 1, 1), (+1, 1, 1, . . . , 1, +1), . . . , (1, . . . , 1, +1, +1).
The rst assignment is just the correct labeling of the training examples. The second assignment just
misclassies the only positive example x
n
into negative. The rest n1 assignments only misclassify
a single negative example into positive. To prove that they maximize
0
( y), consider two cases of
y. First the positive training example is misclassied. Then F
1
(y, y) = 0 and by (26) we have
0
( y) = 1 0 +
1
n
n1
i=1
y
i
'w
0
, x
i
` (y
i
y
i
1) +
1
2
(1 1) =
1
6
n1
i=1
(y
i
y
i
1) 0.
Second, consider the case of y where the positive example is correctly labeled, while t 1 negative
examples are misclassied. Then F
1
(y, y) =
2
2+t
, and
0
( y) = 1
2
2 + t
+
1
6
n1
i=1
(y
i
y
i
1) =
t
2 + t
1
3
t =
t t
2
3(2 + t)
0, t [1, n 1].
So now suppose we pick
y
1
= (+1, 1, 1, . . . , 1, +1)
,
i.e. just misclassify the rst negative training example. Then
a
1
=
2
n
y
1
x
1
=
3
, 1, 0, . . .
, b
1
= R
emp
(w
0
) 'a
1
, w
0
` = 0 +
1
3
=
1
3
,
w
1
= argmin
w
1
2
|w|
2
3
w
1
w
2
3
, 1, 0, . . .
.
Next, we prove (25) by induction. Assume that it holds for steps 1, . . . , k. Then at step k + 1 it is
easy to check that
1
n
y
i
'w
k
, x
i
` =
1
6
+
1
2k
if i [k]
1
6
if k + 1 i n 1
1
2
if i = n
. (27)
12
Dene
k
( y) := 1 F
1
(y, y) +
1
n
n
i=1
y
i
'w
k
, x
i
` (y
i
y
i
1).
Then it is not hard to see that the following y (among others) maximize
k
: a) correct labeling, b)
only misclassify the positive training example x
n
, c) only misclassify one negative training example
in x
k+1
, . . . , x
n1
. And
k
equals 0 at all these assignments. For proof, again consider two cases.
If y misclassies the positive training example, then F
1
(y, y) = 0 and by (27) we have
k
( y) = 1 0 +
1
n
n1
i=1
y
i
'w
k
, x
i
` (y
i
y
i
1) +
1
2
(1 1)
=
1
6
+
1
2k
i=1
(y
i
y
i
1) +
1
6
n1
i=k+1
(y
i
y
i
1) 0.
If y correctly labels the positive example, but misclassies t
1
examples in x
1
, . . . , x
k
and t
2
exam-
ples in x
k+1
, . . . , x
n1
(into positive). Then F
1
(y, y) =
2
2+t1+t2
, and
k
( y) = 1
2
2 + t
1
+ t
2
+
1
6
+
1
2k
i=1
(y
i
y
i
1) +
1
6
n1
i=k+1
(y
i
y
i
1)
=
t
1
+ t
2
2 + t
1
+ t
2
1
3
+
1
k
t
1
1
3
t
2
t t
2
3(2 + t)
0 (t := t
1
+ t
2
).
So we can pick y as (
k copies
. .. .
1, . . . , 1, +1,
nk1 copies
. .. .
1, . . . , 1, +1)
3
e
1
e
k+2
, b
k+1
= R
emp
(w
k
) 'a
k+1
, w
k
` = 0 +
1
3
=
1
3
,
w
k+1
= argmin
w
1
2
|w|
2
+ max
i[k+1]
'a
i
, w` + b
i
=
3
,
k+1 copies
. .. .
1
k + 1
, . . . ,
1
k + 1
, 0, . . .
.
which can be veried by J
k+1
(w
k+1
) =
w
k+1
+
k+1
i=1
i
a
i
:
k+1
0 (setting all
i
=
1
k+1
). All that remains is to observe that J(w
k
) =
1
2
(
1
3
+
1
k
) while min
w
J(w) J(w
n1
) =
1
2
(
1
3
+
1
n1
) from which it follows that J(w
k
)min
w
J(w)
1
2
(
1
k
1
n1
) as claimed by Theorem
6.
F A linear time algorithm for a box constrained diagonal QP with a single
linear equality constraint
It can be shown that the dual optimization problem D() from (1) can be reduced into a box con-
strained diagonal QP with a single linear equality constraint.
In this section, we focus on the following simple QP:
min
1
2
n
i=1
d
2
i
(
i
m
i
)
2
s.t. l
i
i
u
i
i [n];
n
i=1
i
= z.
Without loss of generality, we assume l
i
< u
i
and d
i
= 0 for all i. Also assume
i
= 0 because
otherwise
i
can be solved independently. To make the feasible region nonempty, we also assume
i
((
i
> 0)l
i
+ (
i
< 0)u
i
) z
i
((
i
> 0)u
i
+ (
i
< 0)l
i
).
13
The algorithm we describe below stems from [10] and nds the exact optimal solution in O(n) time,
faster than the O(nlog n) complexity in [13].
With a simple change of variable
i
=
i
(
i
m
i
), the problem is simplied as
min
1
2
n
i=1
d
2
i
2
i
s.t. l
i
i
u
i
i [n];
n
i=1
i
= z
,
where
l
i
=
i
(l
i
m
i
) if
i
> 0
i
(u
i
m
i
) if
i
< 0
,
u
i
=
i
(u
i
m
i
) if
i
> 0
i
(l
i
m
i
) if
i
< 0
,
d
2
i
=
d
2
i
2
i
, z
= z
i
m
i
.
We derive its dual via the standard Lagrangian.
L =
1
2
d
2
i
2
i
+
i
(
i
l
i
) +
i
(
i
u
i
)
i
z
.
Taking derivative:
L
i
=
d
2
i
i
+
i
+
i
= 0
i
=
d
2
i
(
+
i
i
+ ). (28)
Substituting into L, we get the dual optimization problem
min D(,
+
i
,
i
) =
1
2
d
2
i
(
+
i
i
+ )
2
+
i
l
i
+
+
i
u
i
z
s.t.
+
i
0,
i
0 i [n].
Taking derivative of D with respect to , we get:
d
2
i
(
+
i
i
+ ) z
= 0. (29)
The KKT condition gives:
+
i
(
i
l
i
) = 0, (30a)
i
(
i
u
i
) = 0. (30b)
Now we enumerate four cases.
1.
+
i
> 0,
i
> 0. This implies that l
i
=
i
= u
i
, which is contradictory to our assumption.
2.
+
i
= 0,
i
= 0. Then by (28),
i
=
d
2
i
[l
i
, u
i
], hence [
d
2
i
l
i
,
d
2
i
u
i
].
3.
+
i
> 0,
i
= 0. Now by (30) and (28), we have l
i
=
i
=
d
2
i
(
+
i
+ ) >
d
2
i
, hence
<
d
2
i
l
i
and
+
i
=
d
2
i
l
i
.
4.
+
i
= 0,
i
> 0. Now by (30) and (28), we have u
i
=
i
=
d
2
i
(
i
+ ) <
d
2
i
, hence
>
d
2
i
u
i
and
i
=
d
2
i
u
i
+ .
In sum, we have
+
i
= [
d
2
i
l
i
]
+
and
i
= [
d
2
i
u
i
]
+
. Now (29) turns into
14
d
2
i
l
d
2
i
u
i
l
i
u
i
slope =
d
2
i
h
i
()
Figure 1: h
i
()
Algorithm 3: O(n) algorithm to nd the root of f(). Ignoring boundary condition checks.
1: Set kink set S
d
2
i
l
i
: i [n]
d
2
i
u
i
: i [n]
.
2: while [S[ > 2 do
3: Find median of S: m MED(S).
4: if f(m) 0 then
5: S x S : x m.
6: else
7: S x S : x m.
8: end if
9: end while
10: Return root
lf(u)uf(l)
f(u)f(l)
where S = l, u.
f() :=
d
2
i
([
d
2
i
l
i
]
+
[
d
2
i
u
i
]
+
+ )
. .. .
=:hi()
z
= 0. (31)
In other words, we only need to nd the root of f() in (31). h
i
() is plotted in Figure 1. Note that
h
i
() is a monotonically increasing function of , so the whole f() is monotonically increasing
in . Since f() 0 by z
i
u
i
and f() 0 by z
i
l
i
, the root must exist.
Considering that f has at most 2n kinks (nonsmooth points) and is linear between two adjacent
kinks, the simplest idea is to sort
d
2
i
l
i
,
d
2
i
u
i
: i [n]
into s
(1)
. . . s
(2n)
. If f(s
(i)
) and
f(s
(i+1)
) have different signs, then the root must lie between them and can be easily found because
f is linear in [s
(i)
, s
(i+1)
]. This algorithm takes at least O(nlog n) time because of sorting.
However, this complexity can be reduced to O(n) by making use of the fact that the median of n
(unsorted) elements can be found in O(n) time. Notice that due to the monotonicity of f, the median
of a set S gives exactly the median of function values, i.e., f(MED(S)) = MED(f(x) : x S).
Algorithm 3 sketches the idea of binary search. The while loop terminates in log
2
(2n) iterations
because the set S is halved in each iteration. And in each iteration, the time complexity is linear to
[S[, the size of current S. So the total complexity is O(n). Note the evaluation of f(m) potentially
involves summing up n terms as in (31). However by some clever aggregation of slope and offset,
this can be reduced to O([S[).
G Solving Binary Linear SVMs using Nesterovs Algorithm
Now we present the algorithm of [8] in Algorithm 4. It requires a
2
-strongly convex prox-function
on Q
2
: d
2
() =
2
2
||
2
, and sets D
2
= max
Q2
d
2
(). Let the Lipschitz constant of D()
be L. Algorithm 4 is based on two mappings
(w) : Q
1
Q
2
and w() : Q
2
Q
1
, together
with an auxiliary mapping v : Q
2
Q
2
. They are dened by
(w) := argmin
Q2
d
2
() 'Aw, ` + g() = argmin
Q2
2
||
2
+w
XY
i
, (32)
w() := argmin
wQ1
'Aw, ` + f(w) = argmin
wR
d
w
XY +
2
|w|
2
=
1
XY (33)
15
Algorithm 4: Solving Binary Linear SVMs using Nesterovs Algorithm.
Require: L as a conservative estimate of (i.e., no less than) the Lipschitz constant of D().
Ensure: Two sequences w
k
and
k
which reduce the duality gap at O(1/k
2
) rate.
1: Initialize: Randomly pick
1
in Q
2
. Let
0
= 2L,
0
v(
1
), w
0
w(
1
).
2: for k = 0, 1, 2, . . . do
3: Let
k
=
2
k+3
,
k
(1
k
)
k
+
k
k
(w
k
).
4: Set w
k+1
(1
k
)w
k
+
k
w(
k
),
k+1
v(
k
),
k+1
(1
k
)
k
.
5: end for
v() := argmin
Q2
L
2
|
|
2
'D(),
` . (34)
Equations (32) and (34) are examples of a box constrained QP with a single equality constraint. In
the appendix, we provide a linear time algorithm to nd the minimizer of such a QP. The overall
complexity of each iteration is thus O(nd) due to the gradient calculation in (34) and the matrix
multiplication in (33).
H Dynamic Programming to Compute the Marginals of (21)
For convenience, we repeat the joint distribution here:
p( y; w) exp
c( y, y) +
i
a
i
'x
i
, w` y
i
.
Clearly, the marginal distributions can be efciently computed if we are able efciently compute
the forms like
y
exp (c( y, y) +
i
a
i
'x
i
, w` y
i
). Notice that ( y, y) only depends on the
sufcient statistics including false positive and false negative (b and c in the contingency table
respectively), so our idea will be to enumerate all possible values of b and c.
For any xed value of b and c, we just need to sum up
exp
i
a
i
'x
i
, w` y
i
i
exp (a
i
'x
i
, w` y
i
)
over all y which has false positive b and false negative c. Let us call this set of labeling as (
n
(b, c).
If y
n
= +1, then
V
n
(b, c) :=
yCn(b,c)
n
i=1
exp (a
i
'x
i
, w` y
i
)
= exp (a
n
'x
i
, w`)
( y1,..., yn1)Cn1(b,c)
n1
i=1
exp (a
i
'x
i
, w` y
i
) ( y
n
= +1)
+ exp (a
n
'x
i
, w`)
( y1,..., yn1)Cn1(b,c1)
n1
i=1
exp (a
i
'x
i
, w` y
i
) ( y
n
= 1)
= exp (a
n
'x
i
, w`) V
n1
(b, c) + exp (a
n
'x
i
, w`) V
n1
(b, c 1).
If y
n
= 1, then
V
n
(b, c) = exp (a
n
'x
i
, w`) V
n1
(b 1, c) + exp (a
n
'x
i
, w`) V
n1
(b, c).
In practice, the recursions will start from V
1
, and there is no specic b and c kept in mind V
k
(p, q)
will enumerate all values of p and q. So at completion, we get V
n
(b, c) all possible values of b and
c. The cost for computation and storage is O(n
2
).
16