Coordinate Descent Algorithms: Stephen J. Wright

Coordinate Descent
Algorithms
Stephen J. Wright
Sai Krishna Dubagunta

1
Index
About Coordinate Descent Algorithm ( CD )
Introduction
Optimization - minimization
Types of functions [ Definitions ]
Convex
Smooth
Regularization Functions
Outline of CD Algorithms
Applications
2
Contd..
Algorithms, Convergence and Implementations
Powells Example
Randomized Algorithms
Conclusion
3
Coordinate Descent
Algorithm
solve optimization problems by successively

performing approximate minimization along
coordinate directions or hyperplanes
dynamic programming - break a problem to subproblems of lower
dimensions (even scalar) and solve each subproblem
independently
4
Introduction
Optimization - minimization
Minimization of a multivariate function F(X) can be

achieved by minimizing it along one direction at a
time ( solving univariate problem in a loop )
problem considered for this paper is
n
where f : is continuous.
5
Types of functions
Convex Function :
Ex : Real Value Function
Domain :
f is said to be convex, if
6
Smooth Function :
a function f(x) is said to be smooth, if all the

partial derivatives of f(x) are defined and
continuous at every point in Domain of f.
7
Regularization Functions
No regularizer Weighted L1 norm
Box - Constraint Weighted L2 norm
8
Outline of Coordinate
Descent Algorithms
as we said in the Introduction, the function we use in
this paper is
motivated by the recent applications, it is now

common to consider this formulation :
where f is smooth, is a regularization function that

maybe non-smooth and extended value, and > 0 is a
regularization parameter.
is often convex and assumed to be
separable or block separable. If separable,
where i : n for all i. n is a very large value

representing the number of coordinates.
when block separable, the nXn identity matrix

can be partitioned into sub matrices Ui , i = 1,2,
.N such that,
Basic Coordinate Descent Framework
Each step consists of evaluation

of single component ik of the
gradient f at the current point.
adjustment is made to the

respective component of x in the Algorithm 1
opposite direction to this
component.
k is step length which can exact

minimization along the ik
component , predefined short-
step
the components can be selected

in cyclic fashioning which i0 = 1
and ,
the equations from slide 10 is
shown in the algorithm 2 for some interesting choices of i , it is
possible to device a close-form solution
at k iteration, a scalar problem is without performing explicit searches.
formed by making a linear
This kind of operation is called Shrink
approximation to f
Operation, denoted by
along ik coordinate direction at
current xk iterate
stating the subproblem in Algorithm 2
adding quadratic damping term equivalently as
weighted by 1/k

and treating the regularization

term i respectively.
we can express the CD update as
Note that when regularization
term is not present, the step is
identical to Algorithm 1.
Algorithm 2
Application to Linear Equations
for a linear system Aw = b, let us assume the rows of A are normalized
the least norm solutions lagrangian dual is
Applying algorithm 1 to the lagrangian dual with k == 1 , each step has the form
from the lagrangian dual, we acquire and the update on after each update on
xk, we obtain
which is the update of Kaczmarz algorithm.

Relationship to other Methods
Stochastic Gradient (SG) Methods
minimize smooth function f by taking a (-ve) step

along an estimate gk of the gradient f(xk) at iteration
k.
often assumed that gk is unbiased estimate of f(xk),

that is, f(xk) = E(gk).
Randomized CD algorithms can be viewed as special

case of SG methods, in which
where ik is chosen uniformly at random from {1,2,.n}.

Gauss - Seidel Method
for n X n systems of linear equations
which adjusts the ik variable to ensue satisfaction of the ik

equation, at iteration k.
Standard Gauss-Siedel uses the cyclic choice of

coordinates, whereas random choice of ik would correspond
to the randomized versions of these methods
The Gauss-Siedel method applied to linear system Aw = b ,

that is, ATAw = ATb is equivalent to applying Algorithm 1 to
the least-squares problem.
Applications
here mentioned are several applications of CD algorithms
Bouman & Sauer : Positron Emission Tomography in which the objective

that form of equation from slide 9, where f is smooth and convex and
is a sum of terms of the form |x j - x l|q
Liu, paratucco and Zhang describe a block of CD approach for linear

least squares plus a regularization function consisting a sum of l norms
of subvectors of x.
Chang, Hsieh, and Lin use cyclic and stochastic CD to solve a squared-
loss formulation of the support vector machine (SVM) problem in
machine learning, that is,
where (xi,yi) n x {0,1} are feature vector / label pairs and lama is a
regularization parameter.
Algorithms,
Convergence,
Implementations
Powells Example :
function in R3 for which cyclic CD fails to

converge at a stationary point
a non-convex, continuously differentiable

function f : 3 is defined as follows
it has minimizers at the corners of the unit

cube, but coordinate descent with exact
minimization, started near one of the other
vertices of the cube cycles round the
neighborhoods of the six points close to the
six no-optimal vertices.
still, we cannot expect a general

convergence result for non-convex
functions, of the type that are available for Example showing the non convergence of cyclic coordinate descent
full-gradient descent.
results are available for the non convex

case under certain assumptions that still
admit interesting applications
Assumptions & Notations
we consider the unconstrained problem mentioned in slide 5 , where the

objective f is convex and Lipschitz continuously differentiable.
we assume > 0 such that
we define Lipschitz constants that are tied to the component directions, and
are the key to the algorithms and their analysis.
The Component Lipschitz constants are positive quantities Li such that for all x
n and all t we have
we define the coordinate Lipschitz constant Lmax to be such that

The standard Lipschitz Constant L is such that
for all x and d of interest.
By referring to the relationships between norm and trace of a

symmetry matrix, w can assume that 1 L / Lmax n.
we also define restricted Lipschitz constant L res such that the

following property is true for all x n , all t , and all i = 1,2,
.,n:
Clearly Lres L. The ratio below is important in our analysis of

asynchronous parallel algorithms in later section
In the case of f convex and twice continuously
differentiable, we have by positive semidefiniteness of the
2 f(x) at all x that
from which we can deduce that
we can derive stronger bounds on for functions f in which

the coupling between the components of x is weak.
the coordinate Lipschitz constant corresponds Lmax to the

max absolute value of the diagonal elements of the hessian
2f(x), while the restricted Lres is related the maximal column
norm of the hessian. So , if the hessian is positive
semidefinite and diagonally dominant, the ratio is at most
2.
Assumption 1
The function f is convex and uniformly

Lipschitz continuously differentiable, and
attains its minimum value f* on a set S, There is
a finite R0 such that the level set for f defined
by x0 is bounded, that is,
Randomized Algorithms
the update component ik is chosen randomly at each iteration. In the
given algorithm, we consider the simplest variant in which each i k is
selected from {1,2,.,n} with equal probability, independently of
the selections made at previous iterations.
we prove a convergence result for the randomized algorithm, for the

simple step length choice k 1/Lmax.
Algorithm 3
Theorem 1 : Suppose that Assumption 1 holds. Suppose
that k 1/Lmax in algorithm 3. Then for all k> 0 we have
when >0, we have in addition that
Proof By application of Taylors theorem, (21) and (22), we

have
where we substituted the choice k 1/Lmax in the last equality. Taking
the expectation of both sides of this expression over the random index i k,
we have
we now subtract f(x*) from both sides of this expression, take

expectation of both sides with respect to all random variables i 0,i1,i2,.,
and use the notation
to obtain
By convexity of f we have for any x* S that

where the final inequality is because f(xk) f(x0), so that xk is in the level set in
(26). By taking expectations of both sides, we obtain
When we substitute this bound into (32), and rearrange, we obtain
We thus have
by applying this formula recursively, we obtain
so that (27) holds as claimed.
In the case of f is strongly convex with modulus > 0, we have by taking the
minimum of both sides with respect to y in (20), and setting x = x k, that
By using this expression to bound || f( xk ) ||2 in (32), we obtain
Recursive application of this formula leads to (28).
Note that the same convergence expressions can be obtained for more refined
choice of step-length k, by making minor adjustments to the logic.
For example, the choice k 1/Lik leads to the same bounds, the same bounds
hold too when k is the exact minimizer of f along the coordinate search direction.
we compare (27) with the corresponding result for full-gradient descent with
constant step length k 1/L. The iteration
leads to a convergence expression

Accelerated Randomized Algorithms
proposed by Nesterov.
Algorithm 4
assumes that an estimate
is available of modulus of
strong convexity 0
from (20), as well as
estimates of the
component-wise Lipschitz
constants Li from (21).
closely related to
accelerated full-gradient
methods.
Theorem 2: Suppose that Assumption 1 holds, and define
Then for all k 0 we have
In the strongly convex case > 0, the term

eventually dominates the second term in brackets in (35), so that
the linear convergence rate suggested by this expression is
significantly faster than the corresponding rate (28) for algorithm 3.
Efficient Implementation of Accelerated Algorithm
The higher cost of each iteration of Algorithm 4 detracts from the

appeal of accelerated CD methods over standard methods.
However, by using a change of variables due to Lee and Sidford, it

is possible to implement the accelerated randomized CD approach
efficiently for problems with certain structure, including the linear
system Aw=b.
We explain the Lee-Sidford technique in the context of Kaczmarz

algorithm for (8), assuming normalization of the rows of A (14).
As we explained in (16), the Kaczmarz algorithm is obtained by

applying CD to the dual formulation (10) with variables x, but
operating in the space of primal variables w using the
transformation w = ATx.
If we apply transformations k = ATVk and = ATyk to the vectors in

Algorithm 4 to note that Li 1 in (21), we obtain Algorithm 5.
Algorithm 5
When the matrix A is dense, there is only a small factor of
difference between the per-iteration workload of the
standard Kaczmarz algorithm and it accelerated variant,
Algorithm 5. both would require O(m+n) operations per
iteration.
When A is sparse, the computational difference between

the two becomes substantial.
In algorithm 4, at iteration k, the standard Kaczmarz

requires | Aik |.
In algorithm 5, at iteration k, its variant requires O(| Aik |)

operations.
Conclusion
We have surveyed the state of the art in
convergence of coordinate descent methods,
with a focus on the most elementary settings
and the most fundamental algorithms.
Coordinate Descent method have become an

important tool in the optimization toolbox
that is used to solve problems that arise in
machine learning and data analysis,
particularly in big data settings.

Coordinate Descent Algorithms: Stephen J. Wright

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Coordinate Descent Algorithms: Stephen J. Wright

Încărcat de

Drepturi de autor:

Formate disponibile

Coordinate Descent

Sai Krishna Dubagunta

Types of functions [ Definitions ]

Algorithms, Convergence and Implementations

solve optimization problems by successively

Minimization of a multivariate function F(X) can be

problem considered for this paper is

Ex : Real Value Function

a function f(x) is said to be smooth, if all the

No regularizer Weighted L1 norm

Box - Constraint Weighted L2 norm

motivated by the recent applications, it is now

where f is smooth, is a regularization function that

where i : n for all i. n is a very large value

when block separable, the nXn identity matrix

Each step consists of evaluation

adjustment is made to the

k is step length which can exact

the components can be selected

and treating the regularization

for a linear system Aw = b, let us assume the rows of A are normalized

the least norm solutions lagrangian dual is

which is the update of Kaczmarz algorithm.

Stochastic Gradient (SG) Methods

minimize smooth function f by taking a (-ve) step

often assumed that gk is unbiased estimate of f(xk),

Randomized CD algorithms can be viewed as special

where ik is chosen uniformly at random from {1,2,.n}.

for n X n systems of linear equations

which adjusts the ik variable to ensue satisfaction of the ik

Standard Gauss-Siedel uses the cyclic choice of

The Gauss-Siedel method applied to linear system Aw = b ,

Bouman & Sauer : Positron Emission Tomography in which the objective

Liu, paratucco and Zhang describe a block of CD approach for linear

function in R3 for which cyclic CD fails to

a non-convex, continuously differentiable

it has minimizers at the corners of the unit

still, we cannot expect a general

results are available for the non convex

we consider the unconstrained problem mentioned in slide 5 , where the

we assume > 0 such that

we define the coordinate Lipschitz constant Lmax to be such that

for all x and d of interest.

By referring to the relationships between norm and trace of a

we also define restricted Lipschitz constant L res such that the

Clearly Lres L. The ratio below is important in our analysis of

from which we can deduce that

we can derive stronger bounds on for functions f in which

the coordinate Lipschitz constant corresponds Lmax to the

The function f is convex and uniformly

we prove a convergence result for the randomized algorithm, for the

when >0, we have in addition that

Proof By application of Taylors theorem, (21) and (22), we

we now subtract f(x*) from both sides of this expression, take

By convexity of f we have for any x* S that

When we substitute this bound into (32), and rearrange, we obtain

by applying this formula recursively, we obtain

so that (27) holds as claimed.

Recursive application of this formula leads to (28).

leads to a convergence expression

Then for all k 0 we have

In the strongly convex case > 0, the term

The higher cost of each iteration of Algorithm 4 detracts from the