Sunteți pe pagina 1din 32

Coordinate Descent

Algorithms
Stephen J. Wright

Sai Krishna Dubagunta


1
Index
About Coordinate Descent Algorithm ( CD )

Introduction

Optimization - minimization

Types of functions [ Definitions ]

Convex

Smooth

Regularization Functions

Outline of CD Algorithms

Applications

2
Contd..

Algorithms, Convergence and Implementations

Powells Example

Randomized Algorithms

Conclusion

3
Coordinate Descent
Algorithm

solve optimization problems by successively


performing approximate minimization along
coordinate directions or hyperplanes
dynamic programming - break a problem to subproblems of lower
dimensions (even scalar) and solve each subproblem
independently

4
Introduction
Optimization - minimization

Minimization of a multivariate function F(X) can be


achieved by minimizing it along one direction at a
time ( solving univariate problem in a loop )

problem considered for this paper is

n
where f : is continuous.

5
Types of functions

Convex Function :

Ex : Real Value Function

Domain :

f is said to be convex, if

6
Smooth Function :

a function f(x) is said to be smooth, if all the


partial derivatives of f(x) are defined and
continuous at every point in Domain of f.

7
Regularization Functions

No regularizer Weighted L1 norm

Box - Constraint Weighted L2 norm

8
Outline of Coordinate
Descent Algorithms
as we said in the Introduction, the function we use in
this paper is

motivated by the recent applications, it is now


common to consider this formulation :

where f is smooth, is a regularization function that


maybe non-smooth and extended value, and > 0 is a
regularization parameter.
is often convex and assumed to be
separable or block separable. If separable,

where i : n for all i. n is a very large value


representing the number of coordinates.

when block separable, the nXn identity matrix


can be partitioned into sub matrices Ui , i = 1,2,
.N such that,
Basic Coordinate Descent Framework

Each step consists of evaluation


of single component ik of the
gradient f at the current point.

adjustment is made to the


respective component of x in the Algorithm 1
opposite direction to this
component.

k is step length which can exact


minimization along the ik
component , predefined short-
step

the components can be selected


in cyclic fashioning which i0 = 1
and ,
the equations from slide 10 is
shown in the algorithm 2 for some interesting choices of i , it is
possible to device a close-form solution
at k iteration, a scalar problem is without performing explicit searches.
formed by making a linear
This kind of operation is called Shrink
approximation to f
Operation, denoted by
along ik coordinate direction at
current xk iterate
stating the subproblem in Algorithm 2
adding quadratic damping term equivalently as
weighted by 1/k

and treating the regularization


term i respectively.
we can express the CD update as
Note that when regularization
term is not present, the step is
identical to Algorithm 1.
Algorithm 2
Application to Linear Equations

for a linear system Aw = b, let us assume the rows of A are normalized

the least norm solutions lagrangian dual is

Applying algorithm 1 to the lagrangian dual with k == 1 , each step has the form

from the lagrangian dual, we acquire and the update on after each update on
xk, we obtain

which is the update of Kaczmarz algorithm.


Relationship to other Methods

Stochastic Gradient (SG) Methods

minimize smooth function f by taking a (-ve) step


along an estimate gk of the gradient f(xk) at iteration
k.

often assumed that gk is unbiased estimate of f(xk),


that is, f(xk) = E(gk).

Randomized CD algorithms can be viewed as special


case of SG methods, in which

where ik is chosen uniformly at random from {1,2,.n}.


Gauss - Seidel Method

for n X n systems of linear equations

which adjusts the ik variable to ensue satisfaction of the ik


equation, at iteration k.

Standard Gauss-Siedel uses the cyclic choice of


coordinates, whereas random choice of ik would correspond
to the randomized versions of these methods

The Gauss-Siedel method applied to linear system Aw = b ,


that is, ATAw = ATb is equivalent to applying Algorithm 1 to
the least-squares problem.
Applications
here mentioned are several applications of CD algorithms

Bouman & Sauer : Positron Emission Tomography in which the objective


that form of equation from slide 9, where f is smooth and convex and
is a sum of terms of the form |x j - x l|q

Liu, paratucco and Zhang describe a block of CD approach for linear


least squares plus a regularization function consisting a sum of l norms
of subvectors of x.

Chang, Hsieh, and Lin use cyclic and stochastic CD to solve a squared-
loss formulation of the support vector machine (SVM) problem in
machine learning, that is,

where (xi,yi) n x {0,1} are feature vector / label pairs and lama is a
regularization parameter.
Algorithms,
Convergence,
Implementations
Powells Example :

function in R3 for which cyclic CD fails to


converge at a stationary point

a non-convex, continuously differentiable


function f : 3 is defined as follows

it has minimizers at the corners of the unit


cube, but coordinate descent with exact
minimization, started near one of the other
vertices of the cube cycles round the
neighborhoods of the six points close to the
six no-optimal vertices.

still, we cannot expect a general


convergence result for non-convex
functions, of the type that are available for Example showing the non convergence of cyclic coordinate descent
full-gradient descent.

results are available for the non convex


case under certain assumptions that still
admit interesting applications
Assumptions & Notations

we consider the unconstrained problem mentioned in slide 5 , where the


objective f is convex and Lipschitz continuously differentiable.

we assume > 0 such that

we define Lipschitz constants that are tied to the component directions, and
are the key to the algorithms and their analysis.

The Component Lipschitz constants are positive quantities Li such that for all x
n and all t we have

we define the coordinate Lipschitz constant Lmax to be such that


The standard Lipschitz Constant L is such that

for all x and d of interest.

By referring to the relationships between norm and trace of a


symmetry matrix, w can assume that 1 L / Lmax n.

we also define restricted Lipschitz constant L res such that the


following property is true for all x n , all t , and all i = 1,2,
.,n:

Clearly Lres L. The ratio below is important in our analysis of


asynchronous parallel algorithms in later section
In the case of f convex and twice continuously
differentiable, we have by positive semidefiniteness of the
2 f(x) at all x that

from which we can deduce that

we can derive stronger bounds on for functions f in which


the coupling between the components of x is weak.

the coordinate Lipschitz constant corresponds Lmax to the


max absolute value of the diagonal elements of the hessian
2f(x), while the restricted Lres is related the maximal column
norm of the hessian. So , if the hessian is positive
semidefinite and diagonally dominant, the ratio is at most
2.
Assumption 1

The function f is convex and uniformly


Lipschitz continuously differentiable, and
attains its minimum value f* on a set S, There is
a finite R0 such that the level set for f defined
by x0 is bounded, that is,
Randomized Algorithms
the update component ik is chosen randomly at each iteration. In the
given algorithm, we consider the simplest variant in which each i k is
selected from {1,2,.,n} with equal probability, independently of
the selections made at previous iterations.

we prove a convergence result for the randomized algorithm, for the


simple step length choice k 1/Lmax.

Algorithm 3
Theorem 1 : Suppose that Assumption 1 holds. Suppose
that k 1/Lmax in algorithm 3. Then for all k> 0 we have

when >0, we have in addition that

Proof By application of Taylors theorem, (21) and (22), we


have
where we substituted the choice k 1/Lmax in the last equality. Taking
the expectation of both sides of this expression over the random index i k,
we have

we now subtract f(x*) from both sides of this expression, take


expectation of both sides with respect to all random variables i 0,i1,i2,.,
and use the notation

to obtain

By convexity of f we have for any x* S that


where the final inequality is because f(xk) f(x0), so that xk is in the level set in
(26). By taking expectations of both sides, we obtain

When we substitute this bound into (32), and rearrange, we obtain

We thus have

by applying this formula recursively, we obtain

so that (27) holds as claimed.

In the case of f is strongly convex with modulus > 0, we have by taking the
minimum of both sides with respect to y in (20), and setting x = x k, that
By using this expression to bound || f( xk ) ||2 in (32), we obtain

Recursive application of this formula leads to (28).

Note that the same convergence expressions can be obtained for more refined
choice of step-length k, by making minor adjustments to the logic.

For example, the choice k 1/Lik leads to the same bounds, the same bounds
hold too when k is the exact minimizer of f along the coordinate search direction.

we compare (27) with the corresponding result for full-gradient descent with
constant step length k 1/L. The iteration

leads to a convergence expression


Accelerated Randomized Algorithms

proposed by Nesterov.
Algorithm 4
assumes that an estimate
is available of modulus of
strong convexity 0
from (20), as well as
estimates of the
component-wise Lipschitz
constants Li from (21).

closely related to
accelerated full-gradient
methods.
Theorem 2: Suppose that Assumption 1 holds, and define

Then for all k 0 we have

In the strongly convex case > 0, the term


eventually dominates the second term in brackets in (35), so that
the linear convergence rate suggested by this expression is
significantly faster than the corresponding rate (28) for algorithm 3.
Efficient Implementation of Accelerated Algorithm

The higher cost of each iteration of Algorithm 4 detracts from the


appeal of accelerated CD methods over standard methods.

However, by using a change of variables due to Lee and Sidford, it


is possible to implement the accelerated randomized CD approach
efficiently for problems with certain structure, including the linear
system Aw=b.

We explain the Lee-Sidford technique in the context of Kaczmarz


algorithm for (8), assuming normalization of the rows of A (14).

As we explained in (16), the Kaczmarz algorithm is obtained by


applying CD to the dual formulation (10) with variables x, but
operating in the space of primal variables w using the
transformation w = ATx.

If we apply transformations k = ATVk and = ATyk to the vectors in


Algorithm 4 to note that Li 1 in (21), we obtain Algorithm 5.
Algorithm 5
When the matrix A is dense, there is only a small factor of
difference between the per-iteration workload of the
standard Kaczmarz algorithm and it accelerated variant,
Algorithm 5. both would require O(m+n) operations per
iteration.

When A is sparse, the computational difference between


the two becomes substantial.

In algorithm 4, at iteration k, the standard Kaczmarz


requires | Aik |.

In algorithm 5, at iteration k, its variant requires O(| Aik |)


operations.
Conclusion
We have surveyed the state of the art in
convergence of coordinate descent methods,
with a focus on the most elementary settings
and the most fundamental algorithms.

Coordinate Descent method have become an


important tool in the optimization toolbox
that is used to solve problems that arise in
machine learning and data analysis,
particularly in big data settings.

S-ar putea să vă placă și