Sunteți pe pagina 1din 41

Distributed Zero-Order Algorithms for Nonconvex

Multi-Agent Optimization
arXiv:1908.11444v1 [math.OC] 29 Aug 2019

Yujie Tang1 and Na Li1


1
School of Engineering and Applied Sciences, Harvard University

Abstract
Distributed multi-agent optimization is the core of many applications in distributed learning,
control, estimation, etc. Most existing algorithms assume knowledge of first-order information
of the objective and have been analyzed for convex problems. However, there are situations
where the objective is nonconvex, and one can only evaluate the function values at finitely
many points. In this paper we consider derivative-free distributed algorithms for nonconvex
multi-agent optimization, based on recent progress in zero-order optimization. We develop two
algorithms for different settings, provide detailed analysis of their convergence behavior, and
compare them with existing centralized zero-order algorithms and gradient-based distribution
algorithms.

1 Introduction
Consider a set of n agents connected over a network, each of which is associated with a smooth
local objective function fi that can be nonconvex. The goal is to solve the optimization problem
n
1X
min f (x) := fi (x)
x∈Rd n i=1

with the restriction that fi is only known to agent i and each agent can exchange information only
with its neighbors in the network during the optimization procedure. We focus on the situation
where only zero-order information of fi is available to agent i.
Distributed multi-agent optimization lies at the core of a wide range of applications, and a large
body of literature has been contributed to distributed multi-agent optimization algorithms. One line
of research combines (sub)gradient-based methods with a consensus/averaging scheme. It has been
shown that, for convex functions, the convergence rates of distributed gradient-based algorithms can
match or nearly match those of centralized gradient-based
√ algorithms. Specifically, [1, 2, 3] proposed
and analyzed distributed algorithms with O(log t/ t) convergence for nonsmooth convex functions;
[4, 5, 6] proposed distributed algorithms with O(1/t) convergence for smooth convex functions and
linear convergence for strongly convex functions; [7] employed Nesterov’s gradient descent method
and showed O(1/t1.4−ǫ ) convergence for smooth convex functions and improved linear convergence
for strongly convex functions. Besides convergence rates, some works have additional focuses such

1
as time-varying/undirected graphs [1, 2, 8, 5, 9, 10], uncoordinated step sizes [11, 12], stochastic
(sub)gradient [13, 14], etc.
While distributed convex optimization has a broad applicability, nonconvex problems also ap-
pear in important applications such as distributed learning [15], compressed sensing [16], robotic
networks [17], operation of wind farms [18], etc, and several works have considered nonconvex multi-
agent optimization. [19] studied the behavior of distributed projected stochastic gradient algorithm
via tools from continuous-time dynamical systems. [20] developed distributed algorithms based on
the convexification-decomposition technique. [21] established convergence of the distributed push-
sum algorithm for nonconvex problems and also proposed perturbations to avoid local √ maxima.
[22] studied the decentralized parallel stochastic gradient descent, and showed its O(1/ T ) conver-
gence rate
√ to stationary points. [23] proposed a decentralized Frank–Wolfe algorithm, and showed
its O(1/ t) convergence rate of the quantity h∇f (x̄(t)), x̄(t) − x∗ i. [24] proposed the proximal
primal-dual algorithm for distributed nonconvex optimization, and showed its O(1/t) convergence
to a stationary point. [25] studied decentralized gradient descent-type algorithms for nonconvex
problems, established their convergence to stationary points and also provided consensus rates.
Recently there has been increasing interest in zero-order optimization, where one does not
have access to the gradient of the objective. Such situations can occur, for example, when only
black-box procedures are available for computing the values of the functional characteristics of the
problem, or when resource limitations restrict the use of fast or automatic differentiation techniques.
Many existing works on zero-order optimization are based on constructing gradient estimators using
finitely many function evaluations. [26] proposed and analyzed a single-point gradient estimator,
and [27] further studied the convergence rate of single-point zero-order algorithms for highly smooth
objectives. [28] proposed two-point gradient estimators and showed that the convergence of the
resulting algorithms are comparable with their first-order counterparts. [29] studied two-point
gradient estimators in stochastic nonconvex zero-order optimization. [30] and [31] showed that
forpstochastic zero-order convex optimization with two-point gradient estimators, the optimal rate
O( d/N ) is achievable where N denotes the number of function value queries. [32] proposed and
analyzed a zero-order stochastic Frank-Wolfe algorithm.
Some recent works have also started to combine zero-order and distributed methods. [33] pro-
posed a distributed zero-order algorithm for stochastic nonconvex problems based on the method
of multipliers. [34] proposed a zero-order ADMM algorithm for distributed online convex opti-
mization. [35] proposed a distributed zero-order algorithm over random networks and established
its convergence for strongly convex objectives. [36] considered distributed zero-order methods for
constrained convex optimization. On the other hand, there are still many questions remain to be
studied in distributed zero-order optimization, e.g., how zero-order and distributed methods affect
the performance of each other and whether their fundamental structural properties could be kept
by tuning the way of their combination. This paper aims at providing messages along this line:
We propose and analyze two zero-order distributed algorithms for deterministic nonconvex opti-
mization, and compare their convergence rates with their distributed first-order and centralized
zero-order counterparts. The first algorithm employs a simple two-point gradient estimator and
only does consensus on the local decision variables, while the second algorithms uses a 2d-point
gradient estimator and incorporates gradient tracking. The convergence rates of the two algorithms
are summarized in Table 1, and are compared with their distributed first-order and centralized
counterparts. We show that for deterministic nonconvex optimization, the proposed distributed
zero-order algorithms have comparable convergence behavior with their first-order and centralized
counterparts. These results shed light on how zero-order evaluations affect distributed optimization

2
Table 1: Comparison of different algorithms for distributed optimization and zero-order optimiza-
tion.

smooth gradient dominated


r !  
d d
Alg. 1 O log N O
N N
this paper
(nonconvex) N/d !
  4
  
d 2 2 µ 3
Alg. 2 O O 1−c 1−ρ
N L
 
log t
O √ [2] (convex)
t 
log t

distributed DGD 
1
 O [37] (convex)
O √ [22] (nonconvex) t
gradient-
T
based
gradient  µ  32 t
!
methods
  
1 2
tracking O [6] O 1 − c(1 − ρ) [5]
t L
(convex)
  h 
centralized d c µ iN
[28] O (nonconvex) O 1− (strongly convex)
zero-order N dL

Note: t denotes the number of iterations, N denotes the number of function value queries, d denotes
the dimension of the decision variable, and c’s represent numerical constants that can be
different for different algorithms.
T denotes the total number of iterations provided before the optimization procedure. The rate
in [22] assumes knowledge of T and uses T to set a constant step size.
The listed convergence rates are the ergodic rates of k∇f k2 for the smooth case, and the
objective error rates for the gradient dominated case, respectively.
We do not include algorithms with Nesterov-type acceleration in this comparison.

and how the network structure affects zero-order algorithms.

Notation: We denote the ℓ2 -norm by k · k, and the standard inner product by hx, yi := xT y. The
standard basis of Rd will be denoted by {ek }dk=1 . The closed unit ball {x ∈ Rd : kxk ≤ 1} will be
denoted by Bd , and the unit sphere {x ∈ Rd : kxk = 1} will be denoted by Sd−1 . The uniform
distributions over Bd and Sd−1 will be denoted by U(Bd ) and U(Sd−1 ).

2 Formulation and Algorithms


2.1 Problem Formulation
Let N = {1, 2, . . . , n} denote the set of agents. Suppose the agents are connected by a communi-
cation network, whose topology is represented by a connected graph G = (N , E) where E denotes
the set of edges that represent the communication links. The graph G is assumed to be undirected,

3
meaning that the communication links of the network are bidirectional.
Each agent i is associated with a local objective function fi : Rd → R. The goal of the agents
is to collaboratively solve the optimization problem
n
1X
min f (x) := fi (x). (1)
x∈Rd n i=1

We assume that at each time step, agent i can only query the function values of fi at finitely
many points, and can only communicate with its neighbors in the communication network. We also
assume that the queries of the function values are noise-free and error-free.
We use the following definitions of function classes throughout the paper:
Definition 1. 1. A function f : Rd → R is said to be L-smooth if f is continuously differentiable
and satisfies k∇f (x) − ∇f (y)k ≤ Lkx − yk for all x, y ∈ Rd .
2. A function f : Rd → R is said to be G-Lipschitz if |f (x) − f (y)| ≤ Gkx − yk for all x, y ∈ Rd .
3. A function f : Rd → R is said to be µ-gradient dominated 1 if f is differentiable, has a global
minimizer x∗ , and 2µ(f (x) − f (x∗ )) ≤ k∇f (x)k2 for all x ∈ Rd .
Lipschitz continuity and smoothness are standard assumptions on the objective function in
distributed optimization. The concept of gradient domination can be viewed as a nonconvex analog
of strict convexity, and is important in the non-convex optimization literature [39, 38]. It has been
observed that nonconvex but gradient-dominated objective functions appear in various applications
[40, 41].

2.2 Algorithms
We propose consensus-based distributed algorithms for solving Problem (1), where each agent main-
tains a local copy of the global variables, and weighs its neighbors’ information to updates the local
variable. Specifically, we introduce a consensus matrix W = [Wij ] ∈ Rn×n that satisfies the follow-
ing assumption:
Assumption 1. 1. Wij ≥ 0 for all i, j ∈ N and W 1d = W T 1d = 1d , i.e., W is a doubly stochastic
matrix.
2. Wii > 0 for all i ∈ N , and for two distinct agents i and j, Wij > 0 if and only if (i, j) ∈ E.
It is a standard result of consensus optimization that, when Assumption 1 is satisfied, we have

W (x − n−1 1n 1T x)
n
ρ := sup < 1.
kxk=1 kx − n−1 1n 1Tn xk
Because each agent i can only query function values of fi at finitely may points, we employ
techniques from zero-order optimization and introduce the following maps:
(2) f (x + uz) − f (x − uz)
Gf (x; u, z) := d · z, (2)
2u
d
(2d)
X f (x + uek ) − f (x − uek )
Gf (x; u) := ek . (3)
2u
k=1
1 This definition is adopted from [38].

4
Algorithm 1: 2-point gradient estimator without global gradient tracking
for t = 1, 2, 3, . . . do
foreach i ∈ N do
1. Generate z i (t) ∼ U(Sd−1 ) independently from (z i (τ ))t−1 j t
τ =1 and (z (τ ))τ =1 for j 6= i.

2. Update xi (t) by
(2)
g i (t) = Gfi (xi (t − 1); ut , z i (t)), (4)
n
X
xi (t) = Wij (xj (t − 1) − ηt g i (t)). (5)
j=1

end
end

(2d)
The map Gf (x; u) approximates ∇f (x) by difference quotients along d orthogonal directions, and
can be viewed as a noise-free version of the Kiefer-Wolfowitz type method [42]. The following
proposition establishes the rationale of employing Gdf (x; u, z) as an estimator of ∇f (x) when z is
properly randomly generated:
Proposition 1 ([26]). Suppose f : Rd → R is continuous. Then for any u > 0 and x ∈ Rd ,
h i
(2)
Ez∼U (Sd−1 ) Gf (x; u, z) = ∇f u (x),

where f u (x) := Ey∼U (Bd ) [f (x + uy)].


Basically, Proposition 1 indicates that when z is randomly generated from the sphere Sd−1 , the
expectation of Gf (x; u, z) is the gradient of a “smoothed version” of f .
We propose two distributed algorithms for Problem (1) based on the gradient estimators (2)
and (3):
1. Algorithm 1 employs the 2-point gradient estimator (2) in which z is independently sampled
from the uniform distribution U(Sd−1 ), and only involves consensus on the local decision variables
which is similar to the decentralized (sub)gradient descent (DGD) method [1, 2].
2. Algorithm 2 employs the 2d-point gradient estimator (3) and also introduces auxiliary local
variables si (t) for gradient tracking. We shall see in Theorem 3 that si (t) converges to the gradient
of the global objective function as t → ∞ under mild conditions.
Note that here we employ the the adapt-then-combine (ATC) strategy [43] for both algorithms,
which is a commonly used variant for consensus optimization.

5
Algorithm 2: 2d-point gradient estimator with global gradient tracking
Set si (0) = g i (0) = 0 for each i ∈ N .
for t = 1, 2, 3, . . . do
foreach i ∈ N do
1. Update si (t) by
(2d)
g i (t) = Gfi (xi (t − 1); ut ), (6)
n
X
si (t) = Wij sj (t − 1) + g i (t) − g i (t − 1) .

(7)
j=1

2. Update xi (t) by n
X
i
x (t) = Wij (xj (t − 1) − ηsi (t)). (8)
j=1

end
end

3 Main Results
3.1 Convergence of Algorithm 1
Let xi (t) denote the sequence generated by Algorithm 1, where the sequence of step sizes ηt is
positive and non-increasing. Denote
n n
1X i X
x̄(t) := x (t), R0 := kxi (0) − x̄(0)k2 .
n i=1 i=1

We first analyze the case with general nonconvex but smooth objectives.
Theorem 1. Assume that each local objective function fi is uniformly G-Lipschitz and L-smooth
for some positive constants G and L, and that f ∗ := inf x∈Rd f (x) > −∞.
P∞ P∞ 2 P∞ 2
1. Suppose η1 L ≤ 1/4, t=1 ηt = +∞, t=1 ηt < +∞, and t=1 ηt ut < +∞. Then almost
surely, kxi (t) − x̄(t)k converges to zero for all i ∈ N , ∇f (x̄(t)) converges to zero, and the function
value f (x̄(t)) converges as → ∞.
2. Suppose now that
αη 1 αu G 1
ηt = √ · , ut ≤ √ · (γ−β)/2
4L d tβ L d t
with αη ∈ (0, 1], αu ≥ 0, β ∈ (1/2, 1) and γ > 1. Then
Pt−1  1 2
 √ "
τ =0 ητ +1 E k G ∇f (x̄(τ ))k (1−β) d 16(f (x̄(0))−f ∗ ) 12R0 L2 /G2
Pt−1 ≤ + √
τ =0 ητ +1
t1−β αη G2 /L n(1−ρ2 ) d
# (9)
18α2η κ2 ρ2 9α2u γ
 
4αη 1
+ +√ + + o 1−β ,
(6β −3)n2 d(1−ρ2)2 2(γ −1) t

6
and
n
1X  i α2η κ2 ρ2 G2 /L2
E kx (t) − x̄(t)k2 ≤ + o(t−2β ),

(10)
n i=1 (1 − ρ2 )2 t2β

3. Suppose now that


αη 1 αu G 1
ηt = √ ·√ , ut ≤ √ · γ/2−1/4
4L d t L d t
with αη ∈ (0, 1], αu ≥ 0 and γ > 1, and that every agent starts from the same initial point. Then
almost surely, kxi (t)−x̄(t)k converges to zero for all i, and lim inf t→∞ k∇f (x̄(t))k = 0. Furthermore,
we have
Pt−1  1 2
 r "
τ =0 ητ +1 E k G ∇f (x̄(τ ))k d αη 8(f (x̄(0))−f ∗ ) 6R0 L2 /G2
Pt−1 ≤ log(2t+1)+ + √
τ =0 ητ +1
t 3n2 αη G2 /L n(1−ρ2) d
# (11)
9α2η κ2 ρ2 9α2u γ
 
1
+ √ + +o √ ,
(1−ρ2 )2 d 4(γ −1) t

and
n
1X  i α2η κ2 ρ2 G2 /L2
E kx (t) − x̄(t)k2 ≤ + o(t−1 ).

(12)
n i=1 (1 − ρ2 )2 t

Remark 1. Theorem 1 uses the squared norm of the gradient to assess the sub-optimality of the
iterates, which is common for unconstrained nonconvex problems where we do not aim for global
optimal solutions [44, 28]. While Theorem 1 does not exclude the possibility of converging to saddle
points, recent works [45, 46] show that saddle points can be avoided almost surely with a proper
random initialization for many first-order methods, and we conjecture that the proposed algorithms
may also share this property. Rigorous analysis is left for future work.
Remark 2. Each iteration of Algorithm 1 requires√2 queries of function
p values. Thus the convergence
rates (9) and (11) can also be interpreted as O( d/N 1−β ) and O( d/N log N ) respectively where
N denotes the number of function value queries. Characterizing convergence rate in terms of the
number of function value queries N and the dimension d is conventional for zero-order optimization.
In scenarios where zero-order methods are applied, the computation of the function values is usually
one of the most time-consuming procedures. In addition, how the convergence of the algorithm scale
with the problem dimension d is also of interest.
The following result shows that for a gradient dominated global objective, a faster convergence
rate can be achieved by Algorithm 1.
Theorem 2. Assume that each local objective function fi is uniformly G-Lipschitz and L-smooth
for some positive constants G and L. Furthermore, assume that the global objective function f is
µ-gradient dominated and has a minimum value denoted by f ∗ . Suppose
2αη 8αη L αu G 1
ηt = , t0 ≥ − 1, ut ≤ ·√
µ(t + t0 ) µ L t+1
for some αη > 1 and αu ≥ 0. Then

αη G2 d
 
∗ 8αη L 2 1
E [f (x̄(t)) − f ] ≤ · + 2αu · + o(t−1 ), (13)
µ(αη − 1) 3n2 µ t+1

7
and
n
1X  i  64α2η κ2 ρ2 G2 d 1
E kx (t) − x̄(t)k2 ≤ 2 · + o(t−2 ). (14)
n i=1 µ (1 − ρ2 )2 t2
Remark 3. The convergence rate (13) can also be described as E[f (x̄(t)) − f ∗ ] = O(d/N ), where N
is the number of function value queries.
Table 1 shows that, while Algorithm 1 employs a randomized 2-point zero-order estimator of ∇fi ,
its convergence rates are comparable with its gradient-based counterpart, the decentralized gradient
descent (DGD) algorithm [22, 37]. However, its convergence rates are inferior to its centralized zero-
order counterpart in [28].

3.2 Convergence of Algorithm 2


Let (xi (t), si (t)) denote the sequence generated by Algorithm 2 with a constant step size η. Denote
n n 
1 X ηρ2 ηρ2 u21 Ld

1X i
x̄(t) := x (t), R0 := k∇fi (xi (0))k2 +kxi (0) − x̄(0)k2 + .
n i=1 n i=1 2L 4
We first analyze the case where the local objectives are nonconvex and smooth.
Theorem 3. Assume that each local objective function fi is uniformly L-smooth for some positive
constant L, and that f ∗ := inf x∈Rd f (x) > −∞. Suppose

1 (1 − ρ2 )2
  X
ηL ≤ min , and Ru := d u2t < +∞.
6 4ρ2 (3 + 4ρ2 ) t=1

Then f (x̄(t)) converges,


t−1
1 3.2(f (x̄(0)) − f ∗ ) 12.8L2 R0
 
1X
k∇f (x̄(τ ))k2 ≤ + + 2.4Ru L 2
, (15)
t τ =0 t η 1 − ρ2
and
t−1 n
1X1X i 1
kx (τ ) − x̄(τ )k2 ≤ 1.6η(f (x̄(0)) − f ∗ ) + 6.4(ηL)2 R0 + 0.35Ru ,

(16)
t τ =0 n i=1 t
t n  
1X 1 X i 2 1 ∗ 3 2.35
ks (τ )−∇f (x̄(τ−1))k ≤ 9.6L(f (x̄(0)) − f ) + 38.4ηL R0 + LRu . (17)
t τ =1 n i=1 t η

Remark 4. Theorem 3 shows that Algorithm 2 achieves a convergence rate of O(1/t) in terms of
the averaged squared norm of ∇f (x̄(t)), and has a consensus rate of O(1/t) of the averaged squared
consensus error kxi (t)−x̄(t)k2 and the squared gradient tracking error ksi (t)−∇f (x̄(t−1))k2 . They
match the rates for distributed convex optimization with gradient tracking [6]2 . On the other hand,
since each iteration requires 2d queries of function values, we get a O(d/N ) rate in terms of the
number of function value queries N . This matches the convergence rate of centralized zero-order
algorithms without Nesterov-type acceleration [28].
2 Existing convergence rates of gradient tracking algorithms are mainly objective error rates for convex problems.

On the other hand, notice that L-smoothness of f implies k∇f (x)k2 ≤ 2L(f (x) − f ∗ ). Therefore this statement
should be interpreted as follows: Both Algorithm 2 and the gradient tracking algorithm in [6] achieve the O(1/t)
ergodic rate of k∇f k2 for smooth non-strongly convex problems.

8
Now we proceed to the situation with a gradient dominated global objective.
Theorem 4. Assume that each local objective function fi is uniformly L-smooth for some positive
constant L, and that the global objective function f is µ-gradient dominated and achieves it global
minimum at x∗ . Suppose the step size η satisfies
 µ  13 (1 − ρ2 )2
ηL = α ·
L 14
4
 2
2  
µ 3
for some α ∈ (0, 1], and let λ := 1 − α 1−ρ
5 L . Then

t−1
X
f (x̄(t)) − f (x∗ ) ≤ O(λt ) + 5(1 − ρ2 )Ld λτ u2t−τ , (18)
τ =0

and
n t−1
1X i 3α(1 − ρ2 )  µ  13 X τ 2
kx (t) − x̄(t)k2 ≤ O(λt ) + √ d λ ut−τ , (19)
n i=1 10 2 L τ =0
n √ t−1
1X i 2 t 7 2 2
X
ks (t) − ∇f (x̄(t − 1))k ≤ O(λ ) + L d λτ u2t−τ . (20)
n i=1 5(1 − ρ2 ) τ =0

Remark 5. If we use an exponentially decreasing sequence ut ∝ λ̃t/2 with λ̃ < λ, then both the
objective error f (x̄(t)) − f (x∗ ) and the consensus errors kxi (t) − x̄(t)k2 and ksi (t) − ∇f (x̄(t −
1))k2 achieve exponential convergence rate O(λt ), or O(λN/d ) in terms of the number of function
value queries. In addition, we notice that the decaying factor λ given by Theorem 4 has a better
dependence on µ/L than in [5] for convex problems. We point out that this is not a result of using
zero-order techniques, but rather a more refined analysis of the gradient tracking procedure.

3.3 Comparison of the Two Algorithms


We see from the above results that Algorithm 2 converges faster than Algorithm 1 asymptotically
as N → ∞ in theory. However, each iteration of Algorithm 2 makes progress only after 2d queries
of function values, which could be an issue if d is very large. On the contrary, each iteration of
Algorithm 1 only requires 2 function value queries, meaning that progress can be made relatively
immediately without exploring all the d dimensions. This observation suggests that, neglecting
communication delays, Algorithm 1 is more favorable for high-dimensional problems, while Algo-
rithm 2 could handle problems of relatively low dimensions better with faster convergence. On the
other hand, the number of local information exchanges per function value query for Algorithm 1 is
d/2 times as large as that for Algorithm 2. This suggests that the rate of communication between
agents can have a larger impact on the performance of Algorithm 1 than that of Algorithm 2 in
practice.

3.4 Comparison with Existing Algorithms


In this subsection, we compare our algorithms and results with existing literature on distributed
zero-order optimization, specifically [33, 34, 35, 36].

9
1. References [34, 35, 36] discuss convex problems, while [33] and our work focus on nonconvex
problems.
2. In terms of the assumptions on the noisy function queries, [36] and our work consider a noise-free
case. [33] considers stochastic queries but two function evaluations are available for each random
sample. [34] considers an online setting but two online function evaluations are available for each
time step. [35] assumes that independent noises are added on each query of function values. We
expect that our proposed Algorithm 1 can be generalized to the setting used in [33] with heavier
mathematical exposition. Extensions to general stochastic cases remain our ongoing work.
3. In terms of the approach to reach consensus among agents, our algorithms are similar to [35, 36],
where some form of weighted average of the neighbors’ local variables is utilized, while [33] uses the
method of multipliers and [34] uses ADMM to design their algorithms. We also mention that, our
Algorithm 2 also employs the gradient tracking technique, which, to our best knowledge, has not
been discussed in existing literature on distributed zero-order optimization yet.
4. Regarding the convergence rates for nonconvex optimization, [33] proves that its proposed al-
gorithm achieves O(1/T ) rate if each iteration also employs O(T ) function value queries where T
is the number of iterations. √Therefore in terms of the number of function value queries, its con-
vergence rate is in fact O(1/ N ), which is roughly comparable with Algorithm 1 and slower than
Algorithm 2 in our paper. Further [33] also does not discuss dependency on the problem dimension
d. Moreover, our algorithms only require constant numbers (2 or 2d) of function value queries
which is more appealing for practical implementation when T is set to be very large for achieving
sufficiently accurate solutions.

4 Numerical Example
We consider a phase retrieval problem formulated as
n m
1X 1 X 2 2
min fi (x), fi (x) := yik − |aTik x|2 . (21)
x∈Rd n i=1 m
k=1

1
We generate the complex vectors aik = aR I R I
ik + iaik such that (aik , aik ) ∼ N (0, 2 I2d ), and they are
independent of each other. The scalars yik are generated by
yik = |aTik x⋆ | + εik ,
where x⋆ = (1, 0, 0, . . . , 0), and εik ∼ N (0, 0.012 ) are independent Gaussian noise.
We set the dimension to be d = 64, the number of agents to be n = 50, and set m = 30. The
graph G = (N , E) is generated by uniformly randomly sampling n points on S2 , and then connecting
pairs of points with spherical distances less than π/4. The Metropolis-Hastings weights [47] are
employed for constructing W :
1

 , (i, j) ∈ E,
1 + max{deg(i), deg(j)}




 X
Wij = 1 − Wik , i = j,



 k:(i,k)∈E

0, otherwise,

10
Figure 1: Comparison of Algorithm 1 and Algorithm 2.

where deg(i) denotes the degree of vertex i. We randomly sample a number different initial points
of x(0) from the distribution N (0, d1 Ind ), and test Algorithm 1 and Algorithm 2 starting from these
initial points.
Figure 1 illustrates the convergence of k∇f (x̄(t))k2 for Algorithms 1 and 2 with the same
initial point that has been generated randomly. The light blue curves represent the results of
10 random instances for Algorithm 1, and the dark blue curve represents their average. The
horizontal axis has been normalized as the number of function value queries N . It can be seen
that, Algorithm 1 converges faster during the initial stage, but then slows down and converges at a
relatively stable sublinear rate; Algorithm 2 converges relatively slowly initially, but its convergence
rate does not change very much after N & 5 × 103 , and finally achieves smaller squared gradient
norm as N & 2.6 × 104 . Therefore, if the total number of function value queries is limited by
N . 2.6 × 104 , then Algorithm 1 gives better performance despite slower asymptotic convergence
rate, while if more function value queries are allowed, then Algorithm 2 could be favored. We
observe that this is related with the discussion in Section 3.3.
We refer to the appendix for more numerical results.

5 Conclusion
We proposed two distribtued zero-order algorithms for nonconvex multi-agent optimization, estab-
lished theoretical results on their convergence rates, and showed that they achieve comparable
performance with their distributed gradient-based or centralized zero-order counterparts. We also
provided a brief discussion on how the dimension of the problem and rate of communication will
affect their performance in practice.
We point out some future directions that are worth exploring:
1. In this work, we assume that fi (x) can be evaluated without noise or error, which can limit the
applicability of the results here considering that in many practical scenarios the function values are

11
obtained through some noisy measurement procedure. We are interested in investigating distributed
zero-order algorithms in this situation.
2. Some recent works [45, 46, 48] show that modified centralized first-order methods with proper
random initialization can escape saddle-point efficiently. As the algorithms proposed here are based
on first-order methods, we are interested in whether these results can be extended to distributed
zero-order methods.
3. It is interesting to see whether techniques for distributed optimization over time-varying directed
graphs can be applied and give similar performance guarantees.
4. As discussed in Section 3.3, there is a trade-off between convergence rate and the ability to han-
dle high-dimensional problems for the two proposed algorithms. On the contrary, the centralized
zero-order algorithm in [28] is able to handle high-dimensional problems without sacrificing conver-
gence rate. We are interested in whether this gap between distributed and centralized zero-order
algorithms can be mitigated.

References
[1] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent opti-
mization. IEEE Transactions on Automatic Control, 54(1):48, 2009.
[2] I-An Chen. Fast distributed first-order methods. Master’s thesis, Massachusetts Institute of
Technology, 2012.
[3] John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed
optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic
control, 57(3):592–606, 2012.
[4] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm for
decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
[5] Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed
optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017.
[6] Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization. IEEE
Transactions on Control of Network Systems, 5(3):1245–1260, 2018.
[7] Guannan Qu and Na Li. Accelerated distributed nesterov gradient descent, 2017. arXiv preprint
arXiv:1705.07176.
[8] Angelia Nedić and Alex Olshevsky. Distributed optimization over time-varying directed graphs.
IEEE Transactions on Automatic Control, 60(3):601–615, 2014.
[9] Chenguang Xi, Ran Xin, and Usman A Khan. Add-opt: Accelerated distributed directed
optimization. IEEE Transactions on Automatic Control, 63(5):1329–1339, 2018.
[10] Ran Xin and Usman A Khan. A linear algorithm for optimization over directed graphs with
geometric convergence. IEEE Control Systems Letters, 2(3):315–320, 2018.

12
[11] Jinming Xu, Shanying Zhu, Yeng Chai Soh, and Lihua Xie. Augmented distributed gradient
methods for multi-agent optimization under uncoordinated constant stepsizes. In Proceedings
of the 54th IEEE Conference on Decision and Control (CDC), pages 2055–2060, 2015.
[12] Angelia Nedić, Alex Olshevsky, Wei Shi, and César A Uribe. Geometrically convergent dis-
tributed optimization with uncoordinated step-sizes. In 2017 American Control Conference
(ACC), pages 3950–3955. IEEE, 2017.
[13] S Sundhar Ram, Angelia Nedić, and Venugopal V Veeravalli. Distributed stochastic subgradient
projection algorithms for convex optimization. Journal of optimization theory and applications,
147(3):516–545, 2010.
[14] Shi Pu and Angelia Nedić. A distributed stochastic gradient tracking method. In Proceedings
of the 57th IEEE Conference on Decision and Control (CDC), pages 963–968, 2018.
[15] Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P. How, and John Vian.
Deep decentralized multi-task multi-agent reinforcement learning under partial observability.
In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Pro-
ceedings of Machine Learning Research, pages 2681–2690. PMLR, 2017.

[16] Stacy Patterson, Yonina C Eldar, and Idit Keidar. Distributed compressed sensing for static
and time-varying networks. IEEE Transactions on Signal Processing, 62(19):4931–4946, 2014.
[17] Benjamin Charrow, Nathan Michael, and Vijay Kumar. Cooperative multi-robot estimation
and control for radio source localization. The International Journal of Robotics Research,
33(4):569–580, 2014.
[18] Jason R. Marden, Shalom D. Ruben, and Lucy Y. Pao. A model-free approach to wind farm
control using game theoretic methods. IEEE Transactions on Control Systems Technology,
21(4):1207–1214, 2013.
[19] Pascal Bianchi and Jérémie Jakubowicz. Convergence of a multi-agent projected stochas-
tic gradient algorithm for non-convex optimization. IEEE transactions on automatic control,
58(2):391–405, 2012.
[20] Paolo Di Lorenzo and Gesualdo Scutari. NEXT: In-network nonconvex optimization. IEEE
Transactions on Signal and Information Processing over Networks, 2(2):120–136, 2016.
[21] Tatiana Tatarenko and Behrouz Touri. Non-convex distributed optimization. IEEE Transac-
tions on Automatic Control, 62(8):3744–3757, 2017.
[22] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decen-
tralized algorithms outperform centralized algorithms? a case study for decentralized parallel
stochastic gradient descent. In Proceedings of the 31st International Conference on Neural
Information Processing Systems, NIPS’17, pages 5336–5346, 2017.
[23] Hoi-To Wai, Jean Lafond, Anna Scaglione, and Eric Moulines. Decentralized frank–wolfe
algorithm for convex and nonconvex problems. IEEE Transactions on Automatic Control,
62(11):5522–5537, 2017.

13
[24] Mingyi Hong, Davood Hajinezhad, and Ming-Min Zhao. Prox-PDA: The proximal primal-dual
algorithm for fast distributed nonconvex optimization and learning over networks. In Proceed-
ings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of
Machine Learning Research, pages 1529–1538. PMLR, 2017.
[25] Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. IEEE Transac-
tions on signal processing, 66(11):2834–2848, 2018.
[26] Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex op-
timization in the bandit setting: gradient descent without a gradient. In Proceedings of the
Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 385–394, 2005.
[27] Francis Bach and Vianney Perchet. Highly-smooth zero-th order online optimization. In
29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning
Research, pages 257–283. PMLR, 2016.
[28] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions.
Foundations of Computational Mathematics, 17(2):527–566, 2017.
[29] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex
stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
[30] John C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Andre Wibisono. Optimal rates
for zero-order convex optimization: The power of two function evaluations. IEEE Transactions
on Information Theory, 61(5):2788–2806, 2015.
[31] Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-
point feedback. Journal of Machine Learning Research, 18(52):1–11, 2017.
[32] Anit Kumar Sahu, Manzil Zaheer, and Soummya Kar. Towards gradient free and projection
free stochastic optimization, 2018. arXiv preprint arXiv:1810.03233.
[33] Davood Hajinezhad, Mingyi Hong, and Alfredo Garcia. Zeroth order nonconvex multi-agent
optimization over networks, 2017. arXiv preprint arXiv:1710.09997.
[34] Sijia Liu, Jie Chen, Pin-Yu Chen, and Alfred Hero. Zeroth-order online alternating direction
method of multipliers: Convergence analysis and applications. In Proceedings of the Twenty-
First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings
of Machine Learning Research, pages 288–297. PMLR, 2018.
[35] Anit Kumar Sahu, Dusan Jakovetic, Dragana Bajovic, and Soummya Kar. Distributed ze-
roth order optimization over random networks: A kiefer-wolfowitz stochastic approximation
approach. In 2018 IEEE Conference on Decision and Control (CDC), pages 4951–4958. IEEE,
2018.
[36] Zhan Yu, Daniel W. C. Ho, and Deming Yuan. Distributed randomized gradient-free mirror
descent algorithm for constrained optimization, 2019. arXiv preprint arXiv:1903.04157.
[37] Angelia Nedić and Alex Olshevsky. Stochastic gradient-push for strongly convex functions on
time-varying directed graphs. IEEE Transactions on Automatic Control, 61(12):3936–3947,
2016.

14
[38] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-
gradient methods under the Polyak-Łojasiewicz condition. In Joint European Conference on
Machine Learning and Knowledge Discovery in Databases, pages 795–811, 2016.
[39] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. USSR Computational
Mathematics and Mathematical Physics, 3(4):864–878, 1963.
[40] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy
gradient methods for the linear quadratic regulator. In Proceedings of the 35th International
Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research,
pages 1467–1476. PMLR, 2018.
[41] Ohad Shamir. Exponential convergence time of gradient descent for one-dimensional deep
linear neural networks, 2018. arXiv preprint arXiv:1809.08587.
[42] Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regression
function. The Annals of Mathematical Statistics, 23(3):462–466, 1952.
[43] Ali H Sayed. Diffusion adaptation over networks. In Academic Press Library in Signal Pro-
cessing, volume 3, pages 323–453. Elsevier, 2014.
[44] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business
Media, 2006.
[45] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online
stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–
842, 2015.
[46] Jason D Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I Jordan, and
Benjamin Recht. First-order methods almost always avoid saddle points, 2017. arXiv preprint
arXiv:1710.07406.
[47] Lin Xiao, Stephen Boyd, and Sanjay Lall. A scheme for robust distributed sensor fusion based
on average consensus. In 2005 Fourth International Symposium on Information Processing in
Sensor Networks, pages 63–70, 2005.
[48] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to
escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine
Learning - Volume 70, ICML’17, pages 1724–1732. JMLR.org, 2017.
[49] Herbert Robbins and David Siegmund. A convergence theorem for non negative almost su-
permartingales and some applications. In Optimizing methods in statistics, pages 233–257.
Elsevier, 1971.

15
Appendix
The set of positive integers will be denoted by Z+ . We use Id to denote the d-dimensional identity
matrix, and use 1n to denote the n-dimensional vector whose entries are all 1.
For two matrices A = [aij ] ∈ Rp×q and B = [bij ] ∈ Rr×s , their tensor product A ⊗ B is defined
by  
a11 B a12 B · · · a1q B
a21 B a22 B · · · a2q B 
pr×qs
A⊗B = . .. ..  ∈ R .
 
 .. ..
. . . 
ap1 B ap2 B · · · apq B
n×n
Let W ∈ R be a consensus matrix that satisfies Assumption 1 in the main text, and denote

W (x − n−1 1n 1Tn x)
ρ := sup .
kxk=1 kx − n−1 1n 1Tn xk
The following lemma is a standard result in consensus optimization.
Lemma 1. We have ρ < 1 when G is a connected undirected graph. Moreover, for any x1 , . . . , xn ∈
Rd , we have
k(W ⊗ Id )(x − 1n ⊗ x̄)k ≤ ρkx − 1n ⊗ x̄k,
where we denote  
x1 n
1 T 1X i
x =  ...  , x̄ = (1n ⊗ Id )x = x.
 
n n i=1
xn
The following lemma provides a useful property of smooth functions.
Lemma 2. Suppose f : Rn → R is L-smooth and attains its global minimum at x∗ ∈ Rd . Then
k∇f (x)k2 ≤ 2L(f (x) − f (x∗ )).
Proof. The L-smoothness of f implies
1
f (x∗ ) ≤ f (x − L−1 ∇f (x)) ≤ f (x) − k∇f (x)k2 .
2L

The following lemma will be used to establish convergence of the proposed algorithms.
Lemma 3 ([49]). Let (Ω, F , P) be a probability space and (Ft )t∈Z+ be a filtration. Let U (t), ξ(t)
and ζ(t) be nonnegative Ft -measurable random variables for t ∈ Z+ such that
E [U (t + 1)|Ft ] ≤ U (t) + ξ(t) − ζ(t), ∀t = 1, 2, . . .
P∞
P∞ almost surely on the event { t=1 ξ(t) < +∞}, U (t) converges to a random variable and
Then
t=1 ζ(t) < +∞.
As a special case, let Ut , ξt and ζt be (deterministic) nonnegative sequences for t ∈ Z+ such that
Ut+1 ≤ Ut + ξt − ζt ,
P∞ P∞
with t=1 ξt < +∞. Then Ut converges and t=1 ζt < +∞.

16
We will also use the following inequalities:
t2 Z t2 +1
X 1 ds (t2 + 1)1−ǫ − t1−ǫ
1
ǫ
≥ ǫ
= , (22)
t=t
t t1
s 1 − ǫ
1

and Z t2 +1/2
(t2 + 1/2)1−ǫ − (3/2)1−ǫ

ds
1 + =1+ , t1 = 1,


t ǫ
2
s 1−ǫ

X 1 
3/2
≤ (23)
t=t1
tǫ  Z t2 +1/2
ds (t2 + 1/2)1−ǫ − (t1 − 1/2)1−ǫ
= , t1 > 1.


ǫ 1−ǫ

t1 −1/2 s

where ǫ > 0 and ǫ 6= 1, and


t2 +1 t2 Z t2 +1/2
t2 + 1 ds 1 ds 2t2 + 1
Z X
ln = ≤ ≤ = ln . (24)
t1 t1 s t=t
t t1 −1/2 s 2t1 − 1
1

Especially, when ǫ > 1, we have


∞ Z ∞
X 1 ds 1 ǫ
ǫ
≤ 1 + ǫ
=1+ ǫ−1
≤ . (25)
t=1
t 3/2 s (ǫ − 1)(3/2) ǫ − 1

Analysis of Algorithm 1
Let (Ft )t∈Z+ be a filtration to which (z i (t), xi (t) : i ∈ N )t∈Z+ is adapted. We will extensively use
the following properties of the distribution U(Sd−1 ):
Ez∼U (Sd−1 ) d · hg, zi2 = kgk2
 
Ez∼U (Sd−1 ) [d · hg, ziz] = g, (26)

for any (deterministic) g ∈ Rd .


Lemma 4. 1. Let u > 0 be arbitrary, and suppose f : Rd → R is differentiable. Then
h i
(2)
∇f u (x) = Ez∼U (Sd−1 ) Gf (x; u, z) . (27)

where f u (x) := Ey∼U (Bd ) [f (x + uy)] is the smoothed version of f . Moreover, if f is L-smooth, then
f u is also L-smooth.
2. [31, Lemma 10] Suppose f : Rd → R is G-Lipschitz. Then for any x ∈ Rd and u ≥ 0,
 2 
(2)
Ez∼U (Sd−1 ) Gf (x; u, z) ≤ κ2 G2 d (28)

where κ > 0 is some numerical constant.


3. Suppose f : Rd → R is L-Lipschitz, and let u be positive. Then for any x ∈ Rd and h ∈ Rd , we
have
f (x + uh) − f (x − uh) 1
− h∇f (x), hi ≤ uLkhk2. (29)
2u 2
In addition,
k∇f (x) − ∇f u (x)k ≤ uL. (30)

17
Proof. 1. The equality (27) follows from [26, Lemma 1] and the fact that the distribution U(Sd−1 )
has zero mean. When f is L-smooth, we have

1 Z
u u
k∇f (x1 ) − ∇f (x2 )k = (∇f (x1 + uy) − ∇f (x2 + uy)) dy
R
B dy Bd
d

1
Z
≤ R k∇f (x1 + uy) − ∇f (x2 + uy)k dy ≤ Lkx1 − x2 k
Bd
dy Bd

for any x1 , x2 ∈ Rd .
3. We have
Z 1
f (x + uh) − f (x − uh) 1
− h∇f (x), hi = h∇f (x + ush), uhi ds − h∇f (x), hi
2u 2u −1
Z 1 Z 1
1 1 1
= h∇f (x + ush) − ∇f (x), hi ds ≤ Lu|s|khk2 ds = uLkhk2,
2 −1 2 −1 2

and

1 Z
R uL
Z
u
k∇f (x) − ∇f (x)k = (∇f (x) − ∇f (x + uy)) dy ≤ kyk dy ≤ uL.
R
B dy Bd B dy Bd
d d

Now we introduce the following quantities:


 1   1 
x (t) g (t) n n
 ..   ..  1X i 1X i
x(t) =  .  , g(t) =  .  , x̄(t) = x (t), ḡ(t) = g (t).
n i=1 n i=1
xn (t) g n (t)

We can see that

x(t) = (W ⊗ Id )(x(t − 1) − ηt g(t)), x̄(t) = x̄(t − 1) − ηt ḡ(t).

Lemma 5. Suppose each fi is G-Lipschitz and L-smooth. Then


t t−1  τ
1 + ρ2 8nρ2 2 2 X 1 + ρ2

kx(t) − 1n ⊗ x̄(t)k2 ≤ kx(0) − 1n ⊗ x̄(0)k2 + G d 2
ηt−τ (31)
2 1 − ρ2 τ =0
2

almost surely, and


t t−1  τ
1 + ρ2 8nρ2 κ2 2 X 1 + ρ2

E kx(t) − 1n ⊗ x̄(t)k2 ≤ kx(0) − 1n ⊗ x̄(0)k2 + 2
 
G d ηt−τ .
2 1 − ρ2 τ =0
2
(32)

18
Proof. We have
x(t) − 1n ⊗ x̄(t) = (W ⊗ Id ) (x(t − 1) − 1n ⊗ x̄(t − 1) − ηt (g(t) − 1n ⊗ ḡ(t))) ,
and therefore
kx(t) − 1n ⊗ x̄(t)k2
≤ ρ2 kx(t − 1) − 1n ⊗ x̄(t − 1)k2 + ηt2 kg(t) − 1n ⊗ ḡ(t)k2


1 − ρ2 2 2 2ρ2 (33)
+ · ρ kx(t − 1) − 1 n ⊗ x̄(t − 1)k + · η 2 ρ2 kg(t) − 1n ⊗ ḡ(t)k2
2ρ2 1 − ρ2 t
1 + ρ2 ρ2 (1 + ρ2 )
= kx(t − 1) − 1n ⊗ x̄(t − 1)k2 + ηt2 kg(t) − 1n ⊗ ḡ(t)k2 ,
2 1 − ρ2

where we used Lemma 1. Since each fi is G-Lipschitz, we have g i (t) ≤ Gdkzk2 , and so
2
n
2
X n−1 i 1 X
j

kg(t) − 1n ⊗ ḡ(t)k = n g (t) − n
g (t)
i=1 j6=i

 2
 2
n n
X n − 1 1 X X n − 1 1 X
≤  kg i (t)k + kg j (t)k ≤  Gdkz i (t)k2 + Gdkz j (t)k2 
i=1
n n i=1
n n
j6=i j6=i
2
4(n − 1) 2 2
= G d ≤ 4nG2 d2 ,
n
and by (28) of Lemma 4, we have
 2 
n

 n−1 i
X 1 X
E kg(t) − 1n ⊗ ḡ(t)k2 |Ft−1 = g j (t)
 
n g (t) − n
E  Ft−1 

i=1 j6=i

2 
n 2 X
X n − 1 i
1 j

≤2 E g (t) + Ft−1 
g (t)
 
i=1
n n
j6=i
4(n − 1)2 2 2
≤ κ G d ≤ 4nκ2 G2 d.
n
By plugging these bounds into (33)and noting that ρ < 1, we get (31) and (32) by mathematical
induction.
Corollary 1. 1. Let ηt be a non-increasing sequence that converges to zero. Then
lim kx(t) − 1n ⊗ x̄(t)k2 = 0.
t→∞
P∞ 3
Furthermore, if τ =1 ηt < +∞, then

X
ηt kx(t − 1) − 1n ⊗ x̄(t − 1)k2 < +∞
t=1

almost surely.

19
2. Suppose ηt = η1 /tβ for β > 1/3. Then

X  2η1 kx(0) − 1n ⊗ x̄(0)k2 48βnκ2 ρ2
ηt E kx(t − 1) − 1n ⊗ x̄(t − 1)k2 ≤ 3
G2 d. (34)

2
+ η1 2 )2
t=1
1 − ρ (3β − 1)(1 − ρ

Proof. 1. By the monotonicity of ηt and ((1 + ρ2 )/2)t , we have


t−1  τ t  t−τ t  t−τ t
X 1 + ρ2 2
X 1 + ρ2 X 1 + ρ2 1X 2
ηt−τ = ηt2 ≤ · η −→ 0
τ =0
2 τ =1
2 τ =1
2 t τ =1 t

as t → ∞.
For the summability of ηt kx(t − 1) − 1n ⊗ x̄(t − 1)k2 , we have

X
ηt kx(t − 1) − 1n ⊗ x̄(t − 1)k2
t=2
∞ t−1 ∞ t−2 τ
1 + ρ2 8nρ2 G2 d2 X X 1 + ρ2
X  
≤ kx(0) − 1n ⊗ x̄(0)k2 ηt + ηt
2
ηt−1−τ
t=2
2 1 − ρ2 t=2 τ =0 2

The first term on the right-hand side obviously converges. For the second term, we have
∞ Xt−2 τ ∞ X t−2  τ ∞ X t  t−τ
1 + ρ2 1 + ρ2 1 + ρ2
X  X X
2 3
ηt ηt−1−τ ≤ ηt−1−τ = ητ3−1
t=2 τ =0
2 t=2 τ =0
2 t=2 τ =2
2
∞ ∞  2 t−τ
 ∞
X X 1+ρ 2 X 3
= ητ3−1 = η < +∞.
τ =2 t=τ
2 1 − ρ2 τ =2 τ −1

Therefore we can conclude that ηt kx(t − 1) − 1n ⊗ x̄(t − 1)k2 is summable almost surely.
2. We have

X
ηt E[kx(t − 1) − 1n ⊗ x̄(t − 1)k2 ]
t=1
∞  t−1 ∞ t−2 τ
1 + ρ2 2 2
1 + ρ2

2
X κ 2 XX 8nρ 1
≤ η1 kx(0) − 1n ⊗ x̄(0)k + G d η13
t=1
2 1 − ρ2 t=2 τ =0
tβ (t − 1 − τ )2β 2
∞ Xt−2 τ
2η1 kx(0) − 1n ⊗ x̄(0)k2 2 2
1 + ρ2

3 8nρ κ 1
X
2
≤ + η1 G d .
1 − ρ2 1 − ρ2 t=2 τ =0
(t − 1 − τ )3β 2

Then since
∞ X
t−2 τ X ∞ X t t−τ
1 + ρ2 1 + ρ2
 
X 1 1
=
t=2 τ =0
(t − 1 − τ )3β 2 t=2 τ =2
(τ − 1)3β 2
∞ ∞
X 1 + ρ2
 t−τ ∞
X 1 2 X 1
= 3β
= 2
τ =2
(τ − 1) t=τ
2 1 − ρ τ =2
(τ − 1)3β

≤ ,
(3β − 1)(1 − ρ2 )

20
we get the inequality (34).

Lemma 6. We have
 4G2 (d − 1) 4L2
E kḡ(t)k2 |Ft−1 ≤ 2
kx(t − 1) − 1n ⊗ x̄(t − 1)k2 + u2t L2 d2 .

+ 2k∇f (x̄(t − 1))k +
3n2 n
Proof. Since
n "
d X
kḡ(t)k2 = h∇fi (xi (t − 1)), z i (t)iz i (t)

n
i=1
# 2
fi (xi (t−1)+utz i (t))−fi (xi (t−1)−utz i (t))
 
+ − h∇fi (xi (t−1)), z i (t)i z i (t) ,

2ut

by (29) of Lemma 4, we see that

E kḡ(t)k2 |Ft−1
 

  n
!2 n
!2 
1 d X d X 1
h∇fi (xi (t − 1)), z i (t)iz i (t) + (1 + 3)

≤ E 1 + ut L Ft−1 
3 n i=1 n i=1 2
 
n
4 d X 1 X
≤  2 k∇fi (xi (t − 1))k2 + 2 h∇fi (xi (t − 1)), ∇fj (xj (t − 1))i + u2t L2 d2 ,
3 n i=1 n
i6=j

where we used (26) and the fact that h∇fi (xi (t − 1)), z i (t)iz i (t) and h∇fj (xj (t − 1)), z j (t)iz j (t) are
independent for j 6= i conditioned on Ft−1 . Then since
n
d X i 2 1 X
k∇f i (x (t − 1))k + h∇fi (xi (t − 1)), ∇fj (xj (t − 1))i
n2 i=1 n2
i6=j
2
n n
d−1X i 2 1 X

i

= k∇f (x (t − 1))k + ∇f (x (t − 1)) ,

i i
n2 n


i=1 i=1

and
n 2
1 X
i
∇fi (x (t − 1))

n


i=1
2 2
  n n
1 1 X
1 X
i

≤ 1+ ∇fi (x̄(t − 1)) + (1 + 2) ∇fi (x (t − 1)) − ∇fi (x̄(t − 1))

2 n i=1 n


i=1
n
3 1X
≤ k∇f (x̄(t − 1))k2 + 3 · k∇fi (xi (t − 1)) − ∇fi (x̄(t − 1))k2
2 n i=1
3 3L2
≤ k∇f (x̄(t − 1))k2 + kx(t − 1) − 1n ⊗ x̄(t − 1)k2 ,
2 n

21
we get
2
n n
4(d − 1) X 4 1 X
E kḡ(t)k2 |Ft−1 i 2 i
+ u2t L2 d2
 
≤ k∇f (x (t − 1))k + ∇f (x (t − 1))

2 i i
3n 3 n


i=1 i=1
4G2 (d − 1) 2 4L2 2
≤ + 2k∇f (x̄(t − 1))k + kx(t − 1) − 1 ⊗ x̄(t − 1)k + u2t L2 d2 .
3n2 n

Lemma 7. Suppose ηt L ≤ 1/4. Then

ηt 3ηt L2
E [f (x̄(t))|Ft−1 ] ≤ f (x̄(t − 1)) − k∇f (x̄(t − 1))k2 + kx(t − 1) − 1n ⊗ x̄(t − 1)k2
4 2n
(35)
2η 2 LG2 (d − 1)
 
1 2
+ t + η u
t t
2 2
L 1 + d η t L .
3n2 2

Proof. Since x̄(t) = x̄(t − 1) − ηt ḡ(t), by the L-smoothness of the function f , we get

L
f (x̄(t)) ≤ f (x̄(t − 1)) − ηt h∇f (x̄(t − 1)), ḡ(t)i + ηt2 kḡ(t)k2 .
2
Note that by (27) of Lemma 4, we have
n
1X
E [ḡ(t)|Ft−1 ] = ∇fiut (xi (t − 1)).
n i=1

By taking the expectation conditioned on Ft−1 , we get

L 
E [f (x̄(t))|Ft−1 ] ≤ f (x̄(t − 1)) − ηt k∇f (x̄(t − 1))k2 + ηt2 E kḡ(t)k2 |Ft−1

2
n
* +
1X ut i ut

− ηt ∇f (x̄(t − 1)), ∇fi (x (t − 1)) − ∇fi (x̄(t − 1))
n i=1
− ηt h∇f (x̄(t − 1)), ∇f ut (x̄(t − 1)) − ∇f (x̄(t − 1))i .

Since each fiut is L-smooth (see Part 1 of Lemma 4), we have


n
* +
1X ut i ut

− ∇f (x̄(t − 1)), ∇fi (x (t − 1)) − ∇fi (x̄(t − 1))
n i=1
 n 2 
1 1 1 X
k∇f (x̄(t − 1))k2 + 2 ∇fiut (xi (t − 1)) − ∇fiut (x̄(t − 1)) 



2 2 n
i=1

n
!2
1 1X
≤ k∇f (x̄(t − 1))k2 + Lkxi (t − 1) − x̄(t − 1)k
4 n i=1
1 L2
≤ k∇f (x̄(t − 1))k2 + kx(t − 1) − 1n ⊗ x̄(t − 1)k2 ,
4 n

22
and by (30), we have

− h∇f (x̄(t − 1)), ∇f ut (x̄(t − 1)) − ∇f (x̄(t − 1))i


 
1 1 2 ut 2
≤ k∇f (x̄(t − 1))k + 2 k∇f (x̄(t − 1)) − ∇f (x̄(t − 1))k
2 2
1
≤ k∇f (x̄(t − 1))k2 + u2t L2 .
4
Therefore
ηt L 
k∇f (x̄(t − 1))k2 + ηt2 E kḡ(t)k2 |Ft−1

E [f (x̄(t))|Ft−1 ] ≤ f (x̄(t − 1)) −
2 2
ηt L2
+ kx(t − 1) − 1n ⊗ x̄(t − 1)k2 + ηt u2t L2 .
n
Finally, by plugging the bound of Lemma 6, we get
 
ηt 1
E [f (x̄(t))|Ft−1 ] ≤ f (x̄(t − 1)) − (1 − 2ηt L) k∇f (x̄(t − 1))k2 + ηt u2t L2 1 + d2 ηt L
2 2
2 2
2LG (d − 1) ηt L
+ ηt2 + (1 + 2ηt L) kx(t − 1) − 1n ⊗ x̄(t − 1)k2 ,
3n2 n
and the inequality (35) follows from the assumption ηt L ≤ 1/4.
Now we are ready to prove Theorem 1 and Theorem 2 in the main text.
Proof of Theorem 1. Without loss of generality we assume that f ∗ = inf x∈Rd f (x) = 0.
P∞ P∞ P∞
1. Suppose ηt is non-increasing and t=1 ηt = +∞, t=1 ηt2 < +∞, and t=1 ηt u2t < +∞. The
convergence of xi (t) to x̄(t) is already shown by Corollary 1. Moreover, the random variable

3ηt L2 2ηt2 LG2 (d − 1)


 
2 2 2 1 2
kx(t − 1) − 1n ⊗ x̄(t − 1)k + + ηt ut L 1 + d ηt L
2n 3n2 2

is summable almost surely. By Lemma 3, we see that f (x̄(t)) converges and



X
ηt k∇f (t − 1)k2 < +∞
t=1

almost surely, which implies that lim inf t→∞ k∇f (x̄(t))k = 0.
Now let δ > 0 be arbitrary, and consider the event
 
Aδ := lim sup k∇f (x̄(t))k ≥ δ .
t→∞

On the event Aδ , we can always find a (random) subsequence of k∇f (x̄(t))k, which we denote by
(k∇f (x̄(tk ))k)k∈Z+ , such that k∇f (x̄(tk ))k ≥ 2δ
3 for all k. It’s not hard to verify that

M := sup kḡ(t)k < +∞.


t∈Z+

23
Then for any s ∈ Z+ , we have
s
X
k∇f (x̄(tk + s))k ≥ k∇f (x̄(tk ))k − k∇f (x̄(tk + τ )) − ∇f (x̄(tk + τ − 1))k
τ =1
s
2δ X
≥ − L · ηtk +τ M
3 τ =1

Let ŝ(k) be the smallest positive integer such that


ŝ(k)+1
2δ X δ
− L · ηtk +τ M <
3 τ =1
3
P∞
(such ŝ(k) exists as t=1 ηt = +∞). This implies that
ŝ(k)+1
X δ
ηtk +τ > ,
τ =1
3LM

and k∇f (x̄(tk + s))k ≥ δ/3 for all s = 0, . . . , ŝ(k). Therefore


ŝ(k)+1 ŝ(k)+1
X X δ2 δ3
ηtk +τ k∇f (x̄(tk + τ − 1)))k ≥ ηtk +τ ≥
τ =1 τ =1
9 27LM

Since tk → ∞ as k → ∞, we can find a subsequence of (tkp )p∈Z+ satisfying tkp+1 − tkp > ŝ(kp ) by
induction, and then
∞ ∞
X X δ3
ηt k∇f (x̄(t − 1))k2 ≥ = +∞.
t=1 p=1
27LM
P∞ P∞
In other words, on Aδ the series t=1 ηt k∇f (t − 1)k2 diverges. Since t=1 ηt k∇f (t − 1)k2 < +∞
converges almost surely, we have P(Aδ ) = 0, and consequently
 
  [
P lim sup k∇f (x̄(t))k > 0 = P  A1/k  = lim P(A1/k ) = 0,
t→∞ k→∞
k∈Z+

and we see that k∇f (x̄(t))k converges almost surely.


2. When ηt = η1 /tβ and ut = u1 /t(γ−β)/2 , by taking the telescoping sum of (35) and noting that
f is nonnegative, we get
t t
X 6L2 X
ητ E k∇f (x̄(t − 1))k2 ≤ 4f (x̄(0)) + ηt E kx(t − 1) − 1n ⊗ x̄(t − 1)k2
   
τ =1
n τ =1
2 t t  
2 8LG (d − 1) 1 1 1 2 1
X X
2 2
+ η1 + 4η1 u1 L + d η1 L γ+β
3n2 τ =1
τ 2β τ =1
τγ 2 τ
8LG2 (d − 1) 2 2
2 2 (1 + d η1 L/2)γ
≤ 4f (x̄(0)) + η12 + 4η 1 u 1 L ·
3n2 2β − 1 γ−1
2
2kx(0) − 1n ⊗ x̄(0)k2 2 2
 
6η1 L 48nρ κ
+ + η12 G2 d ,
n 1 − ρ2 (1 − ρ2 )2

24
√ √
where we used (34), (25) and β ∈ (1/2, 1). Now since η1 = αη /(4L d) and u1 ≤ αu G/(L d), and
notice that when β < 1,
t t Z t+1
X X 1 ds η1  1−β

ητ = η1 ≥ η1 β
= (1 + t) −1
τ =1 τ =1
τ 1 s 1−β

we have
Pt
k∇f (x̄(t − 1))k2
 
τ =1 ητ E
Pt
τ =1 ητ
√ !
(1 − β) 16 dLf (x̄(0)) 12L2 kx(0) − 1n ⊗ x̄(0)k2 α2u γ 2
√ −1
≤ + + G ( d + 8d )
(t + 1)1−β − 1 αη n(1 − ρ2 ) 2(γ − 1)

18αη κ2 ρ2 (1 − β)αη G2 d
 
4
+ +√ · ,
(6β − 3)n2 d(1 − ρ2 )2 (t + 1)1−β − 1

and we get the convergence rate of E[k∇f (x̄(t))k2 ] stated in the theorem.
The convergence rate of the consensus error follows from Lemma 5 and the fact that
t−1
X λτ 1
ǫ
= ǫ
+ o(t−ǫ )
τ =0
(t − τ ) (1 − λ)t

for any λ ∈ (0, 1) and ǫ > 0.



3. When ηt = η1 / t and ut = u1 /tγ/2−1/4 , by (35) we have
 
f (x̄(t)) 1 η1
E ǫ
F t−1 ≤ ǫ f (x̄(t − 1)) − 1/2+ǫ k∇f (x̄(t − 1))k2
(t + 1) t 4t
3ηt L2
+ kx(t − 1) − 1n ⊗ x̄(t − 1)k2
2ntǫ
2η12 LG2 (d − 1) ηt u2t L2
 
1 2
+ + 1 + d ηt L ,
3n2 t1+ǫ tǫ 2

where ǫ > 0 is arbitrary. Since

3ηt L2 2η 2 LG2 (d − 1) ηt u2t L2


 
1
ǫ
kx(t − 1) − 1n ⊗ x̄(t − 1)k2 + 1 2 1+ǫ + 1 + d2 ηt L
2nt 3n t tǫ 2

is summable, we see that



X η1
1/2+ǫ
k∇f (x̄(t − 1))k2 < +∞,
t=1
t
which implies that
lim inf k∇f (x̄(t))k = 0.
t→∞

25
√ √
Now by taking the telescoping sum of (35) and using η1 = αη /(4L d) and u1 ≤ αu G/(L d),
we can show that
t
X
ητ E k∇f (x̄(t − 1))k2
 
τ =1

2αη G2 d α2u γ √
≤ 4f (x̄(0)) + η1 2
ln (2t + 1) + η1 · G2 ( d + 8d−1 )
3n 2(γ − 1)
2 2 2 2 2
12η1 L kx(0) − 1n ⊗ x̄(0)k 18α ηκ ρ
+ 2
+ η1 G2 ,
n 1−ρ (1 − ρ2 )2
Pt
where we used τ =1 τ −1 ≤ ln(2t + 1). This further leads to
Pt
k∇f (x̄(t − 1))k2
 
τ =1 ητ E
Pt
τ =1 ητ
√ !
1 8 dLf (x̄(0)) 6L2 kx(0) − 1n ⊗ x̄(0)k2 α2u γ √
≤ √ + + G2 ( d + 8d−1 )
t+1−1 αη n(1 − ρ2 ) 4(γ − 1)

αη G2 d ln(2t + 1) 9κ2 ρ2 α2η G2
+ √ + √ .
3n2 t + 1 − 1 (1 − ρ2 )2 t + 1 − 1

We now get the convergence rate of E k∇f (x̄(t))k2 stated in the theorem. The convergence rate
 

of the consensus error follows from the same argument in the previous part.

Proof of Theorem 2. Denote δ(t) := f (x̄(t)) − f ∗ . It’s easy to check that ηt L ≤ 1/4. By (35) and
using the fact that f is µ-gradient dominated, we have
 ηt µ  3ηt L2
E [δ(t)|Ft−1 ] ≤ 1− δ(t − 1) + kx(t − 1) − 1n ⊗ x̄(t − 1)k2
2 2n 
(36)
2ηt2 LG2 (d − 1)

2 2 1 2
+ + ηt ut L 1 + d ηt L ,
3n2 2

By induction we see that


t 
Y ητ µ 
E [δ(t)] ≤ δ(0) 1−
τ =1
2
t−1 t
" #
3L2 X  2
 Y  ηs µ 
+ ηt−τ E kx(t − τ − 1) − 1n ⊗ x̄(t − τ − 1)k 1−
2n τ =0 s=t−τ +1
2
t−1  2 t
2ηt−τ LG2 (d − 1)
 
X
2 2 1 2 Y  ηs µ 
+ + ηt−τ u t−τ L 1 + d ηt−τ L · 1 − .
τ =0
3n2 2 s=t−τ +1
2

26
Since
t2  t2 t2
! !
Y ητ µ  X ητ µ X 1
1− ≤ exp − = exp −αη
τ =t1
2 τ =t
2 τ =t1
t
1
 αη
t1
≤ exp (−αη (ln(t2 + 1) − ln t1 )) = ,
t2 + 1
we get
 αη
1
E [δ(t)] ≤ δ(0) ·
t+1
t−1 
2 X  t − τ + 1 αη
  
3L
ηt−τ E kx(t − τ − 1) − 1n ⊗ x̄(t − τ − 1)k2

+
2n τ =0 t+1
t−1 2 α
2ηt−τ LG2 (d − 1)
   
X
2 2 1 2 t−τ +1 η
+ + ηt−τ ut−τ L 1 + d ηt−τ L · .
τ =0
3n2 2 t+1

By Lemma 5, we have
t   αη 
X  2
 τ +1
ητ E kx(τ − 1) − 1n ⊗ x̄(τ − 1)k
τ =1
t+1
t τ −1
(τ + 1)αη 1 + ρ2

2αη X
≤ kx(0) − 1n ⊗ x̄(0)k2
µ(t + 1)αη τ =1 τ + t0 2
t τ −1  s
8α3η 8nκ2 ρ2 2 X (τ + 1)αη X 1 + ρ2 1
+ 3 α 2
G d
µ (t + 1) 1 − ρ
η
τ =2
τ + t0 s=0 2 (τ − s − 1 + t0 )2
∞ τ −1
1 + ρ2

2αη 2
X
αη −1
≤ kx(0) − 1n ⊗ x̄(0)k · (τ + 1)
µ(t + 1)αη τ =1
2
t X τ  τ −s
8α3η 8nκ2 ρ2 2 1 X 1 + ρ2 (τ + 1)αη −1
+ 3 2
G d · α
.
µ 1−ρ (t + 1) τ =2 s=2
η 2 (s − 1 + t0 )2

Since
t X τ  τ −s
1 X 1 + ρ2 (τ + 1)αη −1
α
(t + 1) η τ =2 s=2 2 (s − 1 + t0 )2
t  
1 X
αη −1 2 1 −2
= (τ + 1) + o(τ )
(t + 1)αη τ =2 1 − ρ2 (τ + 1)2
2


 + o(t−2 ), αη > 2,
 (αη − 2)(1 − ρ2 )(t + 1)2



  
 2 ln t ln t
= +o , αη = 2,

 (1 − ρ2 )(t + 1)2 (t + 1)2

 C1 (αη , µ/L, ρ) ,


αη ∈ (1, 2),

(t + 1)αη

27
where C1 (αη , µ/L, ρ) denotes some positive quantity that depends only on αη , µ/L and ρ, and
t−1 t−1
2
2ηt−τ LG2 (d − 1) t − τ + 1 αη 8α2η LG2 (d − 1)
 
X 1 X
2
≤ 2 n2 αη
(t − τ + 1)αη −2
τ =0
3n t + 1 3µ (t + 1) τ =0
8α2η LG2 (d − 1) 1
≤ ,
3(αη − 1)µ2 n2 t + 1
t−1 αη t−1
2αη α2u G2

X t−τ +1 1 X
ηt−τ u2t−τ L2 ≤ · (t − τ + 1)αη −2
τ =0
t+1 µ (t + 1)αη τ =0
2αη α2u G2 1
≤ ,
µ(αη − 1) t + 1
t−1 2 t−1
ηt−τ u2t−τ L3 d2 t − τ + 1 αη 2α2η α2u G2 Ld2
 
X 1 X
≤ 2
· αη
(t − τ + 1)αη −3
τ =0
2 t + 1 µ (t + 1) τ =0
1


 , αη > 2,
(αη − 2)(t + 1)2




2α2η α2u G2 Ld2  ln(t + 1)

≤ · , αη = 2,
µ2 
 (t + 1)2


 1
, αη ∈ (1, 2).


(2 − αη )(t + 1)αη

We see that
αη G2 d
 
8αη L 2 −1
E [δ(t)] ≤ · + 2αu + o(t ).
µ(αη − 1)(t + 1) 3n2 µ
The bound on the consensus error follows from Lemma 5.

Analysis of Algorithm 2
Lemma 8. Let f : Rd → R be L-smooth. Then for any x ∈ Rd ,
d
X f (x + ue ) − f (x − ue ) 1 √
k k
ek − ∇f (x) ≤ uL d.

2u 2


k=1

Proof. We have
d d  
X f (x + ue ) − f (x − ue ) X f (x + ue ) − f (x − ue )
k k k k
ek − ∇f (x) = − h∇f (x), ek i ek

2u 2u


k=1 k=1
d 2 !1/2
X f (x + uek ) − f (x − uek )
= − h∇f (x), ek i
2u
k=1
d  2 1/2
!
X 1 1 √
≤ uL = uL d,
2 2
k=1

where we used (29) of Lemma 4.

28
We introduce the following quantities:
 1   1   1 
x (t) g (t) s (t) n n
 ..   ..   ..  1X i 1X i
x(t) =  .  , g(t) =  .  , s(t) =  .  , x̄(t) = x (t), ḡ(t) = g (t).
n i=1 n i=1
xn (t) g n (t) sn (t)
It’s not hard to see that
s(t) = (W ⊗ Id )(s(t − 1) + g(t) − g(t − 1)),
x(t) = (W ⊗ Id )(x(t − 1) − ηs(t)),
and n
1X i
s (t) = ḡ(t), x̄(t) = x̄(t − 1) − ηḡ(t).
n i=1
Lemma 9. Suppose ηL ≤ 1/6. Then
η 4ηL2 ηu2 L2 d
f (x̄(t)) ≤ f (x̄(t − 1)) − k∇f (x̄(t − 1))k2 + kx(t − 1) − 1n ⊗ x̄(t − 1)k2 + t . (37)
3 3n 3
Proof. By x̄(t) = x̄(t − 1) − ηḡ(t) and the L-smoothness of the function f , we have
η2 L
f (x̄(t)) ≤ f (x̄(t − 1)) − ηh∇f (x̄(t − 1)), ḡ(t)i + kḡ(t)k2
2
η2 L
= f (x̄(t − 1)) − ηk∇f (x̄(t − 1))k2 + kḡ(t)k2
2
n
* +
1X i
− η ∇f (x̄(t − 1)), (g (t) − fi (x̄(t − 1))
n i=1
n 2
2
η η L η 1 X
≤ f (x̄(t − 1)) − k∇f (x̄(t − 1))k2 + kḡ(t)k2 + (g i (t) − ∇fi (x̄(t − 1)) .

2 2 2 n i=1

Then, by Lemma 8,
n 2
1 X
g i (t) − ∇fi (x̄(t − 1))

n


i=1
2 !2
1 X n n
1X i
∇fi (xi (t − 1)) − ∇fi (x̄(t − 1)) + 2 kg (t) − ∇fi (xi (t − 1))k

≤ 2

n n i=1
(38)

i=1
n
!2
1X 1
≤2 i
Lkx (t − 1) − x̄(t − 1)k + u2t L2 d
n i=1 2
2L2 1
≤ kx(t − 1) − 1n ⊗ x̄(t − 1)k2 + u2t L2 d,
n 2
we see that
η η2 L
f (x̄(t)) ≤ f (x̄(t − 1)) − k∇f (x̄(t − 1))k2 + kḡ(t)k2
2 2
ηL2 ηu2 L2 d
+ kx(t − 1) − 1n ⊗ x̄(t − 1)k2 + t .
n 4

29
Next, we bound the term kḡ(t)k2 :
n 2 n 2
1 X 1 X
2 i 2 i
kḡ(t)k = g (t) ≤ 2 k∇f (x̄(t − 1))k + 2 (g (t) − ∇fi (x̄(t − 1)))

n n
i=1 i=1
2
2 4L
≤ 2 k∇f (x̄(t − 1))k + kx(t) − 1n ⊗ x̄(t)k2 + u2t L2 d.
n
Then we see that
η
f (x̄(t)) ≤ f (x̄(t − 1)) − (1 − 2ηL) k∇f (x̄(t − 1))k2
2
(39)
ηL2 ηu2 L2 d
+ (1 + 2ηL) kx(t) − 1n ⊗ x̄(t − 1)k2 + t (1 + 2ηL) .
n 4
Finally, by using ηL ≤ 1/6, we get the desired result.
Lemma 10. We have
n
!
2 2 4X i 2 2 2
ks(1) − 1n ⊗ ḡ(1)k ≤ ρ k∇fi (x (0))k + nu1 L d .
3 i=1

Proof. Since s(0) = g(0) = 0, we have


ks(1) − 1n ⊗ ḡ(1)k2 = k(W ⊗ Id )(g(1) − 1n ⊗ ḡ(1))k2 ≤ ρ2 kg(1) − 1n ⊗ ḡ(1)k2 .
Then since
n n
* +
2 2 2
X
i 1X j
kg(1) − 1n ⊗ ḡ(1)k = kg(1)k + nkḡ(1)k − 2 g (1), g (1)
i=1
n j=1
= kg(1)k − nkḡ(1)k ≤ kg(1)k2 ,
2 2

and by Lemma 8,
n  
X 4
kg(1)k2 ≤ k∇f i (x(0))k2 + 4kg i (1) − ∇f i (x(0))k2
i=1
3
n n  √ 2
 n
4 X
i 2
X 1 4X
≤ k∇f (x(0))k + 4 u1 L d = k∇f i (x(0))k2 + nu21 L2 d,
3 i=1 i=1
2 3 i=1

we get the desired result.


Lemma 11. If
1 (1 − ρ2 )2
 
ηL ≤ min , , (40)
6 4ρ2 (3 + 4ρ2 )
then
 

max kx(t) − 1n ⊗ x̄(t)k2 , ks(t) − 1n ⊗ ḡ(t)k2
10L
t−1 t−1 (41)
t 4nη 3 Lρ2 (1 + 2ρ2 ) X τ 2 5nηLdρ2 (1 + 2ρ2 ) X τ 2
≤ λ nR0 + λ k∇f (x̄(t − τ − 1))k + λ ut−τ
3(1 − ρ2 ) τ =0
6(1 − ρ2 ) τ =0

30
where
2 + ρ2
λ := < 1.
3
Consequently
( t−1 t
)
X
23η X
max kx(τ ) − 1n ⊗ x̄(τ )k , ks(τ ) − 1n ⊗ ḡ(τ )k2
τ =0
10L τ =1
t−2 t−1
(42)
3nR0 4nη 3 Lρ2 (1 + 2ρ2 ) X 2 5nηLdρ2 (1 + 2ρ2 ) X 2
≤ + k∇f (x̄(τ ))k + uτ .
1 − ρ2 (1 − ρ2 )2 τ =0
2(1 − ρ2 )2 τ =1

Proof. We have
s(t) − 1n ⊗ ḡ(t)
= (W ⊗ Id ) (s(t − 1) − 1n ⊗ ḡ(t − 1) + g(t) − g(t − 1) − 1n ⊗ ḡ(t) + 1n ⊗ ḡ(t − 1)) .
Then since
kg(t) − g(t − 1) − 1n ⊗ ḡ(t) + 1n ⊗ ḡ(t − 1)k2
n
X
= kg(t) − g(t − 1)k2 + nkḡ(t) − ḡ(t − 1)k2 − 2 hg i (t) − g i (t − 1), ḡ(t) − ḡ(t − 1)i
i=1
= kg(t) − g(t − 1)k2 − nkḡ(t) − ḡ(t − 1)k2 ≤ kg(t) − g(t − 1)k2 ,
we have
ks(t) − 1n ⊗ ḡ(t)k2
≤ ρ2 (ks(t − 1) − 1n ⊗ ḡ(t − 1)k + kg(t) − g(t − 1)k)2
n
1 + 2ρ2 ρ2 (1 + 2ρ2 ) X i
≤ ks(t − 1) − 1n ⊗ ḡ(t − 1)k2 + kg (t) − g i (t − 1)k2 .
3 1 − ρ2 i=1

Now since
2
ut + ut−1 √

2
kg i (t) − g i (t − 1)k2 ≤ 2 ∇fi (xi (t − 1)) − ∇fi (xi (t − 2)) + 2

L d
2
≤ 2L2 kxi (t − 1) − xi (t − 2)k2 + 2u2t−1 L2 d,
we get
1 + 2ρ2
ks(t) − 1n ⊗ ḡ(t)k2 ≤ ks(t − 1) − 1n ⊗ ḡ(t − 1)k2
3
2ρ2 (1 + 2ρ2 )L2 2 2ρ2 (1 + 2ρ2 )L2 2
+ kx(t − 1) − x(t − 2)k + nut−1 d.
1 − ρ2 1 − ρ2
Then
x(t − 1) − x(t − 2)
= ((W ⊗ Id ) − I)x(t − 2) − η(W ⊗ Id )s(t − 1)
= ((W ⊗ Id ) − I))(x(t − 2) − 1n ⊗ x̄(t − 2)) − η(W ⊗ Id )(s(t − 1) − 1n ⊗ ḡ(t − 1))
− η1n ⊗ (ḡ(t − 1) − ∇f (x̄(t − 2))) − η1n ⊗ ∇f (x̄(t − 2)).

31
We notice that for any u1 , . . . , un ∈ Rd and v ∈ Rd , we have
n n n
* + * n n
+
X 1X X X 1X
ui − uj , v = 0, and Wij uj − uj , v = 0. (43)
i=1
n j=1 i=1 j=1
n j=1

In addition, similar to (38), we have

18L2 9
kḡ(t − 1) − ∇f (x̄(t − 2))k2 ≤ kx(t − 2) − 1n ⊗ x̄(t − 2)k2 + u2t−1 L2 d.
17n 2
Therefore we get

kx(t − 1) − x(t − 2)k2


= k((W ⊗ Id ) − I)(x(t − 2) − 1n ⊗ x̄(t − 2)) − η(W ⊗ Id )(s(t − 1) − 1n ⊗ ḡ(t − 1))k2
+ η 2 nkḡ(t − 1) − ∇f (x̄(t − 2)) + ∇f (x̄(t − 2))k2
9
≤ kx(t − 2) − 1n ⊗ x̄(t − 2)k2 + 9η 2 ρ2 ks(t − 1) − 1n ⊗ ḡ(t − 1)k2
2
+ 2η 2 nkḡ(t − 1) − ∇f (x̄(t − 2))k2 + 2η 2 nk∇f (x̄(t − 2))k2
 
9 36 2 2
≤ + η L kx(t − 2) − 1n ⊗ x̄(t − 2)k2
2 17
+ 9η 2 ρ2 ks(t − 1) − 1n ⊗ ḡ(t − 1)k2 + 2η 2 nk∇f (x̄(t − 2))k2 + 9η 2 nu2t−1 L2 d
155
≤ kx(t − 2) − 1n ⊗ x̄(t − 2)k2 + 9η 2 ρ2 ks(t − 1) − 1n ⊗ ḡ(t − 1)k2
34
1
+ 2η 2 nk∇f (x̄(t − 2))k2 + nu2t−1 d,
4
where the first inequality follows from kW ⊗Id −Ik ≤ 2 and that ku+vk2 ≤ (1+1/ǫ)kuk2 +(1+ǫ)kvk2
for any vectors u, v and ǫ > 0, and the third inequality follows from ηL ≤ 1/6. Consequently

1 + 2ρ2 18ρ4 (1 + 2ρ2 ) 2 2


 
2
ks(t) − 1n ⊗ ḡ(t)k ≤ + η L ks(t − 1) − 1n ⊗ ḡ(t − 1)k2
3 1 − ρ2
228ρ2 (1 + 2ρ2 ) 2
+ L kx(t − 2) − 1n ⊗ x̄(t − 2)k2
25(1 − ρ2 )
2ρ2 (1 + 2ρ2 )
 
2 2 2 5 2 2
+ 2η L nk∇f (x̄(t − 2))k + nL u t−1 d ,
1 − ρ2 4

where we used 155/34 < 114/25. On the other hand,

kx(t − 1) − 1n ⊗ x̄(t − 1)k2


= k(W ⊗ Id )[x(t − 2) − 1n ⊗ x̄(t − 2) − η(s(t − 1) − 1n ⊗ ḡ(t − 1))]k2
1 + 2ρ2 ρ2 (1 + 2ρ2 ) 2
≤ kx(t − 2) − 1n ⊗ x̄(t − 2)k2 + η ks(t − 1) − 1n ⊗ ḡ(t − 1)k2 .
3 1 − ρ2

32
Therefore
 5η
− 1n ⊗ ḡ(t)k2


2 57L
ks(t)
kx(t − 1) − 1n ⊗ x(t − 1)k2
 5η
ks(t − 1) − 1n ⊗ ḡ(t − 1)k2 5nη 3 Lρ2 (1 + 2ρ2 ) 2k∇f (x̄(t − 2))k2 + 54 η −2 u2t−1 d
  

≤ A 2 57L + √ ,
kx(t − 2) − 1n ⊗ x(t − 2)k2 57(1 − ρ2 ) 0
(44)
where √
18ρ4 (1+2ρ2 ) 2 2 2 57ρ2 (1+2ρ2 )
" 1+2ρ2 #
3 √+ 1−ρ2 η L 5(1−ρ2 ) ηL
A= 2 57ρ2 (1+2ρ2 ) 1+2ρ2
. (45)
5(1−ρ2 ) ηL 3
This leads to
 5η
ks(t + 1) − 1n ⊗ ḡ(t + 1)k2


2 57L
kx(t) − 1n ⊗ x(t)k2
 5η 2 t−1
5nη 3 Lρ2 (1 + 2ρ2 ) X τ 2k∇f (x̄(t−τ −1))k2 + 54 η −2 u2t−τ d
  
t 2 57L ks(1)−1n ⊗ ḡ(1)k

≤A + √ A ,
kx(0)−1n ⊗ x̄(0)k2 57(1 − ρ2 ) τ =0 0

and consequently
 
2 5η 2
max kx(t) − 1n ⊗ x(t)k , √ ks(t + 1) − 1n ⊗ ḡ(t + 1)k
2 57L
 

≤ kAkt √ ks(1) − 1n ⊗ ḡ(1)k2 + kx(0) − 1n ⊗ x̄(0)k2
2 57L
t−1 t−1
10nη 3 Lρ2 (1 + 2ρ2 ) X 25nηLdρ2 (1 + 2ρ2 ) X
+ √ kAkτ k∇f (x̄(t − τ − 1))k2 + √ kAkτ u2t−τ .
57(1 − ρ2 ) τ =0
4 57(1 − ρ 2)
τ =0

Now, since A is symmetric, straightforward calculation shows that


√ !
1 + 2ρ2 2 4 2 3 3p 4 2 8 4
kAk = 1 − ρ + 27ρ (ηL) + 76ρ (ηL) + 675ρ (ηL) . (46)
3(1 − ρ2 ) 5

By solving the inequality kAk ≤ (2 + ρ2 )/3, we get


25(1 − ρ2 )4
(ηL)2 ≤ .
ρ4 (3402 + 8208ρ2 + 4158ρ4 + 2700ρ6 )
It can be checked that
1 2
(3402 + 8208ρ2 + 4158ρ4 + 2700ρ6) ≤ 4(3 + 4ρ2 ) ,

∀ρ ∈ [0, 1).
25

Therefore if ηL satisfies (40), we have kAk ≤ (2+ρ2 )/3 = λ. By Lemma 10 and that 5/(2 57) < 1/3,
we get (41). The bound (42) follows by taking the sum of (41) and using
t−1 τX
−1 t−1 X
τ t−1 t−1 t−1
X X X X 1 X
λs aτ −s = λτ −s as = as λτ −s ≤ as
τ =1 s=0 τ =1 s=1 s=1 τ =s
1 − λ s=1

for any nonnegative sequence (as )s∈Z+ .

33
Now we are ready to prove Theorems 3 and 4 in the main text.
Proof of Theorem 3. Let t ∈ Z+ be arbitrary. By Lemma 9 and (42), we see that
t−1 t
ηX
∗ 2 ηL2 d X 2
0 ≤ f (x̄(0)) − f − k∇f (x̄(τ ))k + u
3 τ =0 3 τ =1 τ
t−2 t−1
!
4ηL2 3nR0 4nη 3 Lρ2 (1 + 2ρ2 ) X 5nηLdρ2 (1 + 2ρ2 ) X 2
+ 2
+ 2 2
k∇f (x̄(τ ))k2 + uτ
3n 1−ρ (1 − ρ ) τ =0
2(1 − ρ2 )2 τ =1
 t−1
4ηL2 R0 16η 3 L3 ρ2 (1 + 2ρ2 ) X

∗ η
≤ f (x̄(0)) − f + − 1− k∇f (x̄(τ ))k2
1 − ρ2 3 (1 − ρ2 )2 τ =0
 2 2
 t
10ηLρ (1 + 2ρ ) 1 X
+ + ηL2 d u2τ .
3(1 − ρ2 )2 3 τ =1

Then since
16η 3 L3 ρ2 (1 + 2ρ2 ) 16 (1 − ρ2 )2 ρ2 (1 + 2ρ2 ) 4 (1 + 2ρ2 ) 1
2 2
≤ · 2 2 2 2
= = ,
(1 − ρ ) 36 4ρ (3 + 4ρ ) (1 − ρ ) 9 4(2 + 4ρ2 ) 18

and 31 (1 − 1/18) = 17
54 ≥ 5
16 , and

10ηLρ2 (1 + 2ρ2 ) 10 (1 − ρ2 )2 ρ2 (1 + 2ρ2 ) 5


≤ · · ≤ ,
3(1 − ρ2 )2 3 4ρ2 (3 + 4ρ2 ) (1 − ρ2 )2 12
we get
t−1 t
4ηL2 R0 5η X 2 3 2 X 2
0 ≤ f (x̄(0)) + − k∇f (x̄(τ ))k + ηL d uτ .
1 − ρ2 16 τ =0 4 τ =1

Since u2t is summable, this implies that k∇f (x̄(t))k converges to zero, and we have
t−1 ∞
" #
1X 2 1 3.2(f (x̄(0)) − f ∗ ) 12.8L2 R0 2
X
2
k∇f (x̄(τ ))k ≤ · + + 2.4L d uτ (15)
t τ =0 t η 1 − ρ2 τ =1

Now by (42) and (15), we see that kx(t) − 1n ⊗ x̄(t)k2 is summable, and

1X
kx(τ ) − 1n ⊗ x̄(τ )k2
n τ =0

" #
3R0 4η 3 Lρ2 (1 + 2ρ2 ) 3.2(f (x̄(0)) − f ∗ ) 12.8L2R0 2
X
2
≤ + + + 2.4L d uτ
1 − ρ2 (1 − ρ2 )2 η 1 − ρ2 τ =1
t−1
5ηLdρ2 (1 + 2ρ2 ) X 2
+ uτ
2(1 − ρ2 )2 τ =1

X
≤ 1.6η(f (x̄(0)) − f ∗ ) + 6.4(ηL)2 R0 + 0.35d u2τ .
τ =1

34
For the convergence of s(t), we have

1X
ks(τ ) − 1n ⊗ ∇f (x̄(τ − 1))k2
n τ =1
∞ ∞
3 X 2
X
≤ ks(τ ) − 1n ⊗ ḡ(τ )k + 3 kḡ(τ ) − ∇f (x̄(τ − 1))k2
2n τ =1 τ =0
∞ ∞ 
2L2

3 X 2
X
2 1 2 2
≤ ks(τ ) − 1n ⊗ ḡ(τ )k + 3 kx(τ − 1) − 1n ⊗ x̄(τ − 1)k + uτ L d
2n τ =1 τ =1
n 2
∞ ∞
  !
5L X 3X 2 2
≤ + 6L2 1.6η(f (x̄(0)) − f ∗ ) + 6.4(ηL)2 R0 + 0.35d u2τ + u L d
η τ =1
2 τ =1 τ

∗ 2.35 X 2
3
≤ 9.6L(f (x̄(0)) − f ) + 38.4ηL R0 + Ld uτ ,
η τ =1

where we used (38), (34) and ηL ≤ 1/6. Finally, since u2t is also summable, by Lemma 9 and the
deterministic version of Lemma 3, we see that f (x̄(t)) converges.
Proof of Theorem 4. Denote κ := µ/L and
δt := f (x̄(t)) − f (x∗ ).
By Lemma 2 we see that µ ≤ L. Notice that the condition on the step size
 µ  13 (1 − ρ2 )2
ηL = α · (47)
L 14
implies ηL ≤ 1/6. By (44) and Lemma 2, we get
 5η  5η
ks(t) − 1n ⊗ ḡ(t)k2 ks(t − 1) − 1n ⊗ ḡ(t − 1)k2
 
√ √
2 57L ≤ A 2 57L
kx(t − 1) − 1n ⊗ x(t − 1)k2 kx(t − 2) − 1n ⊗ x(t − 2)k2
2nηLρ2 (1 + 2ρ2 ) 4η 2 Lδt−2 + 45 u2t−1 d
 
+ ,
3(1 − ρ2 ) 0
where A is given by (45) and the norm of A is given by (46). By solving the inequality kAk ≤
1 − (1 − ρ2 )2 /21, we get
25(1 − ρ2 )4 (13 + ρ2 )2
(ηL)2 ≤ .
ρ4 (223398 + 411642ρ2 + 33642ρ4 + 217350ρ6 + 18900ρ8)
It can be verified that
25(13 + ρ2 )2 1
4 2 4 6 8
≥ 2
ρ (223398 + 411642ρ + 33642ρ + 217350ρ + 18900ρ ) 14
for all ρ ∈ [0, 1). By the condition (47) on ηL, we see that kAk ≤ 1 − (1 − ρ2 )2 /21. Then, since
8nη 3 L2 ρ2 (1 + 2ρ2 ) 8nηρ2 (1 + 2ρ2 ) α2 κ2/3 (1 − ρ2 )3
2
= ·
3(1 − ρ ) 3 196
2nα2 κ2/3 η nα2 κ2/3 η
≤ max ρ2 (1 + 2ρ2 )(1 − ρ2 )3 = · (1 − χ),
147 ρ∈[0,1] 6

35
where we denote
4
χ := 1 − max ρ2 (1 + 2ρ2 )(1 − ρ2 )3 ≈ 0.9865,
49 ρ∈[0,1]
we get
 5η 2
  2 2
  5η 2

2 57L ks(t) − 1n ⊗ ḡ(t)k ≤ 1 − (1 − ρ ) 2 57L ks(t − 1) − 1n ⊗ ḡ(t − 1)k
√ √
kx(t − 1) − 1 ⊗ x(t − 1)k2 21 kx(t − 2) − 1n ⊗ x(t − 2)k2
n

nα2 κ2/3 η 5nηLρ2 (1 + 2ρ2 ) 2


+ · (1 − χ)δt−2 + ut−1 d,
6 6(1 − ρ2 )

where the condition (47) was used. Consequently, if we denote


√  5η
ks(t) − 1n ⊗ ḡ(t)k2

2 2L √
σt−1 := √ 2 57L ,
nακ1/3 1 − χ kx(t − 1) − 1n ⊗ x(t − 1)k2

we get
√ √ √
(1 − ρ2 )2 2ακ1/3 1 − χ 5 2ρ2 (1 + 2ρ2 )(1 − ρ2 ) 2
 
σt−1 ≤ 1− σt−2 + ηL · δt−2 + √ ut−1 Ld.
21 3 42 1 − χ

On the other hand, by Lemma 9 and the fact that f is µ-gradient dominated, we have

4ηL2 ηL2 u2t−1 d


 
2ηµ
δt−1 ≤ 1 − δt−2 + kx(t − 2) − 1n ⊗ x̄(t − 2)k2 +
3 3n 3
√ 1/3
√ 2 2
ηL u d
 
2ηµ 2ακ 1−χ t−1
≤ 1− δt−2 + ηL · σt−2 + .
3 3 3

Therefore  " 5√2ρ2 (1+2ρ2 )(1−ρ2 ) # 2


ut−1 Ld
  
σt−1 σt−2 √
≤B + 14 1−χ , (48)
δt−1 δt−2 ηL 3
where
1 1
2 2
p
2(1 − χ)ακ1/3 ηL
 
B := 1
p1 − 21 (1 − ρ1/3
) 3 .
3 2(1 − χ)ακ ηL 1 − 32 ηµ
Straightforward calculation shows that

(1 − ρ2 )2
 q 
kBk = 1 − 1 + ακ4/3 − (1 − ακ4/3 )2 + 2(1 − χ)α4 κ4/3
42
(1 − ρ2 )2
 q 
≤ 1− 1 + ακ4/3 − (1 − ακ4/3 )2 + 2(1 − χ)ακ4/3
42
(1 − ρ2 )2
 q 
= 1− 1 + ακ4/3 − (1 − χακ4/3 )2 + (1 − χ2 )α2 κ8/3
42
p
Since x 7→ (1 − χx)2 + (1 − χ2 )x2 is a convex function over x ∈ [0, 1], it can be shown that
p p
(1 − χx)2 + (1 − χ2 )x2 ≤ 1 + ( 2(1 − χ) − 1)x,

36
and so
(1 − ρ2 )2  p  (1 − ρ2 )2 4/3
kBk ≤ 1 − 2 − 2(1 − χ) ακ4/3 ≤ 1 − ακ ,
42 25
where we used the fact that 2 − 2(1 − χ) > 42
p
25 . By (48), we then have

   2 2
   " 5√2ρ2 (1+2ρ2 )(1−ρ2 ) #
u2 Ld
σt−1 (1 − ρ ) 4/3
σt−2 √ t−1
δt−1 ≤ 1 − ακ
+ 14 1−χ
25 δt−2 3

ηL
(1 − ρ2 )2 4/3
   
σt−2 + 5(1 − ρ2 )u2t−1 Ld,

≤ 1− ακ
25 δt−2

where we used 1 − χ > 1/9 and
" √ 2 # " √ # s
5 2ρ (1+2ρ

2
)(1−ρ2 ) 135 2(1−ρ2 ) 50(1 + 2ρ2 ) 1
14 1−χ ≤ 14 2 ≤ + 2 < 15(1 − ρ2 ).

1−ρ 16(1 − ρ2 ) 14

ηL
14

By induction we get

2 2
t   t−1  τ
(1 − ρ2 )2 4/3
  
≤ 1 − (1 − ρ ) ακ4/3 σ0 + 5(1 − ρ2 )Ld
σt X
u2t−τ ,


δt δ0 1− ακ
25 τ =0
25

1
Pn
which implies the bound on f (x̄(t)) − f (x∗ ). The bound on n i=1 kxi (t) − x̄(t)k2 follows from

nακ1/3 1 − χ 3nακ1/3
√ · 5(1 − ρ2 )Ld < √ (1 − ρ2 )d
2 2L 10 2
√ Pn
as 1 − χ < 3/25. The bound on n1 i=1 ksi (t) − ∇f (x̄(t − 1))k2 follows from

1 ¯ (x̄(t))k2
ks(t + 1) − 1n ⊗ ∇f
n
3
≤ ks(t + 1) − 1n ⊗ ḡ(t + 1)k2 + 3kḡ(t + 1) − ∇f (x̄(t))k2
2n  2 
3 2 2L 2 1 2 2
≤ ks(t + 1) − 1n ⊗ ḡ(t + 1)k + 3 kx(t) − 1n ⊗ x̄(t))k + ut+1 L d
2n n 2
√ t
3 10L 6L2 nακ1/3 1 − χ (1 − ρ2 )2 4/3
     
σ0

≤ · + · √ 1− ακ
2n 3η n 2 2L 25 δ0
t−1  τ
2 10L 6L2 3nακ1/3 (1 − ρ2 )2 4/3
 
3 X
+ u2t+1 L2 d + · + · √ (1 − ρ2 )d 1− ακ u2t−τ
2 n 3η n 10 2 τ =0
25
2 2
t   √ 2 t  τ
(1 − ρ2 )2 4/3

18L (1 − ρ ) 4/3
σ0 7 2L d X
≤ 1− ακ δ0 + 5(1 − ρ2 )
1− ακ u2t+1−τ .
5(1 − ρ2 )2 25 τ =0
25

37
Details on the Numerical Example
In this part we present more details on the numerical example.
We have tested Algorithm 1, Algorithm 2, decentralized gradient descent (DGD), and distributed
gradient descent with gradient tracking [6] on the phase retrieval problem (21), starting from initial
points randomly generated from the distribution N (0, 1d Ind ). We only present numerical results on
4 cases of these initial points here for legibility. Since Algorithm 1 utilizes random vectors in its
iterations, we run Algorithm 1 for 10 times to get 10 random instances for each initial point. In
the following illustrations, we’ll use light blue curves to represent the individual results of the 10
instances, and dark blue curves to represent their average for Algorithm 1. The parameters of the
algorithms are set as follows:
√ √
1. Algorithm 1: ηt = 0.05/ t + 24, ut = 0.2/ t + 24.
2. Algorithm 2: η = 0.03, ut = 0.1/t3/4 .

3. DGD: ηt = 0.05/ t + 24.
4. Gradien tracking: η = 0.03.
Note that the step sizes used for Algorithm 1 has a slightly different form from the one studied in
Theorem 1. We point out that this does not affect the applicability of the theoretical results much.
We first compare Algorithm 1 with DGD, and Figure 2 show the associated illustrations of the
two algorithms. We see that for all the 4 cases, both algorithms converge relatively fast during
the initial stage and then stabilizes at a sublinear convergence rate, meaning that they exhibit
similar qualitative convergence behavior. On the other hand, Algorithm 1 converges slower and
enters the sublinear convergence stage with higher sub-optimality and consensus errors compared
to DGD. Especially, the squared magnitudes of the gradient k∇f (x̄(t))k2 for Algorithm 1 is about
100 times larger than those for DGD when t & 7.5 × 103 . We speculate that this results from the
randomization of the 2-point gradient estimator employed in Algorithm 1.
Next we compare Algorithm 2 with the gradient tracking method in [6]. The results are shown
in Figure 3. We can see that the two algorithms exhibit almost identical behavior. We point out
that this might be a consequence of the sufficient smoothness of the objective functions that could
lead to very accurate estimation of the gradient even if ut is not very small (the smallest ut used
in the simulation is 0.1/5003/4 ≈ 1 × 10−3 ). We are currently working on testing the performance
of the proposed algorithm on other problems, and it is likely that different phenomena can occur
for different problems.
Finally, we compare the convergence of Algorithm 1 and Algorithm 2 in terms of the number
of function value queries as they are both zero-order methods. We have already presented and
discussed one case of the initial points in the main text. Figure 4 illustrates the results for all the
4 cases, and also include the graphs of the consensus behavior. We see that observations made in
the main text are still valid for the rest of the cases.

38
Figure 2: Comparison of Algorithm 1 and DGD for 4 cases of initial points. Figures on the left-hand
side show convergence of k∇f (x̄(t))k2 , and Figures on the right-hand side show the corresponding
1 Pn
consensus errors n i=1 kxi (t) − x̄(t)k2 .

39
100
100 Algorithm 2 Algorithm 2 100 Algorithm 2
Gradient tracking Gradient tracking Gradient tracking

ksi (t) − ∇f (x̄(t − 1))k2


10−2
10 −2
10 −2

kxi (t) − x̄(t)k2


k∇f (x̄(t))k2

10−4
10−4 10−4

10−6

i=1
10−6 10−6

Pn

i=1
n
1

Pn
10−8 10−8 10−8

n
1
10−10 10−10 10−10
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Number of iterations Number of iterations Number of iterations

100
100 Algorithm 2 Algorithm 2 100 Algorithm 2
Gradient tracking Gradient tracking Gradient tracking

ksi (t) − ∇f (x̄(t − 1))k2


10−2
10−2 10−2
kxi (t) − x̄(t)k2
k∇f (x̄(t))k2

10−4
10−4 10−4

10−6
i=1

10−6 10−6
Pn

i=1
n
1

Pn
10−8 10−8 10−8

n
1
10−10 10−10 10−10
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Number of iterations Number of iterations Number of iterations

100
100 Algorithm 2 Algorithm 2 100 Algorithm 2
Gradient tracking Gradient tracking Gradient tracking

ksi (t) − ∇f (x̄(t − 1))k2


10−2
10−2 10−2
kxi (t) − x̄(t)k2
k∇f (x̄(t))k2

10−4
10−4 10−4

10 −6
i=1

10−6 10−6
Pn

i=1
n
1

Pn

10−8 10−8 10−8


n
1

10−10 10−10 10−10


0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Number of iterations Number of iterations Number of iterations

100
100 Algorithm 2 Algorithm 2 100 Algorithm 2
Gradient tracking Gradient tracking Gradient tracking
ksi (t) − ∇f (x̄(t − 1))k2

10 −2

10−2 10−2
kxi (t) − x̄(t)k2
k∇f (x̄(t))k2

10−4
10−4 10−4

10−6
i=1

10−6 10−6
Pn

i=1
n
1

Pn

10−8 10−8 10−8


n
1

10−10 10−10 10−10


0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Number of iterations Number of iterations Number of iterations

Figure 3: Comparison of Algorithm 2 and the distributed gradient descent with gradien tracking in
show convergence of k∇f (x̄(t))k2 , figures
[6] for 4 cases of initial points. Figures on the left-hand side P
1 n
in the middle show the corresponding consensus errors n i=1 kxi (t)P − x̄(t)k2 , and figures on the
1 n
right-hand side illustrate the corresponding gradient tracking errors n i=1 ksi (t) − ∇f (x̄(t − 1))k2 .

40
Figure 4: Comparison of Algorithm 1 and Algorithm 2 for 4 cases of initial points. Figures on
the left-hand side show convergence of k∇f (x̄(t))k2 , and Figures on the right-hand side show the
1 Pn
corresponding consensus errors n i=1 kxi (t) − x̄(t)k2 .

41

S-ar putea să vă placă și