A Stochastic Second-Order Proximal Method
for Distributed Optimization
Abstract
In this paper, we propose a distributed stochastic second-order proximal method that enables agents in a network to cooperatively minimize the sum of their local loss functions without any centralized coordination. The proposed algorithm, referred to as St-SoPro, incorporates a decentralized second-order approximation into an augmented Lagrangian function, and then randomly samples the local gradients and Hessian matrices of the agents, so that it is computationally and memory-wise efficient, particularly for large-scale optimization problems. We show that for globally restricted strongly convex problems, the expected optimality error of St-SoPro asymptotically drops below an explicit error bound at a linear rate, and the error bound can be arbitrarily small with proper parameter settings. Simulations over real machine learning datasets demonstrate that St-SoPro outperforms several state-of-the-art distributed stochastic first-order methods in terms of convergence speed as well as computation and communication costs.
I Introduction
Stochastic optimization algorithms have been flourishing recently due to their appealing efficiency in machine learning [1, 2]. In the context of large-scale machine learning, parallel stochastic algorithms are often used to process large datasets[3, 4]. However, such methods need a central node to ensure the consistency of the variables from all the nodes, so that the communication burden of the central node becomes the bottleneck that restricts the algorithm performance.
On the other hand, a collection of distributed optimization algorithms have been proposed over the past decade in order to tackle various network control and resource allocation problems, where agents in a network only communicate with their neighbors and do not rely on any central coordination, eliminating potential communication bottlenecks in the computing infrastructure [5]. Typical methods include the distributed gradient descent (DGD) method [5], the decentralized exact first-order algorithm (EXTRA) [6], and the distributed gradient tracking algorithms [7].
Inheriting the merits of the above two algorithm types, distributed stochastic optimization algorithms have been attracting a lot of recent interest. For smooth, strongly convex optimization problems, [8] develops a distributed stochastic gradient descent (DSGD) method based on DGD, which is shown to achieve the optimal sublinear convergence rate (independent of the network) of a centralized stochastic gradient descent (SGD) method [9, 10]. The same convergence rate is attained by the exact diffusion method with adaptive step-sizes (EDAS) in [11]. In addition, the distributed stochastic gradient tracking (DSGT) method proposed in [12] is guaranteed to linearly converge to a neighborhood of the optimal solution in expectation. For non-convex problems, [13] designs a distributed primal-dual SGD algorithm with both fixed step-sizes (DPD-SGD-F) and adaptive step-sizes (DPD-SGD-T), where the former achieves linear convergence to suboptimality under the Polyak-Łojasiewics (PL) condition.
The aforementioned distributed stochastic algorithms are all constructed upon deterministic first-order methods by nature and only use stochastic gradients to evolve. As second-order information often leads to more accurate approximation and accelerates problem solving, we endeavor to develop a second-order distributed stochastic optimization algorithm.
To this end, we choose SoPro [14], a deterministic distributed second-order proximal algorithm, as the cornerstone. SoPro is developed by virtue of a decentralized second-order approximation of the augmented Lagrangian function in the classic method of multipliers [15], and its convergence performance outperforms that of various distributed first-order methods in the deterministic setting. In this paper, we adapt SoPro to the stochastic scenario. Specifically, instead of letting each agent compute exact local gradient and Hessian matrix determined by all its local data as SoPro does, we allow each agent to update using stochastic approximations of its local gradient and Hessian, which come from two randomly and uniformly chosen batches of samples from its local loss function. Such a stochastic variant of SoPro can significantly enhance the computational and memory efficiency of the agents. We call this algorithm a stochastic second-order proximal algorithm, referred to as St-SoPro.
Under the assumptions that the local loss functions are smooth and convex, and that their sum (i.e., the global objective function) is globally restricted strongly convex, we show that our proposed St-SoPro algorithm linearly converges to a neighborhood of the optimal solution in expectation over undirected networks. In particular, we provide an explicit upper bound on its ultimate suboptimality, and illustrate that this upper bound can be made arbitrarily small as long as the parameters are properly set. Finally, we validate the superior performance of St-SoPro in comparison with several recent distributed stochastic optimization methods over some real datasets for classification problems arising in machine learning with respect to convergence speed, communication load, computational efficiency, and classification accuracy.
The paper is organized as follows. Section II formulates the optimization problem to be solved, and Section III describes the proposed St-SoPro algorithm. Section IV provides the convergence analysis, Section V presents the numerical results, and Section VI concludes the paper.
Notation: For any differentiable function , its gradient at is denoted by , and if is twice-differentiable, we use to denote its Hessian matrix at . For any set , represents the cardinality of . In addition, is the Kronecker product, is the Euclidean inner product, and is the norm. We use to denote the -dimensional all-zero vector, all-one vector, zero matrix, and identity matrix, respectively. Also, represents the block diagonal matrix whose diagonal blocks are sequentially . Given a matrix , we write if it is positive semidefinite and if it is positive definite. For any and , and are the largest and smallest real eigenvalues of , respectively, and is ’s pseudoinverse.
II Problem Formulation
Consider a set of agents, where the agents are connected through the link set . We model such a network as a connected undirected graph , and denote the set of each agent ’s neighbors by . Suppose each agent observes a finite number of local samples , that are independent random vectors, and attempts to solve the following optimization problem:
(1) |
In problem (1), is the local loss function of agent , which is the average of every sample’s loss associated with agent , i.e.,
Below, we impose the assumptions on problem (1).
Assumption 1.
Assumption 1 leads to the following inequalities. For any given and , and
(2) |
for some . Also, the globally restricted strong convexity in Assumption 1a) guarantees that the optimal solution is unique.
Problem (1) requires that the agents reach a consensus while minimizing all the sample losses throughout the network. Indeed, a wide range of real-world problems can be cast into the form of problem (1), such as distributed model predictive control [16], distributed spectrum sensing [17], and logistic regression [18]. Under many circumstances, these engineering problems involve huge datasets. Thus, we focus on solving problem (1) in a fully decentralized and stochastic fashion. Specifically, we only allow each agent to communicate with its neighbors and use a randomly chosen subset of its local samples to compute.
III Stochastic Second-order Proximal Method
In this section, we develop a distributed stochastic algorithm for solving problem (1) over undirected networks.
To this end, we first provide a brief review of the distributed (deterministic) second-order proximal algorithm (SoPro) in [14]. Note that problem (1) is equivalent to
(3) |
where , , and
with , so that the null space of is . Also, the unique optimal solution of problem (3) is .
The application of the method of multipliers [15] to solve (3) gives the following: Starting from any ,
(4) | |||
(5) |
where and are the primal and dual variables, respectively, and is an augmented Lagrangian function given by , .
Since (4)–(5) cannot be implemented in a distributed way, the SoPro algorithm in [14] introduces a decentralized second-order proximal approximation of in (4) and applies a variable change to (5). Specifically, it replaces with its second-order Taylor’s expansion at . Then, it replaces the remaining coupling term in the primal update with , where is a symmetric block diagonal matrix satisfying . Furthermore, we define as a substitute for , and can be ensured to identically stay in the range of by letting . To summarize, SoPro takes the following form: Starting from satisfying ,
(6) | |||
where and satisfying .
The primal update of SoPro (6) requires that each agent uses up all its local samples. However, the agents may only be able to access or process a portion of their local samples at one time, especially in the big data scenario. Motivated by this, we consider approximating the gradient and the Hessian in (6) via a stochastic gradient and a stochastic Hessian given by
(7) | |||
(8) |
Here, for each agent , and are two independent random sample sets uniformly chosen from } without replacement, so that and are unbiased, i.e.,
for all . Due to (2), and
(9) |
where and .
Using the above randomly sampled gradient and Hessian, we obtain the following stochastic variant of SoPro: Starting from any such that ,
(10) | |||
where each satisfies , i.e., , so that (10) is well-posed. The above initialization and updates compose our proposed stochastic second-order proximal (St-SoPro) method. The distributed implementation of St-SoPro over the undirected network is described in Algorithm 1, in which are auxiliary variables for better presentation.
IV Convergence Analysis
This section provides the convergence analysis of St-SoPro.
We first impose an assumption on the expected deviation of each sample loss from each entire local loss in terms of their gradients.
Assumption 2.
The random vectors are independent and there is some such that
To simplify the notation, below we let and be of the same size . We also abbreviate and to . According to [19, Chapter 2],
(11) |
This is consistent with the fact that when computing the stochastic gradient , if we reduce the number of randomly selected samples, then the discrepancy between and becomes larger in expectation.
Next, for the sake of presenting the convergence result, we introduce the following notation and definitions. According to [14], any satisfying is a dual optimum of problem (3). Thus, we define
(12) |
as a particular dual optimum of (3). Also, throughout this section, we let , , and . Since such satisfies (5),
(13) |
In addition, we define . Based on [14, Lemma 1], for any ,
(14) |
where is given by , is given in Assumption 1a), , and is the smallest non-zero eigenvalue of . It can be shown that if and only if , and its maximum value is attained at the unique positive root of the cubic equation . We denote the maximum value of by , which is the convexity parameter of (and indeed can be taken as any positive ).
Our convergence analysis relies on the following parameter condition. Suppose there exists such that
(15) |
Let and , which are guaranteed to be positive definite due to (15). Furthermore, it follows from (9) and (15) that the condition required in Section III holds.
We provide our main result in the theorem below.
Theorem 1.
Proof.
See Appendix -A. ∎
It can be shown that larger , smaller , or smaller leads to faster convergence (i.e., larger ) of St-SoPro. Such analysis follows the idea in [14] and is thus omitted here due to space limitation. In addition, it can be seen from (17) that the error bound of drops with the decrease of . Hence, essentially, larger random sample sets for computing the stochastic gradients lead to smaller optimality error.
In fact, the expected distance between and can eventually reach an arbitrarily small value under proper parameter settings. To see this, for simplicity, let and for all , and pick any , , and . We choose, for example, with , , and , so that (15) holds. This also results in . From (17), we obtain . It can thus be shown that as , such an upper bound on goes to zero. Since this bound is continuous at , given any , the above parameter setting with a sufficiently large guarantees .
V Numerical Experiment
This section compares the practical convergence performance of St-SoPro with several state-of-the-art distributed stochastic optimization algorithms.
In the numerical experiment, we intend to learn linear classifiers by solving -regularized logistic regression of the following form over a randomly generated, undirected, and connected network:
(19) |
where is the regularization parameter and are the data samples. Our experiment is conducted on two standard real datasets a4a and mushrooms from the LIBSVM library [20]. Table I lists the values of the problem and network parameters corresponding to these two datasets, including the problem dimension , the number of agents, the network’s average degree , the total number of samples that we assign to each agent , the sizes and of the random sample sets that each agent chooses per iteration, as well as the regularization parameter .
d | |||||||
---|---|---|---|---|---|---|---|
a4a | 123 | 20 | 5 | 239 | 80 | 10 | |
mushrooms | 112 | 10 | 3 | 600 | 80 | 25 |
The simulations include DSGD [8], EDAS [11], DSGT [12], and DPD-SGD-T [13], which are all first-order methods, for comparison with our proposed St-SoPro. We fine-tune all the algorithm parameters so that the algorithms reach a given accuracy ( for a4a and for mushrooms) within fewest possible iterations.
Figures 1(a)–(c) and 2(a)–(c) plot the evolutions of the optimality error generated by the aforementioned algorithms over a4a and mushrooms with respect to the number of iterations, the number of communication bits (set as times the number of transmitted real scalars according to [21]), and computation time. Observe that St-SoPro converges faster than the other algorithms to reach the given accuracy, validating its computational and communication efficiency. It is worth mentioning that although St-SoPro is a second-order method, its computational cost can be comparable with the first-order methods when addressing such common machine learning problems. Figures 1(d) and 2(d) present the correct rates of classification on the test sets upon completing each iteration of these algorithms as training, whereby St-SoPro outperforms the others in the training effect.








VI Conclusion
We have developed St-SoPro, a distributed stochastic second-order proximal method, for addressing strongly convex and smooth optimization over undirected networks. Different from the existing first-order distributed stochastic algorithms, St-SoPro incorporates a second-order approximation of an augmented Lagrangian function and randomly samples each local gradient and Hessian. We show that St-SoPro linearly converges to a neighborhood of the optimal solution in expectation, and the neighborhood can be arbitrarily small. Simulations over two real datasets demonstrate that St-SoPro is both computationally and communication-wise efficient.
-A Proof of Theorem 1
The following lemma intends to bound the difference between and .
Lemma 1.
For each ,
(20) |
where can be arbitrary and .
Proof.
We first equivalently expand the left-hand side of (20). Similar to [14, Eq. (27)], we derive
(21) |
Then, using (5) and , we obtain . From (10) and (5), we have , where . The above two equations, together with (12), give
(22) |
Moreover, based on (5), , and [14, Eq. (26)],
(23) |
By incorporating (23) into (22) and then combining the resulting equation with (21), we have
(24) |
Subsequently, we provide a lower bound for the first term on the right-hand side of (24). To do so, we utilize the AM-GM inequality and (11) to derive
(25) |
Due to the Lipschitz continuity of each and the unbiasedness of , we have . We multiply this inequality by and then add it to (25), which leads to
(26) |
Because of the restricted strong convexity of shown in Section IV and , we have . This, along with (26), results in
(27) |
In addition to Lemma 1, below we provide an upper bound on . For any , through (11), (10), (12), (13), and the AM-GM inequality,
leading to
(29) |
Pick an arbitrary . By subtracting (29) multiplied by from (20), we have
(30) |
where , , and . To make (16) hold based on (30), it suffices to let , , , i.e.,
(31) | ||||
(32) | ||||
(33) |
To guarantee the existence of subject to (31)–(33), we need and . Note from (15) that at . Then, due to the continuity of with respect to , there is such that . Therefore, we ensure (16) with given by (18).
References
- [1] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186.
- [2] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018.
- [3] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics. PMLR, 2017, pp. 1273–1282.
- [4] S. U. Stich, “Local SGD converges fast and communicates little,” in International Conference on Learning Representations (ICLR), 2019.
- [5] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
- [6] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
- [7] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, pp. 2597–2633, 2017.
- [8] S. Pu, A. Olshevsky, and I. C. Paschalidis, “A sharp estimate on the transient time of distributed stochastic gradient descent,” IEEE Transactions on Automatic Control, vol. 67, no. 11, pp. 5900–5915, 2022.
- [9] A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradient descent optimal for strongly convex stochastic optimization,” in 29th International Conference on Machine Learning, 2012, pp. 1571–1578.
- [10] A. Nemirovski, A. B. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on Optimization, vol. 19, pp. 1574–1609, 2009.
- [11] K. Huang and S. Pu, “Improving the transient times for distributed stochastic gradient methods,” IEEE Transactions on Automatic Control (Early Access), 2022.
- [12] S. Pu and A. Nedić, “Distributed stochastic gradient tracking methods,” Mathematical Programming, vol. 187, no. 1, pp. 409–457, 2021.
- [13] X. Yi, S. Zhang, T. Yang, T. Chai, and K. H. Johansson, “A primal-dual SGD algorithm for distributed nonconvex optimization,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 5, pp. 812–833, 2022.
- [14] X. Wu, Z. Qu, and J. Lu, “A second-order proximal algorithm for consensus optimization,” IEEE Transactions on Automatic Control, vol. 66, pp. 1864–1871, 2021.
- [15] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
- [16] P. Giselsson, M. D. Doan, T. Keviczky, B. D. Schutter, and A. Rantzer, “Accelerated gradient methods and dual decomposition in distributed model predictive control,” Automatica, vol. 49, pp. 829–833, 2013.
- [17] J. A. Bazerque and G. B. Giannakis, “Distributed spectrum sensing for cognitive radio networks by exploiting sparsity,” IEEE Transactions on Signal Processing, vol. 58, pp. 1847–1862, 2010.
- [18] F. Bach, “Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 595–627, 2014.
- [19] S. L. Lohr, Sampling: Design and Analysis. Chapman and Hall/CRC, 2021.
- [20] C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, pp. 1–27, 2011.
- [21] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017.