This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Escaping Saddle Points in Distributed Newton’s Method with Communication Efficiency and Byzantine Resilience

\NameAvishek Ghosh \Emaila2ghosh@ucsd.edu
\addrUCSD
   \NameRaj Kumar Maity \Emailrajkmaity@cs.umass.edu
\addrUMass Amherst
   \NameArya Mazumdar \Emailarya@ucsd.edu
\addrUCSD
   \NameKannan Ramachandran \Emailkannanr@eecs.berkeley.edu
\addrUC Berkeley
Abstract

The problem of saddle-point avoidance for non-convex optimization is quite challenging in large scale distributed learning frameworks, such as Federated Learning, especially in the presence of Byzantine workers. The celebrated cubic-regularized Newton method of [Nesterov and Polyak(2006)] is one of the most elegant ways to avoid saddle-points in the standard centralized (non-distributed) setup. In this paper, we extend the cubic-regularized Newton method to a distributed framework and simultaneously address several practical challenges like communication bottleneck and Byzantine attacks. Note that the issue of saddle-point avoidance becomes more crucial in the presence of Byzantine machines since rogue machines may create fake local minima near the saddle-points of the loss function, also known as the saddle-point attack. Being a second order algorithm, our iteration complexity is much lower than the first order counterparts. Furthermore we use compression (or sparsification) techniques like δ\delta-approximate compression for communication efficiency. We obtain theoretical guarantees for our proposed scheme under several settings including approximate (sub-sampled) gradients and Hessians. Moreover, we validate our theoretical findings with experiments using standard datasets and several types of Byzantine attacks, and obtain an improvement of 25%25\% with respect to first order methods in iteration complexity.

keywords:
Distributed optimization, Communication efficiency, robustness, compression.

1 Introduction

Motivated by the real-world applications such as recommendation systems, image recognition, and conversational AI, it has become crucial to implement learning algorithms in a distributed fashion. In a commonly used framework, namely data-parallelism, large data-sets are distributed among several worker machines for parallel processing. In many applications, like Federated Learning (FL) [Konečnỳ et al.(2016)Konečnỳ, McMahan, Ramage, and Richtárik], data is stored in user devices such as mobile phones and personal computers. In a standard distributed framework, several worker machines perform local computations and communicate to the center machine (a parameter server), and the center machine aggregates and broadcasts the information iteratively.

In this setting, it is well-known that one of the major challenges is to tackle the behavior of the Byzantine machines [Lamport et al.(1982)Lamport, Shostak, and Pease]. This can happen owing to software or hardware crashes, poor communication link between the worker and the center machine, stalled computations, and even co-ordinated or malicious attacks by a third party. In this setup, it is generally assumed (see [Yin et al.(2018)Yin, Chen, Kannan, and Bartlett, Blanchard et al.(2017)Blanchard, Mhamdi, Guerraoui, and Stainer] that a subset of worker machines behave completely arbitrarily even in a way that depends on the algorithm used and the data on the other machines, thereby capturing the unpredictable nature of the errors.

Another critical challenge in this distributed setup is the communication cost between the worker and the center machine. The gains we obtain by parellelization of the task among several worker machines often get bottle-necked by the communication cost between the worker and the center machine. In applications like Federated learning, this communication cost is directly linked with the (internet) bandwidth of the users and thus resource constrained. It is well known that in-terms of the number of iterations, second order methods (like Newton and its variants) outperform their competitor; the first order gradient based methods. In this work, we simultaneously handle the Byzantine and communication cost aspects of distributed optimization for non-convex functions.

In this paper, we focus on optimizing a non-convex loss function f(.)f(.) in a distributed optimization framework. We have mm worker machines, out of which α\alpha fraction may behave in a Byzantine fashion, where α<12\alpha<\frac{1}{2}. Optimizing a loss function in a distributed setup has gained a lot of attention in recent years [Alistarh et al.(2018a)Alistarh, Allen-Zhu, and Li, Blanchard et al.(2017)Blanchard, Mhamdi, Guerraoui, and Stainer, Feng et al.(2014)Feng, Xu, and Mannor, Chen et al.(2017)Chen, Su, and Xu]. However, most of these approaches either work when f(.)f(.) is convex, or provide weak guarantees in the non-convex case (for example: zero gradient points, maybe a saddle point).

In order to fit complex machine learning models, one often requires to find local minima of a non-convex loss f(.)f(.), instead of critical points only, which may include several saddle points. Training deep neural networks and other high-capacity learning architectures [Soudry and Carmon(2016), Ge et al.(2017)Ge, Jin, and Zheng] are some of the examples where finding local minima is crucial. [Ge et al.(2017)Ge, Jin, and Zheng, Kawaguchi(2016)] shows that the stationary points of these problems are in fact saddle points and far away from any local minimum, and hence designing efficient algorithm that escapes saddle points is of interest. Moreover, [Jain et al.(2017)Jain, Jin, Kakade, and Netrapalli, Sun et al.(2016)Sun, Qu, and Wright] argue that saddle points can lead to highly sub-optimal solutions in many problems of interest. This is amplified in high dimension as shown in [Dauphin et al.(2014)Dauphin, Pascanu, Gulcehre, Cho, Ganguli, and Bengio], and becomes the main bottleneck in training deep neural nets. Furthermore, a line of recent work [Sun et al.(2016)Sun, Qu, and Wright, Bhojanapalli et al.(2016)Bhojanapalli, Neyshabur, and Srebro, Sun et al.(2017)Sun, Qu, and Wright], shows that for many non-convex problems, it is sufficient to find a local minimum. In fact, in many problems of interest, all local minima are global minima (e.g., dictionary learning [Sun et al.(2017)Sun, Qu, and Wright], phase retrieval [Sun et al.(2016)Sun, Qu, and Wright], matrix sensing and completion [Bhojanapalli et al.(2016)Bhojanapalli, Neyshabur, and Srebro, Ge et al.(2017)Ge, Jin, and Zheng], and some of neural nets [Kawaguchi(2016)]). Also, in [Choromanska et al.(2015)Choromanska, Henaff, Mathieu, Arous, and LeCun], it is argued that for more general neural nets, the local minima are as good as global minima.

The issue of local minima convergence becomes non-trivial in the presence of Byzantine workers. Since we do not assume anything on the behavior of the Byzantine workers, it is certainly conceivable that by appropriately modifying their messages to the center, they can create fake local minima that are close to the saddle point of the loss function f(.)f(.), and these are far away from the true local minima. This is popularly known as the saddle-point attack (see [Yin et al.(2019)Yin, Chen, Kannan, and Bartlett]), and it can arbitrarily destroy the performance of any non-robust learning algorithm. Hence, our goal is to design an algorithm that escapes saddle points of f(.)f(.) in an efficient manner as well as resists the saddle-point attack simultaneously. The complexity of such an algorithm emerges from the the interplay between non-convexity of the loss function and the behavior of the Byzantine machines.

The problem of saddle point avoidance in the context of non-convex optimization has received considerable attention in the past few years. In the seminal paper of [Jin et al.(2017)Jin, Ge, Netrapalli, Kakade, and Jordan], a (first order) gradient descent based approach is proposed. A few papers [Xu et al.(2017)Xu, Jin, and Yang, Allen-Zhu and Li(2017)] following the above use various modifications to obtain saddle point avoidance guarantees. A Byzantine robust first order saddle point avoidance algorithm is proposed by Yin et al. [Yin et al.(2019)Yin, Chen, Kannan, and Bartlett], and probably is the closest to this work. In [Yin et al.(2019)Yin, Chen, Kannan, and Bartlett], the authors propose a repeated check-and-escape type of first order gradient descent based algorithm. First of all, being a first order algorithm, the convergence rate is quite slow (the rate for gradient decay is 1/T1/\sqrt{T}, where TT is the number of iterations). Moreover, implementation-wise, the algorithm presented in [Yin et al.(2019)Yin, Chen, Kannan, and Bartlett] is computation heavy, and takes potentially many iterations between the center and the worker machines. Hence, this algorithm is not efficient in terms of the communication cost.

In this work, we consider a variation of the famous cubic-regularized Newton algorithm of Nesterov and Polyak [Nesterov and Polyak(2006)], which efficiently escapes the saddle points of a non-convex function by appropriately choosing a regularization and thus pushing the Hessian towards a positive semi-definite matrix. The primary motivation behind this choice is the faster convergence rate compared to first order methods, which is crucial in terms of communication efficiency in applications like Federated Learning. Indeed, the rate of gradient decay is 1T2/3\frac{1}{T^{2/3}}.

We consider a distributed variant of the cubic regularized Newton algorithm. In this scheme, the center machine asks the workers to solve an auxiliary function and return the result. Note that the complexity of the problem is partially transferred to the worker machines. It is worth mentioning that in most distributed optimization paradigm, including Federated Learning, the workers posses sufficient compute power to handle this partial transfer of compute load, and in most cases, this is desirable [Konečnỳ et al.(2016)Konečnỳ, McMahan, Ramage, and Richtárik]. The center machine aggregates the solution of the worker machines and takes a descent step. Note that, unlike gradient aggregation, the aggregation of the solutions of the local optimization problems is a highly non-linear operation. Hence, it is quite non-trivial to extend the centralized cubic regularized algorithm to a distributed one. The solution to the cubic regularization even lacks a closed form solution unlike the second order Hessian based update or the first order gradient based update. The analysis is carried out by leveraging the first order and second order stationary conditions of the auxiliary function solved in each worker machines.

In addition to this, we simulateneously use (i) a δ\delta-approximate compressor (defined shortly) to compress the message send from workers to center to gain further communication reduction and (ii) a simple norm-based thresholding to robustify against adversarial attacks. Norm based thresholding is a standard trick for Byzantine resilience as featured in [Ghosh et al.(2020a)Ghosh, Maity, Kadhe, Mazumdar, and Ramchandran, Ghosh et al.(2020b)Ghosh, Maity, and Mazumdar]. However, since the local optimization problem lacks a closed form solution, using norm-based trimming is also technical challenging in this case. We now list our contributions.

1.1 Our Contributions

We propose a novel distributed and robust cubic regularized Newton algorithm, that escapes saddle point efficiently. We prove that the algorithm convergence at a rate of 1T2/3\frac{1}{T^{2/3}}, which is faster than the first order methods (which converge at 1/T1/\sqrt{T} rate, see [Yin et al.(2019)Yin, Chen, Kannan, and Bartlett]). Also, the rate matches to the centralized scheme of [Nesterov and Polyak(2006)] and hence, we do not lose in terms of convergence rate while making the algorithm distributed. We emphasis that since the center machine aggregates solutions of local (auxiliary) loss functions, this extension is quite non-trivial and technically challenging. This fast convergence reduces the number of iterations (and hence the communication cost) required to achieve a target accuracy.

Along with the saddle point avoidance, we simultaneously address the issues of (i) communication efficiency and (ii) Byzantine resilience by using a δ\delta-approximate compressor and a norm-based thresholding scheme respectively. A major technical challenge here is to simultaneously address the above mentioned issues, and it turns out that with a proper parameter choice (step size etc.) it is possible to carry out the analysis jointly.

In Section 4, we verify our theoretical findings via experiments. We use benchmark LIBSVM ([Chang and Lin(2011)]) datasets for logistic regression and non-convex robust regression and show convergence results for both non-Byzantine and several different Byzantine attacks. Specifically, we characterize the total iteration complexity (defined in Section 4) of our algorithm, and compare it with the first order methods. We observe that the algorithm of [Yin et al.(2019)Yin, Chen, Kannan, and Bartlett] requires 25%25\% more total iterations than ours.

Preliminaries:

A point 𝐱\mathbf{x} is said to satisfy the ϵ\epsilon-second order stationary condition of f(.)f(.) if,

f(𝐱)ϵλmin(2f(𝐱))ϵ.\displaystyle\|\nabla f(\mathbf{x})\|\leq\epsilon\qquad\lambda_{\text{min}}(\nabla^{2}f(\mathbf{x}))\geq-\sqrt{\epsilon}.\vspace{-2mm}

f(𝐱)\nabla f(\mathbf{x}) denotes the gradient of the function and λmin(2f(𝐱))\lambda_{\text{min}}(\nabla^{2}f(\mathbf{x})) denotes the minimum eigenvalue of the Hessian of the function. Hence, under the assumption (which is standard in the literature, see [Jin et al.(2017)Jin, Ge, Netrapalli, Kakade, and Jordan, Yin et al.(2019)Yin, Chen, Kannan, and Bartlett]) that all saddle points are strict (i.e., λmin(2f(𝐱s))<0\lambda_{\min}(\nabla^{2}f(\mathbf{x}_{s}))<0 for any saddle point 𝐱s\mathbf{x}_{s}), all second order stationary points (with ϵ=0\epsilon=0) are local minima, and hence converging to a stationary point is equivalent to converging to a local minima.

1.2 Problem Formulation

We minimize a loss function of the form: f(𝐱)=1mi=1mfi(𝐱)f(\mathbf{x})=\frac{1}{m}\sum_{i=1}^{m}f_{i}(\mathbf{x}), where the function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} is twice differentiable and non-convex. In this work, we consider distributed optimization framework with mm worker machines and one center machine where the worker machines communicate to the center machine. Each worker machine is associated with a local loss function fif_{i}. We assume that the data distribution is non-iid across workers, which is standard in frameworks like FL. In addition to this, we also consider the case where α\alpha fraction of the worker machines are Byzantine for some α<12\alpha<\frac{1}{2}. The Byzantine machines can send arbitrary updates to the central machine which can disrupt the learning. Furthermore, the Byzantine machines can collude with each other, create fake local minima or attack maliciously by gaining information about the learning algorithm and other workers. In the rest of the paper, the norm \|\cdot\| will refer to 2\ell_{2} norm or spectral norm when the argument is a vector or a matrix respectively.

Next, we consider a generic class of compressors from [Karimireddy et al.(2019)Karimireddy, Rebjock, Stich, and Jaggi]:

Definition 1.1 (δ\delta-Approximate Compressor).

An operator Q(.):ddQ(.):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is defined as δ\delta approximate compressor on a set 𝒮d\mathcal{S}\subseteq\mathbb{R}^{d} if, x𝒮\forall\,x\in\mathcal{S}, Q(x)x2(1δ)x2\|Q(x)-x\|^{2}\leq(1-\delta)\|x\|^{2}, where δ(0,1]\delta\in(0,1] is the compression factor.

Furthermore, a randomized operator Q(.)Q(.) is δ\delta-approximate compressor on a set 𝒮d\mathcal{S}\subseteq\mathbb{R}^{d} if, the above holds on expectation. In this paper, for the clarity of exposition, we consider the deterministic form of the compressor (as in Definition 1.1). However, the results can be easily extended for randomized Q(.)Q(.). Notice that δ=1\delta=1 implies Q(x)=xQ(x)=x (no compression).

2 Related Work

Saddle Point avoidance algorithms:

In the recent years, there are handful first order algorithms [Lee et al.(2016)Lee, Simchowitz, Jordan, and Recht, Lee et al.(2017)Lee, Panageas, Piliouras, Simchowitz, Jordan, and Recht, Du et al.(2017)Du, Jin, Lee, Jordan, Poczos, and Singh] that focus on the escaping saddle points and convergence to local minima. The critical algorithmic aspect is running gradient based algorithm and adding perturbation to the iterates when the gradient is small. ByzantinePGD [Yin et al.(2019)Yin, Chen, Kannan, and Bartlett], PGD [Jin et al.(2017)Jin, Ge, Netrapalli, Kakade, and Jordan], Neon+GD[Xu et al.(2017)Xu, Jin, and Yang], Neon2+GD [Allen-Zhu and Li(2017)] are examples of such algorithms. The work of Nesterov and Polyak [Nesterov and Polyak(2006)] first proposes the cubic regularized second order Newton method and provides analysis for the second order stationary condition. An algorithm called Adaptive Regularization with Cubics (ARC) was developed by [Cartis et al.(2011a)Cartis, Gould, and Toint, Cartis et al.(2011b)Cartis, Gould, and Toint] where cubic regularized Newton method with access to inexact Hessian was studied. Cubic regularization with both the gradient and Hessian being inexact was studied in [Tripuraneni et al.(2018)Tripuraneni, Stern, Jin, Regier, and Jordan]. In [Kohler and Lucchi(2017)], a cubic regularized Newton with sub-sampled Hessian and gradient was proposed and analyzed. Momentum based cubic regularized algorithm was studied in [Wang et al.(2020)Wang, Zhou, Liang, and Lan]. A variance reduced cubic regularized algorithm was proposed in [Zhou et al.(2018)Zhou, Xu, and Gu, Wang et al.(2019)Wang, Zhou, Liang, and Lan]. In terms of solving the cubic sub-problem, [Carmon and Duchi(2016)] proposes a gradient based algorithm and [Agarwal et al.(2017)Agarwal, Allen-Zhu, Bullins, Hazan, and Ma] provides a Hessian-vector product technique.

Compression:

Byzantine resilience:

The effect of adversaries on convergence of non-convex optimization was studied in [Damaskinos et al.(2019)Damaskinos, El Mhamdi, Guerraoui, Guirguis, and Rouault, Mhamdi et al.(2018)Mhamdi, Guerraoui, and Rouault]. In the distributed learning context, [Feng et al.(2014)Feng, Xu, and Mannor] proposes one shot median based robust learning. A median of mean based algorithm was proposed in [Chen et al.(2017)Chen, Su, and Xu] where the worker machines are grouped in batches and the Byzantine resilience is achieved by computing the median of the grouped machines. Later [Yin et al.(2018)Yin, Chen, Kannan, and Bartlett] proposes co-ordinate wise median, trimmed mean and iterative filtering based approaches. Communication-efficient and Byzantine robust algorithms were developed in [Bernstein et al.(2018)Bernstein, Zhao, Azizzadenesheli, and Anandkumar, Ghosh et al.(2020a)Ghosh, Maity, Kadhe, Mazumdar, and Ramchandran]. A norm based thresholding approach for Byzantine resilience for distributed Newton algorithm was also developed [Ghosh et al.(2020b)Ghosh, Maity, and Mazumdar]. All these works provide only first order convergence guarantee (small gradient). The work [Yin et al.(2019)Yin, Chen, Kannan, and Bartlett] is the only one that provides second order guarantee (Hessian positive semi-definite) under Byzantine attack.

Algorithm 1 Byzantine Robust Distributed Cubic Regularized Newton Algorithm
1:  Input: Step size ηk\eta_{k}, parameter 0αβ,γ>0,M>00\leq\alpha\leq\beta,\gamma>0,M>0 and δ\delta-approximate compressor QQ.
2:  Initialize: Initial iterate 𝐱0d\mathbf{x}_{0}\in\mathbb{R}^{d}
3:  for k=0,1,,T1k=0,1,\ldots,T-1 do
4:     Central machine: broadcasts 𝐱k\mathbf{x}_{k} for i[m]i\in[m] do in parallel
5:     ii-th worker machine: Non-Byzantine: Compute local gradient 𝐠i,k\mathbf{g}_{i,k} and Hessian 𝐇i,k\mathbf{H}_{i,k}; locally solve the problem equation (1). Use the compressor QQ and send Q(𝐬i,k+1)Q(\mathbf{s}_{i,k+1}) to the center, Byzantine: Generate \star (arbitrary), and send it to the center machine end for
6:     Center Machine: (i) Sort the worker machines in a non decreasing order according to norm of updates {Q(𝐬i,k+1)}i=1m\{Q(\mathbf{s}_{i,k+1})\}_{i=1}^{m} from the local machines (ii) Return the indices of the first 1β1-\beta fraction of machines as 𝒰t\mathcal{U}_{t}, (iii) Update parameter: 𝐱k+1=𝐱k+ηk1|𝒰t|i𝒰tQ(𝐬i,k+1)\mathbf{x}_{k+1}=\mathbf{x}_{k}+\eta_{k}\frac{1}{|\mathcal{U}_{t}|}\sum_{i\in\mathcal{U}_{t}}Q(\mathbf{s}_{i,k+1})
7:  end for

3 Compression, Byzantine Resilience and Distributed Cubic Regularized Newton

In this section, we describe a communication efficient and Byzantine robust distributed cubic Newton algorithm that can avoid saddle point by meeting the condition second order stationary point and thus converge to a local minima for non-convex loss function. Starting with initialization 𝐱0\mathbf{x}_{0}, the center machine broadcasts the parameter to the worker machines. At kk-th iteration, the ii-th worker machine solves a cubic-regularized auxiliary loss function based on its local data:

𝐬i,k+1=argmin𝐬𝐠i,kT𝐬+γ2𝐬T𝐇i,k𝐬+M6γ2𝐬3,\displaystyle\mathbf{s}_{i,k+1}=\arg\min_{\mathbf{s}}\mathbf{g}_{i,k}^{T}\mathbf{s}+\frac{\gamma}{2}\mathbf{s}^{T}\mathbf{H}_{i,k}\mathbf{s}+\frac{M}{6}\gamma^{2}\|\mathbf{s}\|^{3}, (1)

where M>0,γ>0M>0,\gamma>0 are parameter choose suitably and 𝐠i,t,𝐇i,t\mathbf{g}_{i,t},\mathbf{H}_{i,t} are the gradient and Hessian of the local loss function fif_{i} computed on data (Si)(S_{i}) stored in the worker machine.

𝐠i,k=fi(xk)=1|Si|ziSifi(xk,zi),𝐇i,k=2fi(xk)=1|Si|ziSi2fi(xk,zi).\displaystyle\mathbf{g}_{i,k}=\nabla f_{i}(x_{k})=\frac{1}{|S_{i}|}\sum_{z_{i}\in S_{i}}\nabla f_{i}(x_{k},z_{i}),\,\mathbf{H}_{i,k}=\nabla^{2}f_{i}(x_{k})=\frac{1}{|S_{i}|}\sum_{z_{i}\in S_{i}}\nabla^{2}f_{i}(x_{k},z_{i}).

After solving the problem described in (1), each worker machine applies compression operator QQ as defined in Definition 1.1 on update 𝐬i,k+1\mathbf{s}_{i,k+1}. The application of the compression on the update is to minimize the communication cost.

Moreover, we also consider that α(<12)\alpha(<\frac{1}{2}) fraction of the worker machines are Byzantine in nature. We denote the set of Byzantine worker machines by \mathcal{B} and the set of the rest of the good machines as \mathcal{M}. In each iteration, the good machines send the compressed update of solution of the sub-cubic problem described in equation  (1) and the Byzantine machines can send any arbitrary values or intentionally disrupt the learning algorithm with malicious updates. Moreover, in the non-convex optimization problems, one of the more complicated and important issue is to avoid saddle points which can yield highly sub-optimal results. In the presence of Byzantine worker machines, they can be in cohort to create a fake local minima and drive the algorithm into sub-optimal region. Lack of any robust measure towards these type of intentional and unintentional attacks can be catastrophic to the learning procedure as the learning algorithm can get stuck in such sub-optimal point. To tackle such Byzantine worker machines, we employ a simple process called norm based thresholding.

After receiving all the updates from the worker machines, the central machine outputs a set 𝒰\mathcal{U} which consists of the indexes of the worker machines with smallest norm. We choose the size of the set 𝒰\mathcal{U} to be (1β)m(1-\beta)m. Hence, we ‘trim’ β\beta fraction of the worker machine so that we can control the iterated update by not letting the worker machines with large norm participate and diverge the learning process. We denote the set of trimmed machine as 𝒯\mathcal{T}. We choose β>α\beta>\alpha so that at least one of the good machines gets trimmed. In this way, the norm of the all the updates in the set 𝒰\mathcal{U} is bounded by at least the largest norm of the good machines.

Notice that the δ\delta-approximate compressor is used to minimize the cost of communication and the norm based thresholding is used to mitigate the effect of the Byzantine machines. The central machine updates the parameter, with step-size ηk\eta_{k} as 𝐱k+1=𝐱k+ηk1|𝒰t|i𝒰tQ(𝐬i,k+1)\mathbf{x}_{k+1}=\mathbf{x}_{k}+\eta_{k}\frac{1}{|\mathcal{U}_{t}|}\sum_{i\in\mathcal{U}_{t}}Q(\mathbf{s}_{i,k+1}).

Remark 3.1.

Note that, we introduce the parameter γ\gamma in the cubic regularized sub-problem. The parameter γ\gamma emphasizes the effect of the second and third order terms in the sub-problem. The choice of γ\gamma plays an important role in our analysis in handling the non-linear update from different worker machines. Such non-linearity vanishes if we choose γ=0\gamma=0.

3.1 Theoretical Guarantees

We have the following standard assumptions:

Assumption 1

The non-convex loss function f(.)f(.) is twice continuously-differentiable and bounded below, i.e., f=inf𝐱df(x)>f^{*}=\inf_{\mathbf{x}\in\mathbb{R}^{d}}f(x)>-\infty.

Assumption 2

The loss f(.)f(.) is LL-Lipschitz continuous (𝐱,𝐲\forall\mathbf{x},\mathbf{y}, |f(𝐱)f(𝐲)|L𝐱𝐲\left|f(\mathbf{x})-f(\mathbf{y})\right|\leq L\|\mathbf{x}-\mathbf{y}\|), has L1L_{1}-Lipschitz gradients (f(𝐱)f(𝐲)L1𝐱𝐲)(\left\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\right\|\leq L_{1}\|\mathbf{x}-\mathbf{y}\|) and L2L_{2}-Lipschitz Hessian

(2f(𝐱)2f(𝐲)L2𝐱𝐲)(\left\|\nabla^{2}f(\mathbf{x})-\nabla^{2}f(\mathbf{y})\right\|\leq L_{2}\|\mathbf{x}-\mathbf{y}\|).

The above assumption states that the loss and the gradient and Hessian of the loss do not drastically change in the local neighborhood. These assumptions are standard in the analysis of the saddle point escape for cubic regularization (see [Tripuraneni et al.(2018)Tripuraneni, Stern, Jin, Regier, and Jordan, Kohler and Lucchi(2017), Nesterov and Polyak(2006), Carmon and Duchi(2016)]).

We assume the data distribution across workers to be non-iid. However, we assume that the local gradient and Hessian computed at worker machines (using local data) satisfies the following gradient and Hessian dissimilarity conditions. Note that these conditions are only applicable for non-Byzantine machines only. Byzantine machines do not adhere to any assumptions.

Assumption 3

(Gradient dissimilarity) For ϵg>0\epsilon_{g}>0, we have f(𝐱k)𝐠i,kϵg\|\nabla f(\mathbf{x}_{k})-\mathbf{g}_{i,k}\|\leq\epsilon_{g} for all k,ik,i.

Assumption 4

(Hessian dissimilarity) For ϵH>0\epsilon_{H}>0, we have 2f(𝐱k)𝐇i,kϵH\|\nabla^{2}f(\mathbf{x}_{k})-\mathbf{H}_{i,k}\|\leq\epsilon_{H} for all k,ik,i.

Similar assumptions featured in previous literature. For example, in [Karimireddy et al.(2020)Karimireddy, Kale, Mohri, Reddi, Stich, and Suresh, Fallah et al.(2020)Fallah, Mokhtari, and Ozdaglar], the authors use similar kind of dissimilarity assumptions that are prevalent in the Federated learning setup to highlight the non-iid or heterogeneity of the data among users.

Remark 3.2.

[Values of ϵg\epsilon_{g} and ϵH\epsilon_{H} in special cases] Compared to the Assumptions  3 and  4, the gradient and Hessian bound have been studied under more relaxed condition. In [Kohler and Lucchi(2017), Tripuraneni et al.(2018)Tripuraneni, Stern, Jin, Regier, and Jordan, Wang et al.(2020)Wang, Zhou, Liang, and Lan], the authors consider gradient and Hessian with sub-sampled data being drawn uniformly randomly from the data set. Due to the data being drawn in iid manner, both the bound (ϵg,ϵH)(\epsilon_{g},\epsilon_{H}) parameters value diminish at the rate 1/|S|\propto 1/\sqrt{|S|} where |S||S| is the size of the data sample in each worker machine. In [Ghosh et al.(2020b)Ghosh, Maity, and Mazumdar], the authors analyze the deviation in case of data partitioning where each worker node sample data uniformly without replacement from a given data set.

Theorem 3.3.

Suppose 0αβ120\leq\alpha\leq\beta\leq\frac{1}{2} and Assumptions  1-4 hold, and we choose M=𝒪(m(1+1δ)3)M=\mathcal{O}(m(1+\sqrt{1-\delta})^{3}), η=γ=c/T\eta=\gamma=c/T. Then, after TT iterations of Algorithm 1, the sequence {𝐱i}i=1T\{\mathbf{x}_{i}\}_{i=1}^{T} generated contains a point x~\tilde{x} such that

f(x~)\displaystyle\|\nabla f(\tilde{x})\|\leq χ1T2/3+ϵg+𝒪(1/T),λmin(2f(x~))χ2T13ϵH𝒪(1/T),\displaystyle\frac{\chi_{1}}{T^{2/3}}+\epsilon_{g}+\mathcal{O}(1/T),\quad\lambda_{\min}\left(\nabla^{2}f(\tilde{x})\right)\geq-\frac{\chi_{2}}{T^{\frac{1}{3}}}-\epsilon_{H}-\mathcal{O}(1/T), (2)
where, χ1\displaystyle\text{ where, }\quad\chi_{1} =𝒪([(1α)(1+1δ)22(1β)+m(1+1δ)3])\displaystyle=\mathcal{O}([\frac{(1-\alpha)(1+\sqrt{1-\delta})^{2}}{2(1-\beta)}+m(1+\sqrt{1-\delta})^{3}])
χ2\displaystyle\chi_{2} =𝒪([m(1+1δ)3+(1+1δ)(1α)(1β)]).\displaystyle=\mathcal{O}([m(1+\sqrt{1-\delta})^{3}+\frac{(1+\sqrt{1-\delta})(1-\alpha)}{(1-\beta)}]).
Remark 3.4.

Both the gradient and the minimum eigenvalue of the Hessian in the Theorem 3.3 have two parts. The first part decreases with the number iterations TT. The gradient and the minimum eigenvalue of the Hessian have the rate of O(1/T23)O(1/T^{\frac{2}{3}}) and O(1/T13)O(1/T^{\frac{1}{3}}), respectively. Both of these rates match the rates of the centralized version of the cubic regularized Newton. In the second parts of the gradient bound and the minimum eigenvalue of the Hessian have terms with ϵg,ϵH\epsilon_{g},\epsilon_{H} factor. As mentioned above (see Remark  3.2 ), in the special cases, both the terms ϵg\epsilon_{g} and ϵH\epsilon_{H} decrease at the rate of 1/|S|1/\sqrt{|S|}, where |S||S| is the number of data in each of the worker machines.

Remark 3.5.

[Comparison with [Yin et al.(2019)Yin, Chen, Kannan, and Bartlett]] In a recent work, [Yin et al.(2019)Yin, Chen, Kannan, and Bartlett] provides a perturbed gradient based algorithm to escape the saddle point in non-convex optimization in the presence of Byzantine worker machines. Also, in that paper, the Byzantine resilience is achieved using techniques such as trimmed mean, median and collaborative filtering. These methods require additional assumptions (coordinate of the gradient being sub-exponential etc.) for the purpose of analysis. In this work, we do not require such assumptions. Moreover, we perform a simple norm based thresholding that provides robustness. Also the perturbed gradient descent (PGD) actually requires multiple rounds of communications between the central machine and the worker machines whenever the norm of the gradient is small as this is an indication of either a local minima or a saddle point. In contrast to that, our method does not require any additional communication for escaping the saddle points. Our method provides such ability by virtue of cubic regularization.

Remark 3.6.

Since our algorithm is second order in nature, it requires less number of iterations compared to the first order gradient based algorithms. Our algorithm achieves a superior rate of O(1/T23)O(1/T^{\frac{2}{3}}) compared to the gradient based approach of rate O(1/T)O(1/\sqrt{T}). Our algorithm dominates ByzantinePGD [Yin et al.(2019)Yin, Chen, Kannan, and Bartlett] in terms of convergence, communication rounds and simplicity.

We now state two corollaries. First, we relax the condition of compression by choosing δ=1\delta=1. The worker machines communicate the actual update instead of the compressed the update.

Corollary 3.7 (No compression).

Suppose 0αβ120\leq\alpha\leq\beta\leq\frac{1}{2} and Assumptions  1-4 hold, and we choose M=𝒪(m)M=\mathcal{O}(m), η=γ=c/T\eta=\gamma=c/T. Then, after TT iterations of Algorithm 1 for uncompressed update (δ=1)(\delta=1), the sequence {𝐱i}i=1T\{\mathbf{x}_{i}\}_{i=1}^{T} generated contains a point x~\tilde{x} such that

f(x~)χ1T2/3+ϵg+𝒪(1/T),λmin(2f(x~))χ2T13ϵH𝒪(1/T),\displaystyle\|\nabla f(\tilde{x})\|\leq\frac{\chi_{1}}{T^{2/3}}+\epsilon_{g}+\mathcal{O}(1/T),\quad\lambda_{\min}\left(\nabla^{2}f(\tilde{x})\right)\geq-\frac{\chi_{2}}{T^{\frac{1}{3}}}-\epsilon_{H}-\mathcal{O}(1/T), (3)

where, χ1=χ2=𝒪([(1α)(1β)+m]).\chi_{1}=\chi_{2}=\mathcal{O}([\frac{(1-\alpha)}{(1-\beta)}+m]).

Next, we choose the non-Byzantine setup with α=β=0\alpha=\beta=0 in addition to the uncompressed update. This is just the distributed variant of the cubic regularized Newton method.

Corollary 3.8 (Non Byzantine and no compression).

Suppose Assumptions  1-4 hold, and we choose M=𝒪(m)M=\mathcal{O}(m), η=γ=c/T\eta=\gamma=c/T. Then, after TT iterations of Algorithm 1 for uncompressed update (δ=1)(\delta=1), the sequence {𝐱i}i=1T\{\mathbf{x}_{i}\}_{i=1}^{T} generated contains a point x~\tilde{x} such that

f(x~)χ1T2/3+ϵg+𝒪(1/T),λmin(2f(x~))χ2T13ϵH𝒪(1/T),\displaystyle\|\nabla f(\tilde{x})\|\leq\frac{\chi_{1}}{T^{2/3}}+\epsilon_{g}+\mathcal{O}(1/T),\quad\lambda_{\min}\left(\nabla^{2}f(\tilde{x})\right)\geq-\frac{\chi_{2}}{T^{\frac{1}{3}}}-\epsilon_{H}-\mathcal{O}(1/T), (4)

where, χ1=χ2=𝒪(m).\chi_{1}=\chi_{2}=\mathcal{O}(m).

Remark 3.9 (Two rounds of communication ϵg=0\epsilon_{g}=0).

We can improve the bound in the Corollary  3.8, with the calculation of the actual gradient which requires one more round of communication in each iteration. In the first iteration, all the worker machines compute the gradient based on the stored data and send it to the center machine. The center machine averages them and then broadcast the global gradient f(𝐱k)=1mi=1m𝐠i,k\nabla f(\mathbf{x}_{k})=\frac{1}{m}\sum_{i=1}^{m}\mathbf{g}_{i,k} at iteration kk. In this manner, the worker machines solve the sub-problem (1) with the actual gradient. This improves the gradient bound while the communication remains O(d)O(d) in each iteration.

3.2 Solution of the cubic sub-problem

The cubic regularized sub-problem  (1) needs to be solved to update the parameter. As this particular problem does not have a closed form solution, a solver is usually employed which yields a satisfactory solution. In previous works, different types of solvers have been used. [Cartis et al.(2011a)Cartis, Gould, and Toint, Cartis et al.(2011b)Cartis, Gould, and Toint] solve the sub-problem using Lanczos based method in Krylov subspace. In [Agarwal et al.(2017)Agarwal, Allen-Zhu, Bullins, Hazan, and Ma], the authors propose a solver based on Hessian-vector product and binary search. Gradient descent based solver is proposed in [Carmon and Duchi(2016), Tripuraneni et al.(2018)Tripuraneni, Stern, Jin, Regier, and Jordan].

Previous works, [Wang et al.(2020)Wang, Zhou, Liang, and Lan, Zhou et al.(2018)Zhou, Xu, and Gu, Wang et al.(2019)Wang, Zhou, Liang, and Lan], consider the exact solution of the cubic sub-problem for theoretical analysis. Recently, inexact solutions to the sub-problem is also proposed in the centralized (non-distributed) framework. For instance, [Kohler and Lucchi(2017)] analyzes the cubic model with sub-sampled Hessian with approximate model minimization technique developed in [Cartis et al.(2011a)Cartis, Gould, and Toint]. Moreover, [Tripuraneni et al.(2018)Tripuraneni, Stern, Jin, Regier, and Jordan] shows improved analysis with gradient based minimization which is a variant studied in [Carmon and Duchi(2016)]. Both exact and inexact solutions to the sub-problem yields similar theoretical guarantees.

In our framework, each worker machine is tasked with solving the sub-problem. For the purpose of theoretical convergence analysis, we consider that woker machines obtain the exact solution in each round. However, in experiments (Section  4), we apply the gradient based solver of [Tripuraneni et al.(2018)Tripuraneni, Stern, Jin, Regier, and Jordan] to solve the sub-problem. Here, we let each worker machines run the gradient based solver for 1010 iterations and send the update to the center machine in each iteration.

4 Experimental Results

First we show that our algorithm indeed escapes saddle point with a toy example. We choose a 22 dimensional example: minw2[f1(w)+f2(w)]\min_{w\in\mathbb{R}^{2}}[f_{1}(w)+f_{2}(w)] where f1(w)=w12w22f_{1}(w)=w_{1}^{2}-w_{2}^{2} and f2(w)=2w122w22f_{2}(w)=2w_{1}^{2}-2w_{2}^{2} (Here wi2w_{i}^{2} denotes the ii-th coordinate of w2w^{2}. This problem is the sum of two non-convex function and has a saddle point at (0,0)(0,0). In Figure  1 (left most) we observe that our algorithm escapes the saddle point (0,0)(0,0), with random initialization.

Next, we validate our algorithm on benchmark LIBSVM ([Chang and Lin(2011)]) data-set in both convex and non-convex problems. We choose the following loss functions: (a) Logistic loss:min𝐰d1nilog(1+exp(yi𝐱iT𝐰))+λ2n𝐰2\min_{\mathbf{w}\in\mathbb{R}^{d}}\frac{1}{n}\sum_{i}\log\left(1+\exp(-y_{i}\mathbf{x}_{i}^{T}\mathbf{w})\right)+\frac{\lambda}{2n}\|\mathbf{w}\|^{2}, and (b) Non-convex robust linear regression: min𝐰d1nilog((yi𝐰T𝐱i)22+1)\min_{\mathbf{w}\in\mathbb{R}^{d}}\frac{1}{n}\sum_{i}\log\left(\frac{(y_{i}-\mathbf{w}^{T}\mathbf{x}_{i})^{2}}{2}+1\right), where 𝐰d\mathbf{w}\in\mathbb{R}^{d} is the parameter, {𝐱i}i=1nd\{\mathbf{x}_{i}\}_{i=1}^{n}\in\mathbb{R}^{d} are the feature vectors and {𝐲i}i=1n{0,1}\{\mathbf{y}_{i}\}_{i=1}^{n}\in\{0,1\} are the corresponding labels. We choose ‘a9a’(d=123,n32Kd=123,n\approx 32K, we split the data into 70/3070/30 and use as training/testing purpose) and ‘w8a’(training data d=300,n50Kd=300,n\approx 50K and testing data d=300,n15Kd=300,n\approx 15K ) classification datasets and partition the data in 20 different worker machines.

First, we demonstrate Algorithm 1 in the presence of Byzantine machines and compressed update. For compression, each worker applies compression operator of QSGD [Alistarh et al.(2017a)Alistarh, Grubic, Li, Tomioka, and Vojnovic]. For a given vector 𝐱d,[Q(𝐱)]i=𝐱2sign(𝐱i)×Ber(|𝐱i|/x2)\mathbf{x}\in\mathbb{R}^{d},[Q(\mathbf{x})]_{i}=\|\mathbf{x}\|_{2}\text{sign}(\mathbf{x}_{i})\times\text{Ber}(|\mathbf{x}_{i}|/\|x\|_{2}) for all i[d]i\in[d]. In this work, we consider the following four Byzantine attacks: (1) ‘Gaussian Noise attack’: where the Byzantine worker machines add Gaussian noise to the update. (2) ‘Random label attack’: where the Byzantine worker machines train and learn based on random labels instead of the proper labels. (3) ‘Flipped label attack’: where (for Binary classification) the Byzantine worker machines flip the labels of the data and learn based on wrong labels. (4) ‘Negative update attack’: where the Byzantine workers computes the update 𝐬\mathbf{s} (here solves the sub-problem in Eq. (1)) and communicates c𝐬-c*\mathbf{s} with c(0,1)c\in(0,1) making the direction of the update opposite of the actual one. In Figure  2, we plot the function value of the robust linear regression problem for ’flipped labels‘ and ’negative update‘ attacks with compressed update for both ‘w8a’ and ‘a9a’ dataset. We choose the parameters λ=1,M=10\lambda=1,M=10, learning rate ηk=1\eta_{k}=1, α={.1,.15,.2}\alpha=\{.1,.15,.2\} and β=α+2m\beta=\alpha+\frac{2}{m}, where number of worker machines m=20m=20. The results with Gaussian and random labels attacks are shown in Appendix.

\subfigure

[]Refer to caption\subfigure[]Refer to caption\subfigure[]Refer to caption\subfigure[]Refer to caption

Figure 1: (a) Plot of the function value with different initialization to show that the algorithm escapes the saddle point with functional value 0. Comparison of our algorithm with ByzantinePGD [Yin et al.(2019)Yin, Chen, Kannan, and Bartlett] in terms of the total number of iterations.
\subfigure

[]Refer to caption\subfigure[]Refer to caption\subfigure[]Refer to caption\subfigure[]Refer to caption

Figure 2: Loss of the training data ‘a9a’ (first row) and ‘w8a’ (second row) with 10%,15%,20%10\%,15\%,20\% Byzantine worker machines for (a,c). Flipped label attack.(b,d). Negative Update attack.

Comparison with [Yin et al.(2019)Yin, Chen, Kannan, and Bartlett]

: We compare our uncompressed version of the algorithm (δ=1)(\delta=1) with ByzantinePGD of [Yin et al.(2019)Yin, Chen, Kannan, and Bartlett]. We take the total number of iterations as a comparison metric. One outer iteration of Algorithm 1 corresponds to one round of communication between the center and the worker machines (and hence one parameter update). Note that in our algorithm the worker machines use 10 steps of gradient solver (see [Tripuraneni et al.(2018)Tripuraneni, Stern, Jin, Regier, and Jordan]) for the local sub problem per iteration. So, the total number of iterations is given by 1010 times the number of outer iterations.For both the algorithms, we choose 2\ell_{2} norm of the gradient as a stopping criteria. For ByzantinePGD, we choose R=10,r=5,Q=10,Tth=10R=10,r=5,Q=10,T_{th}=10 and ‘co-ordinate wise Trimmed mean. In the Figure 1, we plot the total number of iterations in all four types of attacks with different fraction of Byzantine machines. It is evident from the plot that our method requires less number of over all iterations (at least 48.4%48.4\%, 29%29\% and 25%25\% less for 10%10\%, 15%15\% and 20%20\% of Byzantine machines respectively). We provide the results for the non-Byzantine case in the Appendix.

References

  • [Agarwal et al.(2017)Agarwal, Allen-Zhu, Bullins, Hazan, and Ma] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1195–1199, 2017.
  • [Alistarh et al.(2017a)Alistarh, Grubic, Li, Tomioka, and Vojnovic] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017a.
  • [Alistarh et al.(2017b)Alistarh, Grubic, Liu, Tomioka, and Vojnovic] Dan Alistarh, Demjan Grubic, Jerry Liu, Ryota Tomioka, and Milan Vojnovic. Communication-efficient stochastic gradient descent, with applications to neural networks. 2017b.
  • [Alistarh et al.(2018a)Alistarh, Allen-Zhu, and Li] Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems, volume 31, pages 4613–4623, 2018a.
  • [Alistarh et al.(2018b)Alistarh, Hoefler, Johansson, Konstantinov, Khirirat, and Renggli] Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pages 5973–5983, 2018b.
  • [Allen-Zhu and Li(2017)] Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via first-order oracles. arXiv preprint arXiv:1711.06673, 2017.
  • [Bernstein et al.(2018)Bernstein, Zhao, Azizzadenesheli, and Anandkumar] Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd with majority vote is communication efficient and byzantine fault tolerant. arXiv preprint arXiv:1810.05291, 2018.
  • [Bhojanapalli et al.(2016)Bhojanapalli, Neyshabur, and Srebro] Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Global optimality of local search for low rank matrix recovery, 2016.
  • [Blanchard et al.(2017)Blanchard, Mhamdi, Guerraoui, and Stainer] Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Byzantine-tolerant machine learning. arXiv preprint arXiv:1703.02757, 2017.
  • [Carmon and Duchi(2016)] Yair Carmon and John C Duchi. Gradient descent efficiently finds the cubic-regularized non-convex newton step. arXiv preprint arXiv:1612.00547, 2016.
  • [Cartis et al.(2011a)Cartis, Gould, and Toint] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011a.
  • [Cartis et al.(2011b)Cartis, Gould, and Toint] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part ii: worst-case function-and derivative-evaluation complexity. Mathematical programming, 130(2):295–319, 2011b.
  • [Chang and Lin(2011)] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):27, 2011.
  • [Chen et al.(2017)Chen, Su, and Xu] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):44, 2017.
  • [Choromanska et al.(2015)Choromanska, Henaff, Mathieu, Arous, and LeCun] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks, 2015.
  • [Damaskinos et al.(2019)Damaskinos, El Mhamdi, Guerraoui, Guirguis, and Rouault] Georgios Damaskinos, El Mahdi El Mhamdi, Rachid Guerraoui, Arsany Hany Abdelmessih Guirguis, and Sébastien Louis Alexandre Rouault. Aggregathor: Byzantine machine learning via robust gradient aggregation. page 19, 2019. URL http://infoscience.epfl.ch/record/265684. Published in the Conference on Systems and Machine Learning (SysML) 2019, Stanford, CA, USA.
  • [Dauphin et al.(2014)Dauphin, Pascanu, Gulcehre, Cho, Ganguli, and Bengio] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems, volume 27, pages 2933–2941, 2014.
  • [Du et al.(2017)Du, Jin, Lee, Jordan, Poczos, and Singh] Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Barnabas Poczos, and Aarti Singh. Gradient descent can take exponential time to escape saddle points. arXiv preprint arXiv:1705.10412, 2017.
  • [Fallah et al.(2020)Fallah, Mokhtari, and Ozdaglar] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized federated learning: A meta-learning approach. arXiv preprint arXiv:2002.07948, 2020.
  • [Feng et al.(2014)Feng, Xu, and Mannor] Jiashi Feng, Huan Xu, and Shie Mannor. Distributed robust learning. arXiv preprint arXiv:1409.5937, 2014.
  • [Gandikota et al.(2019)Gandikota, Maity, and Mazumdar] Venkata Gandikota, Raj Kumar Maity, and Arya Mazumdar. vqsgd: Vector quantized stochastic gradient descent. arXiv preprint arXiv:1911.07971, 2019.
  • [Ge et al.(2017)Ge, Jin, and Zheng] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1233–1242, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • [Ghosh et al.(2020a)Ghosh, Maity, Kadhe, Mazumdar, and Ramchandran] Avishek Ghosh, Raj Kumar Maity, Swanand Kadhe, Arya Mazumdar, and Kannan Ramchandran. Communication-efficient and byzantine-robust distributed learning. In 2020 Information Theory and Applications Workshop (ITA), pages 1–28. IEEE, 2020a.
  • [Ghosh et al.(2020b)Ghosh, Maity, and Mazumdar] Avishek Ghosh, Raj Kumar Maity, and Arya Mazumdar. Distributed newton can communicate less and resist byzantine workers. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020b.
  • [Ghosh et al.(2020c)Ghosh, Maity, Mazumdar, and Ramchandran] Avishek Ghosh, Raj Kumar Maity, Arya Mazumdar, and Kannan Ramchandran. Communication efficient distributed approximate newton method. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2539–2544. IEEE, 2020c.
  • [Ivkin et al.(2019)Ivkin, Rothchild, Ullah, Braverman, Stoica, and Arora] Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Vladimir Braverman, Ion Stoica, and Raman Arora. Communication-efficient distributed sgd with sketching. arXiv preprint arXiv:1903.04488, 2019.
  • [Jain et al.(2017)Jain, Jin, Kakade, and Netrapalli] Prateek Jain, Chi Jin, Sham M. Kakade, and Praneeth Netrapalli. Global convergence of non-convex gradient descent for computing matrix squareroot, 2017.
  • [Jin et al.(2017)Jin, Ge, Netrapalli, Kakade, and Jordan] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International Conference on Machine Learning, pages 1724–1732. PMLR, 2017.
  • [Karimireddy et al.(2019)Karimireddy, Rebjock, Stich, and Jaggi] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning, pages 3252–3261. PMLR, 2019.
  • [Karimireddy et al.(2020)Karimireddy, Kale, Mohri, Reddi, Stich, and Suresh] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pages 5132–5143. PMLR, 2020.
  • [Kawaguchi(2016)] Kenji Kawaguchi. Deep learning without poor local minima. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29, pages 586–594. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/f2fc990265c712c49d51a18a32b39f0c-Paper.pdf.
  • [Kohler and Lucchi(2017)] Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In International Conference on Machine Learning, pages 1895–1904. PMLR, 2017.
  • [Konečnỳ et al.(2016)Konečnỳ, McMahan, Ramage, and Richtárik] Jakub Konečnỳ, H Brendan McMahan, Daniel Ramage, and Peter Richtárik. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527, 2016.
  • [Lamport et al.(1982)Lamport, Shostak, and Pease] Leslie Lamport, Robert Shostak, and Marshall Pease. The byzantine generals problem. ACM Trans. Program. Lang. Syst., 4(3):382–401, July 1982. ISSN 0164-0925. 10.1145/357172.357176. URL http://doi.acm.org/10.1145/357172.357176.
  • [Lee et al.(2016)Lee, Simchowitz, Jordan, and Recht] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915, 2016.
  • [Lee et al.(2017)Lee, Panageas, Piliouras, Simchowitz, Jordan, and Recht] Jason D Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I Jordan, and Benjamin Recht. First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406, 2017.
  • [Mhamdi et al.(2018)Mhamdi, Guerraoui, and Rouault] El Mahdi El Mhamdi, Rachid Guerraoui, and Sébastien Rouault. The hidden vulnerability of distributed learning in byzantium. arXiv preprint arXiv:1802.07927, 2018.
  • [Nesterov and Polyak(2006)] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
  • [Soudry and Carmon(2016)] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks, 2016.
  • [Sun et al.(2016)Sun, Qu, and Wright] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. CoRR, abs/1602.06664, 2016. URL http://arxiv.org/abs/1602.06664.
  • [Sun et al.(2017)Sun, Qu, and Wright] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere i: Overview and the geometric picture. IEEE Transactions on Information Theory, 63(2):853–884, Feb 2017. ISSN 1557-9654. 10.1109/tit.2016.2632162. URL http://dx.doi.org/10.1109/TIT.2016.2632162.
  • [Tripuraneni et al.(2018)Tripuraneni, Stern, Jin, Regier, and Jordan] N Tripuraneni, M Stern, C Jin, J Regier, and MI Jordan. Stochastic cubic regularization for fast nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2899–2908, 2018.
  • [Wang et al.(2018)Wang, Sievert, Liu, Charles, Papailiopoulos, and Wright] Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris Papailiopoulos, and Stephen Wright. Atomo: Communication-efficient learning via atomic sparsification. In Advances in Neural Information Processing Systems, pages 9850–9861, 2018.
  • [Wang et al.(2019)Wang, Zhou, Liang, and Lan] Zhe Wang, Yi Zhou, Yingbin Liang, and Guanghui Lan. Stochastic variance-reduced cubic regularization for nonconvex optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2731–2740. PMLR, 2019.
  • [Wang et al.(2020)Wang, Zhou, Liang, and Lan] Zhe Wang, Yi Zhou, Yingbin Liang, and Guanghui Lan. Cubic regularization with momentum for nonconvex optimization. In Uncertainty in Artificial Intelligence, pages 313–322. PMLR, 2020.
  • [Wen et al.(2017)Wen, Xu, Yan, Wu, Wang, Chen, and Li] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems, pages 1509–1519, 2017.
  • [Xu et al.(2017)Xu, Jin, and Yang] Yi Xu, Rong Jin, and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time. arXiv preprint arXiv:1711.01944, 2017.
  • [Yin et al.(2018)Yin, Chen, Kannan, and Bartlett] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5650–5659, Stockholm, Sweden, 10–15 Jul 2018. PMLR.
  • [Yin et al.(2019)Yin, Chen, Kannan, and Bartlett] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Defending against saddle point attack in byzantine-robust distributed learning. In International Conference on Machine Learning, pages 7074–7084. PMLR, 2019.
  • [Zhou et al.(2018)Zhou, Xu, and Gu] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic variance-reduced cubic regularized newton methods. In International Conference on Machine Learning, pages 5990–5999. PMLR, 2018.

5 Appendix

In this part, we establish some useful facts and lemmas. Next, we provide analysis of Theorems 3.3.

5.1 Some useful facts

For the purpose of analysis we use the following sets of inequalities.

Fact 1.

For a1,,ana_{1},\ldots,a_{n} we have the following inequality

(i=1nai)3\displaystyle\|\left(\sum_{i=1}^{n}a_{i}\right)\|^{3} (i=1nai)3n2i=1nai3\displaystyle\leq\left(\sum_{i=1}^{n}\|a_{i}\|\right)^{3}\leq n^{2}\sum_{i=1}^{n}\|a_{i}\|^{3} (5)
(i=1nai)2\displaystyle\|\left(\sum_{i=1}^{n}a_{i}\right)\|^{2} (i=1nai)2ni=1nai2\displaystyle\leq\left(\sum_{i=1}^{n}\|a_{i}\|\right)^{2}\leq n\sum_{i=1}^{n}\|a_{i}\|^{2} (6)
Fact 2.

For a1,,an>0a_{1},\ldots,a_{n}>0 and r<sr<s

(1ni=1nair)1/r(1ni=1nais)1/s\displaystyle\left(\frac{1}{n}\sum_{i=1}^{n}a_{i}^{r}\right)^{1/r}\leq\left(\frac{1}{n}\sum_{i=1}^{n}a_{i}^{s}\right)^{1/s} (7)
Lemma 5.1 ([Nesterov and Polyak(2006)]).

Under Assumption  2, i.e., the Hessian of the function is L2L_{2}-Lipschitz continuous, for any 𝐱,𝐲d\mathbf{x},\mathbf{y}\in\mathbb{R}^{d}, we have

f(𝐱)f(𝐲)2f(𝐱)(𝐲𝐱)L22𝐲𝐱2\displaystyle\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})-\nabla^{2}f(\mathbf{x})(\mathbf{y}-\mathbf{x})\|\leq\frac{L_{2}}{2}\|\mathbf{y}-\mathbf{x}\|^{2} (8)
|f(𝐲)f(𝐱)f(𝐱)T(𝐲𝐱)12(𝐲𝐱)T2f(𝐱)(𝐲𝐱)|L26𝐲𝐱2\displaystyle\left|f(\mathbf{y})-f(\mathbf{x})-\nabla f(\mathbf{x})^{T}(\mathbf{y}-\mathbf{x})-\frac{1}{2}(\mathbf{y}-\mathbf{x})^{T}\nabla^{2}f(\mathbf{x})(\mathbf{y}-\mathbf{x})\right|\leq\frac{L_{2}}{6}\|\mathbf{y}-\mathbf{x}\|^{2} (9)

Next, we establish the following Lemma that provides some nice properties of the cubic sub-problem.

Lemma 5.2.

Let M>0,γ>0,𝐠d,𝐇d×dM>0,\gamma>0,\mathbf{g}\in\mathbb{R}^{d},\mathbf{H}\in\mathbb{R}^{d\times d}, and

𝐬=argmin𝐱𝐠T𝐱+γ2𝐱T𝐇𝐱+Mγ26𝐱3.\displaystyle\mathbf{s}=\arg\min_{\mathbf{x}}\mathbf{g}^{T}\mathbf{x}+\frac{\gamma}{2}\mathbf{x}^{T}\mathbf{H}\mathbf{x}+\frac{M\gamma^{2}}{6}\|\mathbf{x}\|^{3}. (10)

The following holds

𝐠+γ𝐇𝐬+Mγ22𝐬𝐬\displaystyle\mathbf{g}+\gamma\mathbf{H}\mathbf{s}+\frac{M\gamma^{2}}{2}\|\mathbf{s}\|\mathbf{s} =0,\displaystyle=0, (11)
𝐇+Mγ2𝐬𝐈\displaystyle\mathbf{H}+\frac{M\gamma}{2}\|\mathbf{s}\|\mathbf{I} 𝟎,\displaystyle\succeq\mathbf{0}, (12)
𝐠T𝐬+γ2𝐬T𝐇𝐬\displaystyle\mathbf{g}^{T}\mathbf{s}+\frac{\gamma}{2}\mathbf{s}^{T}\mathbf{H}\mathbf{s} M4γ2𝐬3.\displaystyle\leq-\frac{M}{4}\gamma^{2}\|\mathbf{s}\|^{3}. (13)
Proof 5.3.

The equations  (11) and  (12) are from the first and second order optimal condition. We proof  (13), by using the conditions of  (11) and  (12).

𝐠T𝐬+γ2γ𝐬T𝐇𝐬\displaystyle\mathbf{g}^{T}\mathbf{s}+\frac{\gamma}{2}\gamma\mathbf{s}^{T}\mathbf{H}\mathbf{s} =(γ𝐇𝐬+M2γ2𝐬𝐬)T𝐬+γ2γ𝐬T𝐇𝐬\displaystyle=-\left(\gamma\mathbf{H}\mathbf{s}+\frac{M}{2}\gamma^{2}\|\mathbf{s}\|\mathbf{s}\right)^{T}\mathbf{s}+\frac{\gamma}{2}\gamma\mathbf{s}^{T}\mathbf{H}\mathbf{s} (14)
=γ𝐬T𝐇𝐬M2γ2𝐬3+γ2γ𝐬T𝐇𝐬\displaystyle=-\gamma\mathbf{s}^{T}\mathbf{H}\mathbf{s}-\frac{M}{2}\gamma^{2}\|\mathbf{s}\|^{3}+\frac{\gamma}{2}\gamma\mathbf{s}^{T}\mathbf{H}\mathbf{s}
M4γ2𝐬3M2γ2𝐬3\displaystyle\leq\frac{M}{4}\gamma^{2}\|\mathbf{s}\|^{3}-\frac{M}{2}\gamma^{2}\|\mathbf{s}\|^{3} (15)
=M4γ2𝐬3.\displaystyle=-\frac{M}{4}\gamma^{2}\|\mathbf{s}\|^{3}.

In  (14), we substitute the expression 𝐠\mathbf{g} from the equation (11). In  (15), we use the fact that 𝐬T𝐇𝐬+Mγ2𝐬3>0\mathbf{s}^{T}\mathbf{H}\mathbf{s}+\frac{M\gamma}{2}\|\mathbf{s}\|^{3}>0 from the equation  (12).

5.2 Proof of Theorem  3.3

First we state the results of Lemma 5.2 for each worker machine in iteration kk,

𝐠i,k+γ𝐇i,k𝐬i,k+1+M2γ2𝐬i,k+1𝐬i,k+1\displaystyle\mathbf{g}_{i,k}+\gamma\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}+\frac{M}{2}\gamma^{2}\|\mathbf{s}_{i,k+1}\|\mathbf{s}_{i,k+1} =0\displaystyle=0 (16)
γ𝐇i,k+M2γ2𝐬i,k+1𝐈\displaystyle\gamma\mathbf{H}_{i,k}+\frac{M}{2}\gamma^{2}\|\mathbf{s}_{i,k+1}\|\mathbf{I} 𝟎\displaystyle\succeq\mathbf{0} (17)
𝐠i,kT𝐬i,k+1+γ2𝐬i,k+1T𝐇i,k𝐬i,k+1\displaystyle\mathbf{g}_{i,k}^{T}\mathbf{s}_{i,k+1}+\frac{\gamma}{2}\mathbf{s}_{i,k+1}^{T}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1} M4γ2𝐬i,k+13\displaystyle\leq-\frac{M}{4}\gamma^{2}\|\mathbf{s}_{i,k+1}\|^{3} (18)

We also use the following fact form the setup and trimming set

|𝒰|\displaystyle|\mathcal{U}| =|𝒰|+|𝒰|\displaystyle=|\mathcal{U}\cap\mathcal{M}|+|\mathcal{U}\cap\mathcal{B}| (19)
||\displaystyle|\mathcal{M}| =|𝒰|+|𝒯|\displaystyle=|\mathcal{U}\cap\mathcal{M}|+|\mathcal{T}\cap\mathcal{M}| (20)

Combining both the equations  (19) and  (20), we have

|𝒰|\displaystyle|\mathcal{U}| =|||𝒯|+|𝒰|\displaystyle=|\mathcal{M}|-|\mathcal{T}\cap\mathcal{M}|+|\mathcal{U}\cap\mathcal{B}| (21)

Now we state the following fact from the trimming set

i𝒰.\displaystyle\sum_{i\in\mathcal{U}\cap\mathcal{B}}\|.\| αmmaxi.\displaystyle\leq\alpha m\max_{i\in\mathcal{M}}\|.\|
i𝒯.\displaystyle\sum_{i\in\mathcal{M}\cap\mathcal{T}}\|.\| i.\displaystyle\leq\sum_{i\in\mathcal{M}}\|.\|
Combining both we get
i𝒰.+i𝒯.\displaystyle\sum_{i\in\mathcal{U}\cap\mathcal{B}}\|.\|+\sum_{i\in\mathcal{M}\cap\mathcal{T}}\|.\| i.+αmmaxi.\displaystyle\leq\sum_{i\in\mathcal{M}}\|.\|+\alpha m\max_{i\in\mathcal{M}}\|.\| (22)

For the rest of the calculation, we use the following notation

Γ=maxi,k𝐬i,k.\displaystyle\Gamma=\max_{i\in\mathcal{M},k}\|\mathbf{s}_{i,k}\|.

If the optimization sub-problem domain is bounded, Γ\Gamma can be upper-bounded by the diameter of the parameter space. Note that in the definition of Γ\Gamma, the maximum is taken over good machines only.

From the definition of the δ\delta-approximate compressor in Definition 1.1, we use the following simple fact

Q(𝐱)(1+1δ)𝐱\displaystyle\|Q(\mathbf{x})\|\leq(1+\sqrt{1-\delta})\|\mathbf{x}\| (23)

At any iteration kk, we have

f(𝐱k+1)f(𝐱k)\displaystyle f(\mathbf{x}_{k+1})-f(\mathbf{x}_{k})
\displaystyle\leq f(𝐱k)T(𝐱k+1𝐱k)+12(𝐱k+1𝐱k)T2f(𝐱k)(𝐱k+1𝐱k)+L26𝐱k+1𝐱k3\displaystyle\nabla f(\mathbf{x}_{k})^{T}(\mathbf{x}_{k+1}-\mathbf{x}_{k})+\frac{1}{2}(\mathbf{x}_{k+1}-\mathbf{x}_{k})^{T}\nabla^{2}f(\mathbf{x}_{k})(\mathbf{x}_{k+1}-\mathbf{x}_{k})+\frac{L_{2}}{6}\left\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\right\|^{3}
=\displaystyle= ηk|𝒰|f(𝐱k)Ti𝒰Q(𝐬i,k+1)Term1+ηk22|𝒰|2(i𝒰Q(𝐬i,k+1))T2f(𝐱k)(i𝒰Q(𝐬i,k+1))Term2\displaystyle\underbrace{\frac{\eta_{k}}{|\mathcal{U}|}\nabla f(\mathbf{x}_{k})^{T}\sum_{i\in\mathcal{U}}Q(\mathbf{s}_{i,k+1})}_{Term1}+\underbrace{\frac{\eta_{k}^{2}}{2|\mathcal{U}|^{2}}\left(\sum_{i\in\mathcal{U}}Q(\mathbf{s}_{i,k+1})\right)^{T}\nabla^{2}f(\mathbf{x}_{k})\left(\sum_{i\in\mathcal{U}}Q(\mathbf{s}_{i,k+1})\right)}_{Term2}
+L26ηk|𝒰|i𝒰Q(𝐬i,k+1)3Term3\displaystyle+\underbrace{\frac{L_{2}}{6}\left\|\frac{\eta_{k}}{|\mathcal{U}|}\sum_{i\in\mathcal{U}}Q(\mathbf{s}_{i,k+1})\right\|^{3}}_{Term3} (24)

First we choose the Term 1 in (24) and expand it using  (21)

ηk|𝒰|f(𝐱k)Ti𝒰Q(𝐬i,k+1)\displaystyle\frac{\eta_{k}}{|\mathcal{U}|}\nabla f(\mathbf{x}_{k})^{T}\sum_{i\in\mathcal{U}}Q(\mathbf{s}_{i,k+1})
=\displaystyle= ηk(1β)mf(𝐱k)T[iQ(𝐬i,k+1)i𝒯Q(𝐬i,k+1)+i𝒰Q(𝐬i,k+1)]\displaystyle\frac{\eta_{k}}{(1-\beta)m}\nabla f(\mathbf{x}_{k})^{T}\left[\sum_{i\in\mathcal{M}}Q(\mathbf{s}_{i,k+1})-\sum_{i\in\mathcal{M}\cap\mathcal{T}}Q(\mathbf{s}_{i,k+1})+\sum_{i\in\mathcal{U}\cap\mathcal{B}}Q(\mathbf{s}_{i,k+1})\right]
=\displaystyle= ηk(1β)mi𝐠i,kT𝐬i,k+1+ηk(1β)mi[f(𝐱k)TQ(𝐬i,k+1)𝐠i,kT𝐬i,k+1]\displaystyle\frac{\eta_{k}}{(1-\beta)m}\sum_{i\in\mathcal{M}}\mathbf{g}_{i,k}^{T}\mathbf{s}_{i,k+1}+\frac{\eta_{k}}{(1-\beta)m}\sum_{i\in\mathcal{M}}\left[\nabla f(\mathbf{x}_{k})^{T}Q(\mathbf{s}_{i,k+1})-\mathbf{g}_{i,k}^{T}\mathbf{s}_{i,k+1}\right]
+ηk(1β)mf(𝐱k)T[i𝒯Q(𝐬i,k+1)+i𝒰Q(𝐬i,k+1)]\displaystyle+\frac{\eta_{k}}{(1-\beta)m}\nabla f(\mathbf{x}_{k})^{T}\left[-\sum_{i\in\mathcal{M}\cap\mathcal{T}}Q(\mathbf{s}_{i,k+1})+\sum_{i\in\mathcal{U}\cap\mathcal{B}}Q(\mathbf{s}_{i,k+1})\right]
\displaystyle\leq ηk(1β)mi𝐠i,kT𝐬i,k+1+ηk(1β)mi[f(𝐱k)TQ(𝐬i,k+1)f(𝐱k)T𝐬i,k+1+f(𝐱k)T𝐬i,k+1𝐠i,kT𝐬i,k+1]\displaystyle\frac{\eta_{k}}{(1-\beta)m}\sum_{i\in\mathcal{M}}\mathbf{g}_{i,k}^{T}\mathbf{s}_{i,k+1}+\frac{\eta_{k}}{(1-\beta)m}\sum_{i\in\mathcal{M}}\left[\nabla f(\mathbf{x}_{k})^{T}Q(\mathbf{s}_{i,k+1})-\nabla f(\mathbf{x}_{k})^{T}\mathbf{s}_{i,k+1}+\nabla f(\mathbf{x}_{k})^{T}\mathbf{s}_{i,k+1}-\mathbf{g}_{i,k}^{T}\mathbf{s}_{i,k+1}\right]
+ηk(1β)m[i𝒯f(𝐱k)Q(𝐬i,k+1)+i𝒰f(𝐱k)Q(𝐬i,k+1)]\displaystyle+\frac{\eta_{k}}{(1-\beta)m}\left[\sum_{i\in\mathcal{M}\cap\mathcal{T}}\|\nabla f(\mathbf{x}_{k})\|\|Q(\mathbf{s}_{i,k+1})\|+\sum_{i\in\mathcal{U}\cap\mathcal{B}}\|\nabla f(\mathbf{x}_{k})\|\|Q(\mathbf{s}_{i,k+1})\|\right]
\displaystyle\leq ηk(1β)mi𝐠i,kT𝐬i,k+1+ηk(1β)mi[f(𝐱k)Q(𝐬i,k+1)𝐬i,k+1+f(𝐱k)𝐠i,k𝐬i,k+1]\displaystyle\frac{\eta_{k}}{(1-\beta)m}\sum_{i\in\mathcal{M}}\mathbf{g}_{i,k}^{T}\mathbf{s}_{i,k+1}+\frac{\eta_{k}}{(1-\beta)m}\sum_{i\in\mathcal{M}}\left[\|\nabla f(\mathbf{x}_{k})\|\|Q(\mathbf{s}_{i,k+1})-\mathbf{s}_{i,k+1}\|+\|\nabla f(\mathbf{x}_{k})-\mathbf{g}_{i,k}\|\|\mathbf{s}_{i,k+1}\|\right]
+ηkL(1β)m[i𝒯Q(𝐬i,k+1)+i𝒰Q(𝐬i,k+1)]\displaystyle+\frac{\eta_{k}L}{(1-\beta)m}\left[\sum_{i\in\mathcal{M}\cap\mathcal{T}}\|Q(\mathbf{s}_{i,k+1})\|+\sum_{i\in\mathcal{U}\cap\mathcal{B}}\|Q(\mathbf{s}_{i,k+1})\|\right]
\displaystyle\leq ηk(1β)mi𝐠i,kT𝐬i,k+1+ηk(1β)mi[L1δ𝐬i,k+1+ϵg𝐬i,k+1]\displaystyle\frac{\eta_{k}}{(1-\beta)m}\sum_{i\in\mathcal{M}}\mathbf{g}_{i,k}^{T}\mathbf{s}_{i,k+1}+\frac{\eta_{k}}{(1-\beta)m}\sum_{i\in\mathcal{M}}\left[L\sqrt{1-\delta}\|\mathbf{s}_{i,k+1}\|+\epsilon_{g}\|\mathbf{s}_{i,k+1}\|\right]
+ηkL(1β)m[i𝒯Q(𝐬i,k+1)+i𝒰Q(𝐬i,k+1)]\displaystyle+\frac{\eta_{k}L}{(1-\beta)m}\left[\sum_{i\in\mathcal{M}\cap\mathcal{T}}\|Q(\mathbf{s}_{i,k+1})\|+\sum_{i\in\mathcal{U}\cap\mathcal{B}}\|Q(\mathbf{s}_{i,k+1})\|\right]

In the above calculation we use the fact gradients are bounded f(𝐱k)L\|\nabla f(\mathbf{x}_{k})\|\leq L and the bound of gradient dissimilarity.

We use the result  (22), we get

Term1\displaystyle Term1
\displaystyle\leq ηk(1β)mi𝐠i,kT𝐬i,k+1+ηk(1β)mi[L1δ𝐬i,k+1+ϵg𝐬i,k+1]\displaystyle\frac{\eta_{k}}{(1-\beta)m}\sum_{i\in\mathcal{M}}\mathbf{g}_{i,k}^{T}\mathbf{s}_{i,k+1}+\frac{\eta_{k}}{(1-\beta)m}\sum_{i\in\mathcal{M}}\left[L\sqrt{1-\delta}\|\mathbf{s}_{i,k+1}\|+\epsilon_{g}\|\mathbf{s}_{i,k+1}\|\right]
+ηkL(1+1δ)(1β)m[iQ(𝐬i,k+1)+αmΓ]\displaystyle+\frac{\eta_{k}L(1+\sqrt{1-\delta})}{(1-\beta)m}\left[\sum_{i\in\mathcal{M}}\|Q(\mathbf{s}_{i,k+1})\|+\alpha m\Gamma\right]
\displaystyle\leq ηk(1β)mi𝐠i,kT𝐬i,k+1+ηk(1α)(1β)(L1δ+ϵg)Γ+ηkL(1β)(1+1δ)Γ\displaystyle\frac{\eta_{k}}{(1-\beta)m}\sum_{i\in\mathcal{M}}\mathbf{g}_{i,k}^{T}\mathbf{s}_{i,k+1}+\frac{\eta_{k}(1-\alpha)}{(1-\beta)}(L\sqrt{1-\delta}+\epsilon_{g})\Gamma+\frac{\eta_{k}L}{(1-\beta)}(1+\sqrt{1-\delta})\Gamma
=\displaystyle= ηk(1β)mi[𝐠i,kT𝐬i,k+1+γ2𝐬i,k+1T𝐇i,k𝐬i,k+1]ηk(1β)miγ2𝐬i,k+1T𝐇i,k𝐬i,k+1\displaystyle\frac{\eta_{k}}{(1-\beta)m}\sum_{i\in\mathcal{M}}\left[\mathbf{g}_{i,k}^{T}\mathbf{s}_{i,k+1}+\frac{\gamma}{2}\mathbf{s}_{i,k+1}^{T}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}\right]-\frac{\eta_{k}}{(1-\beta)m}\sum_{i\in\mathcal{M}}\frac{\gamma}{2}\mathbf{s}_{i,k+1}^{T}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}
+ηk(1α)(1β)(L1δ+ϵg)Γ+ηkL(1β)(1+1δ)Γ\displaystyle+\frac{\eta_{k}(1-\alpha)}{(1-\beta)}(L\sqrt{1-\delta}+\epsilon_{g})\Gamma+\frac{\eta_{k}L}{(1-\beta)}(1+\sqrt{1-\delta})\Gamma
\displaystyle\leq γ2Mηk4(1β)mi𝐬i,k+13ηk(1β)miγ2𝐬i,k+1T𝐇i,k𝐬i,k+1\displaystyle-\frac{\gamma^{2}M\eta_{k}}{4(1-\beta)m}\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|^{3}-\frac{\eta_{k}}{(1-\beta)m}\sum_{i\in\mathcal{M}}\frac{\gamma}{2}\mathbf{s}_{i,k+1}^{T}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}
+ηk(1α)(1β)(L1δ+ϵg)Γ+ηkL(1β)(1+1δ)Γ\displaystyle+\frac{\eta_{k}(1-\alpha)}{(1-\beta)}(L\sqrt{1-\delta}+\epsilon_{g})\Gamma+\frac{\eta_{k}L}{(1-\beta)}(1+\sqrt{1-\delta})\Gamma (25)

In line  (25), we use the bound stated in  (18). Now we consider the Term 3 in equation  (24),

L26ηk|𝒰|i𝒰Q(𝐬i,k+1)3\displaystyle\frac{L_{2}}{6}\left\|\frac{\eta_{k}}{|\mathcal{U}|}\sum_{i\in\mathcal{U}}Q(\mathbf{s}_{i,k+1})\right\|^{3}
\displaystyle\leq L2ηk36|𝒰|i𝒰Q(𝐬i,k+1)3\displaystyle\frac{L_{2}\eta_{k}^{3}}{6|\mathcal{U}|}\sum_{i\in\mathcal{U}}\left\|Q(\mathbf{s}_{i,k+1})\right\|^{3}
\displaystyle\leq L2ηk36|𝒰|[iQ(𝐬i,k+1)3i𝒯Q(𝐬i,k+1)3+i𝒰Q(𝐬i,k+1)3]\displaystyle\frac{L_{2}\eta_{k}^{3}}{6|\mathcal{U}|}\left[\sum_{i\in\mathcal{M}}\|Q(\mathbf{s}_{i,k+1})\|^{3}-\sum_{i\in\mathcal{M}\cap\mathcal{T}}\|Q(\mathbf{s}_{i,k+1})\|^{3}+\sum_{i\in\mathcal{U}\cap\mathcal{B}}\|Q(\mathbf{s}_{i,k+1})\|^{3}\right]
\displaystyle\leq L2ηk36|𝒰|[iQ(𝐬i,k+1)3+i𝒰Q(𝐬i,k+1)3]\displaystyle\frac{L_{2}\eta_{k}^{3}}{6|\mathcal{U}|}\left[\sum_{i\in\mathcal{M}}\|Q(\mathbf{s}_{i,k+1})\|^{3}+\sum_{i\in\mathcal{U}\cap\mathcal{B}}\|Q(\mathbf{s}_{i,k+1})\|^{3}\right]

In the first inequality we use the fact stated in  (5). Next we expand the trimmed set 𝒰\mathcal{U} using (21).

Now we use the fact in equation  (22) and the definition of the δ\delta-approximate compressor, we get

Term3\displaystyle Term3\leq L2ηk36(1β)m[i(1+1δ)3𝐬i,k+13+(1+1δ)3(αm)Γ3]\displaystyle\frac{L_{2}\eta_{k}^{3}}{6(1-\beta)m}\left[\sum_{i\in\mathcal{M}}(1+\sqrt{1-\delta})^{3}\|\mathbf{s}_{i,k+1}\|^{3}+(1+\sqrt{1-\delta})^{3}(\alpha m)\Gamma^{3}\right] (26)

Now we consider the Term 2 in  (24)

ηk22|𝒰|2(i𝒰Q(𝐬i,k+1))T2f(𝐱k)(i𝒰Q(𝐬i,k+1))\displaystyle\frac{\eta_{k}^{2}}{2|\mathcal{U}|^{2}}\left(\sum_{i\in\mathcal{U}}Q(\mathbf{s}_{i,k+1})\right)^{T}\nabla^{2}f(\mathbf{x}_{k})\left(\sum_{i\in\mathcal{U}}Q(\mathbf{s}_{i,k+1})\right)
=ηk22(1β)2m2(i𝒰Q(𝐬i,k+1)T2f(𝐱k)Q(𝐬i,k+1)Term4+ij𝒰Q(𝐬i,k+1)T2f(𝐱k)Q(𝐬j,k+1)Term5)\displaystyle=\frac{\eta_{k}^{2}}{2(1-\beta)^{2}m^{2}}\left(\underbrace{\sum_{i\in\mathcal{U}}Q(\mathbf{s}_{i,k+1})^{T}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})}_{Term4}+\underbrace{\sum_{i\neq j\in\mathcal{U}}Q(\mathbf{s}_{i,k+1})^{T}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{j,k+1})}_{Term5}\right) (27)

Now we consider Term 4 in (27) and expand it using  (21)

i𝒰Q(𝐬i,k+1)T2f(𝐱k)Q(𝐬i,k+1)\displaystyle\sum_{i\in\mathcal{U}}Q(\mathbf{s}_{i,k+1})^{T}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})
=\displaystyle= iQ(𝐬i,k+1)T2f(𝐱k)Q(𝐬i,k+1)i𝒯Q(𝐬i,k+1)T2f(𝐱k)Q(𝐬i,k+1)\displaystyle\sum_{i\in\mathcal{M}}Q(\mathbf{s}_{i,k+1})^{T}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})-\sum_{i\in\mathcal{M}\cap\mathcal{T}}Q(\mathbf{s}_{i,k+1})^{T}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})
+i𝒰Q(𝐬i,k+1)T2f(𝐱k)Q(𝐬i,k+1)\displaystyle+\sum_{i\in\mathcal{B}\cap\mathcal{U}}Q(\mathbf{s}_{i,k+1})^{T}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})
\displaystyle\leq i𝐬i,k+1T𝐇i,k𝐬i,k+1i𝐬i,k+1T𝐇i,k𝐬i,k+1+iQ(𝐬i,k+1)T2f(𝐱k)Q(𝐬i,k+1)\displaystyle\sum_{i\in\mathcal{M}}\mathbf{s}_{i,k+1}^{T}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}-\sum_{i\in\mathcal{M}}\mathbf{s}_{i,k+1}^{T}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}+\sum_{i\in\mathcal{M}}Q(\mathbf{s}_{i,k+1})^{T}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})
+i𝒯L1Q(𝐬i,k+1)2+i𝒰L1Q(𝐬i,k+1)2\displaystyle\qquad+\sum_{i\in\mathcal{M}\cap\mathcal{T}}L_{1}\|Q(\mathbf{s}_{i,k+1})\|^{2}+\sum_{i\in\mathcal{B}\cap\mathcal{U}}L_{1}\|Q(\mathbf{s}_{i,k+1})\|^{2}
\displaystyle\leq i𝐬i,k+1T𝐇i,k𝐬i,k+1i𝐬i,k+1T(f(𝐱k)𝐇i,k)𝐬i,k+1+i𝐬i,k+1Tf(𝐱k)𝐬i,k+1\displaystyle\sum_{i\in\mathcal{M}}\mathbf{s}_{i,k+1}^{T}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}-\sum_{i\in\mathcal{M}}\mathbf{s}_{i,k+1}^{T}(\nabla f(\mathbf{x}_{k})-\mathbf{H}_{i,k})\mathbf{s}_{i,k+1}+\sum_{i\in\mathcal{M}}\mathbf{s}_{i,k+1}^{T}\nabla f(\mathbf{x}_{k})\mathbf{s}_{i,k+1}
+iQ(𝐬i,k+1)T2f(𝐱k)Q(𝐬i,k+1)+i𝒯L1Q(𝐬i,k+1)2+i𝒰L1Q(𝐬i,k+1)2\displaystyle\qquad+\sum_{i\in\mathcal{M}}Q(\mathbf{s}_{i,k+1})^{T}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})+\sum_{i\in\mathcal{M}\cap\mathcal{T}}L_{1}\|Q(\mathbf{s}_{i,k+1})\|^{2}+\sum_{i\in\mathcal{B}\cap\mathcal{U}}L_{1}\|Q(\mathbf{s}_{i,k+1})\|^{2}
\displaystyle\leq i𝐬i,k+1T𝐇i,k𝐬i,k+1+i(ϵH+L1)𝐬i,k+12\displaystyle\sum_{i\in\mathcal{M}}\mathbf{s}_{i,k+1}^{T}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}+\sum_{i\in\mathcal{M}}(\epsilon_{H}+L_{1})\|\mathbf{s}_{i,k+1}\|^{2}
+iL1Q(𝐬i,k+1)2+iL1Q(𝐬i,k+1)2+L1αm(1+1δ)2Γ2\displaystyle+\sum_{i\in\mathcal{M}}L_{1}\|Q(\mathbf{s}_{i,k+1})\|^{2}+\sum_{i\in\mathcal{M}}L_{1}\|Q(\mathbf{s}_{i,k+1})\|^{2}+L_{1}\alpha m(1+\sqrt{1-\delta})^{2}\Gamma^{2}
\displaystyle\leq i𝐬i,k+1T𝐇i,k𝐬i,k+1+i(ϵH+L1)𝐬i,k+12+iL12(1+1δ)2𝐬i,k+12+L1αm(1+1δ)2Γ2\displaystyle\sum_{i\in\mathcal{M}}\mathbf{s}_{i,k+1}^{T}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}+\sum_{i\in\mathcal{M}}(\epsilon_{H}+L_{1})\|\mathbf{s}_{i,k+1}\|^{2}+\sum_{i\in\mathcal{M}}L_{1}2(1+\sqrt{1-\delta})^{2}\|\mathbf{s}_{i,k+1}\|^{2}+L_{1}\alpha m(1+\sqrt{1-\delta})^{2}\Gamma^{2}
=i𝐬i,k+1)T𝐇i,k𝐬i,k+1+i[ϵH+L1(1+2(1+1δ)2)]𝐬i,k+12+L1αm(1+1δ)2Γ2\displaystyle=\sum_{i\in\mathcal{M}}\mathbf{s}_{i,k+1})^{T}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}+\sum_{i\in\mathcal{M}}[\epsilon_{H}+L_{1}(1+2(1+\sqrt{1-\delta})^{2})]\|\mathbf{s}_{i,k+1}\|^{2}+L_{1}\alpha m(1+\sqrt{1-\delta})^{2}\Gamma^{2} (28)

Now we consider the Term 5 in equation  (28)

ij𝒰Q(𝐬i,k+1)T2f(𝐱k)Q(𝐬j,k+1)\displaystyle\sum_{i\neq j\in\mathcal{U}}Q(\mathbf{s}_{i,k+1})^{T}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{j,k+1})
\displaystyle\leq ij𝒰L1Q(𝐬i,k+1)Q(𝐬j,k+1)\displaystyle\sum_{i\neq j\in\mathcal{U}}L_{1}\|Q(\mathbf{s}_{i,k+1})\|\|Q(\mathbf{s}_{j,k+1})\|
\displaystyle\leq ij𝒰L1(1+1δ)2𝐬i,k+1𝐬j,k+1\displaystyle\sum_{i\neq j\in\mathcal{U}}L_{1}(1+\sqrt{1-\delta})^{2}\|\mathbf{s}_{i,k+1}\|\|\mathbf{s}_{j,k+1}\|
=\displaystyle= L1(1+1δ)2[i𝒰𝐬i,k+12i𝒰𝐬i,k+12]\displaystyle L_{1}(1+\sqrt{1-\delta})^{2}\left[\|\sum_{i\in\mathcal{U}}\mathbf{s}_{i,k+1}\|^{2}-\sum_{i\in\mathcal{U}}\|\mathbf{s}_{i,k+1}\|^{2}\right]
\displaystyle\leq L1(1+1δ)2[|𝒰|i𝒰𝐬i,k+12i𝒰𝐬i,k+12]\displaystyle L_{1}(1+\sqrt{1-\delta})^{2}\left[|\mathcal{U}|\sum_{i\in\mathcal{U}}\|\mathbf{s}_{i,k+1}\|^{2}-\sum_{i\in\mathcal{U}}\|\mathbf{s}_{i,k+1}\|^{2}\right]
=\displaystyle= L1(1+1δ)2((1β)m1)[i𝐬i,k+12i𝒯𝐬i,k+12+i𝒰𝐬i,k+12]\displaystyle L_{1}(1+\sqrt{1-\delta})^{2}((1-\beta)m-1)\left[\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|^{2}-\sum_{i\in\mathcal{M}\cap\mathcal{T}}\|\mathbf{s}_{i,k+1}\|^{2}+\sum_{i\in\mathcal{B}\cap\mathcal{U}}\|\mathbf{s}_{i,k+1}\|^{2}\right]
\displaystyle\leq L1(1+1δ)2((1β)m1)[i𝐬i,k+12+αmΓ2]\displaystyle L_{1}(1+\sqrt{1-\delta})^{2}((1-\beta)m-1)\left[\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|^{2}+\alpha m\Gamma^{2}\right] (29)

Now combining the results in (28) and (27) we get,

Term2\displaystyle Term2
\displaystyle\leq ηk22(1β)2m2i𝐬i,k+1)T𝐇i,k𝐬i,k+1+ηk2(1α)2(1β)2m[ϵH+L1(1+(1β)m+(1+1δ)2)]Γ2\displaystyle\frac{\eta_{k}^{2}}{2(1-\beta)^{2}m^{2}}\sum_{i\in\mathcal{M}}\mathbf{s}_{i,k+1})^{T}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}+\frac{\eta_{k}^{2}(1-\alpha)}{2(1-\beta)^{2}m}\left[\epsilon_{H}+L_{1}(1+(1-\beta)m+(1+\sqrt{1-\delta})^{2})\right]\Gamma^{2}
+ηk22(1β)2m2L1(1+1δ)2α(1β)m2Γ2\displaystyle+\frac{\eta_{k}^{2}}{2(1-\beta)^{2}m^{2}}L_{1}(1+\sqrt{1-\delta})^{2}\alpha(1-\beta)m^{2}\Gamma^{2} (30)

Now we combine all the upper bound of the Term 1, Term 2 and Term 3

f(𝐱k+1)f(𝐱k)\displaystyle f(\mathbf{x}_{k+1})-f(\mathbf{x}_{k})
\displaystyle\leq [γ2Mηk4(1β)m+L2ηk36(1β)m(1+1δ)3]i𝐬i,k+13ηk2(1β)m[γηk(1β)m]i𝐬i,k+1T𝐇i,k𝐬i,k+1\displaystyle[-\frac{\gamma^{2}M\eta_{k}}{4(1-\beta)m}+\frac{L_{2}\eta_{k}^{3}}{6(1-\beta)m}(1+\sqrt{1-\delta})^{3}]\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|^{3}-\frac{\eta_{k}}{2(1-\beta)m}[\gamma-\frac{\eta_{k}}{(1-\beta)m}]\sum_{i\in\mathcal{M}}\mathbf{s}_{i,k+1}^{T}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}
+ηk(1α)(1β)(L1δ+ϵg)Γ+ηkL(1β)(1+1δ)Γ\displaystyle+\frac{\eta_{k}(1-\alpha)}{(1-\beta)}(L\sqrt{1-\delta}+\epsilon_{g})\Gamma+\frac{\eta_{k}L}{(1-\beta)}(1+\sqrt{1-\delta})\Gamma
+ηk2(1α)2(1β)2m[ϵH+L1(1+(1β)m+(1+1δ)2)]Γ2\displaystyle+\frac{\eta_{k}^{2}(1-\alpha)}{2(1-\beta)^{2}m}\left[\epsilon_{H}+L_{1}(1+(1-\beta)m+(1+\sqrt{1-\delta})^{2})\right]\Gamma^{2}
+ηk22(1β)2L1(1+1δ)2α(1β)Γ2+L2ηk36(1β)(1+1δ)3αΓ3\displaystyle+\frac{\eta_{k}^{2}}{2(1-\beta)^{2}}L_{1}(1+\sqrt{1-\delta})^{2}\alpha(1-\beta)\Gamma^{2}+\frac{L_{2}\eta_{k}^{3}}{6(1-\beta)}(1+\sqrt{1-\delta})^{3}\alpha\Gamma^{3}

Also we assume that γηk(1β)m\gamma\geq\frac{\eta_{k}}{(1-\beta)m} and use the fact 𝐬i,k+1𝐇i,k𝐬i,k+1Mγ2𝐬i,k+13-\mathbf{s}_{i,k+1}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}\leq\frac{M\gamma}{2}\|\mathbf{s}_{i,k+1}\|^{3}.

Now we have,

f(𝐱k+1)f(𝐱k)\displaystyle f(\mathbf{x}_{k+1})-f(\mathbf{x}_{k})
\displaystyle\leq [γ2M4(1β)ηk2m+L26(1β)m(1+1δ)3]iηk𝐬i,k+13ηk2(1β)m[γηk(1β)m]iMγ2𝐬i,k+13\displaystyle[-\frac{\gamma^{2}M}{4(1-\beta)\eta_{k}^{2}m}+\frac{L_{2}}{6(1-\beta)m}(1+\sqrt{1-\delta})^{3}]\sum_{i\in\mathcal{M}}\|\eta_{k}\mathbf{s}_{i,k+1}\|^{3}-\frac{\eta_{k}}{2(1-\beta)m}[\gamma-\frac{\eta_{k}}{(1-\beta)m}]\sum_{i\in\mathcal{M}}\frac{M\gamma}{2}\|\mathbf{s}_{i,k+1}\|^{3}
+ηk(1α)(1β)(L1δ+ϵg)Γ+ηkL(1β)(1+1δ)Γ\displaystyle+\frac{\eta_{k}(1-\alpha)}{(1-\beta)}(L\sqrt{1-\delta}+\epsilon_{g})\Gamma+\frac{\eta_{k}L}{(1-\beta)}(1+\sqrt{1-\delta})\Gamma
+ηk2(1α)2(1β)2m[ϵH+L1(1+(1β)m+(1+1δ)2)]Γ2\displaystyle+\frac{\eta_{k}^{2}(1-\alpha)}{2(1-\beta)^{2}m}\left[\epsilon_{H}+L_{1}(1+(1-\beta)m+(1+\sqrt{1-\delta})^{2})\right]\Gamma^{2}
+ηk22(1β)2L1(1+1δ)2α(1β)Γ2+L2ηk36(1β)(1+1δ)3αΓ3\displaystyle+\frac{\eta_{k}^{2}}{2(1-\beta)^{2}}L_{1}(1+\sqrt{1-\delta})^{2}\alpha(1-\beta)\Gamma^{2}+\frac{L_{2}\eta_{k}^{3}}{6(1-\beta)}(1+\sqrt{1-\delta})^{3}\alpha\Gamma^{3}
\displaystyle\leq [γM4(1β)ηkm2+L26(1β)m(1+1δ)3]iηk𝐬i,k+13\displaystyle[-\frac{\gamma M}{4(1-\beta)\eta_{k}m^{2}}+\frac{L_{2}}{6(1-\beta)m}(1+\sqrt{1-\delta})^{3}]\sum_{i\in\mathcal{M}}\|\eta_{k}\mathbf{s}_{i,k+1}\|^{3}
+ηk(1α)(1β)(L1δ+ϵg)Γ+ηkL(1β)(1+1δ)Γ\displaystyle+\frac{\eta_{k}(1-\alpha)}{(1-\beta)}(L\sqrt{1-\delta}+\epsilon_{g})\Gamma+\frac{\eta_{k}L}{(1-\beta)}(1+\sqrt{1-\delta})\Gamma
+ηk2(1α)2(1β)2m[ϵH+L1(1+(1β)m+(1+1δ)2)]Γ2\displaystyle+\frac{\eta_{k}^{2}(1-\alpha)}{2(1-\beta)^{2}m}\left[\epsilon_{H}+L_{1}(1+(1-\beta)m+(1+\sqrt{1-\delta})^{2})\right]\Gamma^{2}
+ηk22(1β)2L1(1+1δ)2α(1β)Γ2+L2ηk36(1β)(1+1δ)3αΓ3\displaystyle+\frac{\eta_{k}^{2}}{2(1-\beta)^{2}}L_{1}(1+\sqrt{1-\delta})^{2}\alpha(1-\beta)\Gamma^{2}+\frac{L_{2}\eta_{k}^{3}}{6(1-\beta)}(1+\sqrt{1-\delta})^{3}\alpha\Gamma^{3}
=\displaystyle= λcomp1(1α)miηk𝐬i,k+13+λΓ\displaystyle-\lambda_{comp}\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\eta_{k}\mathbf{s}_{i,k+1}\|^{3}+\lambda_{\Gamma}

where

λcomp=[γM4(1β)ηkm2L26(1β)m(1+1δ)3](1α)m\displaystyle\lambda_{comp}=[\frac{\gamma M}{4(1-\beta)\eta_{k}m^{2}}-\frac{L_{2}}{6(1-\beta)m}(1+\sqrt{1-\delta})^{3}](1-\alpha)m

and

λΓ\displaystyle\lambda_{\Gamma} =ηk(1α)(1β)[(1α)(L1δ+ϵg)+L(1+1δ)]Γ++L2ηk36(1β)(1+1δ)3αΓ3\displaystyle=\frac{\eta_{k}(1-\alpha)}{(1-\beta)}\left[(1-\alpha)(L\sqrt{1-\delta}+\epsilon_{g})+L(1+\sqrt{1-\delta})\right]\Gamma++\frac{L_{2}\eta_{k}^{3}}{6(1-\beta)}(1+\sqrt{1-\delta})^{3}\alpha\Gamma^{3}
+ηk2(1α)2(1β)2m[ϵH+L1(1+(1β)m+(1+1δ)2)]Γ2+ηk22(1β)2L1(1+1δ)2α(1β)Γ2\displaystyle+\frac{\eta_{k}^{2}(1-\alpha)}{2(1-\beta)^{2}m}\left[\epsilon_{H}+L_{1}(1+(1-\beta)m+(1+\sqrt{1-\delta})^{2})\right]\Gamma^{2}+\frac{\eta_{k}^{2}}{2(1-\beta)^{2}}L_{1}(1+\sqrt{1-\delta})^{2}\alpha(1-\beta)\Gamma^{2}

To ensure λcomp>0\lambda_{comp}>0, we need

M>4ηk2mγ2L26(1+1δ)3\displaystyle M>\frac{4\eta_{k}^{2}m}{\gamma^{2}}\frac{L_{2}}{6}(1+\sqrt{1-\delta})^{3} (31)

Now for the choice of ηk=cT\eta_{k}=\frac{c}{T} and γ=c1T\gamma=\frac{c_{1}}{T} for some constant c,c1>0c,c_{1}>0. We have M=𝒪(m)M=\mathcal{O}(m). Thus we have λcomp=𝒪(1)\lambda_{comp}=\mathcal{O}(1) and λΓ=𝒪(1T)+minor terms\lambda_{\Gamma}=\mathcal{O}(\frac{1}{T})+\text{minor terms}.

Now we have

1(1α)miηk𝐬i,k+131λcomp[f(𝐱k)f(𝐱k+1)+λΓ]\displaystyle\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\eta_{k}\mathbf{s}_{i,k+1}\|^{3}\leq\frac{1}{\lambda_{comp}}\left[f(\mathbf{x}_{k})-f(\mathbf{x}_{k+1})+\lambda_{\Gamma}\right]

Now we consider the step k0k_{0}, where k0=argmin0kT1𝐱k+1𝐱k=argmin0kT1ηkQ(𝐬k+1)k_{0}=\arg\min_{0\leq k\leq T-1}\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\|=\arg\min_{0\leq k\leq T-1}\|\eta_{k}Q(\mathbf{s}_{k+1})\|.

min0kT𝐱k+1𝐱k3\displaystyle\min_{0\leq k\leq T}\|\mathbf{x}_{k+1}-\mathbf{x}_{k}\|^{3} =min0kTηkQ(sk+1)3\displaystyle=\min_{0\leq k\leq T}\|\eta_{k}\ Q(s_{k+1})\|^{3}
1(1β)mi𝒰ηk0Q(𝐬i,k0+1)3\displaystyle\leq\frac{1}{(1-\beta)m}\sum_{i\in\mathcal{U}}\|\eta_{k_{0}}Q(\mathbf{s}_{i,k_{0}+1})\|^{3}
(1+1δ)3(1β)mi𝒰ηk0𝐬i,k0+13\displaystyle\leq\frac{(1+\sqrt{1-\delta})^{3}}{(1-\beta)m}\sum_{i\in\mathcal{U}}\|\eta_{k_{0}}\mathbf{s}_{i,k_{0}+1}\|^{3}
=(1+1δ)3(1β)m[iηk0𝐬i,k0+13i𝒯ηk0𝐬i,k0+13+i𝒰ηk0𝐬i,k0+13]\displaystyle=\frac{(1+\sqrt{1-\delta})^{3}}{(1-\beta)m}\left[\sum_{i\in\mathcal{M}}\|\eta_{k_{0}}\mathbf{s}_{i,k_{0}+1}\|^{3}-\sum_{i\in\mathcal{M}\cap\mathcal{T}}\|\eta_{k_{0}}\mathbf{s}_{i,k_{0}+1}\|^{3}+\sum_{i\in\mathcal{U}\cap\mathcal{B}}\|\eta_{k_{0}}\mathbf{s}_{i,k_{0}+1}\|^{3}\right]
(1+1δ)3(1β)m[iηk0𝐬i,k0+13+αmηk03Γ3]\displaystyle\leq\frac{(1+\sqrt{1-\delta})^{3}}{(1-\beta)m}\left[\sum_{i\in\mathcal{M}}\|\eta_{k_{0}}\mathbf{s}_{i,k_{0}+1}\|^{3}+\alpha m\eta^{3}_{k_{0}}\Gamma^{3}\right]
1Tk=0T1(1+1δ)3(1α)(1β)[1(1α)miηk𝐬i,k+13+α1αηk03Γ3]\displaystyle\leq\frac{1}{T}\sum_{k=0}^{T-1}(1+\sqrt{1-\delta})^{3}\frac{(1-\alpha)}{(1-\beta)}\left[\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\eta_{k}\mathbf{s}_{i,k+1}\|^{3}+\frac{\alpha}{1-\alpha}\eta^{3}_{k_{0}}\Gamma^{3}\right]
1Tk=0T1(1+1δ)3(1α)(1β)[f(𝐱k)f(𝐱k+1)λcomp+λΓλcomp+α1αηk03Γ3λcomp]\displaystyle\leq\frac{1}{T}\sum_{k=0}^{T-1}(1+\sqrt{1-\delta})^{3}\frac{(1-\alpha)}{(1-\beta)}\left[\frac{f(\mathbf{x}_{k})-f(\mathbf{x}_{k+1})}{\lambda_{comp}}+\frac{\lambda_{\Gamma}}{\lambda_{comp}}+\frac{\alpha}{1-\alpha}\frac{\eta^{3}_{k_{0}}\Gamma^{3}}{\lambda_{comp}}\right]
1T(1+1δ)3(1α)(1β)[f(𝐱0)fλcomp+k=0T1λΓλcomp+k=0T1α1αηk03Γ3λcomp]\displaystyle\leq\frac{1}{T}(1+\sqrt{1-\delta})^{3}\frac{(1-\alpha)}{(1-\beta)}\left[\frac{f(\mathbf{x}_{0})-f^{*}}{\lambda_{comp}}+\sum_{k=0}^{T-1}\frac{\lambda_{\Gamma}}{\lambda_{comp}}+\sum_{k=0}^{T-1}\frac{\alpha}{1-\alpha}\frac{\eta^{3}_{k_{0}}\Gamma^{3}}{\lambda_{comp}}\right]

With the choice of ηk,γ\eta_{k},\gamma we have the terms k=0T1λΓλcomp\sum_{k=0}^{T-1}\frac{\lambda_{\Gamma}}{\lambda_{comp}} and k=0T1α1αηk03Γ3λcomp\sum_{k=0}^{T-1}\frac{\alpha}{1-\alpha}\frac{\eta^{3}_{k_{0}}\Gamma^{3}}{\lambda_{comp}} are upper bounded by constant.

We have

1(1β)m[iηk0𝐬i,k0+13+αmηk03Γ3]\displaystyle\frac{1}{(1-\beta)m}\left[\sum_{i\in\mathcal{M}}\|\eta_{k_{0}}\mathbf{s}_{i,k_{0}+1}\|^{3}+\alpha m\eta^{3}_{k_{0}}\Gamma^{3}\right] 1T(1α)(1β)[f(𝐱0)fλcomp+k=0T1λΓλcomp+k=0T1α1αηk03Γ3λcomp]\displaystyle\leq\frac{1}{T}\frac{(1-\alpha)}{(1-\beta)}\left[\frac{f(\mathbf{x}_{0})-f^{*}}{\lambda_{comp}}+\sum_{k=0}^{T-1}\frac{\lambda_{\Gamma}}{\lambda_{comp}}+\sum_{k=0}^{T-1}\frac{\alpha}{1-\alpha}\frac{\eta^{3}_{k_{0}}\Gamma^{3}}{\lambda_{comp}}\right]
1(1α)m[iηk0𝐬i,k0+13+αmηk03Γ3]\displaystyle\Rightarrow\frac{1}{(1-\alpha)m}\left[\sum_{i\in\mathcal{M}}\|\eta_{k_{0}}\mathbf{s}_{i,k_{0}+1}\|^{3}+\alpha m\eta^{3}_{k_{0}}\Gamma^{3}\right] 1T[f(𝐱0)fλcomp+k=0T1λΓλcomp+k=0T1α1αηk03Γ3λcomp]\displaystyle\leq\frac{1}{T}\left[\frac{f(\mathbf{x}_{0})-f^{*}}{\lambda_{comp}}+\sum_{k=0}^{T-1}\frac{\lambda_{\Gamma}}{\lambda_{comp}}+\sum_{k=0}^{T-1}\frac{\alpha}{1-\alpha}\frac{\eta^{3}_{k_{0}}\Gamma^{3}}{\lambda_{comp}}\right]
1(1α)miηk0𝐬i,k0+13\displaystyle\Rightarrow\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\eta_{k_{0}}\mathbf{s}_{i,k_{0}+1}\|^{3} 1T[f(𝐱0)fλcomp+k=0T1λΓλcomp]=ψcompT\displaystyle\leq\frac{1}{T}\left[\frac{f(\mathbf{x}_{0})-f^{*}}{\lambda_{comp}}+\sum_{k=0}^{T-1}\frac{\lambda_{\Gamma}}{\lambda_{comp}}\right]=\frac{\psi_{comp}}{T}

where ψcomp=[f(𝐱0)fλcomp+k=0T1λΓλcomp]\psi_{comp}=\left[\frac{f(\mathbf{x}_{0})-f^{*}}{\lambda_{comp}}+\sum_{k=0}^{T-1}\frac{\lambda_{\Gamma}}{\lambda_{comp}}\right]

And

𝐱k0+1𝐱k0ψcomp,1T1/3\displaystyle\|\mathbf{x}_{k_{0}+1}-\mathbf{x}_{k_{0}}\|\leq\frac{\psi_{comp,1}}{T^{1/3}}

where ψcomp,1=(1+1δ)(1α)1/3(1β)1/3[f(𝐱0)fλcomp+k=0T1λΓλcomp+k=0T1α1αηk03Γ3λcomp]1/3\psi_{comp,1}=(1+\sqrt{1-\delta})\frac{(1-\alpha)^{1/3}}{(1-\beta)^{1/3}}\left[\frac{f(\mathbf{x}_{0})-f^{*}}{\lambda_{comp}}+\sum_{k=0}^{T-1}\frac{\lambda_{\Gamma}}{\lambda_{comp}}+\sum_{k=0}^{T-1}\frac{\alpha}{1-\alpha}\frac{\eta^{3}_{k_{0}}\Gamma^{3}}{\lambda_{comp}}\right]^{1/3}

So, we have both the term ψcomp,ψcomp,1\psi_{comp},\psi_{comp,1} are of the order 𝒪(1)\mathcal{O}(1).

The gradient condition is

f(𝐱k+1)\displaystyle\left\|\nabla f(\mathbf{x}_{k+1})\right\|
=f(𝐱k+1)1||i𝐠i,k1||iγ𝐇i,k+1𝐬i,k+11||iMγ22𝐬i,k+1𝐬i,k+1\displaystyle=\left\|\nabla f(\mathbf{x}_{k+1})-\frac{1}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\mathbf{g}_{i,k}-\frac{1}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\gamma\mathbf{H}_{i,k+1}\mathbf{s}_{i,k+1}-\frac{1}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\frac{M\gamma^{2}}{2}\|\mathbf{s}_{i,k+1}\|\mathbf{s}_{i,k+1}\right\|
f(𝐱k+1)f(𝐱k)2f(𝐱k)(xk+1xk)+1||i(𝐠i,kf(𝐱k))\displaystyle\leq\left\|\nabla f(\mathbf{x}_{k+1})-\nabla f(\mathbf{x}_{k})-\nabla^{2}f(\mathbf{x}_{k})(x_{k+1}-x_{k})\right\|+\left\|\frac{1}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}(\mathbf{g}_{i,k}-\nabla f(\mathbf{x}_{k}))\right\|
+2f(𝐱k)(xk+1xk)γ1||i𝐇i,k𝐬i,k+1+1||iMγ22𝐬i,k+1𝐬i,k+1\displaystyle\qquad+\left\|\nabla^{2}f(\mathbf{x}_{k})(x_{k+1}-x_{k})-\gamma\frac{1}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}\right\|+\left\|\frac{1}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\frac{M\gamma^{2}}{2}\|\mathbf{s}_{i,k+1}\|\mathbf{s}_{i,k+1}\right\|
L2ηk221|𝒰|i𝒰Q(𝐬i,k+1)2+ϵg+Mγ221||i𝐬i,k+12\displaystyle\leq\frac{L_{2}\eta_{k}^{2}}{2}\left\|\frac{1}{|\mathcal{U}|}\sum_{i\in\mathcal{U}}Q(\mathbf{s}_{i,k+1})\right\|^{2}+\epsilon_{g}+\frac{M\gamma^{2}}{2}\frac{1}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|^{2}
+ηk|𝒰|i𝒰2f(𝐱k)Q(𝐬i,k+1)γ||i𝐇i,k𝐬i,k+1\displaystyle+\left\|\frac{\eta_{k}}{|\mathcal{U}|}\sum_{i\in\mathcal{U}}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})-\frac{\gamma}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}\right\| (32)

Now consider the term in (32)

ηk|𝒰|i𝒰2f(𝐱k)Q(𝐬i,k+1)γ||i𝐇i,k𝐬i,k+1\displaystyle\left\|\frac{\eta_{k}}{|\mathcal{U}|}\sum_{i\in\mathcal{U}}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})-\frac{\gamma}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}\right\|
\displaystyle\leq ηk|𝒰|[i2f(𝐱k)Q(𝐬i,k+1)i𝒯2f(𝐱k)Q(𝐬i,k+1)+i𝒰2f(𝐱k)Q(𝐬i,k+1)]γ||i𝐇i,k𝐬i,k+1\displaystyle\left\|\frac{\eta_{k}}{|\mathcal{U}|}\left[\sum_{i\in\mathcal{M}}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})-\sum_{i\in\mathcal{M}\cap\mathcal{T}}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})+\sum_{i\in\mathcal{B}\cap\mathcal{U}}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})\right]-\frac{\gamma}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}\right\|
\displaystyle\leq ηk|𝒰|i2f(𝐱k)Q(𝐬i,k+1)γ||i𝐇i,k𝐬i,k+1+ηk|𝒰|i𝒯2f(𝐱k)Q(𝐬i,k+1)\displaystyle\left\|\frac{\eta_{k}}{|\mathcal{U}|}\sum_{i\in\mathcal{M}}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})-\frac{\gamma}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}\right\|+\frac{\eta_{k}}{|\mathcal{U}|}\sum_{i\in\mathcal{M}\cap\mathcal{T}}\|\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})\|
+ηk|𝒰|i𝒰2f(𝐱k)Q(𝐬i,k+1)\displaystyle+\frac{\eta_{k}}{|\mathcal{U}|}\sum_{i\in\mathcal{B}\cap\mathcal{U}}\|\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})\|
\displaystyle\leq ηk|𝒰|i2f(𝐱k)Q(𝐬i,k+1)γ||i2f(𝐱k)Q(𝐬i,k+1)\displaystyle\left\|\frac{\eta_{k}}{|\mathcal{U}|}\sum_{i\in\mathcal{M}}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})-\frac{\gamma}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})\right\|
+γ||i2f(𝐱k)Q(𝐬i,k+1)γ||i2f(𝐱k)𝐬i,k+1\displaystyle+\left\|\frac{\gamma}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\nabla^{2}f(\mathbf{x}_{k})Q(\mathbf{s}_{i,k+1})-\frac{\gamma}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\nabla^{2}f(\mathbf{x}_{k})\mathbf{s}_{i,k+1}\right\|
+γ||i2f(𝐱k)𝐬i,k+1γ||i𝐇i,k𝐬i,k+1+ηk(1β)m(1+1δ)L1[i𝐬i,k+1+αmΓ]\displaystyle+\left\|\frac{\gamma}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\nabla^{2}f(\mathbf{x}_{k})\mathbf{s}_{i,k+1}-\frac{\gamma}{|\mathcal{M}|}\sum_{i\in\mathcal{M}}\mathbf{H}_{i,k}\mathbf{s}_{i,k+1}\right\|+\frac{\eta_{k}}{(1-\beta)m}(1+\sqrt{1-\delta})L_{1}[\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|+\alpha m\Gamma]
\displaystyle\leq (ηk(1β)mγ(1α)m)L1(1+1δ)i𝐬i,k+1+γ(1α)mL11δi𝐬i,k+1\displaystyle(\frac{\eta_{k}}{(1-\beta)m}-\frac{\gamma}{(1-\alpha)m})L_{1}(1+\sqrt{1-\delta})\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|+\frac{\gamma}{(1-\alpha)m}L_{1}\sqrt{1-\delta}\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|
+γϵH(1α)mi𝐬i,k+1+ηk(1β)m(1+1δ)L1[i𝐬i,k+1+αmΓ]\displaystyle+\frac{\gamma\epsilon_{H}}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|+\frac{\eta_{k}}{(1-\beta)m}(1+\sqrt{1-\delta})L_{1}[\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|+\alpha m\Gamma]
(ηk(1β)mL1(1+1δ)(2+αm)γL1(1α)m)i𝐬i,k+1+γϵH(1α)mi𝐬i,k+1\displaystyle\leq\left(\frac{\eta_{k}}{(1-\beta)m}L_{1}(1+\sqrt{1-\delta})(2+\alpha m)-\frac{\gamma L_{1}}{(1-\alpha)m}\right)\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|+\frac{\gamma\epsilon_{H}}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|
=((1α)(1β)2L1(1+1δ)γηk(L1ϵH))1(1α)miηk𝐬i,k+1+ηkα(1β)(1+1δ)Γ\displaystyle=\left(\frac{(1-\alpha)}{(1-\beta)}2L_{1}(1+\sqrt{1-\delta})-\frac{\gamma}{\eta_{k}}(L_{1}-\epsilon_{H})\right)\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\eta_{k}\mathbf{s}_{i,k+1}\|+\frac{\eta_{k}\alpha}{(1-\beta)}(1+\sqrt{1-\delta})\Gamma

Next we consider the term

L2ηk221|𝒰|i𝒰Q(𝐬i,k+1)2\displaystyle\frac{L_{2}\eta_{k}^{2}}{2}\left\|\frac{1}{|\mathcal{U}|}\sum_{i\in\mathcal{U}}Q(\mathbf{s}_{i,k+1})\right\|^{2}
\displaystyle\leq L2(1+1δ)2ηk22(1β)mi𝒰𝐬i,k+12\displaystyle\frac{L_{2}(1+\sqrt{1-\delta})^{2}\eta_{k}^{2}}{2(1-\beta)m}\sum_{i\in\mathcal{U}}\|\mathbf{s}_{i,k+1}\|^{2}
\displaystyle\leq L2(1+1δ)2ηk22(1β)m[i𝐬i,k+12+i𝒰𝐬i,k+12]\displaystyle\frac{L_{2}(1+\sqrt{1-\delta})^{2}\eta_{k}^{2}}{2(1-\beta)m}\left[\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|^{2}+\sum_{i\in\mathcal{U}\cap\mathcal{B}}\|\mathbf{s}_{i,k+1}\|^{2}\right]
=\displaystyle= L2(1+1δ)2ηk22(1β)mi𝐬i,k+12+L2α(1+1δ)2ηk22(1β)Γ2\displaystyle\frac{L_{2}(1+\sqrt{1-\delta})^{2}\eta_{k}^{2}}{2(1-\beta)m}\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|^{2}+\frac{L_{2}\alpha(1+\sqrt{1-\delta})^{2}\eta_{k}^{2}}{2(1-\beta)}\Gamma^{2} (33)

So finally we have

f(𝐱k+1)\displaystyle\left\|\nabla f(\mathbf{x}_{k+1})\right\|
\displaystyle\leq L2(1+1δ)2ηk22(1β)mi𝐬i,k+12+ϵg+Mγ22(1α)mi𝐬i,k+12\displaystyle\frac{L_{2}(1+\sqrt{1-\delta})^{2}\eta_{k}^{2}}{2(1-\beta)m}\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|^{2}+\epsilon_{g}+\frac{M\gamma^{2}}{2(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\mathbf{s}_{i,k+1}\|^{2}
+((1α)(1β)2L1(1+1δ)γηk(L1ϵH))1(1α)miηk𝐬i,k+1\displaystyle+\left(\frac{(1-\alpha)}{(1-\beta)}2L_{1}(1+\sqrt{1-\delta})-\frac{\gamma}{\eta_{k}}(L_{1}-\epsilon_{H})\right)\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\eta_{k}\mathbf{s}_{i,k+1}\|
+L2α(1+1δ)2ηk22(1β)Γ2+ηkα(1β)(1+1δ)Γ\displaystyle+\frac{L_{2}\alpha(1+\sqrt{1-\delta})^{2}\eta_{k}^{2}}{2(1-\beta)}\Gamma^{2}+\frac{\eta_{k}\alpha}{(1-\beta)}(1+\sqrt{1-\delta})\Gamma

Now we choose γ>(1α)(1β)2L1(1+1δ)ηkL1ϵH\gamma>\frac{(1-\alpha)}{(1-\beta)}2L_{1}(1+\sqrt{1-\delta})\frac{\eta_{k}}{L_{1}-\epsilon_{H}}.

f(𝐱k+1)\displaystyle\left\|\nabla f(\mathbf{x}_{k+1})\right\|
\displaystyle\leq [L2(1α)(1+1δ)22(1β)+Mγ22ηk2]1(1α)miηk𝐬i,k+12\displaystyle\left[\frac{L_{2}(1-\alpha)(1+\sqrt{1-\delta})^{2}}{2(1-\beta)}+\frac{M\gamma^{2}}{2\eta_{k}^{2}}\right]\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\eta_{k}\mathbf{s}_{i,k+1}\|^{2}
+L2α(1+1δ)2ηk22(1β)Γ2+ηkα(1β)(1+1δ)Γ+ϵg\displaystyle+\frac{L_{2}\alpha(1+\sqrt{1-\delta})^{2}\eta_{k}^{2}}{2(1-\beta)}\Gamma^{2}+\frac{\eta_{k}\alpha}{(1-\beta)}(1+\sqrt{1-\delta})\Gamma+\epsilon_{g}
\displaystyle\leq [L2(1α)(1+1δ)22(1β)+Mγ22ηk2][1(1α)miηk𝐬i,k+13]2/3\displaystyle\left[\frac{L_{2}(1-\alpha)(1+\sqrt{1-\delta})^{2}}{2(1-\beta)}+\frac{M\gamma^{2}}{2\eta_{k}^{2}}\right]\left[\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\eta_{k}\mathbf{s}_{i,k+1}\|^{3}\right]^{2/3}
+L2α(1+1δ)2ηk22(1β)Γ2+ηkα(1β)(1+1δ)Γ+ϵg\displaystyle+\frac{L_{2}\alpha(1+\sqrt{1-\delta})^{2}\eta_{k}^{2}}{2(1-\beta)}\Gamma^{2}+\frac{\eta_{k}\alpha}{(1-\beta)}(1+\sqrt{1-\delta})\Gamma+\epsilon_{g}

At step k=k0k=k_{0},

f(𝐱k0+1)\displaystyle\left\|\nabla f(\mathbf{x}_{k_{0}+1})\right\|
[L2(1α)(1+1δ)22(1β)+Mγ22ηk2](ψcompT)2/3+ϵg+L2α(1+1δ)2ηk22(1β)Γ2+ηkα(1β)(1+1δ)Γ\displaystyle\leq\left[\frac{L_{2}(1-\alpha)(1+\sqrt{1-\delta})^{2}}{2(1-\beta)}+\frac{M\gamma^{2}}{2\eta_{k}^{2}}\right]\left(\frac{\psi_{comp}}{T}\right)^{2/3}+\epsilon_{g}+\frac{L_{2}\alpha(1+\sqrt{1-\delta})^{2}\eta_{k}^{2}}{2(1-\beta)}\Gamma^{2}+\frac{\eta_{k}\alpha}{(1-\beta)}(1+\sqrt{1-\delta})\Gamma
χ1T2/3+ϵg+Minor term of order 𝒪(1T)\displaystyle\leq\frac{\chi_{1}}{T^{2/3}}+\epsilon_{g}+\text{Minor term of order }\mathcal{O}(\frac{1}{T})

where χ1=[L2(1α)(1+1δ)22(1β)+Mγ22ηk2](ψcomp)2/3\chi_{1}=\left[\frac{L_{2}(1-\alpha)(1+\sqrt{1-\delta})^{2}}{2(1-\beta)}+\frac{M\gamma^{2}}{2\eta_{k}^{2}}\right](\psi_{comp})^{2/3} and as M=O(m)M=O(m), we have χ1=𝒪(m)\chi_{1}=\mathcal{O}(m).

The Hessian bound is

λmin(2f(𝐱k+1))\displaystyle\lambda_{\min}(\nabla^{2}f(\mathbf{x}_{k+1}))
=1(1α)miλmin[2f(𝐱k+1)]\displaystyle=\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\lambda_{\min}\left[\nabla^{2}f(\mathbf{x}_{k+1})\right]
=1(1α)miλmin[𝐇i,k(𝐇i,k2f(𝐱k+1))]\displaystyle=\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\lambda_{\min}\left[\mathbf{H}_{i,k}-(\mathbf{H}_{i,k}-\nabla^{2}f(\mathbf{x}_{k+1}))\right]
1(1α)mi[λmin(𝐇i,k)𝐇i,k2f(𝐱k+1)]\displaystyle\geq\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\left[\lambda_{\min}(\mathbf{H}_{i,k})-\|\mathbf{H}_{i,k}-\nabla^{2}f(\mathbf{x}_{k+1})\|\right]
1(1α)miλmin(𝐇i,k)1(1α)mi𝐇i,k2f(𝐱k+1)\displaystyle\geq\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\lambda_{\min}(\mathbf{H}_{i,k})-\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\mathbf{H}_{i,k}-\nabla^{2}f(\mathbf{x}_{k+1})\|
1(1α)miMγ2𝐬i,k+11(1α)mi𝐇i,k2f(𝐱k)\displaystyle\geq\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}-\frac{M\gamma}{2}\|\mathbf{s}_{i,k+1}\|-\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\mathbf{H}_{i,k}-\nabla^{2}f(\mathbf{x}_{k})\|
1(1α)mi2f(𝐱k)2f(𝐱k+1)\displaystyle\qquad-\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\nabla^{2}f(\mathbf{x}_{k})-\nabla^{2}f(\mathbf{x}_{k+1})\|
1(1α)miMγ2ηkηk𝐬i,k+1ϵH1(1α)miL2𝐱k𝐱k+1\displaystyle\geq\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}-\frac{M\gamma}{2\eta_{k}}\|\eta_{k}\mathbf{s}_{i,k+1}\|-\epsilon_{H}-\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}L_{2}\|\mathbf{x}_{k}-\mathbf{x}_{k+1}\|
Mγ2ηk[1(1α)miηk𝐬i,k+13]1/3L2𝐱k𝐱k+1ϵH\displaystyle\geq-\frac{M\gamma}{2\eta_{k}}\left[\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\eta_{k}\mathbf{s}_{i,k+1}\|^{3}\right]^{1/3}-L_{2}\|\mathbf{x}_{k}-\mathbf{x}_{k+1}\|-\epsilon_{H}
Mγ2ηk[1(1α)miηk𝐬i,k+13]1/3L2[(1+1δ)(1β)mi𝒰ηk𝐬i,k+1]ϵH\displaystyle\geq-\frac{M\gamma}{2\eta_{k}}\left[\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\eta_{k}\mathbf{s}_{i,k+1}\|^{3}\right]^{1/3}-L_{2}\left[\frac{(1+\sqrt{1-\delta})}{(1-\beta)m}\sum_{i\in\mathcal{U}}\|\eta_{k}\mathbf{s}_{i,k+1}\|\right]-\epsilon_{H}
Mγ2ηk[1(1α)miηk𝐬i,k+13]1/3L2(1+1δ)(1β)m[iηk𝐬i,k+1+i𝒰ηk𝐬i,k+1]ϵH\displaystyle\geq-\frac{M\gamma}{2\eta_{k}}\left[\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\eta_{k}\mathbf{s}_{i,k+1}\|^{3}\right]^{1/3}-L_{2}\frac{(1+\sqrt{1-\delta})}{(1-\beta)m}\left[\sum_{i\in\mathcal{M}}\|\eta_{k}\mathbf{s}_{i,k+1}\|+\sum_{i\in\mathcal{B}\cap\mathcal{U}}\|\eta_{k}\mathbf{s}_{i,k+1}\|\right]-\epsilon_{H}
Mγ2ηk[1(1α)miηk𝐬i,k+13]1/3L2(1+1δ)(1α)(1β)[1(1α)miηk𝐬i,k+1]ϵH\displaystyle\geq-\frac{M\gamma}{2\eta_{k}}\left[\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\eta_{k}\mathbf{s}_{i,k+1}\|^{3}\right]^{1/3}-L_{2}\frac{(1+\sqrt{1-\delta})(1-\alpha)}{(1-\beta)}\left[\frac{1}{(1-\alpha)m}\sum_{i\in\mathcal{M}}\|\eta_{k}\mathbf{s}_{i,k+1}\|\right]-\epsilon_{H}
L2(1+1δ)α(1β)ηkΓ\displaystyle-L_{2}\frac{(1+\sqrt{1-\delta})\alpha}{(1-\beta)}\eta_{k}\Gamma (34)

At k=k0k=k_{0} we have

λmin(2f(𝐱k0+1))\displaystyle\lambda_{\min}(\nabla^{2}f(\mathbf{x}_{k_{0}+1})) (35)
\displaystyle\geq Mγ2ηk[ψcompT]1/3L2(1+1δ)(1α)(1β)[ψcompT]1/3ϵHL2(1+1δ)α(1β)ηkΓ\displaystyle-\frac{M\gamma}{2\eta_{k}}\left[\frac{\psi_{comp}}{T}\right]^{1/3}-L_{2}\frac{(1+\sqrt{1-\delta})(1-\alpha)}{(1-\beta)}\left[\frac{\psi_{comp}}{T}\right]^{1/3}-\epsilon_{H}-L_{2}\frac{(1+\sqrt{1-\delta})\alpha}{(1-\beta)}\eta_{k}\Gamma
\displaystyle\geq [Mγ2ηk+L2(1+1δ)(1α)(1β)]ψcomp1/3(1T)1/3ϵH Minor term 𝒪(1/T)\displaystyle-\left[\frac{M\gamma}{2\eta_{k}}+L_{2}\frac{(1+\sqrt{1-\delta})(1-\alpha)}{(1-\beta)}\right]\psi^{1/3}_{comp}\left(\frac{1}{T}\right)^{1/3}-\epsilon_{H}-\text{ Minor term }\mathcal{O}(1/T)
\displaystyle\geq χ2T1/3ϵH Minor term 𝒪(1/T)\displaystyle-\frac{\chi_{2}}{T^{1/3}}-\epsilon_{H}-\text{ Minor term }\mathcal{O}(1/T) (36)

where χ2=[Mγ2ηk+L2(1+1δ)(1α)(1β)]ψcomp1/3\chi_{2}=\left[\frac{M\gamma}{2\eta_{k}}+L_{2}\frac{(1+\sqrt{1-\delta})(1-\alpha)}{(1-\beta)}\right]\psi^{1/3}_{comp} and we have χ2=𝒪(m)\chi_{2}=\mathcal{O}(m).

We have the following parameters

ψcomp\displaystyle\psi_{comp} =[f(𝐱0)fλcomp+k=0T1λΓλcomp]\displaystyle=\left[\frac{f(\mathbf{x}_{0})-f^{*}}{\lambda_{comp}}+\sum_{k=0}^{T-1}\frac{\lambda_{\Gamma}}{\lambda_{comp}}\right] (37)

6 Additional Experiments

In this section we provide additional experiments. We choose the parameters λ=1,M=10\lambda=1,M=10, learning rate ηk=1\eta_{k}=1, fraction of the Byzantine machines α={.1,.15,.2}\alpha=\{.1,.15,.2\} and β=α+2m\beta=\alpha+\frac{2}{m}.

Compressed and Byzantine:

In Figure  3, we plot the function value of the robust linear regression problem for ’Gaussian attack‘ and ’random label‘ attacks with compressed update for both ‘w8a’ and ‘a9a’ dataset.

\subfigure

[]Refer to caption\subfigure[]Refer to caption\subfigure[]Refer to caption\subfigure[]Refer to caption

Figure 3: Loss of the training data ‘a9a’ (first row) and ‘w8a’ (second row) with 10%,15%,20%10\%,15\%,20\% Byzantine worker machines for (a,c). Gaussian attack.(b,d).Random attack.
\subfigure

[]Refer to caption\subfigure[]Refer to caption\subfigure[]Refer to caption\subfigure[]Refer to caption
\subfigure[]Refer to caption\subfigure[]Refer to caption\subfigure[]Refer to caption\subfigure[]Refer to caption

Figure 4: Function loss of the training data ‘a9a’ dataset (first row) and ‘w8a’ dataset (second row) with 10%,15%,20%10\%,15\%,20\% Byzantine worker machines for (a,e). Flipped label attack.(b,f). Negative Update attack (c,g). Gaussian noise attack and (d,h). Random label attack for non-convex robust linear regression problem.
\subfigure

[]Refer to caption\subfigure[]Refer to caption\subfigure[]Refer to caption\subfigure[]Refer to caption
\subfigure[]Refer to caption\subfigure[]Refer to caption\subfigure[]Refer to caption\subfigure[]Refer to caption

Figure 5: Classification accuracy of the testing data ‘a9a’ dataset (first row) and ‘w8a’ dataset (second row) with 10%,15%,20%10\%,15\%,20\% Byzantine worker machines for (a,e). Flipped label.(b,f). Negative Update (c,g). Gaussian noise and (d,h). Random label attack for logistic regression problem.

Training loss for uncompressed update:

In Figure  4, we plot the function value of the robust linear regression problem for all four attacks update for both ‘w8a’ and ‘a9a’ dataset with uncompressed update (δ=1)(\delta=1).

Classification accuracy:

We show the classification accuracy on testing data of ‘a9a’ and ‘w8a’ dataset for logistic regression problem in Figure  5 and training function loss of ‘a9a’ and ‘w8a’ dataset for robust linear regression problem in the Figure 5. It is evident from the plots that a simple norm based thresholding makes the learning algorithm robust.