This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Federated Deep AUC Maximization for Heterogeneous Data
with a Constant Communication Complexity

\nameZhuoning Yuan \emailzhuoning-yuan@uiowa.edu
\addrThe University of Iowa
\nameZhishuai Guo11footnotemark: 1 \emailzhishuai-guo@uiowa.edu
\addrThe University of Iowa
\nameYi Xu \emailyixu@alibaba-inc.com
\addrDAMO Academy, Alibaba Group
\nameYiming Ying \emailyying@albany.edu
\addrState University of New York at Albany
\nameTianbao Yang \emailtianbao-yang@uiowa.edu
\addrThe University of Iowa
Equal Contribution
Abstract

Deep AUC (area under the ROC curve) Maximization (DAM) has attracted much attention recently due to its great potential for imbalanced data classification. However, the research on Federated Deep AUC Maximization (FDAM) is still limited. Compared with standard federated learning (FL) approaches that focus on decomposable minimization objectives, FDAM is more complicated due to its minimization objective is non-decomposable over individual examples. In this paper, we propose improved FDAM algorithms for heterogeneous data by solving the popular non-convex strongly-concave min-max formulation of DAM in a distributed fashion. A striking result of this paper is that the communication complexity of the proposed algorithm is a constant independent of the number of machines and also independent of the accuracy level, which improves an existing result by orders of magnitude. Of independent interest, the proposed algorithm can also be applied to a class of non-convex-strongly-concave min-max problems. The experiments have demonstrated the effectiveness of our FDAM algorithm on benchmark datasets, and on medical chest X-ray images from different organizations. Our experiment shows that the performance of FDAM using data from multiple hospitals can improve the AUC score on testing data from a single hospital for detecting life-threatening diseases based on chest radiographs.


1 Introduction

Federated learning (FL) is an emerging paradigm for large-scale learning to deal with data that are (geographically) distributed over multiple clients, e.g., mobile phones, organizations. An important feature of FL is that the data remains at its own clients, allowing for the preservation of data privacy. This feature makes FL attractive not only to internet companies such as Google and Apple but also to conventional industries that provide services to people such as hospitals and banks in the era of big data [rieke2020future, long2020federated]. Data in these industries is usually collected from people who are concerned about data leakage. But in order to provide better service to people, large-scale machine learning from diverse data sources is important for addressing model bias. For example, most patients in hospitals located in urban areas could have dramatic differences in demographic data, lifestyles, and diseases from patients who are from rural areas. Machine learning models (in particular, deep neural networks) trained based on patients’ data from one hospital could dramatically bias towards its major population, which could bring about serious ethical concerns [pooch2020can].

One of the fundamental issues that could cause model bias is data imbalance, where the number of samples from different classes are skewed. Although FL provides an effective framework for leveraging multiple data sources, most existing FL methods still lack the capability to tackle the model bias caused by data imbalance. The reason is that most existing FL methods are developed for minimizing the conventional objective function, e.g., the average of a standard loss function on all data, which are not amenable to optimizing more suitable measures such as area under the ROC curve (AUC) for imbalanced data. It has been recently shown that directly maximizing AUC for deep learning can lead to great improvements on real-world difficult classification tasks [robustdeepAUC]. For example, robustdeepAUC reported the best performance by DAM on the Stanford CheXpert Competition for interpreting chest X-ray images like radiologists [chexpert19].

Table 1: The summary of sample and communication complexities of different algorithms for FDAM under a μ\mu-PL condition in both heterogeneous and homogeneous settings, where KK is the number of machines and μ1\mu\leq 1. NPA denotes the naive parallel (large mini-batch) version of PPD-SG [liu2019stochastic] for DAM, where MM denotes the batch size in the NPA. The * indicate the results that are derived by us. O~()\widetilde{O}(\cdot) suppresses a logarithmic factor.
Heterogeneous Data Homogeneous Data Sample Complexity
NPA (M<1Kμϵ)M<\frac{1}{K\mu\epsilon}) O~(1KMμ2ϵ+1μϵ)\widetilde{O}\left(\frac{1}{KM\mu^{2}\epsilon}+\frac{1}{\mu\epsilon}\right) O~(1KMμ2ϵ+1μϵ)\widetilde{O}\left(\frac{1}{KM\mu^{2}\epsilon}+\frac{1}{\mu\epsilon}\right) O~(Mμϵ+1μ2Kϵ)\widetilde{O}\left(\frac{M}{\mu\epsilon}+\frac{1}{\mu^{2}K\epsilon}\right)
NPA (M1Kμϵ)M\geq\frac{1}{K\mu\epsilon}) O~(1μ)\widetilde{O}\left(\frac{1}{\mu}\right)^{*} O~(1μ)\widetilde{O}\left(\frac{1}{\mu}\right)^{*} O~(Mμ)\widetilde{O}\left(\frac{M}{\mu}\right)^{*}
CODA+ (CODA) O~(Kμ+1μϵ1/2+1μ3/2ϵ1/2)\widetilde{O}\left(\frac{K}{\mu}+\frac{1}{\mu\epsilon^{1/2}}+\frac{1}{\mu^{3/2}\epsilon^{1/2}}\right) O~(Kμ)\widetilde{O}\left(\frac{K}{\mu}\right)^{*} O~(1μϵ+1μ2Kϵ)\widetilde{O}\left(\frac{1}{\mu\epsilon}+\frac{1}{\mu^{2}K\epsilon}\right)
CODASCA O~(1μ)\widetilde{O}\left(\frac{1}{\mu}\right) O~(1μ)\widetilde{O}\left(\frac{1}{\mu}\right) O~(1μϵ+1μ2Kϵ)\widetilde{O}\left(\frac{1}{\mu\epsilon}+\frac{1}{\mu^{2}K\epsilon}\right)

However, the research on FDAM is still limited. To the best of our knowledge, dist_auc_guo is the only work that was dedicated to FDAM by solving the non-convex strongly-concave min-max problem in a distributed manner. Their algorithm (CODA) is similar to the standard FedAvg method [DBLP:conf/aistats/McMahanMRHA17] except that the periodic averaging is applied both to the primal and the dual variables. Nevertheless, their results on FDAM are not comprehensive. By a deep investigation of their algorithms and analysis, we found that (i) although their FL algorithm CODA was shown to be better than the naive parallel algorithm (NPA) with a small mini-batch for DAM, the NPA using a larger mini-batch at local machines can enjoy a smaller communication complexity than CODA; (ii) the communication complexity of CODA for homogeneous data becomes better than that was established for the heterogeneous data, but is still worse than that of NPA with a large mini-batch at local clients. These shortcomings of CODA for FDAM motivate us to develop better federated averaging algorithms and analysis with a better communication complexity without sacrificing the sample complexity.

This paper aims to provide more comprehensive results for FDAM, with a focus on improving the communication complexity of CODA for heterogeneous data. In particular, our contributions are summarized below:

  • First, we provide a stronger baseline with a simpler algorithm than CODA named CODA+, and establish its complexity in both homogeneous and heterogeneous data settings. Although CODA+ has a slight change from CODA, its analysis is much more involved than that of CODA, which is based on the duality gap analysis instead of the primal objective gap analysis.

  • Second, we propose a new variant of CODA+ named CODASCA with a much improved communication complexity than CODA+. The key thrust is to incorporate the idea of stochastic controlled averaging of SCAFFOLD [karimireddy2019scaffold] into the framework of CODA+ for correcting the client-drift for both local primal updates and local dual updates. A striking result of CODASCA under a PL condition for deep learning is that its communication complexity is independent of the number of machines and the targeted accuracy level, which is even better than CODA+ in the homogeneous data setting. The analysis of CODASCA is also non-trivial that combines the duality gap analysis of CODA+ for a non-convex strongly-concave min-max problem and the variance reduction analysis of SCAFFOLD. The comparison between CODASCA and CODA+ and the NPA for FDAM is shown in Table 1.

  • Third, we conduct experiments on benchmark datasets to verify our theory by showing CODASCA can enjoy a larger communication window size than CODA+ without sacrificing the performance. Moreover, we conduct empirical studies on medical chest X-ray images from different hospitals by showing that the performance of CODASCA using data from multiple organizations can improve the performance on testing data from a single hospital.

2 Related Work

Federated Learning (FL).

Many empirical studies [povey2014parallel, su2015experiments, mcmahan2016communication, chen2016scalable, lin2018don, kamp2018efficient, DBLP:conf/eccv/YuanGYWY20] have shown that FL exhibits good empirical performance for distributed deep learning. For a more thorough survey of FL, we refer the readers to [mcmahan14advances]. This paper is closely related to recent studies that focus on the design of distributed stochastic algorithms for FL with provable convergence guarantee.

The most popular FL algorithm is called Federated Averaging (FedAvg) [DBLP:conf/aistats/McMahanMRHA17], which is also referred to as local SGD [stich2018local]. [stich2018local] is the first one that establishes the convergence of local SGD for strongly convex functions. yu2019parallel, yu_linear establish the convergence of local SGD and their momentum variants for non-convex functions. Although not explicitly discussed in their paper, the analysis in [yu2019parallel] has already exhibited the difference of communication complexities of local SGD in homogeneous and heterogeneous data settings, which was also discovered in recent works [khaled2020tighter, woodworth2020local, DBLP:conf/nips/WoodworthPS20]. These latter studies provide a tight analysis of local SGD in homogeneous and/or heterogeneous data settings, improving its upper bounds for convex functions and strongly convex functions than some earlier works, which sometimes improve that of large mini-batch SGD, e.g., when the level of heterogeneity is sufficiently small.

DBLP:conf/nips/HaddadpourKMC19 improve the complexities of local SGD for non-convex optimization by leveraging the PL condition. [karimireddy2019scaffold] propose a new FedAvg algorithm SCAFFOLD by introducing control variates (variance reduction) to correct for the ‘client-drift’ in the local updates for heterogeneous data. The communication complexities of SCAFFOLD are no worse than that of large mini-batch SGD for both strongly convex and non-convex functions. The proposed algorithm CODASCA is inspired by the idea of stochastic controlled averaging of SCAFFOLD. However, the analysis of CODASCA for non-convex min-max optimization under a PL condition of the primal objective function is non-trivial compared to that of SCAFFOLD.

AUC Maximization. This work builds on the foundations of stochastic AUC maximization developed in many previous works. ying2016stochastic address the scalability issue of optimizing AUC by introducing a min-max reformulation of the AUC square surrogate loss and solving it by a convex-concave stochastic gradient method [nemirovski2009robust].  natole2018stochastic improve the convergence rate by adding a strongly convex regularizer into the original formulation. Based on the same min-max formulation as in [ying2016stochastic],  liu2018fast achieve an improved convergence rate by developing a multi-stage algorithm by leveraging the quadratic growth condition of the problem. However, all of these studies focus on learning a linear model, whose corresponding problem is convex and strongly concave. robustdeepAUC propose a more robust margin-based surrogate loss for the AUC score, which can be formulated as a similar min-max problem to the AUC square surrogate loss.

Deep AUC Maximization (DAM). [rafique2018non] is the first work that develops algorithms and convergence theories for weakly convex and strongly concave min-max problems, which is applicable to DAM. However, their convergence rate is slow for a practical purpose. liu2019stochastic consider improving the convergence rate for DAM under a practical PL condition of the primal objective function. guo2020fast further develop more generic algorithms for non-convex strongly-concave min-max problems, which can also be applied to DAM. There are also several studies [yan2020optimal, lin2019gradient, arXiv:2001.03724, yang2020global] focusing on non-convex strongly concave min-max problems without considering the application to DAM. Based on liu2019stochastic’s algorithm, dist_auc_guo propose a communication-efficient FL algorithm (CODA) for DAM. However, its communication cost is still high for heterogeneous data.

DL for Medical Image Analysis. In past decades, machine learning, especially deep learning methods have revolutionized many domains such as machine vision, natural language processing. For medical image analysis, deep learning methods are also showing great potential such as in classification of skin lesions [esteva2017dermatologist, li2018skin], interpretation of chest radiographs [ardila2019end, chexpert19], and breast cancer screening [bejnordi2017diagnostic, mckinney2020international, wang2016deep]. Some works have already achieved expert-level performance in different tasks [ardila2019end, mckinney2020international, DBLP:journals/mia/LitjensKBSCGLGS17]. Recently, robustdeepAUC employ DAM for medical image classification and achieve great success on two challenging tasks, namely CheXpert competition for chest X-ray image classification and Kaggle competition for melanoma classification based on skin lesion images. However, to the best of our knowledge, the application of FDAM methods on medical datasets from different hospitals have not be thoroughly investigated.

3 Preliminaries and Notations

We consider federated learning of deep neural networks by maximizing the AUC score. The setting is the same to that was considered as in [dist_auc_guo]. Below, we present some preliminaries and notations, which are mostly the same as in [dist_auc_guo]. In this paper, we consider the following min-max formulation for distributed AUC maximization problem:

min𝐰d(a,b)2maxαf(𝐰,a,b,α)=1Kk=1Kfk(𝐰,a,b,α),\min\limits_{\mathbf{w}\in\mathbb{R}^{d}\atop(a,b)\in\mathbb{R}^{2}}\max\limits_{\alpha\in\mathbb{R}}f(\mathbf{w},a,b,\alpha)=\frac{1}{K}\sum\limits_{k=1}^{K}f_{k}(\mathbf{w},a,b,\alpha), (1)

where KK is the total number of machines, fk(𝐰,a,b,α)f_{k}(\mathbf{w},a,b,\alpha) is defined below.

fk(𝐰,a,b,α)=𝔼𝐳k[Fk(w,a,b,α;𝐳k)]=𝔼𝐳k[(1p)(h(𝐰;𝐱k)a)2𝕀[yk=1]+p(h(𝐰;𝐱k)b)2𝕀[yk=1]+2(1+α)(ph(𝐰;𝐱k)𝕀[yk=1](1p)h(𝐰,𝐱k)𝕀[yk=1])p(1p)α2].\begin{split}&f_{k}(\mathbf{w},a,b,\alpha)=\mathbb{E}_{\mathbf{z}^{k}}[F_{k}(\textbf{w},a,b,\alpha;\mathbf{z}^{k})]\\ &=\mathbb{E}_{\mathbf{z}^{k}}\left[(1-p)(h(\mathbf{w};\mathbf{x}^{k})-a)^{2}\mathbb{I}_{[y^{k}=1]}+p(h(\mathbf{w};\mathbf{x}^{k})-b)^{2}\mathbb{I}_{[y^{k}=-1]}\right.\\ &\hskip 14.45377pt+2(1+\alpha)(ph(\mathbf{w};\mathbf{x}^{k})\mathbb{I}_{[y^{k}=-1]}-\left.(1-p)h(\mathbf{w},\mathbf{x}^{k})\mathbb{I}_{[y^{k}=1]})-p(1-p)\alpha^{2}\right].\end{split} (2)

where 𝐳k=(𝐱k,yk)k\mathbf{z}^{k}=(\mathbf{x}^{k},y^{k})\sim\mathbb{P}_{k}, k\mathbb{P}_{k} is the data distribution on machine kk, pp is the ratio of positive data. When ϕk=ϕl,kl\mathbb{\phi}_{k}=\mathbb{\phi}_{l},\forall k\neq l, this is referred to as the homogeneous data setting; otherwise heterogeneous data setting.

Notations. We define the following notations:

𝐯=(𝐰T,a,b)T,ϕ(𝐯)=maxαf(𝐯,α),\displaystyle\mathbf{v}=(\mathbf{w}^{T},a,b)^{T},\quad\phi(\mathbf{v})=\max_{\alpha}f(\mathbf{v},\alpha),
ϕs(𝐯)=ϕ(𝐯)+12γ𝐯𝐯s12,\displaystyle\phi_{s}(\mathbf{v})=\phi(\mathbf{v})+\frac{1}{2\gamma}\|\mathbf{v}-\mathbf{v}_{s-1}\|^{2},
fs(𝐯,α)=f(𝐯,α)+12γ𝐯𝐯s12\displaystyle f^{s}(\mathbf{v},\alpha)=f(\mathbf{v},\alpha)+\frac{1}{2\gamma}\|\mathbf{v}-\mathbf{v}_{s-1}\|^{2}
Fks(𝐯,α;𝐳k)=Fk(𝐯,α;𝐳k)+12γ𝐯𝐯s12\displaystyle F^{s}_{k}(\mathbf{v},\alpha;\mathbf{z}_{k})=F_{k}(\mathbf{v},\alpha;\mathbf{z}_{k})+\frac{1}{2\gamma}\|\mathbf{v}-\mathbf{v}_{s-1}\|^{2}
𝐯ϕ=argmin𝐯ϕ(𝐯),𝐯ϕs=argmin𝐯ϕs(𝐯).\displaystyle\mathbf{v}^{*}_{\phi}=\arg\min\limits_{\mathbf{v}}\phi(\mathbf{v}),\quad\mathbf{v}^{*}_{\phi_{s}}=\arg\min\limits_{\mathbf{v}}\phi_{s}(\mathbf{v}).

In the remaining of this paper, we will consider the following distributed min-max problem that can cover the distributed DAM,

min𝐯maxαf(𝐯,α)=1Kk=1Kfk(𝐯,α),\begin{split}\min\limits_{\mathbf{v}}\max\limits_{\alpha}f(\mathbf{v},\alpha)=\frac{1}{K}\sum\limits_{k=1}^{K}f_{k}(\mathbf{v},\alpha),\end{split} (3)

where

fk(𝐯,α)=𝔼𝐳k[Fk(𝐯,α;𝐳k)].f_{k}(\mathbf{v},\alpha)=\mathbb{E}_{\mathbf{z}^{k}}[F_{k}(\mathbf{v},\alpha;\mathbf{z}^{k})]. (4)

Assumptions. Similar to [dist_auc_guo], we make the following assumptions throughout this paper.

Assumption 1

(i) There exist 𝐯0,Δ0>0\mathbf{v}_{0},\Delta_{0}>0 such that ϕ(𝐯0)ϕ(𝐯ϕ)Δ0\phi(\mathbf{v}_{0})-\phi(\mathbf{v}^{*}_{\phi})\leq\Delta_{0}.
(ii) PL condition: ϕ(𝐯)\phi(\mathbf{v}) satisfies the μ\mu-PL condition, i.e., μ(ϕ(𝐯)ϕ(𝐯))12ϕ(𝐯)2\mu(\phi(\mathbf{v})-\phi(\mathbf{v}_{*}))\leq\frac{1}{2}\|\nabla\phi(\mathbf{v})\|^{2}; (iii) Smoothness: For any 𝐳\mathbf{z}, f(𝐯,α;𝐳)f(\mathbf{v},\alpha;\mathbf{z}) is \ell-smooth in 𝐯\mathbf{v} and α\alpha. ϕ(𝐯)\phi(\mathbf{v}) is LL-smooth, i.e., ϕ(𝐯1)ϕ(𝐯2)L𝐯1𝐯2\|\nabla\phi(\mathbf{v}_{1})-\nabla\phi(\mathbf{v}_{2})\|\leq L\|\mathbf{v}_{1}-\mathbf{v}_{2}\|.
(iv) Bounded variance:

𝔼[𝐯fk(𝐯,α)𝐯Fk(𝐯,α;𝐳)2]σ2𝔼[|αfk(𝐯,α)αFk(𝐯,α;𝐳)|2]σ2.\begin{split}&\mathbb{E}[\|\nabla_{\mathbf{v}}f_{k}(\mathbf{v},\alpha)-\nabla_{\mathbf{v}}F_{k}(\mathbf{v},\alpha;\mathbf{z})\|^{2}]\leq\sigma^{2}\\ &\mathbb{E}[|\nabla_{\alpha}f_{k}(\mathbf{v},\alpha)-\nabla_{\alpha}F_{k}(\mathbf{v},\alpha;\mathbf{z})|^{2}]\leq\sigma^{2}.\end{split} (5)

To quantify the drifts between different clients, we introduce the following assumption.

Assumption 2

Bounded client drift:

1Kk=1K𝐯fk(𝐯,α)𝐯f(𝐯,α)2D21Kk=1Kαfk(𝐯,α)αf(𝐯,α)2D2.\begin{split}\frac{1}{K}\sum\limits_{k=1}^{K}\|\nabla_{\mathbf{v}}f_{k}(\mathbf{v},\alpha)-\nabla_{\mathbf{v}}f(\mathbf{v},\alpha)\|^{2}&\leq D^{2}\\ \frac{1}{K}\sum\limits_{k=1}^{K}\|\nabla_{\alpha}f_{k}(\mathbf{v},\alpha)-\nabla_{\alpha}f(\mathbf{v},\alpha)\|^{2}&\leq D^{2}.\end{split} (6)

Remark. DD quantifies the drift between the local objectives on clients and the global objective. D=0D=0 denotes the homogeneous data setting that all the local objectives are identical. D>0D>0 corresponds to the heterogeneous data setting.

4 CODA+: A stronger baseline

In this section, we present a stronger baseline than CODA [dist_auc_guo]. The motivation is that (i) the CODA algorithm uses a step to compute the dual variable from the primal variable by using sampled data from all clients; but we find this step is unnecessary by an improved analysis; (ii) the complexity of CODA for homogeneous data is not given in its original paper. Hence, CODA+ is a simplified version of CODA but with much refined analysis.

We present the steps of CODA+ in Algorithm 1. It is similar to CODA that uses stagewise updates. In ss-th stage, a strongly convex strongly concave subproblem is constructed as following:

min𝐯maxαf(𝐯,α)+γ2𝐯𝐯0s2,\begin{split}\min\limits_{\mathbf{v}}\max_{\alpha}f(\mathbf{v},\alpha)+\frac{\gamma}{2}\|\mathbf{v}-\mathbf{v}_{0}^{s}\|^{2},\end{split} (7)

where 𝐯0s\mathbf{v}_{0}^{s} is the output of the previous stage.

CODA+ improves upon CODA in two folds. First, CODA+ algorithm is more concise since the output primal and dual variables of each stage can be directly used as input for the next stage, while CODA needs an extra large batch of data after each stage to compute the dual variable. This modification not only reduces the sample complexity, but also makes the algorithm applicable to a boarder family of nonconvex min-max problems. Second, CODA+ has a smaller communication complexity for homogeneous data than that for heterogeneous data while the previous analysis of CODA yields the same communication complexity for homogeneous data and heterogeneous data.

Algorithm 1 CODA+
1:  Initialization: (𝐯0,α0,γ)(\mathbf{v}_{0},\alpha_{0},\gamma).
2:  for s=1,,Ss=1,...,S do
3:     𝐯s,αs=DSG+(𝐯s1,αs1,ηs,Is,γ)\mathbf{v}_{s},\alpha_{s}=\text{DSG+}(\mathbf{v}_{s-1},\alpha_{s-1},\eta_{s},I_{s},\gamma);
4:  end for
5:  Return 𝐯S,αS\mathbf{v}_{S},\alpha_{S}.
Algorithm 2 DSG+(𝐯0,α0,η,T,I,γ\mathbf{v}_{0},\alpha_{0},\eta,T,I,\gamma)
  Each machine does initialization: 𝐯0k=𝐯0,α0k=α0\mathbf{v}_{0}^{k}=\mathbf{v}_{0},\alpha_{0}^{k}=\alpha_{0},
  for t=0,1,,T1t=0,1,...,T-1 do
     Each machine kk updates its local solution in parallel:
         𝐯t+1k=𝐯tkη(𝐯Fk(𝐯tk,αtk;𝐳tk)+γ(𝐯tk𝐯0))\mathbf{v}_{t+1}^{k}=\mathbf{v}_{t}^{k}-\eta(\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t}^{k},\alpha_{t}^{k};\mathbf{z}_{t}^{k})+\gamma(\mathbf{v}_{t}^{k}-\mathbf{v}_{0})),
         αt+1k=αtk+ηαFk(𝐯tk,αtk;𝐳tk)\alpha_{t+1}^{k}=\alpha_{t}^{k}+\eta\nabla_{\alpha}F_{k}(\mathbf{v}_{t}^{k},\alpha_{t}^{k};\mathbf{z}_{t}^{k}),
     if t+1t+1 mod I=0I=0  then
        𝐯t+1k=1Kk=1K𝐯t+1k\mathbf{v}^{k}_{t+1}=\frac{1}{K}\sum\limits_{k=1}^{K}\mathbf{v}_{t+1}^{k}, \diamond communicate
        αt+1k=1Kk=1Kαt+1k\alpha^{k}_{t+1}=\frac{1}{K}\sum\limits_{k=1}^{K}\alpha_{t+1}^{k}, \diamond communicate
     end if
  end for
  Return (𝐯¯=1Kk=1K1Tt=1T𝐯tk,α¯=1Kk=1K1Tt=1Tαtk)\left(\bar{\mathbf{v}}=\frac{1}{K}\sum\limits_{k=1}^{K}\frac{1}{T}\sum\limits_{t=1}^{T}\mathbf{v}_{t}^{k},\bar{\alpha}=\frac{1}{K}\sum\limits_{k=1}^{K}\frac{1}{T}\sum\limits_{t=1}^{T}\alpha_{t}^{k}\right).

We have the following lemma to bound the convergence for the subproblem in each ss-th stage.

Lemma 1

(One call of Algorithm 2) Let (𝐯¯,α¯)(\bar{\mathbf{v}},\bar{\alpha}) be the output of Algorithm 2. Suppose Assumption 1 and 2 hold. By running Algorithm 2 with given input 𝐯0,α0\mathbf{v}_{0},\alpha_{0} for TT iterations, γ=2\gamma=2\ell, and ηmin(13+32/μ2,14)\eta\leq\min(\frac{1}{3\ell+3\ell^{2}/\mu_{2}},\frac{1}{4\ell}), we have for any 𝐯\mathbf{v} and α\alpha

𝔼[fs(𝐯¯,α)fs(𝐯,α¯)]1ηT𝐯0𝐯2+1ηT(α0α)2\displaystyle\mathbb{E}[f^{s}(\bar{\mathbf{v}},\alpha)-f^{s}(\mathbf{v},\bar{\alpha})]\leq\frac{1}{\eta T}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{\eta T}(\alpha_{0}-\alpha)^{2}
+(322μ2+32)(12η2Iσ2+36η2I2D2)𝕀I>1A1+3ησ2K,\displaystyle+\underbrace{\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)(12\eta^{2}I\sigma^{2}+36\eta^{2}I^{2}D^{2})\mathbb{I}_{I>1}}_{A_{1}}+\frac{3\eta\sigma^{2}}{K},

where μ2=2p(1p)\mu_{2}=2p(1-p) is strong concavity parameter of f(𝐯,α)f(\mathbf{v},\alpha) in α\alpha.

Remark. Note that the term A1A_{1} on the RHS is the drift of clients caused by skipping communication. When D=0D=0, i.e., the machines have homogeneous data distribution, we need ηI=O(1K)\eta I=O\left(\frac{1}{K}\right), then A1A_{1} can be merged with the last term. When D>0D>0, we need ηI2=O(1K)\eta I^{2}=O\left(\frac{1}{K}\right), which means that II has to be smaller in heterogeneous data setting and thus the communication complexity is higher.

Remark. The key difference between the analysis of CODA+ and that of CODA lies at how to handle the term (α0α)2(\alpha_{0}-\alpha)^{2} in Lemma 1. In CODA, the initial dual variable α0\alpha_{0} is computed from the initial primal variable 𝐯0\mathbf{v}_{0}, which then reduces the error term (α0α)2(\alpha_{0}-\alpha)^{2} to one similar to 𝐯0𝐯2\|\mathbf{v}_{0}-\mathbf{v}\|^{2}, which is then bounded by the primal objective gap due to the PL condition. However, since we do not conduct the extra computation of α0\alpha_{0} from 𝐯0\mathbf{v}_{0}, our analysis directly deals with such error term by using the duality gap of fsf^{s}. This technique is originally developed by [yan2020optimal].

Theorem 1

Set γ=2\gamma=2\ell, L^=L+2\hat{L}=L+2\ell, c=μ/L^5+μ/L^c=\frac{\mu/\hat{L}}{5+\mu/\hat{L}}, ηs=η0exp((s1)c)O(1)\eta_{s}=\eta_{0}\exp(-(s-1)c)\leq O(1) and Ts=212η0min(,μ2)exp((s1)c)T_{s}=\frac{212}{\eta_{0}\min(\ell,\mu_{2})}\exp((s-1)c). To return 𝐯S\mathbf{v}_{S} such that 𝔼[ϕ(𝐯S)ϕ(𝐯ϕ)]ϵ\mathbb{E}[\phi(\mathbf{v}_{S})-\phi(\mathbf{v}^{*}_{\phi})]\leq\epsilon, it suffices to choose SO(5L^+μμmax{log(2Δ0ϵ),logS+log[2η0ϵ12(σ2)5K]})S\geq O\left(\frac{5\hat{L}+\mu}{\mu}\max\bigg{\{}\log\left(\frac{2\Delta_{0}}{\epsilon}\right),\log S+\log\bigg{[}\frac{2\eta_{0}}{\epsilon}\frac{12(\sigma^{2})}{5K}\bigg{]}\bigg{\}}\right). The iteration complexity is O~(max(Δ0μϵη0K,L^μ2Kϵ))\widetilde{O}\bigg{(}\max\left(\frac{\Delta_{0}}{\mu\epsilon\eta_{0}K},\frac{\hat{L}}{\mu^{2}K\epsilon}\right)\bigg{)} and the communication complexity is O~(Kμ)\widetilde{O}\left(\frac{K}{\mu}\right) by setting Is=Θ(1/(Kηs))I_{s}=\Theta(1/(K\eta_{s})) if D=0D=0, and is O~(max(Kμ+Δ01/2μ(η0ϵ)1/2,Kμ+L^1/2μ3/2ϵ1/2))\widetilde{O}\bigg{(}\max\left(\frac{K}{\mu}+\frac{\Delta_{0}^{1/2}}{\mu(\eta_{0}\epsilon)^{1/2}},\frac{K}{\mu}+\frac{\hat{L}^{1/2}}{\mu^{3/2}\epsilon^{1/2}}\right)\bigg{)} by setting Is=Θ(1/Kηs)I_{s}=\Theta(1/\sqrt{K\eta_{s}}) if D>0D>0, where O~\widetilde{O} suppresses logarithmic factors.

Remark. Due to the PL condition, the step size η\eta decreases geometrically. Accordingly, II increases geometrically due to Lemma 1, and II increases with a faster rate when the data are homogeneous than that when data are heterogeneous. As a result, the total number of communications in homogeneous setting is much less than that in heterogeneous setting.

5 CODASCA

Although CODA+ has a significantly reduced communication complexity for homogeneous data, it is still suffering from a high communication complexity for heterogeneous data. Even for the homogeneous data, CODA+ has a worse communication complexity with a dependence on the number of clients KK than the NPA algorithm for FDAM with a large batch size.

Can we further reduce the communication complexity for FDAM for both homogeneous and heterogenous data without using a large batch size?

The main reason that causes the degeneration in the heterogeneous data setting is the data difference. Even at global optimum (𝐯,α)(\mathbf{v}_{*},\alpha_{*}), the gradient of local functions in different clients could be different and non-zero. In the homogeneous data setting, different clients still produce different solutions due to stochastic error (cf. the η2σ2I\eta^{2}\sigma^{2}I term of A1A_{1} in Lemma 1). These together contribute to the client drift.

To correct the client drift, we propose to leverage the idea of stochastic controlled averaging due to [karimireddy2019scaffold]. The key idea is to maintain and update a control variate to accommodate the client drift, which is taken into account when updating the local solutions. In the proposed algorithm CODASCA for FDAM, we apply control variates to both primal and dual variables. CODASCA shares the same stagewise framework as CODA+, where a strongly convex strongly concave subproblem is constructed and optimized in a distributed fashion approximatly in each stage. The steps of CODASCA are presented in Algorithm 3 and Algorithm 4. Below, we describe the algorithm in each stage.

Each stage has RR communication rounds. Between two rounds, there are II local updates, each machine kk does the local updates as

𝐯r,t+1k=𝐯r,tkηl(𝐯Fks(𝐯r,tk,αr,tk;𝐳r,tt)c𝐯k+c𝐯)αr,t+1k=αr,tk+ηl(αFks(𝐯r,tk,αr,tk;𝐳r,tk)cαk+cα),\begin{split}&\mathbf{v}_{r,t+1}^{k}=\mathbf{v}^{k}_{r,t}-\eta_{l}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};\mathbf{z}^{t}_{r,t})-c_{\mathbf{v}}^{k}+c_{\mathbf{v}})\\ &\alpha_{r,t+1}^{k}=\alpha^{k}_{r,t}+\eta_{l}(\nabla_{\alpha}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};\mathbf{z}^{k}_{r,t})-c_{\alpha}^{k}+c_{\alpha}),\end{split}

where c𝐯k,c𝐯c_{\mathbf{v}}^{k},c_{\mathbf{v}} are local and global control variates for the primal variable, and cαk,cαc^{k}_{\alpha},c_{\alpha} are local and global control variates for the dual variable. Note that 𝐯Fks(𝐯r,tk,αr,tk;𝐳r,tt)\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};\mathbf{z}^{t}_{r,t}) and αFks(𝐯r,tk,αr,tk;𝐳r,tk)\nabla_{\alpha}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};\mathbf{z}^{k}_{r,t}) are unbiased stochastic gradient on local data. However, they are biased estimate of global gradient when data on different clients are heterogeneous. Intuitively, the term c𝐯k+c𝐯-c_{\mathbf{v}}^{k}+c_{\mathbf{v}} and cαk+cα-c_{\alpha}^{k}+c_{\alpha} work to correct the local gradients to get closer to the global gradient. They also play a role of reducing variance of stochastic gradients, which is helpful as well to reduce the communication complexity in the homogeneous data setting.

At each communication round, the primal and dual variables on all clients get aggregated, averaged and broadcast to all clients. The control variates cc at rr-th round get updated as

c𝐯k=c𝐯kc𝐯+1Iηl(𝐯r1𝐯r,Ik)cαk=cαkcα+1Iηl(αr,Ikαr1),\begin{split}&c_{\mathbf{v}}^{k}=c_{\mathbf{v}}^{k}-c_{\mathbf{v}}+\frac{1}{I\eta_{l}}(\mathbf{v}_{r-1}-\mathbf{v}^{k}_{r,I})\\ &c_{\alpha}^{k}=c_{\alpha}^{k}-c_{\alpha}+\frac{1}{I\eta_{l}}(\alpha^{k}_{r,I}-\alpha_{r-1}),\end{split} (8)

which is equivalent to

c𝐯k=1It=1I𝐯fks(𝐯r,tk,αr,tk;𝐳r,tk)cαk=1It=1Iαfks(𝐯r,tk,αr,tk;𝐳r,tk).\begin{split}&c_{\mathbf{v}}^{k}=\frac{1}{I}\sum\limits_{t=1}^{I}\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha^{k}_{r,t};\mathbf{z}^{k}_{r,t})\\ &c_{\alpha}^{k}=\frac{1}{I}\sum\limits_{t=1}^{I}\nabla_{\alpha}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha^{k}_{r,t};\mathbf{z}^{k}_{r,t}).\end{split} (9)

Notice that they are simply the average of stochastic gradients used in this round. An alternative way to compute the control variates is by computing the stochastic gradient with a large batch of extra samples at the same point at each client, but this would bring extra cost and is unnecessary. c𝐯c_{\mathbf{v}} and cαc_{\alpha} are averages of c𝐯kc_{\mathbf{v}}^{k} and cαkc_{\alpha}^{k} over all clients. After the local primal and dual variables are averaged, an extrapolation step with ηg>1\eta_{g}>1 is performed. This step will boost the convergence.

In order to establish the convergence of CODASCA, we first present a key lemma below.

Lemma 2

(One call of Algorithm 4) Under the same setting as in Theorem 2, with η~=ηlηgIμ2402\tilde{\eta}=\eta_{l}\eta_{g}I\leq\frac{\mu_{2}}{40\ell^{2}}, for 𝐯=argmin𝐯fs(𝐯,αα~),α=argmaxαfs(𝐯r~,α)\mathbf{v}^{\prime}=\arg\min\limits_{\mathbf{v}}f^{s}(\mathbf{v},\alpha_{\tilde{\alpha}}),\alpha^{\prime}=\arg\max\limits_{\alpha}f^{s}(\mathbf{v}_{\tilde{r}},\alpha) we have

𝔼[fs(𝐯r~,α)fs(𝐯,αr~)]2ηlηgT𝐯0𝐯2+2ηlηgT(α0α)2+10ηlσ2ηgA2+10ηlηgσ2K\begin{split}&\mathbb{E}[f^{s}(\mathbf{v}_{\tilde{r}},\alpha^{\prime})-f^{s}(\mathbf{v}^{\prime},\alpha_{\tilde{r}})]\leq\frac{2}{\eta_{l}\eta_{g}T}\|\mathbf{v}_{0}-\mathbf{v}^{\prime}\|^{2}\\ &+\frac{2}{\eta_{l}\eta_{g}T}(\alpha_{0}-\alpha^{\prime})^{2}+\underbrace{\frac{10\eta_{l}\sigma^{2}}{\eta_{g}}}\limits_{A_{2}}+\frac{10\eta_{l}\eta_{g}\sigma^{2}}{K}\\ \end{split}

where T=IRT=I\cdot R is the number of iterations for each stage.

Remark. Compared the above bound with that in Lemma 1, in particular the term A2A_{2} vs the term A1A_{1}, we can see that CODASCA will not be affected by the data heterogeneity D>0D>0, and the stochastic variance is also much reduced. As will seen in the next theorem, the value of η~\tilde{\eta} and RR will keep the same in all stages. Therefore, by decreasing local step size ηl\eta_{l} geometrically, the communication window size IsI_{s} will increase geometrically to ensure η~O(1)\tilde{\eta}\leq O(1).

Algorithm 3 CODASCA
1:  Initialization: (𝐯0,α0,γ)(\mathbf{v}_{0},\alpha_{0},\gamma).
2:  for s=1,,Ss=1,...,S do
3:     𝐯s,αs=DSGSCA+(𝐯s1,αs1,ηl,ηg,Is,Rs,γ)\mathbf{v}_{s},\alpha_{s}=\text{DSGSCA+}(\mathbf{v}_{s-1},\alpha_{s-1},\eta_{l},\eta_{g},I_{s},R_{s},\gamma);
4:  end for
5:  Return 𝐯S,αS\mathbf{v}_{S},\alpha_{S}.
Algorithm 4 DSGSCA(𝐯0,α0,ηl,ηg,I,R,γ\mathbf{v}_{0},\alpha_{0},\eta_{l},\eta_{g},I,R,\gamma)
  Each machine does initialization: 𝐯0,0k=𝐯0,α0,0k=α0\mathbf{v}_{0,0}^{k}=\mathbf{v}_{0},\alpha_{0,0}^{k}=\alpha_{0}, c𝐯k=𝟎c_{\mathbf{v}}^{k}=\mathbf{0}, cαk=0c_{\alpha}^{k}=0
  for r=1,,Rr=1,...,R do
     for t=0,1,,I1t=0,1,...,I-1 do
        Each machine kk updates its local solution in parallel:
            𝐯r,t+1k=𝐯r,tkηl(𝐯Fks(𝐯r,tk,αr,tk;𝐳r,tk)c𝐯k+c𝐯)\mathbf{v}_{r,t+1}^{k}\!=\!\mathbf{v}^{k}_{r,t}\!-\!\eta_{l}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};\mathbf{z}^{k}_{r,t})-c_{\mathbf{v}}^{k}+c_{\mathbf{v}}),
            αr,t+1k=αr,tk+ηl(αFks(𝐯r,tk,αr,tk;𝐳r,tk)cαk+cα)\alpha_{r,t+1}^{k}\!=\!\alpha^{k}_{r,t}\!+\!\eta_{l}(\nabla_{\alpha}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};\mathbf{z}^{k}_{r,t})\!-\!c_{\alpha}^{k}\!+\!c_{\alpha}),
     end for
     c𝐯k=c𝐯kc𝐯+1Iηl(𝐯r1𝐯r,Ik)c_{\mathbf{v}}^{k}=c_{\mathbf{v}}^{k}-c_{\mathbf{v}}+\frac{1}{I\eta_{l}}(\mathbf{v}_{r-1}-\mathbf{v}^{k}_{r,I})
     cαk=cαkcα+1Iηl(αr,Ikαr1)c_{\alpha}^{k}=c_{\alpha}^{k}-c_{\alpha}+\frac{1}{I\eta_{l}}(\alpha^{k}_{r,I}-\alpha_{r-1})
     c𝐯=1Kk=1Kc𝐯kc_{\mathbf{v}}=\frac{1}{K}\sum\limits_{k=1}^{K}c_{\mathbf{v}}^{k}, cα=1Kk=1Kcαkc_{\alpha}=\frac{1}{K}\sum\limits_{k=1}^{K}{}c_{\alpha}^{k} \diamond communicate
     𝐯r=1Kk=1K𝐯r,Ik,αr=1Kk=1Kαr,tk\mathbf{v}_{r}=\frac{1}{K}\sum\limits_{k=1}^{K}\mathbf{v}^{k}_{r,I},\alpha_{r}=\frac{1}{K}\sum\limits_{k=1}^{K}\alpha^{k}_{r,t} \diamond communicate
     𝐯r=𝐯r1+ηg(𝐯r𝐯r1)\mathbf{v}_{r}=\mathbf{v}_{r-1}+\eta_{g}(\mathbf{v}_{r}-\mathbf{v}_{r-1}),
     αr=αr1+ηg(αrαr1)\alpha_{r}=\alpha_{r-1}+\eta_{g}(\alpha_{r}-\alpha_{r-1})
     Broadcast 𝐯r,αr,c𝐯,cα\mathbf{v}_{r},\alpha_{r},c_{\mathbf{v}},c_{\alpha} \diamond communicate
  end for
  Return 𝐯r~,αr~\mathbf{v}_{\tilde{r}},\alpha_{\tilde{r}} where r~\tilde{r} is randomly sampled from 1,,R1,...,R

The main convergence result of CODASCA is presented below.

Theorem 2

Define constants L^=L+2\hat{L}=L+2\ell, c=4+24853L^c=4\ell+\frac{248}{53}\hat{L}. Setting ηg=K\eta_{g}=\sqrt{K}, Is=I0exp(2μ1c+2μ1(s1))I_{s}=I_{0}\exp\left(\frac{2\mu_{1}}{c+2\mu_{1}}(s-1)\right), ηls=η~ηgIs=η~KI0exp(2μc+2μ(s1))\eta_{l}^{s}=\frac{\tilde{\eta}}{\eta_{g}I_{s}}=\frac{\tilde{\eta}}{\sqrt{K}I_{0}}\exp\left(-\frac{2\mu}{c+2\mu}(s-1)\right), R=1000η~μ2R=\frac{1000}{\tilde{\eta}\mu_{2}}, η~min{13+32/μ2,μ2402}\tilde{\eta}\leq\min\{\frac{1}{3\ell+3\ell^{2}/\mu_{2}},\frac{\mu_{2}}{40\ell^{2}}\}. After S=O(max{c+2μ2μlog4ϵ0ϵ,c+2μ2μlog160L^S(c+2μ)ϵη~σ2KI0})S=O(\max\bigg{\{}\frac{c+2\mu}{2\mu}\log\frac{4\epsilon_{0}}{\epsilon},\frac{c+2\mu}{2\mu}\log\frac{160\hat{L}S}{(c+2\mu)\epsilon}\frac{\tilde{\eta}\sigma^{2}}{KI_{0}}\bigg{\}}) stages, the output 𝐯S\mathbf{v}_{S} satisfies 𝔼[ϕ(𝐯S)ϕ(𝐯ϕ)]ϵ\mathbb{E}[\phi(\mathbf{v}_{S})-\phi(\mathbf{v}^{*}_{\phi})]\leq\epsilon. The communication complexity is O~(1μ)\widetilde{O}\left(\frac{1}{\mu}\right). The iteration complexity is O~(max{1μϵ,1μ2Kϵ})\widetilde{O}\left(\max\{\frac{1}{\mu\epsilon},\frac{1}{\mu^{2}K\epsilon}\}\right).

Remark. (i) The number of communications is O~(1μ)\widetilde{O}\left(\frac{1}{\mu}\right), which is independent of number of clients KK and the accuracy level ϵ\epsilon. This is a significant improvement over CODA+, which has a communication complexity of O~(K/μ+1/(μ3/2ϵ1/2))\widetilde{O}\left(K/\mu+1/({\mu^{3/2}\epsilon^{1/2}})\right) in heterogeneous setting. Moreover, O~(1/(μ))\widetilde{O}\left(1/({\mu})\right) is a nearly optimal rate up to a logarithmic factor, since O(1/μ)O(1/\mu) is the lower bound communication complexity of distributed strongly convex optimization [karimireddy2019scaffold, ArjevaniS15] and strongly convexity is a stronger condition than the PL condition.

(ii) Each stage has the same number of communication rounds. However, IsI_{s} increases geometrically. Therefore, the number of iterations and samples in a stage increase geometrically. Theoretically, we can also set ηls\eta^{s}_{l} to the same value as the one in the last stage, correspondingly IsI_{s} can be set as a fixed large value. But this increases the number of samples to be processed without further speeding up the convergence. Our setting of IsI_{s} is a balance between skipping communications and reducing sample complexity. For simplicity, we use the fixed setting of IsI_{s} to compare CODASCA and the baseline CODA+ in our experiment to corroborate the theory.

(iii) The local step size ηl\eta_{l} of CODASCA decreases in a similar way as the step size η\eta in CODA+. But the Is=O(1/(Kηls))I_{s}=O(1/(\sqrt{K}\eta_{l}^{s})) in CODASCA increases with a faster rate than that Is=O(1/(Kηs))I_{s}=O(1/(\sqrt{K\eta_{s}})) in CODA+ on heterogeneous data. It is noticeable that different from CODA+, we do not need Assumption 2 which bounds the client drift. This means CODASCA can be applied to optimizing the global objective even if local objectives arbitrarily deviate from the global function.

Refer to captionRefer to caption
Refer to captionRefer to caption
Refer to captionRefer to caption
Refer to captionRefer to caption
Figure 1: Top row: the testing AUC score of CODASCA vs # of iterations for different values of II on ImageNet-IH and CIFAR100-IH with imratio = 10% and KK=16, 8 on Densenet121. Bottom row: the achieved testing AUC vs different values of II for CODASCA and CODA+. The AUC score in the legend in top row figures represent the AUC score at the last iteration.

6 Experiments

In this section, we first verify the effectiveness of CODASCA compared to CODA+ on various datasets, including two benchmark datasets, i.e., ImageNet, CIFAR100 [imagenet_cvpr09, krizhevsky2009cifar] and a constructed large-scale chest X-ray dataset. Then, we demonstrate the effectiveness of FDAM on improving the performance on a single domain (CheXpert) by using data from multiple sources. For notations, KK denotes the number of “clients” (# of machines, # of data sources) and II denotes the communication window size.

Table 2: Statistics of Medical Chest X-ray Datasets.
Dataset Source Samples
CheXpert Stanford Hospital (US) 224,316
ChestXray8 NIH Clinical Center (US) 112,120
PadChest Hospital San Juan (Spain) 110,641
MIMIC-CXR BIDMC (US) 377,110
ChestXrayAD H108 and HMUH (Vietnam) 15,000

Chest X-ray datasets. Five medical chest X-ray datasets, i.e., CheXpert, ChestXray14, MIMIC-CXR, PadChest, ChestXray-AD [chexpert19, wang2017chestx, johnson2019mimic, bustos2020padchest, nguyen2020vindr] are collected from different organizations. The statistics of these medical datasets are summarized in Table 2. We construct five binary classification tasks for predicting five popular diseases, Cardiomegaly (C0), Edema (C1), Consolidation (C2), Atelectasis (C3), P. Effusion (C4), as in CheXpert competition [chexpert19] for our experiments. These datasets are naturally imbalanced and heterogeneous due to different patients’ populations, different data collection protocols and etc. We refer to the whole medical dataset as ChestXray-IH.

Imbalanced and Heterogeneous (IH) Benchmark Datasets. For benchmark datasets, we manually construct the imbalanced heterogeneous dataset. For ImageNet, we first randomly select 500 classes as positive class and 500 classes as negative class. To increase data heterogeneity, we further split all positive/negative classes into KK groups so that each split only owns samples from unique classes without overlapping with that of other groups. To increase data imbalance level, we randomly remove some samples from positive classes for each machine. Please note that due to this operation, the whole sample set for different KK is different. We refer to the proportion of positive samples in all samples as imbalance ratio (imratioimratio). For CIFAR100, we follow similar steps to construct imbalanced heterogeneous data. We keep the testing/validation set untouched and keep them balanced. For imbalance ratio (imratio), we explore two ratios: 10% and 30%. We refer to the constructed datasets as ImageNet-IH (10%), ImageNet-IH (30%), CIFAR100-IH (10%), CIFAR100-IH (30%). Due to the limited space, we only report imratio=10% with DenseNet121 and defer the other results to supplement.

Parameters and Settings. We train Desenet121 on all datasets. For the parameters in CODASCA/CODA+, we tune 1/γ1/\gamma in [500, 700, 1000] and η\eta in [0.1, 0.01, 0.001]. For learning rate schedule, we decay the step size by 3 times every T0T_{0} iterations, where T0T_{0} is tuned in [2000, 3000, 4000]. We experiment with a fixed value of II selected from [1, 32, 64, 128, 512, 1024]. We tune ηg\eta_{g} in [1.1, 1, 0.99, 0.999]. The local batch size is set to 32 for each machine. We run a total of 20000 iterations for all experiments.

6.1 Comparison with CODA+

We plot the testing AUC on ImageNet (10%) vs # of iterations for CODASCA and CODA+ in Figure 1 (top row) by varying the value of II for different values of KK. Results on CIFAR100 are shown in the Supplement. In the bottom row of Figure 1, we plot the achieved testing AUC score vs different values of II for CODASCA and CODA+. We have the following observations: \bullet~{}CODASCA enjoys a larger communication window size. Comparing CODASCA and CODA+ in the bottom panel of Figure 1, we can see that CODASCA enjoys a larger communication window size without hurting the performance than CODA+, which is consistent with our theory.

\bullet~{} CODASCA is consistently better for different values of KK. We compare the largest value of II such that the performance does not degenerate too much compared with I=1I=1, which is denoted by ImaxI_{\max}. From the bottom figures of Figure 1, we can see that the ImaxI_{\max} value of CODASCA on ImageNet is 128 (KK=16) and 512 (KK=8), respectively, and that of CODA+ on ImageNet is 32 (KK=16) and 128 (KK=8). This demonstrates that CODASCA enjoys consistent advantage over CODA+, i.e., when K=16K=16, ImaxCODASCA/ImaxCODA+=4I^{\text{CODASCA}}_{\max}/I^{\text{CODA+}}_{\max}=4, and when K=8K=8, ImaxCODASCA/ImaxCODA+=4I^{\text{CODASCA}}_{\max}/I^{\text{CODA+}}_{\max}=4. The same phenomena occur on CIFAR100 data.

Next, we compare CODASCA with CODA+ on the ChestXray-IH medical dataset, which is also highly heterogeneous. We split the ChestXray-IH data into K=16K=16 groups according to the patient ID and each machine only owns samples from one organization without overlapping patients. The testing set is the collection of 5% data sampled from each organization. In addition, we use train/val split = 7:3 for the parameter tuning. We run CODASCA and CODA+ with the same number of iterations. The performance on testing set are reported in Table 3. From the results, we can observe that CODASCA performs consistently better than CODA+ on C0, C2, C3, C4.

Table 3: Performance on ChestXray-IH testing set when KK=16.
Method II C0 C1 C2 C3 C4
1 0.8472 0.8499 0.7406 0.7475 0.8688
CODA+ 512 0.8361 0.8464 0.7356 0.7449 0.8680
CODASCA 512 0.8427 0.8457 0.7401 0.7468 0.8680
CODA+ 1024 0.8280 0.8451 0.7322 0.7431 0.8660
CODASCA 1024 0.8363 0.8444 0.7346 0.7481 0.8674
Table 4: Performance of FDAM on Chexpert validation set. The improvement from K=1K=1 to K=5K=5 is significant in light of the top 77 CheXpert leaderboard results that only differ within 0.1%0.1\%.
#of sources C0 C1 C2 C3 C4 AVG
KK=1 0.9007 0.9536 0.9542 0.9090 0.9571 0.9353
KK=2 0.9027 0.9586 0.9542 0.9065 0.9583 0.9361
KK=3 0.9021 0.9558 0.9550 0.9068 0.9583 0.9356
KK=4 0.9055 0.9603 0.9542 0.9072 0.9588 0.9372
KK=5 0.9066 0.9583 0.9544 0.9101 0.9584 0.9376

6.2 FDAM for improving performance on CheXpert

Finally, we show that FDAM can be used to leverage data from multiple hospitals to improve the performance at a single target hospital. For this experiment, we choose CheXpert data from Stanford Hospital as the target data. Its validation data will be used for evaluating the performance of our FDAM method. Note that improving the AUC score on CheXpert is a very challenging task. The top 7 teams on CheXpert leaderboard differ by only 0.1% 111https://stanfordmlgroup.github.io/competitions/chexpert/. Hence, we consider any improvement over 0.1%0.1\% significant. Our procedure is following: we gradually increase the number of data resources, e.g., K=1K=1 only includes the CheXpert training data, K=2K=2 includes the CheXpert training data and ChestXray8, K=3K=3 includes the CheXpert training data and ChestXray8 and PadChest, and so on.

Parameters and Settings. Due to the limited computing resources, we resize all images to 320x320. We follow the two stage method proposed in [robustdeepAUC] and compare with the baseline on a single machine with a single data source (CheXpert training data) (KK=1) for learning DenseNet121. More specifically, we first train a base model by minimizing the Cross-Entropy loss on CheXpert training dataset using Adam with a initial learning rate of 1e-5 and batch size of 32 for 2 epochs. Then, we discard the trained classifier, use the same pretrained model for initializing the local models at all machines and continue training using CODASCA. For the parameter tuning, we try II=[16, 32, 64, 128], learning rate=[0.1, 0.01] and we fix γ\gamma=1e-3, T0T_{0}=1000 and batch size=32.

Results. We report all results in term of AUC score on the CheXpert validation data in Table 4. Overall, we can see that the using more data sources from different organizations can efficiently improve the performance on CheXpert. The average improvement across all 5 classification tasks from K=1K=1 to K=5K=5 is over 0.2%0.2\% which is significant in light of the top CheXpert leaderboard results. Specifically, we can see that CODASCA with KK=5 achieves the highest validation AUC score on C0 and C3 and with KK=4 achieves highest AUC score on C1 and C4.

7 Conclusion

In this work, we have conducted comprehensive studies of federated learning for deep AUC maximization. We analyzed a stronger baseline for deep AUC maximization by establishing its convergence for both homogeneous data and heterogeneous data. We also developed an improved variant by adding control variates to the local stochastic gradients for both primal and dual variables, which dramatically reduces the communication complexity. Besides a strong theory guarantee, we also exhibit the power of FDAM on real world medical imaging problems. We have shown that our FDAM method can improve the performance on medical imaging classification tasks by leveraging data from different organizations that are kept locally.

Appendix A Auxiliary Lemmas

Noting all algorithms discussed in thpaper including the baselines implement a stagewise framework, we define the duality gap of ss-th stage at a point (𝐯,α)(\mathbf{v},\alpha) as

Gaps(𝐯,α)=maxαfs(𝐯,α)min𝐯fs(𝐯,α).\begin{split}Gap_{s}(\mathbf{v},\alpha)=\max\limits_{\alpha^{\prime}}f^{s}(\mathbf{v},\alpha^{\prime})-\min\limits_{\mathbf{v}^{\prime}}f^{s}(\mathbf{v}^{\prime},\alpha).\end{split} (10)

Before we show the proofs, we first present the lemmas from [yan2020optimal].

Lemma 3 (Lemma 1 of [yan2020optimal])

Suppose a function h(𝐯,α)h(\mathbf{v},\alpha) is λ1\lambda_{1}-strongly convex in 𝐯\mathbf{v} and λ2\lambda_{2}-strongly concave in α\alpha. Consider the following problem

min𝐯XmaxαYh(𝐯,α),\displaystyle\min\limits_{\mathbf{v}\in X}\max\limits_{\alpha\in Y}h(\mathbf{v},\alpha),

where XX and YY are convex compact sets. Denote 𝐯^h(y)=argmin𝐯Xh(𝐯,α)\hat{\mathbf{v}}_{h}(y)=\arg\min\limits_{\mathbf{v}^{\prime}\in X}h(\mathbf{v}^{\prime},\alpha) and α^h(𝐯)=argmaxαYh(𝐯,α)\hat{\alpha}_{h}(\mathbf{v})=\arg\max\limits_{\alpha^{\prime}\in Y}h(\mathbf{v},\alpha^{\prime}). Suppose we have two solutions (𝐯0,α0)(\mathbf{v}_{0},\alpha_{0}) and (𝐯1,α1)(\mathbf{v}_{1},\alpha_{1}). Then the following relation between variable distance and duality gap holds

λ14𝐯^h(α1)𝐯02+λ24α^h(𝐯1)α02maxαYh(𝐯0,α)min𝐯Xh(𝐯,α0)+maxαYh(𝐯1,α)min𝐯Xh(𝐯,α1).\displaystyle\begin{split}\frac{\lambda_{1}}{4}\|\hat{\mathbf{v}}_{h}(\alpha_{1})-\mathbf{v}_{0}\|^{2}+\frac{\lambda_{2}}{4}\|\hat{\alpha}_{h}(\mathbf{v}_{1})-\alpha_{0}\|^{2}\leq&\max\limits_{\alpha^{\prime}\in Y}h(\mathbf{v}_{0},\alpha^{\prime})-\min\limits_{\mathbf{v}^{\prime}\in X}h(\mathbf{v}^{\prime},\alpha_{0})\\ &+\max\limits_{\alpha^{\prime}\in Y}h(\mathbf{v}_{1},\alpha^{\prime})-\min\limits_{\mathbf{v}^{\prime}\in X}h(\mathbf{v}^{\prime},\alpha_{1}).\end{split} (11)

\hfill\Box

Lemma 4 (Lemma 5 of [yan2020optimal])

We have the following lower bound for Gaps(𝐯s,αs)\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})

Gaps(𝐯s,αs)350Gaps+1(𝐯0s+1,α0s+1)+45(ϕ(𝐯0s+1)ϕ(𝐯0s)),\displaystyle\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\geq\frac{3}{50}\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})+\frac{4}{5}(\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{0}^{s})),

where 𝐯0s+1=𝐯s\mathbf{v}_{0}^{s+1}=\mathbf{v}_{s} and α0s+1=αs\alpha_{0}^{s+1}=\alpha_{s}, i.e., the initialization of (s+1)(s+1)-th stage is the output of the ss-th stage.

\hfill\Box

Appendix B Analysis of CODA+

The proof sketch is similar to the proof of CODA in [dist_auc_guo]. However, there are two noticeable difference from [dist_auc_guo]. First, in Lemma 1, we bound the duality gap instead of the objective gap in [dist_auc_guo]. This is because the analysis later in this proof requires the bound of the duality gap.

Second, in Lemma 1, where the bound for homogeneous data is better than that of heterogeneous data. The better analysis for homogeneous data is inspired by the analysis in [yu_linear], which tackles a minimization problem. Note that fsf^{s} denotes the subproblem for stage ss, we omit the index ss in variables when the context is clear.

B.1 Lemmas

We need following lemmas for the proof. The Lemma 5, Lemma 6 and Lemma 7 are similar to Lemma 3, Lemma 4 and Lemma 5 of [dist_auc_guo], respectively. For the sake of completeness, we will include the proof of Lemma 5 and Lemma 6 since a change in the update of the primal variable.

Lemma 5

Define 𝐯¯t=1Kk=1N𝐯tk,α¯t=1Kk=1Nytk\bar{\mathbf{v}}_{t}=\frac{1}{K}\sum_{k=1}^{N}\mathbf{v}^{k}_{t},\bar{\alpha}_{t}=\frac{1}{K}\sum_{k=1}^{N}y^{k}_{t}. Suppose Assumption 5 holds and by running Algorithm 2, we have for any 𝐯,α\mathbf{v},\alpha,

fs(𝐯¯,α)fs(𝐯,α¯)\displaystyle f^{s}(\bar{\mathbf{v}},\alpha)-f^{s}(\mathbf{v},\bar{\alpha}) 1Tt=1T[𝐯f(𝐯¯t1,α¯t1),𝐯¯txB1+αf(𝐯¯t1,α¯t1),yα¯tB2\displaystyle\leq\frac{1}{T}\sum\limits_{t=1}^{T}\bigg{[}\underbrace{\langle\nabla_{\mathbf{v}}f(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\!\bar{\mathbf{v}}_{t}-x\rangle}_{B_{1}}+\underbrace{\langle\nabla_{\alpha}f(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),y-\bar{\alpha}_{t}\rangle}_{B_{2}}
+3+32/μ22𝐯¯t𝐯¯t12+2(α¯tα¯t1)2B33𝐯¯t𝐯2μ23(α¯t1α)2],\displaystyle~{}~{}~{}+\underbrace{\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2}\!+2\ell(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}}_{B_{3}}-\frac{\ell}{3}\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2}-\frac{\mu_{2}}{3}(\bar{\alpha}_{t-1}-\alpha)^{2}\bigg{]},

where μ2=2p(1p)\mu_{2}=2p(1-p) is the strong concavity coefficient of f(𝐯,α)f(\mathbf{v},\alpha) in α\alpha.

{proof}

For any 𝐯\mathbf{v} and α\alpha, using Jensen’s inequality and the fact that fs(𝐯,α)f^{s}(\mathbf{v},\alpha) is convex in 𝐯\mathbf{v} and concave in α\alpha,

fs(𝐯¯,α)fs(𝐯,α¯)1Tt=1T(fs(𝐯¯t,α)fs(𝐯,α¯t))\begin{split}&f^{s}(\bar{\mathbf{v}},\alpha)-f^{s}(\mathbf{v},\bar{\alpha})\leq\frac{1}{T}\sum\limits_{t=1}^{T}\left(f^{s}(\bar{\mathbf{v}}_{t},\alpha)-f^{s}(\mathbf{v},\bar{\alpha}_{t})\right)\\ \end{split} (12)

By \ell-strongly convexity of fs(𝐯,α)f^{s}(\mathbf{v},\alpha) in 𝐯\mathbf{v}, we have

fs(𝐯¯t1,α¯t1)+𝐯fs(𝐯¯t1,α¯t1),𝐯𝐯¯t1+2𝐯¯t1𝐯2f(𝐯,α¯t1).f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})+\langle\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\mathbf{v}-\bar{\mathbf{v}}_{t-1}\rangle+\frac{\ell}{2}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}\leq f(\mathbf{v},\bar{\alpha}_{t-1}). (13)

By 33\ell-smoothness of fs(𝐯,α)f^{s}(\mathbf{v},\alpha) in 𝐯\mathbf{v}, we have

fs(𝐯¯t,α)fs(𝐯¯t1,α)+𝐯fs(𝐯¯t1,α),𝐯¯t𝐯¯t1+32𝐯¯t𝐯¯t12=fs(𝐯¯t1,α)+𝐯fs(𝐯¯t1,α¯t1),𝐯¯t𝐯¯t1+32𝐯¯t𝐯¯t12+𝐯fs(𝐯¯t1,α)𝐯fs(𝐯¯t1,α¯t1),𝐯¯t𝐯¯t1(a)fs(𝐯¯t1,α)+𝐯fs(𝐯¯t1,α¯t1),𝐯¯t𝐯¯t1+32𝐯¯t𝐯¯t12+|α¯t1α|𝐯¯t𝐯¯t1(b)fs(𝐯¯t1,α)+𝐯fs(𝐯¯t1,α¯t1),𝐯¯t𝐯¯t1+32𝐯¯t𝐯¯t12+μ26(α¯t1α)2+322μ2𝐯¯t𝐯¯t12,\begin{split}f^{s}(\bar{\mathbf{v}}_{t},\alpha)&\leq f^{s}(\bar{\mathbf{v}}_{t-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\alpha),\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\rangle+\frac{3\ell}{2}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2}\\ &=f^{s}(\bar{\mathbf{v}}_{t-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\rangle+\frac{3\ell}{2}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2}\\ &~{}~{}~{}+\langle\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\alpha)-\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\rangle\\ &\overset{(a)}{\leq}f^{s}(\bar{\mathbf{v}}_{t-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\rangle+\frac{3\ell}{2}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2}\\ &~{}~{}~{}~{}+\ell|\bar{\alpha}_{t-1}-\alpha|\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|\\ &\overset{(b)}{\leq}f^{s}(\bar{\mathbf{v}}_{t-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\rangle+\frac{3\ell}{2}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2}\\ &~{}~{}~{}~{}+\frac{\mu_{2}}{6}(\bar{\alpha}_{t-1}-\alpha)^{2}+\frac{3\ell^{2}}{2\mu_{2}}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2},\\ \end{split} (14)

where (a)(a) holds because that we know 𝐯f(𝐯,α)\partial_{\mathbf{v}}f(\mathbf{v},\alpha) is \ell-Lipschitz in α\alpha since f(𝐯,α)f(\mathbf{v},\alpha) is \ell-smooth, (b)(b) holds by Young’s inequality, and μ2=2p(1p)\mu_{2}=2p(1-p) is the strong concavity coefficient of fsf^{s} in α\alpha.

Adding (13) and (14), rearranging terms, we have

fs(𝐯¯t1,α¯t1)+fs(𝐯¯t,α)f(𝐯,α¯t1)+f(𝐯¯t1,α)+𝐯f(𝐯¯t1,α¯t1),𝐯¯t𝐯+3+32/μ22𝐯¯t𝐯¯t122𝐯¯t1𝐯2+μ26(α¯t1α)2.\begin{split}&f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})+f^{s}(\bar{\mathbf{v}}_{t},\alpha)\\ &\leq f(\mathbf{v},\bar{\alpha}_{t-1})+f(\bar{\mathbf{v}}_{t-1},\alpha)+\langle\partial_{\mathbf{v}}f(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\mathbf{v}\rangle+\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2}\\ &~{}~{}~{}-\frac{\ell}{2}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}+\frac{\mu_{2}}{6}(\bar{\alpha}_{t-1}-\alpha)^{2}.\end{split} (15)

We know fs(𝐯,α)f^{s}(\mathbf{v},\alpha) is μ2\mu_{2}-strong concavity in α\alpha (f(𝐯,α)-f(\mathbf{v},\alpha) is μ2\mu_{2}-strong convexity of in α\alpha). Thus, we have

fs(𝐯¯t1,α¯t1)αfs(𝐯¯t1,α¯t1)(αα¯t1)+μ22(αα¯t1)2fs(𝐯¯t1,α).\begin{split}-f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})-\partial_{\alpha}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})^{\top}(\alpha-\bar{\alpha}_{t-1})+\frac{\mu_{2}}{2}(\alpha-\bar{\alpha}_{t-1})^{2}\leq-f^{s}(\bar{\mathbf{v}}_{t-1},\alpha).\end{split} (16)

Since f(𝐯,α)f(\mathbf{v},\alpha) is \ell-smooth in α\alpha, we get

fs(𝐯,α¯t)fs(𝐯,α¯t1)αfs(𝐯,α¯t1),α¯tα¯t1+2(α¯tα¯t1)2=fs(𝐯,α¯t1)αfs(𝐯¯t1,α¯t1),α¯tα¯t1+2(α¯tα¯t1)2α(fs(𝐯,α¯t1)fs(𝐯¯t1,α¯t1)),α¯tα¯t1(a)fs(𝐯,α¯t1)αfs(𝐯¯t1,α¯t1),α¯tα¯t1+2(α¯tα¯t1)2+𝐯𝐯¯t1(α¯tα¯t1)fs(𝐯,α¯t1)αfs(𝐯¯t1,α¯t1),α¯tα¯t1+2(α¯tα¯t1)2+6𝐯¯t1𝐯2+32(α¯tα¯t1)2\begin{split}-f^{s}(\mathbf{v},\bar{\alpha}_{t})&\leq-f^{s}(\mathbf{v},\bar{\alpha}_{t-1})-\langle\partial_{\alpha}f^{s}(\mathbf{v},\bar{\alpha}_{t-1}),\bar{\alpha}_{t}-\bar{\alpha}_{t-1}\rangle+\frac{\ell}{2}(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}\\ &=-f^{s}(\mathbf{v},\bar{\alpha}_{t-1})-\langle\partial_{\alpha}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\alpha}_{t}-\bar{\alpha}_{t-1}\rangle+\frac{\ell}{2}(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}\\ &~{}~{}~{}~{}~{}-\langle\partial_{\alpha}(f^{s}(\mathbf{v},\bar{\alpha}_{t-1})-f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})),\bar{\alpha}_{t}-\bar{\alpha}_{t-1}\rangle\\ &\overset{(a)}{\leq}-f^{s}(\mathbf{v},\bar{\alpha}_{t-1})-\langle\partial_{\alpha}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\alpha}_{t}-\bar{\alpha}_{t-1}\rangle+\frac{\ell}{2}(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}\\ &~{}~{}~{}+\ell\|\mathbf{v}-\bar{\mathbf{v}}_{t-1}\|(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})\\ &\leq-f^{s}(\mathbf{v},\bar{\alpha}_{t-1})-\langle\partial_{\alpha}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\alpha}_{t}-\bar{\alpha}_{t-1}\rangle+\frac{\ell}{2}(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}\\ &~{}~{}~{}+\frac{\ell}{6}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}+\frac{3\ell}{2}(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}\\ \end{split} (17)

where (a) holds because that αfs(𝐯,α)\partial_{\alpha}f^{s}(\mathbf{v},\alpha) is \ell-Lipschitz in 𝐯\mathbf{v}.

Adding (16), (17) and arranging terms, we have

fs(𝐯¯t1,α¯t1)fs(𝐯,α¯t)fs(𝐯¯t1,α)fs(𝐯,α¯t1)αfs(𝐯¯t1,α¯t1),α¯tα+2(α¯tα¯t1)2+6𝐯¯t1𝐯2μ22(αα¯t1)2.\begin{split}&-f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})-f^{s}(\mathbf{v},\bar{\alpha}_{t})\leq-f^{s}(\bar{\mathbf{v}}_{t-1},\alpha)-f^{s}(\mathbf{v},\bar{\alpha}_{t-1})-\langle\partial_{\alpha}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\alpha}_{t}-\alpha\rangle\\ &~{}~{}~{}~{}~{}~{}+2\ell(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}+\frac{\ell}{6}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}-\frac{\mu_{2}}{2}(\alpha-\bar{\alpha}_{t-1})^{2}.\end{split} (18)

Adding (15) and (18), we get

fs(𝐯¯t,α)fs(𝐯,α¯t)𝐯f(𝐯¯t1,α¯t1),𝐯¯t𝐯αf(𝐯¯t1,α¯t1),α¯tα+3+32/μ22𝐯¯t𝐯¯t12+2(α¯tα¯t1)23𝐯¯t1𝐯2μ23(α¯t1α)2\begin{split}&f^{s}(\bar{\mathbf{v}}_{t},\alpha)-f^{s}(\mathbf{v},\bar{\alpha}_{t})\\ &\leq\langle\partial_{\mathbf{v}}f(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\mathbf{v}\rangle-\langle\partial_{\alpha}f(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\alpha}_{t}-\alpha\rangle\\ &~{}~{}~{}+\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2}+2\ell(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}\\ &~{}~{}~{}-\frac{\ell}{3}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}-\frac{\mu_{2}}{3}(\bar{\alpha}_{t-1}-\alpha)^{2}\end{split} (19)

Taking average over t=1,,Tt=1,...,T, we get

fs(𝐯¯,α)fs(𝐯,α¯)1Tt=1T[fs(𝐯¯t,α)fs(𝐯,α¯t)]1Tt=1T[𝐯fs(𝐯¯t1,α¯t1),𝐯¯t𝐯B1+αfs(𝐯¯t1,α¯t1),αα¯tB2+3+32/μ22𝐯¯t𝐯¯t12+2(α¯tα¯t1)2B33𝐯𝐯¯t2μ23(α¯t1α)2]\begin{split}&f^{s}(\bar{\mathbf{v}},\alpha)-f^{s}(\mathbf{v},\bar{\alpha})\\ &\leq\frac{1}{T}\sum\limits_{t=1}^{T}[f^{s}(\bar{\mathbf{v}}_{t},\alpha)-f^{s}(\mathbf{v},\bar{\alpha}_{t})]\\ &\leq\frac{1}{T}\sum\limits_{t=1}^{T}\bigg{[}\underbrace{\langle\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\mathbf{v}\rangle}_{B_{1}}+\underbrace{\langle\partial_{\alpha}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\alpha-\bar{\alpha}_{t}\rangle}_{B_{2}}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\underbrace{\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2}+2\ell(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}}_{B_{3}}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\frac{\ell}{3}\|\mathbf{v}-\bar{\mathbf{v}}_{t}\|^{2}-\frac{\mu_{2}}{3}(\bar{\alpha}_{t-1}-\alpha)^{2}\bigg{]}\end{split}

In the following, we will bound the term B1B_{1} by Lemma 6, B2B_{2} by Lemma 7 and B3B_{3} by Lemma 8.

Lemma 6

Define 𝐯^t=𝐯¯t1ηKk=1K𝐯fs(𝐯t1k,αt1k)\hat{\mathbf{v}}_{t}=\bar{\mathbf{v}}_{t-1}-\frac{\eta}{K}\sum\limits_{k=1}^{K}\!\nabla_{\mathbf{v}}f^{s}(\mathbf{v}^{k}_{t-1},\alpha^{k}_{t-1}) and

𝐯~t=𝐯~t1ηKk=1K(𝐯Fks(𝐯t1k,yt1k;zt1k)𝐯fks(𝐯t1k,αt1k)), for t>0𝐯~0=𝐯0.\begin{split}\tilde{\mathbf{v}}_{t}=\tilde{\mathbf{v}}_{t-1}-\frac{\eta}{K}\sum\limits_{k=1}^{K}\left(\nabla_{\mathbf{v}}F_{k}^{s}(\mathbf{v}^{k}_{t-1},y^{k}_{t-1};z_{t-1}^{k})-\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})\right),\text{ for $t>0$; }\tilde{\mathbf{v}}_{0}=\mathbf{v}_{0}.\end{split} (20)

. We have

B1321Kk=1K(α¯t1αt1k)2+321Kk=1K𝐯¯t1𝐯t1k2+3η21Kk=1K[𝐯fk(𝐯t1k,αt1k)𝐯Fk(𝐯t1k,αt1k;zt1k)]2+1Kk=1K[𝐯fk(𝐯t1k,αt1k)𝐯Fk(𝐯t1k,αt1k;zt1k)],𝐯^t𝐯~t1+12η(𝐯¯t1𝐯2𝐯¯t1𝐯¯t2𝐯¯t𝐯2)+3𝐯¯t𝐯2+12η(𝐯𝐯~t12𝐯𝐯~t2)\begin{split}&B_{1}\leq\frac{3\ell}{2}\frac{1}{K}\sum\limits_{k=1}^{K}(\bar{\alpha}_{t-1}-\alpha_{t-1}^{k})^{2}+\frac{3\ell}{2}\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}_{t-1}^{k}\|^{2}\\ &~{}~{}~{}~{}~{}+\frac{3\eta}{2}\left\|\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})]\right\|^{2}\\ &~{}~{}~{}~{}~{}+\left\langle\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\hat{\mathbf{v}}_{t}-\tilde{\mathbf{v}}_{t-1}\right\rangle\\ &~{}~{}~{}~{}~{}+\frac{1}{2\eta}(\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}-\|\bar{\mathbf{v}}_{t-1}-\bar{\mathbf{v}}_{t}\|^{2}-\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2})\\ &~{}~{}~{}~{}~{}+\frac{\ell}{3}\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2}+\frac{1}{2\eta}(\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}-\|\mathbf{v}-\tilde{\mathbf{v}}_{t}\|^{2})\end{split}
{proof}

We have

𝐯fs(𝐯¯t1,α¯t1),𝐯¯t𝐯=1Kk=1K𝐯fks(𝐯¯t1,α¯t1),𝐯¯t𝐯1Kk=1K[𝐯fks(𝐯¯t1,α¯t1)𝐯fks(𝐯¯t1,αt1k)],𝐯¯t𝐯\small{1}⃝+1Kk=1K[𝐯fks(𝐯¯t1,αt1k)𝐯fks(𝐯t1k,αt1k)],𝐯¯t𝐯\small{2}⃝+1Kk=1K[𝐯fks(𝐯t1k,αt1k)𝐯Fks(𝐯¯t1,αt1k;zt1k)],𝐯¯t𝐯\small{3}⃝+1Kk=1K𝐯Fks(𝐯¯t1,αt1k;zt1k),𝐯¯t𝐯\small{4}⃝\begin{split}&\langle\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\mathbf{v}\rangle=\bigg{\langle}\frac{1}{K}\sum\limits_{k=1}^{K}\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\mathbf{v}\bigg{\rangle}\\ &\leq\bigg{\langle}\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})-\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\alpha_{t-1}^{k})],\bar{\mathbf{v}}_{t}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{1}⃝\\ &~{}~{}~{}+\bigg{\langle}\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})],\bar{\mathbf{v}}_{t}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{2}⃝\\ &~{}~{}~{}+\bigg{\langle}\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\bar{\mathbf{v}}_{t-1},\alpha_{t-1}^{k};z_{t-1}^{k})],\bar{\mathbf{v}}_{t}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{3}⃝\\ &~{}~{}~{}+\bigg{\langle}\frac{1}{K}\sum\limits_{k=1}^{K}\nabla_{\mathbf{v}}F^{s}_{k}(\bar{\mathbf{v}}_{t-1},\alpha_{t-1}^{k};z_{t-1}^{k}),\bar{\mathbf{v}}_{t}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{4}⃝\end{split} (21)

Then we will bound \small{1}⃝, \small{2}⃝, \small{3}⃝ and \small{4}⃝, respectively,

\small{1}⃝(a)321Kk=1K[𝐯fks(𝐯¯t1,α¯t1)𝐯fks(𝐯¯t1,αt1k)]2+6𝐯¯t𝐯2(b)321Kk=1K𝐯fks(𝐯¯t1,α¯t1)𝐯fks(𝐯¯t1,αt1k)2+6𝐯¯t𝐯2(c)321Kk=1K(α¯t1αt1k)2+6𝐯¯t𝐯2,\begin{split}\small{1}⃝&\overset{(a)}{\leq}\frac{3}{2\ell}\left\|\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})-\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\alpha_{t-1}^{k})]\right\|^{2}+\frac{\ell}{6}\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2}\\ &\overset{(b)}{\leq}\frac{3}{2\ell}\frac{1}{K}\sum\limits_{k=1}^{K}\|\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})-\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\alpha_{t-1}^{k})\|^{2}+\frac{\ell}{6}\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2}\\ &\overset{(c)}{\leq}\frac{3\ell}{2}\frac{1}{K}\sum\limits_{k=1}^{K}(\bar{\alpha}_{t-1}-\alpha_{t-1}^{k})^{2}+\frac{\ell}{6}\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2},\\ \end{split} (22)

where (a) follows from Young’s inequality, (b) follows from Jensen’s inequality. and (c) holds because 𝐯fks(𝐯,α)\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v},\alpha) is \ell-Lipschitz in α\alpha. Using similar techniques, we have

\small{2}⃝321Kk=1K𝐯fks(𝐯¯t1,αt1k)𝐯fks(𝐯t1k,αt1k)2+6𝐯¯t𝐯2321Kk=1K𝐯¯t1𝐯t1k2+6𝐯¯t𝐯2.\begin{split}\small{2}⃝&\leq\frac{3}{2\ell}\frac{1}{K}\sum\limits_{k=1}^{K}\|\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})\|^{2}+\frac{\ell}{6}\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2}\\ &\leq\frac{3\ell}{2}\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}_{t-1}^{k}\|^{2}+\frac{\ell}{6}\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2}.\end{split} (23)

Let 𝐯^t=argmin𝐯(1Kk=1K𝐯fs(𝐯t1k,αt1k))x+12η𝐯𝐯¯t12\hat{\mathbf{v}}_{t}=\arg\min\limits_{\mathbf{v}}\left(\frac{1}{K}\sum\limits_{k=1}^{K}\nabla_{\mathbf{v}}f^{s}(\mathbf{v}^{k}_{t-1},\alpha^{k}_{t-1})\right)^{\top}x+\frac{1}{2\eta}\|\mathbf{v}-\bar{\mathbf{v}}_{t-1}\|^{2}, then we have

𝐯¯t𝐯^t=η(𝐯fs(𝐯t1k,yt1k)1Kk=1K𝐯fks(𝐯t1k,yt1k;zt1k))\begin{split}\bar{\mathbf{v}}_{t}-\hat{\mathbf{v}}_{t}=\eta\bigg{(}\nabla_{\mathbf{v}}f^{s}(\mathbf{v}^{k}_{t-1},y^{k}_{t-1})-\frac{1}{K}\sum\limits_{k=1}^{K}\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{t-1},y^{k}_{t-1};z_{t-1}^{k})\bigg{)}\end{split} (24)

Hence we get

\small{3}⃝=1Kk=1K[𝐯fks(𝐯t1k,αt1k)𝐯Fks(𝐯t1k,αt1k;zt1k)],𝐯¯t𝐯^t+1Kk=1K[𝐯fks(𝐯t1k,αt1k)𝐯Fks(𝐯t1k,αt1k;zt1k)],𝐯^t𝐯=η1Kk=1K[𝐯fks(𝐯t1k,αt1k)𝐯Fks(𝐯t1k,αt1k;zt1k)]2+1Kk=1K[𝐯fks(𝐯t1k,αt1k)𝐯Fks(𝐯t1k,αt1k;zt1k)],𝐯^t𝐯\begin{split}&\small{3}⃝=\left\langle\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\bar{\mathbf{v}}_{t}-\hat{\mathbf{v}}_{t}\right\rangle\\ &~{}~{}~{}~{}+\left\langle\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\hat{\mathbf{v}}_{t}-\mathbf{v}\right\rangle\\ &=\eta\left\|\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})]\right\|^{2}\\ &~{}~{}~{}~{}+\left\langle\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\hat{\mathbf{v}}_{t}-\mathbf{v}\right\rangle\\ \end{split} (25)

Define another auxiliary sequence as

𝐯~t=𝐯~t1ηKk=1K(𝐯Fks(𝐯t1k,yt1k;zt1k)𝐯fks(𝐯t1k,αt1k)), for t>0𝐯~0=𝐯0.\begin{split}\tilde{\mathbf{v}}_{t}=\tilde{\mathbf{v}}_{t-1}-\frac{\eta}{K}\sum\limits_{k=1}^{K}\left(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{t-1},y^{k}_{t-1};z_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})\right),\text{ for $t>0$; }\tilde{\mathbf{v}}_{0}=\mathbf{v}_{0}.\end{split} (26)

Denote

Θt1(𝐯)=(1Kk=1K(𝐯Fks(𝐯t1k,yt1k;zt1k)𝐯fks(𝐯t1k,αt1k)))x+12η𝐯𝐯~t12.\Theta_{t-1}(\mathbf{v})=\left(-\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{t-1},y^{k}_{t-1};z_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\right)^{\top}x+\frac{1}{2\eta}\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}. (27)

Hence, for the auxiliary sequence α~t\tilde{\alpha}_{t}, we can verify that

𝐯~t=argmin𝐯Θt1(𝐯).\tilde{\mathbf{v}}_{t}=\arg\min\limits_{\mathbf{v}}\Theta_{t-1}(\mathbf{v}). (28)

Since Θt1(𝐯)\Theta_{t-1}(\mathbf{v}) is 1η\frac{1}{\eta}-strongly convex, we have

12𝐯𝐯~t2Θt1(𝐯)Θt1(𝐯~t)=(1Kk=1K(𝐯Fks(𝐯t1k,αt1k;zt1k)𝐯fks(𝐯t1k,αt1k)))x+12η𝐯𝐯~t12(1Kk=1K(𝐯Fks(𝐯t1k,αt1k;zt1k)𝐯fks(𝐯t1k,αt1k)))𝐯~t12η𝐯~t𝐯~t12=(1Kk=1K(αFks(𝐯t1k,αt1k;zt1k)αfk(𝐯t1k,αt1k)))(𝐯𝐯~t1)+12η𝐯𝐯~t12(1Kk=1K(αFks(𝐯t1k,αt1k;zt1k)αfks(𝐯t1k,αt1k)))(𝐯~t𝐯~t1)12η𝐯~t𝐯~t12(1Kk=1K(𝐯Fks(𝐯t1k,αt1k;zt1k)𝐯fks(𝐯t1k,αt1k)))(𝐯𝐯~t1)+12η𝐯𝐯~t12+η21Kk=1K(𝐯Fks(𝐯t1k,αt1k;zt1k)𝐯fks(𝐯t1k,αt1k))2\displaystyle\begin{split}&\frac{1}{2}\|\mathbf{v}-\tilde{\mathbf{v}}_{t}\|^{2}\leq\Theta_{t-1}(\mathbf{v})-\Theta_{t-1}(\tilde{\mathbf{v}}_{t})\\ &=\bigg{(}-\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\bigg{)}^{\top}x+\frac{1}{2\eta}\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}\\ &~{}~{}~{}-\bigg{(}-\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\bigg{)}^{\top}\tilde{\mathbf{v}}_{t}-\frac{1}{2\eta}\|\tilde{\mathbf{v}}_{t}-\tilde{\mathbf{v}}_{t-1}\|^{2}\\ &=\bigg{(}-\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\alpha}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})-\nabla_{\alpha}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\bigg{)}^{\top}(\mathbf{v}-\tilde{\mathbf{v}}_{t-1})+\frac{1}{2\eta}\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}\\ &~{}~{}~{}-\bigg{(}-\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\alpha}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})-\nabla_{\alpha}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\bigg{)}^{\top}(\tilde{\mathbf{v}}_{t}-\tilde{\mathbf{v}}_{t-1})-\frac{1}{2\eta}\|\tilde{\mathbf{v}}_{t}-\tilde{\mathbf{v}}_{t-1}\|^{2}\\ &\leq\bigg{(}-\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\bigg{)}^{\top}(\mathbf{v}-\tilde{\mathbf{v}}_{t-1})+\frac{1}{2\eta}\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}\\ &~{}~{}~{}+\frac{\eta}{2}\bigg{\|}\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\bigg{\|}^{2}\end{split} (29)

Adding this with (25), we get

3η21Kk=1K(𝐯Fk(𝐯t1k,αt1k;zt1k)𝐯fk(𝐯t1k,αt1k))2+12η𝐯𝐯~t1212𝐯𝐯~t2+1Kk=1K[𝐯fk(𝐯t1k,αt1k)𝐯Fk(𝐯t1k,αt1k;zt1k)],𝐯^t𝐯~t1\begin{split}③\leq&\frac{3\eta}{2}\bigg{\|}\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})-\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\bigg{\|}^{2}+\frac{1}{2\eta}\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}-\frac{1}{2}\|\mathbf{v}-\tilde{\mathbf{v}}_{t}\|^{2}\\ &+\left\langle\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\hat{\mathbf{v}}_{t}-\tilde{\mathbf{v}}_{t-1}\right\rangle\end{split} (30)

④ can be bounded as

=1η𝐯¯t𝐯¯t1,𝐯¯t𝐯=12η(𝐯¯t1𝐯2𝐯¯t1𝐯¯t2𝐯¯t𝐯¯2)④=-\frac{1}{\eta}\langle\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1},\bar{\mathbf{v}}_{t}-\mathbf{v}\rangle=\frac{1}{2\eta}(\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}-\|\bar{\mathbf{v}}_{t-1}-\bar{\mathbf{v}}_{t}\|^{2}-\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}\|^{2}) (31)

Plug (22), (23), (30) and (31) into (21), we get

𝐯f(𝐯¯t1,α¯t1),𝐯¯tx321Kk=1K(α¯t1αt1k)2+321Kk=1K𝐯¯t1𝐯t1k2+3η21Kk=1K[𝐯fk(𝐯t1k,αt1k)𝐯Fk(𝐯t1k,αt1k;zt1k)]2+1Kk=1K[𝐯fk(𝐯t1k,αt1k)𝐯Fk(𝐯t1k,αt1k;zt1k)],𝐯^t𝐯~t1+12η(𝐯¯t1𝐯2𝐯¯t1𝐯¯t2𝐯¯t𝐯2)+3𝐯¯t𝐯2+12η(𝐯𝐯~t12𝐯𝐯~t2)\begin{split}&\left\langle\nabla_{\mathbf{v}}f(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-x\right\rangle\\ &\leq\frac{3\ell}{2}\frac{1}{K}\sum\limits_{k=1}^{K}(\bar{\alpha}_{t-1}-\alpha_{t-1}^{k})^{2}+\frac{3\ell}{2}\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}_{t-1}^{k}\|^{2}\\ &~{}~{}~{}~{}~{}+\frac{3\eta}{2}\left\|\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})]\right\|^{2}\\ &~{}~{}~{}~{}~{}+\left\langle\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\hat{\mathbf{v}}_{t}-\tilde{\mathbf{v}}_{t-1}\right\rangle\\ &~{}~{}~{}~{}~{}+\frac{1}{2\eta}(\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}-\|\bar{\mathbf{v}}_{t-1}-\bar{\mathbf{v}}_{t}\|^{2}-\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2})\\ &~{}~{}~{}~{}~{}+\frac{\ell}{3}\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2}+\frac{1}{2\eta}(\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}-\|\mathbf{v}-\tilde{\mathbf{v}}_{t}\|^{2})\end{split}

B2B_{2} can be bounded by the following lemma, whose proof is identical to that of Lemma 5 in [dist_auc_guo].

Lemma 7

Define α^t=α¯t1+ηKk=1Kαfk(𝐯t1k,αt1k)\hat{\alpha}_{t}\!=\!\bar{\alpha}_{t-1}+\frac{\eta}{K}\sum\limits_{k=1}^{K}\nabla_{\alpha}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}), and

α~t=α~t1+ηKk=1K(αFk(𝐯t1k,αt1k;zt1k)αfk(𝐯t1k,αt1k)).\tilde{\alpha}_{t}\!=\!\tilde{\alpha}_{t-1}\!+\!\frac{\eta}{K}\sum\limits_{k=1}^{K}(\nabla_{\alpha}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})\!-\!\nabla_{\alpha}f_{k}(\mathbf{v}_{t-1}^{k},\!\alpha_{t-1}^{k})).

We have,

B2322μ21Kk=1K𝐯¯t1𝐯t1k2+322μ21Kk=1K(α¯t1αt1k)2+3η2(1Kk=1K[αfk(𝐯t1k,αt1k)αFk(𝐯t1k,αt1k;zt1)])2+1Kk=1Kαfk(𝐯t1k,αt1k)αFi(𝐯t1k,αt1k;zt1k),α~t1α^t+12η((α¯t1α)2(α¯t1α¯t)2(α¯tα)2)+μ23(α¯tα)2+12η(αα~t1)212η(αα~t)2.\begin{split}&B_{2}\leq\frac{3\ell^{2}}{2\mu_{2}}\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}_{t-1}^{k}\|^{2}+\frac{3\ell^{2}}{2\mu_{2}}\frac{1}{K}\sum\limits_{k=1}^{K}(\bar{\alpha}_{t-1}-\alpha_{t-1}^{k})^{2}\\ &~{}~{}~{}+\frac{3\eta}{2}\left(\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\alpha}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\alpha}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1})]\right)^{2}\\ &~{}~{}~{}+\frac{1}{K}\sum\limits_{k=1}^{K}\langle\nabla_{\alpha}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\alpha}F_{i}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k}),\tilde{\alpha}_{t-1}-\hat{\alpha}_{t}\rangle\\ &~{}~{}~{}+\frac{1}{2\eta}((\bar{\alpha}_{t-1}-\alpha)^{2}-(\bar{\alpha}_{t-1}-\bar{\alpha}_{t})^{2}-(\bar{\alpha}_{t}-\alpha)^{2})\\ &~{}~{}~{}+\frac{\mu_{2}}{3}(\bar{\alpha}_{t}-\alpha)^{2}+\frac{1}{2\eta}(\alpha-\tilde{\alpha}_{t-1})^{2}-\frac{1}{2\eta}(\alpha-\tilde{\alpha}_{t})^{2}.\end{split}

\hfill\Box

B3B_{3} can be bounded by the following lemma.

Lemma 8

If KK machines communicate every II iterations, where I1182ηI\leq\frac{1}{18\sqrt{2}\eta\ell}, then

t=0T11Kk=1K𝔼[𝐯¯t𝐯tk2+α¯tαtk2](12η2Iσ2T+36η2I2D2T)𝕀I>1\displaystyle\sum\limits_{t=0}^{T-1}\frac{1}{K}\sum\limits_{k=1}^{K}\mathbb{E}\left[\|\bar{\mathbf{v}}_{t}-\mathbf{v}_{t}^{k}\|^{2}+\|\bar{\alpha}_{t}-\alpha_{t}^{k}\|^{2}\right]\leq\left(12\eta^{2}I\sigma^{2}T+36\eta^{2}I^{2}D^{2}T\right)\mathbb{I}_{I>1}
{proof}

In this proof, we introduce a couple of new notations to make the proof brief: Fk,ts=Fk,ts(𝐯tk,αtk;ztk)F^{s}_{k,t}=F^{s}_{k,t}(\mathbf{v}^{k}_{t},\alpha^{k}_{t};z_{t}^{k}) and fk,ts=fk,ts(𝐯tk,αtk)f^{s}_{k,t}=f^{s}_{k,t}(\mathbf{v}^{k}_{t},\alpha^{k}_{t}). Similar bounds for minimization problems have been analyzed in [yu_linear, stich2018local].

Denote t0t_{0} as the nearest communication round before tt, i.e., tt0It-t_{0}\leq I. By the update rule of 𝐯\mathbf{v}, we have that on each machine kk,

𝐯tk=𝐯¯t0ητ=t0t1𝐯Fk,τs.\begin{split}\mathbf{v}_{t}^{k}=\bar{\mathbf{v}}_{t_{0}}-\eta\sum\limits_{\tau=t_{0}}^{t-1}\nabla_{\mathbf{v}}F^{s}_{k,\tau}.\end{split} (32)

Taking average over all KK machines,

𝐯¯t=𝐯¯t0ητ=t0t11Kk=1K𝐯Fk,τs.\begin{split}\bar{\mathbf{v}}_{t}=\bar{\mathbf{v}}_{t_{0}}-\eta\sum\limits_{\tau=t_{0}}^{t-1}\frac{1}{K}\sum\limits_{k=1}^{K}\nabla_{\mathbf{v}}F^{s}_{k,\tau}.\end{split} (33)

Therefore,

1Kk=1K𝐯¯t𝐯tk2=η2Kk=1K𝔼[τ=t0t1[𝐯Fk,τs1Kj=1K𝐯Fj,τs]2]2η2Kk=1K[τ=t0t1[[𝐯Fk,τs𝐯fk,τs]1Kj=1K[𝐯Fj,τs𝐯fj,τs]]2]+2η2Kk=1K𝔼[τ=t0t1[𝐯fk,τs1Kj=1K𝐯fj,τs]2]\begin{split}&\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t}-\mathbf{v}_{t}^{k}\|^{2}=\frac{\eta^{2}}{K}\sum\limits_{k=1}^{K}\mathbb{E}\left[\left\|\sum\limits_{\tau=t_{0}}^{t-1}\left[\nabla_{\mathbf{v}}F^{s}_{k,\tau}-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}F^{s}_{j,\tau}\right]\right\|^{2}\right]\\ &\leq\frac{2\eta^{2}}{K}\sum\limits_{k=1}^{K}\left[\left\|\sum\limits_{\tau=t_{0}}^{t-1}\left[[\nabla_{\mathbf{v}}F^{s}_{k,\tau}-\nabla_{\mathbf{v}}f^{s}_{k,\tau}]-\frac{1}{K}\sum\limits_{j=1}^{K}\left[\nabla_{\mathbf{v}}F^{s}_{j,\tau}-\nabla_{\mathbf{v}}f^{s}_{j,\tau}\right]\right]\right\|^{2}\right]\\ &~{}~{}~{}+\frac{2\eta^{2}}{K}\sum\limits_{k=1}^{K}\mathbb{E}\left[\left\|\sum\limits_{\tau=t_{0}}^{t-1}\left[\nabla_{\mathbf{v}}f^{s}_{k,\tau}-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}f^{s}_{j,\tau}\right]\right\|^{2}\right]\end{split} (34)

In the following, we will address these two terms on the right hand side separately. First, we have

2η2Kk=1K[τ=t0t1[[𝐯Fk,τs𝐯fk,τs]1Kj=1K[𝐯Fj,τs𝐯fj,τs]]2](a)2η2Kk=1K[τ=t0t1[𝐯Fk,τs𝐯fk,τs]2]=(b)2η2Kk=1Kτ=t0t1[[𝐯Fk,τs𝐯fk,τs]2]2η2Iσ2,\begin{split}&\frac{2\eta^{2}}{K}\sum\limits_{k=1}^{K}\left[\left\|\sum\limits_{\tau=t_{0}}^{t-1}\left[[\nabla_{\mathbf{v}}F^{s}_{k,\tau}-\nabla_{\mathbf{v}}f^{s}_{k,\tau}]-\frac{1}{K}\sum\limits_{j=1}^{K}\left[\nabla_{\mathbf{v}}F^{s}_{j,\tau}-\nabla_{\mathbf{v}}f^{s}_{j,\tau}\right]\right]\right\|^{2}\right]\\ &\overset{(a)}{\leq}\frac{2\eta^{2}}{K}\sum\limits_{k=1}^{K}\left[\left\|\sum\limits_{\tau=t_{0}}^{t-1}\left[\nabla_{\mathbf{v}}F^{s}_{k,\tau}-\nabla_{\mathbf{v}}f^{s}_{k,\tau}\right]\right\|^{2}\right]\\ &\overset{(b)}{=}\frac{2\eta^{2}}{K}\sum\limits_{k=1}^{K}\sum\limits_{\tau=t_{0}}^{t-1}\left[\left\|\left[\nabla_{\mathbf{v}}F^{s}_{k,\tau}-\nabla_{\mathbf{v}}f^{s}_{k,\tau}\right]\right\|^{2}\right]\\ &{\leq}2\eta^{2}I\sigma^{2},\end{split} (35)

where (a)(a) holds by 1Kk=1Kak[1Kj=1Kaj]2=1Kk=1Kak21Kk=1Kak21Kk=1Kak2\frac{1}{K}\sum\limits_{k=1}^{K}\|a_{k}-\left[\frac{1}{K}\sum\limits_{j=1}^{K}a_{j}\right]\|^{2}=\frac{1}{K}\sum\limits_{k=1}^{K}\|a_{k}\|^{2}-\|\frac{1}{K}\sum\limits_{k=1}^{K}a_{k}\|^{2}\leq\frac{1}{K}\sum\limits_{k=1}^{K}\|a_{k}\|^{2}, where ak=τ=t0t1[Fk,τs𝐯fk,τ]a_{k}=\sum\limits_{\tau=t_{0}}^{t-1}[\nabla F^{s}_{k,\tau}-\nabla_{\mathbf{v}}f_{k,\tau}]; (b)(b) follows because 𝔼k,τ1[𝐯Fk,τs𝐯fk,τs]=0\mathbb{E}_{k,\tau-1}[\nabla_{\mathbf{v}}F^{s}_{k,\tau}-\nabla_{\mathbf{v}}f^{s}_{k,\tau}]=0.

Second, we have

1Kk=1K𝔼[τ=t0t1[𝐯fi,τs1Kj=1K𝐯fj,τs]2]1Kk=1K(tt0)τ=t0t1𝔼[𝐯fi,τs1Kj=1K𝐯fj,τs2]Iτ=t0t11Kk=1K𝔼[𝐯fk,τs1Kj=1K𝐯fj,τs2],\begin{split}&\frac{1}{K}\sum\limits_{k=1}^{K}\mathbb{E}\left[\left\|\sum\limits_{\tau=t_{0}}^{t-1}\left[\nabla_{\mathbf{v}}f^{s}_{i,\tau}-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}f^{s}_{j,\tau}\right]\right\|^{2}\right]\\ &\leq\frac{1}{K}\sum\limits_{k=1}^{K}(t-t_{0})\sum\limits_{\tau=t_{0}}^{t-1}\mathbb{E}\left[\left\|\nabla_{\mathbf{v}}f^{s}_{i,\tau}-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}f^{s}_{j,\tau}\right\|^{2}\right]\\ &\leq I\sum\limits_{\tau=t_{0}}^{t-1}\frac{1}{K}\sum\limits_{k=1}^{K}\mathbb{E}\left[\left\|\nabla_{\mathbf{v}}f^{s}_{k,\tau}-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}f^{s}_{j,\tau}\right\|^{2}\right],\end{split} (36)

where

1Kk=1K𝔼𝐯fk,τs1Kj=1K𝐯fj,τs2=1Kk=1K𝔼𝐯fk,τs𝐯fks(𝐯¯τ,α¯τ)+𝐯fks(𝐯¯τ,α¯τ)𝐯fs(𝐯¯τ,α¯τ)+𝐯fs(𝐯¯τ,α¯τ)1Kj=1K𝐯fj,τs21Kk=1K[3𝔼𝐯fk,τs𝐯fk(𝐯¯τ,α¯τ)2+3𝔼𝐯fks(𝐯¯τ,α¯τ)𝐯fs(𝐯¯τ,α¯τ)2]+3𝔼𝐯fs(𝐯¯τ,α¯τ)1Kj=1K𝐯fj,τs2=1Kk=1K[3𝔼𝐯fk,τs𝐯fks(𝐯¯τ,α¯τ)2+3𝔼𝐯fks(𝐯¯τ,α¯τ)𝐯fs(𝐯¯τ,α¯τ)2]+3𝔼1Kj=1K[𝐯fjs(𝐯¯τ,α¯τ)𝐯fj,τs]2]1Kk=1K[3𝔼𝐯fk,τs𝐯fk(𝐯¯τ,α¯τ)2+3𝔼𝐯fks(𝐯¯τ,α¯τ)𝐯fs(𝐯¯τ,α¯τ)2]+31Kj=1K𝔼[𝐯fjs(𝐯¯τ,α¯τ)𝐯fj,τs]2](a)542Kk=1K[𝐯k,τ𝐯¯τ2+|αk,τα¯τ|2]+3Kk=1K𝐯fks(𝐯¯τ,α¯τ)𝐯fs(𝐯¯τ,α¯τ)2542Kk=1K[𝐯k,τ𝐯¯τ2+|αk,τα¯τ|2]+3D2,\begin{split}&\frac{1}{K}\sum\limits_{k=1}^{K}\mathbb{E}\left\|\nabla_{\mathbf{v}}f^{s}_{k,\tau}-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}f^{s}_{j,\tau}\right\|^{2}\\ &=\frac{1}{K}\sum\limits_{k=1}^{K}\mathbb{E}\bigg{\|}\nabla_{\mathbf{v}}f^{s}_{k,\tau}-\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})+\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})+\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}f^{s}_{j,\tau}\bigg{\|}^{2}\\ &\leq\frac{1}{K}\sum\limits_{k=1}^{K}\bigg{[}3\mathbb{E}\|\nabla_{\mathbf{v}}f^{s}_{k,\tau}-\nabla_{\mathbf{v}}f_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})\|^{2}+3\mathbb{E}\|\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})\|^{2}\bigg{]}\\ &~{}~{}~{}+3\mathbb{E}\bigg{\|}\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}f^{s}_{j,\tau}\bigg{\|}^{2}\\ &=\frac{1}{K}\sum\limits_{k=1}^{K}\bigg{[}3\mathbb{E}\|\nabla_{\mathbf{v}}f^{s}_{k,\tau}-\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})\|^{2}+3\mathbb{E}\|\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})\|^{2}\bigg{]}\\ &~{}~{}~{}+3\mathbb{E}\bigg{\|}\frac{1}{K}\sum\limits_{j=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{j}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\nabla_{\mathbf{v}}f^{s}_{j,\tau}]\bigg{\|}^{2}\bigg{]}\\ &\leq\frac{1}{K}\sum\limits_{k=1}^{K}\bigg{[}3\mathbb{E}\|\nabla_{\mathbf{v}}f^{s}_{k,\tau}-\nabla_{\mathbf{v}}f_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})\|^{2}+3\mathbb{E}\|\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})\|^{2}\bigg{]}\\ &~{}~{}~{}+3\frac{1}{K}\sum\limits_{j=1}^{K}\mathbb{E}\bigg{\|}[\nabla_{\mathbf{v}}f^{s}_{j}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\nabla_{\mathbf{v}}f^{s}_{j,\tau}]\bigg{\|}^{2}\bigg{]}\\ &\overset{(a)}{\leq}\frac{54\ell^{2}}{K}\sum\limits_{k=1}^{K}\left[\|\mathbf{v}_{k,\tau}-\bar{\mathbf{v}}_{\tau}\|^{2}+|\alpha_{k,\tau}-\bar{\alpha}_{\tau}|^{2}\right]+\frac{3}{K}\sum\limits_{k=1}^{K}\|\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})\|^{2}\\ &\leq\frac{54\ell^{2}}{K}\sum\limits_{k=1}^{K}\left[\|\mathbf{v}_{k,\tau}-\bar{\mathbf{v}}_{\tau}\|^{2}+|\alpha_{k,\tau}-\bar{\alpha}_{\tau}|^{2}\right]+3D^{2},\end{split} (37)

where (a)(a) holds because ff is \ell-smooth, i.e., fsf^{s} is 33\ell-smooth.

Combining (34), (35), (36) and (37),

1Kk=1K𝐯¯t𝐯tk22η2Iσ2+2η2(Iτ=t0t1[542Kk=1K[𝐯τk𝐯¯τ2+αk,τα¯τ2]+3D2])\begin{split}\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t}-\mathbf{v}_{t}^{k}\|^{2}\leq 2\eta^{2}I\sigma^{2}+2\eta^{2}\left(I\sum\limits_{\tau=t_{0}}^{t-1}\left[\frac{54\ell^{2}}{K}\sum\limits_{k=1}^{K}\left[\|\mathbf{v}^{k}_{\tau}-\bar{\mathbf{v}}_{\tau}\|^{2}+\|\alpha_{k,\tau}-\bar{\alpha}_{\tau}\|^{2}\right]+3D^{2}\right]\right)\end{split} (38)

Summing over t={0,,T1}t=\{0,...,T-1\},

t=0T11Kk=1K𝐯¯t𝐯tk22η2Iσ2T+108η2I22t=0T11K(𝐯tk𝐯¯t2+αtkα¯τ2)+6η2I2D2T.\begin{split}\sum\limits_{t=0}^{T-1}\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t}-\mathbf{v}_{t}^{k}\|^{2}\leq 2\eta^{2}I\sigma^{2}T\!+\!108\eta^{2}I^{2}\ell^{2}\sum\limits_{t=0}^{T-1}\frac{1}{K}\left(\|\mathbf{v}^{k}_{t}-\bar{\mathbf{v}}_{t}\|^{2}+\|\alpha^{k}_{t}-\bar{\alpha}_{\tau}\|^{2}\right)+6\eta^{2}I^{2}D^{2}T.\end{split} (39)

Similarly for α\alpha side, we have

t=0T11Kk=1Kα¯tαtk22η2Iσ2T+108η2I22t=0T11K(𝐯tk𝐯¯t2+αtkα¯t2)+6η2I2D2T.\sum\limits_{t=0}^{T-1}\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\alpha}_{t}-\alpha_{t}^{k}\|^{2}\leq 2\eta^{2}I\sigma^{2}T+108\eta^{2}I^{2}\ell^{2}\sum\limits_{t=0}^{T-1}\frac{1}{K}\left(\|\mathbf{v}_{t}^{k}-\bar{\mathbf{v}}_{t}\|^{2}+\|\alpha_{t}^{k}-\bar{\alpha}_{t}\|^{2}\right)+6\eta^{2}I^{2}D^{2}T. (40)

Summing up the above two inequalities,

t=0T11Kk=1K[𝐯¯t𝐯tk2+𝔼[α¯tαtk2]4η2Iσ21216η2I22T+12η2I2D21216η2I22T12η2Iσ2T+36η2I2D2T\begin{split}\sum\limits_{t=0}^{T-1}\frac{1}{K}\sum\limits_{k=1}^{K}[\|\bar{\mathbf{v}}_{t}-\mathbf{v}_{t}^{k}\|^{2}+\mathbb{E}[\|\bar{\alpha}_{t}-\alpha_{t}^{k}\|^{2}]&\leq\frac{4\eta^{2}I\sigma^{2}}{1-216\eta^{2}I^{2}\ell^{2}}T+\frac{12\eta^{2}I^{2}D^{2}}{1-216\eta^{2}I^{2}\ell^{2}}T\\ &\leq 12\eta^{2}I\sigma^{2}T+36\eta^{2}I^{2}D^{2}T\end{split} (41)

where the second inequality is due to I1182ηI\leq\frac{1}{18\sqrt{2}\eta\ell}, i.e., 1216η2I22231-216\eta^{2}I^{2}\ell^{2}\geq\frac{2}{3}.

Based on above lemmas, we are ready to give the convergence of duality gap in one stage of CODA+.

B.2 Proof of Lemma 1

{proof}

Noting 𝔼1Kk=1K[𝐯fk(𝐯t1k,αt1k)𝐯Fk(𝐯t1k,αt1k;zt1k)],𝐯^t𝐯~t1=0\mathbb{E}\langle\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\hat{\mathbf{v}}_{t}-\tilde{\mathbf{v}}_{t-1}\rangle=0 and
𝔼1Kk=1K[αfk(𝐯t1k,αt1k)Fk(𝐯t1k,αt1k;zt1k)],α~t1α^t=0\mathbb{E}\left\langle-\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\alpha}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\tilde{\alpha}_{t-1}-\hat{\alpha}_{t}\right\rangle=0. and then plugging Lemma 6 and Lemma 7 into Lemma 5, and taking expectation, we get

𝔼[fs(𝐯¯,α)fs(𝐯,α¯)]1Tt=1T𝔼[(3+32/μ2212η)𝐯¯t1𝐯¯t2+(212η)α¯tα¯t12C1+(12ημ23)α¯t1α2(12ημ23)(α¯tα)2C2+(12η3)𝐯¯t1𝐯2(12η3)𝐯¯t𝐯2C3+12η((αα~t1)2(αα~t)2)C4+12η(𝐯𝐯~t12𝐯𝐯~t2)C5+(322μ2+32)1Kk=1K𝐯¯t1𝐯t1k2+(32+322μ2)1Kk=1K(α¯t1αt1k)2C6+3η21Kk=1K[𝐯fk(𝐯t1k,αt1k)𝐯Fks(𝐯t1k,αt1k;zt1k)]2C7+3η21Kk=1Kαfks(𝐯t1k,αt1k)αFks(𝐯t1k,αt1k;zt1k)2C8]\begin{split}&\mathbb{E}[f^{s}(\bar{\mathbf{v}},\alpha)-f^{s}(\mathbf{v},\bar{\alpha})]\\ &\leq\frac{1}{T}\sum\limits_{t=1}^{T}\mathbb{E}\Bigg{[}\underbrace{\left(\frac{3\ell+3\ell^{2}/\mu_{2}}{2}-\frac{1}{2\eta}\right)\|\bar{\mathbf{v}}_{t-1}-\bar{\mathbf{v}}_{t}\|^{2}+\left(2\ell-\frac{1}{2\eta}\right)\|\bar{\alpha}_{t}-\bar{\alpha}_{t-1}\|^{2}}_{C_{1}}\\ &+\underbrace{\left(\frac{1}{2\eta}-\frac{\mu_{2}}{3}\right)\|\bar{\alpha}_{t-1}-\alpha\|^{2}-\left(\frac{1}{2\eta}-\frac{\mu_{2}}{3}\right)(\bar{\alpha}_{t}-\alpha)^{2}}_{C_{2}}+\underbrace{\left(\frac{1}{2\eta}-\frac{\ell}{3}\right)\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}-\left(\frac{1}{2\eta}-\frac{\ell}{3}\right)\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2}}_{C_{3}}\\ &+\underbrace{\frac{1}{2\eta}((\alpha-\tilde{\alpha}_{t-1})^{2}-(\alpha-\tilde{\alpha}_{t})^{2})}_{C_{4}}+\underbrace{\frac{1}{2\eta}(\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}-\|\mathbf{v}-\tilde{\mathbf{v}}_{t}\|^{2})}_{C_{5}}\\ &+\underbrace{\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}_{t-1}^{k}\|^{2}+\left(\frac{3\ell}{2}+\frac{3\ell^{2}}{2\mu_{2}}\right)\frac{1}{K}\sum\limits_{k=1}^{K}(\bar{\alpha}_{t-1}-\alpha_{t-1}^{k})^{2}}_{C_{6}}\\ &+\underbrace{\frac{3\eta}{2}\left\|\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})]\right\|^{2}}_{C_{7}}\\ &+\underbrace{\frac{3\eta}{2}\left\|\frac{1}{K}\sum\limits_{k=1}^{K}\nabla_{\alpha}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\alpha}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})\right\|^{2}}_{C_{8}}\bigg{]}\end{split} (42)

Since ηmin(13+32/μ2,14)\eta\leq\min(\frac{1}{3\ell+3\ell^{2}/\mu_{2}},\frac{1}{4\ell}), thus in the RHS of (42), C1C_{1} can be cancelled. C2C_{2}, C3C_{3}, C4C_{4} and C5C_{5} will be handled by telescoping sum. C6C_{6} can be bounded by Lemma 8.

Taking expectation over C7C_{7},

𝔼[3η21Kk=1K[𝐯fks(𝐯t1k,αt1k)𝐯Fks(𝐯t1k,αt1k;zt1k)]2]=𝔼[3η2K2k=1K[𝐯fks(𝐯t1k,αt1k)𝐯Fk(𝐯t1k,αt1k;zt1k)]2]=𝔼[3η2K2(k=1K𝐯fks(𝐯t1k,αt1k)𝐯Fks(𝐯t1k,αt1k;zt1k)2+2k=1Kj=i+1K𝐯fks(𝐯t1k,αt1k)𝐯Fks(𝐯t1k,αt1k;zt1k),𝐯fj(𝐯t1j,αt1j)𝐯Fjs(𝐯t1j,αt1j;zt1j))]3ησ22K.\begin{split}&\mathbb{E}\left[\frac{3\eta}{2}\left\|\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})]\right\|^{2}\right]\\ &=\mathbb{E}\left[\frac{3\eta}{2K^{2}}\left\|\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})]\right\|^{2}\right]\\ &=\mathbb{E}\left[\frac{3\eta}{2K^{2}}\left(\sum\limits_{k=1}^{K}\|\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})\|^{2}\right.\right.\\ &~{}~{}~{}~{}~{}~{}\left.\left.+2\sum\limits_{k=1}^{K}\sum\limits_{j=i+1}^{K}\left\langle\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k}),\nabla_{\mathbf{v}}f_{j}(\mathbf{v}_{t-1}^{j},\alpha_{t-1}^{j})-\nabla_{\mathbf{v}}F^{s}_{j}(\mathbf{v}_{t-1}^{j},\alpha_{t-1}^{j};z_{t-1}^{j})\right\rangle\right)\right]\\ &\leq\frac{3\eta\sigma^{2}}{2K}.\end{split} (43)

The last inequality holds because 𝐯fk(𝐯t1k,αt1k)𝐯Fk(𝐯t1k,αt1k;zt1k)2σ2\|\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})\|^{2}\leq\sigma^{2} and 𝔼𝐯fk(𝐯t1k,αt1k)𝐯Fk(𝐯t1k,αt1k;zt1k),𝐯fj(𝐯t1j,αt1j)𝐯Fj(𝐯t1j,αt1j;zt1j)=0\mathbb{E}\langle\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})\!-\!\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k}),\nabla_{\mathbf{v}}f_{j}(\mathbf{v}_{t-1}^{j},\alpha_{t-1}^{j})\!-\!\nabla_{\mathbf{v}}F_{j}(\mathbf{v}_{t-1}^{j},\alpha_{t-1}^{j};z_{t-1}^{j})\rangle=0 for any kjk\neq j as each machine draws data independently. Similarly, we take expectation over C8C_{8} and have

𝔼[3η21Kk=1K[αfk(𝐯t1k,αt1k)αFk(𝐯t1k,αt1k;𝐳t1k)]2]3ησ22K.\begin{split}&\mathbb{E}\left[\frac{3\eta}{2}\left\|\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\alpha}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\alpha}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};\mathbf{z}_{t-1}^{k})]\right\|^{2}\right]\leq\frac{3\eta\sigma^{2}}{2K}.\end{split} (44)

Plugging (43) and (44) into (99), and taking expectation, it yields

𝔼[fs(𝐯¯,α)fs(𝐯,α¯)𝔼{1T(12η3)𝐯¯0𝐯2+12ηT𝐯~0𝐯2+1T(12ημ23)α¯0α2+12ηTα~0α2+1Tt=1T(322μ2+32)1Kk=1K𝐯¯t1𝐯t1k2+1Tt=1T(32+322μ2)1Kk=1K(α¯t1αt1k)2+1Tt=1T3ησ2K}1ηT𝐯0𝐯2+1ηTα0α2+(322μ2+32)(12η2Iσ2+36η2I2D2)𝕀I>1+3ησ2K,\begin{split}&\mathbb{E}[f^{s}(\bar{\mathbf{v}},\alpha)-f^{s}(\mathbf{v},\bar{\alpha})\\ &\leq\mathbb{E}\bigg{\{}\frac{1}{T}\left(\frac{1}{2\eta}-\frac{\ell}{3}\right)\|\bar{\mathbf{v}}_{0}-\mathbf{v}\|^{2}+\frac{1}{2\eta T}\|\tilde{\mathbf{v}}_{0}-\mathbf{v}\|^{2}+\frac{1}{T}\left(\frac{1}{2\eta}-\frac{\mu_{2}}{3}\right)\|\bar{\alpha}_{0}-\alpha\|^{2}+\frac{1}{2\eta T}\|\tilde{\alpha}_{0}-\alpha\|^{2}\\ &~{}~{}~{}~{}~{}+\frac{1}{T}\sum\limits_{t=1}^{T}\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}_{t-1}^{k}\|^{2}+\frac{1}{T}\sum\limits_{t=1}^{T}\left(\frac{3\ell}{2}+\frac{3\ell^{2}}{2\mu_{2}}\right)\frac{1}{K}\sum\limits_{k=1}^{K}(\bar{\alpha}_{t-1}-\alpha_{t-1}^{k})^{2}\\ &~{}~{}~{}~{}+\frac{1}{T}\sum\limits_{t=1}^{T}\frac{3\eta\sigma^{2}}{K}\bigg{\}}\\ &\leq\frac{1}{\eta T}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{\eta T}\|\alpha_{0}-\alpha\|^{2}+\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)(12\eta^{2}I\sigma^{2}+36\eta^{2}I^{2}D^{2})\mathbb{I}_{I>1}+\frac{3\eta\sigma^{2}}{K},\end{split}

where we use Lemma 8, 𝐯0=𝐯¯0\mathbf{v}_{0}=\bar{\mathbf{v}}_{0}, and α0=α¯0\alpha_{0}=\bar{\alpha}_{0} in the last inequality.

B.3 Main Proof of Theorem 1

{proof}

Since f(𝐯,α)f(\mathbf{v},\alpha) is \ell-smooth (thus \ell-weakly convex) in 𝐯\mathbf{v} for any α\alpha, ϕ(𝐯)=maxαf(𝐯,α)\phi(\mathbf{v})=\max\limits_{\alpha^{\prime}}f(\mathbf{v},\alpha^{\prime}) is also \ell-weakly convex. Taking γ=2\gamma=2\ell, we have

ϕ(𝐯s1)ϕ(𝐯s)+ϕ(𝐯s),𝐯s1𝐯s2𝐯s1𝐯s2=ϕ(𝐯s)+ϕ(𝐯s)+2(𝐯s𝐯s1),𝐯s1𝐯s+32𝐯s1𝐯s2=(a)ϕ(𝐯s)+ϕs(𝐯s),𝐯s1𝐯s+32𝐯s1𝐯s2=(b)ϕ(𝐯s)12ϕs(𝐯s),ϕs(𝐯s)ϕ(𝐯s)+38ϕs(𝐯s)ϕ(𝐯s)2=ϕ(𝐯s)18ϕs(𝐯s)214ϕs(𝐯s),ϕ(𝐯s)+38ϕ(𝐯s)2,\displaystyle\begin{split}\phi(\mathbf{v}_{s-1})&\geq\phi(\mathbf{v}_{s})+\langle\partial\phi(\mathbf{v}_{s}),\mathbf{v}_{s-1}-\mathbf{v}_{s}\rangle-\frac{\ell}{2}\|\mathbf{v}_{s-1}-\mathbf{v}_{s}\|^{2}\\ &=\phi(\mathbf{v}_{s})+\langle\partial\phi(\mathbf{v}_{s})+2\ell(\mathbf{v}_{s}-\mathbf{v}_{s-1}),\mathbf{v}_{s-1}-\mathbf{v}_{s}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{s-1}-\mathbf{v}_{s}\|^{2}\\ &\overset{(a)}{=}\phi(\mathbf{v}_{s})+\langle\partial\phi_{s}(\mathbf{v}_{s}),\mathbf{v}_{s-1}-\mathbf{v}_{s}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{s-1}-\mathbf{v}_{s}\|^{2}\\ &\overset{(b)}{=}\phi(\mathbf{v}_{s})-\frac{1}{2\ell}\langle\partial\phi_{s}(\mathbf{v}_{s}),\partial\phi_{s}(\mathbf{v}_{s})-\partial\phi(\mathbf{v}_{s})\rangle+\frac{3}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})-\partial\phi(\mathbf{v}_{s})\|^{2}\\ &=\phi(\mathbf{v}_{s})-\frac{1}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}-\frac{1}{4\ell}\langle\partial\phi_{s}(\mathbf{v}_{s}),\partial\phi(\mathbf{v}_{s})\rangle+\frac{3}{8\ell}\|\partial\phi(\mathbf{v}_{s})\|^{2},\end{split} (45)

where (a)(a) and (b)(b) hold by the definition of ϕs(𝐯)\phi_{s}(\mathbf{v}).

Rearranging the terms in (45) yields

ϕ(𝐯s)ϕ(𝐯s1)18ϕs(𝐯s)2+14ϕs(𝐯s),ϕ(𝐯s)38ϕ(𝐯s)2(a)18ϕs(𝐯s)2+18(ϕs(𝐯s)2+ϕ(𝐯s)2)38ϕ(𝐯s)2=14ϕs(𝐯s)214ϕ(𝐯s)2(b)14ϕs(𝐯s)2μ2(ϕ(𝐯s)ϕ(𝐯))\displaystyle\begin{split}\phi(\mathbf{v}_{s})-\phi(\mathbf{v}_{s-1})&\leq\frac{1}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}+\frac{1}{4\ell}\langle\partial\phi_{s}(\mathbf{v}_{s}),\partial\phi(\mathbf{v}_{s})\rangle-\frac{3}{8\ell}\|\partial\phi(\mathbf{v}_{s})\|^{2}\\ &\overset{(a)}{\leq}\frac{1}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}+\frac{1}{8\ell}(\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}+\|\partial\phi(\mathbf{v}_{s})\|^{2})-\frac{3}{8\ell}\|\phi(\mathbf{v}_{s})\|^{2}\\ &=\frac{1}{4\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}-\frac{1}{4\ell}\|\partial\phi(\mathbf{v}_{s})\|^{2}\\ &\overset{(b)}{\leq}\frac{1}{4\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}-\frac{\mu}{2\ell}(\phi(\mathbf{v}_{s})-\phi(\mathbf{v}_{*}))\end{split} (46)

where (a)(a) holds by using 𝐚,𝐛12(𝐚2+𝐛2)\langle\mathbf{a},\mathbf{b}\rangle\leq\frac{1}{2}(\|\mathbf{a}\|^{2}+\|\mathbf{b}\|^{2}), and (b)(b) holds by the μ\mu-PL property of ϕ(𝐯)\phi(\mathbf{v}).

Thus, we have

(4+2μ)(ϕ(𝐯s)ϕ(𝐯))4(ϕ(𝐯s1)ϕ(𝐯))ϕs(𝐯s)2.\displaystyle\left(4\ell+2\mu\right)(\phi(\mathbf{v}_{s})-\phi(\mathbf{v}_{*}))-4\ell(\phi(\mathbf{v}_{s-1})-\phi(\mathbf{v}_{*}))\leq\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}. (47)

Since γ=2\gamma=2\ell, fs(𝐯,α)f^{s}(\mathbf{v},\alpha) is \ell-strongly convex in 𝐯\mathbf{v} and μ2=2p(1p)\mu_{2}=2p(1-p) strong concave in α\alpha. Apply Lemma 3 to fsf^{s}, we know that

4𝐯^s(αs)𝐯0s2+μ24α^s(𝐯s)α0s2Gaps(𝐯0s,α0s)+Gaps(𝐯s,αs).\displaystyle\frac{\ell}{4}\|\hat{\mathbf{v}}_{s}(\alpha_{s})-\mathbf{v}_{0}^{s}\|^{2}+\frac{\mu_{2}}{4}\|\hat{\alpha}_{s}(\mathbf{v}_{s})-\alpha_{0}^{s}\|^{2}\leq\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})+\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s}). (48)

By the setting of ηs=η0exp((s1)2μc+2μ)\eta_{s}=\eta_{0}\exp\left(-(s-1)\frac{2\mu}{c+2\mu}\right), and Ts=212η0min{,μ2}exp((s1)2μc+2μ)T_{s}=\frac{212}{\eta_{0}\min\{\ell,\mu_{2}\}}\exp\left((s-1)\frac{2\mu}{c+2\mu}\right), we note that 1ηsTsmin{,μ2}212\frac{1}{\eta_{s}T_{s}}\leq\frac{\min\{\ell,\mu_{2}\}}{212}. Set IsI_{s} such that (322μ2+32)(12ηs2Is+36η2Is2D2)ηsσ2K\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)(12\eta_{s}^{2}I_{s}+36\eta^{2}I_{s}^{2}D^{2})\leq\frac{\eta_{s}\sigma^{2}}{K}, where the specific choice of IsI_{s} will be made later. Applying Lemma 1 with 𝐯^s(αs)=argmin𝐯fs(𝐯,αs)\hat{\mathbf{v}}_{s}(\alpha_{s})=\arg\min\limits_{\mathbf{v}^{\prime}}f^{s}(\mathbf{v}^{\prime},\alpha_{s}) and α^s(𝐯s)=argmaxαfs(𝐯s,α)\hat{\alpha}_{s}(\mathbf{v}_{s})=\arg\max\limits_{\alpha^{\prime}}f^{s}(\mathbf{v}_{s},\alpha^{\prime}), we have

𝔼[Gaps(𝐯s,αs)]4ηsσ2K+153𝔼[4𝐯^s(αs)𝐯0s2+μ24α^s(𝐯s)α0s2]4ηsσ2K+153𝔼[Gaps(𝐯0s,α0s)+Gaps(𝐯s,αs)].\displaystyle\begin{split}&\mathbb{E}[\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})]\leq\frac{4\eta_{s}\sigma^{2}}{K}+\frac{1}{53}\mathbb{E}\left[\frac{\ell}{4}\|\hat{\mathbf{v}}_{s}(\alpha_{s})-\mathbf{v}_{0}^{s}\|^{2}+\frac{\mu_{2}}{4}\|\hat{\alpha}_{s}(\mathbf{v}_{s})-\alpha_{0}^{s}\|^{2}\right]\\ &\leq\frac{4\eta_{s}\sigma^{2}}{K}+\frac{1}{53}\mathbb{E}\left[\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})+\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right].\end{split} (49)

Since ϕ(𝐯)\phi(\mathbf{v}) is LL-smooth and γ=2\gamma=2\ell, then ϕs(𝐯)\phi_{s}(\mathbf{v}) is L^=(L+2)\hat{L}=(L+2\ell)-smooth. According to Theorem 2.1.5 of [DBLP:books/sp/Nesterov04], we have

𝔼[ϕs(𝐯s)2]2L^𝔼(ϕs(𝐯s)minxdϕs(𝐯))2L^𝔼[Gaps(𝐯s,αs)]=2L^𝔼[4Gaps(𝐯s,αs)3Gaps(𝐯s,αs)]2L^𝔼[4(4ηsσ2K+153(Gaps(𝐯0s,α0s)+Gaps(𝐯s,αs)))3Gaps(𝐯s,αs)]=2L^𝔼[16ηsσ2K+453Gaps(𝐯0s,α0s)15553Gaps(𝐯s,αs)]\displaystyle\begin{split}&\mathbb{E}[\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}]\leq 2\hat{L}\mathbb{E}(\phi_{s}(\mathbf{v}_{s})-\min\limits_{x\in\mathbb{R}^{d}}\phi_{s}(\mathbf{v}))\leq 2\hat{L}\mathbb{E}[\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})]\\ &=2\hat{L}\mathbb{E}[4\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})-3\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})]\\ &\leq 2\hat{L}\mathbb{E}\left[4\left(\frac{4\eta_{s}\sigma^{2}}{K}+\frac{1}{53}\left(\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})+\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right)\right)-3\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right]\\ &=2\hat{L}\mathbb{E}\left[\frac{16\eta_{s}\sigma^{2}}{K}+\frac{4}{53}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})-\frac{155}{53}\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right]\end{split} (50)

Applying Lemma 4 to (50), we have

𝔼[ϕs(𝐯s)2]2L^𝔼[16ηsσ2K+453Gaps(𝐯0s,α0s)15553(350Gaps+1(𝐯0s+1,α0s+1)+45(ϕ(𝐯0s+1)ϕ(𝐯0s)))]=2L^𝔼[16ηsσ2K+453Gaps(𝐯0s,α0s)93530Gaps+1(𝐯0s+1,α0s+1)12453(ϕ(𝐯0s+1)ϕ(𝐯0s))].\displaystyle\begin{split}&\mathbb{E}[\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}]\leq 2\hat{L}\mathbb{E}\bigg{[}\frac{16\eta_{s}\sigma^{2}}{K}+\frac{4}{53}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\frac{155}{53}\left(\frac{3}{50}\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})+\frac{4}{5}(\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{0}^{s}))\right)\bigg{]}\\ &=2\hat{L}\mathbb{E}\bigg{[}\frac{16\eta_{s}\sigma^{2}}{K}+\frac{4}{53}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\!-\!\frac{93}{530}\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})\!-\!\frac{124}{53}(\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{0}^{s}))\bigg{]}.\end{split} (51)

Combining this with (109), rearranging the terms, and defining a constant c=4+24853L^O(L+)c=4\ell+\frac{248}{53}\hat{L}\in O(L+\ell), we get

(c+2μ)𝔼[ϕ(𝐯0s+1)ϕ(𝐯)]+93265L^𝔼[Gaps+1(𝐯0s+1,α0s+1)](4+24853L^)𝔼[ϕ(𝐯0s)ϕ(𝐯)]+8L^53𝔼[Gaps(𝐯0s,α0s)]+32ηsL^σ2Kc𝔼[ϕ(𝐯0s)ϕ(𝐯)+8L^53cGaps(𝐯0s,α0s)]+32ηsL^σ2K\displaystyle\begin{split}&\left(c+2\mu\right)\mathbb{E}[\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{*})]+\frac{93}{265}\hat{L}\mathbb{E}[\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})]\\ &\leq\left(4\ell+\frac{248}{53}\hat{L}\right)\mathbb{E}[\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}_{*})]+\frac{8\hat{L}}{53}\mathbb{E}[\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})]+\frac{32\eta_{s}\hat{L}\sigma^{2}}{K}\\ &\leq c\mathbb{E}\left[\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\right]+\frac{32\eta_{s}\hat{L}\sigma^{2}}{K}\end{split} (52)

Using the fact that L^μ\hat{L}\geq\mu,

(c+2μ)8L^53c=(4+24853L^+2μ)8L^53(4+24853L^)8L^53+16μL^248L^93265L^.\displaystyle\begin{split}(c+2\mu)\frac{8\hat{L}}{53c}=\left(4\ell+\frac{248}{53}\hat{L}+2\mu\right)\frac{8\hat{L}}{53(4\ell+\frac{248}{53}\hat{L})}\leq\frac{8\hat{L}}{53}+\frac{16\mu\hat{L}}{248\hat{L}}\leq\frac{93}{265}\hat{L}.\end{split} (53)

Then, we have

(c+2μ)𝔼[ϕ(𝐯0s+1)ϕ(𝐯)+8L^53cGaps+1(𝐯0s+1,α0s+1)]c𝔼[ϕ(𝐯0s)ϕ(𝐯)+8L^53cGaps(𝐯0s,α0s)]+32ηsL^σ2K.\displaystyle\begin{split}&(c+2\mu)\mathbb{E}\left[\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})\right]\\ &\leq c\mathbb{E}\left[\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\right]+\frac{32\eta_{s}\hat{L}\sigma^{2}}{K}.\end{split} (54)

Defining Δs=ϕ(𝐯0s)ϕ(𝐯)+8L^53cGaps(𝐯0s,α0s)\Delta_{s}=\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s}), then

𝔼[Δs+1]cc+2μ𝔼[Δs]+32ηsL^σ2(c+2μ)K\displaystyle\begin{split}&\mathbb{E}[\Delta_{s+1}]\leq\frac{c}{c+2\mu}\mathbb{E}[\Delta_{s}]+\frac{32\eta_{s}\hat{L}\sigma^{2}}{(c+2\mu)K}\end{split} (55)

Using this inequality recursively, it yields

E[ΔS+1](cc+2μ)SE[Δ1]+32L^σ2(c+2μ)Ks=1S(ηs(cc+2μ)S+1s)\displaystyle\begin{split}&E[\Delta_{S+1}]\leq\left(\frac{c}{c+2\mu}\right)^{S}E[\Delta_{1}]+\frac{32\hat{L}\sigma^{2}}{(c+2\mu)K}\sum\limits_{s=1}^{S}\left(\eta_{s}\left(\frac{c}{c+2\mu}\right)^{S+1-s}\right)\end{split} (56)

By definition,

Δ1=ϕ(𝐯01)ϕ(𝐯)+8L^53cGap^1(𝐯01,α01)=ϕ(𝐯0)ϕ(𝐯)+(f(𝐯0,α^1(𝐯0))+γ2𝐯0𝐯02f(𝐯^1(α0),α0)γ2𝐯^1(α0)𝐯02)ϵ0+f(𝐯0,α^1(𝐯0))f(𝐯^(α0),α0)2ϵ0.\displaystyle\begin{split}\Delta_{1}&=\phi(\mathbf{v}_{0}^{1})-\phi(\mathbf{v}^{*})+\frac{8\hat{L}}{53c}\widehat{Gap}_{1}(\mathbf{v}_{0}^{1},\alpha_{0}^{1})\\ &=\phi(\mathbf{v}_{0})-\phi(\mathbf{v}^{*})+\left(f(\mathbf{v}_{0},\hat{\alpha}_{1}(\mathbf{v}_{0}))+\frac{\gamma}{2}\|\mathbf{v}_{0}-\mathbf{v}_{0}\|^{2}-f(\hat{\mathbf{v}}_{1}(\alpha_{0}),\alpha_{0})-\frac{\gamma}{2}\|\hat{\mathbf{v}}_{1}(\alpha_{0})-\mathbf{v}_{0}\|^{2}\right)\\ &\leq\epsilon_{0}+f(\mathbf{v}_{0},\hat{\alpha}_{1}(\mathbf{v}_{0}))-f(\hat{\mathbf{v}}(\alpha_{0}),\alpha_{0})\leq 2\epsilon_{0}.\end{split} (57)

Using inequality 1xexp(x)1-x\leq\exp(-x), we have

𝔼[ΔS+1]exp(2μSc+2μ)𝔼[Δ1]+32η0L^σ2(c+2μ)Ks=1Sexp(2μSc+2μ)2ϵ0exp(2μSc+2μ)+32η0L^σ2(c+2μ)KSexp(2μS(c+2μ)).\displaystyle\begin{split}&\mathbb{E}[\Delta_{S+1}]\leq\exp\left(\frac{-2\mu S}{c+2\mu}\right)\mathbb{E}[\Delta_{1}]+\frac{32\eta_{0}\hat{L}\sigma^{2}}{(c+2\mu)K}\sum\limits_{s=1}^{S}\exp\left(-\frac{2\mu S}{c+2\mu}\right)\\ &\leq 2\epsilon_{0}\exp\left(\frac{-2\mu S}{c+2\mu}\right)+\frac{32\eta_{0}\hat{L}\sigma^{2}}{(c+2\mu)K}S\exp\left(-\frac{2\mu S}{(c+2\mu)}\right).\end{split}

To make this less than ϵ\epsilon, it suffices to make

2ϵ0exp(2μSc+2μ)ϵ232η0L^σ2(c+2μ)KSexp(2μSc+2μ)ϵ2\displaystyle\begin{split}&2\epsilon_{0}\exp\left(\frac{-2\mu S}{c+2\mu}\right)\leq\frac{\epsilon}{2}\\ &\frac{32\eta_{0}\hat{L}\sigma^{2}}{(c+2\mu)K}S\exp\left(-\frac{2\mu S}{c+2\mu}\right)\leq\frac{\epsilon}{2}\end{split} (58)

Let SS be the smallest value such that exp(2μSc+2μ)min{ϵ4ϵ0,(c+2μ)Kϵ64η0L^Sσ2}\exp\left(\frac{-2\mu S}{c+2\mu}\right)\leq\min\{\frac{\epsilon}{4\epsilon_{0}},\frac{(c+2\mu)K\epsilon}{64\eta_{0}\hat{L}S\sigma^{2}}\}. We can set S=max{c+2μ2μlog4ϵ0ϵ,c+2μ2μlog64η0L^Sσ2(c+2μ)Kϵ}S=\max\bigg{\{}\frac{c+2\mu}{2\mu}\log\frac{4\epsilon_{0}}{\epsilon},\frac{c+2\mu}{2\mu}\log\frac{64\eta_{0}\hat{L}S\sigma^{2}}{(c+2\mu)K\epsilon}\bigg{\}}.

Then, the total iteration complexity is

s=1STsO(424η0min{,μ2}s=1Sexp((s1)2μc+2μ))O(1η0min{,μ2}exp(S2μc+2μ)1exp(2μc+2μ)1)(a)O~(cη0μmin{,μ2}max{ϵ0ϵ,η0L^Sσ2(c+2μ)Kϵ})O~(max{(L+)ϵ0η0μmin{,μ2}ϵ,(L+)2σ2μ2min{,μ2}Kϵ})O~(max{1μ1μ22ϵ,1μ12μ23Kϵ}),\begin{split}\sum\limits_{s=1}^{S}T_{s}&\leq O\left(\frac{424}{\eta_{0}\min\{\ell,\mu_{2}\}}\sum\limits_{s=1}^{S}\exp\left((s-1)\frac{2\mu}{c+2\mu}\right)\right)\\ &\leq O\bigg{(}\frac{1}{\eta_{0}\min\{\ell,\mu_{2}\}}\frac{\exp(S\frac{2\mu}{c+2\mu})-1}{\exp(\frac{2\mu}{c+2\mu})-1}\bigg{)}\\ &\overset{(a)}{\leq}\widetilde{O}\left(\frac{c}{\eta_{0}\mu\min\{\ell,\mu_{2}\}}\max\left\{\frac{\epsilon_{0}}{\epsilon},\frac{\eta_{0}\hat{L}S\sigma^{2}}{(c+2\mu)K\epsilon}\right\}\right)\\ &\leq\widetilde{O}\left(\max\left\{\frac{(L+\ell)\epsilon_{0}}{\eta_{0}\mu\min\{\ell,\mu_{2}\}\epsilon},\frac{(L+\ell)^{2}\sigma^{2}}{\mu^{2}\min\{\ell,\mu_{2}\}K\epsilon}\right\}\right)\\ &\leq\widetilde{O}\left(\max\left\{\frac{1}{\mu_{1}\mu_{2}^{2}\epsilon},\frac{1}{\mu_{1}^{2}\mu_{2}^{3}K\epsilon}\right\}\right),\end{split} (59)

where (a)(a) uses the setting of SS and exp(x)1x\exp(x)-1\geq x, and O~\widetilde{O} suppresses logarithmic factors.

ηs=η0exp((s1)2μc+2μ),Ts=212η0μ2exp((s1)2μc+2μ)\eta_{s}=\eta_{0}\exp(-(s-1)\frac{2\mu}{c+2\mu}),T_{s}=\frac{212}{\eta_{0}\mu_{2}}\exp\left((s-1)\frac{2\mu}{c+2\mu}\right).

Next, we will analyze the communication cost. We investigate both D=0D=0 and D>0D>0 cases.

(i) Homogeneous Data (D = 0): To assure (322μ2+32)(12ηs2Is+36η2Is2D2)ηsσ2K\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)(12\eta_{s}^{2}I_{s}+36\eta^{2}I_{s}^{2}D^{2})\leq\frac{\eta_{s}\sigma^{2}}{K} which we used in above proof, we take Is=1MKηs=exp((s1)2μc+2μ)MKη0I_{s}=\frac{1}{MK\eta_{s}}=\frac{\exp((s-1)\frac{2\mu}{c+2\mu})}{MK\eta_{0}}, where MM is a proper constant.

If 1MKη0>1\frac{1}{MK\eta_{0}}>1, then Is=max(1,exp((s1)2μc+2μ)MKη0)=exp((s1)2μc+2μ)MKη0I_{s}=\max(1,\frac{\exp((s-1)\frac{2\mu}{c+2\mu})}{MK\eta_{0}})=\frac{\exp((s-1)\frac{2\mu}{c+2\mu})}{MK\eta_{0}}.

Otherwise, 1MKη01\frac{1}{MK\eta_{0}}\leq 1, then Ks=1K_{s}=1 for sS1:=c+2μ2μlog(MKη0)+1s\leq S_{1}:=\frac{c+2\mu}{2\mu}\log(MK\eta_{0})+1 and Ks=exp((s1)2μc+2μ)MKη0K_{s}=\frac{\exp((s-1)\frac{2\mu}{c+2\mu})}{MK\eta_{0}} for s>S1s>S_{1}.

s=1S1Ts=s=1S1O(212η0exp((s1)2μc+2μ))=O~(212η0exp(2μc+2μS1)1exp(exp(2μc+2μ)1))=O~(Kμ)\begin{split}&\sum\limits_{s=1}^{S_{1}}T_{s}=\sum\limits_{s=1}^{S_{1}}O\left(\frac{212}{\eta_{0}}\exp\left((s-1)\frac{2\mu}{c+2\mu}\right)\right)\\ &=\widetilde{O}\left(\frac{212}{\eta_{0}}\frac{\exp\left(\frac{2\mu}{c+2\mu}S_{1}\right)-1}{\exp\left(\exp(\frac{2\mu}{c+2\mu})-1\right)}\right)\\ &=\widetilde{O}\left(\frac{K}{\mu}\right)\end{split} (60)

Thus, for both above cases, the total communication complexity can be bounded by

s=1S1Ts+s=S1+1STsIs=O~(Kμ+KS)O~(Kμ).\begin{split}&\sum\limits_{s=1}^{S_{1}}T_{s}+\sum\limits_{s=S_{1}+1}^{S}\frac{T_{s}}{I_{s}}\\ &=\widetilde{O}\left(\frac{K}{\mu}+KS\right)\leq\widetilde{O}\left(\frac{K}{\mu}\right).\end{split} (61)

(ii) Heterogeneous Data (D>0D>0):

To assure (322μ2+32)(12ηs2Is+36η2Is2D2)ηsσ2K\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)(12\eta_{s}^{2}I_{s}+36\eta^{2}I_{s}^{2}D^{2})\leq\frac{\eta_{s}\sigma^{2}}{K} which we used in above proof, we take Is=1MKηsI_{s}=\frac{1}{M\sqrt{K\eta_{s}}}, where MM is proper constant.

If 1MNη01\frac{1}{M\sqrt{N\eta_{0}}}\leq 1, then Is=1I_{s}=1 for sS2:=c+2μ2μlog(M2Kη0)+1s\leq S_{2}:=\frac{c+2\mu}{2\mu}\log(M^{2}K\eta_{0})+1 and Is=exp((s1)2μc+2μ)Nη0I_{s}=\frac{\exp((s-1)\frac{2\mu}{c+2\mu})}{N\eta_{0}} for s>S2s>S_{2}.

s=1S2Ts=s=1S2O(212η0exp((s1)2μc+2μ))=O~(Kμ)\begin{split}&\sum\limits_{s=1}^{S_{2}}T_{s}=\sum\limits_{s=1}^{S_{2}}O\left(\frac{212}{\eta_{0}}\exp\left((s-1)\frac{2\mu}{c+2\mu}\right)\right)\\ &=\widetilde{O}\left(\frac{K}{\mu}\right)\end{split} (62)

Thus, the communication complexity can be bounded by

s=1S2Ts+s=S2+1STsIs=O~(Kμ+Kexp((s1)2μc+2μ2))O~(Kμ+Kexp(S22μc+2μ)1expμc+2μ1)O(Kμ+1μ3/2ϵ1/2).\begin{split}&\sum\limits_{s=1}^{S_{2}}T_{s}+\sum\limits_{s=S_{2}+1}^{S}\frac{T_{s}}{I_{s}}=\widetilde{O}\left(\frac{K}{\mu}+\sqrt{K}\exp\left(\frac{(s-1)\frac{2\mu}{c+2\mu}}{2}\right)\right)\\ &\leq\widetilde{O}(\frac{K}{\mu}+\sqrt{K}\frac{\exp\left(\frac{S}{2}\frac{2\mu}{c+2\mu}\right)-1}{\exp{\frac{\mu}{c+2\mu}}-1})\\ &\leq O\left(\frac{K}{\mu}+\frac{1}{\mu^{3/2}\epsilon^{1/2}}\right).\end{split} (63)

Appendix C Baseline: Naive Parallel Algorithm

Note that if we set Is=1I_{s}=1 for all ss, CODA+ will be reduced to a naive parallel version of PPD-SG [liu2019stochastic]. We analyze this naive parallel algorithm in the following theorem.

Theorem 3

Consider Algorithm 1 with Is=1I_{s}=1. Set γ=2\gamma=2\ell, L^=L+2\hat{L}=L+2\ell, c=μ/L^5+μ/L^c=\frac{\mu/\hat{L}}{5+\mu/\hat{L}}.

(1) If M<1KμϵM<\frac{1}{K\mu\epsilon}, set ηs=η0exp((s1)c)O(1)\eta_{s}=\eta_{0}\exp(-(s-1)c)\leq O(1) and Ts=212η0min(,μ2)exp((s1)c)T_{s}=\frac{212}{\eta_{0}\min(\ell,\mu_{2})}\exp((s-1)c), then the communication/iteration complexity is O~(max(Δ0μϵη0K,L^μ2Kϵ))\widetilde{O}\bigg{(}\max\left(\frac{\Delta_{0}}{\mu\epsilon\eta_{0}K},\frac{\hat{L}}{\mu^{2}K\epsilon}\right)\bigg{)} to return 𝐯S\mathbf{v}_{S} such that 𝔼[ϕ(𝐯S)ϕ(𝐯ϕ)]ϵ\mathbb{E}[\phi(\mathbf{v}_{S})-\phi(\mathbf{v}^{*}_{\phi})]\leq\epsilon.

(2) If M1KμϵM\geq\frac{1}{K\mu\epsilon}, set ηs=min(13+32/μ2,14)\eta_{s}=\min(\frac{1}{3\ell+3\ell^{2}/\mu_{2}},\frac{1}{4\ell}) and Ts=212ηsmin{,μ2}T_{s}=\frac{212}{\eta_{s}\min\{\ell,\mu_{2}\}}, then the communication/iteration complexity is O~(1μ)\widetilde{O}\bigg{(}\frac{1}{\mu}\bigg{)} to return 𝐯S\mathbf{v}_{S} such that 𝔼[ϕ(𝐯S)ϕ(𝐯ϕ)]ϵ\mathbb{E}[\phi(\mathbf{v}_{S})-\phi(\mathbf{v}^{*}_{\phi})]\leq\epsilon.

{proof}

(1) If M<1KμϵM<\frac{1}{K\mu\epsilon}, note that the setting of ηs\eta_{s} and TsT_{s} are identical to that in CODA+ (Theorem 1). However, as a batch of MM is used on each machine at each iteration, the variance at each iteration is reduced to σ2KM\frac{\sigma^{2}}{KM}. Therefore, by similar analysis of Theorem 1 (specifically (59)), we see that the iteration complexity of NPA is O~(1μϵ+1μ2KMϵ)\widetilde{O}\left(\frac{1}{\mu\epsilon}+\frac{1}{\mu^{2}KM\epsilon}\right). Thus, the sample complexity of each machines is O~(Mμϵ+1μ2Kϵ)\widetilde{O}\left(\frac{M}{\mu\epsilon}+\frac{1}{\mu^{2}K\epsilon}\right).

(2) If M1KμϵM\geq\frac{1}{K\mu\epsilon}, . Note 1ηsTsmin{,μ2}212\frac{1}{\eta_{s}T_{s}}\leq\frac{\min\{\ell,\mu_{2}\}}{212}, we can follow the proof of Theorem 1 and derive

Δs+1cc+2μ𝔼[Δs]+32ηsL^σ2KMcc+2μ𝔼[Δs]+32ηsL^σ2μϵ\begin{split}\Delta_{s+1}&\leq\frac{c}{c+2\mu}\mathbb{E}[\Delta_{s}]+\frac{32\eta_{s}\hat{L}\sigma^{2}}{KM}\\ &\leq\frac{c}{c+2\mu}\mathbb{E}[\Delta_{s}]+32\eta_{s}\hat{L}\sigma^{2}\mu\epsilon\end{split} (64)

where the first inequality is similar to (55) and the Δ\Delta is defined as that in Theorem 1. Thus,

ΔS+1(cc+2μ)S+μϵO(s=1S(cc+2μ)s1)(cc+2μ)S+O(ϵ)exp(2μSc+2μ)+O(ϵ)\begin{split}\Delta_{S+1}&\leq\left(\frac{c}{c+2\mu}\right)^{S}+\mu\epsilon O\left(\sum\limits_{s=1}^{S}\left(\frac{c}{c+2\mu}\right)^{s-1}\right)\\ &\leq\left(\frac{c}{c+2\mu}\right)^{S}+O(\epsilon)\\ &\leq\exp\left(\frac{-2\mu S}{c+2\mu}\right)+O(\epsilon)\end{split} (65)

Therefore, it suffices to take S=O~(1μ)S=\widetilde{O}\left(\frac{1}{\mu}\right). Hence, the total number of communication is STs=O~(1μ)S\cdot T_{s}=\widetilde{O}\left(\frac{1}{\mu}\right) and the sample complexity on each machine is O~(Mμ)\widetilde{O}\left(\frac{M}{\mu}\right).

Appendix D Proof of Lemma 2

In this section, we will prove Lemma 2, which is the convergence analysis of one stage in CODASCA.

First, the duality gap in stage ss can be bounded as

Lemma 9

For any 𝐯,α\mathbf{v},\alpha,

1Rr=1R[fs(𝐯r,α)fs(𝐯,αr)]1Rr=1R[𝐯fs(𝐯r1,αr1),𝐯r𝐯B4+αfs(𝐯r1,αr1),ααrB5+3+32/μ22𝐯r𝐯r12+2(αrαr1)23𝐯r1𝐯2μ23(αr1α)2]\begin{split}&\frac{1}{R}\sum\limits_{r=1}^{R}[f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})]\\ &\leq\frac{1}{R}\sum\limits_{r=1}^{R}\bigg{[}\underbrace{\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\rangle}_{B4}+\underbrace{\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha-\alpha_{r}\rangle}_{B5}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}+2\ell(\alpha_{r}-\alpha_{r-1})^{2}-\frac{\ell}{3}\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\frac{\mu_{2}}{3}(\alpha_{r-1}-\alpha)^{2}\bigg{]}\end{split}
{proof}

By \ell-strongly convexity of fs(𝐯,α)f^{s}(\mathbf{v},\alpha) in 𝐯\mathbf{v}, we have

fs(𝐯r1,αr1)+𝐯fs(𝐯r1,αr1),𝐯𝐯r1+2𝐯r1𝐯2fs(𝐯,αr1).f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})+\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}-\mathbf{v}_{r-1}\rangle+\frac{\ell}{2}\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}\leq f^{s}(\mathbf{v},\alpha_{r-1}). (66)

By 33\ell-smoothness of fs(𝐯,α)f^{s}(\mathbf{v},\alpha) in 𝐯\mathbf{v}, we have

fs(𝐯r,α)fs(𝐯r1,α)+𝐯fs(𝐯r1,α),𝐯r𝐯r1+32𝐯r𝐯r12=fs(𝐯r1,α)+𝐯fs(𝐯r1,αr1),𝐯r𝐯r1+32𝐯r𝐯r12+𝐯fs(𝐯r1,α)𝐯fs(𝐯r1,αr1),𝐯r𝐯r1(a)fs(𝐯r1,α)+𝐯fs(𝐯r1,αr1),𝐯r𝐯r1+32𝐯r𝐯r12+|αr1α|𝐯r𝐯r1(b)fs(𝐯r1,α)+𝐯fs(𝐯r1,αr1),𝐯r𝐯r1+32𝐯r𝐯r12+μ26(αr1α)2+322μ2𝐯r𝐯r12,\begin{split}f^{s}(\mathbf{v}_{r},\alpha)&\leq f^{s}(\mathbf{v}_{r-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha),\mathbf{v}_{r}-\mathbf{v}_{r-1}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}\\ &=f^{s}(\mathbf{v}_{r-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}_{r-1}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}\\ &~{}~{}~{}+\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha)-\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}_{r-1}\rangle\\ &\overset{(a)}{\leq}f^{s}(\mathbf{v}_{r-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}_{r-1}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}\\ &~{}~{}~{}~{}+\ell|\alpha_{r-1}-\alpha|\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|\\ &\overset{(b)}{\leq}f^{s}(\mathbf{v}_{r-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}_{r-1}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}\\ &~{}~{}~{}~{}+\frac{\mu_{2}}{6}(\alpha_{r-1}-\alpha)^{2}+\frac{3\ell^{2}}{2\mu_{2}}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2},\\ \end{split} (67)

where (a)(a) holds because that we know 𝐯fs(𝐯,α)\partial_{\mathbf{v}}f^{s}(\mathbf{v},\alpha) is \ell-Lipschitz in α\alpha since f(𝐯,α)f(\mathbf{v},\alpha) is \ell-smooth and (b)(b) holds by Young’s inequality.

Adding (66) and (67), by rearranging terms, we have

fs(𝐯r1,αr1)+fs(𝐯r,α)fs(𝐯,αr1)+fs(𝐯r1,α)+𝐯fs(𝐯r1,αr1),𝐯r𝐯+3+32/μ22𝐯r𝐯r122𝐯r1𝐯2+μ26(αr1α)2.\begin{split}&f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})+f^{s}(\mathbf{v}_{r},\alpha)\\ &\leq f^{s}(\mathbf{v},\alpha_{r-1})+f^{s}(\mathbf{v}_{r-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\rangle\\ &~{}~{}~{}+\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}-\frac{\ell}{2}\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}+\frac{\mu_{2}}{6}(\alpha_{r-1}-\alpha)^{2}.\end{split} (68)

We know fs(𝐯,α)f^{s}(\mathbf{v},\alpha) is μ2\mu_{2}-strong concave in α\alpha (fs(𝐯,α)-f^{s}(\mathbf{v},\alpha) is μ2\mu_{2}-strong convexity of in α\alpha). Thus, we have

fs(𝐯r1,αr1)αfs(𝐯r1,αr1),ααr1+μ22(ααr1)2fs(𝐯r1,α).\begin{split}-f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})-\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha-\alpha_{r-1}\rangle+\frac{\mu_{2}}{2}(\alpha-\alpha_{r-1})^{2}\leq-f^{s}(\mathbf{v}_{r-1},\alpha).\end{split} (69)

Since fs(𝐯,α)f^{s}(\mathbf{v},\alpha) is \ell-smooth in α\alpha, we get

fs(𝐯,αr)fs(𝐯,αr1)αfs(𝐯,αr1),αrαr1+2(αrαr1)2=fs(𝐯,αr1)αfs(𝐯r1,αr1),αrαr1+2(αrαr1)2α(fs(𝐯,αr1)fs(𝐯r1,αr1)),αrαr1(a)fs(𝐯,αr1)αfs(𝐯r1,αr1),αrαr1+2(αrαr1)2+𝐯𝐯r1|αrαr1|fs(𝐯,αr1)αfs(𝐯r1,αr1),αrαr1+2(αrαr1)2+6𝐯r1𝐯2+32(αrαr1)2\begin{split}-f^{s}(\mathbf{v},\alpha_{r})&\leq-f^{s}(\mathbf{v},\alpha_{r-1})-\langle\partial_{\alpha}f^{s}(\mathbf{v},\alpha_{r-1}),\alpha_{r}-\alpha_{r-1}\rangle+\frac{\ell}{2}(\alpha_{r}-\alpha_{r-1})^{2}\\ &=-f^{s}(\mathbf{v},\alpha_{r-1})-\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha_{r}-\alpha_{r-1}\rangle+\frac{\ell}{2}(\alpha_{r}-\alpha_{r-1})^{2}\\ &~{}~{}~{}~{}~{}-\langle\partial_{\alpha}(f^{s}(\mathbf{v},\alpha_{r-1})-f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})),\alpha_{r}-\alpha_{r-1}\rangle\\ &\overset{(a)}{\leq}-f^{s}(\mathbf{v},\alpha_{r-1})-\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha_{r}-\alpha_{r-1}\rangle+\frac{\ell}{2}(\alpha_{r}-\alpha_{r-1})^{2}\\ &~{}~{}~{}+\ell\|\mathbf{v}-\mathbf{v}_{r-1}\||\alpha_{r}-\alpha_{r-1}|\\ &\leq-f^{s}(\mathbf{v},\alpha_{r-1})-\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha_{r}-\alpha_{r-1}\rangle+\frac{\ell}{2}(\alpha_{r}-\alpha_{r-1})^{2}\\ &~{}~{}~{}+\frac{\ell}{6}\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}+\frac{3\ell}{2}(\alpha_{r}-\alpha_{r-1})^{2}\\ \end{split} (70)

where (a) holds because that αfs(𝐯,α)\partial_{\alpha}f^{s}(\mathbf{v},\alpha) is \ell-Lipschitz in α\alpha.

Adding (69), (70) and arranging terms, we have

fs(𝐯r1,αr1)fs(𝐯,αr)fs(𝐯r1,α)fs(𝐯,αr1)αfs(𝐯r1,αr1),αrα+2(αrαr1)2+6𝐯r1𝐯2μ22(ααr1)2.\begin{split}&-f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})-f^{s}(\mathbf{v},\alpha_{r})\leq-f^{s}(\mathbf{v}_{r-1},\alpha)-f^{s}(\mathbf{v},\alpha_{r-1})-\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha_{r}-\alpha\rangle\\ &~{}~{}~{}~{}~{}~{}+2\ell(\alpha_{r}-\alpha_{r-1})^{2}+\frac{\ell}{6}\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\frac{\mu_{2}}{2}(\alpha-\alpha_{r-1})^{2}.\end{split} (71)

Adding (68) and (71), we get

fs(𝐯r,α)fs(𝐯,αr)𝐯fs(𝐯r1,αr1),𝐯r𝐯αfs(𝐯r1,αr1),αrα+3+32/μ22𝐯r𝐯r12+2(αrαr1)23𝐯r1𝐯2μ23(αr1α)2\begin{split}&f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})\\ &\leq\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\rangle-\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha_{r}-\alpha\rangle\\ &~{}~{}~{}+\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}+2\ell(\alpha_{r}-\alpha_{r-1})^{2}\\ &~{}~{}~{}-\frac{\ell}{3}\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\frac{\mu_{2}}{3}(\alpha_{r-1}-\alpha)^{2}\end{split} (72)

Taking average over r=1,,Rr=1,...,R, we get

1Rr=1R[fs(𝐯r,α)fs(𝐯,αr)]1Rr=1R[𝐯fs(𝐯r1,αr1),𝐯r𝐯B4+αfs(𝐯r1,αr1),ααrB5+3+32/μ22𝐯r𝐯r12+2(αrαr1)23𝐯r1𝐯2μ23(αr1α)2]\begin{split}&\frac{1}{R}\sum\limits_{r=1}^{R}[f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})]\\ &\leq\frac{1}{R}\sum\limits_{r=1}^{R}\bigg{[}\underbrace{\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\rangle}_{B_{4}}+\underbrace{\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha-\alpha_{r}\rangle}_{B_{5}}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}+2\ell(\alpha_{r}-\alpha_{r-1})^{2}-\frac{\ell}{3}\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\frac{\mu_{2}}{3}(\alpha_{r-1}-\alpha)^{2}\bigg{]}\end{split}

B4B_{4} and B5B_{5} can be bounded by the following lemma. For simplicity of notation, we define

Ξr=1KIk,t𝔼[𝐯r,tk𝐯r2+(αr,tkαr)2],\begin{split}\Xi_{r}=\frac{1}{KI}\sum\limits_{k,t}\mathbb{E}[\|\mathbf{v}^{k}_{r,t}-\mathbf{v}_{r}\|^{2}+(\alpha^{k}_{r,t}-\alpha_{r})^{2}],\end{split} (73)

which is the drift of the variables between te sequence in rr-th round and the ending point, and

r=1KIk,t𝔼[𝐯r,tk𝐯r12+(αr,tkαr1)2],\begin{split}\mathcal{E}_{r}=\frac{1}{KI}\sum\limits_{k,t}\mathbb{E}[\|\mathbf{v}^{k}_{r,t}-\mathbf{v}_{r-1}\|^{2}+(\alpha^{k}_{r,t}-\alpha_{r-1})^{2}],\end{split} (74)

which is the drift of the variables between te sequence in rr-th round and the starting point.

B4B_{4} can be bounded as

Lemma 10
𝔼𝐯fs(𝐯r1,αr1),𝐯r𝐯32r+3𝔼𝐯¯r𝐯2+3η~2𝔼1NKi,t[𝐯fks(𝐯r,tk,αr,tk)𝐯Fks(𝐯r,tk,αr,tk;zr,tk)]2+12η~𝔼(𝐯r1𝐯2𝐯r1𝐯r2𝐯r𝐯2)+12η~𝔼(𝐯~r1𝐯2𝐯~r𝐯2),\begin{split}&\mathbb{E}\left\langle\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\right\rangle\\ &\leq\frac{3\ell}{2}\mathcal{E}_{r}+\frac{\ell}{3}\mathbb{E}\|\bar{\mathbf{v}}_{r}-\mathbf{v}\|^{2}+\frac{3\tilde{\eta}}{2}\mathbb{E}\left\|\frac{1}{NK}\sum\limits_{i,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})]\right\|^{2}\\ &~{}~{}~{}~{}~{}+\frac{1}{2\tilde{\eta}}\mathbb{E}(\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\|\mathbf{v}_{r-1}-\mathbf{v}_{r}\|^{2}-\|\mathbf{v}_{r}-\mathbf{v}\|^{2})+\frac{1}{2\tilde{\eta}}\mathbb{E}(\|\tilde{\mathbf{v}}_{r-1}-\mathbf{v}\|^{2}-\|\tilde{\mathbf{v}}_{r}-\mathbf{v}\|^{2}),\end{split}

and

𝔼αfs(𝐯r1,αr1),yαr322μ2r+μ23𝔼(α¯rα)2+3η~2𝔼(1NKi,t[αfks(𝐯r,tk,αr,tk)αFks(𝐯r,tk,αr,tk;zr,tk)])2+12η~𝔼((α¯r1α)2(α¯r1α¯r)2(α¯rα)2)+12η~𝔼((αα~r1)2(αα~r)2).\begin{split}&\mathbb{E}\langle\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),y-\alpha_{r}\rangle\leq\frac{3\ell^{2}}{2\mu_{2}}\mathcal{E}_{r}+\frac{\mu_{2}}{3}\mathbb{E}(\bar{\alpha}_{r}-\alpha)^{2}\\ &~{}~{}~{}+\frac{3\tilde{\eta}}{2}\mathbb{E}\left(\frac{1}{NK}\sum\limits_{i,t}[\nabla_{\alpha}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\alpha}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})]\right)^{2}\\ &~{}~{}~{}+\frac{1}{2\tilde{\eta}}\mathbb{E}((\bar{\alpha}_{r-1}-\alpha)^{2}-(\bar{\alpha}_{r-1}-\bar{\alpha}_{r})^{2}-(\bar{\alpha}_{r}-\alpha)^{2})+\frac{1}{2\tilde{\eta}}\mathbb{E}((\alpha-\tilde{\alpha}_{r-1})^{2}-(\alpha-\tilde{\alpha}_{r})^{2}).\end{split}
{proof}
𝐯fs(𝐯r1,αr1),𝐯r𝐯=1KIk,t𝐯fks(𝐯r1,αr1),𝐯r𝐯1KIk,t[𝐯fks(𝐯r1,αr1)𝐯fks(𝐯r1,αr,tk)],𝐯r𝐯\small{1}⃝+1KIi,t[𝐯fks(𝐯r1,αr,tk)𝐯fks(𝐯r,tk,αr,tk)],𝐯r𝐯\small{2}⃝+1KIk,t[𝐯fks(𝐯r,tk,αr,tk)𝐯Fks(𝐯r,tk,αr,tk;zr,tk)],𝐯r𝐯\small{3}⃝+1KIk,t𝐯Fks(𝐯r,tk,αr,tk;zr,tk),𝐯r𝐯\small{4}⃝\begin{split}&\langle\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\rangle\\ &=\bigg{\langle}\frac{1}{KI}\sum\limits_{k,t}\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\bigg{\rangle}\\ &\leq\bigg{\langle}\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r-1})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r,t}^{k})],\mathbf{v}_{r}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{1}⃝\\ &~{}~{}~{}+\bigg{\langle}\frac{1}{KI}\sum\limits_{i,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})],\mathbf{v}_{r}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{2}⃝\\ &~{}~{}~{}+\bigg{\langle}\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha_{r,t}^{k};z_{r,t}^{k})],\mathbf{v}_{r}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{3}⃝\\ &~{}~{}~{}+\bigg{\langle}\frac{1}{KI}\sum\limits_{k,t}\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha_{r,t}^{k};z_{r,t}^{k}),\mathbf{v}_{r}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{4}⃝\end{split} (75)

Then we will bound \small{1}⃝, \small{2}⃝ and \small{3}⃝, respectively,

\small{1}⃝(a)321KIk,t[𝐯fks(𝐯r1,αr1)𝐯fks(𝐯r1,αr,tk)]2+6𝐯r𝐯2(b)321KIk,t𝐯fks(𝐯r1,αr1)𝐯fks(𝐯r1,αr,tk)2+6𝐯r𝐯2(c)321KIk,tαr1αr,tk2+6𝐯r𝐯2,\begin{split}\small{1}⃝&\overset{(a)}{\leq}\frac{3}{2\ell}\left\|\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r-1})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r,t}^{k})]\right\|^{2}+\frac{\ell}{6}\|\mathbf{v}_{r}-\mathbf{v}\|^{2}\\ &\overset{(b)}{\leq}\frac{3}{2\ell}\frac{1}{KI}\sum\limits_{k,t}\|\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r-1})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r,t}^{k})\|^{2}+\frac{\ell}{6}\|\mathbf{v}_{r}-\mathbf{v}\|^{2}\\ &\overset{(c)}{\leq}\frac{3\ell}{2}\frac{1}{KI}\sum\limits_{k,t}\|\alpha_{r-1}-\alpha_{r,t}^{k}\|^{2}+\frac{\ell}{6}\|\mathbf{v}_{r}-\mathbf{v}\|^{2},\end{split} (76)

where (a) follows from Young’s inequality, (b) follows from Jensen’s inequality. and (c) holds because 𝐯fks(𝐯,α)\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v},\alpha) is \ell-smooth in α\alpha. Using similar techniques, we have

\small{2}⃝321KIk,t𝐯fks(𝐯r1,αr,tk)𝐯fks(𝐯r,tk,αr,tk)2+6𝐯r𝐯2321KIk,t𝐯r1𝐯r,ti2+6𝐯r𝐯2.\begin{split}\small{2}⃝&\leq\frac{3}{2\ell}\frac{1}{KI}\sum\limits_{k,t}\|\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})\|^{2}+\frac{\ell}{6}\|\mathbf{v}_{r}-\mathbf{v}\|^{2}\\ &\leq\frac{3\ell}{2}\frac{1}{KI}\sum\limits_{k,t}\|\mathbf{v}_{r-1}-\mathbf{v}_{r,t}^{i}\|^{2}+\frac{\ell}{6}\|\mathbf{v}_{r}-\mathbf{v}\|^{2}.\end{split} (77)

Let 𝐯^r=argmin𝐯(1KIk,t𝐯fks(𝐯r,tk,yr,tk))𝐯+12η~𝐯𝐯r12\hat{\mathbf{v}}_{r}=\arg\min\limits_{\mathbf{v}}\left(\frac{1}{KI}\sum\limits_{k,t}\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}^{k}_{r,t},y^{k}_{r,t})\right)^{\top}\mathbf{v}+\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\mathbf{v}_{r-1}\|^{2}, then we have

𝐯¯r𝐯^r=η~KIk,t(𝐯fks(𝐯r,tk,αr,tk)𝐯fks(𝐯r,tk,yr,tk;zr,tk)).\begin{split}\bar{\mathbf{v}}_{r}-\hat{\mathbf{v}}_{r}=\frac{\tilde{\eta}}{KI}\sum\limits_{k,t}\bigg{(}\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},y^{k}_{r,t};z_{r,t}^{k})\bigg{)}.\end{split} (78)

Hence we get

\small{3}⃝=1KIk,t[𝐯fks(𝐯r,tk,αr,tk)𝐯Fks(𝐯r,tk,αr,tk;zr,tk)],𝐯r𝐯^r+1KIk,t[𝐯fks(𝐯r,tk,αr,tk)𝐯Fks(𝐯r,tk,αr,tk;zr,tk)],𝐯^r𝐯=η~1KIk,t[𝐯fks(𝐯r,tk,αr,tk)𝐯Fks(𝐯r,tk,αr,tk;zr,tk)]2+1KIk,t[𝐯fks(𝐯r,tk,αr,tk)𝐯Fks(𝐯r,tk,αr,tk;zr,tk)],𝐯^r𝐯.\begin{split}&\small{3}⃝=\left\langle\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})],\mathbf{v}_{r}-\hat{\mathbf{v}}_{r}\right\rangle\\ &~{}~{}~{}~{}+\left\langle\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})],\hat{\mathbf{v}}_{r}-\mathbf{v}\right\rangle\\ &={\tilde{\eta}}\left\|\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})]\right\|^{2}\\ &~{}~{}~{}~{}+\left\langle\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})],\hat{\mathbf{v}}_{r}-\mathbf{v}\right\rangle.\end{split} (79)

Define another auxiliary sequence as

𝐯~r=𝐯~r1η~KIk,t(𝐯Fks(𝐯r,tk,yr,tk;zr,tk)𝐯fks(𝐯r,tk,αr,tk)), for r>0𝐯~0=𝐯0.\begin{split}\tilde{\mathbf{v}}_{r}=\tilde{\mathbf{v}}_{r-1}-\frac{\tilde{\eta}}{KI}\sum\limits_{k,t}\left(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{r,t},y^{k}_{r,t};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})\right),\text{ for $r>0$; }\tilde{\mathbf{v}}_{0}=\mathbf{v}_{0}.\end{split} (80)

Denote

Θr(𝐯)=(1KIk,t(𝐯Fks(𝐯r,tk,yr,tk;zr,tk)𝐯fks(𝐯r,tk,αr,tk)))𝐯+12η~𝐯𝐯~r12.\Theta_{r}(\mathbf{v})=\left(\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{r,t},y^{k}_{r,t};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k}))\right)^{\top}\mathbf{v}+\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\tilde{\mathbf{v}}_{r-1}\|^{2}. (81)

Hence, for the auxiliary sequence α~r\tilde{\alpha}_{r}, we can verify that

𝐯~r=argmin𝐯Θr(𝐯).\tilde{\mathbf{v}}_{r}=\arg\min\limits_{\mathbf{v}}\Theta_{r}(\mathbf{v}). (82)

Since Θr(𝐯)\Theta_{r}(\mathbf{v}) is 1η~\frac{1}{\tilde{\eta}}-strongly convex, we have

12η~𝐯𝐯~r2Θr(𝐯)Θr(𝐯~r)=(1KIk,t(𝐯Fks(𝐯r,tk,αr,tk;zr,tk)𝐯fks(𝐯r,tk,αr,tk)))𝐯+12η~𝐯𝐯~r12(1KIk,t(𝐯Fks(𝐯r,ti,αr,tk;zr,tk)𝐯fks(𝐯r,ti,αr,tk)))𝐯~r12η~𝐯~r𝐯~r12=(1KIk,t(𝐯Fks(𝐯r,tk,αr,tk;zr,tk)𝐯fks(𝐯r,tk,αr,tk)))(𝐯𝐯~r1)+12η~𝐯𝐯~r12(1KIk,t(αFks(𝐯r,tk,αr,tk;zr,tk)αfks(𝐯r,tk,αr,tk)))(𝐯~r𝐯~r1)12η~𝐯~r𝐯~r12(1KIk,t(𝐯Fks(𝐯r,tk,αr,tk;zr,tk)𝐯fks(𝐯r,tk,αr,tk)))(𝐯𝐯~r1)+12η~𝐯𝐯~r12+η~21KIk,t(𝐯Fks(𝐯r,tk,αr,tk;zr,tk)𝐯fks(𝐯r,tk,αr,tk))2\displaystyle\begin{split}&\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\tilde{\mathbf{v}}_{r}\|^{2}\leq\Theta_{r}(\mathbf{v})-\Theta_{r}(\tilde{\mathbf{v}}_{r})\\ &=\bigg{(}\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k}))\bigg{)}^{\top}\mathbf{v}+\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\tilde{\mathbf{v}}_{r-1}\|^{2}\\ &~{}~{}~{}-\bigg{(}\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{i},\alpha_{r,t}^{k};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{i},\alpha_{r,t}^{k}))\bigg{)}^{\top}\tilde{\mathbf{v}}_{r}-\frac{1}{2\tilde{\eta}}\|\tilde{\mathbf{v}}_{r}-\tilde{\mathbf{v}}_{r-1}\|^{2}\\ &=\bigg{(}\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k}))\bigg{)}^{\top}(\mathbf{v}-\tilde{\mathbf{v}}_{r-1})+\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\tilde{\mathbf{v}}_{r-1}\|^{2}\\ &~{}~{}~{}-\bigg{(}\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\alpha}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})-\nabla_{\alpha}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k}))\bigg{)}^{\top}(\tilde{\mathbf{v}}_{r}-\tilde{\mathbf{v}}_{r-1})-\frac{1}{2\tilde{\eta}}\|\tilde{\mathbf{v}}_{r}-\tilde{\mathbf{v}}_{r-1}\|^{2}\\ &\leq\bigg{(}\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k}))\bigg{)}^{\top}(\mathbf{v}-\tilde{\mathbf{v}}_{r-1})+\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\tilde{\mathbf{v}}_{r-1}\|^{2}\\ &~{}~{}~{}+\frac{\tilde{\eta}}{2}\bigg{\|}\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k}))\bigg{\|}^{2}\end{split} (83)

Adding this with (79), we get

3η~21KIk,t(𝐯Fks(𝐯r,tk,αr,tk;zr,tk)𝐯fks(𝐯r,tk,αr,tk))2+12η~𝐯𝐯~r1212η~𝐯𝐯~r2+1KIk,t[𝐯fks(𝐯r,tk,αr,tk)𝐯Fks(𝐯r,tk,αr,tk;zr,tk)],𝐯^r𝐯~r1\begin{split}③\leq&\frac{3\tilde{\eta}}{2}\bigg{\|}\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k}))\bigg{\|}^{2}+\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\tilde{\mathbf{v}}_{r-1}\|^{2}-\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\tilde{\mathbf{v}}_{r}\|^{2}\\ &+\left\langle\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})],\hat{\mathbf{v}}_{r}-\tilde{\mathbf{v}}_{r-1}\right\rangle\end{split} (84)

\small{4}⃝ can be bounded as

\small{4}⃝=1η~𝐯r𝐯r1,𝐯𝐯r=12η~(𝐯r1𝐯2𝐯r1𝐯r2𝐯r𝐯2)\begin{split}\small{4}⃝&=\frac{1}{\tilde{\eta}}\langle\mathbf{v}_{r}-\mathbf{v}_{r-1},\mathbf{v}-\mathbf{v}_{r}\rangle=\frac{1}{2\tilde{\eta}}(\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\|\mathbf{v}_{r-1}-\mathbf{v}_{r}\|^{2}-\|\mathbf{v}_{r}-\mathbf{v}\|^{2})\end{split} (85)

Plug (76), (77), (84) and (85) into (75), we get

𝔼𝐯fs(𝐯r1,αr1),𝐯r𝐯32r+3𝔼𝐯¯r𝐯2+3η~2𝔼1KIk,t[𝐯fks(𝐯r,tk,αr,tk)𝐯Fks(𝐯r,tk,αr,tk;zr,tk)]2+12η~𝔼(𝐯r1𝐯2𝐯r1𝐯r2𝐯r𝐯2)+12η~𝔼(𝐯~r1𝐯2𝐯~r𝐯2)\begin{split}&\mathbb{E}\left\langle\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\right\rangle\\ &\leq\frac{3\ell}{2}\mathcal{E}_{r}+\frac{\ell}{3}\mathbb{E}\|\bar{\mathbf{v}}_{r}-\mathbf{v}\|^{2}+\frac{3\tilde{\eta}}{2}\mathbb{E}\left\|\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})]\right\|^{2}\\ &~{}~{}~{}~{}~{}+\frac{1}{2\tilde{\eta}}\mathbb{E}(\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\|\mathbf{v}_{r-1}-\mathbf{v}_{r}\|^{2}-\|\mathbf{v}_{r}-\mathbf{v}\|^{2})+\frac{1}{2\tilde{\eta}}\mathbb{E}(\|\tilde{\mathbf{v}}_{r-1}-\mathbf{v}\|^{2}-\|\tilde{\mathbf{v}}_{r}-\mathbf{v}\|^{2})\end{split}

Similarly for α\alpha, noting fksf^{s}_{k} is \ell-smooth and μ2\mu_{2}-strongly concave in α\alpha,

𝔼αfs(𝐯r1,αr1),yαr322μ2r+μ23𝔼(α¯rα)2+3η~2𝔼(1KIk,t[αfks(𝐯r,tk,αr,tk)αFks(𝐯r,tk,αr,tk;zr,tk)])2+12η~𝔼((α¯r1α)2(α¯r1α¯r)2(α¯rα)2)+12η~𝔼((αα~r1)2(αα~r)2)\begin{split}&\mathbb{E}\langle\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),y-\alpha_{r}\rangle\leq\frac{3\ell^{2}}{2\mu_{2}}\mathcal{E}_{r}+\frac{\mu_{2}}{3}\mathbb{E}(\bar{\alpha}_{r}-\alpha)^{2}\\ &~{}~{}~{}+\frac{3\tilde{\eta}}{2}\mathbb{E}\left(\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\alpha}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\alpha}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})]\right)^{2}\\ &~{}~{}~{}+\frac{1}{2\tilde{\eta}}\mathbb{E}((\bar{\alpha}_{r-1}-\alpha)^{2}-(\bar{\alpha}_{r-1}-\bar{\alpha}_{r})^{2}-(\bar{\alpha}_{r}-\alpha)^{2})+\frac{1}{2\tilde{\eta}}\mathbb{E}((\alpha-\tilde{\alpha}_{r-1})^{2}-(\alpha-\tilde{\alpha}_{r})^{2})\end{split}

We show the following lemmas where Ξ\Xi and \mathcal{E} are coupled.

Lemma 11
Ξr4r+8η~2[𝐯f(𝐯r,αr)2+(αf(𝐯r,αr))2]+5η~2σ2KI.\begin{split}\Xi_{r}&\leq 4\mathcal{E}_{r}+8\tilde{\eta}^{2}[\|\nabla_{\mathbf{v}}f(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{5\tilde{\eta}^{2}\sigma^{2}}{KI}.\end{split} (86)
{proof}
𝔼[𝐯r𝐯r12]=𝔼η~KIk,t(𝐯fks(𝐯r,tk,αr,tk;zr,tk)c𝐯k+c𝐯)2=𝔼η~KIk,t[𝐯fks(𝐯r,tk,αr,tk;zr,tk)𝐯fks(𝐯r,tk,αr,tk)+𝐯fks(𝐯r,tk,αr,tk)]2𝔼η~KIk,t[𝐯fks(𝐯r,tk,αr,tk)]2+η~2σ2KI=𝔼η~KIk,t[𝐯fks(𝐯r,tk,αr,tk)𝐯fks(𝐯r1,αr1)]+η~𝐯fs(𝐯r1,αr1))2+η~2σ2KI2𝔼η~KIk,t[𝐯fks(𝐯r,tk,αr,tk)𝐯fks(𝐯r1,αr1)]2+2η~2𝔼𝐯fs(𝐯r1,αr1)2+η~2σ2KI2η~22KIk,t𝔼[𝐯r,tk𝐯r12+(αr,tkαr1)2]+2η~2𝔼𝐯fs(𝐯r1,αr1)2+η~2σ2KI2η~22r+2η~2𝔼𝐯fs(𝐯r1,αr1)2+η~2σ2KI\begin{split}&\mathbb{E}[\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}]=\mathbb{E}\left\|-\frac{\tilde{\eta}}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};z^{k}_{r,t})-c_{\mathbf{v}}^{k}+c_{\mathbf{v}})\right\|^{2}\\ &=\mathbb{E}\left\|-\frac{\tilde{\eta}}{KI}\sum\limits_{k,t}\left[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};z^{k}_{r,t})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t})+\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t})\right]\right\|^{2}\\ &\leq\mathbb{E}\left\|-\frac{\tilde{\eta}}{KI}\sum\limits_{k,t}\left[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t})\right]\right\|^{2}+\frac{\tilde{\eta}^{2}\sigma^{2}}{KI}\\ &=\mathbb{E}\left\|-\frac{\tilde{\eta}}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r-1})]+\tilde{\eta}\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))\right\|^{2}+\frac{\tilde{\eta}^{2}\sigma^{2}}{KI}\\ &\leq 2\mathbb{E}\left\|-\frac{\tilde{\eta}}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r-1})]\right\|^{2}+2\tilde{\eta}^{2}\mathbb{E}\left\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\right\|^{2}+\frac{\tilde{\eta}^{2}\sigma^{2}}{KI}\\ &\leq\frac{2\tilde{\eta}^{2}\ell^{2}}{KI}\sum\limits_{k,t}\mathbb{E}[\|\mathbf{v}^{k}_{r,t}-\mathbf{v}_{r-1}\|^{2}+(\alpha^{k}_{r,t}-\alpha_{r-1})^{2}]+2\tilde{\eta}^{2}\mathbb{E}\left\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\right\|^{2}+\frac{\tilde{\eta}^{2}\sigma^{2}}{KI}\\ &\leq 2\tilde{\eta}^{2}\ell^{2}\mathcal{E}_{r}+2\tilde{\eta}^{2}\mathbb{E}\left\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\right\|^{2}+\frac{\tilde{\eta}^{2}\sigma^{2}}{KI}\end{split} (87)

Similarly,

𝔼[(αrαr1)2]2η~22r+2η~2𝔼(αfs(𝐯r1,αr1))2+η~2σ2KI.\begin{split}&\mathbb{E}[(\alpha_{r}-\alpha_{r-1})^{2}]\leq 2\tilde{\eta}^{2}\ell^{2}\mathcal{E}_{r}+2\tilde{\eta}^{2}\mathbb{E}\left(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\right)^{2}+\frac{\tilde{\eta}^{2}\sigma^{2}}{KI}.\end{split} (88)

Using the 33\ell-smoothness of fsf^{s} and combining with above results,

𝐯fs(𝐯r1,αr1)2+(αfs(𝐯r1,αr1))2=𝐯fs(𝐯r1,αr1)𝐯fs(𝐯r,αr)+𝐯fs(𝐯r,αr)2+(αfs(𝐯r1,αr1)αfs(𝐯r,αr)+αfs(𝐯r,αr))22[𝐯fs(𝐯r,αr)2+(αfs(𝐯r,αr))2]+182(𝐯r1𝐯r2+(αr1αr)2)2[𝐯fs(𝐯r,αr)2+(αfs(𝐯r,αr))2]+604η~2r+40η~22σ2KI2[𝐯fs(𝐯r,αr)2+(αfs(𝐯r,αr))2]+224r+σ2144KI.\begin{split}&\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}\\ &=\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})-\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})+\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}\\ &~{}~{}~{}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})-\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r})+\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}\\ &\leq 2[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+18\ell^{2}(\|\mathbf{v}_{r-1}-\mathbf{v}_{r}\|^{2}+(\alpha_{r-1}-\alpha_{r})^{2})\\ &\leq 2[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+60\ell^{4}\tilde{\eta}^{2}\mathcal{E}_{r}+\frac{40\tilde{\eta}^{2}\ell^{2}\sigma^{2}}{KI}\\ &\leq 2[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{\ell^{2}}{24}\mathcal{E}_{r}+\frac{\sigma^{2}}{144KI}.\end{split} (89)

Thus,

Ξr=1KIk,t𝔼[𝐯r,tk𝐯r2+(αr,tkαr)2]2KIk,t𝔼[𝐯r,tk𝐯r12+𝐯r1𝐯r2+(αr,tkαr1)2+(αr1αr)2]2r+2𝔼[𝐯r1𝐯r2+(αr1αr)2]2r+8η~22r+4η~2𝔼[(𝐯fs(𝐯r1,αr1))2+(αfs(𝐯r1,αr1))2]+4η~2σ2KI3r+4η~2(2[𝐯fs(𝐯r,αr)2+(αfs(𝐯r,αr))2]+224r+σ2144KI)+4η~2σ2KI4r+8η~2[𝐯fs(𝐯r,αr)2+(αfs(𝐯r,αr))2]+5η~2σ2KI.\begin{split}\Xi_{r}&=\frac{1}{KI}\sum\limits_{k,t}\mathbb{E}[\|\mathbf{v}^{k}_{r,t}-\mathbf{v}_{r}\|^{2}+(\alpha^{k}_{r,t}-\alpha_{r})^{2}]\\ &\leq\frac{2}{KI}\sum\limits_{k,t}\mathbb{E}[\|\mathbf{v}^{k}_{r,t}-\mathbf{v}_{r-1}\|^{2}+\|\mathbf{v}_{r-1}-\mathbf{v}_{r}\|^{2}+(\alpha^{k}_{r,t}-\alpha_{r-1})^{2}+(\alpha_{r-1}-\alpha_{r})^{2}]\\ &\leq 2\mathcal{E}_{r}+2\mathbb{E}[\|\mathbf{v}_{r-1}-\mathbf{v}_{r}\|^{2}+(\alpha_{r-1}-\alpha_{r})^{2}]\\ &\leq 2\mathcal{E}_{r}+8\tilde{\eta}^{2}\ell^{2}\mathcal{E}_{r}+4\tilde{\eta}^{2}\mathbb{E}[(\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}]+\frac{4\tilde{\eta}^{2}\sigma^{2}}{KI}\\ &\leq 3\mathcal{E}_{r}+4\tilde{\eta}^{2}\left(2[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{\ell^{2}}{24}\mathcal{E}_{r}+\frac{\sigma^{2}}{144KI}\right)+\frac{4\tilde{\eta}^{2}\sigma^{2}}{KI}\\ &\leq 4\mathcal{E}_{r}+8\tilde{\eta}^{2}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{5\tilde{\eta}^{2}\sigma^{2}}{KI}.\end{split} (90)
Lemma 12
rη~σ22Kηg2+η~Ξr1+48η~2ηg2[𝐯fs(𝐯r,αr)2+(αfs(𝐯r,αr))2].\begin{split}&\mathcal{E}_{r}\leq\frac{\tilde{\eta}\sigma^{2}}{2\ell K\eta_{g}^{2}}+\tilde{\eta}\ell\Xi_{r-1}+\frac{48\tilde{\eta}^{2}}{\eta_{g}^{2}}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}].\end{split} (91)
{proof}
𝔼𝐯r,tk𝐯r12=𝔼𝐯r,t1kηl(𝐯fk(𝐯r,t1k,yr,t1k;zr,t1k)c𝐯k+c𝐯)𝐯r12𝔼𝐯r,t1kηl(𝐯fk(𝐯r,t1k,yr,t1k)𝔼[c𝐯k]+𝔼[c𝐯])𝐯r12+2ηl2σ2(1+1I1)𝔼𝐯r,t1k𝐯r12+Iηl2𝔼𝐯fk(𝐯r,t1k,αr,t1k)𝔼[c𝐯k]+𝔼[c𝐯]2+2ηl2σ2,\begin{split}\mathbb{E}\|\mathbf{v}^{k}_{r,t}-\mathbf{v}_{r-1}\|^{2}&=\mathbb{E}\|\mathbf{v}^{k}_{r,t-1}-\eta_{l}(\nabla_{\mathbf{v}}f_{k}(\mathbf{v}^{k}_{r,t-1},y^{k}_{r,t-1};z^{k}_{r,t-1})-c_{\mathbf{v}}^{k}+c_{\mathbf{v}})-\mathbf{v}_{r-1}\|^{2}\\ &\leq\mathbb{E}\|\mathbf{v}^{k}_{r,t-1}-\eta_{l}(\nabla_{\mathbf{v}}f_{k}(\mathbf{v}^{k}_{r,t-1},y^{k}_{r,t-1})-\mathbb{E}[c^{k}_{\mathbf{v}}]+\mathbb{E}[c_{\mathbf{v}}])-\mathbf{v}_{r-1}\|^{2}+2\eta_{l}^{2}\sigma^{2}\\ &\leq\left(1+\frac{1}{I-1}\right)\mathbb{E}\|\mathbf{v}^{k}_{r,t-1}-\mathbf{v}_{r-1}\|^{2}+I\eta_{l}^{2}\mathbb{E}\|\nabla_{\mathbf{v}}f_{k}(\mathbf{v}^{k}_{r,t-1},\alpha^{k}_{r,t-1})-\mathbb{E}[c^{k}_{\mathbf{v}}]+\mathbb{E}[c_{\mathbf{v}}]\|^{2}+2\eta_{l}^{2}\sigma^{2},\end{split} (92)

where 𝔼[c𝐯k]=1It=1Ifs(𝐯r,tk,αr,tk)\mathbb{E}[c^{k}_{\mathbf{v}}]=\frac{1}{I}\sum\limits_{t=1}^{I}f^{s}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t}) and 𝔼[c𝐯]=1Kk=1K1It=1Ifs(𝐯r,tk,αr,tk)\mathbb{E}[c_{\mathbf{v}}]=\frac{1}{K}\sum\limits_{k=1}^{K}\frac{1}{I}\sum\limits_{t=1}^{I}f^{s}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t}).

Then,

Iηl2𝔼𝐯fks(𝐯r,t1k,αr,t1k)𝔼[c𝐯k]+𝔼[c𝐯]2Iηl2𝔼𝐯fks(𝐯r,t1k,αr,t1k)𝐯fks(𝐯r1,αr1)+(𝔼[c𝐯]𝐯fs(𝐯r1,αr1))+𝐯fs(𝐯r1,αr1)(𝔼[c𝐯k]𝐯fks(𝐯r1,αr1))24Iηl22(𝔼[𝐯r,t1k𝐯r12]+𝔼[αr,t1kαr12])+4Iηl2𝔼[𝔼[c𝐯k]𝐯fks(𝐯r1,αr1)2]+4Iηl2𝔼[𝔼[c𝐯]𝐯fs(𝐯r1,αr12]+4Iηl2𝔼[𝐯fs(𝐯r1,αr1)24Iηl22(𝔼[𝐯k1,rk𝐯r12]+𝔼[(αk1,rkαr1)2])+4Iηl221Iτ=1I𝔼[𝐯r1,τk𝐯r12+(αr1,τkαr1)2]+4Iηl221KIj=1Kt=1I𝔼[𝐯r1,tj𝐯r12+(αr1,kjαr1)2]+4Iηl2𝔼[𝐯fs(𝐯r1,αr1)2\begin{split}&I\eta_{l}^{2}\mathbb{E}\|\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}^{k}_{r,t-1},\alpha^{k}_{r,t-1})-\mathbb{E}[c^{k}_{\mathbf{v}}]+\mathbb{E}[c_{\mathbf{v}}]\|^{2}\\ &\leq I\eta_{l}^{2}\mathbb{E}\|\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}^{k}_{r,t-1},\alpha^{k}_{r,t-1})-\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{r-1},\alpha_{r-1})+(\mathbb{E}[c_{\mathbf{v}}]-\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})-(\mathbb{E}[c^{k}_{\mathbf{v}}]-\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))\|^{2}\\ &\leq 4I\eta_{l}^{2}\ell^{2}\bigg{(}\mathbb{E}[\|\mathbf{v}^{k}_{r,t-1}-\mathbf{v}_{r-1}\|^{2}]+\mathbb{E}[\|\alpha^{k}_{r,t-1}-\alpha_{r-1}\|^{2}]\bigg{)}+4I\eta_{l}^{2}\mathbb{E}[\|\mathbb{E}[c^{k}_{\mathbf{v}}]-\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}]\\ &~{}~{}~{}+4I\eta_{l}^{2}\mathbb{E}[\|\mathbb{E}[c_{\mathbf{v}}]-\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}\|^{2}]+4I\eta_{l}^{2}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}\\ &\leq 4I\eta_{l}^{2}\ell^{2}\bigg{(}\mathbb{E}[\|\mathbf{v}^{k}_{k-1,r}-\mathbf{v}_{r-1}\|^{2}]+\mathbb{E}[(\alpha^{k}_{k-1,r}-\alpha_{r-1})^{2}]\bigg{)}+4I\eta_{l}^{2}\ell^{2}\frac{1}{I}\sum_{\tau=1}^{I}\mathbb{E}[\|\mathbf{v}^{k}_{r-1,\tau}-\mathbf{v}_{r-1}\|^{2}+(\alpha^{k}_{r-1,\tau}-\alpha_{r-1})^{2}]\\ &+4I\eta_{l}^{2}\ell^{2}\frac{1}{KI}\sum_{j=1}^{K}\sum_{t=1}^{I}\mathbb{E}[\|\mathbf{v}^{j}_{r-1,t}-\mathbf{v}_{r-1}\|^{2}+(\alpha^{j}_{r-1,k}-\alpha_{r-1})^{2}]+4I\eta_{l}^{2}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}\end{split} (93)

For α\alpha, we have similar results, adding them together

𝔼𝐯k,rk𝐯r12+𝔼(αk,rkαr1)2(1+1K1+8Kηl22)(𝔼𝐯k1,rk𝐯r12+𝔼(αk1,rkαr1)2)+2ηl2σ2+4Iηl22Ξr1+4Iηl21Iτ=1I𝔼[𝐯r1,τk𝐯r12+(αr1,τkαr1)2]+4Iηl2𝔼[𝐯fs(𝐯r1,αr1)2+(αfs(𝐯r1,αr1))2]\begin{split}&\mathbb{E}\|\mathbf{v}^{k}_{k,r}-\mathbf{v}_{r-1}\|^{2}+\mathbb{E}(\alpha^{k}_{k,r}-\alpha_{r-1})^{2}\leq\left(1+\frac{1}{K-1}+8K\eta_{l}^{2}\ell^{2}\right)(\mathbb{E}\|\mathbf{v}^{k}_{k-1,r}-\mathbf{v}_{r-1}\|^{2}+\mathbb{E}(\alpha^{k}_{k-1,r}-\alpha_{r-1})^{2})\\ &~{}~{}~{}+2\eta_{l}^{2}\sigma^{2}+4I\eta_{l}^{2}\ell^{2}\Xi_{r-1}+4I\eta_{l}^{2}\frac{1}{I}\sum\limits_{\tau=1}^{I}\mathbb{E}[\|\mathbf{v}_{r-1,\tau}^{k}-\mathbf{v}_{r-1}\|^{2}+(\alpha_{r-1,\tau}^{k}-\alpha_{r-1})^{2}]\\ &~{}~{}~{}+4I\eta_{l}^{2}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}]\end{split} (94)

Taking average over all machines,

1Kk𝔼𝐯r,tk𝐯r12+𝔼(αr,tkαr1)2(1+1I1+8Iηl22)1Kk(𝔼𝐯r,t1k𝐯r12+𝔼(αr,t1kαr1)2)+2ηl2σ2+8Iηl22Ξr1+4Iηl2𝔼[𝐯fs(𝐯r1,αr1)2+(αfs(𝐯r1,αr1))2]](2ηl2σ2+8Iηl22Ξr1+4Iηl2𝔼[𝐯fs(𝐯r1,αr1)2+(αfs(𝐯r1,αr1))2)(τ=0t1(1+1I1+8Iηl22)τ)(2η~2σ2I2ηg2+8η~22Iηg2Ξr1+4η~2Iηg2𝔼[𝐯fs(𝐯r1,αr1)2+(αfs(𝐯r1,αr1))2])3I(η~σ224I2ηg2+η~3Iηg2Ξr1+4η~2Iηg2𝔼[𝐯fs(𝐯r1,αr1)2+(αfs(𝐯r1,αr1))2])3I\begin{split}&\frac{1}{K}\sum\limits_{k}\mathbb{E}\|\mathbf{v}^{k}_{r,t}-\mathbf{v}_{r-1}\|^{2}+\mathbb{E}(\alpha^{k}_{r,t}-\alpha_{r-1})^{2}\\ &\leq\left(1+\frac{1}{I-1}+8I\eta_{l}^{2}\ell^{2}\right)\frac{1}{K}\sum_{k}(\mathbb{E}\|\mathbf{v}^{k}_{r,t-1}-\mathbf{v}_{r-1}\|^{2}+\mathbb{E}(\alpha^{k}_{r,t-1}-\alpha_{r-1})^{2})+2\eta_{l}^{2}\sigma^{2}\\ &~{}~{}~{}+8I\eta_{l}^{2}\ell^{2}\Xi_{r-1}+4I\eta_{l}^{2}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}]]\\ &\leq\left(2\eta_{l}^{2}\sigma^{2}+8I\eta_{l}^{2}\ell^{2}\Xi_{r-1}+4I\eta_{l}^{2}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}\right)\left(\sum\limits_{\tau=0}^{t-1}(1+\frac{1}{I-1}+8I\eta_{l}^{2}\ell^{2})^{\tau}\right)\\ &\leq\left(\frac{2\tilde{\eta}^{2}\sigma^{2}}{I^{2}\eta_{g}^{2}}+\frac{8\tilde{\eta}^{2}\ell^{2}}{I\eta_{g}^{2}}\Xi_{r-1}+\frac{4\tilde{\eta}^{2}}{I\eta_{g}^{2}}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}]\right)3I\\ &\leq\left(\frac{\tilde{\eta}\sigma^{2}}{24\ell I^{2}\eta_{g}^{2}}+\frac{\tilde{\eta}\ell}{3I\eta_{g}^{2}}\Xi_{r-1}+\frac{4\tilde{\eta}^{2}}{I\eta_{g}^{2}}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}]\right)3I\end{split} (95)

Taking average over t=1,,It=1,...,I,

rη~σ28Iηg2+η~Ξr1+12η~2ηg2𝔼[𝐯fs(𝐯r1,αr1)2+(αfs(𝐯r1,αr1))2]\begin{split}&\mathcal{E}_{r}\leq\frac{\tilde{\eta}\sigma^{2}}{8\ell I\eta_{g}^{2}}+{\tilde{\eta}\ell}\Xi_{r-1}+\frac{12\tilde{\eta}^{2}}{\eta_{g}^{2}}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}]\end{split} (96)

Using (89), we have

rη~σ28Iηg2+η~Ξr1+12η~2ηg2(4[𝐯fs(𝐯r,αr)2+(αfs(𝐯r,αr))2]+224r+σ2144KI).\begin{split}&\mathcal{E}_{r}\leq\frac{\tilde{\eta}\sigma^{2}}{8\ell I\eta_{g}^{2}}+\tilde{\eta}\ell\Xi_{r-1}+\frac{12\tilde{\eta}^{2}}{\eta_{g}^{2}}\left(4[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{\ell^{2}}{24}\mathcal{E}_{r}+\frac{\sigma^{2}}{144KI}\right).\end{split} (97)

Rearranging terms,

rη~σ22Iηg2+η~Ξr1+48η~2ηg2[𝐯fs(𝐯r,αr)2+(αfs(𝐯r,αr))2]\begin{split}\mathcal{E}_{r}\leq\frac{\tilde{\eta}\sigma^{2}}{2\ell I\eta_{g}^{2}}+\tilde{\eta}\ell\Xi_{r-1}+\frac{48\tilde{\eta}^{2}}{\eta_{g}^{2}}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]\end{split} (98)

D.1 Main Proof of Lemma 2

{proof}

Plugging Lemma 10 into Lemma 9, we get

1Rr=1R[fs(𝐯r,α)fs(𝐯,αr)]1Rr=1R[(3+32/μ2212η~)𝐯r1𝐯r2+(212η~)(αrαr1)2C1+(12η~μ23)(αr1α)2(12η~μ23)(αrα)2C2+(12η~3)𝐯r1𝐯2(12η~3)𝐯r𝐯2C3+12η~((αα~r1)2(αα~r)2)C4+(32+322μ2)rC5+3η~21KIk,i[𝐯fks(𝐯r,tk,αr,tk)𝐯Fks(𝐯r,tk,αr,tk;zr,tk)]2C6+3η~2(1KIk,iαfks(𝐯r,tk,αr,tk)αFks(𝐯r,tk,αr,tk;zr,tk))2C7\begin{split}&\frac{1}{R}\sum\limits_{r=1}^{R}[f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})]\\ &\leq\frac{1}{R}\sum\limits_{r=1}^{R}\Bigg{[}\underbrace{\left(\frac{3\ell+3\ell^{2}/\mu_{2}}{2}-\frac{1}{2\tilde{\eta}}\right)\|\mathbf{v}_{r-1}-\mathbf{v}_{r}\|^{2}+\left(2\ell-\frac{1}{2\tilde{\eta}}\right)(\alpha_{r}-\alpha_{r-1})^{2}}_{C_{1}}\\ &+\underbrace{\left(\frac{1}{2\tilde{\eta}}-\frac{\mu_{2}}{3}\right)(\alpha_{r-1}-\alpha)^{2}-\left(\frac{1}{2\tilde{\eta}}-\frac{\mu_{2}}{3}\right)(\alpha_{r}-\alpha)^{2}}_{C_{2}}+\underbrace{\left(\frac{1}{2\tilde{\eta}}-\frac{\ell}{3}\right)\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\left(\frac{1}{2\tilde{\eta}}-\frac{\ell}{3}\right)\|\mathbf{v}_{r}-\mathbf{v}\|^{2}}_{C_{3}}\\ &+\underbrace{\frac{1}{2\tilde{\eta}}((\alpha-\tilde{\alpha}_{r-1})^{2}-(\alpha-\tilde{\alpha}_{r})^{2})}_{C_{4}}+\underbrace{\left(\frac{3\ell}{2}+\frac{3\ell^{2}}{2\mu_{2}}\right)\mathcal{E}_{r}}_{C_{5}}\\ &+\underbrace{\frac{3\tilde{\eta}}{2}\left\|\frac{1}{KI}\sum\limits_{k,i}[\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})]\right\|^{2}}_{C_{6}}+\underbrace{\frac{3\tilde{\eta}}{2}\left(\frac{1}{KI}\sum\limits_{k,i}\nabla_{\alpha}f_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\alpha}F_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})\right)^{2}}_{C_{7}}\\ \end{split} (99)

Since η~min(13+32/μ2,14,32μ2)\tilde{\eta}\leq\min(\frac{1}{3\ell+3\ell^{2}/\mu_{2}},\frac{1}{4\ell},\frac{3}{2\mu_{2}}), thus in the RHS of (99), C1C_{1} can be cancelled. C2C_{2}, C3C_{3} and C4C_{4} will be handled by telescoping sum. C5C_{5} can be bounded by Lemma 12.

Taking expectation over C6C_{6},

𝔼[3η~21KIk,i[𝐯fks(𝐯r,tk,αr,tk)𝐯Fks(𝐯r,tk,αr,tk;zr,tk)]2]=𝔼[3η~2K2I2k,i𝐯fks(𝐯r,tk,αr,tk)𝐯Fks(𝐯r,tk,αr,tk;zr,tk)2]3η~σ22KI.\begin{split}&\mathbb{E}\left[\frac{3\tilde{\eta}}{2}\left\|\frac{1}{KI}\sum\limits_{k,i}[\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})]\right\|^{2}\right]\\ &=\mathbb{E}\left[\frac{3\tilde{\eta}}{2K^{2}I^{2}}\sum\limits_{k,i}\|\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})\|^{2}\right]\\ &\leq\frac{3\tilde{\eta}\sigma^{2}}{2KI}.\end{split} (100)

The equality is due to
𝔼r,t𝐯fks(𝐯r,tk,αr,tk)𝐯Fks(𝐯r,ti,αr,ti;zr,tk),𝐯fjs(𝐯r,tj,αr,tj)𝐯Fjs(𝐯r,tj,αr,tj;zr,tj)=0\mathbb{E}_{r,t}\left\langle\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F_{k}^{s}(\mathbf{v}_{r,t}^{i},\alpha_{r,t}^{i};z_{r,t}^{k}),\nabla_{\mathbf{v}}f_{j}^{s}(\mathbf{v}_{r,t}^{j},\alpha_{r,t}^{j})-\nabla_{\mathbf{v}}F_{j}^{s}(\mathbf{v}_{r,t}^{j},\alpha_{r,t}^{j};z_{r,t}^{j})\right\rangle=0 for any iji\neq j as each machine draws data independently, where 𝔼r,t\mathbb{E}_{r,t} denotes an expectation in round rr conditioned on events until kk. The last inequality holds because 𝐯fk(𝐯t1k,αt1k)𝐯Fk(𝐯t1k,αt1k;zt1k)2σ2\|\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})\|^{2}\leq\sigma^{2} for any ii. Similarly, we take expectation over C7C_{7} and have

𝔼[3η~2(1KIk,t[αfk(𝐯r,tk,αr,tk)αFk(𝐯r,tk,αr,tk;𝐳r,tk)])2]3η~σ22KI.\begin{split}&\mathbb{E}\left[\frac{3\tilde{\eta}}{2}\left(\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\alpha}f_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\alpha}F_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};\mathbf{z}_{r,t}^{k})]\right)^{2}\right]\leq\frac{3\tilde{\eta}\sigma^{2}}{2KI}.\end{split} (101)

Plugging (100) and (101) into (99), and taking expectation, it yields

1Rr𝔼[fs(𝐯r,α)fs(𝐯,αr)]𝔼{1R(12η~23)𝐯0𝐯2+1R(12η~μ23)(α0α)2+12η~R𝐯0𝐯2+12η~R(α0α)2+1Rr=1R(322μ2+32)r+3η~σ2KI}1η~R𝐯0𝐯2+1η~R(α0α)2+32μ21Rr=1Rr+3η~σ2KI,\begin{split}&\frac{1}{R}\sum\limits_{r}\mathbb{E}[f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})]\\ &\leq\mathbb{E}\bigg{\{}\frac{1}{R}\left(\frac{1}{2\tilde{\eta}}-\frac{\ell_{2}}{3}\right)\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{R}\left(\frac{1}{2\tilde{\eta}}-\frac{\mu_{2}}{3}\right)(\alpha_{0}-\alpha)^{2}+\frac{1}{2\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{2\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}\\ &~{}~{}~{}~{}~{}+\frac{1}{R}\sum\limits_{r=1}^{R}\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)\mathcal{E}_{r}+\frac{3\tilde{\eta}\sigma^{2}}{KI}\bigg{\}}\\ &\leq\frac{1}{\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}+\frac{3\ell^{2}}{\mu_{2}}\frac{1}{R}\sum\limits_{r=1}^{R}\mathcal{E}_{r}+\frac{3\tilde{\eta}\sigma^{2}}{KI},\end{split}

where we use 𝐯0=𝐯¯0\mathbf{v}_{0}=\bar{\mathbf{v}}_{0}, and α0=α¯0\alpha_{0}=\bar{\alpha}_{0} in the last inequality.

Using Lemma 12,

1Rr𝔼[fs(𝐯r,α)fs(𝐯,αr)]1η~R𝐯0𝐯2+1η~R(α0α)2+32μ21Rr=1Rr+3η~σ2KI1η~R𝐯0𝐯2+1η~R(α0α)2+32μ21Rr=1R[(η~σ22Iηg2+η~Ξr1+48η~2ηg2𝔼[𝐯fs(𝐯r,αr)2+(αfs(𝐯r,αr))2])]+3η~σ2KI1η~R𝐯0𝐯2+1η~R(α0α)2+3η~3μ2Rηg2rΞr1+5μ2Iηg2η~σ2+3000η~24μ22ηg21Rr=1RGapr\begin{split}&\frac{1}{R}\sum\limits_{r}\mathbb{E}[f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})]\\ &\leq\frac{1}{\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}+\frac{3\ell^{2}}{\mu_{2}}\frac{1}{R}\sum\limits_{r=1}^{R}\mathcal{E}_{r}+\frac{3\tilde{\eta}\sigma^{2}}{KI}\\ &\leq\frac{1}{\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}\\ &~{}~{}~{}+\frac{3\ell^{2}}{\mu_{2}}\frac{1}{R}\sum\limits_{r=1}^{R}\left[\left(\frac{\tilde{\eta}\sigma^{2}}{2\ell I\eta_{g}^{2}}+\tilde{\eta}\ell\Xi_{r-1}+\frac{48\tilde{\eta}^{2}}{\eta_{g}^{2}}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]\right)\right]+\frac{3\tilde{\eta}\sigma^{2}}{KI}\\ &\leq\frac{1}{\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}+\frac{3\tilde{\eta}\ell^{3}}{\mu_{2}R\eta_{g}^{2}}\sum_{r}\Xi_{r-1}+\frac{5\ell}{\mu_{2}I\eta_{g}^{2}}\tilde{\eta}\sigma^{2}+\frac{3000\tilde{\eta}^{2}\ell^{4}}{\mu_{2}^{2}\eta_{g}^{2}}\frac{1}{R}\sum\limits_{r=1}^{R}Gap_{r}\end{split}

where the last inequality holds because

𝐯fs(𝐯r,αr)2+αfs(𝐯r,αr)292(𝐯r𝐯fs2+(αrαfs)2)182μ2Gaps(𝐯r,αr),\begin{split}\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+\|\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}\leq 9\ell^{2}(\|\mathbf{v}_{r}-\mathbf{v}^{*}_{f_{s}}\|^{2}+(\alpha_{r}-\alpha^{*}_{f_{s}})^{2})\leq\frac{18\ell^{2}}{\mu_{2}}Gap_{s}(\mathbf{v}_{r},\alpha_{r}),\end{split} (102)

where (𝐯fs,αfs)(\mathbf{v}^{*}_{f^{s}},\alpha^{*}_{f^{s}}) denotes a saddle point of fsf^{s} and the second inequality uses the strong convexity and strong concavity of fsf^{s}. In detail,

Gaps(𝐯r,αr)=maxαfs(𝐯r,α)fs(𝐯fs,αfs)+fs(𝐯fs,αfs)min𝐯fs(𝐯,αr)2𝐯r𝐯fs2+μ22(αrαfs)2.\begin{split}Gap_{s}(\mathbf{v}_{r},\alpha_{r})&=\max\limits_{\alpha}f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v}^{*}_{f^{s}},\alpha^{*}_{f^{s}})+f^{s}(\mathbf{v}^{*}_{f^{s}},\alpha^{*}_{f^{s}})-\min\limits_{\mathbf{v}}f^{s}(\mathbf{v},\alpha_{r})\\ &\geq\frac{\ell}{2}\|\mathbf{v}_{r}-\mathbf{v}^{*}_{f^{s}}\|^{2}+\frac{\mu_{2}}{2}(\alpha_{r}-\alpha^{*}_{f^{s}})^{2}.\end{split} (103)

Using Lemma 11, we have

Ξr4r+16η~2[𝐯fs(𝐯r,αr)2+(αfs(𝐯r,αr))2]+5η~2σ2KI4(η~σ22Kηg2+η~Ξr1+48η~2ηg2[𝐯fs(𝐯r,αr)2+(αfs(𝐯r,αr))2])+16η~2[𝐯fs(𝐯r,αr)2+(αfs(𝐯r,αr))2]+5η~σ2KI4η~Ξr1+160η~2[𝐯fs(𝐯r,αr)2+(αfs(𝐯r,αr))2]+5η~σ2KI(1+Kηg2)Ξr1+160η~2[𝐯fs(𝐯r,αr)2+(αfs(𝐯r,αr))2]+5η~σ2KI(1+Kηg2).\begin{split}\Xi_{r}&\leq 4\mathcal{E}_{r}+16\tilde{\eta}^{2}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{5\tilde{\eta}^{2}\sigma^{2}}{KI}\\ &\leq 4\left(\frac{\tilde{\eta}\sigma^{2}}{2\ell K\eta_{g}^{2}}+\tilde{\eta}\ell\Xi_{r-1}+\frac{48\tilde{\eta}^{2}}{\eta_{g}^{2}}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]\right)\\ &~{}~{}~{}+16\tilde{\eta}^{2}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{5\tilde{\eta}\sigma^{2}}{KI}\\ &\leq 4\tilde{\eta}\ell\Xi_{r-1}+160\tilde{\eta}^{2}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{5\tilde{\eta}\sigma^{2}}{KI}(1+\frac{K}{\eta_{g}^{2}})\\ &\leq\Xi_{r-1}+160\tilde{\eta}^{2}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{5\tilde{\eta}\sigma^{2}}{KI}(1+\frac{K}{\eta_{g}^{2}}).\end{split} (104)

Thus,

2η~3μ2Rηg2r=1RΞr2η~3μ2Rηg2rΞr1+320η~33μ2Rηg2r=1R[𝐯fs(𝐯r,αr)2+(αfs(𝐯r,αr))2]+5η~σ2KI(1+Kηg2)2η~3μ2Rηg2rΞr1+12RrGapr+5η~σ2KI(1+Kηg2)\begin{split}\frac{2\tilde{\eta}\ell^{3}}{\mu_{2}R\eta_{g}^{2}}\sum\limits_{r=1}^{R}\Xi_{r}&\leq\frac{2\tilde{\eta}\ell^{3}}{\mu_{2}R\eta_{g}^{2}}\sum_{r}\Xi_{r-1}+\frac{320\tilde{\eta}^{3}\ell^{3}}{\mu_{2}R\eta_{g}^{2}}\sum\limits_{r=1}^{R}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{5\tilde{\eta}\sigma^{2}}{KI}(1+\frac{K}{\eta_{g}^{2}})\\ &\leq\frac{2\tilde{\eta}\ell^{3}}{\mu_{2}R\eta_{g}^{2}}\sum_{r}\Xi_{r-1}+\frac{1}{2R}\sum\limits_{r}Gap_{r}+\frac{5\tilde{\eta}\sigma^{2}}{KI}(1+\frac{K}{\eta_{g}^{2}})\end{split} (105)

Taking A0=0A_{0}=0,

1Rr𝔼[fs(𝐯r,α)fs(𝐯,αr)]1η~R𝐯0𝐯2+1η~R(α0α)2+12RrGapr+5η~σ2KI(1+Kηg2).\begin{split}&\frac{1}{R}\sum\limits_{r}\mathbb{E}[f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})]\\ &\leq\frac{1}{\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}+\frac{1}{2R}\sum\limits_{r}Gap_{r}+\frac{5\tilde{\eta}\sigma^{2}}{KI}(1+\frac{K}{\eta_{g}^{2}}).\end{split}

It follows that

1Rr𝔼[fs(𝐯r,α)fs(𝐯,αr)]12RrGapr1η~R𝐯0𝐯2+1η~R(α0α)2+5η~σ2KI(1+Kηg2).\begin{split}&\frac{1}{R}\sum\limits_{r}\mathbb{E}[f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})]-\frac{1}{2R}\sum\limits_{r}Gap_{r}\\ &\leq\frac{1}{\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}+\frac{5\tilde{\eta}\sigma^{2}}{KI}(1+\frac{K}{\eta_{g}^{2}}).\end{split}

Sample a r~\tilde{r} from 1,,R1,...,R, we have

𝔼[Gapr~s]2η~R𝐯0𝐯2+2η~R(α0α)2+10η~σ2KI(1+Kηg2).\begin{split}\mathbb{E}[Gap^{s}_{\tilde{r}}]\leq\frac{2}{\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{2}{\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}+\frac{10\tilde{\eta}\sigma^{2}}{KI}\left(1+\frac{K}{\eta_{g}^{2}}\right).\end{split} (106)

Appendix E Proof of Theorem 1

{proof}

Since f(𝐯,α)f(\mathbf{v},\alpha) is \ell-weakly convex in 𝐯\mathbf{v} for any α\alpha, ϕ(𝐯)=maxαf(𝐯,α)\phi(\mathbf{v})=\max\limits_{\alpha^{\prime}}f(\mathbf{v},\alpha^{\prime}) is also \ell-weakly convex. Taking γ=2\gamma=2\ell, we have

ϕ(𝐯s1)ϕ(𝐯s)+ϕ(𝐯s),𝐯s1𝐯s2𝐯s1𝐯s2=ϕ(𝐯s)+ϕ(𝐯s)+2(𝐯s𝐯s1),𝐯s1𝐯s+32𝐯s1𝐯s2=(a)ϕ(𝐯s)+ϕs(𝐯s),𝐯s1𝐯s+32𝐯s1𝐯s2=(b)ϕ(𝐯s)12ϕs(𝐯s),ϕs(𝐯s)ϕ(𝐯s)+38ϕs(𝐯s)ϕ(𝐯s)2=ϕ(𝐯s)18ϕs(𝐯s)214ϕs(𝐯s),ϕ(𝐯s)+38ϕ(𝐯s)2,\displaystyle\begin{split}\phi(\mathbf{v}_{s-1})&\geq\phi(\mathbf{v}_{s})+\langle\partial\phi(\mathbf{v}_{s}),\mathbf{v}_{s-1}-\mathbf{v}_{s}\rangle-\frac{\ell}{2}\|\mathbf{v}_{s-1}-\mathbf{v}_{s}\|^{2}\\ &=\phi(\mathbf{v}_{s})+\langle\partial\phi(\mathbf{v}_{s})+2\ell(\mathbf{v}_{s}-\mathbf{v}_{s-1}),\mathbf{v}_{s-1}-\mathbf{v}_{s}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{s-1}-\mathbf{v}_{s}\|^{2}\\ &\overset{(a)}{=}\phi(\mathbf{v}_{s})+\langle\partial\phi_{s}(\mathbf{v}_{s}),\mathbf{v}_{s-1}-\mathbf{v}_{s}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{s-1}-\mathbf{v}_{s}\|^{2}\\ &\overset{(b)}{=}\phi(\mathbf{v}_{s})-\frac{1}{2\ell}\langle\partial\phi_{s}(\mathbf{v}_{s}),\partial\phi_{s}(\mathbf{v}_{s})-\partial\phi(\mathbf{v}_{s})\rangle+\frac{3}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})-\partial\phi(\mathbf{v}_{s})\|^{2}\\ &=\phi(\mathbf{v}_{s})-\frac{1}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}-\frac{1}{4\ell}\langle\partial\phi_{s}(\mathbf{v}_{s}),\partial\phi(\mathbf{v}_{s})\rangle+\frac{3}{8\ell}\|\partial\phi(\mathbf{v}_{s})\|^{2},\end{split} (107)

where (a)(a) and (b)(b) hold by the definition of ϕs(𝐯)\phi_{s}(\mathbf{v}).

Rearranging the terms in (107) yields

ϕ(𝐯s)ϕ(𝐯s1)18ϕs(𝐯s)2+14ϕs(𝐯s),ϕ(𝐯s)38ϕ(𝐯s)2(a)18ϕs(𝐯s)2+18(ϕs(𝐯s)2+ϕ(𝐯s)2)38ϕ(𝐯s)2=14ϕs(𝐯s)214ϕ(𝐯s)2(b)14ϕs(𝐯s)2μ2(ϕ(𝐯s)ϕ(𝐯ϕs))\displaystyle\begin{split}\phi(\mathbf{v}_{s})-\phi(\mathbf{v}_{s-1})&\leq\frac{1}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}+\frac{1}{4\ell}\langle\partial\phi_{s}(\mathbf{v}_{s}),\partial\phi(\mathbf{v}_{s})\rangle-\frac{3}{8\ell}\|\partial\phi(\mathbf{v}_{s})\|^{2}\\ &\overset{(a)}{\leq}\frac{1}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}+\frac{1}{8\ell}(\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}+\|\partial\phi(\mathbf{v}_{s})\|^{2})-\frac{3}{8\ell}\|\phi(\mathbf{v}_{s})\|^{2}\\ &=\frac{1}{4\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}-\frac{1}{4\ell}\|\partial\phi(\mathbf{v}_{s})\|^{2}\\ &\overset{(b)}{\leq}\frac{1}{4\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}-\frac{\mu}{2\ell}(\phi(\mathbf{v}_{s})-\phi(\mathbf{v}^{*}_{\phi_{s}}))\end{split} (108)

where (a)(a) holds by using 𝐚,𝐛12(𝐚2+𝐛2)\langle\mathbf{a},\mathbf{b}\rangle\leq\frac{1}{2}(\|\mathbf{a}\|^{2}+\|\mathbf{b}\|^{2}), and (b)(b) holds by the μ\mu-PL property of ϕ(𝐯)\phi(\mathbf{v}).

Thus, we have

(4+2μ)(ϕ(𝐯s)ϕ(𝐯))4(ϕ(𝐯s1)ϕ(𝐯ϕs))ϕs(𝐯s)2.\displaystyle\left(4\ell+2\mu\right)(\phi(\mathbf{v}_{s})-\phi(\mathbf{v}_{*}))-4\ell(\phi(\mathbf{v}_{s-1})-\phi(\mathbf{v}^{*}_{\phi_{s}}))\leq\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}. (109)

Since γ=2\gamma=2\ell, fs(𝐯,α)f_{s}(\mathbf{v},\alpha) is \ell-strongly convex in 𝐯\mathbf{v} and μ2\mu_{2} strong concave in α\alpha. Apply Lemma 3 to fsf_{s}, we know that

4𝐯^s(αs)𝐯0s2+μ24(α^s(𝐯s)α0s)2Gaps(𝐯0s,α0s)+Gaps(𝐯s,αs).\displaystyle\frac{\ell}{4}\|\hat{\mathbf{v}}_{s}(\alpha_{s})-\mathbf{v}_{0}^{s}\|^{2}+\frac{\mu_{2}}{4}(\hat{\alpha}_{s}(\mathbf{v}_{s})-\alpha_{0}^{s})^{2}\leq\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})+\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s}). (110)

By the setting of η~s\tilde{\eta}_{s}, Is=I02sI_{s}=I_{0}*2^{s}, and Rs=1000η~min(,μ2)R_{s}=\frac{1000}{\tilde{\eta}\min(\ell,\mu_{2})}, we note that 4η~Rsmin{,μ2}212\frac{4}{\tilde{\eta}R_{s}}\leq\frac{\min\{\ell,\mu_{2}\}}{212}. Applying Lemma (2), we have

𝔼[Gaps(𝐯s,αs)]10η~σ2KI02s+153𝔼[4𝐯^s(αs)𝐯0s2+μ24(α^s(𝐯s)α0s)2]10η~σ2KI02s+153𝔼[Gaps(𝐯0s,α0s)+Gaps(𝐯s,αs)].\displaystyle\begin{split}&\mathbb{E}[\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})]\leq\frac{10\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}+\frac{1}{53}\mathbb{E}\left[\frac{\ell}{4}\|\hat{\mathbf{v}}_{s}(\alpha_{s})-\mathbf{v}_{0}^{s}\|^{2}+\frac{\mu_{2}}{4}(\hat{\alpha}_{s}(\mathbf{v}_{s})-\alpha_{0}^{s})^{2}\right]\\ &\leq\frac{10\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}+\frac{1}{53}\mathbb{E}\left[\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})+\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right].\end{split} (111)

Since ϕ(𝐯)\phi(\mathbf{v}) is LL-smooth and γ=2\gamma=2\ell, then ϕk(𝐯)\phi_{k}(\mathbf{v}) is L^=(L+2)\hat{L}=(L+2\ell)-smooth. According to Theorem 2.1.5 of [DBLP:books/sp/Nesterov04], we have

𝔼[ϕs(𝐯s)2]2L^𝔼(ϕs(𝐯s)minxdϕs(𝐯))2L^𝔼[Gaps(𝐯s,αs)]=2L^𝔼[4Gaps(𝐯s,αs)3Gaps(𝐯s,αs)]2L^𝔼[4(10η~σ2KI02s+153(Gaps(𝐯0s,α0s)+Gaps(𝐯s,αs)))3Gaps(𝐯s,αs)]=2L^𝔼[40η~σ2KI02s+453Gaps(𝐯0s,α0s)15553Gaps(𝐯s,αs)].\displaystyle\begin{split}&\mathbb{E}[\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}]\leq 2\hat{L}\mathbb{E}(\phi_{s}(\mathbf{v}_{s})-\min\limits_{x\in\mathbb{R}^{d}}\phi_{s}(\mathbf{v}))\leq 2\hat{L}\mathbb{E}[\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})]\\ &=2\hat{L}\mathbb{E}[4\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})-3\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})]\\ &\leq 2\hat{L}\mathbb{E}\left[4\left(\frac{10\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}+\frac{1}{53}\left(\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})+\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right)\right)-3\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right]\\ &=2\hat{L}\mathbb{E}\left[40\frac{\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}+\frac{4}{53}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})-\frac{155}{53}\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right].\end{split} (112)

Applying Lemma 4 to (112), we have

𝔼[ϕs(𝐯s)2]2L^𝔼[40η~σ2KI02s+453Gaps(𝐯0s,α0s)15553(350Gaps+1(𝐯0s+1,α0s+1)+45(ϕ(𝐯0s+1)ϕ(𝐯0s)))]=2L^𝔼[40η~σ2KI02s+453Gaps(𝐯0s,α0s)93530Gaps+1(𝐯0s+1,α0s+1)12453(ϕ(𝐯0s+1)ϕ(𝐯0s))].\displaystyle\begin{split}&\mathbb{E}[\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}]\leq 2\hat{L}\mathbb{E}\bigg{[}\frac{40\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}+\frac{4}{53}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\frac{155}{53}\left(\frac{3}{50}\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})+\frac{4}{5}(\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{0}^{s}))\right)\bigg{]}\\ &=2\hat{L}\mathbb{E}\bigg{[}40\frac{\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}\!+\!\frac{4}{53}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\!-\!\frac{93}{530}\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})\!-\!\frac{124}{53}(\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{0}^{s}))\bigg{]}.\end{split} (113)

Combining this with (109), rearranging the terms, and defining a constant c=4+24853L^O(L+)c=4\ell+\frac{248}{53}\hat{L}\in O(L+\ell), we get

(c+2μ)𝔼[ϕ(𝐯0s+1)ϕ(𝐯)]+93265L^𝔼[Gaps+1(𝐯0s+1,α0s+1)](4+24853L^)𝔼[ϕ(𝐯0s)ϕ(𝐯ϕ)]+8L^53𝔼[Gaps(𝐯0s,α0s)]+80L^η~σ2KI02sc𝔼[ϕ(𝐯0s)ϕ(𝐯)+8L^53cGaps(𝐯0s,α0s)]+80L^η~σ2KI02s.\displaystyle\begin{split}&\left(c+2\mu\right)\mathbb{E}[\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{*})]+\frac{93}{265}\hat{L}\mathbb{E}[\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})]\\ &\leq\left(4\ell+\frac{248}{53}\hat{L}\right)\mathbb{E}[\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}^{*}_{\phi})]+\frac{8\hat{L}}{53}\mathbb{E}[\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})]+\frac{80\hat{L}\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}\\ &\leq c\mathbb{E}\left[\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\right]+\frac{80\hat{L}\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}.\end{split} (114)

Using the fact that L^μ\hat{L}\geq\mu,

(c+2μ)8L^53c=(4+24853L^+2μ)8L^53(4+24853L^)8L^53+16μ1L^248L^93265L^.\displaystyle\begin{split}(c+2\mu)\frac{8\hat{L}}{53c}=\left(4\ell+\frac{248}{53}\hat{L}+2\mu\right)\frac{8\hat{L}}{53(4\ell+\frac{248}{53}\hat{L})}\leq\frac{8\hat{L}}{53}+\frac{16\mu_{1}\hat{L}}{248\hat{L}}\leq\frac{93}{265}\hat{L}.\end{split} (115)

Then, we have

(c+2μ1)𝔼[ϕ(𝐯0s+1)ϕ(𝐯)+8L^53cGaps+1(𝐯0s+1,α0s+1)]c𝔼[ϕ(𝐯0s)ϕ(𝐯)+8L^53cGaps(𝐯0s,α0s)]+80L^η~σ2KI02s.\displaystyle\begin{split}&(c+2\mu_{1})\mathbb{E}\left[\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})\right]\\ &\leq c\mathbb{E}\left[\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\right]+\frac{80\hat{L}\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}.\end{split} (116)

Defining Δs=ϕ(𝐯0s)ϕ(𝐯)+8L^53cGaps(𝐯0s,α0s)\Delta_{s}=\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s}), then

𝔼[Δs+1]cc+2μ𝔼[Δs]+80L^c+2μη~σ2KI02s\displaystyle\begin{split}&\mathbb{E}[\Delta_{s+1}]\leq\frac{c}{c+2\mu}\mathbb{E}[\Delta_{s}]+\frac{80\hat{L}}{c+2\mu}\frac{\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}\end{split} (117)

Using this inequality recursively, it yields

E[ΔS+1](cc+2μ)SE[Δ1]+80L^c+2μη~σ2KI0s=1S(exp(2μc+2μ(s1))(cc+2μ)S+1s)2ϵ0exp(2μSc+2μ)+80η~L^σ2(c+2μ)KI0Sexp(2μSc+2μ),\displaystyle\begin{split}&E[\Delta_{S+1}]\leq\left(\frac{c}{c+2\mu}\right)^{S}E[\Delta_{1}]+\frac{80\hat{L}}{c+2\mu}\frac{\tilde{\eta}\sigma^{2}}{KI_{0}}\sum\limits_{s=1}^{S}\left(\exp\left(-\frac{2\mu}{c+2\mu}(s-1)\right)\left(\frac{c}{c+2\mu}\right)^{S+1-s}\right)\\ &\leq 2\epsilon_{0}\exp\left(\frac{-2\mu S}{c+2\mu}\right)+\frac{80\tilde{\eta}\hat{L}\sigma^{2}}{(c+2\mu)KI_{0}}S\exp\left(-\frac{2\mu S}{c+2\mu}\right),\end{split} (118)

where the second inequality uses the fact 1xexp(x)1-x\leq\exp(-x), and

Δ1=ϕ(𝐯01)ϕ(𝐯)+8L^53cGap1(𝐯01,α01)=ϕ(𝐯0)ϕ(𝐯)+(f(𝐯0,α^1(𝐯0))+γ2𝐯0𝐯02f(𝐯^1(α0),α0)γ2𝐯^1(α0)𝐯02)ϵ0+f(𝐯0,α^1(𝐯0))f(𝐯^(α0),α0)2ϵ0.\displaystyle\begin{split}\Delta_{1}&=\phi(\mathbf{v}_{0}^{1})-\phi(\mathbf{v}^{*})+\frac{8\hat{L}}{53c}{Gap}_{1}(\mathbf{v}_{0}^{1},\alpha_{0}^{1})\\ &=\phi(\mathbf{v}_{0})-\phi(\mathbf{v}^{*})+\left(f(\mathbf{v}_{0},\hat{\alpha}_{1}(\mathbf{v}_{0}))+\frac{\gamma}{2}\|\mathbf{v}_{0}-\mathbf{v}_{0}\|^{2}-f(\hat{\mathbf{v}}_{1}(\alpha_{0}),\alpha_{0})-\frac{\gamma}{2}\|\hat{\mathbf{v}}_{1}(\alpha_{0})-\mathbf{v}_{0}\|^{2}\right)\\ &\leq\epsilon_{0}+f(\mathbf{v}_{0},\hat{\alpha}_{1}(\mathbf{v}_{0}))-f(\hat{\mathbf{v}}(\alpha_{0}),\alpha_{0})\leq 2\epsilon_{0}.\end{split} (119)

To make this less than ϵ\epsilon, it suffices to make

2ϵ0exp(2μSc+2μ)ϵ280η~L^σ2(c+2μ)KI0Sexp(2μSc+2μ)ϵ2\displaystyle\begin{split}&2\epsilon_{0}\exp\left(\frac{-2\mu S}{c+2\mu}\right)\leq\frac{\epsilon}{2}\\ &\frac{80\tilde{\eta}\hat{L}\sigma^{2}}{(c+2\mu)KI_{0}}S\exp\left(-\frac{2\mu S}{c+2\mu}\right)\leq\frac{\epsilon}{2}\end{split} (120)

Let SS be the smallest value such that exp(2μSc+2μ)min{ϵ4ϵ0,(c+2μ)ϵ160L^SKI0η~σ2}\exp\left(\frac{-2\mu S}{c+2\mu}\right)\leq\min\{\frac{\epsilon}{4\epsilon_{0}},\frac{(c+2\mu)\epsilon}{160\hat{L}S}\frac{KI_{0}}{\tilde{\eta}\sigma^{2}}\}. We can set SS to be the smallest value such that S>max{c+2μ2μlog4ϵ0ϵ,c+2μ2μlog160L^S(c+2μ)ϵη~σ2KI0}S>\max\bigg{\{}\frac{c+2\mu}{2\mu}\log\frac{4\epsilon_{0}}{\epsilon},\frac{c+2\mu}{2\mu}\log\frac{160\hat{L}S}{(c+2\mu)\epsilon}\frac{\tilde{\eta}\sigma^{2}}{KI_{0}}\bigg{\}}.

Then, the total communication complexity is

s=1SRs\displaystyle\sum\limits_{s=1}^{S}R_{s} O(1000η~μ2S)O~(1η~μ2cμ)O~(1μ).\displaystyle\leq O\left(\frac{1000}{\tilde{\eta}\mu_{2}}S\right)\leq\widetilde{O}\bigg{(}\frac{1}{\tilde{\eta}\mu_{2}}\frac{c}{\mu}\bigg{)}\leq\widetilde{O}\left(\frac{1}{\mu}\right).

Total iteration complexity is

s=1STs=s=1SRsIs=s=1SRsI0exp(2μc+2μ(s1))=O(I0sexp(2μc+2μ(s1)))=O~(I0exp(2μc+2μS)exp(2μ1c+2μ))=O~(cμ22μ(ϵ0ϵ,Sη~σ2I0Kϵ))=O~(max(1μϵ,c2μ2η~σ2K))=O~(max(1μϵ,1Kμ2ϵ)),\begin{split}&\sum\limits_{s=1}^{S}T_{s}=\sum\limits_{s=1}^{S}R_{s}I_{s}\\ &=\sum\limits_{s=1}^{S}R_{s}I_{0}\exp(\frac{2\mu}{c+2\mu}(s-1))=O\left(I_{0}\sum_{s}\exp(\frac{2\mu}{c+2\mu}(s-1))\right)\\ &=\widetilde{O}\left(I_{0}\frac{\exp(\frac{2\mu}{c+2\mu}S)}{\exp(\frac{2\mu_{1}}{c+2\mu})}\right)\\ &=\widetilde{O}\left(\frac{c}{\mu_{2}^{2}\mu}\left(\frac{\epsilon_{0}}{\epsilon},\frac{S\tilde{\eta}\sigma^{2}}{I_{0}K\epsilon}\right)\right)\\ &=\widetilde{O}\left(\max(\frac{1}{\mu\epsilon},\frac{c^{2}}{\mu^{2}}\frac{\tilde{\eta}\sigma^{2}}{K})\right)=\widetilde{O}\left(\max(\frac{1}{\mu\epsilon},\frac{1}{K\mu^{2}\epsilon})\right),\end{split} (121)

which is also the sample complexity on each single machine.

Appendix F More Results

In this section, we report more experiment results for imratio=30% with DenseNet121 on ImageNet-IH, and CIFAR100-IH in Figure 2,3 and 4.

Refer to captionRefer to captionRefer to captionRefer to caption
Figure 2: Imbalanced Heterogeneous CIFAR100 with imratio = 10% and K=16,8 on Densenet121.
Refer to captionRefer to captionRefer to captionRefer to caption
Figure 3: Imbalanced Heterogeneous ImageNet with imratio = 30% and K=16,8 on Densenet121.
Refer to captionRefer to captionRefer to captionRefer to caption
Figure 4: Imbalanced Heterogeneous CIFAR100 with imratio = 30% and K=16,8 on Densenet121.

Appendix G Descriptions of Datasets

Table 5: Statistics of Medical Chest X-ray Datasets. The numbers for each disease denote the imbalance ratio (imratio).
Dataset Source Samples Cardiomegaly Edema Consolidation Atelectasis Effusion
CheXpert Stanford Hospital (US) 224,316 0.211 0.342 0.120 0.310 0.414
ChestXray8 NIH Clinical Center (US) 112,120 0.025 0.021 0.042 0.103 0.119
PadChest Hospital San Juan (Spain) 110,641 0.089 0.012 0.015 0.056 0.064
MIMIC-CXR BIDMC (US) 377,110 0.196 0.179 0.047 0.246 0.237
ChestXrayAD H108 and HMUH (Vietnam) 15,000 0.153 0.000 0.024 0.012 0.069