Federated Deep AUC Maximization for Heterogeneous Data
with a Constant Communication Complexity

\nameZhuoning Yuan \emailzhuoning-yuan@uiowa.edu
\addrThe University of Iowa
\nameZhishuai Guo¹¹footnotemark: 1 \emailzhishuai-guo@uiowa.edu
\addrThe University of Iowa
\nameYi Xu \emailyixu@alibaba-inc.com
\addrDAMO Academy, Alibaba Group
\nameYiming Ying \emailyying@albany.edu
\addrState University of New York at Albany
\nameTianbao Yang \emailtianbao-yang@uiowa.edu
\addrThe University of Iowa
Equal Contribution

Abstract

Deep AUC (area under the ROC curve) Maximization (DAM) has attracted much attention recently due to its great potential for imbalanced data classification. However, the research on Federated Deep AUC Maximization (FDAM) is still limited. Compared with standard federated learning (FL) approaches that focus on decomposable minimization objectives, FDAM is more complicated due to its minimization objective is non-decomposable over individual examples. In this paper, we propose improved FDAM algorithms for heterogeneous data by solving the popular non-convex strongly-concave min-max formulation of DAM in a distributed fashion. A striking result of this paper is that the communication complexity of the proposed algorithm is a constant independent of the number of machines and also independent of the accuracy level, which improves an existing result by orders of magnitude. Of independent interest, the proposed algorithm can also be applied to a class of non-convex-strongly-concave min-max problems. The experiments have demonstrated the effectiveness of our FDAM algorithm on benchmark datasets, and on medical chest X-ray images from different organizations. Our experiment shows that the performance of FDAM using data from multiple hospitals can improve the AUC score on testing data from a single hospital for detecting life-threatening diseases based on chest radiographs.

1 Introduction

Federated learning (FL) is an emerging paradigm for large-scale learning to deal with data that are (geographically) distributed over multiple clients, e.g., mobile phones, organizations. An important feature of FL is that the data remains at its own clients, allowing for the preservation of data privacy. This feature makes FL attractive not only to internet companies such as Google and Apple but also to conventional industries that provide services to people such as hospitals and banks in the era of big data [rieke2020future, long2020federated]. Data in these industries is usually collected from people who are concerned about data leakage. But in order to provide better service to people, large-scale machine learning from diverse data sources is important for addressing model bias. For example, most patients in hospitals located in urban areas could have dramatic differences in demographic data, lifestyles, and diseases from patients who are from rural areas. Machine learning models (in particular, deep neural networks) trained based on patients’ data from one hospital could dramatically bias towards its major population, which could bring about serious ethical concerns [pooch2020can].

One of the fundamental issues that could cause model bias is data imbalance, where the number of samples from different classes are skewed. Although FL provides an effective framework for leveraging multiple data sources, most existing FL methods still lack the capability to tackle the model bias caused by data imbalance. The reason is that most existing FL methods are developed for minimizing the conventional objective function, e.g., the average of a standard loss function on all data, which are not amenable to optimizing more suitable measures such as area under the ROC curve (AUC) for imbalanced data. It has been recently shown that directly maximizing AUC for deep learning can lead to great improvements on real-world difficult classification tasks [robustdeepAUC]. For example, robustdeepAUC reported the best performance by DAM on the Stanford CheXpert Competition for interpreting chest X-ray images like radiologists [chexpert19].

Table 1: The summary of sample and communication complexities of different algorithms for FDAM under a

\mu

-PL condition in both heterogeneous and homogeneous settings, where

K

is the number of machines and

\mu\leq 1

. NPA denotes the naive parallel (large mini-batch) version of PPD-SG [liu2019stochastic] for DAM, where

M

denotes the batch size in the NPA. The

*

indicate the results that are derived by us.

\widetilde{O}(\cdot)

suppresses a logarithmic factor.

	Heterogeneous Data	Homogeneous Data	Sample Complexity
NPA ( $M<\frac{1}{K\mu\epsilon})$	$\widetilde{O}\left(\frac{1}{KM\mu^{2}\epsilon}+\frac{1}{\mu\epsilon}\right)$	$\widetilde{O}\left(\frac{1}{KM\mu^{2}\epsilon}+\frac{1}{\mu\epsilon}\right)$	$\widetilde{O}\left(\frac{M}{\mu\epsilon}+\frac{1}{\mu^{2}K\epsilon}\right)$
NPA ( $M\geq\frac{1}{K\mu\epsilon})$	$\widetilde{O}\left(\frac{1}{\mu}\right)^{*}$	$\widetilde{O}\left(\frac{1}{\mu}\right)^{*}$	$\widetilde{O}\left(\frac{M}{\mu}\right)^{*}$
CODA+ (CODA)	$\widetilde{O}\left(\frac{K}{\mu}+\frac{1}{\mu\epsilon^{1/2}}+\frac{1}{\mu^{3/2}\epsilon^{1/2}}\right)$	$\widetilde{O}\left(\frac{K}{\mu}\right)^{*}$	$\widetilde{O}\left(\frac{1}{\mu\epsilon}+\frac{1}{\mu^{2}K\epsilon}\right)$
CODASCA	$\widetilde{O}\left(\frac{1}{\mu}\right)$	$\widetilde{O}\left(\frac{1}{\mu}\right)$	$\widetilde{O}\left(\frac{1}{\mu\epsilon}+\frac{1}{\mu^{2}K\epsilon}\right)$

However, the research on FDAM is still limited. To the best of our knowledge, dist_auc_guo is the only work that was dedicated to FDAM by solving the non-convex strongly-concave min-max problem in a distributed manner. Their algorithm (CODA) is similar to the standard FedAvg method [DBLP:conf/aistats/McMahanMRHA17] except that the periodic averaging is applied both to the primal and the dual variables. Nevertheless, their results on FDAM are not comprehensive. By a deep investigation of their algorithms and analysis, we found that (i) although their FL algorithm CODA was shown to be better than the naive parallel algorithm (NPA) with a small mini-batch for DAM, the NPA using a larger mini-batch at local machines can enjoy a smaller communication complexity than CODA; (ii) the communication complexity of CODA for homogeneous data becomes better than that was established for the heterogeneous data, but is still worse than that of NPA with a large mini-batch at local clients. These shortcomings of CODA for FDAM motivate us to develop better federated averaging algorithms and analysis with a better communication complexity without sacrificing the sample complexity.

This paper aims to provide more comprehensive results for FDAM, with a focus on improving the communication complexity of CODA for heterogeneous data. In particular, our contributions are summarized below:

•

First, we provide a stronger baseline with a simpler algorithm than CODA named CODA+, and establish its complexity in both homogeneous and heterogeneous data settings. Although CODA+ has a slight change from CODA, its analysis is much more involved than that of CODA, which is based on the duality gap analysis instead of the primal objective gap analysis.
•

Second, we propose a new variant of CODA+ named CODASCA with a much improved communication complexity than CODA+. The key thrust is to incorporate the idea of stochastic controlled averaging of SCAFFOLD [karimireddy2019scaffold] into the framework of CODA+ for correcting the client-drift for both local primal updates and local dual updates. A striking result of CODASCA under a PL condition for deep learning is that its communication complexity is independent of the number of machines and the targeted accuracy level, which is even better than CODA+ in the homogeneous data setting. The analysis of CODASCA is also non-trivial that combines the duality gap analysis of CODA+ for a non-convex strongly-concave min-max problem and the variance reduction analysis of SCAFFOLD. The comparison between CODASCA and CODA+ and the NPA for FDAM is shown in Table 1.
•

Third, we conduct experiments on benchmark datasets to verify our theory by showing CODASCA can enjoy a larger communication window size than CODA+ without sacrificing the performance. Moreover, we conduct empirical studies on medical chest X-ray images from different hospitals by showing that the performance of CODASCA using data from multiple organizations can improve the performance on testing data from a single hospital.

2 Related Work

Federated Learning (FL).

Many empirical studies [povey2014parallel, su2015experiments, mcmahan2016communication, chen2016scalable, lin2018don, kamp2018efficient, DBLP:conf/eccv/YuanGYWY20] have shown that FL exhibits good empirical performance for distributed deep learning. For a more thorough survey of FL, we refer the readers to [mcmahan14advances]. This paper is closely related to recent studies that focus on the design of distributed stochastic algorithms for FL with provable convergence guarantee.

The most popular FL algorithm is called Federated Averaging (FedAvg) [DBLP:conf/aistats/McMahanMRHA17], which is also referred to as local SGD [stich2018local]. [stich2018local] is the first one that establishes the convergence of local SGD for strongly convex functions. yu2019parallel, yu_linear establish the convergence of local SGD and their momentum variants for non-convex functions. Although not explicitly discussed in their paper, the analysis in [yu2019parallel] has already exhibited the difference of communication complexities of local SGD in homogeneous and heterogeneous data settings, which was also discovered in recent works [khaled2020tighter, woodworth2020local, DBLP:conf/nips/WoodworthPS20]. These latter studies provide a tight analysis of local SGD in homogeneous and/or heterogeneous data settings, improving its upper bounds for convex functions and strongly convex functions than some earlier works, which sometimes improve that of large mini-batch SGD, e.g., when the level of heterogeneity is sufficiently small.

DBLP:conf/nips/HaddadpourKMC19 improve the complexities of local SGD for non-convex optimization by leveraging the PL condition. [karimireddy2019scaffold] propose a new FedAvg algorithm SCAFFOLD by introducing control variates (variance reduction) to correct for the ‘client-drift’ in the local updates for heterogeneous data. The communication complexities of SCAFFOLD are no worse than that of large mini-batch SGD for both strongly convex and non-convex functions. The proposed algorithm CODASCA is inspired by the idea of stochastic controlled averaging of SCAFFOLD. However, the analysis of CODASCA for non-convex min-max optimization under a PL condition of the primal objective function is non-trivial compared to that of SCAFFOLD.

AUC Maximization. This work builds on the foundations of stochastic AUC maximization developed in many previous works. ying2016stochastic address the scalability issue of optimizing AUC by introducing a min-max reformulation of the AUC square surrogate loss and solving it by a convex-concave stochastic gradient method [nemirovski2009robust]. natole2018stochastic improve the convergence rate by adding a strongly convex regularizer into the original formulation. Based on the same min-max formulation as in [ying2016stochastic], liu2018fast achieve an improved convergence rate by developing a multi-stage algorithm by leveraging the quadratic growth condition of the problem. However, all of these studies focus on learning a linear model, whose corresponding problem is convex and strongly concave. robustdeepAUC propose a more robust margin-based surrogate loss for the AUC score, which can be formulated as a similar min-max problem to the AUC square surrogate loss.

Deep AUC Maximization (DAM). [rafique2018non] is the first work that develops algorithms and convergence theories for weakly convex and strongly concave min-max problems, which is applicable to DAM. However, their convergence rate is slow for a practical purpose. liu2019stochastic consider improving the convergence rate for DAM under a practical PL condition of the primal objective function. guo2020fast further develop more generic algorithms for non-convex strongly-concave min-max problems, which can also be applied to DAM. There are also several studies [yan2020optimal, lin2019gradient, arXiv:2001.03724, yang2020global] focusing on non-convex strongly concave min-max problems without considering the application to DAM. Based on liu2019stochastic’s algorithm, dist_auc_guo propose a communication-efficient FL algorithm (CODA) for DAM. However, its communication cost is still high for heterogeneous data.

DL for Medical Image Analysis. In past decades, machine learning, especially deep learning methods have revolutionized many domains such as machine vision, natural language processing. For medical image analysis, deep learning methods are also showing great potential such as in classification of skin lesions [esteva2017dermatologist, li2018skin], interpretation of chest radiographs [ardila2019end, chexpert19], and breast cancer screening [bejnordi2017diagnostic, mckinney2020international, wang2016deep]. Some works have already achieved expert-level performance in different tasks [ardila2019end, mckinney2020international, DBLP:journals/mia/LitjensKBSCGLGS17]. Recently, robustdeepAUC employ DAM for medical image classification and achieve great success on two challenging tasks, namely CheXpert competition for chest X-ray image classification and Kaggle competition for melanoma classification based on skin lesion images. However, to the best of our knowledge, the application of FDAM methods on medical datasets from different hospitals have not be thoroughly investigated.

3 Preliminaries and Notations

We consider federated learning of deep neural networks by maximizing the AUC score. The setting is the same to that was considered as in [dist_auc_guo]. Below, we present some preliminaries and notations, which are mostly the same as in [dist_auc_guo]. In this paper, we consider the following min-max formulation for distributed AUC maximization problem:

\min\limits_{\mathbf{w}\in\mathbb{R}^{d}\atop(a,b)\in\mathbb{R}^{2}}\max\limits_{\alpha\in\mathbb{R}}f(\mathbf{w},a,b,\alpha)=\frac{1}{K}\sum\limits_{k=1}^{K}f_{k}(\mathbf{w},a,b,\alpha),

(1)

where $K$ is the total number of machines, $f_{k}(\mathbf{w},a,b,\alpha)$ is defined below.

\begin{split}&f_{k}(\mathbf{w},a,b,\alpha)=\mathbb{E}_{\mathbf{z}^{k}}[F_{k}(\textbf{w},a,b,\alpha;\mathbf{z}^{k})]\\ &=\mathbb{E}_{\mathbf{z}^{k}}\left[(1-p)(h(\mathbf{w};\mathbf{x}^{k})-a)^{2}\mathbb{I}_{[y^{k}=1]}+p(h(\mathbf{w};\mathbf{x}^{k})-b)^{2}\mathbb{I}_{[y^{k}=-1]}\right.\\ &\hskip 14.45377pt+2(1+\alpha)(ph(\mathbf{w};\mathbf{x}^{k})\mathbb{I}_{[y^{k}=-1]}-\left.(1-p)h(\mathbf{w},\mathbf{x}^{k})\mathbb{I}_{[y^{k}=1]})-p(1-p)\alpha^{2}\right].\end{split}

(2)

where $\mathbf{z}^{k}=(\mathbf{x}^{k},y^{k})\sim\mathbb{P}_{k}$ , $\mathbb{P}_{k}$ is the data distribution on machine $k$ , $p$ is the ratio of positive data. When $\mathbb{\phi}_{k}=\mathbb{\phi}_{l},\forall k\neq l$ , this is referred to as the homogeneous data setting; otherwise heterogeneous data setting.

Notations. We define the following notations:

	$\displaystyle\mathbf{v}=(\mathbf{w}^{T},a,b)^{T},\quad\phi(\mathbf{v})=\max_{\alpha}f(\mathbf{v},\alpha),$
	$\displaystyle\phi_{s}(\mathbf{v})=\phi(\mathbf{v})+\frac{1}{2\gamma}\\|\mathbf{v}-\mathbf{v}_{s-1}\\|^{2},$
	$\displaystyle f^{s}(\mathbf{v},\alpha)=f(\mathbf{v},\alpha)+\frac{1}{2\gamma}\\|\mathbf{v}-\mathbf{v}_{s-1}\\|^{2}$
	$\displaystyle F^{s}_{k}(\mathbf{v},\alpha;\mathbf{z}_{k})=F_{k}(\mathbf{v},\alpha;\mathbf{z}_{k})+\frac{1}{2\gamma}\\|\mathbf{v}-\mathbf{v}_{s-1}\\|^{2}$
	$\displaystyle\mathbf{v}^{}_{\phi}=\arg\min\limits_{\mathbf{v}}\phi(\mathbf{v}),\quad\mathbf{v}^{}_{\phi_{s}}=\arg\min\limits_{\mathbf{v}}\phi_{s}(\mathbf{v}).$

In the remaining of this paper, we will consider the following distributed min-max problem that can cover the distributed DAM,

\begin{split}\min\limits_{\mathbf{v}}\max\limits_{\alpha}f(\mathbf{v},\alpha)=\frac{1}{K}\sum\limits_{k=1}^{K}f_{k}(\mathbf{v},\alpha),\end{split}

(3)

where

f_{k}(\mathbf{v},\alpha)=\mathbb{E}_{\mathbf{z}^{k}}[F_{k}(\mathbf{v},\alpha;\mathbf{z}^{k})].

(4)

Assumptions. Similar to [dist_auc_guo], we make the following assumptions throughout this paper.

Assumption 1

(i) There exist $\mathbf{v}_{0},\Delta_{0}>0$ such that $\phi(\mathbf{v}_{0})-\phi(\mathbf{v}^{*}_{\phi})\leq\Delta_{0}$ .
(ii) PL condition: $\phi(\mathbf{v})$ satisfies the $\mu$ -PL condition, i.e., $\mu(\phi(\mathbf{v})-\phi(\mathbf{v}_{*}))\leq\frac{1}{2}\|\nabla\phi(\mathbf{v})\|^{2}$ ; (iii) Smoothness: For any $\mathbf{z}$ , $f(\mathbf{v},\alpha;\mathbf{z})$ is $\ell$ -smooth in $\mathbf{v}$ and $\alpha$ . $\phi(\mathbf{v})$ is $L$ -smooth, i.e., $\|\nabla\phi(\mathbf{v}_{1})-\nabla\phi(\mathbf{v}_{2})\|\leq L\|\mathbf{v}_{1}-\mathbf{v}_{2}\|$ .
(iv) Bounded variance:

\begin{split}&\mathbb{E}[\|\nabla_{\mathbf{v}}f_{k}(\mathbf{v},\alpha)-\nabla_{\mathbf{v}}F_{k}(\mathbf{v},\alpha;\mathbf{z})\|^{2}]\leq\sigma^{2}\\ &\mathbb{E}[|\nabla_{\alpha}f_{k}(\mathbf{v},\alpha)-\nabla_{\alpha}F_{k}(\mathbf{v},\alpha;\mathbf{z})|^{2}]\leq\sigma^{2}.\end{split}

(5)

To quantify the drifts between different clients, we introduce the following assumption.

Assumption 2

Bounded client drift:

\begin{split}\frac{1}{K}\sum\limits_{k=1}^{K}\|\nabla_{\mathbf{v}}f_{k}(\mathbf{v},\alpha)-\nabla_{\mathbf{v}}f(\mathbf{v},\alpha)\|^{2}&\leq D^{2}\\ \frac{1}{K}\sum\limits_{k=1}^{K}\|\nabla_{\alpha}f_{k}(\mathbf{v},\alpha)-\nabla_{\alpha}f(\mathbf{v},\alpha)\|^{2}&\leq D^{2}.\end{split}

(6)

Remark. $D$ quantifies the drift between the local objectives on clients and the global objective. $D=0$ denotes the homogeneous data setting that all the local objectives are identical. $D>0$ corresponds to the heterogeneous data setting.

4 CODA+: A stronger baseline

In this section, we present a stronger baseline than CODA [dist_auc_guo]. The motivation is that (i) the CODA algorithm uses a step to compute the dual variable from the primal variable by using sampled data from all clients; but we find this step is unnecessary by an improved analysis; (ii) the complexity of CODA for homogeneous data is not given in its original paper. Hence, CODA+ is a simplified version of CODA but with much refined analysis.

We present the steps of CODA+ in Algorithm 1. It is similar to CODA that uses stagewise updates. In $s$ -th stage, a strongly convex strongly concave subproblem is constructed as following:

\begin{split}\min\limits_{\mathbf{v}}\max_{\alpha}f(\mathbf{v},\alpha)+\frac{\gamma}{2}\|\mathbf{v}-\mathbf{v}_{0}^{s}\|^{2},\end{split}

(7)

where $\mathbf{v}_{0}^{s}$ is the output of the previous stage.

CODA+ improves upon CODA in two folds. First, CODA+ algorithm is more concise since the output primal and dual variables of each stage can be directly used as input for the next stage, while CODA needs an extra large batch of data after each stage to compute the dual variable. This modification not only reduces the sample complexity, but also makes the algorithm applicable to a boarder family of nonconvex min-max problems. Second, CODA+ has a smaller communication complexity for homogeneous data than that for heterogeneous data while the previous analysis of CODA yields the same communication complexity for homogeneous data and heterogeneous data.

Algorithm 1 CODA+

1: Initialization:

(\mathbf{v}_{0},\alpha_{0},\gamma)

2: for

s=1,...,S

\mathbf{v}_{s},\alpha_{s}=\text{DSG+}(\mathbf{v}_{s-1},\alpha_{s-1},\eta_{s},I_{s},\gamma)

;

4: end for

5: Return

\mathbf{v}_{S},\alpha_{S}

Algorithm 2 DSG+(

\mathbf{v}_{0},\alpha_{0},\eta,T,I,\gamma

)

Each machine does initialization:

\mathbf{v}_{0}^{k}=\mathbf{v}_{0},\alpha_{0}^{k}=\alpha_{0}

for

t=0,1,...,T-1

Each machine

k

updates its local solution in parallel:

\mathbf{v}_{t+1}^{k}=\mathbf{v}_{t}^{k}-\eta(\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t}^{k},\alpha_{t}^{k};\mathbf{z}_{t}^{k})+\gamma(\mathbf{v}_{t}^{k}-\mathbf{v}_{0}))

\alpha_{t+1}^{k}=\alpha_{t}^{k}+\eta\nabla_{\alpha}F_{k}(\mathbf{v}_{t}^{k},\alpha_{t}^{k};\mathbf{z}_{t}^{k})

t+1

mod

I=0

then

\mathbf{v}^{k}_{t+1}=\frac{1}{K}\sum\limits_{k=1}^{K}\mathbf{v}_{t+1}^{k}

\diamond

communicate

\alpha^{k}_{t+1}=\frac{1}{K}\sum\limits_{k=1}^{K}\alpha_{t+1}^{k}

\diamond

communicate

end if

end for

Return

\left(\bar{\mathbf{v}}=\frac{1}{K}\sum\limits_{k=1}^{K}\frac{1}{T}\sum\limits_{t=1}^{T}\mathbf{v}_{t}^{k},\bar{\alpha}=\frac{1}{K}\sum\limits_{k=1}^{K}\frac{1}{T}\sum\limits_{t=1}^{T}\alpha_{t}^{k}\right)

We have the following lemma to bound the convergence for the subproblem in each $s$ -th stage.

Lemma 1

(One call of Algorithm 2) Let $(\bar{\mathbf{v}},\bar{\alpha})$ be the output of Algorithm 2. Suppose Assumption 1 and 2 hold. By running Algorithm 2 with given input $\mathbf{v}_{0},\alpha_{0}$ for $T$ iterations, $\gamma=2\ell$ , and $\eta\leq\min(\frac{1}{3\ell+3\ell^{2}/\mu_{2}},\frac{1}{4\ell})$ , we have for any $\mathbf{v}$ and $\alpha$

	$\displaystyle\mathbb{E}[f^{s}(\bar{\mathbf{v}},\alpha)-f^{s}(\mathbf{v},\bar{\alpha})]\leq\frac{1}{\eta T}\\|\mathbf{v}_{0}-\mathbf{v}\\|^{2}+\frac{1}{\eta T}(\alpha_{0}-\alpha)^{2}$
	$\displaystyle+\underbrace{\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)(12\eta^{2}I\sigma^{2}+36\eta^{2}I^{2}D^{2})\mathbb{I}_{I>1}}_{A_{1}}+\frac{3\eta\sigma^{2}}{K},$

where $\mu_{2}=2p(1-p)$ is strong concavity parameter of $f(\mathbf{v},\alpha)$ in $\alpha$ .

Remark. Note that the term $A_{1}$ on the RHS is the drift of clients caused by skipping communication. When $D=0$ , i.e., the machines have homogeneous data distribution, we need $\eta I=O\left(\frac{1}{K}\right)$ , then $A_{1}$ can be merged with the last term. When $D>0$ , we need $\eta I^{2}=O\left(\frac{1}{K}\right)$ , which means that $I$ has to be smaller in heterogeneous data setting and thus the communication complexity is higher.

Remark. The key difference between the analysis of CODA+ and that of CODA lies at how to handle the term $(\alpha_{0}-\alpha)^{2}$ in Lemma 1. In CODA, the initial dual variable $\alpha_{0}$ is computed from the initial primal variable $\mathbf{v}_{0}$ , which then reduces the error term $(\alpha_{0}-\alpha)^{2}$ to one similar to $\|\mathbf{v}_{0}-\mathbf{v}\|^{2}$ , which is then bounded by the primal objective gap due to the PL condition. However, since we do not conduct the extra computation of $\alpha_{0}$ from $\mathbf{v}_{0}$ , our analysis directly deals with such error term by using the duality gap of $f^{s}$ . This technique is originally developed by [yan2020optimal].

Theorem 1

Set $\gamma=2\ell$ , $\hat{L}=L+2\ell$ , $c=\frac{\mu/\hat{L}}{5+\mu/\hat{L}}$ , $\eta_{s}=\eta_{0}\exp(-(s-1)c)\leq O(1)$ and $T_{s}=\frac{212}{\eta_{0}\min(\ell,\mu_{2})}\exp((s-1)c)$ . To return $\mathbf{v}_{S}$ such that $\mathbb{E}[\phi(\mathbf{v}_{S})-\phi(\mathbf{v}^{*}_{\phi})]\leq\epsilon$ , it suffices to choose $S\geq O\left(\frac{5\hat{L}+\mu}{\mu}\max\bigg{\{}\log\left(\frac{2\Delta_{0}}{\epsilon}\right),\log S+\log\bigg{[}\frac{2\eta_{0}}{\epsilon}\frac{12(\sigma^{2})}{5K}\bigg{]}\bigg{\}}\right)$ . The iteration complexity is $\widetilde{O}\bigg{(}\max\left(\frac{\Delta_{0}}{\mu\epsilon\eta_{0}K},\frac{\hat{L}}{\mu^{2}K\epsilon}\right)\bigg{)}$ and the communication complexity is $\widetilde{O}\left(\frac{K}{\mu}\right)$ by setting $I_{s}=\Theta(1/(K\eta_{s}))$ if $D=0$ , and is $\widetilde{O}\bigg{(}\max\left(\frac{K}{\mu}+\frac{\Delta_{0}^{1/2}}{\mu(\eta_{0}\epsilon)^{1/2}},\frac{K}{\mu}+\frac{\hat{L}^{1/2}}{\mu^{3/2}\epsilon^{1/2}}\right)\bigg{)}$ by setting $I_{s}=\Theta(1/\sqrt{K\eta_{s}})$ if $D>0$ , where $\widetilde{O}$ suppresses logarithmic factors.

Remark. Due to the PL condition, the step size $\eta$ decreases geometrically. Accordingly, $I$ increases geometrically due to Lemma 1, and $I$ increases with a faster rate when the data are homogeneous than that when data are heterogeneous. As a result, the total number of communications in homogeneous setting is much less than that in heterogeneous setting.

5 CODASCA

Although CODA+ has a significantly reduced communication complexity for homogeneous data, it is still suffering from a high communication complexity for heterogeneous data. Even for the homogeneous data, CODA+ has a worse communication complexity with a dependence on the number of clients $K$ than the NPA algorithm for FDAM with a large batch size.

Can we further reduce the communication complexity for FDAM for both homogeneous and heterogenous data without using a large batch size?

The main reason that causes the degeneration in the heterogeneous data setting is the data difference. Even at global optimum $(\mathbf{v}_{*},\alpha_{*})$ , the gradient of local functions in different clients could be different and non-zero. In the homogeneous data setting, different clients still produce different solutions due to stochastic error (cf. the $\eta^{2}\sigma^{2}I$ term of $A_{1}$ in Lemma 1). These together contribute to the client drift.

To correct the client drift, we propose to leverage the idea of stochastic controlled averaging due to [karimireddy2019scaffold]. The key idea is to maintain and update a control variate to accommodate the client drift, which is taken into account when updating the local solutions. In the proposed algorithm CODASCA for FDAM, we apply control variates to both primal and dual variables. CODASCA shares the same stagewise framework as CODA+, where a strongly convex strongly concave subproblem is constructed and optimized in a distributed fashion approximatly in each stage. The steps of CODASCA are presented in Algorithm 3 and Algorithm 4. Below, we describe the algorithm in each stage.

Each stage has $R$ communication rounds. Between two rounds, there are $I$ local updates, each machine $k$ does the local updates as

\begin{split}&\mathbf{v}_{r,t+1}^{k}=\mathbf{v}^{k}_{r,t}-\eta_{l}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};\mathbf{z}^{t}_{r,t})-c_{\mathbf{v}}^{k}+c_{\mathbf{v}})\\ &\alpha_{r,t+1}^{k}=\alpha^{k}_{r,t}+\eta_{l}(\nabla_{\alpha}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};\mathbf{z}^{k}_{r,t})-c_{\alpha}^{k}+c_{\alpha}),\end{split}

where $c_{\mathbf{v}}^{k},c_{\mathbf{v}}$ are local and global control variates for the primal variable, and $c^{k}_{\alpha},c_{\alpha}$ are local and global control variates for the dual variable. Note that $\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};\mathbf{z}^{t}_{r,t})$ and $\nabla_{\alpha}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};\mathbf{z}^{k}_{r,t})$ are unbiased stochastic gradient on local data. However, they are biased estimate of global gradient when data on different clients are heterogeneous. Intuitively, the term $-c_{\mathbf{v}}^{k}+c_{\mathbf{v}}$ and $-c_{\alpha}^{k}+c_{\alpha}$ work to correct the local gradients to get closer to the global gradient. They also play a role of reducing variance of stochastic gradients, which is helpful as well to reduce the communication complexity in the homogeneous data setting.

At each communication round, the primal and dual variables on all clients get aggregated, averaged and broadcast to all clients. The control variates $c$ at $r$ -th round get updated as

\begin{split}&c_{\mathbf{v}}^{k}=c_{\mathbf{v}}^{k}-c_{\mathbf{v}}+\frac{1}{I\eta_{l}}(\mathbf{v}_{r-1}-\mathbf{v}^{k}_{r,I})\\ &c_{\alpha}^{k}=c_{\alpha}^{k}-c_{\alpha}+\frac{1}{I\eta_{l}}(\alpha^{k}_{r,I}-\alpha_{r-1}),\end{split}

(8)

which is equivalent to

\begin{split}&c_{\mathbf{v}}^{k}=\frac{1}{I}\sum\limits_{t=1}^{I}\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha^{k}_{r,t};\mathbf{z}^{k}_{r,t})\\ &c_{\alpha}^{k}=\frac{1}{I}\sum\limits_{t=1}^{I}\nabla_{\alpha}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha^{k}_{r,t};\mathbf{z}^{k}_{r,t}).\end{split}

(9)

Notice that they are simply the average of stochastic gradients used in this round. An alternative way to compute the control variates is by computing the stochastic gradient with a large batch of extra samples at the same point at each client, but this would bring extra cost and is unnecessary. $c_{\mathbf{v}}$ and $c_{\alpha}$ are averages of $c_{\mathbf{v}}^{k}$ and $c_{\alpha}^{k}$ over all clients. After the local primal and dual variables are averaged, an extrapolation step with $\eta_{g}>1$ is performed. This step will boost the convergence.

In order to establish the convergence of CODASCA, we first present a key lemma below.

Lemma 2

(One call of Algorithm 4) Under the same setting as in Theorem 2, with $\tilde{\eta}=\eta_{l}\eta_{g}I\leq\frac{\mu_{2}}{40\ell^{2}}$ , for $\mathbf{v}^{\prime}=\arg\min\limits_{\mathbf{v}}f^{s}(\mathbf{v},\alpha_{\tilde{\alpha}}),\alpha^{\prime}=\arg\max\limits_{\alpha}f^{s}(\mathbf{v}_{\tilde{r}},\alpha)$ we have

\begin{split}&\mathbb{E}[f^{s}(\mathbf{v}_{\tilde{r}},\alpha^{\prime})-f^{s}(\mathbf{v}^{\prime},\alpha_{\tilde{r}})]\leq\frac{2}{\eta_{l}\eta_{g}T}\|\mathbf{v}_{0}-\mathbf{v}^{\prime}\|^{2}\\ &+\frac{2}{\eta_{l}\eta_{g}T}(\alpha_{0}-\alpha^{\prime})^{2}+\underbrace{\frac{10\eta_{l}\sigma^{2}}{\eta_{g}}}\limits_{A_{2}}+\frac{10\eta_{l}\eta_{g}\sigma^{2}}{K}\\ \end{split}

where $T=I\cdot R$ is the number of iterations for each stage.

Remark. Compared the above bound with that in Lemma 1, in particular the term $A_{2}$ vs the term $A_{1}$ , we can see that CODASCA will not be affected by the data heterogeneity $D>0$ , and the stochastic variance is also much reduced. As will seen in the next theorem, the value of $\tilde{\eta}$ and $R$ will keep the same in all stages. Therefore, by decreasing local step size $\eta_{l}$ geometrically, the communication window size $I_{s}$ will increase geometrically to ensure $\tilde{\eta}\leq O(1)$ .

Algorithm 3 CODASCA

1: Initialization:

(\mathbf{v}_{0},\alpha_{0},\gamma)

2: for

s=1,...,S

\mathbf{v}_{s},\alpha_{s}=\text{DSGSCA+}(\mathbf{v}_{s-1},\alpha_{s-1},\eta_{l},\eta_{g},I_{s},R_{s},\gamma)

;

4: end for

5: Return

\mathbf{v}_{S},\alpha_{S}

Algorithm 4 DSGSCA(

\mathbf{v}_{0},\alpha_{0},\eta_{l},\eta_{g},I,R,\gamma

)

Each machine does initialization:

\mathbf{v}_{0,0}^{k}=\mathbf{v}_{0},\alpha_{0,0}^{k}=\alpha_{0}

c_{\mathbf{v}}^{k}=\mathbf{0}

c_{\alpha}^{k}=0

for

r=1,...,R

for

t=0,1,...,I-1

Each machine

k

updates its local solution in parallel:

\mathbf{v}_{r,t+1}^{k}\!=\!\mathbf{v}^{k}_{r,t}\!-\!\eta_{l}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};\mathbf{z}^{k}_{r,t})-c_{\mathbf{v}}^{k}+c_{\mathbf{v}})

\alpha_{r,t+1}^{k}\!=\!\alpha^{k}_{r,t}\!+\!\eta_{l}(\nabla_{\alpha}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};\mathbf{z}^{k}_{r,t})\!-\!c_{\alpha}^{k}\!+\!c_{\alpha})

end for

c_{\mathbf{v}}^{k}=c_{\mathbf{v}}^{k}-c_{\mathbf{v}}+\frac{1}{I\eta_{l}}(\mathbf{v}_{r-1}-\mathbf{v}^{k}_{r,I})

c_{\alpha}^{k}=c_{\alpha}^{k}-c_{\alpha}+\frac{1}{I\eta_{l}}(\alpha^{k}_{r,I}-\alpha_{r-1})

c_{\mathbf{v}}=\frac{1}{K}\sum\limits_{k=1}^{K}c_{\mathbf{v}}^{k}

c_{\alpha}=\frac{1}{K}\sum\limits_{k=1}^{K}{}c_{\alpha}^{k}

\diamond

communicate

\mathbf{v}_{r}=\frac{1}{K}\sum\limits_{k=1}^{K}\mathbf{v}^{k}_{r,I},\alpha_{r}=\frac{1}{K}\sum\limits_{k=1}^{K}\alpha^{k}_{r,t}

\diamond

communicate

\mathbf{v}_{r}=\mathbf{v}_{r-1}+\eta_{g}(\mathbf{v}_{r}-\mathbf{v}_{r-1})

\alpha_{r}=\alpha_{r-1}+\eta_{g}(\alpha_{r}-\alpha_{r-1})

Broadcast

\mathbf{v}_{r},\alpha_{r},c_{\mathbf{v}},c_{\alpha}

\diamond

communicate

end for

Return

\mathbf{v}_{\tilde{r}},\alpha_{\tilde{r}}

where

\tilde{r}

is randomly sampled from

1,...,R

The main convergence result of CODASCA is presented below.

Theorem 2

Define constants $\hat{L}=L+2\ell$ , $c=4\ell+\frac{248}{53}\hat{L}$ . Setting $\eta_{g}=\sqrt{K}$ , $I_{s}=I_{0}\exp\left(\frac{2\mu_{1}}{c+2\mu_{1}}(s-1)\right)$ , $\eta_{l}^{s}=\frac{\tilde{\eta}}{\eta_{g}I_{s}}=\frac{\tilde{\eta}}{\sqrt{K}I_{0}}\exp\left(-\frac{2\mu}{c+2\mu}(s-1)\right)$ , $R=\frac{1000}{\tilde{\eta}\mu_{2}}$ , $\tilde{\eta}\leq\min\{\frac{1}{3\ell+3\ell^{2}/\mu_{2}},\frac{\mu_{2}}{40\ell^{2}}\}$ . After $S=O(\max\bigg{\{}\frac{c+2\mu}{2\mu}\log\frac{4\epsilon_{0}}{\epsilon},\frac{c+2\mu}{2\mu}\log\frac{160\hat{L}S}{(c+2\mu)\epsilon}\frac{\tilde{\eta}\sigma^{2}}{KI_{0}}\bigg{\}})$ stages, the output $\mathbf{v}_{S}$ satisfies $\mathbb{E}[\phi(\mathbf{v}_{S})-\phi(\mathbf{v}^{*}_{\phi})]\leq\epsilon$ . The communication complexity is $\widetilde{O}\left(\frac{1}{\mu}\right)$ . The iteration complexity is $\widetilde{O}\left(\max\{\frac{1}{\mu\epsilon},\frac{1}{\mu^{2}K\epsilon}\}\right)$ .

Remark. (i) The number of communications is $\widetilde{O}\left(\frac{1}{\mu}\right)$ , which is independent of number of clients $K$ and the accuracy level $\epsilon$ . This is a significant improvement over CODA+, which has a communication complexity of $\widetilde{O}\left(K/\mu+1/({\mu^{3/2}\epsilon^{1/2}})\right)$ in heterogeneous setting. Moreover, $\widetilde{O}\left(1/({\mu})\right)$ is a nearly optimal rate up to a logarithmic factor, since $O(1/\mu)$ is the lower bound communication complexity of distributed strongly convex optimization [karimireddy2019scaffold, ArjevaniS15] and strongly convexity is a stronger condition than the PL condition.

(ii) Each stage has the same number of communication rounds. However, $I_{s}$ increases geometrically. Therefore, the number of iterations and samples in a stage increase geometrically. Theoretically, we can also set $\eta^{s}_{l}$ to the same value as the one in the last stage, correspondingly $I_{s}$ can be set as a fixed large value. But this increases the number of samples to be processed without further speeding up the convergence. Our setting of $I_{s}$ is a balance between skipping communications and reducing sample complexity. For simplicity, we use the fixed setting of $I_{s}$ to compare CODASCA and the baseline CODA+ in our experiment to corroborate the theory.

(iii) The local step size $\eta_{l}$ of CODASCA decreases in a similar way as the step size $\eta$ in CODA+. But the $I_{s}=O(1/(\sqrt{K}\eta_{l}^{s}))$ in CODASCA increases with a faster rate than that $I_{s}=O(1/(\sqrt{K\eta_{s}}))$ in CODA+ on heterogeneous data. It is noticeable that different from CODA+, we do not need Assumption 2 which bounds the client drift. This means CODASCA can be applied to optimizing the global objective even if local objectives arbitrarily deviate from the global function.

Refer to caption — Figure 1: Top row: the testing AUC score of CODASCA vs # of iterations for different values of $I$ on ImageNet-IH and CIFAR100-IH with imratio = 10% and $K$ =16, 8 on Densenet121. Bottom row: the achieved testing AUC vs different values of $I$ for CODASCA and CODA+. The AUC score in the legend in top row figures represent the AUC score at the last iteration.

6 Experiments

In this section, we first verify the effectiveness of CODASCA compared to CODA+ on various datasets, including two benchmark datasets, i.e., ImageNet, CIFAR100 [imagenet_cvpr09, krizhevsky2009cifar] and a constructed large-scale chest X-ray dataset. Then, we demonstrate the effectiveness of FDAM on improving the performance on a single domain (CheXpert) by using data from multiple sources. For notations, $K$ denotes the number of “clients” (# of machines, # of data sources) and $I$ denotes the communication window size.

Table 2: Statistics of Medical Chest X-ray Datasets.

Dataset	Source	Samples
CheXpert	Stanford Hospital (US)	224,316
ChestXray8	NIH Clinical Center (US)	112,120
PadChest	Hospital San Juan (Spain)	110,641
MIMIC-CXR	BIDMC (US)	377,110
ChestXrayAD	H108 and HMUH (Vietnam)	15,000

Chest X-ray datasets. Five medical chest X-ray datasets, i.e., CheXpert, ChestXray14, MIMIC-CXR, PadChest, ChestXray-AD [chexpert19, wang2017chestx, johnson2019mimic, bustos2020padchest, nguyen2020vindr] are collected from different organizations. The statistics of these medical datasets are summarized in Table 2. We construct five binary classification tasks for predicting five popular diseases, Cardiomegaly (C0), Edema (C1), Consolidation (C2), Atelectasis (C3), P. Effusion (C4), as in CheXpert competition [chexpert19] for our experiments. These datasets are naturally imbalanced and heterogeneous due to different patients’ populations, different data collection protocols and etc. We refer to the whole medical dataset as ChestXray-IH.

Imbalanced and Heterogeneous (IH) Benchmark Datasets. For benchmark datasets, we manually construct the imbalanced heterogeneous dataset. For ImageNet, we first randomly select 500 classes as positive class and 500 classes as negative class. To increase data heterogeneity, we further split all positive/negative classes into $K$ groups so that each split only owns samples from unique classes without overlapping with that of other groups. To increase data imbalance level, we randomly remove some samples from positive classes for each machine. Please note that due to this operation, the whole sample set for different $K$ is different. We refer to the proportion of positive samples in all samples as imbalance ratio ( $imratio$ ). For CIFAR100, we follow similar steps to construct imbalanced heterogeneous data. We keep the testing/validation set untouched and keep them balanced. For imbalance ratio (imratio), we explore two ratios: 10% and 30%. We refer to the constructed datasets as ImageNet-IH (10%), ImageNet-IH (30%), CIFAR100-IH (10%), CIFAR100-IH (30%). Due to the limited space, we only report imratio=10% with DenseNet121 and defer the other results to supplement.

Parameters and Settings. We train Desenet121 on all datasets. For the parameters in CODASCA/CODA+, we tune $1/\gamma$ in [500, 700, 1000] and $\eta$ in [0.1, 0.01, 0.001]. For learning rate schedule, we decay the step size by 3 times every $T_{0}$ iterations, where $T_{0}$ is tuned in [2000, 3000, 4000]. We experiment with a fixed value of $I$ selected from [1, 32, 64, 128, 512, 1024]. We tune $\eta_{g}$ in [1.1, 1, 0.99, 0.999]. The local batch size is set to 32 for each machine. We run a total of 20000 iterations for all experiments.

6.1 Comparison with CODA+

We plot the testing AUC on ImageNet (10%) vs # of iterations for CODASCA and CODA+ in Figure 1 (top row) by varying the value of $I$ for different values of $K$ . Results on CIFAR100 are shown in the Supplement. In the bottom row of Figure 1, we plot the achieved testing AUC score vs different values of $I$ for CODASCA and CODA+. We have the following observations: $\bullet~{}$ CODASCA enjoys a larger communication window size. Comparing CODASCA and CODA+ in the bottom panel of Figure 1, we can see that CODASCA enjoys a larger communication window size without hurting the performance than CODA+, which is consistent with our theory.

$\bullet~{}$ CODASCA is consistently better for different values of $K$ . We compare the largest value of $I$ such that the performance does not degenerate too much compared with $I=1$ , which is denoted by $I_{\max}$ . From the bottom figures of Figure 1, we can see that the $I_{\max}$ value of CODASCA on ImageNet is 128 ( $K$ =16) and 512 ( $K$ =8), respectively, and that of CODA+ on ImageNet is 32 ( $K$ =16) and 128 ( $K$ =8). This demonstrates that CODASCA enjoys consistent advantage over CODA+, i.e., when $K=16$ , $I^{\text{CODASCA}}_{\max}/I^{\text{CODA+}}_{\max}=4$ , and when $K=8$ , $I^{\text{CODASCA}}_{\max}/I^{\text{CODA+}}_{\max}=4$ . The same phenomena occur on CIFAR100 data.

Next, we compare CODASCA with CODA+ on the ChestXray-IH medical dataset, which is also highly heterogeneous. We split the ChestXray-IH data into $K=16$ groups according to the patient ID and each machine only owns samples from one organization without overlapping patients. The testing set is the collection of 5% data sampled from each organization. In addition, we use train/val split = 7:3 for the parameter tuning. We run CODASCA and CODA+ with the same number of iterations. The performance on testing set are reported in Table 3. From the results, we can observe that CODASCA performs consistently better than CODA+ on C0, C2, C3, C4.

Table 3: Performance on ChestXray-IH testing set when

K

=16.

Method	$I$	C0	C1	C2	C3	C4
	1	0.8472	0.8499	0.7406	0.7475	0.8688
CODA+	512	0.8361	0.8464	0.7356	0.7449	0.8680
CODASCA	512	0.8427	0.8457	0.7401	0.7468	0.8680
CODA+	1024	0.8280	0.8451	0.7322	0.7431	0.8660
CODASCA	1024	0.8363	0.8444	0.7346	0.7481	0.8674

Table 4: Performance of FDAM on Chexpert validation set. The improvement from

K=1

K=5

is significant in light of the top

7

CheXpert leaderboard results that only differ within

0.1\%

#of sources	C0	C1	C2	C3	C4	AVG
$K$ =1	0.9007	0.9536	0.9542	0.9090	0.9571	0.9353
$K$ =2	0.9027	0.9586	0.9542	0.9065	0.9583	0.9361
$K$ =3	0.9021	0.9558	0.9550	0.9068	0.9583	0.9356
$K$ =4	0.9055	0.9603	0.9542	0.9072	0.9588	0.9372
$K$ =5	0.9066	0.9583	0.9544	0.9101	0.9584	0.9376

6.2 FDAM for improving performance on CheXpert

Finally, we show that FDAM can be used to leverage data from multiple hospitals to improve the performance at a single target hospital. For this experiment, we choose CheXpert data from Stanford Hospital as the target data. Its validation data will be used for evaluating the performance of our FDAM method. Note that improving the AUC score on CheXpert is a very challenging task. The top 7 teams on CheXpert leaderboard differ by only 0.1% ¹¹1https://stanfordmlgroup.github.io/competitions/chexpert/. Hence, we consider any improvement over $0.1\%$ significant. Our procedure is following: we gradually increase the number of data resources, e.g., $K=1$ only includes the CheXpert training data, $K=2$ includes the CheXpert training data and ChestXray8, $K=3$ includes the CheXpert training data and ChestXray8 and PadChest, and so on.

Parameters and Settings. Due to the limited computing resources, we resize all images to 320x320. We follow the two stage method proposed in [robustdeepAUC] and compare with the baseline on a single machine with a single data source (CheXpert training data) ( $K$ =1) for learning DenseNet121. More specifically, we first train a base model by minimizing the Cross-Entropy loss on CheXpert training dataset using Adam with a initial learning rate of 1e-5 and batch size of 32 for 2 epochs. Then, we discard the trained classifier, use the same pretrained model for initializing the local models at all machines and continue training using CODASCA. For the parameter tuning, we try $I$ =[16, 32, 64, 128], learning rate=[0.1, 0.01] and we fix $\gamma$ =1e-3, $T_{0}$ =1000 and batch size=32.

Results. We report all results in term of AUC score on the CheXpert validation data in Table 4. Overall, we can see that the using more data sources from different organizations can efficiently improve the performance on CheXpert. The average improvement across all 5 classification tasks from $K=1$ to $K=5$ is over $0.2\%$ which is significant in light of the top CheXpert leaderboard results. Specifically, we can see that CODASCA with $K$ =5 achieves the highest validation AUC score on C0 and C3 and with $K$ =4 achieves highest AUC score on C1 and C4.

7 Conclusion

In this work, we have conducted comprehensive studies of federated learning for deep AUC maximization. We analyzed a stronger baseline for deep AUC maximization by establishing its convergence for both homogeneous data and heterogeneous data. We also developed an improved variant by adding control variates to the local stochastic gradients for both primal and dual variables, which dramatically reduces the communication complexity. Besides a strong theory guarantee, we also exhibit the power of FDAM on real world medical imaging problems. We have shown that our FDAM method can improve the performance on medical imaging classification tasks by leveraging data from different organizations that are kept locally.

Appendix A Auxiliary Lemmas

Noting all algorithms discussed in thpaper including the baselines implement a stagewise framework, we define the duality gap of $s$ -th stage at a point $(\mathbf{v},\alpha)$ as

\begin{split}Gap_{s}(\mathbf{v},\alpha)=\max\limits_{\alpha^{\prime}}f^{s}(\mathbf{v},\alpha^{\prime})-\min\limits_{\mathbf{v}^{\prime}}f^{s}(\mathbf{v}^{\prime},\alpha).\end{split}

(10)

Before we show the proofs, we first present the lemmas from [yan2020optimal].

Lemma 3 (Lemma 1 of [yan2020optimal])

Suppose a function $h(\mathbf{v},\alpha)$ is $\lambda_{1}$ -strongly convex in $\mathbf{v}$ and $\lambda_{2}$ -strongly concave in $\alpha$ . Consider the following problem

\displaystyle\min\limits_{\mathbf{v}\in X}\max\limits_{\alpha\in Y}h(\mathbf{v},\alpha),

where $X$ and $Y$ are convex compact sets. Denote $\hat{\mathbf{v}}_{h}(y)=\arg\min\limits_{\mathbf{v}^{\prime}\in X}h(\mathbf{v}^{\prime},\alpha)$ and $\hat{\alpha}_{h}(\mathbf{v})=\arg\max\limits_{\alpha^{\prime}\in Y}h(\mathbf{v},\alpha^{\prime})$ . Suppose we have two solutions $(\mathbf{v}_{0},\alpha_{0})$ and $(\mathbf{v}_{1},\alpha_{1})$ . Then the following relation between variable distance and duality gap holds

\displaystyle\begin{split}\frac{\lambda_{1}}{4}\|\hat{\mathbf{v}}_{h}(\alpha_{1})-\mathbf{v}_{0}\|^{2}+\frac{\lambda_{2}}{4}\|\hat{\alpha}_{h}(\mathbf{v}_{1})-\alpha_{0}\|^{2}\leq&\max\limits_{\alpha^{\prime}\in Y}h(\mathbf{v}_{0},\alpha^{\prime})-\min\limits_{\mathbf{v}^{\prime}\in X}h(\mathbf{v}^{\prime},\alpha_{0})\\ &+\max\limits_{\alpha^{\prime}\in Y}h(\mathbf{v}_{1},\alpha^{\prime})-\min\limits_{\mathbf{v}^{\prime}\in X}h(\mathbf{v}^{\prime},\alpha_{1}).\end{split}

(11)

$\hfill\Box$

Lemma 4 (Lemma 5 of [yan2020optimal])

We have the following lower bound for $\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})$

\displaystyle\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\geq\frac{3}{50}\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})+\frac{4}{5}(\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{0}^{s})),

where $\mathbf{v}_{0}^{s+1}=\mathbf{v}_{s}$ and $\alpha_{0}^{s+1}=\alpha_{s}$ , i.e., the initialization of $(s+1)$ -th stage is the output of the $s$ -th stage.

$\hfill\Box$

Appendix B Analysis of CODA+

The proof sketch is similar to the proof of CODA in [dist_auc_guo]. However, there are two noticeable difference from [dist_auc_guo]. First, in Lemma 1, we bound the duality gap instead of the objective gap in [dist_auc_guo]. This is because the analysis later in this proof requires the bound of the duality gap.

Second, in Lemma 1, where the bound for homogeneous data is better than that of heterogeneous data. The better analysis for homogeneous data is inspired by the analysis in [yu_linear], which tackles a minimization problem. Note that $f^{s}$ denotes the subproblem for stage $s$ , we omit the index $s$ in variables when the context is clear.

B.1 Lemmas

We need following lemmas for the proof. The Lemma 5, Lemma 6 and Lemma 7 are similar to Lemma 3, Lemma 4 and Lemma 5 of [dist_auc_guo], respectively. For the sake of completeness, we will include the proof of Lemma 5 and Lemma 6 since a change in the update of the primal variable.

Lemma 5

Define $\bar{\mathbf{v}}_{t}=\frac{1}{K}\sum_{k=1}^{N}\mathbf{v}^{k}_{t},\bar{\alpha}_{t}=\frac{1}{K}\sum_{k=1}^{N}y^{k}_{t}$ . Suppose Assumption 5 holds and by running Algorithm 2, we have for any $\mathbf{v},\alpha$ ,

	$\displaystyle f^{s}(\bar{\mathbf{v}},\alpha)-f^{s}(\mathbf{v},\bar{\alpha})$	$\displaystyle\leq\frac{1}{T}\sum\limits_{t=1}^{T}\bigg{[}\underbrace{\langle\nabla_{\mathbf{v}}f(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\!\bar{\mathbf{v}}_{t}-x\rangle}_{B_{1}}+\underbrace{\langle\nabla_{\alpha}f(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),y-\bar{\alpha}_{t}\rangle}_{B_{2}}$
		$\displaystyle~{}~{}~{}+\underbrace{\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\\|^{2}\!+2\ell(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}}_{B_{3}}-\frac{\ell}{3}\\|\bar{\mathbf{v}}_{t}-\mathbf{v}\\|^{2}-\frac{\mu_{2}}{3}(\bar{\alpha}_{t-1}-\alpha)^{2}\bigg{]},$

where $\mu_{2}=2p(1-p)$ is the strong concavity coefficient of $f(\mathbf{v},\alpha)$ in $\alpha$ .

{proof}

For any $\mathbf{v}$ and $\alpha$ , using Jensen’s inequality and the fact that $f^{s}(\mathbf{v},\alpha)$ is convex in $\mathbf{v}$ and concave in $\alpha$ ,

\begin{split}&f^{s}(\bar{\mathbf{v}},\alpha)-f^{s}(\mathbf{v},\bar{\alpha})\leq\frac{1}{T}\sum\limits_{t=1}^{T}\left(f^{s}(\bar{\mathbf{v}}_{t},\alpha)-f^{s}(\mathbf{v},\bar{\alpha}_{t})\right)\\ \end{split}

(12)

By $\ell$ -strongly convexity of $f^{s}(\mathbf{v},\alpha)$ in $\mathbf{v}$ , we have

f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})+\langle\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\mathbf{v}-\bar{\mathbf{v}}_{t-1}\rangle+\frac{\ell}{2}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}\leq f(\mathbf{v},\bar{\alpha}_{t-1}).

(13)

By $3\ell$ -smoothness of $f^{s}(\mathbf{v},\alpha)$ in $\mathbf{v}$ , we have

\begin{split}f^{s}(\bar{\mathbf{v}}_{t},\alpha)&\leq f^{s}(\bar{\mathbf{v}}_{t-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\alpha),\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\rangle+\frac{3\ell}{2}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2}\\ &=f^{s}(\bar{\mathbf{v}}_{t-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\rangle+\frac{3\ell}{2}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2}\\ &~{}~{}~{}+\langle\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\alpha)-\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\rangle\\ &\overset{(a)}{\leq}f^{s}(\bar{\mathbf{v}}_{t-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\rangle+\frac{3\ell}{2}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2}\\ &~{}~{}~{}~{}+\ell|\bar{\alpha}_{t-1}-\alpha|\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|\\ &\overset{(b)}{\leq}f^{s}(\bar{\mathbf{v}}_{t-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\rangle+\frac{3\ell}{2}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2}\\ &~{}~{}~{}~{}+\frac{\mu_{2}}{6}(\bar{\alpha}_{t-1}-\alpha)^{2}+\frac{3\ell^{2}}{2\mu_{2}}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2},\\ \end{split}

(14)

where $(a)$ holds because that we know $\partial_{\mathbf{v}}f(\mathbf{v},\alpha)$ is $\ell$ -Lipschitz in $\alpha$ since $f(\mathbf{v},\alpha)$ is $\ell$ -smooth, $(b)$ holds by Young’s inequality, and $\mu_{2}=2p(1-p)$ is the strong concavity coefficient of $f^{s}$ in $\alpha$ .

Adding (13) and (14), rearranging terms, we have

\begin{split}&f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})+f^{s}(\bar{\mathbf{v}}_{t},\alpha)\\ &\leq f(\mathbf{v},\bar{\alpha}_{t-1})+f(\bar{\mathbf{v}}_{t-1},\alpha)+\langle\partial_{\mathbf{v}}f(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\mathbf{v}\rangle+\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2}\\ &~{}~{}~{}-\frac{\ell}{2}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}+\frac{\mu_{2}}{6}(\bar{\alpha}_{t-1}-\alpha)^{2}.\end{split}

(15)

We know $f^{s}(\mathbf{v},\alpha)$ is $\mu_{2}$ -strong concavity in $\alpha$ ( $-f(\mathbf{v},\alpha)$ is $\mu_{2}$ -strong convexity of in $\alpha$ ). Thus, we have

\begin{split}-f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})-\partial_{\alpha}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})^{\top}(\alpha-\bar{\alpha}_{t-1})+\frac{\mu_{2}}{2}(\alpha-\bar{\alpha}_{t-1})^{2}\leq-f^{s}(\bar{\mathbf{v}}_{t-1},\alpha).\end{split}

(16)

Since $f(\mathbf{v},\alpha)$ is $\ell$ -smooth in $\alpha$ , we get

\begin{split}-f^{s}(\mathbf{v},\bar{\alpha}_{t})&\leq-f^{s}(\mathbf{v},\bar{\alpha}_{t-1})-\langle\partial_{\alpha}f^{s}(\mathbf{v},\bar{\alpha}_{t-1}),\bar{\alpha}_{t}-\bar{\alpha}_{t-1}\rangle+\frac{\ell}{2}(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}\\ &=-f^{s}(\mathbf{v},\bar{\alpha}_{t-1})-\langle\partial_{\alpha}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\alpha}_{t}-\bar{\alpha}_{t-1}\rangle+\frac{\ell}{2}(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}\\ &~{}~{}~{}~{}~{}-\langle\partial_{\alpha}(f^{s}(\mathbf{v},\bar{\alpha}_{t-1})-f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})),\bar{\alpha}_{t}-\bar{\alpha}_{t-1}\rangle\\ &\overset{(a)}{\leq}-f^{s}(\mathbf{v},\bar{\alpha}_{t-1})-\langle\partial_{\alpha}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\alpha}_{t}-\bar{\alpha}_{t-1}\rangle+\frac{\ell}{2}(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}\\ &~{}~{}~{}+\ell\|\mathbf{v}-\bar{\mathbf{v}}_{t-1}\|(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})\\ &\leq-f^{s}(\mathbf{v},\bar{\alpha}_{t-1})-\langle\partial_{\alpha}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\alpha}_{t}-\bar{\alpha}_{t-1}\rangle+\frac{\ell}{2}(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}\\ &~{}~{}~{}+\frac{\ell}{6}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}+\frac{3\ell}{2}(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}\\ \end{split}

(17)

where (a) holds because that $\partial_{\alpha}f^{s}(\mathbf{v},\alpha)$ is $\ell$ -Lipschitz in $\mathbf{v}$ .

Adding (16), (17) and arranging terms, we have

\begin{split}&-f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})-f^{s}(\mathbf{v},\bar{\alpha}_{t})\leq-f^{s}(\bar{\mathbf{v}}_{t-1},\alpha)-f^{s}(\mathbf{v},\bar{\alpha}_{t-1})-\langle\partial_{\alpha}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\alpha}_{t}-\alpha\rangle\\ &~{}~{}~{}~{}~{}~{}+2\ell(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}+\frac{\ell}{6}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}-\frac{\mu_{2}}{2}(\alpha-\bar{\alpha}_{t-1})^{2}.\end{split}

(18)

Adding (15) and (18), we get

\begin{split}&f^{s}(\bar{\mathbf{v}}_{t},\alpha)-f^{s}(\mathbf{v},\bar{\alpha}_{t})\\ &\leq\langle\partial_{\mathbf{v}}f(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\mathbf{v}\rangle-\langle\partial_{\alpha}f(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\alpha}_{t}-\alpha\rangle\\ &~{}~{}~{}+\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2}+2\ell(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}\\ &~{}~{}~{}-\frac{\ell}{3}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}-\frac{\mu_{2}}{3}(\bar{\alpha}_{t-1}-\alpha)^{2}\end{split}

(19)

Taking average over $t=1,...,T$ , we get

\begin{split}&f^{s}(\bar{\mathbf{v}},\alpha)-f^{s}(\mathbf{v},\bar{\alpha})\\ &\leq\frac{1}{T}\sum\limits_{t=1}^{T}[f^{s}(\bar{\mathbf{v}}_{t},\alpha)-f^{s}(\mathbf{v},\bar{\alpha}_{t})]\\ &\leq\frac{1}{T}\sum\limits_{t=1}^{T}\bigg{[}\underbrace{\langle\partial_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\mathbf{v}\rangle}_{B_{1}}+\underbrace{\langle\partial_{\alpha}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\alpha-\bar{\alpha}_{t}\rangle}_{B_{2}}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\underbrace{\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1}\|^{2}+2\ell(\bar{\alpha}_{t}-\bar{\alpha}_{t-1})^{2}}_{B_{3}}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\frac{\ell}{3}\|\mathbf{v}-\bar{\mathbf{v}}_{t}\|^{2}-\frac{\mu_{2}}{3}(\bar{\alpha}_{t-1}-\alpha)^{2}\bigg{]}\end{split}

In the following, we will bound the term $B_{1}$ by Lemma 6, $B_{2}$ by Lemma 7 and $B_{3}$ by Lemma 8.

Lemma 6

Define $\hat{\mathbf{v}}_{t}=\bar{\mathbf{v}}_{t-1}-\frac{\eta}{K}\sum\limits_{k=1}^{K}\!\nabla_{\mathbf{v}}f^{s}(\mathbf{v}^{k}_{t-1},\alpha^{k}_{t-1})$ and

\begin{split}\tilde{\mathbf{v}}_{t}=\tilde{\mathbf{v}}_{t-1}-\frac{\eta}{K}\sum\limits_{k=1}^{K}\left(\nabla_{\mathbf{v}}F_{k}^{s}(\mathbf{v}^{k}_{t-1},y^{k}_{t-1};z_{t-1}^{k})-\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})\right),\text{ for $t>0$; }\tilde{\mathbf{v}}_{0}=\mathbf{v}_{0}.\end{split}

(20)

. We have

\begin{split}&B_{1}\leq\frac{3\ell}{2}\frac{1}{K}\sum\limits_{k=1}^{K}(\bar{\alpha}_{t-1}-\alpha_{t-1}^{k})^{2}+\frac{3\ell}{2}\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}_{t-1}^{k}\|^{2}\\ &~{}~{}~{}~{}~{}+\frac{3\eta}{2}\left\|\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})]\right\|^{2}\\ &~{}~{}~{}~{}~{}+\left\langle\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\hat{\mathbf{v}}_{t}-\tilde{\mathbf{v}}_{t-1}\right\rangle\\ &~{}~{}~{}~{}~{}+\frac{1}{2\eta}(\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}-\|\bar{\mathbf{v}}_{t-1}-\bar{\mathbf{v}}_{t}\|^{2}-\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2})\\ &~{}~{}~{}~{}~{}+\frac{\ell}{3}\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2}+\frac{1}{2\eta}(\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}-\|\mathbf{v}-\tilde{\mathbf{v}}_{t}\|^{2})\end{split}

{proof}

We have

\begin{split}&\langle\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\mathbf{v}\rangle=\bigg{\langle}\frac{1}{K}\sum\limits_{k=1}^{K}\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-\mathbf{v}\bigg{\rangle}\\ &\leq\bigg{\langle}\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})-\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\alpha_{t-1}^{k})],\bar{\mathbf{v}}_{t}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{1}⃝\\ &~{}~{}~{}+\bigg{\langle}\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})],\bar{\mathbf{v}}_{t}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{2}⃝\\ &~{}~{}~{}+\bigg{\langle}\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\bar{\mathbf{v}}_{t-1},\alpha_{t-1}^{k};z_{t-1}^{k})],\bar{\mathbf{v}}_{t}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{3}⃝\\ &~{}~{}~{}+\bigg{\langle}\frac{1}{K}\sum\limits_{k=1}^{K}\nabla_{\mathbf{v}}F^{s}_{k}(\bar{\mathbf{v}}_{t-1},\alpha_{t-1}^{k};z_{t-1}^{k}),\bar{\mathbf{v}}_{t}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{4}⃝\end{split}

(21)

Then we will bound \small{1}⃝, \small{2}⃝, \small{3}⃝ and \small{4}⃝, respectively,

\begin{split}\small{1}⃝&\overset{(a)}{\leq}\frac{3}{2\ell}\left\|\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})-\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\alpha_{t-1}^{k})]\right\|^{2}+\frac{\ell}{6}\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2}\\ &\overset{(b)}{\leq}\frac{3}{2\ell}\frac{1}{K}\sum\limits_{k=1}^{K}\|\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1})-\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\alpha_{t-1}^{k})\|^{2}+\frac{\ell}{6}\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2}\\ &\overset{(c)}{\leq}\frac{3\ell}{2}\frac{1}{K}\sum\limits_{k=1}^{K}(\bar{\alpha}_{t-1}-\alpha_{t-1}^{k})^{2}+\frac{\ell}{6}\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2},\\ \end{split}

(22)

where (a) follows from Young’s inequality, (b) follows from Jensen’s inequality. and (c) holds because $\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v},\alpha)$ is $\ell$ -Lipschitz in $\alpha$ . Using similar techniques, we have

\begin{split}\small{2}⃝&\leq\frac{3}{2\ell}\frac{1}{K}\sum\limits_{k=1}^{K}\|\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{t-1},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})\|^{2}+\frac{\ell}{6}\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2}\\ &\leq\frac{3\ell}{2}\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}_{t-1}^{k}\|^{2}+\frac{\ell}{6}\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2}.\end{split}

(23)

Let $\hat{\mathbf{v}}_{t}=\arg\min\limits_{\mathbf{v}}\left(\frac{1}{K}\sum\limits_{k=1}^{K}\nabla_{\mathbf{v}}f^{s}(\mathbf{v}^{k}_{t-1},\alpha^{k}_{t-1})\right)^{\top}x+\frac{1}{2\eta}\|\mathbf{v}-\bar{\mathbf{v}}_{t-1}\|^{2}$ , then we have

\begin{split}\bar{\mathbf{v}}_{t}-\hat{\mathbf{v}}_{t}=\eta\bigg{(}\nabla_{\mathbf{v}}f^{s}(\mathbf{v}^{k}_{t-1},y^{k}_{t-1})-\frac{1}{K}\sum\limits_{k=1}^{K}\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{t-1},y^{k}_{t-1};z_{t-1}^{k})\bigg{)}\end{split}

(24)

Hence we get

\begin{split}&\small{3}⃝=\left\langle\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\bar{\mathbf{v}}_{t}-\hat{\mathbf{v}}_{t}\right\rangle\\ &~{}~{}~{}~{}+\left\langle\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\hat{\mathbf{v}}_{t}-\mathbf{v}\right\rangle\\ &=\eta\left\|\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})]\right\|^{2}\\ &~{}~{}~{}~{}+\left\langle\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\hat{\mathbf{v}}_{t}-\mathbf{v}\right\rangle\\ \end{split}

(25)

Define another auxiliary sequence as

\begin{split}\tilde{\mathbf{v}}_{t}=\tilde{\mathbf{v}}_{t-1}-\frac{\eta}{K}\sum\limits_{k=1}^{K}\left(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{t-1},y^{k}_{t-1};z_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})\right),\text{ for $t>0$; }\tilde{\mathbf{v}}_{0}=\mathbf{v}_{0}.\end{split}

(26)

Denote

\Theta_{t-1}(\mathbf{v})=\left(-\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{t-1},y^{k}_{t-1};z_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\right)^{\top}x+\frac{1}{2\eta}\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}.

(27)

Hence, for the auxiliary sequence $\tilde{\alpha}_{t}$ , we can verify that

\tilde{\mathbf{v}}_{t}=\arg\min\limits_{\mathbf{v}}\Theta_{t-1}(\mathbf{v}).

(28)

Since $\Theta_{t-1}(\mathbf{v})$ is $\frac{1}{\eta}$ -strongly convex, we have

\displaystyle\begin{split}&\frac{1}{2}\|\mathbf{v}-\tilde{\mathbf{v}}_{t}\|^{2}\leq\Theta_{t-1}(\mathbf{v})-\Theta_{t-1}(\tilde{\mathbf{v}}_{t})\\ &=\bigg{(}-\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\bigg{)}^{\top}x+\frac{1}{2\eta}\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}\\ &~{}~{}~{}-\bigg{(}-\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\bigg{)}^{\top}\tilde{\mathbf{v}}_{t}-\frac{1}{2\eta}\|\tilde{\mathbf{v}}_{t}-\tilde{\mathbf{v}}_{t-1}\|^{2}\\ &=\bigg{(}-\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\alpha}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})-\nabla_{\alpha}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\bigg{)}^{\top}(\mathbf{v}-\tilde{\mathbf{v}}_{t-1})+\frac{1}{2\eta}\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}\\ &~{}~{}~{}-\bigg{(}-\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\alpha}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})-\nabla_{\alpha}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\bigg{)}^{\top}(\tilde{\mathbf{v}}_{t}-\tilde{\mathbf{v}}_{t-1})-\frac{1}{2\eta}\|\tilde{\mathbf{v}}_{t}-\tilde{\mathbf{v}}_{t-1}\|^{2}\\ &\leq\bigg{(}-\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\bigg{)}^{\top}(\mathbf{v}-\tilde{\mathbf{v}}_{t-1})+\frac{1}{2\eta}\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}\\ &~{}~{}~{}+\frac{\eta}{2}\bigg{\|}\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\bigg{\|}^{2}\end{split}

(29)

Adding this with (25), we get

\begin{split}③\leq&\frac{3\eta}{2}\bigg{\|}\frac{1}{K}\sum\limits_{k=1}^{K}(\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})-\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k}))\bigg{\|}^{2}+\frac{1}{2\eta}\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}-\frac{1}{2}\|\mathbf{v}-\tilde{\mathbf{v}}_{t}\|^{2}\\ &+\left\langle\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\hat{\mathbf{v}}_{t}-\tilde{\mathbf{v}}_{t-1}\right\rangle\end{split}

(30)

④ can be bounded as

④=-\frac{1}{\eta}\langle\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}_{t-1},\bar{\mathbf{v}}_{t}-\mathbf{v}\rangle=\frac{1}{2\eta}(\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}-\|\bar{\mathbf{v}}_{t-1}-\bar{\mathbf{v}}_{t}\|^{2}-\|\bar{\mathbf{v}}_{t}-\bar{\mathbf{v}}\|^{2})

(31)

Plug (22), (23), (30) and (31) into (21), we get

\begin{split}&\left\langle\nabla_{\mathbf{v}}f(\bar{\mathbf{v}}_{t-1},\bar{\alpha}_{t-1}),\bar{\mathbf{v}}_{t}-x\right\rangle\\ &\leq\frac{3\ell}{2}\frac{1}{K}\sum\limits_{k=1}^{K}(\bar{\alpha}_{t-1}-\alpha_{t-1}^{k})^{2}+\frac{3\ell}{2}\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}_{t-1}^{k}\|^{2}\\ &~{}~{}~{}~{}~{}+\frac{3\eta}{2}\left\|\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})]\right\|^{2}\\ &~{}~{}~{}~{}~{}+\left\langle\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\hat{\mathbf{v}}_{t}-\tilde{\mathbf{v}}_{t-1}\right\rangle\\ &~{}~{}~{}~{}~{}+\frac{1}{2\eta}(\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}-\|\bar{\mathbf{v}}_{t-1}-\bar{\mathbf{v}}_{t}\|^{2}-\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2})\\ &~{}~{}~{}~{}~{}+\frac{\ell}{3}\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2}+\frac{1}{2\eta}(\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}-\|\mathbf{v}-\tilde{\mathbf{v}}_{t}\|^{2})\end{split}

$B_{2}$ can be bounded by the following lemma, whose proof is identical to that of Lemma 5 in [dist_auc_guo].

Lemma 7

Define $\hat{\alpha}_{t}\!=\!\bar{\alpha}_{t-1}+\frac{\eta}{K}\sum\limits_{k=1}^{K}\nabla_{\alpha}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})$ , and

\tilde{\alpha}_{t}\!=\!\tilde{\alpha}_{t-1}\!+\!\frac{\eta}{K}\sum\limits_{k=1}^{K}(\nabla_{\alpha}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})\!-\!\nabla_{\alpha}f_{k}(\mathbf{v}_{t-1}^{k},\!\alpha_{t-1}^{k})).

We have,

\begin{split}&B_{2}\leq\frac{3\ell^{2}}{2\mu_{2}}\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}_{t-1}^{k}\|^{2}+\frac{3\ell^{2}}{2\mu_{2}}\frac{1}{K}\sum\limits_{k=1}^{K}(\bar{\alpha}_{t-1}-\alpha_{t-1}^{k})^{2}\\ &~{}~{}~{}+\frac{3\eta}{2}\left(\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\alpha}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\alpha}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1})]\right)^{2}\\ &~{}~{}~{}+\frac{1}{K}\sum\limits_{k=1}^{K}\langle\nabla_{\alpha}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\alpha}F_{i}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k}),\tilde{\alpha}_{t-1}-\hat{\alpha}_{t}\rangle\\ &~{}~{}~{}+\frac{1}{2\eta}((\bar{\alpha}_{t-1}-\alpha)^{2}-(\bar{\alpha}_{t-1}-\bar{\alpha}_{t})^{2}-(\bar{\alpha}_{t}-\alpha)^{2})\\ &~{}~{}~{}+\frac{\mu_{2}}{3}(\bar{\alpha}_{t}-\alpha)^{2}+\frac{1}{2\eta}(\alpha-\tilde{\alpha}_{t-1})^{2}-\frac{1}{2\eta}(\alpha-\tilde{\alpha}_{t})^{2}.\end{split}

$\hfill\Box$

$B_{3}$ can be bounded by the following lemma.

Lemma 8

If $K$ machines communicate every $I$ iterations, where $I\leq\frac{1}{18\sqrt{2}\eta\ell}$ , then

\displaystyle\sum\limits_{t=0}^{T-1}\frac{1}{K}\sum\limits_{k=1}^{K}\mathbb{E}\left[\|\bar{\mathbf{v}}_{t}-\mathbf{v}_{t}^{k}\|^{2}+\|\bar{\alpha}_{t}-\alpha_{t}^{k}\|^{2}\right]\leq\left(12\eta^{2}I\sigma^{2}T+36\eta^{2}I^{2}D^{2}T\right)\mathbb{I}_{I>1}

{proof}

In this proof, we introduce a couple of new notations to make the proof brief: $F^{s}_{k,t}=F^{s}_{k,t}(\mathbf{v}^{k}_{t},\alpha^{k}_{t};z_{t}^{k})$ and $f^{s}_{k,t}=f^{s}_{k,t}(\mathbf{v}^{k}_{t},\alpha^{k}_{t})$ . Similar bounds for minimization problems have been analyzed in [yu_linear, stich2018local].

Denote $t_{0}$ as the nearest communication round before $t$ , i.e., $t-t_{0}\leq I$ . By the update rule of $\mathbf{v}$ , we have that on each machine $k$ ,

\begin{split}\mathbf{v}_{t}^{k}=\bar{\mathbf{v}}_{t_{0}}-\eta\sum\limits_{\tau=t_{0}}^{t-1}\nabla_{\mathbf{v}}F^{s}_{k,\tau}.\end{split}

(32)

Taking average over all $K$ machines,

\begin{split}\bar{\mathbf{v}}_{t}=\bar{\mathbf{v}}_{t_{0}}-\eta\sum\limits_{\tau=t_{0}}^{t-1}\frac{1}{K}\sum\limits_{k=1}^{K}\nabla_{\mathbf{v}}F^{s}_{k,\tau}.\end{split}

(33)

Therefore,

\begin{split}&\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t}-\mathbf{v}_{t}^{k}\|^{2}=\frac{\eta^{2}}{K}\sum\limits_{k=1}^{K}\mathbb{E}\left[\left\|\sum\limits_{\tau=t_{0}}^{t-1}\left[\nabla_{\mathbf{v}}F^{s}_{k,\tau}-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}F^{s}_{j,\tau}\right]\right\|^{2}\right]\\ &\leq\frac{2\eta^{2}}{K}\sum\limits_{k=1}^{K}\left[\left\|\sum\limits_{\tau=t_{0}}^{t-1}\left[[\nabla_{\mathbf{v}}F^{s}_{k,\tau}-\nabla_{\mathbf{v}}f^{s}_{k,\tau}]-\frac{1}{K}\sum\limits_{j=1}^{K}\left[\nabla_{\mathbf{v}}F^{s}_{j,\tau}-\nabla_{\mathbf{v}}f^{s}_{j,\tau}\right]\right]\right\|^{2}\right]\\ &~{}~{}~{}+\frac{2\eta^{2}}{K}\sum\limits_{k=1}^{K}\mathbb{E}\left[\left\|\sum\limits_{\tau=t_{0}}^{t-1}\left[\nabla_{\mathbf{v}}f^{s}_{k,\tau}-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}f^{s}_{j,\tau}\right]\right\|^{2}\right]\end{split}

(34)

In the following, we will address these two terms on the right hand side separately. First, we have

\begin{split}&\frac{2\eta^{2}}{K}\sum\limits_{k=1}^{K}\left[\left\|\sum\limits_{\tau=t_{0}}^{t-1}\left[[\nabla_{\mathbf{v}}F^{s}_{k,\tau}-\nabla_{\mathbf{v}}f^{s}_{k,\tau}]-\frac{1}{K}\sum\limits_{j=1}^{K}\left[\nabla_{\mathbf{v}}F^{s}_{j,\tau}-\nabla_{\mathbf{v}}f^{s}_{j,\tau}\right]\right]\right\|^{2}\right]\\ &\overset{(a)}{\leq}\frac{2\eta^{2}}{K}\sum\limits_{k=1}^{K}\left[\left\|\sum\limits_{\tau=t_{0}}^{t-1}\left[\nabla_{\mathbf{v}}F^{s}_{k,\tau}-\nabla_{\mathbf{v}}f^{s}_{k,\tau}\right]\right\|^{2}\right]\\ &\overset{(b)}{=}\frac{2\eta^{2}}{K}\sum\limits_{k=1}^{K}\sum\limits_{\tau=t_{0}}^{t-1}\left[\left\|\left[\nabla_{\mathbf{v}}F^{s}_{k,\tau}-\nabla_{\mathbf{v}}f^{s}_{k,\tau}\right]\right\|^{2}\right]\\ &{\leq}2\eta^{2}I\sigma^{2},\end{split}

(35)

where $(a)$ holds by $\frac{1}{K}\sum\limits_{k=1}^{K}\|a_{k}-\left[\frac{1}{K}\sum\limits_{j=1}^{K}a_{j}\right]\|^{2}=\frac{1}{K}\sum\limits_{k=1}^{K}\|a_{k}\|^{2}-\|\frac{1}{K}\sum\limits_{k=1}^{K}a_{k}\|^{2}\leq\frac{1}{K}\sum\limits_{k=1}^{K}\|a_{k}\|^{2}$ , where $a_{k}=\sum\limits_{\tau=t_{0}}^{t-1}[\nabla F^{s}_{k,\tau}-\nabla_{\mathbf{v}}f_{k,\tau}]$ ; $(b)$ follows because $\mathbb{E}_{k,\tau-1}[\nabla_{\mathbf{v}}F^{s}_{k,\tau}-\nabla_{\mathbf{v}}f^{s}_{k,\tau}]=0$ .

Second, we have

\begin{split}&\frac{1}{K}\sum\limits_{k=1}^{K}\mathbb{E}\left[\left\|\sum\limits_{\tau=t_{0}}^{t-1}\left[\nabla_{\mathbf{v}}f^{s}_{i,\tau}-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}f^{s}_{j,\tau}\right]\right\|^{2}\right]\\ &\leq\frac{1}{K}\sum\limits_{k=1}^{K}(t-t_{0})\sum\limits_{\tau=t_{0}}^{t-1}\mathbb{E}\left[\left\|\nabla_{\mathbf{v}}f^{s}_{i,\tau}-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}f^{s}_{j,\tau}\right\|^{2}\right]\\ &\leq I\sum\limits_{\tau=t_{0}}^{t-1}\frac{1}{K}\sum\limits_{k=1}^{K}\mathbb{E}\left[\left\|\nabla_{\mathbf{v}}f^{s}_{k,\tau}-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}f^{s}_{j,\tau}\right\|^{2}\right],\end{split}

(36)

where

\begin{split}&\frac{1}{K}\sum\limits_{k=1}^{K}\mathbb{E}\left\|\nabla_{\mathbf{v}}f^{s}_{k,\tau}-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}f^{s}_{j,\tau}\right\|^{2}\\ &=\frac{1}{K}\sum\limits_{k=1}^{K}\mathbb{E}\bigg{\|}\nabla_{\mathbf{v}}f^{s}_{k,\tau}-\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})+\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})+\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}f^{s}_{j,\tau}\bigg{\|}^{2}\\ &\leq\frac{1}{K}\sum\limits_{k=1}^{K}\bigg{[}3\mathbb{E}\|\nabla_{\mathbf{v}}f^{s}_{k,\tau}-\nabla_{\mathbf{v}}f_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})\|^{2}+3\mathbb{E}\|\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})\|^{2}\bigg{]}\\ &~{}~{}~{}+3\mathbb{E}\bigg{\|}\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\frac{1}{K}\sum\limits_{j=1}^{K}\nabla_{\mathbf{v}}f^{s}_{j,\tau}\bigg{\|}^{2}\\ &=\frac{1}{K}\sum\limits_{k=1}^{K}\bigg{[}3\mathbb{E}\|\nabla_{\mathbf{v}}f^{s}_{k,\tau}-\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})\|^{2}+3\mathbb{E}\|\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})\|^{2}\bigg{]}\\ &~{}~{}~{}+3\mathbb{E}\bigg{\|}\frac{1}{K}\sum\limits_{j=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{j}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\nabla_{\mathbf{v}}f^{s}_{j,\tau}]\bigg{\|}^{2}\bigg{]}\\ &\leq\frac{1}{K}\sum\limits_{k=1}^{K}\bigg{[}3\mathbb{E}\|\nabla_{\mathbf{v}}f^{s}_{k,\tau}-\nabla_{\mathbf{v}}f_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})\|^{2}+3\mathbb{E}\|\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})\|^{2}\bigg{]}\\ &~{}~{}~{}+3\frac{1}{K}\sum\limits_{j=1}^{K}\mathbb{E}\bigg{\|}[\nabla_{\mathbf{v}}f^{s}_{j}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\nabla_{\mathbf{v}}f^{s}_{j,\tau}]\bigg{\|}^{2}\bigg{]}\\ &\overset{(a)}{\leq}\frac{54\ell^{2}}{K}\sum\limits_{k=1}^{K}\left[\|\mathbf{v}_{k,\tau}-\bar{\mathbf{v}}_{\tau}\|^{2}+|\alpha_{k,\tau}-\bar{\alpha}_{\tau}|^{2}\right]+\frac{3}{K}\sum\limits_{k=1}^{K}\|\nabla_{\mathbf{v}}f^{s}_{k}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})-\nabla_{\mathbf{v}}f^{s}(\bar{\mathbf{v}}_{\tau},\bar{\alpha}_{\tau})\|^{2}\\ &\leq\frac{54\ell^{2}}{K}\sum\limits_{k=1}^{K}\left[\|\mathbf{v}_{k,\tau}-\bar{\mathbf{v}}_{\tau}\|^{2}+|\alpha_{k,\tau}-\bar{\alpha}_{\tau}|^{2}\right]+3D^{2},\end{split}

(37)

where $(a)$ holds because $f$ is $\ell$ -smooth, i.e., $f^{s}$ is $3\ell$ -smooth.

Combining (34), (35), (36) and (37),

\begin{split}\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t}-\mathbf{v}_{t}^{k}\|^{2}\leq 2\eta^{2}I\sigma^{2}+2\eta^{2}\left(I\sum\limits_{\tau=t_{0}}^{t-1}\left[\frac{54\ell^{2}}{K}\sum\limits_{k=1}^{K}\left[\|\mathbf{v}^{k}_{\tau}-\bar{\mathbf{v}}_{\tau}\|^{2}+\|\alpha_{k,\tau}-\bar{\alpha}_{\tau}\|^{2}\right]+3D^{2}\right]\right)\end{split}

(38)

Summing over $t=\{0,...,T-1\}$ ,

\begin{split}\sum\limits_{t=0}^{T-1}\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t}-\mathbf{v}_{t}^{k}\|^{2}\leq 2\eta^{2}I\sigma^{2}T\!+\!108\eta^{2}I^{2}\ell^{2}\sum\limits_{t=0}^{T-1}\frac{1}{K}\left(\|\mathbf{v}^{k}_{t}-\bar{\mathbf{v}}_{t}\|^{2}+\|\alpha^{k}_{t}-\bar{\alpha}_{\tau}\|^{2}\right)+6\eta^{2}I^{2}D^{2}T.\end{split}

(39)

Similarly for $\alpha$ side, we have

\sum\limits_{t=0}^{T-1}\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\alpha}_{t}-\alpha_{t}^{k}\|^{2}\leq 2\eta^{2}I\sigma^{2}T+108\eta^{2}I^{2}\ell^{2}\sum\limits_{t=0}^{T-1}\frac{1}{K}\left(\|\mathbf{v}_{t}^{k}-\bar{\mathbf{v}}_{t}\|^{2}+\|\alpha_{t}^{k}-\bar{\alpha}_{t}\|^{2}\right)+6\eta^{2}I^{2}D^{2}T.

(40)

Summing up the above two inequalities,

\begin{split}\sum\limits_{t=0}^{T-1}\frac{1}{K}\sum\limits_{k=1}^{K}[\|\bar{\mathbf{v}}_{t}-\mathbf{v}_{t}^{k}\|^{2}+\mathbb{E}[\|\bar{\alpha}_{t}-\alpha_{t}^{k}\|^{2}]&\leq\frac{4\eta^{2}I\sigma^{2}}{1-216\eta^{2}I^{2}\ell^{2}}T+\frac{12\eta^{2}I^{2}D^{2}}{1-216\eta^{2}I^{2}\ell^{2}}T\\ &\leq 12\eta^{2}I\sigma^{2}T+36\eta^{2}I^{2}D^{2}T\end{split}

(41)

where the second inequality is due to $I\leq\frac{1}{18\sqrt{2}\eta\ell}$ , i.e., $1-216\eta^{2}I^{2}\ell^{2}\geq\frac{2}{3}$ .

Based on above lemmas, we are ready to give the convergence of duality gap in one stage of CODA+.

B.2 Proof of Lemma 1

{proof}

Noting $\mathbb{E}\langle\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\hat{\mathbf{v}}_{t}-\tilde{\mathbf{v}}_{t-1}\rangle=0$ and
$\mathbb{E}\left\langle-\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\alpha}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})],\tilde{\alpha}_{t-1}-\hat{\alpha}_{t}\right\rangle=0$ . and then plugging Lemma 6 and Lemma 7 into Lemma 5, and taking expectation, we get

\begin{split}&\mathbb{E}[f^{s}(\bar{\mathbf{v}},\alpha)-f^{s}(\mathbf{v},\bar{\alpha})]\\ &\leq\frac{1}{T}\sum\limits_{t=1}^{T}\mathbb{E}\Bigg{[}\underbrace{\left(\frac{3\ell+3\ell^{2}/\mu_{2}}{2}-\frac{1}{2\eta}\right)\|\bar{\mathbf{v}}_{t-1}-\bar{\mathbf{v}}_{t}\|^{2}+\left(2\ell-\frac{1}{2\eta}\right)\|\bar{\alpha}_{t}-\bar{\alpha}_{t-1}\|^{2}}_{C_{1}}\\ &+\underbrace{\left(\frac{1}{2\eta}-\frac{\mu_{2}}{3}\right)\|\bar{\alpha}_{t-1}-\alpha\|^{2}-\left(\frac{1}{2\eta}-\frac{\mu_{2}}{3}\right)(\bar{\alpha}_{t}-\alpha)^{2}}_{C_{2}}+\underbrace{\left(\frac{1}{2\eta}-\frac{\ell}{3}\right)\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}\|^{2}-\left(\frac{1}{2\eta}-\frac{\ell}{3}\right)\|\bar{\mathbf{v}}_{t}-\mathbf{v}\|^{2}}_{C_{3}}\\ &+\underbrace{\frac{1}{2\eta}((\alpha-\tilde{\alpha}_{t-1})^{2}-(\alpha-\tilde{\alpha}_{t})^{2})}_{C_{4}}+\underbrace{\frac{1}{2\eta}(\|\mathbf{v}-\tilde{\mathbf{v}}_{t-1}\|^{2}-\|\mathbf{v}-\tilde{\mathbf{v}}_{t}\|^{2})}_{C_{5}}\\ &+\underbrace{\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}_{t-1}^{k}\|^{2}+\left(\frac{3\ell}{2}+\frac{3\ell^{2}}{2\mu_{2}}\right)\frac{1}{K}\sum\limits_{k=1}^{K}(\bar{\alpha}_{t-1}-\alpha_{t-1}^{k})^{2}}_{C_{6}}\\ &+\underbrace{\frac{3\eta}{2}\left\|\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})]\right\|^{2}}_{C_{7}}\\ &+\underbrace{\frac{3\eta}{2}\left\|\frac{1}{K}\sum\limits_{k=1}^{K}\nabla_{\alpha}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\alpha}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})\right\|^{2}}_{C_{8}}\bigg{]}\end{split}

(42)

Since $\eta\leq\min(\frac{1}{3\ell+3\ell^{2}/\mu_{2}},\frac{1}{4\ell})$ , thus in the RHS of (42), $C_{1}$ can be cancelled. $C_{2}$ , $C_{3}$ , $C_{4}$ and $C_{5}$ will be handled by telescoping sum. $C_{6}$ can be bounded by Lemma 8.

Taking expectation over $C_{7}$ ,

\begin{split}&\mathbb{E}\left[\frac{3\eta}{2}\left\|\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})]\right\|^{2}\right]\\ &=\mathbb{E}\left[\frac{3\eta}{2K^{2}}\left\|\sum\limits_{k=1}^{K}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})]\right\|^{2}\right]\\ &=\mathbb{E}\left[\frac{3\eta}{2K^{2}}\left(\sum\limits_{k=1}^{K}\|\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})\|^{2}\right.\right.\\ &~{}~{}~{}~{}~{}~{}\left.\left.+2\sum\limits_{k=1}^{K}\sum\limits_{j=i+1}^{K}\left\langle\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k}),\nabla_{\mathbf{v}}f_{j}(\mathbf{v}_{t-1}^{j},\alpha_{t-1}^{j})-\nabla_{\mathbf{v}}F^{s}_{j}(\mathbf{v}_{t-1}^{j},\alpha_{t-1}^{j};z_{t-1}^{j})\right\rangle\right)\right]\\ &\leq\frac{3\eta\sigma^{2}}{2K}.\end{split}

(43)

The last inequality holds because $\|\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})\|^{2}\leq\sigma^{2}$ and $\mathbb{E}\langle\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})\!-\!\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k}),\nabla_{\mathbf{v}}f_{j}(\mathbf{v}_{t-1}^{j},\alpha_{t-1}^{j})\!-\!\nabla_{\mathbf{v}}F_{j}(\mathbf{v}_{t-1}^{j},\alpha_{t-1}^{j};z_{t-1}^{j})\rangle=0$ for any $k\neq j$ as each machine draws data independently. Similarly, we take expectation over $C_{8}$ and have

\begin{split}&\mathbb{E}\left[\frac{3\eta}{2}\left\|\frac{1}{K}\sum\limits_{k=1}^{K}[\nabla_{\alpha}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\alpha}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};\mathbf{z}_{t-1}^{k})]\right\|^{2}\right]\leq\frac{3\eta\sigma^{2}}{2K}.\end{split}

(44)

Plugging (43) and (44) into (99), and taking expectation, it yields

\begin{split}&\mathbb{E}[f^{s}(\bar{\mathbf{v}},\alpha)-f^{s}(\mathbf{v},\bar{\alpha})\\ &\leq\mathbb{E}\bigg{\{}\frac{1}{T}\left(\frac{1}{2\eta}-\frac{\ell}{3}\right)\|\bar{\mathbf{v}}_{0}-\mathbf{v}\|^{2}+\frac{1}{2\eta T}\|\tilde{\mathbf{v}}_{0}-\mathbf{v}\|^{2}+\frac{1}{T}\left(\frac{1}{2\eta}-\frac{\mu_{2}}{3}\right)\|\bar{\alpha}_{0}-\alpha\|^{2}+\frac{1}{2\eta T}\|\tilde{\alpha}_{0}-\alpha\|^{2}\\ &~{}~{}~{}~{}~{}+\frac{1}{T}\sum\limits_{t=1}^{T}\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)\frac{1}{K}\sum\limits_{k=1}^{K}\|\bar{\mathbf{v}}_{t-1}-\mathbf{v}_{t-1}^{k}\|^{2}+\frac{1}{T}\sum\limits_{t=1}^{T}\left(\frac{3\ell}{2}+\frac{3\ell^{2}}{2\mu_{2}}\right)\frac{1}{K}\sum\limits_{k=1}^{K}(\bar{\alpha}_{t-1}-\alpha_{t-1}^{k})^{2}\\ &~{}~{}~{}~{}+\frac{1}{T}\sum\limits_{t=1}^{T}\frac{3\eta\sigma^{2}}{K}\bigg{\}}\\ &\leq\frac{1}{\eta T}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{\eta T}\|\alpha_{0}-\alpha\|^{2}+\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)(12\eta^{2}I\sigma^{2}+36\eta^{2}I^{2}D^{2})\mathbb{I}_{I>1}+\frac{3\eta\sigma^{2}}{K},\end{split}

where we use Lemma 8, $\mathbf{v}_{0}=\bar{\mathbf{v}}_{0}$ , and $\alpha_{0}=\bar{\alpha}_{0}$ in the last inequality.

B.3 Main Proof of Theorem 1

{proof}

Since $f(\mathbf{v},\alpha)$ is $\ell$ -smooth (thus $\ell$ -weakly convex) in $\mathbf{v}$ for any $\alpha$ , $\phi(\mathbf{v})=\max\limits_{\alpha^{\prime}}f(\mathbf{v},\alpha^{\prime})$ is also $\ell$ -weakly convex. Taking $\gamma=2\ell$ , we have

\displaystyle\begin{split}\phi(\mathbf{v}_{s-1})&\geq\phi(\mathbf{v}_{s})+\langle\partial\phi(\mathbf{v}_{s}),\mathbf{v}_{s-1}-\mathbf{v}_{s}\rangle-\frac{\ell}{2}\|\mathbf{v}_{s-1}-\mathbf{v}_{s}\|^{2}\\ &=\phi(\mathbf{v}_{s})+\langle\partial\phi(\mathbf{v}_{s})+2\ell(\mathbf{v}_{s}-\mathbf{v}_{s-1}),\mathbf{v}_{s-1}-\mathbf{v}_{s}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{s-1}-\mathbf{v}_{s}\|^{2}\\ &\overset{(a)}{=}\phi(\mathbf{v}_{s})+\langle\partial\phi_{s}(\mathbf{v}_{s}),\mathbf{v}_{s-1}-\mathbf{v}_{s}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{s-1}-\mathbf{v}_{s}\|^{2}\\ &\overset{(b)}{=}\phi(\mathbf{v}_{s})-\frac{1}{2\ell}\langle\partial\phi_{s}(\mathbf{v}_{s}),\partial\phi_{s}(\mathbf{v}_{s})-\partial\phi(\mathbf{v}_{s})\rangle+\frac{3}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})-\partial\phi(\mathbf{v}_{s})\|^{2}\\ &=\phi(\mathbf{v}_{s})-\frac{1}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}-\frac{1}{4\ell}\langle\partial\phi_{s}(\mathbf{v}_{s}),\partial\phi(\mathbf{v}_{s})\rangle+\frac{3}{8\ell}\|\partial\phi(\mathbf{v}_{s})\|^{2},\end{split}

(45)

where $(a)$ and $(b)$ hold by the definition of $\phi_{s}(\mathbf{v})$ .

Rearranging the terms in (45) yields

\displaystyle\begin{split}\phi(\mathbf{v}_{s})-\phi(\mathbf{v}_{s-1})&\leq\frac{1}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}+\frac{1}{4\ell}\langle\partial\phi_{s}(\mathbf{v}_{s}),\partial\phi(\mathbf{v}_{s})\rangle-\frac{3}{8\ell}\|\partial\phi(\mathbf{v}_{s})\|^{2}\\ &\overset{(a)}{\leq}\frac{1}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}+\frac{1}{8\ell}(\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}+\|\partial\phi(\mathbf{v}_{s})\|^{2})-\frac{3}{8\ell}\|\phi(\mathbf{v}_{s})\|^{2}\\ &=\frac{1}{4\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}-\frac{1}{4\ell}\|\partial\phi(\mathbf{v}_{s})\|^{2}\\ &\overset{(b)}{\leq}\frac{1}{4\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}-\frac{\mu}{2\ell}(\phi(\mathbf{v}_{s})-\phi(\mathbf{v}_{*}))\end{split}

(46)

where $(a)$ holds by using $\langle\mathbf{a},\mathbf{b}\rangle\leq\frac{1}{2}(\|\mathbf{a}\|^{2}+\|\mathbf{b}\|^{2})$ , and $(b)$ holds by the $\mu$ -PL property of $\phi(\mathbf{v})$ .

Thus, we have

\displaystyle\left(4\ell+2\mu\right)(\phi(\mathbf{v}_{s})-\phi(\mathbf{v}_{*}))-4\ell(\phi(\mathbf{v}_{s-1})-\phi(\mathbf{v}_{*}))\leq\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}.

(47)

Since $\gamma=2\ell$ , $f^{s}(\mathbf{v},\alpha)$ is $\ell$ -strongly convex in $\mathbf{v}$ and $\mu_{2}=2p(1-p)$ strong concave in $\alpha$ . Apply Lemma 3 to $f^{s}$ , we know that

\displaystyle\frac{\ell}{4}\|\hat{\mathbf{v}}_{s}(\alpha_{s})-\mathbf{v}_{0}^{s}\|^{2}+\frac{\mu_{2}}{4}\|\hat{\alpha}_{s}(\mathbf{v}_{s})-\alpha_{0}^{s}\|^{2}\leq\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})+\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s}).

(48)

By the setting of $\eta_{s}=\eta_{0}\exp\left(-(s-1)\frac{2\mu}{c+2\mu}\right)$ , and $T_{s}=\frac{212}{\eta_{0}\min\{\ell,\mu_{2}\}}\exp\left((s-1)\frac{2\mu}{c+2\mu}\right)$ , we note that $\frac{1}{\eta_{s}T_{s}}\leq\frac{\min\{\ell,\mu_{2}\}}{212}$ . Set $I_{s}$ such that $\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)(12\eta_{s}^{2}I_{s}+36\eta^{2}I_{s}^{2}D^{2})\leq\frac{\eta_{s}\sigma^{2}}{K}$ , where the specific choice of $I_{s}$ will be made later. Applying Lemma 1 with $\hat{\mathbf{v}}_{s}(\alpha_{s})=\arg\min\limits_{\mathbf{v}^{\prime}}f^{s}(\mathbf{v}^{\prime},\alpha_{s})$ and $\hat{\alpha}_{s}(\mathbf{v}_{s})=\arg\max\limits_{\alpha^{\prime}}f^{s}(\mathbf{v}_{s},\alpha^{\prime})$ , we have

\displaystyle\begin{split}&\mathbb{E}[\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})]\leq\frac{4\eta_{s}\sigma^{2}}{K}+\frac{1}{53}\mathbb{E}\left[\frac{\ell}{4}\|\hat{\mathbf{v}}_{s}(\alpha_{s})-\mathbf{v}_{0}^{s}\|^{2}+\frac{\mu_{2}}{4}\|\hat{\alpha}_{s}(\mathbf{v}_{s})-\alpha_{0}^{s}\|^{2}\right]\\ &\leq\frac{4\eta_{s}\sigma^{2}}{K}+\frac{1}{53}\mathbb{E}\left[\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})+\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right].\end{split}

(49)

Since $\phi(\mathbf{v})$ is $L$ -smooth and $\gamma=2\ell$ , then $\phi_{s}(\mathbf{v})$ is $\hat{L}=(L+2\ell)$ -smooth. According to Theorem 2.1.5 of [DBLP:books/sp/Nesterov04], we have

\displaystyle\begin{split}&\mathbb{E}[\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}]\leq 2\hat{L}\mathbb{E}(\phi_{s}(\mathbf{v}_{s})-\min\limits_{x\in\mathbb{R}^{d}}\phi_{s}(\mathbf{v}))\leq 2\hat{L}\mathbb{E}[\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})]\\ &=2\hat{L}\mathbb{E}[4\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})-3\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})]\\ &\leq 2\hat{L}\mathbb{E}\left[4\left(\frac{4\eta_{s}\sigma^{2}}{K}+\frac{1}{53}\left(\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})+\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right)\right)-3\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right]\\ &=2\hat{L}\mathbb{E}\left[\frac{16\eta_{s}\sigma^{2}}{K}+\frac{4}{53}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})-\frac{155}{53}\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right]\end{split}

(50)

Applying Lemma 4 to (50), we have

\displaystyle\begin{split}&\mathbb{E}[\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}]\leq 2\hat{L}\mathbb{E}\bigg{[}\frac{16\eta_{s}\sigma^{2}}{K}+\frac{4}{53}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\frac{155}{53}\left(\frac{3}{50}\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})+\frac{4}{5}(\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{0}^{s}))\right)\bigg{]}\\ &=2\hat{L}\mathbb{E}\bigg{[}\frac{16\eta_{s}\sigma^{2}}{K}+\frac{4}{53}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\!-\!\frac{93}{530}\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})\!-\!\frac{124}{53}(\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{0}^{s}))\bigg{]}.\end{split}

(51)

Combining this with (109), rearranging the terms, and defining a constant $c=4\ell+\frac{248}{53}\hat{L}\in O(L+\ell)$ , we get

\displaystyle\begin{split}&\left(c+2\mu\right)\mathbb{E}[\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{*})]+\frac{93}{265}\hat{L}\mathbb{E}[\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})]\\ &\leq\left(4\ell+\frac{248}{53}\hat{L}\right)\mathbb{E}[\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}_{*})]+\frac{8\hat{L}}{53}\mathbb{E}[\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})]+\frac{32\eta_{s}\hat{L}\sigma^{2}}{K}\\ &\leq c\mathbb{E}\left[\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\right]+\frac{32\eta_{s}\hat{L}\sigma^{2}}{K}\end{split}

(52)

Using the fact that $\hat{L}\geq\mu$ ,

\displaystyle\begin{split}(c+2\mu)\frac{8\hat{L}}{53c}=\left(4\ell+\frac{248}{53}\hat{L}+2\mu\right)\frac{8\hat{L}}{53(4\ell+\frac{248}{53}\hat{L})}\leq\frac{8\hat{L}}{53}+\frac{16\mu\hat{L}}{248\hat{L}}\leq\frac{93}{265}\hat{L}.\end{split}

(53)

Then, we have

\displaystyle\begin{split}&(c+2\mu)\mathbb{E}\left[\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})\right]\\ &\leq c\mathbb{E}\left[\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\right]+\frac{32\eta_{s}\hat{L}\sigma^{2}}{K}.\end{split}

(54)

Defining $\Delta_{s}=\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})$ , then

\displaystyle\begin{split}&\mathbb{E}[\Delta_{s+1}]\leq\frac{c}{c+2\mu}\mathbb{E}[\Delta_{s}]+\frac{32\eta_{s}\hat{L}\sigma^{2}}{(c+2\mu)K}\end{split}

(55)

Using this inequality recursively, it yields

\displaystyle\begin{split}&E[\Delta_{S+1}]\leq\left(\frac{c}{c+2\mu}\right)^{S}E[\Delta_{1}]+\frac{32\hat{L}\sigma^{2}}{(c+2\mu)K}\sum\limits_{s=1}^{S}\left(\eta_{s}\left(\frac{c}{c+2\mu}\right)^{S+1-s}\right)\end{split}

(56)

By definition,

\displaystyle\begin{split}\Delta_{1}&=\phi(\mathbf{v}_{0}^{1})-\phi(\mathbf{v}^{*})+\frac{8\hat{L}}{53c}\widehat{Gap}_{1}(\mathbf{v}_{0}^{1},\alpha_{0}^{1})\\ &=\phi(\mathbf{v}_{0})-\phi(\mathbf{v}^{*})+\left(f(\mathbf{v}_{0},\hat{\alpha}_{1}(\mathbf{v}_{0}))+\frac{\gamma}{2}\|\mathbf{v}_{0}-\mathbf{v}_{0}\|^{2}-f(\hat{\mathbf{v}}_{1}(\alpha_{0}),\alpha_{0})-\frac{\gamma}{2}\|\hat{\mathbf{v}}_{1}(\alpha_{0})-\mathbf{v}_{0}\|^{2}\right)\\ &\leq\epsilon_{0}+f(\mathbf{v}_{0},\hat{\alpha}_{1}(\mathbf{v}_{0}))-f(\hat{\mathbf{v}}(\alpha_{0}),\alpha_{0})\leq 2\epsilon_{0}.\end{split}

(57)

Using inequality $1-x\leq\exp(-x)$ , we have

\displaystyle\begin{split}&\mathbb{E}[\Delta_{S+1}]\leq\exp\left(\frac{-2\mu S}{c+2\mu}\right)\mathbb{E}[\Delta_{1}]+\frac{32\eta_{0}\hat{L}\sigma^{2}}{(c+2\mu)K}\sum\limits_{s=1}^{S}\exp\left(-\frac{2\mu S}{c+2\mu}\right)\\ &\leq 2\epsilon_{0}\exp\left(\frac{-2\mu S}{c+2\mu}\right)+\frac{32\eta_{0}\hat{L}\sigma^{2}}{(c+2\mu)K}S\exp\left(-\frac{2\mu S}{(c+2\mu)}\right).\end{split}

To make this less than $\epsilon$ , it suffices to make

\displaystyle\begin{split}&2\epsilon_{0}\exp\left(\frac{-2\mu S}{c+2\mu}\right)\leq\frac{\epsilon}{2}\\ &\frac{32\eta_{0}\hat{L}\sigma^{2}}{(c+2\mu)K}S\exp\left(-\frac{2\mu S}{c+2\mu}\right)\leq\frac{\epsilon}{2}\end{split}

(58)

Let $S$ be the smallest value such that $\exp\left(\frac{-2\mu S}{c+2\mu}\right)\leq\min\{\frac{\epsilon}{4\epsilon_{0}},\frac{(c+2\mu)K\epsilon}{64\eta_{0}\hat{L}S\sigma^{2}}\}$ . We can set $S=\max\bigg{\{}\frac{c+2\mu}{2\mu}\log\frac{4\epsilon_{0}}{\epsilon},\frac{c+2\mu}{2\mu}\log\frac{64\eta_{0}\hat{L}S\sigma^{2}}{(c+2\mu)K\epsilon}\bigg{\}}$ .

Then, the total iteration complexity is

\begin{split}\sum\limits_{s=1}^{S}T_{s}&\leq O\left(\frac{424}{\eta_{0}\min\{\ell,\mu_{2}\}}\sum\limits_{s=1}^{S}\exp\left((s-1)\frac{2\mu}{c+2\mu}\right)\right)\\ &\leq O\bigg{(}\frac{1}{\eta_{0}\min\{\ell,\mu_{2}\}}\frac{\exp(S\frac{2\mu}{c+2\mu})-1}{\exp(\frac{2\mu}{c+2\mu})-1}\bigg{)}\\ &\overset{(a)}{\leq}\widetilde{O}\left(\frac{c}{\eta_{0}\mu\min\{\ell,\mu_{2}\}}\max\left\{\frac{\epsilon_{0}}{\epsilon},\frac{\eta_{0}\hat{L}S\sigma^{2}}{(c+2\mu)K\epsilon}\right\}\right)\\ &\leq\widetilde{O}\left(\max\left\{\frac{(L+\ell)\epsilon_{0}}{\eta_{0}\mu\min\{\ell,\mu_{2}\}\epsilon},\frac{(L+\ell)^{2}\sigma^{2}}{\mu^{2}\min\{\ell,\mu_{2}\}K\epsilon}\right\}\right)\\ &\leq\widetilde{O}\left(\max\left\{\frac{1}{\mu_{1}\mu_{2}^{2}\epsilon},\frac{1}{\mu_{1}^{2}\mu_{2}^{3}K\epsilon}\right\}\right),\end{split}

(59)

where $(a)$ uses the setting of $S$ and $\exp(x)-1\geq x$ , and $\widetilde{O}$ suppresses logarithmic factors.

$\eta_{s}=\eta_{0}\exp(-(s-1)\frac{2\mu}{c+2\mu}),T_{s}=\frac{212}{\eta_{0}\mu_{2}}\exp\left((s-1)\frac{2\mu}{c+2\mu}\right)$ .

Next, we will analyze the communication cost. We investigate both $D=0$ and $D>0$ cases.

(i) Homogeneous Data (D = 0): To assure $\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)(12\eta_{s}^{2}I_{s}+36\eta^{2}I_{s}^{2}D^{2})\leq\frac{\eta_{s}\sigma^{2}}{K}$ which we used in above proof, we take $I_{s}=\frac{1}{MK\eta_{s}}=\frac{\exp((s-1)\frac{2\mu}{c+2\mu})}{MK\eta_{0}}$ , where $M$ is a proper constant.

If $\frac{1}{MK\eta_{0}}>1$ , then $I_{s}=\max(1,\frac{\exp((s-1)\frac{2\mu}{c+2\mu})}{MK\eta_{0}})=\frac{\exp((s-1)\frac{2\mu}{c+2\mu})}{MK\eta_{0}}$ .

Otherwise, $\frac{1}{MK\eta_{0}}\leq 1$ , then $K_{s}=1$ for $s\leq S_{1}:=\frac{c+2\mu}{2\mu}\log(MK\eta_{0})+1$ and $K_{s}=\frac{\exp((s-1)\frac{2\mu}{c+2\mu})}{MK\eta_{0}}$ for $s>S_{1}$ .

\begin{split}&\sum\limits_{s=1}^{S_{1}}T_{s}=\sum\limits_{s=1}^{S_{1}}O\left(\frac{212}{\eta_{0}}\exp\left((s-1)\frac{2\mu}{c+2\mu}\right)\right)\\ &=\widetilde{O}\left(\frac{212}{\eta_{0}}\frac{\exp\left(\frac{2\mu}{c+2\mu}S_{1}\right)-1}{\exp\left(\exp(\frac{2\mu}{c+2\mu})-1\right)}\right)\\ &=\widetilde{O}\left(\frac{K}{\mu}\right)\end{split}

(60)

Thus, for both above cases, the total communication complexity can be bounded by

\begin{split}&\sum\limits_{s=1}^{S_{1}}T_{s}+\sum\limits_{s=S_{1}+1}^{S}\frac{T_{s}}{I_{s}}\\ &=\widetilde{O}\left(\frac{K}{\mu}+KS\right)\leq\widetilde{O}\left(\frac{K}{\mu}\right).\end{split}

(61)

(ii) Heterogeneous Data ( $D>0$ ):

To assure $\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)(12\eta_{s}^{2}I_{s}+36\eta^{2}I_{s}^{2}D^{2})\leq\frac{\eta_{s}\sigma^{2}}{K}$ which we used in above proof, we take $I_{s}=\frac{1}{M\sqrt{K\eta_{s}}}$ , where $M$ is proper constant.

If $\frac{1}{M\sqrt{N\eta_{0}}}\leq 1$ , then $I_{s}=1$ for $s\leq S_{2}:=\frac{c+2\mu}{2\mu}\log(M^{2}K\eta_{0})+1$ and $I_{s}=\frac{\exp((s-1)\frac{2\mu}{c+2\mu})}{N\eta_{0}}$ for $s>S_{2}$ .

\begin{split}&\sum\limits_{s=1}^{S_{2}}T_{s}=\sum\limits_{s=1}^{S_{2}}O\left(\frac{212}{\eta_{0}}\exp\left((s-1)\frac{2\mu}{c+2\mu}\right)\right)\\ &=\widetilde{O}\left(\frac{K}{\mu}\right)\end{split}

(62)

Thus, the communication complexity can be bounded by

\begin{split}&\sum\limits_{s=1}^{S_{2}}T_{s}+\sum\limits_{s=S_{2}+1}^{S}\frac{T_{s}}{I_{s}}=\widetilde{O}\left(\frac{K}{\mu}+\sqrt{K}\exp\left(\frac{(s-1)\frac{2\mu}{c+2\mu}}{2}\right)\right)\\ &\leq\widetilde{O}(\frac{K}{\mu}+\sqrt{K}\frac{\exp\left(\frac{S}{2}\frac{2\mu}{c+2\mu}\right)-1}{\exp{\frac{\mu}{c+2\mu}}-1})\\ &\leq O\left(\frac{K}{\mu}+\frac{1}{\mu^{3/2}\epsilon^{1/2}}\right).\end{split}

(63)

Appendix C Baseline: Naive Parallel Algorithm

Note that if we set $I_{s}=1$ for all $s$ , CODA+ will be reduced to a naive parallel version of PPD-SG [liu2019stochastic]. We analyze this naive parallel algorithm in the following theorem.

Theorem 3

Consider Algorithm 1 with $I_{s}=1$ . Set $\gamma=2\ell$ , $\hat{L}=L+2\ell$ , $c=\frac{\mu/\hat{L}}{5+\mu/\hat{L}}$ .

(1) If $M<\frac{1}{K\mu\epsilon}$ , set $\eta_{s}=\eta_{0}\exp(-(s-1)c)\leq O(1)$ and $T_{s}=\frac{212}{\eta_{0}\min(\ell,\mu_{2})}\exp((s-1)c)$ , then the communication/iteration complexity is $\widetilde{O}\bigg{(}\max\left(\frac{\Delta_{0}}{\mu\epsilon\eta_{0}K},\frac{\hat{L}}{\mu^{2}K\epsilon}\right)\bigg{)}$ to return $\mathbf{v}_{S}$ such that $\mathbb{E}[\phi(\mathbf{v}_{S})-\phi(\mathbf{v}^{*}_{\phi})]\leq\epsilon$ .

(2) If $M\geq\frac{1}{K\mu\epsilon}$ , set $\eta_{s}=\min(\frac{1}{3\ell+3\ell^{2}/\mu_{2}},\frac{1}{4\ell})$ and $T_{s}=\frac{212}{\eta_{s}\min\{\ell,\mu_{2}\}}$ , then the communication/iteration complexity is $\widetilde{O}\bigg{(}\frac{1}{\mu}\bigg{)}$ to return $\mathbf{v}_{S}$ such that $\mathbb{E}[\phi(\mathbf{v}_{S})-\phi(\mathbf{v}^{*}_{\phi})]\leq\epsilon$ .

{proof}

(1) If $M<\frac{1}{K\mu\epsilon}$ , note that the setting of $\eta_{s}$ and $T_{s}$ are identical to that in CODA+ (Theorem 1). However, as a batch of $M$ is used on each machine at each iteration, the variance at each iteration is reduced to $\frac{\sigma^{2}}{KM}$ . Therefore, by similar analysis of Theorem 1 (specifically (59)), we see that the iteration complexity of NPA is $\widetilde{O}\left(\frac{1}{\mu\epsilon}+\frac{1}{\mu^{2}KM\epsilon}\right)$ . Thus, the sample complexity of each machines is $\widetilde{O}\left(\frac{M}{\mu\epsilon}+\frac{1}{\mu^{2}K\epsilon}\right)$ .

(2) If $M\geq\frac{1}{K\mu\epsilon}$ , . Note $\frac{1}{\eta_{s}T_{s}}\leq\frac{\min\{\ell,\mu_{2}\}}{212}$ , we can follow the proof of Theorem 1 and derive

\begin{split}\Delta_{s+1}&\leq\frac{c}{c+2\mu}\mathbb{E}[\Delta_{s}]+\frac{32\eta_{s}\hat{L}\sigma^{2}}{KM}\\ &\leq\frac{c}{c+2\mu}\mathbb{E}[\Delta_{s}]+32\eta_{s}\hat{L}\sigma^{2}\mu\epsilon\end{split}

(64)

where the first inequality is similar to (55) and the $\Delta$ is defined as that in Theorem 1. Thus,

\begin{split}\Delta_{S+1}&\leq\left(\frac{c}{c+2\mu}\right)^{S}+\mu\epsilon O\left(\sum\limits_{s=1}^{S}\left(\frac{c}{c+2\mu}\right)^{s-1}\right)\\ &\leq\left(\frac{c}{c+2\mu}\right)^{S}+O(\epsilon)\\ &\leq\exp\left(\frac{-2\mu S}{c+2\mu}\right)+O(\epsilon)\end{split}

(65)

Therefore, it suffices to take $S=\widetilde{O}\left(\frac{1}{\mu}\right)$ . Hence, the total number of communication is $S\cdot T_{s}=\widetilde{O}\left(\frac{1}{\mu}\right)$ and the sample complexity on each machine is $\widetilde{O}\left(\frac{M}{\mu}\right)$ .

Appendix D Proof of Lemma 2

In this section, we will prove Lemma 2, which is the convergence analysis of one stage in CODASCA.

First, the duality gap in stage $s$ can be bounded as

Lemma 9

For any $\mathbf{v},\alpha$ ,

\begin{split}&\frac{1}{R}\sum\limits_{r=1}^{R}[f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})]\\ &\leq\frac{1}{R}\sum\limits_{r=1}^{R}\bigg{[}\underbrace{\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\rangle}_{B4}+\underbrace{\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha-\alpha_{r}\rangle}_{B5}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}+2\ell(\alpha_{r}-\alpha_{r-1})^{2}-\frac{\ell}{3}\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\frac{\mu_{2}}{3}(\alpha_{r-1}-\alpha)^{2}\bigg{]}\end{split}

{proof}

By $\ell$ -strongly convexity of $f^{s}(\mathbf{v},\alpha)$ in $\mathbf{v}$ , we have

f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})+\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}-\mathbf{v}_{r-1}\rangle+\frac{\ell}{2}\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}\leq f^{s}(\mathbf{v},\alpha_{r-1}).

(66)

By $3\ell$ -smoothness of $f^{s}(\mathbf{v},\alpha)$ in $\mathbf{v}$ , we have

\begin{split}f^{s}(\mathbf{v}_{r},\alpha)&\leq f^{s}(\mathbf{v}_{r-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha),\mathbf{v}_{r}-\mathbf{v}_{r-1}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}\\ &=f^{s}(\mathbf{v}_{r-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}_{r-1}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}\\ &~{}~{}~{}+\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha)-\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}_{r-1}\rangle\\ &\overset{(a)}{\leq}f^{s}(\mathbf{v}_{r-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}_{r-1}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}\\ &~{}~{}~{}~{}+\ell|\alpha_{r-1}-\alpha|\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|\\ &\overset{(b)}{\leq}f^{s}(\mathbf{v}_{r-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}_{r-1}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}\\ &~{}~{}~{}~{}+\frac{\mu_{2}}{6}(\alpha_{r-1}-\alpha)^{2}+\frac{3\ell^{2}}{2\mu_{2}}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2},\\ \end{split}

(67)

where $(a)$ holds because that we know $\partial_{\mathbf{v}}f^{s}(\mathbf{v},\alpha)$ is $\ell$ -Lipschitz in $\alpha$ since $f(\mathbf{v},\alpha)$ is $\ell$ -smooth and $(b)$ holds by Young’s inequality.

Adding (66) and (67), by rearranging terms, we have

\begin{split}&f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})+f^{s}(\mathbf{v}_{r},\alpha)\\ &\leq f^{s}(\mathbf{v},\alpha_{r-1})+f^{s}(\mathbf{v}_{r-1},\alpha)+\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\rangle\\ &~{}~{}~{}+\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}-\frac{\ell}{2}\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}+\frac{\mu_{2}}{6}(\alpha_{r-1}-\alpha)^{2}.\end{split}

(68)

We know $f^{s}(\mathbf{v},\alpha)$ is $\mu_{2}$ -strong concave in $\alpha$ ( $-f^{s}(\mathbf{v},\alpha)$ is $\mu_{2}$ -strong convexity of in $\alpha$ ). Thus, we have

\begin{split}-f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})-\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha-\alpha_{r-1}\rangle+\frac{\mu_{2}}{2}(\alpha-\alpha_{r-1})^{2}\leq-f^{s}(\mathbf{v}_{r-1},\alpha).\end{split}

(69)

Since $f^{s}(\mathbf{v},\alpha)$ is $\ell$ -smooth in $\alpha$ , we get

\begin{split}-f^{s}(\mathbf{v},\alpha_{r})&\leq-f^{s}(\mathbf{v},\alpha_{r-1})-\langle\partial_{\alpha}f^{s}(\mathbf{v},\alpha_{r-1}),\alpha_{r}-\alpha_{r-1}\rangle+\frac{\ell}{2}(\alpha_{r}-\alpha_{r-1})^{2}\\ &=-f^{s}(\mathbf{v},\alpha_{r-1})-\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha_{r}-\alpha_{r-1}\rangle+\frac{\ell}{2}(\alpha_{r}-\alpha_{r-1})^{2}\\ &~{}~{}~{}~{}~{}-\langle\partial_{\alpha}(f^{s}(\mathbf{v},\alpha_{r-1})-f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})),\alpha_{r}-\alpha_{r-1}\rangle\\ &\overset{(a)}{\leq}-f^{s}(\mathbf{v},\alpha_{r-1})-\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha_{r}-\alpha_{r-1}\rangle+\frac{\ell}{2}(\alpha_{r}-\alpha_{r-1})^{2}\\ &~{}~{}~{}+\ell\|\mathbf{v}-\mathbf{v}_{r-1}\||\alpha_{r}-\alpha_{r-1}|\\ &\leq-f^{s}(\mathbf{v},\alpha_{r-1})-\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha_{r}-\alpha_{r-1}\rangle+\frac{\ell}{2}(\alpha_{r}-\alpha_{r-1})^{2}\\ &~{}~{}~{}+\frac{\ell}{6}\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}+\frac{3\ell}{2}(\alpha_{r}-\alpha_{r-1})^{2}\\ \end{split}

(70)

where (a) holds because that $\partial_{\alpha}f^{s}(\mathbf{v},\alpha)$ is $\ell$ -Lipschitz in $\alpha$ .

Adding (69), (70) and arranging terms, we have

\begin{split}&-f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})-f^{s}(\mathbf{v},\alpha_{r})\leq-f^{s}(\mathbf{v}_{r-1},\alpha)-f^{s}(\mathbf{v},\alpha_{r-1})-\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha_{r}-\alpha\rangle\\ &~{}~{}~{}~{}~{}~{}+2\ell(\alpha_{r}-\alpha_{r-1})^{2}+\frac{\ell}{6}\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\frac{\mu_{2}}{2}(\alpha-\alpha_{r-1})^{2}.\end{split}

(71)

Adding (68) and (71), we get

\begin{split}&f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})\\ &\leq\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\rangle-\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha_{r}-\alpha\rangle\\ &~{}~{}~{}+\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}+2\ell(\alpha_{r}-\alpha_{r-1})^{2}\\ &~{}~{}~{}-\frac{\ell}{3}\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\frac{\mu_{2}}{3}(\alpha_{r-1}-\alpha)^{2}\end{split}

(72)

Taking average over $r=1,...,R$ , we get

\begin{split}&\frac{1}{R}\sum\limits_{r=1}^{R}[f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})]\\ &\leq\frac{1}{R}\sum\limits_{r=1}^{R}\bigg{[}\underbrace{\langle\partial_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\rangle}_{B_{4}}+\underbrace{\langle\partial_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\alpha-\alpha_{r}\rangle}_{B_{5}}\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{3\ell+3\ell^{2}/\mu_{2}}{2}\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}+2\ell(\alpha_{r}-\alpha_{r-1})^{2}-\frac{\ell}{3}\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\frac{\mu_{2}}{3}(\alpha_{r-1}-\alpha)^{2}\bigg{]}\end{split}

$B_{4}$ and $B_{5}$ can be bounded by the following lemma. For simplicity of notation, we define

\begin{split}\Xi_{r}=\frac{1}{KI}\sum\limits_{k,t}\mathbb{E}[\|\mathbf{v}^{k}_{r,t}-\mathbf{v}_{r}\|^{2}+(\alpha^{k}_{r,t}-\alpha_{r})^{2}],\end{split}

(73)

which is the drift of the variables between te sequence in $r$ -th round and the ending point, and

\begin{split}\mathcal{E}_{r}=\frac{1}{KI}\sum\limits_{k,t}\mathbb{E}[\|\mathbf{v}^{k}_{r,t}-\mathbf{v}_{r-1}\|^{2}+(\alpha^{k}_{r,t}-\alpha_{r-1})^{2}],\end{split}

(74)

which is the drift of the variables between te sequence in $r$ -th round and the starting point.

$B_{4}$ can be bounded as

Lemma 10

\begin{split}&\mathbb{E}\left\langle\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\right\rangle\\ &\leq\frac{3\ell}{2}\mathcal{E}_{r}+\frac{\ell}{3}\mathbb{E}\|\bar{\mathbf{v}}_{r}-\mathbf{v}\|^{2}+\frac{3\tilde{\eta}}{2}\mathbb{E}\left\|\frac{1}{NK}\sum\limits_{i,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})]\right\|^{2}\\ &~{}~{}~{}~{}~{}+\frac{1}{2\tilde{\eta}}\mathbb{E}(\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\|\mathbf{v}_{r-1}-\mathbf{v}_{r}\|^{2}-\|\mathbf{v}_{r}-\mathbf{v}\|^{2})+\frac{1}{2\tilde{\eta}}\mathbb{E}(\|\tilde{\mathbf{v}}_{r-1}-\mathbf{v}\|^{2}-\|\tilde{\mathbf{v}}_{r}-\mathbf{v}\|^{2}),\end{split}

and

\begin{split}&\mathbb{E}\langle\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),y-\alpha_{r}\rangle\leq\frac{3\ell^{2}}{2\mu_{2}}\mathcal{E}_{r}+\frac{\mu_{2}}{3}\mathbb{E}(\bar{\alpha}_{r}-\alpha)^{2}\\ &~{}~{}~{}+\frac{3\tilde{\eta}}{2}\mathbb{E}\left(\frac{1}{NK}\sum\limits_{i,t}[\nabla_{\alpha}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\alpha}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})]\right)^{2}\\ &~{}~{}~{}+\frac{1}{2\tilde{\eta}}\mathbb{E}((\bar{\alpha}_{r-1}-\alpha)^{2}-(\bar{\alpha}_{r-1}-\bar{\alpha}_{r})^{2}-(\bar{\alpha}_{r}-\alpha)^{2})+\frac{1}{2\tilde{\eta}}\mathbb{E}((\alpha-\tilde{\alpha}_{r-1})^{2}-(\alpha-\tilde{\alpha}_{r})^{2}).\end{split}

{proof}

\begin{split}&\langle\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\rangle\\ &=\bigg{\langle}\frac{1}{KI}\sum\limits_{k,t}\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\bigg{\rangle}\\ &\leq\bigg{\langle}\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r-1})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r,t}^{k})],\mathbf{v}_{r}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{1}⃝\\ &~{}~{}~{}+\bigg{\langle}\frac{1}{KI}\sum\limits_{i,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})],\mathbf{v}_{r}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{2}⃝\\ &~{}~{}~{}+\bigg{\langle}\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha_{r,t}^{k};z_{r,t}^{k})],\mathbf{v}_{r}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{3}⃝\\ &~{}~{}~{}+\bigg{\langle}\frac{1}{KI}\sum\limits_{k,t}\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha_{r,t}^{k};z_{r,t}^{k}),\mathbf{v}_{r}-\mathbf{v}\bigg{\rangle}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\small{4}⃝\end{split}

(75)

Then we will bound \small{1}⃝, \small{2}⃝ and \small{3}⃝, respectively,

\begin{split}\small{1}⃝&\overset{(a)}{\leq}\frac{3}{2\ell}\left\|\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r-1})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r,t}^{k})]\right\|^{2}+\frac{\ell}{6}\|\mathbf{v}_{r}-\mathbf{v}\|^{2}\\ &\overset{(b)}{\leq}\frac{3}{2\ell}\frac{1}{KI}\sum\limits_{k,t}\|\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r-1})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r,t}^{k})\|^{2}+\frac{\ell}{6}\|\mathbf{v}_{r}-\mathbf{v}\|^{2}\\ &\overset{(c)}{\leq}\frac{3\ell}{2}\frac{1}{KI}\sum\limits_{k,t}\|\alpha_{r-1}-\alpha_{r,t}^{k}\|^{2}+\frac{\ell}{6}\|\mathbf{v}_{r}-\mathbf{v}\|^{2},\end{split}

(76)

where (a) follows from Young’s inequality, (b) follows from Jensen’s inequality. and (c) holds because $\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v},\alpha)$ is $\ell$ -smooth in $\alpha$ . Using similar techniques, we have

\begin{split}\small{2}⃝&\leq\frac{3}{2\ell}\frac{1}{KI}\sum\limits_{k,t}\|\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})\|^{2}+\frac{\ell}{6}\|\mathbf{v}_{r}-\mathbf{v}\|^{2}\\ &\leq\frac{3\ell}{2}\frac{1}{KI}\sum\limits_{k,t}\|\mathbf{v}_{r-1}-\mathbf{v}_{r,t}^{i}\|^{2}+\frac{\ell}{6}\|\mathbf{v}_{r}-\mathbf{v}\|^{2}.\end{split}

(77)

Let $\hat{\mathbf{v}}_{r}=\arg\min\limits_{\mathbf{v}}\left(\frac{1}{KI}\sum\limits_{k,t}\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}^{k}_{r,t},y^{k}_{r,t})\right)^{\top}\mathbf{v}+\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\mathbf{v}_{r-1}\|^{2}$ , then we have

\begin{split}\bar{\mathbf{v}}_{r}-\hat{\mathbf{v}}_{r}=\frac{\tilde{\eta}}{KI}\sum\limits_{k,t}\bigg{(}\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},y^{k}_{r,t};z_{r,t}^{k})\bigg{)}.\end{split}

(78)

Hence we get

\begin{split}&\small{3}⃝=\left\langle\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})],\mathbf{v}_{r}-\hat{\mathbf{v}}_{r}\right\rangle\\ &~{}~{}~{}~{}+\left\langle\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})],\hat{\mathbf{v}}_{r}-\mathbf{v}\right\rangle\\ &={\tilde{\eta}}\left\|\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})]\right\|^{2}\\ &~{}~{}~{}~{}+\left\langle\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})],\hat{\mathbf{v}}_{r}-\mathbf{v}\right\rangle.\end{split}

(79)

Define another auxiliary sequence as

\begin{split}\tilde{\mathbf{v}}_{r}=\tilde{\mathbf{v}}_{r-1}-\frac{\tilde{\eta}}{KI}\sum\limits_{k,t}\left(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{r,t},y^{k}_{r,t};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})\right),\text{ for $r>0$; }\tilde{\mathbf{v}}_{0}=\mathbf{v}_{0}.\end{split}

(80)

Denote

\Theta_{r}(\mathbf{v})=\left(\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}^{k}_{r,t},y^{k}_{r,t};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k}))\right)^{\top}\mathbf{v}+\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\tilde{\mathbf{v}}_{r-1}\|^{2}.

(81)

Hence, for the auxiliary sequence $\tilde{\alpha}_{r}$ , we can verify that

\tilde{\mathbf{v}}_{r}=\arg\min\limits_{\mathbf{v}}\Theta_{r}(\mathbf{v}).

(82)

Since $\Theta_{r}(\mathbf{v})$ is $\frac{1}{\tilde{\eta}}$ -strongly convex, we have

\displaystyle\begin{split}&\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\tilde{\mathbf{v}}_{r}\|^{2}\leq\Theta_{r}(\mathbf{v})-\Theta_{r}(\tilde{\mathbf{v}}_{r})\\ &=\bigg{(}\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k}))\bigg{)}^{\top}\mathbf{v}+\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\tilde{\mathbf{v}}_{r-1}\|^{2}\\ &~{}~{}~{}-\bigg{(}\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{i},\alpha_{r,t}^{k};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{i},\alpha_{r,t}^{k}))\bigg{)}^{\top}\tilde{\mathbf{v}}_{r}-\frac{1}{2\tilde{\eta}}\|\tilde{\mathbf{v}}_{r}-\tilde{\mathbf{v}}_{r-1}\|^{2}\\ &=\bigg{(}\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k}))\bigg{)}^{\top}(\mathbf{v}-\tilde{\mathbf{v}}_{r-1})+\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\tilde{\mathbf{v}}_{r-1}\|^{2}\\ &~{}~{}~{}-\bigg{(}\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\alpha}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})-\nabla_{\alpha}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k}))\bigg{)}^{\top}(\tilde{\mathbf{v}}_{r}-\tilde{\mathbf{v}}_{r-1})-\frac{1}{2\tilde{\eta}}\|\tilde{\mathbf{v}}_{r}-\tilde{\mathbf{v}}_{r-1}\|^{2}\\ &\leq\bigg{(}\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k}))\bigg{)}^{\top}(\mathbf{v}-\tilde{\mathbf{v}}_{r-1})+\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\tilde{\mathbf{v}}_{r-1}\|^{2}\\ &~{}~{}~{}+\frac{\tilde{\eta}}{2}\bigg{\|}\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k}))\bigg{\|}^{2}\end{split}

(83)

Adding this with (79), we get

\begin{split}③\leq&\frac{3\tilde{\eta}}{2}\bigg{\|}\frac{1}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k}))\bigg{\|}^{2}+\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\tilde{\mathbf{v}}_{r-1}\|^{2}-\frac{1}{2\tilde{\eta}}\|\mathbf{v}-\tilde{\mathbf{v}}_{r}\|^{2}\\ &+\left\langle\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})],\hat{\mathbf{v}}_{r}-\tilde{\mathbf{v}}_{r-1}\right\rangle\end{split}

(84)

\small{4}⃝ can be bounded as

\begin{split}\small{4}⃝&=\frac{1}{\tilde{\eta}}\langle\mathbf{v}_{r}-\mathbf{v}_{r-1},\mathbf{v}-\mathbf{v}_{r}\rangle=\frac{1}{2\tilde{\eta}}(\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\|\mathbf{v}_{r-1}-\mathbf{v}_{r}\|^{2}-\|\mathbf{v}_{r}-\mathbf{v}\|^{2})\end{split}

(85)

Plug (76), (77), (84) and (85) into (75), we get

\begin{split}&\mathbb{E}\left\langle\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),\mathbf{v}_{r}-\mathbf{v}\right\rangle\\ &\leq\frac{3\ell}{2}\mathcal{E}_{r}+\frac{\ell}{3}\mathbb{E}\|\bar{\mathbf{v}}_{r}-\mathbf{v}\|^{2}+\frac{3\tilde{\eta}}{2}\mathbb{E}\left\|\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})]\right\|^{2}\\ &~{}~{}~{}~{}~{}+\frac{1}{2\tilde{\eta}}\mathbb{E}(\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\|\mathbf{v}_{r-1}-\mathbf{v}_{r}\|^{2}-\|\mathbf{v}_{r}-\mathbf{v}\|^{2})+\frac{1}{2\tilde{\eta}}\mathbb{E}(\|\tilde{\mathbf{v}}_{r-1}-\mathbf{v}\|^{2}-\|\tilde{\mathbf{v}}_{r}-\mathbf{v}\|^{2})\end{split}

Similarly for $\alpha$ , noting $f^{s}_{k}$ is $\ell$ -smooth and $\mu_{2}$ -strongly concave in $\alpha$ ,

\begin{split}&\mathbb{E}\langle\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}),y-\alpha_{r}\rangle\leq\frac{3\ell^{2}}{2\mu_{2}}\mathcal{E}_{r}+\frac{\mu_{2}}{3}\mathbb{E}(\bar{\alpha}_{r}-\alpha)^{2}\\ &~{}~{}~{}+\frac{3\tilde{\eta}}{2}\mathbb{E}\left(\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\alpha}f^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\alpha}F^{s}_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})]\right)^{2}\\ &~{}~{}~{}+\frac{1}{2\tilde{\eta}}\mathbb{E}((\bar{\alpha}_{r-1}-\alpha)^{2}-(\bar{\alpha}_{r-1}-\bar{\alpha}_{r})^{2}-(\bar{\alpha}_{r}-\alpha)^{2})+\frac{1}{2\tilde{\eta}}\mathbb{E}((\alpha-\tilde{\alpha}_{r-1})^{2}-(\alpha-\tilde{\alpha}_{r})^{2})\end{split}

We show the following lemmas where $\Xi$ and $\mathcal{E}$ are coupled.

Lemma 11

\begin{split}\Xi_{r}&\leq 4\mathcal{E}_{r}+8\tilde{\eta}^{2}[\|\nabla_{\mathbf{v}}f(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{5\tilde{\eta}^{2}\sigma^{2}}{KI}.\end{split}

(86)

{proof}

\begin{split}&\mathbb{E}[\|\mathbf{v}_{r}-\mathbf{v}_{r-1}\|^{2}]=\mathbb{E}\left\|-\frac{\tilde{\eta}}{KI}\sum\limits_{k,t}(\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};z^{k}_{r,t})-c_{\mathbf{v}}^{k}+c_{\mathbf{v}})\right\|^{2}\\ &=\mathbb{E}\left\|-\frac{\tilde{\eta}}{KI}\sum\limits_{k,t}\left[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t};z^{k}_{r,t})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t})+\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t})\right]\right\|^{2}\\ &\leq\mathbb{E}\left\|-\frac{\tilde{\eta}}{KI}\sum\limits_{k,t}\left[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t})\right]\right\|^{2}+\frac{\tilde{\eta}^{2}\sigma^{2}}{KI}\\ &=\mathbb{E}\left\|-\frac{\tilde{\eta}}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r-1})]+\tilde{\eta}\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))\right\|^{2}+\frac{\tilde{\eta}^{2}\sigma^{2}}{KI}\\ &\leq 2\mathbb{E}\left\|-\frac{\tilde{\eta}}{KI}\sum\limits_{k,t}[\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t})-\nabla_{\mathbf{v}}f^{s}_{k}(\mathbf{v}_{r-1},\alpha_{r-1})]\right\|^{2}+2\tilde{\eta}^{2}\mathbb{E}\left\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\right\|^{2}+\frac{\tilde{\eta}^{2}\sigma^{2}}{KI}\\ &\leq\frac{2\tilde{\eta}^{2}\ell^{2}}{KI}\sum\limits_{k,t}\mathbb{E}[\|\mathbf{v}^{k}_{r,t}-\mathbf{v}_{r-1}\|^{2}+(\alpha^{k}_{r,t}-\alpha_{r-1})^{2}]+2\tilde{\eta}^{2}\mathbb{E}\left\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\right\|^{2}+\frac{\tilde{\eta}^{2}\sigma^{2}}{KI}\\ &\leq 2\tilde{\eta}^{2}\ell^{2}\mathcal{E}_{r}+2\tilde{\eta}^{2}\mathbb{E}\left\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\right\|^{2}+\frac{\tilde{\eta}^{2}\sigma^{2}}{KI}\end{split}

(87)

Similarly,

\begin{split}&\mathbb{E}[(\alpha_{r}-\alpha_{r-1})^{2}]\leq 2\tilde{\eta}^{2}\ell^{2}\mathcal{E}_{r}+2\tilde{\eta}^{2}\mathbb{E}\left(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\right)^{2}+\frac{\tilde{\eta}^{2}\sigma^{2}}{KI}.\end{split}

(88)

Using the $3\ell$ -smoothness of $f^{s}$ and combining with above results,

\begin{split}&\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}\\ &=\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})-\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})+\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}\\ &~{}~{}~{}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})-\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r})+\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}\\ &\leq 2[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+18\ell^{2}(\|\mathbf{v}_{r-1}-\mathbf{v}_{r}\|^{2}+(\alpha_{r-1}-\alpha_{r})^{2})\\ &\leq 2[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+60\ell^{4}\tilde{\eta}^{2}\mathcal{E}_{r}+\frac{40\tilde{\eta}^{2}\ell^{2}\sigma^{2}}{KI}\\ &\leq 2[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{\ell^{2}}{24}\mathcal{E}_{r}+\frac{\sigma^{2}}{144KI}.\end{split}

(89)

Thus,

\begin{split}\Xi_{r}&=\frac{1}{KI}\sum\limits_{k,t}\mathbb{E}[\|\mathbf{v}^{k}_{r,t}-\mathbf{v}_{r}\|^{2}+(\alpha^{k}_{r,t}-\alpha_{r})^{2}]\\ &\leq\frac{2}{KI}\sum\limits_{k,t}\mathbb{E}[\|\mathbf{v}^{k}_{r,t}-\mathbf{v}_{r-1}\|^{2}+\|\mathbf{v}_{r-1}-\mathbf{v}_{r}\|^{2}+(\alpha^{k}_{r,t}-\alpha_{r-1})^{2}+(\alpha_{r-1}-\alpha_{r})^{2}]\\ &\leq 2\mathcal{E}_{r}+2\mathbb{E}[\|\mathbf{v}_{r-1}-\mathbf{v}_{r}\|^{2}+(\alpha_{r-1}-\alpha_{r})^{2}]\\ &\leq 2\mathcal{E}_{r}+8\tilde{\eta}^{2}\ell^{2}\mathcal{E}_{r}+4\tilde{\eta}^{2}\mathbb{E}[(\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}]+\frac{4\tilde{\eta}^{2}\sigma^{2}}{KI}\\ &\leq 3\mathcal{E}_{r}+4\tilde{\eta}^{2}\left(2[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{\ell^{2}}{24}\mathcal{E}_{r}+\frac{\sigma^{2}}{144KI}\right)+\frac{4\tilde{\eta}^{2}\sigma^{2}}{KI}\\ &\leq 4\mathcal{E}_{r}+8\tilde{\eta}^{2}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{5\tilde{\eta}^{2}\sigma^{2}}{KI}.\end{split}

(90)

Lemma 12

\begin{split}&\mathcal{E}_{r}\leq\frac{\tilde{\eta}\sigma^{2}}{2\ell K\eta_{g}^{2}}+\tilde{\eta}\ell\Xi_{r-1}+\frac{48\tilde{\eta}^{2}}{\eta_{g}^{2}}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}].\end{split}

(91)

{proof}

\begin{split}\mathbb{E}\|\mathbf{v}^{k}_{r,t}-\mathbf{v}_{r-1}\|^{2}&=\mathbb{E}\|\mathbf{v}^{k}_{r,t-1}-\eta_{l}(\nabla_{\mathbf{v}}f_{k}(\mathbf{v}^{k}_{r,t-1},y^{k}_{r,t-1};z^{k}_{r,t-1})-c_{\mathbf{v}}^{k}+c_{\mathbf{v}})-\mathbf{v}_{r-1}\|^{2}\\ &\leq\mathbb{E}\|\mathbf{v}^{k}_{r,t-1}-\eta_{l}(\nabla_{\mathbf{v}}f_{k}(\mathbf{v}^{k}_{r,t-1},y^{k}_{r,t-1})-\mathbb{E}[c^{k}_{\mathbf{v}}]+\mathbb{E}[c_{\mathbf{v}}])-\mathbf{v}_{r-1}\|^{2}+2\eta_{l}^{2}\sigma^{2}\\ &\leq\left(1+\frac{1}{I-1}\right)\mathbb{E}\|\mathbf{v}^{k}_{r,t-1}-\mathbf{v}_{r-1}\|^{2}+I\eta_{l}^{2}\mathbb{E}\|\nabla_{\mathbf{v}}f_{k}(\mathbf{v}^{k}_{r,t-1},\alpha^{k}_{r,t-1})-\mathbb{E}[c^{k}_{\mathbf{v}}]+\mathbb{E}[c_{\mathbf{v}}]\|^{2}+2\eta_{l}^{2}\sigma^{2},\end{split}

(92)

where $\mathbb{E}[c^{k}_{\mathbf{v}}]=\frac{1}{I}\sum\limits_{t=1}^{I}f^{s}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t})$ and $\mathbb{E}[c_{\mathbf{v}}]=\frac{1}{K}\sum\limits_{k=1}^{K}\frac{1}{I}\sum\limits_{t=1}^{I}f^{s}(\mathbf{v}^{k}_{r,t},\alpha^{k}_{r,t})$ .

Then,

\begin{split}&I\eta_{l}^{2}\mathbb{E}\|\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}^{k}_{r,t-1},\alpha^{k}_{r,t-1})-\mathbb{E}[c^{k}_{\mathbf{v}}]+\mathbb{E}[c_{\mathbf{v}}]\|^{2}\\ &\leq I\eta_{l}^{2}\mathbb{E}\|\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}^{k}_{r,t-1},\alpha^{k}_{r,t-1})-\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{r-1},\alpha_{r-1})+(\mathbb{E}[c_{\mathbf{v}}]-\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})-(\mathbb{E}[c^{k}_{\mathbf{v}}]-\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))\|^{2}\\ &\leq 4I\eta_{l}^{2}\ell^{2}\bigg{(}\mathbb{E}[\|\mathbf{v}^{k}_{r,t-1}-\mathbf{v}_{r-1}\|^{2}]+\mathbb{E}[\|\alpha^{k}_{r,t-1}-\alpha_{r-1}\|^{2}]\bigg{)}+4I\eta_{l}^{2}\mathbb{E}[\|\mathbb{E}[c^{k}_{\mathbf{v}}]-\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}]\\ &~{}~{}~{}+4I\eta_{l}^{2}\mathbb{E}[\|\mathbb{E}[c_{\mathbf{v}}]-\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}\|^{2}]+4I\eta_{l}^{2}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}\\ &\leq 4I\eta_{l}^{2}\ell^{2}\bigg{(}\mathbb{E}[\|\mathbf{v}^{k}_{k-1,r}-\mathbf{v}_{r-1}\|^{2}]+\mathbb{E}[(\alpha^{k}_{k-1,r}-\alpha_{r-1})^{2}]\bigg{)}+4I\eta_{l}^{2}\ell^{2}\frac{1}{I}\sum_{\tau=1}^{I}\mathbb{E}[\|\mathbf{v}^{k}_{r-1,\tau}-\mathbf{v}_{r-1}\|^{2}+(\alpha^{k}_{r-1,\tau}-\alpha_{r-1})^{2}]\\ &+4I\eta_{l}^{2}\ell^{2}\frac{1}{KI}\sum_{j=1}^{K}\sum_{t=1}^{I}\mathbb{E}[\|\mathbf{v}^{j}_{r-1,t}-\mathbf{v}_{r-1}\|^{2}+(\alpha^{j}_{r-1,k}-\alpha_{r-1})^{2}]+4I\eta_{l}^{2}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}\end{split}

(93)

For $\alpha$ , we have similar results, adding them together

\begin{split}&\mathbb{E}\|\mathbf{v}^{k}_{k,r}-\mathbf{v}_{r-1}\|^{2}+\mathbb{E}(\alpha^{k}_{k,r}-\alpha_{r-1})^{2}\leq\left(1+\frac{1}{K-1}+8K\eta_{l}^{2}\ell^{2}\right)(\mathbb{E}\|\mathbf{v}^{k}_{k-1,r}-\mathbf{v}_{r-1}\|^{2}+\mathbb{E}(\alpha^{k}_{k-1,r}-\alpha_{r-1})^{2})\\ &~{}~{}~{}+2\eta_{l}^{2}\sigma^{2}+4I\eta_{l}^{2}\ell^{2}\Xi_{r-1}+4I\eta_{l}^{2}\frac{1}{I}\sum\limits_{\tau=1}^{I}\mathbb{E}[\|\mathbf{v}_{r-1,\tau}^{k}-\mathbf{v}_{r-1}\|^{2}+(\alpha_{r-1,\tau}^{k}-\alpha_{r-1})^{2}]\\ &~{}~{}~{}+4I\eta_{l}^{2}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}]\end{split}

(94)

Taking average over all machines,

\begin{split}&\frac{1}{K}\sum\limits_{k}\mathbb{E}\|\mathbf{v}^{k}_{r,t}-\mathbf{v}_{r-1}\|^{2}+\mathbb{E}(\alpha^{k}_{r,t}-\alpha_{r-1})^{2}\\ &\leq\left(1+\frac{1}{I-1}+8I\eta_{l}^{2}\ell^{2}\right)\frac{1}{K}\sum_{k}(\mathbb{E}\|\mathbf{v}^{k}_{r,t-1}-\mathbf{v}_{r-1}\|^{2}+\mathbb{E}(\alpha^{k}_{r,t-1}-\alpha_{r-1})^{2})+2\eta_{l}^{2}\sigma^{2}\\ &~{}~{}~{}+8I\eta_{l}^{2}\ell^{2}\Xi_{r-1}+4I\eta_{l}^{2}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}]]\\ &\leq\left(2\eta_{l}^{2}\sigma^{2}+8I\eta_{l}^{2}\ell^{2}\Xi_{r-1}+4I\eta_{l}^{2}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}\right)\left(\sum\limits_{\tau=0}^{t-1}(1+\frac{1}{I-1}+8I\eta_{l}^{2}\ell^{2})^{\tau}\right)\\ &\leq\left(\frac{2\tilde{\eta}^{2}\sigma^{2}}{I^{2}\eta_{g}^{2}}+\frac{8\tilde{\eta}^{2}\ell^{2}}{I\eta_{g}^{2}}\Xi_{r-1}+\frac{4\tilde{\eta}^{2}}{I\eta_{g}^{2}}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}]\right)3I\\ &\leq\left(\frac{\tilde{\eta}\sigma^{2}}{24\ell I^{2}\eta_{g}^{2}}+\frac{\tilde{\eta}\ell}{3I\eta_{g}^{2}}\Xi_{r-1}+\frac{4\tilde{\eta}^{2}}{I\eta_{g}^{2}}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}]\right)3I\end{split}

(95)

Taking average over $t=1,...,I$ ,

\begin{split}&\mathcal{E}_{r}\leq\frac{\tilde{\eta}\sigma^{2}}{8\ell I\eta_{g}^{2}}+{\tilde{\eta}\ell}\Xi_{r-1}+\frac{12\tilde{\eta}^{2}}{\eta_{g}^{2}}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r-1},\alpha_{r-1}))^{2}]\end{split}

(96)

Using (89), we have

\begin{split}&\mathcal{E}_{r}\leq\frac{\tilde{\eta}\sigma^{2}}{8\ell I\eta_{g}^{2}}+\tilde{\eta}\ell\Xi_{r-1}+\frac{12\tilde{\eta}^{2}}{\eta_{g}^{2}}\left(4[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{\ell^{2}}{24}\mathcal{E}_{r}+\frac{\sigma^{2}}{144KI}\right).\end{split}

(97)

Rearranging terms,

\begin{split}\mathcal{E}_{r}\leq\frac{\tilde{\eta}\sigma^{2}}{2\ell I\eta_{g}^{2}}+\tilde{\eta}\ell\Xi_{r-1}+\frac{48\tilde{\eta}^{2}}{\eta_{g}^{2}}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]\end{split}

(98)

D.1 Main Proof of Lemma 2

{proof}

Plugging Lemma 10 into Lemma 9, we get

\begin{split}&\frac{1}{R}\sum\limits_{r=1}^{R}[f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})]\\ &\leq\frac{1}{R}\sum\limits_{r=1}^{R}\Bigg{[}\underbrace{\left(\frac{3\ell+3\ell^{2}/\mu_{2}}{2}-\frac{1}{2\tilde{\eta}}\right)\|\mathbf{v}_{r-1}-\mathbf{v}_{r}\|^{2}+\left(2\ell-\frac{1}{2\tilde{\eta}}\right)(\alpha_{r}-\alpha_{r-1})^{2}}_{C_{1}}\\ &+\underbrace{\left(\frac{1}{2\tilde{\eta}}-\frac{\mu_{2}}{3}\right)(\alpha_{r-1}-\alpha)^{2}-\left(\frac{1}{2\tilde{\eta}}-\frac{\mu_{2}}{3}\right)(\alpha_{r}-\alpha)^{2}}_{C_{2}}+\underbrace{\left(\frac{1}{2\tilde{\eta}}-\frac{\ell}{3}\right)\|\mathbf{v}_{r-1}-\mathbf{v}\|^{2}-\left(\frac{1}{2\tilde{\eta}}-\frac{\ell}{3}\right)\|\mathbf{v}_{r}-\mathbf{v}\|^{2}}_{C_{3}}\\ &+\underbrace{\frac{1}{2\tilde{\eta}}((\alpha-\tilde{\alpha}_{r-1})^{2}-(\alpha-\tilde{\alpha}_{r})^{2})}_{C_{4}}+\underbrace{\left(\frac{3\ell}{2}+\frac{3\ell^{2}}{2\mu_{2}}\right)\mathcal{E}_{r}}_{C_{5}}\\ &+\underbrace{\frac{3\tilde{\eta}}{2}\left\|\frac{1}{KI}\sum\limits_{k,i}[\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})]\right\|^{2}}_{C_{6}}+\underbrace{\frac{3\tilde{\eta}}{2}\left(\frac{1}{KI}\sum\limits_{k,i}\nabla_{\alpha}f_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\alpha}F_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})\right)^{2}}_{C_{7}}\\ \end{split}

(99)

Since $\tilde{\eta}\leq\min(\frac{1}{3\ell+3\ell^{2}/\mu_{2}},\frac{1}{4\ell},\frac{3}{2\mu_{2}})$ , thus in the RHS of (99), $C_{1}$ can be cancelled. $C_{2}$ , $C_{3}$ and $C_{4}$ will be handled by telescoping sum. $C_{5}$ can be bounded by Lemma 12.

Taking expectation over $C_{6}$ ,

\begin{split}&\mathbb{E}\left[\frac{3\tilde{\eta}}{2}\left\|\frac{1}{KI}\sum\limits_{k,i}[\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})]\right\|^{2}\right]\\ &=\mathbb{E}\left[\frac{3\tilde{\eta}}{2K^{2}I^{2}}\sum\limits_{k,i}\|\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};z_{r,t}^{k})\|^{2}\right]\\ &\leq\frac{3\tilde{\eta}\sigma^{2}}{2KI}.\end{split}

(100)

The equality is due to
$\mathbb{E}_{r,t}\left\langle\nabla_{\mathbf{v}}f_{k}^{s}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\mathbf{v}}F_{k}^{s}(\mathbf{v}_{r,t}^{i},\alpha_{r,t}^{i};z_{r,t}^{k}),\nabla_{\mathbf{v}}f_{j}^{s}(\mathbf{v}_{r,t}^{j},\alpha_{r,t}^{j})-\nabla_{\mathbf{v}}F_{j}^{s}(\mathbf{v}_{r,t}^{j},\alpha_{r,t}^{j};z_{r,t}^{j})\right\rangle=0$ for any $i\neq j$ as each machine draws data independently, where $\mathbb{E}_{r,t}$ denotes an expectation in round $r$ conditioned on events until $k$ . The last inequality holds because $\|\nabla_{\mathbf{v}}f_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k})-\nabla_{\mathbf{v}}F_{k}(\mathbf{v}_{t-1}^{k},\alpha_{t-1}^{k};z_{t-1}^{k})\|^{2}\leq\sigma^{2}$ for any $i$ . Similarly, we take expectation over $C_{7}$ and have

\begin{split}&\mathbb{E}\left[\frac{3\tilde{\eta}}{2}\left(\frac{1}{KI}\sum\limits_{k,t}[\nabla_{\alpha}f_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k})-\nabla_{\alpha}F_{k}(\mathbf{v}_{r,t}^{k},\alpha_{r,t}^{k};\mathbf{z}_{r,t}^{k})]\right)^{2}\right]\leq\frac{3\tilde{\eta}\sigma^{2}}{2KI}.\end{split}

(101)

Plugging (100) and (101) into (99), and taking expectation, it yields

\begin{split}&\frac{1}{R}\sum\limits_{r}\mathbb{E}[f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})]\\ &\leq\mathbb{E}\bigg{\{}\frac{1}{R}\left(\frac{1}{2\tilde{\eta}}-\frac{\ell_{2}}{3}\right)\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{R}\left(\frac{1}{2\tilde{\eta}}-\frac{\mu_{2}}{3}\right)(\alpha_{0}-\alpha)^{2}+\frac{1}{2\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{2\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}\\ &~{}~{}~{}~{}~{}+\frac{1}{R}\sum\limits_{r=1}^{R}\left(\frac{3\ell^{2}}{2\mu_{2}}+\frac{3\ell}{2}\right)\mathcal{E}_{r}+\frac{3\tilde{\eta}\sigma^{2}}{KI}\bigg{\}}\\ &\leq\frac{1}{\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}+\frac{3\ell^{2}}{\mu_{2}}\frac{1}{R}\sum\limits_{r=1}^{R}\mathcal{E}_{r}+\frac{3\tilde{\eta}\sigma^{2}}{KI},\end{split}

where we use $\mathbf{v}_{0}=\bar{\mathbf{v}}_{0}$ , and $\alpha_{0}=\bar{\alpha}_{0}$ in the last inequality.

Using Lemma 12,

\begin{split}&\frac{1}{R}\sum\limits_{r}\mathbb{E}[f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})]\\ &\leq\frac{1}{\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}+\frac{3\ell^{2}}{\mu_{2}}\frac{1}{R}\sum\limits_{r=1}^{R}\mathcal{E}_{r}+\frac{3\tilde{\eta}\sigma^{2}}{KI}\\ &\leq\frac{1}{\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}\\ &~{}~{}~{}+\frac{3\ell^{2}}{\mu_{2}}\frac{1}{R}\sum\limits_{r=1}^{R}\left[\left(\frac{\tilde{\eta}\sigma^{2}}{2\ell I\eta_{g}^{2}}+\tilde{\eta}\ell\Xi_{r-1}+\frac{48\tilde{\eta}^{2}}{\eta_{g}^{2}}\mathbb{E}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]\right)\right]+\frac{3\tilde{\eta}\sigma^{2}}{KI}\\ &\leq\frac{1}{\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}+\frac{3\tilde{\eta}\ell^{3}}{\mu_{2}R\eta_{g}^{2}}\sum_{r}\Xi_{r-1}+\frac{5\ell}{\mu_{2}I\eta_{g}^{2}}\tilde{\eta}\sigma^{2}+\frac{3000\tilde{\eta}^{2}\ell^{4}}{\mu_{2}^{2}\eta_{g}^{2}}\frac{1}{R}\sum\limits_{r=1}^{R}Gap_{r}\end{split}

where the last inequality holds because

\begin{split}\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+\|\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}\leq 9\ell^{2}(\|\mathbf{v}_{r}-\mathbf{v}^{*}_{f_{s}}\|^{2}+(\alpha_{r}-\alpha^{*}_{f_{s}})^{2})\leq\frac{18\ell^{2}}{\mu_{2}}Gap_{s}(\mathbf{v}_{r},\alpha_{r}),\end{split}

(102)

where $(\mathbf{v}^{*}_{f^{s}},\alpha^{*}_{f^{s}})$ denotes a saddle point of $f^{s}$ and the second inequality uses the strong convexity and strong concavity of $f^{s}$ . In detail,

\begin{split}Gap_{s}(\mathbf{v}_{r},\alpha_{r})&=\max\limits_{\alpha}f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v}^{*}_{f^{s}},\alpha^{*}_{f^{s}})+f^{s}(\mathbf{v}^{*}_{f^{s}},\alpha^{*}_{f^{s}})-\min\limits_{\mathbf{v}}f^{s}(\mathbf{v},\alpha_{r})\\ &\geq\frac{\ell}{2}\|\mathbf{v}_{r}-\mathbf{v}^{*}_{f^{s}}\|^{2}+\frac{\mu_{2}}{2}(\alpha_{r}-\alpha^{*}_{f^{s}})^{2}.\end{split}

(103)

Using Lemma 11, we have

\begin{split}\Xi_{r}&\leq 4\mathcal{E}_{r}+16\tilde{\eta}^{2}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{5\tilde{\eta}^{2}\sigma^{2}}{KI}\\ &\leq 4\left(\frac{\tilde{\eta}\sigma^{2}}{2\ell K\eta_{g}^{2}}+\tilde{\eta}\ell\Xi_{r-1}+\frac{48\tilde{\eta}^{2}}{\eta_{g}^{2}}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]\right)\\ &~{}~{}~{}+16\tilde{\eta}^{2}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{5\tilde{\eta}\sigma^{2}}{KI}\\ &\leq 4\tilde{\eta}\ell\Xi_{r-1}+160\tilde{\eta}^{2}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{5\tilde{\eta}\sigma^{2}}{KI}(1+\frac{K}{\eta_{g}^{2}})\\ &\leq\Xi_{r-1}+160\tilde{\eta}^{2}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{5\tilde{\eta}\sigma^{2}}{KI}(1+\frac{K}{\eta_{g}^{2}}).\end{split}

(104)

Thus,

\begin{split}\frac{2\tilde{\eta}\ell^{3}}{\mu_{2}R\eta_{g}^{2}}\sum\limits_{r=1}^{R}\Xi_{r}&\leq\frac{2\tilde{\eta}\ell^{3}}{\mu_{2}R\eta_{g}^{2}}\sum_{r}\Xi_{r-1}+\frac{320\tilde{\eta}^{3}\ell^{3}}{\mu_{2}R\eta_{g}^{2}}\sum\limits_{r=1}^{R}[\|\nabla_{\mathbf{v}}f^{s}(\mathbf{v}_{r},\alpha_{r})\|^{2}+(\nabla_{\alpha}f^{s}(\mathbf{v}_{r},\alpha_{r}))^{2}]+\frac{5\tilde{\eta}\sigma^{2}}{KI}(1+\frac{K}{\eta_{g}^{2}})\\ &\leq\frac{2\tilde{\eta}\ell^{3}}{\mu_{2}R\eta_{g}^{2}}\sum_{r}\Xi_{r-1}+\frac{1}{2R}\sum\limits_{r}Gap_{r}+\frac{5\tilde{\eta}\sigma^{2}}{KI}(1+\frac{K}{\eta_{g}^{2}})\end{split}

(105)

Taking $A_{0}=0$ ,

\begin{split}&\frac{1}{R}\sum\limits_{r}\mathbb{E}[f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})]\\ &\leq\frac{1}{\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}+\frac{1}{2R}\sum\limits_{r}Gap_{r}+\frac{5\tilde{\eta}\sigma^{2}}{KI}(1+\frac{K}{\eta_{g}^{2}}).\end{split}

It follows that

\begin{split}&\frac{1}{R}\sum\limits_{r}\mathbb{E}[f^{s}(\mathbf{v}_{r},\alpha)-f^{s}(\mathbf{v},\alpha_{r})]-\frac{1}{2R}\sum\limits_{r}Gap_{r}\\ &\leq\frac{1}{\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{1}{\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}+\frac{5\tilde{\eta}\sigma^{2}}{KI}(1+\frac{K}{\eta_{g}^{2}}).\end{split}

Sample a $\tilde{r}$ from $1,...,R$ , we have

\begin{split}\mathbb{E}[Gap^{s}_{\tilde{r}}]\leq\frac{2}{\tilde{\eta}R}\|\mathbf{v}_{0}-\mathbf{v}\|^{2}+\frac{2}{\tilde{\eta}R}(\alpha_{0}-\alpha)^{2}+\frac{10\tilde{\eta}\sigma^{2}}{KI}\left(1+\frac{K}{\eta_{g}^{2}}\right).\end{split}

(106)

Appendix E Proof of Theorem 1

{proof}

Since $f(\mathbf{v},\alpha)$ is $\ell$ -weakly convex in $\mathbf{v}$ for any $\alpha$ , $\phi(\mathbf{v})=\max\limits_{\alpha^{\prime}}f(\mathbf{v},\alpha^{\prime})$ is also $\ell$ -weakly convex. Taking $\gamma=2\ell$ , we have

\displaystyle\begin{split}\phi(\mathbf{v}_{s-1})&\geq\phi(\mathbf{v}_{s})+\langle\partial\phi(\mathbf{v}_{s}),\mathbf{v}_{s-1}-\mathbf{v}_{s}\rangle-\frac{\ell}{2}\|\mathbf{v}_{s-1}-\mathbf{v}_{s}\|^{2}\\ &=\phi(\mathbf{v}_{s})+\langle\partial\phi(\mathbf{v}_{s})+2\ell(\mathbf{v}_{s}-\mathbf{v}_{s-1}),\mathbf{v}_{s-1}-\mathbf{v}_{s}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{s-1}-\mathbf{v}_{s}\|^{2}\\ &\overset{(a)}{=}\phi(\mathbf{v}_{s})+\langle\partial\phi_{s}(\mathbf{v}_{s}),\mathbf{v}_{s-1}-\mathbf{v}_{s}\rangle+\frac{3\ell}{2}\|\mathbf{v}_{s-1}-\mathbf{v}_{s}\|^{2}\\ &\overset{(b)}{=}\phi(\mathbf{v}_{s})-\frac{1}{2\ell}\langle\partial\phi_{s}(\mathbf{v}_{s}),\partial\phi_{s}(\mathbf{v}_{s})-\partial\phi(\mathbf{v}_{s})\rangle+\frac{3}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})-\partial\phi(\mathbf{v}_{s})\|^{2}\\ &=\phi(\mathbf{v}_{s})-\frac{1}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}-\frac{1}{4\ell}\langle\partial\phi_{s}(\mathbf{v}_{s}),\partial\phi(\mathbf{v}_{s})\rangle+\frac{3}{8\ell}\|\partial\phi(\mathbf{v}_{s})\|^{2},\end{split}

(107)

where $(a)$ and $(b)$ hold by the definition of $\phi_{s}(\mathbf{v})$ .

Rearranging the terms in (107) yields

\displaystyle\begin{split}\phi(\mathbf{v}_{s})-\phi(\mathbf{v}_{s-1})&\leq\frac{1}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}+\frac{1}{4\ell}\langle\partial\phi_{s}(\mathbf{v}_{s}),\partial\phi(\mathbf{v}_{s})\rangle-\frac{3}{8\ell}\|\partial\phi(\mathbf{v}_{s})\|^{2}\\ &\overset{(a)}{\leq}\frac{1}{8\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}+\frac{1}{8\ell}(\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}+\|\partial\phi(\mathbf{v}_{s})\|^{2})-\frac{3}{8\ell}\|\phi(\mathbf{v}_{s})\|^{2}\\ &=\frac{1}{4\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}-\frac{1}{4\ell}\|\partial\phi(\mathbf{v}_{s})\|^{2}\\ &\overset{(b)}{\leq}\frac{1}{4\ell}\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}-\frac{\mu}{2\ell}(\phi(\mathbf{v}_{s})-\phi(\mathbf{v}^{*}_{\phi_{s}}))\end{split}

(108)

where $(a)$ holds by using $\langle\mathbf{a},\mathbf{b}\rangle\leq\frac{1}{2}(\|\mathbf{a}\|^{2}+\|\mathbf{b}\|^{2})$ , and $(b)$ holds by the $\mu$ -PL property of $\phi(\mathbf{v})$ .

Thus, we have

\displaystyle\left(4\ell+2\mu\right)(\phi(\mathbf{v}_{s})-\phi(\mathbf{v}_{*}))-4\ell(\phi(\mathbf{v}_{s-1})-\phi(\mathbf{v}^{*}_{\phi_{s}}))\leq\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}.

(109)

Since $\gamma=2\ell$ , $f_{s}(\mathbf{v},\alpha)$ is $\ell$ -strongly convex in $\mathbf{v}$ and $\mu_{2}$ strong concave in $\alpha$ . Apply Lemma 3 to $f_{s}$ , we know that

\displaystyle\frac{\ell}{4}\|\hat{\mathbf{v}}_{s}(\alpha_{s})-\mathbf{v}_{0}^{s}\|^{2}+\frac{\mu_{2}}{4}(\hat{\alpha}_{s}(\mathbf{v}_{s})-\alpha_{0}^{s})^{2}\leq\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})+\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s}).

(110)

By the setting of $\tilde{\eta}_{s}$ , $I_{s}=I_{0}*2^{s}$ , and $R_{s}=\frac{1000}{\tilde{\eta}\min(\ell,\mu_{2})}$ , we note that $\frac{4}{\tilde{\eta}R_{s}}\leq\frac{\min\{\ell,\mu_{2}\}}{212}$ . Applying Lemma (2), we have

\displaystyle\begin{split}&\mathbb{E}[\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})]\leq\frac{10\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}+\frac{1}{53}\mathbb{E}\left[\frac{\ell}{4}\|\hat{\mathbf{v}}_{s}(\alpha_{s})-\mathbf{v}_{0}^{s}\|^{2}+\frac{\mu_{2}}{4}(\hat{\alpha}_{s}(\mathbf{v}_{s})-\alpha_{0}^{s})^{2}\right]\\ &\leq\frac{10\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}+\frac{1}{53}\mathbb{E}\left[\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})+\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right].\end{split}

(111)

Since $\phi(\mathbf{v})$ is $L$ -smooth and $\gamma=2\ell$ , then $\phi_{k}(\mathbf{v})$ is $\hat{L}=(L+2\ell)$ -smooth. According to Theorem 2.1.5 of [DBLP:books/sp/Nesterov04], we have

\displaystyle\begin{split}&\mathbb{E}[\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}]\leq 2\hat{L}\mathbb{E}(\phi_{s}(\mathbf{v}_{s})-\min\limits_{x\in\mathbb{R}^{d}}\phi_{s}(\mathbf{v}))\leq 2\hat{L}\mathbb{E}[\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})]\\ &=2\hat{L}\mathbb{E}[4\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})-3\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})]\\ &\leq 2\hat{L}\mathbb{E}\left[4\left(\frac{10\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}+\frac{1}{53}\left(\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})+\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right)\right)-3\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right]\\ &=2\hat{L}\mathbb{E}\left[40\frac{\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}+\frac{4}{53}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})-\frac{155}{53}\text{Gap}_{s}(\mathbf{v}_{s},\alpha_{s})\right].\end{split}

(112)

Applying Lemma 4 to (112), we have

\displaystyle\begin{split}&\mathbb{E}[\|\partial\phi_{s}(\mathbf{v}_{s})\|^{2}]\leq 2\hat{L}\mathbb{E}\bigg{[}\frac{40\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}+\frac{4}{53}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\frac{155}{53}\left(\frac{3}{50}\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})+\frac{4}{5}(\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{0}^{s}))\right)\bigg{]}\\ &=2\hat{L}\mathbb{E}\bigg{[}40\frac{\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}\!+\!\frac{4}{53}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\!-\!\frac{93}{530}\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})\!-\!\frac{124}{53}(\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{0}^{s}))\bigg{]}.\end{split}

(113)

Combining this with (109), rearranging the terms, and defining a constant $c=4\ell+\frac{248}{53}\hat{L}\in O(L+\ell)$ , we get

\displaystyle\begin{split}&\left(c+2\mu\right)\mathbb{E}[\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{*})]+\frac{93}{265}\hat{L}\mathbb{E}[\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})]\\ &\leq\left(4\ell+\frac{248}{53}\hat{L}\right)\mathbb{E}[\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}^{*}_{\phi})]+\frac{8\hat{L}}{53}\mathbb{E}[\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})]+\frac{80\hat{L}\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}\\ &\leq c\mathbb{E}\left[\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\right]+\frac{80\hat{L}\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}.\end{split}

(114)

Using the fact that $\hat{L}\geq\mu$ ,

\displaystyle\begin{split}(c+2\mu)\frac{8\hat{L}}{53c}=\left(4\ell+\frac{248}{53}\hat{L}+2\mu\right)\frac{8\hat{L}}{53(4\ell+\frac{248}{53}\hat{L})}\leq\frac{8\hat{L}}{53}+\frac{16\mu_{1}\hat{L}}{248\hat{L}}\leq\frac{93}{265}\hat{L}.\end{split}

(115)

Then, we have

\displaystyle\begin{split}&(c+2\mu_{1})\mathbb{E}\left[\phi(\mathbf{v}_{0}^{s+1})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s+1}(\mathbf{v}_{0}^{s+1},\alpha_{0}^{s+1})\right]\\ &\leq c\mathbb{E}\left[\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})\right]+\frac{80\hat{L}\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}.\end{split}

(116)

Defining $\Delta_{s}=\phi(\mathbf{v}_{0}^{s})-\phi(\mathbf{v}_{*})+\frac{8\hat{L}}{53c}\text{Gap}_{s}(\mathbf{v}_{0}^{s},\alpha_{0}^{s})$ , then

\displaystyle\begin{split}&\mathbb{E}[\Delta_{s+1}]\leq\frac{c}{c+2\mu}\mathbb{E}[\Delta_{s}]+\frac{80\hat{L}}{c+2\mu}\frac{\tilde{\eta}\sigma^{2}}{KI_{0}2^{s}}\end{split}

(117)

Using this inequality recursively, it yields

\displaystyle\begin{split}&E[\Delta_{S+1}]\leq\left(\frac{c}{c+2\mu}\right)^{S}E[\Delta_{1}]+\frac{80\hat{L}}{c+2\mu}\frac{\tilde{\eta}\sigma^{2}}{KI_{0}}\sum\limits_{s=1}^{S}\left(\exp\left(-\frac{2\mu}{c+2\mu}(s-1)\right)\left(\frac{c}{c+2\mu}\right)^{S+1-s}\right)\\ &\leq 2\epsilon_{0}\exp\left(\frac{-2\mu S}{c+2\mu}\right)+\frac{80\tilde{\eta}\hat{L}\sigma^{2}}{(c+2\mu)KI_{0}}S\exp\left(-\frac{2\mu S}{c+2\mu}\right),\end{split}

(118)

where the second inequality uses the fact $1-x\leq\exp(-x)$ , and

\displaystyle\begin{split}\Delta_{1}&=\phi(\mathbf{v}_{0}^{1})-\phi(\mathbf{v}^{*})+\frac{8\hat{L}}{53c}{Gap}_{1}(\mathbf{v}_{0}^{1},\alpha_{0}^{1})\\ &=\phi(\mathbf{v}_{0})-\phi(\mathbf{v}^{*})+\left(f(\mathbf{v}_{0},\hat{\alpha}_{1}(\mathbf{v}_{0}))+\frac{\gamma}{2}\|\mathbf{v}_{0}-\mathbf{v}_{0}\|^{2}-f(\hat{\mathbf{v}}_{1}(\alpha_{0}),\alpha_{0})-\frac{\gamma}{2}\|\hat{\mathbf{v}}_{1}(\alpha_{0})-\mathbf{v}_{0}\|^{2}\right)\\ &\leq\epsilon_{0}+f(\mathbf{v}_{0},\hat{\alpha}_{1}(\mathbf{v}_{0}))-f(\hat{\mathbf{v}}(\alpha_{0}),\alpha_{0})\leq 2\epsilon_{0}.\end{split}

(119)

To make this less than $\epsilon$ , it suffices to make

\displaystyle\begin{split}&2\epsilon_{0}\exp\left(\frac{-2\mu S}{c+2\mu}\right)\leq\frac{\epsilon}{2}\\ &\frac{80\tilde{\eta}\hat{L}\sigma^{2}}{(c+2\mu)KI_{0}}S\exp\left(-\frac{2\mu S}{c+2\mu}\right)\leq\frac{\epsilon}{2}\end{split}

(120)

Let $S$ be the smallest value such that $\exp\left(\frac{-2\mu S}{c+2\mu}\right)\leq\min\{\frac{\epsilon}{4\epsilon_{0}},\frac{(c+2\mu)\epsilon}{160\hat{L}S}\frac{KI_{0}}{\tilde{\eta}\sigma^{2}}\}$ . We can set $S$ to be the smallest value such that $S>\max\bigg{\{}\frac{c+2\mu}{2\mu}\log\frac{4\epsilon_{0}}{\epsilon},\frac{c+2\mu}{2\mu}\log\frac{160\hat{L}S}{(c+2\mu)\epsilon}\frac{\tilde{\eta}\sigma^{2}}{KI_{0}}\bigg{\}}$ .

Then, the total communication complexity is

\displaystyle\sum\limits_{s=1}^{S}R_{s}

\displaystyle\leq O\left(\frac{1000}{\tilde{\eta}\mu_{2}}S\right)\leq\widetilde{O}\bigg{(}\frac{1}{\tilde{\eta}\mu_{2}}\frac{c}{\mu}\bigg{)}\leq\widetilde{O}\left(\frac{1}{\mu}\right).

Total iteration complexity is

\begin{split}&\sum\limits_{s=1}^{S}T_{s}=\sum\limits_{s=1}^{S}R_{s}I_{s}\\ &=\sum\limits_{s=1}^{S}R_{s}I_{0}\exp(\frac{2\mu}{c+2\mu}(s-1))=O\left(I_{0}\sum_{s}\exp(\frac{2\mu}{c+2\mu}(s-1))\right)\\ &=\widetilde{O}\left(I_{0}\frac{\exp(\frac{2\mu}{c+2\mu}S)}{\exp(\frac{2\mu_{1}}{c+2\mu})}\right)\\ &=\widetilde{O}\left(\frac{c}{\mu_{2}^{2}\mu}\left(\frac{\epsilon_{0}}{\epsilon},\frac{S\tilde{\eta}\sigma^{2}}{I_{0}K\epsilon}\right)\right)\\ &=\widetilde{O}\left(\max(\frac{1}{\mu\epsilon},\frac{c^{2}}{\mu^{2}}\frac{\tilde{\eta}\sigma^{2}}{K})\right)=\widetilde{O}\left(\max(\frac{1}{\mu\epsilon},\frac{1}{K\mu^{2}\epsilon})\right),\end{split}

(121)

which is also the sample complexity on each single machine.

Appendix F More Results

In this section, we report more experiment results for imratio=30% with DenseNet121 on ImageNet-IH, and CIFAR100-IH in Figure 2,3 and 4.

Appendix G Descriptions of Datasets

Table 5: Statistics of Medical Chest X-ray Datasets. The numbers for each disease denote the imbalance ratio (imratio).

Dataset	Source	Samples	Cardiomegaly	Edema	Consolidation	Atelectasis	Effusion
CheXpert	Stanford Hospital (US)	224,316	0.211	0.342	0.120	0.310	0.414
ChestXray8	NIH Clinical Center (US)	112,120	0.025	0.021	0.042	0.103	0.119
PadChest	Hospital San Juan (Spain)	110,641	0.089	0.012	0.015	0.056	0.064
MIMIC-CXR	BIDMC (US)	377,110	0.196	0.179	0.047	0.246	0.237
ChestXrayAD	H108 and HMUH (Vietnam)	15,000	0.153	0.000	0.024	0.012	0.069

	$\displaystyle\mathbf{v}=(\mathbf{w}^{T},a,b)^{T},\quad\phi(\mathbf{v})=\max_{\alpha}f(\mathbf{v},\alpha),$
	$\displaystyle\phi_{s}(\mathbf{v})=\phi(\mathbf{v})+\frac{1}{2\gamma}\\|\mathbf{v}-\mathbf{v}_{s-1}\\|^{2},$
	$\displaystyle f^{s}(\mathbf{v},\alpha)=f(\mathbf{v},\alpha)+\frac{1}{2\gamma}\\|\mathbf{v}-\mathbf{v}_{s-1}\\|^{2}$
	$\displaystyle F^{s}_{k}(\mathbf{v},\alpha;\mathbf{z}_{k})=F_{k}(\mathbf{v},\alpha;\mathbf{z}_{k})+\frac{1}{2\gamma}\\|\mathbf{v}-\mathbf{v}_{s-1}\\|^{2}$
	$\displaystyle\mathbf{v}^{}_{\phi}=\arg\min\limits_{\mathbf{v}}\phi(\mathbf{v}),\quad\mathbf{v}^{}_{\phi_{s}}=\arg\min\limits_{\mathbf{v}}\phi_{s}(\mathbf{v}).$

Federated Deep AUC Maximization for Heterogeneous Data with a Constant Communication Complexity

Abstract

1 Introduction

2 Related Work

Federated Learning (FL).

3 Preliminaries and Notations

Assumption 1

Assumption 2

4 CODA+: A stronger baseline

Lemma 1

Theorem 1

5 CODASCA

Lemma 2

Theorem 2

6 Experiments

6.1 Comparison with CODA+

6.2 FDAM for improving performance on CheXpert

7 Conclusion

Appendix A Auxiliary Lemmas

Lemma 3 (Lemma 1 of [yan2020optimal])

Lemma 4 (Lemma 5 of [yan2020optimal])

Appendix B Analysis of CODA+

B.1 Lemmas

Lemma 5

Lemma 6

Lemma 7

Lemma 8

B.2 Proof of Lemma 1

B.3 Main Proof of Theorem 1

Appendix C Baseline: Naive Parallel Algorithm

Theorem 3

Appendix D Proof of Lemma 2

Lemma 9

Lemma 10

Lemma 11

Lemma 12

D.1 Main Proof of Lemma 2

Appendix E Proof of Theorem 1

Appendix F More Results

Appendix G Descriptions of Datasets

Federated Deep AUC Maximization for Heterogeneous Data
with a Constant Communication Complexity