Federated Deep AUC Maximization for Heterogeneous Data
with a Constant Communication Complexity
Abstract
Deep AUC (area under the ROC curve) Maximization (DAM) has attracted much attention recently due to its great potential for imbalanced data classification. However, the research on Federated Deep AUC Maximization (FDAM) is still limited. Compared with standard federated learning (FL) approaches that focus on decomposable minimization objectives, FDAM is more complicated due to its minimization objective is non-decomposable over individual examples. In this paper, we propose improved FDAM algorithms for heterogeneous data by solving the popular non-convex strongly-concave min-max formulation of DAM in a distributed fashion. A striking result of this paper is that the communication complexity of the proposed algorithm is a constant independent of the number of machines and also independent of the accuracy level, which improves an existing result by orders of magnitude. Of independent interest, the proposed algorithm can also be applied to a class of non-convex-strongly-concave min-max problems. The experiments have demonstrated the effectiveness of our FDAM algorithm on benchmark datasets, and on medical chest X-ray images from different organizations. Our experiment shows that the performance of FDAM using data from multiple hospitals can improve the AUC score on testing data from a single hospital for detecting life-threatening diseases based on chest radiographs.
1 Introduction
Federated learning (FL) is an emerging paradigm for large-scale learning to deal with data that are (geographically) distributed over multiple clients, e.g., mobile phones, organizations. An important feature of FL is that the data remains at its own clients, allowing for the preservation of data privacy. This feature makes FL attractive not only to internet companies such as Google and Apple but also to conventional industries that provide services to people such as hospitals and banks in the era of big data [rieke2020future, long2020federated]. Data in these industries is usually collected from people who are concerned about data leakage. But in order to provide better service to people, large-scale machine learning from diverse data sources is important for addressing model bias. For example, most patients in hospitals located in urban areas could have dramatic differences in demographic data, lifestyles, and diseases from patients who are from rural areas. Machine learning models (in particular, deep neural networks) trained based on patients’ data from one hospital could dramatically bias towards its major population, which could bring about serious ethical concerns [pooch2020can].
One of the fundamental issues that could cause model bias is data imbalance, where the number of samples from different classes are skewed. Although FL provides an effective framework for leveraging multiple data sources, most existing FL methods still lack the capability to tackle the model bias caused by data imbalance. The reason is that most existing FL methods are developed for minimizing the conventional objective function, e.g., the average of a standard loss function on all data, which are not amenable to optimizing more suitable measures such as area under the ROC curve (AUC) for imbalanced data. It has been recently shown that directly maximizing AUC for deep learning can lead to great improvements on real-world difficult classification tasks [robustdeepAUC]. For example, robustdeepAUC reported the best performance by DAM on the Stanford CheXpert Competition for interpreting chest X-ray images like radiologists [chexpert19].
Heterogeneous Data | Homogeneous Data | Sample Complexity | |
NPA ( | |||
NPA ( | |||
CODA+ (CODA) | |||
CODASCA |
However, the research on FDAM is still limited. To the best of our knowledge, dist_auc_guo is the only work that was dedicated to FDAM by solving the non-convex strongly-concave min-max problem in a distributed manner. Their algorithm (CODA) is similar to the standard FedAvg method [DBLP:conf/aistats/McMahanMRHA17] except that the periodic averaging is applied both to the primal and the dual variables. Nevertheless, their results on FDAM are not comprehensive. By a deep investigation of their algorithms and analysis, we found that (i) although their FL algorithm CODA was shown to be better than the naive parallel algorithm (NPA) with a small mini-batch for DAM, the NPA using a larger mini-batch at local machines can enjoy a smaller communication complexity than CODA; (ii) the communication complexity of CODA for homogeneous data becomes better than that was established for the heterogeneous data, but is still worse than that of NPA with a large mini-batch at local clients. These shortcomings of CODA for FDAM motivate us to develop better federated averaging algorithms and analysis with a better communication complexity without sacrificing the sample complexity.
This paper aims to provide more comprehensive results for FDAM, with a focus on improving the communication complexity of CODA for heterogeneous data. In particular, our contributions are summarized below:
-
•
First, we provide a stronger baseline with a simpler algorithm than CODA named CODA+, and establish its complexity in both homogeneous and heterogeneous data settings. Although CODA+ has a slight change from CODA, its analysis is much more involved than that of CODA, which is based on the duality gap analysis instead of the primal objective gap analysis.
-
•
Second, we propose a new variant of CODA+ named CODASCA with a much improved communication complexity than CODA+. The key thrust is to incorporate the idea of stochastic controlled averaging of SCAFFOLD [karimireddy2019scaffold] into the framework of CODA+ for correcting the client-drift for both local primal updates and local dual updates. A striking result of CODASCA under a PL condition for deep learning is that its communication complexity is independent of the number of machines and the targeted accuracy level, which is even better than CODA+ in the homogeneous data setting. The analysis of CODASCA is also non-trivial that combines the duality gap analysis of CODA+ for a non-convex strongly-concave min-max problem and the variance reduction analysis of SCAFFOLD. The comparison between CODASCA and CODA+ and the NPA for FDAM is shown in Table 1.
-
•
Third, we conduct experiments on benchmark datasets to verify our theory by showing CODASCA can enjoy a larger communication window size than CODA+ without sacrificing the performance. Moreover, we conduct empirical studies on medical chest X-ray images from different hospitals by showing that the performance of CODASCA using data from multiple organizations can improve the performance on testing data from a single hospital.
2 Related Work
Federated Learning (FL).
Many empirical studies [povey2014parallel, su2015experiments, mcmahan2016communication, chen2016scalable, lin2018don, kamp2018efficient, DBLP:conf/eccv/YuanGYWY20] have shown that FL exhibits good empirical performance for distributed deep learning. For a more thorough survey of FL, we refer the readers to [mcmahan14advances]. This paper is closely related to recent studies that focus on the design of distributed stochastic algorithms for FL with provable convergence guarantee.
The most popular FL algorithm is called Federated Averaging (FedAvg) [DBLP:conf/aistats/McMahanMRHA17], which is also referred to as local SGD [stich2018local]. [stich2018local] is the first one that establishes the convergence of local SGD for strongly convex functions. yu2019parallel, yu_linear establish the convergence of local SGD and their momentum variants for non-convex functions. Although not explicitly discussed in their paper, the analysis in [yu2019parallel] has already exhibited the difference of communication complexities of local SGD in homogeneous and heterogeneous data settings, which was also discovered in recent works [khaled2020tighter, woodworth2020local, DBLP:conf/nips/WoodworthPS20]. These latter studies provide a tight analysis of local SGD in homogeneous and/or heterogeneous data settings, improving its upper bounds for convex functions and strongly convex functions than some earlier works, which sometimes improve that of large mini-batch SGD, e.g., when the level of heterogeneity is sufficiently small.
DBLP:conf/nips/HaddadpourKMC19 improve the complexities of local SGD for non-convex optimization by leveraging the PL condition. [karimireddy2019scaffold] propose a new FedAvg algorithm SCAFFOLD by introducing control variates (variance reduction) to correct for the ‘client-drift’ in the local updates for heterogeneous data. The communication complexities of SCAFFOLD are no worse than that of large mini-batch SGD for both strongly convex and non-convex functions. The proposed algorithm CODASCA is inspired by the idea of stochastic controlled averaging of SCAFFOLD. However, the analysis of CODASCA for non-convex min-max optimization under a PL condition of the primal objective function is non-trivial compared to that of SCAFFOLD.
AUC Maximization. This work builds on the foundations of stochastic AUC maximization developed in many previous works. ying2016stochastic address the scalability issue of optimizing AUC by introducing a min-max reformulation of the AUC square surrogate loss and solving it by a convex-concave stochastic gradient method [nemirovski2009robust]. natole2018stochastic improve the convergence rate by adding a strongly convex regularizer into the original formulation. Based on the same min-max formulation as in [ying2016stochastic], liu2018fast achieve an improved convergence rate by developing a multi-stage algorithm by leveraging the quadratic growth condition of the problem. However, all of these studies focus on learning a linear model, whose corresponding problem is convex and strongly concave. robustdeepAUC propose a more robust margin-based surrogate loss for the AUC score, which can be formulated as a similar min-max problem to the AUC square surrogate loss.
Deep AUC Maximization (DAM). [rafique2018non] is the first work that develops algorithms and convergence theories for weakly convex and strongly concave min-max problems, which is applicable to DAM. However, their convergence rate is slow for a practical purpose. liu2019stochastic consider improving the convergence rate for DAM under a practical PL condition of the primal objective function. guo2020fast further develop more generic algorithms for non-convex strongly-concave min-max problems, which can also be applied to DAM. There are also several studies [yan2020optimal, lin2019gradient, arXiv:2001.03724, yang2020global] focusing on non-convex strongly concave min-max problems without considering the application to DAM. Based on liu2019stochastic’s algorithm, dist_auc_guo propose a communication-efficient FL algorithm (CODA) for DAM. However, its communication cost is still high for heterogeneous data.
DL for Medical Image Analysis. In past decades, machine learning, especially deep learning methods have revolutionized many domains such as machine vision, natural language processing. For medical image analysis, deep learning methods are also showing great potential such as in classification of skin lesions [esteva2017dermatologist, li2018skin], interpretation of chest radiographs [ardila2019end, chexpert19], and breast cancer screening [bejnordi2017diagnostic, mckinney2020international, wang2016deep]. Some works have already achieved expert-level performance in different tasks [ardila2019end, mckinney2020international, DBLP:journals/mia/LitjensKBSCGLGS17]. Recently, robustdeepAUC employ DAM for medical image classification and achieve great success on two challenging tasks, namely CheXpert competition for chest X-ray image classification and Kaggle competition for melanoma classification based on skin lesion images. However, to the best of our knowledge, the application of FDAM methods on medical datasets from different hospitals have not be thoroughly investigated.
3 Preliminaries and Notations
We consider federated learning of deep neural networks by maximizing the AUC score. The setting is the same to that was considered as in [dist_auc_guo]. Below, we present some preliminaries and notations, which are mostly the same as in [dist_auc_guo]. In this paper, we consider the following min-max formulation for distributed AUC maximization problem:
(1) |
where is the total number of machines, is defined below.
(2) |
where , is the data distribution on machine , is the ratio of positive data. When , this is referred to as the homogeneous data setting; otherwise heterogeneous data setting.
Notations. We define the following notations:
In the remaining of this paper, we will consider the following distributed min-max problem that can cover the distributed DAM,
(3) |
where
(4) |
Assumptions. Similar to [dist_auc_guo], we make the following assumptions throughout this paper.
Assumption 1
(i) There exist such that .
(ii) PL condition: satisfies the -PL condition, i.e., ;
(iii) Smoothness: For any , is -smooth in and . is -smooth, i.e., .
(iv) Bounded variance:
(5) |
To quantify the drifts between different clients, we introduce the following assumption.
Assumption 2
Bounded client drift:
(6) |
Remark. quantifies the drift between the local objectives on clients and the global objective. denotes the homogeneous data setting that all the local objectives are identical. corresponds to the heterogeneous data setting.
4 CODA+: A stronger baseline
In this section, we present a stronger baseline than CODA [dist_auc_guo]. The motivation is that (i) the CODA algorithm uses a step to compute the dual variable from the primal variable by using sampled data from all clients; but we find this step is unnecessary by an improved analysis; (ii) the complexity of CODA for homogeneous data is not given in its original paper. Hence, CODA+ is a simplified version of CODA but with much refined analysis.
We present the steps of CODA+ in Algorithm 1. It is similar to CODA that uses stagewise updates. In -th stage, a strongly convex strongly concave subproblem is constructed as following:
(7) |
where is the output of the previous stage.
CODA+ improves upon CODA in two folds. First, CODA+ algorithm is more concise since the output primal and dual variables of each stage can be directly used as input for the next stage, while CODA needs an extra large batch of data after each stage to compute the dual variable. This modification not only reduces the sample complexity, but also makes the algorithm applicable to a boarder family of nonconvex min-max problems. Second, CODA+ has a smaller communication complexity for homogeneous data than that for heterogeneous data while the previous analysis of CODA yields the same communication complexity for homogeneous data and heterogeneous data.
We have the following lemma to bound the convergence for the subproblem in each -th stage.
Lemma 1
Remark. Note that the term on the RHS is the drift of clients caused by skipping communication. When , i.e., the machines have homogeneous data distribution, we need , then can be merged with the last term. When , we need , which means that has to be smaller in heterogeneous data setting and thus the communication complexity is higher.
Remark. The key difference between the analysis of CODA+ and that of CODA lies at how to handle the term in Lemma 1. In CODA, the initial dual variable is computed from the initial primal variable , which then reduces the error term to one similar to , which is then bounded by the primal objective gap due to the PL condition. However, since we do not conduct the extra computation of from , our analysis directly deals with such error term by using the duality gap of . This technique is originally developed by [yan2020optimal].
Theorem 1
Set , , , and . To return such that , it suffices to choose . The iteration complexity is and the communication complexity is by setting if , and is by setting if , where suppresses logarithmic factors.
Remark. Due to the PL condition, the step size decreases geometrically. Accordingly, increases geometrically due to Lemma 1, and increases with a faster rate when the data are homogeneous than that when data are heterogeneous. As a result, the total number of communications in homogeneous setting is much less than that in heterogeneous setting.
5 CODASCA
Although CODA+ has a significantly reduced communication complexity for homogeneous data, it is still suffering from a high communication complexity for heterogeneous data. Even for the homogeneous data, CODA+ has a worse communication complexity with a dependence on the number of clients than the NPA algorithm for FDAM with a large batch size.
Can we further reduce the communication complexity for FDAM for both homogeneous and heterogenous data without using a large batch size?
The main reason that causes the degeneration in the heterogeneous data setting is the data difference. Even at global optimum , the gradient of local functions in different clients could be different and non-zero. In the homogeneous data setting, different clients still produce different solutions due to stochastic error (cf. the term of in Lemma 1). These together contribute to the client drift.
To correct the client drift, we propose to leverage the idea of stochastic controlled averaging due to [karimireddy2019scaffold]. The key idea is to maintain and update a control variate to accommodate the client drift, which is taken into account when updating the local solutions. In the proposed algorithm CODASCA for FDAM, we apply control variates to both primal and dual variables. CODASCA shares the same stagewise framework as CODA+, where a strongly convex strongly concave subproblem is constructed and optimized in a distributed fashion approximatly in each stage. The steps of CODASCA are presented in Algorithm 3 and Algorithm 4. Below, we describe the algorithm in each stage.
Each stage has communication rounds. Between two rounds, there are local updates, each machine does the local updates as
where are local and global control variates for the primal variable, and are local and global control variates for the dual variable. Note that and are unbiased stochastic gradient on local data. However, they are biased estimate of global gradient when data on different clients are heterogeneous. Intuitively, the term and work to correct the local gradients to get closer to the global gradient. They also play a role of reducing variance of stochastic gradients, which is helpful as well to reduce the communication complexity in the homogeneous data setting.
At each communication round, the primal and dual variables on all clients get aggregated, averaged and broadcast to all clients. The control variates at -th round get updated as
(8) |
which is equivalent to
(9) |
Notice that they are simply the average of stochastic gradients used in this round. An alternative way to compute the control variates is by computing the stochastic gradient with a large batch of extra samples at the same point at each client, but this would bring extra cost and is unnecessary. and are averages of and over all clients. After the local primal and dual variables are averaged, an extrapolation step with is performed. This step will boost the convergence.
In order to establish the convergence of CODASCA, we first present a key lemma below.
Lemma 2
Remark. Compared the above bound with that in Lemma 1, in particular the term vs the term , we can see that CODASCA will not be affected by the data heterogeneity , and the stochastic variance is also much reduced. As will seen in the next theorem, the value of and will keep the same in all stages. Therefore, by decreasing local step size geometrically, the communication window size will increase geometrically to ensure .
The main convergence result of CODASCA is presented below.
Theorem 2
Define constants , . Setting , , , , . After stages, the output satisfies . The communication complexity is . The iteration complexity is .
Remark. (i) The number of communications is , which is independent of number of clients and the accuracy level . This is a significant improvement over CODA+, which has a communication complexity of in heterogeneous setting. Moreover, is a nearly optimal rate up to a logarithmic factor, since is the lower bound communication complexity of distributed strongly convex optimization [karimireddy2019scaffold, ArjevaniS15] and strongly convexity is a stronger condition than the PL condition.
(ii) Each stage has the same number of communication rounds. However, increases geometrically. Therefore, the number of iterations and samples in a stage increase geometrically. Theoretically, we can also set to the same value as the one in the last stage, correspondingly can be set as a fixed large value. But this increases the number of samples to be processed without further speeding up the convergence. Our setting of is a balance between skipping communications and reducing sample complexity. For simplicity, we use the fixed setting of to compare CODASCA and the baseline CODA+ in our experiment to corroborate the theory.
(iii) The local step size of CODASCA decreases in a similar way as the step size in CODA+. But the in CODASCA increases with a faster rate than that in CODA+ on heterogeneous data. It is noticeable that different from CODA+, we do not need Assumption 2 which bounds the client drift. This means CODASCA can be applied to optimizing the global objective even if local objectives arbitrarily deviate from the global function.








6 Experiments
In this section, we first verify the effectiveness of CODASCA compared to CODA+ on various datasets, including two benchmark datasets, i.e., ImageNet, CIFAR100 [imagenet_cvpr09, krizhevsky2009cifar] and a constructed large-scale chest X-ray dataset. Then, we demonstrate the effectiveness of FDAM on improving the performance on a single domain (CheXpert) by using data from multiple sources. For notations, denotes the number of “clients” (# of machines, # of data sources) and denotes the communication window size.
Dataset | Source | Samples |
---|---|---|
CheXpert | Stanford Hospital (US) | 224,316 |
ChestXray8 | NIH Clinical Center (US) | 112,120 |
PadChest | Hospital San Juan (Spain) | 110,641 |
MIMIC-CXR | BIDMC (US) | 377,110 |
ChestXrayAD | H108 and HMUH (Vietnam) | 15,000 |
Chest X-ray datasets. Five medical chest X-ray datasets, i.e., CheXpert, ChestXray14, MIMIC-CXR, PadChest, ChestXray-AD [chexpert19, wang2017chestx, johnson2019mimic, bustos2020padchest, nguyen2020vindr] are collected from different organizations. The statistics of these medical datasets are summarized in Table 2. We construct five binary classification tasks for predicting five popular diseases, Cardiomegaly (C0), Edema (C1), Consolidation (C2), Atelectasis (C3), P. Effusion (C4), as in CheXpert competition [chexpert19] for our experiments. These datasets are naturally imbalanced and heterogeneous due to different patients’ populations, different data collection protocols and etc. We refer to the whole medical dataset as ChestXray-IH.
Imbalanced and Heterogeneous (IH) Benchmark Datasets. For benchmark datasets, we manually construct the imbalanced heterogeneous dataset. For ImageNet, we first randomly select 500 classes as positive class and 500 classes as negative class. To increase data heterogeneity, we further split all positive/negative classes into groups so that each split only owns samples from unique classes without overlapping with that of other groups. To increase data imbalance level, we randomly remove some samples from positive classes for each machine. Please note that due to this operation, the whole sample set for different is different. We refer to the proportion of positive samples in all samples as imbalance ratio (). For CIFAR100, we follow similar steps to construct imbalanced heterogeneous data. We keep the testing/validation set untouched and keep them balanced. For imbalance ratio (imratio), we explore two ratios: 10% and 30%. We refer to the constructed datasets as ImageNet-IH (10%), ImageNet-IH (30%), CIFAR100-IH (10%), CIFAR100-IH (30%). Due to the limited space, we only report imratio=10% with DenseNet121 and defer the other results to supplement.
Parameters and Settings. We train Desenet121 on all datasets. For the parameters in CODASCA/CODA+, we tune in [500, 700, 1000] and in [0.1, 0.01, 0.001]. For learning rate schedule, we decay the step size by 3 times every iterations, where is tuned in [2000, 3000, 4000]. We experiment with a fixed value of selected from [1, 32, 64, 128, 512, 1024]. We tune in [1.1, 1, 0.99, 0.999]. The local batch size is set to 32 for each machine. We run a total of 20000 iterations for all experiments.
6.1 Comparison with CODA+
We plot the testing AUC on ImageNet (10%) vs # of iterations for CODASCA and CODA+ in Figure 1 (top row) by varying the value of for different values of . Results on CIFAR100 are shown in the Supplement. In the bottom row of Figure 1, we plot the achieved testing AUC score vs different values of for CODASCA and CODA+. We have the following observations: CODASCA enjoys a larger communication window size. Comparing CODASCA and CODA+ in the bottom panel of Figure 1, we can see that CODASCA enjoys a larger communication window size without hurting the performance than CODA+, which is consistent with our theory.
CODASCA is consistently better for different values of . We compare the largest value of such that the performance does not degenerate too much compared with , which is denoted by . From the bottom figures of Figure 1, we can see that the value of CODASCA on ImageNet is 128 (=16) and 512 (=8), respectively, and that of CODA+ on ImageNet is 32 (=16) and 128 (=8). This demonstrates that CODASCA enjoys consistent advantage over CODA+, i.e., when , , and when , . The same phenomena occur on CIFAR100 data.
Next, we compare CODASCA with CODA+ on the ChestXray-IH medical dataset, which is also highly heterogeneous. We split the ChestXray-IH data into groups according to the patient ID and each machine only owns samples from one organization without overlapping patients. The testing set is the collection of 5% data sampled from each organization. In addition, we use train/val split = 7:3 for the parameter tuning. We run CODASCA and CODA+ with the same number of iterations. The performance on testing set are reported in Table 3. From the results, we can observe that CODASCA performs consistently better than CODA+ on C0, C2, C3, C4.
Method | C0 | C1 | C2 | C3 | C4 | |
---|---|---|---|---|---|---|
1 | 0.8472 | 0.8499 | 0.7406 | 0.7475 | 0.8688 | |
CODA+ | 512 | 0.8361 | 0.8464 | 0.7356 | 0.7449 | 0.8680 |
CODASCA | 512 | 0.8427 | 0.8457 | 0.7401 | 0.7468 | 0.8680 |
CODA+ | 1024 | 0.8280 | 0.8451 | 0.7322 | 0.7431 | 0.8660 |
CODASCA | 1024 | 0.8363 | 0.8444 | 0.7346 | 0.7481 | 0.8674 |
#of sources | C0 | C1 | C2 | C3 | C4 | AVG |
---|---|---|---|---|---|---|
=1 | 0.9007 | 0.9536 | 0.9542 | 0.9090 | 0.9571 | 0.9353 |
=2 | 0.9027 | 0.9586 | 0.9542 | 0.9065 | 0.9583 | 0.9361 |
=3 | 0.9021 | 0.9558 | 0.9550 | 0.9068 | 0.9583 | 0.9356 |
=4 | 0.9055 | 0.9603 | 0.9542 | 0.9072 | 0.9588 | 0.9372 |
=5 | 0.9066 | 0.9583 | 0.9544 | 0.9101 | 0.9584 | 0.9376 |
6.2 FDAM for improving performance on CheXpert
Finally, we show that FDAM can be used to leverage data from multiple hospitals to improve the performance at a single target hospital. For this experiment, we choose CheXpert data from Stanford Hospital as the target data. Its validation data will be used for evaluating the performance of our FDAM method. Note that improving the AUC score on CheXpert is a very challenging task. The top 7 teams on CheXpert leaderboard differ by only 0.1% 111https://stanfordmlgroup.github.io/competitions/chexpert/. Hence, we consider any improvement over significant. Our procedure is following: we gradually increase the number of data resources, e.g., only includes the CheXpert training data, includes the CheXpert training data and ChestXray8, includes the CheXpert training data and ChestXray8 and PadChest, and so on.
Parameters and Settings. Due to the limited computing resources, we resize all images to 320x320. We follow the two stage method proposed in [robustdeepAUC] and compare with the baseline on a single machine with a single data source (CheXpert training data) (=1) for learning DenseNet121. More specifically, we first train a base model by minimizing the Cross-Entropy loss on CheXpert training dataset using Adam with a initial learning rate of 1e-5 and batch size of 32 for 2 epochs. Then, we discard the trained classifier, use the same pretrained model for initializing the local models at all machines and continue training using CODASCA. For the parameter tuning, we try =[16, 32, 64, 128], learning rate=[0.1, 0.01] and we fix =1e-3, =1000 and batch size=32.
Results. We report all results in term of AUC score on the CheXpert validation data in Table 4. Overall, we can see that the using more data sources from different organizations can efficiently improve the performance on CheXpert. The average improvement across all 5 classification tasks from to is over which is significant in light of the top CheXpert leaderboard results. Specifically, we can see that CODASCA with =5 achieves the highest validation AUC score on C0 and C3 and with =4 achieves highest AUC score on C1 and C4.
7 Conclusion
In this work, we have conducted comprehensive studies of federated learning for deep AUC maximization. We analyzed a stronger baseline for deep AUC maximization by establishing its convergence for both homogeneous data and heterogeneous data. We also developed an improved variant by adding control variates to the local stochastic gradients for both primal and dual variables, which dramatically reduces the communication complexity. Besides a strong theory guarantee, we also exhibit the power of FDAM on real world medical imaging problems. We have shown that our FDAM method can improve the performance on medical imaging classification tasks by leveraging data from different organizations that are kept locally.
Appendix A Auxiliary Lemmas
Noting all algorithms discussed in thpaper including the baselines implement a stagewise framework, we define the duality gap of -th stage at a point as
(10) |
Before we show the proofs, we first present the lemmas from [yan2020optimal].
Lemma 3 (Lemma 1 of [yan2020optimal])
Suppose a function is -strongly convex in and -strongly concave in . Consider the following problem
where and are convex compact sets. Denote and . Suppose we have two solutions and . Then the following relation between variable distance and duality gap holds
(11) |
Lemma 4 (Lemma 5 of [yan2020optimal])
We have the following lower bound for
where and , i.e., the initialization of -th stage is the output of the -th stage.
Appendix B Analysis of CODA+
The proof sketch is similar to the proof of CODA in [dist_auc_guo]. However, there are two noticeable difference from [dist_auc_guo]. First, in Lemma 1, we bound the duality gap instead of the objective gap in [dist_auc_guo]. This is because the analysis later in this proof requires the bound of the duality gap.
Second, in Lemma 1, where the bound for homogeneous data is better than that of heterogeneous data. The better analysis for homogeneous data is inspired by the analysis in [yu_linear], which tackles a minimization problem. Note that denotes the subproblem for stage , we omit the index in variables when the context is clear.
B.1 Lemmas
We need following lemmas for the proof. The Lemma 5, Lemma 6 and Lemma 7 are similar to Lemma 3, Lemma 4 and Lemma 5 of [dist_auc_guo], respectively. For the sake of completeness, we will include the proof of Lemma 5 and Lemma 6 since a change in the update of the primal variable.
Lemma 5
For any and , using Jensen’s inequality and the fact that is convex in and concave in ,
(12) |
By -strongly convexity of in , we have
(13) |
By -smoothness of in , we have
(14) |
where holds because that we know is -Lipschitz in since is -smooth, holds by Young’s inequality, and is the strong concavity coefficient of in .
We know is -strong concavity in ( is -strong convexity of in ). Thus, we have
(16) |
Since is -smooth in , we get
(17) |
where (a) holds because that is -Lipschitz in .
Taking average over , we get
Lemma 6
Define and
(20) |
. We have
We have
(21) |
Then we will bound \small{1}⃝, \small{2}⃝, \small{3}⃝ and \small{4}⃝, respectively,
(22) |
where (a) follows from Young’s inequality, (b) follows from Jensen’s inequality. and (c) holds because is -Lipschitz in . Using similar techniques, we have
(23) |
Let , then we have
(24) |
Hence we get
(25) |
Define another auxiliary sequence as
(26) |
Denote
(27) |
Hence, for the auxiliary sequence , we can verify that
(28) |
Since is -strongly convex, we have
(29) |
Adding this with (25), we get
(30) |
④ can be bounded as
(31) |
can be bounded by the following lemma, whose proof is identical to that of Lemma 5 in [dist_auc_guo].
Lemma 7
Define , and
We have,
can be bounded by the following lemma.
Lemma 8
If machines communicate every iterations, where , then
In this proof, we introduce a couple of new notations to make the proof brief: and . Similar bounds for minimization problems have been analyzed in [yu_linear, stich2018local].
Denote as the nearest communication round before , i.e., . By the update rule of , we have that on each machine ,
(32) |
Taking average over all machines,
(33) |
Therefore,
(34) |
In the following, we will address these two terms on the right hand side separately. First, we have
(35) |
where holds by , where ; follows because .
Second, we have
(36) |
where
(37) |
where holds because is -smooth, i.e., is -smooth.
Summing over ,
(39) |
Similarly for side, we have
(40) |
Summing up the above two inequalities,
(41) |
where the second inequality is due to , i.e., .
Based on above lemmas, we are ready to give the convergence of duality gap in one stage of CODA+.
B.2 Proof of Lemma 1
(42) |
Since , thus in the RHS of (42), can be cancelled. , , and will be handled by telescoping sum. can be bounded by Lemma 8.
Taking expectation over ,
(43) |
The last inequality holds because and for any as each machine draws data independently. Similarly, we take expectation over and have
(44) |
B.3 Main Proof of Theorem 1
Since is -smooth (thus -weakly convex) in for any , is also -weakly convex. Taking , we have
(45) |
where and hold by the definition of .
Thus, we have
(47) |
Since , is -strongly convex in and strong concave in . Apply Lemma 3 to , we know that
(48) |
By the setting of , and , we note that . Set such that , where the specific choice of will be made later. Applying Lemma 1 with and , we have
(49) |
Since is -smooth and , then is -smooth. According to Theorem 2.1.5 of [DBLP:books/sp/Nesterov04], we have
(50) |
Combining this with (109), rearranging the terms, and defining a constant , we get
(52) |
Using the fact that ,
(53) |
Then, we have
(54) |
Defining , then
(55) |
Using this inequality recursively, it yields
(56) |
By definition,
(57) |
Using inequality , we have
To make this less than , it suffices to make
(58) |
Let be the smallest value such that . We can set .
Then, the total iteration complexity is
(59) |
where uses the setting of and , and suppresses logarithmic factors.
.
Next, we will analyze the communication cost. We investigate both and cases.
(i) Homogeneous Data (D = 0): To assure which we used in above proof, we take , where is a proper constant.
If , then .
Otherwise, , then for and for .
(60) |
Thus, for both above cases, the total communication complexity can be bounded by
(61) |
(ii) Heterogeneous Data ():
To assure which we used in above proof, we take , where is proper constant.
If , then for and for .
(62) |
Thus, the communication complexity can be bounded by
(63) |
Appendix C Baseline: Naive Parallel Algorithm
Note that if we set for all , CODA+ will be reduced to a naive parallel version of PPD-SG [liu2019stochastic]. We analyze this naive parallel algorithm in the following theorem.
Theorem 3
Consider Algorithm 1 with . Set , , .
(1) If , set and , then the communication/iteration complexity is to return such that .
(2) If , set and , then the communication/iteration complexity is to return such that .
(1) If , note that the setting of and are identical to that in CODA+ (Theorem 1). However, as a batch of is used on each machine at each iteration, the variance at each iteration is reduced to . Therefore, by similar analysis of Theorem 1 (specifically (59)), we see that the iteration complexity of NPA is . Thus, the sample complexity of each machines is .
Appendix D Proof of Lemma 2
In this section, we will prove Lemma 2, which is the convergence analysis of one stage in CODASCA.
First, the duality gap in stage can be bounded as
Lemma 9
For any ,
By -strongly convexity of in , we have
(66) |
By -smoothness of in , we have
(67) |
where holds because that we know is -Lipschitz in since is -smooth and holds by Young’s inequality.
We know is -strong concave in ( is -strong convexity of in ). Thus, we have
(69) |
Since is -smooth in , we get
(70) |
where (a) holds because that is -Lipschitz in .
Taking average over , we get
and can be bounded by the following lemma. For simplicity of notation, we define
(73) |
which is the drift of the variables between te sequence in -th round and the ending point, and
(74) |
which is the drift of the variables between te sequence in -th round and the starting point.
can be bounded as
Lemma 10
and
(75) |
Then we will bound \small{1}⃝, \small{2}⃝ and \small{3}⃝, respectively,
(76) |
where (a) follows from Young’s inequality, (b) follows from Jensen’s inequality. and (c) holds because is -smooth in . Using similar techniques, we have
(77) |
Let , then we have
(78) |
Hence we get
(79) |
Define another auxiliary sequence as
(80) |
Denote
(81) |
Hence, for the auxiliary sequence , we can verify that
(82) |
Since is -strongly convex, we have
(83) |
Adding this with (79), we get
(84) |
\small{4}⃝ can be bounded as
(85) |
Similarly for , noting is -smooth and -strongly concave in ,
We show the following lemmas where and are coupled.
Lemma 11
(86) |
(87) |
Similarly,
(88) |
Using the -smoothness of and combining with above results,
(89) |
Thus,
(90) |
Lemma 12
(91) |
(92) |
where and .
Then,
(93) |
For , we have similar results, adding them together
(94) |
Taking average over all machines,
(95) |
Taking average over ,
(96) |
Using (89), we have
(97) |
Rearranging terms,
(98) |
D.1 Main Proof of Lemma 2
(99) |
Since , thus in the RHS of (99), can be cancelled. , and will be handled by telescoping sum. can be bounded by Lemma 12.
Taking expectation over ,
(100) |
The equality is due to
for any as each machine draws data independently, where denotes an expectation in round conditioned on events until .
The last inequality holds because
for any .
Similarly, we take expectation over and have
(101) |
Plugging (100) and (101) into (99), and taking expectation, it yields
where we use , and in the last inequality.
Using Lemma 12,
where the last inequality holds because
(102) |
where denotes a saddle point of and the second inequality uses the strong convexity and strong concavity of . In detail,
(103) |
Taking ,
It follows that
Sample a from , we have
(106) |
Appendix E Proof of Theorem 1
Since is -weakly convex in for any , is also -weakly convex. Taking , we have
(107) |
where and hold by the definition of .
Rearranging the terms in (107) yields
(108) |
where holds by using , and holds by the -PL property of .
Thus, we have
(109) |
Since , is -strongly convex in and strong concave in . Apply Lemma 3 to , we know that
(110) |
By the setting of , , and , we note that . Applying Lemma (2), we have
(111) |
Since is -smooth and , then is -smooth. According to Theorem 2.1.5 of [DBLP:books/sp/Nesterov04], we have
(112) |
Combining this with (109), rearranging the terms, and defining a constant , we get
(114) |
Using the fact that ,
(115) |
Then, we have
(116) |
Defining , then
(117) |
Using this inequality recursively, it yields
(118) |
where the second inequality uses the fact , and
(119) |
To make this less than , it suffices to make
(120) |
Let be the smallest value such that . We can set to be the smallest value such that .
Then, the total communication complexity is
Total iteration complexity is
(121) |
which is also the sample complexity on each single machine.
Appendix F More Results
In this section, we report more experiment results for imratio=30% with DenseNet121 on ImageNet-IH, and CIFAR100-IH in Figure 2,3 and 4.












Appendix G Descriptions of Datasets
Dataset | Source | Samples | Cardiomegaly | Edema | Consolidation | Atelectasis | Effusion |
---|---|---|---|---|---|---|---|
CheXpert | Stanford Hospital (US) | 224,316 | 0.211 | 0.342 | 0.120 | 0.310 | 0.414 |
ChestXray8 | NIH Clinical Center (US) | 112,120 | 0.025 | 0.021 | 0.042 | 0.103 | 0.119 |
PadChest | Hospital San Juan (Spain) | 110,641 | 0.089 | 0.012 | 0.015 | 0.056 | 0.064 |
MIMIC-CXR | BIDMC (US) | 377,110 | 0.196 | 0.179 | 0.047 | 0.246 | 0.237 |
ChestXrayAD | H108 and HMUH (Vietnam) | 15,000 | 0.153 | 0.000 | 0.024 | 0.012 | 0.069 |