FedSC: Provable Federated Self-supervised Learning
with Spectral Contrastive Objective over Non-i.i.d. Data
Abstract
Recent efforts have been made to integrate self-supervised learning (SSL) with the framework of federated learning (FL). One unique challenge of federated self-supervised learning (FedSSL) is that the global objective of FedSSL usually does not equal the weighted sum of local SSL objectives. Consequently, conventional approaches, such as federated averaging (FedAvg), fail to precisely minimize the FedSSL global objective, often resulting in suboptimal performance, especially when data is non-i.i.d.. To fill this gap, we propose a provable FedSSL algorithm, named FedSC, based on the spectral contrastive objective. In FedSC, clients share correlation matrices of data representations in addition to model weights periodically, which enables inter-client contrast of data samples in addition to intra-client contrast and contraction, resulting in improved quality of data representations. Differential privacy (DP) protection is deployed to control the additional privacy leakage on local datasets when correlation matrices are shared. We also provide theoretical analysis on the convergence and extra privacy leakage. The experimental results validate the effectiveness of our proposed algorithm.
1 Introduction
As a type of unsupervised learning, self-supervised learning (SSL) aims to learn a structured representation space, in which data similarity can be measured by simple metrics, such as cosine and Euclidean distances, with unlabeled data (Chen et al., 2020; Chen & He, 2021; Grill et al., 2020; He et al., 2020; Zbontar et al., 2021; Bardes et al., 2021; HaoChen et al., 2021). On top of the foundation model trained with SSL, a simple linear layer, also known as linear probe, is sufficient to perform well on a wide range of downstream tasks with minimal labeled data. Resulting from its high label efficiency, SSL has been adopted in a variety of applications, such as natural language processing (He et al., 2021; Brown et al., 2020) and computer vision (Ravi & Larochelle, 2016; Hu et al., 2021).
However, SSL algorithms are often executed on massive amounts of unlabeled data that may be dispersed across various locations. Moreover, the progressively tightening privacy-protection regulations frequently inhibit the centralization of data. Within this context, the federated learning (FL) framework is often favored, wherein a central server can learn from private data located on clients without the data being shared directly (McMahan et al., 2017; Stich, 2018; Li et al., 2019).
Despite the extensive study and theoretical guarantees (Stich, 2018; Li et al., 2019) associated with conventional FL, its generalization to incorporate with SSL is not straightforward. The fundamental challenge arises from the fact that, unlike FL within supervised learning, the global objective of FedSSL usually does not equal the weighted sum of local SSL objectives. Consequently, conventional FL approaches, e.g. federated averaging (FedAvg), can not minimize the exact global objective of FedSSL especially when data is non-independent and identically distributed (non-i.i.d.). From the perspective of contrastive learning, FedAvg only contrasts data samples within the same client (intra-client) rather than those across different clients (inter-client). Therefore, the learned representation might not be as effective at distinguishing inter-client data samples as it is with intra-client data samples.
Although recent works on FedSSL have shown great numerical success (Zhuang et al., 2021, 2022; Zhang et al., 2023; Han et al., 2022), the majority of them either overlook previously mentioned challenge or fail to offer a theoretical analysis. FedU (Zhuang et al., 2021) and FedEMA (Zhuang et al., 2021) lack the formulation of global objective and thus fail to provide theoretical analysis. FedCA (Zhang et al., 2023) notices the unique challenge and proposes to share data representations, which, however, results in significant privacy leakage and communication overhead. Unlike FedU and FedEMA, which involve sharing predictors, and FedCA, which shares data representations, our proposed FedSC results in much lower communication costs, since sharing correlation matrices requires transmitting far fewer parameters than what is needed for predictors or data representations. FedX (Han et al., 2022) does not share additional information besides encoders, but still lacks theoretical analysis. Among all these works, only our proposed FedSC deploys differential privacy (DP) protection to mitigate the extra privacy leakage from components other than encoders. Moreover, FedSC is the only provable FedSSL method to the best knowledge of the authors. Table 1 summarizes the difference between this work and state of the arts (SOTAs).
Info. shared besides encoder | Privacy Protection | Provable | |
FedU | predictor | ||
FedEMA | predictor | ||
FedX | N/A | ||
FedCA | representations | ||
FedSC | correlation matrices |
Contribution. In this work, we propose a novel FedSSL formulation based on the spectral contrastive (SC) objective (HaoChen et al., 2021). The formulation clarifies all the necessary components in FedSSL encompassing intra-client contraction, intra-client contrast and inter-client contrast. Building upon this formulation, we propose the first provable FedSSL method, namely FedSC, with the convergence guarantee to the solutions of centralized SSL. Unlike FedAvg, clients in FedSC share correlation matrices of their local data representations in addition to the weights of local models. By leveraging the aggregated correlation matrix from the server, inter-client contrast of data samples, which is overlooked in FedAvg, can be performed in addition to local contrast and contraction. To better control and quantify the extra privacy leakage, we apply DP mechanism to correlation matrices when they are shared. We made theoretical analysis of FedSC, demonstrating the convergence of the global objective and efficacy of our method. Our contributions are summarized as follows:
We propose a novel FedSSL formulation delineating all essential components of FedSSL, which encompasses intra-client contraction, intra-client contrast and inter-client contrast. This highlights the limitations of FedAvg due to its neglect of the inter-client contrast.
We propose FedSC, in which clients are able to perform inter-client contrast of data samples by leveraging the correlation matrices of data representations shared from others, resulting in improved quality of data representations.
DP protection is applied, which effectively constrains the privacy leakage resulting from sharing correlation matrices with only negligible utility degradation.
Theoretical analysis of FedSC is made, providing extra privacy leakage and convergence guarantee for the global FedSSL objective. We prove that FedSC can achieve a convergence rate, while FedAvg will have a constant error floor.
Through extensive experimentation involving datasets across SOTAs, we affirm that FedSC achieves superior or comparable performance compared with other methods.
2 Related Works
Self-supervised learning. SSL can be mainly categorized into contrastive and non-contrastive SSL. The mechanisms and explicit objective of non-contrastive SSL algorithms are still not fully understood despite a few recent attempts (Halvagal et al., 2023; Tian et al., 2021; Zhang et al., 2022). In contrast, contrastive SSL is more intuitive and explainable. Contrastive SSL explicitly penalizes the distance between positive pairs (two samples share the same semantic meaning), while encouraging distance between negative pairs (two samples share different semantic meanings). For example, SimCLR (Chen et al., 2020) objective accounts for the mutual information between positive pairs (Tschannen et al., 2019) preserved by representations. The SC objective (HaoChen et al., 2021) is equivalent to performing a spectral decomposition of the augmentation graph.
Federated Self-supervised Learning. In FedU (Zhuang et al., 2021), clients make decisions on whether the local model should be updated by the global based on the distances of two model weights when receiving global models from the server. As a follow up, FedEMA (Zhuang et al., 2022) is proposed, in which the hard decision in FedU is replaced with a weighted combination of local and global models. FedX (Han et al., 2022) designs local and global objectives using the idea of cross knowledge distillation to mitigate the effects of non-i.i.d. data. The authors of FedCA (Zhang et al., 2023) propose to share features of individual data samples in addition to local model weights for inter-client contrast, which however, results in significant privacy leakage and communication overhead.
Differential Privacy. Gaussian and Laplace mechanisms are most common DP approaches to protect a dataset from membership attack (Dwork, 2006). To better analyze DP, (Mironov, 2017) proposes Rényi differential privacy (RDP), which characterizes the operations on mechanisms, such as composition, in a more elegant way, and proves the equivalence between DP and RDP. Currently, DP has been widely deployed in FL (Wei et al., 2020; Truex et al., 2020; Hu et al., 2020; Geyer et al., 2017; Noble et al., 2022).

3 Preliminaries: Spectral Contrastive (SC) Self-supervised Learning
Spectral contrastive (SC) SSL is proposed in (HaoChen et al., 2021) with the following objective:
where is the dataset; is the representation mapping parameterized by ; is referred to as the operator of expectation; is referred to as the augmentation kernel, which is essentially a conditional distribution, and . We use to denote negative pairs, where sample and have different semantic meanings, and to denote positive pairs, where and have same semantic meaning. Intuitively, minimizing the SC objective encourages the orthogonality of representations of a negative pair, and simultaneously promotes linear alignment of representations of a positive pair. It has been proved that solving this optimization problem is equivalent to doing spectral decomposition of a well-defined augmentation graph, whose nodes are augmented images, i.e, from , and edges describe the semantic similarity of two images determined by the kernel , which results in high-quality and explainable data (node) representations (HaoChen et al., 2021).
4 Problem Formulation
In an FL system consists of a server and clients, the -th client owns a private local dataset disjoint with others. The goal of FedSSL is to optimize the SSL model (SC model in this work) over the union of all local datasets, i.e,
(2) |
where . Like the majority of SSL objectives, the global SC objective typically does not equal the weighted sum of local SC objectives, especially with non-i.i.d. data distribution. For the purpose of rigor, we make it an assumption instead of a claim in this work as follows.
(3) |
where are weights depending on the amount of local data. As a result, FedAvg is not guaranteed to minimize the global objective when data is non-i.i.d..
In addition, we adopt SC framework for the following reasons: First, SC has solid theoretical derivations and simultaneously achieves performance comparable to SOTA SSL methods (HaoChen et al., 2021). Second, the SC objective suggests that correlation matrices of data representations are sufficient for contrasting negative-pairs. Sharing correlation matrices only results in constant negligible extra communication overheads and quantifiable privacy leakage.
5 FedSC: A Provable FedSSL Method
For the simplification of notations, we denote and the positive correlation matrix and correlation matrix, respectively. We start with manipulating the global objective
(4) |
where . From Eq. (4), we notice that can be decomposed into a weighted sum of terms corresponding to clients, where each term consists of three sub-terms accounting for intra-client contraction (of positive pairs), intra-client contrast (of negative pairs), and inter-clients contrast (of negative pairs), respectively. Inspired by this decomposition, we construct the following local objective:
(5) |
where is an estimate of , whose updates relying on the communication with the server. Since is treated as a constant (stop gradient) in local objectives, we intentionally remove the coefficient before the third term for gradient alignment between local and global objectives. That is to say, when , we have
(6) |
Note that directly applying FedAvg results in a misalignment of gradients, which is inherited from the fact that the global objective of FedSSL does not equal to the weighted sum of local objectives as suggested in Eq. (3).
The process of FedSC is similar to FedAvg, except sharing and aggregating local correlation matrices besides model weights. To begin with, the server synchronizes local models with the global model. Subsequently, clients compute their local correlation matrices and send them to the server. Following this, the server distributes the aggregated global correlation matrices back to the clients. The clients then proceed to update their local models in accordance with the local objective specified in Eq. (5). Finally, the server aggregates the local models and initiates the next iteration. The process is summarized in Fig. 1.
The detailed algorithm of FedSC is shown in Algorithm 1. Here, clients use Algorithm 2 DP-CalR to calculate local correlation matrices to be shared with differential privacy (DP) protection, which is detailed in Sec. 5.1. During local training, clients minimize through stochastic gradient descent (SGD), which is detailed in Sec. 5.2. It can be noticed that both clients and the server maintain the knowledge of global correlation matrix .
Intuitively, since the averaged local gradients align with the global gradient as shown in Eq. (6), the drift and variance of local gradients contribute and to the convergence rate, respectively, which has been extensively studied by previous works on FedAvg. The difference is that the shared correlation matrix introduces additional perturbation due to its aging (compared with instant correlation matrix ) and DP noise. The perturbation caused by aging is proportional to the movements of weights, which is proportional to the squared learning rate , thus contributing an additional factor to the convergence rate. This is what motivates the design of FedSC.
5.1 Correlation Matrices Sharing
DP protection is applied when correlation matrices are shared to mitigate additional privacy leakage on local dataset. A typical Gaussian mechanism is adopted, with parameters and controlling sensitivity and noise scale, respectively. The process is summarized in Algorithm 2.
5.2 Local Training
The local training process follows mini-batch stochastic gradient descent (SGD). At each iteration, consider a batch of samples . Let be views augmented from . The empirical correlation matrices are calculated as follows:
The batch loss can be obtained by substitute and in Eq. (5) with and , respectively. The local training follows by back-propagating the batch loss and updating the model weights iteratively.
5.3 Comparison with existing FedSSL frameworks
In this subsection, we discuss the privacy leakage and communication overhead of FedSC in comparison with other FedSSL frameworks.
Sharing correlation matrices only results in negligible communication overhead: Although FedSC shares correlation matrices in addition, it still results in less communication overhead than SOTA non-contrastive FedSSL frameworks (Zhuang et al., 2021, 2022; He et al., 2020), due to the implementation of predictor in non-contrastive SSL methods. For example, the feature dimension is in our experiments, thus the correlation matrices yield additional parameters to be communicated. In contrast, the structure of the predictor is often a three-layer multilayer perceptron (MLP), which contains parameters that are multiples of the correlation matrices. In our case, we choose a typical size of resulting in parameters. The overhead of correlation matrices is negligible compared with the encoders. Therefore, even compared with contrastive SSL, which does not have a predictor, the communication overhead resulting from sharing correlation matrices is not a concern.
The extra privacy leakage is probably comparable to that caused by sharing predictors: The predictors in non-contrastive SSL also lead to potential privacy leakages. Although theoretical characterization has not been established, recent works shed lights on the operational meaning of the predictors (Halvagal et al., 2023; Tian et al., 2021), suggesting what information is probably leaked. Particularly, (Tian et al., 2021) reports that linear predictors in BYOL align with the correlation matrices during training. This interesting finding suggests that predictors probably contain similar information as the correlation matrices.
6 Theoretical Analysis
In this section, we first analyze the additional privacy leakage and convergence of FedSC. Our findings are summarized as follows:
We prove that sharing correlation matrices through DP-CalR results in -DP.
We provide the convergence analysis of FedSC. Specifically, with large batch size , large number of views and small scale of DP noise , we can achieve a convergence rate close to .
The analysis indicates superior performance of FedSC over FedAvg whose convergence is dominated by a constant meaning error floor.
6.1 Additional Privacy Leakage
In this subsection, we analyze the Gaussian mechanism in Algorithm 2 DP-CalR. We start with definitions of variations of differential privacy (DP).
Definition 6.1 (-DP).
A mechanism is -DP, if for any neighboring and , the following inequality is satisfied.
DP protects the inputs of a mechanism from membership inference attacks. For a mechanism satisfying DP, we expect that one can hardly tell whether the input contains a certain entry by only looking at the output. In our case, we do not want the server to know whether a local dataset contains a particular data point.
Definition 6.2 (-RDP (Mironov, 2017)).
A mechanism has -Rényi differential privacy, if for any neighboring , and , the following inequality is satisfied:
where is Rényi-divergence of order
RDP is a variation of DP with many good properties, which are summarized in the following Lemmas.
Lemma 6.3 (Gaussian Mechanism of RDP (Mironov, 2017)).
Let be a function with sensitivity , then the Gaussian mechanism is -RDP.
Lemma 6.4 (Composition of RDP (Mironov, 2017) ).
Let be -RDP, and be -RDP. Then the mechanism is -RDP to .
Lemma 6.5 ((Mironov, 2017)).
If a mechanism is -RDP, then it is -DP.
With all these preparations, we use the following proposition to characterize the additional privacy leakage of FedSC.
Proposition 6.6 (Additional Privacy Leakage of FedSC).
Sharing correlation matrices for times through Algorithm 2 DP-CalR results in -DP.
Proof.
We start with the sensitivity of DP-CalR.
where is the representation of the -th view of data . Notice that for any
The sensitivity is finally bounded by . With Lemma 6.3, we have DP-CalR is -RDP. With Lemma 6.4, sharing correlation matrices for times results in -RDP, which is -DP using Lemma 6.5 with . ∎
From the results, we can notice that for arbitrarily , and , the parameter approaches to zero when the size of local dataset approaches to infinity, indicating no differential privacy leakage.
SVHN | CIFAR10 | CIFAR100 | SVHN | CIFAR10 | CIFAR100 | |
Participation | ||||||
FedAvg + BYOL | ||||||
FedAvg + SC | ||||||
FedU | ||||||
FedEMA | ||||||
FedX | ||||||
FedCA | ||||||
FedSC (Proposed) | ||||||
Centralized SC |
6.2 Convergence of FedSC
This subsection presents the convergence of FedSC. We begin with the following assumptions.
Assumption 6.7.
For any and , NN’s output is bounded in norm for some constant .
Assumption 6.8.
For any and , the Jacobian matrix of NN’s output is bounded in norm for some constant .
Assumption 6.9.
The function represented by NN has bounded second order derivatives, i.e, for any and ,
for some constant , where refers to the partial derivation with respect to the -th entry of .
Assumption 6.7 can be satisfied when the NN has a normalization layer at the end or uses bounded activation functions, such as sigmoid, at the output layer. Assumption 6.8 accounts for the Lipschitz continuity of , which is often the case when hidden layers of a NN uses activation functions, such as tanh, sigmoid and relu. Note that Assumption 6.8 is weaker than the bounded gradient norm assumption used in previous works (Li et al., 2019; Noble et al., 2022). However, in our case, it can lead to bounded gradient norm due to the structure of SC objectives, which is detailed in the appendices. Assumption 6.9 accounts for the strong smoothness of the NN, which is widely adopted in existing works (Karimireddy et al., 2020b; Li et al., 2019; Karimireddy et al., 2020a) . With these assumptions, we demonstrate the convergence of FedSC with the following theorem.
Theorem 6.10.
Let Assumption 6.7, 6.8 and 6.9 hold. Choose , and the local learning rate , where and are the number of communication rounds and local updates, respectively. Then FedSC achieves
(7) |
where is the virtual averaged weights, and the local weights of client at the -th step in the -th round; and are batch size and number of augmented views, respectively; and here are constants only depending on , and .
6.2.1 Superior performance of FedSC
The convergence rate of FedSC is dominated by the following term when approaches infinity
(8) |
The bias of local batch gradients results in a rate of , in which specifically, sampling the data set (without replacement) leads to the rate of , and sampling the augmentation kernel results in . Note that this bias also exists in centralized SSL training, and does not result from federation. Sampling variance, from the augmentation kernel, in the shared correlation matrix contributes to the convergence. The DP noise contributes . If we set batch size , generate infinite number of views and not apply DP protection, i.e., , then the error floor will disappear, which results in convergence rate similar to FedAvg in supervised FL. In comparison, if we directly use the SC objective without modification and apply FedAvg, there will be a constant error floor independent with batch size and number of views , due to the misalignment between the averaged local objectives and the global objectives.
6.2.2 Sketch of Proof
We begin with the case of full clients participation. The convergence is determined by two terms: 1) The squared norm of the bias of the averaged local gradient and 2) The variance of the averaged local gradient. The norm of bias can be factorized into three components: 1.a) The “drift” of the local weights, leading to a factor of . 1.b) The bias in the batch gradient of local objectives , contributes a factor of . The bias is due to the fact that the SSL objective can not be written as a sum of samples losses like in the supervised learning cases. 1.c) The impact of aging, sample variance, and DP noise in the shared correlation matrices . Note that remains constant during local training. The aging (compared with ) leads to a bias proportional to the drift of local weights, resulting in a factor of . The sampling variance in contributes a factor of . DP noise contributes a factor of , where is the dimension of the representation. The variance in the averaged local gradient contributes a factor of , considering 2.a) the variance in batch gradient sampling and 2.b) the DP noise in .For partial client participation, we need to consider the variance in aggregation and additional aging of . Given bounded gradient norm, the variance due to client sampling is . Additional aging is proportional to the extra drift, leading to a rate of .
SVHN | CIFAR10 | SVHN | CIFAR10 | |
Participation | ||||
FedAvg+SC | ||||
7 Experiments
7.1 Experimental Setup
Datasets: Three datasets, SVHN, CIFAR10 and CIFAR100, are used for evaluation. SVHN is split into disjoint local datasets, each of which contains classes. CIFAR10 is split into disjoint local datasets according to the classes. CIFAR100 is split into disjoint local datasets, each of which contains classes. Therefore, the size of local datasets for SVHN, CIFAR10 and CIFAR100 tasks are , and , respectively.
Models: For SVHN and CIFAR10, we use a modified version of ResNet20 as backbones. For CIFAR100, the backbone is a modified version of ResNet50.
Hyper-parameters: For all three tasks, the number of communication round , and the number of local epochs is . For SVHN and CIFAR10, the batch size is . For CIFAR100, the batch size is . The number of views for all experiments. For correlation matrices sharing, the number of views is set as .
Benchmarks: Besides FedAvg+BYOL and FedAvg+SC, we also compare with the following state of the arts: FedU (Zhuang et al., 2021), FedEMA (Zhuang et al., 2022), FedX (Han et al., 2022) and FedCA (Zhang et al., 2023).



7.2 Experimental Results
Comparison with SOTA approaches: Table 2 presents the performance comparisons of various algorithms under linear evaluation, where the centralized SC serves as an ideal upper bound. We conclude the following three observations: (1) Our proposed algorithm, FedSC, demonstrates better or comparable performance across different tasks compared to other methods. (2) FedBYOL, FedU, and FedEMA show good results on SVHN but underperform on CIFAR10 and CIFAR100. We believe that this disparity is caused by the larger local dataset size in SVHN, leading to increased local updates. Since these methods incorporate momentum updates in the target encoder, a larger number of updates might be necessary to effectively initiate local training. (3) FedSC and FedCA exhibit less performance degradation when switched to the partial client participation case. We believe this is because clients in both FedSC and FedCA have extra global information about representations. Additionally, predictors in FedBYOL, FedU, and FedEMA are under the effect of client sampling, hindering their global information provision.
CIFAR100 | ||
Participation | ||
FedAvg+SC | ||
SVHN | CIFAR10 | CIFAR100 | |
Participation | |||
FedAvg+SC | |||
DP Impact: Table 3, 4 and 5 illustrate the impact of the DP mechanism on FedSC’s performance. It is shown that with a reasonable degree of DP protection, there is only a modest decline in FedSC’s performance, which remains better than that of FedAvg+SC. Given that our focus is on data level DP, the extra privacy leakage shown in the tables is typically insignificant when compared to the leakage resulting from the encoders. On the other hand, according to the analysis in Sec. 6.1, a smaller dataset necessitates a higher level of DP noise to maintain the same degree of privacy protection. The local dataset sizes for SVHN, CIFAR10, and CIFAR100 tasks are , , and , respectively. As a result, for the CIFAR100 task, we choose a slightly higher privacy budget compared to the other two tasks.
Convergence: Fig. 2 compares the convergence of proposed FedSC and FedAvg+SC, in terms of communication rounds and KNN accuracy. The figures reveal that FedAvg+SC tends to experience either a high error rate or overfitting as the number of communication rounds grows. In contrast, FedSC can consistently enhance KNN accuracy. This validates our theoretical analysis in Sec. 6.
8 Conclusion
In this paper, we proposed FedSC, a novel FedSSL framework based on spectral contrastive objectives. In FedSC, clients share correlation matrices besides local weights periodically. With shared correlation matrices, clients are able to contrast inter-client sample contrast in addition to intra-client contrast and contraction. To mitigate the extra privacy leakage on local dataset, we adopted DP mechanism on shared correlation matrices. We provided theoretical analysis on privacy leakage and convergence, demonstrating the efficacy of FedSC. To the best knowledge of the authors, this is the first provable FedSSL method.
9 Impact Statements
This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.
References
- Bardes et al. (2021) Bardes, A., Ponce, J., and LeCun, Y. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
- Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Chen et al. (2020) Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020.
- Chen & He (2021) Chen, X. and He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15750–15758, 2021.
- Dwork (2006) Dwork, C. Differential privacy. In International colloquium on automata, languages, and programming, pp. 1–12. Springer, 2006.
- Geyer et al. (2017) Geyer, R. C., Klein, T., and Nabi, M. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557, 2017.
- Grill et al. (2020) Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- Halvagal et al. (2023) Halvagal, M. S., Laborieux, A., and Zenke, F. Implicit variance regularization in non-contrastive ssl. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Han et al. (2022) Han, S., Park, S., Wu, F., Kim, S., Wu, C., Xie, X., and Cha, M. Fedx: Unsupervised federated learning with cross knowledge distillation. In European Conference on Computer Vision, pp. 691–707. Springer, 2022.
- HaoChen et al. (2021) HaoChen, J. Z., Wei, C., Gaidon, A., and Ma, T. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 34:5000–5011, 2021.
- He et al. (2021) He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., and Neubig, G. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations, 2021.
- He et al. (2020) He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738, 2020.
- Hu et al. (2021) Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Hu et al. (2020) Hu, R., Guo, Y., Li, H., Pei, Q., and Gong, Y. Personalized federated learning with differential privacy. IEEE Internet of Things Journal, 7(10):9530–9539, 2020.
- Karimireddy et al. (2020a) Karimireddy, S. P., Jaggi, M., Kale, S., Mohri, M., Reddi, S. J., Stich, S. U., and Suresh, A. T. Mime: Mimicking centralized stochastic algorithms in federated learning. arXiv preprint arXiv:2008.03606, 2020a.
- Karimireddy et al. (2020b) Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pp. 5132–5143. PMLR, 2020b.
- Li et al. (2019) Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189, 2019.
- McMahan et al. (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and Arcas, B. A. y. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Singh, A. and Zhu, J. (eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 1273–1282. PMLR, 20–22 Apr 2017.
- Mironov (2017) Mironov, I. Rényi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF), pp. 263–275. IEEE, 2017.
- Noble et al. (2022) Noble, M., Bellet, A., and Dieuleveut, A. Differentially private federated learning on heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pp. 10110–10145. PMLR, 2022.
- Ravi & Larochelle (2016) Ravi, S. and Larochelle, H. Optimization as a model for few-shot learning. In International conference on learning representations, 2016.
- Stich (2018) Stich, S. U. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
- Tian et al. (2021) Tian, Y., Chen, X., and Ganguli, S. Understanding self-supervised learning dynamics without contrastive pairs. In International Conference on Machine Learning, pp. 10268–10278. PMLR, 2021.
- Truex et al. (2020) Truex, S., Liu, L., Chow, K.-H., Gursoy, M. E., and Wei, W. Ldp-fed: Federated learning with local differential privacy. In Proceedings of the Third ACM International Workshop on Edge Systems, Analytics and Networking, pp. 61–66, 2020.
- Tschannen et al. (2019) Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lucic, M. On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625, 2019.
- Wei et al. (2020) Wei, K., Li, J., Ding, M., Ma, C., Yang, H. H., Farokhi, F., Jin, S., Quek, T. Q., and Poor, H. V. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Transactions on Information Forensics and Security, 15:3454–3469, 2020.
- Zbontar et al. (2021) Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pp. 12310–12320. PMLR, 2021.
- Zhang et al. (2022) Zhang, C., Zhang, K., Zhang, C., Pham, T. X., Yoo, C. D., and Kweon, I. S. How does simsiam avoid collapse without negative samples? a unified understanding with self-supervised contrastive learning. arXiv preprint arXiv:2203.16262, 2022.
- Zhang et al. (2023) Zhang, F., Kuang, K., Chen, L., You, Z., Shen, T., Xiao, J., Zhang, Y., Wu, C., Wu, F., Zhuang, Y., et al. Federated unsupervised representation learning. Frontiers of Information Technology & Electronic Engineering, 24(8):1181–1193, 2023.
- Zhuang et al. (2021) Zhuang, W., Gan, X., Wen, Y., Zhang, S., and Yi, S. Collaborative unsupervised visual representation learning from decentralized data. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4912–4921, 2021.
- Zhuang et al. (2022) Zhuang, W., Wen, Y., and Zhang, S. Divergence-aware federated self-supervised learning. arXiv preprint arXiv:2204.04385, 2022.
Appendix A Derivation of SC objective
(9) |
Appendix B Proof of Theorem 6.10
B.1 Additional Notations
Let and be the local weights and local SGD direction, respectively, at the -th update in the -th communication round. Denote and the virtual averaged weights and moving direction, respectively. Since the server aggregates periodically, we have . For simplicity, we remove the up-script ”SC” in and without ambiguity.
B.2 Assumptions
Assumption B.1.
For any and , NN’s output is bounded in norm .
Assumption B.2.
For any and , the Jacobin of NN’s output is bounded in norm .
Assumption B.3.
The function represented by NN has bounded second order derivatives, i.e, for any and
(10) |
B.3 Lemmas
Lemma B.4.
For any , , and , the following inequalities hold
(11) |
(12) |
Proof.
(13) |
(14) |
The remaining results directly follows Jansen’s inequality. ∎
Lemma B.5.
The following function, whose -th entry is defined as
(15) |
is -Lipschitz continuous with
(16) |
Proof.
We start with the derivative of
(17) |
Using AM-GM, we have
(18) |
For the first term, recall the definition of , we have
(19) |
Consequently, we have
(20) |
where the first inequality uses AM-GM, and the second inequality uses Jensen’s inequality.
Corollary B.6.
The global loss is -smooth.
Proof.
Notice that . The result follows after applying Lemma B.5. ∎
Lemma B.7.
For any random matrix , we have
(28) |
Proof.
(29) |
∎
Lemma B.8.
The local stochastic gradient with -th entry defined as
(30) |
has bounded norm
(31) |
where is the history before the -th round; is the variance of the DP noise and is the dimension of the representation .
Proof.
(32) |
For the first term,
(33) |
Then we have
(34) |
and
(35) |
where the line uses Jensen’s inequality and AM-GM. For the second term,
(36) |
where the last inequality uses Lemma B.4. Combine the above results we have
(37) |
where we use the fact that is essentially a correlation matrix plus DP noise with scale . ∎
B.4 Proof of the fully participation
From the -smoothness of given by Corollary B.6, we have
(38) |
Denote the history of the optimization process as , then we have
(39) |
Let and , we have
(40) |
By the choice of , we have
(41) |
B.4.1 Bounding the term
Recall the definition of local batch loss
(42) |
The -th entry of is
(43) |
Take expectation over , we have
(44) |
The -th entry of the global loss gradient is
(45) |
Decompose , where the terms are defined as follows.
(46) |
Then we have
(47) |
The term can be written as follows
(48) |
where the third line uses Jensen’s inequality and the last line uses AM-GM. By Lemma , we have
(49) |
For the term , we have
(50) |
where the third inequality uses Lemma B.4 is a batch of samples drawn from , and are augmented views of . Notice that
(51) |
we have
(52) |
where the last inequality uses Lemma B.4, the fact and sampling with and without replacement.
For the term , we have
(53) |
where the second inequality uses Lemma B.4, and the third inequality uses Jensen’s inequality.
B.4.2 Bounding the term
(58) |
Compare eq. (43) and eq. (44), we have
(59) |
(60) |
For term , we have
(61) |
where the first equality uses ; the last inequality uses the variance under sampling without replacement.
B.4.3 Combine the results
B.5 Partial Participation Case
Partial participation results in perturbation in aggregation.
(77) |
where one can easily verify that from Lemma B.8 servers a bound for gradient norm; the second line uses mean-value theorem. Another aspect is that is less frequently updated on sever. Therefore the term should involve an additional term accounting to aging of correlation matrix,
(78) |
The reason is that the difference between the current and old correlation matrix is proportional to the distance between the current and old variables (shown in eq. (54)), which is proportional to (shown in eq. (69)). Thus we finally have
(79) |
round indices | local dataset size | |||
SVHN | ||||
SVHN | ||||
SVHN | ||||
SVHN | ||||
CIFAR10 | ||||
CIFAR10 | ||||
CIFAR10 | ||||
CIFAR10 | ||||
CIFAR100 | ||||
CIFAR100 | ||||
CIFAR100 | ||||
CIFAR100 |
Appendix C Detailed Implementation of FedSC
Recall the local objective
(80) |
here we replace with a general coefficient , and decay it linearly from to along with communication round indices. The behind motivation is as follows. At the beginning of the training, moving direction from the global objective and the average local objective tend to align closely. Moreover, the correlation matrices of clients are not yet stable at this stage, making it less critical to at early stages. Therefore, we choose large for quicker start. Conversely, correlation matrices converges and becomes stable at the end of training, thus we give the inter-client contrast larger weights, i.e., smaller .
We also make modifications when DP protection is applied. Based on the above analysis, we start sharing at the middle or late stages of the training to save privacy budgets. Following are the detailed implementation details. For partial client participation, we only change according to the ratio of participation.