This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\declaretheorem

[name=Theorem,numberwithin=section]thm \declaretheorem[name=Proposition,numberwithin=section]pro

Do not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning

Da Yu1,2,∗, Huishuai Zhang2,  , Wei Chen2, Tie-Yan Liu2
1School of Computer Science and Engineering, Sun Yat-sen University
2Microsoft Research Asia
1yuda3@mail2.sysu.edu.cn
2{huzhang,wche,tyliu}@microsoft.com
Authors contribute equally to this work.
Abstract

The privacy leakage of the model about the training data can be bounded in the differential privacy mechanism. However, for meaningful privacy parameters, a differentially private model degrades the utility drastically when the model comprises a large number of trainable parameters. In this paper, we propose an algorithm Gradient Embedding Perturbation (GEP) towards training differentially private deep models with decent accuracy. Specifically, in each gradient descent step, GEP first projects individual private gradient into a non-sensitive anchor subspace, producing a low-dimensional gradient embedding and a small-norm residual gradient. Then, GEP perturbs the low-dimensional embedding and the residual gradient separately according to the privacy budget. Such a decomposition permits a small perturbation variance, which greatly helps to break the dimensional barrier of private learning. With GEP, we achieve decent accuracy with reasonable computational cost and modest privacy guarantee for deep models. Especially, with privacy bound ϵ=8\epsilon=8, we achieve 74.9%74.9\% test accuracy on CIFAR10 and 95.1%95.1\% test accuracy on SVHN, significantly improving over existing results.

1 Introduction

Recent works have shown that the trained model may leak/memorize the information of its training set (Fredrikson et al., 2015; Wu et al., 2016; Shokri et al., 2017; Hitaj et al., 2017), which raises privacy issue when the models are trained with sensitive data. Differential privacy (DP) mechanism provides a way to quantitatively measure and upper bound such information leakage. It theoretically ensures that the influence of any individual sample is negligible with the DP parameter ϵ\epsilon or (ϵ,δ)(\epsilon,\delta). Moreover, it has been observed that differentially private models can also resist model inversion attack (Carlini et al., 2019), membership inference attack (Rahman et al., 2018; Bernau et al., 2019; Sablayrolles et al., 2019; Yu et al., 2021), gradient matching attack (Zhu et al., 2019), and data poisoning attack (Ma et al., 2019).

One popular way to achieve differentially private machine learning is to perturb the training process with noise (Song et al., 2013; Bassily et al., 2014; Shokri & Shmatikov, 2015; Wu et al., 2017; Fukuchi et al., 2017; Iyengar et al., 2019; Phan et al., 2020). Specifically, gradient perturbation perturbs the gradient at each iteration of (stochastic) gradient descent algorithm and guarantees the privacy of the final model via composition property of DP. It is worthy to note that gradient perturbation does not assume (strongly) convex objective and hence is applicable to various settings (Abadi et al., 2016; Wang et al., 2017; Lee & Kifer, 2018; Jayaraman et al., 2018; Wang & Gu, 2019; Yu et al., 2020). Specifically, for given gradient sensitivity SS, a general form of gradient perturbation is to add an isotropic Gaussian noise 𝒛{\bm{z}} to the gradient 𝒈p{\bm{g}}\in\mathbb{R}^{p} independently for each step,

𝒈~=𝒈+𝒛,where𝒛𝒩(0,σ2S2𝑰p×p).\displaystyle\tilde{\bm{g}}={\bm{g}}+{\bm{z}},\;\;\text{where}\;\;{\bm{z}}\sim\mathcal{N}(0,\sigma^{2}S^{2}{\bm{I}}_{p\times p}). (1)

One can set proper variance σ2\sigma^{2} to make each update differentially private with parameter (ϵ,δ)(\epsilon,\delta). It is easy to see that the intensity of the added noise 𝔼[𝒛2]\mathbb{E}[\|{\bm{z}}\|^{2}] scales linearly with the model dimension pp. This indicates that as the model becomes larger, the useful signal, i.e., gradient, would be submerged in the added noise (see Figure 2). This dimensional barrier restricts the utility of deep learning models trained with gradient perturbation.

Refer to caption
Figure 1: Noise norm vs gradient norm of ResNet20 at initialization. The noise variance is chosen such that SGD satisfies (5,105)(5,10^{-5})-DP after 9090 epochs in Abadi et al. (2016).
Refer to caption
Figure 2: Stable rank F2/2\|\cdot\|_{F}^{2}/\|\cdot\|^{2} (tropp2015introduction) of batch gradient matrix of given groups (with pp parameters). The setting is ResNet20 on CIFAR-10. The stable rank is small throughout training.

The dimensional barrier is attributed to the fact that the added noise is isotropic while the gradients live on a very low dimensional manifold, which has been observed in (Gur-Ari et al., 2018; Vogels et al., 2019; Gooneratne et al., 2020; Li et al., 2020) and is also verified in Figure 2 for the gradients of a 20-layer ResNet (He et al., 2016). Hence to limit the noise energy, it is natural to think

“Can we reduce the dimension of gradients first and then add the isotropic noise onto a low-dimensional gradient embedding?"

The answer is affirmative. We propose a new algorithm Gradient Embedding Perturbation (GEP), illustrated in Figure 3. Specifically, we first compute anchor gradients on some non-sensitive auxiliary data, and identify an anchor subspace that is spanned by several top principal components of the anchor gradient matrix. Then we project the private gradients into the anchor subspace and obtain low-dimensional gradient embeddings and small-norm residual gradients. Finally, we perturb the gradient embedding and residual gradient separately according to the sensitivities and privacy budget.

We intuitively argue why GEP could reduce the perturbation variance and achieve good utility for large models. First, because the gradient embedding has a very low dimension, the added isotropic noise on embedding has small energy that scales linearly only with the subspace dimension. Second, if the anchor subspace can cover most of the gradient information, the residual gradient, though high dimensional, should have small magnitude, which permits smaller added noise to guarantee the same level privacy because of the reduced sensitivity. Overall, we can use a much lower perturbation compared with the original gradient perturbation to guarantee the same level of privacy.

We emphasize several properties of GEP. First, the non-sensitive auxiliary data assumption is weak. In fact, GEP only requires a small number of non-sensitive unlabeled data following a similar feature distribution as the private data, which often exist even for learning on sensitive data. In our experiments, we use a few unlabeled samples from ImageNet to serve as auxiliary data for MNIST, SVHN, and CIFAR-10. This assumption is much weaker than the public data assumption in previous works (Papernot et al., 2017; 2018; Alon et al., 2019; Wang & Zhou, 2020), where the public data should follow exactly the same distribution as the private data. Second, GEP produces an unbiased estimator of the target gradient because of releasing both the perturbed gradient embedding and the perturbed residual gradient, which turns out to be critical for good utility. Third, we use power method to estimate the principal components of anchor gradients, achievable with a few matrix multiplications. The fact that GEP is not sensitive to the choices of subspace dimension further allows a very efficient implementation.

Compared with existing works of differentially private machine learning, our contribution can be summarized as follows: (1) we propose a novel algorithm GEP that achieves good utility for large models with modest differential privacy guarantee; (2) we show that GEP returns an unbiased estimator of target private gradient with much lower perturbation variance than original gradient perturbation; (3) we demonstrate that GEP achieves state-of-the-art utility in differentially private learning with three benchmark datasets. Specifically, for ϵ=8\epsilon=8, GEP achieves 74.9%74.9\% test accuracy on CIFAR-10 with a ResNet20 model. To the best of our knowledge, GEP is the first algorithm that can achieve such utility with training deep models from scratch for a “single-digit" privacy budget111Abadi et al. (2016) achieve 73%73\% accuracy on CIFAR-10 but they need to pre-train the model on CIFAR-100..

Refer to caption
Figure 3: Overview of the proposed GEP approach. 1) We estimate an anchor subspace on some non-sensitive data; 2) We project the private gradients into the anchor subspace, producing low-dimensional embeddings and residual gradients; 3) We perturb the gradient embedding and residual gradient separately to guarantee differential privacy. The auxiliary data are only required to share similar features as the private data. In our experiments, we use 20002000 images from ImageNet as auxiliary data for MNIST, SVHN, and CIFAR-10 datasets.

1.1 Related work

Existing works studying differentially private machine learning in high-dimensional setting can be roughly categorized into two sets. One is treating the optimization of the machine learning objective as a whole mechanism and adding noise into this process. The other one is based on the knowledge transfer of machine learning models, which trains a differentially private publishable student model with private signals from teacher models. We review them one by one.

Differentially private convex optimization in high-dimensional setting has been studied extensively over the years (Kifer et al., 2012; Thakurta & Smith, 2013; Talwar et al., 2015; Wang & Xu, 2019; Wang & Gu, 2019). Although these methods demonstrate good utility on some convex settings, their analyses can not be directly applied to non-convex setting. Right before the submission, we note two independent and concurrent works (Zhou et al., 2020; Kairouz et al., 2020) that also leverage the gradient redundancy to reduce the added noise. Specifically, Kairouz et al. (2020) track historical gradients to do dimension reduction for private AdaGrad. Zhou et al. (2020) requires gradients on some public data and then project the noisy gradients into a public subspace at each update. One core difference between these two works and GEP is that we introduce residual gradient perturbation and GEP produces an unbiased estimator of the private gradients, which is essential for achieving the superior utility. Moreover, we weaken the auxiliary data assumption and introduce several designs that significantly boost the efficiency and applicability of GEP.

One recent progress towards training arbitrary models with differential privacy is Private Aggregation of Teacher Ensembles (PATE) (Papernot et al., 2017; 2018; Jordon et al., 2019). PATE first trains independent teacher models on disjoint shards of private data. Then it trains a student model with privacy guarantee by distilling noisy predictions of teacher models on some public samples. In comparison, GEP only requires some non-sensitive data that have similar natural features as the private data while PATE requires the public data follow exactly the same distribution as the private data and in practice it uses a portion of the test data to serve as public data. Moreover, GEP demonstrates better performance than PATE especially for complex datasets, e.g., CIFAR-10, because GEP can train the model with the whole private data rather than a small shard of data.

2 Preliminaries

We introduce some notations and definitions. We use bold lowercase letters, e.g., 𝒗{\bm{v}}, and bold capital letters, e.g., 𝑴{\bm{M}}, to denote vectors and matrices, respectively. The L2L^{2} norm of a vector 𝒗{\bm{v}} is denoted by 𝒗\|{\bm{v}}\|. The spectral norm and the Frobenius norm of a matrix 𝑴{\bm{M}} are denoted by 𝑴\|{\bm{M}}\| and 𝑴F\|{\bm{M}}\|_{F}, respectively. A sample d=(𝒙,y)d=({\bm{x}},y) consists of feature 𝒙{\bm{x}} and label yy. A dataset 𝔻{\mathbb{D}} is a collection of individual samples. A dataset 𝔻{\mathbb{D}}^{\prime} is said to be a neighboring dataset of 𝔻{\mathbb{D}} if they differ in a single sample, denoted as 𝔻𝔻{\mathbb{D}}\sim{\mathbb{D}}^{\prime}. Differential privacy ensures that the outputs of an algorithm on neighboring datasets have approximately indistinguishable distributions.

Definition 1 ((ϵ,δ)(\epsilon,\delta)-DP (Dwork et al., 2006a; b)).

A randomized mechanism \mathcal{M} guarantees (ϵ,δ)(\epsilon,\delta)-differential privacy if for any two neighboring input datasets 𝔻𝔻{\mathbb{D}}\sim{\mathbb{D}}^{{}^{\prime}} and for any subset of outputs SS it holds that Pr[(𝔻)S]eϵPr[(𝔻)S]+δ\text{Pr}[\mathcal{M}({\mathbb{D}})\in S]\leq e^{\epsilon}\text{Pr}[\mathcal{M}({\mathbb{D}}^{{}^{\prime}})\in S]+\delta.

By its definition, (ϵ,δ)(\epsilon,\delta)-DP controls the maximum influence that any individual sample can produce. One can adjust the privacy parameters to trade off between privacy and utility. Differential privacy is immune to post-processing (Dwork et al., 2014), i.e., any function applied on the output of a differentially private algorithm would not increase the privacy loss as long as it does not have new interaction with the private dataset. Differential privacy also allows composition, i.e., the composition of a series of differentially private mechanisms is also differentially private but with different parameters. Several variants of (ϵ,δ)(\epsilon,\delta)-DP have been proposed (Bun & Steinke, 2016; Dong et al., 2019) to address certain weakness of (ϵ,δ)(\epsilon,\delta)-DP, e.g., they achieve better composition property. In this work, we use Rényi differential privacy (Mironov, 2017) to track the privacy loss and then convert it to (ϵ,δ)(\epsilon,\delta)-DP.

Suppose that there is a private dataset 𝔻={(𝒙i,yi)}i=1n{\mathbb{D}}=\{({\bm{x}}_{i},y_{i})\}_{i=1}^{n} with nn samples. We want to train a model ff to learn the mapping in 𝔻{\mathbb{D}}. Specifically, ff takes 𝒙{\bm{x}} as input and outputs a label yy, and ff has parameter θp\theta\in{\mathbb{R}}^{p}. The training objective is to minimize an empirical risk 1ni=1n(f(𝒙i),yi)\frac{1}{n}\sum_{i=1}^{n}\ell(f({\bm{x}}_{i}),y_{i}), where (,)\ell(\cdot,\cdot) is a loss function. We further assume that there is an auxiliary dataset 𝔻(a)={(𝒙~j,y~j)}j=1m{\mathbb{D}}^{(a)}=\{(\tilde{{\bm{x}}}_{j},\tilde{y}_{j})\}_{j=1}^{m} that 𝒙~\tilde{{\bm{x}}} shares similar features as 𝒙{\bm{x}} in 𝔻{\mathbb{D}} while y~\tilde{y} could be random.

3 Gradient embedding perturbation

An overview of GEP is given in Figure 3. GEP has three major ingredients: 1) first, estimate an anchor subspace that contains the principal components of some non-sensitive anchor gradients via power method; 2) then, project private gradients into the anchor subspace and produce low-dimensional embeddings of private gradients and residual gradients; 3) finally, perturb gradient embedding and residual gradient separately to establish differential privacy guarantee. In Section 3.1, we present the GEP algorithm in detail. In Section 3.2, we given an analysis on the residual gradients. In Section 3.3, we give a differentially private learning algorithm that updates the model with the output of GEP.

3.1 The GEP algorithm and its privacy analysis

The pseudocode of GEP is presented in Algorithm 1. For convenience, we write a set of gradients and a set of basis vectors as matrices with each row being one gradient/basis vector.

The anchor subspace is constructed as follows. We first compute the gradients of the model on an auxiliary dataset 𝔻(a){\mathbb{D}}^{(a)} with mm samples, which is referred to as the anchor gradients 𝑮(a)m×p{\bm{G}}^{(a)}\in{\mathbb{R}}^{m\times p}. We then use the power method to estimate the principal components of 𝑮(a){\bm{G}}^{(a)} to construct a subspace basis 𝑩k×p{\bm{B}}\in{\mathbb{R}}^{k\times p}, which is referred to as the anchor subspace. All these matrices are publishable because 𝔻(a){\mathbb{D}}^{(a)} is non-sensitive. We expect that the anchor subspace 𝑩{\bm{B}} can cover most energy of private gradients when the auxiliary data are not far from private data and m,km,k are reasonably large.

Suppose that the private gradients are 𝑮n×p{\bm{G}}\in{\mathbb{R}}^{n\times p}. Then, we project the private gradients into the anchor subspace 𝑩{\bm{B}}. The projection produces low-dimensional embeddings 𝑾=𝑮𝑩T{\bm{W}}={\bm{G}}{\bm{B}}^{T} and residual gradients 𝑹=𝑮𝑮𝑩T𝑩{\bm{R}}={\bm{G}}-{\bm{G}}{\bm{B}}^{T}{\bm{B}}. The magnitude of residual gradients is usually much smaller than original gradient even when kk is small because of the gradient redundancy.

Then, we aggregate the gradient embeddings and the residual gradients, respectively. We perturb the aggregated embedding and the aggregated residual gradient respectively to guarantee certain differential privacy. Finally, we release the perturbed embedding and the perturbed residual gradient and construct an unbiased estimator of the private gradient: 𝒗~:=(𝒘~T𝑩+𝒓~)/n\tilde{{\bm{v}}}:=(\tilde{\bm{w}}^{T}{\bm{B}}+\tilde{\bm{r}})/n. This construction process does not resulting in additional privacy loss because of DP’s post-processing property. The privacy analysis of the whole process of GEP is given in Theorem 1.

11:  Input: anchor gradients 𝑮(a)m×p{\bm{G}}^{(a)}\in\mathbb{R}^{m\times p}; number of basis vectors kk; private gradients 𝑮n×p{\bm{G}}\in\mathbb{R}^{n\times p}; clipping thresholds S1,S2S_{1},S_{2}; standard deviations σ1,σ2\sigma_{1},\sigma_{2}; number of power iterations tt.
2:  //First stage: Compute an orthonormal basis for the anchor subspace.
23:  Initialize 𝑩k×p{\bm{B}}\in\mathbb{R}^{k\times p} randomly.
4:  for i=1i=1 to tt do
5:     Compute 𝑨=𝑮(a)𝑩T{\bm{A}}={\bm{G}}^{(a)}{\bm{B}}^{T} and 𝑩=𝑨T𝑮(a){\bm{B}}={\bm{A}}^{T}{\bm{G}}^{(a)}.
6:     Orthogonalize 𝑩{\bm{B}} and normalize row vectors.
37:  end for
48:  Delete 𝑮(a){\bm{G}}^{(a)} to free memory.
9:  //Second stage: project the private gradients 𝑮{\bm{G}} into anchor subspace 𝑩{\bm{B}}
510:  Compute gradient embeddings 𝑾=𝑮𝑩T{\bm{W}}={\bm{G}}{\bm{B}}^{T} and clip its rows with S1S_{1} to obtain 𝑾^\hat{\bm{W}}.
611:  Compute residual gradients 𝑹=𝑮𝑾𝑩{\bm{R}}={\bm{G}}-{\bm{W}}{\bm{B}} and clip its rows with S2S_{2} to obtain 𝑹^\hat{\bm{R}}.
12:  //Third stage: perturb gradient embedding and residual gradient separately
713:  Perturb embedding with noise 𝒛(1)𝒩(0,σ12𝑰k×k){\bm{z}}^{(1)}\sim\mathcal{N}(0,\sigma_{1}^{2}{\bm{I}}_{k\times k}): 𝒘:=i𝑾^i,:,𝒘~:=𝒘+𝒛(1).\;{\bm{w}}:=\sum_{i}\hat{\bm{W}}_{i,:},\;\;\tilde{\bm{w}}:={\bm{w}}+{\bm{z}}^{(1)}.
814:  Perturb residual gradient with noise 𝒛(2)𝒩(0,σ22𝑰p×p){\bm{z}}^{(2)}\sim\mathcal{N}(0,\sigma_{2}^{2}{\bm{I}}_{p\times p}): 𝒓:=i𝑹^i,:,𝒓~:=𝒓+𝒛(2).\;{\bm{r}}:=\sum_{i}\hat{\bm{R}}_{i,:},\;\;\tilde{\bm{r}}:={\bm{r}}+{\bm{z}}^{(2)}.
915:  Return 𝒗~:=(𝒘~T𝑩+𝒓~)/n\tilde{{\bm{v}}}:=(\tilde{\bm{w}}^{T}{\bm{B}}+\tilde{\bm{r}})/n.
Algorithm 1 Gradient embedding perturbation
{thm}

[] Let S1S_{1} and S2S_{2} be the sensitivity of 𝒘{\bm{w}} and 𝒓{\bm{r}}, respectively, the output of Algorithm 1 satisfies (ϵ,δ)(\epsilon,\delta)-DP for any δ(0,1)\delta\in(0,1) and ϵ2log(1/δ)\epsilon\leq 2\log(1/\delta) if we choose σ12S12log(1/δ)/ϵ\sigma_{1}\geq 2S_{1}\sqrt{2\log(1/\delta)}/\epsilon and σ22S22log(1/δ)/ϵ\sigma_{2}\geq 2S_{2}\sqrt{2\log(1/\delta)}/\epsilon.

A common practice to control sensitivity is to clip the output with a pre-defined threshold. In our experiments, we use different thresholds S1S_{1} and S2S_{2} to clip the gradient embeddings and residual gradients, respectively. The privacy loss of GEP consists of two parts: the privacy loss incurred by releasing the perturbed embedding and the privacy loss incurred by releasing the perturbed residual gradient. We compose these two parts via the Rényi differential privacy and convert it to (ϵ,δ)(\epsilon,\delta)-DP.

We highlight several implementation techniques that make GEP widely applicable and implementable with reasonable computational cost. Firstly, auxiliary non-sensitive data do not have to be the same source as the private data and the auxiliary data can be randomly labeled. This non-sensitive data assumption is very weak and easy to satisfy in practical scenarios. To understand why random label works, a quick example is that for the least squares regression problem the individual gradient is aligned with the feature vector while the label only scales the length but does not change the direction. This auxiliary data assumption avoids conducting principal component analysis (PCA) on private gradients, which requires releasing private high-dimensional basis vectors and hence introduces large privacy loss. Secondly, we use power method (Panju, 2011; Vogels et al., 2019) to approximately estimate the principal components. The new operation we introduce is standard matrix multiplication that enjoys efficient implementation on GPU. The computational complexity of each power iteration is 2mkp2mkp, where pp is the number of model parameters, mm is the number of anchor gradients and kk is the number of subspace basis vectors. Thirdly, we divide the parameters into different groups and compute one orthonormal basis for each group. This further reduces the computational cost. For example, suppose the parameters are divided into two groups with size p1,p2p_{1},p_{2} and the numbers of basis vectors are k1,k2k_{1},k_{2}, the computational complexity of each power iteration is 2m(k1p1+k2p2)2m(k_{1}p_{1}+k_{2}p_{2}), which is smaller than 2m(k1+k2)(p1+p2)2m(k_{1}+k_{2})(p_{1}+p_{2}). In Appendix B, we analyze the additional computational and memory costs of GEP compared to standard gradient perturbation.

Curious readers may wonder if we can use random projection to reduce the dimensionality as Johnson–Lindenstrauss Lemma (Dasgupta & Gupta, 2003) guarantees that one can preserve the pairwise distance between any two points after projecting into a random subspace of much lower dimension. However, preserving the pairwise distance is not sufficient for high quality gradient reconstruction, which is verified by the empirical observation in Appendix C.

3.2 An analysis on the residual gradients of GEP

Let 𝒈:=1ni𝑮i,:{\bm{g}}:=\frac{1}{n}\sum_{i}{\bm{G}}_{i,:} be the target private gradient. For a given anchor subspace 𝑩{\bm{B}}, the residual gradients are defined as 𝑹:=𝑮𝑮𝑩T𝑩{\bm{R}}:={\bm{G}}-{\bm{G}}{\bm{B}}^{T}{\bm{B}}. We then analyze how large the residual gradients could be. The following argument holds for all time steps and we ignore the time step index for simplicity.

For the ease of discussion, we introduce 𝝃i:=(𝑮i,:)T{\bm{\xi}}_{i}:=({\bm{G}}_{i,:})^{T} for i[n]i\in[n] to denote the the private gradients and the 𝝃^j:=(𝑮j,:(a))T\hat{{\bm{\xi}}}_{j}:=({\bm{G}}^{(a)}_{j,:})^{T} for j[m]j\in[m] to denote the anchor gradients. We use λk()\lambda_{k}(\cdot) to denote the kthk_{th} largest eigenvalue of a given matrix. We assume that the private gradients 𝝃1,,𝝃n{\bm{\xi}}_{1},...,{\bm{\xi}}_{n} and the anchor gradients 𝝃^1,,𝝃^m\hat{{\bm{\xi}}}_{1},...,\hat{{\bm{\xi}}}_{m} are sampled independently from a distribution 𝒫{\mathcal{P}}. We assume 𝚺:=𝔼𝝃𝒫𝝃𝝃Tp×p{\bm{\Sigma}}:=\mathbb{E}_{{\bm{\xi}}\sim{\mathcal{P}}}{\bm{\xi}}{\bm{\xi}}^{T}\in{\mathbb{R}}^{p\times p} to be the population gradient (uncentered) covariance matrix. We also consider the (uncentered) empirical gradient covariance matrix 𝑺^:=1mi=1m𝝃^i𝝃^iT\hat{{\bm{S}}}:=\frac{1}{m}\sum_{i=1}^{m}\hat{{\bm{\xi}}}_{i}\hat{{\bm{\xi}}}_{i}^{T}.

One case is that the population gradient covariance matrix 𝚺{\bm{\Sigma}} is low-rank kk. In this case we can argue that the residual gradients are 0 once the number of anchor gradients m>km>k.

Lemma 3.1.

Assume that the population covariance matrix 𝚺{\bm{\Sigma}} is with rank kk and the distribution 𝒫{\mathcal{P}} satisfies (𝛏𝔽s)=0{\mathbb{P}}({\bm{\xi}}\in{\mathbb{F}}_{s})=0 for all ss-flats 𝔽s{\mathbb{F}}_{s} in p{\mathbb{R}}^{p} with 0s<k0\leq s<k. Let 𝚺=𝐕kΛ𝐕kT{\bm{\Sigma}}={\bm{V}}_{k}\Lambda{\bm{V}}_{k}^{T} and 𝐒^=𝐕^kΛ^𝐕^kT\hat{{\bm{S}}}=\hat{{\bm{V}}}_{k^{\prime}}\hat{\Lambda}\hat{{\bm{V}}}_{k^{\prime}}^{T} be the eigendecompositions of 𝚺{\bm{\Sigma}} and the empirical covariance matrix 𝐒^\hat{{\bm{S}}}, respectively, such that λk(𝐒^)>0\lambda_{k^{\prime}}(\hat{{\bm{S}}})>0 and λk+1(𝐒^)=0\lambda_{k^{\prime}+1}(\hat{{\bm{S}}})=0. Then if mkm\geq k, we have with probability 1,

k=kand𝑽k𝑽kT𝑽^k𝑽^kT2=0.\displaystyle k^{\prime}=k\quad\text{and}\quad\|{\bm{V}}_{k}{\bm{V}}_{k}^{T}-\hat{{\bm{V}}}_{k}\hat{{\bm{V}}}_{k}^{T}\|_{2}=0. (2)
Proof.

The proof is based on the non-singularity of covariance matrix. See Appendix D. ∎

We note that ss-flat is the translate 𝔽s=𝒙+𝔽s(0){\mathbb{F}}_{s}={{\bm{x}}}+{\mathbb{F}}_{s(0)} of an ss-dimensional linear subspace 𝔽s(0){\mathbb{F}}_{s(0)} in p{\mathbb{R}}^{p} and the normal distribution satisfies such condition (Eaton & Perlman, 1973; Muirhead, 2009). Therefore, we have seen that for low-rank case of population covariance matrix, the residual gradients are 0 once m>km>k. In the general case, we measure the expected norm of the residual gradients.

Lemma 3.2.

Assume that 𝛏𝒫{\bm{\xi}}\sim{\mathcal{P}} and 𝛏2<T\|{\bm{\xi}}\|^{2}<T almost surely. Let 𝚺=𝐕Λ𝐕T{\bm{\Sigma}}={\bm{V}}\Lambda{\bm{V}}^{T} be the eigendecomposition of the population covariance matrix 𝚺{\bm{\Sigma}}. Let 𝐒^=𝐕^kΛ^𝐕^kT\hat{{\bm{S}}}=\hat{{\bm{V}}}_{k}\hat{\Lambda}\hat{{\bm{V}}}_{k}^{T} be the eigendecomposition of the empirical covariance matrix 𝐒^\hat{{\bm{S}}}. Then we have with probability 12exp(δ)1-2\exp(-\delta),

𝔼𝝃Π𝑽^k(𝝃)22k>kλk(𝚺)+kC/m+T2δ/m,\displaystyle\mathbb{E}\|{\bm{\xi}}-\Pi_{\hat{{\bm{V}}}_{k}}({\bm{\xi}})\|_{2}^{2}\leq\sum_{k^{\prime}>k}\lambda_{k^{\prime}}({\bm{\Sigma}})+\sqrt{kC/m}+T\sqrt{2\delta/m}, (3)

where C=[𝔼𝛏4iλi2(𝚺)]+[1mj=1m𝛏^j4iλi2(𝐒^)]C=\left[\mathbb{E}\|{\bm{\xi}}\|^{4}-\sum_{i}\lambda_{i}^{2}({\bm{\Sigma}})\right]+\left[\frac{1}{m}\sum_{j=1}^{m}\|\hat{\bm{\xi}}_{j}\|^{4}-\sum_{i}\lambda_{i}^{2}(\hat{\bm{S}})\right], Π𝐕^k\Pi_{\hat{{\bm{V}}}_{k}} is a projection operator onto the subspace 𝐕^k\hat{{\bm{V}}}_{k} and the 𝔼\mathbb{E} is taken over the randomness of 𝛏𝒫{\bm{\xi}}\sim{\mathcal{P}}.

Proof.

The proof is an adaptation of Theorem 3.1 in Blanchard et al. (2007). ∎

From Lemma 3.2, we can see the larger the number of anchor gradients and the dimension of the anchor subspace kk, the smaller the residual gradients. We can choose m,km,k properly such that the upper bound on the expected residual gradient norm is small. This indicates that we may use a smaller clipping threshold and consequently apply smaller noises with achieving the same privacy guarantee.

We next empirically examine the projection error 𝒓=i𝑹i,:{\bm{r}}=\sum_{i}{\bm{R}}_{i,:} by training a 2020-layer ResNet on CIFAR10 dataset. We try two different types of auxiliary data to compute the anchor gradients: 1) samples from the same source as private data with correct labels, i.e., 20002000 random samples from the test data; 2) samples from different source with random labels, i.e., 20002000 random samples from ImageNet. The relation between the dimension of anchor subspace kk and the projection error rate (1n𝒓/𝒈\left\lVert\frac{1}{n}{\bm{r}}\right\rVert/\left\lVert{\bm{g}}\right\rVert) is presented in Figure 5. We can see that the project error is small and decreases with kk, and the benefit of increasing kk diminishes when kk is large, which is implied by Lemma 3.2. In practice one can only use small or moderate kk because of the memory constraint. GEP needs to store at least kk individual gradients and each individual gradient consumes the same amount of memory as the model itself. Moreover, we can see that the projection into anchor subspace of random labeled auxiliary data yields comparable projection error, corroborating our argument that unlabeled auxiliary data are sufficient for finding the anchor subspace.

We also verify that the redundancy of residual gradients is small, by plotting the stable rank of residual gradient matrix in Figure 5. The stable rank of residual gradient matrix is an order of magnitude higher than the stable rank of original gradient matrix. This implies that it could be hard to further approximate 𝑹{\bm{R}} with low-dimensional embeddings.

Refer to caption

Figure 4: Relative projection error (1n𝒓/𝒈\left\lVert\frac{1}{n}{\bm{r}}\right\rVert/\left\lVert{\bm{g}}\right\rVert) of the second stage in ResNet20. The number of anchor gradients is 20002000. The dimension of anchor subspace is kk. The learning rate is decayed by 1010 at epoch 30. The left plot uses random samples from ImageNet. The right plot uses random samples from test data. The benefit of increasing kk becomes smaller when kk is larger.

Refer to caption

Figure 5: Stable rank of the residual gradient matrix versus original gradient matrix. The gradients are computed on full batch data for the first stage in ResNet20. The dimension of anchor subspace is k=1000k=1000.

We next compare the GEP with a scheme that simply discards the residual gradients and only outputs the perturbed gradient embedding, i.e., the output is 𝒖~:=𝒘~T𝑩/n\tilde{{\bm{u}}}:=\tilde{\bm{w}}^{T}{\bm{B}}/n.

Remark 1.

Let 𝐮~:=𝐰~T𝐁/n\tilde{\bm{u}}:=\tilde{\bm{w}}^{T}{\bm{B}}/n be the reconstructed gradient from noisy gradient embedding and 𝐯~\tilde{{\bm{v}}} be the output of GEP. If ignoring the effect of gradient clipping, we have

𝔼[𝒖~]=𝒈𝒓/n,𝔼[𝒗~]=𝒈.\displaystyle\mathbb{E}[{\tilde{{\bm{u}}}}]={\bm{g}}-{\bm{r}}/n,\quad\mathbb{E}[{\tilde{{\bm{v}}}}]={\bm{g}}. (4)

where 𝐫=i𝐑i,:{\bm{r}}=\sum_{i}{\bm{R}}_{i,:} is the aggregated residual gradients, 𝐰~,𝐁\tilde{\bm{w}},{\bm{B}} are given in Algorithm 1 and the expectation is over the added random noises.

This indicates that 𝒖~{\tilde{{\bm{u}}}} contains a systematic error that makes 𝒖~{\tilde{{\bm{u}}}} always deviate from 𝒈{\bm{g}} by the residual gradient. This systematic error is the projection error, which is plotted in Figure 5. The systematic error cannot be mitigated by reducing the noise magnitude (e.g., increasing the privacy budget or collecting more private data). We refer to the algorithm releasing 𝒖~{\tilde{{\bm{u}}}} directly as Biased-GEP or B-GEP for short, which can be viewed as an efficient implementation of the algorithm in (Zhou et al., 2020). In our experiments, B-GEP can outperform standard gradient perturbation when kk is large but is inferior to GEP. We note that the above remark is made with ignoring the clipping effect (or set a large clipping threshold). In practice, we do apply clipping for the individual gradients at each time step, which makes the expectations in Remark 1 obscure (Chen et al., 2020b). We note that the claim that 𝒗~{\tilde{{\bm{v}}}} is an unbiased estimator of 𝒈{\bm{g}} is not that precise when applying gradient clipping.

3.3 Private learning with gradient embedding perturbation

GEP (Algorithm 1) describes how to release one-step gradient with privacy guarantee. In this section, we compose the privacy losses at each step to establish the privacy guarantee for the whole learning process. The differentially private learning process with GEP is given in Algorithm 2 and the privacy analysis is presented in Theorem 2.

11:  Input: private dataset 𝔻{\mathbb{D}}; auxiliary dataset 𝔻(a){\mathbb{D}}^{(a)}; number of updates TT; learning rate η\eta; configuration of GEP {\mathbb{C}}; loss function \ell;
22:  Output: Differentially private model 𝜽T\boldsymbol{\theta}_{T}.
3:  for t=0t=0 to T1T-1 do
4:     Compute the private gradients 𝑮t{\bm{G}}_{t} and anchor gradients 𝑮t(a){\bm{G}}^{(a)}_{t} of loss with respect to 𝜽t\boldsymbol{\theta}_{t}.
5:     Call GEP with 𝑮t,𝑮t(a){\bm{G}}_{t},{\bm{G}}^{(a)}_{t} and configuration {\mathbb{C}} to get 𝒗~t\tilde{\bm{v}}_{t}.
6:     Update model 𝜽t+1=𝜽tη𝒗~t\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\eta\tilde{\bm{v}}_{t}.
7:  end for
Algorithm 2 Differentially private gradient descent with GEP.
{thm}

[] For any ϵ<2log(1/δ)\epsilon<2\log(1/\delta) and δ(0,1)\delta\in(0,1), the output of Algorithm 2 satisfies (ϵ,δ)(\epsilon,\delta)-DP if we set σ22Tlog(1/δ)/ϵ\sigma\geq 2\sqrt{2T\log(1/\delta)}/\epsilon.

If the private gradients are randomly sampled from the full batch gradients, the privacy guarantee can be strengthened via the privacy amplification by subsampling theorem of DP (Balle et al., 2018; Wang et al., 2019; Zhu & Wang, 2019; Mironov et al., 2019). Theorem 2 gives the expected excess error of Algorithm 2. Expected excess error measures the distance between the algorithm’s output and the optimal solution in expectation.

{thm}

[] Suppose the loss L(𝜽)=1n(𝒙,y)𝔻(f𝜽(𝒙),y)L(\boldsymbol{\theta})=\frac{1}{n}\sum_{({\bm{x}},y)\in{\mathbb{D}}}\ell(f_{\boldsymbol{\theta}}({\bm{x}}),y) is 1-Lipschitz, convex, and β\beta-smooth. If η=1β\eta=\frac{1}{\beta}, T=nβϵpT=\frac{n\beta\epsilon}{\sqrt{p}}, and 𝜽¯=1Tt=1T𝜽t\boldsymbol{\bar{\theta}}=\frac{1}{T}\sum_{t=1}^{T}\boldsymbol{\theta}_{t}, then we have 𝔼[L(𝜽¯)]L(𝜽)𝒪(klog(1/δ)nϵ+r¯plog(1/δ)nϵ)\mathbb{E}[L(\boldsymbol{\bar{\theta}})]-L(\boldsymbol{\theta_{*}})\leq\mathcal{O}\left(\frac{\sqrt{k\log(1/\delta)}}{n\epsilon}+\frac{\bar{r}\sqrt{p\log(1/\delta)}}{n\epsilon}\right), where r¯=1Tt=0T1rt2\bar{r}=\frac{1}{T}\sum_{t=0}^{T-1}r^{2}_{t} and rt=maxi(𝑹t)i,:r_{t}=\max_{i}\left\lVert({\bm{R}}_{t})_{i,:}\right\rVert is the sensitivity of residual gradient at step tt.

The r¯\bar{r} term represents the average projection error over the training process. The previous best expected excess error for gradient perturbation is 𝒪(plog(1/δ)/(nϵ))\mathcal{O}(\sqrt{p\log(1/\delta)}/(n\epsilon)) (Wang et al., 2017). As shown in Lemma 3.1, if the gradients locate in a kk-dimensional subspace over the training process, r¯=0\bar{r}=0 and the excess error is 𝒪(klog(1/δ)/(nϵ))\mathcal{O}(\sqrt{k\log(1/\delta)}/(n\epsilon)), independent of the problem ambient dimension pp. When the gradients are in general position, i.e., gradient matrix is not exact low-rank, Lemma 3.2 and the empirical result give a hint on how small the residual gradients could be. However, it is hard to get a good bound on maxi(𝑹t)i,:\max_{i}\|({\bm{R}}_{t})_{i,:}\| and the bound in Theorem 2 does not explicitly improve over previous result. One possible solution is to use a clipping threshold based on the expected residual gradient norm. Then the output gradient becomes biased because of clipping and the utility/privacy guarantees in Theorem 2/2 require new elaborate derivation. We leave this for future work.

4 Experiments

We conduct experiments on MNIST, extended SVHN, and CIFAR-10 datasets. Our implementation is publicly available222https://github.com/dayu11/Gradient-Embedding-Perturbation. The model for MNIST has two convolutional layers with max-pooling and one fully connected layer. The model for SVHN and CIFAR-10 is ResNet20 in He et al. (2016). We replace all batch normalization (Ioffe & Szegedy, 2015) layers with group normalization (Wu & He, 2018) layers because batch normalization mixes the representations of different samples and makes the privacy loss cannot be analyzed accurately. The non-private accuracy for MNIST, SVHN, and CIFAR-10 is 99.1%, 95.9%, and 90.4%, respectively.

We also provide experiments with pre-trained models in Appendix A. Tramèr & Boneh (2020) show that differentially private linear classifier can achieve high accuracy using the features produced by pre-trained models. We examine whether GEP can improve the performance of such private linear classifiers. Notably, using the features produced by a model pre-trained on unlabeled ImageNet, GEP achieves 94.8% validation accuracy on CIFAR10 with ϵ=2\epsilon=2.

Evaluated algorithms We use the algorithm in Abadi et al. (2016) as benchmark gradient perturbation approach, referred to as “GP”. We also compare GEP with PATE (Papernot et al., 2017). We run the experiments for PATE using the official implementation. The privacy parameter ϵ\epsilon of PATE is data-dependent and hence cannot be released directly (see Section 3.3 in Papernot et al. (2017)). Nonetheless, we report the results of PATE anyway.

Implementation details At each step, GEP needs to release two vectors: the noisy gradient embedding and the noisy residual gradient. The gradient embeddings have a sensitivity of S1S_{1} and the residual gradients have a sensitivity of S2S_{2} because of the clipping. The output of GEP can be constructed as follows: (1) normalize the gradient embeddings and residual gradients by 1/S11/S_{1} and 1/S21/S_{2}, respectively, (2) concatenate the rescaled vectors, (3) release the concatenated vector via gaussian mechanism with sensitivity 2\sqrt{2}, (4) rescale the two components by S1S_{1} and S2S_{2}. B-GEP only needs to release the normalized noisy gradient embedding. We use the numerical tool in Mironov et al. (2019) to compute the privacy loss. For given privacy budget and sampling probability, σ\sigma is set to be the smallest value such that the privacy budget is allowable to run desired epochs.

All experiments are run on a single Tesla V100 GPU with 16G memory. For ResNet20, the parameters are divided into five groups: input layer, output layer, and three intermediate stages. For a given quota of basis vectors, we allocate it to each group according to the square root of the number of parameters in each group. We compute an orthonormal subspace basis on each group separately. Then we concatenate the projections of all groups to construct gradient embeddings. The number of power iterations tt is set as 11 as empirical evaluations suggest more iterations do not improve the performance for GEP and B-GEP.

For all datasets, the anchor gradients are computed on 20002000 random samples from ImageNet. In Appendix C, we examine the influence of choosing different numbers of anchor gradients and different sources of auxiliary data. The selected images are downsampled into size of 32×3232\times 32 (28×2828\times 28 for MNIST) and we label them randomly at each update. For SVHN and CIFAR-10, kk is chosen from [500,1000,1500,2000][500,1000,1500,2000]. For MNIST, we halve the size of kk. We use SGD with momentum 0.9 as the optimizer. Initial learning rate and batchsize are 0.10.1 and 10001000, respectively. The learning rate is divided by 1010 at middle of training. Weight decay is set as 1×1041\times 10^{-4}. The clipping threshold for is 1010 for original gradients and 22 for residual gradients. The number of training epochs for CIFAR-10 and MNIST is 50, 100, 200 for privacy parameter ϵ=2,5,8\epsilon=2,5,8, respectively. The number of training epochs for SVHN is 5, 10, 20 for privacy parameter ϵ=2,5,8\epsilon=2,5,8, respectively. Privacy parameter δ\delta is 1×1061\times 10^{-6} for SVHN and 1×1051\times 10^{-5} for CIFAR-10 and MNIST.

Results The best accuracy with given ϵ\epsilon is in Table 1. For all datasets, GEP achieves considerable improvement over GP in Abadi et al. (2016). Specifically, GEP achieves 74.9%74.9\% test accuracy on CIFAR-10 with (8,105)(8,10^{-5})-DP, outperforming GP by 18.5%18.5\%. PATE achieves best accuracy on MNIST but its performance drops as the dataset becomes more complex.

We also plot the relation between accuracy and kk in Figure 6. GEP is less sensitive to the choice of kk and outperforms B-GEP for all choices of kk. The improvement of increasing kk becomes smaller as kk becomes larger. We note that the memory cost of choosing large kk is high because we need to store at least kk individual gradients to compute anchor subspace.

Table 1: Test accuracy (in %) with varying choices of privacy bound ϵ\epsilon. The numbers under symbol Δ\Delta denote the improvement over GP baseline.
Dataset Algorithm ϵ=2\epsilon=2 Δ\Delta ϵ=5\epsilon=5 Δ\Delta ϵ=8\epsilon=8 Δ\Delta
MNIST GP 94.7 +0.0 96.8 +0.0 97.2 +0.0
PATE 98.5 +3.8 98.5 +1.7 98.6 +1.4
B-GEP 93.1 -1.6 94.5 -2.3 95.9 -1.3
GEP 96.3 +1.6 97.9 +1.1 98.4 +1.2
SVHN GP 87.1 +0.0 91.3 +0.0 91.6 +0.0
PATE 80.7 -6.4 91.6 +0.3 91.6 +0.0
B-GEP 88.5 +1.4 91.8 +0.5 92.3 +0.7
GEP 92.3 +5.2 94.7 +3.4 95.1 +3.5
CIFAR-10 GP 43.6 +0.0 52.2 +0.0 56.4333The test accuracy of DP-SGD can be improved to \sim 62% by tuning the hyperparameters. See the implementation in https://github.com/dayu11/Differentially-Private-Deep-Learning. +0.0
PATE 34.2 -9.4 41.9 -10.3 43.6 -12.8
B-GEP 50.3 +6.7 59.5 +7.3 63.0 +6.6
GEP 59.7 +16.1 70.1 +17.9 74.9 +18.5
Refer to caption
Figure 6: Test accuracy when varying the dimension of anchor subspace. GEP significantly outperforms B-GEP for all kk. Moreover, the performance of GEP is not that sensitive to kk.

5 Conclusion

In this paper, we propose Gradient Embedding Perturbation (GEP) for learning with differential privacy. GEP leverages the gradient redundancy to reduce the added noise and outputs an unbiased estimator of target gradient. The several key designs of GEP significantly boost the applicability of GEP. Extensive experiments on real world datasets demonstrate the superior utility of GEP.

References

  • Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security, 2016.
  • Alon et al. (2019) Noga Alon, Raef Bassily, and Shay Moran. Limits of private learning with access to public data. In Advances in Neural Information Processing Systems, 2019.
  • Balle et al. (2018) Borja Balle, Gilles Barthe, and Marco Gaboardi. Privacy amplification by subsampling: Tight analyses via couplings and divergences. In Advances in Neural Information Processing Systems, 2018.
  • Bassily et al. (2014) Raef Bassily, Adam Smith, and Abhradeep Thakurta. Differentially private empirical risk minimization: Efficient algorithms and tight error bounds. Annual Symposium on Foundations of Computer Science, 2014.
  • Bernau et al. (2019) Daniel Bernau, Philip-William Grassal, Jonas Robl, and Florian Kerschbaum. Assessing differentially private deep learning with membership inference. arXiv preprint arXiv:1912.11328, 2019.
  • Blanchard et al. (2007) Gilles Blanchard, Olivier Bousquet, and Laurent Zwald. Statistical properties of kernel principal component analysis. Machine Learning, 66(2-3):259–294, 2007.
  • Bun & Steinke (2016) Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, 2016.
  • Carlini et al. (2019) Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium, 2019.
  • Chen et al. (2020a) Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029, 2020a.
  • Chen et al. (2020b) Xiangyi Chen, Steven Z Wu, and Mingyi Hong. Understanding gradient clipping in private sgd: A geometric perspective. Advances in Neural Information Processing Systems, 33, 2020b.
  • Dasgupta & Gupta (2003) Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 2003.
  • Dong et al. (2019) Jinshuo Dong, Aaron Roth, and Weijie J Su. Gaussian differential privacy. arXiv preprint arXiv:1905.02383, 2019.
  • Dwork et al. (2006a) Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, 2006a.
  • Dwork et al. (2006b) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, 2006b.
  • Dwork et al. (2014) Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 2014.
  • Eaton & Perlman (1973) Morris L Eaton and Michael D Perlman. The non-singularity of generalized sample covariance matrices. The Annals of Statistics, pp.  710–717, 1973.
  • Fredrikson et al. (2015) Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In ACM SIGSAC Conference on Computer and Communications Security, 2015.
  • Fukuchi et al. (2017) Kazuto Fukuchi, Quang Khai Tran, and Jun Sakuma. Differentially private empirical risk minimization with input perturbation. In International Conference on Discovery Science, 2017.
  • Gooneratne et al. (2020) Mary Gooneratne, Khe Chai Sim, Petr Zadrazil, Andreas Kabel, Françoise Beaufays, and Giovanni Motta. Low-rank gradient approximation for memory-efficient on-device training of deep neural network. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.
  • Gur-Ari et al. (2018) Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754, 2018.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Conference on Computer Vision and Pattern Recognition, 2020.
  • Hitaj et al. (2017) Briland Hitaj, Giuseppe Ateniese, and Fernando Pérez-Cruz. Deep models under the gan: information leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017.
  • Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
  • Iyengar et al. (2019) Roger Iyengar, Joseph P Near, Dawn Song, Om Thakkar, Abhradeep Thakurta, and Lun Wang. Towards practical differentially private convex optimization. In IEEE Symposium on Security and Privacy, 2019.
  • Jayaraman et al. (2018) Bargav Jayaraman, Lingxiao Wang, David Evans, and Quanquan Gu. Distributed learning without distress: Privacy-preserving empirical risk minimization. In Advances in Neural Information Processing Systems, 2018.
  • Jordon et al. (2019) James Jordon, Jinsung Yoon, and Mihaela van der Schaar. Pate-gan: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations, 2019.
  • Kairouz et al. (2020) Peter Kairouz, Mónica Ribero, Keith Rush, and Abhradeep Thakurta. Dimension independence in unconstrained private erm via adaptive preconditioning. arXiv preprint arXiv:2008.06570, 2020.
  • Kifer et al. (2012) Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory, 2012.
  • Lee & Kifer (2018) Jaewoo Lee and Daniel Kifer. Concentrated differentially private gradient descent with adaptive per-iteration privacy budget. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.
  • Li et al. (2020) Xinyan Li, Qilong Gu, Yingxue Zhou, Tiancong Chen, and Arindam Banerjee. Hessian based analysis of sgd for deep nets: Dynamics and generalization. In SIAM International Conference on Data Mining, 2020.
  • Ma et al. (2019) Yuzhe Ma, Xiaojin Zhu, and Justin Hsu. Data poisoning against differentially-private learners: attacks and defenses. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp.  4732–4738. AAAI Press, 2019.
  • Mironov (2017) Ilya Mironov. Rényi differential privacy. In IEEE Computer Security Foundations Symposium, 2017.
  • Mironov et al. (2019) Ilya Mironov, Kunal Talwar, and Li Zhang. Rényi differential privacy of the sampled gaussian mechanism. arXiv, 2019.
  • Muirhead (2009) Robb J Muirhead. Aspects of multivariate statistical theory, volume 197. John Wiley & Sons, 2009.
  • Panju (2011) Maysum Panju. Iterative methods for computing eigenvalues and eigenvectors. arXiv preprint arXiv:1105.1185, 2011.
  • Papernot et al. (2017) Nicolas Papernot, Martín Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semi-supervised knowledge transfer for deep learning from private training data. In International Conference on Learning Representations, 2017.
  • Papernot et al. (2018) Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Úlfar Erlingsson. Scalable private learning with pate. In International Conference on Learning Representations, 2018.
  • Phan et al. (2020) NhatHai Phan, My T Thai, Han Hu, Ruoming Jin, Tong Sun, and Dejing Dou. Scalable differential privacy with certified robustness in adversarial learning. International Conference on Machine Learning, 2020.
  • Rahman et al. (2018) Md Atiqur Rahman, Tanzila Rahman, Robert Laganiere, Noman Mohammed, and Yang Wang. Membership inference attack against differentially private deep learning model. Transactions on Data Privacy, 2018.
  • Sablayrolles et al. (2019) Alexandre Sablayrolles, Matthijs Douze, Yann Ollivier, Cordelia Schmid, and Hervé Jégou. White-box vs black-box: Bayes optimal strategies for membership inference. International Conference on Machine Learning, 2019.
  • Shokri & Shmatikov (2015) Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In ACM SIGSAC conference on computer and communications security, 2015.
  • Shokri et al. (2017) Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In IEEE Symposium on Security and Privacy (SP), 2017.
  • Song et al. (2013) Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In Global Conference on Signal and Information Processing (GlobalSIP), 2013.
  • Talwar et al. (2015) Kunal Talwar, Abhradeep Guha Thakurta, and Li Zhang. Nearly optimal private lasso. In Advances in Neural Information Processing Systems, 2015.
  • Thakurta & Smith (2013) Abhradeep Guha Thakurta and Adam Smith. Differentially private feature selection via stability arguments, and the robustness of the lasso. In Conference on Learning Theory, 2013.
  • Tramèr & Boneh (2020) Florian Tramèr and Dan Boneh. Differentially private learning needs better features (or much more data). arXiv preprint arXiv:2011.11660, 2020.
  • Vogels et al. (2019) Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization. In Advances in Neural Information Processing Systems, 2019.
  • Wang & Xu (2019) Di Wang and Jinhui Xu. On sparse linear regression in the local differential privacy model. In International Conference on Machine Learning, 2019.
  • Wang et al. (2017) Di Wang, Minwei Ye, and Jinhui Xu. Differentially private empirical risk minimization revisited: Faster and more general. In Advances in Neural Information Processing Systems, 2017.
  • Wang & Zhou (2020) Jun Wang and Zhi-Hua Zhou. Differentially private learning with small public data. In AAAI, 2020.
  • Wang & Gu (2019) Lingxiao Wang and Quanquan Gu. Differentially private iterative gradient hard thresholding for sparse learning. In International Joint Conference on Artificial Intelligence, 2019.
  • Wang et al. (2019) Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. Subsampled rényi differential privacy and analytical moments accountant. In International Conference on Artificial Intelligence and Statistics, 2019.
  • Wu et al. (2016) Xi Wu, Matthew Fredrikson, Somesh Jha, and Jeffrey F Naughton. A methodology for formalizing model-inversion attacks. In IEEE Computer Security Foundations Symposium, 2016.
  • Wu et al. (2017) Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In ACM International Conference on Management of Data, 2017.
  • Wu & He (2018) Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), 2018.
  • Yu et al. (2020) Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, and Tie-Yan Liu. Gradient perturbation is underrated for differentially private convex optimization. In Proc. of 29th Int. Joint Conf. Artificial Intelligence, 2020.
  • Yu et al. (2021) Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, and Tie-Yan Liu. How does data augmentation affect privacy in machine learning? In Proc. of the AAAI Conference on Artificial Intelligence, 2021.
  • Zhou et al. (2020) Yingxue Zhou, Zhiwei Steven Wu, and Arindam Banerjee. Bypassing the ambient dimension: Private sgd with gradient subspace identification. arXiv preprint arXiv:2007.03813, 2020.
  • Zhu et al. (2019) Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. In Advances in Neural Information Processing Systems, 2019.
  • Zhu & Wang (2019) Yuqing Zhu and Yu-Xiang Wang. Poission subsampled rényi differential privacy. In International Conference on Machine Learning, 2019.

Appendix A Experiments with pre-trained models

Recent works have shown that pre-training the models on unlabeled data can be beneficial for subsequent learning tasks (Chen et al., 2020a; He et al., 2020). Tramèr & Boneh (2020) demonstrate that differentially private linear classifier can achieve high accuracy using the features produced by those per-trained models. We show that GEP can also benefit from such pre-trained models.

Inspired by Tramèr & Boneh (2020), we use the output of the penultimate layer of a pre-trained ResNet152 model as feature to train a private linear classifier. The ResNet152 model is pre-trained on unlabeled ImageNet using SimCLR (Chen et al., 2020a). The feature dimension is 4096.

Implementation Details We choose the privacy parameter ϵ\epsilon from [0.1,0.5,1,2][0.1,0.5,1,2]. The privacy parameter δ\delta is 1×1051\times 10^{-5}. We run all experiments for 5 times and report the average accuracy. The clipping threshold of residual gradients is still one-fifth of the clipping threshold of the original gradients. The dimension of anchor subspace is set as 200p200\simeq\sqrt{p} where p=40960p=40960 is the model dimension. We randomly sample 500500 samples from the test set as auxiliary data and evaluate performance on the rest test samples. The optimizer is Adam with default momentum coefficients. Other hyper-parameters are listed in Table 2.

Hyperparameter Values
Learning rate 0.01, 0.05, 0.1
Running steps 50, 100, 400
Clipping threshold 0.01, 0.1, 1
Table 2: Hyperparameter values used in Appendix A.

Results The experiment results are shown in Table 3. GEP outperforms GP on all values of ϵ\epsilon. With privacy bound ϵ=2\epsilon=2, GEP achieves 94.8% validation accuracy on CIFAR10 dataset, improving over the GP baseline by 1.4%. For very strong privacy guarantee (ϵ=0.1\epsilon=0.1), B-GEP performs on par with GEP because strong privacy guarantee requires large noise and the useful signal in residual gradient is submerged in the added noise. B-GEP benefits less from larger ϵ\epsilon compared to GP or GEP. For ϵ=1\epsilon=1 and 22, the performance of B-GEP is worse than the performance of GP. This is because larger ϵ\epsilon can not reduce the systematic error of B-GEP (see Remark 1 in Section 3.2).

Table 3: Validation accuracy (in %) on CIFAR10 with varying choices of ϵ\epsilon. We train a private linear model on top of the features from a ResNet152 model, which is pre-trained on unlabeled ImageNet.
ϵ=0.1\epsilon=0.1 ϵ=0.5\epsilon=0.5 ϵ=1\epsilon=1 ϵ=2\epsilon=2
Non private 96.3 96.3 96.3 96.3
GP 88.2 (±\pm0.16) 91.1 (±\pm0.17) 93.2 (±\pm0.19) 93.4 (±\pm0.12)
B-GEP 91.0 (±\pm0.07) 92.9 (±\pm0.03) 93.1 (±\pm0.10) 93.2 (±\pm0.08)
GEP 90.9 (±\pm0.19) 93.5 (±\pm0.06) 94.3 (±\pm0.09) 94.8 (±\pm0.06)

Appendix B Complexity Analysis

We provide an analysis of the computational and memory costs of the construction of anchor subspace. The computation of the anchor subspace is the dominant additional cost of GEP compared to conventional gradient perturbation. Notations: kk, mm, nn, and pp are the dimension of anchor subspace, number of anchor gradients, number of private gradients, and the model dimension, respectively. In order to reduce the computational and memory costs, we divide the parameters into gg groups and compute one orthonormal basis for each group. We refer to this approach as ‘parameter grouping’. In this section, we assume the parameters and the dimension of the anchor subspace are both divided evenly. Table 4 summarizes the additional costs of GEP with/without parameter grouping. Using parameter grouping can reduce the computational/memory cost significantly.

Table 4: Computational and memory costs of a single power iteration in Algorithm 1. The computation cost is measured by the number of floating point operations. The memory cost is measured by the number of floating-point numbers we need to store. ‘GEP+PG’ denotes GEP with parameter grouping and gg denotes the number of groups. Notations: kk, mm, nn, and pp are the dimension of anchor subspace, number of anchor gradients, number of private gradients, and the model dimension, respectively.
Computational Cost Memory Cost
GEP 2mkp+pk22mkp+pk^{2} max(0,(mn+k)p+mk)\max\left(0,\left(m-n+k\right)p+mk\right)
GEP+PG 2mkp/g+pk2/g22mkp/g+pk^{2}/g^{2} max(0,(mn+kg)p+mk)\max\left(0,\left(m-n+\frac{k}{g}\right)p+mk\right)

Appendix C Ablation Study

The influence of choosing different auxiliary datasets. We conduct experiments with different choices of auxiliary datasets. For CIFAR10, we try 2000 random test samples from CIFAR10, 2000 random samples from CIFAR100, and 2000 random samples from ImageNet. When the auxiliary dataset is CIFAR10, we try both correct labels and random labels. For all choices of auxiliary datasets, the test accuracy is evaluated on 8000 test samples of CIFAR10 that are not used as auxiliary data. Other implementation details are the same as in Section 4. The results are shown in Table 5. Surprisingly, using samples from CIFAR10 with correct labels yields the worst accuracy. This may because the model ‘overfits’ the auxiliary data when it has access to correct labels, which makes the anchor subspace contains less information about the private gradients. The best accuracy is achieved using samples from CIFAR10 with random labels, this makes sense because in this case the features of auxiliary data and private data have the same distribution. Using samples from CIFAR100 or ImageNet as auxiliary data has a small influence on the test accuracy.

Table 5: Test accuracy on CIFAR10 with different choices of auxiliary datasets. The privacy guarantee is (8,105)(8,10^{-5})-DP. We report the average accuracy of five runs with standard deviations in brackets.
Auxiliary Data Random Label? Test Accuracy
CIFAR10 No 72.9 (±\pm0.31)
CIFAR10 Yes 75.1 (±\pm0.42)
CIFAR100 Yes 74.7 (±\pm0.46)
ImageNet Yes 74.8 (±\pm0.39)

The influence of the number of anchor gradients. In the main text, the size of auxiliary dataset is m=2000m=2000. We conduct more experiments with different sizes of auxiliary dataset to examine the influence of mm. The auxiliary data is randomly sampled from ImageNet. Table 6 reports the test accuracy on CIFAR10 with different choices of mm. For both B-GEP and GEP, increasing mm leads to slightly improved performance.

Table 6: Test accuracy on CIFAR10 with different sizes of auxiliary dataset. The privacy guarantee is (8,105)(8,10^{-5})-DP. We report the average accuracy of five runs with standard deviations in brackets.
Algorithm m=1000m=1000 m=2000m=2000 m=4000m=4000
B-GEP 62.2 (±\pm0.26) 62.6 (±\pm0.24) 63.3 (±\pm0.27)
GEP 74.6 (±\pm0.41) 74.8 (±\pm0.39) 75.2 (±\pm0.34)

The projection error of random basis vectors. It is tempting to construct the anchor subspace using random basis vectors because Johnson–Lindenstrauss Lemma (Dasgupta & Gupta, 2003) guarantees that one can preserve the pairwise distance between any two points after projecting into a random subspace of much lower dimension. We empirically verify the projection error of Gaussian random basis vectors on CIFAR10 and SVHN. The experiment settings are the same as in Section 4. The projection errors over the training process are plotted in Figure 7. The projection error of random basis vectors is very high (>95%>95\%) throughout training. This is because preserving the pairwise distance is not sufficient for high quality gradient reconstruction, which requires one to preserve the average ‘distance’ between any individual gradient and all other gradients.

Refer to caption
Figure 7: Projection error rate of random basis vectors. The dimension of subspace is denoted by kk.

Appendix D Missing Proofs

See 3.1

Proof.

We extend the Theorem 3.2 in Eaton & Perlman (1973) to the low-rank case.

Theorem D.1 (Theorem 3.2 in Eaton & Perlman (1973)).

Let 𝐗=(𝐱1,,𝐱n){\bm{X}}=({\bm{x}}_{1},...,{\bm{x}}_{n}) where the 𝐱i{\bm{x}}_{i} are i.i.d. random vectors in p{\mathbb{R}}^{p}, npn\geq p. If {𝐱1𝕄}=0{\mathbb{P}}\{{\bm{x}}_{1}\in{\mathbb{M}}\}=0 for all proper manifolds 𝕄p{\mathbb{M}}\subset{\mathbb{R}}^{p}, then {𝐗 is non-singular}{\mathbb{P}}\{{\bm{X}}\text{ is non-singular}\}=1.

We note that the subspace spanned by 𝑽^k\hat{{\bm{V}}}_{k^{\prime}} is in the space spanned by 𝑽k{\bm{V}}_{k} by definition. Hence kkk^{\prime}\leq k.

Let 𝒙^i:=𝑽kT𝝃^ik\hat{{\bm{x}}}_{i}:={\bm{V}}_{k}^{T}\hat{{\bm{\xi}}}_{i}\in{\mathbb{R}}^{k} for i[m]i\in[m]. Then 𝑿^:=(𝒙^1,,𝒙^m)\hat{{\bm{X}}}:=(\hat{\bm{x}}_{1},...,\hat{\bm{x}}_{m}) is non-singular because of the assumption and Theorem D.1. That is rank(𝑿^)=krank(\hat{{\bm{X}}})=k. Therefore rank((𝝃^1,,𝝃^m))krank((\hat{\bm{\xi}}_{1},...,\hat{\bm{\xi}}_{m}))\geq k, rank(𝑺^)krank(\hat{\bm{S}})\geq k and kkk^{\prime}\geq k. Therefore k=kk^{\prime}=k and the subspace spanned by 𝑽^k\hat{{\bm{V}}}_{k^{\prime}} and the subspace spanned by 𝑽k{\bm{V}}_{k} are identical. ∎

See 1

Proof of Theorem 1.

We first introduce some background knowledge of Rényi differential privacy (RDP) (Mironov, 2017). RDP measures the Rényi divergence between two output distributions.

Definition 2 ((λ,γ)(\lambda,\gamma)-RDP).

A randomized mechanism ff is said to guarantee (λ,γ)(\lambda,\gamma)-RDP if for any neighboring datasets 𝔻,𝔻{\mathbb{D}},{\mathbb{D}}^{\prime} and λ>1\lambda>1 it holds that

Dλ(f(𝔻)||f(𝔻))γ,D_{\lambda}(f({\mathbb{D}})||f({\mathbb{D}}^{\prime}))\leq\gamma,

where Dλ(||)D_{\lambda}(\cdot||\cdot) denotes the Rényi divergence of order λ\lambda.

We next introduce some useful properties of RDP.

Lemma D.2 (Gaussian mechanism of RDP).

Let S=max𝔻𝔻f(𝔻)f(𝔻)S=\max_{{\mathbb{D}}\sim{\mathbb{D}}^{\prime}}\left\lVert f({\mathbb{D}})-f({\mathbb{D}}^{\prime})\right\rVert be the l2l_{2} sensitivity, then Gaussian mechanism =f(𝔻)+𝐳\mathcal{M}=f({\mathbb{D}})+{\bm{z}} satisfies (λ,λS22σ2)(\lambda,\frac{\lambda S^{2}}{2\sigma^{2}})-RDP, where 𝐳𝒩(0,σ2Ip×p){\bm{z}}\sim\mathcal{N}(0,\sigma^{2}I_{p\times p}).

Lemma D.3 (Composition of RDP).

If M1M_{1}, M2M_{2} satisfy (λ,γ1)(\lambda,\gamma_{1})-RDP and (λ,γ2)(\lambda,\gamma_{2})-RDP respectively, then their composition satisfies (λ,γ1+γ2)(\lambda,\gamma_{1}+\gamma_{2})-RDP.

Lemma D.4 (Conversion from RDP to (ϵ,δ)(\epsilon,\delta)-DP).

If \mathcal{M} obeys (λ,γ)(\lambda,\gamma)-RDP, then \mathcal{M} obeys (γ+log(1/δ)/(λ1),δ)(\gamma+\log(1/\delta)/(\lambda-1),\delta)-DP for all 0<δ<10<\delta<1.

Now we proof Theorem 1. Let 𝑾,𝑾{\bm{W}},{\bm{W}}^{\prime} be the gradient embeddings of two neighboring datasets 𝔻𝔻{\mathbb{D}}\sim{\mathbb{D}}^{\prime} and 𝑹,𝑹{\bm{R}},{\bm{R}}^{\prime} be corresponding residual gradients. Without loss of generality, suppose 𝑾{\bm{W}} (𝑹{\bm{R}}) has one more row than 𝑾{\bm{W}}^{\prime} (𝑹{\bm{R}}^{\prime}). For given sensitivity S1,S2S_{1},S_{2},

max𝔻𝔻𝒘𝒘=max𝑾𝑾𝑾n,:S1,max𝔻𝔻𝒓𝒓=max𝑹𝑹𝑹n,:S2.\max_{{\mathbb{D}}\sim{\mathbb{D}}^{\prime}}\left\lVert{\bm{w}}-{\bm{w}}^{\prime}\right\rVert=\max_{{\bm{W}}\sim{\bm{W}}^{\prime}}\left\lVert{\bm{W}}_{n,:}\right\rVert\leq S_{1},\quad\max_{{\mathbb{D}}\sim{\mathbb{D}}^{\prime}}\left\lVert{\bm{r}}-{\bm{r}}^{\prime}\right\rVert=\max_{{\bm{R}}\sim{\bm{R}}^{\prime}}\left\lVert{\bm{R}}_{n,:}\right\rVert\leq S_{2}.

If we set σ1=S1σ\sigma_{1}=S_{1}\sigma and σ2=S2σ\sigma_{2}=S_{2}\sigma for some σ\sigma, then Algorithm 1 satisfies (λ,λσ2)(\lambda,\frac{\lambda}{\sigma^{2}})-RDP because of Lemma D.2 and D.3. In order to guarantee (ϵ,δ)(\epsilon,\delta)-DP, we need

λσ2+log(1/δ)λ1ϵ.\displaystyle\frac{\lambda}{\sigma^{2}}+\frac{\log(1/\delta)}{\lambda-1}\leq\epsilon. (5)

Choose λ=1+2log(1/δ)ϵ\lambda=1+\frac{2\log(1/\delta)}{\epsilon} and rearrange Eq (5), we need

σ22(ϵ+2log(1/δ))ϵ2.\displaystyle\sigma^{2}\geq\frac{2\left(\epsilon+2\log(1/\delta)\right)}{\epsilon^{2}}. (6)

Then using the constraint on ϵ\epsilon concludes the proof.

See 2

Proof of Theorem 2.

From the proof of Theorem 1, we have each call of GEP satisfies (λ,λσ2)(\lambda,\frac{\lambda}{\sigma^{2}})-RDP. Then by the composition property of RDP (Lemma D.3), the output of Algorithm 2 satisfies (λ,Tλσ2)(\lambda,\frac{T\lambda}{\sigma^{2}})-RDP. Plugging Tλσ2\frac{T\lambda}{\sigma^{2}} into Equation 5 and 6 concludes the proof.

See 2

Proof of Theorem 2.

The β\beta-smooth condition gives

L(𝜽t+1)L(𝜽t)+L(𝜽t),𝜽t+1𝜽t+β2𝜽t+1𝜽t2.\displaystyle L(\boldsymbol{\theta}_{t+1})\leq L(\boldsymbol{\theta}_{t})+\langle\nabla L(\boldsymbol{\theta}_{t}),\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}_{t}\rangle+\frac{\beta}{2}\left\lVert\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}_{t}\right\rVert^{2}. (7)

Based on the update rule of GEP we have

𝜽t+1𝜽t=η𝒗~=ηL(𝜽t)ηn(𝒛t(1)𝑩+𝒛t(2)),\displaystyle\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}_{t}=-\eta\tilde{\bm{v}}=-\eta\nabla L(\boldsymbol{\theta}_{t})-\frac{\eta}{n}({\bm{z}}^{(1)}_{t}{\bm{B}}+{\bm{z}}^{(2)}_{t}), (8)

where 𝒛t(1)𝒩(0,σ2𝑰k×k){\bm{z}}^{(1)}_{t}\sim\mathcal{N}(0,\sigma^{2}{\bm{I}}_{k\times k}), 𝒛t(2)𝒩(0,σ2rt2𝑰p×p){\bm{z}}^{(2)}_{t}\sim\mathcal{N}(0,\sigma^{2}r_{t}^{2}{\bm{I}}_{p\times p}) are the perturbation noises and rt=maxi(𝑹t)i,:r_{t}=\max_{i}\left\lVert({\bm{R}}_{t})_{i,:}\right\rVert is the sensitivity of residual gradients at step tt.

Take expectation on Eq (7) with respect to the perturbation noises.

𝔼[L(𝜽t+1)]𝔼[L(𝜽t)](ηβη2/2)𝔼[L(𝜽t)2]+βη2σ22n2(k+prt2).\displaystyle\mathbb{E}[L(\boldsymbol{\theta}_{t+1})]\leq\mathbb{E}[L(\boldsymbol{\theta}_{t})]-(\eta-\beta\eta^{2}/2)\mathbb{E}[\left\lVert\nabla L(\boldsymbol{\theta}_{t})\right\rVert^{2}]+\frac{\beta\eta^{2}\sigma^{2}}{2n^{2}}\left(k+pr_{t}^{2}\right). (9)

Subtract L(𝜽)L(\boldsymbol{\theta}_{*}) from both sides, we have

𝔼[L(𝜽t+1)]L(𝜽)\displaystyle\mathbb{E}[L(\boldsymbol{\theta}_{t+1})]-L(\boldsymbol{\theta}_{*}) 𝔼[L(𝜽t)]L(𝜽)(ηβη2/2)𝔼[L(𝜽t)2]+βη2σ22n2(k+prt2)\displaystyle\leq\mathbb{E}[L(\boldsymbol{\theta}_{t})]-L(\boldsymbol{\theta}_{*})-(\eta-\beta\eta^{2}/2)\mathbb{E}[\left\lVert\nabla L(\boldsymbol{\theta}_{t})\right\rVert^{2}]+\frac{\beta\eta^{2}\sigma^{2}}{2n^{2}}\left(k+pr_{t}^{2}\right) (10)
𝔼[L(𝜽t),𝜽t𝜽](ηβη2/2)𝔼[L(𝜽t)2]+βη2σ22n2(k+prt2).\displaystyle\leq\mathbb{E}[\langle\nabla L(\boldsymbol{\theta}_{t}),\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{*}\rangle]-(\eta-\beta\eta^{2}/2)\mathbb{E}[\left\lVert\nabla L(\boldsymbol{\theta}_{t})\right\rVert^{2}]+\frac{\beta\eta^{2}\sigma^{2}}{2n^{2}}\left(k+pr_{t}^{2}\right).

The second inequality holds because LL is convex. Then choose η=1β\eta=\frac{1}{\beta} and plug L(𝜽t)=(𝜽t𝜽t+1)/η(𝒛1t𝑩+𝒛2t)/n\nabla L(\boldsymbol{\theta}_{t})=(\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{t+1})/\eta-({\bm{z}}^{t}_{1}{\bm{B}}+{\bm{z}}^{t}_{2})/n into Eq (10).

𝔼[L(𝜽t+1)]L(𝜽)\displaystyle\mathbb{E}[L(\boldsymbol{\theta}_{t+1})]-L(\boldsymbol{\theta}_{*}) β𝔼[𝜽t𝜽t+1,𝜽t𝜽]β2𝔼[𝜽t𝜽t+12]+σ2βn2(k+prt2)\displaystyle\leq\beta\mathbb{E}[\langle\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{t+1},\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{*}\rangle]-\frac{\beta}{2}\mathbb{E}[\left\lVert\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{t+1}\right\rVert^{2}]+\frac{\sigma^{2}}{\beta n^{2}}\left(k+pr_{t}^{2}\right) (11)
=β2(𝔼[𝜽t𝜽2]𝔼[𝜽t+1𝜽2])+σ2βn2(k+prt2).\displaystyle=\frac{\beta}{2}\left(\mathbb{E}[\left\lVert\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{*}\right\rVert^{2}]-\mathbb{E}[\left\lVert\boldsymbol{\theta}_{t+1}-\boldsymbol{\theta}_{*}\right\rVert^{2}]\right)+\frac{\sigma^{2}}{\beta n^{2}}\left(k+pr_{t}^{2}\right).

Sum over t=0,,T1t=0,\ldots,T-1 and use convexity, we have

𝔼[L(𝜽¯)]L(𝜽)β2T𝜽0𝜽+σ2βn2(k+pTt=0T1rt2).\displaystyle\mathbb{E}[L(\boldsymbol{\bar{\theta}})]-L(\boldsymbol{\theta_{*}})\leq\frac{\beta}{2T}\left\lVert\boldsymbol{\theta}_{0}-\boldsymbol{\theta}_{*}\right\rVert+\frac{\sigma^{2}}{\beta n^{2}}(k+\frac{p}{T}\sum_{t=0}^{T-1}r_{t}^{2}). (12)

Then substituting T=nβϵpT=\frac{n\beta\epsilon}{\sqrt{p}} and σ=𝒪(Tlog(1/δ)/ϵ)\sigma=\mathcal{O}(\sqrt{T\log(1/\delta)}/\epsilon) yields the desired bound.