This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Differentially Private ADMM Algorithms for Machine Learning

Tao Xu, Fanhua Shang,  Yuanyuan Liu, 
Hongying Liu,  Longjie Shen, and Maoguo Gong
T. Xu, F. Shang (Corresponding author), Y. Liu, H. Liu, and L. Shen are with the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, China. E-mails: {fhshang, yyliu, hyliu}@xidian.edu.cn, xidianxutao@gmail.com, s269524963@gmail.com.M. Gong is with the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Electronic Engineering, Xidian University, China. E-mail: gong@ieee.org.Manuscript received October 30, 2020.
Abstract

In this paper, we study efficient differentially private alternating direction methods of multipliers (ADMM) via gradient perturbation for many machine learning problems. For smooth convex loss functions with (non)-smooth regularization, we propose the first differentially private ADMM (DP-ADMM) algorithm with performance guarantee of (ϵ,δ)(\epsilon,\delta)-differential privacy ((ϵ,δ)(\epsilon,\delta)-DP). From the viewpoint of theoretical analysis, we use the Gaussian mechanism and the conversion relationship between Rényi Differential Privacy (RDP) and DP to perform a comprehensive privacy analysis for our algorithm. Then we establish a new criterion to prove the convergence of the proposed algorithms including DP-ADMM. We also give the utility analysis of our DP-ADMM. Moreover, we propose an accelerated DP-ADMM (DP-AccADMM) with the Nesterov’s acceleration technique. Finally, we conduct numerical experiments on many real-world datasets to show the privacy-utility tradeoff of the two proposed algorithms, and all the comparative analysis shows that DP-AccADMM converges faster and has a better utility than DP-ADMM, when the privacy budget ϵ\epsilon is larger than a threshold.

Index Terms:
Differentially private, alternating direction method of multipliers (ADMM), gradient perturbation, momentum acceleration, Gaussian mechanism.

I Introduction

In the era of ‘Big Data’, most people’s lives will be presented on the internet in different types of data, and personal privacy may be leaked when data are released/shared. Data is the most valuable resource for research institutions and decision-making departments, which can greatly improve the service system of society, but meanwhile we must also protect the privacy of users who provide data often containing personal identifiable information and various confidential data. Generally, research institutions use these data sets to train machine learning models. Sometimes the models are made publicly available, which gives attackers an opportunity to obtain individual privacy. Thus, it is necessary to add privacy preservation techniques during training and learning process.

In recent years, privacy-preserving technologies are very popular. Differential privacy (DP) [1] that can provide rigorous guarantees for individual or personal privacy by adding randomized noise has been extensively studied in the literature [2, 3]. Therefore, whether in academia or industry, differential privacy is the most recognized privacy preserving technology. A lot of applications have been studied to be equipped with differential privacy in many fields, such as data mining, machine learning and deep learning. For instance, differentially private recommender systems have been extensively studied [4][5][6][7] to deal with the privacy leakage during collaborative filtering process. Besides, personalized online advertising [8], health data [9], face recognition [10][11], network trace analysis [12] and search logs [13] have all been studied by utilizing differential privacy for privacy-preserving. Of course, real-world applications have also been widely developed by technology companies and government agencies, such as Google [14], Apple [15], Microsoft [16], and US Census Bureau [17]. Thus, we can see that differential privacy plays an important role in privacy-preserving nowadays.

Empirical risk minimization (ERM) is a commonly used supervised learning problem. Let D=(l1,l2,,ln)D\!=\!(l_{1},l_{2},\ldots,l_{n}) be a dataset with nn samples, where lidl_{i}\!\in\!\mathbb{R}^{d}. Many machine learning problems such as ERM are formulated as the following minimization problem:

minxd{F(x):=1ni=1nfi(x,li)+g(x)}\min_{x\in\mathbb{R}^{d}}\;\big{\{}F(x):=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x,l_{i})+g(x)\big{\}} (1)

where each fi:df_{i}:\mathbb{R}^{d}\!\rightarrow\!\mathbb{R} is a smooth convex loss function, and g:dg:\mathbb{R}^{d}\rightarrow\mathbb{R} is a simple convex (non)-smooth regularizer such as the 1\ell_{1}-norm or 2\ell_{2}-norm regularizer. Many differential privacy algorithms have been proposed to deal with ERM problems, such as DP-SGD [18], DP-SVRG [19], and DP-SRGD [20]. Therefore, this paper mainly considers the generalized ERM problem with more complex regularizers (e.g., g(x)=λAx1g(x)\!=\!\lambda\|Ax\|_{1} with a given matrix AA and a regularized parameter λ>0\lambda\!>\!0), such as graph-guided fused Lasso [21], generalized Lasso [22] and graph-guided support vector machine (SVM) [23].

The alternating direction method of multipliers (ADMM) is an efficient and popular optimization method for solving the generalized ERM problem. Although there are many research works on differential privacy ADMMs, most of them focus on differential privacy of distributed ADMMs such as [24, 25, 26]. In fact, distributed ADMMs are mostly suitable for federated learning instead of centralized problems. Therefore, only a few of them work on the centralized and stochastic ADMMs [27] or objective perturbed centralized and deterministic ADMMs such as [28] to deal with centralized problems.

However, the gradients of stochastic ADMMs such as [23] have their own gradient noise due to random sampling [29]. Thus, it is difficult to estimate the noise added into stochastic algorithms. Generally, stochastic differentially private algorithms need some methods to estimate their privacy loss [18, 30]. But rough estimations may lead to bad utility of algorithms. As for the position where we add the noise to, actually, gradient perturbation is a better choice than output/objective perturbation for first-order algorithms [31]. Firstly, at each iteration, gradient perturbation can release the noisy gradient so that the utility of the algorithm will not be affected. Secondly, objective perturbation often requires strong assumptions on the objective function, while gradient perturbation only needs to bound the sensitivity of gradients. It does not require strong assumptions on the objective. Moreover, as for DP-ERM problems, gradient perturbation often achieves better empirical utility than output/objective perturbations. Thus, it is very meaningful to study differentially private ADMM algorithms under gradient perturbation. Until now, there is no gradient perturbed differential privacy work on centralized and deterministic ADMM algorithms. Therefore, in this paper, we will focus on centralized and deterministic ADMMs and propose efficient gradient perturbed differential privacy ADMM algorithms for solving the generalized ERM problem (i.e., Problem (2) in Section II).

ADMM algorithms can deal with more complicated ERM problems, especially the ERM problem with equality constraints, e.g., graph-guided logistic regression and graph-guided SVM problems. In this paper, we propose two efficient deterministic differential privacy ADMM algorithms that satisfy (ϵ,δ)(\epsilon,\delta)-differential privacy (DP), namely differential privacy ADMM (DP-ADMM) algorithm and differential privacy accelerated ADMM (DP-AccADMM) algorithm. Our algorithms can be applied to many real-world applications, such as finance, medical treatment, Internet and transportation. Because we give a quantitative representation of privacy leakage, even if attackers own the largest background, they can not get individual privacy information. That is, under our privacy guarantees, our algorithms satisfy (ϵ,δ)(\epsilon,\delta)-DP.

In the first proposed algorithm (i.e., DP-ADMM), we replace f(x)f(x) by its first-order approximation to get the gradient term f\nabla\!f, and then add Gaussian noise as gradient perturbation into it. Moreover, we give the privacy guarantee analysis to show the size of the noise variance added to the algorithm, which can ensure adequate security of our algorithm. In particular, we use the relationship between Rényi differential privacy (RDP) and (ϵ,δ)(\epsilon,\delta)-DP to get the privacy guarantee of our algorithm. Utility preserving is one of the important indicators used to measure the utility of our algorithm that seeks to preserve data privacy while maintaining an acceptable level of utility. Therefore, we provide the utility bound for our algorithm, which indicates how good the model can be trained. Unlike distributed differential privacy ADMMs [25, 26], we define a new convergence criterion to analyze the convergence property of our algorithm and give our utility bound.

However, common ADMMs converge very slowly when approaching the optimal solution, so is DP-ADMM. A common solution is to introduce an acceleration technique to ADMMs. Like the Nesterov method [32] and the momentum acceleration method [33], the Nesterov accelerated method is a well-known momentum acceleration method [34], and has a faster convergence rate than traditional momentum acceleration methods [35]. In particular, Goldstein et al. [36] proved that their accelerated ADMM with Nesterov acceleration has a convergence rate O(1/T2)O(1/T^{2}), while traditional non-accelerated ADMMs only have the convergence rate O(1/T)O(1/T), where TT is the number of iterations.

Therefore, in the second proposed algorithm (i.e., DP-AccADMM), we use the Nesterov acceleration technique to accelerate our DP-ADMM. The convergence speed of non-accelerated ADMMs will slow down gradually with the increase of the iteration number especially when the dataset is huge, so is DP-ADMM. Therefore, we also propose a new accelerated DP-AccADMM algorithm. Then we conduct some experiments to compare the performance of DP-ADMM and DP-AccADMM. Moreover, we give some comparative analysis for our experiments, which shows that DP-AccADMM converges much faster than DP-ADMM and retains a good performance on testing data, when the privacy budget ϵ\epsilon becomes bigger and reaches a threshold.

Our contributions of this paper are summarized as follows:

  • Based on deterministic ADMMs, we propose two efficient differentially private ADMM algorithms for solving (non)-smooth convex optimization problems with privacy protection. In the proposed algorithms, each subproblem is solved exactly in a closed-form expression by using a first-order approximation.

  • We use the relationship between RDP and (ϵ,δ)(\epsilon,\delta)-DP to get the privacy guarantees of our DP-ADMM algorithm. Moreover, we design a new convergence criterion to complete the convergence analysis of DP-ADMM. It is non-trivial to provide the utility bound and gradient complexity for our algorithm. To the best of our knowledge, we are the first to give the theoretical analysis of gradient perturbed differentially private ADMM algorithms in the centralized and deterministic settings.

  • We empirically show the effectiveness of the proposed algorithms by performing extensive empirical evaluations on graph-guided fused Lasso models and comparing them with their counterparts. The results show that DP-AccADMM performs much better than DP-ADMM in terms of convergence speed. In particular, DP-AccADMM continuously improves performance on test sets with the increase of privacy budget ϵ\epsilon and even outperforms DP-ADMM.

The remainder of this paper is organized as follows. Section II discusses recent research advances in differential privacy methods. Section III introduces the preliminary of differential privacy. We propose two new efficient differential privacy ADMM algorithms and analyze their privacy guarantees in Section IV. Experimental results in Section V show the effectiveness of our algorithms. In Section VI, we conclude this paper and discuss future work.

II Related Work

In this section, we present the formulation considered in this paper, and review some differential privacy methods.

II-A Problem Setting

For solving the generalized ERM problem (1) with a smooth convex function f(x)=1ni=1nfi(x,li)f(x)\!=\!\frac{1}{n}\sum_{i=1}^{n}\!{f_{i}(x,l_{i})} and a complex sparsity inducing regularizer, we introduce an auxiliary variable yy and an equality constraint Ax+By=cAx+By=c, and can reformulate Problem (1) as follows:

minxd1,yd2{f(x)+g(y),s.t.,Ax+By=c}\displaystyle\!\!\!\min_{x\in\mathbb{R}^{d_{1}}\!,y\in\mathbb{R}^{d_{2}}}\!\!\big{\{}f(x)+g(y),\;\textup{s.t.,}\;Ax+By=c\big{\}}\!\!\! (2)

where Ad3×d1A\in\mathbb{R}^{d_{3}\times d_{1}}, Bd3×d2B\in\mathbb{R}^{d_{3}\times d_{2}}, and cd3c\in\mathbb{R}^{d_{3}}. Therefore, this paper aims to propose efficient differentially private ADMM algorithms for solving the more general equality constrained minimization problem (2).

Let ρ>0\rho>0 be a penalty parameter, and uu be a dual variable. Then the augmented Lagrangian function of Problem (2) is:

Lρ(x,y,u)=\displaystyle L_{\rho}(x,y,u)= f(x)+g(y)+u,Ax+Byc\displaystyle\,f(x)+g(y)+\langle u,\,Ax+By-c\rangle
+(ρ/2)Ax+Byc2.\displaystyle+(\rho/2)\|Ax+By-c\|^{2}.

In an alternating or sequential fasion, at iteration tt, ADMM performs the following update rules:

yt=argminy{g(y)+ρ2Axt1+Byc+ut12},\displaystyle\!\!y_{t}=\arg\min_{y}\big{\{}g(y)+\frac{\rho}{2}\|Ax_{t-1}+By-c+u_{t-1}\|^{2}\big{\}}, (3)
xt=argminx{f(x)+ρ2Ax+Bytc+ut12},\displaystyle\!\!x_{t}=\arg\min_{x}\big{\{}f(x)+\frac{\rho}{2}\|Ax+By_{t}-c+u_{t-1}\|^{2}\big{\}},\;\;\; (4)
ut=ut1+Axt+Bytc.\displaystyle\!\!u_{t}=u_{t-1}+Ax_{t}+By_{t}-c.\qquad\quad\qquad\qquad\qquad\;\>\>\;\>\;\;\,\, (5)

This is the classic update form of ADMM [37]. While updating the variable xx, this update step usually has a high computational complexity, especially when the dataset is very large.

II-B Related Work

As for differential privacy, there are three main types of perturbation used to solve the empirical risk minimization problems under differential privacy [2, 3]. Output perturbation is to perturb the model parameters. For instance, [38] analyzed that introducing output perturbation can make kk-anonymous algorithms satisfy differential privacy. [39] analyzed the sensitivity of optimal solution between neighboring databases. Objective perturbation is to perturb the objective function trained by algorithms. [40] proposed an algorithm introduced objective perturbation to solve ERM problems and analyzed its privacy guarantee to satisfy (ϵ,δ)(\epsilon,\delta)-DP. Gradient perturbation is to perturb the gradients used for updating parameters by first-order optimization methods. For instance, [19] proposed DP-SVRG by introducing gradient perturbation to the SVRG operator in [29], and used moment accountant to complete both its privacy guarantee and utility guarantee.

In recent years, there are some research works on the ADMM algorithms that satisfies differential privacy. However, most of them are about distributed ADMM algorithms. For instance, [25] analyzed the relationship between privacy guarantee and utility guarantee on their differentially private distributed ADMM algorithms. Of course, there are few research works focused on differentially private centralized ADMM algorithms. For example, [41] proposed two stochastic ADMM algorithms, which satisfy (α,β)(\alpha,\beta)-RDP and provide privacy guarantee of the algorithms. However, there is no research work on differentially private centralized and deterministic ADMM algorithms. Therefore, in this paper, we focus on differentially private centralized deterministic ADMM algorithms, and propose two efficient deterministic DP-ADMM and DP-AccADMM algorithms with privacy guarantees.

III Preliminaries

Notations. Throughout this paper, the norm \|\!\cdot\!\| is the standard Euclidean norm, 1\|\!\cdot\!\|_{1} denotes the 1\ell_{1}-norm, i.e., x1=i|xi|\|x\|_{1}\!=\!\sum_{i}\!|x_{i}|, and 2\|\!\cdot\!\|_{2} is the spectral norm (i.e., the largest singular value of the matrix). We denote by f(x)\nabla\!f(x) the gradient of f(x)f(x) if it is differentiable, or g(x)g^{\prime}(x) any of the subgradients of g()g(\cdot) at xx if g()g(\cdot) is only Lipschitz continuous. D={l1,l2,,ln}D=\{l_{1},l_{2},\cdots,l_{n}\} is a dataset of nn samples.

III-A Differential Privacy

[1] introduced the formal notion of differential privacy as follows.

Definition 1 (Differential privacy).

A randomized mechanism 𝒜:𝔻n\mathcal{A}\!:\!\mathbb{D}^{n}\!\to\!\mathbb{R} is (ϵ,δ)(\epsilon,\delta)-differential privacy ((ϵ,δ)(\epsilon,\delta)-DP) if for all neighboring datasets D,DD,D^{\prime} differing by one element and for all events SS in the output space of 𝒜\mathcal{A}, we have

Pr[𝒜(D)S]eϵPr[𝒜(D)S]+δ.\displaystyle\textup{Pr}[\mathcal{A}(D)\in S]\leq e^{\epsilon}\,\textup{Pr}[\mathcal{A}(D^{\prime})\in S]+\delta.

And when δ=0\delta=0, 𝒜\mathcal{A} is ϵ\epsilon-differential privacy (ϵ\epsilon-DP).

Definition 2 (2\ell_{2}-sensitivity [2]).

For a function q:𝔻nq\!:\!\mathbb{D}^{n}\!\to\!\mathbb{R}, the 2\ell_{2}-sensitivity (q)\bigtriangleup(q) of q()q(\cdot) is defined as follows:

(q)=maxD,Dq(D)q(D)\bigtriangleup(q)=\max_{D,D^{\prime}}\|q(D)-q(D^{\prime})\| (6)

where D,D𝔻nD,D^{\prime}\in\mathbb{D}^{n} are a pair of neighboring datasets, which differ in a single entry.

Sensitivity is a key indicator that determines the size of the noise added to the algorithm. That is, to achieve (ϵ,δ)(\epsilon,\delta)-DP for a function q:𝔻nq\!:\!\mathbb{D}^{n}\!\to\!\mathbb{R}, we usually use the following Gaussian mechanism, where the added noise is sampled from Gaussian distribution with variance that is proportional to the 2\ell_{2}-sensitivity of q()q(\cdot).

Definition 3 (Gaussian mechanism [2]).

Given a function q:𝔻nq\!:\!\mathbb{D}^{n}\!\to\!\mathbb{R}, the Gaussian mechanism 𝒜\mathcal{A} is defined as follows:

𝒜(D,q,ϵ)=q(D)+v\displaystyle\mathcal{A}(D,q,\epsilon)=q(D)+v

where vv is drawn from Gaussian distribution N(0,σ2I)N(0,\sigma^{2}I) with σ2log(1.25/δ)(q)ϵ\sigma\geq\frac{\sqrt{2\log(1.25/\delta)}\bigtriangleup(q)}{\epsilon}. Here (q)\bigtriangleup(q) is the 2\ell_{2}-sensitivity of the function q, i.e., (q)=supDDq(D)q(D)\bigtriangleup(q)=\sup_{D\sim D^{\prime}}\!\|q(D)-q(D^{\prime})\|.

III-B Rényi Differential Privacy

Although the definition of (ϵ,δ)(\epsilon,\delta)-DP is widely used in the objective and output perturbation methods, the notion of Rényi differential privacy (RDP) [42] is more suitable for gradient perturbation methods including the proposed algorithms.

Definition 4 (Rényi divergence [42]).

For two probability distributions PP and QQ defined on \mathbb{R}, the Rényi divergence of order α>1\alpha>1 is defined as follows:

Dα(PQ)1α1logExQ(P(x)Q(x))α.D_{\alpha}(P\parallel Q)\triangleq\frac{1}{\alpha-1}\log\textup{E}_{x\sim Q}\big{(}\frac{P(x)}{Q(x)}\big{)}^{\alpha}.
Definition 5 (Rényi Differential Privacy (RDP) [42]).

A randomized mechanism 𝒜\mathcal{A} is (α,β)(\alpha,\beta)-Rényi differentially private ((α,β)(\alpha,\beta)-RDP) if for all neighboring datasets D,DD,D^{\prime}, we have

Dα(𝒜(D)𝒜(D))β.\displaystyle D_{\alpha}(\mathcal{A}(D)\|\mathcal{A}(D^{\prime}))\leq\beta.

That is, the Rényi divergence of the output of the function 𝒜\mathcal{A} is less than β\beta.

Definition 6 (Rényi Gaussian Mechanism [42]).

Given a function q:𝔻nq:\mathbb{D}^{n}\!\to\!\mathbb{R}, the Gaussian mechanism 𝒜=q(D)+v\mathcal{A}=q(D)+v satisfies (α,α2(q)/(2σ2))(\alpha,\alpha\bigtriangleup^{2}\!(q)/(2\sigma^{2}))-RDP, where vN(0,σ2I)v\sim N(0,\sigma^{2}I).

III-C Nesterov’s Accelerated Method

Algorithm 1 Nesterov’s Accelerated Gradient Descent
0:  θ1=1\theta_{1}=1, x0=y1dx_{0}=y_{1}\in\mathbb{R}^{d}, learning rate η<1/LF\eta<1/L_{F}, where LFL_{F} is the Lipschitz constant for F\nabla F.
1:  for t=1,2,,Tt=1,2,...,T do
2:     xt=ytηF(yt)x_{t}=y_{t}-\eta\nabla F(y_{t});
3:     θt+1=(1+4θt2+1)/2\theta_{t+1}=(1+\sqrt{4\theta_{t}^{2}+1})/2;
4:     yt+1=xt+(θt1)(xtxt1)/θt+1y_{t+1}=x_{t}+(\theta_{t}-1)(x_{t}-x_{t-1})/\theta_{t+1};
5:  end for
5:  yTy_{T}.

In [32], Nesterov presented a first-order minimization scheme with a global convergence rate O(1/T2)O(1/T^{2}) for solving Problem (1). This convergence rate is provably optimal for the class of Lipschitz differentiable functionals. As shown in Algorithm 1, the Nesterov method accelerates the gradient descent by using an overrelaxation step as follows:

yt+1=xt+θt1θt+1(xtxt1)y_{t+1}=x_{t}+\frac{\theta_{t}-1}{\theta_{t+1}}(x_{t}-x_{t-1})

where θt+1=(1+4θt2+1)/2\theta_{t+1}=(1+\sqrt{4\theta_{t}^{2}+1})/2 with the initial value θ1=1\theta_{1}\!=\!1.

In 2014, Goldstein et al. [36] proposed an accelerated ADMM algorithm (AccADMM) by introducing Nesterov acceleration. They also proved that their algorithm has a global convergence rate of O(1/T2)O(1/T^{2}).

IV Differentially Private ADMMs

In this section, we propose two new deterministic differentially private ADMM algorithms for many machine learning problems such as the 1\ell_{1}-norm regularized and graph-guided fused Lasso. The proposed algorithms protect privacy by adding gradient perturbations. In particular, the second algorithm (DP-AccADMM) introduces the Nesterov acceleration into the first algorithm (DP-ADMM). Moreover, we also provide the privacy guarantees for the proposed algorithms.

IV-A Differentially Private ADMM

For solving the equality constrained minimization problem (2), and the specific algorithmic steps of our deterministic differentially private ADMM (DP-ADMM) algorithm are presented in Algorithm 2.

Analogous to the general ADMM algorithm, our DP-ADMM algorithm updates and iterates the x,y,ux,y,u variables in an alternating fashion. But when updating xtx_{t}, we use the first-order approximation of f(x)f(x) at xt1x_{t-1} with Gaussian noise (i.e., f(xt1)+f(xt1)+Pt,xf(x_{t-1})+\langle\nabla\!f(x_{t-1})+P_{t},\,x\rangle) to replace f(x)f(x), where PtN(0,σ2Id1)P_{t}\!\sim\!N(0,\sigma^{2}I_{d_{1}}) is the added Gaussian noise, and σ\sigma is a noisy variance computed by privacy guarantee, which satisfies the Gaussian mechanism. Then we add a squared norm term xxt1G22η\frac{\|x-x_{t-1}\|_{G}^{2}}{2\eta} into the following proximal update rule of xtx_{t},

xt=argminx{f(xt1)+Pt,x+ρ2Ax+Bytc+ut12+xxt1G22η}\begin{split}&x_{t}=\arg\min_{x}\big{\{}\langle\nabla\!f(x_{t-1})+P_{t},\,x\rangle+\frac{\rho}{2}\|Ax+By_{t}\\ &\qquad\qquad\quad\qquad\quad\;\;\;-c+u_{t-1}\|^{2}+\frac{\|x-x_{t-1}\|_{G}^{2}}{2\eta}\big{\}}\end{split} (7)

where η\eta is a step-size or learning rate, G=γIηρATAG=\gamma I-\eta\rho A^{T}\!A, γγminηρATA2+1\gamma\geq\gamma_{\min}\equiv\eta\rho\|A^{T}\!A\|_{2}+1 to ensure that GIG\succeq I, and zG2=zTGz\|z\|^{2}_{G}\!=\!z^{T}Gz with a given positive semi-definite matrix GG as in [23, 43]. Introducing this squared norm term can make the distance between adjacent iterates (i.e., xt1x_{t-1} and xtx_{t}) not be too far, and prevent the noise from affecting the iterates too much.

By using the linearized proximal point method [23], we can obtain the closed-form solution of xtx_{t} in Eq. (7) as follows:

xt=xt 1ηγ[f(xt 1)+Pt+ρAT(Axt 1+Bytc+ut 1)].\displaystyle x_{t}\!=\!x_{t-\!\!\>1}\!-\!\frac{\eta}{\gamma}[\nabla\!f(x_{t-\!\!\>1})\!+\!P_{t}\!+\!\rho A^{T}\!(Ax_{t-\!\!\>1}\!+\!By_{t}\!-\!c\!+\!u_{t-\!\!\>1})].

As shown in Algorithm 2, the update rules of yty_{t} in Eq. (3) and utu_{t} in Eq. (5) remain unchanged for DP-ADMM. Here, g(y)g(y) is the regularization term used in many machine learning problems, e.g., the sparsity inducing term g(y)=λAy1g(y)\!=\!\lambda\|Ay\|_{1}. For instance, if it is the 1\ell_{1}-norm regularized term, then the closed-form solution can be easily obtained using the soft-thresholding operator [44], and when it is the 2\ell_{2}-norm term, the closed-form solution can be obtained by derivation.

Algorithm 2 DP-ADMM(f,g,x0,T,η,σf,g,x_{0},T,\eta,\sigma)
0:  ff is LfL_{f}-smooth, learning rate η\eta, and TT.
0:  x0x_{0}, u0=1ρ(AT)f(x0)u_{0}=-\frac{1}{\rho}(A^{T})^{\dagger}\nabla f(x_{0});
1:  for t=1,2,,Tt=1,2,\ldots,T do
2:     yt=argminy{g(y)+ρ2Axt1+Byc+ut12}y_{t}=\arg\min_{y}\big{\{}g(y)+\frac{\rho}{2}\|Ax_{t-1}+By-c+u_{t-1}\|^{2}\big{\}};
3:     xt=argminx{f(xt1)+Pt,x+ρ2Ax+Bytx_{t}=\arg\min_{x}\big{\{}\langle\nabla\!f(x_{t-1})+P_{t},\,x\rangle+\frac{\rho}{2}\|Ax+By_{t}c+ut12+xxt1G22η}\;\;-\,c+u_{t-1}\|^{2}+\frac{\|x-x_{t-1}\|_{G}^{2}}{2\eta}\big{\}}, where PtN(0,σ2Id1)P_{t}\sim N(0,\sigma^{2}I_{d1});
4:     ut=ut1+Axt+Bytcu_{t}=u_{t-1}+Ax_{t}+By_{t}-c;
5:  end for
5:  xT,yTx_{T},y_{T}

IV-B Privacy Guarantee Analysis

In this subsection, we theoretically analyze both privacy guarantee and utility guarantee of DP-ADMM. To facilitate our discussion, we first make the following basic assumptions.

Assumption 1.

For a convex and Lipschitz-smooth function ff, there exists a constant LfL_{f} such that f(x1)f(x2)Lfx1x2\|\nabla\!f(x_{1})-\nabla\!f(x_{2})\|\leq L_{f}\|x_{1}-x_{2}\| for any x1,x2x_{1},x_{2}.

Assumption 2.

The matrix AA has full row rank. It makes sure that the matrix AA has a pseudo-inverse.

Through the Gaussian mechanism of RDP, we can get the relationship between α\alpha and β\beta of RDP and the variance σ\sigma of the added noise. Then, we can obtain the relationship between ϵ,δ\epsilon,\delta and σ\sigma by the conversion relationship between RDP and (ϵ,δ)(\epsilon,\delta)-DP. Thus, we can get the size of Gaussian noise variance.

Theorem 1 (Privacy guarantee).

For DP-ADMM, it satisfies (ϵ,δ)(\epsilon,\delta)-differential privacy with some constants c,δ>0c,\delta>0 and μ(0,1)\mu\in(0,1), if

σ2=c2αT2n2ϵμ\sigma^{2}=\frac{c^{2}\alpha T}{2n^{2}\epsilon\mu} (8)

where α=log(1/δ)(1μ)ϵ+1\alpha=\frac{\log(1/\delta)}{(1-\mu)\epsilon}+1.

The detailed proof of Theorem 1 is given in Appendix A.

Let (x,yx_{*},y_{*}) be an optimal solution of Problem (2). Different from distributed differential privacy ADMMs [25, 26], by constructing the following convergence criterion R(x~,y~)R(\tilde{x},\tilde{y}),

R(x~,y~)\displaystyle R(\tilde{x},\tilde{y}) =f(x~)f(x)f(x),x~x\displaystyle=f(\tilde{x})-f(x_{*})-\big{\langle}\nabla\!f(x_{*}),\,\tilde{x}-x_{*}\big{\rangle}
+g(y~)g(y)g(y),y~y\displaystyle\quad+g(\tilde{y})-g(y_{*})-\big{\langle}g^{\prime}(y_{*}),\,\tilde{y}-y_{*}\big{\rangle}

where x~=1Tt=1Txt\tilde{x}\!=\!\frac{1}{T}\sum_{t=1}^{T}\!x_{t} and y~=1Tt=1Tyt\tilde{y}\!=\!\frac{1}{T}\sum_{t=1}^{T}\!y_{t}, we can complete the convergence analysis of DP-ADMM. Then by introducing the noise variance term and finding the number of iterations TT required to reach the stopping criterion, we can get the utility bound and gradient complexity of DP-ADMM. The utility bound is an important indicator to measure the performance of differential privacy algorithms including DP-ADMM. It usually reflects the minimum value that the algorithm can converge to. The lower the utility bound is, the smaller the loss can decrease and the better the trained model is.

Gradient complexity reflects the number of gradient calculations required by the model to reach the utility bound. It is usually proportional to the running time of the algorithm and has a certain relationship with the convergence rate of the algorithm. The smaller the gradient complexity is, the fewer the number of gradients need to be calculated.

Theorem 2 (Utility guarantee).

In DP-ADMM, let σ\sigma be defined in Eq. (8), x~=1Tt=1Txt\tilde{x}\!=\!\frac{1}{T}\sum_{t=1}^{T}x_{t}, and y~=1Tt=1Tyt\tilde{y}\!=\!\frac{1}{T}\sum_{t=1}^{T}y_{t}. Then, if the number of iterations T=O(nϵμαηd1)T=O(\frac{n\sqrt{\epsilon\mu}}{\sqrt{\alpha\eta d_{1}}}), we can get the utility bound of our DP-ADMM algorithm as follows:

R(x~,y~)O(αηd1nϵμ).R(\tilde{x},\tilde{y})\leq O(\frac{\sqrt{\alpha\eta d_{1}}}{n\sqrt{\epsilon\mu}}). (9)

Moreover, the gradient complexity of our DP-ADMM algorithm is O(n2ϵμαηd1)O(\frac{n^{2}\sqrt{\epsilon\mu}}{\sqrt{\alpha\eta d_{1}}}).

The detailed proof of Theorem 2 can be found in Appendix B. Theorem 2 shows that the privacy guarantee and utility guarantee of DP-ADMM. That is, the size of the added noise variance required by the algorithm to meet differential privacy requirement is given, and the minimum value of the algorithm’s loss is obtained. Compared with the theoretical analysis of traditional ADMMs, we find that the convergence rate of DP-ADMM does not change. This means that differential privacy protection does not affect the convergence rate of the optimization algorithm or the speed at which the optimization algorithm converges to the optimal value, but the optimal value that the algorithm converges to.

IV-C Differentially Private Accelerated ADMM

Non-accelerated ADMMs including DP-ADMM can be used to solve many machine learning problems with equality constraints, but it has a flaw that the convergence speed is not fast enough as in the non-private case. Especially when approaching to the exact solution, the convergence speed is very slow. Therefore, we propose an accelerated version of our DP-ADMM algorithm, called DP-AccADMM.

Algorithm 3 DP-AccADMM(f,g,x^0,T,η,σ,α0f,g,\hat{x}_{0},T,\eta,\sigma,\alpha_{0})
0:  ff is LfL_{f}-smooth, learning rate η\eta, and TT.
0:  x^0=x0\hat{x}_{0}\!=\!x_{0}, u0=u^0=1ρ(AT)f(x^0)u_{0}\!=\!\hat{u}_{0}\!=\!-\frac{1}{\rho}(A^{T})^{\dagger}\nabla\!f(\hat{x}_{0}), and θ1=1\theta_{1}\!=\!1.
1:  for t=1,2,,Tt=1,2,\ldots,T do
2:     yt=argminy{g(y)+ρ2Ax^t1+Byc+u^t12}y_{t}=\arg\min_{y}\big{\{}g(y)+\frac{\rho}{2}\|A\hat{x}_{t-1}+By-c+\hat{u}_{t-1}\|^{2}\big{\}};
3:     xt=argminx{f(x^t1)+Pt,x+ρ2Ax+Bytc+u^t12+xx^t1G22η}x_{t}=\arg\min_{x}\big{\{}\langle\nabla\!f(\hat{x}_{t-1})+P_{t},\,x\rangle+\frac{\rho}{2}\|Ax+By_{t}-c+\hat{u}_{t-1}\|^{2}+\frac{\|x-\hat{x}_{t-1}\|_{G}^{2}}{2\eta}\big{\}}, where PtN(0,σ2Id1)P_{t}\sim N(0,\sigma^{2}I_{d1});
4:     ut=u^t1+Axt+Bytcu_{t}=\hat{u}_{t-1}+Ax_{t}+By_{t}-c;
5:     θt+1=(1+1+4θt2)/2\theta_{t+1}=(1+\sqrt{1+4\theta_{t}^{2}})/{2};
6:     xt^=xt+θt1θt+1(xtxt1)\hat{x_{t}}=x_{t}+\frac{\theta_{t}-1}{\theta_{t+1}}(x_{t}-x_{t-1});
7:     ut^=ut+θt1θt+1(utut1)\hat{u_{t}}=u_{t}+\frac{\theta_{t}-1}{\theta_{t+1}}(u_{t}-u_{t-1});
8:  end for
8:  xT,yTx_{T},y_{T}

Inspired by the accelerated ADMM algorithm [36] using the Nesterov method, we introduce the Nesterov accelerated technique into our DP-ADMM algorithm. The detailed steps of our accelerated differentially private ADMM (DP-AccADMM) algorithm are presented in Algorithm 3. In particular, we add a weight coefficient during the iteration process to linearly combine the two adjacent iterates, so that the current iterate can be updated by using the information of the previous iterate. That is, the current descent direction is the combination of the current optimal direction and the previous descent direction, which will make the current iterate to move forward a distance on the previous descent direction. Note that the weight coefficient for momentum acceleration changes with the increase of iterations. In our DP-AccADMM algorithm, we set the weight coefficient θt\theta_{t} at the tt-th iteration as follows:

θt+1=1+1+4θt22\theta_{t+1}=\frac{1+\sqrt{1+4\theta_{t}^{2}}}{2}

where θ1=1\theta_{1}\!=\!1. And the momentum accelerated update rule of x^t\hat{x}_{t} is defined as:

x^t=xt+θt1θt+1(xtxt1).\hat{x}_{t}=x_{t}+\frac{\theta_{t}-1}{\theta_{t+1}}(x_{t}\!-\!x_{t-1}). (10)

Such an accelerated scheme is consistent with the acceleration method proposed by Nesterov [34] and the accelerated ADMM algorithm proposed by Goldstein et al. [36]. In addition, the dual variable ut^\hat{u_{t}} also has the same accelerated scheme as x^t\hat{x}_{t}, as shown in Step 7 of Algorithm 3.

Although we do not give the theoretical analysis of DP-AccADMM, we conduct many experiments to compare the performance of the two algorithms, DP-AccADMM and DP-ADMM, whose noise level is the same, as shown in Theorem 1. We will report many experimental results and give some discussions in the next section.

Refer to caption
(a) a8a
Refer to caption
(b) a9a
Refer to caption
(c) bio_train
Refer to caption
(d) ijcnn1
Figure 1: Comparison of the proposed differential privacy ADMM algorithms (including DP-ADMM and DP-AccADMM) and their non-differentially private counterparts for solving graph-guided fused Lasso problems on the four data sets. The xx-axis is the objective value minus the minimum value, and the yy-axis denotes the running time (seconds).

V Experimental Results

In this section, we evaluate the performance of our DP-ADMM and DP-AccADMM algorithms and report some experimental results on four publicly available datasets, as shown in Table I. We use our two algorithms to solve the general convex graph-guided fused Lasso problem, and compare our algorithms with their counterparts: ADMM and Acc-ADMM. We first show the convergence speed performance of all the methods in terms of CPU time (seconds) during the training process, and then show the performance of model training and testing as the privacy parameter ϵ\epsilon changes.

V-A Graph-Guided Fused Lasso

To evaluate performance of the proposed algorithms, we consider the following 1\ell_{1}-norm regularized graph-guided fused Lasso problem,

minx{1ni=1nfi(x,li)+λy1,s.t.,Axy=0}\min_{x}\Big{\{}\frac{1}{n}\sum_{i=1}^{n}f_{i}(x,l_{i})+\lambda\|y\|_{1},\;\textup{s.t.,}\;Ax-y=0\Big{\}} (11)

where fif_{i} is the logistic loss function on the feature-label pair (li,mi)(l_{i},m_{i}), i.e., log(1+exp(miliTx))\log(1+\exp(-m_{i}l_{i}^{T}x)), and λ\lambda is the regularization parameter. We set A=[W;I]A=[W;I] as in [43, 45, 46], where WW is the sparsity pattern of the graph obtained by sparse inverse covariance selection [47]. We used four publicly available datasets in our experiments, as listed in Table I.

TABLE I: Summary of data sets and regularization parameters used in our experiments.
Data sets #\#training #\#test #\#features λ1\lambda_{1}
a9a 32,561 16,281 123 1e-5
a8a 22,696 9,865 123 1e-5
ijcnn1 49,990 91,701 22 1e-5
bio_train 72,876 72,875 74 1e-5

V-B Parameter Setting

We fix δ=103\delta\!=\!10^{-3} for all the experiments, and we choose μ=0.5\mu=0.5 and ρ=1\rho=1, where μ\mu is an intermediate constant in the solution process and ρ\rho is the penalty parameter of the ADMM algorithms. But in our theoretical analysis, ρ\rho will not influence the utility bound of our differentially private algorithms. Thus, we fix ρ\rho as a constant. Moreover, we set γ=1\gamma=1 and set η\eta as a constant on each dataset but differs among these datasets.

V-C Objective Value Decreasing with CPU Time

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) a8a
Refer to caption
(b) a9a
Refer to caption
(c) bio_train
Refer to caption
(d) ijcnn1
Figure 2: Graph-guided fused Lasso results of all the algorithms with different privacy budgets, ϵ\epsilon. Top: Objective value vs budget; Bottom: Classification accuracy vs budget.

Fig. 1 plots the objective gap (i.e., the objective value minus the minimum value) of the differential privacy ADMM algorithms as the number of iterations increases. Actually, the accelerated ADMM algorithm can achieve the convergence rate of O(1/T2)O(1/T^{2}), and it is much faster than the traditional ADMM algorithm. In our experiments, it can be seen that the convergence speed of DP-AccADMM is almost the same as that of its non-differential privacy counterpart, AccADMM. In addition, we can see that there are two gaps between two pairs of algorithms. These gaps are the embodiment of the utility boundary in the utility analysis in the experiments. And the gap between DP-AccADMM and AccADMM is larger than that between DP-ADMM and ADMM. It means that the Nesterov acceleration may increase the utility bound of the algorithm, which will reduce the utility of the algorithm. Therefore, we do another experiment to show the performance of test accuracy of all these algorithms.

V-D Performance on Simulated Data

Fig. 2 plots the objective value and test accuracy of the differential privacy ADMM algorithms with different privacy budgets ϵ\epsilon for solving 1\ell_{1}-norm regularized graph-guided fused Lasso. We can see that AccADMM converges much faster than ADMM in terms of objective value. DP-AccADMM performs worse than DP-ADMM, when the private budget ϵ\epsilon is small (e.g., ϵ=0.01\epsilon\!=\!0.01). But with the increase of the budget ϵ\epsilon (e.g., ϵ=0.07\epsilon\!=\!0.07 on bio_train), DP-AccADMM outperforms DP-ADMM in terms of both objective value and test accuracy. As the private budget increases, the objective value of DP-AccADMM gradually approaches to that of AccADMM, and the objective value of DP-ADMM approaches to that of ADMM. Thus, the Nesterov acceleration usually leads to a worse utility, when ϵ\epsilon is small. However, when ϵ\epsilon becomes larger, DP-AccADMM gets better and outperforms DP-ADMM.

From test accuracy results, we can see that DP-AccADMM also performs worse than DP-ADMM, when the private budget ϵ\epsilon is small. With the increase of ϵ\epsilon, the test accuracy of DP-AccADMM gradually improves, approaching or even exceeding the test accuracy of DP-ADMM. Moreover, the test accuracy of DP-AccADMM gradually approaches to that of AccADMM and the test accuracy of DP-ADMM gradually approaches to that of ADMM. Thus, for test accuracy, we can see that the Nesterov acceleration technique usually leads to a worse test accuracy, when the private budget ϵ\epsilon is small (e.g., ϵ=0.02\epsilon\!=\!0.02). However, with the increase of ϵ\epsilon (e.g., ϵ=0.08\epsilon\!=\!0.08 on a8a and ijcnn1), the test accuracy of DP-AccADMM gradually increases and outperforms DP-ADMM until it gradually approaches to the test accuracy of AccADMM.

VI Conclusions and Further Work

In this paper, we proposed two efficient differentially private ADMM algorithms with the guarantees of (ϵ,δ)(\epsilon,\delta)-DP. The first algorithm, DP-ADMM, uses gradient perturbation to achieve (ϵ,δ)(\epsilon,\delta)-DP. Moreover, we also provided its privacy analysis and utility analysis through Gaussian mechanism and convergence analysis. The second algorithm, DP-AccADMM, uses the Nesterov method to accelerate the proposed DP-ADMM algorithm. All the experimental results showed that DP-AccADMM has a much faster convergence speed. In particular, DP-AccADMM can achieve a higher classification accuracy, when the privacy budget ϵ\epsilon reaches a certain threshold.

Similar to DP-SVRG [19], we can extend our deterministic differentially private ADMM algorithms to the stochastic setting for solving large-scale optimization problems as in [48]. As for the theoretical analysis of DP-AccADMM, we will complete it as our future work. Moreover, we can employ some added noise variance decay schemes as in [49] to reduce the negative effects of gradient perturbation, and provide stronger privacy guarantees and better utility [50]. In addition, an interesting direction of future work is to extend our differentially private algorithms and theoretical results from the two-block version to the multi-block ADMM case [51].

Appendix A: Proof of Theorem 1

Before giving the proof of Theorem 1, we present the following lemmas. We first give the following relationship between RDP and (ϵ,δ)(\epsilon,\delta)-DP [42]:

Lemma 1 (From RDP to (ϵ,δ)(\epsilon,\delta)-DP).

If a random mechanism 𝒜:𝔻n\mathcal{A}:\mathbb{D}^{n}\!\to\!\mathbb{R} satisfies (α,β)(\alpha,\beta)-RDP, then for any δ(0,1)\delta\in(0,1), 𝒜\mathcal{A} satisfies (β+log(1/δ)/(α1),δ)(\beta+\log(1/\delta)/(\alpha-1),\delta)-DP.

This lemma shows the conversion relationship between (α,β)(\alpha,\beta)-RDP and (ϵ,δ)(\epsilon,\delta)-DP. And RDP has the following composition properties [42]:

Lemma 2.

For i[k]i\in[k], if kk random mechanisms 𝒜i:𝔻n\mathcal{A}_{i}:\mathbb{D}^{n}\!\to\!\mathbb{R} satisfy (α,βi)(\alpha,\beta_{i})-RDP, then their composition (𝒜1(D),,𝒜k(D))(\mathcal{A}_{1}(D),\dots,\mathcal{A}_{k}(D)) satisfies (α,i=1kβi)(\alpha,\sum_{i=1}^{k}\beta_{i})-RDP. Moreover, the input of the ii-th mechanism can be based on the output of the first (i-1) mechanisms.

This lemma shows the changes of the coefficients α,β\alpha,\beta that satisfy the RDP during the compounding process of a series of random mechanisms of the algorithm iteration.

Lemma 3 ([42]).

Given a function q:𝔻nq:\mathbb{D}^{n}\!\to\!\mathbb{R}, the Gaussian Mechanism 𝒜=q(D)+v\mathcal{A}=q(D)+v satisfies (α,α2(q)/(2σ2))(\alpha,\alpha\bigtriangleup^{2}(q)/(2\sigma^{2}))-RDP, where vN(0,σ2I)v\sim N(0,\sigma^{2}I).

This lemma shows the relationship between the satisfied RDP coefficients α,β\alpha,\beta and the size of added noise variance σ\sigma.

Proof of Theorem 1:

Proof.

xT(α,β)x_{T}\sim(\alpha,\beta)-RDP \to (β+log(1/δ)α1,δ)(\beta+\frac{\log(1/\delta)}{\alpha-1},\delta)-DP == (ϵ,δ)(\epsilon,\delta)-DP, where

β+log(1/δ)α1=ϵ.\beta+\frac{\log(1/\delta)}{\alpha-1}=\epsilon. (12)

Let β=μϵ\beta=\mu\epsilon, log(1/δ)α1=(1μ)ϵ\frac{\log(1/\delta)}{\alpha-1}=(1-\mu)\epsilon, and μ(0,1)\mu\sim(0,1), then

α=log(1/δ)(1μ)ϵ+1.\alpha=\frac{\log(1/\delta)}{(1-\mu)\epsilon}+1. (13)

We can obtain

xT(log(1/δ)(1μ)ϵ+1,μϵ)x_{T}\sim(\frac{\log(1/\delta)}{(1-\mu)\epsilon}+1,\mu\epsilon)-RDP.

According to Lemma 2, we have

𝒜k(log(1/δ)(1μ)ϵ+1,μϵT)\mathcal{A}_{k}\sim(\frac{\log(1/\delta)}{(1-\mu)\epsilon}+1,\frac{\mu\epsilon}{T})-RDP, xk(log(1/δ)(1μ)ϵ+1,kμϵT)\qquad x_{k}\sim(\frac{\log(1/\delta)}{(1-\mu)\epsilon}+1,\frac{k\mu\epsilon}{T})-RDP.

According to Definition 2 (i.e., the L2L_{2}-sensitivity of the function qq), let maxifi(x)=c\max_{i}\|\nabla\!f_{i}(x)\|=c, then we have (q)=cn\bigtriangleup(q)=\frac{c}{n}. And according to Lemma 3, we have

𝒜k(α,c2α2n2σ2)\mathcal{A}_{k}\sim(\alpha,\frac{c^{2}\alpha}{2n^{2}\sigma^{2}})-RDP.

Therefore, we can obtain

σ2=c2αT2n2ϵμ.\sigma^{2}=\frac{c^{2}\alpha T}{2n^{2}\epsilon\mu}. (14)

That is, when the size of the noise variance we added satisfies Eq. (14), the proposed DP-ADMM algorithm satisfies (ϵ,δ)(\epsilon,\delta)-DP. This completes the proof. ∎

Appendix B: Proof of Theorem 2

Before giving the proof of Theorem 2, we first present the following lemmas.

Lemma 4 ([43]).

u=1ρ(AT)f(x)u_{*}=-\frac{1}{\rho}(A^{T})^{\dagger}\nabla\!f(x_{*}).

This lemma presents the relationship between the optimal solution of the dual problem and the optimal solution of the original problem. Let us simply express xtx_{t} with xt1x_{t-1} and the gradient term, which is convenient for our subsequent proof.

xt=argminx{f(xt1)+Pt,x+ρ2Ax+Bytc+ut12+xxt1G22η}.\begin{split}\!\!\!\!\!x_{t}=&\arg\min_{x}\,\Big{\{}\langle\nabla\!f(x_{t-1})+P_{t},\,x\rangle\\ &\quad+\frac{\rho}{2}\|Ax+By_{t}-c+u_{t-1}\|^{2}+\frac{\|x-x_{t-1}\|^{2}_{G}}{2\eta}\Big{\}}.\!\!\end{split} (15)

Setting the derivative with respect to xx at xtx_{t} to zero, we have

f(xt1)+Pt+ρAT(Ax+Bytc+ut1)+1ηG(xxt1)=0.\begin{split}&\nabla\!f(x_{t-1})+P_{t}+\rho A^{T}(Ax+By_{t}-c+u_{t-1})\\ &\qquad\quad\qquad+\frac{1}{\eta}G(x-x_{t-1})=0.\end{split} (16)

Let

vt\displaystyle v_{t} =f(xt1)+Pt,PtN(0,σ2Id1),\displaystyle=\nabla\!f(x_{t-1})+P_{t},\;P_{t}\sim N(0,\sigma^{2}I_{d1}),
qt\displaystyle q_{t} =ρAT(Ax+Bytc+ut1),\displaystyle=\rho A^{T}(Ax+By_{t}-c+u_{t-1}),
gt\displaystyle g_{t} =vt+qt.\displaystyle=v_{t}+q_{t}.

Then gt+1ηG(xxt1)=0g_{t}+\frac{1}{\eta}G(x-x_{t-1})=0, and xt=xt1ηG1gtx_{t}=x_{t-1}-\eta G^{-1}g_{t}. Thus, we get a simple representation of xtx_{t}.

Lemma 5 ([43]).

If 0η1Lf0\leq\eta\leq\frac{1}{L_{f}}, then we have

f(x)+qtT(xxt)\displaystyle f(x)+q_{t}^{T}(x-x_{t})\leq f(xt)+gtT(xxt1)+η2gtG12\displaystyle f(x_{t})+g_{t}^{T}(x-x_{t-1})+\frac{\eta}{2}\|g_{t}\|^{2}_{G^{-1}}
+(vtf(xt1))T(xtx).\displaystyle+(v_{t}-\nabla\!f(x_{t-1}))^{T}(x_{t}-x).
Lemma 6 ([43]).

Let αt=ρ(utu)\alpha_{t}=\rho(u_{t}-u_{*}), we have

2η[g(yt)g(y)g(y)T(yty)+(BTαt)T(yyt)]\displaystyle 2\eta[g(y_{t})\!-\!g(y_{*})\!-\!g^{\prime}(y_{*})^{T}(y_{t}\!-\!y_{*})\!+\!(B^{T}\!\alpha_{t})^{T}(y_{*}\!-\!y_{t})]
\displaystyle\!\!\leq\; ηρ[Axt1+Byc2Axt1+Byc2\displaystyle\eta\rho[\|Ax_{t-1}+By_{*}-c\|^{2}-\|Ax_{t-1}+By_{*}-c\|^{2}
+utut12].\displaystyle+\|u_{t}-u_{t-1}\|^{2}].
Lemma 7 ([43]).
 2η[(Axt+Bytc)Tαt]\displaystyle\;2\eta[-(Ax_{t}+By_{t}-c)^{T}\alpha_{t}]
=\displaystyle= ηρ[ut1u2utu2utut12].\displaystyle\;\eta\rho[\|u_{t-1}-u_{*}\|^{2}-\|u_{t}-u_{*}\|^{2}-\|u_{t}-u_{t-1}\|^{2}].

Proof of Theorem 2:

Proof.
xtxG2\displaystyle\quad\ \|x_{t}-x_{*}\|^{2}_{G}\quad
=xt1xG22η(xt1x)Tgt+η2gtG12\displaystyle=\|x_{t-1}-x_{*}\|^{2}_{G}-2\eta(x_{t-1}-x_{*})^{T}g_{t}+\eta^{2}\|g_{t}\|^{2}_{G^{-1}}
xt1xG22η[f(xt)f(x)]\displaystyle\leq\|x_{t-1}-x_{*}\|^{2}_{G}-2\eta[f(x_{t})-f(x_{*})]
2η[vtf(xt1)]T(xtx)+2ηqtT(xxt).\displaystyle\quad-2\eta[v_{t}-\nabla\!f(x_{t-1})]^{T}(x_{t}-x_{*})+2\eta q_{t}^{T}(x_{*}-x_{t}).

Below let us bound 2η(vtf(xt1))T(xtx)-2\eta(v_{t}-\nabla f(x_{t-1}))^{T}(x_{t}-x_{*}). Let

ψt(x)\displaystyle\psi_{t}(x) =ρ2Ax+Bytc+ut12+12ηxxt1GI2,\displaystyle=\frac{\rho}{2}\|Ax+By_{t}-c+u_{t-1}\|^{2}+\frac{1}{2\eta}\|x-x_{t-1}\|^{2}_{G-I},
x¯\displaystyle\bar{x} =proxηψt(xt1ηf(xt1))\displaystyle=\textup{prox}_{\eta\psi_{t}}(x_{t-1}-\eta\nabla f(x_{t-1}))

where proxηr(y)=minx{r(x)+12ηxy2}\textup{prox}_{\eta r}(y)=\min_{x}\big{\{}r(x)+\frac{1}{2\eta}\|x-y\|^{2}\big{\}}, then we have

xt\displaystyle x_{t} =argminx{vtTx+ρ2Ax+Bytc+ut12\displaystyle=\arg\min_{x}\,\Big{\{}v_{t}^{T}x+\frac{\rho}{2}\|Ax+By_{t}-c+u_{t-1}\|^{2}
+xxt1G22η}\displaystyle\quad\qquad\quad\qquad\qquad\qquad\qquad\;\;+\frac{\|x-x_{t-1}\|^{2}_{G}}{2\eta}\Big{\}}
=argminx{ηvtTx+ηρ2Ax+Bytc+ut12\displaystyle=\arg\min_{x}\,\Big{\{}\eta v_{t}^{T}x\!+\!\frac{\eta\rho}{2}\|Ax\!+\!By_{t}\!-\!c\!+\!u_{t-1}\|^{2}\!
+xxt1GI22+xxt122}\displaystyle\qquad\quad\qquad\quad+\!\frac{\|x\!-\!x_{t-1}\|^{2}_{G-I}}{2}\!+\!\frac{\|x\!-\!x_{t-1}\|^{2}}{2}\Big{\}}
=argminx{ηψt(x)+12x(xt1ηvt)2}.\displaystyle=\arg\min_{x}\,\Big{\{}\eta\psi_{t}(x)+\frac{1}{2}\|x-(x_{t-1}-\eta v_{t})\|^{2}\Big{\}}.

Therefore,

xt=proxηψt(xt1ηvt).x_{t}=\textup{prox}_{\eta\psi_{t}}(x_{t-1}-\eta v_{t}). (17)

Then, we have

2η(vtf(xt1))T(xtx)\displaystyle\quad-2\eta(v_{t}-\nabla\!f(x_{t-1}))^{T}(x_{t}-x_{*})
=2η(vtf(xt1))T(xtx¯)2η(vtf(xt1))T(x¯x)\displaystyle=-2\eta(v_{t}\!-\!\nabla\!f(x_{t-1}))^{T}\!(x_{t}\!-\!\bar{x})\!-\!2\eta(v_{t}\!-\!\nabla\!f(x_{t-1}))^{T}\!(\bar{x}\!-\!x_{*}\!\!\>)
2η2vtf(xt1)22η(vtf(xt1))T(x¯x)\displaystyle\leq 2\eta^{2}\|v_{t}-\nabla\!f(x_{t-1})\|^{2}-2\eta(v_{t}-\nabla f(x_{t-1}))^{T}(\bar{x}-x_{*}\!\!\>)

and

xtxG22ηqtT(xxt)\displaystyle\quad\;\,\|x_{t}-x_{*}\|^{2}_{G}-2\eta q_{t}^{T}(x_{*}-x_{t})
xt1xG22η(f(xt)f(x))\displaystyle\leq\|x_{t-1}-x_{*}\|^{2}_{G}-2\eta(f(x_{t})-f(x_{*}))
+2η2vtf(xt1)22η(vtf(xt1))T(x¯x).\displaystyle\quad+2\eta^{2}\|v_{t}-\nabla\!f(x_{t-1})\|^{2}-2\eta(v_{t}-\nabla\!f(x_{t-1}))^{T}(\bar{x}-x_{*}).

Since vtf(xt1)=Ptv_{t}\!-\!\nabla\!f(x_{t-1})\!=\!P_{t}, then 𝔼Pt2=σ2d1\mathbb{E}\|P_{t}\|^{2}\!=\!\sigma^{2}d_{1}, and 𝔼(Pt)=0\mathbb{E}(P_{t})\!=\!0. Taking expectations on both sides of the above inequalities, we have

xtxG22ηqtT(xxt)\displaystyle\quad\ \|x_{t}-x_{*}\|^{2}_{G}-2\eta q_{t}^{T}(x_{*}-x_{t})
xt1xG22η(f(xt)f(x))+2η2σ2d1,\displaystyle\leq\|x_{t-1}-x_{*}\|^{2}_{G}-2\eta(f(x_{t})-f(x_{*}))+2\eta^{2}\sigma^{2}d_{1},
2η[f(xt)f(x)qtT(xxt)]\displaystyle\quad\ 2\eta[f(x_{t})-f(x_{*})-q_{t}^{T}(x_{*}-x_{t})]
xt1xG2xtxG2+2η2σ2d1.\displaystyle\leq\|x_{t-1}-x_{*}\|^{2}_{G}-\|x_{t}-x_{*}\|^{2}_{G}+2\eta^{2}\sigma^{2}d_{1}.

Since

2η[f(xt)f(x)qtT(xxt)]\displaystyle 2\eta[f(x_{t})-f(x_{*})-q_{t}^{T}(x_{*}-x_{t})]
=\displaystyle=\; 2η[f(xt)f(x)f(x)T(xtx)(ATα)T(xxt)]\displaystyle 2\eta[f(x_{t})\!-\!f(x_{*})\!-\!\nabla\!f(x_{*})^{T}\!(x_{t}\!-\!x_{*})\!-\!(A^{T}\!\alpha)^{T}\!(x_{*}\!-\!x_{t})]

then

2η[f(xt)f(x)f(x)T(xtx)(ATα)T(xxt)]\displaystyle 2\eta[f(x_{t})\!-\!f(x_{*})\!-\!\nabla\!f(x_{*})^{T}\!(x_{t}\!-\!x_{*})\!-\!(A^{T}\!\alpha)^{T}\!(x_{*}\!-\!x_{t})]
\displaystyle\leq xt1xG2xtxG2+2η2σ2d1.\displaystyle\;\|x_{t-1}-x_{*}\|^{2}_{G}-\|x_{t}-x_{*}\|^{2}_{G}+2\eta^{2}\sigma^{2}d_{1}.

According to Lemma 7, we have

2η[g(yt)g(y)g(y)T(yty)+αtT(yyt)]\displaystyle 2\eta[g(y_{t})-g(y_{*})-g^{\prime}(y_{*})^{T}(y_{t}-y_{*})+\alpha_{t}^{T}(y_{*}-y_{t})]
\displaystyle\leq ηρ[Axt1+Byc2Axt+Byc2\displaystyle\;\eta\rho\big{[}\|Ax_{t-1}+By_{*}-c\|^{2}-\|Ax_{t}+By_{*}-c\|^{2}
+utut12].\displaystyle+\|u_{t}-u_{t-1}\|^{2}\big{]}.

And we have

0=\displaystyle 0= (ATαt)T(xxt)(BTαt)T(yyt)\displaystyle-(A^{T}\alpha_{t})^{T}(x_{*}-x_{t})-(B^{T}\alpha_{t})^{T}(y_{*}-y_{t})
(Axt+Bytc)Tαt.\displaystyle-(Ax_{t}+By_{t}-c)^{T}\alpha_{t}.

Thus, according to Lemma 8, we have

2ηR(xt,yt)\displaystyle 2\eta R(x_{t},y_{t})\leq xt1xG2xtxG2+2η2σ2d1\displaystyle\,\|x_{t-1}-x_{*}\|^{2}_{G}-\|x_{t}-x_{*}\|^{2}_{G}+2\eta^{2}\sigma^{2}d_{1}
2η(Axtyt)Tαt+ηρ[Axt1+Byc2\displaystyle-2\eta(Ax_{t}\!-\!y_{t})^{T}\!\alpha_{t}\!+\!\eta\rho[\|Ax_{t-1}\!+\!By_{*}\!-\!c\|^{2}
Axt+Byc2+utut12]\displaystyle-\|Ax_{t}+By_{*}-c\|^{2}+\|u_{t}-u_{t-1}\|^{2}]
=\displaystyle= xt1xG2xtxG2+2η2σ2d1\displaystyle\;\|x_{t-1}-x_{*}\|^{2}_{G}-\|x_{t}-x_{*}\|^{2}_{G}+2\eta^{2}\sigma^{2}d_{1}
+ηρ[Axt1+Byc2Axt+Byc2\displaystyle+\!\eta\rho[\|Ax_{t-1}\!+\!By_{*}\!\!-\!c\|^{2}\!-\!\|Ax_{t}\!+\!By_{*}\!\!-\!c\|^{2}\!
+ut1u2+utu2].\displaystyle+\|u_{t-1}\!-\!u_{*}\|^{2}\!+\!\|u_{t}\!-\!u_{*}\|^{2}].

Summing up the above inequality for all t=1,2,,Tt=1,2,\cdots,T, we obtain

2ηt=1TR(xt,yt)\displaystyle 2\eta\sum_{t=1}^{T}R(x_{t},y_{t})\leq x0xG2xTxG2+2η2σ2Td1\displaystyle\,\|x_{0}-x_{*}\|^{2}_{G}-\|x_{T}-x_{*}\|^{2}_{G}+2\eta^{2}\sigma^{2}Td_{1}
+ηρ[Ax0+Byc2AxT+Byc2\displaystyle\!\!+\eta\rho[\|Ax_{0}\!+\!By_{*}\!-\!c\|^{2}\!-\!\|Ax_{T}\!+\!By_{*}\!-\!c\|^{2}
+u0u2uTu2]\displaystyle\!\!+\|u_{0}-u_{*}\|^{2}-\|u_{T}-u_{*}\|^{2}]
\displaystyle\leq x0xG2+2η2σ2Td1\displaystyle\|x_{0}-x_{*}\|^{2}_{G}+2\eta^{2}\sigma^{2}Td_{1}
+ηρ(Ax0+Byc2+u0u2).\displaystyle\!\!+\eta\rho(\|Ax_{0}+By_{*}-c\|^{2}+\|u_{0}-u_{*}\|^{2}).

The convexity of R(,)R(\cdot,\cdot) can be obtained from the convexity of both f()f(\cdot) and g()g(\cdot). Let x~=1Tt=1Txt,y~=1Tt=1Tyt\tilde{x}\!=\!\frac{1}{T}\sum_{t=1}^{T}x_{t},\tilde{y}\!=\!\frac{1}{T}\sum_{t=1}^{T}y_{t}. Then

2ηR(x~,y~)\displaystyle 2\eta R(\tilde{x},\tilde{y}) 2ηt=1T1TR(xt,yt)\displaystyle\leq 2\eta\sum_{t=1}^{T}\frac{1}{T}R(x_{t},y_{t})
1T[x0xG2+ηρAx0+Byc2\displaystyle\leq\frac{1}{T}[\|x_{0}-x_{*}\|^{2}_{G}+\eta\rho\|Ax_{0}+By_{*}-c\|^{2}
+ηρu0u2]+2η2σ2d1,\displaystyle\quad+\eta\rho\|u_{0}-u_{*}\|^{2}]+2\eta^{2}\sigma^{2}d_{1},
R(x~,y~)\displaystyle R(\tilde{x},\tilde{y}) 12ηT[x0xG2+ηρAx0+Byc2\displaystyle\leq\frac{1}{2\eta T}[\|x_{0}-x_{*}\|^{2}_{G}+\eta\rho\|Ax_{0}+By_{*}-c\|^{2}
+ηρu0u2]+ησ2d1.\displaystyle\quad+\eta\rho\|u_{0}-u_{*}\|^{2}]+\eta\sigma^{2}d_{1}.

Since Ax+Byc=0Ax_{*}+By_{*}-c=0, u0=1ρ(AT)f(x0)u_{0}=-\frac{1}{\rho}(A^{T})^{\dagger}\nabla\!f(x_{0}), and u=1ρ(AT)f(x)u_{*}=-\frac{1}{\rho}(A^{T})^{\dagger}\nabla\!f(x_{*}), then we have

ρ2TAx0+Byc2=ρ2TA(x0x)2\displaystyle\;\frac{\rho}{2T}\|Ax_{0}+By_{*}-c\|^{2}=\frac{\rho}{2T}\|A(x_{0}-x_{*})\|^{2}
=\displaystyle= ρ2T[A(x0x)]T[A(x0x)]=ρ2Tx0xATA2\displaystyle\;\frac{\rho}{2T}[A(x_{0}-x_{*})]^{T}[A(x_{0}-x_{*})]=\frac{\rho}{2T}\|x_{0}-x_{*}\|^{2}_{A^{T}\!A}
\displaystyle\leq ρ2TATA2x0x2\displaystyle\;\frac{\rho}{2T}\|A^{T}\!A\|_{2}\cdot\|x_{0}-x_{*}\|^{2}

and

ρ2Tu0u2\displaystyle\frac{\rho}{2T}\|u_{0}\!-\!u_{*}\|^{2} =ρ2T1ρ(AT)f(x0)+1ρ(AT)f(x)2\displaystyle=\frac{\rho}{2T}\|\!-\frac{1}{\rho}(A^{T})^{\dagger}\nabla\!f(x_{0})+\frac{1}{\rho}(A^{T})^{\dagger}\nabla\!f(x_{*})\|^{2}
=12ρTf(x0)f(x)A(A)T2\displaystyle=\frac{1}{2\rho T}\|\nabla\!f(x_{0})-\nabla\!f(x_{*})\|^{2}_{A^{\dagger}(A^{\dagger})^{T}}
12ρTA(A)T2f(x0)f(x)2\displaystyle\leq\frac{1}{2\rho T}\|A^{\dagger}(A^{\dagger})^{T}\|_{2}\cdot\|\nabla\!f(x_{0})-\nabla\!f(x_{*})\|^{2}
Lf2ρTA(A)T2x0x2.\displaystyle\leq\frac{L_{f}}{2\rho T}\|A^{\dagger}(A^{\dagger})^{T}\|_{2}\cdot\|x_{0}-x_{*}\|^{2}.

Combining the above results, we have

R(x~,y~)(G22ηT+ρATA22T+LfA(A)T22ρT)x0x2+ησ2d1.\displaystyle\begin{split}R(\tilde{x},\tilde{y})\leq&\;\big{(}\frac{\|G\|_{2}}{2\eta T}\!+\!\frac{\rho\|A^{T}\!A\|_{2}}{2T}\!+\!\frac{L_{f}\|A^{\dagger}\!(A^{\dagger})^{T}\|_{2}}{2\rho T}\big{)}\|x_{0}\!-\!x_{*}\|^{2}\\ &+\eta\sigma^{2}d_{1}.\!\!\!\end{split}

The analysis of utilityutility boundbound is given below. Let c1=G22η+ρATA22+LfA(A)T22ρc_{1}\!=\!\frac{\|G\|_{2}}{2\eta}\!+\!\frac{\rho\|A^{T}\!A\|_{2}}{2}\!+\!\frac{L_{f}\|A^{\dagger}\!(A^{\dagger})^{T}\!\|_{2}}{2\rho}, and 2(q)=c2n\bigtriangleup_{2}(q)\!=\!\frac{c_{2}}{n}. Then we have

R(x~,y~)c1Tx0x2+c22αTηd12n2ϵμ=c1Tx0x2+O(αηTd1n2ϵμ).\begin{split}R(\tilde{x},\tilde{y})&\leq\frac{c_{1}}{T}\|x_{0}-x_{*}\|^{2}+\frac{c_{2}^{2}\alpha T\eta d_{1}}{2n^{2}\epsilon\mu}\\ &=\frac{c_{1}}{T}\|x_{0}-x_{*}\|^{2}+O(\frac{\alpha\eta Td_{1}}{n^{2}\epsilon\mu}).\end{split} (18)

Let c1Tx0x2=O(αηTd1n2ϵμ)\frac{c_{1}}{T}\|x_{0}-x_{*}\|^{2}=O(\frac{\alpha\eta Td_{1}}{n^{2}\epsilon\mu}). Then

T=O(nϵμαηd1).\displaystyle T=O(\frac{n\sqrt{\epsilon\mu}}{\sqrt{\alpha\eta d_{1}}}). (19)

Substituting the result in (19) into (18), we obtain the following results:

utilitybound:R(x~,y~)O(αηd1nϵμ)utility\;\;bound:R(\tilde{x},\tilde{y})\leq O(\frac{\sqrt{\alpha\eta d_{1}}}{n\sqrt{\epsilon\mu}}),

gradientcomplexity=nT=O(n2ϵμαηd1)gradient\;\;complexity=n\cdot T=O(\frac{n^{2}\sqrt{\epsilon\mu}}{\sqrt{\alpha\eta d_{1}}})

where α=log(1/δ)(1μ)ϵ+1\alpha=\frac{\log(1/\delta)}{(1-\mu)\epsilon}+1. This completes the proof. ∎

Acknowledgments

We thank all the reviewers for their valuable comments. This work was supported by the National Natural Science Foundation of China (Nos. 61876220, 61876221, 61976164, 61836009 and U1701267), the Project supported the Foundation for Innovative Research Groups of the National Natural Science Foundation of China (No. 61621005), the Program for Cheung Kong Scholars and Innovative Research Team in University (No. IRT_15R53), the Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project) (No. B07048), the Science Foundation of Xidian University (Nos. 10251180018 and 10251180019), the National Science Basic Research Plan in Shaanxi Province of China (Nos. 2019JQ-657 and 2020JM-194), and the Key Special Project of China High Resolution Earth Observation System-Young Scholar Innovation Fund.

References

  • [1] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in The Third Theory of Cryptography Conference, 2006, pp. 265–284.
  • [2] C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,” Found. Trends Theor. Comput. Sci., vol. 9, no. 3-4, pp. 211–407, 2014.
  • [3] M. Gong, Y. Xie, K. Pan, K. Feng, and A. K. Qin, “A survey on differentially private machine learning,” IEEE Comput. Intell. Mag., vol. 15, no. 2, pp. 49–64, 2020.
  • [4] F. McSherry and I. Mironov, “Differentially private recommender systems: Building privacy into the netflix prize contenders,” in Proc. 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2009, pp. 627–636.
  • [5] A. Machanavajjhala, A. Korolova, and A. D. Sarma, “Personalized social recommendations-accurate or private?” in Proc. 37th Int. Conf. Very Large Data Bases, 2011, pp. 440–450.
  • [6] T. Zhu, G. Li, Y. Ren, W. Zhou, and P. Xiong, “Differential privacy for neighborhood-based collaborative filtering,” in Proc. IEEE/ACM Int. Conf. Advances in Social Networks Analysis and Mining, 2013, pp. 752–759.
  • [7] T. Zhu, G. Li, W. Zhou, and P. Xiong, “Privacy preserving for tagging recommender systems,” in IEEE WIC ACM Int. Conf. Web Intelligence and Intelligent Agent Technology, 2013, pp. 81–88.
  • [8] Y. Lindell and E. Omri, “A practical application of differential privacy to personalized online advertising,” IACR Cryptology ePrint Archive, vol. 2011, p. 152, 2011.
  • [9] F. K. Dankar and K. El Emam, “The application of differential privacy to health data,” in Proc. Joint EDBT/ICDT Workshops, 2012, pp. 158–166.
  • [10] M. A. P. Chamikara, P. Bertok, I. Khalil, D. Liu, and S. Camtepe, “Privacy preserving face recognition utilizing differential privacy,” Computers & Security, vol. 97, 2020.
  • [11] A. Othman and A. Ross, “Privacy of facial soft biometrics: Suppressing gender but retaining identity,” in Proc. European Conf. Computer Vision, 2014, pp. 682–696.
  • [12] F. McSherry and R. Mahajan, “Differentially-private network trace analysis,” ACM SIGCOMM Computer Communication Review, vol. 40, no. 4, pp. 123–134, 2010.
  • [13] M. Gotz, A. Machanavajjhala, G. Wang, X. Xiao, and J. Gehrke, “Publishing search logs—a comparative study of privacy guarantees,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 3, pp. 520–532, 2011.
  • [14] Ú. Erlingsson, V. Pihur, and A. Korolova, “Rappor: Randomized aggregatable privacy-preserving ordinal response,” in Proc. ACM SIGSAC Conf. Computer and Communications Security, 2014, pp. 1054–1067.
  • [15] D. P. T. Apple, “Learning with privacy at scale,” Technical report, Apple, 2017.
  • [16] B. Ding, J. Kulkarni, and S. Yekhanin, “Collecting telemetry data privately,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 3571–3580.
  • [17] J. M. Abowd, “The us census bureau adopts differential privacy,” in Proc. 24th ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining, 2018, pp. 2867–2867.
  • [18] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in Proc. ACM SIGSAC Conf. Computer and Communications Security, 2016, pp. 308–318.
  • [19] D. Wang, M. Ye, and J. Xu, “Differentially private empirical risk minimization revisited: Faster and more general,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 2722–2731.
  • [20] L. Wang, B. Jayaraman, D. Evans, and Q. Gu, “Efficient privacy-preserving nonconvex optimization,” arXiv: 1910.13659, 2019.
  • [21] S. Kim, K. A. Sohn, and E. P. Xing, “A multivariate regression approach to association analysis of a quantitative trait network,” Bioinformatics, vol. 25, pp. 204–212, 2009.
  • [22] R. J. Tibshirani and J. Taylor, “The solution path of the generalized lasso,” Annals of Statistics, vol. 39, no. 3, pp. 1335–1371, 2011.
  • [23] H. Ouyang, N. He, L. Tran, and A. Gray, “Stochastic alternating direction method of multipliers,” in Proc. Int. Conf. Machine Learning, 2013, pp. 80–88.
  • [24] Z. Huang and Y. Gong, “Differentially private ADMM for convex distributed learning: Improved accuracy via multi-step approximation,” arXiv: 2005.07890v1, 2020.
  • [25] Z. Huang, R. Hu, Y. Guo, E. Chan-Tin, and Y. Gong, “DP-ADMM: ADMM-based distributed learning with differential privacy,” IEEE Trans. Inf. Forensics Secur., vol. 15, pp. 1002–1012, 2019.
  • [26] X. Wang, H. Ishii, L. Du, P. Cheng, and J. Chen, “Privacy-preserving distributed machine learning via local randomization and ADMM perturbation,” IEEE Trans. Signal Process., vol. 68, pp. 4226–4241, 2020.
  • [27] C. Chen and J. Lee, “Rényi differentially private ADMM for non-smooth regularized optimization,” in Proc. the Tenth ACM Conf. Data and Application Security and Privacy, 2020, pp. 319–328.
  • [28] P. Wang and H. Zhang, “Differential privacy for sparse classification learning,” Neurocomputing, vol. 375, no. 29, pp. 91–101, 2020.
  • [29] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 315–323.
  • [30] C. Dwork, G. N. Rothblum, and S. Vadhan, “Boosting and differential privacy,” in Proc. IEEE the 51st Annual Symposium on Foundations of Computer Science, 2010, pp. 51–60.
  • [31] D. Yu, H. Zhang, W. Chen, T.-Y. Liu, and J. Yin, “Gradient perturbation is underrated for differentially private convex optimization,” in Proc. 29th Int. Joint Conf. Artificial Intelligence, 2020, pp. 3117–3123.
  • [32] Y. Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k2){O}(1/k^{2}),” Soviet Math. Doklady, vol. 27, pp. 372–376, 1983.
  • [33] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural networks, vol. 12, no. 1, pp. 145–151, 1999.
  • [34] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course.   Boston: Springer US, 2014.
  • [35] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv: 1609.04747, 2016.
  • [36] T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk, “Fast alternating direction optimization methods,” SIAM J. Imaging Sciences, vol. 7, no. 3, pp. 1588–1623, 2014.
  • [37] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.
  • [38] N. Li, T. Li, and S. Venkatasubramanian, “L-closeness: Privacy beyond k-anonymity and l-diversity,” in IEEE Int. Conf. Data Engineering, 2007, pp. 24–24.
  • [39] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Differentially private empirical risk minimization,” Journal of Machine Learning Research, vol. 12, pp. 1069–1109, 2011.
  • [40] K. Fukuchi, Q. K. Tran, and J. Sakuma, “Differentially private empirical risk minimization with input perturbation,” in Proc. Int. Conf. Discovery Science, 2017, pp. 82–90.
  • [41] C. Chen, J. Lee, and D. Kifer, “Renyi differentially private erm for smooth objectives,” in Proc. 22nd Int. Conf. Artificial Intelligence and Statistics, 2019, pp. 2037–2046.
  • [42] I. Mironov, “Rényi differential privacy,” in Prof. IEEE 30th Computer Security Foundations Symposium, 2017, pp. 263–275.
  • [43] S. Zheng and J. T. Kwok, “Fast-and-light stochastic ADMM,” in Proc. Int. Joint Conf. Artificial Intelligence, 2016, pp. 2407–2613.
  • [44] D. L. Donoho, “De-noising by soft-thresholding,” IEEE Trans. Inform. Theory, vol. 41, no. 3, pp. 613–627, 1995.
  • [45] H. Ouyang, N. He, and A. Gray, “Stochastic ADMM for nonsmooth optimization,” arXiv: 1211.0632v2, 2012.
  • [46] S. Azadi and S. Sra, “Towards an optimal stochastic alternating direction method of multipliers,” in Proc. Int. Conf. Machine Learning, 2014, pp. 620–628.
  • [47] L. El Ghaoui, O. Banerjee, and A. D’Aspremont, “Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data,” Journal of Machine Learning Research, pp. 485–516, 2008.
  • [48] Y. Liu, F. Shang, H. Liu, L. Kong, L. Jiao, and Z. Lin, “Accelerated variance reduction stochastic ADMM for large-scale machine learning,” IEEE Trans. Pattern Anal. Mach. Intell., 2020.
  • [49] J. Ding, X. Zhang, M. Chen, K. Xue, C. Zhang, and M. Pan, “Differentially private robust ADMM for distributed machine learning,” in Proc. 2019 IEEE Inte. Conf. Big Data, 2019, pp. 1302–1311.
  • [50] A. Bellet, R. Guerraoui, M. Taziki, and M. Tommasi, “Personalized and private peer-to-peer machine learning,” in Proc. 21st Int. Conf. Artificial Intelligence and Statistics, 2018, pp. 473–481.
  • [51] C. Chen, B. He, Y. Ye, and X. Yuan, “The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent,” Math. Comp., vol. 155, pp. 57–79, 2016.